]> Git Repo - linux.git/log
linux.git
5 months agortnetlink: Add assertion helpers for per-netns RTNL.
Kuniyuki Iwashima [Fri, 4 Oct 2024 22:10:30 +0000 (15:10 -0700)]
rtnetlink: Add assertion helpers for per-netns RTNL.

Once an RTNL scope is converted with rtnl_net_lock(), we will replace
RTNL helper functions inside the scope with the following per-netns
alternatives:

  ASSERT_RTNL()           -> ASSERT_RTNL_NET(net)
  rcu_dereference_rtnl(p) -> rcu_dereference_rtnl_net(net, p)

Note that the per-netns helpers are equivalent to the conventional
helpers unless CONFIG_DEBUG_NET_SMALL_RTNL is enabled.

Signed-off-by: Kuniyuki Iwashima <[email protected]>
Reviewed-by: Eric Dumazet <[email protected]>
Signed-off-by: Paolo Abeni <[email protected]>
5 months agortnetlink: Add per-netns RTNL.
Kuniyuki Iwashima [Fri, 4 Oct 2024 22:10:29 +0000 (15:10 -0700)]
rtnetlink: Add per-netns RTNL.

The goal is to break RTNL down into per-netns mutex.

This patch adds per-netns mutex and its helper functions, rtnl_net_lock()
and rtnl_net_unlock().

rtnl_net_lock() acquires the global RTNL and per-netns RTNL mutex, and
rtnl_net_unlock() releases them.

We will replace 800+ rtnl_lock() with rtnl_net_lock() and finally removes
rtnl_lock() in rtnl_net_lock().

When we need to nest per-netns RTNL mutex, we will use __rtnl_net_lock(),
and its locking order is defined by rtnl_net_lock_cmp_fn() as follows:

  1. init_net is first
  2. netns address ascending order

Note that the conversion will be done under CONFIG_DEBUG_NET_SMALL_RTNL
with LOCKDEP so that we can carefully add the extra mutex without slowing
down RTNL operations during conversion.

Signed-off-by: Kuniyuki Iwashima <[email protected]>
Reviewed-by: Eric Dumazet <[email protected]>
Signed-off-by: Paolo Abeni <[email protected]>
5 months agoRevert "rtnetlink: add guard for RTNL"
Kuniyuki Iwashima [Fri, 4 Oct 2024 22:10:28 +0000 (15:10 -0700)]
Revert "rtnetlink: add guard for RTNL"

This reverts commit 464eb03c4a7cfb32cb3324249193cf6bb5b35152.

Once we have a per-netns RTNL, we won't use guard(rtnl).

Also, there's no users for now.

  $ grep -rnI "guard(rtnl" || true
  $

Suggested-by: Eric Dumazet <[email protected]>
Link: https://lore.kernel.org/netdev/CANn89i+KoYzUH+VPLdGmLABYf5y4TW0hrM4UAeQQJ9AREty0iw@mail.gmail.com/
Signed-off-by: Kuniyuki Iwashima <[email protected]>
Reviewed-by: Eric Dumazet <[email protected]>
Signed-off-by: Paolo Abeni <[email protected]>
5 months agoMerge branch 'net-fec-add-pps-channel-configuration'
Paolo Abeni [Tue, 8 Oct 2024 10:29:37 +0000 (12:29 +0200)]
Merge branch 'net-fec-add-pps-channel-configuration'

Francesco Dolcini says:

====================
net: fec: add PPS channel configuration

Make the FEC Ethernet PPS channel configurable from device tree.
====================

Link: https://patch.msgid.link/[email protected]
Signed-off-by: Paolo Abeni <[email protected]>
5 months agonet: fec: make PPS channel configurable
Francesco Dolcini [Fri, 4 Oct 2024 15:24:19 +0000 (17:24 +0200)]
net: fec: make PPS channel configurable

Depending on the SoC where the FEC is integrated into the PPS channel
might be routed to different timer instances. Make this configurable
from the devicetree.

When the related DT property is not present fallback to the previous
default and use channel 0.

Reviewed-by: Frank Li <[email protected]>
Tested-by: Rafael Beims <[email protected]>
Signed-off-by: Francesco Dolcini <[email protected]>
Reviewed-by: Csókás, Bence <[email protected]>
Signed-off-by: Paolo Abeni <[email protected]>
5 months agonet: fec: refactor PPS channel configuration
Francesco Dolcini [Fri, 4 Oct 2024 15:24:18 +0000 (17:24 +0200)]
net: fec: refactor PPS channel configuration

Preparation patch to allow for PPS channel configuration, no functional
change intended.

Signed-off-by: Francesco Dolcini <[email protected]>
Reviewed-by: Frank Li <[email protected]>
Reviewed-by: Csókás, Bence <[email protected]>
Signed-off-by: Paolo Abeni <[email protected]>
5 months agodt-bindings: net: fec: add pps channel property
Francesco Dolcini [Fri, 4 Oct 2024 15:24:17 +0000 (17:24 +0200)]
dt-bindings: net: fec: add pps channel property

Add fsl,pps-channel property to select where to connect the PPS signal.
This depends on the internal SoC routing and on the board, for example
on the i.MX8 SoC it can be connected to an external pin (using channel 1)
or to internal eDMA as DMA request (channel 0).

Signed-off-by: Francesco Dolcini <[email protected]>
Acked-by: Conor Dooley <[email protected]>
Signed-off-by: Paolo Abeni <[email protected]>
5 months agoMerge branch 'net-sparx5-prepare-for-lan969x-switch-driver'
Paolo Abeni [Tue, 8 Oct 2024 10:07:06 +0000 (12:07 +0200)]
Merge branch 'net-sparx5-prepare-for-lan969x-switch-driver'

Daniel Machon says:

====================
net: sparx5: prepare for lan969x switch driver

== Description:

This series is the first of a multi-part series, that prepares and adds
support for the new lan969x switch driver.

The upstreaming efforts is split into multiple series (might change a
bit as we go along):

    1) Prepare the Sparx5 driver for lan969x (this series)
    2) Add support lan969x (same basic features as Sparx5 provides +
       RGMII, excl.  FDMA and VCAP)
    3) Add support for lan969x FDMA
    4) Add support for lan969x VCAP

== Lan969x in short:

The lan969x Ethernet switch family [1] provides a rich set of
switching features and port configurations (up to 30 ports) from 10Mbps
to 10Gbps, with support for RGMII, SGMII, QSGMII, USGMII, and USXGMII,
ideal for industrial & process automation infrastructure applications,
transport, grid automation, power substation automation, and ring &
intra-ring topologies. The LAN969x family is hardware and software
compatible and scalable supporting 46Gbps to 102Gbps switch bandwidths.

== Preparing Sparx5 for lan969x:

The lan969x switch chip reuses many of the IP's of the Sparx5 switch
chip, therefore it has been decided to add support through the existing
Sparx5 driver, in order to avoid a bunch of duplicate code. However, in
order to reuse the Sparx5 switch driver, we have to introduce some
mechanisms to handle the chip differences that are there.  These
mechanisms are:

    - Platform match data to contain all the differences that needs to
      be handled (constants, ops etc.)

    - Register macro indirection layer so that we can reuse the existing
      register macros.

    - Function for branching out on platform type where required.

In some places we ops out functions and in other places we branch on the
chip type. Exactly when we choose one over the other, is an estimate in
each case.

After this series is applied, the Sparx5 driver will be prepared for
lan969x and still function exactly as before.

== Patch breakdown:

Patch #1        adds private match data

Patch #2        adds register macro indirection layer

Patch #3-#4     does some preparation work

Patch #5-#7     adds chip constants and updates the code to use them

Patch #8-#13    adds and uses ops for handling functions differently on the
                two platforms.

Patch #14       adds and uses a macro for branching out on the chip type.

Patch #15 (NEW) redefines macros for internal ports and PGID's.

[1] https://www.microchip.com/en-us/product/lan9698

To: David S. Miller <[email protected]>
To: Eric Dumazet <[email protected]>
To: Jakub Kicinski <[email protected]>
To: Paolo Abeni <[email protected]>
To: Lars Povlsen <[email protected]>
To: Steen Hegelund <[email protected]>
To: [email protected]
To: [email protected]
To: [email protected]
To: Richard Cochran <[email protected]>
To: [email protected]
To: [email protected]
To: [email protected]
To: [email protected]
To: [email protected]
To: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Daniel Machon <[email protected]>
====================

Link: https://patch.msgid.link/20241004-b4-sparx5-lan969x-switch-driver-v2-0-d3290f581663@microchip.com
Signed-off-by: Paolo Abeni <[email protected]>
5 months agonet: sparx5: redefine internal ports and PGID's as offsets
Daniel Machon [Fri, 4 Oct 2024 13:19:41 +0000 (15:19 +0200)]
net: sparx5: redefine internal ports and PGID's as offsets

Internal ports and PGID's are both defined relative to the number of
front ports on Sparx5. This will not work on lan969x. Instead make them
offsets to the number of front ports and add two helpers to retrieve
them. Use the helpers throughout.

Reviewed-by: Steen Hegelund <[email protected]>
Signed-off-by: Daniel Machon <[email protected]>
Signed-off-by: Paolo Abeni <[email protected]>
5 months agonet: sparx5: add is_sparx5 macro and use it throughout
Daniel Machon [Fri, 4 Oct 2024 13:19:40 +0000 (15:19 +0200)]
net: sparx5: add is_sparx5 macro and use it throughout

We dont want to ops out each time a function needs to do some platform
specifics. In particular we have a few places, where it would be
convenient to just branch out on the platform type. Add the function
is_sparx5() and, initially, use it for:

    - register writes that should only be done on Sparx5 (QSYS_CAL_CTRL,
      CLKGEN_LCPLL1_CORE_CLK).

    - function calls that should only be done on Sparx5
      (ethtool_op_get_ts_info())

    - register writes that are chip-exclusive (MASK_CFG1/2, PGID_CFG1/2,
      these are replicated for n_ports >32 on Sparx5).

The is_sparx5() function simply checks the target chip type, to
determine if this is a Sparx5 SKU or not.

Reviewed-by: Steen Hegelund <[email protected]>
Reviewed-by: Jacob Keller <[email protected]>
Signed-off-by: Daniel Machon <[email protected]>
Signed-off-by: Paolo Abeni <[email protected]>
5 months agonet: sparx5: ops out function for DSM calendar calculation
Daniel Machon [Fri, 4 Oct 2024 13:19:39 +0000 (15:19 +0200)]
net: sparx5: ops out function for DSM calendar calculation

The DSM (Disassembler) calendar grants each port access to internal
busses. The configuration of the calendar is done differently on Sparx5
and lan969x. Therefore ops out the function that calculates the
calendar.

Reviewed-by: Steen Hegelund <[email protected]>
Reviewed-by: Jacob Keller <[email protected]>
Signed-off-by: Daniel Machon <[email protected]>
Signed-off-by: Paolo Abeni <[email protected]>
5 months agonet: sparx5: ops out PTP IRQ handler
Daniel Machon [Fri, 4 Oct 2024 13:19:38 +0000 (15:19 +0200)]
net: sparx5: ops out PTP IRQ handler

The PTP registers are located in two different register targets on
Sparx5 and lan969x. We can't handle this with the register macros, so
ops out the handler.

Reviewed-by: Steen Hegelund <[email protected]>
Reviewed-by: Jacob Keller <[email protected]>
Signed-off-by: Daniel Machon <[email protected]>
Signed-off-by: Paolo Abeni <[email protected]>
5 months agonet: sparx5: ops out function for setting the port mux
Daniel Machon [Fri, 4 Oct 2024 13:19:37 +0000 (15:19 +0200)]
net: sparx5: ops out function for setting the port mux

Port muxing is configured based on the supported port modes. As these
modes can differ on Sparx5 and lan969x we ops out the port muxing
function.

Reviewed-by: Steen Hegelund <[email protected]>
Reviewed-by: Jacob Keller <[email protected]>
Signed-off-by: Daniel Machon <[email protected]>
Signed-off-by: Paolo Abeni <[email protected]>
5 months agonet: sparx5: ops out functions for getting certain array values
Daniel Machon [Fri, 4 Oct 2024 13:19:36 +0000 (15:19 +0200)]
net: sparx5: ops out functions for getting certain array values

Add getters for getting values in arrays: sdlb_groups and
sparx5_hsch_max_group_rate and ops out the getters, as these arrays will
differ on lan969x.

Reviewed-by: Steen Hegelund <[email protected]>
Reviewed-by: Jacob Keller <[email protected]>
Signed-off-by: Daniel Machon <[email protected]>
Signed-off-by: Paolo Abeni <[email protected]>
5 months agonet: sparx5: ops out chip port to device index/bit functions
Daniel Machon [Fri, 4 Oct 2024 13:19:35 +0000 (15:19 +0200)]
net: sparx5: ops out chip port to device index/bit functions

The chip port device index and mode bit can be obtained using the port
number.  However the mapping of port number to chip device index and
mode bit differs on Sparx5 and lan969x. Therefore ops out the function.

Reviewed-by: Steen Hegelund <[email protected]>
Reviewed-by: Jacob Keller <[email protected]>
Signed-off-by: Daniel Machon <[email protected]>
Signed-off-by: Paolo Abeni <[email protected]>
5 months agonet: sparx5: add ops to match data
Daniel Machon [Fri, 4 Oct 2024 13:19:34 +0000 (15:19 +0200)]
net: sparx5: add ops to match data

Add new struct sparx5_ops, containing functions that needs to be
different as the implementation differs on Sparx5 and lan969x. Initially
we add functions for checking the port type (2g5, 5g, 10g or 25g) based
on the port number. Update the code to use the ops instead of the
platform specific functions.

Reviewed-by: Steen Hegelund <[email protected]>
Reviewed-by: Jacob Keller <[email protected]>
Signed-off-by: Daniel Machon <[email protected]>
Signed-off-by: Paolo Abeni <[email protected]>
5 months agonet: sparx5: use SPX5_CONST for constants which do not have a symbol
Daniel Machon [Fri, 4 Oct 2024 13:19:33 +0000 (15:19 +0200)]
net: sparx5: use SPX5_CONST for constants which do not have a symbol

Now that we have indentified all the chip constants, update the use of
them where a symbol is not defined for the constant.

Reviewed-by: Steen Hegelund <[email protected]>
Signed-off-by: Daniel Machon <[email protected]>
Signed-off-by: Paolo Abeni <[email protected]>
5 months agonet: sparx5: use SPX5_CONST for constants which already have a symbol
Daniel Machon [Fri, 4 Oct 2024 13:19:32 +0000 (15:19 +0200)]
net: sparx5: use SPX5_CONST for constants which already have a symbol

Now that we have indentified all the chip constants, update the use of
them where a symbol is already defined for the constant.

Reviewed-by: Steen Hegelund <[email protected]>
Signed-off-by: Daniel Machon <[email protected]>
Signed-off-by: Paolo Abeni <[email protected]>
5 months agonet: sparx5: add constants to match data
Daniel Machon [Fri, 4 Oct 2024 13:19:31 +0000 (15:19 +0200)]
net: sparx5: add constants to match data

Add new struct sparx5_consts, containing all the chip constants that are
known to be different for Sparx5 and lan969x.

Reviewed-by: Steen Hegelund <[email protected]>
Signed-off-by: Daniel Machon <[email protected]>
Signed-off-by: Paolo Abeni <[email protected]>
5 months agonet: sparx5: add *sparx5 argument to a few functions
Daniel Machon [Fri, 4 Oct 2024 13:19:30 +0000 (15:19 +0200)]
net: sparx5: add *sparx5 argument to a few functions

The *sparx5 context pointer is required in functions that need to access
platform constants (which will be added in a subsequent patch).  Prepare
for this by updating the prototype and use of such functions.

Reviewed-by: Steen Hegelund <[email protected]>
Reviewed-by: Jacob Keller <[email protected]>
Signed-off-by: Daniel Machon <[email protected]>
Signed-off-by: Paolo Abeni <[email protected]>
5 months agonet: sparx5: modify SPX5_PORTS_ALL macro
Daniel Machon [Fri, 4 Oct 2024 13:19:29 +0000 (15:19 +0200)]
net: sparx5: modify SPX5_PORTS_ALL macro

In preparation for lan969x, we need to define the SPX5_PORTS_ALL macro
as 70 (65 front ports + 5 internal ports). This is required as the
SPX5_PORT_CPU will be redefined as an offset to the number of front
ports, in a subsequent patch.

Reviewed-by: Steen Hegelund <[email protected]>
Reviewed-by: Jacob Keller <[email protected]>
Signed-off-by: Daniel Machon <[email protected]>
Signed-off-by: Paolo Abeni <[email protected]>
5 months agonet: sparx5: add indirection layer to register macros
Daniel Machon [Fri, 4 Oct 2024 13:19:28 +0000 (15:19 +0200)]
net: sparx5: add indirection layer to register macros

The register macros are used to read and write to the switch registers.
The registers are largely the same on Sparx5 and lan969x, however in some
cases they differ. The differences can be one or more of the following:
target size, register address, register count, group address, group
count, group size, field position, field size.

In order to handle these differences, we introduce a new indirection
layer, that defines and maps them to corresponding values, based on the
platform. As the register macro arguments can now be non-constants, we
also add non-constant variants of FIELD_GET and FIELD_PREP.

Since the indirection layer contributes to longer macros, we have
changed the formatting of them slightly, to adhere to a 80 character
limit, and added a comment if a macro is platform-specific.

With these additions, we can reuse all the existing macros for
lan969x.

Reviewed-by: Steen Hegelund <[email protected]>
Signed-off-by: Daniel Machon <[email protected]>
Signed-off-by: Paolo Abeni <[email protected]>
5 months agonet: sparx5: add support for private match data
Daniel Machon [Fri, 4 Oct 2024 13:19:27 +0000 (15:19 +0200)]
net: sparx5: add support for private match data

In preparation for lan969x, add support for private match data. This
will be needed for abstracting away differences between the Sparx5 and
lan969x platforms. We initially add values for: iomap, iomap size and
ioranges. Update the use of these throughout.

Reviewed-by: Steen Hegelund <[email protected]>
Reviewed-by: Jacob Keller <[email protected]>
Signed-off-by: Daniel Machon <[email protected]>
Signed-off-by: Paolo Abeni <[email protected]>
5 months agoDocumentation: networking: add Twisted Pair Ethernet diagnostics at OSI Layer 1
Oleksij Rempel [Fri, 4 Oct 2024 12:18:24 +0000 (14:18 +0200)]
Documentation: networking: add Twisted Pair Ethernet diagnostics at OSI Layer 1

This patch introduces a diagnostic guide for troubleshooting Twisted
Pair  Ethernet variants at OSI Layer 1. It provides detailed steps for
detecting  and resolving common link issues, such as incorrect wiring,
cable damage,  and power delivery problems. The guide also includes
interface verification  steps and PHY-specific diagnostics.

Signed-off-by: Oleksij Rempel <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Paolo Abeni <[email protected]>
5 months agoMerge branch 'net-phy-support-master-slave-config-via-device-tree'
Paolo Abeni [Tue, 8 Oct 2024 08:50:16 +0000 (10:50 +0200)]
Merge branch 'net-phy-support-master-slave-config-via-device-tree'

Oleksij Rempel says:

====================
net: phy: Support master-slave config via device tree

This patch series adds support for configuring the master/slave role of
PHYs via the device tree. A new `master-slave` property is introduced in
the device tree bindings, allowing PHYs to be forced into either master
or slave mode. This is particularly necessary for Single Pair Ethernet
(SPE) PHYs (1000/100/10Base-T1), where hardware strap pins may not be
available or correctly configured, but it is applicable to all PHY
types.

changes v5:
- sync DT options with ethtool nameing.

changes v4:
- add Reviewed-by
- rebase against latest net-next

changes v3:
- rename  master-slave to timing-role
- add prefer-master/slave support
====================

Link: https://patch.msgid.link/[email protected]
Signed-off-by: Paolo Abeni <[email protected]>
5 months agonet: phy: Add support for PHY timing-role configuration via device tree
Oleksij Rempel [Fri, 4 Oct 2024 09:01:00 +0000 (11:01 +0200)]
net: phy: Add support for PHY timing-role configuration via device tree

Introduce support for configuring the master/slave role of PHYs based on
the `timing-role` property in the device tree. While this functionality
is necessary for Single Pair Ethernet (SPE) PHYs (1000/100/10Base-T1)
where hardware strap pins may be unavailable or incorrectly set, it
works for any PHY type.

Signed-off-by: Oleksij Rempel <[email protected]>
Reviewed-by: Russell King (Oracle) <[email protected]>
Reviewed-by: Florian Fainelli <[email protected]>
Reviewed-by: Divya Koppera <[email protected]>
Signed-off-by: Paolo Abeni <[email protected]>
5 months agodt-bindings: net: ethernet-phy: Add timing-role role property for ethernet PHYs
Oleksij Rempel [Fri, 4 Oct 2024 09:00:59 +0000 (11:00 +0200)]
dt-bindings: net: ethernet-phy: Add timing-role role property for ethernet PHYs

This patch introduces a new `timing-role` property in the device tree
bindings for configuring the master/slave role of PHYs. This is
essential for scenarios where hardware strap pins are unavailable or
incorrectly configured.

The `timing-role` property supports the following values:
- `forced-master`: Forces the PHY to operate as a master (clock source).
- `forced-slave`: Forces the PHY to operate as a slave (clock receiver).
- `preferred-master`: Prefers the PHY to be master but allows negotiation.
- `preferred-slave`: Prefers the PHY to be slave but allows negotiation.

The terms "master" and "slave" are retained in this context to align
with the IEEE 802.3 standards, where they are used to describe the roles
of PHY devices in managing clock signals for data transmission. In
particular, the terms are used in specifications for 1000Base-T and
MultiGBASE-T PHYs, among others. Although there is an effort to adopt
more inclusive terminology, replacing these terms could create
discrepancies between the Linux kernel and the established standards,
documentation, and existing hardware interfaces.

Signed-off-by: Oleksij Rempel <[email protected]>
Reviewed-by: Rob Herring (Arm) <[email protected]>
Reviewed-by: Florian Fainelli <[email protected]>
Reviewed-by: Divya Koppera <[email protected]>
Signed-off-by: Paolo Abeni <[email protected]>
5 months agonet: qcom/emac: Find sgmii_ops by device_for_each_child()
Zijun Hu [Thu, 3 Oct 2024 17:27:27 +0000 (10:27 -0700)]
net: qcom/emac: Find sgmii_ops by device_for_each_child()

To prepare for constifying the following old driver core API:

struct device *device_find_child(struct device *dev, void *data,
int (*match)(struct device *dev, void *data));
to new:
struct device *device_find_child(struct device *dev, const void *data,
int (*match)(struct device *dev, const void *data));

The new API does not allow its match function (*match)() to modify
caller's match data @*data, but emac_sgmii_acpi_match(), as the old
API's match function, indeed modifies relevant match data, so it is
not suitable for the new API any more, solved by implementing the same
finding sgmii_ops function by correcting the function and using it
as parameter of device_for_each_child() instead of device_find_child().

By the way, this commit does not change any existing logic.

Signed-off-by: Zijun Hu <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Paolo Abeni <[email protected]>
5 months agonet: phy: mxl-gpy: add missing support for TRIGGER_NETDEV_LINK_10
Daniel Golle [Fri, 4 Oct 2024 15:56:35 +0000 (16:56 +0100)]
net: phy: mxl-gpy: add missing support for TRIGGER_NETDEV_LINK_10

The PHY also support 10MBit/s links as well as the corresponding link
indication trigger to be offloaded. Add TRIGGER_NETDEV_LINK_10 to the
supported triggers.

Signed-off-by: Daniel Golle <[email protected]>
Reviewed-by: Andrew Lunn <[email protected]>
Link: https://patch.msgid.link/cc5da0a989af8b0d49d823656d88053c4de2ab98.1728057367.git.daniel@makrotopia.org
Signed-off-by: Jakub Kicinski <[email protected]>
5 months agovmxnet3: support higher link speeds from vmxnet3 v9
Ronak Doshi [Fri, 4 Oct 2024 17:43:03 +0000 (10:43 -0700)]
vmxnet3: support higher link speeds from vmxnet3 v9

Until now, vmxnet3 was default reporting 10Gbps as link speed.
Vmxnet3 v9 adds support for user to configure higher link speeds.
User can configure the link speed via VMs advanced parameters options
in VCenter. This speed is reported in gbps by hypervisor.

This patch adds support for vmxnet3 to report higher link speeds and
converts it to mbps as expected by Linux stack.

Signed-off-by: Ronak Doshi <[email protected]>
Acked-by: Guolin Yang <[email protected]>
Reviewed-by: Simon Horman <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
5 months agodt-bindings: net: realtek: Use proper node names
Linus Walleij [Fri, 4 Oct 2024 08:08:50 +0000 (10:08 +0200)]
dt-bindings: net: realtek: Use proper node names

We eventually want to get to a place where we fix all DTS files
so that we can simply disallow switch/port/ports without the
ethernet-* prefix so the DTS files are more readable.

Replace:
- switch with ethernet-switch
- ports with ethernet-ports
- port with ethernet-port

Reviewed-by: Krzysztof Kozlowski <[email protected]>
Signed-off-by: Linus Walleij <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
5 months agoMerge branch 'ipv4-preliminary-work-for-per-netns-rtnl'
Jakub Kicinski [Mon, 7 Oct 2024 23:46:32 +0000 (16:46 -0700)]
Merge branch 'ipv4-preliminary-work-for-per-netns-rtnl'

Eric Dumazet says:

====================
ipv4: preliminary work for per-netns RTNL

Inspired by 9b8ca04854fd ("ipv4: avoid quadratic behavior in
FIB insertion of common address") and per-netns RTNL conversion
started by Kuniyuki this week.

ip_fib_check_default() can use RCU instead of a shared spinlock.

fib_info_lock can be removed, RTNL is already used.

fib_info_devhash[] can be removed in favor of a single
pointer in net_device.
====================

Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
5 months agoipv4: remove fib_info_devhash[]
Eric Dumazet [Fri, 4 Oct 2024 13:47:20 +0000 (13:47 +0000)]
ipv4: remove fib_info_devhash[]

Upcoming per-netns RTNL conversion needs to get rid
of shared hash tables.

fib_info_devhash[] is one of them.

It is unclear why we used a hash table, because
a single hlist_head per net device was cheaper and scalable.

Signed-off-by: Eric Dumazet <[email protected]>
Reviewed-by: Kuniyuki Iwashima <[email protected]>
Reviewed-by: David Ahern <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
5 months agoipv4: remove fib_info_lock
Eric Dumazet [Fri, 4 Oct 2024 13:47:19 +0000 (13:47 +0000)]
ipv4: remove fib_info_lock

After the prior patch, fib_info_lock became redundant
because all of its users are holding RTNL.

BH protection is not needed.

Remove the READ_ONCE()/WRITE_ONCE() annotations around fib_info_cnt,
since it is protected by RTNL.

Signed-off-by: Eric Dumazet <[email protected]>
Reviewed-by: Kuniyuki Iwashima <[email protected]>
Reviewed-by: David Ahern <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
5 months agoipv4: use rcu in ip_fib_check_default()
Eric Dumazet [Fri, 4 Oct 2024 13:47:18 +0000 (13:47 +0000)]
ipv4: use rcu in ip_fib_check_default()

fib_info_devhash[] is not resized in fib_info_hash_move().

fib_nh structs are already freed after an rcu grace period.

This will allow to remove fib_info_lock in the following patch.

Signed-off-by: Eric Dumazet <[email protected]>
Reviewed-by: Kuniyuki Iwashima <[email protected]>
Reviewed-by: David Ahern <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
5 months agoipv4: remove fib_devindex_hashfn()
Eric Dumazet [Fri, 4 Oct 2024 13:47:17 +0000 (13:47 +0000)]
ipv4: remove fib_devindex_hashfn()

fib_devindex_hashfn() converts a 32bit ifindex value to a 8bit hash.

It makes no sense doing this from fib_info_hashfn() and
fib_find_info_nh().

It is better to keep as many bits as possible to let
fib_info_hashfn_result() have better spread.

Only fib_info_devhash_bucket() needs to make this operation,
we can 'inline' trivial fib_devindex_hashfn() in it.

Signed-off-by: Eric Dumazet <[email protected]>
Reviewed-by: Kuniyuki Iwashima <[email protected]>
Reviewed-by: David Ahern <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
5 months agolib: packing: catch kunit_kzalloc() failure in the pack() test
Vladimir Oltean [Fri, 4 Oct 2024 11:00:12 +0000 (14:00 +0300)]
lib: packing: catch kunit_kzalloc() failure in the pack() test

kunit_kzalloc() may fail. Other call sites verify that this is the case,
either using a direct comparison with the NULL pointer, or the
KUNIT_ASSERT_NOT_NULL() or KUNIT_ASSERT_NOT_ERR_OR_NULL().

Pick KUNIT_ASSERT_NOT_NULL() as the error handling method that made most
sense to me. It's an unlikely thing to happen, but at least we call
__kunit_abort() instead of dereferencing this NULL pointer.

Fixes: e9502ea6db8a ("lib: packing: add KUnit tests adapted from selftests")
Signed-off-by: Vladimir Oltean <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
5 months agomlxsw: spectrum_acl_flex_keys: Constify struct mlxsw_afk_element_inst
Christophe JAILLET [Fri, 4 Oct 2024 05:26:05 +0000 (07:26 +0200)]
mlxsw: spectrum_acl_flex_keys: Constify struct mlxsw_afk_element_inst

'struct mlxsw_afk_element_inst' are not modified in these drivers.

Constifying these structures moves some data to a read-only section, so
increases overall security.

Update a few functions and struct mlxsw_afk_block accordingly.

On a x86_64, with allmodconfig, as an example:
Before:
======
   text    data     bss     dec     hex filename
   4278    4032       0    8310    2076 drivers/net/ethernet/mellanox/mlxsw/spectrum_acl_flex_keys.o

After:
=====
   text    data     bss     dec     hex filename
   7934     352       0    8286    205e drivers/net/ethernet/mellanox/mlxsw/spectrum_acl_flex_keys.o

Signed-off-by: Christophe JAILLET <[email protected]>
Reviewed-by: Simon Horman <[email protected]>
Reviewed-by: Ido Schimmel <[email protected]>
Link: https://patch.msgid.link/8ccfc7bfb2365dcee5b03c81ebe061a927d6da2e.1727541677.git.christophe.jaillet@wanadoo.fr
Signed-off-by: Jakub Kicinski <[email protected]>
5 months agonet: dsa: remove obsolete phylink dsa_switch operations
Russell King (Oracle) [Thu, 3 Oct 2024 11:52:17 +0000 (12:52 +0100)]
net: dsa: remove obsolete phylink dsa_switch operations

No driver now uses the DSA switch phylink members, so we can now remove
the method pointers, but we need to leave empty shim functions to allow
those drivers that do not provide phylink MAC operations structure to
continue functioning.

Signed-off-by: Russell King (oracle) <[email protected]>
Reviewed-by: Vladimir Oltean <[email protected]>
Tested-by: Vladimir Oltean <[email protected]> # sja1105, felix, dsa_loop
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
5 months agonet: tcp: refresh tcp_mstamp for compressed ack in timer
Menglong Dong [Thu, 3 Oct 2024 08:22:31 +0000 (16:22 +0800)]
net: tcp: refresh tcp_mstamp for compressed ack in timer

For now, we refresh the tcp_mstamp for delayed acks and keepalives, but
not for the compressed ack in tcp_compressed_ack_kick().

I have not found out the effact of the tcp_mstamp when sending ack, but
we can still refresh it for the compressed ack to keep consistent.

Signed-off-by: Menglong Dong <[email protected]>
Reviewed-by: Eric Dumazet <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
5 months agohv_netvsc: Link queues to NAPIs
Joe Damato [Mon, 30 Sep 2024 17:27:09 +0000 (17:27 +0000)]
hv_netvsc: Link queues to NAPIs

Use netif_queue_set_napi to link queues to NAPI instances so that they
can be queried with netlink.

Shradha Gupta tested the patch and reported that the results are
as expected:

$ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml \
                           --dump queue-get --json='{"ifindex": 2}'

 [{'id': 0, 'ifindex': 2, 'napi-id': 8193, 'type': 'rx'},
  {'id': 1, 'ifindex': 2, 'napi-id': 8194, 'type': 'rx'},
  {'id': 2, 'ifindex': 2, 'napi-id': 8195, 'type': 'rx'},
  {'id': 3, 'ifindex': 2, 'napi-id': 8196, 'type': 'rx'},
  {'id': 4, 'ifindex': 2, 'napi-id': 8197, 'type': 'rx'},
  {'id': 5, 'ifindex': 2, 'napi-id': 8198, 'type': 'rx'},
  {'id': 6, 'ifindex': 2, 'napi-id': 8199, 'type': 'rx'},
  {'id': 7, 'ifindex': 2, 'napi-id': 8200, 'type': 'rx'},
  {'id': 0, 'ifindex': 2, 'napi-id': 8193, 'type': 'tx'},
  {'id': 1, 'ifindex': 2, 'napi-id': 8194, 'type': 'tx'},
  {'id': 2, 'ifindex': 2, 'napi-id': 8195, 'type': 'tx'},
  {'id': 3, 'ifindex': 2, 'napi-id': 8196, 'type': 'tx'},
  {'id': 4, 'ifindex': 2, 'napi-id': 8197, 'type': 'tx'},
  {'id': 5, 'ifindex': 2, 'napi-id': 8198, 'type': 'tx'},
  {'id': 6, 'ifindex': 2, 'napi-id': 8199, 'type': 'tx'},
  {'id': 7, 'ifindex': 2, 'napi-id': 8200, 'type': 'tx'}]

Signed-off-by: Joe Damato <[email protected]>
Reviewed-by: Haiyang Zhang <[email protected]>
Tested-by: Shradha Gupta <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
5 months agoMerge branch 'sfc-per-q-stats'
David S. Miller [Sun, 6 Oct 2024 15:02:24 +0000 (16:02 +0100)]
Merge branch 'sfc-per-q-stats'

Edward Cree says:

====================
sfc: per-queue stats

This series implements the netdev_stat_ops interface for per-queue
 statistics in the sfc driver, partly using existing counters that
 were originally added for ethtool -S output.

Changed in v4:
* remove RFC tags

Changed in v3:
* make TX stats count completions rather than enqueues
* add new patch #4 to account for XDP TX separately from netdev
  traffic and include it in base_stats
* move the tx_queue->old_* members out of the fastpath cachelines
* note on patch #6 that our hw_gso stats still count enqueues
* RFC since net-next is closed right now

Changed in v2:
* exclude (dedicated) XDP TXQ stats from per-queue TX stats
* explain patch #3 better
====================

Signed-off-by: David S. Miller <[email protected]>
5 months agosfc: add per-queue RX bytes stats
Edward Cree [Mon, 30 Sep 2024 13:52:45 +0000 (14:52 +0100)]
sfc: add per-queue RX bytes stats

While this does add overhead to the fast path, it should be minimal
 as the cacheline should already be held for write from updating the
 queue's rx_packets stat.

Reviewed-by: Jacob Keller <[email protected]>
Signed-off-by: Edward Cree <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
5 months agosfc: implement per-queue TSO (hw_gso) stats
Edward Cree [Mon, 30 Sep 2024 13:52:44 +0000 (14:52 +0100)]
sfc: implement per-queue TSO (hw_gso) stats

Use our existing TSO stats, which count enqueued TSO TXes.
Users may expect them to count completions, as tx-packets and
 tx-bytes do; however, these are the counters we have, and the
 qstats documentation doesn't actually specify.

Reviewed-by: Jacob Keller <[email protected]>
Signed-off-by: Edward Cree <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
5 months agosfc: implement per-queue rx drop and overrun stats
Edward Cree [Mon, 30 Sep 2024 13:52:43 +0000 (14:52 +0100)]
sfc: implement per-queue rx drop and overrun stats

Reviewed-by: Jacob Keller <[email protected]>
Signed-off-by: Edward Cree <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
5 months agosfc: account XDP TXes in netdev base stats
Edward Cree [Mon, 30 Sep 2024 13:52:42 +0000 (14:52 +0100)]
sfc: account XDP TXes in netdev base stats

When we handle a TX completion for an XDP packet, it is not counted
 in the per-TXQ netdev stats.  Record it in new internal counters,
 and include those in the device-wide total in efx_get_base_stats().

Signed-off-by: Edward Cree <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
5 months agosfc: add n_rx_overlength to ethtool stats
Edward Cree [Mon, 30 Sep 2024 13:52:41 +0000 (14:52 +0100)]
sfc: add n_rx_overlength to ethtool stats

The previous patch changed when we increment the RX queue's rx_packets
 counter, to match the semantics of netdev per-queue stats.  The
 differences between the old and new counts are scatter errors (which
 produce a WARN_ON) and this counter, which is incremented by
 efx_rx_packet__check_len() when an RX packet (which was placed in a
 single buffer by SG, i.e. n_frags == 1) has a length (from the RX
 event) which is too long to fit in the RX buffer.  If this occurs, we
 drop the packet and fire a ratelimited netif_err().
The counter previously was not reported anywhere; add it to ethtool -S
 output to ensure users still have this information.

Reviewed-by: Jacob Keller <[email protected]>
Signed-off-by: Edward Cree <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
5 months agosfc: implement basic per-queue stats
Edward Cree [Mon, 30 Sep 2024 13:52:40 +0000 (14:52 +0100)]
sfc: implement basic per-queue stats

Just RX and TX packet counts and TX bytes for now.  We do not
 have per-queue RX byte counts, which causes us to fail
 stats.pkt_byte_sum selftest with "Drivers should always report
 basic keys" error.
Per-queue counts are since the last time the queue was inited
 (typically by efx_start_datapath(), on ifup or reconfiguration);
 device-wide total (efx_get_base_stats()) is since driver probe.
 This is not the same lifetime as rtnl_link_stats64, which uses
 firmware stats which count since FW (re)booted; this can cause a
 "Qstats are lower" or "RTNL stats are lower" failure in
 stats.pkt_byte_sum selftest.
Move the increment of rx_queue->rx_packets to match the semantics
 specified for netdev per-queue stats, i.e. just before handing
 the packet to XDP (if present) or the netstack (through GRO).
 This will affect the existing ethtool -S output which also
 reports these counters.
XDP TX packets are not yet counted into base_stats.

Signed-off-by: Edward Cree <[email protected]>
Reviewed-by: Jacob Keller <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
5 months agosfc: remove obsolete counters from struct efx_channel
Edward Cree [Mon, 30 Sep 2024 13:52:39 +0000 (14:52 +0100)]
sfc: remove obsolete counters from struct efx_channel

The n_rx_tobe_disc and n_rx_mcast_mismatch counters are a legacy
 from farch, and are never written in EF10 or EF100 code.  Remove
 them from the struct and from ethtool -S output, saving a bit of
 memory and avoiding user confusion.

Reviewed-by: Jacob Keller <[email protected]>
Signed-off-by: Edward Cree <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
5 months agoMerge branch 'net-switch-back-to-struct-platform_driver-remove'
Jakub Kicinski [Fri, 4 Oct 2024 23:39:59 +0000 (16:39 -0700)]
Merge branch 'net-switch-back-to-struct-platform_driver-remove'

Uwe Kleine-König says:

====================
net: Switch back to struct platform_driver::remove()

I already sent a patch last week that is very similar to patch #1 of
this series. However the previous submission was based on plain next.
I was asked to resend based on net-next once the merge window closed,
so here comes this v2.  The additional patches address drivers/net/dsa,
drivers/net/mdio and the rest of drivers/net apart from wireless which
has its own tree and will addressed separately at a later point in time.
====================

Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
5 months agonet: Switch back to struct platform_driver::remove()
Uwe Kleine-König [Thu, 3 Oct 2024 10:01:06 +0000 (12:01 +0200)]
net: Switch back to struct platform_driver::remove()

After commit 0edb555a65d1 ("platform: Make platform_driver::remove()
return void") .remove() is (again) the right callback to implement for
platform drivers.

Convert all platform drivers below drivers/net after the previous
conversion commits apart from the wireless drivers to use .remove(),
with the eventual goal to drop struct platform_driver::remove_new(). As
.remove() and .remove_new() have the same prototypes, conversion is done
by just changing the structure member name in the driver initializer.

Signed-off-by: Uwe Kleine-König <[email protected]>
Reviewed-by: Andrew Lunn <[email protected]>
Reviewed-by: Sergey Ryazanov <[email protected]>
Acked-by: Stefan Schmidt <[email protected]>
Signed-off-by: Jakub Kicinski <[email protected]>
5 months agonet: mdio: Switch back to struct platform_driver::remove()
Uwe Kleine-König [Thu, 3 Oct 2024 10:01:05 +0000 (12:01 +0200)]
net: mdio: Switch back to struct platform_driver::remove()

After commit 0edb555a65d1 ("platform: Make platform_driver::remove()
return void") .remove() is (again) the right callback to implement for
platform drivers.

Convert all platform drivers below drivers/net/mdio to use .remove(),
with the eventual goal to drop struct platform_driver::remove_new(). As
.remove() and .remove_new() have the same prototypes, conversion is done
by just changing the structure member name in the driver initializer.

Signed-off-by: Uwe Kleine-König <[email protected]>
Reviewed-by: Andrew Lunn <[email protected]>
Link: https://patch.msgid.link/0b60d8bfc45a3de8193f953794dda241e11032a9.1727949050.git.u.kleine-koenig@baylibre.com
Signed-off-by: Jakub Kicinski <[email protected]>
5 months agonet: dsa: Switch back to struct platform_driver::remove()
Uwe Kleine-König [Thu, 3 Oct 2024 10:01:04 +0000 (12:01 +0200)]
net: dsa: Switch back to struct platform_driver::remove()

After commit 0edb555a65d1 ("platform: Make platform_driver::remove()
return void") .remove() is (again) the right callback to implement for
platform drivers.

Convert all platform drivers below drivers/net/dsa to use .remove(),
with the eventual goal to drop struct platform_driver::remove_new(). As
.remove() and .remove_new() have the same prototypes, conversion is done
by just changing the structure member name in the driver initializer.

Signed-off-by: Uwe Kleine-König <[email protected]>
Reviewed-by: Andrew Lunn <[email protected]>
Link: https://patch.msgid.link/36da477cb9fa0bffec32d50c2cf3d18e94a0e7e3.1727949050.git.u.kleine-koenig@baylibre.com
Signed-off-by: Jakub Kicinski <[email protected]>
5 months agonet: ethernet: Switch back to struct platform_driver::remove()
Uwe Kleine-König [Thu, 3 Oct 2024 10:01:03 +0000 (12:01 +0200)]
net: ethernet: Switch back to struct platform_driver::remove()

After commit 0edb555a65d1 ("platform: Make platform_driver::remove()
return void") .remove() is (again) the right callback to implement for
platform drivers.

Convert all platform drivers below drivers/net/ethernet to use
.remove(), with the eventual goal to drop struct
platform_driver::remove_new(). As .remove() and .remove_new() have the
same prototypes, conversion is done by just changing the structure
member name in the driver initializer.

Signed-off-by: Uwe Kleine-König <[email protected]>
Link: https://patch.msgid.link/18f7c585a1a8a8ac8b03a2fca7de19bd5c52ac2b.1727949050.git.u.kleine-koenig@baylibre.com
Signed-off-by: Jakub Kicinski <[email protected]>
5 months agonet: dsa: bcm_sf2: fix crossbar port bitwidth logic
Sam Edwards [Thu, 3 Oct 2024 21:23:01 +0000 (14:23 -0700)]
net: dsa: bcm_sf2: fix crossbar port bitwidth logic

The SF2 crossbar register is a packed bitfield, giving the index of the
external port selected for each of the internal ports. On BCM4908 (the
only currently-supported switch family with a crossbar), there are 2
internal ports and 3 external ports, so there are 2 bits per internal
port.

The driver currently conflates the "bits per port" and "number of ports"
concepts, lumping both into the `num_crossbar_int_ports` field. Since it
is currently only possible for either of these counts to have a value of
2, there is no behavioral error resulting from this situation for now.

Make the code more readable (and support the future possibility of
larger crossbars) by adding a `num_crossbar_ext_bits` field to represent
the "bits per port" count and relying on this where appropriate instead.

Signed-off-by: Sam Edwards <[email protected]>
Reviewed-by: Florian Fainelli <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
5 months agoMerge branch 'net-prepare-pacing-offload-support'
Jakub Kicinski [Fri, 4 Oct 2024 22:37:58 +0000 (15:37 -0700)]
Merge branch 'net-prepare-pacing-offload-support'

Eric Dumazet says:

====================
net: prepare pacing offload support

Some network devices have the ability to offload EDT (Earliest
Departure Time) which is the model used for TCP pacing and FQ
packet scheduler.

Some of them implement the timing wheel mechanism described in
https://saeed.github.io/files/carousel-sigcomm17.pdf
with an associated 'timing wheel horizon'.

In order to upstream the NIC support, this series adds :

1) timing wheel horizon as a per-device attribute.

2) FQ packet scheduler support, to let paced packets
   below the timing wheel horizon be handled by the driver.

v1: https://lore.kernel.org/20240930152304[email protected]
====================

Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
5 months agonet_sched: sch_fq: add the ability to offload pacing
Jeffrey Ji [Thu, 3 Oct 2024 12:12:19 +0000 (12:12 +0000)]
net_sched: sch_fq: add the ability to offload pacing

Some network devices have the ability to offload EDT (Earliest
Departure Time) which is the model used for TCP pacing and FQ packet
scheduler.

Some of them implement the timing wheel mechanism described in
https://saeed.github.io/files/carousel-sigcomm17.pdf
with an associated 'timing wheel horizon'.

This patchs adds to FQ packet scheduler TCA_FQ_OFFLOAD_HORIZON
attribute.

Its value is capped by the device max_pacing_offload_horizon,
added in the prior patch.

It allows FQ to let packets within pacing offload horizon
to be delivered to the device, which will handle the needed
delay without host involvement.

Signed-off-by: Jeffrey Ji <[email protected]>
Signed-off-by: Eric Dumazet <[email protected]>
Reviewed-by: Willem de Bruijn <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
5 months agonet: add IFLA_MAX_PACING_OFFLOAD_HORIZON device attribute
Eric Dumazet [Thu, 3 Oct 2024 12:12:18 +0000 (12:12 +0000)]
net: add IFLA_MAX_PACING_OFFLOAD_HORIZON device attribute

Some network devices have the ability to offload EDT (Earliest
Departure Time) which is the model used for TCP pacing and FQ
packet scheduler.

Some of them implement the timing wheel mechanism described in
https://saeed.github.io/files/carousel-sigcomm17.pdf
with an associated 'timing wheel horizon'.

This patch adds dev->max_pacing_offload_horizon expressing
this timing wheel horizon in nsec units.

This is a read-only attribute.

Unless a driver sets it, dev->max_pacing_offload_horizon
is zero.

v2: addressed Jakub feedback ( https://lore.kernel.org/netdev/20240930152304[email protected]/T/#mf6294d714c41cc459962154cc2580ce3c9693663 )
v3: added yaml doc (also per Jakub feedback)

Signed-off-by: Eric Dumazet <[email protected]>
Reviewed-by: Willem de Bruijn <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
5 months agoselftest/ptp: update ptp selftest to exercise the gettimex options
Mahesh Bandewar [Thu, 3 Oct 2024 10:15:06 +0000 (03:15 -0700)]
selftest/ptp: update ptp selftest to exercise the gettimex options

With the inclusion of commit c259acab839e ("ptp/ioctl: support
MONOTONIC{,_RAW} timestamps for PTP_SYS_OFFSET_EXTENDED") clock_gettime()
now allows retrieval of pre/post timestamps for CLOCK_MONOTONIC and
CLOCK_MONOTONIC_RAW timebases along with the previously supported
CLOCK_REALTIME.

This patch adds a command line option 'y' to the testptp program to
choose one of the allowed timebases [realtime aka system, monotonic,
and monotonic-raw).

Signed-off-by: Mahesh Bandewar <[email protected]>
Cc: Shuah Khan <[email protected]>
Acked-by: Richard Cochran <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
5 months agoMerge branch 'tcp-add-fast-path-in-timer-handlers'
Jakub Kicinski [Fri, 4 Oct 2024 22:34:42 +0000 (15:34 -0700)]
Merge branch 'tcp-add-fast-path-in-timer-handlers'

Eric Dumazet says:

====================
tcp: add fast path in timer handlers

As mentioned in Netconf 2024:

TCP retransmit and delack timers are not stopped from
inet_csk_clear_xmit_timer() because we do not define
INET_CSK_CLEAR_TIMERS.

Enabling INET_CSK_CLEAR_TIMERS leads to lower performance,
mainly because del_timer() and mod_timer() happen from
different cpus quite often.

What we can do instead is to add fast paths to tcp_write_timer()
and tcp_delack_timer() to avoid socket spinlock acquisition.
====================

Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
5 months agotcp: add a fast path in tcp_delack_timer()
Eric Dumazet [Wed, 2 Oct 2024 17:30:42 +0000 (17:30 +0000)]
tcp: add a fast path in tcp_delack_timer()

delack timer is not stopped from inet_csk_clear_xmit_timer()
because we do not define INET_CSK_CLEAR_TIMERS.

This is a conscious choice : inet_csk_clear_xmit_timer()
is often called from another cpu. Calling del_timer()
would cause false sharing and lock contention.

This means that very often, tcp_delack_timer() is called
at the timer expiration, while there is no ACK to transmit.

This can be detected very early, avoiding the socket spinlock.

Notes:
- test about tp->compressed_ack is racy,
  but in the unlikely case there is a race, the dedicated
  compressed_ack_timer hrtimer would close it.

- Even if the fast path is not taken, reading
  icsk->icsk_ack.pending and tp->compressed_ack
  before acquiring the socket spinlock reduces
  acquisition time and chances of contention.

Signed-off-by: Eric Dumazet <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
5 months agotcp: add a fast path in tcp_write_timer()
Eric Dumazet [Wed, 2 Oct 2024 17:30:41 +0000 (17:30 +0000)]
tcp: add a fast path in tcp_write_timer()

retransmit timer is not stopped from inet_csk_clear_xmit_timer()
because we do not define INET_CSK_CLEAR_TIMERS.

This is a conscious choice : for active TCP flows, it is better
to only call mod_timer(), because there is more chances of
keeping the timer unchanged. Also inet_csk_clear_xmit_timer()
is often called from another cpu, and calling del_timer()
would cause false sharing and lock contention.

This means that very often, tcp_write_timer() is called
at the timer expiration, while there is nothing to retransmit.

This can be detected very early, avoiding the socket spinlock.

Signed-off-by: Eric Dumazet <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
5 months agotcp: annotate data-races around icsk->icsk_pending
Eric Dumazet [Wed, 2 Oct 2024 17:30:40 +0000 (17:30 +0000)]
tcp: annotate data-races around icsk->icsk_pending

icsk->icsk_pending can be read locklessly already.

Following patch in the series will add another lockless read.

Add smp_load_acquire() and smp_store_release() annotations
because following patch will add a test in tcp_write_timer(),
and READ_ONCE()/WRITE_ONCE() alone would possibly lead to races.

Signed-off-by: Eric Dumazet <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
5 months agoMerge branch 'selftests-net-ioam-add-tunsrc-support'
Jakub Kicinski [Fri, 4 Oct 2024 22:34:10 +0000 (15:34 -0700)]
Merge branch 'selftests-net-ioam-add-tunsrc-support'

Justin Iurman says:

====================
selftests: net: ioam: add tunsrc support

TL;DR This patch comes from a discussion we had with Jakub and Paolo on
aligning the ioam selftests with its new "tunsrc" feature.

This patch updates the IOAM selftests to support the new "tunsrc"
feature of IOAM. As a consequence, some changes were required. For
example, the IPv6 header must be accessed to check some fields (i.e.,
the source address for the "tunsrc" feature), which is not possible
AFAIK with IPv6 raw sockets. The latter is currently used with
IPV6_RECVHOPOPTS and was introduced by commit 187bbb6968af ("selftests:
ioam: refactoring to align with the fix") to fix an issue. But, we
really need packet sockets actually... which is one of the changes in
this patch (see the description of the topology at the top of ioam6.sh
for explanations). Another change is that all IPv6 addresses used in the
topology are now based on the documentation prefix (2001:db8::/32).
Also, the tests have been improved and there are now many more of them.
Overall, the script is more robust.
====================

Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
5 months agoselftests: net: add new ioam tests
Justin Iurman [Wed, 2 Oct 2024 16:27:31 +0000 (18:27 +0200)]
selftests: net: add new ioam tests

This patch re-adds the (updated) ioam selftests with support for the
tunsrc feature.

Signed-off-by: Justin Iurman <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
5 months agoselftests: net: remove ioam tests
Justin Iurman [Wed, 2 Oct 2024 16:27:30 +0000 (18:27 +0200)]
selftests: net: remove ioam tests

This patch entirely removes the ioam selftests to prepare for the next
patch in this series, which re-adds the new ioam selftests for better
readability.

Signed-off-by: Justin Iurman <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
5 months agonet: dsa: mv88e6xxx: Support LED control
Linus Walleij [Tue, 1 Oct 2024 09:27:21 +0000 (11:27 +0200)]
net: dsa: mv88e6xxx: Support LED control

This adds control over the hardware LEDs in the Marvell
MV88E6xxx DSA switch and enables it for MV88E6352.

This fixes an imminent problem on the Inteno XG6846 which
has a WAN LED that simply do not work with hardware
defaults: driver amendment is necessary.

The patch is modeled after Christian Marangis LED support
code for the QCA8k DSA switch, I got help with the register
definitions from Tim Harvey.

After this patch it is possible to activate hardware link
indication like this (or with a similar script):

  cd /sys/class/leds/Marvell\ 88E6352:05:00:green:wan/
  echo netdev > trigger
  echo 1 > link

This makes the green link indicator come up on any link
speed. It is also possible to be more elaborate, like this:

  cd /sys/class/leds/Marvell\ 88E6352:05:00:green:wan/
  echo netdev > trigger
  echo 1 > link_1000
  cd /sys/class/leds/Marvell\ 88E6352:05:01:amber:wan/
  echo netdev > trigger
  echo 1 > link_100

Making the green LED come on for a gigabit link and the
amber LED come on for a 100 mbit link.

Each port has 2 LED slots (the hardware may use just one or
none) and the hardware triggers are specified in four bits per
LED, and some of the hardware triggers are only available on the
SFP (fiber) uplink. The restrictions are described in the
port.h header file where the registers are described. For
example, selector 1 set for LED 1 on port 5 or 6 will indicate
Fiber 1000 (gigabit) and activity with a blinking LED, but
ONLY for an SFP connection. If port 5/6 is used with something
not SFP, this selector is a noop: something else need to be
selected.

After the previous series rewriting the MV88E6xxx DT
bindings to use YAML a "leds" subnode is already valid
for each port, in my scratch device tree it looks like
this:

   leds {
     #address-cells = <1>;
     #size-cells = <0>;

     led@0 {
       reg = <0>;
       color = <LED_COLOR_ID_GREEN>;
       function = LED_FUNCTION_LAN;
       default-state = "off";
       linux,default-trigger = "netdev";
     };
     led@1 {
       reg = <1>;
       color = <LED_COLOR_ID_AMBER>;
       function = LED_FUNCTION_LAN;
       default-state = "off";
     };
   };

This DT config is not yet configuring everything: when the netdev
default trigger is assigned the hw acceleration callbacks are
not called, and there is no way to set the netdev sub-trigger
type (such as link_1000) from the device tree, such as if you want
a gigabit link indicator. This has to be done from userspace at
this point.

We add LED operations to all switches in the 6352 family:
6172, 6176, 6240 and 6352.

Signed-off-by: Linus Walleij <[email protected]>
Reviewed-by: Andrew Lunn <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
5 months agohv_netvsc: Don't assume cpu_possible_mask is dense
Michael Kelley [Thu, 3 Oct 2024 03:53:33 +0000 (20:53 -0700)]
hv_netvsc: Don't assume cpu_possible_mask is dense

Current code allocates the pcpu_sum array with size num_possible_cpus().
This code assumes the cpu_possible_mask is dense, which is not true in
the general case per [1]. If cpu_possible_mask is sparse, the array
might be indexed by a value beyond the size of the array.

However, the configurations that Hyper-V provides to guest VMs on x86
and ARM64 hardware, in combination with how architecture specific code
assigns Linux CPU numbers, *does* always produce a dense cpu_possible_mask.
So the dense assumption is not currently causing failures. But for
robustness against future changes in how cpu_possible_mask is populated,
update the code to no longer assume dense.

The correct approach is to allocate and initialize the array using size
"nr_cpu_ids". While this leaves unused array entries corresponding to
holes in cpu_possible_mask, the holes are assumed to be minimal and hence
the amount of memory wasted by unused entries is minimal.

[1] https://lore.kernel.org/lkml/SN6PR02MB4157210CC36B2593F8572E5ED4692@SN6PR02MB4157.namprd02.prod.outlook.com/

Signed-off-by: Michael Kelley <[email protected]>
Acked-by: Peter Zijlstra (Intel) <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
5 months agoethtool: rss: fix rss key initialization warning
Daniel Zahka [Thu, 3 Oct 2024 16:23:10 +0000 (09:23 -0700)]
ethtool: rss: fix rss key initialization warning

This warning is emitted when a driver does not default populate an rss
key when one is not provided from userspace. Some devices do not
support individual rss keys per context. For these devices, it is ok
to leave the key zeroed out in ethtool_rxfh_context. Do not warn on
zeroed key when ethtool_ops.rxfh_per_ctx_key == 0.

Signed-off-by: Daniel Zahka <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
5 months agoMerge branch '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next...
Jakub Kicinski [Fri, 4 Oct 2024 19:30:18 +0000 (12:30 -0700)]
Merge branch '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue

Tony Nguyen says:

====================
Intel Wired LAN Driver Updates 2024-10-01 (ice)

This series contains updates to ice driver only.

Karol cleans up current PTP GPIO pin handling, fixes minor bugs,
refactors implementation for all products, introduces SDP (Software
Definable Pins) for E825C and implements reading SDP section from NVM
for E810 products.

Sergey replaces multiple aux buses and devices used in the PTP support
code with struct ice_adapter holding the necessary shared data.

* '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue:
  ice: Drop auxbus use for PTP to finalize ice_adapter move
  ice: Use ice_adapter for PTP shared data instead of auxdev
  ice: Initial support for E825C hardware in ice_adapter
  ice: Add ice_get_ctrl_ptp() wrapper to simplify the code
  ice: Introduce ice_get_phy_model() wrapper
  ice: Enable 1PPS out from CGU for E825C products
  ice: Read SDP section from NVM for pin definitions
  ice: Disable shared pin on E810 on setfunc
  ice: Cache perout/extts requests and check flags
  ice: Align E810T GPIO to other products
  ice: Add SDPs support for E825C
  ice: Implement ice_ptp_pin_desc
====================

Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
5 months agoMerge branch 'add-option-to-provide-opt_id-value-via-cmsg'
Jakub Kicinski [Fri, 4 Oct 2024 18:52:22 +0000 (11:52 -0700)]
Merge branch 'add-option-to-provide-opt_id-value-via-cmsg'

Vadim Fedorenko says:

====================
Add option to provide OPT_ID value via cmsg

SOF_TIMESTAMPING_OPT_ID socket option flag gives a way to correlate TX
timestamps and packets sent via socket. Unfortunately, there is no way
to reliably predict socket timestamp ID value in case of error returned
by sendmsg. For UDP sockets it's impossible because of lockless
nature of UDP transmit, several threads may send packets in parallel. In
case of RAW sockets MSG_MORE option makes things complicated. More
details are in the conversation [1].
This patch adds new control message type to give user-space
software an opportunity to control the mapping between packets and
values by providing ID with each sendmsg.

The first patch in the series adds all needed definitions and implements
the function for UDP sockets. The explicit check of socket's type is not
added because subsequent patches in the series will add support for other
types of sockets. The documentation is also included into the first
patch.

Patch 2/4 adds support for TCP sockets. This part is simple and straight
forward.

Patch 3/4 adds support for RAW sockets. It's a bit tricky because
sock_tx_timestamp functions has to be refactored to receive full socket
cookie information to fill in ID. The commit b534dc46c8ae ("net_tstamp:
add SOF_TIMESTAMPING_OPT_ID_TCP") did the conversion of sk_tsflags to
u32 but sock_tx_timestamp functions were not converted and still receive
16b flags. It wasn't a problem because SOF_TIMESTAMPING_OPT_ID_TCP was
not checked in these functions, that's why no backporting is needed.

Patch 4/4 adds selftests for new feature.

[1] https://lore.kernel.org/netdev/CALCETrU0jB+kg0mhV6A8mrHfTE1D1pr1SD_B9Eaa9aDPfgHdtA@mail.gmail.com/
====================

Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
5 months agoselftests: txtimestamp: add SCM_TS_OPT_ID test
Vadim Fedorenko [Tue, 1 Oct 2024 12:57:16 +0000 (05:57 -0700)]
selftests: txtimestamp: add SCM_TS_OPT_ID test

Extend txtimestamp test to run with fixed tskey using
SCM_TS_OPT_ID control message for all types of sockets.

Reviewed-by: Jason Xing <[email protected]>
Reviewed-by: Willem de Bruijn <[email protected]>
Signed-off-by: Vadim Fedorenko <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
5 months agonet_tstamp: add SCM_TS_OPT_ID for RAW sockets
Vadim Fedorenko [Tue, 1 Oct 2024 12:57:15 +0000 (05:57 -0700)]
net_tstamp: add SCM_TS_OPT_ID for RAW sockets

The last type of sockets which supports SOF_TIMESTAMPING_OPT_ID is RAW
sockets. To add new option this patch converts all callers (direct and
indirect) of _sock_tx_timestamp to provide sockcm_cookie instead of
tsflags. And while here fix __sock_tx_timestamp to receive tsflags as
__u32 instead of __u16.

Reviewed-by: Willem de Bruijn <[email protected]>
Reviewed-by: Jason Xing <[email protected]>
Signed-off-by: Vadim Fedorenko <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
5 months agonet_tstamp: add SCM_TS_OPT_ID to provide OPT_ID in control message
Vadim Fedorenko [Tue, 1 Oct 2024 12:57:14 +0000 (05:57 -0700)]
net_tstamp: add SCM_TS_OPT_ID to provide OPT_ID in control message

SOF_TIMESTAMPING_OPT_ID socket option flag gives a way to correlate TX
timestamps and packets sent via socket. Unfortunately, there is no way
to reliably predict socket timestamp ID value in case of error returned
by sendmsg. For UDP sockets it's impossible because of lockless
nature of UDP transmit, several threads may send packets in parallel. In
case of RAW sockets MSG_MORE option makes things complicated. More
details are in the conversation [1].
This patch adds new control message type to give user-space
software an opportunity to control the mapping between packets and
values by providing ID with each sendmsg for UDP sockets.
The documentation is also added in this patch.

[1] https://lore.kernel.org/netdev/CALCETrU0jB+kg0mhV6A8mrHfTE1D1pr1SD_B9Eaa9aDPfgHdtA@mail.gmail.com/

Reviewed-by: Willem de Bruijn <[email protected]>
Reviewed-by: Jason Xing <[email protected]>
Signed-off-by: Vadim Fedorenko <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
5 months agoMerge branch 'net-mlx5-hw-counters-refactor'
Jakub Kicinski [Fri, 4 Oct 2024 18:33:48 +0000 (11:33 -0700)]
Merge branch 'net-mlx5-hw-counters-refactor'

Tariq Toukan says:

====================
net/mlx5: hw counters refactor

This is a patchset re-post, see:
https://lore.kernel.org/20240815054656.2210494[email protected]

In this patchset, Cosmin refactors hw counters and solves perf scaling
issue.

Series generated against:
commit c824deb1a897 ("cxgb4: clip_tbl: Fix spelling mistake "wont" -> "won't"")

HW counters are central to mlx5 driver operations. They are hardware
objects created and used alongside most steering operations, and queried
from a variety of places. Most counters are queried in bulk from a
periodic task in fs_counters.c.

Counter performance is important and as such, a variety of improvements
have been done over the years. Currently, counters are allocated from
pools, which are bulk allocated to amortize the cost of firmware
commands. Counters are managed through an IDR, a doubly linked list and
two atomic single linked lists. Adding/removing counters is a complex
dance between user contexts requesting it and the mlx5_fc_stats_work
task which does most of the work.

Under high load (e.g. from connection tracking flow insertion/deletion),
the counter code becomes a bottleneck, as seen on flame graphs. Whenever
a counter is deleted, it gets added to a list and the wq task is
scheduled to run immediately to actually delete it. This is done via
mod_delayed_work which uses an internal spinlock. In some tests, waiting
for this spinlock took up to 66% of all samples.

This series refactors the counter code to use a more straight-forward
approach, avoiding the mod_delayed_work problem and making the code
easier to understand. For that:

- patch #1 moves counters data structs to a more appropriate place.
- patch #2 simplifies the bulk query allocation scheme by using vmalloc.
- patch #3 replaces the IDR+3 lists with an xarray. This is the main
  patch of the series, solving the spinlock congestion issue.
- patch #4 removes an unnecessary cacheline alignment causing a lot of
  memory to be wasted.
- patches #5 and #6 are small cleanups enabled by the refactoring.
====================

Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
5 months agonet/mlx5: hw counters: Remove mlx5_fc_create_ex
Cosmin Ratiu [Tue, 1 Oct 2024 10:37:09 +0000 (13:37 +0300)]
net/mlx5: hw counters: Remove mlx5_fc_create_ex

It no longer serves any purpose and is identical to mlx5_fc_create upon
which it was originally based of.

Signed-off-by: Cosmin Ratiu <[email protected]>
Signed-off-by: Tariq Toukan <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
5 months agonet/mlx5: hw counters: Don't maintain a counter count
Cosmin Ratiu [Tue, 1 Oct 2024 10:37:08 +0000 (13:37 +0300)]
net/mlx5: hw counters: Don't maintain a counter count

num_counters is only used for deciding whether to grow the bulk query
buffer, which is done once more counters than a small initial threshold
are present. After that, maintaining num_counters serves no purpose.

This commit replaces that with an actual xarray traversal to count the
counters. This appears expensive at first sight, but is only done when
the number of counters is less than the initial threshold (8) and only
once every sampling interval. Once the number of counters goes above the
threshold, the bulk query buffer is grown to max size and the xarray
traversal is never done again.

Signed-off-by: Cosmin Ratiu <[email protected]>
Signed-off-by: Tariq Toukan <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
5 months agonet/mlx5: hw counters: Drop unneeded cacheline alignment
Cosmin Ratiu [Tue, 1 Oct 2024 10:37:07 +0000 (13:37 +0300)]
net/mlx5: hw counters: Drop unneeded cacheline alignment

The mlx5_fc struct has a cache for values queried from hw, which is
cacheline aligned. On x86_64, this results in:

struct mlx5_fc {
        u32                    id;                   /*     0     4 */
        bool                   aging;                /*     4     1 */

        /* XXX 3 bytes hole, try to pack */

        struct mlx5_fc_bulk *  bulk;                 /*     8     8 */

        /* XXX 48 bytes hole, try to pack */

        /* --- cacheline 1 boundary (64 bytes) --- */
        struct mlx5_fc_cache   cache __attribute__((__aligned__(64)));
/*    64    24 */
        u64                    lastpackets;          /*    88     8 */
        u64                    lastbytes;            /*    96     8 */

        /* size: 128, cachelines: 2, members: 6 */
        /* sum members: 53, holes: 2, sum holes: 51 */
        /* padding: 24 */
        /* forced aligns: 1, forced holes: 1, sum forced holes: 48 */
} __attribute__((__aligned__(64)));

(output from pahole).

...So a 48+24=72 byte waste. As far as I can determine, this serves no
purpose other than maybe making sure that the values in the cache do not
span two cachelines in the worst case scenario, but that's not a valid
enough reason to waste 72 bytes per counter, especially since this code
is not performance-critical. There could potentially be hundreds of
thousands of counters (e.g. for connection-tracking), so this quickly
adds up to multiple MB wasted.

This commit removes the alignment, resulting in:
struct mlx5_fc {
        [...]
        /* size: 56, cachelines: 1, members: 6 */
        /* sum members: 53, holes: 1, sum holes: 3 */
        /* last cacheline: 56 bytes */
};

Signed-off-by: Cosmin Ratiu <[email protected]>
Signed-off-by: Tariq Toukan <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
5 months agonet/mlx5: hw counters: Replace IDR+lists with xarray
Cosmin Ratiu [Tue, 1 Oct 2024 10:37:06 +0000 (13:37 +0300)]
net/mlx5: hw counters: Replace IDR+lists with xarray

Previously, managing counters was a complicated affair involving an IDR,
a sorted double linked list, two single linked lists and a complex dance
between a non-periodic wq task and users adding/deleting counters.

Adding was done by inserting new counters into the IDR and into a single
linked list, leaving the wq to process the list and actually add the
counters into the double linked list, maintained sorted with the IDR.

Deleting involved adding the counter into another single linked list,
leaving the wq to actually unlink the counter from the other structures
and release it.

Dumping the counters is done with the bulk query API, which relies on
the counter list being sorted and unmutable during querying to
efficiently retrieve cached counter values.

Finally, the IDR data struct is deprecated.

This commit replaces all of that with an xarray.

Adding is now done directly, by using xa_lock.
Deleting is also done directly, under the xa_lock.

Querying is done from a periodic task running every sampling_interval
(default 1s) and uses the bulk query API for efficiency.
It works by iterating over the xarray:
- when a new bulk needs to be started, the bulk information is computed
  under the xa_lock.
- the xa iteration state is saved and the xa_lock dropped.
- the HW is queried for bulk counter values.
- the xa_lock is reacquired.
- counter caches with ids covered by the bulk response are updated.

Querying always requests the max bulk length, for simplicity.

Counters could be added/deleted while the HW is queried. This is safe,
as the HW API simply returns unknown values for counters not in HW, but
those values won't be accessed. Only counters present in xarray before
bulk query will actually read queried cache values.

This cuts down the size of mlx5_fc by 4 pointers (88->56 bytes), which
amounts to ~3MB / 100K counters.
But more importantly, this solves the wq spinlock congestion issue seen
happening on high-rate counter insertion+deletion.

Signed-off-by: Cosmin Ratiu <[email protected]>
Signed-off-by: Tariq Toukan <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
5 months agonet/mlx5: hw counters: Use kvmalloc for bulk query buffer
Cosmin Ratiu [Tue, 1 Oct 2024 10:37:05 +0000 (13:37 +0300)]
net/mlx5: hw counters: Use kvmalloc for bulk query buffer

The bulk query buffer starts out small (see [1]) and as soon as the
number of counters goes past the initial threshold grows to max
size (32K entries, 512KB) with a retry scheme.

This commit switches to using kvmalloc for the buffer, which has a near
zero likelihood of failing, and thus the explicit retry scheme becomes
superfluous and is taken out. On the low chance the allocation fails, it
will still be retried every sampling_interval, when the wq task runs.

[1] commit b247f32aecad ("net/mlx5: Dynamically resize flow counters
query buffer")

Signed-off-by: Cosmin Ratiu <[email protected]>
Signed-off-by: Tariq Toukan <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
5 months agonet/mlx5: hw counters: Make fc_stats & fc_pool private
Cosmin Ratiu [Tue, 1 Oct 2024 10:37:04 +0000 (13:37 +0300)]
net/mlx5: hw counters: Make fc_stats & fc_pool private

The mlx5_fc_stats and mlx5_fc_pool structs are only used from
fs_counters.c. As such, make them private there.

mlx5_fc_pool is not used or referenced at all outside fs_counters.

mlx5_fc_stats is referenced from mlx5_core_dev, so instead of having it
as a direct member (which requires exporting it from fs_counters), store
a pointer to it, allocate it on init and clear it on destroy.
One caveat is that a simple container_of to get from a 'work' struct to
the outermost mlx5_core_dev struct directly no longer works, so an extra
pointer had to be added to mlx5_fc_stats back to the parent dev.

Signed-off-by: Cosmin Ratiu <[email protected]>
Signed-off-by: Tariq Toukan <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
5 months agoocteontx2-af: Change block parameter to const pointer in get_lf_str_list
Riyan Dhiman [Tue, 1 Oct 2024 11:05:43 +0000 (16:35 +0530)]
octeontx2-af: Change block parameter to const pointer in get_lf_str_list

Convert struct rvu_block block to const struct rvu_block *block in
get_lf_str_list() function parameter. This improves efficiency by
avoiding structure copying and reflects the function's read-only
access to block.

Signed-off-by: Riyan Dhiman <[email protected]>
Reviewed-by: Simon Horman <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
5 months agonet: macb: Adding support for Jumbo Frames up to 10240 Bytes in SAMA5D2
Aleksander Jan Bajkowski [Thu, 3 Oct 2024 17:19:41 +0000 (19:19 +0200)]
net: macb: Adding support for Jumbo Frames up to 10240 Bytes in SAMA5D2

As per the SAMA5D2 device specification it supports Jumbo frames.
But the suggested flag and length of bytes it supports was not updated
in this driver config_structure.
The maximum jumbo frames the device supports:
10240 bytes as per the device spec.

While changing the MTU value greater than 1500, it threw error:
sudo ifconfig eth1 mtu 9000
SIOCSIFMTU: Invalid argument

Add this support to driver so that it works as expected and designed.

Signed-off-by: Aleksander Jan Bajkowski <[email protected]>
Reviewed-by: Simon Horman <[email protected]>
Acked-by: Nicolas Ferre <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
5 months agoMerge branch 'net-airoha-fix-pse-memory-configuration'
Jakub Kicinski [Fri, 4 Oct 2024 16:54:29 +0000 (09:54 -0700)]
Merge branch 'net-airoha-fix-pse-memory-configuration'

Lorenzo Bianconi says:

====================
net: airoha: Fix PSE memory configuration

Align PSE memory configuration to vendor SDK.
Increase initial value of PSE reserved memory in
airoha_fe_pse_ports_init() by the value used for the second Packet
Processor Engine (PPE2).
Do not overwrite the default value for the number of PSE reserved pages
in airoha_fe_set_pse_oq_rsv().
These changes fix issues which are not visible to the user.

v1: https://lore.kernel.org/20240930-airoha-eth-pse-fix-v1-0-f41f2f35abb9@kernel.org
====================

Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
5 months agonet: airoha: fix PSE memory configuration in airoha_fe_pse_ports_init()
Lorenzo Bianconi [Tue, 1 Oct 2024 10:10:25 +0000 (12:10 +0200)]
net: airoha: fix PSE memory configuration in airoha_fe_pse_ports_init()

Align PSE memory configuration to vendor SDK. In particular, increase
initial value of PSE reserved memory in airoha_fe_pse_ports_init()
routine by the value used for the second Packet Processor Engine (PPE2)
and do not overwrite the default value.

Introduced by commit 23020f049327 ("net: airoha: Introduce ethernet support
for EN7581 SoC")

Signed-off-by: Lorenzo Bianconi <[email protected]>
Reviewed-by: Simon Horman <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
5 months agonet: airoha: read default PSE reserved pages value before updating
Lorenzo Bianconi [Tue, 1 Oct 2024 10:10:24 +0000 (12:10 +0200)]
net: airoha: read default PSE reserved pages value before updating

Store the default value for the number of PSE reserved pages in orig_val
at the beginning of airoha_fe_set_pse_oq_rsv routine, before updating it
with airoha_fe_set_pse_queue_rsv_pages().
Introduce airoha_fe_get_pse_all_rsv utility routine.

Introduced by commit 23020f049327 ("net: airoha: Introduce ethernet support
for EN7581 SoC")

Signed-off-by: Lorenzo Bianconi <[email protected]>
Reviewed-by: Simon Horman <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
5 months agoMerge branch 'net-switch-to-scoped-device_for_each_child_node'
Jakub Kicinski [Fri, 4 Oct 2024 16:28:28 +0000 (09:28 -0700)]
Merge branch 'net-switch-to-scoped-device_for_each_child_node'

Javier Carrasco says:

====================
net: switch to scoped device_for_each_child_node()

This series switches from the device_for_each_child_node() macro to its
scoped variant. This makes the code more robust if new early exits are
added to the loops, because there is no need for explicit calls to
fwnode_handle_put(), which also simplifies existing code.

The non-scoped macros to walk over nodes turn error-prone as soon as
the loop contains early exits (break, goto, return), and patches to
fix them show up regularly, sometimes due to new error paths in an
existing loop [1].

Note that the child node is now declared in the macro, and therefore the
explicit declaration is no longer required.

The general functionality should not be affected by this modification.
If functional changes are found, please report them back as errors.

Link: https://lore.kernel.org/[email protected]
v1: https://lore.kernel.org/r/20240930-net-device_for_each_child_node_scoped-v1-0-bbdd7f9fd649@gmail.com
====================

Link: https://patch.msgid.link/20240930-net-device_for_each_child_node_scoped-v2-0-35f09333c1d7@gmail.com
Signed-off-by: Jakub Kicinski <[email protected]>
5 months agonet: hns: hisilicon: hns_dsaf_mac: switch to scoped device_for_each_child_node()
Javier Carrasco [Mon, 30 Sep 2024 20:38:26 +0000 (22:38 +0200)]
net: hns: hisilicon: hns_dsaf_mac: switch to scoped device_for_each_child_node()

Use device_for_each_child_node_scoped() to simplify the code by removing
the need for explicit calls to fwnode_handle_put() in every error path.
This approach also accounts for any error path that could be added.

Signed-off-by: Javier Carrasco <[email protected]>
Link: https://patch.msgid.link/20240930-net-device_for_each_child_node_scoped-v2-2-35f09333c1d7@gmail.com
Signed-off-by: Jakub Kicinski <[email protected]>
5 months agonet: mdio: thunder: switch to scoped device_for_each_child_node()
Javier Carrasco [Mon, 30 Sep 2024 20:38:25 +0000 (22:38 +0200)]
net: mdio: thunder: switch to scoped device_for_each_child_node()

There has already been an issue with the handling of early exits from
device_for_each_child() in this driver, and it was solved with commit
b1de5c78ebe9 ("net: mdio: thunder: Add missing fwnode_handle_put()") by
adding a call to fwnode_handle_put() right after the loop.

That solution is valid indeed, but if a new error path with a 'return'
is added to the loop, this solution will fail. A more secure approach
is using the scoped variant of the macro, which automatically
decrements the refcount of the child node when it goes out of scope,
removing the need for explicit calls to fwnode_handle_put().

Signed-off-by: Javier Carrasco <[email protected]>
Link: https://patch.msgid.link/20240930-net-device_for_each_child_node_scoped-v2-1-35f09333c1d7@gmail.com
Signed-off-by: Jakub Kicinski <[email protected]>
5 months agoMerge branch 'qed-ethtool-d-faster-less-latency'
Jakub Kicinski [Fri, 4 Oct 2024 16:25:18 +0000 (09:25 -0700)]
Merge branch 'qed-ethtool-d-faster-less-latency'

Michal Schmidt says:

====================
qed: 'ethtool -d' faster, less latency

Here is a patch to make 'ethtool -d' on a qede network device a lot
faster and 3 patches to make it cause less latency for other tasks on
non-preemptible kernels.
====================

Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
5 months agoqed: put cond_resched() in qed_dmae_operation_wait()
Michal Schmidt [Mon, 30 Sep 2024 20:13:07 +0000 (22:13 +0200)]
qed: put cond_resched() in qed_dmae_operation_wait()

It is OK to sleep in qed_dmae_operation_wait, because it is called only
in process context, while holding p_hwfn->dmae_info.mutex from one of
the qed_dmae_{host,grc}2{host,grc} functions.
The udelay(DMAE_MIN_WAIT_TIME=2) in the function is too short to replace
with usleep_range, but at least it's a suitable point for checking if we
should give up the CPU with cond_resched().

This lowers the latency caused by 'ethtool -d' from 10 ms to less than
2 ms on my test system with voluntary preemption.

Signed-off-by: Michal Schmidt <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
5 months agoqed: allow the callee of qed_mcp_nvm_read() to sleep
Michal Schmidt [Mon, 30 Sep 2024 20:13:06 +0000 (22:13 +0200)]
qed: allow the callee of qed_mcp_nvm_read() to sleep

qed_mcp_nvm_read has a loop where it calls qed_mcp_nvm_rd_cmd with the
argument b_can_sleep=false. And it sleeps once every 0x1000 bytes
read.

Simplify this by letting qed_mcp_nvm_rd_cmd itself sleep
(b_can_sleep=true). It will have slept at least once when successful
(in the "Wait for the MFW response" loop). So the extra sleep once every
0x1000 bytes becomes superfluous. Delete it.

On my test system with voluntary preemption, this lowers the latency
caused by 'ethtool -d' from 53 ms to 10 ms.

Signed-off-by: Michal Schmidt <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
5 months agoqed: put cond_resched() in qed_grc_dump_ctx_data()
Michal Schmidt [Mon, 30 Sep 2024 20:13:05 +0000 (22:13 +0200)]
qed: put cond_resched() in qed_grc_dump_ctx_data()

On a kernel with preemption none or voluntary, 'ethtool -d'
on a qede network device can cause a big latency spike.
The biggest part of it is the loop in qed_grc_dump_ctx_data.

The function is called only from the .get_size and .perform_dump
callbacks for the "grc" feature defined in qed_features_lookup[].
As far as I can see, they are used in:
 - qed's devlink healh reporter .dump op
 - qede's ethtool get_regs/get_regs_len/get_dump_data ops
 - qedf's qedf_get_grc_dump, called from:
   - qedf_sysfs_write_grcdump - "grcdump" sysfs attribute write
   - qedf_wq_grcdump - a workqueue

It is safe to sleep in all of them.
Let's insert a cond_resched() in the outer loop to let other tasks run.

Measured using this script:

  #!/bin/bash
  DEV=ens3f1
  echo wakeup_rt > /sys/kernel/tracing/current_tracer
  echo 0 > /sys/kernel/tracing/tracing_max_latency
  echo 1 > /sys/kernel/tracing/tracing_on
  echo "Setting the task CPU affinity"
  taskset -p 1 $$ > /dev/null
  echo "Starting the real-time task"
  chrt -f 50 bash -c 'while sleep 0.01; do :; done' &
  sleep 1
  echo "Running: ethtool -d $DEV"
  time ethtool -d $DEV > /dev/null
  kill %1
  echo 0 > /sys/kernel/tracing/tracing_on
  echo "Measured latency: $(</sys/kernel/tracing/tracing_max_latency) us"
  echo "To see the latency trace: less /sys/kernel/tracing/trace"

The patch lowers the latency from 180 ms to 53 ms on my test system with
voluntary preemption.

Signed-off-by: Michal Schmidt <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
5 months agoqed: make 'ethtool -d' 10 times faster
Michal Schmidt [Mon, 30 Sep 2024 20:13:04 +0000 (22:13 +0200)]
qed: make 'ethtool -d' 10 times faster

As a side effect of commit 5401c3e09928 ("qed: allow sleep in
qed_mcp_trace_dump()"), 'ethtool -d' became much slower.
Almost all the time is spent collecting the "mcp_trace".
It is caused by sleeping too long in _qed_mcp_cmd_and_union.
When called with sleeping not allowed, the function delays for 10 µs
between firmware polls. But if sleeping is allowed, it sleeps for 10 ms
instead.

The sleeps in _qed_mcp_cmd_and_union are unnecessarily long.
Replace msleep with usleep_range, which allows to achieve a similar
polling interval like in the no-sleeping mode (10 - 20 µs).

The only caller, qed_mcp_cmd_and_union, can stop doing the
multiplication/division of the usecs/max_retries. The polling interval
and the number of retries do not need to be parameters at all.

On my test system, 'ethtool -d' now takes 4 seconds instead of 44.

Signed-off-by: Michal Schmidt <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
5 months agoMerge branch 'net-mv643xx-devm-fixes'
Jakub Kicinski [Fri, 4 Oct 2024 15:59:49 +0000 (08:59 -0700)]
Merge branch 'net-mv643xx-devm-fixes'

Rosen Penev says:

====================
net: mv643xx: devm fixes

Small simplification and a fix for a seemingly wrong function usage.
====================

Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
5 months agonet: mv643xx: fix wrong devm_clk_get usage
Rosen Penev [Mon, 30 Sep 2024 20:29:51 +0000 (13:29 -0700)]
net: mv643xx: fix wrong devm_clk_get usage

This clock should be optional. In addition, PTR_ERR can be -EPROBE_DEFER
in which case it should return.

devm_clk_get_optional_enabled also allows removing explicit clock enable
and disable calls.

Signed-off-by: Rosen Penev <[email protected]>
Reviewed-by: Andrew Lunn <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
5 months agonet: mv643xx: use devm_platform_ioremap_resource
Rosen Penev [Mon, 30 Sep 2024 20:29:50 +0000 (13:29 -0700)]
net: mv643xx: use devm_platform_ioremap_resource

This combines multiple steps in one function.

Signed-off-by: Rosen Penev <[email protected]>
Reviewed-by: Andrew Lunn <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
5 months agoMerge branch 'net-ag71xx-small-cleanups'
Jakub Kicinski [Fri, 4 Oct 2024 15:56:41 +0000 (08:56 -0700)]
Merge branch 'net-ag71xx-small-cleanups'

Rosen Penev says:

====================
net: ag71xx: small cleanups

More devm and some loose ends.
====================

Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
5 months agonet: ag71xx: move assignment into main loop
Rosen Penev [Mon, 30 Sep 2024 18:18:23 +0000 (11:18 -0700)]
net: ag71xx: move assignment into main loop

Effectively what's going on here is there's a main loop and an identical
one below with a single assignment. Simpler to move it up.

Signed-off-by: Rosen Penev <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
5 months agonet: ag71xx: replace INIT_LIST_HEAD
Rosen Penev [Mon, 30 Sep 2024 18:18:22 +0000 (11:18 -0700)]
net: ag71xx: replace INIT_LIST_HEAD

LIST_HEAD is a shorter macro. No real difference.

Signed-off-by: Rosen Penev <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
This page took 0.127604 seconds and 4 git commands to generate.