Git Repo - linux.git/log

tools/net/ynl: improve async notification handling

The notification handling in ynl is currently very simple, using sleep()
to wait a period of time and then handling all the buffered messages in
a single batch.

This patch changes the notification handling so that messages are
processed as they are received. This makes it possible to use ynl as a
library that supplies notifications in a timely manner.

- Change check_ntf() to be a generator that yields 1 notification at a
  time and blocks until a notification is available.
- Use the --sleep parameter to set an alarm and exit when it fires.

This means that the CLI has the same interface, but notifications get
printed as they are received:

./tools/net/ynl/cli.py --spec <SPEC> --subscribe <TOPIC> [ --sleep <SECS> ]

Here is an example python snippet that shows how to use ynl as a library
for receiving notifications:

    ynl = YnlFamily(f"{dir}/rt_route.yaml")
    ynl.ntf_subscribe('rtnlgrp-ipv4-route')

    for event in ynl.check_ntf():
        handle(event)

Signed-off-by: Donald Hunter <donald.hunter@gmail.com>
Tested-by: Kory Maincent <kory.maincent@bootlin.com>
Link: https://patch.msgid.link/20241018093228.25477-1-donald.hunter@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Merge branch 'net-pcs-xpcs-yet-more-cleanups'

Russell King says:

====================
net: pcs: xpcs: yet more cleanups

I've found yet more potential for cleanups in the XPCS driver.

The first patch switches to using generic register definitions.

Next, there's an overly complex bit of code in xpcs_link_up_1000basex()
which can be simplified down to a simple if() statement.

Then, rearrange xpcs_link_up_1000basex() to separate out the warnings
from the functional bit.

Next, realising that the functional bit is just the helper function we
already have and are using in the SGMII version of this function,
switch over to that.

We can now see that xpcs_link_up_1000basex() and xpcs_link_up_sgmii()
are basically functionally identical except for the warnings, so merge
the two functions.

Next, xpcs_config_usxgmii() seems misnamed, so rename it to follow the
established pattern.

Lastly, "return foo();" where foo is a void function and the function
being returned from is also void is a weird programming pattern.
Replace this with something more conventional.

With these changes, we see yet another reduction in the amount of
code in this driver.

Tested-by: Serge Semin <fancer.lancer@gmail.com>
====================

Link: https://patch.msgid.link/ZxD6cVFajwBlC9eN@shell.armlinux.org.uk
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: pcs: xpcs: remove return statements in void function

While using "return" when calling a void returning function inside a
function that returns void doesn't cause a compiler warning, it looks
weird. Convert the bunch of if() statements to a switch() and remove
these return statements.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Tested-by: Serge Semin <fancer.lancer@gmail.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: pcs: xpcs: rename xpcs_config_usxgmii()

xpcs_config_usxgmii() is only called from the xpcs_link_up() method, so
let's name it similarly to the SGMII and 1000BASEX functions.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Tested-by: Serge Semin <fancer.lancer@gmail.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: pcs: xpcs: combine xpcs_link_up_{1000basex,sgmii}()

xpcs_link_up_sgmii() and xpcs_link_up_1000basex() are almost identical
with the exception of checking the speed and duplex for 1000BASE-X.
Combine the two functions.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Tested-by: Serge Semin <fancer.lancer@gmail.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: pcs: xpcs: replace open-coded mii_bmcr_encode_fixed()

We can now see that we have an open-coded version of
mii_bmcr_encode_fixed() when this is called with SPEED_1000:

        val = BMCR_SPEED1000;
        if (duplex == DUPLEX_FULL)
                val |= BMCR_FULLDPLX;

Replace this with a call to mii_bmcr_encode_fixed().

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Tested-by: Serge Semin <fancer.lancer@gmail.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: pcs: xpcs: rearrange xpcs_link_up_1000basex()

Rearrange xpcs_link_up_1000basex() to make it more obvious what will
happen in the following commit.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Tested-by: Serge Semin <fancer.lancer@gmail.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: pcs: xpcs: remove switch() in xpcs_link_up_1000basex()

Remove an unnecessary switch() statement in xpcs_link_up_1000basex().
The only value this switch statement is interested in is SPEED_1000,
all other values lead to an error. Replace this with a simple if()
statement.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Tested-by: Serge Semin <fancer.lancer@gmail.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: pcs: xpcs: use generic register definitions

As a general policy, we refer our generic register definitions over
vendor specific definitions. In XPCS, it appears that the register
layout follows a BMCR, BMSR and ADVERTISE register definition. We
already refer to this BMCR register using several different macros
which is confusing.

Convert the following register definitions to generic versions:

DW_VR_MII_MMD_CTRL => MII_BMCR
MDIO_CTRL1 => MII_BMCR
AN_CL37_EN => BMCR_ANENABLE
SGMII_SPEED_SS6 => BMCR_SPEED1000
SGMII_SPEED_SS13 => BMCR_SPEED100
MDIO_CTRL1_RESET => BMCR_RESET

DW_VR_MII_MMD_STS => MII_BMSR
DW_VR_MII_MMD_STS_LINK_STS => BMSR_LSTATUS

DW_FULL_DUPLEX => ADVERTISE_1000XFULL
iDW_HALF_DUPLEX => ADVERTISE_1000XHALF

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Tested-by: Serge Semin <fancer.lancer@gmail.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

netlink: specs: Add missing bitset attrs to ethtool spec

There are a couple of attributes missing from the 'bitset' attribute-set
in the ethtool netlink spec. Add them to the spec.

Reported-by: Kory Maincent <kory.maincent@bootlin.com>
Closes: https://lore.kernel.org/netdev/20241017180551.1259bf5c@kmaincent-XPS-13-7390/
Signed-off-by: Donald Hunter <donald.hunter@gmail.com>
Reviewed-by: Kory Maincent <kory.maincent@bootlin.com>
Tested-by: Kory Maincent <kory.maincent@bootlin.com>
Link: https://patch.msgid.link/20241018090630.22212-1-donald.hunter@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: netdev_tx_sent_queue() small optimization

Change smp_mb() imediately following a set_bit()
with smp_mb__after_atomic().

Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20241018052310.2612084-1-edumazet@google.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

netpoll: remove ndo_netpoll_setup() second argument

npinfo is not used in any of the ndo_netpoll_setup() methods.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20241018052108.2610827-1-edumazet@google.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

ipv4: Switch inet_addr_hash() to less predictable hash.

Recently, commit 4a0ec2aa0704 ("ipv6: switch inet6_addr_hash()
to less predictable hash") and commit 4daf4dc275f1 ("ipv6: switch
inet6_acaddr_hash() to less predictable hash") hardened IPv6
address hash functions.

inet_addr_hash() is also highly predictable, and a malicious use
could abuse a specific bucket.

Let's follow the change on IPv4 by using jhash_1word().

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20241018014100.93776-1-kuniyu@amazon.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

ip6mr: Add __init to ip6_mr_cleanup().

kernel test robot reported a section mismatch in ip6_mr_cleanup().

WARNING: modpost: vmlinux: section mismatch in reference: ip6_mr_cleanup+0x0 (section: .text) -> 0xffffffff (section: .init.rodata)
WARNING: modpost: vmlinux: section mismatch in reference: ip6_mr_cleanup+0x14 (section: .text) -> ip6mr_rtnl_msg_handlers (section: .init.rodata)

ip6_mr_cleanup() uses ip6mr_rtnl_msg_handlers[] that has
__initconst_or_module qualifier.

ip6_mr_cleanup() is only called from inet6_init() but does
not have __init qualifier.

Let's add __init to ip6_mr_cleanup().

Fixes: 3ac84e31b33e ("ipmr: Use rtnl_register_many().")
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202410180139.B3HeemsC-lkp@intel.com/
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://patch.msgid.link/20241017174732.39487-1-kuniyu@amazon.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net/sched: act_api: unexport tcf_action_dump_1()

This isn't used outside act_api.c, but is called by tcf_dump_walker()
prior to its definition. So move it upwards and make it static.

Simultaneously, reorder the variable declarations so that they follow
the networking "reverse Christmas tree" coding style.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://patch.msgid.link/20241017161934.3599046-1-vladimir.oltean@nxp.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Merge branch 'net-sysctl-allow-dump_cpumask-to-handle-higher-numbers-of-cpus'

Antoine Tenart says:

====================
net: sysctl: allow dump_cpumask to handle higher numbers of CPUs

The main goal of this series is to allow dump_cpumask to handle higher
numbers of CPUs (patch 3). While doing so I had the opportunity to make
the function a bit simpler, which is done in patches 1-2.

None of those is net material IMO.
====================

Link: https://patch.msgid.link/20241017152422.487406-1-atenart@kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: sysctl: allow dump_cpumask to handle higher numbers of CPUs

This fixes the output of rps_default_mask and flow_limit_cpu_bitmap when
the CPU count is > 448, as it was truncated.

The underlying values are actually stored correctly when writing to
these sysctl but displaying them uses a fixed length temporary buffer in
dump_cpumask. This buffer can be too small if the CPU count is > 448.

Fix this by dynamically allocating the buffer in dump_cpumask, using a
guesstimate of what we need.

Signed-off-by: Antoine Tenart <atenart@kernel.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: sysctl: do not reserve an extra char in dump_cpumask temporary buffer

When computing the length we'll be able to use out of the buffers, one
char is removed from the temporary one to make room for a newline. It
should be removed from the output buffer length too, but in reality this
is not needed as the later call to scnprintf makes sure a null char is
written at the end of the buffer which we override with the newline.

Signed-off-by: Antoine Tenart <atenart@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: sysctl: remove always-true condition

Before adding a new line at the end of the temporary buffer in
dump_cpumask, a length check is performed to ensure there is space for
it.

  len = min(sizeof(kbuf) - 1, *lenp);
  len = scnprintf(kbuf, len, ...);
  if (len < *lenp)
          kbuf[len++] = '\n';

Note that the check is currently logically wrong, the written length is
compared against the output buffer, not the temporary one. However this
has no consequence as this is always true, even if fixed: scnprintf
includes a null char at the end of the buffer but the returned length do
not include it and there is always space for overriding it with a
newline.

Remove the condition.

Signed-off-by: Antoine Tenart <atenart@kernel.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: use sock_valbool_flag() only in __sock_set_timestamps()

sock_{,re}set_flag() are contained in sock_valbool_flag(),
it would be cleaner to just use sock_valbool_flag().

Signed-off-by: Yajun Deng <yajun.deng@linux.dev>
Link: https://patch.msgid.link/20241017133435.2552-1-yajun.deng@linux.dev
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

netdevsim: macsec: pad u64 to correct length in logs

Commit 02b34d03a24b ("netdevsim: add dummy macsec offload") pads u64
number to 8 characters using "%08llx" format specifier.

Changing format specifier to "%016llx" ensures that no matter the value
the representation of number in log is always the same length.

Before this patch, entry in log for value '1' would say:
removing SecY with SCI 00000001 at index 2
After this patch is applied, entry in log will say:
removing SecY with SCI 0000000000000001 at index 2

Signed-off-by: Ales Nezbeda <anezbeda@redhat.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Link: https://patch.msgid.link/20241017131933.136971-1-anezbeda@redhat.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: mv643xx: use ethtool_puts

Allows simplifying get_strings and avoids manual pointer manipulation.

Signed-off-by: Rosen Penev <rosenp@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Message-ID: <20241018200522.12506-1-rosenp@gmail.com>
Signed-off-by: Andrew Lunn <andrew@lunn.ch>

net: atlantic: support reading SFP module info

Add support for reading SFP module info and digital diagnostic
monitoring data if supported by the module. The only Aquantia
controller without an integrated PHY is the AQC100 which belongs to
the B0 revision, that's why it's only implemented there.

The register information was extracted from a diagnostic tool made
publicly available by Dell, but all code was written from scratch by me.

This has been tested to work with a variety of both optical and direct
attach modules I had lying around and seems to work fine with all of
them, including the diagnostics if supported by an optical module.
All tests have been done with an AQC100 on an TL-NT521F card on firmware
version 3.1.121 (current at the time of this patch).

Signed-off-by: Lorenz Brun <lorenz@brun.one>
Reviewed-by: Simon Horman <horms@kernel.org>
Message-ID: <20241018171721.2577386-1-lorenz@brun.one>
Signed-off-by: Andrew Lunn <andrew@lunn.ch>

octeontx2-pf: handle otx2_mbox_get_rsp errors in otx2_dcbnl.c

Add error pointer check after calling otx2_mbox_get_rsp().

Fixes: 8e67558177f8 ("octeontx2-pf: PFC config support with DCBx")
Signed-off-by: Dipendra Khadka <kdipendra88@gmail.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Andrew Lunn <andrew@lunn.ch>

octeontx2-pf: handle otx2_mbox_get_rsp errors in otx2_dmac_flt.c

Add error pointer checks after calling otx2_mbox_get_rsp().

Fixes: 79d2be385e9e ("octeontx2-pf: offload DMAC filters to CGX/RPM block")
Fixes: fa5e0ccb8f3a ("octeontx2-pf: Add support for exact match table.")
Signed-off-by: Dipendra Khadka <kdipendra88@gmail.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Andrew Lunn <andrew@lunn.ch>

octeontx2-pf: handle otx2_mbox_get_rsp errors in cn10k.c

Add error pointer check after calling otx2_mbox_get_rsp().

Fixes: 2ca89a2c3752 ("octeontx2-pf: TC_MATCHALL ingress ratelimiting offload")
Signed-off-by: Dipendra Khadka <kdipendra88@gmail.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Andrew Lunn <andrew@lunn.ch>

octeontx2-pf: handle otx2_mbox_get_rsp errors in otx2_flows.c

Adding error pointer check after calling otx2_mbox_get_rsp().

Fixes: 9917060fc30a ("octeontx2-pf: Cleanup flow rule management")
Fixes: f0a1913f8a6f ("octeontx2-pf: Add support for ethtool ntuple filters")
Fixes: 674b3e164238 ("octeontx2-pf: Add additional checks while configuring ucast/bcast/mcast rules")
Signed-off-by: Dipendra Khadka <kdipendra88@gmail.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Andrew Lunn <andrew@lunn.ch>

octeontx2-pf: handle otx2_mbox_get_rsp errors in otx2_ethtool.c

Add error pointer check after calling otx2_mbox_get_rsp().

Fixes: 75f36270990c ("octeontx2-pf: Support to enable/disable pause frames via ethtool")
Fixes: d0cf9503e908 ("octeontx2-pf: ethtool fec mode support")
Signed-off-by: Dipendra Khadka <kdipendra88@gmail.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Andrew Lunn <andrew@lunn.ch>

octeontx2-pf: handle otx2_mbox_get_rsp errors in otx2_common.c

Add error pointer check after calling otx2_mbox_get_rsp().

Fixes: ab58a416c93f ("octeontx2-pf: cn10k: Get max mtu supported from admin function")
Signed-off-by: Dipendra Khadka <kdipendra88@gmail.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Andrew Lunn <andrew@lunn.ch>

virtchnl: fix m68k build.

The kernel test robot reported a build failure on m68k in the intel
driver due to the recent shapers-related changes.

The mentioned arch has funny alignment properties, let's be explicit
about the binary layout expectation introducing a padding field.

Fixes: 608a5c05c39b ("virtchnl: support queue rate limit and quanta size configuration")
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202410131710.71Wt6LKO-lkp@intel.com/
Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://patch.msgid.link/e45d1c9f17356d431b03b419f60b8b763d2ff768.1729000481.git.pabeni@redhat.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Merge branch 'net-netconsole-refactoring-and-warning-fix'

Breno Leitao says:

====================
net: netconsole refactoring and warning fix

The netconsole driver was showing a warning related to userdata
information, depending on the message size being transmitted:

------------[ cut here ]------------
WARNING: CPU: 13 PID: 3013042 at drivers/net/netconsole.c:1122 write_ext_msg+0x3b6/0x3d0
? write_ext_msg+0x3b6/0x3d0
console_flush_all+0x1e9/0x330
...

Identifying the cause of this warning proved to be non-trivial due to:

* The write_ext_msg() function being over 100 lines long
* Extensive use of pointer arithmetic
* Inconsistent naming conventions and concept application

The send_ext_msg() function grew organically over time:

* Initially, the UDP packet consisted of a header and body
* Later additions included release prepend and userdata
* Naming became inconsistent (e.g., "body" excludes userdata, "header"
excludes prepended release)

This lack of consistency made investigating issues like the above warning
more challenging than what it should be.

To address these issues, the following steps were taken:

* Breaking down write_ext_msg() into smaller functions with clear scopes
* Improving readability and reasoning about the code
* Simplifying and clarifying naming conventions

Warning Fix
-----------

The warning occurred when there was insufficient buffer space to append
userdata. While this scenario is acceptable (as userdata can be sent in a
separate packet later), the kernel was incorrectly raising a warning. A
one-line fix has been implemented to resolve this issue.

The fix was already sent to net, and is already available in net-next
also.

v4:
* https://lore.kernel.org/all/20240930131214.3771313-1-leitao@debian.org/

v3:
* https://lore.kernel.org/all/20240910100410.2690012-1-leitao@debian.org/

v2:
* https://lore.kernel.org/all/20240909130756.2722126-1-leitao@debian.org/

v1:
* https://lore.kernel.org/all/20240903140757.2802765-1-leitao@debian.org/
====================

Link: https://patch.msgid.link/20241017095028.3131508-1-leitao@debian.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: netconsole: split send_msg_fragmented

Refactor the send_msg_fragmented() function by extracting the logic for
sending the message body into a new function called
send_fragmented_body().

Now, send_msg_fragmented() handles appending the release and header, and
then delegates the task of breaking up the body and sending the
fragments to send_fragmented_body().

This is the final flow now:

When send_ext_msg_udp() is called to send a message, it will:
  - call send_msg_no_fragmentation() if no fragmentation is needed
  or
  - call send_msg_fragmented() if fragmentation is needed
    * send_msg_fragmented() appends the header to the buffer, which is
      be persisted until the function returns
      * call send_fragmented_body() to iterate and populate the body of
the message. It will not touch the header, and it will only
replace the body, writing the msgbody and/or userdata.

Also add some comment to make the code easier to review.

Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: netconsole: do not pass userdata up to the tail

Do not pass userdata to send_msg_fragmented, since we can get it later.

This will be more useful in the next patch, where send_msg_fragmented()
will be split even more, and userdata is only necessary in the last
function.

Suggested-by: Simon Horman <horms@kernel.org>
Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: netconsole: extract release appending into separate function

Refactor the code by extracting the logic for appending the
release into the buffer into a separate function.

The goal is to reduce the size of send_msg_fragmented() and improve
code readability.

Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: netconsole: track explicitly if msgbody was written to buffer

The current check to determine if the message body was fully sent is
difficult to follow. To improve clarity, introduce a variable that
explicitly tracks whether the message body (msgbody) has been completely
sent, indicating when it's time to begin sending userdata.

Additionally, add comments to make the code more understandable for
others who may work with it.

Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: netconsole: introduce variable to track body length

This new variable tracks the total length of the data to be sent,
encompassing both the message body (msgbody) and userdata, which is
collectively called body.

By explicitly defining body_len, the code becomes clearer and easier to
reason about, simplifying offset calculations and improving overall
readability of the function.

Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: netconsole: rename body to msg_body

With the introduction of the userdata concept, the term body has become
ambiguous and less intuitive.

To improve clarity, body is renamed to msg_body, making it clear that
the body is not the only content following the header.

In an upcoming patch, the term body_len will also be revised for further
clarity.

The current packet structure is as follows:

release, header, body, [msg_body + userdata]

Here, [msg_body + userdata] collectively forms what is currently
referred to as "body." This renaming helps to distinguish and better
understand each component of the packet.

Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: netconsole: separate fragmented message handling in send_ext_msg

Following the previous change, where the non-fragmented case was moved
to its own function, this update introduces a new function called
send_msg_fragmented to specifically manage scenarios where message
fragmentation is required.

Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: netconsole: split send_ext_msg_udp() function

The send_ext_msg_udp() function has become quite large, currently
spanning 102 lines. Its complexity, along with extensive pointer and
offset manipulation, makes it difficult to read and error-prone.

The function has evolved over time, and it’s now due for a refactor.

To improve readability and maintainability, isolate the case where no
message fragmentation occurs into a separate function, into a new
send_msg_no_fragmentation() function. This scenario covers about 95% of
the messages.

Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: netconsole: remove msg_ready variable

Variable msg_ready is useless, since it does not represent anything. Get
rid of it, using buf directly instead.

Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

tools: ynl-gen: use big-endian netlink attribute types

Change ynl-gen-c.py to use NLA_BE16 and NLA_BE32 types to represent
big-endian u16 and u32 ynl types.

Doing this enables those attributes to have range checks applied, as
the validator will then convert to host endianness prior to validation.

The autogenerated kernel/uapi code have been regenerated by running:
  ./tools/net/ynl/ynl-regen.sh -f

This changes the policy types of the following attributes:

  FOU_ATTR_PORT (NLA_U16 -> NLA_BE16)
  FOU_ATTR_PEER_PORT (NLA_U16 -> NLA_BE16)
    These two are used with nla_get_be16/nla_put_be16().

  MPTCP_PM_ADDR_ATTR_ADDR4 (NLA_U32 -> NLA_BE32)
    This one is used with nla_get_in_addr/nla_put_in_addr(),
    which uses nla_get_be32/nla_put_be32().

IOWs the generated changes are AFAICT aligned with their implementations.

The generated userspace code remains identical, and have been verified
by comparing the output generated by the following command:
  make -C tools/net/ynl/generated

Signed-off-by: Asbjørn Sloth Tønnesen <ast@fiberby.net>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20241017094704.3222173-1-ast@fiberby.net
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Merge branch 'selftests-net-introduce-deferred-commands'

Petr Machata says:

====================
selftests: net: Introduce deferred commands

Recently, a defer helper was added to Python selftests. The idea is to keep
cleanup commands close to their dirtying counterparts, thereby making it
more transparent what is cleaning up what, making it harder to miss a
cleanup, and make the whole cleanup business exception safe. All these
benefits are applicable to bash as well, exception safety can be
interpreted in terms of safety vs. a SIGINT.

This patchset therefore introduces a framework of several helpers that
serve to schedule cleanups in bash selftests.

- Patch #1 has more details about the primitives being introduced.
  Patch #2 adds a fallback cleanup() function to lib.sh, because ideally
  selftests wouldn't need to introduce a dedicated cleanup function at all.

- Patch #3 adds a parameter to stop_traffic(), which makes it possible to
  start other background processes after the traffic is started without
  confusing the cleanup.

- Patches #4 to #10 convert a number of selftests.

  The goal was to convert all tests that use start_traffic / stop_traffic
  to the defer framework. Leftover traffic generators are a particularly
  painful sort of a missed cleanup. Normal unfinished cleanups can usually
  be cleaned up simply by rerunning the test and interrupting it early to
  let the cleanups run again / in full. This does not work with
  stop_traffic, because it is only issued at the end of the test case that
  starts the traffic. At the same time, leftover traffic generators
  influence follow-up test runs, and are hard to notice.

  The tests were however converted whole-sale, not just their traffic bits.
  Thus they form a proof of concept of the defer framework.
====================

Link: https://patch.msgid.link/cover.1729157566.git.petrm@nvidia.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

selftests: mlxsw: devlink_trap_police: Use defer for test cleanup

Use the defer framework to schedule cleanups as soon as the command is
executed.

Note that the start_traffic commands in __burst_test() are each sending a
fixed number of packets (note the -c flag) and then ending. They therefore
do not need a matching stop_traffic.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

selftests: mlxsw: qos_max_descriptors: Use defer for test cleanup

Use the defer framework to schedule cleanups as soon as the command is
executed.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

selftests: mlxsw: qos_ets_strict: Use defer for test cleanup

Use the defer framework to schedule cleanups as soon as the command is
executed.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

selftests: mlxsw: qos_mc_aware: Use defer for test cleanup

Use the defer framework to schedule cleanups as soon as the command is
executed.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

selftests: ETS: Use defer for test cleanup

Use the defer framework to schedule cleanups as soon as the command is
executed.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

selftests: TBF: Use defer for test cleanup

Use the defer framework to schedule cleanups as soon as the command is
executed.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

selftests: RED: Use defer for test cleanup

Instead of having a suite of dedicated cleanup functions, use the defer
framework to schedule cleanups right as their setup functions are run.

The sleep after stop_traffic() in mlxsw selftests is necessary, but
scheduling it as "defer sleep; defer stop_traffic" is silly. Instead, add a
local helper to stop traffic and sleep afterwards.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

selftests: forwarding: lib: Allow passing PID to stop_traffic()

Now that it is possible to schedule a deferral of stop_traffic() right
after the traffic is started, we do not have to rely on the %% magic to
kill the background process that was started last. Instead we can just give
the PID explicitly. This makes it possible to start other background
processes after the traffic is started without confusing the cleanup.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

selftests: forwarding: Add a fallback cleanup()

Consistent use of defers obviates the need for a separate test-specific
cleanup function -- everything is just taken care of in defers. So in this
patch, introduce a cleanup() helper in the forwarding lib.sh, which calls
just pre_cleanup() and defer_scopes_cleanup(). Selftests are obviously
still free to override the function.

Since pre_cleanup() is too entangled with forwarding-specific minutia, the
function cannot currently be in net/lib.sh.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

selftests: net: lib: Introduce deferred commands

In commit 8510801a9dbd ("selftests: drv-net: add ability to schedule
cleanup with defer()"), a defer helper was added to Python selftests.
The idea is to keep cleanup commands close to their dirtying counterparts,
thereby making it more transparent what is cleaning up what, making it
harder to miss a cleanup, and make the whole cleanup business exception
safe. All these benefits are applicable to bash as well, exception safety
can be interpreted in terms of safety vs. a SIGINT.

This patch therefore introduces a framework of several helpers that serve
to schedule cleanups in bash selftests:

- defer_scope_push(), defer_scope_pop(): Deferred statements can be batched
  together in scopes. When a scope is popped, the deferred commands
  scheduled in that scope are executed in the order opposite to order of
  their scheduling.

- defer(): Schedules a defer to the most recently pushed scope (or the
  default scope if none was pushed.)

- defer_prio(): Schedules a defer on the priority track. The priority defer
  queue is run before the default defer queue when scope is popped.

  The issue that this is addressing is specifically the one of restoring
  devlink shared buffer threshold type. When setting up static thresholds,
  one has to first change the threshold type to static, then override the
  individual thresholds. When cleaning up, it would be natural to reset the
  threshold values first, then change the threshold type. But the values
  that are valid for dynamic thresholds are generally invalid for static
  thresholds and vice versa. Attempts to restore the values first would be
  bounced. Thus one has to first reset the threshold type, then adjust the
  thresholds.

  (You could argue that the shared buffer threshold type API is broken and
  you would be right, but here we are.)

  This cannot be solved by pure defers easily. I considered making it
  possible to disable an existing defer, so that one could then schedule a
  new defer and disable the original. But this forward-shifting of the
  defer job would have to take place after every threshold-adjusting
  command, which would make it very awkward to schedule these jobs.

- defer_scopes_cleanup(): Pops any unpopped scopes, including the default
  one. The selftests that use defer should run this in their exit trap.
  This is important to get cleanups of interrupted scripts.

- in_defer_scope(): Sometimes a function would like to introduce a new
  defer scope, then run whatever it is that it wants to run, and then pop
  the scope to run the deferred cleanups. The helper in_defer_scope() can
  be used to run another command within such environment, such that any
  scheduled defers run after the command finishes.

The framework is added as a separate file lib/sh/defer.sh so that it can be
used by all bash selftests, including those that do not currently use
lib.sh. lib.sh however includes the file by default, because ideally all
tests would use these helpers instead of hand-rolling their cleanups.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: phy: marvell: Add mdix status reporting

Report MDI-X resolved state after link up.

Tested on Linkstreet 88E6193X internal PHYs.

Signed-off-by: Paul Davey <paul.davey@alliedtelesis.co.nz>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/20241017015026.255224-1-paul.davey@alliedtelesis.co.nz
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: stmmac: Programming sequence for VLAN packets with split header

Currently reset state configuration of split header works fine for
non-tagged packets and we see no corruption in payload of any size

We need additional programming sequence with reset configuration to
handle VLAN tagged packets to avoid corruption in payload for packets
of size greater than 256 bytes.

Without this change ping application complains about corruption
in payload when the size of the VLAN packet exceeds 256 bytes.

With this change tagged and non-tagged packets of any size works fine
and there is no corruption seen.

Current configuration which has the issue for VLAN packet
----------------------------------------------------------

Split happens at the position at Layer 3 header
|MAC-DA|MAC-SA|Vlan Tag|Ether type|IP header|IP data|Rest of the payload|
                         2 bytes            ^
                                            |

With the fix we are making sure that the split happens now at
Layer 2 which is end of ethernet header and start of IP payload

Ip traffic split
-----------------

Bits which take care of this are SPLM and SPLOFST
SPLM = Split mode is set to Layer 2
SPLOFST = These bits indicate the value of offset from the beginning
of Length/Type field at which header split should take place when the
appropriate SPLM is selected. Reset value is 2bytes.

Un-tagged data (without VLAN)
|MAC-DA|MAC-SA|Ether type|IP header|IP data|Rest of the payload|
                  2bytes ^
|

Tagged data (with VLAN)
|MAC-DA|MAC-SA|VLAN Tag|Ether type|IP header|IP data|Rest of the payload|
                          2bytes  ^
  |

Non-IP traffic split such AV packet
------------------------------------

Bits which take care of this are
SAVE = Split AV Enable
SAVO = Split AV Offset, similar to SPLOFST but this is for AVTP
packets.

|Preamble|MAC-DA|MAC-SA|VLAN tag|Ether type|IEEE 1722 payload|CRC|
    2bytes ^
   |

Signed-off-by: Abhishek Chauhan <quic_abchauha@quicinc.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20241016234313.3992214-1-quic_abchauha@quicinc.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Merge branch 'rtnetlink-refactor-rtnl_-new-del-set-link-for-per-netns-rtnl'

Kuniyuki Iwashima says:

====================
rtnetlink: Refactor rtnl_{new,del,set}link() for per-netns RTNL.

This is a prep for the next series where we will push RTNL down to
rtnl_{new,del,set}link().

That means, for example, __rtnl_newlink() is always under RTNL, but
rtnl_newlink() has a non-RTNL section.

As a prerequisite for per-netns RTNL, we will move netns validation
(and RTNL-independent validations if possible) to that section.

rtnl_link_ops and rtnl_af_ops will be protected with SRCU not to
depend on RTNL.

Changes:
  v2:
    * Add Eric's Reviewed-by to patch 1-4,6,8-11, (no tag on 5,7,12-14)
    * Patch 7
      * Handle error of init_srcu_struct().
      * Call cleanup_srcu_struct() after synchronize_srcu().
    * Patch 12
      * Move put_net() before errorout label
    * Patch 13
      * Newly added as prep for patch 14
    * Patch 14
      * Handle error of init_srcu_struct().
      * Call cleanup_srcu_struct() after synchronize_srcu().

  v1: https://lore.kernel.org/netdev/20241009231656.57830-1-kuniyu@amazon.com/
====================

Link: https://patch.msgid.link/20241016185357.83849-1-kuniyu@amazon.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

rtnetlink: Protect struct rtnl_af_ops with SRCU.

Once RTNL is replaced with rtnl_net_lock(), we need a mechanism to
guarantee that rtnl_af_ops is alive during inflight RTM_SETLINK
even when its module is being unloaded.

Let's use SRCU to protect ops.

rtnl_af_lookup() now iterates rtnl_af_ops under RCU and returns
SRCU-protected ops pointer. The caller must call rtnl_af_put()
to release the pointer after the use.

Also, rtnl_af_unregister() unlinks the ops first and calls
synchronize_srcu() to wait for inflight RTM_SETLINK requests to
complete.

Note that rtnl_af_ops needs to be protected by its dedicated lock
when RTNL is removed.

Note also that BUG_ON() in do_setlink() is changed to the normal
error handling as a different af_ops might be found after
validate_linkmsg().

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

rtnetlink: Return int from rtnl_af_register().

The next patch will add init_srcu_struct() in rtnl_af_register(),
then we need to handle its error.

Let's add the error handling in advance to make the following
patch cleaner.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Matt Johnston <matt@codeconstruct.com.au>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

rtnetlink: Call rtnl_link_get_net_capable() in do_setlink().

We will push RTNL down to rtnl_setlink().

RTM_SETLINK could call rtnl_link_get_net_capable() in do_setlink()
to move a dev to a new netns, but the netns needs to be fetched before
holding rtnl_net_lock().

Let's move it to rtnl_setlink() and pass the netns to do_setlink().

Now, RTM_NEWLINK paths (rtnl_changelink() and rtnl_group_changelink())
can pass the prefetched netns to do_setlink().

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

rtnetlink: Clean up rtnl_setlink().

We will push RTNL down to rtnl_setlink().

Let's unify the error path to make it easy to place rtnl_net_lock().

While at it, keep the variables in reverse xmas order.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

rtnetlink: Clean up rtnl_dellink().

We will push RTNL down to rtnl_delink().

Let's unify the error path to make it easy to place rtnl_net_lock().

While at it, keep the variables in reverse xmas order.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

rtnetlink: Fetch IFLA_LINK_NETNSID in rtnl_newlink().

Another netns option for RTM_NEWLINK is IFLA_LINK_NETNSID and
is fetched in rtnl_newlink_create().

This must be done before holding rtnl_net_lock().

Let's move IFLA_LINK_NETNSID processing to rtnl_newlink().

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

rtnetlink: Call rtnl_link_get_net_capable() in rtnl_newlink().

As a prerequisite of per-netns RTNL, we must fetch netns before
looking up dev or moving it to another netns.

rtnl_link_get_net_capable() is called in rtnl_newlink_create() and
do_setlink(), but both of them need to be moved to the RTNL-independent
region, which will be rtnl_newlink().

Let's call rtnl_link_get_net_capable() in rtnl_newlink() and pass the
netns down to where needed.

Note that the latter two have not passed the nets to do_setlink() yet
but will do so after the remaining rtnl_link_get_net_capable() is moved
to rtnl_setlink() later.

While at it, dest_net is renamed to tgt_net in rtnl_newlink_create() to
align with rtnl_{del,set}link().

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

rtnetlink: Protect struct rtnl_link_ops with SRCU.

Once RTNL is replaced with rtnl_net_lock(), we need a mechanism to
guarantee that rtnl_link_ops is alive during inflight RTM_NEWLINK
even when its module is being unloaded.

Let's use SRCU to protect ops.

rtnl_link_ops_get() now iterates link_ops under RCU and returns
SRCU-protected ops pointer. The caller must call rtnl_link_ops_put()
to release the pointer after the use.

Also, __rtnl_link_unregister() unlinks the ops first and calls
synchronize_srcu() to wait for inflight RTM_NEWLINK requests to
complete.

Note that link_ops needs to be protected by its dedicated lock
when RTNL is removed.

Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

rtnetlink: Move ops->validate to rtnl_newlink().

ops->validate() does not require RTNL.

Let's move it to rtnl_newlink().

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

rtnetlink: Move rtnl_link_ops_get() and retry to rtnl_newlink().

Currently, if neither dev nor rtnl_link_ops is found in __rtnl_newlink(),
we release RTNL and redo the whole process after request_module(), which
complicates the logic.

The ops will be RTNL-independent later.

Let's move the ops lookup to rtnl_newlink() and do the retry earlier.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

rtnetlink: Move simple validation from __rtnl_newlink() to rtnl_newlink().

We will push RTNL down to rtnl_newlink().

Let's move RTNL-independent validation to rtnl_newlink().

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

rtnetlink: Factorise do_setlink() path from __rtnl_newlink().

__rtnl_newlink() got too long to maintain.

For example, netdev_master_upper_dev_get()->rtnl_link_ops is fetched even
when IFLA_INFO_SLAVE_DATA is not specified.

Let's factorise the single dev do_setlink() path to a separate function.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

rtnetlink: Call validate_linkmsg() in do_setlink().

There are 3 paths that finally call do_setlink(), and validate_linkmsg()
is called in each path.

  1. RTM_NEWLINK
    1-1. dev is found in __rtnl_newlink()
    1-2. dev isn't found, but IFLA_GROUP is specified in
          rtnl_group_changelink()
  2. RTM_SETLINK

The next patch factorises 1-1 to a separate function.

As a preparation, let's move validate_linkmsg() calls to do_setlink().

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

rtnetlink: Allocate linkinfo[] as struct rtnl_newlink_tbs.

We will move linkinfo to rtnl_newlink() and pass it down to other
functions.

Let's pack it into rtnl_newlink_tbs.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Merge branch 'net-mlx5-refactor-esw-qos-to-support-generalized-operations'

Tariq Toukan says:

====================
net/mlx5: Refactor esw QoS to support generalized operations

This patch series from the team to mlx5 core driver consists of one main
QoS part followed by small misc patches.

This main part (patches 1 to 11) by Carolina refactors the QoS handling
to generalize operations on scheduling groups and vports. These changes
are necessary to support new features that will extend group
functionality, introduce new group types, and support deeper
hierarchies.

Additionally, this refactor updates the terminology from "group" to
"node" to better reflect the hardware’s rate hierarchy and its use
of scheduling element nodes.

Simplify group scheduling element creation:
- net/mlx5: Refactor QoS group scheduling element creation

Refactor to support generalized operations for QoS:
- net/mlx5: Introduce node type to rate group structure
- net/mlx5: Add parent group support in rate group structure
- net/mlx5: Restrict domain list insertion to root TSAR ancestors
- net/mlx5: Rename vport QoS group reference to parent
- net/mlx5: Introduce node struct and rename group terminology to node
- net/mlx5: Refactor vport scheduling element creation function
- net/mlx5: Refactor vport QoS to use scheduling node structure
- net/mlx5: Remove vport QoS enabled flag

Support generalized operations for QoS elements:
- net/mlx5: Simplify QoS scheduling element configuration
- net/mlx5: Generalize QoS operations for nodes and vports

On top, patch 12 by Moshe handles FW request to move to drop mode.

In patch 13, Benjamin Poirier removes an empty eswitch flow table when
not used, which improves packet processing performance.

Patches 14 and 15 by Moshe are small field renamings as preparation for
future fields addition to these structures.

Series generated against:
commit c531f2269a53 ("net: bcmasp: enable SW timestamping")
====================

Link: https://patch.msgid.link/20241016173617.217736-1-tariqt@nvidia.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net/mlx5: fs, rename modify header struct member action

As preparation for HW Steering support, rename modify header struct
member action to fs_dr_action, to distinguish from fs_hws_action which
will be added. Add a pointer where needed to keep code line shorter and
more readable.

Reviewed-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Signed-off-by: Moshe Shemesh <moshe@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net/mlx5: fs, rename packet reformat struct member action

As preparation for HW Steering support, rename packet reformat struct
member action to fs_dr_action, to distinguish from fs_hws_action which
will be added. Add a pointer where needed to keep code line shorter and
more readable.

Reviewed-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Signed-off-by: Moshe Shemesh <moshe@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net/mlx5: Only create VEPA flow table when in VEPA mode

Currently, when VFs are created, two flow tables are added for the eswitch:
the "fdb" table, which contains rules for each VF and the "vepa_fdb" table.
In the default VEB mode, the vepa_fdb table is empty. When switching to
VEPA mode, flow steering rules are added to vepa_fdb. Even though the
vepa_fdb table is empty in VEB mode, its presence adds some cost to packet
processing. In some workloads, this leads to drops which are reported by
the rx_discards_phy ethtool counter.

In order to improve performance, only create vepa_fdb when in VEPA mode.

Tests were done on a ConnectX-6 Lx adapter forwarding 64B packets between
both ports using dpdk-testpmd. Numbers are Rx-pps for each port, as
reported by testpmd.

Without changes:
traffic to unknown mac
testpmd on PF
numvfs=0,0
35257998,35264499
numvfs=1,1
24590124,24590888
testpmd on VF with numvfs=1,1
20434338,20434887
traffic to VF mac
testpmd on VF with numvfs=1,1
30341014,30340749

With changes:
traffic to unknown mac
testpmd on PF
numvfs=0,0
35404361,35383378
numvfs=1,1
29801247,29790757
testpmd on VF with numvfs=1,1
24310435,24309084
traffic to VF mac
testpmd on VF with numvfs=1,1
34811436,34781706

Signed-off-by: Benjamin Poirier <bpoirier@nvidia.com>
Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Reviewed-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net/mlx5: Add sync reset drop mode support

On sync reset flow, firmware may request a PF, which already
acknowledged the unload event, to move to drop mode. Drop mode means
that this PF will reduce polling frequency, as this PF is not going to
have another active part in the reset, but only reload back after the
reset.

Signed-off-by: Moshe Shemesh <moshe@nvidia.com>
Reviewed-by: Aya Levin <ayal@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net/mlx5: Generalize QoS operations for nodes and vports

Refactor QoS normalization and rate calculation functions to operate
on mlx5_esw_sched_node, allowing for generalized handling of both
vports and nodes.

Signed-off-by: Carolina Jubran <cjubran@nvidia.com>
Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net/mlx5: Simplify QoS scheduling element configuration

Simplify the configuration of QoS scheduling elements by removing the
separate functions `esw_qos_node_config` and `esw_qos_vport_config`.

Instead, directly use the existing `esw_qos_sched_elem_config` function
for both nodes and vports.

This unification helps in generalizing operations on scheduling
elements nodes.

Signed-off-by: Carolina Jubran <cjubran@nvidia.com>
Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net/mlx5: Remove vport QoS enabled flag

Remove the `enabled` flag from the `vport->qos` struct, as QoS now
relies solely on the `sched_node` pointer to determine whether QoS
features are in use.

Currently, the vport `qos` struct consists only of the `sched_node`,
introducing an unnecessary two-level reference. However, the qos struct
is retained as it will be extended in future patches to support new QoS
features.

Signed-off-by: Carolina Jubran <cjubran@nvidia.com>
Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net/mlx5: Refactor vport QoS to use scheduling node structure

Refactor the vport QoS structure by moving group membership and
scheduling details into the `mlx5_esw_sched_node` structure.

This change consolidates the vport into the rate hierarchy by unifying
the handling of different types of scheduling element nodes.

In addition, add a direct reference to the mlx5_vport within the
mlx5_esw_sched_node structure, to ensure that the vport is easily
accessible when a scheduling node is associated with a vport.

Signed-off-by: Carolina Jubran <cjubran@nvidia.com>
Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net/mlx5: Refactor vport scheduling element creation function

Modify the vport scheduling element creation function to get the parent
node directly, aligning it with the group creation function.

This ensures a consistent flow for scheduling elements creation, as the
parent nodes already contain the device and parent element index.

Signed-off-by: Carolina Jubran <cjubran@nvidia.com>
Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net/mlx5: Introduce node struct and rename group terminology to node

Introduce the `mlx5_esw_sched_node` struct, consolidating all rate
hierarchy related details, including membership and scheduling
parameters.

Since the group concept aligns with the `mlx5_esw_sched_node`, replace
the `mlx5_esw_rate_group` struct with it and rename the "group"
terminology to "node" throughout the rate hierarchy.

All relevant code paths and structures have been updated to use the
"node" terminology accordingly, laying the groundwork for future
patches that will unify the handling of different types of members
within the rate hierarchy.

Signed-off-by: Carolina Jubran <cjubran@nvidia.com>
Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Daniel Machon <daniel.machon@microchip.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net/mlx5: Rename vport QoS group reference to parent

Rename the `group` field in the `mlx5_vport` structure to `parent` to
clarify the vport's role as a member of a parent group and distinguish
it from the concept of a general group.

Additionally, rename `group_entry` to `parent_entry` to reflect this
update.

This distinction will be important for handling more complex group
structures and scheduling elements.

Signed-off-by: Carolina Jubran <cjubran@nvidia.com>
Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Daniel Machon <daniel.machon@microchip.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net/mlx5: Restrict domain list insertion to root TSAR ancestors

Update the logic for adding rate groups to the E-Switch domain list,
ensuring only groups with the root Transmit Scheduling Arbiter as their
parent are included.

Signed-off-by: Carolina Jubran <cjubran@nvidia.com>
Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Daniel Machon <daniel.machon@microchip.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net/mlx5: Add parent group support in rate group structure

Introduce a `parent` field in the `mlx5_esw_rate_group` structure to
support hierarchical group relationships.

The `parent` can reference another group or be set to `NULL`,
indicating the group is connected to the root TSAR.

This change enables the ability to manage groups in a hierarchical
structure for future enhancements.

Signed-off-by: Carolina Jubran <cjubran@nvidia.com>
Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Daniel Machon <daniel.machon@microchip.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net/mlx5: Introduce node type to rate group structure

Introduce the `sched_node_type` enum to represent both the group and
its members as scheduling nodes in the rate hierarchy.

Add the `type` field to the rate group structure to specify the type of
the node membership in the rate hierarchy.

Generalize comments to reflect this flexibility within the rate group
structure.

Signed-off-by: Carolina Jubran <cjubran@nvidia.com>
Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Daniel Machon <daniel.machon@microchip.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net/mlx5: Refactor QoS group scheduling element creation

Introduce `esw_qos_create_group_sched_elem` to handle the creation of
group scheduling elements for E-Switch QoS, Transmit Scheduling
Arbiter (TSAR).

This reduces duplication and simplifies code for TSAR setup.

Signed-off-by: Carolina Jubran <cjubran@nvidia.com>
Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Daniel Machon <daniel.machon@microchip.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Merge branch 'add-support-of-hibmcge-ethernet-driver'

Jijie Shao says:

====================
Add support of HIBMCGE Ethernet Driver

This patch set adds the support of Hisilicon BMC Gigabit Ethernet Driver.

This patch set includes basic Rx/Tx functionality. It also includes
the registration and interrupt codes.

This work provides the initial support to the HIBMCGE and
would incrementally add features or enhancements.
====================

Link: https://patch.msgid.link/20241015123516.4035035-1-shaojijie@huawei.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: hibmcge: Add maintainer for hibmcge

Add myself as the maintainer for the hibmcge ethernet driver.

Signed-off-by: Jijie Shao <shaojijie@huawei.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: hibmcge: Add a Makefile and update Kconfig for hibmcge

Add a Makefile and update Kconfig to build hibmcge driver.

Signed-off-by: Jijie Shao <shaojijie@huawei.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: hibmcge: Implement some ethtool_ops functions

Implement the .get_drvinfo .get_link .get_link_ksettings to get
the basic information and working status of the driver.
Implement the .set_link_ksettings to modify the rate, duplex,
and auto-negotiation status.

Signed-off-by: Jijie Shao <shaojijie@huawei.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: hibmcge: Implement rx_poll function to receive packets

Implement rx_poll function to read the rx descriptor after
receiving the rx interrupt. Adjust the skb based on the
descriptor to complete the reception of the packet.

Signed-off-by: Jijie Shao <shaojijie@huawei.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: hibmcge: Implement .ndo_start_xmit function

Implement .ndo_start_xmit function to fill the information of the packet
to be transmitted into the tx descriptor, and then the hardware will
transmit the packet using the information in the tx descriptor.
In addition, we also implemented the tx_handler function to enable the
tx descriptor to be reused, and .ndo_tx_timeout function to print some
information when the hardware is busy.

Signed-off-by: Jijie Shao <shaojijie@huawei.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: hibmcge: Implement some .ndo functions

Implement the .ndo_open() .ndo_stop() .ndo_set_mac_address()
and .ndo_change_mtu functions().
And .ndo_validate_addr calls the eth_validate_addr function directly

Signed-off-by: Jijie Shao <shaojijie@huawei.com>
Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: hibmcge: Add interrupt supported in this module

The driver supports four interrupts: TX interrupt, RX interrupt,
mdio interrupt, and error interrupt.

Actually, the driver does not use the mdio interrupt.
Therefore, the driver does not request the mdio interrupt.

The error interrupt distinguishes different error information
by using different masks. To distinguish different errors,
the statistics count is added for each error.

To ensure the consistency of the code process, masks are added for the
TX interrupt and RX interrupt.

This patch implements interrupt request, and provides a
unified entry for the interrupt handler function. However,
the specific interrupt handler function of each interrupt
is not implemented currently.

Because of pcim_enable_device(), the interrupt vector
is already device managed and does not need to be free actively.

Signed-off-by: Jijie Shao <shaojijie@huawei.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: hibmcge: Add mdio and hardware configuration supported in this module

Implements the C22 read and write PHY registers interfaces.

Some hardware interfaces related to the PHY are also implemented
in this patch.

Signed-off-by: Jijie Shao <shaojijie@huawei.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: hibmcge: Add read/write registers supported through the bar space

Add support for to read and write registers through the pic bar space.

Some driver parameters, such as mac_id, are determined by the
board form. Therefore, these parameters are initialized
from the register as device specifications.

the device specifications register are initialized and written by bmc.
driver will read these registers when loading.

Signed-off-by: Jijie Shao <shaojijie@huawei.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: hibmcge: Add pci table supported in this module

Add pci table supported in this module, and implement pci_driver function
to initialize this driver.

hibmcge is a passthrough network device. Its software runs
on the host side, and the MAC hardware runs on the BMC side
to reduce the host CPU area. The software interacts with the
MAC hardware through the PCIe.

  ┌─────────────────────────┐
  │ HOST CPU network device │
  │    ┌──────────────┐     │
  │    │hibmcge driver│     │
  │    └─────┬─┬──────┘     │
  │          │ │            │
  │HOST  ┌───┴─┴───┐        │
  │      │ PCIE RC │        │
  └──────┴───┬─┬───┴────────┘
             │ │
            PCIE
             │ │
  ┌──────┬───┴─┴───┬────────┐
  │      │ PCIE EP │        │
  │BMC   └───┬─┬───┘        │
  │          │ │            │
  │ ┌────────┴─┴──────────┐ │
  │ │        GE           │ │
  │ │ ┌─────┐    ┌─────┐  │ │
  │ │ │ MAC │    │ MAC │  │ │
  └─┴─┼─────┼────┼─────┼──┴─┘
      │ PHY │    │ PHY │
      └─────┘    └─────┘

Signed-off-by: Jijie Shao <shaojijie@huawei.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: sfp: change quirks for Alcatel Lucent G-010S-P

Seems Alcatel Lucent G-010S-P also have the same problem that it uses
TX_FAULT pin for SOC uart. So apply sfp_fixup_ignore_tx_fault to it.

Signed-off-by: Shengyu Qu <wiagn233@outlook.com>
Link: https://patch.msgid.link/TYCPR01MB84373677E45A7BFA5A28232C98792@TYCPR01MB8437.jpnprd01.prod.outlook.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Cross-merge networking fixes after downstream PR (net-6.12-rc4).

Conflicts:

107a034d5c1e ("net/mlx5: qos: Store rate groups in a qos domain")
1da9cfd6c41c ("net/mlx5: Unregister notifier on eswitch init failure")

Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: ftgmac100: correct the phy interface of NC-SI mode

In NC-SI specification, NC-SI is using RMII, not MII.

Signed-off-by: Jacky Chou <jacky_chou@aspeedtech.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Message-ID: <20241018053331.1900100-1-jacky_chou@aspeedtech.com>
Signed-off-by: Andrew Lunn <andrew@lunn.ch>

eth: Fix typo 'accelaration'. 'exprienced' and 'rewritting'

There are some spelling mistakes of 'accelaration', 'exprienced' and
'rewritting' in comments which should be 'acceleration', 'experienced'
and 'rewriting'.

Suggested-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/all/20241017162846.GA51712@kernel.org/
Signed-off-by: WangYuli <wangyuli@uniontech.com>
Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Message-ID: <90D42CB167CA0842+20241018021910.31359-1-wangyuli@uniontech.com>
Signed-off-by: Andrew Lunn <andrew@lunn.ch>