Git Repo - linux.git/log

eventpoll: Control irq suspension for prefer_busy_poll

When events are reported to userland and prefer_busy_poll is set, irqs
are temporarily suspended using napi_suspend_irqs.

If no events are found and ep_poll would go to sleep, irq suspension is
cancelled using napi_resume_irqs.

Signed-off-by: Martin Karsten <[email protected]>
Co-developed-by: Joe Damato <[email protected]>
Signed-off-by: Joe Damato <[email protected]>
Tested-by: Joe Damato <[email protected]>
Tested-by: Martin Karsten <[email protected]>
Acked-by: Stanislav Fomichev <[email protected]>
Reviewed-by: Sridhar Samudrala <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

eventpoll: Trigger napi_busy_loop, if prefer_busy_poll is set

Setting prefer_busy_poll now leads to an effectively nonblocking
iteration though napi_busy_loop, even when busy_poll_usecs is 0.

Signed-off-by: Martin Karsten <[email protected]>
Co-developed-by: Joe Damato <[email protected]>
Signed-off-by: Joe Damato <[email protected]>
Tested-by: Joe Damato <[email protected]>
Tested-by: Martin Karsten <[email protected]>
Acked-by: Stanislav Fomichev <[email protected]>
Reviewed-by: Sridhar Samudrala <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

net: Add control functions for irq suspension

The napi_suspend_irqs routine bootstraps irq suspension by elongating
the defer timeout to irq_suspend_timeout.

The napi_resume_irqs routine effectively cancels irq suspension by
forcing the napi to be scheduled immediately.

Signed-off-by: Martin Karsten <[email protected]>
Co-developed-by: Joe Damato <[email protected]>
Signed-off-by: Joe Damato <[email protected]>
Tested-by: Joe Damato <[email protected]>
Tested-by: Martin Karsten <[email protected]>
Acked-by: Stanislav Fomichev <[email protected]>
Reviewed-by: Sridhar Samudrala <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

net: Add napi_struct parameter irq_suspend_timeout

Add a per-NAPI IRQ suspension parameter, which can be get/set with
netdev-genl.

This patch doesn't change any behavior but prepares the code for other
changes in the following commits which use irq_suspend_timeout as a
timeout for IRQ suspension.

Signed-off-by: Martin Karsten <[email protected]>
Co-developed-by: Joe Damato <[email protected]>
Signed-off-by: Joe Damato <[email protected]>
Tested-by: Joe Damato <[email protected]>
Tested-by: Martin Karsten <[email protected]>
Acked-by: Stanislav Fomichev <[email protected]>
Reviewed-by: Sridhar Samudrala <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

bnxt_en: add unlocked version of bnxt_refclk_read

Serialization of PHC read with FW reset mechanism uses ptp_lock which
also protects timecounter updates. This means we cannot grab it when
called from bnxt_cc_read(). Let's move locking into different function.

Fixes: 6c0828d00f07 ("bnxt_en: replace PTP spinlock with seqlock")
Signed-off-by: Vadim Fedorenko <[email protected]>
Reviewed-by: Pavan Chebbi <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

Merge branch 'rtnetlink-convert-rtnl_newlink-to-per-netns-rtnl'

Kuniyuki Iwashima says:

====================
rtnetlink: Convert rtnl_newlink() to per-netns RTNL.

Patch 1 - 3 removes __rtnl_link_unregister and protect link_ops by
its dedicated mutex to move synchronize_srcu() out of RTNL scope.

Patch 4 introduces struct rtnl_nets and helper functions to acquire
multiple per-netns RTNL in rtnl_newlink().

Patch 5 - 8 are to prefetch the peer device's netns in rtnl_newlink().

Patch 9 converts rtnl_newlink() to per-netns RTNL.

Patch 10 pushes RTNL down to rtnl_dellink() and rtnl_setlink(), but
the conversion will not be completed unless we support cases with
peer/upper/lower devices.

I confirmed v3 survived ./rtnetlink.sh; rmmod netdevsim.ko; without
lockdep splat.

v3: https://lore.kernel.org/20241107022900 [email protected]
v2: https://lore.kernel.org/20241106022432 [email protected]
v1: https://lore.kernel.org/20241105020514 [email protected]
====================

Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

rtnetlink: Register rtnl_dellink() and rtnl_setlink() with RTNL_FLAG_DOIT_PERNET_WIP.

Currently, rtnl_setlink() and rtnl_dellink() cannot be fully converted
to per-netns RTNL due to a lack of handling peer/lower/upper devices in
different netns.

For example, when we change a device in rtnl_setlink() and need to
propagate that to its upper devices, we want to avoid acquiring all netns
locks, for which we do not know the upper limit.

The same situation happens when we remove a device.

rtnl_dellink() could be transformed to remove a single device in the
requested netns and delegate other devices to per-netns work, and
rtnl_setlink() might be ?

Until we come up with a better idea, let's use a new flag
RTNL_FLAG_DOIT_PERNET_WIP for rtnl_dellink() and rtnl_setlink().

This will unblock converting RTNL users where such devices are not related.

Signed-off-by: Kuniyuki Iwashima <[email protected]>
Reviewed-by: Nikolay Aleksandrov <[email protected]>
Reviewed-by: Eric Dumazet <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

rtnetlink: Convert RTM_NEWLINK to per-netns RTNL.

Now, we are ready to convert rtnl_newlink() to per-netns RTNL;
rtnl_link_ops is protected by SRCU and netns is prefetched in
rtnl_newlink().

Let's register rtnl_newlink() with RTNL_FLAG_DOIT_PERNET and
push RTNL down as rtnl_nets_lock().

Signed-off-by: Kuniyuki Iwashima <[email protected]>
Reviewed-by: Eric Dumazet <[email protected]>
Reviewed-by: Nikolay Aleksandrov <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

netkit: Set IFLA_NETKIT_PEER_INFO to netkit_link_ops.peer_type.

For per-netns RTNL, we need to prefetch the peer device's netns.

Let's set rtnl_link_ops.peer_type and accordingly remove duplicated
validation in ->newlink().

Signed-off-by: Kuniyuki Iwashima <[email protected]>
Reviewed-by: Eric Dumazet <[email protected]>
Acked-by: Nikolay Aleksandrov <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

vxcan: Set VXCAN_INFO_PEER to vxcan_link_ops.peer_type.

For per-netns RTNL, we need to prefetch the peer device's netns.

Let's set rtnl_link_ops.peer_type and accordingly remove duplicated
validation in ->newlink().

Signed-off-by: Kuniyuki Iwashima <[email protected]>
Reviewed-by: Eric Dumazet <[email protected]>
Reviewed-by: Nikolay Aleksandrov <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

veth: Set VETH_INFO_PEER to veth_link_ops.peer_type.

For per-netns RTNL, we need to prefetch the peer device's netns.

Let's set rtnl_link_ops.peer_type and accordingly remove duplicated
validation in ->newlink().

Signed-off-by: Kuniyuki Iwashima <[email protected]>
Reviewed-by: Eric Dumazet <[email protected]>
Reviewed-by: Nikolay Aleksandrov <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

rtnetlink: Add peer_type in struct rtnl_link_ops.

In ops->newlink(), veth, vxcan, and netkit call rtnl_link_get_net() with
a net pointer, which is the first argument of ->newlink().

rtnl_link_get_net() could return another netns based on IFLA_NET_NS_PID
and IFLA_NET_NS_FD in the peer device's attributes.

We want to get it and fill rtnl_nets->nets[] in advance in rtnl_newlink()
for per-netns RTNL.

All of the three get the peer netns in the same way:

  1. Call rtnl_nla_parse_ifinfomsg()
  2. Call ops->validate() (vxcan doesn't have)
  3. Call rtnl_link_get_net_tb()

Let's add a new field peer_type to struct rtnl_link_ops and prefetch
netns in the peer ifla to add it to rtnl_nets in rtnl_newlink().

Signed-off-by: Kuniyuki Iwashima <[email protected]>
Reviewed-by: Eric Dumazet <[email protected]>
Reviewed-by: Nikolay Aleksandrov <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

rtnetlink: Introduce struct rtnl_nets and helpers.

rtnl_newlink() needs to hold 3 per-netns RTNL: 2 for a new device
and 1 for its peer.

We will add rtnl_nets_lock() later, which performs the nested locking
based on struct rtnl_nets, which has an array of struct net pointers.

rtnl_nets_add() adds a net pointer to the array and sorts it so that
rtnl_nets_lock() can simply acquire per-netns RTNL from array[0] to [2].

Before calling rtnl_nets_add(), get_net() must be called for the net,
and rtnl_nets_destroy() will call put_net() for each.

Let's apply the helpers to rtnl_newlink().

When CONFIG_DEBUG_NET_SMALL_RTNL is disabled, we do not call
rtnl_net_lock() thus do not care about the array order, so
rtnl_net_cmp_locks() returns -1 so that the loop in rtnl_nets_add()
can be optimised to NOP.

Signed-off-by: Kuniyuki Iwashima <[email protected]>
Reviewed-by: Eric Dumazet <[email protected]>
Reviewed-by: Nikolay Aleksandrov <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

rtnetlink: Remove __rtnl_link_register()

link_ops is protected by link_ops_mutex and no longer needs RTNL,
so we have no reason to have __rtnl_link_register() separately.

Let's remove it and call rtnl_link_register() from ifb.ko and
dummy.ko.

Note that both modules' init() work on init_net only, so we need
not export pernet_ops_rwsem and can use rtnl_net_lock() there.

Signed-off-by: Kuniyuki Iwashima <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

rtnetlink: Protect link_ops by mutex.

rtnl_link_unregister() holds RTNL and calls synchronize_srcu(),
but rtnl_newlink() will acquire SRCU frist and then RTNL.

Then, we need to unlink ops and call synchronize_srcu() outside
of RTNL to avoid the deadlock.

   rtnl_link_unregister()       rtnl_newlink()
   ----                         ----
   lock(rtnl_mutex);
                                lock(&ops->srcu);
                                lock(rtnl_mutex);
   sync(&ops->srcu);

Let's move as such and add a mutex to protect link_ops.

Now, link_ops is protected by its dedicated mutex and
rtnl_link_register() no longer needs to hold RTNL.

While at it, we move the initialisation of ops->dellink and
ops->srcu out of the mutex scope.

Signed-off-by: Kuniyuki Iwashima <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

rtnetlink: Remove __rtnl_link_unregister().

rtnl_link_unregister() holds RTNL and calls __rtnl_link_unregister(),
where we call synchronize_srcu() to wait inflight RTM_NEWLINK requests
for per-netns RTNL.

We put synchronize_srcu() in __rtnl_link_unregister() due to ifb.ko
and dummy.ko.

However, rtnl_newlink() will acquire SRCU before RTNL later in this
series.  Then, lockdep will detect the deadlock:

   rtnl_link_unregister()       rtnl_newlink()
   ----                         ----
   lock(rtnl_mutex);
                                lock(&ops->srcu);
                                lock(rtnl_mutex);
   sync(&ops->srcu);

To avoid the problem, we must call synchronize_srcu() before RTNL in
rtnl_link_unregister().

As a preparation, let's remove __rtnl_link_unregister().

Signed-off-by: Kuniyuki Iwashima <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

r8169: use helper r8169_mod_reg8_cond to simplify rtl_jumbo_config

Use recently added helper r8169_mod_reg8_cond() to simplify jumbo
mode configuration.

Signed-off-by: Heiner Kallweit <[email protected]>
Reviewed-by: Simon Horman <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

Merge branch 'selftests-ncdevmem-add-ncdevmem-to-ksft'

Stanislav Fomichev says:

====================
selftests: ncdevmem: Add ncdevmem to ksft

The goal of the series is to simplify and make it possible to use
ncdevmem in an automated way from the ksft python wrapper.

ncdevmem is slowly mutated into a state where it uses stdout
to print the payload and the python wrapper is added to
make sure the arrived payload matches the expected one.
====================

Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

selftests: ncdevmem: Add automated test

Only RX side for now and small message to test the setup.
In the future, we can extend it to TX side and to testing
both sides with a couple of megs of data.

  make \
   -C tools/testing/selftests \
   TARGETS="drivers/hw/net" \
   install INSTALL_PATH=~/tmp/ksft

  scp ~/tmp/ksft ${HOST}:
  scp ~/tmp/ksft ${PEER}:

  cfg+="NETIF=${DEV}\n"
  cfg+="LOCAL_V6=${HOST_IP}\n"
  cfg+="REMOTE_V6=${PEER_IP}\n"
  cfg+="REMOTE_TYPE=ssh\n"
  cfg+="REMOTE_ARGS=root@${PEER}\n"

  echo -e "$cfg" | ssh root@${HOST} "cat > ksft/drivers/net/net.config"
  ssh root@${HOST} "cd ksft && ./run_kselftest.sh -t drivers/net:devmem.py"

Reviewed-by: Mina Almasry <[email protected]>
Signed-off-by: Stanislav Fomichev <[email protected]>
Reviewed-by: Joe Damato <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

selftests: ncdevmem: Move ncdevmem under drivers/net/hw

This is where all the tests that depend on the HW functionality live in
and this is where the automated test is gonna be added in the next
patch.

Reviewed-by: Mina Almasry <[email protected]>
Signed-off-by: Stanislav Fomichev <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

selftests: ncdevmem: Run selftest when none of the -s or -c has been provided

This will be used as a 'probe' mode in the selftest to check whether
the device supports the devmem or not. Use hard-coded queue layout
(two last queues) and prevent user from passing custom -q and/or -t.

Reviewed-by: Mina Almasry <[email protected]>
Reviewed-by: Joe Damato <[email protected]>
Signed-off-by: Stanislav Fomichev <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

selftests: ncdevmem: Remove hard-coded queue numbers

Use single last queue of the device and probe it dynamically.

Reviewed-by: Mina Almasry <[email protected]>
Reviewed-by: Joe Damato <[email protected]>
Signed-off-by: Stanislav Fomichev <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

selftests: ncdevmem: Use YNL to enable TCP header split

In the next patch the hard-coded queue numbers are gonna be removed.
So introduce some initial support for ethtool YNL and use
it to enable header split.

Also, tcp-data-split requires latest ethtool which is unlikely
to be present in the distros right now.

(ideally, we should not shell out to ethtool at all).

Reviewed-by: Mina Almasry <[email protected]>
Reviewed-by: Joe Damato <[email protected]>
Signed-off-by: Stanislav Fomichev <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

selftests: ncdevmem: Properly reset flow steering

ntuple off/on might be not enough to do it on all NICs.
Add a bunch of shell crap to explicitly remove the rules.

Reviewed-by: Mina Almasry <[email protected]>
Reviewed-by: Joe Damato <[email protected]>
Signed-off-by: Stanislav Fomichev <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

selftests: ncdevmem: Switch to AF_INET6

Use dualstack socket to support both v4 and v6. v4-mapped-v6 address
can be used to do v4.

Reviewed-by: Mina Almasry <[email protected]>
Reviewed-by: Joe Damato <[email protected]>
Signed-off-by: Stanislav Fomichev <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

selftests: ncdevmem: Remove default arguments

To make it clear what's required and what's not. Also, some of the
values don't seem like a good defaults; for example eth1.

Move the invocation comment to the top, add missing -s to the client
and cleanup the client invocation a bit to make more readable.

Reviewed-by: Mina Almasry <[email protected]>
Reviewed-by: Joe Damato <[email protected]>
Signed-off-by: Stanislav Fomichev <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

selftests: ncdevmem: Make client_ip optional

Support 3-tuple filtering by making client_ip optional. When -c is
not passed, don't specify src-ip/src-port in the filter.

Reviewed-by: Mina Almasry <[email protected]>
Reviewed-by: Joe Damato <[email protected]>
Signed-off-by: Stanislav Fomichev <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

selftests: ncdevmem: Unify error handling

There is a bunch of places where error() calls look out of place.
Use the same error(1, errno, ...) pattern everywhere.

Reviewed-by: Mina Almasry <[email protected]>
Reviewed-by: Joe Damato <[email protected]>
Signed-off-by: Stanislav Fomichev <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

selftests: ncdevmem: Separate out dmabuf provider

So we can plug the other ones in the future if needed.

Reviewed-by: Mina Almasry <[email protected]>
Reviewed-by: Joe Damato <[email protected]>
Signed-off-by: Stanislav Fomichev <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

selftests: ncdevmem: Redirect all non-payload output to stderr

That should make it possible to do expected payload validation on
the caller side.

Reviewed-by: Mina Almasry <[email protected]>
Reviewed-by: Joe Damato <[email protected]>
Signed-off-by: Stanislav Fomichev <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

Merge branch 'net-stmmac-dwmac4-fixes-issues-in-dwmac4'

Ley Foon Tan says:

====================
net: stmmac: dwmac4: Fixes issues in dwmac4

This patch series fixes issues in the dwmac4 driver. These three patches
don't cause any user-visible issues, so they are targeted for net-next.

Patch #1:
Corrects the masking logic in the MTL Operation Mode RTC mask and shift
macros. The current code lacks the use of the ~ operator, which is
necessary to clear the bits properly.

Patch #2:
Addresses inaccuracies in the MTL_OP_MODE_*_MASK macros. The RTC fields
are located in bits [1:0], and this patch ensures the mask and shift
macros use the appropriate values to reflect this.

Patch #3:
Moves the handling of the Receive Watchdog Timeout (RWT) out of the
Abnormal Interrupt Summary (AIS) condition. According to the databook,
the RWT interrupt is not included in the AIS.

v1: https://lore.kernel.org/20241023112005 [email protected]
v2: https://lore.kernel.org/20241101082336.1552084 [email protected]
====================

Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

net: stmmac: dwmac4: Receive Watchdog Timeout is not in abnormal interrupt summary

The Receive Watchdog Timeout (RWT, bit[9]) is not part of Abnormal
Interrupt Summary (AIS). Move the RWT handling out of the AIS
condition statement.

From databook, the AIS is the logical OR of the following interrupt bits:

- Bit 1: Transmit Process Stopped
- Bit 7: Receive Buffer Unavailable
- Bit 8: Receive Process Stopped
- Bit 10: Early Transmit Interrupt
- Bit 12: Fatal Bus Error
- Bit 13: Context Descriptor Error

Signed-off-by: Ley Foon Tan <[email protected]>
Reviewed-by: Simon Horman <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

net: stmmac: dwmac4: Fix the MTL_OP_MODE_*_MASK operation

In order to mask off the bits, we need to use the '~' operator to invert
all the bits of _MASK and clear them.

Signed-off-by: Ley Foon Tan <[email protected]>
Reviewed-by: Simon Horman <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

net: stmmac: dwmac4: Fix MTL_OP_MODE_RTC mask and shift macros

RTC fields are located in bits [1:0]. Correct the _MASK and _SHIFT
macros to use the appropriate mask and shift.

Signed-off-by: Ley Foon Tan <[email protected]>
Reviewed-by: Simon Horman <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

net: phy: aquantia: Add mdix config and reporting

Add support for configuring MDI-X state of PHY.
Add reporting of resolved MDI-X state in status information.

Tested on AQR113C.

Signed-off-by: Paul Davey <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

Merge branch 'introduce-vlan-support-in-hsr'

MD Danish Anwar says:

====================
Introduce VLAN support in HSR

This series adds VLAN support to HSR framework.
This series also adds VLAN support to HSR mode of ICSSG Ethernet driver.

[1] https://gist.githubusercontent.com/danish-ti/d309f92c640134ccc4f2c0c442de5be1/raw/9cfb5f8bd12b374ae591f4bd9ba3e91ae509ed4f/hsr_vlan_logs
v1 https://lore.kernel.org/all/20241004074715 [email protected]/
v2 https://lore.kernel.org/all/20241024103056.3201071 [email protected]/
====================

Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

selftests: hsr: Add test for VLAN

Add test for VLAN ping for HSR. The test adds vlan interfaces to the hsr
interface and then verifies if ping to them works.

Signed-off-by: MD Danish Anwar <[email protected]>
Reviewed-by: Jiri Pirko <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

net: ti: icssg-prueth: Add VLAN support for HSR mode

Add support for VLAN addition/deletion in HSR mode.
In HSR mode, even if the host port is not a member of
the VLAN domain, the slave ports should simply forward the
frames. So allow forwarding of all VLAN frames in HSR mode.

Signed-off-by: Ravi Gunasekaran <[email protected]>
Signed-off-by: MD Danish Anwar <[email protected]>
Reviewed-by: Jiri Pirko <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

net: hsr: Add VLAN CTAG filter support

This patch adds support for VLAN ctag based filtering at slave devices.
The slave ethernet device may be capable of filtering ethernet packets
based on VLAN ID. This requires that when the VLAN interface is created
over an HSR/PRP interface, it passes the VID information to the
associated slave ethernet devices so that it updates the hardware
filters to filter ethernet frames based on VID. This patch adds the
required functions to propagate the vid information to the slave
devices.

Signed-off-by: Murali Karicheri <[email protected]>
Signed-off-by: MD Danish Anwar <[email protected]>
Reviewed-by: Jiri Pirko <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

net: hsr: Add VLAN support

Add support for creating VLAN interfaces over HSR/PRP interface.

Signed-off-by: WingMan Kwok <[email protected]>
Signed-off-by: Murali Karicheri <[email protected]>
Signed-off-by: MD Danish Anwar <[email protected]>
Reviewed-by: Jiri Pirko <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

Merge branch 'side-mdio-support-for-lan937x-switches'

Oleksij Rempel says:

====================
Side MDIO Support for LAN937x Switches

This patch set introduces support for an internal MDIO bus in LAN937x
switches, enabling the use of a side MDIO channel for PHY management
while keeping SPI as the main interface for switch configuration.

other changelogs are added to separate patches.
====================

Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

net: dsa: microchip: parse PHY config from device tree

Introduce ksz_parse_dt_phy_config() to validate and parse PHY
configuration from the device tree for KSZ switches. This function
ensures proper setup of internal PHYs by checking `phy-handle`
properties, verifying expected PHY IDs, and handling parent node
mismatches. Sets the PHY mask on the MII bus if validation is
successful. Returns -EINVAL on configuration errors.

Signed-off-by: Oleksij Rempel <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

net: dsa: microchip: add support for side MDIO interface in LAN937x

Implement side MDIO channel support for LAN937x switches, providing an
alternative to SPI for PHY management alongside existing SPI-based
switch configuration. This is needed to reduce SPI load, as SPI can be
relatively expensive for small packets compared to MDIO support.

Also, implemented static mappings for PHY addresses for various LAN937x
models to support different internal PHY configurations. Since the PHY
address mappings are not equal to the port indexes, this patch also
provides PHY address calculation based on hardware strapping
configuration.

Signed-off-by: Oleksij Rempel <[email protected]>
Reviewed-by: Andrew Lunn <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

net: dsa: microchip: cleanup error handling in ksz_mdio_register

Replace repeated cleanup code with a single error path using a label.

Signed-off-by: Oleksij Rempel <[email protected]>
Reviewed-by: Andrew Lunn <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

net: dsa: microchip: Refactor MDIO handling for side MDIO access

Add support for accessing PHYs via a side MDIO interface in LAN937x
switches. The existing code already supports accessing PHYs via main
management interfaces, which can be SPI, I2C, or MDIO, depending on the
chip variant. This patch enables using a side MDIO bus, where SPI is
used for the main switch configuration and MDIO for managing the
integrated PHYs. On LAN937x, this is optional, allowing them to operate
in both configurations: SPI only, or SPI + MDIO. Typically, the SPI
interface is used for switch configuration, while MDIO handles PHY
management.

Additionally, update interrupt controller code to support non-linear
port to PHY address mapping, enabling correct interrupt handling for
configurations where PHY addresses do not directly correspond to port
indexes. This change ensures that the interrupt mechanism properly
aligns with the new, flexible PHY address mappings introduced by side
MDIO support.

Signed-off-by: Oleksij Rempel <[email protected]>
Reviewed-by: Andrew Lunn <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

dt-bindings: net: dsa: microchip: add mdio-parent-bus property for internal MDIO

Introduce `mdio-parent-bus` property in the ksz DSA bindings to
reference the parent MDIO bus when the internal MDIO bus is attached to
it, bypassing the main management interface.

Signed-off-by: Oleksij Rempel <[email protected]>
Reviewed-by: Rob Herring (Arm) <[email protected]>
Reviewed-by: Andrew Lunn <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

dt-bindings: net: dsa: microchip: add internal MDIO bus description

Add description for the internal MDIO bus, including integrated PHY
nodes, to ksz DSA bindings.

Signed-off-by: Oleksij Rempel <[email protected]>
Reviewed-by: Rob Herring (Arm) <[email protected]>
Reviewed-by: Andrew Lunn <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

net: atlantic: use irq_update_affinity_hint()

irq_set_affinity_hint() is deprecated, Use irq_update_affinity_hint()
instead. This removes the side-effect of actually applying the affinity.

The driver does not really need to worry about spreading its IRQs across
CPUs. The core code already takes care of that. when the driver applies the
affinities by itself, it breaks the users' expectations:

1. The user configures irqbalance with IRQBALANCE_BANNED_CPULIST in
   order to prevent IRQs from being moved to certain CPUs that run a
   real-time workload.

2. atlantic device reopening will resets the affinity
   in aq_ndev_open().

3. atlantic has no idea about irqbalance's config, so it may move an IRQ to
   a banned CPU. The real-time workload suffers unacceptable latency.

Signed-off-by: Mohammad Heib <[email protected]>
Reviewed-by: Simon Horman <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

nfp: use irq_update_affinity_hint()

irq_set_affinity_hint() is deprecated, Use irq_update_affinity_hint()
instead. This removes the side-effect of actually applying the affinity.

The driver does not really need to worry about spreading its IRQs across
CPUs. The core code already takes care of that. when the driver applies the
affinities by itself, it breaks the users' expectations:

1. The user configures irqbalance with IRQBALANCE_BANNED_CPULIST in
   order to prevent IRQs from being moved to certain CPUs that run a
   real-time workload.

2. nfp device reopening will resets the affinity
   in nfp_net_netdev_open().

3. nfp has no idea about irqbalance's config, so it may move an IRQ to
   a banned CPU. The real-time workload suffers unacceptable latency.

Signed-off-by: Mohammad Heib <[email protected]>
Reviewed-by: Simon Horman <[email protected]>
Reviewed-by: Louis Peens <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

bnxt_en: use irq_update_affinity_hint()

irq_set_affinity_hint() is deprecated, Use irq_update_affinity_hint()
instead. This removes the side-effect of actually applying the affinity.

The driver does not really need to worry about spreading its IRQs across
CPUs. The core code already takes care of that. when the driver applies the
affinities by itself, it breaks the users' expectations:

1. The user configures irqbalance with IRQBALANCE_BANNED_CPULIST in
    order to prevent IRQs from being moved to certain CPUs that run a
    real-time workload.

2. bnxt_en device reopening will resets the affinity
    in bnxt_open().

3. bnxt_en has no idea about irqbalance's config, so it may move an IRQ to
    a banned CPU. The real-time workload suffers unacceptable latency.

Signed-off-by: Mohammad Heib <[email protected]>
Reviewed-by: Andy Gospodarek <[email protected]>
Reviewed-by: Somnath Kotur <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

rxrpc: Add a tracepoint for aborts being proposed

Add a tracepoint to rxrpc to trace the proposal of an abort. The abort is
performed asynchronously by the I/O thread.

Signed-off-by: David Howells <[email protected]>
cc: Marc Dionne <[email protected]>
cc: Simon Horman <[email protected]>
cc: [email protected]
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

ipv6: Fix soft lockups in fib6_select_path under high next hop churn

Soft lockups have been observed on a cluster of Linux-based edge routers
located in a highly dynamic environment. Using the `bird` service, these
routers continuously update BGP-advertised routes due to frequently
changing nexthop destinations, while also managing significant IPv6
traffic. The lockups occur during the traversal of the multipath
circular linked-list in the `fib6_select_path` function, particularly
while iterating through the siblings in the list. The issue typically
arises when the nodes of the linked list are unexpectedly deleted
concurrently on a different core—indicated by their 'next' and
'previous' elements pointing back to the node itself and their reference
count dropping to zero. This results in an infinite loop, leading to a
soft lockup that triggers a system panic via the watchdog timer.

Apply RCU primitives in the problematic code sections to resolve the
issue. Where necessary, update the references to fib6_siblings to
annotate or use the RCU APIs.

Include a test script that reproduces the issue. The script
periodically updates the routing table while generating a heavy load
of outgoing IPv6 traffic through multiple iperf3 clients. It
consistently induces infinite soft lockups within a couple of minutes.

Kernel log:

0 [ffffbd13003e8d30] machine_kexec at ffffffff8ceaf3eb
1 [ffffbd13003e8d90] __crash_kexec at ffffffff8d0120e3
2 [ffffbd13003e8e58] panic at ffffffff8cef65d4
3 [ffffbd13003e8ed8] watchdog_timer_fn at ffffffff8d05cb03
4 [ffffbd13003e8f08] __hrtimer_run_queues at ffffffff8cfec62f
5 [ffffbd13003e8f70] hrtimer_interrupt at ffffffff8cfed756
6 [ffffbd13003e8fd0] __sysvec_apic_timer_interrupt at ffffffff8cea01af
7 [ffffbd13003e8ff0] sysvec_apic_timer_interrupt at ffffffff8df1b83d
-- <IRQ stack> --
8 [ffffbd13003d3708] asm_sysvec_apic_timer_interrupt at ffffffff8e000ecb
    [exception RIP: fib6_select_path+299]
    RIP: ffffffff8ddafe7b  RSP: ffffbd13003d37b8  RFLAGS: 00000287
    RAX: ffff975850b43600  RBX: ffff975850b40200  RCX: 0000000000000000
    RDX: 000000003fffffff  RSI: 0000000051d383e4  RDI: ffff975850b43618
    RBP: ffffbd13003d3800   R8: 0000000000000000   R9: ffff975850b40200
    R10: 0000000000000000  R11: 0000000000000000  R12: ffffbd13003d3830
    R13: ffff975850b436a8  R14: ffff975850b43600  R15: 0000000000000007
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
9 [ffffbd13003d3808] ip6_pol_route at ffffffff8ddb030c
10 [ffffbd13003d3888] ip6_pol_route_input at ffffffff8ddb068c
11 [ffffbd13003d3898] fib6_rule_lookup at ffffffff8ddf02b5
12 [ffffbd13003d3928] ip6_route_input at ffffffff8ddb0f47
13 [ffffbd13003d3a18] ip6_rcv_finish_core.constprop.0 at ffffffff8dd950d0
14 [ffffbd13003d3a30] ip6_list_rcv_finish.constprop.0 at ffffffff8dd96274
15 [ffffbd13003d3a98] ip6_sublist_rcv at ffffffff8dd96474
16 [ffffbd13003d3af8] ipv6_list_rcv at ffffffff8dd96615
17 [ffffbd13003d3b60] __netif_receive_skb_list_core at ffffffff8dc16fec
18 [ffffbd13003d3be0] netif_receive_skb_list_internal at ffffffff8dc176b3
19 [ffffbd13003d3c50] napi_gro_receive at ffffffff8dc565b9
20 [ffffbd13003d3c80] ice_receive_skb at ffffffffc087e4f5 [ice]
21 [ffffbd13003d3c90] ice_clean_rx_irq at ffffffffc0881b80 [ice]
22 [ffffbd13003d3d20] ice_napi_poll at ffffffffc088232f [ice]
23 [ffffbd13003d3d80] __napi_poll at ffffffff8dc18000
24 [ffffbd13003d3db8] net_rx_action at ffffffff8dc18581
25 [ffffbd13003d3e40] __do_softirq at ffffffff8df352e9
26 [ffffbd13003d3eb0] run_ksoftirqd at ffffffff8ceffe47
27 [ffffbd13003d3ec0] smpboot_thread_fn at ffffffff8cf36a30
28 [ffffbd13003d3ee8] kthread at ffffffff8cf2b39f
29 [ffffbd13003d3f28] ret_from_fork at ffffffff8ce5fa64
30 [ffffbd13003d3f50] ret_from_fork_asm at ffffffff8ce03cbb

Fixes: 66f5d6ce53e6 ("ipv6: replace rwlock with rcu and spinlock in fib6_table")
Reported-by: Adrian Oliver <[email protected]>
Signed-off-by: Omid Ehtemam-Haghighi <[email protected]>
Cc: Shuah Khan <[email protected]>
Cc: Ido Schimmel <[email protected]>
Cc: Kuniyuki Iwashima <[email protected]>
Cc: Simon Horman <[email protected]>
Reviewed-by: David Ahern <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

Merge branch 'knobs-for-npc-default-rule-counters'

Linu Cherian says:

====================
Knobs for NPC default rule counters

Patch 1 introduce _rvu_mcam_remove/add_counter_from/to_rule
by refactoring existing code

Patch 2 adds a devlink param to enable/disable counters for default
rules. Once enabled, counters can

Patch 3 adds documentation for devlink params

v4: https://lore.kernel.org/20241029035739.1981839 [email protected]
====================

Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

devlink: Add documentation for OcteonTx2 AF

Add documentation for the following devlink params
- npc_mcam_high_zone_percent
- npc_def_rule_cntr
- nix_maxlf

Signed-off-by: Linu Cherian <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

octeontx2-af: Knobs for NPC default rule counters

Add devlink knobs to enable/disable counters on NPC
default rule entries.

Sample command to enable default rule counters:
devlink dev param set <dev> name npc_def_rule_cntr value true cmode runtime

Sample command to read the counter:
cat /sys/kernel/debug/cn10k/npc/mcam_rules

Signed-off-by: Linu Cherian <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

octeontx2-af: Refactor few NPC mcam APIs

Introduce lowlevel variant of rvu_mcam_remove/add_counter_from/to_rule
for better code reuse, which assumes necessary locks are taken at
higher level.

These low level functions would be used for implementing default rule
counter APIs in the subsequent patch.

Signed-off-by: Linu Cherian <[email protected]>
Reviewed-by: Simon Horman <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

mlx5/core: deduplicate {mlx5_,}eq_update_ci()

The logic of eq_update_ci() is duplicated in mlx5_eq_update_ci(). The
only additional work done by mlx5_eq_update_ci() is to increment
eq->cons_index. Call eq_update_ci() from mlx5_eq_update_ci() to avoid
the duplication.

Signed-off-by: Caleb Sander Mateos <[email protected]>
Reviewed-by: Parav Pandit <[email protected]>
Acked-by: Tariq Toukan <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

mlx5/core: relax memory barrier in eq_update_ci()

The memory barrier in eq_update_ci() after the doorbell write is a
significant hot spot in mlx5_eq_comp_int(). Under heavy TCP load, we see
3% of CPU time spent on the mfence instruction.

98df6d5b877c ("net/mlx5: A write memory barrier is sufficient in EQ ci
update") already relaxed the full memory barrier to just a write barrier
in mlx5_eq_update_ci(), which duplicates eq_update_ci(). So replace mb()
with wmb() in eq_update_ci() too.

On strongly ordered architectures, no barrier is actually needed because
the MMIO writes to the doorbell register are guaranteed to appear to the
device in the order they were made. However, the kernel's ordered MMIO
primitive writel() lacks a convenient big-endian interface.
Therefore, we opt to stick with __raw_writel() + a barrier.

Signed-off-by: Caleb Sander Mateos <[email protected]>
Reviewed-by: Parav Pandit <[email protected]>
Acked-by: Tariq Toukan <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

Merge branch 'macsec-inherit-lower-device-s-features-and-tso-limits-when-offloading'

Sabrina Dubroca says:

====================
macsec: inherit lower device's features and TSO limits when offloading

When macsec is offloaded to a NIC, we can take advantage of some of
its features, mainly TSO and checksumming. This increases performance
significantly. Some features cannot be inherited, because they require
additional ops that aren't provided by the macsec netdevice.

We also need to inherit TSO limits from the lower device, like
VLAN/macvlan devices do.

This series also moves the existing macsec offload selftest to the
netdevsim selftests before adding tests for the new features. To allow
this new selftest to work, netdevsim's hw_features are expanded.
====================

Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

selftests: netdevsim: add ethtool features to macsec offload tests

The test verifies that available features aren't changed by toggling
offload on the device. Creating a device with offload off and then
enabling it later should result in the same features as creating the
device with offload enabled directly.

Signed-off-by: Sabrina Dubroca <[email protected]>
Reviewed-by: Simon Horman <[email protected]>
Link: https://patch.msgid.link/ba801bd0a75b02de2dddbfc77f9efceb8b3d8a2e.1730929545.git.sd@queasysnail.net
Signed-off-by: Jakub Kicinski <[email protected]>

selftests: netdevsim: add test toggling macsec offload

The test verifies that toggling offload works (both via rtnetlink and
macsec's genetlink APIs). This is only possible when no SA is
configured.

Signed-off-by: Sabrina Dubroca <[email protected]>
Reviewed-by: Simon Horman <[email protected]>
Link: https://patch.msgid.link/bf8e27ee0d921caa4eb35f1e830eca6d4080ddb2.1730929545.git.sd@queasysnail.net
Signed-off-by: Jakub Kicinski <[email protected]>

selftests: move macsec offload tests from net/rtnetlink to drivers/net/netdvesim

We're going to expand this test, and macsec offload is only lightly
related to rtnetlink.

Signed-off-by: Sabrina Dubroca <[email protected]>
Reviewed-by: Simon Horman <[email protected]>
Link: https://patch.msgid.link/a1f92c250cc129b4bb111a206c4b560bab4e24a5.1730929545.git.sd@queasysnail.net
Signed-off-by: Jakub Kicinski <[email protected]>

macsec: inherit lower device's TSO limits when offloading

If macsec is offloaded, we need to follow the lower device's
capabilities, like VLAN devices do.

Leave the limits unchanged when the offload is disabled.

Signed-off-by: Sabrina Dubroca <[email protected]>
Reviewed-by: Simon Horman <[email protected]>
Link: https://patch.msgid.link/8240c0181e851f169d815f59658a01fb9dfc5073.1730929545.git.sd@queasysnail.net
Signed-off-by: Jakub Kicinski <[email protected]>

macsec: clean up local variables in macsec_notify

For all events, we need to loop over the list of secys, so let's move
the common variables out of the switch/case.

Signed-off-by: Sabrina Dubroca <[email protected]>
Reviewed-by: Simon Horman <[email protected]>
Link: https://patch.msgid.link/9b8996af518fbeb3b7d527feb15d5788495e3108.1730929545.git.sd@queasysnail.net
Signed-off-by: Jakub Kicinski <[email protected]>

macsec: add some of the lower device's features when offloading

This commit extends the set of netdevice features supported by macsec
devices when offload is enabled, which increases performance
significantly (for a single TCP stream: 17.5Gbps to 38.5Gbps on my
test machines).

Commit c850240b6c41 ("net: macsec: report real_dev features when HW
offloading is enabled") previously attempted something similar, but
had to be reverted (commit 8bcd560ae878 ("Revert "net: macsec: report
real_dev features when HW offloading is enabled"")) because the set of
features it exposed was too large.

During initialization, all features are set, and they're then removed
via ndo_fix_features (macsec_fix_features). This allows the
offloadable features to be automatically enabled if offloading is
turned on after device creation.

Signed-off-by: Sabrina Dubroca <[email protected]>
Reviewed-by: Simon Horman <[email protected]>
Link: https://patch.msgid.link/8b32c3011d269d6f149724e80c1ffe67c9534067.1730929545.git.sd@queasysnail.net
Signed-off-by: Jakub Kicinski <[email protected]>

selftests: netdevsim: add a test checking ethtool features

Add a test checking that some features are active by default and
changeable.

Signed-off-by: Sabrina Dubroca <[email protected]>
Reviewed-by: Simon Horman <[email protected]>
Link: https://patch.msgid.link/fff58fa70f8a300440958b5020f6a4eb2e9dad61.1730929545.git.sd@queasysnail.net
Signed-off-by: Jakub Kicinski <[email protected]>

netdevsim: add more hw_features

netdevsim currently only set HW_TC in its hw_features, but other
features should also be present to better reflect the behavior of real
HW.

In my macsec offload testing, this ends up as HW_CSUM being missing
from hw_features, so it doesn't stick in wanted_features when offload
is turned off. Then HW_CSUM (and thus TSO, thanks to
netdev_fix_features) is not automatically turned back on when offload
is re-enabled.

Signed-off-by: Sabrina Dubroca <[email protected]>
Reviewed-by: Simon Horman <[email protected]>
Link: https://patch.msgid.link/b918dc4dd76410a57f7516a855f66b0a2bd58326.1730929545.git.sd@queasysnail.net
Signed-off-by: Jakub Kicinski <[email protected]>

Merge branch 'replace-page_frag-with-page_frag_cache-part-1'

Yunsheng Lin says:

====================
Replace page_frag with page_frag_cache (Part-1)

This is part 1 of "Replace page_frag with page_frag_cache",
which mainly contain refactoring and optimization for the
implementation of page_frag API before the replacing.

As the discussion in [1], it would be better to target net-next
tree to get more testing as all the callers page_frag API are
in networking, and the chance of conflicting with MM tree seems
low as implementation of page_frag API seems quite self-contained.

After [2], there are still two implementations for page frag:

1. mm/page_alloc.c: net stack seems to be using it in the
   rx part with 'struct page_frag_cache' and the main API
   being page_frag_alloc_align().
2. net/core/sock.c: net stack seems to be using it in the
   tx part with 'struct page_frag' and the main API being
   skb_page_frag_refill().

This patchset tries to unfiy the page frag implementation
by replacing page_frag with page_frag_cache for sk_page_frag()
first. net_high_order_alloc_disable_key for the implementation
in net/core/sock.c doesn't seems matter that much now as pcp
is also supported for high-order pages:
commit 44042b449872 ("mm/page_alloc: allow high-order pages to
be stored on the per-cpu lists")

As the related change is mostly related to networking, so
targeting the net-next. And will try to replace the rest
of page_frag in the follow patchset.

After this patchset:
1. Unify the page frag implementation by taking the best out of
   two the existing implementations: we are able to save some space
   for the 'page_frag_cache' API user, and avoid 'get_page()' for
   the old 'page_frag' API user.
2. Future bugfix and performance can be done in one place, hence
   improving maintainability of page_frag's implementation.

Kernel Image changing:
    Linux Kernel   total |      text      data        bss
    ------------------------------------------------------
    after     45250307 |   27274279   17209996     766032
    before    45254134 |   27278118   17209984     766032
    delta        -3827 |      -3839        +12         +0

1. https://lore.kernel.org/all/add10dd4-7f5d-4aa1-aa04-767590f944e0@redhat.com/
2. https://lore.kernel.org/all/20240228093013 [email protected]/
====================

Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

mm: page_frag: use __alloc_pages() to replace alloc_pages_node()

It seems there is about 24Bytes binary size increase for
__page_frag_cache_refill() after refactoring in arm64 system
with 64K PAGE_SIZE. By doing the gdb disassembling, It seems
we can have more than 100Bytes decrease for the binary size
by using __alloc_pages() to replace alloc_pages_node(), as
there seems to be some unnecessary checking for nid being
NUMA_NO_NODE, especially when page_frag is part of the mm
system.

CC: Andrew Morton <[email protected]>
CC: Linux-MM <[email protected]>
Signed-off-by: Yunsheng Lin <[email protected]>
Reviewed-by: Alexander Duyck <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

mm: page_frag: reuse existing space for 'size' and 'pfmemalloc'

Currently there is one 'struct page_frag' for every 'struct
sock' and 'struct task_struct', we are about to replace the
'struct page_frag' with 'struct page_frag_cache' for them.
Before begin the replacing, we need to ensure the size of
'struct page_frag_cache' is not bigger than the size of
'struct page_frag', as there may be tens of thousands of
'struct sock' and 'struct task_struct' instances in the
system.

By or'ing the page order & pfmemalloc with lower bits of
'va' instead of using 'u16' or 'u32' for page size and 'u8'
for pfmemalloc, we are able to avoid 3 or 5 bytes space waste.
And page address & pfmemalloc & order is unchanged for the
same page in the same 'page_frag_cache' instance, it makes
sense to fit them together.

After this patch, the size of 'struct page_frag_cache' should be
the same as the size of 'struct page_frag'.

CC: Andrew Morton <[email protected]>
CC: Linux-MM <[email protected]>
Signed-off-by: Yunsheng Lin <[email protected]>
Reviewed-by: Alexander Duyck <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

xtensa: remove the get_order() implementation

As the get_order() implemented by xtensa supporting 'nsau'
instruction seems be the same as the generic implementation
in include/asm-generic/getorder.h when size is not a constant
value as the generic implementation calling the fls*() is also
utilizing the 'nsau' instruction for xtensa.

So remove the get_order() implemented by xtensa, as using the
generic implementation may enable the compiler to do the
computing when size is a constant value instead of runtime
computing and enable the using of get_order() in BUILD_BUG_ON()
macro in next patch.

CC: Andrew Morton <[email protected]>
CC: Linux-MM <[email protected]>
Signed-off-by: Yunsheng Lin <[email protected]>
Acked-by: Max Filippov <[email protected]>
Reviewed-by: Alexander Duyck <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

mm: page_frag: avoid caller accessing 'page_frag_cache' directly

Use appropriate frag_page API instead of caller accessing
'page_frag_cache' directly.

CC: Andrew Morton <[email protected]>
CC: Linux-MM <[email protected]>
Signed-off-by: Yunsheng Lin <[email protected]>
Reviewed-by: Alexander Duyck <[email protected]>
Acked-by: Chuck Lever <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

mm: page_frag: use initial zero offset for page_frag_alloc_align()

We are about to use page_frag_alloc_*() API to not just
allocate memory for skb->data, but also use them to do
the memory allocation for skb frag too. Currently the
implementation of page_frag in mm subsystem is running
the offset as a countdown rather than count-up value,
there may have several advantages to that as mentioned
in [1], but it may have some disadvantages, for example,
it may disable skb frag coalescing and more correct cache
prefetching

We have a trade-off to make in order to have a unified
implementation and API for page_frag, so use a initial zero
offset in this patch, and the following patch will try to
make some optimization to avoid the disadvantages as much
as possible.

1. https://lore.kernel.org/all/f4abe71b3439b39d17a6fb2d410180f367cadf5c [email protected]/

CC: Andrew Morton <[email protected]>
CC: Linux-MM <[email protected]>
Signed-off-by: Yunsheng Lin <[email protected]>
Reviewed-by: Alexander Duyck <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

mm: move the page fragment allocator from page_alloc into its own file

Inspired by [1], move the page fragment allocator from page_alloc
into its own c file and header file, as we are about to make more
change for it to replace another page_frag implementation in
sock.c

As this patchset is going to replace 'struct page_frag' with
'struct page_frag_cache' in sched.h, including page_frag_cache.h
in sched.h has a compiler error caused by interdependence between
mm_types.h and mm.h for asm-offsets.c, see [2]. So avoid the compiler
error by moving 'struct page_frag_cache' to mm_types_task.h as
suggested by Alexander, see [3].

1. https://lore.kernel.org/all/20230411160902.4134381 [email protected]/
2. https://lore.kernel.org/all/15623dac-9358-4597-b3ee-3694a5956920@gmail.com/
3. https://lore.kernel.org/all/CAKgT0UdH1yD=LSCXFJ=YM_aiA4OomD-2wXykO42bizaWMt_HOA@mail.gmail.com/
CC: David Howells <[email protected]>
CC: Linux-MM <[email protected]>
Signed-off-by: Yunsheng Lin <[email protected]>
Acked-by: Andrew Morton <[email protected]>
Reviewed-by: Alexander Duyck <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

mm: page_frag: add a test module for page_frag

The testing is done by ensuring that the fragment allocated
from a frag_frag_cache instance is pushed into a ptr_ring
instance in a kthread binded to a specified cpu, and a kthread
binded to a specified cpu will pop the fragment from the
ptr_ring and free the fragment.

CC: Andrew Morton <[email protected]>
CC: Linux-MM <[email protected]>
Signed-off-by: Yunsheng Lin <[email protected]>
Reviewed-by: Alexander Duyck <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

net: convert to nla_get_*_default()

Most of the original conversion is from the spatch below,
but I edited some and left out other instances that were
either buggy after conversion (where default values don't
fit into the type) or just looked strange.

    @@
    expression attr, def;
    expression val;
    identifier fn =~ "^nla_get_.*";
    fresh identifier dfn = fn ## "_default";
    @@
    (
    -if (attr)
    -  val = fn(attr);
    -else
    -  val = def;
    +val = dfn(attr, def);
    |
    -if (!attr)
    -  val = def;
    -else
    -  val = fn(attr);
    +val = dfn(attr, def);
    |
    -if (!attr)
    -  return def;
    -return fn(attr);
    +return dfn(attr, def);
    |
    -attr ? fn(attr) : def
    +dfn(attr, def)
    |
    -!attr ? def : fn(attr)
    +dfn(attr, def)
    )

Signed-off-by: Johannes Berg <[email protected]>
Reviewed-by: Toke Høiland-Jørgensen <[email protected]>
Link: https://patch.msgid.link/20241108114145.0580b8684e7f.I740beeaa2f70ebfc19bfca1045a24d6151992790@changeid
Signed-off-by: Jakub Kicinski <[email protected]>

net: netlink: add nla_get_*_default() accessors

There are quite a number of places that use patterns
such as

  if (attr)
     val = nla_get_u16(attr);
  else
     val = DEFAULT;

Add nla_get_u16_default() and friends like that to
not have to type this out all the time.

Acked-by: Toke Høiland-Jørgensen <[email protected]>
Acked-by: Jakub Kicinski <[email protected]>
Signed-off-by: Johannes Berg <[email protected]>
Link: https://patch.msgid.link/20241108114145.acd2aadb03ac.I3df6aac71d38a5baa1c0a03d0c7e82d4395c030e@changeid
Signed-off-by: Jakub Kicinski <[email protected]>

bridge: Allow deleting FDB entries with non-existent VLAN

It is currently impossible to delete individual FDB entries (as opposed
to flushing) that were added with a VLAN that no longer exists:

# ip link add name dummy1 up type dummy
# ip link add name br1 up type bridge vlan_filtering 1
# ip link set dev dummy1 master br1
# bridge fdb add 00:11:22:33:44:55 dev dummy1 master static vlan 1
# bridge vlan del vid 1 dev dummy1
# bridge fdb get 00:11:22:33:44:55 br br1 vlan 1
00:11:22:33:44:55 dev dummy1 vlan 1 master br1 static
# bridge fdb del 00:11:22:33:44:55 dev dummy1 master vlan 1
RTNETLINK answers: Invalid argument
# bridge fdb get 00:11:22:33:44:55 br br1 vlan 1
00:11:22:33:44:55 dev dummy1 vlan 1 master br1 static

This is in contrast to MDB entries that can be deleted after the VLAN
was deleted:

# bridge vlan add vid 10 dev dummy1
# bridge mdb add dev br1 port dummy1 grp 239.1.1.1 permanent vid 10
# bridge vlan del vid 10 dev dummy1
# bridge mdb get dev br1 grp 239.1.1.1 vid 10
dev br1 port dummy1 grp 239.1.1.1 permanent vid 10
# bridge mdb del dev br1 port dummy1 grp 239.1.1.1 permanent vid 10
# bridge mdb get dev br1 grp 239.1.1.1 vid 10
Error: bridge: MDB entry not found.

Align the two interfaces and allow user space to delete FDB entries that
were added with a VLAN that no longer exists:

# ip link add name dummy1 up type dummy
# ip link add name br1 up type bridge vlan_filtering 1
# ip link set dev dummy1 master br1
# bridge fdb add 00:11:22:33:44:55 dev dummy1 master static vlan 1
# bridge vlan del vid 1 dev dummy1
# bridge fdb get 00:11:22:33:44:55 br br1 vlan 1
00:11:22:33:44:55 dev dummy1 vlan 1 master br1 static
# bridge fdb del 00:11:22:33:44:55 dev dummy1 master vlan 1
# bridge fdb get 00:11:22:33:44:55 br br1 vlan 1
Error: Fdb entry not found.

Add a selftest to make sure this behavior does not regress:

# ./rtnetlink.sh -t kci_test_fdb_del
PASS: bridge fdb del

Signed-off-by: Ido Schimmel <[email protected]>
Reviewed-by: Andy Roulin <[email protected]>
Reviewed-by: Petr Machata <[email protected]>
Acked-by: Nikolay Aleksandrov <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

mlx5/core: Schedule EQ comp tasklet only if necessary

Currently, the mlx5_eq_comp_int() interrupt handler schedules a tasklet
to call mlx5_cq_tasklet_cb() if it processes any completions. For CQs
whose completions don't need to be processed in tasklet context, this
adds unnecessary overhead. In a heavy TCP workload, we see 4% of CPU
time spent on the tasklet_trylock() in tasklet_action_common(), with a
smaller amount spent on the atomic operations in tasklet_schedule(),
tasklet_clear_sched(), and locking the spinlock in mlx5_cq_tasklet_cb().
TCP completions are handled by mlx5e_completion_event(), which schedules
NAPI to poll the queue, so they don't need tasklet processing.

Schedule the tasklet in mlx5_add_cq_to_tasklet() instead to avoid this
overhead. mlx5_add_cq_to_tasklet() is responsible for enqueuing the CQs
to be processed in tasklet context, so it can schedule the tasklet. CQs
that need tasklet processing have their interrupt comp handler set to
mlx5_add_cq_to_tasklet(), so they will schedule the tasklet. CQs that
don't need tasklet processing won't schedule the tasklet. To avoid
scheduling the tasklet multiple times during the same interrupt, only
schedule the tasklet in mlx5_add_cq_to_tasklet() if the tasklet work
queue was empty before the new CQ was pushed to it.

The additional branch in mlx5_add_cq_to_tasklet(), called for each EQE,
may add a small cost for the userspace Infiniband CQs whose completions
are processed in tasklet context. But this seems worth it to avoid the
tasklet overhead for CQs that don't need it.

Note that the mlx4 driver works the same way: it schedules the tasklet
in mlx4_add_cq_to_tasklet() and only if the work queue was empty before.

Signed-off-by: Caleb Sander Mateos <[email protected]>
Reviewed-by: Parav Pandit <[email protected]>
Acked-by: Tariq Toukan <[email protected]>
Acked-by: Saeed Mahameed <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

Merge branch 'improve-neigh_flush_dev-performance'

Gilad Naaman says:

====================
Improve neigh_flush_dev performance

This patchsets improves the performance of neigh_flush_dev.

Currently, the only way to implement it requires traversing
all neighbours known to the kernel, across all network-namespaces.

This means that some flows are slowed down as a function of neigh-scale,
even if the specific link they're handling has little to no neighbours.

In order to solve this, this patchset adds a netdev->neighbours list,
as well as making the original linked-list doubly-, so that it is
possible to unlink neighbours without traversing the hash-bucket to
obtain the previous neighbour.

The original use-case we encountered was mass-deletion of links (12K
VLANs) while there are 50K ARPs and 50K NDPs in the system; though the
slowdowns would also appear when the links are set down.
====================

Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

neighbour: Create netdev->neighbour association

Create a mapping between a netdev and its neighoburs,
allowing for much cheaper flushes.

Signed-off-by: Gilad Naaman <[email protected]>
Reviewed-by: Eric Dumazet <[email protected]>
Reviewed-by: Kuniyuki Iwashima <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

neighbour: Remove bare neighbour::next pointer

Remove the now-unused neighbour::next pointer, leaving struct neighbour
solely with the hlist_node implementation.

Signed-off-by: Gilad Naaman <[email protected]>
Reviewed-by: Kuniyuki Iwashima <[email protected]>
Reviewed-by: Eric Dumazet <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

neighbour: Convert iteration to use hlist+macro

Remove all usage of the bare neighbour::next pointer,
replacing them with neighbour::hash and its for_each macro.

Signed-off-by: Gilad Naaman <[email protected]>
Reviewed-by: Kuniyuki Iwashima <[email protected]>
Reviewed-by: Eric Dumazet <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

neighbour: Convert seq_file functions to use hlist

Convert seq_file-related neighbour functionality to use neighbour::hash
and the related for_each macro.

Signed-off-by: Gilad Naaman <[email protected]>
Reviewed-by: Kuniyuki Iwashima <[email protected]>
Reviewed-by: Eric Dumazet <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

neighbour: Define neigh_for_each_in_bucket

Introduce neigh_for_each_in_bucket in neighbour.h, to help iterate over
the neighbour table more succinctly.

Signed-off-by: Gilad Naaman <[email protected]>
Reviewed-by: Kuniyuki Iwashima <[email protected]>
Reviewed-by: Eric Dumazet <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

neighbour: Add hlist_node to struct neighbour

Add a doubly-linked node to neighbours, so that they
can be deleted without iterating the entire bucket they're in.

Signed-off-by: Gilad Naaman <[email protected]>
Reviewed-by: Kuniyuki Iwashima <[email protected]>
Reviewed-by: Eric Dumazet <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

Merge branch 'r8169-improve-wol-suspend-related-code'

Heiner Kallweit says:

====================
r8169: improve wol/suspend-related code

This series improves wol/suspend-related code parts.
====================

Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

r8169: align WAKE_PHY handling with r8125/r8126 vendor drivers

Vendor drivers r8125/r8126 apply this additional magic setting when
enabling WAKE_PHY, so do the same here.

Signed-off-by: Heiner Kallweit <[email protected]>
Reviewed-by: Simon Horman <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

r8169: improve rtl_set_d3_pll_down

Make use of new helper r8169_mod_reg8_cond() and move from a switch()
to an if() clause. Benefit is that we don't have to touch this piece of
code each time support for a new chip version is added.

Signed-off-by: Heiner Kallweit <[email protected]>
Reviewed-by: Simon Horman <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

r8169: improve __rtl8169_set_wol

Add helper r8169_mod_reg8_cond() what allows to significantly simplify
__rtl8169_set_wol().

Signed-off-by: Heiner Kallweit <[email protected]>
Reviewed-by: Simon Horman <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

tc: fix typo probabilty in tc.yaml doc

Fix spelling of "probability" in tc.yaml documentation. This corrects
the max-P field description in struct tc_sfq_qopt_v1.

Signed-off-by: Abhinav Saxena <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

mISDN: Fix typos

Fix typos:
  - syncronized -> synchronized
  - interfacs -> interface
  - otherwhise -> otherwise
  - ony -> only
  - busses -> buses
  - maxinum -> maximum

Via codespell.

Reported-by: Simon Horman <[email protected]>
Signed-off-by: Andrew Kreimer <[email protected]>
Reviewed-by: Simon Horman <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

hv_sock: Initializing vsk->trans to NULL to prevent a dangling pointer

When hvs is released, there is a possibility that vsk->trans may not
be initialized to NULL, which could lead to a dangling pointer.
This issue is resolved by initializing vsk->trans to NULL.

Signed-off-by: Hyunwoo Kim <[email protected]>
Reviewed-by: Stefano Garzarella <[email protected]>
Acked-by: Michael S. Tsirkin <[email protected]>
Link: https://patch.msgid.link/Zys4hCj61V+mQfX2@v4bel-B760M-AORUS-ELITE-AX
Signed-off-by: Jakub Kicinski <[email protected]>

net: sfc: use ethtool string helpers

The latter is the preferred way to copy ethtool strings.

Avoids manually incrementing the pointer. Cleans up the code quite well.

Signed-off-by: Rosen Penev <[email protected]>
Acked-by: Edward Cree <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

mptcp: remove the redundant assignment of 'new_ctx->tcp_sock' in subflow_ulp_clone()

The variable has already been assigned in the subflow_create_ctx(),
So we don't need to reassign this variable in the subflow_ulp_clone().

Signed-off-by: MoYuanhao <[email protected]>
Reviewed-by: Matthieu Baerts (NGI0) <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

net: mctp: Expose transport binding identifier via IFLA attribute

MCTP control protocol implementations are transport binding dependent.
Endpoint discovery is mandatory based on transport binding.
Message timing requirements are specified in each respective transport
binding specification.

However, we currently have no means to get this information from MCTP
links.

Add a IFLA_MCTP_PHYS_BINDING netlink link attribute, which represents
the transport type using the DMTF DSP0239-defined type numbers, returned
as part of RTM_GETLINK data.

We get an IFLA_MCTP_PHYS_BINDING attribute for each MCTP link, for
example:

- 0x00 (unspec) for loopback interface;
- 0x01 (SMBus/I2C) for mctpi2c%d interfaces; and
- 0x05 (serial) for mctpserial%d interfaces.

Signed-off-by: Khang Nguyen <[email protected]>
Reviewed-by: Matt Johnston <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

bonding: add ESP offload features when slaves support

Add NETIF_F_GSO_ESP bit to bond's gso_partial_features if all slaves
support it, such that ESP segmentation is handled by hardware if possible.

Signed-off-by: Jianbo Liu <[email protected]>
Reviewed-by: Boris Pismenny <[email protected]>
Signed-off-by: Tariq Toukan <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

Merge branch 'netlink-specs-add-neigh-and-rule-ynl-specs'

Donald Hunter says:

====================
netlink: specs: Add neigh and rule YNL specs

Add YNL specs for the FDB neighbour tables and FIB rules from the
rtnelink families.

Example usage:

./tools/net/ynl/cli.py \
    --spec Documentation/netlink/specs/rt_neigh.yaml \
    --dump getneigh
[{'cacheinfo': {'confirmed': 122664055,
                'refcnt': 0,
                'updated': 122658055,
                'used': 122658055},
  'dst': '0.0.0.0',
  'family': 2,
  'flags': set(),
  'ifindex': 5,
  'lladr': '',
  'probes': 0,
  'state': {'noarp'},
  'type': 'broadcast'},
  ...]

./tools/net/ynl/cli.py \
    --spec Documentation/netlink/specs/rt_rule.yaml \
    --dump getrule --json '{"family": 2}'

[{'action': 'to-tbl',
  'dst-len': 0,
  'family': 2,
  'flags': 0,
  'protocol': 2,
  'src-len': 0,
  'suppress-prefixlen': '0xffffffff',
  'table': 255,
  'tos': 0},
  ... ]
====================

Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

netlink: specs: Add a spec for FIB rule management

Add a YNL spec for FIB rules:

./tools/net/ynl/cli.py \
    --spec Documentation/netlink/specs/rt_rule.yaml \
    --dump getrule --json '{"family": 2}'

[{'action': 'to-tbl',
  'dst-len': 0,
  'family': 2,
  'flags': 0,
  'protocol': 2,
  'src-len': 0,
  'suppress-prefixlen': '0xffffffff',
  'table': 255,
  'tos': 0},
  ... ]

Acked-by: Stanislav Fomichev <[email protected]>
Reviewed-by: Ido Schimmel <[email protected]>
Signed-off-by: Donald Hunter <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

netlink: specs: Add a spec for neighbor tables in rtnetlink

Add a YNL spec for neighbour tables and neighbour entries in rtnetlink.

./tools/net/ynl/cli.py \
    --spec Documentation/netlink/specs/rt_neigh.yaml \
    --dump getneigh
[{'cacheinfo': {'confirmed': 122664055,
                'refcnt': 0,
                'updated': 122658055,
                'used': 122658055},
  'dst': '0.0.0.0',
  'family': 2,
  'flags': set(),
  'ifindex': 5,
  'lladr': '',
  'probes': 0,
  'state': {'noarp'},
  'type': 'broadcast'},
  ...]

Acked-by: Stanislav Fomichev <[email protected]>
Signed-off-by: Donald Hunter <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>