Git Repo - linux.git/log

mlxsw: spectrum_fid: Add an op to get PGT allocation size

In the CFF flood mode, the PGT allocation size of RFID family will not
depend on number of FIDs, but rather number of ports and LAGs. Therefore
introduce a FID family operation to calculate the PGT allocation size.

The way that size is calculated in the CFF mode depends on calling fallible
functions. Thus express the op as returning an int, with the size returned
via a pointer argument.

Signed-off-by: Petr Machata <[email protected]>
Reviewed-by: Amit Cohen <[email protected]>
Reviewed-by: Ido Schimmel <[email protected]>
Link: https://lore.kernel.org/r/1174651b7160fcedbef50010ae4b68201112fe6f.1701183892.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <[email protected]>

mlxsw: spectrum_fid: Add an op for flood table initialization

In controlled flood mode, for each bridge FID family (i.e., 802.1Q and
802.1D) and packet type (i.e., UUC/MC/BC), the hardware needs to be told
which PGT address to use as the base address for the flood table and how
to determine the offset from the base for each FID.

The above is not needed in CFF mode where each FID has its own flood
table instead of the FID family itself.

Therefore, create a new FID family operation for the above configuration
and only implement it for the 802.1Q and 802.1D families in controlled
flood mode.

No functional changes intended.

Signed-off-by: Petr Machata <[email protected]>
Reviewed-by: Ido Schimmel <[email protected]>
Link: https://lore.kernel.org/r/06f71415eec75811585ec597e1dd101b6dff77e7.1701183892.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <[email protected]>

mlxsw: spectrum_fid: Move mlxsw_sp_fid_flood_table_init() up

Move the function to the point where it will need to be to be visible for
the 802.1d ops.

Signed-off-by: Petr Machata <[email protected]>
Reviewed-by: Amit Cohen <[email protected]>
Reviewed-by: Ido Schimmel <[email protected]>
Link: https://lore.kernel.org/r/aef09e26b0c2dd077531e665d7135b300bdaf0a8.1701183892.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <[email protected]>

mlxsw: spectrum_fid: Make mlxsw_sp_fid_ops.setup return an int

This operation will be fallible for rFIDs in CFF mode, which will be
introduced in follow-up patches. Have it return an int, and handle
the failures in the caller.

Signed-off-by: Petr Machata <[email protected]>
Reviewed-by: Amit Cohen <[email protected]>
Reviewed-by: Ido Schimmel <[email protected]>
Link: https://lore.kernel.org/r/75f1b85c0cb86bea5501fcc8657042f221a78b32.1701183892.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <[email protected]>

mlxsw: spectrum_fid: Split a helper out of mlxsw_sp_fid_flood_table_mid()

In future patches, for CFF flood mode support, we will need a way to
determine a PGT base dynamically, as an op. Therefore, for symmetry,
split out a helper, mlxsw_sp_fid_pgt_base_ctl(), that determines a PGT base
in the controlled mode as well.

Now that the helper is available, use it in mlxsw_sp_fid_flood_table_init()
which currently invokes the FID->MID helper to that end.

Signed-off-by: Petr Machata <[email protected]>
Reviewed-by: Amit Cohen <[email protected]>
Reviewed-by: Ido Schimmel <[email protected]>
Link: https://lore.kernel.org/r/fd41c66a1df4df6499d3da34f40e7b9efa15bc3e.1701183892.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <[email protected]>

mlxsw: spectrum_fid: Rename FID ops, families, arrays

Currently, mlxsw always uses a "controlled" flood mode on all Nvidia
Spectrum generations. The following patches will however introduce a
possibility to run a "CFF" (for Compressed FID Flooding) mode on newer
machines, if the FW supports it.

To reflect that, label all FID ops, FID families and FID family arrays with
a _ctl suffix. This will make it clearer what is what when the CFF families
are introduced in later patches.

Keep the dummy family intact. Since the dummy family has no flood tables
in either CTL or CFF mode, there are no flood-mode-specific callbacks.

Additionally, add a remark at two fields that they are only relevant when
flood mode is not CFF.

Signed-off-by: Petr Machata <[email protected]>
Reviewed-by: Amit Cohen <[email protected]>
Reviewed-by: Ido Schimmel <[email protected]>
Link: https://lore.kernel.org/r/96b6da5439bb662fa86e795bbcec9dc3ccfa59fd.1701183892.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <[email protected]>

mlxsw: spectrum_fid: Privatize FID families

Currently, mlxsw always uses a "controlled" flood mode on all Nvidia
Spectrum generations. The following patches will however introduce a
possibility to run a "CFF" (for Compressed FID Flooding) mode on newer
machines, if the FW supports it.

Several operations will differ between how they need to be done in
controlled mode vs. CFF mode. Thus the per-FID-family ops will differ
between controlled and CFF, thus the FID family array as such will
differ depending on whether the mode negotiated with FW is controlled
or CFF.

The simple approach of having several globally visible arrays for
spectrum.c to statically choose from no longer works. Instead privatize all
FID initialization and finalization logic, and expose it as ops instead.

Signed-off-by: Petr Machata <[email protected]>
Reviewed-by: Ido Schimmel <[email protected]>
Link: https://lore.kernel.org/r/d3fa390d97cf3dbd2f7a28741be69b311e2059e4.1701183891.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <[email protected]>

net: phy: aquantia: drop wrong endianness conversion for addr and CRC

On further testing on BE target with kernel test robot, it was notice
that the endianness conversion for addr and CRC in fw_load_memory was
wrong.

Drop the cpu_to_le32 conversion for addr load as it's not needed.

Use get_unaligned_le32 instead of get_unaligned for FW data word load to
correctly convert data in the correct order to follow system endian.

Also drop the cpu_to_be32 for CRC calculation as it's wrong and would
cause different CRC on BE system.
The loaded word is swapped internally and MAILBOX calculates the CRC on
the swapped word. To correctly calculate the CRC to be later matched
with the one from MAILBOX, use an u8 struct and swap the word there to
keep the same order on both LE and BE for crc_ccitt_false function.
Also add additional comments on how the CRC verification for the loaded
section works.

CRC is calculated as we load the section and verified with the MAILBOX
only after the entire section is loaded to skip additional slowdown by
loop the section data again.

Reported-by: kernel test robot <[email protected]>
Closes: https://lore.kernel.org/oe-kbuild-all/[email protected]/
Fixes: e93984ebc1c8 ("net: phy: aquantia: add firmware load support")
Tested-by: Robert Marko <[email protected]> # ipq8072 LE device
Signed-off-by: Christian Marangi <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

ice: Fix VF Reset paths when interface in a failed over aggregate

There is an error when an interface has the following conditions:
- PF is in an aggregate (bond)
- PF has VFs created on it
- bond is in a state where it is failed-over to the secondary interface
- A VF reset is issued on one or more of those VFs

The issue is generated by the originating PF trying to rebuild or
reconfigure the VF resources. Since the bond is failed over to the
secondary interface the queue contexts are in a modified state.

To fix this issue, have the originating interface reclaim its resources
prior to the tear-down and rebuild or reconfigure. Then after the process
is complete, move the resources back to the currently active interface.

There are multiple paths that can be used depending on what triggered the
event, so create a helper function to move the queues and use paired calls
to the helper (back to origin, process, then move back to active interface)
under the same lag_mutex lock.

Fixes: 1e0f9881ef79 ("ice: Flesh out implementation of support for SRIOV on bonded interface")
Signed-off-by: Dave Ertman <[email protected]>
Tested-by: Sujai Buvaneswaran <[email protected]>
Signed-off-by: Tony Nguyen <[email protected]>
Reviewed-by: Przemek Kitszel <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

net: phy: adin: allow control of Fast Link Down

Add support to allow Fast Link Down (aka "Enhanced link detection") to
be controlled via the ETHTOOL_PHY_FAST_LINK_DOWN tunable. These PHYs
have this feature enabled by default.

Signed-off-by: Vincent Whitchurch <[email protected]>
Acked-by: Nuno Sa <[email protected]>
Reviewed-by: Andrew Lunn <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

r8169: improve handling task scheduling

If we know that the task is going to be a no-op, don't even schedule it.
And remove the check for netif_running() in the worker function, the
check for flag RTL_FLAG_TASK_ENABLED is sufficient. Note that we can't
remove the check for flag RTL_FLAG_TASK_ENABLED in the worker function
because we have no guarantee when it will be executed.

Signed-off-by: Heiner Kallweit <[email protected]>
Reviewed-by: Przemek Kitszel <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

Merge tag 'wireless-2023-11-29' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless

Johannes Berg says:

====================
wireless fixes:
- debugfs had a deadlock (removal vs. use of files),
   fixes going through wireless ACKed by Greg
- support for HT STAs on 320 MHz channels, even if it's
   not clear that should ever happen (that's 6 GHz), best
   not to WARN()
- fix for the previous CQM fix that broke most cases
- various wiphy locking fixes
- various small driver fixes

* tag 'wireless-2023-11-29' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless:
  wifi: mac80211: use wiphy locked debugfs for sdata/link
  wifi: mac80211: use wiphy locked debugfs helpers for agg_status
  wifi: cfg80211: add locked debugfs wrappers
  debugfs: add API to allow debugfs operations cancellation
  debugfs: annotate debugfs handlers vs. removal with lockdep
  debugfs: fix automount d_fsdata usage
  wifi: mac80211: handle 320 MHz in ieee80211_ht_cap_ie_to_sta_ht_cap
  wifi: avoid offset calculation on NULL pointer
  wifi: cfg80211: hold wiphy mutex for send_interface
  wifi: cfg80211: lock wiphy mutex for rfkill poll
  wifi: cfg80211: fix CQM for non-range use
  wifi: mac80211: do not pass AP_VLAN vif pointer to drivers during flush
  wifi: iwlwifi: mvm: fix an error code in iwl_mvm_mld_add_sta()
  wifi: mt76: mt7925: fix typo in mt7925_init_he_caps
  wifi: mt76: mt7921: fix 6GHz disabled by the missing default CLC config
====================

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf

Daniel Borkmann says:

====================
pull-request: bpf 2023-11-30

We've added 5 non-merge commits during the last 7 day(s) which contain
a total of 10 files changed, 66 insertions(+), 15 deletions(-).

The main changes are:

1) Fix AF_UNIX splat from use after free in BPF sockmap,
   from John Fastabend.

2) Fix a syzkaller splat in netdevsim by properly handling offloaded
   programs (and not device-bound ones), from Stanislav Fomichev.

3) Fix bpf_mem_cache_alloc_flags() to initialize the allocation hint,
   from Hou Tao.

4) Fix netkit by rejecting IFLA_NETKIT_PEER_INFO in changelink,
   from Daniel Borkmann.

* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
  bpf, sockmap: Add af_unix test with both sockets in map
  bpf, sockmap: af_unix stream sockets need to hold ref for pair sock
  netkit: Reject IFLA_NETKIT_PEER_INFO in netkit_change_link
  bpf: Add missed allocation hint for bpf_mem_cache_alloc_flags()
  netdevsim: Don't accept device bound programs
====================

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

Merge branch 'create-a-binding-for-the-marvell-mv88e6xxx-dsa-switches'

Linus Walleij says:

====================
Create a binding for the Marvell MV88E6xxx DSA switches

The Marvell switches are lacking DT bindings.

I need proper schema checking to add LED support to the
Marvell switch. Just how it is, it can't go on like this.

Some Device Tree fixes are included in the series, these
remove the major and most annoying warnings fallout noise:
some warnings remain, and these are of more serious nature,
such as missing phy-mode. They can be applied individually,
or to the networking tree with the rest of the patches.

Thanks to Andrew Lunn, Vladimir Oltean and Russell King
for excellent review and feedback!

This latest version employs special compatibles in the
odd ABI device trees.

v8: https://lore.kernel.org/r/20231114-marvell-88e6152-wan-led-v8-0-50688741691b@linaro.org
v7: https://lore.kernel.org/r/20231024-marvell-88e6152-wan-led-v7-0-2869347697d1@linaro.org
v6: https://lore.kernel.org/r/20231024-marvell-88e6152-wan-led-v6-0-993ab0949344@linaro.org
v5: https://lore.kernel.org/r/20231023-marvell-88e6152-wan-led-v5-0-0e82952015a7@linaro.org
v4: https://lore.kernel.org/r/20231018-marvell-88e6152-wan-led-v4-0-3ee0c67383be@linaro.org
v3: https://lore.kernel.org/r/20231016-marvell-88e6152-wan-led-v3-0-38cd449dfb15@linaro.org
v2: https://lore.kernel.org/r/20231014-marvell-88e6152-wan-led-v2-0-7fca08b68849@linaro.org
v1: https://lore.kernel.org/r/20231013-marvell-88e6152-wan-led-v1-0-0712ba99857c@linaro.org
====================

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

dt-bindings: marvell: Add Marvell MV88E6060 DSA schema

The Marvell MV88E6060 is one of the oldest DSA switches from
Marvell, and it has DT bindings used in the wild. Let's define
them properly.

It is different enough from the rest of the MV88E6xxx switches
that it deserves its own binding.

Reviewed-by: Andrew Lunn <[email protected]>
Reviewed-by: Vladimir Oltean <[email protected]>
Reviewed-by: Rob Herring <[email protected]>
Reviewed-by: Florian Fainelli <[email protected]>
Signed-off-by: Linus Walleij <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

dt-bindings: marvell: Rewrite MV88E6xxx in schema

This is an attempt to rewrite the Marvell MV88E6xxx switch bindings
in YAML schema.

The current text binding says:
  WARNING: This binding is currently unstable. Do not program it into a
  FLASH never to be changed again. Once this binding is stable, this
  warning will be removed.

Well that never happened before we switched to YAML markup,
we can't have it like this, what about fixing the mess?

Reviewed-by: Andrew Lunn <[email protected]>
Reviewed-by: Rob Herring <[email protected]>
Reviewed-by: Florian Fainelli <[email protected]>
Signed-off-by: Linus Walleij <[email protected]>
Reviewed-by: Vladimir Oltean <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

dt-bindings: net: ethernet-switch: Accept special variants

Accept special node naming variants for Marvell switches with
special node names as ABI.

This is maybe not the prettiest but it avoids special-casing
the Marvell MV88E6xxx bindings by copying a lot of generic
binding code down into that one binding just to special-case
these unfixable nodes.

Reviewed-by: Rob Herring <[email protected]>
Reviewed-by: Andrew Lunn <[email protected]>
Signed-off-by: Linus Walleij <[email protected]>
Reviewed-by: Vladimir Oltean <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

dt-bindings: net: mvusb: Fix up DSA example

When adding a proper schema for the Marvell mx88e6xxx switch,
the scripts start complaining about this embedded example:

  dtschema/dtc warnings/errors:
  net/marvell,mvusb.example.dtb: switch@0: ports: '#address-cells'
  is a required property
  from schema $id: http://devicetree.org/schemas/net/dsa/marvell,mv88e6xxx.yaml#
  net/marvell,mvusb.example.dtb: switch@0: ports: '#size-cells'
  is a required property
  from schema $id: http://devicetree.org/schemas/net/dsa/marvell,mv88e6xxx.yaml#

Fix this up by extending the example with those properties in
the ports node.

While we are at it, rename "ports" to "ethernet-ports" and rename
"switch" to "ethernet-switch" as this is recommended practice.

Reviewed-by: Andrew Lunn <[email protected]>
Reviewed-by: Vladimir Oltean <[email protected]>
Reviewed-by: Rob Herring <[email protected]>
Reviewed-by: Florian Fainelli <[email protected]>
Signed-off-by: Linus Walleij <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

dt-bindings: net: dsa: Require ports or ethernet-ports

Bindings using dsa.yaml#/$defs/ethernet-ports specify that
a DSA switch node need to have a ports or ethernet-ports
subnode, and that is actually required, so add requirements
using oneOf.

Suggested-by: Rob Herring <[email protected]>
Acked-by: Florian Fainelli <[email protected]>
Reviewed-by: Krzysztof Kozlowski <[email protected]>
Signed-off-by: Linus Walleij <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

Merge branch 'tools-ynl-fixes-for-the-page-pool-sample-and-the-generation-process'

Jakub Kicinski says:

====================
tools: ynl: fixes for the page-pool sample and the generation process

Minor fixes to the new sample and the Makefiles.
====================

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

tools: ynl: don't skip regeneration from make targets

Commit 2b7ac0c87d98 ("tools: ynl-gen: don't touch the output file if
content is the same") is working too well. It was added so that
ynl-regen -f doesn't make us rebuild half of the kernel, if there
are no actual changes in any generated code.

When ynl-gen-c is called by make, however, we're better off trusting
make's tracking and overwrite the file. Otherwise if output is identical
we won't update file timestamps and make will retry code gen on every
invocation.

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

tools: ynl: order building samples after generated code

Parallel builds of ynl:

make -C tools/net/ynl/ -j 4

don't work correctly right now. samples get handled before
generated, so build of samples does not notice that protos.a
has changed. Order samples to be last.

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

tools: ynl: make sure we use local headers for page-pool

Building samples generates the following warning:

  In file included from page-pool.c:11:
  generated/netdev-user.h:21:45: warning: ‘enum netdev_xdp_rx_metadata’ declared inside parameter list will not be visible outside of this definition or declaration
   21 | const char *netdev_xdp_rx_metadata_str(enum netdev_xdp_rx_metadata value);
      |                                             ^~~~~~~~~~~~~~~~~~~~~~

Our magic way of including uAPI headers assumes the sample
name matches the family name. We need to copy the flags over.

Fixes: 637567e4a3ef ("tools: ynl: add sample for getting page-pool information")
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

tools: ynl: fix build of the page-pool sample

The name of the "destroyed" field in the reply was not changed
in the sample after we started calling it "detach_time".

page-pool.c: In function ‘main’:
page-pool.c:84:33: error: ‘struct <anonymous>’ has no member named ‘destroyed’
84 | if (pp->_present.destroyed)
| ^

Fixes: 637567e4a3ef ("tools: ynl: add sample for getting page-pool information")
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

bpf, sockmap: Add af_unix test with both sockets in map

This adds a test where both pairs of a af_unix paired socket are put into a
BPF map. This ensures that when we tear down the af_unix pair we don't have
any issues on sockmap side with ordering and reference counting.

Signed-off-by: John Fastabend <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
Reviewed-by: Jakub Sitnicki <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]

bpf, sockmap: af_unix stream sockets need to hold ref for pair sock

AF_UNIX stream sockets are a paired socket. So sending on one of the pairs
will lookup the paired socket as part of the send operation. It is possible
however to put just one of the pairs in a BPF map. This currently increments
the refcnt on the sock in the sockmap to ensure it is not free'd by the
stack before sockmap cleans up its state and stops any skbs being sent/recv'd
to that socket.

But we missed a case. If the peer socket is closed it will be free'd by the
stack. However, the paired socket can still be referenced from BPF sockmap
side because we hold a reference there. Then if we are sending traffic through
BPF sockmap to that socket it will try to dereference the free'd pair in its
send logic creating a use after free. And following splat:

   [59.900375] BUG: KASAN: slab-use-after-free in sk_wake_async+0x31/0x1b0
   [59.901211] Read of size 8 at addr ffff88811acbf060 by task kworker/1:2/954
   [...]
   [59.905468] Call Trace:
   [59.905787]  <TASK>
   [59.906066]  dump_stack_lvl+0x130/0x1d0
   [59.908877]  print_report+0x16f/0x740
   [59.910629]  kasan_report+0x118/0x160
   [59.912576]  sk_wake_async+0x31/0x1b0
   [59.913554]  sock_def_readable+0x156/0x2a0
   [59.914060]  unix_stream_sendmsg+0x3f9/0x12a0
   [59.916398]  sock_sendmsg+0x20e/0x250
   [59.916854]  skb_send_sock+0x236/0xac0
   [59.920527]  sk_psock_backlog+0x287/0xaa0

To fix let BPF sockmap hold a refcnt on both the socket in the sockmap and its
paired socket. It wasn't obvious how to contain the fix to bpf_unix logic. The
primarily problem with keeping this logic in bpf_unix was: In the sock close()
we could handle the deref by having a close handler. But, when we are destroying
the psock through a map delete operation we wouldn't have gotten any signal
thorugh the proto struct other than it being replaced. If we do the deref from
the proto replace its too early because we need to deref the sk_pair after the
backlog worker has been stopped.

Given all this it seems best to just cache it at the end of the psock and eat 8B
for the af_unix and vsock users. Notice dgram sockets are OK because they handle
locking already.

Fixes: 94531cfcbe79 ("af_unix: Add unix_stream_proto for sockmap")
Signed-off-by: John Fastabend <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
Reviewed-by: Jakub Sitnicki <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]

Merge branch 'xsk-tx-metadata'

Stanislav Fomichev says:

====================
xsk: TX metadata

This series implements initial TX metadata (offloads) for AF_XDP.
See patch #2 for the main implementation and mlx5/stmmac ones for the
example on how to consume the metadata on the device side.

Starting with two types of offloads:
- request TX timestamp (and write it back into the metadata area)
- request TX checksum offload

Changes since v5:
- preserve xsk_tx_metadata flags across tx and completion by moving
  them out of 'request' union (Jesper)
- fix xdp_metadata checksum failure in big endian (Alexei)
- add SPDX to xdp-rx-metadata.rst (Simon)

v5: https://lore.kernel.org/bpf/20231102225837.1141915 [email protected]/

Performance (mlx5):

I've implemented a small xskgen tool to try to saturate single tx queue:
https://github.com/fomichev/xskgen/tree/master

Here are the performance numbers with some analysis.

1. Baseline. Running with commit eb62e6aef940 ("Merge branch 'bpf:
Support bpf_get_func_ip helper in uprobes'"), nothing from this series:

- with 1400 bytes of payload: 98 gbps, 8 mpps
./xskgen -s 1400 -b eth3 10:70:fd:48:10:77 10:70:fd:48:10:87 fe80::1270:fdff:fe48:1077 fe80::1270:fdff:fe48:1087 1 1
sent 10000000 packets 116960000000 bits, took 1.189130 sec, 98.357623 gbps 8.409509 mpps

- with 200 bytes of payload: 49 gbps, 23 mpps
./xskgen -s 200 -b eth3 10:70:fd:48:10:77 10:70:fd:48:10:87 fe80::1270:fdff:fe48:1077 fe80::1270:fdff:fe48:1087 1 1
sent 10000064 packets 20960134144 bits, took 0.422235 sec, 49.640921 gbps 23.683645 mpps

2. Adding single commit that supports reserving tx_metadata_len
   changes nothing numbers-wise.

- baseline for 1400
./xskgen -s 1400 -b eth3 10:70:fd:48:10:77 10:70:fd:48:10:87 fe80::1270:fdff:fe48:1077 fe80::1270:fdff:fe48:1087 1 1
sent 10000000 packets 116960000000 bits, took 1.189247 sec, 98.347946 gbps 8.408682 mpps

- baseline for 200
./xskgen -s 200 -b eth3 10:70:fd:48:10:77 10:70:fd:48:10:87 fe80::1270:fdff:fe48:1077 fe80::1270:fdff:fe48:1087 1 1
sent 10000000 packets 20960000000 bits, took 0.421248 sec, 49.756913 gbps 23.738985 mpps

3. Adding -M flag causes xskgen to reserve the metadata and fill it, but
   doesn't set XDP_TX_METADATA descriptor option.

- new baseline for 1400 (with only filling the metadata)
./xskgen -M -s 1400 -b eth3 10:70:fd:48:10:77 10:70:fd:48:10:87 fe80::1270:fdff:fe48:1077 fe80::1270:fdff:fe48:1087 1 1
sent 10000000 packets 116960000000 bits, took 1.188767 sec, 98.387657 gbps 8.412077 mpps

- new baseline for 200 (with only filling the metadata)
./xskgen -M -s 200 -b eth3 10:70:fd:48:10:77 10:70:fd:48:10:87 fe80::1270:fdff:fe48:1077 fe80::1270:fdff:fe48:1087 1 1
sent 10000000 packets 20960000000 bits, took 0.410213 sec, 51.095407 gbps 24.377579 mpps
(the numbers go sligtly up here, not really sure why, maybe some cache-related
side-effects?

4. Next, I'm running the same test but with the commit that adds actual
   general infra to parse XDP_TX_METADATA (but no driver support).
   Essentially applying "xsk: add TX timestamp and TX checksum offload support"
   from this series. Numbers are the same.

- fill metadata for 1400
./xskgen -M -s 1400 -b eth3 10:70:fd:48:10:77 10:70:fd:48:10:87 fe80::1270:fdff:fe48:1077 fe80::1270:fdff:fe48:1087 1 1
sent 10000000 packets 116960000000 bits, took 1.188430 sec, 98.415557 gbps 8.414463 mpps

- fill metadata for 200
./xskgen -M -s 200 -b eth3 10:70:fd:48:10:77 10:70:fd:48:10:87 fe80::1270:fdff:fe48:1077 fe80::1270:fdff:fe48:1087 1 1
sent 10000000 packets 20960000000 bits, took 0.411559 sec, 50.928299 gbps 24.297853 mpps

- request metadata for 1400
./xskgen -m -s 1400 -b eth3 10:70:fd:48:10:77 10:70:fd:48:10:87 fe80::1270:fdff:fe48:1077 fe80::1270:fdff:fe48:1087 1 1
sent 10000000 packets 116960000000 bits, took 1.188723 sec, 98.391299 gbps 8.412389 mpps

- request metadata for 200
./xskgen -m -s 200 -b eth3 10:70:fd:48:10:77 10:70:fd:48:10:87 fe80::1270:fdff:fe48:1077 fe80::1270:fdff:fe48:1087 1 1
sent 10000064 packets 20960134144 bits, took 0.411240 sec, 50.968131 gbps 24.316856 mpps

5. Now, for the most interesting part, I'm adding mlx5 driver support.
   The mpps for 200 bytes case goes down from 23 mpps to 19 mpps, but
   _only_ when I enable the metadata. This looks like a side effect
   of me pushing extra metadata pointer via mlx5e_xdpi_fifo_push.
   Hence, this part is wrapped into 'if (xp_tx_metadata_enabled)'
   to not affect the existing non-metadata use-cases. Since this is not
   regressing existing workloads, I'm not spending any time trying to
   optimize it more (and leaving it up to mlx owners to purse if
   they see any good way to do it).

- same baseline
./xskgen -s 1400 -b eth3 10:70:fd:48:10:77 10:70:fd:48:10:87 fe80::1270:fdff:fe48:1077 fe80::1270:fdff:fe48:1087 1 1
sent 10000000 packets 116960000000 bits, took 1.189434 sec, 98.332484 gbps 8.407360 mpps

./xskgen -s 200 -b eth3 10:70:fd:48:10:77 10:70:fd:48:10:87 fe80::1270:fdff:fe48:1077 fe80::1270:fdff:fe48:1087 1 1
sent 10000128 packets 20960268288 bits, took 0.425254 sec, 49.288821 gbps 23.515659 mpps

- fill metadata for 1400
./xskgen -M -s 1400 -b eth3 10:70:fd:48:10:77 10:70:fd:48:10:87 fe80::1270:fdff:fe48:1077 fe80::1270:fdff:fe48:1087 1 1
sent 10000000 packets 116960000000 bits, took 1.189528 sec, 98.324714 gbps 8.406696 mpps

- fill metadata for 200
./xskgen -M -s 200 -b eth3 10:70:fd:48:10:77 10:70:fd:48:10:87 fe80::1270:fdff:fe48:1077 fe80::1270:fdff:fe48:1087 1 1
sent 10000128 packets 20960268288 bits, took 0.519085 sec, 40.379260 gbps 19.264914 mpps

- request metadata for 1400
./xskgen -m -s 1400 -b eth3 10:70:fd:48:10:77 10:70:fd:48:10:87 fe80::1270:fdff:fe48:1077 fe80::1270:fdff:fe48:1087 1 1
sent 10000000 packets 116960000000 bits, took 1.189329 sec, 98.341165 gbps 8.408102 mpps

- request metadata for 200
./xskgen -m -s 200 -b eth3 10:70:fd:48:10:77 10:70:fd:48:10:87 fe80::1270:fdff:fe48:1077 fe80::1270:fdff:fe48:1087 1 1
sent 10000128 packets 20960268288 bits, took 0.519929 sec, 40.313713 gbps 19.233642 mpps

Acked-by: Magnus Karlsson <[email protected]>
Acked-by: Jakub Kicinski <[email protected]>
====================

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>

selftests/bpf: Add TX side to xdp_hw_metadata

When we get a packet on port 9091, we swap src/dst and send it out.
At this point we also request the timestamp and checksum offloads.

Checksum offload is verified by looking at the tcpdump on the other side.
The tool prints pseudo-header csum and the final one it expects.
The final checksum actually matches the incoming packets checksum
because we only flip the src/dst and don't change the payload.

Some other related changes:
- switched to zerocopy mode by default; new flag can be used to force
  old behavior
- request fixed tx_metadata_len headroom
- some other small fixes (umem size, fill idx+i, etc)

mvbz3:~# ./xdp_hw_metadata eth3
...
xsk_ring_cons__peek: 1
0x19546f8: rx_desc[0]->addr=80100 addr=80100 comp_addr=80100
rx_hash: 0x80B7EA8B with RSS type:0x2A
rx_timestamp:  1697580171852147395 (sec:1697580171.8521)
HW RX-time:   1697580171852147395 (sec:1697580171.8521), delta to User RX-time sec:0.2797 (279673.082 usec)
XDP RX-time:   1697580172131699047 (sec:1697580172.1317), delta to User RX-time sec:0.0001 (121.430 usec)
0x19546f8: ping-pong with csum=3b8e (want d862) csum_start=54 csum_offset=6
0x19546f8: complete tx idx=0 addr=8
tx_timestamp:  1697580172056756493 (sec:1697580172.0568)
HW TX-complete-time:   1697580172056756493 (sec:1697580172.0568), delta to User TX-complete-time sec:0.0852 (85175.537 usec)
XDP RX-time:   1697580172131699047 (sec:1697580172.1317), delta to User TX-complete-time sec:0.0102 (10232.983 usec)
HW RX-time:   1697580171852147395 (sec:1697580171.8521), delta to HW TX-complete-time sec:0.2046 (204609.098 usec)
0x19546f8: complete rx idx=128 addr=80100

mvbz4:~# nc  -Nu -q1 ${MVBZ3_LINK_LOCAL_IP}%eth3 9091

mvbz4:~# tcpdump -vvx -i eth3 udp
        tcpdump: listening on eth3, link-type EN10MB (Ethernet), snapshot length 262144 bytes
12:26:09.301074 IP6 (flowlabel 0x35fa5, hlim 127, next-header UDP (17) payload length: 11) fe80::1270:fdff:fe48:1087.55807 > fe80::1270:fdff:fe48:1077.9091: [bad udp cksum 0x3b8e -> 0xde7e!] UDP, length 3
        0x0000:  6003 5fa5 000b 117f fe80 0000 0000 0000
        0x0010:  1270 fdff fe48 1087 fe80 0000 0000 0000
        0x0020:  1270 fdff fe48 1077 d9ff 2383 000b 3b8e
        0x0030:  7864 70
12:26:09.301976 IP6 (flowlabel 0x35fa5, hlim 127, next-header UDP (17) payload length: 11) fe80::1270:fdff:fe48:1077.9091 > fe80::1270:fdff:fe48:1087.55807: [udp sum ok] UDP, length 3
        0x0000:  6003 5fa5 000b 117f fe80 0000 0000 0000
        0x0010:  1270 fdff fe48 1077 fe80 0000 0000 0000
        0x0020:  1270 fdff fe48 1087 2383 d9ff 000b de7e
        0x0030:  7864 70

Signed-off-by: Stanislav Fomichev <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>

selftests/bpf: Convert xdp_hw_metadata to XDP_USE_NEED_WAKEUP

This is the recommended way to run AF_XDP, so let's use it in the test.

Also, some unrelated changes to now blow up the log too much:
- change default mode to zerocopy and add -c to use copy mode
- small fixes for the flags/sizes/prints
- add print_tstamp_delta to print timestamp + reference

Signed-off-by: Stanislav Fomichev <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>

selftests/bpf: Add TX side to xdp_metadata

Request TX timestamp and make sure it's not empty.
Request TX checksum offload (SW-only) and make sure it's resolved
to the correct one.

Signed-off-by: Stanislav Fomichev <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>

selftests/bpf: Add csum helpers

Checksum helpers will be used to calculate pseudo-header checksum in
AF_XDP metadata selftests.

The helpers are mirroring existing kernel ones:
- csum_tcpudp_magic : IPv4 pseudo header csum
- csum_ipv6_magic : IPv6 pseudo header csum
- csum_fold : fold csum and do one's complement

Signed-off-by: Stanislav Fomichev <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>

selftests/xsk: Support tx_metadata_len

Add new config field and propagate to UMEM registration setsockopt.

Signed-off-by: Stanislav Fomichev <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>

xsk: Add option to calculate TX checksum in SW

For XDP_COPY mode, add a UMEM option XDP_UMEM_TX_SW_CSUM
to call skb_checksum_help in transmit path. Might be useful
to debugging issues with real hardware. I also use this mode
in the selftests.

Signed-off-by: Stanislav Fomichev <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>

xsk: Validate xsk_tx_metadata flags

Accept only the flags that the kernel knows about to make
sure we can extend this field in the future. Note that only
in XDP_COPY mode we propagate the error signal back to the user
(via sendmsg). For zerocopy mode we silently skip the metadata
for the descriptors that have wrong flags (since we process
the descriptors deep in the driver).

Signed-off-by: Stanislav Fomichev <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>

xsk: Document tx_metadata_len layout

- how to use
- how to query features
- pointers to the examples

Signed-off-by: Stanislav Fomichev <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>

net: stmmac: Add Tx HWTS support to XDP ZC

This patch enables transmit hardware timestamp support to XDP zero copy
via XDP Tx metadata framework.

This patchset is tested with tools/testing/selftests/bpf/xdp_hw_metadata
on Intel Tiger Lake platform. Below are the test steps and results.

Command on DUT:
sudo ./xdp_hw_metadata <interface name>
sudo hwstamp_ctl -i <interface name> -t 1 -r 1

Command on Link Partner:
echo -n xdp | nc -u -q1 <destination IPv4 addr> 9091

Result:
xsk_ring_cons__peek: 1
0x55bbbf08b6d0: rx_desc[2]->addr=8c100 addr=8c100 comp_addr=8c100 EoP
No rx_hash err=-95
rx_timestamp:  1677762688429141540 (sec:1677762688.4291)
HW RX-time:   1677762688429141540 (sec:1677762688.4291) delta to User RX-time sec:0.0003 (250.665 usec)
XDP RX-time:   1677762688429375597 (sec:1677762688.4294) delta to User RX-time sec:0.0000 (16.608 usec)
0x55bbbf08b6d0: ping-pong with csum=561c (want f488) csum_start=34 csum_offset=6
0x55bbbf08b6d0: complete tx idx=2 addr=2008
tx_timestamp:  1677762688431127273 (sec:1677762688.4311)
HW TX-complete-time:   1677762688431127273 (sec:1677762688.4311) delta to User TX-complete-time sec:0.0083 (8331.655 usec)
XDP RX-time:   1677762688429375597 (sec:1677762688.4294) delta to User TX-complete-time sec:0.0101 (10083.331 usec)
HW RX-time:   1677762688429141540 (sec:1677762688.4291) delta to HW TX-complete-time sec:0.0020 (1985.733 usec)
0x55bbbf08b6d0: complete rx idx=130 addr=8c100

Signed-off-by: Song Yoong Siang <[email protected]>
Signed-off-by: Stanislav Fomichev <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>

net/mlx5e: Implement AF_XDP TX timestamp and checksum offload

TX timestamp:
- requires passing clock, not sure I'm passing the correct one (from
cq->mdev), but the timestamp value looks convincing

TX checksum:
- looks like device does packet parsing (and doesn't accept custom
start/offset), so I'm ignoring user offsets

Cc: Saeed Mahameed <[email protected]>
Signed-off-by: Stanislav Fomichev <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>

tools: ynl: Print xsk-features from the sample

In a similar fashion we do for the other bit masks.
Fix mask parsing (>= vs >) while we are it.

Signed-off-by: Stanislav Fomichev <[email protected]>
Reviewed-by: Jakub Kicinski <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>

xsk: Add TX timestamp and TX checksum offload support

This change actually defines the (initial) metadata layout
that should be used by AF_XDP userspace (xsk_tx_metadata).
The first field is flags which requests appropriate offloads,
followed by the offload-specific fields. The supported per-device
offloads are exported via netlink (new xsk-flags).

The offloads themselves are still implemented in a bit of a
framework-y fashion that's left from my initial kfunc attempt.
I'm introducing new xsk_tx_metadata_ops which drivers are
supposed to implement. The drivers are also supposed
to call xsk_tx_metadata_request/xsk_tx_metadata_complete in
the right places. Since xsk_tx_metadata_{request,_complete}
are static inline, we don't incur any extra overhead doing
indirect calls.

The benefit of this scheme is as follows:
- keeps all metadata layout parsing away from driver code
- makes it easy to grep and see which drivers implement what
- don't need any extra flags to maintain to keep track of what
  offloads are implemented; if the callback is implemented - the offload
  is supported (used by netlink reporting code)

Two offloads are defined right now:
1. XDP_TXMD_FLAGS_CHECKSUM: skb-style csum_start+csum_offset
2. XDP_TXMD_FLAGS_TIMESTAMP: writes TX timestamp back into metadata
   area upon completion (tx_timestamp field)

XDP_TXMD_FLAGS_TIMESTAMP is also implemented for XDP_COPY mode: it writes
SW timestamp from the skb destructor (note I'm reusing hwtstamps to pass
metadata pointer).

The struct is forward-compatible and can be extended in the future
by appending more fields.

Reviewed-by: Song Yoong Siang <[email protected]>
Signed-off-by: Stanislav Fomichev <[email protected]>
Acked-by: Jakub Kicinski <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>

xsk: Support tx_metadata_len

For zerocopy mode, tx_desc->addr can point to an arbitrary offset
and carry some TX metadata in the headroom. For copy mode, there
is no way currently to populate skb metadata.

Introduce new tx_metadata_len umem config option that indicates how many
bytes to treat as metadata. Metadata bytes come prior to tx_desc address
(same as in RX case).

The size of the metadata has mostly the same constraints as XDP:
- less than 256 bytes
- 8-byte aligned (compared to 4-byte alignment on xdp, due to 8-byte
timestamp in the completion)
- non-zero

This data is not interpreted in any way right now.

Reviewed-by: Song Yoong Siang <[email protected]>
Signed-off-by: Stanislav Fomichev <[email protected]>
Reviewed-by: Jakub Kicinski <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>

tools: ynl-gen: always construct struct ynl_req_state

struct ynl_req_state carries reply-related info from generated code
into generic YNL code. While we don't need reply info to execute
a request without a reply, we still need to pass in the struct, because
it's also where we get the pointer to struct ynl_sock from. Passing NULL
results in crashes if kernel returns an error or an unexpected reply.

Fixes: dc0956c98f11 ("tools: ynl-gen: move the response reading logic into YNL")
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

ethtool: don't propagate EOPNOTSUPP from dumps

The default dump handler needs to clear ret before returning.
Otherwise if the last interface returns an inconsequential
error this error will propagate to user space.

This may confuse user space (ethtool CLI seems to ignore it,
but YNL doesn't). It will also terminate the dump early
for mutli-skb dump, because netlink core treats EOPNOTSUPP
as a real error.

Fixes: 728480f12442 ("ethtool: default handlers for GET requests")
Reviewed-by: Simon Horman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

Merge branch 'fine-tune-flow-control-and-speed-configurations-in-microchip-ksz8xxx-dsa-driver'

Oleksij Rempel says:

====================
Fine-Tune Flow Control and Speed Configurations in Microchip KSZ8xxx DSA Driver

This patch set focuses on enhancing the configurability of flow
control, speed, and duplex settings in the Microchip KSZ8xxx DSA driver.

The first patch allows more granular control over the CPU port's flow
control, speed, and duplex settings. The second patch introduces a
method for port configurations for port with integrated PHYs, primarily
concerning flow control based on duplex mode.
====================

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

net: dsa: microchip: make phylink_mac_link_up() not optional

Last part of the driver do now support phylink_mac_link_up(). So, make it
not optional.

Signed-off-by: Oleksij Rempel <[email protected]>
Reviewed-by: Florian Fainelli <[email protected]>
Reviewed-by: Russell King (Oracle) <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

net: dsa: microchip: ksz8: Add function to configure ports with integrated PHYs

This patch introduces the function 'ksz8_phy_port_link_up' to the
Microchip KSZ8xxx driver. This function is responsible for setting up
flow control and duplex settings for the ports that are integrated with
PHYs.

The KSZ8795 switch supports asymmetric pause control, which can't be
fully utilized since a single bit controls both RX and TX pause. Despite
this, the flow control can be adjusted based on the auto-negotiation
process, taking into account the capabilities of both link partners.

On the other hand, the KSZ8873's PORT_FORCE_FLOW_CTRL bit can be set by
the hardware bootstrap, which ignores the auto-negotiation result.
Therefore, even in auto-negotiation mode, we need to ensure that this
bit is correctly set.

When auto-negotiation isn't in use, we enforce symmetric pause control
for the KSZ8795 switch.

Please note, forcing flow control disable on a port while still
advertising pause support isn't possible. While this scenario
might not be practical or desired, it's important to be aware of this
limitation when working with the KSZ8873 and similar devices.

Signed-off-by: Oleksij Rempel <[email protected]>
Reviewed-by: Simon Horman <[email protected]>
Reviewed-by: Florian Fainelli <[email protected]>
Reviewed-by: Russell King (Oracle) <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

net: dsa: microchip: ksz8: Make flow control, speed, and duplex on CPU port configurable

Allow flow control, speed, and duplex settings on the CPU port to be
configurable. Previously, the speed and duplex relied on default switch
values, which limited flexibility. Additionally, flow control was
hardcoded and only functional in duplex mode. This update enhances the
configurability of these parameters.

Signed-off-by: Oleksij Rempel <[email protected]>
Reviewed-by: Simon Horman <[email protected]>
Reviewed-by: Vladimir Oltean <[email protected]>
Reviewed-by: Florian Fainelli <[email protected]>
Reviewed-by: Russell King (Oracle) <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

Merge branch 'gve-add-support-for-non-4k-page-sizes'

John Fraker says:

====================
gve: Add support for non-4k page sizes.

This patch series adds support for non-4k page sizes to the driver. Prior
to this patch series, the driver assumes a 4k page size in many small
ways, and will crash in a kernel compiled for a different page size.

This changeset aims to be a minimal changeset that unblocks certain arm
platforms with large page sizes.
====================

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

gve: Remove dependency on 4k page size.

Prior to this change, gve crashes when attempting to run in kernels with
page sizes other than 4k. This change removes unnecessary references to
PAGE_SIZE and replaces them with more meaningful constants.

Signed-off-by: Jordan Kimbrough <[email protected]>
Signed-off-by: John Fraker <[email protected]>
Reviewed-by: Willem de Bruijn <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

gve: Add page size register to the register_page_list command.

This register is required on platforms with page sizes greater than 4k.
This is because the tx side of the driver vmaps the entire queue page
list of pages into a single flat address space, then uses the entire
space. Without communicating the guest page size to the backend, the
backend will only access the first 4k of each page in the queue page list.

Signed-off-by: Jordan Kimbrough <[email protected]>
Signed-off-by: John Fraker <[email protected]>
Reviewed-by: Willem de Bruijn <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

gve: Remove obsolete checks that rely on page size.

These checks are safe to remove as they are no longer enforced by the
backend. Retaining them would require updating them to work differently
with page sizes larger than 4k.

Signed-off-by: Jordan Kimbrough <[email protected]>
Signed-off-by: John Fraker <[email protected]>
Reviewed-by: Willem de Bruijn <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

gve: Deprecate adminq_pfn for pci revision 0x1.

adminq_pfn assumes a page size of 4k, causing this mechanism to break in
kernels compiled with different page sizes. A new PCI device revision was
needed for the device to be able to communicate with the driver how to
set up the admin queue prior to having access to the admin queue.

Signed-off-by: Jordan Kimbrough <[email protected]>
Signed-off-by: John Fraker <[email protected]>
Reviewed-by: Willem de Bruijn <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

gve: Perform adminq allocations through a dma_pool.

This allows the adminq to be smaller than a page, paving the way for
non 4k page support. This is to support platforms where PAGE_SIZE
is not 4k, such as some ARM platforms.

Signed-off-by: Jordan Kimbrough <[email protected]>
Signed-off-by: John Fraker <[email protected]>
Reviewed-by: Willem de Bruijn <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

Merge tag 'pinctrl-v6.7-2' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl

Pull pin control fixes from Linus Walleij:

- Fix a really interesting potential core bug in the list iterator
   requireing the use of READ_ONCE() discovered when testing kernel
   compiles with clang.

- Check devm_kcalloc() return value and an array bounds in the STM32
   driver.

- Fix an exotic string truncation issue in the s32cc driver, found by
   the kernel test robot (impressive!)

- Fix an undocumented struct member in the cy8c95x0 driver.

- Fix a symbol overlap with MIPS in the Lochnagar driver, MIPS defines
   a global symbol "RST" which is a bit too generic and collide with
   stuff. OK this one should be renamed too, we will fix that as well.

- Fix erroneous branch taking in the Realtek driver.

- Fix the mail address in MAINTAINERS for the s32g2 driver.

* tag 'pinctrl-v6.7-2' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl:
  dt-bindings: pinctrl: s32g2: change a maintainer email address
  pinctrl: realtek: Fix logical error when finding descriptor
  pinctrl: lochnagar: Don't build on MIPS
  pinctrl: avoid reload of p state in list iteration
  pinctrl: cy8c95x0: Fix doc warning
  pinctrl: s32cc: Avoid possible string truncation
  pinctrl: stm32: fix array read out of bound
  pinctrl: stm32: Add check for devm_kcalloc

Merge branch 'selftests-bpf-use-pkg-config-to-determine-ld-flags'

Akihiko Odaki says:

====================
selftests/bpf: Use pkg-config to determine ld flags

When linking statically, libraries may require other dependencies to be
included to ld flags. In particular, libelf may require libzstd. Use
pkg-config to determine such dependencies.

V4 -> V5: Introduced variables LIBELF_CFLAGS and LIBELF_LIBS.
(Daniel Borkmann)
Added patch "selftests/bpf: Choose pkg-config for the target".
V3 -> V4: Added "2> /dev/null".
V2 -> V3: Added missing "echo".
V1 -> V2: Implemented fallback, referring to HOSTPKG_CONFIG.
====================

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Andrii Nakryiko <[email protected]>

selftests/bpf: Use pkg-config for libelf

When linking statically, libraries may require other dependencies to be
included to ld flags. In particular, libelf may require libzstd. Use
pkg-config to determine such dependencies.

Signed-off-by: Akihiko Odaki <[email protected]>
Signed-off-by: Andrii Nakryiko <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]

selftests/bpf: Override PKG_CONFIG for static builds

A library may need to depend on additional archive files for static
builds so pkg-config should be instructed to list them.

Signed-off-by: Akihiko Odaki <[email protected]>
Signed-off-by: Andrii Nakryiko <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]

selftests/bpf: Choose pkg-config for the target

pkg-config is used to build sign-file executable. It should use the
library for the target instead of the host as it is called during tests.

Signed-off-by: Akihiko Odaki <[email protected]>
Signed-off-by: Andrii Nakryiko <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]

Merge branch 'bpf-add-link_info-support-for-uprobe-multi-link'

Jiri Olsa says:

====================
bpf: Add link_info support for uprobe multi link

hi,
this patchset adds support to get bpf_link_info details for
uprobe_multi links and adding support for bpftool link to
display them.

v4 changes:
  - move flags field up in bpf_uprobe_multi_link [Andrii]
  - include zero terminating byte in path_size [Andrii]
  - return d_path error directly [Yonghong]
  - use SEC(".probes") for semaphores [Yonghong]
  - fix ref_ctr_offsets leak in test [Yonghong]
  - other smaller fixes [Yonghong]

thanks,
jirka
---
====================

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Andrii Nakryiko <[email protected]>

bpftool: Add support to display uprobe_multi links

Adding support to display details for uprobe_multi links,
both plain:

  # bpftool link -p
  ...
  24: uprobe_multi  prog 126
          uprobe.multi  path /home/jolsa/bpf/test_progs  func_cnt 3  pid 4143
          offset             ref_ctr_offset     cookies
          0xd1f88            0xf5d5a8           0xdead
          0xd1f8f            0xf5d5aa           0xbeef
          0xd1f96            0xf5d5ac           0xcafe

and json:

  # bpftool link -p
  [{
  ...
      },{
          "id": 24,
          "type": "uprobe_multi",
          "prog_id": 126,
          "retprobe": false,
          "path": "/home/jolsa/bpf/test_progs",
          "func_cnt": 3,
          "pid": 4143,
          "funcs": [{
                  "offset": 860040,
                  "ref_ctr_offset": 16111016,
                  "cookie": 57005
              },{
                  "offset": 860047,
                  "ref_ctr_offset": 16111018,
                  "cookie": 48879
              },{
                  "offset": 860054,
                  "ref_ctr_offset": 16111020,
                  "cookie": 51966
              }
          ]
      }
  ]

Signed-off-by: Jiri Olsa <[email protected]>
Signed-off-by: Andrii Nakryiko <[email protected]>
Reviewed-by: Quentin Monnet <[email protected]>
Acked-by: Song Liu <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]

selftests/bpf: Add link_info test for uprobe_multi link

Adding fill_link_info test for uprobe_multi link.

Setting up uprobes with bogus ref_ctr_offsets and cookie values
to test all the bpf_link_info::uprobe_multi fields.

Signed-off-by: Jiri Olsa <[email protected]>
Signed-off-by: Andrii Nakryiko <[email protected]>
Acked-by: Song Liu <[email protected]>
Acked-by: Yonghong Song <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]

selftests/bpf: Use bpf_link__destroy in fill_link_info tests

The fill_link_info test keeps skeleton open and just creates
various links. We are wrongly calling bpf_link__detach after
each test to close them, we need to call bpf_link__destroy.

Signed-off-by: Jiri Olsa <[email protected]>
Signed-off-by: Andrii Nakryiko <[email protected]>
Acked-by: Yonghong Song <[email protected]>
Acked-by: Yafang Shao <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]

bpf: Add link_info support for uprobe multi link

Adding support to get uprobe_link details through bpf_link_info
interface.

Adding new struct uprobe_multi to struct bpf_link_info to carry
the uprobe_multi link details.

The uprobe_multi.count is passed from user space to denote size
of array fields (offsets/ref_ctr_offsets/cookies). The actual
array size is stored back to uprobe_multi.count (allowing user
to find out the actual array size) and array fields are populated
up to the user passed size.

All the non-array fields (path/count/flags/pid) are always set.

Signed-off-by: Jiri Olsa <[email protected]>
Signed-off-by: Andrii Nakryiko <[email protected]>
Acked-by: Yonghong Song <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]

bpf: Store ref_ctr_offsets values in bpf_uprobe array

We will need to return ref_ctr_offsets values through link_info
interface in following change, so we need to keep them around.

Storing ref_ctr_offsets values directly into bpf_uprobe array.

Signed-off-by: Jiri Olsa <[email protected]>
Signed-off-by: Andrii Nakryiko <[email protected]>
Acked-by: Andrii Nakryiko <[email protected]>
Acked-by: Song Liu <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]

libbpf: Add st_type argument to elf_resolve_syms_offsets function

We need to get offsets for static variables in following changes,
so making elf_resolve_syms_offsets to take st_type value as argument
and passing it to elf_sym_iter_new.

Signed-off-by: Jiri Olsa <[email protected]>
Signed-off-by: Andrii Nakryiko <[email protected]>
Acked-by: Andrii Nakryiko <[email protected]>
Acked-by: Song Liu <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]

Merge branch '40GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue

Tony Nguyen says:

====================
Intel Wired LAN Driver Updates 2023-11-27 (i40e, iavf)

This series contains updates to i40e and iavf drivers.

Ivan Vecera performs more cleanups on i40e and iavf drivers; removing
unused fields, defines, and unneeded fields.

Petr Oros utilizes iavf_schedule_aq_request() helper to replace open
coded equivalents.

* '40GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue:
  iavf: use iavf_schedule_aq_request() helper
  iavf: Remove queue tracking fields from iavf_adminq_ring
  i40e: Remove queue tracking fields from i40e_adminq_ring
  i40e: Remove AQ register definitions for VF types
  i40e: Delete unused and useless i40e_pf fields
====================

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

r8169: remove multicast filter limit

Once upon a time, when r8169 was new, the multicast filter limit code
was copied from RTL8139 driver. There the filter limit is even
user-configurable.
The filtering is hash-based and we don't have perfect filtering.
Actually the mc filtering on RTL8125 still seems to be the same
as used on 8390/NE2000. So it's not clear to me which benefit it
should bring when switching to all-multi mode once a certain number
of filter bits is set. More the opposite: Filtering out at least
some unwanted mc traffic is better than no filtering.
Also the available chip documentation doesn't mention any restriction.
Therefore remove the filter limit.

Signed-off-by: Heiner Kallweit <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

ice: fix error code in ice_eswitch_attach()

Set the "err" variable on this error path.

Fixes: fff292b47ac1 ("ice: add VF representors one by one")
Signed-off-by: Dan Carpenter <[email protected]>
Reviewed-by: Wojciech Drewek <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

nfp: ethtool: support TX/RX pause frame on/off

Add support for ethtool -A tx on/off and rx on/off.

Signed-off-by: Yu Xiao <[email protected]>
Signed-off-by: Louis Peens <[email protected]>
Reviewed-by: Simon Horman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

ravb: Fix races between ravb_tx_timeout_work() and net related ops

Fix races between ravb_tx_timeout_work() and functions of net_device_ops
and ethtool_ops by using rtnl_trylock() and rtnl_unlock(). Note that
since ravb_close() is under the rtnl lock and calls cancel_work_sync(),
ravb_tx_timeout_work() should calls rtnl_trylock(). Otherwise, a deadlock
may happen in ravb_tx_timeout_work() like below:

CPU0 CPU1
ravb_tx_timeout()
schedule_work()
...
__dev_close_many()
// Under rtnl lock
ravb_close()
cancel_work_sync()
// Waiting
ravb_tx_timeout_work()
rtnl_lock()
// This is possible to cause a deadlock

If rtnl_trylock() fails, rescheduling the work with sleep for 1 msec.

Fixes: c156633f1353 ("Renesas Ethernet AVB driver proper")
Signed-off-by: Yoshihiro Shimoda <[email protected]>
Reviewed-by: Sergey Shtylyov <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

Merge tag 'for-6.7-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:
"A few fixes and message updates:

   - for simple quotas, handle the case when a snapshot is created and
     the target qgroup already exists

   - fix a warning when file descriptor given to send ioctl is not
     writable

   - fix off-by-one condition when checking chunk maps

   - free pages when page array allocation fails during compression
     read, other cases were handled

   - fix memory leak on error handling path in ref-verify debugging
     feature

   - copy missing struct member 'version' in 64/32bit compat send ioctl

   - tree-checker verifies inline backref ordering

   - print messages to syslog on first mount and last unmount

   - update error messages when reading chunk maps"

* tag 'for-6.7-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: send: ensure send_fd is writable
  btrfs: free the allocated memory if btrfs_alloc_page_array() fails
  btrfs: fix 64bit compat send ioctl arguments not initializing version member
  btrfs: make error messages more clear when getting a chunk map
  btrfs: fix off-by-one when checking chunk map includes logical address
  btrfs: ref-verify: fix memory leaks in btrfs_ref_tree_mod()
  btrfs: add dmesg output for first mount and last unmount of a filesystem
  btrfs: do not abort transaction if there is already an existing qgroup
  btrfs: tree-checker: add type and sequence check for inline backrefs

Merge branch 'net-page_pool-add-netlink-based-introspection'

Jakub Kicinski says:

====================
net: page_pool: add netlink-based introspection

We recently started to deploy newer kernels / drivers at Meta,
making significant use of page pools for the first time.
We immediately run into page pool leaks both real and false positive
warnings. As Eric pointed out/predicted there's no guarantee that
applications will read / close their sockets so a page pool page
may be stuck in a socket (but not leaked) forever. This happens
a lot in our fleet. Most of these are obviously due to application
bugs but we should not be printing kernel warnings due to minor
application resource leaks.

Conversely the page pool memory may get leaked at runtime, and
we have no way to detect / track that, unless someone reconfigures
the NIC and destroys the page pools which leaked the pages.

The solution presented here is to expose the memory use of page
pools via netlink. This allows for continuous monitoring of memory
used by page pools, regardless if they were destroyed or not.
Sample in patch 15 can print the memory use and recycling
efficiency:

$ ./page-pool
    eth0[2] page pools: 10 (zombies: 0)
refs: 41984 bytes: 171966464 (refs: 0 bytes: 0)
recycling: 90.3% (alloc: 656:397681 recycle: 89652:270201)

v4:
- use dev_net(netdev)->loopback_dev
- extend inflight doc
v3: https://lore.kernel.org/all/20231122034420.1158898 [email protected]/
- ID is still here, can't decide if it matters
- rename destroyed -> detach-time, good enough?
- fix build for netsec
v2: https://lore.kernel.org/r/20231121000048 [email protected]
- hopefully fix build with PAGE_POOL=n
v1: https://lore.kernel.org/all/20231024160220.3973311 [email protected]/
- The main change compared to the RFC is that the API now exposes
   outstanding references and byte counts even for "live" page pools.
   The warning is no longer printed if page pool is accessible via netlink.
RFC: https://lore.kernel.org/all/20230816234303.3786178 [email protected]/
====================

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Paolo Abeni <[email protected]>

tools: ynl: add sample for getting page-pool information

Regenerate the tools/ code after netdev spec changes.

Add sample to query page-pool info in a concise fashion:

$ ./page-pool
eth0[2] page pools: 10 (zombies: 0)
refs: 41984 bytes: 171966464 (refs: 0 bytes: 0)
recycling: 90.3% (alloc: 656:397681 recycle: 89652:270201)

Acked-by: Jesper Dangaard Brouer <[email protected]>
Signed-off-by: Jakub Kicinski <[email protected]>
Signed-off-by: Paolo Abeni <[email protected]>

net: page_pool: mute the periodic warning for visible page pools

Mute the periodic "stalled pool shutdown" warning if the page pool
is visible to user space. Rolling out a driver using page pools
to just a few hundred hosts at Meta surfaces applications which
fail to reap their broken sockets. Obviously it's best if the
applications are fixed, but we don't generally print warnings
for application resource leaks. Admins can now depend on the
netlink interface for getting page pool info to detect buggy
apps.

While at it throw in the ID of the pool into the message,
in rare cases (pools from destroyed netns) this will make
finding the pool with a debugger easier.

Reviewed-by: Eric Dumazet <[email protected]>
Acked-by: Jesper Dangaard Brouer <[email protected]>
Signed-off-by: Jakub Kicinski <[email protected]>
Signed-off-by: Paolo Abeni <[email protected]>

net: page_pool: expose page pool stats via netlink

Dump the stats into netlink. More clever approaches
like dumping the stats per-CPU for each CPU individually
to see where the packets get consumed can be implemented
in the future.

A trimmed example from a real (but recently booted system):

$ ./cli.py --no-schema --spec netlink/specs/netdev.yaml \
           --dump page-pool-stats-get
[{'info': {'id': 19, 'ifindex': 2},
  'alloc-empty': 48,
  'alloc-fast': 3024,
  'alloc-refill': 0,
  'alloc-slow': 48,
  'alloc-slow-high-order': 0,
  'alloc-waive': 0,
  'recycle-cache-full': 0,
  'recycle-cached': 0,
  'recycle-released-refcnt': 0,
  'recycle-ring': 0,
  'recycle-ring-full': 0},
{'info': {'id': 18, 'ifindex': 2},
  'alloc-empty': 66,
  'alloc-fast': 11811,
  'alloc-refill': 35,
  'alloc-slow': 66,
  'alloc-slow-high-order': 0,
  'alloc-waive': 0,
  'recycle-cache-full': 1145,
  'recycle-cached': 6541,
  'recycle-released-refcnt': 0,
  'recycle-ring': 1275,
  'recycle-ring-full': 0},
{'info': {'id': 17, 'ifindex': 2},
  'alloc-empty': 73,
  'alloc-fast': 62099,
  'alloc-refill': 413,
...

Acked-by: Jesper Dangaard Brouer <[email protected]>
Signed-off-by: Jakub Kicinski <[email protected]>
Signed-off-by: Paolo Abeni <[email protected]>

net: page_pool: report when page pool was destroyed

Report when page pool was destroyed. Together with the inflight
/ memory use reporting this can serve as a replacement for the
warning about leaked page pools we currently print to dmesg.

Example output for a fake leaked page pool using some hacks
in netdevsim (one "live" pool, and one "leaked" on the same dev):

$ ./cli.py --no-schema --spec netlink/specs/netdev.yaml \
--dump page-pool-get
[{'id': 2, 'ifindex': 3},
{'id': 1, 'ifindex': 3, 'destroyed': 133, 'inflight': 1}]

Tested-by: Dragos Tatulea <[email protected]>
Reviewed-by: Eric Dumazet <[email protected]>
Acked-by: Jesper Dangaard Brouer <[email protected]>
Signed-off-by: Jakub Kicinski <[email protected]>
Signed-off-by: Paolo Abeni <[email protected]>

net: page_pool: report amount of memory held by page pools

Advanced deployments need the ability to check memory use
of various system components. It makes it possible to make informed
decisions about memory allocation and to find regressions and leaks.

Report memory use of page pools. Report both number of references
and bytes held.

Acked-by: Jesper Dangaard Brouer <[email protected]>
Signed-off-by: Jakub Kicinski <[email protected]>
Signed-off-by: Paolo Abeni <[email protected]>

net: page_pool: add netlink notifications for state changes

Generate netlink notifications about page pool state changes.

Reviewed-by: Eric Dumazet <[email protected]>
Acked-by: Jesper Dangaard Brouer <[email protected]>
Signed-off-by: Jakub Kicinski <[email protected]>
Signed-off-by: Paolo Abeni <[email protected]>

net: page_pool: implement GET in the netlink API

Expose the very basic page pool information via netlink.

Example using ynl-py for a system with 9 queues:

$ ./cli.py --no-schema --spec netlink/specs/netdev.yaml \
--dump page-pool-get
[{'id': 19, 'ifindex': 2, 'napi-id': 147},
{'id': 18, 'ifindex': 2, 'napi-id': 146},
{'id': 17, 'ifindex': 2, 'napi-id': 145},
{'id': 16, 'ifindex': 2, 'napi-id': 144},
{'id': 15, 'ifindex': 2, 'napi-id': 143},
{'id': 14, 'ifindex': 2, 'napi-id': 142},
{'id': 13, 'ifindex': 2, 'napi-id': 141},
{'id': 12, 'ifindex': 2, 'napi-id': 140},
{'id': 11, 'ifindex': 2, 'napi-id': 139},
{'id': 10, 'ifindex': 2, 'napi-id': 138}]

Reviewed-by: Eric Dumazet <[email protected]>
Acked-by: Jesper Dangaard Brouer <[email protected]>
Signed-off-by: Jakub Kicinski <[email protected]>
Signed-off-by: Paolo Abeni <[email protected]>

net: page_pool: add nlspec for basic access to page pools

Add a Netlink spec in YAML for getting very basic information
about page pools.

Reviewed-by: Eric Dumazet <[email protected]>
Acked-by: Jesper Dangaard Brouer <[email protected]>
Signed-off-by: Jakub Kicinski <[email protected]>
Signed-off-by: Paolo Abeni <[email protected]>

eth: link netdev to page_pools in drivers

Link page pool instances to netdev for the drivers which
already link to NAPI. Unless the driver is doing something
very weird per-NAPI should imply per-netdev.

Add netsec as well, Ilias indicates that it fits the mold.

Reviewed-by: Eric Dumazet <[email protected]>
Acked-by: Jesper Dangaard Brouer <[email protected]>
Signed-off-by: Jakub Kicinski <[email protected]>
Signed-off-by: Paolo Abeni <[email protected]>

net: page_pool: stash the NAPI ID for easier access

To avoid any issues with race conditions on accessing napi
and having to think about the lifetime of NAPI objects
in netlink GET - stash the napi_id to which page pool
was linked at creation time.

Reviewed-by: Eric Dumazet <[email protected]>
Acked-by: Jesper Dangaard Brouer <[email protected]>
Signed-off-by: Jakub Kicinski <[email protected]>
Signed-off-by: Paolo Abeni <[email protected]>

net: page_pool: record pools per netdev

Link the page pools with netdevs. This needs to be netns compatible
so we have two options. Either we record the pools per netns and
have to worry about moving them as the netdev gets moved.
Or we record them directly on the netdev so they move with the netdev
without any extra work.

Implement the latter option. Since pools may outlast netdev we need
a place to store orphans. In time honored tradition use loopback
for this purpose.

Reviewed-by: Mina Almasry <[email protected]>
Reviewed-by: Eric Dumazet <[email protected]>
Acked-by: Jesper Dangaard Brouer <[email protected]>
Signed-off-by: Jakub Kicinski <[email protected]>
Signed-off-by: Paolo Abeni <[email protected]>

net: page_pool: id the page pools

To give ourselves the flexibility of creating netlink commands
and ability to refer to page pool instances in uAPIs create
IDs for page pools.

Reviewed-by: Ilias Apalodimas <[email protected]>
Reviewed-by: Eric Dumazet <[email protected]>
Acked-by: Jesper Dangaard Brouer <[email protected]>
Signed-off-by: Jakub Kicinski <[email protected]>
Reviewed-by: Shakeel Butt <[email protected]>
Signed-off-by: Paolo Abeni <[email protected]>

net: page_pool: factor out uninit

We'll soon (next change in the series) need a fuller unwind path
in page_pool_create() so create the inverse of page_pool_init().

Reviewed-by: Ilias Apalodimas <[email protected]>
Reviewed-by: Eric Dumazet <[email protected]>
Acked-by: Jesper Dangaard Brouer <[email protected]>
Signed-off-by: Jakub Kicinski <[email protected]>
Reviewed-by: Shakeel Butt <[email protected]>
Signed-off-by: Paolo Abeni <[email protected]>

r8169: prevent potential deadlock in rtl8169_close

ndo_stop() is RTNL-protected by net core, and the worker function takes
RTNL as well. Therefore we will deadlock when trying to execute a
pending work synchronously. To fix this execute any pending work
asynchronously. This will do no harm because netif_running() is false
in ndo_stop(), and therefore the work function is effectively a no-op.
However we have to ensure that no task is running or pending after
rtl_remove_one(), therefore add a call to cancel_work_sync().

Fixes: abe5fc42f9ce ("r8169: use RTNL to protect critical sections")
Signed-off-by: Heiner Kallweit <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Paolo Abeni <[email protected]>

r8169: fix deadlock on RTL8125 in jumbo mtu mode

The original change results in a deadlock if jumbo mtu mode is used.
Reason is that the phydev lock is held when rtl_reset_work() is called
here, and rtl_jumbo_config() calls phy_start_aneg() which also tries
to acquire the phydev lock. Fix this by calling rtl_reset_work()
asynchronously.

Fixes: 621735f59064 ("r8169: fix rare issue with broken rx after link-down on RTL8125")
Reported-by: Ian Chen <[email protected]>
Tested-by: Ian Chen <[email protected]>
Cc: [email protected]
Signed-off-by: Heiner Kallweit <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Paolo Abeni <[email protected]>

efi/unaccepted: Fix off-by-one when checking for overlapping ranges

When a task needs to accept memory it will scan the accepting_list
to see if any ranges already being processed by other tasks overlap
with its range. Due to an off-by-one in the range comparisons, a task
might falsely determine that an overlapping range is being accepted,
leading to an unnecessary delay before it begins processing the range.

Fix the off-by-one in the range comparison to prevent this and slightly
improve performance.

Fixes: 50e782a86c98 ("efi/unaccepted: Fix soft lockups caused by parallel memory acceptance")
Link: https://lore.kernel.org/linux-mm/[email protected]/T/#me2eceb9906fcae5fe958b3fe88e41f920f8335b6
Reviewed-by: Kirill A. Shutemov <[email protected]>
Signed-off-by: Michael Roth <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Signed-off-by: Ard Biesheuvel <[email protected]>

neighbour: Fix __randomize_layout crash in struct neighbour

Previously, one-element and zero-length arrays were treated as true
flexible arrays, even though they are actually "fake" flex arrays.
The __randomize_layout would leave them untouched at the end of the
struct, similarly to proper C99 flex-array members.

However, this approach changed with commit 1ee60356c2dc ("gcc-plugins:
randstruct: Only warn about true flexible arrays"). Now, only C99
flexible-array members will remain untouched at the end of the struct,
while one-element and zero-length arrays will be subject to randomization.

Fix a `__randomize_layout` crash in `struct neighbour` by transforming
zero-length array `primary_key` into a proper C99 flexible-array member.

Fixes: 1ee60356c2dc ("gcc-plugins: randstruct: Only warn about true flexible arrays")
Closes: https://lore.kernel.org/linux-hardening/[email protected]/
Signed-off-by: Gustavo A. R. Silva <[email protected]>
Reviewed-by: Kees Cook <[email protected]>
Tested-by: Joey Gouly <[email protected]>
Link: https://lore.kernel.org/r/ZWJoRsJGnCPdJ3+2@work
Signed-off-by: Paolo Abeni <[email protected]>

octeontx2-pf: Restore TC ingress police rules when interface is up

TC ingress policer rules depends on interface receive queue
contexts since the bandwidth profiles are attached to RQ
contexts. When an interface is brought down all the queue
contexts are freed. This in turn frees bandwidth profiles in
hardware causing ingress police rules non-functional after
the interface is brought up. Fix this by applying all the ingress
police rules config to hardware in otx2_open. Also allow
adding ingress rules only when interface is running
since no contexts exist for the interface when it is down.

Fixes: 68fbff68dbea ("octeontx2-pf: Add police action for TC flower")
Signed-off-by: Subbaraya Sundeep <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Paolo Abeni <[email protected]>

octeontx2-pf: Fix adding mbox work queue entry when num_vfs > 64

When more than 64 VFs are enabled for a PF then mbox communication
between VF and PF is not working as mbox work queueing for few VFs
are skipped due to wrong calculation of VF numbers.

Fixes: d424b6c02415 ("octeontx2-pf: Enable SRIOV and added VF mbox handling")
Signed-off-by: Geetha sowjanya <[email protected]>
Signed-off-by: Subbaraya Sundeep <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Paolo Abeni <[email protected]>

net: stmmac: xgmac: Disable FPE MMC interrupts

Commit aeb18dd07692 ("net: stmmac: xgmac: Disable MMC interrupts
by default") tries to disable MMC interrupts to avoid a storm of
unhandled interrupts, but leaves the FPE(Frame Preemption) MMC
interrupts enabled, FPE MMC interrupts can cause the same problem.
Now we mask FPE TX and RX interrupts to disable all MMC interrupts.

Fixes: aeb18dd07692 ("net: stmmac: xgmac: Disable MMC interrupts by default")
Reviewed-by: Larysa Zaremba <[email protected]>
Signed-off-by: Furong Xu <[email protected]>
Reviewed-by: Serge Semin <[email protected]>
Reviewed-by: Wojciech Drewek <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Paolo Abeni <[email protected]>

octeontx2-af: Fix possible buffer overflow

A loop in rvu_mbox_handler_nix_bandprof_free() contains
a break if (idx == MAX_BANDPROF_PER_PFFUNC),
but if idx may reach MAX_BANDPROF_PER_PFFUNC
buffer '(*req->prof_idx)[layer]' overflow happens before that check.

The patch moves the break to the
beginning of the loop.

Found by Linux Verification Center (linuxtesting.org) with SVACE.

Fixes: e8e095b3b370 ("octeontx2-af: cn10k: Bandwidth profiles config support").
Signed-off-by: Elena Salomatkina <[email protected]>
Reviewed-by: Simon Horman <[email protected]>
Reviewed-by: Subbaraya Sundeep <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Paolo Abeni <[email protected]>

Merge tag 'wireless-next-2023-11-27' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next

Kalle Valo says:

====================
wireless-next patches for v6.8

The first features pull request for v6.8. Not so big in number of
commits but we removed quite a few ancient drivers: libertas 16-bit
PCMCIA support, atmel, hostap, zd1201, orinoco, ray_cs, wl3501 and
rndis_wlan.

Major changes:

cfg80211/mac80211
- extend support for scanning while Multi-Link Operation (MLO) connected

* tag 'wireless-next-2023-11-27' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next: (68 commits)
  wifi: nl80211: Documentation update for NL80211_CMD_PORT_AUTHORIZED event
  wifi: mac80211: Extend support for scanning while MLO connected
  wifi: cfg80211: Extend support for scanning while MLO connected
  wifi: ieee80211: fix PV1 frame control field name
  rfkill: return ENOTTY on invalid ioctl
  MAINTAINERS: update iwlwifi maintainers
  wifi: rtw89: 8922a: read efuse content from physical map
  wifi: rtw89: 8922a: read efuse content via efuse map struct from logic map
  wifi: rtw89: 8852c: read RX gain offset from efuse for 6GHz channels
  wifi: rtw89: mac: add to access efuse for WiFi 7 chips
  wifi: rtw89: mac: use mac_gen pointer to access about efuse
  wifi: rtw89: 8922a: add 8922A basic chip info
  wifi: rtlwifi: drop unused const_amdpci_aspm
  wifi: mwifiex: mwifiex_process_sleep_confirm_resp(): remove unused priv variable
  wifi: rtw89: regd: update regulatory map to R65-R44
  wifi: rtw89: regd: handle policy of 6 GHz according to BIOS
  wifi: rtw89: acpi: process 6 GHz band policy from DSM
  wifi: rtlwifi: simplify rtl_action_proc() and rtl_tx_agg_start()
  wifi: rtw89: pci: update interrupt mitigation register for 8922AE
  wifi: rtw89: pci: correct interrupt mitigation register for 8852CE
  ...
====================

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

Merge branch 'selftests-tc-testing-updates-and-cleanups-for-tdc'

Pedro Tammela says:

====================
selftests: tc-testing: updates and cleanups for tdc

Address the recommendations from the previous series and cleanup some
leftovers.
====================

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

selftests: tc-testing: remove unused import

Remove this leftover from the times we pre-allocated everything

Signed-off-by: Pedro Tammela <[email protected]>
Acked-by: Jamal Hadi Salim <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

selftests: tc-testing: cleanup on Ctrl-C

Cleanup net namespaces and other resources if we get a SIGINT (Ctrl-C).
As user visible resources are allocated on a per test basis, it's only
required to catch this condition when (possibly) running tests.

So far calling post_suite is enough to free up anything that might
linger.

A missing keyword replacement for nsPlugin is also included.

Signed-off-by: Pedro Tammela <[email protected]>
Acked-by: Jamal Hadi Salim <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

selftests: tc-testing: prefix iproute2 functions with "ipr2"

As suggested by Simon, prefix the functions that operate on iproute2
commands in contrast with the "nl" netlink prefix.

Cc: Simon Horman <[email protected]>
Signed-off-by: Pedro Tammela <[email protected]>
Acked-by: Jamal Hadi Salim <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

selftests: tc-testing: remove unnecessary time.sleep

This operation is redundant and it's not stabilizing nor waiting
for anything.

Signed-off-by: Pedro Tammela <[email protected]>
Acked-by: Jamal Hadi Salim <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

selftests: tc-testing: remove buildebpf plugin

As tdc only tests loading/deleting and anything more complicated is
better left to the ebpf test suite, provide a pre-compiled version of
'action.c' and don't bother compiling it in kselftests or on the fly
at all.

Cc: Davide Caratti <[email protected]>
Signed-off-by: Pedro Tammela <[email protected]>
Acked-by: Jamal Hadi Salim <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

Merge branch 'selftests-net-fix-a-few-small-compiler-warnings'

Willem de Bruijn says:

====================
selftests/net: fix a few small compiler warnings

Observed a clang warning when backporting cmsg_sender.
Ran the same build against all the .c files under selftests/net.

This is clang-14 with -Wall
Which is what tools/testing/selftests/net/Makefile also enables.
====================

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>