In keeping up with my status as a hero who removes code: another one bites the
dust.
The tc ipt action was intended to run all netfilter/iptables target.
Unfortunately it has not benefitted over the years from proper updates when
netfilter changes, and for that reason it has remained rudimentary.
Pinging a bunch of people that i was aware were using this indicates that
removing it wont affect them.
Retire it to reduce maintenance efforts.
So Long, ipt, and Thanks for all the Fish.
====================
Jamal Hadi Salim [Thu, 21 Dec 2023 21:31:03 +0000 (16:31 -0500)]
net/sched: Retire ipt action
The tc ipt action was intended to run all netfilter/iptables target.
Unfortunately it has not benefitted over the years from proper updates when
netfilter changes, and for that reason it has remained rudimentary.
Pinging a bunch of people that i was aware were using this indicates that
removing it wont affect them.
Retire it to reduce maintenance efforts. Buh-bye.
David S. Miller [Mon, 1 Jan 2024 18:38:57 +0000 (18:38 +0000)]
Merge branch 'phy-listing-link_topology-tracking'
Maxime Chevallier says:
====================
Introduce PHY listing and link_topology tracking
Here's a V5 of the multi-PHY support series.
At a glance, besides some minor fixes and R'd-by from Andrew, one of the
thing this series does is remove the ASSERT_RTNL() from the
topo_add_phy/del_phy operations.
These operations will take a PHY device and put it into the list of
devices associated to a netdevice. The main thing to protect here is the
list itself, but since we use xarrays, my naive understanding of it is
that it contains its own protection scheme. There shouldn't be a need
for more locking, as the insertion/deletion paths are already hooked
into the PHY connection to a netdev, or disconnection from it.
Now for the rest of the cover :
As a remainder, this ongoing work aims ultimately at supporting complex
link topologies that involve multiplexing multiple PHYs/SFPs on a single
netdevice. As a first step, it's required that we are able to enumerate the
PHYs on a given ethernet interface.
By just doing so, we also improve already-existing use-cases, namely the
copper SFP modules support when a media-converter is used (as we have 2
PHYs on the link, but only one is referenced by net_device.phydev, which
is used on a variety of netlink commands).
The series is architectured as follows :
- The first patch adds the notion of phy_link_topology, which tracks
all PHYs attached to a netdevice.
- Patches 2, 3 and 4 adds some plumbing into SFP and phylib to be able
to connect the dots when building the topology tree, to know which PHY
is connected to which SFP bus, trying not to be too invasive on phylib.
- Patch 5 allows passing a PHY_INDEX to ethnl commands. I'm uncertain about
this, as there are at least 4 netlink commands ( 5 with the one introduced
in patch 7 ) that targets PHYs directly or indirectly, which to me makes
it worth-it to have a generic way to pass a PHY index to commands, however
the approach taken may be too generic.
- Patch 6 is the netlink spec update + ethtool-user.c|h autogenerated code
update (the autogenerated code triggers checkpatch warning though)
- Patch 7 introduces a new netlink command set to list PHYs on a netdevice.
It implements a custom DUMP and GET operation to allow filtered dumps,
that lists all PHYs on a given netdevice. I couldn't use most of ethnl's
plumbing though.
- Patch 8 is the netlink spec update + ethtool-user.c|h update for that
new command
- Patch 8,9,10 and 11 updates the PLCA, strset, cable-test and pse netlink
commands to use the user-provided PHY instead of net_device.phydev.
- Finally patch 12 adds some documentation for this whole work.
Examples
========
Here's a short overview of the kind of operations you can have regarding
the PHY topology. These tests were performed on a MacchiatoBin, which
has 3 interfaces :
eth0 and eth1 have the following layout:
MAC - PHY - SFP
eth2 has this more classic topology :
MAC - PHY - RJ45
finally eth3 has the following topology :
MAC - SFP
When performing a dump with all interfaces down, we don't get any
result, as no PHY has been attached to their respective net_device :
None
The following output is with eth0, eth2 and eth3 up, but no SFP module
inserted in none of the interfaces :
-- Note that this shouldn't actually work as the 88x3310 PHY doesn't allow
a 1G SFP to be connected to its SFP interface, and I don't have a 10G copper SFP,
so for the sake of the demo I applied the following modification, which
of courses gives a non-functionnal link, but the PHY attach still works,
which is what I want to demonstrate :
Changed in V5:
- Removed the RTNL assertion in the topology ops
- Made the phy_topo_get_phy inline
- Fixed the PSE-PD multi-PHY support by re-adding a wrongly dropped
check
- Fixed some typos in the documentation
- Fixed reverse xmas trees
Changes in V4:
- Dropped the RFC flag
- Made the net_device integration independent to having phylib enabled
- Removed the autogenerated ethtool-user code for the YNL specs
Changes in V3:
- Added RTNL assertions where needed
- Fixed issues in the DUMP code for PHY_GET, which crashed when running it
twice in a row
- Added the documentation, and moved in-source docs around
- renamed link_topology to phy_link_topology
Changes in V2:
- Added the DUMP operation
- Added much more information in the reported data, to be able to reconstruct
precisely the topology tree
- renamed phy_list to link_topology
====================
The newly introduced phy_link_topology tracks all ethernet PHYs that are
attached to a netdevice. Document the base principle, internal and
external APIs. As the phy_link_topology is expected to be extended, this
documentation will hold any further improvements and additions made
relative to topology handling.
net: ethtool: pse-pd: Target the command to the requested PHY
PSE and PD configuration is a PHY-specific command. Instead of targeting
the command towards dev->phydev, use the request to pick the targeted
PHY device.
The PHY_GET command, supporting both DUMP and GET operations, is used to
retrieve the list of PHYs connected to a netdevice, and get topology
information to know where exactly it sits on the physical link.
Add the netlink specs corresponding to that command.
net: ethtool: Introduce a command to list PHYs on an interface
As we have the ability to track the PHYs connected to a net_device
through the link_topology, we can expose this list to userspace. This
allows userspace to use these identifiers for phy-specific commands and
take the decision of which PHY to target by knowing the link topology.
Add PHY_GET and PHY_DUMP, which can be a filtered DUMP operation to list
devices on only one interface.
net: ethtool: Allow passing a phy index for some commands
Some netlink commands are target towards ethernet PHYs, to control some
of their features. As there's several such commands, add the ability to
pass a PHY index in the ethnl request, which will populate the generic
ethnl_req_info with the relevant phydev when the command targets a PHY.
net: phy: add helpers to handle sfp phy connect/disconnect
There are a few PHY drivers that can handle SFP modules through their
sfp_upstream_ops. Introduce Phylib helpers to keep track of connected
SFP PHYs in a netdevice's namespace, by adding the SFP PHY to the
upstream PHY's netdev's namespace.
By doing so, these SFP PHYs can be enumerated and exposed to users,
which will be able to use their capabilities.
net: sfp: pass the phy_device when disconnecting an sfp module's PHY
Pass the phy_device as a parameter to the sfp upstream .disconnect_phy
operation. This is preparatory work to help track phy devices across
a net_device's link.
net: phy: Introduce ethernet link topology representation
Link topologies containing multiple network PHYs attached to the same
net_device can be found when using a PHY as a media converter for use
with an SFP connector, on which an SFP transceiver containing a PHY can
be used.
With the current model, the transceiver's PHY can't be used for
operations such as cable testing, timestamping, macsec offload, etc.
The reason being that most of the logic for these configuration, coming
from either ethtool netlink or ioctls tend to use netdev->phydev, which
in multi-phy systems will reference the PHY closest to the MAC.
Introduce a numbering scheme allowing to enumerate PHY devices that
belong to any netdev, which can in turn allow userspace to take more
precise decisions with regard to each PHY's configuration.
The numbering is maintained per-netdev, in a phy_device_list.
The numbering works similarly to a netdevice's ifindex, with
identifiers that are only recycled once INT_MAX has been reached.
This prevents races that could occur between PHY listing and SFP
transceiver removal/insertion.
The identifiers are assigned at phy_attach time, as the numbering
depends on the netdevice the phy is attached to.
The following patchset contains Netfilter updates for net-next:
1) Add locking for NFT_MSG_GETSETELEM_RESET requests, to address a
race scenario with two concurrent processes running a dump-and-reset
which exposes negative counters to userspace, from Phil Sutter.
2) Use GFP_KERNEL in pipapo GC, from Florian Westphal.
3) Reorder nf_flowtable struct members, place the read-mostly parts
accessed by the datapath first. From Florian Westphal.
4) Set on dead flag for NFT_MSG_NEWSET in abort path,
from Florian Westphal.
5) Support filtering zone in ctnetlink, from Felix Huettner.
6) Bail out if user tries to redefine an existing chain with different
type in nf_tables.
====================
David S. Miller [Mon, 1 Jan 2024 14:45:21 +0000 (14:45 +0000)]
Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:
====================
bpf-next-for-netdev
The following pull-request contains BPF updates for your *net-next* tree.
We've added 22 non-merge commits during the last 3 day(s) which contain
a total of 23 files changed, 652 insertions(+), 431 deletions(-).
The main changes are:
1) Add verifier support for annotating user's global BPF subprogram arguments
with few commonly requested annotations for a better developer experience,
from Andrii Nakryiko.
These tags are:
- Ability to annotate a special PTR_TO_CTX argument
- Ability to annotate a generic PTR_TO_MEM as non-NULL
2) Support BPF verifier tracking of BPF_JNE which helps cases when the compiler
transforms (unsigned) "a > 0" into "if a == 0 goto xxx" and the like, from
Menglong Dong.
3) Fix a warning in bpf_mem_cache's check_obj_size() as reported by LKP, from Hou Tao.
4) Re-support uid/gid options when mounting bpffs which had to be reverted with
the prior token series revert to avoid conflicts, from Daniel Borkmann.
5) Fix a libbpf NULL pointer dereference in bpf_object__collect_prog_relos() found
from fuzzing the library with malformed ELF files, from Mingyi Zhang.
6) Skip DWARF sections in libbpf's linker sanity check given compiler options to
generate compressed debug sections can trigger a rejection due to misalignment,
from Alyssa Ross.
7) Fix an unnecessary use of the comma operator in BPF verifier, from Simon Horman.
8) Fix format specifier for unsigned long values in cpustat sample, from Colin Ian King.
====================
net: mdio: get/put device node during (un)registration
The __of_mdiobus_register() function was storing the device node in
dev.of_node without increasing its reference count. It implicitly relied
on the caller to maintain the allocated node until the mdiobus was
unregistered.
Now, __of_mdiobus_register() will acquire the node before assigning it,
and of_mdiobus_unregister_callback() will be called at the end of
mdio_unregister().
Drivers can now release the node immediately after MDIO registration.
Some of them are already doing that even before this patch.
David S. Miller [Fri, 29 Dec 2023 22:35:13 +0000 (22:35 +0000)]
Merge tag 'mlx5-updates-2023-12-20' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux
Saeed Mahameed says:
====================
mlx5-updates-2023-12-20
mlx5 Socket direct support and management PF profile.
Tariq Says:
===========
Support Socket-Direct multi-dev netdev
This series adds support for combining multiple devices (PFs) of the
same port under one netdev instance. Passing traffic through different
devices belonging to different NUMA sockets saves cross-numa traffic and
allows apps running on the same netdev from different numas to still
feel a sense of proximity to the device and achieve improved
performance.
We achieve this by grouping PFs together, and creating the netdev only
once all group members are probed. Symmetrically, we destroy the netdev
once any of the PFs is removed.
The channels are distributed between all devices, a proper configuration
would utilize the correct close numa when working on a certain app/cpu.
We pick one device to be a primary (leader), and it fills a special
role. The other devices (secondaries) are disconnected from the network
in the chip level (set to silent mode). All RX/TX traffic is steered
through the primary to/from the secondaries.
Currently, we limit the support to PFs only, and up to two devices
(sockets).
===========
Armen Says:
===========
Management PF support and module integration
This patch rolls out comprehensive support for the Management Physical
Function (MGMT PF) within the mlx5 driver. It involves updating the
mlx5 interface header to introduce necessary definitions for MGMT PF
and adding a new management PF netdev profile, which will allow the host
side to communicate with the embedded linux on Blue-field devices.
Ido Schimmel [Wed, 20 Dec 2023 15:43:58 +0000 (17:43 +0200)]
genetlink: Use internal flags for multicast groups
As explained in commit e03781879a0d ("drop_monitor: Require
'CAP_SYS_ADMIN' when joining "events" group"), the "flags" field in the
multicast group structure reuses uAPI flags despite the field not being
exposed to user space. This makes it impossible to extend its use
without adding new uAPI flags, which is inappropriate for internal
kernel checks.
Solve this by adding internal flags (i.e., "GENL_MCAST_*") and convert
the existing users to use them instead of the uAPI flags.
Tested using the reproducers in commit 44ec98ea5ea9 ("psample: Require
'CAP_NET_ADMIN' when joining "packets" group") and commit e03781879a0d
("drop_monitor: Require 'CAP_SYS_ADMIN' when joining "events" group").
Now that the driver core can properly handle constant struct bus_type,
move the iucv_bus variable to be a constant structure as well, placing
it into read-only memory which can not be modified at runtime.
Jonathan Corbet [Tue, 19 Dec 2023 23:55:31 +0000 (16:55 -0700)]
ethtool: reformat kerneldoc for struct ethtool_fec_stats
The kerneldoc comment for struct ethtool_fec_stats attempts to describe the
"total" and "lanes" fields of the ethtool_fec_stat substructure in a way
leading to these warnings:
./include/linux/ethtool.h:424: warning: Excess struct member 'lane' description in 'ethtool_fec_stats'
./include/linux/ethtool.h:424: warning: Excess struct member 'total' description in 'ethtool_fec_stats'
Reformat the comment to retain the information while eliminating the
warnings.
Jonathan Corbet [Tue, 19 Dec 2023 23:53:46 +0000 (16:53 -0700)]
ethtool: reformat kerneldoc for struct ethtool_link_settings
The kernel doc comments for struct ethtool_link_settings includes
documentation for three fields that were never present there, leading to
these docs-build warnings:
./include/uapi/linux/ethtool.h:2207: warning: Excess struct member 'supported' description in 'ethtool_link_settings'
./include/uapi/linux/ethtool.h:2207: warning: Excess struct member 'advertising' description in 'ethtool_link_settings'
./include/uapi/linux/ethtool.h:2207: warning: Excess struct member 'lp_advertising' description in 'ethtool_link_settings'
Remove the entries to make the warnings go away. There was some
information there on how data in >link_mode_masks is formatted; move that
to the body of the comment to preserve it.
Remove a couple of kerneldoc entries for struct members that do not exist,
addressing these warnings:
./include/net/sock.h:548: warning: Excess struct member '__sk_flags_offset' description in 'sock'
./include/net/sock.h:548: warning: Excess struct member 'sk_padding' description in 'sock'
Kevin Hao [Tue, 19 Dec 2023 23:37:57 +0000 (07:37 +0800)]
net: pktgen: Use wait_event_freezable_timeout() for freezable kthread
A freezable kernel thread can enter frozen state during freezing by
either calling try_to_freeze() or using wait_event_freezable() and its
variants. So for the following snippet of code in a kernel thread loop:
wait_event_interruptible_timeout();
try_to_freeze();
We can change it to a simple wait_event_freezable_timeout() and then
eliminate a function call.
David S. Miller [Wed, 27 Dec 2023 13:08:10 +0000 (13:08 +0000)]
Merge branch 'net-tja11xx-macsec-support'
Radu Pirea says:
====================
Add MACsec support for TJA11XX C45 PHYs
This is the MACsec support for TJA11XX PHYs. The MACsec block encrypts
the ethernet frames on the fly and has no buffering. This operation will
grow the frames by 32 bytes. If the frames are sent back to back, the
MACsec block will not have enough room to insert the SecTAG and the ICV
and the frames will be dropped.
To mitigate this, the PHY can parse a specific ethertype with some
padding bytes and replace them with the SecTAG and ICV. These padding
bytes might be dummy or might contain information about TX SC that must
be used to encrypt the frame.
====================
Add MACsec statistics callbacks.
The statistic registers must be set to 0 if the SC/SA is
deleted to read relevant values next time when the SC/SA is used.
Add MACsec support.
The MACsec block has four TX SCs and four RX SCs. The driver supports up
to four SecY. Each SecY with one TX SC and one RX SC.
The RX SCs can have two keys, key A and key B, written in hardware and
enabled at the same time.
The TX SCs can have two keys written in hardware, but only one can be
active at a given time.
On TX, the SC is selected using the MAC source address. Due of this
selection mechanism, each offloaded netdev must have a unique MAC
address.
On RX, the SC is selected by SCI(found in SecTAG or calculated using MAC
SA), or using RX SC 0 as implicit.
Offloading MACsec in PHYs requires inserting the SecTAG and the ICV in
the ethernet frame. This operation will increase the frame size with up
to 32 bytes. If the frames are sent at line rate, the PHY will not have
enough room to insert the SecTAG and the ICV.
Some PHYs use a hardware buffer to store a number of ethernet frames and,
if it fills up, a pause frame is sent to the MAC to control the flow.
This HW implementation does not need any modification in the stack.
Other PHYs might offer to use a specific ethertype with some padding
bytes present in the ethernet frame. This ethertype and its associated
bytes will be replaced by the SecTAG and ICV.
mdo_insert_tx_tag allows the PHY drivers to add any specific tag in the
skb.
Lin Ma [Wed, 20 Dec 2023 16:34:51 +0000 (00:34 +0800)]
bridge: cfm: fix enum typo in br_cc_ccm_tx_parse
It appears that there is a typo in the code where the nlattr array is
being parsed with policy br_cfm_cc_ccm_tx_policy, but the instance is
being accessed via IFLA_BRIDGE_CFM_CC_RDI_INSTANCE, which is associated
with the policy br_cfm_cc_rdi_policy.
This problem was introduced by commit 2be665c3940d ("bridge: cfm: Netlink
SET configuration Interface.").
Though it seems like a harmless typo since these two enum owns the exact
same value (1 here), it is quite misleading hence fix it by using the
correct enum IFLA_BRIDGE_CFM_CC_CCM_TX_INSTANCE here.
====================
mptcp: cleanup and support more ephemeral ports sockopts
Patch 1 is a cleanup one: mptcp_is_tcpsk() helper was modifying sock_ops
in some cases which is unexpected with that name.
Patch 2 to 4 add support for two socket options: IP_LOCAL_PORT_RANGE and
IP_BIND_ADDRESS_NO_PORT. The first one is a preparation patch, the
second one adds the support while the last one modifies an existing
selftest to validate the new features.
====================
Maxim Galaganov [Tue, 19 Dec 2023 21:31:06 +0000 (22:31 +0100)]
mptcp: sockopt: support IP_LOCAL_PORT_RANGE and IP_BIND_ADDRESS_NO_PORT
Support for IP_BIND_ADDRESS_NO_PORT sockopt was introduced in [1].
Recently [2] allowed its value to be accessed without locking the
socket.
Support for (newer) IP_LOCAL_PORT_RANGE sockopt was introduced in [3].
In the same series a selftest was added in [4]. This selftest also
covers the IP_BIND_ADDRESS_NO_PORT sockopt.
This patch enables getsockopt()/setsockopt() on MPTCP sockets for these
socket options, syncing set values to subflows in sync_socket_options().
Ephemeral port range is synced to subflows, enabling NAT usecase
described in [3].
Davide Caratti [Tue, 19 Dec 2023 21:31:04 +0000 (22:31 +0100)]
mptcp: don't overwrite sock_ops in mptcp_is_tcpsk()
Eric Dumazet suggests:
> The fact that mptcp_is_tcpsk() was able to write over sock->ops was a
> bit strange to me.
> mptcp_is_tcpsk() should answer a question, with a read-only argument.
re-factor code to avoid overwriting sock_ops inside that function. Also,
change the helper name to reflect the semantics and to disambiguate from
its dual, sk_is_mptcp(). While at it, collapse mptcp_stream_accept() and
mptcp_accept() into a single function, where fallback / non-fallback are
separated into a single sk_is_mptcp() conditional.
David S. Miller [Tue, 26 Dec 2023 21:20:09 +0000 (21:20 +0000)]
Merge branch 'net-sched-tc-block-ports-tracking'
Victor Nogueira says:
====================
net/sched: Introduce tc block ports tracking and use
__context__
The "tc block" is a collection of netdevs/ports which allow qdiscs to share
match-action block instances (as opposed to the traditional tc filter per
netdev/port)[1].
Up to this point in the implementation, the block is unaware of its ports.
This patch makes the tc block ports available to the datapath.
For the datapath we provide a use case of the tc block in a mirred
action in patch 3. For users can levarage mirred to do something like
the following:
In this case, if the packet arrives on ens8, it will be copied and sent to
all ports in the block excluding ens8. Note that the packet is still in
the pipeline at this point - meaning other actions could be added after the
mirror because mirred copies/clones the skb. Example the following is
valid:
redirect behavior always steals the packet from the pipeline and therefore
the skb is no longer available for a subsequent action as illustrated above
(in redirecting to dummy0).
The behavior of redirecting to a tc block is therefore adapted to work in
the same manner. So a setup as such:
$ tc qdisc add dev ens7 ingress_block 22
$ tc qdisc add dev ens8 ingress_block 22
$ tc qdisc add dev ens9 ingress_block 22
$ tc filter add block 22 protocol ip pref 25 \
flower dst_ip 192.168.0.0/16 action mirred egress redirect blockid 22
for a matching packet arriving on ens7 will first send a copy/clone to ens8
(as in the "mirror" behavior) then to ens9 as in the redirect behavior
above. Once this processing is done - no other actions are able to process
this skb. i.e it is removed from the "pipeline".
In this case, if the packet arrives on ens8, it will be copied and sent to
all ports in the block excluding ens8.
Patch 1 separates/exports mirror and redirect functions from act_mirred
Patch 2 introduces the required infra.
Patch 3 Allows mirred to blocks
Subsequent patches will come with tdc test cases.
__Acknowledgements__
Suggestions from Vlad Buslov and Marcelo Ricardo Leitner made this patchset
better. The idea of integrating the ports into the tc block was suggested
by Jiri Pirko.
[1] See commit ca46abd6f89f ("Merge branch'net-sched-allow-qdiscs-to-share-filter-block-instances'")
Changes in v2:
- Remove RFC tag
- Add more details in patch 0(Jiri)
- When CONFIG_NET_TC_SKB_EXT is selected we have unused qdisc_cb Reported-by: kernel test robot <[email protected]> (and [email protected])
- Fix bad dev dereference in printk of blockcast action (Simon)
Changes in v3:
- Add missing xa_destroy (pointed out by Vlad)
- Remove bugfix pointed by Vlad (will send in separate patch)
- Removed ports from subject in patch #2 and typos (suggested by
Marcelo)
- Remove net_notice_ratelimited debug messages in error
cases (suggested by Marcelo)
- Minor changes to appease sparse's lock context warning
Changes in v4:
- Avoid code repetition using gotos in cast_one (suggested by Paolo)
- Fix typo in cover letter (pointed out by Paolo)
- Create a module description for act_blockcast
(reported by Paolo and CI)
Changes in v5:
- Add new patch which separated mirred into mirror and redirect
functions (suggested by Jiri)
- Instead of repeating the code to mirror in blockcast use mirror
exported function by patch1 (tcf_mirror_act)
- Make Block ID into act_blockcast's parameter passed by user space
instead of always getting it from SKB (suggested by Jiri)
- Add tx_type parameter which will specify what transmission behaviour
we want (as described earlier)
Changes in v6:
- Remove blockcast and make it a part of mirred (suggestd by Jiri)
- Block ID is now a mirred parameter
- We now allow redirecting and mirroring to either ingress or egress
Changes in v7:
- Remove set but not used variable in tcf_mirred_act (pointed out by
Jakub)
Changes in v8:
- Fix uapi issues (pointed out by Jiri)
- Separate last patch into 3 - two as preparations for adding
block ID to mirred and one allowing mirred to block (suggested by Jiri)
- Remove declaration initialisation of eg_block and in_block in
qdisc_block_add_dev (suggested by Jiri)
- Avoid unnecessary if guards in qdisc_block_add_dev (suggested by Jiri)
- Remove unncessary block_index retrieval in __qdisc_destroy
(suggested by Jiri)
====================
Victor Nogueira [Tue, 19 Dec 2023 18:16:23 +0000 (15:16 -0300)]
net/sched: act_mirred: Allow mirred to block
So far the mirred action has dealt with syntax that handles
mirror/redirection for netdev. A matching packet is redirected or mirrored
to a target netdev.
In this patch we enable mirred to mirror to a tc block as well.
IOW, the new syntax looks as follows:
... mirred <ingress | egress> <mirror | redirect> [index INDEX] < <blockid BLOCKID> | <dev <devname>> >
Examples of mirroring or redirecting to a tc block:
$ tc filter add block 22 protocol ip pref 25 \
flower dst_ip 192.168.0.0/16 action mirred egress mirror blockid 22
Victor Nogueira [Tue, 19 Dec 2023 18:16:22 +0000 (15:16 -0300)]
net/sched: act_mirred: Add helper function tcf_mirred_replace_dev
The act of replacing a device will be repeated by the init logic for the
block ID in the patch that allows mirred to a block. Therefore we
encapsulate this functionality in a function (tcf_mirred_replace_dev) so
that we can reuse it and avoid code repetition.
Victor Nogueira [Tue, 19 Dec 2023 18:16:21 +0000 (15:16 -0300)]
net/sched: act_mirred: Create function tcf_mirred_to_dev and improve readability
As a preparation for adding block ID to mirred, separate the part of
mirred that redirect/mirrors to a dev into a specific function so that it
can be called by blockcast for each dev.
Also improve readability. Eg. rename use_reinsert to dont_clone and skb2
to skb_to_send.
Victor Nogueira [Tue, 19 Dec 2023 18:16:20 +0000 (15:16 -0300)]
net/sched: cls_api: Expose tc block to the datapath
The datapath can now find the block of the port in which the packet arrived
at.
In the next patch we show a possible usage of this patch in a new
version of mirred that multicasts to all ports except for the port in
which the packet arrived on.
Victor Nogueira [Tue, 19 Dec 2023 18:16:19 +0000 (15:16 -0300)]
net/sched: Introduce tc block netdev tracking infra
This commit makes tc blocks track which ports have been added to them.
And, with that, we'll be able to use this new information to send
packets to the block's ports. Which will be done in the patch #3 of this
series.
David S. Miller [Tue, 26 Dec 2023 20:24:33 +0000 (20:24 +0000)]
Merge branch 'net-smcv2.1-ISM-device-support'
Wen Gu says:
====================
net/smc: implement SMCv2.1 virtual ISM device support
The fourth edition of SMCv2 adds the SMC version 2.1 feature updates for
SMC-Dv2 with virtual ISM. Virtual ISM are created and supported mainly by
OS or hypervisor software, comparable to IBM ISM which is based on platform
firmware or hardware.
With the introduction of virtual ISM, SMCv2.1 makes some updates:
- Introduce feature bitmask to indicate supplemental features.
- Reserve a range of CHIDs for virtual ISM.
- Support extended GIDs (128 bits) in CLC handshake.
So this patch set aims to implement these updates in Linux kernel. And it
acts as the first part of SMC-D virtual ISM extension & loopback-ism [1].
v8->v7:
- Patch #7: v7 mistakenly changed the type of gid_ext in
smc_clc_msg_accept_confirm to u64 instead of __be64 as previous versions
when fixing the rebase conflicts. So fix this mistake.
v7->v6: Link: https://lore.kernel.org/netdev/[email protected]/
- Collect the Reviewed-by tag in v6;
- Patch #3: redefine the struct smc_clc_msg_accept_confirm;
- Patch #7: Because that the Patch #3 already adds '__packed' to
smc_clc_msg_accept_confirm, so Patch #7 doesn't need to do the same thing.
But this is a minor change, so I kept the 'Reviewed-by' tag.
Other changes in previous versions but not yet acked:
- Patch #1: Some minor changes in subject and fix the format issue
(length exceeds 80 columns) compared to v3.
- Patch #5: removes useless ini->feature_mask assignment in __smc_connect()
and smc_listen_v2_check() compared to v4.
- Patch #8: new added, compared to v3.
v6->v5: Link: https://lore.kernel.org/netdev/[email protected]/
- Add 'Reviewed-by' label given in the previous versions:
* Patch #4, #6, #9, #10 have nothing changed since v3;
- Patch #2:
* fix the format issue (Alignment should match open parenthesis) compared to v5;
* remove useless clc->hdr.length assignment in smcr_clc_prep_confirm_accept()
compared to v5;
- Patch #3: new added compared to v5.
- Patch #7: some minor changes like aclc_v2->aclc or clc_v2->clc compared to v5
due to the introduction of Patch #3. Since there were no major changes, I kept
the 'Reviewed-by' label.
Other changes in previous versions but not yet acked:
- Patch #1: Some minor changes in subject and fix the format issue
(length exceeds 80 columns) compared to v3.
- Patch #5: removes useless ini->feature_mask assignment in __smc_connect()
and smc_listen_v2_check() compared to v4.
- Patch #8: new added, compared to v3.
v4->v3:
https://lore.kernel.org/netdev/1701920994[email protected]/
- Patch #6: use SMCD_CLC_MAX_V2_GID_ENTRIES to indicate the max gid
entries in CLC proposal and using SMC_MAX_V2_ISM_DEVS to indicate the
max devices to propose;
- Patch #6: use i and i+1 in smc_find_ism_v2_device_serv();
- Patch #2: replace the large if-else block in smc_clc_send_confirm_accept()
with 2 subfunctions;
- Fix missing byte order conversion of GID and token in CLC handshake,
which is in a separate patch sending to net:
https://lore.kernel.org/netdev/1701882157[email protected]/
- Patch #7: add extended GID in SMC-D lgr netlink attribute;
v3->v2:
https://lore.kernel.org/netdev/1701343695[email protected]/
- Rename smc_clc_fill_fce as smc_clc_fill_fce_v2x;
- Remove ISM_IDENT_MASK from drivers/s390/net/ism.h;
- Add explicitly assigning 'false' to ism_v2_capable in ism_dev_init();
- Remove smc_ism_set_v2_capable() helper for now, and introduce it in
later loopback-ism implementation;
v2->v1:
- Fix sparse complaint;
- Rebase to the latest net-next;
====================
Wen Gu [Tue, 19 Dec 2023 14:26:16 +0000 (22:26 +0800)]
net/smc: manage system EID in SMC stack instead of ISM driver
The System EID (SEID) is an internal EID that is used by the SMCv2
software stack that has a predefined and constant value representing
the s390 physical machine that the OS is executing on. So it should
be managed by SMC stack instead of ISM driver and be consistent for
all ISMv2 device (including virtual ISM devices) on s390 architecture.
Wen Gu [Tue, 19 Dec 2023 14:26:15 +0000 (22:26 +0800)]
net/smc: disable SEID on non-s390 archs where virtual ISM may be used
The system EID (SEID) is an internal EID used by SMC-D to represent the
s390 physical machine that OS is executing on. On s390 architecture, it
predefined by fixed string and part of cpuid and is enabled regardless
of whether underlay device is virtual ISM or platform firmware ISM.
However on non-s390 architectures where SMC-D can be used with virtual
ISM devices, there is no similar information to identify physical
machines, especially in virtualization scenarios. So in such cases, SEID
is forcibly disabled and the user-defined UEID will be used to represent
the communicable space.
Wen Gu [Tue, 19 Dec 2023 14:26:14 +0000 (22:26 +0800)]
net/smc: support extended GID in SMC-D lgr netlink attribute
Virtual ISM devices introduced in SMCv2.1 requires a 128 bit extended
GID vs. the existing ISM 64bit GID. So the 2nd 64 bit of extended GID
should be included in SMC-D linkgroup netlink attribute as well.
Wen Gu [Tue, 19 Dec 2023 14:26:13 +0000 (22:26 +0800)]
net/smc: compatible with 128-bits extended GID of virtual ISM device
According to virtual ISM support feature defined by SMCv2.1, GIDs of
virtual ISM device are UUIDs defined by RFC4122, which are 128-bits
long. So some adaptation work is required. And note that the GIDs of
existing platform firmware ISM devices still remain 64-bits long.
Wen Gu [Tue, 19 Dec 2023 14:26:10 +0000 (22:26 +0800)]
net/smc: support SMCv2.x supplemental features negotiation
This patch adds SMCv2.x supplemental features negotiation. Supported
SMCv2.x supplemental features are represented by feature_mask in FCE
field. The negotiation process is as follows.
Wen Gu [Tue, 19 Dec 2023 14:26:09 +0000 (22:26 +0800)]
net/smc: unify the structs of accept or confirm message for v1 and v2
The structs of CLC accept and confirm messages for SMCv1 and SMCv2 are
separately defined and often casted to each other in the code, which may
increase the risk of errors caused by future divergence of them. So
unify them into one struct for better maintainability.
Marek Behún [Tue, 19 Dec 2023 16:24:15 +0000 (17:24 +0100)]
net: sfp: fix PHY discovery for FS SFP-10G-T module
Commit 2f3ce7a56c6e ("net: sfp: rework the RollBall PHY waiting code")
changed the long wait before accessing RollBall / FS modules into
probing for PHY every 1 second, and trying 25 times.
Wei Lei reports that this does not work correctly on FS modules: when
initializing, they may report values different from 0xffff in PHY ID
registers for some MMDs, causing get_phy_c45_ids() to find some bogus
MMD.
Fix this by adding the module_t_wait member back, and setting it to 4
seconds for FS modules.
David S. Miller [Sat, 23 Dec 2023 01:18:59 +0000 (01:18 +0000)]
Merge branch 'dpaa2-switch-small-improvements'
Ioana Ciornei says:
====================
dpaa2-switch: small improvements
This patch set consists of a series of small improvements on the
dpaa2-switch driver ranging from adding some more verbosity when
encountering errors to reorganizing code to be easily extensible.
Changes in v3:
- 4/8: removed the fixes tag and moved it to the commit message
- 5/8: specified that there is no user-visible effect
- 6/8: removed the initialization of the err variable
Changes in v2:
- No changes to the actual diff, only rephrased some commit messages and
added more information.
====================
Ioana Ciornei [Tue, 19 Dec 2023 11:59:33 +0000 (13:59 +0200)]
dpaa2-switch: cleanup the egress flood of an unused FDB
In case a DPAA2 switch interface joins a bridge, the FDB used on the
port will be changed to the one associated with the bridge. What this
means exactly is that any VLAN installed on the port will need to be
removed and then installed back so that it points to the new FDB.
Once this is done, the previous FDB will become unused (no VLAN to
point to it). Even though no traffic will reach this FDB, it's best to
just cleanup the state of the FDB by zeroing its egress flood domain.
Ioana Ciornei [Tue, 19 Dec 2023 11:59:32 +0000 (13:59 +0200)]
dpaa2-switch: move a check to the prechangeupper stage
Two different DPAA2 switch ports from two different DPSW instances
cannot be under the same bridge. Instead of checking for this
unsupported configuration in the CHANGEUPPER event, check it as early as
possible in the PRECHANGEUPPER one.
Ioana Ciornei [Tue, 19 Dec 2023 11:59:31 +0000 (13:59 +0200)]
dpaa2-switch: reorganize the [pre]changeupper events
Create separate functions, dpaa2_switch_port_prechangeupper and
dpaa2_switch_port_changeupper, to be called directly when a DPSW port
changes its upper device.
This way we are not open-coding everything in the main event callback
and we can easily extent, for example, with bond offload.
Ioana Ciornei [Tue, 19 Dec 2023 11:59:30 +0000 (13:59 +0200)]
dpaa2-switch: do not clear any interrupts automatically
The DPSW object has multiple event sources multiplexed over the same
IRQ. The driver has the capability to configure only some of these
events to trigger the IRQ.
The dpsw_get_irq_status() can clear events automatically based on the
value stored in the 'status' variable passed to it. We don't want that
to happen because we could get into a situation when we are clearing
more events than we actually handled.
Just resort to manually clearing the events that we handled. Also, since
status is not used on the out path we remove its initialization to zero.
This change does not have a user-visible effect because the dpaa2-switch
driver enables and handles all the DPSW events which exist at the
moment.
Ioana Ciornei [Tue, 19 Dec 2023 11:59:29 +0000 (13:59 +0200)]
dpaa2-switch: add ENDPOINT_CHANGED to the irq_mask
Commit 84cba72956fd ("dpaa2-switch: integrate the MAC endpoint support")
added support for MAC endpoints in the dpaa2-switch driver but omitted
to add the ENDPOINT_CHANGED irq to the list of interrupt sources. Fix
this by extending the list of events which can raise an interrupt by
extending the mask passed to the dpsw_set_irq_mask() firmware API.
There is no user visible impact even without this patch since whenever a
switch interface is connected/disconnected from an endpoint both events
are set (LINK_CHANGED and ENDPOINT_CHANGED) and, luckily, the
LINK_CHANGED event could actually raise the interrupt and thus get the
MAC/PHY SW configuration started.
Even with this, it's better to just not rely on undocumented firmware
behavior which can change.
Ioana Ciornei [Tue, 19 Dec 2023 11:59:28 +0000 (13:59 +0200)]
dpaa2-switch: print an error when the vlan is already configured
Print a netdev error when we hit a case in which a specific VLAN is
already configured on the port. While at it, change the already existing
netdev_warn into an _err for consistency purposes.
Ioana Ciornei [Tue, 19 Dec 2023 11:59:26 +0000 (13:59 +0200)]
dpaa2-switch: set interface MAC address only on endpoint change
There is no point in updating the MAC address of a switch interface each
time the link state changes, this only needs to happen in case the
endpoint changes (the switch interface is [dis]connected from/to a MAC).
Just move the call to dpaa2_switch_port_set_mac_addr() under
DPSW_IRQ_EVENT_ENDPOINT_CHANGED.
Roger Quadros [Tue, 19 Dec 2023 10:58:04 +0000 (12:58 +0200)]
net: ethernet: ti: am65-cpsw-qos: Add Frame Preemption MAC Merge support
Add driver support for viewing / changing the MAC Merge sublayer
parameters and seeing the verification state machine's current state
via ethtool.
As hardware does not support interrupt notification for verification
events we resort to polling on link up. On link up we try a couple of
times for verification success and if unsuccessful then give up.
The Frame Preemption feature is described in the Technical Reference
Manual [1] in section:
12.3.1.4.6.7 Intersperced Express Traffic (IET – P802.3br/D2.0)
Due to Silicon Errata i2208 [2] we set limit min IET fragment size to
124 (excluding 4 bytes mCRC).
This patch adds MQPRIO Qdisc offload in full 'channel' mode which allows
not only setting up pri:tc mapping, but also configuring TX shapers
(rate-limiting) on external port FIFOs.
The MQPRIO Qdisc offload is expected to work with or without VLAN/priority
tagged packets.
The CPSW external Port FIFO has 8 Priority queues. The rate-limit can be
set for each of these priority queues. Which Priority queue a packet is
assigned to depends on PN_REG_TX_PRI_MAP register which maps header
priority to switch priority.
The header priority of a packet is assigned via the RX_PRI_MAP_REG which
maps packet priority to header priority.
The packet priority is either the VLAN priority (for VLAN tagged packets)
or the thread/channel offset.
For simplicity, we assign the same priority queue to all queues of a
Traffic Class so it can be rate-limited correctly.
ip link add link eth1 name eth1.100 type vlan id 100
ip link set eth1.100 type vlan egress 0:0 1:1 2:2 3:3 4:4 5:5 6:6 7:7
In the above example two ports share the same TX CPPI queue 0 for low
priority traffic. 3 traffic classes are defined for eth1 and mapped to:
TC0 - low priority, TX CPPI queue 0 -> ext Port 1 fifo0, no rate limit
TC1 - prio 2, TX CPPI queue 1 -> ext Port 1 fifo1, CIR=100Mbit/s, EIR=1Mbit/s
TC2 - prio 3, TX CPPI queue 2 -> ext Port 1 fifo2, CIR=200Mbit/s, EIR=2Mbit/s
Roger Quadros [Tue, 19 Dec 2023 10:57:59 +0000 (12:57 +0200)]
net: ethernet: am65-cpsw: Rename TI_AM65_CPSW_TAS to TI_AM65_CPSW_QOS
We will use this Kconfig option to not only enable TAS/EST offload
but also other QoS features like Multiqueue priority descriptors
and MAC-Merge/Frame Preemption. TI_AM65_CPSW_QOS seems a more
appropriate Kconfig option name than TI_AM65_CPSW_TAS.
Vladimir Oltean [Tue, 19 Dec 2023 10:57:56 +0000 (12:57 +0200)]
selftests: forwarding: ethtool_mm: support devices with higher rx-min-frag-size
Some devices have errata due to which they cannot report ETH_ZLEN (60)
in the rx-min-frag-size. This was foreseen of course, and lldpad has
logic that when we request it to advertise addFragSize 0, it will round
it up to the lowest value that is _actually_ supported by the hardware.
The problem is that the selftest expects lldpad to report back to us the
same value as we requested.
Make the selftest smarter by figuring out on its own what is a
reasonable value to expect.
====================
Convert net selftests to run in unique namespace (last part)
Here is the last part of converting net selftests to run in unique namespace.
This part converts all left tests. After the conversion, we can run the net
sleftests in parallel. e.g.
# ./run_kselftest.sh -n -t net:reuseport_bpf
TAP version 13
1..1
# selftests: net: reuseport_bpf
ok 1 selftests: net: reuseport_bpf
mod 10...
# Socket 0: 0
# Socket 1: 1
...
# Socket 4: 19
# Testing filter add without bind...
# SUCCESS
# ./run_kselftest.sh -p -n -t net:cmsg_so_mark.sh -t net:cmsg_time.sh -t net:cmsg_ipv6.sh
TAP version 13
1..3
# selftests: net: cmsg_so_mark.sh
ok 1 selftests: net: cmsg_so_mark.sh
# selftests: net: cmsg_time.sh
ok 2 selftests: net: cmsg_time.sh
# selftests: net: cmsg_ipv6.sh
ok 3 selftests: net: cmsg_ipv6.sh
# ./run_kselftest.sh -p -n -c net
TAP version 13
1..95
# selftests: net: reuseport_bpf_numa
ok 3 selftests: net: reuseport_bpf_numa
# selftests: net: reuseport_bpf_cpu
ok 2 selftests: net: reuseport_bpf_cpu
# selftests: net: sk_bind_sendto_listen
ok 9 selftests: net: sk_bind_sendto_listen
# selftests: net: reuseaddr_conflict
ok 5 selftests: net: reuseaddr_conflict
...
Hangbin Liu [Tue, 19 Dec 2023 09:48:54 +0000 (17:48 +0800)]
selftests/net: use unique netns name for setup_loopback.sh setup_veth.sh
The setup_loopback and setup_veth use their own way to create namespace.
So let's just re-define server_ns/client_ns to unique name.
At the same time update the namespace name in gro.sh and toeplitz.sh.
As I don't have env to run toeplitz.sh. Here is only the gro test result.
# ./gro.sh
running test ipv4 data
Expected {200 }, Total 1 packets
Received {200 }, Total 1 packets.
...
Gro::large test passed.
All Tests Succeeded!
Hangbin Liu [Tue, 19 Dec 2023 09:48:53 +0000 (17:48 +0800)]
selftests/net: convert xfrm_policy.sh to run it in unique namespace
Here is the test result after conversion.
# ./xfrm_policy.sh
PASS: policy before exception matches
PASS: ping to .254 bypassed ipsec tunnel (exceptions)
PASS: direct policy matches (exceptions)
PASS: policy matches (exceptions)
PASS: ping to .254 bypassed ipsec tunnel (exceptions and block policies)
PASS: direct policy matches (exceptions and block policies)
PASS: policy matches (exceptions and block policies)
PASS: ping to .254 bypassed ipsec tunnel (exceptions and block policies after hresh changes)
PASS: direct policy matches (exceptions and block policies after hresh changes)
PASS: policy matches (exceptions and block policies after hresh changes)
PASS: ping to .254 bypassed ipsec tunnel (exceptions and block policies after hthresh change in ns3)
PASS: direct policy matches (exceptions and block policies after hthresh change in ns3)
PASS: policy matches (exceptions and block policies after hthresh change in ns3)
PASS: ping to .254 bypassed ipsec tunnel (exceptions and block policies after htresh change to normal)
PASS: direct policy matches (exceptions and block policies after htresh change to normal)
PASS: policy matches (exceptions and block policies after htresh change to normal)
PASS: policies with repeated htresh change
Hangbin Liu [Tue, 19 Dec 2023 09:48:51 +0000 (17:48 +0800)]
selftests/net: convert rtnetlink.sh to run it in unique namespace
When running the test in namespace, the debugfs may not load automatically.
So add a checking to make sure debugfs loaded. Here is the test result
after conversion.
# ./rtnetlink.sh
PASS: policy routing
PASS: route get
...
PASS: address proto IPv4
PASS: address proto IPv6
Dmitry Safonov [Tue, 19 Dec 2023 02:03:05 +0000 (02:03 +0000)]
selftest/tcp-ao: Rectify out-of-tree build
Trivial fix for out-of-tree build that I wasn't testing previously:
1. Create a directory for library object files, fixes:
> gcc lib/kconfig.c -Wall -O2 -g -D_GNU_SOURCE -fno-strict-aliasing -I ../../../../../usr/include/ -iquote /tmp/kselftest/kselftest/net/tcp_ao/lib -I ../../../../include/ -o /tmp/kselftest/kselftest/net/tcp_ao/lib/kconfig.o -c
> Assembler messages:
> Fatal error: can't create /tmp/kselftest/kselftest/net/tcp_ao/lib/kconfig.o: No such file or directory
> make[1]: *** [Makefile:46: /tmp/kselftest/kselftest/net/tcp_ao/lib/kconfig.o] Error 1
2. Include $(KHDR_INCLUDES) that's exported by selftests/Makefile, fixes:
> In file included from lib/kconfig.c:6:
> lib/aolib.h:320:45: warning: ‘struct tcp_ao_add’ declared inside parameter list will not be visible outside of this definition or declaration
> 320 | extern int test_prepare_key_sockaddr(struct tcp_ao_add *ao, const char *alg,
> | ^~~~~~~~~~
...
3. While at here, clean-up $(KSFT_KHDR_INSTALL): it's not needed anymore
since commit f2745dc0ba3d ("selftests: stop using KSFT_KHDR_INSTALL")
4. Also, while at here, drop .DEFAULT_GOAL definition: that has a
self-explaining comment, that was valid when I made these selftests
compile on local v4.19 kernel, but not needed since
commit 8ce72dc32578 ("selftests: fix headers_install circular dependency")
Fixes: cfbab37b3da0 ("selftests/net: Add TCP-AO library") Reported-by: kernel test robot <[email protected]> Closes: https://lore.kernel.org/oe-kbuild-all/[email protected]/ Signed-off-by: Dmitry Safonov <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Jonathan Corbet [Tue, 19 Dec 2023 00:28:13 +0000 (17:28 -0700)]
tipc: Remove some excess struct member documentation
Remove documentation for nonexistent struct members, addressing these
warnings:
./net/tipc/link.c:228: warning: Excess struct member 'media_addr' description in 'tipc_link'
./net/tipc/link.c:228: warning: Excess struct member 'timer' description in 'tipc_link'
./net/tipc/link.c:228: warning: Excess struct member 'refcnt' description in 'tipc_link'
./net/tipc/link.c:228: warning: Excess struct member 'proto_msg' description in 'tipc_link'
./net/tipc/link.c:228: warning: Excess struct member 'pmsg' description in 'tipc_link'
./net/tipc/link.c:228: warning: Excess struct member 'backlog_limit' description in 'tipc_link'
./net/tipc/link.c:228: warning: Excess struct member 'exp_msg_count' description in 'tipc_link'
./net/tipc/link.c:228: warning: Excess struct member 'reset_rcv_checkpt' description in 'tipc_link'
./net/tipc/link.c:228: warning: Excess struct member 'transmitq' description in 'tipc_link'
./net/tipc/link.c:228: warning: Excess struct member 'snt_nxt' description in 'tipc_link'
./net/tipc/link.c:228: warning: Excess struct member 'deferred_queue' description in 'tipc_link'
./net/tipc/link.c:228: warning: Excess struct member 'unacked_window' description in 'tipc_link'
./net/tipc/link.c:228: warning: Excess struct member 'next_out' description in 'tipc_link'
./net/tipc/link.c:228: warning: Excess struct member 'long_msg_seq_no' description in 'tipc_link'
./net/tipc/link.c:228: warning: Excess struct member 'bc_rcvr' description in 'tipc_link'
Jonathan Corbet [Tue, 19 Dec 2023 00:26:26 +0000 (17:26 -0700)]
net: skbuff: Remove some excess struct-member documentation
Remove documentation for nonexistent structure members, addressing these
warnings:
./include/linux/skbuff.h:1063: warning: Excess struct member 'sp' description in 'sk_buff'
./include/linux/skbuff.h:1063: warning: Excess struct member 'nf_bridge' description in 'sk_buff'