Git Repo - linux.git/log

net/mlx5e: SHAMPO, Re-enable HW-GRO

Add back HW-GRO to the reported features.

As the current implementation of HW-GRO uses KSMs with a
specific fixed buffer size (256B) to map its headers buffer,
we reported the feature only if the NIC is supporting KSM and
the minimum value for buffer size is below the requested one.

iperf3 bandwidth comparison:
+---------+--------+--------+-----------+
| streams | SW GRO | HW GRO | Unit      |
|---------+--------+--------+-----------|
| 1       | 36     | 42     | Gbits/sec |
| 4       | 34     | 39     | Gbits/sec |
| 8       | 31     | 35     | Gbits/sec |
+---------+--------+--------+-----------+

A downstream patch will add skb fragment coalescing which will improve
performance considerably.

Benchmark details:
VM based setup
CPU: Intel(R) Xeon(R) Platinum 8380 CPU, 24 cores
NIC: ConnectX-7 100GbE
iperf3 and irq running on same CPU over a single receive queue

Signed-off-by: Yoray Zack <[email protected]>
Signed-off-by: Tariq Toukan <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

net/mlx5e: SHAMPO, Use KSMs instead of KLMs

KSM Mkey is KLM Mkey with a fixed buffer size. Due to this fact,
it is a faster mechanism than KLM.

SHAMPO feature used KLMs Mkeys for memory mappings of its headers buffer.
As it used KLMs with the same buffer size for each entry,
we can use KSMs instead.

This commit changes the Mkeys that map the SHAMPO headers buffer
from KLMs to KSMs.

Signed-off-by: Yoray Zack <[email protected]>
Signed-off-by: Tariq Toukan <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

net/mlx5e: SHAMPO, Add header-only ethtool counters for header data split

Count the number of header-only packets and bytes from SHAMPO.

Signed-off-by: Dragos Tatulea <[email protected]>
Signed-off-by: Tariq Toukan <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

net/mlx5e: SHAMPO, Drop rx_gro_match_packets counter

After modifying rx_gro_packets to be more accurate, the
rx_gro_match_packets counter is redundant.

Signed-off-by: Dragos Tatulea <[email protected]>
Signed-off-by: Tariq Toukan <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

net/mlx5e: SHAMPO, Make GRO counters more precise

Don't count non GRO packets. A non GRO packet is a packet with
a GRO cb count of 1.

Signed-off-by: Dragos Tatulea <[email protected]>
Signed-off-by: Tariq Toukan <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

net/mlx5e: SHAMPO, Skipping on duplicate flush of the same SHAMPO SKB

SHAMPO SKB can be flushed in mlx5e_shampo_complete_rx_cqe().
If the SKB was flushed, rq->hw_gro_data->skb was also set to NULL.

We can skip on flushing the SKB in mlx5e_shampo_flush_skb
if rq->hw_gro_data->skb == NULL.

Signed-off-by: Yoray Zack <[email protected]>
Signed-off-by: Tariq Toukan <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

net/mlx5e: SHAMPO, Specialize mlx5e_fill_skb_data()

mlx5e_fill_skb_data() used to have multiple callers. But after the XDP
multibuf refactoring from commit 2cb0e27d43b4 ("net/mlx5e: RX, Prepare
non-linear striding RQ for XDP multi-buffer support") the SHAMPO code
path is the only caller.

Take advantage of this and specialize the function:
- Drop the redundant check.
- Assume that data_bcnt is > 0. This is needed in a downstream patch.

Rename the function as well to make things clear.

Signed-off-by: Dragos Tatulea <[email protected]>
Suggested-by: Tariq Toukan <[email protected]>
Signed-off-by: Tariq Toukan <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

net/mlx5e: SHAMPO, Simplify header page release in teardown

The function that releases SHAMPO header pages (mlx5e_shampo_dealloc_hd)
has some complicated logic that comes from the fact that it is called
twice during teardown:
1) To release the posted header pages that didn't get any completions.
2) To release all remaining header pages.

This flow is not necessary: all header pages can be released from the
driver side in one go. Furthermore, the above flow is buggy. Taking the
8 headers per page example:
1) Release fragments 5-7. Page will be released.
2) Release remaining fragments 0-4. The bits in the header will indicate
that the page needs releasing. But this is incorrect: page was
released in step 1.

This patch releases all header pages in one go. This simplifies the
header page cleanup function. For consistency, the datapath header
page release API (mlx5e_free_rx_shampo_hd_entry()) is used.

Signed-off-by: Dragos Tatulea <[email protected]>
Signed-off-by: Tariq Toukan <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

net/mlx5e: SHAMPO, Disable gso_size for non GRO packets

When HW GRO is enabled, forwarding of packets is broken due to gso_size
being set incorrectly on non GRO packets.

Non GRO packets have a skb GRO count of 1. mlx5 always sets gso_size on
the skb, even for non GRO packets. It leans on the fact that gso_size is
normally reset in napi_gro_complete(). But this happens only for packets
from GRO'able protocols (TCP/UDP) that have a gro_receive() handler.

The problematic scenarios are:

1) Non GRO protocol packets are received, validate_xmit_skb() will drop
   them (see EPROTONOSUPPORT in skb_mac_gso_segment()). The fix for
   this case would be to not set gso_size at all for SHAMPO packets with
   header size 0.

2) Packets from a GRO'ed protocol (TCP) are received but immediately
   flushed because they are not GRO'able (TCP SYN for example).
   mlx5e_shampo_update_hdr(), which updates the remaining GRO state on
   the skb, is not called because skb GRO count is 1. The fix here would
   be to always call mlx5e_shampo_update_hdr(), regardless of skb GRO
   count. But this call is expensive

The unified fix for both cases is to reset gso_size before calling
napi_gro_receive(). It is a change that is more effective (no call to
mlx5e_shampo_update_hdr() necessary) and simple (smallest code
footprint).

Signed-off-by: Dragos Tatulea <[email protected]>
Signed-off-by: Tariq Toukan <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

net/mlx5e: SHAMPO, Fix FCS config when HW GRO on

For the following scenario:

ethtool --features eth3 rx-gro-hw on
ethtool --features eth3 rx-fcs on
ethtool --features eth3 rx-fcs off

... there is a firmware error because the driver enables HW GRO first
while FCS is still enabled.

This patch fixes this by swapping the order of HW GRO and FCS for this
specific case. Take LRO into consideration as well for consistency.

Signed-off-by: Dragos Tatulea <[email protected]>
Signed-off-by: Tariq Toukan <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

net/mlx5e: SHAMPO, Fix invalid WQ linked list unlink

When all the strides in a WQE have been consumed, the WQE is unlinked
from the WQ linked list (mlx5_wq_ll_pop()). For SHAMPO, it is possible
to receive CQEs with 0 consumed strides for the same WQE even after the
WQE is fully consumed and unlinked. This triggers an additional unlink
for the same wqe which corrupts the linked list.

Fix this scenario by accepting 0 sized consumed strides without
unlinking the WQE again.

Signed-off-by: Dragos Tatulea <[email protected]>
Signed-off-by: Tariq Toukan <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

net/mlx5e: SHAMPO, Fix incorrect page release

Under the following conditions:
1) No skb created yet
2) header_size == 0 (no SHAMPO header)
3) header_index + 1 % MLX5E_SHAMPO_WQ_HEADER_PER_PAGE == 0 (this is the
last page fragment of a SHAMPO header page)

a new skb is formed with a page that is NOT a SHAMPO header page (it
is a regular data page). Further down in the same function
(mlx5e_handle_rx_cqe_mpwrq_shampo()), a SHAMPO header page from
header_index is released. This is wrong and it leads to SHAMPO header
pages being released more than once.

Signed-off-by: Dragos Tatulea <[email protected]>
Signed-off-by: Tariq Toukan <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

net/mlx5e: SHAMPO, Use net_prefetch API

Let the SHAMPO functions use the net-specific prefetch API,
similar to all other usages.

Signed-off-by: Tariq Toukan <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

selftests: hsr: Extend the hsr_ping.sh test to use fixed MAC addresses

Fixed MAC addresses help with debugging as last four bytes identify the
network namespace.

Signed-off-by: Lukasz Majewski <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

selftests: hsr: Extend the hsr_redbox.sh test to use fixed MAC addresses

Fixed MAC addresses help with debugging as last four bytes identify the
network namespace.

Moreover, it allows to mimic the real life setup with for example bridge
having the same MAC address on each port.

Signed-off-by: Lukasz Majewski <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

Merge branch 'vmxnet3-upgrade-to-version-9'

Ronak Doshi says:

====================
vmxnet3: upgrade to version 9

vmxnet3 emulation has recently added timestamping feature which allows the
hypervisor (ESXi) to calculate latency from guest virtual NIC driver to all
the way up to the physical NIC. This patch series extends vmxnet3 driver
to leverage these new feature.

Compatibility is maintained using existing vmxnet3 versioning mechanism as
follows:
- new features added to vmxnet3 emulation are associated with new vmxnet3
   version viz. vmxnet3 version 9.
- emulation advertises all the versions it supports to the driver.
- during initialization, vmxnet3 driver picks the highest version number
   supported by both the emulation and the driver and configures emulation
   to run at that version.

In particular, following changes are introduced:

Patch 1:
  This patch introduces utility macros for vmxnet3 version 9 comparison
  and updates Copyright information.

Patch 2:
  This patch adds support to timestamp the packets so as to allow latency
  measurement in the ESXi.

Patch 3:
  This patch adds support to disable certain offloads on the device based
  on the request specified by the user in the VM configuration.

Patch 4:
  With all vmxnet3 version 9 changes incorporated in the vmxnet3 driver,
  with this patch, the driver can configure emulation to run at vmxnet3
  version 9.
====================

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

vmxnet3: update to version 9

With all vmxnet3 version 9 changes incorporated in the vmxnet3 driver,
the driver can configure emulation to run at vmxnet3 version 9, provided
the emulation advertises support for version 9.

Signed-off-by: Ronak Doshi <[email protected]>
Acked-by: Guolin Yang <[email protected]>
Reviewed-by: Simon Horman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

vmxnet3: add command to allow disabling of offloads

This patch adds a new command to disable certain offloads. This
allows user to specify, using VM configuration, if certain offloads
need to be disabled.

Signed-off-by: Ronak Doshi <[email protected]>
Acked-by: Guolin Yang <[email protected]>
Reviewed-by: Simon Horman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

vmxnet3: add latency measurement support in vmxnet3

This patch enhances vmxnet3 to support latency measurement.
This support will help to track the latency in packet processing
between guest virtual nic driver and host. For this purpose, we
introduce a new timestamp ring in vmxnet3 which will be per Tx/Rx
queue. This ring will be used to carry timestamp of the packets
which will be used to calculate the latency.

User can enable latency measurement using realtime knob in vnic
setting in VCenter.

Signed-off-by: Ronak Doshi <[email protected]>
Acked-by: Guolin Yang <[email protected]>
Reviewed-by: Simon Horman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

vmxnet3: prepare for version 9 changes

vmxnet3 is currently at version 7 and this patch initiates the
preparation to accommodate changes for up to version 9. Introduced
utility macros for vmxnet3 version 9 comparison and update Copyright
information.

Signed-off-by: Ronak Doshi <[email protected]>
Acked-by: Guolin Yang <[email protected]>
Reviewed-by: Simon Horman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

ionic: advertise 52-bit addressing limitation for MSI-X

Current ionic devices only support 52 internal physical address
lines. This is sufficient for x86_64 systems which have similar
limitations but does not apply to all other architectures,
notably IBM POWER (ppc64). To ensure that MSI/MSI-X vectors are
not set outside the physical address limits of the NIC, set the
no_64bit_msi value of the pci_dev structure during device probe.

Signed-off-by: David Christensen <[email protected]>
Reviewed-by: Shannon Nelson <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

bnxt_en: fix atomic counter for ptp packets

atomic_dec_if_positive returns new value regardless if it is updated or
not. The commit in fixes changed the behavior of the condition to one
that differs from original code. Restore original condition to properly
maintain atomic counter.

Fixes: 165f87691a89 ("bnxt_en: add timestamping statistics support")
Reviewed-by: Michael Chan <[email protected]>
Signed-off-by: Vadim Fedorenko <[email protected]>
Reviewed-by: Simon Horman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

Merge branch 'tcp-rto-min-us'

Kevin Yang says:

====================
tcp: add sysctl_tcp_rto_min_us

Adding a sysctl knob to allow user to specify a default
rto_min at socket init time.

After this patch series, the rto_min will has multiple sources:
route option has the highest precedence, followed by the
TCP_BPF_RTO_MIN socket option, followed by this new
tcp_rto_min_us sysctl.

v3:
fix typo, simplify min/max_t to min/max

v2:
fit line width to 80 column.

v2: https://lore.kernel.org/netdev/20240530153436.2202800 [email protected]/
v1: https://lore.kernel.org/netdev/20240528171320.1332292 [email protected]/
====================

Signed-off-by: David S. Miller <[email protected]>

tcp: add sysctl_tcp_rto_min_us

Adding a sysctl knob to allow user to specify a default
rto_min at socket init time, other than using the hard
coded 200ms default rto_min.

Note that the rto_min route option has the highest precedence
for configuring this setting, followed by the TCP_BPF_RTO_MIN
socket option, followed by the tcp_rto_min_us sysctl.

Signed-off-by: Kevin Yang <[email protected]>
Reviewed-by: Neal Cardwell <[email protected]>
Reviewed-by: Yuchung Cheng <[email protected]>
Reviewed-by: Eric Dumazet <[email protected]>
Reviewed-by: Tony Lu <[email protected]>
Reviewed-by: Jakub Kicinski <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

tcp: derive delack_max with tcp_rto_min helper

Rto_min now has multiple sources, ordered by preprecedence high to
low: ip route option rto_min, icsk->icsk_rto_min.

When derive delack_max from rto_min, we should not only use ip
route option, but should use tcp_rto_min helper to get the correct
rto_min.

Signed-off-by: Kevin Yang <[email protected]>
Reviewed-by: Neal Cardwell <[email protected]>
Reviewed-by: Yuchung Cheng <[email protected]>
Reviewed-by: Eric Dumazet <[email protected]>
Reviewed-by: Tony Lu <[email protected]>
Reviewed-by: Jakub Kicinski <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

tcp: annotate data-races around tw->tw_ts_recent and tw->tw_ts_recent_stamp

These fields can be read and written locklessly, add annotations
around these minor races.

Signed-off-by: Eric Dumazet <[email protected]>
Reviewed-by: Simon Horman <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

octeontx2-af: Add debugfs support to dump NIX TM topology

This patch adds support to dump NIX transmit queue topology.
There are multiple levels of scheduling/shaping supported by
NIX and a packet traverses through multiple levels before sending
the packet out. At each level, there are set of scheduling/shaping
rules applied to a packet flow.

Each packet traverses through multiple levels
SQ->SMQ->TL4->TL3->TL2->TL1 and these levels are mapped in a parent-child
relationship.

This patch dumps the debug information related to all TM Levels in
the following way.

Example:
$ echo <nixlf> > /sys/kernel/debug/octeontx2/nix/tm_tree
$ cat /sys/kernel/debug/octeontx2/nix/tm_tree

A more desriptive set of registers at each level can be dumped
in the following way.

Example:
$ echo <nixlf> > /sys/kernel/debug/octeontx2/nix/tm_topo
$ cat /sys/kernel/debug/octeontx2/nix/tm_topo

Signed-off-by: Anshumali Gaur <[email protected]>
Reviewed-by: Wojciech Drewek <[email protected]>
Reviewed-by: Simon Horman <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Merge branch 'devlink-const'

Christophe JAILLET says:

====================
devlink: Constify struct devlink_dpipe_table_ops

Patch 1 updates devl_dpipe_table_register() and struct
devlink_dpipe_table to accept "const struct devlink_dpipe_table_ops".

Then patch 2 updates the only user of this function.

This is compile tested only.
====================

Signed-off-by: David S. Miller <[email protected]>

mlxsw: spectrum_router: Constify struct devlink_dpipe_table_ops

'struct devlink_dpipe_table_ops' are not modified in this driver.

Constifying these structures moves some data to a read-only section, so
increase overall security.

On a x86_64, with allmodconfig:
Before:
======
   text    data     bss     dec     hex filename
  15557     712       0   16269    3f8d drivers/net/ethernet/mellanox/mlxsw/spectrum_dpipe.o

After:
=====
   text    data     bss     dec     hex filename
  15789     488       0   16277    3f95 drivers/net/ethernet/mellanox/mlxsw/spectrum_dpipe.o

Signed-off-by: Christophe JAILLET <[email protected]>
Reviewed-by: Wojciech Drewek <[email protected]>
Reviewed-by: Ido Schimmel <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

devlink: Constify the 'table_ops' parameter of devl_dpipe_table_register()

"struct devlink_dpipe_table_ops" only contains some function pointers.

Update "struct devlink_dpipe_table" and the 'table_ops' parameter of
devl_dpipe_table_register() so that structures in drivers can be
constified.

Constifying these structures will move some data to a read-only section, so
increase overall security.

Signed-off-by: Christophe JAILLET <[email protected]>
Reviewed-by: Wojciech Drewek <[email protected]>
Reviewed-by: Ido Schimmel <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: phy: aquantia: add support for PHY LEDs

Aquantia Ethernet PHYs got 3 LED output pins which are typically used
to indicate link status and activity.
Add a minimal LED controller driver supporting the most common uses
with the 'netdev' trigger as well as software-driven forced control of
the LEDs.

Signed-off-by: Daniel Golle <[email protected]>
[ rework indentation, fix checkpatch error and improve some functions ]
Signed-off-by: Christian Marangi <[email protected]>
Reviewed-by: Andrew Lunn <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: phy: aquantia: move priv and hw stat to header

In preparation for LEDs support, move priv and hw stat to header to
reference priv struct also in other .c outside aquantia.main

Signed-off-by: Christian Marangi <[email protected]>
Reviewed-by: Andrew Lunn <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: ethtool: remove unused struct 'cable_test_tdr_req_info'

'cable_test_tdr_req_info' is unused since the original
commit f2bc8ad31a7f ("net: ethtool: Allow PHY cable test TDR data to
configured").

Remove it.

Signed-off-by: Dr. David Alan Gilbert <[email protected]>
Reviewed-by: Andrew Lunn <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: caif: remove unused structs

'cfpktq' has been unused since
commit 73d6ac633c6c ("caif: code cleanup").

'caif_packet_funcs' is declared but never defined.

Remove both of them.

Signed-off-by: Dr. David Alan Gilbert <[email protected]>
Reviewed-by: Jacob Keller <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: remove NULL-pointer net parameter in ip_metrics_convert

When I was doing some experiments, I found that when using the first
parameter, namely, struct net, in ip_metrics_convert() always triggers NULL
pointer crash. Then I digged into this part, realizing that we can remove
this one due to its uselessness.

Signed-off-by: Jason Xing <[email protected]>
Reviewed-by: Simon Horman <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: bridge: fix an inconsistent indentation

Smatch complains:
net/bridge/br_netlink_tunnel.c:
318 br_process_vlan_tunnel_info() warn: inconsistent indenting

Fix it with a proper indenting

Signed-off-by: Chen Hanxiao <[email protected]>
Acked-by: Nikolay Aleksandrov <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

dt-bindings: dsa: Rewrite Vitesse VSC73xx in schema

This rewrites the Vitesse VSC73xx DSA switches DT binding in
schema.

It was a bit tricky since I needed to come up with some way
of applying the SPI properties only on SPI devices and not
platform devices, but I figured something out that works.

Signed-off-by: Linus Walleij <[email protected]>
Reviewed-by: Rob Herring (Arm) <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Revert "ethernet: octeontx2: avoid linking objects into multiple modules"

This reverts commit 727c94c9539aa8865cdbf6a783da6a6585f1fec2.

Stephen reports that this commit causes a circular module dependency
for him. Revert, and we'll try to address the problem, again.

Reported-by: Stephen Rothwell <[email protected]>
Link: https://lore.kernel.org/all/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

openvswitch: Remove generic .ndo_get_stats64

Commit 3e2f544dd8a33 ("net: get stats64 if device if driver is
configured") moved the callback to dev_get_tstats64() to net core, so,
unless the driver is doing some custom stats collection, it does not
need to set .ndo_get_stats64.

Since this driver is now relying in NETDEV_PCPU_STAT_TSTATS, then, it
doesn't need to set the dev_get_tstats64() generic .ndo_get_stats64
function pointer.

Signed-off-by: Breno Leitao <[email protected]>
Reviewed-by: Subbaraya Sundeep <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Paolo Abeni <[email protected]>

openvswitch: Move stats allocation to core

With commit 34d21de99cea9 ("net: Move {l,t,d}stats allocation to core and
convert veth & vrf"), stats allocation could be done on net core instead
of this driver.

With this new approach, the driver doesn't have to bother with error
handling (allocation failure checking, making sure free happens in the
right spot, etc). This is core responsibility now.

Move openvswitch driver to leverage the core allocation.

Signed-off-by: Breno Leitao <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Paolo Abeni <[email protected]>

Merge branch 'tcp-refactor-skb_cmp_decrypted-checks'

Jakub Kicinski says:

====================
tcp: refactor skb_cmp_decrypted() checks

Refactor the input patch coalescing checks and wrap "EOR forcing"
logic into a helper. This will hopefully make the code easier to
follow. While at it throw some DEBUG_NET checks into skb_shift().
====================

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Paolo Abeni <[email protected]>

net: skb: add compatibility warnings to skb_shift()

According to current semantics we should never try to shift data
between skbs which differ on decrypted or pp_recycle status.

Signed-off-by: Jakub Kicinski <[email protected]>
Reviewed-by: Eric Dumazet <[email protected]>
Reviewed-by: Willem de Bruijn <[email protected]>
Signed-off-by: Paolo Abeni <[email protected]>

tcp: add a helper for setting EOR on tail skb

TLS (and hopefully soon PSP will) use EOR to prevent skbs
with different decrypted state from getting merged, without
adding new tests to the skb handling. In both cases once
the connection switches to an "encrypted" state, all subsequent
skbs will be encrypted, so a single "EOR fence" is sufficient
to prevent mixing.

Add a helper for setting the EOR bit, to make this arrangement
more explicit.

Signed-off-by: Jakub Kicinski <[email protected]>
Reviewed-by: Eric Dumazet <[email protected]>
Reviewed-by: Willem de Bruijn <[email protected]>
Signed-off-by: Paolo Abeni <[email protected]>

tcp: wrap mptcp and decrypted checks into tcp_skb_can_collapse_rx()

tcp_skb_can_collapse() checks for conditions which don't make
sense on input. Because of this we ended up sprinkling a few
pairs of mptcp_skb_can_collapse() and skb_cmp_decrypted() calls
on the input path. Group them in a new helper. This should make
it less likely that someone will check mptcp and not decrypted
or vice versa when adding new code.

This implicitly adds a decrypted check early in tcp_collapse().
AFAIU this will very slightly increase our ability to collapse
packets under memory pressure, not a real bug.

Signed-off-by: Jakub Kicinski <[email protected]>
Reviewed-by: Eric Dumazet <[email protected]>
Reviewed-by: Matthieu Baerts (NGI0) <[email protected]>
Reviewed-by: Willem de Bruijn <[email protected]>
Signed-off-by: Paolo Abeni <[email protected]>

Merge branch 'net-allow-dissecting-matching-tunnel-control-flags'

Davide Caratti says:

====================
net: allow dissecting/matching tunnel control flags

Ilya says: "for correct matching on decapsulated packets, we should match
on not only tunnel id and headers, but also on tunnel configuration flags
like TUNNEL_NO_CSUM and TUNNEL_DONT_FRAGMENT. This is done to distinguish
similar tunnels with slightly different configs. And it is important since
tunnel configuration is flow based, i.e. can be different for every packet,
even though the main tunnel port is the same."

- patch 1 extends the kernel's flow dissector to extract these flags
   from the packet's tunnel metadata.
- patch 2 extends TC flower to match on any combination of TUNNEL_NO_CSUM,
   TUNNEL_DONT_FRAGMENT, TUNNEL_OAM, TUNNEL_CRIT_OPT

v4:
- fix kernel-doc warning in flow_dissector.h (thanks Jakub)

v3:
- rebase on top of new uAPI bits and internals after commit 5832c4a77d69
   ("ip_tunnel: convert __be16 tunnel flags to bitmaps"). Use of network
   byte order is no more needed, since these bits match on metadata: convert
   netlink attributes to be u32.
- also include TUNNEL_CRIT_OPT

v2:
- use NL_REQ_ATTR_CHECK() where possible (thanks Jamal)
- don't overwrite 'ret' in the error path of fl_set_key_flags()
====================

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Paolo Abeni <[email protected]>

net/sched: cls_flower: add support for matching tunnel control flags

extend cls_flower to match TUNNEL_FLAGS_PRESENT bits in tunnel metadata.

Suggested-by: Ilya Maximets <[email protected]>
Acked-by: Jamal Hadi Salim <[email protected]>
Signed-off-by: Davide Caratti <[email protected]>
Reviewed-by: Simon Horman <[email protected]>
Signed-off-by: Paolo Abeni <[email protected]>

flow_dissector: add support for tunnel control flags

Dissect [no]csum, [no]dontfrag, [no]oam, [no]crit flags from skb metadata.
This is a prerequisite for matching these control flags using TC flower.

Suggested-by: Ilya Maximets <[email protected]>
Signed-off-by: Davide Caratti <[email protected]>
Reviewed-by: Simon Horman <[email protected]>
Signed-off-by: Paolo Abeni <[email protected]>

net: count drops due to missing qdisc as dev->tx_drops

Catching and debugging missing qdiscs is pretty tricky. When qdisc
is deleted we replace it with a noop qdisc, which silently drops
all the packets. Since the noop qdisc has a single static instance
we can't count drops at the qdisc level. Count them as dev->tx_drops.

  ip netns add red
  ip link add type veth peer netns red
  ip            link set dev veth0 up
  ip -netns red link set dev veth0 up
  ip            a a dev veth0 10.0.0.1/24
  ip -netns red a a dev veth0 10.0.0.2/24
  ping -c 2 10.0.0.2
  #  2 packets transmitted, 2 received, 0% packet loss, time 1031ms
  ip -s link show dev veth0
  #  TX:  bytes packets errors dropped carrier collsns
  #        1314      17      0       0       0       0

  tc qdisc replace dev veth0 root handle 1234: mq
  tc qdisc replace dev veth0 parent 1234:1 pfifo
  tc qdisc del dev veth0 parent 1234:1
  ping -c 2 10.0.0.2
  #  2 packets transmitted, 0 received, 100% packet loss, time 1034ms
  ip -s link show dev veth0
  #  TX:  bytes packets errors dropped carrier collsns
  #        1314      17      0       3       0       0

Signed-off-by: Jakub Kicinski <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Paolo Abeni <[email protected]>

r8152: Wake up the system if the we need a reset

If we get to the end of the r8152's suspend() routine and we find that
the USB device is INACCESSIBLE then it means that some of our
preparation for suspend didn't take place. We need a USB reset to get
ourselves back in a consistent state so we can try again and that
can't happen during system suspend. Call pm_wakeup_event() to wake the
system up in this case.

Signed-off-by: Douglas Anderson <[email protected]>
Acked-by: Hayes Wang <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

r8152: If inaccessible at resume time, issue a reset

If we happened to get a USB transfer error during the transition to
suspend then the usb_queue_reset_device() that r8152_control_msg()
calls will get dropped on the floor. This is because
usb_lock_device_for_reset() (which usb_queue_reset_device() uses)
silently fails if it's called when a device is suspended or if too
much time passes.

Let's resolve this by resetting the device ourselves in r8152's
resume() function.

NOTE: due to timing, it's _possible_ that we could end up with two USB
resets: the one queued previously and the one called from the resume()
patch. This didn't happen in test cases I ran, though it's conceivably
possible. We can't easily know if this happened since
usb_queue_reset_device() can just silently drop the reset request. In
any case, it's not expected that this is a problem since the two
resets can't run at the same time (because of the device lock) and it
should be OK to reset the device twice. If somehow the double-reset
causes problems we could prevent resets from being queued up while
suspend is running.

Signed-off-by: Douglas Anderson <[email protected]>
Acked-by: Hayes Wang <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

Merge branch 'Felix-DSA-probing-cleanup'

Vladimir Oltean says:

====================
Probing cleanup for the Felix DSA driver

This is a follow-up to Russell King's request for code consolidation
among felix_vsc9959, seville_vsc9953 and ocelot_ext, stated here:
https://lore.kernel.org/all/[email protected]/

Details are in individual patches. Testing was done on NXP LS1028A
(felix_vsc9959).
====================

Signed-off-by: David S. Miller <[email protected]>

net: dsa: ocelot: unexport felix_phylink_mac_ops and felix_switch_ops

Now that the common felix_register_switch() from the umbrella driver
is the only entity that accesses these data structures, we can remove
them from the list of the exported symbols.

Signed-off-by: Vladimir Oltean <[email protected]>
Reviewed-by: Sai Krishna <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: dsa: ocelot: common probing code

Russell King suggested that felix_vsc9959, seville_vsc9953 and
ocelot_ext have a large portion of duplicated init code, which could be
made common [1].

[1]: https://lore.kernel.org/all/[email protected]/

Here, we take the following common steps:
- "felix" and "ds" structure allocation
- "felix", "ocelot" and "ds" basic structure initialization
- dsa_register_switch() call

and we make a common function out of them.

For every driver except felix_vsc9959, this is also the entire probing
procedure. For felix_vsc9959, we also need to do some PCI-specific
stuff, which can easily be reordered to be done before, and unwound on
failure.

We also have to convert the bus-specific platform_set_drvdata() and
pci_set_drvdata() calls into dev_set_drvdata(). But this should have no
impact on the behavior.

Suggested-by: "Russell King (Oracle)" <[email protected]>
Signed-off-by: Vladimir Oltean <[email protected]>
Tested-by: Colin Foster <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: dsa: ocelot: use ds->num_tx_queues = OCELOT_NUM_TC for all models

Russell King points out that seville_vsc9953 populates
felix->info->num_tx_queues = 8, but this doesn't make it all the way
into ds->num_tx_queues (which is how the user interface netdev queues
get allocated) [1].

[1]: https://lore.kernel.org/all/20240415160150.yejcazpjqvn7vhxu@skbuf/

When num_tx_queues=0 for seville, this is implicitly converted to 1 by
dsa_user_create(), and this is good enough for basic operation for a
switch port. The tc qdisc offload layer works with netdev TX queues,
so for QoS offload we need to pretend we have multiple TX queues. The
VSC9953, like ocelot_ext, doesn't export QoS offload, so it doesn't
really matter. But we can definitely set num_tx_queues=8 for all
switches.

The felix->info->num_tx_queues construct itself seems unnecessary.
It was introduced by commit de143c0e274b ("net: dsa: felix: Configure
Time-Aware Scheduler via taprio offload") at a time when vsc9959
(LS1028A) was the only switch supported by the driver.

8 traffic classes, and 1 queue per traffic class, is a common
architectural feature of all switches in the family. So they could
all just set OCELOT_NUM_TC and be fine.

Signed-off-by: Vladimir Oltean <[email protected]>
Reviewed-by: Colin Foster <[email protected]>
Tested-by: Colin Foster <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: dsa: ocelot: move devm_request_threaded_irq() to felix_setup()

The current placement of devm_request_threaded_irq() is inconvenient.
It is between the allocation of the "felix" structure and
dsa_register_switch(), both of which we'd like to refactor into a
function that's common for all switches. But the IRQ is specific to
felix_vsc9959.

A closer inspection of the felix_irq_handler() code suggests that
it does things that depend on the data structures having been fully
initialized. For example, ocelot_get_txtstamp() takes
&port->tx_skbs.lock, which has only been initialized in
ocelot_init_port() which has not run yet.

It is not one of those IRQF_SHARED IRQs, so CONFIG_DEBUG_SHIRQ_FIXME
shouldn't apply here, and thus, it doesn't really matter, because in
practice, the IRQ will not be triggered so early. Nonetheless, it is a
good practice for the driver to be prepared for it to fire as soon as it
is requested.

Create a new felix->info method for running custom code for vsc9959 from
within felix_setup(), and move the request_irq() call there. The
ocelot_ext should have an IRQ as well, so this should be a step in the
right direction for that model (VSC7512) as well.

Some minor changes are made while moving the code. Casts from void *
aren't necessary, so drop them, and rename felix_irq_handler() to the
more specific vsc9959_irq_handler().

Signed-off-by: Vladimir Oltean <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: dsa: ocelot: consistently use devres in felix_pci_probe()

Russell King suggested that felix_vsc9959, seville_vsc9953 and
ocelot_ext have a large portion of duplicated init and teardown code,
which could be made common [1]. The teardown code could even be
simplified away if we made use of devres, something which is used here
and there in the felix driver, just not very consistently.

[1] https://lore.kernel.org/all/[email protected]/

Prepare the ground in the felix_vsc9959 driver, by allocating the data
structures using devres and deleting the kfree() calls. This also
deletes the "Failed to allocate ..." message, since memory allocation
errors are extremely loud anyway, and it's hard to miss them.

Suggested-by: "Russell King (Oracle)" <[email protected]>
Signed-off-by: Vladimir Oltean <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: dsa: ocelot: delete open coded status = "disabled" parsing

Since commit 6fffbc7ae137 ("PCI: Honor firmware's device disabled
status"), PCI device drivers with OF bindings no longer need this check.

Signed-off-by: Vladimir Oltean <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: dsa: ocelot: use devres in seville_probe()

Russell King suggested that felix_vsc9959, seville_vsc9953 and
ocelot_ext have a large portion of duplicated init and teardown code,
which could be made common [1]. The teardown code could even be
simplified away if we made use of devres, something which is used here
and there in the felix driver, just not very consistently.

[1] https://lore.kernel.org/all/[email protected]/

Prepare the ground in the seville_vsc9953 driver, by allocating the data
structures using devres and deleting the kfree() calls. This also
deletes the "Failed to allocate ..." message, since memory allocation
errors are extremely loud anyway, and it's hard to miss them.

Suggested-by: "Russell King (Oracle)" <[email protected]>
Signed-off-by: Vladimir Oltean <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: dsa: ocelot: use devres in ocelot_ext_probe()

Russell King suggested that felix_vsc9959, seville_vsc9953 and
ocelot_ext have a large portion of duplicated init and teardown code,
which could be made common [1]. The teardown code could even be
simplified away if we made use of devres, something which is used here
and there in the felix driver, just not very consistently.

[1] https://lore.kernel.org/all/[email protected]/

Prepare the ground in the ocelot_ext driver, by allocating the data
structures using devres and deleting the kfree() calls. This also
deletes the "Failed to allocate ..." message, since memory allocation
errors are extremely loud anyway, and it's hard to miss them.

Suggested-by: "Russell King (Oracle)" <[email protected]>
Signed-off-by: Vladimir Oltean <[email protected]>
Reviewed-by: Colin Foster <[email protected]>
Tested-by: Colin Foster <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Merge branch 'net-smc-snd_buf-rcv_buf'

Guangguan Wang says:

====================
net/smc: Change the upper boundary of SMC-R's snd_buf and rcv_buf to 512MB

SMCR_RMBE_SIZES is the upper boundary of SMC-R's snd_buf and rcv_buf.
The maximum bytes of snd_buf and rcv_buf can be calculated by 2^SMCR_
RMBE_SIZES * 16KB. SMCR_RMBE_SIZES = 5 means the upper boundary is 512KB.
TCP's snd_buf and rcv_buf max size is configured by net.ipv4.tcp_w/rmem[2]
whose default value is 4MB or 6MB, is much larger than SMC-R's upper
boundary.

In some scenarios, such as Recommendation System, the communication
pattern is mainly large size send/recv, where the size of snd_buf and
rcv_buf greatly affects performance. Due to the upper boundary
disadvantage, SMC-R performs poor than TCP in those scenarios. So it
is time to enlarge the upper boundary size of SMC-R's snd_buf and rcv_buf,
so that the SMC-R's snd_buf and rcv_buf can be configured to larger size
for performance gain in such scenarios.

The SMC-R rcv_buf's size will be transferred to peer by the field
rmbe_size in clc accept and confirm message. The length of the field
rmbe_size is four bits, which means the maximum value of SMCR_RMBE_SIZES
is 15. In case of frequently adjusting the value of SMCR_RMBE_SIZES
in different scenarios, set the value of SMCR_RMBE_SIZES to the maximum
value 15, which means the upper boundary of SMC-R's snd_buf and rcv_buf
is 512MB. As the real memory usage is determined by the value of
net.smc.w/rmem, not by the upper boundary, set the value of SMCR_RMBE_SIZES
to the maximum value has no side affects.
====================

Signed-off-by: David S. Miller <[email protected]>

net/smc: change SMCR_RMBE_SIZES from 5 to 15

SMCR_RMBE_SIZES is the upper boundary of SMC-R's snd_buf and rcv_buf.
The maximum bytes of snd_buf and rcv_buf can be calculated by 2^SMCR_
RMBE_SIZES * 16KB. SMCR_RMBE_SIZES = 5 means the upper boundary is 512KB.
TCP's snd_buf and rcv_buf max size is configured by net.ipv4.tcp_w/rmem[2]
whose default value is 4MB or 6MB, is much larger than SMC-R's upper
boundary.

In some scenarios, such as Recommendation System, the communication
pattern is mainly large size send/recv, where the size of snd_buf and
rcv_buf greatly affects performance. Due to the upper boundary
disadvantage, SMC-R performs poor than TCP in those scenarios. So it
is time to enlarge the upper boundary size of SMC-R's snd_buf and rcv_buf,
so that the SMC-R's snd_buf and rcv_buf can be configured to larger size
for performance gain in such scenarios.

The SMC-R rcv_buf's size will be transferred to peer by the field
rmbe_size in clc accept and confirm message. The length of the field
rmbe_size is four bits, which means the maximum value of SMCR_RMBE_SIZES
is 15. In case of frequently adjusting the value of SMCR_RMBE_SIZES
in different scenarios, set the value of SMCR_RMBE_SIZES to the maximum
value 15, which means the upper boundary of SMC-R's snd_buf and rcv_buf
is 512MB. As the real memory usage is determined by the value of
net.smc.w/rmem, not by the upper boundary, set the value of SMCR_RMBE_SIZES
to the maximum value has no side affects.

Signed-off-by: Guangguan Wang <[email protected]>
Co-developed-by: Wen Gu <[email protected]>
Signed-off-by: Wen Gu <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net/smc: set rmb's SG_MAX_SINGLE_ALLOC limitation only when CONFIG_ARCH_NO_SG_CHAIN is defined

SG_MAX_SINGLE_ALLOC is used to limit maximum number of entries that
will be allocated in one piece of scatterlist. When the entries of
scatterlist exceeds SG_MAX_SINGLE_ALLOC, sg chain will be used. From
commit 7c703e54cc71 ("arch: switch the default on ARCH_HAS_SG_CHAIN"),
we can know that the macro CONFIG_ARCH_NO_SG_CHAIN is used to identify
whether sg chain is supported. So, SMC-R's rmb buffer should be limited
by SG_MAX_SINGLE_ALLOC only when the macro CONFIG_ARCH_NO_SG_CHAIN is
defined.

Fixes: a3fe3d01bd0d ("net/smc: introduce sg-logic for RMBs")
Signed-off-by: Guangguan Wang <[email protected]>
Co-developed-by: Wen Gu <[email protected]>
Signed-off-by: Wen Gu <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

af_unix: Remove dead code in unix_stream_read_generic().

When splice() support was added in commit 2b514574f7e8 ("net:
af_unix: implement splice for stream af_unix sockets"), we had
to release unix_sk(sk)->readlock (current iolock) before calling
splice_to_pipe().

Due to the unlock, commit 73ed5d25dce0 ("af-unix: fix use-after-free
with concurrent readers while splicing") added a safeguard in
unix_stream_read_generic(); we had to bump the skb refcount before
calling ->recv_actor() and then check if the skb was consumed by a
concurrent reader.

However, the pipe side locking was refactored, and since commit
25869262ef7a ("skb_splice_bits(): get rid of callback"), we can
call splice_to_pipe() without releasing unix_sk(sk)->iolock.

Now, the skb is always alive after the ->recv_actor() callback,
so let's remove the unnecessary drop_skb logic.

This is mostly the revert of commit 73ed5d25dce0 ("af-unix: fix
use-after-free with concurrent readers while splicing").

Signed-off-by: Kuniyuki Iwashima <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

Merge branch 'lan78xx-enable-125-mhz-clk-and-auto-speed-configuration-for-lan7801-if-no-eeprom-is-detected'

Rengarajan says:

====================
lan78xx: Enable 125 MHz CLK and Auto Speed configuration for LAN7801 if NO EEPROM is detected

This patch series adds the support for 125 MHz clock, Auto speed and
auto duplex configuration for LAN7801 in the absence of EEPROM.
====================

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

lan78xx: Enable Auto Speed and Auto Duplex configuration for LAN7801 if NO EEPROM is detected

Enabled ASD/ADD configuration for LAN7801 in the absence of EEPROM.
After the lite reset these contents go back to defaults where ASD/
ADD is disabled. The check is already available for LAN7800.

Reviewed-by: Simon Horman <[email protected]>
Signed-off-by: Rengarajan S <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

lan78xx: Enable 125 MHz CLK configuration for LAN7801 if NO EEPROM is detected

The 125MHz and 25MHz clock configurations are enabled in the initialization
regardless of EEPROM (125MHz is needed for RGMII 1000Mbps operation). After
a lite reset (lan78xx_reset), these contents go back to defaults(all 0, so
no 125MHz or 25MHz clock).

Reviewed-by: Simon Horman <[email protected]>
Signed-off-by: Rengarajan S <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

Merge branch 'net-ethernet-cortina-use-phylib-for-rx-and-tx-pause'

Linus Walleij says:

====================
net: ethernet: cortina: Use phylib for RX and TX pause

This patch series switches the Cortina Gemini ethernet
driver to use phylib to set up RX and TX pause for the
PHY.

v3: https://lore.kernel.org/r/20240513-gemini-ethernet-fix-tso-v3-0-b442540cc140@linaro.org
v2: https://lore.kernel.org/r/20240511-gemini-ethernet-fix-tso-v2-0-2ed841574624@linaro.org
v1: https://lore.kernel.org/r/20240509-gemini-ethernet-fix-tso-v1-0-10cd07b54d1c@linaro.org
====================

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

net: ethernet: cortina: Implement .set_pauseparam()

The Cortina Gemini ethernet can very well set up TX or RX
pausing, so add this functionality to the driver in a
.set_pauseparam() callback. Essentially just call down to
phylib and let phylib deal with this, .adjust_link()
will respect the setting from phylib.

Signed-off-by: Linus Walleij <[email protected]>
Reviewed-by: Andrew Lunn <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

net: ethernet: cortina: Use negotiated TX/RX pause

Instead of directly poking into registers of the PHY, use
the existing function to query phylib about this directly.

Suggested-by: Andrew Lunn <[email protected]>
Reviewed-by: Andrew Lunn <[email protected]>
Signed-off-by: Linus Walleij <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

net: ethernet: cortina: Rename adjust link callback

The callback passed to of_phy_get_and_connect() in the
Cortina Gemini driver is called "gmac_speed_set" which is
archaic, rename it to "gmac_adjust_link" following the
pattern of most other drivers.

Reviewed-by: Andrew Lunn <[email protected]>
Signed-off-by: Linus Walleij <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

Merge branch 'net-visibility-of-memory-limits-in-netns'

Matteo Croce says:

====================
net: visibility of memory limits in netns

Some programs need to know the size of the network buffers to operate
correctly, export the following sysctls read-only in network namespaces:

- net.core.rmem_default
- net.core.rmem_max
- net.core.wmem_default
- net.core.wmem_max
====================

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

selftests: net: tests net.core.{r,w}mem_{default,max} sysctls in a netns

Add a selftest which checks that the sysctl is present in a netns,
that the value is read from the init one, and that it's readonly.

Signed-off-by: Matteo Croce <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

net: make net.core.{r,w}mem_{default,max} namespaced

The following sysctl are global and can't be read from a netns:

net.core.rmem_default
net.core.rmem_max
net.core.wmem_default
net.core.wmem_max

Make the following sysctl parameters available readonly from within a
network namespace, allowing a container to read them.

Signed-off-by: Matteo Croce <[email protected]>
Reviewed-by: Eric Dumazet <[email protected]>
Reviewed-by: Shakeel Butt <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

bnxt_en: add timestamping statistics support

The ethtool_ts_stats structure was introduced earlier this year. Now
it's time to support this group of counters in more drivers.
This patch adds support to bnxt driver.

Signed-off-by: Vadim Fedorenko <[email protected]>
Reviewed-by: Michael Chan <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

Merge branch 'ice-introduce-eth56g-phy-model-for-e825c-products'

Jacob Keller says:

====================
ice: Introduce ETH56G PHY model for E825C products

E825C products have a different PHY model than E822, E823 and E810 products.
This PHY is ETH56G and its support is necessary to have functional PTP stack
for E825C products.

This series refactors the ice driver to add support for the new PHY model.

Karol introduces the ice_ptp_hw structure. This is used to replace some
hard-coded values relating to the PHY quad and port numbers, as well as to
hold the phy_model type.

Jacob refactors the driver code that converts between the ice_ptp_tmr_cmd
enumeration and hardware register values to better re-use logic and reduce
duplication when introducing another PHY type.

Sergey introduces functions to help enable and disable the Tx timestamp
interrupts. This makes the ice_ptp.c code more generic and encapsulates the
PHY specifics into ice_ptp_hw.c

Karol introduces helper functions to clear the valid bits for Tx and Rx
timestamps. This enables informing hardware to discard stale timestamps
after performing clock operations.

Sergey moves the Clock Generation Unit (CGU) logic out of the E822 specific
area of the ice_ptp_hw.c file as it will be re-used for other device PHY
models.

Jacob introduces a helper function for obtaining the base increment values,
moving this logic out of ice_ptp.c and into the ice_ptp_hw.c file to better
encapsulate hardware differences.

Sergey builds on these refactors to introduce the new ETH56G PHY model used
by the E825C products. This includes introducing the required helpers,
constants, and PHY model checks.

Karol simplifies the CGU logic by using anonymous structures, dropping an
unnecessary ".field" name for accessing the CGU data.

Michal Michalik updates the CGU logic to support the E825C hardware,
ensuring that the clock generation is configured properly.

Grzegorz Nitka adds support to read the NAC topology data from the device.
This is in preparation for supporting devices which combine two NACs
together, connecting all ports to the same clock source. This enables the
driver to determine if its operating on such a device, or if its operating
on the standard 1-NAC configuration.

Grzsecgorz Nitka adjusts the PTP initialization to prepare for the 2x50G
E825C devices, introducing special mapping for the PHY ports to prepare for
support of the 2-NAC devices.

With this, the ice driver is capable of handling PTP for the single-NAC
E825C devices. Complete support for the 2-NAC devices requirs some work on
how the ports connect to the clock owner. During review of this work, it
was pointed out that our existing use of auxiliary bus is disliked, and
Jiri requested that we change it. We are currently working on developing a
replacement solution for the auxiliary bus implementation and have dropped
the relevant changes out of this series. A future series will refactor the
port to clock connection, at which time we will finish the support for
2-NAC E825C devices.

Signed-off-by: Jacob Keller <[email protected]>
====================

Link: https://lore.kernel.org/r/20240528-next-2024-05-28-ptp-refactors-v1-0-c082739bb6f6@intel.com
Signed-off-by: Jakub Kicinski <[email protected]>

ice: Adjust PTP init for 2x50G E825C devices

>From FW/HW perspective, 2 port topology in E825C devices requires
merging of 2 port mapping internally and breakout mapping externally.
As a consequence, it requires different port numbering from PTP code
perspective.
For that topology, pf_id can not be used to index PTP ports. Even if
the 2nd port is identified as port with pf_id = 1, all PHY operations
need to be performed as it was port 2. Thus, special mapping is needed
for the 2nd port.
This change adds detection of 2x50G topology and applies 'custom'
mapping on the 2nd port.

Signed-off-by: Grzegorz Nitka <[email protected]>
Reviewed-by: Arkadiusz Kubalewski <[email protected]>
Signed-off-by: Karol Kolacinski <[email protected]>
Tested-by: Pucha Himasekhar Reddy <[email protected]>
Signed-off-by: Jacob Keller <[email protected]>
Link: https://lore.kernel.org/r/20240528-next-2024-05-28-ptp-refactors-v1-11-c082739bb6f6@intel.com
Signed-off-by: Jakub Kicinski <[email protected]>

ice: Add NAC Topology device capability parser

Add new device capability ICE_AQC_CAPS_NAC_TOPOLOGY which allows to
determine the mode of operation (1 or 2 NAC).
Define a new structure to store data from new capability and
corresponding parser code.

Co-developed-by: Prathisna Padmasanan <[email protected]>
Signed-off-by: Prathisna Padmasanan <[email protected]>
Signed-off-by: Grzegorz Nitka <[email protected]>
Reviewed-by: Pawel Kaminski <[email protected]>
Reviewed-by: Mateusz Polchlopek <[email protected]>
Reviewed-by: Przemek Kitszel <[email protected]>
Reviewed-by: Arkadiusz Kubalewski <[email protected]>
Signed-off-by: Karol Kolacinski <[email protected]>
Tested-by: Pucha Himasekhar Reddy <[email protected]>
Signed-off-by: Jacob Keller <[email protected]>
Link: https://lore.kernel.org/r/20240528-next-2024-05-28-ptp-refactors-v1-10-c082739bb6f6@intel.com
Signed-off-by: Jakub Kicinski <[email protected]>

ice: Add support for E825-C TS PLL handling

The CGU layout of E825-C is a little different than E822/E823. Add
support the new hardware adding relevant functions.

Signed-off-by: Michal Michalik <[email protected]>
Reviewed-by: Przemek Kitszel <[email protected]>
Reviewed-by: Arkadiusz Kubalewski <[email protected]>
Signed-off-by: Karol Kolacinski <[email protected]>
Tested-by: Pucha Himasekhar Reddy <[email protected]>
Signed-off-by: Jacob Keller <[email protected]>
Link: https://lore.kernel.org/r/20240528-next-2024-05-28-ptp-refactors-v1-9-c082739bb6f6@intel.com
Signed-off-by: Jakub Kicinski <[email protected]>

ice: Change CGU regs struct to anonymous

Simplify the code by using anonymous struct in CGU registers instead of
naming each structure 'field'.

Suggested-by: Przemek Kitszel <[email protected]>
Reviewed-by: Przemek Kitszel <[email protected]>
Reviewed-by: Arkadiusz Kubalewski <[email protected]>
Signed-off-by: Karol Kolacinski <[email protected]>
Tested-by: Pucha Himasekhar Reddy <[email protected]>
Signed-off-by: Jacob Keller <[email protected]>
Link: https://lore.kernel.org/r/20240528-next-2024-05-28-ptp-refactors-v1-8-c082739bb6f6@intel.com
Signed-off-by: Jakub Kicinski <[email protected]>

ice: Introduce ETH56G PHY model for E825C products

E825C products feature a new PHY model - ETH56G.

Introduces all necessary PHY definitions, functions etc. for ETH56G PHY,
analogous to E82X and E810 ones with addition of a few HW-specific
functionalities for ETH56G like one-step timestamping.

It ensures correct PTP initialization and operation for E825C products.

Co-developed-by: Jacob Keller <[email protected]>
Signed-off-by: Jacob Keller <[email protected]>
Co-developed-by: Michal Michalik <[email protected]>
Signed-off-by: Michal Michalik <[email protected]>
Signed-off-by: Sergey Temerkhanov <[email protected]>
Reviewed-by: Przemek Kitszel <[email protected]>
Reviewed-by: Arkadiusz Kubalewski <[email protected]>
Co-developed-by: Karol Kolacinski <[email protected]>
Signed-off-by: Karol Kolacinski <[email protected]>
Tested-by: Pucha Himasekhar Reddy <[email protected]>
Signed-off-by: Jacob Keller <[email protected]>
Link: https://lore.kernel.org/r/20240528-next-2024-05-28-ptp-refactors-v1-7-c082739bb6f6@intel.com
Signed-off-by: Jakub Kicinski <[email protected]>

ice: Introduce ice_get_base_incval() helper

Add a new helper for getting base clock increment value for specific HW.

Reviewed-by: Przemek Kitszel <[email protected]>
Reviewed-by: Arkadiusz Kubalewski <[email protected]>
Signed-off-by: Karol Kolacinski <[email protected]>
Tested-by: Pucha Himasekhar Reddy <[email protected]>
Signed-off-by: Jacob Keller <[email protected]>
Link: https://lore.kernel.org/r/20240528-next-2024-05-28-ptp-refactors-v1-6-c082739bb6f6@intel.com
Signed-off-by: Jakub Kicinski <[email protected]>

ice: Move CGU block

Move CGU block to the beginning of ice_ptp_hw.c

Signed-off-by: Sergey Temerkhanov <[email protected]>
Reviewed-by: Przemek Kitszel <[email protected]>
Reviewed-by: Arkadiusz Kubalewski <[email protected]>
Signed-off-by: Karol Kolacinski <[email protected]>
Tested-by: Pucha Himasekhar Reddy <[email protected]>
Signed-off-by: Jacob Keller <[email protected]>
Link: https://lore.kernel.org/r/20240528-next-2024-05-28-ptp-refactors-v1-5-c082739bb6f6@intel.com
Signed-off-by: Jakub Kicinski <[email protected]>

ice: Add PHY OFFSET_READY register clearing

Add a possibility to mark all transmitted/received timestamps as invalid
by clearing PHY OFFSET_READY registers.

Reviewed-by: Przemek Kitszel <[email protected]>
Reviewed-by: Arkadiusz Kubalewski <[email protected]>
Signed-off-by: Karol Kolacinski <[email protected]>
Tested-by: Pucha Himasekhar Reddy <[email protected]>
Signed-off-by: Jacob Keller <[email protected]>
Link: https://lore.kernel.org/r/20240528-next-2024-05-28-ptp-refactors-v1-4-c082739bb6f6@intel.com
Signed-off-by: Jakub Kicinski <[email protected]>

ice: Implement Tx interrupt enablement functions

Introduce functions enabling/disabling Tx TS interrupts
for the E822 and ETH56G PHYs

Signed-off-by: Sergey Temerkhanov <[email protected]>
Reviewed-by: Przemek Kitszel <[email protected]>
Reviewed-by: Arkadiusz Kubalewski <[email protected]>
Signed-off-by: Karol Kolacinski <[email protected]>
Tested-by: Pucha Himasekhar Reddy <[email protected]>
Signed-off-by: Jacob Keller <[email protected]>
Link: https://lore.kernel.org/r/20240528-next-2024-05-28-ptp-refactors-v1-3-c082739bb6f6@intel.com
Signed-off-by: Jakub Kicinski <[email protected]>

ice: Introduce helper to get tmr_cmd_reg values

Multiple places in the driver code need to convert enum ice_ptp_tmr_cmd
values into register bits for both the main timer and the PHY port
timers. The main MAC register has one bit scheme for timer commands,
while the PHY commands use a different scheme.

The E810 and E830 devices use the same scheme for port commands as used
for the main timer. However, E822 and ETH56G hardware has a separate
scheme used by the PHY.

Introduce helper functions to convert the timer command enumeration into
the register values, reducing some code duplication, and making it
easier to later refactor the individual port write commands.

Reviewed-by: Przemek Kitszel <[email protected]>
Reviewed-by: Arkadiusz Kubalewski <[email protected]>
Signed-off-by: Karol Kolacinski <[email protected]>
Tested-by: Pucha Himasekhar Reddy <[email protected]>
Signed-off-by: Jacob Keller <[email protected]>
Link: https://lore.kernel.org/r/20240528-next-2024-05-28-ptp-refactors-v1-2-c082739bb6f6@intel.com
Signed-off-by: Jakub Kicinski <[email protected]>

ice: Introduce ice_ptp_hw struct

Create new ice_ptp_hw struct and use it for all HW and PTP-related
fields from struct ice_hw.
Replace definitions with struct fields, which values are set accordingly
to a specific device.

Reviewed-by: Przemek Kitszel <[email protected]>
Reviewed-by: Arkadiusz Kubalewski <[email protected]>
Signed-off-by: Karol Kolacinski <[email protected]>
Tested-by: Pucha Himasekhar Reddy <[email protected]>
Signed-off-by: Jacob Keller <[email protected]>
Link: https://lore.kernel.org/r/20240528-next-2024-05-28-ptp-refactors-v1-1-c082739bb6f6@intel.com
Signed-off-by: Jakub Kicinski <[email protected]>

net: validate SO_TXTIME clockid coming from userspace

Currently there are no strict checks while setting SO_TXTIME
from userspace. With the recent development in skb->tstamp_type
clockid with unsupported clocks results in warn_on_once, which causes
unnecessary aborts in some systems which enables panic on warns.

Add validation in setsockopt to support only CLOCK_REALTIME,
CLOCK_MONOTONIC and CLOCK_TAI to be set from userspace.

Link: https://lore.kernel.org/netdev/[email protected]/
Link: https://lore.kernel.org/lkml/[email protected]/
Fixes: 1693c5db6ab8 ("net: Add additional bit to support clockid_t timestamp type")
Reported-by: [email protected]
Closes: https://syzkaller.appspot.com/bug?extid=d7b227731ec589e7f4f0
Reported-by: [email protected]
Closes: https://syzkaller.appspot.com/bug?extid=30a35a2e9c5067cc43fa
Signed-off-by: Abhishek Chauhan <[email protected]>
Acked-by: Martin KaFai Lau <[email protected]>
Reviewed-by: Willem de Bruijn <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

Merge branch 'doc-mptcp-new-general-doc-and-fixes'

Matthieu Baerts says:

====================
doc: mptcp: new general doc and fixes

A general documentation about MPTCP was missing since its introduction
in v5.6. The last patch adds a new 'mptcp' page in the 'networking'
documentation.

The first patch is a fix for a missing sysctl entry introduced in v6.10
rc0, and the second one reorder the sysctl entries.

Signed-off-by: Matthieu Baerts (NGI0) <[email protected]>
====================

v2: https://lore.kernel.org/r/20240528-upstream-net-20240520-mptcp-doc-v2-0-47f2d5bc2ef3@kernel.org
v1: https://lore.kernel.org/r/20240520-upstream-net-20240520-mptcp-doc-v1-0-e3ad294382cb@kernel.org

Link: https://lore.kernel.org/r/20240530-upstream-net-20240520-mptcp-doc-v3-0-e94cdd9f2673@kernel.org
Signed-off-by: Jakub Kicinski <[email protected]>

doc: new 'mptcp' page in 'networking'

A general documentation about MPTCP was missing since its introduction
in v5.6.

Most of what is there comes from our recently updated mptcp.dev website,
with additional links to resources from the kernel documentation.

This is a first version, mainly targeting app developers and users.

Link: https://www.mptcp.dev
Reviewed-by: Mat Martineau <[email protected]>
Signed-off-by: Matthieu Baerts (NGI0) <[email protected]>
Link: https://lore.kernel.org/r/20240530-upstream-net-20240520-mptcp-doc-v3-3-e94cdd9f2673@kernel.org
Signed-off-by: Jakub Kicinski <[email protected]>

doc: mptcp: alphabetical order

Similar to what is done in other 'sysctl' pages: it looks clearer from a
readability perspective.

This might cause troubles in the short/mid-term with the backports, but
by not putting new entries at the end, this can help to reduce conflicts
in case of backports in the long term. We don't change the information
here too often, so it looks OK to do that.

Reviewed-by: Mat Martineau <[email protected]>
Signed-off-by: Matthieu Baerts (NGI0) <[email protected]>
Link: https://lore.kernel.org/r/20240530-upstream-net-20240520-mptcp-doc-v3-2-e94cdd9f2673@kernel.org
Signed-off-by: Jakub Kicinski <[email protected]>

doc: mptcp: add missing 'available_schedulers' entry

This sysctl knob has been added recently, but the documentation has not
been updated.

This knob is used to show the available schedulers choices that are
registered, similar to 'net.ipv4.tcp_available_congestion_control'.

Fixes: 73c900aa3660 ("mptcp: add net.mptcp.available_schedulers")
Reviewed-by: Mat Martineau <[email protected]>
Signed-off-by: Matthieu Baerts (NGI0) <[email protected]>
Link: https://lore.kernel.org/r/20240530-upstream-net-20240520-mptcp-doc-v3-1-e94cdd9f2673@kernel.org
Signed-off-by: Jakub Kicinski <[email protected]>

net: smc91x: Fix pointer types

Use void __iomem pointers as parameters for mcf_insw() and mcf_outsw()
to align with the parameter types of readw() and writew() to fix the
following warnings reported by kernel test robot:

drivers/net/ethernet/smsc/smc91x.c:590:9: sparse: warning: incorrect type in argument 1 (different address spaces)
drivers/net/ethernet/smsc/smc91x.c:590:9: sparse:    expected void *a
drivers/net/ethernet/smsc/smc91x.c:590:9: sparse:    got void [noderef] __iomem *
drivers/net/ethernet/smsc/smc91x.c:590:9: sparse: warning: incorrect type in argument 1 (different address spaces)
drivers/net/ethernet/smsc/smc91x.c:590:9: sparse:    expected void *a
drivers/net/ethernet/smsc/smc91x.c:590:9: sparse:    got void [noderef] __iomem *
drivers/net/ethernet/smsc/smc91x.c:590:9: sparse: warning: incorrect type in argument 1 (different address spaces)
drivers/net/ethernet/smsc/smc91x.c:590:9: sparse:    expected void *a
drivers/net/ethernet/smsc/smc91x.c:590:9: sparse:    got void [noderef] __iomem *
drivers/net/ethernet/smsc/smc91x.c:483:17: sparse: warning: incorrect type in argument 1 (different address spaces)
drivers/net/ethernet/smsc/smc91x.c:483:17: sparse:    expected void *a
drivers/net/ethernet/smsc/smc91x.c:483:17: sparse:    got void [noderef] __iomem *

Reported-by: kernel test robot <[email protected]>
Closes: https://lore.kernel.org/oe-kbuild-all/[email protected]/
Acked-by: Nicolas Pitre <[email protected]>
Reviewed-by: Andrew Lunn <[email protected]>
Signed-off-by: Thorsten Blum <[email protected]>
Acked-by: Arnd Bergmann <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

net: qstat: extend kdoc about get_base_stats

mlx5 has a dedicated queue for PTP packets. Clarify that
this sort of queues can also be accounted towards the base.

Reviewed-by: Joe Damato <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Cross-merge networking fixes after downstream PR.

Conflicts:

drivers/net/ethernet/ti/icssg/icssg_classifier.c
abd5576b9c57 ("net: ti: icssg-prueth: Add support for ICSSG switch firmware")
56a5cf538c3f ("net: ti: icssg-prueth: Fix start counter for ft1 filter")
https://lore.kernel.org/all/20240531123822.3bb7eadf@canb.auug.org.au/

No other adjacent changes.

Signed-off-by: Jakub Kicinski <[email protected]>

tools: ynl: make the attr and msg helpers more C++ friendly

Folks working on a C++ codegen would like to reuse the attribute
helpers directly. Add the few necessary casts, it's not too ugly.

Reviewed-by: Donald Hunter <[email protected]>
Reviewed-by: Nicolas Dichtel <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

Merge branch 'net-phylink-rearrange-ovr_an_inband-support'

Russell King says:

====================
net: phylink: rearrange ovr_an_inband support

This series addresses the use of the ovr_an_inband flag, which is used
by two drivers to indicate to phylink that they wish to use inband mode
without firmware specifying inband mode.

The issue with ovr_an_inband is that it overrides not only PHY mode,
but also fixed-link mode. Both of the drivers that set this flag
contain code to detect when fixed-link mode will be used, and then
either avoid setting it or explicitly clear the flag. This is
wasteful when phylink already knows this.

Therefore, the approach taken in this patch set is to replace the
ovr_an_inband flag with a default_an_inband flag which means that
phylink defaults to MLO_AN_INBAND instead of MLO_AN_PHY, and will
allow that default to be overriden if firmware specifies a fixed-link.
This allows users of ovr_an_inband to be simplified.

What's more is this requires minimal changes in phylink to allow this
new mode of operation.

This series changes phylink, and also updates the two drivers
(fman_memac and stmmac), and then removes the unnecessary complexity
from the drivers.

This series may depend on the stmmac cleanup series I've posted
earlier - this is something I have not checked, but I currently have
these patches on top of that series.
====================

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

net: stmmac: dwmac-intel: remove checking for fixed link

With the new default_an_inband functionality in phylink, there is no
need to check for a fixed link when this flag is set, since a fixed
link will now override default_an_inband.

Signed-off-by: Russell King (Oracle) <[email protected]>
Reviewed-by: Andrew Halaney <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

net: stmmac: rename xpcs_an_inband to default_an_inband

Rename xpcs_an_inband to default_an_inband to reflect the change in
phylink and its changed functionality.

Signed-off-by: Russell King (Oracle) <[email protected]>
Reviewed-by: Andrew Halaney <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

net: fman_memac: remove the now unnecessary checking for fixed-link

Since default_an_inband can be overriden by a fixed-link specification,
there is no need for memac to be checking for this before setting
default_an_inband. Remove this code and update the comment.

Signed-off-by: Russell King (Oracle) <[email protected]>
Reviewed-by: Sean Anderson <[email protected]>
Reviewed-by: Andrew Halaney <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

net: phylink: rename ovr_an_inband to default_an_inband

Since ovr_an_inband no longer overrides every MLO_AN_xxx mode, rename
it to reflect what it now does - it changes the default mode from
MLO_AN_PHY to MLO_AN_INBAND. Fix up the two users of this.

Signed-off-by: Russell King (Oracle) <[email protected]>
Reviewed-by: Andrew Halaney <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>