net: phy: aquantia: wait for the GLOBAL_CFG to start returning real values
When the PHY is first coming up (or resuming from suspend), it's
possible that although the FW status shows as running, we still see
zeroes in the GLOBAL_CFG set of registers and cannot determine available
modes. Since all models support 10M, add a poll and wait the config to
become available.
net: phy: aquantia: wait for FW reset before checking the vendor ID
Checking the firmware register before it complete the boot process makes
no sense, it will report 0 even if FW is available from internal memory.
Always wait for FW to boot before continuing or we'll unnecessarily try
to load it from nvmem/filesystem and fail.
net: phy: aquantia: rename and export aqr107_wait_reset_complete()
This function is quite generic in this driver and not limited to aqr107.
We will use it outside its current compilation unit soon so rename it
and declare it in the header.
net/mlx5e: CT: Initialize err to 0 to avoid warning
It is theoretically possible to return bogus uninitialized values from
mlx5_tc_ct_entry_replace_rules, even though in practice this will never
be the case as the flow rule will be part of at least the regular ct
table or the ct nat table, if not both.
mlxsw: pci: Lock configuration space of upstream bridge during reset
The driver triggers a "Secondary Bus Reset" (SBR) by calling
__pci_reset_function_locked() which asserts the SBR bit in the "Bridge
Control Register" in the configuration space of the upstream bridge for
2ms. This is done without locking the configuration space of the
upstream bridge port, allowing user space to access it concurrently.
Linux 6.11 will start warning about such unlocked resets [1][2]:
pcieport 0000:00:01.0: unlocked secondary bus reset via: pci_reset_bus_function+0x51c/0x6a0
Avoid the warning and the concurrent access by locking the configuration
space of the upstream bridge prior to the reset and unlocking it
afterwards.
mlxsw: core_thermal: Report valid current state during cooling device registration
Commit 31a0fa0019b0 ("thermal/debugfs: Pass cooling device state to
thermal_debug_cdev_add()") changed the thermal core to read the current
state of the cooling device as part of the cooling device's
registration. This is incompatible with the current implementation of
the cooling device operations in mlxsw, leading to initialization
failure with errors such as:
mlxsw_spectrum 0000:01:00.0: Failed to register cooling device
mlxsw_spectrum 0000:01:00.0: cannot register bus device
The reason for the failure is that when the get current state operation
is invoked the driver tries to derive the index of the cooling device by
walking a per thermal zone array and looking for the matching cooling
device pointer. However, the pointer is returned from the registration
function and therefore only set in the array after the registration.
The issue was later fixed by commit 1af89dedc8a5 ("thermal: core: Do not
fail cdev registration because of invalid initial state") by not failing
the registration of the cooling device if it cannot report a valid
current state during registration, although drivers are responsible for
ensuring that this will not happen.
Therefore, make sure the driver is able to report a valid current state
for the cooling device during registration by passing to the
registration function a per cooling device private data that already has
the cooling device index populated.
While at it, call thermal_cooling_device_unregister() unconditionally
since the function returns immediately if the cooling device pointer is
NULL.
Petr Machata [Mon, 8 Jul 2024 14:23:40 +0000 (16:23 +0200)]
mlxsw: Warn about invalid accesses to array fields
A forgotten or buggy variable initialization can cause out-of-bounds access
to a register or other item array field. For an overflow, such access would
mangle adjacent parts of the register payload. For an underflow, due to all
variables being unsigned, the access would likely trample unrelated memory.
Since neither is correct, replace these accesses with accesses at the index
of 0, and warn about the issue.
Jakub Kicinski [Mon, 8 Jul 2024 21:36:27 +0000 (14:36 -0700)]
selftests: drv-net: rss_ctx: test flow rehashing without impacting traffic
Some workloads may want to rehash the flows in response to an imbalance.
Most effective way to do that is changing the RSS key. Check that changing
the key does not cause link flaps or traffic disruption.
Disrupting traffic for key update is not incorrect, but makes the key
update unusable for rehashing under load.
Jakub Kicinski [Mon, 8 Jul 2024 21:36:26 +0000 (14:36 -0700)]
selftests: drv-net: rss_ctx: check behavior of indirection table resizing
Some devices dynamically increase and decrease the size of the RSS
indirection table based on the number of enabled queues.
When that happens driver must maintain the balance of entries
(preferably duplicating the smaller table).
Jakub Kicinski [Mon, 8 Jul 2024 21:36:25 +0000 (14:36 -0700)]
selftests: drv-net: rss_ctx: test queue changes vs user RSS config
By default main RSS table should change to include all queues.
When user sets a specific RSS config the driver should preserve it,
even when queue count changes. Driver should refuse to deactivate
queues used in the user-set RSS config.
For additional contexts driver should still refuse to deactivate
queues in use. Whether the contexts should get resized like
context 0 when queue count increases is a bit unclear. I anticipate
most drivers today don't do that. Since main use case for additional
contexts is to set the indir table - it doesn't seem worthwhile to
care about behavior of the default table too much. Don't test that.
Jakub Kicinski [Mon, 8 Jul 2024 21:36:24 +0000 (14:36 -0700)]
selftests: drv-net: rss_ctx: factor out send traffic and check
Wrap up sending traffic and checking in which queues it landed
in a helper.
The method used for testing is to send a lot of iperf traffic
and check which queues received the most packets. Those should
be the queues where we expect iperf to land - either because we
installed a filter for the port iperf uses, or we didn't and
expect it to use context 0.
Contexts get disjoint queue sets, but the main context (AKA context 0)
may receive some background traffic (noise).
Jakub Kicinski [Mon, 8 Jul 2024 21:36:23 +0000 (14:36 -0700)]
selftests: drv-net: rss_ctx: fix cleanup in the basic test
The basic test may fail without resetting the RSS indir table.
Use the .exec() method to run cleanup early since we re-test
with traffic that returning to default state works.
While at it reformat the doc a tiny bit.
Add ptimer-handle property to link to ptp-timer node handle.
Fix below warning:
arch/arm64/boot/dts/freescale/fsl-ls1043a-rdb.dtb: fman@1a00000: 'ptimer-handle' do not match any of the regexes: '^ethernet@[a-f0-9]+$', '^mdio@[a-f0-9]+$', '^muram@[a-f0-9]+$', '^phc@[a-f0-9]+$', '^port@[a-f0-9]+$', 'pinctrl-[0-9]+'
Add dma-coherent property to fix below warning.
arch/arm64/boot/dts/freescale/fsl-ls1046a-rdb.dtb: fman@1a00000: 'dma-coherent', 'ptimer-handle' do not match any of the regexes: '^ethernet@[a-f0-9]+$', '^mdio@[a-f0-9]+$', '^muram@[a-f0-9]+$', '^phc@[a-f0-9]+$', '^port@[a-f0-9]+$', 'pinctrl-[0-9]+'
from schema $id: http://devicetree.org/schemas/net/fsl,fman.yaml#
Simon Horman [Mon, 8 Jul 2024 07:27:19 +0000 (08:27 +0100)]
net: tls: Pass union tls_crypto_context pointer to memzero_explicit
Pass union tls_crypto_context pointer, rather than struct
tls_crypto_info pointer, to memzero_explicit().
The address of the pointer is the same before and after.
But the new construct means that the size of the dereferenced pointer type
matches the size being zeroed. Which aids static analysis.
As reported by Smatch:
.../tls_main.c:842 do_tls_setsockopt_conf() error: memzero_explicit() 'crypto_info' too small (4 vs 56)
No functional change intended.
Compile tested only.
selftests: forwarding: Make vxlan-bridge-1d pass on debug kernels
The ageing time used by the test is too short for debug kernels and
results in entries being aged out prematurely [1].
Fix by increasing the ageing time.
The same change was done for the VLAN-aware version of the test in
commit dfbab74044be ("selftests: forwarding: Make vxlan-bridge-1q pass
on debug kernels").
[1]
# ./vxlan_bridge_1d.sh
[...]
# TEST: VXLAN: flood before learning [ OK ]
# TEST: VXLAN: show learned FDB entry [ OK ]
# TEST: VXLAN: learned FDB entry [FAIL]
# veth3: Expected to capture 0 packets, got 4.
# RTNETLINK answers: No such file or directory
# TEST: VXLAN: deletion of learned FDB entry [ OK ]
# TEST: VXLAN: Ageing of learned FDB entry [FAIL]
# veth3: Expected to capture 0 packets, got 2.
[...]
The following pull-request contains BPF updates for your *net-next* tree.
We've added 102 non-merge commits during the last 28 day(s) which contain
a total of 127 files changed, 4606 insertions(+), 980 deletions(-).
The main changes are:
1) Support resilient split BTF which cuts down on duplication and makes BTF
as compact as possible wrt BTF from modules, from Alan Maguire & Eduard Zingerman.
2) Add support for dumping kfunc prototypes from BTF which enables both detecting
as well as dumping compilable prototypes for kfuncs, from Daniel Xu.
3) Batch of s390x BPF JIT improvements to add support for BPF arena and to implement
support for BPF exceptions, from Ilya Leoshkevich.
4) Batch of riscv64 BPF JIT improvements in particular to add 12-argument support
for BPF trampolines and to utilize bpf_prog_pack for the latter, from Pu Lehui.
5) Extend BPF test infrastructure to add a CHECKSUM_COMPLETE validation option
for skbs and add coverage along with it, from Vadim Fedorenko.
6) Inline bpf_get_current_task/_btf() helpers in the arm64 BPF JIT which gives
a small 1% performance improvement in micro-benchmarks, from Puranjay Mohan.
7) Extend the BPF verifier to track the delta between linked registers in order
to better deal with recent LLVM code optimizations, from Alexei Starovoitov.
8) Fix bpf_wq_set_callback_impl() kfunc signature where the third argument should
have been a pointer to the map value, from Benjamin Tissoires.
9) Extend BPF selftests to add regular expression support for test output matching
and adjust some of the selftest when compiled under gcc, from Cupertino Miranda.
10) Simplify task_file_seq_get_next() and remove an unnecessary loop which always
iterates exactly once anyway, from Dan Carpenter.
11) Add the capability to offload the netfilter flowtable in XDP layer through
kfuncs, from Florian Westphal & Lorenzo Bianconi.
12) Various cleanups in networking helpers in BPF selftests to shave off a few
lines of open-coded functions on client/server handling, from Geliang Tang.
13) Properly propagate prog->aux->tail_call_reachable out of BPF verifier, so
that x86 JIT does not need to implement detection, from Leon Hwang.
14) Fix BPF verifier to add a missing check_func_arg_reg_off() to prevent an
out-of-bounds memory access for dynpointers, from Matt Bobrowski.
15) Fix bpf_session_cookie() kfunc to return __u64 instead of long pointer as
it might lead to problems on 32-bit archs, from Jiri Olsa.
16) Enhance traffic validation and dynamic batch size support in xsk selftests,
from Tushar Vyavahare.
bpf-next-for-netdev
* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (102 commits)
selftests/bpf: DENYLIST.aarch64: Remove fexit_sleep
selftests/bpf: amend for wrong bpf_wq_set_callback_impl signature
bpf: helpers: fix bpf_wq_set_callback_impl signature
libbpf: Add NULL checks to bpf_object__{prev_map,next_map}
selftests/bpf: Remove exceptions tests from DENYLIST.s390x
s390/bpf: Implement exceptions
s390/bpf: Change seen_reg to a mask
bpf: Remove unnecessary loop in task_file_seq_get_next()
riscv, bpf: Optimize stack usage of trampoline
bpf, devmap: Add .map_alloc_check
selftests/bpf: Remove arena tests from DENYLIST.s390x
selftests/bpf: Add UAF tests for arena atomics
selftests/bpf: Introduce __arena_global
s390/bpf: Support arena atomics
s390/bpf: Enable arena
s390/bpf: Support address space cast instruction
s390/bpf: Support BPF_PROBE_MEM32
s390/bpf: Land on the next JITed instruction after exception
s390/bpf: Introduce pre- and post- probe functions
s390/bpf: Get rid of get_probe_mem_regno()
...
====================
James Chapman [Thu, 4 Jul 2024 15:25:08 +0000 (16:25 +0100)]
l2tp: fix possible UAF when cleaning up tunnels
syzbot reported a UAF caused by a race when the L2TP work queue closes a
tunnel at the same time as a userspace thread closes a session in that
tunnel.
Tunnel cleanup is handled by a work queue which iterates through the
sessions contained within a tunnel, and closes them in turn.
Meanwhile, a userspace thread may arbitrarily close a session via
either netlink command or by closing the pppox socket in the case of
l2tp_ppp.
The race condition may occur when l2tp_tunnel_closeall walks the list
of sessions in the tunnel and deletes each one. Currently this is
implemented using list_for_each_safe, but because the list spinlock is
dropped in the loop body it's possible for other threads to manipulate
the list during list_for_each_safe's list walk. This can lead to the
list iterator being corrupted, leading to list_for_each_safe spinning.
One sequence of events which may lead to this is as follows:
* A tunnel is created, containing two sessions A and B.
* A thread closes the tunnel, triggering tunnel cleanup via the work
queue.
* l2tp_tunnel_closeall runs in the context of the work queue. It
removes session A from the tunnel session list, then drops the list
lock. At this point the list_for_each_safe temporary variable is
pointing to the other session on the list, which is session B, and
the list can be manipulated by other threads since the list lock has
been released.
* Userspace closes session B, which removes the session from its parent
tunnel via l2tp_session_delete. Since l2tp_tunnel_closeall has
released the tunnel list lock, l2tp_session_delete is able to call
list_del_init on the session B list node.
* Back on the work queue, l2tp_tunnel_closeall resumes execution and
will now spin forever on the same list entry until the underlying
session structure is freed, at which point UAF occurs.
The solution is to iterate over the tunnel's session list using
list_first_entry_not_null to avoid the possibility of the list
iterator pointing at a list item which may be removed during the walk.
Also, have l2tp_tunnel_closeall ref each session while it processes it
to prevent another thread from freeing it.
cpu1 cpu2
--- ---
pppol2tp_release()
spin_lock_bh(&tunnel->list_lock);
for (;;) {
session = list_first_entry_or_null(&tunnel->session_list,
struct l2tp_session, list);
if (!session)
break;
list_del_init(&session->list);
spin_unlock_bh(&tunnel->list_lock);
Calling l2tp_session_delete on the same session twice isn't a problem
per-se, but if cpu2 manages to destruct the socket and unref the
session to zero before cpu1 progresses then it would lead to UAF.
====================
net: stmmac: qcom-ethqos: enable 2.5G ethernet on sa8775p-ride
Here are the changes required to enable 2.5G ethernet on sa8775p-ride.
As advised by Andrew Lunn and Russell King, I am reusing the existing
stmmac infrastructure to enable the SGMII loopback and so I dropped the
patches adding new callbacks to the driver core. I also added more
details to the commit message and made sure the workaround is only
enabled on Rev 3 of the board (with AQR115C PHY). Also: dropped any
mentions of the OCSGMII mode.
net: stmmac: qcom-ethqos: enable SGMII loopback during DMA reset on sa8775p-ride-r3
On sa8775p-ride-r3 the RX clocks from the AQR115C PHY are not available at
the time of the DMA reset. We can however extract the RX clock from the
internal SERDES block. Once the link is up, we can revert to the
previous state.
The AQR115C PHY doesn't support in-band signalling so we can count on
getting the link up notification and safely reuse existing callbacks
which are already used by another HW quirk workaround which enables the
functional clock to avoid a DMA reset due to timeout.
Only enable loopback on revision 3 of the board - check the phy_mode to
make sure.
fexit_sleep test runs successfully now on the BPF CI so remove it
from the deny list. ftrace direct calls was blocking tracing programs
on arm64 but it has been resolved by now. For more details see also
discussion in [*].
I realized this while having a map containing both a struct bpf_timer and
a struct bpf_wq: the third argument provided to the bpf_wq callback is
not the struct bpf_wq pointer itself, but the pointer to the value in
the map.
Which means that the users need to double cast the provided "value" as
this is not a struct bpf_wq *.
This is a change of API, but there doesn't seem to be much users of bpf_wq
right now, so we should be able to go with this right now.
Signed-off-by: Benjamin Tissoires <[email protected]>
---
Changes in v2:
- amended the selftests to retrieve something from the third argument of
the callback
- Link to v1: https://lore.kernel.org/r/20240705-fix-wq-v1-0-91b4d82cd825@kernel.org
I realized this while having a map containing both a struct bpf_timer and
a struct bpf_wq: the third argument provided to the bpf_wq callback is
not the struct bpf_wq pointer itself, but the pointer to the value in
the map.
Which means that the users need to double cast the provided "value" as
this is not a struct bpf_wq *.
This is a change of API, but there doesn't seem to be much users of bpf_wq
right now, so we should be able to go with this right now.
While calling bpf_object__find_map_by_name with obj = NULL is
obviously incorrect, this should not lead to a segmentation
fault but rather be handled gracefully.
As __bpf_map__iter already handles this situation correctly, we
can delegate the check for the regular case there and only add
a check in case the prev or next parameter is NULL.
Implement the following three pieces required from the JIT:
- A "top-level" BPF prog (exception_boundary) must save all
non-volatile registers, and not only the ones that it clobbers.
- A "handler" BPF prog (exception_cb) must switch stack to that of
exception_boundary, and restore the registers that exception_boundary
saved.
- arch_bpf_stack_walk() must unwind the stack and provide the results
in a way that satisfies both bpf_throw() and exception_cb.
Using a mask instead of an array saves a small amount of memory and
allows marking multiple registers as seen with a simple "or". Another
positive side-effect is that it speeds up verification with jitterbug.
Dan Carpenter [Thu, 4 Jul 2024 15:19:19 +0000 (10:19 -0500)]
bpf: Remove unnecessary loop in task_file_seq_get_next()
After commit 0ede61d8589c ("file: convert to SLAB_TYPESAFE_BY_RCU") this
loop always iterates exactly one time. Delete the for statement and pull
the code in a tab.
When BPF_TRAMP_F_CALL_ORIG is not set, stack space for passing arguments
on stack doesn't need to be reserved because the original function is
not called.
Only reserve space for stacked arguments when BPF_TRAMP_F_CALL_ORIG is
set.
net: dsa: microchip: lan9371/2: update MAC capabilities for port 4
Set proper MAC capabilities for port 4 on LAN9371 and LAN9372 switches with
integrated 100BaseTX PHY. And introduce the is_lan937x_tx_phy() function to
reuse it where applicable.
act_ct: prepare for stolen verdict coming from conntrack and nat engine
At this time, conntrack either returns NF_ACCEPT or NF_DROP.
To improve debuging it would be nice to be able to replace NF_DROP verdict
with NF_DROP_REASON() helper,
This helper releases the skb instantly (so drop_monitor can pinpoint
exact location) and returns NF_STOLEN.
Prepare call sites to deal with this before introducing such changes
in conntrack and nat core.
====================
net: pse-pd: Add new PSE c33 features
This patch series adds new c33 features to the PSE API.
- Expand the PSE PI informations status with power, class and failure
reason
- Add the possibility to get and set the PSE PIs power limit
net: pse-pd: pd692x0: Enhance with new current limit and voltage read callbacks
This patch expands PSE callbacks with newly introduced
pi_get/set_current_limit() and pi_get_voltage() callback.
It also add the power limit ranges description in the status returned.
The only way to set ps692x0 port power limit is by configure the power
class plus a small power supplement which maximum depends on each class.
net: ethtool: Add new power limit get and set features
This patch expands the status information provided by ethtool for PSE c33
with available power limit and available power limit ranges. It also adds
a call to pse_ethtool_set_pw_limit() to configure the PSE control power
limit.
net: pse-pd: Add new power limit get and set c33 features
This patch add a way to get and set the power limit of a PSE PI.
For that it uses regulator API callbacks wrapper like get_voltage() and
get/set_current_limit() as power is simply V * I.
We used mW unit as defined by the IEEE 802.3-2022 standards.
set_current_limit() uses the voltage return by get_voltage() and the
desired power limit to calculate the current limit. get_voltage() callback
is then mandatory to set the power limit.
get_current_limit() callback is by default looking at a driver callback
and fallback to extracting the current limit from _pse_ethtool_get_status()
if the driver does not set its callback. We prefer let the user the choice
because ethtool_get_status return much more information than the current
limit.
expand pse status with c33_pw_limit_ranges to return the ranges available
to configure the power limit.
net: pse-pd: pd692x0: Expand ethtool status message
This update expands pd692x0_ethtool_get_status() callback with newly
introduced details such as the detected class, current power delivered,
and extended state information.
net: ethtool: pse-pd: Expand C33 PSE status with class, power and extended state
This update expands the status information provided by ethtool for PSE c33.
It includes details such as the detected class, current power delivered,
and extended state information.
** Background **
Currently, OVS supports several packet sampling mechanisms (sFlow,
per-bridge IPFIX, per-flow IPFIX). These end up being translated into a
userspace action that needs to be handled by ovs-vswitchd's handler
threads only to be forwarded to some third party application that
will somehow process the sample and provide observability on the
datapath.
A particularly interesting use-case is controller-driven
per-flow IPFIX sampling where the OpenFlow controller can add metadata
to samples (via two 32bit integers) and this metadata is then available
to the sample-collecting system for correlation.
** Problem **
The fact that sampled traffic share netlink sockets and handler thread
time with upcalls, apart from being a performance bottleneck in the
sample extraction itself, can severely compromise the datapath,
yielding this solution unfit for highly loaded production systems.
Users are left with little options other than guessing what sampling
rate will be OK for their traffic pattern and system load and dealing
with the lost accuracy.
Looking at available infrastructure, an obvious candidated would be
to use psample. However, it's current state does not help with the
use-case at stake because sampled packets do not contain user-defined
metadata.
** Proposal **
This series is an attempt to fix this situation by extending the
existing psample infrastructure to carry a variable length
user-defined cookie.
The main existing user of psample is tc's act_sample. It is also
extended to forward the action's cookie to psample.
Finally, a new OVS action (OVS_SAMPLE_ATTR_PSAMPLE) is created.
It accepts a group and an optional cookie and uses psample to
multicast the packet and the metadata.
====================
Adrian Moreno [Thu, 4 Jul 2024 08:56:55 +0000 (10:56 +0200)]
net: psample: allow using rate as probability
Although not explicitly documented in the psample module itself, the
definition of PSAMPLE_ATTR_SAMPLE_RATE seems inherited from act_sample.
Quoting tc-sample(8):
"RATE of 100 will lead to an average of one sampled packet out of every
100 observed."
With this semantics, the rates that we can express with an unsigned
32-bits number are very unevenly distributed and concentrated towards
"sampling few packets".
For example, we can express a probability of 2.32E-8% but we
cannot express anything between 100% and 50%.
For sampling applications that are capable of sampling a decent
amount of packets, this sampling rate semantics is not very useful.
Add a new flag to the uAPI that indicates that the sampling rate is
expressed in scaled probability, this is:
- 0 is 0% probability, no packets get sampled.
- U32_MAX is 100% probability, all packets get sampled.
Implement operations to get and set flow-control link parameters.
Both is done by simply calling phylink_ethtool_{get,set}_pauseparam().
Fix whitespace in mtk_ethtool_ops while at it.
net: ethernet: mtk_ppe: Change PPE entries number to 16K
MT7981,7986 and 7988 all supports 32768 PPE entries, and MT7621/MT7620
supports 16384 PPE entries, but only set to 8192 entries in driver. So
incrase max entries to 16384 instead.
This series adds the const modifier to the remaining regmap_bus and
regmap_config structs within the net subsystem that are effectively
used as const (i.e., only read after their declaration), but kept as
writtable data.
====================
During the introduction of struct bpf_net_context handling for
XDP-redirect, the tun driver has been missed.
Jakub also pointed out that there is another call chain to
do_xdp_generic() originating from netif_receive_skb() and drivers may
use it outside from the NAPI context.
Set the bpf_net_context before invoking BPF XDP program within the TUN
driver. Set the bpf_net_context also in do_xdp_generic() if a xdp
program is available.
Daniel Golle [Mon, 1 Jul 2024 19:28:14 +0000 (20:28 +0100)]
net: ethernet: mediatek: Allow gaps in MAC allocation
Some devices with MediaTek SoCs don't use the first but only the second
MAC in the chip. Especially with MT7981 which got a built-in 1GE PHY
connected to the second MAC this is quite common.
Make sure to reset and enable PSE also in those cases by skipping gaps
using 'continue' instead of aborting the loop using 'break'.
openvswitch: prepare for stolen verdict coming from conntrack and nat engine
At this time, conntrack either returns NF_ACCEPT or NF_DROP.
To improve debuging it would be nice to be able to replace NF_DROP
verdict with NF_DROP_REASON() helper,
This helper releases the skb instantly (so drop_monitor can pinpoint
precise location) and returns NF_STOLEN.
Prepare call sites to deal with this before introducing such changes
in conntrack and nat core.
David S. Miller [Fri, 5 Jul 2024 08:35:52 +0000 (09:35 +0100)]
Merge branch 'pcs-xpcs-mmap' into main
Serge Semin <fancer says:
====================
net: pcs: xpcs: Add memory-mapped device support
The main goal of this series is to extend the DW XPCS device support in
the kernel. Particularly the patchset adds a support of the DW XPCS
device with the MCI/APB3 IO interface registered as a platform device. In
order to have them utilized by the DW XPCS core the fwnode-based DW XPCS
descriptor creation procedure has been introduced. Finally the STMMAC
driver has been altered to support the DW XPCS passed via the 'pcs-handle'
property.
Note the series has been significantly re-developed since v1. So I even
had to change the subject. Anyway I've done my best to take all the noted
into account.
The series traditionally starts with a set of the preparation patches.
First one just moves the generic DW XPCS IDs macros from the internal
header file to the external one where some other IDs also reside. Second
patch splits up the xpcs_create() method to a set of the coherent
sub-functions for the sake of the further easier updates and to have it
looking less complicated. The goal of the next three patches is to extend
the DW XPCS ID management code by defining a new dw_xpcs_info structure
with both PCS and PMA IDs.
The next two patches provide the DW XPCS device DT-bindings and the
respective platform-device driver for the memory-mapped DW XPCS devices.
Besides the later patch makes use of the introduced dw_xpcs_info structure
to pre-define the DW XPCS IDs based on the platform-device compatible
string. Thus if there is no way to auto-identify the XPCS device
capabilities it can be done based on the custom device IDs passed via the
MDIO-device platform data.
Final DW XPCS driver change is about adding a new method of the DW XPCS
descriptor creation. The xpcs_create_fwnode() function has been introduced
with the same semantics as a similar method recently added to the Lynx PCS
driver. It's supposed to be called with the fwnode pointing to the DW XPCS
device node, for which the XPCS descriptor will be created.
The series is terminated with two STMMAC driver patches. The former one
simplifies the DW XPCS descriptor registration procedure by dropping the
MDIO-bus scanning and creating the descriptor for the particular device
address. The later patch alters the STMMAC PCS setup method so one would
support the DW XPCS specified via the "pcs-handle" property.
That's it for now. Thanks for review in advance. Any tests are very
welcome. After this series is merged in, I'll submit another one which
adds the generic 10GBase-R and 10GBase-X interfaces support to the STMMAC
and DW XPCS driver with the proper CSRs re-initialization, PMA
initialization and reference clock selection as it's described in the
Synopsys DW XPCS HW manual.
Link: https://lore.kernel.org/netdev/[email protected]
Changelog v2:
- Drop the patches:
[PATCH net-next 01/16] net: pcs: xpcs: Drop sentinel entry from 2500basex ifaces list
[PATCH net-next 02/16] net: pcs: xpcs: Drop redundant workqueue.h include directive
[PATCH net-next 03/16] net: pcs: xpcs: Return EINVAL in the internal methods
[PATCH net-next 04/16] net: pcs: xpcs: Explicitly return error on caps validation
as ones have already been merged into the kernel repo: Link: https://lore.kernel.org/netdev/[email protected]/
- Drop the patches:
[PATCH net-next 14/16] net: stmmac: Pass netdev to XPCS setup function
[PATCH net-next 15/16] net: stmmac: Add dedicated XPCS cleanup method
as ones have already been merged into the kernel repo: Link: https://lore.kernel.org/netdev/[email protected]/
- Drop the patch:
[PATCH net-next 06/16] net: pcs: xpcs: Avoid creating dummy XPCS MDIO device
[PATCH net-next 09/16] net: mdio: Add Synopsys DW XPCS management interface support
[PATCH net-next 11/16] net: pcs: xpcs: Change xpcs_create_mdiodev() suffix to "byaddr"
[PATCH net-next 13/16] net: stmmac: intel: Register generic MDIO device
as no longer relevant.
- Add new patches:
[PATCH net-next v2 03/10] net: pcs: xpcs: Convert xpcs_id to dw_xpcs_desc
[PATCH net-next v2 04/10] net: pcs: xpcs: Convert xpcs_compat to dw_xpcs_compat
[PATCH net-next v2 05/10] net: pcs: xpcs: Introduce DW XPCS info structure
[PATCH net-next v2 09/10] net: stmmac: Create DW XPCS device with particular address
- Use the xpcs_create_fwnode() function name and semantics similar to the
Lynx PCS driver.
- Add kdoc describing the DW XPCS registration functions.
- Convert the memory-mapped DW XPCS device driver to being the
platform-device driver.
- Convert the DW XPCS DT-bindings to defining both memory-mapped and MDIO
devices.
- Drop inline'es from the methods statically defined in *.c. (@Maxime)
- Preserve the strict refcount-ing pattern. (@Russell)
Link: https://lore.kernel.org/netdev/[email protected]/
Changelov v3:
- Implement the ordered clocks constraint. (@Rob)
- Convert xpcs_plat_pm_ops to being defined as static. (@Simon)
- Add the "@interface" argument kdoc to the xpcs_create_mdiodev()
function. (@Simon)
- Fix the "@fwnode" argument name in the xpcs_create_fwnode() method kdoc.
(@Simon)
- Move the return value descriptions to the "Return:" section of the
xpcs_create_mdiodev() and xpcs_create_fwnode() kdoc. (@Simon)
- Drop stmmac_mdio_bus_data::has_xpcs flag and define the PCS-address
mask with particular XPCS address instead.
Link: https://lore.kernel.org/netdev/[email protected]/
Changelog v4:
- Make sure the series is applicable to the net-next tree. (@Vladimir)
- Rename entry to desc in the xpcs_init_id() method. (@Andrew)
- Add a comment to the clock-names property constraint about the
oneOf-subschemas applicability. (@Conor)
- Convert "pclk" clock name to "csr" to match the DW XPCS IP-core
input signal name. (@Rob)
====================
net: stmmac: Add DW XPCS specified via "pcs-handle" support
Recently the DW XPCS DT-bindings have been introduced and the DW XPCS
driver has been altered to support the DW XPCS registered as a platform
device. In order to have the DW XPCS DT-device accessed from the STMMAC
driver let's alter the STMMAC PCS-setup procedure to support the
"pcs-handle" property containing the phandle reference to the DW XPCS
device DT-node. The respective fwnode will be then passed to the
xpcs_create_fwnode() function which in its turn will create the DW XPCS
descriptor utilized in the main driver for the PCS-related setups.
net: stmmac: Create DW XPCS device with particular address
Currently the only STMMAC platform driver using the DW XPCS code is the
Intel mGBE device driver. (It can be determined by finding all the drivers
having the stmmac_mdio_bus_data::has_xpcs flag set.) At the same time the
low-level platform driver masks out the DW XPCS MDIO-address from being
auto-detected as PHY by the MDIO subsystem core. Seeing the PCS MDIO ID is
known the procedure of the DW XPCS device creation can be simplified by
dropping the loop over all the MDIO IDs. From now the DW XPCS device
descriptor will be created for the MDIO-bus address pre-defined by the
platform drivers via the stmmac_mdio_bus_data::pcs_mask field.
Note besides this shall speed up a bit the Intel mGBE probing.
It's now possible to have the DW XPCS device defined as a standard
platform device for instance in the platform DT-file. Although that
functionality is useless unless there is a way to have the device found by
the client drivers (STMMAC/DW *MAC, NXP SJA1105 Eth Switch, etc). Provide
such ability by means of the xpcs_create_fwnode() method. It needs to be
called with the device DW XPCS fwnode instance passed. That node will be
then used to find the MDIO-device instance in order to create the DW XPCS
descriptor.
Note the method semantics and name is similar to what has been recently
introduced in the Lynx PCS driver.
Synopsys DesignWare XPCS IP-core can be synthesized with the device CSRs
being accessible over the MCI or APB3 interface instead of the MDIO bus
(see the CSR_INTERFACE HDL parameter). Thus all the PCS registers can be
just memory mapped and be a subject of the standard MMIO operations of
course taking into account the peculiarities of the Clause C45 CSRs
mapping. From that perspective the DW XPCS devices would look as just
normal platform devices for the kernel.
On the other hand in order to have the DW XPCS devices handled by the
pcs-xpcs.c driver they need to be registered in the framework of the
MDIO-subsystem. So the suggested change is about providing a DW XPCS
platform device driver registering a virtual MDIO-bus with a single
MDIO-device representing the DW XPCS device.
DW XPCS platform device is supposed to be described by the respective
compatible string "snps,dw-xpcs" (or with the PMA-specific compatible
string), CSRs memory space and optional peripheral bus and reference clock
sources. Depending on the INDIRECT_ACCESS IP-core synthesize parameter the
memory-mapped reg-space can be represented as either directly or
indirectly mapped Clause 45 space. In the former case the particular
address is determined based on the MMD device and the registers offset (5
+ 16 bits all together) within the device reg-space. In the later case
there is only 8 lower address bits are utilized for the registers mapping
(255 CSRs). The upper bits are supposed to be written into the respective
viewport CSR in order to select the respective MMD sub-page.
Note, only the peripheral bus clock source is requested in the platform
device probe procedure. The core and pad clocks handling has been
implemented in the framework of the xpcs_create() method intentionally
since the clocks-related setups are supposed to be performed later, during
the DW XPCS main configuration procedures. (For instance they will be
required for the DW Gen5 10G PMA configuration.)
Synopsys DesignWare XPCS IP-core is a Physical Coding Sublayer (PCS) layer
providing an interface between the Media Access Control (MAC) and Physical
Medium Attachment Sublayer (PMA) through a Media independent interface.
From software point of view it exposes IEEE std. Clause 45 CSR space and
can be accessible either by MDIO or MCI/APB3 bus interfaces. In the former
case the PCS device is supposed to be defined under the respective MDIO
bus DT-node. In the later case the DW xPCS will be just a normal IO
memory-mapped device.
Besides of that DW XPCS DT-nodes can have an interrupt signal and clock
source properties specified. The former one indicates the Clause 73/37
auto-negotiation events like: negotiation page received, AN is completed
or incompatible link partner. The clock DT-properties can describe up to
three clock sources: peripheral bus clock source, internal reference clock
and the externally connected reference clock.
Finally the DW XPCS IP-core can be optionally synthesized with a
vendor-specific interface connected to the Synopsys PMA (also called
DesignWare Consumer/Enterprise PHY). Alas that isn't auto-detectable in a
portable way. So if the DW XPCS device has the respective PMA attached
then it should be reflected in the DT-node compatible string so the driver
would be aware of the PMA-specific device capabilities (mainly connected
with CSRs available for the fine-tunings).
The being introduced structure will preserve the PCS and PMA IDs retrieved
from the respective DW XPCS MMDs or potentially pre-defined by the client
drivers. (The later change will be introduced later in the framework of
the commit adding the memory-mapped DW XPCS devices support.)
The structure fields are filled in in the xpcs_get_id() function, which
used to be responsible for the PCS Device ID getting only. Besides of the
PCS ID the method now fetches the PMA/PMD IDs too from MMD 1, which used
to be done in xpcs_dev_flag(). The retrieved PMA ID will be from now
utilized for the PMA-specific tweaks like it was introduced for the
Wangxun TxGBE PCS in the commit f629acc6f210 ("net: pcs: xpcs: support to
switch mode for Wangxun NICs").
Note 1. The xpcs_get_id() error-handling semantics has been changed. From
now the error number will be returned from the function. There is no point
in the next IOs or saving 0xffs and then looping over the actual device
IDs if device couldn't be reached. -ENODEV will be returned if the very
first IO operation failed thus indicating that no device could be found.
Note 2. The PCS and PMA IDs macros have been converted to enum'es. The
enum'es will be populated later in another commit with the virtual IDs
identifying the DW XPCS devices which have some platform-specifics, but
have been synthesized with the default PCS/PMA ID.
net: pcs: xpcs: Convert xpcs_compat to dw_xpcs_compat
The xpcs_compat structure has been left as the only dw-prefix-less
structure since the previous commit. Let's unify at least the structures
naming in the driver by adding the dw_-prefix to it.
A structure with the PCS/PMA MMD IDs data is being introduced in one of
the next commits. In order to prevent the names ambiguity let's convert
the xpcs_id structure name to dw_xpcs_desc. The later version is more
suitable since the structure content is indeed the device descriptor
containing the data and callbacks required for the driver to correctly set
the device up.
net: pcs: xpcs: Split up xpcs_create() body to sub-functions
As an initial preparation before adding the fwnode-based DW XPCS device
support let's split the xpcs_create() function code up to a set of the
small sub-functions. Thus the xpcs_create() implementation will get to
look simpler and turn to be more coherent. Further updates will just touch
the new sub-functions a bit: add platform-specific device info, add the
reference clock getting and enabling.
The xpcs_create() method will now contain the next static methods calls:
xpcs_create_data() - create the DW XPCS device descriptor, pre-initialize
it' fields and increase the mdio device refcount-er;
xpcs_init_id() - find XPCS ID instance and save it in the device
descriptor;
xpcs_init_iface() - find MAC/PCS interface descriptor and perform
basic initialization specific to it: soft-reset, disable polling.
The update doesn't imply any semantic change but merely makes the code
looking simpler and more ready for adding new features support.
Note the xpcs_destroy() has been moved to being defined below the
xpcs_create_mdiodev() function as the driver now implies having the
protagonist-then-antagonist functions definition order.
net: pcs: xpcs: Move native device ID macro to linux/pcs/pcs-xpcs.h
One of the next commits will alter the DW XPCS driver to support setting a
custom device ID for the particular MDIO-device detected on the platform.
The generic DW XPCS ID can be used as a custom ID as well in case if the
DW XPCS-device was erroneously synthesized with no or some undefined ID.
In addition to that having all supported DW XPCS device IDs defined in a
single place will improve the code maintainability and readability.
Note while at it rename the macros to being shorter and looking alike to
the already defined NXP XPCS ID macro.
Convention is DT schemas should define all properties at the top-level
and not inside of if/then schemas. That minimizes the if/then schemas
and is more future proof.
Edward Cree [Wed, 3 Jul 2024 12:18:49 +0000 (13:18 +0100)]
ethtool: move firmware flashing flag to struct ethtool_netdev_state
Commit 31e0aa99dc02 ("ethtool: Veto some operations during firmware flashing process")
added a flag module_fw_flash_in_progress to struct net_device. As
this is ethtool related state, move it to the recently created
struct ethtool_netdev_state, accessed via the 'ethtool' member of
struct net_device.
Embedding net_device into structures prohibits the usage of flexible
arrays in the net_device structure. For more details, see the discussion
at [1].
Un-embed the net_devices from struct dpaa2_caam_priv_per_cpu by
converting them into pointers, and allocating them dynamically. Use the
leverage alloc_netdev_dummy() to allocate the net_device object at
dpaa2_dpseci_setup().
The free of the device occurs at dpaa2_dpseci_disable().
Embedding net_device into structures prohibits the usage of flexible
arrays in the net_device structure. For more details, see the discussion
at [1].
Un-embed the net_devices from struct caam_qi_pcpu_priv by converting them
into pointers, and allocating them dynamically. Use the leverage
alloc_netdev_dummy() to allocate the net_device object at
caam_qi_init().
The free of the device occurs at caam_qi_shutdown().
If caam module is built without OF support, the compiler returns the
following warning:
drivers/crypto/caam/ctrl.c:83:34: warning: 'imx8m_machine_match' defined but not used [-Wunused-const-variable=]
imx8m_machine_match is only referenced by of_match_node(), which is set
to NULL if CONFIG_OF is not set, as of commit 5762c20593b6b ("dt: Add
empty of_match_node() macro"):
#define of_match_node(_matches, _node) NULL
Do not create imx8m_machine_match if CONFIG_OF is not set.
Merge tag 'net-6.10-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
"Including fixes from bluetooth, wireless and netfilter.
There's one fix for power management with Intel's e1000e here,
Thorsten tells us there's another problem that started in v6.9. We're
trying to wrap that up but I don't think it's blocking.
Current release - new code bugs:
- wifi: mac80211: disable softirqs for queued frame handling
- af_unix: fix uninit-value in __unix_walk_scc(), with the new
garbage collection algo
Previous releases - regressions:
- Bluetooth:
- qca: fix BT enable failure for QCA6390 after warm reboot
- add quirk to ignore reserved PHY bits in LE Extended Adv Report,
abused by some Broadcom controllers found on Apple machines
- wifi: wilc1000: fix ies_len type in connect path
Previous releases - always broken:
- tcp: fix DSACK undo in fast recovery to call tcp_try_to_open(),
avoid premature timeouts
- net: make sure skb_datagram_iter maps fragments page by page, in
case we somehow get compound highmem mixed in
- eth: bnx2x: fix multiple UBSAN array-index-out-of-bounds when more
queues are used
Misc:
- MAINTAINERS: Remembering Larry Finger"
* tag 'net-6.10-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (62 commits)
bnxt_en: Fix the resource check condition for RSS contexts
mlxsw: core_linecards: Fix double memory deallocation in case of invalid INI file
inet_diag: Initialize pad field in struct inet_diag_req_v2
tcp: Don't flag tcp_sk(sk)->rx_opt.saw_unknown for TCP AO.
selftests: make order checking verbose in msg_zerocopy selftest
selftests: fix OOM in msg_zerocopy selftest
ice: use proper macro for testing bit
ice: Reject pin requests with unsupported flags
ice: Don't process extts if PTP is disabled
ice: Fix improper extts handling
selftest: af_unix: Add test case for backtrack after finalising SCC.
af_unix: Fix uninit-value in __unix_walk_scc()
bonding: Fix out-of-bounds read in bond_option_arp_ip_targets_set()
net: rswitch: Avoid use-after-free in rswitch_poll()
netfilter: nf_tables: unconditionally flush pending work before notifier
wifi: iwlwifi: mvm: check vif for NULL/ERR_PTR before dereference
wifi: iwlwifi: mvm: avoid link lookup in statistics
wifi: iwlwifi: mvm: don't wake up rx_sync_waitq upon RFKILL
wifi: iwlwifi: properly set WIPHY_FLAG_SUPPORTS_EXT_KEK_KCK
wifi: wilc1000: fix ies_len type in connect path
...
Merge tag 's390-6.10-8' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
Pull s390 fixes from Heiko Carstens:
- Fix and add physical to virtual address translations in dasd and
virtio_ccw drivers. For virtio_ccw this is just a minimal fix.
More code cleanup will follow.
- Small defconfig updates
* tag 's390-6.10-8' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
s390/dasd: Fix invalid dereferencing of indirect CCW data pointer
s390/vfio_ccw: Fix target addresses of TIC CCWs
s390: Update defconfigs
Merge tag 'mm-hotfixes-stable-2024-07-03-22-23' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull misc fixes from, Andrew Morton:
"6 hotfies, all cc:stable. Some fixes for longstanding nilfs2 issues
and three unrelated MM fixes"
* tag 'mm-hotfixes-stable-2024-07-03-22-23' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
nilfs2: fix incorrect inode allocation from reserved inodes
nilfs2: add missing check for inode numbers on directory entries
nilfs2: fix inode number range checks
mm: avoid overflows in dirty throttling logic
Revert "mm/writeback: fix possible divide-by-zero in wb_dirty_limits(), again"
mm: optimize the redundant loop of mm_update_owner_next()
bnxt_en: Fix the resource check condition for RSS contexts
While creating a new RSS context, bnxt_rfs_capable() currently
makes a strict check to see if the required VNICs are already
available. If the current VNICs are not what is required,
either too many or not enough, it will call the firmware to
reserve the exact number required.
There is a bug in the firmware when the driver tries to
relinquish some reserved VNICs and RSS contexts. It will
cause the default VNIC to lose its RSS configuration and
cause receive packets to be placed incorrectly.
Workaround this problem by skipping the resource reduction.
The driver will not reduce the VNIC and RSS context reservations
when a context is deleted. The resources will be available for
use when new contexts are created later.
Potentially, this workaround can cause us to run out of VNIC
and RSS contexts if there are a lot of VF functions creating
and deleting RSS contexts. In the future, we will conditionally
disable this workaround when the firmware fix is available.