Edward Cree [Thu, 27 Jun 2024 15:33:52 +0000 (16:33 +0100)]
sfc: use new rxfh_context API
The core is now responsible for allocating IDs and a memory region for
us to store our state (struct efx_rss_context_priv), so we no longer
need efx_alloc_rss_context_entry() and friends.
Since the contexts are now maintained by the core, use the core's lock
(net_dev->ethtool->rss_lock), rather than our own mutex (efx->rss_lock),
to serialise access against changes; and remove the now-unused
efx->rss_lock from struct efx_nic.
Edward Cree [Thu, 27 Jun 2024 15:33:51 +0000 (16:33 +0100)]
net: ethtool: add a mutex protecting RSS contexts
While this is not needed to serialise the ethtool entry points (which
are all under RTNL), drivers may have cause to asynchronously access
dev->ethtool->rss_ctx; taking dev->ethtool->rss_lock allows them to
do this safely without needing to take the RTNL.
Edward Cree [Thu, 27 Jun 2024 15:33:49 +0000 (16:33 +0100)]
net: ethtool: let the core choose RSS context IDs
Add a new API to create/modify/remove RSS contexts, that passes in the
newly-chosen context ID (not as a pointer) rather than leaving the
driver to choose it on create. Also pass in the ctx, allowing drivers
to easily use its private data area to store their hardware-specific
state.
Keep the existing .set_rxfh API for now as a fallback, but deprecate it
for custom contexts (rss_context != 0).
Edward Cree [Thu, 27 Jun 2024 15:33:48 +0000 (16:33 +0100)]
net: ethtool: record custom RSS contexts in the XArray
Since drivers are still choosing the context IDs, we have to force the
XArray to use the ID they've chosen rather than picking one ourselves,
and handle the case where they give us an ID that's already in use.
Edward Cree [Thu, 27 Jun 2024 15:33:47 +0000 (16:33 +0100)]
net: ethtool: attach an XArray of custom RSS contexts to a netdevice
Each context stores the RXFH settings (indir, key, and hfunc) as well
as optionally some driver private data.
Delete any still-existing contexts at netdev unregister time.
Jakub Kicinski [Thu, 27 Jun 2024 18:55:01 +0000 (11:55 -0700)]
selftests: drv-net: add ability to schedule cleanup with defer()
This implements what I was describing in [1]. When writing a test
author can schedule cleanup / undo actions right after the creation
completes, eg:
cmd("touch /tmp/file")
defer(cmd, "rm /tmp/file")
defer() takes the function name as first argument, and the rest are
arguments for that function. defer()red functions are called in
inverse order after test exits. It's also possible to capture them
and execute earlier (in which case they get automatically de-queued).
As a nice safety all exceptions from defer()ed calls are captured,
printed, and ignored (they do make the test fail, however).
This addresses the common problem of exceptions in cleanup paths
often being unhandled, leading to potential leaks.
There is a global action queue, flushed by ksft_run(). We could support
function level defers too, I guess, but there's no immediate need..
Jakub Kicinski [Thu, 27 Jun 2024 18:55:00 +0000 (11:55 -0700)]
selftests: net: ksft: avoid continue when handling results
Exception handlers print the result and use continue
to skip the non-exception result printing. This makes
inserting common post-test code hard. Refactor to
avoid the continues and have only one ktap_result() call.
Jon Kohler [Thu, 27 Jun 2024 20:20:13 +0000 (13:20 -0700)]
enic: add ethtool get_channel support
Add .get_channel to enic_ethtool_ops to enable basic ethtool -l
support to get the current channel configuration.
Note that the driver does not support dynamically changing queue
configuration, so .set_channel is intentionally unused. Instead, users
should use Cisco's hardware management tools (UCSM/IMC) to modify
virtual interface card configuration out of band.
Fix UBSAN warnings that occur when using a system with 32 physical
cpu cores or more, or when the user defines a number of Ethernet
queues greater than or equal to FP_SB_MAX_E1x using the num_queues
module parameter.
Currently there is a read/write out of bounds that occurs on the array
"struct stats_query_entry query" present inside the "bnx2x_fw_stats_req"
struct in "drivers/net/ethernet/broadcom/bnx2x/bnx2x.h".
Looking at the definition of the "struct stats_query_entry query" array:
FP_SB_MAX_E1x is defined as the maximum number of fast path interrupts and
has a value of 16, while BNX2X_FIRST_QUEUE_QUERY_IDX has a value of 3
meaning the array has a total size of 19.
Since accesses to "struct stats_query_entry query" are offset-ted by
BNX2X_FIRST_QUEUE_QUERY_IDX, that means that the total number of Ethernet
queues should not exceed FP_SB_MAX_E1x (16). However one of these queues
is reserved for FCOE and thus the number of Ethernet queues should be set
to [FP_SB_MAX_E1x -1] (15) if FCOE is enabled or [FP_SB_MAX_E1x] (16) if
it is not.
This is also described in a comment in the source code in
drivers/net/ethernet/broadcom/bnx2x/bnx2x.h just above the Macro definition
of FP_SB_MAX_E1x. Below is the part of this explanation that it important
for this patch
/*
* The total number of L2 queues, MSIX vectors and HW contexts (CIDs) is
* control by the number of fast-path status blocks supported by the
* device (HW/FW). Each fast-path status block (FP-SB) aka non-default
* status block represents an independent interrupts context that can
* serve a regular L2 networking queue. However special L2 queues such
* as the FCoE queue do not require a FP-SB and other components like
* the CNIC may consume FP-SB reducing the number of possible L2 queues
*
* If the maximum number of FP-SB available is X then:
* a. If CNIC is supported it consumes 1 FP-SB thus the max number of
* regular L2 queues is Y=X-1
* b. In MF mode the actual number of L2 queues is Y= (X-1/MF_factor)
* c. If the FCoE L2 queue is supported the actual number of L2 queues
* is Y+1
* d. The number of irqs (MSIX vectors) is either Y+1 (one extra for
* slow-path interrupts) or Y+2 if CNIC is supported (one additional
* FP interrupt context for the CNIC).
* e. The number of HW context (CID count) is always X or X+1 if FCoE
* L2 queue is supported. The cid for the FCoE L2 queue is always X.
*/
However this driver also supports NICs that use the E2 controller which can
handle more queues due to having more FP-SB represented by FP_SB_MAX_E2.
Looking at the commits when the E2 support was added, it was originally
using the E1x parameters: commit f2e0899f0f27 ("bnx2x: Add 57712 support").
Back then FP_SB_MAX_E2 was set to 16 the same as E1x. However the driver
was later updated to take full advantage of the E2 instead of having it be
limited to the capabilities of the E1x. But as far as we can tell, the
array "stats_query_entry query" was still limited to using the FP-SB
available to the E1x cards as part of an oversignt when the driver was
updated to take full advantage of the E2, and now with the driver being
aware of the greater queue size supported by E2 NICs, it causes the UBSAN
warnings seen in the stack traces below.
This patch increases the size of the "stats_query_entry query" array by
replacing FP_SB_MAX_E1x with FP_SB_MAX_E2 to be large enough to handle
both types of NICs.
Stack traces:
UBSAN: array-index-out-of-bounds in
drivers/net/ethernet/broadcom/bnx2x/bnx2x_stats.c:1529:11
index 20 is out of range for type 'stats_query_entry [19]'
CPU: 12 PID: 858 Comm: systemd-network Not tainted 6.9.0-060900rc7-generic
#202405052133
Hardware name: HP ProLiant DL360 Gen9/ProLiant DL360 Gen9,
BIOS P89 10/21/2019
Call Trace:
<TASK>
dump_stack_lvl+0x76/0xa0
dump_stack+0x10/0x20
__ubsan_handle_out_of_bounds+0xcb/0x110
bnx2x_prep_fw_stats_req+0x2e1/0x310 [bnx2x]
bnx2x_stats_init+0x156/0x320 [bnx2x]
bnx2x_post_irq_nic_init+0x81/0x1a0 [bnx2x]
bnx2x_nic_load+0x8e8/0x19e0 [bnx2x]
bnx2x_open+0x16b/0x290 [bnx2x]
__dev_open+0x10e/0x1d0
RIP: 0033:0x736223927a0a
Code: d8 64 89 02 48 c7 c0 ff ff ff ff eb b8 0f 1f 00 f3 0f 1e fa 41 89 ca
64 8b 04 25 18 00 00 00 85 c0 75 15 b8 2c 00 00 00 0f 05 <48> 3d 00
f0 ff ff 77 7e c3 0f 1f 44 00 00 41 54 48 83 ec 30 44 89
RSP: 002b:00007ffc0bb2ada8 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
RAX: ffffffffffffffda RBX: 0000583df50f9c78 RCX: 0000736223927a0a
RDX: 0000000000000020 RSI: 0000583df50ee510 RDI: 0000000000000003
RBP: 0000583df50d4940 R08: 00007ffc0bb2adb0 R09: 0000000000000080
R10: 0000000000000000 R11: 0000000000000246 R12: 0000583df5103ae0
R13: 000000000000035a R14: 0000583df50f9c30 R15: 0000583ddddddf00
</TASK>
---[ end trace ]---
------------[ cut here ]------------
UBSAN: array-index-out-of-bounds in
drivers/net/ethernet/broadcom/bnx2x/bnx2x_stats.c:1546:11
index 28 is out of range for type 'stats_query_entry [19]'
CPU: 12 PID: 858 Comm: systemd-network Not tainted 6.9.0-060900rc7-generic
#202405052133
Hardware name: HP ProLiant DL360 Gen9/ProLiant DL360 Gen9,
BIOS P89 10/21/2019
Call Trace:
<TASK>
dump_stack_lvl+0x76/0xa0
dump_stack+0x10/0x20
__ubsan_handle_out_of_bounds+0xcb/0x110
bnx2x_prep_fw_stats_req+0x2fd/0x310 [bnx2x]
bnx2x_stats_init+0x156/0x320 [bnx2x]
bnx2x_post_irq_nic_init+0x81/0x1a0 [bnx2x]
bnx2x_nic_load+0x8e8/0x19e0 [bnx2x]
bnx2x_open+0x16b/0x290 [bnx2x]
__dev_open+0x10e/0x1d0
RIP: 0033:0x736223927a0a
Code: d8 64 89 02 48 c7 c0 ff ff ff ff eb b8 0f 1f 00 f3 0f 1e fa 41 89 ca
64 8b 04 25 18 00 00 00 85 c0 75 15 b8 2c 00 00 00 0f 05 <48> 3d 00
f0 ff ff 77 7e c3 0f 1f 44 00 00 41 54 48 83 ec 30 44 89
RSP: 002b:00007ffc0bb2ada8 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
RAX: ffffffffffffffda RBX: 0000583df50f9c78 RCX: 0000736223927a0a
RDX: 0000000000000020 RSI: 0000583df50ee510 RDI: 0000000000000003
RBP: 0000583df50d4940 R08: 00007ffc0bb2adb0 R09: 0000000000000080
R10: 0000000000000000 R11: 0000000000000246 R12: 0000583df5103ae0
R13: 000000000000035a R14: 0000583df50f9c30 R15: 0000583ddddddf00
</TASK>
---[ end trace ]---
------------[ cut here ]------------
UBSAN: array-index-out-of-bounds in
drivers/net/ethernet/broadcom/bnx2x/bnx2x_sriov.c:1895:8
index 29 is out of range for type 'stats_query_entry [19]'
CPU: 13 PID: 163 Comm: kworker/u96:1 Not tainted 6.9.0-060900rc7-generic
#202405052133
Hardware name: HP ProLiant DL360 Gen9/ProLiant DL360 Gen9,
BIOS P89 10/21/2019
Workqueue: bnx2x bnx2x_sp_task [bnx2x]
Call Trace:
<TASK>
dump_stack_lvl+0x76/0xa0
dump_stack+0x10/0x20
__ubsan_handle_out_of_bounds+0xcb/0x110
bnx2x_iov_adjust_stats_req+0x3c4/0x3d0 [bnx2x]
bnx2x_storm_stats_post.part.0+0x4a/0x330 [bnx2x]
? bnx2x_hw_stats_post+0x231/0x250 [bnx2x]
bnx2x_stats_start+0x44/0x70 [bnx2x]
bnx2x_stats_handle+0x149/0x350 [bnx2x]
bnx2x_attn_int_asserted+0x998/0x9b0 [bnx2x]
bnx2x_sp_task+0x491/0x5c0 [bnx2x]
process_one_work+0x18d/0x3f0
</TASK>
---[ end trace ]---
====================
Lift UDP_SEGMENT restriction for egress via device w/o csum offload
This is a follow-up to an earlier question [1] if we can make UDP GSO work
with any egress device, even those with no checksum offload capability.
That's the default setup for TUN/TAP.
Because there is a change in behavior - sendmsg() does no longer return
EIO error - I'm submitting through net-next tree, rather than net,
as per Willem's advice.
Jakub Sitnicki [Wed, 26 Jun 2024 17:51:27 +0000 (19:51 +0200)]
selftests/net: Add test coverage for UDP GSO software fallback
Extend the existing test to exercise UDP GSO egress through devices with
various offload capabilities, including lack of checksum offload, which is
the default case for TUN/TAP devices.
Test against a dummy device because it is simpler to set up then TUN/TAP.
This is due to a check in the udp stack if the egress device offers
checksum offload. While TUN/TAP devices, by default, don't advertise this
capability because it requires support from the TUN/TAP reader.
However, the GSO stack has a software fallback for checksum calculation,
which we can use. This way we don't force UDP_SEGMENT users to handle the
EIO error and implement a segmentation fallback.
Lift the restriction so that UDP_SEGMENT can be used with any egress
device. We also need to adjust the UDP GSO code to match the GSO stack
expectation about ip_summed field, as set in commit 8d63bee643f1 ("net:
avoid skb_warn_bad_offload false positives on UFO"). Otherwise we will hit
the bad offload check.
Users should, however, expect a potential performance impact when
batch-sending packets with UDP_SEGMENT without checksum offload on the
egress device. In such case the packet payload is read twice: first during
the sendmsg syscall when copying data from user memory, and then in the GSO
stack for checksum computation. This double memory read can be less
efficient than a regular sendmsg where the checksum is calculated during
the initial data copy from user memory.
Linus Torvalds [Fri, 28 Jun 2024 23:14:59 +0000 (16:14 -0700)]
Merge tag 'riscv-for-linus-6.10-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux
Pull RISC-V fixes from Palmer Dabbelt:
- A fix for vector load/store instruction decoding, which could result
in reserved vector element length encodings decoding as valid vector
instructions.
- Instruction patching now aggressively flushes the local instruction
cache, to avoid situations where patching functions on the flush path
results in torn instructions being fetched.
- A fix to prevent the stack walker from showing up as part of traces.
* tag 'riscv-for-linus-6.10-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux:
riscv: stacktrace: convert arch_stack_walk() to noinstr
riscv: patch: Flush the icache right after patching to avoid illegal insns
RISC-V: fix vector insn load/store width mask
This patch series tries to work around erratum DS80000789E 6 of the
mcp2518fd, found by Stefan Althöfer, the other variants of the chip
family (mcp2517fd and mcp251863) are probably also affected.
Erratum DS80000789E 6 says "reading of the FIFOCI bits in the FIFOSTA
register for an RX FIFO may be corrupted". However observation shows
that this problem is not limited to RX FIFOs but also effects the TEF
FIFO.
In the bad case, the driver reads a too large head index. In the
original code, the driver always trusted the read value.
For the RX FIDO this caused old, already processed CAN frames or new,
incompletely written CAN frames to be (re-)processed.
To work around this issue, keep a per FIFO timestamp of the last valid
received CAN frame and compare against the timestamp of every received
CAN frame.
Further tests showed that this workaround can recognize old CAN
frames, but a small time window remains in which partially written CAN
frames are not recognized but then processed. These CAN frames have
the correct data and time stamps, but the DLC has not yet been
updated.
For the TEF FIFO the original driver already detects the error, update
the error handling with the knowledge that it is causes by this erratum.
can: mcp251xfd: tef: update workaround for erratum DS80000789E 6 of mcp2518fd
This patch updates the workaround for a problem similar to erratum
DS80000789E 6 of the mcp2518fd, the other variants of the chip
family (mcp2517fd and mcp251863) are probably also affected.
Erratum DS80000789E 6 says "reading of the FIFOCI bits in the FIFOSTA
register for an RX FIFO may be corrupted". However observation shows
that this problem is not limited to RX FIFOs but also effects the TEF
FIFO.
In the bad case, the driver reads a too large head index. As the FIFO
is implemented as a ring buffer, this results in re-handling old CAN
transmit complete events.
Every transmit complete event contains with a sequence number that
equals to the sequence number of the corresponding TX request. This
way old TX complete events can be detected.
If the original driver detects a non matching sequence number, it
prints an info message and tries again later. As wrong sequence
numbers can be explained by the erratum DS80000789E 6, demote the info
message to debug level, streamline the code and update the comments.
Keep the behavior: If an old CAN TX complete event is detected, abort
the iteration and mark the number of valid CAN TX complete events as
processed in the chip by incrementing the FIFO's tail index.
can: mcp251xfd: tef: prepare to workaround broken TEF FIFO tail index erratum
This is a preparatory patch to work around a problem similar to
erratum DS80000789E 6 of the mcp2518fd, the other variants of the chip
family (mcp2517fd and mcp251863) are probably also affected.
Erratum DS80000789E 6 says "reading of the FIFOCI bits in the FIFOSTA
register for an RX FIFO may be corrupted". However observation shows
that this problem is not limited to RX FIFOs but also effects the TEF
FIFO.
When handling the TEF interrupt, the driver reads the FIFO header
index from the TEF FIFO STA register of the chip.
In the bad case, the driver reads a too large head index. In the
original code, the driver always trusted the read value, which caused
old CAN transmit complete events that were already processed to be
re-processed.
Instead of reading and trusting the head index, read the head index
and calculate the number of CAN frames that were supposedly received -
replace mcp251xfd_tef_ring_update() with mcp251xfd_get_tef_len().
The mcp251xfd_handle_tefif() function reads the CAN transmit complete
events from the chip, iterates over them and pushes them into the
network stack. The original driver already contains code to detect old
CAN transmit complete events, that will be updated in the next patch.
can: mcp251xfd: rx: add workaround for erratum DS80000789E 6 of mcp2518fd
This patch tries to works around erratum DS80000789E 6 of the
mcp2518fd, the other variants of the chip family (mcp2517fd and
mcp251863) are probably also affected.
In the bad case, the driver reads a too large head index. In the
original code, the driver always trusted the read value, which caused
old, already processed CAN frames or new, incompletely written CAN
frames to be (re-)processed.
To work around this issue, keep a per FIFO timestamp [1] of the last
valid received CAN frame and compare against the timestamp of every
received CAN frame. If an old CAN frame is detected, abort the
iteration and mark the number of valid CAN frames as processed in the
chip by incrementing the FIFO's tail index.
Further tests showed that this workaround can recognize old CAN
frames, but a small time window remains in which partially written CAN
frames [2] are not recognized but then processed. These CAN frames
have the correct data and time stamps, but the DLC has not yet been
updated.
[1] As the raw timestamp overflows every 107 seconds (at the usual
clock rate of 40 MHz) convert it to nanoseconds with the
timecounter framework and use this to detect stale CAN frames.
can: mcp251xfd: rx: prepare to workaround broken RX FIFO head index erratum
This is a preparatory patch to work around erratum DS80000789E 6 of
the mcp2518fd, the other variants of the chip family (mcp2517fd and
mcp251863) are probably also affected.
When handling the RX interrupt, the driver iterates over all pending
FIFOs (which are implemented as ring buffers in hardware) and reads
the FIFO header index from the RX FIFO STA register of the chip.
In the bad case, the driver reads a too large head index. In the
original code, the driver always trusted the read value, which caused
old CAN frames that were already processed, or new, incompletely
written CAN frames to be (re-)processed.
Instead of reading and trusting the head index, read the head index
and calculate the number of CAN frames that were supposedly received -
replace mcp251xfd_rx_ring_update() with mcp251xfd_get_rx_len().
The mcp251xfd_handle_rxif_ring() function reads the received CAN
frames from the chip, iterates over them and pushes them into the
network stack. Prepare that the iteration can be stopped if an old CAN
frame is detected. The actual code to detect old or incomplete frames
and abort will be added in the next patch.
can: mcp251xfd: mcp251xfd_handle_rxif_ring_uinc(): factor out in separate function
This is a preparation patch.
Sending the UINC messages followed by incrementing the tail pointer
will be called in more than one place in upcoming patches, so factor
this out into a separate function.
Also make mcp251xfd_handle_rxif_ring_uinc() safe to be called with a
"len" of 0.
The mcp251xfd chip is configured to provide a timestamp with each
received and transmitted CAN frame. The timestamp is derived from the
internal free-running timer, which can also be read from the TBC
register via SPI. The timer is 32 bits wide and is clocked by the
external oscillator (typically 20 or 40 MHz).
To avoid confusion, we call this timestamp "timestamp_raw" or "ts_raw"
for short.
Using the timecounter framework, the "ts_raw" is converted to 64 bit
nanoseconds since the epoch. This is what we call "timestamp".
This is a preparation for the next patches which use the "timestamp"
to work around a bug where so far only the "ts_raw" is used.
can: mcp251xfd: move mcp251xfd_timestamp_start()/stop() into mcp251xfd_chip_start/stop()
The mcp251xfd wakes up from Low Power or Sleep Mode when SPI activity
is detected. To avoid this, make sure that the timestamp worker is
stopped before shutting down the chip.
Split the starting of the timestamp worker out of
mcp251xfd_timestamp_init() into the separate function
mcp251xfd_timestamp_start().
Call mcp251xfd_timestamp_init() before mcp251xfd_chip_start(), move
mcp251xfd_timestamp_start() to mcp251xfd_chip_start(). In this way,
mcp251xfd_timestamp_stop() can be called unconditionally by
mcp251xfd_chip_stop().
Since the errata references have been added to the driver, new errata
sheets have been published. Update the references for the mcp2517fd
and mcp2518fd. For completeness add references for the mcp251863.
Linus Torvalds [Fri, 28 Jun 2024 21:27:22 +0000 (14:27 -0700)]
x86: stop playing stack games in profile_pc()
The 'profile_pc()' function is used for timer-based profiling, which
isn't really all that relevant any more to begin with, but it also ends
up making assumptions based on the stack layout that aren't necessarily
valid.
Basically, the code tries to account the time spent in spinlocks to the
caller rather than the spinlock, and while I support that as a concept,
it's not worth the code complexity or the KASAN warnings when no serious
profiling is done using timers anyway these days.
And the code really does depend on stack layout that is only true in the
simplest of cases. We've lost the comment at some point (I think when
the 32-bit and 64-bit code was unified), but it used to say:
Assume the lock function has either no stack frame or a copy
of eflags from PUSHF.
which explains why it just blindly loads a word or two straight off the
stack pointer and then takes a minimal look at the values to just check
if they might be eflags or the return pc:
Eflags always has bits 22 and up cleared unlike kernel addresses
but that basic stack layout assumption assumes that there isn't any lock
debugging etc going on that would complicate the code and cause a stack
frame.
It causes KASAN unhappiness reported for years by syzkaller [1] and
others [2].
With no real practical reason for this any more, just remove the code.
Just for historical interest, here's some background commits relating to
this code from 2006:
0cb91a229364 ("i386: Account spinlocks to the caller during profiling for !FP kernels") 31679f38d886 ("Simplify profile_pc on x86-64")
Przemek Kitszel [Mon, 17 Jun 2024 13:24:07 +0000 (15:24 +0200)]
ice: do not init struct ice_adapter more times than needed
Allocate and initialize struct ice_adapter object only once per physical
card instead of once per port. This is not a big deal by now, but we want
to extend this struct more and more in the near future. Our plans include
PTP stuff and a devlink instance representing whole-device/physical card.
Transactions requiring to be sleep-able (like those doing user (here ice)
memory allocation) must be performed with an additional (on top of xarray)
mutex. Adding it here removes need to xa_lock() manually.
Since this commit is a reimplementation of ice_adapter_get(), a rather new
scoped_guard() wrapper for locking is used to simplify the logic.
It's worth to mention that xa_insert() use gives us both slot reservation
and checks if it is already filled, what simplifies code a tiny bit.
Wolfram Sang [Thu, 27 Jun 2024 11:14:48 +0000 (13:14 +0200)]
i2c: testunit: discard write requests while old command is running
When clearing registers on new write requests was added, the protection
for currently running commands was missed leading to concurrent access
to the testunit registers. Check the flag beforehand.
Fixes: b39ab96aa894 ("i2c: testunit: add support for block process calls") Signed-off-by: Wolfram Sang <[email protected]> Reviewed-by: Andi Shyti <[email protected]>
Wolfram Sang [Thu, 27 Jun 2024 11:14:47 +0000 (13:14 +0200)]
i2c: testunit: don't erase registers after STOP
STOP fallsthrough to WRITE_REQUESTED but this became problematic when
clearing the testunit registers was added to the latter. Actually, there
is no reason to clear the testunit state after STOP. Doing it when a new
WRITE_REQUESTED arrives is enough. So, no need to fallthrough, at all.
Fixes: b39ab96aa894 ("i2c: testunit: add support for block process calls") Signed-off-by: Wolfram Sang <[email protected]> Reviewed-by: Andi Shyti <[email protected]>
Wolfram Sang [Fri, 28 Jun 2024 18:38:20 +0000 (20:38 +0200)]
Merge tag 'i2c-host-fixes-6.10-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/andi.shyti/linux into i2c/for-current
Fixed a build error following the major refactoring involving the
VIA-I2C modules. Originally, the code was split to group together
parts that would be used by different drivers. This caused build
issues when two modules linked to the same code.
1365 sk->sk_state = BT_CONFIG;
1366 }
1367 release_sock(sk);
1368 return 0;
1369 case BT_CONNECTED:
1370 if (test_bit(BT_SK_PA_SYNC,
Fixes: fbdc4bc47268 ("Bluetooth: ISO: Use defer setup to separate PA sync and BIG sync") Signed-off-by: Iulia Tanasescu <[email protected]> Signed-off-by: Luiz Augusto von Dentz <[email protected]>
The problem occurs between the system call to close the sock and hci_rx_work,
where the former releases the sock and the latter accesses it without lock protection.
If hci_rx_work processes the data that needs to be received before the sock is
closed, then everything is normal; Otherwise, the work thread may access the
released sock when receiving data.
Add a chan mutex in the rx callback of the sock to achieve synchronization between
the sock release and recv cb.
Sock is dead, so set chan data to NULL, avoid others use invalid sock pointer.
hci_le_big_sync_established_evt is necessary to filter out cases where the
handle value is belonging to ida id range, otherwise ida will be erroneously
released in hci_conn_cleanup.
Fixes: 181a42edddf5 ("Bluetooth: Make handle of hci_conn be unique") Reported-by: [email protected] Closes: https://syzkaller.appspot.com/bug?extid=b2545b087a01a7319474 Signed-off-by: Edward Adam Davis <[email protected]> Signed-off-by: Luiz Augusto von Dentz <[email protected]>
Bluetooth: btnxpuart: Enable Power Save feature on startup
This sets the default power save mode setting to enabled.
The power save feature is now stable and stress test issues, such as the
TX timeout error, have been resolved.
commit c7ee0bc8db32 ("Bluetooth: btnxpuart: Resolve TX timeout error in
power save stress test")
With this setting, the driver will send the vendor command to FW at
startup, to enable power save feature.
User can disable this feature using the following vendor command:
hcitool cmd 3f 23 03 00 00 (HCI_NXP_AUTO_SLEEP_MODE)
Tetsuo Handa [Mon, 10 Jun 2024 11:00:32 +0000 (20:00 +0900)]
Bluetooth: hci_core: cancel all works upon hci_unregister_dev()
syzbot is reporting that calling hci_release_dev() from hci_error_reset()
due to hci_dev_put() from hci_error_reset() can cause deadlock at
destroy_workqueue(), for hci_error_reset() is called from
hdev->req_workqueue which destroy_workqueue() needs to flush.
We need to make sure that hdev->{rx_work,cmd_work,tx_work} which are
queued into hdev->workqueue and hdev->{power_on,error_reset} which are
queued into hdev->req_workqueue are no longer running by the moment
Zijun Hu [Thu, 16 May 2024 13:31:34 +0000 (21:31 +0800)]
Bluetooth: qca: Fix BT enable failure again for QCA6390 after warm reboot
Commit 272970be3dab ("Bluetooth: hci_qca: Fix driver shutdown on closed
serdev") will cause below regression issue:
BT can't be enabled after below steps:
cold boot -> enable BT -> disable BT -> warm reboot -> BT enable failure
if property enable-gpios is not configured within DT|ACPI for QCA6390.
The commit is to fix a use-after-free issue within qca_serdev_shutdown()
by adding condition to avoid the serdev is flushed or wrote after closed
but also introduces this regression issue regarding above steps since the
VSC is not sent to reset controller during warm reboot.
Fixed by sending the VSC to reset controller within qca_serdev_shutdown()
once BT was ever enabled, and the use-after-free issue is also fixed by
this change since the serdev is still opened before it is flushed or wrote.
Verified by the reported machine Dell XPS 13 9310 laptop over below two
kernel commits:
commit e00fc2700a3f ("Bluetooth: btusb: Fix triggering coredump
implementation for QCA") of bluetooth-next tree.
commit b23d98d46d28 ("Bluetooth: btusb: Fix triggering coredump
implementation for QCA") of linus mainline tree.
Fixes: 272970be3dab ("Bluetooth: hci_qca: Fix driver shutdown on closed serdev") Cc: [email protected] Reported-by: Wren Turkal <[email protected]> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=218726 Signed-off-by: Zijun Hu <[email protected]> Tested-by: Wren Turkal <[email protected]> Signed-off-by: Luiz Augusto von Dentz <[email protected]>
Bluetooth: hci_event: Fix setting of unicast qos interval
qos->ucast interval reffers to the SDU interval, and should not
be set to the interval value reported by the LE CIS Established
event since the latter reffers to the ISO interval. These two
interval are not the same thing:
BLUETOOTH CORE SPECIFICATION Version 5.3 | Vol 6, Part G
Isochronous interval:
The time between two consecutive BIS or CIS events (designated
ISO_Interval in the Link Layer)
SDU interval:
The nominal time between two consecutive SDUs that are sent or
received by the upper layer.
So this instead uses the following formula from the spec to calculate
the resulting SDU interface:
BLUETOOTH CORE SPECIFICATION Version 5.4 | Vol 6, Part G
page 3075:
Sven Peter [Wed, 15 May 2024 18:02:58 +0000 (18:02 +0000)]
Bluetooth: Add quirk to ignore reserved PHY bits in LE Extended Adv Report
Some Broadcom controllers found on Apple Silicon machines abuse the
reserved bits inside the PHY fields of LE Extended Advertising Report
events for additional flags. Add a quirk to drop these and correctly
extract the Primary/Secondary_PHY field.
The following excerpt from a btmon trace shows a report received with
"Reserved" for "Primary PHY" on a 4388 controller:
> HCI Event: LE Meta Event (0x3e) plen 26
LE Extended Advertising Report (0x0d)
Num reports: 1
Entry 0
Event type: 0x2515
Props: 0x0015
Connectable
Directed
Use legacy advertising PDUs
Data status: Complete
Reserved (0x2500)
Legacy PDU Type: Reserved (0x2515)
Address type: Random (0x01)
Address: 00:00:00:00:00:00 (Static)
Primary PHY: Reserved
Secondary PHY: No packets
SID: no ADI field (0xff)
TX power: 127 dBm
RSSI: -60 dBm (0xc4)
Periodic advertising interval: 0.00 msec (0x0000)
Direct address type: Public (0x00)
Direct address: 00:00:00:00:00:00 (Apple, Inc.)
Data length: 0x00
Cc: [email protected] Fixes: 2e7ed5f5e69b ("Bluetooth: hci_sync: Use advertised PHYs on hci_le_ext_create_conn_sync") Reported-by: Janne Grunau <[email protected]> Closes: https://lore.kernel.org/all/Zjz0atzRhFykROM9@robin Tested-by: Janne Grunau <[email protected]> Signed-off-by: Sven Peter <[email protected]> Signed-off-by: Luiz Augusto von Dentz <[email protected]>
Piotr Gardocki [Fri, 14 Jun 2024 10:38:11 +0000 (12:38 +0200)]
ice: Distinguish driver reset and removal for AQ shutdown
Admin queue command for shutdown AQ contains a flag to indicate driver
unload. However, the flag is always set in the driver, even for resets. It
can cause the firmware to consider driver as unloaded once the PF reset is
triggered on all ports of device, which could lead to unexpected results.
Add an additional function parameter to functions that shutdown AQ,
indicating whether the driver is actually unloading.
Paul Greenwalt [Thu, 6 Jun 2024 13:48:46 +0000 (09:48 -0400)]
ice: Allow different FW API versions based on MAC type
Allow the driver to be compatible with different FW API versions based
on the device's MAC type. Currently, E810 is only compatible with one
FW API version. Now the driver can be compatible with different FW API
versions for both E810 and E830. For example, E810 FW API version is
1.5.0 and E830 is 1.7.0.
Linus Torvalds [Fri, 28 Jun 2024 16:32:33 +0000 (09:32 -0700)]
Merge tag 'nfsd-6.10-3' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux
Pull nfsd fixes from Chuck Lever:
- Due to a late review, revert and re-fix a recent crasher fix
* tag 'nfsd-6.10-3' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux:
Revert "nfsd: fix oops when reading pool_stats before server is started"
nfsd: initialise nfsd_info.mutex early.
Linus Torvalds [Fri, 28 Jun 2024 16:25:21 +0000 (09:25 -0700)]
Merge tag 'bcachefs-2024-06-28' of https://evilpiepirate.org/git/bcachefs
Pull bcachefs fixes from Kent Overstreet:
"Simple stuff:
- NULL ptr/err ptr deref fixes
- fix for getting wedged on shutdown after journal error
- fix missing recalc_capacity() call, capacity now changes correctly
after a device goes read only
however: our capacity calculation still doesn't take into account
when we have mixed ro/rw devices and the ro devices have data on
them, that's going to be a more involved fix to separate accounting
for "capacity used on ro devices" and "capacity used on rw devices"
- boring syzbot stuff
Slightly more involved:
- discard, invalidate workers are now per device
this has the effect of simplifying how we take device refs in these
paths, and the device ref cleanup fixes a longstanding race between
the device removal path and the discard path
- fixes for how the debugfs code takes refs on btree_trans objects we
have debugfs code that prints in use btree_trans objects.
It uses closure_get() on trans->ref, which is mainly for the cycle
detector, but the debugfs code was using it on a closure that may
have hit 0, which is not allowed; for performance reasons we cannot
avoid having not-in-use transactions on the global list.
Introduce some new primitives to fix this and make the
synchronization here a whole lot saner"
* tag 'bcachefs-2024-06-28' of https://evilpiepirate.org/git/bcachefs:
bcachefs: Fix kmalloc bug in __snapshot_t_mut
bcachefs: Discard, invalidate workers are now per device
bcachefs: Fix shift-out-of-bounds in bch2_blacklist_entries_gc
bcachefs: slab-use-after-free Read in bch2_sb_errors_from_cpu
bcachefs: Add missing bch2_journal_do_writes() call
bcachefs: Fix null ptr deref in journal_pins_to_text()
bcachefs: Add missing recalc_capacity() call
bcachefs: Fix btree_trans list ordering
bcachefs: Fix race between trans_put() and btree_transactions_read()
closures: closure_get_not_zero(), closure_return_sync()
bcachefs: Make btree_deadlock_to_text() clearer
bcachefs: fix seqmutex_relock()
bcachefs: Fix freeing of error pointers
Linus Torvalds [Fri, 28 Jun 2024 16:21:27 +0000 (09:21 -0700)]
Merge tag 'block-6.10-20240628' of git://git.kernel.dk/linux
Pull block fixes from Jens Axboe:
"NVMe fixes via Keith:
- Fabrics fixes (Hannes)
- Missing module description (Jeff)
- Clang warning fix (Nathan)"
* tag 'block-6.10-20240628' of git://git.kernel.dk/linux:
nvmet-fc: Remove __counted_by from nvmet_fc_tgt_queue.fod[]
nvmet: make 'tsas' attribute idempotent for RDMA
nvme: fixup comment for nvme RDMA Provider Type
nvme-apple: add missing MODULE_DESCRIPTION()
nvmet: do not return 'reserved' for empty TSAS values
nvme: fix NVME_NS_DEAC may incorrectly identifying the disk as EXT_LBA.
Linus Torvalds [Fri, 28 Jun 2024 16:18:01 +0000 (09:18 -0700)]
Merge tag 'iommu-fixes-v6.10-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/iommu/linux
Pull iommu fixes from Joerg Roedel:
- Two cache flushing fixes for Intel and AMD drivers
- AMD guest translation enabling fix
- Update IOMMU tree location in MAINTAINERS file
* tag 'iommu-fixes-v6.10-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/iommu/linux:
MAINTAINERS: Update IOMMU tree location
iommu/amd: Fix GT feature enablement again
iommu/vt-d: Fix missed device TLB cache tag
iommu/amd: Invalidate cache before removing device from domain list
Linus Torvalds [Fri, 28 Jun 2024 16:15:13 +0000 (09:15 -0700)]
Merge tag 'gpio-fixes-for-v6.10-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux
Pull gpio fixes from Bartosz Golaszewski:
"An assortment of driver fixes and two commits addressing a bad
behavior of the GPIO uAPI when reconfiguring requested lines.
- fix a race condition in i2c transfers by adding a missing i2c lock
section in gpio-pca953x
- validate the number of obtained interrupts in gpio-davinci
- add missing raw_spinlock_init() in gpio-graniterapids
- fix bad character device behavior: disallow GPIO line
reconfiguration without set direction both in v1 and v2 uAPI"
* tag 'gpio-fixes-for-v6.10-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux:
gpiolib: cdev: Ignore reconfiguration without direction
gpiolib: cdev: Disallow reconfiguration without direction (uAPI v1)
gpio: graniterapids: Add missing raw_spinlock_init()
gpio: davinci: Validate the obtained number of IRQs
gpio: pca953x: fix pca953x_irq_bus_sync_unlock race
Linus Torvalds [Fri, 28 Jun 2024 16:10:01 +0000 (09:10 -0700)]
Merge tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull arm64 fixes from Will Deacon:
"A pair of small arm64 fixes for -rc6.
One is a fix for the recently merged uffd-wp support (which was
triggering a spurious warning) and the other is a fix to the clearing
of the initial idmap pgd in some configurations
Summary:
- Fix spurious page-table warning when clearing PTE_UFFD_WP in a live
pte
- Fix clearing of the idmap pgd when using large addressing modes"
* tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
arm64: Clear the initial ID map correctly before remapping
arm64: mm: Permit PTE SW bits to change in live mappings
Linus Torvalds [Fri, 28 Jun 2024 16:04:33 +0000 (09:04 -0700)]
Merge tag 'v6.10-rc-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux
Pull turbostat fixes from Len Brown:
"Fix three recent minor turbostat regressions"
* tag 'v6.10-rc-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux:
tools/power turbostat: Add local build_bug.h header for snapshot target
tools/power turbostat: Fix unc freq columns not showing with '-q' or '-l'
tools/power turbostat: option '-n' is ambiguous
Phil Sutter [Wed, 26 Jun 2024 22:35:05 +0000 (00:35 +0200)]
netfilter: xt_recent: Lift restrictions on max hitcount value
Support tracking of up to 65535 packets per table entry instead of just
255 to better facilitate longer term tracking or higher throughput
scenarios.
Note how this aligns sizes of struct recent_entry's 'nstamps' and
'index' fields when 'nstamps' was larger before. This is unnecessary as
the value of 'nstamps' grows along with that of 'index' after being
initialized to 1 (see recent_entry_update()). Its value will thus never
exceed that of 'index' and therefore does not need to provide space for
larger values.
Florian Westphal [Tue, 25 Jun 2024 19:07:44 +0000 (21:07 +0200)]
selftests: netfilter: nft_queue.sh: add test for disappearing listener
If userspace program exits while the queue its subscribed to has packets
those need to be discarded.
commit dc21c6cc3d69 ("netfilter: nfnetlink_queue: acquire rcu_read_lock()
in instance_destroy_rcu()") fixed a (harmless) rcu splat that could be
triggered in this case.
tty: mxser: Remove __counted_by from mxser_board.ports[]
Work for __counted_by on generic pointers in structures (not just
flexible array members) has started landing in Clang 19 (current tip of
tree). During the development of this feature, a restriction was added
to __counted_by to prevent the flexible array member's element type from
including a flexible array member itself such as:
struct foo {
int count;
char buf[];
};
struct bar {
int count;
struct foo data[] __counted_by(count);
};
because the size of data cannot be calculated with the standard array
size formula:
sizeof(struct foo) * count
This restriction was downgraded to a warning but due to CONFIG_WERROR,
it can still break the build. The application of __counted_by on the
ports member of 'struct mxser_board' triggers this restriction,
resulting in:
drivers/tty/mxser.c:291:2: error: 'counted_by' should not be applied to an array with element of unknown size because 'struct mxser_port' is a struct type with a flexible array member. This will be an error in a future compiler version [-Werror,-Wbounds-safety-counted-by-elt-type-unknown-size]
291 | struct mxser_port ports[] __counted_by(nports);
| ^~~~~~~~~~~~~~~~~~~~~~~~~
1 error generated.
Remove this use of __counted_by to fix the warning/error. However,
rather than remove it altogether, leave it commented, as it may be
possible to support this in future compiler releases.
An unintended consequence of commit 9c573cd31343 ("randomize_kstack:
Improve entropy diffusion") was that the per-architecture entropy size
filtering reduced how many bits were being added to the mix, rather than
how many bits were being used during the offsetting. All architectures
fell back to the existing default of 0x3FF (10 bits), which will consume
at most 1KiB of stack space. It seems that this is working just fine,
so let's avoid the confusion and update everything to use the default.
The prior intent of the per-architecture limits were:
arm64: capped at 0x1FF (9 bits), 5 bits effective
powerpc: uncapped (10 bits), 6 or 7 bits effective
riscv: uncapped (10 bits), 6 bits effective
x86: capped at 0xFF (8 bits), 5 (x86_64) or 6 (ia32) bits effective
s390: capped at 0xFF (8 bits), undocumented effective entropy
Current discussion has led to just dropping the original per-architecture
filters. The additional entropy appears to be safe for arm64, x86,
and s390. Quoting Arnd, "There is no point pretending that 15.75KB is
somehow safe to use while 15.00KB is not."
make allmodconfig && make W=1 C=1 reports:
WARNING: modpost: missing MODULE_DESCRIPTION() in lib/string_kunit.o
WARNING: modpost: missing MODULE_DESCRIPTION() in lib/string_helpers_kunit.o
Add the missing invocation of the MODULE_DESCRIPTION() macro.
Niklas Cassel [Thu, 27 Jun 2024 10:55:52 +0000 (12:55 +0200)]
ata: libata-core: Add ATA_HORKAGE_NOLPM for all Crucial BX SSD1 models
We got another report that CT1000BX500SSD1 does not work with LPM.
If you look in libata-core.c, we have six different Crucial devices that
are marked with ATA_HORKAGE_NOLPM. This model would have been the seventh.
(This quirk is used on Crucial models starting with both CT* and
Crucial_CT*)
It is obvious that this vendor does not have a great history of supporting
LPM properly, therefore, add the ATA_HORKAGE_NOLPM quirk for all Crucial
BX SSD1 models.
Mickaël Salaün [Fri, 21 Jun 2024 18:06:05 +0000 (20:06 +0200)]
selftests/harness: Fix tests timeout and race condition
We cannot use CLONE_VFORK because we also need to wait for the timeout
signal.
Restore tests timeout by using the original fork() call in __run_test()
but also in __TEST_F_IMPL(). Also fix a race condition when waiting for
the test child process.
Because test metadata are shared between test processes, only the
parent process must set the test PID (child). Otherwise, t->pid may be
set to zero, leading to inconsistent error cases:
# RUN layout1.rule_on_mountpoint ...
# rule_on_mountpoint: Test ended in some other way [127]
# OK layout1.rule_on_mountpoint
ok 20 layout1.rule_on_mountpoint
As safeguards, initialize the "status" variable with a valid exit code,
and handle unknown test exits as errors.
The use of fork() introduces a new race condition in landlock/fs_test.c
which seems to be specific to hostfs bind mounts, but I haven't found
the root cause and it's difficult to trigger. I'll try to fix it with
another patch.
Leon Romanovsky [Thu, 27 Jun 2024 18:02:40 +0000 (21:02 +0300)]
net/mlx5e: Approximate IPsec per-SA payload data bytes count
ConnectX devices lack ability to count payload data byte size which is
needed for SA to return to libreswan for rekeying.
As a solution let's approximate that by decreasing headers size from
total size counted by flow steering. The calculation doesn't take into
account any other headers which can be in the packet (e.g. IP extensions).
Fixes: 5a6cddb89b51 ("net/mlx5e: Update IPsec per SA packets/bytes count") Signed-off-by: Leon Romanovsky <[email protected]> Signed-off-by: Tariq Toukan <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Leon Romanovsky [Thu, 27 Jun 2024 18:02:39 +0000 (21:02 +0300)]
net/mlx5e: Present succeeded IPsec SA bytes and packet
IPsec SA statistics presents successfully decrypted and encrypted
packet and bytes, and not total handled by this SA. So update the
calculation logic to take into account failures.
Fixes: 6fb7f9408779 ("net/mlx5e: Connect mlx5 IPsec statistics with XFRM core") Signed-off-by: Leon Romanovsky <[email protected]> Signed-off-by: Tariq Toukan <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Jianbo Liu [Thu, 27 Jun 2024 18:02:38 +0000 (21:02 +0300)]
net/mlx5e: Add mqprio_rl cleanup and free in mlx5e_priv_cleanup()
In the cited commit, mqprio_rl cleanup and free are mistakenly removed
in mlx5e_priv_cleanup(), and it causes the leakage of host memory and
firmware SCHEDULING_ELEMENT objects while changing eswitch mode. So,
add them back.
Chris Mi [Thu, 27 Jun 2024 18:02:37 +0000 (21:02 +0300)]
net/mlx5: E-switch, Create ingress ACL when needed
Currently, ingress acl is used for three features. It is created only
when vport metadata match and prio tag are enabled. But active-backup
lag mode also uses it. It is independent of vport metadata match and
prio tag. And vport metadata match can be disabled using the
following devlink command:
# devlink dev param set pci/0000:08:00.0 name esw_port_metadata \
value false cmode runtime
If ingress acl is not created, will hit panic when creating drop rule
for active-backup lag mode. If always create it, there will be about
5% performance degradation.
Fix it by creating ingress acl when needed. If esw_port_metadata is
true, ingress acl exists, then create drop rule using existing
ingress acl. If esw_port_metadata is false, create ingress acl and
then create drop rule.
Fixes: 1749c4c51c16 ("net/mlx5: E-switch, add drop rule support to ingress ACL") Signed-off-by: Chris Mi <[email protected]> Signed-off-by: Tariq Toukan <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Daniel Jurgens [Thu, 27 Jun 2024 18:02:36 +0000 (21:02 +0300)]
net/mlx5: Use max_num_eqs_24b when setting max_io_eqs
Due a bug in the device max_num_eqs doesn't always reflect a written
value. As a result, setting max_io_eqs may not work but appear
successful. Instead write max_num_eqs_24b, which reflects correct
value.
David S. Miller [Fri, 28 Jun 2024 09:55:38 +0000 (10:55 +0100)]
Merge branch 'net-selftests-mirroring-cleanup' into main
Petr Machata says:
====================
selftest: Clean-up and stabilize mirroring tests
The mirroring selftests work by sending ICMP traffic between two hosts.
Along the way, this traffic is mirrored to a gretap netdevice, and counter
taps are then installed strategically along the path of the mirrored
traffic to verify the mirroring took place.
The problem with this is that besides mirroring the primary traffic, any
other service traffic is mirrored as well. At the same time, because the
tests need to work in HW-offloaded scenarios, the ability of the device to
do arbitrary packet inspection should not be taken for granted. Most tests
therefore simply use matchall, one uses flower to match on IP address.
As a result, the selftests are noisy.
mirror_test() accommodated this noisiness by giving the counters an
allowance of several packets. But that only works up to a point, and on
busy systems won't be always enough.
In this patch set, clean up and stabilize the mirroring selftests. The
original intention was to port the tests over to UDP, but the logic of
ICMP ends up being so entangled in the mirroring selftests that the
changes feel overly invasive. Instead, ICMP is kept, but where possible,
we match on ICMP message type, thus filtering out hits by other ICMP
messages.
Where this is not practical (where the counter tap is put on a device
that carries encapsulated packets), switch the counter condition to _at
least_ X observed packets. This is less robust, but barely so --
probably the only scenario that this would not catch is something like
erroneous packet duplication, which would hopefully get caught by the
numerous other tests in this extensive suite.
- Patches #1 to #3 clean up parameters at various helpers.
- Patches #4 to #6 stabilize the mirroring selftests as described above.
- Mirroring tests currently allow testing SW datapath even on HW
netdevices by trapping traffic to the SW datapath. This complicates
the tests a bit without a good reason: to test SW datapath, just run
the selftests on the veth topology. Thus in patch #7, drop support for
this dual SW/HW testing.
- At this point, some cleanups were either made possible by the previous
patches, or were always possible. In patches #8 to #11, realize these
cleanups.
- In patch #12, fix mlxsw mirror_gre selftest to respect setting TESTS.
====================
Petr Machata [Thu, 27 Jun 2024 14:48:49 +0000 (16:48 +0200)]
selftests: mlxsw: mirror_gre: Obey TESTS
This test is unusual in that overriding TESTS does not change the tests to
be run. Split the individual tests into several functions and invoke them
through tests_run() as appropriate.
Petr Machata [Thu, 27 Jun 2024 14:48:44 +0000 (16:48 +0200)]
selftests: mirror: Drop dual SW/HW testing
The mirroring tests are currently run in a skip_hw and optionally a skip_sw
mode. The former tests the SW datapath, the latter the HW datapath, if
available. In order to be able to test SW datapath on HW loopbacks, traps
are installed on ingress to get traffic from the HW datapath to the SW one.
This adds an unnecessary complexity when it would be much simpler to just
use a veth-based topology to test the SW datapath. Thus drop all the code
that supports this dual testing.
Petr Machata [Thu, 27 Jun 2024 14:48:43 +0000 (16:48 +0200)]
selftests: mirror: mirror_test(): Allow exact count of packets
The mirroring selftests work by sending ICMP traffic between two hosts.
Along the way, this traffic is mirrored to a gretap netdevice, and counter
taps are then installed strategically along the path of the mirrored
traffic to verify the mirroring took place.
The problem with this is that besides mirroring the primary traffic, any
other service traffic is mirrored as well. At the same time, because the
tests need to work in HW-offloaded scenarios, the ability of the device to
do arbitrary packet inspection should not be taken for granted. Most tests
therefore simply use matchall, one uses flower to match on IP address.
As a result, the selftests are noisy, because besides the primary ICMP
traffic, any amount of other service traffic is mirrored as well.
mirror_test() accommodated this noisiness by giving the counters an
allowance of several packets. But in the previous patch, where possible,
counter taps were changed to match only on an exact ICMP message. At least
in those cases, we can demand an exact number of packets to match.
Where the tap is installed on a connective netdevice, the exact matching is
not practical (though with u32, anything is possible). In those places,
there should still be some leeway -- and probably bigger than before,
because experience shows that these tests are very noisy.
To that end, change mirror_test() so that it can be either called with an
exact number to expect, or with an expression. Where leeway is needed,
adjust callers to pass a ">= 10" instead of mere 10.
Petr Machata [Thu, 27 Jun 2024 14:48:42 +0000 (16:48 +0200)]
selftests: mirror: do_test_span_dir_ips(): Install accurate taps
The mirroring selftests work by sending ICMP traffic between two hosts.
Along the way, this traffic is mirrored to a gretap netdevice, and counter
taps are then installed strategically along the path of the mirrored
traffic to verify the mirroring took place.
The problem with this is that besides mirroring the primary traffic, any
other service traffic is mirrored as well. At the same time, because the
tests need to work in HW-offloaded scenarios, the ability of the device to
do arbitrary packet inspection should not be taken for granted. Most tests
therefore simply use matchall, one uses flower to match on IP address.
As a result, the selftests are noisy, because besides the primary ICMP
traffic, any amount of other service traffic is mirrored as well.
However, often the counter tap is installed at the remote end of the gretap
tunnel. Since this is a SW-datapath scenario anyway, we can make the filter
arbitrarily accurate.
Thus in this patch, add parameters forward_type and backward_type to
several mirroring test helpers, as some other helpers already have. Then
change do_test_span_dir_ips() to instead of installing one generic tap and
using it for test in both directions, install the tap for each direction
separately, matching on the ICMP type given by these parameters.
Petr Machata [Thu, 27 Jun 2024 14:48:41 +0000 (16:48 +0200)]
selftests: mirror_gre_lag_lacp: Check counters at tunnel
The test works by sending packets through a tunnel, whence they are
forwarded to a LAG. One of the LAG children is removed from the LAG prior
to the exercise, and the test then counts how many packets pass through the
other one. The issue with this is that it counts all packets, not just the
encapsulated ones.
So instead add a second gretap endpoint to receive the sent packets, and
check reception counters there.
Petr Machata [Thu, 27 Jun 2024 14:48:38 +0000 (16:48 +0200)]
selftests: libs: Expand "$@" where possible
In some functions, argument-forwarding through "$@" without listing the
individual arguments explicitly is fundamental to the operation of a
function. E.g. xfail_on_veth() should be able to run various tests in the
fail-to-xfail regime, and usage of "$@" is appropriate as an abstraction
mechanism. For functions such as simple_if_init(), $@ is a handy way to
pass an array.
In other functions, it's merely a mechanism to save some typing, which
however ends up obscuring the real arguments and makes life hard for those
that end up reading the code.
This patch adds some of the implicit function arguments and correspondingly
expands $@'s. In several cases this will come in handy as following patches
adjust the parameter lists.
David S. Miller [Fri, 28 Jun 2024 09:49:35 +0000 (10:49 +0100)]
Merge branch 'net-flash-modees-firmware' into main
Danielle Ratson says:
====================
Add ability to flash modules' firmware
CMIS compliant modules such as QSFP-DD might be running a firmware that
can be updated in a vendor-neutral way by exchanging messages between
the host and the module as described in section 7.2.2 of revision
4.0 of the CMIS standard.
According to the CMIS standard, the firmware update process is done
using a CDB commands sequence.
CDB (Command Data Block Message Communication) reads and writes are
performed on memory map pages 9Fh-AFh according to the CMIS standard,
section 8.12 of revision 4.0.
Add a pair of new ethtool messages that allow:
* User space to trigger firmware update of transceiver modules
* The kernel to notify user space about the progress of the process
The user interface is designed to be asynchronous in order to avoid RTNL
being held for too long and to allow several modules to be updated
simultaneously. The interface is designed with CMIS compliant modules in
mind, but kept generic enough to accommodate future use cases, if these
arise.
The kernel interface that will implement the firmware update using CDB
command will include 2 layers that will be added under ethtool:
* The upper layer that will be triggered from the module layer, is
cmis_ fw_update.
* The lower one is cmis_cdb.
In the future there might be more operations to implement using CDB
commands. Therefore, the idea is to keep the cmis_cdb interface clean and
the cmis_fw_update specific to the cdb commands handling it.
The communication between the kernel and the driver will be done using
two ethtool operations that enable reading and writing the transceiver
module EEPROM.
The operation ethtool_ops::get_module_eeprom_by_page, that is already
implemented, will be used for reading from the EEPROM the CDB reply,
e.g. reading module setting, state, etc.
The operation ethtool_ops::set_module_eeprom_by_page, that is added in
the current patchset, will be used for writing to the EEPROM the CDB
command such as start firmware image, run firmware image, etc.
Therefore in order for a driver to implement module flashing, that
driver needs to implement the two functions mentioned above.
Patchset overview:
Patch #1-#2: Implement the EEPROM writing in mlxsw.
Patch #3: Define the interface between the kernel and user space.
Patch #4: Add ability to notify the flashing firmware progress.
Patch #5: Veto operations during flashing.
Patch #6: Add extended compliance codes.
Patch #7: Add the cdb layer.
Patch #8: Add the fw_update layer.
Patch #9: Add ability to flash transceiver modules' firmware.
v8:
Patch #7:
* In the ethtool_cmis_wait_for_cond() evaluate the condition once more
to decide if the error code should be -ETIMEDOUT or something else.
* s/netdev_err/netdev_err_once.
v7:
Patch #4:
* Return -ENOMEM instead of PTR_ERR(attr) on
ethnl_module_fw_flash_ntf_put_err().
Patch #9:
* Fix Warning for not unlocking the spin_lock in the error flow
on module_flash_fw_work_list_add().
* Avoid the fall-through on ethnl_sock_priv_destroy().
v6:
* Squash some of the last patch to patch #5 and patch #9.
Patch #3:
* Add paragraph in .rst file.
Patch #4:
* Reserve '1' more place on SKB for NUL terminator in
the error message string.
* Add more prints on error flow, re-write the printing
function and add ethnl_module_fw_flash_ntf_put_err().
* Change the communication method so notification will be
sent in unicast instead of multicast.
* Add new 'struct ethnl_module_fw_flash_ntf_params' that holds
the relevant info for unicast communication and use it to
send notification to the specific socket.
* s/nla_put_u64_64bit/nla_put_uint/
Patch #7:
* In ethtool_cmis_cdb_init(), Use 'const' for the 'params'
parameter.
Patch #8:
* Add a list field to struct ethtool_module_fw_flash for
module_fw_flash_work_list that will be presented in the next
patch.
* Move ethtool_cmis_fw_update() cleaning to a new function that
will be represented in the next patch.
* Move some of the fields in struct ethtool_module_fw_flash to
a separate struct, so ethtool_cmis_fw_update() will get only
the relevant parameters for it.
* Edit the relevant functions to get the relevant params for
them.
* s/CMIS_MODULE_READY_MAX_DURATION_USEC/CMIS_MODULE_READY_MAX_DURATION_MSEC
Patch #9:
* Add a paragraph in the commit message.
* Rename labels in module_flash_fw_schedule().
* Add info to genl_sk_priv_*() and implement the relevant
callbacks, in order to handle properly a scenario of closing
the socket from user space before the work item was ended.
* Add a list the holds all the ethtool_module_fw_flash struct
that corresponds to the in progress work items.
* Add a new enum for the socket types.
* Use both above to identify a flashing socket, add it to the
list and when closing socket affect only the flashing type.
* Create a new function that will get the work item instead of
ethtool_cmis_fw_update().
* Edit the relevant functions to get the relevant params for
them.
* The new function will call the old ethtool_cmis_fw_update(),
and do the cleaning, so the existence of the list should be
completely isolated in module.c.
===================
Transceiver module firmware flashing started for device swp40
Transceiver module firmware flashing in progress for device swp40
Progress: 99%
Transceiver module firmware flashing completed for device swp40
In addition, add infrastructure that allows modules to set socket-specific
private data. This ensures that when a socket is closed from user space
during the flashing process, the right socket halts sending notifications
to user space until the work item is completed.
Danielle Ratson [Thu, 27 Jun 2024 14:08:55 +0000 (17:08 +0300)]
ethtool: cmis_fw_update: add a layer for supporting firmware update using CDB
According to the CMIS standard, the firmware update process is done using
a CDB commands sequence.
Implement a work that will be triggered from the module layer in the
next patch the will initiate and execute all the CDB commands in order, to
eventually complete the firmware update process.
This flashing process includes, writing the firmware image, running the new
firmware image and committing it after testing, so that it will run upon
reset.
This work will also notify user space about the progress of the firmware
update process.
Danielle Ratson [Thu, 27 Jun 2024 14:08:54 +0000 (17:08 +0300)]
ethtool: cmis_cdb: Add a layer for supporting CDB commands
CDB (Command Data Block Message Communication) reads and writes are
performed on memory map pages 9Fh-AFh according to the CMIS standard,
section 8.20 of revision 5.2.
Page 9Fh is used to specify the CDB command to be executed and also
provides an area for a local payload (LPL).
According to the CMIS standard, the firmware update process is done using
a CDB commands sequence that will be implemented in the next patch.
The kernel interface that will implement the firmware update using CDB
command will include 2 layers that will be added under ethtool:
* The upper layer that will be triggered from the module layer, is
cmis_fw_update.
* The lower one is cmis_cdb.
In the future there might be more operations to implement using CDB
commands. Therefore, the idea is to keep the CDB interface clean and the
cmis_fw_update specific to the CDB commands handling it.
These two layers will communicate using the API the consists of three
functions:
Add the CDB layer to support initializing, finishing and executing CDB
commands:
* The initialization process will include creating of an ethtool_cmis_cdb
instance, querying the module CDB support, entering and validating the
password from user space (CMD 0x0000) and querying the module features
(CMD 0x0040).
* The finishing API will simply free the ethtool_cmis_cdb instance.
* The executing process will write the CDB command to EEPROM using
set_module_eeprom_by_page() that was presented earlier, and will
process the reply from EEPROM.
Danielle Ratson [Thu, 27 Jun 2024 14:08:52 +0000 (17:08 +0300)]
ethtool: Veto some operations during firmware flashing process
Some operations cannot be performed during the firmware flashing
process.
For example:
- Port must be down during the whole flashing process to avoid packet loss
while committing reset for example.
- Writing to EEPROM interrupts the flashing process, so operations like
ethtool dump, module reset, get and set power mode should be vetoed.
- Split port firmware flashing should be vetoed.
In order to veto those scenarios, add a flag in 'struct net_device' that
indicates when a firmware flash is taking place on the module and use it
to prevent interruptions during the process.
Danielle Ratson [Thu, 27 Jun 2024 14:08:50 +0000 (17:08 +0300)]
ethtool: Add an interface for flashing transceiver modules' firmware
CMIS compliant modules such as QSFP-DD might be running a firmware that
can be updated in a vendor-neutral way by exchanging messages between
the host and the module as described in section 7.3.1 of revision 5.2 of
the CMIS standard.
Add a pair of new ethtool messages that allow:
* User space to trigger firmware update of transceiver modules
* The kernel to notify user space about the progress of the process
The user interface is designed to be asynchronous in order to avoid
RTNL being held for too long and to allow several modules to be
updated simultaneously. The interface is designed with CMIS compliant
modules in mind, but kept generic enough to accommodate future use
cases, if these arise.
Ido Schimmel [Thu, 27 Jun 2024 14:08:49 +0000 (17:08 +0300)]
mlxsw: Implement ethtool operation to write to a transceiver module EEPROM
Implement the ethtool_ops::set_module_eeprom_by_page operation to allow
ethtool to write to a transceiver module EEPROM, in a similar fashion to
the ethtool_ops::get_module_eeprom_by_page operation.
Ido Schimmel [Thu, 27 Jun 2024 14:08:48 +0000 (17:08 +0300)]
ethtool: Add ethtool operation to write to a transceiver module EEPROM
Ethtool can already retrieve information from a transceiver module
EEPROM by invoking the ethtool_ops::get_module_eeprom_by_page operation.
Add a corresponding operation that allows ethtool to write to a
transceiver module EEPROM.
The new write operation is purely an in-kernel API and is not exposed to
user space.
The purpose of this operation is not to enable arbitrary read / write
access, but to allow the kernel to write to specific addresses as part
of transceiver module firmware flashing. In the future, more
functionality can be implemented on top of these read / write
operations.
Adjust the comments of the 'ethtool_module_eeprom' structure as it is
no longer used only for read access.
Neal Cardwell [Thu, 27 Jun 2024 02:42:27 +0000 (22:42 -0400)]
UPSTREAM: tcp: fix DSACK undo in fast recovery to call tcp_try_to_open()
In some production workloads we noticed that connections could
sometimes close extremely prematurely with ETIMEDOUT after
transmitting only 1 TLP and RTO retransmission (when we would normally
expect roughly tcp_retries2 = TCP_RETR2 = 15 RTOs before a connection
closes with ETIMEDOUT).
From tracing we determined that these workloads can suffer from a
scenario where in fast recovery, after some retransmits, a DSACK undo
can happen at a point where the scoreboard is totally clear (we have
retrans_out == sacked_out == lost_out == 0). In such cases, calling
tcp_try_keep_open() means that we do not execute any code path that
clears tp->retrans_stamp to 0. That means that tp->retrans_stamp can
remain erroneously set to the start time of the undone fast recovery,
even after the fast recovery is undone. If minutes or hours elapse,
and then a TLP/RTO/RTO sequence occurs, then the start_ts value in
retransmits_timed_out() (which is from tp->retrans_stamp) will be
erroneously ancient (left over from the fast recovery undone via
DSACKs). Thus this ancient tp->retrans_stamp value can cause the
connection to die very prematurely with ETIMEDOUT via
tcp_write_err().
The fix: we change DSACK undo in fast recovery (TCP_CA_Recovery) to
call tcp_try_to_open() instead of tcp_try_keep_open(). This ensures
that if no retransmits are in flight at the time of DSACK undo in fast
recovery then we properly zero retrans_stamp. Note that calling
tcp_try_to_open() is more consistent with other loss recovery
behavior, since normal fast recovery (CA_Recovery) and RTO recovery
(CA_Loss) both normally end when tp->snd_una meets or exceeds
tp->high_seq and then in tcp_fastretrans_alert() the "default" switch
case executes tcp_try_to_open(). Also note that by inspection this
change to call tcp_try_to_open() implies at least one other nice bug
fix, where now an ECE-marked DSACK that causes an undo will properly
invoke tcp_enter_cwr() rather than ignoring the ECE mark.
Fix the definition of BSS_CHANGED_UNSOL_BCAST_PROBE_RESP so that
not all higher bits get set, 1<<31 is a signed variable, so when
we do
u64 changed = BSS_CHANGED_UNSOL_BCAST_PROBE_RESP;
we get sign expansion, so the value is 0xffff'ffff'8000'0000 and
that's clearly not desired. Use BIT_ULL() to make it unsigned as
well as the right type for the change flags.