Arnd Bergmann [Tue, 28 May 2024 12:06:30 +0000 (14:06 +0200)]
i2c: viai2c: turn common code into a proper module
The i2c-viai2c-common.c file is used by two drivers, but is not a proper
abstraction and can get linked into both modules in the same configuration,
which results in a warning:
scripts/Makefile.build:236: drivers/i2c/busses/Makefile: i2c-viai2c-common.o is added to multiple modules: i2c-wmt i2c-zhaoxin
The other problems with this include the incorrect use of a __weak function
when both are built-in, and the fact that the "common" module is sprinked
with 'if (i2c->plat == ...)' checks that have knowledge about the differences
between the drivers using it.
Avoid the link time warning by making the common driver a proper module
with MODULE_LICENCE()/MODULE_AUTHOR() tags, and remove the __weak function
by slightly rearranging the code.
This adds a little more duplication between the two main drivers, but
those versions get more readable in the process.
David Howells [Tue, 25 Jun 2024 12:29:06 +0000 (13:29 +0100)]
netfs: Fix netfs_page_mkwrite() to check folio->mapping is valid
Fix netfs_page_mkwrite() to check that folio->mapping is valid once it has
taken the folio lock (as filemap_page_mkwrite() does). Without this,
generic/247 occasionally oopses with something like the following:
David Howells [Thu, 6 Jun 2024 10:10:44 +0000 (11:10 +0100)]
netfs: Delete some xarray-wangling functions that aren't used
Delete some xarray-based buffer wangling functions that are intended for
use with bounce buffering, but aren't used because bounce-buffering got
deferred to a later patch series. Now, however, the intention is to use
something other than an xarray to do this.
David Howells [Wed, 5 Jun 2024 21:18:04 +0000 (22:18 +0100)]
netfs: Fix early issue of write op on partial write to folio tail
During the writeback procedure, at the end of netfs_write_folio(), pending
write operations are flushed if the amount of write-streaming data stored
in a page is less than the size of the folio because if we haven't modified
a folio to the end, it cannot be contiguous with the following folio...
except if the dirty region of the folio is right at the end of the folio
space.
Fix the test to take the offset into the folio into account as well, such
that if the dirty region runs right up to the end of the folio, we leave
the flushing for later.
Sandeep Dhavale [Mon, 24 Jun 2024 22:02:05 +0000 (15:02 -0700)]
erofs: fix possible memory leak in z_erofs_gbuf_exit()
Because we incorrectly reused of variable `i` in `z_erofs_gbuf_exit()`
for inner loop, we may exit early from outer loop resulting in memory
leak. Fix this by using separate variable for iterating through inner loop.
Darrick J. Wong [Wed, 19 Jun 2024 17:32:46 +0000 (10:32 -0700)]
xfs: honor init_xattrs in xfs_init_new_inode for !ATTR fs
xfs_init_new_inode ignores the init_xattrs parameter for filesystems
that do not have ATTR enabled. As a result, the first init_xattrs file
to be created by the kernel will not have an attr fork created to store
acls. Storing that first acl will add ATTR to the superblock flags, so
subsequent files will be created with attr forks. The overhead of this
is so small that chances are that nobody has noticed this behavior.
However, this is disastrous on a filesystem with parent pointers because
it requires that a new linkable file /must/ have a pre-existing attr
fork, and the parent pointers code uses init_xattrs to create that fork.
The preproduction version of mkfs.xfs used to set this, but the V5 sb
verifier only requires ATTR2, not ATTR. There is no guard for
filesystems with (PARENT && !ATTR).
It turns out that I misunderstood the two flags -- ATTR means that we at
some point created an attr fork to store xattrs in a file; ATTR2
apparently means only that inodes have dynamic fork offsets or that the
filesystem was mounted with the "attr2" option.
Fixes: 2442ee15bb1e ("xfs: eager inode attr fork init needs attr feature awareness") Signed-off-by: Darrick J. Wong <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: Chandan Babu R <[email protected]>
Darrick J. Wong [Wed, 19 Jun 2024 17:32:45 +0000 (10:32 -0700)]
xfs: allow unlinked symlinks and dirs with zero size
For a very very long time, inode inactivation has set the inode size to
zero before unmapping the extents associated with the data fork.
Unfortunately, commit 3c6f46eacd876 changed the inode verifier to
prohibit zero-length symlinks and directories. If an inode happens to
get logged in this state and the system crashes before freeing the
inode, log recovery will also fail on the broken inode.
Therefore, allow zero-size symlinks and directories as long as the link
count is zero; nobody will be able to open these files by handle so
there isn't any risk of data exposure.
Fixes: 3c6f46eacd876 ("xfs: sanity check directory inode di_size") Signed-off-by: Darrick J. Wong <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: Chandan Babu R <[email protected]>
Darrick J. Wong [Wed, 19 Jun 2024 17:32:44 +0000 (10:32 -0700)]
xfs: restrict when we try to align cow fork delalloc to cowextsz hints
xfs/205 produces the following failure when always_cow is enabled:
--- a/tests/xfs/205.out 2024-02-28 16:20:24.437887970 -0800
+++ b/tests/xfs/205.out.bad 2024-06-03 21:13:40.584000000 -0700
@@ -1,4 +1,5 @@
QA output created by 205
*** one file
+ !!! disk full (expected)
*** one file, a few bytes at a time
*** done
This is the result of overly aggressive attempts to align cow fork
delalloc reservations to the CoW extent size hint. Looking at the trace
data, we're trying to append a single fsblock to the "fred" file.
Trying to create a speculative post-eof reservation fails because
there's not enough space.
We then set @prealloc_blocks to zero and try again, but the cowextsz
alignment code triggers, which expands our request for a 1-fsblock
reservation into a 39-block reservation. There's not enough space for
that, so the whole write fails with ENOSPC even though there's
sufficient space in the filesystem to allocate the single block that we
need to land the write.
There are two things wrong here -- first, we shouldn't be attempting
speculative preallocations beyond what was requested when we're low on
space. Second, if we've already computed a posteof preallocation, we
shouldn't bother trying to align that to the cowextsize hint.
Fix both of these problems by adding a flag that only enables the
expansion of the delalloc reservation to the cowextsize if we're doing a
non-extending write, and only if we're not doing an ENOSPC retry. This
requires us to move the ENOSPC retry logic to xfs_bmapi_reserve_delalloc.
I probably should have caught this six years ago when 6ca30729c206d was
being reviewed, but oh well. Update the comments to reflect what the
code does now.
xfs: fix freeing speculative preallocations for preallocated files
xfs_can_free_eofblocks returns false for files that have persistent
preallocations unless the force flag is passed and there are delayed
blocks. This means it won't free delalloc reservations for files
with persistent preallocations unless the force flag is set, and it
will also free the persistent preallocations if the force flag is
set and the file happens to have delayed allocations.
Both of these are bad, so do away with the force flag and always free
only post-EOF delayed allocations for files with the XFS_DIFLAG_PREALLOC
or APPEND flags set.
Zijun Hu [Thu, 13 Jun 2024 14:04:36 +0000 (22:04 +0800)]
net: rfkill: Correct return value in invalid parameter case
rfkill_set_hw_state_reason() does not return current combined
block state when its parameter @reason is invalid, that is
wrong according to its comments, fix it by correcting the
value returned.
Zong-Zhe Yang [Mon, 17 Jun 2024 11:52:17 +0000 (19:52 +0800)]
wifi: mac80211: fix NULL dereference at band check in starting tx ba session
In MLD connection, link_data/link_conf are dynamically allocated. They
don't point to vif->bss_conf. So, there will be no chanreq assigned to
vif->bss_conf and then the chan will be NULL. Tweak the code to check
ht_supported/vht_supported/has_he/has_eht on sta deflink.
Crash log (with rtw89 version under MLO development):
[ 9890.526087] BUG: kernel NULL pointer dereference, address: 0000000000000000
[ 9890.526102] #PF: supervisor read access in kernel mode
[ 9890.526105] #PF: error_code(0x0000) - not-present page
[ 9890.526109] PGD 0 P4D 0
[ 9890.526114] Oops: 0000 [#1] PREEMPT SMP PTI
[ 9890.526119] CPU: 2 PID: 6367 Comm: kworker/u16:2 Kdump: loaded Tainted: G OE 6.9.0 #1
[ 9890.526123] Hardware name: LENOVO 2356AD1/2356AD1, BIOS G7ETB3WW (2.73 ) 11/28/2018
[ 9890.526126] Workqueue: phy2 rtw89_core_ba_work [rtw89_core]
[ 9890.526203] RIP: 0010:ieee80211_start_tx_ba_session (net/mac80211/agg-tx.c:618 (discriminator 1)) mac80211
[ 9890.526279] Code: f7 e8 d5 93 3e ea 48 83 c4 28 89 d8 5b 41 5c 41 5d 41 5e 41 5f 5d c3 cc cc cc cc 49 8b 84 24 e0 f1 ff ff 48 8b 80 90 1b 00 00 <83> 38 03 0f 84 37 fe ff ff bb ea ff ff ff eb cc 49 8b 84 24 10 f3
All code
========
0: f7 e8 imul %eax
2: d5 (bad)
3: 93 xchg %eax,%ebx
4: 3e ea ds (bad)
6: 48 83 c4 28 add $0x28,%rsp
a: 89 d8 mov %ebx,%eax
c: 5b pop %rbx
d: 41 5c pop %r12
f: 41 5d pop %r13
11: 41 5e pop %r14
13: 41 5f pop %r15
15: 5d pop %rbp
16: c3 retq
17: cc int3
18: cc int3
19: cc int3
1a: cc int3
1b: 49 8b 84 24 e0 f1 ff mov -0xe20(%r12),%rax
22: ff
23: 48 8b 80 90 1b 00 00 mov 0x1b90(%rax),%rax
2a:* 83 38 03 cmpl $0x3,(%rax) <-- trapping instruction
2d: 0f 84 37 fe ff ff je 0xfffffffffffffe6a
33: bb ea ff ff ff mov $0xffffffea,%ebx
38: eb cc jmp 0x6
3a: 49 rex.WB
3b: 8b .byte 0x8b
3c: 84 24 10 test %ah,(%rax,%rdx,1)
3f: f3 repz
Johannes Berg [Tue, 25 Jun 2024 16:51:14 +0000 (19:51 +0300)]
wifi: iwlwifi: trans: make bad state warnings
Kalle reported that this triggers very occasionally, but
we don't even know which place, except that it wasn't one
with a warning. Make all of them warnings since this is
really not meant to happen and indicates driver bugs.
Johannes Berg [Tue, 25 Jun 2024 16:51:09 +0000 (19:51 +0300)]
wifi: iwlwifi: mvm: use IWL_FW_CHECK for link ID check
The lookup function iwl_mvm_rcu_fw_link_id_to_link_conf() is
normally called with input from the firmware, so it should use
IWL_FW_CHECK() instead of WARN_ON().
Johannes Berg [Tue, 25 Jun 2024 16:51:08 +0000 (19:51 +0300)]
wifi: iwlwifi: mvm: don't flush BSSes on restart with MLD API
If the firmware has MLD APIs, it will handle all timing and we
don't need to give it timestamps. Therefore, we don't care about
the timestamps stored in the BSS table, so there's no need to
flush the BSS table.
Golan Ben Ami [Thu, 13 Jun 2024 14:11:23 +0000 (17:11 +0300)]
wifi: iwlwifi: remove AX101, AX201 and AX203 support from LNL
LNL is the codename for the upcoming Series 2 Core Ultra
processors designed by Intel. AX101, AX201 and AX203 devices
are not shiped on LNL platforms, so don't support them.
Johannes Berg [Tue, 18 Jun 2024 16:59:45 +0000 (19:59 +0300)]
wifi: iwlwifi: mvm: don't limit VLP/AFC to UATS-enabled
When UATS isn't enabled (no VLP/AFC AP support), we need to still
set the right bits in the channel/regulatory flags, so remove the
uats_enabled argument to the parsing etc.
Also, firmware deals just fine with getting the UATS table if it
supports the command even if the bits aren't set, so always send
it, since it's also needed if BIT(31) is set, but the driver need
not have any knowledge of that. Remove 'uats_enabled' entirely.
This isn't related to whether or not "fw can be loaded",
but rather requesting that ME go into a state where doing
a product reset is safe. This is related to FW load only
in the specific case of where it's used today in iwlmvm,
notably when it's known that the firmware itself will (or
at least may) do a product reset during load.
Clarify the documentation.
I was tempted to rename things too, but on the ME side it
really is also called PLDR (which is a Windows term and
may not even match the complete behaviour since doing a
full product reset from the driver also requires calling
an ACPI method first.) So keep the name aligned with ME.
Johannes Berg [Tue, 18 Jun 2024 16:44:11 +0000 (19:44 +0300)]
wifi: iwlwifi: mvm: rename 'pldr_sync'
PLDR (product level device reset) is a Windows term, and
is something the driver triggers there, AFAICT.
Really what 'pldr_sync' here wants to capture is whether
or not the firmware will/may do a product reset during
initialization, which makes the device drop off the bus,
requiring a rescan. If this is the case, obviously the
init will fail/time out, so we don't want to report all
kinds of errors etc., hence this tracking variable.
Rename it to 'fw_product_reset' to capture the meaning
better.
Miri Korenblit [Tue, 18 Jun 2024 16:44:05 +0000 (19:44 +0300)]
wifi: iwlwifi: trans: remove unused function parameter
iwl_trans_pcie_gen2_fw_alive doesn't use the scd_addr parameter,
it was there only because we needed the functio to have a prototype same
as iwl_trans_ops::fw_alive callback.
But now the ops is removed so no reason to keep the parameter.
This will allow to suspend / resume the system without resetting the
firmware. This will allow to reduce the resume time.
In case the fast_resume fails, stop the device and bring it up from
scratch.
Raise the timeout for the D3_END notification since in some iterations,
it took 240ms.
Johannes Berg [Tue, 18 Jun 2024 17:03:02 +0000 (20:03 +0300)]
wifi: iwlwifi: mvm: unify and fix interface combinations
AP interfaces fundamentally cannot leave the channel, so multi-
channel operation with them isn't really possible. We shouldn't
advertise support for such, at least not as long as we don't
have full multi-radio support. Thus, remove the AP bit from the
interface combinations for two channels and add another set for
just one channel that has it.
Also, to avoid duplicating everything even more, unify the NAN
and non-NAN cases.
wifi: iwlwifi: pcie: fix a few legacy register accesses for new devices
Do not access legacy bits for new devices, this has no effect.
Somehow, wowlan worked despite the usage of the wrong bits. Now
that we want to keep the firmware loaded during suspend even without
wowlan, this change is needed.
Johannes Berg [Wed, 26 Jun 2024 07:15:59 +0000 (09:15 +0200)]
wifi: mac80211: disable softirqs for queued frame handling
As noticed by syzbot, calling ieee80211_handle_queued_frames()
(and actually handling frames there) requires softirqs to be
disabled, since we call into the RX code. Fix that in the case
of cleaning up frames left over during shutdown.
Johannes Berg [Wed, 12 Jun 2024 12:38:10 +0000 (14:38 +0200)]
wifi: mac80211: check SSID in beacon
Check that the SSID in beacons is correct, if it's not hidden
and beacon protection is enabled (otherwise there's no value).
If it doesn't match, disconnect.
Johannes Berg [Wed, 12 Jun 2024 12:35:57 +0000 (14:35 +0200)]
wifi: mac80211: correcty limit wider BW TDLS STAs
When updating a channel context, the code can apply wider
bandwidth TDLS STA channel definitions to each and every
channel context used by the device, an approach that will
surely lead to problems if there is ever more than one.
Restrict the wider BW TDLS STA consideration to only TDLS
STAs that are actually related to links using the channel
context being updated.
Johannes Berg [Wed, 12 Jun 2024 12:32:06 +0000 (14:32 +0200)]
wifi: mac80211: update STA/chandef width during switch
In channel switch without an additional channel context,
where the reassign logic kicks in, we also need to update
the station bandwidth and chandef minimum width correctly
to avoid having station rate control configured to wider
bandwidth than the channel context. Do that now.
Johannes Berg [Wed, 12 Jun 2024 12:32:05 +0000 (14:32 +0200)]
wifi: mac80211: make ieee80211_chan_bw_change() able to use reserved
Make ieee80211_chan_bw_change() able to use the reserved chanreq
(really the chandef part of it) for the calculations, so it can
be used _without_ applying the changes first. Remove the comment
that indicates this is required, since it no longer is. However,
this capability only gets used later.
Also, this is not ideal, we really should not different so much
between reserved and non-reserved usage, to simplify. That's a
further cleanup later though.
Johannes Berg [Wed, 12 Jun 2024 12:32:03 +0000 (14:32 +0200)]
wifi: mac80211: optionally pass chandef to ieee80211_sta_cap_rx_bw()
We'll need this function to take a new chandef in
(some) channel switching cases, so prepare for that
by allowing that to be passed and using it if so.
Clean up the code a little bit while at it.
Johannes Berg [Wed, 12 Jun 2024 12:28:37 +0000 (14:28 +0200)]
wifi: mac80211: handle protected dual of public action
The code currently handles ECSA (extended channel switch
announcement) public action frames. Handle also their
protected dual, which actually is protected.
Johannes Berg [Wed, 12 Jun 2024 12:28:36 +0000 (14:28 +0200)]
wifi: mac80211: restrict public action ECSA frame handling
Public action extended channel switch announcement (ECSA)
frames cannot be protected well, the spec is unclear about
what should happen in the presence of stations that can
receive protected dual and stations that cannot.
Mitigate these issues by not treating public action frames
as the absolute truth, only treat them as a hint to stop
transmitting (quiet mode), and do the remainder of the CSA
handling only when receiving the next beacon (or protected
action frame) that contains the CSA; or, if it doesn't,
simply stop being quiet and continue operating normally.
This limits the exposure to malicious ECSA public action
frames, since they cannot cause a disconnect now, only a
short interruption in traffic.
Takashi Iwai [Tue, 25 Jun 2024 15:52:12 +0000 (17:52 +0200)]
ALSA: hda/realtek: Fix conflicting quirk for PCI SSID 17aa:3820
The recent fix for Lenovo IdeaPad 330-17IKB replaced the quirk entry,
and this eventually breaks the existing quirk for Lenovo Yoga Duet 7
13ITL6 equipped with the same PCI SSID 17aa:3820.
For applying a proper quirk for each model, check the codec SSID
additionally. Fortunately Yoga Duet has a different codec SSID,
0x17aa3802.
(Interestingly, 17aa:3802 has another conflict of SSID between another
Yoga model vs 14IRP8 which we had to work around similarly.)
Target debugfs entry is removed via async_schedule() which isn't drained
when adding same name target, so failure of "Directory 'target11:0:0' with
parent 'scsi_debug' already present!" can be triggered easily.
Fix it by switching to domain async schedule, and draining it before
adding new target debugfs entry.
====================
add ethernet driver for Tehuti Networks TN40xx chips
This patchset adds a new 10G ethernet driver for Tehuti Networks
TN40xx chips. Note in mainline, there is a driver for Tehuti Networks
(drivers/net/ethernet/tehuti/tehuti.[hc]), which supports TN30xx
chips.
Multiple vendors (DLink, Asus, Edimax, QNAP, etc) developed adapters
based on TN40xx chips. Tehuti Networks went out of business but the
drivers are still distributed under GPL2 with some of the hardware
(and also available on some sites). With some changes, I try to
upstream this driver with a new PHY driver in Rust.
The major change is replacing the PHY abstraction layer in the original
driver with phylink. TN40xx chips are used with various PHY hardware
(AMCC QT2025, TI TLK10232, Aqrate AQR105, and Marvell MV88X3120,
MV88X3310, and MV88E2010).
I've also been working on a new PHY driver for QT2025 in Rust [1]. For
now, I enable only adapters using QT2025 PHY in the PCI ID table of
this driver. I've tested this driver and the QT2025 PHY driver with
Edimax EN-9320 10G adapter and 10G-SR SFP+. In mainline, there are PHY
drivers for AQR105 and Marvell PHYs, which could work for some TN40xx
adapters with this driver.
To make reviewing easier, this patchset has only basic functions. Once
merged, I'll submit features like ethtool support.
FUJITA Tomonori [Sun, 23 Jun 2024 23:55:07 +0000 (08:55 +0900)]
net: tn40xx: add phylink support
This patch adds supports for multiple PHY hardware with phylink. The
adapters with TN40xx chips use multiple PHY hardware; AMCC QT2025, TI
TLK10232, Aqrate AQR105, and Marvell 88X3120, 88X3310, and MV88E2010.
For now, the PCI ID table of this driver enables adapters using only
QT2025 PHY. I've tested this driver and the QT2025 PHY driver (SFP+
10G SR) with Edimax EN-9320 10G adapter.
FUJITA Tomonori [Sun, 23 Jun 2024 23:55:05 +0000 (08:55 +0900)]
net: tn40xx: add basic Rx handling
This patch adds basic Rx handling. The Rx logic uses three major data
structures; two ring buffers with NIC and one database. One ring
buffer is used to send information to NIC about memory to be stored
packets to be received. The other is used to get information from NIC
about received packets. The database is used to keep the information
about DMA mapping. After a packet arrived, the db is used to pass the
packet to the network stack.
FUJITA Tomonori [Sun, 23 Jun 2024 23:55:04 +0000 (08:55 +0900)]
net: tn40xx: add basic Tx handling
This patch adds device specific structures to initialize the hardware
with basic Tx handling. The original driver loads the embedded
firmware in the header file. This driver is implemented to use the
firmware APIs.
The Tx logic uses three major data structures; two ring buffers with
NIC and one database. One ring buffer is used to send information
about packets to be sent for NIC. The other is used to get information
from NIC about packet that are sent. The database is used to keep the
information about DMA mapping. After a packet is sent, the db is used
to free the resource used for the packet.
FUJITA Tomonori [Sun, 23 Jun 2024 23:55:01 +0000 (08:55 +0900)]
PCI: Add Edimax Vendor ID to pci_ids.h
Add the Edimax Vendor ID (0x1432) for an ethernet driver for Tehuti
Networks TN40xx chips. This ID can be used for Realtek 8180 and Ralink
rt28xx wireless drivers.
Jakub Kicinski [Wed, 26 Jun 2024 00:48:35 +0000 (17:48 -0700)]
Merge branch 'gve-add-flow-steering-support'
Ziwei Xiao says:
====================
gve: Add flow steering support
To support flow steering in GVE driver, there are two adminq changes
need to be made in advance.
The first one is adding adminq mutex lock, which is to allow the
incoming flow steering operations to be able to temporarily drop the
rtnl_lock to reduce the latency for registering flow rules among
several NICs at the same time. This could be achieved by the future
changes to reduce the drivers' dependencies on the rtnl lock for
particular ethtool ops.
The second one is to add the extended adminq command so that we can
support larger adminq command such as configure_flow_rule command. In
that patch, there is a new added function called
gve_adminq_execute_extended_cmd with the attribute of __maybe_unused.
That attribute will be removed in the third patch of this series where
it will use the previously unused function.
And the other three patches are needed for the actual flow steering
feature support in driver.
====================
Jeroen de Borst [Tue, 25 Jun 2024 00:12:31 +0000 (00:12 +0000)]
gve: Add flow steering ethtool support
Implement the ethtool commands that can be used to configure and query
flow-steering rules.
A large part of this change consists of translating the ethtool
representation of 'ntuples' to our internal gve_flow_rule and vice-versa
in the new created gve_flow_rule.c
Considering the possible large amount of flow rules, the driver doesn't
store all the rules locally. When the user runs 'ethtool -n <nic>' to
check the registered rules, the driver will send adminq command to
query a limited amount of rules/rule ids(that filled in a 4096 bytes dma
memory) at a time as a cache for the ethtool queries. The adminq query
commands will be repeated for several times until the ethtool has
queried all the needed rules.
Jeroen de Borst [Tue, 25 Jun 2024 00:12:30 +0000 (00:12 +0000)]
gve: Add flow steering adminq commands
Add new adminq commands for the driver to configure and query flow rules
that are stored in the device. Flow steering rules are assigned with a
location that determines the relative order of the rules.
Flow rules can run up to an order of millions. In such cases, storing
a full copy of the rules in the driver to prepare for the ethtool query
is infeasible while querying them from the device is better. That needs
to be optimized too so that we don't send a lot of adminq commands. The
solution here is to store a limited number of rules/rule ids in the
driver in a cache. Use dma_pool to allocate 4k bytes which lets device
write at most 46 flow rules(4096/88) or 1024 rule ids(4096/4) at a time.
For configuring flow rules, there are 3 sub-commands:
- ADD which adds a rule at the location supplied
- DEL which deletes the rule at the location supplied
- RESET which clears all currently active rules in the device
For querying flow rules, there are also 3 sub-commands:
- QUERY_RULES corresponds to ETHTOOL_GRXCLSRULE. It fills the rules in
the allocated cache after querying the device
- QUERY_RULES_IDS corresponds to ETHTOOL_GRXCLSRLALL. It fills the
rule_ids in the allocated cache after querying the device
- QUERY_RULES_STATS corresponds to ETHTOOL_GRXCLSRLCNT. It queries the
device's current flow rule number and the supported max flow rule
limit
Jeroen de Borst [Tue, 25 Jun 2024 00:12:29 +0000 (00:12 +0000)]
gve: Add flow steering device option
Add a new device option to signal to the driver that the device supports
flow steering. This device option also carries the maximum number of
flow steering rules that the device can store.
Jeroen de Borst [Tue, 25 Jun 2024 00:12:28 +0000 (00:12 +0000)]
gve: Add adminq extended command
The adminq command is limited to 64 bytes per entry and it's 56 bytes
for the command itself at maximum. To support larger commands, we need
to dma_alloc a separate memory to put the command in that memory and
send the dma memory address instead of the actual command.
Introduce an extended adminq command to wrap the real command with the
inner opcode and the allocated dma memory address specified. Once the
device receives it, it can get the real command from the given dma
memory address. As designed with the device, all the extended commands
will use inner opcode larger than 0xFF.
Ziwei Xiao [Tue, 25 Jun 2024 00:12:27 +0000 (00:12 +0000)]
gve: Add adminq mutex lock
We were depending on the rtnl_lock to make sure there is only one adminq
command running at a time. But some commands may take too long to hold
the rtnl_lock, such as the upcoming flow steering operations. For such
situations, it can temporarily drop the rtnl_lock, and replace it for
these operations with a new adminq lock, which can ensure the adminq
command execution to be thread-safe.
Neal Cardwell [Mon, 24 Jun 2024 14:43:23 +0000 (14:43 +0000)]
tcp: fix tcp_rcv_fastopen_synack() to enter TCP_CA_Loss for failed TFO
Testing determined that the recent commit 9e046bb111f1 ("tcp: clear
tp->retrans_stamp in tcp_rcv_fastopen_synack()") has a race, and does
not always ensure retrans_stamp is 0 after a TFO payload retransmit.
If transmit completion for the SYN+data skb happens after the client
TCP stack receives the SYNACK (which sometimes happens), then
retrans_stamp can erroneously remain non-zero for the lifetime of the
connection, causing a premature ETIMEDOUT later.
Testing and tracing showed that the buggy scenario is the following
somewhat tricky sequence:
+ Client attempts a TFO handshake. tcp_send_syn_data() sends SYN + TFO
cookie + data in a single packet in the syn_data skb. It hands the
syn_data skb to tcp_transmit_skb(), which makes a clone. Crucially,
it then reuses the same original (non-clone) syn_data skb,
transforming it by advancing the seq by one byte and removing the
FIN bit, and enques the resulting payload-only skb in the
sk->tcp_rtx_queue.
+ Client sets retrans_stamp to the start time of the three-way
handshake.
+ Cookie mismatches or server has TFO disabled, and server only ACKs
SYN.
+ tcp_ack() sees SYN is acked, tcp_clean_rtx_queue() clears
retrans_stamp.
+ Since the client SYN was acked but not the payload, the TFO failure
code path in tcp_rcv_fastopen_synack() tries to retransmit the
payload skb. However, in some cases the transmit completion for the
clone of the syn_data (which had SYN + TFO cookie + data) hasn't
happened. In those cases, skb_still_in_host_queue() returns true
for the retransmitted TFO payload, because the clone of the syn_data
skb has not had its tx completetion.
+ Because skb_still_in_host_queue() finds skb_fclone_busy() is true,
it sets the TSQ_THROTTLED bit and the retransmit does not happen in
the tcp_rcv_fastopen_synack() call chain.
+ The tcp_rcv_fastopen_synack() code next implicitly assumes the
retransmit process is finished, and sets retrans_stamp to 0 to clear
it, but this is later overwritten (see below).
+ Later, upon tx completion, tcp_tsq_write() calls
tcp_xmit_retransmit_queue(), which puts the retransmit in flight and
sets retrans_stamp to a non-zero value.
+ The client receives an ACK for the retransmitted TFO payload data.
+ Since we're in CA_Open and there are no dupacks/SACKs/DSACKs/ECN to
make tcp_ack_is_dubious() true and make us call
tcp_fastretrans_alert() and reach a code path that clears
retrans_stamp, retrans_stamp stays nonzero.
+ Later, if there is a TLP, RTO, RTO sequence, then the connection
will suffer an early ETIMEDOUT due to the erroneously ancient
retrans_stamp.
The fix: this commit refactors the code to have
tcp_rcv_fastopen_synack() retransmit by reusing the relevant parts of
tcp_simple_retransmit() that enter CA_Loss (without changing cwnd) and
call tcp_xmit_retransmit_queue(). We have tcp_simple_retransmit() and
tcp_rcv_fastopen_synack() share code in this way because in both cases
we get a packet indicating non-congestion loss (MTU reduction or TFO
failure) and thus in both cases we want to retransmit as many packets
as cwnd allows, without reducing cwnd. And given that retransmits will
set retrans_stamp to a non-zero value (and may do so in a later
calling context due to TSQ), we also want to enter CA_Loss so that we
track when all retransmitted packets are ACked and clear retrans_stamp
when that happens (to ensure later recurring RTOs are using the
correct retrans_stamp and don't declare ETIMEDOUT prematurely).
====================
ethtool: provide the dim profile fine-tuning channel
The NetDIM library provides excellent acceleration for many modern
network cards. However, the default profiles of DIM limits its maximum
capabilities for different NICs, so providing a way which the NIC can
be custom configured is necessary.
Currently, the way is based on the commonly used "ethtool -C".
For example,
on the server side, the virtio-net NIC with rx dim enabled has 8
queues and runs nginx.
The client uses the following command to send traffic to the server:
./wrk http://server_ip:80 -c 64 -t 5 -d 30
Then adjust the default rx-profile for server dim to
Heng Qi [Fri, 21 Jun 2024 10:13:53 +0000 (18:13 +0800)]
virtio-net: support dim profile fine-tuning
Virtio-net has different types of back-end device implementations.
In order to effectively optimize the dim library's gains for different
device implementations, let's use the new interface params to
initialize and query dim results from a customized profile list.
Heng Qi [Fri, 21 Jun 2024 10:13:51 +0000 (18:13 +0800)]
ethtool: provide customized dim profile management
The NetDIM library, currently leveraged by an array of NICs, delivers
excellent acceleration benefits. Nevertheless, NICs vary significantly
in their dim profile list prerequisites.
Specifically, virtio-net backends may present diverse sw or hw device
implementation, making a one-size-fits-all parameter list impractical.
On Alibaba Cloud, the virtio DPU's performance under the default DIM
profile falls short of expectations, partly due to a mismatch in
parameter configuration.
I also noticed that ice/idpf/ena and other NICs have customized
profilelist or placed some restrictions on dim capabilities.
Motivated by this, I tried adding new params for "ethtool -C" that provides
a per-device control to modify and access a device's interrupt parameters.
Usage
========
The target NIC is named ethx.
Assume that ethx only declares support for rx profile setting
(with DIM_PROFILE_RX flag set in profile_flags) and supports modification
of usec and pkt fields.
1. Query the currently customized list of the device
Heng Qi [Fri, 21 Jun 2024 10:13:50 +0000 (18:13 +0800)]
dim: make DIMLIB dependent on NET
DIMLIB's capabilities are supplied by the dim, net_dim, and
rdma_dim objects, and dim's interfaces solely act as a base for
net_dim and rdma_dim and are not explicitly used anywhere else.
rdma_dim is utilized by the infiniband driver, while net_dim
is for network devices, excluding the soc/fsl driver.
In this patch, net_dim relies on some NET's interfaces, thus
DIMLIB needs to explicitly depend on the NET Kconfig.
The soc/fsl driver uses the functions provided by net_dim, so
it also needs to depend on NET.
Jakub Kicinski [Wed, 26 Jun 2024 00:07:06 +0000 (17:07 -0700)]
Merge branch 'ravb-add-mii-support-for-r-car-v4m'
Geert Uytterhoeven says:
====================
ravb: Add MII support for R-Car V4M
All EtherAVB instances on R-Car Gen3/Gen4 SoCs support the RGMII
interface. In addition, the first two EtherAVB instances on R-Car V4M
also support the MII interface, but this is not yet supported by the
driver. This patch series adds support for MII on R-Car Gen4, after the
customary cleanup.
The corresponding pin control support is available in [1].
Compile-tested only, as all AVB interfaces on the Gray Hawk Single
development board are connected to RGMII PHYs.
No regressions on R-Car V4H.
All EtherAVB instances on R-Car Gen3/Gen4 SoCs support the RGMII
interface. In addition, the first two EtherAVB instances on R-Car V4M
also support the MII interface, but this is not yet supported by the
driver.
Add support for MII on R-Car Gen4 by adding an R-Car Gen4-specific EMAC
initialization function that selects the MII clock instead of the RGMII
clock when the PHY interface is MII. Note that all implementations of
EtherAVB on R-Car Gen4 SoCs have the APSR register, but only MII-capable
instances are documented to have the MIISELECT bit, which has a
documented value of zero when reserved.
Shannon Nelson [Mon, 24 Jun 2024 17:50:15 +0000 (10:50 -0700)]
ionic: use dev_consume_skb_any outside of napi
If we're not in a NAPI softirq context, we need to be careful
about how we call napi_consume_skb(), specifically we need to
call it with budget==0 to signal to it that we're not in a
safe context.
This was found while running some configuration stress testing
of traffic and a change queue config loop running, and this
curious note popped out:
I found that ionic_tx_clean() calls napi_consume_skb() which calls
napi_skb_cache_put(), but before that last call is the note
/* Zero budget indicate non-NAPI context called us, like netpoll */
and
DEBUG_NET_WARN_ON_ONCE(!in_softirq());
Those are pretty big hints that we're doing it wrong. We can pass a
context hint down through the calls to let ionic_tx_clean() know what
we're doing so it can call napi_consume_skb() correctly.
Li RongQing [Fri, 21 Jun 2024 09:45:52 +0000 (17:45 +0800)]
virtio_net: Remove u64_stats_update_begin()/end() for stats fetch
This place is fetching the stats, u64_stats_update_begin()/end()
should not be used, and the fetcher of stats is in the same context
as the updater of the stats, so don't need any protection
Lin Ma [Fri, 31 May 2024 01:28:47 +0000 (09:28 +0800)]
netfilter: cttimeout: remove 'l3num' attr check
After commit dd2934a95701 ("netfilter: conntrack: remove l3->l4 mapping
information"), the attribute of type `CTA_TIMEOUT_L3PROTO` is not used
any more in function cttimeout_default_set.
However, the previous commit ea9cf2a55a7b ("netfilter: cttimeout: remove
set but not used variable 'l3num'") forgot to remove the attribute
present check when removing the related variable.
This commit removes that check to ensure consistency.
Yunjian Wang [Fri, 31 May 2024 03:48:47 +0000 (11:48 +0800)]
netfilter: nf_conncount: fix wrong variable type
Now there is a issue is that code checks reports a warning: implicit
narrowing conversion from type 'unsigned int' to small type 'u8' (the
'keylen' variable). Fix it by removing the 'keylen' variable.
Kent Overstreet [Sun, 23 Jun 2024 04:53:44 +0000 (00:53 -0400)]
bcachefs: Discard, invalidate workers are now per device
There's no reason for discards to be single threaded across all devices;
this will improve performance on multi device setups.
Additionally, making them per-device simplifies the refcounting on
bch_dev->io_ref; we now hold it for the duration that the discard path
is running, which fixes a race between the discard path and device
removal.