====================
mptcp: fix endpoints with 'signal' and 'subflow' flags
When looking at improving the user experience around the MPTCP endpoints
setup, I noticed that setting an endpoint with both the 'signal' and the
'subflow' flags -- as it has been done in the past by users according to
bug reports we got -- was resulting on only announcing the endpoint, but
not using it to create subflows: the 'subflow' flag was then ignored.
My initial thought was to modify IPRoute2 to warn the user when the two
flags were set, but it doesn't sound normal to ignore one of them. I
then looked at modifying the kernel not to allow having the two flags
set, but when discussing about that with Mat, we thought it was maybe
not ideal to do that, as there might be use-cases, we might break some
configs. Then I saw it was working before v5.17. So instead, I fixed the
support on the kernel side (patch 5) using Paolo's suggestion. This also
includes a fix on the options side (patch 1: for v5.11+), an explicit
deny of some options combinations (patch 2: for v5.18+), and some
refactoring (patches 3 and 4) to ease the inclusion of the patch 5.
While at it, I added a new selftest (patch 7) to validate this case --
including a modification of the chk_add_nr helper to inverse the sides
were the counters are checked (patch 6) -- and allowed ADD_ADDR echo
just after the MP_JOIN 3WHS.
The selftests modification have the same Fixes tag as the previous
commit, but no 'Cc: Stable': if the backport can work, that's good --
but it still need to be verified by running the selftests -- if not, no
need to worry, many CIs will use the selftests from the last stable
version to validate previous stable releases.
====================
selftests: mptcp: join: test both signal & subflow
It should be quite uncommon to set both the subflow and the signal
flags: the initiator of the connection is typically the one creating new
subflows, not the other peer, then no need to announce additional local
addresses, and use it to create subflows.
But some people might be confused about the flags, and set both "just to
be sure at least the right one is set". To verify the previous fix, and
avoid future regressions, this specific case is now validated: the
client announces a new address, and initiates a new subflow from the
same address.
While working on this, another bug has been noticed, where the client
reset the new subflow because an ADD_ADDR echo got received as the 3rd
ACK: this new test also explicitly checks that no RST have been sent by
the client and server.
The 'Fixes' tag here below is the same as the one from the previous
commit: this patch here is not fixing anything wrong in the selftests,
but it validates the previous fix for an issue introduced by this commit
ID.
selftests: mptcp: join: ability to invert ADD_ADDR check
In the following commit, the client will initiate the ADD_ADDR, instead
of the server. We need to way to verify the ADD_ADDR have been correctly
sent.
Note: the default expected counters for when the port number is given
are never changed by the caller, no need to accept them as parameter
then.
The 'Fixes' tag here below is the same as the one from the previous
commit: this patch here is not fixing anything wrong in the selftests,
but it validates the previous fix for an issue introduced by this commit
ID.
mptcp: pm: do not ignore 'subflow' if 'signal' flag is also set
Up to the 'Fixes' commit, having an endpoint with both the 'signal' and
'subflow' flags, resulted in the creation of a subflow and an address
announcement using the address linked to this endpoint. After this
commit, only the address announcement was done, ignoring the 'subflow'
flag.
That's because the same bitmap is used for the two flags. It is OK to
keep this single bitmap, the already selected local endpoint simply have
to be re-used, but not via select_local_address() not to look at the
just modified bitmap.
Note that it is unusual to set the two flags together: creating a new
subflow using a new local address will implicitly advertise it to the
other peer. So in theory, no need to advertise it explicitly as well.
Maybe there are use-cases -- the subflow might not reach the other peer
that way, we can ask the other peer to try initiating the new subflow
without delay -- or very likely the user is confused, and put both flags
"just to be sure at least the right one is set". Still, if it is
allowed, the kernel should do what has been asked: using this endpoint
to announce the address and to create a new subflow from it.
An alternative is to forbid the use of the two flags together, but
that's probably too late, there are maybe use-cases, and it was working
before. This patch will avoid people complaining subflows are not
created using the endpoint they added with the 'subflow' and 'signal'
flag.
Note that with the current patch, the subflow might not be created in
some corner cases, e.g. if the 'subflows' limit was reached when sending
the ADD_ADDR, but changed later on. It is probably not worth splitting
id_avail_bitmap per target ('signal', 'subflow'), which will add another
large field to the msk "just" to track (again) endpoints. Anyway,
currently when the limits are changed, the kernel doesn't check if new
subflows can be created or removed, because we would need to keep track
of the received ADD_ADDR, and more. It sounds OK to assume that the
limits should be properly configured before establishing new
connections.
It sounds better to avoid wasting cycles and / or put extreme memory
pressure on the system by trying to create new subflows if it was not
possible to add a new item in the announce list.
While at it, a warning is now printed if the entry was already in the
list as it should not happen with the in-kernel path-manager. With this
PM, mptcp_pm_alloc_anno_list() should only fail in case of memory
pressure.
As mentioned in the 'Fixes' commit, the port flag is only supported by
the 'signal' flag, and not by the 'subflow' one. Then if both the
'signal' and 'subflow' flags are set, the problem is the same: the
feature cannot work with the 'subflow' flag.
Technically, if both the 'signal' and 'subflow' flags are set, it will
be possible to create the listening socket, but not to establish a
subflow using this source port. So better to explicitly deny it, not to
create some confusions because the expected behaviour is not possible.
mptcp: fully established after ADD_ADDR echo on MPJ
Before this patch, receiving an ADD_ADDR echo on the just connected
MP_JOIN subflow -- initiator side, after the MP_JOIN 3WHS -- was
resulting in an MP_RESET. That's because only ACKs with a DSS or
ADD_ADDRs without the echo bit were allowed.
Not allowing the ADD_ADDR echo after an MP_CAPABLE 3WHS makes sense, as
we are not supposed to send an ADD_ADDR before because it requires to be
in full established mode first. For the MP_JOIN 3WHS, that's different:
the ADD_ADDR can be sent on a previous subflow, and the ADD_ADDR echo
can be received on the recently created one. The other peer will already
be in fully established, so it is allowed to send that.
We can then relax the conditions here to accept the ADD_ADDR echo for
MPJ subflows.
Linus Torvalds [Thu, 1 Aug 2024 16:42:09 +0000 (09:42 -0700)]
Merge tag 'net-6.11-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Paolo Abeni:
"Including fixes from wireless, bleutooth, BPF and netfilter.
Current release - regressions:
- core: drop bad gso csum_start and offset in virtio_net_hdr
- wifi: mt76: fix null pointer access in mt792x_mac_link_bss_remove
- eth: tun: add missing bpf_net_ctx_clear() in do_xdp_generic()
- phy: aquantia: only poll GLOBAL_CFG regs on aqr113, aqr113c and
aqr115c
Current release - new code bugs:
- smc: prevent UAF in inet_create()
- bluetooth: btmtk: fix kernel crash when entering btmtk_usb_suspend
- eth: bnxt: reject unsupported hash functions
Previous releases - regressions:
- sched: act_ct: take care of padding in struct zones_ht_key
- netfilter: fix null-ptr-deref in iptable_nat_table_init().
- tcp: adjust clamping window for applications specifying SO_RCVBUF
Previous releases - always broken:
- ethtool: rss: small fixes to spec and GET
- mptcp:
- fix signal endpoint re-add
- pm: fix backup support in signal endpoints
- wifi: ath12k: fix soft lockup on suspend
- eth: bnxt_en: fix RSS logic in __bnxt_reserve_rings()
- eth: ice: fix AF_XDP ZC timeout and concurrency issues
- eth: mlx5:
- fix missing lock on sync reset reload
- fix error handling in irq_pool_request_irq"
* tag 'net-6.11-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (76 commits)
mptcp: fix duplicate data handling
mptcp: fix bad RCVPRUNED mib accounting
ipv6: fix ndisc_is_useropt() handling for PIO
igc: Fix double reset adapter triggered from a single taprio cmd
net: MAINTAINERS: Demote Qualcomm IPA to "maintained"
net: wan: fsl_qmc_hdlc: Discard received CRC
net: wan: fsl_qmc_hdlc: Convert carrier_lock spinlock to a mutex
net/mlx5e: Add a check for the return value from mlx5_port_set_eth_ptys
net/mlx5e: Fix CT entry update leaks of modify header context
net/mlx5e: Require mlx5 tc classifier action support for IPsec prio capability
net/mlx5: Fix missing lock on sync reset reload
net/mlx5: Lag, don't use the hardcoded value of the first port
net/mlx5: DR, Fix 'stack guard page was hit' error in dr_rule
net/mlx5: Fix error handling in irq_pool_request_irq
net/mlx5: Always drain health in shutdown callback
net: Add skbuff.h to MAINTAINERS
r8169: don't increment tx_dropped in case of NETDEV_TX_BUSY
netfilter: iptables: Fix potential null-ptr-deref in ip6table_nat_table_init().
netfilter: iptables: Fix null-ptr-deref in iptable_nat_table_init().
net: drop bad gso csum_start and offset in virtio_net_hdr
...
Paolo Abeni [Wed, 31 Jul 2024 10:10:15 +0000 (12:10 +0200)]
mptcp: fix duplicate data handling
When a subflow receives and discards duplicate data, the mptcp
stack assumes that the consumed offset inside the current skb is
zero.
With multiple subflows receiving data simultaneously such assertion
does not held true. As a result the subflow-level copied_seq will
be incorrectly increased and later on the same subflow will observe
a bad mapping, leading to subflow reset.
Address the issue taking into account the skb consumed offset in
mptcp_subflow_discard_data().
Paolo Abeni [Wed, 31 Jul 2024 10:10:14 +0000 (12:10 +0200)]
mptcp: fix bad RCVPRUNED mib accounting
Since its introduction, the mentioned MIB accounted for the wrong
event: wake-up being skipped as not-needed on some edge condition
instead of incoming skb being dropped after landing in the (subflow)
receive queue.
Paolo Abeni [Thu, 1 Aug 2024 10:08:28 +0000 (12:08 +0200)]
Merge tag 'nf-24-07-31' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
The following patchset contains Netfilter fixes for net:
Fix a possible null-ptr-deref sometimes triggered by iptables-restore at
boot time. Register iptables {ipv4,ipv6} nat table pernet in first place
to fix this issue. Patch #1 and #2 from Kuniyuki Iwashima.
netfilter pull request 24-07-31
* tag 'nf-24-07-31' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
netfilter: iptables: Fix potential null-ptr-deref in ip6table_nat_table_init().
netfilter: iptables: Fix null-ptr-deref in iptable_nat_table_init().
====================
The current logic only works if the PIO is between two
other ND user options. This fixes it so that the PIO
can also be either before or after other ND user options
(for example the first or last option in the RA).
side note: there's actually Android tests verifying
a portion of the old broken behaviour, so:
https://android-review.googlesource.com/c/kernel/tests/+/3196704
fixes those up.
Faizal Rahim [Tue, 30 Jul 2024 17:33:02 +0000 (10:33 -0700)]
igc: Fix double reset adapter triggered from a single taprio cmd
Following the implementation of "igc: Add TransmissionOverrun counter"
patch, when a taprio command is triggered by user, igc processes two
commands: TAPRIO_CMD_REPLACE followed by TAPRIO_CMD_STATS. However, both
commands unconditionally pass through igc_tsn_offload_apply() which
evaluates and triggers reset adapter. The double reset causes issues in
the calculation of adapter->qbv_count in igc.
TAPRIO_CMD_REPLACE command is expected to reset the adapter since it
activates qbv. It's unexpected for TAPRIO_CMD_STATS to do the same
because it doesn't configure any driver-specific TSN settings. So, the
evaluation in igc_tsn_offload_apply() isn't needed for TAPRIO_CMD_STATS.
To address this, commands parsing are relocated to
igc_tsn_enable_qbv_scheduling(). Commands that don't require an adapter
reset will exit after processing, thus avoiding igc_tsn_offload_apply().
net: MAINTAINERS: Demote Qualcomm IPA to "maintained"
To the best of my knowledge, Alex Elder is not being paid to support
Qualcomm IPA networking drivers, so drop the status from "supported" to
"maintained".
net: wan: fsl_qmc_hdlc: Convert carrier_lock spinlock to a mutex
The carrier_lock spinlock protects the carrier detection. While it is
held, framer_get_status() is called which in turn takes a mutex.
This is not correct and can lead to a deadlock.
A run with PROVE_LOCKING enabled detected the issue:
[ BUG: Invalid wait context ]
... c204ddbc (&framer->mutex){+.+.}-{3:3}, at: framer_get_status+0x40/0x78
other info that might help us debug this:
context-{4:4}
2 locks held by ifconfig/146:
#0: c0926a38 (rtnl_mutex){+.+.}-{3:3}, at: devinet_ioctl+0x12c/0x664
#1: c2006a40 (&qmc_hdlc->carrier_lock){....}-{2:2}, at: qmc_hdlc_framer_set_carrier+0x30/0x98
Avoid the spinlock usage and convert carrier_lock to a mutex.
Shahar Shitrit [Tue, 30 Jul 2024 06:16:37 +0000 (09:16 +0300)]
net/mlx5e: Add a check for the return value from mlx5_port_set_eth_ptys
Since the documentation for mlx5_toggle_port_link states that it should
only be used after setting the port register, we add a check for the
return value from mlx5_port_set_eth_ptys to ensure the register was
successfully set before calling it.
Chris Mi [Tue, 30 Jul 2024 06:16:36 +0000 (09:16 +0300)]
net/mlx5e: Fix CT entry update leaks of modify header context
The cited commit allocates a new modify header to replace the old
one when updating CT entry. But if failed to allocate a new one, eg.
exceed the max number firmware can support, modify header will be
an error pointer that will trigger a panic when deallocating it. And
the old modify header point is copied to old attr. When the old
attr is freed, the old modify header is lost.
Fix it by restoring the old attr to attr when failed to allocate a
new modify header context. So when the CT entry is freed, the right
modify header context will be freed. And the panic of accessing
error pointer is also fixed.
net/mlx5e: Require mlx5 tc classifier action support for IPsec prio capability
Require mlx5 classifier action support when creating IPSec chains in
offload path. MLX5_IPSEC_CAP_PRIO should only be set if CONFIG_MLX5_CLS_ACT
is enabled. If CONFIG_MLX5_CLS_ACT=n and MLX5_IPSEC_CAP_PRIO is set,
configuring IPsec offload will fail due to the mlxx5 ipsec chain rules
failing to be created due to lack of classifier action support.
On sync reset reload work, when remote host updates devlink on reload
actions performed on that host, it misses taking devlink lock before
calling devlink_remote_reload_actions_performed() which results in
triggering lock assert like the following:
net/mlx5: DR, Fix 'stack guard page was hit' error in dr_rule
This patch reduces the size of hw_ste_arr_optimized array that is
allocated on stack from 640 bytes (5 match STEs + 5 action STES)
to 448 bytes (2 match STEs + 5 action STES).
This fixes the 'stack guard page was hit' issue, while still fitting
majority of the usecases (up to 2 match STEs).
Jakub Kicinski [Thu, 1 Aug 2024 00:49:00 +0000 (17:49 -0700)]
Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Daniel Borkmann says:
====================
pull-request: bpf 2024-07-31
We've added 2 non-merge commits during the last 2 day(s) which contain
a total of 2 files changed, 2 insertions(+), 2 deletions(-).
The main changes are:
1) Fix BPF selftest build after tree sync with regards to a _GNU_SOURCE
macro redefined compilation error, from Stanislav Fomichev.
2) Fix a wrong test in the ASSERT_OK() check in uprobe_syscall BPF selftest,
from Jiri Olsa.
* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
bpf/selftests: Fix ASSERT_OK condition check in uprobe_syscall test
selftests/bpf: Filter out _GNU_SOURCE when compiling test_cpp
====================
netfilter: iptables: Fix potential null-ptr-deref in ip6table_nat_table_init().
ip6table_nat_table_init() accesses net->gen->ptr[ip6table_nat_net_ops.id],
but the function is exposed to user space before the entry is allocated
via register_pernet_subsys().
Let's call register_pernet_subsys() before xt_register_template().
Fixes: fdacd57c79b7 ("netfilter: x_tables: never register tables by default") Signed-off-by: Kuniyuki Iwashima <[email protected]> Reviewed-by: Florian Westphal <[email protected]> Signed-off-by: Pablo Neira Ayuso <[email protected]>
netfilter: iptables: Fix null-ptr-deref in iptable_nat_table_init().
We had a report that iptables-restore sometimes triggered null-ptr-deref
at boot time. [0]
The problem is that iptable_nat_table_init() is exposed to user space
before the kernel fully initialises netns.
In the small race window, a user could call iptable_nat_table_init()
that accesses net_generic(net, iptable_nat_net_id), which is available
only after registering iptable_nat_net_ops.
Let's call register_pernet_subsys() before xt_register_template().
David Laight pointed out that we should deal with the min3() and max3()
mess too, which still does excessive expansion.
And our current macros are actually rather broken.
In particular, the macros did this:
#define min3(x, y, z) min((typeof(x))min(x, y), z)
#define max3(x, y, z) max((typeof(x))max(x, y), z)
and that not only is a nested expansion of possibly very complex
arguments with all that involves, the typing with that "typeof()" cast
is completely wrong.
For example, imagine what happens in max3() if 'x' happens to be a
'unsigned char', but 'y' and 'z' are 'unsigned long'. The types are
compatible, and there's no warning - but the result is just random
garbage.
No, I don't think we've ever hit that issue in practice, but since we
now have sane infrastructure for doing this right, let's just use it.
It fixes any excessive expansion, and also avoids these kinds of broken
type issues.
Merge tag 'for-6.11-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
- fix regression in extent map rework when handling insertion of
overlapping compressed extent
- fix unexpected file length when appending to a file using direct io
and buffer not faulted in
- in zoned mode, fix accounting of unusable space when flipping
read-only block group back to read-write
- fix page locking when COWing an inline range, assertion failure found
by syzbot
- fix calculation of space info in debugging print
- tree-checker, add validation of data reference item
- fix a few -Wmaybe-uninitialized build warnings
* tag 'for-6.11-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: initialize location to fix -Wmaybe-uninitialized in btrfs_lookup_dentry()
btrfs: fix corruption after buffer fault in during direct IO append write
btrfs: zoned: fix zone_unusable accounting on making block group read-write again
btrfs: do not subtract delalloc from avail bytes
btrfs: make cow_file_range_inline() honor locked_page on error
btrfs: fix corrupt read due to bad offset of a compressed extent map
btrfs: tree-checker: validate dref root and objectid
Merge tag 'perf-tools-fixes-for-v6.11-2024-07-30' of git://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools
Pull perf tools fixes from Namhyung Kim:
"Some more build fixes and a random crash fix:
- Fix cross-build by setting pkg-config env according to the arch
- Fix static build for missing library dependencies
- Fix Segfault when callchain has no symbols"
* tag 'perf-tools-fixes-for-v6.11-2024-07-30' of git://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools:
perf docs: Document cross compilation
perf: build: Link lib 'zstd' for static build
perf: build: Link lib 'lzma' for static build
perf: build: Only link libebl.a for old libdw
perf: build: Set Python configuration for cross compilation
perf: build: Setup PKG_CONFIG_LIBDIR for cross compilation
perf tool: fix dereferencing NULL al->maps
Jakub Kicinski [Wed, 31 Jul 2024 01:41:10 +0000 (18:41 -0700)]
Merge branch '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue
Tony Nguyen says:
====================
ice: fix AF_XDP ZC timeout and concurrency issues
Maciej Fijalkowski says:
Changes included in this patchset address an issue that customer has
been facing when AF_XDP ZC Tx sockets were used in combination with flow
control and regular Tx traffic.
After executing:
ethtool --set-priv-flags $dev link-down-on-close on
ethtool -A $dev rx on tx on
launching multiple ZC Tx sockets on $dev + pinging remote interface (so
that regular Tx traffic is present) and then going through down/up of
$dev, Tx timeout occurred and then most of the time ice driver was unable
to recover from that state.
These patches combined together solve the described above issue on
customer side. Main focus here is to forbid producing Tx descriptors when
either carrier is not yet initialized or process of bringing interface
down has already started.
* '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue:
ice: xsk: fix txq interrupt mapping
ice: add missing WRITE_ONCE when clearing ice_rx_ring::xdp_prog
ice: improve updating ice_{t,r}x_ring::xsk_pool
ice: toggle netif_carrier when setting up XSK pool
ice: modify error handling when setting XSK pool in ndo_bpf
ice: replace synchronize_rcu with synchronize_net
ice: don't busy wait for Rx queue disable in ice_qp_dis()
ice: respect netif readiness in AF_XDP ZC related ndo's
====================
Willem de Bruijn [Mon, 29 Jul 2024 20:10:12 +0000 (16:10 -0400)]
net: drop bad gso csum_start and offset in virtio_net_hdr
Tighten csum_start and csum_offset checks in virtio_net_hdr_to_skb
for GSO packets.
The function already checks that a checksum requested with
VIRTIO_NET_HDR_F_NEEDS_CSUM is in skb linear. But for GSO packets
this might not hold for segs after segmentation.
Syzkaller demonstrated to reach this warning in skb_checksum_help
offset = skb_checksum_start_offset(skb);
ret = -EINVAL;
if (WARN_ON_ONCE(offset >= skb_headlen(skb)))
csum_offset: for GSO packets, deduce the correct value from gso_type.
This is already done for USO. Extend it to TSO. Let UFO be:
udp[46]_ufo_fragment ignores these fields and always computes the
checksum in software.
csum_start: finding the real offset requires parsing to the transport
header. Do not add a parser, use existing segmentation parsing. Thanks
to SKB_GSO_DODGY, that also catches bad packets that are hw offloaded.
Again test both TSO and USO. Do not test UFO for the above reason, and
do not test UDP tunnel offload.
GSO packet are almost always CHECKSUM_PARTIAL. USO packets may be
CHECKSUM_NONE since commit 10154dbded6d6 ("udp: Allow GSO transmit
from devices with no checksum offload"), but then still these fields
are initialized correctly in udp4_hwcsum/udp6_hwcsum_outgoing. So no
need to test for ip_summed == CHECKSUM_PARTIAL first.
This revises an existing fix mentioned in the Fixes tag, which broke
small packets with GSO offload, as detected by kselftests.
net: phy: aquantia: only poll GLOBAL_CFG regs on aqr113, aqr113c and aqr115c
Commit 708405f3e56e ("net: phy: aquantia: wait for the GLOBAL_CFG to
start returning real values") introduced a workaround for an issue
observed on aqr115c. However there were never any reports of it
happening on other models and the workaround has been reported to cause
and issue on aqr113c (and it may cause the same on any other model not
supporting 10M mode).
Let's limit the impact of the workaround to aqr113, aqr113c and aqr115c
and poll the 100M GLOBAL_CFG register instead as both models are known
to support it correctly.
net: phy: micrel: Fix the KSZ9131 MDI-X status issue
The MDIX status is not accurately reflecting the current state after the link
partner has manually altered its MDIX configuration while operating in forced
mode.
Access information about Auto mdix completion and pair selection from the
KSZ9131's Auto/MDI/MDI-X status register
Merge tag 'chrome-platform-fixes-for-v6.11-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/chrome-platform/linux
Pull chrome-platform fix from Tzung-Bi Shih:
"Fix a race condition that sends multiple host commands at a time"
* tag 'chrome-platform-fixes-for-v6.11-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/chrome-platform/linux:
platform/chrome: cros_ec_proto: Lock device when updating MKBP version
This clarifies the rules for min()/max()/clamp() type checking and makes
them a much more efficient macro expansion.
In particular, we now look at the type and range of the inputs to see
whether they work together, generating a mask of acceptable comparisons,
and then just verifying that the inputs have a shared case:
- an expression with a signed type can be used for
(1) signed comparisons
(2) unsigned comparisons if it is statically known to have a
non-negative value
- an expression with an unsigned type can be used for
(3) unsigned comparison
(4) signed comparisons if the type is smaller than 'int' and thus
the C integer promotion rules will make it signed anyway
Here rule (1) and (3) are obvious, and rule (2) is important in order to
allow obvious trivial constants to be used together with unsigned
values.
Rule (4) is not necessarily a good idea, but matches what we used to do,
and we have extant cases of this situation in the kernel. Notably with
bcachefs having an expression like
where bch2_bucket_sectors_dirty() returns an 's64', and
'ca->mi.bucket_size' is of type 'u16'.
Technically that bcachefs comparison is clearly sensible on a C type
level, because the 'u16' will go through the normal C integer promotion,
and become 'int', and then we're comparing two signed values and
everything looks sane.
However, it's not entirely clear that a 'min(s64,u16)' operation makes a
lot of conceptual sense, and it's possible that we will remove rule (4).
After all, the _reason_ we have these complicated type checks is exactly
that the C type promotion rules are not very intuitive.
But at least for now the rule is in place for backwards compatibility.
Also note that rule (2) existed before, but is hugely relaxed by this
commit. It used to be true only for the simplest compile-time
non-negative integer constants. The new macro model will allow cases
where the compiler can trivially see that an expression is non-negative
even if it isn't necessarily a constant.
because our old 'min()' macro would see that 'pia[addr] & 0x3F' is of
type 'int' and clearly not a C constant expression, so doing a 'min()'
with a 'size_t' is a signedness violation.
Our new 'min()' macro still sees that 'pia[addr] & 0x3F' is of type
'int', but is smart enough to also see that it is clearly non-negative,
and thus would allow that case without any complaints.
Dan Carpenter [Wed, 24 Jul 2024 16:06:56 +0000 (11:06 -0500)]
net: mvpp2: Don't re-use loop iterator
This function has a nested loop. The problem is that both the inside
and outside loop use the same variable as an iterator. I found this
via static analysis so I'm not sure the impact. It could be that it
loops forever or, more likely, the loop exits early.
David Sterba [Mon, 29 Jul 2024 19:59:24 +0000 (21:59 +0200)]
btrfs: initialize location to fix -Wmaybe-uninitialized in btrfs_lookup_dentry()
Some arch + compiler combinations report a potentially unused variable
location in btrfs_lookup_dentry(). This is a false alert as the variable
is passed by value and always valid or there's an error. The compilers
cannot probably reason about that although btrfs_inode_by_name() is in
the same file.
> + /kisskb/src/fs/btrfs/inode.c: error: 'location.objectid' may be used
+uninitialized in this function [-Werror=maybe-uninitialized]: => 5603:9
> + /kisskb/src/fs/btrfs/inode.c: error: 'location.type' may be used
+uninitialized in this function [-Werror=maybe-uninitialized]: => 5674:5
Initialize it to zero, this should fix the warnings and won't change the
behaviour as btrfs_inode_by_name() accepts only a root or inode item
types, otherwise returns an error.
Alexandra Winter [Mon, 29 Jul 2024 12:28:16 +0000 (14:28 +0200)]
net/iucv: fix use after free in iucv_sock_close()
iucv_sever_path() is called from process context and from bh context.
iucv->path is used as indicator whether somebody else is taking care of
severing the path (or it is already removed / never existed).
This needs to be done with atomic compare and swap, otherwise there is a
small window where iucv_sock_close() will try to work with a path that has
already been severed and freed by iucv_callback_connrej() called by
iucv_tasklet_fn().
Note that bh_lock_sock() is not serializing the tasklet context against
process context, because the check for sock_owned_by_user() and
corresponding handling is missing.
Ideas for a future clean-up patch:
A) Correct usage of bh_lock_sock() in tasklet context, as described in Link: https://lore.kernel.org/netdev/1280155406.2899.407.camel@edumazet-laptop/
Re-enqueue, if needed. This may require adding return values to the
tasklet functions and thus changes to all users of iucv.
B) Change iucv tasklet into worker and use only lock_sock() in af_iucv.
platform/chrome: cros_ec_proto: Lock device when updating MKBP version
The cros_ec_get_host_command_version_mask() function requires that the
caller must have ec_dev->lock mutex before calling it. This requirement
was not met and as a result it was possible that two commands were sent
to the device at the same time.
The problem was observed while using UART backend which doesn't use any
additional locks, unlike SPI backend which locks the controller until
response is received.
In all the MPTCP backup related tests, the backup flag was set on one
side, and the expected behaviour is to have both sides respecting this
decision. That's also the "natural" way, and what the users seem to
expect.
On the scheduler side, only the 'backup' field was checked, which is
supposed to be set only if the other peer flagged a subflow as backup.
But in various places, this flag was also set when the local host
flagged the subflow as backup, certainly to have the expected behaviour
mentioned above.
Patch 1 modifies the packet scheduler to check if the backup flag has
been set on both directions, not to change its behaviour after having
applied the following patches. That's what the default packet scheduler
should have done since the beginning in v5.7.
Patch 2 fixes the backup flag being mirrored on the MPJ+SYN+ACK by
accident since its introduction in v5.7. Instead, the received and sent
backup flags are properly distinguished in requests.
Patch 3 stops setting the received backup flag as well when sending an
MP_PRIO, something that was done since the MP_PRIO support in v5.12.
Patch 4 adds related and missing MIB counters to be able to easily check
if MP_JOIN are sent with a backup flag. Certainly because these counters
were not there, the behaviour that is fixed by patches here was not
properly verified.
Patch 5 validates the previous patch by extending the MPTCP Join
selftest.
Patch 6 fixes the backup support in signal endpoints: if a signal
endpoint had the backup flag, it was not set in the MPJ+SYN+ACK as
expected. It was only set for ongoing connections, but not future ones
as expected, since the introduction of the backup flag in endpoints in
v5.10.
Patch 7 validates the previous patch by extending the MPTCP Join
selftest as well.
Signed-off-by: Matthieu Baerts (NGI0) <[email protected]>
---
Matthieu Baerts (NGI0) (7):
mptcp: sched: check both directions for backup
mptcp: distinguish rcv vs sent backup flag in requests
mptcp: pm: only set request_bkup flag when sending MP_PRIO
mptcp: mib: count MPJ with backup flag
selftests: mptcp: join: validate backup in MPJ
mptcp: pm: fix backup support in signal endpoints
selftests: mptcp: join: check backup support in signal endp
selftests: mptcp: join: check backup support in signal endp
Before the previous commit, 'signal' endpoints with the 'backup' flag
were ignored when sending the MP_JOIN.
The MPTCP Join selftest has then been modified to validate this case:
the "single address, backup" test, is now validating the MP_JOIN with a
backup flag as it is what we expect it to do with such name. The
previous version has been kept, but renamed to "single address, switch
to backup" to avoid confusions.
The "single address with port, backup" test is also now validating the
MPJ with a backup flag, which makes more sense than checking the switch
to backup with an MP_PRIO.
The "mpc backup both sides" test is now validating that the backup flag
is also set in MP_JOIN from and to the addresses used in the initial
subflow, using the special ID 0.
The 'Fixes' tag here below is the same as the one from the previous
commit: this patch here is not fixing anything wrong in the selftests,
but it validates the previous fix for an issue introduced by this commit
ID.
There was a support for signal endpoints, but only when the endpoint's
flag was changed during a connection. If an endpoint with the signal and
backup was already present, the MP_JOIN reply was not containing the
backup flag as expected.
That's confusing to have this inconsistent behaviour. On the other hand,
the infrastructure to set the backup flag in the SYN + ACK + MP_JOIN was
already there, it was just never set before. Now when requesting the
local ID from the path-manager, the backup status is also requested.
Note that when the userspace PM is used, the backup flag can be set if
the local address was already used before with a backup flag, e.g. if
the address was announced with the 'backup' flag, or a subflow was
created with the 'backup' flag.
A peer can notify the other one that a subflow has to be treated as
"backup" by two different ways: either by sending a dedicated MP_PRIO
notification, or by setting the backup flag in the MP_JOIN handshake.
The selftests were previously monitoring the former, but not the latter.
This is what is now done here by looking at these new MIB counters when
validating the 'backup' cases:
The 'Fixes' tag here below is the same as the one from the previous
commit: this patch here is not fixing anything wrong in the selftests,
but it will help to validate a new fix for an issue introduced by this
commit ID.
Without such counters, it is difficult to easily debug issues with MPJ
not having the backup flags on production servers.
This is not strictly a fix, but it eases to validate the following
patches without requiring to take packet traces, to query ongoing
connections with Netlink with admin permissions, or to guess by looking
at the behaviour of the packet scheduler. Also, the modification is self
contained, isolated, well controlled, and the increments are done just
after others, there from the beginning. It looks then safe, and helpful
to backport this.
mptcp: distinguish rcv vs sent backup flag in requests
When sending an MP_JOIN + SYN + ACK, it is possible to mark the subflow
as 'backup' by setting the flag with the same name. Before this patch,
the backup was set if the other peer set it in its MP_JOIN + SYN
request.
It is not correct: the backup flag should be set in the MPJ+SYN+ACK only
if the host asks for it, and not mirroring what was done by the other
peer. It is then required to have a dedicated bit for each direction,
similar to what is done in the subflow context.
The 'mptcp_subflow_context' structure has two items related to the
backup flags:
- 'backup': the subflow has been marked as backup by the other peer
- 'request_bkup': the backup flag has been set by the host
Before this patch, the scheduler was only looking at the 'backup' flag.
That can make sense in some cases, but it looks like that's not what we
wanted for the general use, because either the path-manager was setting
both of them when sending an MP_PRIO, or the receiver was duplicating
the 'backup' flag in the subflow request.
Note that the use of these two flags in the path-manager are going to be
fixed in the next commits, but this change here is needed not to modify
the behaviour.
profiling: remove stale percpu flip buffer variables
For some reason I didn't see this issue on my arm64 or x86-64 builds,
but Stephen Rothwell reports that commit 2accfdb7eff6 ("profiling:
attempt to remove per-cpu profile flip buffer") left these static
variables around, and the powerpc build is unhappy about them:
kernel/profile.c:52:28: warning: 'cpu_profile_flip' defined but not used [-Wunused-variable]
52 | static DEFINE_PER_CPU(int, cpu_profile_flip);
| ^~~~~~~~~~~~~~~~
..
So remove these stale left-over remnants too.
Fixes: 2accfdb7eff6 ("profiling: attempt to remove per-cpu profile flip buffer") Reported-by: Stephen Rothwell <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
selftests/bpf: Filter out _GNU_SOURCE when compiling test_cpp
Jakub reports build failures when merging linux/master with net tree:
CXX test_cpp
In file included from <built-in>:454:
<command line>:2:9: error: '_GNU_SOURCE' macro redefined [-Werror,-Wmacro-redefined]
2 | #define _GNU_SOURCE
| ^
<built-in>:445:9: note: previous definition is here
445 | #define _GNU_SOURCE 1
The culprit is commit cc937dad85ae ("selftests: centralize -D_GNU_SOURCE= to
CFLAGS in lib.mk") which unconditionally added -D_GNU_SOUCE to CLFAGS.
Apparently clang++ also unconditionally adds it for the C++ targets [0]
which causes a conflict. Add small change in the selftests makefile
to filter it out for test_cpp.
Not sure which tree it should go via, targeting bpf for now, but net
might be better?
Merge tag 'for-linus-2024072901' of git://git.kernel.org/pub/scm/linux/kernel/git/hid/hid
Pull HID fixes from Benjamin Tissoires:
- fixes for HID-BPF after the merge with the bpf tree (Arnd Bergmann
and Benjamin Tissoires)
- some tool type fix for the Wacom driver (Tatsunosuke Tobita)
- a reorder of the sensor discovery to ensure the HID AMD SFH is
removed when no sensors are available (Basavaraj Natikar)
* tag 'for-linus-2024072901' of git://git.kernel.org/pub/scm/linux/kernel/git/hid/hid:
selftests/hid: add test for attaching multiple time the same struct_ops
HID: bpf: prevent the same struct_ops to be attached more than once
selftests/hid: disable struct_ops auto-attach
selftests/hid: fix bpf_wq new API
HID: amd_sfh: Move sensor discovery before HID device initialization
hid: bpf: add BPF_JIT dependency
HID: wacom: more appropriate tool type categorization
HID: wacom: Modify pen IDs
Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost
Pull virtio fixes from Michael Tsirkin:
"The biggest thing here is the adminq change - but it looks like the
only way to avoid headq blocking causing indefinite stalls.
This fixes three issues:
- Prevent admin commands on one VF blocking another.
This prevents a bad VF from blocking a good one, as well as fixing
a scalability issue with large # of VFs
- Correctly return error on command failure on octeon. We used to
treat failed commands as a success.
- Fix modpost warning when building virtio_dma_buf. Harmless, but the
fix is trivial"
* tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost:
virtio_pci_modern: remove admin queue serialization lock
virtio_pci_modern: use completion instead of busy loop to wait on admin cmd result
virtio_pci_modern: pass cmd as an identification token
virtio_pci_modern: create admin queue of queried size
virtio: create admin queues alongside other virtqueues
virtio_pci: pass vq info as an argument to vp_setup_vq()
virtio: push out code to vp_avq_index()
virtio_pci_modern: treat vp_dev->admin_vq.info.vq pointer as static
virtio_pci: introduce vector allocation fallback for slow path virtqueues
virtio_pci: pass vector policy enum to vp_find_one_vq_msix()
virtio_pci: pass vector policy enum to vp_find_vqs_msix()
virtio_pci: simplify vp_request_msix_vectors() call a bit
virtio_pci: push out single vq find code to vp_find_one_vq_msix()
vdpa/octeon_ep: Fix error code in octep_process_mbox()
virtio: add missing MODULE_DESCRIPTION() macro
task_work: make TWA_NMI_CURRENT handling conditional on IRQ_WORK
The TWA_NMI_CURRENT handling very much depends on IRQ_WORK, but that
isn't universally enabled everywhere.
Maybe the IRQ_WORK infrastructure should just be unconditional - x86
ends up indirectly enabling it through unconditionally enabling
PERF_EVENTS, for example. But it also gets enabled by having SMP
support, or even if you just have PRINTK enabled.
But in the meantime TWA_NMI_CURRENT causes tons of build failures on
various odd minimal configs. Which did show up in linux-next, but
despite that nobody bothered to fix it or even inform me until -rc1 was
out.
profiling: attempt to remove per-cpu profile flip buffer
This is the really old legacy kernel profiling code, which has long
since been obviated by "real profiling" (ie 'prof' and company), and
mainly remains as a source of syzbot reports.
There are anecdotal reports that people still use it for boot-time
profiling, but it's unlikely that such use would care about the old NUMA
optimizations in this code from 2004 (commit ad02973d42: "profile: 512x
Altix timer interrupt livelock fix" in the BK import archive at [1])
So in order to head off future syzbot reports, let's try to simplify
this code and get rid of the per-cpu profile buffers that are quite a
large portion of the complexity footprint of this thing (including CPU
hotplug callbacks etc).
It's unlikely anybody will actually notice, or possibly, as Thomas put
it: "Only people who indulge in nostalgia will notice :)".
That said, if it turns out that this code is actually actively used by
somebody, we can always revert this removal. Thus the "attempt" in the
summary line.
[ Note: in a small nod to "the profiling code can cause NUMA problems",
this also removes the "increment the last entry in the profiling array
on any unknown hits" logic. That would account any program counter in
a module to that single counter location, and might exacerbate any
NUMA cacheline bouncing issues ]
in profile_tick(); prof_cpu_mask remains uninitialzed until cpumask_copy()
completes while cpumask_available(prof_cpu_mask) returns true as soon as
alloc_cpumask_var(&prof_cpu_mask) completes.
We could replace alloc_cpumask_var() with zalloc_cpumask_var() and
call cpumask_copy() from create_proc_profile() on only UP kernels, for
profile_online_cpu() calls cpumask_set_cpu() as needed via
cpuhp_setup_state(CPUHP_AP_ONLINE_DYN) on SMP kernels. But this patch
removes prof_cpu_mask because it seems unnecessary.
The cpumask_test_cpu(smp_processor_id(), prof_cpu_mask) test
in profile_tick() is likely always true due to
a CPU cannot call profile_tick() if that CPU is offline
and
cpumask_set_cpu(cpu, prof_cpu_mask) is called when that CPU becomes
online and cpumask_clear_cpu(cpu, prof_cpu_mask) is called when that
CPU becomes offline
. This test could be false during transition between online and offline.
But according to include/linux/cpuhotplug.h , CPUHP_PROFILE_PREPARE
belongs to PREPARE section, which means that the CPU subjected to
profile_dead_cpu() cannot be inside profile_tick() (i.e. no risk of
use-after-free bug) because interrupt for that CPU is disabled during
PREPARE section. Therefore, this test is guaranteed to be true, and
can be removed. (Since profile_hits() checks prof_buffer != NULL, we
don't need to check prof_buffer != NULL here unless get_irq_regs() or
user_mode() is such slow that we want to avoid when prof_buffer == NULL).
do_profile_hits() is called from profile_tick() from timer interrupt
only if cpumask_test_cpu(smp_processor_id(), prof_cpu_mask) is true and
prof_buffer is not NULL. But syzbot is also reporting that sometimes
do_profile_hits() is called while current thread is still doing vzalloc(),
where prof_buffer must be NULL at this moment. This indicates that multiple
threads concurrently tried to write to /sys/kernel/profiling interface,
which caused that somebody else try to re-allocate prof_buffer despite
somebody has already allocated prof_buffer. Fix this by using
serialization.
Filipe Manana [Fri, 26 Jul 2024 10:12:52 +0000 (11:12 +0100)]
btrfs: fix corruption after buffer fault in during direct IO append write
During an append (O_APPEND write flag) direct IO write if the input buffer
was not previously faulted in, we can corrupt the file in a way that the
final size is unexpected and it includes an unexpected hole.
The problem happens like this:
1) We have an empty file, with size 0, for example;
2) We do an O_APPEND direct IO with a length of 4096 bytes and the input
buffer is not currently faulted in;
3) We enter btrfs_direct_write(), lock the inode and call
generic_write_checks(), which calls generic_write_checks_count(), and
that function sets the iocb position to 0 with the following code:
if (iocb->ki_flags & IOCB_APPEND)
iocb->ki_pos = i_size_read(inode);
4) We call btrfs_dio_write() and enter into iomap, which will end up
calling btrfs_dio_iomap_begin() and that calls
btrfs_get_blocks_direct_write(), where we update the i_size of the
inode to 4096 bytes;
5) After btrfs_dio_iomap_begin() returns, iomap will attempt to access
the page of the write input buffer (at iomap_dio_bio_iter(), with a
call to bio_iov_iter_get_pages()) and fail with -EFAULT, which gets
returned to btrfs at btrfs_direct_write() via btrfs_dio_write();
6) At btrfs_direct_write() we get the -EFAULT error, unlock the inode,
fault in the write buffer and then goto to the label 'relock';
7) We lock again the inode, do all the necessary checks again and call
again generic_write_checks(), which calls generic_write_checks_count()
again, and there we set the iocb's position to 4K, which is the current
i_size of the inode, with the following code pointed above:
if (iocb->ki_flags & IOCB_APPEND)
iocb->ki_pos = i_size_read(inode);
8) Then we go again to btrfs_dio_write() and enter iomap and the write
succeeds, but it wrote to the file range [4K, 8K), leaving a hole in
the [0, 4K) range and an i_size of 8K, which goes against the
expectations of having the data written to the range [0, 4K) and get an
i_size of 4K.
Fix this by not unlocking the inode before faulting in the input buffer,
in case we get -EFAULT or an incomplete write, and not jumping to the
'relock' label after faulting in the buffer - instead jump to a location
immediately before calling iomap, skipping all the write checks and
relocking. This solves this problem and it's fine even in case the input
buffer is memory mapped to the same file range, since only holding the
range locked in the inode's io tree can cause a deadlock, it's safe to
keep the inode lock (VFS lock), as was fixed and described in commit 51bd9563b678 ("btrfs: fix deadlock due to page faults during direct IO
reads and writes").
A sample reproducer provided by a reporter is the following:
Naohiro Aota [Wed, 15 Feb 2023 00:18:02 +0000 (09:18 +0900)]
btrfs: zoned: fix zone_unusable accounting on making block group read-write again
When btrfs makes a block group read-only, it adds all free regions in the
block group to space_info->bytes_readonly. That free space excludes
reserved and pinned regions. OTOH, when btrfs makes the block group
read-write again, it moves all the unused regions into the block group's
zone_unusable. That unused region includes reserved and pinned regions.
As a result, it counts too much zone_unusable bytes.
Fortunately (or unfortunately), having erroneous zone_unusable does not
affect the calculation of space_info->bytes_readonly, because free
space (num_bytes in btrfs_dec_block_group_ro) calculation is done based on
the erroneous zone_unusable and it reduces the num_bytes just to cancel the
error.
This behavior can be easily discovered by adding a WARN_ON to check e.g,
"bg->pinned > 0" in btrfs_dec_block_group_ro(), and running fstests test
case like btrfs/282.
Fix it by properly considering pinned and reserved in
btrfs_dec_block_group_ro(). Also, add a WARN_ON and introduce
btrfs_space_info_update_bytes_zone_unusable() to catch a similar mistake.
The block group's avail bytes printed when dumping a space info subtract
the delalloc_bytes. However, as shown in btrfs_add_reserved_bytes() and
btrfs_free_reserved_bytes(), it is added or subtracted along with
"reserved" for the delalloc case, which means the "delalloc_bytes" is a
part of the "reserved" bytes. So, excluding it to calculate the avail space
counts delalloc_bytes twice, which can lead to an invalid result.
Boris Burkov [Mon, 22 Jul 2024 23:49:45 +0000 (16:49 -0700)]
btrfs: make cow_file_range_inline() honor locked_page on error
The btrfs buffered write path runs through __extent_writepage() which
has some tricky return value handling for writepage_delalloc().
Specifically, when that returns 1, we exit, but for other return values
we continue and end up calling btrfs_folio_end_all_writers(). If the
folio has been unlocked (note that we check the PageLocked bit at the
start of __extent_writepage()), this results in an assert panic like
this one from syzbot:
I was hitting the same issue by doing hundreds of accelerated runs of
generic/475, which also hits IO errors by design.
I instrumented that reproducer with bpftrace and found that the
undesirable folio_unlock was coming from the following callstack:
folio_unlock+5
__process_pages_contig+475
cow_file_range_inline.constprop.0+230
cow_file_range+803
btrfs_run_delalloc_range+566
writepage_delalloc+332
__extent_writepage # inlined in my stacktrace, but I added it here
extent_write_cache_pages+622
Looking at the bisected-to patch in the syzbot report, Josef realized
that the logic of the cow_file_range_inline error path subtly changing.
In the past, on error, it jumped to out_unlock in cow_file_range(),
which honors the locked_page, so when we ultimately call
folio_end_all_writers(), the folio of interest is still locked. After
the change, we always unlocked ignoring the locked_page, on both success
and error. On the success path, this all results in returning 1 to
__extent_writepage(), which skips the folio_end_all_writers() call,
which makes it OK to have unlocked.
Fix the bug by wiring the locked_page into cow_file_range_inline() and
only setting locked_page to NULL on success.
ice_cfg_txq_interrupt() internally handles XDP Tx ring. Do not use
ice_for_each_tx_ring() in ice_qvec_cfg_msix() as this causing us to
treat XDP ring that belongs to queue vector as Tx ring and therefore
misconfiguring the interrupts.
Fixes: 2d4238f55697 ("ice: Add support for AF_XDP") Reviewed-by: Shannon Nelson <[email protected]> Tested-by: Chandan Kumar Rout <[email protected]> (A Contingent Worker at Intel) Signed-off-by: Maciej Fijalkowski <[email protected]> Signed-off-by: Tony Nguyen <[email protected]>
ice: add missing WRITE_ONCE when clearing ice_rx_ring::xdp_prog
It is read by data path and modified from process context on remote cpu
so it is needed to use WRITE_ONCE to clear the pointer.
Fixes: efc2214b6047 ("ice: Add support for XDP") Reviewed-by: Shannon Nelson <[email protected]> Tested-by: Chandan Kumar Rout <[email protected]> (A Contingent Worker at Intel) Signed-off-by: Maciej Fijalkowski <[email protected]> Signed-off-by: Tony Nguyen <[email protected]>
xsk_buff_pool pointers that ice ring structs hold are updated via
ndo_bpf that is executed in process context while it can be read by
remote CPU at the same time within NAPI poll. Use synchronize_net()
after pointer update and {READ,WRITE}_ONCE() when working with mentioned
pointer.
Fixes: 2d4238f55697 ("ice: Add support for AF_XDP") Reviewed-by: Shannon Nelson <[email protected]> Tested-by: Chandan Kumar Rout <[email protected]> (A Contingent Worker at Intel) Signed-off-by: Maciej Fijalkowski <[email protected]> Signed-off-by: Tony Nguyen <[email protected]>
ice: toggle netif_carrier when setting up XSK pool
This so we prevent Tx timeout issues. One of conditions checked on
running in the background dev_watchdog() is netif_carrier_ok(), so let
us turn it off when we disable the queues that belong to a q_vector
where XSK pool is being configured. Turn carrier on in ice_qp_ena()
only when ice_get_link_status() tells us that physical link is up.
Fixes: 2d4238f55697 ("ice: Add support for AF_XDP") Reviewed-by: Shannon Nelson <[email protected]> Tested-by: Chandan Kumar Rout <[email protected]> (A Contingent Worker at Intel) Signed-off-by: Maciej Fijalkowski <[email protected]> Signed-off-by: Tony Nguyen <[email protected]>
ice: modify error handling when setting XSK pool in ndo_bpf
Don't bail out right when spotting an error within ice_qp_{dis,ena}()
but rather track error and go through whole flow of disabling and
enabling queue pair.
Fixes: 2d4238f55697 ("ice: Add support for AF_XDP") Reviewed-by: Shannon Nelson <[email protected]> Tested-by: Chandan Kumar Rout <[email protected]> (A Contingent Worker at Intel) Signed-off-by: Maciej Fijalkowski <[email protected]> Signed-off-by: Tony Nguyen <[email protected]>
Given that ice_qp_dis() is called under rtnl_lock, synchronize_net() can
be called instead of synchronize_rcu() so that XDP rings can finish its
job in a faster way. Also let us do this as earlier in XSK queue disable
flow.
Additionally, turn off regular Tx queue before disabling irqs and NAPI.
Fixes: 2d4238f55697 ("ice: Add support for AF_XDP") Reviewed-by: Shannon Nelson <[email protected]> Tested-by: Chandan Kumar Rout <[email protected]> (A Contingent Worker at Intel) Signed-off-by: Maciej Fijalkowski <[email protected]> Signed-off-by: Tony Nguyen <[email protected]>
ice: don't busy wait for Rx queue disable in ice_qp_dis()
When ice driver is spammed with multiple xdpsock instances and flow
control is enabled, there are cases when Rx queue gets stuck and unable
to reflect the disable state in QRX_CTRL register. Similar issue has
previously been addressed in commit 13a6233b033f ("ice: Add support to
enable/disable all Rx queues before waiting").
To workaround this, let us simply not wait for a disabled state as later
patch will make sure that regardless of the encountered error in the
process of disabling a queue pair, the Rx queue will be enabled.
Fixes: 2d4238f55697 ("ice: Add support for AF_XDP") Reviewed-by: Shannon Nelson <[email protected]> Tested-by: Chandan Kumar Rout <[email protected]> (A Contingent Worker at Intel) Signed-off-by: Maciej Fijalkowski <[email protected]> Signed-off-by: Tony Nguyen <[email protected]>
Michal Kubiak [Fri, 26 Jul 2024 18:17:09 +0000 (20:17 +0200)]
ice: respect netif readiness in AF_XDP ZC related ndo's
Address a scenario in which XSK ZC Tx produces descriptors to XDP Tx
ring when link is either not yet fully initialized or process of
stopping the netdev has already started. To avoid this, add checks
against carrier readiness in ice_xsk_wakeup() and in ice_xmit_zc().
One could argue that bailing out early in ice_xsk_wakeup() would be
sufficient but given the fact that we produce Tx descriptors on behalf
of NAPI that is triggered for Rx traffic, the latter is also needed.
Bringing link up is an asynchronous event executed within
ice_service_task so even though interface has been brought up there is
still a time frame where link is not yet ok.
Without this patch, when AF_XDP ZC Tx is used simultaneously with stack
Tx, Tx timeouts occur after going through link flap (admin brings
interface down then up again). HW seem to be unable to transmit
descriptor to the wire after HW tail register bump which in turn causes
bit __QUEUE_STATE_STACK_XOFF to be set forever as
netdev_tx_completed_queue() sees no cleaned bytes on the input.
Fixes: 126cdfe1007a ("ice: xsk: Improve AF_XDP ZC Tx and use batching API") Fixes: 2d4238f55697 ("ice: Add support for AF_XDP") Reviewed-by: Shannon Nelson <[email protected]> Tested-by: Chandan Kumar Rout <[email protected]> (A Contingent Worker at Intel) Signed-off-by: Michal Kubiak <[email protected]> Signed-off-by: Maciej Fijalkowski <[email protected]> Signed-off-by: Tony Nguyen <[email protected]>
David S. Miller [Mon, 29 Jul 2024 12:31:28 +0000 (13:31 +0100)]
Merge branch 'mptcp-endpoint-readd-fixes' into main
Matthieu Baerts says:
====================
mptcp: fix signal endpoint readd
Issue #501 [1] showed that the Netlink PM currently doesn't correctly
support removal and re-add of signal endpoints.
Patches 1 and 2 address the issue: the first one in the userspace path-
manager, introduced in v5.19 ; and the second one in the in-kernel path-
manager, introduced in v5.7.
Patch 3 introduces a related selftest. There is no 'Fixes' tag, because
it might be hard to backport it automatically, as missing helpers in
Bash will not be caught when compiling the kernel or the selftests.
The last two patches address two small issues in the MPTCP selftests,
one introduced in v6.6., and the other one in v5.17.
Paolo Abeni [Sat, 27 Jul 2024 09:04:00 +0000 (11:04 +0200)]
mptcp: fix NL PM announced address accounting
Currently the per connection announced address counter is never
decreased. As a consequence, after connection establishment, if
the NL PM deletes an endpoint and adds a new/different one, no
additional subflow is created for the new endpoint even if the
current limits allow that.
Address the issue properly updating the signaled address counter
every time the NL PM removes such addresses.
Paolo Abeni [Sat, 27 Jul 2024 09:03:59 +0000 (11:03 +0200)]
mptcp: fix user-space PM announced address accounting
Currently the per-connection announced address counter is never
decreased. When the user-space PM is in use, this just affect
the information exposed via diag/sockopt, but it could still foul
the PM to wrong decision.
Add the missing accounting for the user-space PM's sake.
rtnetlink: Don't ignore IFLA_TARGET_NETNSID when ifname is specified in rtnl_dellink().
The cited commit accidentally replaced tgt_net with net in rtnl_dellink().
As a result, IFLA_TARGET_NETNSID is ignored if the interface is specified
with IFLA_IFNAME or IFLA_ALT_IFNAME.
Let's pass tgt_net to rtnl_dev_get().
Fixes: cc6090e985d7 ("net: rtnetlink: introduce helper to get net_device instance by ifname") Signed-off-by: Kuniyuki Iwashima <[email protected]> Reviewed-by: Jakub Kicinski <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Andy Chiu [Fri, 26 Jul 2024 07:06:50 +0000 (15:06 +0800)]
net: axienet: start napi before enabling Rx/Tx
softirq may get lost if an Rx interrupt comes before we call
napi_enable. Move napi_enable in front of axienet_setoptions(), which
turns on the device, to address the issue.
tcp: Adjust clamping window for applications specifying SO_RCVBUF
tp->scaling_ratio is not updated based on skb->len/skb->truesize once
SO_RCVBUF is set leading to the maximum window scaling to be 25% of
rcvbuf after
commit dfa2f0483360 ("tcp: get rid of sysctl_tcp_adv_win_scale")
and 50% of rcvbuf after
commit 697a6c8cec03 ("tcp: increase the default TCP scaling ratio").
50% tries to emulate the behavior of older kernels using
sysctl_tcp_adv_win_scale with default value.
Systems which were using a different values of sysctl_tcp_adv_win_scale
in older kernels ended up seeing reduced download speeds in certain
cases as covered in https://lists.openwall.net/netdev/2024/05/15/13
While the sysctl scheme is no longer acceptable, the value of 50% is
a bit conservative when the skb->len/skb->truesize ratio is later
determined to be ~0.66.
Applications not specifying SO_RCVBUF update the window scaling and
the receiver buffer every time data is copied to userspace. This
computation is now used for applications setting SO_RCVBUF to update
the maximum window scaling while ensuring that the receive buffer
is within the application specified limit.
Fixes: dfa2f0483360 ("tcp: get rid of sysctl_tcp_adv_win_scale") Signed-off-by: Sean Tranchetti <[email protected]> Signed-off-by: Subash Abhinov Kasiviswanathan <[email protected]> Reviewed-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>
David S. Miller [Mon, 29 Jul 2024 09:59:08 +0000 (10:59 +0100)]
Merge branch 'ethtool-rss-fixes' into main
Jakub Kicinski says;
====================
ethtool: more RSS fixes
More fixes for RSS setting. First two patches fix my own bugs
in bnxt conversion to the new API. The third patch fixes
what seems to be a 10 year old issue (present since the Linux
RSS API was created). Fourth patch fixes an issue with
the XArray state being out of sync. And then a small test.
====================
Jakub Kicinski [Thu, 25 Jul 2024 22:23:52 +0000 (15:23 -0700)]
ethtool: fix the state of additional contexts with old API
We expect drivers implementing the new create/modify/destroy
API to populate the defaults in struct ethtool_rxfh_context.
In legacy API ctx isn't even passed, and rxfh.indir / rxfh.key
are NULL so drivers can't give us defaults even if they want to.
Call get_rxfh() to fetch the values. We can reuse rxfh_dev
for the get_rxfh(), rxfh stores the input from the user.
This fixes IOCTL reporting 0s instead of the default key /
indir table for drivers using legacy API.
Add a check to try to catch drivers using the new API
but not populating the key.
Fixes: 7964e7884643 ("net: ethtool: use the tracking array for get_rxfh on custom RSS contexts") Signed-off-by: Jakub Kicinski <[email protected]> Reviewed-by: Edward Cree <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Jakub Kicinski [Thu, 25 Jul 2024 22:23:51 +0000 (15:23 -0700)]
ethtool: fix setting key and resetting indir at once
The indirection table and the key follow struct ethtool_rxfh
in user memory.
To reset the indirection table user space calls SET_RXFH with
table of size 0 (OTOH to say "no change" it should use -1 / ~0).
The logic for calculating the offset where they key sits is
incorrect in this case, as kernel would still offset by the full
table length, while for the reset there is no indir table and
key is immediately after the struct.
Fixes: 3de0b592394d ("ethtool: Support for configurable RSS hash key") Signed-off-by: Jakub Kicinski <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Fixes: 7964e7884643 ("net: ethtool: use the tracking array for get_rxfh on custom RSS contexts") Signed-off-by: Jakub Kicinski <[email protected]> Reviewed-by: Pavan Chebbi <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Jakub Kicinski [Thu, 25 Jul 2024 22:23:49 +0000 (15:23 -0700)]
eth: bnxt: reject unsupported hash functions
In commit under Fixes I split the bnxt_set_rxfh_context() function,
and attached the appropriate chunks to new ops. I missed that
bnxt_set_rxfh_context() gets called after some initial checks
in bnxt_set_rxfh(), namely that the hash function is Toeplitz.
Fixes: 5c466b4d4e75 ("eth: bnxt: move from .set_rxfh to .create_rxfh_context and friends") Signed-off-by: Jakub Kicinski <[email protected]> Reviewed-by: Pavan Chebbi <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Jeongjun Park [Thu, 25 Jul 2024 21:40:49 +0000 (06:40 +0900)]
tun: Add missing bpf_net_ctx_clear() in do_xdp_generic()
There are cases where do_xdp_generic returns bpf_net_context without
clearing it. This causes various memory corruptions, so the missing
bpf_net_ctx_clear must be added.
Now that we no longer have any C constant expression contexts (ie array
size declarations or static initializers) that use min() or max(), we
can simpify the implementation by not having to worry about the result
staying as a C constant expression.
So now we can unconditionally just use temporary variables of the right
type, and get rid of the excessive expansion that used to come from the
use of
__builtin_choose_expr(__is_constexpr(...), ..
to pick the specialized code for constant expressions.
Another expansion simplification is to pass the temporary variables (in
addition to the original expression) to our __types_ok() macro. That
may superficially look like it complicates the macro, but when we only
want the type of the expression, expanding the temporary variable names
is much simpler and smaller than expanding the potentially complicated
original expression.
As a result, on my machine, doing a
$ time make drivers/staging/media/atomisp/pci/isp/kernels/ynr/ynr_1.0/ia_css_ynr.host.i
goes from
real 0m16.621s
user 0m15.360s
sys 0m1.221s
to
real 0m2.532s
user 0m2.091s
sys 0m0.452s
because the token expansion goes down dramatically.
In particular, the longest line expansion (which was line 71 of that
'ia_css_ynr.host.c' file) shrinks from 23,338kB (yes, 23MB for one
single line) to "just" 1,444kB (now "only" 1.4MB).
And yes, that line is still the line from hell, because it's doing
multiple levels of "min()/max()" expansion thanks to some of them being
hidden inside the uDIGIT_FITTING() macro.
Lorenzo has a nice cleanup patch that makes that driver use inline
functions instead of macros for sDIGIT_FITTING() and uDIGIT_FITTING(),
which will fix that line once and for all, but the 16-fold reduction in
this case does show why we need to simplify these helpers.
minmax: don't use max() in situations that want a C constant expression
We only had a couple of array[] declarations, and changing them to just
use 'MAX()' instead of 'max()' fixes the issue.
This will allow us to simplify our min/max macros enormously, since they
can now unconditionally use temporary variables to avoid using the
argument values multiple times.
While working on simplifying the minmax functions, and avoiding
excessive macro expansion, it turns out that the sr.c use of the
'clamp()' macro has the arguments the wrong way around.
The clamp logic is
val = clamp(in, low, high);
and it returns the input clamped to the low/high limits. But sr.c ddid
speed = clamp(0, speed, 0xffff / 177);
which clamps the value '0' to the range '[speed, 0xffff / 177]' and ends
up being nonsensical.
minmax: make generic MIN() and MAX() macros available everywhere
This just standardizes the use of MIN() and MAX() macros, with the very
traditional semantics. The goal is to use these for C constant
expressions and for top-level / static initializers, and so be able to
simplify the min()/max() macros.
These macro names were used by various kernel code - they are very
traditional, after all - and all such users have been fixed up, with a
few different approaches:
- trivial duplicated macro definitions have been removed
Note that 'trivial' here means that it's obviously kernel code that
already included all the major kernel headers, and thus gets the new
generic MIN/MAX macros automatically.
- non-trivial duplicated macro definitions are guarded with #ifndef
This is the "yes, they define their own versions, but no, the include
situation is not entirely obvious, and maybe they don't get the
generic version automatically" case.
- strange use case #1
A couple of drivers decided that the way they want to describe their
versioning is with
#define MAJ 1
#define MIN 2
#define DRV_VERSION __stringify(MAJ) "." __stringify(MIN)
which adds zero value and I just did my Alexander the Great
impersonation, and rewrote that pointless Gordian knot as
#define DRV_VERSION "1.2"
instead.
- strange use case #2
A couple of drivers thought that it's a good idea to have a random
'MIN' or 'MAX' define for a value or index into a table, rather than
the traditional macro that takes arguments.
These values were re-written as C enum's instead. The new
function-line macros only expand when followed by an open
parenthesis, and thus don't clash with enum use.
Happily, there weren't really all that many of these cases, and a lot of
users already had the pattern of using '#ifndef' guarding (or in one
case just using '#undef MIN') before defining their own private version
that does the same thing. I left such cases alone.
Merge tag 'kbuild-fixes-v6.11' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild
Pull Kbuild fixes from Masahiro Yamada:
- Fix RPM package build error caused by an incorrect locale setup
- Mark modules.weakdep as ghost in RPM package
- Fix the odd combination of -S and -c in stack protector scripts,
which is an error with the latest Clang
* tag 'kbuild-fixes-v6.11' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild:
kbuild: Fix '-S -c' in x86 stack protector scripts
kbuild: rpm-pkg: ghost modules.weakdep file
kbuild: rpm-pkg: Fix C locale setup
minmax: simplify and clarify min_t()/max_t() implementation
This simplifies the min_t() and max_t() macros by no longer making them
work in the context of a C constant expression.
That means that you can no longer use them for static initializers or
for array sizes in type definitions, but there were only a couple of
such uses, and all of them were converted (famous last words) to use
MIN_T/MAX_T instead.
Commit 3a7e02c040b1 ("minmax: avoid overly complicated constant
expressions in VM code") added the simpler MIN_T/MAX_T macros in order
to avoid some excessive expansion from the rather complicated regular
min/max macros.
The complexity of those macros stems from two issues:
(a) trying to use them in situations that require a C constant
expression (in static initializers and for array sizes)
(b) the type sanity checking
and MIN_T/MAX_T avoids both of these issues.
Now, in the whole (long) discussion about all this, it was pointed out
that the whole type sanity checking is entirely unnecessary for
min_t/max_t which get a fixed type that the comparison is done in.
But that still leaves min_t/max_t unnecessarily complicated due to
worries about the C constant expression case.
However, it turns out that there really aren't very many cases that use
min_t/max_t for this, and we can just force-convert those.
This does exactly that.
Which in turn will then allow for much simpler implementations of
min_t()/max_t(). All the usual "macros in all upper case will evaluate
the arguments multiple times" rules apply.
We should do all the same things for the regular min/max() vs MIN/MAX()
cases, but that has the added complexity of various drivers defining
their own local versions of MIN/MAX, so that needs another level of
fixes first.