In bpf_timer_cancel_and_free, this patch frees the timer->timer
after a rcu grace period. This requires a rcu_head addition
to the "struct bpf_hrtimer". Another kfree(t) happens in bpf_timer_init,
this does not need a kfree_rcu because it is still under the
spin_lock and timer->timer has not been visible by others yet.
In bpf_timer_cancel, rcu_read_lock() is added because this helper
can be used in a non rcu critical section context (e.g. from
a sleepable bpf prog). Other timer->timer usages in helpers.c
have been audited, bpf_timer_cancel() is the only place where
timer->timer is used outside of the spin_lock.
Another solution considered is to mark a t->flag in bpf_timer_cancel
and clear it after hrtimer_cancel() is done. In bpf_timer_cancel_and_free,
it busy waits for the flag to be cleared before kfree(t). This patch
goes with a straight forward solution and frees timer->timer after
a rcu grace period.
As reported by syzboot [1] and [2], when trying to read vsyscall page
by using bpf_probe_read_kernel() or bpf_probe_read(), oops may happen.
Thomas Gleixner had proposed a test patch [3], but it seems that no
formal patch is posted after about one month [4], so I post it instead
and add an Originally-by tag in patch #2.
Patch #1 makes is_vsyscall_vaddr() being a common helper. Patch #2 fixes
the problem by disallowing vsyscall page read for
copy_from_kernel_nofault(). Patch #3 adds one test case to ensure the
read of vsyscall page through bpf is rejected. Please see individual
patches for more details.
Change Log:
v3:
* rephrase commit message for patch #1 & #2 (Sohil)
* reword comments in copy_from_kernel_nofault_allowed() (Sohil)
* add Rvb tag for patch #1 and Acked-by tag for patch #3 (Sohil, Yonghong)
v2: https://lore.kernel.org/bpf/20240126115423.3943360[email protected]/
* move is_vsyscall_vaddr to asm/vsyscall.h instead (Sohil)
* elaborate on the reason for disallowing of vsyscall page read in
copy_from_kernel_nofault_allowed() (Sohil)
* update the commit message of patch #2 to more clearly explain how
the oops occurs. (Sohil)
* update the commit message of patch #3 to explain the expected return
values of various bpf helpers (Yonghong)
Hou Tao [Fri, 2 Feb 2024 10:39:35 +0000 (18:39 +0800)]
selftest/bpf: Test the read of vsyscall page under x86-64
Under x86-64, when using bpf_probe_read_kernel{_str}() or
bpf_probe_read{_str}() to read vsyscall page, the read may trigger oops,
so add one test case to ensure that the problem is fixed. Beside those
four bpf helpers mentioned above, testing the read of vsyscall page by
using bpf_probe_read_user{_str} and bpf_copy_from_user{_task}() as well.
The test case passes the address of vsyscall page to these six helpers
and checks whether the returned values are expected:
1) For bpf_probe_read_kernel{_str}()/bpf_probe_read{_str}(), the
expected return value is -ERANGE as shown below:
The occurrence of oops depends on the availability of CPU SMAP [1]
feature and there are three possible configurations of vsyscall page in
the boot cmd-line: vsyscall={xonly|none|emulate}, so there are a total
of six possible combinations. Under all these combinations, the test
case runs successfully.
1) A bpf program uses bpf_probe_read_kernel() to read from the vsyscall
page and invokes copy_from_kernel_nofault() which in turn calls
__get_user_asm().
2) Because the vsyscall page address is not readable from kernel space,
a page fault exception is triggered accordingly.
3) handle_page_fault() considers the vsyscall page address as a user
space address instead of a kernel space address. This results in the
fix-up setup by bpf not being applied and a page_fault_oops() is invoked
due to SMAP.
Considering handle_page_fault() has already considered the vsyscall page
address as a userspace address, fix the problem by disallowing vsyscall
page read for copy_from_kernel_nofault().
The bpf_doc script refers to the GPL as the "GNU Privacy License".
I strongly suspect that the author wanted to refer to the GNU General
Public License, under which the Linux kernel is released, as, to the
best of my knowledge, there is no license named "GNU Privacy License".
This patch corrects the license name in the script accordingly.
xsk_build_skb() allocates a page and adds it to the skb via
skb_add_rx_frag() and specifies 0 for truesize. This leads to a warning
in skb_add_rx_frag() with CONFIG_DEBUG_NET enabled because size is
larger than truesize.
Increasing truesize requires to add the same amount to socket's
sk_wmem_alloc counter in order not to underflow the counter during
release in the destructor (sock_wfree()).
Pass the size of the allocated page as truesize to skb_add_rx_frag().
Add this mount to socket's sk_wmem_alloc counter.
Fedor Pchelkin [Thu, 25 Jan 2024 09:53:09 +0000 (12:53 +0300)]
nfc: nci: free rx_data_reassembly skb on NCI device cleanup
rx_data_reassembly skb is stored during NCI data exchange for processing
fragmented packets. It is dropped only when the last fragment is processed
or when an NTF packet with NCI_OP_RF_DEACTIVATE_NTF opcode is received.
However, the NCI device may be deallocated before that which leads to skb
leak.
As by design the rx_data_reassembly skb is bound to the NCI device and
nothing prevents the device to be freed before the skb is processed in
some way and cleaned, free it on the NCI device cleanup.
Found by Linux Verification Center (linuxtesting.org) with Syzkaller.
net: hsr: remove WARN_ONCE() in send_hsr_supervision_frame()
Syzkaller reported [1] hitting a warning after failing to allocate
resources for skb in hsr_init_skb(). Since a WARN_ONCE() call will
not help much in this case, it might be prudent to switch to
netdev_warn_once(). At the very least it will suppress syzkaller
reports such as [1].
Just in case, use netdev_warn_once() in send_prp_supervision_frame()
for similar reasons.
Horatiu Vultur [Wed, 24 Jan 2024 10:17:58 +0000 (11:17 +0100)]
net: lan966x: Fix port configuration when using SGMII interface
In case the interface between the MAC and the PHY is SGMII, then the bit
GIGA_MODE on the MAC side needs to be set regardless of the speed at
which it is running.
Fixes: d28d6d2e37d1 ("net: lan966x: add port module support") Signed-off-by: Horatiu Vultur <[email protected]> Reviewed-by: Maxime Chevallier <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Simon Horman [Wed, 24 Jan 2024 09:16:22 +0000 (09:16 +0000)]
MAINTAINERS: Add connector headers to NETWORKING DRIVERS
Commit 46cf789b68b2 ("connector: Move maintainence under networking
drivers umbrella.") moved the connector maintenance but did not include
the connector header files.
It seems that it has always been implied that these headers were
maintained along with the rest of the connector code, both before and
after the cited commit. Make this explicit.
The original packet in ipmr_cache_report() may be queued and then forwarded
with ip_mr_forward(). This last function has the assumption that the skb
dst is set.
After the below commit, the skb dst is dropped by ipv4_pktinfo_prepare(),
which causes the oops.
Daniel Golle [Wed, 24 Jan 2024 05:17:25 +0000 (05:17 +0000)]
net: dsa: mt7530: fix 10M/100M speed on MT7988 switch
Setup PMCR port register for actual speed and duplex on internally
connected PHYs of the MT7988 built-in switch. This fixes links with
speeds other than 1000M.
Paolo Abeni [Thu, 25 Jan 2024 18:09:06 +0000 (19:09 +0100)]
selftests: net: give more time for GRO aggregation
The gro.sh test-case relay on the gro_flush_timeout to ensure
that all the segments belonging to any given batch are properly
aggregated.
The other end, the sender is a user-space program transmitting
each packet with a separate write syscall. A busy host and/or
stracing the sender program can make the relevant segments reach
the GRO engine after the flush timeout triggers.
Give the GRO flush timeout more slack, to avoid sporadic self-tests
failures.
Paolo Abeni [Thu, 25 Jan 2024 08:22:50 +0000 (09:22 +0100)]
selftests: net: add missing required classifier
the udpgro_fraglist self-test uses the BPF classifiers, but the
current net self-test configuration does not include it, causing
CI failures:
# selftests: net: udpgro_frglist.sh
# ipv6
# tcp - over veth touching data
# -l 4 -6 -D 2001:db8::1 -t rx -4 -t
# Error: TC classifier not found.
# We have an error talking to the kernel
# Error: TC classifier not found.
# We have an error talking to the kernel
Breno Leitao [Thu, 25 Jan 2024 13:41:03 +0000 (05:41 -0800)]
bnxt_en: Make PTP timestamp HWRM more silent
commit 056bce63c469 ("bnxt_en: Make PTP TX timestamp HWRM query silent")
changed a netdev_err() to netdev_WARN_ONCE().
netdev_WARN_ONCE() is it generates a kernel WARNING, which is bad, for
the following reasons:
* You do not a kernel warning if the firmware queries are late
* In busy networks, timestamp query failures fairly regularly
* A WARNING message doesn't bring much value, since the code path
is clear.
(This was discussed in-depth in [1])
Transform the netdev_WARN_ONCE() into a netdev_warn_once(), and print a
more well-behaved message, instead of a full WARN().
Wen Gu [Thu, 25 Jan 2024 12:39:16 +0000 (20:39 +0800)]
net/smc: fix incorrect SMC-D link group matching logic
The logic to determine if SMC-D link group matches is incorrect. The
correct logic should be that it only returns true when the GID is the
same, and the SMC-D device is the same and the extended GID is the same
(in the case of virtual ISM).
It can be fixed by adding brackets around the conditional (or ternary)
operator expression. But for better readability and maintainability, it
has been changed to an if-else statement.
Paolo Abeni [Fri, 26 Jan 2024 15:32:36 +0000 (16:32 +0100)]
selftests: net: add missing config for big tcp tests
The big_tcp test-case requires a few kernel knobs currently
not specified in the net selftests config, causing the
following failure:
# selftests: net: big_tcp.sh
# Error: Failed to load TC action module.
# We have an error talking to the kernel
...
# Testing for BIG TCP:
# CLI GSO | GW GRO | GW GSO | SER GRO
# ./big_tcp.sh: line 107: test: !=: unary operator expected
...
# on on on on : [FAIL_on_link1]
Daniel Golle [Wed, 24 Jan 2024 05:18:23 +0000 (05:18 +0000)]
net: phy: mediatek-ge-soc: sync driver with MediaTek SDK
Sync initialization and calibration routines with MediaTek's reference
driver. Improves compliance and resolves link stability issues with
CH340 IoT devices connected to MT798x built-in PHYs.
====================
nfp: flower: a few small conntrack offload fixes
This small series addresses two bugs in the nfp conntrack offloading
code.
The first patch is a check to prevent offloading for a case which is
currently not supported by the nfp.
The second patch fixes up parsing of layer4 mangling code so it can be
correctly offloaded. Since the masks are an inverse mask and we are
shifting it so it can be packed together with the destination we
effectively need to 'clear' the lower bits of the mask by setting it to
0xFFFF.
====================
Hui Zhou [Wed, 24 Jan 2024 15:19:09 +0000 (17:19 +0200)]
nfp: flower: fix hardware offload for the transfer layer port
The nfp driver will merge the tp source port and tp destination port
into one dword which the offset must be zero to do hardware offload.
However, the mangle action for the tp source port and tp destination
port is separated for tc ct action. Modify the mangle action for the
FLOW_ACT_MANGLE_HDR_TYPE_TCP and FLOW_ACT_MANGLE_HDR_TYPE_UDP to
satisfy the nfp driver offload check for the tp port.
The mangle action provides a 4B value for source, and a 4B value for
the destination, but only 2B of each contains the useful information.
For offload the 2B of each is combined into a single 4B word. Since the
incoming mask for the source is '0xFFFF<mask>' the shift-left will
throw away the 0xFFFF part. When this gets combined together in the
offload it will clear the destination field. Fix this by setting the
lower bits back to 0xFFFF, effectively doing a rotate-left operation on
the mask.
Hui Zhou [Wed, 24 Jan 2024 15:19:08 +0000 (17:19 +0200)]
nfp: flower: add hardware offload check for post ct entry
The nfp offload flow pay will not allocate a mask id when the out port
is openvswitch internal port. This is because these flows are used to
configure the pre_tun table and are never actually send to the firmware
as an add-flow message. When a tc rule which action contains ct and
the post ct entry's out port is openvswitch internal port, the merge
offload flow pay with the wrong mask id of 0 will be send to the
firmware. Actually, the nfp can not support hardware offload for this
situation, so return EOPNOTSUPP.
Hangbin Liu [Wed, 24 Jan 2024 06:13:44 +0000 (14:13 +0800)]
selftests/net/lib: update busywait timeout value
The busywait timeout value is a millisecond, not a second. So the
current setting 2 is too small. On slow/busy host (or VMs) the
current timeout can expire even on "correct" execution, causing random
failures. Let's copy the WAIT_TIMEOUT from forwarding/lib.sh and set
BUSYWAIT_TIMEOUT here.
Jakub Kicinski [Wed, 24 Jan 2024 23:36:30 +0000 (15:36 -0800)]
selftests: tcp_ao: set the timeout to 2 minutes
The default timeout for tests is 45sec, bench-lookups_ipv6
seems to take around 50sec when running in a VM without
HW acceleration. Give it a 2x margin and set the timeout
to 120sec.
Jakub Kicinski [Thu, 25 Jan 2024 23:59:24 +0000 (15:59 -0800)]
Merge branch 'selftests-net-a-few-fixes'
Paolo Abeni says:
====================
selftests: net: a few fixes
This series address self-tests failures for udp gro-related tests.
The first patch addresses the main problem I observe locally - the XDP
program required by such tests, xdp_dummy, is currently build in the
ebpf self-tests directory, not available if/when the user targets net
only. Arguably is more a refactor than a fix, but still targeting net
to hopefully
The second patch fixes the integration of such tests with the build
system.
Paolo Abeni [Wed, 24 Jan 2024 21:33:21 +0000 (22:33 +0100)]
selftests: net: included needed helper in the install targets
The blamed commit below introduce a dependency in some net self-tests
towards a newly introduce helper script.
Such script is currently not included into the TEST_PROGS_EXTENDED list
and thus is not installed, causing failure for the relevant tests when
executed from the install dir.
Paolo Abeni [Wed, 24 Jan 2024 21:33:20 +0000 (22:33 +0100)]
selftests: net: remove dependency on ebpf tests
Several net tests requires an XDP program build under the ebpf
directory, and error out if such program is not available.
That makes running successful net test hard, let's duplicate into the
net dir the [very small] program, re-using the existing rules to build
it, and finally dropping the bogus dependency.
Linus Torvalds [Thu, 25 Jan 2024 18:58:35 +0000 (10:58 -0800)]
Merge tag 'net-6.8-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Paolo Abeni:
"Including fixes from bpf, netfilter and WiFi.
Jakub is doing a lot of work to include the self-tests in our CI, as a
result a significant amount of self-tests related fixes is flowing in
(and will likely continue in the next few weeks).
Current release - regressions:
- bpf: fix a kernel crash for the riscv 64 JIT
- bnxt_en: fix memory leak in bnxt_hwrm_get_rings()
- revert "net: macsec: use skb_ensure_writable_head_tail to expand
the skb"
Previous releases - regressions:
- core: fix removing a namespace with conflicting altnames
- tcp:
- make sure init the accept_queue's spinlocks once
- fix autocork on CPUs with weak memory model
- udp: fix busy polling
- mlx5e:
- fix out-of-bound read in port timestamping
- fix peer flow lists corruption
- iwlwifi: fix a memory corruption
Previous releases - always broken:
- netfilter:
- nft_chain_filter: handle NETDEV_UNREGISTER for inet/ingress
basechain
- nft_limit: reject configurations that cause integer overflow
- bpf: fix bpf_xdp_adjust_tail() with XSK zero-copy mbuf, avoiding a
NULL pointer dereference upon shrinking
- llc: make llc_ui_sendmsg() more robust against bonding changes
- smc: fix illegal rmb_desc access in SMC-D connection dump
- dpll: fix pin dump crash for rebound module
- bnxt_en: fix possible crash after creating sw mqprio TCs
- hv_netvsc: calculate correct ring size when PAGE_SIZE is not 4kB
Misc:
- several self-tests fixes for better integration with the netdev CI
- added several missing modules descriptions"
* tag 'net-6.8-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (88 commits)
tsnep: Fix XDP_RING_NEED_WAKEUP for empty fill ring
tsnep: Remove FCS for XDP data path
net: fec: fix the unhandled context fault from smmu
selftests: bonding: do not test arp/ns target with mode balance-alb/tlb
fjes: fix memleaks in fjes_hw_setup
i40e: update xdp_rxq_info::frag_size for ZC enabled Rx queue
i40e: set xdp_rxq_info::frag_size
xdp: reflect tail increase for MEM_TYPE_XSK_BUFF_POOL
ice: update xdp_rxq_info::frag_size for ZC enabled Rx queue
intel: xsk: initialize skb_frag_t::bv_offset in ZC drivers
ice: remove redundant xdp_rxq_info registration
i40e: handle multi-buffer packets that are shrunk by xdp prog
ice: work on pre-XDP prog frag count
xsk: fix usage of multi-buffer BPF helpers for ZC XDP
xsk: make xsk_buff_pool responsible for clearing xdp_buff::flags
xsk: recycle buffer in case Rx queue was full
net: fill in MODULE_DESCRIPTION()s for rvu_mbox
net: fill in MODULE_DESCRIPTION()s for litex
net: fill in MODULE_DESCRIPTION()s for fsl_pq_mdio
net: fill in MODULE_DESCRIPTION()s for fec
...
Linus Torvalds [Thu, 25 Jan 2024 18:52:30 +0000 (10:52 -0800)]
Merge tag 'ovl-fixes-6.8-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/overlayfs/vfs
Pull overlayfs fix from Amir Goldstein:
"Change the on-disk format for the new "xwhiteouts" feature introduced
in v6.7
The change reduces unneeded overhead of an extra getxattr per readdir.
The only user of the "xwhiteout" feature is the external composefs
tool, which has been updated to support the new on-disk format.
This change is also designated for 6.7.y"
* tag 'ovl-fixes-6.8-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/overlayfs/vfs:
ovl: mark xwhiteouts directory with overlay.opaque='x'
Linus Torvalds [Thu, 25 Jan 2024 18:41:29 +0000 (10:41 -0800)]
Merge tag 'vfs-6.8-rc2.netfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull netfs fixes from Christian Brauner:
"This contains various fixes for the netfs work merged earlier this
cycle:
afs:
- Fix locking imbalance in afs_proc_addr_prefs_show()
- Remove afs_dynroot_d_revalidate() which is redundant
- Fix error handling during lookup
- Hide sillyrenames from userspace. This fixes a race between
silly-rename files being created/removed and userspace iterating
over directory entries
- Don't use unnecessary folio_*() functions
cifs:
- Don't use unnecessary folio_*() functions
cachefiles:
- erofs: Fix Null dereference when cachefiles are not doing
ondemand-mode
- Update mailing list
netfs library:
- Add Jeff Layton as reviewer
- Update mailing list
- Fix a error checking in netfs_perform_write()
- fscache: Check error before dereferencing
- Don't use unnecessary folio_*() functions"
* tag 'vfs-6.8-rc2.netfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
afs: Fix missing/incorrect unlocking of RCU read lock
afs: Remove afs_dynroot_d_revalidate() as it is redundant
afs: Fix error handling with lookup via FS.InlineBulkStatus
afs: Hide silly-rename files from userspace
cachefiles, erofs: Fix NULL deref in when cachefiles is not doing ondemand-mode
netfs: Fix a NULL vs IS_ERR() check in netfs_perform_write()
netfs, fscache: Prevent Oops in fscache_put_cache()
cifs: Don't use certain unnecessary folio_*() functions
afs: Don't use certain unnecessary folio_*() functions
netfs: Don't use certain unnecessary folio_*() functions
netfs: Add Jeff Layton as reviewer
netfs, cachefiles: Change mailing list
Linus Torvalds [Thu, 25 Jan 2024 18:26:52 +0000 (10:26 -0800)]
Merge tag 'nfsd-6.8-1' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux
Pull nfsd fixes from Chuck Lever:
- Fix in-kernel RPC UDP transport
- Fix NFSv4.0 RELEASE_LOCKOWNER
* tag 'nfsd-6.8-1' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux:
nfsd: fix RELEASE_LOCKOWNER
SUNRPC: use request size to initialize bio_vec in svc_udp_sendto()
Linus Torvalds [Thu, 25 Jan 2024 18:21:21 +0000 (10:21 -0800)]
Merge tag 'urgent-rcu.2024.01.24a' of https://github.com/neeraju/linux
Pull RCU fix from Neeraj Upadhyay:
"This fixes RCU grace period stalls, which are observed when an
outgoing CPU's quiescent state reporting results in wakeup of one of
the grace period kthreads, to complete the grace period.
If those kthreads have SCHED_FIFO policy, the wake up can indirectly
arm the RT bandwith timer to the local offline CPU.
Earlier migration of the hrtimers from the CPU introduced in commit 5c0930ccaad5 ("hrtimers: Push pending hrtimers away from outgoing CPU
earlier") results in this timer getting ignored.
If the RCU grace period kthreads are waiting for RT bandwidth to be
available, they may never be actually scheduled, resulting in RCU
stall warnings"
* tag 'urgent-rcu.2024.01.24a' of https://github.com/neeraju/linux:
rcu: Defer RCU kthreads wakeup when CPU is dying
Gerhard Engleder [Tue, 23 Jan 2024 20:09:18 +0000 (21:09 +0100)]
tsnep: Fix XDP_RING_NEED_WAKEUP for empty fill ring
The fill ring of the XDP socket may contain not enough buffers to
completey fill the RX queue during socket creation. In this case the
flag XDP_RING_NEED_WAKEUP is not set as this flag is only set if the RX
queue is not completely filled during polling.
Set XDP_RING_NEED_WAKEUP flag also if RX queue is not completely filled
during XDP socket creation.
Fixes: 3fc2333933fd ("tsnep: Add XDP socket zero-copy RX support") Signed-off-by: Gerhard Engleder <[email protected]> Signed-off-by: Paolo Abeni <[email protected]>
Gerhard Engleder [Tue, 23 Jan 2024 20:09:17 +0000 (21:09 +0100)]
tsnep: Remove FCS for XDP data path
The RX data buffer includes the FCS. The FCS is already stripped for the
normal data path. But for the XDP data path the FCS is included and
acts like additional/useless data.
Remove the FCS from the RX data buffer also for XDP.
Paolo Abeni [Thu, 25 Jan 2024 10:42:27 +0000 (11:42 +0100)]
Merge tag 'mlx5-fixes-2024-01-24' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux
Saeed Mahameed says:
====================
mlx5 fixes 2024-01-24
This series provides bug fixes to mlx5 driver.
Please pull and let me know if there is any problem.
* tag 'mlx5-fixes-2024-01-24' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux:
net/mlx5e: fix a potential double-free in fs_any_create_groups
net/mlx5e: fix a double-free in arfs_create_groups
net/mlx5e: Ignore IPsec replay window values on sender side
net/mlx5e: Allow software parsing when IPsec crypto is enabled
net/mlx5: Use mlx5 device constant for selecting CQ period mode for ASO
net/mlx5: DR, Can't go to uplink vport on RX rule
net/mlx5: DR, Use the right GVMI number for drop action
net/mlx5: Bridge, fix multicast packets sent to uplink
net/mlx5: Fix a WARN upon a callback command failure
net/mlx5e: Fix peer flow lists handling
net/mlx5e: Fix inconsistent hairpin RQT sizes
net/mlx5e: Fix operation precedence bug in port timestamping napi_poll context
net/mlx5: Fix query of sd_group field
net/mlx5e: Use the correct lag ports number when creating TISes
====================
Paolo Abeni [Thu, 25 Jan 2024 10:30:31 +0000 (11:30 +0100)]
Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Daniel Borkmann says:
====================
pull-request: bpf 2024-01-25
The following pull-request contains BPF updates for your *net* tree.
We've added 12 non-merge commits during the last 2 day(s) which contain
a total of 13 files changed, 190 insertions(+), 91 deletions(-).
The main changes are:
1) Fix bpf_xdp_adjust_tail() in context of XSK zero-copy drivers which
support XDP multi-buffer. The former triggered a NULL pointer
dereference upon shrinking, from Maciej Fijalkowski & Tirthendu Sarkar.
2) Fix a bug in riscv64 BPF JIT which emitted a wrong prologue and
epilogue for struct_ops programs, from Pu Lehui.
* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
i40e: update xdp_rxq_info::frag_size for ZC enabled Rx queue
i40e: set xdp_rxq_info::frag_size
xdp: reflect tail increase for MEM_TYPE_XSK_BUFF_POOL
ice: update xdp_rxq_info::frag_size for ZC enabled Rx queue
intel: xsk: initialize skb_frag_t::bv_offset in ZC drivers
ice: remove redundant xdp_rxq_info registration
i40e: handle multi-buffer packets that are shrunk by xdp prog
ice: work on pre-XDP prog frag count
xsk: fix usage of multi-buffer BPF helpers for ZC XDP
xsk: make xsk_buff_pool responsible for clearing xdp_buff::flags
xsk: recycle buffer in case Rx queue was full
riscv, bpf: Fix unpredictable kernel crash about RV64 struct_ops
====================
Shenwei Wang [Tue, 23 Jan 2024 16:51:41 +0000 (10:51 -0600)]
net: fec: fix the unhandled context fault from smmu
When repeatedly changing the interface link speed using the command below:
ethtool -s eth0 speed 100 duplex full
ethtool -s eth0 speed 1000 duplex full
The following errors may sometimes be reported by the ARM SMMU driver:
[ 5395.035364] fec 5b040000.ethernet eth0: Link is Down
[ 5395.039255] arm-smmu 51400000.iommu: Unhandled context fault:
fsr=0x402, iova=0x00000000, fsynr=0x100001, cbfrsynra=0x852, cb=2
[ 5398.108460] fec 5b040000.ethernet eth0: Link is Up - 100Mbps/Full -
flow control off
It is identified that the FEC driver does not properly stop the TX queue
during the link speed transitions, and this results in the invalid virtual
I/O address translations from the SMMU and causes the context faults.
Hangbin Liu [Tue, 23 Jan 2024 07:59:17 +0000 (15:59 +0800)]
selftests: bonding: do not test arp/ns target with mode balance-alb/tlb
The prio_arp/ns tests hard code the mode to active-backup. At the same
time, The balance-alb/tlb modes do not support arp/ns target. So remove
the prio_arp/ns tests from the loop and only test active-backup mode.
Jakub Kicinski [Thu, 25 Jan 2024 05:03:16 +0000 (21:03 -0800)]
Merge tag 'nf-24-01-24' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
The following patchset contains Netfilter fixes for net:
1) Update nf_tables kdoc to keep it in sync with the code, from George Guo.
2) Handle NETDEV_UNREGISTER event for inet/ingress basechain.
3) Reject configuration that cause nft_limit to overflow,
from Florian Westphal.
4) Restrict anonymous set/map names to 16 bytes, from Florian Westphal.
5) Disallow to encode queue number and error in verdicts. This reverts
a patch which seems to have introduced an early attempt to support for
nfqueue maps, which is these days supported via nft_queue expression.
6) Sanitize family via .validate for expressions that explicitly refer
to NF_INET_* hooks.
* tag 'nf-24-01-24' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
netfilter: nf_tables: validate NFPROTO_* family
netfilter: nf_tables: reject QUEUE/DROP verdict parameters
netfilter: nf_tables: restrict anonymous set and map names to 16 bytes
netfilter: nft_limit: reject configurations that cause integer overflow
netfilter: nft_chain_filter: handle NETDEV_UNREGISTER for inet/ingress basechain
netfilter: nf_tables: cleanup documentation
====================
However, when fjes_hw_setup fails, fjes_hw_exit won't be called and thus
all the resources allocated in fjes_hw_setup will be leaked. In this
patch, we free those resources in fjes_hw_setup and prevents such leaks.
Linus Torvalds [Thu, 25 Jan 2024 00:59:52 +0000 (16:59 -0800)]
Merge tag 'ceph-for-6.8-rc2' of https://github.com/ceph/ceph-client
Pull ceph fixes from Ilya Dryomov:
"A fix to avoid triggering an assert in some cases where RBD exclusive
mappings are involved and a deprecated API cleanup"
* tag 'ceph-for-6.8-rc2' of https://github.com/ceph/ceph-client:
rbd: don't move requests to the running list on errors
rbd: remove usage of the deprecated ida_simple_*() API
Linus Torvalds [Thu, 25 Jan 2024 00:51:59 +0000 (16:51 -0800)]
Merge tag 'integrity-v6.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/zohar/linux-integrity
Pull integrity fix from Mimi Zohar:
"Revert patch that required user-provided key data, since keys can be
created from kernel-generated random numbers"
* tag 'integrity-v6.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/zohar/linux-integrity:
Revert "KEYS: encrypted: Add check for strsep"
====================
net: bpf_xdp_adjust_tail() and Intel mbuf fixes
Hey,
after a break followed by dealing with sickness, here is a v6 that makes
bpf_xdp_adjust_tail() actually usable for ZC drivers that support XDP
multi-buffer. Since v4 I tried also using bpf_xdp_adjust_tail() with
positive offset which exposed yet another issues, which can be observed
by increased commit count when compared to v3.
John, in the end I think we should remove handling
MEM_TYPE_XSK_BUFF_POOL from __xdp_return(), but it is out of the scope
for fixes set, IMHO.
v5:
- pick correct version of patch 5 [Simon]
- elaborate a bit more on what patch 2 fixes
v4:
- do not clear frags flag when deleting tail; xsk_buff_pool now does
that
- skip some NULL tests for xsk_buff_get_tail [Martin, John]
- address problems around registering xdp_rxq_info
- fix bpf_xdp_frags_increase_tail() for ZC mbuf
v3:
- add acks
- s/xsk_buff_tail_del/xsk_buff_del_tail
- address i40e as well (thanks Tirthendu)
v2:
- fix !CONFIG_XDP_SOCKETS builds
- add reviewed-by tag to patch 3
====================
i40e: update xdp_rxq_info::frag_size for ZC enabled Rx queue
Now that i40e driver correctly sets up frag_size in xdp_rxq_info, let us
make it work for ZC multi-buffer as well. i40e_ring::rx_buf_len for ZC
is being set via xsk_pool_get_rx_frame_size() and this needs to be
propagated up to xdp_rxq_info.
i40e support XDP multi-buffer so it is supposed to use
__xdp_rxq_info_reg() instead of xdp_rxq_info_reg() and set the
frag_size. It can not be simply converted at existing callsite because
rx_buf_len could be un-initialized, so let us register xdp_rxq_info
within i40e_configure_rx_ring(), which happen to be called with already
initialized rx_buf_len value.
Commit 5180ff1364bc ("i40e: use int for i40e_status") converted 'err' to
int, so two variables to deal with return codes are not needed within
i40e_configure_rx_ring(). Remove 'ret' and use 'err' to handle status
from xdp_rxq_info registration.
xdp: reflect tail increase for MEM_TYPE_XSK_BUFF_POOL
XSK ZC Rx path calculates the size of data that will be posted to XSK Rx
queue via subtracting xdp_buff::data_end from xdp_buff::data.
In bpf_xdp_frags_increase_tail(), when underlying memory type of
xdp_rxq_info is MEM_TYPE_XSK_BUFF_POOL, add offset to data_end in tail
fragment, so that later on user space will be able to take into account
the amount of bytes added by XDP program.
ice: update xdp_rxq_info::frag_size for ZC enabled Rx queue
Now that ice driver correctly sets up frag_size in xdp_rxq_info, let us
make it work for ZC multi-buffer as well. ice_rx_ring::rx_buf_len for ZC
is being set via xsk_pool_get_rx_frame_size() and this needs to be
propagated up to xdp_rxq_info.
Use a bigger hammer and instead of unregistering only xdp_rxq_info's
memory model, unregister it altogether and register it again and have
xdp_rxq_info with correct frag_size value.
intel: xsk: initialize skb_frag_t::bv_offset in ZC drivers
Ice and i40e ZC drivers currently set offset of a frag within
skb_shared_info to 0, which is incorrect. xdp_buffs that come from
xsk_buff_pool always have 256 bytes of a headroom, so they need to be
taken into account to retrieve xdp_buff::data via skb_frag_address().
Otherwise, bpf_xdp_frags_increase_tail() would be starting its job from
xdp_buff::data_hard_start which would result in overwriting existing
payload.
xdp_rxq_info struct can be registered by drivers via two functions -
xdp_rxq_info_reg() and __xdp_rxq_info_reg(). The latter one allows
drivers that support XDP multi-buffer to set up xdp_rxq_info::frag_size
which in turn will make it possible to grow the packet via
bpf_xdp_adjust_tail() BPF helper.
Currently, ice registers xdp_rxq_info in two spots:
1) ice_setup_rx_ring() // via xdp_rxq_info_reg(), BUG
2) ice_vsi_cfg_rxq() // via __xdp_rxq_info_reg(), OK
Cited commit under fixes tag took care of setting up frag_size and
updated registration scheme in 2) but it did not help as
1) is called before 2) and as shown above it uses old registration
function. This means that 2) sees that xdp_rxq_info is already
registered and never calls __xdp_rxq_info_reg() which leaves us with
xdp_rxq_info::frag_size being set to 0.
To fix this misbehavior, simply remove xdp_rxq_info_reg() call from
ice_setup_rx_ring().
Tirthendu Sarkar [Wed, 24 Jan 2024 19:15:56 +0000 (20:15 +0100)]
i40e: handle multi-buffer packets that are shrunk by xdp prog
XDP programs can shrink packets by calling the bpf_xdp_adjust_tail()
helper function. For multi-buffer packets this may lead to reduction of
frag count stored in skb_shared_info area of the xdp_buff struct. This
results in issues with the current handling of XDP_PASS and XDP_DROP
cases.
For XDP_PASS, currently skb is being built using frag count of
xdp_buffer before it was processed by XDP prog and thus will result in
an inconsistent skb when frag count gets reduced by XDP prog. To fix
this, get correct frag count while building the skb instead of using
pre-obtained frag count.
For XDP_DROP, current page recycling logic will not reuse the page but
instead will adjust the pagecnt_bias so that the page can be freed. This
again results in inconsistent behavior as the page refcnt has already
been changed by the helper while freeing the frag(s) as part of
shrinking the packet. To fix this, only adjust pagecnt_bias for buffers
that are stillpart of the packet post-xdp prog run.
Fix an OOM panic in XDP_DRV mode when a XDP program shrinks a
multi-buffer packet by 4k bytes and then redirects it to an AF_XDP
socket.
Since support for handling multi-buffer frames was added to XDP, usage
of bpf_xdp_adjust_tail() helper within XDP program can free the page
that given fragment occupies and in turn decrease the fragment count
within skb_shared_info that is embedded in xdp_buff struct. In current
ice driver codebase, it can become problematic when page recycling logic
decides not to reuse the page. In such case, __page_frag_cache_drain()
is used with ice_rx_buf::pagecnt_bias that was not adjusted after
refcount of page was changed by XDP prog which in turn does not drain
the refcount to 0 and page is never freed.
To address this, let us store the count of frags before the XDP program
was executed on Rx ring struct. This will be used to compare with
current frag count from skb_shared_info embedded in xdp_buff. A smaller
value in the latter indicates that XDP prog freed frag(s). Then, for
given delta decrement pagecnt_bias for XDP_DROP verdict.
While at it, let us also handle the EOP frag within
ice_set_rx_bufs_act() to make our life easier, so all of the adjustments
needed to be applied against freed frags are performed in the single
place.
This comes from __xdp_return() call with xdp_buff argument passed as
NULL which is supposed to be consumed by xsk_buff_free() call.
To address this properly, in ZC case, a node that represents the frag
being removed has to be pulled out of xskb_list. Introduce
appropriate xsk helpers to do such node operation and use them
accordingly within bpf_xdp_adjust_tail().
xsk: make xsk_buff_pool responsible for clearing xdp_buff::flags
XDP multi-buffer support introduced XDP_FLAGS_HAS_FRAGS flag that is
used by drivers to notify data path whether xdp_buff contains fragments
or not. Data path looks up mentioned flag on first buffer that occupies
the linear part of xdp_buff, so drivers only modify it there. This is
sufficient for SKB and XDP_DRV modes as usually xdp_buff is allocated on
stack or it resides within struct representing driver's queue and
fragments are carried via skb_frag_t structs. IOW, we are dealing with
only one xdp_buff.
ZC mode though relies on list of xdp_buff structs that is carried via
xsk_buff_pool::xskb_list, so ZC data path has to make sure that
fragments do *not* have XDP_FLAGS_HAS_FRAGS set. Otherwise,
xsk_buff_free() could misbehave if it would be executed against xdp_buff
that carries a frag with XDP_FLAGS_HAS_FRAGS flag set. Such scenario can
take place when within supplied XDP program bpf_xdp_adjust_tail() is
used with negative offset that would in turn release the tail fragment
from multi-buffer frame.
Calling xsk_buff_free() on tail fragment with XDP_FLAGS_HAS_FRAGS would
result in releasing all the nodes from xskb_list that were produced by
driver before XDP program execution, which is not what is intended -
only tail fragment should be deleted from xskb_list and then it should
be put onto xsk_buff_pool::free_list. Such multi-buffer frame will never
make it up to user space, so from AF_XDP application POV there would be
no traffic running, however due to free_list getting constantly new
nodes, driver will be able to feed HW Rx queue with recycled buffers.
Bottom line is that instead of traffic being redirected to user space,
it would be continuously dropped.
To fix this, let us clear the mentioned flag on xsk_buff_pool side
during xdp_buff initialization, which is what should have been done
right from the start of XSK multi-buffer support.
Jakub Kicinski [Wed, 24 Jan 2024 23:12:55 +0000 (15:12 -0800)]
Merge branch 'fix-module_description-for-net-p2'
Breno Leitao says:
====================
Fix MODULE_DESCRIPTION() for net (p2)
There are hundreds of network modules that misses MODULE_DESCRIPTION(),
causing a warnning when compiling with W=1. Example:
WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/net/arcnet/com90io.o
WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/net/arcnet/arc-rimi.o
WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/net/arcnet/com20020.o
This part2 of the patchset focus on the drivers/net/ethernet drivers.
There are still some missing warnings in drivers/net/ethernet that will
be fixed in an upcoming patchset.
Jakub Kicinski [Tue, 23 Jan 2024 06:05:29 +0000 (22:05 -0800)]
selftests: netdevsim: fix the udp_tunnel_nic test
This test is missing a whole bunch of checks for interface
renaming and one ifup. Presumably it was only used on a system
with renaming disabled and NetworkManager running.
Jakub Kicinski [Mon, 22 Jan 2024 19:58:15 +0000 (11:58 -0800)]
selftests: net: fix rps_default_mask with >32 CPUs
If there is more than 32 cpus the bitmask will start to contain
commas, leading to:
./rps_default_mask.sh: line 36: [: 00000000,00000000: integer expression expected
Remove the commas, bash doesn't interpret leading zeroes as oct
so that should be good enough. Switch to bash, Simon reports that
not all shells support this type of substitution.
Linus Torvalds [Wed, 24 Jan 2024 21:12:20 +0000 (13:12 -0800)]
uselib: remove use of __FMODE_EXEC
Jann Horn points out that uselib() really shouldn't trigger the new
FMODE_EXEC logic introduced by commit 4759ff71f23e ("exec: __FMODE_EXEC
instead of in_execve for LSMs").
In fact, it shouldn't even have ever triggered the old pre-existing
logic for __FMODE_EXEC (like the NFS code that makes executables not
need read permissions). Unlike a real execve(), that can work even with
files that are purely executable by the user (not readable), uselib()
has that MAY_READ requirement becasue it's really just a convenience
wrapper around mmap() for legacy shared libraries.
The whole FMODE_EXEC bit was originally introduced by commit b500531e6f5f ("[PATCH] Introduce FMODE_EXEC file flag"), primarily to
give ETXTBUSY error returns for distributed filesystems.
It has since grown a few other warts (like that NFS thing), but there
really isn't any reason to use it for uselib(), and now that we are
trying to use it to replace the horrid 'tsk->in_execve' flag, it's
actively wrong.
Of course, as Jann Horn also points out, nobody should be enabling
CONFIG_USELIB in the first place in this day and age, but that's a
different discussion entirely.
New encrypted keys are created either from kernel-generated random
numbers or user-provided decrypted data. Revert the change requiring
user-provided decrypted data.
Register value persist after booting the kernel using
kexec which results in kernel panic. Thus clear the
BM pool registers before initialisation to fix the issue.
Bernd Edlinger [Mon, 22 Jan 2024 18:19:09 +0000 (19:19 +0100)]
net: stmmac: Wait a bit for the reset to take effect
otherwise the synopsys_id value may be read out wrong,
because the GMAC_VERSION register might still be in reset
state, for at least 1 us after the reset is de-asserted.
Add a wait for 10 us before continuing to be on the safe side.
> From what have you got that delay value?
Just try and error, with very old linux versions and old gcc versions
the synopsys_id was read out correctly most of the time (but not always),
with recent linux versions and recnet gcc versions it was read out
wrongly most of the time, but again not always.
I don't have access to the VHDL code in question, so I cannot
tell why it takes so long to get the correct values, I also do not
have more than a few hardware samples, so I cannot tell how long
this timeout must be in worst case.
Experimentally I can tell that the register is read several times
as zero immediately after the reset is de-asserted, also adding several
no-ops is not enough, adding a printk is enough, also udelay(1) seems to
be enough but I tried that not very often, and I have not access to many
hardware samples to be 100% sure about the necessary delay.
And since the udelay here is only executed once per device instance,
it seems acceptable to delay the boot for 10 us.
Kees Cook [Wed, 24 Jan 2024 19:15:33 +0000 (11:15 -0800)]
exec: Distinguish in_execve from in_exec
Just to help distinguish the fs->in_exec flag from the current->in_execve
flag, add comments in check_unsafe_exec() and copy_fs() for more
context. Also note that in_execve is only used by TOMOYO now.
Kees Cook [Wed, 24 Jan 2024 19:22:32 +0000 (11:22 -0800)]
exec: Check __FMODE_EXEC instead of in_execve for LSMs
After commit 978ffcbf00d8 ("execve: open the executable file before
doing anything else"), current->in_execve was no longer in sync with the
open(). This broke AppArmor and TOMOYO which depend on this flag to
distinguish "open" operations from being "exec" operations.
Instead of moving around in_execve, switch to using __FMODE_EXEC, which
is where the "is this an exec?" intent is stored. Note that TOMOYO still
uses in_execve around cred handling.
core.c:nf_hook_slow assumes that the upper 16 bits of NF_DROP
verdicts contain a valid errno, i.e. -EPERM, -EHOSTUNREACH or similar,
or 0.
Due to the reverted commit, its possible to provide a positive
value, e.g. NF_ACCEPT (1), which results in use-after-free.
Its not clear to me why this commit was made.
NF_QUEUE is not used by nftables; "queue" rules in nftables
will result in use of "nft_queue" expression.
If we later need to allow specifiying errno values from userspace
(do not know why), this has to call NF_DROP_GETERR and check that
"err <= 0" holds true.
Florian Westphal [Fri, 19 Jan 2024 12:34:32 +0000 (13:34 +0100)]
netfilter: nf_tables: restrict anonymous set and map names to 16 bytes
nftables has two types of sets/maps, one where userspace defines the
name, and anonymous sets/maps, where userspace defines a template name.
For the latter, kernel requires presence of exactly one "%d".
nftables uses "__set%d" and "__map%d" for this. The kernel will
expand the format specifier and replaces it with the smallest unused
number.
As-is, userspace could define a template name that allows to move
the set name past the 256 bytes upperlimit (post-expansion).
I don't see how this could be a problem, but I would prefer if userspace
cannot do this, so add a limit of 16 bytes for the '%d' template name.
16 bytes is the old total upper limit for set names that existed when
nf_tables was merged initially.
Fixes: 387454901bd6 ("netfilter: nf_tables: Allow set names of up to 255 chars") Signed-off-by: Florian Westphal <[email protected]> Signed-off-by: Pablo Neira Ayuso <[email protected]>
netfilter: nft_chain_filter: handle NETDEV_UNREGISTER for inet/ingress basechain
Remove netdevice from inet/ingress basechain in case NETDEV_UNREGISTER
event is reported, otherwise a stale reference to netdevice remains in
the hook list.
When the CPU goes idle for the last time during the CPU down hotplug
process, RCU reports a final quiescent state for the current CPU. If
this quiescent state propagates up to the top, some tasks may then be
woken up to complete the grace period: the main grace period kthread
and/or the expedited main workqueue (or kworker).
If those kthreads have a SCHED_FIFO policy, the wake up can indirectly
arm the RT bandwith timer to the local offline CPU. Since this happens
after hrtimers have been migrated at CPUHP_AP_HRTIMERS_DYING stage, the
timer gets ignored. Therefore if the RCU kthreads are waiting for RT
bandwidth to be available, they may never be actually scheduled.
The situation can't be solved with just unpinning the timer. The hrtimer
infrastructure and the nohz heuristics involved in finding the best
remote target for an unpinned timer would then also need to handle
enqueues from an offline CPU in the most horrendous way.
So fix this on the RCU side instead and defer the wake up to an online
CPU if it's too late for the local one.
Reported-by: Paul E. McKenney <[email protected]> Fixes: 5c0930ccaad5 ("hrtimers: Push pending hrtimers away from outgoing CPU earlier") Signed-off-by: Frederic Weisbecker <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]> Signed-off-by: Neeraj Upadhyay (AMD) <[email protected]>
Linus Torvalds [Wed, 24 Jan 2024 16:55:51 +0000 (08:55 -0800)]
Merge tag 'fbdev-for-6.8-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/linux-fbdev
Pull fbdev fixes and cleanups from Helge Deller:
"A crash fix in stifb which was missed to be included in the drm-misc
tree, two checks to prevent wrong userspace input in sisfb and
savagefb and two trivial printk cleanups:
- stifb: Fix crash in stifb_blank()
- savage/sis: Error out if pixclock equals zero
- minor trivial cleanups"
* tag 'fbdev-for-6.8-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/linux-fbdev:
fbdev: stifb: Fix crash in stifb_blank()
fbcon: Fix incorrect printed function name in fbcon_prepare_logo()
fbdev: sis: Error out if pixclock equals zero
fbdev: savage: Error out if pixclock equals zero
fbdev: vt8500lcdfb: Remove unnecessary print function dev_err()
NeilBrown [Mon, 22 Jan 2024 03:58:16 +0000 (14:58 +1100)]
nfsd: fix RELEASE_LOCKOWNER
The test on so_count in nfsd4_release_lockowner() is nonsense and
harmful. Revert to using check_for_locks(), changing that to not sleep.
First: harmful.
As is documented in the kdoc comment for nfsd4_release_lockowner(), the
test on so_count can transiently return a false positive resulting in a
return of NFS4ERR_LOCKS_HELD when in fact no locks are held. This is
clearly a protocol violation and with the Linux NFS client it can cause
incorrect behaviour.
If RELEASE_LOCKOWNER is sent while some other thread is still
processing a LOCK request which failed because, at the time that request
was received, the given owner held a conflicting lock, then the nfsd
thread processing that LOCK request can hold a reference (conflock) to
the lock owner that causes nfsd4_release_lockowner() to return an
incorrect error.
The Linux NFS client ignores that NFS4ERR_LOCKS_HELD error because it
never sends NFS4_RELEASE_LOCKOWNER without first releasing any locks, so
it knows that the error is impossible. It assumes the lock owner was in
fact released so it feels free to use the same lock owner identifier in
some later locking request.
When it does reuse a lock owner identifier for which a previous RELEASE
failed, it will naturally use a lock_seqid of zero. However the server,
which didn't release the lock owner, will expect a larger lock_seqid and
so will respond with NFS4ERR_BAD_SEQID.
So clearly it is harmful to allow a false positive, which testing
so_count allows.
The test is nonsense because ... well... it doesn't mean anything.
so_count is the sum of three different counts.
1/ the set of states listed on so_stateids
2/ the set of active vfs locks owned by any of those states
3/ various transient counts such as for conflicting locks.
When it is tested against '2' it is clear that one of these is the
transient reference obtained by find_lockowner_str_locked(). It is not
clear what the other one is expected to be.
In practice, the count is often 2 because there is precisely one state
on so_stateids. If there were more, this would fail.
In my testing I see two circumstances when RELEASE_LOCKOWNER is called.
In one case, CLOSE is called before RELEASE_LOCKOWNER. That results in
all the lock states being removed, and so the lockowner being discarded
(it is removed when there are no more references which usually happens
when the lock state is discarded). When nfsd4_release_lockowner() finds
that the lock owner doesn't exist, it returns success.
The other case shows an so_count of '2' and precisely one state listed
in so_stateid. It appears that the Linux client uses a separate lock
owner for each file resulting in one lock state per lock owner, so this
test on '2' is safe. For another client it might not be safe.
So this patch changes check_for_locks() to use the (newish)
find_any_file_locked() so that it doesn't take a reference on the
nfs4_file and so never calls nfsd_file_put(), and so never sleeps. With
this check is it safe to restore the use of check_for_locks() rather
than testing so_count against the mysterious '2'.
Dinghao Liu [Tue, 28 Nov 2023 09:29:01 +0000 (17:29 +0800)]
net/mlx5e: fix a potential double-free in fs_any_create_groups
When kcalloc() for ft->g succeeds but kvzalloc() for in fails,
fs_any_create_groups() will free ft->g. However, its caller
fs_any_create_table() will free ft->g again through calling
mlx5e_destroy_flow_table(), which will lead to a double-free.
Fix this by setting ft->g to NULL in fs_any_create_groups().
Zhipeng Lu [Wed, 17 Jan 2024 07:17:36 +0000 (15:17 +0800)]
net/mlx5e: fix a double-free in arfs_create_groups
When `in` allocated by kvzalloc fails, arfs_create_groups will free
ft->g and return an error. However, arfs_create_table, the only caller of
arfs_create_groups, will hold this error and call to
mlx5e_destroy_flow_table, in which the ft->g will be freed again.
Leon Romanovsky [Sun, 26 Nov 2023 09:08:10 +0000 (11:08 +0200)]
net/mlx5e: Ignore IPsec replay window values on sender side
XFRM stack doesn't prevent from users to configure replay window
in TX side and strongswan sets replay_window to be 1. It causes
to failures in validation logic when trying to offload the SA.
Replay window is not relevant in TX side and should be ignored.
Fixes: cded6d80129b ("net/mlx5e: Store replay window in XFRM attributes") Signed-off-by: Aya Levin <[email protected]> Signed-off-by: Leon Romanovsky <[email protected]> Signed-off-by: Saeed Mahameed <[email protected]>
Leon Romanovsky [Tue, 12 Dec 2023 11:52:55 +0000 (13:52 +0200)]
net/mlx5e: Allow software parsing when IPsec crypto is enabled
All ConnectX devices have software parsing capability enabled, but it is
more correct to set allow_swp only if capability exists, which for IPsec
means that crypto offload is supported.
Rahul Rameshbabu [Tue, 28 Nov 2023 22:01:54 +0000 (14:01 -0800)]
net/mlx5: Use mlx5 device constant for selecting CQ period mode for ASO
mlx5 devices have specific constants for choosing the CQ period mode. These
constants do not have to match the constants used by the kernel software
API for DIM period mode selection.
Fixes: cdd04f4d4d71 ("net/mlx5: Add support to create SQ and CQ for ASO") Signed-off-by: Rahul Rameshbabu <[email protected]> Reviewed-by: Jianbo Liu <[email protected]> Signed-off-by: Saeed Mahameed <[email protected]>
net/mlx5: DR, Use the right GVMI number for drop action
When FW provides ICM addresses for drop RX/TX, the provided capability
is 64 bits that contain its GVMI as well as the ICM address itself.
In case of TX DROP this GVMI is different from the GVMI that the
domain is operating on.
This patch fixes the action to use these GVMI IDs, as provided by FW.
Moshe Shemesh [Sat, 30 Dec 2023 20:40:37 +0000 (22:40 +0200)]
net/mlx5: Bridge, fix multicast packets sent to uplink
To enable multicast packets which are offloaded in bridge multicast
offload mode to be sent also to uplink, FTE bit uplink_hairpin_en should
be set. Add this bit to FTE for the bridge multicast offload rules.