Linus Torvalds [Thu, 6 Jun 2024 16:55:27 +0000 (09:55 -0700)]
Merge tag 'net-6.10-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
"Including fixes from BPF and big collection of fixes for WiFi core and
drivers.
Current release - regressions:
- vxlan: fix regression when dropping packets due to invalid src
addresses
- bpf: fix a potential use-after-free in bpf_link_free()
- xdp: revert support for redirect to any xsk socket bound to the
same UMEM as it can result in a corruption
- virtio_net:
- add missing lock protection when reading return code from
control_buf
- fix false-positive lockdep splat in DIM
- Revert "wifi: wilc1000: convert list management to RCU"
- wifi: ath11k: fix error path in ath11k_pcic_ext_irq_config
Previous releases - regressions:
- rtnetlink: make the "split" NLM_DONE handling generic, restore the
old behavior for two cases where we started coalescing those
messages with normal messages, breaking sloppily-coded userspace
- wifi:
- cfg80211: validate HE operation element parsing
- cfg80211: fix 6 GHz scan request building
- mt76: mt7615: add missing chanctx ops
- ath11k: move power type check to ASSOC stage, fix connecting to
6 GHz AP
- ath11k: fix WCN6750 firmware crash caused by 17 num_vdevs
- rtlwifi: ignore IEEE80211_CONF_CHANGE_RETRY_LIMITS
- iwlwifi: mvm: fix a crash on 7265
Previous releases - always broken:
- ncsi: prevent multi-threaded channel probing, a spec violation
- vmxnet3: disable rx data ring on dma allocation failure
- ethtool: init tsinfo stats if requested, prevent unintentionally
reporting all-zero stats on devices which don't implement any
- dst_cache: fix possible races in less common IPv6 features
- tcp: auth: don't consider TCP_CLOSE to be in TCP_AO_ESTABLISHED
- ax25: fix two refcounting bugs
- eth: ionic: fix kernel panic in XDP_TX action
Misc:
- tcp: count CLOSE-WAIT sockets for TCP_MIB_CURRESTAB"
* tag 'net-6.10-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (107 commits)
selftests: net: lib: set 'i' as local
selftests: net: lib: avoid error removing empty netns name
selftests: net: lib: support errexit with busywait
net: ethtool: fix the error condition in ethtool_get_phy_stats_ethtool()
ipv6: fix possible race in __fib6_drop_pcpu_from()
af_unix: Annotate data-race of sk->sk_shutdown in sk_diag_fill().
af_unix: Use skb_queue_len_lockless() in sk_diag_show_rqlen().
af_unix: Use skb_queue_empty_lockless() in unix_release_sock().
af_unix: Use unix_recvq_full_lockless() in unix_stream_connect().
af_unix: Annotate data-race of net->unx.sysctl_max_dgram_qlen.
af_unix: Annotate data-races around sk->sk_sndbuf.
af_unix: Annotate data-races around sk->sk_state in UNIX_DIAG.
af_unix: Annotate data-race of sk->sk_state in unix_stream_read_skb().
af_unix: Annotate data-races around sk->sk_state in sendmsg() and recvmsg().
af_unix: Annotate data-race of sk->sk_state in unix_accept().
af_unix: Annotate data-race of sk->sk_state in unix_stream_connect().
af_unix: Annotate data-races around sk->sk_state in unix_write_space() and poll().
af_unix: Annotate data-race of sk->sk_state in unix_inq_len().
af_unix: Annodate data-races around sk->sk_state for writers.
af_unix: Set sk->sk_state under unix_state_lock() for truly disconencted peer.
...
Jakub Kicinski [Thu, 6 Jun 2024 15:23:44 +0000 (08:23 -0700)]
Merge branch 'selftests-net-lib-small-fixes'
Matthieu Baerts says:
====================
selftests: net: lib: small fixes
While looking at using 'lib.sh' for the MPTCP selftests [1], we found
some small issues with 'lib.sh'. Here they are:
- Patch 1: fix 'errexit' (set -e) support with busywait. 'errexit' is
supported in some functions, not all. A fix for v6.8+.
- Patch 2: avoid confusing error messages linked to the cleaning part
when the netns setup fails. A fix for v6.8+.
- Patch 3: set a variable as local to avoid accidentally changing the
value of a another one with the same name on the caller side. A fix
for v6.10-rc1+.
Without this, the 'i' variable declared before could be overridden by
accident, e.g.
for i in "${@}"; do
__ksft_status_merge "${i}" ## 'i' has been modified
foo "${i}" ## using 'i' with an unexpected value
done
After a quick look, it looks like 'i' is currently not used after having
been modified in __ksft_status_merge(), but still, better be safe than
sorry. I saw this while modifying the same file, not because I suspected
an issue somewhere.
selftests: net: lib: avoid error removing empty netns name
If there is an error to create the first netns with 'setup_ns()',
'cleanup_ns()' will be called with an empty string as first parameter.
The consequences is that 'cleanup_ns()' will try to delete an invalid
netns, and wait 20 seconds if the netns list is empty.
Instead of just checking if the name is not empty, convert the string
separated by spaces to an array. Manipulating the array is cleaner, and
calling 'cleanup_ns()' with an empty array will be a no-op.
Note that with this series there is skb_queue_empty() left in
unix_dgram_disconnected() that needs to be changed to lockless
version, and unix_peer(other) access there should be protected
by unix_state_lock().
This will require some refactoring, so another series will follow.
af_unix: Use skb_queue_empty_lockless() in unix_release_sock().
If the socket type is SOCK_STREAM or SOCK_SEQPACKET, unix_release_sock()
checks the length of the peer socket's recvq under unix_state_lock().
However, unix_stream_read_generic() calls skb_unlink() after releasing
the lock. Also, for SOCK_SEQPACKET, __skb_try_recv_datagram() unlinks
skb without unix_state_lock().
Thues, unix_state_lock() does not protect qlen.
Let's use skb_queue_empty_lockless() in unix_release_sock().
af_unix: Use unix_recvq_full_lockless() in unix_stream_connect().
Once sk->sk_state is changed to TCP_LISTEN, it never changes.
unix_accept() takes advantage of this characteristics; it does not
hold the listener's unix_state_lock() and only acquires recvq lock
to pop one skb.
It means unix_state_lock() does not prevent the queue length from
changing in unix_stream_connect().
Thus, we need to use unix_recvq_full_lockless() to avoid data-race.
Now we remove unix_recvq_full() as no one uses it.
Note that we can remove READ_ONCE() for sk->sk_max_ack_backlog in
unix_recvq_full_lockless() because of the following reasons:
(1) For SOCK_DGRAM, it is a written-once field in unix_create1()
(2) For SOCK_STREAM and SOCK_SEQPACKET, it is changed under the
listener's unix_state_lock() in unix_listen(), and we hold
the lock in unix_stream_connect()
af_unix: Annotate data-races around sk->sk_state in unix_write_space() and poll().
unix_poll() and unix_dgram_poll() read sk->sk_state locklessly and
calls unix_writable() which also reads sk->sk_state without holding
unix_state_lock().
Let's use READ_ONCE() in unix_poll() and unix_dgram_poll() and pass
it to unix_writable().
While at it, we remove TCP_SYN_SENT check in unix_dgram_poll() as
that state does not exist for AF_UNIX socket since the code was added.
Fixes: 1586a5877db9 ("af_unix: do not report POLLOUT on listeners") Fixes: 3c73419c09a5 ("af_unix: fix 'poll for write'/ connected DGRAM sockets") Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Kuniyuki Iwashima <[email protected]> Signed-off-by: Paolo Abeni <[email protected]>
af_unix: Set sk->sk_state under unix_state_lock() for truly disconencted peer.
When a SOCK_DGRAM socket connect()s to another socket, the both sockets'
sk->sk_state are changed to TCP_ESTABLISHED so that we can register them
to BPF SOCKMAP.
When the socket disconnects from the peer by connect(AF_UNSPEC), the state
is set back to TCP_CLOSE.
Then, the peer's state is also set to TCP_CLOSE, but the update is done
locklessly and unconditionally.
Let's say socket A connect()ed to B, B connect()ed to C, and A disconnects
from B.
After the first two connect()s, all three sockets' sk->sk_state are
TCP_ESTABLISHED:
So, when a socket disconnects from the peer, we should not set TCP_CLOSE to
the peer if the peer is connected to yet another socket, and this must be
done under unix_state_lock().
Note that we use WRITE_ONCE() for sk->sk_state as there are many lockless
readers. These data-races will be fixed in the following patches.
Fixes: 83301b5367a9 ("af_unix: Set TCP_ESTABLISHED for datagram sockets too") Signed-off-by: Kuniyuki Iwashima <[email protected]> Signed-off-by: Paolo Abeni <[email protected]>
net: wwan: iosm: Fix tainted pointer delete is case of region creation fail
In case of region creation fail in ipc_devlink_create_region(), previously
created regions delete process starts from tainted pointer which actually
holds error code value.
Fix this bug by decreasing region index before delete.
Found by Linux Verification Center (linuxtesting.org) with SVACE.
====================
Intel Wired LAN Driver Updates 2024-05-29 (ice, igc)
This series includes fixes for the ice driver as well as a fix for the igc
driver.
Jacob fixes two issues in the ice driver with reading the NVM for providing
firmware data via devlink info. First, fix an off-by-one error when reading
the Preserved Fields Area, resolving an infinite loop triggered on some
NVMs which lack certain data in the NVM. Second, fix the reading of the NVM
Shadow RAM on newer E830 and E825-C devices which have a variable sized CSS
header rather than assuming this header is always the same fixed size as in
the E810 devices.
Larysa fixes three issues with the ice driver XDP logic that could occur if
the number of queues is changed after enabling an XDP program. First, the
af_xdp_zc_qps bitmap is removed and replaced by simpler logic to track
whether queues are in zero-copy mode. Second, the reset and .ndo_bpf flows
are distinguished to avoid potential races with a PF reset occuring
simultaneously to .ndo_bpf callback from userspace. Third, the logic for
mapping XDP queues to vectors is fixed so that XDP state is restored for
XDP queues after a reconfiguration.
Sasha fixes reporting of Energy Efficient Ethernet support via ethtool in
the igc driver.
Sasha Neftin [Mon, 3 Jun 2024 21:42:35 +0000 (14:42 -0700)]
igc: Fix Energy Efficient Ethernet support declaration
The commit 01cf893bf0f4 ("net: intel: i40e/igc: Remove setting Autoneg in
EEE capabilities") removed SUPPORTED_Autoneg field but left inappropriate
ethtool_keee structure initialization. When "ethtool --show <device>"
(get_eee) invoke, the 'ethtool_keee' structure was accidentally overridden.
Remove the 'ethtool_keee' overriding and add EEE declaration as per IEEE
specification that allows reporting Energy Efficient Ethernet capabilities.
Examples:
Before fix:
ethtool --show-eee enp174s0
EEE settings for enp174s0:
EEE status: not supported
After fix:
EEE settings for enp174s0:
EEE status: disabled
Tx LPI: disabled
Supported EEE link modes: 100baseT/Full
1000baseT/Full
2500baseT/Full
Larysa Zaremba [Mon, 3 Jun 2024 21:42:34 +0000 (14:42 -0700)]
ice: map XDP queues to vectors in ice_vsi_map_rings_to_vectors()
ice_pf_dcb_recfg() re-maps queues to vectors with
ice_vsi_map_rings_to_vectors(), which does not restore the previous
state for XDP queues. This leads to no AF_XDP traffic after rebuild.
Map XDP queues to vectors in ice_vsi_map_rings_to_vectors().
Also, move the code around, so XDP queues are mapped independently only
through .ndo_bpf().
Larysa Zaremba [Mon, 3 Jun 2024 21:42:33 +0000 (14:42 -0700)]
ice: add flag to distinguish reset from .ndo_bpf in XDP rings config
Commit 6624e780a577 ("ice: split ice_vsi_setup into smaller functions")
has placed ice_vsi_free_q_vectors() after ice_destroy_xdp_rings() in
the rebuild process. The behaviour of the XDP rings config functions is
context-dependent, so the change of order has led to
ice_destroy_xdp_rings() doing additional work and removing XDP prog, when
it was supposed to be preserved.
Also, dependency on the PF state reset flags creates an additional,
fortunately less common problem:
* PFR is requested e.g. by tx_timeout handler
* .ndo_bpf() is asked to delete the program, calls ice_destroy_xdp_rings(),
but reset flag is set, so rings are destroyed without deleting the
program
* ice_vsi_rebuild tries to delete non-existent XDP rings, because the
program is still on the VSI
* system crashes
With a similar race, when requested to attach a program,
ice_prepare_xdp_rings() can actually skip setting the program in the VSI
and nevertheless report success.
Instead of reverting to the old order of function calls, add an enum
argument to both ice_prepare_xdp_rings() and ice_destroy_xdp_rings() in
order to distinguish between calls from rebuild and .ndo_bpf().
Larysa Zaremba [Mon, 3 Jun 2024 21:42:32 +0000 (14:42 -0700)]
ice: remove af_xdp_zc_qps bitmap
Referenced commit has introduced a bitmap to distinguish between ZC and
copy-mode AF_XDP queues, because xsk_get_pool_from_qid() does not do this
for us.
The bitmap would be especially useful when restoring previous state after
rebuild, if only it was not reallocated in the process. This leads to e.g.
xdpsock dying after changing number of queues.
Instead of preserving the bitmap during the rebuild, remove it completely
and distinguish between ZC and copy-mode queues based on the presence of
a device associated with the pool.
Jacob Keller [Mon, 3 Jun 2024 21:42:31 +0000 (14:42 -0700)]
ice: fix reads from NVM Shadow RAM on E830 and E825-C devices
The ice driver reads data from the Shadow RAM portion of the NVM during
initialization, including data used to identify the NVM image and device,
such as the ETRACK ID used to populate devlink dev info fw.bundle.
Currently it is using a fixed offset defined by ICE_CSS_HEADER_LENGTH to
compute the appropriate offset. This worked fine for E810 and E822 devices
which both have CSS header length of 330 words.
Other devices, including both E825-C and E830 devices have different sizes
for their CSS header. The use of a hard coded value results in the driver
reading from the wrong block in the NVM when attempting to access the
Shadow RAM copy. This results in the driver reporting the fw.bundle as 0x0
in both the devlink dev info and ethtool -i output.
The first E830 support was introduced by commit ba20ecb1d1bb ("ice: Hook up
4 E830 devices by adding their IDs") and the first E825-C support was
introducted by commit f64e18944233 ("ice: introduce new E825C devices
family")
The NVM actually contains the CSS header length embedded in it. Remove the
hard coded value and replace it with logic to read the length from the NVM
directly. This is more resilient against all existing and future hardware,
vs looking up the expected values from a table. It ensures the driver will
read from the appropriate place when determining the ETRACK ID value used
for populating the fw.bundle_id and for reporting in ethtool -i.
The CSS header length for both the active and inactive flash bank is stored
in the ice_bank_info structure to avoid unnecessary duplicate work when
accessing multiple words of the Shadow RAM. Both banks are read in the
unlikely event that the header length is different for the NVM in the
inactive bank, rather than being different only by the overall device
family.
Jacob Keller [Mon, 3 Jun 2024 21:42:30 +0000 (14:42 -0700)]
ice: fix iteration of TLVs in Preserved Fields Area
The ice_get_pfa_module_tlv() function iterates over the Type-Length-Value
structures in the Preserved Fields Area (PFA) of the NVM. This is used by
the driver to access data such as the Part Board Assembly identifier.
The function uses simple logic to iterate over the PFA. First, the pointer
to the PFA in the NVM is read. Then the total length of the PFA is read
from the first word.
A pointer to the first TLV is initialized, and a simple loop iterates over
each TLV. The pointer is moved forward through the NVM until it exceeds the
PFA area.
The logic seems sound, but it is missing a key detail. The Preserved
Fields Area length includes one additional final word. This is documented
in the device data sheet as a dummy word which contains 0xFFFF. All NVMs
have this extra word.
If the driver tries to scan for a TLV that is not in the PFA, it will read
past the size of the PFA. It reads and interprets the last dummy word of
the PFA as a TLV with type 0xFFFF. It then reads the word following the PFA
as a length.
The PFA resides within the Shadow RAM portion of the NVM, which is
relatively small. All of its offsets are within a 16-bit size. The PFA
pointer and TLV pointer are stored by the driver as 16-bit values.
In almost all cases, the word following the PFA will be such that
interpreting it as a length will result in 16-bit arithmetic overflow. Once
overflowed, the new next_tlv value is now below the maximum offset of the
PFA. Thus, the driver will continue to iterate the data as TLVs. In the
worst case, the driver hits on a sequence of reads which loop back to
reading the same offsets in an endless loop.
To fix this, we need to correct the loop iteration check to account for
this extra word at the end of the PFA. This alone is sufficient to resolve
the known cases of this issue in the field. However, it is plausible that
an NVM could be misconfigured or have corrupt data which results in the
same kind of overflow. Protect against this by using check_add_overflow
when calculating both the maximum offset of the TLVs, and when calculating
the next_tlv offset at the end of each loop iteration. This ensures that
the driver will not get stuck in an infinite loop when scanning the PFA.
Jakub Kicinski [Thu, 6 Jun 2024 02:03:07 +0000 (19:03 -0700)]
Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Daniel Borkmann says:
====================
pull-request: bpf 2024-06-05
We've added 8 non-merge commits during the last 6 day(s) which contain
a total of 9 files changed, 34 insertions(+), 35 deletions(-).
The main changes are:
1) Fix a potential use-after-free in bpf_link_free when the link uses
dealloc_deferred to free the link object but later still tests for
presence of link->ops->dealloc, from Cong Wang.
2) Fix BPF test infra to set the run context for rawtp test_run callback
where syzbot reported a crash, from Jiri Olsa.
3) Fix bpf_session_cookie BTF_ID in the special_kfunc_set list to exclude
it for the case of !CONFIG_FPROBE, also from Jiri Olsa.
4) Fix a Coverity static analysis report to not close() a link_fd of -1
in the multi-uprobe feature detector, from Andrii Nakryiko.
5) Revert support for redirect to any xsk socket bound to the same umem
as it can result in corrupted ring state which can lead to a crash when
flushing rings. A different approach will be pursued for bpf-next to
address it safely, from Magnus Karlsson.
6) Fix inet_csk_accept prototype in test_sk_storage_tracing.c which caused
BPF CI failure after the last tree fast forwarding, from Andrii Nakryiko.
7) Fix a coccicheck warning in BPF devmap that iterator variable cannot
be NULL, from Thorsten Blum.
* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
Revert "xsk: Document ability to redirect to any socket bound to the same umem"
Revert "xsk: Support redirect to any socket bound to the same umem"
bpf: Set run context for rawtp test_run callback
bpf: Fix a potential use-after-free in bpf_link_free()
bpf, devmap: Remove unnecessary if check in for loop
libbpf: don't close(-1) in multi-uprobe feature detector
bpf: Fix bpf_session_cookie BTF_ID in special_kfunc_set list
selftests/bpf: fix inet_csk_accept prototype in test_sk_storage_tracing.c
====================
If one TCA_TAPRIO_ATTR_PRIOMAP attribute has been provided,
taprio_parse_mqprio_opt() must validate it, or userspace
can inject arbitrary data to the kernel, the second time
taprio_change() is called.
First call (with valid attributes) sets dev->num_tc
to a non zero value.
Second call (with arbitrary mqprio attributes)
returns early from taprio_parse_mqprio_opt()
and bad things can happen.
Linus Torvalds [Wed, 5 Jun 2024 22:28:20 +0000 (15:28 -0700)]
Merge tag 'thermal-6.10-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull thermal control fixes from Rafael Wysocki:
"Fix issues related to the handling of invalid trip points in the
thermal core and in the thermal debug code that have been overlooked
by some recent thermal control core changes"
* tag 'thermal-6.10-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
thermal: trip: Trigger trip down notifications when trips involved in mitigation become invalid
thermal: core: Introduce thermal_trip_crossed()
thermal/debugfs: Allow tze_seq_show() to print statistics for invalid trips
thermal/debugfs: Print initial trip temperature and hysteresis in tze_seq_show()
Linus Torvalds [Wed, 5 Jun 2024 22:19:15 +0000 (15:19 -0700)]
Merge tag 'acpi-6.10-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull ACPI fixes from Rafael Wysocki:
"These fix the ACPI EC and AC drivers, the ACPI APEI error injection
driver and build issues related to the dev_is_pnp() macro referring to
pnp_bus_type that is not exported to modules.
Specifics:
- Fix error handling during EC operation region accesses in the ACPI
EC driver (Armin Wolf)
- Fix a memory leak in the APEI error injection driver introduced
during its converion to a platform driver (Dan Williams)
- Fix build failures related to the dev_is_pnp() macro by redefining
it as a proper function and exporting it to modules as appropriate
and unexport pnp_bus_type which need not be exported any more (Andy
Shevchenko)
- Update the ACPI AC driver to use power_supply_changed() to let the
power supply core handle configuration changes properly (Thomas
Weißschuh)"
* tag 'acpi-6.10-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
ACPI: AC: Properly notify powermanagement core about changes
PNP: Hide pnp_bus_type from the non-PNP code
PNP: Make dev_is_pnp() to be a function and export it for modules
ACPI: EC: Avoid returning AE_OK on errors in address space handler
ACPI: EC: Abort address space access upon error
ACPI: APEI: EINJ: Fix einj_dev release leak
Linus Torvalds [Wed, 5 Jun 2024 22:12:35 +0000 (15:12 -0700)]
Merge tag 'pm-6.10-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management fixes from Rafael Wysocki:
"These fix the intel_pstate and amd-pstate cpufreq drivers and the
cpupower utility.
Specifics:
- Fix a recently introduced unchecked HWP MSR access in the
intel_pstate driver (Srinivas Pandruvada)
- Add missing conversion from MHz to KHz to amd_pstate_set_boost() to
address sysfs inteface inconsistency and fix P-state frequency
reporting on AMD Family 1Ah CPUs in the cpupower utility (Dhananjay
Ugwekar)
- Get rid of an excess global header file used by the amd-pstate
cpufreq driver (Arnd Bergmann)"
* tag 'pm-6.10-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
cpufreq: intel_pstate: Fix unchecked HWP MSR access
cpufreq: amd-pstate: Fix the inconsistency in max frequency units
cpufreq: amd-pstate: remove global header file
tools/power/cpupower: Fix Pstate frequency reporting on AMD Family 1Ah CPUs
net/mlx5: Fix tainted pointer delete is case of flow rules creation fail
In case of flow rule creation fail in mlx5_lag_create_port_sel_table(),
instead of previously created rules, the tainted pointer is deleted
deveral times.
Fix this bug by using correct flow rules pointers.
Found by Linux Verification Center (linuxtesting.org) with SVACE.
Linus Torvalds [Wed, 5 Jun 2024 18:28:25 +0000 (11:28 -0700)]
Merge tag 'for-6.10-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fix from David Sterba:
"A fix for fast fsync that needs to handle errors during writes after
some COW failure so it does not lead to an inconsistent state"
* tag 'for-6.10-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: ensure fast fsync waits for ordered extents after a write failure
Linus Torvalds [Wed, 5 Jun 2024 18:25:41 +0000 (11:25 -0700)]
Merge tag 'bcachefs-2024-06-05' of https://evilpiepirate.org/git/bcachefs
Pull bcachefs fixes from Kent Overstreet:
"Just a few small fixes"
* tag 'bcachefs-2024-06-05' of https://evilpiepirate.org/git/bcachefs:
bcachefs: Fix trans->locked assert
bcachefs: Rereplicate now moves data off of durability=0 devices
bcachefs: Fix GFP_KERNEL allocation in break_cycle()
Linus Torvalds [Wed, 5 Jun 2024 17:32:20 +0000 (10:32 -0700)]
Merge tag 'i2c-for-6.10-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux
Pull i2c fixes from Wolfram Sang:
"This should have been my second pull request during the merge window
but one dependency in the drm subsystem fell through the cracks and
was only applied for rc2.
Now we can finally remove I2C_CLASS_SPD"
* tag 'i2c-for-6.10-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux:
i2c: Remove I2C_CLASS_SPD
i2c: synquacer: Remove a clk reference from struct synquacer_i2c
Linus Torvalds [Wed, 5 Jun 2024 17:29:13 +0000 (10:29 -0700)]
Merge tag 'tpmdd-next-6.10-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/jarkko/linux-tpmdd
Pull tpm fixes from Jarkko Sakkinen:
"The bug fix for tpm_tis_core_init() is not that critical but still
makes sense to get into release for the sake of better quality.
I included the Intel CPU model define change mainly to help Tony just
a bit, as for this subsystem it cannot realistically speaking cause
any possible harm"
* tag 'tpmdd-next-6.10-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/jarkko/linux-tpmdd:
tpm: Switch to new Intel CPU model defines
tpm_tis: Do *not* flush uninitialized work
Linus Torvalds [Wed, 5 Jun 2024 15:43:41 +0000 (08:43 -0700)]
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm fixes from Paolo Bonzini:
"This is dominated by a couple large series for ARM and x86
respectively, but apart from that things are calm.
ARM:
- Large set of FP/SVE fixes for pKVM, addressing the fallout from the
per-CPU data rework and making sure that the host is not involved
in the FP/SVE switching any more
- Allow FEAT_BTI to be enabled with NV now that FEAT_PAUTH is
completely supported
- Fix for the respective priorities of Failed PAC, Illegal Execution
state and Instruction Abort exceptions
- Fix the handling of AArch32 instruction traps failing their
condition code, which was broken by the introduction of
ESR_EL2.ISS2
- Allow vcpus running in AArch32 state to be restored in System mode
- Fix AArch32 GPR restore that would lose the 64 bit state under some
conditions
RISC-V:
- No need to use mask when hart-index-bits is 0
- Fix incorrect reg_subtype labels in
kvm_riscv_vcpu_set_reg_isa_ext()
x86:
- Fixes and debugging help for the #VE sanity check.
Also disable it by default, even for CONFIG_DEBUG_KERNEL, because
it was found to trigger spuriously (most likely a processor erratum
as the exact symptoms vary by generation).
- Avoid WARN() when two NMIs arrive simultaneously during an
NMI-disabled situation (GIF=0 or interrupt shadow) when the
processor supports virtual NMI.
While generally KVM will not request an NMI window when virtual
NMIs are supported, in this case it *does* have to single-step over
the interrupt shadow or enable the STGI intercept, in order to
deliver the latched second NMI.
- Drop support for hand tuning APIC timer advancement from userspace.
Since we have adaptive tuning, and it has proved to work well, drop
the module parameter for manual configuration and with it a few
stupid bugs that it had"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (32 commits)
KVM: x86/mmu: Don't save mmu_invalidate_seq after checking private attr
KVM: arm64: Ensure that SME controls are disabled in protected mode
KVM: arm64: Refactor CPACR trap bit setting/clearing to use ELx format
KVM: arm64: Consolidate initializing the host data's fpsimd_state/sve in pKVM
KVM: arm64: Eagerly restore host fpsimd/sve state in pKVM
KVM: arm64: Allocate memory mapped at hyp for host sve state in pKVM
KVM: arm64: Specialize handling of host fpsimd state on trap
KVM: arm64: Abstract set/clear of CPTR_EL2 bits behind helper
KVM: arm64: Fix prototype for __sve_save_state/__sve_restore_state
KVM: arm64: Reintroduce __sve_save_state
KVM: x86: Drop support for hand tuning APIC timer advancement from userspace
KVM: SEV-ES: Delegate LBR virtualization to the processor
KVM: SEV-ES: Disallow SEV-ES guests when X86_FEATURE_LBRV is absent
KVM: SEV-ES: Prevent MSR access post VMSA encryption
RISC-V: KVM: Fix incorrect reg_subtype labels in kvm_riscv_vcpu_set_reg_isa_ext function
RISC-V: KVM: No need to use mask when hart-index-bit is 0
KVM: arm64: nv: Expose BTI and CSV_frac to a guest hypervisor
KVM: arm64: nv: Fix relative priorities of exceptions generated by ERETAx
KVM: arm64: AArch32: Fix spurious trapping of conditional instructions
KVM: arm64: Allow AArch32 PSTATE.M to be restored as System mode
...
- Fix a recently introduced unchecked HWP MSR access in the
intel_pstate driver (Srinivas Pandruvada).
- Add missing conversion from MHz to KHz to amd_pstate_set_boost()
to address sysfs inteface inconsistency (Dhananjay Ugwekar).
- Get rid of an excess global header file used by the amd-pstate
cpufreq driver (Arnd Bergmann).
* pm-cpufreq:
cpufreq: intel_pstate: Fix unchecked HWP MSR access
cpufreq: amd-pstate: Fix the inconsistency in max frequency units
cpufreq: amd-pstate: remove global header file
Merge ACPI EC driver fixes, an ACPI APEI fix and PNP fixes for
6.10-rc3:
- Fix error handling during EC operation region accesses in the ACPI EC
driver (Armin Wolf).
- Fix a memory leak in the APEI error injection driver introduced
during its converion to a platform driver (Dan Williams).
- Fix build failures related to the dev_is_pnp() macro by redefining it
as a proper function and exporting it to modules as appropriate and
unexport pnp_bus_type which need not be exported any more (Andy
Shevchenko).
* acpi-ec:
ACPI: EC: Avoid returning AE_OK on errors in address space handler
ACPI: EC: Abort address space access upon error
Shay Drory [Mon, 3 Jun 2024 21:04:43 +0000 (00:04 +0300)]
net/mlx5: Always stop health timer during driver removal
Currently, if teardown_hca fails to execute during driver removal, mlx5
does not stop the health timer. Afterwards, mlx5 continue with driver
teardown. This may lead to a UAF bug, which results in page fault
Oops[1], since the health timer invokes after resources were freed.
Hence, stop the health monitor even if teardown_hca fails.
Moshe Shemesh [Mon, 3 Jun 2024 21:04:42 +0000 (00:04 +0300)]
net/mlx5: Stop waiting for PCI if pci channel is offline
In case pci channel becomes offline the driver should not wait for PCI
reads during health dump and recovery flow. The driver has timeout for
each of these loops trying to read PCI, so it would fail anyway.
However, in case of recovery waiting till timeout may cause the pci
error_detected() callback fail to meet pci_dpc_recovered() wait timeout.
net: ethernet: mtk_eth_soc: handle dma buffer size soc specific
The mainline MTK ethernet driver suffers long time from rarly but
annoying tx queue timeouts. We think that this is caused by fixed
dma sizes hardcoded for all SoCs.
We suspect this problem arises from a low level of free TX DMADs,
the TX Ring alomost full.
The transmit timeout is caused by the Tx queue not waking up. The
Tx queue stops when the free counter is less than ring->thres, and
it will wake up once the free counter is greater than ring->thres.
If the CPU is too late to wake up the Tx queues, it may cause a
transmit timeout.
Therefore, we increased the TX and RX DMADs to improve this error
situation.
Use the dma-size implementation from SDK in a per SoC manner. In
difference to SDK we have no RSS feature yet, so all RX/TX sizes
should be raised from 512 to 2048 byte except fqdma on mt7988 to
avoid the tx timeout issue.
Jakub Kicinski [Mon, 3 Jun 2024 18:48:26 +0000 (11:48 -0700)]
rtnetlink: make the "split" NLM_DONE handling generic
Jaroslav reports Dell's OMSA Systems Management Data Engine
expects NLM_DONE in a separate recvmsg(), both for rtnl_dump_ifinfo()
and inet_dump_ifaddr(). We already added a similar fix previously in
commit 460b0d33cf10 ("inet: bring NLM_DONE out to a separate recv() again")
Instead of modifying all the dump handlers, and making them look
different than modern for_each_netdev_dump()-based dump handlers -
put the workaround in rtnetlink code. This will also help us move
the custom rtnl-locking from af_netlink in the future (in net-next).
Note that this change is not touching rtnl_dump_all(). rtnl_dump_all()
is different kettle of fish and a potential problem. We now mix families
in a single recvmsg(), but NLM_DONE is not coalesced.
Jason Xing [Mon, 3 Jun 2024 17:02:17 +0000 (01:02 +0800)]
mptcp: count CLOSE-WAIT sockets for MPTCP_MIB_CURRESTAB
Like previous patch does in TCP, we need to adhere to RFC 1213:
"tcpCurrEstab OBJECT-TYPE
...
The number of TCP connections for which the current state
is either ESTABLISHED or CLOSE- WAIT."
So let's consider CLOSE-WAIT sockets.
The logic of counting
When we increment the counter?
a) Only if we change the state to ESTABLISHED.
When we decrement the counter?
a) if the socket leaves ESTABLISHED and will never go into CLOSE-WAIT,
say, on the client side, changing from ESTABLISHED to FIN-WAIT-1.
b) if the socket leaves CLOSE-WAIT, say, on the server side, changing
from CLOSE-WAIT to LAST-ACK.
Fixes: d9cd27b8cd19 ("mptcp: add CurrEstab MIB counter support") Signed-off-by: Jason Xing <[email protected]> Reviewed-by: Matthieu Baerts (NGI0) <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Jason Xing [Mon, 3 Jun 2024 17:02:16 +0000 (01:02 +0800)]
tcp: count CLOSE-WAIT sockets for TCP_MIB_CURRESTAB
According to RFC 1213, we should also take CLOSE-WAIT sockets into
consideration:
"tcpCurrEstab OBJECT-TYPE
...
The number of TCP connections for which the current state
is either ESTABLISHED or CLOSE- WAIT."
After this, CurrEstab counter will display the total number of
ESTABLISHED and CLOSE-WAIT sockets.
The logic of counting
When we increment the counter?
a) if we change the state to ESTABLISHED.
b) if we change the state from SYN-RECEIVED to CLOSE-WAIT.
When we decrement the counter?
a) if the socket leaves ESTABLISHED and will never go into CLOSE-WAIT,
say, on the client side, changing from ESTABLISHED to FIN-WAIT-1.
b) if the socket leaves CLOSE-WAIT, say, on the server side, changing
from CLOSE-WAIT to LAST-ACK.
Please note: there are two chances that old state of socket can be changed
to CLOSE-WAIT in tcp_fin(). One is SYN-RECV, the other is ESTABLISHED.
So we have to take care of the former case.
Tao Su [Tue, 28 May 2024 10:22:34 +0000 (18:22 +0800)]
KVM: x86/mmu: Don't save mmu_invalidate_seq after checking private attr
Drop the second snapshot of mmu_invalidate_seq in kvm_faultin_pfn().
Before checking the mismatch of private vs. shared, mmu_invalidate_seq is
saved to fault->mmu_seq, which can be used to detect an invalidation
related to the gfn occurred, i.e. KVM will not install a mapping in page
table if fault->mmu_seq != mmu_invalidate_seq.
Currently there is a second snapshot of mmu_invalidate_seq, which may not
be same as the first snapshot in kvm_faultin_pfn(), i.e. the gfn attribute
may be changed between the two snapshots, but the gfn may be mapped in
page table without hindrance. Therefore, drop the second snapshot as it
has no obvious benefits.
Paolo Bonzini [Wed, 5 Jun 2024 10:32:18 +0000 (06:32 -0400)]
Merge tag 'kvmarm-fixes-6.10-1' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD
KVM/arm64 fixes for 6.10, take #1
- Large set of FP/SVE fixes for pKVM, addressing the fallout
from the per-CPU data rework and making sure that the host
is not involved in the FP/SVE switching any more
- Allow FEAT_BTI to be enabled with NV now that FEAT_PAUTH
is copletely supported
- Fix for the respective priorities of Failed PAC, Illegal
Execution state and Instruction Abort exceptions
- Fix the handling of AArch32 instruction traps failing their
condition code, which was broken by the introduction of
ESR_EL2.ISS2
- Allow vpcus running in AArch32 state to be restored in
System mode
- Fix AArch32 GPR restore that would lose the 64 bit state
under some conditions
Hangbin Liu [Mon, 3 Jun 2024 09:30:19 +0000 (17:30 +0800)]
selftests: hsr: add missing config for CONFIG_BRIDGE
hsr_redbox.sh test need to create bridge for testing. Add the missing
config CONFIG_BRIDGE in config file.
Fixes: eafbf0574e05 ("test: hsr: Extend the hsr_redbox.sh to have more SAN devices connected") Signed-off-by: Hangbin Liu <[email protected]> Tested-by: Simon Horman <[email protected]> Reviewed-by: Simon Horman <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Daniel Borkmann [Mon, 3 Jun 2024 08:59:26 +0000 (10:59 +0200)]
vxlan: Fix regression when dropping packets due to invalid src addresses
Commit f58f45c1e5b9 ("vxlan: drop packets from invalid src-address")
has recently been added to vxlan mainly in the context of source
address snooping/learning so that when it is enabled, an entry in the
FDB is not being created for an invalid address for the corresponding
tunnel endpoint.
Before commit f58f45c1e5b9 vxlan was similarly behaving as geneve in
that it passed through whichever macs were set in the L2 header. It
turns out that this change in behavior breaks setups, for example,
Cilium with netkit in L3 mode for Pods as well as tunnel mode has been
passing before the change in f58f45c1e5b9 for both vxlan and geneve.
After mentioned change it is only passing for geneve as in case of
vxlan packets are dropped due to vxlan_set_mac() returning false as
source and destination macs are zero which for E/W traffic via tunnel
is totally fine.
Fix it by only opting into the is_valid_ether_addr() check in
vxlan_set_mac() when in fact source address snooping/learning is
actually enabled in vxlan. This is done by moving the check into
vxlan_snoop(). With this change, the Cilium connectivity test suite
passes again for both tunnel flavors.
Hangyu Hua [Mon, 3 Jun 2024 07:13:03 +0000 (15:13 +0800)]
net: sched: sch_multiq: fix possible OOB write in multiq_tune()
q->bands will be assigned to qopt->bands to execute subsequent code logic
after kmalloc. So the old q->bands should not be used in kmalloc.
Otherwise, an out-of-bounds write will occur.
Fixes: c2999f7fb05b ("net: sched: multiq: don't call qdisc_put() while holding tree lock") Signed-off-by: Hangyu Hua <[email protected]> Acked-by: Cong Wang <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Taehee Yoo [Mon, 3 Jun 2024 04:57:55 +0000 (04:57 +0000)]
ionic: fix kernel panic in XDP_TX action
In the XDP_TX path, ionic driver sends a packet to the TX path with rx
page and corresponding dma address.
After tx is done, ionic_tx_clean() frees that page.
But RX ring buffer isn't reset to NULL.
So, it uses a freed page, which causes kernel panic.
Tristram Ha [Fri, 31 May 2024 01:38:01 +0000 (18:38 -0700)]
net: phy: Micrel KSZ8061: fix errata solution not taking effect problem
KSZ8061 needs to write to a MMD register at driver initialization to fix
an errata. This worked in 5.0 kernel but not in newer kernels. The
issue is the main phylib code no longer resets PHY at the very beginning.
Calling phy resuming code later will reset the chip if it is already
powered down at the beginning. This wipes out the MMD register write.
Solution is to implement a phy resume function for KSZ8061 to take care
of this problem.
Fixes: 232ba3a51cc2 ("net: phy: Micrel KSZ8061: link failure after cable connect") Signed-off-by: Tristram Ha <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Wen Gu [Fri, 31 May 2024 08:54:17 +0000 (16:54 +0800)]
net/smc: avoid overwriting when adjusting sock bufsizes
When copying smc settings to clcsock, avoid setting clcsock's sk_sndbuf
to sysctl_tcp_wmem[1], since this may overwrite the value set by
tcp_sndbuf_expand() in TCP connection establishment.
And the other setting sk_{snd|rcv}buf to sysctl value in
smc_adjust_sock_bufsizes() can also be omitted since the initialization
of smc sock and clcsock has set sk_{snd|rcv}buf to smc.sysctl_{w|r}mem
or ipv4_sysctl_tcp_{w|r}mem[1].
octeontx2-af: Always allocate PF entries from low prioriy zone
PF mcam entries has to be at low priority always so that VF
can install longest prefix match rules at higher priority.
This was taken care currently but when priority allocation
wrt reference entry is requested then entries are allocated
from mid-zone instead of low priority zone. Fix this and
always allocate entries from low priority zone for PFs.
Fixes: 7df5b4b260dd ("octeontx2-af: Allocate low priority entries for PF") Signed-off-by: Subbaraya Sundeep <[email protected]> Reviewed-by: Jacob Keller <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Ard Biesheuvel [Tue, 4 Jun 2024 15:45:20 +0000 (17:45 +0200)]
efi: Add missing __nocfi annotations to runtime wrappers
The EFI runtime wrappers are a sandbox for calling into EFI runtime
services, which are invoked using indirect calls. When running with kCFI
enabled, the compiler will require the target of any indirect call to be
type annotated.
Given that the EFI runtime services prototypes and calling convention
are governed by the EFI spec, not the Linux kernel, adding such type
annotations for firmware routines is infeasible, and so the compiler
must be informed that prototype validation should be omitted.
Add the __nocfi annotation at the appropriate places in the EFI runtime
wrapper code to achieve this.
Note that this currently only affects 32-bit ARM, given that other
architectures that support both kCFI and EFI use an asm wrapper to call
EFI runtime services, and this hides the indirect call from the
compiler.
This patch introduced a potential kernel crash when multiple napi instances
redirect to the same AF_XDP socket. By removing the queue_index check, it is
possible for multiple napi instances to access the Rx ring at the same time,
which will result in a corrupted ring state which can lead to a crash when
flushing the rings in __xsk_flush(). This can happen when the linked list of
sockets to flush gets corrupted by concurrent accesses. A quick and small fix
is not possible, so let us revert this for now.
Jiri Olsa [Tue, 4 Jun 2024 15:00:24 +0000 (17:00 +0200)]
bpf: Set run context for rawtp test_run callback
syzbot reported crash when rawtp program executed through the
test_run interface calls bpf_get_attach_cookie helper or any
other helper that touches task->bpf_ctx pointer.
Setting the run context (task->bpf_ctx pointer) for test_run
callback.
Jan Beulich [Wed, 29 May 2024 12:23:25 +0000 (15:23 +0300)]
tpm_tis: Do *not* flush uninitialized work
tpm_tis_core_init() may fail before tpm_tis_probe_irq_single() is
called, in which case tpm_tis_remove() unconditionally calling
flush_work() is triggering a warning for .func still being NULL.
Linus Torvalds [Tue, 4 Jun 2024 21:08:44 +0000 (14:08 -0700)]
Merge tag 'devicetree-fixes-for-6.10-1' of git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux
Pull devicetree fixes from Rob Herring:
- Fix regression in 'interrupt-map' handling affecting Apple M1 mini
(at least)
- Fix binding example warning in stm32 st,mlahb binding
- Fix schema error in Allwinner platform binding causing lots of
spurious warnings
- Add missing MODULE_DESCRIPTION() to DT kunit tests
* tag 'devicetree-fixes-for-6.10-1' of git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux:
of: property: Fix fw_devlink handling of interrupt-map
of/irq: Factor out parsing of interrupt-map parent phandle+args from of_irq_parse_raw()
dt-bindings: arm: stm32: st,mlahb: Drop spurious "reg" property from example
dt-bindings: arm: sunxi: Fix incorrect '-' usage
of: of_test: add MODULE_DESCRIPTION()
Linus Torvalds [Tue, 4 Jun 2024 17:34:13 +0000 (10:34 -0700)]
Merge tag 'linux_kselftest-fixes-6.10-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest
Pull kselftest fixes from Shuah Khan:
"Fixes to build warnings in several tests and fixes to ftrace tests"
* tag 'linux_kselftest-fixes-6.10-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest:
selftests/futex: don't pass a const char* to asprintf(3)
selftests/futex: don't redefine .PHONY targets (all, clean)
selftests/tracing: Fix event filter test to retry up to 10 times
selftests/futex: pass _GNU_SOURCE without a value to the compiler
selftests/overlayfs: Fix build error on ppc64
selftests/openat2: Fix build warnings on ppc64
selftests: cachestat: Fix build warnings on ppc64
tracing/selftests: Fix kprobe event name test for .isra. functions
selftests/ftrace: Update required config
selftests/ftrace: Fix to check required event file
kselftest/alsa: Ensure _GNU_SOURCE is defined
Fuad Tabba [Mon, 3 Jun 2024 12:28:51 +0000 (13:28 +0100)]
KVM: arm64: Ensure that SME controls are disabled in protected mode
KVM (and pKVM) do not support SME guests. Therefore KVM ensures
that the host's SME state is flushed and that SME controls for
enabling access to ZA storage and for streaming are disabled.
pKVM needs to protect against a buggy/malicious host. Ensure that
it wouldn't run a guest when protected mode is enabled should any
of the SME controls be enabled.
Fuad Tabba [Mon, 3 Jun 2024 12:28:50 +0000 (13:28 +0100)]
KVM: arm64: Refactor CPACR trap bit setting/clearing to use ELx format
When setting/clearing CPACR bits for EL0 and EL1, use the ELx
format of the bits, which covers both. This makes the code
clearer, and reduces the chances of accidentally missing a bit.
Fuad Tabba [Mon, 3 Jun 2024 12:28:48 +0000 (13:28 +0100)]
KVM: arm64: Eagerly restore host fpsimd/sve state in pKVM
When running in protected mode we don't want to leak protected
guest state to the host, including whether a guest has used
fpsimd/sve. Therefore, eagerly restore the host state on guest
exit when running in protected mode, which happens only if the
guest has used fpsimd/sve.
Fuad Tabba [Mon, 3 Jun 2024 12:28:47 +0000 (13:28 +0100)]
KVM: arm64: Allocate memory mapped at hyp for host sve state in pKVM
Protected mode needs to maintain (save/restore) the host's sve
state, rather than relying on the host kernel to do that. This is
to avoid leaking information to the host about guests and the
type of operations they are performing.
As a first step towards that, allocate memory mapped at hyp, per
cpu, for the host sve state. The following patch will use this
memory to save/restore the host state.
Fuad Tabba [Mon, 3 Jun 2024 12:28:46 +0000 (13:28 +0100)]
KVM: arm64: Specialize handling of host fpsimd state on trap
In subsequent patches, n/vhe will diverge on saving the host
fpsimd/sve state when taking a guest fpsimd/sve trap. Add a
specialized helper to handle it.
Fuad Tabba [Mon, 3 Jun 2024 12:28:45 +0000 (13:28 +0100)]
KVM: arm64: Abstract set/clear of CPTR_EL2 bits behind helper
The same traps controlled by CPTR_EL2 or CPACR_EL1 need to be
toggled in different parts of the code, but the exact bits and
their polarity differ between these two formats and the mode
(vhe/nvhe/hvhe).
To reduce the amount of duplicated code and the chance of getting
the wrong bit/polarity or missing a field, abstract the set/clear
of CPTR_EL2 bits behind a helper.
Since (h)VHE is the way of the future, use the CPACR_EL1 format,
which is a subset of the VHE CPTR_EL2, as a reference.
Fuad Tabba [Mon, 3 Jun 2024 12:28:44 +0000 (13:28 +0100)]
KVM: arm64: Fix prototype for __sve_save_state/__sve_restore_state
Since the prototypes for __sve_save_state/__sve_restore_state at
hyp were added, the underlying macro has acquired a third
parameter for saving/restoring ffr.
Fix the prototypes to account for the third parameter, and
restore the ffr for the guest since it is saved.
Jakub Kicinski [Thu, 30 May 2024 23:26:07 +0000 (16:26 -0700)]
net: tls: fix marking packets as decrypted
For TLS offload we mark packets with skb->decrypted to make sure
they don't escape the host without getting encrypted first.
The crypto state lives in the socket, so it may get detached
by a call to skb_orphan(). As a safety check - the egress path
drops all packets with skb->decrypted and no "crypto-safe" socket.
The skb marking was added to sendpage only (and not sendmsg),
because tls_device injected data into the TCP stack using sendpage.
This special case was missed when sendpage got folded into sendmsg.
Jakub Kicinski [Tue, 4 Jun 2024 01:52:24 +0000 (18:52 -0700)]
Merge tag 'wireless-2024-06-03' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless
Kalle Valo says:
====================
wireless fixes for v6.10-rc3
The first fixes for v6.10. And we have a big one, I suspect the
biggest wireless pull request we ever had. There are fixes all over,
both in stack and drivers. Likely the most important here are mt76 not
working on mt7615 devices, ath11k not being able to connect to 6 GHz
networks and rtlwifi suffering from packet loss. But of course there's
much more.
* tag 'wireless-2024-06-03' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless: (37 commits)
wifi: rtlwifi: Ignore IEEE80211_CONF_CHANGE_RETRY_LIMITS
wifi: mt76: mt7615: add missing chanctx ops
wifi: wilc1000: document SRCU usage instead of SRCU
Revert "wifi: wilc1000: set atomic flag on kmemdup in srcu critical section"
Revert "wifi: wilc1000: convert list management to RCU"
wifi: mac80211: fix UBSAN noise in ieee80211_prep_hw_scan()
wifi: mac80211: correctly parse Spatial Reuse Parameter Set element
wifi: mac80211: fix Spatial Reuse element size check
wifi: iwlwifi: mvm: don't read past the mfuart notifcation
wifi: iwlwifi: mvm: Fix scan abort handling with HW rfkill
wifi: iwlwifi: mvm: check n_ssids before accessing the ssids
wifi: iwlwifi: mvm: properly set 6 GHz channel direct probe option
wifi: iwlwifi: mvm: handle BA session teardown in RF-kill
wifi: iwlwifi: mvm: Handle BIGTK cipher in kek_kck cmd
wifi: iwlwifi: mvm: remove stale STA link data during restart
wifi: iwlwifi: dbg_ini: move iwl_dbg_tlv_free outside of debugfs ifdef
wifi: iwlwifi: mvm: set properly mac header
wifi: iwlwifi: mvm: revert gen2 TX A-MPDU size to 64
wifi: iwlwifi: mvm: d3: fix WoWLAN command version lookup
wifi: iwlwifi: mvm: fix a crash on 7265
...
====================
Eric Dumazet [Fri, 31 May 2024 13:26:34 +0000 (13:26 +0000)]
ipv6: sr: block BH in seg6_output_core() and seg6_input_core()
As explained in commit 1378817486d6 ("tipc: block BH
before using dst_cache"), net/core/dst_cache.c
helpers need to be called with BH disabled.
Disabling preemption in seg6_output_core() is not good enough,
because seg6_output_core() is called from process context,
lwtunnel_output() only uses rcu_read_lock().
We might be interrupted by a softirq, re-enter seg6_output_core()
and corrupt dst_cache data structures.
Fix the race by using local_bh_disable() instead of
preempt_disable().
Eric Dumazet [Fri, 31 May 2024 13:26:33 +0000 (13:26 +0000)]
net: ipv6: rpl_iptunnel: block BH in rpl_output() and rpl_input()
As explained in commit 1378817486d6 ("tipc: block BH
before using dst_cache"), net/core/dst_cache.c
helpers need to be called with BH disabled.
Disabling preemption in rpl_output() is not good enough,
because rpl_output() is called from process context,
lwtunnel_output() only uses rcu_read_lock().
We might be interrupted by a softirq, re-enter rpl_output()
and corrupt dst_cache data structures.
Fix the race by using local_bh_disable() instead of
preempt_disable().
Eric Dumazet [Fri, 31 May 2024 13:26:32 +0000 (13:26 +0000)]
ipv6: ioam: block BH from ioam6_output()
As explained in commit 1378817486d6 ("tipc: block BH
before using dst_cache"), net/core/dst_cache.c
helpers need to be called with BH disabled.
Disabling preemption in ioam6_output() is not good enough,
because ioam6_output() is called from process context,
lwtunnel_output() only uses rcu_read_lock().
We might be interrupted by a softirq, re-enter ioam6_output()
and corrupt dst_cache data structures.
Fix the race by using local_bh_disable() instead of
preempt_disable().
Matthias Stocker [Fri, 31 May 2024 10:37:11 +0000 (12:37 +0200)]
vmxnet3: disable rx data ring on dma allocation failure
When vmxnet3_rq_create() fails to allocate memory for rq->data_ring.base,
the subsequent call to vmxnet3_rq_destroy_all_rxdataring does not reset
rq->data_ring.desc_size for the data ring that failed, which presumably
causes the hypervisor to reference it on packet reception.
To fix this bug, rq->data_ring.desc_size needs to be set to 0 to tell
the hypervisor to disable this feature.
Linus Torvalds [Mon, 3 Jun 2024 21:42:41 +0000 (14:42 -0700)]
Merge tag 'cxl-fixes-6.10-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl
Pull cxl fixes from Dave Jiang:
- Compile fix for cxl-test from missing linux/vmalloc.h
- Fix for memregion leaks in devm_cxl_add_region()
* tag 'cxl-fixes-6.10-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl:
cxl/region: Fix memregion leaks in devm_cxl_add_region()
cxl/test: Add missing vmalloc.h for tools/testing/cxl/test/mem.c
Paolo Bonzini [Mon, 3 Jun 2024 17:09:55 +0000 (13:09 -0400)]
Merge branch 'kvm-fixes-6.10-1' into HEAD
* Fixes and debugging help for the #VE sanity check. Also disable
it by default, even for CONFIG_DEBUG_KERNEL, because it was found
to trigger spuriously (most likely a processor erratum as the
exact symptoms vary by generation).
* Avoid WARN() when two NMIs arrive simultaneously during an NMI-disabled
situation (GIF=0 or interrupt shadow) when the processor supports
virtual NMI. While generally KVM will not request an NMI window
when virtual NMIs are supported, in this case it *does* have to
single-step over the interrupt shadow or enable the STGI intercept,
in order to deliver the latched second NMI.
* Drop support for hand tuning APIC timer advancement from userspace.
Since we have adaptive tuning, and it has proved to work well,
drop the module parameter for manual configuration and with it a
few stupid bugs that it had.
KVM: x86: Drop support for hand tuning APIC timer advancement from userspace
Remove support for specifying a static local APIC timer advancement value,
and instead present a read-only boolean parameter to let userspace enable
or disable KVM's dynamic APIC timer advancement. Realistically, it's all
but impossible for userspace to specify an advancement that is more
precise than what KVM's adaptive tuning can provide. E.g. a static value
needs to be tuned for the exact hardware and kernel, and if KVM is using
hrtimers, likely requires additional tuning for the exact configuration of
the entire system.
Dropping support for a userspace provided value also fixes several flaws
in the interface. E.g. KVM interprets a negative value other than -1 as a
large advancement, toggling between a negative and positive value yields
unpredictable behavior as vCPUs will switch from dynamic to static
advancement, changing the advancement in the middle of VM creation can
result in different values for vCPUs within a VM, etc. Those flaws are
mostly fixable, but there's almost no justification for taking on yet more
complexity (it's minimal complexity, but still non-zero).
The only arguments against using KVM's adaptive tuning is if a setup needs
a higher maximum, or if the adjustments are too reactive, but those are
arguments for letting userspace control the absolute max advancement and
the granularity of each adjustment, e.g. similar to how KVM provides knobs
for halt polling.
Ravi Bangoria [Fri, 31 May 2024 04:46:44 +0000 (04:46 +0000)]
KVM: SEV-ES: Delegate LBR virtualization to the processor
As documented in APM[1], LBR Virtualization must be enabled for SEV-ES
guests. Although KVM currently enforces LBRV for SEV-ES guests, there
are multiple issues with it:
o MSR_IA32_DEBUGCTLMSR is still intercepted. Since MSR_IA32_DEBUGCTLMSR
interception is used to dynamically toggle LBRV for performance reasons,
this can be fatal for SEV-ES guests. For ex SEV-ES guest on Zen3:
Fix this by never intercepting MSR_IA32_DEBUGCTLMSR for SEV-ES guests.
No additional save/restore logic is required since MSR_IA32_DEBUGCTLMSR
is of swap type A.
o KVM will disable LBRV if userspace sets MSR_IA32_DEBUGCTLMSR before the
VMSA is encrypted. Fix this by moving LBRV enablement code post VMSA
encryption.