Linus Torvalds [Fri, 1 Nov 2024 00:56:19 +0000 (14:56 -1000)]
Merge tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Pull bpf fixes from Daniel Borkmann:
- Fix BPF verifier to force a checkpoint when the program's jump
history becomes too long (Eduard Zingerman)
- Add several fixes to the BPF bits iterator addressing issues like
memory leaks and overflow problems (Hou Tao)
- Fix an out-of-bounds write in trie_get_next_key (Byeonguk Jeong)
- Fix BPF test infra's LIVE_FRAME frame update after a page has been
recycled (Toke Høiland-Jørgensen)
- Fix BPF verifier and undo the 40-bytes extra stack space for
bpf_fastcall patterns due to various bugs (Eduard Zingerman)
- Fix a BPF sockmap race condition which could trigger a NULL pointer
dereference in sock_map_link_update_prog (Cong Wang)
- Fix tcp_bpf_recvmsg_parser to retrieve seq_copied from tcp_sk under
the socket lock (Jiayuan Chen)
* tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
bpf, test_run: Fix LIVE_FRAME frame update after a page has been recycled
selftests/bpf: Add three test cases for bits_iter
bpf: Use __u64 to save the bits in bits iterator
bpf: Check the validity of nr_words in bpf_iter_bits_new()
bpf: Add bpf_mem_alloc_check_size() helper
bpf: Free dynamically allocated bits in bpf_iter_bits_destroy()
bpf: disallow 40-bytes extra stack for bpf_fastcall patterns
selftests/bpf: Add test for trie_get_next_key()
bpf: Fix out-of-bounds write in trie_get_next_key()
selftests/bpf: Test with a very short loop
bpf: Force checkpoint when jmp history is too long
bpf: fix filed access without lock
sock_map: fix a NULL pointer dereference in sock_map_link_update_prog()
Linus Torvalds [Thu, 31 Oct 2024 22:39:58 +0000 (12:39 -1000)]
Merge tag 'net-6.12-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Paolo Abeni:
"Including fixes from WiFi, bluetooth and netfilter.
No known new regressions outstanding.
Current release - regressions:
- wifi: mt76: do not increase mcu skb refcount if retry is not
supported
Current release - new code bugs:
- wifi:
- rtw88: fix the RX aggregation in USB 3 mode
- mac80211: fix memory corruption bug in struct ieee80211_chanctx
Previous releases - regressions:
- sched:
- stop qdisc_tree_reduce_backlog on TC_H_ROOT
- sch_api: fix xa_insert() error path in tcf_block_get_ext()
- wifi:
- revert "wifi: iwlwifi: remove retry loops in start"
- cfg80211: clear wdev->cqm_config pointer on free
- netfilter: fix potential crash in nf_send_reset6()
- ip_tunnel: fix suspicious RCU usage warning in ip_tunnel_find()
- bluetooth: fix null-ptr-deref in hci_read_supported_codecs
- eth: mlxsw: add missing verification before pushing Tx header
- eth: hns3: fixed hclge_fetch_pf_reg accesses bar space out of
bounds issue
Previous releases - always broken:
- wifi: mac80211: do not pass a stopped vif to the driver in
.get_txpower
- netfilter: sanitize offset and length before calling skb_checksum()
- core:
- fix crash when config small gso_max_size/gso_ipv4_max_size
- skip offload for NETIF_F_IPV6_CSUM if ipv6 header contains extension
- mptcp: protect sched with rcu_read_lock
- eth: ice: fix crash on probe for DPLL enabled E810 LOM
- eth: macsec: fix use-after-free while sending the offloading packet
- eth: stmmac: fix unbalanced DMA map/unmap for non-paged SKB data
- eth: hns3: fix kernel crash when 1588 is sent on HIP08 devices
- eth: mtk_wed: fix path of MT7988 WO firmware"
* tag 'net-6.12-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (70 commits)
net: hns3: fix kernel crash when 1588 is sent on HIP08 devices
net: hns3: fixed hclge_fetch_pf_reg accesses bar space out of bounds issue
net: hns3: initialize reset_timer before hclgevf_misc_irq_init()
net: hns3: don't auto enable misc vector
net: hns3: Resolved the issue that the debugfs query result is inconsistent.
net: hns3: fix missing features due to dev->features configuration too early
net: hns3: fixed reset failure issues caused by the incorrect reset type
net: hns3: add sync command to sync io-pgtable
net: hns3: default enable tx bounce buffer when smmu enabled
netfilter: nft_payload: sanitize offset and length before calling skb_checksum()
net: ethernet: mtk_wed: fix path of MT7988 WO firmware
selftests: forwarding: Add IPv6 GRE remote change tests
mlxsw: spectrum_ipip: Fix memory leak when changing remote IPv6 address
mlxsw: pci: Sync Rx buffers for device
mlxsw: pci: Sync Rx buffers for CPU
mlxsw: spectrum_ptp: Add missing verification before pushing Tx header
net: skip offload for NETIF_F_IPV6_CSUM if ipv6 header contains extension
Bluetooth: hci: fix null-ptr-deref in hci_read_supported_codecs
netfilter: nf_reject_ipv6: fix potential crash in nf_send_reset6()
netfilter: Fix use-after-free in get_info()
...
Linus Torvalds [Thu, 31 Oct 2024 18:15:40 +0000 (08:15 -1000)]
Merge tag 'sound-6.12-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound
Pull sound fixes from Takashi Iwai:
"Here we see slightly more commits than wished, but basically all are
small and mostly trivial fixes.
The only core change is the workaround for __counted_by() usage in
ASoC DAPM code, while the rest are device-specific fixes for Intel
Baytrail devices, Cirrus and wcd937x codecs, and HD-audio / USB-audio
devices"
* tag 'sound-6.12-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound:
ALSA: hda/realtek: Fix headset mic on TUXEDO Stellaris 16 Gen6 mb1
ALSA: hda/realtek: Fix headset mic on TUXEDO Gemini 17 Gen3
ALSA: usb-audio: Add quirks for Dell WD19 dock
ASoC: codecs: wcd937x: relax the AUX PDM watchdog
ASoC: codecs: wcd937x: add missing LO Switch control
ASoC: dt-bindings: rockchip,rk3308-codec: add port property
ALSA: hda/realtek: Add subwoofer quirk for Infinix ZERO BOOK 13
ASoC: dapm: fix bounds checker error in dapm_widget_list_create
ASoC: Intel: sst: Fix used of uninitialized ctx to log an error
ASoC: cs42l51: Fix some error handling paths in cs42l51_probe()
ASoC: Intel: sst: Support LPE0F28 ACPI HID
ALSA: hda/realtek: Limit internal Mic boost on Dell platform
ASoC: Intel: bytcr_rt5640: Add DMI quirk for Vexia Edu Atla 10 tablet
ASoC: Intel: bytcr_rt5640: Add support for non ACPI instantiated codec
ASoC: codecs: rt5640: Always disable IRQs from rt5640_cancel_work()
bpf, test_run: Fix LIVE_FRAME frame update after a page has been recycled
The test_run code detects whether a page has been modified and
re-initialises the xdp_frame structure if it has, using
xdp_update_frame_from_buff(). However, xdp_update_frame_from_buff()
doesn't touch frame->mem, so that wasn't correctly re-initialised, which
led to the pages from page_pool not being returned correctly. Syzbot
noticed this as a memory leak.
Fix this by also copying the frame->mem structure when re-initialising
the frame, like we do on initialisation of a new page from page_pool.
Paolo Abeni [Thu, 31 Oct 2024 10:32:57 +0000 (11:32 +0100)]
Merge tag 'for-net-2024-10-30' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth
Luiz Augusto von Dentz says:
====================
bluetooth pull request for net:
- hci: fix null-ptr-deref in hci_read_supported_codecs
* tag 'for-net-2024-10-30' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth:
Bluetooth: hci: fix null-ptr-deref in hci_read_supported_codecs
====================
====================
There are some bugfix for the HNS3 ethernet driver
ChangeLog:
v2 -> v3:
- Rewrite the commit logs of net: hns3: add sync command to sync io-pgtable' to
add more verbose explanation, suggested Paolo.
- Add fixes tag for hardware issue, suggested Paolo and Simon Horman.
v2: https://lore.kernel.org/all/20241018101059.1718375[email protected]/
v1 -> v2:
- Pass IRQF_NO_AUTOEN to request_irq(), suggested by Jakub.
- Rewrite the commit logs of 'net: hns3: default enable tx bounce buffer when smmu enabled'
and 'net: hns3: add sync command to sync io-pgtable'.
v1: https://lore.kernel.org/all/20241011094521.3008298[email protected]/
====================
Jie Wang [Fri, 25 Oct 2024 09:29:38 +0000 (17:29 +0800)]
net: hns3: fix kernel crash when 1588 is sent on HIP08 devices
Currently, HIP08 devices does not register the ptp devices, so the
hdev->ptp is NULL. But the tx process would still try to set hardware time
stamp info with SKBTX_HW_TSTAMP flag and cause a kernel crash.
Hao Lan [Fri, 25 Oct 2024 09:29:37 +0000 (17:29 +0800)]
net: hns3: fixed hclge_fetch_pf_reg accesses bar space out of bounds issue
The TQP BAR space is divided into two segments. TQPs 0-1023 and TQPs
1024-1279 are in different BAR space addresses. However,
hclge_fetch_pf_reg does not distinguish the tqp space information when
reading the tqp space information. When the number of TQPs is greater
than 1024, access bar space overwriting occurs.
The problem of different segments has been considered during the
initialization of tqp.io_base. Therefore, tqp.io_base is directly used
when the queue is read in hclge_fetch_pf_reg.
The error message:
Unable to handle kernel paging request at virtual address ffff800037200000
pc : hclge_fetch_pf_reg+0x138/0x250 [hclge]
lr : hclge_get_regs+0x84/0x1d0 [hclge]
Call trace:
hclge_fetch_pf_reg+0x138/0x250 [hclge]
hclge_get_regs+0x84/0x1d0 [hclge]
hns3_get_regs+0x2c/0x50 [hns3]
ethtool_get_regs+0xf4/0x270
dev_ethtool+0x674/0x8a0
dev_ioctl+0x270/0x36c
sock_do_ioctl+0x110/0x2a0
sock_ioctl+0x2ac/0x530
__arm64_sys_ioctl+0xa8/0x100
invoke_syscall+0x4c/0x124
el0_svc_common.constprop.0+0x140/0x15c
do_el0_svc+0x30/0xd0
el0_svc+0x1c/0x2c
el0_sync_handler+0xb0/0xb4
el0_sync+0x168/0x180
Fixes: 939ccd107ffc ("net: hns3: move dump regs function to a separate file") Signed-off-by: Hao Lan <[email protected]> Signed-off-by: Jijie Shao <[email protected]> Signed-off-by: Paolo Abeni <[email protected]>
Jian Shen [Fri, 25 Oct 2024 09:29:36 +0000 (17:29 +0800)]
net: hns3: initialize reset_timer before hclgevf_misc_irq_init()
Currently the misc irq is initialized before reset_timer setup. But
it will access the reset_timer in the irq handler. So initialize
the reset_timer earlier.
Fixes: ff200099d271 ("net: hns3: remove unnecessary work in hclgevf_main") Signed-off-by: Jian Shen <[email protected]> Signed-off-by: Jijie Shao <[email protected]> Signed-off-by: Paolo Abeni <[email protected]>
Jian Shen [Fri, 25 Oct 2024 09:29:35 +0000 (17:29 +0800)]
net: hns3: don't auto enable misc vector
Currently, there is a time window between misc irq enabled
and service task inited. If an interrupte is reported at
this time, it will cause warning like below:
Hao Lan [Fri, 25 Oct 2024 09:29:34 +0000 (17:29 +0800)]
net: hns3: Resolved the issue that the debugfs query result is inconsistent.
This patch modifies the implementation of debugfs:
When the user process stops unexpectedly, not all data of the file system
is read. In this case, the save_buf pointer is not released. When the user
process is called next time, save_buf is used to copy the cached data
to the user space. As a result, the queried data is inconsistent. To solve
this problem, determine whether the function is invoked for the first time
based on the value of *ppos. If *ppos is 0, obtain the actual data.
Hao Lan [Fri, 25 Oct 2024 09:29:33 +0000 (17:29 +0800)]
net: hns3: fix missing features due to dev->features configuration too early
Currently, the netdev->features is configured in hns3_nic_set_features.
As a result, __netdev_update_features considers that there is no feature
difference, and the procedures of the real features are missing.
Hao Lan [Fri, 25 Oct 2024 09:29:32 +0000 (17:29 +0800)]
net: hns3: fixed reset failure issues caused by the incorrect reset type
When a reset type that is not supported by the driver is input, a reset
pending flag bit of the HNAE3_NONE_RESET type is generated in
reset_pending. The driver does not have a mechanism to clear this type
of error. As a result, the driver considers that the reset is not
complete. This patch provides a mechanism to clear the
HNAE3_NONE_RESET flag and the parameter of
hnae3_ae_ops.set_default_reset_request is verified.
Jian Shen [Fri, 25 Oct 2024 09:29:31 +0000 (17:29 +0800)]
net: hns3: add sync command to sync io-pgtable
To avoid errors in pgtable prefectch, add a sync command to sync
io-pagtable.
This is a supplement for the previous patch.
We want all the tx packet can be handled with tx bounce buffer path.
But it depends on the remain space of the spare buffer, checked by the
hns3_can_use_tx_bounce(). In most cases, maybe 99.99%, it returns true.
But once it return false by no available space, the packet will be handled
with the former path, which will map/unmap the skb buffer.
Then the driver will face the smmu prefetch risk again.
So add a sync command in this case to avoid smmu prefectch,
just protects corner scenes.
Peiyang Wang [Fri, 25 Oct 2024 09:29:30 +0000 (17:29 +0800)]
net: hns3: default enable tx bounce buffer when smmu enabled
The SMMU engine on HIP09 chip has a hardware issue.
SMMU pagetable prefetch features may prefetch and use a invalid PTE
even the PTE is valid at that time. This will cause the device trigger
fake pagefaults. The solution is to avoid prefetching by adding a
SYNC command when smmu mapping a iova. But the performance of nic has a
sharp drop. Then we do this workaround, always enable tx bounce buffer,
avoid mapping/unmapping on TX path.
This issue only affects HNS3, so we always enable
tx bounce buffer when smmu enabled to improve performance.
netfilter: nft_payload: sanitize offset and length before calling skb_checksum()
If access to offset + length is larger than the skbuff length, then
skb_checksum() triggers BUG_ON().
skb_checksum() internally subtracts the length parameter while iterating
over skbuff, BUG_ON(len) at the end of it checks that the expected
length to be included in the checksum calculation is fully consumed.
Fixes: 7ec3f7b47b8d ("netfilter: nft_payload: add packet mangling support") Reported-by: Slavin Liu <[email protected]> Signed-off-by: Pablo Neira Ayuso <[email protected]>
Daniel Golle [Sat, 26 Oct 2024 13:52:25 +0000 (14:52 +0100)]
net: ethernet: mtk_wed: fix path of MT7988 WO firmware
linux-firmware commit 808cba84 ("mtk_wed: add firmware for mt7988
Wireless Ethernet Dispatcher") added mt7988_wo_{0,1}.bin in the
'mediatek/mt7988' directory while driver current expects the files in
the 'mediatek' directory.
Change path in the driver header now that the firmware has been added.
Jakub Kicinski [Thu, 31 Oct 2024 01:24:41 +0000 (18:24 -0700)]
Merge branch 'mlxsw-fixes'
Petr Machata says:
====================
mlxsw: Fixes
In this patchset:
- Tx header should be pushed for each packet which is transmitted via
Spectrum ASICs. Patch #1 adds a missing call to skb_cow_head() to make
sure that there is both enough room to push the Tx header and that the
SKB header is not cloned and can be modified.
- Commit b5b60bb491b2 ("mlxsw: pci: Use page pool for Rx buffers
allocation") converted mlxsw to use page pool for Rx buffers allocation.
Sync for CPU and for device should be done for Rx pages. In patches #2
and #3, add the missing calls to sync pages for, respectively, CPU and
the device.
- Patch #4 then fixes a bug to IPv6 GRE forwarding offload. Patch #5 adds
a generic forwarding test that fails with mlxsw ports prior to the fix.
====================
Ido Schimmel [Fri, 25 Oct 2024 14:26:29 +0000 (16:26 +0200)]
selftests: forwarding: Add IPv6 GRE remote change tests
Test that after changing the remote address of an ip6gre net device
traffic is forwarded as expected. Test with both flat and hierarchical
topologies and with and without an input / output keys.
Ido Schimmel [Fri, 25 Oct 2024 14:26:28 +0000 (16:26 +0200)]
mlxsw: spectrum_ipip: Fix memory leak when changing remote IPv6 address
The device stores IPv6 addresses that are used for encapsulation in
linear memory that is managed by the driver.
Changing the remote address of an ip6gre net device never worked
properly, but since cited commit the following reproducer [1] would
result in a warning [2] and a memory leak [3]. The problem is that the
new remote address is never added by the driver to its hash table (and
therefore the device) and the old address is never removed from it.
Fix by programming the new address when the configuration of the ip6gre
net device changes and removing the old one. If the address did not
change, then the above would result in increasing the reference count of
the address and then decreasing it.
[1]
# ip link add name bla up type ip6gre local 2001:db8:1::1 remote 2001:db8:2::1 tos inherit ttl inherit
# ip link set dev bla type ip6gre remote 2001:db8:3::1
# ip link del dev bla
# devlink dev reload pci/0000:01:00.0
Amit Cohen [Fri, 25 Oct 2024 14:26:27 +0000 (16:26 +0200)]
mlxsw: pci: Sync Rx buffers for device
Non-coherent architectures, like ARM, may require invalidating caches
before the device can use the DMA mapped memory, which means that before
posting pages to device, drivers should sync the memory for device.
Sync for device can be configured as page pool responsibility. Set the
relevant flag and define max_len for sync.
Amit Cohen [Fri, 25 Oct 2024 14:26:26 +0000 (16:26 +0200)]
mlxsw: pci: Sync Rx buffers for CPU
When Rx packet is received, drivers should sync the pages for CPU, to
ensure the CPU reads the data written by the device and not stale
data from its cache.
Add the missing sync call in Rx path, sync the actual length of data for
each fragment.
Amit Cohen [Fri, 25 Oct 2024 14:26:25 +0000 (16:26 +0200)]
mlxsw: spectrum_ptp: Add missing verification before pushing Tx header
Tx header should be pushed for each packet which is transmitted via
Spectrum ASICs. The cited commit moved the call to skb_cow_head() from
mlxsw_sp_port_xmit() to functions which handle Tx header.
In case that mlxsw_sp->ptp_ops->txhdr_construct() is used to handle Tx
header, and txhdr_construct() is mlxsw_sp_ptp_txhdr_construct(), there is
no call for skb_cow_head() before pushing Tx header size to SKB. This flow
is relevant for Spectrum-1 and Spectrum-4, for PTP packets.
Add the missing call to skb_cow_head() to make sure that there is both
enough room to push the Tx header and that the SKB header is not cloned and
can be modified.
An additional set will be sent to net-next to centralize the handling of
the Tx header by pushing it to every packet just before transmission.
Benoît Monin [Thu, 24 Oct 2024 14:01:54 +0000 (16:01 +0200)]
net: skip offload for NETIF_F_IPV6_CSUM if ipv6 header contains extension
As documented in skbuff.h, devices with NETIF_F_IPV6_CSUM capability
can only checksum TCP and UDP over IPv6 if the IP header does not
contains extension.
This is enforced for UDP packets emitted from user-space to an IPv6
address as they go through ip6_make_skb(), which calls
__ip6_append_data() where a check is done on the header size before
setting CHECKSUM_PARTIAL.
But the introduction of UDP encapsulation with fou6 added a code-path
where it is possible to get an skb with a partial UDP checksum and an
IPv6 header with extension:
* fou6 adds a UDP header with a partial checksum if the inner packet
does not contains a valid checksum.
* ip6_tunnel adds an IPv6 header with a destination option extension
header if encap_limit is non-zero (the default value is 4).
The thread linked below describes in more details how to reproduce the
problem with GRE-in-UDP tunnel.
Add a check on the network header size in skb_csum_hwoffload_help() to
make sure no IPv6 packet with extension header is handed to a network
device with NETIF_F_IPV6_CSUM capability.
Linus Torvalds [Wed, 30 Oct 2024 21:17:47 +0000 (11:17 -1000)]
Merge tag 'perf-tools-fixes-for-v6.12-2-2024-10-30' of git://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools
Pull perf tools fixes from Arnaldo Carvalho de Melo:
- Update more header copies with the kernel sources, including const.h,
msr-index.h, arm64's cputype.h, kvm's, bits.h and unaligned.h
- The return from 'write' isn't a pid, fix cut'n'paste error in 'perf
trace'
- Fix up the python binding build on architectures without
HAVE_KVM_STAT_SUPPORT
- Add some more bounds checks to augmented_raw_syscalls.bpf.c (used to
collect syscall pointer arguments in 'perf trace') to make the
resulting bytecode to pass the kernel BPF verifier, allowing us to go
back accepting clang 12.0.1 as the minimum version required for
compiling BPF sources
- Add __NR_capget for x86 to fix a regression on running perf + intel
PT (hw tracing) as non-root setting up the capabilities as described
in https://www.kernel.org/doc/html/latest/admin-guide/perf-security.html
- Fix missing syscalltbl in non-explicitly listed architectures,
noticed on ARM 32-bit, that still needs a .tbl generator for the
syscall id<->name tables, should be added for v6.13
- Handle 'perf test' failure when handling broken DWARF for ASM files
* tag 'perf-tools-fixes-for-v6.12-2-2024-10-30' of git://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools:
perf cap: Add __NR_capget to arch/x86 unistd
tools headers: Update the linux/unaligned.h copy with the kernel sources
tools headers arm64: Sync arm64's cputype.h with the kernel sources
tools headers: Synchronize {uapi/}linux/bits.h with the kernel sources
tools arch x86: Sync the msr-index.h copy with the kernel sources
perf python: Fix up the build on architectures without HAVE_KVM_STAT_SUPPORT
perf test: Handle perftool-testsuite_probe failure due to broken DWARF
tools headers UAPI: Sync kvm headers with the kernel sources
perf trace: Fix non-listed archs in the syscalltbl routines
perf build: Change the clang check back to 12.0.1
perf trace augmented_raw_syscalls: Add more checks to pass the verifier
perf trace augmented_raw_syscalls: Add extra array index bounds checking to satisfy some BPF verifiers
perf trace: The return from 'write' isn't a pid
tools headers UAPI: Sync linux/const.h with the kernel headers
====================
The patch set fixes several issues in bits iterator. Patch #1 fixes the
kmemleak problem of bits iterator. Patch #2~#3 fix the overflow problem
of nr_bits. Patch #4 fixes the potential stack corruption when bits
iterator is used on 32-bit host. Patch #5 adds more test cases for bits
iterator.
Please see the individual patches for more details. And comments are
always welcome.
---
v4:
* patch #1: add ack from Yafang
* patch #3: revert code-churn like changes:
(1) compute nr_bytes and nr_bits before the check of nr_words.
(2) use nr_bits == 64 to check for single u64, preventing build
warning on 32-bit hosts.
* patch #4: use "BITS_PER_LONG == 32" instead of "!defined(CONFIG_64BIT)"
v3: https://lore.kernel.org/bpf/20241025013233[email protected]/T/#t
* split the bits-iterator related patches from "Misc fixes for bpf"
patch set
* patch #1: use "!nr_bits || bits >= nr_bits" to stop the iteration
* patch #2: add a new helper for the overflow problem
* patch #3: decrease the limitation from 512 to 511 and check whether
nr_bytes is too large for bpf memory allocator explicitly
* patch #5: add two more test cases for bit iterator
Hou Tao [Wed, 30 Oct 2024 10:05:16 +0000 (18:05 +0800)]
selftests/bpf: Add three test cases for bits_iter
Add more test cases for bits iterator:
(1) huge word test
Verify the multiplication overflow of nr_bits in bits_iter. Without
the overflow check, when nr_words is 67108865, nr_bits becomes 64,
causing bpf_probe_read_kernel_common() to corrupt the stack.
(2) max word test
Verify correct handling of maximum nr_words value (511).
(3) bad word test
Verify early termination of bits iteration when bits iterator
initialization fails.
Also rename bits_nomem to bits_too_big to better reflect its purpose.
Hou Tao [Wed, 30 Oct 2024 10:05:15 +0000 (18:05 +0800)]
bpf: Use __u64 to save the bits in bits iterator
On 32-bit hosts (e.g., arm32), when a bpf program passes a u64 to
bpf_iter_bits_new(), bpf_iter_bits_new() will use bits_copy to store the
content of the u64. However, bits_copy is only 4 bytes, leading to stack
corruption.
The straightforward solution would be to replace u64 with unsigned long
in bpf_iter_bits_new(). However, this introduces confusion and problems
for 32-bit hosts because the size of ulong in bpf program is 8 bytes,
but it is treated as 4-bytes after passed to bpf_iter_bits_new().
Fix it by changing the type of both bits and bit_count from unsigned
long to u64. However, the change is not enough. The main reason is that
bpf_iter_bits_next() uses find_next_bit() to find the next bit and the
pointer passed to find_next_bit() is an unsigned long pointer instead
of a u64 pointer. For 32-bit little-endian host, it is fine but it is
not the case for 32-bit big-endian host. Because under 32-bit big-endian
host, the first iterated unsigned long will be the bits 32-63 of the u64
instead of the expected bits 0-31. Therefore, in addition to changing
the type, swap the two unsigned longs within the u64 for 32-bit
big-endian host.
Hou Tao [Wed, 30 Oct 2024 10:05:14 +0000 (18:05 +0800)]
bpf: Check the validity of nr_words in bpf_iter_bits_new()
Check the validity of nr_words in bpf_iter_bits_new(). Without this
check, when multiplication overflow occurs for nr_bits (e.g., when
nr_words = 0x0400-0001, nr_bits becomes 64), stack corruption may occur
due to bpf_probe_read_kernel_common(..., nr_bytes = 0x2000-0008).
Fix it by limiting the maximum value of nr_words to 511. The value is
derived from the current implementation of BPF memory allocator. To
ensure compatibility if the BPF memory allocator's size limitation
changes in the future, use the helper bpf_mem_alloc_check_size() to
check whether nr_bytes is too larger. And return -E2BIG instead of
-ENOMEM for oversized nr_bytes.
Hou Tao [Wed, 30 Oct 2024 10:05:13 +0000 (18:05 +0800)]
bpf: Add bpf_mem_alloc_check_size() helper
Introduce bpf_mem_alloc_check_size() to check whether the allocation
size exceeds the limitation for the kmalloc-equivalent allocator. The
upper limit for percpu allocation is LLIST_NODE_SZ bytes larger than
non-percpu allocation, so a percpu argument is added to the helper.
The helper will be used in the following patch to check whether the size
parameter passed to bpf_mem_alloc() is too big.
Hou Tao [Wed, 30 Oct 2024 10:05:12 +0000 (18:05 +0800)]
bpf: Free dynamically allocated bits in bpf_iter_bits_destroy()
bpf_iter_bits_destroy() uses "kit->nr_bits <= 64" to check whether the
bits are dynamically allocated. However, the check is incorrect and may
cause a kmemleak as shown below:
That is because nr_bits will be set as zero in bpf_iter_bits_next()
after all bits have been iterated.
Fix the issue by setting kit->bit to kit->nr_bits instead of setting
kit->nr_bits to zero when the iteration completes in
bpf_iter_bits_next(). In addition, use "!nr_bits || bits >= nr_bits" to
check whether the iteration is complete and still use "nr_bits > 64" to
indicate whether bits are dynamically allocated. The "!nr_bits" check is
necessary because bpf_iter_bits_new() may fail before setting
kit->nr_bits, and this condition will stop the iteration early instead
of accessing the zeroed or freed kit->bits.
Considering the initial value of kit->bits is -1 and the type of
kit->nr_bits is unsigned int, change the type of kit->nr_bits to int.
The potential overflow problem will be handled in the following patch.
Sungwoo Kim [Tue, 29 Oct 2024 19:44:41 +0000 (19:44 +0000)]
Bluetooth: hci: fix null-ptr-deref in hci_read_supported_codecs
Fix __hci_cmd_sync_sk() to return not NULL for unknown opcodes.
__hci_cmd_sync_sk() returns NULL if a command returns a status event.
However, it also returns NULL where an opcode doesn't exist in the
hci_cc table because hci_cmd_complete_evt() assumes status = skb->data[0]
for unknown opcodes.
This leads to null-ptr-deref in cmd_sync for HCI_OP_READ_LOCAL_CODECS as
there is no hci_cc for HCI_OP_READ_LOCAL_CODECS, which always assumes
status = skb->data[0].
Linus Torvalds [Wed, 30 Oct 2024 18:16:23 +0000 (08:16 -1000)]
Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Pull SCSI fixes from James Bottomley:
"Two small fixes, both in drivers (ufs and scsi_debug)"
* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
scsi: ufs: core: Fix another deadlock during RTC update
scsi: scsi_debug: Fix do_device_access() handling of unexpected SG copy length
Jan Schär [Tue, 29 Oct 2024 22:12:49 +0000 (23:12 +0100)]
ALSA: usb-audio: Add quirks for Dell WD19 dock
The WD19 family of docks has the same audio chipset as the WD15. This
change enables jack detection on the WD19.
We don't need the dell_dock_mixer_init quirk for the WD19. It is only
needed because of the dell_alc4020_map quirk for the WD15 in
mixer_maps.c, which disables the volume controls. Even for the WD15,
this quirk was apparently only needed when the dock firmware was not
updated.
Takashi Iwai [Wed, 30 Oct 2024 13:46:35 +0000 (14:46 +0100)]
Merge tag 'asoc-fix-v6.12-rc5' of https://git.kernel.org/pub/scm/linux/kernel/git/broonie/sound into for-linus
ASoC: Fixes for v6.12
The biggest set of changes here is Hans' fixes and quirks for various
Baytrail based platforms with RT5640 CODECs, and there's one core fix
for a missed length assignment for __counted_by() checking. Otherwise
it's small device specific fixes, several of them in the DT bindings.
Concurrent execution of module unload and get_info() trigered the warning.
The root cause is as follows:
cpu0 cpu1
module_exit
//mod->state = MODULE_STATE_GOING
ip6table_nat_exit
xt_unregister_template
kfree(t)
//removed from templ_list
getinfo()
t = xt_find_table_lock
list_for_each_entry(tmpl, &xt_templates[af]...)
if (strcmp(tmpl->name, name))
continue; //table not found
try_module_get
list_for_each_entry(t, &xt_net->tables[af]...)
return t; //not get refcnt
module_put(t->me) //uaf
unregister_pernet_subsys
//remove table from xt_net list
While xt_table module was going away and has been removed from
xt_templates list, we couldnt get refcnt of xt_table->me. Check
module in xt_net->tables list re-traversal to fix it.
Fixes: fdacd57c79b7 ("netfilter: x_tables: never register tables by default") Signed-off-by: Dong Chenchen <[email protected]> Reviewed-by: Florian Westphal <[email protected]> Signed-off-by: Pablo Neira Ayuso <[email protected]>
Eduard Zingerman [Tue, 29 Oct 2024 19:39:11 +0000 (12:39 -0700)]
bpf: disallow 40-bytes extra stack for bpf_fastcall patterns
Hou Tao reported an issue with bpf_fastcall patterns allowing extra
stack space above MAX_BPF_STACK limit. This extra stack allowance is
not integrated properly with the following verifier parts:
- backtracking logic still assumes that stack can't exceed
MAX_BPF_STACK;
- bpf_verifier_env->scratched_stack_slots assumes only 64 slots are
available.
Here is an example of an issue with precision tracking
(note stack slot -8 tracked as precise instead of -520):
This patch disables the additional allowance for the moment.
Also, two test cases are removed:
- bpf_fastcall_max_stack_ok:
it fails w/o additional stack allowance;
- bpf_fastcall_max_stack_fail:
this test is no longer necessary, stack size follows
regular rules, pattern invalidation is checked by other
test cases.
Linus Torvalds [Wed, 30 Oct 2024 02:41:30 +0000 (16:41 -1000)]
Merge tag 'cgroup-for-6.12-rc5-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup fixes from Tejun Heo:
- cgroup_bpf_release_fn() could saturate system_wq with
cgrp->bpf.release_work which can then form a circular dependency
leading to deadlocks. Fix by using a dedicated workqueue. The
system_wq's max concurrency limit is being increased separately.
- Fix theoretical off-by-one bug when enforcing max cgroup hierarchy
depth
* tag 'cgroup-for-6.12-rc5-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroup: Fix potential overflow issue when checking max_depth
cgroup/bpf: use a dedicated workqueue for cgroup bpf destruction
Linus Torvalds [Wed, 30 Oct 2024 02:35:40 +0000 (16:35 -1000)]
Merge tag 'sched_ext-for-6.12-rc5-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext
Pull sched_ext fixes from Tejun Heo:
- Instances of scx_ops_bypass() could race each other leading to
misbehavior. Fix by protecting the operation with a spinlock.
- selftest and userspace header fixes
* tag 'sched_ext-for-6.12-rc5-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext:
sched_ext: Fix enq_last_no_enq_fails selftest
sched_ext: Make cast_mask() inline
scx: Fix raciness in scx_ops_bypass()
scx: Fix exit selftest to use custom DSQ
sched_ext: Fix function pointer type mismatches in BPF selftests
selftests/sched_ext: add order-only dependency of runner.o on BPFOBJ
Linus Torvalds [Wed, 30 Oct 2024 02:24:02 +0000 (16:24 -1000)]
Merge tag 'slab-for-6.12-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab
Pull slab fixes from Vlastimil Babka:
- Fix for a slub_kunit test warning with MEM_ALLOC_PROFILING_DEBUG (Pei
Xiao)
- Fix for a MTE-based KASAN BUG in krealloc() (Qun-Wei Lin)
* tag 'slab-for-6.12-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab:
mm: krealloc: Fix MTE false alarm in __do_krealloc
slub/kunit: fix a WARNING due to unwrapped __kmalloc_cache_noprof
Linus Torvalds [Wed, 30 Oct 2024 02:19:15 +0000 (16:19 -1000)]
Merge tag 'mm-hotfixes-stable-2024-10-28-21-50' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull misc fixes from Andrew Morton:
"21 hotfixes. 13 are cc:stable. 13 are MM and 8 are non-MM.
No particular theme here - mainly singletons, a couple of doubletons.
Please see the changelogs"
* tag 'mm-hotfixes-stable-2024-10-28-21-50' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (21 commits)
mm: avoid unconditional one-tick sleep when swapcache_prepare fails
mseal: update mseal.rst
mm: split critical region in remap_file_pages() and invoke LSMs in between
selftests/mm: fix deadlock for fork after pthread_create with atomic_bool
Revert "selftests/mm: replace atomic_bool with pthread_barrier_t"
Revert "selftests/mm: fix deadlock for fork after pthread_create on ARM"
tools: testing: add expand-only mode VMA test
mm/vma: add expand-only VMA merge mode and optimise do_brk_flags()
resource,kexec: walk_system_ram_res_rev must retain resource flags
nilfs2: fix kernel bug due to missing clearing of checked flag
mm: numa_clear_kernel_node_hotplug: Add NUMA_NO_NODE check for node id
ocfs2: pass u64 to ocfs2_truncate_inline maybe overflow
mm: shmem: fix data-race in shmem_getattr()
mm: mark mas allocation in vms_abort_munmap_vmas as __GFP_NOFAIL
x86/traps: move kmsan check after instrumentation_begin
resource: remove dependency on SPARSEMEM from GET_FREE_REGION
mm/mmap: fix race in mmap_region() with ftruncate()
mm/page_alloc: let GFP_ATOMIC order-0 allocs access highatomic reserves
fork: only invoke khugepaged, ksm hooks if no error
fork: do not invoke uffd on fork if error occurs
...
Jakub Kicinski [Wed, 30 Oct 2024 01:57:12 +0000 (18:57 -0700)]
Merge tag 'wireless-2024-10-29' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless
Johannes Berg says:
====================
wireless fixes for v6.12-rc6
Another set of fixes, mostly iwlwifi:
* fix infinite loop in 6 GHz scan if more than
255 colocated APs were reported
* revert removal of retry loops for now to work
around issues with firmware initialization on
some devices/platforms
* fix SAR table issues with some BIOSes
* fix race in suspend/debug collection
* fix memory leak in fw recovery
* fix link ID leak in AP mode for older devices
* fix sending TX power constraints
* fix link handling in FW restart
And also the stack:
* fix setting TX power from userspace with the new
chanctx emulation code for old-style drivers
* fix a memory corruption bug due to structure
embedding
* fix CQM configuration double-free when moving
between net namespaces
* tag 'wireless-2024-10-29' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless:
wifi: mac80211: ieee80211_i: Fix memory corruption bug in struct ieee80211_chanctx
wifi: iwlwifi: mvm: fix 6 GHz scan construction
wifi: cfg80211: clear wdev->cqm_config pointer on free
mac80211: fix user-power when emulating chanctx
Revert "wifi: iwlwifi: remove retry loops in start"
wifi: iwlwifi: mvm: don't add default link in fw restart flow
wifi: iwlwifi: mvm: Fix response handling in iwl_mvm_send_recovery_cmd()
wifi: iwlwifi: mvm: SAR table alignment
wifi: iwlwifi: mvm: Use the sync timepoint API in suspend
wifi: iwlwifi: mvm: really send iwl_txpower_constraints_cmd
wifi: iwlwifi: mvm: don't leak a link on AP removal
====================
Wang Liang [Wed, 23 Oct 2024 03:52:13 +0000 (11:52 +0800)]
net: fix crash when config small gso_max_size/gso_ipv4_max_size
Config a small gso_max_size/gso_ipv4_max_size will lead to an underflow
in sk_dst_gso_max_size(), which may trigger a BUG_ON crash,
because sk->sk_gso_max_size would be much bigger than device limits.
Call Trace:
tcp_write_xmit
tso_segs = tcp_init_tso_segs(skb, mss_now);
tcp_set_skb_tso_segs
tcp_skb_pcount_set
// skb->len = 524288, mss_now = 8
// u16 tso_segs = 524288/8 = 65535 -> 0
tso_segs = DIV_ROUND_UP(skb->len, mss_now)
BUG_ON(!tso_segs)
Add check for the minimum value of gso_max_size and gso_ipv4_max_size.
Byeonguk Jeong [Sat, 26 Oct 2024 05:04:58 +0000 (14:04 +0900)]
selftests/bpf: Add test for trie_get_next_key()
Add a test for out-of-bounds write in trie_get_next_key() when a full
path from root to leaf exists and bpf_map_get_next_key() is called
with the leaf node. It may crashes the kernel on failure, so please
run in a VM.
Byeonguk Jeong [Sat, 26 Oct 2024 05:02:43 +0000 (14:02 +0900)]
bpf: Fix out-of-bounds write in trie_get_next_key()
trie_get_next_key() allocates a node stack with size trie->max_prefixlen,
while it writes (trie->max_prefixlen + 1) nodes to the stack when it has
full paths from the root to leaves. For example, consider a trie with
max_prefixlen is 8, and the nodes with key 0x00/0, 0x00/1, 0x00/2, ...
0x00/8 inserted. Subsequent calls to trie_get_next_key with _key with
.prefixlen = 8 make 9 nodes be written on the node stack with size 8.
- regarding the LO switch patch. I've got info about that from two persons
independently hence not sure what tags to put there and who should be
the author. Please let me know if that needs to be corrected.
- the wcd937x pdm watchdog is a problem for audio playback and needs to be
fixed. The minimal fix would be to at least increase timeout value but
it will still trigger in case of plenty of dbg messages or other
delay-generating things. Unfortunately, I can't test HPHL/R outputs hence
the patch is only for AUX. The other options would be introducing
module parameter for debugging and using HOLD_OFF bit for that or
adding Kconfig option.
Alexey Klimov (2):
ASoC: codecs: wcd937x: add missing LO Switch control
ASoC: codecs: wcd937x: relax the AUX PDM watchdog
Vladimir Oltean [Wed, 23 Oct 2024 10:05:41 +0000 (13:05 +0300)]
net/sched: sch_api: fix xa_insert() error path in tcf_block_get_ext()
This command:
$ tc qdisc replace dev eth0 ingress_block 1 egress_block 1 clsact
Error: block dev insert failed: -EBUSY.
fails because user space requests the same block index to be set for
both ingress and egress.
[ side note, I don't think it even failed prior to commit 913b47d3424e
("net/sched: Introduce tc block netdev tracking infra"), because this
is a command from an old set of notes of mine which used to work, but
alas, I did not scientifically bisect this ]
The problem is not that it fails, but rather, that the second time
around, it fails differently (and irrecoverably):
$ tc qdisc replace dev eth0 ingress_block 1 egress_block 1 clsact
Error: dsa_core: Flow block cb is busy.
[ another note: the extack is added by me for illustration purposes.
the context of the problem is that clsact_init() obtains the same
&q->ingress_block pointer as &q->egress_block, and since we call
tcf_block_get_ext() on both of them, "dev" will be added to the
block->ports xarray twice, thus failing the operation: once through
the ingress block pointer, and once again through the egress block
pointer. the problem itself is that when xa_insert() fails, we have
emitted a FLOW_BLOCK_BIND command through ndo_setup_tc(), but the
offload never sees a corresponding FLOW_BLOCK_UNBIND. ]
Even correcting the bad user input, we still cannot recover:
$ tc qdisc replace dev swp3 ingress_block 1 egress_block 2 clsact
Error: dsa_core: Flow block cb is busy.
Basically the only way to recover is to reboot the system, or unbind and
rebind the net device driver.
To fix the bug, we need to fill the correct error teardown path which
was missed during code movement, and call tcf_block_offload_unbind()
when xa_insert() fails.
[ last note, fundamentally I blame the label naming convention in
tcf_block_get_ext() for the bug. The labels should be named after what
they do, not after the error path that jumps to them. This way, it is
obviously wrong that two labels pointing to the same code mean
something is wrong, and checking the code correctness at the goto site
is also easier ]
Zichen Xie [Tue, 22 Oct 2024 17:19:08 +0000 (12:19 -0500)]
netdevsim: Add trailing zero to terminate the string in nsim_nexthop_bucket_activity_write()
This was found by a static analyzer.
We should not forget the trailing zero after copy_from_user()
if we will further do some string operations, sscanf() in this
case. Adding a trailing zero will ensure that the function
performs properly.
Eduard Zingerman [Tue, 29 Oct 2024 17:26:41 +0000 (10:26 -0700)]
selftests/bpf: Test with a very short loop
The test added is a simplified reproducer from syzbot report [1].
If verifier does not insert checkpoint somewhere inside the loop,
verification of the program would take a very long time.
This would happen because mark_chain_precision() for register r7 would
constantly trace jump history of the loop back, processing many
iterations for each mark_chain_precision() call.
Eduard Zingerman [Tue, 29 Oct 2024 17:26:40 +0000 (10:26 -0700)]
bpf: Force checkpoint when jmp history is too long
A specifically crafted program might trick verifier into growing very
long jump history within a single bpf_verifier_state instance.
Very long jump history makes mark_chain_precision() unreasonably slow,
especially in case if verifier processes a loop.
Mitigate this by forcing new state in is_state_visited() in case if
current state's jump history is too long.
Use same constant as in `skip_inf_loop_check`, but multiply it by
arbitrarily chosen value 2 to account for jump history containing not
only information about jumps, but also information about stack access.
For an example of problematic program consider the code below,
w/o this patch the example is processed by verifier for ~15 minutes,
before failing to allocate big-enough chunk for jmp_history.
/* Maximum number of loops for bpf_loop and bpf_iter_num.
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index f514247ba8ba..75e88be3bb3e 100644
\--- a/kernel/bpf/verifier.c
\+++ b/kernel/bpf/verifier.c
\@@ -18024,8 +18024,13 @@ static int is_state_visited(struct bpf_verifier_env *env, int insn_idx)
skip_inf_loop_check:
if (!force_new_state &&
env->jmps_processed - env->prev_jmps_processed < 20 &&
- env->insn_processed - env->prev_insn_processed < 100)
+ env->insn_processed - env->prev_insn_processed < 100) {
+ verbose(env, "is_state_visited: suppressing checkpoint at %d, %d jmps processed, cur->jmp_history_cnt is %d\n",
+ env->insn_idx,
+ env->jmps_processed - env->prev_jmps_processed,
+ cur->jmp_history_cnt);
add_new_state = false;
+ }
goto miss;
}
/* If sl->state is a part of a loop and this loop's entry is a part of
\@@ -18142,6 +18147,9 @@ static int is_state_visited(struct bpf_verifier_env *env, int insn_idx)
if (!add_new_state)
return 0;
+ verbose(env, "is_state_visited: new checkpoint at %d, resetting env->jmps_processed\n",
+ env->insn_idx);
+
/* There were no equivalent states, remember the current one.
* Technically the current state is not proven to be safe yet,
* but it will either reach outer most bpf_exit (which means it's safe)
from 5 to 6: R1=ctx() R7=ctx(...) R10=fp0
6: R1=ctx() R7=ctx(...) R10=fp0
6: (b7) r0 = 0 ; R0_w=0
7: (95) exit
is_state_visited: suppressing checkpoint at 1, 3 jmps processed, cur->jmp_history_cnt is 74
from 2 to 1: R1=ctx() R7_w=scalar(...) R10=fp0
1: R1=ctx() R7_w=scalar(...) R10=fp0
1: (07) r7 += 447767737
is_state_visited: suppressing checkpoint at 2, 3 jmps processed, cur->jmp_history_cnt is 75
2: R7_w=scalar(...)
2: (45) if r7 & 0x702000 goto pc-2
... mark_precise 152 steps for r7 ...
2: R7_w=scalar(...)
is_state_visited: suppressing checkpoint at 1, 4 jmps processed, cur->jmp_history_cnt is 75
1: (07) r7 += 447767737
is_state_visited: suppressing checkpoint at 2, 4 jmps processed, cur->jmp_history_cnt is 76
2: R7_w=scalar(...)
2: (45) if r7 & 0x702000 goto pc-2
...
BPF program is too large. Processed 257 insn
The log output shows that checkpoint at label (1) is never created,
because it is suppressed by `skip_inf_loop_check` logic:
a. When 'if' at (2) is processed it pushes a state with insn_idx (1)
onto stack and proceeds to (3);
b. At (5) checkpoint is created, and this resets
env->{jmps,insns}_processed.
c. Verification proceeds and reaches `exit`;
d. State saved at step (a) is popped from stack and is_state_visited()
considers if checkpoint needs to be added, but because
env->{jmps,insns}_processed had been just reset at step (b)
the `skip_inf_loop_check` logic forces `add_new_state` to false.
e. Verifier proceeds with current state, which slowly accumulates
more and more entries in the jump history.
The accumulation of entries in the jump history is a problem because
of two factors:
- it eventually exhausts memory available for kmalloc() allocation;
- mark_chain_precision() traverses the jump history of a state,
meaning that if `r7` is marked precise, verifier would iterate
ever growing jump history until parent state boundary is reached.
(note: the log also shows a REG INVARIANTS VIOLATION warning
upon jset processing, but that's another bug to fix).
With this patch applied, the example above is rejected by verifier
under 1s of time, reaching 1M instructions limit.
The program is a simplified reproducer from syzbot report.
Previous discussion could be found at [1].
The patch does not cause any changes in verification performance,
when tested on selftests from veristat.cfg and cilium programs taken
from [2].
Changelog:
- v1 -> v2:
- moved patch to bpf tree;
- moved force_new_state variable initialization after declaration and
shortened the comment.
v1: https://lore.kernel.org/bpf/20241018020307.1766906[email protected]/
Pedro Tammela [Thu, 24 Oct 2024 16:55:47 +0000 (12:55 -0400)]
net/sched: stop qdisc_tree_reduce_backlog on TC_H_ROOT
In qdisc_tree_reduce_backlog, Qdiscs with major handle ffff: are assumed
to be either root or ingress. This assumption is bogus since it's valid
to create egress qdiscs with major handle ffff:
Budimir Markovic found that for qdiscs like DRR that maintain an active
class list, it will cause a UAF with a dangling class pointer.
In 066a3b5b2346, the concern was to avoid iterating over the ingress
qdisc since its parent is itself. The proper fix is to stop when parent
TC_H_ROOT is reached because the only way to retrieve ingress is when a
hierarchy which does not contain a ffff: major handle call into
qdisc_lookup with TC_H_MAJ(TC_H_ROOT).
In the scenario where major ffff: is an egress qdisc in any of the tree
levels, the updates will also propagate to TC_H_ROOT, which then the
iteration must stop.
Florian Westphal [Tue, 22 Oct 2024 15:23:18 +0000 (17:23 +0200)]
selftests: netfilter: nft_flowtable.sh: make first pass deterministic
The CI occasionaly encounters a failing test run. Example:
# PASS: ipsec tunnel mode for ns1/ns2
# re-run with random mtus: -o 10966 -l 19499 -r 31322
# PASS: flow offloaded for ns1/ns2
[..]
# FAIL: ipsec tunnel ... counter 1157059 exceeds expected value 878489
This script will re-exec itself, on the second run, random MTUs are
chosen for the involved links. This is done so we can cover different
combinations (large mtu on client, small on server, link has lowest
mtu, etc).
Furthermore, file size is random, even for the first run.
Rework this script and always use the same file size on initial run so
that at least the first round can be expected to have reproducible
behavior.
Second round will use random mtu/filesize.
Raise the failure limit to that of the file size, this should avoid all
errneous test errors. Currently, first fin will remove the offload, so if
one peer is already closing remaining data is handled by classic path,
which result in larger-than-expected counter and a test failure.
Given packet path also counts tcp/ip headers, in case offload is
completely broken this test will still fail (as expected).
The test counter limit could be made more strict again in the future
once flowtable can keep a connection in offloaded state until FINs
in both directions were seen.
Ido Schimmel [Wed, 23 Oct 2024 12:30:09 +0000 (15:30 +0300)]
ipv4: ip_tunnel: Fix suspicious RCU usage warning in ip_tunnel_find()
The per-netns IP tunnel hash table is protected by the RTNL mutex and
ip_tunnel_find() is only called from the control path where the mutex is
taken.
Add a lockdep expression to hlist_for_each_entry_rcu() in
ip_tunnel_find() in order to validate that the mutex is held and to
silence the suspicious RCU usage warning [1].
[1]
WARNING: suspicious RCU usage 6.12.0-rc3-custom-gd95d9a31aceb #139 Not tainted
-----------------------------
net/ipv4/ip_tunnel.c:221 RCU-list traversed in non-reader section!!
other info that might help us debug this:
rcu_scheduler_active = 2, debug_locks = 1
1 lock held by ip/362:
#0: ffffffff86fc7cb0 (rtnl_mutex){+.+.}-{3:3}, at: rtnetlink_rcv_msg+0x377/0xf60
Jiayuan Chen [Mon, 28 Oct 2024 06:52:26 +0000 (14:52 +0800)]
bpf: fix filed access without lock
The tcp_bpf_recvmsg_parser() function, running in user context,
retrieves seq_copied from tcp_sk without holding the socket lock, and
stores it in a local variable seq. However, the softirq context can
modify tcp_sk->seq_copied concurrently, for example, n tcp_read_sock().
As a result, the seq value is stale when it is assigned back to
tcp_sk->copied_seq at the end of tcp_bpf_recvmsg_parser(), leading to
incorrect behavior.
Due to concurrency, the copied_seq field in tcp_bpf_recvmsg_parser()
might be set to an incorrect value (less than the actual copied_seq) at
the end of function: 'WRITE_ONCE(tcp->copied_seq, seq)'. This causes the
'offset' to be negative in tcp_read_sock()->tcp_recv_skb() when
processing new incoming packets (sk->copied_seq - skb->seq becomes less
than 0), and all subsequent packets will be dropped.
The E810 Lan On Motherboard (LOM) design is vendor specific. Intel
provides the reference design, but it is up to vendor on the final
product design. For some cases, like Linux DPLL support, the static
values defined in the driver does not reflect the actual LOM design.
Current implementation of dpll pins is causing the crash on probe
of the ice driver for such DPLL enabled E810 LOM designs:
The number of dpll pins enabled by LOM vendor is greater than expected
and defined in the driver for Intel designed NICs, which causes the crash.
Prevent the crash and allow generic pin initialization within Linux DPLL
subsystem for DPLL enabled E810 LOM designs.
Newly designed solution for described issue will be based on "per HW
design" pin initialization. It requires pin information dynamically
acquired from the firmware and is already in progress, planned for
next-tree only.
During testing of SR-IOV, Red Hat QE encountered an issue where the
ip link up command intermittently fails for the igbvf interfaces when
using the PREEMPT_RT variant. Investigation revealed that
e1000_write_posted_mbx returns an error due to the lack of an ACK
from e1000_poll_for_ack.
The underlying issue arises from the fact that IRQs are threaded by
default under PREEMPT_RT. While the exact hardware details are not
available, it appears that the IRQ handled by igb_msix_other must
be processed before e1000_poll_for_ack times out. However,
e1000_write_posted_mbx is called with preemption disabled, leading
to a scenario where the IRQ is serviced only after the failure of
e1000_write_posted_mbx.
To resolve this, we set IRQF_NO_THREAD for the affected interrupt,
ensuring that the kernel handles it immediately, thereby preventing
the aforementioned error.
while true; do
ip link set ${nic_test} mtu 1500
ip link set ${vf} mtu 1500
ip link set $vf up
ip link set ${nic_test} vf 0 vlan ${ipaddr_vlan}
ip addr add 172.30.${ipaddr_vlan}.1/24 dev ${vf}
ip addr add 2021:db8:${ipaddr_vlan}::1/64 dev ${vf}
if ! ip link show $vf | grep 'state UP'; then
echo 'Error found'
break
fi
ip link set $vf down
done
Alexey Klimov [Tue, 22 Oct 2024 03:31:31 +0000 (04:31 +0100)]
ASoC: codecs: wcd937x: relax the AUX PDM watchdog
On a system with wcd937x, rxmacro and Qualcomm audio DSP, which is pretty
common set of devices on Qualcomm platforms, and due to the order of how
DAPM widgets are powered on (they are sorted), there is a small time window
when wcd937x chip is online and expects the flow of incoming data but
rxmacro is not yet online. When wcd937x is programmed to receive data
via AUX port then its AUX PDM watchdog is enabled in
wcd937x_codec_enable_aux_pa(). If due to some reasons the rxmacro and
soundwire machinery are delayed to start streaming data, then there is
a chance for this AUX PDM watchdog to reset the wcd937x codec. Such event
is not logged as a message and only wcd937x IRQ counter is increased
however there could be a lot of other reasons for that IRQ.
There is a similar opportunity for such delay during DAPM widgets power
down sequence.
If wcd937x codec reset happens on the start of the playback, then there
will be no sound and if such reset happens at the end of a playback then
it may generate additional clicks and pops noises.
On qrb4210 RB2 board without any debugging bits the wcd937x resets are
sometimes observed at the end of a playback though not always.
With some debugging messages or with some tracing enabled the AUX PDM
watchdog resets the wcd937x codec at the start of a playback and there
is no sound output at all.
In this patch:
- TIMEOUT_SEL bit in PDM_WD_CTL2 register is set to increase the watchdog
reset delay to 100ms which eliminates the AUX PDM watchdog IRQs on
qrb4210 RB2 board completely and decreases the number of unwanted clicks
noises;
- HOLD_OFF bit postpones triggering such watchdog IRQ till wcd937x codec
reset which usually happens at the end of a playback. This allows to
actually output some sound in case of debugging.
Alexey Klimov [Tue, 22 Oct 2024 03:31:30 +0000 (04:31 +0100)]
ASoC: codecs: wcd937x: add missing LO Switch control
The wcd937x supports also AUX input but the control that sets correct
soundwire port for this is missing. This control is required for audio
playback, for instance, on qrb4210 RB2 board as well as on other
SoCs.
Furong Xu [Mon, 21 Oct 2024 06:10:23 +0000 (14:10 +0800)]
net: stmmac: TSO: Fix unbalanced DMA map/unmap for non-paged SKB data
In case the non-paged data of a SKB carries protocol header and protocol
payload to be transmitted on a certain platform that the DMA AXI address
width is configured to 40-bit/48-bit, or the size of the non-paged data
is bigger than TSO_MAX_BUFF_SIZE on a certain platform that the DMA AXI
address width is configured to 32-bit, then this SKB requires at least
two DMA transmit descriptors to serve it.
For example, three descriptors are allocated to split one DMA buffer
mapped from one piece of non-paged data:
dma_desc[N + 0],
dma_desc[N + 1],
dma_desc[N + 2].
Then three elements of tx_q->tx_skbuff_dma[] will be allocated to hold
extra information to be reused in stmmac_tx_clean():
tx_q->tx_skbuff_dma[N + 0],
tx_q->tx_skbuff_dma[N + 1],
tx_q->tx_skbuff_dma[N + 2].
Now we focus on tx_q->tx_skbuff_dma[entry].buf, which is the DMA buffer
address returned by DMA mapping call. stmmac_tx_clean() will try to
unmap the DMA buffer _ONLY_IF_ tx_q->tx_skbuff_dma[entry].buf
is a valid buffer address.
The expected behavior that saves DMA buffer address of this non-paged
data to tx_q->tx_skbuff_dma[entry].buf is:
tx_q->tx_skbuff_dma[N + 0].buf = NULL;
tx_q->tx_skbuff_dma[N + 1].buf = NULL;
tx_q->tx_skbuff_dma[N + 2].buf = dma_map_single();
Unfortunately, the current code misbehaves like this:
tx_q->tx_skbuff_dma[N + 0].buf = dma_map_single();
tx_q->tx_skbuff_dma[N + 1].buf = NULL;
tx_q->tx_skbuff_dma[N + 2].buf = NULL;
On the stmmac_tx_clean() side, when dma_desc[N + 0] is closed by the
DMA engine, tx_q->tx_skbuff_dma[N + 0].buf is a valid buffer address
obviously, then the DMA buffer will be unmapped immediately.
There may be a rare case that the DMA engine does not finish the
pending dma_desc[N + 1], dma_desc[N + 2] yet. Now things will go
horribly wrong, DMA is going to access a unmapped/unreferenced memory
region, corrupted data will be transmited or iommu fault will be
triggered :(
In contrast, the for-loop that maps SKB fragments behaves perfectly
as expected, and that is how the driver should do for both non-paged
data and paged frags actually.
This patch corrects DMA map/unmap sequences by fixing the array index
for tx_q->tx_skbuff_dma[entry].buf when assigning DMA buffer address.
Ley Foon Tan [Mon, 21 Oct 2024 05:46:25 +0000 (13:46 +0800)]
net: stmmac: dwmac4: Fix high address display by updating reg_space[] from register values
The high address will display as 0 if the driver does not set the
reg_space[]. To fix this, read the high address registers and
update the reg_space[] accordingly.
Qun-Wei Lin [Fri, 25 Oct 2024 08:58:11 +0000 (16:58 +0800)]
mm: krealloc: Fix MTE false alarm in __do_krealloc
This patch addresses an issue introduced by commit 1a83a716ec233 ("mm:
krealloc: consider spare memory for __GFP_ZERO") which causes MTE
(Memory Tagging Extension) to falsely report a slab-out-of-bounds error.
The problem occurs when zeroing out spare memory in __do_krealloc. The
original code only considered software-based KASAN and did not account
for MTE. It does not reset the KASAN tag before calling memset, leading
to a mismatch between the pointer tag and the memory tag, resulting
in a false positive.
ALSA: hda/realtek: Add subwoofer quirk for Infinix ZERO BOOK 13
Infinix ZERO BOOK 13 has a 2+2 speaker system which isn't probed correctly.
This patch adds a quirk with the proper pin connections.
Also The mic in this laptop suffers too high gain resulting in mostly
fan noise being recorded,
This patch Also limit mic boost.
HW Probe for device; https://linux-hardware.org/?probe=a2e892c47b
Barry Song [Thu, 26 Sep 2024 21:19:36 +0000 (09:19 +1200)]
mm: avoid unconditional one-tick sleep when swapcache_prepare fails
Commit 13ddaf26be32 ("mm/swap: fix race when skipping swapcache")
introduced an unconditional one-tick sleep when `swapcache_prepare()`
fails, which has led to reports of UI stuttering on latency-sensitive
Android devices. To address this, we can use a waitqueue to wake up tasks
that fail `swapcache_prepare()` sooner, instead of always sleeping for a
full tick. While tasks may occasionally be woken by an unrelated
`do_swap_page()`, this method is preferable to two scenarios: rapid
re-entry into page faults, which can cause livelocks, and multiple
millisecond sleeps, which visibly degrade user experience.
Oven's testing shows that a single waitqueue resolves the UI stuttering
issue. If a 'thundering herd' problem becomes apparent later, a waitqueue
hash similar to `folio_wait_table[PAGE_WAIT_TABLE_SIZE]` for page bit
locks can be introduced.
Jeff Xu [Tue, 8 Oct 2024 04:09:41 +0000 (04:09 +0000)]
mseal: update mseal.rst
Pedro Falcato's optimization [1] for checking sealed VMAs, which replaces
the can_modify_mm() function with an in-loop check, necessitates an update
to the mseal.rst documentation to reflect this change.
Furthermore, the document has received offline comments regarding the code
sample and suggestions for sentence clarification to enhance reader
comprehension.
mm: split critical region in remap_file_pages() and invoke LSMs in between
Commit ea7e2d5e49c0 ("mm: call the security_mmap_file() LSM hook in
remap_file_pages()") fixed a security issue, it added an LSM check when
trying to remap file pages, so that LSMs have the opportunity to evaluate
such action like for other memory operations such as mmap() and
mprotect().
However, that commit called security_mmap_file() inside the mmap_lock
lock, while the other calls do it before taking the lock, after commit 8b3ec6814c83 ("take security_mmap_file() outside of ->mmap_sem").
This caused lock inversion issue with IMA which was taking the mmap_lock
and i_mutex lock in the opposite way when the remap_file_pages() system
call was called.
Solve the issue by splitting the critical region in remap_file_pages() in
two regions: the first takes a read lock of mmap_lock, retrieves the VMA
and the file descriptor associated, and calculates the 'prot' and 'flags'
variables; the second takes a write lock on mmap_lock, checks that the VMA
flags and the VMA file descriptor are the same as the ones obtained in the
first critical region (otherwise the system call fails), and calls
do_mmap().
In between, after releasing the read lock and before taking the write
lock, call security_mmap_file(), and solve the lock inversion issue.
Edward Liaw [Fri, 18 Oct 2024 17:17:24 +0000 (17:17 +0000)]
selftests/mm: fix deadlock for fork after pthread_create with atomic_bool
Some additional synchronization is needed on Android ARM64; we see a
deadlock with pthread_create when the parent thread races forward before
the child has a chance to start doing work.
uffd_poll_thread may be called by other tests that do not initialize the
pthread_barrier, so this approach is not correct. This will revert to
using atomic_bool instead.
Edward Liaw [Fri, 18 Oct 2024 17:17:22 +0000 (17:17 +0000)]
Revert "selftests/mm: fix deadlock for fork after pthread_create on ARM"
Patch series "selftests/mm: revert pthread_barrier change"
On Android arm, pthread_create followed by a fork caused a deadlock in
the case where the fork required work to be completed by the created
thread.
The previous patches incorrectly assumed that the parent would
always initialize the pthread_barrier for the child thread. This
reverts the change and replaces the fix for wp-fork-with-event with the
original use of atomic_bool.
fork_event_consumer may be called by other tests that do not initialize
the pthread_barrier, so this approach is not correct. The subsequent
patch will revert to using atomic_bool instead.
Lorenzo Stoakes [Thu, 17 Oct 2024 14:31:46 +0000 (15:31 +0100)]
tools: testing: add expand-only mode VMA test
Add a test to assert that VMG_FLAG_JUST_EXPAND functions as expected - that
is, when the VMA iterator is positioned at the previous VMA and no VMAs
proceed it, we observe an expansion with all state as expected.
Explicitly place a prior VMA that would otherwise fail this test if the
mode were not enabled (as it would traverse to the previous-previous VMA).
Lorenzo Stoakes [Thu, 17 Oct 2024 14:31:45 +0000 (15:31 +0100)]
mm/vma: add expand-only VMA merge mode and optimise do_brk_flags()
Patch series "introduce VMA merge mode to improve brk() performance".
A ~5% performance regression was discovered on the
aim9.brk_test.ops_per_sec by the linux kernel test bot [0].
In the past to satisfy brk() performance we duplicated VMA expansion code
and special-cased do_brk_flags(). This is however horrid and undoes work
to abstract this logic, so in resolving the issue I have endeavoured to
avoid this.
Investigating further I was able to observe that the use of a
vma_iter_next_range() and vma_prev() pair, causing an unnecessary maple
tree walk. In addition there is work that we do that is simply
unnecessary for brk().
Therefore, add a special VMA merge mode VMG_FLAG_JUST_EXPAND to avoid
doing any of this - it assumes the VMA iterator is pointing at the
previous VMA and which skips logic that brk() does not require.
This mostly eliminates the performance regression reducing it to ~2% which
is in the realm of noise. In addition, the will-it-scale test brk2,
written to be more representative of real-world brk() usage, shows a
modest performance improvement - which gives me confidence that we are not
meaningfully regressing real workloads here.
This series includes a test asserting that the 'just expand' mode works as
expected.
With many thanks to Oliver Sang for helping with performance testing of
candidate patch sets!
We know in advance that do_brk_flags() wants only to perform a VMA
expansion (if the prior VMA is compatible), and that we assume no
mergeable VMA follows it.
These are the semantics of this function prior to the recent rewrite of
the VMA merging logic, however we are now doing more work than necessary -
positioning the VMA iterator at the prior VMA and performing tasks that
are not required.
Add a new field to the vmg struct to permit merge flags and add a new
merge flag VMG_FLAG_JUST_EXPAND which implies this behaviour, and have
do_brk_flags() use this.
This fixes a reported performance regression in a brk() benchmarking suite.
Gregory Price [Thu, 17 Oct 2024 19:03:47 +0000 (15:03 -0400)]
resource,kexec: walk_system_ram_res_rev must retain resource flags
walk_system_ram_res_rev() erroneously discards resource flags when passing
the information to the callback.
This causes systems with IORESOURCE_SYSRAM_DRIVER_MANAGED memory to have
these resources selected during kexec to store kexec buffers if that
memory happens to be at placed above normal system ram.
This leads to undefined behavior after reboot. If the kexec buffer is
never touched, nothing happens. If the kexec buffer is touched, it could
lead to a crash (like below) or undefined behavior.
Tested on a system with CXL memory expanders with driver managed memory,
TPM enabled, and CONFIG_IMA_KEXEC=y. Adding printk's showed the flags
were being discarded and as a result the check for
IORESOURCE_SYSRAM_DRIVER_MANAGED passes.
Ryusuke Konishi [Thu, 17 Oct 2024 19:33:10 +0000 (04:33 +0900)]
nilfs2: fix kernel bug due to missing clearing of checked flag
Syzbot reported that in directory operations after nilfs2 detects
filesystem corruption and degrades to read-only,
__block_write_begin_int(), which is called to prepare block writes, may
fail the BUG_ON check for accesses exceeding the folio/page size,
triggering a kernel bug.
This was found to be because the "checked" flag of a page/folio was not
cleared when it was discarded by nilfs2's own routine, which causes the
sanity check of directory entries to be skipped when the directory
page/folio is reloaded. So, fix that.
This was necessary when the use of nilfs2's own page discard routine was
applied to more than just metadata files.
mm: numa_clear_kernel_node_hotplug: Add NUMA_NO_NODE check for node id
The acquired memory blocks for reserved may include blocks outside of
memory management. In this case, the nid variable is set to NUMA_NO_NODE
(-1), so an error occurs in node_set(). This adds a check using
numa_valid_node() to numa_clear_kernel_node_hotplug() that skips
node_set() when nid is set to NUMA_NO_NODE.
ocfs2: pass u64 to ocfs2_truncate_inline maybe overflow
Syzbot reported a kernel BUG in ocfs2_truncate_inline. There are two
reasons for this: first, the parameter value passed is greater than
ocfs2_max_inline_data_with_xattr, second, the start and end parameters of
ocfs2_truncate_inline are "unsigned int".
So, we need to add a sanity check for byte_start and byte_len right before
ocfs2_truncate_inline() in ocfs2_remove_inode_range(), if they are greater
than ocfs2_max_inline_data_with_xattr return -EINVAL.
read to 0xffff888102eb3260 of 4 bytes by task 3498 on cpu 0:
inode_get_ctime_nsec include/linux/fs.h:1623 [inline]
inode_get_ctime include/linux/fs.h:1629 [inline]
generic_fillattr+0x1dd/0x2f0 fs/stat.c:62
shmem_getattr+0x17b/0x200 mm/shmem.c:1157
vfs_getattr_nosec fs/stat.c:166 [inline]
vfs_getattr+0x19b/0x1e0 fs/stat.c:207
vfs_statx_path fs/stat.c:251 [inline]
vfs_statx+0x134/0x2f0 fs/stat.c:315
vfs_fstatat+0xec/0x110 fs/stat.c:341
__do_sys_newfstatat fs/stat.c:505 [inline]
__se_sys_newfstatat+0x58/0x260 fs/stat.c:499
__x64_sys_newfstatat+0x55/0x70 fs/stat.c:499
x64_sys_call+0x141f/0x2d60 arch/x86/include/generated/asm/syscalls_64.h:263
do_syscall_x64 arch/x86/entry/common.c:52 [inline]
do_syscall_64+0x54/0x120 arch/x86/entry/common.c:83
entry_SYSCALL_64_after_hwframe+0x76/0x7e
value changed: 0x2755ae53 -> 0x27ee44d3
Reported by Kernel Concurrency Sanitizer on:
CPU: 0 UID: 0 PID: 3498 Comm: udevd Not tainted 6.11.0-rc6-syzkaller-00326-gd1f2d51b711a-dirty #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 08/06/2024
==================================================================
When calling generic_fillattr(), if you don't hold read lock, data-race
will occur in inode member variables, which can cause unexpected
behavior.
Since there is no special protection when shmem_getattr() calls
generic_fillattr(), data-race occurs by functions such as shmem_unlink()
or shmem_mknod(). This can cause unexpected results, so commenting it out
is not enough.
Therefore, when calling generic_fillattr() from shmem_getattr(), it is
appropriate to protect the inode using inode_lock_shared() and
inode_unlock_shared() to prevent data-race.
Jann Horn [Wed, 16 Oct 2024 15:07:53 +0000 (17:07 +0200)]
mm: mark mas allocation in vms_abort_munmap_vmas as __GFP_NOFAIL
vms_abort_munmap_vmas() is a recovery path where, on entry, some VMAs have
already been torn down halfway (in a way we can't undo) but are still
present in the maple tree.
At this point, we *must* remove the VMAs from the VMA tree, otherwise we
get UAF.
Because removing VMA tree nodes can require memory allocation, the
existing code has an error path which tries to handle this by reattaching
the VMAs; but that can't be done safely.
A nicer way to fix it would probably be to preallocate enough maple tree
nodes for the removal before the point of no return, or something like
that; but for now, fix it the easy and kinda ugly way, by marking this
allocation __GFP_NOFAIL.
x86/traps: move kmsan check after instrumentation_begin
During x86_64 kernel build with CONFIG_KMSAN, the objtool warns following:
AR built-in.a
AR vmlinux.a
LD vmlinux.o
vmlinux.o: warning: objtool: handle_bug+0x4: call to
kmsan_unpoison_entry_regs() leaves .noinstr.text section
OBJCOPY modules.builtin.modinfo
GEN modules.builtin
MODPOST Module.symvers
CC .vmlinux.export.o
Moving kmsan_unpoison_entry_regs() _after_ instrumentation_begin() fixes
the warning.
There is decode_bug(regs->ip, &imm) is left before KMSAN unpoisoining, but
it has the return condition and if we include it after
instrumentation_begin() it results the warning "return with
instrumentation enabled", hence, I'm concerned that regs will not be KMSAN
unpoisoned if `ud_type == BUG_NONE` is true.
Huang Ying [Tue, 15 Oct 2024 05:15:54 +0000 (13:15 +0800)]
resource: remove dependency on SPARSEMEM from GET_FREE_REGION
We want to use the functions (get_free_mem_region()) configured via
GET_FREE_REGION in resource kunit tests. However, GET_FREE_REGION
depends on SPARSEMEM now. This makes resource kunit tests cannot be
built on some architectures lacking SPARSEMEM, or causes config warning
as follows,
WARNING: unmet direct dependencies detected for GET_FREE_REGION
Depends on [n]: SPARSEMEM [=n]
Selected by [y]:
- RESOURCE_KUNIT_TEST [=y] && RUNTIME_TESTING_MENU [=y] && KUNIT [=y]
When get_free_mem_region() was introduced the only consumers were those
looking to pass the address range to memremap_pages(). That address
range needed to be mindful of the maximum addressable platform physical
address which at the time only SPARSMEM defined via MAX_PHYSMEM_BITS.
Given that memremap_pages() also depended on SPARSEMEM via ZONE_DEVICE,
it was easier to just depend on that definition than invent a general
MAX_PHYSMEM_BITS concept outside of SPARSEMEM.
Turns out that decision was buggy and did not account for KASAN
consumption of physical address space. That problem was resolved
recently with commit ea72ce5da228 ("x86/kaslr: Expose and use the end
of the physical memory address space"), and GET_FREE_REGION dropped its
MAX_PHYSMEM_BITS dependency.
Then commit 99185c10d5d9 ("resource, kunit: add test case for
region_intersects()"), went ahead and fixed up the only remaining
dependency on SPARSEMEM which was usage of the PA_SECTION_SHIFT macro
for setting the default alignment. A PAGE_SIZE fallback is fine in the
SPARSEMEM=n case.
With those build dependencies gone GET_FREE_REGION no longer depends on
SPARSEMEM. So, the patch removes dependency on SPARSEMEM from
GET_FREE_REGION to fix the build issues.
Liam R. Howlett [Wed, 16 Oct 2024 01:34:55 +0000 (21:34 -0400)]
mm/mmap: fix race in mmap_region() with ftruncate()
Avoiding the zeroing of the vma tree in mmap_region() introduced a race
with truncate in the page table walk. To avoid any races, create a hole
in the rmap during the operation by clearing the pagetable entries earlier
under the mmap write lock and (critically) before the new vma is installed
into the vma tree. The result is that the old vma(s) are left in the vma
tree, but free_pgtables() removes them from the rmap and clears the ptes
while holding the necessary locks.
This change extends the fix required for hugetblfs and the call_mmap()
function by moving the cleanup higher in the function and running it
unconditionally.
Matt Fleming [Fri, 11 Oct 2024 12:07:37 +0000 (13:07 +0100)]
mm/page_alloc: let GFP_ATOMIC order-0 allocs access highatomic reserves
Under memory pressure it's possible for GFP_ATOMIC order-0 allocations to
fail even though free pages are available in the highatomic reserves.
GFP_ATOMIC allocations cannot trigger unreserve_highatomic_pageblock()
since it's only run from reclaim.
Given that such allocations will pass the watermarks in
__zone_watermark_unusable_free(), it makes sense to fallback to highatomic
reserves the same way that ALLOC_OOM can.
This fixes order-0 page allocation failures observed on Cloudflare's fleet
when handling network packets:
Lorenzo Stoakes [Tue, 15 Oct 2024 17:56:06 +0000 (18:56 +0100)]
fork: only invoke khugepaged, ksm hooks if no error
There is no reason to invoke these hooks early against an mm that is in an
incomplete state.
The change in commit d24062914837 ("fork: use __mt_dup() to duplicate
maple tree in dup_mmap()") makes this more pertinent as we may be in a
state where entries in the maple tree are not yet consistent.
Their placement early in dup_mmap() only appears to have been meaningful
for early error checking, and since functionally it'd require a very small
allocation to fail (in practice 'too small to fail') that'd only occur in
the most dire circumstances, meaning the fork would fail or be OOM'd in
any case.
Since both khugepaged and KSM tracking are there to provide optimisations
to memory performance rather than critical functionality, it doesn't
really matter all that much if, under such dire memory pressure, we fail
to register an mm with these.
As a result, we follow the example of commit d2081b2bf819 ("mm:
khugepaged: make khugepaged_enter() void function") and make ksm_fork() a
void function also.
We only expose the mm to these functions once we are done with them and
only if no error occurred in the fork operation.
Lorenzo Stoakes [Tue, 15 Oct 2024 17:56:05 +0000 (18:56 +0100)]
fork: do not invoke uffd on fork if error occurs
Patch series "fork: do not expose incomplete mm on fork".
During fork we may place the virtual memory address space into an
inconsistent state before the fork operation is complete.
In addition, we may encounter an error during the fork operation that
indicates that the virtual memory address space is invalidated.
As a result, we should not be exposing it in any way to external machinery
that might interact with the mm or VMAs, machinery that is not designed to
deal with incomplete state.
We specifically update the fork logic to defer khugepaged and ksm to the
end of the operation and only to be invoked if no error arose, and
disallow uffd from observing fork events should an error have occurred.
This patch (of 2):
Currently on fork we expose the virtual address space of a process to
userland unconditionally if uffd is registered in VMAs, regardless of
whether an error arose in the fork.
This is performed in dup_userfaultfd_complete() which is invoked
unconditionally, and performs two duties - invoking registered handlers
for the UFFD_EVENT_FORK event via dup_fctx(), and clearing down
userfaultfd_fork_ctx objects established in dup_userfaultfd().
This is problematic, because the virtual address space may not yet be
correctly initialised if an error arose.
The change in commit d24062914837 ("fork: use __mt_dup() to duplicate
maple tree in dup_mmap()") makes this more pertinent as we may be in a
state where entries in the maple tree are not yet consistent.
We address this by, on fork error, ensuring that we roll back state that
we would otherwise expect to clean up through the event being handled by
userland and perform the memory freeing duty otherwise performed by
dup_userfaultfd_complete().
We do this by implementing a new function, dup_userfaultfd_fail(), which
performs the same loop, only decrementing reference counts.
Note that we perform mmgrab() on the parent and child mm's, however
userfaultfd_ctx_put() will mmdrop() this once the reference count drops to
zero, so we will avoid memory leaks correctly here.
mm/pagewalk: fix usage of pmd_leaf()/pud_leaf() without present check
pmd_leaf()/pud_leaf() only implies a pmd_present()/pud_present() check on
some architectures. We really should check for
pmd_present()/pud_present() first.
This should explain the report we got on ppc64 (which has
CONFIG_PGTABLE_HAS_HUGE_LEAVES set in the config) that triggered:
VM_WARN_ON_ONCE(pmd_leaf(pmdp_get_lockless(pmdp)));
Likely we had a PMD migration entry for which pmd_leaf() did not trigger.
We raced with restoring the PMD migration entry, and suddenly saw a
pmd_leaf(). In this case, pte_offset_map_lock() saved us from more
trouble, because it rechecks the PMD value, but we would not have
processed the migration entry -- which is not too bad because the only
user of FW_MIGRATION is KSM for unsharing, and KSM only applies to small
folios.
Further, we shouldn't re-read the PMD/PUD value for our warning, the
primary purpose of the VM_WARN_ON_ONCE() is to find spurious use of
pmd_leaf()/pud_leaf() without CONFIG_PGTABLE_HAS_HUGE_LEAVES.
As a side note, we are currently not implementing FW_MIGRATION support for
PUD migration entries, which likely should exist due to hugetlb. Add a
TODO so this won't fall through the cracks if more FW_MIGRATION users get
added.
Was able to write a quick reproducer and verify that the issue no longer triggers with this fix.
Aleksei Vetrov [Mon, 28 Oct 2024 22:50:30 +0000 (22:50 +0000)]
ASoC: dapm: fix bounds checker error in dapm_widget_list_create
The widgets array in the snd_soc_dapm_widget_list has a __counted_by
attribute attached to it, which points to the num_widgets variable. This
attribute is used in bounds checking, and if it is not set before the
array is filled, then the bounds sanitizer will issue a warning or a
kernel panic if CONFIG_UBSAN_TRAP is set.
This patch sets the size of the widgets list calculated with
list_for_each as the initial value for num_widgets as it is used for
allocating memory for the array. It is updated with the actual number of
added elements after the array is filled.
Benjamin Große [Sun, 20 Oct 2024 17:41:28 +0000 (18:41 +0100)]
usb: add support for new USB device ID 0x17EF:0x3098 for the r8152 driver
This patch adds support for another Lenovo Mini dock 0x17EF:0x3098 to the
r8152 driver. The device has been tested on NixOS, hotplugging and sleep
included.