Linus Torvalds [Fri, 18 Oct 2024 14:01:59 +0000 (07:01 -0700)]
Merge tag 's390-6.12-3' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
Pull s390 fixes from Heiko Carstens:
- Fix PCI error recovery by handling error events correctly
- Fix CCA crypto card behavior within protected execution environment
- Two KVM commits which fix virtual vs physical address handling bugs
in KVM pfault handling
- Fix return code handling in pckmo_key2protkey()
- Deactivate sclp console as late as possible so that outstanding
messages appear on the console instead of being dropped on reboot
- Convert newlines to CRLF instead of LFCR for the sclp vt220 driver,
as required by the vt220 specification
- Initialize also psw mask in perf_arch_fetch_caller_regs() to make
sure that user_mode(regs) will return false
- Update defconfigs
* tag 's390-6.12-3' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
s390: Update defconfigs
s390: Initialize psw mask in perf_arch_fetch_caller_regs()
s390/sclp_vt220: Convert newlines to CRLF instead of LFCR
s390/sclp: Deactivate sclp after all its users
s390/pkey_pckmo: Return with success for valid protected key types
KVM: s390: Change virtual to physical address access in diag 0x258 handler
KVM: s390: gaccess: Check if guest address is in memslot
s390/ap: Fix CCA crypto card behavior within protected execution environment
s390/pci: Handle PCI error codes other than 0x3a
Linus Torvalds [Fri, 18 Oct 2024 02:12:38 +0000 (19:12 -0700)]
Merge tag 'x86_bugs_post_ibpb' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 IBPB fixes from Borislav Petkov:
"This fixes the IBPB implementation of older AMDs (< gen4) that do not
flush the RSB (Return Address Stack) so you can still do some leaking
when using a "=ibpb" mitigation for Retbleed or SRSO. Fix it by doing
the flushing in software on those generations.
IBPB is not the default setting so this is not likely to affect
anybody in practice"
* tag 'x86_bugs_post_ibpb' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/bugs: Do not use UNTRAIN_RET with IBPB on entry
x86/bugs: Skip RSB fill at VMEXIT
x86/entry: Have entry_ibpb() invalidate return predictions
x86/cpufeatures: Add a IBPB_NO_RET BUG flag
x86/cpufeatures: Define X86_FEATURE_AMD_IBPB_RET
Linus Torvalds [Thu, 17 Oct 2024 23:33:06 +0000 (16:33 -0700)]
Merge tag 'mm-hotfixes-stable-2024-10-17-16-08' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull misc fixes from Andrew Morton:
"28 hotfixes. 13 are cc:stable. 23 are MM.
It is the usual shower of unrelated singletons - please see the
individual changelogs for details"
* tag 'mm-hotfixes-stable-2024-10-17-16-08' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (28 commits)
maple_tree: add regression test for spanning store bug
maple_tree: correct tree corruption on spanning store
mm/mglru: only clear kswapd_failures if reclaimable
mm/swapfile: skip HugeTLB pages for unuse_vma
selftests: mm: fix the incorrect usage() info of khugepaged
MAINTAINERS: add Jann as memory mapping/VMA reviewer
mm: swap: prevent possible data-race in __try_to_reclaim_swap
mm: khugepaged: fix the incorrect statistics when collapsing large file folios
MAINTAINERS: kasan, kcov: add bugzilla links
mm: don't install PMD mappings when THPs are disabled by the hw/process/vma
mm: huge_memory: add vma_thp_disabled() and thp_disabled_by_hw()
Docs/damon/maintainer-profile: update deprecated awslabs GitHub URLs
Docs/damon/maintainer-profile: add missing '_' suffixes for external web links
maple_tree: check for MA_STATE_BULK on setting wr_rebalance
mm: khugepaged: fix the arguments order in khugepaged_collapse_file trace point
mm/damon/tests/sysfs-kunit.h: fix memory leak in damon_sysfs_test_add_targets()
mm: remove unused stub for can_swapin_thp()
mailmap: add an entry for Andy Chiu
MAINTAINERS: add memory mapping/VMA co-maintainers
fs/proc: fix build with GCC 15 due to -Werror=unterminated-string-initialization
...
Linus Torvalds [Thu, 17 Oct 2024 23:24:42 +0000 (16:24 -0700)]
Merge tag 'clk-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux
Pull clk fixes from Stephen Boyd:
"Two clk driver fixes and a unit test fix:
- Terminate the of_device_id table in the Samsung exynosautov920 clk
driver so that device matching logic doesn't run off the end of the
array into other memory and break matching for any kernel with this
driver loaded
- Properly limit the max clk ID in the Rockchip clk driver
- Use clk kunit helpers in the clk tests so that memory isn't leaked
after the test concludes"
* tag 'clk-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux:
clk: test: Fix some memory leaks
clk: rockchip: fix finding of maximum clock ID
clk: samsung: Fix out-of-bound access of of_match_node()
Linus Torvalds [Thu, 17 Oct 2024 16:43:36 +0000 (09:43 -0700)]
Merge tag 'arm-fixes-6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc
Pull SoC fixes from Arnd Bergmann:
"Most of the fixes this time are for platform specific drivers,
addressing issues found through build testing on freescale, ep93xx,
starfive, and npcm platforms, as as well as the ffa firmware.
The fixes for the scmi firmware driver address compatibility problems
found on broadcom machines.
There are only two devicetree fixes, addressing incorrect in
configuration on broadcom and marvell machines.
The changes to the Documentation and MAINTAINERS files are for
clarification only"
* tag 'arm-fixes-6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc:
firmware: arm_ffa: Avoid string-fortify warning caused by memcpy()
firmware: arm_scmi: Queue in scmi layer for mailbox implementation
firmware: arm_ffa: Avoid string-fortify warning in export_uuid()
firmware: arm_scmi: Give SMC transport precedence over mailbox
firmware: arm_scmi: Fix the double free in scmi_debugfs_common_setup()
Documentation/process: maintainer-soc: clarify submitting patches
dmaengine: cirrus: check that output may be truncated
dmaengine: cirrus: ERR_CAST() ioremap error
MAINTAINERS: use the canonical soc mailing list address and mark it as L:
ARM: dts: bcm2837-rpi-cm3-io3: Fix HDMI hpd-gpio pin
arm64: dts: marvell: cn9130-sr-som: fix cp0 mdio pin numbers
soc: fsl: cpm1: qmc: Fix unused data compilation warning
soc: fsl: cpm1: qmc: Do not use IS_ERR_VALUE() on error pointers
reset: starfive: jh71x0: Fix accessing the empty member on JH7110 SoC
reset: npcm: convert comma to semicolon
Linus Torvalds [Thu, 17 Oct 2024 16:36:59 +0000 (09:36 -0700)]
Merge tag 'sound-6.12-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound
Pull sound fixes from Takashi Iwai:
"A collection of small fixes, nothing really stands out:
- Usual HD-audio quirks / device-specific fixes
- Kconfig dependency fix for UM
- A series of minor fixes for SoundWire
- Updates of USB-audio LINE6 contact address"
* tag 'sound-6.12-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound:
ALSA: hda/conexant - Use cached pin control for Node 0x1d on HP EliteOne 1000 G2
ALSA/hda: intel-sdw-acpi: add support for sdw-manager-list property read
ALSA/hda: intel-sdw-acpi: simplify sdw-master-count property read
ALSA/hda: intel-sdw-acpi: fetch fwnode once in sdw_intel_scan_controller()
ALSA/hda: intel-sdw-acpi: cleanup sdw_intel_scan_controller
ALSA: hda/tas2781: Add new quirk for Lenovo, ASUS, Dell projects
ALSA: scarlett2: Add error check after retrieving PEQ filter values
ALSA: hda/cs8409: Fix possible NULL dereference
sound: Make CONFIG_SND depend on INDIRECT_IOMEM instead of UML
ALSA: line6: update contact information
ALSA: usb-audio: Fix NULL pointer deref in snd_usb_power_domain_set()
ALSA: hda/conexant - Fix audio routing for HP EliteOne 1000 G2
ALSA: hda: Sound support for HP Spectre x360 16 inch model 2024
Linus Torvalds [Thu, 17 Oct 2024 16:31:18 +0000 (09:31 -0700)]
Merge tag 'net-6.12-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Paolo Abeni:
"Current release - new code bugs:
- eth: mlx5: HWS, don't destroy more bwc queue locks than allocated
Previous releases - regressions:
- ipv4: give an IPv4 dev to blackhole_netdev
- udp: compute L4 checksum as usual when not segmenting the skb
- tcp/dccp: don't use timer_pending() in reqsk_queue_unlink().
- eth: mlx5e: don't call cleanup on profile rollback failure
- eth: microchip: vcap api: fix memory leaks in
vcap_api_encode_rule_test()
- eth: enetc: disable Tx BD rings after they are empty
- eth: macb: avoid 20s boot delay by skipping MDIO bus registration
for fixed-link PHY
Previous releases - always broken:
- posix-clock: fix missing timespec64 check in pc_clock_settime()
- genetlink: hold RCU in genlmsg_mcast()
- mptcp: prevent MPC handshake on port-based signal endpoints
- eth: vmxnet3: fix packet corruption in vmxnet3_xdp_xmit_frame
- eth: stmmac: dwmac-tegra: fix link bring-up sequence
- eth: bcmasp: fix potential memory leak in bcmasp_xmit()
Misc:
- add Andrew Lunn as a co-maintainer of all networking drivers"
* tag 'net-6.12-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (47 commits)
net/mlx5e: Don't call cleanup on profile rollback failure
net/mlx5: Unregister notifier on eswitch init failure
net/mlx5: Fix command bitmask initialization
net/mlx5: Check for invalid vector index on EQ creation
net/mlx5: HWS, use lock classes for bwc locks
net/mlx5: HWS, don't destroy more bwc queue locks than allocated
net/mlx5: HWS, fixed double free in error flow of definer layout
net/mlx5: HWS, removed wrong access to a number of rules variable
mptcp: pm: fix UaF read in mptcp_pm_nl_rm_addr_or_subflow
net: ethernet: mtk_eth_soc: fix memory corruption during fq dma init
vmxnet3: Fix packet corruption in vmxnet3_xdp_xmit_frame
net: dsa: vsc73xx: fix reception from VLAN-unaware bridges
net: ravb: Only advertise Rx/Tx timestamps if hardware supports it
net: microchip: vcap api: Fix memory leaks in vcap_api_encode_rule_test()
net: phy: mdio-bcm-unimac: Add BCM6846 support
dt-bindings: net: brcm,unimac-mdio: Add bcm6846-mdio
udp: Compute L4 checksum as usual when not segmenting the skb
genetlink: hold RCU in genlmsg_mcast()
net: dsa: mv88e6xxx: Fix the max_vid definition for the MV88E6361
tcp/dccp: Don't use timer_pending() in reqsk_queue_unlink().
...
Lorenzo Stoakes [Mon, 7 Oct 2024 15:28:33 +0000 (16:28 +0100)]
maple_tree: add regression test for spanning store bug
Add a regression test to assert that, when performing a spanning store
which consumes the entirety of the rightmost right leaf node does not
result in maple tree corruption when doing so.
This achieves this by building a test tree of 3 levels and establishing a
store which ultimately results in a spanned store of this nature.
Lorenzo Stoakes [Mon, 7 Oct 2024 15:28:32 +0000 (16:28 +0100)]
maple_tree: correct tree corruption on spanning store
Patch series "maple_tree: correct tree corruption on spanning store", v3.
There has been a nasty yet subtle maple tree corruption bug that appears
to have been in existence since the inception of the algorithm.
This bug seems far more likely to happen since commit f8d112a4e657
("mm/mmap: avoid zeroing vma tree in mmap_region()"), which is the point
at which reports started to be submitted concerning this bug.
We were made definitely aware of the bug thanks to the kind efforts of
Bert Karwatzki who helped enormously in my being able to track this down
and identify the cause of it.
The bug arises when an attempt is made to perform a spanning store across
two leaf nodes, where the right leaf node is the rightmost child of the
shared parent, AND the store completely consumes the right-mode node.
This results in mas_wr_spanning_store() mitakenly duplicating the new and
existing entries at the maximum pivot within the range, and thus maple
tree corruption.
The fix patch corrects this by detecting this scenario and disallowing the
mistaken duplicate copy.
The fix patch commit message goes into great detail as to how this occurs.
This series also includes a test which reliably reproduces the issue, and
asserts that the fix works correctly.
Bert has kindly tested the fix and confirmed it resolved his issues. Also
Mikhail Gavrilov kindly reported what appears to be precisely the same
bug, which this fix should also resolve.
This patch (of 2):
There has been a subtle bug present in the maple tree implementation from
its inception.
This arises from how stores are performed - when a store occurs, it will
overwrite overlapping ranges and adjust the tree as necessary to
accommodate this.
A range may always ultimately span two leaf nodes. In this instance we
walk the two leaf nodes, determine which elements are not overwritten to
the left and to the right of the start and end of the ranges respectively
and then rebalance the tree to contain these entries and the newly
inserted one.
This kind of store is dubbed a 'spanning store' and is implemented by
mas_wr_spanning_store().
In order to reach this stage, mas_store_gfp() invokes
mas_wr_preallocate(), mas_wr_store_type() and mas_wr_walk() in turn to
walk the tree and update the object (mas) to traverse to the location
where the write should be performed, determining its store type.
When a spanning store is required, this function returns false stopping at
the parent node which contains the target range, and mas_wr_store_type()
marks the mas->store_type as wr_spanning_store to denote this fact.
When we go to perform the store in mas_wr_spanning_store(), we first
determine the elements AFTER the END of the range we wish to store (that
is, to the right of the entry to be inserted) - we do this by walking to
the NEXT pivot in the tree (i.e. r_mas.last + 1), starting at the node we
have just determined contains the range over which we intend to write.
We then turn our attention to the entries to the left of the entry we are
inserting, whose state is represented by l_mas, and copy these into a 'big
node', which is a special node which contains enough slots to contain two
leaf node's worth of data.
We then copy the entry we wish to store immediately after this - the copy
and the insertion of the new entry is performed by mas_store_b_node().
After this we copy the elements to the right of the end of the range which
we are inserting, if we have not exceeded the length of the node (i.e.
r_mas.offset <= r_mas.end).
Herein lies the bug - under very specific circumstances, this logic can
break and corrupt the maple tree.
Now imagine we wish to store an entry in the range [0x4000, 0xffff] (note
that all ranges expressed in maple tree code are inclusive):
1. mas_store_gfp() descends the tree, finds node A at <=0xffff, then
determines that this is a spanning store across nodes B and C. The mas
state is set such that the current node from which we traverse further
is node A.
2. In mas_wr_spanning_store() we try to find elements to the right of pivot
0xffff by searching for an index of 0x10000:
- mas_wr_walk_index() invokes mas_wr_walk_descend() and
mas_wr_node_walk() in turn.
- mas_wr_node_walk() loops over entries in node A until EITHER it
finds an entry whose pivot equals or exceeds 0x10000 OR it
reaches the final entry.
- Since no entry has a pivot equal to or exceeding 0x10000, pivot
0xffff is selected, leading to node C.
- mas_wr_walk_traverse() resets the mas state to traverse node C. We
loop around and invoke mas_wr_walk_descend() and mas_wr_node_walk()
in turn once again.
- Again, we reach the last entry in node C, which has a pivot of
0xffff.
3. We then copy the elements to the left of 0x4000 in node B to the big
node via mas_store_b_node(), and insert the new [0x4000, 0xffff] entry
too.
4. We determine whether we have any entries to copy from the right of the
end of the range via - and with r_mas set up at the entry at pivot
0xffff, r_mas.offset <= r_mas.end, and then we DUPLICATE the entry at
pivot 0xffff.
5. BUG! The maple tree is corrupted with a duplicate entry.
This requires a very specific set of circumstances - we must be spanning
the last element in a leaf node, which is the last element in the parent
node.
spanning store across two leaf nodes with a range that ends at that shared
pivot.
A potential solution to this problem would simply be to reset the walk
each time we traverse r_mas, however given the rarity of this situation it
seems that would be rather inefficient.
Instead, this patch detects if the right hand node is populated, i.e. has
anything we need to copy.
We do so by only copying elements from the right of the entry being
inserted when the maximum value present exceeds the last, rather than
basing this on offset position.
The patch also updates some comments and eliminates the unused bool return
value in mas_wr_walk_index().
The work performed in commit f8d112a4e657 ("mm/mmap: avoid zeroing vma
tree in mmap_region()") seems to have made the probability of this event
much more likely, which is the point at which reports started to be
submitted concerning this bug.
The motivation for this change arose from Bert Karwatzki's report of
encountering mm instability after the release of kernel v6.12-rc1 which,
after the use of CONFIG_DEBUG_VM_MAPLE_TREE and similar configuration
options, was identified as maple tree corruption.
After Bert very generously provided his time and ability to reproduce this
event consistently, I was able to finally identify that the issue
discussed in this commit message was occurring for him.
Cosmin Ratiu [Tue, 15 Oct 2024 09:32:08 +0000 (12:32 +0300)]
net/mlx5e: Don't call cleanup on profile rollback failure
When profile rollback fails in mlx5e_netdev_change_profile, the netdev
profile var is left set to NULL. Avoid a crash when unloading the driver
by not calling profile->cleanup in such a case.
This was encountered while testing, with the original trigger that
the wq rescuer thread creation got interrupted (presumably due to
Ctrl+C-ing modprobe), which gets converted to ENOMEM (-12) by
mlx5e_priv_init, the profile rollback also fails for the same reason
(signal still active) so the profile is left as NULL, leading to a crash
later in _mlx5e_remove.
Shay Drory [Tue, 15 Oct 2024 09:32:06 +0000 (12:32 +0300)]
net/mlx5: Fix command bitmask initialization
Command bitmask have a dedicated bit for MANAGE_PAGES command, this bit
isn't Initialize during command bitmask Initialization, only during
MANAGE_PAGES.
In addition, mlx5_cmd_trigger_completions() is trying to trigger
completion for MANAGE_PAGES command as well.
Hence, in case health error occurred before any MANAGE_PAGES command
have been invoke (for example, during mlx5_enable_hca()),
mlx5_cmd_trigger_completions() will try to trigger completion for
MANAGE_PAGES command, which will result in null-ptr-deref error.[1]
Fix it by Initialize command bitmask correctly.
While at it, re-write the code for better understanding.
Maher Sanalla [Tue, 15 Oct 2024 09:32:05 +0000 (12:32 +0300)]
net/mlx5: Check for invalid vector index on EQ creation
Currently, mlx5 driver does not enforce vector index to be lower than
the maximum number of supported completion vectors when requesting a
new completion EQ. Thus, mlx5_comp_eqn_get() fails when trying to
acquire an IRQ with an improper vector index.
To prevent the case above, enforce that vector index value is
valid and lower than maximum in mlx5_comp_eqn_get() before handling the
request.
Cosmin Ratiu [Tue, 15 Oct 2024 09:32:04 +0000 (12:32 +0300)]
net/mlx5: HWS, use lock classes for bwc locks
The HWS BWC API uses one lock per queue and usually acquires one of
them, except when doing changes which require locking all queues in
order. Naturally, lockdep isn't too happy about acquiring the same lock
class multiple times, so inform it that each queue lock is a different
class to avoid false positives.
Cosmin Ratiu [Tue, 15 Oct 2024 09:32:03 +0000 (12:32 +0300)]
net/mlx5: HWS, don't destroy more bwc queue locks than allocated
hws_send_queues_bwc_locks_destroy destroyed more queue locks than
allocated, leading to memory corruption (occasionally) and warnings such
as DEBUG_LOCKS_WARN_ON(mutex_is_locked(lock)) in __mutex_destroy because
sometimes, the 'mutex' being destroyed was random memory.
The severity of this problem is proportional to the number of queues
configured because the code overreaches beyond the end of the
bwc_send_queue_locks array by 2x its length.
Fix that by using the correct number of bwc queues.
net/mlx5: HWS, removed wrong access to a number of rules variable
Removed wrong access to the num_of_rules field of the matcher.
This is a usual u32 variable, but the access was as if it was atomic.
This fixes the following CI warnings:
mlx5hws_bwc.c:708:17: warning: large atomic operation may incur significant performance penalty;
the access size (4 bytes) exceeds the max lock-free size (0 bytes) [-Watomic-alignment]
mptcp: pm: fix UaF read in mptcp_pm_nl_rm_addr_or_subflow
Syzkaller reported this splat:
==================================================================
BUG: KASAN: slab-use-after-free in mptcp_pm_nl_rm_addr_or_subflow+0xb44/0xcc0 net/mptcp/pm_netlink.c:881
Read of size 4 at addr ffff8880569ac858 by task syz.1.2799/14662
The buggy address belongs to the object at ffff8880569ac800
which belongs to the cache kmalloc-512 of size 512
The buggy address is located 88 bytes inside of
freed 512-byte region [ffff8880569ac800, ffff8880569aca00)
That's because 'subflow' is used just after 'mptcp_close_ssk(subflow)',
which will initiate the release of its memory. Even if it is very likely
the release and the re-utilisation will be done later on, it is of
course better to avoid any issues and read the content of 'subflow'
before closing it.
Felix Fietkau [Tue, 15 Oct 2024 08:17:55 +0000 (10:17 +0200)]
net: ethernet: mtk_eth_soc: fix memory corruption during fq dma init
The loop responsible for allocating up to MTK_FQ_DMA_LENGTH buffers must
only touch as many descriptors, otherwise it ends up corrupting unrelated
memory. Fix the loop iteration count accordingly.
Daniel Borkmann [Mon, 14 Oct 2024 19:03:11 +0000 (21:03 +0200)]
vmxnet3: Fix packet corruption in vmxnet3_xdp_xmit_frame
Andrew and Nikolay reported connectivity issues with Cilium's service
load-balancing in case of vmxnet3.
If a BPF program for native XDP adds an encapsulation header such as
IPIP and transmits the packet out the same interface, then in case
of vmxnet3 a corrupted packet is being sent and subsequently dropped
on the path.
vmxnet3_xdp_xmit_frame() which is called e.g. via vmxnet3_run_xdp()
through vmxnet3_xdp_xmit_back() calculates an incorrect DMA address:
The above assumes a fixed offset (VMXNET3_XDP_HEADROOM), but the XDP
BPF program could have moved xdp->data. While the passed buf_size is
correct (xdpf->len), the dma_addr needs to have a dynamic offset which
can be calculated as xdpf->data - (void *)xdpf, that is, xdp->data -
xdp->data_hard_start.
Wei Xu [Mon, 14 Oct 2024 22:12:11 +0000 (22:12 +0000)]
mm/mglru: only clear kswapd_failures if reclaimable
lru_gen_shrink_node() unconditionally clears kswapd_failures, which can
prevent kswapd from sleeping and cause 100% kswapd cpu usage even when
kswapd repeatedly fails to make progress in reclaim.
Only clear kswap_failures in lru_gen_shrink_node() if reclaim makes some
progress, similar to shrink_node().
I happened to run into this problem in one of my tests recently. It
requires a combination of several conditions: The allocator needs to
allocate a right amount of pages such that it can wake up kswapd
without itself being OOM killed; there is no memory for kswapd to
reclaim (My test disables swap and cleans page cache first); no other
process frees enough memory at the same time.
Liu Shixin [Tue, 15 Oct 2024 01:45:21 +0000 (09:45 +0800)]
mm/swapfile: skip HugeTLB pages for unuse_vma
I got a bad pud error and lost a 1GB HugeTLB when calling swapoff. The
problem can be reproduced by the following steps:
1. Allocate an anonymous 1GB HugeTLB and some other anonymous memory.
2. Swapout the above anonymous memory.
3. run swapoff and we will get a bad pud error in kernel message:
We can tell that pud_clear_bad is called by pud_none_or_clear_bad in
unuse_pud_range() by ftrace. And therefore the HugeTLB pages will never
be freed because we lost it from page table. We can skip HugeTLB pages
for unuse_vma to fix it.
Jann Horn [Mon, 14 Oct 2024 20:50:57 +0000 (22:50 +0200)]
MAINTAINERS: add Jann as memory mapping/VMA reviewer
Add myself as a reviewer for memory mapping / VMA code. I will probably
only reply to patches sporadically, but hopefully this will help me keep
up with changes that look interesting security-wise.
Jeongjun Park [Mon, 7 Oct 2024 07:06:23 +0000 (16:06 +0900)]
mm: swap: prevent possible data-race in __try_to_reclaim_swap
A report [1] was uploaded from syzbot.
In the previous commit 862590ac3708 ("mm: swap: allow cache reclaim to
skip slot cache"), the __try_to_reclaim_swap() function reads offset and
folio->entry from folio without folio_lock protection.
In the currently reported KCSAN log, it is assumed that the actual
data-race will not occur because the calltrace that does WRITE already
obtains the folio_lock and then writes.
However, the existing __try_to_reclaim_swap() function was already
implemented to perform reads under folio_lock protection [1], and there is
a risk of a data-race occurring through a function other than the one
shown in the KCSAN log.
Therefore, I think it is appropriate to change
read operations for folio to be performed under folio_lock.
[1]
==================================================================
BUG: KCSAN: data-race in __delete_from_swap_cache / __try_to_reclaim_swap
Baolin Wang [Mon, 14 Oct 2024 10:24:44 +0000 (18:24 +0800)]
mm: khugepaged: fix the incorrect statistics when collapsing large file folios
Khugepaged already supports collapsing file large folios (including shmem
mTHP) by commit 7de856ffd007 ("mm: khugepaged: support shmem mTHP
collapse"), and the control parameters in khugepaged:
'khugepaged_max_ptes_swap' and 'khugepaged_max_ptes_none', still compare
based on PTE granularity to determine whether a file collapse is needed.
However, the statistics for 'present' and 'swap' in
hpage_collapse_scan_file() do not take into account the large folios,
which may lead to incorrect judgments regarding the
khugepaged_max_ptes_swap/none parameters, resulting in unnecessary file
collapses.
To fix this issue, take into account the large folios' statistics for
'present' and 'swap' variables in the hpage_collapse_scan_file().
mm: don't install PMD mappings when THPs are disabled by the hw/process/vma
We (or rather, readahead logic :) ) might be allocating a THP in the
pagecache and then try mapping it into a process that explicitly disabled
THP: we might end up installing PMD mappings.
This is a problem for s390x KVM, which explicitly remaps all PMD-mapped
THPs to be PTE-mapped in s390_enable_sie()->thp_split_mm(), before
starting the VM.
For example, starting a VM backed on a file system with large folios
supported makes the VM crash when the VM tries accessing such a mapping
using KVM.
Is it also a problem when the HW disabled THP using
TRANSPARENT_HUGEPAGE_UNSUPPORTED? At least on x86 this would be the case
without X86_FEATURE_PSE.
In the future, we might be able to do better on s390x and only disallow
PMD mappings -- what s390x and likely TRANSPARENT_HUGEPAGE_UNSUPPORTED
really wants. For now, fix it by essentially performing the same check as
would be done in __thp_vma_allowable_orders() or in shmem code, where this
works as expected, and disallow PMD mappings, making us fallback to PTE
mappings.
Kefeng Wang [Fri, 11 Oct 2024 10:24:44 +0000 (12:24 +0200)]
mm: huge_memory: add vma_thp_disabled() and thp_disabled_by_hw()
Patch series "mm: don't install PMD mappings when THPs are disabled by the
hw/process/vma".
During testing, it was found that we can get PMD mappings in processes
where THP (and more precisely, PMD mappings) are supposed to be disabled.
While it works as expected for anon+shmem, the pagecache is the
problematic bit.
For s390 KVM this currently means that a VM backed by a file located on
filesystem with large folio support can crash when KVM tries accessing the
problematic page, because the readahead logic might decide to use a
PMD-sized THP and faulting it into the page tables will install a PMD
mapping, something that s390 KVM cannot tolerate.
This might also be a problem with HW that does not support PMD mappings,
but I did not try reproducing it.
Fix it by respecting the ways to disable THPs when deciding whether we can
install a PMD mapping. khugepaged should already be taking care of not
collapsing if THPs are effectively disabled for the hw/process/vma.
This patch (of 2):
Add vma_thp_disabled() and thp_disabled_by_hw() helpers to be shared by
shmem_allowable_huge_orders() and __thp_vma_allowable_orders().
DAMON GitHub repos have moved from awslabs GitHub org to damonitor org[1].
Following the change, URLs on documents are also updated[2]. However,
commit 2e9b3d6e2e59 ("Docs/damon/maintainer-profile: add links in place"),
which was added just after the update, was using the deprecated GitHub
URLs. Update those to use damonitor GitHub URLs instead.
Links to external web pages on DAMON's maintainer-profile.rst are missing
'_' suffixes. As a result, rendered document is having only verbose URLs
that cannot be clicked. Fix those.
Also, update the link texts for git trees to contain the names of the
trees, for better readability and avoiding below Sphinx warning.
Sidhartha Kumar [Fri, 11 Oct 2024 21:44:50 +0000 (17:44 -0400)]
maple_tree: check for MA_STATE_BULK on setting wr_rebalance
It is possible for a bulk operation (MA_STATE_BULK is set) to enter the
new_end < mt_min_slots[type] case and set wr_rebalance as a store type.
This is incorrect as bulk stores do not rebalance per write, but rather
after the all of the writes are done through the mas_bulk_rebalance()
path. Therefore, add a check to make sure MA_STATE_BULK is not set before
we return wr_rebalance as the store type.
Also add a test to make sure wr_rebalance is never the store type when
doing bulk operations via mas_expected_entries()
This is a hotfix for this rc however it has no userspace effects as there
are no users of the bulk insertion mode.
Jinjie Ruan [Thu, 10 Oct 2024 12:53:23 +0000 (20:53 +0800)]
mm/damon/tests/sysfs-kunit.h: fix memory leak in damon_sysfs_test_add_targets()
The sysfs_target->regions allocated in damon_sysfs_regions_alloc() is not
freed in damon_sysfs_test_add_targets(), which cause the following memory
leak, free it to fix it.
Add myself and Liam as co-maintainers of the memory mapping and VMA code
alongside Andrew as we are heavily involved in its implementation and
maintenance.
Brahmajit Das [Sat, 5 Oct 2024 06:37:00 +0000 (12:07 +0530)]
fs/proc: fix build with GCC 15 due to -Werror=unterminated-string-initialization
show show_smap_vma_flags() has been a using misspelled initializer in
mnemonics[] - it needed to initialize 2 element array of char and it used
NUL-padded 2 character string literals (i.e. 3-element initializer).
This has been spotted by gcc-15[*]; prior to that gcc quietly dropped the
3rd eleemnt of initializers. To fix this we are increasing the size of
mnemonics[] (from mnemonics[BITS_PER_LONG][2] to
mnemonics[BITS_PER_LONG][3]) to accomodate the NUL-padded string literals.
This also helps us in simplyfying the logic for printing of the flags as
instead of printing each character from the mnemonics[], we can just print
the mnemonics[] using seq_printf.
[*]: fs/proc/task_mmu.c:917:49: error: initializer-string for array of `char' is too long [-Werror=unterminate d-string-initialization]
917 | [0 ... (BITS_PER_LONG-1)] = "??",
| ^~~~
fs/proc/task_mmu.c:917:49: error: initializer-string for array of `char' is too long [-Werror=unterminate d-string-initialization]
fs/proc/task_mmu.c:917:49: error: initializer-string for array of `char' is too long [-Werror=unterminate d-string-initialization]
fs/proc/task_mmu.c:917:49: error: initializer-string for array of `char' is too long [-Werror=unterminate d-string-initialization]
fs/proc/task_mmu.c:917:49: error: initializer-string for array of `char' is too long [-Werror=unterminate d-string-initialization]
fs/proc/task_mmu.c:917:49: error: initializer-string for array of `char' is too long [-Werror=unterminate d-string-initialization]
...
Stephen pointed out:
: The C standard explicitly allows for a string initializer to be too long
: due to the NUL byte at the end ... so this warning may be overzealous.
In mremap(), move_page_tables() looks at the type of the PMD entry and the
specified address range to figure out by which method the next chunk of
page table entries should be moved.
At that point, the mmap_lock is held in write mode, but no rmap locks are
held yet. For PMD entries that point to page tables and are fully covered
by the source address range, move_pgt_entry(NORMAL_PMD, ...) is called,
which first takes rmap locks, then does move_normal_pmd().
move_normal_pmd() takes the necessary page table locks at source and
destination, then moves an entire page table from the source to the
destination.
The problem is: The rmap locks, which protect against concurrent page
table removal by retract_page_tables() in the THP code, are only taken
after the PMD entry has been read and it has been decided how to move it.
So we can race as follows (with two processes that have mappings of the
same tmpfs file that is stored on a tmpfs mount with huge=advise); note
that process A accesses page tables through the MM while process B does it
through the file rmap:
process A process B
========= =========
mremap
mremap_to
move_vma
move_page_tables
get_old_pmd
alloc_new_pmd
*** PREEMPT ***
madvise(MADV_COLLAPSE)
do_madvise
madvise_walk_vmas
madvise_vma_behavior
madvise_collapse
hpage_collapse_scan_file
collapse_file
retract_page_tables
i_mmap_lock_read(mapping)
pmdp_collapse_flush
i_mmap_unlock_read(mapping)
move_pgt_entry(NORMAL_PMD, ...)
take_rmap_locks
move_normal_pmd
drop_rmap_locks
When this happens, move_normal_pmd() can end up creating bogus PMD entries
in the line `pmd_populate(mm, new_pmd, pmd_pgtable(pmd))`. The effect
depends on arch-specific and machine-specific details; on x86, you can end
up with physical page 0 mapped as a page table, which is likely
exploitable for user->kernel privilege escalation.
Fix the race by letting process B recheck that the PMD still points to a
page table after the rmap locks have been taken. Otherwise, we bail and
let the caller fall back to the PTE-level copying path, which will then
bail immediately at the pmd_none() check.
Bug reachability: Reaching this bug requires that you can create
shmem/file THP mappings - anonymous THP uses different code that doesn't
zap stuff under rmap locks. File THP is gated on an experimental config
flag (CONFIG_READ_ONLY_THP_FOR_FS), so on normal distro kernels you need
shmem THP to hit this bug. As far as I know, getting shmem THP normally
requires that you can mount your own tmpfs with the right mount flags,
which would require creating your own user+mount namespace; though I don't
know if some distros maybe enable shmem THP by default or something like
that.
Bug impact: This issue can likely be used for user->kernel privilege
escalation when it is reachable.
The factors that increase the right side of the equation:
- PAGE_SIZE > 4KiB increases KMALLOC_SHIFT_HIGH
- For the local_lock_t in kmem_cache_cpu:
- PREEMPT_RT adds an actual lock.
- LOCKDEP increases the size of the lock.
- LOCK_STAT adds additional bytes plus padding to the lockdep
structure.
The net difference with and without PREEMPT_RT is 88 bytes for the
lock_lock_t, 96 bytes for kmem_cache_cpu due to additional padding. This
is enough to exceed the 80KiB limit with 16KiB page size - the 8KiB page
size is fine.
Increase PERCPU_DYNAMIC_SIZE_SHIFT to 13 on configs with PAGE_SIZE larger
than 4KiB and LOCKDEP enabled.
Edward Liaw [Thu, 3 Oct 2024 21:17:10 +0000 (21:17 +0000)]
selftests/mm: replace atomic_bool with pthread_barrier_t
Patch series "selftests/mm: fix deadlock after pthread_create".
On Android arm, pthread_create followed by a fork caused a deadlock in the
case where the fork required work to be completed by the created thread.
Update the synchronization primitive to use pthread_barrier instead of
atomic_bool.
Apply the same fix to the wp-fork-with-event test.
This patch (of 2):
Swap synchronization primitive with pthread_barrier, so that stdatomic.h
does not need to be included.
The synchronization is needed on Android ARM64; we see a deadlock with
pthread_create when the parent thread races forward before the child has a
chance to start doing work.
Ryusuke Konishi [Fri, 4 Oct 2024 03:35:31 +0000 (12:35 +0900)]
nilfs2: propagate directory read errors from nilfs_find_entry()
Syzbot reported that a task hang occurs in vcs_open() during a fuzzing
test for nilfs2.
The root cause of this problem is that in nilfs_find_entry(), which
searches for directory entries, ignores errors when loading a directory
page/folio via nilfs_get_folio() fails.
If the filesystem images is corrupted, and the i_size of the directory
inode is large, and the directory page/folio is successfully read but
fails the sanity check, for example when it is zero-filled,
nilfs_check_folio() may continue to spit out error messages in bursts.
Fix this issue by propagating the error to the callers when loading a
page/folio fails in nilfs_find_entry().
The current interface of nilfs_find_entry() and its callers is outdated
and cannot propagate error codes such as -EIO and -ENOMEM returned via
nilfs_find_entry(), so fix it together.
Lorenzo Stoakes [Wed, 2 Oct 2024 07:39:32 +0000 (08:39 +0100)]
mm/mmap: correct error handling in mmap_region()
Commit f8d112a4e657 ("mm/mmap: avoid zeroing vma tree in mmap_region()")
changed how error handling is performed in mmap_region().
The error value defaults to -ENOMEM, but then gets reassigned immediately
to the result of vms_gather_munmap_vmas() if we are performing a MAP_FIXED
mapping over existing VMAs (and thus unmapping them).
This overwrites the error value, potentially clearing it.
After this, we invoke may_expand_vm() and possibly vm_area_alloc(), and
check to see if they failed. If they do so, then we perform error-handling
logic, but importantly, we do NOT update the error code.
This means that, if vms_gather_munmap_vmas() succeeds, but one of these
calls does not, the function will return indicating no error, but rather an
address value of zero, which is entirely incorrect.
Correct this and avoid future confusion by strictly setting error on each
and every occasion we jump to the error handling logic, and set the error
code immediately prior to doing so.
This way we can see at a glance that the error code is always correct.
Many thanks to Vegard Nossum who spotted this issue in discussion around
this problem.
Jinjie Ruan [Wed, 16 Oct 2024 02:26:58 +0000 (10:26 +0800)]
clk: test: Fix some memory leaks
CONFIG_CLK_KUNIT_TEST=y, CONFIG_DEBUG_KMEMLEAK=y
and CONFIG_DEBUG_KMEMLEAK_AUTO_SCAN=y, the following memory leak occurs.
If the KUNIT_ASSERT_*() fails, the latter (exit() or testcases)
clk_put() or clk_hw_unregister() will fail to release the clk resource
and cause memory leaks, use new clk_hw_register_kunit()
and clk_hw_get_clk_kunit() to automatically release them.
Fixes: 02cdeace1e1e ("clk: tests: Add tests for single parent mux") Fixes: 2e9cad1abc71 ("clk: tests: Add some tests for orphan with multiple parents") Fixes: 433fb8a611ca ("clk: tests: Add missing test case for ranges") Signed-off-by: Jinjie Ruan <[email protected]> Link: https://lore.kernel.org/r/[email protected] Reviewed-by: Maxime Ripard <[email protected]> Signed-off-by: Stephen Boyd <[email protected]>
Linus Torvalds [Wed, 16 Oct 2024 20:37:59 +0000 (13:37 -0700)]
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma
Pull rdma fixes from Jason Gunthorpe:
"Several miscellaneous fixes. A lot of bnxt_re activity, there will be
more rc patches there coming.
- Many bnxt_re bug fixes - Memory leaks, kasn, NULL pointer deref,
soft lockups, error unwinding and some small functional issues
- Error unwind bug in rdma netlink
- Two issues with incorrect VLAN detection for iWarp
- skb_splice_from_iter() splat in siw
- Give SRP slab caches unique names to resolve the merge window
WARN_ON regression"
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma:
RDMA/bnxt_re: Fix the GID table length
RDMA/bnxt_re: Fix a bug while setting up Level-2 PBL pages
RDMA/bnxt_re: Change the sequence of updating the CQ toggle value
RDMA/bnxt_re: Fix an error path in bnxt_re_add_device
RDMA/bnxt_re: Avoid CPU lockups due fifo occupancy check loop
RDMA/bnxt_re: Fix a possible NULL pointer dereference
RDMA/bnxt_re: Return more meaningful error
RDMA/bnxt_re: Fix incorrect dereference of srq in async event
RDMA/bnxt_re: Fix out of bound check
RDMA/bnxt_re: Fix the max CQ WQEs for older adapters
RDMA/srpt: Make slab cache names unique
RDMA/irdma: Fix misspelling of "accept*"
RDMA/cxgb4: Fix RDMA_CM_EVENT_UNREACHABLE error for iWARP
RDMA/siw: Add sendpage_ok() check to disable MSG_SPLICE_PAGES
RDMA/core: Fix ENODEV error for iWARP test over vlan
RDMA/nldev: Fix NULL pointer dereferences issue in rdma_nl_notify_event
RDMA/bnxt_re: Fix the max WQEs used in Static WQE mode
RDMA/bnxt_re: Add a check for memory allocation
RDMA/bnxt_re: Fix incorrect AVID type in WQE structure
RDMA/bnxt_re: Fix a possible memory leak
Linus Torvalds [Wed, 16 Oct 2024 16:30:20 +0000 (09:30 -0700)]
Merge tag 'for-6.12-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
- regression fix: dirty extents tracked in xarray for qgroups must be
adjusted for 32bit platforms
- fix potentially freeing uninitialized name in fscrypt structure
- fix warning about unneeded variable in a send callback
* tag 'for-6.12-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: fix uninitialized pointer free on read_alloc_one_name() error
btrfs: send: cleanup unneeded return variable in changed_verity()
btrfs: fix uninitialized pointer free in add_inode_ref()
btrfs: use sector numbers as keys for the dirty extents xarray
Linus Torvalds [Wed, 16 Oct 2024 16:15:43 +0000 (09:15 -0700)]
Merge tag 'v6.12-rc3-ksmbd-fixes' of git://git.samba.org/ksmbd
Pull smb server fixes from Steve French:
- fix race between session setup and session logoff
- add supplementary group support
* tag 'v6.12-rc3-ksmbd-fixes' of git://git.samba.org/ksmbd:
ksmbd: add support for supplementary groups
ksmbd: fix user-after-free from session log off
Heiko Carstens [Thu, 10 Oct 2024 15:52:39 +0000 (17:52 +0200)]
s390: Initialize psw mask in perf_arch_fetch_caller_regs()
Also initialize regs->psw.mask in perf_arch_fetch_caller_regs().
This way user_mode(regs) will return false, like it should.
It looks like all current users initialize regs to zero, so that this
doesn't fix a bug currently. However it is better to not rely on callers
to do this.
s390/sclp_vt220: Convert newlines to CRLF instead of LFCR
According to the VT220 specification the possible character combinations
sent on RETURN are only CR or CRLF [0].
The Return key sends either a CR character (0/13) or a CR
character (0/13) and an LF character (0/10), depending on the
set/reset state of line feed/new line mode (LNM).
The sclp/vt220 driver however uses LFCR. This can confuse tools, for
example the kunit runner.
On reboot the SCLP interface is deactivated through a reboot notifier.
This happens before other components using SCLP have the chance to run
their own reboot notifiers.
Two of those components are the SCLP console and tty drivers which try
to flush the last outstanding messages.
At that point the SCLP interface is already unusable and the messages
are discarded.
Execute sclp_deactivate() as late as possible to avoid this issue.
Holger Dengler [Fri, 11 Oct 2024 08:48:00 +0000 (10:48 +0200)]
s390/pkey_pckmo: Return with success for valid protected key types
The key_to_protkey handler function in module pkey_pckmo should return
with success on all known protected key types, including the new types
introduced by fd197556eef5 ("s390/pkey: Add AES xts and HMAC clear key
token support").
Vasiliy Kovalev [Wed, 16 Oct 2024 08:07:13 +0000 (11:07 +0300)]
ALSA: hda/conexant - Use cached pin control for Node 0x1d on HP EliteOne 1000 G2
The cached version avoids redundant commands to the codec, improving
stability and reducing unnecessary operations. This change ensures
better power management and reliable restoration of pin configurations,
especially after hibernation (S4) and other power transitions.
Linus Torvalds [Wed, 16 Oct 2024 02:47:19 +0000 (19:47 -0700)]
Merge tag 'sched_ext-for-6.12-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext
Pull sched_ext fixes from Tejun Heo:
- More issues reported in the enable/disable paths on large machines
with many tasks due to scx_tasks_lock being held too long. Break up
the task iterations
- Remove ops.select_cpu() dependency in bypass mode so that a
misbehaving implementation can't live-lock the machine by pushing all
tasks to few CPUs in bypass mode
- Other misc fixes
* tag 'sched_ext-for-6.12-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext:
sched_ext: Remove unnecessary cpu_relax()
sched_ext: Don't hold scx_tasks_lock for too long
sched_ext: Move scx_tasks_lock handling into scx_task_iter helpers
sched_ext: bypass mode shouldn't depend on ops.select_cpu()
sched_ext: Move scx_buildin_idle_enabled check to scx_bpf_select_cpu_dfl()
sched_ext: Start schedulers with consistent p->scx.slice values
Revert "sched_ext: Use shorter slice while bypassing"
sched_ext: use correct function name in pick_task_scx() warning message
selftests: sched_ext: Add sched_ext as proper selftest target
Vladimir Oltean [Mon, 14 Oct 2024 15:30:41 +0000 (18:30 +0300)]
net: dsa: vsc73xx: fix reception from VLAN-unaware bridges
Similar to the situation described for sja1105 in commit 1f9fc48fd302
("net: dsa: sja1105: fix reception from VLAN-unaware bridges"), the
vsc73xx driver uses tag_8021q and doesn't need the ds->untag_bridge_pvid
request. In fact, this option breaks packet reception.
The ds->untag_bridge_pvid option strips VLANs from packets received on
VLAN-unaware bridge ports. But those VLANs should already be stripped
by tag_vsc73xx_8021q.c as part of vsc73xx_rcv() - they are not VLANs in
VLAN-unaware mode, but DSA tags. Thus, dsa_software_vlan_untag() tries
to untag a VLAN that doesn't exist, corrupting the packet.
net: ravb: Only advertise Rx/Tx timestamps if hardware supports it
Recent work moving the reporting of Rx software timestamps to the core
[1] highlighted an issue where hardware time stamping was advertised
for the platforms where it is not supported.
Fix this by covering advertising support for hardware timestamps only if
the hardware supports it. Due to the Tx implementation in RAVB software
Tx timestamping is also only considered if the hardware supports
hardware timestamps. This should be addressed in future, but this fix
only reflects what the driver currently implements.
1. Commit 277901ee3a26 ("ravb: Remove setting of RX software timestamp")
Jinjie Ruan [Mon, 14 Oct 2024 12:19:22 +0000 (20:19 +0800)]
net: microchip: vcap api: Fix memory leaks in vcap_api_encode_rule_test()
Commit a3c1e45156ad ("net: microchip: vcap: Fix use-after-free error in
kunit test") fixed the use-after-free error, but introduced below
memory leaks by removing necessary vcap_free_rule(), add it to fix it.
As pointed out by Florian:
https://lore.kernel.org/linux-devicetree/b542b2e8-115c-4234-a464-e73aa6bece5c@broadcom.com/
The BCM6846 has a few extra registers and cannot reuse the
compatible string from other variants of the Unimac
MDIO block: we need to be able to tell them apart.
====================
The MDIO block in the BCM6846 is not identical to any of the
previous versions, but has extended registers not present in
the other variants. For this reason we need to use a new
compatible especially for this SoC.
Jakub Sitnicki [Fri, 11 Oct 2024 12:17:30 +0000 (14:17 +0200)]
udp: Compute L4 checksum as usual when not segmenting the skb
If:
1) the user requested USO, but
2) there is not enough payload for GSO to kick in, and
3) the egress device doesn't offer checksum offload, then
we want to compute the L4 checksum in software early on.
In the case when we are not taking the GSO path, but it has been requested,
the software checksum fallback in skb_segment doesn't get a chance to
compute the full checksum, if the egress device can't do it. As a result we
end up sending UDP datagrams with only a partial checksum filled in, which
the peer will discard.
Eric Dumazet [Fri, 11 Oct 2024 17:12:17 +0000 (17:12 +0000)]
genetlink: hold RCU in genlmsg_mcast()
While running net selftests with CONFIG_PROVE_RCU_LIST=y I saw
one lockdep splat [1].
genlmsg_mcast() uses for_each_net_rcu(), and must therefore hold RCU.
Instead of letting all callers guard genlmsg_multicast_allns()
with a rcu_read_lock()/rcu_read_unlock() pair, do it in genlmsg_mcast().
This also means the @flags parameter is useless, we need to always use
GFP_ATOMIC.
[1]
[10882.424136] =============================
[10882.424166] WARNING: suspicious RCU usage
[10882.424309] 6.12.0-rc2-virtme #1156 Not tainted
[10882.424400] -----------------------------
[10882.424423] net/netlink/genetlink.c:1940 RCU-list traversed in non-reader section!!
[10882.424469]
other info that might help us debug this:
Peter Rashleigh [Mon, 14 Oct 2024 20:43:42 +0000 (13:43 -0700)]
net: dsa: mv88e6xxx: Fix the max_vid definition for the MV88E6361
According to the Marvell datasheet the 88E6361 has two VTU pages
(4k VIDs per page) so the max_vid should be 8191, not 4095.
In the current implementation mv88e6xxx_vtu_walk() gives unexpected
results because of this error. I verified that mv88e6xxx_vtu_walk()
works correctly on the MV88E6361 with this patch in place.
tcp/dccp: Don't use timer_pending() in reqsk_queue_unlink().
Martin KaFai Lau reported use-after-free [0] in reqsk_timer_handler().
"""
We are seeing a use-after-free from a bpf prog attached to
trace_tcp_retransmit_synack. The program passes the req->sk to the
bpf_sk_storage_get_tracing kernel helper which does check for null
before using it.
"""
The commit 83fccfc3940c ("inet: fix potential deadlock in
reqsk_queue_unlink()") added timer_pending() in reqsk_queue_unlink() not
to call del_timer_sync() from reqsk_timer_handler(), but it introduced a
small race window.
Before the timer is called, expire_timers() calls detach_timer(timer, true)
to clear timer->entry.pprev and marks it as not pending.
If reqsk_queue_unlink() checks timer_pending() just after expire_timers()
calls detach_timer(), TCP will miss del_timer_sync(); the reqsk timer will
continue running and send multiple SYN+ACKs until it expires.
The reported UAF could happen if req->sk is close()d earlier than the timer
expiration, which is 63s by default.
The scenario would be
1. inet_csk_complete_hashdance() calls inet_csk_reqsk_queue_drop(),
but del_timer_sync() is missed
2. reqsk timer is executed and scheduled again
3. req->sk is accept()ed and reqsk_put() decrements rsk_refcnt, but
reqsk timer still has another one, and inet_csk_accept() does not
clear req->sk for non-TFO sockets
4. sk is close()d
5. reqsk timer is executed again, and BPF touches req->sk
Let's not use timer_pending() by passing the caller context to
__inet_csk_reqsk_queue_drop().
Note that reqsk timer is pinned, so the issue does not happen in most
use cases. [1]
[0]
BUG: KFENCE: use-after-free read in bpf_sk_storage_get_tracing+0x2e/0x1b0
Use-after-free read at 0x00000000a891fb3a (in kfence-#1):
bpf_sk_storage_get_tracing+0x2e/0x1b0
bpf_prog_5ea3e95db6da0438_tcp_retransmit_synack+0x1d20/0x1dda
bpf_trace_run2+0x4c/0xc0
tcp_rtx_synack+0xf9/0x100
reqsk_timer_handler+0xda/0x3d0
run_timer_softirq+0x292/0x8a0
irq_exit_rcu+0xf5/0x320
sysvec_apic_timer_interrupt+0x6d/0x80
asm_sysvec_apic_timer_interrupt+0x16/0x20
intel_idle_irq+0x5a/0xa0
cpuidle_enter_state+0x94/0x273
cpu_startup_entry+0x15e/0x260
start_secondary+0x8a/0x90
secondary_startup_64_no_verify+0xfa/0xfb
allocated by task 0 on cpu 9 at 260507.901592s:
sk_prot_alloc+0x35/0x140
sk_clone_lock+0x1f/0x3f0
inet_csk_clone_lock+0x15/0x160
tcp_create_openreq_child+0x1f/0x410
tcp_v6_syn_recv_sock+0x1da/0x700
tcp_check_req+0x1fb/0x510
tcp_v6_rcv+0x98b/0x1420
ipv6_list_rcv+0x2258/0x26e0
napi_complete_done+0x5b1/0x2990
mlx5e_napi_poll+0x2ae/0x8d0
net_rx_action+0x13e/0x590
irq_exit_rcu+0xf5/0x320
common_interrupt+0x80/0x90
asm_common_interrupt+0x22/0x40
cpuidle_enter_state+0xfb/0x273
cpu_startup_entry+0x15e/0x260
start_secondary+0x8a/0x90
secondary_startup_64_no_verify+0xfa/0xfb
freed by task 0 on cpu 9 at 260507.927527s:
rcu_core_si+0x4ff/0xf10
irq_exit_rcu+0xf5/0x320
sysvec_apic_timer_interrupt+0x6d/0x80
asm_sysvec_apic_timer_interrupt+0x16/0x20
cpuidle_enter_state+0xfb/0x273
cpu_startup_entry+0x15e/0x260
start_secondary+0x8a/0x90
secondary_startup_64_no_verify+0xfa/0xfb
Michael Ellerman [Fri, 20 Sep 2024 09:35:20 +0000 (19:35 +1000)]
powerpc/powernv: Free name on error in opal_event_init()
In opal_event_init() if request_irq() fails name is not freed, leading
to a memory leak. The code only runs at boot time, there's no way for a
user to trigger it, so there's no security impact.
Arnd Bergmann [Tue, 15 Oct 2024 20:39:42 +0000 (20:39 +0000)]
Merge tag 'scmi-fixes-6.12' of https://git.kernel.org/pub/scm/linux/kernel/git/sudeep.holla/linux into arm/fixes
Arm SCMI fixes for v6.12
Couple of fixes to address the issues found and reported on Broadcom
STB platforms following the recent refactor of all the SCMI transports
as standalone drivers.
One of the issue is that the effective timeout value is much less than
the intended value due to the way mailbox messages are queues in the
mailbox framework. Since we block or serialise the shmem access anyway,
there is no point in utilizing mailbox queues. The issue is fixed with
exclusive lock on the channel when sending the message.
The other issues is actually non-issue for upstream, but the workaround
is just changing the link order of the transport drivers which enables
Broadcom STB platforms to run both upstream and custom downstream kernel
without any device tree changes. So pushing this to help them test upstream
seamlessly as it has no practical or theoretical impact for others.
There is also a fix to address possible double freeing of the name string
in scmi_debugfs_common_cleanup() when devm_add_action_or_reset() fails.
* tag 'scmi-fixes-6.12' of https://git.kernel.org/pub/scm/linux/kernel/git/sudeep.holla/linux:
firmware: arm_scmi: Queue in scmi layer for mailbox implementation
firmware: arm_scmi: Give SMC transport precedence over mailbox
firmware: arm_scmi: Fix the double free in scmi_debugfs_common_setup()
Arnd Bergmann [Tue, 15 Oct 2024 20:38:27 +0000 (20:38 +0000)]
Merge tag 'ffa-fixes-6.12' of https://git.kernel.org/pub/scm/linux/kernel/git/sudeep.holla/linux into arm/fixes
Arm FF-A fixes for v6.12
Couple of fixes to avoid string-fortify warnings in export_uuid()
and memcpy() from the recently added functions to support
FFA_MSG_SEND_DIRECT_REQ2 and FFA_MSG_SEND_DIRECT_RESP2.
* tag 'ffa-fixes-6.12' of https://git.kernel.org/pub/scm/linux/kernel/git/sudeep.holla/linux:
firmware: arm_ffa: Avoid string-fortify warning caused by memcpy()
firmware: arm_ffa: Avoid string-fortify warning in export_uuid()
Arnd Bergmann [Tue, 15 Oct 2024 20:36:53 +0000 (20:36 +0000)]
Merge tag 'reset-fixes-for-v6.12' of git://git.pengutronix.de/pza/linux into arm/fixes
Reset controller fixes for v6.12
Fix a NULL pointer dereference in reset-starfive-jh71x0 and replace two
accidental commas at line endings with semicolons in reset-npcm.
* tag 'reset-fixes-for-v6.12' of git://git.pengutronix.de/pza/linux:
reset: starfive: jh71x0: Fix accessing the empty member on JH7110 SoC
reset: npcm: convert comma to semicolon
Linus Torvalds [Tue, 15 Oct 2024 18:18:44 +0000 (11:18 -0700)]
Merge tag 'trace-ringbuffer-v6.12-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull ring-buffer fixes from Steven Rostedt:
- Fix ref counter of buffers assigned at boot up
A tracing instance can be created from the kernel command line. If it
maps to memory, it is considered permanent and should not be deleted,
or bad things can happen. If it is not mapped to memory, then the
user is fine to delete it via rmdir from the instances directory. But
the ref counts assumed 0 was free to remove and greater than zero was
not. But this was not the case. When an instance is created, it
should have the reference of 1, and if it should not be removed, it
must be greater than 1. The boot up code set normal instances with a
ref count of 0, which could get removed if something accessed it and
then released it. And memory mapped instances had a ref count of 1
which meant it could be deleted, and bad things happen. Keep normal
instances ref count as 1, and set memory mapped instances ref count
to 2.
- Protect sub buffer size (order) updates from other modifications
When a ring buffer is changing the size of its sub-buffers, no other
operations should be performed on the ring buffer. That includes
reading it. But the locking only grabbed the buffer->mutex that keeps
some operations from touching the ring buffer. It also must hold the
cpu_buffer->reader_lock as well when updates happen as other paths
use that to do some operations on the ring buffer.
* tag 'trace-ringbuffer-v6.12-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
ring-buffer: Fix reader locking when changing the sub buffer order
ring-buffer: Fix refcount setting of boot mapped buffers
Linus Torvalds [Tue, 15 Oct 2024 18:06:45 +0000 (11:06 -0700)]
Merge tag 'bcachefs-2024-10-14' of git://evilpiepirate.org/bcachefs
Pull bcachefs fixes from Kent Overstreet:
- New metadata version inode_has_child_snapshots
This fixes bugs with handling of unlinked inodes + snapshots, in
particular when an inode is reattached after taking a snapshot;
deleted inodes now get correctly cleaned up across snapshots.
- Disk accounting rewrite fixes
- validation fixes for when a device has been removed
- fix journal replay failing with "journal_reclaim_would_deadlock"
- Some more small fixes for erasure coding + device removal
- Assorted small syzbot fixes
* tag 'bcachefs-2024-10-14' of git://evilpiepirate.org/bcachefs: (27 commits)
bcachefs: Fix sysfs warning in fstests generic/730,731
bcachefs: Handle race between stripe reuse, invalidate_stripe_to_dev
bcachefs: Fix kasan splat in new_stripe_alloc_buckets()
bcachefs: Add missing validation for bch_stripe.csum_granularity_bits
bcachefs: Fix missing bounds checks in bch2_alloc_read()
bcachefs: fix uaf in bch2_dio_write_done()
bcachefs: Improve check_snapshot_exists()
bcachefs: Fix bkey_nocow_lock()
bcachefs: Fix accounting replay flags
bcachefs: Fix invalid shift in member_to_text()
bcachefs: Fix bch2_have_enough_devs() for BCH_SB_MEMBER_INVALID
bcachefs: __wait_for_freeing_inode: Switch to wait_bit_queue_entry
bcachefs: Check if stuck in journal_res_get()
closures: Add closure_wait_event_timeout()
bcachefs: Fix state lock involved deadlock
bcachefs: Fix NULL pointer dereference in bch2_opt_to_text
bcachefs: Release transaction before wake up
bcachefs: add check for btree id against max in try read node
bcachefs: Disk accounting device validation fixes
bcachefs: bch2_inode_or_descendents_is_open()
...
====================
mptcp: prevent MPC handshake on port-based signal endpoints
MPTCP connection requests toward a listening socket created by the
in-kernel PM for a port based signal endpoint will never be accepted,
they need to be explicitly rejected.
- Patch 1: Explicitly reject such requests. A fix for >= v5.12.
- Patch 2: Cover this case in the MPTCP selftests to avoid regressions.
Paolo Abeni [Mon, 14 Oct 2024 14:06:01 +0000 (16:06 +0200)]
selftests: mptcp: join: test for prohibited MPC to port-based endp
Explicitly verify that MPC connection attempts towards a port-based
signal endpoint fail with a reset.
Note that this new test is a bit different from the other ones, not
using 'run_tests'. It is then needed to add the capture capability, and
the picking the right port which have been extracted into three new
helpers. The info about the capture can also be printed from a single
point, which simplifies the exit paths in do_transfer().
The 'Fixes' tag here below is the same as the one from the previous
commit: this patch here is not fixing anything wrong in the selftests,
but it validates the previous fix for an issue introduced by this commit
ID.
As noted by Cong Wang, the splat is false positive, but the code
path leading to the report is an unexpected one: a client is
attempting an MPC handshake towards the in-kernel listener created
by the in-kernel PM for a port based signal endpoint.
Such connection will be never accepted; many of them can make the
listener queue full and preventing the creation of MPJ subflow via
such listener - its intended role.
Explicitly detect this scenario at initial-syn time and drop the
incoming MPC request.
Oleksij Rempel [Sun, 13 Oct 2024 05:29:16 +0000 (07:29 +0200)]
net: macb: Avoid 20s boot delay by skipping MDIO bus registration for fixed-link PHY
A boot delay was introduced by commit 79540d133ed6 ("net: macb: Fix
handling of fixed-link node"). This delay was caused by the call to
`mdiobus_register()` in cases where a fixed-link PHY was present. The
MDIO bus registration triggered unnecessary PHY address scans, leading
to a 20-second delay due to attempts to detect Clause 45 (C45)
compatible PHYs, despite no MDIO bus being attached.
The commit 79540d133ed6 ("net: macb: Fix handling of fixed-link node")
was originally introduced to fix a regression caused by commit 7897b071ac3b4 ("net: macb: convert to phylink"), which caused the driver
to misinterpret fixed-link nodes as PHY nodes. This resulted in warnings
like:
mdio_bus f0028000.ethernet-ffffffff: fixed-link has invalid PHY address
mdio_bus f0028000.ethernet-ffffffff: scan phy fixed-link at address 0
...
mdio_bus f0028000.ethernet-ffffffff: scan phy fixed-link at address 31
This patch reworks the logic to avoid registering and allocation of the
MDIO bus when:
- The device tree contains a fixed-link node.
- There is no "mdio" child node in the device tree.
If a child node named "mdio" exists, the MDIO bus will be registered to
support PHYs attached to the MACB's MDIO bus. Otherwise, with only a
fixed-link, the MDIO bus is skipped.
Tested on a sama5d35 based system with a ksz8863 switch attached to
macb0.
Sabrina Dubroca [Fri, 11 Oct 2024 15:16:37 +0000 (17:16 +0200)]
macsec: don't increment counters for an unrelated SA
On RX, we shouldn't be incrementing the stats for an arbitrary SA in
case the actual SA hasn't been set up. Those counters are intended to
track packets for their respective AN when the SA isn't currently
configured. Due to the way MACsec is implemented, we don't keep
counters unless the SA is configured, so we can't track those packets,
and those counters will remain at 0.
The RXSC's stats keeps track of those packets without telling us which
AN they belonged to. We could add counters for non-existent SAs, and
then find a way to integrate them in the dump to userspace, but I
don't think it's worth the effort.
Petr Pavlu [Tue, 15 Oct 2024 11:24:29 +0000 (13:24 +0200)]
ring-buffer: Fix reader locking when changing the sub buffer order
The function ring_buffer_subbuf_order_set() updates each
ring_buffer_per_cpu and installs new sub buffers that match the requested
page order. This operation may be invoked concurrently with readers that
rely on some of the modified data, such as the head bit (RB_PAGE_HEAD), or
the ring_buffer_per_cpu.pages and reader_page pointers. However, no
exclusive access is acquired by ring_buffer_subbuf_order_set(). Modifying
the mentioned data while a reader also operates on them can then result in
incorrect memory access and various crashes.
Fix the problem by taking the reader_lock when updating a specific
ring_buffer_per_cpu in ring_buffer_subbuf_order_set().
Gavin Shan [Mon, 14 Oct 2024 00:47:24 +0000 (10:47 +1000)]
firmware: arm_ffa: Avoid string-fortify warning caused by memcpy()
Copying from a 144 byte structure arm_smccc_1_2_regs at an offset of 32
into an 112 byte struct ffa_send_direct_data2 causes a compile-time warning:
| In file included from drivers/firmware/arm_ffa/driver.c:25:
| In function 'fortify_memcpy_chk',
| inlined from 'ffa_msg_send_direct_req2' at drivers/firmware/arm_ffa/driver.c:504:3:
| include/linux/fortify-string.h:580:4: warning: call to '__read_overflow2_field'
| declared with 'warning' attribute: detected read beyond size of field
| (2nd parameter); maybe use struct_group()? [-Wattribute-warning]
| __read_overflow2_field(q_size_field, size);
Fix it by not passing a plain buffer to memcpy() to avoid the overflow
warning.
Colin Ian King [Thu, 10 Oct 2024 15:45:19 +0000 (16:45 +0100)]
octeontx2-af: Fix potential integer overflows on integer shifts
The left shift int 32 bit integer constants 1 is evaluated using 32 bit
arithmetic and then assigned to a 64 bit unsigned integer. In the case
where the shift is 32 or more this can lead to an overflow. Avoid this
by shifting using the BIT_ULL macro instead.
Will Deacon [Mon, 14 Oct 2024 16:11:00 +0000 (17:11 +0100)]
kasan: Disable Software Tag-Based KASAN with GCC
Syzbot reports a KASAN failure early during boot on arm64 when building
with GCC 12.2.0 and using the Software Tag-Based KASAN mode:
| BUG: KASAN: invalid-access in smp_build_mpidr_hash arch/arm64/kernel/setup.c:133 [inline]
| BUG: KASAN: invalid-access in setup_arch+0x984/0xd60 arch/arm64/kernel/setup.c:356
| Write of size 4 at addr 03ff800086867e00 by task swapper/0
| Pointer tag: [03], memory tag: [fe]
Initial triage indicates that the report is a false positive and a
thorough investigation of the crash by Mark Rutland revealed the root
cause to be a bug in GCC:
> When GCC is passed `-fsanitize=hwaddress` or
> `-fsanitize=kernel-hwaddress` it ignores
> `__attribute__((no_sanitize_address))`, and instruments functions
> we require are not instrumented.
>
> [...]
>
> All versions [of GCC] I tried were broken, from 11.3.0 to 14.2.0
> inclusive.
>
> I think we have to disable KASAN_SW_TAGS with GCC until this is
> fixed
Disable Software Tag-Based KASAN when building with GCC by making
CC_HAS_KASAN_SW_TAGS depend on !CC_IS_GCC.
Paritosh Dixit [Thu, 10 Oct 2024 14:29:08 +0000 (10:29 -0400)]
net: stmmac: dwmac-tegra: Fix link bring-up sequence
The Tegra MGBE driver sometimes fails to initialize, reporting the
following error, and as a result, it is unable to acquire an IP
address with DHCP:
tegra-mgbe 6800000.ethernet: timeout waiting for link to become ready
As per the recommendation from the Tegra hardware design team, fix this
issue by:
- clearing the PHY_RDY bit before setting the CDR_RESET bit and then
setting PHY_RDY bit before clearing CDR_RESET bit. This ensures valid
data is present at UPHY RX inputs before starting the CDR lock.
- adding the required delays when bringing up the UPHY lane. Note we
need to use delays here because there is no alternative, such as
polling, for these cases. Using the usleep_range() instead of ndelay()
as sleeping is preferred over busy wait loop.
Without this change we would see link failures on boot sometimes as
often as 1 in 5 boots. With this fix we have not observed any failures
in over 1000 boots.
Lu Baolu [Mon, 14 Oct 2024 01:37:44 +0000 (09:37 +0800)]
iommu/vt-d: Fix incorrect pci_for_each_dma_alias() for non-PCI devices
Previously, the domain_context_clear() function incorrectly called
pci_for_each_dma_alias() to set up context entries for non-PCI devices.
This could lead to kernel hangs or other unexpected behavior.
Add a check to only call pci_for_each_dma_alias() for PCI devices. For
non-PCI devices, domain_context_clear_one() is called directly.
ALSA/hda: intel-sdw-acpi: add support for sdw-manager-list property read
The DisCo for SoundWire 2.0 spec adds support for a new
sdw-manager-list property. Add it in backwards-compatible mode with
'sdw-master-count', which assumed that all links between 0..count-1
exist.