net: pse-pd: Do not return EOPNOSUPP if config is null
For a PSE supporting both c33 and PoDL, setting config for one type of PoE
leaves the other type's config null. Currently, this case returns
EOPNOTSUPP, which is incorrect. Instead, we should do nothing if the
configuration is empty.
1) Fix esp_output_tail_tcp() on unsupported ESPINTCP.
From Hagar Hemdan.
2) Fix two bugs in the recently introduced SA direction separation.
From Antony Antony.
3) Fix unregister netdevice hang on hardware offload. We had to add another
list where skbs linked to that are unlinked from the lists (deleted)
but not yet freed.
4) Fix netdev reference count imbalance in xfrm_state_find.
From Jianbo Liu.
5) Call xfrm_dev_policy_delete when killingi them on offloaded policies.
Jianbo Liu.
* tag 'ipsec-2024-07-11' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec:
xfrm: call xfrm_dev_policy_delete when kill policy
xfrm: fix netdev reference count imbalance
xfrm: Export symbol xfrm_dev_state_delete.
xfrm: Fix unregister netdevice hang on hardware offload.
xfrm: Log input direction mismatch error in one place
xfrm: Fix input error path memory access
net: esp: cleanup esp_output_tail_tcp() in case of unsupported ESPINTCP
====================
Merge tag 'net-6.10-rc8-2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull more networking fixes from Jakub Kicinski:
"A quick follow up to yesterday's pull. We got a regressions report for
the bnxt patch as soon as it got to your tree. The ethtool fix is also
good to have, although it's an older regression.
Current release - regressions:
- eth: bnxt_en: fix crash in bnxt_get_max_rss_ctx_ring() on older HW
when user tries to decrease the ring count
Previous releases - regressions:
- ethtool: fix RSS setting, accept "no change" setting if the driver
doesn't support the new features
- eth: i40e: remove needless retries of NVM update, don't wait 20min
when we know the firmware update won't succeed"
* tag 'net-6.10-rc8-2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net:
bnxt_en: Fix crash in bnxt_get_max_rss_ctx_ring()
octeontx2-af: fix issue with IPv4 match for RSS
octeontx2-af: fix issue with IPv6 ext match for RSS
octeontx2-af: fix detection of IP layer
octeontx2-af: fix a issue with cpt_lf_alloc mailbox
octeontx2-af: replace cpt slot with lf id on reg write
i40e: fix: remove needless retries of NVM update
net: ethtool: Fix RSS setting
bp->rss_ctx_list is not initialized if the chip or firmware does not
support multiple RSS contexts. Fix it by adding a check in
bnxt_get_max_rss_ctx_ring() before proceeding to reference
bp->rss_ctx_list.
Merge tag 'for-6.10-rc7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"Fix a regression in extent map shrinker behaviour.
In the past weeks we got reports from users that there are huge
latency spikes or freezes. This was bisected to newly added shrinker
of extent maps (it was added to fix a build up of the structures in
memory).
I'm assuming that the freezes would happen to many users after release
so I'd like to get it merged now so it's in 6.10. Although the diff
size is not small the changes are relatively straightforward, the
reporters verified the fixes and we did testing on our side.
The fixes:
- adjust behaviour under memory pressure and check lock or scheduling
conditions, bail out if needed
- synchronize tracking of the scanning progress so inode ranges are
not skipped or work duplicated
- do a delayed iput when scanning a root so evicting an inode does
not slow things down in case of lots of dirty data, also fix
lockdep warning, a deadlock could happen when writing the dirty
data would need to start a transaction"
* tag 'for-6.10-rc7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: avoid races when tracking progress for extent map shrinking
btrfs: stop extent map shrinker if reschedule is needed
btrfs: use delayed iput during extent map shrinking
Merge tag 'pmdomain-v6.10-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/linux-pm
Pull pmdomain fix from Ulf Hansson:
- qcom: Skip retention level for rpmhpd's
* tag 'pmdomain-v6.10-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/linux-pm:
pmdomain: qcom: rpmhpd: Skip retention level for Power Domains
Merge tag 'mmc-v6.10-rc4-2' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc
Pull MMC host fixes from Ulf Hansson:
- davinci_mmc: Prevent transmitted data size from exceeding sgm's
length
- sdhci: Fix max_seg_size for 64KiB PAGE_SIZE
* tag 'mmc-v6.10-rc4-2' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc:
mmc: davinci_mmc: Prevent transmitted data size from exceeding sgm's length
mmc: sdhci: Fix max_seg_size for 64KiB PAGE_SIZE
Merge tag 'arm-fixes-6.10-3' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc
Pull ARM SoC fixes from Arnd Bergmann:
"Most of these changes are Qualcomm SoC specific and came in just after
I sent out the last set of fixes. This includes two regression fixes
for SoC drivers, a defconfig change to ensure the Lenovo X13s is
usable and 11 changes to DT files to fix regressions and minor
platform specific issues.
Tony and Chunyan step back from their respective maintainership roles
on the omap and unisoc platforms, and Christophe in turn takes over
maintaining some of the Freescale SoC drivers that he has been taking
care of in practice already.
Lastly, there are two trivial fixes for the davinci and sunxi
platforms"
* tag 'arm-fixes-6.10-3' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc:
MAINTAINERS: Update FREESCALE SOC DRIVERS and QUICC ENGINE LIBRARY
MAINTAINERS: Add more maintainers for omaps
ARM: davinci: Convert comma to semicolon
MAINTAINERS: Move myself from SPRD Maintainer to Reviewer
Revert "dt-bindings: cache: qcom,llcc: correct QDU1000 reg entries"
arm64: dts: qcom: qdu1000: Fix LLCC reg property
arm64: dts: qcom: sm6115: add iommu for sdhc_1
arm64: dts: qcom: x1e80100-crd: fix DAI used for headset recording
arm64: dts: qcom: x1e80100-crd: fix WCD audio codec TX port mapping
soc: qcom: pmic_glink: disable UCSI on sc8280xp
arm64: defconfig: enable Elan i2c-hid driver
arm64: dts: qcom: sc8280xp-crd: use external pull up for touch reset
arm64: dts: qcom: sc8280xp-x13s: fix touchscreen power on
arm64: dts: qcom: x1e80100: Fix PCIe 6a reg offsets and add MHI
arm64: dts: qcom: sa8775p: Correct IRQ number of EL2 non-secure physical timer
arm64: dts: allwinner: Fix PMIC interrupt number
arm64: dts: qcom: sc8280xp: Set status = "reserved" on PSHOLD
arm64: dts: qcom: x1e80100-*: Allocate some CMA buffers
arm64: dts: qcom: sc8180x: Fix LLCC reg property again
Merge tag 'char-misc-6.10-final' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc
Pull char / misc driver fixes from Greg KH:
"Here are some small remaining driver fixes for 6.10-final that have
all been in linux-next for a while and resolve reported issues.
Included in here are:
- mei driver fixes (and a spelling fix at the end just to be clean)
- iio driver fixes for reported problems
- fastrpc bugfixes
- nvmem small fixes"
* tag 'char-misc-6.10-final' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc:
mei: vsc: Fix spelling error
mei: vsc: Enhance SPI transfer of IVSC ROM
mei: vsc: Utilize the appropriate byte order swap function
mei: vsc: Prevent timeout error with added delay post-firmware download
mei: vsc: Enhance IVSC chipset stability during warm reboot
nvmem: core: limit cell sysfs permissions to main attribute ones
nvmem: core: only change name to fram for current attribute
nvmem: meson-efuse: Fix return value of nvmem callbacks
nvmem: rmem: Fix return value of rmem_read()
misc: microchip: pci1xxxx: Fix return value of nvmem callbacks
hpet: Support 32-bit userspace
misc: fastrpc: Restrict untrusted app to attach to privileged PD
misc: fastrpc: Fix ownership reassignment of remote heap
misc: fastrpc: Fix memory leak in audio daemon attach operation
misc: fastrpc: Avoid updating PD type for capability request
misc: fastrpc: Copy the complete capability structure to user
misc: fastrpc: Fix DSP capabilities request
iio: light: apds9306: Fix error handing
iio: trigger: Fix condition for own trigger
Merge tag 'tty-6.10-final' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty
Pull tty / serial fixes from Greg KH:
"Here are some small serial driver fixes for 6.10-final. Included in
here are:
- qcom-geni fixes for a much much much discussed issue and everyone
now seems to be agreed that this is the proper way forward to
resolve the reported lockups
- imx serial driver bugfixes
- 8250_omap errata fix
- ma35d1 serial driver bugfix
All of these have been in linux-next for over a week with no reported
issues"
* tag 'tty-6.10-final' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty:
serial: qcom-geni: do not kill the machine on fifo underrun
serial: qcom-geni: fix hard lockup on buffer flush
serial: qcom-geni: fix soft lockup on sw flow control and suspend
serial: imx: ensure RTS signal is not left active after shutdown
tty: serial: ma35d1: Add a NULL check for of_node
serial: 8250_omap: Fix Errata i2310 with RX FIFO level check
serial: imx: only set receiver level if it is zero
Merge tag 'sound-6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound
Pull sound fixes from Takashi Iwai:
"The majority of changes here are small device-specific fixes for ASoC
SOF / Intel and usual HD-audio quirks.
The only significant high LOC is found in the Cirrus firmware driver,
but all those are for hardening against malicious firmware blobs, and
they look fine for taking as a last minute fix, too"
* tag 'sound-6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound:
ALSA: hda/realtek: Enable Mute LED on HP 250 G7
firmware: cs_dsp: Use strnlen() on name fields in V1 wmfw files
ALSA: hda/realtek: Limit mic boost on VAIO PRO PX
ALSA: hda: cs35l41: Fix swapped l/r audio channels for Lenovo ThinBook 13x Gen4
ASoC: SOF: Intel: hda-pcm: Limit the maximum number of periods by MAX_BDL_ENTRIES
ASoC: rt711-sdw: add missing readable registers
ASoC: SOF: Intel: hda: fix null deref on system suspend entry
ALSA: hda/realtek: add quirk for Clevo V5[46]0TU
firmware: cs_dsp: Prevent buffer overrun when processing V2 alg headers
firmware: cs_dsp: Validate payload length before processing block
firmware: cs_dsp: Return error if block header overflows file
firmware: cs_dsp: Fix overflow checking of wmfw header
Merge tag 'bcachefs-2024-07-12' of https://evilpiepirate.org/git/bcachefs
Pull more bcachefs fixes from Kent Overstreet:
- revert the SLAB_ACCOUNT patch, something crazy is going on in memcg
and someone forgot to test
- minor fixes: missing rcu_read_lock(), scheduling while atomic (in an
emergency shutdown path)
- two lockdep fixes; these could have gone earlier, but were left to
bake awhile
* tag 'bcachefs-2024-07-12' of https://evilpiepirate.org/git/bcachefs:
bcachefs: bch2_gc_btree() should not use btree_root_lock
bcachefs: Set PF_MEMALLOC_NOFS when trans->locked
bcachefs; Use trans_unlock_long() when waiting on allocator
Revert "bcachefs: Mark bch_inode_info as SLAB_ACCOUNT"
bcachefs: fix scheduling while atomic in break_cycle()
bcachefs: Fix RCU splat
Satheesh Paul [Wed, 10 Jul 2024 07:51:27 +0000 (13:21 +0530)]
octeontx2-af: fix issue with IPv4 match for RSS
While performing RSS based on IPv4, packets with
IPv4 options are not being considered. Adding changes
to match both plain IPv4 and IPv4 with option header.
Fixes: 41a7aa7b800d ("octeontx2-af: NIX Rx flowkey configuration for RSS") Signed-off-by: Satheesh Paul <[email protected]> Reviewed-by: Kalesh AP <[email protected]> Signed-off-by: David S. Miller <[email protected]>
octeontx2-af: fix issue with IPv6 ext match for RSS
While performing RSS based on IPv6, extension ltype
is not being considered. This will be problem for
fragmented packets or packets with extension header.
Adding changes to match IPv6 ext header along with IPv6
ltype.
Fixes: 41a7aa7b800d ("octeontx2-af: NIX Rx flowkey configuration for RSS") Signed-off-by: Kiran Kumar K <[email protected]> Reviewed-by: Kalesh AP <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Michal Mazur [Wed, 10 Jul 2024 07:51:25 +0000 (13:21 +0530)]
octeontx2-af: fix detection of IP layer
Checksum and length checks are not enabled for IPv4 header with
options and IPv6 with extension headers.
To fix this a change in enum npc_kpu_lc_ltype is required which will
allow adjustment of LTYPE_MASK to detect all types of IP headers.
Fixes: 21e6699e5cd6 ("octeontx2-af: Add NPC KPU profile") Signed-off-by: Michal Mazur <[email protected]> Signed-off-by: David S. Miller <[email protected]>
octeontx2-af: fix a issue with cpt_lf_alloc mailbox
This patch fixes CPT_LF_ALLOC mailbox error due to
incompatible mailbox message format. Specifically, it
corrects the `blkaddr` field type from `int` to `u8`.
Fixes: de2854c87c64 ("octeontx2-af: Mailbox changes for 98xx CPT block") Signed-off-by: Srujana Challa <[email protected]> Signed-off-by: David S. Miller <[email protected]>
octeontx2-af: replace cpt slot with lf id on reg write
Replace slot id with global CPT lf id on reg read/write as
CPTPF/VF driver would send slot number instead of global
lf id in the reg offset. And also update the mailbox response
with the global lf's register offset.
Fixes: ae454086e3c2 ("octeontx2-af: add mailbox interface for CPT") Signed-off-by: Nithin Dabilpuram <[email protected]> Signed-off-by: David S. Miller <[email protected]>
MAINTAINERS: Update FREESCALE SOC DRIVERS and QUICC ENGINE LIBRARY
FREESCALE SOC DRIVERS has been orphaned since
commit eaac25d026a1 ("MAINTAINERS: Drop Li Yang as their email address
stopped working")
QUICC ENGINE LIBRARY has Qiang Zhao as maintainer but he hasn't
responded for years and when Li Yang was still maintaining FREESCALE
SOC DRIVERS he was also handling QUICC ENGINE LIBRARY directly.
As a maintainer of LINUX FOR POWERPC EMBEDDED PPC8XX AND PPC83XX, I
also need FREESCALE SOC DRIVERS to be actively maintained, so add
myself as maintainer of FREESCALE SOC DRIVERS and QUICC ENGINE LIBRARY.
Tony Lindgren [Tue, 9 Jul 2024 13:59:29 +0000 (16:59 +0300)]
MAINTAINERS: Add more maintainers for omaps
There are many generations of omaps to maintain, and I will be only active
as a hobbyist with time permitting. Let's add more maintainers to ensure
continued Linux support.
TI is interested in maintaining the active SoCs such as am3, am4 and
dra7. And the hobbyists are interested in maintaining some of the older
devices, mainly based on omap3 and 4 SoCs.
Kevin and Roger have agreed to maintain the active TI parts. Both Kevin
and Roger have been working on the omap variants for a long time, and
have a good understanding of the hardware.
Aaro and Andreas have agreed to maintain the community devices. Both Aaro
and Andreas have long experience on working with the earlier TI SoCs.
While at it, let's also change me to be a reviewer for the omap1, and
drop the link to my old omap web page.
Remove wrong EIO to EGAIN conversion and pass all errors as is.
After commit 230f3d53a547 ("i40e: remove i40e_status"), which should only
replace F/W specific error codes with Linux kernel generic, all EIO errors
suddenly started to be converted into EAGAIN which leads nvmupdate to retry
until it timeouts and sometimes fails after more than 20 minutes in the
middle of NVM update, so NVM becomes corrupted.
The bug affects users only at the time when they try to update NVM, and
only F/W versions that generate errors while nvmupdate. For example, X710DA2
with 0x8000ECB7 F/W is affected, but there are probably more...
Command for reproduction is just NVM update:
./nvmupdate64
The problematic code did silently convert EIO into EAGAIN which forced
nvmupdate to ignore EAGAIN error and retry the same operation until timeout.
That's why NVM update takes 20+ minutes to finish with the fail in the end.
When user submits a rxfh set command without touching XFRM_SYM_XOR,
rxfh.input_xfrm is set to RXH_XFRM_NO_CHANGE, which is equal to 0xff.
Testing if (rxfh.input_xfrm & RXH_XFRM_SYM_XOR &&
!ops->cap_rss_sym_xor_supported)
return -EOPNOTSUPP;
Will always be true on devices that don't set cap_rss_sym_xor_supported,
since rxfh.input_xfrm & RXH_XFRM_SYM_XOR is always true, if input_xfrm
was not set, i.e RXH_XFRM_NO_CHANGE=0xff, which will result in failure
of any command that doesn't require any change of XFRM, e.g RSS context
or hash function changes.
To avoid this breakage, test if rxfh.input_xfrm != RXH_XFRM_NO_CHANGE
before testing other conditions. Note that the problem will only trigger
with XFRM-aware userspace, old ethtool CLI would continue to work.
Kent Overstreet [Fri, 5 Jul 2024 01:02:16 +0000 (21:02 -0400)]
bcachefs: bch2_gc_btree() should not use btree_root_lock
btree_root_lock is for the root keys in btree_root, not the pointers to
the nodes themselves; this fixes a lock ordering issue between
btree_root_lock and btree node locks.
Merge tag 'for-6.10/dm-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm
Pull device mapper fix from Mikulas Patocka:
- Fix broken discard for device mapper VDO target
* tag 'for-6.10/dm-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
dm vdo: replace max_discard_sectors with max_hw_discard_sectors
Bruce Johnston [Tue, 9 Jul 2024 17:00:36 +0000 (13:00 -0400)]
dm vdo: replace max_discard_sectors with max_hw_discard_sectors
Commit 4f563a64732d ("block: add a max_user_discard_sectors queue
limit") changed block core to set max_discard_sectors to:
min(lim->max_hw_discard_sectors, lim->max_user_discard_sectors)
Commit 825d8bbd2f32 ("dm: always manage discard support in terms
of max_hw_discard_sectors") fixed most dm targetss to deal with
this, by replacing max_discard_sectors with max_hw_discard_sectors.
Unfortunately, dm-vdo did not get fixed at that time.
Fixes: 825d8bbd2f32 ("dm: always manage discard support in terms of max_hw_discard_sectors") Signed-off-by: Bruce Johnston <[email protected]> Signed-off-by: Matthew Sakai <[email protected]> Signed-off-by: Mikulas Patocka <[email protected]>
Merge tag 'spi-fix-v6.10-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi
Pull spi fixes from Mark Brown:
"This fixes two regressions that have been bubbling along for a large
part of this release.
One is a revert of the multi mode support for the OMAP SPI controller,
this introduced regressions on a number of systems and while there has
been progress on fixing those we've not got something that works for
everyone yet so let's just drop the change for now.
The other is a series of fixes from David Lechner for his recent
message optimisation work, this interacted badly with spi-mux which
is altogether too clever with recursive use of the bus and creates
situations that hadn't been considered.
There are also a couple of small driver specific fixes, including one
more patch from David for sleep duration calculations in the AXI
driver"
* tag 'spi-fix-v6.10-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi:
spi: mux: set ctlr->bits_per_word_mask
spi: add defer_optimize_message controller flag
spi: don't unoptimize message in spi_async()
spi: omap2-mcspi: Revert multi mode support
spi: davinci: Unset POWERDOWN bit when releasing resources
spi: axi-spi-engine: fix sleep calculation
spi: imx: Don't expect DMA for i.MX{25,35,50,51,53} cspi devices
- eth: bnxt: fix crashes when reducing ring count with active RSS
contexts
Previous releases - regressions:
- sched: fix UAF when resolving a clash
- skmsg: skip zero length skb in sk_msg_recvmsg2
- sunrpc: fix kernel free on connection failure in
xs_tcp_setup_socket
- tcp: avoid too many retransmit packets
- tcp: fix incorrect undo caused by DSACK of TLP retransmit
- udp: Set SOCK_RCU_FREE earlier in udp_lib_get_port().
- eth: ks8851: fix deadlock with the SPI chip variant
- eth: i40e: fix XDP program unloading while removing the driver
Previous releases - always broken:
- bpf:
- fix too early release of tcx_entry
- fail bpf_timer_cancel when callback is being cancelled
- bpf: fix order of args in call to bpf_map_kvcalloc
- netfilter: nf_tables: prefer nft_chain_validate
- ppp: reject claimed-as-LCP but actually malformed packets
* tag 'net-6.10-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (33 commits)
net, sunrpc: Remap EPERM in case of connection failure in xs_tcp_setup_socket
net/sched: Fix UAF when resolving a clash
net: ks8851: Fix potential TX stall after interface reopen
udp: Set SOCK_RCU_FREE earlier in udp_lib_get_port().
netfilter: nf_tables: prefer nft_chain_validate
netfilter: nfnetlink_queue: drop bogus WARN_ON
ethtool: netlink: do not return SQI value if link is down
ppp: reject claimed-as-LCP but actually malformed packets
selftests/bpf: Add timer lockup selftest
net: ethernet: mtk-star-emac: set mac_managed_pm when probing
e1000e: fix force smbus during suspend flow
tcp: avoid too many retransmit packets
bpf: Defer work in bpf_timer_cancel_and_free
bpf: Fail bpf_timer_cancel when callback is being cancelled
bpf: fix order of args in call to bpf_map_kvcalloc
net: ethernet: lantiq_etop: fix double free in detach
i40e: Fix XDP program unloading while removing the driver
net: fix rc7's __skb_datagram_iter()
net: ks8851: Fix deadlock with the SPI chip variant
octeontx2-af: Fix incorrect value output on error path in rvu_check_rsrc_availability()
...
Merge tag 'vfs-6.10-rc8.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs fixes from Christian Brauner:
"cachefiles:
- Export an existing and add a new cachefile helper to be used in
filesystems to fix reference count bugs
- Use the newly added fscache_ty_get_volume() helper to get a
reference count on an fscache_volume to handle volumes that are
about to be removed cleanly
- After withdrawing a fscache_cache via FSCACHE_CACHE_IS_WITHDRAWN
wait for all ongoing cookie lookups to complete and for the object
count to reach zero
- Propagate errors from vfs_getxattr() to avoid an infinite loop in
cachefiles_check_volume_xattr() because it keeps seeing ESTALE
- Don't send new requests when an object is dropped by raising
CACHEFILES_ONDEMAND_OJBSTATE_DROPPING
- Cancel all requests for an object that is about to be dropped
- Wait for the ondemand_boject_worker to finish before dropping a
cachefiles object to prevent use-after-free
- Use cyclic allocation for message ids to better handle id recycling
- Add missing lock protection when iterating through the xarray when
polling
netfs:
- Use standard logging helpers for debug logging
VFS:
- Fix potential use-after-free in file locks during
trace_posix_lock_inode(). The tracepoint could fire while another
task raced it and freed the lock that was requested to be traced
- Only increment the nr_dentry_negative counter for dentries that are
present on the superblock LRU. Currently, DCACHE_LRU_LIST list is
used to detect this case. However, the flag is also raised in
combination with DCACHE_SHRINK_LIST to indicate that dentry->d_lru
is used. So checking only DCACHE_LRU_LIST will lead to wrong
nr_dentry_negative count. Fix the check to not count dentries that
are on a shrink related list
Misc:
- hfsplus: fix an uninitialized value issue in copy_name
- minix: fix minixfs_rename with HIGHMEM. It still uses kunmap() even
though we switched it to kmap_local_page() a while ago"
* tag 'vfs-6.10-rc8.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
minixfs: Fix minixfs_rename with HIGHMEM
hfsplus: fix uninit-value in copy_name
vfs: don't mod negative dentry count when on shrinker list
filelock: fix potential use-after-free in posix_lock_inode
cachefiles: add missing lock protection when polling
cachefiles: cyclic allocation of msg_id to avoid reuse
cachefiles: wait for ondemand_object_worker to finish when dropping object
cachefiles: cancel all requests for the object that is being dropped
cachefiles: stop sending new request when dropping object
cachefiles: propagate errors from vfs_getxattr() to avoid infinite loop
cachefiles: fix slab-use-after-free in cachefiles_withdraw_cookie()
cachefiles: fix slab-use-after-free in fscache_withdraw_volume()
netfs, fscache: export fscache_put_volume() and add fscache_try_get_volume()
netfs: Switch debug logging to pr_debug()
Merge tag 'asoc-fix-v6.10-rc7' of https://git.kernel.org/pub/scm/linux/kernel/git/broonie/sound into for-linus
ASoC: Fixes for v6.10
A few fairly small fixes for ASoC, there's a relatively large set of
hardening changes for the cs_dsp firmware file parsing and a couple of
other small device specific fixes.
Filipe Manana [Mon, 8 Jul 2024 14:42:45 +0000 (15:42 +0100)]
btrfs: avoid races when tracking progress for extent map shrinking
We store the progress (root and inode numbers) of the extent map shrinker
in fs_info without any synchronization but we can have multiple tasks
calling into the shrinker during memory allocations when there's enough
memory pressure for example.
This can result in a task A reading fs_info->extent_map_shrinker_last_ino
after another task B updates it, and task A reading
fs_info->extent_map_shrinker_last_root before task B updates it, making
task A see an odd state that isn't necessarily harmful but may make it
skip certain inode ranges or do more work than necessary by going over
the same inodes again. These unprotected accesses would also trigger
warnings from tools like KCSAN.
So add a lock to protect access to these progress fields.
Filipe Manana [Mon, 8 Jul 2024 14:42:44 +0000 (15:42 +0100)]
btrfs: stop extent map shrinker if reschedule is needed
The extent map shrinker can be called in a variety of contexts where we
are under memory pressure, and of them is when a task is trying to
allocate memory. For this reason the shrinker is typically called with a
value of struct shrink_control::nr_to_scan that is much smaller than what
we return in the nr_cached_objects callback of struct super_operations
(fs/btrfs/super.c:btrfs_nr_cached_objects()), so that the shrinker does
not take a long time and cause high latencies. However we can still take
a lot of time in the shrinker even for a limited amount of nr_to_scan:
1) When traversing the red black tree that tracks open inodes in a root,
as for example with millions of open inodes we get a deep tree which
takes time searching for an inode;
2) Iterating over the extent map tree, which is a red black tree, of an
inode when doing the rb_next() calls and when removing an extent map
from the tree, since often that requires rebalancing the red black
tree;
3) When trying to write lock an inode's extent map tree we may wait for a
significant amount of time, because there's either another task about
to do IO and searching for an extent map in the tree or inserting an
extent map in the tree, and we can have thousands or even millions of
extent maps for an inode. Furthermore, there can be concurrent calls
to the shrinker so the lock might be busy simply because there is
already another task shrinking extent maps for the same inode;
4) We often reschedule if we need to, which further increases latency.
So improve on this by stopping the extent map shrinking code whenever we
need to reschedule and make it skip an inode if we can't immediately lock
its extent map tree.
Filipe Manana [Wed, 3 Jul 2024 10:07:26 +0000 (11:07 +0100)]
btrfs: use delayed iput during extent map shrinking
When putting an inode during extent map shrinking we're doing a standard
iput() but that may take a long time in case the inode is dirty and we are
doing the final iput that triggers eviction - the VFS will have to wait
for writeback before calling the btrfs evict callback (see
fs/inode.c:evict()).
This slows down the task running the shrinker which may have been
triggered while updating some tree for example, meaning locks are held
as well as an open transaction handle.
Also if the iput() ends up triggering eviction and the inode has no links
anymore, then we trigger item truncation which requires flushing delayed
items, space reservation to start a transaction and that may trigger the
space reclaim task and wait for it, resulting in deadlocks in case the
reclaim task needs for example to commit a transaction and the shrinker
is being triggered from a path holding a transaction handle.
Syzbot reported such a case with the following stack traces:
======================================================
WARNING: possible circular locking dependency detected 6.10.0-rc2-syzkaller-00010-g2ab795141095 #0 Not tainted
------------------------------------------------------
kswapd0/111 is trying to acquire lock: ffff88801eae4610 (sb_internal#3){.+.+}-{0:0}, at: btrfs_commit_inode_delayed_inode+0x110/0x330 fs/btrfs/delayed-inode.c:1275
but task is already holding lock: ffffffff8dd3a9a0 (fs_reclaim){+.+.}-{0:0}, at: balance_pgdat+0xa88/0x1970 mm/vmscan.c:6924
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
Currently, when built with "make W=1", the following warnings are
generated:
net/ceph/crush/mapper.c:466: warning: Function parameter or struct member 'work' not described in 'crush_choose_firstn'
net/ceph/crush/mapper.c:466: warning: Function parameter or struct member 'weight' not described in 'crush_choose_firstn'
net/ceph/crush/mapper.c:466: warning: Function parameter or struct member 'weight_max' not described in 'crush_choose_firstn'
net/ceph/crush/mapper.c:466: warning: Function parameter or struct member 'choose_args' not described in 'crush_choose_firstn'
Update the crush_choose_firstn() kernel-doc to document these
parameters.
Currently, when built with "make W=1", the following warnings are
generated:
net/ceph/crush/mapper.c:655: warning: Function parameter or struct member 'map' not described in 'crush_choose_indep'
net/ceph/crush/mapper.c:655: warning: Function parameter or struct member 'work' not described in 'crush_choose_indep'
net/ceph/crush/mapper.c:655: warning: Function parameter or struct member 'bucket' not described in 'crush_choose_indep'
net/ceph/crush/mapper.c:655: warning: Function parameter or struct member 'weight' not described in 'crush_choose_indep'
net/ceph/crush/mapper.c:655: warning: Function parameter or struct member 'weight_max' not described in 'crush_choose_indep'
net/ceph/crush/mapper.c:655: warning: Function parameter or struct member 'x' not described in 'crush_choose_indep'
net/ceph/crush/mapper.c:655: warning: Function parameter or struct member 'left' not described in 'crush_choose_indep'
net/ceph/crush/mapper.c:655: warning: Function parameter or struct member 'numrep' not described in 'crush_choose_indep'
net/ceph/crush/mapper.c:655: warning: Function parameter or struct member 'type' not described in 'crush_choose_indep'
net/ceph/crush/mapper.c:655: warning: Function parameter or struct member 'out' not described in 'crush_choose_indep'
net/ceph/crush/mapper.c:655: warning: Function parameter or struct member 'outpos' not described in 'crush_choose_indep'
net/ceph/crush/mapper.c:655: warning: Function parameter or struct member 'tries' not described in 'crush_choose_indep'
net/ceph/crush/mapper.c:655: warning: Function parameter or struct member 'recurse_tries' not described in 'crush_choose_indep'
net/ceph/crush/mapper.c:655: warning: Function parameter or struct member 'recurse_to_leaf' not described in 'crush_choose_indep'
net/ceph/crush/mapper.c:655: warning: Function parameter or struct member 'out2' not described in 'crush_choose_indep'
net/ceph/crush/mapper.c:655: warning: Function parameter or struct member 'parent_r' not described in 'crush_choose_indep'
net/ceph/crush/mapper.c:655: warning: Function parameter or struct member 'choose_args' not described in 'crush_choose_indep'
These warnings are generated because the prologue comment for
crush_choose_indep() uses the kernel-doc prefix, but the actual
comment is a very brief description that is not in kernel-doc
format. Since this is a static function there is no need to fully
document the function, so replace the kernel-doc comment prefix with a
standard comment prefix to remove these warnings.
Paolo Abeni [Thu, 11 Jul 2024 10:57:10 +0000 (12:57 +0200)]
Merge tag 'nf-24-07-11' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
The following batch contains Netfilter fixes for net:
Patch #1 fixes a bogus WARN_ON splat in nfnetlink_queue.
Patch #2 fixes a crash due to stack overflow in chain loop detection
by using the existing chain validation routines
Both patches from Florian Westphal.
netfilter pull request 24-07-11
* tag 'nf-24-07-11' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
netfilter: nf_tables: prefer nft_chain_validate
netfilter: nfnetlink_queue: drop bogus WARN_ON
====================
Paolo Abeni [Thu, 11 Jul 2024 10:38:33 +0000 (12:38 +0200)]
Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Daniel Borkmann says:
====================
pull-request: bpf 2024-07-11
The following pull-request contains BPF updates for your *net* tree.
We've added 4 non-merge commits during the last 2 day(s) which contain
a total of 4 files changed, 262 insertions(+), 19 deletions(-).
The main changes are:
1) Fixes for a BPF timer lockup and a use-after-free scenario when timers
are used concurrently, from Kumar Kartikeya Dwivedi.
2) Fix the argument order in the call to bpf_map_kvcalloc() which could
otherwise lead to a compilation error, from Mohammad Shehar Yaar Tausif.
bpf-for-netdev
* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
selftests/bpf: Add timer lockup selftest
bpf: Defer work in bpf_timer_cancel_and_free
bpf: Fail bpf_timer_cancel when callback is being cancelled
bpf: fix order of args in call to bpf_map_kvcalloc
====================
Daniel Borkmann [Thu, 4 Jul 2024 06:41:57 +0000 (08:41 +0200)]
net, sunrpc: Remap EPERM in case of connection failure in xs_tcp_setup_socket
When using a BPF program on kernel_connect(), the call can return -EPERM. This
causes xs_tcp_setup_socket() to loop forever, filling up the syslog and causing
the kernel to potentially freeze up.
Neil suggested:
This will propagate -EPERM up into other layers which might not be ready
to handle it. It might be safer to map EPERM to an error we would be more
likely to expect from the network system - such as ECONNREFUSED or ENETDOWN.
ECONNREFUSED as error seems reasonable. For programs setting a different error
can be out of reach (see handling in 4fbac77d2d09) in particular on kernels
which do not have f10d05966196 ("bpf: Make BPF_PROG_RUN_ARRAY return -err
instead of allow boolean"), thus given that it is better to simply remap for
consistent behavior. UDP does handle EPERM in xs_udp_send_request().
The ct may be dropped if a clash has been resolved but is still passed to
the tcf_ct_flow_table_process_conn function for further usage. This issue
can be fixed by retrieving ct from skb again after confirming conntrack.
Ronald Wahl [Tue, 9 Jul 2024 19:58:45 +0000 (21:58 +0200)]
net: ks8851: Fix potential TX stall after interface reopen
The amount of TX space in the hardware buffer is tracked in the tx_space
variable. The initial value is currently only set during driver probing.
After closing the interface and reopening it the tx_space variable has
the last value it had before close. If it is smaller than the size of
the first send packet after reopeing the interface the queue will be
stopped. The queue is woken up after receiving a TX interrupt but this
will never happen since we did not send anything.
This commit moves the initialization of the tx_space variable to the
ks8851_net_open function right before starting the TX queue. Also query
the value from the hardware instead of using a hard coded value.
Only the SPI chip variant is affected by this issue because only this
driver variant actually depends on the tx_space variable in the xmit
function.
udp: Set SOCK_RCU_FREE earlier in udp_lib_get_port().
syzkaller triggered the warning [0] in udp_v4_early_demux().
In udp_v[46]_early_demux() and sk_lookup(), we do not touch the refcount
of the looked-up sk and use sock_pfree() as skb->destructor, so we check
SOCK_RCU_FREE to ensure that the sk is safe to access during the RCU grace
period.
Currently, SOCK_RCU_FREE is flagged for a bound socket after being put
into the hash table. Moreover, the SOCK_RCU_FREE check is done too early
in udp_v[46]_early_demux() and sk_lookup(), so there could be a small race
window:
nft_chain_validate already performs loop detection because a cycle will
result in a call stack overflow (ctx->level >= NFT_JUMP_STACK_SIZE).
It also follows maps via ->validate callback in nft_lookup, so there
appears no reason to iterate the maps again.
nf_tables_check_loops() and all its helper functions can be removed.
This improves ruleset load time significantly, from 23s down to 12s.
This also fixes a crash bug. Old loop detection code can result in
unbounded recursion:
BUG: TASK stack guard page was hit at ....
Oops: stack guard page: 0000 [#1] PREEMPT SMP KASAN
CPU: 4 PID: 1539 Comm: nft Not tainted 6.10.0-rc5+ #1
[..]
with a suitable ruleset during validation of register stores.
I can't see any actual reason to attempt to check for this from
nft_validate_register_store(), at this point the transaction is still in
progress, so we don't have a full picture of the rule graph.
For nf-next it might make sense to either remove it or make this depend
on table->validate_state in case we could catch an error earlier
(for improved error reporting to userspace).
Fixes: 20a69341f2d0 ("netfilter: nf_tables: add netlink set API") Signed-off-by: Florian Westphal <[email protected]> Signed-off-by: Pablo Neira Ayuso <[email protected]>
ethtool: netlink: do not return SQI value if link is down
Do not attach SQI value if link is down. "SQI values are only valid if
link-up condition is present" per OpenAlliance specification of
100Base-T1 Interoperability Test suite [1]. The same rule would apply
for other link types.
ppp: reject claimed-as-LCP but actually malformed packets
Since 'ppp_async_encode()' assumes valid LCP packets (with code
from 1 to 7 inclusive), add 'ppp_check_packet()' to ensure that
LCP packet has an actual body beyond PPP_LCP header bytes, and
reject claimed-as-LCP but actually malformed data otherwise.
Add a selftest that tries to trigger a situation where two timer callbacks
are attempting to cancel each other's timer. By running them continuously,
we hit a condition where both run in parallel and cancel each other.
Without the fix in the previous patch, this would cause a lockup as
hrtimer_cancel on either side will wait for forward progress from the
callback.
Ensure that this situation leads to a EDEADLK error.
net: ethernet: mtk-star-emac: set mac_managed_pm when probing
The below commit introduced a warning message when phy state is not in
the states: PHY_HALTED, PHY_READY, and PHY_UP.
commit 744d23c71af3 ("net: phy: Warn about incorrect mdio_bus_phy_resume() state")
mtk-star-emac doesn't need mdiobus suspend/resume. To fix the warning
message during resume, indicate the phy resume/suspend is managed by the
mac when probing.
Commit 861e8086029e ("e1000e: move force SMBUS from enable ulp function
to avoid PHY loss issue") resolved a PHY access loss during suspend on
Meteor Lake consumer platforms, but it affected corporate systems
incorrectly.
A better fix, working for both consumer and corporate systems, was
proposed in commit bfd546a552e1 ("e1000e: move force SMBUS near the end
of enable_ulp function"). However, it introduced a regression on older
devices, such as [8086:15B8], [8086:15F9], [8086:15BE].
This patch aims to fix the secondary regression, by limiting the scope of
the changes to Meteor Lake platforms only.
Eric Dumazet [Wed, 10 Jul 2024 00:14:01 +0000 (00:14 +0000)]
tcp: avoid too many retransmit packets
If a TCP socket is using TCP_USER_TIMEOUT, and the other peer
retracted its window to zero, tcp_retransmit_timer() can
retransmit a packet every two jiffies (2 ms for HZ=1000),
for about 4 minutes after TCP_USER_TIMEOUT has 'expired'.
The fix is to make sure tcp_rtx_probe0_timed_out() takes
icsk->icsk_user_timeout into account.
Before blamed commit, the socket would not timeout after
icsk->icsk_user_timeout, but would use standard exponential
backoff for the retransmits.
Also worth noting that before commit e89688e3e978 ("net: tcp:
fix unexcepted socket die when snd_wnd is 0"), the issue
would last 2 minutes instead of 4.
In this case, both callbacks will continue waiting for each other to
finish synchronously, causing a lockup.
The proposed fix adds support for tracking in-flight cancellations
*begun by other timer callbacks* for a particular BPF timer. Whenever
preparing to call hrtimer_cancel, a callback will increment the target
timer's counter, then inspect its in-flight cancellations, and if
non-zero, return -EDEADLK to avoid situations where the target timer's
callback is waiting for its completion.
This does mean that in cases where a callback is fired and cancelled, it
will be unable to cancel any timers in that execution. This can be
alleviated by maintaining the list of waiting callbacks in bpf_hrtimer
and searching through it to avoid interdependencies, but this may
introduce additional delays in bpf_timer_cancel, in addition to
requiring extra state at runtime which may need to be allocated or
reused from bpf_hrtimer storage. Moreover, extra synchronization is
needed to delete these elements from the list of waiting callbacks once
hrtimer_cancel has finished.
The second patch is for a deadlock situation similar to above in
bpf_timer_cancel_and_free, but also a UAF scenario that can occur if
timer is armed before entering it, if hrtimer_running check causes the
hrtimer_cancel call to be skipped.
As seen above, synchronous hrtimer_cancel would lead to deadlock (if
same callback tries to free its timer, or two timers free each other),
therefore we queue work onto the global workqueue to ensure outstanding
timers are cancelled before bpf_hrtimer state is freed.
Further details are in the patches.
====================
Currently, the same case as previous patch (two timer callbacks trying
to cancel each other) can be invoked through bpf_map_update_elem as
well, or more precisely, freeing map elements containing timers. Since
this relies on hrtimer_cancel as well, it is prone to the same deadlock
situation as the previous patch.
It would be sufficient to use hrtimer_try_to_cancel to fix this problem,
as the timer cannot be enqueued after async_cancel_and_free. Once
async_cancel_and_free has been done, the timer must be reinitialized
before it can be armed again. The callback running in parallel trying to
arm the timer will fail, and freeing bpf_hrtimer without waiting is
sufficient (given kfree_rcu), and bpf_timer_cb will return
HRTIMER_NORESTART, preventing the timer from being rearmed again.
However, there exists a UAF scenario where the callback arms the timer
before entering this function, such that if cancellation fails (due to
timer callback invoking this routine, or the target timer callback
running concurrently). In such a case, if the timer expiration is
significantly far in the future, the RCU grace period expiration
happening before it will free the bpf_hrtimer state and along with it
the struct hrtimer, that is enqueued.
Hence, it is clear cancellation needs to occur after
async_cancel_and_free, and yet it cannot be done inline due to deadlock
issues. We thus modify bpf_timer_cancel_and_free to defer work to the
global workqueue, adding a work_struct alongside rcu_head (both used at
_different_ points of time, so can share space).
Update existing code comments to reflect the new state of affairs.
Both bpf_timer_cancel calls would wait for the other callback to finish
executing, introducing a lockup.
Add an atomic_t count named 'cancelling' in bpf_hrtimer. This keeps
track of all in-flight cancellation requests for a given BPF timer.
Whenever cancelling a BPF timer, we must check if we have outstanding
cancellation requests, and if so, we must fail the operation with an
error (-EDEADLK) since cancellation is synchronous and waits for the
callback to finish executing. This implies that we can enter a deadlock
situation involving two or more timer callbacks executing in parallel
and attempting to cancel one another.
Note that we avoid incrementing the cancelling counter for the target
timer (the one being cancelled) if bpf_timer_cancel is not invoked from
a callback, to avoid spurious errors. The whole point of detecting
cur->cancelling and returning -EDEADLK is to not enter a busy wait loop
(which may or may not lead to a lockup). This does not apply in case the
caller is in a non-callback context, the other side can continue to
cancel as it sees fit without running into errors.
Background on prior attempts:
Earlier versions of this patch used a bool 'cancelling' bit and used the
following pattern under timer->lock to publish cancellation status.
The store outside the critical section could overwrite a parallel
requests t->cancelling assignment to true, to ensure the parallely
executing callback observes its cancellation status.
It would be necessary to clear this cancelling bit once hrtimer_cancel
is done, but lack of serialization introduced races. Another option was
explored where bpf_timer_start would clear the bit when (re)starting the
timer under timer->lock. This would ensure serialized access to the
cancelling bit, but may allow it to be cleared before in-flight
hrtimer_cancel has finished executing, such that lockups can occur
again.
Thus, we choose an atomic counter to keep track of all outstanding
cancellation requests and use it to prevent lockups in case callbacks
attempt to cancel each other while executing in parallel.
bpf: fix order of args in call to bpf_map_kvcalloc
The original function call passed size of smap->bucket before the number of
buckets which raises the error 'calloc-transposed-args' on compilation.
Vlastimil Babka added:
The order of parameters can be traced back all the way to 6ac99e8f23d4
("bpf: Introduce bpf sk local storage") accross several refactorings,
and that's why the commit is used as a Fixes: tag.
In v6.10-rc1, a different commit 2c321f3f70bc ("mm: change inlined
allocation helpers to account at the call site") however exposed the
order of args in a way that gcc-14 has enough visibility to start
warning about it, because (in !CONFIG_MEMCG case) bpf_map_kvcalloc is
then a macro alias for kvcalloc instead of a static inline wrapper.
To sum up the warning happens when the following conditions are all met:
- gcc-14 is used (didn't see it with gcc-13)
- commit 2c321f3f70bc is present
- CONFIG_MEMCG is not enabled in .config
- CONFIG_WERROR turns this from a compiler warning to error
Merge tag 'mm-hotfixes-stable-2024-07-10-13-19' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull misc fixes from Andrew Morton:
"21 hotfixes, 15 of which are cc:stable.
No identifiable theme here - all are singleton patches, 19 are for MM"
* tag 'mm-hotfixes-stable-2024-07-10-13-19' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (21 commits)
mm/hugetlb: fix kernel NULL pointer dereference when migrating hugetlb folio
mm/hugetlb: fix potential race in __update_and_free_hugetlb_folio()
filemap: replace pte_offset_map() with pte_offset_map_nolock()
arch/xtensa: always_inline get_current() and current_thread_info()
sched.h: always_inline alloc_tag_{save|restore} to fix modpost warnings
MAINTAINERS: mailmap: update Lorenzo Stoakes's email address
mm: fix crashes from deferred split racing folio migration
lib/build_OID_registry: avoid non-destructive substitution for Perl < 5.13.2 compat
mm: gup: stop abusing try_grab_folio
nilfs2: fix kernel bug on rename operation of broken directory
mm/hugetlb_vmemmap: fix race with speculative PFN walkers
cachestat: do not flush stats in recency check
mm/shmem: disable PMD-sized page cache if needed
mm/filemap: skip to create PMD-sized page cache if needed
mm/readahead: limit page cache size in page_cache_ra_order()
mm/filemap: make MAX_PAGECACHE_ORDER acceptable to xarray
mm/damon/core: merge regions aggressively when max_nr_regions is unmet
Fix userfaultfd_api to return EINVAL as expected
mm: vmalloc: check if a hash-index is in cpu_possible_mask
mm: prevent derefencing NULL ptr in pfn_section_valid()
...
Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Pull SCSI fixes from James Bottomley:
"One core change that moves a disk start message to a location where it
will only be printed once instead of twice plus a couple of error
handling race fixes in the ufs driver"
* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
scsi: sd: Do not repeat the starting disk message
scsi: ufs: core: Fix ufshcd_abort_one racing issue
scsi: ufs: core: Fix ufshcd_clear_cmd racing issue
Merge tag 'vfio-v6.10' of https://github.com/awilliam/linux-vfio
Pull VFIO fix from Alex Williamson:
- Recent stable backports are exposing a bug introduced in the v6.10
development cycle where a counter value is uninitialized. This leads
to regressions in userspace drivers like QEMU where where the kernel
might ask for an arbitrary buffer size or return out of memory itself
based on a bogus value. Zero initialize the counter. (Yi Liu)
* tag 'vfio-v6.10' of https://github.com/awilliam/linux-vfio:
vfio/pci: Init the count variable in collecting hot-reset devices
Merge tag 'usb-serial-6.10-rc8' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/johan/usb-serial into usb-linus
Johan writes:
USB-serial fixes for 6.10-rc8
Here's a fix for a long-standing issue in the mos7840 driver that can trigger
a crash when resuming from system suspend.
Included are also some new modem device ids.
All have been in linux-next with no reported issues.
* tag 'usb-serial-6.10-rc8' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/johan/usb-serial:
USB: serial: mos7840: fix crash on resume
USB: serial: option: add Rolling RW350-GL variants
USB: serial: option: add support for Foxconn T99W651
USB: serial: option: add Netprisma LCUK54 series modules
Merge tag 'acpi-6.10-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull ACPI fix from Rafael Wysocki:
"Fix the sorting of _CST output data in the ACPI processor idle driver
(Kuan-Wei Chiu)"
* tag 'acpi-6.10-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
ACPI: processor_idle: Fix invalid comparison with insertion sort for latency
Merge tag 'pm-6.10-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management fixes from Rafael Wysocki:
"Fix two issues related to boost frequencies handling, one in the
cpufreq core and one in the ACPI cpufreq driver (Mario Limonciello)"
* tag 'pm-6.10-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
cpufreq: ACPI: Mark boost policy as enabled when setting boost
cpufreq: Allow drivers to advertise boost enabled
Merge tag 'thermal-6.10-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull thermal control fixes from Rafael Wysocki:
"These fix a possible NULL pointer dereference in a thermal governor,
fix up the handling of thermal zones enabled before their temperature
can be determined and fix list sorting during thermal zone temperature
updates.
Specifics:
- Prevent the Power Allocator thermal governor from dereferencing a
NULL pointer if it is bound to a tripless thermal zone (NÃcolas
Prado)
- Prevent thermal zones enabled too early from staying effectively
dormant forever because their temperature cannot be determined
initially (Rafael Wysocki)
- Fix list sorting during thermal zone temperature updates to ensure
the proper ordering of trip crossing notifications (Rafael
Wysocki)"
* tag 'thermal-6.10-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
thermal: core: Fix list sorting in __thermal_zone_device_update()
thermal: core: Call monitor_thermal_zone() if zone temperature is invalid
thermal: gov_power_allocator: Return early in manage if trip_max is NULL
Merge tag 'devicetree-fixes-for-6.10-2' of git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux
Pull devicetree fix from Rob Herring:
- One fix for PASemi Nemo board interrupts
* tag 'devicetree-fixes-for-6.10-2' of git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux:
of/irq: Disable "interrupt-map" parsing for PASEMI Nemo
Yi Liu [Wed, 10 Jul 2024 00:41:50 +0000 (17:41 -0700)]
vfio/pci: Init the count variable in collecting hot-reset devices
The count variable is used without initialization, it results in mistakes
in the device counting and crashes the userspace if the get hot reset info
path is triggered.
After commit 230e9fc28604 ("slab: add SLAB_ACCOUNT flag"), we need to mark
the inode cache as SLAB_ACCOUNT, similar to commit 5d097056c9a0 ("kmemcg:
account for certain kmem allocations to memcg")
Kent Overstreet [Sun, 30 Jun 2024 02:12:09 +0000 (22:12 -0400)]
closures: fix closure_sync + closure debugging
originally, stack closures were only used synchronously, and with the
original implementation of closure_sync() the ref never hit 0; thus,
closure_put_after_sub() assumes that if the ref hits 0 it's on the debug
list, in debug mode.
that's no longer true with the current implementation of closure_sync,
so we need a new magic so closure_debug_destroy() doesn't pop an assert.
libceph: fix race between delayed_work() and ceph_monc_stop()
The way the delayed work is handled in ceph_monc_stop() is prone to
races with mon_fault() and possibly also finish_hunting(). Both of
these can requeue the delayed work which wouldn't be canceled by any of
the following code in case that happens after cancel_delayed_work_sync()
runs -- __close_session() doesn't mess with the delayed work in order
to avoid interfering with the hunting interval logic. This part was
missed in commit b5d91704f53e ("libceph: behave in mon_fault() if
cur_mon < 0") and use-after-free can still ensue on monc and objects
that hang off of it, with monc->auth and monc->monmap being
particularly susceptible to quickly being reused.
To fix this:
- clear monc->cur_mon and monc->hunting as part of closing the session
in ceph_monc_stop()
- bail from delayed_work() if monc->cur_mon is cleared, similar to how
it's done in mon_fault() and finish_hunting() (based on monc->hunting)
- call cancel_delayed_work_sync() after the session is closed
Michal Kubiak [Mon, 8 Jul 2024 23:07:49 +0000 (16:07 -0700)]
i40e: Fix XDP program unloading while removing the driver
The commit 6533e558c650 ("i40e: Fix reset path while removing
the driver") introduced a new PF state "__I40E_IN_REMOVE" to block
modifying the XDP program while the driver is being removed.
Unfortunately, such a change is useful only if the ".ndo_bpf()"
callback was called out of the rmmod context because unloading the
existing XDP program is also a part of driver removing procedure.
In other words, from the rmmod context the driver is expected to
unload the XDP program without reporting any errors. Otherwise,
the kernel warning with callstack is printed out to dmesg.
Example failing scenario:
1. Load the i40e driver.
2. Load the XDP program.
3. Unload the i40e driver (using "rmmod" command).
Fix this by checking if the XDP program is being loaded or unloaded.
Then, block only loading a new program while "__I40E_IN_REMOVE" is set.
Also, move testing "__I40E_IN_REMOVE" flag to the beginning of XDP_SETUP
callback to avoid unnecessary operations and checks.
It's because hugetlb folio is passed to __folio_undo_large_rmappable()
unexpectedly. large_rmappable flag is imperceptibly set to hugetlb folio
since commit f6a8dd98a2ce ("hugetlb: convert alloc_buddy_hugetlb_folio to
use a folio"). Then commit be9581ea8c05 ("mm: fix crashes from deferred
split racing folio migration") makes folio_migrate_mapping() call
folio_undo_large_rmappable() triggering the bug. Fix this issue by
clearing large_rmappable flag for hugetlb folios. They don't need that
flag set anyway.
Miaohe Lin [Mon, 8 Jul 2024 02:51:27 +0000 (10:51 +0800)]
mm/hugetlb: fix potential race in __update_and_free_hugetlb_folio()
There is a potential race between __update_and_free_hugetlb_folio() and
try_memory_failure_hugetlb():
CPU1 CPU2
__update_and_free_hugetlb_folio try_memory_failure_hugetlb
folio_test_hugetlb
-- It's still hugetlb folio.
folio_clear_hugetlb_hwpoison
spin_lock_irq(&hugetlb_lock);
__get_huge_page_for_hwpoison
folio_set_hugetlb_hwpoison
spin_unlock_irq(&hugetlb_lock);
spin_lock_irq(&hugetlb_lock);
__folio_clear_hugetlb(folio);
-- Hugetlb flag is cleared but too late.
spin_unlock_irq(&hugetlb_lock);
When the above race occurs, raw error page info will be leaked. Even
worse, raw error pages won't have hwpoisoned flag set and hit
pcplists/buddy. Fix this issue by deferring
folio_clear_hugetlb_hwpoison() until __folio_clear_hugetlb() is done. So
all raw error pages will have hwpoisoned flag set.
ZhangPeng [Wed, 13 Mar 2024 01:29:13 +0000 (09:29 +0800)]
filemap: replace pte_offset_map() with pte_offset_map_nolock()
The vmf->ptl in filemap_fault_recheck_pte_none() is still set from
handle_pte_fault(). But at the same time, we did a pte_unmap(vmf->pte).
After a pte_unmap(vmf->pte) unmap and rcu_read_unlock(), the page table
may be racily changed and vmf->ptl maybe fails to protect the actual page
table. Fix this by replacing pte_offset_map() with
pte_offset_map_nolock().
As David said, the PTL pointer might be stale so if we continue to use
it infilemap_fault_recheck_pte_none(), it might trigger UAF. Also, if
the PTL fails, the issue fixed by commit 58f327f2ce80 ("filemap: avoid
unnecessary major faults in filemap_fault()") might reappear.
The warning happens when these functions are called from an __init
function and they don't get inlined (remain in the .text section) while
the value they return points into .init.data section. Assuming
get_current() always returns a valid address, this situation can happen
only during init stage and accessing .init.data from .text section during
that stage should pose no issues.
The warnings happen when these functions are called from an __init
function and they don't get inlined (remain in the .text section) while
the value returned by get_current() points into .init.data section.
Assuming get_current() always returns a valid address, this situation can
happen only during init stage and accessing .init.data from .text section
during that stage should pose no issues.
X would not start in my old 32-bit partition (and the "n"-handling looks
just as wrong on 64-bit, but for whatever reason did not show up there):
"n" must be accumulated over all pages before it's added to "offset" and
compared with "copy", immediately after the skb_frag_foreach_page() loop.
David Lechner [Tue, 9 Jul 2024 01:05:30 +0000 (20:05 -0500)]
spi: mux: set ctlr->bits_per_word_mask
Like other SPI controller flags, bits_per_word_mask may be used by a
peripheral driver, so it needs to reflect the capabilities of the
underlying controller.
David Lechner [Tue, 9 Jul 2024 01:05:29 +0000 (20:05 -0500)]
spi: add defer_optimize_message controller flag
Adding spi_optimize_message() broke the spi-mux driver because it
calls spi_async() from it's transfer_one_message() callback. This
resulted in passing an incorrectly optimized message to the controller.
For example, if the underlying controller has an optimize_message()
callback, this would have not been called and can cause a crash when
the underlying controller driver tries to transfer the message.
Also, since the spi-mux driver swaps out the controller pointer by
replacing msg->spi, __spi_unoptimize_message() was being called with a
different controller than the one used in __spi_optimize_message(). This
could cause a crash when attempting to free the message resources when
__spi_unoptimize_message() is called in spi_finalize_current_message()
since it is being called with a controller that did not allocate the
resources.
This is fixed by adding a defer_optimize_message flag for controllers.
This flag causes all of the spi_[maybe_][un]optimize_message() calls to
be a no-op (other than attaching a pointer to the spi device to the
message).
This allows the spi-mux driver to pass an unmodified message to
spi_async() in spi_mux_transfer_one_message() after the spi device has
been swapped out. This causes __spi_optimize_message() and
__spi_unoptimize_message() to be called only once per message and with
the correct/same controller in each case.
David Lechner [Tue, 9 Jul 2024 01:05:28 +0000 (20:05 -0500)]
spi: don't unoptimize message in spi_async()
Calling spi_maybe_unoptimize_message() in spi_async() is wrong because
the message is likely to be in the queue and not transferred yet. This
can corrupt the message while it is being used by the controller driver.
spi_maybe_unoptimize_message() is already called in the correct place
in spi_finalize_current_message() to balance the call to
spi_maybe_optimize_message() in spi_async().
Dmitry Smirnov [Fri, 14 Jun 2024 22:45:56 +0000 (01:45 +0300)]
USB: serial: mos7840: fix crash on resume
Since commit c49cfa917025 ("USB: serial: use generic method if no
alternative is provided in usb serial layer"), USB serial core calls the
generic resume implementation when the driver has not provided one.
This can trigger a crash on resume with mos7840 since support for
multiple read URBs was added back in 2011. Specifically, both port read
URBs are now submitted on resume for open ports, but the context pointer
of the second URB is left set to the core rather than mos7840 port
structure.
Fix this by implementing dedicated suspend and resume functions for
mos7840.
Tested with Delock 87414 USB 2.0 to 4x serial adapter.
Signed-off-by: Dmitry Smirnov <[email protected]>
[ johan: analyse crash and rewrite commit message; set busy flag on
resume; drop bulk-in check; drop unnecessary usb_kill_urb() ] Fixes: d83b405383c9 ("USB: serial: add support for multiple read urbs") Cc: [email protected] # 3.3 Signed-off-by: Johan Hovold <[email protected]>
Merge tag '6.10-rc6-smb3-server-fixes' of git://git.samba.org/ksmbd
Pull smb server fixes from Steve French:
- fix access flags to address fuse incompatibility
- fix device type returned by get filesystem info
* tag '6.10-rc6-smb3-server-fixes' of git://git.samba.org/ksmbd:
ksmbd: discard write access to the directory open
ksmbd: return FILE_DEVICE_DISK instead of super magic