Linus Torvalds [Sat, 8 Jun 2024 16:44:50 +0000 (09:44 -0700)]
Merge tag 'irq-urgent-2024-06-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq fixes from Ingo Molnar:
- Fix possible memory leak the riscv-intc irqchip driver load failures
- Fix boot crash in the sifive-plic irqchip driver caused by recently
changed boot initialization order
- Fix race condition in the gic-v3-its irqchip driver
* tag 'irq-urgent-2024-06-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
irqchip/gic-v3-its: Fix potential race condition in its_vlpi_prop_update()
irqchip/sifive-plic: Chain to parent IRQ after handlers are ready
irqchip/riscv-intc: Prevent memory leak when riscv_intc_init_common() fails
Linus Torvalds [Sat, 8 Jun 2024 16:36:08 +0000 (09:36 -0700)]
Merge tag 'x86-urgent-2024-06-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Ingo Molnar:
"Miscellaneous fixes:
- Fix kexec() crash if call depth tracking is enabled
- Fix SMN reads on inaccessible registers on certain AMD systems"
* tag 'x86-urgent-2024-06-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/amd_nb: Check for invalid SMN reads
x86/kexec: Fix bug with call depth tracking
Linus Torvalds [Sat, 8 Jun 2024 16:26:59 +0000 (09:26 -0700)]
Merge tag 'perf-urgent-2024-06-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf event fix from Ingo Molnar:
"Fix race between perf_event_free_task() and perf_event_release_kernel()
that can result in missed wakeups and hung tasks"
* tag 'perf-urgent-2024-06-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/core: Fix missing wakeup when waiting for context reference
Linus Torvalds [Sat, 8 Jun 2024 16:03:46 +0000 (09:03 -0700)]
Merge tag 'locking-urgent-2024-06-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking doc fix from Ingo Molnar:
"Fix typos in the kerneldoc of some of the atomic APIs"
* tag 'locking-urgent-2024-06-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
locking/atomic: scripts: fix ${atomic}_sub_and_test() kerneldoc
Linus Torvalds [Sat, 8 Jun 2024 00:01:10 +0000 (17:01 -0700)]
Merge tag 'mm-hotfixes-stable-2024-06-07-15-24' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull misc fixes from Andrew Morton:
"14 hotfixes, 6 of which are cc:stable.
All except the nilfs2 fix affect MM and all are singletons - see the
chagelogs for details"
* tag 'mm-hotfixes-stable-2024-06-07-15-24' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
nilfs2: fix nilfs_empty_dir() misjudgment and long loop on I/O errors
mm: fix xyz_noprof functions calling profiled functions
codetag: avoid race at alloc_slab_obj_exts
mm/hugetlb: do not call vma_add_reservation upon ENOMEM
mm/ksm: fix ksm_zero_pages accounting
mm/ksm: fix ksm_pages_scanned accounting
kmsan: do not wipe out origin when doing partial unpoisoning
vmalloc: check CONFIG_EXECMEM in is_vmalloc_or_module_addr()
mm: page_alloc: fix highatomic typing in multi-block buddies
nilfs2: fix potential kernel bug due to lack of writeback flag waiting
memcg: remove the lockdep assert from __mod_objcg_mlstate()
mm: arm64: fix the out-of-bounds issue in contpte_clear_young_dirty_ptes
mm: huge_mm: fix undefined reference to `mthp_stats' for CONFIG_SYSFS=n
mm: drop the 'anon_' prefix for swap-out mTHP counters
Linus Torvalds [Fri, 7 Jun 2024 23:54:57 +0000 (16:54 -0700)]
Merge tag 'gpio-fixes-for-v6.10-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux
Pull gpio fixes from Bartosz Golaszewski:
- interrupt handling and Kconfig fixes for gpio-tqmx86
- add a buffer for storing output values in gpio-tqmx86 as reading back
the registers always returns the input values
- add missing MODULE_DESCRIPTION()s to several GPIO drivers
* tag 'gpio-fixes-for-v6.10-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux:
gpio: add missing MODULE_DESCRIPTION() macros
gpio: tqmx86: fix broken IRQ_TYPE_EDGE_BOTH interrupt type
gpio: tqmx86: store IRQ trigger type and unmask status separately
gpio: tqmx86: introduce shadow register for GPIO output value
gpio: tqmx86: fix typo in Kconfig label
Linus Torvalds [Fri, 7 Jun 2024 23:45:48 +0000 (16:45 -0700)]
Merge tag 'block-6.10-20240607' of git://git.kernel.dk/linux
Pull block fixes from Jens Axboe:
- Fix for null_blk block size validation (Andreas)
- NVMe pull request via Keith:
- Use reserved tags for special fabrics operations (Chunguang)
- Persistent Reservation status masking fix (Weiwen)
* tag 'block-6.10-20240607' of git://git.kernel.dk/linux:
null_blk: fix validation of block size
nvme: fix nvme_pr_* status code parsing
nvme-fabrics: use reserved tag for reg read/write command
Linus Torvalds [Fri, 7 Jun 2024 23:43:07 +0000 (16:43 -0700)]
Merge tag 'io_uring-6.10-20240607' of git://git.kernel.dk/linux
Pull io_uring fixes from Jens Axboe:
- Fix a locking order issue with setting max async thread workers
(Hagar)
- Fix for a NULL pointer dereference for failed async flagged requests
using ring provided buffers. This doesn't affect the current kernel,
but it does affect older kernels, and is being queued up for 6.10
just to make the stable process easier (me)
- Fix for NAPI timeout calculations for how long to busy poll, and
subsequently how much to sleep post that if a wait timeout is passed
in (me)
- Fix for a regression in this release cycle, where we could end up
using a partially unitialized match value for io-wq (Su)
* tag 'io_uring-6.10-20240607' of git://git.kernel.dk/linux:
io_uring: fix possible deadlock in io_register_iowq_max_workers()
io_uring/io-wq: avoid garbage value of 'match' in io_wq_enqueue()
io_uring/napi: fix timeout calculation
io_uring: check for non-NULL file pointer in io_file_can_poll()
Linus Torvalds [Fri, 7 Jun 2024 22:13:12 +0000 (15:13 -0700)]
Merge tag 'for-6.10-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
- fix handling of folio private changes.
The private value holds pointer to our extent buffer structure
representing a metadata range. Release and create of the range was
not properly synchronized when updating the private bit which ended
up in double folio_put, leading to all sorts of breakage
- fix a crash, reported as duplicate key in metadata, but caused by a
race of fsync and size extending write. Requires prealloc target
range + fsync and other conditions (log tree state, timing)
- fix leak of qgroup extent records after transaction abort
* tag 'for-6.10-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: protect folio::private when attaching extent buffer folios
btrfs: fix leak of qgroup extent records after transaction abort
btrfs: fix crash on racing fsync and size-extending write into prealloc
Linus Torvalds [Fri, 7 Jun 2024 21:47:38 +0000 (14:47 -0700)]
Merge tag 'riscv-for-linus-6.10-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux
Pull RISC-V fixes from Palmer Dabbelt:
- Another fix to avoid allocating pages that overlap with ERR_PTR,
which manifests on rv32
- A revert for the badaccess patch I incorrectly picked up an early
version of
* tag 'riscv-for-linus-6.10-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux:
Revert "riscv: mm: accelerate pagefault when badaccess"
riscv: fix overlap of allocated page and PTR_ERR
Linus Torvalds [Fri, 7 Jun 2024 21:44:53 +0000 (14:44 -0700)]
Merge tag 's390-6.10-3' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
Pull s390 fixes from Alexander Gordeev:
- Do not create PT_LOAD program header for the kenel image when the
virtual memory informaton in OS_INFO data is not available. That
fixes stand-alone dump failures against kernels that do not provide
the virtual memory informaton
- Add KVM s390 shared zeropage selftest
* tag 's390-6.10-3' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
KVM: s390x: selftests: Add shared zeropage test
s390/crash: Do not use VM info if os_info does not have it
Linus Torvalds [Fri, 7 Jun 2024 21:36:57 +0000 (14:36 -0700)]
Merge tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull arm64 fixes from Will Deacon:
- Fix spurious CPU hotplug warning message from SETEND emulation code
- Fix the build when GCC wasn't inlining our I/O accessor internals
* tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
arm64/io: add constant-argument check
arm64: armv8_deprecated: Fix warning in isndep cpuhp starting process
Linus Torvalds [Fri, 7 Jun 2024 21:13:46 +0000 (14:13 -0700)]
Merge tag 'platform-drivers-x86-v6.10-3' of git://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86
Pull x86 platform driver fixes from Hans de Goede:
- Default silead touchscreen driver to 10 fingers and drop 10 finger
setting from all DMI quirks. More of a cleanup then a pure fix, but
since the DMI quirks always get updated through the fixes branch
this avoids conflicts.
- Kconfig fix for randconfig builds
- dell-smbios: Fix wrong token data in sysfs
- amd-hsmp: Fix driver poking unsupported hw when loaded manually
* tag 'platform-drivers-x86-v6.10-3' of git://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86:
platform/x86/amd/hsmp: Check HSMP support on AMD family of processors
platform/x86: dell-smbios: Simplify error handling
platform/x86: dell-smbios: Fix wrong token data in sysfs
platform/x86: yt2-1380: add CONFIG_EXTCON dependency
platform/x86: touchscreen_dmi: Use 2-argument strscpy()
platform/x86: touchscreen_dmi: Drop "silead,max-fingers" property
Input: silead - Always support 10 fingers
Linus Torvalds [Fri, 7 Jun 2024 20:34:53 +0000 (13:34 -0700)]
Merge tag 'iommu-fixes-v6.10-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu
Pull iommu fixes from Joerg Roedel:
"Core:
- Make iommu-dma code recognize 'force_aperture' again
- Fix for potential NULL-ptr dereference from iommu_sva_bind_device()
return value
AMD IOMMU fixes:
- Fix lockdep splat for invalid wait context
- Add feature bit check before enabling PPR
- Make workqueue name fit into buffer
- Fix memory leak in sysfs code"
* tag 'iommu-fixes-v6.10-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu:
iommu/amd: Fix Invalid wait context issue
iommu/amd: Check EFR[EPHSup] bit before enabling PPR
iommu/amd: Fix workqueue name
iommu: Return right value in iommu_sva_bind_device()
iommu/dma: Fix domain init
iommu/amd: Fix sysfs leak in iommu init
vmwgfx:
- filter modes greater than available graphics memory
- fix 3D vs STDU enable
- remove STDU logic from mode valid
- logging fix
- memcmp pointers fix
- remove unused struct
- screen target lifetime fix
komeda:
- unused struct removal"
* tag 'drm-fixes-2024-06-07' of https://gitlab.freedesktop.org/drm/kernel:
drm/vmwgfx: Don't memcmp equivalent pointers
drm/vmwgfx: remove unused struct 'vmw_stdu_dma'
drm/vmwgfx: Don't destroy Screen Target when CRTC is enabled but inactive
drm/vmwgfx: Standardize use of kibibytes when logging
drm/vmwgfx: Remove STDU logic from generic mode_valid function
drm/vmwgfx: 3D disabled should not effect STDU memory limits
drm/vmwgfx: Filter modes which exceed graphics memory
drm/amdgpu/pptable: Fix UBSAN array-index-out-of-bounds
drm/amd: Fix shutdown (again) on some SMU v13.0.4/11 platforms
drm/xe/pf: Update the LMTT when freeing VF GT config
drm/panel: sitronix-st7789v: Add check for of_drm_get_panel_orientation
drm/komeda: remove unused struct 'gamma_curve_segment'
Jeff Johnson [Fri, 7 Jun 2024 03:23:50 +0000 (20:23 -0700)]
gpio: add missing MODULE_DESCRIPTION() macros
On x86, make allmodconfig && make W=1 C=1 reports:
WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/gpio/gpio-gw-pld.o
WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/gpio/gpio-mc33880.o
WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/gpio/gpio-pcf857x.o
Add the missing invocations of the MODULE_DESCRIPTION() macro,
including the one missing in gpio-pl061.c, which is not built for x86.
Linus Torvalds [Thu, 6 Jun 2024 21:40:51 +0000 (14:40 -0700)]
Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Pull SCSI fixes from James Bottomley:
"The core change is to detect unusually large number of VPD pages
(caused by device manufacturers having an endiannes issue) and reject
them rather than trying to parse a huge non-existent array.
The remaining fixes are in drivers the most user visible of which is
the ALUA state transition recognition (leads to intermittent I/O
errors in some situations otherwise)"
* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
scsi: ufs: mcq: Fix error output and clean up ufshcd_mcq_abort()
scsi: core: Handle devices which return an unusually large VPD page count
scsi: mpt3sas: Add missing kerneldoc parameter descriptions
scsi: qedf: Set qed_slowpath_params to zero before use
scsi: qedf: Wait for stag work during unload
scsi: qedf: Don't process stag work during unload and recovery
scsi: sr: Fix unintentional arithmetic wraparound
scsi: core: alua: I/O errors for ALUA state transitions
scsi: mpi3mr: Use proper format specifier in mpi3mr_sas_port_add()
Linus Torvalds [Thu, 6 Jun 2024 21:28:11 +0000 (14:28 -0700)]
Merge tag 'pci-v6.10-fixes-1' of git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci
Pull pci fix from Bjorn Helgaas:
- Revert lockdep checking on locking that protects device resets from
user-space config accesses; it exposed issues for which fixes are in
the works but are too risky for this cycle (Dan Williams)
* tag 'pci-v6.10-fixes-1' of git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci:
PCI: Revert the cfg_access_lock lockdep mechanism
[CAUSE]
Commit 09e6cef19c9f ("btrfs: refactor alloc_extent_buffer() to
allocate-then-attach method") changes the sequence when allocating a new
extent buffer.
Previously we always called grab_extent_buffer() under
mapping->i_private_lock, to ensure the safety on modification on
folio::private (which is a pointer to extent buffer for regular
sectorsize).
This can lead to the following race:
Thread A is trying to allocate an extent buffer at bytenr X, with 4
4K pages, meanwhile thread B is trying to release the page at X + 4K
(the second page of the extent buffer at X).
Thread A | Thread B
-----------------------------------+-------------------------------------
| btree_release_folio()
| | This is for the page at X + 4K,
| | Not page X.
| |
alloc_extent_buffer() | |- release_extent_buffer()
|- filemap_add_folio() for the | | |- atomic_dec_and_test(eb->refs)
| page at bytenr X (the first | | |
| page). | | |
| Which returned -EEXIST. | | |
| | | |
|- filemap_lock_folio() | | |
| Returned the first page locked. | | |
| | | |
|- grab_extent_buffer() | | |
| |- atomic_inc_not_zero() | | |
| | Returned false | | |
| |- folio_detach_private() | | |- folio_detach_private() for X
| |- folio_test_private() | | |- folio_test_private()
| Returned true | | | Returned true
|- folio_put() | |- folio_put()
Now there are two puts on the same folio at folio X, leading to refcount
underflow of the folio X, and eventually causing the BUG_ON() on the
page->mapping.
The condition is not that easy to hit:
- The release must be triggered for the middle page of an eb
If the release is on the same first page of an eb, page lock would kick
in and prevent the race.
- folio_detach_private() has a very small race window
It's only between folio_test_private() and folio_clear_private().
That's exactly when mapping->i_private_lock is used to prevent such race,
and commit 09e6cef19c9f ("btrfs: refactor alloc_extent_buffer() to
allocate-then-attach method") screwed that up.
At that time, I thought the page lock would kick in as
filemap_release_folio() also requires the page to be locked, but forgot
the filemap_release_folio() only locks one page, not all pages of an
extent buffer.
[FIX]
Move all the code requiring i_private_lock into
attach_eb_folio_to_filemap(), so that everything is done with proper
lock protection.
Furthermore to prevent future problems, add an extra
lockdep_assert_locked() to ensure we're holding the proper lock.
To reproducer that is able to hit the race (takes a few minutes with
instrumented code inserting delays to alloc_extent_buffer()):
Linus Torvalds [Thu, 6 Jun 2024 16:55:27 +0000 (09:55 -0700)]
Merge tag 'net-6.10-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
"Including fixes from BPF and big collection of fixes for WiFi core and
drivers.
Current release - regressions:
- vxlan: fix regression when dropping packets due to invalid src
addresses
- bpf: fix a potential use-after-free in bpf_link_free()
- xdp: revert support for redirect to any xsk socket bound to the
same UMEM as it can result in a corruption
- virtio_net:
- add missing lock protection when reading return code from
control_buf
- fix false-positive lockdep splat in DIM
- Revert "wifi: wilc1000: convert list management to RCU"
- wifi: ath11k: fix error path in ath11k_pcic_ext_irq_config
Previous releases - regressions:
- rtnetlink: make the "split" NLM_DONE handling generic, restore the
old behavior for two cases where we started coalescing those
messages with normal messages, breaking sloppily-coded userspace
- wifi:
- cfg80211: validate HE operation element parsing
- cfg80211: fix 6 GHz scan request building
- mt76: mt7615: add missing chanctx ops
- ath11k: move power type check to ASSOC stage, fix connecting to
6 GHz AP
- ath11k: fix WCN6750 firmware crash caused by 17 num_vdevs
- rtlwifi: ignore IEEE80211_CONF_CHANGE_RETRY_LIMITS
- iwlwifi: mvm: fix a crash on 7265
Previous releases - always broken:
- ncsi: prevent multi-threaded channel probing, a spec violation
- vmxnet3: disable rx data ring on dma allocation failure
- ethtool: init tsinfo stats if requested, prevent unintentionally
reporting all-zero stats on devices which don't implement any
- dst_cache: fix possible races in less common IPv6 features
- tcp: auth: don't consider TCP_CLOSE to be in TCP_AO_ESTABLISHED
- ax25: fix two refcounting bugs
- eth: ionic: fix kernel panic in XDP_TX action
Misc:
- tcp: count CLOSE-WAIT sockets for TCP_MIB_CURRESTAB"
* tag 'net-6.10-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (107 commits)
selftests: net: lib: set 'i' as local
selftests: net: lib: avoid error removing empty netns name
selftests: net: lib: support errexit with busywait
net: ethtool: fix the error condition in ethtool_get_phy_stats_ethtool()
ipv6: fix possible race in __fib6_drop_pcpu_from()
af_unix: Annotate data-race of sk->sk_shutdown in sk_diag_fill().
af_unix: Use skb_queue_len_lockless() in sk_diag_show_rqlen().
af_unix: Use skb_queue_empty_lockless() in unix_release_sock().
af_unix: Use unix_recvq_full_lockless() in unix_stream_connect().
af_unix: Annotate data-race of net->unx.sysctl_max_dgram_qlen.
af_unix: Annotate data-races around sk->sk_sndbuf.
af_unix: Annotate data-races around sk->sk_state in UNIX_DIAG.
af_unix: Annotate data-race of sk->sk_state in unix_stream_read_skb().
af_unix: Annotate data-races around sk->sk_state in sendmsg() and recvmsg().
af_unix: Annotate data-race of sk->sk_state in unix_accept().
af_unix: Annotate data-race of sk->sk_state in unix_stream_connect().
af_unix: Annotate data-races around sk->sk_state in unix_write_space() and poll().
af_unix: Annotate data-race of sk->sk_state in unix_inq_len().
af_unix: Annodate data-races around sk->sk_state for writers.
af_unix: Set sk->sk_state under unix_state_lock() for truly disconencted peer.
...
Jakub Kicinski [Thu, 6 Jun 2024 15:23:44 +0000 (08:23 -0700)]
Merge branch 'selftests-net-lib-small-fixes'
Matthieu Baerts says:
====================
selftests: net: lib: small fixes
While looking at using 'lib.sh' for the MPTCP selftests [1], we found
some small issues with 'lib.sh'. Here they are:
- Patch 1: fix 'errexit' (set -e) support with busywait. 'errexit' is
supported in some functions, not all. A fix for v6.8+.
- Patch 2: avoid confusing error messages linked to the cleaning part
when the netns setup fails. A fix for v6.8+.
- Patch 3: set a variable as local to avoid accidentally changing the
value of a another one with the same name on the caller side. A fix
for v6.10-rc1+.
Without this, the 'i' variable declared before could be overridden by
accident, e.g.
for i in "${@}"; do
__ksft_status_merge "${i}" ## 'i' has been modified
foo "${i}" ## using 'i' with an unexpected value
done
After a quick look, it looks like 'i' is currently not used after having
been modified in __ksft_status_merge(), but still, better be safe than
sorry. I saw this while modifying the same file, not because I suspected
an issue somewhere.
selftests: net: lib: avoid error removing empty netns name
If there is an error to create the first netns with 'setup_ns()',
'cleanup_ns()' will be called with an empty string as first parameter.
The consequences is that 'cleanup_ns()' will try to delete an invalid
netns, and wait 20 seconds if the netns list is empty.
Instead of just checking if the name is not empty, convert the string
separated by spaces to an array. Manipulating the array is cleaner, and
calling 'cleanup_ns()' with an empty array will be a no-op.
ata: pata_macio: Fix max_segment_size with PAGE_SIZE == 64K
The pata_macio driver advertises a max_segment_size of 0xff00, because
the hardware doesn't cope with requests >= 64K.
However the SCSI core requires max_segment_size to be at least
PAGE_SIZE, which is a problem for pata_macio when the kernel is built
with 64K pages.
In older kernels the SCSI core would just increase the segment size to
be equal to PAGE_SIZE, however since the commit tagged below it causes a
warning and the device fails to probe:
WARNING: CPU: 0 PID: 26 at block/blk-settings.c:202 .blk_validate_limits+0x2f8/0x35c
CPU: 0 PID: 26 Comm: kworker/u4:1 Not tainted 6.10.0-rc1 #1
Hardware name: PowerMac7,2 PPC970 0x390202 PowerMac
...
NIP .blk_validate_limits+0x2f8/0x35c
LR .blk_alloc_queue+0xc0/0x2f8
Call Trace:
.blk_alloc_queue+0xc0/0x2f8
.blk_mq_alloc_queue+0x60/0xf8
.scsi_alloc_sdev+0x208/0x3c0
.scsi_probe_and_add_lun+0x314/0x52c
.__scsi_add_device+0x170/0x1a4
.ata_scsi_scan_host+0x2bc/0x3e4
.async_port_probe+0x6c/0xa0
.async_run_entry_fn+0x60/0x1bc
.process_one_work+0x228/0x510
.worker_thread+0x360/0x530
.kthread+0x134/0x13c
.start_kernel_thread+0x10/0x14
...
scsi_alloc_sdev: Allocation failure during SCSI scanning, some SCSI devices might not be configured
Although the hardware can't cope with a 64K segment, the driver
already deals with that internally by splitting large requests in
pata_macio_qc_prep(). That is how the driver has managed to function
until now on 64K kernels.
So fix the driver to advertise a max_segment_size of 64K, which avoids
the warning and keeps the SCSI core happy.
Fixes: afd53a3d8528 ("scsi: core: Initialize scsi midlayer limits before allocating the queue") Reported-by: Guenter Roeck <[email protected]> Closes: https://lore.kernel.org/all/[email protected]/ Reported-by: Doru Iorgulescu <[email protected]> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=218858 Signed-off-by: Michael Ellerman <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: Damien Le Moal <[email protected]> Reviewed-by: John Garry <[email protected]> Signed-off-by: Niklas Cassel <[email protected]>
Note that with this series there is skb_queue_empty() left in
unix_dgram_disconnected() that needs to be changed to lockless
version, and unix_peer(other) access there should be protected
by unix_state_lock().
This will require some refactoring, so another series will follow.
af_unix: Use skb_queue_empty_lockless() in unix_release_sock().
If the socket type is SOCK_STREAM or SOCK_SEQPACKET, unix_release_sock()
checks the length of the peer socket's recvq under unix_state_lock().
However, unix_stream_read_generic() calls skb_unlink() after releasing
the lock. Also, for SOCK_SEQPACKET, __skb_try_recv_datagram() unlinks
skb without unix_state_lock().
Thues, unix_state_lock() does not protect qlen.
Let's use skb_queue_empty_lockless() in unix_release_sock().
af_unix: Use unix_recvq_full_lockless() in unix_stream_connect().
Once sk->sk_state is changed to TCP_LISTEN, it never changes.
unix_accept() takes advantage of this characteristics; it does not
hold the listener's unix_state_lock() and only acquires recvq lock
to pop one skb.
It means unix_state_lock() does not prevent the queue length from
changing in unix_stream_connect().
Thus, we need to use unix_recvq_full_lockless() to avoid data-race.
Now we remove unix_recvq_full() as no one uses it.
Note that we can remove READ_ONCE() for sk->sk_max_ack_backlog in
unix_recvq_full_lockless() because of the following reasons:
(1) For SOCK_DGRAM, it is a written-once field in unix_create1()
(2) For SOCK_STREAM and SOCK_SEQPACKET, it is changed under the
listener's unix_state_lock() in unix_listen(), and we hold
the lock in unix_stream_connect()
af_unix: Annotate data-races around sk->sk_state in unix_write_space() and poll().
unix_poll() and unix_dgram_poll() read sk->sk_state locklessly and
calls unix_writable() which also reads sk->sk_state without holding
unix_state_lock().
Let's use READ_ONCE() in unix_poll() and unix_dgram_poll() and pass
it to unix_writable().
While at it, we remove TCP_SYN_SENT check in unix_dgram_poll() as
that state does not exist for AF_UNIX socket since the code was added.
Fixes: 1586a5877db9 ("af_unix: do not report POLLOUT on listeners") Fixes: 3c73419c09a5 ("af_unix: fix 'poll for write'/ connected DGRAM sockets") Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Kuniyuki Iwashima <[email protected]> Signed-off-by: Paolo Abeni <[email protected]>
af_unix: Set sk->sk_state under unix_state_lock() for truly disconencted peer.
When a SOCK_DGRAM socket connect()s to another socket, the both sockets'
sk->sk_state are changed to TCP_ESTABLISHED so that we can register them
to BPF SOCKMAP.
When the socket disconnects from the peer by connect(AF_UNSPEC), the state
is set back to TCP_CLOSE.
Then, the peer's state is also set to TCP_CLOSE, but the update is done
locklessly and unconditionally.
Let's say socket A connect()ed to B, B connect()ed to C, and A disconnects
from B.
After the first two connect()s, all three sockets' sk->sk_state are
TCP_ESTABLISHED:
So, when a socket disconnects from the peer, we should not set TCP_CLOSE to
the peer if the peer is connected to yet another socket, and this must be
done under unix_state_lock().
Note that we use WRITE_ONCE() for sk->sk_state as there are many lockless
readers. These data-races will be fixed in the following patches.
Fixes: 83301b5367a9 ("af_unix: Set TCP_ESTABLISHED for datagram sockets too") Signed-off-by: Kuniyuki Iwashima <[email protected]> Signed-off-by: Paolo Abeni <[email protected]>
net: wwan: iosm: Fix tainted pointer delete is case of region creation fail
In case of region creation fail in ipc_devlink_create_region(), previously
created regions delete process starts from tainted pointer which actually
holds error code value.
Fix this bug by decreasing region index before delete.
Found by Linux Verification Center (linuxtesting.org) with SVACE.
Ian Forbes [Thu, 28 Mar 2024 19:07:16 +0000 (14:07 -0500)]
drm/vmwgfx: Don't memcmp equivalent pointers
These pointers are frequently the same and memcmp does not compare the
pointers before comparing their contents so this was wasting cycles
comparing 16 KiB of memory which will always be equal.
====================
Intel Wired LAN Driver Updates 2024-05-29 (ice, igc)
This series includes fixes for the ice driver as well as a fix for the igc
driver.
Jacob fixes two issues in the ice driver with reading the NVM for providing
firmware data via devlink info. First, fix an off-by-one error when reading
the Preserved Fields Area, resolving an infinite loop triggered on some
NVMs which lack certain data in the NVM. Second, fix the reading of the NVM
Shadow RAM on newer E830 and E825-C devices which have a variable sized CSS
header rather than assuming this header is always the same fixed size as in
the E810 devices.
Larysa fixes three issues with the ice driver XDP logic that could occur if
the number of queues is changed after enabling an XDP program. First, the
af_xdp_zc_qps bitmap is removed and replaced by simpler logic to track
whether queues are in zero-copy mode. Second, the reset and .ndo_bpf flows
are distinguished to avoid potential races with a PF reset occuring
simultaneously to .ndo_bpf callback from userspace. Third, the logic for
mapping XDP queues to vectors is fixed so that XDP state is restored for
XDP queues after a reconfiguration.
Sasha fixes reporting of Energy Efficient Ethernet support via ethtool in
the igc driver.
Sasha Neftin [Mon, 3 Jun 2024 21:42:35 +0000 (14:42 -0700)]
igc: Fix Energy Efficient Ethernet support declaration
The commit 01cf893bf0f4 ("net: intel: i40e/igc: Remove setting Autoneg in
EEE capabilities") removed SUPPORTED_Autoneg field but left inappropriate
ethtool_keee structure initialization. When "ethtool --show <device>"
(get_eee) invoke, the 'ethtool_keee' structure was accidentally overridden.
Remove the 'ethtool_keee' overriding and add EEE declaration as per IEEE
specification that allows reporting Energy Efficient Ethernet capabilities.
Examples:
Before fix:
ethtool --show-eee enp174s0
EEE settings for enp174s0:
EEE status: not supported
After fix:
EEE settings for enp174s0:
EEE status: disabled
Tx LPI: disabled
Supported EEE link modes: 100baseT/Full
1000baseT/Full
2500baseT/Full
Larysa Zaremba [Mon, 3 Jun 2024 21:42:34 +0000 (14:42 -0700)]
ice: map XDP queues to vectors in ice_vsi_map_rings_to_vectors()
ice_pf_dcb_recfg() re-maps queues to vectors with
ice_vsi_map_rings_to_vectors(), which does not restore the previous
state for XDP queues. This leads to no AF_XDP traffic after rebuild.
Map XDP queues to vectors in ice_vsi_map_rings_to_vectors().
Also, move the code around, so XDP queues are mapped independently only
through .ndo_bpf().
Larysa Zaremba [Mon, 3 Jun 2024 21:42:33 +0000 (14:42 -0700)]
ice: add flag to distinguish reset from .ndo_bpf in XDP rings config
Commit 6624e780a577 ("ice: split ice_vsi_setup into smaller functions")
has placed ice_vsi_free_q_vectors() after ice_destroy_xdp_rings() in
the rebuild process. The behaviour of the XDP rings config functions is
context-dependent, so the change of order has led to
ice_destroy_xdp_rings() doing additional work and removing XDP prog, when
it was supposed to be preserved.
Also, dependency on the PF state reset flags creates an additional,
fortunately less common problem:
* PFR is requested e.g. by tx_timeout handler
* .ndo_bpf() is asked to delete the program, calls ice_destroy_xdp_rings(),
but reset flag is set, so rings are destroyed without deleting the
program
* ice_vsi_rebuild tries to delete non-existent XDP rings, because the
program is still on the VSI
* system crashes
With a similar race, when requested to attach a program,
ice_prepare_xdp_rings() can actually skip setting the program in the VSI
and nevertheless report success.
Instead of reverting to the old order of function calls, add an enum
argument to both ice_prepare_xdp_rings() and ice_destroy_xdp_rings() in
order to distinguish between calls from rebuild and .ndo_bpf().
Larysa Zaremba [Mon, 3 Jun 2024 21:42:32 +0000 (14:42 -0700)]
ice: remove af_xdp_zc_qps bitmap
Referenced commit has introduced a bitmap to distinguish between ZC and
copy-mode AF_XDP queues, because xsk_get_pool_from_qid() does not do this
for us.
The bitmap would be especially useful when restoring previous state after
rebuild, if only it was not reallocated in the process. This leads to e.g.
xdpsock dying after changing number of queues.
Instead of preserving the bitmap during the rebuild, remove it completely
and distinguish between ZC and copy-mode queues based on the presence of
a device associated with the pool.
Jacob Keller [Mon, 3 Jun 2024 21:42:31 +0000 (14:42 -0700)]
ice: fix reads from NVM Shadow RAM on E830 and E825-C devices
The ice driver reads data from the Shadow RAM portion of the NVM during
initialization, including data used to identify the NVM image and device,
such as the ETRACK ID used to populate devlink dev info fw.bundle.
Currently it is using a fixed offset defined by ICE_CSS_HEADER_LENGTH to
compute the appropriate offset. This worked fine for E810 and E822 devices
which both have CSS header length of 330 words.
Other devices, including both E825-C and E830 devices have different sizes
for their CSS header. The use of a hard coded value results in the driver
reading from the wrong block in the NVM when attempting to access the
Shadow RAM copy. This results in the driver reporting the fw.bundle as 0x0
in both the devlink dev info and ethtool -i output.
The first E830 support was introduced by commit ba20ecb1d1bb ("ice: Hook up
4 E830 devices by adding their IDs") and the first E825-C support was
introducted by commit f64e18944233 ("ice: introduce new E825C devices
family")
The NVM actually contains the CSS header length embedded in it. Remove the
hard coded value and replace it with logic to read the length from the NVM
directly. This is more resilient against all existing and future hardware,
vs looking up the expected values from a table. It ensures the driver will
read from the appropriate place when determining the ETRACK ID value used
for populating the fw.bundle_id and for reporting in ethtool -i.
The CSS header length for both the active and inactive flash bank is stored
in the ice_bank_info structure to avoid unnecessary duplicate work when
accessing multiple words of the Shadow RAM. Both banks are read in the
unlikely event that the header length is different for the NVM in the
inactive bank, rather than being different only by the overall device
family.
Jacob Keller [Mon, 3 Jun 2024 21:42:30 +0000 (14:42 -0700)]
ice: fix iteration of TLVs in Preserved Fields Area
The ice_get_pfa_module_tlv() function iterates over the Type-Length-Value
structures in the Preserved Fields Area (PFA) of the NVM. This is used by
the driver to access data such as the Part Board Assembly identifier.
The function uses simple logic to iterate over the PFA. First, the pointer
to the PFA in the NVM is read. Then the total length of the PFA is read
from the first word.
A pointer to the first TLV is initialized, and a simple loop iterates over
each TLV. The pointer is moved forward through the NVM until it exceeds the
PFA area.
The logic seems sound, but it is missing a key detail. The Preserved
Fields Area length includes one additional final word. This is documented
in the device data sheet as a dummy word which contains 0xFFFF. All NVMs
have this extra word.
If the driver tries to scan for a TLV that is not in the PFA, it will read
past the size of the PFA. It reads and interprets the last dummy word of
the PFA as a TLV with type 0xFFFF. It then reads the word following the PFA
as a length.
The PFA resides within the Shadow RAM portion of the NVM, which is
relatively small. All of its offsets are within a 16-bit size. The PFA
pointer and TLV pointer are stored by the driver as 16-bit values.
In almost all cases, the word following the PFA will be such that
interpreting it as a length will result in 16-bit arithmetic overflow. Once
overflowed, the new next_tlv value is now below the maximum offset of the
PFA. Thus, the driver will continue to iterate the data as TLVs. In the
worst case, the driver hits on a sequence of reads which loop back to
reading the same offsets in an endless loop.
To fix this, we need to correct the loop iteration check to account for
this extra word at the end of the PFA. This alone is sufficient to resolve
the known cases of this issue in the field. However, it is plausible that
an NVM could be misconfigured or have corrupt data which results in the
same kind of overflow. Protect against this by using check_add_overflow
when calculating both the maximum offset of the TLVs, and when calculating
the next_tlv offset at the end of each loop iteration. This ensures that
the driver will not get stuck in an infinite loop when scanning the PFA.
Ryusuke Konishi [Tue, 4 Jun 2024 13:42:55 +0000 (22:42 +0900)]
nilfs2: fix nilfs_empty_dir() misjudgment and long loop on I/O errors
The error handling in nilfs_empty_dir() when a directory folio/page read
fails is incorrect, as in the old ext2 implementation, and if the
folio/page cannot be read or nilfs_check_folio() fails, it will falsely
determine the directory as empty and corrupt the file system.
In addition, since nilfs_empty_dir() does not immediately return on a
failed folio/page read, but continues to loop, this can cause a long loop
with I/O if i_size of the directory's inode is also corrupted, causing the
log writer thread to wait and hang, as reported by syzbot.
Fix these issues by making nilfs_empty_dir() immediately return a false
value (0) if it fails to get a directory folio/page.
Grepping /proc/allocinfo for "noprof" reveals several xyz_noprof
functions, which means internally they are calling profiled functions.
This should never happen as such calls move allocation charge from a
higher level location where it should be accounted for into these lower
level helpers. Fix this by replacing profiled function calls with noprof
ones.
This is due to a race when two objects are allocated from the same slab,
which did not have an obj_exts allocated for.
In such a case, the two threads will notice the NULL obj_exts and after
one assigns slab->obj_exts, the second one will happily do the exchange if
it reads this new assigned value.
In order to avoid that, verify that the read obj_exts does not point to an
allocated obj_exts before doing the exchange.
Oscar Salvador [Tue, 28 May 2024 20:53:23 +0000 (22:53 +0200)]
mm/hugetlb: do not call vma_add_reservation upon ENOMEM
sysbot reported a splat [1] on __unmap_hugepage_range(). This is because
vma_needs_reservation() can return -ENOMEM if
allocate_file_region_entries() fails to allocate the file_region struct
for the reservation.
Check for that and do not call vma_add_reservation() if that is the case,
otherwise region_abort() and region_del() will see that we do not have any
file_regions.
If we detect that vma_needs_reservation() returned -ENOMEM, we clear the
hugetlb_restore_reserve flag as if this reservation was still consumed, so
free_huge_folio() will not increment the resv count.
Chengming Zhou [Tue, 28 May 2024 05:15:22 +0000 (13:15 +0800)]
mm/ksm: fix ksm_zero_pages accounting
We normally ksm_zero_pages++ in ksmd when page is merged with zero page,
but ksm_zero_pages-- is done from page tables side, where there is no any
accessing protection of ksm_zero_pages.
So we can read very exceptional value of ksm_zero_pages in rare cases,
such as -1, which is very confusing to users.
Fix it by changing to use atomic_long_t, and the same case with the
mm->ksm_zero_pages.
Chengming Zhou [Tue, 28 May 2024 05:15:21 +0000 (13:15 +0800)]
mm/ksm: fix ksm_pages_scanned accounting
Patch series "mm/ksm: fix some accounting problems", v3.
We encountered some abnormal ksm_pages_scanned and ksm_zero_pages during
some random tests.
1. ksm_pages_scanned unchanged even ksmd scanning has progress.
2. ksm_zero_pages maybe -1 in some rare cases.
This patch (of 2):
During testing, I found ksm_pages_scanned is unchanged although the
scan_get_next_rmap_item() did return valid rmap_item that is not NULL.
The reason is the scan_get_next_rmap_item() will return NULL after a full
scan, so ksm_do_scan() just return without accounting of the
ksm_pages_scanned.
Fix it by just putting ksm_pages_scanned accounting in that loop, and it
will be accounted more timely if that loop would last for a long time.
Cong Wang [Tue, 28 May 2024 16:08:38 +0000 (09:08 -0700)]
vmalloc: check CONFIG_EXECMEM in is_vmalloc_or_module_addr()
After commit 2c9e5d4a0082 ("bpf: remove CONFIG_BPF_JIT dependency on
CONFIG_MODULES of") CONFIG_BPF_JIT does not depend on CONFIG_MODULES any
more and bpf jit also uses the [MODULES_VADDR, MODULES_END] memory region.
But is_vmalloc_or_module_addr() still checks CONFIG_MODULES, which then
returns false for a bpf jit memory region when CONFIG_MODULES is not
defined. It leads to the following kernel BUG:
While trying to service a movable allocation (page type 1), the page
allocator runs into a two-pageblock buddy on the movable freelist whose
second block is typed as highatomic (page type 3).
This inconsistency is caused by the highatomic reservation system
operating on single pageblocks, while MAX_ORDER can be bigger than that -
in this configuration, pageblock_order is 9 while MAX_PAGE_ORDER is 10.
The test case is observed to make several adjacent order-3 requests with
__GFP_DIRECT_RECLAIM cleared, which marks the surrounding block as
highatomic. Upon freeing, the blocks merge into an order-10 buddy. When
the highatomic pool is drained later on, this order-10 buddy gets moved
back to the movable list, but only the first pageblock is marked movable
again. A subsequent expand() of this buddy warns about the tail being of
a different type.
This is a long-standing bug that's surfaced by the recent block type
warnings added to the allocator. The consequences seem mostly benign, it
just results in odd behavior: the highatomic tail blocks are not properly
drained, instead they end up on the movable list first, then go back to
the highatomic list after an alloc-free cycle.
To fix this, make the highatomic reservation code aware that
allocations/buddies can be larger than a pageblock.
While it's an old quirk, the recently added type consistency warnings seem
to be the most prominent consequence of it. Set the Fixes: tag
accordingly to highlight this backporting dependency.
Ryusuke Konishi [Thu, 30 May 2024 14:15:56 +0000 (23:15 +0900)]
nilfs2: fix potential kernel bug due to lack of writeback flag waiting
Destructive writes to a block device on which nilfs2 is mounted can cause
a kernel bug in the folio/page writeback start routine or writeback end
routine (__folio_start_writeback in the log below):
This is because when the log writer starts a writeback for segment summary
blocks or a super root block that use the backing device's page cache, it
does not wait for the ongoing folio/page writeback, resulting in an
inconsistent writeback state.
Fix this issue by waiting for ongoing writebacks when putting
folios/pages on the backing device into writeback state.
memcg: remove the lockdep assert from __mod_objcg_mlstate()
The assert was introduced in the commit cited below as an insurance that
the semantic is the same after the local_irq_save() has been removed and
the function has been made static.
The original requirement to disable interrupt was due the modification
of per-CPU counters which require interrupts to be disabled because the
counter update operation is not atomic and some of the counters are
updated from interrupt context.
All callers of __mod_objcg_mlstate() acquire a lock
(memcg_stock.stock_lock) which disables interrupts on !PREEMPT_RT and
the lockdep assert is satisfied. On PREEMPT_RT the interrupts are not
disabled and the assert triggers.
The safety of the counter update is already ensured by
VM_WARN_ON_IRQS_ENABLED() which is part of __mod_memcg_lruvec_state() and
does not require yet another check.
Remove the lockdep assert from __mod_objcg_mlstate().
Barry Song [Fri, 24 May 2024 00:54:44 +0000 (12:54 +1200)]
mm: arm64: fix the out-of-bounds issue in contpte_clear_young_dirty_ptes
We are passing a huge nr to __clear_young_dirty_ptes() right now. While
we should pass the number of pages, we are actually passing CONT_PTE_SIZE.
This is causing lots of crashes of MADV_FREE, panic oops could vary
everytime.
Barry Song [Thu, 23 May 2024 20:50:48 +0000 (08:50 +1200)]
mm: huge_mm: fix undefined reference to `mthp_stats' for CONFIG_SYSFS=n
if CONFIG_SYSFS is not enabled in config, we get the below error,
All errors (new ones prefixed by >>):
s390-linux-ld: mm/memory.o: in function `count_mthp_stat':
>> include/linux/huge_mm.h:285:(.text+0x191c): undefined reference to `mthp_stats'
s390-linux-ld: mm/huge_memory.o:(.rodata+0x10): undefined reference to `mthp_stats'
Baolin Wang [Thu, 23 May 2024 02:36:39 +0000 (10:36 +0800)]
mm: drop the 'anon_' prefix for swap-out mTHP counters
The mTHP swap related counters: 'anon_swpout' and 'anon_swpout_fallback'
are confusing with an 'anon_' prefix, since the shmem can swap out
non-anonymous pages. So drop the 'anon_' prefix to keep consistent with
the old swap counter names.
This is needed in 6.10-rcX to avoid having an inconsistent ABI out in the
field.
Ian Forbes [Fri, 31 May 2024 20:33:58 +0000 (15:33 -0500)]
drm/vmwgfx: Don't destroy Screen Target when CRTC is enabled but inactive
drm_crtc_helper_funcs::atomic_disable can be called even when the CRTC is
still enabled. This can occur when the mode changes or the CRTC is set as
inactive.
In the case where the CRTC is being set as inactive we only want to
blank the screen. The Screen Target should remain intact as long as the
mode has not changed and CRTC is enabled.
This fixes a bug with GDM where locking the screen results in a permanent
black screen because the Screen Target is no longer defined.
Ian Forbes [Tue, 21 May 2024 18:47:18 +0000 (13:47 -0500)]
drm/vmwgfx: 3D disabled should not effect STDU memory limits
This limit became a hard cap starting with the change referenced below.
Surface creation on the device will fail if the requested size is larger
than this limit so altering the value arbitrarily will expose modes that
are too large for the device's hard limits.
Ian Forbes [Tue, 21 May 2024 18:47:17 +0000 (13:47 -0500)]
drm/vmwgfx: Filter modes which exceed graphics memory
SVGA requires individual surfaces to fit within graphics memory
(max_mob_pages) which means that modes with a final buffer size that would
exceed graphics memory must be pruned otherwise creation will fail.
Additionally llvmpipe requires its buffer height and width to be a multiple
of its tile size which is 64. As a result we have to anticipate that
llvmpipe will round up the mode size passed to it by the compositor when
it creates buffers and filter modes where this rounding exceeds graphics
memory.
This fixes an issue where VMs with low graphics memory (< 64MiB) configured
with high resolution mode boot to a black screen because surface creation
fails.
Jakub Kicinski [Thu, 6 Jun 2024 02:03:07 +0000 (19:03 -0700)]
Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Daniel Borkmann says:
====================
pull-request: bpf 2024-06-05
We've added 8 non-merge commits during the last 6 day(s) which contain
a total of 9 files changed, 34 insertions(+), 35 deletions(-).
The main changes are:
1) Fix a potential use-after-free in bpf_link_free when the link uses
dealloc_deferred to free the link object but later still tests for
presence of link->ops->dealloc, from Cong Wang.
2) Fix BPF test infra to set the run context for rawtp test_run callback
where syzbot reported a crash, from Jiri Olsa.
3) Fix bpf_session_cookie BTF_ID in the special_kfunc_set list to exclude
it for the case of !CONFIG_FPROBE, also from Jiri Olsa.
4) Fix a Coverity static analysis report to not close() a link_fd of -1
in the multi-uprobe feature detector, from Andrii Nakryiko.
5) Revert support for redirect to any xsk socket bound to the same umem
as it can result in corrupted ring state which can lead to a crash when
flushing rings. A different approach will be pursued for bpf-next to
address it safely, from Magnus Karlsson.
6) Fix inet_csk_accept prototype in test_sk_storage_tracing.c which caused
BPF CI failure after the last tree fast forwarding, from Andrii Nakryiko.
7) Fix a coccicheck warning in BPF devmap that iterator variable cannot
be NULL, from Thorsten Blum.
* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
Revert "xsk: Document ability to redirect to any socket bound to the same umem"
Revert "xsk: Support redirect to any socket bound to the same umem"
bpf: Set run context for rawtp test_run callback
bpf: Fix a potential use-after-free in bpf_link_free()
bpf, devmap: Remove unnecessary if check in for loop
libbpf: don't close(-1) in multi-uprobe feature detector
bpf: Fix bpf_session_cookie BTF_ID in special_kfunc_set list
selftests/bpf: fix inet_csk_accept prototype in test_sk_storage_tracing.c
====================
If one TCA_TAPRIO_ATTR_PRIOMAP attribute has been provided,
taprio_parse_mqprio_opt() must validate it, or userspace
can inject arbitrary data to the kernel, the second time
taprio_change() is called.
First call (with valid attributes) sets dev->num_tc
to a non zero value.
Second call (with arbitrary mqprio attributes)
returns early from taprio_parse_mqprio_opt()
and bad things can happen.
Linus Torvalds [Wed, 5 Jun 2024 22:28:20 +0000 (15:28 -0700)]
Merge tag 'thermal-6.10-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull thermal control fixes from Rafael Wysocki:
"Fix issues related to the handling of invalid trip points in the
thermal core and in the thermal debug code that have been overlooked
by some recent thermal control core changes"
* tag 'thermal-6.10-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
thermal: trip: Trigger trip down notifications when trips involved in mitigation become invalid
thermal: core: Introduce thermal_trip_crossed()
thermal/debugfs: Allow tze_seq_show() to print statistics for invalid trips
thermal/debugfs: Print initial trip temperature and hysteresis in tze_seq_show()
Linus Torvalds [Wed, 5 Jun 2024 22:19:15 +0000 (15:19 -0700)]
Merge tag 'acpi-6.10-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull ACPI fixes from Rafael Wysocki:
"These fix the ACPI EC and AC drivers, the ACPI APEI error injection
driver and build issues related to the dev_is_pnp() macro referring to
pnp_bus_type that is not exported to modules.
Specifics:
- Fix error handling during EC operation region accesses in the ACPI
EC driver (Armin Wolf)
- Fix a memory leak in the APEI error injection driver introduced
during its converion to a platform driver (Dan Williams)
- Fix build failures related to the dev_is_pnp() macro by redefining
it as a proper function and exporting it to modules as appropriate
and unexport pnp_bus_type which need not be exported any more (Andy
Shevchenko)
- Update the ACPI AC driver to use power_supply_changed() to let the
power supply core handle configuration changes properly (Thomas
Weißschuh)"
* tag 'acpi-6.10-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
ACPI: AC: Properly notify powermanagement core about changes
PNP: Hide pnp_bus_type from the non-PNP code
PNP: Make dev_is_pnp() to be a function and export it for modules
ACPI: EC: Avoid returning AE_OK on errors in address space handler
ACPI: EC: Abort address space access upon error
ACPI: APEI: EINJ: Fix einj_dev release leak
Linus Torvalds [Wed, 5 Jun 2024 22:12:35 +0000 (15:12 -0700)]
Merge tag 'pm-6.10-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management fixes from Rafael Wysocki:
"These fix the intel_pstate and amd-pstate cpufreq drivers and the
cpupower utility.
Specifics:
- Fix a recently introduced unchecked HWP MSR access in the
intel_pstate driver (Srinivas Pandruvada)
- Add missing conversion from MHz to KHz to amd_pstate_set_boost() to
address sysfs inteface inconsistency and fix P-state frequency
reporting on AMD Family 1Ah CPUs in the cpupower utility (Dhananjay
Ugwekar)
- Get rid of an excess global header file used by the amd-pstate
cpufreq driver (Arnd Bergmann)"
* tag 'pm-6.10-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
cpufreq: intel_pstate: Fix unchecked HWP MSR access
cpufreq: amd-pstate: Fix the inconsistency in max frequency units
cpufreq: amd-pstate: remove global header file
tools/power/cpupower: Fix Pstate frequency reporting on AMD Family 1Ah CPUs
net/mlx5: Fix tainted pointer delete is case of flow rules creation fail
In case of flow rule creation fail in mlx5_lag_create_port_sel_table(),
instead of previously created rules, the tainted pointer is deleted
deveral times.
Fix this bug by using correct flow rules pointers.
Found by Linux Verification Center (linuxtesting.org) with SVACE.
AMD Zen-based systems use a System Management Network (SMN) that
provides access to implementation-specific registers.
SMN accesses are done indirectly through an index/data pair in PCI
config space. The PCI config access may fail and return an error code.
This would prevent the "read" value from being updated.
However, the PCI config access may succeed, but the return value may be
invalid. This is in similar fashion to PCI bad reads, i.e. return all
bits set.
Most systems will return 0 for SMN addresses that are not accessible.
This is in line with AMD convention that unavailable registers are
Read-as-Zero/Writes-Ignored.
However, some systems will return a "PCI Error Response" instead. This
value, along with an error code of 0 from the PCI config access, will
confuse callers of the amd_smn_read() function.
Check for this condition, clear the return value, and set a proper error
code.
Linus Torvalds [Wed, 5 Jun 2024 18:28:25 +0000 (11:28 -0700)]
Merge tag 'for-6.10-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fix from David Sterba:
"A fix for fast fsync that needs to handle errors during writes after
some COW failure so it does not lead to an inconsistent state"
* tag 'for-6.10-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: ensure fast fsync waits for ordered extents after a write failure
Linus Torvalds [Wed, 5 Jun 2024 18:25:41 +0000 (11:25 -0700)]
Merge tag 'bcachefs-2024-06-05' of https://evilpiepirate.org/git/bcachefs
Pull bcachefs fixes from Kent Overstreet:
"Just a few small fixes"
* tag 'bcachefs-2024-06-05' of https://evilpiepirate.org/git/bcachefs:
bcachefs: Fix trans->locked assert
bcachefs: Rereplicate now moves data off of durability=0 devices
bcachefs: Fix GFP_KERNEL allocation in break_cycle()
Jens Axboe [Wed, 5 Jun 2024 18:13:00 +0000 (12:13 -0600)]
Merge tag 'nvme-6.10-2024-06-05' of git://git.infradead.org/nvme into block-6.10
Pull NVMe fixes from Keith:
"nvme fixes Linux 6.10
- Use reserved tags for special fabrics operations (Chunguang)
- Persistent Reservation status masking fix (Weiwen)"
* tag 'nvme-6.10-2024-06-05' of git://git.infradead.org/nvme:
nvme: fix nvme_pr_* status code parsing
nvme-fabrics: use reserved tag for reg read/write command
drm/amd: Fix shutdown (again) on some SMU v13.0.4/11 platforms
commit cd94d1b182d2 ("dm/amd/pm: Fix problems with reboot/shutdown for
some SMU 13.0.4/13.0.11 users") attempted to fix shutdown issues
that were reported since commit 31729e8c21ec ("drm/amd/pm: fixes a
random hang in S4 for SMU v13.0.4/11") but caused issues for some
people.
Adjust the workaround flow to properly only apply in the S4 case:
-> For shutdown go through SMU_MSG_PrepareMp1ForUnload
-> For S4 go through SMU_MSG_GfxDeviceDriverReset and
SMU_MSG_PrepareMp1ForUnload
Reported-and-tested-by: lectrode <[email protected]> Closes: https://github.com/void-linux/void-packages/issues/50417 Cc: [email protected] Fixes: cd94d1b182d2 ("dm/amd/pm: Fix problems with reboot/shutdown for some SMU 13.0.4/13.0.11 users") Reviewed-by: Tim Huang <[email protected]> Signed-off-by: Mario Limonciello <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
Linus Torvalds [Wed, 5 Jun 2024 17:32:20 +0000 (10:32 -0700)]
Merge tag 'i2c-for-6.10-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux
Pull i2c fixes from Wolfram Sang:
"This should have been my second pull request during the merge window
but one dependency in the drm subsystem fell through the cracks and
was only applied for rc2.
Now we can finally remove I2C_CLASS_SPD"
* tag 'i2c-for-6.10-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux:
i2c: Remove I2C_CLASS_SPD
i2c: synquacer: Remove a clk reference from struct synquacer_i2c
Linus Torvalds [Wed, 5 Jun 2024 17:29:13 +0000 (10:29 -0700)]
Merge tag 'tpmdd-next-6.10-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/jarkko/linux-tpmdd
Pull tpm fixes from Jarkko Sakkinen:
"The bug fix for tpm_tis_core_init() is not that critical but still
makes sense to get into release for the sake of better quality.
I included the Intel CPU model define change mainly to help Tony just
a bit, as for this subsystem it cannot realistically speaking cause
any possible harm"
* tag 'tpmdd-next-6.10-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/jarkko/linux-tpmdd:
tpm: Switch to new Intel CPU model defines
tpm_tis: Do *not* flush uninitialized work
Filipe Manana [Mon, 3 Jun 2024 11:49:08 +0000 (12:49 +0100)]
btrfs: fix leak of qgroup extent records after transaction abort
Qgroup extent records are created when delayed ref heads are created and
then released after accounting extents at btrfs_qgroup_account_extents(),
called during the transaction commit path.
If a transaction is aborted we free the qgroup records by calling
btrfs_qgroup_destroy_extent_records() at btrfs_destroy_delayed_refs(),
unless we don't have delayed references. We are incorrectly assuming
that no delayed references means we don't have qgroup extents records.
We can currently have no delayed references because we ran them all
during a transaction commit and the transaction was aborted after that
due to some error in the commit path.
So fix this by ensuring we btrfs_qgroup_destroy_extent_records() at
btrfs_destroy_delayed_refs() even if we don't have any delayed references.
So we're logging a changed extent from fsync, which is splitting an
extent in the log tree. But this split part already exists in the tree,
triggering the BUG().
This is the state of the log tree at the time of the crash, dumped with
drgn (https://github.com/osandov/drgn/blob/main/contrib/btrfs_tree.py)
to get more details than btrfs_print_leaf() gives us:
>>> print_extent_buffer(prog.crashed_thread().stack_trace()[0]["eb"])
leaf 33439744 level 0 items 72 generation 9 owner 18446744073709551610
leaf 33439744 flags 0x100000000000000
fs uuid e5bd3946-400c-4223-8923-190ef1f18677
chunk uuid d58cb17e-6d02-494a-829a-18b7d8a399da
item 0 key (450 INODE_ITEM 0) itemoff 16123 itemsize 160
generation 7 transid 9 size 8192 nbytes 8473563889606862198
block group 0 mode 100600 links 1 uid 0 gid 0 rdev 0
sequence 204 flags 0x10(PREALLOC)
atime 1716417703.220000000 (2024-05-22 15:41:43)
ctime 1716417704.983333333 (2024-05-22 15:41:44)
mtime 1716417704.983333333 (2024-05-22 15:41:44)
otime 17592186044416.000000000 (559444-03-08 01:40:16)
item 1 key (450 INODE_REF 256) itemoff 16110 itemsize 13
index 195 namelen 3 name: 193
item 2 key (450 XATTR_ITEM 1640047104) itemoff 16073 itemsize 37
location key (0 UNKNOWN.0 0) type XATTR
transid 7 data_len 1 name_len 6
name: user.a
data a
item 3 key (450 EXTENT_DATA 0) itemoff 16020 itemsize 53
generation 9 type 1 (regular)
extent data disk byte 303144960 nr 12288
extent data offset 0 nr 4096 ram 12288
extent compression 0 (none)
item 4 key (450 EXTENT_DATA 4096) itemoff 15967 itemsize 53
generation 9 type 2 (prealloc)
prealloc data disk byte 303144960 nr 12288
prealloc data offset 4096 nr 8192
item 5 key (450 EXTENT_DATA 8192) itemoff 15914 itemsize 53
generation 9 type 2 (prealloc)
prealloc data disk byte 303144960 nr 12288
prealloc data offset 8192 nr 4096
...
So the real problem happened earlier: notice that items 4 (4k-12k) and 5
(8k-12k) overlap. Both are prealloc extents. Item 4 straddles i_size and
item 5 starts at i_size.
Here is the state of the filesystem tree at the time of the crash:
Item 5 in the log tree corresponds to item 183 in the filesystem tree,
but nothing matches item 4. Furthermore, item 183 is the last item in
the leaf.
btrfs_log_prealloc_extents() is responsible for logging prealloc extents
beyond i_size. It first truncates any previously logged prealloc extents
that start beyond i_size. Then, it walks the filesystem tree and copies
the prealloc extent items to the log tree.
If it hits the end of a leaf, then it calls btrfs_next_leaf(), which
unlocks the tree and does another search. However, while the filesystem
tree is unlocked, an ordered extent completion may modify the tree. In
particular, it may insert an extent item that overlaps with an extent
item that was already copied to the log tree.
This may manifest in several ways depending on the exact scenario,
including an EEXIST error that is silently translated to a full sync,
overlapping items in the log tree, or this crash. This particular crash
is triggered by the following sequence of events:
- Initially, the file has i_size=4k, a regular extent from 0-4k, and a
prealloc extent beyond i_size from 4k-12k. The prealloc extent item is
the last item in its B-tree leaf.
- The file is fsync'd, which copies its inode item and both extent items
to the log tree.
- An xattr is set on the file, which sets the
BTRFS_INODE_COPY_EVERYTHING flag.
- The range 4k-8k in the file is written using direct I/O. i_size is
extended to 8k, but the ordered extent is still in flight.
- The file is fsync'd. Since BTRFS_INODE_COPY_EVERYTHING is set, this
calls copy_inode_items_to_log(), which calls
btrfs_log_prealloc_extents().
- btrfs_log_prealloc_extents() finds the 4k-12k prealloc extent in the
filesystem tree. Since it starts before i_size, it skips it. Since it
is the last item in its B-tree leaf, it calls btrfs_next_leaf().
- btrfs_next_leaf() unlocks the path.
- The ordered extent completion runs, which converts the 4k-8k part of
the prealloc extent to written and inserts the remaining prealloc part
from 8k-12k.
- btrfs_next_leaf() does a search and finds the new prealloc extent
8k-12k.
- btrfs_log_prealloc_extents() copies the 8k-12k prealloc extent into
the log tree. Note that it overlaps with the 4k-12k prealloc extent
that was copied to the log tree by the first fsync.
- fsync calls btrfs_log_changed_extents(), which tries to log the 4k-8k
extent that was written.
- This tries to drop the range 4k-8k in the log tree, which requires
adjusting the start of the 4k-12k prealloc extent in the log tree to
8k.
- btrfs_set_item_key_safe() sees that there is already an extent
starting at 8k in the log tree and calls BUG().
Fix this by detecting when we're about to insert an overlapping file
extent item in the log tree and truncating the part that would overlap.
Linus Torvalds [Wed, 5 Jun 2024 15:43:41 +0000 (08:43 -0700)]
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm fixes from Paolo Bonzini:
"This is dominated by a couple large series for ARM and x86
respectively, but apart from that things are calm.
ARM:
- Large set of FP/SVE fixes for pKVM, addressing the fallout from the
per-CPU data rework and making sure that the host is not involved
in the FP/SVE switching any more
- Allow FEAT_BTI to be enabled with NV now that FEAT_PAUTH is
completely supported
- Fix for the respective priorities of Failed PAC, Illegal Execution
state and Instruction Abort exceptions
- Fix the handling of AArch32 instruction traps failing their
condition code, which was broken by the introduction of
ESR_EL2.ISS2
- Allow vcpus running in AArch32 state to be restored in System mode
- Fix AArch32 GPR restore that would lose the 64 bit state under some
conditions
RISC-V:
- No need to use mask when hart-index-bits is 0
- Fix incorrect reg_subtype labels in
kvm_riscv_vcpu_set_reg_isa_ext()
x86:
- Fixes and debugging help for the #VE sanity check.
Also disable it by default, even for CONFIG_DEBUG_KERNEL, because
it was found to trigger spuriously (most likely a processor erratum
as the exact symptoms vary by generation).
- Avoid WARN() when two NMIs arrive simultaneously during an
NMI-disabled situation (GIF=0 or interrupt shadow) when the
processor supports virtual NMI.
While generally KVM will not request an NMI window when virtual
NMIs are supported, in this case it *does* have to single-step over
the interrupt shadow or enable the STGI intercept, in order to
deliver the latched second NMI.
- Drop support for hand tuning APIC timer advancement from userspace.
Since we have adaptive tuning, and it has proved to work well, drop
the module parameter for manual configuration and with it a few
stupid bugs that it had"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (32 commits)
KVM: x86/mmu: Don't save mmu_invalidate_seq after checking private attr
KVM: arm64: Ensure that SME controls are disabled in protected mode
KVM: arm64: Refactor CPACR trap bit setting/clearing to use ELx format
KVM: arm64: Consolidate initializing the host data's fpsimd_state/sve in pKVM
KVM: arm64: Eagerly restore host fpsimd/sve state in pKVM
KVM: arm64: Allocate memory mapped at hyp for host sve state in pKVM
KVM: arm64: Specialize handling of host fpsimd state on trap
KVM: arm64: Abstract set/clear of CPTR_EL2 bits behind helper
KVM: arm64: Fix prototype for __sve_save_state/__sve_restore_state
KVM: arm64: Reintroduce __sve_save_state
KVM: x86: Drop support for hand tuning APIC timer advancement from userspace
KVM: SEV-ES: Delegate LBR virtualization to the processor
KVM: SEV-ES: Disallow SEV-ES guests when X86_FEATURE_LBRV is absent
KVM: SEV-ES: Prevent MSR access post VMSA encryption
RISC-V: KVM: Fix incorrect reg_subtype labels in kvm_riscv_vcpu_set_reg_isa_ext function
RISC-V: KVM: No need to use mask when hart-index-bit is 0
KVM: arm64: nv: Expose BTI and CSV_frac to a guest hypervisor
KVM: arm64: nv: Fix relative priorities of exceptions generated by ERETAx
KVM: arm64: AArch32: Fix spurious trapping of conditional instructions
KVM: arm64: Allow AArch32 PSTATE.M to be restored as System mode
...
- Fix a recently introduced unchecked HWP MSR access in the
intel_pstate driver (Srinivas Pandruvada).
- Add missing conversion from MHz to KHz to amd_pstate_set_boost()
to address sysfs inteface inconsistency (Dhananjay Ugwekar).
- Get rid of an excess global header file used by the amd-pstate
cpufreq driver (Arnd Bergmann).
* pm-cpufreq:
cpufreq: intel_pstate: Fix unchecked HWP MSR access
cpufreq: amd-pstate: Fix the inconsistency in max frequency units
cpufreq: amd-pstate: remove global header file