Linus Torvalds [Thu, 10 Oct 2024 19:25:32 +0000 (12:25 -0700)]
Merge tag 'trace-ringbuffer-v6.12-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull tracing fix from Steven Rostedt:
"Ring-buffer fix: do not have boot-mapped buffers use CPU hotplug
callbacks
When a ring buffer is mapped to memory assigned at boot, it also
splits it up evenly between the possible CPUs. But the allocation code
still attached a CPU notifier callback to this ring buffer. When a CPU
is added, the callback will happen and another per-cpu buffer is
created for the ring buffer.
But for boot mapped buffers, there is no room to add another one (as
they were all created already). The result of calling the CPU hotplug
notifier on a boot mapped ring buffer is unpredictable and could lead
to a system crash.
If the ring buffer is boot mapped simply do not attach the CPU
notifier to it"
* tag 'trace-ringbuffer-v6.12-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
ring-buffer: Do not have boot mapped buffers hook to CPU hotplug
Linus Torvalds [Thu, 10 Oct 2024 17:02:59 +0000 (10:02 -0700)]
Merge tag 'for-6.12-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
- update fstrim loop and add more cancellation points, fix reported
delayed or blocked suspend if there's a huge chunk queued
- fix error handling in recent qgroup xarray conversion
- in zoned mode, fix warning printing device path without RCU
protection
- again fix invalid extent xarray state (6252690f7e1b), lost due to
refactoring
* tag 'for-6.12-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: fix clear_dirty and writeback ordering in submit_one_sector()
btrfs: zoned: fix missing RCU locking in error message when loading zone info
btrfs: fix missing error handling when adding delayed ref with qgroups enabled
btrfs: add cancellation points to trim loops
btrfs: split remaining space to discard in chunks
Linus Torvalds [Thu, 10 Oct 2024 16:52:49 +0000 (09:52 -0700)]
Merge tag 'nfsd-6.12-1' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux
Pull nfsd fixes from Chuck Lever:
- Fix NFSD bring-up / shutdown
- Fix a UAF when releasing a stateid
* tag 'nfsd-6.12-1' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux:
nfsd: fix possible badness in FREE_STATEID
nfsd: nfsd_destroy_serv() must call svc_destroy() even if nfsd_startup_net() failed
NFSD: Mark filecache "down" if init fails
Linus Torvalds [Thu, 10 Oct 2024 16:45:45 +0000 (09:45 -0700)]
Merge tag 'xfs-6.12-fixes-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull xfs fixes from Carlos Maiolino:
- A few small typo fixes
- fstests xfs/538 DEBUG-only fix
- Performance fix on blockgc on COW'ed files, by skipping trims on
cowblock inodes currently opened for write
- Prevent cowblocks to be freed under dirty pagecache during unshare
- Update MAINTAINERS file to quote the new maintainer
* tag 'xfs-6.12-fixes-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
xfs: fix a typo
xfs: don't free cowblocks from under dirty pagecache on unshare
xfs: skip background cowblock trims on inodes open for write
xfs: support lowmode allocations in xfs_bmap_exact_minlen_extent_alloc
xfs: call xfs_bmap_exact_minlen_extent_alloc from xfs_bmap_btalloc
xfs: don't ifdef around the exact minlen allocations
xfs: fold xfs_bmap_alloc_userdata into xfs_bmapi_allocate
xfs: distinguish extra split from real ENOSPC from xfs_attr_node_try_addname
xfs: distinguish extra split from real ENOSPC from xfs_attr3_leaf_split
xfs: return bool from xfs_attr3_leaf_add
xfs: merge xfs_attr_leaf_try_add into xfs_attr_leaf_addname
xfs: Use try_cmpxchg() in xlog_cil_insert_pcp_aggregate()
xfs: scrub: convert comma to semicolon
xfs: Remove empty declartion in header file
MAINTAINERS: add Carlos Maiolino as XFS release manager
Linus Torvalds [Wed, 9 Oct 2024 23:01:40 +0000 (16:01 -0700)]
Merge tag 'mm-hotfixes-stable-2024-10-09-15-46' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull misc fixes from Andrew Morton:
"12 hotfixes, 5 of which are c:stable. All singletons, about half of
which are MM"
* tag 'mm-hotfixes-stable-2024-10-09-15-46' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
mm: zswap: delete comments for "value" member of 'struct zswap_entry'.
CREDITS: sort alphabetically by name
secretmem: disable memfd_secret() if arch cannot set direct map
.mailmap: update Fangrui's email
mm/huge_memory: check pmd_special() only after pmd_present()
resource, kunit: fix user-after-free in resource_test_region_intersects()
fs/proc/kcore.c: allow translation of physical memory addresses
selftests/mm: fix incorrect buffer->mirror size in hmm2 double_map test
device-dax: correct pgoff align in dax_set_mapping()
kthread: unpark only parked kthread
Revert "mm: introduce PF_MEMALLOC_NORECLAIM, PF_MEMALLOC_NOWARN"
bcachefs: do not use PF_MEMALLOC_NORECLAIM
mm: zswap: delete comments for "value" member of 'struct zswap_entry'.
Made a minor edit in the comments for 'struct zswap_entry' to delete the
description of the 'value' member that was deleted in commit 20a5532ffa53
("mm: remove code to handle same filled pages").
Patrick Roy [Tue, 1 Oct 2024 08:00:41 +0000 (09:00 +0100)]
secretmem: disable memfd_secret() if arch cannot set direct map
Return -ENOSYS from memfd_secret() syscall if !can_set_direct_map(). This
is the case for example on some arm64 configurations, where marking 4k
PTEs in the direct map not present can only be done if the direct map is
set up at 4k granularity in the first place (as ARM's break-before-make
semantics do not easily allow breaking apart large/gigantic pages).
More precisely, on arm64 systems with !can_set_direct_map(),
set_direct_map_invalid_noflush() is a no-op, however it returns success
(0) instead of an error. This means that memfd_secret will seemingly
"work" (e.g. syscall succeeds, you can mmap the fd and fault in pages),
but it does not actually achieve its goal of removing its memory from the
direct map.
Note that with this patch, memfd_secret() will start erroring on systems
where can_set_direct_map() returns false (arm64 with
CONFIG_RODATA_FULL_DEFAULT_ENABLED=n, CONFIG_DEBUG_PAGEALLOC=n and
CONFIG_KFENCE=n), but that still seems better than the current silent
failure. Since CONFIG_RODATA_FULL_DEFAULT_ENABLED defaults to 'y', most
arm64 systems actually have a working memfd_secret() and aren't be
affected.
From going through the iterations of the original memfd_secret patch
series, it seems that disabling the syscall in these scenarios was the
intended behavior [1] (preferred over having
set_direct_map_invalid_noflush return an error as that would result in
SIGBUSes at page-fault time), however the check for it got dropped between
v16 [2] and v17 [3], when secretmem moved away from CMA allocations.
mm/huge_memory: check pmd_special() only after pmd_present()
We should only check for pmd_special() after we made sure that we have a
present PMD. For example, if we have a migration PMD, pmd_special() might
indicate that we have a special PMD although we really don't.
This fixes confusing migration entries as PFN mappings, and not doing what
we are supposed to do in the "is_swap_pmd()" case further down in the
function -- including messing up COW, page table handling and accounting.
resource, kunit: fix user-after-free in resource_test_region_intersects()
In resource_test_insert_resource(), the pointer is used in error message
after kfree(). This is user-after-free. To fix this, we need to call
kunit_add_action_or_reset() to schedule memory freeing after usage. But
kunit_add_action_or_reset() itself may fail and free the memory. So, its
return value should be checked and abort the test for failure. Then, we
found that other usage of kunit_add_action_or_reset() in
resource_test_region_intersects() needs to be fixed too. We fix all these
user-after-free bugs in this patch.
fs/proc/kcore.c: allow translation of physical memory addresses
When /proc/kcore is read an attempt to read the first two pages results in
HW-specific page swap on s390 and another (so called prefix) pages are
accessed instead. That leads to a wrong read.
Allow architecture-specific translation of memory addresses using
kc_xlate_dev_mem_ptr() and kc_unxlate_dev_mem_ptr() callbacks similarily
to /dev/mem xlate_dev_mem_ptr() and unxlate_dev_mem_ptr() callbacks. That
way an architecture can deal with specific physical memory ranges.
Re-use the existing /dev/mem callback implementation on s390, which
handles the described prefix pages swapping correctly.
For other architectures the default callback is basically NOP. It is
expected the condition (vaddr == __va(__pa(vaddr))) always holds true for
KCORE_RAM memory type.
Donet Tom [Fri, 27 Sep 2024 05:07:52 +0000 (00:07 -0500)]
selftests/mm: fix incorrect buffer->mirror size in hmm2 double_map test
The hmm2 double_map test was failing due to an incorrect buffer->mirror
size. The buffer->mirror size was 6, while buffer->ptr size was 6 *
PAGE_SIZE. The test failed because the kernel's copy_to_user function was
attempting to copy a 6 * PAGE_SIZE buffer to buffer->mirror. Since the
size of buffer->mirror was incorrect, copy_to_user failed.
This patch corrects the buffer->mirror size to 6 * PAGE_SIZE.
Test Result without this patch
==============================
# RUN hmm2.hmm2_device_private.double_map ...
# hmm-tests.c:1680:double_map:Expected ret (-14) == 0 (0)
# double_map: Test terminated by assertion
# FAIL hmm2.hmm2_device_private.double_map
not ok 53 hmm2.hmm2_device_private.double_map
Test Result with this patch
===========================
# RUN hmm2.hmm2_device_private.double_map ...
# OK hmm2.hmm2_device_private.double_map
ok 53 hmm2.hmm2_device_private.double_map
device-dax: correct pgoff align in dax_set_mapping()
pgoff should be aligned using ALIGN_DOWN() instead of ALIGN(). Otherwise,
vmf->address not aligned to fault_size will be aligned to the next
alignment, that can result in memory failure getting the wrong address.
It's a subtle situation that only can be observed in
page_mapped_in_vma() after the page is page fault handled by
dev_dax_huge_fault. Generally, there is little chance to perform
page_mapped_in_vma in dev-dax's page unless in specific error injection
to the dax device to trigger an MCE - memory-failure. In that case,
page_mapped_in_vma() will be triggered to determine which task is
accessing the failure address and kill that task in the end.
We used self-developed dax device (which is 2M aligned mapping) , to
perform error injection to random address. It turned out that error
injected to non-2M-aligned address was causing endless MCE until panic.
Because page_mapped_in_vma() kept resulting wrong address and the task
accessing the failure address was never killed properly:
[ 3783.719419] Memory failure: 0x200c9742: recovery action for dax page:
Recovered
[ 3784.049006] mce: Uncorrected hardware memory error in user-access at 200c9742380
[ 3784.049190] Memory failure: 0x200c9742: recovery action for dax page:
Recovered
[ 3784.448042] mce: Uncorrected hardware memory error in user-access at 200c9742380
[ 3784.448186] Memory failure: 0x200c9742: recovery action for dax page:
Recovered
[ 3784.792026] mce: Uncorrected hardware memory error in user-access at 200c9742380
[ 3784.792179] Memory failure: 0x200c9742: recovery action for dax page:
Recovered
[ 3785.162502] mce: Uncorrected hardware memory error in user-access at 200c9742380
[ 3785.162633] Memory failure: 0x200c9742: recovery action for dax page:
Recovered
[ 3785.461116] mce: Uncorrected hardware memory error in user-access at 200c9742380
[ 3785.461247] Memory failure: 0x200c9742: recovery action for dax page:
Recovered
[ 3785.764730] mce: Uncorrected hardware memory error in user-access at 200c9742380
[ 3785.764859] Memory failure: 0x200c9742: recovery action for dax page:
Recovered
[ 3786.042128] mce: Uncorrected hardware memory error in user-access at 200c9742380
[ 3786.042259] Memory failure: 0x200c9742: recovery action for dax page:
Recovered
[ 3786.464293] mce: Uncorrected hardware memory error in user-access at 200c9742380
[ 3786.464423] Memory failure: 0x200c9742: recovery action for dax page:
Recovered
[ 3786.818090] mce: Uncorrected hardware memory error in user-access at 200c9742380
[ 3786.818217] Memory failure: 0x200c9742: recovery action for dax page:
Recovered
[ 3787.085297] mce: Uncorrected hardware memory error in user-access at 200c9742380
[ 3787.085424] Memory failure: 0x200c9742: recovery action for dax page:
Recovered
It took us several weeks to pinpoint this problem, but we eventually
used bpftrace to trace the page fault and mce address and successfully
identified the issue.
Joao added:
; Likely we never reproduce in production because we always pin
: device-dax regions in the region align they provide (Qemu does
: similarly with prealloc in hugetlb/file backed memory). I think this
: bug requires that we touch *unpinned* device-dax regions unaligned to
: the device-dax selected alignment (page size i.e. 4K/2M/1G)
Calling into kthread unparking unconditionally is mostly harmless when
the kthread is already unparked. The wake up is then simply ignored
because the target is not in TASK_PARKED state.
However if the kthread is per CPU, the wake up is preceded by a call
to kthread_bind() which expects the task to be inactive and in
TASK_PARKED state, which obviously isn't the case if it is unparked.
As a result, calling kthread_stop() on an unparked per-cpu kthread
triggers such a warning:
There is no existing user of those flags. PF_MEMALLOC_NOWARN is dangerous
because a nested allocation context can use GFP_NOFAIL which could cause
unexpected failure. Such a code would be hard to maintain because it
could be deeper in the call chain.
PF_MEMALLOC_NORECLAIM has been added even when it was pointed out [1] that
such a allocation contex is inherently unsafe if the context doesn't fully
control all allocations called from this context.
While PF_MEMALLOC_NOWARN is not dangerous the way PF_MEMALLOC_NORECLAIM is
it doesn't have any user and as Matthew has pointed out we are running out
of those flags so better reclaim it without any real users.
Michal Hocko [Thu, 26 Sep 2024 17:11:50 +0000 (19:11 +0200)]
bcachefs: do not use PF_MEMALLOC_NORECLAIM
Patch series "remove PF_MEMALLOC_NORECLAIM" v3.
This patch (of 2):
bch2_new_inode relies on PF_MEMALLOC_NORECLAIM to try to allocate a new
inode to achieve GFP_NOWAIT semantic while holding locks. If this
allocation fails it will drop locks and use GFP_NOFS allocation context.
We would like to drop PF_MEMALLOC_NORECLAIM because it is really
dangerous to use if the caller doesn't control the full call chain with
this flag set. E.g. if any of the function down the chain needed
GFP_NOFAIL request the PF_MEMALLOC_NORECLAIM would override this and
cause unexpected failure.
While this is not the case in this particular case using the scoped gfp
semantic is not really needed bacause we can easily pus the allocation
context down the chain without too much clutter.
misc: sgi-gru: Don't disable preemption in GRU driver
Disabling preemption in the GRU driver is unnecessary, and clashes with
sleeping locks in several code paths. Remove preempt_disable and
preempt_enable from the GRU driver.
Steven Rostedt [Tue, 8 Oct 2024 18:32:42 +0000 (14:32 -0400)]
ring-buffer: Do not have boot mapped buffers hook to CPU hotplug
The boot mapped ring buffer has its buffer mapped at a fixed location
found at boot up. It is not dynamic. It cannot grow or be expanded when
new CPUs come online.
Do not hook fixed memory mapped ring buffers to the CPU hotplug callback,
otherwise it can cause a crash when it tries to add the buffer to the
memory that is already fully occupied.
Naohiro Aota [Fri, 4 Oct 2024 04:53:35 +0000 (13:53 +0900)]
btrfs: fix clear_dirty and writeback ordering in submit_one_sector()
This commit is a replay of commit 6252690f7e1b ("btrfs: fix invalid
mapping of extent xarray state"). We need to call
btrfs_folio_clear_dirty() before btrfs_set_range_writeback(), so that
xarray DIRTY tag is cleared.
With a refactoring commit 8189197425e7 ("btrfs: refactor
__extent_writepage_io() to do sector-by-sector submission"), it screwed
up and the order is reversed and causing the same hang. Fix the ordering
now in submit_one_sector().
Fixes: 8189197425e7 ("btrfs: refactor __extent_writepage_io() to do sector-by-sector submission") Reviewed-by: Qu Wenruo <[email protected]> Reviewed-by: Johannes Thumshirn <[email protected]> Signed-off-by: Naohiro Aota <[email protected]> Signed-off-by: David Sterba <[email protected]>
Filipe Manana [Wed, 2 Oct 2024 14:02:56 +0000 (15:02 +0100)]
btrfs: zoned: fix missing RCU locking in error message when loading zone info
At btrfs_load_zone_info() we have an error path that is dereferencing
the name of a device which is a RCU string but we are not holding a RCU
read lock, which is incorrect.
Fix this by using btrfs_err_in_rcu() instead of btrfs_err().
The problem is there since commit 08e11a3db098 ("btrfs: zoned: load zone's
allocation offset"), back then at btrfs_load_block_group_zone_info() but
then later on that code was factored out into the helper
btrfs_load_zone_info() by commit 09a46725cc84 ("btrfs: zoned: factor out
per-zone logic from btrfs_load_block_group_zone_info").
Brian Foster [Fri, 6 Sep 2024 11:40:51 +0000 (07:40 -0400)]
xfs: don't free cowblocks from under dirty pagecache on unshare
fallocate unshare mode explicitly breaks extent sharing. When a
command completes, it checks the data fork for any remaining shared
extents to determine whether the reflink inode flag and COW fork
preallocation can be removed. This logic doesn't consider in-core
pagecache and I/O state, however, which means we can unsafely remove
COW fork blocks that are still needed under certain conditions.
For example, consider the following command sequence:
This allocates a data block at offset 0, shares it, and then
overwrites it with a larger buffered write. The overwrite triggers
COW fork preallocation, 32 blocks by default, which maps the entire
32k write to delalloc in the COW fork. All but the shared block at
offset 0 remains hole mapped in the data fork. The unshare command
redirties and flushes the folio at offset 0, removing the only
shared extent from the inode. Since the inode no longer maps shared
extents, unshare purges the COW fork before the remaining 28k may
have written back.
This leaves dirty pagecache backed by holes, which writeback quietly
skips, thus leaving clean, non-zeroed pagecache over holes in the
file. To verify, fiemap shows holes in the first 32k of the file and
reads return different data across a remount:
$ xfs_io -c "fiemap -v" <file>
<file>:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
...
1: [8..511]: hole 504
...
$ xfs_io -c "pread -v 4k 8" <file> 00001000: cd cd cd cd cd cd cd cd ........
$ umount <mnt>; mount <dev> <mnt>
$ xfs_io -c "pread -v 4k 8" <file> 00001000: 00 00 00 00 00 00 00 00 ........
To avoid this problem, make unshare follow the same rules used for
background cowblock scanning and never purge the COW fork for inodes
with dirty pagecache or in-flight I/O.
Fixes: 46afb0628b86347 ("xfs: only flush the unshared range in xfs_reflink_unshare") Signed-off-by: Brian Foster <[email protected]> Reviewed-by: Darrick J. Wong <[email protected]> Signed-off-by: Carlos Maiolino <[email protected]>
Linus Torvalds [Tue, 8 Oct 2024 19:54:04 +0000 (12:54 -0700)]
Merge tag 'sched_ext-for-6.12-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext
Pull sched_ext fixes from Tejun Heo:
- ops.enqueue() didn't have a way to tell whether select_task_rq_scx()
and thus ops.select() were skipped. Some schedulers were incorrectly
using SCX_ENQ_WAKEUP. Add SCX_ENQ_CPU_SELECTED and fix scx_qmap using
it.
- Remove a spurious WARN_ON_ONCE() in scx_cgroup_exit()
- Fix error information clobbering during load
- Add missing __weak markers to BPF helper declarations
- Doc update
* tag 'sched_ext-for-6.12-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext:
sched_ext: Documentation: Update instructions for running example schedulers
sched_ext, scx_qmap: Add and use SCX_ENQ_CPU_SELECTED
sched/core: Add ENQUEUE_RQ_SELECTED to indicate whether ->select_task_rq() was called
sched/core: Make select_task_rq() take the pointer to wake_flags instead of value
sched_ext: scx_cgroup_exit() may be called without successful scx_cgroup_init()
sched_ext: Improve error reporting during loading
sched_ext: Add __weak markers to BPF helper function decalarations
Linus Torvalds [Tue, 8 Oct 2024 17:53:06 +0000 (10:53 -0700)]
Merge tag 'ntfs3_for_6.12' of https://github.com/Paragon-Software-Group/linux-ntfs3
Pull ntfs3 updates from Konstantin Komarov:
"New:
- implement fallocate for compressed files
- add support for the compression attribute
- optimize large writes to sparse files
Fixes:
- fix several potential deadlock scenarios
- fix various internal bugs detected by syzbot
- add checks before accessing NTFS structures during parsing
- correct the format of output messages
Refactoring:
- replace fsparam_flag_no with fsparam_flag in options parser
- remove unused functions and macros"
* tag 'ntfs3_for_6.12' of https://github.com/Paragon-Software-Group/linux-ntfs3: (25 commits)
fs/ntfs3: Format output messages like others fs in kernel
fs/ntfs3: Additional check in ntfs_file_release
fs/ntfs3: Fix general protection fault in run_is_mapped_full
fs/ntfs3: Sequential field availability check in mi_enum_attr()
fs/ntfs3: Additional check in ni_clear()
fs/ntfs3: Fix possible deadlock in mi_read
ntfs3: Change to non-blocking allocation in ntfs_d_hash
fs/ntfs3: Remove unused al_delete_le
fs/ntfs3: Rename ntfs3_setattr into ntfs_setattr
fs/ntfs3: Replace fsparam_flag_no -> fsparam_flag
fs/ntfs3: Add support for the compression attribute
fs/ntfs3: Implement fallocate for compressed files
fs/ntfs3: Make checks in run_unpack more clear
fs/ntfs3: Add rough attr alloc_size check
fs/ntfs3: Stale inode instead of bad
fs/ntfs3: Refactor enum_rstbl to suppress static checker
fs/ntfs3: Fix sparse warning in ni_fiemap
fs/ntfs3: Fix warning possible deadlock in ntfs_set_state
fs/ntfs3: Fix sparse warning for bigendian
fs/ntfs3: Separete common code for file_read/write iter/splice
...
Linus Torvalds [Tue, 8 Oct 2024 17:43:22 +0000 (10:43 -0700)]
Merge tag 'perf-tools-fixes-for-v6.12-1-2024-10-08' of git://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools
Pull perf tools fixes from Arnaldo Carvalho de Melo:
- Fix an assert() to handle captured and unprocessed ARM CoreSight CPU
traces
- Fix static build compilation error when libdw isn't installed or is
too old
- Add missing include when building with
!HAVE_DWARF_GETLOCATIONS_SUPPORT
- Add missing refcount put on 32-bit DSOs
- Fix disassembly of user space binaries by setting the binary_type of
DSO when loading
- Update headers with the kernel sources, including asound.h, sched.h,
fcntl, msr-index.h, irq_vectors.h, socket.h, list_sort.c and arm64's
cputype.h
* tag 'perf-tools-fixes-for-v6.12-1-2024-10-08' of git://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools:
perf cs-etm: Fix the assert() to handle captured and unprocessed cpu trace
perf build: Fix build feature-dwarf_getlocations fail for old libdw
perf build: Fix static compilation error when libdw is not installed
perf dwarf-aux: Fix build with !HAVE_DWARF_GETLOCATIONS_SUPPORT
tools headers arm64: Sync arm64's cputype.h with the kernel sources
perf tools: Cope with differences for lib/list_sort.c copy from the kernel
tools check_headers.sh: Add check variant that excludes some hunks
perf beauty: Update copy of linux/socket.h with the kernel sources
tools headers UAPI: Sync the linux/in.h with the kernel sources
perf trace beauty: Update the arch/x86/include/asm/irq_vectors.h copy with the kernel sources
tools arch x86: Sync the msr-index.h copy with the kernel sources
tools include UAPI: Sync linux/fcntl.h copy with the kernel sources
tools include UAPI: Sync linux/sched.h copy with the kernel sources
tools include UAPI: Sync sound/asound.h copy with the kernel sources
perf vdso: Missed put on 32-bit dsos
perf symbol: Set binary_type of dso when loading
Filipe Manana [Tue, 24 Sep 2024 13:39:19 +0000 (14:39 +0100)]
btrfs: fix missing error handling when adding delayed ref with qgroups enabled
When adding a delayed ref head, at delayed-ref.c:add_delayed_ref_head(),
if we fail to insert the qgroup record we don't error out, we ignore it.
In fact we treat it as if there was no error and there was already an
existing record - we don't distinguish between the cases where
btrfs_qgroup_trace_extent_nolock() returns 1, meaning a record already
existed and we can free the given record, and the case where it returns
a negative error value, meaning the insertion into the xarray that is
used to track records failed.
Effectively we end up ignoring that we are lacking qgroup record in the
dirty extents xarray, resulting in incorrect qgroup accounting.
Fix this by checking for errors and return them to the callers.
Fixes: 3cce39a8ca4e ("btrfs: qgroup: use xarray to track dirty extents in transaction") Reviewed-by: Qu Wenruo <[email protected]> Signed-off-by: Filipe Manana <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
There are reports that system cannot suspend due to running trim because
the task responsible for trimming the device isn't able to finish in
time, especially since we have a free extent discarding phase, which can
trim a lot of unallocated space. There are no limits on the trim size
(unlike the block group part).
Since trime isn't a critical call it can be interrupted at any time,
in such cases we stop the trim, report the amount of discarded bytes and
return an error.
Per Qu Wenruo in case we have a very large disk, e.g. 8TiB device,
mostly empty although we will do the split according to our super block
locations, the last super block ends at 256G, we can submit a huge
discard for the range [256G, 8T), causing a large delay.
Split the space left to discard based on BTRFS_MAX_DISCARD_CHUNK_SIZE in
preparation of introduction of cancellation points to trim. The value
of the chunk size is arbitrary, it can be higher or derived from actual
device capabilities but we can't easily read that using
bio_discard_limit().
sched_ext, scx_qmap: Add and use SCX_ENQ_CPU_SELECTED
scx_qmap and other schedulers in the SCX repo are using SCX_ENQ_WAKEUP to
tell whether ops.select_cpu() was called. This is incorrect as
ops.select_cpu() can be skipped in the wakeup path and leads to e.g.
incorrectly skipping direct dispatch for tasks that are bound to a single
CPU.
sched core has been updated to specify ENQUEUE_RQ_SELECTED if
->select_task_rq() was called. Map it to SCX_ENQ_CPU_SELECTED and update
scx_qmap to test it instead of SCX_ENQ_WAKEUP.
sched/core: Add ENQUEUE_RQ_SELECTED to indicate whether ->select_task_rq() was called
During ttwu, ->select_task_rq() can be skipped if only one CPU is allowed or
migration is disabled. sched_ext schedulers may perform operations such as
direct dispatch from ->select_task_rq() path and it is useful for them to
know whether ->select_task_rq() was skipped in the ->enqueue_task() path.
Currently, sched_ext schedulers are using ENQUEUE_WAKEUP for this purpose
and end up assuming incorrectly that ->select_task_rq() was called for tasks
that are bound to a single CPU or migration disabled.
Make select_task_rq() indicate whether ->select_task_rq() was called by
setting WF_RQ_SELECTED in *wake_flags and make ttwu_do_activate() map that
to ENQUEUE_RQ_SELECTED for ->enqueue_task().
This will be used by sched_ext to fix ->select_task_rq() skip detection.
sched/core: Make select_task_rq() take the pointer to wake_flags instead of value
This will be used to allow select_task_rq() to indicate whether
->select_task_rq() was called by modifying *wake_flags.
This makes try_to_wake_up() call all functions that take wake_flags with
WF_TTWU set. Previously, only select_task_rq() was. Using the same flags is
more consistent, and, as the flag is only tested by ->select_task_rq()
implementations, it doesn't cause any behavior differences.
Linus Torvalds [Mon, 7 Oct 2024 18:33:26 +0000 (11:33 -0700)]
Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost
Pull virtio fixes from Michael Tsirkin:
"Several small bugfixes all over the place.
Most notably, fixes the vsock allocation with GFP_KERNEL in atomic
context, which has been triggering warnings for lots of testers"
* tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost:
vhost/scsi: null-ptr-dereference in vhost_scsi_get_req()
vsock/virtio: use GFP_ATOMIC under RCU read lock
virtio_console: fix misc probe bugs
virtio_ring: tag event_triggered as racy for KCSAN
vdpa/octeon_ep: Fix format specifier for pointers in debug messages
Haoran Zhang [Tue, 1 Oct 2024 20:14:15 +0000 (15:14 -0500)]
vhost/scsi: null-ptr-dereference in vhost_scsi_get_req()
Since commit 3f8ca2e115e5 ("vhost/scsi: Extract common handling code
from control queue handler") a null pointer dereference bug can be
triggered when guest sends an SCSI AN request.
In vhost_scsi_ctl_handle_vq(), `vc.target` is assigned with
`&v_req.tmf.lun[1]` within a switch-case block and is then passed to
vhost_scsi_get_req() which extracts `vc->req` and `tpg`. However, for
a `VIRTIO_SCSI_T_AN_*` request, tpg is not required, so `vc.target` is
set to NULL in this branch. Later, in vhost_scsi_get_req(),
`vc->target` is dereferenced without being checked, leading to a null
pointer dereference bug. This bug can be triggered from guest.
When this bug occurs, the vhost_worker process is killed while holding
`vq->mutex` and the corresponding tpg will remain occupied
indefinitely.
virtio_transport_send_pkt in now called on transport fast path,
under RCU read lock. In that case, we have a bug: virtio_add_sgs
is called with GFP_KERNEL, and might sleep.
Pass the gfp flags as an argument, and use GFP_ATOMIC on
the fast path.
Brian Foster [Tue, 3 Sep 2024 12:47:13 +0000 (08:47 -0400)]
xfs: skip background cowblock trims on inodes open for write
The background blockgc scanner runs on a 5m interval by default and
trims preallocation (post-eof and cow fork) from inodes that are
otherwise idle. Idle effectively means that iolock can be acquired
without blocking and that the inode has no dirty pagecache or I/O in
flight.
This simple mechanism and heuristic has worked fairly well for
post-eof speculative preallocations. Support for reflink and COW
fork preallocations came sometime later and plugged into the same
mechanism, with similar heuristics. Some recent testing has shown
that COW fork preallocation may be notably more sensitive to blockgc
processing than post-eof preallocation, however.
For example, consider an 8GB reflinked file with a COW extent size
hint of 1MB. A worst case fully randomized overwrite of this file
results in ~8k extents of an average size of ~1MB. If the same
workload is interrupted a couple times for blockgc processing
(assuming the file goes idle), the resulting extent count explodes
to over 100k extents with an average size <100kB. This is
significantly worse than ideal and essentially defeats the COW
extent size hint mechanism.
While this particular test is instrumented, it reflects a fairly
reasonable pattern in practice where random I/Os might spread out
over a large period of time with varying periods of (in)activity.
For example, consider a cloned disk image file for a VM or container
with long uptime and variable and bursty usage. A background blockgc
scan that races and processes the image file when it happens to be
clean and idle can have a significant effect on the future
fragmentation level of the file, even when still in use.
To help combat this, update the heuristic to skip cowblocks inodes
that are currently opened for write access during non-sync blockgc
scans. This allows COW fork preallocations to persist for as long as
possible unless otherwise needed for functional purposes (i.e. a
sync scan), the file is idle and closed, or the inode is being
evicted from cache. While here, update the comments to help
distinguish performance oriented heuristics from the logic that
exists to maintain functional correctness.
xfs: support lowmode allocations in xfs_bmap_exact_minlen_extent_alloc
Currently the debug-only xfs_bmap_exact_minlen_extent_alloc allocation
variant fails to drop into the lowmode last resort allocator, and
thus can sometimes fail allocations for which the caller has a
transaction block reservation.
Fix this by using xfs_bmap_btalloc_low_space to do the actual allocation.
xfs: don't ifdef around the exact minlen allocations
Exact minlen allocations only exist as an error injection tool for debug
builds. Currently this is implemented using ifdefs, which means the code
isn't even compiled for non-XFS_DEBUG builds. Enhance the compile test
coverage by always building the code and use the compilers' dead code
elimination to remove it from the generated binary instead.
The only downside is that the alloc_minlen_only field is unconditionally
added to struct xfs_alloc_args now, but by moving it around and packing
it tightly this doesn't actually increase the size of the structure.
xfs: distinguish extra split from real ENOSPC from xfs_attr_node_try_addname
Just like xfs_attr3_leaf_split, xfs_attr_node_try_addname can return
-ENOSPC both for an actual failure to allocate a disk block, but also
to signal the caller to convert the format of the attr fork. Use magic
1 to ask for the conversion here as well.
Note that unlike the similar issue in xfs_attr3_leaf_split, this one was
only found by code review.
xfs: distinguish extra split from real ENOSPC from xfs_attr3_leaf_split
xfs_attr3_leaf_split propagates the need for an extra btree split as
-ENOSPC to it's only caller, but the same return value can also be
returned from xfs_da_grow_inode when it fails to find free space.
Distinguish the two cases by returning 1 for the extra split case instead
of overloading -ENOSPC.
This can be triggered relatively easily with the pending realtime group
support and a file system with a lot of small zones that use metadata
space on the main device. In this case every about 5-10th run of
xfs/538 runs into the following assert:
ASSERT(oldblk->magic == XFS_ATTR_LEAF_MAGIC);
in xfs_attr3_leaf_split caused by an allocation failure. Note that
the allocation failure is caused by another bug that will be fixed
subsequently, but this commit at least sorts out the error handling.
xfs_attr3_leaf_add only has two potential return values, indicating if the
entry could be added or not. Replace the errno return with a bool so that
ENOSPC from it can't easily be confused with a real ENOSPC.
Remove the return value from the xfs_attr3_leaf_add_work helper entirely,
as it always return 0.
xfs: Use try_cmpxchg() in xlog_cil_insert_pcp_aggregate()
Use !try_cmpxchg instead of cmpxchg (*ptr, old, new) != old in
xlog_cil_insert_pcp_aggregate(). x86 CMPXCHG instruction returns
success in ZF flag, so this change saves a compare after cmpxchg.
Also, try_cmpxchg implicitly assigns old *ptr value to "old" when
cmpxchg fails. There is no need to re-read the value in the loop.
Note that the value from *ptr should be read using READ_ONCE to
prevent the compiler from merging, refetching or reordering the read.
The definition of xfs_attr_use_log_assist() has been removed since
commit d9c61ccb3b09 ("xfs: move xfs_attr_use_log_assist out of xfs_log.c").
So, Remove the empty declartion in header files.
MAINTAINERS: add Carlos Maiolino as XFS release manager
I nominate Carlos Maiolino to take over linux-xfs tree maintainer role for
upstream kernel's XFS code. He has enough experience in Linux kernel and he's
been maintaining xfsprogs and xfsdump trees for a few years now, so he has
sufficient experience with xfs workflow to take over this role.
Linus Torvalds [Sun, 6 Oct 2024 18:34:55 +0000 (11:34 -0700)]
Merge tag 'kbuild-fixes-v6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild
Pull Kbuild fixes from Masahiro Yamada:
- Move non-boot built-in DTBs to the .rodata section
- Fix Kconfig bugs
- Fix maint scripts in the linux-image Debian package
- Import some list macros to scripts/include/
* tag 'kbuild-fixes-v6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild:
kbuild: deb-pkg: Remove blank first line from maint scripts
kbuild: fix a typo dt_binding_schema -> dt_binding_schemas
scripts: import more list macros
kconfig: qconf: fix buffer overflow in debug links
kconfig: qconf: move conf_read() before drawing tree pain
kconfig: clear expr::val_is_valid when allocated
kconfig: fix infinite loop in sym_calc_choice()
kbuild: move non-boot built-in DTBs to .rodata section
Linus Torvalds [Sun, 6 Oct 2024 18:11:01 +0000 (11:11 -0700)]
Merge tag 'platform-drivers-x86-v6.12-2' of git://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86
Pull x86 platform driver fixes from Hans de Goede:
- Intel PMC fix for suspend/resume issues on some Sky and Kaby Lake
laptops
- Intel Diamond Rapids hw-id additions
- Documentation and MAINTAINERS fixes
- Some other small fixes
* tag 'platform-drivers-x86-v6.12-2' of git://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86:
platform/x86: x86-android-tablets: Fix use after free on platform_device_register() errors
platform/x86: wmi: Update WMI driver API documentation
platform/x86: dell-ddv: Fix typo in documentation
platform/x86: dell-sysman: add support for alienware products
platform/x86/intel: power-domains: Add Diamond Rapids support
platform/x86: ISST: Add Diamond Rapids to support list
platform/x86:intel/pmc: Disable ACPI PM Timer disabling on Sky and Kaby Lake
platform/x86: dell-laptop: Do not fail when encountering unsupported batteries
MAINTAINERS: Update Intel In Field Scan(IFS) entry
platform/x86: ISST: Fix the KASAN report slab-out-of-bounds bug
Linus Torvalds [Sun, 6 Oct 2024 17:53:28 +0000 (10:53 -0700)]
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm fixes from Paolo Bonzini:
"ARM64:
- Fix pKVM error path on init, making sure we do not change critical
system registers as we're about to fail
- Make sure that the host's vector length is at capped by a value
common to all CPUs
- Fix kvm_has_feat*() handling of "negative" features, as the current
code is pretty broken
- Promote Joey to the status of official reviewer, while James steps
down -- hopefully only temporarly
x86:
- Fix compilation with KVM_INTEL=KVM_AMD=n
- Fix disabling KVM_X86_QUIRK_SLOT_ZAP_ALL when shadow MMU is in use
Selftests:
- Fix compilation on non-x86 architectures"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
x86/reboot: emergency callbacks are now registered by common KVM code
KVM: x86: leave kvm.ko out of the build if no vendor module is requested
KVM: x86/mmu: fix KVM_X86_QUIRK_SLOT_ZAP_ALL for shadow MMU
KVM: arm64: Fix kvm_has_feat*() handling of negative features
KVM: selftests: Fix build on architectures other than x86_64
KVM: arm64: Another reviewer reshuffle
KVM: arm64: Constrain the host to the maximum shared SVE VL with pKVM
KVM: arm64: Fix __pkvm_init_vcpu cptr_el2 error path
Aaron Thompson [Fri, 4 Oct 2024 07:52:45 +0000 (07:52 +0000)]
kbuild: deb-pkg: Remove blank first line from maint scripts
The blank line causes execve() to fail:
# strace ./postinst
execve("./postinst", ...) = -1 ENOEXEC (Exec format error)
strace: exec: Exec format error
+++ exited with 1 +++
However running the scripts via shell does work (at least with bash)
because the shell attempts to execute the file as a shell script when
execve() fails.
Fixes: b611daae5efc ("kbuild: deb-pkg: split image and debug objects staging out into functions") Signed-off-by: Aaron Thompson <[email protected]> Reviewed-by: Nathan Chancellor <[email protected]> Reviewed-by: Nicolas Schier <[email protected]> Signed-off-by: Masahiro Yamada <[email protected]>
Hans de Goede [Sat, 5 Oct 2024 13:05:45 +0000 (15:05 +0200)]
platform/x86: x86-android-tablets: Fix use after free on platform_device_register() errors
x86_android_tablet_remove() frees the pdevs[] array, so it should not
be used after calling x86_android_tablet_remove().
When platform_device_register() fails, store the pdevs[x] PTR_ERR() value
into the local ret variable before calling x86_android_tablet_remove()
to avoid using pdevs[] after it has been freed.
Fixes: 5eba0141206e ("platform/x86: x86-android-tablets: Add support for instantiating platform-devs") Fixes: e2200d3f26da ("platform/x86: x86-android-tablets: Add gpio_keys support to x86_android_tablet_init()") Cc: [email protected] Reported-by: Aleksandr Burakov <[email protected]> Closes: https://lore.kernel.org/platform-driver-x86/[email protected]/ Signed-off-by: Hans de Goede <[email protected]> Link: https://lore.kernel.org/r/[email protected]
Hans de Goede [Thu, 3 Oct 2024 20:26:13 +0000 (22:26 +0200)]
platform/x86:intel/pmc: Disable ACPI PM Timer disabling on Sky and Kaby Lake
There have been multiple reports that the ACPI PM Timer disabling is
causing Sky and Kaby Lake systems to hang on all suspend (s2idle, s3,
hibernate) methods.
Remove the acpi_pm_tmr_ctl_offset and acpi_pm_tmr_disable_bit settings from
spt_reg_map to disable the ACPI PM Timer disabling on Sky and Kaby Lake to
fix the hang on suspend.
Armin Wolf [Tue, 1 Oct 2024 21:28:35 +0000 (23:28 +0200)]
platform/x86: dell-laptop: Do not fail when encountering unsupported batteries
If the battery hook encounters a unsupported battery, it will
return an error. This in turn will cause the battery driver to
automatically unregister the battery hook.
On machines with multiple batteries however, this will prevent
the battery hook from handling the primary battery, since it will
always get unregistered upon encountering one of the unsupported
batteries.
Fix this by simply ignoring unsupported batteries.
Jithu Joseph [Tue, 1 Oct 2024 17:08:08 +0000 (10:08 -0700)]
MAINTAINERS: Update Intel In Field Scan(IFS) entry
Ashok is no longer with Intel and his e-mail address will start bouncing
soon. Update his email address to the new one he provided to ensure
correct contact details in the MAINTAINERS file.
Paolo Bonzini [Tue, 1 Oct 2024 14:34:58 +0000 (10:34 -0400)]
x86/reboot: emergency callbacks are now registered by common KVM code
Guard them with CONFIG_KVM_X86_COMMON rather than the two vendor modules.
In practice this has no functional change, because CONFIG_KVM_X86_COMMON
is set if and only if at least one vendor-specific module is being built.
However, it is cleaner to specify CONFIG_KVM_X86_COMMON for functions that
are used in kvm.ko.
Reported-by: Linus Torvalds <[email protected]> Fixes: 590b09b1d88e ("KVM: x86: Register "emergency disable" callbacks when virt is enabled") Fixes: 6d55a94222db ("x86/reboot: Unconditionally define cpu_emergency_virt_cb typedef") Signed-off-by: Paolo Bonzini <[email protected]>
Paolo Bonzini [Tue, 1 Oct 2024 14:15:01 +0000 (10:15 -0400)]
KVM: x86: leave kvm.ko out of the build if no vendor module is requested
kvm.ko is nothing but library code shared by kvm-intel.ko and kvm-amd.ko.
It provides no functionality on its own and it is unnecessary unless one
of the vendor-specific module is compiled. In particular, /dev/kvm is
not created until one of kvm-intel.ko or kvm-amd.ko is loaded.
Use CONFIG_KVM to decide if it is built-in or a module, but use the
vendor-specific modules for the actual decision on whether to build it.
This also fixes a build failure when CONFIG_KVM_INTEL and CONFIG_KVM_AMD
are both disabled. The cpu_emergency_register_virt_callback() function
is called from kvm.ko, but it is only defined if at least one of
CONFIG_KVM_INTEL and CONFIG_KVM_AMD is provided.
Fixes: 590b09b1d88e ("KVM: x86: Register "emergency disable" callbacks when virt is enabled") Signed-off-by: Paolo Bonzini <[email protected]>
Linus Torvalds [Sat, 5 Oct 2024 22:18:04 +0000 (15:18 -0700)]
Merge tag 'bcachefs-2024-10-05' of git://evilpiepirate.org/bcachefs
Pull bcachefs fixes from Kent Overstreet:
"A lot of little fixes, bigger ones include:
- bcachefs's __wait_on_freeing_inode() was broken in rc1 due to vfs
changes, now fixed along with another lost wakeup
- fragmentation LRU fixes; fsck now repairs successfully (this is the
data structure copygc uses); along with some nice simplification.
- Rework logged op error handling, so that if logged op replay errors
(due to another filesystem error) we delete the logged op instead
of going into an infinite loop)
- Various small filesystem connectivitity repair fixes"
* tag 'bcachefs-2024-10-05' of git://evilpiepirate.org/bcachefs:
bcachefs: Rework logged op error handling
bcachefs: Add warn param to subvol_get_snapshot, peek_inode
bcachefs: Kill snapshot arg to fsck_write_inode()
bcachefs: Check for unlinked, non-empty dirs in check_inode()
bcachefs: Check for unlinked inodes with dirents
bcachefs: Check for directories with no backpointers
bcachefs: Kill alloc_v4.fragmentation_lru
bcachefs: minor lru fsck fixes
bcachefs: Mark more errors AUTOFIX
bcachefs: Make sure we print error that causes fsck to bail out
bcachefs: bkey errors are only AUTOFIX during read
bcachefs: Create lost+found in correct snapshot
bcachefs: Fix reattach_inode()
bcachefs: Add missing wakeup to bch2_inode_hash_remove()
bcachefs: Fix trans_commit disk accounting revert
bcachefs: Fix bch2_inode_is_open() check
bcachefs: Fix return type of dirent_points_to_inode_nowarn()
bcachefs: Fix bad shift in bch2_read_flag_list()
When multiple FREE_STATEIDs are sent for the same delegation stateid,
it can lead to a possible either use-after-free or counter refcount
underflow errors.
In nfsd4_free_stateid() under the client lock we find a delegation
stateid, however the code drops the lock before calling nfs4_put_stid(),
that allows another FREE_STATE to find the stateid again. The first one
will proceed to then free the stateid which leads to either
use-after-free or decrementing already zeroed counter.
Linus Torvalds [Sat, 5 Oct 2024 17:47:00 +0000 (10:47 -0700)]
Merge tag 'ext4_for_linus-5.12-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 fixes from Ted Ts'o:
"Fix some ext4 bugs and regressions relating to oneline resize and fast
commits"
* tag 'ext4_for_linus-5.12-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
ext4: fix off by one issue in alloc_flex_gd()
ext4: mark fc as ineligible using an handle in ext4_xattr_set()
ext4: use handle to mark fc as ineligible in __track_dentry_update()
Linus Torvalds [Sat, 5 Oct 2024 17:31:04 +0000 (10:31 -0700)]
Merge tag 'i2c-for-6.12-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux
Pull i2c fix from Wolfram Sang:
- Fix potential deadlock during runtime suspend and resume (stm32f7)
* tag 'i2c-for-6.12-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux:
i2c: stm32f7: Do not prepare/unprepare clock during runtime suspend/resume
Linus Torvalds [Sat, 5 Oct 2024 17:25:04 +0000 (10:25 -0700)]
Merge tag 'spi-fix-v6.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi
Pull spi fixes from Mark Brown:
"A small set of driver specific fixes that came in since the merge
window, about half of which is fixes for correctness in the use of the
runtime PM APIs done as part of a broader cleanup"
* tag 'spi-fix-v6.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi:
spi: s3c64xx: fix timeout counters in flush_fifo
spi: atmel-quadspi: Fix wrong register value written to MR
spi: spi-cadence: Fix missing spi_controller_is_target() check
spi: spi-cadence: Fix pm_runtime_set_suspended() with runtime pm enabled
spi: spi-imx: Fix pm_runtime_set_suspended() with runtime pm enabled
Linus Torvalds [Sat, 5 Oct 2024 17:19:14 +0000 (10:19 -0700)]
Merge tag 'hardening-v6.12-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
Pull hardening fixes from Kees Cook:
- gcc plugins: Avoid Kconfig warnings with randstruct (Nathan
Chancellor)
- MAINTAINERS: Add security/Kconfig.hardening to hardening section
(Nathan Chancellor)
- MAINTAINERS: Add unsafe_memcpy() to the FORTIFY review list
* tag 'hardening-v6.12-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
MAINTAINERS: Add security/Kconfig.hardening to hardening section
hardening: Adjust dependencies in selection of MODVERSIONS
MAINTAINERS: Add unsafe_memcpy() to the FORTIFY review list
Linus Torvalds [Sat, 5 Oct 2024 17:10:45 +0000 (10:10 -0700)]
Merge tag 'lsm-pr-20241004' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/lsm
Pull lsm revert from Paul Moore:
"Here is the CONFIG_SECURITY_TOMOYO_LKM revert that we've been
discussing this week. With near unanimous agreement that the original
TOMOYO patches were not the right way to solve the distro problem
Tetsuo is trying the solve, reverting is our best option at this time"
* tag 'lsm-pr-20241004' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/lsm:
tomoyo: revert CONFIG_SECURITY_TOMOYO_LKM support
platform/x86: ISST: Fix the KASAN report slab-out-of-bounds bug
Attaching SST PCI device to VM causes "BUG: KASAN: slab-out-of-bounds".
kasan report:
[ 19.411889] ==================================================================
[ 19.413702] BUG: KASAN: slab-out-of-bounds in _isst_if_get_pci_dev+0x3d5/0x400 [isst_if_common]
[ 19.415634] Read of size 8 at addr ffff888829e65200 by task cpuhp/16/113
[ 19.417368]
[ 19.418627] CPU: 16 PID: 113 Comm: cpuhp/16 Tainted: G E 6.9.0 #10
[ 19.420435] Hardware name: VMware, Inc. VMware20,1/440BX Desktop Reference Platform, BIOS VMW201.00V.20192059.B64.2207280713 07/28/2022
[ 19.422687] Call Trace:
[ 19.424091] <TASK>
[ 19.425448] dump_stack_lvl+0x5d/0x80
[ 19.426963] ? _isst_if_get_pci_dev+0x3d5/0x400 [isst_if_common]
[ 19.428694] print_report+0x19d/0x52e
[ 19.430206] ? __pfx__raw_spin_lock_irqsave+0x10/0x10
[ 19.431837] ? _isst_if_get_pci_dev+0x3d5/0x400 [isst_if_common]
[ 19.433539] kasan_report+0xf0/0x170
[ 19.435019] ? _isst_if_get_pci_dev+0x3d5/0x400 [isst_if_common]
[ 19.436709] _isst_if_get_pci_dev+0x3d5/0x400 [isst_if_common]
[ 19.438379] ? __pfx_sched_clock_cpu+0x10/0x10
[ 19.439910] isst_if_cpu_online+0x406/0x58f [isst_if_common]
[ 19.441573] ? __pfx_isst_if_cpu_online+0x10/0x10 [isst_if_common]
[ 19.443263] ? ttwu_queue_wakelist+0x2c1/0x360
[ 19.444797] cpuhp_invoke_callback+0x221/0xec0
[ 19.446337] cpuhp_thread_fun+0x21b/0x610
[ 19.447814] ? __pfx_cpuhp_thread_fun+0x10/0x10
[ 19.449354] smpboot_thread_fn+0x2e7/0x6e0
[ 19.450859] ? __pfx_smpboot_thread_fn+0x10/0x10
[ 19.452405] kthread+0x29c/0x350
[ 19.453817] ? __pfx_kthread+0x10/0x10
[ 19.455253] ret_from_fork+0x31/0x70
[ 19.456685] ? __pfx_kthread+0x10/0x10
[ 19.458114] ret_from_fork_asm+0x1a/0x30
[ 19.459573] </TASK>
[ 19.460853]
[ 19.462055] Allocated by task 1198:
[ 19.463410] kasan_save_stack+0x30/0x50
[ 19.464788] kasan_save_track+0x14/0x30
[ 19.466139] __kasan_kmalloc+0xaa/0xb0
[ 19.467465] __kmalloc+0x1cd/0x470
[ 19.468748] isst_if_cdev_register+0x1da/0x350 [isst_if_common]
[ 19.470233] isst_if_mbox_init+0x108/0xff0 [isst_if_mbox_msr]
[ 19.471670] do_one_initcall+0xa4/0x380
[ 19.472903] do_init_module+0x238/0x760
[ 19.474105] load_module+0x5239/0x6f00
[ 19.475285] init_module_from_file+0xd1/0x130
[ 19.476506] idempotent_init_module+0x23b/0x650
[ 19.477725] __x64_sys_finit_module+0xbe/0x130
[ 19.476506] idempotent_init_module+0x23b/0x650
[ 19.477725] __x64_sys_finit_module+0xbe/0x130
[ 19.478920] do_syscall_64+0x82/0x160
[ 19.480036] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 19.481292]
[ 19.482205] The buggy address belongs to the object at ffff888829e65000
which belongs to the cache kmalloc-512 of size 512
[ 19.484818] The buggy address is located 0 bytes to the right of
allocated 512-byte region [ffff888829e65000, ffff888829e65200)
[ 19.487447]
[ 19.488328] The buggy address belongs to the physical page:
[ 19.489569] page: refcount:1 mapcount:0 mapping:0000000000000000 index:0xffff888829e60c00 pfn:0x829e60
[ 19.491140] head: order:3 entire_mapcount:0 nr_pages_mapped:0 pincount:0
[ 19.492466] anon flags: 0x57ffffc0000840(slab|head|node=1|zone=2|lastcpupid=0x1fffff)
[ 19.493914] page_type: 0xffffffff()
[ 19.494988] raw: 0057ffffc0000840ffff88810004cc8000000000000000000000000000000001
[ 19.496451] raw: ffff888829e60c00000000008020001800000001ffffffff0000000000000000
[ 19.497906] head: 0057ffffc0000840ffff88810004cc8000000000000000000000000000000001
[ 19.499379] head: ffff888829e60c00000000008020001800000001ffffffff0000000000000000
[ 19.500844] head: 0057ffffc0000003ffffea0020a79801ffffea0020a7984800000000ffffffff
[ 19.502316] head: 0000000800000000000000000000000000000000ffffffff0000000000000000
[ 19.503784] page dumped because: kasan: bad access detected
[ 19.505058]
[ 19.505970] Memory state around the buggy address:
[ 19.507172] ffff888829e65100: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[ 19.508599] ffff888829e65180: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[ 19.510013] >ffff888829e65200: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[ 19.510014] ^
[ 19.510016] ffff888829e65280: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[ 19.510018] ffff888829e65300: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[ 19.515367] ==================================================================
The reason for this error is physical_package_ids assigned by VMware VMM
are not continuous and have gaps. This will cause value returned by
topology_physical_package_id() to be more than topology_max_packages().
Here the allocation uses topology_max_packages(). The call to
topology_max_packages() returns maximum logical package ID not physical
ID. Hence use topology_logical_package_id() instead of
topology_physical_package_id().
Linus Torvalds [Sat, 5 Oct 2024 00:30:59 +0000 (17:30 -0700)]
Merge tag 'linux_kselftest-fixes-6.12-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest
Pull kselftest fixes from Shuah Khan:
"Fixes to build warnings, install scripts, run-time error path, and git
status cleanups to tests:
- devices/probe: fix for Python3 regex string syntax warnings
- clone3: removing unused macro from clone3_cap_checkpoint_restore()
- vDSO: fix to align getrandom states to cache line
- core and exec: add missing executables to .gitignore files
- rtc: change to skip test if /dev/rtc0 can't be accessed
- timers/posix: fix warn_unused_result result in __fatal_error()
- breakpoints: fix to detect suspend successful condition correctly
- hid: fix to install required dependencies to run the test"
* tag 'linux_kselftest-fixes-6.12-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest:
selftests: breakpoints: use remaining time to check if suspend succeed
kselftest/devices/probe: Fix SyntaxWarning in regex strings for Python3
selftest: hid: add missing run-hid-tools-tests.sh
selftests: vDSO: align getrandom states to cache line
selftests: exec: update gitignore for load_address
selftests: core: add unshare_test to gitignore
clone3: clone3_cap_checkpoint_restore: remove unused MAX_PID_NS_LEVEL macro
selftests:timers: posix_timers: Fix warn_unused_result in __fatal_error()
selftest: rtc: Check if could access /dev/rtc0 before testing
Kent Overstreet [Tue, 24 Sep 2024 02:06:58 +0000 (22:06 -0400)]
bcachefs: Rework logged op error handling
Initially it was thought that we just wanted to ignore errors from
logged op replay, but it turns out we do need to catch -EROFS, or we'll
go into an infinite loop.
Kent Overstreet [Mon, 30 Sep 2024 04:00:33 +0000 (00:00 -0400)]
bcachefs: Kill snapshot arg to fsck_write_inode()
It was initially believed that it would be better to be explicit about
the snapshot we're updating when writing inodes in fsck; however, it
turns out that passing around the snapshot separately is more error
prone and we're usually updating the inode in the same snapshow we read
it from.
This is different from normal filesystem paths, where we do the update
in the snapshot of the subvolume we're in.
Kent Overstreet [Mon, 30 Sep 2024 03:38:37 +0000 (23:38 -0400)]
bcachefs: Check for unlinked, non-empty dirs in check_inode()
We want to check for this early so it can be reattached if necessary in
check_unreachable_inodes(); better than letting it be deleted and having
the children reattached, losing their filenames.
Kent Overstreet [Mon, 30 Sep 2024 02:38:04 +0000 (22:38 -0400)]
bcachefs: Check for unlinked inodes with dirents
link count works differently in bcachefs - it's only nonzero for files
with multiple hardlinks, which means we can also avoid checking it
except for files that are known to have hardlinks.
That means we need a few different checks instead; in particular, we
don't want fsck to delet a file that has a dirent pointing to it.
Kent Overstreet [Sat, 28 Sep 2024 19:27:37 +0000 (15:27 -0400)]
bcachefs: Check for directories with no backpointers
It's legal for regular files to have missing backpointers (due to
hardlinks), and fsck should automatically add them, but for directories
this is an error that should be flagged.
Kent Overstreet [Tue, 1 Oct 2024 23:08:37 +0000 (19:08 -0400)]
bcachefs: Kill alloc_v4.fragmentation_lru
The fragmentation_lru field hasn't been needed since we reworked the LRU
btrees to use the btree write buffer; previously it was used to resolve
collisions, but the revised LRU btree uses the backpointer (the bucket)
as part of the key.
It should have been deleted at the time of the LRU rework; since it
wasn't, that left places for bugs to hide, in check/repair.
This fixes LRU fsck on a filesystem image helpfully provided by a user
who disappeared before I could get his name for the reported-by.
Kent Overstreet [Tue, 1 Oct 2024 20:40:33 +0000 (16:40 -0400)]
bcachefs: minor lru fsck fixes
check_lru_key() wasn't using write buffer updates for deleting bad lru
entries - dating from before the lru btree used the btree write buffer.
And when possibly flushing the btree write buffer (to make sure we're
seeing a real inconsistency), we need to be using the modern
bch2_btree_write_buffer_maybe_flush().
Kent Overstreet [Sat, 28 Sep 2024 06:44:12 +0000 (02:44 -0400)]
bcachefs: Fix reattach_inode()
Ensure a copy of the lost+found inode exists in the snapshot that we're
reattaching, so that we don't trigger warnings in
lookup_inode_for_snapshot() later.
Kent Overstreet [Fri, 4 Oct 2024 23:44:32 +0000 (19:44 -0400)]
bcachefs: Add missing wakeup to bch2_inode_hash_remove()
This fixes two different bugs:
- Looser locking with the rhashtable means we need to recheck if the
inode is still hashed after prepare_to_wait(), and add a corresponding
wakeup after removing from the hash table.
- da18ecbf0fb6 ("fs: add i_state helpers") changed the bit waitqueues
used for inodes, and bcachefs wasn't updated and thus broke; this
updates bcachefs to the new helper.
Fixes: 112d21fd1a12 ("bcachefs: switch to rhashtable for vfs inodes hash") Signed-off-by: Kent Overstreet <[email protected]>
Baokun Li [Fri, 27 Sep 2024 13:33:29 +0000 (21:33 +0800)]
ext4: fix off by one issue in alloc_flex_gd()
Wesley reported an issue:
==================================================================
EXT4-fs (dm-5): resizing filesystem from 7168 to 786432 blocks
------------[ cut here ]------------
kernel BUG at fs/ext4/resize.c:324!
CPU: 9 UID: 0 PID: 3576 Comm: resize2fs Not tainted 6.11.0+ #27
RIP: 0010:ext4_resize_fs+0x1212/0x12d0
Call Trace:
__ext4_ioctl+0x4e0/0x1800
ext4_ioctl+0x12/0x20
__x64_sys_ioctl+0x99/0xd0
x64_sys_call+0x1206/0x20d0
do_syscall_64+0x72/0x110
entry_SYSCALL_64_after_hwframe+0x76/0x7e
==================================================================
While reviewing the patch, Honza found that when adjusting resize_bg in
alloc_flex_gd(), it was possible for flex_gd->resize_bg to be bigger than
flexbg_size.
The reproduction of the problem requires the following:
ext4: mark fc as ineligible using an handle in ext4_xattr_set()
Calling ext4_fc_mark_ineligible() with a NULL handle is racy and may result
in a fast-commit being done before the filesystem is effectively marked as
ineligible. This patch moves the call to this function so that an handle
can be used. If a transaction fails to start, then there's not point in
trying to mark the filesystem as ineligible, and an error will eventually be
returned to user-space.
ext4: use handle to mark fc as ineligible in __track_dentry_update()
Calling ext4_fc_mark_ineligible() with a NULL handle is racy and may result
in a fast-commit being done before the filesystem is effectively marked as
ineligible. This patch fixes the calls to this function in
__track_dentry_update() by adding an extra parameter to the callback used in
ext4_fc_track_template().
Tejun Heo [Wed, 2 Oct 2024 20:34:38 +0000 (10:34 -1000)]
sched_ext: scx_cgroup_exit() may be called without successful scx_cgroup_init()
568894edbe48 ("sched_ext: Add scx_cgroup_enabled to gate cgroup operations
and fix scx_tg_online()") assumed that scx_cgroup_exit() is only called
after scx_cgroup_init() finished successfully. This isn't true.
scx_cgroup_exit() can be called without scx_cgroup_init() being called at
all or after scx_cgroup_init() failed in the middle.
As init state is tracked per cgroup, scx_cgroup_exit() can be used safely to
clean up in all cases. Remove the incorrect WARN_ON_ONCE().
Signed-off-by: Tejun Heo <[email protected]> Fixes: 568894edbe48 ("sched_ext: Add scx_cgroup_enabled to gate cgroup operations and fix scx_tg_online()")