Yu Zhao [Thu, 27 Jun 2024 22:27:05 +0000 (16:27 -0600)]
mm/hugetlb_vmemmap: fix race with speculative PFN walkers
While investigating HVO for THPs [1], it turns out that speculative PFN
walkers like compaction can race with vmemmap modifications, e.g.,
CPU 1 (vmemmap modifier) CPU 2 (speculative PFN walker)
------------------------------- ------------------------------
Allocates an LRU folio page1
Sees page1
Frees page1
Allocates a hugeTLB folio page2
(page1 being a tail of page2)
Even though page1->_refcount is zero after HVO, get_page_unless_zero() can
still try to modify this read-only field, resulting in a crash.
An independent report [2] confirmed this race.
There are two discussed approaches to fix this race:
1. Make RO vmemmap RW so that get_page_unless_zero() can fail without
triggering a PF.
2. Use RCU to make sure get_page_unless_zero() either sees zero
page->_refcount through the old vmemmap or non-zero page->_refcount
through the new one.
The second approach is preferred here because:
1. It can prevent illegal modifications to struct page[] that has been
HVO'ed;
2. It can be generalized, in a way similar to ZERO_PAGE(), to fix
similar races in other places, e.g., arch_remove_memory() on x86
[3], which frees vmemmap mapping offlined struct page[].
While adding synchronize_rcu(), the goal is to be surgical, rather than
optimized. Specifically, calls to synchronize_rcu() on the error handling
paths can be coalesced, but it is not done for the sake of Simplicity:
noticeably, this fix removes ~50% more lines than it adds.
According to the hugetlb_optimize_vmemmap section in
Documentation/admin-guide/sysctl/vm.rst, enabling HVO makes allocating or
freeing hugeTLB pages "~2x slower than before". Having synchronize_rcu()
on top makes those operations even worse, and this also affects the user
interface /proc/sys/vm/nr_overcommit_hugepages.
This is *very* hard to trigger:
1. Most hugeTLB use cases I know of are static, i.e., reserved at
boot time, because allocating at runtime is not reliable at all.
2. On top of that, someone has to be very unlucky to get tripped
over above, because the race window is so small -- I wasn't able to
trigger it with a stress testing that does nothing but that (with
THPs though).
Nhat Pham [Thu, 27 Jun 2024 20:17:37 +0000 (13:17 -0700)]
cachestat: do not flush stats in recency check
syzbot detects that cachestat() is flushing stats, which can sleep, in its
RCU read section (see [1]). This is done in the workingset_test_recent()
step (which checks if the folio's eviction is recent).
Move the stat flushing step to before the RCU read section of cachestat,
and skip stat flushing during the recency check.
Gavin Shan [Thu, 27 Jun 2024 00:39:52 +0000 (10:39 +1000)]
mm/shmem: disable PMD-sized page cache if needed
For shmem files, it's possible that PMD-sized page cache can't be
supported by xarray. For example, 512MB page cache on ARM64 when the base
page size is 64KB can't be supported by xarray. It leads to errors as the
following messages indicate when this sort of xarray entry is split.
Fix it by disabling PMD-sized page cache when HPAGE_PMD_ORDER is larger
than MAX_PAGECACHE_ORDER. As Matthew Wilcox pointed, the page cache in a
shmem file isn't represented by a multi-index entry and doesn't have this
limitation when the xarry entry is split until commit 6b24ca4a1a8d ("mm:
Use multi-index entries in the page cache").
Gavin Shan [Thu, 27 Jun 2024 00:39:51 +0000 (10:39 +1000)]
mm/filemap: skip to create PMD-sized page cache if needed
On ARM64, HPAGE_PMD_ORDER is 13 when the base page size is 64KB. The
PMD-sized page cache can't be supported by xarray as the following error
messages indicate.
Fix it by skipping to allocate PMD-sized page cache when its size is
larger than MAX_PAGECACHE_ORDER. For this specific case, we will fall to
regular path where the readahead window is determined by BDI's sysfs file
(read_ahead_kb).
Gavin Shan [Thu, 27 Jun 2024 00:39:50 +0000 (10:39 +1000)]
mm/readahead: limit page cache size in page_cache_ra_order()
In page_cache_ra_order(), the maximal order of the page cache to be
allocated shouldn't be larger than MAX_PAGECACHE_ORDER. Otherwise, it's
possible the large page cache can't be supported by xarray when the
corresponding xarray entry is split.
For example, HPAGE_PMD_ORDER is 13 on ARM64 when the base page size is
64KB. The PMD-sized page cache can't be supported by xarray.
Gavin Shan [Thu, 27 Jun 2024 00:39:49 +0000 (10:39 +1000)]
mm/filemap: make MAX_PAGECACHE_ORDER acceptable to xarray
Patch series "mm/filemap: Limit page cache size to that supported by
xarray", v2.
Currently, xarray can't support arbitrary page cache size. More details
can be found from the WARN_ON() statement in xas_split_alloc(). In our
test whose code is attached below, we hit the WARN_ON() on ARM64 system
where the base page size is 64KB and huge page size is 512MB. The issue
was reported long time ago and some discussions on it can be found here
[1].
In order to fix the issue, we need to adjust MAX_PAGECACHE_ORDER to one
supported by xarray and avoid PMD-sized page cache if needed. The code
changes are suggested by David Hildenbrand.
PATCH[1] adjusts MAX_PAGECACHE_ORDER to that supported by xarray
PATCH[2-3] avoids PMD-sized page cache in the synchronous readahead path
PATCH[4] avoids PMD-sized page cache for shmem files if needed
The largest page cache order can be HPAGE_PMD_ORDER (13) on ARM64 with
64KB base page size. The xarray entry with this order can't be split as
the following error messages indicate.
Fix it by decreasing MAX_PAGECACHE_ORDER to the largest supported order
by xarray. For this specific case, MAX_PAGECACHE_ORDER is dropped from
13 to 11 when CONFIG_BASE_SMALL is disabled.
SeongJae Park [Mon, 24 Jun 2024 17:58:14 +0000 (10:58 -0700)]
mm/damon/core: merge regions aggressively when max_nr_regions is unmet
DAMON keeps the number of regions under max_nr_regions by skipping regions
split operations when doing so can make the number higher than the limit.
It works well for preventing violation of the limit. But, if somehow the
violation happens, it cannot recovery well depending on the situation. In
detail, if the real number of regions having different access pattern is
higher than the limit, the mechanism cannot reduce the number below the
limit. In such a case, the system could suffer from high monitoring
overhead of DAMON.
The violation can actually happen. For an example, the user could reduce
max_nr_regions while DAMON is running, to be lower than the current number
of regions. Fix the problem by repeating the merge operations with
increasing aggressiveness in kdamond_merge_regions() for the case, until
the limit is met.
Audra Mitchell [Wed, 26 Jun 2024 13:05:11 +0000 (09:05 -0400)]
Fix userfaultfd_api to return EINVAL as expected
Currently if we request a feature that is not set in the Kernel config we
fail silently and return all the available features. However, the man
page indicates we should return an EINVAL.
We need to fix this issue since we can end up with a Kernel warning should
a program request the feature UFFD_FEATURE_WP_UNPOPULATED on a kernel with
the config not set with this feature.
mm: vmalloc: check if a hash-index is in cpu_possible_mask
The problem is that there are systems where cpu_possible_mask has gaps
between set CPUs, for example SPARC. In this scenario addr_to_vb_xa()
hash function can return an index which accesses to not-possible and not
setup CPU area using per_cpu() macro. This results in an oops on SPARC.
A per-cpu vmap_block_queue is also used as hash table, incorrectly
assuming the cpu_possible_mask has no gaps. Fix it by adjusting an index
to a next possible CPU.
Waiman Long [Wed, 26 Jun 2024 00:16:39 +0000 (20:16 -0400)]
mm: prevent derefencing NULL ptr in pfn_section_valid()
Commit 5ec8e8ea8b77 ("mm/sparsemem: fix race in accessing
memory_section->usage") changed pfn_section_valid() to add a READ_ONCE()
call around "ms->usage" to fix a race with section_deactivate() where
ms->usage can be cleared. The READ_ONCE() call, by itself, is not enough
to prevent NULL pointer dereference. We need to check its value before
dereferencing it.
This BUG is the VM_BUG_ON(!in_atomic() && !irqs_disabled()) assertion in
folio_ref_try_add_rcu() for non-SMP kernel.
The process_vm_readv() calls GUP to pin the THP. An optimization for
pinning THP instroduced by commit 57edfcfd3419 ("mm/gup: accelerate thp
gup even for "pages != NULL"") calls try_grab_folio() to pin the THP,
but try_grab_folio() is supposed to be called in atomic context for
non-SMP kernel, for example, irq disabled or preemption disabled, due to
the optimization introduced by commit e286781d5f2e ("mm: speculative
page references").
The commit efa7df3e3bb5 ("mm: align larger anonymous mappings on THP
boundaries") is not actually the root cause although it was bisected to.
It just makes the problem exposed more likely.
The follow up discussion suggested the optimization for non-SMP kernel
may be out-dated and not worth it anymore [1]. So removing the
optimization to silence the BUG.
However calling try_grab_folio() in GUP slow path actually is
unnecessary, so the following patch will clean this up.
Ryusuke Konishi [Sun, 23 Jun 2024 05:11:35 +0000 (14:11 +0900)]
nilfs2: fix incorrect inode allocation from reserved inodes
If the bitmap block that manages the inode allocation status is corrupted,
nilfs_ifile_create_inode() may allocate a new inode from the reserved
inode area where it should not be allocated.
Previous fix commit d325dc6eb763 ("nilfs2: fix use-after-free bug of
struct nilfs_root"), fixed the problem that reserved inodes with inode
numbers less than NILFS_USER_INO (=11) were incorrectly reallocated due to
bitmap corruption, but since the start number of non-reserved inodes is
read from the super block and may change, in which case inode allocation
may occur from the extended reserved inode area.
If that happens, access to that inode will cause an IO error, causing the
file system to degrade to an error state.
Fix this potential issue by adding a wraparound option to the common
metadata object allocation routine and by modifying
nilfs_ifile_create_inode() to disable the option so that it only allocates
inodes with inode numbers greater than or equal to the inode number read
in "nilfs->ns_first_ino", regardless of the bitmap status of reserved
inodes.
Ryusuke Konishi [Sun, 23 Jun 2024 05:11:34 +0000 (14:11 +0900)]
nilfs2: add missing check for inode numbers on directory entries
Syzbot reported that mounting and unmounting a specific pattern of
corrupted nilfs2 filesystem images causes a use-after-free of metadata
file inodes, which triggers a kernel bug in lru_add_fn().
As Jan Kara pointed out, this is because the link count of a metadata file
gets corrupted to 0, and nilfs_evict_inode(), which is called from iput(),
tries to delete that inode (ifile inode in this case).
The inconsistency occurs because directories containing the inode numbers
of these metadata files that should not be visible in the namespace are
read without checking.
Fix this issue by treating the inode numbers of these internal files as
errors in the sanity check helper when reading directory folios/pages.
Also thanks to Hillf Danton and Matthew Wilcox for their initial mm-layer
analysis.
Ryusuke Konishi [Sun, 23 Jun 2024 05:11:33 +0000 (14:11 +0900)]
nilfs2: fix inode number range checks
Patch series "nilfs2: fix potential issues related to reserved inodes".
This series fixes one use-after-free issue reported by syzbot, caused by
nilfs2's internal inode being exposed in the namespace on a corrupted
filesystem, and a couple of flaws that cause problems if the starting
number of non-reserved inodes written in the on-disk super block is
intentionally (or corruptly) changed from its default value.
This patch (of 3):
In the current implementation of nilfs2, "nilfs->ns_first_ino", which
gives the first non-reserved inode number, is read from the superblock,
but its lower limit is not checked.
As a result, if a number that overlaps with the inode number range of
reserved inodes such as the root directory or metadata files is set in the
super block parameter, the inode number test macros (NILFS_MDT_INODE and
NILFS_VALID_INODE) will not function properly.
In addition, these test macros use left bit-shift calculations using with
the inode number as the shift count via the BIT macro, but the result of a
shift calculation that exceeds the bit width of an integer is undefined in
the C specification, so if "ns_first_ino" is set to a large value other
than the default value NILFS_USER_INO (=11), the macros may potentially
malfunction depending on the environment.
Fix these issues by checking the lower bound of "nilfs->ns_first_ino" and
by preventing bit shifts equal to or greater than the NILFS_USER_INO
constant in the inode number test macros.
Also, change the type of "ns_first_ino" from signed integer to unsigned
integer to avoid the need for type casting in comparisons such as the
lower bound check introduced this time.
Jan Kara [Fri, 21 Jun 2024 14:42:38 +0000 (16:42 +0200)]
mm: avoid overflows in dirty throttling logic
The dirty throttling logic is interspersed with assumptions that dirty
limits in PAGE_SIZE units fit into 32-bit (so that various multiplications
fit into 64-bits). If limits end up being larger, we will hit overflows,
possible divisions by 0 etc. Fix these problems by never allowing so
large dirty limits as they have dubious practical value anyway. For
dirty_bytes / dirty_background_bytes interfaces we can just refuse to set
so large limits. For dirty_ratio / dirty_background_ratio it isn't so
simple as the dirty limit is computed from the amount of available memory
which can change due to memory hotplug etc. So when converting dirty
limits from ratios to numbers of pages, we just don't allow the result to
exceed UINT_MAX.
This is root-only triggerable problem which occurs when the operator
sets dirty limits to >16 TB.
Jan Kara [Fri, 21 Jun 2024 14:42:37 +0000 (16:42 +0200)]
Revert "mm/writeback: fix possible divide-by-zero in wb_dirty_limits(), again"
Patch series "mm: Avoid possible overflows in dirty throttling".
Dirty throttling logic assumes dirty limits in page units fit into
32-bits. This patch series makes sure this is true (see patch 2/2 for
more details).
The commit is broken in several ways. Firstly, the removed (u64) cast
from the multiplication will introduce a multiplication overflow on 32-bit
archs if wb_thresh * bg_thresh >= 1<<32 (which is actually common - the
default settings with 4GB of RAM will trigger this). Secondly, the
div64_u64() is unnecessarily expensive on 32-bit archs. We have
div64_ul() in case we want to be safe & cheap. Thirdly, if dirty
thresholds are larger than 1<<32 pages, then dirty balancing is going to
blow up in many other spectacular ways anyway so trying to fix one
possible overflow is just moot.
Jinliang Zheng [Thu, 20 Jun 2024 12:21:24 +0000 (20:21 +0800)]
mm: optimize the redundant loop of mm_update_owner_next()
When mm_update_owner_next() is racing with swapoff (try_to_unuse()) or
/proc or ptrace or page migration (get_task_mm()), it is impossible to
find an appropriate task_struct in the loop whose mm_struct is the same as
the target mm_struct.
If the above race condition is combined with the stress-ng-zombie and
stress-ng-dup tests, such a long loop can easily cause a Hard Lockup in
write_lock_irq() for tasklist_lock.
Recognize this situation in advance and exit early.
Linus Torvalds [Sun, 30 Jun 2024 21:32:24 +0000 (14:32 -0700)]
Merge tag 'ata-6.10-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/libata/linux
Pull ata fixes from Niklas Cassel:
- Add NOLPM quirk for for all Crucial BX SSD1 models.
Considering that we now have had bug reports for 3 different BX SSD1
variants from Crucial with the same product name, make the quirk more
inclusive, to catch more device models from the same generation.
- Fix a trivial NULL pointer dereference in the error path for
ata_host_release().
- Create a ata_port_free(), so that we don't miss freeing ata_port
struct members when freeing a struct ata_port.
- Fix a trivial double free in the error path for ata_host_alloc().
- Ensure that we remove the libata "remapped NVMe device count" sysfs
entry on .probe() error.
* tag 'ata-6.10-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/libata/linux:
ata: ahci: Clean up sysfs file on error
ata: libata-core: Fix double free on error
ata,scsi: libata-core: Do not leak memory for ata_port struct members
ata: libata-core: Fix null pointer dereference on error
ata: libata-core: Add ATA_HORKAGE_NOLPM for all Crucial BX SSD1 models
Niklas Cassel [Sat, 29 Jun 2024 12:42:14 +0000 (14:42 +0200)]
ata: ahci: Clean up sysfs file on error
.probe() (ahci_init_one()) calls sysfs_add_file_to_group(), however,
if probe() fails after this call, we currently never call
sysfs_remove_file_from_group().
(The sysfs_remove_file_from_group() call in .remove() (ahci_remove_one())
does not help, as .remove() is not called on .probe() error.)
Thus, if probe() fails after the sysfs_add_file_to_group() call, the next
time we insmod the module we will get:
Niklas Cassel [Sat, 29 Jun 2024 12:42:13 +0000 (14:42 +0200)]
ata: libata-core: Fix double free on error
If e.g. the ata_port_alloc() call in ata_host_alloc() fails, we will jump
to the err_out label, which will call devres_release_group().
devres_release_group() will trigger a call to ata_host_release().
ata_host_release() calls kfree(host), so executing the kfree(host) in
ata_host_alloc() will lead to a double free:
Niklas Cassel [Sat, 29 Jun 2024 12:42:12 +0000 (14:42 +0200)]
ata,scsi: libata-core: Do not leak memory for ata_port struct members
libsas is currently not freeing all the struct ata_port struct members,
e.g. ncq_sense_buf for a driver supporting Command Duration Limits (CDL).
Add a function, ata_port_free(), that is used to free a ata_port,
including its struct members. It makes sense to keep the code related to
freeing a ata_port in its own function, which will also free all the
struct members of struct ata_port.
Linus Torvalds [Sun, 30 Jun 2024 17:00:01 +0000 (10:00 -0700)]
Merge tag 'kbuild-fixes-v6.10-3' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild
Pull Kbuild fixes from Masahiro Yamada:
- Remove the executable bit from installed DTB files
- Escape $ in subshell execution in the debian-orig target
- Fix RPM builds with CONFIG_MODULES=n
- Fix xconfig with the O= option
- Fix scripts_gdb with the O= option
* tag 'kbuild-fixes-v6.10-3' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild:
kbuild: scripts/gdb: bring the "abspath" back
kbuild: Use $(obj)/%.cc to fix host C++ module builds
kbuild: rpm-pkg: fix build error with CONFIG_MODULES=n
kbuild: Fix build target deb-pkg: ln: failed to create hard link
kbuild: doc: Update default INSTALL_MOD_DIR from extra to updates
kbuild: Install dtb files as 0644 in Makefile.dtbinst
Linus Torvalds [Wed, 26 Jun 2024 00:50:04 +0000 (17:50 -0700)]
x86-32: fix cmpxchg8b_emu build error with clang
The kernel test robot reported that clang no longer compiles the 32-bit
x86 kernel in some configurations due to commit 95ece48165c1
("locking/atomic/x86: Rewrite x86_32 arch_atomic64_{,fetch}_{and,or,xor}()
functions").
The build fails with
arch/x86/include/asm/cmpxchg_32.h:149:9: error: inline assembly requires more registers than available
and the reason seems to be that not only does the cmpxchg8b instruction
need four fixed registers (EDX:EAX and ECX:EBX), with the emulation
fallback the inline asm also wants a fifth fixed register for the
address (it uses %esi for that, but that's just a software convention
with cmpxchg8b_emu).
Avoiding using another pointer input to the asm (and just forcing it to
use the "0(%esi)" addressing that we end up requiring for the sw
fallback) seems to fix the issue.
Linus Torvalds [Sun, 30 Jun 2024 16:11:59 +0000 (09:11 -0700)]
Merge tag 'staging-6.10-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging
Pull staging driver fixes from Greg KH:
"Here are two small staging driver fixes for 6.10-rc6, both for the
vc04_services drivers:
- build fix if CONFIG_DEBUGFS was not set
- initialization check fix that was much reported.
Both of these have been in linux-next this week with no reported
issues"
* tag 'staging-6.10-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging:
staging: vchiq_debugfs: Fix build if CONFIG_DEBUG_FS is not set
staging: vc04_services: vchiq_arm: Fix initialisation check
Linus Torvalds [Sun, 30 Jun 2024 15:57:43 +0000 (08:57 -0700)]
Merge tag 'tty-6.10-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty
Pull tty / serial / console fixes from Greg KH:
"Here are a bunch of fixes/reverts for 6.10-rc6. Include in here are:
- revert the bunch of tty/serial/console changes that landed in -rc1
that didn't quite work properly yet.
Everyone agreed to just revert them for now and will work on making
them better for a future release instead of trying to quick fix the
existing changes this late in the release cycle
- 8250 driver port count bugfix
- Other tiny serial port bugfixes for reported issues
All of these have been in linux-next this week with no reported
issues"
* tag 'tty-6.10-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty:
Revert "printk: Save console options for add_preferred_console_match()"
Revert "printk: Don't try to parse DEVNAME:0.0 console options"
Revert "printk: Flag register_console() if console is set on command line"
Revert "serial: core: Add support for DEVNAME:0.0 style naming for kernel console"
Revert "serial: core: Handle serial console options"
Revert "serial: 8250: Add preferred console in serial8250_isa_init_ports()"
Revert "Documentation: kernel-parameters: Add DEVNAME:0.0 format for serial ports"
Revert "serial: 8250: Fix add preferred console for serial8250_isa_init_ports()"
Revert "serial: core: Fix ifdef for serial base console functions"
serial: bcm63xx-uart: fix tx after conversion to uart_port_tx_limited()
serial: core: introduce uart_port_tx_limited_flags()
Revert "serial: core: only stop transmit when HW fifo is empty"
serial: imx: set receiver level before starting uart
tty: mcf: MCF54418 has 10 UARTS
serial: 8250_omap: Implementation of Errata i2310
tty: serial: 8250: Fix port count mismatch with the device
Linus Torvalds [Sun, 30 Jun 2024 15:41:42 +0000 (08:41 -0700)]
Merge tag 'smp_urgent_for_v6.10_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull smp fixes from Borislav Petkov:
- Fix "nosmp" and "maxcpus=0" after the parallel CPU bringup work went
in and broke them
- Make sure CPU hotplug dynamic prepare states are actually executed
* tag 'smp_urgent_for_v6.10_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
cpu: Fix broken cmdline "nosmp" and "maxcpus=0"
cpu/hotplug: Fix dynstate assignment in __cpuhp_setup_state_cpuslocked()
Linus Torvalds [Sun, 30 Jun 2024 15:36:13 +0000 (08:36 -0700)]
Merge tag 'irq_urgent_for_v6.10_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq fixes from Borislav Petkov:
- Make sure multi-bridge machines get all eiointc interrupt controllers
initialized even if the number of CPUs has been limited by a cmdline
param
- Make sure interrupt lines on liointc hw are configured properly even
when interrupt routing changes
- Avoid use-after-free in the error path of the MSI init code
* tag 'irq_urgent_for_v6.10_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
PCI/MSI: Fix UAF in msi_capability_init
irqchip/loongson-liointc: Set different ISRs for different cores
irqchip/loongson-eiointc: Use early_cpu_to_node() instead of cpu_to_node()
Linus Torvalds [Sun, 30 Jun 2024 15:31:08 +0000 (08:31 -0700)]
Merge tag 'timers_urgent_for_v6.10_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer fix from Borislav Petkov:
- Warn when an hrtimer doesn't get a callback supplied
* tag 'timers_urgent_for_v6.10_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
hrtimer: Prevent queuing of hrtimer without a function callback
Linus Torvalds [Sat, 29 Jun 2024 16:21:40 +0000 (09:21 -0700)]
Merge tag 'xfs-6.10-fixes-5' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull xfs fixes from Chandan Babu:
- Always free only post-EOF delayed allocations for files with the
XFS_DIFLAG_PREALLOC or APPEND flags set.
- Do not align cow fork delalloc to cowextsz hint when running low on
space.
- Allow zero-size symlinks and directories as long as the link count is
zero.
- Change XFS_IOC_EXCHANGE_RANGE to be a _IOW only ioctl. This was ioctl
was introduced during v6.10 developement cycle.
- xfs_init_new_inode() now creates an attribute fork on a newly created
inode even if ATTR feature flag is not enabled.
* tag 'xfs-6.10-fixes-5' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
xfs: honor init_xattrs in xfs_init_new_inode for !ATTR fs
xfs: fix direction in XFS_IOC_EXCHANGE_RANGE
xfs: allow unlinked symlinks and dirs with zero size
xfs: restrict when we try to align cow fork delalloc to cowextsz hints
xfs: fix freeing speculative preallocations for preallocated files
Linus Torvalds [Sat, 29 Jun 2024 16:12:53 +0000 (09:12 -0700)]
Merge tag 'i2c-for-6.10-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux
Pull i2c fixes from Wolfram Sang:
"Two fixes for the testunit and and a fixup for the code reorganization
of the previous wmt-driver"
* tag 'i2c-for-6.10-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux:
i2c: testunit: discard write requests while old command is running
i2c: testunit: don't erase registers after STOP
i2c: viai2c: turn common code into a proper module
- sdhci-brcmstb: Fix support for erase/trim/discard
* tag 'mmc-v6.10-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc:
mmc: sdhci: Do not lock spinlock around mmc_gpio_get_ro()
mmc: sdhci: Do not invert write-protect twice
Revert "mmc: moxart-mmc: Use sg_miter for PIO"
mmc: sdhci-brcmstb: check R1_STATUS for erase/trim/discard
mmc: sdhci-pci-o2micro: Convert PCIBIOS_* return codes to errnos
mmc: sdhci-pci: Convert PCIBIOS_* return codes to errnos
Linus Torvalds [Fri, 28 Jun 2024 23:14:59 +0000 (16:14 -0700)]
Merge tag 'riscv-for-linus-6.10-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux
Pull RISC-V fixes from Palmer Dabbelt:
- A fix for vector load/store instruction decoding, which could result
in reserved vector element length encodings decoding as valid vector
instructions.
- Instruction patching now aggressively flushes the local instruction
cache, to avoid situations where patching functions on the flush path
results in torn instructions being fetched.
- A fix to prevent the stack walker from showing up as part of traces.
* tag 'riscv-for-linus-6.10-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux:
riscv: stacktrace: convert arch_stack_walk() to noinstr
riscv: patch: Flush the icache right after patching to avoid illegal insns
RISC-V: fix vector insn load/store width mask
Linus Torvalds [Fri, 28 Jun 2024 21:27:22 +0000 (14:27 -0700)]
x86: stop playing stack games in profile_pc()
The 'profile_pc()' function is used for timer-based profiling, which
isn't really all that relevant any more to begin with, but it also ends
up making assumptions based on the stack layout that aren't necessarily
valid.
Basically, the code tries to account the time spent in spinlocks to the
caller rather than the spinlock, and while I support that as a concept,
it's not worth the code complexity or the KASAN warnings when no serious
profiling is done using timers anyway these days.
And the code really does depend on stack layout that is only true in the
simplest of cases. We've lost the comment at some point (I think when
the 32-bit and 64-bit code was unified), but it used to say:
Assume the lock function has either no stack frame or a copy
of eflags from PUSHF.
which explains why it just blindly loads a word or two straight off the
stack pointer and then takes a minimal look at the values to just check
if they might be eflags or the return pc:
Eflags always has bits 22 and up cleared unlike kernel addresses
but that basic stack layout assumption assumes that there isn't any lock
debugging etc going on that would complicate the code and cause a stack
frame.
It causes KASAN unhappiness reported for years by syzkaller [1] and
others [2].
With no real practical reason for this any more, just remove the code.
Just for historical interest, here's some background commits relating to
this code from 2006:
0cb91a229364 ("i386: Account spinlocks to the caller during profiling for !FP kernels") 31679f38d886 ("Simplify profile_pc on x86-64")
Wolfram Sang [Thu, 27 Jun 2024 11:14:48 +0000 (13:14 +0200)]
i2c: testunit: discard write requests while old command is running
When clearing registers on new write requests was added, the protection
for currently running commands was missed leading to concurrent access
to the testunit registers. Check the flag beforehand.
Fixes: b39ab96aa894 ("i2c: testunit: add support for block process calls") Signed-off-by: Wolfram Sang <[email protected]> Reviewed-by: Andi Shyti <[email protected]>
Wolfram Sang [Thu, 27 Jun 2024 11:14:47 +0000 (13:14 +0200)]
i2c: testunit: don't erase registers after STOP
STOP fallsthrough to WRITE_REQUESTED but this became problematic when
clearing the testunit registers was added to the latter. Actually, there
is no reason to clear the testunit state after STOP. Doing it when a new
WRITE_REQUESTED arrives is enough. So, no need to fallthrough, at all.
Fixes: b39ab96aa894 ("i2c: testunit: add support for block process calls") Signed-off-by: Wolfram Sang <[email protected]> Reviewed-by: Andi Shyti <[email protected]>
Wolfram Sang [Fri, 28 Jun 2024 18:38:20 +0000 (20:38 +0200)]
Merge tag 'i2c-host-fixes-6.10-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/andi.shyti/linux into i2c/for-current
Fixed a build error following the major refactoring involving the
VIA-I2C modules. Originally, the code was split to group together
parts that would be used by different drivers. This caused build
issues when two modules linked to the same code.
Linus Torvalds [Fri, 28 Jun 2024 16:32:33 +0000 (09:32 -0700)]
Merge tag 'nfsd-6.10-3' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux
Pull nfsd fixes from Chuck Lever:
- Due to a late review, revert and re-fix a recent crasher fix
* tag 'nfsd-6.10-3' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux:
Revert "nfsd: fix oops when reading pool_stats before server is started"
nfsd: initialise nfsd_info.mutex early.
Linus Torvalds [Fri, 28 Jun 2024 16:25:21 +0000 (09:25 -0700)]
Merge tag 'bcachefs-2024-06-28' of https://evilpiepirate.org/git/bcachefs
Pull bcachefs fixes from Kent Overstreet:
"Simple stuff:
- NULL ptr/err ptr deref fixes
- fix for getting wedged on shutdown after journal error
- fix missing recalc_capacity() call, capacity now changes correctly
after a device goes read only
however: our capacity calculation still doesn't take into account
when we have mixed ro/rw devices and the ro devices have data on
them, that's going to be a more involved fix to separate accounting
for "capacity used on ro devices" and "capacity used on rw devices"
- boring syzbot stuff
Slightly more involved:
- discard, invalidate workers are now per device
this has the effect of simplifying how we take device refs in these
paths, and the device ref cleanup fixes a longstanding race between
the device removal path and the discard path
- fixes for how the debugfs code takes refs on btree_trans objects we
have debugfs code that prints in use btree_trans objects.
It uses closure_get() on trans->ref, which is mainly for the cycle
detector, but the debugfs code was using it on a closure that may
have hit 0, which is not allowed; for performance reasons we cannot
avoid having not-in-use transactions on the global list.
Introduce some new primitives to fix this and make the
synchronization here a whole lot saner"
* tag 'bcachefs-2024-06-28' of https://evilpiepirate.org/git/bcachefs:
bcachefs: Fix kmalloc bug in __snapshot_t_mut
bcachefs: Discard, invalidate workers are now per device
bcachefs: Fix shift-out-of-bounds in bch2_blacklist_entries_gc
bcachefs: slab-use-after-free Read in bch2_sb_errors_from_cpu
bcachefs: Add missing bch2_journal_do_writes() call
bcachefs: Fix null ptr deref in journal_pins_to_text()
bcachefs: Add missing recalc_capacity() call
bcachefs: Fix btree_trans list ordering
bcachefs: Fix race between trans_put() and btree_transactions_read()
closures: closure_get_not_zero(), closure_return_sync()
bcachefs: Make btree_deadlock_to_text() clearer
bcachefs: fix seqmutex_relock()
bcachefs: Fix freeing of error pointers
Linus Torvalds [Fri, 28 Jun 2024 16:21:27 +0000 (09:21 -0700)]
Merge tag 'block-6.10-20240628' of git://git.kernel.dk/linux
Pull block fixes from Jens Axboe:
"NVMe fixes via Keith:
- Fabrics fixes (Hannes)
- Missing module description (Jeff)
- Clang warning fix (Nathan)"
* tag 'block-6.10-20240628' of git://git.kernel.dk/linux:
nvmet-fc: Remove __counted_by from nvmet_fc_tgt_queue.fod[]
nvmet: make 'tsas' attribute idempotent for RDMA
nvme: fixup comment for nvme RDMA Provider Type
nvme-apple: add missing MODULE_DESCRIPTION()
nvmet: do not return 'reserved' for empty TSAS values
nvme: fix NVME_NS_DEAC may incorrectly identifying the disk as EXT_LBA.
Linus Torvalds [Fri, 28 Jun 2024 16:18:01 +0000 (09:18 -0700)]
Merge tag 'iommu-fixes-v6.10-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/iommu/linux
Pull iommu fixes from Joerg Roedel:
- Two cache flushing fixes for Intel and AMD drivers
- AMD guest translation enabling fix
- Update IOMMU tree location in MAINTAINERS file
* tag 'iommu-fixes-v6.10-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/iommu/linux:
MAINTAINERS: Update IOMMU tree location
iommu/amd: Fix GT feature enablement again
iommu/vt-d: Fix missed device TLB cache tag
iommu/amd: Invalidate cache before removing device from domain list
Linus Torvalds [Fri, 28 Jun 2024 16:15:13 +0000 (09:15 -0700)]
Merge tag 'gpio-fixes-for-v6.10-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux
Pull gpio fixes from Bartosz Golaszewski:
"An assortment of driver fixes and two commits addressing a bad
behavior of the GPIO uAPI when reconfiguring requested lines.
- fix a race condition in i2c transfers by adding a missing i2c lock
section in gpio-pca953x
- validate the number of obtained interrupts in gpio-davinci
- add missing raw_spinlock_init() in gpio-graniterapids
- fix bad character device behavior: disallow GPIO line
reconfiguration without set direction both in v1 and v2 uAPI"
* tag 'gpio-fixes-for-v6.10-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux:
gpiolib: cdev: Ignore reconfiguration without direction
gpiolib: cdev: Disallow reconfiguration without direction (uAPI v1)
gpio: graniterapids: Add missing raw_spinlock_init()
gpio: davinci: Validate the obtained number of IRQs
gpio: pca953x: fix pca953x_irq_bus_sync_unlock race
Linus Torvalds [Fri, 28 Jun 2024 16:10:01 +0000 (09:10 -0700)]
Merge tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull arm64 fixes from Will Deacon:
"A pair of small arm64 fixes for -rc6.
One is a fix for the recently merged uffd-wp support (which was
triggering a spurious warning) and the other is a fix to the clearing
of the initial idmap pgd in some configurations
Summary:
- Fix spurious page-table warning when clearing PTE_UFFD_WP in a live
pte
- Fix clearing of the idmap pgd when using large addressing modes"
* tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
arm64: Clear the initial ID map correctly before remapping
arm64: mm: Permit PTE SW bits to change in live mappings
Linus Torvalds [Fri, 28 Jun 2024 16:04:33 +0000 (09:04 -0700)]
Merge tag 'v6.10-rc-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux
Pull turbostat fixes from Len Brown:
"Fix three recent minor turbostat regressions"
* tag 'v6.10-rc-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux:
tools/power turbostat: Add local build_bug.h header for snapshot target
tools/power turbostat: Fix unc freq columns not showing with '-q' or '-l'
tools/power turbostat: option '-n' is ambiguous
tty: mxser: Remove __counted_by from mxser_board.ports[]
Work for __counted_by on generic pointers in structures (not just
flexible array members) has started landing in Clang 19 (current tip of
tree). During the development of this feature, a restriction was added
to __counted_by to prevent the flexible array member's element type from
including a flexible array member itself such as:
struct foo {
int count;
char buf[];
};
struct bar {
int count;
struct foo data[] __counted_by(count);
};
because the size of data cannot be calculated with the standard array
size formula:
sizeof(struct foo) * count
This restriction was downgraded to a warning but due to CONFIG_WERROR,
it can still break the build. The application of __counted_by on the
ports member of 'struct mxser_board' triggers this restriction,
resulting in:
drivers/tty/mxser.c:291:2: error: 'counted_by' should not be applied to an array with element of unknown size because 'struct mxser_port' is a struct type with a flexible array member. This will be an error in a future compiler version [-Werror,-Wbounds-safety-counted-by-elt-type-unknown-size]
291 | struct mxser_port ports[] __counted_by(nports);
| ^~~~~~~~~~~~~~~~~~~~~~~~~
1 error generated.
Remove this use of __counted_by to fix the warning/error. However,
rather than remove it altogether, leave it commented, as it may be
possible to support this in future compiler releases.
An unintended consequence of commit 9c573cd31343 ("randomize_kstack:
Improve entropy diffusion") was that the per-architecture entropy size
filtering reduced how many bits were being added to the mix, rather than
how many bits were being used during the offsetting. All architectures
fell back to the existing default of 0x3FF (10 bits), which will consume
at most 1KiB of stack space. It seems that this is working just fine,
so let's avoid the confusion and update everything to use the default.
The prior intent of the per-architecture limits were:
arm64: capped at 0x1FF (9 bits), 5 bits effective
powerpc: uncapped (10 bits), 6 or 7 bits effective
riscv: uncapped (10 bits), 6 bits effective
x86: capped at 0xFF (8 bits), 5 (x86_64) or 6 (ia32) bits effective
s390: capped at 0xFF (8 bits), undocumented effective entropy
Current discussion has led to just dropping the original per-architecture
filters. The additional entropy appears to be safe for arm64, x86,
and s390. Quoting Arnd, "There is no point pretending that 15.75KB is
somehow safe to use while 15.00KB is not."
make allmodconfig && make W=1 C=1 reports:
WARNING: modpost: missing MODULE_DESCRIPTION() in lib/string_kunit.o
WARNING: modpost: missing MODULE_DESCRIPTION() in lib/string_helpers_kunit.o
Add the missing invocation of the MODULE_DESCRIPTION() macro.
Niklas Cassel [Thu, 27 Jun 2024 10:55:52 +0000 (12:55 +0200)]
ata: libata-core: Add ATA_HORKAGE_NOLPM for all Crucial BX SSD1 models
We got another report that CT1000BX500SSD1 does not work with LPM.
If you look in libata-core.c, we have six different Crucial devices that
are marked with ATA_HORKAGE_NOLPM. This model would have been the seventh.
(This quirk is used on Crucial models starting with both CT* and
Crucial_CT*)
It is obvious that this vendor does not have a great history of supporting
LPM properly, therefore, add the ATA_HORKAGE_NOLPM quirk for all Crucial
BX SSD1 models.
Adam Hawley [Wed, 22 May 2024 13:27:21 +0000 (16:27 +0300)]
tools/power turbostat: Fix unc freq columns not showing with '-q' or '-l'
Commit 78464d7681f7 ("tools/power turbostat: Add columns for clustered
uncore frequency") introduced 'probe_intel_uncore_frequency_cluster()'
in a way which prevents printing uncore frequency columns if either of
the '-q' or '-l' options are used. Systems which do not have multiple
uncore frequencies per package are unaffected by this regression.
Fix the function so that uncore frequency columns are shown when either
the '-l' or '-q' option is used by checking if 'quiet' is true after
adding counters for the uncore frequency columns.
Fixes: 78464d7681f7 ("tools/power turbostat: Add columns for clustered uncore frequency") Signed-off-by: Adam Hawley <[email protected]> Signed-off-by: Len Brown <[email protected]>
David Arcari [Mon, 20 May 2024 18:57:49 +0000 (14:57 -0400)]
tools/power turbostat: option '-n' is ambiguous
In some cases specifying the '-n' command line argument will cause
turbostat to fail. For instance 'turbostat -n 1' works fine; however,
'turbostat -n 1 -d' will fail. This is the result of the first call
to getopt_long_only() where "MP" is specified as the optstring. This can
be easily fixed by changing the optstring from "MP" to "MPn:" to remove
ambiguity between the arguments.
tools/power turbostat: option '-n' is ambiguous; possibilities: '-num_iterations' '-no-msr' '-no-perf'
Fixes: a0e86c90b83c ("tools/power turbostat: Add --no-perf option") Signed-off-by: David Arcari <[email protected]> Signed-off-by: Len Brown <[email protected]>
Linus Torvalds [Fri, 28 Jun 2024 00:24:34 +0000 (17:24 -0700)]
Merge tag 'drm-fixes-2024-06-28' of https://gitlab.freedesktop.org/drm/kernel
Pull drm fixes from Dave Airlie:
"Regular fixes, mostly amdgpu with some minor fixes in other places,
along with a fix for a very narrow UAF race in the pid handover code.
core:
- fix refcounting race on pid handover
fbdev:
- Fix fb_info when vmalloc is used, regression from
CONFIG_DRM_FBDEV_LEAK_PHYS_SMEM.
i915:
- Fix potential UAF due to race on fence register revocation
nouveau
- nouveau tv mode fixes
panel:
- Add KOE TX26D202VM0BWA timings"
* tag 'drm-fixes-2024-06-28' of https://gitlab.freedesktop.org/drm/kernel:
drm/drm_file: Fix pid refcounting race
drm/nouveau/dispnv04: fix null pointer dereference in nv17_tv_get_ld_modes
drm/nouveau/dispnv04: fix null pointer dereference in nv17_tv_get_hd_modes
drm/amdgpu: Don't show false warning for reg list
drm/amdgpu: avoid using null object of framebuffer
drm/amd/display: Send DP_TOTAL_LTTPR_CNT during detection if LTTPR is present
drm/amdgpu: Fix pci state save during mode-1 reset
drm/amdgpu/atomfirmware: fix parsing of vram_info
drm/amd/swsmu: add MALL init support workaround for smu_v14_0_1
drm/i915/gt: Fix potential UAF by revoke of fence registers
drm/panel: simple: Add missing display timing flags for KOE TX26D202VM0BWA
drm/fbdev-dma: Only set smem_start is enable per module option
filp->pid is supposed to be a refcounted pointer; however, before this
patch, drm_file_update_pid() only increments the refcount of a struct
pid after storing a pointer to it in filp->pid and dropping the
dev->filelist_mutex, making the following race possible:
process A process B
========= =========
begin drm_file_update_pid
mutex_lock(&dev->filelist_mutex)
rcu_replace_pointer(filp->pid, <pid B>, 1)
mutex_unlock(&dev->filelist_mutex)
begin drm_file_update_pid
mutex_lock(&dev->filelist_mutex)
rcu_replace_pointer(filp->pid, <pid A>, 1)
mutex_unlock(&dev->filelist_mutex)
get_pid(<pid A>)
synchronize_rcu()
put_pid(<pid B>) *** pid B reaches refcount 0 and is freed here ***
get_pid(<pid B>) *** UAF ***
synchronize_rcu()
put_pid(<pid A>)
As far as I know, this race can only occur with CONFIG_PREEMPT_RCU=y
because it requires RCU to detect a quiescent state in code that is not
explicitly calling into the scheduler.
This race leads to use-after-free of a "struct pid".
It is probably somewhat hard to hit because process A has to pass
through a synchronize_rcu() operation while process B is between
mutex_unlock() and get_pid().
Fix it by ensuring that by the time a pointer to the current task's pid
is stored in the file, an extra reference to the pid has been taken.
This fix also removes the condition for synchronize_rcu(); I think
that optimization is unnecessary complexity, since in that case we
would usually have bailed out on the lockless check above.
Linus Torvalds [Thu, 27 Jun 2024 19:35:30 +0000 (12:35 -0700)]
Merge tag 'pm-6.10-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management fix from Rafael Wysocki:
"Modify the intel_pstate driver to use HWP to initialize the ITMT
scheduler extension if ACPI CPPC cannot be used for that, which is the
case on some hybrid x86 systems (Rafael Wysocki)"
* tag 'pm-6.10-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
cpufreq: intel_pstate: Use HWP to initialize ITMT if CPPC is missing
Linus Torvalds [Thu, 27 Jun 2024 19:32:56 +0000 (12:32 -0700)]
Merge tag 'thermal-6.10-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull thermal control fix from Rafael Wysocki:
"Replace an earlier fix for a recent regression in the Step-Wise
thermal governor that was not effective in all of the relevant cases
(Rafael Wysocki)"
* tag 'thermal-6.10-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
thermal: gov_step_wise: Go straight to instance->lower when mitigation is over
Linus Torvalds [Thu, 27 Jun 2024 19:23:52 +0000 (12:23 -0700)]
Merge tag 'io_uring-6.10-20240627' of git://git.kernel.dk/linux
Pull io_uring fixes from Jens Axboe:
"Removal of a struct member that's unused since the 6.10 merge window,
and a fix for a regression in SQPOLL wakeups, bringing it back to how
it worked before the SQPOLL local task_work"
* tag 'io_uring-6.10-20240627' of git://git.kernel.dk/linux:
io_uring: signal SQPOLL task_work with TWA_SIGNAL_NO_IPI
io_uring: remove dead struct io_submit_state member
* tag 'nvme-6.10-2024-06-27' of git://git.infradead.org/nvme:
nvmet-fc: Remove __counted_by from nvmet_fc_tgt_queue.fod[]
nvmet: make 'tsas' attribute idempotent for RDMA
nvme: fixup comment for nvme RDMA Provider Type
nvme-apple: add missing MODULE_DESCRIPTION()
nvmet: do not return 'reserved' for empty TSAS values
nvme: fix NVME_NS_DEAC may incorrectly identifying the disk as EXT_LBA.
Linus Torvalds [Thu, 27 Jun 2024 18:09:03 +0000 (11:09 -0700)]
Merge tag 's390-6.10-7' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
Pull s390 updates from Alexander Gordeev:
- Add missing virt_to_phys() conversion for directed interrupt bit
vectors
- Fix broken configuration change notifications for virtio-ccw
- Fix sclp_init() cleanup path on failure and as result - fix a list
double add warning
- Fix unconditional adjusting of GOT entries containing undefined weak
symbols that resolve to zero
* tag 's390-6.10-7' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
s390/boot: Do not adjust GOT entries for undef weak sym
s390/sclp: Fix sclp_init() cleanup on failure
s390/virtio_ccw: Fix config change notifications
s390/pci: Add missing virt_to_phys() for directed DIBV
Linus Torvalds [Thu, 27 Jun 2024 17:53:52 +0000 (10:53 -0700)]
Merge tag 'asm-generic-fixes-6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic
Pull asm-generic fixes from Arnd Bergmann:
"These are some bugfixes for system call ABI issues I found while
working on a cleanup series. None of these are urgent since these bugs
have gone unnoticed for many years, but I think we probably want to
backport them all to stable kernels, so it makes sense to have the
fixes included as early as possible.
One more fix addresses a compile-time warning in kallsyms that was
uncovered by a patch I did to enable additional warnings in 6.10. I
had mistakenly thought that this fix was already merged through the
module tree, but as Geert pointed out it was still missing"
* tag 'asm-generic-fixes-6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic:
kallsyms: rework symbol lookup return codes
linux/syscalls.h: add missing __user annotations
syscalls: mmap(): use unsigned offset type consistently
s390: remove native mmap2() syscall
hexagon: fix fadvise64_64 calling conventions
csky, hexagon: fix broken sys_sync_file_range
sh: rework sync_file_range ABI
powerpc: restore some missing spu syscalls
parisc: use generic sys_fanotify_mark implementation
parisc: use correct compat recv/recvfrom syscalls
sparc: fix compat recv/recvfrom syscalls
sparc: fix old compat_sys_select()
syscalls: fix compat_sys_io_pgetevents_time64 usage
ftruncate: pass a signed offset
Linus Torvalds [Thu, 27 Jun 2024 17:26:16 +0000 (10:26 -0700)]
Merge tag 'for-6.10-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
- fix quota root leak after quota disable failure
- fix condition when checking if a zone can be added as free
- allocate inode in NOFS context during logging or tree-log replay
- handle raid-stripe-tree lookup correctly during scrub
* tag 'for-6.10-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: qgroup: fix quota root leak after quota disable failure
btrfs: scrub: handle RST lookup error correctly
btrfs: zoned: fix initial free space detection
btrfs: use NOFS context when getting inodes during logging and log replay
Linus Torvalds [Thu, 27 Jun 2024 17:05:35 +0000 (10:05 -0700)]
Merge tag 'net-6.10-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Paolo Abeni:
"Including fixes from can, bpf and netfilter.
There are a bunch of regressions addressed here, but hopefully nothing
spectacular. We are still waiting the driver fix from Intel, mentioned
by Jakub in the previous networking pull.
Current release - regressions:
- core: add softirq safety to netdev_rename_lock
- tcp: fix tcp_rcv_fastopen_synack() to enter TCP_CA_Loss for failed
TFO
- batman-adv: fix RCU race at module unload time
Previous releases - regressions:
- openvswitch: get related ct labels from its master if it is not
confirmed
- eth: mlxsw: fix memory corruptions on spectrum-4 systems
- eth: ionic: use dev_consume_skb_any outside of napi
Previous releases - always broken:
- netfilter: fully validate NFT_DATA_VALUE on store to data registers
- unix: several fixes for OoB data
- tcp: fix race for duplicate reqsk on identical SYN
- bpf:
- fix may_goto with negative offset
- fix the corner case with may_goto and jump to the 1st insn
- fix overrunning reservations in ringbuf
- can:
- j1939: recover socket queue on CAN bus error during BAM
transmission
- mcp251xfd: fix infinite loop when xmit fails
- dsa: microchip: monitor potential faults in half-duplex mode
- eth: vxlan: pull inner IP header in vxlan_xmit_one()
- eth: ionic: fix kernel panic due to multi-buffer handling
Misc:
- selftest: unix tests refactor and a lot of new cases added"
* tag 'net-6.10-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (61 commits)
net: mana: Fix possible double free in error handling path
selftest: af_unix: Check SIOCATMARK after every send()/recv() in msg_oob.c.
af_unix: Fix wrong ioctl(SIOCATMARK) when consumed OOB skb is at the head.
selftest: af_unix: Check EPOLLPRI after every send()/recv() in msg_oob.c
selftest: af_unix: Check SIGURG after every send() in msg_oob.c
selftest: af_unix: Add SO_OOBINLINE test cases in msg_oob.c
af_unix: Don't stop recv() at consumed ex-OOB skb.
selftest: af_unix: Add non-TCP-compliant test cases in msg_oob.c.
af_unix: Don't stop recv(MSG_DONTWAIT) if consumed OOB skb is at the head.
af_unix: Stop recv(MSG_PEEK) at consumed OOB skb.
selftest: af_unix: Add msg_oob.c.
selftest: af_unix: Remove test_unix_oob.c.
tracing/net_sched: NULL pointer dereference in perf_trace_qdisc_reset()
netfilter: nf_tables: fully validate NFT_DATA_VALUE on store to data registers
net: usb: qmi_wwan: add Telit FN912 compositions
tcp: fix tcp_rcv_fastopen_synack() to enter TCP_CA_Loss for failed TFO
ionic: use dev_consume_skb_any outside of napi
net: dsa: microchip: fix wrong register write when masking interrupt
Fix race for duplicate reqsk on identical SYN
ibmvnic: Add tx check to prevent skb leak
...
Linus Torvalds [Thu, 27 Jun 2024 16:34:09 +0000 (09:34 -0700)]
Merge tag 'sound-6.10-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound
Pull sound fixes from Takashi Iwai:
"This became bigger than usual, as it receives a pile of pending ASoC
fixes. Most of changes are for device-specific issues while there are
a few core fixes that are all rather trivial:
- DMA-engine sync fixes
- Continued MIDI2 conversion fixes
- Various ASoC Intel SOF fixes
- A series of ASoC topology fixes for memory handling
- AMD ACP fix, curing a recent regression, too
- Platform / codec-specific fixes for mediatek, atmel, realtek, etc"
* tag 'sound-6.10-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound: (40 commits)
ASoC: rt5645: fix issue of random interrupt from push-button
ALSA: seq: Fix missing MSB in MIDI2 SPP conversion
ASoC: amd: yc: Fix non-functional mic on ASUS M5602RA
ALSA: hda/realtek: fix mute/micmute LEDs don't work for EliteBook 645/665 G11.
ALSA: hda/realtek: Fix conflicting quirk for PCI SSID 17aa:3820
ALSA: dmaengine_pcm: terminate dmaengine before synchronize
ALSA: hda/relatek: Enable Mute LED on HP Laptop 15-gw0xxx
ALSA: PCM: Allow resume only for suspended streams
ALSA: seq: Fix missing channel at encoding RPN/NRPN MIDI2 messages
ASoC: mediatek: mt8195: Add platform entry for ETDM1_OUT_BE dai link
ASoC: fsl-asoc-card: set priv->pdev before using it
ASoC: amd: acp: move chip->flag variable assignment
ASoC: amd: acp: remove i2s configuration check in acp_i2s_probe()
ASoC: amd: acp: add a null check for chip_pdev structure
ASoC: Intel: soc-acpi: mtl: fix speaker no sound on Dell SKU 0C64
ASoC: q6apm-lpass-dai: close graph on prepare errors
ASoC: cs35l56: Disconnect ASP1 TX sources when ASP1 DAI is hooked up
ASoC: topology: Fix route memory corruption
ASoC: rt722-sdca-sdw: add debounce time for type detection
ASoC: SOF: sof-audio: Skip unprepare for in-use widgets on error rollback
...
Building with W=1 in some configurations produces a false positive
warning for kallsyms:
kernel/kallsyms.c: In function '__sprint_symbol.isra':
kernel/kallsyms.c:503:17: error: 'strcpy' source argument is the same as destination [-Werror=restrict]
503 | strcpy(buffer, name);
| ^~~~~~~~~~~~~~~~~~~~
This originally showed up while building with -O3, but later started
happening in other configurations as well, depending on inlining
decisions. The underlying issue is that the local 'name' variable is
always initialized to the be the same as 'buffer' in the called functions
that fill the buffer, which gcc notices while inlining, though it could
see that the address check always skips the copy.
The calling conventions here are rather unusual, as all of the internal
lookup functions (bpf_address_lookup, ftrace_mod_address_lookup,
ftrace_func_address_lookup, module_address_lookup and
kallsyms_lookup_buildid) already use the provided buffer and either return
the address of that buffer to indicate success, or NULL for failure,
but the callers are written to also expect an arbitrary other buffer
to be returned.
Rework the calling conventions to return the length of the filled buffer
instead of its address, which is simpler and easier to follow as well
as avoiding the warning. Leave only the kallsyms_lookup() calling conventions
unchanged, since that is called from 16 different functions and
adapting this would be a much bigger change.
Kent Gibson [Wed, 26 Jun 2024 05:29:23 +0000 (13:29 +0800)]
gpiolib: cdev: Ignore reconfiguration without direction
linereq_set_config() behaves badly when direction is not set.
The configuration validation is borrowed from linereq_create(), where,
to verify the intent of the user, the direction must be set to in order to
effect a change to the electrical configuration of a line. But, when
applied to reconfiguration, that validation does not allow for the unset
direction case, making it possible to clear flags set previously without
specifying the line direction.
Adding to the inconsistency, those changes are not immediately applied by
linereq_set_config(), but will take effect when the line value is next get
or set.
For example, by requesting a configuration with no flags set, an output
line with GPIO_V2_LINE_FLAG_ACTIVE_LOW and GPIO_V2_LINE_FLAG_OPEN_DRAIN
set could have those flags cleared, inverting the sense of the line and
changing the line drive to push-pull on the next line value set.
Skip the reconfiguration of lines for which the direction is not set, and
only reconfigure the lines for which direction is set.
Kent Gibson [Wed, 26 Jun 2024 05:29:22 +0000 (13:29 +0800)]
gpiolib: cdev: Disallow reconfiguration without direction (uAPI v1)
linehandle_set_config() behaves badly when direction is not set.
The configuration validation is borrowed from linehandle_create(), where,
to verify the intent of the user, the direction must be set to in order
to effect a change to the electrical configuration of a line. But, when
applied to reconfiguration, that validation does not allow for the unset
direction case, making it possible to clear flags set previously without
specifying the line direction.
Adding to the inconsistency, those changes are not immediately applied by
linehandle_set_config(), but will take effect when the line value is next
get or set.
For example, by requesting a configuration with no flags set, an output
line with GPIOHANDLE_REQUEST_ACTIVE_LOW and GPIOHANDLE_REQUEST_OPEN_DRAIN
requested could have those flags cleared, inverting the sense of the line
and changing the line drive to push-pull on the next line value set.
Ensure the intent of the user by disallowing configurations which do not
have direction set, returning an error to userspace to indicate that the
configuration is invalid.
And, for clarity, use lflags, a local copy of gcnf.flags, throughout when
dealing with the requested flags, rather than a mixture of both.
Jos Wang [Wed, 19 Jun 2024 11:45:29 +0000 (19:45 +0800)]
usb: dwc3: core: Workaround for CSR read timeout
This is a workaround for STAR 4846132, which only affects
DWC_usb31 version2.00a operating in host mode.
There is a problem in DWC_usb31 version 2.00a operating
in host mode that would cause a CSR read timeout When CSR
read coincides with RAM Clock Gating Entry. By disable
Clock Gating, sacrificing power consumption for normal
operation.
This commit breaks u_ether on some setups (at least Merrifield). The fix
"usb: gadget: u_ether: Re-attach netif device to mirror detachment" party
restores u-ether. However the netif usb: remains up even usb is switched
from device to host mode. This creates problems for user space as the
interface remains in the routing table while not realy present and network
managers (connman) not detecting a network change.
Various attempts to find the root cause were unsuccesful up to now. Therefore
revert until a solution is found.
staging: vchiq_debugfs: Fix build if CONFIG_DEBUG_FS is not set
Commit 42a2f6664e18 ("staging: vc04_services: Move global g_state to
vchiq_state") adds a parameter to vchiq_debugfs_init, but leaves the
dummy implementation in the !CONFIG_DEBUG_FS case untouched, causing a
compile time error.
Paolo Abeni [Thu, 27 Jun 2024 11:00:50 +0000 (13:00 +0200)]
Merge tag 'nf-24-06-27' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
The following patchset contains two Netfilter fixes for net:
Patch #1 fixes CONFIG_SYSCTL=n for a patch coming in the previous PR
to move the sysctl toggle to enable SRv6 netfilter hooks from
nf_conntrack to the core, from Jianguo Wu.
Patch #2 fixes a possible pointer leak to userspace due to insufficient
validation of NFT_DATA_VALUE.
Linus found this pointer leak to userspace via zdi-disclosures@ and
forwarded the notice to Netfilter maintainers, he appears as reporter
because whoever found this issue never approached Netfilter
maintainers neither via security@ nor in private.
netfilter pull request 24-06-27
* tag 'nf-24-06-27' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
netfilter: nf_tables: fully validate NFT_DATA_VALUE on store to data registers
netfilter: fix undefined reference to 'netfilter_lwtunnel_*' when CONFIG_SYSCTL=n
====================
Ma Ke [Tue, 25 Jun 2024 13:03:14 +0000 (21:03 +0800)]
net: mana: Fix possible double free in error handling path
When auxiliary_device_add() returns error and then calls
auxiliary_device_uninit(), callback function adev_release
calls kfree(madev). We shouldn't call kfree(madev) again
in the error handling path. Set 'madev' to NULL.
Vasant Hegde [Fri, 21 Jun 2024 10:15:33 +0000 (10:15 +0000)]
iommu/amd: Fix GT feature enablement again
Current code configures GCR3 even when device is attached to identity
domain. So that we can support SVA with identity domain. This means in
attach device path it updates Guest Translation related bits in DTE.
Commit de111f6b4f6a ("iommu/amd: Enable Guest Translation after reading
IOMMU feature register") missed to enable Control[GT] bit in resume
path. Its causing certain laptop to fail to resume after suspend.
This is because we have inconsistency between between control register
(GT is disabled) and DTE (where we have enabled guest translation related
bits) in resume path. And IOMMU hardware throws ILLEGAL_DEV_TABLE_ENTRY.
Lu Baolu [Thu, 20 Jun 2024 06:29:40 +0000 (14:29 +0800)]
iommu/vt-d: Fix missed device TLB cache tag
When a domain is attached to a device, the required cache tags are
assigned to the domain so that the related caches can be flushed
whenever it is needed. The device TLB cache tag is created based
on whether the ats_enabled field of the device's iommu data is set.
This creates an ordered dependency between cache tag assignment and
ATS enabling.
The device TLB cache tag would not be created if device's ATS is
enabled after the cache tag assignment. This causes devices with PCI
ATS support to malfunction.
The ATS control is exclusively owned by the iommu driver. Hence, move
cache_tag_assign_domain() after PCI ATS enabling to make sure that the
device TLB cache tag is created for the domain.
Vasant Hegde [Thu, 20 Jun 2024 06:05:52 +0000 (06:05 +0000)]
iommu/amd: Invalidate cache before removing device from domain list
Commit 87a6f1f22c97 ("iommu/amd: Introduce per-device domain ID to fix
potential TLB aliasing issue") introduced per device domain ID when
domain is configured with v2 page table. And in invalidation path, it
uses per device structure (dev_data->gcr3_info.domid) to get the domain ID.
In detach_device() path, current code tries to invalidate IOMMU cache
after removing dev_data from domain device list. This means when domain
is configured with v2 page table, amd_iommu_domain_flush_all() will not be
able to invalidate cache as device is already removed from domain device
list.
This is causing change domain tests (changing domain type from identity to DMA)
to fail with IO_PAGE_FAULT issue.
Hence invalidate cache and update DTE before updating data structures.
af_unix: Fix wrong ioctl(SIOCATMARK) when consumed OOB skb is at the head.
Even if OOB data is recv()ed, ioctl(SIOCATMARK) must return 1 when the
OOB skb is at the head of the receive queue and no new OOB data is queued.
Without fix:
# RUN msg_oob.no_peek.oob ...
# msg_oob.c:305:oob:Expected answ[0] (0) == oob_head (1)
# oob: Test terminated by assertion
# FAIL msg_oob.no_peek.oob
not ok 2 msg_oob.no_peek.oob
With fix:
# RUN msg_oob.no_peek.oob ...
# OK msg_oob.no_peek.oob
ok 2 msg_oob.no_peek.oob
selftest: af_unix: Check SIGURG after every send() in msg_oob.c
When data is sent with MSG_OOB, SIGURG is sent to a process if the
receiver socket has set its owner to the process by ioctl(FIOSETOWN)
or fcntl(F_SETOWN).
This patch adds SIGURG check after every send(MSG_OOB) call.
Now the consumed OOB skb stays at the head of recvq to return a correct
value for ioctl(SIOCATMARK), which is broken now and fixed by a later
patch.
Then, if peer issues recv() with MSG_DONTWAIT, manage_oob() returns NULL,
so recv() ends up with -EAGAIN.
>>> c2.setblocking(False) # This causes -EAGAIN even with available data
>>> c2.recv(5)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
BlockingIOError: [Errno 11] Resource temporarily unavailable
However, next recv() will return the following available data, "world".
>>> c2.recv(5)
b'world'
When the consumed OOB skb is at the head of the queue, we need to fetch
the next skb to fix the weird behaviour.
Note that the issue does not happen without MSG_DONTWAIT because we can
retry after manage_oob().
This patch also adds a test case that covers the issue.
Without fix:
# RUN msg_oob.no_peek.ex_oob_break ...
# msg_oob.c:134:ex_oob_break:AF_UNIX :Resource temporarily unavailable
# msg_oob.c:135:ex_oob_break:Expected:ld
# msg_oob.c:137:ex_oob_break:Expected ret[0] (-1) == expected_len (2)
# ex_oob_break: Test terminated by assertion
# FAIL msg_oob.no_peek.ex_oob_break
not ok 8 msg_oob.no_peek.ex_oob_break
With fix:
# RUN msg_oob.no_peek.ex_oob_break ...
# OK msg_oob.no_peek.ex_oob_break
ok 8 msg_oob.no_peek.ex_oob_break
Yunseong Kim [Mon, 24 Jun 2024 17:33:23 +0000 (02:33 +0900)]
tracing/net_sched: NULL pointer dereference in perf_trace_qdisc_reset()
In the TRACE_EVENT(qdisc_reset) NULL dereference occurred from
qdisc->dev_queue->dev <NULL> ->name
This situation simulated from bunch of veths and Bluetooth disconnection
and reconnection.
During qdisc initialization, qdisc was being set to noop_queue.
In veth_init_queue, the initial tx_num was reduced back to one,
causing the qdisc reset to be called with noop, which led to the kernel
panic.
I've attached the GitHub gist link that C converted syz-execprogram
source code and 3 log of reproduced vmcore-dmesg.
However, without our patch applied, I tested upstream 6.10.0-rc3 kernel
using the qdisc_reset event and the ip command on my qemu virtual machine.
This 2 commands makes always kernel panic.
Linux version: 6.10.0-rc3
[ 0.000000] Linux version 6.10.0-rc3-00164-g44ef20baed8e-dirty
(paran@fedora) (gcc (GCC) 14.1.1 20240522 (Red Hat 14.1.1-4), GNU ld
version 2.41-34.fc40) #20 SMP PREEMPT Sat Jun 15 16:51:25 KST 2024
Yeoreum and I don't know if the patch we wrote will fix the underlying
cause, but we think that priority is to prevent kernel panic happening.
So, we're sending this patch.
Dave Airlie [Thu, 27 Jun 2024 07:22:13 +0000 (17:22 +1000)]
Merge tag 'drm-misc-fixes-2024-06-26' of https://gitlab.freedesktop.org/drm/misc/kernel into drm-fixes
drm-misc-fixes for v6.10-rc6:
- nouveau tv mode fixes.
- Add KOE TX26D202VM0BWA timings.
- Fix fb_info when vmalloc is used, regression from
CONFIG_DRM_FBDEV_LEAK_PHYS_SMEM.
Linus Torvalds [Thu, 27 Jun 2024 00:51:39 +0000 (17:51 -0700)]
Merge tag 'mm-hotfixes-stable-2024-06-26-17-28' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull misc fixes from Andrew Morton:
"13 hotfixes, 7 are cc:stable.
All are MM related apart from a MAINTAINERS update. There is no
identifiable theme here - just singleton patches in various places"
* tag 'mm-hotfixes-stable-2024-06-26-17-28' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
mm/memory: don't require head page for do_set_pmd()
mm/page_alloc: Separate THP PCP into movable and non-movable categories
nfs: drop the incorrect assertion in nfs_swap_rw()
mm/migrate: make migrate_pages_batch() stats consistent
MAINTAINERS: TPM DEVICE DRIVER: update the W-tag
selftests/mm:fix test_prctl_fork_exec return failure
mm: convert page type macros to enum
ocfs2: fix DIO failure due to insufficient transaction credits
kasan: fix bad call to unpoison_slab_object
mm: handle profiling for fake memory allocations during compaction
mm/slab: fix 'variable obj_exts set but not used' warning
/proc/pid/smaps: add mseal info for vma
mm: fix incorrect vbq reference in purge_fragmented_block