Wentong Wu [Tue, 25 Jun 2024 08:10:44 +0000 (16:10 +0800)]
mei: vsc: Prevent timeout error with added delay post-firmware download
After completing the firmware download, the firmware requires some
time to become functional. This change introduces additional sleep
time before the first read operation to prevent a confusing timeout
error in vsc_tp_xfer().
Wentong Wu [Tue, 25 Jun 2024 08:10:43 +0000 (16:10 +0800)]
mei: vsc: Enhance IVSC chipset stability during warm reboot
During system shutdown, incorporate reset logic to ensure the IVSC
chipset remains in a valid state. This adjustment guarantees that
the IVSC chipset operates in a known state following a warm reboot.
tcp: Don't flag tcp_sk(sk)->rx_opt.saw_unknown for TCP AO.
When we process segments with TCP AO, we don't check it in
tcp_parse_options(). Thus, opt_rx->saw_unknown is set to 1,
which unconditionally triggers the BPF TCP option parser.
Matt Roper [Wed, 26 Jun 2024 21:05:37 +0000 (14:05 -0700)]
drm/xe/mcr: Avoid clobbering DSS steering
A couple copy/paste mistakes in the code that selects steering targets
for OADDRM and INSTANCE0 unintentionally clobbered the steering target
for DSS ranges in some cases.
The OADDRM/INSTANCE0 values were also not assigned as intended, although
that mistake wound up being harmless since the desired values for those
specific ranges were '0' which the kzalloc of the GT structure should
have already taken care of implicitly.
Thomas Hellström [Fri, 28 Jun 2024 15:38:48 +0000 (17:38 +0200)]
drm/ttm: Always take the bo delayed cleanup path for imported bos
Bos can be put with multiple unrelated dma-resv locks held. But
imported bos attempt to grab the bo dma-resv during dma-buf detach
that typically happens during cleanup. That leads to lockde splats
similar to the below and a potential ABBA deadlock.
Fix this by always taking the delayed workqueue cleanup path for
imported bos.
Requesting stable fixes from when the Xe driver was introduced,
since its usage of drm_exec and wide vm dma_resvs appear to be
the first reliable trigger of this.
[22982.116427] ============================================
[22982.116428] WARNING: possible recursive locking detected
[22982.116429] 6.10.0-rc2+ #10 Tainted: G U W
[22982.116430] --------------------------------------------
[22982.116430] glxgears:sh0/5785 is trying to acquire lock:
[22982.116431] ffff8c2bafa539a8 (reservation_ww_class_mutex){+.+.}-{3:3}, at: dma_buf_detach+0x3b/0xf0
[22982.116438]
but task is already holding lock:
[22982.116438] ffff8c2d9aba6da8 (reservation_ww_class_mutex){+.+.}-{3:3}, at: drm_exec_lock_obj+0x49/0x2b0 [drm_exec]
[22982.116442]
other info that might help us debug this:
[22982.116442] Possible unsafe locking scenario:
Ryusuke Konishi [Fri, 28 Jun 2024 16:51:07 +0000 (01:51 +0900)]
nilfs2: fix kernel bug on rename operation of broken directory
Syzbot reported that in rename directory operation on broken directory on
nilfs2, __block_write_begin_int() called to prepare block write may fail
BUG_ON check for access exceeding the folio/page size.
This is because nilfs_dotdot(), which gets parent directory reference
entry ("..") of the directory to be moved or renamed, does not check
consistency enough, and may return location exceeding folio/page size for
broken directories.
Fix this issue by checking required directory entries ("." and "..") in
the first chunk of the directory in nilfs_dotdot().
Yu Zhao [Thu, 27 Jun 2024 22:27:05 +0000 (16:27 -0600)]
mm/hugetlb_vmemmap: fix race with speculative PFN walkers
While investigating HVO for THPs [1], it turns out that speculative PFN
walkers like compaction can race with vmemmap modifications, e.g.,
CPU 1 (vmemmap modifier) CPU 2 (speculative PFN walker)
------------------------------- ------------------------------
Allocates an LRU folio page1
Sees page1
Frees page1
Allocates a hugeTLB folio page2
(page1 being a tail of page2)
Even though page1->_refcount is zero after HVO, get_page_unless_zero() can
still try to modify this read-only field, resulting in a crash.
An independent report [2] confirmed this race.
There are two discussed approaches to fix this race:
1. Make RO vmemmap RW so that get_page_unless_zero() can fail without
triggering a PF.
2. Use RCU to make sure get_page_unless_zero() either sees zero
page->_refcount through the old vmemmap or non-zero page->_refcount
through the new one.
The second approach is preferred here because:
1. It can prevent illegal modifications to struct page[] that has been
HVO'ed;
2. It can be generalized, in a way similar to ZERO_PAGE(), to fix
similar races in other places, e.g., arch_remove_memory() on x86
[3], which frees vmemmap mapping offlined struct page[].
While adding synchronize_rcu(), the goal is to be surgical, rather than
optimized. Specifically, calls to synchronize_rcu() on the error handling
paths can be coalesced, but it is not done for the sake of Simplicity:
noticeably, this fix removes ~50% more lines than it adds.
According to the hugetlb_optimize_vmemmap section in
Documentation/admin-guide/sysctl/vm.rst, enabling HVO makes allocating or
freeing hugeTLB pages "~2x slower than before". Having synchronize_rcu()
on top makes those operations even worse, and this also affects the user
interface /proc/sys/vm/nr_overcommit_hugepages.
This is *very* hard to trigger:
1. Most hugeTLB use cases I know of are static, i.e., reserved at
boot time, because allocating at runtime is not reliable at all.
2. On top of that, someone has to be very unlucky to get tripped
over above, because the race window is so small -- I wasn't able to
trigger it with a stress testing that does nothing but that (with
THPs though).
Nhat Pham [Thu, 27 Jun 2024 20:17:37 +0000 (13:17 -0700)]
cachestat: do not flush stats in recency check
syzbot detects that cachestat() is flushing stats, which can sleep, in its
RCU read section (see [1]). This is done in the workingset_test_recent()
step (which checks if the folio's eviction is recent).
Move the stat flushing step to before the RCU read section of cachestat,
and skip stat flushing during the recency check.
Gavin Shan [Thu, 27 Jun 2024 00:39:52 +0000 (10:39 +1000)]
mm/shmem: disable PMD-sized page cache if needed
For shmem files, it's possible that PMD-sized page cache can't be
supported by xarray. For example, 512MB page cache on ARM64 when the base
page size is 64KB can't be supported by xarray. It leads to errors as the
following messages indicate when this sort of xarray entry is split.
Fix it by disabling PMD-sized page cache when HPAGE_PMD_ORDER is larger
than MAX_PAGECACHE_ORDER. As Matthew Wilcox pointed, the page cache in a
shmem file isn't represented by a multi-index entry and doesn't have this
limitation when the xarry entry is split until commit 6b24ca4a1a8d ("mm:
Use multi-index entries in the page cache").
Gavin Shan [Thu, 27 Jun 2024 00:39:51 +0000 (10:39 +1000)]
mm/filemap: skip to create PMD-sized page cache if needed
On ARM64, HPAGE_PMD_ORDER is 13 when the base page size is 64KB. The
PMD-sized page cache can't be supported by xarray as the following error
messages indicate.
Fix it by skipping to allocate PMD-sized page cache when its size is
larger than MAX_PAGECACHE_ORDER. For this specific case, we will fall to
regular path where the readahead window is determined by BDI's sysfs file
(read_ahead_kb).
Gavin Shan [Thu, 27 Jun 2024 00:39:50 +0000 (10:39 +1000)]
mm/readahead: limit page cache size in page_cache_ra_order()
In page_cache_ra_order(), the maximal order of the page cache to be
allocated shouldn't be larger than MAX_PAGECACHE_ORDER. Otherwise, it's
possible the large page cache can't be supported by xarray when the
corresponding xarray entry is split.
For example, HPAGE_PMD_ORDER is 13 on ARM64 when the base page size is
64KB. The PMD-sized page cache can't be supported by xarray.
Gavin Shan [Thu, 27 Jun 2024 00:39:49 +0000 (10:39 +1000)]
mm/filemap: make MAX_PAGECACHE_ORDER acceptable to xarray
Patch series "mm/filemap: Limit page cache size to that supported by
xarray", v2.
Currently, xarray can't support arbitrary page cache size. More details
can be found from the WARN_ON() statement in xas_split_alloc(). In our
test whose code is attached below, we hit the WARN_ON() on ARM64 system
where the base page size is 64KB and huge page size is 512MB. The issue
was reported long time ago and some discussions on it can be found here
[1].
In order to fix the issue, we need to adjust MAX_PAGECACHE_ORDER to one
supported by xarray and avoid PMD-sized page cache if needed. The code
changes are suggested by David Hildenbrand.
PATCH[1] adjusts MAX_PAGECACHE_ORDER to that supported by xarray
PATCH[2-3] avoids PMD-sized page cache in the synchronous readahead path
PATCH[4] avoids PMD-sized page cache for shmem files if needed
The largest page cache order can be HPAGE_PMD_ORDER (13) on ARM64 with
64KB base page size. The xarray entry with this order can't be split as
the following error messages indicate.
Fix it by decreasing MAX_PAGECACHE_ORDER to the largest supported order
by xarray. For this specific case, MAX_PAGECACHE_ORDER is dropped from
13 to 11 when CONFIG_BASE_SMALL is disabled.
SeongJae Park [Mon, 24 Jun 2024 17:58:14 +0000 (10:58 -0700)]
mm/damon/core: merge regions aggressively when max_nr_regions is unmet
DAMON keeps the number of regions under max_nr_regions by skipping regions
split operations when doing so can make the number higher than the limit.
It works well for preventing violation of the limit. But, if somehow the
violation happens, it cannot recovery well depending on the situation. In
detail, if the real number of regions having different access pattern is
higher than the limit, the mechanism cannot reduce the number below the
limit. In such a case, the system could suffer from high monitoring
overhead of DAMON.
The violation can actually happen. For an example, the user could reduce
max_nr_regions while DAMON is running, to be lower than the current number
of regions. Fix the problem by repeating the merge operations with
increasing aggressiveness in kdamond_merge_regions() for the case, until
the limit is met.
Audra Mitchell [Wed, 26 Jun 2024 13:05:11 +0000 (09:05 -0400)]
Fix userfaultfd_api to return EINVAL as expected
Currently if we request a feature that is not set in the Kernel config we
fail silently and return all the available features. However, the man
page indicates we should return an EINVAL.
We need to fix this issue since we can end up with a Kernel warning should
a program request the feature UFFD_FEATURE_WP_UNPOPULATED on a kernel with
the config not set with this feature.
mm: vmalloc: check if a hash-index is in cpu_possible_mask
The problem is that there are systems where cpu_possible_mask has gaps
between set CPUs, for example SPARC. In this scenario addr_to_vb_xa()
hash function can return an index which accesses to not-possible and not
setup CPU area using per_cpu() macro. This results in an oops on SPARC.
A per-cpu vmap_block_queue is also used as hash table, incorrectly
assuming the cpu_possible_mask has no gaps. Fix it by adjusting an index
to a next possible CPU.
Waiman Long [Wed, 26 Jun 2024 00:16:39 +0000 (20:16 -0400)]
mm: prevent derefencing NULL ptr in pfn_section_valid()
Commit 5ec8e8ea8b77 ("mm/sparsemem: fix race in accessing
memory_section->usage") changed pfn_section_valid() to add a READ_ONCE()
call around "ms->usage" to fix a race with section_deactivate() where
ms->usage can be cleared. The READ_ONCE() call, by itself, is not enough
to prevent NULL pointer dereference. We need to check its value before
dereferencing it.
This BUG is the VM_BUG_ON(!in_atomic() && !irqs_disabled()) assertion in
folio_ref_try_add_rcu() for non-SMP kernel.
The process_vm_readv() calls GUP to pin the THP. An optimization for
pinning THP instroduced by commit 57edfcfd3419 ("mm/gup: accelerate thp
gup even for "pages != NULL"") calls try_grab_folio() to pin the THP,
but try_grab_folio() is supposed to be called in atomic context for
non-SMP kernel, for example, irq disabled or preemption disabled, due to
the optimization introduced by commit e286781d5f2e ("mm: speculative
page references").
The commit efa7df3e3bb5 ("mm: align larger anonymous mappings on THP
boundaries") is not actually the root cause although it was bisected to.
It just makes the problem exposed more likely.
The follow up discussion suggested the optimization for non-SMP kernel
may be out-dated and not worth it anymore [1]. So removing the
optimization to silence the BUG.
However calling try_grab_folio() in GUP slow path actually is
unnecessary, so the following patch will clean this up.
Namjae Jeon [Sun, 23 Jun 2024 23:39:23 +0000 (08:39 +0900)]
ksmbd: return FILE_DEVICE_DISK instead of super magic
MS-SMB2 specification describes setting ->DeviceType to FILE_DEVICE_DISK
or FILE_DEVICE_CD_ROM. Set FILE_DEVICE_DISK instead of super magic in
FS_DEVICE_INFORMATION. And Set FILE_READ_ONLY_DEVICE for read-only share.
====================
fix OOM and order check in msg_zerocopy selftest
In selftests/net/msg_zerocopy.c, it has a while loop keeps calling sendmsg
on a socket with MSG_ZEROCOPY flag, and it will recv the notifications
until the socket is not writable. Typically, it will start the receiving
process after around 30+ sendmsgs. However, as the introduction of commit dfa2f0483360 ("tcp: get rid of sysctl_tcp_adv_win_scale"), the sender is
always writable and does not get any chance to run recv notifications.
The selftest always exits with OUT_OF_MEMORY because the memory used by
opt_skb exceeds the net.core.optmem_max. Meanwhile, it could be set to a
different value to trigger OOM on older kernels too.
Thus, we introduce "cfg_notification_limit" to force sender to receive
notifications after some number of sendmsgs.
And, we find that when lock debugging is on, notifications may not come in
order. Thus, we have order checking outputs managed by cfg_verbose, to
avoid too many outputs in this case.
====================
selftests: make order checking verbose in msg_zerocopy selftest
We find that when lock debugging is on, notifications may not come in
order. Thus, we have order checking outputs managed by cfg_verbose, to
avoid too many outputs in this case.
In selftests/net/msg_zerocopy.c, it has a while loop keeps calling sendmsg
on a socket with MSG_ZEROCOPY flag, and it will recv the notifications
until the socket is not writable. Typically, it will start the receiving
process after around 30+ sendmsgs. However, as the introduction of commit dfa2f0483360 ("tcp: get rid of sysctl_tcp_adv_win_scale"), the sender is
always writable and does not get any chance to run recv notifications.
The selftest always exits with OUT_OF_MEMORY because the memory used by
opt_skb exceeds the net.core.optmem_max. Meanwhile, it could be set to a
different value to trigger OOM on older kernels too.
Thus, we introduce "cfg_notification_limit" to force sender to receive
notifications after some number of sendmsgs.
====================
Intel Wired LAN Driver Updates 2024-06-25 (ice)
This series contains updates to ice driver only.
Milena adds disabling of extts events when PTP is disabled.
Jake prevents possible NULL pointer by checking that timestamps are
ready before processing extts events and adds checks for unsupported
PTP pin configuration.
Petr Oros replaces _test_bit() with the correct test_bit() macro.
v1: https://lore.kernel.org/netdev/20240625170248[email protected]/
====================
Petr Oros [Tue, 2 Jul 2024 17:14:57 +0000 (10:14 -0700)]
ice: use proper macro for testing bit
Do not use _test_bit() macro for testing bit. The proper macro for this
is one without underline.
_test_bit() is what test_bit() was prior to const-optimization. It
directly calls arch_test_bit(), i.e. the arch-specific implementation
(or the generic one). It's strictly _internal_ and shouldn't be used
anywhere outside the actual test_bit() macro.
test_bit() is a wrapper which checks whether the bitmap and the bit
number are compile-time constants and if so, it calls the optimized
function which evaluates this call to a compile-time constant as well.
If either of them is not a compile-time constant, it just calls _test_bit().
test_bit() is the actual function to use anywhere in the kernel.
The sensors is not a compile-time constant, thus most probably there
are no object code changes before and after the patch.
But anyway, we shouldn't call internal wrappers instead of
the actual API.
Jacob Keller [Tue, 2 Jul 2024 17:14:56 +0000 (10:14 -0700)]
ice: Reject pin requests with unsupported flags
The driver receives requests for configuring pins via the .enable
callback of the PTP clock object. These requests come into the driver
with flags which modify the requested behavior from userspace. Current
implementation in ice does not reject flags that it doesn't support.
This causes the driver to incorrectly apply requests with such flags as
PTP_PEROUT_DUTY_CYCLE, or any future flags added by the kernel which it
is not yet aware of.
Fix this by properly validating flags in both ice_ptp_cfg_perout and
ice_ptp_cfg_extts. Ensure that we check by bit-wise negating supported
flags rather than just checking and rejecting known un-supported flags.
This is preferable, as it ensures better compatibility with future
kernels.
Jacob Keller [Tue, 2 Jul 2024 17:14:55 +0000 (10:14 -0700)]
ice: Don't process extts if PTP is disabled
The ice_ptp_extts_event() function can race with ice_ptp_release() and
result in a NULL pointer dereference which leads to a kernel panic.
Panic occurs because the ice_ptp_extts_event() function calls
ptp_clock_event() with a NULL pointer. The ice driver has already
released the PTP clock by the time the interrupt for the next external
timestamp event occurs.
To fix this, modify the ice_ptp_extts_event() function to check the
PTP state and bail early if PTP is not ready.
Extts events are disabled and enabled by the application ts2phc.
However, in case where the driver is removed when the application is
running, a specific extts event remains enabled and can cause a kernel
crash.
As a side effect, when the driver is reloaded and application is started
again, remaining extts event for the channel from a previous run will
keep firing and the message "extts on unexpected channel" might be
printed to the user.
To avoid that, extts events shall be disabled when PTP is released.
Shigeru Yoshida [Tue, 2 Jul 2024 16:04:27 +0000 (01:04 +0900)]
af_unix: Fix uninit-value in __unix_walk_scc()
KMSAN reported uninit-value access in __unix_walk_scc() [1].
In the list_for_each_entry_reverse() loop, when the vertex's index
equals it's scc_index, the loop uses the variable vertex as a
temporary variable that points to a vertex in scc. And when the loop
is finished, the variable vertex points to the list head, in this case
scc, which is a local variable on the stack (more precisely, it's not
even scc and might underflow the call stack of __unix_walk_scc():
container_of(&scc, struct unix_vertex, scc_entry)).
However, the variable vertex is used under the label prev_vertex. So
if the edge_stack is not empty and the function jumps to the
prev_vertex label, the function will access invalid data on the
stack. This causes the uninit-value access issue.
Fix this by introducing a new temporary variable for the loop.
Sam Sun [Tue, 2 Jul 2024 13:55:55 +0000 (14:55 +0100)]
bonding: Fix out-of-bounds read in bond_option_arp_ip_targets_set()
In function bond_option_arp_ip_targets_set(), if newval->string is an
empty string, newval->string+1 will point to the byte after the
string, causing an out-of-bound read.
Radu Rendec [Tue, 2 Jul 2024 21:08:37 +0000 (17:08 -0400)]
net: rswitch: Avoid use-after-free in rswitch_poll()
The use-after-free is actually in rswitch_tx_free(), which is inlined in
rswitch_poll(). Since `skb` and `gq->skbs[gq->dirty]` are in fact the
same pointer, the skb is first freed using dev_kfree_skb_any(), then the
value in skb->len is used to update the interface statistics.
Let's move around the instructions to use skb->len before the skb is
freed.
This bug is trivial to reproduce using KFENCE. It will trigger a splat
every few packets. A simple ARP request or ICMP echo request is enough.
Boris Burkov [Tue, 2 Jul 2024 14:31:14 +0000 (07:31 -0700)]
btrfs: fix folio refcount in __alloc_dummy_extent_buffer()
Another improper use of __folio_put() in an error path after freshly
allocating pages/folios which returns them with the refcount initialized
to 1. The refactor from __free_pages() -> __folio_put() (instead of
folio_put) removed a refcount decrement found in __free_pages() and
folio_put but absent from __folio_put().
Boris Burkov [Tue, 2 Jul 2024 14:31:13 +0000 (07:31 -0700)]
btrfs: fix folio refcount in btrfs_do_encoded_write()
The conversion to folios switched __free_page() to __folio_put() in the
error path in btrfs_do_encoded_write().
However, this gets the page refcounting wrong. If we do hit that error
path (I reproduced by modifying btrfs_do_encoded_write to pretend to
always fail in a way that jumps to out_folios and running the fstests
case btrfs/281), then we always hit the following BUG freeing the folio:
BUG: Bad page state in process btrfs pfn:40ab0b
page: refcount:1 mapcount:0 mapping:0000000000000000 index:0x61be5 pfn:0x40ab0b
flags: 0x5ffff0000000000(node=0|zone=2|lastcpupid=0x1ffff)
raw: 05ffff00000000000000000000000000dead0000000001220000000000000000
raw: 0000000000061be5000000000000000000000001ffffffff0000000000000000
page dumped because: nonzero _refcount
Call Trace:
<TASK>
dump_stack_lvl+0x3d/0xe0
bad_page+0xea/0xf0
free_unref_page+0x8e1/0x900
? __mem_cgroup_uncharge+0x69/0x90
__folio_put+0xe6/0x190
btrfs_do_encoded_write+0x445/0x780
? current_time+0x25/0xd0
btrfs_do_write_iter+0x2cc/0x4b0
btrfs_ioctl_encoded_write+0x2b6/0x340
It turns out __free_page() decreases the page reference count while
__folio_put() does not. Switch __folio_put() to folio_put() which
decreases the folio reference count first.
netfilter: nf_tables: unconditionally flush pending work before notifier
syzbot reports:
KASAN: slab-uaf in nft_ctx_update include/net/netfilter/nf_tables.h:1831
KASAN: slab-uaf in nft_commit_release net/netfilter/nf_tables_api.c:9530
KASAN: slab-uaf int nf_tables_trans_destroy_work+0x152b/0x1750 net/netfilter/nf_tables_api.c:9597
Read of size 2 at addr ffff88802b0051c4 by task kworker/1:1/45
[..]
Workqueue: events nf_tables_trans_destroy_work
Call Trace:
nft_ctx_update include/net/netfilter/nf_tables.h:1831 [inline]
nft_commit_release net/netfilter/nf_tables_api.c:9530 [inline]
nf_tables_trans_destroy_work+0x152b/0x1750 net/netfilter/nf_tables_api.c:9597
Problem is that the notifier does a conditional flush, but its possible
that the table-to-be-removed is still referenced by transactions being
processed by the worker, so we need to flush unconditionally.
We could make the flush_work depend on whether we found a table to delete
in nf-next to avoid the flush for most cases.
AFAICS this problem is only exposed in nf-next, with
commit e169285f8c56 ("netfilter: nf_tables: do not store nft_ctx in transaction objects"),
with this commit applied there is an unconditional fetch of
table->family which is whats triggering the above splat.
Fixes: 2c9f0293280e ("netfilter: nf_tables: flush pending destroy work before netlink notifier") Reported-and-tested-by: [email protected] Closes: https://syzkaller.appspot.com/bug?extid=4fd66a69358fc15ae2ad Signed-off-by: Florian Westphal <[email protected]> Signed-off-by: Pablo Neira Ayuso <[email protected]>
i2c: pnx: Fix potential deadlock warning from del_timer_sync() call in isr
When del_timer_sync() is called in an interrupt context it throws a warning
because of potential deadlock. The timer is used only to exit from
wait_for_completion() after a timeout so replacing the call with
wait_for_completion_timeout() allows to remove the problematic timer and
its related functions altogether.
Fixes: 41561f28e76a ("i2c: New Philips PNX bus driver") Signed-off-by: Piotr Wojtaszczyk <[email protected]> Signed-off-by: Andi Shyti <[email protected]>
Since this ioctl hasn't been in a full release yet, change it from
"T", 0x1 to "R" 0x20, and also reserve 0x20-0x2F for future ioctl
commands, as some more are being worked on for the future"
* tag 'trace-v6.10-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
tracing: Have memmapped ring buffer use ioctl of "R" range 0x20-2F
tracing: Have memmapped ring buffer use ioctl of "R" range 0x20-2F
To prevent conflicts with other ioctl numbers to allow strace to have an
idea of what is happening, add the range of ioctls for the trace buffer
mapping from _IO("T", 0x1) to the range of "R" 0x20 - 0x2F.
Song Shuai [Wed, 26 Jun 2024 02:33:16 +0000 (10:33 +0800)]
riscv: kexec: Avoid deadlock in kexec crash path
If the kexec crash code is called in the interrupt context, the
machine_kexec_mask_interrupts() function will trigger a deadlock while
trying to acquire the irqdesc spinlock and then deactivate irqchip in
irq_set_irqchip_state() function.
Unlike arm64, riscv only requires irq_eoi handler to complete EOI and
keeping irq_set_irqchip_state() will only leave this possible deadlock
without any use. So we simply remove it.
Puranjay Mohan [Tue, 18 Jun 2024 14:58:20 +0000 (14:58 +0000)]
riscv: stacktrace: fix usage of ftrace_graph_ret_addr()
ftrace_graph_ret_addr() takes an `idx` integer pointer that is used to
optimize the stack unwinding. Pass it a valid pointer to utilize the
optimizations that might be available in the future.
The commit is making riscv's usage of ftrace_graph_ret_addr() match
x86_64.
This series contains 3 fixes out of which the first one is a new fix
for invalid event data reported in lkml[2]. The last two are v3 of Samuel's
patch[1]. I added the RB/TB/Fixes tag and moved 1 unrelated change
to its own patch. I also changed an error message in kvm vcpu_pmu from
pr_err to pr_debug to avoid redundant failure error messages generated
due to the boot time quering of events implemented in the patch[1]
Here is the original cover letter for the patch[1]
Before this patch:
$ perf list hw
List of pre-defined events (to be used in -e or -M):
branch-instructions OR branches [Hardware event]
branch-misses [Hardware event]
bus-cycles [Hardware event]
cache-misses [Hardware event]
cache-references [Hardware event]
cpu-cycles OR cycles [Hardware event]
instructions [Hardware event]
ref-cycles [Hardware event]
stalled-cycles-backend OR idle-cycles-backend [Hardware event]
stalled-cycles-frontend OR idle-cycles-frontend [Hardware event]
* b4-shazam-merge:
perf: RISC-V: Check standard event availability
drivers/perf: riscv: Reset the counter to hpmevent mapping while starting cpus
drivers/perf: riscv: Do not update the event data if uptodate
Samuel Holland [Fri, 28 Jun 2024 07:51:43 +0000 (00:51 -0700)]
perf: RISC-V: Check standard event availability
The RISC-V SBI PMU specification defines several standard hardware and
cache events. Currently, all of these events are exposed to userspace,
even when not actually implemented. They appear in the `perf list`
output, and commands like `perf stat` try to use them.
This is more than just a cosmetic issue, because the PMU driver's .add
function fails for these events, which causes pmu_groups_sched_in() to
prematurely stop scheduling in other (possibly valid) hardware events.
Add logic to check which events are supported by the hardware (i.e. can
be mapped to some counter), so only usable events are reported to
userspace. Since the kernel does not know the mapping between events and
possible counters, this check must happen during boot, when no counters
are in use. Make the check asynchronous to minimize impact on boot time.
Samuel Holland [Fri, 28 Jun 2024 07:51:42 +0000 (00:51 -0700)]
drivers/perf: riscv: Reset the counter to hpmevent mapping while starting cpus
Currently, we stop all the counters while a new cpu is brought online.
However, the hpmevent to counter mappings are not reset. The firmware may
have some stale encoding in their mapping structure which may lead to
undesirable results. We have not encountered such scenario though.
Atish Patra [Fri, 28 Jun 2024 07:51:41 +0000 (00:51 -0700)]
drivers/perf: riscv: Do not update the event data if uptodate
In case of an counter overflow, the event data may get corrupted
if called from an external overflow handler. This happens because
we can't update the counter without starting it when SBI PMU
extension is in use. However, the prev_count has been already
updated at the first pass while the counter value is still the
old one.
The solution is simple where we don't need to update it again
if it is already updated which can be detected using hwc state.
The event state in the overflow handler is updated in the following
patch. Thus, this fix can't be backported to kernel version where
overflow support was added.
Ryusuke Konishi [Sun, 23 Jun 2024 05:11:35 +0000 (14:11 +0900)]
nilfs2: fix incorrect inode allocation from reserved inodes
If the bitmap block that manages the inode allocation status is corrupted,
nilfs_ifile_create_inode() may allocate a new inode from the reserved
inode area where it should not be allocated.
Previous fix commit d325dc6eb763 ("nilfs2: fix use-after-free bug of
struct nilfs_root"), fixed the problem that reserved inodes with inode
numbers less than NILFS_USER_INO (=11) were incorrectly reallocated due to
bitmap corruption, but since the start number of non-reserved inodes is
read from the super block and may change, in which case inode allocation
may occur from the extended reserved inode area.
If that happens, access to that inode will cause an IO error, causing the
file system to degrade to an error state.
Fix this potential issue by adding a wraparound option to the common
metadata object allocation routine and by modifying
nilfs_ifile_create_inode() to disable the option so that it only allocates
inodes with inode numbers greater than or equal to the inode number read
in "nilfs->ns_first_ino", regardless of the bitmap status of reserved
inodes.
Ryusuke Konishi [Sun, 23 Jun 2024 05:11:34 +0000 (14:11 +0900)]
nilfs2: add missing check for inode numbers on directory entries
Syzbot reported that mounting and unmounting a specific pattern of
corrupted nilfs2 filesystem images causes a use-after-free of metadata
file inodes, which triggers a kernel bug in lru_add_fn().
As Jan Kara pointed out, this is because the link count of a metadata file
gets corrupted to 0, and nilfs_evict_inode(), which is called from iput(),
tries to delete that inode (ifile inode in this case).
The inconsistency occurs because directories containing the inode numbers
of these metadata files that should not be visible in the namespace are
read without checking.
Fix this issue by treating the inode numbers of these internal files as
errors in the sanity check helper when reading directory folios/pages.
Also thanks to Hillf Danton and Matthew Wilcox for their initial mm-layer
analysis.
Ryusuke Konishi [Sun, 23 Jun 2024 05:11:33 +0000 (14:11 +0900)]
nilfs2: fix inode number range checks
Patch series "nilfs2: fix potential issues related to reserved inodes".
This series fixes one use-after-free issue reported by syzbot, caused by
nilfs2's internal inode being exposed in the namespace on a corrupted
filesystem, and a couple of flaws that cause problems if the starting
number of non-reserved inodes written in the on-disk super block is
intentionally (or corruptly) changed from its default value.
This patch (of 3):
In the current implementation of nilfs2, "nilfs->ns_first_ino", which
gives the first non-reserved inode number, is read from the superblock,
but its lower limit is not checked.
As a result, if a number that overlaps with the inode number range of
reserved inodes such as the root directory or metadata files is set in the
super block parameter, the inode number test macros (NILFS_MDT_INODE and
NILFS_VALID_INODE) will not function properly.
In addition, these test macros use left bit-shift calculations using with
the inode number as the shift count via the BIT macro, but the result of a
shift calculation that exceeds the bit width of an integer is undefined in
the C specification, so if "ns_first_ino" is set to a large value other
than the default value NILFS_USER_INO (=11), the macros may potentially
malfunction depending on the environment.
Fix these issues by checking the lower bound of "nilfs->ns_first_ino" and
by preventing bit shifts equal to or greater than the NILFS_USER_INO
constant in the inode number test macros.
Also, change the type of "ns_first_ino" from signed integer to unsigned
integer to avoid the need for type casting in comparisons such as the
lower bound check introduced this time.
Jan Kara [Fri, 21 Jun 2024 14:42:38 +0000 (16:42 +0200)]
mm: avoid overflows in dirty throttling logic
The dirty throttling logic is interspersed with assumptions that dirty
limits in PAGE_SIZE units fit into 32-bit (so that various multiplications
fit into 64-bits). If limits end up being larger, we will hit overflows,
possible divisions by 0 etc. Fix these problems by never allowing so
large dirty limits as they have dubious practical value anyway. For
dirty_bytes / dirty_background_bytes interfaces we can just refuse to set
so large limits. For dirty_ratio / dirty_background_ratio it isn't so
simple as the dirty limit is computed from the amount of available memory
which can change due to memory hotplug etc. So when converting dirty
limits from ratios to numbers of pages, we just don't allow the result to
exceed UINT_MAX.
This is root-only triggerable problem which occurs when the operator
sets dirty limits to >16 TB.
Jan Kara [Fri, 21 Jun 2024 14:42:37 +0000 (16:42 +0200)]
Revert "mm/writeback: fix possible divide-by-zero in wb_dirty_limits(), again"
Patch series "mm: Avoid possible overflows in dirty throttling".
Dirty throttling logic assumes dirty limits in page units fit into
32-bits. This patch series makes sure this is true (see patch 2/2 for
more details).
The commit is broken in several ways. Firstly, the removed (u64) cast
from the multiplication will introduce a multiplication overflow on 32-bit
archs if wb_thresh * bg_thresh >= 1<<32 (which is actually common - the
default settings with 4GB of RAM will trigger this). Secondly, the
div64_u64() is unnecessarily expensive on 32-bit archs. We have
div64_ul() in case we want to be safe & cheap. Thirdly, if dirty
thresholds are larger than 1<<32 pages, then dirty balancing is going to
blow up in many other spectacular ways anyway so trying to fix one
possible overflow is just moot.
Jinliang Zheng [Thu, 20 Jun 2024 12:21:24 +0000 (20:21 +0800)]
mm: optimize the redundant loop of mm_update_owner_next()
When mm_update_owner_next() is racing with swapoff (try_to_unuse()) or
/proc or ptrace or page migration (get_task_mm()), it is impossible to
find an appropriate task_struct in the loop whose mm_struct is the same as
the target mm_struct.
If the above race condition is combined with the stress-ng-zombie and
stress-ng-dup tests, such a long loop can easily cause a Hard Lockup in
write_lock_irq() for tasklist_lock.
Recognize this situation in advance and exit early.
Merge tag 'io_uring-6.10-20240703' of git://git.kernel.dk/linux
Pull io_uring fix from Jens Axboe:
"A fix for a feature that went into the 6.10 merge window actually
ended up causing a regression in building bundles for receives.
Fix that up by ensuring we don't overwrite msg_inq before we use
it in the loop"
* tag 'io_uring-6.10-20240703' of git://git.kernel.dk/linux:
io_uring/net: don't clear msg_inq before io_recv_buf_select() needs it
Merge tag 'media/v6.10-3' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media
Pull media fixes from Mauro Carvalho Chehab:
"Some fixes related to the IPU6 driver"
* tag 'media/v6.10-3' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media:
media: ivsc: Depend on IPU_BRIDGE or not IPU_BRIDGE
media: intel/ipu6: Fix a null pointer dereference in ipu6_isys_query_stream_by_source
media: ipu6: Use the ISYS auxdev device as the V4L2 device's device
Thomas Weißschuh [Fri, 28 Jun 2024 11:37:04 +0000 (12:37 +0100)]
nvmem: core: limit cell sysfs permissions to main attribute ones
The cell sysfs attribute should not provide more access to the nvmem
data than the main attribute itself.
For example if nvme_config::root_only was set, the cell attribute
would still provide read access to everybody.
Mask out permissions not available on the main attribute.
Thomas Weißschuh [Fri, 28 Jun 2024 11:37:03 +0000 (12:37 +0100)]
nvmem: core: only change name to fram for current attribute
bin_attr_nvmem_eeprom_compat is the template from which all future
compat attributes are created.
Changing it means to change all subsquent compat attributes, too.
Instead only use the "fram" name for the currently registered attribute.
Joy Chakraborty [Fri, 28 Jun 2024 11:37:02 +0000 (12:37 +0100)]
nvmem: meson-efuse: Fix return value of nvmem callbacks
Read/write callbacks registered with nvmem core expect 0 to be returned
on success and a negative value to be returned on failure.
meson_efuse_read() and meson_efuse_write() call into
meson_sm_call_read() and meson_sm_call_write() respectively which return
the number of bytes read or written on success as per their api
description.
Fix to return error if meson_sm_call_read()/meson_sm_call_write()
returns an error else return 0.
Joy Chakraborty [Fri, 28 Jun 2024 11:37:01 +0000 (12:37 +0100)]
nvmem: rmem: Fix return value of rmem_read()
reg_read() callback registered with nvmem core expects 0 on success and
a negative value on error but rmem_read() returns the number of bytes
read which is treated as an error at the nvmem core.
This does not break when rmem is accessed using sysfs via
bin_attr_nvmem_read()/write() but causes an error when accessed from
places like nvmem_access_with_keepouts(), etc.
Change to return 0 on success and error in case
memory_read_from_buffer() returns an error or -EIO if bytes read do not
match what was requested.
Joy Chakraborty [Wed, 12 Jun 2024 07:00:30 +0000 (07:00 +0000)]
misc: microchip: pci1xxxx: Fix return value of nvmem callbacks
Read/write callbacks registered with nvmem core expect 0 to be returned
on success and a negative value to be returned on failure.
Currently pci1xxxx_otp_read()/pci1xxxx_otp_write() and
pci1xxxx_eeprom_read()/pci1xxxx_eeprom_write() return the number of
bytes read/written on success.
Fix to return 0 on success.
Fixes: 9ab5465349c0 ("misc: microchip: pci1xxxx: Add support to read and write into PCI1XXXX EEPROM via NVMEM sysfs") Fixes: 0969001569e4 ("misc: microchip: pci1xxxx: Add support to read and write into PCI1XXXX OTP via NVMEM sysfs") Cc: [email protected] Signed-off-by: Joy Chakraborty <[email protected]> Reviewed-by: Dan Carpenter <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Greg Kroah-Hartman <[email protected]>
s390/dasd: Fix invalid dereferencing of indirect CCW data pointer
Fix invalid dereferencing of indirect CCW data pointer in
dasd_eckd_dump_sense() that leads to a kernel panic in error cases.
When using indirect addressing for DASD CCWs (IDAW) the CCW CDA pointer
does not contain the data address itself but a pointer to the IDAL.
This needs to be translated from physical to virtual as well before
using it.
This dereferencing is also used for dasd_page_cache and also fixed
although it is very unlikely that this code path ever gets used.
Ekansh Gupta [Fri, 28 Jun 2024 11:45:01 +0000 (12:45 +0100)]
misc: fastrpc: Restrict untrusted app to attach to privileged PD
Untrusted application with access to only non-secure fastrpc device
node can attach to root_pd or static PDs if it can make the respective
init request. This can cause problems as the untrusted application
can send bad requests to root_pd or static PDs. Add changes to reject
attach to privileged PDs if the request is being made using non-secure
fastrpc device node.
Ekansh Gupta [Fri, 28 Jun 2024 11:45:00 +0000 (12:45 +0100)]
misc: fastrpc: Fix ownership reassignment of remote heap
Audio PD daemon will allocate memory for audio PD dynamic loading
usage when it is attaching for the first time to audio PD. As
part of this, the memory ownership is moved to the VM where
audio PD can use it. In case daemon process is killed without any
impact to DSP audio PD, the daemon process will retry to attach to
audio PD and in this case memory won't be reallocated. If the invoke
fails due to any reason, as part of err_invoke, the memory ownership
is getting reassigned to HLOS even when the memory was not allocated.
At this time the audio PD might still be using the memory and an
attemp of ownership reassignment would result in memory issue.
Ekansh Gupta [Fri, 28 Jun 2024 11:44:59 +0000 (12:44 +0100)]
misc: fastrpc: Fix memory leak in audio daemon attach operation
Audio PD daemon send the name as part of the init IOCTL call. This
name needs to be copied to kernel for which memory is allocated.
This memory is never freed which might result in memory leak. Free
the memory when it is not needed.
Ekansh Gupta [Fri, 28 Jun 2024 11:44:58 +0000 (12:44 +0100)]
misc: fastrpc: Avoid updating PD type for capability request
When user is requesting for DSP capability, the process pd type is
getting updated to USER_PD which is incorrect as DSP will assume the
process which is making the request is a user PD and this will never
get updated back to the original value. The actual PD type should not
be updated for capability request and it should be serviced by the
respective PD on DSP side. Don't change process's PD type for DSP
capability request.
Ekansh Gupta [Fri, 28 Jun 2024 11:44:57 +0000 (12:44 +0100)]
misc: fastrpc: Copy the complete capability structure to user
User is passing capability ioctl structure(argp) to get DSP
capabilities. This argp is copied to a local structure to get domain
and attribute_id information. After getting the capability, only
capability value is getting copied to user argp which will not be
useful if the use is trying to get the capability by checking the
capability member of fastrpc_ioctl_capability structure. Copy the
complete capability structure so that user can get the capability
value from the expected member of the structure.
Ekansh Gupta [Fri, 28 Jun 2024 11:44:56 +0000 (12:44 +0100)]
misc: fastrpc: Fix DSP capabilities request
The DSP capability request call expects 2 arguments. First is the
information about the total number of attributes to be copied from
DSP and second is the information about the buffer where the DSP
needs to copy the information. The current design is passing the
information about the size to be copied from DSP which would be
considered as a bad argument to the call by DSP causing a failure
suggesting the same. The second argument carries the information
about the buffer where the DSP needs to copy the capability
information and the size to be copied. As the first entry of
capability attribute is getting skipped, same should also be
considered while sending the information to DSP. Add changes to
pass proper arguments to DSP.
Merge tag 'iio-fixes-for-6.10c' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/jic23/iio into char-misc-linus
Jonathan writes:
IIO: 3rd round of fixes for 6.10
core:
- Trigger check on on whether a device was using own trigger was inverted.
avago,apds9306
- Checking wrong variable in an error check.
* tag 'iio-fixes-for-6.10c' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/jic23/iio:
iio: light: apds9306: Fix error handing
iio: trigger: Fix condition for own trigger
Rasmus Villemoes [Tue, 25 Jun 2024 18:42:05 +0000 (20:42 +0200)]
serial: imx: ensure RTS signal is not left active after shutdown
If a process is killed while writing to a /dev/ttymxc* device in RS485
mode, we observe that the RTS signal is left high, thus making it
impossible for other devices to transmit anything.
Moreover, the ->tx_state variable is left in state SEND, which means
that when one next opens the device and configures baud rate etc., the
initialization code in imx_uart_set_termios dutifully ensures the RTS
pin is pulled down, but since ->tx_state is already SEND, the logic in
imx_uart_start_tx() does not in fact pull the pin high before
transmitting, so nothing actually gets on the wire on the other side
of the transceiver. Only when that transmission is allowed to complete
is the state machine then back in a consistent state.
This is completely reproducible by doing something as simple as
seq 10000 > /dev/ttymxc0
and hitting ctrl-C, and watching with a logic analyzer.
Udit Kumar [Tue, 25 Jun 2024 16:07:25 +0000 (21:37 +0530)]
serial: 8250_omap: Fix Errata i2310 with RX FIFO level check
Errata i2310[0] says, Erroneous timeout can be triggered,
if this Erroneous interrupt is not cleared then it may leads
to storm of interrupts.
Commit 9d141c1e6157 ("serial: 8250_omap: Implementation of Errata i2310")
which added the workaround but missed ensuring RX FIFO is really empty
before applying the errata workaround as recommended in the errata text.
Fix this by adding back check for UART_OMAP_RX_LVL to be 0 for
workaround to take effect.
wifi: iwlwifi: mvm: check vif for NULL/ERR_PTR before dereference
iwl_mvm_get_bss_vif might return a NULL or ERR_PTR. Some of the callers
check only the NULL case, and some doesn't check at all.
Some of the callers even have a pointer to the mvmvif of the bss vif,
so we don't even need to call this function, and can simply get the vif
from mvmvif. Do it for those cases, and for the others - properly check
if IS_ERR_OR_NULL
Johannes Berg [Wed, 3 Jul 2024 03:43:14 +0000 (06:43 +0300)]
wifi: iwlwifi: mvm: avoid link lookup in statistics
We already iterate the link bss_conf/link_info and have the
pointer, or know that deflink/bss_conf is used, so avoid an
extra lookup and just pass the pointer. This may also avoid
a crash when this is processed during restart, where the FW
to link conf array (link_id_to_link_conf) may be NULLed out.
wifi: iwlwifi: mvm: don't wake up rx_sync_waitq upon RFKILL
Since we now want to sync the queues even when we're in RFKILL, we
shouldn't wake up the wait queue since we still expect to get all the
notifications from the firmware.
Daniel Gabay [Wed, 3 Jul 2024 03:43:13 +0000 (06:43 +0300)]
wifi: iwlwifi: properly set WIPHY_FLAG_SUPPORTS_EXT_KEK_KCK
The WIPHY_FLAG_SUPPORTS_EXT_KEK_KCK should be set based on the
WOWLAN_KEK_KCK_MATERIAL command version. Currently, the command
version in the firmware has advanced to 4, which prevents the
flag from being set correctly, fix that.
Javier Carrasco [Mon, 24 Jun 2024 21:10:06 +0000 (23:10 +0200)]
usb: core: add missing of_node_put() in usb_of_has_devices_or_graph
The for_each_child_of_node() macro requires an explicit call to
of_node_put() on early exits to decrement the child refcount and avoid a
memory leak.
The child node is not required outsie the loop, and the resource must be
released before the function returns.
Alan Stern [Thu, 27 Jun 2024 19:56:18 +0000 (15:56 -0400)]
USB: core: Fix duplicate endpoint bug by clearing reserved bits in the descriptor
Syzbot has identified a bug in usbcore (see the Closes: tag below)
caused by our assumption that the reserved bits in an endpoint
descriptor's bEndpointAddress field will always be 0. As a result of
the bug, the endpoint_is_duplicate() routine in config.c (and possibly
other routines as well) may believe that two descriptors are for
distinct endpoints, even though they have the same direction and
endpoint number. This can lead to confusion, including the bug
identified by syzbot (two descriptors with matching endpoint numbers
and directions, where one was interrupt and the other was bulk).
To fix the bug, we will clear the reserved bits in bEndpointAddress
when we parse the descriptor. (Note that both the USB-2.0 and USB-3.1
specs say these bits are "Reserved, reset to zero".) This requires us
to make a copy of the descriptor earlier in usb_parse_endpoint() and
use the copy instead of the original when checking for duplicates.
Mathias Nyman [Thu, 27 Jun 2024 14:55:23 +0000 (17:55 +0300)]
xhci: always resume roothubs if xHC was reset during resume
Usb device connect may not be detected after runtime resume if
xHC is reset during resume.
In runtime resume cases xhci_resume() will only resume roothubs if there
are pending port events. If the xHC host is reset during runtime resume
due to a Save/Restore Error (SRE) then these pending port events won't be
detected as PORTSC change bits are not immediately set by host after reset.
Unconditionally resume roothubs if xHC is reset during resume to ensure
device connections are detected.
Also return early with error code if starting xHC fails after reset.
Issue was debugged and a similar solution suggested by Remi Pommarel.
Using this instead as it simplifies future refactoring.
serial: imx: only set receiver level if it is zero
With commit a81dbd0463ec ("serial: imx: set receiver level before
starting uart") we set the receiver level to its default value. This
caused a regression when using SDMA, where the receiver level is 9
instead of 8 (default). This change will first check if the receiver
level is zero and only then set it to the default. This still avoids the
interrupt storm when the receiver level is zero.
Jozef Hopko [Mon, 1 Jul 2024 16:23:20 +0000 (18:23 +0200)]
wifi: wilc1000: fix ies_len type in connect path
Commit 205c50306acf ("wifi: wilc1000: fix RCU usage in connect path")
made sure that the IEs data was manipulated under the relevant RCU section.
Unfortunately, while doing so, the commit brought a faulty implicit cast
from int to u8 on the ies_len variable, making the parsing fail to be
performed correctly if the IEs block is larger than 255 bytes. This failure
can be observed with Access Points appending a lot of IEs TLVs in their
beacon frames (reproduced with a Pixel phone acting as an Access Point,
which brough 273 bytes of IE data in my testing environment).
Fix IEs parsing by removing this undesired implicit cast.
Shiji Yang [Tue, 25 Jun 2024 01:19:49 +0000 (09:19 +0800)]
gpio: mmio: do not calculate bgpio_bits via "ngpios"
bgpio_bits must be aligned with the data bus width. For example, on a
32 bit big endian system and we only have 16 GPIOs. If we only assume
bgpio_bits=16 we can never control the GPIO because the base address
is the lowest address.
low address high address
-------------------------------------------------
| byte3 | byte2 | byte1 | byte0 |
-------------------------------------------------
| NaN | NaN | gpio8-15 | gpio0-7 |
-------------------------------------------------
Eric Farman [Fri, 28 Jun 2024 16:37:38 +0000 (18:37 +0200)]
s390/vfio_ccw: Fix target addresses of TIC CCWs
The processing of a Transfer-In-Channel (TIC) CCW requires locating
the target of the CCW in the channel program, and updating the
address to reflect what will actually be sent to hardware.
An error exists where the 64-bit virtual address is truncated to
32-bits (variable "cda") when performing this math. Since s390
addresses of that size are 31-bits, this leaves that additional
bit enabled such that the resulting I/O triggers a channel
program check. This shows up occasionally when booting a KVM
guest from a passthrough DASD device:
..snip...
Interrupt Response Block Data:
: 0x0000000000003990
Function Ctrl : [Start]
Activity Ctrl :
Status Ctrl : [Alert] [Primary] [Secondary] [Status-Pending]
Device Status :
Channel Status : [Program-Check]
cpa=: 0x00000000008d0018
prev_ccw=: 0x0000000000000000
this_ccw=: 0x0000000000000000
...snip...
dasd-ipl: Failed to run IPL1 channel program
The channel program address of "0x008d0018" in the IRB doesn't
look wrong, but tracing the CCWs shows the offending bit enabled:
Jingbo Xu [Fri, 28 Jun 2024 06:29:30 +0000 (14:29 +0800)]
cachefiles: add missing lock protection when polling
Add missing lock protection in poll routine when iterating xarray,
otherwise:
Even with RCU read lock held, only the slot of the radix tree is
ensured to be pinned there, while the data structure (e.g. struct
cachefiles_req) stored in the slot has no such guarantee. The poll
routine will iterate the radix tree and dereference cachefiles_req
accordingly. Thus RCU read lock is not adequate in this case and
spinlock is needed here.
Baokun Li [Fri, 28 Jun 2024 06:29:29 +0000 (14:29 +0800)]
cachefiles: cyclic allocation of msg_id to avoid reuse
Reusing the msg_id after a maliciously completed reopen request may cause
a read request to remain unprocessed and result in a hung, as shown below:
t1 | t2 | t3
-------------------------------------------------
cachefiles_ondemand_select_req
cachefiles_ondemand_object_is_close(A)
cachefiles_ondemand_set_object_reopening(A)
queue_work(fscache_object_wq, &info->work)
ondemand_object_worker
cachefiles_ondemand_init_object(A)
cachefiles_ondemand_send_req(OPEN)
// get msg_id 6
wait_for_completion(&req_A->done)
cachefiles_ondemand_daemon_read
// read msg_id 6 req_A
cachefiles_ondemand_get_fd
copy_to_user
// Malicious completion msg_id 6
copen 6,-1
cachefiles_ondemand_copen
complete(&req_A->done)
// will not set the object to close
// because ondemand_id && fd is valid.
// ondemand_object_worker() is done
// but the object is still reopening.
// new open req_B
cachefiles_ondemand_init_object(B)
cachefiles_ondemand_send_req(OPEN)
// reuse msg_id 6
process_open_req
copen 6,A.size
// The expected failed copen was executed successfully
Expect copen to fail, and when it does, it closes fd, which sets the
object to close, and then close triggers reopen again. However, due to
msg_id reuse resulting in a successful copen, the anonymous fd is not
closed until the daemon exits. Therefore read requests waiting for reopen
to complete may trigger hung task.
To avoid this issue, allocate the msg_id cyclically to avoid reusing the
msg_id for a very short duration of time.
Hou Tao [Fri, 28 Jun 2024 06:29:28 +0000 (14:29 +0800)]
cachefiles: wait for ondemand_object_worker to finish when dropping object
When queuing ondemand_object_worker() to re-open the object,
cachefiles_object is not pinned. The cachefiles_object may be freed when
the pending read request is completed intentionally and the related
erofs is umounted. If ondemand_object_worker() runs after the object is
freed, it will incur use-after-free problem as shown below.
process A processs B process C process D
cachefiles_ondemand_send_req()
// send a read req X
// wait for its completion
// close ondemand fd
cachefiles_ondemand_fd_release()
// set object as CLOSE
cachefiles_ondemand_daemon_read()
// set object as REOPENING
queue_work(fscache_wq, &info->ondemand_work)
// close /dev/cachefiles
cachefiles_daemon_release
cachefiles_flush_reqs
complete(&req->done)
// read req X is completed
// umount the erofs fs
cachefiles_put_object()
// object will be freed
cachefiles_ondemand_deinit_obj_info()
kmem_cache_free(object)
// both info and object are freed
ondemand_object_worker()
When dropping an object, it is no longer necessary to reopen the object,
so use cancel_work_sync() to cancel or wait for ondemand_object_worker()
to finish.
Baokun Li [Fri, 28 Jun 2024 06:29:27 +0000 (14:29 +0800)]
cachefiles: cancel all requests for the object that is being dropped
Because after an object is dropped, requests for that object are useless,
cancel them to avoid causing other problems.
This prepares for the later addition of cancel_work_sync(). After the
reopen requests is generated, cancel it to avoid cancel_work_sync()
blocking by waiting for daemon to complete the reopen requests.
Baokun Li [Fri, 28 Jun 2024 06:29:26 +0000 (14:29 +0800)]
cachefiles: stop sending new request when dropping object
Added CACHEFILES_ONDEMAND_OBJSTATE_DROPPING indicates that the cachefiles
object is being dropped, and is set after the close request for the dropped
object completes, and no new requests are allowed to be sent after this
state.
This prepares for the later addition of cancel_work_sync(). It prevents
leftover reopen requests from being sent, to avoid processing unnecessary
requests and to avoid cancel_work_sync() blocking by waiting for daemon to
complete the reopen requests.
Baokun Li [Fri, 28 Jun 2024 06:29:25 +0000 (14:29 +0800)]
cachefiles: propagate errors from vfs_getxattr() to avoid infinite loop
In cachefiles_check_volume_xattr(), the error returned by vfs_getxattr()
is not passed to ret, so it ends up returning -ESTALE, which leads to an
endless loop as follows:
cachefiles_acquire_volume
retry:
ret = cachefiles_check_volume_xattr
ret = -ESTALE
xlen = vfs_getxattr // return -EIO
// The ret is not updated when xlen < 0, so -ESTALE is returned.
return ret
// Supposed to jump out of the loop at this judgement.
if (ret != -ESTALE)
goto error_dir;
cachefiles_bury_object
// EIO causes rename failure
goto retry;
Hence propagate the error returned by vfs_getxattr() to avoid the above
issue. Do the same in cachefiles_check_auxdata().
Baokun Li [Fri, 28 Jun 2024 06:29:24 +0000 (14:29 +0800)]
cachefiles: fix slab-use-after-free in cachefiles_withdraw_cookie()
We got the following issue in our fault injection stress test:
==================================================================
BUG: KASAN: slab-use-after-free in cachefiles_withdraw_cookie+0x4d9/0x600
Read of size 8 at addr ffff888118efc000 by task kworker/u78:0/109
After setting FSCACHE_CACHE_IS_WITHDRAWN, wait for all the cookie lookups
to complete first, and then wait for fscache_cache->object_count == 0 to
avoid the cookie exiting after the volume has been freed and triggering
the above issue. Therefore call fscache_withdraw_volume() before calling
cachefiles_withdraw_objects().
This way, after setting FSCACHE_CACHE_IS_WITHDRAWN, only the following two
cases will occur:
1) fscache_begin_lookup fails in fscache_begin_volume_access().
2) fscache_withdraw_volume() will ensure that fscache_count_object() has
been executed before calling fscache_wait_for_objects().
Baokun Li [Fri, 28 Jun 2024 06:29:23 +0000 (14:29 +0800)]
cachefiles: fix slab-use-after-free in fscache_withdraw_volume()
We got the following issue in our fault injection stress test:
==================================================================
BUG: KASAN: slab-use-after-free in fscache_withdraw_volume+0x2e1/0x370
Read of size 4 at addr ffff88810680be08 by task ondemand-04-dae/5798
The fscache_volume in cache->volumes must not have been freed yet, but its
reference count may be 0. So use the new fscache_try_get_volume() helper
function try to get its reference count.
If the reference count of fscache_volume is 0, fscache_put_volume() is
freeing it, so wait for it to be removed from cache->volumes.
If its reference count is not 0, call cachefiles_withdraw_volume() with
reference count protection to avoid the above issue.
net: stmmac: enable HW-accelerated VLAN stripping for gmac4 only
Commit 750011e239a5 ("net: stmmac: Add support for HW-accelerated VLAN
stripping") enables MAC level VLAN tag stripping for all MAC cores, but
leaves set_hw_vlan_mode() and rx_hw_vlan() un-implemented for both gmac
and xgmac.
On gmac and xgmac, ethtool reports rx-vlan-offload is on, both MAC and
driver do nothing about VLAN packets actually, although VLAN works well.
Driver level stripping should be used on gmac and xgmac for now.
Fixes: 750011e239a5 ("net: stmmac: Add support for HW-accelerated VLAN stripping") Signed-off-by: Furong Xu <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Thomas Huth [Thu, 27 Jun 2024 17:35:30 +0000 (19:35 +0200)]
drm/fbdev-generic: Fix framebuffer on big endian devices
Starting with kernel 6.7, the framebuffer text console is not working
anymore with the virtio-gpu device on s390x hosts. Such big endian fb
devices are usinga different pixel ordering than little endian devices,
e.g. DRM_FORMAT_BGRX8888 instead of DRM_FORMAT_XRGB8888.
This used to work fine as long as drm_client_buffer_addfb() was still
calling drm_mode_addfb() which called drm_driver_legacy_fb_format()
internally to get the right format. But drm_client_buffer_addfb() has
recently been reworked to call drm_mode_addfb2() instead with the
format value that has been passed to it as a parameter (see commit 6ae2ff23aa43 ("drm/client: Convert drm_client_buffer_addfb() to drm_mode_addfb2()").
That format parameter is determined in drm_fbdev_generic_helper_fb_probe()
via the drm_mode_legacy_fb_format() function - which only generates
formats suitable for little endian devices. So to fix this issue
switch to drm_driver_legacy_fb_format() here instead to take the
device endianness into consideration.
Boris Brezillon [Wed, 3 Jul 2024 07:16:40 +0000 (09:16 +0200)]
drm/panthor: Fix sync-only jobs
A sync-only job is meant to provide a synchronization point on a
queue, so we can't return a NULL fence there, we have to add a signal
operation to the command stream which executes after all other
previously submitted jobs are done.
David Howells [Tue, 2 Jul 2024 14:50:09 +0000 (15:50 +0100)]
cifs: Fix read-performance regression by dropping readahead expansion
cifs_expand_read() is causing a performance regression of around 30% by
causing extra pagecache to be allocated for an inode in the readahead path
before we begin actually dispatching RPC requests, thereby delaying the
actual I/O. The expansion is sized according to the rsize parameter, which
seems to be 4MiB on my test system; this is a big step up from the first
requests made by the fio test program.
Fix this by removing cifs_expand_readahead(). Readahead expansion is
mostly useful for when we're using the local cache if the local cache has a
block size greater than PAGE_SIZE, so we can dispense with it when not
caching.
The cause is due to the idxd driver interrupt completion handler uses
threaded interrupt and the threaded handler is not hard or soft interrupt
context. However __netif_rx() can only be called from interrupt context.
Change the call to netif_rx() in order to allow completion via normal
context for dmaengine drivers that utilize threaded irq handling.
While the following commit changed from netif_rx() to __netif_rx(), baebdf48c360 ("net: dev: Makes sure netif_rx() can be invoked in any context."),
the change should've been a noop instead. However, the code precedes this
fix should've been using netif_rx_ni() or netif_rx_any_context().