Pratyush Yadav [Mon, 22 May 2023 15:30:20 +0000 (17:30 +0200)]
net: fix skb leak in __skb_tstamp_tx()
Commit 50749f2dd685 ("tcp/udp: Fix memleaks of sk and zerocopy skbs with
TX timestamp.") added a call to skb_orphan_frags_rx() to fix leaks with
zerocopy skbs. But it ended up adding a leak of its own. When
skb_orphan_frags_rx() fails, the function just returns, leaking the skb
it just cloned. Free it before returning.
This bug was discovered and resolved using Coverity Static Analysis
Security Testing (SAST) by Synopsys, Inc.
r8169: Use a raw_spinlock_t for the register locks.
The driver's interrupt service routine is requested with the
IRQF_NO_THREAD if MSI is available. This means that the routine is
invoked in hardirq context even on PREEMPT_RT. The routine itself is
relatively short and schedules a worker, performs register access and
schedules NAPI. On PREEMPT_RT, scheduling NAPI from hardirq results in
waking ksoftirqd for further processing so using NAPI threads with this
driver is highly recommended since it NULL routes the threaded-IRQ
efforts.
Adding rtl_hw_aspm_clkreq_enable() to the ISR is problematic on
PREEMPT_RT because the function uses spinlock_t locks which become
sleeping locks on PREEMPT_RT. The locks are only used to protect
register access and don't nest into other functions or locks. They are
also not used for unbounded period of time. Therefore it looks okay to
convert them to raw_spinlock_t.
Convert the three locks which are used from the interrupt service
routine to raw_spinlock_t.
Yunsheng Lin [Mon, 22 May 2023 03:17:14 +0000 (11:17 +0800)]
page_pool: fix inconsistency for page_pool_ring_[un]lock()
page_pool_ring_[un]lock() use in_softirq() to decide which
spin lock variant to use, and when they are called in the
context with in_softirq() being false, spin_lock_bh() is
called in page_pool_ring_lock() while spin_unlock() is
called in page_pool_ring_unlock(), because spin_lock_bh()
has disabled the softirq in page_pool_ring_lock(), which
causes inconsistency for spin lock pair calling.
This patch fixes it by returning in_softirq state from
page_pool_producer_lock(), and use it to decide which
spin lock variant to use in page_pool_producer_unlock().
As pool->ring has both producer and consumer lock, so
rename it to page_pool_producer_[un]lock() to reflect
the actual usage. Also move them to page_pool.c as they
are only used there, and remove the 'inline' as the
compiler may have better idea to do inlining or not.
Peter Ujfalusi [Wed, 17 May 2023 12:29:31 +0000 (15:29 +0300)]
tpm: tpm_tis: Disable interrupts for AEON UPX-i11
Interrupts got recently enabled for tpm_tis.
The interrupts initially works on the device but they will stop arriving
after circa ~200 interrupts. On system reboot/shutdown this will cause a
long wait (120000 jiffies).
[[email protected]: fix a merge conflict and adjust the commit message] Fixes: e644b2f498d2 ("tpm, tpm_tis: Enable interrupt test") Signed-off-by: Peter Ujfalusi <[email protected]> Reviewed-by: Jarkko Sakkinen <[email protected]> Signed-off-by: Jarkko Sakkinen <[email protected]>
It happens because of play_dma_data/capture_dma_data pointers are NULL.
Current implementation assigns these pointers at snd_soc_dai_driver
startup() callback and reset them back to NULL at shutdown(). But
soc_pcm_open() sequence uses DMA pointers in dmaengine_pcm_open()
before snd_soc_dai_driver startup().
Most generic DMA capable I2S drivers use snd_soc_dai_driver probe()
callback to init DMA pointers only once at probe. So move DMA init
to dw_i2s_dai_probe and drop shutdown() and startup() callbacks.
Yan Zhao [Fri, 19 May 2023 06:58:43 +0000 (14:58 +0800)]
vfio/type1: check pfn valid before converting to struct page
Check physical PFN is valid before converting the PFN to a struct page
pointer to be returned to caller of vfio_pin_pages().
vfio_pin_pages() pins user pages with contiguous IOVA.
If the IOVA of a user page to be pinned belongs to vma of vm_flags
VM_PFNMAP, pin_user_pages_remote() will return -EFAULT without returning
struct page address for this PFN. This is because usually this kind of PFN
(e.g. MMIO PFN) has no valid struct page address associated.
Upon this error, vaddr_get_pfns() will obtain the physical PFN directly.
While previously vfio_pin_pages() returns to caller PFN arrays directly,
after commit 34a255e67615 ("vfio: Replace phys_pfn with pages for vfio_pin_pages()"),
PFNs will be converted to "struct page *" unconditionally and therefore
the returned "struct page *" array may contain invalid struct page
addresses.
Given current in-tree users of vfio_pin_pages() only expect "struct page *
returned, check PFN validity and return -EINVAL to let the caller be
aware of IOVAs to be pinned containing PFN not able to be returned in
"struct page *" array. So that, the caller will not consume the returned
pointer (e.g. test PageReserved()) and avoid error like "supervisor read
access in kernel mode".
Linus Torvalds [Tue, 23 May 2023 17:47:32 +0000 (10:47 -0700)]
Merge tag 'erofs-for-6.4-rc4-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs
Pull erofs fixes from Gao Xiang:
"One patch addresses a null-ptr-deref issue reported by syzbot weeks
ago, which is caused by the new long xattr name prefix feature and
needs to be fixed.
The remaining two patches are minor cleanups to avoid unnecessary
compilation and adjust per-cpu kworker configuration.
Summary:
- Fix null-ptr-deref related to long xattr name prefixes
- Avoid pcpubuf compilation if CONFIG_EROFS_FS_ZIP is off
- Use high priority kthreads by default if per-cpu kthread workers
are enabled"
* tag 'erofs-for-6.4-rc4-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs:
erofs: use HIPRI by default if per-cpu kthreads are enabled
erofs: avoid pcpubuf.c inclusion if CONFIG_EROFS_FS_ZIP is off
erofs: fix null-ptr-deref caused by erofs_xattr_prefixes_init
Anuj Gupta [Tue, 23 May 2023 11:17:09 +0000 (16:47 +0530)]
block: fix bio-cache for passthru IO
commit <8af870aa5b847> ("block: enable bio caching use for passthru IO")
introduced bio-cache for passthru IO. In case when nr_vecs are greater
than BIO_INLINE_VECS, bio and bvecs are allocated from mempool (instead
of percpu cache) and REQ_ALLOC_CACHE is cleared. This causes the side
effect of not freeing bio/bvecs into mempool on completion.
This patch lets the passthru IO fallback to allocation using bio_kmalloc
when nr_vecs are greater than BIO_INLINE_VECS. The corresponding bio
is freed during call to blk_mq_map_bio_put during completion.
Tian Lan [Mon, 22 May 2023 21:05:55 +0000 (17:05 -0400)]
blk-mq: fix race condition in active queue accounting
If multiple CPUs are sharing the same hardware queue, it can
cause leak in the active queue counter tracking when __blk_mq_tag_busy()
is executed simultaneously.
Yu Kuai [Mon, 22 May 2023 12:18:54 +0000 (20:18 +0800)]
blk-wbt: fix that wbt can't be disabled by default
commit b11d31ae01e6 ("blk-wbt: remove unnecessary check in
wbt_enable_default()") removes the checking of CONFIG_BLK_WBT_MQ by
mistake, which is used to control enable or disable wbt by default.
Fix the problem by adding back the checking. This patch also do a litter
cleanup to make related code more readable.
Helge Deller [Fri, 19 May 2023 10:12:06 +0000 (12:12 +0200)]
parisc: Use num_present_cpus() in alternative patching code
When patching the kernel code some alternatives depend on SMP vs. !SMP.
Use the value of num_present_cpus() instead of num_online_cpus() to
decide, otherwise we may run into issues if and additional CPU is
enabled after having loaded a module while only one CPU was enabled.
Jeffrey Hugo [Wed, 17 May 2023 19:35:40 +0000 (13:35 -0600)]
accel/qaic: Fix NNC message corruption
If msg_xfer() is unable to queue part of a NNC message because the MHI ring
is full, it will attempt to give the QSM some time to drain the queue.
However, if QSM fails to make any room, msg_xfer() will fail and tell the
caller to try again. This is problematic because part of the message may
have been committed to the ring and there is no mechanism to revoke that
content. This will cause QSM to receive a corrupt message.
The better way to do this is to check if the ring has enough space for the
entire message before committing any of the message. Since msg_xfer() is
under the cntl_mutex no one else can come in and consume the space.
accel/qaic: Grab ch_lock during QAIC_ATTACH_SLICE_BO
During QAIC_ATTACH_SLICE_BO, we associate a BO to its DBC. We need to
grab the dbc->ch_lock to make sure that DBC does not goes away while
QAIC_ATTACH_SLICE_BO is still running.
Before calling synchronize_srcu() we clear the transfer list, this is to
allow all the QAIC_WAIT_BO callers to exit otherwise the system could
deadlock. There could be a corner case where more elements get added to
transfer list after we have flushed it. Re-flush the transfer list once
all the holders of dbc->ch_lock have completed execution i.e.
synchronize_srcu() is complete.
Tom Rix [Wed, 17 May 2023 16:56:05 +0000 (10:56 -0600)]
accel/qaic: initialize ret variable to 0
clang static analysis reports
drivers/accel/qaic/qaic_data.c:610:2: warning: Undefined or garbage
value returned to caller [core.uninitialized.UndefReturn]
return ret;
^~~~~~~~~~
From a code analysis of the function, the ret variable is only set some
of the time but is always returned. This suggests ret can return
uninitialized garbage. However BO allocation will ensure ret is always
set in reality.
John Fastabend [Tue, 23 May 2023 02:56:18 +0000 (19:56 -0700)]
bpf, sockmap: Test progs verifier error with latest clang
With a relatively recent clang (7090c10273119) and with this commit
to fix warnings in selftests (c8ed668593972) that uses __sink(err)
to resolve unused variables. We get the following verifier error.
To fix simply remove the err value because its not actually used anywhere
in the testing. We can investigate the root cause later. Future patch should
probably actually test the err value as well. Although if the map updates
fail they will get caught eventually by userspace.
John Fastabend [Tue, 23 May 2023 02:56:17 +0000 (19:56 -0700)]
bpf, sockmap: Test FIONREAD returns correct bytes in rx buffer with drops
When BPF program drops pkts the sockmap logic 'eats' the packet and
updates copied_seq. In the PASS case where the sk_buff is accepted
we update copied_seq from recvmsg path so we need a new test to
handle the drop case.
John Fastabend [Tue, 23 May 2023 02:56:16 +0000 (19:56 -0700)]
bpf, sockmap: Test FIONREAD returns correct bytes in rx buffer
A bug was reported where ioctl(FIONREAD) returned zero even though the
socket with a SK_SKB verdict program attached had bytes in the msg
queue. The result is programs may hang or more likely try to recover,
but use suboptimal buffer sizes.
Add a test to check that ioctl(FIONREAD) returns the correct number of
bytes.
John Fastabend [Tue, 23 May 2023 02:56:14 +0000 (19:56 -0700)]
bpf, sockmap: Build helper to create connected socket pair
A common operation for testing is to spin up a pair of sockets that are
connected. Then we can use these to run specific tests that need to
send data, check BPF programs and so on.
The sockmap_listen programs already have this logic lets move it into
the new sockmap_helpers header file for general use.
John Fastabend [Tue, 23 May 2023 02:56:13 +0000 (19:56 -0700)]
bpf, sockmap: Pull socket helpers out of listen test for general use
No functional change here we merely pull the helpers in sockmap_listen.c
into a header file so we can use these in other programs. The tests we
are about to add aren't really _listen tests so doesn't make sense
to add them here.
John Fastabend [Tue, 23 May 2023 02:56:12 +0000 (19:56 -0700)]
bpf, sockmap: Incorrectly handling copied_seq
The read_skb() logic is incrementing the tcp->copied_seq which is used for
among other things calculating how many outstanding bytes can be read by
the application. This results in application errors, if the application
does an ioctl(FIONREAD) we return zero because this is calculated from
the copied_seq value.
To fix this we move tcp->copied_seq accounting into the recv handler so
that we update these when the recvmsg() hook is called and data is in
fact copied into user buffers. This gives an accurate FIONREAD value
as expected and improves ACK handling. Before we were calling the
tcp_rcv_space_adjust() which would update 'number of bytes copied to
user in last RTT' which is wrong for programs returning SK_PASS. The
bytes are only copied to the user when recvmsg is handled.
Doing the fix for recvmsg is straightforward, but fixing redirect and
SK_DROP pkts is a bit tricker. Build a tcp_psock_eat() helper and then
call this from skmsg handlers. This fixes another issue where a broken
socket with a BPF program doing a resubmit could hang the receiver. This
happened because although read_skb() consumed the skb through sock_drop()
it did not update the copied_seq. Now if a single reccv socket is
redirecting to many sockets (for example for lb) the receiver sk will be
hung even though we might expect it to continue. The hang comes from
not updating the copied_seq numbers and memory pressure resulting from
that.
We have a slight layer problem of calling tcp_eat_skb even if its not
a TCP socket. To fix we could refactor and create per type receiver
handlers. I decided this is more work than we want in the fix and we
already have some small tweaks depending on caller that use the
helper skb_bpf_strparser(). So we extend that a bit and always set
the strparser bit when it is in use and then we can gate the
seq_copied updates on this.
John Fastabend [Tue, 23 May 2023 02:56:11 +0000 (19:56 -0700)]
bpf, sockmap: Wake up polling after data copy
When TCP stack has data ready to read sk_data_ready() is called. Sockmap
overwrites this with its own handler to call into BPF verdict program.
But, the original TCP socket had sock_def_readable that would additionally
wake up any user space waiters with sk_wake_async().
Sockmap saved the callback when the socket was created so call the saved
data ready callback and then we can wake up any epoll() logic waiting
on the read.
Note we call on 'copied >= 0' to account for returning 0 when a FIN is
received because we need to wake up user for this as well so they
can do the recvmsg() -> 0 and detect the shutdown.
John Fastabend [Tue, 23 May 2023 02:56:10 +0000 (19:56 -0700)]
bpf, sockmap: TCP data stall on recv before accept
A common mechanism to put a TCP socket into the sockmap is to hook the
BPF_SOCK_OPS_{ACTIVE_PASSIVE}_ESTABLISHED_CB event with a BPF program
that can map the socket info to the correct BPF verdict parser. When
the user adds the socket to the map the psock is created and the new
ops are assigned to ensure the verdict program will 'see' the sk_buffs
as they arrive.
Part of this process hooks the sk_data_ready op with a BPF specific
handler to wake up the BPF verdict program when data is ready to read.
The logic is simple enough (posted here for easy reading)
if (unlikely(!sock || !sock->ops || !sock->ops->read_skb))
return;
sock->ops->read_skb(sk, sk_psock_verdict_recv);
}
The oversight here is sk->sk_socket is not assigned until the application
accepts() the new socket. However, its entirely ok for the peer application
to do a connect() followed immediately by sends. The socket on the receiver
is sitting on the backlog queue of the listening socket until its accepted
and the data is queued up. If the peer never accepts the socket or is slow
it will eventually hit data limits and rate limit the session. But,
important for BPF sockmap hooks when this data is received TCP stack does
the sk_data_ready() call but the read_skb() for this data is never called
because sk_socket is missing. The data sits on the sk_receive_queue.
Then once the socket is accepted if we never receive more data from the
peer there will be no further sk_data_ready calls and all the data
is still on the sk_receive_queue(). Then user calls recvmsg after accept()
and for TCP sockets in sockmap we use the tcp_bpf_recvmsg_parser() handler.
The handler checks for data in the sk_msg ingress queue expecting that
the BPF program has already run from the sk_data_ready hook and enqueued
the data as needed. So we are stuck.
To fix do an unlikely check in recvmsg handler for data on the
sk_receive_queue and if it exists wake up data_ready. We have the sock
locked in both read_skb and recvmsg so should avoid having multiple
runners.
John Fastabend [Tue, 23 May 2023 02:56:09 +0000 (19:56 -0700)]
bpf, sockmap: Handle fin correctly
The sockmap code is returning EAGAIN after a FIN packet is received and no
more data is on the receive queue. Correct behavior is to return 0 to the
user and the user can then close the socket. The EAGAIN causes many apps
to retry which masks the problem. Eventually the socket is evicted from
the sockmap because its released from sockmap sock free handling. The
issue creates a delay and can cause some errors on application side.
To fix this check on sk_msg_recvmsg side if length is zero and FIN flag
is set then set return to zero. A selftest will be added to check this
condition.
John Fastabend [Tue, 23 May 2023 02:56:08 +0000 (19:56 -0700)]
bpf, sockmap: Improved check for empty queue
We noticed some rare sk_buffs were stepping past the queue when system was
under memory pressure. The general theory is to skip enqueueing
sk_buffs when its not necessary which is the normal case with a system
that is properly provisioned for the task, no memory pressure and enough
cpu assigned.
But, if we can't allocate memory due to an ENOMEM error when enqueueing
the sk_buff into the sockmap receive queue we push it onto a delayed
workqueue to retry later. When a new sk_buff is received we then check
if that queue is empty. However, there is a problem with simply checking
the queue length. When a sk_buff is being processed from the ingress queue
but not yet on the sockmap msg receive queue its possible to also recv
a sk_buff through normal path. It will check the ingress queue which is
zero and then skip ahead of the pkt being processed.
Previously we used sock lock from both contexts which made the problem
harder to hit, but not impossible.
To fix instead of popping the skb from the queue entirely we peek the
skb from the queue and do the copy there. This ensures checks to the
queue length are non-zero while skb is being processed. Then finally
when the entire skb has been copied to user space queue or another
socket we pop it off the queue. This way the queue length check allows
bypassing the queue only after the list has been completely processed.
To reproduce issue we run NGINX compliance test with sockmap running and
observe some flakes in our testing that we attributed to this issue.
John Fastabend [Tue, 23 May 2023 02:56:07 +0000 (19:56 -0700)]
bpf, sockmap: Reschedule is now done through backlog
Now that the backlog manages the reschedule() logic correctly we can drop
the partial fix to reschedule from recvmsg hook.
Rescheduling on recvmsg hook was added to address a corner case where we
still had data in the backlog state but had nothing to kick it and
reschedule the backlog worker to run and finish copying data out of the
state. This had a couple limitations, first it required user space to
kick it introducing an unnecessary EBUSY and retry. Second it only
handled the ingress case and egress redirects would still be hung.
With the correct fix, pushing the reschedule logic down to where the
enomem error occurs we can drop this fix.
John Fastabend [Tue, 23 May 2023 02:56:06 +0000 (19:56 -0700)]
bpf, sockmap: Convert schedule_work into delayed_work
Sk_buffs are fed into sockmap verdict programs either from a strparser
(when the user might want to decide how framing of skb is done by attaching
another parser program) or directly through tcp_read_sock. The
tcp_read_sock is the preferred method for performance when the BPF logic is
a stream parser.
The flow for Cilium's common use case with a stream parser is,
tcp_read_sock()
sk_psock_verdict_recv
ret = bpf_prog_run_pin_on_cpu()
sk_psock_verdict_apply(sock, skb, ret)
// if system is under memory pressure or app is slow we may
// need to queue skb. Do this queuing through ingress_skb and
// then kick timer to wake up handler
skb_queue_tail(ingress_skb, skb)
schedule_work(work);
The work queue is wired up to sk_psock_backlog(). This will then walk the
ingress_skb skb list that holds our sk_buffs that could not be handled,
but should be OK to run at some later point. However, its possible that
the workqueue doing this work still hits an error when sending the skb.
When this happens the skbuff is requeued on a temporary 'state' struct
kept with the workqueue. This is necessary because its possible to
partially send an skbuff before hitting an error and we need to know how
and where to restart when the workqueue runs next.
Now for the trouble, we don't rekick the workqueue. This can cause a
stall where the skbuff we just cached on the state variable might never
be sent. This happens when its the last packet in a flow and no further
packets come along that would cause the system to kick the workqueue from
that side.
To fix we could do simple schedule_work(), but while under memory pressure
it makes sense to back off some instead of continue to retry repeatedly. So
instead to fix convert schedule_work to schedule_delayed_work and add
backoff logic to reschedule from backlog queue on errors. Its not obvious
though what a good backoff is so use '1'.
To test we observed some flakes whil running NGINX compliance test with
sockmap we attributed these failed test to this bug and subsequent issue.
>From on list discussion. This commit
bec217197b41("skmsg: Schedule psock work if the cached skb exists on the psock")
was intended to address similar race, but had a couple cases it missed.
Most obvious it only accounted for receiving traffic on the local socket
so if redirecting into another socket we could still get an sk_buff stuck
here. Next it missed the case where copied=0 in the recv() handler and
then we wouldn't kick the scheduler. Also its sub-optimal to require
userspace to kick the internal mechanisms of sockmap to wake it up and
copy data to user. It results in an extra syscall and requires the app
to actual handle the EAGAIN correctly.
John Fastabend [Tue, 23 May 2023 02:56:05 +0000 (19:56 -0700)]
bpf, sockmap: Pass skb ownership through read_skb
The read_skb hook calls consume_skb() now, but this means that if the
recv_actor program wants to use the skb it needs to inc the ref cnt
so that the consume_skb() doesn't kfree the sk_buff.
This is problematic because in some error cases under memory pressure
we may need to linearize the sk_buff from sk_psock_skb_ingress_enqueue().
Then we get this,
Because we incremented users refcnt from sk_psock_verdict_recv() we
hit the bug on with refcnt > 1 and trip it.
To fix lets simply pass ownership of the sk_buff through the skb_read
call. Then we can drop the consume from read_skb handlers and assume
the verdict recv does any required kfree.
Bug found while testing in our CI which runs in VMs that hit memory
constraints rather regularly. William tested TCP read_skb handlers.
Nicolas Dichtel [Mon, 22 May 2023 12:08:20 +0000 (14:08 +0200)]
ipv{4,6}/raw: fix output xfrm lookup wrt protocol
With a raw socket bound to IPPROTO_RAW (ie with hdrincl enabled), the
protocol field of the flow structure, build by raw_sendmsg() /
rawv6_sendmsg()), is set to IPPROTO_RAW. This breaks the ipsec policy
lookup when some policies are defined with a protocol in the selector.
For ipv6, the sin6_port field from 'struct sockaddr_in6' could be used to
specify the protocol. Just accept all values for IPPROTO_RAW socket.
For ipv4, the sin_port field of 'struct sockaddr_in' could not be used
without breaking backward compatibility (the value of this field was never
checked). Let's add a new kind of control message, so that the userland
could specify which protocol is used.
Horatiu Vultur [Mon, 22 May 2023 12:00:38 +0000 (14:00 +0200)]
lan966x: Fix unloading/loading of the driver
It was noticing that after a while when unloading/loading the driver and
sending traffic through the switch, it would stop working. It would stop
forwarding any traffic and the only way to get out of this was to do a
power cycle of the board. The root cause seems to be that the switch
core is initialized twice. Apparently initializing twice the switch core
disturbs the pointers in the queue systems in the HW, so after a while
it would stop sending the traffic.
Unfortunetly, it is not possible to use a reset of the switch here,
because the reset line is connected to multiple devices like MDIO,
SGPIO, FAN, etc. So then all the devices will get reseted when the
network driver will be loaded.
So the fix is to check if the core is initialized already and if that is
the case don't initialize it again.
Steve Wahl [Fri, 19 May 2023 16:04:20 +0000 (11:04 -0500)]
platform/x86: ISST: Remove 8 socket limit
Stop restricting the PCI search to a range of PCI domains fed to
pci_get_domain_bus_and_slot(). Instead, use for_each_pci_dev() and
look at all PCI domains in one pass.
On systems with more than 8 sockets, this avoids error messages like
"Information: Invalid level, Can't get TDP control information at
specified levels on cpu 480" from the intel speed select utility.
Liviu Dudau [Tue, 9 May 2023 17:29:21 +0000 (18:29 +0100)]
mips: Move initrd_start check after initrd address sanitisation.
PAGE_OFFSET is technically a virtual address so when checking the value of
initrd_start against it we should make sure that it has been sanitised from
the values passed by the bootloader. Without this change, even with a bootloader
that passes correct addresses for an initrd, we are failing to load it on MT7621
boards, for example.
Manuel Lauss [Thu, 11 May 2023 15:30:10 +0000 (17:30 +0200)]
MIPS: Alchemy: fix dbdma2
Various fixes for the Au1200/Au1550/Au1300 DBDMA2 code:
- skip cache invalidation if chip has working coherency circuitry.
- invalidate KSEG0-portion of the (physical) data address.
- force the dma channel doorbell write out to bus immediately with
a sync.
Manuel Lauss [Wed, 10 May 2023 10:33:23 +0000 (12:33 +0200)]
MIPS: Restore Au1300 support
The Au1300, at least the one I have to test, uses the NetLogic vendor
ID, but commit 95b8a5e0111a ("MIPS: Remove NETLOGIC support") also
dropped Au1300 detection. Restore Au1300 detection.
Jingbo Xu [Mon, 15 May 2023 10:39:41 +0000 (18:39 +0800)]
erofs: fix null-ptr-deref caused by erofs_xattr_prefixes_init
Fragments and dedupe share one feature bit, and thus packed inode may not
exist when fragment feature bit (dedupe feature bit exactly) is set, e.g.
when deduplication feature is in use while fragments feature is not. In
this case, sbi->packed_inode could be NULL while fragments feature bit
is set.
Fix this by accessing packed inode only when it exists.
gpio-f7188x: fix chip name and pin count on Nuvoton chip
In fact the device with chip id 0xD283 is called NCT6126D, and that is
the chip id the Nuvoton code was written for. Correct that name to avoid
confusion, because a NCT6116D in fact exists as well but has another
chip id, and is currently not supported.
The look at the spec also revealed that GPIO group7 in fact has 8 pins,
so correct the pin count in that group as well.
Like Xu [Wed, 17 May 2023 13:38:08 +0000 (21:38 +0800)]
perf/x86/intel: Save/restore cpuc->active_pebs_data_cfg when using guest PEBS
After commit b752ea0c28e3 ("perf/x86/intel/ds: Flush PEBS DS when changing
PEBS_DATA_CFG"), the cpuc->pebs_data_cfg may save some bits that are not
supported by real hardware, such as PEBS_UPDATE_DS_SW. This would cause
the VMX hardware MSR switching mechanism to save/restore invalid values
for PEBS_DATA_CFG MSR, thus crashing the host when PEBS is used for guest.
Fix it by using the active host value from cpuc->active_pebs_data_cfg.
After the cited patch, mlx5_irq xarray index can be different then
mlx5_irq MSIX table index.
Fix it by storing both mlx5_irq xarray index and MSIX table index.
The cited patch deny the user of changing the affinity of mlx5 irqs,
which break backward compatibility.
Hence, allow the user to change the affinity of mlx5 irqs.
Whenever a shutdown is invoked, free irqs only and keep mlx5_irq
synthetic wrapper intact in order to avoid use-after-free on
system shutdown.
for example:
==================================================================
BUG: KASAN: use-after-free in _find_first_bit+0x66/0x80
Read of size 8 at addr ffff88823fc0d318 by task kworker/u192:0/13608
The buggy address belongs to the object at ffff88823fc0d300
which belongs to the cache kmalloc-192 of size 192
The buggy address is located 24 bytes inside of
192-byte region [ffff88823fc0d300, ffff88823fc0d3c0)
Shay Drory [Tue, 2 May 2023 10:36:42 +0000 (13:36 +0300)]
net/mlx5: Devcom, serialize devcom registration
From one hand, mlx5 driver is allowing to probe PFs in parallel.
From the other hand, devcom, which is a share resource between PFs, is
registered without any lock. This might resulted in memory problems.
Hence, use the global mlx5_dev_list_lock in order to serialize devcom
registration.
Shay Drory [Tue, 2 May 2023 10:35:11 +0000 (13:35 +0300)]
net/mlx5: Devcom, fix error flow in mlx5_devcom_register_device
In case devcom allocation is failed, mlx5 is always freeing the priv.
However, this priv might have been allocated by a different thread,
and freeing it might lead to use-after-free bugs.
Fix it by freeing the priv only in case it was allocated by the
running thread.
Shay Drory [Mon, 6 Feb 2023 09:52:02 +0000 (11:52 +0200)]
net/mlx5: E-switch, Devcom, sync devcom events and devcom comp register
devcom events are sent to all registered component. Following the
cited patch, it is possible for two components, e.g.: two eswitches,
to send devcom events, while both components are registered. This
means eswitch layer will do double un/pairing, which is double
allocation and free of resources, even though only one un/pairing is
needed. flow example:
[ 826.534521] The buggy address belongs to the object at ffff888194485800
which belongs to the cache kmalloc-512 of size 512
[ 826.535506] The buggy address is located 48 bytes inside of
freed 512-byte region [ffff888194485800, ffff888194485a00)
[ 826.541095] Memory state around the buggy address:
[ 826.541519] ffff888194485700: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[ 826.542149] ffff888194485780: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[ 826.542773] >ffff888194485800: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 826.543400] ^
[ 826.543822] ffff888194485880: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 826.544452] ffff888194485900: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 826.545079] ==================================================================
Fixes: 6702782845a5 ("net/mlx5e: TC, Set CT miss to the specific ct action instance") Signed-off-by: Paul Blakey <[email protected]> Reviewed-by: Vlad Buslov <[email protected]> Signed-off-by: Saeed Mahameed <[email protected]>
Rahul Rameshbabu [Wed, 22 Feb 2023 00:18:48 +0000 (16:18 -0800)]
net/mlx5e: Fix SQ wake logic in ptp napi_poll context
Check in the mlx5e_ptp_poll_ts_cq context if the ptp tx sq should be woken
up. Before change, the ptp tx sq may never wake up if the ptp tx ts skb
fifo is full when mlx5e_poll_tx_cq checks if the queue should be woken up.
Vlad Buslov [Fri, 31 Mar 2023 12:20:51 +0000 (14:20 +0200)]
net/mlx5e: Fix deadlock in tc route query code
Cited commit causes ABBA deadlock[0] when peer flows are created while
holding the devcom rw semaphore. Due to peer flows offload implementation
the lock is taken much higher up the call chain and there is no obvious way
to easily fix the deadlock. Instead, since tc route query code needs the
peer eswitch structure only to perform a lookup in xarray and doesn't
perform any sleeping operations with it, refactor the code for lockless
execution in following ways:
- RCUify the devcom 'data' pointer. When resetting the pointer
synchronously wait for RCU grace period before returning. This is fine
since devcom is currently only used for synchronization of
pairing/unpairing of eswitches which is rare and already expensive as-is.
- Wrap all usages of 'paired' boolean in {READ|WRITE}_ONCE(). The flag has
already been used in some unlocked contexts without proper
annotations (e.g. users of mlx5_devcom_is_paired() function), but it wasn't
an issue since all relevant code paths checked it again after obtaining the
devcom semaphore. Now it is also used by mlx5_devcom_get_peer_data_rcu() as
"best effort" check to return NULL when devcom is being unpaired. Note that
while RCU read lock doesn't prevent the unpaired flag from being changed
concurrently it still guarantees that reader can continue to use 'data'.
- Refactor mlx5e_tc_query_route_vport() function to use new
mlx5_devcom_get_peer_data_rcu() API which fixes the deadlock.
[0]:
[ 164.599612] ======================================================
[ 164.600142] WARNING: possible circular locking dependency detected
[ 164.600667] 6.3.0-rc3+ #1 Not tainted
[ 164.601021] ------------------------------------------------------
[ 164.601557] handler1/3456 is trying to acquire lock:
[ 164.601998] ffff88811f1714b0 (&esw->offloads.encap_tbl_lock){+.+.}-{3:3}, at: mlx5e_attach_encap+0xd8/0x8b0 [mlx5_core]
[ 164.603078]
but task is already holding lock:
[ 164.603617] ffff88810137fc98 (&comp->sem){++++}-{3:3}, at: mlx5_devcom_get_peer_data+0x37/0x80 [mlx5_core]
[ 164.604459]
which lock already depends on the new lock.
net/mlx5e: Use correct encap attribute during invalidation
With introduction of post action infrastructure most of the users of encap
attribute had been modified in order to obtain the correct attribute by
calling mlx5e_tc_get_encap_attr() helper instead of assuming encap action
is always on default attribute. However, the cited commit didn't modify
mlx5e_invalidate_encap() which prevents it from destroying correct modify
header action which leads to a warning [0]. Fix the issue by using correct
attribute.
[0]:
Feb 21 09:47:35 c-237-177-40-045 kernel: WARNING: CPU: 17 PID: 654 at drivers/net/ethernet/mellanox/mlx5/core/en_tc.c:684 mlx5e_tc_attach_mod_hdr+0x1cc/0x230 [mlx5_core]
Feb 21 09:47:35 c-237-177-40-045 kernel: RIP: 0010:mlx5e_tc_attach_mod_hdr+0x1cc/0x230 [mlx5_core]
Feb 21 09:47:35 c-237-177-40-045 kernel: Call Trace:
Feb 21 09:47:35 c-237-177-40-045 kernel: <TASK>
Feb 21 09:47:35 c-237-177-40-045 kernel: mlx5e_tc_fib_event_work+0x8e3/0x1f60 [mlx5_core]
Feb 21 09:47:35 c-237-177-40-045 kernel: ? mlx5e_take_all_encap_flows+0xe0/0xe0 [mlx5_core]
Feb 21 09:47:35 c-237-177-40-045 kernel: ? lock_downgrade+0x6d0/0x6d0
Feb 21 09:47:35 c-237-177-40-045 kernel: ? lockdep_hardirqs_on_prepare+0x273/0x3f0
Feb 21 09:47:35 c-237-177-40-045 kernel: ? lockdep_hardirqs_on_prepare+0x273/0x3f0
Feb 21 09:47:35 c-237-177-40-045 kernel: process_one_work+0x7c2/0x1310
Feb 21 09:47:35 c-237-177-40-045 kernel: ? lockdep_hardirqs_on_prepare+0x3f0/0x3f0
Feb 21 09:47:35 c-237-177-40-045 kernel: ? pwq_dec_nr_in_flight+0x230/0x230
Feb 21 09:47:35 c-237-177-40-045 kernel: ? rwlock_bug.part.0+0x90/0x90
Feb 21 09:47:35 c-237-177-40-045 kernel: worker_thread+0x59d/0xec0
Feb 21 09:47:35 c-237-177-40-045 kernel: ? __kthread_parkme+0xd9/0x1d0
net/mlx5: DR, Check force-loopback RC QP capability independently from RoCE
SW Steering uses RC QP for writing STEs to ICM. This writingis done in LB
(loopback), and FL (force-loopback) QP is preferred for performance. FL is
available when RoCE is enabled or disabled based on RoCE caps.
This patch adds reading of FL capability from HCA caps in addition to the
existing reading from RoCE caps, thus fixing the case where we didn't
have loopback enabled when RoCE was disabled.
Erez Shitrit [Thu, 9 Mar 2023 14:43:15 +0000 (16:43 +0200)]
net/mlx5: DR, Fix crc32 calculation to work on big-endian (BE) CPUs
When calculating crc for hash index we use the function crc32 that
calculates for little-endian (LE) arch.
Then we convert it to network endianness using htonl(), but it's wrong
to do the conversion in BE archs since the crc32 value is already LE.
The solution is to switch the bytes from the crc result for all types
of arc.
Fixes: 40416d8ede65 ("net/mlx5: DR, Replace CRC32 implementation to use kernel lib") Signed-off-by: Erez Shitrit <[email protected]> Reviewed-by: Alex Vesker <[email protected]> Signed-off-by: Saeed Mahameed <[email protected]>
Shay Drory [Mon, 20 Mar 2023 11:07:53 +0000 (13:07 +0200)]
net/mlx5: Handle pairing of E-switch via uplink un/load APIs
In case user switch a device from switchdev mode to legacy mode, mlx5
first unpair the E-switch and afterwards unload the uplink vport.
From the other hand, in case user remove or reload a device, mlx5
first unload the uplink vport and afterwards unpair the E-switch.
The latter is causing a bug[1], hence, handle pairing of E-switch as
part of uplink un/load APIs.
[1]
In case VF_LAG is used, every tc fdb flow is duplicated to the peer
esw. However, the original esw keeps a pointer to this duplicated
flow, not the peer esw.
e.g.: if user create tc fdb flow over esw0, the flow is duplicated
over esw1, in FW/HW, but in SW, esw0 keeps a pointer to the duplicated
flow.
During module unload while a peer tc fdb flow is still offloaded, in
case the first device to be removed is the peer device (esw1 in the
example above), the peer net-dev is destroyed, and so the mlx5e_priv
is memset to 0.
Afterwards, the peer device is trying to unpair himself from the
original device (esw0 in the example above). Unpair API invoke the
original device to clear peer flow from its eswitch (esw0), but the
peer flow, which is stored over the original eswitch (esw0), is
trying to use the peer mlx5e_priv, which is memset to 0 and result in
bellow kernel-oops.
Shay Drory [Tue, 2 May 2023 08:03:53 +0000 (11:03 +0300)]
net/mlx5: Collect command failures data only for known commands
DEVX can issue a general command, which is not used by mlx5 driver.
In case such command is failed, mlx5 is trying to collect the failure
data, However, mlx5 doesn't create a storage for this command, since
mlx5 doesn't use it. This lead to array-index-out-of-bounds error.
Fix it by checking whether the command is known before collecting the
failure data.
Fixes: 34f46ae0d4b3 ("net/mlx5: Add command failures data to debugfs") Signed-off-by: Shay Drory <[email protected]> Signed-off-by: Saeed Mahameed <[email protected]>
sock_alloc_file() calls release_sock() on error but the left hand
side of the assignment dereferences "sock". This isn't the bug and
I didn't report this earlier because there is an assert that it
doesn't fail.
Chuck Lever [Fri, 19 May 2023 17:12:50 +0000 (13:12 -0400)]
net/handshake: Squelch allocation warning during Kunit test
The "handshake_req_alloc excessive privsize" kunit test is intended
to check what happens when the maximum privsize is exceeded. The
WARN_ON_ONCE_GFP at mm/page_alloc.c:4744 can be disabled safely for
this test.
Linus Torvalds [Mon, 22 May 2023 23:31:28 +0000 (16:31 -0700)]
Merge tag 'x86_urgent_for_6.4-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fix from Dave Hansen:
"This works around and issue where the INVLPG instruction may miss
invalidating kernel TLB entries in recent hybrid CPUs.
I do expect an eventual microcode fix for this. When the microcode
version numbers are known, we can circle back around and add them the
model table to disable this workaround"
* tag 'x86_urgent_for_6.4-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/mm: Avoid incomplete Global INVLPG flushes
Linus Torvalds [Mon, 22 May 2023 22:35:53 +0000 (15:35 -0700)]
Merge tag 'modules-6.4-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux
Pull module fix from Luis Chamberlain:
"Only one fix has trickled through. Harshit Mogalapalli found a
use-after-free bug through static analysis with smatch"
* tag 'modules-6.4-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux:
module: Fix use-after-free bug in read_file_mod_stats()
Linus Torvalds [Mon, 22 May 2023 19:01:13 +0000 (12:01 -0700)]
Merge tag 'nfs-for-6.4-2' of git://git.linux-nfs.org/projects/anna/linux-nfs
Pull NFS client fixes from Anna Schumaker:
"Stable Fix:
- Don't change task->tk_status after the call to rpc_exit_task
Other Bugfixes:
- Convert kmap_atomic() to kmap_local_folio()
- Fix a potential double free with READ_PLUS"
* tag 'nfs-for-6.4-2' of git://git.linux-nfs.org/projects/anna/linux-nfs:
NFSv4.2: Fix a potential double free with READ_PLUS
SUNRPC: Don't change task->tk_status after the call to rpc_exit_task
NFS: Convert kmap_atomic() to kmap_local_folio()
Anton Protopopov [Mon, 22 May 2023 15:45:58 +0000 (15:45 +0000)]
bpf: fix a memory leak in the LRU and LRU_PERCPU hash maps
The LRU and LRU_PERCPU maps allocate a new element on update before locking the
target hash table bucket. Right after that the maps try to lock the bucket.
If this fails, then maps return -EBUSY to the caller without releasing the
allocated element. This makes the element untracked: it doesn't belong to
either of free lists, and it doesn't belong to the hash table, so can't be
re-used; this eventually leads to the permanent -ENOMEM on LRU map updates,
which is unexpected. Fix this by returning the element to the local free list
if bucket locking fails.
Imre Deak [Wed, 10 May 2023 10:31:18 +0000 (13:31 +0300)]
drm/i915: Fix PIPEDMC disabling for a bigjoiner configuration
For a bigjoiner configuration display->crtc_disable() will be called
first for the slave CRTCs and then for the master CRTC. However slave
CRTCs will be actually disabled only after the master CRTC is disabled
(from the encoder disable hooks called with the master CRTC state).
Hence the slave PIPEDMCs can be disabled only after the master CRTC is
disabled, make this so.
intel_encoders_post_pll_disable() must be called only for the master
CRTC, as for the other two encoder disable hooks. While at it fix this
up as well. This didn't cause a problem, since
intel_encoders_post_pll_disable() will call the corresponding hook only
for an encoder/connector connected to the given CRTC, however slave
CRTCs will have no associated encoder/connector.
Tetsuo Handa [Thu, 11 May 2023 13:47:32 +0000 (22:47 +0900)]
debugobjects: Don't wake up kswapd from fill_pool()
syzbot is reporting a lockdep warning in fill_pool() because the allocation
from debugobjects is using GFP_ATOMIC, which is (__GFP_HIGH | __GFP_KSWAPD_RECLAIM)
and therefore tries to wake up kswapd, which acquires kswapd_wait::lock.
Since fill_pool() might be called with arbitrary locks held, fill_pool()
should not assume that acquiring kswapd_wait::lock is safe.
Use __GFP_HIGH instead and remove __GFP_NORETRY as it is pointless for
!__GFP_DIRECT_RECLAIM allocation.
Takashi Iwai [Thu, 18 May 2023 11:35:20 +0000 (13:35 +0200)]
ALSA: hda: Fix unhandled register update during auto-suspend period
It's reported that the recording started right after the driver probe
doesn't work properly, and it turned out that this is related with the
codec auto-suspend. Namely, after the probe phase, the usage count
goes zero, and the auto-suspend is programmed, but the codec is kept
still active until the auto-suspend expiration. When an application
(e.g. alsactl) updates the mixer values at this moment, the values are
cached but not actually written. Then, starting arecord thereafter
also results in the silence because of the missing unmute.
The root cause is the handling of "lazy update" mode; when a mixer
value is updated *after* the suspend, it should update only the cache
and exits. At the resume, the cached value is written to the device,
in turn. The problem is that the current code misinterprets the state
of auto-suspend as if it were already suspended.
Although we can add the check of the actual device state after
pm_runtime_get_if_in_use() for catching the missing state, this won't
suffice; the second call of regmap_update_bits_check() will skip
writing the register because the cache has been already updated by the
first call. So we'd need fixes in two different places.
OTOH, a simpler fix is to replace pm_runtime_get_if_in_use() with
pm_runtime_get_if_active() (with ign_usage_count=true). This change
implies that the driver takes the pm refcount if the device is still
in ACTIVE state and continues the processing. A small caveat is that
this will leave the auto-suspend timer. But, since the timer callback
itself checks the device state and aborts gracefully when it's active,
this won't be any substantial problem.
Long story short: we address the missing register-write problem just
by replacing the pm_runtime_*() call in snd_hda_keep_power_up().
Finn Thain [Sat, 6 May 2023 09:38:12 +0000 (19:38 +1000)]
m68k: Move signal frame following exception on 68020/030
On 68030/020, an instruction such as, moveml %a2-%a3/%a5,%sp@- may cause
a stack page fault during instruction execution (i.e. not at an
instruction boundary) and produce a format 0xB exception frame.
In this situation, the value of USP will be unreliable. If a signal is
to be delivered following the exception, this USP value is used to
calculate the location for a signal frame. This can result in a
corrupted user stack.
The corruption was detected in dash (actually in glibc) where it showed
up as an intermittent "stack smashing detected" message and crash
following signal delivery for SIGCHLD.
It was hard to reproduce that failure because delivery of the signal
raced with the page fault and because the kernel places an unpredictable
gap of up to 7 bytes between the USP and the signal frame.
A format 0xB exception frame can be produced by a bus error or an
address error. The 68030 Users Manual says that address errors occur
immediately upon detection during instruction prefetch. The instruction
pipeline allows prefetch to overlap with other instructions, which means
an address error can arise during the execution of a different
instruction. So it seems likely that this patch may help in the address
error case also.
Charles Keepax [Thu, 18 May 2023 09:39:26 +0000 (10:39 +0100)]
spi: spi-cadence: Interleave write of TX and read of RX FIFO
When working in slave mode it seems the timing is exceedingly tight.
The TX FIFO can never empty, because the master is driving the clock so
zeros would be sent for those bytes where the FIFO is empty.
Return to interleaving the writing of the TX FIFO and the reading
of the RX FIFO to try to ensure the data is available when required.
Matthew Auld [Fri, 19 May 2023 09:07:33 +0000 (10:07 +0100)]
drm: fix drmm_mutex_init()
In mutex_init() lockdep identifies a lock by defining a special static
key for each lock class. However if we wrap the macro in a function,
like in drmm_mutex_init(), we end up generating:
The static __key here is what lockdep uses to identify the lock class,
however since this is just a normal function the key here will be
created once, where all callers then use the same key. In effect the
mutex->depmap.key will be the same pointer for different
drmm_mutex_init() callers. This then results in impossible lockdep
splats since lockdep thinks completely unrelated locks are the same lock
class.
To fix this turn drmm_mutex_init() into a macro such that it generates a
different "static struct lock_class_key __key" for each invocation,
which looks to be inline with what mutex_init() wants.
v2:
- Revamp the commit message with clearer explanation of the issue.
- Rather export __drmm_mutex_release() than static inline.
All IPCs using instance_id use 8 bit value. Original commit used 16 bit
value because FW reports possible max value in 16 bit field, but in
practice FW limits the value to 8 bits.
Cezary Rojewski [Fri, 19 May 2023 20:17:09 +0000 (22:17 +0200)]
ASoC: Intel: avs: Account for UID of ACPI device
Configurations with multiple codecs attached to the platform are
supported but only if each from the set is different. Add new field
representing the 'Unique ID' so that codecs that share Vendor and Part
IDs can be differentiated and thus enabling support for such
configurations.
When changing value of kcontrol, FW module to which data should be send
needs to be found. Currently it is done in improper way, fix it. Change
function name to indicate that it looks only for volume module.
This allows to change volume during runtime, instead of only changing
init value.
Xin Long [Thu, 18 May 2023 20:03:00 +0000 (16:03 -0400)]
sctp: fix an issue that plpmtu can never go to complete state
When doing plpmtu probe, the probe size is growing every time when it
receives the ACK during the Search state until the probe fails. When
the failure occurs, pl.probe_high is set and it goes to the Complete
state.
However, if the link pmtu is huge, like 65535 in loopback_dev, the probe
eventually keeps using SCTP_MAX_PLPMTU as the probe size and never fails.
Because of that, pl.probe_high can not be set, and the plpmtu probe can
never go to the Complete state.
Fix it by setting pl.probe_high to SCTP_MAX_PLPMTU when the probe size
grows to SCTP_MAX_PLPMTU in sctp_transport_pl_recv(). Also, not allow
the probe size greater than SCTP_MAX_PLPMTU in the Complete state.
Fixes: b87641aff9e7 ("sctp: do state transition when a probe succeeds on HB ACK recv path") Signed-off-by: Xin Long <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Thomas Gleixner [Mon, 22 May 2023 06:11:01 +0000 (08:11 +0200)]
Merge tag 'irqchip-fixes-6.4-1' of git://git.kernel.org/pub/scm/linux/kernel/git/maz/arm-platforms into irq/urgent
Pull irqchip fixes from Marc Zyngier:
- MIPS GIC fixes for issues that could result in either
loss of state in the interrupt controller, or a deadlock
- Workaround for Mediatek Chromebooks that only save/restore
partial state when turning the GIC redistributors off,
resulting if fireworks if Linux uses interrupt priorities
for pseudo-NMIs
- Fix the MBIGEN error handling on init
- Mark meson-gpio OF data structures as __maybe_unused,
avoiding compilation warnings on non-OF setups
Linus Torvalds [Sun, 21 May 2023 20:58:37 +0000 (13:58 -0700)]
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm fixes from Paolo Bonzini:
"ARM:
- Plug a race in the stage-2 mapping code where the IPA and the PA
would end up being out of sync
- Make better use of the bitmap API (bitmap_zero, bitmap_zalloc...)
- FP/SVE/SME documentation update, in the hope that this field
becomes clearer...
- Add workaround for Apple SEIS brokenness to a new SoC
- Random comment fixes
x86:
- add MSR_IA32_TSX_CTRL into msrs_to_save
- fixes for XCR0 handling in SGX enclaves
Generic:
- Fix vcpu_array[0] races
- Fix race between starting a VM and 'reboot -f'"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
KVM: VMX: add MSR_IA32_TSX_CTRL into msrs_to_save
KVM: x86: Don't adjust guest's CPUID.0x12.1 (allowed SGX enclave XFRM)
KVM: VMX: Don't rely _only_ on CPUID to enforce XCR0 restrictions for ECREATE
KVM: Fix vcpu_array[0] races
KVM: VMX: Fix header file dependency of asm/vmx.h
KVM: Don't enable hardware after a restart/shutdown is initiated
KVM: Use syscore_ops instead of reboot_notifier to hook restart/shutdown
KVM: arm64: vgic: Add Apple M2 PRO/MAX cpus to the list of broken SEIS implementations
KVM: arm64: Clarify host SME state management
KVM: arm64: Restructure check for SVE support in FP trap handler
KVM: arm64: Document check for TIF_FOREIGN_FPSTATE
KVM: arm64: Fix repeated words in comments
KVM: arm64: Constify start/end/phys fields of the pgtable walker data
KVM: arm64: Infer PA offset from VA in hyp map walker
KVM: arm64: Infer the PA offset from IPA in stage-2 map walker
KVM: arm64: Use the bitmap API to allocate bitmaps
KVM: arm64: Slightly optimize flush_context()
* tag 'perf-tools-fixes-for-v6.4-1-2023-05-20' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux: (33 commits)
perf bench syscall: Fix __NR_execve undeclared build error
perf test attr: Fix python SafeConfigParser() deprecation warning
perf test attr: Update no event/metric expectations
tools headers disabled-features: Sync with the kernel sources
tools headers UAPI: Sync arch prctl headers with the kernel sources
tools headers: Update the copy of x86's mem{cpy,set}_64.S used in 'perf bench'
tools headers x86 cpufeatures: Sync with the kernel sources
tools headers UAPI: Sync s390 syscall table file that wires up the memfd_secret syscall
tools headers UAPI: Sync linux/prctl.h with the kernel sources
perf metrics: Avoid segv with --topdown for metrics without a group
perf lock contention: Add empty 'struct rq' to satisfy libbpf 'runqueue' type verification
perf cs-etm: Fix contextid validation
perf arm64: Fix build with refcount checking
perf test: Add stat test for record and script
perf script: Skip aggregation for stat events
perf build: Add system include paths to BPF builds
perf bpf skels: Make vmlinux.h use bpf.h and perf_event.h in source directory
perf parse-events: Do not break up AUX event group
perf test test_intel_pt.sh: Test sample mode with event with PMU name
perf evsel: Modify group pmu name for software events
...
Linus Torvalds [Sun, 21 May 2023 18:53:52 +0000 (11:53 -0700)]
Merge tag 'powerpc-6.4-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc fixes from Michael Ellerman:
- Fix broken soft dirty tracking when using the Radix MMU (>= P9)
- Fix ISA mapping when "ranges" property is not present, for PASemi
Nemo boards
- Fix a possible WARN_ON_ONCE hitting in BPF extable handling
- Fix incorrect DMA address handling when using 2MB TCEs
- Fix a bug in IOMMU table handling for SR-IOV devices
- Fix the recent rework of IOMMU handling which left arch code calling
clean up routines that are handled by the IOMMU core
- A few assorted build fixes
Thanks to Christian Zigotzky, Dan Horák, Gaurav Batra, Hari Bathini,
Jason Gunthorpe, Nathan Chancellor, Naveen N. Rao, Nicholas Piggin, Pali
Rohár, Randy Dunlap, and Rob Herring.
* tag 'powerpc-6.4-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
powerpc/iommu: Incorrect DDW Table is referenced for SR-IOV device
powerpc/iommu: DMA address offset is incorrectly calculated with 2MB TCEs
powerpc/iommu: Remove iommu_del_device()
powerpc/crypto: Fix aes-gcm-p10 build when VSX=n
powerpc/bpf: populate extable entries only during the last pass
powerpc/boot: Disable power10 features after BOOTAFLAGS assignment
powerpc/64s/radix: Fix soft dirty tracking
powerpc/fsl_uli1575: fix kconfig warnings and build errors
powerpc/isa-bridge: Fix ISA mapping when "ranges" is not present