scsi: megaraid_sas: Target with invalid LUN ID is deleted during scan
The megaraid_sas driver supports single LUN for RAID devices. That is LUN
0. All other LUNs are unsupported. When a device scan on a logical target
with invalid LUN number is invoked through sysfs, that target ends up
getting removed.
Add LUN ID validation in the slave destroy function to avoid the target
deletion.
Xiaomeng Tong [Sun, 20 Mar 2022 15:07:33 +0000 (23:07 +0800)]
scsi: ufs: ufshpb: Fix a NULL check on list iterator
The list iterator is always non-NULL so the check 'if (!rgn)' is always
false and the dev_err() is never called. Move the check outside the loop
and determine if 'victim_rgn' is NULL, to fix this bug.
scsi: sd: Clean up gendisk if device_add_disk() failed
We forgot to call blk_cleanup_disk() when device_add_disk() failed. This
would cause a memory leak of gendisk and sched_tags allocated in
elevator_init_mq()
Variable dmp is being assigned a value that is never read, the variable is
redundant and can be removed.
Cleans up clang scan build warning:
drivers/message/fusion/mptbase.c:6667:39: warning: Although
the value stored to 'dmp' is used in the enclosing expression,
the value is never actually read from 'dmp' [deadcode.DeadStores]
Alexey Galakhov [Wed, 9 Mar 2022 21:25:35 +0000 (22:25 +0100)]
scsi: mvsas: Add PCI ID of RocketRaid 2640
The HighPoint RocketRaid 2640 is a low-cost SAS controller based on Marvell
chip. The chip in question was already supported by the kernel, just the
PCI ID of this particular board was missing.
scsi: mpt3sas: Fail reset operation if config request timed out
As part of controller reset operation the driver issues a config request
command. If this command gets times out, then fail the controller reset
operation instead of retrying it.
Finn Thain [Wed, 6 Apr 2022 09:05:39 +0000 (19:05 +1000)]
scsi: sym53c500_cs: Stop using struct scsi_pointer
This driver doesn't use SCp.ptr to save a SCSI command data pointer which
means "scsi pointer" is a complete misnomer here. Only a few members of
struct scsi_pointer are needed so move those to private command data.
The start_addres argument of mpt3sas_check_same_4gb_region() was misnamed
in the function kdoc comment, resulting in the following warning when
compiling with W=1.
drivers/scsi/mpt3sas/mpt3sas_base.c:5728: warning: Function parameter or
member 'start_address' not described in 'mpt3sas_check_same_4gb_region'
drivers/scsi/mpt3sas/mpt3sas_base.c:5728: warning: Excess function
parameter 'reply_pool_start_address' description in
'mpt3sas_check_same_4gb_region'
Fix the argument name in the function kdoc comment to avoid it. While at
it, remove a useless blank line between the kdoc and function code.
Damien Le Moal [Mon, 4 Apr 2022 04:55:47 +0000 (13:55 +0900)]
scsi: scsi_debug: Fix sdebug_blk_mq_poll() in_use_bm bitmap use
The in_use_bm bitmap of struct sdebug_queue should be accessed under
protection of the qc_lock spinlock. Make sure that this lock is taken
before calling find_first_bit() at the beginning of the function
sdebug_blk_mq_poll().
Dave Airlie [Thu, 7 Apr 2022 00:23:03 +0000 (10:23 +1000)]
Merge tag 'imx-drm-fixes-2022-04-06' of git://git.pengutronix.de/pza/linux into drm-fixes
drm/imx: error handling and debug output fixes
Catch an EDID allocation failure in imx-ldb, fix a leaked drm display
mode on DT parsing error in parallel-display, properly remove the
dw_hdmi bridge in case the component_add fails in dw_hdmi-imx, and
fix the IPU clock frequency debug printout in ipu-di.
random: check for signals every PAGE_SIZE chunk of /dev/[u]random
In 1448769c9cdb ("random: check for signal_pending() outside of
need_resched() check"), Jann pointed out that we previously were only
checking the TIF_NOTIFY_SIGNAL and TIF_SIGPENDING flags if the process
had TIF_NEED_RESCHED set, which meant in practice, super long reads to
/dev/[u]random would delay signal handling by a long time. I tried this
using the below program, and indeed I wasn't able to interrupt a
/dev/urandom read until after several megabytes had been read. The bug
he fixed has always been there, and so code that reads from /dev/urandom
without checking the return value of read() has mostly worked for a long
time, for most sizes, not just for <= 256.
Maybe it makes sense to keep that code working. The reason it was so
small prior, ignoring the fact that it didn't work anyway, was likely
because /dev/random used to block, and that could happen for pretty
large lengths of time while entropy was gathered. But now, it's just a
chacha20 call, which is extremely fast and is just operating on pure
data, without having to wait for some external event. In that sense,
/dev/[u]random is a lot more like /dev/zero.
Taking a page out of /dev/zero's read_zero() function, it always returns
at least one chunk, and then checks for signals after each chunk. Chunk
sizes there are of length PAGE_SIZE. Let's just copy the same thing for
/dev/[u]random, and check for signals and cond_resched() for every
PAGE_SIZE amount of data. This makes the behavior more consistent with
expectations, and should mitigate the impact of Jann's fix for the
age-old signal check bug.
Kefeng Wang [Wed, 6 Apr 2022 14:57:57 +0000 (00:57 +1000)]
powerpc: Fix virt_addr_valid() for 64-bit Book3E & 32-bit
mpe: On 64-bit Book3E vmalloc space starts at 0x8000000000000000.
Because of the way __pa() works we have:
__pa(0x8000000000000000) == 0, and therefore
virt_to_pfn(0x8000000000000000) == 0, and therefore
virt_addr_valid(0x8000000000000000) == true
Which is wrong, virt_addr_valid() should be false for vmalloc space.
In fact all vmalloc addresses that alias with a valid PFN will return
true from virt_addr_valid(). That can cause bugs with hardened usercopy
as described below by Kefeng Wang:
When running ethtool eth0 on 64-bit Book3E, a BUG occurred:
usercopy: Kernel memory exposure attempt detected from SLUB object not in SLUB page?! (offset 0, size 1048)!
kernel BUG at mm/usercopy.c:99
...
usercopy_abort+0x64/0xa0 (unreliable)
__check_heap_object+0x168/0x190
__check_object_size+0x1a0/0x200
dev_ethtool+0x2494/0x2b20
dev_ioctl+0x5d0/0x770
sock_do_ioctl+0xf0/0x1d0
sock_ioctl+0x3ec/0x5a0
__se_sys_ioctl+0xf0/0x160
system_call_exception+0xfc/0x1f0
system_call_common+0xf8/0x200
The code shows below,
data = vzalloc(array_size(gstrings.len, ETH_GSTRING_LEN));
copy_to_user(useraddr, data, gstrings.len * ETH_GSTRING_LEN))
The data is alloced by vmalloc(), virt_addr_valid(ptr) will return true
on 64-bit Book3E, which leads to the panic.
As commit 4dd7554a6456 ("powerpc/64: Add VIRTUAL_BUG_ON checks for __va
and __pa addresses") does, make sure the virt addr above PAGE_OFFSET in
the virt_addr_valid() for 64-bit, also add upper limit check to make
sure the virt is below high_memory.
Meanwhile, for 32-bit PAGE_OFFSET is the virtual address of the start
of lowmem, high_memory is the upper low virtual address, the check is
suitable for 32-bit, this will fix the issue mentioned in commit 602946ec2f90 ("powerpc: Set max_mapnr correctly") too.
On 32-bit there is a similar problem with high memory, that was fixed in
commit 602946ec2f90 ("powerpc: Set max_mapnr correctly"), but that
commit breaks highmem and needs to be reverted.
We can't easily fix __pa(), we have code that relies on its current
behaviour. So for now add extra checks to virt_addr_valid().
For 64-bit Book3S the extra checks are not necessary, the combination of
virt_to_pfn() and pfn_valid() should yield the correct result, but they
are harmless.
fbdev: Fix unregistering of framebuffers without device
OF framebuffers do not have an underlying device in the Linux
device hierarchy. Do a regular unregister call instead of hot
unplugging such a non-existing device. Fixes a NULL dereference.
An example error message on ppc64le is shown below.
The bug [1] was introduced by commit 27599aacbaef ("fbdev: Hot-unplug
firmware fb devices on forced removal"). Most firmware framebuffers
have an underlying platform device, which can be hot-unplugged
before loading the native graphics driver. OF framebuffers do not
(yet) have that device. Fix the code by unregistering the framebuffer
as before without a hot unplug.
drbd: fix an invalid memory access caused by incorrect use of list iterator
The bug is here:
idr_remove(&connection->peer_devices, vnr);
If the previous for_each_connection() don't exit early (no goto hit
inside the loop), the iterator 'connection' after the loop will be a
bogus pointer to an invalid structure object containing the HEAD
(&resource->connections). As a result, the use of 'connection' above
will lead to a invalid memory access (including a possible invalid free
as idr_remove could call free_layer).
The original intention should have been to remove all peer_devices,
but the following lines have already done the work. So just remove
this line and the unneeded label, to fix this bug.
drbd: Fix five use after free bugs in get_initial_state
In get_initial_state, it calls notify_initial_state_done(skb,..) if
cb->args[5]==1. If genlmsg_put() failed in notify_initial_state_done(),
the skb will be freed by nlmsg_free(skb).
Then get_initial_state will goto out and the freed skb will be used by
return value skb->len, which is a uaf bug.
What's worse, the same problem goes even further: skb can also be
freed in the notify_*_state_change -> notify_*_state calls below.
Thus 4 additional uaf bugs happened.
My patch lets the problem callee functions: notify_initial_state_done
and notify_*_state_change return an error code if errors happen.
So that the error codes could be propagated and the uaf bugs can be avoid.
v2 reports a compilation warning. This v3 fixed this warning and built
successfully in my local environment with no additional warnings.
v2: https://lore.kernel.org/patchwork/patch/1435218/
bpf: Adjust bpf_tcp_check_syncookie selftest to test dual-stack sockets
The previous commit fixed support for dual-stack sockets in
bpf_tcp_check_syncookie. This commit adjusts the selftest to verify the
fixed functionality.
bpf: Support dual-stack sockets in bpf_tcp_check_syncookie
bpf_tcp_gen_syncookie looks at the IP version in the IP header and
validates the address family of the socket. It supports IPv4 packets in
AF_INET6 dual-stack sockets.
On the other hand, bpf_tcp_check_syncookie looks only at the address
family of the socket, ignoring the real IP version in headers, and
validates only the packet size. This implementation has some drawbacks:
1. Packets are not validated properly, allowing a BPF program to trick
bpf_tcp_check_syncookie into handling an IPv6 packet on an IPv4
socket.
2. Dual-stack sockets fail the checks on IPv4 packets. IPv4 clients end
up receiving a SYNACK with the cookie, but the following ACK gets
dropped.
This patch fixes these issues by changing the checks in
bpf_tcp_check_syncookie to match the ones in bpf_tcp_gen_syncookie. IP
version from the header is taken into account, and it is validated
properly with address family.
Alex Deucher [Fri, 1 Apr 2022 15:08:48 +0000 (11:08 -0400)]
drm/amdgpu/smu10: fix SoC/fclk units in auto mode
SMU takes clock limits in Mhz units. socclk and fclk were
using 10 khz units in some cases. Switch to Mhz units.
Fixes higher than required SoC clocks.
drm/amd/display: Fix by adding FPU protection for dcn30_internal_validate_bw
[Why]
Below general protection fault observed when WebGL Aquarium is run for
longer duration. If drm debug logs are enabled and set to 0x1f then the
issue is observed within 10 minutes of run.
[How]
It calles populate_dml_pipes which uses doubles to initialize.
Adding FPU protection avoids context switch and probable loss of vba context
as there is potential contention while drm debug logs are enabled.
Shirish S [Fri, 11 Mar 2022 15:00:17 +0000 (20:30 +0530)]
amd/display: set backlight only if required
[Why]
comparing pwm bl values (coverted) with user brightness(converted)
levels in commit_tail leads to continuous setting of backlight via dmub
as they don't to match.
This leads overdrive in queuing of commands to DMCU that sometimes lead
to depending on load on DMCU fw:
"[drm:dc_dmub_srv_wait_idle] *ERROR* Error waiting for DMUB idle: status=3"
[How]
Store last successfully set backlight value and compare with it instead
of pwm reads which is not what we should compare with.
Roman Li [Thu, 17 Mar 2022 23:55:05 +0000 (19:55 -0400)]
drm/amd/display: Fix allocate_mst_payload assert on resume
[Why]
On resume we do link detection for all non-MST connectors.
MST is handled separately. However the condition for telling
if connector is on mst branch is not enough for mst hub case.
Link detection for mst branch link leads to mst topology reset.
That causes assert in dc_link_allocate_mst_payload()
Roman Li [Tue, 15 Mar 2022 20:31:14 +0000 (16:31 -0400)]
drm/amd/display: Enable power gating before init_pipes
[Why]
In init_hw() we call init_pipes() before enabling power gating.
init_pipes() tries to power gate dsc but it may fail because
required force-ons are not released yet.
As a result with dsc config the following errors observed on resume:
"REG_WAIT timeout 1us * 1000 tries - dcn20_dsc_pg_control"
"REG_WAIT timeout 1us * 1000 tries - dcn20_dpp_pg_control"
"REG_WAIT timeout 1us * 1000 tries - dcn20_hubp_pg_control"
[How]
Move enable_power_gating_plane() before init_pipes() in init_hw()
Roman Li [Tue, 15 Mar 2022 18:57:34 +0000 (14:57 -0400)]
drm/amd/display: Remove redundant dsc power gating from init_hw
[Why]
DSC Power down code has been moved from dcn31_init_hw into init_pipes()
Need to remove it from dcn10_init_hw() as well to avoid duplicated action
on dcn1.x/2.x
[How]
Remove DSC power down code from dcn10_init_hw()
Chris Park [Tue, 15 Mar 2022 16:21:43 +0000 (12:21 -0400)]
drm/amd/display: Correct Slice reset calculation
[Why]
Once DSC slice cannot fit pixel clock, we incorrectly
reset min slices to 0 and allow max slice to operate,
even when max slice itself cannot fit the pixel clock
properly.
[How]
Change the sequence such that we correctly determine
DSC is not possible when both min slices and max
slices cannot fit pixel clock per slice.
spi: cadence-quadspi: fix protocol setup for non-1-1-X operations
cqspi_set_protocol() only set the data width, but ignored the command
and address width (except for 8-8-8 DTR ops), leading to corruption of
all transfers using 1-X-X or X-X-X ops. Fix by setting the other two
widths as well.
While we're at it, simplify the code a bit by replacing the
CQSPI_INST_TYPE_* constants with ilog2().
Tested on a TI AM64x with a Macronix MX25U51245G QSPI flash with 1-4-4
read and write operations.
Commit b470e10eb43f ("spi: core: add dma_map_dev for dma device") added
dma_map_dev for _spi_map_msg() but missed to add for unmap routine,
__spi_unmap_msg(), so add it now.
Enze Li [Fri, 1 Apr 2022 21:18:42 +0000 (22:18 +0100)]
cdrom: remove unused variable
The clang static analyzer reports the following warning,
File: drivers/cdrom/cdrom.c
Warning: line 1380, column 7
Although the value stored to 'status' is used in enclosing
expression, the value is never actually read from 'status'
Remove the unused variable to eliminate the warning.
net: usb: aqc111: Fix out-of-bounds accesses in RX fixup
aqc111_rx_fixup() contains several out-of-bounds accesses that can be
triggered by a malicious (or defective) USB device, in particular:
- The metadata array (desc_offset..desc_offset+2*pkt_count) can be out of bounds,
causing OOB reads and (on big-endian systems) OOB endianness flips.
- A packet can overlap the metadata array, causing a later OOB
endianness flip to corrupt data used by a cloned SKB that has already
been handed off into the network stack.
- A packet SKB can be constructed whose tail is far beyond its end,
causing out-of-bounds heap data to be considered part of the SKB's
data.
Found doing variant analysis. Tested it with another driver (ax88179_178a), since
I don't have a aqc111 device to test it, but the code looks very similar.
qede_build_skb() assumes build_skb() always works and goes straight
to skb_reserve(). However, build_skb() can fail under memory pressure.
This results in a kernel panic because the skb to reserve is NULL.
Add a check in case build_skb() failed to allocate and return NULL.
The NULL return is handled correctly in callers to qede_build_skb().
Fixes: 8a8633978b842 ("qede: Add build_skb() support.") Signed-off-by: Jamie Bainbridge <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Move it to the CONFIG_IPV6_PIMSM_V2 scope where its used.
Fixes: 4b340a5a726d ("net: ip6mr: add support for passing full packet on wrong mif") Signed-off-by: Florian Westphal <[email protected]> Signed-off-by: David S. Miller <[email protected]>
David S. Miller [Wed, 6 Apr 2022 14:03:50 +0000 (15:03 +0100)]
Merge branch '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue
Tony Nguyen says:
====================
Intel Wired LAN Driver Updates 2022-04-05
Maciej Fijalkowski says:
We were solving issues around AF_XDP busy poll's not-so-usual scenarios,
such as very big busy poll budgets applied to very small HW rings. This
set carries the things that were found during that work that apply to
net tree.
One thing that was fixed for all in-tree ZC drivers was missing on ice
side all the time - it's about syncing RCU before destroying XDP
resources. Next one fixes the bit that is checked in ice_xsk_wakeup and
third one avoids false setting of DD bits on Tx descriptors.
====================
Boqun Feng [Fri, 25 Mar 2022 02:32:12 +0000 (10:32 +0800)]
Drivers: hv: balloon: Disable balloon and hot-add accordingly
Currently there are known potential issues for balloon and hot-add on
ARM64:
* Unballoon requests from Hyper-V should only unballoon ranges
that are guest page size aligned, otherwise guests cannot handle
because it's impossible to partially free a page. This is a
problem when guest page size > 4096 bytes.
* Memory hot-add requests from Hyper-V should provide the NUMA
node id of the added ranges or ARM64 should have a functional
memory_add_physaddr_to_nid(), otherwise the node id is missing
for add_memory().
These issues require discussions on design and implementation. In the
meanwhile, post_status() is working and essential to guest monitoring.
Therefore instead of disabling the entire hv_balloon driver, the
ballooning (when page size > 4096 bytes) and hot-add are disabled
accordingly for now. Once the issues are fixed, they can be re-enable in
these cases.
Boqun Feng [Fri, 25 Mar 2022 02:32:11 +0000 (10:32 +0800)]
Drivers: hv: balloon: Support status report for larger page sizes
DM_STATUS_REPORT expects the numbers of pages in the unit of 4k pages
(HV_HYP_PAGE) instead of guest pages, so to make it work when guest page
sizes are larger than 4k, convert the numbers of guest pages into the
numbers of HV_HYP_PAGEs.
Note that the numbers of guest pages are still used for tracing because
tracing is internal to the guest kernel.
random: check for signal_pending() outside of need_resched() check
signal_pending() checks TIF_NOTIFY_SIGNAL and TIF_SIGPENDING, which
signal that the task should bail out of the syscall when possible. This
is a separate concept from need_resched(), which checks
TIF_NEED_RESCHED, signaling that the task should preempt.
In particular, with the current code, the signal_pending() bailout
probably won't work reliably.
Change this to look like other functions that read lots of data, such as
read_zero().
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Jann Horn <[email protected]> Signed-off-by: Jason A. Donenfeld <[email protected]>
random: do not allow user to keep crng key around on stack
The fast key erasure RNG design relies on the key that's used to be used
and then discarded. We do this, making judicious use of
memzero_explicit(). However, reads to /dev/urandom and calls to
getrandom() involve a copy_to_user(), and userspace can use FUSE or
userfaultfd, or make a massive call, dynamically remap memory addresses
as it goes, and set the process priority to idle, in order to keep a
kernel stack alive indefinitely. By probing
/proc/sys/kernel/random/entropy_avail to learn when the crng key is
refreshed, a malicious userspace could mount this attack every 5 minutes
thereafter, breaking the crng's forward secrecy.
In order to fix this, we just overwrite the stack's key with the first
32 bytes of the "free" fast key erasure output. If we're returning <= 32
bytes to the user, then we can still return those bytes directly, so
that short reads don't become slower. And for long reads, the difference
is hopefully lost in the amortization, so it doesn't change much, with
that amortization helping variously for medium reads.
We don't need to do this for get_random_bytes() and the various
kernel-space callers, and later, if we ever switch to always batching,
this won't be necessary either, so there's no need to change the API of
these functions.
Cc: Theodore Ts'o <[email protected]> Reviewed-by: Jann Horn <[email protected]> Fixes: c92e040d575a ("random: add backtracking protection to the CRNG") Fixes: 186873c549df ("random: use simpler fast key erasure flow on per-cpu keys") Signed-off-by: Jason A. Donenfeld <[email protected]>
The driver doesn't support clause 45 register access yet, but doesn't
check if the access is a c45 one either. This leads to spurious register
reads and writes. Add the check.
David S. Miller [Wed, 6 Apr 2022 12:54:52 +0000 (13:54 +0100)]
Merge branch 'axienet-broken-link'
Andy Chiu says:
====================
Fix broken link on Xilinx's AXI Ethernet in SGMII mode
The Ethernet driver use phy-handle to reference the PCS/PMA PHY. This
could be a problem if one wants to configure an external PHY via phylink,
since it use the same phandle to get the PHY. To fix this, introduce a
dedicated pcs-handle to point to the PCS/PMA PHY and deprecate the use
of pointing it with phy-handle. A similar use case of pcs-handle can be
seen on dpaa2 as well.
--- patch v5 ---
- Re-apply the v4 patch on the net tree.
- Describe the pcs-handle DT binding at ethernet-controller level.
--- patch v6 ---
- Remove "preferrably" to clearify usage of pcs_handle.
--- patch v7 ---
- Rebase the patch on latest net/master
--- patch v8 ---
- Rebase the patch on net-next/master
- Add "reviewed-by" tag in PATCH 3/4: dt-bindings: net: add pcs-handle
attribute
- Remove "fix" tag in last commit message since this is not a critical
bug and will not be back ported to stable.
====================
Andy Chiu [Tue, 5 Apr 2022 09:19:29 +0000 (17:19 +0800)]
net: axiemac: use a phandle to reference pcs_phy
In some SGMII use cases where both a fixed link external PHY and the
internal PCS/PMA PHY need to be configured, we should explicitly use a
phandle "pcs-phy" to get the reference to the PCS/PMA PHY. Otherwise, the
driver would use "phy-handle" in the DT as the reference to both the
external and the internal PCS/PMA PHY.
In other cases where the core is connected to a SFP cage, we could still
point phy-handle to the intenal PCS/PMA PHY, and let the driver connect
to the SFP module, if exist, via phylink.
Andy Chiu [Tue, 5 Apr 2022 09:19:28 +0000 (17:19 +0800)]
dt-bindings: net: add pcs-handle attribute
Document the new pcs-handle attribute to support connecting to an
external PHY. For Xilinx's AXI Ethernet, this is used when the core
operates in SGMII or 1000Base-X modes and links through the internal
PCS/PMA PHY.
Andy Chiu [Tue, 5 Apr 2022 09:19:27 +0000 (17:19 +0800)]
net: axienet: factor out phy_node in struct axienet_local
the struct member `phy_node` of struct axienet_local is not used by the
driver anymore after initialization. It might be a remnent of old code
and could be removed.
Andy Chiu [Tue, 5 Apr 2022 09:19:26 +0000 (17:19 +0800)]
net: axienet: setup mdio unconditionally
The call to axienet_mdio_setup should not depend on whether "phy-node"
pressents on the DT. Besides, since `lp->phy_node` is used if PHY is in
SGMII or 100Base-X modes, move it into the if statement. And the next patch
will remove `lp->phy_node` from driver's private structure and do an
of_node_put on it right away after use since it is not used elsewhere.
In some cases, xdp tx_queue can get used before initialization.
1. interface up/down
2. ring buffer size change
When CPU cores are lower than maximum number of channels of sfc driver,
it creates new channels only for XDP.
When an interface is up or ring buffer size is changed, all channels
are initialized.
But xdp channels are always initialized later.
So, the below scenario is possible.
Packets are received to rx queue of normal channels and it is acted
XDP_TX and tx_queue of xdp channels get used.
But these tx_queues are not initialized yet.
If so, TX DMA or queue error occurs.
In order to avoid this problem.
1. initializes xdp tx_queues earlier than other rx_queue in
efx_start_channels().
2. checks whether tx_queue is initialized or not in efx_xdp_tx_buffers().
Splat looks like:
sfc 0000:08:00.1 enp8s0f1np1: TX queue 10 spurious TX completion id 250
sfc 0000:08:00.1 enp8s0f1np1: resetting (RECOVER_OR_ALL)
sfc 0000:08:00.1 enp8s0f1np1: MC command 0x80 inlen 100 failed rc=-22
(raw=22) arg=789
sfc 0000:08:00.1 enp8s0f1np1: has been disabled
Fixes: f28100cb9c96 ("sfc: fix lack of XDP TX queues - error XDP TX failed (-22)") Acked-by: Martin Habets <[email protected]> Signed-off-by: Taehee Yoo <[email protected]> Signed-off-by: David S. Miller <[email protected]>
While parsing user-provided actions, openvswitch module may dynamically
allocate memory and store pointers in the internal copy of the actions.
So this memory has to be freed while destroying the actions.
Currently there are only two such actions: ct() and set(). However,
there are many actions that can hold nested lists of actions and
ovs_nla_free_flow_actions() just jumps over them leaking the memory.
For example, removal of the flow with the following actions will lead
to a leak of the memory allocated by nf_ct_tmpl_alloc():
actions:clone(ct(commit),0)
Non-freed set() action may also leak the 'dst' structure for the
tunnel info including device references.
Under certain conditions with a high rate of flow rotation that may
cause significant memory leak problem (2MB per second in reporter's
case). The problem is also hard to mitigate, because the user doesn't
have direct control over the datapath flows generated by OVS.
Fix that by iterating over all the nested actions and freeing
everything that needs to be freed recursively.
New build time assertion should protect us from this problem if new
actions will be added in the future.
Unfortunately, openvswitch module doesn't use NLA_F_NESTED, so all
attributes has to be explicitly checked. sample() and clone() actions
are mixing extra attributes into the user-provided action list. That
prevents some code generalization too.
Steve Capper [Wed, 30 Mar 2022 11:25:43 +0000 (12:25 +0100)]
tlb: hugetlb: Add more sizes to tlb_remove_huge_tlb_entry
tlb_remove_huge_tlb_entry only considers PMD_SIZE and PUD_SIZE when
updating the mmu_gather structure.
Unfortunately on arm64 there are two additional huge page sizes that
need to be covered: CONT_PTE_SIZE and CONT_PMD_SIZE. Where an end-user
attempts to employ contiguous huge pages, a VM_BUG_ON can be experienced
due to the fact that the tlb structure hasn't been correctly updated by
the relevant tlb_flush_p.._range() call from tlb_remove_huge_tlb_entry.
This patch adds inequality logic to the generic implementation of
tlb_remove_huge_tlb_entry s.t. CONT_PTE_SIZE and CONT_PMD_SIZE are
effectively covered on arm64. Also, as well as ptes, pmds and puds;
p4ds are now considered too.
arm64: alternatives: mark patch_alternative() as `noinstr`
The alternatives code must be `noinstr` such that it does not patch itself,
as the cache invalidation is only performed after all the alternatives have
been applied.
Mark patch_alternative() as `noinstr`. Mark branch_insn_requires_update()
and get_alt_insn() with `__always_inline` since they are both only called
through patch_alternative().
Booting a kernel in QEMU TCG with KCSAN=y and ARM64_USE_LSE_ATOMICS=y caused
a boot hang:
[ 0.241121] CPU: All CPU(s) started at EL2
The alternatives code was patching the atomics in __tsan_read4() from LL/SC
atomics to LSE atomics.
The following fragment is using LL/SC atomics in the .text section:
| <__tsan_unaligned_read4+304>: ldxr x6, [x2]
| <__tsan_unaligned_read4+308>: add x6, x6, x5
| <__tsan_unaligned_read4+312>: stxr w7, x6, [x2]
| <__tsan_unaligned_read4+316>: cbnz w7, <__tsan_unaligned_read4+304>
This LL/SC atomic sequence was to be replaced with LSE atomics. However since
the alternatives code was instrumentable, __tsan_read4() was being called after
only the first instruction was replaced, which led to the following code in memory:
| <__tsan_unaligned_read4+304>: ldadd x5, x6, [x2]
| <__tsan_unaligned_read4+308>: add x6, x6, x5
| <__tsan_unaligned_read4+312>: stxr w7, x6, [x2]
| <__tsan_unaligned_read4+316>: cbnz w7, <__tsan_unaligned_read4+304>
This caused an infinite loop as the `stxr` instruction never completed successfully,
so `w7` was always 0.
ata: ahci: Rename CONFIG_SATA_LPM_POLICY configuration item back
CONFIG_SATA_LPM_MOBILE_POLICY was renamed to CONFIG_SATA_LPM_POLICY in
commit 4dd4d3deb502 ("ata: ahci: Rename CONFIG_SATA_LPM_MOBILE_POLICY
configuration item").
This can potentially cause problems as users would invisibly lose
configuration policy defaults when they built the new kernel. To
avoid such problems, switch back to the old name (even if it's wrong).
Andrew Lunn [Tue, 5 Apr 2022 00:04:04 +0000 (02:04 +0200)]
net: ethernet: mv643xx: Fix over zealous checking of_get_mac_address()
There is often not a MAC address available in an EEPROM accessible by
Linux with Marvell devices. Instead the bootload has the MAC address
and directly programs it into the hardware. So don't consider an error
from of_get_mac_address() has fatal. However, the check was added for
the case where there is a MAC address in an the EEPROM, but the EEPROM
has not probed yet, and -EPROBE_DEFER is returned. In that case the
error should be returned. So make the check specific to this error
code.
net: openvswitch: don't send internal clone attribute to the userspace.
'OVS_CLONE_ATTR_EXEC' is an internal attribute that is used for
performance optimization inside the kernel. It's added by the kernel
while parsing user-provided actions and should not be sent during the
flow dump as it's not part of the uAPI.
The issue doesn't cause any significant problems to the ovs-vswitchd
process, because reported actions are not really used in the
application lifecycle and only supposed to be shown to a human via
ovs-dpctl flow dump. However, the action list is still incorrect
and causes the following error if the user wants to look at the
datapath flows:
Replace the last instance of acpi_bus_get_device(), added recently
by commit 87e59b36e5e2 ("spi: Support selection of the index of the
ACPI Spi Resource before alloc"), with acpi_fetch_acpi_dev() and
finally drop acpi_bus_get_device() that has no more users.
Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost
Pull virtio fixes from Michael Tsirkin:
"Fixes and cleanups:
- A couple of mlx5 fixes related to cvq
- A couple of reverts dropping useless code (code that used it got
reverted earlier)"
* tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost:
vdpa: mlx5: synchronize driver status with CVQ
vdpa: mlx5: prevent cvq work from hogging CPU
Revert "virtio_config: introduce a new .enable_cbs method"
Revert "virtio: use virtio_device_ready() in virtio_device_restore()"
x86/speculation: Restore speculation related MSRs during S3 resume
After resuming from suspend-to-RAM, the MSRs that control CPU's
speculative execution behavior are not being restored on the boot CPU.
These MSRs are used to mitigate speculative execution vulnerabilities.
Not restoring them correctly may leave the CPU vulnerable. Secondary
CPU's MSRs are correctly being restored at S3 resume by
identify_secondary_cpu().
During S3 resume, restore these MSRs for boot CPU when restoring its
processor state.
x86/pm: Save the MSR validity status at context setup
The mechanism to save/restore MSRs during S3 suspend/resume checks for
the MSR validity during suspend, and only restores the MSR if its a
valid MSR. This is not optimal, as an invalid MSR will unnecessarily
throw an exception for every suspend cycle. The more invalid MSRs,
higher the impact will be.
Check and save the MSR validity at setup. This ensures that only valid
MSRs that are guaranteed to not throw an exception will be attempted
during suspend.
firewire: add kernel API to access packet structure in request structure for AR context
In 1394 OHCI specification, descriptor of Asynchronous Receive DMA context
has timeStamp field in its trailer quadlet. The field is written by
the host controller for the time to receive asynchronous request
subaction in isochronous cycle time.
In Linux FireWire subsystem, the value of field is stored to fw_packet
structure and copied to fw_request structure as the part. The fw_request
structure is hidden from unit driver and passed as opaque pointer when
calling registered handler. It's inconvenient to the unit driver which
needs timestamp of packet.
This commit adds kernel API to pick up timestamp from opaque pointer to
fw_request structure.
firewire: add kernel API to access CYCLE_TIME register
1394 OHCI specification defined Isochronous Cycle Timer Register to get
value of CYCLE_TIME register defined by IEEE 1394 for CSR architecture
defined by ISO/IEC 13213. Unit driver can calculate packet time by
compute with the value of CYCLE_TIME and timeStamp field in descriptor
of each isochronous and asynchronous context. The resolution of CYCLE_TIME
is 49.576 MHz, while the one of timeStamp is 8,000 Hz.
Current implementation of Linux FireWire subsystem allows the driver to
get the value of CYCLE_TIMER CSR register by transaction service. The
transaction service has overhead in regard of access to MMIO register.
This commit adds kernel API for unit driver to access the register
directly.
Hector Martin [Tue, 5 Apr 2022 07:22:19 +0000 (16:22 +0900)]
firewire: Add dummy read_csr/write_csr functions
(Hector Martin wrote)
This fixes segfaults when a card gets yanked off of the PCIe bus while
busy, e.g. with a userspace app trying to get the cycle time:
(Takashi Sakamoto wrote)
As long as reading commit 20802224298c ("firewire: core: add forgotten
dummy driver methods, remove unused ones"), three functions are not
implemeted in dummy driver for reason; .read_csr, .write_csr, and
.set_config_rom.
In core of Linux FireWire subsystem, the callback of .set_config_rom is
under acquisition of mutual exclusive for local list of card. The
acquision is also done in process for removal of card, therefore it's
safe for missing implementation of .set_config_rom.
On the other hand, no lock primitive accompanies any call of .read_csr and
.write_csr. For userspace client, check of node shutdown is done in the
beginning of dispatch of ioctl request, while node shifts to shutdown
state in workqueue context enough after card shifts to dummy driver. It's
probable that these two functions are called for the dummy driver by the
code of userspace client. In-kernel unit driver has similar situation.
It's better to add implementation of the two functions for dummy driver.
ALSA: usb-audio: Fix undefined behavior due to shift overflowing the constant
Fix:
sound/usb/midi.c: In function ‘snd_usbmidi_out_endpoint_create’:
sound/usb/midi.c:1389:2: error: case label does not reduce to an integer constant
case USB_ID(0xfc08, 0x0101): /* Unknown vendor Cable */
^~~~
See https://lore.kernel.org/r/YkwQ6%[email protected] for the gory
details as to why it triggers with older gccs only.
[ A slight correction with parentheses around the argument by tiwai ]
Colin Ian King [Tue, 5 Apr 2022 13:54:12 +0000 (14:54 +0100)]
ALSA: echoaudio: remove redundant assignment to variable i
The variable i is being assigned a value that is never read, it
is being re-assigned in the following for-loop. The assignment is
redundant and can be removed.
Kai Vehmanen [Tue, 5 Apr 2022 12:36:22 +0000 (15:36 +0300)]
ALSA: hda/i915 - skip acomp init if no matching display
In systems with only a discrete i915 GPU, the acomp init will
always timeout for the PCH HDA controller instance.
Avoid the timeout by checking the PCI device hierarchy
whether any display class PCI device can be found on the system,
and at the same level as the HDA PCI device. If found, proceed
with the acomp init, which will wait until i915 probe is complete
and component binding can proceed. If no matching display
device is found, the audio component bind can be safely skipped.
The bind timeout will still be hit if the display is present
in the system, but i915 driver does not bind to it by configuration
choice or probe error. In this case the 60sec timeout will be
hit.
Robin Murphy [Tue, 5 Apr 2022 13:27:54 +0000 (14:27 +0100)]
ALSA: emu10k1: Stop using iommu_present()
iommu_get_domain_for_dev() is already perfectly happy to return NULL
if the given device has no IOMMU. Drop the unnecessary check in favour
of just handling that condition appropriately.
Currently when XDP rings are created, each descriptor gets its DD bit
set, which turns out to be the wrong approach as it can lead to a
situation where more descriptors get cleaned than it was supposed to,
e.g. when AF_XDP busy poll is run with a large batch size. In this
situation, the driver would request for more buffers than it is able to
handle.
Fix this by not setting the DD bits in ice_xdp_alloc_setup_rings(). They
should be initialized to zero instead.
Unfortunately, the ice driver doesn't respect the RCU critical section that
XSK wakeup is surrounded with. To fix this, add synchronize_rcu() calls to
paths that destroy resources that might be in use.
This was addressed in other AF_XDP ZC enabled drivers, for reference see
for example commit b3873a5be757 ("net/i40e: Fix concurrency issues
between config flow and XSK")
Fixes: efc2214b6047 ("ice: Add support for XDP") Fixes: 2d4238f55697 ("ice: Add support for AF_XDP") Signed-off-by: Maciej Fijalkowski <[email protected]> Tested-by: Shwetha Nagaraju <[email protected]> Signed-off-by: Tony Nguyen <[email protected]>
Merge tag 'for-5.18-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
- prevent deleting subvolume with active swapfile
- fix qgroup reserve limit calculation overflow
- remove device count in superblock and its item in one transaction so
they cant't get out of sync
- skip defragmenting an isolated sector, this could cause some extra IO
- unify handling of mtime/permissions in hole punch with fallocate
- zoned mode fixes:
- remove assert checking for only single mode, we have the
DUP mode implemented
- fix potential lockdep warning while traversing devices
when checking for zone activation
* tag 'for-5.18-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: prevent subvol with swapfile from being deleted
btrfs: do not warn for free space inode in cow_file_range
btrfs: avoid defragging extents whose next extents are not targets
btrfs: fix fallocate to use file_modified to update permissions consistently
btrfs: remove device item and update super block in the same transaction
btrfs: fix qgroup reserve overflow the qgroup limit
btrfs: zoned: remove left over ASSERT checking for single profile
btrfs: zoned: traverse devices under chunk_mutex in btrfs_can_activate_zone
Andre Przywara [Mon, 4 Apr 2022 11:08:42 +0000 (12:08 +0100)]
irqchip/gic, gic-v3: Prevent GSI to SGI translations
At the moment the GIC IRQ domain translation routine happily converts
ACPI table GSI numbers below 16 to GIC SGIs (Software Generated
Interrupts aka IPIs). On the Devicetree side we explicitly forbid this
translation, actually the function will never return HWIRQs below 16 when
using a DT based domain translation.
We expect SGIs to be handled in the first part of the function, and any
further occurrence should be treated as a firmware bug, so add a check
and print to report this explicitly and avoid lengthy debug sessions.
Marc Zyngier [Tue, 15 Mar 2022 16:50:32 +0000 (16:50 +0000)]
irqchip/gic-v3: Fix GICR_CTLR.RWP polling
It turns out that our polling of RWP is totally wrong when checking
for it in the redistributors, as we test the *distributor* bit index,
whereas it is a different bit number in the RDs... Oopsie boo.
This is embarassing. Not only because it is wrong, but also because
it took *8 years* to notice the blunder...
Marc Zyngier [Thu, 17 Mar 2022 09:49:02 +0000 (09:49 +0000)]
irqchip/gic-v4: Wait for GICR_VPENDBASER.Dirty to clear before descheduling
The way KVM drives GICv4.{0,1} is as follows:
- vcpu_load() makes the VPE resident, instructing the RD to start
scanning for interrupts
- just before entering the guest, we check that the RD has finished
scanning and that we can start running the vcpu
- on preemption, we deschedule the VPE by making it invalid on
the RD
However, we are preemptible between the first two steps. If it so
happens *and* that the RD was still scanning, we nonetheless write
to the GICR_VPENDBASER register while Dirty is set, and bad things
happen (we're in UNPRED land).
This affects both the 4.0 and 4.1 implementations.
Make sure Dirty is cleared before performing the deschedule,
meaning that its_clear_vpend_valid() becomes a sort of full VPE
residency barrier.
YueHaibing [Thu, 17 Mar 2022 13:19:56 +0000 (21:19 +0800)]
irq/qcom-mpm: Fix build error without MAILBOX
If MAILBOX is n, building fails:
drivers/irqchip/irq-qcom-mpm.o: In function `mpm_pd_power_off':
irq-qcom-mpm.c:(.text+0x174): undefined reference to `mbox_send_message'
irq-qcom-mpm.c:(.text+0x174): relocation truncated to fit: R_AARCH64_CALL26 against undefined symbol `mbox_send_message'
random: opportunistically initialize on /dev/urandom reads
In 6f98a4bfee72 ("random: block in /dev/urandom"), we tried to make a
successful try_to_generate_entropy() call *required* if the RNG was not
already initialized. Unfortunately, weird architectures and old
userspaces combined in TCG test harnesses, making that change still not
realistic, so it was reverted in 0313bc278dac ("Revert "random: block in
/dev/urandom"").
However, rather than making a successful try_to_generate_entropy() call
*required*, we can instead make it *best-effort*.
If try_to_generate_entropy() fails, it fails, and nothing changes from
the current behavior. If it succeeds, then /dev/urandom becomes safe to
use for free. This way, we don't risk the regression potential that led
to us reverting the required-try_to_generate_entropy() call before.
Practically speaking, this means that at least on x86, /dev/urandom
becomes safe. Probably other architectures with working cycle counters
will also become safe. And architectures with slow or broken cycle
counters at least won't be affected at all by this change.
So it may not be the glorious "all things are unified!" change we were
hoping for initially, but practically speaking, it makes a positive
impact.
Now that all in-kernel users of default_attrs for the kobj_type are gone
and converted to properly use the default_groups pointer instead, it can
be safely removed.
There is one standard way to create sysfs files in a kobj_type, and not
two like before, causing confusion as to which should be used.
powerpc/pseries/vas: use default_groups in kobj_type
There are currently 2 ways to create a set of sysfs files for a
kobj_type, through the default_attrs field, and the default_groups
field. Move the pseries vas sysfs code to use default_groups field
which has been the preferred way since aa30f47cf666 ("kobject: Add
support for default attribute groups to kobj_type") so that we can soon
get rid of the obsolete default_attrs field.
David Ahern [Mon, 4 Apr 2022 15:09:08 +0000 (09:09 -0600)]
ipv6: Fix stats accounting in ip6_pkt_drop
VRF devices are the loopbacks for VRFs, and a loopback can not be
assigned to a VRF. Accordingly, the condition in ip6_pkt_drop should
be '||' not '&&'.