- Revert "PCI: dwc: Wait for link up only if link is started" to fix a
regression on Qualcomm platforms that don't reach interconnect sync
state if the slot is empty (Johan Hovold)
- Revert "PCI: mvebu: Mark driver as BROKEN" so people can use
pci-mvebu even though some others report problems (Bjorn Helgaas)
- Avoid a NULL pointer dereference when using acpiphp for root bus
hotplug to fix a regression added in v6.5-rc1 (Igor Mammedov)
* tag 'pci-v6.5-fixes-1' of git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci:
PCI: acpiphp: Use pci_assign_unassigned_bridge_resources() only for non-root bus
Revert "PCI: mvebu: Mark driver as BROKEN"
Revert "PCI: dwc: Wait for link up only if link is started"
MAINTAINERS: Add Manivannan Sadhasivam as DesignWare PCIe driver maintainer
Linus Torvalds [Fri, 11 Aug 2023 16:12:44 +0000 (09:12 -0700)]
Merge tag 'riscv-for-linus-6.5-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux
Pull RISC-V fixes from Palmer Dabbelt:
- Fixes for a pair of kexec_file_load() failures
- A fix to ensure the direct mapping is PMD-aligned
- A fix for CPU feature detection on SMP=n
- The MMIO ordering fences have been strengthened to ensure ordering
WRT delay()
- Fixes for a pair of -Wmissing-variable-declarations warnings
- A fix to avoid PUD mappings in vmap on sv39
- flush_cache_vmap() now flushes the TLB to avoid issues on systems
that cache invalid mappings
* tag 'riscv-for-linus-6.5-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux:
riscv: Implement flush_cache_vmap()
riscv: Do not allow vmap pud mappings for 3-level page table
riscv: mm: fix 2 instances of -Wmissing-variable-declarations
riscv,mmio: Fix readX()-to-delay() ordering
riscv: Fix CPU feature detection with SMP disabled
riscv: Start of DRAM should at least be aligned on PMD size for the direct mapping
riscv/kexec: load initrd high in available memory
riscv/kexec: handle R_RISCV_CALL_PLT relocation type
Linus Torvalds [Fri, 11 Aug 2023 16:07:23 +0000 (09:07 -0700)]
Merge tag 'parisc-for-6.5-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux
Pull parisc architecture fixes from Helge Deller:
"A bugfix in the LWS code, which used different lock words than the
parisc lightweight spinlock checks. This inconsistency triggered false
positives when the lightweight spinlock checks checked the locks of
mutexes.
The other patches are trivial cleanups and most of them fix sparse
warnings.
Summary:
- Fix LWS code to use same lock words as for the parisc lightweight
spinlocks
- Use PTR_ERR_OR_ZERO() in pdt init code
- Fix lots of sparse warnings"
* tag 'parisc-for-6.5-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux:
parisc: perf: Make cpu_device variable static
parisc: ftrace: Add declaration for ftrace_function_trampoline()
parisc: boot: Nuke some sparse warnings in decompressor
parisc: processor: Include asm/smp.h for init_per_cpu()
parisc: unaligned: Include linux/sysctl.h for unaligned_enabled
parisc: Move proc_mckinley_root and proc_runway_root to sba_iommu
parisc: dma: Add prototype for pcxl_dma_start
parisc: parisc_ksyms: Include libgcc.h for libgcc prototypes
parisc: ucmpdi2: Fix no previous prototype for '__ucmpdi2' warning
parisc: firmware: Mark pdc_result buffers local
parisc: firmware: Fix sparse context imbalance warnings
parisc: signal: Fix sparse incorrect type in assignment warning
parisc: ioremap: Fix sparse warnings
parisc: fault: Use C99 arrary initializers
parisc: pdt: Use PTR_ERR_OR_ZERO() to simplify code
parisc: Fix lightweight spinlock checks to not break futexes
Linus Torvalds [Fri, 11 Aug 2023 16:00:31 +0000 (09:00 -0700)]
Merge tag 'cpuidle-psci-v6.5-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/linux-pm
Pull cpuidle psci fixes from Ulf Hansson:
"A couple of cpuidle-psci fixes. Usually, this is managed by arm-soc
maintainers or Rafael, although due to a busy period I have stepped in
to help out:
- Fix the error path to prevent reverting from OSI back to PC mode"
* tag 'cpuidle-psci-v6.5-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/linux-pm:
cpuidle: psci: Move enabling OSI mode after power domains creation
cpuidle: dt_idle_genpd: Add helper function to remove genpd topology
Linus Torvalds [Fri, 11 Aug 2023 15:53:58 +0000 (08:53 -0700)]
Merge tag 'drm-fixes-2023-08-11' of git://anongit.freedesktop.org/drm/drm
Pull drm fixes from Dave Airlie:
"This week's fixes, as expected amdgpu is probably a little larger
since it skipped a week, but otherwise a few nouveau fixes, a couple
of bridge, rockchip and ivpu fixes.
amdgpu:
- S/G display workaround for platforms with >= 64G of memory
- S0i3 fix
- SMU 13.0.0 fixes
- Disable SMU 13.x OD features temporarily while the interface is
reworked to enable additional functionality
- Fix cursor gamma issues on DCN3+
- SMU 13.0.6 fixes
- Fix possible UAF in CS IOCTL
- Polaris display regression fix
- Only enable CP GFX shadowing on SR-IOV
amdkfd:
- Raven/Picasso KFD regression fix
bridge:
- it6505: runtime PM fix
- lt9611: revert Do not generate HFP/HBP/HSA and EOT packet
nouveau:
- enable global memory loads for helper invocations for userspace
driver
- dp 1.3 dpcd+ workaround fix
- remove unused function
- revert incorrect NULL check
accel/ivpu:
- Add set_pages_array_wc/uc for internal buffers
rockchip:
- Don't spam logs in atomic check"
* tag 'drm-fixes-2023-08-11' of git://anongit.freedesktop.org/drm/drm: (23 commits)
drm/shmem-helper: Reset vma->vm_ops before calling dma_buf_mmap()
drm/amdkfd: disable IOMMUv2 support for Raven
drm/amdkfd: disable IOMMUv2 support for KV/CZ
drm/amdkfd: ignore crat by default
drm/amdgpu/gfx11: only enable CP GFX shadowing on SR-IOV
drm/amd/display: Fix a regression on Polaris cards
drm/amdgpu: fix possible UAF in amdgpu_cs_pass1()
drm/amd/pm: Fix SMU v13.0.6 energy reporting
drm/amd/display: check attr flag before set cursor degamma on DCN3+
drm/amd/pm: disable the SMU13 OD feature support temporarily
drm/amd/pm: correct the pcie width for smu 13.0.0
drm/amd/display: Don't show stack trace for missing eDP
drm/amdgpu: Match against exact bootloader status
drm/amd/pm: skip the RLC stop when S0i3 suspend for SMU v13.0.4/11
drm/amd: Disable S/G for APUs when 64GB or more host memory
drm/rockchip: Don't spam logs in atomic check
accel/ivpu: Add set_pages_array_wc/uc for internal buffers
drm/nouveau/disp: Revert a NULL check inside nouveau_connector_get_modes
Revert "drm/bridge: lt9611: Do not generate HFP/HBP/HSA and EOT packet"
drm/nouveau: remove unused tu102_gr_load() function
...
Ming Lei [Wed, 9 Aug 2023 02:04:40 +0000 (10:04 +0800)]
nvme: core: don't hold rcu read lock in nvme_ns_chr_uring_cmd_iopoll
Now nvme_ns_chr_uring_cmd_iopoll() has switched to request based io
polling, and the associated NS is guaranteed to be live in case of
io polling, so request is guaranteed to be valid because blk-mq uses
pre-allocated request pool.
Remove the rcu read lock in nvme_ns_chr_uring_cmd_iopoll(), which
isn't needed any more after switching to request based io polling.
Fix "BUG: sleeping function called from invalid context" because
set_page_dirty_lock() from blk_rq_unmap_user() may sleep.
Xiang Yang [Thu, 10 Aug 2023 14:06:39 +0000 (22:06 +0800)]
net: pcs: Add missing put_device call in miic_create
The reference of pdev->dev is taken by of_find_device_by_node, so
it should be released when not need anymore.
Fixes: 7dc54d3b8d91 ("net: pcs: add Renesas MII converter driver") Signed-off-by: Xiang Yang <[email protected]> Reviewed-by: Vladimir Oltean <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Jason Wang [Thu, 10 Aug 2023 03:12:56 +0000 (23:12 -0400)]
virtio-net: set queues after driver_ok
Commit 25266128fe16 ("virtio-net: fix race between set queues and
probe") tries to fix the race between set queues and probe by calling
_virtnet_set_queues() before DRIVER_OK is set. This violates virtio
spec. Fixing this by setting queues after virtio_device_ready().
Note that rtnl needs to be held for userspace requests to change the
number of queues. So we are serialized in this way.
Dave Airlie [Fri, 11 Aug 2023 04:49:17 +0000 (14:49 +1000)]
Merge tag 'amd-drm-fixes-6.5-2023-08-09' of https://gitlab.freedesktop.org/agd5f/linux into drm-fixes
amd-drm-fixes-6.5-2023-08-09:
amdgpu:
- S/G display workaround for platforms with >= 64G of memory
- S0i3 fix
- SMU 13.0.0 fixes
- Disable SMU 13.x OD features temporarily while the interface is reworked
to enable additional functionality
- Fix cursor gamma issues on DCN3+
- SMU 13.0.6 fixes
- Fix possible UAF in CS IOCTL
- Polaris display regression fix
- Only enable CP GFX shadowing on SR-IOV
Dave Airlie [Fri, 11 Aug 2023 03:56:20 +0000 (13:56 +1000)]
Merge tag 'drm-misc-fixes-2023-08-10' of git://anongit.freedesktop.org/drm/drm-misc into drm-fixes
Multiple fixes for nouveau around memory safety and DisplayPort, one fix
to reduce the log level of rockchip, a power state fix for the it6505
bridge, a timing fix for the lt9611 bridge, a cache maintenance fix for
ivpu and one to reset vma->vm_ops on mmap for shmem-helper.
Steve French [Thu, 10 Aug 2023 20:34:21 +0000 (15:34 -0500)]
cifs: fix potential oops in cifs_oplock_break
With deferred close we can have closes that race with lease breaks,
and so with the current checks for whether to send the lease response,
oplock_response(), this can mean that an unmount (kill_sb) can occur
just before we were checking if the tcon->ses is valid. See below:
To fix this change the ordering of the checks before sending the oplock_response
to first check if the openFileList is empty.
Fixes: da787d5b7498 ("SMB3: Do not send lease break acknowledgment if all file handles have been closed") Suggested-by: Bharath SM <[email protected]> Reviewed-by: Bharath SM <[email protected]> Reviewed-by: Shyam Prasad N <[email protected]> Signed-off-by: Paulo Alcantara (SUSE) <[email protected]> Signed-off-by: Steve French <[email protected]>
virtio-mem: check if the config changed before fake offlining memory
If we repeatedly fail to fake offline memory to unplug it, we won't be
sending any unplug requests to the device. However, we only check if the
config changed when sending such (un)plug requests.
We could end up trying for a long time to unplug memory, even though
the config changed already and we're not supposed to unplug memory
anymore. For example, the hypervisor might detect a low-memory situation
while unplugging memory and decide to replug some memory. Continuing
trying to unplug memory in that case can be problematic.
virtio-mem: keep retrying on offline_and_remove_memory() errors in Sub Block Mode (SBM)
In case offline_and_remove_memory() fails in SBM, we leave a completely
unplugged Linux memory block stick around until we try plugging memory
again. We won't try removing that memory block again.
offline_and_remove_memory() may, for example, fail if we're racing with
another alloc_contig_range() user, if allocating temporary memory fails,
or if some memory notifier rejected the offlining request.
Let's handle that case better, by simple retrying to offline and remove
such memory.
virtio-mem: convert most offline_and_remove_memory() errors to -EBUSY
Just like we do with alloc_contig_range(), let's convert all unknown
errors to -EBUSY, but WARN so we can look into the issue. For example,
offline_pages() could fail with -EINTR, which would be unexpected in our
case.
virtio-mem: remove unsafe unplug in Big Block Mode (BBM)
When "unsafe unplug" is enabled, we don't fake-offline all memory ahead of
actual memory offlining using alloc_contig_range(). Instead, we rely on
offline_pages() to also perform actual page migration, which might fail
or take a very long time.
In that case, it's possible to easily run into endless loops that cannot be
aborted anymore (as offlining is triggered by a workqueue then): For
example, a single (accidentally) permanently unmovable page in
ZONE_MOVABLE results in an endless loop. For ZONE_NORMAL, races between
isolating the pageblock (and checking for unmovable pages) and
concurrent page allocation are possible and similarly result in endless
loops.
The idea of the unsafe unplug mode was to make it possible to more
reliably unplug large memory blocks. However, (a) we really should be
tackling that differently, by extending the alloc_contig_range()-based
mechanism; and (b) this mode is not the default and as far as I know,
it's unused either way.
Shannon Nelson [Tue, 11 Jul 2023 04:24:37 +0000 (21:24 -0700)]
pds_vdpa: fix up debugfs feature bit printing
Make clearer in debugfs output the difference between the hw
feature bits, the features supported through the driver, and
the features that have been negotiated.
Allen Hubbe [Tue, 11 Jul 2023 04:24:36 +0000 (21:24 -0700)]
pds_vdpa: alloc irq vectors on DRIVER_OK
We were allocating irq vectors at the time the aux dev was probed,
but that is before the PCI VF is assigned to a separate iommu domain
by vhost_vdpa. Because vhost_vdpa later changes the iommu domain the
interrupts do not work.
Instead, we can allocate the irq vectors later when we see DRIVER_OK and
know that the reassignment of the PCI VF to an iommu domain has already
happened.
Shannon Nelson [Tue, 11 Jul 2023 04:24:34 +0000 (21:24 -0700)]
pds_vdpa: always allow offering VIRTIO_NET_F_MAC
Our driver sets a mac if the HW is 00:..:00 so we need to be sure to
advertise VIRTIO_NET_F_MAC even if the HW doesn't. We also need to be
sure that virtio_net sees the VIRTIO_NET_F_MAC and doesn't rewrite the
mac address that a user may have set with the vdpa utility.
After reading the hw_feature bits, add the VIRTIO_NET_F_MAC to the driver's
supported_features and use that for reporting what is available. If the
HW is not advertising it, be sure to strip the VIRTIO_NET_F_MAC before
finishing the feature negotiation. If the user specifies a device_features
bitpattern in the vdpa utility without the VIRTIO_NET_F_MAC set, then
don't set the mac.
Hawkins Jiawei [Thu, 10 Aug 2023 11:04:05 +0000 (19:04 +0800)]
virtio-net: Zero max_tx_vq field for VIRTIO_NET_CTRL_MQ_HASH_CONFIG case
Kernel uses `struct virtio_net_ctrl_rss` to save command-specific-data
for both the VIRTIO_NET_CTRL_MQ_HASH_CONFIG and
VIRTIO_NET_CTRL_MQ_RSS_CONFIG commands.
According to the VirtIO standard, "Field reserved MUST contain zeroes.
It is defined to make the structure to match the layout of
virtio_net_rss_config structure, defined in 5.1.6.5.7.".
Yet for the VIRTIO_NET_CTRL_MQ_HASH_CONFIG command case, the `max_tx_vq`
field in struct virtio_net_ctrl_rss, which corresponds to the
`reserved` field in struct virtio_net_hash_config, is not zeroed,
thereby violating the VirtIO standard.
This patch solves this problem by zeroing this field in
virtnet_init_default_rss().
Linus Torvalds [Thu, 10 Aug 2023 19:37:24 +0000 (12:37 -0700)]
Merge tag 'net-6.5-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
"Including fixes from netfilter, wireless and bpf.
Still trending up in size but the good news is that the "current"
regressions are resolved, AFAIK.
We're getting weirdly many fixes for Wake-on-LAN and suspend/resume
handling on embedded this week (most not merged yet), not sure why.
But those are all for older bugs.
Current release - regressions:
- tls: set MSG_SPLICE_PAGES consistently when handing encrypted data
over to TCP
Current release - new code bugs:
- eth: mlx5: correct IDs on VFs internal to the device (IPU)
Previous releases - regressions:
- phy: at803x: fix WoL support / reporting on AR8032
- bonding: fix incorrect deletion of ETH_P_8021AD protocol VID from
slaves, leading to BUG_ON()
- tun: prevent tun_build_skb() from exceeding the packet size limit
- wifi: rtw89: fix 8852AE disconnection caused by RX full flags
- eth/PCI: enetc: fix probing after 6fffbc7ae137 ("PCI: Honor
firmware's device disabled status"), keep PCI devices around even
if they are disabled / not going to be probed to be able to apply
quirks on them
- eth: prestera: fix handling IPv4 routes with nexthop IDs
Previous releases - always broken:
- netfilter: re-work garbage collection to avoid races between
user-facing API and timeouts
- tunnels: fix generating ipv4 PMTU error on non-linear skbs
- nexthop: fix infinite nexthop bucket dump when using maximum
nexthop ID
- wifi: nl80211: fix integer overflow in nl80211_parse_mbssid_elems()
Misc:
- unix: use consistent error code in SO_PEERPIDFD
- ipv6: adjust ndisc_is_useropt() to include PREFIX_INFO, in prep for
upcoming IETF RFC"
* tag 'net-6.5-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (94 commits)
net: hns3: fix strscpy causing content truncation issue
net: tls: set MSG_SPLICE_PAGES consistently
ibmvnic: Ensure login failure recovery is safe from other resets
ibmvnic: Do partial reset on login failure
ibmvnic: Handle DMA unmapping of login buffs in release functions
ibmvnic: Unmap DMA login rsp buffer on send login fail
ibmvnic: Enforce stronger sanity checks on login response
net: mana: Fix MANA VF unload when hardware is unresponsive
netfilter: nf_tables: remove busy mark and gc batch API
netfilter: nft_set_hash: mark set element as dead when deleting from packet path
netfilter: nf_tables: adapt set backend to use GC transaction API
netfilter: nf_tables: GC transaction API to avoid race with control plane
selftests/bpf: Add sockmap test for redirecting partial skb data
selftests/bpf: fix a CI failure caused by vsock sockmap test
bpf, sockmap: Fix bug that strp_done cannot be called
bpf, sockmap: Fix map type error in sock_map_del_link
xsk: fix refcount underflow in error path
ipv6: adjust ndisc_is_useropt() to also return true for PIO
selftests: forwarding: bridge_mdb: Make test more robust
selftests: forwarding: bridge_mdb_max: Fix failing test with old libnet
...
vdpa/mlx5: Delete control vq iotlb in destroy_mr only when necessary
mlx5_vdpa_destroy_mr can be called from .set_map with data ASID after
the control virtqueue ASID iotlb has been populated. The control vq
iotlb must not be cleared, since it will not be populated again.
So call the ASID aware destroy function which makes sure that the
right vq resource is destroyed.
Dragos Tatulea [Wed, 2 Aug 2023 17:12:18 +0000 (20:12 +0300)]
vdpa/mlx5: Fix mr->initialized semantics
The mr->initialized flag is shared between the control vq and data vq
part of the mr init/uninit. But if the control vq and data vq get placed
in different ASIDs, it can happen that initializing the control vq will
prevent the data vq mr from being initialized.
This patch consolidates the control and data vq init parts into their
own init functions. The mr->initialized will now be used for the data vq
only. The control vq currently doesn't need a flag.
The uninitializing part is also taken care of: mlx5_vdpa_destroy_mr got
split into data and control vq functions which are now also ASID aware.
Lin Ma [Thu, 27 Jul 2023 17:57:52 +0000 (20:57 +0300)]
vdpa: Add max vqp attr to vdpa_nl_policy for nlattr length check
The vdpa_nl_policy structure is used to validate the nlattr when parsing
the incoming nlmsg. It will ensure the attribute being described produces
a valid nlattr pointer in info->attrs before entering into each handler
in vdpa_nl_ops.
That is to say, the missing part in vdpa_nl_policy may lead to illegal
nlattr after parsing, which could lead to OOB read just like CVE-2023-3773.
This patch adds the missing nla_policy for vdpa max vqp attr to avoid
such bugs.
Lin Ma [Thu, 27 Jul 2023 17:57:50 +0000 (20:57 +0300)]
vdpa: Add queue index attr to vdpa_nl_policy for nlattr length check
The vdpa_nl_policy structure is used to validate the nlattr when parsing
the incoming nlmsg. It will ensure the attribute being described produces
a valid nlattr pointer in info->attrs before entering into each handler
in vdpa_nl_ops.
That is to say, the missing part in vdpa_nl_policy may lead to illegal
nlattr after parsing, which could lead to OOB read just like CVE-2023-3773.
This patch adds the missing nla_policy for vdpa queue index attr to avoid
such bugs.
Lin Ma [Thu, 27 Jul 2023 17:57:48 +0000 (20:57 +0300)]
vdpa: Add features attr to vdpa_nl_policy for nlattr length check
The vdpa_nl_policy structure is used to validate the nlattr when parsing
the incoming nlmsg. It will ensure the attribute being described produces
a valid nlattr pointer in info->attrs before entering into each handler
in vdpa_nl_ops.
That is to say, the missing part in vdpa_nl_policy may lead to illegal
nlattr after parsing, which could lead to OOB read just like CVE-2023-3773.
This patch adds the missing nla_policy for vdpa features attr to avoid
such bugs.
Feng Liu [Wed, 19 Jul 2023 15:45:50 +0000 (11:45 -0400)]
virtio-pci: Fix legacy device flag setting error in probe
The 'is_legacy' flag is used to differentiate between legacy vs modern
device. Currently, it is based on the value of vp_dev->ldev.ioaddr.
However, due to the shared memory of the union between struct
virtio_pci_legacy_device and struct virtio_pci_modern_device, when
virtio_pci_modern_probe modifies the content of struct
virtio_pci_modern_device, it affects the content of struct
virtio_pci_legacy_device, and ldev.ioaddr is no longer zero, causing
the 'is_legacy' flag to be set as true. To resolve issue, when legacy
device is probed, mark 'is_legacy' as true, when modern device is
probed, keep 'is_legacy' as false.
Mike Christie [Sun, 9 Jul 2023 20:28:58 +0000 (15:28 -0500)]
vhost-scsi: Fix alignment handling with windows
The linux block layer requires bios/requests to have lengths with a 512
byte alignment. Some drivers/layers like dm-crypt and the directi IO code
will test for it and just fail. Other drivers like SCSI just assume the
requirement is met and will end up in infinte retry loops. The problem
for drivers like SCSI is that it uses functions like blk_rq_cur_sectors
and blk_rq_sectors which divide the request's length by 512. If there's
lefovers then it just gets dropped. But other code in the block/scsi
layer may use blk_rq_bytes/blk_rq_cur_bytes and end up thinking there is
still data left and try to retry the cmd. We can then end up getting
stuck in retry loops where part of the block/scsi thinks there is data
left, but other parts think we want to do IOs of zero length.
Linux will always check for alignment, but windows will not. When
vhost-scsi then translates the iovec it gets from a windows guest to a
scatterlist, we can end up with sg items where the sg->length is not
divisible by 512 due to the misaligned offset:
When the lio backends then convert the SG to bios or other iovecs, we
end up sending them with the same misaligned values and can hit the
issues above.
This just has us drop down to allocating a temp page and copying the data
when we detect a misaligned buffer and the IO is large enough that it
will get split into multiple bad IOs.
Shannon Nelson [Thu, 6 Jul 2023 23:17:18 +0000 (16:17 -0700)]
pds_vdpa: protect Makefile from unconfigured debugfs
debugfs.h protects itself from an undefined DEBUG_FS, so it is
not necessary to check it in the driver code or the Makefile.
The driver code had been updated for this, but the Makefile had
missed the update.
Wolfram Sang [Thu, 29 Jun 2023 12:05:26 +0000 (14:05 +0200)]
virtio-mmio: don't break lifecycle of vm_dev
vm_dev has a separate lifecycle because it has a 'struct device'
embedded. Thus, having a release callback for it is correct.
Allocating the vm_dev struct with devres totally breaks this protection,
though. Instead of waiting for the vm_dev release callback, the memory
is freed when the platform_device is removed. Resulting in a
use-after-free when finally the callback is to be called.
To easily see the problem, compile the kernel with
CONFIG_DEBUG_KOBJECT_RELEASE and unbind with sysfs.
The fix is easy, don't use devres in this case.
Found during my research about object lifetime problems.
hns3_dbg_fill_content()/hclge_dbg_fill_content() is aim to integrate some
items to a string for content, and we add '\n' and '\0' in the last
two bytes of content.
strscpy() will add '\0' in the last byte of destination buffer(one of
items), it result in finishing content print ahead of schedule and some
dump content truncation.
One Error log shows as below:
cat mac_list/uc
UC MAC_LIST:
Expected:
UC MAC_LIST:
FUNC_ID MAC_ADDR STATE
pf 00:2b:19:05:03:00 ACTIVE
The destination buffer is length-bounded and not required to be
NUL-terminated, so just change strscpy() to memcpy() to fix it.
Jakub Kicinski [Tue, 8 Aug 2023 18:09:17 +0000 (11:09 -0700)]
net: tls: set MSG_SPLICE_PAGES consistently
We used to change the flags for the last segment, because
non-last segments had the MSG_SENDPAGE_NOTLAST flag set.
That flag is no longer a thing so remove the setting.
Since flags most likely don't have MSG_SPLICE_PAGES set
this avoids passing parts of the sg as splice and parts
as non-splice. Before commit under Fixes we'd have called
tcp_sendpage() which would add the MSG_SPLICE_PAGES.
Why this leads to trouble remains unclear but Tariq
reports hitting the WARN_ON(!sendpage_ok()) due to
page refcount of 0.
Linus Torvalds [Thu, 10 Aug 2023 18:32:26 +0000 (11:32 -0700)]
Merge tag 'dmaengine-fix-6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/vkoul/dmaengine
Pull dmaengine fixes from Vinod Koul:
- HAS_IOMEM fixes for fsl edma and intel idma
- return-value fix, interrupt vector setting and typo fix for xilinx
xdma
- email updates for codeaurora email domain move
- correct pause status for pl330 driver
- idxd clear flag on disable fix
- function documentation fix for owl dma
- potential un-allocated memory fix for mcf driver
* tag 'dmaengine-fix-6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/vkoul/dmaengine:
dmaengine: xilinx: xdma: Fix typo
dmaengine: xilinx: xdma: Fix interrupt vector setting
dmaengine: owl-dma: Modify mismatched function name
dmaengine: idxd: Clear PRS disable flag when disabling IDXD device
dmaengine: pl330: Return DMA_PAUSED when transaction is paused
dmaengine: qcom_hidma: Update codeaurora email domain
dmaengine: mcf-edma: Fix a potential un-allocated memory access
dmaengine: xilinx: xdma: Fix Judgment of the return value
idmaengine: make FSL_EDMA and INTEL_IDMA64 depends on HAS_IOMEM
Jakub Kicinski [Thu, 10 Aug 2023 17:47:07 +0000 (10:47 -0700)]
Merge tag 'nf-23-08-10' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
The existing attempt to resolve races between control plane and GC work
is error prone, as reported by Bien Pham <[email protected]>, some places
forgot to call nft_set_elem_mark_busy(), leading to double-deactivation
of elements.
This series contains the following patches:
1) Do not skip expired elements during walk otherwise elements might
never decrement the reference counter on data, leading to memleak.
2) Add a GC transaction API to replace the former attempt to deal with
races between control plane and GC. GC worker sets on NFT_SET_ELEM_DEAD_BIT
on elements and it creates a GC transaction to remove the expired
elements, GC transaction could abort in case of interference with
control plane and retried later (GC async). Set backends such as
rbtree and pipapo also perform GC from control plane (GC sync), in
such case, element deactivation and removal is safe because mutex
is held then collected elements are released via call_rcu().
3) Adapt existing set backends to use the GC transaction API.
4) Update rhash set backend to set on _DEAD bit to report deleted
elements from datapath for GC.
5) Remove old GC batch API and the NFT_SET_ELEM_BUSY_BIT.
* tag 'nf-23-08-10' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
netfilter: nf_tables: remove busy mark and gc batch API
netfilter: nft_set_hash: mark set element as dead when deleting from packet path
netfilter: nf_tables: adapt set backend to use GC transaction API
netfilter: nf_tables: GC transaction API to avoid race with control plane
netfilter: nf_tables: don't skip expired elements during walk
====================
Jakub Kicinski [Thu, 10 Aug 2023 17:41:36 +0000 (10:41 -0700)]
Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Martin KaFai Lau says:
====================
pull-request: bpf 2023-08-09
We've added 5 non-merge commits during the last 7 day(s) which contain
a total of 6 files changed, 102 insertions(+), 8 deletions(-).
The main changes are:
1) A bpf sockmap memleak fix and a fix in accessing the programs of
a sockmap under the incorrect map type from Xu Kuohai.
2) A refcount underflow fix in xsk from Magnus Karlsson.
* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
selftests/bpf: Add sockmap test for redirecting partial skb data
selftests/bpf: fix a CI failure caused by vsock sockmap test
bpf, sockmap: Fix bug that strp_done cannot be called
bpf, sockmap: Fix map type error in sock_map_del_link
xsk: fix refcount underflow in error path
====================
Nick Child [Wed, 9 Aug 2023 22:10:38 +0000 (17:10 -0500)]
ibmvnic: Ensure login failure recovery is safe from other resets
If a login request fails, the recovery process should be protected
against parallel resets. It is a known issue that freeing and
registering CRQ's in quick succession can result in a failover CRQ from
the VIOS. Processing a failover during login recovery is dangerous for
two reasons:
1. This will result in two parallel initialization processes, this can
cause serious issues during login.
2. It is possible that the failover CRQ is received but never executed.
We get notified of a pending failover through a transport event CRQ.
The reset is not performed until a INIT CRQ request is received.
Previously, if CRQ init fails during login recovery, then the ibmvnic
irq is freed and the login process returned error. If failover_pending
is true (a transport event was received), then the ibmvnic device
would never be able to process the reset since it cannot receive the
CRQ_INIT request due to the irq being freed. This leaved the device
in a inoperable state.
Therefore, the login failure recovery process must be hardened against
these possible issues. Possible failovers (due to quick CRQ free and
init) must be avoided and any issues during re-initialization should be
dealt with instead of being propagated up the stack. This logic is
similar to that of ibmvnic_probe().
Nick Child [Wed, 9 Aug 2023 22:10:37 +0000 (17:10 -0500)]
ibmvnic: Do partial reset on login failure
Perform a partial reset before sending a login request if any of the
following are true:
1. If a previous request times out. This can be dangerous because the
VIOS could still receive the old login request at any point after
the timeout. Therefore, it is best to re-register the CRQ's and
sub-CRQ's before retrying.
2. If the previous request returns an error that is not described in
PAPR. PAPR provides procedures if the login returns with partial
success or aborted return codes (section L.5.1) but other values
do not have a defined procedure. Previously, these conditions
just returned error from the login function rather than trying
to resolve the issue.
This can cause further issues since most callers of the login
function are not prepared to handle an error when logging in. This
improper cleanup can lead to the device being permanently DOWN'd.
For example, if the VIOS believes that the device is already logged
in then it will return INVALID_STATE (-7). If we never re-register
CRQ's then it will always think that the device is already logged
in. This leaves the device inoperable.
The partial reset involves freeing the sub-CRQs, freeing the CRQ then
registering and initializing a new CRQ and sub-CRQs. This essentially
restarts all communication with VIOS to allow for a fresh login attempt
that will be unhindered by any previous failed attempts.
Nick Child [Wed, 9 Aug 2023 22:10:36 +0000 (17:10 -0500)]
ibmvnic: Handle DMA unmapping of login buffs in release functions
Rather than leaving the DMA unmapping of the login buffers to the
login response handler, move this work into the login release functions.
Previously, these functions were only used for freeing the allocated
buffers. This could lead to issues if there are more than one
outstanding login buffer requests, which is possible if a login request
times out.
If a login request times out, then there is another call to send login.
The send login function makes a call to the login buffer release
function. In the past, this freed the buffers but did not DMA unmap.
Therefore, the VIOS could still write to the old login (now freed)
buffer. It is for this reason that it is a good idea to leave the DMA
unmap call to the login buffers release function.
Since the login buffer release functions now handle DMA unmapping,
remove the duplicate DMA unmapping in handle_login_rsp().
Nick Child [Wed, 9 Aug 2023 22:10:35 +0000 (17:10 -0500)]
ibmvnic: Unmap DMA login rsp buffer on send login fail
If the LOGIN CRQ fails to send then we must DMA unmap the response
buffer. Previously, if the CRQ failed then the memory was freed without
DMA unmapping.
Nick Child [Wed, 9 Aug 2023 22:10:34 +0000 (17:10 -0500)]
ibmvnic: Enforce stronger sanity checks on login response
Ensure that all offsets in a login response buffer are within the size
of the allocated response buffer. Any offsets or lengths that surpass
the allocation are likely the result of an incomplete response buffer.
In these cases, a full reset is necessary.
When attempting to login, the ibmvnic device will allocate a response
buffer and pass a reference to the VIOS. The VIOS will then send the
ibmvnic device a LOGIN_RSP CRQ to signal that the buffer has been filled
with data. If the ibmvnic device does not get a response in 20 seconds,
the old buffer is freed and a new login request is sent. With 2
outstanding requests, any LOGIN_RSP CRQ's could be for the older
login request. If this is the case then the login response buffer (which
is for the newer login request) could be incomplete and contain invalid
data. Therefore, we must enforce strict sanity checks on the response
buffer values.
Testing has shown that the `off_rxadd_buff_size` value is filled in last
by the VIOS and will be the smoking gun for these circumstances.
Until VIOS can implement a mechanism for tracking outstanding response
buffers and a method for mapping a LOGIN_RSP CRQ to a particular login
response buffer, the best ibmvnic can do in this situation is perform a
full reset.
net: mana: Fix MANA VF unload when hardware is unresponsive
When unloading the MANA driver, mana_dealloc_queues() waits for the MANA
hardware to complete any inflight packets and set the pending send count
to zero. But if the hardware has failed, mana_dealloc_queues()
could wait forever.
Fix this by adding a timeout to the wait. Set the timeout to 120 seconds,
which is a somewhat arbitrary value that is more than long enough for
functional hardware to complete any sends.
The RISC-V kernel needs a sfence.vma after a page table modification: we
used to rely on the vmalloc fault handling to emit an sfence.vma, but
commit 7d3332be011e ("riscv: mm: Pre-allocate PGD entries for
vmalloc/modules area") got rid of this path for 64-bit kernels, so now we
need to explicitly emit a sfence.vma in flush_cache_vmap().
Note that we don't need to implement flush_cache_vunmap() as the generic
code should emit a flush tlb after unmapping a vmalloc region.
Alexandre Ghiti [Tue, 8 Aug 2023 13:07:09 +0000 (15:07 +0200)]
riscv: Do not allow vmap pud mappings for 3-level page table
The vmalloc_fault() path was removed and to avoid syncing the vmalloc PGD
mappings, they are now preallocated. But if the kernel can use a PUD
mapping (which in sv39 is actually a PGD mapping) for large vmalloc
allocation, it will free the current unused preallocated PGD mapping and
install a new leaf one. Since there is no sync anymore, some page tables
lack this new mapping and that triggers a panic.
Helge Deller [Wed, 9 Aug 2023 07:21:58 +0000 (09:21 +0200)]
parisc: Fix lightweight spinlock checks to not break futexes
The lightweight spinlock checks verify that a spinlock has either value
0 (spinlock locked) and that not any other bits than in
__ARCH_SPIN_LOCK_UNLOCKED_VAL is set.
This breaks the current LWS code, which writes the address of the lock
into the lock word to unlock it, which was an optimization to save one
assembler instruction.
Fix it by making spinlock_types.h accessible for asm code, change the
LWS spinlock-unlocking code to write __ARCH_SPIN_LOCK_UNLOCKED_VAL into
the lock word, and add some missing lightweight spinlock checks to the
LWS path. Finally, make the spinlock checks dependend on DEBUG_KERNEL.
Josef Bacik [Wed, 2 Aug 2023 13:20:24 +0000 (09:20 -0400)]
btrfs: set cache_block_group_error if we find an error
We set cache_block_group_error if btrfs_cache_block_group() returns an
error, this is because we could end up not finding space to allocate and
mistakenly return -ENOSPC, and which could then abort the transaction
with the incorrect errno, and in the case of ENOSPC result in a
WARN_ON() that will trip up tests like generic/475.
However there's the case where multiple threads can be racing, one
thread gets the proper error, and the other thread doesn't actually call
btrfs_cache_block_group(), it instead sees ->cached ==
BTRFS_CACHE_ERROR. Again the result is the same, we fail to allocate
our space and return -ENOSPC. Instead we need to set
cache_block_group_error to -EIO in this case to make sure that if we do
not make our allocation we get the appropriate error returned back to
the caller.
Qu Wenruo [Thu, 3 Aug 2023 09:20:43 +0000 (17:20 +0800)]
btrfs: reject invalid reloc tree root keys with stack dump
[BUG]
Syzbot reported a crash that an ASSERT() got triggered inside
prepare_to_merge().
That ASSERT() makes sure the reloc tree is properly pointed back by its
subvolume tree.
[CAUSE]
After more debugging output, it turns out we had an invalid reloc tree:
BTRFS error (device loop1): reloc tree mismatch, root 8 has no reloc root, expect reloc root key (-8, 132, 8) gen 17
Note the above root key is (TREE_RELOC_OBJECTID, ROOT_ITEM,
QUOTA_TREE_OBJECTID), meaning it's a reloc tree for quota tree.
But reloc trees can only exist for subvolumes, as for non-subvolume
trees, we just COW the involved tree block, no need to create a reloc
tree since those tree blocks won't be shared with other trees.
Only subvolumes tree can share tree blocks with other trees (thus they
have BTRFS_ROOT_SHAREABLE flag).
Thus this new debug output proves my previous assumption that corrupted
on-disk data can trigger that ASSERT().
[FIX]
Besides the dedicated fix and the graceful exit, also let tree-checker to
check such root keys, to make sure reloc trees can only exist for subvolumes.
Qu Wenruo [Thu, 3 Aug 2023 09:20:42 +0000 (17:20 +0800)]
btrfs: exit gracefully if reloc roots don't match
[BUG]
Syzbot reported a crash that an ASSERT() got triggered inside
prepare_to_merge().
[CAUSE]
The root cause of the triggered ASSERT() is we can have a race between
quota tree creation and relocation.
This leads us to create a duplicated quota tree in the
btrfs_read_fs_root() path, and since it's treated as fs tree, it would
have ROOT_SHAREABLE flag, causing us to create a reloc tree for it.
The bug itself is fixed by a dedicated patch for it, but this already
taught us the ASSERT() is not something straightforward for
developers.
[ENHANCEMENT]
Instead of using an ASSERT(), let's handle it gracefully and output
extra info about the mismatch reloc roots to help debug.
Also with the above ASSERT() removed, we can trigger ASSERT(0)s inside
merge_reloc_roots() later.
Also replace those ASSERT(0)s with WARN_ON()s.
[CAUSE]
With extra debugging, the offending reloc_root is for quota tree (rootid 8).
Normally we should not use the reloc tree for quota root at all, as reloc
trees are only for subvolume trees.
But there is a race between quota enabling and relocation, this happens
after commit 85724171b302 ("btrfs: fix the btrfs_get_global_root return value").
Before that commit, for quota and free space tree, we exit immediately
if we cannot grab it from fs_info.
But now we would try to read it from disk, just as if they are fs trees,
this sets ROOT_SHAREABLE flags in such race:
Thread A | Thread B
---------------------------------+------------------------------
btrfs_quota_enable() |
| | btrfs_get_root_ref()
| | |- btrfs_get_global_root()
| | | Returned NULL
| | |- btrfs_lookup_fs_root()
| | | Returned NULL
|- btrfs_create_tree() | |
| Now quota root item is | |
| inserted | |- btrfs_read_tree_root()
| | | Got the newly inserted quota root
| | |- btrfs_init_fs_root()
| | | Set ROOT_SHAREABLE flag
[FIX]
Get back to the old behavior by returning PTR_ERR(-ENOENT) if the target
objectid is not a subvolume tree or data reloc tree.
btrfs: properly clear end of the unreserved range in cow_file_range
When the call to btrfs_reloc_clone_csums in cow_file_range returns an
error, we jump to the out_unlock label with the extent_reserved variable
set to false. The cleanup at the label will then call
extent_clear_unlock_delalloc on the range from start to end. But we've
already added cur_alloc_size to start before the jump, so there might no
range be left from the newly incremented start to end. Move the check for
'start < end' so that it is reached by also for the !extent_reserved case.
btrfs: don't wait for writeback on clean pages in extent_write_cache_pages
__extent_writepage could have started on more pages than the one it was
called for. This happens regularly for zoned file systems, and in theory
could happen for compressed I/O if the worker thread was executed very
quickly. For such pages extent_write_cache_pages waits for writeback
to complete before moving on to the next page, which is highly inefficient
as it blocks the flusher thread.
Port over the PageDirty check that was added to write_cache_pages in
commit 515f4a037fb ("mm: write_cache_pages optimise page cleaning") to
fix this.
extent_write_cache_pages stops writing pages as soon as nr_to_write hits
zero. That is the right thing for opportunistic writeback, but incorrect
for data integrity writeback, which needs to ensure that no dirty pages
are left in the range. Thus only stop the writeback for WB_SYNC_NONE
if nr_to_write hits 0.
This is a port of write_cache_pages changes in commit 05fe478dd04e
("mm: write_cache_pages integrity fix").
Note that I've only trigger the problem with other changes to the btrfs
writeback code, but this condition seems worthwhile fixing anyway.
Hans de Goede [Thu, 10 Aug 2023 09:00:11 +0000 (11:00 +0200)]
ACPI: resource: Add IRQ override quirk for PCSpecialist Elimina Pro 16 M
The PCSpecialist Elimina Pro 16 M laptop model is a Zen laptop which
needs to use the MADT IRQ settings override and which does not have
an INT_SRC_OVR entry for IRQ 1 in its MADT.
So this model needs a DMI quirk to enable the MADT IRQ settings override
to fix its keyboard not working.
Josef Bacik [Fri, 21 Jul 2023 20:09:43 +0000 (16:09 -0400)]
btrfs: wait for actual caching progress during allocation
Recently we've been having mysterious hangs while running generic/475 on
the CI system. This turned out to be something like this:
Task 1
dmsetup suspend --nolockfs
-> __dm_suspend
-> dm_wait_for_completion
-> dm_wait_for_bios_completion
-> Unable to complete because of IO's on a plug in Task 2
Task 2
wb_workfn
-> wb_writeback
-> blk_start_plug
-> writeback_sb_inodes
-> Infinite loop unable to make an allocation
Task 3
cache_block_group
->read_extent_buffer_pages
->Waiting for IO to complete that can't be submitted because Task 1
suspended the DM device
The problem here is that we need Task 2 to be scheduled completely for
the blk plug to flush. Normally this would happen, we normally wait for
the block group caching to finish (Task 3), and this schedule would
result in the block plug flushing.
However if there's enough free space available from the current caching
to satisfy the allocation we won't actually wait for the caching to
complete. This check however just checks that we have enough space, not
that we can make the allocation. In this particular case we were trying
to allocate 9MiB, and we had 10MiB of free space, but we didn't have
9MiB of contiguous space to allocate, and thus the allocation failed and
we looped.
We specifically don't cycle through the FFE loop until we stop finding
cached block groups because we don't want to allocate new block groups
just because we're caching, so we short circuit the normal loop once we
hit LOOP_CACHING_WAIT and we found a caching block group.
This is normally fine, except in this particular case where the caching
thread can't make progress because the DM device has been suspended.
Fix this by not only waiting for free space to >= the amount of space we
want to allocate, but also that we make some progress in caching from
the time we start waiting. This will keep us from busy looping when the
caching is taking a while but still theoretically has enough space for
us to allocate from, and fixes this particular case by forcing us to
actually sleep and wait for forward progress, which will flush the plug.
With this fix we're no longer hanging with generic/475.
The assertion added to verify the difference in bits set of the
addresses of srso_untrain_ret_alias() and srso_safe_ret_alias() would fail
to link in LLVM's ld.lld linker with the following error:
ld.lld: error: ./arch/x86/kernel/vmlinux.lds:210: at least one side of
the expression must be absolute
ld.lld: error: ./arch/x86/kernel/vmlinux.lds:211: at least one side of
the expression must be absolute
Use ABSOLUTE to evaluate the expression referring to at least one of the
symbols so that LLD can evaluate the linker script.
Also, add linker version info to the comment about XOR being unsupported
in either ld.bfd or ld.lld until somewhat recently.
Ninad Naik [Wed, 9 Aug 2023 10:06:34 +0000 (15:36 +0530)]
pinctrl: qcom: Add intr_target_width field to support increased number of interrupt targets
SA8775 and newer target have added support for an increased number of
interrupt targets. To implement this change, the intr_target field, which
is used to configure the interrupt target in the interrupt configuration
register is increased from 3 bits to 4 bits.
In accordance to these updates, a new intr_target_width member is
introduced in msm_pingroup structure. This member stores the value of
width of intr_target field in the interrupt configuration register. This
value is used to dynamically calculate and generate mask for setting the
intr_target field. By default, this mask is set to 3 bit wide, to ensure
backward compatibility with the older targets.
Boris Brezillon [Mon, 24 Jul 2023 11:26:10 +0000 (13:26 +0200)]
drm/shmem-helper: Reset vma->vm_ops before calling dma_buf_mmap()
The dma-buf backend is supposed to provide its own vm_ops, but some
implementation just have nothing special to do and leave vm_ops
untouched, probably expecting this field to be zero initialized (this
is the case with the system_heap implementation for instance).
Let's reset vma->vm_ops to NULL to keep things working with these
implementations.
netfilter: nft_set_hash: mark set element as dead when deleting from packet path
Set on the NFT_SET_ELEM_DEAD_BIT flag on this element, instead of
performing element removal which might race with an ongoing transaction.
Enable gc when dynamic flag is set on since dynset deletion requires
garbage collection after this patch.
Fixes: d0a8d877da97 ("netfilter: nft_dynset: support for element deletion") Signed-off-by: Pablo Neira Ayuso <[email protected]>
netfilter: nf_tables: adapt set backend to use GC transaction API
Use the GC transaction API to replace the old and buggy gc API and the
busy mark approach.
No set elements are removed from async garbage collection anymore,
instead the _DEAD bit is set on so the set element is not visible from
lookup path anymore. Async GC enqueues transaction work that might be
aborted and retried later.
rbtree and pipapo set backends does not set on the _DEAD bit from the
sync GC path since this runs in control plane path where mutex is held.
In this case, set elements are deactivated, removed and then released
via RCU callback, sync GC never fails.
Fixes: 3c4287f62044 ("nf_tables: Add set type for arbitrary concatenation of ranges") Fixes: 8d8540c4f5e0 ("netfilter: nft_set_rbtree: add timeout support") Fixes: 9d0982927e79 ("netfilter: nft_hash: add support for timeouts") Signed-off-by: Pablo Neira Ayuso <[email protected]>
netfilter: nf_tables: GC transaction API to avoid race with control plane
The set types rhashtable and rbtree use a GC worker to reclaim memory.
From system work queue, in periodic intervals, a scan of the table is
done.
The major caveat here is that the nft transaction mutex is not held.
This causes a race between control plane and GC when they attempt to
delete the same element.
We cannot grab the netlink mutex from the work queue, because the
control plane has to wait for the GC work queue in case the set is to be
removed, so we get following deadlock:
cpu 1 cpu2
GC work transaction comes in , lock nft mutex
`acquire nft mutex // BLOCKS
transaction asks to remove the set
set destruction calls cancel_work_sync()
cancel_work_sync will now block forever, because it is waiting for the
mutex the caller already owns.
This patch adds a new API that deals with garbage collection in two
steps:
1) Lockless GC of expired elements sets on the NFT_SET_ELEM_DEAD_BIT
so they are not visible via lookup. Annotate current GC sequence in
the GC transaction. Enqueue GC transaction work as soon as it is
full. If ruleset is updated, then GC transaction is aborted and
retried later.
2) GC work grabs the mutex. If GC sequence has changed then this GC
transaction lost race with control plane, abort it as it contains
stale references to objects and let GC try again later. If the
ruleset is intact, then this GC transaction deactivates and removes
the elements and it uses call_rcu() to destroy elements.
Note that no elements are removed from GC lockless path, the _DEAD bit
is set and pointers are collected. GC catchall does not remove the
elements anymore too. There is a new set->dead flag that is set on to
abort the GC transaction to deal with set->ops->destroy() path which
removes the remaining elements in the set from commit_release, where no
mutex is held.
To deal with GC when mutex is held, which allows safe deactivate and
removal, add sync GC API which releases the set element object via
call_rcu(). This is used by rbtree and pipapo backends which also
perform garbage collection from control plane path.
Since element removal from sets can happen from control plane and
element garbage collection/timeout, it is necessary to keep the set
structure alive until all elements have been deactivated and destroyed.
We cannot do a cancel_work_sync or flush_work in nft_set_destroy because
its called with the transaction mutex held, but the aforementioned async
work queue might be blocked on the very mutex that nft_set_destroy()
callchain is sitting on.
This gives us the choice of ABBA deadlock or UaF.
To avoid both, add set->refs refcount_t member. The GC API can then
increment the set refcount and release it once the elements have been
free'd.
Set backends are adapted to use the GC transaction API in a follow up
patch entitled:
("netfilter: nf_tables: use gc transaction API in set backends")
This is joint work with Florian Westphal.
Fixes: cfed7e1b1f8e ("netfilter: nf_tables: add set garbage collection helpers") Signed-off-by: Pablo Neira Ayuso <[email protected]>
Linus Torvalds [Thu, 10 Aug 2023 04:12:56 +0000 (21:12 -0700)]
Merge tag '6.5-rc5-ksmbd-server' of git://git.samba.org/ksmbd
Pull smb server fixes from Steve French:
"Two ksmbd server fixes, both also for stable:
- improve buffer validation when multiple EAs returned
- missing check for command payload size"
* tag '6.5-rc5-ksmbd-server' of git://git.samba.org/ksmbd:
ksmbd: fix wrong next length validation of ea buffer in smb2_set_ea()
ksmbd: validate command request size
Aleksa Savic [Mon, 7 Aug 2023 17:20:03 +0000 (19:20 +0200)]
hwmon: (aquacomputer_d5next) Add selective 200ms delay after sending ctrl report
Add a 200ms delay after sending a ctrl report to Quadro,
Octo, D5 Next and Aquaero to give them enough time to
process the request and save the data to memory. Otherwise,
under heavier userspace loads where multiple sysfs entries
are usually set in quick succession, a new ctrl report could
be requested from the device while it's still processing the
previous one and fail with -EPIPE. The delay is only applied
if two ctrl report operations are near each other in time.
Reported by a user on Github [1] and tested by both of us.
Linus Torvalds [Thu, 10 Aug 2023 04:06:14 +0000 (21:06 -0700)]
Merge tag 'perf-tools-fixes-for-v6.5-3-2023-08-09' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux
Pull perf tools fixes from Arnaldo Carvalho de Melo:
- Revert a patch that unconditionally resolved addresses to inlines in
callchains, something that was done before when DWARF mode was asked
for, but could as well be done when just frame pointers (the default)
was selected.
This enriches the callchains with inlines but the way to resolve it
is gross right now, relying on addr2line, and even if we come up with
an efficient way of processing all the associated DWARF info for a
big file as vmlinux is, this has to be something people opt-in, as it
will still result in overheads, so revert it until we get this done
in a saner way.
- Update the x86 msr-index.h header with the kernel original, no change
in tooling output, just addresses a tools/perf build warning.
- Resolve a regression where special "tool events", such as
"duration_time" were being presented for all CPUs, when it only
makes sense to show it for the workload, that is, just once.
* tag 'perf-tools-fixes-for-v6.5-3-2023-08-09' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux:
perf stat: Don't display zero tool counts
tools arch x86: Sync the msr-index.h copy with the kernel sources
Revert "perf report: Append inlines to non-DWARF callchains"
Damien Le Moal [Mon, 7 Aug 2023 04:11:48 +0000 (13:11 +0900)]
zonefs: fix synchronous direct writes to sequential files
Commit 16d7fd3cfa72 ("zonefs: use iomap for synchronous direct writes")
changes zonefs code from a self-built zone append BIO to using iomap for
synchronous direct writes. This change relies on iomap submit BIO
callback to change the write BIO built by iomap to a zone append BIO.
However, this change overlooked the fact that a write BIO may be very
large as it is split when issued. The change from a regular write to a
zone append operation for the built BIO can result in a block layer
warning as zone append BIO are not allowed to be split.
This error depends on the hardware used, specifically on the max zone
append bytes and max_[hw_]sectors limits. Tests using AMD Epyc machines
that have low limits did not reveal this issue while runs on Intel Xeon
machines with larger limits trigger it.
Manually splitting the zone append BIO using bio_split_rw() can solve
this issue but also requires issuing the fragment BIOs synchronously
with submit_bio_wait(), to avoid potential reordering of the zone append
BIO fragments, which would lead to data corruption. That is, this
solution is not better than using regular write BIOs which are subject
to serialization using zone write locking at the IO scheduler level.
Given this, fix the issue by removing zone append support and using
regular write BIOs for synchronous direct writes. This allows preseving
the use of iomap and having identical synchronous and asynchronous
sequential file write path. Zone append support will be reintroduced
later through io_uring commands to ensure that the needed special
handling is done correctly.