vdpa_sim_blk: add support for discard and write-zeroes
Expose VIRTIO_BLK_F_DISCARD and VIRTIO_BLK_F_WRITE_ZEROES features
to the drivers and handle VIRTIO_BLK_T_DISCARD and
VIRTIO_BLK_T_WRITE_ZEROES requests checking ranges and flags.
The simulator behaves like a ramdisk, so for VIRTIO_BLK_F_DISCARD
does nothing, while for VIRTIO_BLK_T_WRITE_ZEROES sets to 0 the
specified region.
The simulator behaves like a ramdisk, so we don't have to do
anything when a VIRTIO_BLK_T_FLUSH request is received, but it
could be useful to test driver behavior.
Let's expose the VIRTIO_BLK_F_FLUSH feature to inform the driver
that we support the flush command.
vdpa_sim_blk: make vdpasim_blk_check_range usable by other requests
Next patches will add handling of other requests, where will be
useful to reuse vdpasim_blk_check_range().
So let's make it more generic by adding the `max_sectors` parameter,
since different requests allow different numbers of maximum sectors.
Let's also print the messages directly in vdpasim_blk_check_range()
to avoid duplicate prints.
vdpa_sim_blk: check if sector is 0 for commands other than read or write
VIRTIO spec states: "The sector number indicates the offset
(multiplied by 512) where the read or write is to occur. This field is
unused and set to 0 for commands other than read or write."
Eugenio Pérez [Wed, 10 Aug 2022 17:15:12 +0000 (19:15 +0200)]
vdpa_sim: Implement suspend vdpa op
Implement suspend operation for vdpa_sim devices, so vhost-vdpa will
offer that backend feature and userspace can effectively suspend the
device.
This is a must before get virtqueue indexes (base) for live migration,
since the device could modify them after userland gets them. There are
individual ways to perform that action for some devices
(VHOST_NET_SET_BACKEND, VHOST_VSOCK_SET_RUNNING, ...) but there was no
way to perform it for any vhost device (and, in particular, vhost-vdpa).
Eugenio Pérez [Wed, 10 Aug 2022 17:15:11 +0000 (19:15 +0200)]
vhost-vdpa: uAPI to suspend the device
The ioctl adds support for suspending the device from userspace.
This is a must before getting virtqueue indexes (base) for live migration,
since the device could modify them after userland gets them. There are
individual ways to perform that action for some devices
(VHOST_NET_SET_BACKEND, VHOST_VSOCK_SET_RUNNING, ...) but there was no
way to perform it for any vhost device (and, in particular, vhost-vdpa).
After a successful return of the ioctl call the device must not process
more virtqueue descriptors. The device can answer to read or writes of
config fields as if it were not suspended. In particular, writing to
"queue_enable" with a value of 1 will not make the device start
processing buffers of the virtqueue.
Eugenio Pérez [Wed, 10 Aug 2022 17:15:10 +0000 (19:15 +0200)]
vhost-vdpa: introduce SUSPEND backend feature bit
Userland knows if it can suspend the device or not by checking this feature
bit.
It's only offered if the vdpa driver backend implements the suspend()
operation callback, and to offer it or userland to ack it if the backend
does not offer that callback is an error.
Shigeru Yoshida [Wed, 10 Aug 2022 16:09:48 +0000 (01:09 +0900)]
virtio-blk: Avoid use-after-free on suspend/resume
hctx->user_data is set to vq in virtblk_init_hctx(). However, vq is
freed on suspend and reallocated on resume. So, hctx->user_data is
invalid after resume, and it will cause use-after-free accessing which
will result in the kernel crash something like below:
vDPA: !FEATURES_OK should not block querying device config space
Users may want to query the config space of a vDPA device,
to choose a appropriate one for a certain guest. This means the
users need to read the config space before FEATURES_OK, and
the existence of config space contents does not depend on
FEATURES_OK.
The spec says:
The device MUST allow reading of any device-specific configuration
field before FEATURES_OK is set by the driver. This includes
fields which are conditional on feature bits, as long as those
feature bits are offered by the device.
vDPA/ifcvf: support userspace to query features and MQ of a management device
Adapting to current netlink interfaces, this commit allows userspace
to query feature bits and MQ capability of a management device.
Currently both the vDPA device and the management device are the VF itself,
thus this ifcvf should initialize the virtio capabilities in probe() before
setting up the struct vdpa_mgmt_dev.
vDPA/ifcvf: get_config_size should return a value no greater than dev implementation
Drivers must not access a BAR outside the capability length,
and for a virtio device, ifcvf driver should not report any non-standard
capability contents to the upper layers.
Function ifcvf_get_config_size() is introduced here to return a safe value
of the device config capability size.
Mike Christie [Fri, 8 Jul 2022 03:05:25 +0000 (22:05 -0500)]
vhost scsi: Allow user to control num virtqueues
We are currently hard coded to always create 128 IO virtqueues, so this
adds a modparam to control it. For large systems where we are ok with
using memory for virtqueues it allows us to add up to 1024. This limit
was just selected becuase that's qemu's limit.
Mike Christie [Fri, 8 Jul 2022 03:05:24 +0000 (22:05 -0500)]
vhost-scsi: Fix max number of virtqueues
Qemu takes it's num_queues limit then adds the fixed queues (control and
event) to the total it will request from the kernel. So when a user
requests 128 (or qemu does it's num_queues calculation based on vCPUS
and other system limits), we hit errors due to userspace trying to setup
130 queues when vhost-scsi has a hard coded limit of 128.
This has vhost-scsi adjust it's max so we can do a total of 130 virtqueues
(128 IO and 2 fixed). For the case where the user has 128 vCPUs the guest
OS can then nicely map each IO virtqueue to a vCPU and not have the odd case
where 2 vCPUs share a virtqueue.
Eli Cohen [Thu, 14 Jul 2022 11:39:26 +0000 (14:39 +0300)]
vdpa/mlx5: Implement susupend virtqueue callback
Implement the suspend callback allowing to suspend the virtqueues so
they stop processing descriptors. This is required to allow to query a
consistent state of the virtqueue while live migration is taking place.
Xie Yongji [Wed, 3 Aug 2022 04:55:21 +0000 (12:55 +0800)]
vduse: Support using userspace pages as bounce buffer
Introduce two APIs: vduse_domain_add_user_bounce_pages()
and vduse_domain_remove_user_bounce_pages() to support
adding and removing userspace pages for bounce buffers.
During adding and removing, the DMA data would be copied
from the kernel bounce pages to the userspace bounce pages
and back.
Xie Yongji [Wed, 3 Aug 2022 04:55:19 +0000 (12:55 +0800)]
vduse: Remove unnecessary spin lock protection
Now we use domain->iotlb_lock to protect two different
variables: domain->bounce_maps->bounce_page and
domain->iotlb. But for domain->bounce_maps->bounce_page,
we actually don't need any synchronization between
vduse_domain_get_bounce_page() and vduse_domain_free_bounce_pages()
since vduse_domain_get_bounce_page() will only be called in
page fault handler and vduse_domain_free_bounce_pages() will
be called during file release.
So let's remove the unnecessary spin lock protection in
vduse_domain_get_bounce_page(). Then the usage of
domain->iotlb_lock could be more clear: the lock will be
only used to protect the domain->iotlb.
New VirtIO network feature: VIRTIO_NET_F_NOTF_COAL.
Control a Virtio network device notifications coalescing parameters
using the control virtqueue.
A device that supports this fetature can receive
VIRTIO_NET_CTRL_NOTF_COAL control commands.
- VIRTIO_NET_CTRL_NOTF_COAL_TX_SET:
Ask the network device to change the following parameters:
- tx_usecs: Maximum number of usecs to delay a TX notification.
- tx_max_packets: Maximum number of packets to send before a
TX notification.
- VIRTIO_NET_CTRL_NOTF_COAL_RX_SET:
Ask the network device to change the following parameters:
- rx_usecs: Maximum number of usecs to delay a RX notification.
- rx_max_packets: Maximum number of packets to receive before a
RX notification.
Fix the build caused by the following changes:
- phys_addr_t is now defined in tools/include/linux/types.h
- dev_warn_once() is used in drivers/virtio/virtio_ring.c
- linux/uio.h included by vringh.h use INT_MAX defined in limits.h
vdpa_sim: use max_iotlb_entries as a limit in vhost_iotlb_init
Commit bda324fd037a ("vdpasim: control virtqueue support") changed
the allocation of iotlbs calling vhost_iotlb_init() for each address
space, instead of vhost_iotlb_alloc().
With this change we forgot to use the limit we had introduced with
the `max_iotlb_entries` module parameter.
vdpa_sim_blk: set number of address spaces and virtqueue groups
Commit bda324fd037a ("vdpasim: control virtqueue support") added two
new fields (nas, ngroups) to vdpasim_dev_attr, but we forgot to
initialize them for vdpa_sim_blk.
When creating a new vdpa_sim_blk device this causes the kernel
to panic in this way:
$ vdpa dev add mgmtdev vdpasim_blk name blk0
BUG: kernel NULL pointer dereference, address: 0000000000000030
...
RIP: 0010:vhost_iotlb_add_range_ctx+0x41/0x220 [vhost_iotlb]
...
Call Trace:
<TASK>
vhost_iotlb_add_range+0x11/0x800 [vhost_iotlb]
vdpasim_map_range+0x91/0xd0 [vdpa_sim]
vdpasim_alloc_coherent+0x56/0x90 [vdpa_sim]
...
This happens because vdpasim->iommu[0] is not initialized when
dev_attr.nas is 0.
Let's fix this issue by initializing both (nas, ngroups) to 1 for
vdpa_sim_blk.
vdpa_sim_blk: call vringh_complete_iotlb() also in the error path
Call vringh_complete_iotlb() even when we encounter a serious error
that prevents us from writing the status in the "in" header
(e.g. the header length is incorrect, etc.).
The guest is misbehaving, so maybe the ring is in a bad state, but
let's avoid making things worse.
Use dev_dbg() instead of dev_err()/dev_warn() to avoid flooding the
host with prints, when the guest driver is misbehaving.
In this way, prints can be dynamically enabled when the vDPA block
simulator is used to validate a driver.
Xuan Zhuo [Mon, 1 Aug 2022 06:39:01 +0000 (14:39 +0800)]
virtio_net: support tx queue resize
This patch implements the resize function of the tx queues.
Based on this function, it is possible to modify the ring num of the
queue.
Inludes fixup:
virtio_net: fix for stuck when change tx ring size with dev down
When dev is set to DOWN state, napi has been disabled, if we modify the
ring size at this time, we should not call napi_disable() again, which
will cause stuck.
And all operations are under the protection of rtnl_lock, so there is no
need to consider concurrency issues.
Xuan Zhuo [Mon, 1 Aug 2022 06:39:00 +0000 (14:39 +0800)]
virtio_net: support rx queue resize
This patch implements the resize function of the rx queues.
Based on this function, it is possible to modify the ring num of the
queue.
Includes fixup:
virtio_net: fix for stuck when change rx ring size with dev down
When dev is set to DOWN state, napi has been disabled, if we modify the
ring size at this time, we should not call napi_disable() again, which
will cause stuck.
And all operations are under the protection of rtnl_lock, so there is no
need to consider concurrency issues.
Xuan Zhuo [Mon, 1 Aug 2022 06:38:53 +0000 (14:38 +0800)]
virtio: find_vqs() add arg sizes
find_vqs() adds a new parameter sizes to specify the size of each vq
vring.
NULL as sizes means that all queues in find_vqs() use the maximum size.
A value in the array is 0, which means that the corresponding queue uses
the maximum size.
In the split scenario, the meaning of size is the largest size, because
it may be limited by memory, the virtio core will try a smaller size.
And the size is power of 2.
Xuan Zhuo [Mon, 1 Aug 2022 06:38:52 +0000 (14:38 +0800)]
virtio_pci: support VIRTIO_F_RING_RESET
This patch implements virtio pci support for QUEUE RESET.
Performing reset on a queue is divided into these steps:
1. notify the device to reset the queue
2. recycle the buffer submitted
3. reset the vring (may re-alloc)
4. mmap vring to device, and enable the queue
This patch implements virtio_reset_vq(), virtio_enable_resetq() in the
pci scenario.
Xuan Zhuo [Mon, 1 Aug 2022 06:38:44 +0000 (14:38 +0800)]
virtio_ring: introduce virtqueue_resize()
Introduce virtqueue_resize() to implement the resize of vring.
Based on these, the driver can dynamically adjust the size of the vring.
For example: ethtool -G.
virtqueue_resize() implements resize based on the vq reset function. In
case of failure to allocate a new vring, it will give up resize and use
the original vring.
During this process, if the re-enable reset vq fails, the vq can no
longer be used. Although the probability of this situation is not high.
The parameter recycle is used to recycle the buffer that is no longer
used.
Only after the new vring is successfully allocated based on the new num,
we will release the old vring. In any case, an error is returned,
indicating that the vring still points to the old vring.
In the case of an error, re-initialize(by virtqueue_reinit_packed()) the
virtqueue to ensure that the vring can be used.
Only after the new vring is successfully allocated based on the new num,
we will release the old vring. In any case, an error is returned,
indicating that the vring still points to the old vring.
In the case of an error, re-initialize(virtqueue_reinit_split()) the
virtqueue to ensure that the vring can be used.
Xuan Zhuo [Mon, 1 Aug 2022 06:38:33 +0000 (14:38 +0800)]
virtio_ring: split: extract the logic of attach vring
Separate the logic of attach vring, subsequent patches will call it
separately.
virtqueue_vring_init_split() completes the initialization of other
variables of vring split. We can directly use
vq->split = *vring_split to complete attach.
Xuan Zhuo [Mon, 1 Aug 2022 06:38:27 +0000 (14:38 +0800)]
virtio_ring: split: stop __vring_new_virtqueue as export symbol
There is currently only one place to reference __vring_new_virtqueue()
directly from the outside of virtio core. And here vring_new_virtqueue()
can be used instead.
Subsequent patches will modify __vring_new_virtqueue, so stop it as an
export symbol for now.
Xuan Zhuo [Mon, 1 Aug 2022 06:38:22 +0000 (14:38 +0800)]
virtio: struct virtio_config_ops add callbacks for queue_reset
reset can be divided into the following four steps (example):
1. transport: notify the device to reset the queue
2. vring: recycle the buffer submitted
3. vring: reset/resize the vring (may re-alloc)
4. transport: mmap vring to device, and enable the queue
In order to support queue reset, add two callbacks in struct
virtio_config_ops to implement steps 1 and 4.
Some systems want to set the interrupt of virtio_mmio device
as a wakeup source. On such systems, we'll use the existence
of the "wakeup-source" property as a signal of requirement.
This option doesn't really work and breaks too many drivers.
Not yet sure what's the right thing to do, for now
let's make sure randconfig isn't broken by this.
Fixes: c346dae4f3fb ("virtio: disable notification hardening by default") Cc: "Jason Wang" <[email protected]> Signed-off-by: Michael S. Tsirkin <[email protected]> Acked-by: Jason Wang <[email protected]>
Jason Wang [Tue, 28 Jun 2022 08:34:30 +0000 (16:34 +0800)]
virtio_pmem: set device ready in probe()
The NVDIMM region could be available before the virtio_device_ready()
that is called by virtio_dev_probe(). This means the driver tries to
use device before DRIVER_OK which violates the spec, fixing this by
set device ready before the nvdimm_pmem_region_create().
Note that this means the virtio_pmem_host_ack() could be triggered
before the creation of the nd region, this is safe since the pmem_lock
has been initialized and whether or not any available buffer is added
before is validated by virtio_pmem_host_ack().
Jason Wang [Tue, 28 Jun 2022 08:34:29 +0000 (16:34 +0800)]
virtio_pmem: initialize provider_data through nd_region_desc
We used to initialize the provider_data manually after
nvdimm_pemm_region_create(). This seems to be racy if the flush is
issued before the initialization of provider_data[1]. Fixing this by
initializing the provider_data through nd_region_desc to make sure the
provider_data is ready after the pmem is created.
vringh: iterate on iotlb_translate to handle large translations
iotlb_translate() can return -ENOBUFS if the bio_vec is not big enough
to contain all the ranges for translation.
This can happen for example if the VMM maps a large bounce buffer,
without using hugepages, that requires more than 16 ranges to translate
the addresses.
To handle this case, let's extend iotlb_translate() to also return the
number of bytes successfully translated.
In copy_from_iotlb()/copy_to_iotlb() loops by calling iotlb_translate()
several times until we complete the translation.
Xuan Zhuo [Fri, 24 Jun 2022 02:55:41 +0000 (10:55 +0800)]
remoteproc: rename len of rpoc_vring to num
Rename the member len in the structure rpoc_vring to num. And remove 'in
bytes' from the comment of it. This is misleading. Because this actually
refers to the size of the virtio vring to be created. The unit is not
bytes.
Shut up this warning:
kernel/bpf/syscall.c:5089:5: warning: no previous prototype for function 'kern_sys_bpf' [-Wmissing-prototypes]
int kern_sys_bpf(int cmd, union bpf_attr *attr, unsigned int size)
Currently, tls_device_down synchronizes with tls_device_resync_rx using
RCU, however, the pointer to netdev is stored using WRITE_ONCE and
loaded using READ_ONCE.
Although such approach is technically correct (rcu_dereference is
essentially a READ_ONCE, and rcu_assign_pointer uses WRITE_ONCE to store
NULL), using special RCU helpers for pointers is more valid, as it
includes additional checks and might change the implementation
transparently to the callers.
Mark the netdev pointer as __rcu and use the correct RCU helpers to
access it. For non-concurrent access pass the right conditions that
guarantee safe access (locks taken, refcount value). Also use the
correct helper in mlx5e, where even READ_ONCE was missing.
The transition to RCU exposes existing issues, fixed by this commit:
1. bond_tls_device_xmit could read netdev twice, and it could become
NULL the second time, after the NULL check passed.
2. Drivers shouldn't stop processing the last packet if tls_device_down
just set netdev to NULL, before tls_dev_del was called. This prevents a
possible packet drop when transitioning to the fallback software mode.
Fixes: 89df6a810470 ("net/bonding: Implement TLS TX device offload") Fixes: c55dcdd435aa ("net/tls: Fix use-after-free after the TLS device goes down and up") Signed-off-by: Maxim Mikityanskiy <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
Jakub Kicinski [Tue, 9 Aug 2022 17:55:44 +0000 (10:55 -0700)]
tls: rx: device: don't try to copy too much on detach
Another device offload bug, we use the length of the output
skb as an indication of how much data to copy. But that skb
is sized to offset + record length, and we start from offset.
So we end up double-counting the offset which leads to
skb_copy_bits() returning -EFAULT.
Jakub Kicinski [Tue, 9 Aug 2022 17:55:43 +0000 (10:55 -0700)]
tls: rx: device: bound the frag walk
We can't do skb_walk_frags() on the input skbs, because
the input skbs is really just a pointer to the tcp read
queue. We need to bound the "is decrypted" check by the
amount of data in the message.
Note that the walk in tls_device_reencrypt() is after a
CoW so the skb there is safe to walk. Actually in the
current implementation it can't have frags at all, but
whatever, maybe one day it will.
net_sched: cls_route: remove from list when handle is 0
When a route filter is replaced and the old filter has a 0 handle, the old
one won't be removed from the hashtable, while it will still be freed.
The test was there since before commit 1109c00547fc ("net: sched: RCU
cls_route"), when a new filter was not allocated when there was an old one.
The old filter was reused and the reinserting would only be necessary if an
old filter was replaced. That was still wrong for the same case where the
old handle was 0.
Remove the old filter from the list independently from its handle value.
This fixes CVE-2022-2588, also reported as ZDI-CAN-17440.
Mohan Kumar [Thu, 11 Aug 2022 05:27:04 +0000 (10:57 +0530)]
ALSA: hda: Fix crash due to jack poll in suspend
With jackpoll_in_suspend flag set, there is a possibility that
jack poll worker thread will run even after system suspend was
completed. Any register access after system pm callback flow
will result in kernel crash as still jack poll worker thread
tries to access registers.
To fix the crash issue during system flow, cancel the jack poll
worker thread during system pm prepare callback and cancel the
worker thread at start of runtime suspend callback and re-schedule
at last to avoid any unwarranted access of register by worker thread
during suspend flow.
Allen Ballway [Wed, 10 Aug 2022 15:27:22 +0000 (15:27 +0000)]
ALSA: hda/cirrus - support for iMac 12,1 model
The 12,1 model requires the same configuration as the 12,2 model
to enable headphones but has a different codec SSID. Adds
12,1 SSID for matching quirk.
Ido Schimmel [Tue, 9 Aug 2022 11:33:20 +0000 (14:33 +0300)]
selftests: forwarding: Fix failing tests with old libnet
The custom multipath hash tests use mausezahn in order to test how
changes in various packet fields affect the packet distribution across
the available nexthops.
The tool uses the libnet library for various low-level packet
construction and injection. The library started using the
"SO_BINDTODEVICE" socket option for IPv6 sockets in version 1.1.6 and
for IPv4 sockets in version 1.2.
When the option is not set, packets are not routed according to the
table associated with the VRF master device and tests fail.
Fix this by prefixing the command with "ip vrf exec", which will cause
the route lookup to occur in the VRF routing table. This makes the tests
pass regardless of the libnet library version.
Fixes: 511e8db54036 ("selftests: forwarding: Add test for custom multipath hash") Fixes: 185b0c190bb6 ("selftests: forwarding: Add test for custom multipath hash with IPv4 GRE") Fixes: b7715acba4d3 ("selftests: forwarding: Add test for custom multipath hash with IPv6 GRE") Reported-by: Ivan Vecera <[email protected]> Tested-by: Ivan Vecera <[email protected]> Signed-off-by: Ido Schimmel <[email protected]> Reviewed-by: Amit Cohen <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
We've added 23 non-merge commits during the last 7 day(s) which contain
a total of 19 files changed, 424 insertions(+), 35 deletions(-).
The main changes are:
1) Several fixes for BPF map iterator such as UAFs along with selftests, from Hou Tao.
2) Fix BPF syscall program's {copy,strncpy}_from_bpfptr() to not fault, from Jinghao Jia.
3) Reject BPF syscall programs calling BPF_PROG_RUN, from Alexei Starovoitov and YiFei Zhu.
4) Fix attach_btf_obj_id info to pick proper target BTF, from Stanislav Fomichev.
5) BPF design Q/A doc update to clarify what is not stable ABI, from Paul E. McKenney.
6) Fix BPF map's prealloc_lru_pop to not reinitialize, from Kumar Kartikeya Dwivedi.
7) Fix bpf_trampoline_put to avoid leaking ftrace hash, from Jiri Olsa.
8) Fix arm64 JIT to address sparse errors around BPF trampoline, from Xu Kuohai.
9) Fix arm64 JIT to use kvcalloc instead of kcalloc for internal program address
offset buffer, from Aijun Sun.
* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: (23 commits)
selftests/bpf: Ensure sleepable program is rejected by hash map iter
selftests/bpf: Add write tests for sk local storage map iterator
selftests/bpf: Add tests for reading a dangling map iter fd
bpf: Only allow sleepable program for resched-able iterator
bpf: Check the validity of max_rdwr_access for sock local storage map iterator
bpf: Acquire map uref in .init_seq_private for sock{map,hash} iterator
bpf: Acquire map uref in .init_seq_private for sock local storage map iterator
bpf: Acquire map uref in .init_seq_private for hash map iterator
bpf: Acquire map uref in .init_seq_private for array map iterator
bpf: Disallow bpf programs call prog_run command.
bpf, arm64: Fix bpf trampoline instruction endianness
selftests/bpf: Add test for prealloc_lru_pop bug
bpf: Don't reinit map value in prealloc_lru_pop
bpf: Allow calling bpf_prog_test kfuncs in tracing programs
bpf, arm64: Allocate program buffer using kvcalloc instead of kcalloc
selftests/bpf: Excercise bpf_obj_get_info_by_fd for bpf2bpf
bpf: Use proper target btf when exporting attach_btf_obj_id
mptcp, btf: Add struct mptcp_sock definition when CONFIG_MPTCP is disabled
bpf: Cleanup ftrace hash in bpf_trampoline_put
BPF: Fix potential bad pointer dereference in bpf_sys_bpf()
...
====================