Git Repo - linux.git/log

vdpa_sim_blk: add support for discard and write-zeroes

Expose VIRTIO_BLK_F_DISCARD and VIRTIO_BLK_F_WRITE_ZEROES features
to the drivers and handle VIRTIO_BLK_T_DISCARD and
VIRTIO_BLK_T_WRITE_ZEROES requests checking ranges and flags.

The simulator behaves like a ramdisk, so for VIRTIO_BLK_F_DISCARD
does nothing, while for VIRTIO_BLK_T_WRITE_ZEROES sets to 0 the
specified region.

Signed-off-by: Stefano Garzarella <[email protected]>
Message-Id: <20220811083632 [email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>

vdpa_sim_blk: add support for VIRTIO_BLK_T_FLUSH

The simulator behaves like a ramdisk, so we don't have to do
anything when a VIRTIO_BLK_T_FLUSH request is received, but it
could be useful to test driver behavior.

Let's expose the VIRTIO_BLK_F_FLUSH feature to inform the driver
that we support the flush command.

Signed-off-by: Stefano Garzarella <[email protected]>
Message-Id: <20220811083632 [email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>

vdpa_sim_blk: make vdpasim_blk_check_range usable by other requests

Next patches will add handling of other requests, where will be
useful to reuse vdpasim_blk_check_range().
So let's make it more generic by adding the `max_sectors` parameter,
since different requests allow different numbers of maximum sectors.

Let's also print the messages directly in vdpasim_blk_check_range()
to avoid duplicate prints.

Signed-off-by: Stefano Garzarella <[email protected]>
Message-Id: <20220811083632 [email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>

vdpa_sim_blk: check if sector is 0 for commands other than read or write

VIRTIO spec states: "The sector number indicates the offset
(multiplied by 512) where the read or write is to occur. This field is
unused and set to 0 for commands other than read or write."

Signed-off-by: Stefano Garzarella <[email protected]>
Message-Id: <20220811083632 [email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>

vdpa_sim: Implement suspend vdpa op

Implement suspend operation for vdpa_sim devices, so vhost-vdpa will
offer that backend feature and userspace can effectively suspend the
device.

This is a must before get virtqueue indexes (base) for live migration,
since the device could modify them after userland gets them. There are
individual ways to perform that action for some devices
(VHOST_NET_SET_BACKEND, VHOST_VSOCK_SET_RUNNING, ...) but there was no
way to perform it for any vhost device (and, in particular, vhost-vdpa).

Reviewed-by: Stefano Garzarella <[email protected]>
Signed-off-by: Eugenio Pérez <[email protected]>
Message-Id: <20220810171512.2343333 [email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>

vhost-vdpa: uAPI to suspend the device

The ioctl adds support for suspending the device from userspace.

This is a must before getting virtqueue indexes (base) for live migration,
since the device could modify them after userland gets them. There are
individual ways to perform that action for some devices
(VHOST_NET_SET_BACKEND, VHOST_VSOCK_SET_RUNNING, ...) but there was no
way to perform it for any vhost device (and, in particular, vhost-vdpa).

After a successful return of the ioctl call the device must not process
more virtqueue descriptors. The device can answer to read or writes of
config fields as if it were not suspended. In particular, writing to
"queue_enable" with a value of 1 will not make the device start
processing buffers of the virtqueue.

Signed-off-by: Eugenio Pérez <[email protected]>
Message-Id: <20220810171512.2343333 [email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>

vhost-vdpa: introduce SUSPEND backend feature bit

Userland knows if it can suspend the device or not by checking this feature
bit.

It's only offered if the vdpa driver backend implements the suspend()
operation callback, and to offer it or userland to ack it if the backend
does not offer that callback is an error.

Signed-off-by: Eugenio Pérez <[email protected]>
Message-Id: <20220810171512.2343333 [email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>

vdpa: Add suspend operation

This operation is optional: It it's not implemented, backend feature bit
will not be exposed.

Signed-off-by: Eugenio Pérez <[email protected]>
Message-Id: <20220810171512.2343333 [email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>

virtio-blk: Avoid use-after-free on suspend/resume

hctx->user_data is set to vq in virtblk_init_hctx().  However, vq is
freed on suspend and reallocated on resume.  So, hctx->user_data is
invalid after resume, and it will cause use-after-free accessing which
will result in the kernel crash something like below:

[   22.428391] Call Trace:
[   22.428899]  <TASK>
[   22.429339]  virtqueue_add_split+0x3eb/0x620
[   22.430035]  ? __blk_mq_alloc_requests+0x17f/0x2d0
[   22.430789]  ? kvm_clock_get_cycles+0x14/0x30
[   22.431496]  virtqueue_add_sgs+0xad/0xd0
[   22.432108]  virtblk_add_req+0xe8/0x150
[   22.432692]  virtio_queue_rqs+0xeb/0x210
[   22.433330]  blk_mq_flush_plug_list+0x1b8/0x280
[   22.434059]  __blk_flush_plug+0xe1/0x140
[   22.434853]  blk_finish_plug+0x20/0x40
[   22.435512]  read_pages+0x20a/0x2e0
[   22.436063]  ? folio_add_lru+0x62/0xa0
[   22.436652]  page_cache_ra_unbounded+0x112/0x160
[   22.437365]  filemap_get_pages+0xe1/0x5b0
[   22.437964]  ? context_to_sid+0x70/0x100
[   22.438580]  ? sidtab_context_to_sid+0x32/0x400
[   22.439979]  filemap_read+0xcd/0x3d0
[   22.440917]  xfs_file_buffered_read+0x4a/0xc0
[   22.441984]  xfs_file_read_iter+0x65/0xd0
[   22.442970]  __kernel_read+0x160/0x2e0
[   22.443921]  bprm_execve+0x21b/0x640
[   22.444809]  do_execveat_common.isra.0+0x1a8/0x220
[   22.446008]  __x64_sys_execve+0x2d/0x40
[   22.446920]  do_syscall_64+0x37/0x90
[   22.447773]  entry_SYSCALL_64_after_hwframe+0x63/0xcd

This patch fixes this issue by getting vq from vblk, and removes
virtblk_init_hctx().

Fixes: 4e0400525691 ("virtio-blk: support polling I/O")
Cc: "Suwan Kim" <[email protected]>
Signed-off-by: Shigeru Yoshida <[email protected]>
Message-Id: <20220810160948 [email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>

virtio_vdpa: support the arg sizes of find_vqs()

Virtio vdpa support the new parameter sizes of find_vqs().

Signed-off-by: Bo Liu <[email protected]>
Message-Id: <20220810085151 [email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>

vhost-vdpa: Call ida_simple_remove() when failed

In function vhost_vdpa_probe(), when code execution fails, we should
call ida_simple_remove() to free ida.

Signed-off-by: Bo Liu <[email protected]>
Message-Id: <20220805091254 [email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>

vDPA: fix 'cast to restricted le16' warnings in vdpa.c

This commit fixes spars warnings: cast to restricted __le16
in function vdpa_dev_net_config_fill() and
vdpa_fill_stats_rec()

Signed-off-by: Zhu Lingshan <[email protected]>
Reviewed-by: Parav Pandit <[email protected]>
Message-Id: <20220722115309 [email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>

vDPA: !FEATURES_OK should not block querying device config space

Users may want to query the config space of a vDPA device,
to choose a appropriate one for a certain guest. This means the
users need to read the config space before FEATURES_OK, and
the existence of config space contents does not depend on
FEATURES_OK.

The spec says:
The device MUST allow reading of any device-specific configuration
field before FEATURES_OK is set by the driver. This includes
fields which are conditional on feature bits, as long as those
feature bits are offered by the device.

Signed-off-by: Zhu Lingshan <[email protected]>
Message-Id: <20220722115309 [email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>

vDPA/ifcvf: support userspace to query features and MQ of a management device

Adapting to current netlink interfaces, this commit allows userspace
to query feature bits and MQ capability of a management device.

Currently both the vDPA device and the management device are the VF itself,
thus this ifcvf should initialize the virtio capabilities in probe() before
setting up the struct vdpa_mgmt_dev.

Signed-off-by: Zhu Lingshan <[email protected]>
Message-Id: <20220722115309 [email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>

vDPA/ifcvf: get_config_size should return a value no greater than dev implementation

Drivers must not access a BAR outside the capability length,
and for a virtio device, ifcvf driver should not report any non-standard
capability contents to the upper layers.

Function ifcvf_get_config_size() is introduced here to return a safe value
of the device config capability size.

Signed-off-by: Zhu Lingshan <[email protected]>
Message-Id: <20220722115309 [email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>

vhost scsi: Allow user to control num virtqueues

We are currently hard coded to always create 128 IO virtqueues, so this
adds a modparam to control it. For large systems where we are ok with
using memory for virtqueues it allows us to add up to 1024. This limit
was just selected becuase that's qemu's limit.

Signed-off-by: Mike Christie <[email protected]>
Message-Id: <20220708030525 [email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>
Acked-by: Jason Wang <[email protected]>

vhost-scsi: Fix max number of virtqueues

Qemu takes it's num_queues limit then adds the fixed queues (control and
event) to the total it will request from the kernel. So when a user
requests 128 (or qemu does it's num_queues calculation based on vCPUS
and other system limits), we hit errors due to userspace trying to setup
130 queues when vhost-scsi has a hard coded limit of 128.

This has vhost-scsi adjust it's max so we can do a total of 130 virtqueues
(128 IO and 2 fixed). For the case where the user has 128 vCPUs the guest
OS can then nicely map each IO virtqueue to a vCPU and not have the odd case
where 2 vCPUs share a virtqueue.

Signed-off-by: Mike Christie <[email protected]>
Message-Id: <20220708030525 [email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>
Acked-by: Jason Wang <[email protected]>

vdpa/mlx5: Support different address spaces for control and data

Partition virtqueues to two different address spaces: one for control
virtqueue which is implemented in software, and one for data virtqueues.

Based-on: <20220526124338 [email protected]>
Signed-off-by: Eli Cohen <[email protected]>
Message-Id: <20220714113927 [email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>

vdpa/mlx5: Implement susupend virtqueue callback

Implement the suspend callback allowing to suspend the virtqueues so
they stop processing descriptors. This is required to allow to query a
consistent state of the virtqueue while live migration is taking place.

Signed-off-by: Eli Cohen <[email protected]>
Message-Id: <20220714113927 [email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>

vduse: Support querying information of IOVA regions

This introduces a new ioctl: VDUSE_IOTLB_GET_INFO to
support querying some information of IOVA regions.

Now it can be used to query whether the IOVA region
supports userspace memory registration.

Signed-off-by: Xie Yongji <[email protected]>
Message-Id: <20220803045523 [email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>
Acked-by: Jason Wang <[email protected]>

vduse: Support registering userspace memory for IOVA regions

Introduce two ioctls: VDUSE_IOTLB_REG_UMEM and
VDUSE_IOTLB_DEREG_UMEM to support registering
and de-registering userspace memory for IOVA
regions.

Now it only supports registering userspace memory
for bounce buffer region in virtio-vdpa case.

Signed-off-by: Xie Yongji <[email protected]>
Acked-by: Jason Wang <[email protected]>
Message-Id: <20220803045523 [email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>

vduse: Support using userspace pages as bounce buffer

Introduce two APIs: vduse_domain_add_user_bounce_pages()
and vduse_domain_remove_user_bounce_pages() to support
adding and removing userspace pages for bounce buffers.
During adding and removing, the DMA data would be copied
from the kernel bounce pages to the userspace bounce pages
and back.

Signed-off-by: Xie Yongji <[email protected]>
Acked-by: Jason Wang <[email protected]>
Message-Id: <20220803045523 [email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>

vduse: Use memcpy_{to,from}_page() in do_bounce()

kmap_atomic() is being deprecated in favor of kmap_local_page().

The use of kmap_atomic() in do_bounce() is all thread local therefore
kmap_local_page() is a sufficient replacement.

Convert to kmap_local_page() but, instead of open coding it,
use the helpers memcpy_to_page() and memcpy_from_page().

Signed-off-by: Xie Yongji <[email protected]>
Acked-by: Jason Wang <[email protected]>
Reviewed-by: Muchun Song <[email protected]>
Message-Id: <20220803045523 [email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>

vduse: Remove unnecessary spin lock protection

Now we use domain->iotlb_lock to protect two different
variables: domain->bounce_maps->bounce_page and
domain->iotlb. But for domain->bounce_maps->bounce_page,
we actually don't need any synchronization between
vduse_domain_get_bounce_page() and vduse_domain_free_bounce_pages()
since vduse_domain_get_bounce_page() will only be called in
page fault handler and vduse_domain_free_bounce_pages() will
be called during file release.

So let's remove the unnecessary spin lock protection in
vduse_domain_get_bounce_page(). Then the usage of
domain->iotlb_lock could be more clear: the lock will be
only used to protect the domain->iotlb.

Signed-off-by: Xie Yongji <[email protected]>
Acked-by: Jason Wang <[email protected]>
Message-Id: <20220803045523 [email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>

net: virtio_net: notifications coalescing support

New VirtIO network feature: VIRTIO_NET_F_NOTF_COAL.

Control a Virtio network device notifications coalescing parameters
using the control virtqueue.

A device that supports this fetature can receive
VIRTIO_NET_CTRL_NOTF_COAL control commands.

- VIRTIO_NET_CTRL_NOTF_COAL_TX_SET:
  Ask the network device to change the following parameters:
  - tx_usecs: Maximum number of usecs to delay a TX notification.
  - tx_max_packets: Maximum number of packets to send before a
    TX notification.

- VIRTIO_NET_CTRL_NOTF_COAL_RX_SET:
  Ask the network device to change the following parameters:
  - rx_usecs: Maximum number of usecs to delay a RX notification.
  - rx_max_packets: Maximum number of packets to receive before a
    RX notification.

VirtIO spec. patch:
https://lists.oasis-open.org/archives/virtio-comment/202206/msg00100.html

Signed-off-by: Alvaro Karsz <[email protected]>
Message-Id: <20220718091102 [email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>
Reviewed-by: Jakub Kicinski <[email protected]>
Acked-by: Jason Wang <[email protected]>

virtio: Check dev_set_name() return value

It's possible that dev_set_name() returns -ENOMEM, catch and handle this.

Signed-off-by: Bo Liu <[email protected]>
Message-Id: <20220707031751 [email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>
Acked-by: Jason Wang <[email protected]>

tools/virtio: fix build

Fix the build caused by the following changes:
- phys_addr_t is now defined in tools/include/linux/types.h
- dev_warn_once() is used in drivers/virtio/virtio_ring.c
- linux/uio.h included by vringh.h use INT_MAX defined in limits.h

Signed-off-by: Stefano Garzarella <[email protected]>
Message-Id: <20220705072249 [email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>
Acked-by: Eugenio Pérez <[email protected]>
Signed-off-by: Peng Fan <[email protected]>
Acked-by: Jason Wang <[email protected]>

vDPA/ifcvf: remove duplicated assignment to pointer cfg

The assignment to pointer cfg is duplicated, the second assignment
is redundant and can be removed.

Signed-off-by: Colin Ian King <[email protected]>
Message-Id: <20220704190456 [email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>
Acked-by: Jason Wang <[email protected]>

vdpa: ifcvf: Fix spelling mistake in comments

There is a typo(does't) in comments.
It maybe 'doesn't' instead of 'does't'.

Signed-off-by: Zhang Jiaming <[email protected]>
Message-Id: <20220704024104 [email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>
Acked-by: Jason Wang <[email protected]>

vdpa/mlx5: Use eth_broadcast_addr() to assign broadcast address

Using eth_broadcast_addr() to assign broadcast address instead
of memset().

Reported-by: Hulk Robot <[email protected]>
Signed-off-by: Xu Qiang <[email protected]>
Message-Id: <20220704021405 [email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>
Acked-by: Jason Wang <[email protected]>

vdpa_sim: use max_iotlb_entries as a limit in vhost_iotlb_init

Commit bda324fd037a ("vdpasim: control virtqueue support") changed
the allocation of iotlbs calling vhost_iotlb_init() for each address
space, instead of vhost_iotlb_alloc().

With this change we forgot to use the limit we had introduced with
the `max_iotlb_entries` module parameter.

Fixes: bda324fd037a ("vdpasim: control virtqueue support")
Cc: [email protected]
Signed-off-by: Stefano Garzarella <[email protected]>
Message-Id: <20220621151208 [email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>
Acked-by: Eugenio Pérez <[email protected]>

vdpa_sim_blk: set number of address spaces and virtqueue groups

Commit bda324fd037a ("vdpasim: control virtqueue support") added two
new fields (nas, ngroups) to vdpasim_dev_attr, but we forgot to
initialize them for vdpa_sim_blk.

When creating a new vdpa_sim_blk device this causes the kernel
to panic in this way:
    $ vdpa dev add mgmtdev vdpasim_blk name blk0
    BUG: kernel NULL pointer dereference, address: 0000000000000030
    ...
    RIP: 0010:vhost_iotlb_add_range_ctx+0x41/0x220 [vhost_iotlb]
    ...
    Call Trace:
     <TASK>
     vhost_iotlb_add_range+0x11/0x800 [vhost_iotlb]
     vdpasim_map_range+0x91/0xd0 [vdpa_sim]
     vdpasim_alloc_coherent+0x56/0x90 [vdpa_sim]
     ...

This happens because vdpasim->iommu[0] is not initialized when
dev_attr.nas is 0.

Let's fix this issue by initializing both (nas, ngroups) to 1 for
vdpa_sim_blk.

Fixes: bda324fd037a ("vdpasim: control virtqueue support")
Cc: [email protected]
Signed-off-by: Stefano Garzarella <[email protected]>
Message-Id: <20220621151323 [email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>
Acked-by: Eugenio Pérez <[email protected]>

vdpa_sim_blk: call vringh_complete_iotlb() also in the error path

Call vringh_complete_iotlb() even when we encounter a serious error
that prevents us from writing the status in the "in" header
(e.g. the header length is incorrect, etc.).

The guest is misbehaving, so maybe the ring is in a bad state, but
let's avoid making things worse.

Acked-by: Jason Wang <[email protected]>
Signed-off-by: Stefano Garzarella <[email protected]>
Message-Id: <20220630153221 [email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>

vdpa_sim_blk: limit the number of request handled per batch

Limit the number of requests (4 per queue as for vdpa_sim_net) handled
in a batch to prevent the worker from using the CPU for too long.

Suggested-by: Eugenio Pérez <[email protected]>
Signed-off-by: Stefano Garzarella <[email protected]>
Message-Id: <20220630153221 [email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>
Acked-by: Jason Wang <[email protected]>

vdpa_sim_blk: use dev_dbg() to print errors

Use dev_dbg() instead of dev_err()/dev_warn() to avoid flooding the
host with prints, when the guest driver is misbehaving.
In this way, prints can be dynamically enabled when the vDPA block
simulator is used to validate a driver.

Suggested-by: Jason Wang <[email protected]>
Acked-by: Jason Wang <[email protected]>
Signed-off-by: Stefano Garzarella <[email protected]>
Message-Id: <20220630153221 [email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>

virtio_net: support set_ringparam

Support set_ringparam based on virtio queue reset.

Users can use ethtool -G eth0 <ring_num> to modify the ring size of
virtio-net.

Signed-off-by: Xuan Zhuo <[email protected]>
Acked-by: Jason Wang <[email protected]>
Message-Id: <20220801063902 [email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>

virtio_net: support tx queue resize

This patch implements the resize function of the tx queues.
Based on this function, it is possible to modify the ring num of the
queue.

Inludes fixup:

virtio_net: fix for stuck when change tx ring size with dev down

When dev is set to DOWN state, napi has been disabled, if we modify the
ring size at this time, we should not call napi_disable() again, which
will cause stuck.

And all operations are under the protection of rtnl_lock, so there is no
need to consider concurrency issues.

Message-Id: <20220801063902 [email protected]>
Acked-by: Jason Wang <[email protected]>
Message-Id: <20220811080258 [email protected]>
Reported-by: Kangjie Xu <[email protected]>
Signed-off-by: Xuan Zhuo <[email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>

virtio_net: support rx queue resize

This patch implements the resize function of the rx queues.
Based on this function, it is possible to modify the ring num of the
queue.

Includes fixup:

virtio_net: fix for stuck when change rx ring size with dev down

When dev is set to DOWN state, napi has been disabled, if we modify the
ring size at this time, we should not call napi_disable() again, which
will cause stuck.

And all operations are under the protection of rtnl_lock, so there is no
need to consider concurrency issues.

Message-Id: <20220801063902 [email protected]>
Acked-by: Jason Wang <[email protected]>
Message-Id: <20220811080258 [email protected]>
Reported-by: Kangjie Xu <[email protected]>
Signed-off-by: Xuan Zhuo <[email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>

virtio_net: split free_unused_bufs()

This patch separates two functions for freeing sq buf and rq buf from
free_unused_bufs().

When supporting the enable/disable tx/rq queue in the future, it is
necessary to support separate recovery of a sq buf or a rq buf.

Signed-off-by: Xuan Zhuo <[email protected]>
Acked-by: Jason Wang <[email protected]>
Message-Id: <20220801063902 [email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>

virtio_net: get ringparam by virtqueue_get_vring_max_size()

Use virtqueue_get_vring_max_size() in virtnet_get_ringparam() to set
tx,rx_max_pending.

Signed-off-by: Xuan Zhuo <[email protected]>
Acked-by: Jason Wang <[email protected]>
Message-Id: <20220801063902 [email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>

virtio_net: set the default max ring size by find_vqs()

Use virtio_find_vqs_ctx_size() to specify the maximum ring size of tx,
rx at the same time.

                         | rx/tx ring size
-------------------------------------------
speed == UNKNOWN or < 10G| 1024
speed < 40G              | 4096
speed >= 40G             | 8192

Call virtnet_update_settings() once before calling init_vqs() to update
speed.

Signed-off-by: Xuan Zhuo <[email protected]>
Acked-by: Jason Wang <[email protected]>
Message-Id: <20220801063902 [email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>

virtio: add helper virtio_find_vqs_ctx_size()

Introduce helper virtio_find_vqs_ctx_size() to call find_vqs and specify
the maximum size of each vq ring.

Signed-off-by: Xuan Zhuo <[email protected]>
Acked-by: Jason Wang <[email protected]>
Message-Id: <20220801063902 [email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>

virtio_mmio: support the arg sizes of find_vqs()

Virtio MMIO support the new parameter sizes of find_vqs().

Signed-off-by: Xuan Zhuo <[email protected]>
Acked-by: Jason Wang <[email protected]>
Message-Id: <20220801063902 [email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>

virtio_pci: support the arg sizes of find_vqs()

Virtio PCI supports new parameter sizes of find_vqs().

Signed-off-by: Xuan Zhuo <[email protected]>
Acked-by: Jason Wang <[email protected]>
Message-Id: <20220801063902 [email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>

virtio: find_vqs() add arg sizes

find_vqs() adds a new parameter sizes to specify the size of each vq
vring.

NULL as sizes means that all queues in find_vqs() use the maximum size.
A value in the array is 0, which means that the corresponding queue uses
the maximum size.

In the split scenario, the meaning of size is the largest size, because
it may be limited by memory, the virtio core will try a smaller size.
And the size is power of 2.

Signed-off-by: Xuan Zhuo <[email protected]>
Acked-by: Hans de Goede <[email protected]>
Reviewed-by: Mathieu Poirier <[email protected]>
Acked-by: Jason Wang <[email protected]>
Message-Id: <20220801063902 [email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>

virtio_pci: support VIRTIO_F_RING_RESET

This patch implements virtio pci support for QUEUE RESET.

Performing reset on a queue is divided into these steps:

1. notify the device to reset the queue
2. recycle the buffer submitted
3. reset the vring (may re-alloc)
4. mmap vring to device, and enable the queue

This patch implements virtio_reset_vq(), virtio_enable_resetq() in the
pci scenario.

Signed-off-by: Xuan Zhuo <[email protected]>
Acked-by: Jason Wang <[email protected]>
Message-Id: <20220801063902 [email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>

virtio_pci: extract the logic of active vq for modern pci

Introduce vp_active_vq() to configure vring to backend after vq attach
vring. And configure vq vector if necessary.

Signed-off-by: Xuan Zhuo <[email protected]>
Acked-by: Jason Wang <[email protected]>
Message-Id: <20220801063902 [email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>

virtio_pci: introduce helper to get/set queue reset

Introduce new helpers to implement queue reset and get queue reset
status.

https://github.com/oasis-tcs/virtio-spec/issues/124
https://github.com/oasis-tcs/virtio-spec/issues/139

Signed-off-by: Xuan Zhuo <[email protected]>
Acked-by: Jason Wang <[email protected]>
Message-Id: <20220801063902 [email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>

virtio_pci: struct virtio_pci_common_cfg add queue_reset

Add queue_reset in virtio_pci_modern_common_cfg.

https://github.com/oasis-tcs/virtio-spec/issues/124
https://github.com/oasis-tcs/virtio-spec/issues/139

Signed-off-by: Xuan Zhuo <[email protected]>
Acked-by: Jason Wang <[email protected]>
Message-Id: <20220801063902 [email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>

virtio_ring: struct virtqueue introduce reset

Introduce a new member reset to the structure virtqueue to determine
whether the current vq is in the reset state. Subsequent patches will
use it.

Signed-off-by: Xuan Zhuo <[email protected]>
Acked-by: Jason Wang <[email protected]>
Message-Id: <20220801063902 [email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>

virtio: queue_reset: add VIRTIO_F_RING_RESET

Added VIRTIO_F_RING_RESET, it came from here

https://github.com/oasis-tcs/virtio-spec/issues/124
https://github.com/oasis-tcs/virtio-spec/issues/139

This feature indicates that the driver can reset a queue individually.

Signed-off-by: Xuan Zhuo <[email protected]>
Acked-by: Jason Wang <[email protected]>
Message-Id: <20220801063902 [email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>

virtio: allow to unbreak/break virtqueue individually

This patch allows the new introduced
__virtqueue_break()/__virtqueue_unbreak() to break/unbreak the
virtqueue.

Signed-off-by: Xuan Zhuo <[email protected]>
Acked-by: Jason Wang <[email protected]>
Message-Id: <20220801063902 [email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>

virtio_pci: struct virtio_pci_common_cfg add queue_notify_data

Add queue_notify_data in struct virtio_pci_common_cfg, which comes from
here https://github.com/oasis-tcs/virtio-spec/issues/89

In order not to affect the API, add a dedicated structure struct
virtio_pci_modern_common_cfg to virtio_pci_modern.h.

Since I want to add queue_reset after queue_notify_data, I submitted
this patch first.

Signed-off-by: Xuan Zhuo <[email protected]>
Acked-by: Jason Wang <[email protected]>
Message-Id: <20220801063902 [email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>

virtio_ring: introduce virtqueue_resize()

Introduce virtqueue_resize() to implement the resize of vring.
Based on these, the driver can dynamically adjust the size of the vring.
For example: ethtool -G.

virtqueue_resize() implements resize based on the vq reset function. In
case of failure to allocate a new vring, it will give up resize and use
the original vring.

During this process, if the re-enable reset vq fails, the vq can no
longer be used. Although the probability of this situation is not high.

The parameter recycle is used to recycle the buffer that is no longer
used.

Signed-off-by: Xuan Zhuo <[email protected]>
Acked-by: Jason Wang <[email protected]>
Message-Id: <20220801063902 [email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>

virtio_ring: packed: introduce virtqueue_resize_packed()

virtio ring packed supports resize.

Only after the new vring is successfully allocated based on the new num,
we will release the old vring. In any case, an error is returned,
indicating that the vring still points to the old vring.

In the case of an error, re-initialize(by virtqueue_reinit_packed()) the
virtqueue to ensure that the vring can be used.

Signed-off-by: Xuan Zhuo <[email protected]>
Acked-by: Jason Wang <[email protected]>
Message-Id: <20220801063902 [email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>

virtio_ring: packed: introduce virtqueue_reinit_packed()

Introduce a function to initialize vq without allocating new ring,
desc_state, desc_extra.

Subsequent patches will call this function after reset vq to
reinitialize vq.

Signed-off-by: Xuan Zhuo <[email protected]>
Acked-by: Jason Wang <[email protected]>
Message-Id: <20220801063902 [email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>

virtio_ring: packed: extract the logic of attach vring

Separate the logic of attach vring, the subsequent patch will call it
separately.

Signed-off-by: Xuan Zhuo <[email protected]>
Acked-by: Jason Wang <[email protected]>
Message-Id: <20220801063902 [email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>

virtio_ring: packed: extract the logic of vring init

Separate the logic of initializing vring, and subsequent patches will
call it separately.

This function completes the variable initialization of packed vring. It
together with the logic of atatch constitutes the initialization of
vring.

Signed-off-by: Xuan Zhuo <[email protected]>
Acked-by: Jason Wang <[email protected]>
Message-Id: <20220801063902 [email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>

virtio_ring: packed: extract the logic of alloc state and extra

Separate the logic for alloc desc_state and desc_extra, which will
be called separately by subsequent patches.

Use struct vring_packed to pass desc_state, desc_extra.

Signed-off-by: Xuan Zhuo <[email protected]>
Acked-by: Jason Wang <[email protected]>
Message-Id: <20220801063902 [email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>

virtio_ring: packed: extract the logic of alloc queue

Separate the logic of packed to create vring queue.

This feature is required for subsequent virtuqueue reset vring.

Signed-off-by: Xuan Zhuo <[email protected]>
Acked-by: Jason Wang <[email protected]>
Message-Id: <20220801063902 [email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>

virtio_ring: packed: introduce vring_free_packed

Free the structure struct vring_vritqueue_packed.

Subsequent patches require it.

Signed-off-by: Xuan Zhuo <[email protected]>
Acked-by: Jason Wang <[email protected]>
Message-Id: <20220801063902 [email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>

virtio_ring: split: introduce virtqueue_resize_split()

virtio ring split supports resize.

Only after the new vring is successfully allocated based on the new num,
we will release the old vring. In any case, an error is returned,
indicating that the vring still points to the old vring.

In the case of an error, re-initialize(virtqueue_reinit_split()) the
virtqueue to ensure that the vring can be used.

Signed-off-by: Xuan Zhuo <[email protected]>
Acked-by: Jason Wang <[email protected]>
Message-Id: <20220801063902 [email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>

virtio_ring: split: reserve vring_align, may_reduce_num

In vring_alloc_queue_split() save vring_align, may_reduce_num to
structure vring_virtqueue_split. Used to create a new vring when
implementing resize.

Signed-off-by: Xuan Zhuo <[email protected]>
Acked-by: Jason Wang <[email protected]>
Message-Id: <20220801063902 [email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>

virtio_ring: split: introduce virtqueue_reinit_split()

Introduce a function to initialize vq without allocating new ring,
desc_state, desc_extra.

Subsequent patches will call this function after reset vq to
reinitialize vq.

Signed-off-by: Xuan Zhuo <[email protected]>
Acked-by: Jason Wang <[email protected]>
Message-Id: <20220801063902 [email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>

virtio_ring: split: extract the logic of attach vring

Separate the logic of attach vring, subsequent patches will call it
separately.

virtqueue_vring_init_split() completes the initialization of other
variables of vring split. We can directly use
vq->split = *vring_split to complete attach.

Signed-off-by: Xuan Zhuo <[email protected]>
Acked-by: Jason Wang <[email protected]>
Message-Id: <20220801063902 [email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>

virtio_ring: split: extract the logic of vring init

Separate the logic of initializing vring, and subsequent patches will
call it separately.

This function completes the variable initialization of split vring. It
together with the logic of atatch constitutes the initialization of
vring.

Signed-off-by: Xuan Zhuo <[email protected]>
Acked-by: Jason Wang <[email protected]>
Message-Id: <20220801063902 [email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>

virtio_ring: split: extract the logic of alloc state and extra

Separate the logic of creating desc_state, desc_extra, and subsequent
patches will call it independently.

Signed-off-by: Xuan Zhuo <[email protected]>
Acked-by: Jason Wang <[email protected]>
Message-Id: <20220801063902 [email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>

virtio_ring: split: extract the logic of alloc queue

Separate the logic of split to create vring queue.

This feature is required for subsequent virtuqueue reset vring.

Signed-off-by: Xuan Zhuo <[email protected]>
Acked-by: Jason Wang <[email protected]>
Message-Id: <20220801063902 [email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>

virtio_ring: split: introduce vring_free_split()

Free the structure struct vring_vritqueue_split.

Subsequent patches require it.

Signed-off-by: Xuan Zhuo <[email protected]>
Acked-by: Jason Wang <[email protected]>
Message-Id: <20220801063902 [email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>

virtio_ring: split: __vring_new_virtqueue() accept struct vring_virtqueue_split

__vring_new_virtqueue() instead accepts struct vring_virtqueue_split.

The purpose of this is to pass more information into
__vring_new_virtqueue() to make the code simpler and the structure
cleaner.

Signed-off-by: Xuan Zhuo <[email protected]>
Acked-by: Jason Wang <[email protected]>
Message-Id: <20220801063902 [email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>

virtio_ring: split: stop __vring_new_virtqueue as export symbol

There is currently only one place to reference __vring_new_virtqueue()
directly from the outside of virtio core. And here vring_new_virtqueue()
can be used instead.

Subsequent patches will modify __vring_new_virtqueue, so stop it as an
export symbol for now.

Signed-off-by: Xuan Zhuo <[email protected]>
Acked-by: Jason Wang <[email protected]>
Message-Id: <20220801063902 [email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>

virtio_ring: introduce virtqueue_init()

Separate the logic of virtqueue initialization. These variables should
be reset during reset.

This logic can be called independently when implementing resize/reset
later.

Signed-off-by: Xuan Zhuo <[email protected]>
Acked-by: Jason Wang <[email protected]>
Message-Id: <20220801063902 [email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>

virtio_ring: split vring_virtqueue

Separate the two inline structures(split and packed) from the structure
vring_virtqueue.

In this way, we can use these two structures later to pass parameters
and retain temporary variables.

Signed-off-by: Xuan Zhuo <[email protected]>
Acked-by: Jason Wang <[email protected]>
Message-Id: <20220801063902 [email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>

virtio_ring: extract the logic of freeing vring

Introduce vring_free() to free the vring of vq.

Subsequent patches will use vring_free() alone.

Signed-off-by: Xuan Zhuo <[email protected]>
Acked-by: Jason Wang <[email protected]>
Message-Id: <20220801063902 [email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>

virtio_ring: update the document of the virtqueue_detach_unused_buf for queue reset

Added documentation for virtqueue_detach_unused_buf, allowing it to be
called on queue reset.

Signed-off-by: Xuan Zhuo <[email protected]>
Acked-by: Jason Wang <[email protected]>
Message-Id: <20220801063902 [email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>

virtio: struct virtio_config_ops add callbacks for queue_reset

reset can be divided into the following four steps (example):
1. transport: notify the device to reset the queue
2. vring: recycle the buffer submitted
3. vring: reset/resize the vring (may re-alloc)
4. transport: mmap vring to device, and enable the queue

In order to support queue reset, add two callbacks in struct
virtio_config_ops to implement steps 1 and 4.

Signed-off-by: Xuan Zhuo <[email protected]>
Acked-by: Jason Wang <[email protected]>
Message-Id: <20220801063902 [email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>

virtio: record the maximum queue num supported by the device.

virtio-net can display the maximum (supported by hardware) ring size in
ethtool -g eth0.

When the subsequent patch implements vring reset, it can judge whether
the ring size passed by the driver is legal based on this.

Signed-off-by: Xuan Zhuo <[email protected]>
Acked-by: Jason Wang <[email protected]>
Message-Id: <20220801063902 [email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>

drivers/virtio: Clarify CONFIG_VIRTIO_MEM for unsupported architectures

Let's make it clearer that simply unlocking CONFIG_VIRTIO_MEM on an
architecture is most probably not sufficient to have it working as
expected.

Cc: "Michael S. Tsirkin" <[email protected]>
Cc: Jason Wang <[email protected]>
Cc: Gavin Shan <[email protected]>
Signed-off-by: David Hildenbrand <[email protected]>
Message-Id: <20220610094737 [email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>

virtio_mmio: add support to set IRQ of a virtio device as wakeup source

According to virtio_mmio wakeup flag in device trees, set its IRQ
as wakeup source in virtqueue initialization.

Signed-off-by: Minghao Xue <[email protected]>
Message-Id: <1654851507 [email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>

dt-bindings: virtio: mmio: add optional wakeup-source property

Some systems want to set the interrupt of virtio_mmio device
as a wakeup source. On such systems, we'll use the existence
of the "wakeup-source" property as a signal of requirement.

Signed-off-by: Minghao Xue <[email protected]>
Reviewed-by: Krzysztof Kozlowski <[email protected]>
Message-Id: <1654851507 [email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>

vdpa: Use device_iommu_capable()

Use the new interface to check the capability for our device
specifically.

Signed-off-by: Robin Murphy <[email protected]>
Message-Id: <548e316fa282ce513fabb991a4c4d92258062eb5.1654688822 [email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>
Acked-by: Jason Wang <[email protected]>

virtio: VIRTIO_HARDEN_NOTIFICATION is broken

This option doesn't really work and breaks too many drivers.
Not yet sure what's the right thing to do, for now
let's make sure randconfig isn't broken by this.

Fixes: c346dae4f3fb ("virtio: disable notification hardening by default")
Cc: "Jason Wang" <[email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>
Acked-by: Jason Wang <[email protected]>

virtio_pmem: set device ready in probe()

The NVDIMM region could be available before the virtio_device_ready()
that is called by virtio_dev_probe(). This means the driver tries to
use device before DRIVER_OK which violates the spec, fixing this by
set device ready before the nvdimm_pmem_region_create().

Note that this means the virtio_pmem_host_ack() could be triggered
before the creation of the nd region, this is safe since the pmem_lock
has been initialized and whether or not any available buffer is added
before is validated by virtio_pmem_host_ack().

Fixes 6e84200c0a29 ("virtio-pmem: Add virtio pmem driver")
Acked-by: Pankaj Gupta <[email protected]>
Signed-off-by: Jason Wang <[email protected]>
Message-Id: <20220628083430 [email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>

virtio_pmem: initialize provider_data through nd_region_desc

We used to initialize the provider_data manually after
nvdimm_pemm_region_create(). This seems to be racy if the flush is
issued before the initialization of provider_data[1]. Fixing this by
initializing the provider_data through nd_region_desc to make sure the
provider_data is ready after the pmem is created.

[1]:

[   80.152281] nd_pmem namespace0.0: unable to guarantee persistence of writes
[   92.393956] BUG: kernel NULL pointer dereference, address: 0000000000000318
[   92.394551] #PF: supervisor read access in kernel mode
[   92.394955] #PF: error_code(0x0000) - not-present page
[   92.395365] PGD 0 P4D 0
[   92.395566] Oops: 0000 [#1] PREEMPT SMP PTI
[   92.395867] CPU: 2 PID: 506 Comm: mkfs.ext4 Not tainted 5.19.0-rc1+ #453
[   92.396365] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009),
BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
[   92.397178] RIP: 0010:virtio_pmem_flush+0x2f/0x1f0
[   92.397521] Code: 55 41 54 55 53 48 81 ec a0 00 00 00 65 48 8b 04
25 28 00 00 00 48 89 84 24 98 00 00 00 31 c0 48 8b 87 78 03 00 00 48
89 04 24 <48> 8b 98 18 03 00 00 e8 85 bf 6b 00 ba 58 00 00 00 be c0 0c
00 00
[   92.398982] RSP: 0018:ffff9a7380aefc88 EFLAGS: 00010246
[   92.399349] RAX: 0000000000000000 RBX: ffff8e77c3f86f00 RCX: 0000000000000000
[   92.399833] RDX: ffffffffad4ea720 RSI: ffff8e77c41e39c0 RDI: ffff8e77c41c5c00
[   92.400388] RBP: ffff8e77c41e39c0 R08: ffff8e77c19f0600 R09: 0000000000000000
[   92.400874] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8e77c0814e28
[   92.401364] R13: 0000000000000000 R14: 0000000000000000 R15: ffff8e77c41e39c0
[   92.401849] FS:  00007f3cd75b2780(0000) GS:ffff8e7937d00000(0000)
knlGS:0000000000000000
[   92.402423] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   92.402821] CR2: 0000000000000318 CR3: 0000000103c80002 CR4: 0000000000370ee0
[   92.403307] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   92.403793] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[   92.404278] Call Trace:
[   92.404481]  <TASK>
[   92.404654]  ? mempool_alloc+0x5d/0x160
[   92.404939]  ? terminate_walk+0x5f/0xf0
[   92.405226]  ? bio_alloc_bioset+0xbb/0x3f0
[   92.405525]  async_pmem_flush+0x17/0x80
[   92.405806]  nvdimm_flush+0x11/0x30
[   92.406067]  pmem_submit_bio+0x1e9/0x200
[   92.406354]  __submit_bio+0x80/0x120
[   92.406621]  submit_bio_noacct_nocheck+0xdc/0x2a0
[   92.406958]  submit_bio_wait+0x4e/0x80
[   92.407234]  blkdev_issue_flush+0x31/0x50
[   92.407526]  ? punt_bios_to_rescuer+0x230/0x230
[   92.407852]  blkdev_fsync+0x1e/0x30
[   92.408112]  do_fsync+0x33/0x70
[   92.408354]  __x64_sys_fsync+0xb/0x10
[   92.408625]  do_syscall_64+0x43/0x90
[   92.408895]  entry_SYSCALL_64_after_hwframe+0x46/0xb0
[   92.409257] RIP: 0033:0x7f3cd76c6c44

Fixes 6e84200c0a29 ("virtio-pmem: Add virtio pmem driver")
Acked-by: Pankaj Gupta <[email protected]>
Reviewed-by: Dan Williams <[email protected]>
Signed-off-by: Jason Wang <[email protected]>
Message-Id: <20220628083430 [email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>

vringh: iterate on iotlb_translate to handle large translations

iotlb_translate() can return -ENOBUFS if the bio_vec is not big enough
to contain all the ranges for translation.
This can happen for example if the VMM maps a large bounce buffer,
without using hugepages, that requires more than 16 ranges to translate
the addresses.

To handle this case, let's extend iotlb_translate() to also return the
number of bytes successfully translated.
In copy_from_iotlb()/copy_to_iotlb() loops by calling iotlb_translate()
several times until we complete the translation.

Signed-off-by: Stefano Garzarella <[email protected]>
Message-Id: <20220624075656 [email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>

virtio_ring: remove the arg vq of vring_alloc_desc_extra()

The parameter vq of vring_alloc_desc_extra() is useless. This patch
removes this parameter.

Subsequent patches will call this function to avoid passing useless
arguments.

Signed-off-by: Xuan Zhuo <[email protected]>
Acked-by: Jason Wang <[email protected]>
Message-Id: <20220624025621 [email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>

remoteproc: rename len of rpoc_vring to num

Rename the member len in the structure rpoc_vring to num. And remove 'in
bytes' from the comment of it. This is misleading. Because this actually
refers to the size of the virtio vring to be created. The unit is not
bytes.

Signed-off-by: Xuan Zhuo <[email protected]>
Message-Id: <20220624025621 [email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>

bpf: Shut up kern_sys_bpf warning.

Shut up this warning:
kernel/bpf/syscall.c:5089:5: warning: no previous prototype for function 'kern_sys_bpf' [-Wmissing-prototypes]
int kern_sys_bpf(int cmd, union bpf_attr *attr, unsigned int size)

Reported-by: Jakub Kicinski <[email protected]>
Signed-off-by: Alexei Starovoitov <[email protected]>

KVM: x86/MMU: properly format KVM_CAP_VM_DISABLE_NX_HUGE_PAGES capability table

There is unexpected warning on KVM_CAP_VM_DISABLE_NX_HUGE_PAGES capability
table, which cause the table to be rendered as paragraph text instead.

The warning is due to missing colon at capability name and returns keyword,
as well as improper alignment on multi-line returns field.

Fix the warning by adding missing colons and aligning the field.

Link: https://lore.kernel.org/lkml/[email protected]/
Fixes: 084cc29f8bbb03 ("KVM: x86/MMU: Allow NX huge pages to be disabled on a per-vm basis")
Reported-by: Stephen Rothwell <[email protected]>
Cc: Paolo Bonzini <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: David Matlack <[email protected]>
Cc: Ben Gardon <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Bagas Sanjaya <[email protected]>
Message-Id: <20220627095151 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

Documentation: KVM: extend KVM_CAP_VM_DISABLE_NX_HUGE_PAGES heading underline

Extend heading underline for KVM_CAP_VM_DISABLE_NX_HUGE_PAGE to match
the heading text length.

Link: https://lore.kernel.org/lkml/[email protected]/
Fixes: 084cc29f8bbb03 ("KVM: x86/MMU: Allow NX huge pages to be disabled on a per-vm basis")
Reported-by: Stephen Rothwell <[email protected]>
Cc: Paolo Bonzini <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: David Matlack <[email protected]>
Cc: Ben Gardon <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Bagas Sanjaya <[email protected]>
Message-Id: <20220627095151 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

net/tls: Use RCU API to access tls_ctx->netdev

Currently, tls_device_down synchronizes with tls_device_resync_rx using
RCU, however, the pointer to netdev is stored using WRITE_ONCE and
loaded using READ_ONCE.

Although such approach is technically correct (rcu_dereference is
essentially a READ_ONCE, and rcu_assign_pointer uses WRITE_ONCE to store
NULL), using special RCU helpers for pointers is more valid, as it
includes additional checks and might change the implementation
transparently to the callers.

Mark the netdev pointer as __rcu and use the correct RCU helpers to
access it. For non-concurrent access pass the right conditions that
guarantee safe access (locks taken, refcount value). Also use the
correct helper in mlx5e, where even READ_ONCE was missing.

The transition to RCU exposes existing issues, fixed by this commit:

1. bond_tls_device_xmit could read netdev twice, and it could become
NULL the second time, after the NULL check passed.

2. Drivers shouldn't stop processing the last packet if tls_device_down
just set netdev to NULL, before tls_dev_del was called. This prevents a
possible packet drop when transitioning to the fallback software mode.

Fixes: 89df6a810470 ("net/bonding: Implement TLS TX device offload")
Fixes: c55dcdd435aa ("net/tls: Fix use-after-free after the TLS device goes down and up")
Signed-off-by: Maxim Mikityanskiy <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

tls: rx: device: don't try to copy too much on detach

Another device offload bug, we use the length of the output
skb as an indication of how much data to copy. But that skb
is sized to offset + record length, and we start from offset.
So we end up double-counting the offset which leads to
skb_copy_bits() returning -EFAULT.

Reported-by: Tariq Toukan <[email protected]>
Fixes: 84c61fe1a75b ("tls: rx: do not use the standard strparser")
Tested-by: Ran Rozenstein <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

tls: rx: device: bound the frag walk

We can't do skb_walk_frags() on the input skbs, because
the input skbs is really just a pointer to the tcp read
queue. We need to bound the "is decrypted" check by the
amount of data in the message.

Note that the walk in tls_device_reencrypt() is after a
CoW so the skb there is safe to walk. Actually in the
current implementation it can't have frags at all, but
whatever, maybe one day it will.

Reported-by: Tariq Toukan <[email protected]>
Fixes: 84c61fe1a75b ("tls: rx: do not use the standard strparser")
Tested-by: Ran Rozenstein <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

net_sched: cls_route: remove from list when handle is 0

When a route filter is replaced and the old filter has a 0 handle, the old
one won't be removed from the hashtable, while it will still be freed.

The test was there since before commit 1109c00547fc ("net: sched: RCU
cls_route"), when a new filter was not allocated when there was an old one.
The old filter was reused and the reinserting would only be necessary if an
old filter was replaced. That was still wrong for the same case where the
old handle was 0.

Remove the old filter from the list independently from its handle value.

This fixes CVE-2022-2588, also reported as ZDI-CAN-17440.

Reported-by: Zhenpeng Lin <[email protected]>
Signed-off-by: Thadeu Lima de Souza Cascardo <[email protected]>
Reviewed-by: Kamal Mostafa <[email protected]>
Cc: <[email protected]>
Acked-by: Jamal Hadi Salim <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

ALSA: hda: Fix crash due to jack poll in suspend

With jackpoll_in_suspend flag set, there is a possibility that
jack poll worker thread will run even after system suspend was
completed. Any register access after system pm callback flow
will result in kernel crash as still jack poll worker thread
tries to access registers.

To fix the crash issue during system flow, cancel the jack poll
worker thread during system pm prepare callback and cancel the
worker thread at start of runtime suspend callback and re-schedule
at last to avoid any unwarranted access of register by worker thread
during suspend flow.

Signed-off-by: Mohan Kumar <[email protected]>
Fixes: b33115bd05af ("ALSA: hda: Jack detection poll in suspend state")
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Takashi Iwai <[email protected]>

ALSA: hda/cirrus - support for iMac 12,1 model

The 12,1 model requires the same configuration as the 12,2 model
to enable headphones but has a different codec SSID. Adds
12,1 SSID for matching quirk.

[ re-sorted in SSID order by tiwai ]

Signed-off-by: Allen Ballway <[email protected]>
Cc: <[email protected]>
Link: https://lore.kernel.org/r/20220810152701.1.I902c2e591bbf8de9acb649d1322fa1f291849266@changeid
Signed-off-by: Takashi Iwai <[email protected]>

selftests: forwarding: Fix failing tests with old libnet

The custom multipath hash tests use mausezahn in order to test how
changes in various packet fields affect the packet distribution across
the available nexthops.

The tool uses the libnet library for various low-level packet
construction and injection. The library started using the
"SO_BINDTODEVICE" socket option for IPv6 sockets in version 1.1.6 and
for IPv4 sockets in version 1.2.

When the option is not set, packets are not routed according to the
table associated with the VRF master device and tests fail.

Fix this by prefixing the command with "ip vrf exec", which will cause
the route lookup to occur in the VRF routing table. This makes the tests
pass regardless of the libnet library version.

Fixes: 511e8db54036 ("selftests: forwarding: Add test for custom multipath hash")
Fixes: 185b0c190bb6 ("selftests: forwarding: Add test for custom multipath hash with IPv4 GRE")
Fixes: b7715acba4d3 ("selftests: forwarding: Add test for custom multipath hash with IPv6 GRE")
Reported-by: Ivan Vecera <[email protected]>
Tested-by: Ivan Vecera <[email protected]>
Signed-off-by: Ido Schimmel <[email protected]>
Reviewed-by: Amit Cohen <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf

Daniel Borkmann says:

====================
bpf 2022-08-10

We've added 23 non-merge commits during the last 7 day(s) which contain
a total of 19 files changed, 424 insertions(+), 35 deletions(-).

The main changes are:

1) Several fixes for BPF map iterator such as UAFs along with selftests, from Hou Tao.

2) Fix BPF syscall program's {copy,strncpy}_from_bpfptr() to not fault, from Jinghao Jia.

3) Reject BPF syscall programs calling BPF_PROG_RUN, from Alexei Starovoitov and YiFei Zhu.

4) Fix attach_btf_obj_id info to pick proper target BTF, from Stanislav Fomichev.

5) BPF design Q/A doc update to clarify what is not stable ABI, from Paul E. McKenney.

6) Fix BPF map's prealloc_lru_pop to not reinitialize, from Kumar Kartikeya Dwivedi.

7) Fix bpf_trampoline_put to avoid leaking ftrace hash, from Jiri Olsa.

8) Fix arm64 JIT to address sparse errors around BPF trampoline, from Xu Kuohai.

9) Fix arm64 JIT to use kvcalloc instead of kcalloc for internal program address
   offset buffer, from Aijun Sun.

* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: (23 commits)
  selftests/bpf: Ensure sleepable program is rejected by hash map iter
  selftests/bpf: Add write tests for sk local storage map iterator
  selftests/bpf: Add tests for reading a dangling map iter fd
  bpf: Only allow sleepable program for resched-able iterator
  bpf: Check the validity of max_rdwr_access for sock local storage map iterator
  bpf: Acquire map uref in .init_seq_private for sock{map,hash} iterator
  bpf: Acquire map uref in .init_seq_private for sock local storage map iterator
  bpf: Acquire map uref in .init_seq_private for hash map iterator
  bpf: Acquire map uref in .init_seq_private for array map iterator
  bpf: Disallow bpf programs call prog_run command.
  bpf, arm64: Fix bpf trampoline instruction endianness
  selftests/bpf: Add test for prealloc_lru_pop bug
  bpf: Don't reinit map value in prealloc_lru_pop
  bpf: Allow calling bpf_prog_test kfuncs in tracing programs
  bpf, arm64: Allocate program buffer using kvcalloc instead of kcalloc
  selftests/bpf: Excercise bpf_obj_get_info_by_fd for bpf2bpf
  bpf: Use proper target btf when exporting attach_btf_obj_id
  mptcp, btf: Add struct mptcp_sock definition when CONFIG_MPTCP is disabled
  bpf: Cleanup ftrace hash in bpf_trampoline_put
  BPF: Fix potential bad pointer dereference in bpf_sys_bpf()
  ...
====================

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

Merge branch 'net-enhancements-to-sk_user_data-field'

Hawkins Jiawei says:

====================
net: enhancements to sk_user_data field

This patchset fixes refcount bug by adding SK_USER_DATA_PSOCK flag bit in
sk_user_data field. The bug cause following info:

WARNING: CPU: 1 PID: 3605 at lib/refcount.c:19 refcount_warn_saturate+0xf4/0x1e0 lib/refcount.c:19
Modules linked in:
CPU: 1 PID: 3605 Comm: syz-executor208 Not tainted 5.18.0-syzkaller-03023-g7e062cda7d90 #0
<TASK>
__refcount_add_not_zero include/linux/refcount.h:163 [inline]
__refcount_inc_not_zero include/linux/refcount.h:227 [inline]
refcount_inc_not_zero include/linux/refcount.h:245 [inline]
sk_psock_get+0x3bc/0x410 include/linux/skmsg.h:439
tls_data_ready+0x6d/0x1b0 net/tls/tls_sw.c:2091
tcp_data_ready+0x106/0x520 net/ipv4/tcp_input.c:4983
tcp_data_queue+0x25f2/0x4c90 net/ipv4/tcp_input.c:5057
tcp_rcv_state_process+0x1774/0x4e80 net/ipv4/tcp_input.c:6659
tcp_v4_do_rcv+0x339/0x980 net/ipv4/tcp_ipv4.c:1682
sk_backlog_rcv include/net/sock.h:1061 [inline]
__release_sock+0x134/0x3b0 net/core/sock.c:2849
release_sock+0x54/0x1b0 net/core/sock.c:3404
inet_shutdown+0x1e0/0x430 net/ipv4/af_inet.c:909
__sys_shutdown_sock net/socket.c:2331 [inline]
__sys_shutdown_sock net/socket.c:2325 [inline]
__sys_shutdown+0xf1/0x1b0 net/socket.c:2343
__do_sys_shutdown net/socket.c:2351 [inline]
__se_sys_shutdown net/socket.c:2349 [inline]
__x64_sys_shutdown+0x50/0x70 net/socket.c:2349
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x46/0xb0
</TASK>

To improve code maintainability, this patchset refactors sk_user_data
flags code to be more generic.
====================

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

net: refactor bpf_sk_reuseport_detach()

Refactor sk_user_data dereference using more generic function
__rcu_dereference_sk_user_data_with_flags(), which improve its
maintainability

Suggested-by: Jakub Kicinski <[email protected]>
Signed-off-by: Hawkins Jiawei <[email protected]>
Reviewed-by: Jakub Sitnicki <[email protected]>
Signed-off-by: Jakub Kicinski <[email protected]>