]>
Commit | Line | Data |
---|---|---|
ed1be66b MAL |
1 | =================== |
2 | Vhost-user Protocol | |
3 | =================== | |
4 | :Copyright: 2014 Virtual Open Systems Sarl. | |
5 | :Licence: This work is licensed under the terms of the GNU GPL, | |
6 | version 2 or later. See the COPYING file in the top-level | |
7 | directory. | |
8 | ||
9 | .. contents:: Table of Contents | |
10 | ||
11 | Introduction | |
12 | ============ | |
13 | ||
14 | This protocol is aiming to complement the ``ioctl`` interface used to | |
15 | control the vhost implementation in the Linux kernel. It implements | |
16 | the control plane needed to establish virtqueue sharing with a user | |
17 | space process on the same host. It uses communication over a Unix | |
18 | domain socket to share file descriptors in the ancillary data of the | |
19 | message. | |
20 | ||
21 | The protocol defines 2 sides of the communication, *master* and | |
22 | *slave*. *Master* is the application that shares its virtqueues, in | |
23 | our case QEMU. *Slave* is the consumer of the virtqueues. | |
24 | ||
25 | In the current implementation QEMU is the *master*, and the *slave* is | |
26 | the external process consuming the virtio queues, for example a | |
27 | software Ethernet switch running in user space, such as Snabbswitch, | |
28 | or a block device backend processing read & write to a virtual | |
29 | disk. In order to facilitate interoperability between various backend | |
30 | implementations, it is recommended to follow the :ref:`Backend program | |
31 | conventions <backend_conventions>`. | |
32 | ||
33 | *Master* and *slave* can be either a client (i.e. connecting) or | |
34 | server (listening) in the socket communication. | |
35 | ||
36 | Message Specification | |
37 | ===================== | |
38 | ||
39 | .. Note:: All numbers are in the machine native byte order. | |
40 | ||
41 | A vhost-user message consists of 3 header fields and a payload. | |
42 | ||
43 | +---------+-------+------+---------+ | |
44 | | request | flags | size | payload | | |
45 | +---------+-------+------+---------+ | |
46 | ||
47 | Header | |
48 | ------ | |
49 | ||
50 | :request: 32-bit type of the request | |
51 | ||
52 | :flags: 32-bit bit field | |
53 | ||
54 | - Lower 2 bits are the version (currently 0x01) | |
55 | - Bit 2 is the reply flag - needs to be sent on each reply from the slave | |
56 | - Bit 3 is the need_reply flag - see :ref:`REPLY_ACK <reply_ack>` for | |
57 | details. | |
58 | ||
59 | :size: 32-bit size of the payload | |
60 | ||
61 | Payload | |
62 | ------- | |
63 | ||
64 | Depending on the request type, **payload** can be: | |
65 | ||
66 | A single 64-bit integer | |
67 | ^^^^^^^^^^^^^^^^^^^^^^^ | |
68 | ||
69 | +-----+ | |
70 | | u64 | | |
71 | +-----+ | |
72 | ||
73 | :u64: a 64-bit unsigned integer | |
74 | ||
75 | A vring state description | |
76 | ^^^^^^^^^^^^^^^^^^^^^^^^^ | |
77 | ||
78 | +-------+-----+ | |
79 | | index | num | | |
80 | +-------+-----+ | |
81 | ||
82 | :index: a 32-bit index | |
83 | ||
84 | :num: a 32-bit number | |
85 | ||
86 | A vring address description | |
87 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
88 | ||
89 | +-------+-------+------+------------+------+-----------+-----+ | |
90 | | index | flags | size | descriptor | used | available | log | | |
91 | +-------+-------+------+------------+------+-----------+-----+ | |
92 | ||
93 | :index: a 32-bit vring index | |
94 | ||
95 | :flags: a 32-bit vring flags | |
96 | ||
97 | :descriptor: a 64-bit ring address of the vring descriptor table | |
98 | ||
99 | :used: a 64-bit ring address of the vring used ring | |
100 | ||
101 | :available: a 64-bit ring address of the vring available ring | |
102 | ||
103 | :log: a 64-bit guest address for logging | |
104 | ||
105 | Note that a ring address is an IOVA if ``VIRTIO_F_IOMMU_PLATFORM`` has | |
106 | been negotiated. Otherwise it is a user address. | |
107 | ||
108 | Memory regions description | |
109 | ^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
110 | ||
111 | +-------------+---------+---------+-----+---------+ | |
112 | | num regions | padding | region0 | ... | region7 | | |
113 | +-------------+---------+---------+-----+---------+ | |
114 | ||
115 | :num regions: a 32-bit number of regions | |
116 | ||
117 | :padding: 32-bit | |
118 | ||
119 | A region is: | |
120 | ||
121 | +---------------+------+--------------+-------------+ | |
122 | | guest address | size | user address | mmap offset | | |
123 | +---------------+------+--------------+-------------+ | |
124 | ||
125 | :guest address: a 64-bit guest address of the region | |
126 | ||
127 | :size: a 64-bit size | |
128 | ||
129 | :user address: a 64-bit user address | |
130 | ||
131 | :mmap offset: 64-bit offset where region starts in the mapped memory | |
132 | ||
133 | Log description | |
134 | ^^^^^^^^^^^^^^^ | |
135 | ||
136 | +----------+------------+ | |
137 | | log size | log offset | | |
138 | +----------+------------+ | |
139 | ||
140 | :log size: size of area used for logging | |
141 | ||
142 | :log offset: offset from start of supplied file descriptor where | |
143 | logging starts (i.e. where guest address 0 would be | |
144 | logged) | |
145 | ||
146 | An IOTLB message | |
147 | ^^^^^^^^^^^^^^^^ | |
148 | ||
149 | +------+------+--------------+-------------------+------+ | |
150 | | iova | size | user address | permissions flags | type | | |
151 | +------+------+--------------+-------------------+------+ | |
152 | ||
153 | :iova: a 64-bit I/O virtual address programmed by the guest | |
154 | ||
155 | :size: a 64-bit size | |
156 | ||
157 | :user address: a 64-bit user address | |
158 | ||
159 | :permissions flags: an 8-bit value: | |
160 | - 0: No access | |
161 | - 1: Read access | |
162 | - 2: Write access | |
163 | - 3: Read/Write access | |
164 | ||
165 | :type: an 8-bit IOTLB message type: | |
166 | - 1: IOTLB miss | |
167 | - 2: IOTLB update | |
168 | - 3: IOTLB invalidate | |
169 | - 4: IOTLB access fail | |
170 | ||
171 | Virtio device config space | |
172 | ^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
173 | ||
174 | +--------+------+-------+---------+ | |
175 | | offset | size | flags | payload | | |
176 | +--------+------+-------+---------+ | |
177 | ||
178 | :offset: a 32-bit offset of virtio device's configuration space | |
179 | ||
180 | :size: a 32-bit configuration space access size in bytes | |
181 | ||
182 | :flags: a 32-bit value: | |
183 | - 0: Vhost master messages used for writeable fields | |
184 | - 1: Vhost master messages used for live migration | |
185 | ||
186 | :payload: Size bytes array holding the contents of the virtio | |
187 | device's configuration space | |
188 | ||
189 | Vring area description | |
190 | ^^^^^^^^^^^^^^^^^^^^^^ | |
191 | ||
192 | +-----+------+--------+ | |
193 | | u64 | size | offset | | |
194 | +-----+------+--------+ | |
195 | ||
196 | :u64: a 64-bit integer contains vring index and flags | |
197 | ||
198 | :size: a 64-bit size of this area | |
199 | ||
200 | :offset: a 64-bit offset of this area from the start of the | |
201 | supplied file descriptor | |
202 | ||
203 | Inflight description | |
204 | ^^^^^^^^^^^^^^^^^^^^ | |
205 | ||
206 | +-----------+-------------+------------+------------+ | |
207 | | mmap size | mmap offset | num queues | queue size | | |
208 | +-----------+-------------+------------+------------+ | |
209 | ||
210 | :mmap size: a 64-bit size of area to track inflight I/O | |
211 | ||
212 | :mmap offset: a 64-bit offset of this area from the start | |
213 | of the supplied file descriptor | |
214 | ||
215 | :num queues: a 16-bit number of virtqueues | |
216 | ||
217 | :queue size: a 16-bit size of virtqueues | |
218 | ||
219 | C structure | |
220 | ----------- | |
221 | ||
222 | In QEMU the vhost-user message is implemented with the following struct: | |
223 | ||
224 | .. code:: c | |
225 | ||
226 | typedef struct VhostUserMsg { | |
227 | VhostUserRequest request; | |
228 | uint32_t flags; | |
229 | uint32_t size; | |
230 | union { | |
231 | uint64_t u64; | |
232 | struct vhost_vring_state state; | |
233 | struct vhost_vring_addr addr; | |
234 | VhostUserMemory memory; | |
235 | VhostUserLog log; | |
236 | struct vhost_iotlb_msg iotlb; | |
237 | VhostUserConfig config; | |
238 | VhostUserVringArea area; | |
239 | VhostUserInflight inflight; | |
240 | }; | |
241 | } QEMU_PACKED VhostUserMsg; | |
242 | ||
243 | Communication | |
244 | ============= | |
245 | ||
246 | The protocol for vhost-user is based on the existing implementation of | |
247 | vhost for the Linux Kernel. Most messages that can be sent via the | |
248 | Unix domain socket implementing vhost-user have an equivalent ioctl to | |
249 | the kernel implementation. | |
250 | ||
251 | The communication consists of *master* sending message requests and | |
252 | *slave* sending message replies. Most of the requests don't require | |
253 | replies. Here is a list of the ones that do: | |
254 | ||
255 | * ``VHOST_USER_GET_FEATURES`` | |
256 | * ``VHOST_USER_GET_PROTOCOL_FEATURES`` | |
257 | * ``VHOST_USER_GET_VRING_BASE`` | |
258 | * ``VHOST_USER_SET_LOG_BASE`` (if ``VHOST_USER_PROTOCOL_F_LOG_SHMFD``) | |
259 | * ``VHOST_USER_GET_INFLIGHT_FD`` (if ``VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD``) | |
260 | ||
261 | .. seealso:: | |
262 | ||
263 | :ref:`REPLY_ACK <reply_ack>` | |
264 | The section on ``REPLY_ACK`` protocol extension. | |
265 | ||
266 | There are several messages that the master sends with file descriptors passed | |
267 | in the ancillary data: | |
268 | ||
269 | * ``VHOST_USER_SET_MEM_TABLE`` | |
270 | * ``VHOST_USER_SET_LOG_BASE`` (if ``VHOST_USER_PROTOCOL_F_LOG_SHMFD``) | |
271 | * ``VHOST_USER_SET_LOG_FD`` | |
272 | * ``VHOST_USER_SET_VRING_KICK`` | |
273 | * ``VHOST_USER_SET_VRING_CALL`` | |
274 | * ``VHOST_USER_SET_VRING_ERR`` | |
275 | * ``VHOST_USER_SET_SLAVE_REQ_FD`` | |
276 | * ``VHOST_USER_SET_INFLIGHT_FD`` (if ``VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD``) | |
277 | ||
278 | If *master* is unable to send the full message or receives a wrong | |
279 | reply it will close the connection. An optional reconnection mechanism | |
280 | can be implemented. | |
281 | ||
282 | Any protocol extensions are gated by protocol feature bits, which | |
283 | allows full backwards compatibility on both master and slave. As | |
284 | older slaves don't support negotiating protocol features, a feature | |
285 | bit was dedicated for this purpose:: | |
286 | ||
287 | #define VHOST_USER_F_PROTOCOL_FEATURES 30 | |
288 | ||
289 | Starting and stopping rings | |
290 | --------------------------- | |
291 | ||
292 | Client must only process each ring when it is started. | |
293 | ||
294 | Client must only pass data between the ring and the backend, when the | |
295 | ring is enabled. | |
296 | ||
297 | If ring is started but disabled, client must process the ring without | |
298 | talking to the backend. | |
299 | ||
300 | For example, for a networking device, in the disabled state client | |
301 | must not supply any new RX packets, but must process and discard any | |
302 | TX packets. | |
303 | ||
304 | If ``VHOST_USER_F_PROTOCOL_FEATURES`` has not been negotiated, the | |
305 | ring is initialized in an enabled state. | |
306 | ||
307 | If ``VHOST_USER_F_PROTOCOL_FEATURES`` has been negotiated, the ring is | |
308 | initialized in a disabled state. Client must not pass data to/from the | |
309 | backend until ring is enabled by ``VHOST_USER_SET_VRING_ENABLE`` with | |
310 | parameter 1, or after it has been disabled by | |
311 | ``VHOST_USER_SET_VRING_ENABLE`` with parameter 0. | |
312 | ||
313 | Each ring is initialized in a stopped state, client must not process | |
314 | it until ring is started, or after it has been stopped. | |
315 | ||
316 | Client must start ring upon receiving a kick (that is, detecting that | |
317 | file descriptor is readable) on the descriptor specified by | |
318 | ``VHOST_USER_SET_VRING_KICK``, and stop ring upon receiving | |
319 | ``VHOST_USER_GET_VRING_BASE``. | |
320 | ||
321 | While processing the rings (whether they are enabled or not), client | |
322 | must support changing some configuration aspects on the fly. | |
323 | ||
324 | Multiple queue support | |
325 | ---------------------- | |
326 | ||
327 | Multiple queue is treated as a protocol extension, hence the slave has | |
328 | to implement protocol features first. The multiple queues feature is | |
329 | supported only when the protocol feature ``VHOST_USER_PROTOCOL_F_MQ`` | |
330 | (bit 0) is set. | |
331 | ||
332 | The max number of queue pairs the slave supports can be queried with | |
333 | message ``VHOST_USER_GET_QUEUE_NUM``. Master should stop when the | |
334 | number of requested queues is bigger than that. | |
335 | ||
336 | As all queues share one connection, the master uses a unique index for each | |
337 | queue in the sent message to identify a specified queue. One queue pair | |
338 | is enabled initially. More queues are enabled dynamically, by sending | |
339 | message ``VHOST_USER_SET_VRING_ENABLE``. | |
340 | ||
341 | Migration | |
342 | --------- | |
343 | ||
344 | During live migration, the master may need to track the modifications | |
345 | the slave makes to the memory mapped regions. The client should mark | |
346 | the dirty pages in a log. Once it complies to this logging, it may | |
347 | declare the ``VHOST_F_LOG_ALL`` vhost feature. | |
348 | ||
349 | To start/stop logging of data/used ring writes, server may send | |
350 | messages ``VHOST_USER_SET_FEATURES`` with ``VHOST_F_LOG_ALL`` and | |
351 | ``VHOST_USER_SET_VRING_ADDR`` with ``VHOST_VRING_F_LOG`` in ring's | |
352 | flags set to 1/0, respectively. | |
353 | ||
354 | All the modifications to memory pointed by vring "descriptor" should | |
355 | be marked. Modifications to "used" vring should be marked if | |
356 | ``VHOST_VRING_F_LOG`` is part of ring's flags. | |
357 | ||
358 | Dirty pages are of size:: | |
359 | ||
360 | #define VHOST_LOG_PAGE 0x1000 | |
361 | ||
362 | The log memory fd is provided in the ancillary data of | |
363 | ``VHOST_USER_SET_LOG_BASE`` message when the slave has | |
364 | ``VHOST_USER_PROTOCOL_F_LOG_SHMFD`` protocol feature. | |
365 | ||
366 | The size of the log is supplied as part of ``VhostUserMsg`` which | |
367 | should be large enough to cover all known guest addresses. Log starts | |
368 | at the supplied offset in the supplied file descriptor. The log | |
369 | covers from address 0 to the maximum of guest regions. In pseudo-code, | |
370 | to mark page at ``addr`` as dirty:: | |
371 | ||
372 | page = addr / VHOST_LOG_PAGE | |
373 | log[page / 8] |= 1 << page % 8 | |
374 | ||
375 | Where ``addr`` is the guest physical address. | |
376 | ||
377 | Use atomic operations, as the log may be concurrently manipulated. | |
378 | ||
379 | Note that when logging modifications to the used ring (when | |
380 | ``VHOST_VRING_F_LOG`` is set for this ring), ``log_guest_addr`` should | |
381 | be used to calculate the log offset: the write to first byte of the | |
382 | used ring is logged at this offset from log start. Also note that this | |
383 | value might be outside the legal guest physical address range | |
384 | (i.e. does not have to be covered by the ``VhostUserMemory`` table), but | |
385 | the bit offset of the last byte of the ring must fall within the size | |
386 | supplied by ``VhostUserLog``. | |
387 | ||
388 | ``VHOST_USER_SET_LOG_FD`` is an optional message with an eventfd in | |
389 | ancillary data, it may be used to inform the master that the log has | |
390 | been modified. | |
391 | ||
392 | Once the source has finished migration, rings will be stopped by the | |
393 | source. No further update must be done before rings are restarted. | |
394 | ||
395 | In postcopy migration the slave is started before all the memory has | |
396 | been received from the source host, and care must be taken to avoid | |
397 | accessing pages that have yet to be received. The slave opens a | |
398 | 'userfault'-fd and registers the memory with it; this fd is then | |
399 | passed back over to the master. The master services requests on the | |
400 | userfaultfd for pages that are accessed and when the page is available | |
401 | it performs WAKE ioctl's on the userfaultfd to wake the stalled | |
402 | slave. The client indicates support for this via the | |
403 | ``VHOST_USER_PROTOCOL_F_PAGEFAULT`` feature. | |
404 | ||
405 | Memory access | |
406 | ------------- | |
407 | ||
408 | The master sends a list of vhost memory regions to the slave using the | |
409 | ``VHOST_USER_SET_MEM_TABLE`` message. Each region has two base | |
410 | addresses: a guest address and a user address. | |
411 | ||
412 | Messages contain guest addresses and/or user addresses to reference locations | |
413 | within the shared memory. The mapping of these addresses works as follows. | |
414 | ||
415 | User addresses map to the vhost memory region containing that user address. | |
416 | ||
417 | When the ``VIRTIO_F_IOMMU_PLATFORM`` feature has not been negotiated: | |
418 | ||
419 | * Guest addresses map to the vhost memory region containing that guest | |
420 | address. | |
421 | ||
422 | When the ``VIRTIO_F_IOMMU_PLATFORM`` feature has been negotiated: | |
423 | ||
424 | * Guest addresses are also called I/O virtual addresses (IOVAs). They are | |
425 | translated to user addresses via the IOTLB. | |
426 | ||
427 | * The vhost memory region guest address is not used. | |
428 | ||
429 | IOMMU support | |
430 | ------------- | |
431 | ||
432 | When the ``VIRTIO_F_IOMMU_PLATFORM`` feature has been negotiated, the | |
433 | master sends IOTLB entries update & invalidation by sending | |
434 | ``VHOST_USER_IOTLB_MSG`` requests to the slave with a ``struct | |
435 | vhost_iotlb_msg`` as payload. For update events, the ``iotlb`` payload | |
436 | has to be filled with the update message type (2), the I/O virtual | |
437 | address, the size, the user virtual address, and the permissions | |
438 | flags. Addresses and size must be within vhost memory regions set via | |
439 | the ``VHOST_USER_SET_MEM_TABLE`` request. For invalidation events, the | |
440 | ``iotlb`` payload has to be filled with the invalidation message type | |
441 | (3), the I/O virtual address and the size. On success, the slave is | |
442 | expected to reply with a zero payload, non-zero otherwise. | |
443 | ||
444 | The slave relies on the slave communcation channel (see :ref:`Slave | |
445 | communication <slave_communication>` section below) to send IOTLB miss | |
446 | and access failure events, by sending ``VHOST_USER_SLAVE_IOTLB_MSG`` | |
447 | requests to the master with a ``struct vhost_iotlb_msg`` as | |
448 | payload. For miss events, the iotlb payload has to be filled with the | |
449 | miss message type (1), the I/O virtual address and the permissions | |
450 | flags. For access failure event, the iotlb payload has to be filled | |
451 | with the access failure message type (4), the I/O virtual address and | |
452 | the permissions flags. For synchronization purpose, the slave may | |
453 | rely on the reply-ack feature, so the master may send a reply when | |
454 | operation is completed if the reply-ack feature is negotiated and | |
455 | slaves requests a reply. For miss events, completed operation means | |
456 | either master sent an update message containing the IOTLB entry | |
457 | containing requested address and permission, or master sent nothing if | |
458 | the IOTLB miss message is invalid (invalid IOVA or permission). | |
459 | ||
460 | The master isn't expected to take the initiative to send IOTLB update | |
461 | messages, as the slave sends IOTLB miss messages for the guest virtual | |
462 | memory areas it needs to access. | |
463 | ||
464 | .. _slave_communication: | |
465 | ||
466 | Slave communication | |
467 | ------------------- | |
468 | ||
469 | An optional communication channel is provided if the slave declares | |
470 | ``VHOST_USER_PROTOCOL_F_SLAVE_REQ`` protocol feature, to allow the | |
471 | slave to make requests to the master. | |
472 | ||
473 | The fd is provided via ``VHOST_USER_SET_SLAVE_REQ_FD`` ancillary data. | |
474 | ||
475 | A slave may then send ``VHOST_USER_SLAVE_*`` messages to the master | |
476 | using this fd communication channel. | |
477 | ||
478 | If ``VHOST_USER_PROTOCOL_F_SLAVE_SEND_FD`` protocol feature is | |
479 | negotiated, slave can send file descriptors (at most 8 descriptors in | |
480 | each message) to master via ancillary data using this fd communication | |
481 | channel. | |
482 | ||
483 | Inflight I/O tracking | |
484 | --------------------- | |
485 | ||
486 | To support reconnecting after restart or crash, slave may need to | |
487 | resubmit inflight I/Os. If virtqueue is processed in order, we can | |
488 | easily achieve that by getting the inflight descriptors from | |
489 | descriptor table (split virtqueue) or descriptor ring (packed | |
490 | virtqueue). However, it can't work when we process descriptors | |
491 | out-of-order because some entries which store the information of | |
492 | inflight descriptors in available ring (split virtqueue) or descriptor | |
493 | ring (packed virtqueue) might be overrided by new entries. To solve | |
494 | this problem, slave need to allocate an extra buffer to store this | |
495 | information of inflight descriptors and share it with master for | |
496 | persistent. ``VHOST_USER_GET_INFLIGHT_FD`` and | |
497 | ``VHOST_USER_SET_INFLIGHT_FD`` are used to transfer this buffer | |
498 | between master and slave. And the format of this buffer is described | |
499 | below: | |
500 | ||
501 | +---------------+---------------+-----+---------------+ | |
502 | | queue0 region | queue1 region | ... | queueN region | | |
503 | +---------------+---------------+-----+---------------+ | |
504 | ||
505 | N is the number of available virtqueues. Slave could get it from num | |
506 | queues field of ``VhostUserInflight``. | |
507 | ||
508 | For split virtqueue, queue region can be implemented as: | |
509 | ||
510 | .. code:: c | |
511 | ||
512 | typedef struct DescStateSplit { | |
513 | /* Indicate whether this descriptor is inflight or not. | |
514 | * Only available for head-descriptor. */ | |
515 | uint8_t inflight; | |
516 | ||
517 | /* Padding */ | |
518 | uint8_t padding[5]; | |
519 | ||
520 | /* Maintain a list for the last batch of used descriptors. | |
521 | * Only available when batching is used for submitting */ | |
522 | uint16_t next; | |
523 | ||
524 | /* Used to preserve the order of fetching available descriptors. | |
525 | * Only available for head-descriptor. */ | |
526 | uint64_t counter; | |
527 | } DescStateSplit; | |
528 | ||
529 | typedef struct QueueRegionSplit { | |
530 | /* The feature flags of this region. Now it's initialized to 0. */ | |
531 | uint64_t features; | |
532 | ||
533 | /* The version of this region. It's 1 currently. | |
534 | * Zero value indicates an uninitialized buffer */ | |
535 | uint16_t version; | |
536 | ||
537 | /* The size of DescStateSplit array. It's equal to the virtqueue | |
538 | * size. Slave could get it from queue size field of VhostUserInflight. */ | |
539 | uint16_t desc_num; | |
540 | ||
541 | /* The head of list that track the last batch of used descriptors. */ | |
542 | uint16_t last_batch_head; | |
543 | ||
544 | /* Store the idx value of used ring */ | |
545 | uint16_t used_idx; | |
546 | ||
547 | /* Used to track the state of each descriptor in descriptor table */ | |
548 | DescStateSplit desc[0]; | |
549 | } QueueRegionSplit; | |
550 | ||
551 | To track inflight I/O, the queue region should be processed as follows: | |
552 | ||
553 | When receiving available buffers from the driver: | |
554 | ||
555 | #. Get the next available head-descriptor index from available ring, ``i`` | |
556 | ||
557 | #. Set ``desc[i].counter`` to the value of global counter | |
558 | ||
559 | #. Increase global counter by 1 | |
560 | ||
561 | #. Set ``desc[i].inflight`` to 1 | |
562 | ||
563 | When supplying used buffers to the driver: | |
564 | ||
565 | 1. Get corresponding used head-descriptor index, i | |
566 | ||
567 | 2. Set ``desc[i].next`` to ``last_batch_head`` | |
568 | ||
569 | 3. Set ``last_batch_head`` to ``i`` | |
570 | ||
571 | #. Steps 1,2,3 may be performed repeatedly if batching is possible | |
572 | ||
573 | #. Increase the ``idx`` value of used ring by the size of the batch | |
574 | ||
575 | #. Set the ``inflight`` field of each ``DescStateSplit`` entry in the batch to 0 | |
576 | ||
577 | #. Set ``used_idx`` to the ``idx`` value of used ring | |
578 | ||
579 | When reconnecting: | |
580 | ||
581 | #. If the value of ``used_idx`` does not match the ``idx`` value of | |
582 | used ring (means the inflight field of ``DescStateSplit`` entries in | |
583 | last batch may be incorrect), | |
584 | ||
585 | a. Subtract the value of ``used_idx`` from the ``idx`` value of | |
586 | used ring to get last batch size of ``DescStateSplit`` entries | |
587 | ||
588 | #. Set the ``inflight`` field of each ``DescStateSplit`` entry to 0 in last batch | |
589 | list which starts from ``last_batch_head`` | |
590 | ||
591 | #. Set ``used_idx`` to the ``idx`` value of used ring | |
592 | ||
593 | #. Resubmit inflight ``DescStateSplit`` entries in order of their | |
594 | counter value | |
595 | ||
596 | For packed virtqueue, queue region can be implemented as: | |
597 | ||
598 | .. code:: c | |
599 | ||
600 | typedef struct DescStatePacked { | |
601 | /* Indicate whether this descriptor is inflight or not. | |
602 | * Only available for head-descriptor. */ | |
603 | uint8_t inflight; | |
604 | ||
605 | /* Padding */ | |
606 | uint8_t padding; | |
607 | ||
608 | /* Link to the next free entry */ | |
609 | uint16_t next; | |
610 | ||
611 | /* Link to the last entry of descriptor list. | |
612 | * Only available for head-descriptor. */ | |
613 | uint16_t last; | |
614 | ||
615 | /* The length of descriptor list. | |
616 | * Only available for head-descriptor. */ | |
617 | uint16_t num; | |
618 | ||
619 | /* Used to preserve the order of fetching available descriptors. | |
620 | * Only available for head-descriptor. */ | |
621 | uint64_t counter; | |
622 | ||
623 | /* The buffer id */ | |
624 | uint16_t id; | |
625 | ||
626 | /* The descriptor flags */ | |
627 | uint16_t flags; | |
628 | ||
629 | /* The buffer length */ | |
630 | uint32_t len; | |
631 | ||
632 | /* The buffer address */ | |
633 | uint64_t addr; | |
634 | } DescStatePacked; | |
635 | ||
636 | typedef struct QueueRegionPacked { | |
637 | /* The feature flags of this region. Now it's initialized to 0. */ | |
638 | uint64_t features; | |
639 | ||
640 | /* The version of this region. It's 1 currently. | |
641 | * Zero value indicates an uninitialized buffer */ | |
642 | uint16_t version; | |
643 | ||
644 | /* The size of DescStatePacked array. It's equal to the virtqueue | |
645 | * size. Slave could get it from queue size field of VhostUserInflight. */ | |
646 | uint16_t desc_num; | |
647 | ||
648 | /* The head of free DescStatePacked entry list */ | |
649 | uint16_t free_head; | |
650 | ||
651 | /* The old head of free DescStatePacked entry list */ | |
652 | uint16_t old_free_head; | |
653 | ||
654 | /* The used index of descriptor ring */ | |
655 | uint16_t used_idx; | |
656 | ||
657 | /* The old used index of descriptor ring */ | |
658 | uint16_t old_used_idx; | |
659 | ||
660 | /* Device ring wrap counter */ | |
661 | uint8_t used_wrap_counter; | |
662 | ||
663 | /* The old device ring wrap counter */ | |
664 | uint8_t old_used_wrap_counter; | |
665 | ||
666 | /* Padding */ | |
667 | uint8_t padding[7]; | |
668 | ||
669 | /* Used to track the state of each descriptor fetched from descriptor ring */ | |
670 | DescStatePacked desc[0]; | |
671 | } QueueRegionPacked; | |
672 | ||
673 | To track inflight I/O, the queue region should be processed as follows: | |
674 | ||
675 | When receiving available buffers from the driver: | |
676 | ||
677 | #. Get the next available descriptor entry from descriptor ring, ``d`` | |
678 | ||
679 | #. If ``d`` is head descriptor, | |
680 | ||
681 | a. Set ``desc[old_free_head].num`` to 0 | |
682 | ||
683 | #. Set ``desc[old_free_head].counter`` to the value of global counter | |
684 | ||
685 | #. Increase global counter by 1 | |
686 | ||
687 | #. Set ``desc[old_free_head].inflight`` to 1 | |
688 | ||
689 | #. If ``d`` is last descriptor, set ``desc[old_free_head].last`` to | |
690 | ``free_head`` | |
691 | ||
692 | #. Increase ``desc[old_free_head].num`` by 1 | |
693 | ||
694 | #. Set ``desc[free_head].addr``, ``desc[free_head].len``, | |
695 | ``desc[free_head].flags``, ``desc[free_head].id`` to ``d.addr``, | |
696 | ``d.len``, ``d.flags``, ``d.id`` | |
697 | ||
698 | #. Set ``free_head`` to ``desc[free_head].next`` | |
699 | ||
700 | #. If ``d`` is last descriptor, set ``old_free_head`` to ``free_head`` | |
701 | ||
702 | When supplying used buffers to the driver: | |
703 | ||
704 | 1. Get corresponding used head-descriptor entry from descriptor ring, | |
705 | ``d`` | |
706 | ||
707 | 2. Get corresponding ``DescStatePacked`` entry, ``e`` | |
708 | ||
709 | 3. Set ``desc[e.last].next`` to ``free_head`` | |
710 | ||
711 | 4. Set ``free_head`` to the index of ``e`` | |
712 | ||
713 | #. Steps 1,2,3,4 may be performed repeatedly if batching is possible | |
714 | ||
715 | #. Increase ``used_idx`` by the size of the batch and update | |
716 | ``used_wrap_counter`` if needed | |
717 | ||
718 | #. Update ``d.flags`` | |
719 | ||
720 | #. Set the ``inflight`` field of each head ``DescStatePacked`` entry | |
721 | in the batch to 0 | |
722 | ||
723 | #. Set ``old_free_head``, ``old_used_idx``, ``old_used_wrap_counter`` | |
724 | to ``free_head``, ``used_idx``, ``used_wrap_counter`` | |
725 | ||
726 | When reconnecting: | |
727 | ||
728 | #. If ``used_idx`` does not match ``old_used_idx`` (means the | |
729 | ``inflight`` field of ``DescStatePacked`` entries in last batch may | |
730 | be incorrect), | |
731 | ||
732 | a. Get the next descriptor ring entry through ``old_used_idx``, ``d`` | |
733 | ||
734 | #. Use ``old_used_wrap_counter`` to calculate the available flags | |
735 | ||
736 | #. If ``d.flags`` is not equal to the calculated flags value (means | |
737 | slave has submitted the buffer to guest driver before crash, so | |
738 | it has to commit the in-progres update), set ``old_free_head``, | |
739 | ``old_used_idx``, ``old_used_wrap_counter`` to ``free_head``, | |
740 | ``used_idx``, ``used_wrap_counter`` | |
741 | ||
742 | #. Set ``free_head``, ``used_idx``, ``used_wrap_counter`` to | |
743 | ``old_free_head``, ``old_used_idx``, ``old_used_wrap_counter`` | |
744 | (roll back any in-progress update) | |
745 | ||
746 | #. Set the ``inflight`` field of each ``DescStatePacked`` entry in | |
747 | free list to 0 | |
748 | ||
749 | #. Resubmit inflight ``DescStatePacked`` entries in order of their | |
750 | counter value | |
751 | ||
752 | Protocol features | |
753 | ----------------- | |
754 | ||
755 | .. code:: c | |
756 | ||
757 | #define VHOST_USER_PROTOCOL_F_MQ 0 | |
758 | #define VHOST_USER_PROTOCOL_F_LOG_SHMFD 1 | |
759 | #define VHOST_USER_PROTOCOL_F_RARP 2 | |
760 | #define VHOST_USER_PROTOCOL_F_REPLY_ACK 3 | |
761 | #define VHOST_USER_PROTOCOL_F_MTU 4 | |
762 | #define VHOST_USER_PROTOCOL_F_SLAVE_REQ 5 | |
763 | #define VHOST_USER_PROTOCOL_F_CROSS_ENDIAN 6 | |
764 | #define VHOST_USER_PROTOCOL_F_CRYPTO_SESSION 7 | |
765 | #define VHOST_USER_PROTOCOL_F_PAGEFAULT 8 | |
766 | #define VHOST_USER_PROTOCOL_F_CONFIG 9 | |
767 | #define VHOST_USER_PROTOCOL_F_SLAVE_SEND_FD 10 | |
768 | #define VHOST_USER_PROTOCOL_F_HOST_NOTIFIER 11 | |
769 | #define VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD 12 | |
770 | ||
771 | Master message types | |
772 | -------------------- | |
773 | ||
774 | ``VHOST_USER_GET_FEATURES`` | |
775 | :id: 1 | |
776 | :equivalent ioctl: ``VHOST_GET_FEATURES`` | |
777 | :master payload: N/A | |
778 | :slave payload: ``u64`` | |
779 | ||
780 | Get from the underlying vhost implementation the features bitmask. | |
781 | Feature bit ``VHOST_USER_F_PROTOCOL_FEATURES`` signals slave support | |
782 | for ``VHOST_USER_GET_PROTOCOL_FEATURES`` and | |
783 | ``VHOST_USER_SET_PROTOCOL_FEATURES``. | |
784 | ||
785 | ``VHOST_USER_SET_FEATURES`` | |
786 | :id: 2 | |
787 | :equivalent ioctl: ``VHOST_SET_FEATURES`` | |
788 | :master payload: ``u64`` | |
789 | ||
790 | Enable features in the underlying vhost implementation using a | |
791 | bitmask. Feature bit ``VHOST_USER_F_PROTOCOL_FEATURES`` signals | |
792 | slave support for ``VHOST_USER_GET_PROTOCOL_FEATURES`` and | |
793 | ``VHOST_USER_SET_PROTOCOL_FEATURES``. | |
794 | ||
795 | ``VHOST_USER_GET_PROTOCOL_FEATURES`` | |
796 | :id: 15 | |
797 | :equivalent ioctl: ``VHOST_GET_FEATURES`` | |
798 | :master payload: N/A | |
799 | :slave payload: ``u64`` | |
800 | ||
801 | Get the protocol feature bitmask from the underlying vhost | |
802 | implementation. Only legal if feature bit | |
803 | ``VHOST_USER_F_PROTOCOL_FEATURES`` is present in | |
804 | ``VHOST_USER_GET_FEATURES``. | |
805 | ||
806 | .. Note:: | |
807 | Slave that reported ``VHOST_USER_F_PROTOCOL_FEATURES`` must | |
808 | support this message even before ``VHOST_USER_SET_FEATURES`` was | |
809 | called. | |
810 | ||
811 | ``VHOST_USER_SET_PROTOCOL_FEATURES`` | |
812 | :id: 16 | |
813 | :equivalent ioctl: ``VHOST_SET_FEATURES`` | |
814 | :master payload: ``u64`` | |
815 | ||
816 | Enable protocol features in the underlying vhost implementation. | |
817 | ||
818 | Only legal if feature bit ``VHOST_USER_F_PROTOCOL_FEATURES`` is present in | |
819 | ``VHOST_USER_GET_FEATURES``. | |
820 | ||
821 | .. Note:: | |
822 | Slave that reported ``VHOST_USER_F_PROTOCOL_FEATURES`` must support | |
823 | this message even before ``VHOST_USER_SET_FEATURES`` was called. | |
824 | ||
825 | ``VHOST_USER_SET_OWNER`` | |
826 | :id: 3 | |
827 | :equivalent ioctl: ``VHOST_SET_OWNER`` | |
828 | :master payload: N/A | |
829 | ||
830 | Issued when a new connection is established. It sets the current | |
831 | *master* as an owner of the session. This can be used on the *slave* | |
832 | as a "session start" flag. | |
833 | ||
834 | ``VHOST_USER_RESET_OWNER`` | |
835 | :id: 4 | |
836 | :master payload: N/A | |
837 | ||
838 | .. admonition:: Deprecated | |
839 | ||
840 | This is no longer used. Used to be sent to request disabling all | |
841 | rings, but some clients interpreted it to also discard connection | |
842 | state (this interpretation would lead to bugs). It is recommended | |
843 | that clients either ignore this message, or use it to disable all | |
844 | rings. | |
845 | ||
846 | ``VHOST_USER_SET_MEM_TABLE`` | |
847 | :id: 5 | |
848 | :equivalent ioctl: ``VHOST_SET_MEM_TABLE`` | |
849 | :master payload: memory regions description | |
850 | :slave payload: (postcopy only) memory regions description | |
851 | ||
852 | Sets the memory map regions on the slave so it can translate the | |
853 | vring addresses. In the ancillary data there is an array of file | |
854 | descriptors for each memory mapped region. The size and ordering of | |
855 | the fds matches the number and ordering of memory regions. | |
856 | ||
857 | When ``VHOST_USER_POSTCOPY_LISTEN`` has been received, | |
858 | ``SET_MEM_TABLE`` replies with the bases of the memory mapped | |
859 | regions to the master. The slave must have mmap'd the regions but | |
860 | not yet accessed them and should not yet generate a userfault | |
861 | event. | |
862 | ||
863 | .. Note:: | |
864 | ``NEED_REPLY_MASK`` is not set in this case. QEMU will then | |
865 | reply back to the list of mappings with an empty | |
866 | ``VHOST_USER_SET_MEM_TABLE`` as an acknowledgement; only upon | |
867 | reception of this message may the guest start accessing the memory | |
868 | and generating faults. | |
869 | ||
870 | ``VHOST_USER_SET_LOG_BASE`` | |
871 | :id: 6 | |
872 | :equivalent ioctl: ``VHOST_SET_LOG_BASE`` | |
873 | :master payload: u64 | |
874 | :slave payload: N/A | |
875 | ||
876 | Sets logging shared memory space. | |
877 | ||
878 | When slave has ``VHOST_USER_PROTOCOL_F_LOG_SHMFD`` protocol feature, | |
879 | the log memory fd is provided in the ancillary data of | |
880 | ``VHOST_USER_SET_LOG_BASE`` message, the size and offset of shared | |
881 | memory area provided in the message. | |
882 | ||
883 | ``VHOST_USER_SET_LOG_FD`` | |
884 | :id: 7 | |
885 | :equivalent ioctl: ``VHOST_SET_LOG_FD`` | |
886 | :master payload: N/A | |
887 | ||
888 | Sets the logging file descriptor, which is passed as ancillary data. | |
889 | ||
890 | ``VHOST_USER_SET_VRING_NUM`` | |
891 | :id: 8 | |
892 | :equivalent ioctl: ``VHOST_SET_VRING_NUM`` | |
893 | :master payload: vring state description | |
894 | ||
895 | Set the size of the queue. | |
896 | ||
897 | ``VHOST_USER_SET_VRING_ADDR`` | |
898 | :id: 9 | |
899 | :equivalent ioctl: ``VHOST_SET_VRING_ADDR`` | |
900 | :master payload: vring address description | |
901 | :slave payload: N/A | |
902 | ||
903 | Sets the addresses of the different aspects of the vring. | |
904 | ||
905 | ``VHOST_USER_SET_VRING_BASE`` | |
906 | :id: 10 | |
907 | :equivalent ioctl: ``VHOST_SET_VRING_BASE`` | |
908 | :master payload: vring state description | |
909 | ||
910 | Sets the base offset in the available vring. | |
911 | ||
912 | ``VHOST_USER_GET_VRING_BASE`` | |
913 | :id: 11 | |
914 | :equivalent ioctl: ``VHOST_USER_GET_VRING_BASE`` | |
915 | :master payload: vring state description | |
916 | :slave payload: vring state description | |
917 | ||
918 | Get the available vring base offset. | |
919 | ||
920 | ``VHOST_USER_SET_VRING_KICK`` | |
921 | :id: 12 | |
922 | :equivalent ioctl: ``VHOST_SET_VRING_KICK`` | |
923 | :master payload: ``u64`` | |
924 | ||
925 | Set the event file descriptor for adding buffers to the vring. It is | |
926 | passed in the ancillary data. | |
927 | ||
928 | Bits (0-7) of the payload contain the vring index. Bit 8 is the | |
929 | invalid FD flag. This flag is set when there is no file descriptor | |
930 | in the ancillary data. This signals that polling should be used | |
931 | instead of waiting for a kick. | |
932 | ||
933 | ``VHOST_USER_SET_VRING_CALL`` | |
934 | :id: 13 | |
935 | :equivalent ioctl: ``VHOST_SET_VRING_CALL`` | |
936 | :master payload: ``u64`` | |
937 | ||
938 | Set the event file descriptor to signal when buffers are used. It is | |
939 | passed in the ancillary data. | |
940 | ||
941 | Bits (0-7) of the payload contain the vring index. Bit 8 is the | |
942 | invalid FD flag. This flag is set when there is no file descriptor | |
943 | in the ancillary data. This signals that polling will be used | |
944 | instead of waiting for the call. | |
945 | ||
946 | ``VHOST_USER_SET_VRING_ERR`` | |
947 | :id: 14 | |
948 | :equivalent ioctl: ``VHOST_SET_VRING_ERR`` | |
949 | :master payload: ``u64`` | |
950 | ||
951 | Set the event file descriptor to signal when error occurs. It is | |
952 | passed in the ancillary data. | |
953 | ||
954 | Bits (0-7) of the payload contain the vring index. Bit 8 is the | |
955 | invalid FD flag. This flag is set when there is no file descriptor | |
956 | in the ancillary data. | |
957 | ||
958 | ``VHOST_USER_GET_QUEUE_NUM`` | |
959 | :id: 17 | |
960 | :equivalent ioctl: N/A | |
961 | :master payload: N/A | |
962 | :slave payload: u64 | |
963 | ||
964 | Query how many queues the backend supports. | |
965 | ||
966 | This request should be sent only when ``VHOST_USER_PROTOCOL_F_MQ`` | |
967 | is set in queried protocol features by | |
968 | ``VHOST_USER_GET_PROTOCOL_FEATURES``. | |
969 | ||
970 | ``VHOST_USER_SET_VRING_ENABLE`` | |
971 | :id: 18 | |
972 | :equivalent ioctl: N/A | |
973 | :master payload: vring state description | |
974 | ||
975 | Signal slave to enable or disable corresponding vring. | |
976 | ||
977 | This request should be sent only when | |
978 | ``VHOST_USER_F_PROTOCOL_FEATURES`` has been negotiated. | |
979 | ||
980 | ``VHOST_USER_SEND_RARP`` | |
981 | :id: 19 | |
982 | :equivalent ioctl: N/A | |
983 | :master payload: ``u64`` | |
984 | ||
985 | Ask vhost user backend to broadcast a fake RARP to notify the migration | |
986 | is terminated for guest that does not support GUEST_ANNOUNCE. | |
987 | ||
988 | Only legal if feature bit ``VHOST_USER_F_PROTOCOL_FEATURES`` is | |
989 | present in ``VHOST_USER_GET_FEATURES`` and protocol feature bit | |
990 | ``VHOST_USER_PROTOCOL_F_RARP`` is present in | |
991 | ``VHOST_USER_GET_PROTOCOL_FEATURES``. The first 6 bytes of the | |
992 | payload contain the mac address of the guest to allow the vhost user | |
993 | backend to construct and broadcast the fake RARP. | |
994 | ||
995 | ``VHOST_USER_NET_SET_MTU`` | |
996 | :id: 20 | |
997 | :equivalent ioctl: N/A | |
998 | :master payload: ``u64`` | |
999 | ||
1000 | Set host MTU value exposed to the guest. | |
1001 | ||
1002 | This request should be sent only when ``VIRTIO_NET_F_MTU`` feature | |
1003 | has been successfully negotiated, ``VHOST_USER_F_PROTOCOL_FEATURES`` | |
1004 | is present in ``VHOST_USER_GET_FEATURES`` and protocol feature bit | |
1005 | ``VHOST_USER_PROTOCOL_F_NET_MTU`` is present in | |
1006 | ``VHOST_USER_GET_PROTOCOL_FEATURES``. | |
1007 | ||
1008 | If ``VHOST_USER_PROTOCOL_F_REPLY_ACK`` is negotiated, slave must | |
1009 | respond with zero in case the specified MTU is valid, or non-zero | |
1010 | otherwise. | |
1011 | ||
1012 | ``VHOST_USER_SET_SLAVE_REQ_FD`` | |
1013 | :id: 21 | |
1014 | :equivalent ioctl: N/A | |
1015 | :master payload: N/A | |
1016 | ||
1017 | Set the socket file descriptor for slave initiated requests. It is passed | |
1018 | in the ancillary data. | |
1019 | ||
1020 | This request should be sent only when | |
1021 | ``VHOST_USER_F_PROTOCOL_FEATURES`` has been negotiated, and protocol | |
1022 | feature bit ``VHOST_USER_PROTOCOL_F_SLAVE_REQ`` bit is present in | |
1023 | ``VHOST_USER_GET_PROTOCOL_FEATURES``. If | |
1024 | ``VHOST_USER_PROTOCOL_F_REPLY_ACK`` is negotiated, slave must | |
1025 | respond with zero for success, non-zero otherwise. | |
1026 | ||
1027 | ``VHOST_USER_IOTLB_MSG`` | |
1028 | :id: 22 | |
1029 | :equivalent ioctl: N/A (equivalent to ``VHOST_IOTLB_MSG`` message type) | |
1030 | :master payload: ``struct vhost_iotlb_msg`` | |
1031 | :slave payload: ``u64`` | |
1032 | ||
1033 | Send IOTLB messages with ``struct vhost_iotlb_msg`` as payload. | |
1034 | ||
1035 | Master sends such requests to update and invalidate entries in the | |
1036 | device IOTLB. The slave has to acknowledge the request with sending | |
1037 | zero as ``u64`` payload for success, non-zero otherwise. | |
1038 | ||
1039 | This request should be send only when ``VIRTIO_F_IOMMU_PLATFORM`` | |
1040 | feature has been successfully negotiated. | |
1041 | ||
1042 | ``VHOST_USER_SET_VRING_ENDIAN`` | |
1043 | :id: 23 | |
1044 | :equivalent ioctl: ``VHOST_SET_VRING_ENDIAN`` | |
1045 | :master payload: vring state description | |
1046 | ||
1047 | Set the endianness of a VQ for legacy devices. Little-endian is | |
1048 | indicated with state.num set to 0 and big-endian is indicated with | |
1049 | state.num set to 1. Other values are invalid. | |
1050 | ||
1051 | This request should be sent only when | |
1052 | ``VHOST_USER_PROTOCOL_F_CROSS_ENDIAN`` has been negotiated. | |
1053 | Backends that negotiated this feature should handle both | |
1054 | endiannesses and expect this message once (per VQ) during device | |
1055 | configuration (ie. before the master starts the VQ). | |
1056 | ||
1057 | ``VHOST_USER_GET_CONFIG`` | |
1058 | :id: 24 | |
1059 | :equivalent ioctl: N/A | |
1060 | :master payload: virtio device config space | |
1061 | :slave payload: virtio device config space | |
1062 | ||
1063 | When ``VHOST_USER_PROTOCOL_F_CONFIG`` is negotiated, this message is | |
1064 | submitted by the vhost-user master to fetch the contents of the | |
1065 | virtio device configuration space, vhost-user slave's payload size | |
1066 | MUST match master's request, vhost-user slave uses zero length of | |
1067 | payload to indicate an error to vhost-user master. The vhost-user | |
1068 | master may cache the contents to avoid repeated | |
1069 | ``VHOST_USER_GET_CONFIG`` calls. | |
1070 | ||
1071 | ``VHOST_USER_SET_CONFIG`` | |
1072 | :id: 25 | |
1073 | :equivalent ioctl: N/A | |
1074 | :master payload: virtio device config space | |
1075 | :slave payload: N/A | |
1076 | ||
1077 | When ``VHOST_USER_PROTOCOL_F_CONFIG`` is negotiated, this message is | |
1078 | submitted by the vhost-user master when the Guest changes the virtio | |
1079 | device configuration space and also can be used for live migration | |
1080 | on the destination host. The vhost-user slave must check the flags | |
1081 | field, and slaves MUST NOT accept SET_CONFIG for read-only | |
1082 | configuration space fields unless the live migration bit is set. | |
1083 | ||
1084 | ``VHOST_USER_CREATE_CRYPTO_SESSION`` | |
1085 | :id: 26 | |
1086 | :equivalent ioctl: N/A | |
1087 | :master payload: crypto session description | |
1088 | :slave payload: crypto session description | |
1089 | ||
1090 | Create a session for crypto operation. The server side must return | |
1091 | the session id, 0 or positive for success, negative for failure. | |
1092 | This request should be sent only when | |
1093 | ``VHOST_USER_PROTOCOL_F_CRYPTO_SESSION`` feature has been | |
1094 | successfully negotiated. It's a required feature for crypto | |
1095 | devices. | |
1096 | ||
1097 | ``VHOST_USER_CLOSE_CRYPTO_SESSION`` | |
1098 | :id: 27 | |
1099 | :equivalent ioctl: N/A | |
1100 | :master payload: ``u64`` | |
1101 | ||
1102 | Close a session for crypto operation which was previously | |
1103 | created by ``VHOST_USER_CREATE_CRYPTO_SESSION``. | |
1104 | ||
1105 | This request should be sent only when | |
1106 | ``VHOST_USER_PROTOCOL_F_CRYPTO_SESSION`` feature has been | |
1107 | successfully negotiated. It's a required feature for crypto | |
1108 | devices. | |
1109 | ||
1110 | ``VHOST_USER_POSTCOPY_ADVISE`` | |
1111 | :id: 28 | |
1112 | :master payload: N/A | |
1113 | :slave payload: userfault fd | |
1114 | ||
1115 | When ``VHOST_USER_PROTOCOL_F_PAGEFAULT`` is supported, the master | |
1116 | advises slave that a migration with postcopy enabled is underway, | |
1117 | the slave must open a userfaultfd for later use. Note that at this | |
1118 | stage the migration is still in precopy mode. | |
1119 | ||
1120 | ``VHOST_USER_POSTCOPY_LISTEN`` | |
1121 | :id: 29 | |
1122 | :master payload: N/A | |
1123 | ||
1124 | Master advises slave that a transition to postcopy mode has | |
1125 | happened. The slave must ensure that shared memory is registered | |
1126 | with userfaultfd to cause faulting of non-present pages. | |
1127 | ||
1128 | This is always sent sometime after a ``VHOST_USER_POSTCOPY_ADVISE``, | |
1129 | and thus only when ``VHOST_USER_PROTOCOL_F_PAGEFAULT`` is supported. | |
1130 | ||
1131 | ``VHOST_USER_POSTCOPY_END`` | |
1132 | :id: 30 | |
1133 | :slave payload: ``u64`` | |
1134 | ||
1135 | Master advises that postcopy migration has now completed. The slave | |
1136 | must disable the userfaultfd. The response is an acknowledgement | |
1137 | only. | |
1138 | ||
1139 | When ``VHOST_USER_PROTOCOL_F_PAGEFAULT`` is supported, this message | |
1140 | is sent at the end of the migration, after | |
1141 | ``VHOST_USER_POSTCOPY_LISTEN`` was previously sent. | |
1142 | ||
1143 | The value returned is an error indication; 0 is success. | |
1144 | ||
1145 | ``VHOST_USER_GET_INFLIGHT_FD`` | |
1146 | :id: 31 | |
1147 | :equivalent ioctl: N/A | |
1148 | :master payload: inflight description | |
1149 | ||
1150 | When ``VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD`` protocol feature has | |
1151 | been successfully negotiated, this message is submitted by master to | |
1152 | get a shared buffer from slave. The shared buffer will be used to | |
1153 | track inflight I/O by slave. QEMU should retrieve a new one when vm | |
1154 | reset. | |
1155 | ||
1156 | ``VHOST_USER_SET_INFLIGHT_FD`` | |
1157 | :id: 32 | |
1158 | :equivalent ioctl: N/A | |
1159 | :master payload: inflight description | |
1160 | ||
1161 | When ``VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD`` protocol feature has | |
1162 | been successfully negotiated, this message is submitted by master to | |
1163 | send the shared inflight buffer back to slave so that slave could | |
1164 | get inflight I/O after a crash or restart. | |
1165 | ||
bd2e44fe MAL |
1166 | ``VHOST_USER_GPU_SET_SOCKET`` |
1167 | :id: 33 | |
1168 | :equivalent ioctl: N/A | |
1169 | :master payload: N/A | |
1170 | ||
1171 | Sets the GPU protocol socket file descriptor, which is passed as | |
1172 | ancillary data. The GPU protocol is used to inform the master of | |
1173 | rendering state and updates. See vhost-user-gpu.rst for details. | |
1174 | ||
ed1be66b MAL |
1175 | Slave message types |
1176 | ------------------- | |
1177 | ||
1178 | ``VHOST_USER_SLAVE_IOTLB_MSG`` | |
1179 | :id: 1 | |
1180 | :equivalent ioctl: N/A (equivalent to ``VHOST_IOTLB_MSG`` message type) | |
1181 | :slave payload: ``struct vhost_iotlb_msg`` | |
1182 | :master payload: N/A | |
1183 | ||
1184 | Send IOTLB messages with ``struct vhost_iotlb_msg`` as payload. | |
1185 | Slave sends such requests to notify of an IOTLB miss, or an IOTLB | |
1186 | access failure. If ``VHOST_USER_PROTOCOL_F_REPLY_ACK`` is | |
1187 | negotiated, and slave set the ``VHOST_USER_NEED_REPLY`` flag, master | |
1188 | must respond with zero when operation is successfully completed, or | |
1189 | non-zero otherwise. This request should be send only when | |
1190 | ``VIRTIO_F_IOMMU_PLATFORM`` feature has been successfully | |
1191 | negotiated. | |
1192 | ||
1193 | ``VHOST_USER_SLAVE_CONFIG_CHANGE_MSG`` | |
1194 | :id: 2 | |
1195 | :equivalent ioctl: N/A | |
1196 | :slave payload: N/A | |
1197 | :master payload: N/A | |
1198 | ||
1199 | When ``VHOST_USER_PROTOCOL_F_CONFIG`` is negotiated, vhost-user | |
1200 | slave sends such messages to notify that the virtio device's | |
1201 | configuration space has changed, for those host devices which can | |
1202 | support such feature, host driver can send ``VHOST_USER_GET_CONFIG`` | |
1203 | message to slave to get the latest content. If | |
1204 | ``VHOST_USER_PROTOCOL_F_REPLY_ACK`` is negotiated, and slave set the | |
1205 | ``VHOST_USER_NEED_REPLY`` flag, master must respond with zero when | |
1206 | operation is successfully completed, or non-zero otherwise. | |
1207 | ||
1208 | ``VHOST_USER_SLAVE_VRING_HOST_NOTIFIER_MSG`` | |
1209 | :id: 3 | |
1210 | :equivalent ioctl: N/A | |
1211 | :slave payload: vring area description | |
1212 | :master payload: N/A | |
1213 | ||
1214 | Sets host notifier for a specified queue. The queue index is | |
1215 | contained in the ``u64`` field of the vring area description. The | |
1216 | host notifier is described by the file descriptor (typically it's a | |
1217 | VFIO device fd) which is passed as ancillary data and the size | |
1218 | (which is mmap size and should be the same as host page size) and | |
1219 | offset (which is mmap offset) carried in the vring area | |
1220 | description. QEMU can mmap the file descriptor based on the size and | |
1221 | offset to get a memory range. Registering a host notifier means | |
1222 | mapping this memory range to the VM as the specified queue's notify | |
1223 | MMIO region. Slave sends this request to tell QEMU to de-register | |
1224 | the existing notifier if any and register the new notifier if the | |
1225 | request is sent with a file descriptor. | |
1226 | ||
1227 | This request should be sent only when | |
1228 | ``VHOST_USER_PROTOCOL_F_HOST_NOTIFIER`` protocol feature has been | |
1229 | successfully negotiated. | |
1230 | ||
1231 | .. _reply_ack: | |
1232 | ||
1233 | VHOST_USER_PROTOCOL_F_REPLY_ACK | |
1234 | ------------------------------- | |
1235 | ||
1236 | The original vhost-user specification only demands replies for certain | |
1237 | commands. This differs from the vhost protocol implementation where | |
1238 | commands are sent over an ``ioctl()`` call and block until the client | |
1239 | has completed. | |
1240 | ||
1241 | With this protocol extension negotiated, the sender (QEMU) can set the | |
1242 | ``need_reply`` [Bit 3] flag to any command. This indicates that the | |
1243 | client MUST respond with a Payload ``VhostUserMsg`` indicating success | |
1244 | or failure. The payload should be set to zero on success or non-zero | |
1245 | on failure, unless the message already has an explicit reply body. | |
1246 | ||
1247 | The response payload gives QEMU a deterministic indication of the result | |
1248 | of the command. Today, QEMU is expected to terminate the main vhost-user | |
1249 | loop upon receiving such errors. In future, qemu could be taught to be more | |
1250 | resilient for selective requests. | |
1251 | ||
1252 | For the message types that already solicit a reply from the client, | |
1253 | the presence of ``VHOST_USER_PROTOCOL_F_REPLY_ACK`` or need_reply bit | |
1254 | being set brings no behavioural change. (See the Communication_ | |
1255 | section for details.) | |
1256 | ||
1257 | .. _backend_conventions: | |
1258 | ||
1259 | Backend program conventions | |
1260 | =========================== | |
1261 | ||
1262 | vhost-user backends can provide various devices & services and may | |
1263 | need to be configured manually depending on the use case. However, it | |
1264 | is a good idea to follow the conventions listed here when | |
1265 | possible. Users, QEMU or libvirt, can then rely on some common | |
1266 | behaviour to avoid heterogenous configuration and management of the | |
1267 | backend programs and facilitate interoperability. | |
1268 | ||
1269 | Each backend installed on a host system should come with at least one | |
1270 | JSON file that conforms to the vhost-user.json schema. Each file | |
1271 | informs the management applications about the backend type, and binary | |
1272 | location. In addition, it defines rules for management apps for | |
1273 | picking the highest priority backend when multiple match the search | |
1274 | criteria (see ``@VhostUserBackend`` documentation in the schema file). | |
1275 | ||
1276 | If the backend is not capable of enabling a requested feature on the | |
1277 | host (such as 3D acceleration with virgl), or the initialization | |
1278 | failed, the backend should fail to start early and exit with a status | |
1279 | != 0. It may also print a message to stderr for further details. | |
1280 | ||
1281 | The backend program must not daemonize itself, but it may be | |
1282 | daemonized by the management layer. It may also have a restricted | |
1283 | access to the system. | |
1284 | ||
1285 | File descriptors 0, 1 and 2 will exist, and have regular | |
1286 | stdin/stdout/stderr usage (they may have been redirected to /dev/null | |
1287 | by the management layer, or to a log handler). | |
1288 | ||
1289 | The backend program must end (as quickly and cleanly as possible) when | |
1290 | the SIGTERM signal is received. Eventually, it may receive SIGKILL by | |
1291 | the management layer after a few seconds. | |
1292 | ||
1293 | The following command line options have an expected behaviour. They | |
1294 | are mandatory, unless explicitly said differently: | |
1295 | ||
1296 | --socket-path=PATH | |
1297 | ||
1298 | This option specify the location of the vhost-user Unix domain socket. | |
1299 | It is incompatible with --fd. | |
1300 | ||
1301 | --fd=FDNUM | |
1302 | ||
1303 | When this argument is given, the backend program is started with the | |
1304 | vhost-user socket as file descriptor FDNUM. It is incompatible with | |
1305 | --socket-path. | |
1306 | ||
1307 | --print-capabilities | |
1308 | ||
1309 | Output to stdout the backend capabilities in JSON format, and then | |
1310 | exit successfully. Other options and arguments should be ignored, and | |
1311 | the backend program should not perform its normal function. The | |
1312 | capabilities can be reported dynamically depending on the host | |
1313 | capabilities. | |
1314 | ||
1315 | The JSON output is described in the ``vhost-user.json`` schema, by | |
1316 | ```@VHostUserBackendCapabilities``. Example: | |
1317 | ||
1318 | .. code:: json | |
1319 | ||
1320 | { | |
1321 | "type": "foo", | |
1322 | "features": [ | |
1323 | "feature-a", | |
1324 | "feature-b" | |
1325 | ] | |
1326 | } | |
1327 | ||
1328 | vhost-user-input | |
1329 | ---------------- | |
1330 | ||
1331 | Command line options: | |
1332 | ||
1333 | --evdev-path=PATH | |
1334 | ||
1335 | Specify the linux input device. | |
1336 | ||
1337 | (optional) | |
1338 | ||
1339 | --no-grab | |
1340 | ||
1341 | Do no request exclusive access to the input device. | |
1342 | ||
1343 | (optional) | |
1344 | ||
1345 | vhost-user-gpu | |
1346 | -------------- | |
1347 | ||
1348 | Command line options: | |
1349 | ||
1350 | --render-node=PATH | |
1351 | ||
1352 | Specify the GPU DRM render node. | |
1353 | ||
1354 | (optional) | |
1355 | ||
1356 | --virgl | |
1357 | ||
1358 | Enable virgl rendering support. | |
1359 | ||
1360 | (optional) |