]>
Commit | Line | Data |
---|---|---|
5fc0e002 NN |
1 | Vhost-user Protocol |
2 | =================== | |
3 | ||
4 | Copyright (c) 2014 Virtual Open Systems Sarl. | |
5 | ||
6 | This work is licensed under the terms of the GNU GPL, version 2 or later. | |
7 | See the COPYING file in the top-level directory. | |
8 | =================== | |
9 | ||
10 | This protocol is aiming to complement the ioctl interface used to control the | |
11 | vhost implementation in the Linux kernel. It implements the control plane needed | |
12 | to establish virtqueue sharing with a user space process on the same host. It | |
13 | uses communication over a Unix domain socket to share file descriptors in the | |
14 | ancillary data of the message. | |
15 | ||
16 | The protocol defines 2 sides of the communication, master and slave. Master is | |
17 | the application that shares its virtqueues, in our case QEMU. Slave is the | |
18 | consumer of the virtqueues. | |
19 | ||
20 | In the current implementation QEMU is the Master, and the Slave is intended to | |
21 | be a software Ethernet switch running in user space, such as Snabbswitch. | |
22 | ||
23 | Master and slave can be either a client (i.e. connecting) or server (listening) | |
24 | in the socket communication. | |
25 | ||
26 | Message Specification | |
27 | --------------------- | |
28 | ||
29 | Note that all numbers are in the machine native byte order. A vhost-user message | |
30 | consists of 3 header fields and a payload: | |
31 | ||
32 | ------------------------------------ | |
33 | | request | flags | size | payload | | |
34 | ------------------------------------ | |
35 | ||
36 | * Request: 32-bit type of the request | |
37 | * Flags: 32-bit bit field: | |
38 | - Lower 2 bits are the version (currently 0x01) | |
39 | - Bit 2 is the reply flag - needs to be sent on each reply from the slave | |
40 | * Size - 32-bit size of the payload | |
41 | ||
42 | ||
43 | Depending on the request type, payload can be: | |
44 | ||
45 | * A single 64-bit integer | |
46 | ------- | |
47 | | u64 | | |
48 | ------- | |
49 | ||
50 | u64: a 64-bit unsigned integer | |
51 | ||
52 | * A vring state description | |
53 | --------------- | |
54 | | index | num | | |
55 | --------------- | |
56 | ||
57 | Index: a 32-bit index | |
58 | Num: a 32-bit number | |
59 | ||
60 | * A vring address description | |
61 | -------------------------------------------------------------- | |
62 | | index | flags | size | descriptor | used | available | log | | |
63 | -------------------------------------------------------------- | |
64 | ||
65 | Index: a 32-bit vring index | |
66 | Flags: a 32-bit vring flags | |
67 | Descriptor: a 64-bit user address of the vring descriptor table | |
68 | Used: a 64-bit user address of the vring used ring | |
69 | Available: a 64-bit user address of the vring available ring | |
70 | Log: a 64-bit guest address for logging | |
71 | ||
72 | * Memory regions description | |
73 | --------------------------------------------------- | |
74 | | num regions | padding | region0 | ... | region7 | | |
75 | --------------------------------------------------- | |
76 | ||
77 | Num regions: a 32-bit number of regions | |
78 | Padding: 32-bit | |
79 | ||
80 | A region is: | |
3fd74b84 DM |
81 | ----------------------------------------------------- |
82 | | guest address | size | user address | mmap offset | | |
83 | ----------------------------------------------------- | |
5fc0e002 NN |
84 | |
85 | Guest address: a 64-bit guest address of the region | |
86 | Size: a 64-bit size | |
87 | User address: a 64-bit user address | |
a628fc8d | 88 | mmap offset: 64-bit offset where region starts in the mapped memory |
5fc0e002 | 89 | |
a586e65b MT |
90 | * Log description |
91 | --------------------------- | |
92 | | log size | log offset | | |
93 | --------------------------- | |
94 | log size: size of area used for logging | |
95 | log offset: offset from start of supplied file descriptor | |
96 | where logging starts (i.e. where guest address 0 would be logged) | |
97 | ||
5fc0e002 NN |
98 | In QEMU the vhost-user message is implemented with the following struct: |
99 | ||
100 | typedef struct VhostUserMsg { | |
101 | VhostUserRequest request; | |
102 | uint32_t flags; | |
103 | uint32_t size; | |
104 | union { | |
105 | uint64_t u64; | |
106 | struct vhost_vring_state state; | |
107 | struct vhost_vring_addr addr; | |
108 | VhostUserMemory memory; | |
2b8819c6 | 109 | VhostUserLog log; |
5fc0e002 NN |
110 | }; |
111 | } QEMU_PACKED VhostUserMsg; | |
112 | ||
113 | Communication | |
114 | ------------- | |
115 | ||
116 | The protocol for vhost-user is based on the existing implementation of vhost | |
117 | for the Linux Kernel. Most messages that can be sent via the Unix domain socket | |
118 | implementing vhost-user have an equivalent ioctl to the kernel implementation. | |
119 | ||
120 | The communication consists of master sending message requests and slave sending | |
121 | message replies. Most of the requests don't require replies. Here is a list of | |
122 | the ones that do: | |
123 | ||
124 | * VHOST_GET_FEATURES | |
dcb10c00 | 125 | * VHOST_GET_PROTOCOL_FEATURES |
5fc0e002 | 126 | * VHOST_GET_VRING_BASE |
c62b91e5 | 127 | * VHOST_SET_LOG_BASE (if VHOST_USER_PROTOCOL_F_LOG_SHMFD) |
5fc0e002 NN |
128 | |
129 | There are several messages that the master sends with file descriptors passed | |
130 | in the ancillary data: | |
131 | ||
132 | * VHOST_SET_MEM_TABLE | |
c62b91e5 | 133 | * VHOST_SET_LOG_BASE (if VHOST_USER_PROTOCOL_F_LOG_SHMFD) |
5fc0e002 NN |
134 | * VHOST_SET_LOG_FD |
135 | * VHOST_SET_VRING_KICK | |
136 | * VHOST_SET_VRING_CALL | |
137 | * VHOST_SET_VRING_ERR | |
138 | ||
139 | If Master is unable to send the full message or receives a wrong reply it will | |
140 | close the connection. An optional reconnection mechanism can be implemented. | |
141 | ||
dcb10c00 MT |
142 | Any protocol extensions are gated by protocol feature bits, |
143 | which allows full backwards compatibility on both master | |
144 | and slave. | |
145 | As older slaves don't support negotiating protocol features, | |
146 | a feature bit was dedicated for this purpose: | |
147 | #define VHOST_USER_F_PROTOCOL_FEATURES 30 | |
148 | ||
a586e65b MT |
149 | Starting and stopping rings |
150 | ---------------------- | |
c61f09ed MT |
151 | Client must only process each ring when it is started. |
152 | ||
153 | Client must only pass data between the ring and the | |
154 | backend, when the ring is enabled. | |
155 | ||
156 | If ring is started but disabled, client must process the | |
157 | ring without talking to the backend. | |
158 | ||
159 | For example, for a networking device, in the disabled state | |
160 | client must not supply any new RX packets, but must process | |
161 | and discard any TX packets. | |
7ebcfe56 MT |
162 | |
163 | If VHOST_USER_F_PROTOCOL_FEATURES has not been negotiated, the ring is initialized | |
164 | in an enabled state. | |
a586e65b | 165 | |
7ebcfe56 | 166 | If VHOST_USER_F_PROTOCOL_FEATURES has been negotiated, the ring is initialized |
c61f09ed | 167 | in a disabled state. Client must not pass data to/from the backend until ring is enabled by |
7ebcfe56 MT |
168 | VHOST_USER_SET_VRING_ENABLE with parameter 1, or after it has been disabled by |
169 | VHOST_USER_SET_VRING_ENABLE with parameter 0. | |
170 | ||
171 | Each ring is initialized in a stopped state, client must not process it until | |
172 | ring is started, or after it has been stopped. | |
a586e65b | 173 | |
7ebcfe56 MT |
174 | Client must start ring upon receiving a kick (that is, detecting that file |
175 | descriptor is readable) on the descriptor specified by | |
176 | VHOST_USER_SET_VRING_KICK, and stop ring upon receiving | |
177 | VHOST_USER_GET_VRING_BASE. | |
a586e65b | 178 | |
c61f09ed | 179 | While processing the rings (whether they are enabled or not), client must |
7ebcfe56 | 180 | support changing some configuration aspects on the fly. |
a586e65b | 181 | |
b931bfbf CO |
182 | Multiple queue support |
183 | ---------------------- | |
184 | ||
185 | Multiple queue is treated as a protocol extension, hence the slave has to | |
186 | implement protocol features first. The multiple queues feature is supported | |
c62b91e5 | 187 | only when the protocol feature VHOST_USER_PROTOCOL_F_MQ (bit 0) is set. |
b931bfbf CO |
188 | |
189 | The max number of queues the slave supports can be queried with message | |
190 | VHOST_USER_GET_PROTOCOL_FEATURES. Master should stop when the number of | |
191 | requested queues is bigger than that. | |
192 | ||
193 | As all queues share one connection, the master uses a unique index for each | |
7263a0ad CO |
194 | queue in the sent message to identify a specified queue. One queue pair |
195 | is enabled initially. More queues are enabled dynamically, by sending | |
196 | message VHOST_USER_SET_VRING_ENABLE. | |
b931bfbf | 197 | |
c62b91e5 MAL |
198 | Migration |
199 | --------- | |
200 | ||
201 | During live migration, the master may need to track the modifications | |
202 | the slave makes to the memory mapped regions. The client should mark | |
203 | the dirty pages in a log. Once it complies to this logging, it may | |
204 | declare the VHOST_F_LOG_ALL vhost feature. | |
205 | ||
a586e65b MT |
206 | To start/stop logging of data/used ring writes, server may send messages |
207 | VHOST_USER_SET_FEATURES with VHOST_F_LOG_ALL and VHOST_USER_SET_VRING_ADDR with | |
208 | VHOST_VRING_F_LOG in ring's flags set to 1/0, respectively. | |
209 | ||
c62b91e5 MAL |
210 | All the modifications to memory pointed by vring "descriptor" should |
211 | be marked. Modifications to "used" vring should be marked if | |
a586e65b | 212 | VHOST_VRING_F_LOG is part of ring's flags. |
c62b91e5 MAL |
213 | |
214 | Dirty pages are of size: | |
215 | #define VHOST_LOG_PAGE 0x1000 | |
216 | ||
217 | The log memory fd is provided in the ancillary data of | |
218 | VHOST_USER_SET_LOG_BASE message when the slave has | |
219 | VHOST_USER_PROTOCOL_F_LOG_SHMFD protocol feature. | |
220 | ||
a586e65b MT |
221 | The size of the log is supplied as part of VhostUserMsg |
222 | which should be large enough to cover all known guest | |
223 | addresses. Log starts at the supplied offset in the | |
224 | supplied file descriptor. | |
225 | The log covers from address 0 to the maximum of guest | |
c62b91e5 MAL |
226 | regions. In pseudo-code, to mark page at "addr" as dirty: |
227 | ||
228 | page = addr / VHOST_LOG_PAGE | |
229 | log[page / 8] |= 1 << page % 8 | |
230 | ||
a586e65b MT |
231 | Where addr is the guest physical address. |
232 | ||
c62b91e5 MAL |
233 | Use atomic operations, as the log may be concurrently manipulated. |
234 | ||
a586e65b MT |
235 | Note that when logging modifications to the used ring (when VHOST_VRING_F_LOG |
236 | is set for this ring), log_guest_addr should be used to calculate the log | |
237 | offset: the write to first byte of the used ring is logged at this offset from | |
238 | log start. Also note that this value might be outside the legal guest physical | |
239 | address range (i.e. does not have to be covered by the VhostUserMemory table), | |
240 | but the bit offset of the last byte of the ring must fall within | |
241 | the size supplied by VhostUserLog. | |
242 | ||
c62b91e5 MAL |
243 | VHOST_USER_SET_LOG_FD is an optional message with an eventfd in |
244 | ancillary data, it may be used to inform the master that the log has | |
245 | been modified. | |
246 | ||
a586e65b MT |
247 | Once the source has finished migration, rings will be stopped by |
248 | the source. No further update must be done before rings are | |
249 | restarted. | |
c62b91e5 MAL |
250 | |
251 | Protocol features | |
252 | ----------------- | |
253 | ||
254 | #define VHOST_USER_PROTOCOL_F_MQ 0 | |
255 | #define VHOST_USER_PROTOCOL_F_LOG_SHMFD 1 | |
3e866365 | 256 | #define VHOST_USER_PROTOCOL_F_RARP 2 |
c62b91e5 | 257 | |
5fc0e002 NN |
258 | Message types |
259 | ------------- | |
260 | ||
261 | * VHOST_USER_GET_FEATURES | |
262 | ||
46e797c4 | 263 | Id: 1 |
5fc0e002 NN |
264 | Equivalent ioctl: VHOST_GET_FEATURES |
265 | Master payload: N/A | |
266 | Slave payload: u64 | |
267 | ||
268 | Get from the underlying vhost implementation the features bitmask. | |
dcb10c00 MT |
269 | Feature bit VHOST_USER_F_PROTOCOL_FEATURES signals slave support for |
270 | VHOST_USER_GET_PROTOCOL_FEATURES and VHOST_USER_SET_PROTOCOL_FEATURES. | |
5fc0e002 NN |
271 | |
272 | * VHOST_USER_SET_FEATURES | |
273 | ||
46e797c4 | 274 | Id: 2 |
5fc0e002 NN |
275 | Ioctl: VHOST_SET_FEATURES |
276 | Master payload: u64 | |
277 | ||
278 | Enable features in the underlying vhost implementation using a bitmask. | |
dcb10c00 MT |
279 | Feature bit VHOST_USER_F_PROTOCOL_FEATURES signals slave support for |
280 | VHOST_USER_GET_PROTOCOL_FEATURES and VHOST_USER_SET_PROTOCOL_FEATURES. | |
281 | ||
282 | * VHOST_USER_GET_PROTOCOL_FEATURES | |
283 | ||
284 | Id: 15 | |
285 | Equivalent ioctl: VHOST_GET_FEATURES | |
286 | Master payload: N/A | |
287 | Slave payload: u64 | |
288 | ||
289 | Get the protocol feature bitmask from the underlying vhost implementation. | |
290 | Only legal if feature bit VHOST_USER_F_PROTOCOL_FEATURES is present in | |
291 | VHOST_USER_GET_FEATURES. | |
292 | Note: slave that reported VHOST_USER_F_PROTOCOL_FEATURES must support | |
293 | this message even before VHOST_USER_SET_FEATURES was called. | |
294 | ||
295 | * VHOST_USER_SET_PROTOCOL_FEATURES | |
296 | ||
297 | Id: 16 | |
298 | Ioctl: VHOST_SET_FEATURES | |
299 | Master payload: u64 | |
300 | ||
301 | Enable protocol features in the underlying vhost implementation. | |
302 | Only legal if feature bit VHOST_USER_F_PROTOCOL_FEATURES is present in | |
303 | VHOST_USER_GET_FEATURES. | |
304 | Note: slave that reported VHOST_USER_F_PROTOCOL_FEATURES must support | |
305 | this message even before VHOST_USER_SET_FEATURES was called. | |
5fc0e002 NN |
306 | |
307 | * VHOST_USER_SET_OWNER | |
308 | ||
46e797c4 | 309 | Id: 3 |
5fc0e002 NN |
310 | Equivalent ioctl: VHOST_SET_OWNER |
311 | Master payload: N/A | |
312 | ||
313 | Issued when a new connection is established. It sets the current Master | |
314 | as an owner of the session. This can be used on the Slave as a | |
315 | "session start" flag. | |
316 | ||
60915dc4 | 317 | * VHOST_USER_RESET_OWNER |
5fc0e002 | 318 | |
46e797c4 | 319 | Id: 4 |
5fc0e002 NN |
320 | Master payload: N/A |
321 | ||
c61f09ed | 322 | This is no longer used. Used to be sent to request disabling |
a586e65b MT |
323 | all rings, but some clients interpreted it to also discard |
324 | connection state (this interpretation would lead to bugs). | |
325 | It is recommended that clients either ignore this message, | |
c61f09ed | 326 | or use it to disable all rings. |
5fc0e002 NN |
327 | |
328 | * VHOST_USER_SET_MEM_TABLE | |
329 | ||
46e797c4 | 330 | Id: 5 |
5fc0e002 NN |
331 | Equivalent ioctl: VHOST_SET_MEM_TABLE |
332 | Master payload: memory regions description | |
333 | ||
334 | Sets the memory map regions on the slave so it can translate the vring | |
335 | addresses. In the ancillary data there is an array of file descriptors | |
336 | for each memory mapped region. The size and ordering of the fds matches | |
337 | the number and ordering of memory regions. | |
338 | ||
339 | * VHOST_USER_SET_LOG_BASE | |
340 | ||
46e797c4 | 341 | Id: 6 |
5fc0e002 NN |
342 | Equivalent ioctl: VHOST_SET_LOG_BASE |
343 | Master payload: u64 | |
c62b91e5 | 344 | Slave payload: N/A |
5fc0e002 | 345 | |
2b8819c6 VK |
346 | Sets logging shared memory space. |
347 | When slave has VHOST_USER_PROTOCOL_F_LOG_SHMFD protocol | |
348 | feature, the log memory fd is provided in the ancillary data of | |
349 | VHOST_USER_SET_LOG_BASE message, the size and offset of shared | |
350 | memory area provided in the message. | |
351 | ||
5fc0e002 NN |
352 | |
353 | * VHOST_USER_SET_LOG_FD | |
354 | ||
46e797c4 | 355 | Id: 7 |
5fc0e002 NN |
356 | Equivalent ioctl: VHOST_SET_LOG_FD |
357 | Master payload: N/A | |
358 | ||
359 | Sets the logging file descriptor, which is passed as ancillary data. | |
360 | ||
361 | * VHOST_USER_SET_VRING_NUM | |
362 | ||
46e797c4 | 363 | Id: 8 |
5fc0e002 NN |
364 | Equivalent ioctl: VHOST_SET_VRING_NUM |
365 | Master payload: vring state description | |
366 | ||
09230cb8 | 367 | Set the size of the queue. |
5fc0e002 NN |
368 | |
369 | * VHOST_USER_SET_VRING_ADDR | |
370 | ||
46e797c4 | 371 | Id: 9 |
5fc0e002 NN |
372 | Equivalent ioctl: VHOST_SET_VRING_ADDR |
373 | Master payload: vring address description | |
374 | Slave payload: N/A | |
375 | ||
376 | Sets the addresses of the different aspects of the vring. | |
377 | ||
378 | * VHOST_USER_SET_VRING_BASE | |
379 | ||
46e797c4 | 380 | Id: 10 |
5fc0e002 NN |
381 | Equivalent ioctl: VHOST_SET_VRING_BASE |
382 | Master payload: vring state description | |
383 | ||
384 | Sets the base offset in the available vring. | |
385 | ||
386 | * VHOST_USER_GET_VRING_BASE | |
387 | ||
46e797c4 | 388 | Id: 11 |
5fc0e002 NN |
389 | Equivalent ioctl: VHOST_USER_GET_VRING_BASE |
390 | Master payload: vring state description | |
391 | Slave payload: vring state description | |
392 | ||
393 | Get the available vring base offset. | |
394 | ||
395 | * VHOST_USER_SET_VRING_KICK | |
396 | ||
46e797c4 | 397 | Id: 12 |
5fc0e002 NN |
398 | Equivalent ioctl: VHOST_SET_VRING_KICK |
399 | Master payload: u64 | |
400 | ||
401 | Set the event file descriptor for adding buffers to the vring. It | |
402 | is passed in the ancillary data. | |
403 | Bits (0-7) of the payload contain the vring index. Bit 8 is the | |
404 | invalid FD flag. This flag is set when there is no file descriptor | |
405 | in the ancillary data. This signals that polling should be used | |
406 | instead of waiting for a kick. | |
407 | ||
408 | * VHOST_USER_SET_VRING_CALL | |
409 | ||
46e797c4 | 410 | Id: 13 |
5fc0e002 NN |
411 | Equivalent ioctl: VHOST_SET_VRING_CALL |
412 | Master payload: u64 | |
413 | ||
414 | Set the event file descriptor to signal when buffers are used. It | |
415 | is passed in the ancillary data. | |
416 | Bits (0-7) of the payload contain the vring index. Bit 8 is the | |
417 | invalid FD flag. This flag is set when there is no file descriptor | |
418 | in the ancillary data. This signals that polling will be used | |
419 | instead of waiting for the call. | |
420 | ||
421 | * VHOST_USER_SET_VRING_ERR | |
422 | ||
46e797c4 | 423 | Id: 14 |
5fc0e002 NN |
424 | Equivalent ioctl: VHOST_SET_VRING_ERR |
425 | Master payload: u64 | |
426 | ||
427 | Set the event file descriptor to signal when error occurs. It | |
428 | is passed in the ancillary data. | |
429 | Bits (0-7) of the payload contain the vring index. Bit 8 is the | |
430 | invalid FD flag. This flag is set when there is no file descriptor | |
431 | in the ancillary data. | |
e2051e9e YL |
432 | |
433 | * VHOST_USER_GET_QUEUE_NUM | |
434 | ||
435 | Id: 17 | |
436 | Equivalent ioctl: N/A | |
437 | Master payload: N/A | |
438 | Slave payload: u64 | |
439 | ||
440 | Query how many queues the backend supports. This request should be | |
c954f09e | 441 | sent only when VHOST_USER_PROTOCOL_F_MQ is set in queried protocol |
e2051e9e | 442 | features by VHOST_USER_GET_PROTOCOL_FEATURES. |
7263a0ad CO |
443 | |
444 | * VHOST_USER_SET_VRING_ENABLE | |
445 | ||
446 | Id: 18 | |
447 | Equivalent ioctl: N/A | |
448 | Master payload: vring state description | |
449 | ||
450 | Signal slave to enable or disable corresponding vring. | |
a586e65b MT |
451 | This request should be sent only when VHOST_USER_F_PROTOCOL_FEATURES |
452 | has been negotiated. | |
3e866365 TC |
453 | |
454 | * VHOST_USER_SEND_RARP | |
455 | ||
456 | Id: 19 | |
457 | Equivalent ioctl: N/A | |
458 | Master payload: u64 | |
459 | ||
460 | Ask vhost user backend to broadcast a fake RARP to notify the migration | |
461 | is terminated for guest that does not support GUEST_ANNOUNCE. | |
462 | Only legal if feature bit VHOST_USER_F_PROTOCOL_FEATURES is present in | |
463 | VHOST_USER_GET_FEATURES and protocol feature bit VHOST_USER_PROTOCOL_F_RARP | |
464 | is present in VHOST_USER_GET_PROTOCOL_FEATURES. | |
465 | The first 6 bytes of the payload contain the mac address of the guest to | |
466 | allow the vhost user backend to construct and broadcast the fake RARP. |