]>
Commit | Line | Data |
---|---|---|
fdee2025 MA |
1 | = Device Specification for Inter-VM shared memory device = |
2 | ||
3 | The Inter-VM shared memory device (ivshmem) is designed to share a | |
4 | memory region between multiple QEMU processes running different guests | |
5 | and the host. In order for all guests to be able to pick up the | |
6 | shared memory area, it is modeled by QEMU as a PCI device exposing | |
7 | said memory to the guest as a PCI BAR. | |
8 | ||
9 | The device can use a shared memory object on the host directly, or it | |
10 | can obtain one from an ivshmem server. | |
11 | ||
12 | In the latter case, the device can additionally interrupt its peers, and | |
13 | get interrupted by its peers. | |
14 | ||
15 | ||
16 | == Configuring the ivshmem PCI device == | |
17 | ||
18 | There are two basic configurations: | |
19 | ||
5400c02b | 20 | - Just shared memory: -device ivshmem-plain,memdev=HMB,... |
fdee2025 | 21 | |
5400c02b MA |
22 | This uses host memory backend HMB. It should have option "share" |
23 | set. | |
fdee2025 MA |
24 | |
25 | - Shared memory plus interrupts: -device ivshmem,chardev=CHR,vectors=N,... | |
26 | ||
27 | An ivshmem server must already be running on the host. The device | |
28 | connects to the server's UNIX domain socket via character device | |
29 | CHR. | |
30 | ||
31 | Each peer gets assigned a unique ID by the server. IDs must be | |
32 | between 0 and 65535. | |
33 | ||
5400c02b MA |
34 | Interrupts are message-signaled (MSI-X). vectors=N configures the |
35 | number of vectors to use. | |
fdee2025 MA |
36 | |
37 | For more details on ivshmem device properties, see The QEMU Emulator | |
38 | User Documentation (qemu-doc.*). | |
39 | ||
40 | ||
41 | == The ivshmem PCI device's guest interface == | |
42 | ||
5400c02b MA |
43 | The device has vendor ID 1af4, device ID 1110, revision 1. Before |
44 | QEMU 2.6.0, it had revision 0. | |
fdee2025 MA |
45 | |
46 | === PCI BARs === | |
47 | ||
48 | The ivshmem PCI device has two or three BARs: | |
49 | ||
50 | - BAR0 holds device registers (256 Byte MMIO) | |
5400c02b | 51 | - BAR1 holds MSI-X table and PBA (only ivshmem-doorbell) |
fdee2025 MA |
52 | - BAR2 maps the shared memory object |
53 | ||
54 | There are two ways to use this device: | |
55 | ||
56 | - If you only need the shared memory part, BAR2 suffices. This way, | |
57 | you have access to the shared memory in the guest and can use it as | |
58 | you see fit. Memnic, for example, uses ivshmem this way from guest | |
59 | user space (see http://dpdk.org/browse/memnic). | |
60 | ||
61 | - If you additionally need the capability for peers to interrupt each | |
5400c02b MA |
62 | other, you need BAR0 and BAR1. You will most likely want to write a |
63 | kernel driver to handle interrupts. Requires the device to be | |
64 | configured for interrupts, obviously. | |
fdee2025 | 65 | |
1309cf44 MA |
66 | Before QEMU 2.6.0, BAR2 can initially be invalid if the device is |
67 | configured for interrupts. It becomes safely accessible only after | |
5400c02b MA |
68 | the ivshmem server provided the shared memory. These devices have PCI |
69 | revision 0 rather than 1. Guest software should wait for the | |
70 | IVPosition register (described below) to become non-negative before | |
71 | accessing BAR2. | |
fdee2025 | 72 | |
5400c02b MA |
73 | Revision 0 of the device is not capable to tell guest software whether |
74 | it is configured for interrupts. | |
fdee2025 MA |
75 | |
76 | === PCI device registers === | |
77 | ||
78 | BAR 0 contains the following registers: | |
79 | ||
80 | Offset Size Access On reset Function | |
81 | 0 4 read/write 0 Interrupt Mask | |
5400c02b MA |
82 | bit 0: peer interrupt (rev 0) |
83 | reserved (rev 1) | |
fdee2025 MA |
84 | bit 1..31: reserved |
85 | 4 4 read/write 0 Interrupt Status | |
5400c02b MA |
86 | bit 0: peer interrupt (rev 0) |
87 | reserved (rev 1) | |
fdee2025 | 88 | bit 1..31: reserved |
1309cf44 | 89 | 8 4 read-only 0 or ID IVPosition |
fdee2025 MA |
90 | 12 4 write-only N/A Doorbell |
91 | bit 0..15: vector | |
92 | bit 16..31: peer ID | |
93 | 16 240 none N/A reserved | |
94 | ||
95 | Software should only access the registers as specified in column | |
96 | "Access". Reserved bits should be ignored on read, and preserved on | |
97 | write. | |
98 | ||
5400c02b MA |
99 | In revision 0 of the device, Interrupt Status and Mask Register |
100 | together control the legacy INTx interrupt when the device has no | |
101 | MSI-X capability: INTx is asserted when the bit-wise AND of Status and | |
102 | Mask is non-zero and the device has no MSI-X capability. Interrupt | |
103 | Status Register bit 0 becomes 1 when an interrupt request from a peer | |
104 | is received. Reading the register clears it. | |
fdee2025 MA |
105 | |
106 | IVPosition Register: if the device is not configured for interrupts, | |
1309cf44 MA |
107 | this is zero. Else, it is the device's ID (between 0 and 65535). |
108 | ||
109 | Before QEMU 2.6.0, the register may read -1 for a short while after | |
5400c02b | 110 | reset. These devices have PCI revision 0 rather than 1. |
fdee2025 MA |
111 | |
112 | There is no good way for software to find out whether the device is | |
113 | configured for interrupts. A positive IVPosition means interrupts, | |
1309cf44 | 114 | but zero could be either. |
fdee2025 MA |
115 | |
116 | Doorbell Register: writing this register requests to interrupt a peer. | |
117 | The written value's high 16 bits are the ID of the peer to interrupt, | |
118 | and its low 16 bits select an interrupt vector. | |
119 | ||
120 | If the device is not configured for interrupts, the write is ignored. | |
121 | ||
122 | If the interrupt hasn't completed setup, the write is ignored. The | |
123 | device is not capable to tell guest software whether setup is | |
124 | complete. Interrupts can regress to this state on migration. | |
125 | ||
126 | If the peer with the requested ID isn't connected, or it has fewer | |
127 | interrupt vectors connected, the write is ignored. The device is not | |
128 | capable to tell guest software what peers are connected, or how many | |
129 | interrupt vectors are connected. | |
130 | ||
5400c02b MA |
131 | The peer's interrupt for this vector then becomes pending. There is |
132 | no way for software to clear the pending bit, and a polling mode of | |
133 | operation is therefore impossible. | |
fdee2025 | 134 | |
5400c02b MA |
135 | If the peer is a revision 0 device without MSI-X capability, its |
136 | Interrupt Status register is set to 1. This asserts INTx unless | |
137 | masked by the Interrupt Mask register. The device is not capable to | |
138 | communicate the interrupt vector to guest software then. | |
fdee2025 MA |
139 | |
140 | With multiple MSI-X vectors, different vectors can be used to indicate | |
141 | different events have occurred. The semantics of interrupt vectors | |
142 | are left to the application. | |
143 | ||
144 | ||
145 | == Interrupt infrastructure == | |
146 | ||
147 | When configured for interrupts, the peers share eventfd objects in | |
148 | addition to shared memory. The shared resources are managed by an | |
149 | ivshmem server. | |
150 | ||
151 | === The ivshmem server === | |
152 | ||
153 | The server listens on a UNIX domain socket. | |
154 | ||
155 | For each new client that connects to the server, the server | |
156 | - picks an ID, | |
157 | - creates eventfd file descriptors for the interrupt vectors, | |
158 | - sends the ID and the file descriptor for the shared memory to the | |
159 | new client, | |
160 | - sends connect notifications for the new client to the other clients | |
161 | (these contain file descriptors for sending interrupts), | |
162 | - sends connect notifications for the other clients to the new client, | |
163 | and | |
164 | - sends interrupt setup messages to the new client (these contain file | |
165 | descriptors for receiving interrupts). | |
166 | ||
62a830b6 MA |
167 | The first client to connect to the server receives ID zero. |
168 | ||
fdee2025 MA |
169 | When a client disconnects from the server, the server sends disconnect |
170 | notifications to the other clients. | |
171 | ||
172 | The next section describes the protocol in detail. | |
173 | ||
174 | If the server terminates without sending disconnect notifications for | |
175 | its connected clients, the clients can elect to continue. They can | |
176 | communicate with each other normally, but won't receive disconnect | |
177 | notification on disconnect, and no new clients can connect. There is | |
178 | no way for the clients to connect to a restarted server. The device | |
179 | is not capable to tell guest software whether the server is still up. | |
180 | ||
181 | Example server code is in contrib/ivshmem-server/. Not to be used in | |
182 | production. It assumes all clients use the same number of interrupt | |
183 | vectors. | |
184 | ||
185 | A standalone client is in contrib/ivshmem-client/. It can be useful | |
186 | for debugging. | |
187 | ||
188 | === The ivshmem Client-Server Protocol === | |
189 | ||
190 | An ivshmem device configured for interrupts connects to an ivshmem | |
191 | server. This section details the protocol between the two. | |
192 | ||
193 | The connection is one-way: the server sends messages to the client. | |
194 | Each message consists of a single 8 byte little-endian signed number, | |
195 | and may be accompanied by a file descriptor via SCM_RIGHTS. Both | |
196 | client and server close the connection on error. | |
197 | ||
71c26581 MA |
198 | Note: QEMU currently doesn't close the connection right on error, but |
199 | only when the character device is destroyed. | |
200 | ||
fdee2025 MA |
201 | On connect, the server sends the following messages in order: |
202 | ||
203 | 1. The protocol version number, currently zero. The client should | |
204 | close the connection on receipt of versions it can't handle. | |
205 | ||
206 | 2. The client's ID. This is unique among all clients of this server. | |
207 | IDs must be between 0 and 65535, because the Doorbell register | |
208 | provides only 16 bits for them. | |
209 | ||
210 | 3. The number -1, accompanied by the file descriptor for the shared | |
211 | memory. | |
212 | ||
213 | 4. Connect notifications for existing other clients, if any. This is | |
214 | a peer ID (number between 0 and 65535 other than the client's ID), | |
215 | repeated N times. Each repetition is accompanied by one file | |
216 | descriptor. These are for interrupting the peer with that ID using | |
217 | vector 0,..,N-1, in order. If the client is configured for fewer | |
218 | vectors, it closes the extra file descriptors. If it is configured | |
219 | for more, the extra vectors remain unconnected. | |
220 | ||
221 | 5. Interrupt setup. This is the client's own ID, repeated N times. | |
222 | Each repetition is accompanied by one file descriptor. These are | |
223 | for receiving interrupts from peers using vector 0,..,N-1, in | |
224 | order. If the client is configured for fewer vectors, it closes | |
225 | the extra file descriptors. If it is configured for more, the | |
226 | extra vectors remain unconnected. | |
227 | ||
228 | From then on, the server sends these kinds of messages: | |
229 | ||
230 | 6. Connection / disconnection notification. This is a peer ID. | |
231 | ||
232 | - If the number comes with a file descriptor, it's a connection | |
233 | notification, exactly like in step 4. | |
234 | ||
235 | - Else, it's a disconnection notification for the peer with that ID. | |
236 | ||
237 | Known bugs: | |
238 | ||
239 | * The protocol changed incompatibly in QEMU 2.5. Before, messages | |
240 | were native endian long, and there was no version number. | |
241 | ||
242 | * The protocol is poorly designed. | |
243 | ||
244 | === The ivshmem Client-Client Protocol === | |
245 | ||
246 | An ivshmem device configured for interrupts receives eventfd file | |
247 | descriptors for interrupting peers and getting interrupted by peers | |
248 | from the server, as explained in the previous section. | |
249 | ||
250 | To interrupt a peer, the device writes the 8-byte integer 1 in native | |
251 | byte order to the respective file descriptor. | |
252 | ||
253 | To receive an interrupt, the device reads and discards as many 8-byte | |
254 | integers as it can. |