]>
Commit | Line | Data |
---|---|---|
f4abc9d6 MH |
1 | (RDMA: Remote Direct Memory Access) |
2 | RDMA Live Migration Specification, Version # 1 | |
3 | ============================================== | |
70b7fba9 | 4 | Wiki: https://wiki.qemu.org/Features/RDMALiveMigration |
f4abc9d6 MH |
5 | Github: [email protected]:hinesmr/qemu.git, 'rdma' branch |
6 | ||
7 | Copyright (C) 2013 Michael R. Hines <[email protected]> | |
8 | ||
9 | An *exhaustive* paper (2010) shows additional performance details | |
10 | linked on the QEMU wiki above. | |
11 | ||
12 | Contents: | |
13 | ========= | |
14 | * Introduction | |
15 | * Before running | |
16 | * Running | |
17 | * Performance | |
18 | * RDMA Migration Protocol Description | |
19 | * Versioning and Capabilities | |
20 | * QEMUFileRDMA Interface | |
971ae6ef | 21 | * Migration of VM's ram |
f4abc9d6 MH |
22 | * Error handling |
23 | * TODO | |
24 | ||
25 | Introduction: | |
26 | ============= | |
27 | ||
28 | RDMA helps make your migration more deterministic under heavy load because | |
29 | of the significantly lower latency and higher throughput over TCP/IP. This is | |
30 | because the RDMA I/O architecture reduces the number of interrupts and | |
31 | data copies by bypassing the host networking stack. In particular, a TCP-based | |
32 | migration, under certain types of memory-bound workloads, may take a more | |
33 | unpredicatable amount of time to complete the migration if the amount of | |
34 | memory tracked during each live migration iteration round cannot keep pace | |
35 | with the rate of dirty memory produced by the workload. | |
36 | ||
37 | RDMA currently comes in two flavors: both Ethernet based (RoCE, or RDMA | |
a5f56b90 | 38 | over Converged Ethernet) as well as Infiniband-based. This implementation of |
f4abc9d6 MH |
39 | migration using RDMA is capable of using both technologies because of |
40 | the use of the OpenFabrics OFED software stack that abstracts out the | |
41 | programming model irrespective of the underlying hardware. | |
42 | ||
43 | Refer to openfabrics.org or your respective RDMA hardware vendor for | |
44 | an understanding on how to verify that you have the OFED software stack | |
45 | installed in your environment. You should be able to successfully link | |
46 | against the "librdmacm" and "libibverbs" libraries and development headers | |
47 | for a working build of QEMU to run successfully using RDMA Migration. | |
48 | ||
49 | BEFORE RUNNING: | |
50 | =============== | |
51 | ||
52 | Use of RDMA during migration requires pinning and registering memory | |
53 | with the hardware. This means that memory must be physically resident | |
54 | before the hardware can transmit that memory to another machine. | |
55 | If this is not acceptable for your application or product, then the use | |
56 | of RDMA migration may in fact be harmful to co-located VMs or other | |
57 | software on the machine if there is not sufficient memory available to | |
58 | relocate the entire footprint of the virtual machine. If so, then the | |
59 | use of RDMA is discouraged and it is recommended to use standard TCP migration. | |
60 | ||
61 | Experimental: Next, decide if you want dynamic page registration. | |
62 | For example, if you have an 8GB RAM virtual machine, but only 1GB | |
63 | is in active use, then enabling this feature will cause all 8GB to | |
64 | be pinned and resident in memory. This feature mostly affects the | |
65 | bulk-phase round of the migration and can be enabled for extremely | |
66 | high-performance RDMA hardware using the following command: | |
67 | ||
68 | QEMU Monitor Command: | |
41310c68 | 69 | $ migrate_set_capability rdma-pin-all on # disabled by default |
f4abc9d6 MH |
70 | |
71 | Performing this action will cause all 8GB to be pinned, so if that's | |
72 | not what you want, then please ignore this step altogether. | |
73 | ||
74 | On the other hand, this will also significantly speed up the bulk round | |
75 | of the migration, which can greatly reduce the "total" time of your migration. | |
76 | Example performance of this using an idle VM in the previous example | |
77 | can be found in the "Performance" section. | |
78 | ||
79 | Note: for very large virtual machines (hundreds of GBs), pinning all | |
80 | *all* of the memory of your virtual machine in the kernel is very expensive | |
81 | may extend the initial bulk iteration time by many seconds, | |
82 | and thus extending the total migration time. However, this will not | |
83 | affect the determinism or predictability of your migration you will | |
84 | still gain from the benefits of advanced pinning with RDMA. | |
85 | ||
86 | RUNNING: | |
87 | ======== | |
88 | ||
89 | First, set the migration speed to match your hardware's capabilities: | |
90 | ||
91 | QEMU Monitor Command: | |
92 | $ migrate_set_speed 40g # or whatever is the MAX of your RDMA device | |
93 | ||
94 | Next, on the destination machine, add the following to the QEMU command line: | |
95 | ||
41310c68 | 96 | qemu ..... -incoming rdma:host:port |
f4abc9d6 MH |
97 | |
98 | Finally, perform the actual migration on the source machine: | |
99 | ||
100 | QEMU Monitor Command: | |
41310c68 | 101 | $ migrate -d rdma:host:port |
f4abc9d6 MH |
102 | |
103 | PERFORMANCE | |
104 | =========== | |
105 | ||
106 | Here is a brief summary of total migration time and downtime using RDMA: | |
107 | Using a 40gbps infiniband link performing a worst-case stress test, | |
108 | using an 8GB RAM virtual machine: | |
109 | ||
110 | Using the following command: | |
111 | $ apt-get install stress | |
112 | $ stress --vm-bytes 7500M --vm 1 --vm-keep | |
113 | ||
114 | 1. Migration throughput: 26 gigabits/second. | |
115 | 2. Downtime (stop time) varies between 15 and 100 milliseconds. | |
116 | ||
117 | EFFECTS of memory registration on bulk phase round: | |
118 | ||
119 | For example, in the same 8GB RAM example with all 8GB of memory in | |
120 | active use and the VM itself is completely idle using the same 40 gbps | |
121 | infiniband link: | |
122 | ||
41310c68 MH |
123 | 1. rdma-pin-all disabled total time: approximately 7.5 seconds @ 9.5 Gbps |
124 | 2. rdma-pin-all enabled total time: approximately 4 seconds @ 26 Gbps | |
f4abc9d6 MH |
125 | |
126 | These numbers would of course scale up to whatever size virtual machine | |
127 | you have to migrate using RDMA. | |
128 | ||
129 | Enabling this feature does *not* have any measurable affect on | |
130 | migration *downtime*. This is because, without this feature, all of the | |
131 | memory will have already been registered already in advance during | |
132 | the bulk round and does not need to be re-registered during the successive | |
133 | iteration rounds. | |
134 | ||
135 | RDMA Protocol Description: | |
136 | ========================== | |
137 | ||
138 | Migration with RDMA is separated into two parts: | |
139 | ||
140 | 1. The transmission of the pages using RDMA | |
141 | 2. Everything else (a control channel is introduced) | |
142 | ||
143 | "Everything else" is transmitted using a formal | |
144 | protocol now, consisting of infiniband SEND messages. | |
145 | ||
146 | An infiniband SEND message is the standard ibverbs | |
147 | message used by applications of infiniband hardware. | |
148 | The only difference between a SEND message and an RDMA | |
149 | message is that SEND messages cause notifications | |
150 | to be posted to the completion queue (CQ) on the | |
151 | infiniband receiver side, whereas RDMA messages (used | |
971ae6ef | 152 | for VM's ram) do not (to behave like an actual DMA). |
f4abc9d6 MH |
153 | |
154 | Messages in infiniband require two things: | |
155 | ||
156 | 1. registration of the memory that will be transmitted | |
157 | 2. (SEND only) work requests to be posted on both | |
158 | sides of the network before the actual transmission | |
159 | can occur. | |
160 | ||
161 | RDMA messages are much easier to deal with. Once the memory | |
162 | on the receiver side is registered and pinned, we're | |
163 | basically done. All that is required is for the sender | |
164 | side to start dumping bytes onto the link. | |
165 | ||
166 | (Memory is not released from pinning until the migration | |
167 | completes, given that RDMA migrations are very fast.) | |
168 | ||
169 | SEND messages require more coordination because the | |
170 | receiver must have reserved space (using a receive | |
171 | work request) on the receive queue (RQ) before QEMUFileRDMA | |
172 | can start using them to carry all the bytes as | |
173 | a control transport for migration of device state. | |
174 | ||
175 | To begin the migration, the initial connection setup is | |
176 | as follows (migration-rdma.c): | |
177 | ||
178 | 1. Receiver and Sender are started (command line or libvirt): | |
179 | 2. Both sides post two RQ work requests | |
180 | 3. Receiver does listen() | |
181 | 4. Sender does connect() | |
182 | 5. Receiver accept() | |
183 | 6. Check versioning and capabilities (described later) | |
184 | ||
185 | At this point, we define a control channel on top of SEND messages | |
186 | which is described by a formal protocol. Each SEND message has a | |
187 | header portion and a data portion (but together are transmitted | |
188 | as a single SEND message). | |
189 | ||
190 | Header: | |
a5f56b90 MH |
191 | * Length (of the data portion, uint32, network byte order) |
192 | * Type (what command to perform, uint32, network byte order) | |
193 | * Repeat (Number of commands in data portion, same type only) | |
f4abc9d6 MH |
194 | |
195 | The 'Repeat' field is here to support future multiple page registrations | |
196 | in a single message without any need to change the protocol itself | |
197 | so that the protocol is compatible against multiple versions of QEMU. | |
198 | Version #1 requires that all server implementations of the protocol must | |
199 | check this field and register all requests found in the array of commands located | |
200 | in the data portion and return an equal number of results in the response. | |
201 | The maximum number of repeats is hard-coded to 4096. This is a conservative | |
52f35022 | 202 | limit based on the maximum size of a SEND message along with empirical |
f4abc9d6 MH |
203 | observations on the maximum future benefit of simultaneous page registrations. |
204 | ||
a5f56b90 MH |
205 | The 'type' field has 12 different command values: |
206 | 1. Unused | |
207 | 2. Error (sent to the source during bad things) | |
208 | 3. Ready (control-channel is available) | |
209 | 4. QEMU File (for sending non-live device state) | |
210 | 5. RAM Blocks request (used right after connection setup) | |
211 | 6. RAM Blocks result (used right after connection setup) | |
212 | 7. Compress page (zap zero page and skip registration) | |
213 | 8. Register request (dynamic chunk registration) | |
214 | 9. Register result ('rkey' to be used by sender) | |
215 | 10. Register finished (registration for current iteration finished) | |
216 | 11. Unregister request (unpin previously registered memory) | |
217 | 12. Unregister finished (confirmation that unpin completed) | |
f4abc9d6 MH |
218 | |
219 | A single control message, as hinted above, can contain within the data | |
220 | portion an array of many commands of the same type. If there is more than | |
221 | one command, then the 'repeat' field will be greater than 1. | |
222 | ||
223 | After connection setup, message 5 & 6 are used to exchange ram block | |
224 | information and optionally pin all the memory if requested by the user. | |
225 | ||
226 | After ram block exchange is completed, we have two protocol-level | |
227 | functions, responsible for communicating control-channel commands | |
228 | using the above list of values: | |
229 | ||
230 | Logically: | |
231 | ||
232 | qemu_rdma_exchange_recv(header, expected command type) | |
233 | ||
234 | 1. We transmit a READY command to let the sender know that | |
235 | we are *ready* to receive some data bytes on the control channel. | |
236 | 2. Before attempting to receive the expected command, we post another | |
237 | RQ work request to replace the one we just used up. | |
238 | 3. Block on a CQ event channel and wait for the SEND to arrive. | |
239 | 4. When the send arrives, librdmacm will unblock us. | |
240 | 5. Verify that the command-type and version received matches the one we expected. | |
241 | ||
242 | qemu_rdma_exchange_send(header, data, optional response header & data): | |
243 | ||
244 | 1. Block on the CQ event channel waiting for a READY command | |
245 | from the receiver to tell us that the receiver | |
246 | is *ready* for us to transmit some new bytes. | |
247 | 2. Optionally: if we are expecting a response from the command | |
a5f56b90 | 248 | (that we have not yet transmitted), let's post an RQ |
f4abc9d6 MH |
249 | work request to receive that data a few moments later. |
250 | 3. When the READY arrives, librdmacm will | |
251 | unblock us and we immediately post a RQ work request | |
252 | to replace the one we just used up. | |
253 | 4. Now, we can actually post the work request to SEND | |
254 | the requested command type of the header we were asked for. | |
255 | 5. Optionally, if we are expecting a response (as before), | |
256 | we block again and wait for that response using the additional | |
257 | work request we previously posted. (This is used to carry | |
258 | 'Register result' commands #6 back to the sender which | |
259 | hold the rkey need to perform RDMA. Note that the virtual address | |
260 | corresponding to this rkey was already exchanged at the beginning | |
261 | of the connection (described below). | |
262 | ||
263 | All of the remaining command types (not including 'ready') | |
264 | described above all use the aformentioned two functions to do the hard work: | |
265 | ||
266 | 1. After connection setup, RAMBlock information is exchanged using | |
267 | this protocol before the actual migration begins. This information includes | |
268 | a description of each RAMBlock on the server side as well as the virtual addresses | |
269 | and lengths of each RAMBlock. This is used by the client to determine the | |
270 | start and stop locations of chunks and how to register them dynamically | |
271 | before performing the RDMA operations. | |
272 | 2. During runtime, once a 'chunk' becomes full of pages ready to | |
273 | be sent with RDMA, the registration commands are used to ask the | |
274 | other side to register the memory for this chunk and respond | |
275 | with the result (rkey) of the registration. | |
276 | 3. Also, the QEMUFile interfaces also call these functions (described below) | |
277 | when transmitting non-live state, such as devices or to send | |
278 | its own protocol information during the migration process. | |
279 | 4. Finally, zero pages are only checked if a page has not yet been registered | |
280 | using chunk registration (or not checked at all and unconditionally | |
281 | written if chunk registration is disabled. This is accomplished using | |
282 | the "Compress" command listed above. If the page *has* been registered | |
283 | then we check the entire chunk for zero. Only if the entire chunk is | |
284 | zero, then we send a compress command to zap the page on the other side. | |
285 | ||
286 | Versioning and Capabilities | |
287 | =========================== | |
288 | Current version of the protocol is version #1. | |
289 | ||
290 | The same version applies to both for protocol traffic and capabilities | |
291 | negotiation. (i.e. There is only one version number that is referred to | |
292 | by all communication). | |
293 | ||
294 | librdmacm provides the user with a 'private data' area to be exchanged | |
295 | at connection-setup time before any infiniband traffic is generated. | |
296 | ||
297 | Header: | |
a5f56b90 MH |
298 | * Version (protocol version validated before send/recv occurs), |
299 | uint32, network byte order | |
300 | * Flags (bitwise OR of each capability), | |
301 | uint32, network byte order | |
f4abc9d6 MH |
302 | |
303 | There is no data portion of this header right now, so there is | |
304 | no length field. The maximum size of the 'private data' section | |
305 | is only 192 bytes per the Infiniband specification, so it's not | |
306 | very useful for data anyway. This structure needs to remain small. | |
307 | ||
308 | This private data area is a convenient place to check for protocol | |
309 | versioning because the user does not need to register memory to | |
310 | transmit a few bytes of version information. | |
311 | ||
312 | This is also a convenient place to negotiate capabilities | |
313 | (like dynamic page registration). | |
314 | ||
315 | If the version is invalid, we throw an error. | |
316 | ||
317 | If the version is new, we only negotiate the capabilities that the | |
318 | requested version is able to perform and ignore the rest. | |
319 | ||
a5f56b90 | 320 | Currently there is only one capability in Version #1: dynamic page registration |
f4abc9d6 MH |
321 | |
322 | Finally: Negotiation happens with the Flags field: If the primary-VM | |
323 | sets a flag, but the destination does not support this capability, it | |
324 | will return a zero-bit for that flag and the primary-VM will understand | |
325 | that as not being an available capability and will thus disable that | |
326 | capability on the primary-VM side. | |
327 | ||
328 | QEMUFileRDMA Interface: | |
329 | ======================= | |
330 | ||
331 | QEMUFileRDMA introduces a couple of new functions: | |
332 | ||
a5f56b90 MH |
333 | 1. qemu_rdma_get_buffer() (QEMUFileOps rdma_read_ops) |
334 | 2. qemu_rdma_put_buffer() (QEMUFileOps rdma_write_ops) | |
f4abc9d6 MH |
335 | |
336 | These two functions are very short and simply use the protocol | |
337 | describe above to deliver bytes without changing the upper-level | |
338 | users of QEMUFile that depend on a bytestream abstraction. | |
339 | ||
340 | Finally, how do we handoff the actual bytes to get_buffer()? | |
341 | ||
342 | Again, because we're trying to "fake" a bytestream abstraction | |
343 | using an analogy not unlike individual UDP frames, we have | |
344 | to hold on to the bytes received from control-channel's SEND | |
345 | messages in memory. | |
346 | ||
347 | Each time we receive a complete "QEMU File" control-channel | |
348 | message, the bytes from SEND are copied into a small local holding area. | |
349 | ||
350 | Then, we return the number of bytes requested by get_buffer() | |
351 | and leave the remaining bytes in the holding area until get_buffer() | |
352 | comes around for another pass. | |
353 | ||
354 | If the buffer is empty, then we follow the same steps | |
355 | listed above and issue another "QEMU File" protocol command, | |
356 | asking for a new SEND message to re-fill the buffer. | |
357 | ||
971ae6ef | 358 | Migration of VM's ram: |
f4abc9d6 MH |
359 | ==================== |
360 | ||
361 | At the beginning of the migration, (migration-rdma.c), | |
362 | the sender and the receiver populate the list of RAMBlocks | |
363 | to be registered with each other into a structure. | |
364 | Then, using the aforementioned protocol, they exchange a | |
365 | description of these blocks with each other, to be used later | |
366 | during the iteration of main memory. This description includes | |
367 | a list of all the RAMBlocks, their offsets and lengths, virtual | |
368 | addresses and possibly includes pre-registered RDMA keys in case dynamic | |
369 | page registration was disabled on the server-side, otherwise not. | |
370 | ||
371 | Main memory is not migrated with the aforementioned protocol, | |
372 | but is instead migrated with normal RDMA Write operations. | |
373 | ||
374 | Pages are migrated in "chunks" (hard-coded to 1 Megabyte right now). | |
375 | Chunk size is not dynamic, but it could be in a future implementation. | |
376 | There's nothing to indicate that this is useful right now. | |
377 | ||
378 | When a chunk is full (or a flush() occurs), the memory backed by | |
379 | the chunk is registered with librdmacm is pinned in memory on | |
380 | both sides using the aforementioned protocol. | |
381 | After pinning, an RDMA Write is generated and transmitted | |
382 | for the entire chunk. | |
383 | ||
384 | Chunks are also transmitted in batches: This means that we | |
385 | do not request that the hardware signal the completion queue | |
386 | for the completion of *every* chunk. The current batch size | |
387 | is about 64 chunks (corresponding to 64 MB of memory). | |
388 | Only the last chunk in a batch must be signaled. | |
389 | This helps keep everything as asynchronous as possible | |
390 | and helps keep the hardware busy performing RDMA operations. | |
391 | ||
392 | Error-handling: | |
393 | =============== | |
394 | ||
395 | Infiniband has what is called a "Reliable, Connected" | |
396 | link (one of 4 choices). This is the mode in which | |
397 | we use for RDMA migration. | |
398 | ||
399 | If a *single* message fails, | |
400 | the decision is to abort the migration entirely and | |
401 | cleanup all the RDMA descriptors and unregister all | |
402 | the memory. | |
403 | ||
404 | After cleanup, the Virtual Machine is returned to normal | |
405 | operation the same way that would happen if the TCP | |
406 | socket is broken during a non-RDMA based migration. | |
407 | ||
408 | TODO: | |
409 | ===== | |
41310c68 | 410 | 1. Currently, 'ulimit -l' mlock() limits as well as cgroups swap limits |
f4abc9d6 MH |
411 | are not compatible with infinband memory pinning and will result in |
412 | an aborted migration (but with the source VM left unaffected). | |
41310c68 | 413 | 2. Use of the recent /proc/<pid>/pagemap would likely speed up |
f4abc9d6 | 414 | the use of KSM and ballooning while using RDMA. |
41310c68 | 415 | 3. Also, some form of balloon-device usage tracking would also |
f4abc9d6 | 416 | help alleviate some issues. |
41310c68 | 417 | 4. Use LRU to provide more fine-grained direction of UNREGISTER |
a5f56b90 | 418 | requests for unpinning memory in an overcommitted environment. |
41310c68 | 419 | 5. Expose UNREGISTER support to the user by way of workload-specific |
a5f56b90 | 420 | hints about application behavior. |