[qemu.git] / docs / rdma.txt

(RDMA: Remote Direct Memory Access)
RDMA Live Migration Specification, Version # 1
==============================================
Wiki: https://wiki.qemu.org/Features/RDMALiveMigration
Github: [email protected]:hinesmr/qemu.git, 'rdma' branch

Copyright (C) 2013 Michael R. Hines <[email protected]>

An *exhaustive* paper (2010) shows additional performance details
linked on the QEMU wiki above.

Contents:
=========
* Introduction
* Before running
* Running
* Performance
* RDMA Migration Protocol Description
* Versioning and Capabilities
* QEMUFileRDMA Interface
* Migration of VM's ram
* Error handling
* TODO

Introduction:
=============

RDMA helps make your migration more deterministic under heavy load because
of the significantly lower latency and higher throughput over TCP/IP. This is
because the RDMA I/O architecture reduces the number of interrupts and
data copies by bypassing the host networking stack. In particular, a TCP-based
migration, under certain types of memory-bound workloads, may take a more
unpredicatable amount of time to complete the migration if the amount of
memory tracked during each live migration iteration round cannot keep pace
with the rate of dirty memory produced by the workload.

RDMA currently comes in two flavors: both Ethernet based (RoCE, or RDMA
over Converged Ethernet) as well as Infiniband-based. This implementation of
migration using RDMA is capable of using both technologies because of
the use of the OpenFabrics OFED software stack that abstracts out the
programming model irrespective of the underlying hardware.

Refer to openfabrics.org or your respective RDMA hardware vendor for
an understanding on how to verify that you have the OFED software stack
installed in your environment. You should be able to successfully link
against the "librdmacm" and "libibverbs" libraries and development headers
for a working build of QEMU to run successfully using RDMA Migration.

BEFORE RUNNING:
===============

Use of RDMA during migration requires pinning and registering memory
with the hardware. This means that memory must be physically resident
before the hardware can transmit that memory to another machine.
If this is not acceptable for your application or product, then the use
of RDMA migration may in fact be harmful to co-located VMs or other
software on the machine if there is not sufficient memory available to
relocate the entire footprint of the virtual machine. If so, then the
use of RDMA is discouraged and it is recommended to use standard TCP migration.

Experimental: Next, decide if you want dynamic page registration.
For example, if you have an 8GB RAM virtual machine, but only 1GB
is in active use, then enabling this feature will cause all 8GB to
be pinned and resident in memory. This feature mostly affects the
bulk-phase round of the migration and can be enabled for extremely
high-performance RDMA hardware using the following command:

QEMU Monitor Command:
$ migrate_set_capability rdma-pin-all on # disabled by default

Performing this action will cause all 8GB to be pinned, so if that's
not what you want, then please ignore this step altogether.

On the other hand, this will also significantly speed up the bulk round
of the migration, which can greatly reduce the "total" time of your migration.
Example performance of this using an idle VM in the previous example
can be found in the "Performance" section.

Note: for very large virtual machines (hundreds of GBs), pinning all
*all* of the memory of your virtual machine in the kernel is very expensive
may extend the initial bulk iteration time by many seconds,
and thus extending the total migration time. However, this will not
affect the determinism or predictability of your migration you will
still gain from the benefits of advanced pinning with RDMA.

RUNNING:
========

First, set the migration speed to match your hardware's capabilities:

QEMU Monitor Command:
$ migrate_set_speed 40g # or whatever is the MAX of your RDMA device

Next, on the destination machine, add the following to the QEMU command line:

qemu ..... -incoming rdma:host:port

Finally, perform the actual migration on the source machine:

QEMU Monitor Command:
$ migrate -d rdma:host:port

PERFORMANCE
===========

Here is a brief summary of total migration time and downtime using RDMA:
Using a 40gbps infiniband link performing a worst-case stress test,
using an 8GB RAM virtual machine:

Using the following command:
$ apt-get install stress
$ stress --vm-bytes 7500M --vm 1 --vm-keep

1. Migration throughput: 26 gigabits/second.
2. Downtime (stop time) varies between 15 and 100 milliseconds.

EFFECTS of memory registration on bulk phase round:

For example, in the same 8GB RAM example with all 8GB of memory in
active use and the VM itself is completely idle using the same 40 gbps
infiniband link:

1. rdma-pin-all disabled total time: approximately 7.5 seconds @ 9.5 Gbps
2. rdma-pin-all enabled total time: approximately 4 seconds @ 26 Gbps

These numbers would of course scale up to whatever size virtual machine
you have to migrate using RDMA.

Enabling this feature does *not* have any measurable affect on
migration *downtime*. This is because, without this feature, all of the
memory will have already been registered already in advance during
the bulk round and does not need to be re-registered during the successive
iteration rounds.

RDMA Protocol Description:
==========================

Migration with RDMA is separated into two parts:

1. The transmission of the pages using RDMA
2. Everything else (a control channel is introduced)

"Everything else" is transmitted using a formal
protocol now, consisting of infiniband SEND messages.

An infiniband SEND message is the standard ibverbs
message used by applications of infiniband hardware.
The only difference between a SEND message and an RDMA
message is that SEND messages cause notifications
to be posted to the completion queue (CQ) on the
infiniband receiver side, whereas RDMA messages (used
for VM's ram) do not (to behave like an actual DMA).

Messages in infiniband require two things:

1. registration of the memory that will be transmitted
2. (SEND only) work requests to be posted on both
   sides of the network before the actual transmission
   can occur.

RDMA messages are much easier to deal with. Once the memory
on the receiver side is registered and pinned, we're
basically done. All that is required is for the sender
side to start dumping bytes onto the link.

(Memory is not released from pinning until the migration
completes, given that RDMA migrations are very fast.)

SEND messages require more coordination because the
receiver must have reserved space (using a receive
work request) on the receive queue (RQ) before QEMUFileRDMA
can start using them to carry all the bytes as
a control transport for migration of device state.

To begin the migration, the initial connection setup is
as follows (migration-rdma.c):

1. Receiver and Sender are started (command line or libvirt):
2. Both sides post two RQ work requests
3. Receiver does listen()
4. Sender does connect()
5. Receiver accept()
6. Check versioning and capabilities (described later)

At this point, we define a control channel on top of SEND messages
which is described by a formal protocol. Each SEND message has a
header portion and a data portion (but together are transmitted
as a single SEND message).

Header:
    * Length               (of the data portion, uint32, network byte order)
    * Type                 (what command to perform, uint32, network byte order)
    * Repeat               (Number of commands in data portion, same type only)

The 'Repeat' field is here to support future multiple page registrations
in a single message without any need to change the protocol itself
so that the protocol is compatible against multiple versions of QEMU.
Version #1 requires that all server implementations of the protocol must
check this field and register all requests found in the array of commands located
in the data portion and return an equal number of results in the response.
The maximum number of repeats is hard-coded to 4096. This is a conservative
limit based on the maximum size of a SEND message along with empirical
observations on the maximum future benefit of simultaneous page registrations.

The 'type' field has 12 different command values:
     1. Unused
     2. Error                      (sent to the source during bad things)
     3. Ready                      (control-channel is available)
     4. QEMU File                  (for sending non-live device state)
     5. RAM Blocks request         (used right after connection setup)
     6. RAM Blocks result          (used right after connection setup)
     7. Compress page              (zap zero page and skip registration)
     8. Register request           (dynamic chunk registration)
     9. Register result            ('rkey' to be used by sender)
    10. Register finished          (registration for current iteration finished)
    11. Unregister request         (unpin previously registered memory)
    12. Unregister finished        (confirmation that unpin completed)

A single control message, as hinted above, can contain within the data
portion an array of many commands of the same type. If there is more than
one command, then the 'repeat' field will be greater than 1.

After connection setup, message 5 & 6 are used to exchange ram block
information and optionally pin all the memory if requested by the user.

After ram block exchange is completed, we have two protocol-level
functions, responsible for communicating control-channel commands
using the above list of values:

Logically:

qemu_rdma_exchange_recv(header, expected command type)

1. We transmit a READY command to let the sender know that
   we are *ready* to receive some data bytes on the control channel.
2. Before attempting to receive the expected command, we post another
   RQ work request to replace the one we just used up.
3. Block on a CQ event channel and wait for the SEND to arrive.
4. When the send arrives, librdmacm will unblock us.
5. Verify that the command-type and version received matches the one we expected.

qemu_rdma_exchange_send(header, data, optional response header & data):

1. Block on the CQ event channel waiting for a READY command
   from the receiver to tell us that the receiver
   is *ready* for us to transmit some new bytes.
2. Optionally: if we are expecting a response from the command
   (that we have not yet transmitted), let's post an RQ
   work request to receive that data a few moments later.
3. When the READY arrives, librdmacm will
   unblock us and we immediately post a RQ work request
   to replace the one we just used up.
4. Now, we can actually post the work request to SEND
   the requested command type of the header we were asked for.
5. Optionally, if we are expecting a response (as before),
   we block again and wait for that response using the additional
   work request we previously posted. (This is used to carry
   'Register result' commands #6 back to the sender which
   hold the rkey need to perform RDMA. Note that the virtual address
   corresponding to this rkey was already exchanged at the beginning
   of the connection (described below).

All of the remaining command types (not including 'ready')
described above all use the aformentioned two functions to do the hard work:

1. After connection setup, RAMBlock information is exchanged using
   this protocol before the actual migration begins. This information includes
   a description of each RAMBlock on the server side as well as the virtual addresses
   and lengths of each RAMBlock. This is used by the client to determine the
   start and stop locations of chunks and how to register them dynamically
   before performing the RDMA operations.
2. During runtime, once a 'chunk' becomes full of pages ready to
   be sent with RDMA, the registration commands are used to ask the
   other side to register the memory for this chunk and respond
   with the result (rkey) of the registration.
3. Also, the QEMUFile interfaces also call these functions (described below)
   when transmitting non-live state, such as devices or to send
   its own protocol information during the migration process.
4. Finally, zero pages are only checked if a page has not yet been registered
   using chunk registration (or not checked at all and unconditionally
   written if chunk registration is disabled. This is accomplished using
   the "Compress" command listed above. If the page *has* been registered
   then we check the entire chunk for zero. Only if the entire chunk is
   zero, then we send a compress command to zap the page on the other side.

Versioning and Capabilities
===========================
Current version of the protocol is version #1.

The same version applies to both for protocol traffic and capabilities
negotiation. (i.e. There is only one version number that is referred to
by all communication).

librdmacm provides the user with a 'private data' area to be exchanged
at connection-setup time before any infiniband traffic is generated.

Header:
    * Version (protocol version validated before send/recv occurs),
                                               uint32, network byte order
    * Flags   (bitwise OR of each capability),
                                               uint32, network byte order

There is no data portion of this header right now, so there is
no length field. The maximum size of the 'private data' section
is only 192 bytes per the Infiniband specification, so it's not
very useful for data anyway. This structure needs to remain small.

This private data area is a convenient place to check for protocol
versioning because the user does not need to register memory to
transmit a few bytes of version information.

This is also a convenient place to negotiate capabilities
(like dynamic page registration).

If the version is invalid, we throw an error.

If the version is new, we only negotiate the capabilities that the
requested version is able to perform and ignore the rest.

Currently there is only one capability in Version #1: dynamic page registration

Finally: Negotiation happens with the Flags field: If the primary-VM
sets a flag, but the destination does not support this capability, it
will return a zero-bit for that flag and the primary-VM will understand
that as not being an available capability and will thus disable that
capability on the primary-VM side.

QEMUFileRDMA Interface:
=======================

QEMUFileRDMA introduces a couple of new functions:

1. qemu_rdma_get_buffer()               (QEMUFileOps rdma_read_ops)
2. qemu_rdma_put_buffer()               (QEMUFileOps rdma_write_ops)

These two functions are very short and simply use the protocol
describe above to deliver bytes without changing the upper-level
users of QEMUFile that depend on a bytestream abstraction.

Finally, how do we handoff the actual bytes to get_buffer()?

Again, because we're trying to "fake" a bytestream abstraction
using an analogy not unlike individual UDP frames, we have
to hold on to the bytes received from control-channel's SEND
messages in memory.

Each time we receive a complete "QEMU File" control-channel
message, the bytes from SEND are copied into a small local holding area.

Then, we return the number of bytes requested by get_buffer()
and leave the remaining bytes in the holding area until get_buffer()
comes around for another pass.

If the buffer is empty, then we follow the same steps
listed above and issue another "QEMU File" protocol command,
asking for a new SEND message to re-fill the buffer.

Migration of VM's ram:
====================

At the beginning of the migration, (migration-rdma.c),
the sender and the receiver populate the list of RAMBlocks
to be registered with each other into a structure.
Then, using the aforementioned protocol, they exchange a
description of these blocks with each other, to be used later
during the iteration of main memory. This description includes
a list of all the RAMBlocks, their offsets and lengths, virtual
addresses and possibly includes pre-registered RDMA keys in case dynamic
page registration was disabled on the server-side, otherwise not.

Main memory is not migrated with the aforementioned protocol,
but is instead migrated with normal RDMA Write operations.

Pages are migrated in "chunks" (hard-coded to 1 Megabyte right now).
Chunk size is not dynamic, but it could be in a future implementation.
There's nothing to indicate that this is useful right now.

When a chunk is full (or a flush() occurs), the memory backed by
the chunk is registered with librdmacm is pinned in memory on
both sides using the aforementioned protocol.
After pinning, an RDMA Write is generated and transmitted
for the entire chunk.

Chunks are also transmitted in batches: This means that we
do not request that the hardware signal the completion queue
for the completion of *every* chunk. The current batch size
is about 64 chunks (corresponding to 64 MB of memory).
Only the last chunk in a batch must be signaled.
This helps keep everything as asynchronous as possible
and helps keep the hardware busy performing RDMA operations.

Error-handling:
===============

Infiniband has what is called a "Reliable, Connected"
link (one of 4 choices). This is the mode in which
we use for RDMA migration.

If a *single* message fails,
the decision is to abort the migration entirely and
cleanup all the RDMA descriptors and unregister all
the memory.

After cleanup, the Virtual Machine is returned to normal
operation the same way that would happen if the TCP
socket is broken during a non-RDMA based migration.

TODO:
=====
1. Currently, 'ulimit -l' mlock() limits as well as cgroups swap limits
   are not compatible with infinband memory pinning and will result in
   an aborted migration (but with the source VM left unaffected).
2. Use of the recent /proc/<pid>/pagemap would likely speed up
   the use of KSM and ballooning while using RDMA.
3. Also, some form of balloon-device usage tracking would also
   help alleviate some issues.
4. Use LRU to provide more fine-grained direction of UNREGISTER
   requests for unpinning memory in an overcommitted environment.
5. Expose UNREGISTER support to the user by way of workload-specific
   hints about application behavior.
Commit	Line	Data
f4abc9d6 MH	1	(RDMA: Remote Direct Memory Access)
	2	RDMA Live Migration Specification, Version # 1
	3	==============================================
70b7fba9	4	Wiki: https://wiki.qemu.org/Features/RDMALiveMigration
f4abc9d6 MH	5	Github: [email protected]:hinesmr/qemu.git, 'rdma' branch
	6
	7	Copyright (C) 2013 Michael R. Hines <[email protected]>
	8
	9	An exhaustive paper (2010) shows additional performance details
	10	linked on the QEMU wiki above.
	11
	12	Contents:
	13	=========
	14	* Introduction
	15	* Before running
	16	* Running
	17	* Performance
	18	* RDMA Migration Protocol Description
	19	* Versioning and Capabilities
	20	* QEMUFileRDMA Interface
971ae6ef	21	* Migration of VM's ram
f4abc9d6 MH	22	* Error handling
	23	* TODO
	24
	25	Introduction:
	26	=============
	27
	28	RDMA helps make your migration more deterministic under heavy load because
	29	of the significantly lower latency and higher throughput over TCP/IP. This is
	30	because the RDMA I/O architecture reduces the number of interrupts and
	31	data copies by bypassing the host networking stack. In particular, a TCP-based
	32	migration, under certain types of memory-bound workloads, may take a more
	33	unpredicatable amount of time to complete the migration if the amount of
	34	memory tracked during each live migration iteration round cannot keep pace
	35	with the rate of dirty memory produced by the workload.
	36
	37	RDMA currently comes in two flavors: both Ethernet based (RoCE, or RDMA
a5f56b90	38	over Converged Ethernet) as well as Infiniband-based. This implementation of
f4abc9d6 MH	39	migration using RDMA is capable of using both technologies because of
	40	the use of the OpenFabrics OFED software stack that abstracts out the
	41	programming model irrespective of the underlying hardware.
	42
	43	Refer to openfabrics.org or your respective RDMA hardware vendor for
	44	an understanding on how to verify that you have the OFED software stack
	45	installed in your environment. You should be able to successfully link
	46	against the "librdmacm" and "libibverbs" libraries and development headers
	47	for a working build of QEMU to run successfully using RDMA Migration.
	48
	49	BEFORE RUNNING:
	50	===============
	51
	52	Use of RDMA during migration requires pinning and registering memory
	53	with the hardware. This means that memory must be physically resident
	54	before the hardware can transmit that memory to another machine.
	55	If this is not acceptable for your application or product, then the use
	56	of RDMA migration may in fact be harmful to co-located VMs or other
	57	software on the machine if there is not sufficient memory available to
	58	relocate the entire footprint of the virtual machine. If so, then the
	59	use of RDMA is discouraged and it is recommended to use standard TCP migration.
	60
	61	Experimental: Next, decide if you want dynamic page registration.
	62	For example, if you have an 8GB RAM virtual machine, but only 1GB
	63	is in active use, then enabling this feature will cause all 8GB to
	64	be pinned and resident in memory. This feature mostly affects the
	65	bulk-phase round of the migration and can be enabled for extremely
	66	high-performance RDMA hardware using the following command:
	67
	68	QEMU Monitor Command:
41310c68	69	$ migrate_set_capability rdma-pin-all on # disabled by default
f4abc9d6 MH	70
	71	Performing this action will cause all 8GB to be pinned, so if that's
	72	not what you want, then please ignore this step altogether.
	73
	74	On the other hand, this will also significantly speed up the bulk round
	75	of the migration, which can greatly reduce the "total" time of your migration.
	76	Example performance of this using an idle VM in the previous example
	77	can be found in the "Performance" section.
	78
	79	Note: for very large virtual machines (hundreds of GBs), pinning all
	80	all of the memory of your virtual machine in the kernel is very expensive
	81	may extend the initial bulk iteration time by many seconds,
	82	and thus extending the total migration time. However, this will not
	83	affect the determinism or predictability of your migration you will
	84	still gain from the benefits of advanced pinning with RDMA.
	85
	86	RUNNING:
	87	========
	88
	89	First, set the migration speed to match your hardware's capabilities:
	90
	91	QEMU Monitor Command:
	92	$ migrate_set_speed 40g # or whatever is the MAX of your RDMA device
	93
	94	Next, on the destination machine, add the following to the QEMU command line:
	95
41310c68	96	qemu ..... -incoming rdma:host:port
f4abc9d6 MH	97
	98	Finally, perform the actual migration on the source machine:
	99
	100	QEMU Monitor Command:
41310c68	101	$ migrate -d rdma:host:port
f4abc9d6 MH	102
	103	PERFORMANCE
	104	===========
	105
	106	Here is a brief summary of total migration time and downtime using RDMA:
	107	Using a 40gbps infiniband link performing a worst-case stress test,
	108	using an 8GB RAM virtual machine:
	109
	110	Using the following command:
	111	$ apt-get install stress
	112	$ stress --vm-bytes 7500M --vm 1 --vm-keep
	113
	114	1. Migration throughput: 26 gigabits/second.
	115	2. Downtime (stop time) varies between 15 and 100 milliseconds.
	116
	117	EFFECTS of memory registration on bulk phase round:
	118
	119	For example, in the same 8GB RAM example with all 8GB of memory in
	120	active use and the VM itself is completely idle using the same 40 gbps
	121	infiniband link:
	122
41310c68 MH	123	1. rdma-pin-all disabled total time: approximately 7.5 seconds @ 9.5 Gbps
41310c68 MH	124	2. rdma-pin-all enabled total time: approximately 4 seconds @ 26 Gbps
f4abc9d6 MH	125
	126	These numbers would of course scale up to whatever size virtual machine
	127	you have to migrate using RDMA.
	128
	129	Enabling this feature does not have any measurable affect on
	130	migration downtime. This is because, without this feature, all of the
	131	memory will have already been registered already in advance during
	132	the bulk round and does not need to be re-registered during the successive
	133	iteration rounds.
	134
	135	RDMA Protocol Description:
	136	==========================
	137
	138	Migration with RDMA is separated into two parts:
	139
	140	1. The transmission of the pages using RDMA
	141	2. Everything else (a control channel is introduced)
	142
	143	"Everything else" is transmitted using a formal
	144	protocol now, consisting of infiniband SEND messages.
	145
	146	An infiniband SEND message is the standard ibverbs
	147	message used by applications of infiniband hardware.
	148	The only difference between a SEND message and an RDMA
	149	message is that SEND messages cause notifications
	150	to be posted to the completion queue (CQ) on the
	151	infiniband receiver side, whereas RDMA messages (used
971ae6ef	152	for VM's ram) do not (to behave like an actual DMA).
f4abc9d6 MH	153
	154	Messages in infiniband require two things:
	155
	156	1. registration of the memory that will be transmitted
	157	2. (SEND only) work requests to be posted on both
	158	sides of the network before the actual transmission
	159	can occur.
	160
	161	RDMA messages are much easier to deal with. Once the memory
	162	on the receiver side is registered and pinned, we're
	163	basically done. All that is required is for the sender
	164	side to start dumping bytes onto the link.
	165
	166	(Memory is not released from pinning until the migration
	167	completes, given that RDMA migrations are very fast.)
	168
	169	SEND messages require more coordination because the
	170	receiver must have reserved space (using a receive
	171	work request) on the receive queue (RQ) before QEMUFileRDMA
	172	can start using them to carry all the bytes as
	173	a control transport for migration of device state.
	174
	175	To begin the migration, the initial connection setup is
	176	as follows (migration-rdma.c):
	177
	178	1. Receiver and Sender are started (command line or libvirt):
	179	2. Both sides post two RQ work requests
	180	3. Receiver does listen()
	181	4. Sender does connect()
	182	5. Receiver accept()
	183	6. Check versioning and capabilities (described later)
	184
	185	At this point, we define a control channel on top of SEND messages
	186	which is described by a formal protocol. Each SEND message has a
	187	header portion and a data portion (but together are transmitted
	188	as a single SEND message).
	189
	190	Header:
a5f56b90 MH	191	* Length (of the data portion, uint32, network byte order)
	192	* Type (what command to perform, uint32, network byte order)
	193	* Repeat (Number of commands in data portion, same type only)
f4abc9d6 MH	194
	195	The 'Repeat' field is here to support future multiple page registrations
	196	in a single message without any need to change the protocol itself
	197	so that the protocol is compatible against multiple versions of QEMU.
	198	Version #1 requires that all server implementations of the protocol must
	199	check this field and register all requests found in the array of commands located
	200	in the data portion and return an equal number of results in the response.
	201	The maximum number of repeats is hard-coded to 4096. This is a conservative
52f35022	202	limit based on the maximum size of a SEND message along with empirical
f4abc9d6 MH	203	observations on the maximum future benefit of simultaneous page registrations.
f4abc9d6 MH	204
a5f56b90 MH	205	The 'type' field has 12 different command values:
	206	1. Unused
	207	2. Error (sent to the source during bad things)
	208	3. Ready (control-channel is available)
	209	4. QEMU File (for sending non-live device state)
	210	5. RAM Blocks request (used right after connection setup)
	211	6. RAM Blocks result (used right after connection setup)
	212	7. Compress page (zap zero page and skip registration)
	213	8. Register request (dynamic chunk registration)
	214	9. Register result ('rkey' to be used by sender)
	215	10. Register finished (registration for current iteration finished)
	216	11. Unregister request (unpin previously registered memory)
	217	12. Unregister finished (confirmation that unpin completed)
f4abc9d6 MH	218
	219	A single control message, as hinted above, can contain within the data
	220	portion an array of many commands of the same type. If there is more than
	221	one command, then the 'repeat' field will be greater than 1.
	222
	223	After connection setup, message 5 & 6 are used to exchange ram block
	224	information and optionally pin all the memory if requested by the user.
	225
	226	After ram block exchange is completed, we have two protocol-level
	227	functions, responsible for communicating control-channel commands
	228	using the above list of values:
	229
	230	Logically:
	231
	232	qemu_rdma_exchange_recv(header, expected command type)
	233
	234	1. We transmit a READY command to let the sender know that
	235	we are ready to receive some data bytes on the control channel.
	236	2. Before attempting to receive the expected command, we post another
	237	RQ work request to replace the one we just used up.
	238	3. Block on a CQ event channel and wait for the SEND to arrive.
	239	4. When the send arrives, librdmacm will unblock us.
	240	5. Verify that the command-type and version received matches the one we expected.
	241
	242	qemu_rdma_exchange_send(header, data, optional response header & data):
	243
	244	1. Block on the CQ event channel waiting for a READY command
	245	from the receiver to tell us that the receiver
	246	is ready for us to transmit some new bytes.
	247	2. Optionally: if we are expecting a response from the command
a5f56b90	248	(that we have not yet transmitted), let's post an RQ
f4abc9d6 MH	249	work request to receive that data a few moments later.
	250	3. When the READY arrives, librdmacm will
	251	unblock us and we immediately post a RQ work request
	252	to replace the one we just used up.
	253	4. Now, we can actually post the work request to SEND
	254	the requested command type of the header we were asked for.
	255	5. Optionally, if we are expecting a response (as before),
	256	we block again and wait for that response using the additional
	257	work request we previously posted. (This is used to carry
	258	'Register result' commands #6 back to the sender which
	259	hold the rkey need to perform RDMA. Note that the virtual address
	260	corresponding to this rkey was already exchanged at the beginning
	261	of the connection (described below).
	262
	263	All of the remaining command types (not including 'ready')
	264	described above all use the aformentioned two functions to do the hard work:
	265
	266	1. After connection setup, RAMBlock information is exchanged using
	267	this protocol before the actual migration begins. This information includes
	268	a description of each RAMBlock on the server side as well as the virtual addresses
	269	and lengths of each RAMBlock. This is used by the client to determine the
	270	start and stop locations of chunks and how to register them dynamically
	271	before performing the RDMA operations.
	272	2. During runtime, once a 'chunk' becomes full of pages ready to
	273	be sent with RDMA, the registration commands are used to ask the
	274	other side to register the memory for this chunk and respond
	275	with the result (rkey) of the registration.
	276	3. Also, the QEMUFile interfaces also call these functions (described below)
	277	when transmitting non-live state, such as devices or to send
	278	its own protocol information during the migration process.
	279	4. Finally, zero pages are only checked if a page has not yet been registered
	280	using chunk registration (or not checked at all and unconditionally
	281	written if chunk registration is disabled. This is accomplished using
	282	the "Compress" command listed above. If the page has been registered
	283	then we check the entire chunk for zero. Only if the entire chunk is
	284	zero, then we send a compress command to zap the page on the other side.
	285
	286	Versioning and Capabilities
	287	===========================
	288	Current version of the protocol is version #1.
	289
	290	The same version applies to both for protocol traffic and capabilities
	291	negotiation. (i.e. There is only one version number that is referred to
	292	by all communication).
	293
	294	librdmacm provides the user with a 'private data' area to be exchanged
	295	at connection-setup time before any infiniband traffic is generated.
	296
	297	Header:
a5f56b90 MH	298	* Version (protocol version validated before send/recv occurs),
	299	uint32, network byte order
	300	* Flags (bitwise OR of each capability),
	301	uint32, network byte order
f4abc9d6 MH	302
	303	There is no data portion of this header right now, so there is
	304	no length field. The maximum size of the 'private data' section
	305	is only 192 bytes per the Infiniband specification, so it's not
	306	very useful for data anyway. This structure needs to remain small.
	307
	308	This private data area is a convenient place to check for protocol
	309	versioning because the user does not need to register memory to
	310	transmit a few bytes of version information.
	311
	312	This is also a convenient place to negotiate capabilities
	313	(like dynamic page registration).
	314
	315	If the version is invalid, we throw an error.
	316
	317	If the version is new, we only negotiate the capabilities that the
	318	requested version is able to perform and ignore the rest.
	319
a5f56b90	320	Currently there is only one capability in Version #1: dynamic page registration
f4abc9d6 MH	321
	322	Finally: Negotiation happens with the Flags field: If the primary-VM
	323	sets a flag, but the destination does not support this capability, it
	324	will return a zero-bit for that flag and the primary-VM will understand
	325	that as not being an available capability and will thus disable that
	326	capability on the primary-VM side.
	327
	328	QEMUFileRDMA Interface:
	329	=======================
	330
	331	QEMUFileRDMA introduces a couple of new functions:
	332
a5f56b90 MH	333	1. qemu_rdma_get_buffer() (QEMUFileOps rdma_read_ops)
a5f56b90 MH	334	2. qemu_rdma_put_buffer() (QEMUFileOps rdma_write_ops)
f4abc9d6 MH	335
	336	These two functions are very short and simply use the protocol
	337	describe above to deliver bytes without changing the upper-level
	338	users of QEMUFile that depend on a bytestream abstraction.
	339
	340	Finally, how do we handoff the actual bytes to get_buffer()?
	341
	342	Again, because we're trying to "fake" a bytestream abstraction
	343	using an analogy not unlike individual UDP frames, we have
	344	to hold on to the bytes received from control-channel's SEND
	345	messages in memory.
	346
	347	Each time we receive a complete "QEMU File" control-channel
	348	message, the bytes from SEND are copied into a small local holding area.
	349
	350	Then, we return the number of bytes requested by get_buffer()
	351	and leave the remaining bytes in the holding area until get_buffer()
	352	comes around for another pass.
	353
	354	If the buffer is empty, then we follow the same steps
	355	listed above and issue another "QEMU File" protocol command,
	356	asking for a new SEND message to re-fill the buffer.
	357
971ae6ef	358	Migration of VM's ram:
f4abc9d6 MH	359	====================
	360
	361	At the beginning of the migration, (migration-rdma.c),
	362	the sender and the receiver populate the list of RAMBlocks
	363	to be registered with each other into a structure.
	364	Then, using the aforementioned protocol, they exchange a
	365	description of these blocks with each other, to be used later
	366	during the iteration of main memory. This description includes
	367	a list of all the RAMBlocks, their offsets and lengths, virtual
	368	addresses and possibly includes pre-registered RDMA keys in case dynamic
	369	page registration was disabled on the server-side, otherwise not.
	370
	371	Main memory is not migrated with the aforementioned protocol,
	372	but is instead migrated with normal RDMA Write operations.
	373
	374	Pages are migrated in "chunks" (hard-coded to 1 Megabyte right now).
	375	Chunk size is not dynamic, but it could be in a future implementation.
	376	There's nothing to indicate that this is useful right now.
	377
	378	When a chunk is full (or a flush() occurs), the memory backed by
	379	the chunk is registered with librdmacm is pinned in memory on
	380	both sides using the aforementioned protocol.
	381	After pinning, an RDMA Write is generated and transmitted
	382	for the entire chunk.
	383
	384	Chunks are also transmitted in batches: This means that we
	385	do not request that the hardware signal the completion queue
	386	for the completion of every chunk. The current batch size
	387	is about 64 chunks (corresponding to 64 MB of memory).
	388	Only the last chunk in a batch must be signaled.
	389	This helps keep everything as asynchronous as possible
	390	and helps keep the hardware busy performing RDMA operations.
	391
	392	Error-handling:
	393	===============
	394
	395	Infiniband has what is called a "Reliable, Connected"
	396	link (one of 4 choices). This is the mode in which
	397	we use for RDMA migration.
	398
	399	If a single message fails,
	400	the decision is to abort the migration entirely and
	401	cleanup all the RDMA descriptors and unregister all
	402	the memory.
	403
	404	After cleanup, the Virtual Machine is returned to normal
	405	operation the same way that would happen if the TCP
	406	socket is broken during a non-RDMA based migration.
	407
	408	TODO:
	409	=====
41310c68	410	1. Currently, 'ulimit -l' mlock() limits as well as cgroups swap limits
f4abc9d6 MH	411	are not compatible with infinband memory pinning and will result in
f4abc9d6 MH	412	an aborted migration (but with the source VM left unaffected).
41310c68	413	2. Use of the recent /proc/<pid>/pagemap would likely speed up
f4abc9d6	414	the use of KSM and ballooning while using RDMA.
41310c68	415	3. Also, some form of balloon-device usage tracking would also
f4abc9d6	416	help alleviate some issues.
41310c68	417	4. Use LRU to provide more fine-grained direction of UNREGISTER
a5f56b90	418	requests for unpinning memory in an overcommitted environment.
41310c68	419	5. Expose UNREGISTER support to the user by way of workload-specific
a5f56b90	420	hints about application behavior.