]>
Commit | Line | Data |
---|---|---|
edab5632 MA |
1 | Paravirtualized RDMA Device (PVRDMA) |
2 | ==================================== | |
3 | ||
4 | ||
5 | 1. Description | |
6 | =============== | |
7 | PVRDMA is the QEMU implementation of VMware's paravirtualized RDMA device. | |
8 | It works with its Linux Kernel driver AS IS, no need for any special guest | |
9 | modifications. | |
10 | ||
11 | While it complies with the VMware device, it can also communicate with bare | |
46b69a88 YS |
12 | metal RDMA-enabled machines as peers. |
13 | ||
14 | It does not require an RDMA HCA in the host, it can work with Soft-RoCE (rxe). | |
edab5632 MA |
15 | |
16 | It does not require the whole guest RAM to be pinned allowing memory | |
17 | over-commit and, even if not implemented yet, migration support will be | |
18 | possible with some HW assistance. | |
19 | ||
20 | A project presentation accompany this document: | |
21 | - http://events.linuxfoundation.org/sites/events/files/slides/lpc-2017-pvrdma-marcel-apfelbaum-yuval-shaia.pdf | |
22 | ||
23 | ||
24 | ||
25 | 2. Setup | |
26 | ======== | |
27 | ||
28 | ||
29 | 2.1 Guest setup | |
30 | =============== | |
31 | Fedora 27+ kernels work out of the box, older distributions | |
32 | require updating the kernel to 4.14 to include the pvrdma driver. | |
33 | ||
34 | However the libpvrdma library needed by User Level Software is still | |
35 | not available as part of the distributions, so the rdma-core library | |
36 | needs to be compiled and optionally installed. | |
37 | ||
38 | Please follow the instructions at: | |
39 | https://github.com/linux-rdma/rdma-core.git | |
40 | ||
41 | ||
42 | 2.2 Host Setup | |
43 | ============== | |
44 | The pvrdma backend is an ibdevice interface that can be exposed | |
45 | either by a Soft-RoCE(rxe) device on machines with no RDMA device, | |
46 | or an HCA SRIOV function(VF/PF). | |
47 | Note that ibdevice interfaces can't be shared between pvrdma devices, | |
48 | each one requiring a separate instance (rxe or SRIOV VF). | |
49 | ||
50 | ||
51 | 2.2.1 Soft-RoCE backend(rxe) | |
52 | =========================== | |
53 | A stable version of rxe is required, Fedora 27+ or a Linux | |
54 | Kernel 4.14+ is preferred. | |
55 | ||
56 | The rdma_rxe module is part of the Linux Kernel but not loaded by default. | |
57 | Install the User Level library (librxe) following the instructions from: | |
58 | https://github.com/SoftRoCE/rxe-dev/wiki/rxe-dev:-Home | |
59 | ||
60 | Associate an ETH interface with rxe by running: | |
61 | rxe_cfg add eth0 | |
62 | An rxe0 ibdevice interface will be created and can be used as pvrdma backend. | |
63 | ||
64 | ||
65 | 2.2.2 RDMA device Virtual Function backend | |
66 | ========================================== | |
67 | Nothing special is required, the pvrdma device can work not only with | |
68 | Ethernet Links, but also Infinibands Links. | |
69 | All is needed is an ibdevice with an active port, for Mellanox cards | |
70 | will be something like mlx5_6 which can be the backend. | |
71 | ||
72 | ||
73 | 2.2.3 QEMU setup | |
74 | ================ | |
75 | Configure QEMU with --enable-rdma flag, installing | |
76 | the required RDMA libraries. | |
77 | ||
78 | ||
79 | ||
80 | 3. Usage | |
81 | ======== | |
46b69a88 YS |
82 | |
83 | ||
84 | 3.1 VM Memory settings | |
85 | ====================== | |
edab5632 MA |
86 | Currently the device is working only with memory backed RAM |
87 | and it must be mark as "shared": | |
88 | -m 1G \ | |
89 | -object memory-backend-ram,id=mb1,size=1G,share \ | |
90 | -numa node,memdev=mb1 \ | |
91 | ||
46b69a88 YS |
92 | |
93 | 3.2 MAD Multiplexer | |
94 | =================== | |
95 | MAD Multiplexer is a service that exposes MAD-like interface for VMs in | |
96 | order to overcome the limitation where only single entity can register with | |
97 | MAD layer to send and receive RDMA-CM MAD packets. | |
98 | ||
99 | To build rdmacm-mux run | |
100 | # make rdmacm-mux | |
101 | ||
5a301bb9 KH |
102 | Before running the rdmacm-mux make sure that both ib_cm and rdma_cm kernel |
103 | modules aren't loaded, otherwise the rdmacm-mux service will fail to start. | |
104 | ||
46b69a88 YS |
105 | The application accepts 3 command line arguments and exposes a UNIX socket |
106 | to pass control and data to it. | |
107 | -d rdma-device-name Name of RDMA device to register with | |
108 | -s unix-socket-path Path to unix socket to listen (default /var/run/rdmacm-mux) | |
109 | -p rdma-device-port Port number of RDMA device to register with (default 1) | |
110 | The final UNIX socket file name is a concatenation of the 3 arguments so | |
111 | for example for device mlx5_0 on port 2 this /var/run/rdmacm-mux-mlx5_0-2 | |
112 | will be created. | |
113 | ||
114 | pvrdma requires this service. | |
115 | ||
116 | Please refer to contrib/rdmacm-mux for more details. | |
117 | ||
118 | ||
119 | 3.3 Service exposed by libvirt daemon | |
120 | ===================================== | |
121 | The control over the RDMA device's GID table is done by updating the | |
122 | device's Ethernet function addresses. | |
123 | Usually the first GID entry is determined by the MAC address, the second by | |
124 | the first IPv6 address and the third by the IPv4 address. Other entries can | |
125 | be added by adding more IP addresses. The opposite is the same, i.e. | |
126 | whenever an address is removed, the corresponding GID entry is removed. | |
127 | The process is done by the network and RDMA stacks. Whenever an address is | |
128 | added the ib_core driver is notified and calls the device driver add_gid | |
129 | function which in turn update the device. | |
130 | To support this in pvrdma device the device hooks into the create_bind and | |
131 | destroy_bind HW commands triggered by pvrdma driver in guest. | |
132 | ||
133 | Whenever changed is made to the pvrdma port's GID table a special QMP | |
134 | messages is sent to be processed by libvirt to update the address of the | |
135 | backend Ethernet device. | |
136 | ||
137 | pvrdma requires that libvirt service will be up. | |
138 | ||
139 | ||
140 | 3.4 PCI devices settings | |
141 | ======================== | |
142 | RoCE device exposes two functions - an Ethernet and RDMA. | |
143 | To support it, pvrdma device is composed of two PCI functions, an Ethernet | |
144 | device of type vmxnet3 on PCI slot 0 and a PVRDMA device on PCI slot 1. The | |
145 | Ethernet function can be used for other Ethernet purposes such as IP. | |
146 | ||
147 | ||
148 | 3.5 Device parameters | |
149 | ===================== | |
150 | - netdev: Specifies the Ethernet device function name on the host for | |
151 | example enp175s0f0. For Soft-RoCE device (rxe) this would be the Ethernet | |
152 | device used to create it. | |
153 | - ibdev: The IB device name on host for example rxe0, mlx5_0 etc. | |
154 | - mad-chardev: The name of the MAD multiplexer char device. | |
155 | - ibport: In case of multi-port device (such as Mellanox's HCA) this | |
156 | specify the port to use. If not set 1 will be used. | |
157 | - dev-caps-max-mr-size: The maximum size of MR. | |
158 | - dev-caps-max-qp: Maximum number of QPs. | |
46b69a88 YS |
159 | - dev-caps-max-cq: Maximum number of CQs. |
160 | - dev-caps-max-mr: Maximum number of MRs. | |
161 | - dev-caps-max-pd: Maximum number of PDs. | |
162 | - dev-caps-max-ah: Maximum number of AHs. | |
163 | ||
164 | Notes: | |
165 | - The first 3 parameters are mandatory settings, the rest have their | |
166 | defaults. | |
167 | - The last 8 parameters (the ones that prefixed by dev-caps) defines the top | |
168 | limits but the final values is adjusted by the backend device limitations. | |
169 | - netdev can be extracted from ibdev's sysfs | |
170 | (/sys/class/infiniband/<ibdev>/device/net/) | |
171 | ||
172 | ||
173 | 3.6 Example | |
174 | =========== | |
175 | Define bridge device with vmxnet3 network backend: | |
176 | <interface type='bridge'> | |
177 | <mac address='56:b4:44:e9:62:dc'/> | |
178 | <source bridge='bridge1'/> | |
179 | <model type='vmxnet3'/> | |
180 | <address type='pci' domain='0x0000' bus='0x00' slot='0x10' function='0x0' multifunction='on'/> | |
181 | </interface> | |
182 | ||
183 | Define pvrdma device: | |
184 | <qemu:commandline> | |
185 | <qemu:arg value='-object'/> | |
186 | <qemu:arg value='memory-backend-ram,id=mb1,size=1G,share'/> | |
187 | <qemu:arg value='-numa'/> | |
188 | <qemu:arg value='node,memdev=mb1'/> | |
189 | <qemu:arg value='-chardev'/> | |
190 | <qemu:arg value='socket,path=/var/run/rdmacm-mux-rxe0-1,id=mads'/> | |
191 | <qemu:arg value='-device'/> | |
192 | <qemu:arg value='pvrdma,addr=10.1,ibdev=rxe0,netdev=bridge0,mad-chardev=mads'/> | |
193 | </qemu:commandline> | |
edab5632 MA |
194 | |
195 | ||
196 | ||
197 | 4. Implementation details | |
198 | ========================= | |
199 | ||
200 | ||
201 | 4.1 Overview | |
202 | ============ | |
203 | The device acts like a proxy between the Guest Driver and the host | |
204 | ibdevice interface. | |
205 | On configuration path: | |
206 | - For every hardware resource request (PD/QP/CQ/...) the pvrdma will request | |
207 | a resource from the backend interface, maintaining a 1-1 mapping | |
208 | between the guest and host. | |
209 | On data path: | |
210 | - Every post_send/receive received from the guest will be converted into | |
211 | a post_send/receive for the backend. The buffers data will not be touched | |
212 | or copied resulting in near bare-metal performance for large enough buffers. | |
213 | - Completions from the backend interface will result in completions for | |
214 | the pvrdma device. | |
215 | ||
216 | ||
217 | 4.2 PCI BARs | |
218 | ============ | |
219 | PCI Bars: | |
220 | BAR 0 - MSI-X | |
221 | MSI-X vectors: | |
222 | (0) Command - used when execution of a command is completed. | |
223 | (1) Async - not in use. | |
224 | (2) Completion - used when a completion event is placed in | |
225 | device's CQ ring. | |
226 | BAR 1 - Registers | |
227 | -------------------------------------------------------- | |
228 | | VERSION | DSR | CTL | REQ | ERR | ICR | IMR | MAC | | |
229 | -------------------------------------------------------- | |
230 | DSR - Address of driver/device shared memory used | |
231 | for the command channel, used for passing: | |
232 | - General info such as driver version | |
233 | - Address of 'command' and 'response' | |
234 | - Address of async ring | |
235 | - Address of device's CQ ring | |
236 | - Device capabilities | |
237 | CTL - Device control operations (activate, reset etc) | |
238 | IMG - Set interrupt mask | |
239 | REQ - Command execution register | |
240 | ERR - Operation status | |
241 | ||
242 | BAR 2 - UAR | |
243 | --------------------------------------------------------- | |
244 | | QP_NUM | SEND/RECV Flag || CQ_NUM | ARM/POLL Flag | | |
245 | --------------------------------------------------------- | |
246 | - Offset 0 used for QP operations (send and recv) | |
247 | - Offset 4 used for CQ operations (arm and poll) | |
248 | ||
249 | ||
250 | 4.3 Major flows | |
251 | =============== | |
252 | ||
253 | 4.3.1 Create CQ | |
254 | =============== | |
255 | - Guest driver | |
256 | - Allocates pages for CQ ring | |
257 | - Creates page directory (pdir) to hold CQ ring's pages | |
258 | - Initializes CQ ring | |
259 | - Initializes 'Create CQ' command object (cqe, pdir etc) | |
260 | - Copies the command to 'command' address | |
261 | - Writes 0 into REQ register | |
262 | - Device | |
263 | - Reads the request object from the 'command' address | |
264 | - Allocates CQ object and initialize CQ ring based on pdir | |
265 | - Creates the backend CQ | |
266 | - Writes operation status to ERR register | |
267 | - Posts command-interrupt to guest | |
268 | - Guest driver | |
269 | - Reads the HW response code from ERR register | |
270 | ||
271 | 4.3.2 Create QP | |
272 | =============== | |
273 | - Guest driver | |
274 | - Allocates pages for send and receive rings | |
275 | - Creates page directory(pdir) to hold the ring's pages | |
276 | - Initializes 'Create QP' command object (max_send_wr, | |
277 | send_cq_handle, recv_cq_handle, pdir etc) | |
278 | - Copies the object to 'command' address | |
279 | - Write 0 into REQ register | |
280 | - Device | |
281 | - Reads the request object from 'command' address | |
282 | - Allocates the QP object and initialize | |
283 | - Send and recv rings based on pdir | |
284 | - Send and recv ring state | |
285 | - Creates the backend QP | |
286 | - Writes the operation status to ERR register | |
287 | - Posts command-interrupt to guest | |
288 | - Guest driver | |
289 | - Reads the HW response code from ERR register | |
290 | ||
291 | 4.3.3 Post receive | |
292 | ================== | |
293 | - Guest driver | |
294 | - Initializes a wqe and place it on recv ring | |
295 | - Write to qpn|qp_recv_bit (31) to QP offset in UAR | |
296 | - Device | |
297 | - Extracts qpn from UAR | |
298 | - Walks through the ring and does the following for each wqe | |
299 | - Prepares the backend CQE context to be used when | |
300 | receiving completion from backend (wr_id, op_code, emu_cq_num) | |
301 | - For each sge prepares backend sge | |
302 | - Calls backend's post_recv | |
303 | ||
304 | 4.3.4 Process backend events | |
305 | ============================ | |
306 | - Done by a dedicated thread used to process backend events; | |
307 | at initialization is attached to the device and creates | |
308 | the communication channel. | |
309 | - Thread main loop: | |
310 | - Polls for completions | |
311 | - Extracts QEMU _cq_num, wr_id and op_code from context | |
312 | - Writes CQE to CQ ring | |
313 | - Writes CQ number to device CQ | |
314 | - Sends completion-interrupt to guest | |
315 | - Deallocates context | |
316 | - Acks the event to backend | |
317 | ||
318 | ||
319 | ||
320 | 5. Limitations | |
321 | ============== | |
322 | - The device obviously is limited by the Guest Linux Driver features implementation | |
323 | of the VMware device API. | |
324 | - Memory registration mechanism requires mremap for every page in the buffer in order | |
325 | to map it to a contiguous virtual address range. Since this is not the data path | |
326 | it should not matter much. If the default max mr size is increased, be aware that | |
327 | memory registration can take up to 0.5 seconds for 1GB of memory. | |
328 | - The device requires target page size to be the same as the host page size, | |
329 | otherwise it will fail to init. | |
330 | - QEMU cannot map guest RAM from a file descriptor if a pvrdma device is attached, | |
331 | so it can't work with huge pages. The limitation will be addressed in the future, | |
332 | however QEMU allocates Guest RAM with MADV_HUGEPAGE so if there are enough huge | |
333 | pages available, QEMU will use them. QEMU will fail to init if the requirements | |
334 | are not met. | |
335 | ||
336 | ||
337 | ||
338 | 6. Performance | |
339 | ============== | |
340 | By design the pvrdma device exits on each post-send/receive, so for small buffers | |
341 | the performance is affected; however for medium buffers it will became close to | |
342 | bare metal and from 1MB buffers and up it reaches bare metal performance. | |
343 | (tested with 2 VMs, the pvrdma devices connected to 2 VFs of the same device) | |
344 | ||
345 | All the above assumes no memory registration is done on data path. |