]>
Commit | Line | Data |
---|---|---|
edab5632 MA |
1 | Paravirtualized RDMA Device (PVRDMA) |
2 | ==================================== | |
3 | ||
4 | ||
5 | 1. Description | |
6 | =============== | |
7 | PVRDMA is the QEMU implementation of VMware's paravirtualized RDMA device. | |
8 | It works with its Linux Kernel driver AS IS, no need for any special guest | |
9 | modifications. | |
10 | ||
11 | While it complies with the VMware device, it can also communicate with bare | |
12 | metal RDMA-enabled machines and does not require an RDMA HCA in the host, it | |
13 | can work with Soft-RoCE (rxe). | |
14 | ||
15 | It does not require the whole guest RAM to be pinned allowing memory | |
16 | over-commit and, even if not implemented yet, migration support will be | |
17 | possible with some HW assistance. | |
18 | ||
19 | A project presentation accompany this document: | |
20 | - http://events.linuxfoundation.org/sites/events/files/slides/lpc-2017-pvrdma-marcel-apfelbaum-yuval-shaia.pdf | |
21 | ||
22 | ||
23 | ||
24 | 2. Setup | |
25 | ======== | |
26 | ||
27 | ||
28 | 2.1 Guest setup | |
29 | =============== | |
30 | Fedora 27+ kernels work out of the box, older distributions | |
31 | require updating the kernel to 4.14 to include the pvrdma driver. | |
32 | ||
33 | However the libpvrdma library needed by User Level Software is still | |
34 | not available as part of the distributions, so the rdma-core library | |
35 | needs to be compiled and optionally installed. | |
36 | ||
37 | Please follow the instructions at: | |
38 | https://github.com/linux-rdma/rdma-core.git | |
39 | ||
40 | ||
41 | 2.2 Host Setup | |
42 | ============== | |
43 | The pvrdma backend is an ibdevice interface that can be exposed | |
44 | either by a Soft-RoCE(rxe) device on machines with no RDMA device, | |
45 | or an HCA SRIOV function(VF/PF). | |
46 | Note that ibdevice interfaces can't be shared between pvrdma devices, | |
47 | each one requiring a separate instance (rxe or SRIOV VF). | |
48 | ||
49 | ||
50 | 2.2.1 Soft-RoCE backend(rxe) | |
51 | =========================== | |
52 | A stable version of rxe is required, Fedora 27+ or a Linux | |
53 | Kernel 4.14+ is preferred. | |
54 | ||
55 | The rdma_rxe module is part of the Linux Kernel but not loaded by default. | |
56 | Install the User Level library (librxe) following the instructions from: | |
57 | https://github.com/SoftRoCE/rxe-dev/wiki/rxe-dev:-Home | |
58 | ||
59 | Associate an ETH interface with rxe by running: | |
60 | rxe_cfg add eth0 | |
61 | An rxe0 ibdevice interface will be created and can be used as pvrdma backend. | |
62 | ||
63 | ||
64 | 2.2.2 RDMA device Virtual Function backend | |
65 | ========================================== | |
66 | Nothing special is required, the pvrdma device can work not only with | |
67 | Ethernet Links, but also Infinibands Links. | |
68 | All is needed is an ibdevice with an active port, for Mellanox cards | |
69 | will be something like mlx5_6 which can be the backend. | |
70 | ||
71 | ||
72 | 2.2.3 QEMU setup | |
73 | ================ | |
74 | Configure QEMU with --enable-rdma flag, installing | |
75 | the required RDMA libraries. | |
76 | ||
77 | ||
78 | ||
79 | 3. Usage | |
80 | ======== | |
81 | Currently the device is working only with memory backed RAM | |
82 | and it must be mark as "shared": | |
83 | -m 1G \ | |
84 | -object memory-backend-ram,id=mb1,size=1G,share \ | |
85 | -numa node,memdev=mb1 \ | |
86 | ||
87 | The pvrdma device is composed of two functions: | |
88 | - Function 0 is a vmxnet Ethernet Device which is redundant in Guest | |
89 | but is required to pass the ibdevice GID using its MAC. | |
90 | Examples: | |
91 | For an rxe backend using eth0 interface it will use its mac: | |
92 | -device vmxnet3,addr=<slot>.0,multifunction=on,mac=<eth0 MAC> | |
93 | For an SRIOV VF, we take the Ethernet Interface exposed by it: | |
94 | -device vmxnet3,multifunction=on,mac=<RoCE eth MAC> | |
95 | - Function 1 is the actual device: | |
96 | -device pvrdma,addr=<slot>.1,backend-dev=<ibdevice>,backend-gid-idx=<gid>,backend-port=<port> | |
97 | where the ibdevice can be rxe or RDMA VF (e.g. mlx5_4) | |
98 | Note: Pay special attention that the GID at backend-gid-idx matches vmxnet's MAC. | |
99 | The rules of conversion are part of the RoCE spec, but since manual conversion | |
100 | is not required, spotting problems is not hard: | |
101 | Example: GID: fe80:0000:0000:0000:7efe:90ff:fecb:743a | |
102 | MAC: 7c:fe:90:cb:74:3a | |
103 | Note the difference between the first byte of the MAC and the GID. | |
104 | ||
105 | ||
106 | ||
107 | 4. Implementation details | |
108 | ========================= | |
109 | ||
110 | ||
111 | 4.1 Overview | |
112 | ============ | |
113 | The device acts like a proxy between the Guest Driver and the host | |
114 | ibdevice interface. | |
115 | On configuration path: | |
116 | - For every hardware resource request (PD/QP/CQ/...) the pvrdma will request | |
117 | a resource from the backend interface, maintaining a 1-1 mapping | |
118 | between the guest and host. | |
119 | On data path: | |
120 | - Every post_send/receive received from the guest will be converted into | |
121 | a post_send/receive for the backend. The buffers data will not be touched | |
122 | or copied resulting in near bare-metal performance for large enough buffers. | |
123 | - Completions from the backend interface will result in completions for | |
124 | the pvrdma device. | |
125 | ||
126 | ||
127 | 4.2 PCI BARs | |
128 | ============ | |
129 | PCI Bars: | |
130 | BAR 0 - MSI-X | |
131 | MSI-X vectors: | |
132 | (0) Command - used when execution of a command is completed. | |
133 | (1) Async - not in use. | |
134 | (2) Completion - used when a completion event is placed in | |
135 | device's CQ ring. | |
136 | BAR 1 - Registers | |
137 | -------------------------------------------------------- | |
138 | | VERSION | DSR | CTL | REQ | ERR | ICR | IMR | MAC | | |
139 | -------------------------------------------------------- | |
140 | DSR - Address of driver/device shared memory used | |
141 | for the command channel, used for passing: | |
142 | - General info such as driver version | |
143 | - Address of 'command' and 'response' | |
144 | - Address of async ring | |
145 | - Address of device's CQ ring | |
146 | - Device capabilities | |
147 | CTL - Device control operations (activate, reset etc) | |
148 | IMG - Set interrupt mask | |
149 | REQ - Command execution register | |
150 | ERR - Operation status | |
151 | ||
152 | BAR 2 - UAR | |
153 | --------------------------------------------------------- | |
154 | | QP_NUM | SEND/RECV Flag || CQ_NUM | ARM/POLL Flag | | |
155 | --------------------------------------------------------- | |
156 | - Offset 0 used for QP operations (send and recv) | |
157 | - Offset 4 used for CQ operations (arm and poll) | |
158 | ||
159 | ||
160 | 4.3 Major flows | |
161 | =============== | |
162 | ||
163 | 4.3.1 Create CQ | |
164 | =============== | |
165 | - Guest driver | |
166 | - Allocates pages for CQ ring | |
167 | - Creates page directory (pdir) to hold CQ ring's pages | |
168 | - Initializes CQ ring | |
169 | - Initializes 'Create CQ' command object (cqe, pdir etc) | |
170 | - Copies the command to 'command' address | |
171 | - Writes 0 into REQ register | |
172 | - Device | |
173 | - Reads the request object from the 'command' address | |
174 | - Allocates CQ object and initialize CQ ring based on pdir | |
175 | - Creates the backend CQ | |
176 | - Writes operation status to ERR register | |
177 | - Posts command-interrupt to guest | |
178 | - Guest driver | |
179 | - Reads the HW response code from ERR register | |
180 | ||
181 | 4.3.2 Create QP | |
182 | =============== | |
183 | - Guest driver | |
184 | - Allocates pages for send and receive rings | |
185 | - Creates page directory(pdir) to hold the ring's pages | |
186 | - Initializes 'Create QP' command object (max_send_wr, | |
187 | send_cq_handle, recv_cq_handle, pdir etc) | |
188 | - Copies the object to 'command' address | |
189 | - Write 0 into REQ register | |
190 | - Device | |
191 | - Reads the request object from 'command' address | |
192 | - Allocates the QP object and initialize | |
193 | - Send and recv rings based on pdir | |
194 | - Send and recv ring state | |
195 | - Creates the backend QP | |
196 | - Writes the operation status to ERR register | |
197 | - Posts command-interrupt to guest | |
198 | - Guest driver | |
199 | - Reads the HW response code from ERR register | |
200 | ||
201 | 4.3.3 Post receive | |
202 | ================== | |
203 | - Guest driver | |
204 | - Initializes a wqe and place it on recv ring | |
205 | - Write to qpn|qp_recv_bit (31) to QP offset in UAR | |
206 | - Device | |
207 | - Extracts qpn from UAR | |
208 | - Walks through the ring and does the following for each wqe | |
209 | - Prepares the backend CQE context to be used when | |
210 | receiving completion from backend (wr_id, op_code, emu_cq_num) | |
211 | - For each sge prepares backend sge | |
212 | - Calls backend's post_recv | |
213 | ||
214 | 4.3.4 Process backend events | |
215 | ============================ | |
216 | - Done by a dedicated thread used to process backend events; | |
217 | at initialization is attached to the device and creates | |
218 | the communication channel. | |
219 | - Thread main loop: | |
220 | - Polls for completions | |
221 | - Extracts QEMU _cq_num, wr_id and op_code from context | |
222 | - Writes CQE to CQ ring | |
223 | - Writes CQ number to device CQ | |
224 | - Sends completion-interrupt to guest | |
225 | - Deallocates context | |
226 | - Acks the event to backend | |
227 | ||
228 | ||
229 | ||
230 | 5. Limitations | |
231 | ============== | |
232 | - The device obviously is limited by the Guest Linux Driver features implementation | |
233 | of the VMware device API. | |
234 | - Memory registration mechanism requires mremap for every page in the buffer in order | |
235 | to map it to a contiguous virtual address range. Since this is not the data path | |
236 | it should not matter much. If the default max mr size is increased, be aware that | |
237 | memory registration can take up to 0.5 seconds for 1GB of memory. | |
238 | - The device requires target page size to be the same as the host page size, | |
239 | otherwise it will fail to init. | |
240 | - QEMU cannot map guest RAM from a file descriptor if a pvrdma device is attached, | |
241 | so it can't work with huge pages. The limitation will be addressed in the future, | |
242 | however QEMU allocates Guest RAM with MADV_HUGEPAGE so if there are enough huge | |
243 | pages available, QEMU will use them. QEMU will fail to init if the requirements | |
244 | are not met. | |
245 | ||
246 | ||
247 | ||
248 | 6. Performance | |
249 | ============== | |
250 | By design the pvrdma device exits on each post-send/receive, so for small buffers | |
251 | the performance is affected; however for medium buffers it will became close to | |
252 | bare metal and from 1MB buffers and up it reaches bare metal performance. | |
253 | (tested with 2 VMs, the pvrdma devices connected to 2 VFs of the same device) | |
254 | ||
255 | All the above assumes no memory registration is done on data path. |