]>
Commit | Line | Data |
---|---|---|
79c0f397 HZ |
1 | QEMU Virtual NVDIMM |
2 | =================== | |
3 | ||
4 | This document explains the usage of virtual NVDIMM (vNVDIMM) feature | |
5 | which is available since QEMU v2.6.0. | |
6 | ||
7 | The current QEMU only implements the persistent memory mode of vNVDIMM | |
8 | device and not the block window mode. | |
9 | ||
10 | Basic Usage | |
11 | ----------- | |
12 | ||
13 | The storage of a vNVDIMM device in QEMU is provided by the memory | |
14 | backend (i.e. memory-backend-file and memory-backend-ram). A simple | |
15 | way to create a vNVDIMM device at startup time is done via the | |
16 | following command line options: | |
17 | ||
ca577afc | 18 | -machine pc,nvdimm=on |
79c0f397 | 19 | -m $RAM_SIZE,slots=$N,maxmem=$MAX_SIZE |
dbd730e8 SH |
20 | -object memory-backend-file,id=mem1,share=on,mem-path=$PATH,size=$NVDIMM_SIZE,readonly=off |
21 | -device nvdimm,id=nvdimm1,memdev=mem1,unarmed=off | |
79c0f397 HZ |
22 | |
23 | Where, | |
24 | ||
25 | - the "nvdimm" machine option enables vNVDIMM feature. | |
26 | ||
27 | - "slots=$N" should be equal to or larger than the total amount of | |
28 | normal RAM devices and vNVDIMM devices, e.g. $N should be >= 2 here. | |
29 | ||
30 | - "maxmem=$MAX_SIZE" should be equal to or larger than the total size | |
31 | of normal RAM devices and vNVDIMM devices, e.g. $MAX_SIZE should be | |
32 | >= $RAM_SIZE + $NVDIMM_SIZE here. | |
33 | ||
dbd730e8 SH |
34 | - "object memory-backend-file,id=mem1,share=on,mem-path=$PATH, |
35 | size=$NVDIMM_SIZE,readonly=off" creates a backend storage of size | |
36 | $NVDIMM_SIZE on a file $PATH. All accesses to the virtual NVDIMM device go | |
37 | to the file $PATH. | |
79c0f397 HZ |
38 | |
39 | "share=on/off" controls the visibility of guest writes. If | |
40 | "share=on", then guest writes will be applied to the backend | |
41 | file. If another guest uses the same backend file with option | |
42 | "share=on", then above writes will be visible to it as well. If | |
43 | "share=off", then guest writes won't be applied to the backend | |
44 | file and thus will be invisible to other guests. | |
45 | ||
dbd730e8 SH |
46 | "readonly=on/off" controls whether the file $PATH is opened read-only or |
47 | read/write (default). | |
48 | ||
49 | - "device nvdimm,id=nvdimm1,memdev=mem1,unarmed=off" creates a read/write | |
50 | virtual NVDIMM device whose storage is provided by above memory backend | |
51 | device. | |
52 | ||
53 | "unarmed" controls the ACPI NFIT NVDIMM Region Mapping Structure "NVDIMM | |
54 | State Flags" Bit 3 indicating that the device is "unarmed" and cannot accept | |
55 | persistent writes. Linux guest drivers set the device to read-only when this | |
56 | bit is present. Set unarmed to on when the memdev has readonly=on. | |
79c0f397 HZ |
57 | |
58 | Multiple vNVDIMM devices can be created if multiple pairs of "-object" | |
59 | and "-device" are provided. | |
60 | ||
61 | For above command line options, if the guest OS has the proper NVDIMM | |
bd54b110 KC |
62 | driver (e.g. "CONFIG_ACPI_NFIT=y" under Linux), it should be able to |
63 | detect a NVDIMM device which is in the persistent memory mode and whose | |
64 | size is $NVDIMM_SIZE. | |
79c0f397 HZ |
65 | |
66 | Note: | |
67 | ||
68 | 1. Prior to QEMU v2.8.0, if memory-backend-file is used and the actual | |
69 | backend file size is not equal to the size given by "size" option, | |
70 | QEMU will truncate the backend file by ftruncate(2), which will | |
71 | corrupt the existing data in the backend file, especially for the | |
72 | shrink case. | |
73 | ||
74 | QEMU v2.8.0 and later check the backend file size and the "size" | |
75 | option. If they do not match, QEMU will report errors and abort in | |
76 | order to avoid the data corruption. | |
77 | ||
78 | 2. QEMU v2.6.0 only puts a basic alignment requirement on the "size" | |
79 | option of memory-backend-file, e.g. 4KB alignment on x86. However, | |
80 | QEMU v.2.7.0 puts an additional alignment requirement, which may | |
81 | require a larger value than the basic one, e.g. 2MB on x86. This | |
82 | change breaks the usage of memory-backend-file that only satisfies | |
83 | the basic alignment. | |
84 | ||
85 | QEMU v2.8.0 and later remove the additional alignment on non-s390x | |
86 | architectures, so the broken memory-backend-file can work again. | |
87 | ||
88 | Label | |
89 | ----- | |
90 | ||
91 | QEMU v2.7.0 and later implement the label support for vNVDIMM devices. | |
92 | To enable label on vNVDIMM devices, users can simply add | |
93 | "label-size=$SZ" option to "-device nvdimm", e.g. | |
94 | ||
95 | -device nvdimm,id=nvdimm1,memdev=mem1,label-size=128K | |
96 | ||
97 | Note: | |
98 | ||
99 | 1. The minimal label size is 128KB. | |
100 | ||
101 | 2. QEMU v2.7.0 and later store labels at the end of backend storage. | |
102 | If a memory backend file, which was previously used as the backend | |
103 | of a vNVDIMM device without labels, is now used for a vNVDIMM | |
104 | device with label, the data in the label area at the end of file | |
105 | will be inaccessible to the guest. If any useful data (e.g. the | |
106 | meta-data of the file system) was stored there, the latter usage | |
107 | may result guest data corruption (e.g. breakage of guest file | |
108 | system). | |
109 | ||
110 | Hotplug | |
111 | ------- | |
112 | ||
113 | QEMU v2.8.0 and later implement the hotplug support for vNVDIMM | |
114 | devices. Similarly to the RAM hotplug, the vNVDIMM hotplug is | |
115 | accomplished by two monitor commands "object_add" and "device_add". | |
116 | ||
117 | For example, the following commands add another 4GB vNVDIMM device to | |
118 | the guest: | |
119 | ||
120 | (qemu) object_add memory-backend-file,id=mem2,share=on,mem-path=new_nvdimm.img,size=4G | |
121 | (qemu) device_add nvdimm,id=nvdimm2,memdev=mem2 | |
122 | ||
123 | Note: | |
124 | ||
125 | 1. Each hotplugged vNVDIMM device consumes one memory slot. Users | |
126 | should always ensure the memory option "-m ...,slots=N" specifies | |
127 | enough number of slots, i.e. | |
128 | N >= number of RAM devices + | |
129 | number of statically plugged vNVDIMM devices + | |
130 | number of hotplugged vNVDIMM devices | |
131 | ||
132 | 2. The similar is required for the memory option "-m ...,maxmem=M", i.e. | |
133 | M >= size of RAM devices + | |
134 | size of statically plugged vNVDIMM devices + | |
135 | size of hotplugged vNVDIMM devices | |
98376843 HZ |
136 | |
137 | Alignment | |
138 | --------- | |
139 | ||
140 | QEMU uses mmap(2) to maps vNVDIMM backends and aligns the mapping | |
141 | address to the page size (getpagesize(2)) by default. However, some | |
142 | types of backends may require an alignment different than the page | |
143 | size. In that case, QEMU v2.12.0 and later provide 'align' option to | |
144 | memory-backend-file to allow users to specify the proper alignment. | |
5f509751 JL |
145 | For device dax (e.g., /dev/dax0.0), this alignment needs to match the |
146 | alignment requirement of the device dax. The NUM of 'align=NUM' option | |
147 | must be larger than or equal to the 'align' of device dax. | |
148 | We can use one of the following commands to show the 'align' of device dax. | |
149 | ||
150 | ndctl list -X | |
151 | daxctl list -R | |
152 | ||
153 | In order to get the proper 'align' of device dax, you need to install | |
154 | the library 'libdaxctl'. | |
98376843 HZ |
155 | |
156 | For example, device dax require the 2 MB alignment, so we can use | |
157 | following QEMU command line options to use it (/dev/dax0.0) as the | |
158 | backend of vNVDIMM: | |
159 | ||
160 | -object memory-backend-file,id=mem1,share=on,mem-path=/dev/dax0.0,size=4G,align=2M | |
161 | -device nvdimm,id=nvdimm1,memdev=mem1 | |
cb836434 HZ |
162 | |
163 | Guest Data Persistence | |
164 | ---------------------- | |
165 | ||
166 | Though QEMU supports multiple types of vNVDIMM backends on Linux, | |
119906af ZY |
167 | the only backend that can guarantee the guest write persistence is: |
168 | ||
169 | A. DAX device (e.g., /dev/dax0.0, ) or | |
170 | B. DAX file(mounted with dax option) | |
171 | ||
172 | When using B (A file supporting direct mapping of persistent memory) | |
173 | as a backend, write persistence is guaranteed if the host kernel has | |
174 | support for the MAP_SYNC flag in the mmap system call (available | |
175 | since Linux 4.15 and on certain distro kernels) and additionally | |
176 | both 'pmem' and 'share' flags are set to 'on' on the backend. | |
177 | ||
178 | If these conditions are not satisfied i.e. if either 'pmem' or 'share' | |
179 | are not set, if the backend file does not support DAX or if MAP_SYNC | |
180 | is not supported by the host kernel, write persistence is not | |
181 | guaranteed after a system crash. For compatibility reasons, these | |
182 | conditions are ignored if not satisfied. Currently, no way is | |
183 | provided to test for them. | |
184 | For more details, please reference mmap(2) man page: | |
185 | http://man7.org/linux/man-pages/man2/mmap.2.html. | |
cb836434 HZ |
186 | |
187 | When using other types of backends, it's suggested to set 'unarmed' | |
188 | option of '-device nvdimm' to 'on', which sets the unarmed flag of the | |
189 | guest NVDIMM region mapping structure. This unarmed flag indicates | |
190 | guest software that this vNVDIMM device contains a region that cannot | |
191 | accept persistent writes. In result, for example, the guest Linux | |
192 | NVDIMM driver, marks such vNVDIMM device as read-only. | |
9ab3aad2 | 193 | |
d8b92bd4 WY |
194 | Backend File Setup Example |
195 | -------------------------- | |
196 | ||
197 | Here are two examples showing how to setup these persistent backends on | |
198 | linux using the tool ndctl [3]. | |
199 | ||
200 | A. DAX device | |
201 | ||
202 | Use the following command to set up /dev/dax0.0 so that the entirety of | |
203 | namespace0.0 can be exposed as an emulated NVDIMM to the guest: | |
204 | ||
205 | ndctl create-namespace -f -e namespace0.0 -m devdax | |
206 | ||
207 | The /dev/dax0.0 could be used directly in "mem-path" option. | |
208 | ||
209 | B. DAX file | |
210 | ||
211 | Individual files on a DAX host file system can be exposed as emulated | |
212 | NVDIMMS. First an fsdax block device is created, partitioned, and then | |
213 | mounted with the "dax" mount option: | |
214 | ||
215 | ndctl create-namespace -f -e namespace0.0 -m fsdax | |
216 | (partition /dev/pmem0 with name pmem0p1) | |
217 | mount -o dax /dev/pmem0p1 /mnt | |
218 | (create or copy a disk image file with qemu-img(1), cp(1), or dd(1) | |
219 | in /mnt) | |
220 | ||
221 | Then the new file in /mnt could be used in "mem-path" option. | |
222 | ||
11c39b5c RZ |
223 | NVDIMM Persistence |
224 | ------------------ | |
9ab3aad2 RZ |
225 | |
226 | ACPI 6.2 Errata A added support for a new Platform Capabilities Structure | |
227 | which allows the platform to communicate what features it supports related to | |
11c39b5c RZ |
228 | NVDIMM data persistence. Users can provide a persistence value to a guest via |
229 | the optional "nvdimm-persistence" machine command line option: | |
9ab3aad2 | 230 | |
11c39b5c | 231 | -machine pc,accel=kvm,nvdimm,nvdimm-persistence=cpu |
9ab3aad2 | 232 | |
11c39b5c | 233 | There are currently two valid values for this option: |
9ab3aad2 | 234 | |
11c39b5c RZ |
235 | "mem-ctrl" - The platform supports flushing dirty data from the memory |
236 | controller to the NVDIMMs in the event of power loss. | |
9ab3aad2 | 237 | |
11c39b5c RZ |
238 | "cpu" - The platform supports flushing dirty data from the CPU cache to |
239 | the NVDIMMs in the event of power loss. This implies that the | |
240 | platform also supports flushing dirty data through the memory | |
241 | controller on power loss. | |
a4de8552 JH |
242 | |
243 | If the vNVDIMM backend is in host persistent memory that can be accessed in | |
244 | SNIA NVM Programming Model [1] (e.g., Intel NVDIMM), it's suggested to set | |
245 | the 'pmem' option of memory-backend-file to 'on'. When 'pmem' is 'on' and QEMU | |
246 | is built with libpmem [2] support (configured with --enable-libpmem), QEMU | |
247 | will take necessary operations to guarantee the persistence of its own writes | |
248 | to the vNVDIMM backend(e.g., in vNVDIMM label emulation and live migration). | |
249 | If 'pmem' is 'on' while there is no libpmem support, qemu will exit and report | |
250 | a "lack of libpmem support" message to ensure the persistence is available. | |
251 | For example, if we want to ensure the persistence for some backend file, | |
252 | use the QEMU command line: | |
253 | ||
254 | -object memory-backend-file,id=nv_mem,mem-path=/XXX/yyy,size=4G,pmem=on | |
255 | ||
256 | References | |
257 | ---------- | |
258 | ||
259 | [1] NVM Programming Model (NPM) | |
260 | Version 1.2 | |
261 | https://www.snia.org/sites/default/files/technical_work/final/NVMProgrammingModel_v1.2.pdf | |
262 | [2] Persistent Memory Development Kit (PMDK), formerly known as NVML project, home page: | |
263 | http://pmem.io/pmdk/ | |
d8b92bd4 WY |
264 | [3] ndctl-create-namespace - provision or reconfigure a namespace |
265 | http://pmem.io/ndctl/ndctl-create-namespace.html |