]>
Commit | Line | Data |
---|---|---|
79c0f397 HZ |
1 | QEMU Virtual NVDIMM |
2 | =================== | |
3 | ||
4 | This document explains the usage of virtual NVDIMM (vNVDIMM) feature | |
5 | which is available since QEMU v2.6.0. | |
6 | ||
7 | The current QEMU only implements the persistent memory mode of vNVDIMM | |
8 | device and not the block window mode. | |
9 | ||
10 | Basic Usage | |
11 | ----------- | |
12 | ||
13 | The storage of a vNVDIMM device in QEMU is provided by the memory | |
14 | backend (i.e. memory-backend-file and memory-backend-ram). A simple | |
15 | way to create a vNVDIMM device at startup time is done via the | |
16 | following command line options: | |
17 | ||
18 | -machine pc,nvdimm | |
19 | -m $RAM_SIZE,slots=$N,maxmem=$MAX_SIZE | |
20 | -object memory-backend-file,id=mem1,share=on,mem-path=$PATH,size=$NVDIMM_SIZE | |
21 | -device nvdimm,id=nvdimm1,memdev=mem1 | |
22 | ||
23 | Where, | |
24 | ||
25 | - the "nvdimm" machine option enables vNVDIMM feature. | |
26 | ||
27 | - "slots=$N" should be equal to or larger than the total amount of | |
28 | normal RAM devices and vNVDIMM devices, e.g. $N should be >= 2 here. | |
29 | ||
30 | - "maxmem=$MAX_SIZE" should be equal to or larger than the total size | |
31 | of normal RAM devices and vNVDIMM devices, e.g. $MAX_SIZE should be | |
32 | >= $RAM_SIZE + $NVDIMM_SIZE here. | |
33 | ||
34 | - "object memory-backend-file,id=mem1,share=on,mem-path=$PATH,size=$NVDIMM_SIZE" | |
35 | creates a backend storage of size $NVDIMM_SIZE on a file $PATH. All | |
36 | accesses to the virtual NVDIMM device go to the file $PATH. | |
37 | ||
38 | "share=on/off" controls the visibility of guest writes. If | |
39 | "share=on", then guest writes will be applied to the backend | |
40 | file. If another guest uses the same backend file with option | |
41 | "share=on", then above writes will be visible to it as well. If | |
42 | "share=off", then guest writes won't be applied to the backend | |
43 | file and thus will be invisible to other guests. | |
44 | ||
45 | - "device nvdimm,id=nvdimm1,memdev=mem1" creates a virtual NVDIMM | |
46 | device whose storage is provided by above memory backend device. | |
47 | ||
48 | Multiple vNVDIMM devices can be created if multiple pairs of "-object" | |
49 | and "-device" are provided. | |
50 | ||
51 | For above command line options, if the guest OS has the proper NVDIMM | |
52 | driver, it should be able to detect a NVDIMM device which is in the | |
53 | persistent memory mode and whose size is $NVDIMM_SIZE. | |
54 | ||
55 | Note: | |
56 | ||
57 | 1. Prior to QEMU v2.8.0, if memory-backend-file is used and the actual | |
58 | backend file size is not equal to the size given by "size" option, | |
59 | QEMU will truncate the backend file by ftruncate(2), which will | |
60 | corrupt the existing data in the backend file, especially for the | |
61 | shrink case. | |
62 | ||
63 | QEMU v2.8.0 and later check the backend file size and the "size" | |
64 | option. If they do not match, QEMU will report errors and abort in | |
65 | order to avoid the data corruption. | |
66 | ||
67 | 2. QEMU v2.6.0 only puts a basic alignment requirement on the "size" | |
68 | option of memory-backend-file, e.g. 4KB alignment on x86. However, | |
69 | QEMU v.2.7.0 puts an additional alignment requirement, which may | |
70 | require a larger value than the basic one, e.g. 2MB on x86. This | |
71 | change breaks the usage of memory-backend-file that only satisfies | |
72 | the basic alignment. | |
73 | ||
74 | QEMU v2.8.0 and later remove the additional alignment on non-s390x | |
75 | architectures, so the broken memory-backend-file can work again. | |
76 | ||
77 | Label | |
78 | ----- | |
79 | ||
80 | QEMU v2.7.0 and later implement the label support for vNVDIMM devices. | |
81 | To enable label on vNVDIMM devices, users can simply add | |
82 | "label-size=$SZ" option to "-device nvdimm", e.g. | |
83 | ||
84 | -device nvdimm,id=nvdimm1,memdev=mem1,label-size=128K | |
85 | ||
86 | Note: | |
87 | ||
88 | 1. The minimal label size is 128KB. | |
89 | ||
90 | 2. QEMU v2.7.0 and later store labels at the end of backend storage. | |
91 | If a memory backend file, which was previously used as the backend | |
92 | of a vNVDIMM device without labels, is now used for a vNVDIMM | |
93 | device with label, the data in the label area at the end of file | |
94 | will be inaccessible to the guest. If any useful data (e.g. the | |
95 | meta-data of the file system) was stored there, the latter usage | |
96 | may result guest data corruption (e.g. breakage of guest file | |
97 | system). | |
98 | ||
99 | Hotplug | |
100 | ------- | |
101 | ||
102 | QEMU v2.8.0 and later implement the hotplug support for vNVDIMM | |
103 | devices. Similarly to the RAM hotplug, the vNVDIMM hotplug is | |
104 | accomplished by two monitor commands "object_add" and "device_add". | |
105 | ||
106 | For example, the following commands add another 4GB vNVDIMM device to | |
107 | the guest: | |
108 | ||
109 | (qemu) object_add memory-backend-file,id=mem2,share=on,mem-path=new_nvdimm.img,size=4G | |
110 | (qemu) device_add nvdimm,id=nvdimm2,memdev=mem2 | |
111 | ||
112 | Note: | |
113 | ||
114 | 1. Each hotplugged vNVDIMM device consumes one memory slot. Users | |
115 | should always ensure the memory option "-m ...,slots=N" specifies | |
116 | enough number of slots, i.e. | |
117 | N >= number of RAM devices + | |
118 | number of statically plugged vNVDIMM devices + | |
119 | number of hotplugged vNVDIMM devices | |
120 | ||
121 | 2. The similar is required for the memory option "-m ...,maxmem=M", i.e. | |
122 | M >= size of RAM devices + | |
123 | size of statically plugged vNVDIMM devices + | |
124 | size of hotplugged vNVDIMM devices | |
98376843 HZ |
125 | |
126 | Alignment | |
127 | --------- | |
128 | ||
129 | QEMU uses mmap(2) to maps vNVDIMM backends and aligns the mapping | |
130 | address to the page size (getpagesize(2)) by default. However, some | |
131 | types of backends may require an alignment different than the page | |
132 | size. In that case, QEMU v2.12.0 and later provide 'align' option to | |
133 | memory-backend-file to allow users to specify the proper alignment. | |
134 | ||
135 | For example, device dax require the 2 MB alignment, so we can use | |
136 | following QEMU command line options to use it (/dev/dax0.0) as the | |
137 | backend of vNVDIMM: | |
138 | ||
139 | -object memory-backend-file,id=mem1,share=on,mem-path=/dev/dax0.0,size=4G,align=2M | |
140 | -device nvdimm,id=nvdimm1,memdev=mem1 | |
cb836434 HZ |
141 | |
142 | Guest Data Persistence | |
143 | ---------------------- | |
144 | ||
145 | Though QEMU supports multiple types of vNVDIMM backends on Linux, | |
146 | currently the only one that can guarantee the guest write persistence | |
147 | is the device DAX on the real NVDIMM device (e.g., /dev/dax0.0), to | |
148 | which all guest access do not involve any host-side kernel cache. | |
149 | ||
150 | When using other types of backends, it's suggested to set 'unarmed' | |
151 | option of '-device nvdimm' to 'on', which sets the unarmed flag of the | |
152 | guest NVDIMM region mapping structure. This unarmed flag indicates | |
153 | guest software that this vNVDIMM device contains a region that cannot | |
154 | accept persistent writes. In result, for example, the guest Linux | |
155 | NVDIMM driver, marks such vNVDIMM device as read-only. | |
9ab3aad2 | 156 | |
11c39b5c RZ |
157 | NVDIMM Persistence |
158 | ------------------ | |
9ab3aad2 RZ |
159 | |
160 | ACPI 6.2 Errata A added support for a new Platform Capabilities Structure | |
161 | which allows the platform to communicate what features it supports related to | |
11c39b5c RZ |
162 | NVDIMM data persistence. Users can provide a persistence value to a guest via |
163 | the optional "nvdimm-persistence" machine command line option: | |
9ab3aad2 | 164 | |
11c39b5c | 165 | -machine pc,accel=kvm,nvdimm,nvdimm-persistence=cpu |
9ab3aad2 | 166 | |
11c39b5c | 167 | There are currently two valid values for this option: |
9ab3aad2 | 168 | |
11c39b5c RZ |
169 | "mem-ctrl" - The platform supports flushing dirty data from the memory |
170 | controller to the NVDIMMs in the event of power loss. | |
9ab3aad2 | 171 | |
11c39b5c RZ |
172 | "cpu" - The platform supports flushing dirty data from the CPU cache to |
173 | the NVDIMMs in the event of power loss. This implies that the | |
174 | platform also supports flushing dirty data through the memory | |
175 | controller on power loss. | |
a4de8552 JH |
176 | |
177 | If the vNVDIMM backend is in host persistent memory that can be accessed in | |
178 | SNIA NVM Programming Model [1] (e.g., Intel NVDIMM), it's suggested to set | |
179 | the 'pmem' option of memory-backend-file to 'on'. When 'pmem' is 'on' and QEMU | |
180 | is built with libpmem [2] support (configured with --enable-libpmem), QEMU | |
181 | will take necessary operations to guarantee the persistence of its own writes | |
182 | to the vNVDIMM backend(e.g., in vNVDIMM label emulation and live migration). | |
183 | If 'pmem' is 'on' while there is no libpmem support, qemu will exit and report | |
184 | a "lack of libpmem support" message to ensure the persistence is available. | |
185 | For example, if we want to ensure the persistence for some backend file, | |
186 | use the QEMU command line: | |
187 | ||
188 | -object memory-backend-file,id=nv_mem,mem-path=/XXX/yyy,size=4G,pmem=on | |
189 | ||
190 | References | |
191 | ---------- | |
192 | ||
193 | [1] NVM Programming Model (NPM) | |
194 | Version 1.2 | |
195 | https://www.snia.org/sites/default/files/technical_work/final/NVMProgrammingModel_v1.2.pdf | |
196 | [2] Persistent Memory Development Kit (PMDK), formerly known as NVML project, home page: | |
197 | http://pmem.io/pmdk/ |