]>
Commit | Line | Data |
---|---|---|
03feae73 KW |
1 | == General == |
2 | ||
3 | A qcow2 image file is organized in units of constant size, which are called | |
4 | (host) clusters. A cluster is the unit in which all allocations are done, | |
5 | both for actual guest data and for image metadata. | |
6 | ||
7 | Likewise, the virtual disk as seen by the guest is divided into (guest) | |
8 | clusters of the same size. | |
9 | ||
10 | All numbers in qcow2 are stored in Big Endian byte order. | |
11 | ||
12 | ||
13 | == Header == | |
14 | ||
15 | The first cluster of a qcow2 image contains the file header: | |
16 | ||
17 | Byte 0 - 3: magic | |
18 | QCOW magic string ("QFI\xfb") | |
19 | ||
20 | 4 - 7: version | |
4fabffc1 | 21 | Version number (valid values are 2 and 3) |
03feae73 KW |
22 | |
23 | 8 - 15: backing_file_offset | |
24 | Offset into the image file at which the backing file name | |
25 | is stored (NB: The string is not null terminated). 0 if the | |
26 | image doesn't have a backing file. | |
27 | ||
a50c1f57 AG |
28 | Note: backing files are incompatible with raw external data |
29 | files (auto-clear feature bit 1). | |
30 | ||
03feae73 KW |
31 | 16 - 19: backing_file_size |
32 | Length of the backing file name in bytes. Must not be | |
33 | longer than 1023 bytes. Undefined if the image doesn't have | |
34 | a backing file. | |
35 | ||
36 | 20 - 23: cluster_bits | |
37 | Number of bits that are used for addressing an offset | |
38 | within a cluster (1 << cluster_bits is the cluster size). | |
39 | Must not be less than 9 (i.e. 512 byte clusters). | |
40 | ||
41 | Note: qemu as of today has an implementation limit of 2 MB | |
42 | as the maximum cluster size and won't be able to open images | |
43 | with larger cluster sizes. | |
44 | ||
30afc120 AG |
45 | Note: if the image has Extended L2 Entries then cluster_bits |
46 | must be at least 14 (i.e. 16384 byte clusters). | |
47 | ||
03feae73 | 48 | 24 - 31: size |
d3e1a7eb EB |
49 | Virtual disk size in bytes. |
50 | ||
51 | Note: qemu has an implementation limit of 32 MB as | |
52 | the maximum L1 table size. With a 2 MB cluster | |
53 | size, it is unable to populate a virtual cluster | |
54 | beyond 2 EB (61 bits); with a 512 byte cluster | |
55 | size, it is unable to populate a virtual size | |
56 | larger than 128 GB (37 bits). Meanwhile, L1/L2 | |
57 | table layouts limit an image to no more than 64 PB | |
58 | (56 bits) of populated clusters, and an image may | |
59 | hit other limits first (such as a file system's | |
60 | maximum size). | |
03feae73 KW |
61 | |
62 | 32 - 35: crypt_method | |
63 | 0 for no encryption | |
64 | 1 for AES encryption | |
7674b575 | 65 | 2 for LUKS encryption |
03feae73 KW |
66 | |
67 | 36 - 39: l1_size | |
68 | Number of entries in the active L1 table | |
69 | ||
70 | 40 - 47: l1_table_offset | |
71 | Offset into the image file at which the active L1 table | |
72 | starts. Must be aligned to a cluster boundary. | |
73 | ||
74 | 48 - 55: refcount_table_offset | |
75 | Offset into the image file at which the refcount table | |
76 | starts. Must be aligned to a cluster boundary. | |
77 | ||
78 | 56 - 59: refcount_table_clusters | |
79 | Number of clusters that the refcount table occupies | |
80 | ||
81 | 60 - 63: nb_snapshots | |
82 | Number of snapshots contained in the image | |
83 | ||
84 | 64 - 71: snapshots_offset | |
85 | Offset into the image file at which the snapshot table | |
86 | starts. Must be aligned to a cluster boundary. | |
87 | ||
3ae3fcfa VSO |
88 | For version 2, the header is exactly 72 bytes in length, and finishes here. |
89 | For version 3 or higher, the header length is at least 104 bytes, including | |
90 | the next fields through header_length. | |
4fabffc1 KW |
91 | |
92 | 72 - 79: incompatible_features | |
93 | Bitmask of incompatible features. An implementation must | |
94 | fail to open an image if an unknown bit is set. | |
95 | ||
0f6d767a SH |
96 | Bit 0: Dirty bit. If this bit is set then refcounts |
97 | may be inconsistent, make sure to scan L1/L2 | |
98 | tables to repair refcounts before accessing the | |
99 | image. | |
100 | ||
69c98726 HR |
101 | Bit 1: Corrupt bit. If this bit is set then any data |
102 | structure may be corrupt and the image must not | |
103 | be written to (unless for regaining | |
104 | consistency). | |
105 | ||
65a3d073 KW |
106 | Bit 2: External data file bit. If this bit is set, an |
107 | external data file is used. Guest clusters are | |
108 | then stored in the external data file. For such | |
109 | images, clusters in the external data file are | |
110 | not refcounted. The offset field in the | |
111 | Standard Cluster Descriptor must match the | |
112 | guest offset and neither compressed clusters | |
113 | nor internal snapshots are supported. | |
114 | ||
115 | An External Data File Name header extension may | |
116 | be present if this bit is set. | |
117 | ||
66fcbca5 VSO |
118 | Bit 3: Compression type bit. If this bit is set, |
119 | a non-default compression is used for compressed | |
120 | clusters. The compression_type field must be | |
121 | present and not zero. | |
122 | ||
30afc120 AG |
123 | Bit 4: Extended L2 Entries. If this bit is set then |
124 | L2 table entries use an extended format that | |
125 | allows subcluster-based allocation. See the | |
126 | Extended L2 Entries section for more details. | |
127 | ||
128 | Bits 5-63: Reserved (set to 0) | |
4fabffc1 KW |
129 | |
130 | 80 - 87: compatible_features | |
131 | Bitmask of compatible features. An implementation can | |
132 | safely ignore any unknown bits that are set. | |
133 | ||
dae8796d SH |
134 | Bit 0: Lazy refcounts bit. If this bit is set then |
135 | lazy refcount updates can be used. This means | |
136 | marking the image file dirty and postponing | |
137 | refcount metadata updates. | |
138 | ||
139 | Bits 1-63: Reserved (set to 0) | |
4fabffc1 KW |
140 | |
141 | 88 - 95: autoclear_features | |
142 | Bitmask of auto-clear features. An implementation may only | |
143 | write to an image with unknown auto-clear features if it | |
144 | clears the respective bits from this field first. | |
145 | ||
bca5a8f4 VSO |
146 | Bit 0: Bitmaps extension bit |
147 | This bit indicates consistency for the bitmaps | |
148 | extension data. | |
149 | ||
150 | It is an error if this bit is set without the | |
151 | bitmaps extension present. | |
152 | ||
153 | If the bitmaps extension is present but this | |
154 | bit is unset, the bitmaps extension data must be | |
155 | considered inconsistent. | |
156 | ||
bb40ebce EB |
157 | Bit 1: Raw external data bit |
158 | If this bit is set, the external data file can | |
65a3d073 KW |
159 | be read as a consistent standalone raw image |
160 | without looking at the qcow2 metadata. | |
161 | ||
162 | Setting this bit has a performance impact for | |
163 | some operations on the image (e.g. writing | |
164 | zeros requires writing to the data file instead | |
165 | of only setting the zero flag in the L2 table | |
166 | entry) and conflicts with backing files. | |
167 | ||
168 | This bit may only be set if the External Data | |
169 | File bit (incompatible feature bit 1) is also | |
170 | set. | |
171 | ||
172 | Bits 2-63: Reserved (set to 0) | |
4fabffc1 KW |
173 | |
174 | 96 - 99: refcount_order | |
175 | Describes the width of a reference count block entry (width | |
6815bce5 MK |
176 | in bits: refcount_bits = 1 << refcount_order). For version 2 |
177 | images, the order is always assumed to be 4 | |
178 | (i.e. refcount_bits = 16). | |
7f75a07d | 179 | This value may not exceed 6 (i.e. refcount_bits = 64). |
4fabffc1 KW |
180 | |
181 | 100 - 103: header_length | |
182 | Length of the header structure in bytes. For version 2 | |
183 | images, the length is always assumed to be 72 bytes. | |
3ae3fcfa VSO |
184 | For version 3 it's at least 104 bytes and must be a multiple |
185 | of 8. | |
186 | ||
187 | ||
188 | === Additional fields (version 3 and higher) === | |
189 | ||
190 | In general, these fields are optional and may be safely ignored by the software, | |
191 | as well as filled by zeros (which is equal to field absence), if software needs | |
192 | to set field B, but does not care about field A which precedes B. More | |
193 | formally, additional fields have the following compatibility rules: | |
194 | ||
195 | 1. If the value of the additional field must not be ignored for correct | |
196 | handling of the file, it will be accompanied by a corresponding incompatible | |
197 | feature bit. | |
198 | ||
199 | 2. If there are no unrecognized incompatible feature bits set, an unknown | |
200 | additional field may be safely ignored other than preserving its value when | |
201 | rewriting the image header. | |
202 | ||
203 | 3. An explicit value of 0 will have the same behavior as when the field is not | |
204 | present*, if not altered by a specific incompatible bit. | |
205 | ||
206 | *. A field is considered not present when header_length is less than or equal | |
207 | to the field's offset. Also, all additional fields are not present for | |
208 | version 2. | |
209 | ||
66fcbca5 VSO |
210 | 104: compression_type |
211 | ||
212 | Defines the compression method used for compressed clusters. | |
213 | All compressed clusters in an image use the same compression | |
214 | type. | |
215 | ||
216 | If the incompatible bit "Compression type" is set: the field | |
217 | must be present and non-zero (which means non-zlib | |
218 | compression type). Otherwise, this field must not be present | |
219 | or must be zero (which means zlib). | |
220 | ||
221 | Available compression type values: | |
222 | 0: zlib <https://www.zlib.net/> | |
d298ac10 | 223 | 1: zstd <http://github.com/facebook/zstd> |
3ae3fcfa VSO |
224 | |
225 | ||
226 | === Header padding === | |
227 | ||
228 | @header_length must be a multiple of 8, which means that if the end of the last | |
229 | additional field is not aligned, some padding is needed. This padding must be | |
230 | zeroed, so that if some existing (or future) additional field will fall into | |
231 | the padding, it will be interpreted accordingly to point [3.] of the previous | |
232 | paragraph, i.e. in the same manner as when this field is not present. | |
233 | ||
234 | ||
235 | === Header extensions === | |
4fabffc1 | 236 | |
03feae73 KW |
237 | Directly after the image header, optional sections called header extensions can |
238 | be stored. Each extension has a structure like the following: | |
239 | ||
240 | Byte 0 - 3: Header extension type: | |
241 | 0x00000000 - End of the header extension area | |
8098969c | 242 | 0xe2792aca - Backing file format name string |
4fabffc1 | 243 | 0x6803f857 - Feature name table |
bca5a8f4 | 244 | 0x23852875 - Bitmaps extension |
7674b575 | 245 | 0x0537be77 - Full disk encryption header pointer |
e88153ea | 246 | 0x44415441 - External data file name string |
03feae73 KW |
247 | other - Unknown header extension, can be safely |
248 | ignored | |
249 | ||
250 | 4 - 7: Length of the header extension data | |
251 | ||
252 | 8 - n: Header extension data | |
253 | ||
254 | n - m: Padding to round up the header extension size to the next | |
255 | multiple of 8. | |
256 | ||
4fabffc1 KW |
257 | Unless stated otherwise, each header extension type shall appear at most once |
258 | in the same image. | |
259 | ||
8e436ec1 MK |
260 | If the image has a backing file then the backing file name should be stored in |
261 | the remaining space between the end of the header extension area and the end of | |
262 | the first cluster. It is not allowed to store other data here, so that an | |
263 | implementation can safely modify the header and add extensions without harming | |
264 | data of compatible features that it doesn't support. Compatible features that | |
265 | need space for additional data can use a header extension. | |
4fabffc1 KW |
266 | |
267 | ||
e88153ea KW |
268 | == String header extensions == |
269 | ||
270 | Some header extensions (such as the backing file format name and the external | |
271 | data file name) are just a single string. In this case, the header extension | |
272 | length is the string length and the string is not '\0' terminated. (The header | |
273 | extension padding can make it look like a string is '\0' terminated, but | |
274 | neither is padding always necessary nor is there a guarantee that zero bytes | |
275 | are used for padding.) | |
276 | ||
277 | ||
4fabffc1 KW |
278 | == Feature name table == |
279 | ||
280 | The feature name table is an optional header extension that contains the name | |
281 | for features used by the image. It can be used by applications that don't know | |
282 | the respective feature (e.g. because the feature was introduced only later) to | |
283 | display a useful error message. | |
284 | ||
285 | The number of entries in the feature name table is determined by the length of | |
286 | the header extension data. Each entry look like this: | |
287 | ||
288 | Byte 0: Type of feature (select feature bitmap) | |
289 | 0: Incompatible feature | |
290 | 1: Compatible feature | |
291 | 2: Autoclear feature | |
292 | ||
293 | 1: Bit number within the selected feature bitmap (valid | |
294 | values: 0-63) | |
295 | ||
296 | 2 - 47: Feature name (padded with zeros, but not necessarily null | |
297 | terminated if it has full length) | |
03feae73 KW |
298 | |
299 | ||
bca5a8f4 VSO |
300 | == Bitmaps extension == |
301 | ||
302 | The bitmaps extension is an optional header extension. It provides the ability | |
303 | to store bitmaps related to a virtual disk. For now, there is only one bitmap | |
304 | type: the dirty tracking bitmap, which tracks virtual disk changes from some | |
305 | point in time. | |
306 | ||
307 | The data of the extension should be considered consistent only if the | |
308 | corresponding auto-clear feature bit is set, see autoclear_features above. | |
309 | ||
310 | The fields of the bitmaps extension are: | |
311 | ||
312 | Byte 0 - 3: nb_bitmaps | |
313 | The number of bitmaps contained in the image. Must be | |
314 | greater than or equal to 1. | |
315 | ||
316 | Note: Qemu currently only supports up to 65535 bitmaps per | |
317 | image. | |
318 | ||
319 | 4 - 7: Reserved, must be zero. | |
320 | ||
321 | 8 - 15: bitmap_directory_size | |
322 | Size of the bitmap directory in bytes. It is the cumulative | |
b348c262 | 323 | size of all (nb_bitmaps) bitmap directory entries. |
bca5a8f4 VSO |
324 | |
325 | 16 - 23: bitmap_directory_offset | |
326 | Offset into the image file at which the bitmap directory | |
327 | starts. Must be aligned to a cluster boundary. | |
328 | ||
7674b575 DB |
329 | == Full disk encryption header pointer == |
330 | ||
331 | The full disk encryption header must be present if, and only if, the | |
332 | 'crypt_method' header requires metadata. Currently this is only true | |
333 | of the 'LUKS' crypt method. The header extension must be absent for | |
334 | other methods. | |
335 | ||
336 | This header provides the offset at which the crypt method can store | |
337 | its additional data, as well as the length of such data. | |
338 | ||
339 | Byte 0 - 7: Offset into the image file at which the encryption | |
340 | header starts in bytes. Must be aligned to a cluster | |
341 | boundary. | |
342 | Byte 8 - 15: Length of the written encryption header in bytes. | |
343 | Note actual space allocated in the qcow2 file may | |
344 | be larger than this value, since it will be rounded | |
345 | to the nearest multiple of the cluster size. Any | |
346 | unused bytes in the allocated space will be initialized | |
347 | to 0. | |
348 | ||
349 | For the LUKS crypt method, the encryption header works as follows. | |
350 | ||
351 | The first 592 bytes of the header clusters will contain the LUKS | |
352 | partition header. This is then followed by the key material data areas. | |
353 | The size of the key material data areas is determined by the number of | |
354 | stripes in the key slot and key size. Refer to the LUKS format | |
355 | specification ('docs/on-disk-format.pdf' in the cryptsetup source | |
356 | package) for details of the LUKS partition header format. | |
357 | ||
358 | In the LUKS partition header, the "payload-offset" field will be | |
359 | calculated as normal for the LUKS spec. ie the size of the LUKS | |
360 | header, plus key material regions, plus padding, relative to the | |
361 | start of the LUKS header. This offset value is not required to be | |
362 | qcow2 cluster aligned. Its value is currently never used in the | |
363 | context of qcow2, since the qcow2 file format itself defines where | |
364 | the real payload offset is, but none the less a valid payload offset | |
365 | should always be present. | |
366 | ||
367 | In the LUKS key slots header, the "key-material-offset" is relative | |
368 | to the start of the LUKS header clusters in the qcow2 container, | |
369 | not the start of the qcow2 file. | |
370 | ||
371 | Logically the layout looks like | |
372 | ||
373 | +-----------------------------+ | |
374 | | QCow2 header | | |
375 | | QCow2 header extension X | | |
376 | | QCow2 header extension FDE | | |
377 | | QCow2 header extension ... | | |
378 | | QCow2 header extension Z | | |
379 | +-----------------------------+ | |
380 | | ....other QCow2 tables.... | | |
381 | . . | |
382 | . . | |
383 | +-----------------------------+ | |
384 | | +-------------------------+ | | |
385 | | | LUKS partition header | | | |
386 | | +-------------------------+ | | |
387 | | | LUKS key material 1 | | | |
388 | | +-------------------------+ | | |
389 | | | LUKS key material 2 | | | |
390 | | +-------------------------+ | | |
391 | | | LUKS key material ... | | | |
392 | | +-------------------------+ | | |
393 | | | LUKS key material 8 | | | |
394 | | +-------------------------+ | | |
395 | +-----------------------------+ | |
396 | | QCow2 cluster payload | | |
397 | . . | |
398 | . . | |
399 | . . | |
400 | | | | |
401 | +-----------------------------+ | |
402 | ||
403 | == Data encryption == | |
404 | ||
405 | When an encryption method is requested in the header, the image payload | |
406 | data must be encrypted/decrypted on every write/read. The image headers | |
407 | and metadata are never encrypted. | |
408 | ||
409 | The algorithms used for encryption vary depending on the method | |
410 | ||
411 | - AES: | |
412 | ||
413 | The AES cipher, in CBC mode, with 256 bit keys. | |
414 | ||
415 | Initialization vectors generated using plain64 method, with | |
416 | the virtual disk sector as the input tweak. | |
417 | ||
418 | This format is no longer supported in QEMU system emulators, due | |
419 | to a number of design flaws affecting its security. It is only | |
420 | supported in the command line tools for the sake of back compatibility | |
421 | and data liberation. | |
422 | ||
423 | - LUKS: | |
424 | ||
425 | The algorithms are specified in the LUKS header. | |
426 | ||
427 | Initialization vectors generated using the method specified | |
428 | in the LUKS header, with the physical disk sector as the | |
429 | input tweak. | |
bca5a8f4 | 430 | |
03feae73 KW |
431 | == Host cluster management == |
432 | ||
433 | qcow2 manages the allocation of host clusters by maintaining a reference count | |
434 | for each host cluster. A refcount of 0 means that the cluster is free, 1 means | |
435 | that it is used, and >= 2 means that it is used and any write access must | |
436 | perform a COW (copy on write) operation. | |
437 | ||
438 | The refcounts are managed in a two-level table. The first level is called | |
439 | refcount table and has a variable size (which is stored in the header). The | |
440 | refcount table can cover multiple clusters, however it needs to be contiguous | |
441 | in the image file. | |
442 | ||
443 | It contains pointers to the second level structures which are called refcount | |
444 | blocks and are exactly one cluster in size. | |
445 | ||
d3e1a7eb EB |
446 | Although a large enough refcount table can reserve clusters past 64 PB |
447 | (56 bits) (assuming the underlying protocol can even be sized that | |
448 | large), note that some qcow2 metadata such as L1/L2 tables must point | |
449 | to clusters prior to that point. | |
450 | ||
451 | Note: qemu has an implementation limit of 8 MB as the maximum refcount | |
452 | table size. With a 2 MB cluster size and a default refcount_order of | |
453 | 4, it is unable to reference host resources beyond 2 EB (61 bits); in | |
454 | the worst case, with a 512 cluster size and refcount_order of 6, it is | |
455 | unable to access beyond 32 GB (35 bits). | |
456 | ||
9277d81f VS |
457 | Given an offset into the image file, the refcount of its cluster can be |
458 | obtained as follows: | |
03feae73 | 459 | |
4b318d6c | 460 | refcount_block_entries = (cluster_size * 8 / refcount_bits) |
03feae73 | 461 | |
3789985f ZYW |
462 | refcount_block_index = (offset / cluster_size) % refcount_block_entries |
463 | refcount_table_index = (offset / cluster_size) / refcount_block_entries | |
03feae73 KW |
464 | |
465 | refcount_block = load_cluster(refcount_table[refcount_table_index]); | |
466 | return refcount_block[refcount_block_index]; | |
467 | ||
468 | Refcount table entry: | |
469 | ||
470 | Bit 0 - 8: Reserved (set to 0) | |
471 | ||
472 | 9 - 63: Bits 9-63 of the offset into the image file at which the | |
473 | refcount block starts. Must be aligned to a cluster | |
474 | boundary. | |
475 | ||
476 | If this is 0, the corresponding refcount block has not yet | |
477 | been allocated. All refcounts managed by this refcount block | |
478 | are 0. | |
479 | ||
4fabffc1 | 480 | Refcount block entry (x = refcount_bits - 1): |
03feae73 | 481 | |
4fabffc1 KW |
482 | Bit 0 - x: Reference count of the cluster. If refcount_bits implies a |
483 | sub-byte width, note that bit 0 means the least significant | |
484 | bit in this context. | |
03feae73 KW |
485 | |
486 | ||
487 | == Cluster mapping == | |
488 | ||
489 | Just as for refcounts, qcow2 uses a two-level structure for the mapping of | |
490 | guest clusters to host clusters. They are called L1 and L2 table. | |
491 | ||
492 | The L1 table has a variable size (stored in the header) and may use multiple | |
493 | clusters, however it must be contiguous in the image file. L2 tables are | |
494 | exactly one cluster in size. | |
495 | ||
d3e1a7eb EB |
496 | The L1 and L2 tables have implications on the maximum virtual file |
497 | size; for a given L1 table size, a larger cluster size is required for | |
498 | the guest to have access to more space. Furthermore, a virtual | |
499 | cluster must currently map to a host offset below 64 PB (56 bits) | |
500 | (although this limit could be relaxed by putting reserved bits into | |
501 | use). Additionally, as cluster size increases, the maximum host | |
502 | offset for a compressed cluster is reduced (a 2M cluster size requires | |
503 | compressed clusters to reside below 512 TB (49 bits), and this limit | |
504 | cannot be relaxed without an incompatible layout change). | |
505 | ||
9277d81f | 506 | Given an offset into the virtual disk, the offset into the image file can be |
03feae73 KW |
507 | obtained as follows: |
508 | ||
30afc120 | 509 | l2_entries = (cluster_size / sizeof(uint64_t)) [*] |
03feae73 KW |
510 | |
511 | l2_index = (offset / cluster_size) % l2_entries | |
512 | l1_index = (offset / cluster_size) / l2_entries | |
513 | ||
514 | l2_table = load_cluster(l1_table[l1_index]); | |
515 | cluster_offset = l2_table[l2_index]; | |
516 | ||
517 | return cluster_offset + (offset % cluster_size) | |
518 | ||
30afc120 AG |
519 | [*] this changes if Extended L2 Entries are enabled, see next section |
520 | ||
03feae73 KW |
521 | L1 table entry: |
522 | ||
523 | Bit 0 - 8: Reserved (set to 0) | |
524 | ||
525 | 9 - 55: Bits 9-55 of the offset into the image file at which the L2 | |
526 | table starts. Must be aligned to a cluster boundary. If the | |
527 | offset is 0, the L2 table and all clusters described by this | |
528 | L2 table are unallocated. | |
529 | ||
530 | 56 - 62: Reserved (set to 0) | |
531 | ||
532 | 63: 0 for an L2 table that is unused or requires COW, 1 if its | |
533 | refcount is exactly one. This information is only accurate | |
534 | in the active L1 table. | |
535 | ||
4fabffc1 | 536 | L2 table entry: |
03feae73 | 537 | |
4fabffc1 KW |
538 | Bit 0 - 61: Cluster descriptor |
539 | ||
540 | 62: 0 for standard clusters | |
541 | 1 for compressed clusters | |
542 | ||
3c7d14b2 AG |
543 | 63: 0 for clusters that are unused, compressed or require COW. |
544 | 1 for standard clusters whose refcount is exactly one. | |
545 | This information is only accurate in L2 tables | |
546 | that are reachable from the active L1 table. | |
4fabffc1 | 547 | |
65a3d073 KW |
548 | With external data files, all guest clusters have an |
549 | implicit refcount of 1 (because of the fixed host = guest | |
550 | mapping for guest cluster offsets), so this bit should be 1 | |
551 | for all allocated clusters. | |
552 | ||
4fabffc1 KW |
553 | Standard Cluster Descriptor: |
554 | ||
555 | Bit 0: If set to 1, the cluster reads as all zeros. The host | |
556 | cluster offset can be used to describe a preallocation, | |
557 | but it won't be used for reading data from this cluster, | |
558 | nor is data read from the backing file if the cluster is | |
559 | unallocated. | |
560 | ||
30afc120 AG |
561 | With version 2 or with extended L2 entries (see the next |
562 | section), this is always 0. | |
4fabffc1 KW |
563 | |
564 | 1 - 8: Reserved (set to 0) | |
03feae73 KW |
565 | |
566 | 9 - 55: Bits 9-55 of host cluster offset. Must be aligned to a | |
65a3d073 KW |
567 | cluster boundary. If the offset is 0 and bit 63 is clear, |
568 | the cluster is unallocated. The offset may only be 0 with | |
569 | bit 63 set (indicating a host cluster offset of 0) when an | |
570 | external data file is used. | |
03feae73 KW |
571 | |
572 | 56 - 61: Reserved (set to 0) | |
573 | ||
03feae73 | 574 | |
bf3f363a | 575 | Compressed Clusters Descriptor (x = 62 - (cluster_bits - 8)): |
03feae73 | 576 | |
156b46de | 577 | Bit 0 - x-1: Host cluster offset. This is usually _not_ aligned to a |
d3e1a7eb EB |
578 | cluster or sector boundary! If cluster_bits is |
579 | small enough that this field includes bits beyond | |
580 | 55, those upper bits must be set to 0. | |
03feae73 | 581 | |
156b46de AG |
582 | x - 61: Number of additional 512-byte sectors used for the |
583 | compressed data, beyond the sector containing the offset | |
584 | in the previous field. Some of these sectors may reside | |
585 | in the next contiguous host cluster. | |
586 | ||
587 | Note that the compressed data does not necessarily occupy | |
588 | all of the bytes in the final sector; rather, decompression | |
589 | stops when it has produced a cluster of data. | |
590 | ||
591 | Another compressed cluster may map to the tail of the final | |
592 | sector used by this compressed cluster. | |
03feae73 | 593 | |
03feae73 | 594 | If a cluster is unallocated, read requests shall read the data from the backing |
4fabffc1 KW |
595 | file (except if bit 0 in the Standard Cluster Descriptor is set). If there is |
596 | no backing file or the backing file is smaller than the image, they shall read | |
597 | zeros for all parts that are not covered by the backing file. | |
03feae73 | 598 | |
30afc120 AG |
599 | == Extended L2 Entries == |
600 | ||
601 | An image uses Extended L2 Entries if bit 4 is set on the incompatible_features | |
602 | field of the header. | |
603 | ||
604 | In these images standard data clusters are divided into 32 subclusters of the | |
605 | same size. They are contiguous and start from the beginning of the cluster. | |
606 | Subclusters can be allocated independently and the L2 entry contains information | |
607 | indicating the status of each one of them. Compressed data clusters don't have | |
608 | subclusters so they are treated the same as in images without this feature. | |
609 | ||
610 | The size of an extended L2 entry is 128 bits so the number of entries per table | |
611 | is calculated using this formula: | |
612 | ||
613 | l2_entries = (cluster_size / (2 * sizeof(uint64_t))) | |
614 | ||
615 | The first 64 bits have the same format as the standard L2 table entry described | |
616 | in the previous section, with the exception of bit 0 of the standard cluster | |
617 | descriptor. | |
618 | ||
619 | The last 64 bits contain a subcluster allocation bitmap with this format: | |
620 | ||
621 | Subcluster Allocation Bitmap (for standard clusters): | |
622 | ||
623 | Bit 0 - 31: Allocation status (one bit per subcluster) | |
624 | ||
625 | 1: the subcluster is allocated. In this case the | |
626 | host cluster offset field must contain a valid | |
627 | offset. | |
628 | 0: the subcluster is not allocated. In this case | |
629 | read requests shall go to the backing file or | |
630 | return zeros if there is no backing file data. | |
631 | ||
632 | Bits are assigned starting from the least significant | |
633 | one (i.e. bit x is used for subcluster x). | |
634 | ||
635 | 32 - 63 Subcluster reads as zeros (one bit per subcluster) | |
636 | ||
637 | 1: the subcluster reads as zeros. In this case the | |
638 | allocation status bit must be unset. The host | |
639 | cluster offset field may or may not be set. | |
640 | 0: no effect. | |
641 | ||
642 | Bits are assigned starting from the least significant | |
643 | one (i.e. bit x is used for subcluster x - 32). | |
644 | ||
645 | Subcluster Allocation Bitmap (for compressed clusters): | |
646 | ||
647 | Bit 0 - 63: Reserved (set to 0) | |
648 | Compressed clusters don't have subclusters, | |
649 | so this field is not used. | |
03feae73 KW |
650 | |
651 | == Snapshots == | |
652 | ||
653 | qcow2 supports internal snapshots. Their basic principle of operation is to | |
654 | switch the active L1 table, so that a different set of host clusters are | |
655 | exposed to the guest. | |
656 | ||
657 | When creating a snapshot, the L1 table should be copied and the refcount of all | |
3789985f | 658 | L2 tables and clusters reachable from this L1 table must be increased, so that |
03feae73 KW |
659 | a write causes a COW and isn't visible in other snapshots. |
660 | ||
661 | When loading a snapshot, bit 63 of all entries in the new active L1 table and | |
662 | all L2 tables referenced by it must be reconstructed from the refcount table | |
663 | as it doesn't need to be accurate in inactive L1 tables. | |
664 | ||
665 | A directory of all snapshots is stored in the snapshot table, a contiguous area | |
666 | in the image file, whose starting offset and length are given by the header | |
667 | fields snapshots_offset and nb_snapshots. The entries of the snapshot table | |
668 | have variable length, depending on the length of ID, name and extra data. | |
669 | ||
670 | Snapshot table entry: | |
671 | ||
672 | Byte 0 - 7: Offset into the image file at which the L1 table for the | |
673 | snapshot starts. Must be aligned to a cluster boundary. | |
674 | ||
675 | 8 - 11: Number of entries in the L1 table of the snapshots | |
676 | ||
677 | 12 - 13: Length of the unique ID string describing the snapshot | |
678 | ||
679 | 14 - 15: Length of the name of the snapshot | |
680 | ||
681 | 16 - 19: Time at which the snapshot was taken in seconds since the | |
682 | Epoch | |
683 | ||
684 | 20 - 23: Subsecond part of the time at which the snapshot was taken | |
685 | in nanoseconds | |
686 | ||
687 | 24 - 31: Time that the guest was running until the snapshot was | |
688 | taken in nanoseconds | |
689 | ||
690 | 32 - 35: Size of the VM state in bytes. 0 if no VM state is saved. | |
691 | If there is VM state, it starts at the first cluster | |
692 | described by first L1 table entry that doesn't describe a | |
693 | regular guest cluster (i.e. VM state is stored like guest | |
694 | disk content, except that it is stored at offsets that are | |
695 | larger than the virtual disk presented to the guest) | |
696 | ||
697 | 36 - 39: Size of extra data in the table entry (used for future | |
698 | extensions of the format) | |
699 | ||
c2c9a466 KW |
700 | variable: Extra data for future extensions. Unknown fields must be |
701 | ignored. Currently defined are (offset relative to snapshot | |
702 | table entry): | |
703 | ||
704 | Byte 40 - 47: Size of the VM state in bytes. 0 if no VM | |
705 | state is saved. If this field is present, | |
706 | the 32-bit value in bytes 32-35 is ignored. | |
03feae73 | 707 | |
4fabffc1 KW |
708 | Byte 48 - 55: Virtual disk size of the snapshot in bytes |
709 | ||
bbacffc5 PD |
710 | Byte 56 - 63: icount value which corresponds to |
711 | the record/replay instruction count | |
712 | when the snapshot was taken. Set to -1 | |
713 | if icount was disabled | |
714 | ||
4fabffc1 KW |
715 | Version 3 images must include extra data at least up to |
716 | byte 55. | |
717 | ||
03feae73 KW |
718 | variable: Unique ID string for the snapshot (not null terminated) |
719 | ||
720 | variable: Name of the snapshot (not null terminated) | |
f2520804 HR |
721 | |
722 | variable: Padding to round up the snapshot table entry size to the | |
723 | next multiple of 8. | |
bca5a8f4 VSO |
724 | |
725 | ||
726 | == Bitmaps == | |
727 | ||
728 | As mentioned above, the bitmaps extension provides the ability to store bitmaps | |
729 | related to a virtual disk. This section describes how these bitmaps are stored. | |
730 | ||
731 | All stored bitmaps are related to the virtual disk stored in the same image, so | |
732 | each bitmap size is equal to the virtual disk size. | |
733 | ||
734 | Each bit of the bitmap is responsible for strictly defined range of the virtual | |
735 | disk. For bit number bit_nr the corresponding range (in bytes) will be: | |
736 | ||
737 | [bit_nr * bitmap_granularity .. (bit_nr + 1) * bitmap_granularity - 1] | |
738 | ||
739 | Granularity is a property of the concrete bitmap, see below. | |
740 | ||
741 | ||
742 | === Bitmap directory === | |
743 | ||
744 | Each bitmap saved in the image is described in a bitmap directory entry. The | |
745 | bitmap directory is a contiguous area in the image file, whose starting offset | |
746 | and length are given by the header extension fields bitmap_directory_offset and | |
747 | bitmap_directory_size. The entries of the bitmap directory have variable | |
b348c262 | 748 | length, depending on the lengths of the bitmap name and extra data. |
bca5a8f4 VSO |
749 | |
750 | Structure of a bitmap directory entry: | |
751 | ||
752 | Byte 0 - 7: bitmap_table_offset | |
753 | Offset into the image file at which the bitmap table | |
754 | (described below) for the bitmap starts. Must be aligned to | |
755 | a cluster boundary. | |
756 | ||
757 | 8 - 11: bitmap_table_size | |
758 | Number of entries in the bitmap table of the bitmap. | |
759 | ||
760 | 12 - 15: flags | |
761 | Bit | |
762 | 0: in_use | |
763 | The bitmap was not saved correctly and may be | |
2fd490c6 VSO |
764 | inconsistent. Although the bitmap metadata is still |
765 | well-formed from a qcow2 perspective, the metadata | |
766 | (such as the auto flag or bitmap size) or data | |
767 | contents may be outdated. | |
bca5a8f4 VSO |
768 | |
769 | 1: auto | |
770 | The bitmap must reflect all changes of the virtual | |
771 | disk by any application that would write to this qcow2 | |
772 | file (including writes, snapshot switching, etc.). The | |
773 | type of this bitmap must be 'dirty tracking bitmap'. | |
774 | ||
775 | 2: extra_data_compatible | |
776 | This flags is meaningful when the extra data is | |
777 | unknown to the software (currently any extra data is | |
778 | unknown to Qemu). | |
779 | If it is set, the bitmap may be used as expected, extra | |
780 | data must be left as is. | |
781 | If it is not set, the bitmap must not be used, but | |
782 | both it and its extra data be left as is. | |
783 | ||
784 | Bits 3 - 31 are reserved and must be 0. | |
785 | ||
786 | 16: type | |
787 | This field describes the sort of the bitmap. | |
788 | Values: | |
789 | 1: Dirty tracking bitmap | |
790 | ||
791 | Values 0, 2 - 255 are reserved. | |
792 | ||
793 | 17: granularity_bits | |
794 | Granularity bits. Valid values: 0 - 63. | |
795 | ||
b5d1f154 | 796 | Note: Qemu currently supports only values 9 - 31. |
bca5a8f4 VSO |
797 | |
798 | Granularity is calculated as | |
799 | granularity = 1 << granularity_bits | |
800 | ||
801 | A bitmap's granularity is how many bytes of the image | |
802 | accounts for one bit of the bitmap. | |
803 | ||
804 | 18 - 19: name_size | |
805 | Size of the bitmap name. Must be non-zero. | |
806 | ||
807 | Note: Qemu currently doesn't support values greater than | |
808 | 1023. | |
809 | ||
810 | 20 - 23: extra_data_size | |
811 | Size of type-specific extra data. | |
812 | ||
813 | For now, as no extra data is defined, extra_data_size is | |
814 | reserved and should be zero. If it is non-zero the | |
815 | behavior is defined by extra_data_compatible flag. | |
816 | ||
817 | variable: extra_data | |
818 | Extra data for the bitmap, occupying extra_data_size bytes. | |
819 | Extra data must never contain references to clusters or in | |
820 | some other way allocate additional clusters. | |
821 | ||
822 | variable: name | |
823 | The name of the bitmap (not null terminated), occupying | |
824 | name_size bytes. Must be unique among all bitmap names | |
825 | within the bitmaps extension. | |
826 | ||
827 | variable: Padding to round up the bitmap directory entry size to the | |
828 | next multiple of 8. All bytes of the padding must be zero. | |
829 | ||
830 | ||
831 | === Bitmap table === | |
832 | ||
833 | Each bitmap is stored using a one-level structure (as opposed to two-level | |
834 | structures like for refcounts and guest clusters mapping) for the mapping of | |
835 | bitmap data to host clusters. This structure is called the bitmap table. | |
836 | ||
837 | Each bitmap table has a variable size (stored in the bitmap directory entry) | |
838 | and may use multiple clusters, however, it must be contiguous in the image | |
839 | file. | |
840 | ||
841 | Structure of a bitmap table entry: | |
842 | ||
843 | Bit 0: Reserved and must be zero if bits 9 - 55 are non-zero. | |
844 | If bits 9 - 55 are zero: | |
845 | 0: Cluster should be read as all zeros. | |
846 | 1: Cluster should be read as all ones. | |
847 | ||
848 | 1 - 8: Reserved and must be zero. | |
849 | ||
850 | 9 - 55: Bits 9 - 55 of the host cluster offset. Must be aligned to | |
851 | a cluster boundary. If the offset is 0, the cluster is | |
852 | unallocated; in that case, bit 0 determines how this | |
853 | cluster should be treated during reads. | |
854 | ||
855 | 56 - 63: Reserved and must be zero. | |
856 | ||
857 | ||
858 | === Bitmap data === | |
859 | ||
860 | As noted above, bitmap data is stored in separate clusters, described by the | |
861 | bitmap table. Given an offset (in bytes) into the bitmap data, the offset into | |
862 | the image file can be obtained as follows: | |
863 | ||
864 | image_offset(bitmap_data_offset) = | |
865 | bitmap_table[bitmap_data_offset / cluster_size] + | |
866 | (bitmap_data_offset % cluster_size) | |
867 | ||
868 | This offset is not defined if bits 9 - 55 of bitmap table entry are zero (see | |
869 | above). | |
870 | ||
871 | Given an offset byte_nr into the virtual disk and the bitmap's granularity, the | |
872 | bit offset into the image file to the corresponding bit of the bitmap can be | |
873 | calculated like this: | |
874 | ||
875 | bit_offset(byte_nr) = | |
876 | image_offset(byte_nr / granularity / 8) * 8 + | |
877 | (byte_nr / granularity) % 8 | |
878 | ||
879 | If the size of the bitmap data is not a multiple of the cluster size then the | |
880 | last cluster of the bitmap data contains some unused tail bits. These bits must | |
881 | be zero. | |
882 | ||
883 | ||
884 | === Dirty tracking bitmaps === | |
885 | ||
886 | Bitmaps with 'type' field equal to one are dirty tracking bitmaps. | |
887 | ||
888 | When the virtual disk is in use dirty tracking bitmap may be 'enabled' or | |
889 | 'disabled'. While the bitmap is 'enabled', all writes to the virtual disk | |
890 | should be reflected in the bitmap. A set bit in the bitmap means that the | |
891 | corresponding range of the virtual disk (see above) was written to while the | |
892 | bitmap was 'enabled'. An unset bit means that this range was not written to. | |
893 | ||
894 | The software doesn't have to sync the bitmap in the image file with its | |
2fd490c6 VSO |
895 | representation in RAM after each write or metadata change. Flag 'in_use' |
896 | should be set while the bitmap is not synced. | |
bca5a8f4 VSO |
897 | |
898 | In the image file the 'enabled' state is reflected by the 'auto' flag. If this | |
899 | flag is set, the software must consider the bitmap as 'enabled' and start | |
900 | tracking virtual disk changes to this bitmap from the first write to the | |
901 | virtual disk. If this flag is not set then the bitmap is disabled. |