]>
Commit | Line | Data |
---|---|---|
68365a38 WC |
1 | Block replication |
2 | ---------------------------------------- | |
3 | Copyright Fujitsu, Corp. 2016 | |
4 | Copyright (c) 2016 Intel Corporation | |
5 | Copyright (c) 2016 HUAWEI TECHNOLOGIES CO., LTD. | |
6 | ||
7 | This work is licensed under the terms of the GNU GPL, version 2 or later. | |
8 | See the COPYING file in the top-level directory. | |
9 | ||
10 | Block replication is used for continuous checkpoints. It is designed | |
11 | for COLO (COarse-grain LOck-stepping) where the Secondary VM is running. | |
12 | It can also be applied for FT/HA (Fault-tolerance/High Assurance) scenario, | |
13 | where the Secondary VM is not running. | |
14 | ||
15 | This document gives an overview of block replication's design. | |
16 | ||
17 | == Background == | |
18 | High availability solutions such as micro checkpoint and COLO will do | |
19 | consecutive checkpoints. The VM state of the Primary and Secondary VM is | |
20 | identical right after a VM checkpoint, but becomes different as the VM | |
21 | executes till the next checkpoint. To support disk contents checkpoint, | |
22 | the modified disk contents in the Secondary VM must be buffered, and are | |
23 | only dropped at next checkpoint time. To reduce the network transportation | |
24 | effort during a vmstate checkpoint, the disk modification operations of | |
25 | the Primary disk are asynchronously forwarded to the Secondary node. | |
26 | ||
27 | == Workflow == | |
28 | The following is the image of block replication workflow: | |
29 | ||
30 | +----------------------+ +------------------------+ | |
31 | |Primary Write Requests| |Secondary Write Requests| | |
32 | +----------------------+ +------------------------+ | |
33 | | | | |
34 | | (4) | |
35 | | V | |
36 | | /-------------\ | |
37 | | Copy and Forward | | | |
38 | |---------(1)----------+ | Disk Buffer | | |
39 | | | | | | |
40 | | (3) \-------------/ | |
41 | | speculative ^ | |
42 | | write through (2) | |
43 | | | | | |
44 | V V | | |
45 | +--------------+ +----------------+ | |
46 | | Primary Disk | | Secondary Disk | | |
47 | +--------------+ +----------------+ | |
48 | ||
49 | 1) Primary write requests will be copied and forwarded to Secondary | |
50 | QEMU. | |
51 | 2) Before Primary write requests are written to Secondary disk, the | |
52 | original sector content will be read from Secondary disk and | |
53 | buffered in the Disk buffer, but it will not overwrite the existing | |
54 | sector content (it could be from either "Secondary Write Requests" or | |
55 | previous COW of "Primary Write Requests") in the Disk buffer. | |
56 | 3) Primary write requests will be written to Secondary disk. | |
57 | 4) Secondary write requests will be buffered in the Disk buffer and it | |
58 | will overwrite the existing sector content in the buffer. | |
59 | ||
60 | == Architecture == | |
61 | We are going to implement block replication from many basic | |
62 | blocks that are already in QEMU. | |
63 | ||
64 | virtio-blk || | |
65 | ^ || .---------- | |
66 | | || | Secondary | |
67 | 1 Quorum || '---------- | |
90dfe59b LS |
68 | / \ || virtio-blk |
69 | / \ || ^ | |
70 | Primary 2 filter | | |
71 | disk ^ 7 Quorum | |
72 | | / | |
73 | 3 NBD -------> 3 NBD / | |
68365a38 WC |
74 | client || server 2 filter |
75 | || ^ ^ | |
76 | --------. || | | | |
77 | Primary | || Secondary disk <--------- hidden-disk 5 <--------- active-disk 4 | |
78 | --------' || | backing ^ backing | |
79 | || | | | |
80 | || | | | |
81 | || '-------------------------' | |
82 | || drive-backup sync=none 6 | |
83 | ||
84 | 1) The disk on the primary is represented by a block device with two | |
85 | children, providing replication between a primary disk and the host that | |
86 | runs the secondary VM. The read pattern (fifo) for quorum can be extended | |
87 | to make the primary always read from the local disk instead of going through | |
88 | NBD. | |
89 | ||
90 | 2) The new block filter (the name is replication) will control the block | |
91 | replication. | |
92 | ||
93 | 3) The secondary disk receives writes from the primary VM through QEMU's | |
94 | embedded NBD server (speculative write-through). | |
95 | ||
96 | 4) The disk on the secondary is represented by a custom block device | |
97 | (called active-disk). It should start as an empty disk, and the format | |
98 | should support bdrv_make_empty() and backing file. | |
99 | ||
100 | 5) The hidden-disk is created automatically. It buffers the original content | |
101 | that is modified by the primary VM. It should also start as an empty disk, | |
102 | and the driver supports bdrv_make_empty() and backing file. | |
103 | ||
104 | 6) The drive-backup job (sync=none) is run to allow hidden-disk to buffer | |
105 | any state that would otherwise be lost by the speculative write-through | |
106 | of the NBD server into the secondary disk. So before block replication, | |
107 | the primary disk and secondary disk should contain the same data. | |
108 | ||
90dfe59b LS |
109 | 7) The secondary also has a quorum node, so after secondary failover it |
110 | can become the new primary and continue replication. | |
111 | ||
112 | ||
68365a38 WC |
113 | == Failure Handling == |
114 | There are 7 internal errors when block replication is running: | |
115 | 1. I/O error on primary disk | |
116 | 2. Forwarding primary write requests failed | |
117 | 3. Backup failed | |
118 | 4. I/O error on secondary disk | |
119 | 5. I/O error on active disk | |
120 | 6. Making active disk or hidden disk empty failed | |
121 | 7. Doing failover failed | |
122 | In case 1 and 5, we just report the error to the disk layer. In case 2, 3, | |
123 | 4 and 6, we just report block replication's error to FT/HA manager (which | |
124 | decides when to do a new checkpoint, when to do failover). | |
125 | In case 7, if active commit failed, we use replication failover failed state | |
126 | in Secondary's write operation (what decides which target to write). | |
127 | ||
128 | == New block driver interface == | |
129 | We add four block driver interfaces to control block replication: | |
130 | a. replication_start_all() | |
131 | Start block replication, called in migration/checkpoint thread. | |
132 | We must call block_replication_start_all() in secondary QEMU before | |
133 | calling block_replication_start_all() in primary QEMU. The caller | |
134 | must hold the I/O mutex lock if it is in migration/checkpoint | |
135 | thread. | |
136 | b. replication_do_checkpoint_all() | |
137 | This interface is called after all VM state is transferred to | |
138 | Secondary QEMU. The Disk buffer will be dropped in this interface. | |
139 | The caller must hold the I/O mutex lock if it is in migration/checkpoint | |
140 | thread. | |
141 | c. replication_get_error_all() | |
142 | This interface is called to check if error happened in replication. | |
143 | The caller must hold the I/O mutex lock if it is in migration/checkpoint | |
144 | thread. | |
145 | d. replication_stop_all() | |
146 | It is called on failover. We will flush the Disk buffer into | |
147 | Secondary Disk and stop block replication. The vm should be stopped | |
148 | before calling it if you use this API to shutdown the guest, or other | |
149 | things except failover. The caller must hold the I/O mutex lock if it is | |
150 | in migration/checkpoint thread. | |
151 | ||
152 | == Usage == | |
153 | Primary: | |
154 | -drive if=xxx,driver=quorum,read-pattern=fifo,id=colo1,vote-threshold=1,\ | |
155 | children.0.file.filename=1.raw,\ | |
156 | children.0.driver=raw | |
157 | ||
158 | Run qmp command in primary qemu: | |
159 | { 'execute': 'human-monitor-command', | |
160 | 'arguments': { | |
161 | 'command-line': 'drive_add -n buddy driver=replication,mode=primary,file.driver=nbd,file.host=xxxx,file.port=xxxx,file.export=colo1,node-name=nbd_client1' | |
162 | } | |
163 | } | |
164 | { 'execute': 'x-blockdev-change', | |
165 | 'arguments': { | |
166 | 'parent': 'colo1', | |
167 | 'node': 'nbd_client1' | |
168 | } | |
169 | } | |
170 | Note: | |
171 | 1. There should be only one NBD Client for each primary disk. | |
172 | 2. host is the secondary physical machine's hostname or IP | |
173 | 3. Each disk must have its own export name. | |
174 | 4. It is all a single argument to -drive and you should ignore the | |
175 | leading whitespace. | |
176 | 5. The qmp command line must be run after running qmp command line in | |
177 | secondary qemu. | |
90dfe59b | 178 | 6. After primary failover we need remove children.1 (replication driver). |
68365a38 WC |
179 | |
180 | Secondary: | |
181 | -drive if=none,driver=raw,file.filename=1.raw,id=colo1 \ | |
90dfe59b | 182 | -drive if=none,id=childs1,driver=replication,mode=secondary,top-id=childs1 |
68365a38 WC |
183 | file.file.filename=active_disk.qcow2,\ |
184 | file.driver=qcow2,\ | |
185 | file.backing.file.filename=hidden_disk.qcow2,\ | |
186 | file.backing.driver=qcow2,\ | |
187 | file.backing.backing=colo1 | |
90dfe59b LS |
188 | -drive if=xxx,driver=quorum,read-pattern=fifo,id=top-disk1,\ |
189 | vote-threshold=1,children.0=childs1 | |
68365a38 WC |
190 | |
191 | Then run qmp command in secondary qemu: | |
192 | { 'execute': 'nbd-server-start', | |
193 | 'arguments': { | |
194 | 'addr': { | |
195 | 'type': 'inet', | |
196 | 'data': { | |
197 | 'host': 'xxx', | |
198 | 'port': 'xxx' | |
199 | } | |
200 | } | |
201 | } | |
202 | } | |
203 | { 'execute': 'nbd-server-add', | |
204 | 'arguments': { | |
205 | 'device': 'colo1', | |
206 | 'writable': true | |
207 | } | |
208 | } | |
209 | ||
210 | Note: | |
211 | 1. The export name in secondary QEMU command line is the secondary | |
212 | disk's id. | |
213 | 2. The export name for the same disk must be the same | |
214 | 3. The qmp command nbd-server-start and nbd-server-add must be run | |
215 | before running the qmp command migrate on primary QEMU | |
216 | 4. Active disk, hidden disk and nbd target's length should be the | |
217 | same. | |
218 | 5. It is better to put active disk and hidden disk in ramdisk. | |
219 | 6. It is all a single argument to -drive, and you should ignore | |
220 | the leading whitespace. | |
221 | ||
222 | After Failover: | |
223 | Primary: | |
224 | The secondary host is down, so we should run the following qmp command | |
225 | to remove the nbd child from the quorum: | |
226 | { 'execute': 'x-blockdev-change', | |
227 | 'arguments': { | |
228 | 'parent': 'colo1', | |
229 | 'child': 'children.1' | |
230 | } | |
231 | } | |
232 | { 'execute': 'human-monitor-command', | |
233 | 'arguments': { | |
234 | 'command-line': 'drive_del xxxx' | |
235 | } | |
236 | } | |
237 | Note: there is no qmp command to remove the blockdev now | |
238 | ||
239 | Secondary: | |
240 | The primary host is down, so we should do the following thing: | |
241 | { 'execute': 'nbd-server-stop' } | |
242 | ||
90dfe59b LS |
243 | Promote Secondary to Primary: |
244 | see COLO-FT.txt | |
245 | ||
68365a38 | 246 | TODO: |
90dfe59b | 247 | 1. Shared disk |