]> Git Repo - qemu.git/blame - docs/replay.txt
replay: push replay_mutex_lock up the call tree
[qemu.git] / docs / replay.txt
CommitLineData
d73abd6d
PD
1Copyright (c) 2010-2015 Institute for System Programming
2 of the Russian Academy of Sciences.
3
4This work is licensed under the terms of the GNU GPL, version 2 or later.
5See the COPYING file in the top-level directory.
6
7Record/replay
8-------------
9
10Record/replay functions are used for the reverse execution and deterministic
11replay of qemu execution. This implementation of deterministic replay can
12be used for deterministic debugging of guest code through a gdb remote
13interface.
14
15Execution recording writes a non-deterministic events log, which can be later
16used for replaying the execution anywhere and for unlimited number of times.
17It also supports checkpointing for faster rewinding during reverse debugging.
18Execution replaying reads the log and replays all non-deterministic events
19including external input, hardware clocks, and interrupts.
20
21Deterministic replay has the following features:
22 * Deterministically replays whole system execution and all contents of
23 the memory, state of the hardware devices, clocks, and screen of the VM.
24 * Writes execution log into the file for later replaying for multiple times
25 on different machines.
26 * Supports i386, x86_64, and ARM hardware platforms.
27 * Performs deterministic replay of all operations with keyboard and mouse
28 input devices.
29
30Usage of the record/replay:
31 * First, record the execution, by adding the following arguments to the command line:
32 '-icount shift=7,rr=record,rrfile=replay.bin -net none'.
33 Block devices' images are not actually changed in the recording mode,
34 because all of the changes are written to the temporary overlay file.
35 * Then you can replay it by using another command
36 line option: '-icount shift=7,rr=replay,rrfile=replay.bin -net none'
37 * '-net none' option should also be specified if network replay patches
38 are not applied.
39
40Papers with description of deterministic replay implementation:
41http://www.computer.org/csdl/proceedings/csmr/2012/4666/00/4666a553-abs.html
42http://dl.acm.org/citation.cfm?id=2786805.2803179
43
44Modifications of qemu include:
45 * wrappers for clock and time functions to save their return values in the log
46 * saving different asynchronous events (e.g. system shutdown) into the log
47 * synchronization of the bottom halves execution
48 * synchronization of the threads from thread pool
49 * recording/replaying user input (mouse and keyboard)
50 * adding internal checkpoints for cpu and io synchronization
51
d759c951
AB
52Locking and thread synchronisation
53----------------------------------
54
55Previously the synchronisation of the main thread and the vCPU thread
56was ensured by the holding of the BQL. However the trend has been to
57reduce the time the BQL was held across the system including under TCG
58system emulation. As it is important that batches of events are kept
59in sequence (e.g. expiring timers and checkpoints in the main thread
60while instruction checkpoints are written by the vCPU thread) we need
61another lock to keep things in lock-step. This role is now handled by
62the replay_mutex_lock. It used to be held only for each event being
63written but now it is held for a whole execution period. This results
64in a deterministic ping-pong between the two main threads.
65
66As the BQL is now a finer grained lock than the replay_lock it is almost
67certainly a bug, and a source of deadlocks, to take the
68replay_mutex_lock while the BQL is held. This is enforced by an assert.
69While the unlocks are usually in the reverse order, this is not
70necessary; you can drop the replay_lock while holding the BQL, without
71doing a more complicated unlock_iothread/replay_unlock/lock_iothread
72sequence.
73
d73abd6d
PD
74Non-deterministic events
75------------------------
76
77Our record/replay system is based on saving and replaying non-deterministic
78events (e.g. keyboard input) and simulating deterministic ones (e.g. reading
79from HDD or memory of the VM). Saving only non-deterministic events makes
80log file smaller, simulation faster, and allows using reverse debugging even
81for realtime applications.
82
83The following non-deterministic data from peripheral devices is saved into
84the log: mouse and keyboard input, network packets, audio controller input,
85USB packets, serial port input, and hardware clocks (they are non-deterministic
86too, because their values are taken from the host machine). Inputs from
87simulated hardware, memory of VM, software interrupts, and execution of
88instructions are not saved into the log, because they are deterministic and
89can be replayed by simulating the behavior of virtual machine starting from
90initial state.
91
92We had to solve three tasks to implement deterministic replay: recording
93non-deterministic events, replaying non-deterministic events, and checking
94that there is no divergence between record and replay modes.
95
96We changed several parts of QEMU to make event log recording and replaying.
97Devices' models that have non-deterministic input from external devices were
98changed to write every external event into the execution log immediately.
99E.g. network packets are written into the log when they arrive into the virtual
100network adapter.
101
102All non-deterministic events are coming from these devices. But to
103replay them we need to know at which moments they occur. We specify
104these moments by counting the number of instructions executed between
105every pair of consecutive events.
106
107Instruction counting
108--------------------
109
110QEMU should work in icount mode to use record/replay feature. icount was
111designed to allow deterministic execution in absence of external inputs
112of the virtual machine. We also use icount to control the occurrence of the
113non-deterministic events. The number of instructions elapsed from the last event
114is written to the log while recording the execution. In replay mode we
115can predict when to inject that event using the instruction counter.
116
117Timers
118------
119
120Timers are used to execute callbacks from different subsystems of QEMU
121at the specified moments of time. There are several kinds of timers:
122 * Real time clock. Based on host time and used only for callbacks that
123 do not change the virtual machine state. For this reason real time
124 clock and timers does not affect deterministic replay at all.
125 * Virtual clock. These timers run only during the emulation. In icount
126 mode virtual clock value is calculated using executed instructions counter.
127 That is why it is completely deterministic and does not have to be recorded.
128 * Host clock. This clock is used by device models that simulate real time
129 sources (e.g. real time clock chip). Host clock is the one of the sources
130 of non-determinism. Host clock read operations should be logged to
131 make the execution deterministic.
e76d1798 132 * Virtual real time clock. This clock is similar to real time clock but
d73abd6d
PD
133 it is used only for increasing virtual clock while virtual machine is
134 sleeping. Due to its nature it is also non-deterministic as the host clock
135 and has to be logged too.
136
137Checkpoints
138-----------
139
140Replaying of the execution of virtual machine is bound by sources of
141non-determinism. These are inputs from clock and peripheral devices,
142and QEMU thread scheduling. Thread scheduling affect on processing events
143from timers, asynchronous input-output, and bottom halves.
144
145Invocations of timers are coupled with clock reads and changing the state
146of the virtual machine. Reads produce non-deterministic data taken from
147host clock. And VM state changes should preserve their order. Their relative
148order in replay mode must replicate the order of callbacks in record mode.
149To preserve this order we use checkpoints. When a specific clock is processed
150in record mode we save to the log special "checkpoint" event.
151Checkpoints here do not refer to virtual machine snapshots. They are just
152record/replay events used for synchronization.
153
154QEMU in replay mode will try to invoke timers processing in random moment
155of time. That's why we do not process a group of timers until the checkpoint
156event will be read from the log. Such an event allows synchronizing CPU
157execution and timer events.
158
e76d1798
PD
159Two other checkpoints govern the "warping" of the virtual clock.
160While the virtual machine is idle, the virtual clock increments at
1611 ns per *real time* nanosecond. This is done by setting up a timer
162(called the warp timer) on the virtual real time clock, so that the
163timer fires at the next deadline of the virtual clock; the virtual clock
164is then incremented (which is called "warping" the virtual clock) as
165soon as the timer fires or the CPUs need to go out of the idle state.
166Two functions are used for this purpose; because these actions change
167virtual machine state and must be deterministic, each of them creates a
168checkpoint. qemu_start_warp_timer checks if the CPUs are idle and if so
169starts accounting real time to virtual clock. qemu_account_warp_timer
170is called when the CPUs get an interrupt or when the warp timer fires,
171and it warps the virtual clock by the amount of real time that has passed
172since qemu_start_warp_timer.
d73abd6d
PD
173
174Bottom halves
175-------------
176
177Disk I/O events are completely deterministic in our model, because
178in both record and replay modes we start virtual machine from the same
179disk state. But callbacks that virtual disk controller uses for reading and
180writing the disk may occur at different moments of time in record and replay
181modes.
182
183Reading and writing requests are created by CPU thread of QEMU. Later these
184requests proceed to block layer which creates "bottom halves". Bottom
185halves consist of callback and its parameters. They are processed when
186main loop locks the global mutex. These locks are not synchronized with
187replaying process because main loop also processes the events that do not
188affect the virtual machine state (like user interaction with monitor).
189
190That is why we had to implement saving and replaying bottom halves callbacks
191synchronously to the CPU execution. When the callback is about to execute
192it is added to the queue in the replay module. This queue is written to the
193log when its callbacks are executed. In replay mode callbacks are not processed
194until the corresponding event is read from the events log file.
195
196Sometimes the block layer uses asynchronous callbacks for its internal purposes
197(like reading or writing VM snapshots or disk image cluster tables). In this
198case bottom halves are not marked as "replayable" and do not saved
199into the log.
63785678
PD
200
201Block devices
202-------------
203
204Block devices record/replay module intercepts calls of
205bdrv coroutine functions at the top of block drivers stack.
206To record and replay block operations the drive must be configured
207as following:
208 -drive file=disk.qcow,if=none,id=img-direct
209 -drive driver=blkreplay,if=none,image=img-direct,id=img-blkreplay
210 -device ide-hd,drive=img-blkreplay
211
212blkreplay driver should be inserted between disk image and virtual driver
213controller. Therefore all disk requests may be recorded and replayed.
214
215All block completion operations are added to the queue in the coroutines.
216Queue is flushed at checkpoints and information about processed requests
217is recorded to the log. In replay phase the queue is matched with
218events read from the log. Therefore block devices requests are processed
219deterministically.
646c5478 220
9c2037d0
PD
221Snapshotting
222------------
223
224New VM snapshots may be created in replay mode. They can be used later
225to recover the desired VM state. All VM states created in replay mode
226are associated with the moment of time in the replay scenario.
227After recovering the VM state replay will start from that position.
228
229Default starting snapshot name may be specified with icount field
230rrsnapshot as follows:
231 -icount shift=7,rr=record,rrfile=replay.bin,rrsnapshot=snapshot_name
232
233This snapshot is created at start of recording and restored at start
234of replaying. It also can be loaded while replaying to roll back
235the execution.
236
646c5478
PD
237Network devices
238---------------
239
240Record and replay for network interactions is performed with the network filter.
241Each backend must have its own instance of the replay filter as follows:
242 -netdev user,id=net1 -device rtl8139,netdev=net1
243 -object filter-replay,id=replay,netdev=net1
244
245Replay network filter is used to record and replay network packets. While
246recording the virtual machine this filter puts all packets coming from
247the outer world into the log. In replay mode packets from the log are
248injected into the network device. All interactions with network backend
249in replay mode are disabled.
3d4d16f4
PD
250
251Audio devices
252-------------
253
254Audio data is recorded and replay automatically. The command line for recording
255and replaying must contain identical specifications of audio hardware, e.g.:
256 -soundhw ac97
bb040e00
PD
257
258Replay log format
259-----------------
260
261Record/replay log consits of the header and the sequence of execution
262events. The header includes 4-byte replay version id and 8-byte reserved
263field. Version is updated every time replay log format changes to prevent
264using replay log created by another build of qemu.
265
266The sequence of the events describes virtual machine state changes.
267It includes all non-deterministic inputs of VM, synchronization marks and
268instruction counts used to correctly inject inputs at replay.
269
270Synchronization marks (checkpoints) are used for synchronizing qemu threads
271that perform operations with virtual hardware. These operations may change
272system's state (e.g., change some register or generate interrupt) and
273therefore should execute synchronously with CPU thread.
274
275Every event in the log includes 1-byte event id and optional arguments.
276When argument is an array, it is stored as 4-byte array length
277and corresponding number of bytes with data.
278Here is the list of events that are written into the log:
279
280 - EVENT_INSTRUCTION. Instructions executed since last event.
281 Argument: 4-byte number of executed instructions.
282 - EVENT_INTERRUPT. Used to synchronize interrupt processing.
283 - EVENT_EXCEPTION. Used to synchronize exception handling.
284 - EVENT_ASYNC. This is a group of events. They are always processed
285 together with checkpoints. When such an event is generated, it is
286 stored in the queue and processed only when checkpoint occurs.
287 Every such event is followed by 1-byte checkpoint id and 1-byte
288 async event id from the following list:
289 - REPLAY_ASYNC_EVENT_BH. Bottom-half callback. This event synchronizes
290 callbacks that affect virtual machine state, but normally called
291 asyncronously.
292 Argument: 8-byte operation id.
293 - REPLAY_ASYNC_EVENT_INPUT. Input device event. Contains
294 parameters of keyboard and mouse input operations
295 (key press/release, mouse pointer movement).
296 Arguments: 9-16 bytes depending of input event.
297 - REPLAY_ASYNC_EVENT_INPUT_SYNC. Internal input synchronization event.
298 - REPLAY_ASYNC_EVENT_CHAR_READ. Character (e.g., serial port) device input
299 initiated by the sender.
300 Arguments: 1-byte character device id.
301 Array with bytes were read.
302 - REPLAY_ASYNC_EVENT_BLOCK. Block device operation. Used to synchronize
303 operations with disk and flash drives with CPU.
304 Argument: 8-byte operation id.
305 - REPLAY_ASYNC_EVENT_NET. Incoming network packet.
306 Arguments: 1-byte network adapter id.
307 4-byte packet flags.
308 Array with packet bytes.
309 - EVENT_SHUTDOWN. Occurs when user sends shutdown event to qemu,
310 e.g., by closing the window.
311 - EVENT_CHAR_WRITE. Used to synchronize character output operations.
312 Arguments: 4-byte output function return value.
313 4-byte offset in the output array.
314 - EVENT_CHAR_READ_ALL. Used to synchronize character input operations,
315 initiated by qemu.
316 Argument: Array with bytes that were read.
317 - EVENT_CHAR_READ_ALL_ERROR. Unsuccessful character input operation,
318 initiated by qemu.
319 Argument: 4-byte error code.
320 - EVENT_CLOCK + clock_id. Group of events for host clock read operations.
321 Argument: 8-byte clock value.
322 - EVENT_CHECKPOINT + checkpoint_id. Checkpoint for synchronization of
323 CPU, internal threads, and asynchronous input events. May be followed
324 by one or more EVENT_ASYNC events.
325 - EVENT_END. Last event in the log.
This page took 0.15666 seconds and 4 git commands to generate.