]>
Commit | Line | Data |
---|---|---|
d02d8dde | 1 | Copyright (c) 2014-2017 Red Hat Inc. |
ef558696 SH |
2 | |
3 | This work is licensed under the terms of the GNU GPL, version 2 or later. See | |
4 | the COPYING file in the top-level directory. | |
5 | ||
6 | ||
7 | This document explains the IOThread feature and how to write code that runs | |
8 | outside the QEMU global mutex. | |
9 | ||
10 | The main loop and IOThreads | |
11 | --------------------------- | |
12 | QEMU is an event-driven program that can do several things at once using an | |
13 | event loop. The VNC server and the QMP monitor are both processed from the | |
14 | same event loop, which monitors their file descriptors until they become | |
15 | readable and then invokes a callback. | |
16 | ||
17 | The default event loop is called the main loop (see main-loop.c). It is | |
18 | possible to create additional event loop threads using -object | |
19 | iothread,id=my-iothread. | |
20 | ||
21 | Side note: The main loop and IOThread are both event loops but their code is | |
22 | not shared completely. Sometimes it is useful to remember that although they | |
23 | are conceptually similar they are currently not interchangeable. | |
24 | ||
25 | Why IOThreads are useful | |
26 | ------------------------ | |
27 | IOThreads allow the user to control the placement of work. The main loop is a | |
28 | scalability bottleneck on hosts with many CPUs. Work can be spread across | |
29 | several IOThreads instead of just one main loop. When set up correctly this | |
30 | can improve I/O latency and reduce jitter seen by the guest. | |
31 | ||
32 | The main loop is also deeply associated with the QEMU global mutex, which is a | |
33 | scalability bottleneck in itself. vCPU threads and the main loop use the QEMU | |
34 | global mutex to serialize execution of QEMU code. This mutex is necessary | |
35 | because a lot of QEMU's code historically was not thread-safe. | |
36 | ||
37 | The fact that all I/O processing is done in a single main loop and that the | |
38 | QEMU global mutex is contended by all vCPU threads and the main loop explain | |
39 | why it is desirable to place work into IOThreads. | |
40 | ||
41 | The experimental virtio-blk data-plane implementation has been benchmarked and | |
42 | shows these effects: | |
43 | ftp://public.dhe.ibm.com/linux/pdfs/KVM_Virtualized_IO_Performance_Paper.pdf | |
44 | ||
45 | How to program for IOThreads | |
46 | ---------------------------- | |
47 | The main difference between legacy code and new code that can run in an | |
48 | IOThread is dealing explicitly with the event loop object, AioContext | |
49 | (see include/block/aio.h). Code that only works in the main loop | |
50 | implicitly uses the main loop's AioContext. Code that supports running | |
51 | in IOThreads must be aware of its AioContext. | |
52 | ||
53 | AioContext supports the following services: | |
54 | * File descriptor monitoring (read/write/error on POSIX hosts) | |
55 | * Event notifiers (inter-thread signalling) | |
56 | * Timers | |
57 | * Bottom Halves (BH) deferred callbacks | |
58 | ||
59 | There are several old APIs that use the main loop AioContext: | |
60 | * LEGACY qemu_aio_set_fd_handler() - monitor a file descriptor | |
61 | * LEGACY qemu_aio_set_event_notifier() - monitor an event notifier | |
62 | * LEGACY timer_new_ms() - create a timer | |
63 | * LEGACY qemu_bh_new() - create a BH | |
64 | * LEGACY qemu_aio_wait() - run an event loop iteration | |
65 | ||
66 | Since they implicitly work on the main loop they cannot be used in code that | |
67 | runs in an IOThread. They might cause a crash or deadlock if called from an | |
68 | IOThread since the QEMU global mutex is not held. | |
69 | ||
70 | Instead, use the AioContext functions directly (see include/block/aio.h): | |
71 | * aio_set_fd_handler() - monitor a file descriptor | |
72 | * aio_set_event_notifier() - monitor an event notifier | |
73 | * aio_timer_new() - create a timer | |
74 | * aio_bh_new() - create a BH | |
75 | * aio_poll() - run an event loop iteration | |
76 | ||
77 | The AioContext can be obtained from the IOThread using | |
78 | iothread_get_aio_context() or for the main loop using qemu_get_aio_context(). | |
79 | Code that takes an AioContext argument works both in IOThreads or the main | |
80 | loop, depending on which AioContext instance the caller passes in. | |
81 | ||
82 | How to synchronize with an IOThread | |
83 | ----------------------------------- | |
84 | AioContext is not thread-safe so some rules must be followed when using file | |
85 | descriptors, event notifiers, timers, or BHs across threads: | |
86 | ||
7c690fd1 PB |
87 | 1. AioContext functions can always be called safely. They handle their |
88 | own locking internally. | |
ef558696 SH |
89 | |
90 | 2. Other threads wishing to access the AioContext must use | |
91 | aio_context_acquire()/aio_context_release() for mutual exclusion. Once the | |
92 | context is acquired no other thread can access it or run event loop iterations | |
93 | in this AioContext. | |
94 | ||
d02d8dde SH |
95 | Legacy code sometimes nests aio_context_acquire()/aio_context_release() calls. |
96 | Do not use nesting anymore, it is incompatible with the BDRV_POLL_WHILE() macro | |
97 | used in the block layer and can lead to hangs. | |
ef558696 SH |
98 | |
99 | There is currently no lock ordering rule if a thread needs to acquire multiple | |
100 | AioContexts simultaneously. Therefore, it is only safe for code holding the | |
101 | QEMU global mutex to acquire other AioContexts. | |
102 | ||
7c690fd1 PB |
103 | Side note: the best way to schedule a function call across threads is to call |
104 | aio_bh_schedule_oneshot(). No acquire/release or locking is needed. | |
ef558696 | 105 | |
65c1b5b6 PB |
106 | AioContext and the block layer |
107 | ------------------------------ | |
108 | The AioContext originates from the QEMU block layer, even though nowadays | |
109 | AioContext is a generic event loop that can be used by any QEMU subsystem. | |
ef558696 SH |
110 | |
111 | The block layer has support for AioContext integrated. Each BlockDriverState | |
112 | is associated with an AioContext using bdrv_set_aio_context() and | |
113 | bdrv_get_aio_context(). This allows block layer code to process I/O inside the | |
114 | right AioContext. Other subsystems may wish to follow a similar approach. | |
115 | ||
116 | Block layer code must therefore expect to run in an IOThread and avoid using | |
117 | old APIs that implicitly use the main loop. See the "How to program for | |
118 | IOThreads" above for information on how to do that. | |
119 | ||
65c1b5b6 PB |
120 | If main loop code such as a QMP function wishes to access a BlockDriverState |
121 | it must first call aio_context_acquire(bdrv_get_aio_context(bs)) to ensure | |
122 | that callbacks in the IOThread do not run in parallel. | |
123 | ||
124 | Code running in the monitor typically needs to ensure that past | |
125 | requests from the guest are completed. When a block device is running | |
126 | in an IOThread, the IOThread can also process requests from the guest | |
127 | (via ioeventfd). To achieve both objects, wrap the code between | |
128 | bdrv_drained_begin() and bdrv_drained_end(), thus creating a "drained | |
129 | section". The functions must be called between aio_context_acquire() | |
130 | and aio_context_release(). You can freely release and re-acquire the | |
131 | AioContext within a drained section. | |
132 | ||
133 | Long-running jobs (usually in the form of coroutines) are best scheduled in | |
134 | the BlockDriverState's AioContext to avoid the need to acquire/release around | |
135 | each bdrv_*() call. The functions bdrv_add/remove_aio_context_notifier, | |
136 | or alternatively blk_add/remove_aio_context_notifier if you use BlockBackends, | |
137 | can be used to get a notification whenever bdrv_set_aio_context() moves a | |
138 | BlockDriverState to a different AioContext. |