[qemu.git] / docs / multiple-iothreads.txt

Copyright (c) 2014 Red Hat Inc.

This work is licensed under the terms of the GNU GPL, version 2 or later.  See
the COPYING file in the top-level directory.


This document explains the IOThread feature and how to write code that runs
outside the QEMU global mutex.

The main loop and IOThreads
---------------------------
QEMU is an event-driven program that can do several things at once using an
event loop.  The VNC server and the QMP monitor are both processed from the
same event loop, which monitors their file descriptors until they become
readable and then invokes a callback.

The default event loop is called the main loop (see main-loop.c).  It is
possible to create additional event loop threads using -object
iothread,id=my-iothread.

Side note: The main loop and IOThread are both event loops but their code is
not shared completely.  Sometimes it is useful to remember that although they
are conceptually similar they are currently not interchangeable.

Why IOThreads are useful
------------------------
IOThreads allow the user to control the placement of work.  The main loop is a
scalability bottleneck on hosts with many CPUs.  Work can be spread across
several IOThreads instead of just one main loop.  When set up correctly this
can improve I/O latency and reduce jitter seen by the guest.

The main loop is also deeply associated with the QEMU global mutex, which is a
scalability bottleneck in itself.  vCPU threads and the main loop use the QEMU
global mutex to serialize execution of QEMU code.  This mutex is necessary
because a lot of QEMU's code historically was not thread-safe.

The fact that all I/O processing is done in a single main loop and that the
QEMU global mutex is contended by all vCPU threads and the main loop explain
why it is desirable to place work into IOThreads.

The experimental virtio-blk data-plane implementation has been benchmarked and
shows these effects:
ftp://public.dhe.ibm.com/linux/pdfs/KVM_Virtualized_IO_Performance_Paper.pdf

How to program for IOThreads
----------------------------
The main difference between legacy code and new code that can run in an
IOThread is dealing explicitly with the event loop object, AioContext
(see include/block/aio.h).  Code that only works in the main loop
implicitly uses the main loop's AioContext.  Code that supports running
in IOThreads must be aware of its AioContext.

AioContext supports the following services:
 * File descriptor monitoring (read/write/error on POSIX hosts)
 * Event notifiers (inter-thread signalling)
 * Timers
 * Bottom Halves (BH) deferred callbacks

There are several old APIs that use the main loop AioContext:
 * LEGACY qemu_aio_set_fd_handler() - monitor a file descriptor
 * LEGACY qemu_aio_set_event_notifier() - monitor an event notifier
 * LEGACY timer_new_ms() - create a timer
 * LEGACY qemu_bh_new() - create a BH
 * LEGACY qemu_aio_wait() - run an event loop iteration

Since they implicitly work on the main loop they cannot be used in code that
runs in an IOThread.  They might cause a crash or deadlock if called from an
IOThread since the QEMU global mutex is not held.

Instead, use the AioContext functions directly (see include/block/aio.h):
 * aio_set_fd_handler() - monitor a file descriptor
 * aio_set_event_notifier() - monitor an event notifier
 * aio_timer_new() - create a timer
 * aio_bh_new() - create a BH
 * aio_poll() - run an event loop iteration

The AioContext can be obtained from the IOThread using
iothread_get_aio_context() or for the main loop using qemu_get_aio_context().
Code that takes an AioContext argument works both in IOThreads or the main
loop, depending on which AioContext instance the caller passes in.

How to synchronize with an IOThread
-----------------------------------
AioContext is not thread-safe so some rules must be followed when using file
descriptors, event notifiers, timers, or BHs across threads:

1. AioContext functions can be called safely from file descriptor, event
notifier, timer, or BH callbacks invoked by the AioContext.  No locking is
necessary.

2. Other threads wishing to access the AioContext must use
aio_context_acquire()/aio_context_release() for mutual exclusion.  Once the
context is acquired no other thread can access it or run event loop iterations
in this AioContext.

aio_context_acquire()/aio_context_release() calls may be nested.  This
means you can call them if you're not sure whether #1 applies.

There is currently no lock ordering rule if a thread needs to acquire multiple
AioContexts simultaneously.  Therefore, it is only safe for code holding the
QEMU global mutex to acquire other AioContexts.

Side note: the best way to schedule a function call across threads is to create
a BH in the target AioContext beforehand and then call qemu_bh_schedule().  No
acquire/release or locking is needed for the qemu_bh_schedule() call.  But be
sure to acquire the AioContext for aio_bh_new() if necessary.

The relationship between AioContext and the block layer
-------------------------------------------------------
The AioContext originates from the QEMU block layer because it provides a
scoped way of running event loop iterations until all work is done.  This
feature is used to complete all in-flight block I/O requests (see
bdrv_drain_all()).  Nowadays AioContext is a generic event loop that can be
used by any QEMU subsystem.

The block layer has support for AioContext integrated.  Each BlockDriverState
is associated with an AioContext using bdrv_set_aio_context() and
bdrv_get_aio_context().  This allows block layer code to process I/O inside the
right AioContext.  Other subsystems may wish to follow a similar approach.

Block layer code must therefore expect to run in an IOThread and avoid using
old APIs that implicitly use the main loop.  See the "How to program for
IOThreads" above for information on how to do that.

If main loop code such as a QMP function wishes to access a BlockDriverState it
must first call aio_context_acquire(bdrv_get_aio_context(bs)) to ensure the
IOThread does not run in parallel.

Long-running jobs (usually in the form of coroutines) are best scheduled in the
BlockDriverState's AioContext to avoid the need to acquire/release around each
bdrv_*() call.  Be aware that there is currently no mechanism to get notified
when bdrv_set_aio_context() moves this BlockDriverState to a different
AioContext (see bdrv_detach_aio_context()/bdrv_attach_aio_context()), so you
may need to add this if you want to support long-running jobs.
Commit	Line	Data
ef558696 SH	1	Copyright (c) 2014 Red Hat Inc.
	2
	3	This work is licensed under the terms of the GNU GPL, version 2 or later. See
	4	the COPYING file in the top-level directory.
	5
	6
	7	This document explains the IOThread feature and how to write code that runs
	8	outside the QEMU global mutex.
	9
	10	The main loop and IOThreads
	11	---------------------------
	12	QEMU is an event-driven program that can do several things at once using an
	13	event loop. The VNC server and the QMP monitor are both processed from the
	14	same event loop, which monitors their file descriptors until they become
	15	readable and then invokes a callback.
	16
	17	The default event loop is called the main loop (see main-loop.c). It is
	18	possible to create additional event loop threads using -object
	19	iothread,id=my-iothread.
	20
	21	Side note: The main loop and IOThread are both event loops but their code is
	22	not shared completely. Sometimes it is useful to remember that although they
	23	are conceptually similar they are currently not interchangeable.
	24
	25	Why IOThreads are useful
	26	------------------------
	27	IOThreads allow the user to control the placement of work. The main loop is a
	28	scalability bottleneck on hosts with many CPUs. Work can be spread across
	29	several IOThreads instead of just one main loop. When set up correctly this
	30	can improve I/O latency and reduce jitter seen by the guest.
	31
	32	The main loop is also deeply associated with the QEMU global mutex, which is a
	33	scalability bottleneck in itself. vCPU threads and the main loop use the QEMU
	34	global mutex to serialize execution of QEMU code. This mutex is necessary
	35	because a lot of QEMU's code historically was not thread-safe.
	36
	37	The fact that all I/O processing is done in a single main loop and that the
	38	QEMU global mutex is contended by all vCPU threads and the main loop explain
	39	why it is desirable to place work into IOThreads.
	40
	41	The experimental virtio-blk data-plane implementation has been benchmarked and
	42	shows these effects:
	43	ftp://public.dhe.ibm.com/linux/pdfs/KVM_Virtualized_IO_Performance_Paper.pdf
	44
	45	How to program for IOThreads
	46	----------------------------
	47	The main difference between legacy code and new code that can run in an
	48	IOThread is dealing explicitly with the event loop object, AioContext
	49	(see include/block/aio.h). Code that only works in the main loop
	50	implicitly uses the main loop's AioContext. Code that supports running
	51	in IOThreads must be aware of its AioContext.
	52
	53	AioContext supports the following services:
	54	* File descriptor monitoring (read/write/error on POSIX hosts)
	55	* Event notifiers (inter-thread signalling)
	56	* Timers
	57	* Bottom Halves (BH) deferred callbacks
	58
	59	There are several old APIs that use the main loop AioContext:
	60	* LEGACY qemu_aio_set_fd_handler() - monitor a file descriptor
	61	* LEGACY qemu_aio_set_event_notifier() - monitor an event notifier
	62	* LEGACY timer_new_ms() - create a timer
	63	* LEGACY qemu_bh_new() - create a BH
	64	* LEGACY qemu_aio_wait() - run an event loop iteration
65
66	Since they implicitly work on the main loop they cannot be used in code that
67	runs in an IOThread. They might cause a crash or deadlock if called from an
68	IOThread since the QEMU global mutex is not held.
69
70	Instead, use the AioContext functions directly (see include/block/aio.h):
71	* aio_set_fd_handler() - monitor a file descriptor
72	* aio_set_event_notifier() - monitor an event notifier
73	* aio_timer_new() - create a timer
74	* aio_bh_new() - create a BH
75	* aio_poll() - run an event loop iteration
76
77	The AioContext can be obtained from the IOThread using
78	iothread_get_aio_context() or for the main loop using qemu_get_aio_context().
79	Code that takes an AioContext argument works both in IOThreads or the main
80	loop, depending on which AioContext instance the caller passes in.
81
82	How to synchronize with an IOThread
83	-----------------------------------
84	AioContext is not thread-safe so some rules must be followed when using file
85	descriptors, event notifiers, timers, or BHs across threads:
86
87	1. AioContext functions can be called safely from file descriptor, event
88	notifier, timer, or BH callbacks invoked by the AioContext. No locking is
89	necessary.
90
91	2. Other threads wishing to access the AioContext must use
92	aio_context_acquire()/aio_context_release() for mutual exclusion. Once the
93	context is acquired no other thread can access it or run event loop iterations
94	in this AioContext.
95
96	aio_context_acquire()/aio_context_release() calls may be nested. This
97	means you can call them if you're not sure whether #1 applies.
98
99	There is currently no lock ordering rule if a thread needs to acquire multiple
100	AioContexts simultaneously. Therefore, it is only safe for code holding the
101	QEMU global mutex to acquire other AioContexts.
102
103	Side note: the best way to schedule a function call across threads is to create
104	a BH in the target AioContext beforehand and then call qemu_bh_schedule(). No
105	acquire/release or locking is needed for the qemu_bh_schedule() call. But be
106	sure to acquire the AioContext for aio_bh_new() if necessary.
107
108	The relationship between AioContext and the block layer
109	-------------------------------------------------------
110	The AioContext originates from the QEMU block layer because it provides a
111	scoped way of running event loop iterations until all work is done. This
112	feature is used to complete all in-flight block I/O requests (see
113	bdrv_drain_all()). Nowadays AioContext is a generic event loop that can be
114	used by any QEMU subsystem.
115
116	The block layer has support for AioContext integrated. Each BlockDriverState
117	is associated with an AioContext using bdrv_set_aio_context() and
118	bdrv_get_aio_context(). This allows block layer code to process I/O inside the
119	right AioContext. Other subsystems may wish to follow a similar approach.
120
121	Block layer code must therefore expect to run in an IOThread and avoid using
122	old APIs that implicitly use the main loop. See the "How to program for
123	IOThreads" above for information on how to do that.
124
125	If main loop code such as a QMP function wishes to access a BlockDriverState it
126	must first call aio_context_acquire(bdrv_get_aio_context(bs)) to ensure the
127	IOThread does not run in parallel.
128
129	Long-running jobs (usually in the form of coroutines) are best scheduled in the
130	BlockDriverState's AioContext to avoid the need to acquire/release around each
131	bdrv_*() call. Be aware that there is currently no mechanism to get notified
132	when bdrv_set_aio_context() moves this BlockDriverState to a different
133	AioContext (see bdrv_detach_aio_context()/bdrv_attach_aio_context()), so you
134	may need to add this if you want to support long-running jobs.