]>
Commit | Line | Data |
---|---|---|
7911747b PB |
1 | Using RCU (Read-Copy-Update) for synchronization |
2 | ================================================ | |
3 | ||
4 | Read-copy update (RCU) is a synchronization mechanism that is used to | |
5 | protect read-mostly data structures. RCU is very efficient and scalable | |
6 | on the read side (it is wait-free), and thus can make the read paths | |
7 | extremely fast. | |
8 | ||
9 | RCU supports concurrency between a single writer and multiple readers, | |
10 | thus it is not used alone. Typically, the write-side will use a lock to | |
11 | serialize multiple updates, but other approaches are possible (e.g., | |
12 | restricting updates to a single task). In QEMU, when a lock is used, | |
13 | this will often be the "iothread mutex", also known as the "big QEMU | |
14 | lock" (BQL). Also, restricting updates to a single task is done in | |
15 | QEMU using the "bottom half" API. | |
16 | ||
17 | RCU is fundamentally a "wait-to-finish" mechanism. The read side marks | |
18 | sections of code with "critical sections", and the update side will wait | |
19 | for the execution of all *currently running* critical sections before | |
20 | proceeding, or before asynchronously executing a callback. | |
21 | ||
22 | The key point here is that only the currently running critical sections | |
23 | are waited for; critical sections that are started _after_ the beginning | |
24 | of the wait do not extend the wait, despite running concurrently with | |
25 | the updater. This is the reason why RCU is more scalable than, | |
26 | for example, reader-writer locks. It is so much more scalable that | |
27 | the system will have a single instance of the RCU mechanism; a single | |
28 | mechanism can be used for an arbitrary number of "things", without | |
29 | having to worry about things such as contention or deadlocks. | |
30 | ||
31 | How is this possible? The basic idea is to split updates in two phases, | |
32 | "removal" and "reclamation". During removal, we ensure that subsequent | |
33 | readers will not be able to get a reference to the old data. After | |
34 | removal has completed, a critical section will not be able to access | |
35 | the old data. Therefore, critical sections that begin after removal | |
36 | do not matter; as soon as all previous critical sections have finished, | |
37 | there cannot be any readers who hold references to the data structure, | |
38 | and these can now be safely reclaimed (e.g., freed or unref'ed). | |
39 | ||
17313446 | 40 | Here is a picture: |
7911747b PB |
41 | |
42 | thread 1 thread 2 thread 3 | |
43 | ------------------- ------------------------ ------------------- | |
44 | enter RCU crit.sec. | |
45 | | finish removal phase | |
46 | | begin wait | |
47 | | | enter RCU crit.sec. | |
48 | exit RCU crit.sec | | | |
49 | complete wait | | |
50 | begin reclamation phase | | |
51 | exit RCU crit.sec. | |
52 | ||
53 | ||
54 | Note how thread 3 is still executing its critical section when thread 2 | |
55 | starts reclaiming data. This is possible, because the old version of the | |
56 | data structure was not accessible at the time thread 3 began executing | |
57 | that critical section. | |
58 | ||
59 | ||
60 | RCU API | |
61 | ======= | |
62 | ||
63 | The core RCU API is small: | |
64 | ||
65 | void rcu_read_lock(void); | |
66 | ||
67 | Used by a reader to inform the reclaimer that the reader is | |
68 | entering an RCU read-side critical section. | |
69 | ||
70 | void rcu_read_unlock(void); | |
71 | ||
72 | Used by a reader to inform the reclaimer that the reader is | |
73 | exiting an RCU read-side critical section. Note that RCU | |
74 | read-side critical sections may be nested and/or overlapping. | |
75 | ||
76 | void synchronize_rcu(void); | |
77 | ||
78 | Blocks until all pre-existing RCU read-side critical sections | |
79 | on all threads have completed. This marks the end of the removal | |
80 | phase and the beginning of reclamation phase. | |
81 | ||
82 | Note that it would be valid for another update to come while | |
83 | synchronize_rcu is running. Because of this, it is better that | |
84 | the updater releases any locks it may hold before calling | |
26387f86 PB |
85 | synchronize_rcu. If this is not possible (for example, because |
86 | the updater is protected by the BQL), you can use call_rcu. | |
87 | ||
88 | void call_rcu1(struct rcu_head * head, | |
89 | void (*func)(struct rcu_head *head)); | |
90 | ||
91 | This function invokes func(head) after all pre-existing RCU | |
92 | read-side critical sections on all threads have completed. This | |
93 | marks the end of the removal phase, with func taking care | |
94 | asynchronously of the reclamation phase. | |
95 | ||
96 | The foo struct needs to have an rcu_head structure added, | |
97 | perhaps as follows: | |
98 | ||
99 | struct foo { | |
100 | struct rcu_head rcu; | |
101 | int a; | |
102 | char b; | |
103 | long c; | |
104 | }; | |
105 | ||
106 | so that the reclaimer function can fetch the struct foo address | |
107 | and free it: | |
108 | ||
109 | call_rcu1(&foo.rcu, foo_reclaim); | |
110 | ||
111 | void foo_reclaim(struct rcu_head *rp) | |
112 | { | |
113 | struct foo *fp = container_of(rp, struct foo, rcu); | |
114 | g_free(fp); | |
115 | } | |
116 | ||
117 | For the common case where the rcu_head member is the first of the | |
118 | struct, you can use the following macro. | |
119 | ||
120 | void call_rcu(T *p, | |
121 | void (*func)(T *p), | |
122 | field-name); | |
439c5e02 PB |
123 | void g_free_rcu(T *p, |
124 | field-name); | |
26387f86 | 125 | |
439c5e02 PB |
126 | call_rcu1 is typically used through these macro, in the common case |
127 | where the "struct rcu_head" is the first field in the struct. If | |
128 | the callback function is g_free, in particular, g_free_rcu can be | |
129 | used. In the above case, one could have written simply: | |
26387f86 | 130 | |
9bda456e | 131 | g_free_rcu(&foo, rcu); |
7911747b PB |
132 | |
133 | typeof(*p) atomic_rcu_read(p); | |
134 | ||
135 | atomic_rcu_read() is similar to atomic_mb_read(), but it makes | |
136 | some assumptions on the code that calls it. This allows a more | |
137 | optimized implementation. | |
138 | ||
139 | atomic_rcu_read assumes that whenever a single RCU critical | |
140 | section reads multiple shared data, these reads are either | |
141 | data-dependent or need no ordering. This is almost always the | |
142 | case when using RCU, because read-side critical sections typically | |
143 | navigate one or more pointers (the pointers that are changed on | |
144 | every update) until reaching a data structure of interest, | |
145 | and then read from there. | |
146 | ||
147 | RCU read-side critical sections must use atomic_rcu_read() to | |
85cdeb36 | 148 | read data, unless concurrent writes are prevented by another |
7911747b PB |
149 | synchronization mechanism. |
150 | ||
151 | Furthermore, RCU read-side critical sections should traverse the | |
152 | data structure in a single direction, opposite to the direction | |
153 | in which the updater initializes it. | |
154 | ||
155 | void atomic_rcu_set(p, typeof(*p) v); | |
156 | ||
157 | atomic_rcu_set() is also similar to atomic_mb_set(), and it also | |
158 | makes assumptions on the code that calls it in order to allow a more | |
159 | optimized implementation. | |
160 | ||
161 | In particular, atomic_rcu_set() suffices for synchronization | |
162 | with readers, if the updater never mutates a field within a | |
163 | data item that is already accessible to readers. This is the | |
164 | case when initializing a new copy of the RCU-protected data | |
165 | structure; just ensure that initialization of *p is carried out | |
166 | before atomic_rcu_set() makes the data item visible to readers. | |
167 | If this rule is observed, writes will happen in the opposite | |
168 | order as reads in the RCU read-side critical sections (or if | |
169 | there is just one update), and there will be no need for other | |
170 | synchronization mechanism to coordinate the accesses. | |
171 | ||
172 | The following APIs must be used before RCU is used in a thread: | |
173 | ||
174 | void rcu_register_thread(void); | |
175 | ||
176 | Mark a thread as taking part in the RCU mechanism. Such a thread | |
177 | will have to report quiescent points regularly, either manually | |
178 | or through the QemuCond/QemuSemaphore/QemuEvent APIs. | |
179 | ||
180 | void rcu_unregister_thread(void); | |
181 | ||
182 | Mark a thread as not taking part anymore in the RCU mechanism. | |
183 | It is not a problem if such a thread reports quiescent points, | |
184 | either manually or by using the QemuCond/QemuSemaphore/QemuEvent | |
185 | APIs. | |
186 | ||
187 | Note that these APIs are relatively heavyweight, and should _not_ be | |
188 | nested. | |
189 | ||
190 | ||
191 | DIFFERENCES WITH LINUX | |
192 | ====================== | |
193 | ||
194 | - Waiting on a mutex is possible, though discouraged, within an RCU critical | |
195 | section. This is because spinlocks are rarely (if ever) used in userspace | |
196 | programming; not allowing this would prevent upgrading an RCU read-side | |
197 | critical section to become an updater. | |
198 | ||
199 | - atomic_rcu_read and atomic_rcu_set replace rcu_dereference and | |
200 | rcu_assign_pointer. They take a _pointer_ to the variable being accessed. | |
201 | ||
26387f86 PB |
202 | - call_rcu is a macro that has an extra argument (the name of the first |
203 | field in the struct, which must be a struct rcu_head), and expects the | |
204 | type of the callback's argument to be the type of the first argument. | |
205 | call_rcu1 is the same as Linux's call_rcu. | |
206 | ||
7911747b PB |
207 | |
208 | RCU PATTERNS | |
209 | ============ | |
210 | ||
211 | Many patterns using read-writer locks translate directly to RCU, with | |
212 | the advantages of higher scalability and deadlock immunity. | |
213 | ||
214 | In general, RCU can be used whenever it is possible to create a new | |
215 | "version" of a data structure every time the updater runs. This may | |
216 | sound like a very strict restriction, however: | |
217 | ||
218 | - the updater does not mean "everything that writes to a data structure", | |
219 | but rather "everything that involves a reclamation step". See the | |
220 | array example below | |
221 | ||
222 | - in some cases, creating a new version of a data structure may actually | |
223 | be very cheap. For example, modifying the "next" pointer of a singly | |
224 | linked list is effectively creating a new version of the list. | |
225 | ||
226 | Here are some frequently-used RCU idioms that are worth noting. | |
227 | ||
228 | ||
229 | RCU list processing | |
230 | ------------------- | |
231 | ||
232 | TBD (not yet used in QEMU) | |
233 | ||
234 | ||
235 | RCU reference counting | |
236 | ---------------------- | |
237 | ||
238 | Because grace periods are not allowed to complete while there is an RCU | |
239 | read-side critical section in progress, the RCU read-side primitives | |
240 | may be used as a restricted reference-counting mechanism. For example, | |
241 | consider the following code fragment: | |
242 | ||
243 | rcu_read_lock(); | |
244 | p = atomic_rcu_read(&foo); | |
245 | /* do something with p. */ | |
246 | rcu_read_unlock(); | |
247 | ||
248 | The RCU read-side critical section ensures that the value of "p" remains | |
249 | valid until after the rcu_read_unlock(). In some sense, it is acquiring | |
250 | a reference to p that is later released when the critical section ends. | |
251 | The write side looks simply like this (with appropriate locking): | |
252 | ||
253 | qemu_mutex_lock(&foo_mutex); | |
254 | old = foo; | |
255 | atomic_rcu_set(&foo, new); | |
256 | qemu_mutex_unlock(&foo_mutex); | |
257 | synchronize_rcu(); | |
258 | free(old); | |
259 | ||
26387f86 PB |
260 | If the processing cannot be done purely within the critical section, it |
261 | is possible to combine this idiom with a "real" reference count: | |
262 | ||
263 | rcu_read_lock(); | |
264 | p = atomic_rcu_read(&foo); | |
265 | foo_ref(p); | |
266 | rcu_read_unlock(); | |
267 | /* do something with p. */ | |
268 | foo_unref(p); | |
269 | ||
270 | The write side can be like this: | |
271 | ||
272 | qemu_mutex_lock(&foo_mutex); | |
273 | old = foo; | |
274 | atomic_rcu_set(&foo, new); | |
275 | qemu_mutex_unlock(&foo_mutex); | |
276 | synchronize_rcu(); | |
277 | foo_unref(old); | |
278 | ||
279 | or with call_rcu: | |
280 | ||
281 | qemu_mutex_lock(&foo_mutex); | |
282 | old = foo; | |
283 | atomic_rcu_set(&foo, new); | |
284 | qemu_mutex_unlock(&foo_mutex); | |
285 | call_rcu(foo_unref, old, rcu); | |
286 | ||
287 | In both cases, the write side only performs removal. Reclamation | |
288 | happens when the last reference to a "foo" object is dropped. | |
289 | Using synchronize_rcu() is undesirably expensive, because the | |
290 | last reference may be dropped on the read side. Hence you can | |
291 | use call_rcu() instead: | |
292 | ||
293 | foo_unref(struct foo *p) { | |
294 | if (atomic_fetch_dec(&p->refcount) == 1) { | |
295 | call_rcu(foo_destroy, p, rcu); | |
296 | } | |
297 | } | |
298 | ||
299 | ||
300 | Note that the same idioms would be possible with reader/writer | |
7911747b PB |
301 | locks: |
302 | ||
303 | read_lock(&foo_rwlock); write_mutex_lock(&foo_rwlock); | |
304 | p = foo; p = foo; | |
305 | /* do something with p. */ foo = new; | |
306 | read_unlock(&foo_rwlock); free(p); | |
307 | write_mutex_unlock(&foo_rwlock); | |
308 | free(p); | |
309 | ||
26387f86 PB |
310 | ------------------------------------------------------------------ |
311 | ||
312 | read_lock(&foo_rwlock); write_mutex_lock(&foo_rwlock); | |
313 | p = foo; old = foo; | |
314 | foo_ref(p); foo = new; | |
315 | read_unlock(&foo_rwlock); foo_unref(old); | |
316 | /* do something with p. */ write_mutex_unlock(&foo_rwlock); | |
317 | read_lock(&foo_rwlock); | |
318 | foo_unref(p); | |
319 | read_unlock(&foo_rwlock); | |
320 | ||
321 | foo_unref could use a mechanism such as bottom halves to move deallocation | |
322 | out of the write-side critical section. | |
323 | ||
7911747b PB |
324 | |
325 | RCU resizable arrays | |
326 | -------------------- | |
327 | ||
328 | Resizable arrays can be used with RCU. The expensive RCU synchronization | |
26387f86 PB |
329 | (or call_rcu) only needs to take place when the array is resized. |
330 | The two items to take care of are: | |
7911747b PB |
331 | |
332 | - ensuring that the old version of the array is available between removal | |
333 | and reclamation; | |
334 | ||
335 | - avoiding mismatches in the read side between the array data and the | |
336 | array size. | |
337 | ||
338 | The first problem is avoided simply by not using realloc. Instead, | |
339 | each resize will allocate a new array and copy the old data into it. | |
340 | The second problem would arise if the size and the data pointers were | |
341 | two members of a larger struct: | |
342 | ||
343 | struct mystuff { | |
344 | ... | |
345 | int data_size; | |
346 | int data_alloc; | |
347 | T *data; | |
348 | ... | |
349 | }; | |
350 | ||
351 | Instead, we store the size of the array with the array itself: | |
352 | ||
353 | struct arr { | |
354 | int size; | |
355 | int alloc; | |
356 | T data[]; | |
357 | }; | |
358 | struct arr *global_array; | |
359 | ||
360 | read side: | |
361 | rcu_read_lock(); | |
362 | struct arr *array = atomic_rcu_read(&global_array); | |
363 | x = i < array->size ? array->data[i] : -1; | |
364 | rcu_read_unlock(); | |
365 | return x; | |
366 | ||
367 | write side (running under a lock): | |
368 | if (global_array->size == global_array->alloc) { | |
369 | /* Creating a new version. */ | |
370 | new_array = g_malloc(sizeof(struct arr) + | |
371 | global_array->alloc * 2 * sizeof(T)); | |
372 | new_array->size = global_array->size; | |
373 | new_array->alloc = global_array->alloc * 2; | |
374 | memcpy(new_array->data, global_array->data, | |
375 | global_array->alloc * sizeof(T)); | |
376 | ||
377 | /* Removal phase. */ | |
378 | old_array = global_array; | |
379 | atomic_rcu_set(&new_array->data, new_array); | |
380 | synchronize_rcu(); | |
381 | ||
382 | /* Reclamation phase. */ | |
383 | free(old_array); | |
384 | } | |
385 | ||
386 | ||
387 | SOURCES | |
388 | ======= | |
389 | ||
390 | * Documentation/RCU/ from the Linux kernel |