]>
Commit | Line | Data |
---|---|---|
7911747b PB |
1 | Using RCU (Read-Copy-Update) for synchronization |
2 | ================================================ | |
3 | ||
4 | Read-copy update (RCU) is a synchronization mechanism that is used to | |
5 | protect read-mostly data structures. RCU is very efficient and scalable | |
6 | on the read side (it is wait-free), and thus can make the read paths | |
7 | extremely fast. | |
8 | ||
9 | RCU supports concurrency between a single writer and multiple readers, | |
10 | thus it is not used alone. Typically, the write-side will use a lock to | |
11 | serialize multiple updates, but other approaches are possible (e.g., | |
12 | restricting updates to a single task). In QEMU, when a lock is used, | |
13 | this will often be the "iothread mutex", also known as the "big QEMU | |
14 | lock" (BQL). Also, restricting updates to a single task is done in | |
15 | QEMU using the "bottom half" API. | |
16 | ||
17 | RCU is fundamentally a "wait-to-finish" mechanism. The read side marks | |
18 | sections of code with "critical sections", and the update side will wait | |
19 | for the execution of all *currently running* critical sections before | |
20 | proceeding, or before asynchronously executing a callback. | |
21 | ||
22 | The key point here is that only the currently running critical sections | |
23 | are waited for; critical sections that are started _after_ the beginning | |
24 | of the wait do not extend the wait, despite running concurrently with | |
25 | the updater. This is the reason why RCU is more scalable than, | |
26 | for example, reader-writer locks. It is so much more scalable that | |
27 | the system will have a single instance of the RCU mechanism; a single | |
28 | mechanism can be used for an arbitrary number of "things", without | |
29 | having to worry about things such as contention or deadlocks. | |
30 | ||
31 | How is this possible? The basic idea is to split updates in two phases, | |
32 | "removal" and "reclamation". During removal, we ensure that subsequent | |
33 | readers will not be able to get a reference to the old data. After | |
34 | removal has completed, a critical section will not be able to access | |
35 | the old data. Therefore, critical sections that begin after removal | |
36 | do not matter; as soon as all previous critical sections have finished, | |
37 | there cannot be any readers who hold references to the data structure, | |
38 | and these can now be safely reclaimed (e.g., freed or unref'ed). | |
39 | ||
17313446 | 40 | Here is a picture: |
7911747b PB |
41 | |
42 | thread 1 thread 2 thread 3 | |
43 | ------------------- ------------------------ ------------------- | |
44 | enter RCU crit.sec. | |
45 | | finish removal phase | |
46 | | begin wait | |
47 | | | enter RCU crit.sec. | |
48 | exit RCU crit.sec | | | |
49 | complete wait | | |
50 | begin reclamation phase | | |
51 | exit RCU crit.sec. | |
52 | ||
53 | ||
54 | Note how thread 3 is still executing its critical section when thread 2 | |
55 | starts reclaiming data. This is possible, because the old version of the | |
56 | data structure was not accessible at the time thread 3 began executing | |
57 | that critical section. | |
58 | ||
59 | ||
60 | RCU API | |
61 | ======= | |
62 | ||
63 | The core RCU API is small: | |
64 | ||
65 | void rcu_read_lock(void); | |
66 | ||
67 | Used by a reader to inform the reclaimer that the reader is | |
68 | entering an RCU read-side critical section. | |
69 | ||
70 | void rcu_read_unlock(void); | |
71 | ||
72 | Used by a reader to inform the reclaimer that the reader is | |
73 | exiting an RCU read-side critical section. Note that RCU | |
74 | read-side critical sections may be nested and/or overlapping. | |
75 | ||
76 | void synchronize_rcu(void); | |
77 | ||
78 | Blocks until all pre-existing RCU read-side critical sections | |
79 | on all threads have completed. This marks the end of the removal | |
80 | phase and the beginning of reclamation phase. | |
81 | ||
82 | Note that it would be valid for another update to come while | |
83 | synchronize_rcu is running. Because of this, it is better that | |
84 | the updater releases any locks it may hold before calling | |
26387f86 PB |
85 | synchronize_rcu. If this is not possible (for example, because |
86 | the updater is protected by the BQL), you can use call_rcu. | |
87 | ||
88 | void call_rcu1(struct rcu_head * head, | |
89 | void (*func)(struct rcu_head *head)); | |
90 | ||
91 | This function invokes func(head) after all pre-existing RCU | |
92 | read-side critical sections on all threads have completed. This | |
93 | marks the end of the removal phase, with func taking care | |
94 | asynchronously of the reclamation phase. | |
95 | ||
96 | The foo struct needs to have an rcu_head structure added, | |
97 | perhaps as follows: | |
98 | ||
99 | struct foo { | |
100 | struct rcu_head rcu; | |
101 | int a; | |
102 | char b; | |
103 | long c; | |
104 | }; | |
105 | ||
106 | so that the reclaimer function can fetch the struct foo address | |
107 | and free it: | |
108 | ||
109 | call_rcu1(&foo.rcu, foo_reclaim); | |
110 | ||
111 | void foo_reclaim(struct rcu_head *rp) | |
112 | { | |
113 | struct foo *fp = container_of(rp, struct foo, rcu); | |
114 | g_free(fp); | |
115 | } | |
116 | ||
117 | For the common case where the rcu_head member is the first of the | |
118 | struct, you can use the following macro. | |
119 | ||
120 | void call_rcu(T *p, | |
121 | void (*func)(T *p), | |
122 | field-name); | |
439c5e02 PB |
123 | void g_free_rcu(T *p, |
124 | field-name); | |
26387f86 | 125 | |
439c5e02 PB |
126 | call_rcu1 is typically used through these macro, in the common case |
127 | where the "struct rcu_head" is the first field in the struct. If | |
128 | the callback function is g_free, in particular, g_free_rcu can be | |
129 | used. In the above case, one could have written simply: | |
26387f86 | 130 | |
9bda456e | 131 | g_free_rcu(&foo, rcu); |
7911747b | 132 | |
d73415a3 | 133 | typeof(*p) qatomic_rcu_read(p); |
7911747b | 134 | |
d73415a3 | 135 | qatomic_rcu_read() is similar to qatomic_load_acquire(), but it makes |
7911747b PB |
136 | some assumptions on the code that calls it. This allows a more |
137 | optimized implementation. | |
138 | ||
d73415a3 | 139 | qatomic_rcu_read assumes that whenever a single RCU critical |
7911747b PB |
140 | section reads multiple shared data, these reads are either |
141 | data-dependent or need no ordering. This is almost always the | |
142 | case when using RCU, because read-side critical sections typically | |
143 | navigate one or more pointers (the pointers that are changed on | |
144 | every update) until reaching a data structure of interest, | |
145 | and then read from there. | |
146 | ||
d73415a3 | 147 | RCU read-side critical sections must use qatomic_rcu_read() to |
85cdeb36 | 148 | read data, unless concurrent writes are prevented by another |
7911747b PB |
149 | synchronization mechanism. |
150 | ||
151 | Furthermore, RCU read-side critical sections should traverse the | |
152 | data structure in a single direction, opposite to the direction | |
153 | in which the updater initializes it. | |
154 | ||
d73415a3 | 155 | void qatomic_rcu_set(p, typeof(*p) v); |
7911747b | 156 | |
d73415a3 | 157 | qatomic_rcu_set() is similar to qatomic_store_release(), though it also |
7911747b PB |
158 | makes assumptions on the code that calls it in order to allow a more |
159 | optimized implementation. | |
160 | ||
d73415a3 | 161 | In particular, qatomic_rcu_set() suffices for synchronization |
7911747b PB |
162 | with readers, if the updater never mutates a field within a |
163 | data item that is already accessible to readers. This is the | |
164 | case when initializing a new copy of the RCU-protected data | |
165 | structure; just ensure that initialization of *p is carried out | |
d73415a3 | 166 | before qatomic_rcu_set() makes the data item visible to readers. |
7911747b PB |
167 | If this rule is observed, writes will happen in the opposite |
168 | order as reads in the RCU read-side critical sections (or if | |
169 | there is just one update), and there will be no need for other | |
170 | synchronization mechanism to coordinate the accesses. | |
171 | ||
172 | The following APIs must be used before RCU is used in a thread: | |
173 | ||
174 | void rcu_register_thread(void); | |
175 | ||
176 | Mark a thread as taking part in the RCU mechanism. Such a thread | |
177 | will have to report quiescent points regularly, either manually | |
178 | or through the QemuCond/QemuSemaphore/QemuEvent APIs. | |
179 | ||
180 | void rcu_unregister_thread(void); | |
181 | ||
182 | Mark a thread as not taking part anymore in the RCU mechanism. | |
183 | It is not a problem if such a thread reports quiescent points, | |
184 | either manually or by using the QemuCond/QemuSemaphore/QemuEvent | |
185 | APIs. | |
186 | ||
187 | Note that these APIs are relatively heavyweight, and should _not_ be | |
188 | nested. | |
189 | ||
5626f8c6 DDAG |
190 | Convenience macros |
191 | ================== | |
192 | ||
193 | Two macros are provided that automatically release the read lock at the | |
194 | end of the scope. | |
195 | ||
196 | RCU_READ_LOCK_GUARD() | |
197 | ||
198 | Takes the lock and will release it at the end of the block it's | |
199 | used in. | |
200 | ||
201 | WITH_RCU_READ_LOCK_GUARD() { code } | |
202 | ||
203 | Is used at the head of a block to protect the code within the block. | |
204 | ||
205 | Note that 'goto'ing out of the guarded block will also drop the lock. | |
7911747b PB |
206 | |
207 | DIFFERENCES WITH LINUX | |
208 | ====================== | |
209 | ||
210 | - Waiting on a mutex is possible, though discouraged, within an RCU critical | |
211 | section. This is because spinlocks are rarely (if ever) used in userspace | |
212 | programming; not allowing this would prevent upgrading an RCU read-side | |
213 | critical section to become an updater. | |
214 | ||
d73415a3 | 215 | - qatomic_rcu_read and qatomic_rcu_set replace rcu_dereference and |
7911747b PB |
216 | rcu_assign_pointer. They take a _pointer_ to the variable being accessed. |
217 | ||
26387f86 PB |
218 | - call_rcu is a macro that has an extra argument (the name of the first |
219 | field in the struct, which must be a struct rcu_head), and expects the | |
220 | type of the callback's argument to be the type of the first argument. | |
221 | call_rcu1 is the same as Linux's call_rcu. | |
222 | ||
7911747b PB |
223 | |
224 | RCU PATTERNS | |
225 | ============ | |
226 | ||
227 | Many patterns using read-writer locks translate directly to RCU, with | |
228 | the advantages of higher scalability and deadlock immunity. | |
229 | ||
230 | In general, RCU can be used whenever it is possible to create a new | |
231 | "version" of a data structure every time the updater runs. This may | |
232 | sound like a very strict restriction, however: | |
233 | ||
234 | - the updater does not mean "everything that writes to a data structure", | |
235 | but rather "everything that involves a reclamation step". See the | |
236 | array example below | |
237 | ||
238 | - in some cases, creating a new version of a data structure may actually | |
239 | be very cheap. For example, modifying the "next" pointer of a singly | |
240 | linked list is effectively creating a new version of the list. | |
241 | ||
242 | Here are some frequently-used RCU idioms that are worth noting. | |
243 | ||
244 | ||
245 | RCU list processing | |
246 | ------------------- | |
247 | ||
248 | TBD (not yet used in QEMU) | |
249 | ||
250 | ||
251 | RCU reference counting | |
252 | ---------------------- | |
253 | ||
254 | Because grace periods are not allowed to complete while there is an RCU | |
255 | read-side critical section in progress, the RCU read-side primitives | |
256 | may be used as a restricted reference-counting mechanism. For example, | |
257 | consider the following code fragment: | |
258 | ||
259 | rcu_read_lock(); | |
d73415a3 | 260 | p = qatomic_rcu_read(&foo); |
7911747b PB |
261 | /* do something with p. */ |
262 | rcu_read_unlock(); | |
263 | ||
264 | The RCU read-side critical section ensures that the value of "p" remains | |
265 | valid until after the rcu_read_unlock(). In some sense, it is acquiring | |
266 | a reference to p that is later released when the critical section ends. | |
267 | The write side looks simply like this (with appropriate locking): | |
268 | ||
269 | qemu_mutex_lock(&foo_mutex); | |
270 | old = foo; | |
d73415a3 | 271 | qatomic_rcu_set(&foo, new); |
7911747b PB |
272 | qemu_mutex_unlock(&foo_mutex); |
273 | synchronize_rcu(); | |
274 | free(old); | |
275 | ||
26387f86 PB |
276 | If the processing cannot be done purely within the critical section, it |
277 | is possible to combine this idiom with a "real" reference count: | |
278 | ||
279 | rcu_read_lock(); | |
d73415a3 | 280 | p = qatomic_rcu_read(&foo); |
26387f86 PB |
281 | foo_ref(p); |
282 | rcu_read_unlock(); | |
283 | /* do something with p. */ | |
284 | foo_unref(p); | |
285 | ||
286 | The write side can be like this: | |
287 | ||
288 | qemu_mutex_lock(&foo_mutex); | |
289 | old = foo; | |
d73415a3 | 290 | qatomic_rcu_set(&foo, new); |
26387f86 PB |
291 | qemu_mutex_unlock(&foo_mutex); |
292 | synchronize_rcu(); | |
293 | foo_unref(old); | |
294 | ||
295 | or with call_rcu: | |
296 | ||
297 | qemu_mutex_lock(&foo_mutex); | |
298 | old = foo; | |
d73415a3 | 299 | qatomic_rcu_set(&foo, new); |
26387f86 PB |
300 | qemu_mutex_unlock(&foo_mutex); |
301 | call_rcu(foo_unref, old, rcu); | |
302 | ||
303 | In both cases, the write side only performs removal. Reclamation | |
304 | happens when the last reference to a "foo" object is dropped. | |
305 | Using synchronize_rcu() is undesirably expensive, because the | |
306 | last reference may be dropped on the read side. Hence you can | |
307 | use call_rcu() instead: | |
308 | ||
309 | foo_unref(struct foo *p) { | |
d73415a3 | 310 | if (qatomic_fetch_dec(&p->refcount) == 1) { |
26387f86 PB |
311 | call_rcu(foo_destroy, p, rcu); |
312 | } | |
313 | } | |
314 | ||
315 | ||
316 | Note that the same idioms would be possible with reader/writer | |
7911747b PB |
317 | locks: |
318 | ||
319 | read_lock(&foo_rwlock); write_mutex_lock(&foo_rwlock); | |
320 | p = foo; p = foo; | |
321 | /* do something with p. */ foo = new; | |
322 | read_unlock(&foo_rwlock); free(p); | |
323 | write_mutex_unlock(&foo_rwlock); | |
324 | free(p); | |
325 | ||
26387f86 PB |
326 | ------------------------------------------------------------------ |
327 | ||
328 | read_lock(&foo_rwlock); write_mutex_lock(&foo_rwlock); | |
329 | p = foo; old = foo; | |
330 | foo_ref(p); foo = new; | |
331 | read_unlock(&foo_rwlock); foo_unref(old); | |
332 | /* do something with p. */ write_mutex_unlock(&foo_rwlock); | |
333 | read_lock(&foo_rwlock); | |
334 | foo_unref(p); | |
335 | read_unlock(&foo_rwlock); | |
336 | ||
337 | foo_unref could use a mechanism such as bottom halves to move deallocation | |
338 | out of the write-side critical section. | |
339 | ||
7911747b PB |
340 | |
341 | RCU resizable arrays | |
342 | -------------------- | |
343 | ||
344 | Resizable arrays can be used with RCU. The expensive RCU synchronization | |
26387f86 PB |
345 | (or call_rcu) only needs to take place when the array is resized. |
346 | The two items to take care of are: | |
7911747b PB |
347 | |
348 | - ensuring that the old version of the array is available between removal | |
349 | and reclamation; | |
350 | ||
351 | - avoiding mismatches in the read side between the array data and the | |
352 | array size. | |
353 | ||
354 | The first problem is avoided simply by not using realloc. Instead, | |
355 | each resize will allocate a new array and copy the old data into it. | |
356 | The second problem would arise if the size and the data pointers were | |
357 | two members of a larger struct: | |
358 | ||
359 | struct mystuff { | |
360 | ... | |
361 | int data_size; | |
362 | int data_alloc; | |
363 | T *data; | |
364 | ... | |
365 | }; | |
366 | ||
367 | Instead, we store the size of the array with the array itself: | |
368 | ||
369 | struct arr { | |
370 | int size; | |
371 | int alloc; | |
372 | T data[]; | |
373 | }; | |
374 | struct arr *global_array; | |
375 | ||
376 | read side: | |
377 | rcu_read_lock(); | |
d73415a3 | 378 | struct arr *array = qatomic_rcu_read(&global_array); |
7911747b PB |
379 | x = i < array->size ? array->data[i] : -1; |
380 | rcu_read_unlock(); | |
381 | return x; | |
382 | ||
383 | write side (running under a lock): | |
384 | if (global_array->size == global_array->alloc) { | |
385 | /* Creating a new version. */ | |
386 | new_array = g_malloc(sizeof(struct arr) + | |
387 | global_array->alloc * 2 * sizeof(T)); | |
388 | new_array->size = global_array->size; | |
389 | new_array->alloc = global_array->alloc * 2; | |
390 | memcpy(new_array->data, global_array->data, | |
391 | global_array->alloc * sizeof(T)); | |
392 | ||
393 | /* Removal phase. */ | |
394 | old_array = global_array; | |
d73415a3 | 395 | qatomic_rcu_set(&new_array->data, new_array); |
7911747b PB |
396 | synchronize_rcu(); |
397 | ||
398 | /* Reclamation phase. */ | |
399 | free(old_array); | |
400 | } | |
401 | ||
402 | ||
403 | SOURCES | |
404 | ======= | |
405 | ||
406 | * Documentation/RCU/ from the Linux kernel |