]>
Commit | Line | Data |
---|---|---|
5444e768 PB |
1 | CPUs perform independent memory operations effectively in random order. |
2 | but this can be a problem for CPU-CPU interaction (including interactions | |
3 | between QEMU and the guest). Multi-threaded programs use various tools | |
4 | to instruct the compiler and the CPU to restrict the order to something | |
5 | that is consistent with the expectations of the programmer. | |
6 | ||
7 | The most basic tool is locking. Mutexes, condition variables and | |
8 | semaphores are used in QEMU, and should be the default approach to | |
9 | synchronization. Anything else is considerably harder, but it's | |
10 | also justified more often than one would like. The two tools that | |
11 | are provided by qemu/atomic.h are memory barriers and atomic operations. | |
12 | ||
13 | Macros defined by qemu/atomic.h fall in three camps: | |
14 | ||
15 | - compiler barriers: barrier(); | |
16 | ||
17 | - weak atomic access and manual memory barriers: atomic_read(), | |
f1ee8696 PB |
18 | atomic_set(), smp_rmb(), smp_wmb(), smp_mb(), smp_mb_acquire(), |
19 | smp_mb_release(), smp_read_barrier_depends(); | |
5444e768 PB |
20 | |
21 | - sequentially consistent atomic access: everything else. | |
22 | ||
23 | ||
24 | COMPILER MEMORY BARRIER | |
25 | ======================= | |
26 | ||
27 | barrier() prevents the compiler from moving the memory accesses either | |
28 | side of it to the other side. The compiler barrier has no direct effect | |
29 | on the CPU, which may then reorder things however it wishes. | |
30 | ||
31 | barrier() is mostly used within qemu/atomic.h itself. On some | |
32 | architectures, CPU guarantees are strong enough that blocking compiler | |
33 | optimizations already ensures the correct order of execution. In this | |
34 | case, qemu/atomic.h will reduce stronger memory barriers to simple | |
35 | compiler barriers. | |
36 | ||
37 | Still, barrier() can be useful when writing code that can be interrupted | |
38 | by signal handlers. | |
39 | ||
40 | ||
41 | SEQUENTIALLY CONSISTENT ATOMIC ACCESS | |
42 | ===================================== | |
43 | ||
44 | Most of the operations in the qemu/atomic.h header ensure *sequential | |
45 | consistency*, where "the result of any execution is the same as if the | |
46 | operations of all the processors were executed in some sequential order, | |
47 | and the operations of each individual processor appear in this sequence | |
48 | in the order specified by its program". | |
49 | ||
50 | qemu/atomic.h provides the following set of atomic read-modify-write | |
51 | operations: | |
52 | ||
53 | void atomic_inc(ptr) | |
54 | void atomic_dec(ptr) | |
55 | void atomic_add(ptr, val) | |
56 | void atomic_sub(ptr, val) | |
57 | void atomic_and(ptr, val) | |
58 | void atomic_or(ptr, val) | |
59 | ||
60 | typeof(*ptr) atomic_fetch_inc(ptr) | |
61 | typeof(*ptr) atomic_fetch_dec(ptr) | |
62 | typeof(*ptr) atomic_fetch_add(ptr, val) | |
63 | typeof(*ptr) atomic_fetch_sub(ptr, val) | |
64 | typeof(*ptr) atomic_fetch_and(ptr, val) | |
65 | typeof(*ptr) atomic_fetch_or(ptr, val) | |
dfc007f7 | 66 | typeof(*ptr) atomic_xchg(ptr, val) |
5444e768 PB |
67 | typeof(*ptr) atomic_cmpxchg(ptr, old, new) |
68 | ||
69 | all of which return the old value of *ptr. These operations are | |
70 | polymorphic; they operate on any type that is as wide as an int. | |
71 | ||
72 | Sequentially consistent loads and stores can be done using: | |
73 | ||
74 | atomic_fetch_add(ptr, 0) for loads | |
75 | atomic_xchg(ptr, val) for stores | |
76 | ||
77 | However, they are quite expensive on some platforms, notably POWER and | |
78 | ARM. Therefore, qemu/atomic.h provides two primitives with slightly | |
79 | weaker constraints: | |
80 | ||
81 | typeof(*ptr) atomic_mb_read(ptr) | |
82 | void atomic_mb_set(ptr, val) | |
83 | ||
84 | The semantics of these primitives map to Java volatile variables, | |
85 | and are strongly related to memory barriers as used in the Linux | |
86 | kernel (see below). | |
87 | ||
88 | As long as you use atomic_mb_read and atomic_mb_set, accesses cannot | |
89 | be reordered with each other, and it is also not possible to reorder | |
90 | "normal" accesses around them. | |
91 | ||
92 | However, and this is the important difference between | |
93 | atomic_mb_read/atomic_mb_set and sequential consistency, it is important | |
94 | for both threads to access the same volatile variable. It is not the | |
95 | case that everything visible to thread A when it writes volatile field f | |
96 | becomes visible to thread B after it reads volatile field g. The store | |
97 | and load have to "match" (i.e., be performed on the same volatile | |
98 | field) to achieve the right semantics. | |
99 | ||
100 | ||
101 | These operations operate on any type that is as wide as an int or smaller. | |
102 | ||
103 | ||
104 | WEAK ATOMIC ACCESS AND MANUAL MEMORY BARRIERS | |
105 | ============================================= | |
106 | ||
107 | Compared to sequentially consistent atomic access, programming with | |
108 | weaker consistency models can be considerably more complicated. | |
109 | In general, if the algorithm you are writing includes both writes | |
110 | and reads on the same side, it is generally simpler to use sequentially | |
111 | consistent primitives. | |
112 | ||
113 | When using this model, variables are accessed with atomic_read() and | |
114 | atomic_set(), and restrictions to the ordering of accesses is enforced | |
f1ee8696 PB |
115 | using the memory barrier macros: smp_rmb(), smp_wmb(), smp_mb(), |
116 | smp_mb_acquire(), smp_mb_release(), smp_read_barrier_depends(). | |
5444e768 PB |
117 | |
118 | atomic_read() and atomic_set() prevents the compiler from using | |
119 | optimizations that might otherwise optimize accesses out of existence | |
120 | on the one hand, or that might create unsolicited accesses on the other. | |
121 | In general this should not have any effect, because the same compiler | |
122 | barriers are already implied by memory barriers. However, it is useful | |
123 | to do so, because it tells readers which variables are shared with | |
124 | other threads, and which are local to the current thread or protected | |
125 | by other, more mundane means. | |
126 | ||
127 | Memory barriers control the order of references to shared memory. | |
f1ee8696 | 128 | They come in six kinds: |
5444e768 PB |
129 | |
130 | - smp_rmb() guarantees that all the LOAD operations specified before | |
131 | the barrier will appear to happen before all the LOAD operations | |
132 | specified after the barrier with respect to the other components of | |
133 | the system. | |
134 | ||
135 | In other words, smp_rmb() puts a partial ordering on loads, but is not | |
136 | required to have any effect on stores. | |
137 | ||
138 | - smp_wmb() guarantees that all the STORE operations specified before | |
139 | the barrier will appear to happen before all the STORE operations | |
140 | specified after the barrier with respect to the other components of | |
141 | the system. | |
142 | ||
143 | In other words, smp_wmb() puts a partial ordering on stores, but is not | |
144 | required to have any effect on loads. | |
145 | ||
f1ee8696 PB |
146 | - smp_mb_acquire() guarantees that all the LOAD operations specified before |
147 | the barrier will appear to happen before all the LOAD or STORE operations | |
148 | specified after the barrier with respect to the other components of | |
149 | the system. | |
150 | ||
151 | - smp_mb_release() guarantees that all the STORE operations specified *after* | |
152 | the barrier will appear to happen after all the LOAD or STORE operations | |
153 | specified *before* the barrier with respect to the other components of | |
154 | the system. | |
155 | ||
5444e768 PB |
156 | - smp_mb() guarantees that all the LOAD and STORE operations specified |
157 | before the barrier will appear to happen before all the LOAD and | |
158 | STORE operations specified after the barrier with respect to the other | |
159 | components of the system. | |
160 | ||
161 | smp_mb() puts a partial ordering on both loads and stores. It is | |
162 | stronger than both a read and a write memory barrier; it implies both | |
f1ee8696 PB |
163 | smp_mb_acquire() and smp_mb_release(), but it also prevents STOREs |
164 | coming before the barrier from overtaking LOADs coming after the | |
165 | barrier and vice versa. | |
5444e768 PB |
166 | |
167 | - smp_read_barrier_depends() is a weaker kind of read barrier. On | |
168 | most processors, whenever two loads are performed such that the | |
169 | second depends on the result of the first (e.g., the first load | |
170 | retrieves the address to which the second load will be directed), | |
171 | the processor will guarantee that the first LOAD will appear to happen | |
172 | before the second with respect to the other components of the system. | |
173 | However, this is not always true---for example, it was not true on | |
174 | Alpha processors. Whenever this kind of access happens to shared | |
175 | memory (that is not protected by a lock), a read barrier is needed, | |
176 | and smp_read_barrier_depends() can be used instead of smp_rmb(). | |
177 | ||
178 | Note that the first load really has to have a _data_ dependency and not | |
179 | a control dependency. If the address for the second load is dependent | |
180 | on the first load, but the dependency is through a conditional rather | |
181 | than actually loading the address itself, then it's a _control_ | |
182 | dependency and a full read barrier or better is required. | |
183 | ||
184 | ||
185 | This is the set of barriers that is required *between* two atomic_read() | |
186 | and atomic_set() operations to achieve sequential consistency: | |
187 | ||
f1ee8696 PB |
188 | | 2nd operation | |
189 | |-----------------------------------------------| | |
190 | 1st operation | (after last) | atomic_read | atomic_set | | |
191 | ---------------+----------------+-------------+----------------| | |
192 | (before first) | | none | smp_mb_release | | |
193 | ---------------+----------------+-------------+----------------| | |
194 | atomic_read | smp_mb_acquire | smp_rmb | ** | | |
195 | ---------------+----------------+-------------+----------------| | |
196 | atomic_set | none | smp_mb()*** | smp_wmb() | | |
197 | ---------------+----------------+-------------+----------------| | |
5444e768 PB |
198 | |
199 | * Or smp_read_barrier_depends(). | |
200 | ||
f1ee8696 PB |
201 | ** This requires a load-store barrier. This is achieved by |
202 | either smp_mb_acquire() or smp_mb_release(). | |
5444e768 PB |
203 | |
204 | *** This requires a store-load barrier. On most machines, the only | |
205 | way to achieve this is a full barrier. | |
206 | ||
207 | ||
208 | You can see that the two possible definitions of atomic_mb_read() | |
209 | and atomic_mb_set() are the following: | |
210 | ||
f1ee8696 PB |
211 | 1) atomic_mb_read(p) = atomic_read(p); smp_mb_acquire() |
212 | atomic_mb_set(p, v) = smp_mb_release(); atomic_set(p, v); smp_mb() | |
5444e768 | 213 | |
f1ee8696 PB |
214 | 2) atomic_mb_read(p) = smp_mb() atomic_read(p); smp_mb_acquire() |
215 | atomic_mb_set(p, v) = smp_mb_release(); atomic_set(p, v); | |
5444e768 PB |
216 | |
217 | Usually the former is used, because smp_mb() is expensive and a program | |
218 | normally has more reads than writes. Therefore it makes more sense to | |
219 | make atomic_mb_set() the more expensive operation. | |
220 | ||
221 | There are two common cases in which atomic_mb_read and atomic_mb_set | |
222 | generate too many memory barriers, and thus it can be useful to manually | |
223 | place barriers instead: | |
224 | ||
225 | - when a data structure has one thread that is always a writer | |
226 | and one thread that is always a reader, manual placement of | |
227 | memory barriers makes the write side faster. Furthermore, | |
228 | correctness is easy to check for in this case using the "pairing" | |
229 | trick that is explained below: | |
230 | ||
231 | thread 1 thread 1 | |
232 | ------------------------- ------------------------ | |
233 | (other writes) | |
f1ee8696 | 234 | smp_mb_release() |
5444e768 PB |
235 | atomic_mb_set(&a, x) atomic_set(&a, x) |
236 | smp_wmb() | |
237 | atomic_mb_set(&b, y) atomic_set(&b, y) | |
238 | ||
239 | => | |
240 | thread 2 thread 2 | |
241 | ------------------------- ------------------------ | |
242 | y = atomic_mb_read(&b) y = atomic_read(&b) | |
243 | smp_rmb() | |
244 | x = atomic_mb_read(&a) x = atomic_read(&a) | |
f1ee8696 PB |
245 | smp_mb_acquire() |
246 | ||
247 | Note that the barrier between the stores in thread 1, and between | |
248 | the loads in thread 2, has been optimized here to a write or a | |
249 | read memory barrier respectively. On some architectures, notably | |
250 | ARMv7, smp_mb_acquire and smp_mb_release are just as expensive as | |
251 | smp_mb, but smp_rmb and/or smp_wmb are more efficient. | |
5444e768 PB |
252 | |
253 | - sometimes, a thread is accessing many variables that are otherwise | |
254 | unrelated to each other (for example because, apart from the current | |
255 | thread, exactly one other thread will read or write each of these | |
256 | variables). In this case, it is possible to "hoist" the implicit | |
257 | barriers provided by atomic_mb_read() and atomic_mb_set() outside | |
258 | a loop. For example, the above definition atomic_mb_read() gives | |
259 | the following transformation: | |
260 | ||
261 | n = 0; n = 0; | |
262 | for (i = 0; i < 10; i++) => for (i = 0; i < 10; i++) | |
263 | n += atomic_mb_read(&a[i]); n += atomic_read(&a[i]); | |
f1ee8696 | 264 | smp_mb_acquire(); |
5444e768 PB |
265 | |
266 | Similarly, atomic_mb_set() can be transformed as follows: | |
267 | smp_mb(): | |
268 | ||
f1ee8696 | 269 | smp_mb_release(); |
5444e768 PB |
270 | for (i = 0; i < 10; i++) => for (i = 0; i < 10; i++) |
271 | atomic_mb_set(&a[i], false); atomic_set(&a[i], false); | |
272 | smp_mb(); | |
273 | ||
274 | ||
275 | The two tricks can be combined. In this case, splitting a loop in | |
276 | two lets you hoist the barriers out of the loops _and_ eliminate the | |
277 | expensive smp_mb(): | |
278 | ||
f1ee8696 | 279 | smp_mb_release(); |
5444e768 PB |
280 | for (i = 0; i < 10; i++) { => for (i = 0; i < 10; i++) |
281 | atomic_mb_set(&a[i], false); atomic_set(&a[i], false); | |
282 | atomic_mb_set(&b[i], false); smb_wmb(); | |
283 | } for (i = 0; i < 10; i++) | |
284 | atomic_set(&a[i], false); | |
285 | smp_mb(); | |
286 | ||
287 | The other thread can still use atomic_mb_read()/atomic_mb_set() | |
288 | ||
289 | ||
290 | Memory barrier pairing | |
291 | ---------------------- | |
292 | ||
293 | A useful rule of thumb is that memory barriers should always, or almost | |
294 | always, be paired with another barrier. In the case of QEMU, however, | |
295 | note that the other barrier may actually be in a driver that runs in | |
296 | the guest! | |
297 | ||
298 | For the purposes of pairing, smp_read_barrier_depends() and smp_rmb() | |
d3e4abdd | 299 | both count as read barriers. A read barrier shall pair with a write |
5444e768 PB |
300 | barrier or a full barrier; a write barrier shall pair with a read |
301 | barrier or a full barrier. A full barrier can pair with anything. | |
302 | For example: | |
303 | ||
304 | thread 1 thread 2 | |
305 | =============== =============== | |
306 | a = 1; | |
307 | smp_wmb(); | |
308 | b = 2; x = b; | |
309 | smp_rmb(); | |
310 | y = a; | |
311 | ||
d3e4abdd | 312 | Note that the "writing" thread is accessing the variables in the |
5444e768 PB |
313 | opposite order as the "reading" thread. This is expected: stores |
314 | before the write barrier will normally match the loads after the | |
315 | read barrier, and vice versa. The same is true for more than 2 | |
316 | access and for data dependency barriers: | |
317 | ||
318 | thread 1 thread 2 | |
319 | =============== =============== | |
320 | b[2] = 1; | |
321 | smp_wmb(); | |
322 | x->i = 2; | |
323 | smp_wmb(); | |
324 | a = x; x = a; | |
325 | smp_read_barrier_depends(); | |
326 | y = x->i; | |
327 | smp_read_barrier_depends(); | |
328 | z = b[y]; | |
329 | ||
f1ee8696 PB |
330 | smp_wmb() also pairs with atomic_mb_read() and smp_mb_acquire(). |
331 | and smp_rmb() also pairs with atomic_mb_set() and smp_mb_release(). | |
5444e768 PB |
332 | |
333 | ||
334 | COMPARISON WITH LINUX KERNEL MEMORY BARRIERS | |
335 | ============================================ | |
336 | ||
337 | Here is a list of differences between Linux kernel atomic operations | |
338 | and memory barriers, and the equivalents in QEMU: | |
339 | ||
340 | - atomic operations in Linux are always on a 32-bit int type and | |
341 | use a boxed atomic_t type; atomic operations in QEMU are polymorphic | |
342 | and use normal C types. | |
343 | ||
56ebe022 EC |
344 | - Originally, atomic_read and atomic_set in Linux gave no guarantee |
345 | at all. Linux 4.1 updated them to implement volatile | |
346 | semantics via ACCESS_ONCE (or the more recent READ/WRITE_ONCE). | |
347 | ||
348 | QEMU's atomic_read/set implement, if the compiler supports it, C11 | |
349 | atomic relaxed semantics, and volatile semantics otherwise. | |
350 | Both semantics prevent the compiler from doing certain transformations; | |
351 | the difference is that atomic accesses are guaranteed to be atomic, | |
352 | while volatile accesses aren't. Thus, in the volatile case we just cross | |
353 | our fingers hoping that the compiler will generate atomic accesses, | |
354 | since we assume the variables passed are machine-word sized and | |
355 | properly aligned. | |
356 | No barriers are implied by atomic_read/set in either Linux or QEMU. | |
5444e768 | 357 | |
a4a0e4b2 PB |
358 | - atomic read-modify-write operations in Linux are of three kinds: |
359 | ||
360 | atomic_OP returns void | |
361 | atomic_OP_return returns new value of the variable | |
362 | atomic_fetch_OP returns the old value of the variable | |
363 | atomic_cmpxchg returns the old value of the variable | |
364 | ||
365 | In QEMU, the second kind does not exist. Currently Linux has | |
366 | atomic_fetch_or only. QEMU provides and, or, inc, dec, add, sub. | |
5444e768 PB |
367 | |
368 | - different atomic read-modify-write operations in Linux imply | |
369 | a different set of memory barriers; in QEMU, all of them enforce | |
370 | sequential consistency, which means they imply full memory barriers | |
371 | before and after the operation. | |
372 | ||
a4a0e4b2 PB |
373 | - Linux does not have an equivalent of atomic_mb_set(). In particular, |
374 | note that smp_store_mb() is a little weaker than atomic_mb_set(). | |
375 | atomic_mb_read() compiles to the same instructions as Linux's | |
376 | smp_load_acquire(), but this should be treated as an implementation | |
803cf26a PB |
377 | detail. QEMU does have atomic_load_acquire() and atomic_store_release() |
378 | macros, but for now they are only used within atomic.h. This may | |
379 | change in the future. | |
5444e768 PB |
380 | |
381 | ||
382 | SOURCES | |
383 | ======= | |
384 | ||
385 | * Documentation/memory-barriers.txt from the Linux kernel | |
386 | ||
387 | * "The JSR-133 Cookbook for Compiler Writers", available at | |
388 | http://g.oswego.edu/dl/jmm/cookbook.html |