]>
Commit | Line | Data |
---|---|---|
5444e768 PB |
1 | CPUs perform independent memory operations effectively in random order. |
2 | but this can be a problem for CPU-CPU interaction (including interactions | |
3 | between QEMU and the guest). Multi-threaded programs use various tools | |
4 | to instruct the compiler and the CPU to restrict the order to something | |
5 | that is consistent with the expectations of the programmer. | |
6 | ||
7 | The most basic tool is locking. Mutexes, condition variables and | |
8 | semaphores are used in QEMU, and should be the default approach to | |
9 | synchronization. Anything else is considerably harder, but it's | |
10 | also justified more often than one would like. The two tools that | |
11 | are provided by qemu/atomic.h are memory barriers and atomic operations. | |
12 | ||
13 | Macros defined by qemu/atomic.h fall in three camps: | |
14 | ||
15 | - compiler barriers: barrier(); | |
16 | ||
17 | - weak atomic access and manual memory barriers: atomic_read(), | |
18 | atomic_set(), smp_rmb(), smp_wmb(), smp_mb(), smp_read_barrier_depends(); | |
19 | ||
20 | - sequentially consistent atomic access: everything else. | |
21 | ||
22 | ||
23 | COMPILER MEMORY BARRIER | |
24 | ======================= | |
25 | ||
26 | barrier() prevents the compiler from moving the memory accesses either | |
27 | side of it to the other side. The compiler barrier has no direct effect | |
28 | on the CPU, which may then reorder things however it wishes. | |
29 | ||
30 | barrier() is mostly used within qemu/atomic.h itself. On some | |
31 | architectures, CPU guarantees are strong enough that blocking compiler | |
32 | optimizations already ensures the correct order of execution. In this | |
33 | case, qemu/atomic.h will reduce stronger memory barriers to simple | |
34 | compiler barriers. | |
35 | ||
36 | Still, barrier() can be useful when writing code that can be interrupted | |
37 | by signal handlers. | |
38 | ||
39 | ||
40 | SEQUENTIALLY CONSISTENT ATOMIC ACCESS | |
41 | ===================================== | |
42 | ||
43 | Most of the operations in the qemu/atomic.h header ensure *sequential | |
44 | consistency*, where "the result of any execution is the same as if the | |
45 | operations of all the processors were executed in some sequential order, | |
46 | and the operations of each individual processor appear in this sequence | |
47 | in the order specified by its program". | |
48 | ||
49 | qemu/atomic.h provides the following set of atomic read-modify-write | |
50 | operations: | |
51 | ||
52 | void atomic_inc(ptr) | |
53 | void atomic_dec(ptr) | |
54 | void atomic_add(ptr, val) | |
55 | void atomic_sub(ptr, val) | |
56 | void atomic_and(ptr, val) | |
57 | void atomic_or(ptr, val) | |
58 | ||
59 | typeof(*ptr) atomic_fetch_inc(ptr) | |
60 | typeof(*ptr) atomic_fetch_dec(ptr) | |
61 | typeof(*ptr) atomic_fetch_add(ptr, val) | |
62 | typeof(*ptr) atomic_fetch_sub(ptr, val) | |
63 | typeof(*ptr) atomic_fetch_and(ptr, val) | |
64 | typeof(*ptr) atomic_fetch_or(ptr, val) | |
65 | typeof(*ptr) atomic_xchg(ptr, val | |
66 | typeof(*ptr) atomic_cmpxchg(ptr, old, new) | |
67 | ||
68 | all of which return the old value of *ptr. These operations are | |
69 | polymorphic; they operate on any type that is as wide as an int. | |
70 | ||
71 | Sequentially consistent loads and stores can be done using: | |
72 | ||
73 | atomic_fetch_add(ptr, 0) for loads | |
74 | atomic_xchg(ptr, val) for stores | |
75 | ||
76 | However, they are quite expensive on some platforms, notably POWER and | |
77 | ARM. Therefore, qemu/atomic.h provides two primitives with slightly | |
78 | weaker constraints: | |
79 | ||
80 | typeof(*ptr) atomic_mb_read(ptr) | |
81 | void atomic_mb_set(ptr, val) | |
82 | ||
83 | The semantics of these primitives map to Java volatile variables, | |
84 | and are strongly related to memory barriers as used in the Linux | |
85 | kernel (see below). | |
86 | ||
87 | As long as you use atomic_mb_read and atomic_mb_set, accesses cannot | |
88 | be reordered with each other, and it is also not possible to reorder | |
89 | "normal" accesses around them. | |
90 | ||
91 | However, and this is the important difference between | |
92 | atomic_mb_read/atomic_mb_set and sequential consistency, it is important | |
93 | for both threads to access the same volatile variable. It is not the | |
94 | case that everything visible to thread A when it writes volatile field f | |
95 | becomes visible to thread B after it reads volatile field g. The store | |
96 | and load have to "match" (i.e., be performed on the same volatile | |
97 | field) to achieve the right semantics. | |
98 | ||
99 | ||
100 | These operations operate on any type that is as wide as an int or smaller. | |
101 | ||
102 | ||
103 | WEAK ATOMIC ACCESS AND MANUAL MEMORY BARRIERS | |
104 | ============================================= | |
105 | ||
106 | Compared to sequentially consistent atomic access, programming with | |
107 | weaker consistency models can be considerably more complicated. | |
108 | In general, if the algorithm you are writing includes both writes | |
109 | and reads on the same side, it is generally simpler to use sequentially | |
110 | consistent primitives. | |
111 | ||
112 | When using this model, variables are accessed with atomic_read() and | |
113 | atomic_set(), and restrictions to the ordering of accesses is enforced | |
114 | using the smp_rmb(), smp_wmb(), smp_mb() and smp_read_barrier_depends() | |
115 | memory barriers. | |
116 | ||
117 | atomic_read() and atomic_set() prevents the compiler from using | |
118 | optimizations that might otherwise optimize accesses out of existence | |
119 | on the one hand, or that might create unsolicited accesses on the other. | |
120 | In general this should not have any effect, because the same compiler | |
121 | barriers are already implied by memory barriers. However, it is useful | |
122 | to do so, because it tells readers which variables are shared with | |
123 | other threads, and which are local to the current thread or protected | |
124 | by other, more mundane means. | |
125 | ||
126 | Memory barriers control the order of references to shared memory. | |
127 | They come in four kinds: | |
128 | ||
129 | - smp_rmb() guarantees that all the LOAD operations specified before | |
130 | the barrier will appear to happen before all the LOAD operations | |
131 | specified after the barrier with respect to the other components of | |
132 | the system. | |
133 | ||
134 | In other words, smp_rmb() puts a partial ordering on loads, but is not | |
135 | required to have any effect on stores. | |
136 | ||
137 | - smp_wmb() guarantees that all the STORE operations specified before | |
138 | the barrier will appear to happen before all the STORE operations | |
139 | specified after the barrier with respect to the other components of | |
140 | the system. | |
141 | ||
142 | In other words, smp_wmb() puts a partial ordering on stores, but is not | |
143 | required to have any effect on loads. | |
144 | ||
145 | - smp_mb() guarantees that all the LOAD and STORE operations specified | |
146 | before the barrier will appear to happen before all the LOAD and | |
147 | STORE operations specified after the barrier with respect to the other | |
148 | components of the system. | |
149 | ||
150 | smp_mb() puts a partial ordering on both loads and stores. It is | |
151 | stronger than both a read and a write memory barrier; it implies both | |
152 | smp_rmb() and smp_wmb(), but it also prevents STOREs coming before the | |
153 | barrier from overtaking LOADs coming after the barrier and vice versa. | |
154 | ||
155 | - smp_read_barrier_depends() is a weaker kind of read barrier. On | |
156 | most processors, whenever two loads are performed such that the | |
157 | second depends on the result of the first (e.g., the first load | |
158 | retrieves the address to which the second load will be directed), | |
159 | the processor will guarantee that the first LOAD will appear to happen | |
160 | before the second with respect to the other components of the system. | |
161 | However, this is not always true---for example, it was not true on | |
162 | Alpha processors. Whenever this kind of access happens to shared | |
163 | memory (that is not protected by a lock), a read barrier is needed, | |
164 | and smp_read_barrier_depends() can be used instead of smp_rmb(). | |
165 | ||
166 | Note that the first load really has to have a _data_ dependency and not | |
167 | a control dependency. If the address for the second load is dependent | |
168 | on the first load, but the dependency is through a conditional rather | |
169 | than actually loading the address itself, then it's a _control_ | |
170 | dependency and a full read barrier or better is required. | |
171 | ||
172 | ||
173 | This is the set of barriers that is required *between* two atomic_read() | |
174 | and atomic_set() operations to achieve sequential consistency: | |
175 | ||
176 | | 2nd operation | | |
177 | |-----------------------------------------| | |
178 | 1st operation | (after last) | atomic_read | atomic_set | | |
179 | ---------------+--------------+-------------+------------| | |
180 | (before first) | | none | smp_wmb() | | |
181 | ---------------+--------------+-------------+------------| | |
182 | atomic_read | smp_rmb() | smp_rmb()* | ** | | |
183 | ---------------+--------------+-------------+------------| | |
184 | atomic_set | none | smp_mb()*** | smp_wmb() | | |
185 | ---------------+--------------+-------------+------------| | |
186 | ||
187 | * Or smp_read_barrier_depends(). | |
188 | ||
189 | ** This requires a load-store barrier. How to achieve this varies | |
190 | depending on the machine, but in practice smp_rmb()+smp_wmb() | |
191 | should have the desired effect. For example, on PowerPC the | |
192 | lwsync instruction is a combined load-load, load-store and | |
193 | store-store barrier. | |
194 | ||
195 | *** This requires a store-load barrier. On most machines, the only | |
196 | way to achieve this is a full barrier. | |
197 | ||
198 | ||
199 | You can see that the two possible definitions of atomic_mb_read() | |
200 | and atomic_mb_set() are the following: | |
201 | ||
202 | 1) atomic_mb_read(p) = atomic_read(p); smp_rmb() | |
203 | atomic_mb_set(p, v) = smp_wmb(); atomic_set(p, v); smp_mb() | |
204 | ||
205 | 2) atomic_mb_read(p) = smp_mb() atomic_read(p); smp_rmb() | |
206 | atomic_mb_set(p, v) = smp_wmb(); atomic_set(p, v); | |
207 | ||
208 | Usually the former is used, because smp_mb() is expensive and a program | |
209 | normally has more reads than writes. Therefore it makes more sense to | |
210 | make atomic_mb_set() the more expensive operation. | |
211 | ||
212 | There are two common cases in which atomic_mb_read and atomic_mb_set | |
213 | generate too many memory barriers, and thus it can be useful to manually | |
214 | place barriers instead: | |
215 | ||
216 | - when a data structure has one thread that is always a writer | |
217 | and one thread that is always a reader, manual placement of | |
218 | memory barriers makes the write side faster. Furthermore, | |
219 | correctness is easy to check for in this case using the "pairing" | |
220 | trick that is explained below: | |
221 | ||
222 | thread 1 thread 1 | |
223 | ------------------------- ------------------------ | |
224 | (other writes) | |
225 | smp_wmb() | |
226 | atomic_mb_set(&a, x) atomic_set(&a, x) | |
227 | smp_wmb() | |
228 | atomic_mb_set(&b, y) atomic_set(&b, y) | |
229 | ||
230 | => | |
231 | thread 2 thread 2 | |
232 | ------------------------- ------------------------ | |
233 | y = atomic_mb_read(&b) y = atomic_read(&b) | |
234 | smp_rmb() | |
235 | x = atomic_mb_read(&a) x = atomic_read(&a) | |
236 | smp_rmb() | |
237 | ||
238 | - sometimes, a thread is accessing many variables that are otherwise | |
239 | unrelated to each other (for example because, apart from the current | |
240 | thread, exactly one other thread will read or write each of these | |
241 | variables). In this case, it is possible to "hoist" the implicit | |
242 | barriers provided by atomic_mb_read() and atomic_mb_set() outside | |
243 | a loop. For example, the above definition atomic_mb_read() gives | |
244 | the following transformation: | |
245 | ||
246 | n = 0; n = 0; | |
247 | for (i = 0; i < 10; i++) => for (i = 0; i < 10; i++) | |
248 | n += atomic_mb_read(&a[i]); n += atomic_read(&a[i]); | |
249 | smp_rmb(); | |
250 | ||
251 | Similarly, atomic_mb_set() can be transformed as follows: | |
252 | smp_mb(): | |
253 | ||
254 | smp_wmb(); | |
255 | for (i = 0; i < 10; i++) => for (i = 0; i < 10; i++) | |
256 | atomic_mb_set(&a[i], false); atomic_set(&a[i], false); | |
257 | smp_mb(); | |
258 | ||
259 | ||
260 | The two tricks can be combined. In this case, splitting a loop in | |
261 | two lets you hoist the barriers out of the loops _and_ eliminate the | |
262 | expensive smp_mb(): | |
263 | ||
264 | smp_wmb(); | |
265 | for (i = 0; i < 10; i++) { => for (i = 0; i < 10; i++) | |
266 | atomic_mb_set(&a[i], false); atomic_set(&a[i], false); | |
267 | atomic_mb_set(&b[i], false); smb_wmb(); | |
268 | } for (i = 0; i < 10; i++) | |
269 | atomic_set(&a[i], false); | |
270 | smp_mb(); | |
271 | ||
272 | The other thread can still use atomic_mb_read()/atomic_mb_set() | |
273 | ||
274 | ||
275 | Memory barrier pairing | |
276 | ---------------------- | |
277 | ||
278 | A useful rule of thumb is that memory barriers should always, or almost | |
279 | always, be paired with another barrier. In the case of QEMU, however, | |
280 | note that the other barrier may actually be in a driver that runs in | |
281 | the guest! | |
282 | ||
283 | For the purposes of pairing, smp_read_barrier_depends() and smp_rmb() | |
284 | both count as read barriers. A read barriers shall pair with a write | |
285 | barrier or a full barrier; a write barrier shall pair with a read | |
286 | barrier or a full barrier. A full barrier can pair with anything. | |
287 | For example: | |
288 | ||
289 | thread 1 thread 2 | |
290 | =============== =============== | |
291 | a = 1; | |
292 | smp_wmb(); | |
293 | b = 2; x = b; | |
294 | smp_rmb(); | |
295 | y = a; | |
296 | ||
297 | Note that the "writing" thread are accessing the variables in the | |
298 | opposite order as the "reading" thread. This is expected: stores | |
299 | before the write barrier will normally match the loads after the | |
300 | read barrier, and vice versa. The same is true for more than 2 | |
301 | access and for data dependency barriers: | |
302 | ||
303 | thread 1 thread 2 | |
304 | =============== =============== | |
305 | b[2] = 1; | |
306 | smp_wmb(); | |
307 | x->i = 2; | |
308 | smp_wmb(); | |
309 | a = x; x = a; | |
310 | smp_read_barrier_depends(); | |
311 | y = x->i; | |
312 | smp_read_barrier_depends(); | |
313 | z = b[y]; | |
314 | ||
315 | smp_wmb() also pairs with atomic_mb_read(), and smp_rmb() also pairs | |
316 | with atomic_mb_set(). | |
317 | ||
318 | ||
319 | COMPARISON WITH LINUX KERNEL MEMORY BARRIERS | |
320 | ============================================ | |
321 | ||
322 | Here is a list of differences between Linux kernel atomic operations | |
323 | and memory barriers, and the equivalents in QEMU: | |
324 | ||
325 | - atomic operations in Linux are always on a 32-bit int type and | |
326 | use a boxed atomic_t type; atomic operations in QEMU are polymorphic | |
327 | and use normal C types. | |
328 | ||
329 | - atomic_read and atomic_set in Linux give no guarantee at all; | |
330 | atomic_read and atomic_set in QEMU include a compiler barrier | |
331 | (similar to the ACCESS_ONCE macro in Linux). | |
332 | ||
333 | - most atomic read-modify-write operations in Linux return void; | |
334 | in QEMU, all of them return the old value of the variable. | |
335 | ||
336 | - different atomic read-modify-write operations in Linux imply | |
337 | a different set of memory barriers; in QEMU, all of them enforce | |
338 | sequential consistency, which means they imply full memory barriers | |
339 | before and after the operation. | |
340 | ||
341 | - Linux does not have an equivalent of atomic_mb_read() and | |
342 | atomic_mb_set(). In particular, note that set_mb() is a little | |
343 | weaker than atomic_mb_set(). | |
344 | ||
345 | ||
346 | SOURCES | |
347 | ======= | |
348 | ||
349 | * Documentation/memory-barriers.txt from the Linux kernel | |
350 | ||
351 | * "The JSR-133 Cookbook for Compiler Writers", available at | |
352 | http://g.oswego.edu/dl/jmm/cookbook.html |