]>
Commit | Line | Data |
---|---|---|
a1b2a555 CL |
1 | this_cpu operations |
2 | ------------------- | |
3 | ||
4 | this_cpu operations are a way of optimizing access to per cpu | |
5 | variables associated with the *currently* executing processor through | |
6 | the use of segment registers (or a dedicated register where the cpu | |
7 | permanently stored the beginning of the per cpu area for a specific | |
8 | processor). | |
9 | ||
10 | The this_cpu operations add a per cpu variable offset to the processor | |
11 | specific percpu base and encode that operation in the instruction | |
12 | operating on the per cpu variable. | |
13 | ||
14 | This means there are no atomicity issues between the calculation of | |
15 | the offset and the operation on the data. Therefore it is not | |
16 | necessary to disable preempt or interrupts to ensure that the | |
17 | processor is not changed between the calculation of the address and | |
18 | the operation on the data. | |
19 | ||
20 | Read-modify-write operations are of particular interest. Frequently | |
21 | processors have special lower latency instructions that can operate | |
22 | without the typical synchronization overhead but still provide some | |
23 | sort of relaxed atomicity guarantee. The x86 for example can execute | |
24 | RMV (Read Modify Write) instructions like inc/dec/cmpxchg without the | |
25 | lock prefix and the associated latency penalty. | |
26 | ||
27 | Access to the variable without the lock prefix is not synchronized but | |
28 | synchronization is not necessary since we are dealing with per cpu | |
29 | data specific to the currently executing processor. Only the current | |
30 | processor should be accessing that variable and therefore there are no | |
31 | concurrency issues with other processors in the system. | |
32 | ||
33 | On x86 the fs: or the gs: segment registers contain the base of the | |
34 | per cpu area. It is then possible to simply use the segment override | |
35 | to relocate a per cpu relative address to the proper per cpu area for | |
36 | the processor. So the relocation to the per cpu base is encoded in the | |
37 | instruction via a segment register prefix. | |
38 | ||
39 | For example: | |
40 | ||
41 | DEFINE_PER_CPU(int, x); | |
42 | int z; | |
43 | ||
44 | z = this_cpu_read(x); | |
45 | ||
46 | results in a single instruction | |
47 | ||
48 | mov ax, gs:[x] | |
49 | ||
50 | instead of a sequence of calculation of the address and then a fetch | |
51 | from that address which occurs with the percpu operations. Before | |
52 | this_cpu_ops such sequence also required preempt disable/enable to | |
53 | prevent the kernel from moving the thread to a different processor | |
54 | while the calculation is performed. | |
55 | ||
56 | The main use of the this_cpu operations has been to optimize counter | |
57 | operations. | |
58 | ||
59 | this_cpu_inc(x) | |
60 | ||
61 | results in the following single instruction (no lock prefix!) | |
62 | ||
63 | inc gs:[x] | |
64 | ||
65 | instead of the following operations required if there is no segment | |
66 | register. | |
67 | ||
68 | int *y; | |
69 | int cpu; | |
70 | ||
71 | cpu = get_cpu(); | |
72 | y = per_cpu_ptr(&x, cpu); | |
73 | (*y)++; | |
74 | put_cpu(); | |
75 | ||
76 | Note that these operations can only be used on percpu data that is | |
77 | reserved for a specific processor. Without disabling preemption in the | |
78 | surrounding code this_cpu_inc() will only guarantee that one of the | |
79 | percpu counters is correctly incremented. However, there is no | |
80 | guarantee that the OS will not move the process directly before or | |
81 | after the this_cpu instruction is executed. In general this means that | |
82 | the value of the individual counters for each processor are | |
83 | meaningless. The sum of all the per cpu counters is the only value | |
84 | that is of interest. | |
85 | ||
86 | Per cpu variables are used for performance reasons. Bouncing cache | |
87 | lines can be avoided if multiple processors concurrently go through | |
88 | the same code paths. Since each processor has its own per cpu | |
89 | variables no concurrent cacheline updates take place. The price that | |
90 | has to be paid for this optimization is the need to add up the per cpu | |
91 | counters when the value of the counter is needed. | |
92 | ||
93 | ||
94 | Special operations: | |
95 | ------------------- | |
96 | ||
97 | y = this_cpu_ptr(&x) | |
98 | ||
99 | Takes the offset of a per cpu variable (&x !) and returns the address | |
100 | of the per cpu variable that belongs to the currently executing | |
101 | processor. this_cpu_ptr avoids multiple steps that the common | |
102 | get_cpu/put_cpu sequence requires. No processor number is | |
103 | available. Instead the offset of the local per cpu area is simply | |
104 | added to the percpu offset. | |
105 | ||
106 | ||
107 | ||
108 | Per cpu variables and offsets | |
109 | ----------------------------- | |
110 | ||
111 | Per cpu variables have *offsets* to the beginning of the percpu | |
112 | area. They do not have addresses although they look like that in the | |
113 | code. Offsets cannot be directly dereferenced. The offset must be | |
114 | added to a base pointer of a percpu area of a processor in order to | |
115 | form a valid address. | |
116 | ||
117 | Therefore the use of x or &x outside of the context of per cpu | |
118 | operations is invalid and will generally be treated like a NULL | |
119 | pointer dereference. | |
120 | ||
121 | In the context of per cpu operations | |
122 | ||
123 | x is a per cpu variable. Most this_cpu operations take a cpu | |
124 | variable. | |
125 | ||
126 | &x is the *offset* a per cpu variable. this_cpu_ptr() takes | |
127 | the offset of a per cpu variable which makes this look a bit | |
128 | strange. | |
129 | ||
130 | ||
131 | ||
132 | Operations on a field of a per cpu structure | |
133 | -------------------------------------------- | |
134 | ||
135 | Let's say we have a percpu structure | |
136 | ||
137 | struct s { | |
138 | int n,m; | |
139 | }; | |
140 | ||
141 | DEFINE_PER_CPU(struct s, p); | |
142 | ||
143 | ||
144 | Operations on these fields are straightforward | |
145 | ||
146 | this_cpu_inc(p.m) | |
147 | ||
148 | z = this_cpu_cmpxchg(p.m, 0, 1); | |
149 | ||
150 | ||
151 | If we have an offset to struct s: | |
152 | ||
153 | struct s __percpu *ps = &p; | |
154 | ||
155 | z = this_cpu_dec(ps->m); | |
156 | ||
157 | z = this_cpu_inc_return(ps->n); | |
158 | ||
159 | ||
160 | The calculation of the pointer may require the use of this_cpu_ptr() | |
161 | if we do not make use of this_cpu ops later to manipulate fields: | |
162 | ||
163 | struct s *pp; | |
164 | ||
165 | pp = this_cpu_ptr(&p); | |
166 | ||
167 | pp->m--; | |
168 | ||
169 | z = pp->n++; | |
170 | ||
171 | ||
172 | Variants of this_cpu ops | |
173 | ------------------------- | |
174 | ||
175 | this_cpu ops are interrupt safe. Some architecture do not support | |
176 | these per cpu local operations. In that case the operation must be | |
177 | replaced by code that disables interrupts, then does the operations | |
178 | that are guaranteed to be atomic and then reenable interrupts. Doing | |
179 | so is expensive. If there are other reasons why the scheduler cannot | |
180 | change the processor we are executing on then there is no reason to | |
181 | disable interrupts. For that purpose the __this_cpu operations are | |
182 | provided. For example. | |
183 | ||
184 | __this_cpu_inc(x); | |
185 | ||
186 | Will increment x and will not fallback to code that disables | |
187 | interrupts on platforms that cannot accomplish atomicity through | |
188 | address relocation and a Read-Modify-Write operation in the same | |
189 | instruction. | |
190 | ||
191 | ||
192 | ||
193 | &this_cpu_ptr(pp)->n vs this_cpu_ptr(&pp->n) | |
194 | -------------------------------------------- | |
195 | ||
196 | The first operation takes the offset and forms an address and then | |
197 | adds the offset of the n field. | |
198 | ||
199 | The second one first adds the two offsets and then does the | |
200 | relocation. IMHO the second form looks cleaner and has an easier time | |
201 | with (). The second form also is consistent with the way | |
202 | this_cpu_read() and friends are used. | |
203 | ||
204 | ||
205 | Christoph Lameter, April 3rd, 2013 |