]>
Commit | Line | Data |
---|---|---|
8fa4e720 CD |
1 | ========================================================= |
2 | Notes on Analysing Behaviour Using Events and Tracepoints | |
3 | ========================================================= | |
4 | :Author: Mel Gorman (PCL information heavily based on email from Ingo Molnar) | |
bb722220 MG |
5 | |
6 | 1. Introduction | |
7 | =============== | |
8 | ||
ec15872d | 9 | Tracepoints (see Documentation/trace/tracepoints.rst) can be used without |
bb722220 MG |
10 | creating custom kernel modules to register probe functions using the event |
11 | tracing infrastructure. | |
12 | ||
b41df645 RD |
13 | Simplistically, tracepoints represent important events that can be |
14 | taken in conjunction with other tracepoints to build a "Big Picture" of | |
bb722220 MG |
15 | what is going on within the system. There are a large number of methods for |
16 | gathering and interpreting these events. Lacking any current Best Practises, | |
17 | this document describes some of the methods that can be used. | |
18 | ||
19 | This document assumes that debugfs is mounted on /sys/kernel/debug and that | |
20 | the appropriate tracing options have been configured into the kernel. It is | |
21 | assumed that the PCL tool tools/perf has been installed and is in your path. | |
22 | ||
23 | 2. Listing Available Events | |
24 | =========================== | |
25 | ||
26 | 2.1 Standard Utilities | |
27 | ---------------------- | |
28 | ||
29 | All possible events are visible from /sys/kernel/debug/tracing/events. Simply | |
8fa4e720 | 30 | calling:: |
bb722220 MG |
31 | |
32 | $ find /sys/kernel/debug/tracing/events -type d | |
33 | ||
34 | will give a fair indication of the number of events available. | |
35 | ||
b41df645 | 36 | 2.2 PCL (Performance Counters for Linux) |
8fa4e720 | 37 | ---------------------------------------- |
bb722220 | 38 | |
b41df645 | 39 | Discovery and enumeration of all counters and events, including tracepoints, |
bb722220 | 40 | are available with the perf tool. Getting a list of available events is a |
8fa4e720 | 41 | simple case of:: |
bb722220 MG |
42 | |
43 | $ perf list 2>&1 | grep Tracepoint | |
44 | ext4:ext4_free_inode [Tracepoint event] | |
45 | ext4:ext4_request_inode [Tracepoint event] | |
46 | ext4:ext4_allocate_inode [Tracepoint event] | |
47 | ext4:ext4_write_begin [Tracepoint event] | |
48 | ext4:ext4_ordered_write_end [Tracepoint event] | |
49 | [ .... remaining output snipped .... ] | |
50 | ||
51 | ||
b41df645 | 52 | 3. Enabling Events |
bb722220 MG |
53 | ================== |
54 | ||
b41df645 | 55 | 3.1 System-Wide Event Enabling |
bb722220 MG |
56 | ------------------------------ |
57 | ||
5fb94e9c | 58 | See Documentation/trace/events.rst for a proper description on how events |
bb722220 | 59 | can be enabled system-wide. A short example of enabling all events related |
8fa4e720 | 60 | to page allocation would look something like:: |
bb722220 MG |
61 | |
62 | $ for i in `find /sys/kernel/debug/tracing/events -name "enable" | grep mm_`; do echo 1 > $i; done | |
63 | ||
b41df645 | 64 | 3.2 System-Wide Event Enabling with SystemTap |
bb722220 MG |
65 | --------------------------------------------- |
66 | ||
67 | In SystemTap, tracepoints are accessible using the kernel.trace() function | |
68 | call. The following is an example that reports every 5 seconds what processes | |
69 | were allocating the pages. | |
8fa4e720 | 70 | :: |
bb722220 MG |
71 | |
72 | global page_allocs | |
73 | ||
74 | probe kernel.trace("mm_page_alloc") { | |
75 | page_allocs[execname()]++ | |
76 | } | |
77 | ||
78 | function print_count() { | |
79 | printf ("%-25s %-s\n", "#Pages Allocated", "Process Name") | |
80 | foreach (proc in page_allocs-) | |
81 | printf("%-25d %s\n", page_allocs[proc], proc) | |
82 | printf ("\n") | |
83 | delete page_allocs | |
84 | } | |
85 | ||
86 | probe timer.s(5) { | |
87 | print_count() | |
88 | } | |
89 | ||
b41df645 | 90 | 3.3 System-Wide Event Enabling with PCL |
bb722220 MG |
91 | --------------------------------------- |
92 | ||
93 | By specifying the -a switch and analysing sleep, the system-wide events | |
94 | for a duration of time can be examined. | |
8fa4e720 | 95 | :: |
bb722220 MG |
96 | |
97 | $ perf stat -a \ | |
90a5d5af KK |
98 | -e kmem:mm_page_alloc -e kmem:mm_page_free \ |
99 | -e kmem:mm_page_free_batched \ | |
bb722220 MG |
100 | sleep 10 |
101 | Performance counter stats for 'sleep 10': | |
102 | ||
103 | 9630 kmem:mm_page_alloc | |
90a5d5af KK |
104 | 2143 kmem:mm_page_free |
105 | 7424 kmem:mm_page_free_batched | |
bb722220 MG |
106 | |
107 | 10.002577764 seconds time elapsed | |
108 | ||
109 | Similarly, one could execute a shell and exit it as desired to get a report | |
110 | at that point. | |
111 | ||
b41df645 | 112 | 3.4 Local Event Enabling |
bb722220 MG |
113 | ------------------------ |
114 | ||
5fb94e9c | 115 | Documentation/trace/ftrace.rst describes how to enable events on a per-thread |
bb722220 MG |
116 | basis using set_ftrace_pid. |
117 | ||
b41df645 | 118 | 3.5 Local Event Enablement with PCL |
bb722220 MG |
119 | ----------------------------------- |
120 | ||
b41df645 | 121 | Events can be activated and tracked for the duration of a process on a local |
bb722220 | 122 | basis using PCL such as follows. |
8fa4e720 | 123 | :: |
bb722220 | 124 | |
90a5d5af KK |
125 | $ perf stat -e kmem:mm_page_alloc -e kmem:mm_page_free \ |
126 | -e kmem:mm_page_free_batched ./hackbench 10 | |
bb722220 MG |
127 | Time: 0.909 |
128 | ||
129 | Performance counter stats for './hackbench 10': | |
130 | ||
131 | 17803 kmem:mm_page_alloc | |
90a5d5af KK |
132 | 12398 kmem:mm_page_free |
133 | 4827 kmem:mm_page_free_batched | |
bb722220 MG |
134 | |
135 | 0.973913387 seconds time elapsed | |
136 | ||
b41df645 | 137 | 4. Event Filtering |
bb722220 MG |
138 | ================== |
139 | ||
5fb94e9c | 140 | Documentation/trace/ftrace.rst covers in-depth how to filter events in |
bb722220 MG |
141 | ftrace. Obviously using grep and awk of trace_pipe is an option as well |
142 | as any script reading trace_pipe. | |
143 | ||
b41df645 | 144 | 5. Analysing Event Variances with PCL |
bb722220 MG |
145 | ===================================== |
146 | ||
147 | Any workload can exhibit variances between runs and it can be important | |
b41df645 | 148 | to know what the standard deviation is. By and large, this is left to the |
bb722220 MG |
149 | performance analyst to do it by hand. In the event that the discrete event |
150 | occurrences are useful to the performance analyst, then perf can be used. | |
8fa4e720 | 151 | :: |
bb722220 | 152 | |
90a5d5af KK |
153 | $ perf stat --repeat 5 -e kmem:mm_page_alloc -e kmem:mm_page_free |
154 | -e kmem:mm_page_free_batched ./hackbench 10 | |
bb722220 MG |
155 | Time: 0.890 |
156 | Time: 0.895 | |
157 | Time: 0.915 | |
158 | Time: 1.001 | |
159 | Time: 0.899 | |
160 | ||
161 | Performance counter stats for './hackbench 10' (5 runs): | |
162 | ||
163 | 16630 kmem:mm_page_alloc ( +- 3.542% ) | |
90a5d5af KK |
164 | 11486 kmem:mm_page_free ( +- 4.771% ) |
165 | 4730 kmem:mm_page_free_batched ( +- 2.325% ) | |
bb722220 MG |
166 | |
167 | 0.982653002 seconds time elapsed ( +- 1.448% ) | |
168 | ||
169 | In the event that some higher-level event is required that depends on some | |
170 | aggregation of discrete events, then a script would need to be developed. | |
171 | ||
172 | Using --repeat, it is also possible to view how events are fluctuating over | |
b41df645 | 173 | time on a system-wide basis using -a and sleep. |
8fa4e720 | 174 | :: |
bb722220 | 175 | |
90a5d5af KK |
176 | $ perf stat -e kmem:mm_page_alloc -e kmem:mm_page_free \ |
177 | -e kmem:mm_page_free_batched \ | |
bb722220 MG |
178 | -a --repeat 10 \ |
179 | sleep 1 | |
180 | Performance counter stats for 'sleep 1' (10 runs): | |
181 | ||
182 | 1066 kmem:mm_page_alloc ( +- 26.148% ) | |
90a5d5af KK |
183 | 182 kmem:mm_page_free ( +- 5.464% ) |
184 | 890 kmem:mm_page_free_batched ( +- 30.079% ) | |
bb722220 MG |
185 | |
186 | 1.002251757 seconds time elapsed ( +- 0.005% ) | |
187 | ||
b41df645 | 188 | 6. Higher-Level Analysis with Helper Scripts |
bb722220 MG |
189 | ============================================ |
190 | ||
191 | When events are enabled the events that are triggering can be read from | |
192 | /sys/kernel/debug/tracing/trace_pipe in human-readable format although binary | |
193 | options exist as well. By post-processing the output, further information can | |
194 | be gathered on-line as appropriate. Examples of post-processing might include | |
195 | ||
8fa4e720 CD |
196 | - Reading information from /proc for the PID that triggered the event |
197 | - Deriving a higher-level event from a series of lower-level events. | |
198 | - Calculating latencies between two events | |
bb722220 MG |
199 | |
200 | Documentation/trace/postprocess/trace-pagealloc-postprocess.pl is an example | |
201 | script that can read trace_pipe from STDIN or a copy of a trace. When used | |
b41df645 | 202 | on-line, it can be interrupted once to generate a report without exiting |
bb722220 MG |
203 | and twice to exit. |
204 | ||
205 | Simplistically, the script just reads STDIN and counts up events but it | |
206 | also can do more such as | |
207 | ||
8fa4e720 | 208 | - Derive high-level events from many low-level events. If a number of pages |
bb722220 MG |
209 | are freed to the main allocator from the per-CPU lists, it recognises |
210 | that as one per-CPU drain even though there is no specific tracepoint | |
211 | for that event | |
8fa4e720 CD |
212 | - It can aggregate based on PID or individual process number |
213 | - In the event memory is getting externally fragmented, it reports | |
bb722220 | 214 | on whether the fragmentation event was severe or moderate. |
8fa4e720 | 215 | - When receiving an event about a PID, it can record who the parent was so |
bb722220 MG |
216 | that if large numbers of events are coming from very short-lived |
217 | processes, the parent process responsible for creating all the helpers | |
218 | can be identified | |
219 | ||
b41df645 | 220 | 7. Lower-Level Analysis with PCL |
bb722220 MG |
221 | ================================ |
222 | ||
b41df645 | 223 | There may also be a requirement to identify what functions within a program |
bb722220 | 224 | were generating events within the kernel. To begin this sort of analysis, the |
b41df645 | 225 | data must be recorded. At the time of writing, this required root: |
8fa4e720 | 226 | :: |
bb722220 MG |
227 | |
228 | $ perf record -c 1 \ | |
90a5d5af KK |
229 | -e kmem:mm_page_alloc -e kmem:mm_page_free \ |
230 | -e kmem:mm_page_free_batched \ | |
bb722220 MG |
231 | ./hackbench 10 |
232 | Time: 0.894 | |
233 | [ perf record: Captured and wrote 0.733 MB perf.data (~32010 samples) ] | |
234 | ||
235 | Note the use of '-c 1' to set the event period to sample. The default sample | |
236 | period is quite high to minimise overhead but the information collected can be | |
237 | very coarse as a result. | |
238 | ||
239 | This record outputted a file called perf.data which can be analysed using | |
240 | perf report. | |
8fa4e720 | 241 | :: |
bb722220 MG |
242 | |
243 | $ perf report | |
244 | # Samples: 30922 | |
245 | # | |
246 | # Overhead Command Shared Object | |
247 | # ........ ......... ................................ | |
248 | # | |
249 | 87.27% hackbench [vdso] | |
250 | 6.85% hackbench /lib/i686/cmov/libc-2.9.so | |
251 | 2.62% hackbench /lib/ld-2.9.so | |
252 | 1.52% perf [vdso] | |
253 | 1.22% hackbench ./hackbench | |
254 | 0.48% hackbench [kernel] | |
255 | 0.02% perf /lib/i686/cmov/libc-2.9.so | |
256 | 0.01% perf /usr/bin/perf | |
257 | 0.01% perf /lib/ld-2.9.so | |
258 | 0.00% hackbench /lib/i686/cmov/libpthread-2.9.so | |
259 | # | |
260 | # (For more details, try: perf report --sort comm,dso,symbol) | |
261 | # | |
262 | ||
b41df645 RD |
263 | According to this, the vast majority of events triggered on events |
264 | within the VDSO. With simple binaries, this will often be the case so let's | |
bb722220 | 265 | take a slightly different example. In the course of writing this, it was |
b41df645 RD |
266 | noticed that X was generating an insane amount of page allocations so let's look |
267 | at it: | |
8fa4e720 | 268 | :: |
bb722220 MG |
269 | |
270 | $ perf record -c 1 -f \ | |
90a5d5af KK |
271 | -e kmem:mm_page_alloc -e kmem:mm_page_free \ |
272 | -e kmem:mm_page_free_batched \ | |
bb722220 MG |
273 | -p `pidof X` |
274 | ||
275 | This was interrupted after a few seconds and | |
8fa4e720 | 276 | :: |
bb722220 MG |
277 | |
278 | $ perf report | |
279 | # Samples: 27666 | |
280 | # | |
281 | # Overhead Command Shared Object | |
282 | # ........ ....... ....................................... | |
283 | # | |
284 | 51.95% Xorg [vdso] | |
285 | 47.95% Xorg /opt/gfx-test/lib/libpixman-1.so.0.13.1 | |
286 | 0.09% Xorg /lib/i686/cmov/libc-2.9.so | |
287 | 0.01% Xorg [kernel] | |
288 | # | |
289 | # (For more details, try: perf report --sort comm,dso,symbol) | |
290 | # | |
291 | ||
b41df645 RD |
292 | So, almost half of the events are occurring in a library. To get an idea which |
293 | symbol: | |
8fa4e720 | 294 | :: |
bb722220 MG |
295 | |
296 | $ perf report --sort comm,dso,symbol | |
297 | # Samples: 27666 | |
298 | # | |
299 | # Overhead Command Shared Object Symbol | |
300 | # ........ ....... ....................................... ...... | |
301 | # | |
302 | 51.95% Xorg [vdso] [.] 0x000000ffffe424 | |
303 | 47.93% Xorg /opt/gfx-test/lib/libpixman-1.so.0.13.1 [.] pixmanFillsse2 | |
304 | 0.09% Xorg /lib/i686/cmov/libc-2.9.so [.] _int_malloc | |
305 | 0.01% Xorg /opt/gfx-test/lib/libpixman-1.so.0.13.1 [.] pixman_region32_copy_f | |
306 | 0.01% Xorg [kernel] [k] read_hpet | |
307 | 0.01% Xorg /opt/gfx-test/lib/libpixman-1.so.0.13.1 [.] get_fast_path | |
308 | 0.00% Xorg [kernel] [k] ftrace_trace_userstack | |
309 | ||
b41df645 | 310 | To see where within the function pixmanFillsse2 things are going wrong: |
8fa4e720 | 311 | :: |
bb722220 MG |
312 | |
313 | $ perf annotate pixmanFillsse2 | |
314 | [ ... ] | |
315 | 0.00 : 34eeb: 0f 18 08 prefetcht0 (%eax) | |
316 | : } | |
317 | : | |
318 | : extern __inline void __attribute__((__gnu_inline__, __always_inline__, _ | |
319 | : _mm_store_si128 (__m128i *__P, __m128i __B) : { | |
320 | : *__P = __B; | |
321 | 12.40 : 34eee: 66 0f 7f 80 40 ff ff movdqa %xmm0,-0xc0(%eax) | |
322 | 0.00 : 34ef5: ff | |
323 | 12.40 : 34ef6: 66 0f 7f 80 50 ff ff movdqa %xmm0,-0xb0(%eax) | |
324 | 0.00 : 34efd: ff | |
325 | 12.39 : 34efe: 66 0f 7f 80 60 ff ff movdqa %xmm0,-0xa0(%eax) | |
326 | 0.00 : 34f05: ff | |
327 | 12.67 : 34f06: 66 0f 7f 80 70 ff ff movdqa %xmm0,-0x90(%eax) | |
328 | 0.00 : 34f0d: ff | |
329 | 12.58 : 34f0e: 66 0f 7f 40 80 movdqa %xmm0,-0x80(%eax) | |
330 | 12.31 : 34f13: 66 0f 7f 40 90 movdqa %xmm0,-0x70(%eax) | |
331 | 12.40 : 34f18: 66 0f 7f 40 a0 movdqa %xmm0,-0x60(%eax) | |
332 | 12.31 : 34f1d: 66 0f 7f 40 b0 movdqa %xmm0,-0x50(%eax) | |
333 | ||
334 | At a glance, it looks like the time is being spent copying pixmaps to | |
335 | the card. Further investigation would be needed to determine why pixmaps | |
336 | are being copied around so much but a starting point would be to take an | |
337 | ancient build of libpixmap out of the library path where it was totally | |
338 | forgotten about from months ago! |