]>
Commit | Line | Data |
---|---|---|
33c3fc71 VD |
1 | MOTIVATION |
2 | ||
3 | The idle page tracking feature allows to track which memory pages are being | |
4 | accessed by a workload and which are idle. This information can be useful for | |
5 | estimating the workload's working set size, which, in turn, can be taken into | |
6 | account when configuring the workload parameters, setting memory cgroup limits, | |
7 | or deciding where to place the workload within a compute cluster. | |
8 | ||
9 | It is enabled by CONFIG_IDLE_PAGE_TRACKING=y. | |
10 | ||
11 | USER API | |
12 | ||
13 | The idle page tracking API is located at /sys/kernel/mm/page_idle. Currently, | |
14 | it consists of the only read-write file, /sys/kernel/mm/page_idle/bitmap. | |
15 | ||
16 | The file implements a bitmap where each bit corresponds to a memory page. The | |
17 | bitmap is represented by an array of 8-byte integers, and the page at PFN #i is | |
18 | mapped to bit #i%64 of array element #i/64, byte order is native. When a bit is | |
19 | set, the corresponding page is idle. | |
20 | ||
21 | A page is considered idle if it has not been accessed since it was marked idle | |
22 | (for more details on what "accessed" actually means see the IMPLEMENTATION | |
23 | DETAILS section). To mark a page idle one has to set the bit corresponding to | |
24 | the page by writing to the file. A value written to the file is OR-ed with the | |
25 | current bitmap value. | |
26 | ||
27 | Only accesses to user memory pages are tracked. These are pages mapped to a | |
28 | process address space, page cache and buffer pages, swap cache pages. For other | |
29 | page types (e.g. SLAB pages) an attempt to mark a page idle is silently ignored, | |
30 | and hence such pages are never reported idle. | |
31 | ||
32 | For huge pages the idle flag is set only on the head page, so one has to read | |
33 | /proc/kpageflags in order to correctly count idle huge pages. | |
34 | ||
35 | Reading from or writing to /sys/kernel/mm/page_idle/bitmap will return | |
36 | -EINVAL if you are not starting the read/write on an 8-byte boundary, or | |
37 | if the size of the read/write is not a multiple of 8 bytes. Writing to | |
38 | this file beyond max PFN will return -ENXIO. | |
39 | ||
40 | That said, in order to estimate the amount of pages that are not used by a | |
41 | workload one should: | |
42 | ||
43 | 1. Mark all the workload's pages as idle by setting corresponding bits in | |
44 | /sys/kernel/mm/page_idle/bitmap. The pages can be found by reading | |
45 | /proc/pid/pagemap if the workload is represented by a process, or by | |
46 | filtering out alien pages using /proc/kpagecgroup in case the workload is | |
47 | placed in a memory cgroup. | |
48 | ||
49 | 2. Wait until the workload accesses its working set. | |
50 | ||
51 | 3. Read /sys/kernel/mm/page_idle/bitmap and count the number of bits set. If | |
52 | one wants to ignore certain types of pages, e.g. mlocked pages since they | |
53 | are not reclaimable, he or she can filter them out using /proc/kpageflags. | |
54 | ||
55 | See Documentation/vm/pagemap.txt for more information about /proc/pid/pagemap, | |
56 | /proc/kpageflags, and /proc/kpagecgroup. | |
57 | ||
58 | IMPLEMENTATION DETAILS | |
59 | ||
60 | The kernel internally keeps track of accesses to user memory pages in order to | |
61 | reclaim unreferenced pages first on memory shortage conditions. A page is | |
62 | considered referenced if it has been recently accessed via a process address | |
63 | space, in which case one or more PTEs it is mapped to will have the Accessed bit | |
64 | set, or marked accessed explicitly by the kernel (see mark_page_accessed()). The | |
65 | latter happens when: | |
66 | ||
67 | - a userspace process reads or writes a page using a system call (e.g. read(2) | |
68 | or write(2)) | |
69 | ||
70 | - a page that is used for storing filesystem buffers is read or written, | |
71 | because a process needs filesystem metadata stored in it (e.g. lists a | |
72 | directory tree) | |
73 | ||
74 | - a page is accessed by a device driver using get_user_pages() | |
75 | ||
76 | When a dirty page is written to swap or disk as a result of memory reclaim or | |
77 | exceeding the dirty memory limit, it is not marked referenced. | |
78 | ||
79 | The idle memory tracking feature adds a new page flag, the Idle flag. This flag | |
80 | is set manually, by writing to /sys/kernel/mm/page_idle/bitmap (see the USER API | |
81 | section), and cleared automatically whenever a page is referenced as defined | |
82 | above. | |
83 | ||
84 | When a page is marked idle, the Accessed bit must be cleared in all PTEs it is | |
85 | mapped to, otherwise we will not be able to detect accesses to the page coming | |
86 | from a process address space. To avoid interference with the reclaimer, which, | |
87 | as noted above, uses the Accessed bit to promote actively referenced pages, one | |
88 | more page flag is introduced, the Young flag. When the PTE Accessed bit is | |
89 | cleared as a result of setting or updating a page's Idle flag, the Young flag | |
90 | is set on the page. The reclaimer treats the Young flag as an extra PTE | |
91 | Accessed bit and therefore will consider such a page as referenced. | |
92 | ||
93 | Since the idle memory tracking feature is based on the memory reclaimer logic, | |
94 | it only works with pages that are on an LRU list, other pages are silently | |
95 | ignored. That means it will ignore a user memory page if it is isolated, but | |
96 | since there are usually not many of them, it should not affect the overall | |
97 | result noticeably. In order not to stall scanning of the idle page bitmap, | |
98 | locked pages may be skipped too. |