]>
Commit | Line | Data |
---|---|---|
378012cf | 1 | ===================== |
1da177e4 | 2 | I/O statistics fields |
378012cf | 3 | ===================== |
1da177e4 | 4 | |
1da177e4 LT |
5 | Since 2.4.20 (and some versions before, with patches), and 2.5.45, |
6 | more extensive disk statistics have been introduced to help measure disk | |
877b638f | 7 | activity. Tools such as ``sar`` and ``iostat`` typically interpret these and do |
1da177e4 LT |
8 | the work for you, but in case you are interested in creating your own |
9 | tools, the fields are explained here. | |
10 | ||
11 | In 2.4 now, the information is found as additional fields in | |
877b638f MCC |
12 | ``/proc/partitions``. In 2.6 and upper, the same information is found in two |
13 | places: one is in the file ``/proc/diskstats``, and the other is within | |
1da177e4 LT |
14 | the sysfs file system, which must be mounted in order to obtain |
15 | the information. Throughout this document we'll assume that sysfs | |
877b638f MCC |
16 | is mounted on ``/sys``, although of course it may be mounted anywhere. |
17 | Both ``/proc/diskstats`` and sysfs use the same source for the information | |
1da177e4 LT |
18 | and so should not differ. |
19 | ||
378012cf | 20 | Here are examples of these different formats:: |
1da177e4 | 21 | |
378012cf MCC |
22 | 2.4: |
23 | 3 0 39082680 hda 446216 784926 9550688 4382310 424847 312726 5922052 19310380 0 3376340 23705160 | |
24 | 3 1 9221278 hda1 35486 0 35496 38030 0 0 0 0 0 38030 38030 | |
1da177e4 | 25 | |
877b638f | 26 | 2.6+ sysfs: |
378012cf MCC |
27 | 446216 784926 9550688 4382310 424847 312726 5922052 19310380 0 3376340 23705160 |
28 | 35486 38030 38030 38030 | |
1da177e4 | 29 | |
877b638f | 30 | 2.6+ diskstats: |
378012cf MCC |
31 | 3 0 hda 446216 784926 9550688 4382310 424847 312726 5922052 19310380 0 3376340 23705160 |
32 | 3 1 hda1 35486 38030 38030 38030 | |
1da177e4 | 33 | |
bdca3c87 MC |
34 | 4.18+ diskstats: |
35 | 3 0 hda 446216 784926 9550688 4382310 424847 312726 5922052 19310380 0 3376340 23705160 0 0 0 0 | |
36 | ||
877b638f MCC |
37 | On 2.4 you might execute ``grep 'hda ' /proc/partitions``. On 2.6+, you have |
38 | a choice of ``cat /sys/block/hda/stat`` or ``grep 'hda ' /proc/diskstats``. | |
39 | ||
1da177e4 | 40 | The advantage of one over the other is that the sysfs choice works well |
877b638f | 41 | if you are watching a known, small set of disks. ``/proc/diskstats`` may |
1da177e4 LT |
42 | be a better choice if you are watching a large number of disks because |
43 | you'll avoid the overhead of 50, 100, or 500 or more opens/closes with | |
44 | each snapshot of your disk statistics. | |
45 | ||
46 | In 2.4, the statistics fields are those after the device name. In | |
47 | the above example, the first field of statistics would be 446216. | |
877b638f | 48 | By contrast, in 2.6+ if you look at ``/sys/block/hda/stat``, you'll |
1da177e4 | 49 | find just the eleven fields, beginning with 446216. If you look at |
877b638f | 50 | ``/proc/diskstats``, the eleven fields will be preceded by the major and |
9d2e157d | 51 | minor device numbers, and device name. Each of these formats provides |
1da177e4 LT |
52 | eleven fields of statistics, each meaning exactly the same things. |
53 | All fields except field 9 are cumulative since boot. Field 9 should | |
9d2e157d RD |
54 | go to zero as I/Os complete; all others only increase (unless they |
55 | overflow and wrap). Yes, these are (32-bit or 64-bit) unsigned long | |
56 | (native word size) numbers, and on a very busy or long-lived system they | |
1da177e4 LT |
57 | may wrap. Applications should be prepared to deal with that; unless |
58 | your observations are measured in large numbers of minutes or hours, | |
59 | they should not wrap twice before you notice them. | |
60 | ||
61 | Each set of stats only applies to the indicated device; if you want | |
62 | system-wide stats you'll have to find all the devices and sum them all up. | |
63 | ||
0e53c2be | 64 | Field 1 -- # of reads completed |
1da177e4 | 65 | This is the total number of reads completed successfully. |
378012cf | 66 | |
1da177e4 LT |
67 | Field 2 -- # of reads merged, field 6 -- # of writes merged |
68 | Reads and writes which are adjacent to each other may be merged for | |
69 | efficiency. Thus two 4K reads may become one 8K read before it is | |
70 | ultimately handed to the disk, and so it will be counted (and queued) | |
71 | as only one I/O. This field lets you know how often this was done. | |
378012cf | 72 | |
1da177e4 LT |
73 | Field 3 -- # of sectors read |
74 | This is the total number of sectors read successfully. | |
378012cf | 75 | |
1da177e4 LT |
76 | Field 4 -- # of milliseconds spent reading |
77 | This is the total number of milliseconds spent by all reads (as | |
78 | measured from __make_request() to end_that_request_last()). | |
378012cf | 79 | |
1da177e4 LT |
80 | Field 5 -- # of writes completed |
81 | This is the total number of writes completed successfully. | |
378012cf | 82 | |
69963a07 DH |
83 | Field 6 -- # of writes merged |
84 | See the description of field 2. | |
378012cf | 85 | |
1da177e4 LT |
86 | Field 7 -- # of sectors written |
87 | This is the total number of sectors written successfully. | |
378012cf | 88 | |
1da177e4 LT |
89 | Field 8 -- # of milliseconds spent writing |
90 | This is the total number of milliseconds spent by all writes (as | |
91 | measured from __make_request() to end_that_request_last()). | |
378012cf | 92 | |
1da177e4 LT |
93 | Field 9 -- # of I/Os currently in progress |
94 | The only field that should go to zero. Incremented as requests are | |
165125e1 | 95 | given to appropriate struct request_queue and decremented as they finish. |
378012cf | 96 | |
1da177e4 | 97 | Field 10 -- # of milliseconds spent doing I/Os |
50ed380a | 98 | This field increases so long as field 9 is nonzero. |
378012cf | 99 | |
1da177e4 LT |
100 | Field 11 -- weighted # of milliseconds spent doing I/Os |
101 | This field is incremented at each I/O start, I/O completion, I/O | |
102 | merge, or read of these stats by the number of I/Os in progress | |
103 | (field 9) times the number of milliseconds spent doing I/O since the | |
104 | last update of this field. This can provide an easy measure of both | |
105 | I/O completion time and the backlog that may be accumulating. | |
106 | ||
bdca3c87 MC |
107 | Field 12 -- # of discards completed |
108 | This is the total number of discards completed successfully. | |
109 | ||
110 | Field 13 -- # of discards merged | |
111 | See the description of field 2 | |
112 | ||
113 | Field 14 -- # of sectors discarded | |
114 | This is the total number of sectors discarded successfully. | |
115 | ||
116 | Field 15 -- # of milliseconds spent discarding | |
117 | This is the total number of milliseconds spent by all discards (as | |
118 | measured from __make_request() to end_that_request_last()). | |
1da177e4 LT |
119 | |
120 | To avoid introducing performance bottlenecks, no locks are held while | |
121 | modifying these counters. This implies that minor inaccuracies may be | |
122 | introduced when changes collide, so (for instance) adding up all the | |
123 | read I/Os issued per partition should equal those made to the disks ... | |
124 | but due to the lack of locking it may only be very close. | |
125 | ||
877b638f | 126 | In 2.6+, there are counters for each CPU, which make the lack of locking |
9d2e157d RD |
127 | almost a non-issue. When the statistics are read, the per-CPU counters |
128 | are summed (possibly overflowing the unsigned long variable they are | |
1da177e4 | 129 | summed to) and the result given to the user. There is no convenient |
9d2e157d | 130 | user interface for accessing the per-CPU counters themselves. |
1da177e4 LT |
131 | |
132 | Disks vs Partitions | |
133 | ------------------- | |
134 | ||
877b638f | 135 | There were significant changes between 2.4 and 2.6+ in the I/O subsystem. |
1da177e4 LT |
136 | As a result, some statistic information disappeared. The translation from |
137 | a disk address relative to a partition to the disk address relative to | |
138 | the host disk happens much earlier. All merges and timings now happen | |
139 | at the disk level rather than at both the disk and partition level as | |
877b638f | 140 | in 2.4. Consequently, you'll see a different statistics output on 2.6+ for |
1da177e4 | 141 | partitions from that for disks. There are only *four* fields available |
877b638f | 142 | for partitions on 2.6+ machines. This is reflected in the examples above. |
1da177e4 LT |
143 | |
144 | Field 1 -- # of reads issued | |
145 | This is the total number of reads issued to this partition. | |
378012cf | 146 | |
1da177e4 LT |
147 | Field 2 -- # of sectors read |
148 | This is the total number of sectors requested to be read from this | |
149 | partition. | |
378012cf | 150 | |
1da177e4 LT |
151 | Field 3 -- # of writes issued |
152 | This is the total number of writes issued to this partition. | |
378012cf | 153 | |
1da177e4 LT |
154 | Field 4 -- # of sectors written |
155 | This is the total number of sectors requested to be written to | |
156 | this partition. | |
157 | ||
158 | Note that since the address is translated to a disk-relative one, and no | |
159 | record of the partition-relative address is kept, the subsequent success | |
160 | or failure of the read cannot be attributed to the partition. In other | |
161 | words, the number of reads for partitions is counted slightly before time | |
162 | of queuing for partitions, and at completion for whole disks. This is | |
163 | a subtle distinction that is probably uninteresting for most cases. | |
164 | ||
0e53c2be JM |
165 | More significant is the error induced by counting the numbers of |
166 | reads/writes before merges for partitions and after for disks. Since a | |
167 | typical workload usually contains a lot of successive and adjacent requests, | |
168 | the number of reads/writes issued can be several times higher than the | |
169 | number of reads/writes completed. | |
170 | ||
171 | In 2.6.25, the full statistic set is again available for partitions and | |
172 | disk and partition statistics are consistent again. Since we still don't | |
173 | keep record of the partition-relative address, an operation is attributed to | |
174 | the partition which contains the first sector of the request after the | |
175 | eventual merges. As requests can be merged across partition, this could lead | |
d9195881 | 176 | to some (probably insignificant) inaccuracy. |
0e53c2be | 177 | |
1da177e4 LT |
178 | Additional notes |
179 | ---------------- | |
180 | ||
877b638f | 181 | In 2.6+, sysfs is not mounted by default. If your distribution of |
1da177e4 | 182 | Linux hasn't added it already, here's the line you'll want to add to |
877b638f | 183 | your ``/etc/fstab``:: |
1da177e4 | 184 | |
378012cf | 185 | none /sys sysfs defaults 0 0 |
1da177e4 LT |
186 | |
187 | ||
877b638f MCC |
188 | In 2.6+, all disk statistics were removed from ``/proc/stat``. In 2.4, they |
189 | appear in both ``/proc/partitions`` and ``/proc/stat``, although the ones in | |
190 | ``/proc/stat`` take a very different format from those in ``/proc/partitions`` | |
1da177e4 LT |
191 | (see proc(5), if your system has it.) |
192 | ||
193 | -- [email protected] |