]>
Commit | Line | Data |
---|---|---|
898bd37a MCC |
1 | ============== |
2 | Data Integrity | |
3 | ============== | |
4 | ||
5 | 1. Introduction | |
6 | =============== | |
c1c72b59 MP |
7 | |
8 | Modern filesystems feature checksumming of data and metadata to | |
9 | protect against data corruption. However, the detection of the | |
10 | corruption is done at read time which could potentially be months | |
11 | after the data was written. At that point the original data that the | |
12 | application tried to write is most likely lost. | |
13 | ||
14 | The solution is to ensure that the disk is actually storing what the | |
15 | application meant it to. Recent additions to both the SCSI family | |
16 | protocols (SBC Data Integrity Field, SCC protection proposal) as well | |
17 | as SATA/T13 (External Path Protection) try to remedy this by adding | |
18 | support for appending integrity metadata to an I/O. The integrity | |
19 | metadata (or protection information in SCSI terminology) includes a | |
20 | checksum for each sector as well as an incrementing counter that | |
21 | ensures the individual sectors are written in the right order. And | |
22 | for some protection schemes also that the I/O is written to the right | |
23 | place on disk. | |
24 | ||
25 | Current storage controllers and devices implement various protective | |
26 | measures, for instance checksumming and scrubbing. But these | |
27 | technologies are working in their own isolated domains or at best | |
28 | between adjacent nodes in the I/O path. The interesting thing about | |
29 | DIF and the other integrity extensions is that the protection format | |
30 | is well defined and every node in the I/O path can verify the | |
31 | integrity of the I/O and reject it if corruption is detected. This | |
32 | allows not only corruption prevention but also isolation of the point | |
33 | of failure. | |
34 | ||
898bd37a MCC |
35 | 2. The Data Integrity Extensions |
36 | ================================ | |
c1c72b59 MP |
37 | |
38 | As written, the protocol extensions only protect the path between | |
39 | controller and storage device. However, many controllers actually | |
40 | allow the operating system to interact with the integrity metadata | |
41 | (IMD). We have been working with several FC/SAS HBA vendors to enable | |
42 | the protection information to be transferred to and from their | |
43 | controllers. | |
44 | ||
45 | The SCSI Data Integrity Field works by appending 8 bytes of protection | |
46 | information to each sector. The data + integrity metadata is stored | |
47 | in 520 byte sectors on disk. Data + IMD are interleaved when | |
48 | transferred between the controller and target. The T13 proposal is | |
49 | similar. | |
50 | ||
51 | Because it is highly inconvenient for operating systems to deal with | |
52 | 520 (and 4104) byte sectors, we approached several HBA vendors and | |
53 | encouraged them to allow separation of the data and integrity metadata | |
54 | scatter-gather lists. | |
55 | ||
56 | The controller will interleave the buffers on write and split them on | |
61fd2167 | 57 | read. This means that Linux can DMA the data buffers to and from |
c1c72b59 MP |
58 | host memory without changes to the page cache. |
59 | ||
60 | Also, the 16-bit CRC checksum mandated by both the SCSI and SATA specs | |
61 | is somewhat heavy to compute in software. Benchmarks found that | |
62 | calculating this checksum had a significant impact on system | |
63 | performance for a number of workloads. Some controllers allow a | |
64 | lighter-weight checksum to be used when interfacing with the operating | |
65 | system. Emulex, for instance, supports the TCP/IP checksum instead. | |
66 | The IP checksum received from the OS is converted to the 16-bit CRC | |
67 | when writing and vice versa. This allows the integrity metadata to be | |
68 | generated by Linux or the application at very low cost (comparable to | |
69 | software RAID5). | |
70 | ||
71 | The IP checksum is weaker than the CRC in terms of detecting bit | |
72 | errors. However, the strength is really in the separation of the data | |
61fd2167 | 73 | buffers and the integrity metadata. These two distinct buffers must |
c1c72b59 MP |
74 | match up for an I/O to complete. |
75 | ||
76 | The separation of the data and integrity metadata buffers as well as | |
77 | the choice in checksums is referred to as the Data Integrity | |
78 | Extensions. As these extensions are outside the scope of the protocol | |
79 | bodies (T10, T13), Oracle and its partners are trying to standardize | |
80 | them within the Storage Networking Industry Association. | |
81 | ||
898bd37a MCC |
82 | 3. Kernel Changes |
83 | ================= | |
c1c72b59 MP |
84 | |
85 | The data integrity framework in Linux enables protection information | |
86 | to be pinned to I/Os and sent to/received from controllers that | |
87 | support it. | |
88 | ||
89 | The advantage to the integrity extensions in SCSI and SATA is that | |
90 | they enable us to protect the entire path from application to storage | |
91 | device. However, at the same time this is also the biggest | |
92 | disadvantage. It means that the protection information must be in a | |
93 | format that can be understood by the disk. | |
94 | ||
95 | Generally Linux/POSIX applications are agnostic to the intricacies of | |
96 | the storage devices they are accessing. The virtual filesystem switch | |
97 | and the block layer make things like hardware sector size and | |
98 | transport protocols completely transparent to the application. | |
99 | ||
100 | However, this level of detail is required when preparing the | |
101 | protection information to send to a disk. Consequently, the very | |
102 | concept of an end-to-end protection scheme is a layering violation. | |
103 | It is completely unreasonable for an application to be aware whether | |
104 | it is accessing a SCSI or SATA disk. | |
105 | ||
106 | The data integrity support implemented in Linux attempts to hide this | |
107 | from the application. As far as the application (and to some extent | |
108 | the kernel) is concerned, the integrity metadata is opaque information | |
109 | that's attached to the I/O. | |
110 | ||
111 | The current implementation allows the block layer to automatically | |
112 | generate the protection information for any I/O. Eventually the | |
113 | intent is to move the integrity metadata calculation to userspace for | |
114 | user data. Metadata and other I/O that originates within the kernel | |
115 | will still use the automatic generation interface. | |
116 | ||
117 | Some storage devices allow each hardware sector to be tagged with a | |
118 | 16-bit value. The owner of this tag space is the owner of the block | |
119 | device. I.e. the filesystem in most cases. The filesystem can use | |
120 | this extra space to tag sectors as they see fit. Because the tag | |
121 | space is limited, the block interface allows tagging bigger chunks by | |
122 | way of interleaving. This way, 8*16 bits of information can be | |
123 | attached to a typical 4KB filesystem block. | |
124 | ||
125 | This also means that applications such as fsck and mkfs will need | |
126 | access to manipulate the tags from user space. A passthrough | |
127 | interface for this is being worked on. | |
128 | ||
129 | ||
898bd37a MCC |
130 | 4. Block Layer Implementation Details |
131 | ===================================== | |
c1c72b59 | 132 | |
898bd37a MCC |
133 | 4.1 Bio |
134 | ------- | |
c1c72b59 MP |
135 | |
136 | The data integrity patches add a new field to struct bio when | |
180b2f95 MP |
137 | CONFIG_BLK_DEV_INTEGRITY is enabled. bio_integrity(bio) returns a |
138 | pointer to a struct bip which contains the bio integrity payload. | |
139 | Essentially a bip is a trimmed down struct bio which holds a bio_vec | |
140 | containing the integrity metadata and the required housekeeping | |
141 | information (bvec pool, vector count, etc.) | |
c1c72b59 MP |
142 | |
143 | A kernel subsystem can enable data integrity protection on a bio by | |
144 | calling bio_integrity_alloc(bio). This will allocate and attach the | |
145 | bip to the bio. | |
146 | ||
147 | Individual pages containing integrity metadata can subsequently be | |
148 | attached using bio_integrity_add_page(). | |
149 | ||
150 | bio_free() will automatically free the bip. | |
151 | ||
152 | ||
898bd37a MCC |
153 | 4.2 Block Device |
154 | ---------------- | |
c1c72b59 MP |
155 | |
156 | Because the format of the protection data is tied to the physical | |
157 | disk, each block device has been extended with a block integrity | |
158 | profile (struct blk_integrity). This optional profile is registered | |
159 | with the block layer using blk_integrity_register(). | |
160 | ||
161 | The profile contains callback functions for generating and verifying | |
162 | the protection data, as well as getting and setting application tags. | |
163 | The profile also contains a few constants to aid in completing, | |
164 | merging and splitting the integrity metadata. | |
165 | ||
166 | Layered block devices will need to pick a profile that's appropriate | |
167 | for all subdevices. blk_integrity_compare() can help with that. DM | |
168 | and MD linear, RAID0 and RAID1 are currently supported. RAID4/5/6 | |
169 | will require extra work due to the application tag. | |
170 | ||
171 | ||
898bd37a MCC |
172 | 5.0 Block Layer Integrity API |
173 | ============================= | |
c1c72b59 | 174 | |
898bd37a MCC |
175 | 5.1 Normal Filesystem |
176 | --------------------- | |
c1c72b59 MP |
177 | |
178 | The normal filesystem is unaware that the underlying block device | |
179 | is capable of sending/receiving integrity metadata. The IMD will | |
180 | be automatically generated by the block layer at submit_bio() time | |
181 | in case of a WRITE. A READ request will cause the I/O integrity | |
182 | to be verified upon completion. | |
183 | ||
898bd37a | 184 | IMD generation and verification can be toggled using the:: |
c1c72b59 MP |
185 | |
186 | /sys/block/<bdev>/integrity/write_generate | |
187 | ||
898bd37a | 188 | and:: |
c1c72b59 MP |
189 | |
190 | /sys/block/<bdev>/integrity/read_verify | |
191 | ||
192 | flags. | |
193 | ||
194 | ||
898bd37a MCC |
195 | 5.2 Integrity-Aware Filesystem |
196 | ------------------------------ | |
c1c72b59 MP |
197 | |
198 | A filesystem that is integrity-aware can prepare I/Os with IMD | |
199 | attached. It can also use the application tag space if this is | |
200 | supported by the block device. | |
201 | ||
202 | ||
898bd37a | 203 | `bool bio_integrity_prep(bio);` |
c1c72b59 MP |
204 | |
205 | To generate IMD for WRITE and to set up buffers for READ, the | |
206 | filesystem must call bio_integrity_prep(bio). | |
207 | ||
208 | Prior to calling this function, the bio data direction and start | |
209 | sector must be set, and the bio should have all data pages | |
210 | added. It is up to the caller to ensure that the bio does not | |
211 | change while I/O is in progress. | |
e23947bd | 212 | Complete bio with error if prepare failed for some reson. |
c1c72b59 MP |
213 | |
214 | ||
898bd37a MCC |
215 | 5.3 Passing Existing Integrity Metadata |
216 | --------------------------------------- | |
c1c72b59 MP |
217 | |
218 | Filesystems that either generate their own integrity metadata or | |
219 | are capable of transferring IMD from user space can use the | |
220 | following calls: | |
221 | ||
222 | ||
898bd37a | 223 | `struct bip * bio_integrity_alloc(bio, gfp_mask, nr_pages);` |
c1c72b59 MP |
224 | |
225 | Allocates the bio integrity payload and hangs it off of the bio. | |
226 | nr_pages indicate how many pages of protection data need to be | |
227 | stored in the integrity bio_vec list (similar to bio_alloc()). | |
228 | ||
229 | The integrity payload will be freed at bio_free() time. | |
230 | ||
231 | ||
898bd37a | 232 | `int bio_integrity_add_page(bio, page, len, offset);` |
c1c72b59 MP |
233 | |
234 | Attaches a page containing integrity metadata to an existing | |
235 | bio. The bio must have an existing bip, | |
236 | i.e. bio_integrity_alloc() must have been called. For a WRITE, | |
237 | the integrity metadata in the pages must be in a format | |
238 | understood by the target device with the notable exception that | |
239 | the sector numbers will be remapped as the request traverses the | |
240 | I/O stack. This implies that the pages added using this call | |
241 | will be modified during I/O! The first reference tag in the | |
242 | integrity metadata must have a value of bip->bip_sector. | |
243 | ||
244 | Pages can be added using bio_integrity_add_page() as long as | |
245 | there is room in the bip bio_vec array (nr_pages). | |
246 | ||
247 | Upon completion of a READ operation, the attached pages will | |
248 | contain the integrity metadata received from the storage device. | |
249 | It is up to the receiver to process them and verify data | |
250 | integrity upon completion. | |
251 | ||
252 | ||
898bd37a MCC |
253 | 5.4 Registering A Block Device As Capable Of Exchanging Integrity Metadata |
254 | -------------------------------------------------------------------------- | |
c1c72b59 MP |
255 | |
256 | To enable integrity exchange on a block device the gendisk must be | |
257 | registered as capable: | |
258 | ||
898bd37a | 259 | `int blk_integrity_register(gendisk, blk_integrity);` |
c1c72b59 MP |
260 | |
261 | The blk_integrity struct is a template and should contain the | |
898bd37a | 262 | following:: |
c1c72b59 MP |
263 | |
264 | static struct blk_integrity my_profile = { | |
265 | .name = "STANDARDSBODY-TYPE-VARIANT-CSUM", | |
266 | .generate_fn = my_generate_fn, | |
898bd37a | 267 | .verify_fn = my_verify_fn, |
c1c72b59 MP |
268 | .tuple_size = sizeof(struct my_tuple_size), |
269 | .tag_size = <tag bytes per hw sector>, | |
270 | }; | |
271 | ||
272 | 'name' is a text string which will be visible in sysfs. This is | |
273 | part of the userland API so chose it carefully and never change | |
274 | it. The format is standards body-type-variant. | |
275 | E.g. T10-DIF-TYPE1-IP or T13-EPP-0-CRC. | |
276 | ||
277 | 'generate_fn' generates appropriate integrity metadata (for WRITE). | |
278 | ||
279 | 'verify_fn' verifies that the data buffer matches the integrity | |
280 | metadata. | |
281 | ||
282 | 'tuple_size' must be set to match the size of the integrity | |
283 | metadata per sector. I.e. 8 for DIF and EPP. | |
284 | ||
285 | 'tag_size' must be set to identify how many bytes of tag space | |
286 | are available per hardware sector. For DIF this is either 2 or | |
287 | 0 depending on the value of the Control Mode Page ATO bit. | |
288 | ||
c1c72b59 | 289 | ---------------------------------------------------------------------- |
898bd37a | 290 | |
c1c72b59 | 291 | 2007-12-24 Martin K. Petersen <[email protected]> |