]>
Commit | Line | Data |
---|---|---|
bbb5bbb0 RD |
1 | <?xml version="1.0" encoding="UTF-8"?> |
2 | <!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.1.2//EN" | |
3 | "http://www.oasis-open.org/docbook/xml/4.1.2/docbookx.dtd" []> | |
4 | ||
5 | <book id="Linux-filesystems-API"> | |
6 | <bookinfo> | |
7 | <title>Linux Filesystems API</title> | |
8 | ||
9 | <legalnotice> | |
10 | <para> | |
11 | This documentation is free software; you can redistribute | |
12 | it and/or modify it under the terms of the GNU General Public | |
13 | License as published by the Free Software Foundation; either | |
14 | version 2 of the License, or (at your option) any later | |
15 | version. | |
16 | </para> | |
17 | ||
18 | <para> | |
19 | This program is distributed in the hope that it will be | |
20 | useful, but WITHOUT ANY WARRANTY; without even the implied | |
21 | warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. | |
22 | See the GNU General Public License for more details. | |
23 | </para> | |
24 | ||
25 | <para> | |
26 | You should have received a copy of the GNU General Public | |
27 | License along with this program; if not, write to the Free | |
28 | Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, | |
29 | MA 02111-1307 USA | |
30 | </para> | |
31 | ||
32 | <para> | |
33 | For more details see the file COPYING in the source | |
34 | distribution of Linux. | |
35 | </para> | |
36 | </legalnotice> | |
37 | </bookinfo> | |
38 | ||
39 | <toc></toc> | |
40 | ||
41 | <chapter id="vfs"> | |
42 | <title>The Linux VFS</title> | |
5c3b4474 | 43 | <sect1 id="the_filesystem_types"><title>The Filesystem types</title> |
bbb5bbb0 RD |
44 | !Iinclude/linux/fs.h |
45 | </sect1> | |
5c3b4474 | 46 | <sect1 id="the_directory_cache"><title>The Directory Cache</title> |
bbb5bbb0 RD |
47 | !Efs/dcache.c |
48 | !Iinclude/linux/dcache.h | |
49 | </sect1> | |
5c3b4474 | 50 | <sect1 id="inode_handling"><title>Inode Handling</title> |
bbb5bbb0 RD |
51 | !Efs/inode.c |
52 | !Efs/bad_inode.c | |
53 | </sect1> | |
5c3b4474 | 54 | <sect1 id="registration_and_superblocks"><title>Registration and Superblocks</title> |
bbb5bbb0 RD |
55 | !Efs/super.c |
56 | </sect1> | |
5c3b4474 | 57 | <sect1 id="file_locks"><title>File Locks</title> |
bbb5bbb0 RD |
58 | !Efs/locks.c |
59 | !Ifs/locks.c | |
60 | </sect1> | |
5c3b4474 | 61 | <sect1 id="other_functions"><title>Other Functions</title> |
bbb5bbb0 RD |
62 | !Efs/mpage.c |
63 | !Efs/namei.c | |
64 | !Efs/buffer.c | |
64b14519 | 65 | !Eblock/bio.c |
bbb5bbb0 RD |
66 | !Efs/seq_file.c |
67 | !Efs/filesystems.c | |
68 | !Efs/fs-writeback.c | |
69 | !Efs/block_dev.c | |
70 | </sect1> | |
71 | </chapter> | |
72 | ||
73 | <chapter id="proc"> | |
74 | <title>The proc filesystem</title> | |
75 | ||
5c3b4474 | 76 | <sect1 id="sysctl_interface"><title>sysctl interface</title> |
bbb5bbb0 RD |
77 | !Ekernel/sysctl.c |
78 | </sect1> | |
79 | ||
5c3b4474 | 80 | <sect1 id="proc_filesystem_interface"><title>proc filesystem interface</title> |
bbb5bbb0 RD |
81 | !Ifs/proc/base.c |
82 | </sect1> | |
83 | </chapter> | |
84 | ||
36182185 RD |
85 | <chapter id="fs_events"> |
86 | <title>Events based on file descriptors</title> | |
87 | !Efs/eventfd.c | |
88 | </chapter> | |
89 | ||
bbb5bbb0 RD |
90 | <chapter id="sysfs"> |
91 | <title>The Filesystem for Exporting Kernel Objects</title> | |
92 | !Efs/sysfs/file.c | |
93 | !Efs/sysfs/symlink.c | |
bbb5bbb0 RD |
94 | </chapter> |
95 | ||
96 | <chapter id="debugfs"> | |
97 | <title>The debugfs filesystem</title> | |
98 | ||
5c3b4474 | 99 | <sect1 id="debugfs_interface"><title>debugfs interface</title> |
bbb5bbb0 RD |
100 | !Efs/debugfs/inode.c |
101 | !Efs/debugfs/file.c | |
102 | </sect1> | |
103 | </chapter> | |
104 | ||
733b72c3 RD |
105 | <chapter id="LinuxJDBAPI"> |
106 | <chapterinfo> | |
107 | <title>The Linux Journalling API</title> | |
108 | ||
109 | <authorgroup> | |
110 | <author> | |
111 | <firstname>Roger</firstname> | |
112 | <surname>Gammans</surname> | |
113 | <affiliation> | |
114 | <address> | |
115 | <email>[email protected]</email> | |
116 | </address> | |
117 | </affiliation> | |
118 | </author> | |
119 | </authorgroup> | |
120 | ||
121 | <authorgroup> | |
122 | <author> | |
123 | <firstname>Stephen</firstname> | |
124 | <surname>Tweedie</surname> | |
125 | <affiliation> | |
126 | <address> | |
127 | <email>[email protected]</email> | |
128 | </address> | |
129 | </affiliation> | |
130 | </author> | |
131 | </authorgroup> | |
132 | ||
133 | <copyright> | |
134 | <year>2002</year> | |
135 | <holder>Roger Gammans</holder> | |
136 | </copyright> | |
137 | </chapterinfo> | |
138 | ||
139 | <title>The Linux Journalling API</title> | |
140 | ||
5c3b4474 | 141 | <sect1 id="journaling_overview"> |
733b72c3 | 142 | <title>Overview</title> |
5c3b4474 | 143 | <sect2 id="journaling_details"> |
733b72c3 RD |
144 | <title>Details</title> |
145 | <para> | |
146 | The journalling layer is easy to use. You need to | |
147 | first of all create a journal_t data structure. There are | |
148 | two calls to do this dependent on how you decide to allocate the physical | |
82ff50b2 JK |
149 | media on which the journal resides. The jbd2_journal_init_inode() call |
150 | is for journals stored in filesystem inodes, or the jbd2_journal_init_dev() | |
151 | call can be used for journal stored on a raw device (in a continuous range | |
733b72c3 | 152 | of blocks). A journal_t is a typedef for a struct pointer, so when |
82ff50b2 | 153 | you are finally finished make sure you call jbd2_journal_destroy() on it |
733b72c3 RD |
154 | to free up any used kernel memory. |
155 | </para> | |
156 | ||
157 | <para> | |
158 | Once you have got your journal_t object you need to 'mount' or load the journal | |
82ff50b2 JK |
159 | file. The journalling layer expects the space for the journal was already |
160 | allocated and initialized properly by the userspace tools. When loading the | |
161 | journal you must call jbd2_journal_load() to process journal contents. If the | |
162 | client file system detects the journal contents does not need to be processed | |
163 | (or even need not have valid contents), it may call jbd2_journal_wipe() to | |
164 | clear the journal contents before calling jbd2_journal_load(). | |
733b72c3 RD |
165 | </para> |
166 | ||
167 | <para> | |
82ff50b2 JK |
168 | Note that jbd2_journal_wipe(..,0) calls jbd2_journal_skip_recovery() for you if |
169 | it detects any outstanding transactions in the journal and similarly | |
170 | jbd2_journal_load() will call jbd2_journal_recover() if necessary. I would | |
171 | advise reading ext4_load_journal() in fs/ext4/super.c for examples on this | |
172 | stage. | |
733b72c3 RD |
173 | </para> |
174 | ||
175 | <para> | |
176 | Now you can go ahead and start modifying the underlying | |
177 | filesystem. Almost. | |
178 | </para> | |
179 | ||
180 | <para> | |
181 | ||
182 | You still need to actually journal your filesystem changes, this | |
183 | is done by wrapping them into transactions. Additionally you | |
184 | also need to wrap the modification of each of the buffers | |
185 | with calls to the journal layer, so it knows what the modifications | |
82ff50b2 | 186 | you are actually making are. To do this use jbd2_journal_start() which |
733b72c3 RD |
187 | returns a transaction handle. |
188 | </para> | |
189 | ||
190 | <para> | |
82ff50b2 JK |
191 | jbd2_journal_start() |
192 | and its counterpart jbd2_journal_stop(), which indicates the end of a | |
193 | transaction are nestable calls, so you can reenter a transaction if necessary, | |
194 | but remember you must call jbd2_journal_stop() the same number of times as | |
195 | jbd2_journal_start() before the transaction is completed (or more accurately | |
196 | leaves the update phase). Ext4/VFS makes use of this feature to simplify | |
197 | handling of inode dirtying, quota support, etc. | |
733b72c3 RD |
198 | </para> |
199 | ||
200 | <para> | |
201 | Inside each transaction you need to wrap the modifications to the | |
202 | individual buffers (blocks). Before you start to modify a buffer you | |
82ff50b2 | 203 | need to call jbd2_journal_get_{create,write,undo}_access() as appropriate, |
733b72c3 RD |
204 | this allows the journalling layer to copy the unmodified data if it |
205 | needs to. After all the buffer may be part of a previously uncommitted | |
206 | transaction. | |
207 | At this point you are at last ready to modify a buffer, and once | |
82ff50b2 | 208 | you are have done so you need to call jbd2_journal_dirty_{meta,}data(). |
733b72c3 | 209 | Or if you've asked for access to a buffer you now know is now longer |
82ff50b2 | 210 | required to be pushed back on the device you can call jbd2_journal_forget() |
733b72c3 RD |
211 | in much the same way as you might have used bforget() in the past. |
212 | </para> | |
213 | ||
214 | <para> | |
82ff50b2 | 215 | A jbd2_journal_flush() may be called at any time to commit and checkpoint |
733b72c3 RD |
216 | all your transactions. |
217 | </para> | |
218 | ||
219 | <para> | |
82ff50b2 | 220 | Then at umount time , in your put_super() you can then call jbd2_journal_destroy() |
34e5053f | 221 | to clean up your in-core journal object. |
733b72c3 RD |
222 | </para> |
223 | ||
224 | <para> | |
225 | Unfortunately there a couple of ways the journal layer can cause a deadlock. | |
226 | The first thing to note is that each task can only have | |
227 | a single outstanding transaction at any one time, remember nothing | |
82ff50b2 | 228 | commits until the outermost jbd2_journal_stop(). This means |
733b72c3 RD |
229 | you must complete the transaction at the end of each file/inode/address |
230 | etc. operation you perform, so that the journalling system isn't re-entered | |
231 | on another journal. Since transactions can't be nested/batched | |
232 | across differing journals, and another filesystem other than | |
82ff50b2 | 233 | yours (say ext4) may be modified in a later syscall. |
733b72c3 RD |
234 | </para> |
235 | ||
236 | <para> | |
82ff50b2 | 237 | The second case to bear in mind is that jbd2_journal_start() can |
733b72c3 RD |
238 | block if there isn't enough space in the journal for your transaction |
239 | (based on the passed nblocks param) - when it blocks it merely(!) needs to | |
240 | wait for transactions to complete and be committed from other tasks, | |
82ff50b2 JK |
241 | so essentially we are waiting for jbd2_journal_stop(). So to avoid |
242 | deadlocks you must treat jbd2_journal_start/stop() as if they | |
733b72c3 | 243 | were semaphores and include them in your semaphore ordering rules to prevent |
82ff50b2 JK |
244 | deadlocks. Note that jbd2_journal_extend() has similar blocking behaviour to |
245 | jbd2_journal_start() so you can deadlock here just as easily as on | |
246 | jbd2_journal_start(). | |
733b72c3 RD |
247 | </para> |
248 | ||
249 | <para> | |
250 | Try to reserve the right number of blocks the first time. ;-). This will | |
251 | be the maximum number of blocks you are going to touch in this transaction. | |
82ff50b2 JK |
252 | I advise having a look at at least ext4_jbd.h to see the basis on which |
253 | ext4 uses to make these decisions. | |
733b72c3 RD |
254 | </para> |
255 | ||
256 | <para> | |
257 | Another wriggle to watch out for is your on-disk block allocation strategy. | |
82ff50b2 JK |
258 | Why? Because, if you do a delete, you need to ensure you haven't reused any |
259 | of the freed blocks until the transaction freeing these blocks commits. If you | |
260 | reused these blocks and crash happens, there is no way to restore the contents | |
261 | of the reallocated blocks at the end of the last fully committed transaction. | |
262 | ||
263 | One simple way of doing this is to mark blocks as free in internal in-memory | |
264 | block allocation structures only after the transaction freeing them commits. | |
265 | Ext4 uses journal commit callback for this purpose. | |
266 | </para> | |
267 | ||
268 | <para> | |
269 | With journal commit callbacks you can ask the journalling layer to call a | |
270 | callback function when the transaction is finally committed to disk, so that | |
271 | you can do some of your own management. You ask the journalling layer for | |
272 | calling the callback by simply setting journal->j_commit_callback function | |
273 | pointer and that function is called after each transaction commit. You can also | |
274 | use transaction->t_private_list for attaching entries to a transaction that | |
275 | need processing when the transaction commits. | |
733b72c3 RD |
276 | </para> |
277 | ||
278 | <para> | |
82ff50b2 JK |
279 | JBD2 also provides a way to block all transaction updates via |
280 | jbd2_journal_{un,}lock_updates(). Ext4 uses this when it wants a window with a | |
281 | clean and stable fs for a moment. E.g. | |
733b72c3 RD |
282 | </para> |
283 | ||
284 | <programlisting> | |
285 | ||
82ff50b2 JK |
286 | jbd2_journal_lock_updates() //stop new stuff happening.. |
287 | jbd2_journal_flush() // checkpoint everything. | |
733b72c3 | 288 | ..do stuff on stable fs |
82ff50b2 | 289 | jbd2_journal_unlock_updates() // carry on with filesystem use. |
733b72c3 RD |
290 | </programlisting> |
291 | ||
292 | <para> | |
293 | The opportunities for abuse and DOS attacks with this should be obvious, | |
294 | if you allow unprivileged userspace to trigger codepaths containing these | |
295 | calls. | |
733b72c3 RD |
296 | </para> |
297 | ||
298 | </sect2> | |
299 | ||
5c3b4474 | 300 | <sect2 id="jbd_summary"> |
733b72c3 RD |
301 | <title>Summary</title> |
302 | <para> | |
303 | Using the journal is a matter of wrapping the different context changes, | |
304 | being each mount, each modification (transaction) and each changed buffer | |
305 | to tell the journalling layer about them. | |
306 | </para> | |
307 | ||
733b72c3 RD |
308 | </sect2> |
309 | ||
310 | </sect1> | |
311 | ||
5c3b4474 | 312 | <sect1 id="data_types"> |
733b72c3 RD |
313 | <title>Data Types</title> |
314 | <para> | |
315 | The journalling layer uses typedefs to 'hide' the concrete definitions | |
82ff50b2 | 316 | of the structures used. As a client of the JBD2 layer you can |
733b72c3 RD |
317 | just rely on the using the pointer as a magic cookie of some sort. |
318 | ||
319 | Obviously the hiding is not enforced as this is 'C'. | |
320 | </para> | |
5c3b4474 | 321 | <sect2 id="structures"><title>Structures</title> |
82ff50b2 | 322 | !Iinclude/linux/jbd2.h |
733b72c3 RD |
323 | </sect2> |
324 | </sect1> | |
325 | ||
5c3b4474 | 326 | <sect1 id="functions"> |
733b72c3 RD |
327 | <title>Functions</title> |
328 | <para> | |
329 | The functions here are split into two groups those that | |
330 | affect a journal as a whole, and those which are used to | |
331 | manage transactions | |
332 | </para> | |
5c3b4474 | 333 | <sect2 id="journal_level"><title>Journal Level</title> |
82ff50b2 JK |
334 | !Efs/jbd2/journal.c |
335 | !Ifs/jbd2/recovery.c | |
733b72c3 | 336 | </sect2> |
5c3b4474 | 337 | <sect2 id="transaction_level"><title>Transasction Level</title> |
82ff50b2 | 338 | !Efs/jbd2/transaction.c |
733b72c3 RD |
339 | </sect2> |
340 | </sect1> | |
5c3b4474 | 341 | <sect1 id="see_also"> |
733b72c3 RD |
342 | <title>See also</title> |
343 | <para> | |
344 | <citation> | |
96824f4b | 345 | <ulink url="http://kernel.org/pub/linux/kernel/people/sct/ext3/journal-design.ps.gz"> |
733b72c3 RD |
346 | Journaling the Linux ext2fs Filesystem, LinuxExpo 98, Stephen Tweedie |
347 | </ulink> | |
348 | </citation> | |
349 | </para> | |
350 | <para> | |
351 | <citation> | |
352 | <ulink url="http://olstrans.sourceforge.net/release/OLS2000-ext3/OLS2000-ext3.html"> | |
353 | Ext3 Journalling FileSystem, OLS 2000, Dr. Stephen Tweedie | |
354 | </ulink> | |
355 | </citation> | |
356 | </para> | |
357 | </sect1> | |
358 | ||
359 | </chapter> | |
360 | ||
073b86da RD |
361 | <chapter id="splice"> |
362 | <title>splice API</title> | |
363 | <para> | |
364 | splice is a method for moving blocks of data around inside the | |
365 | kernel, without continually transferring them between the kernel | |
366 | and user space. | |
367 | </para> | |
368 | !Ffs/splice.c | |
369 | </chapter> | |
370 | ||
371 | <chapter id="pipes"> | |
372 | <title>pipes API</title> | |
373 | <para> | |
374 | Pipe interfaces are all for in-kernel (builtin image) use. | |
375 | They are not exported for use by modules. | |
376 | </para> | |
377 | !Iinclude/linux/pipe_fs_i.h | |
378 | !Ffs/pipe.c | |
379 | </chapter> | |
380 | ||
bbb5bbb0 | 381 | </book> |