]> Git Repo - linux.git/blame - Documentation/filesystems/locking.rst
Merge tag 'iomap-6.6-merge-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
[linux.git] / Documentation / filesystems / locking.rst
CommitLineData
ec23eb54
MCC
1=======
2Locking
3=======
4
5The text below describes the locking rules for VFS-related methods.
1da177e4
LT
6It is (believed to be) up-to-date. *Please*, if you change anything in
7prototypes or locking protocols - update this file. And update the relevant
8instances in the tree, don't leave that to maintainers of filesystems/devices/
9etc. At the very least, put the list of dubious cases in the end of this file.
10Don't turn it into log - maintainers of out-of-the-tree code are supposed to
11be able to use diff(1).
1da177e4 12
ec23eb54
MCC
13Thing currently missing here: socket operations. Alexey?
14
15dentry_operations
16=================
17
18prototypes::
19
0b728e19 20 int (*d_revalidate)(struct dentry *, unsigned int);
ecf3d1f1 21 int (*d_weak_revalidate)(struct dentry *, unsigned int);
da53be12 22 int (*d_hash)(const struct dentry *, struct qstr *);
6fa67e70 23 int (*d_compare)(const struct dentry *,
621e155a 24 unsigned int, const char *, const struct qstr *);
1da177e4 25 int (*d_delete)(struct dentry *);
285b102d 26 int (*d_init)(struct dentry *);
1da177e4
LT
27 void (*d_release)(struct dentry *);
28 void (*d_iput)(struct dentry *, struct inode *);
c23fbb6b 29 char *(*d_dname)((struct dentry *dentry, char *buffer, int buflen);
9875cf80 30 struct vfsmount *(*d_automount)(struct path *path);
fb5f51c7 31 int (*d_manage)(const struct path *, bool);
fb16043b 32 struct dentry *(*d_real)(struct dentry *, const struct inode *);
1da177e4
LT
33
34locking rules:
ec23eb54
MCC
35
36================== =========== ======== ============== ========
37ops rename_lock ->d_lock may block rcu-walk
38================== =========== ======== ============== ========
39d_revalidate: no no yes (ref-walk) maybe
40d_weak_revalidate: no no yes no
41d_hash no no no maybe
42d_compare: yes no no maybe
43d_delete: no yes no no
44d_init: no no yes no
45d_release: no no yes no
46d_prune: no yes no no
47d_iput: no no yes no
48d_dname: no no no no
49d_automount: no no yes no
50d_manage: no no yes (ref-walk) maybe
51d_real no no yes no
52================== =========== ======== ============== ========
53
54inode_operations
55================
56
57prototypes::
58
6c960e68 59 int (*create) (struct mnt_idmap *, struct inode *,struct dentry *,umode_t, bool);
00cd8dd3 60 struct dentry * (*lookup) (struct inode *,struct dentry *, unsigned int);
1da177e4
LT
61 int (*link) (struct dentry *,struct inode *,struct dentry *);
62 int (*unlink) (struct inode *,struct dentry *);
7a77db95 63 int (*symlink) (struct mnt_idmap *, struct inode *,struct dentry *,const char *);
c54bd91e 64 int (*mkdir) (struct mnt_idmap *, struct inode *,struct dentry *,umode_t);
1da177e4 65 int (*rmdir) (struct inode *,struct dentry *);
5ebb29be 66 int (*mknod) (struct mnt_idmap *, struct inode *,struct dentry *,umode_t,dev_t);
e18275ae 67 int (*rename) (struct mnt_idmap *, struct inode *, struct dentry *,
520c8b16 68 struct inode *, struct dentry *, unsigned int);
1da177e4 69 int (*readlink) (struct dentry *, char __user *,int);
1a6a3165 70 const char *(*get_link) (struct dentry *, struct inode *, struct delayed_call *);
1da177e4 71 void (*truncate) (struct inode *);
4609e1f1 72 int (*permission) (struct mnt_idmap *, struct inode *, int, unsigned int);
cac2f8b8 73 struct posix_acl * (*get_inode_acl)(struct inode *, int, bool);
c1632a0f 74 int (*setattr) (struct mnt_idmap *, struct dentry *, struct iattr *);
b74d24f7 75 int (*getattr) (struct mnt_idmap *, const struct path *, struct kstat *, u32, unsigned int);
1da177e4 76 ssize_t (*listxattr) (struct dentry *, char *, size_t);
b83be6f2 77 int (*fiemap)(struct inode *, struct fiemap_extent_info *, u64 start, u64 len);
c3b2da31 78 void (*update_time)(struct inode *, struct timespec *, int);
d9585277 79 int (*atomic_open)(struct inode *, struct dentry *,
30d90494 80 struct file *, unsigned open_flag,
6c9b1de1 81 umode_t create_mode);
011e2b71 82 int (*tmpfile) (struct mnt_idmap *, struct inode *,
863f144f 83 struct file *, umode_t);
8782a9ae 84 int (*fileattr_set)(struct mnt_idmap *idmap,
4c5b4799
MS
85 struct dentry *dentry, struct fileattr *fa);
86 int (*fileattr_get)(struct dentry *dentry, struct fileattr *fa);
77435322 87 struct posix_acl * (*get_acl)(struct mnt_idmap *, struct dentry *, int);
6faddda6 88 struct offset_ctx *(*get_offset_ctx)(struct inode *inode);
1da177e4
LT
89
90locking rules:
b83be6f2 91 all may block
ec23eb54 92
6faddda6 93============== ==================================================
ec23eb54 94ops i_rwsem(inode)
6faddda6 95============== ==================================================
965de0ec
SA
96lookup: shared
97create: exclusive
98link: exclusive (both)
99mknod: exclusive
100symlink: exclusive
101mkdir: exclusive
102unlink: exclusive (both)
103rmdir: exclusive (both)(see below)
104rename: exclusive (all) (see below)
1da177e4 105readlink: no
6b255391 106get_link: no
965de0ec 107setattr: exclusive
b74c79e9 108permission: no (may not block if called in rcu-walk mode)
cac2f8b8 109get_inode_acl: no
7420332a 110get_acl: no
1da177e4 111getattr: no
1da177e4 112listxattr: no
b83be6f2 113fiemap: no
c3b2da31 114update_time: no
ff467342 115atomic_open: shared (exclusive if O_CREAT is set in open flags)
48bde8d3 116tmpfile: no
4c5b4799
MS
117fileattr_get: no or exclusive
118fileattr_set: exclusive
6faddda6
CL
119get_offset_ctx no
120============== ==================================================
c3b2da31 121
6c6ef9f2 122
965de0ec
SA
123 Additionally, ->rmdir(), ->unlink() and ->rename() have ->i_rwsem
124 exclusive on victim.
2773bf00 125 cross-directory ->rename() has (per-superblock) ->s_vfs_rename_sem.
1da177e4 126
ec23eb54 127See Documentation/filesystems/directory-locking.rst for more detailed discussion
1da177e4
LT
128of the locking scheme for directory operations.
129
ec23eb54
MCC
130xattr_handler operations
131========================
132
133prototypes::
134
6c6ef9f2
AG
135 bool (*list)(struct dentry *dentry);
136 int (*get)(const struct xattr_handler *handler, struct dentry *dentry,
137 struct inode *inode, const char *name, void *buffer,
138 size_t size);
e65ce2a5 139 int (*set)(const struct xattr_handler *handler,
39f60c1c 140 struct mnt_idmap *idmap,
e65ce2a5
CB
141 struct dentry *dentry, struct inode *inode, const char *name,
142 const void *buffer, size_t size, int flags);
6c6ef9f2
AG
143
144locking rules:
145 all may block
ec23eb54
MCC
146
147===== ==============
148ops i_rwsem(inode)
149===== ==============
6c6ef9f2
AG
150list: no
151get: no
965de0ec 152set: exclusive
ec23eb54
MCC
153===== ==============
154
155super_operations
156================
157
158prototypes::
6c6ef9f2 159
1da177e4 160 struct inode *(*alloc_inode)(struct super_block *sb);
fdb0da89 161 void (*free_inode)(struct inode *);
1da177e4 162 void (*destroy_inode)(struct inode *);
aa385729 163 void (*dirty_inode) (struct inode *, int flags);
b83be6f2 164 int (*write_inode) (struct inode *, struct writeback_control *wbc);
336fb3b9
AV
165 int (*drop_inode) (struct inode *);
166 void (*evict_inode) (struct inode *);
1da177e4 167 void (*put_super) (struct super_block *);
1da177e4 168 int (*sync_fs)(struct super_block *sb, int wait);
c4be0c1d
TS
169 int (*freeze_fs) (struct super_block *);
170 int (*unfreeze_fs) (struct super_block *);
726c3342 171 int (*statfs) (struct dentry *, struct kstatfs *);
1da177e4 172 int (*remount_fs) (struct super_block *, int *, char *);
1da177e4 173 void (*umount_begin) (struct super_block *);
34c80b1d 174 int (*show_options)(struct seq_file *, struct dentry *);
1da177e4
LT
175 ssize_t (*quota_read)(struct super_block *, int, char *, size_t, loff_t);
176 ssize_t (*quota_write)(struct super_block *, int, const char *, size_t, loff_t);
177
178locking rules:
336fb3b9 179 All may block [not true, see below]
ec23eb54
MCC
180
181====================== ============ ========================
182ops s_umount note
183====================== ============ ========================
7e325d3a 184alloc_inode:
fdb0da89 185free_inode: called from RCU callback
7e325d3a 186destroy_inode:
aa385729 187dirty_inode:
7e325d3a 188write_inode:
f283c86a 189drop_inode: !!!inode->i_lock!!!
336fb3b9 190evict_inode:
7e325d3a 191put_super: write
7e325d3a 192sync_fs: read
06fd516c
VA
193freeze_fs: write
194unfreeze_fs: write
336fb3b9
AV
195statfs: maybe(read) (see below)
196remount_fs: write
7e325d3a
CH
197umount_begin: no
198show_options: no (namespace_sem)
199quota_read: no (see below)
200quota_write: no (see below)
ec23eb54 201====================== ============ ========================
1da177e4 202
336fb3b9
AV
203->statfs() has s_umount (shared) when called by ustat(2) (native or
204compat), but that's an accident of bad API; s_umount is used to pin
205the superblock down when we only have dev_t given us by userland to
206identify the superblock. Everything else (statfs(), fstatfs(), etc.)
207doesn't hold it when calling ->statfs() - superblock is pinned down
208by resolving the pathname passed to syscall.
ec23eb54 209
1da177e4
LT
210->quota_read() and ->quota_write() functions are both guaranteed to
211be the only ones operating on the quota file by the quota code (via
212dqio_sem) (unless an admin really wants to screw up something and
213writes to quota files with quotas on). For other details about locking
214see also dquot_operations section.
ec23eb54 215
ec23eb54
MCC
216file_system_type
217================
218
219prototypes::
220
b83be6f2
CH
221 struct dentry *(*mount) (struct file_system_type *, int,
222 const char *, void *);
1da177e4 223 void (*kill_sb) (struct super_block *);
ec23eb54 224
1da177e4 225locking rules:
ec23eb54
MCC
226
227======= =========
228ops may block
229======= =========
b83be6f2
CH
230mount yes
231kill_sb yes
ec23eb54 232======= =========
1da177e4 233
1a102ff9
AV
234->mount() returns ERR_PTR or the root dentry; its superblock should be locked
235on return.
ec23eb54 236
1da177e4
LT
237->kill_sb() takes a write-locked superblock, does all shutdown work on it,
238unlocks and drops the reference.
239
ec23eb54
MCC
240address_space_operations
241========================
242prototypes::
243
1da177e4 244 int (*writepage)(struct page *page, struct writeback_control *wbc);
08830c8b 245 int (*read_folio)(struct file *, struct folio *);
1da177e4 246 int (*writepages)(struct address_space *, struct writeback_control *);
6f31a5a2 247 bool (*dirty_folio)(struct address_space *, struct folio *folio);
8151b4c8 248 void (*readahead)(struct readahead_control *);
4e02ed4b 249 int (*write_begin)(struct file *, struct address_space *mapping,
9d6b0cd7 250 loff_t pos, unsigned len,
4e02ed4b
NP
251 struct page **pagep, void **fsdata);
252 int (*write_end)(struct file *, struct address_space *mapping,
253 loff_t pos, unsigned len, unsigned copied,
254 struct page *page, void *fsdata);
1da177e4 255 sector_t (*bmap)(struct address_space *, sector_t);
128d1f82 256 void (*invalidate_folio) (struct folio *, size_t start, size_t len);
fa29000b 257 bool (*release_folio)(struct folio *, gfp_t);
d2329aa0 258 void (*free_folio)(struct folio *);
c8b8e32d 259 int (*direct_IO)(struct kiocb *, struct iov_iter *iter);
5490da4f
MWO
260 int (*migrate_folio)(struct address_space *, struct folio *dst,
261 struct folio *src, enum migrate_mode);
affa80e8 262 int (*launder_folio)(struct folio *);
2e7e80f7 263 bool (*is_partially_uptodate)(struct folio *, size_t from, size_t count);
b83be6f2 264 int (*error_remove_page)(struct address_space *, struct page *);
cba738f6 265 int (*swap_activate)(struct swap_info_struct *sis, struct file *f, sector_t *span)
62c230bc 266 int (*swap_deactivate)(struct file *);
cba738f6 267 int (*swap_rw)(struct kiocb *iocb, struct iov_iter *iter);
1da177e4
LT
268
269locking rules:
d2329aa0 270 All except dirty_folio and free_folio may block
1da177e4 271
730633f0 272====================== ======================== ========= ===============
d2329aa0 273ops folio locked i_rwsem invalidate_lock
730633f0 274====================== ======================== ========= ===============
b83be6f2 275writepage: yes, unlocks (see below)
08830c8b 276read_folio: yes, unlocks shared
b83be6f2 277writepages:
fa29000b 278dirty_folio: maybe
730633f0 279readahead: yes, unlocks shared
ec23eb54
MCC
280write_begin: locks the page exclusive
281write_end: yes, unlocks exclusive
b83be6f2 282bmap:
128d1f82 283invalidate_folio: yes exclusive
fa29000b 284release_folio: yes
d2329aa0 285free_folio: yes
b83be6f2 286direct_IO:
5490da4f 287migrate_folio: yes (both)
affa80e8 288launder_folio: yes
b83be6f2
CH
289is_partially_uptodate: yes
290error_remove_page: yes
62c230bc
MG
291swap_activate: no
292swap_deactivate: no
cba738f6 293swap_rw: yes, unlocks
7882c55e 294====================== ======================== ========= ===============
1da177e4 295
08830c8b 296->write_begin(), ->write_end() and ->read_folio() may be called from
f4e6d844 297the request handler (/dev/loop).
1da177e4 298
08830c8b 299->read_folio() unlocks the folio, either synchronously or via I/O
1da177e4
LT
300completion.
301
08830c8b 302->readahead() unlocks the folios that I/O is attempted on like ->read_folio().
8151b4c8 303
ec23eb54 304->writepage() is used for two purposes: for "memory cleansing" and for
1da177e4
LT
305"sync". These are quite different operations and the behaviour may differ
306depending upon the mode.
307
308If writepage is called for sync (wbc->sync_mode != WBC_SYNC_NONE) then
309it *must* start I/O against the page, even if that would involve
310blocking on in-progress I/O.
311
312If writepage is called for memory cleansing (sync_mode ==
313WBC_SYNC_NONE) then its role is to get as much writeout underway as
314possible. So writepage should try to avoid blocking against
315currently-in-progress I/O.
316
317If the filesystem is not called for "sync" and it determines that it
318would need to block against in-progress I/O to be able to start new I/O
319against the page the filesystem should redirty the page with
320redirty_page_for_writepage(), then unlock the page and return zero.
321This may also be done to avoid internal deadlocks, but rarely.
322
3a4fa0a2 323If the filesystem is called for sync then it must wait on any
1da177e4
LT
324in-progress I/O and then start new I/O.
325
2054606a
ND
326The filesystem should unlock the page synchronously, before returning to the
327caller, unless ->writepage() returns special WRITEPAGE_ACTIVATE
328value. WRITEPAGE_ACTIVATE means that page cannot really be written out
329currently, and VM should stop calling ->writepage() on this page for some
330time. VM does this by moving page to the head of the active list, hence the
331name.
1da177e4
LT
332
333Unless the filesystem is going to redirty_page_for_writepage(), unlock the page
334and return zero, writepage *must* run set_page_writeback() against the page,
335followed by unlocking it. Once set_page_writeback() has been run against the
336page, write I/O can be submitted and the write I/O completion handler must run
337end_page_writeback() once the I/O is complete. If no I/O is submitted, the
338filesystem must run end_page_writeback() against the page before returning from
339writepage.
340
341That is: after 2.5.12, pages which are under writeout are *not* locked. Note,
342if the filesystem needs the page to be locked during writeout, that is ok, too,
343the page is allowed to be unlocked at any point in time between the calls to
344set_page_writeback() and end_page_writeback().
345
346Note, failure to run either redirty_page_for_writepage() or the combination of
347set_page_writeback()/end_page_writeback() on a page submitted to writepage
348will leave the page itself marked clean but it will be tagged as dirty in the
349radix tree. This incoherency can lead to all sorts of hard-to-debug problems
350in the filesystem like having dirty inodes at umount and losing written data.
351
ec23eb54 352->writepages() is used for periodic writeback and for syscall-initiated
1da177e4 353sync operations. The address_space should start I/O against at least
ec23eb54
MCC
354``*nr_to_write`` pages. ``*nr_to_write`` must be decremented for each page
355which is written. The address_space implementation may write more (or less)
356pages than ``*nr_to_write`` asks for, but it should try to be reasonably close.
357If nr_to_write is NULL, all dirty pages must be written.
1da177e4
LT
358
359writepages should _only_ write pages which are present on
360mapping->io_pages.
361
6f31a5a2
MWO
362->dirty_folio() is called from various places in the kernel when
363the target folio is marked as needing writeback. The folio cannot be
364truncated because either the caller holds the folio lock, or the caller
365has found the folio while holding the page table lock which will block
366truncation.
1da177e4 367
ec23eb54 368->bmap() is currently used by legacy ioctl() (FIBMAP) provided by some
b83be6f2
CH
369filesystems and by the swapper. The latter will eventually go away. Please,
370keep it that way and don't breed new callers.
1da177e4 371
128d1f82 372->invalidate_folio() is called when the filesystem must attempt to drop
d47992f8 373some or all of the buffers from the page when it is being truncated. It
128d1f82
MWO
374returns zero on success. The filesystem must exclusively acquire
375invalidate_lock before invalidating page cache in truncate / hole punch
376path (and thus calling into ->invalidate_folio) to block races between page
377cache invalidation and page cache filling functions (fault, read, ...).
1da177e4 378
32b29cc9
MWO
379->release_folio() is called when the MM wants to make a change to the
380folio that would invalidate the filesystem's private data. For example,
381it may be about to be removed from the address_space or split. The folio
382is locked and not under writeback. It may be dirty. The gfp parameter
383is not usually used for allocation, but rather to indicate what the
384filesystem may do to attempt to free the private data. The filesystem may
385return false to indicate that the folio's private data cannot be freed.
386If it returns true, it should have already removed the private data from
387the folio. If a filesystem does not provide a ->release_folio method,
388the pagecache will assume that private data is buffer_heads and call
389try_to_free_buffers().
1da177e4 390
d2329aa0 391->free_folio() is called when the kernel has dropped the folio
6072d13c
LT
392from the page cache.
393
affa80e8
MWO
394->launder_folio() may be called prior to releasing a folio if
395it is still found to be dirty. It returns zero if the folio was successfully
396cleaned, or an error value if not. Note that in order to prevent the folio
e3db7691
TM
397getting mapped back in and redirtied, it needs to be kept locked
398across the entire operation.
399
cba738f6
N
400->swap_activate() will be called to prepare the given file for swap. It
401should perform any validation and preparation necessary to ensure that
402writes can be performed with minimal memory allocation. It should call
403add_swap_extent(), or the helper iomap_swapfile_activate(), and return
404the number of extents added. If IO should be submitted through
405->swap_rw(), it should set SWP_FS_OPS, otherwise IO will be submitted
406directly to the block device ``sis->bdev``.
62c230bc 407
ec23eb54 408->swap_deactivate() will be called in the sys_swapoff()
62c230bc
MG
409path after ->swap_activate() returned success.
410
cba738f6
N
411->swap_rw will be called for swap IO if SWP_FS_OPS was set by ->swap_activate().
412
ec23eb54
MCC
413file_lock_operations
414====================
415
416prototypes::
417
1da177e4
LT
418 void (*fl_copy_lock)(struct file_lock *, struct file_lock *);
419 void (*fl_release_private)(struct file_lock *);
420
421
422locking rules:
ec23eb54
MCC
423
424=================== ============= =========
425ops inode->i_lock may block
426=================== ============= =========
b83be6f2 427fl_copy_lock: yes no
ec23eb54
MCC
428fl_release_private: maybe maybe[1]_
429=================== ============= =========
430
431.. [1]:
432 ->fl_release_private for flock or POSIX locks is currently allowed
433 to block. Leases however can still be freed while the i_lock is held and
434 so fl_release_private called on a lease should not block.
2ece173e 435
ec23eb54
MCC
436lock_manager_operations
437=======================
438
439prototypes::
1da177e4 440
8fb47a4f
BF
441 void (*lm_notify)(struct file_lock *); /* unblock callback */
442 int (*lm_grant)(struct file_lock *, struct file_lock *, int);
8fb47a4f
BF
443 void (*lm_break)(struct file_lock *); /* break_lease callback */
444 int (*lm_change)(struct file_lock **, int);
28df3d15 445 bool (*lm_breaker_owns_lease)(struct file_lock *);
2443da22
DN
446 bool (*lm_lock_expirable)(struct file_lock *);
447 void (*lm_expire_lock)(void);
1da177e4
LT
448
449locking rules:
1c8c601a 450
6cbef2ad 451====================== ============= ================= =========
9d664776 452ops flc_lock blocked_lock_lock may block
6cbef2ad 453====================== ============= ================= =========
9d664776 454lm_notify: no yes no
7b2296af
JL
455lm_grant: no no no
456lm_break: yes no no
457lm_change yes no no
9d664776 458lm_breaker_owns_lease: yes no no
2443da22
DN
459lm_lock_expirable yes no no
460lm_expire_lock no no yes
6cbef2ad 461====================== ============= ================= =========
ec23eb54
MCC
462
463buffer_head
464===========
465
466prototypes::
1c8c601a 467
1da177e4
LT
468 void (*b_end_io)(struct buffer_head *bh, int uptodate);
469
470locking rules:
ec23eb54
MCC
471
472called from interrupts. In other words, extreme care is needed here.
1da177e4
LT
473bh is locked, but that's all warranties we have here. Currently only RAID1,
474highmem, fs/buffer.c, and fs/ntfs/aops.c are providing these. Block devices
475call this method upon the IO completion.
476
ec23eb54
MCC
477block_device_operations
478=======================
479prototypes::
480
e1455d1b
CH
481 int (*open) (struct block_device *, fmode_t);
482 int (*release) (struct gendisk *, fmode_t);
483 int (*ioctl) (struct block_device *, fmode_t, unsigned, unsigned long);
484 int (*compat_ioctl) (struct block_device *, fmode_t, unsigned, unsigned long);
7a9eb206 485 int (*direct_access) (struct block_device *, sector_t, void **,
e2e05394 486 unsigned long *);
e1455d1b 487 void (*unlock_native_capacity) (struct gendisk *);
e1455d1b
CH
488 int (*getgeo)(struct block_device *, struct hd_geometry *);
489 void (*swap_slot_free_notify) (struct block_device *, unsigned long);
1da177e4
LT
490
491locking rules:
ec23eb54
MCC
492
493======================= ===================
a8698707 494ops open_mutex
ec23eb54 495======================= ===================
b83be6f2
CH
496open: yes
497release: yes
498ioctl: no
499compat_ioctl: no
500direct_access: no
b83be6f2 501unlock_native_capacity: no
b83be6f2
CH
502getgeo: no
503swap_slot_free_notify: no (see below)
ec23eb54 504======================= ===================
e1455d1b 505
e1455d1b
CH
506swap_slot_free_notify is called with swap_lock and sometimes the page lock
507held.
1da177e4 508
1da177e4 509
ec23eb54
MCC
510file_operations
511===============
512
513prototypes::
514
1da177e4
LT
515 loff_t (*llseek) (struct file *, loff_t, int);
516 ssize_t (*read) (struct file *, char __user *, size_t, loff_t *);
1da177e4 517 ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *);
293bc982
AV
518 ssize_t (*read_iter) (struct kiocb *, struct iov_iter *);
519 ssize_t (*write_iter) (struct kiocb *, struct iov_iter *);
c625b4cc 520 int (*iopoll) (struct kiocb *kiocb, bool spin);
2233f31a 521 int (*iterate) (struct file *, struct dir_context *);
965de0ec 522 int (*iterate_shared) (struct file *, struct dir_context *);
6e8b704d 523 __poll_t (*poll) (struct file *, struct poll_table_struct *);
1da177e4
LT
524 long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long);
525 long (*compat_ioctl) (struct file *, unsigned int, unsigned long);
526 int (*mmap) (struct file *, struct vm_area_struct *);
527 int (*open) (struct inode *, struct file *);
528 int (*flush) (struct file *);
529 int (*release) (struct inode *, struct file *);
02c24a82 530 int (*fsync) (struct file *, loff_t start, loff_t end, int datasync);
1da177e4
LT
531 int (*fasync) (int, struct file *, int);
532 int (*lock) (struct file *, int, struct file_lock *);
1da177e4
LT
533 unsigned long (*get_unmapped_area)(struct file *, unsigned long,
534 unsigned long, unsigned long, unsigned long);
535 int (*check_flags)(int);
b83be6f2
CH
536 int (*flock) (struct file *, int, struct file_lock *);
537 ssize_t (*splice_write)(struct pipe_inode_info *, struct file *, loff_t *,
538 size_t, unsigned int);
539 ssize_t (*splice_read)(struct file *, loff_t *, struct pipe_inode_info *,
540 size_t, unsigned int);
e6f5c789 541 int (*setlease)(struct file *, long, struct file_lock **, void **);
2fe17c10 542 long (*fallocate)(struct file *, int, loff_t, loff_t);
c625b4cc
JK
543 void (*show_fdinfo)(struct seq_file *m, struct file *f);
544 unsigned (*mmap_capabilities)(struct file *);
545 ssize_t (*copy_file_range)(struct file *, loff_t, struct file *,
546 loff_t, size_t, unsigned int);
547 loff_t (*remap_file_range)(struct file *file_in, loff_t pos_in,
548 struct file *file_out, loff_t pos_out,
549 loff_t len, unsigned int remap_flags);
550 int (*fadvise)(struct file *, loff_t, loff_t, int);
1da177e4
LT
551
552locking rules:
a11e1d43 553 All may block.
b83be6f2 554
1da177e4
LT
555->llseek() locking has moved from llseek to the individual llseek
556implementations. If your fs is not using generic_file_llseek, you
557need to acquire and release the appropriate locks in your ->llseek().
558For many filesystems, it is probably safe to acquire the inode
866707fc
JB
559mutex or just to use i_size_read() instead.
560Note: this does not protect the file->f_pos against concurrent modifications
561since this is something the userspace has to take care about.
1da177e4 562
3e327154
LT
563->iterate_shared() is called with i_rwsem held for reading, and with the
564file f_pos_lock held exclusively
965de0ec 565
b83be6f2
CH
566->fasync() is responsible for maintaining the FASYNC bit in filp->f_flags.
567Most instances call fasync_helper(), which does that maintenance, so it's
568not normally something one needs to worry about. Return values > 0 will be
569mapped to zero in the VFS layer.
1da177e4
LT
570
571->readdir() and ->ioctl() on directories must be changed. Ideally we would
572move ->readdir() to inode_operations and use a separate method for directory
573->ioctl() or kill the latter completely. One of the problems is that for
574anything that resembles union-mount we won't have a struct file for all
575components. And there are other reasons why the current interface is a mess...
576
1da177e4
LT
577->read on directories probably must go away - we should just enforce -EISDIR
578in sys_read() and friends.
579
f82b4b67
JL
580->setlease operations should call generic_setlease() before or after setting
581the lease within the individual filesystem to record the result of the
582operation
583
730633f0
JK
584->fallocate implementation must be really careful to maintain page cache
585consistency when punching holes or performing other operations that invalidate
586page cache contents. Usually the filesystem needs to call
587truncate_inode_pages_range() to invalidate relevant range of the page cache.
588However the filesystem usually also needs to update its internal (and on disk)
589view of file offset -> disk block mapping. Until this update is finished, the
590filesystem needs to block page faults and reads from reloading now-stale page
591cache contents from the disk. Since VFS acquires mapping->invalidate_lock in
592shared mode when loading pages from disk (filemap_fault(), filemap_read(),
593readahead paths), the fallocate implementation must take the invalidate_lock to
594prevent reloading.
595
596->copy_file_range and ->remap_file_range implementations need to serialize
597against modifications of file data while the operation is running. For
598blocking changes through write(2) and similar operations inode->i_rwsem can be
599used. To block changes to file contents via a memory mapping during the
600operation, the filesystem must take mapping->invalidate_lock to coordinate
601with ->page_mkwrite.
602
ec23eb54
MCC
603dquot_operations
604================
605
606prototypes::
607
1da177e4
LT
608 int (*write_dquot) (struct dquot *);
609 int (*acquire_dquot) (struct dquot *);
610 int (*release_dquot) (struct dquot *);
611 int (*mark_dirty) (struct dquot *);
612 int (*write_info) (struct super_block *, int);
613
614These operations are intended to be more or less wrapping functions that ensure
615a proper locking wrt the filesystem and call the generic quota operations.
616
617What filesystem should expect from the generic quota functions:
618
ec23eb54
MCC
619============== ============ =========================
620ops FS recursion Held locks when called
621============== ============ =========================
1da177e4
LT
622write_dquot: yes dqonoff_sem or dqptr_sem
623acquire_dquot: yes dqonoff_sem or dqptr_sem
624release_dquot: yes dqonoff_sem or dqptr_sem
625mark_dirty: no -
626write_info: yes dqonoff_sem
ec23eb54 627============== ============ =========================
1da177e4
LT
628
629FS recursion means calling ->quota_read() and ->quota_write() from superblock
630operations.
631
1da177e4
LT
632More details about quota locking can be found in fs/dquot.c.
633
ec23eb54
MCC
634vm_operations_struct
635====================
636
637prototypes::
638
1da177e4
LT
639 void (*open)(struct vm_area_struct*);
640 void (*close)(struct vm_area_struct*);
fe3136f4
SJ
641 vm_fault_t (*fault)(struct vm_area_struct*, struct vm_fault *);
642 vm_fault_t (*page_mkwrite)(struct vm_area_struct *, struct vm_fault *);
643 vm_fault_t (*pfn_mkwrite)(struct vm_area_struct *, struct vm_fault *);
28b2ee20 644 int (*access)(struct vm_area_struct *, unsigned long, void*, int, int);
1da177e4
LT
645
646locking rules:
ec23eb54 647
6cbef2ad 648============= ========= ===========================
c1e8d7c6 649ops mmap_lock PageLocked(page)
6cbef2ad 650============= ========= ===========================
b83be6f2
CH
651open: yes
652close: yes
653fault: yes can return with page locked
58ef47ef 654map_pages: read
b83be6f2 655page_mkwrite: yes can return with page locked
dd906184 656pfn_mkwrite: yes
b83be6f2 657access: yes
6cbef2ad 658============= ========= ===========================
ed2f2f9b 659
730633f0
JK
660->fault() is called when a previously not present pte is about to be faulted
661in. The filesystem must find and return the page associated with the passed in
662"pgoff" in the vm_fault structure. If it is possible that the page may be
663truncated and/or invalidated, then the filesystem must lock invalidate_lock,
664then ensure the page is not already truncated (invalidate_lock will block
b827e496
NP
665subsequent truncate), and then return with VM_FAULT_LOCKED, and the page
666locked. The VM will unlock the page.
667
ec23eb54 668->map_pages() is called when VM asks to map easy accessible pages.
bae473a4 669Filesystem should find and map pages associated with offsets from "start_pgoff"
58ef47ef 670till "end_pgoff". ->map_pages() is called with the RCU lock held and must
8c6e50b0
KS
671not block. If it's not possible to reach a page without blocking,
672filesystem should skip it. Filesystem should use do_set_pte() to setup
bae473a4 673page table entry. Pointer to entry associated with the page is passed in
82b0f8c3 674"pte" field in vm_fault structure. Pointers to entries for other offsets
bae473a4 675should be calculated relative to "pte".
8c6e50b0 676
730633f0
JK
677->page_mkwrite() is called when a previously read-only pte is about to become
678writeable. The filesystem again must ensure that there are no
679truncate/invalidate races or races with operations such as ->remap_file_range
680or ->copy_file_range, and then return with the page locked. Usually
681mapping->invalidate_lock is suitable for proper serialization. If the page has
682been truncated, the filesystem should not look up a new page like the ->fault()
683handler, but simply return with VM_FAULT_NOPAGE, which will cause the VM to
684retry the fault.
1da177e4 685
ec23eb54 686->pfn_mkwrite() is the same as page_mkwrite but when the pte is
dd906184
BH
687VM_PFNMAP or VM_MIXEDMAP with a page-less entry. Expected return is
688VM_FAULT_NOPAGE. Or one of the VM_FAULT_ERROR types. The default behavior
689after this call is to make the pte read-write, unless pfn_mkwrite returns
690an error.
691
ec23eb54 692->access() is called when get_user_pages() fails in
507da6a1 693access_process_vm(), typically used to debug a process through
28b2ee20
RR
694/proc/pid/mem or ptrace. This function is needed only for
695VM_IO | VM_PFNMAP VMAs.
696
ec23eb54
MCC
697--------------------------------------------------------------------------------
698
1da177e4
LT
699 Dubious stuff
700
701(if you break something or notice that it is broken and do not fix it yourself
702- at least put it here)
This page took 1.516819 seconds and 4 git commands to generate.