]> Git Repo - linux.git/log
linux.git
5 months agobtrfs: fix clear_dirty and writeback ordering in submit_one_sector()
Naohiro Aota [Fri, 4 Oct 2024 04:53:35 +0000 (13:53 +0900)]
btrfs: fix clear_dirty and writeback ordering in submit_one_sector()

This commit is a replay of commit 6252690f7e1b ("btrfs: fix invalid
mapping of extent xarray state"). We need to call
btrfs_folio_clear_dirty() before btrfs_set_range_writeback(), so that
xarray DIRTY tag is cleared.

With a refactoring commit 8189197425e7 ("btrfs: refactor
__extent_writepage_io() to do sector-by-sector submission"), it screwed
up and the order is reversed and causing the same hang. Fix the ordering
now in submit_one_sector().

Fixes: 8189197425e7 ("btrfs: refactor __extent_writepage_io() to do sector-by-sector submission")
Reviewed-by: Qu Wenruo <[email protected]>
Reviewed-by: Johannes Thumshirn <[email protected]>
Signed-off-by: Naohiro Aota <[email protected]>
Signed-off-by: David Sterba <[email protected]>
5 months agobtrfs: zoned: fix missing RCU locking in error message when loading zone info
Filipe Manana [Wed, 2 Oct 2024 14:02:56 +0000 (15:02 +0100)]
btrfs: zoned: fix missing RCU locking in error message when loading zone info

At btrfs_load_zone_info() we have an error path that is dereferencing
the name of a device which is a RCU string but we are not holding a RCU
read lock, which is incorrect.

Fix this by using btrfs_err_in_rcu() instead of btrfs_err().

The problem is there since commit 08e11a3db098 ("btrfs: zoned: load zone's
allocation offset"), back then at btrfs_load_block_group_zone_info() but
then later on that code was factored out into the helper
btrfs_load_zone_info() by commit 09a46725cc84 ("btrfs: zoned: factor out
per-zone logic from btrfs_load_block_group_zone_info").

Fixes: 08e11a3db098 ("btrfs: zoned: load zone's allocation offset")
Reviewed-by: Johannes Thumshirn <[email protected]>
Reviewed-by: Qu Wenruo <[email protected]>
Reviewed-by: Naohiro Aota <[email protected]>
Signed-off-by: Filipe Manana <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
5 months agobtrfs: fix missing error handling when adding delayed ref with qgroups enabled
Filipe Manana [Tue, 24 Sep 2024 13:39:19 +0000 (14:39 +0100)]
btrfs: fix missing error handling when adding delayed ref with qgroups enabled

When adding a delayed ref head, at delayed-ref.c:add_delayed_ref_head(),
if we fail to insert the qgroup record we don't error out, we ignore it.
In fact we treat it as if there was no error and there was already an
existing record - we don't distinguish between the cases where
btrfs_qgroup_trace_extent_nolock() returns 1, meaning a record already
existed and we can free the given record, and the case where it returns
a negative error value, meaning the insertion into the xarray that is
used to track records failed.

Effectively we end up ignoring that we are lacking qgroup record in the
dirty extents xarray, resulting in incorrect qgroup accounting.

Fix this by checking for errors and return them to the callers.

Fixes: 3cce39a8ca4e ("btrfs: qgroup: use xarray to track dirty extents in transaction")
Reviewed-by: Qu Wenruo <[email protected]>
Signed-off-by: Filipe Manana <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
5 months agobtrfs: add cancellation points to trim loops
Luca Stefani [Tue, 17 Sep 2024 20:33:05 +0000 (22:33 +0200)]
btrfs: add cancellation points to trim loops

There are reports that system cannot suspend due to running trim because
the task responsible for trimming the device isn't able to finish in
time, especially since we have a free extent discarding phase, which can
trim a lot of unallocated space. There are no limits on the trim size
(unlike the block group part).

Since trime isn't a critical call it can be interrupted at any time,
in such cases we stop the trim, report the amount of discarded bytes and
return an error.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=219180
Link: https://bugzilla.suse.com/show_bug.cgi?id=1229737
CC: [email protected] # 5.15+
Signed-off-by: Luca Stefani <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
5 months agobtrfs: split remaining space to discard in chunks
Luca Stefani [Tue, 17 Sep 2024 20:33:04 +0000 (22:33 +0200)]
btrfs: split remaining space to discard in chunks

Per Qu Wenruo in case we have a very large disk, e.g. 8TiB device,
mostly empty although we will do the split according to our super block
locations, the last super block ends at 256G, we can submit a huge
discard for the range [256G, 8T), causing a large delay.

Split the space left to discard based on BTRFS_MAX_DISCARD_CHUNK_SIZE in
preparation of introduction of cancellation points to trim. The value
of the chunk size is arbitrary, it can be higher or derived from actual
device capabilities but we can't easily read that using
bio_discard_limit().

Link: https://bugzilla.kernel.org/show_bug.cgi?id=219180
Link: https://bugzilla.suse.com/show_bug.cgi?id=1229737
CC: [email protected] # 5.15+
Signed-off-by: Luca Stefani <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
5 months agobtrfs: disable rate limiting when debug enabled
Leo Martins [Tue, 24 Sep 2024 23:42:29 +0000 (16:42 -0700)]
btrfs: disable rate limiting when debug enabled

Disable ratelimiting for btrfs_printk when CONFIG_BTRFS_DEBUG is
enabled. This allows for more verbose output which is often needed by
functions like btrfs_dump_space_info().

Reviewed-by: Qu Wenruo <[email protected]>
Signed-off-by: Leo Martins <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
5 months agobtrfs: wait for fixup workers before stopping cleaner kthread during umount
Filipe Manana [Tue, 1 Oct 2024 10:06:52 +0000 (11:06 +0100)]
btrfs: wait for fixup workers before stopping cleaner kthread during umount

During unmount, at close_ctree(), we have the following steps in this order:

1) Park the cleaner kthread - this doesn't destroy the kthread, it basically
   halts its execution (wake ups against it work but do nothing);

2) We stop the cleaner kthread - this results in freeing the respective
   struct task_struct;

3) We call btrfs_stop_all_workers() which waits for any jobs running in all
   the work queues and then free the work queues.

Syzbot reported a case where a fixup worker resulted in a crash when doing
a delayed iput on its inode while attempting to wake up the cleaner at
btrfs_add_delayed_iput(), because the task_struct of the cleaner kthread
was already freed. This can happen during unmount because we don't wait
for any fixup workers still running before we call kthread_stop() against
the cleaner kthread, which stops and free all its resources.

Fix this by waiting for any fixup workers at close_ctree() before we call
kthread_stop() against the cleaner and run pending delayed iputs.

The stack traces reported by syzbot were the following:

  BUG: KASAN: slab-use-after-free in __lock_acquire+0x77/0x2050 kernel/locking/lockdep.c:5065
  Read of size 8 at addr ffff8880272a8a18 by task kworker/u8:3/52

  CPU: 1 UID: 0 PID: 52 Comm: kworker/u8:3 Not tainted 6.12.0-rc1-syzkaller #0
  Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 09/13/2024
  Workqueue: btrfs-fixup btrfs_work_helper
  Call Trace:
   <TASK>
   __dump_stack lib/dump_stack.c:94 [inline]
   dump_stack_lvl+0x241/0x360 lib/dump_stack.c:120
   print_address_description mm/kasan/report.c:377 [inline]
   print_report+0x169/0x550 mm/kasan/report.c:488
   kasan_report+0x143/0x180 mm/kasan/report.c:601
   __lock_acquire+0x77/0x2050 kernel/locking/lockdep.c:5065
   lock_acquire+0x1ed/0x550 kernel/locking/lockdep.c:5825
   __raw_spin_lock_irqsave include/linux/spinlock_api_smp.h:110 [inline]
   _raw_spin_lock_irqsave+0xd5/0x120 kernel/locking/spinlock.c:162
   class_raw_spinlock_irqsave_constructor include/linux/spinlock.h:551 [inline]
   try_to_wake_up+0xb0/0x1480 kernel/sched/core.c:4154
   btrfs_writepage_fixup_worker+0xc16/0xdf0 fs/btrfs/inode.c:2842
   btrfs_work_helper+0x390/0xc50 fs/btrfs/async-thread.c:314
   process_one_work kernel/workqueue.c:3229 [inline]
   process_scheduled_works+0xa63/0x1850 kernel/workqueue.c:3310
   worker_thread+0x870/0xd30 kernel/workqueue.c:3391
   kthread+0x2f0/0x390 kernel/kthread.c:389
   ret_from_fork+0x4b/0x80 arch/x86/kernel/process.c:147
   ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244
   </TASK>

  Allocated by task 2:
   kasan_save_stack mm/kasan/common.c:47 [inline]
   kasan_save_track+0x3f/0x80 mm/kasan/common.c:68
   unpoison_slab_object mm/kasan/common.c:319 [inline]
   __kasan_slab_alloc+0x66/0x80 mm/kasan/common.c:345
   kasan_slab_alloc include/linux/kasan.h:247 [inline]
   slab_post_alloc_hook mm/slub.c:4086 [inline]
   slab_alloc_node mm/slub.c:4135 [inline]
   kmem_cache_alloc_node_noprof+0x16b/0x320 mm/slub.c:4187
   alloc_task_struct_node kernel/fork.c:180 [inline]
   dup_task_struct+0x57/0x8c0 kernel/fork.c:1107
   copy_process+0x5d1/0x3d50 kernel/fork.c:2206
   kernel_clone+0x223/0x880 kernel/fork.c:2787
   kernel_thread+0x1bc/0x240 kernel/fork.c:2849
   create_kthread kernel/kthread.c:412 [inline]
   kthreadd+0x60d/0x810 kernel/kthread.c:765
   ret_from_fork+0x4b/0x80 arch/x86/kernel/process.c:147
   ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244

  Freed by task 61:
   kasan_save_stack mm/kasan/common.c:47 [inline]
   kasan_save_track+0x3f/0x80 mm/kasan/common.c:68
   kasan_save_free_info+0x40/0x50 mm/kasan/generic.c:579
   poison_slab_object mm/kasan/common.c:247 [inline]
   __kasan_slab_free+0x59/0x70 mm/kasan/common.c:264
   kasan_slab_free include/linux/kasan.h:230 [inline]
   slab_free_hook mm/slub.c:2343 [inline]
   slab_free mm/slub.c:4580 [inline]
   kmem_cache_free+0x1a2/0x420 mm/slub.c:4682
   put_task_struct include/linux/sched/task.h:144 [inline]
   delayed_put_task_struct+0x125/0x300 kernel/exit.c:228
   rcu_do_batch kernel/rcu/tree.c:2567 [inline]
   rcu_core+0xaaa/0x17a0 kernel/rcu/tree.c:2823
   handle_softirqs+0x2c5/0x980 kernel/softirq.c:554
   __do_softirq kernel/softirq.c:588 [inline]
   invoke_softirq kernel/softirq.c:428 [inline]
   __irq_exit_rcu+0xf4/0x1c0 kernel/softirq.c:637
   irq_exit_rcu+0x9/0x30 kernel/softirq.c:649
   instr_sysvec_apic_timer_interrupt arch/x86/kernel/apic/apic.c:1037 [inline]
   sysvec_apic_timer_interrupt+0xa6/0xc0 arch/x86/kernel/apic/apic.c:1037
   asm_sysvec_apic_timer_interrupt+0x1a/0x20 arch/x86/include/asm/idtentry.h:702

  Last potentially related work creation:
   kasan_save_stack+0x3f/0x60 mm/kasan/common.c:47
   __kasan_record_aux_stack+0xac/0xc0 mm/kasan/generic.c:541
   __call_rcu_common kernel/rcu/tree.c:3086 [inline]
   call_rcu+0x167/0xa70 kernel/rcu/tree.c:3190
   context_switch kernel/sched/core.c:5318 [inline]
   __schedule+0x184b/0x4ae0 kernel/sched/core.c:6675
   schedule_idle+0x56/0x90 kernel/sched/core.c:6793
   do_idle+0x56a/0x5d0 kernel/sched/idle.c:354
   cpu_startup_entry+0x42/0x60 kernel/sched/idle.c:424
   start_secondary+0x102/0x110 arch/x86/kernel/smpboot.c:314
   common_startup_64+0x13e/0x147

  The buggy address belongs to the object at ffff8880272a8000
   which belongs to the cache task_struct of size 7424
  The buggy address is located 2584 bytes inside of
   freed 7424-byte region [ffff8880272a8000ffff8880272a9d00)

  The buggy address belongs to the physical page:
  page: refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x272a8
  head: order:3 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0
  flags: 0xfff00000000040(head|node=0|zone=1|lastcpupid=0x7ff)
  page_type: f5(slab)
  raw: 00fff00000000040 ffff88801bafa500 dead000000000122 0000000000000000
  raw: 0000000000000000 0000000080040004 00000001f5000000 0000000000000000
  head: 00fff00000000040 ffff88801bafa500 dead000000000122 0000000000000000
  head: 0000000000000000 0000000080040004 00000001f5000000 0000000000000000
  head: 00fff00000000003 ffffea00009caa01 ffffffffffffffff 0000000000000000
  head: 0000000000000008 0000000000000000 00000000ffffffff 0000000000000000
  page dumped because: kasan: bad access detected
  page_owner tracks the page as allocated
  page last allocated via order 3, migratetype Unmovable, gfp_mask 0xd20c0(__GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC), pid 2, tgid 2 (kthreadd), ts 71247381401, free_ts 71214998153
   set_page_owner include/linux/page_owner.h:32 [inline]
   post_alloc_hook+0x1f3/0x230 mm/page_alloc.c:1537
   prep_new_page mm/page_alloc.c:1545 [inline]
   get_page_from_freelist+0x3039/0x3180 mm/page_alloc.c:3457
   __alloc_pages_noprof+0x256/0x6c0 mm/page_alloc.c:4733
   alloc_pages_mpol_noprof+0x3e8/0x680 mm/mempolicy.c:2265
   alloc_slab_page+0x6a/0x120 mm/slub.c:2413
   allocate_slab+0x5a/0x2f0 mm/slub.c:2579
   new_slab mm/slub.c:2632 [inline]
   ___slab_alloc+0xcd1/0x14b0 mm/slub.c:3819
   __slab_alloc+0x58/0xa0 mm/slub.c:3909
   __slab_alloc_node mm/slub.c:3962 [inline]
   slab_alloc_node mm/slub.c:4123 [inline]
   kmem_cache_alloc_node_noprof+0x1fe/0x320 mm/slub.c:4187
   alloc_task_struct_node kernel/fork.c:180 [inline]
   dup_task_struct+0x57/0x8c0 kernel/fork.c:1107
   copy_process+0x5d1/0x3d50 kernel/fork.c:2206
   kernel_clone+0x223/0x880 kernel/fork.c:2787
   kernel_thread+0x1bc/0x240 kernel/fork.c:2849
   create_kthread kernel/kthread.c:412 [inline]
   kthreadd+0x60d/0x810 kernel/kthread.c:765
   ret_from_fork+0x4b/0x80 arch/x86/kernel/process.c:147
   ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244
  page last free pid 5230 tgid 5230 stack trace:
   reset_page_owner include/linux/page_owner.h:25 [inline]
   free_pages_prepare mm/page_alloc.c:1108 [inline]
   free_unref_page+0xcd0/0xf00 mm/page_alloc.c:2638
   discard_slab mm/slub.c:2678 [inline]
   __put_partials+0xeb/0x130 mm/slub.c:3146
   put_cpu_partial+0x17c/0x250 mm/slub.c:3221
   __slab_free+0x2ea/0x3d0 mm/slub.c:4450
   qlink_free mm/kasan/quarantine.c:163 [inline]
   qlist_free_all+0x9a/0x140 mm/kasan/quarantine.c:179
   kasan_quarantine_reduce+0x14f/0x170 mm/kasan/quarantine.c:286
   __kasan_slab_alloc+0x23/0x80 mm/kasan/common.c:329
   kasan_slab_alloc include/linux/kasan.h:247 [inline]
   slab_post_alloc_hook mm/slub.c:4086 [inline]
   slab_alloc_node mm/slub.c:4135 [inline]
   kmem_cache_alloc_noprof+0x135/0x2a0 mm/slub.c:4142
   getname_flags+0xb7/0x540 fs/namei.c:139
   do_sys_openat2+0xd2/0x1d0 fs/open.c:1409
   do_sys_open fs/open.c:1430 [inline]
   __do_sys_openat fs/open.c:1446 [inline]
   __se_sys_openat fs/open.c:1441 [inline]
   __x64_sys_openat+0x247/0x2a0 fs/open.c:1441
   do_syscall_x64 arch/x86/entry/common.c:52 [inline]
   do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
   entry_SYSCALL_64_after_hwframe+0x77/0x7f

  Memory state around the buggy address:
   ffff8880272a8900: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
   ffff8880272a8980: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
  >ffff8880272a8a00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                              ^
   ffff8880272a8a80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
   ffff8880272a8b00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
  ==================================================================

Reported-by: [email protected]
Link: https://lore.kernel.org/linux-btrfs/[email protected]/
CC: [email protected] # 4.19+
Reviewed-by: Qu Wenruo <[email protected]>
Reviewed-by: Johannes Thumshirn <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: Filipe Manana <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
5 months agobtrfs: fix a NULL pointer dereference when failed to start a new trasacntion
Qu Wenruo [Fri, 27 Sep 2024 22:35:58 +0000 (08:05 +0930)]
btrfs: fix a NULL pointer dereference when failed to start a new trasacntion

[BUG]
Syzbot reported a NULL pointer dereference with the following crash:

  FAULT_INJECTION: forcing a failure.
   start_transaction+0x830/0x1670 fs/btrfs/transaction.c:676
   prepare_to_relocate+0x31f/0x4c0 fs/btrfs/relocation.c:3642
   relocate_block_group+0x169/0xd20 fs/btrfs/relocation.c:3678
  ...
  BTRFS info (device loop0): balance: ended with status: -12
  Oops: general protection fault, probably for non-canonical address 0xdffffc00000000cc: 0000 [#1] PREEMPT SMP KASAN NOPTI
  KASAN: null-ptr-deref in range [0x0000000000000660-0x0000000000000667]
  RIP: 0010:btrfs_update_reloc_root+0x362/0xa80 fs/btrfs/relocation.c:926
  Call Trace:
   <TASK>
   commit_fs_roots+0x2ee/0x720 fs/btrfs/transaction.c:1496
   btrfs_commit_transaction+0xfaf/0x3740 fs/btrfs/transaction.c:2430
   del_balance_item fs/btrfs/volumes.c:3678 [inline]
   reset_balance_state+0x25e/0x3c0 fs/btrfs/volumes.c:3742
   btrfs_balance+0xead/0x10c0 fs/btrfs/volumes.c:4574
   btrfs_ioctl_balance+0x493/0x7c0 fs/btrfs/ioctl.c:3673
   vfs_ioctl fs/ioctl.c:51 [inline]
   __do_sys_ioctl fs/ioctl.c:907 [inline]
   __se_sys_ioctl+0xf9/0x170 fs/ioctl.c:893
   do_syscall_x64 arch/x86/entry/common.c:52 [inline]
   do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
   entry_SYSCALL_64_after_hwframe+0x77/0x7f

[CAUSE]
The allocation failure happens at the start_transaction() inside
prepare_to_relocate(), and during the error handling we call
unset_reloc_control(), which makes fs_info->balance_ctl to be NULL.

Then we continue the error path cleanup in btrfs_balance() by calling
reset_balance_state() which will call del_balance_item() to fully delete
the balance item in the root tree.

However during the small window between set_reloc_contrl() and
unset_reloc_control(), we can have a subvolume tree update and created a
reloc_root for that subvolume.

Then we go into the final btrfs_commit_transaction() of
del_balance_item(), and into btrfs_update_reloc_root() inside
commit_fs_roots().

That function checks if fs_info->reloc_ctl is in the merge_reloc_tree
stage, but since fs_info->reloc_ctl is NULL, it results a NULL pointer
dereference.

[FIX]
Just add extra check on fs_info->reloc_ctl inside
btrfs_update_reloc_root(), before checking
fs_info->reloc_ctl->merge_reloc_tree.

That DEAD_RELOC_TREE handling is to prevent further modification to the
reloc tree during merge stage, but since there is no reloc_ctl at all,
we do not need to bother that.

Reported-by: [email protected]
Link: https://lore.kernel.org/linux-btrfs/[email protected]/
CC: [email protected] # 4.19+
Reviewed-by: Josef Bacik <[email protected]>
Signed-off-by: Qu Wenruo <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
5 months agobtrfs: send: fix invalid clone operation for file that got its size decreased
Filipe Manana [Fri, 27 Sep 2024 09:50:12 +0000 (10:50 +0100)]
btrfs: send: fix invalid clone operation for file that got its size decreased

During an incremental send we may end up sending an invalid clone
operation, for the last extent of a file which ends at an unaligned offset
that matches the final i_size of the file in the send snapshot, in case
the file had its initial size (the size in the parent snapshot) decreased
in the send snapshot. In this case the destination will fail to apply the
clone operation because its end offset is not sector size aligned and it
ends before the current size of the file.

Sending the truncate operation always happens when we finish processing an
inode, after we process all its extents (and xattrs, names, etc). So fix
this by ensuring the file has a valid size before we send a clone
operation for an unaligned extent that ends at the final i_size of the
file. The size we truncate to matches the start offset of the clone range
but it could be any value between that start offset and the final size of
the file since the clone operation will expand the i_size if the current
size is smaller than the end offset. The start offset of the range was
chosen because it's always sector size aligned and avoids a truncation
into the middle of a page, which results in dirtying the page due to
filling part of it with zeroes and then making the clone operation at the
receiver trigger IO.

The following test reproduces the issue:

  $ cat test.sh
  #!/bin/bash

  DEV=/dev/sdi
  MNT=/mnt/sdi

  mkfs.btrfs -f $DEV
  mount $DEV $MNT

  # Create a file with a size of 256K + 5 bytes, having two extents, one
  # with a size of 128K and another one with a size of 128K + 5 bytes.
  last_ext_size=$((128 * 1024 + 5))
  xfs_io -f -d -c "pwrite -S 0xab -b 128K 0 128K" \
         -c "pwrite -S 0xcd -b $last_ext_size 128K $last_ext_size" \
         $MNT/foo

  # Another file which we will later clone foo into, but initially with
  # a larger size than foo.
  xfs_io -f -c "pwrite -S 0xef 0 1M" $MNT/bar

  btrfs subvolume snapshot -r $MNT/ $MNT/snap1

  # Now resize bar and clone foo into it.
  xfs_io -c "truncate 0" \
         -c "reflink $MNT/foo" $MNT/bar

  btrfs subvolume snapshot -r $MNT/ $MNT/snap2

  rm -f /tmp/send-full /tmp/send-inc
  btrfs send -f /tmp/send-full $MNT/snap1
  btrfs send -p $MNT/snap1 -f /tmp/send-inc $MNT/snap2

  umount $MNT
  mkfs.btrfs -f $DEV
  mount $DEV $MNT

  btrfs receive -f /tmp/send-full $MNT
  btrfs receive -f /tmp/send-inc $MNT

  umount $MNT

Running it before this patch:

  $ ./test.sh
  (...)
  At subvol snap1
  At snapshot snap2
  ERROR: failed to clone extents to bar: Invalid argument

A test case for fstests will be sent soon.

Reported-by: Ben Millwood <[email protected]>
Link: https://lore.kernel.org/linux-btrfs/CAJhrHS2z+WViO2h=ojYvBPDLsATwLbg+7JaNCyYomv0fUxEpQQ@mail.gmail.com/
Fixes: 46a6e10a1ab1 ("btrfs: send: allow cloning non-aligned extent if it ends at i_size")
CC: [email protected] # 6.11
Reviewed-by: Qu Wenruo <[email protected]>
Signed-off-by: Filipe Manana <[email protected]>
Signed-off-by: David Sterba <[email protected]>
5 months agobtrfs: tracepoints: end assignment with semicolon at btrfs_qgroup_extent event class
Filipe Manana [Wed, 25 Sep 2024 10:38:55 +0000 (11:38 +0100)]
btrfs: tracepoints: end assignment with semicolon at btrfs_qgroup_extent event class

While running checkpatch.pl against a patch that modifies the
btrfs_qgroup_extent event class, it complained about using a comma instead
of a semicolon:

  $ ./scripts/checkpatch.pl qgroups/0003-btrfs-qgroups-remove-bytenr-field-from-struct-btrfs_.patch
  WARNING: Possible comma where semicolon could be used
  #215: FILE: include/trace/events/btrfs.h:1720:
  + __entry->bytenr = bytenr,
__entry->num_bytes = rec->num_bytes;

  total: 0 errors, 1 warnings, 184 lines checked

So replace the comma with a semicolon to silence checkpatch and possibly
other tools. It also makes the code consistent with the rest.

Reviewed-by: Qu Wenruo <[email protected]>
Signed-off-by: Filipe Manana <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
5 months agobtrfs: drop the backref cache during relocation if we commit
Josef Bacik [Tue, 24 Sep 2024 20:50:22 +0000 (16:50 -0400)]
btrfs: drop the backref cache during relocation if we commit

Since the inception of relocation we have maintained the backref cache
across transaction commits, updating the backref cache with the new
bytenr whenever we COWed blocks that were in the cache, and then
updating their bytenr once we detected a transaction id change.

This works as long as we're only ever modifying blocks, not changing the
structure of the tree.

However relocation does in fact change the structure of the tree.  For
example, if we are relocating a data extent, we will look up all the
leaves that point to this data extent.  We will then call
do_relocation() on each of these leaves, which will COW down to the leaf
and then update the file extent location.

But, a key feature of do_relocation() is the pending list.  This is all
the pending nodes that we modified when we updated the file extent item.
We will then process all of these blocks via finish_pending_nodes, which
calls do_relocation() on all of the nodes that led up to that leaf.

The purpose of this is to make sure we don't break sharing unless we
absolutely have to.  Consider the case that we have 3 snapshots that all
point to this leaf through the same nodes, the initial COW would have
created a whole new path.  If we did this for all 3 snapshots we would
end up with 3x the number of nodes we had originally.  To avoid this we
will cycle through each of the snapshots that point to each of these
nodes and update their pointers to point at the new nodes.

Once we update the pointer to the new node we will drop the node we
removed the link for and all of its children via btrfs_drop_subtree().
This is essentially just btrfs_drop_snapshot(), but for an arbitrary
point in the snapshot.

The problem with this is that we will never reflect this in the backref
cache.  If we do this btrfs_drop_snapshot() for a node that is in the
backref tree, we will leave the node in the backref tree.  This becomes
a problem when we change the transid, as now the backref cache has
entire subtrees that no longer exist, but exist as if they still are
pointed to by the same roots.

In the best case scenario you end up with "adding refs to an existing
tree ref" errors from insert_inline_extent_backref(), where we attempt
to link in nodes on roots that are no longer valid.

Worst case you will double free some random block and re-use it when
there's still references to the block.

This is extremely subtle, and the consequences are quite bad.  There
isn't a way to make sure our backref cache is consistent between
transid's.

In order to fix this we need to simply evict the entire backref cache
anytime we cross transid's.  This reduces performance in that we have to
rebuild this backref cache every time we change transid's, but fixes the
bug.

This has existed since relocation was added, and is a pretty critical
bug.  There's a lot more cleanup that can be done now that this
functionality is going away, but this patch is as small as possible in
order to fix the problem and make it easy for us to backport it to all
the kernels it needs to be backported to.

Followup series will dismantle more of this code and simplify relocation
drastically to remove this functionality.

We have a reproducer that reproduced the corruption within a few minutes
of running.  With this patch it survives several iterations/hours of
running the reproducer.

Fixes: 3fd0a5585eb9 ("Btrfs: Metadata ENOSPC handling for balance")
CC: [email protected]
Reviewed-by: Boris Burkov <[email protected]>
Signed-off-by: Josef Bacik <[email protected]>
Signed-off-by: David Sterba <[email protected]>
5 months agobtrfs: also add stripe entries for NOCOW writes
Johannes Thumshirn [Thu, 19 Sep 2024 10:16:38 +0000 (12:16 +0200)]
btrfs: also add stripe entries for NOCOW writes

NOCOW writes do not generate stripe_extent entries in the RAID stripe
tree, as the RAID stripe-tree feature initially was designed with a
zoned filesystem in mind and on a zoned filesystem, we do not allow NOCOW
writes. But the RAID stripe-tree feature is independent from the zoned
feature, so we must also do NOCOW writes for RAID stripe-tree filesystems.

Reviewed-by: Naohiro Aota <[email protected]>
Signed-off-by: Johannes Thumshirn <[email protected]>
Signed-off-by: David Sterba <[email protected]>
5 months agobtrfs: send: fix buffer overflow detection when copying path to cache entry
Filipe Manana [Thu, 19 Sep 2024 21:20:34 +0000 (22:20 +0100)]
btrfs: send: fix buffer overflow detection when copying path to cache entry

Starting with commit c0247d289e73 ("btrfs: send: annotate struct
name_cache_entry with __counted_by()") we annotated the variable length
array "name" from the name_cache_entry structure with __counted_by() to
improve overflow detection. However that alone was not correct, because
the length of that array does not match the "name_len" field - it matches
that plus 1 to include the NUL string terminator, so that makes a
fortified kernel think there's an overflow and report a splat like this:

  strcpy: detected buffer overflow: 20 byte write of buffer size 19
  WARNING: CPU: 3 PID: 3310 at __fortify_report+0x45/0x50
  CPU: 3 UID: 0 PID: 3310 Comm: btrfs Not tainted 6.11.0-prnet #1
  Hardware name: CompuLab Ltd.  sbc-ihsw/Intense-PC2 (IPC2), BIOS IPC2_3.330.7 X64 03/15/2018
  RIP: 0010:__fortify_report+0x45/0x50
  Code: 48 8b 34 (...)
  RSP: 0018:ffff97ebc0d6f650 EFLAGS: 00010246
  RAX: 7749924ef60fa600 RBX: ffff8bf5446a521a RCX: 0000000000000027
  RDX: 00000000ffffdfff RSI: ffff97ebc0d6f548 RDI: ffff8bf84e7a1cc8
  RBP: ffff8bf548574080 R08: ffffffffa8c40e10 R09: 0000000000005ffd
  R10: 0000000000000004 R11: ffffffffa8c70e10 R12: ffff8bf551eef400
  R13: 0000000000000000 R14: 0000000000000013 R15: 00000000000003a8
  FS:  00007fae144de8c0(0000) GS:ffff8bf84e780000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 00007fae14691690 CR3: 00000001027a2003 CR4: 00000000001706f0
  Call Trace:
   <TASK>
   ? __warn+0x12a/0x1d0
   ? __fortify_report+0x45/0x50
   ? report_bug+0x154/0x1c0
   ? handle_bug+0x42/0x70
   ? exc_invalid_op+0x1a/0x50
   ? asm_exc_invalid_op+0x1a/0x20
   ? __fortify_report+0x45/0x50
   __fortify_panic+0x9/0x10
  __get_cur_name_and_parent+0x3bc/0x3c0
   get_cur_path+0x207/0x3b0
   send_extent_data+0x709/0x10d0
   ? find_parent_nodes+0x22df/0x25d0
   ? mas_nomem+0x13/0x90
   ? mtree_insert_range+0xa5/0x110
   ? btrfs_lru_cache_store+0x5f/0x1e0
   ? iterate_extent_inodes+0x52d/0x5a0
   process_extent+0xa96/0x11a0
   ? __pfx_lookup_backref_cache+0x10/0x10
   ? __pfx_store_backref_cache+0x10/0x10
   ? __pfx_iterate_backrefs+0x10/0x10
   ? __pfx_check_extent_item+0x10/0x10
   changed_cb+0x6fa/0x930
   ? tree_advance+0x362/0x390
   ? memcmp_extent_buffer+0xd7/0x160
   send_subvol+0xf0a/0x1520
   btrfs_ioctl_send+0x106b/0x11d0
   ? __pfx___clone_root_cmp_sort+0x10/0x10
   _btrfs_ioctl_send+0x1ac/0x240
   btrfs_ioctl+0x75b/0x850
   __se_sys_ioctl+0xca/0x150
   do_syscall_64+0x85/0x160
   ? __count_memcg_events+0x69/0x100
   ? handle_mm_fault+0x1327/0x15c0
   ? __se_sys_rt_sigprocmask+0xf1/0x180
   ? syscall_exit_to_user_mode+0x75/0xa0
   ? do_syscall_64+0x91/0x160
   ? do_user_addr_fault+0x21d/0x630
  entry_SYSCALL_64_after_hwframe+0x76/0x7e
  RIP: 0033:0x7fae145eeb4f
  Code: 00 48 89 (...)
  RSP: 002b:00007ffdf1cb09b0 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
  RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 00007fae145eeb4f
  RDX: 00007ffdf1cb0ad0 RSI: 0000000040489426 RDI: 0000000000000004
  RBP: 00000000000078fe R08: 00007fae144006c0 R09: 00007ffdf1cb0927
  R10: 0000000000000008 R11: 0000000000000246 R12: 00007ffdf1cb1ce8
  R13: 0000000000000003 R14: 000055c499fab2e0 R15: 0000000000000004
   </TASK>

Fix this by not storing the NUL string terminator since we don't actually
need it for name cache entries, this way "name_len" corresponds to the
actual size of the "name" array. This requires marking the "name" array
field with __nonstring and using memcpy() instead of strcpy() as
recommended by the guidelines at:

   https://github.com/KSPP/linux/issues/90

Reported-by: David Arendt <[email protected]>
Link: https://lore.kernel.org/linux-btrfs/[email protected]/
Fixes: c0247d289e73 ("btrfs: send: annotate struct name_cache_entry with __counted_by()")
CC: [email protected] # 6.11
Tested-by: David Arendt <[email protected]>
Reviewed-by: Josef Bacik <[email protected]>
Reviewed-by: Qu Wenruo <[email protected]>
Signed-off-by: Filipe Manana <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
5 months agobtrfs: fix use-after-free on rbtree that tracks inodes for auto defrag
Filipe Manana [Sun, 15 Sep 2024 19:52:53 +0000 (20:52 +0100)]
btrfs: fix use-after-free on rbtree that tracks inodes for auto defrag

When cleaning up defrag inodes at btrfs_cleanup_defrag_inodes(), called
during remount and unmount, we are freeing every node from the rbtree
that tracks inodes for auto defrag using
rbtree_postorder_for_each_entry_safe(), which doesn't modify the tree
itself. So once we unlock the lock that protects the rbtree, we have a
tree pointing to a root that was freed (and a root pointing to freed
nodes, and their children pointing to other freed nodes, and so on).
This makes further access to the tree result in a use-after-free with
unpredictable results.

Fix this by initializing the rbtree to an empty root after the call to
rbtree_postorder_for_each_entry_safe() and before unlocking.

Fixes: 276940915f23 ("btrfs: clear defragmented inodes using postorder in btrfs_cleanup_defrag_inodes()")
Reported-by: [email protected]
Link: https://lore.kernel.org/linux-btrfs/[email protected]/
Reviewed-by: Qu Wenruo <[email protected]>
Signed-off-by: Filipe Manana <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
5 months agobtrfs: tree-checker: fix the wrong output of data backref objectid
Qu Wenruo [Tue, 10 Sep 2024 21:36:45 +0000 (07:06 +0930)]
btrfs: tree-checker: fix the wrong output of data backref objectid

[BUG]
There are some reports about invalid data backref objectids, the report
looks like this:

  BTRFS critical (device sda): corrupt leaf: block=333654787489792 slot=110 extent bytenr=333413935558656 len=65536 invalid data ref objectid value 2543

The data ref objectid is the inode number inside the subvolume.

But in above case, the value is completely sane, not really showing the
problem.

[CAUSE]
The root cause of the problem is the deprecated feature, inode cache.

This feature results a special inode number, -12ULL, and it's no longer
recognized by tree-checker, triggering the error.

The direct problem here is the output of data ref objectid. The value
shown is in fact the dref_root (subvolume id), not the dref_objectid
(inode number).

[FIX]
Fix the output to use dref_objectid instead.

Reported-by: Neil Parton <[email protected]>
Reported-by: Archange <[email protected]>
Link: https://lore.kernel.org/linux-btrfs/CAAYHqBbrrgmh6UmW3ANbysJX9qG9Pbg3ZwnKsV=5mOpv_qix_Q@mail.gmail.com/
Link: https://lore.kernel.org/linux-btrfs/[email protected]/
Fixes: f333a3c7e832 ("btrfs: tree-checker: validate dref root and objectid")
CC: [email protected] # 6.11
Reviewed-by: Filipe Manana <[email protected]>
Signed-off-by: Qu Wenruo <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
5 months agobtrfs: fix race setting file private on concurrent lseek using same fd
Filipe Manana [Tue, 3 Sep 2024 09:55:36 +0000 (10:55 +0100)]
btrfs: fix race setting file private on concurrent lseek using same fd

When doing concurrent lseek(2) system calls against the same file
descriptor, using multiple threads belonging to the same process, we have
a short time window where a race happens and can result in a memory leak.

The race happens like this:

1) A program opens a file descriptor for a file and then spawns two
   threads (with the pthreads library for example), lets call them
   task A and task B;

2) Task A calls lseek with SEEK_DATA or SEEK_HOLE and ends up at
   file.c:find_desired_extent() while holding a read lock on the inode;

3) At the start of find_desired_extent(), it extracts the file's
   private_data pointer into a local variable named 'private', which has
   a value of NULL;

4) Task B also calls lseek with SEEK_DATA or SEEK_HOLE, locks the inode
   in shared mode and enters file.c:find_desired_extent(), where it also
   extracts file->private_data into its local variable 'private', which
   has a NULL value;

5) Because it saw a NULL file private, task A allocates a private
   structure and assigns to the file structure;

6) Task B also saw a NULL file private so it also allocates its own file
   private and then assigns it to the same file structure, since both
   tasks are using the same file descriptor.

   At this point we leak the private structure allocated by task A.

Besides the memory leak, there's also the detail that both tasks end up
using the same cached state record in the private structure (struct
btrfs_file_private::llseek_cached_state), which can result in a
use-after-free problem since one task can free it while the other is
still using it (only one task took a reference count on it). Also, sharing
the cached state is not a good idea since it could result in incorrect
results in the future - right now it should not be a problem because it
end ups being used only in extent-io-tree.c:count_range_bits() where we do
range validation before using the cached state.

Fix this by protecting the private assignment and check of a file while
holding the inode's spinlock and keep track of the task that allocated
the private, so that it's used only by that task in order to prevent
user-after-free issues with the cached state record as well as potentially
using it incorrectly in the future.

Fixes: 3c32c7212f16 ("btrfs: use cached state when looking for delalloc ranges with lseek")
CC: [email protected] # 6.6+
Reviewed-by: Josef Bacik <[email protected]>
Signed-off-by: Filipe Manana <[email protected]>
Signed-off-by: David Sterba <[email protected]>
6 months agobtrfs: only unlock the to-be-submitted ranges inside a folio
Qu Wenruo [Mon, 2 Sep 2024 04:59:06 +0000 (14:29 +0930)]
btrfs: only unlock the to-be-submitted ranges inside a folio

[SUBPAGE COMPRESSION LIMITS]
Currently inside writepage_delalloc(), if a delalloc range is going to
be submitted asynchronously (inline or compression, the page
dirty/writeback/unlock are all handled in at different time, not at the
submission time), then we return 1 and extent_writepage() will skip the
submission.

This is fine if every sector matches page size, but if a sector is
smaller than page size (aka, subpage case), then it can be very
problematic, for example for the following 64K page:

     0     16K     32K    48K     64K
     |/|   |///////|      |/|
       |                    |
       4K                   52K

Where |/| is the dirty range we need to submit.

In the above case, we need the following different handling for the 3
ranges:

- [0, 4K) needs to be submitted for regular write
  A single sector cannot be compressed.

- [16K, 32K) needs to be submitted for compressed write

- [48K, 52K) needs to be submitted for regular write.

Above, if we try to submit [16K, 32K) for compressed write, we will
return 1 and immediately, and without submitting the remaining
[48K, 52K) range.

Furthermore, since extent_writepage() will exit without unlocking any
sectors, the submitted range [0, 4K) will not have sector unlocked.

That's the reason why for now subpage is only allowed for full page
range.

[ENHANCEMENT]
- Introduce a submission bitmap at btrfs_bio_ctrl::submit_bitmap
  This records which sectors will be submitted by extent_writepage_io().
  This allows us to track which sectors needs to be submitted thus later
  to be properly unlocked.

  For asynchronously submitted range (inline/compression), the
  corresponding bits will be cleared from that bitmap.

- Only return 1 if no sector needs to be submitted in
  writepage_delalloc()

- Only submit sectors marked by submission bitmap inside
  extent_writepage_io()
  So we won't touch the asynchronously submitted part.

- Introduce btrfs_folio_end_writer_lock_bitmap() helper
  This will only unlock the involved sectors specified by @bitmap
  parameter, to avoid touching the range asynchronously submitted.

Please note that, since subpage compression is still limited to page
aligned range, this change is only a preparation for future sector
perfect compression support for subpage.

Signed-off-by: Qu Wenruo <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
6 months agobtrfs: merge btrfs_folio_unlock_writer() into btrfs_folio_end_writer_lock()
Qu Wenruo [Mon, 2 Sep 2024 04:27:08 +0000 (13:57 +0930)]
btrfs: merge btrfs_folio_unlock_writer() into btrfs_folio_end_writer_lock()

The function btrfs_folio_unlock_writer() is already calling
btrfs_folio_end_writer_lock() to do the heavy lifting work, the only
missing 0 writer check.

Thus there is no need to keep two different functions, move the 0 writer
check into btrfs_folio_end_writer_lock(), and remove
btrfs_folio_unlock_writer().

Signed-off-by: Qu Wenruo <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
6 months agobtrfs: BTRFS_PATH_AUTO_FREE in orphan.c
Leo Martins [Tue, 3 Sep 2024 18:19:07 +0000 (11:19 -0700)]
btrfs: BTRFS_PATH_AUTO_FREE in orphan.c

All cleanup paths lead to btrfs_path_free so path can be defined with
the automatic freeing callback in the following functions:

- btrfs_insert_orphan_item()
- btrfs_del_orphan_item()

Signed-off-by: Leo Martins <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
6 months agobtrfs: use btrfs_path auto free in zoned.c
Leo Martins [Tue, 3 Sep 2024 18:19:06 +0000 (11:19 -0700)]
btrfs: use btrfs_path auto free in zoned.c

All cleanup paths lead to btrfs_path_free so path can be defined with
the automatic freeing callback in the following functions:

- calculate_emulated_zone_size()
- calculate_alloc_pointer()

Signed-off-by: Leo Martins <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
6 months agobtrfs: DEFINE_FREE for struct btrfs_path
Leo Martins [Tue, 3 Sep 2024 18:19:05 +0000 (11:19 -0700)]
btrfs: DEFINE_FREE for struct btrfs_path

Add a DEFINE_FREE for struct btrfs_path. This defines a function that
can be called using the __free attribute. Define a macro
BTRFS_PATH_AUTO_FREE to make the declaration of an auto freeing path
very clear.

The intended use is to define the auto free of path in cases where the
path is allocated somewhere at the beginning and freed either on all
error paths or at the end of the function.

  int func() {
  BTRFS_PATH_AUTO_FREE(path);

  if (...)
  return -ERROR;

  path = alloc_path();

  ...

  if (...)
  return -ERROR;

  ...
  return 0;
  }

Signed-off-by: Leo Martins <[email protected]>
[ update changelog ]
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
6 months agobtrfs: remove btrfs_folio_end_all_writers()
Qu Wenruo [Fri, 30 Aug 2024 07:18:20 +0000 (16:48 +0930)]
btrfs: remove btrfs_folio_end_all_writers()

The function btrfs_folio_end_all_writers() is only utilized in
extent_writepage() as a way to unlock all subpage range (for both
successful submission and error handling).

Meanwhile we have a similar function, btrfs_folio_end_writer_lock().

The difference is, btrfs_folio_end_writer_lock() expects a range that is
a subset of the already locked range.

This limit on btrfs_folio_end_writer_lock() is a little overkilled,
preventing it from being utilized for error paths.

So here we enhance btrfs_folio_end_writer_lock() to accept a superset of
the locked range, and only end the locked subset.
This means we can replace btrfs_folio_end_all_writers() with
btrfs_folio_end_writer_lock() instead.

Signed-off-by: Qu Wenruo <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
6 months agobtrfs: constify more pointer parameters
David Sterba [Wed, 26 Jun 2024 21:39:11 +0000 (23:39 +0200)]
btrfs: constify more pointer parameters

Continue adding const to parameters.  This is for clarity and minor
addition to safety. There are some minor effects, in the assembly code
and .ko measured on release config.

Signed-off-by: David Sterba <[email protected]>
6 months agobtrfs: rework BTRFS_I as macro to preserve parameter const
David Sterba [Wed, 5 Jun 2024 00:28:30 +0000 (02:28 +0200)]
btrfs: rework BTRFS_I as macro to preserve parameter const

Currently BTRFS_I is a static inline function that takes a const inode
and returns btrfs inode, dropping the 'const' qualifier. This can break
assumptions of compiler though it seems there's no real case.

To make the parameter and return type consistent regardint const we can
use the container_of_const() that preserves it. However this would not
check the parameter type. To fix that use the same _Generic construct
but implement only the two expected types.

Signed-off-by: David Sterba <[email protected]>
6 months agobtrfs: add and use helper to verify the calling task has locked the inode
Filipe Manana [Thu, 29 Aug 2024 18:03:52 +0000 (19:03 +0100)]
btrfs: add and use helper to verify the calling task has locked the inode

We have a few places that check if we have the inode locked by doing:

    ASSERT(inode_is_locked(vfs_inode));

This actually proved to be useful several times as if assertions are
enabled (and by default they are in many distros) it immediately triggers
a crash which is impossible for users to miss.

However that doesn't check if the lock is held by the calling task, so
the check passes if some other task locked the inode.

Using one of the lockdep functions to check the lock is held, like
lockdep_assert_held() for example, does check that the calling task
holds the lock, and if that's not the case it produces a warning and
stack trace in dmesg. However, despite the misleading "assert" in the
name of the lockdep helpers, it does not trigger a crash/BUG_ON(), just
a warning and splat in dmesg, which is easy to get unnoticed by users
who may have lockdep enabled.

So add a helper that does the ASSERT() and calls lockdep_assert_held()
immediately after and use it every where we check the inode is locked.
Like this if the lock is held by some other task we get the warning
in dmesg which is caught by fstests, very helpful during development,
and may also be occassionaly noticed by users with lockdep enabled.

Reviewed-by: Josef Bacik <[email protected]>
Signed-off-by: Filipe Manana <[email protected]>
Signed-off-by: David Sterba <[email protected]>
6 months agobtrfs: always update fstrim_range on failure in FITRIM ioctl
Luca Stefani [Mon, 2 Sep 2024 11:10:53 +0000 (13:10 +0200)]
btrfs: always update fstrim_range on failure in FITRIM ioctl

Even in case of failure we could've discarded some data and userspace
should be made aware of it, so copy fstrim_range to userspace
regardless.

Also make sure to update the trimmed bytes amount even if
btrfs_trim_free_extents fails.

CC: [email protected] # 5.15+
Reviewed-by: Qu Wenruo <[email protected]>
Signed-off-by: Luca Stefani <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
6 months agobtrfs: convert copy_inline_to_page() to use folio
Li Zetao [Wed, 28 Aug 2024 18:29:08 +0000 (02:29 +0800)]
btrfs: convert copy_inline_to_page() to use folio

The old page API is being gradually replaced and converted to use folio
to improve code readability and avoid repeated conversion between page
and folio. Moreover find_or_create_page() is compatible API, and it can
replaced with __filemap_get_folio(). Some interfaces have been converted
to use folio before, so the conversion operation from page can be
eliminated here.

Signed-off-by: Li Zetao <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
6 months agobtrfs: convert btrfs_decompress() to take a folio
Li Zetao [Wed, 28 Aug 2024 18:29:07 +0000 (02:29 +0800)]
btrfs: convert btrfs_decompress() to take a folio

The old page API is being gradually replaced and converted to use folio
to improve code readability and avoid repeated conversion between page
and folio. Based on the previous patch, the compression path can be
directly used in folio without converting to page.

Signed-off-by: Li Zetao <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
6 months agobtrfs: convert zstd_decompress() to take a folio
Li Zetao [Wed, 28 Aug 2024 18:29:06 +0000 (02:29 +0800)]
btrfs: convert zstd_decompress() to take a folio

The old page API is being gradually replaced and converted to use folio
to improve code readability and avoid repeated conversion between page
and folio. And memcpy_to_page() can be replaced with memcpy_to_folio().
But there is no memzero_folio(), but it can be replaced equivalently by
folio_zero_range().

Signed-off-by: Li Zetao <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
6 months agobtrfs: convert lzo_decompress() to take a folio
Li Zetao [Wed, 28 Aug 2024 18:29:05 +0000 (02:29 +0800)]
btrfs: convert lzo_decompress() to take a folio

The old page API is being gradually replaced and converted to use folio
to improve code readability and avoid repeated conversion between page
and folio. And memcpy_to_page() can be replaced with memcpy_to_folio().
But there is no memzero_folio(), but it can be replaced equivalently by
folio_zero_range().

Signed-off-by: Li Zetao <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
6 months agobtrfs: convert zlib_decompress() to take a folio
Li Zetao [Wed, 28 Aug 2024 18:29:04 +0000 (02:29 +0800)]
btrfs: convert zlib_decompress() to take a folio

The old page API is being gradually replaced and converted to use folio
to improve code readability and avoid repeated conversion between page
and folio. And memcpy_to_page() can be replaced with memcpy_to_folio().
But there is no memzero_folio(), but it can be replaced equivalently by
folio_zero_range().

Signed-off-by: Li Zetao <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
6 months agobtrfs: convert try_release_extent_mapping() to take a folio
Li Zetao [Wed, 28 Aug 2024 18:29:03 +0000 (02:29 +0800)]
btrfs: convert try_release_extent_mapping() to take a folio

The old page API is being gradually replaced and converted to use folio
to improve code readability and avoid repeated conversion between page
and folio. And page_to_inode() can be replaced with folio_to_inode() now.

Signed-off-by: Li Zetao <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
6 months agobtrfs: convert try_release_extent_state() to take a folio
Li Zetao [Wed, 28 Aug 2024 18:29:02 +0000 (02:29 +0800)]
btrfs: convert try_release_extent_state() to take a folio

The old page API is being gradually replaced and converted to use folio
to improve code readability and avoid repeated conversion between page
and folio. Moreover, use folio_pos() instead of page_offset(),
which is more consistent with folio usage.

Signed-off-by: Li Zetao <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
6 months agobtrfs: convert submit_eb_page() to take a folio
Li Zetao [Wed, 28 Aug 2024 18:29:01 +0000 (02:29 +0800)]
btrfs: convert submit_eb_page() to take a folio

The old page API is being gradually replaced and converted to use folio
to improve code readability and avoid repeated conversion between page
and folio.

Signed-off-by: Li Zetao <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
6 months agobtrfs: convert submit_eb_subpage() to take a folio
Li Zetao [Wed, 28 Aug 2024 18:29:00 +0000 (02:29 +0800)]
btrfs: convert submit_eb_subpage() to take a folio

The old page API is being gradually replaced and converted to use folio
to improve code readability and avoid repeated conversion between page
and folio. Moreover, use folio_pos() instead of page_offset(),
which is more consistent with folio usage.

Signed-off-by: Li Zetao <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
6 months agobtrfs: convert read_key_bytes() to take a folio
Li Zetao [Wed, 28 Aug 2024 18:28:59 +0000 (02:28 +0800)]
btrfs: convert read_key_bytes() to take a folio

The old page API is being gradually replaced and converted to use folio
to improve code readability and avoid repeated conversion between page
and folio. Moreover, use kmap_local_folio() instead of kmap_local_page(),
which is more consistent with folio usage.

Signed-off-by: Li Zetao <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
6 months agobtrfs: convert try_release_extent_buffer() to take a folio
Li Zetao [Wed, 28 Aug 2024 18:28:58 +0000 (02:28 +0800)]
btrfs: convert try_release_extent_buffer() to take a folio

The old page API is being gradually replaced and converted to use folio
to improve code readability and avoid repeated conversion between page
and folio.

Signed-off-by: Li Zetao <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
6 months agobtrfs: convert try_release_subpage_extent_buffer() to take a folio
Li Zetao [Wed, 28 Aug 2024 18:28:57 +0000 (02:28 +0800)]
btrfs: convert try_release_subpage_extent_buffer() to take a folio

The old page API is being gradually replaced and converted to use folio
to improve code readability and avoid repeated conversion between page
and folio. And use folio_pos instead of page_offset, which is more
consistent with folio usage. At the same time, folio_test_private() can
handle folio directly without converting from page to folio first.

Signed-off-by: Li Zetao <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
6 months agobtrfs: convert get_next_extent_buffer() to take a folio
Li Zetao [Wed, 28 Aug 2024 18:28:56 +0000 (02:28 +0800)]
btrfs: convert get_next_extent_buffer() to take a folio

The old page API is being gradually replaced and converted to use folio
to improve code readability and avoid repeated conversion between page
and folio. Use folio_pos instead of page_offset, which is more
consistent with folio usage.

Signed-off-by: Li Zetao <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
6 months agobtrfs: convert clear_page_extent_mapped() to take a folio
Li Zetao [Wed, 28 Aug 2024 18:28:55 +0000 (02:28 +0800)]
btrfs: convert clear_page_extent_mapped() to take a folio

The old page API is being gradually replaced and converted to use folio
to improve code readability and avoid repeated conversion between page
and folio. Now clear_page_extent_mapped() can deal with a folio
directly, so change its name to clear_folio_extent_mapped().

Signed-off-by: Li Zetao <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
6 months agobtrfs: make compression path to be subpage compatible
Qu Wenruo [Wed, 29 May 2024 07:33:47 +0000 (17:03 +0930)]
btrfs: make compression path to be subpage compatible

Currently btrfs compression path is not really subpage compatible, every
thing is still done in page unit.

That's fine for regular sector size and subpage routine. As even for
subpage routine compression is only enabled if the whole range is page
aligned, so reading the page cache in page unit is totally fine.

However in preparation for the future subpage perfect compression
support, we need to change the compression routine to properly handle a
subpage range.

This patch would prepare both zlib and zstd to only read the subpage
range for compression.
Lzo is already doing subpage aware read, as lzo's on-disk format is
already sectorsize dependent.

Signed-off-by: Qu Wenruo <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
6 months agobtrfs: merge btrfs_orig_bbio_end_io() into btrfs_bio_end_io()
Qu Wenruo [Sat, 24 Aug 2024 10:06:43 +0000 (19:36 +0930)]
btrfs: merge btrfs_orig_bbio_end_io() into btrfs_bio_end_io()

There are only two differences between the two functions:

- btrfs_orig_bbio_end_io() does extra error propagation
  This is mostly to allow tolerance for write errors.

- btrfs_orig_bbio_end_io() does extra pending_ios check
  This check can handle both the original bio, or the cloned one.
  (All accounting happens in the original one).

This makes btrfs_orig_bbio_end_io() a much safer call.
In fact we already had a double freeing error due to usage of
btrfs_bio_end_io() in the error path of btrfs_submit_chunk().

So just move the whole content of btrfs_orig_bbio_end_io() into
btrfs_bio_end_io().

For normal paths this brings no change, because they are already calling
btrfs_orig_bbio_end_io() in the first place.

For error paths (not only inside bio.c but also external callers), this
change will introduce extra checks, especially for external callers, as
they will error out without submitting the btrfs bio.

But considering it's already in the error path, such slower but much
safer checks are still an overall win.

Signed-off-by: Qu Wenruo <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
6 months agobtrfs: do not hold the extent lock for entire read
Josef Bacik [Fri, 16 Aug 2024 19:16:24 +0000 (15:16 -0400)]
btrfs: do not hold the extent lock for entire read

Historically we've held the extent lock throughout the entire read.
There's been a few reasons for this, but it's mostly just caused us
problems.  For example, this prevents us from allowing page faults
during direct io reads, because we could deadlock.  This has forced us
to only allow 4k reads at a time for io_uring NOWAIT requests because we
have no idea if we'll be forced to page fault and thus have to do a
whole lot of work.

On the buffered side we are protected by the page lock, as long as we're
reading things like buffered writes, punch hole, and even direct IO to a
certain degree will get hung up on the page lock while the page is in
flight.

On the direct side we have the dio extent lock, which acts much like the
way the extent lock worked previously to this patch, however just for
direct reads.  This protects direct reads from concurrent direct writes,
while we're protected from buffered writes via the inode lock.

Now that we're protected in all cases, narrow the extent lock to the
part where we're getting the extent map to submit the reads, no longer
holding the extent lock for the entire read operation.  Push the extent
lock down into do_readpage() so that we're only grabbing it when looking
up the extent map.  This portion was contributed by Goldwyn.

Co-developed-by: Goldwyn Rodrigues <[email protected]>
Reviewed-by: Goldwyn Rodrigues <[email protected]>
Signed-off-by: Josef Bacik <[email protected]>
Signed-off-by: David Sterba <[email protected]>
6 months agobtrfs: take the dio extent lock during O_DIRECT operations
Josef Bacik [Mon, 19 Aug 2024 20:50:01 +0000 (16:50 -0400)]
btrfs: take the dio extent lock during O_DIRECT operations

Currently we hold the extent lock for the entire duration of a read.
This isn't really necessary in the buffered case, we're protected by the
page lock, however it's necessary for O_DIRECT.

For O_DIRECT reads, if we only locked the extent for the part where we
get the extent, we could potentially race with an O_DIRECT write in the
same region.  This isn't really a problem, unless the read is delayed so
much that the write does the COW, unpins the old extent, and some other
application re-allocates the extent before the read is actually able to
be submitted.  At that point at best we'd have a checksum mismatch, but
at worse we could read data that doesn't belong to us.

To address this potential race we need to make sure we don't have
overlapping, concurrent direct io reads and writes.

To accomplish this use the new EXTENT_DIO_LOCKED bit in the direct IO
case in the same spot as the current extent lock.  The writes will take
this while they're creating the ordered extent, which is also used to
make sure concurrent buffered reads or concurrent direct reads are not
allowed to occur, and drop it after the ordered extent is taken.  For
reads it will act as the current read behavior for the EXTENT_LOCKED
bit, we set it when we're starting the read, we clear it in the end_io
to allow other direct writes to continue.

This still has the drawback of disallowing concurrent overlapping direct
reads from occurring, but that exists with the current extent locking.

Signed-off-by: Josef Bacik <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
6 months agobtrfs: introduce EXTENT_DIO_LOCKED
Josef Bacik [Mon, 19 Aug 2024 18:15:52 +0000 (14:15 -0400)]
btrfs: introduce EXTENT_DIO_LOCKED

In order to support dropping the extent lock during a read we need a way
to make sure that direct reads and direct writes for overlapping ranges
are protected from each other.  To accomplish this introduce another
lock bit specifically for direct io.  Subsequent patches will utilize
this to protect direct IO operations.

Signed-off-by: Josef Bacik <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
6 months agobtrfs: always pass readahead state to defrag
David Sterba [Tue, 27 Aug 2024 02:26:51 +0000 (04:26 +0200)]
btrfs: always pass readahead state to defrag

Defrag ioctl passes readahead from the file, but autodefrag does not
have a file so the readahead state is allocated when needed.

The autodefrag loop in cleaner thread iterates over inodes so we can
simply provide an on-stack readahead state and will not need to allocate
it in btrfs_defrag_file(). The size is 32 bytes which is acceptable.

Reviewed-by: Qu Wenruo <[email protected]>
Signed-off-by: David Sterba <[email protected]>
6 months agobtrfs: drop transaction parameter from btrfs_add_inode_defrag()
David Sterba [Tue, 27 Aug 2024 02:13:44 +0000 (04:13 +0200)]
btrfs: drop transaction parameter from btrfs_add_inode_defrag()

There's only one caller inode_should_defrag() that passes NULL to
btrfs_add_inode_defrag() so we can drop it an simplify the code.

Reviewed-by: Qu Wenruo <[email protected]>
Signed-off-by: David Sterba <[email protected]>
6 months agobtrfs: return void from btrfs_add_inode_defrag()
David Sterba [Tue, 27 Aug 2024 02:10:11 +0000 (04:10 +0200)]
btrfs: return void from btrfs_add_inode_defrag()

The potential memory allocation failure is not a fatal error, skipping
autodefrag is fine and the caller inode_should_defrag() does not care
about the errors.  Further writes can attempt to add the inode back to
the defragmentation list again.

Reviewed-by: Qu Wenruo <[email protected]>
Signed-off-by: David Sterba <[email protected]>
6 months agobtrfs: clear defragmented inodes using postorder in btrfs_cleanup_defrag_inodes()
David Sterba [Tue, 27 Aug 2024 02:05:48 +0000 (04:05 +0200)]
btrfs: clear defragmented inodes using postorder in btrfs_cleanup_defrag_inodes()

btrfs_cleanup_defrag_inodes() is not called frequently, only in remount
or unmount, but the way it frees the inodes in fs_info->defrag_inodes
is inefficient. Each time it needs to locate first node, remove it,
potentially rebalance tree until it's done. This allows to do a
conditional reschedule.

For cleanups the rbtree_postorder_for_each_entry_safe() iterator is
convenient but we can't reschedule and restart iteration because some of
the tree nodes would be already freed.

The cleanup operation is kmem_cache_free() which will likely take the
fast path for most objects so rescheduling should not be necessary.

Reviewed-by: Qu Wenruo <[email protected]>
Signed-off-by: David Sterba <[email protected]>
6 months agobtrfs: rename __btrfs_run_defrag_inode() and drop double underscores
David Sterba [Tue, 27 Aug 2024 01:51:38 +0000 (03:51 +0200)]
btrfs: rename __btrfs_run_defrag_inode() and drop double underscores

The function does not follow the pattern where the underscores would be
justified, so rename it.

Reviewed-by: Qu Wenruo <[email protected]>
Signed-off-by: David Sterba <[email protected]>
6 months agobtrfs: rename __btrfs_add_inode_defrag() and drop double underscores
David Sterba [Tue, 27 Aug 2024 01:50:43 +0000 (03:50 +0200)]
btrfs: rename __btrfs_add_inode_defrag() and drop double underscores

The function does not follow the pattern where the underscores would be
justified, so rename it.

Also update the misleading comment, the passed item is not freed, that's
what the caller does.

Reviewed-by: Qu Wenruo <[email protected]>
Signed-off-by: David Sterba <[email protected]>
6 months agobtrfs: rename __need_auto_defrag() and drop double underscores
David Sterba [Tue, 27 Aug 2024 01:45:38 +0000 (03:45 +0200)]
btrfs: rename __need_auto_defrag() and drop double underscores

The function does not follow the pattern where the underscores would be
justified, so rename it.

Reviewed-by: Qu Wenruo <[email protected]>
Signed-off-by: David Sterba <[email protected]>
6 months agobtrfs: constify arguments of compare_inode_defrag()
David Sterba [Tue, 27 Aug 2024 01:44:30 +0000 (03:44 +0200)]
btrfs: constify arguments of compare_inode_defrag()

A comparator function does not change its parameters, make them const.

Reviewed-by: Qu Wenruo <[email protected]>
Signed-off-by: David Sterba <[email protected]>
6 months agobtrfs: rename __compare_inode_defrag() and drop double underscores
David Sterba [Tue, 27 Aug 2024 01:44:15 +0000 (03:44 +0200)]
btrfs: rename __compare_inode_defrag() and drop double underscores

The function does not follow the pattern where the underscores would be
justified, so rename it.

Reviewed-by: Qu Wenruo <[email protected]>
Signed-off-by: David Sterba <[email protected]>
6 months agobtrfs: rename __extent_writepage() and drop double underscores
David Sterba [Tue, 27 Aug 2024 01:30:16 +0000 (03:30 +0200)]
btrfs: rename __extent_writepage() and drop double underscores

The function does not follow the pattern where the underscores would be
justified, so rename it.

Reviewed-by: Qu Wenruo <[email protected]>
Signed-off-by: David Sterba <[email protected]>
6 months agobtrfs: rename __btrfs_submit_bio() and drop double underscores
David Sterba [Tue, 27 Aug 2024 01:41:45 +0000 (03:41 +0200)]
btrfs: rename __btrfs_submit_bio() and drop double underscores

Previous patch freed the function name btrfs_submit_bio() so we can use
it for a helper that submits struct bio.

Reviewed-by: Qu Wenruo <[email protected]>
Signed-off-by: David Sterba <[email protected]>
6 months agobtrfs: rename btrfs_submit_bio() to btrfs_submit_bbio()
David Sterba [Tue, 27 Aug 2024 01:40:11 +0000 (03:40 +0200)]
btrfs: rename btrfs_submit_bio() to btrfs_submit_bbio()

The function name is a bit misleading as it submits the btrfs_bio
(bbio), rename it so we can use btrfs_submit_bio() when an actual bio is
submitted.

Reviewed-by: Qu Wenruo <[email protected]>
Signed-off-by: David Sterba <[email protected]>
6 months agobtrfs: subpage: remove btrfs_fs_info::subpage_info member
Qu Wenruo [Mon, 26 Aug 2024 06:14:50 +0000 (15:44 +0930)]
btrfs: subpage: remove btrfs_fs_info::subpage_info member

The member btrfs_fs_info::subpage_info stores the cached bitmap start
position inside the merged bitmap.

However in reality there is only one thing depending on the sectorsize,
bitmap_nr_bits, which records the number of sectors that fit inside a
page.

The sequence of sub-bitmaps have fixed order, thus it's just a quick
multiplication to calculate the start position of each sub-bitmaps.

Signed-off-by: Qu Wenruo <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
6 months agobtrfs: remove the nr_ret parameter from __extent_writepage_io()
Qu Wenruo [Wed, 14 Aug 2024 01:20:21 +0000 (10:50 +0930)]
btrfs: remove the nr_ret parameter from __extent_writepage_io()

The parameter @nr_ret is used to tell the caller how many sectors have
been submitted for IO.

Then callers check @nr_ret value to determine if we need to manually
clear the PAGECACHE_TAG_DIRTY, as if we submitted no sector (e.g. all
sectors are beyond i_size) there is no folio_start_writeback() called thus
PAGECACHE_TAG_DIRTY tag will not be cleared.

Remove this parameter by:

- Moving the btrfs_folio_clear_writeback() call into
  __extent_writepage_io()
  So that if we didn't submit any IO, then manually call
  btrfs_folio_set_writeback() to clear PAGECACHE_TAG_DIRTY when
  the page is no longer dirty.

- Use a bool to record if we have submitted any sector
  Instead of an int.

- Use subpage compatible helpers to end folio writeback.
  This brings no change to the behavior, just for the sake of consistency.

  As for the call site inside __extent_writepage(), we're always called
  for the whole page, so the existing full page helper
  folio_(start|end)_writeback() is totally fine.

  For the call site inside extent_write_locked_range(), although we can
  have subpage range, folio_start_writeback() will only clear
  PAGECACHE_TAG_DIRTY if the page is no longer dirty, and the full folio
  will still be dirty if there is any subpage dirty range.
  Only when the last dirty subpage sector is cleared, the
  folio_start_writeback() will clear PAGECACHE_TAG_DIRTY.

  So no matter if we call the full page or subpage helper, the result
  is still the same, then just use the subpage helpers for consistency.

Signed-off-by: Qu Wenruo <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
6 months agobtrfs: send: fix grammar in comments
Thorsten Blum [Wed, 14 Aug 2024 08:13:29 +0000 (10:13 +0200)]
btrfs: send: fix grammar in comments

Fix a few obvious grammar mistakes: a -> an, then -> than.

Signed-off-by: Thorsten Blum <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
6 months agobtrfs: qgroup: use xarray to track dirty extents in transaction
Junchao Sun [Fri, 7 Jun 2024 14:30:21 +0000 (22:30 +0800)]
btrfs: qgroup: use xarray to track dirty extents in transaction

Use xarray to track dirty extents to reduce the size of the struct
btrfs_qgroup_extent_record from 64 bytes to 40 bytes.  The xarray is
more cache line friendly, it also reduces the complexity of insertion
and search code compared to rb tree.

Another change introduced is about error handling.  Before this patch,
the result of btrfs_qgroup_trace_extent_nolock() is always a success. In
this patch, because of this function calls the function xa_store() which
has the possibility to fail, so mark qgroup as inconsistent if error
happened and then free preallocated memory. Also we preallocate memory
before spin_lock(), if memory preallcation failed, error handling is the
same the existing code.

Suggested-by: Qu Wenruo <[email protected]>
Signed-off-by: Junchao Sun <[email protected]>
Signed-off-by: David Sterba <[email protected]>
6 months agobtrfs: qgroup: use goto style to handle errors in add_delayed_ref()
Junchao Sun [Fri, 7 Jun 2024 14:30:20 +0000 (22:30 +0800)]
btrfs: qgroup: use goto style to handle errors in add_delayed_ref()

Clean up resources using goto to get rid of repeated code.

Reviewed-by: Qu Wenruo <[email protected]>
Signed-off-by: Junchao Sun <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
6 months agobtrfs: refactor __extent_writepage_io() to do sector-by-sector submission
Qu Wenruo [Wed, 7 Aug 2024 05:01:54 +0000 (14:31 +0930)]
btrfs: refactor __extent_writepage_io() to do sector-by-sector submission

Unlike the bitmap usage inside raid56, for __extent_writepage_io() we
handle the subpage submission not sector-by-sector, but for each dirty
range we found.

This is not a big deal normally, as the subpage complex code is already
mostly optimized out by the compiler for x86_64.

However for the sake of consistency and for the future of subpage
sector-perfect compression support, this patch does:

- Extract the sector submission code into submit_one_sector()

- Add the needed code to extract the dirty bitmap for subpage case
  There is a small pitfall for non-subpage case, as we cleared page
  dirty before starting writeback, so we have to manually set
  the default dirty_bitmap to 1 for such case.

- Use bitmap_and() to calculate the target sectors we need to submit
  This is done for both subpage and non-subpage cases, and will later
  be expanded to skip inline/compression ranges.

For x86_64, the dirty bitmap will be fixed to 1, with the length of 1,
so we're still doing the same workload per sector.

For larger page sizes, the overhead will be a little larger, as previous
we only need to do one extent_map lookup per-dirty-range, but now it
will be one extent_map lookup per-sector.

But that is the same frequency as x86_64, so we're just aligning the
behavior to x86_64.

Reviewed-by: Josef Bacik <[email protected]>
Signed-off-by: Qu Wenruo <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
6 months agobtrfs: subpage: fix the bitmap dump which can cause bitmap corruption
Qu Wenruo [Fri, 30 Aug 2024 07:05:48 +0000 (16:35 +0930)]
btrfs: subpage: fix the bitmap dump which can cause bitmap corruption

In commit 75258f20fb70 ("btrfs: subpage: dump extra subpage bitmaps for
debug") an internal macro GET_SUBPAGE_BITMAP() is introduced to grab the
bitmap of each attribute.

But that commit is using bitmap_cut() which will do the left shift of
the larger bitmap, causing incorrect values.

Thankfully this bitmap_cut() is only called for debug usage, and so far
it's not yet causing problem.

Fix it to use bitmap_read() to only grab the desired sub-bitmap.

Fixes: 75258f20fb70 ("btrfs: subpage: dump extra subpage bitmaps for debug")
CC: [email protected] # 6.6+
Signed-off-by: Qu Wenruo <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
6 months agobtrfs: reduce chunk_map lookups in btrfs_map_block()
Johannes Thumshirn [Tue, 13 Aug 2024 11:36:40 +0000 (13:36 +0200)]
btrfs: reduce chunk_map lookups in btrfs_map_block()

Currently we're calling btrfs_num_copies() before btrfs_get_chunk_map() in
btrfs_map_block(). But btrfs_num_copies() itself does a chunk map lookup
to be able to calculate the number of copies.

So split out the code getting the number of copies from btrfs_num_copies()
into a helper called btrfs_chunk_map_num_copies() and directly call it
from btrfs_map_block() and btrfs_num_copies().

This saves us one rbtree lookup per btrfs_map_block() invocation.

Reviewed-by: Qu Wenruo <[email protected]>
Reviewed-by: Filipe Manana <[email protected]>
Signed-off-by: Johannes Thumshirn <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
6 months agobtrfs: directly wake up cleaner kthread in the BTRFS_IOC_SYNC ioctl
Filipe Manana [Wed, 7 Aug 2024 16:13:58 +0000 (17:13 +0100)]
btrfs: directly wake up cleaner kthread in the BTRFS_IOC_SYNC ioctl

The BTRFS_IOC_SYNC ioctl wants to wake up the cleaner kthread so that it
does any pending work (subvolume deletion, delayed iputs, etc), however
it is waking up the transaction kthread, which in turn wakes up the
cleaner. Since we don't have any transaction to commit, as any ongoing
transaction was already committed when it called btrfs_sync_fs() and
the goal is just to wake up the cleaner thread, directly wake up the
cleaner instead of the transaction kthread.

Reviewed-by: Boris Burkov <[email protected]>
Reviewed-by: Qu Wenruo <[email protected]>
Signed-off-by: Filipe Manana <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
6 months agobtrfs: make btrfs_is_subpage() to return false directly for 4K page size
Qu Wenruo [Mon, 5 Aug 2024 05:32:54 +0000 (15:02 +0930)]
btrfs: make btrfs_is_subpage() to return false directly for 4K page size

Btrfs only supports sectorsize 4K, 8K, 16K, 32K, 64K for now, thus for
systems with 4K page size, there is no way the fs is subpage (sectorsize
< PAGE_SIZE).

So here we define btrfs_is_subpage() different according to the
PAGE_SIZE:

- PAGE_SIZE > 4K
  We may hit real subpage cases, define btrfs_is_subpage() as a regular
  function and do the usual checks.

- PAGE_SIZE == 4K (no smaller PAGE_SIZE support AFAIK)
  There is no way the fs is subpage, so just define btrfs_is_subpage()
  as an inline function which always return false.

This saves about 7K bytes for x86_64 debug builds:

   text    data     bss     dec     hex filename
Before: 1484452  168693   25776 1678921  199e49 fs/btrfs/btrfs.ko
After: 1476605  168445   25776 1670826  197eaa fs/btrfs/btrfs.ko

Signed-off-by: Qu Wenruo <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
6 months agobtrfs: change RST lookup error message level to debug
Johannes Thumshirn [Wed, 31 Jul 2024 20:43:07 +0000 (22:43 +0200)]
btrfs: change RST lookup error message level to debug

Now that RAID stripe-tree lookup failures are not treated as a fatal issue
any more, change the RAID stripe-tree lookup error message to debug level.

Reviewed-by: Josef Bacik <[email protected]>
Reviewed-by: Qu Wenruo <[email protected]>
Signed-off-by: Johannes Thumshirn <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
6 months agobtrfs: don't readahead the relocation inode on RST
Johannes Thumshirn [Wed, 31 Jul 2024 20:43:06 +0000 (22:43 +0200)]
btrfs: don't readahead the relocation inode on RST

On relocation we're doing readahead on the relocation inode, but if the
filesystem is backed by a RAID stripe tree we can get ENOENT (e.g. due to
preallocated extents not being mapped in the RST) from the lookup.

But readahead doesn't handle the error and submits invalid reads to the
device, causing an assertion in the scatter-gather list code:

  BTRFS info (device nvme1n1): balance: start -d -m -s
  BTRFS info (device nvme1n1): relocating block group 6480920576 flags data|raid0
  BTRFS error (device nvme1n1): cannot find raid-stripe for logical [64819281926481969152] devid 2, profile raid0
  ------------[ cut here ]------------
  kernel BUG at include/linux/scatterlist.h:115!
  Oops: invalid opcode: 0000 [#1] PREEMPT SMP PTI
  CPU: 0 PID: 1012 Comm: btrfs Not tainted 6.10.0-rc7+ #567
  RIP: 0010:__blk_rq_map_sg+0x339/0x4a0
  RSP: 0018:ffffc90001a43820 EFLAGS: 00010202
  RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffea00045d4802
  RDX: 0000000117520000 RSI: 0000000000000000 RDI: ffff8881027d1000
  RBP: 0000000000003000 R08: ffffea00045d4902 R09: 0000000000000000
  R10: 0000000000000000 R11: 0000000000001000 R12: ffff8881003d10b8
  R13: ffffc90001a438f0 R14: 0000000000000000 R15: 0000000000003000
  FS:  00007fcc048a6900(0000) GS:ffff88813bc00000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 000000002cd11000 CR3: 00000001109ea001 CR4: 0000000000370eb0
  Call Trace:
   <TASK>
   ? __die_body.cold+0x14/0x25
   ? die+0x2e/0x50
   ? do_trap+0xca/0x110
   ? do_error_trap+0x65/0x80
   ? __blk_rq_map_sg+0x339/0x4a0
   ? exc_invalid_op+0x50/0x70
   ? __blk_rq_map_sg+0x339/0x4a0
   ? asm_exc_invalid_op+0x1a/0x20
   ? __blk_rq_map_sg+0x339/0x4a0
   nvme_prep_rq.part.0+0x9d/0x770
   nvme_queue_rq+0x7d/0x1e0
   __blk_mq_issue_directly+0x2a/0x90
   ? blk_mq_get_budget_and_tag+0x61/0x90
   blk_mq_try_issue_list_directly+0x56/0xf0
   blk_mq_flush_plug_list.part.0+0x52b/0x5d0
   __blk_flush_plug+0xc6/0x110
   blk_finish_plug+0x28/0x40
   read_pages+0x160/0x1c0
   page_cache_ra_unbounded+0x109/0x180
   relocate_file_extent_cluster+0x611/0x6a0
   ? btrfs_search_slot+0xba4/0xd20
   ? balance_dirty_pages_ratelimited_flags+0x26/0xb00
   relocate_data_extent.constprop.0+0x134/0x160
   relocate_block_group+0x3f2/0x500
   btrfs_relocate_block_group+0x250/0x430
   btrfs_relocate_chunk+0x3f/0x130
   btrfs_balance+0x71b/0xef0
   ? kmalloc_trace_noprof+0x13b/0x280
   btrfs_ioctl+0x2c2e/0x3030
   ? kvfree_call_rcu+0x1e6/0x340
   ? list_lru_add_obj+0x66/0x80
   ? mntput_no_expire+0x3a/0x220
   __x64_sys_ioctl+0x96/0xc0
   do_syscall_64+0x54/0x110
   entry_SYSCALL_64_after_hwframe+0x76/0x7e
  RIP: 0033:0x7fcc04514f9b
  Code: Unable to access opcode bytes at 0x7fcc04514f71.
  RSP: 002b:00007ffeba923370 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
  RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007fcc04514f9b
  RDX: 00007ffeba923460 RSI: 00000000c4009420 RDI: 0000000000000003
  RBP: 0000000000000000 R08: 0000000000000013 R09: 0000000000000001
  R10: 00007fcc043fbba8 R11: 0000000000000246 R12: 00007ffeba924fc5
  R13: 00007ffeba923460 R14: 0000000000000002 R15: 00000000004d4bb0
   </TASK>
  Modules linked in:
  ---[ end trace 0000000000000000 ]---
  RIP: 0010:__blk_rq_map_sg+0x339/0x4a0
  RSP: 0018:ffffc90001a43820 EFLAGS: 00010202
  RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffea00045d4802
  RDX: 0000000117520000 RSI: 0000000000000000 RDI: ffff8881027d1000
  RBP: 0000000000003000 R08: ffffea00045d4902 R09: 0000000000000000
  R10: 0000000000000000 R11: 0000000000001000 R12: ffff8881003d10b8
  R13: ffffc90001a438f0 R14: 0000000000000000 R15: 0000000000003000
  FS:  00007fcc048a6900(0000) GS:ffff88813bc00000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 00007fcc04514f71 CR3: 00000001109ea001 CR4: 0000000000370eb0
  Kernel panic - not syncing: Fatal exception
  Kernel Offset: disabled
  ---[ end Kernel panic - not syncing: Fatal exception ]---

So in case of a relocation on a RAID stripe-tree based file system, skip
the readahead.

Reviewed-by: Josef Bacik <[email protected]>
Reviewed-by: Qu Wenruo <[email protected]>
Signed-off-by: Johannes Thumshirn <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
6 months agobtrfs: set search_commit_root on stripe io in case of relocation
Johannes Thumshirn [Wed, 31 Jul 2024 20:43:05 +0000 (22:43 +0200)]
btrfs: set search_commit_root on stripe io in case of relocation

Set rst_search_commit_root in the btrfs_io_stripe we're passing to
btrfs_map_block() in case we're doing data relocation.

Reviewed-by: Josef Bacik <[email protected]>
Reviewed-by: Qu Wenruo <[email protected]>
Signed-off-by: Johannes Thumshirn <[email protected]>
Signed-off-by: David Sterba <[email protected]>
6 months agobtrfs: rename btrfs_io_stripe::is_scrub to rst_search_commit_root
Johannes Thumshirn [Wed, 31 Jul 2024 20:43:04 +0000 (22:43 +0200)]
btrfs: rename btrfs_io_stripe::is_scrub to rst_search_commit_root

Rename 'btrfs_io_stripe::is_scrub' to 'rst_search_commit_root'. While
'is_scrub' describes the state of the io_stripe (it is a stripe submitted
by scrub) it does not describe the purpose, namely looking at the commit
root when searching RAID stripe-tree entries.

Renaming the stripe to rst_search_commit_root describes this purpose.

Reviewed-by: Josef Bacik <[email protected]>
Reviewed-by: Qu Wenruo <[email protected]>
Signed-off-by: Johannes Thumshirn <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
6 months agobtrfs: don't dump stripe-tree on lookup error
Johannes Thumshirn [Wed, 31 Jul 2024 20:43:03 +0000 (22:43 +0200)]
btrfs: don't dump stripe-tree on lookup error

This just creates unnecessary noise and doesn't provide any insights into
debugging RAID stripe-tree related issues.

Reviewed-by: Josef Bacik <[email protected]>
Reviewed-by: Qu Wenruo <[email protected]>
Signed-off-by: Johannes Thumshirn <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
6 months agobtrfs: add comment about locking in cow_file_range_inline()
Boris Burkov [Wed, 31 Jul 2024 19:41:06 +0000 (12:41 -0700)]
btrfs: add comment about locking in cow_file_range_inline()

Add a comment to document the complicated locked_page unlock logic in
cow_file_range_inline. The specifically tricky part is that a caller
just up the stack converts ret == 0 to ret == 1 and then another
caller far up the callstack handles ret == 1 as a success, AND returns
without cleanup in that case, both of which "feel" unnatural and led to
the original bug.

Try to document that somewhat specific callstack logic here to explain
the weird un-setting of locked_folio on success.

Reviewed-by: Qu Wenruo <[email protected]>
Signed-off-by: Boris Burkov <[email protected]>
Signed-off-by: David Sterba <[email protected]>
6 months agobtrfs: more efficient chunk map iteration when device replace finishes
Filipe Manana [Thu, 25 Jul 2024 10:48:10 +0000 (11:48 +0100)]
btrfs: more efficient chunk map iteration when device replace finishes

When iterating the chunk maps when a device replace finishes we are doing
a full rbtree search for each chunk map, which is not the most efficient
thing to do, wasting CPU time. As we are holding a write lock on the tree
during the whole iteration, we can simply start from the first node in the
tree and then move to the next chunk map by doing a rb_next() call - the
only exception is when we need to reschedule, in which case we have to do
a full rbtree search since we dropped the write lock and the tree may have
changed (chunk maps may have been removed and the tree got rebalanced).
So just do that.

Signed-off-by: Filipe Manana <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
6 months agobtrfs: reschedule when updating chunk maps at the end of a device replace
Filipe Manana [Thu, 25 Jul 2024 09:46:01 +0000 (10:46 +0100)]
btrfs: reschedule when updating chunk maps at the end of a device replace

At the end of a device replace we must go over all the chunk maps and
update their stripes to point to the target device instead of the source
device. We iterate over the chunk maps while holding a write lock and
we never reschedule, which can result in monopolizing a CPU for too long
and blocking readers for too long (it's a rw lock, non-blocking).

So improve on this by rescheduling if necessary. This is safe because at
this point we are holding the chunk mutex, which means no new chunks can
be allocated and therefore we don't risk missing a new chunk map that
covers a range behind the last one we processed before rescheduling.

Signed-off-by: Filipe Manana <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
6 months agobtrfs: convert extent_range_clear_dirty_for_io() to use a folio
Josef Bacik [Thu, 25 Jul 2024 00:39:35 +0000 (20:39 -0400)]
btrfs: convert extent_range_clear_dirty_for_io() to use a folio

Instead of getting a page and using that to clear dirty for io, use the
folio helper and use the appropriate folio functions.

Signed-off-by: Josef Bacik <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
6 months agobtrfs: convert insert_inline_extent() to use a folio
Josef Bacik [Thu, 25 Jul 2024 00:37:20 +0000 (20:37 -0400)]
btrfs: convert insert_inline_extent() to use a folio

We only use a page to copy in the data for the inline extent.  Use a
folio for this instead.

Signed-off-by: Josef Bacik <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
6 months agobtrfs: convert btrfs_set_range_writeback() to use a folio
Josef Bacik [Thu, 25 Jul 2024 00:22:32 +0000 (20:22 -0400)]
btrfs: convert btrfs_set_range_writeback() to use a folio

We already use a lot of functions here that use folios, update the
function to use __filemap_get_folio instead of find_get_page and then
use the folio directly.

Signed-off-by: Josef Bacik <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
6 months agobtrfs: convert wait_subpage_spinlock() to only use a folio
Josef Bacik [Thu, 25 Jul 2024 00:20:24 +0000 (20:20 -0400)]
btrfs: convert wait_subpage_spinlock() to only use a folio

Currently this already uses a folio for most things, update it to take a
folio and update all the page usage with the corresponding folio usage.

Signed-off-by: Josef Bacik <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
6 months agobtrfs: convert find_next_dirty_byte() to take a folio
Josef Bacik [Thu, 25 Jul 2024 00:17:26 +0000 (20:17 -0400)]
btrfs: convert find_next_dirty_byte() to take a folio

We already use a folio some in this function, replace all page usage
with the folio and update the function to take the folio as an argument.

Signed-off-by: Josef Bacik <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
6 months agobtrfs: convert __get_extent_map() to take a folio
Josef Bacik [Wed, 24 Jul 2024 23:28:26 +0000 (19:28 -0400)]
btrfs: convert __get_extent_map() to take a folio

Now that btrfs_get_extent takes a folio, update __get_extent_map to
take a folio as well.

Signed-off-by: Josef Bacik <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
6 months agobtrfs: convert btrfs_get_extent() to take a folio
Josef Bacik [Wed, 24 Jul 2024 23:26:58 +0000 (19:26 -0400)]
btrfs: convert btrfs_get_extent() to take a folio

We only pass this into read_inline_extent, change it to take a folio and
update the callers.

Signed-off-by: Josef Bacik <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
6 months agobtrfs: convert read_inline_extent() to use a folio
Josef Bacik [Wed, 24 Jul 2024 23:25:10 +0000 (19:25 -0400)]
btrfs: convert read_inline_extent() to use a folio

Instead of using a page, use a folio instead, take a folio as an
argument, and update the callers appropriately.

Signed-off-by: Josef Bacik <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
6 months agobtrfs: convert uncompress_inline() to take a folio
Josef Bacik [Wed, 24 Jul 2024 21:38:46 +0000 (17:38 -0400)]
btrfs: convert uncompress_inline() to take a folio

Update uncompress_inline to take a folio and update it's usage
accordingly.

Signed-off-by: Josef Bacik <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
6 months agobtrfs: convert struct btrfs_writepage_fixup to use a folio
Josef Bacik [Wed, 24 Jul 2024 21:28:05 +0000 (17:28 -0400)]
btrfs: convert struct btrfs_writepage_fixup to use a folio

Now the fixup creator and consumer use folios, change this to use a
folio as well.

Signed-off-by: Josef Bacik <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
6 months agobtrfs: convert btrfs_writepage_cow_fixup() to use folio
Josef Bacik [Wed, 24 Jul 2024 21:26:31 +0000 (17:26 -0400)]
btrfs: convert btrfs_writepage_cow_fixup() to use folio

Instead of a page, use a folio for btrfs_writepage_cow_fixup.  We
already have a folio at the only caller, and the fixup worker uses
folios.

Signed-off-by: Josef Bacik <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
6 months agobtrfs: convert btrfs_writepage_fixup_worker() to use a folio
Josef Bacik [Wed, 24 Jul 2024 21:21:25 +0000 (17:21 -0400)]
btrfs: convert btrfs_writepage_fixup_worker() to use a folio

This function heavily messes with pages, instead update it to use a
folio.

Signed-off-by: Josef Bacik <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
6 months agobtrfs: convert submit_uncompressed_range() to take a folio
Josef Bacik [Wed, 24 Jul 2024 21:13:17 +0000 (17:13 -0400)]
btrfs: convert submit_uncompressed_range() to take a folio

This mostly uses folios already, update it to take a folio and update
the rest of the function to use the folio instead of the page.

Signed-off-by: Josef Bacik <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
6 months agobtrfs: convert struct async_chunk to hold a folio
Josef Bacik [Wed, 24 Jul 2024 21:09:33 +0000 (17:09 -0400)]
btrfs: convert struct async_chunk to hold a folio

Instead of passing in the page for ->locked_page, make it hold a
locked_folio and then update the users of async_chunk to act
accordingly.

Signed-off-by: Josef Bacik <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
6 months agobtrfs: convert btrfs_run_delalloc_range() to take a folio
Josef Bacik [Wed, 24 Jul 2024 20:56:32 +0000 (16:56 -0400)]
btrfs: convert btrfs_run_delalloc_range() to take a folio

Now that every function that btrfs_run_delalloc_range calls takes a
folio, update it to take a folio and update the callers.

Signed-off-by: Josef Bacik <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
6 months agobtrfs: convert run_delalloc_compressed() to take a folio
Josef Bacik [Wed, 24 Jul 2024 20:52:57 +0000 (16:52 -0400)]
btrfs: convert run_delalloc_compressed() to take a folio

This just passes the page into the compressed machinery to keep track of
the locked page.  Update this to take a folio and convert it to a page
where appropriate.

Signed-off-by: Josef Bacik <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
6 months agobtrfs: convert btrfs_cleanup_ordered_extents() to take a folio
Josef Bacik [Wed, 24 Jul 2024 20:49:54 +0000 (16:49 -0400)]
btrfs: convert btrfs_cleanup_ordered_extents() to take a folio

Now that btrfs_cleanup_ordered_extents is operating mostly with folios,
update it to use a folio instead of a page, and the update the function
and the callers as appropriate.

Signed-off-by: Josef Bacik <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
6 months agobtrfs: convert btrfs_cleanup_ordered_extents() to use folios
Josef Bacik [Wed, 24 Jul 2024 20:46:01 +0000 (16:46 -0400)]
btrfs: convert btrfs_cleanup_ordered_extents() to use folios

We walk through pages in this function and clear ordered, and the
function for this uses folios. Update the function to use a folio for
this whole operation.

Signed-off-by: Josef Bacik <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
6 months agobtrfs: convert run_delalloc_nocow() to take a folio
Josef Bacik [Wed, 24 Jul 2024 20:42:23 +0000 (16:42 -0400)]
btrfs: convert run_delalloc_nocow() to take a folio

Now all of the functions that use locked_page in run_delalloc_nocow take
a folio, update it to take a folio and update the caller.

Signed-off-by: Josef Bacik <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
6 months agobtrfs: convert fallback_to_cow() to take a folio
Josef Bacik [Wed, 24 Jul 2024 20:40:34 +0000 (16:40 -0400)]
btrfs: convert fallback_to_cow() to take a folio

With this we can pass the folio directly into cow_file_range().

Signed-off-by: Josef Bacik <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
6 months agobtrfs: convert cow_file_range() to take a folio
Josef Bacik [Wed, 24 Jul 2024 20:37:29 +0000 (16:37 -0400)]
btrfs: convert cow_file_range() to take a folio

Convert this to take a folio and pass it into all of the various cleanup
functions.  Update the callers to pass in a folio instead.

Signed-off-by: Josef Bacik <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
6 months agobtrfs: convert cow_file_range_inline() to take a folio
Josef Bacik [Fri, 26 Jul 2024 19:25:45 +0000 (15:25 -0400)]
btrfs: convert cow_file_range_inline() to take a folio

Now that we want the folio in this function, convert it to take a folio
directly and use that.

Signed-off-by: Josef Bacik <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
6 months agobtrfs: convert run_delalloc_cow() to take a folio
Josef Bacik [Wed, 24 Jul 2024 20:34:15 +0000 (16:34 -0400)]
btrfs: convert run_delalloc_cow() to take a folio

We pass the folio into extent_write_locked_range, go ahead and take a
folio to pass along, and update the callers to pass in a folio.

Signed-off-by: Josef Bacik <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
6 months agobtrfs: convert extent_write_locked_range() to take a folio
Josef Bacik [Wed, 24 Jul 2024 20:31:44 +0000 (16:31 -0400)]
btrfs: convert extent_write_locked_range() to take a folio

This mostly uses folios, convert it to take a folio instead and update
the callers to pass in the folio.

Signed-off-by: Josef Bacik <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
6 months agobtrfs: convert extent_clear_unlock_delalloc() to take a folio
Josef Bacik [Wed, 24 Jul 2024 20:29:18 +0000 (16:29 -0400)]
btrfs: convert extent_clear_unlock_delalloc() to take a folio

Instead of taking the locked page, take the locked folio so we can pass
that into __process_folios_contig.

Signed-off-by: Josef Bacik <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
This page took 0.134447 seconds and 4 git commands to generate.