]> Git Repo - linux.git/log
linux.git
4 years agobtrfs: qgroup: fix qgroup meta rsv leak for subvolume operations
Qu Wenruo [Fri, 24 Jul 2020 06:46:10 +0000 (14:46 +0800)]
btrfs: qgroup: fix qgroup meta rsv leak for subvolume operations

[BUG]
When quota is enabled for TEST_DEV, generic/013 sometimes fails like this:

  generic/013 14s ... _check_dmesg: something found in dmesg (see xfstests-dev/results//generic/013.dmesg)

And with the following metadata leak:

  BTRFS warning (device dm-3): qgroup 0/1370 has unreleased space, type 2 rsv 49152
  ------------[ cut here ]------------
  WARNING: CPU: 2 PID: 47912 at fs/btrfs/disk-io.c:4078 close_ctree+0x1dc/0x323 [btrfs]
  Call Trace:
   btrfs_put_super+0x15/0x17 [btrfs]
   generic_shutdown_super+0x72/0x110
   kill_anon_super+0x18/0x30
   btrfs_kill_super+0x17/0x30 [btrfs]
   deactivate_locked_super+0x3b/0xa0
   deactivate_super+0x40/0x50
   cleanup_mnt+0x135/0x190
   __cleanup_mnt+0x12/0x20
   task_work_run+0x64/0xb0
   __prepare_exit_to_usermode+0x1bc/0x1c0
   __syscall_return_slowpath+0x47/0x230
   do_syscall_64+0x64/0xb0
   entry_SYSCALL_64_after_hwframe+0x44/0xa9
  ---[ end trace a6cfd45ba80e4e06 ]---
  BTRFS error (device dm-3): qgroup reserved space leaked
  BTRFS info (device dm-3): disk space caching is enabled
  BTRFS info (device dm-3): has skinny extents

[CAUSE]
The qgroup preallocated meta rsv operations of that offending root are:

  btrfs_delayed_inode_reserve_metadata: rsv_meta_prealloc root=1370 num_bytes=131072
  btrfs_delayed_inode_reserve_metadata: rsv_meta_prealloc root=1370 num_bytes=131072
  btrfs_subvolume_reserve_metadata: rsv_meta_prealloc root=1370 num_bytes=49152
  btrfs_delayed_inode_release_metadata: convert_meta_prealloc root=1370 num_bytes=-131072
  btrfs_delayed_inode_release_metadata: convert_meta_prealloc root=1370 num_bytes=-131072

It's pretty obvious that, we reserve qgroup meta rsv in
btrfs_subvolume_reserve_metadata(), but doesn't have corresponding
release/convert calls in btrfs_subvolume_release_metadata().

This leads to the leakage.

[FIX]
To fix this bug, we should follow what we're doing in
btrfs_delalloc_reserve_metadata(), where we reserve qgroup space, and
add it to block_rsv->qgroup_rsv_reserved.

And free the qgroup reserved metadata space when releasing the
block_rsv.

To do this, we need to change the btrfs_subvolume_release_metadata() to
accept btrfs_root, and record the qgroup_to_release number, and call
btrfs_qgroup_convert_reserved_meta() for it.

Fixes: 733e03a0b26a ("btrfs: qgroup: Split meta rsv type into meta_prealloc and meta_pertrans")
CC: [email protected] # 4.19+
Reviewed-by: Josef Bacik <[email protected]>
Signed-off-by: Qu Wenruo <[email protected]>
Signed-off-by: David Sterba <[email protected]>
4 years agobtrfs: qgroup: fix wrong qgroup metadata reserve for delayed inode
Qu Wenruo [Fri, 24 Jul 2020 06:46:09 +0000 (14:46 +0800)]
btrfs: qgroup: fix wrong qgroup metadata reserve for delayed inode

For delayed inode facility, qgroup metadata is reserved for it, and
later freed.

However we're freeing more bytes than we reserved.
In btrfs_delayed_inode_reserve_metadata():

num_bytes = btrfs_calc_metadata_size(fs_info, 1);
...
ret = btrfs_qgroup_reserve_meta_prealloc(root,
fs_info->nodesize, true);
...
if (!ret) {
node->bytes_reserved = num_bytes;

But in btrfs_delayed_inode_release_metadata():

if (qgroup_free)
btrfs_qgroup_free_meta_prealloc(node->root,
node->bytes_reserved);
else
btrfs_qgroup_convert_reserved_meta(node->root,
node->bytes_reserved);

This means, we're always releasing more qgroup metadata rsv than we have
reserved.

This won't trigger selftest warning, as btrfs qgroup metadata rsv has
extra protection against cases like quota enabled half-way.

But we still need to fix this problem any way.

This patch will use the same num_bytes for qgroup metadata rsv so we
could handle it correctly.

Fixes: f218ea6c4792 ("btrfs: delayed-inode: Remove wrong qgroup meta reservation calls")
CC: [email protected] # 4.19+
Reviewed-by: Josef Bacik <[email protected]>
Signed-off-by: Qu Wenruo <[email protected]>
Signed-off-by: David Sterba <[email protected]>
4 years agobtrfs: do not hold device_list_mutex when closing devices
Josef Bacik [Mon, 10 Aug 2020 15:42:28 +0000 (11:42 -0400)]
btrfs: do not hold device_list_mutex when closing devices

The following lockdep splat

======================================================
WARNING: possible circular locking dependency detected
5.8.0-rc7-00169-g87212851a027-dirty #929 Not tainted
------------------------------------------------------
fsstress/8739 is trying to acquire lock:
ffff88bfd0eb0c90 (&fs_info->reloc_mutex){+.+.}-{3:3}, at: btrfs_record_root_in_trans+0x43/0x70

but task is already holding lock:
ffff88bfbd16e538 (sb_pagefaults){.+.+}-{0:0}, at: btrfs_page_mkwrite+0x6a/0x4a0

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #10 (sb_pagefaults){.+.+}-{0:0}:
       __sb_start_write+0x129/0x210
       btrfs_page_mkwrite+0x6a/0x4a0
       do_page_mkwrite+0x4d/0xc0
       handle_mm_fault+0x103c/0x1730
       exc_page_fault+0x340/0x660
       asm_exc_page_fault+0x1e/0x30

-> #9 (&mm->mmap_lock#2){++++}-{3:3}:
       __might_fault+0x68/0x90
       _copy_to_user+0x1e/0x80
       perf_read+0x141/0x2c0
       vfs_read+0xad/0x1b0
       ksys_read+0x5f/0xe0
       do_syscall_64+0x50/0x90
       entry_SYSCALL_64_after_hwframe+0x44/0xa9

-> #8 (&cpuctx_mutex){+.+.}-{3:3}:
       __mutex_lock+0x9f/0x930
       perf_event_init_cpu+0x88/0x150
       perf_event_init+0x1db/0x20b
       start_kernel+0x3ae/0x53c
       secondary_startup_64+0xa4/0xb0

-> #7 (pmus_lock){+.+.}-{3:3}:
       __mutex_lock+0x9f/0x930
       perf_event_init_cpu+0x4f/0x150
       cpuhp_invoke_callback+0xb1/0x900
       _cpu_up.constprop.26+0x9f/0x130
       cpu_up+0x7b/0xc0
       bringup_nonboot_cpus+0x4f/0x60
       smp_init+0x26/0x71
       kernel_init_freeable+0x110/0x258
       kernel_init+0xa/0x103
       ret_from_fork+0x1f/0x30

-> #6 (cpu_hotplug_lock){++++}-{0:0}:
       cpus_read_lock+0x39/0xb0
       kmem_cache_create_usercopy+0x28/0x230
       kmem_cache_create+0x12/0x20
       bioset_init+0x15e/0x2b0
       init_bio+0xa3/0xaa
       do_one_initcall+0x5a/0x2e0
       kernel_init_freeable+0x1f4/0x258
       kernel_init+0xa/0x103
       ret_from_fork+0x1f/0x30

-> #5 (bio_slab_lock){+.+.}-{3:3}:
       __mutex_lock+0x9f/0x930
       bioset_init+0xbc/0x2b0
       __blk_alloc_queue+0x6f/0x2d0
       blk_mq_init_queue_data+0x1b/0x70
       loop_add+0x110/0x290 [loop]
       fq_codel_tcf_block+0x12/0x20 [sch_fq_codel]
       do_one_initcall+0x5a/0x2e0
       do_init_module+0x5a/0x220
       load_module+0x2459/0x26e0
       __do_sys_finit_module+0xba/0xe0
       do_syscall_64+0x50/0x90
       entry_SYSCALL_64_after_hwframe+0x44/0xa9

-> #4 (loop_ctl_mutex){+.+.}-{3:3}:
       __mutex_lock+0x9f/0x930
       lo_open+0x18/0x50 [loop]
       __blkdev_get+0xec/0x570
       blkdev_get+0xe8/0x150
       do_dentry_open+0x167/0x410
       path_openat+0x7c9/0xa80
       do_filp_open+0x93/0x100
       do_sys_openat2+0x22a/0x2e0
       do_sys_open+0x4b/0x80
       do_syscall_64+0x50/0x90
       entry_SYSCALL_64_after_hwframe+0x44/0xa9

-> #3 (&bdev->bd_mutex){+.+.}-{3:3}:
       __mutex_lock+0x9f/0x930
       blkdev_put+0x1d/0x120
       close_fs_devices.part.31+0x84/0x130
       btrfs_close_devices+0x44/0xb0
       close_ctree+0x296/0x2b2
       generic_shutdown_super+0x69/0x100
       kill_anon_super+0xe/0x30
       btrfs_kill_super+0x12/0x20
       deactivate_locked_super+0x29/0x60
       cleanup_mnt+0xb8/0x140
       task_work_run+0x6d/0xb0
       __prepare_exit_to_usermode+0x1cc/0x1e0
       do_syscall_64+0x5c/0x90
       entry_SYSCALL_64_after_hwframe+0x44/0xa9

-> #2 (&fs_devs->device_list_mutex){+.+.}-{3:3}:
       __mutex_lock+0x9f/0x930
       btrfs_run_dev_stats+0x49/0x480
       commit_cowonly_roots+0xb5/0x2a0
       btrfs_commit_transaction+0x516/0xa60
       sync_filesystem+0x6b/0x90
       generic_shutdown_super+0x22/0x100
       kill_anon_super+0xe/0x30
       btrfs_kill_super+0x12/0x20
       deactivate_locked_super+0x29/0x60
       cleanup_mnt+0xb8/0x140
       task_work_run+0x6d/0xb0
       __prepare_exit_to_usermode+0x1cc/0x1e0
       do_syscall_64+0x5c/0x90
       entry_SYSCALL_64_after_hwframe+0x44/0xa9

-> #1 (&fs_info->tree_log_mutex){+.+.}-{3:3}:
       __mutex_lock+0x9f/0x930
       btrfs_commit_transaction+0x4bb/0xa60
       sync_filesystem+0x6b/0x90
       generic_shutdown_super+0x22/0x100
       kill_anon_super+0xe/0x30
       btrfs_kill_super+0x12/0x20
       deactivate_locked_super+0x29/0x60
       cleanup_mnt+0xb8/0x140
       task_work_run+0x6d/0xb0
       __prepare_exit_to_usermode+0x1cc/0x1e0
       do_syscall_64+0x5c/0x90
       entry_SYSCALL_64_after_hwframe+0x44/0xa9

-> #0 (&fs_info->reloc_mutex){+.+.}-{3:3}:
       __lock_acquire+0x1272/0x2310
       lock_acquire+0x9e/0x360
       __mutex_lock+0x9f/0x930
       btrfs_record_root_in_trans+0x43/0x70
       start_transaction+0xd1/0x5d0
       btrfs_dirty_inode+0x42/0xd0
       file_update_time+0xc8/0x110
       btrfs_page_mkwrite+0x10c/0x4a0
       do_page_mkwrite+0x4d/0xc0
       handle_mm_fault+0x103c/0x1730
       exc_page_fault+0x340/0x660
       asm_exc_page_fault+0x1e/0x30

other info that might help us debug this:

Chain exists of:
  &fs_info->reloc_mutex --> &mm->mmap_lock#2 --> sb_pagefaults

 Possible unsafe locking scenario:

       CPU0                    CPU1
       ----                    ----
  lock(sb_pagefaults);
                               lock(&mm->mmap_lock#2);
                               lock(sb_pagefaults);
  lock(&fs_info->reloc_mutex);

 *** DEADLOCK ***

3 locks held by fsstress/8739:
 #0: ffff88bee66eeb68 (&mm->mmap_lock#2){++++}-{3:3}, at: exc_page_fault+0x173/0x660
 #1: ffff88bfbd16e538 (sb_pagefaults){.+.+}-{0:0}, at: btrfs_page_mkwrite+0x6a/0x4a0
 #2: ffff88bfbd16e630 (sb_internal){.+.+}-{0:0}, at: start_transaction+0x3da/0x5d0

stack backtrace:
CPU: 17 PID: 8739 Comm: fsstress Kdump: loaded Not tainted 5.8.0-rc7-00169-g87212851a027-dirty #929
Hardware name: Quanta Tioga Pass Single Side 01-0030993006/Tioga Pass Single Side, BIOS F08_3A18 12/20/2018
Call Trace:
 dump_stack+0x78/0xa0
 check_noncircular+0x165/0x180
 __lock_acquire+0x1272/0x2310
 ? btrfs_get_alloc_profile+0x150/0x210
 lock_acquire+0x9e/0x360
 ? btrfs_record_root_in_trans+0x43/0x70
 __mutex_lock+0x9f/0x930
 ? btrfs_record_root_in_trans+0x43/0x70
 ? lock_acquire+0x9e/0x360
 ? join_transaction+0x5d/0x450
 ? find_held_lock+0x2d/0x90
 ? btrfs_record_root_in_trans+0x43/0x70
 ? join_transaction+0x3d5/0x450
 ? btrfs_record_root_in_trans+0x43/0x70
 btrfs_record_root_in_trans+0x43/0x70
 start_transaction+0xd1/0x5d0
 btrfs_dirty_inode+0x42/0xd0
 file_update_time+0xc8/0x110
 btrfs_page_mkwrite+0x10c/0x4a0
 ? handle_mm_fault+0x5e/0x1730
 do_page_mkwrite+0x4d/0xc0
 ? __do_fault+0x32/0x150
 handle_mm_fault+0x103c/0x1730
 exc_page_fault+0x340/0x660
 ? asm_exc_page_fault+0x8/0x30
 asm_exc_page_fault+0x1e/0x30
RIP: 0033:0x7faa6c9969c4

Was seen in testing.  The fix is similar to that of

  btrfs: open device without device_list_mutex

where we're holding the device_list_mutex and then grab the bd_mutex,
which pulls in a bunch of dependencies under the bd_mutex.  We only ever
call btrfs_close_devices() on mount failure or unmount, so we're save to
not have the device_list_mutex here.  We're already holding the
uuid_mutex which keeps us safe from any external modification of the
fs_devices.

Signed-off-by: Josef Bacik <[email protected]>
Signed-off-by: David Sterba <[email protected]>
4 years agobtrfs: move btrfs_rm_dev_replace_free_srcdev outside of all locks
Josef Bacik [Thu, 20 Aug 2020 15:18:27 +0000 (11:18 -0400)]
btrfs: move btrfs_rm_dev_replace_free_srcdev outside of all locks

When closing and freeing the source device we could end up doing our
final blkdev_put() on the bdev, which will grab the bd_mutex.  As such
we want to be holding as few locks as possible, so move this call
outside of the dev_replace->lock_finishing_cancel_unmount lock.  Since
we're modifying the fs_devices we need to make sure we're holding the
uuid_mutex here, so take that as well.

Signed-off-by: Josef Bacik <[email protected]>
Signed-off-by: David Sterba <[email protected]>
4 years agobtrfs: remove alloc_list splice in btrfs_prepare_sprout
Nikolay Borisov [Wed, 12 Aug 2020 13:26:46 +0000 (16:26 +0300)]
btrfs: remove alloc_list splice in btrfs_prepare_sprout

btrfs_prepare_sprout is called when the first rw device is added to a
seed filesystem. This means the filesystem can't have its alloc_list
be non-empty, since seed filesystems are read only. Simply remove the
code altogether.

Reviewed-by: Josef Bacik <[email protected]>
Reviewed-by: Anand Jain <[email protected]>
Signed-off-by: Nikolay Borisov <[email protected]>
Signed-off-by: David Sterba <[email protected]>
4 years agobtrfs: document some invariants of seed code
Nikolay Borisov [Wed, 12 Aug 2020 14:04:36 +0000 (17:04 +0300)]
btrfs: document some invariants of seed code

Without good understanding of how seed devices works it's hard to grok
some of what the code in open_seed_devices or btrfs_prepare_sprout does.

Add comments hopefully reducing some of the cognitive load.

Reviewed-by: Anand Jain <[email protected]>
Signed-off-by: Nikolay Borisov <[email protected]>
Signed-off-by: David Sterba <[email protected]>
4 years agobtrfs: switch seed device to list api
Nikolay Borisov [Thu, 16 Jul 2020 07:25:33 +0000 (10:25 +0300)]
btrfs: switch seed device to list api

While this patch touches a bunch of files the conversion is
straighforward. Instead of using the implicit linked list anchored at
btrfs_fs_devices::seed the code is switched to using
list_for_each_entry.

Previous patches in the series already factored out code that processed
both main and seed devices so in those cases the factored out functions
are called on the main fs_devices and then on every seed dev inside
list_for_each_entry.

Using list api also allows to simplify deletion from the seed dev list
performed in btrfs_rm_device and btrfs_rm_dev_replace_free_srcdev by
substituting a while() loop with a simple list_del_init.

Reviewed-by: Josef Bacik <[email protected]>
Reviewed-by: Anand Jain <[email protected]>
Signed-off-by: Nikolay Borisov <[email protected]>
Signed-off-by: David Sterba <[email protected]>
4 years agobtrfs: simplify setting/clearing fs_info to btrfs_fs_devices
Nikolay Borisov [Wed, 15 Jul 2020 10:48:49 +0000 (13:48 +0300)]
btrfs: simplify setting/clearing fs_info to btrfs_fs_devices

It makes no sense to have sysfs-related routines be responsible for
properly initialising the fs_info pointer of struct btrfs_fs_device.
Instead this can be streamlined by making it the responsibility of
btrfs_init_devices_late to initialize it. That function already
initializes fs_info of every individual device in btrfs_fs_devices.

As far as clearing it is concerned it makes sense to move it to
close_fs_devices. That function is only called when struct
btrfs_fs_devices is no longer in use - either for holding seeds or
main devices for a mounted filesystem.

Reviewed-by: Josef Bacik <[email protected]>
Reviewed-by: Anand Jain <[email protected]>
Signed-off-by: Nikolay Borisov <[email protected]>
Signed-off-by: David Sterba <[email protected]>
4 years agobtrfs: make close_fs_devices return void
Nikolay Borisov [Wed, 15 Jul 2020 10:48:48 +0000 (13:48 +0300)]
btrfs: make close_fs_devices return void

The return value of this function conveys absolutely no information.
All callers already check the state of fs_devices->opened to decide how
to proceed. So convert the function to returning void. While at it make
btrfs_close_devices also return void.

Reviewed-by: Josef Bacik <[email protected]>
Reviewed-by: Anand Jain <[email protected]>
Signed-off-by: Nikolay Borisov <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
4 years agobtrfs: factor out loop logic from btrfs_free_extra_devids
Nikolay Borisov [Thu, 16 Jul 2020 07:17:04 +0000 (10:17 +0300)]
btrfs: factor out loop logic from btrfs_free_extra_devids

This prepares the code to switching seeds devices to a proper list.

Reviewed-by: Josef Bacik <[email protected]>
Reviewed-by: Anand Jain <[email protected]>
Signed-off-by: Nikolay Borisov <[email protected]>
Signed-off-by: David Sterba <[email protected]>
4 years agobtrfs: factor out reada loop in __reada_start_machine
Nikolay Borisov [Wed, 15 Jul 2020 10:48:46 +0000 (13:48 +0300)]
btrfs: factor out reada loop in __reada_start_machine

This is in preparation for moving fs_devices to proper lists.

Reviewed-by: Josef Bacik <[email protected]>
Reviewed-by: Anand Jain <[email protected]>
Signed-off-by: Nikolay Borisov <[email protected]>
Signed-off-by: David Sterba <[email protected]>
4 years agobtrfs: remove err variable from btrfs_get_extent
Nikolay Borisov [Mon, 3 Aug 2020 09:58:46 +0000 (12:58 +0300)]
btrfs: remove err variable from btrfs_get_extent

There's no practical reason too use 'err' as a variable to convey
errors. In fact it's value is either set explicitly in the beginning of
the function or it simply takes the value of 'ret'. Not conforming to
the usual pattern of having ret be the only variable used to convey
errors makes the code more error prone to bugs. In fact one such bug
was introduced by 6bf9e4bd6a27 ("btrfs: inode: Verify inode mode toi
avoid NULL pointer dereference") by assigning the error value to 'ret'
and not 'err'.

Let's fix that issue and make the function less tricky by leaving only
ret to convey error values.

Signed-off-by: Nikolay Borisov <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
4 years agobtrfs: dio iomap DSYNC workaround
Josef Bacik [Thu, 3 Sep 2020 15:16:51 +0000 (11:16 -0400)]
btrfs: dio iomap DSYNC workaround

iomap dio will run generic_write_sync() for us if the iocb is DSYNC.
This is problematic for us because of 2 reasons:

1. we hold the inode_lock() during this operation, and we take it in
   generic_write_sync()
2. we hold a read lock on the dio_sem but take the write lock in fsync

Since we don't want to rip out this code right now, but reworking the
locking is a bit much to do at this point, work around this problem with
this masterpiece of a patch.

First, we clear DSYNC on the iocb so that the iomap stuff doesn't know
that it needs to handle the sync.  We save this fact in
current->journal_info, because we need to see do special things once
we're in iomap_begin, and we have no way to pass private information
into iomap_dio_rw().

Next we specify a separate iomap_dio_ops for sync, which implements an
->end_io() callback that gets called when the dio completes.  This is
important for AIO, because we really do need to run generic_write_sync()
if we complete asynchronously.  However if we're still in the submitting
context when we enter ->end_io() we clear the flag so that the submitter
knows they're the ones that needs to run generic_write_sync().

This is meant to be temporary.  We need to work out how to eliminate the
inode_lock() and the dio_sem in our fsync and use another mechanism to
protect these operations.

Tested-by: Johannes Thumshirn <[email protected]>
Signed-off-by: Josef Bacik <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
4 years agobtrfs: switch to iomap for direct IO
Goldwyn Rodrigues [Mon, 17 Aug 2020 16:18:21 +0000 (11:18 -0500)]
btrfs: switch to iomap for direct IO

We're using direct io implementation based on buffer heads. This patch
switches to the new iomap infrastructure.

Switch from __blockdev_direct_IO() to iomap_dio_rw().  Rename
btrfs_get_blocks_direct() to btrfs_dio_iomap_begin() and use it as
iomap_begin() for iomap direct I/O functions. This function allocates
and locks all the blocks required for the I/O.  btrfs_submit_direct() is
used as the submit_io() hook for direct I/O ops.

Since we need direct I/O reads to go through iomap_dio_rw(), we change
file_operations.read_iter() to a btrfs_file_read_iter() which calls
btrfs_direct_IO() for direct reads and falls back to
generic_file_buffered_read() for incomplete reads and buffered reads.

We don't need address_space.direct_IO() anymore: set it to noop.

Similarly, we don't need flags used in __blockdev_direct_IO(). iomap is
capable of direct I/O reads from a hole, so we don't need to return
-ENOENT.

Btrfs direct I/O is now done under i_rwsem, shared in case of reads and
exclusive in case of writes. This guards against simultaneous truncates.

Use iomap->iomap_end() to check for failed or incomplete direct I/O:

  - for writes, call __endio_write_update_ordered()
  - for reads, unlock extents

btrfs_dio_data is now hooked in iomap->private and not
current->journal_info. It carries the reservation variable and the
amount of data submitted, so we can calculate the amount of data to call
__endio_write_update_ordered in case of an error.

This patch removes last use of struct buffer_head from btrfs.

Reviewed-by: Josef Bacik <[email protected]>
Signed-off-by: Goldwyn Rodrigues <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
4 years agobtrfs: add owner and fs_info to alloc_state io_tree
Qu Wenruo [Thu, 20 Aug 2020 07:42:46 +0000 (15:42 +0800)]
btrfs: add owner and fs_info to alloc_state io_tree

Commit 1c11b63eff2a ("btrfs: replace pending/pinned chunks lists with io
tree") introduced btrfs_device::alloc_state extent io tree, but it
doesn't initialize the fs_info and owner member.

This means the following features are not properly supported:

- Fs owner report for insert_state() error
  Without fs_info initialized, although btrfs_err() won't panic, it
  won't output which fs is causing the error.

- Wrong owner for trace events
  alloc_state will get the owner as pinned extents.

Fix this by assiging proper fs_info and owner for
btrfs_device::alloc_state.

Fixes: 1c11b63eff2a ("btrfs: replace pending/pinned chunks lists with io tree")
Reviewed-by: Nikolay Borisov <[email protected]>
Signed-off-by: Qu Wenruo <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
4 years agobtrfs: make read_block_group_item return void
Marcos Paulo de Souza [Mon, 17 Aug 2020 13:56:10 +0000 (10:56 -0300)]
btrfs: make read_block_group_item return void

Since it's inclusion on 9afc66498a0b ("btrfs: block-group: refactor how
we read one block group item") this function always returned 0, so there
is no need to check for the returned value.

Reviewed-by: Nikolay Borisov <[email protected]>
Signed-off-by: Marcos Paulo de Souza <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
4 years agobtrfs: sysfs: fix unused-but-set-variable warnings
Leon Romanovsky [Wed, 19 Aug 2020 14:16:28 +0000 (17:16 +0300)]
btrfs: sysfs: fix unused-but-set-variable warnings

The compilation with W=1 generates the following warnings:
 fs/btrfs/sysfs.c:1630:6: warning: variable 'ret' set but not used [-Wunused-but-set-variable]
  1630 |  int ret;
       |      ^~~
 fs/btrfs/sysfs.c:1629:6: warning: variable 'features' set but not used [-Wunused-but-set-variable]
  1629 |  u64 features;
       |      ^~~~~~~~

[ The unused variables are leftover from e410e34fad91 ("Revert "btrfs:
  synchronize incompat feature bits with sysfs files""), which needs
  to be properly fixed by moving feature bit manipulation from the sysfs
  context.  Silence the warning to save pepople time, we got several
  reports. ]

Signed-off-by: Leon Romanovsky <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
4 years agobtrfs: make fast fsyncs wait only for writeback
Filipe Manana [Tue, 11 Aug 2020 11:43:58 +0000 (12:43 +0100)]
btrfs: make fast fsyncs wait only for writeback

Currently regardless of a full or a fast fsync we always wait for ordered
extents to complete, and then start logging the inode after that. However
for fast fsyncs we can just wait for the writeback to complete, we don't
need to wait for the ordered extents to complete since we use the list of
modified extents maps to figure out which extents we must log and we can
get their checksums directly from the ordered extents that are still in
flight, otherwise look them up from the checksums tree.

Until commit b5e6c3e170b770 ("btrfs: always wait on ordered extents at
fsync time"), for fast fsyncs, we used to start logging without even
waiting for the writeback to complete first, we would wait for it to
complete after logging, while holding a transaction open, which lead to
performance issues when using cgroups and probably for other cases too,
as wait for IO while holding a transaction handle should be avoided as
much as possible. After that, for fast fsyncs, we started to wait for
ordered extents to complete before starting to log, which adds some
latency to fsyncs and we even got at least one report about a performance
drop which bisected to that particular change:

https://lore.kernel.org/linux-btrfs/20181109215148[email protected]/

This change makes fast fsyncs only wait for writeback to finish before
starting to log the inode, instead of waiting for both the writeback to
finish and for the ordered extents to complete. This brings back part of
the logic we had that extracts checksums from in flight ordered extents,
which are not yet in the checksums tree, and making sure transaction
commits wait for the completion of ordered extents previously logged
(by far most of the time they have already completed by the time a
transaction commit starts, resulting in no wait at all), to avoid any
data loss if an ordered extent completes after the transaction used to
log an inode is committed, followed by a power failure.

When there are no other tasks accessing the checksums and the subvolume
btrees, the ordered extent completion is pretty fast, typically taking
100 to 200 microseconds only in my observations. However when there are
other tasks accessing these btrees, ordered extent completion can take a
lot more time due to lock contention on nodes and leaves of these btrees.
I've seen cases over 2 milliseconds, which starts to be significant. In
particular when we do have concurrent fsyncs against different files there
is a lot of contention on the checksums btree, since we have many tasks
writing the checksums into the btree and other tasks that already started
the logging phase are doing lookups for checksums in the btree.

This change also turns all ranged fsyncs into full ranged fsyncs, which
is something we already did when not using the NO_HOLES features or when
doing a full fsync. This is to guarantee we never miss checksums due to
writeback having been triggered only for a part of an extent, and we end
up logging the full extent but only checksums for the written range, which
results in missing checksums after log replay. Allowing ranged fsyncs to
operate again only in the original range, when using the NO_HOLES feature
and doing a fast fsync is doable but requires some non trivial changes to
the writeback path, which can always be worked on later if needed, but I
don't think they are a very common use case.

Several tests were performed using fio for different numbers of concurrent
jobs, each writing and fsyncing its own file, for both sequential and
random file writes. The tests were run on bare metal, no virtualization,
on a box with 12 cores (Intel i7-8700), 64Gb of RAM and a NVMe device,
with a kernel configuration that is the default of typical distributions
(debian in this case), without debug options enabled (kasan, kmemleak,
slub debug, debug of page allocations, lock debugging, etc).

The following script that calls fio was used:

  $ cat test-fsync.sh
  #!/bin/bash

  DEV=/dev/nvme0n1
  MNT=/mnt/btrfs
  MOUNT_OPTIONS="-o ssd -o space_cache=v2"
  MKFS_OPTIONS="-d single -m single"

  if [ $# -ne 5 ]; then
    echo "Use $0 NUM_JOBS FILE_SIZE FSYNC_FREQ BLOCK_SIZE [write|randwrite]"
    exit 1
  fi

  NUM_JOBS=$1
  FILE_SIZE=$2
  FSYNC_FREQ=$3
  BLOCK_SIZE=$4
  WRITE_MODE=$5

  if [ "$WRITE_MODE" != "write" ] && [ "$WRITE_MODE" != "randwrite" ]; then
    echo "Invalid WRITE_MODE, must be 'write' or 'randwrite'"
    exit 1
  fi

  cat <<EOF > /tmp/fio-job.ini
  [writers]
  rw=$WRITE_MODE
  fsync=$FSYNC_FREQ
  fallocate=none
  group_reporting=1
  direct=0
  bs=$BLOCK_SIZE
  ioengine=sync
  size=$FILE_SIZE
  directory=$MNT
  numjobs=$NUM_JOBS
  EOF

  echo "performance" | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor

  echo
  echo "Using config:"
  echo
  cat /tmp/fio-job.ini
  echo

  umount $MNT &> /dev/null
  mkfs.btrfs -f $MKFS_OPTIONS $DEV
  mount $MOUNT_OPTIONS $DEV $MNT
  fio /tmp/fio-job.ini
  umount $MNT

The results were the following:

*************************
*** sequential writes ***
*************************

==== 1 job, 8GiB file, fsync frequency 1, block size 64KiB ====

Before patch:

WRITE: bw=36.6MiB/s (38.4MB/s), 36.6MiB/s-36.6MiB/s (38.4MB/s-38.4MB/s), io=8192MiB (8590MB), run=223689-223689msec

After patch:

WRITE: bw=40.2MiB/s (42.1MB/s), 40.2MiB/s-40.2MiB/s (42.1MB/s-42.1MB/s), io=8192MiB (8590MB), run=203980-203980msec
(+9.8%, -8.8% runtime)

==== 2 jobs, 4GiB files, fsync frequency 1, block size 64KiB ====

Before patch:

WRITE: bw=35.8MiB/s (37.5MB/s), 35.8MiB/s-35.8MiB/s (37.5MB/s-37.5MB/s), io=8192MiB (8590MB), run=228950-228950msec

After patch:

WRITE: bw=43.5MiB/s (45.6MB/s), 43.5MiB/s-43.5MiB/s (45.6MB/s-45.6MB/s), io=8192MiB (8590MB), run=188272-188272msec
(+21.5% throughput, -17.8% runtime)

==== 4 jobs, 2GiB files, fsync frequency 1, block size 64KiB ====

Before patch:

WRITE: bw=50.1MiB/s (52.6MB/s), 50.1MiB/s-50.1MiB/s (52.6MB/s-52.6MB/s), io=8192MiB (8590MB), run=163446-163446msec

After patch:

WRITE: bw=64.5MiB/s (67.6MB/s), 64.5MiB/s-64.5MiB/s (67.6MB/s-67.6MB/s), io=8192MiB (8590MB), run=126987-126987msec
(+28.7% throughput, -22.3% runtime)

==== 8 jobs, 1GiB files, fsync frequency 1, block size 64KiB ====

Before patch:

WRITE: bw=64.0MiB/s (68.1MB/s), 64.0MiB/s-64.0MiB/s (68.1MB/s-68.1MB/s), io=8192MiB (8590MB), run=126075-126075msec

After patch:

WRITE: bw=86.8MiB/s (91.0MB/s), 86.8MiB/s-86.8MiB/s (91.0MB/s-91.0MB/s), io=8192MiB (8590MB), run=94358-94358msec
(+35.6% throughput, -25.2% runtime)

==== 16 jobs, 512MiB files, fsync frequency 1, block size 64KiB ====

Before patch:

WRITE: bw=79.8MiB/s (83.6MB/s), 79.8MiB/s-79.8MiB/s (83.6MB/s-83.6MB/s), io=8192MiB (8590MB), run=102694-102694msec

After patch:

WRITE: bw=107MiB/s (112MB/s), 107MiB/s-107MiB/s (112MB/s-112MB/s), io=8192MiB (8590MB), run=76446-76446msec
(+34.1% throughput, -25.6% runtime)

==== 32 jobs, 512MiB files, fsync frequency 1, block size 64KiB ====

Before patch:

WRITE: bw=93.2MiB/s (97.7MB/s), 93.2MiB/s-93.2MiB/s (97.7MB/s-97.7MB/s), io=16.0GiB (17.2GB), run=175836-175836msec

After patch:

WRITE: bw=111MiB/s (117MB/s), 111MiB/s-111MiB/s (117MB/s-117MB/s), io=16.0GiB (17.2GB), run=147001-147001msec
(+19.1% throughput, -16.4% runtime)

==== 64 jobs, 512MiB files, fsync frequency 1, block size 64KiB ====

Before patch:

WRITE: bw=108MiB/s (114MB/s), 108MiB/s-108MiB/s (114MB/s-114MB/s), io=32.0GiB (34.4GB), run=302656-302656msec

After patch:

WRITE: bw=133MiB/s (140MB/s), 133MiB/s-133MiB/s (140MB/s-140MB/s), io=32.0GiB (34.4GB), run=246003-246003msec
(+23.1% throughput, -18.7% runtime)

************************
***   random writes  ***
************************

==== 1 job, 8GiB file, fsync frequency 16, block size 4KiB ====

Before patch:

WRITE: bw=11.5MiB/s (12.0MB/s), 11.5MiB/s-11.5MiB/s (12.0MB/s-12.0MB/s), io=8192MiB (8590MB), run=714281-714281msec

After patch:

WRITE: bw=11.6MiB/s (12.2MB/s), 11.6MiB/s-11.6MiB/s (12.2MB/s-12.2MB/s), io=8192MiB (8590MB), run=705959-705959msec
(+0.9% throughput, -1.7% runtime)

==== 2 jobs, 4GiB files, fsync frequency 16, block size 4KiB ====

Before patch:

WRITE: bw=12.8MiB/s (13.5MB/s), 12.8MiB/s-12.8MiB/s (13.5MB/s-13.5MB/s), io=8192MiB (8590MB), run=638101-638101msec

After patch:

WRITE: bw=13.1MiB/s (13.7MB/s), 13.1MiB/s-13.1MiB/s (13.7MB/s-13.7MB/s), io=8192MiB (8590MB), run=625374-625374msec
(+2.3% throughput, -2.0% runtime)

==== 4 jobs, 2GiB files, fsync frequency 16, block size 4KiB ====

Before patch:

WRITE: bw=15.4MiB/s (16.2MB/s), 15.4MiB/s-15.4MiB/s (16.2MB/s-16.2MB/s), io=8192MiB (8590MB), run=531146-531146msec

After patch:

WRITE: bw=17.8MiB/s (18.7MB/s), 17.8MiB/s-17.8MiB/s (18.7MB/s-18.7MB/s), io=8192MiB (8590MB), run=460431-460431msec
(+15.6% throughput, -13.3% runtime)

==== 8 jobs, 1GiB files, fsync frequency 16, block size 4KiB ====

Before patch:

WRITE: bw=19.9MiB/s (20.8MB/s), 19.9MiB/s-19.9MiB/s (20.8MB/s-20.8MB/s), io=8192MiB (8590MB), run=412664-412664msec

After patch:

WRITE: bw=22.2MiB/s (23.3MB/s), 22.2MiB/s-22.2MiB/s (23.3MB/s-23.3MB/s), io=8192MiB (8590MB), run=368589-368589msec
(+11.6% throughput, -10.7% runtime)

==== 16 jobs, 512MiB files, fsync frequency 16, block size 4KiB ====

Before patch:

WRITE: bw=29.3MiB/s (30.7MB/s), 29.3MiB/s-29.3MiB/s (30.7MB/s-30.7MB/s), io=8192MiB (8590MB), run=279924-279924msec

After patch:

WRITE: bw=30.4MiB/s (31.9MB/s), 30.4MiB/s-30.4MiB/s (31.9MB/s-31.9MB/s), io=8192MiB (8590MB), run=269258-269258msec
(+3.8% throughput, -3.8% runtime)

==== 32 jobs, 512MiB files, fsync frequency 16, block size 4KiB ====

Before patch:

WRITE: bw=36.9MiB/s (38.7MB/s), 36.9MiB/s-36.9MiB/s (38.7MB/s-38.7MB/s), io=16.0GiB (17.2GB), run=443581-443581msec

After patch:

WRITE: bw=41.6MiB/s (43.6MB/s), 41.6MiB/s-41.6MiB/s (43.6MB/s-43.6MB/s), io=16.0GiB (17.2GB), run=394114-394114msec
(+12.7% throughput, -11.2% runtime)

==== 64 jobs, 512MiB files, fsync frequency 16, block size 4KiB ====

Before patch:

WRITE: bw=45.9MiB/s (48.1MB/s), 45.9MiB/s-45.9MiB/s (48.1MB/s-48.1MB/s), io=32.0GiB (34.4GB), run=714614-714614msec

After patch:

WRITE: bw=48.8MiB/s (51.1MB/s), 48.8MiB/s-48.8MiB/s (51.1MB/s-51.1MB/s), io=32.0GiB (34.4GB), run=672087-672087msec
(+6.3% throughput, -6.0% runtime)

Signed-off-by: Filipe Manana <[email protected]>
Signed-off-by: David Sterba <[email protected]>
4 years agobtrfs: do not commit logs and transactions during link and rename operations
Filipe Manana [Tue, 11 Aug 2020 11:43:48 +0000 (12:43 +0100)]
btrfs: do not commit logs and transactions during link and rename operations

Since commit d4682ba03ef618 ("Btrfs: sync log after logging new name") we
started to commit logs, and fallback to transaction commits when we failed
to log the new names or commit the logs, after link and rename operations
when the target inodes (or their parents) were previously logged in the
current transaction. This was to avoid losing directories despite an
explicit fsync on them when they are ancestors of some inode that got a
new named logged, due to a link or rename operation. However that adds the
cost of starting IO and waiting for it to complete, which can cause higher
latencies for applications.

Instead of doing that, just make sure that when we log a new name for an
inode we don't mark any of its ancestors as logged, so that if any one
does an fsync against any of them, without doing any other change on them,
the fsync commits the log. This way we only pay the cost of a log commit
(or a transaction commit if something goes wrong or a new block group was
created) if the application explicitly asks to fsync any of the parent
directories.

Using dbench, which mixes several filesystems operations including renames,
revealed some significant latency gains. The following script that uses
dbench was used to test this:

  #!/bin/bash

  DEV=/dev/nvme0n1
  MNT=/mnt/btrfs
  MOUNT_OPTIONS="-o ssd -o space_cache=v2"
  MKFS_OPTIONS="-m single -d single"
  THREADS=16

  echo "performance" | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
  mkfs.btrfs -f $MKFS_OPTIONS $DEV
  mount $MOUNT_OPTIONS $DEV $MNT

  dbench -t 300 -D $MNT $THREADS

  umount $MNT

The test was run on bare metal, no virtualization, on a box with 12 cores
(Intel i7-8700), 64Gb of RAM and using a NVMe device, with a kernel
configuration that is the default of typical distributions (debian in this
case), without debug options enabled (kasan, kmemleak, slub debug, debug
of page allocations, lock debugging, etc).

Results before this patch:

 Operation      Count    AvgLat    MaxLat
 ----------------------------------------
 NTCreateX    10750455     0.011   155.088
 Close         7896674     0.001     0.243
 Rename         455222     2.158  1101.947
 Unlink        2171189     0.067   121.638
 Deltree           256     2.425     7.816
 Mkdir             128     0.002     0.003
 Qpathinfo     9744323     0.006    21.370
 Qfileinfo     1707092     0.001     0.146
 Qfsinfo       1786756     0.001    11.228
 Sfileinfo      875612     0.003    21.263
 Find          3767281     0.025     9.617
 WriteX        5356924     0.011   211.390
 ReadX        16852694     0.003     9.442
 LockX           35008     0.002     0.119
 UnlockX         35008     0.001     0.138
 Flush          753458     4.252  1102.249

Throughput 1128.35 MB/sec  16 clients  16 procs  max_latency=1102.255 ms

Results after this patch:

16 clients, after

 Operation      Count    AvgLat    MaxLat
 ----------------------------------------
 NTCreateX    11471098     0.012   448.281
 Close         8426396     0.001     0.925
 Rename         485746     0.123   267.183
 Unlink        2316477     0.080    63.433
 Deltree           288     2.830    11.144
 Mkdir             144     0.003     0.010
 Qpathinfo    10397420     0.006    10.288
 Qfileinfo     1822039     0.001     0.169
 Qfsinfo       1906497     0.002    14.039
 Sfileinfo      934433     0.004     2.438
 Find          4019879     0.026    10.200
 WriteX        5718932     0.011   200.985
 ReadX        17981671     0.003    10.036
 LockX           37352     0.002     0.076
 UnlockX         37352     0.001     0.109
 Flush          804018     5.015   778.033

Throughput 1201.98 MB/sec  16 clients  16 procs  max_latency=778.036 ms
(+6.5% throughput, -29.4% max latency, -75.8% rename latency)

Test case generic/498 from fstests tests the scenario that the previously
mentioned commit fixed.

Signed-off-by: Filipe Manana <[email protected]>
Signed-off-by: David Sterba <[email protected]>
4 years agobtrfs: do not take the log_mutex of the subvolume when pinning the log
Filipe Manana [Tue, 11 Aug 2020 11:43:37 +0000 (12:43 +0100)]
btrfs: do not take the log_mutex of the subvolume when pinning the log

During a rename we pin the log to make sure no one commits a log that
reflects an ongoing rename operation, as it might result in a committed
log where it recorded the unlink of the old name without having recorded
the new name. However we are taking the subvolume's log_mutex before
incrementing the log_writers counter, which is not necessary since that
counter is atomic and we only remove the old name from the log and add
the new name to the log after we have incremented log_writers, ensuring
that no one can commit the log after we have removed the old name from
the log and before we added the new name to the log.

By taking the log_mutex lock we are just adding unnecessary contention on
the lock, which can become visible for workloads that mix renames with
fsyncs, writes for files opened with O_SYNC and unlink operations (if the
inode or its parent were fsynced before in the current transaction).

So just remove the lock and unlock of the subvolume's log_mutex at
btrfs_pin_log_trans().

Using dbench, which mixes different types of operations that end up taking
that mutex (fsyncs, renames, unlinks and writes into files opened with
O_SYNC) revealed some small gains. The following script that calls dbench
was used:

  #!/bin/bash

  DEV=/dev/nvme0n1
  MNT=/mnt/btrfs
  MOUNT_OPTIONS="-o ssd -o space_cache=v2"
  MKFS_OPTIONS="-m single -d single"
  THREADS=32

  echo "performance" | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
  mkfs.btrfs -f $MKFS_OPTIONS $DEV
  mount $MOUNT_OPTIONS $DEV $MNT

  dbench -s -t 600 -D $MNT $THREADS

  umount $MNT

The test was run on bare metal, no virtualization, on a box with 12 cores
(Intel i7-8700), 64Gb of RAM and using a NVMe device, with a kernel
configuration that is the default of typical distributions (debian in this
case), without debug options enabled (kasan, kmemleak, slub debug, debug
of page allocations, lock debugging, etc).

Results before this patch:

 Operation      Count    AvgLat    MaxLat
 ----------------------------------------
 NTCreateX    4410848     0.017   738.640
 Close        3240222     0.001     0.834
 Rename        186850     7.478  1272.476
 Unlink        890875     0.128   785.018
 Deltree          128     2.846    12.081
 Mkdir             64     0.002     0.003
 Qpathinfo    3997659     0.009    11.171
 Qfileinfo     701307     0.001     0.478
 Qfsinfo       733494     0.002     1.103
 Sfileinfo     359362     0.004     3.266
 Find         1546226     0.041     4.128
 WriteX       2202803     7.905  1376.989
 ReadX        6917775     0.003     3.887
 LockX          14392     0.002     0.043
 UnlockX        14392     0.001     0.085
 Flush         309225     0.128  1033.936

Throughput 231.555 MB/sec (sync open)  32 clients  32 procs  max_latency=1376.993 ms

Results after this patch:

Operation      Count    AvgLat    MaxLat
 ----------------------------------------
 NTCreateX    4603244     0.017   232.776
 Close        3381299     0.001     1.041
 Rename        194871     7.251  1073.165
 Unlink        929730     0.133   119.233
 Deltree          128     2.871    10.199
 Mkdir             64     0.002     0.004
 Qpathinfo    4171343     0.009    11.317
 Qfileinfo     731227     0.001     1.635
 Qfsinfo       765079     0.002     3.568
 Sfileinfo     374881     0.004     1.220
 Find         1612964     0.041     4.675
 WriteX       2296720     7.569  1178.204
 ReadX        7213633     0.003     3.075
 LockX          14976     0.002     0.076
 UnlockX        14976     0.001     0.061
 Flush         322635     0.102   579.505

Throughput 241.4 MB/sec (sync open)  32 clients  32 procs  max_latency=1178.207 ms
(+4.3% throughput, -14.4% max latency)

Reviewed-by: Josef Bacik <[email protected]>
Signed-off-by: Filipe Manana <[email protected]>
Signed-off-by: David Sterba <[email protected]>
4 years agobtrfs: send: remove indirect callback parameter for changed_cb
David Sterba [Mon, 17 Aug 2020 10:16:57 +0000 (12:16 +0200)]
btrfs: send: remove indirect callback parameter for changed_cb

There's a custom callback passed to btrfs_compare_trees which happens to
be named exactly same as the existing function implementing it. This is
confusing and the indirection is not necessary for our needs. Compiler
is clever enough to call it directly so there's effectively no change.

Reviewed-by: Josef Bacik <[email protected]>
Signed-off-by: David Sterba <[email protected]>
4 years agobtrfs: scrub: rename ratelimit state varaible to avoid shadowing
David Sterba [Mon, 17 Aug 2020 10:12:38 +0000 (12:12 +0200)]
btrfs: scrub: rename ratelimit state varaible to avoid shadowing

There's already defined _rs within ctree.h:btrfs_printk_ratelimited,
local variables should not use _ to avoid such name clashes with
macro-local variables.

Reviewed-by: Josef Bacik <[email protected]>
Signed-off-by: David Sterba <[email protected]>
4 years agobtrfs: remove unnecessarily shadowed variables
David Sterba [Mon, 17 Aug 2020 10:08:37 +0000 (12:08 +0200)]
btrfs: remove unnecessarily shadowed variables

In btrfs_orphan_cleanup, there's another instance of fs_info, but it's
the same as the one we already have.

In btrfs_backref_finish_upper_links, rb_node is same type and used
as temporary cursor to the tree.

Reviewed-by: Josef Bacik <[email protected]>
Signed-off-by: David Sterba <[email protected]>
4 years agobtrfs: compression: move declarations to header
David Sterba [Mon, 17 Aug 2020 08:58:38 +0000 (10:58 +0200)]
btrfs: compression: move declarations to header

The declarations of compression algorithm callbacks are defined in the
.c file as they're used from there. Compiler warns that there are no
declarations for public functions when compiling lzo.c/zlib.c/zstd.c.
Fix that by moving the declarations to the header as it's the common
place for all of them.

Reviewed-by: Josef Bacik <[email protected]>
Signed-off-by: David Sterba <[email protected]>
4 years agobtrfs: remove const from btrfs_feature_set_name
David Sterba [Mon, 17 Aug 2020 08:56:00 +0000 (10:56 +0200)]
btrfs: remove const from btrfs_feature_set_name

The function btrfs_feature_set_name returns a const char pointer, the
second const is not necessary and reported as a warning:

 In file included from fs/btrfs/space-info.c:6:
 fs/btrfs/sysfs.h:16:1: warning: type qualifiers ignored on function return type [-Wignored-qualifiers]
    16 | const char * const btrfs_feature_set_name(enum btrfs_feature_set set);
       | ^~~~~

Reviewed-by: Josef Bacik <[email protected]>
Signed-off-by: David Sterba <[email protected]>
4 years agobtrfs: cleanup calculation of lockend in lock_and_cleanup_extent_if_need()
Qu Wenruo [Thu, 13 Aug 2020 06:33:52 +0000 (14:33 +0800)]
btrfs: cleanup calculation of lockend in lock_and_cleanup_extent_if_need()

We're just doing rounding up to sectorsize to calculate the lockend.
There is no need to do the unnecessary length calculation, just direct
round_up() is enough.

Reviewed-by: Nikolay Borisov <[email protected]>
Signed-off-by: Qu Wenruo <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
4 years agobtrfs: fix possible infinite loop in data async reclaim
Josef Bacik [Tue, 25 Aug 2020 20:56:59 +0000 (16:56 -0400)]
btrfs: fix possible infinite loop in data async reclaim

Dave reported an issue where generic/102 would sometimes hang.  This
turned out to be because we'd get into this spot where we were no longer
making progress on data reservations because our exit condition was not
met.  The log is basically

while (!space_info->full && !list_empty(&space_info->tickets))
flush_space(space_info, flush_state);

where flush state is our various flush states, but doesn't include
ALLOC_CHUNK_FORCE.  This is because we actually lead with allocating
chunks, and so the assumption was that once you got to the actual
flushing states you could no longer allocate chunks.  This was a stupid
assumption, because you could have deleted block groups that would be
reclaimed by a transaction commit, thus unsetting space_info->full.
This is essentially what happens with generic/102, and so sometimes
you'd get stuck in the flushing loop because we weren't allocating
chunks, but flushing space wasn't giving us what we needed to make
progress.

Fix this by adding ALLOC_CHUNK_FORCE to the end of our flushing states,
that way we will eventually bail out because we did end up with
space_info->full if we free'd a chunk previously.  Otherwise, as is the
case for this test, we'll allocate our chunk and continue on our happy
merry way.

Reported-by: David Sterba <[email protected]>
Signed-off-by: Josef Bacik <[email protected]>
Signed-off-by: David Sterba <[email protected]>
4 years agobtrfs: add a comment explaining the data flush steps
Josef Bacik [Tue, 21 Jul 2020 14:22:34 +0000 (10:22 -0400)]
btrfs: add a comment explaining the data flush steps

The data flushing steps are not obvious to people other than myself and
Chris.  Write a giant comment explaining the reasoning behind each flush
step for data as well as why it is in that particular order.

Reviewed-by: Johannes Thumshirn <[email protected]>
Reviewed-by: Nikolay Borisov <[email protected]>
Signed-off-by: Josef Bacik <[email protected]>
Signed-off-by: David Sterba <[email protected]>
4 years agobtrfs: do async reclaim for data reservations
Josef Bacik [Tue, 21 Jul 2020 14:22:33 +0000 (10:22 -0400)]
btrfs: do async reclaim for data reservations

Now that we have the data ticketing stuff in place, move normal data
reservations to use an async reclaim helper to satisfy tickets.  Before
we could have multiple tasks race in and both allocate chunks, resulting
in more data chunks than we would necessarily need.  Serializing these
allocations and making a single thread responsible for flushing will
only allocate chunks as needed, as well as cut down on transaction
commits and other flush related activities.

Priority reservations will still work as they have before, simply
trying to allocate a chunk until they can make their reservation.

Reviewed-by: Nikolay Borisov <[email protected]>
Tested-by: Nikolay Borisov <[email protected]>
Reviewed-by: Johannes Thumshirn <[email protected]>
Signed-off-by: Josef Bacik <[email protected]>
Signed-off-by: David Sterba <[email protected]>
4 years agobtrfs: flush delayed refs when trying to reserve data space
Josef Bacik [Tue, 21 Jul 2020 14:22:32 +0000 (10:22 -0400)]
btrfs: flush delayed refs when trying to reserve data space

We can end up with freed extents in the delayed refs, and thus
may_commit_transaction() may not think we have enough pinned space to
commit the transaction and we'll ENOSPC early.  Handle this by running
the delayed refs in order to make sure pinned is uptodate before we try
to commit the transaction.

Tested-by: Nikolay Borisov <[email protected]>
Reviewed-by: Johannes Thumshirn <[email protected]>
Signed-off-by: Josef Bacik <[email protected]>
Signed-off-by: David Sterba <[email protected]>
4 years agobtrfs: run delayed iputs before committing the transaction for data
Josef Bacik [Tue, 21 Jul 2020 14:22:31 +0000 (10:22 -0400)]
btrfs: run delayed iputs before committing the transaction for data

Before we were waiting on iputs after we committed the transaction, but
this doesn't really make much sense.  We want to reclaim any space we
may have in order to be more likely to commit the transaction, due to
pinned space being added by running the delayed iputs.  Fix this by
making delayed iputs run before committing the transaction.

Reviewed-by: Nikolay Borisov <[email protected]>
Tested-by: Nikolay Borisov <[email protected]>
Reviewed-by: Johannes Thumshirn <[email protected]>
Signed-off-by: Josef Bacik <[email protected]>
Signed-off-by: David Sterba <[email protected]>
4 years agobtrfs: don't force commit if we are data
Josef Bacik [Tue, 21 Jul 2020 14:22:30 +0000 (10:22 -0400)]
btrfs: don't force commit if we are data

We used to unconditionally commit the transaction at least 2 times and
then on the 3rd try check against pinned space to make sure committing
the transaction was worth the effort.  This is overkill, we know nobody
is going to steal our reservation, and if we can't make our reservation
with the pinned amount simply bail out.

This also cleans up the passing of bytes_needed to
may_commit_transaction, as that was the thing we added into place in
order to accomplish this behavior.  We no longer need it so remove that
mess.

Reviewed-by: Nikolay Borisov <[email protected]>
Tested-by: Nikolay Borisov <[email protected]>
Reviewed-by: Johannes Thumshirn <[email protected]>
Signed-off-by: Josef Bacik <[email protected]>
Signed-off-by: David Sterba <[email protected]>
4 years agobtrfs: drop the commit_cycles stuff for data reservations
Josef Bacik [Tue, 21 Jul 2020 14:22:29 +0000 (10:22 -0400)]
btrfs: drop the commit_cycles stuff for data reservations

This was an old wart left over from how we previously did data
reservations.  Before we could have people race in and take a
reservation while we were flushing space, so we needed to make sure we
looped a few times before giving up.  Now that we're using the ticketing
infrastructure we don't have to worry about this and can drop the logic
altogether.

Reviewed-by: Nikolay Borisov <[email protected]>
Tested-by: Nikolay Borisov <[email protected]>
Reviewed-by: Johannes Thumshirn <[email protected]>
Signed-off-by: Josef Bacik <[email protected]>
Signed-off-by: David Sterba <[email protected]>
4 years agobtrfs: use the same helper for data and metadata reservations
Josef Bacik [Tue, 21 Jul 2020 14:22:28 +0000 (10:22 -0400)]
btrfs: use the same helper for data and metadata reservations

Now that data reservations follow the same pattern as metadata
reservations we can simply rename __reserve_metadata_bytes to
__reserve_bytes and use that helper for data reservations.

Things to keep in mind, btrfs_can_overcommit() returns 0 for data,
because we can never overcommit.  We also will never pass in FLUSH_ALL
for data, so we'll simply be added to the priority list and go straight
into handle_reserve_ticket.

Reviewed-by: Nikolay Borisov <[email protected]>
Tested-by: Nikolay Borisov <[email protected]>
Reviewed-by: Johannes Thumshirn <[email protected]>
Signed-off-by: Josef Bacik <[email protected]>
Signed-off-by: David Sterba <[email protected]>
4 years agobtrfs: serialize data reservations if we are flushing
Josef Bacik [Tue, 21 Jul 2020 14:22:27 +0000 (10:22 -0400)]
btrfs: serialize data reservations if we are flushing

Nikolay reported a problem where generic/371 would fail sometimes with a
slow drive.  The gist of the test is that we fallocate a file in
parallel with a pwrite of a different file.  These two files combined
are smaller than the file system, but sometimes the pwrite would ENOSPC.

A fair bit of investigation uncovered the fact that the fallocate
workload was racing in and grabbing the free space that the pwrite
workload was trying to free up so it could make its own reservation.
After a few loops of this eventually the pwrite workload would error out
with an ENOSPC.

We've had the same problem with metadata as well, and we serialized all
metadata allocations to satisfy this problem.  This wasn't usually a
problem with data because data reservations are more straightforward,
but obviously could still happen.

Fix this by not allowing reservations to occur if there are any pending
tickets waiting to be satisfied on the space info.

Reviewed-by: Nikolay Borisov <[email protected]>
Tested-by: Nikolay Borisov <[email protected]>
Reviewed-by: Johannes Thumshirn <[email protected]>
Signed-off-by: Josef Bacik <[email protected]>
Signed-off-by: David Sterba <[email protected]>
4 years agobtrfs: use ticketing for data space reservations
Josef Bacik [Tue, 21 Jul 2020 14:22:26 +0000 (10:22 -0400)]
btrfs: use ticketing for data space reservations

Now that we have all the infrastructure in place, use the ticketing
infrastructure to make data allocations.  This still maintains the exact
same flushing behavior, but now we're using tickets to get our
reservations satisfied.

Reviewed-by: Nikolay Borisov <[email protected]>
Tested-by: Nikolay Borisov <[email protected]>
Reviewed-by: Johannes Thumshirn <[email protected]>
Signed-off-by: Josef Bacik <[email protected]>
Signed-off-by: David Sterba <[email protected]>
4 years agobtrfs: add btrfs_reserve_data_bytes and use it
Josef Bacik [Tue, 21 Jul 2020 14:22:25 +0000 (10:22 -0400)]
btrfs: add btrfs_reserve_data_bytes and use it

Create a new function btrfs_reserve_data_bytes() in order to handle data
reservations.  This uses the new flush types and flush states to handle
making data reservations.

This patch specifically does not change any functionality, and is
purposefully not cleaned up in order to make bisection easier for the
future patches.  The new helper is identical to the old helper in how it
handles data reservations.  We first try to force a chunk allocation,
and then we run through the flush states all at once and in the same
order that they were done with the old helper.

Subsequent patches will clean this up and change the behavior of the
flushing, and it is important to keep those changes separate so we can
easily bisect down to the patch that caused the regression, rather than
the patch that made us start using the new infrastructure.

Reviewed-by: Nikolay Borisov <[email protected]>
Tested-by: Nikolay Borisov <[email protected]>
Reviewed-by: Johannes Thumshirn <[email protected]>
Signed-off-by: Josef Bacik <[email protected]>
Signed-off-by: David Sterba <[email protected]>
4 years agobtrfs: add the data transaction commit logic into may_commit_transaction
Josef Bacik [Tue, 21 Jul 2020 14:22:24 +0000 (10:22 -0400)]
btrfs: add the data transaction commit logic into may_commit_transaction

Data space flushing currently unconditionally commits the transaction
twice in a row, and the last time it checks if there's enough pinned
extents to satisfy its reservation before deciding to commit the
transaction for the 3rd and final time.

Encode this logic into may_commit_transaction().  In the next patch we
will pass in U64_MAX for bytes_needed the first two times, and the final
time we will pass in the actual bytes we need so the normal logic will
apply.

This patch exists solely to make the logical changes I will make to the
flushing state machine separate to make it easier to bisect any
performance related regressions.

Reviewed-by: Nikolay Borisov <[email protected]>
Tested-by: Nikolay Borisov <[email protected]>
Reviewed-by: Johannes Thumshirn <[email protected]>
Signed-off-by: Josef Bacik <[email protected]>
Signed-off-by: David Sterba <[email protected]>
4 years agobtrfs: add flushing states for handling data reservations
Josef Bacik [Tue, 21 Jul 2020 14:22:23 +0000 (10:22 -0400)]
btrfs: add flushing states for handling data reservations

Currently the way we do data reservations is by seeing if we have enough
space in our space_info.  If we do not and we're a normal inode we'll

1) Attempt to force a chunk allocation until we can't anymore.
2) If that fails we'll flush delalloc, then commit the transaction, then
   run the delayed iputs.

If we are a free space inode we're only allowed to force a chunk
allocation.  In order to use the normal flushing mechanism we need to
encode this into a flush state array for normal inodes.  Since both will
start with allocating chunks until the space info is full there is no
need to add this as a flush state, this will be handled specially.

Reviewed-by: Nikolay Borisov <[email protected]>
Tested-by: Nikolay Borisov <[email protected]>
Reviewed-by: Johannes Thumshirn <[email protected]>
Signed-off-by: Josef Bacik <[email protected]>
Signed-off-by: David Sterba <[email protected]>
4 years agobtrfs: check tickets after waiting on ordered extents
Josef Bacik [Tue, 21 Jul 2020 14:22:22 +0000 (10:22 -0400)]
btrfs: check tickets after waiting on ordered extents

Right now if the space is freed up after the ordered extents complete
(which is likely since the reservations are held until they complete),
we would do extra delalloc flushing before we'd notice that we didn't
have any more tickets.  Fix this by moving the tickets check after our
wait_ordered_extents check.

Tested-by: Nikolay Borisov <[email protected]>
Reviewed-by: Johannes Thumshirn <[email protected]>
Signed-off-by: Josef Bacik <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
4 years agobtrfs: use btrfs_start_delalloc_roots in shrink_delalloc
Josef Bacik [Tue, 21 Jul 2020 14:22:21 +0000 (10:22 -0400)]
btrfs: use btrfs_start_delalloc_roots in shrink_delalloc

The original iteration of flushing had us flushing delalloc and then
checking to see if we could make our reservation, thus we were very
careful about how many pages we would flush at once.

But now that everything is async and we satisfy tickets as the space
becomes available we don't have to keep track of any of this, simply
try and flush the number of dirty inodes we may have in order to
reclaim space to make our reservation.  This cleans up our delalloc
flushing significantly.

The async_pages stuff is dropped because btrfs_start_delalloc_roots()
handles the case that we generate async extents for us, so we no longer
require this extra logic.

Reviewed-by: Nikolay Borisov <[email protected]>
Tested-by: Nikolay Borisov <[email protected]>
Reviewed-by: Johannes Thumshirn <[email protected]>
Signed-off-by: Josef Bacik <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
4 years agobtrfs: use the btrfs_space_info_free_bytes_may_use helper for delalloc
Josef Bacik [Tue, 21 Jul 2020 14:22:20 +0000 (10:22 -0400)]
btrfs: use the btrfs_space_info_free_bytes_may_use helper for delalloc

We are going to use the ticket infrastructure for data, so use the
btrfs_space_info_free_bytes_may_use() helper in
btrfs_free_reserved_data_space_noquota() so we get the
btrfs_try_granting_tickets call when we free our reservation.

Reviewed-by: Nikolay Borisov <[email protected]>
Reviewed-by: Johannes Thumshirn <[email protected]>
Tested-by: Nikolay Borisov <[email protected]>
Signed-off-by: Josef Bacik <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
4 years agobtrfs: call btrfs_try_granting_tickets when reserving space
Josef Bacik [Tue, 21 Jul 2020 14:22:19 +0000 (10:22 -0400)]
btrfs: call btrfs_try_granting_tickets when reserving space

If we have compression on we could free up more space than we reserved,
and thus be able to make a space reservation.  Add the call for this
scenario.

Reviewed-by: Nikolay Borisov <[email protected]>
Tested-by: Nikolay Borisov <[email protected]>
Reviewed-by: Johannes Thumshirn <[email protected]>
Signed-off-by: Josef Bacik <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
4 years agobtrfs: call btrfs_try_granting_tickets when unpinning anything
Josef Bacik [Tue, 21 Jul 2020 14:22:18 +0000 (10:22 -0400)]
btrfs: call btrfs_try_granting_tickets when unpinning anything

When unpinning we were only calling btrfs_try_granting_tickets() if
global_rsv->space_info == space_info, which is problematic because we
use ticketing for SYSTEM chunks, and want to use it for DATA as well.
Fix this by moving this call outside of that if statement.

Reviewed-by: Nikolay Borisov <[email protected]>
Reviewed-by: Johannes Thumshirn <[email protected]>
Tested-by: Nikolay Borisov <[email protected]>
Signed-off-by: Josef Bacik <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
4 years agobtrfs: call btrfs_try_granting_tickets when freeing reserved bytes
Josef Bacik [Tue, 21 Jul 2020 14:22:17 +0000 (10:22 -0400)]
btrfs: call btrfs_try_granting_tickets when freeing reserved bytes

We were missing a call to btrfs_try_granting_tickets in
btrfs_free_reserved_bytes, so add it to handle the case where we're able
to satisfy an allocation because we've freed a pending reservation.

Reviewed-by: Nikolay Borisov <[email protected]>
Tested-by: Nikolay Borisov <[email protected]>
Reviewed-by: Johannes Thumshirn <[email protected]>
Signed-off-by: Josef Bacik <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
4 years agobtrfs: make ALLOC_CHUNK use the space info flags
Josef Bacik [Tue, 21 Jul 2020 14:22:16 +0000 (10:22 -0400)]
btrfs: make ALLOC_CHUNK use the space info flags

We have traditionally used flush_space() to flush metadata space, so
we've been unconditionally using btrfs_metadata_alloc_profile() for our
profile to allocate a chunk. However if we're going to use this for
data we need to use btrfs_get_alloc_profile() on the space_info we pass
in.

Reviewed-by: Nikolay Borisov <[email protected]>
Reviewed-by: Johannes Thumshirn <[email protected]>
Tested-by: Nikolay Borisov <[email protected]>
Signed-off-by: Josef Bacik <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
4 years agobtrfs: make shrink_delalloc take space_info as an arg
Josef Bacik [Tue, 21 Jul 2020 14:22:15 +0000 (10:22 -0400)]
btrfs: make shrink_delalloc take space_info as an arg

Currently shrink_delalloc just looks up the metadata space info, but
this won't work if we're trying to reclaim space for data chunks.  We
get the right space_info we want passed into flush_space, so simply pass
that along to shrink_delalloc.

Reviewed-by: Nikolay Borisov <[email protected]>
Reviewed-by: Johannes Thumshirn <[email protected]>
Tested-by: Nikolay Borisov <[email protected]>
Signed-off-by: Josef Bacik <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
4 years agobtrfs: handle U64_MAX for shrink_delalloc
Josef Bacik [Tue, 21 Jul 2020 14:22:14 +0000 (10:22 -0400)]
btrfs: handle U64_MAX for shrink_delalloc

Data allocations are going to want to pass in U64_MAX for flushing
space, adjust shrink_delalloc to handle this properly.

Reviewed-by: Nikolay Borisov <[email protected]>
Reviewed-by: Johannes Thumshirn <[email protected]>
Tested-by: Nikolay Borisov <[email protected]>
Signed-off-by: Josef Bacik <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
4 years agobtrfs: remove orig from shrink_delalloc
Josef Bacik [Tue, 21 Jul 2020 14:22:13 +0000 (10:22 -0400)]
btrfs: remove orig from shrink_delalloc

We don't use this anywhere inside of shrink_delalloc since 17024ad0a0fd
("Btrfs: fix early ENOSPC due to delalloc"), remove it.

Reviewed-by: Nikolay Borisov <[email protected]>
Reviewed-by: Johannes Thumshirn <[email protected]>
Tested-by: Nikolay Borisov <[email protected]>
Signed-off-by: Josef Bacik <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
4 years agobtrfs: change nr to u64 in btrfs_start_delalloc_roots
Josef Bacik [Tue, 21 Jul 2020 14:22:12 +0000 (10:22 -0400)]
btrfs: change nr to u64 in btrfs_start_delalloc_roots

We have btrfs_wait_ordered_roots() which takes a u64 for nr, but
btrfs_start_delalloc_roots() that takes an int for nr, which makes using
them in conjunction, especially for something like (u64)-1, annoying and
inconsistent.  Fix btrfs_start_delalloc_roots() to take a u64 for nr and
adjust start_delalloc_inodes() and it's callers appropriately.

This means we've adjusted start_delalloc_inodes() to take a pointer of
nr since we want to preserve the ability for start-delalloc_inodes() to
return an error, so simply make it do the nr adjusting as necessary.

Part of adjusting the callers to this means changing
btrfs_writeback_inodes_sb_nr() to take a u64 for items.  This may be
confusing because it seems unrelated, but the caller of
btrfs_writeback_inodes_sb_nr() already passes in a u64, it's just the
function variable that needs to be changed.

Reviewed-by: Nikolay Borisov <[email protected]>
Tested-by: Nikolay Borisov <[email protected]>
Reviewed-by: Johannes Thumshirn <[email protected]>
Signed-off-by: Josef Bacik <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
4 years agobtrfs: remove fsid argument from btrfs_sysfs_update_sprout_fsid
Nikolay Borisov [Wed, 12 Aug 2020 13:18:51 +0000 (16:18 +0300)]
btrfs: remove fsid argument from btrfs_sysfs_update_sprout_fsid

It can be accessed from 'fs_devices' as it's identical to
fs_info->fs_devices. Also add a comment about why we are calling the
function. No semantic changes.

Reviewed-by: Josef Bacik <[email protected]>
Reviewed-by: Johannes Thumshirn <[email protected]>
Reviewed-by: Anand Jain <[email protected]>
Signed-off-by: Nikolay Borisov <[email protected]>
Signed-off-by: David Sterba <[email protected]>
4 years agobtrfs: remove spurious BUG_ON in btrfs_get_extent
Nikolay Borisov [Mon, 3 Aug 2020 09:43:18 +0000 (12:43 +0300)]
btrfs: remove spurious BUG_ON in btrfs_get_extent

That BUG_ON cannot ever trigger because as the comment there states -
'err' is always set. Simply remove it as it brings no value.

Signed-off-by: Nikolay Borisov <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
4 years agobtrfs: delete duplicated words + other fixes in comments
Randy Dunlap [Wed, 5 Aug 2020 02:48:34 +0000 (19:48 -0700)]
btrfs: delete duplicated words + other fixes in comments

Delete repeated words in fs/btrfs/.
{to, the, a, and old}
and change "into 2 part" to "into 2 parts".

Reviewed-by: Nikolay Borisov <[email protected]>
Signed-off-by: Randy Dunlap <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
4 years agobtrfs: tracepoints: output proper root owner for trace_find_free_extent()
Qu Wenruo [Tue, 28 Jul 2020 01:42:49 +0000 (09:42 +0800)]
btrfs: tracepoints: output proper root owner for trace_find_free_extent()

The current trace event always output result like this:

 find_free_extent: root=2(EXTENT_TREE) len=16384 empty_size=0 flags=4(METADATA)
 find_free_extent: root=2(EXTENT_TREE) len=16384 empty_size=0 flags=4(METADATA)
 find_free_extent: root=2(EXTENT_TREE) len=8192 empty_size=0 flags=1(DATA)
 find_free_extent: root=2(EXTENT_TREE) len=8192 empty_size=0 flags=1(DATA)
 find_free_extent: root=2(EXTENT_TREE) len=4096 empty_size=0 flags=1(DATA)
 find_free_extent: root=2(EXTENT_TREE) len=4096 empty_size=0 flags=1(DATA)

T's saying we're allocating data extent for EXTENT tree, which is not
even possible.

It's because we always use EXTENT tree as the owner for
trace_find_free_extent() without using the @root from
btrfs_reserve_extent().

This patch will change the parameter to use proper @root for
trace_find_free_extent():

Now it looks much better:

 find_free_extent: root=5(FS_TREE) len=16384 empty_size=0 flags=36(METADATA|DUP)
 find_free_extent: root=5(FS_TREE) len=8192 empty_size=0 flags=1(DATA)
 find_free_extent: root=5(FS_TREE) len=16384 empty_size=0 flags=1(DATA)
 find_free_extent: root=5(FS_TREE) len=4096 empty_size=0 flags=1(DATA)
 find_free_extent: root=5(FS_TREE) len=8192 empty_size=0 flags=1(DATA)
 find_free_extent: root=5(FS_TREE) len=16384 empty_size=0 flags=36(METADATA|DUP)
 find_free_extent: root=7(CSUM_TREE) len=16384 empty_size=0 flags=36(METADATA|DUP)
 find_free_extent: root=2(EXTENT_TREE) len=16384 empty_size=0 flags=36(METADATA|DUP)
 find_free_extent: root=1(ROOT_TREE) len=16384 empty_size=0 flags=36(METADATA|DUP)

Reported-by: Hans van Kranenburg <[email protected]>
CC: [email protected] # 5.4+
Signed-off-by: Qu Wenruo <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
4 years agogpio: pca953x: Survive spurious interrupts
Marc Zyngier [Mon, 5 Oct 2020 14:02:17 +0000 (15:02 +0100)]
gpio: pca953x: Survive spurious interrupts

The pca953x driver never checks the result of irq_find_mapping(),
which returns 0 when no mapping is found. When a spurious interrupt
is delivered (which can happen under obscure circumstances), the
kernel explodes as it still tries to handle the error code as
a real interrupt.

Handle this particular case and warn on spurious interrupts.

Signed-off-by: Marc Zyngier <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Linus Walleij <[email protected]>
4 years agogpiolib: Disable compat ->read() code in UML case
Andy Shevchenko [Mon, 5 Oct 2020 13:10:44 +0000 (16:10 +0300)]
gpiolib: Disable compat ->read() code in UML case

It appears that UML (arch/um) has no compat.h header defined and hence
can't compile a recently provided piece of code in GPIO library.

Disable compat ->read() code in UML case to avoid compilation errors.

While at it, use pattern which is already being used in the kernel elsewhere.

Fixes: 5ad284ab3a01 ("gpiolib: Fix line event handling in syscall compatible mode")
Reported-by: Geert Uytterhoeven <[email protected]>
Signed-off-by: Andy Shevchenko <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Linus Walleij <[email protected]>
4 years agox86/mce: Decode a kernel instruction to determine if it is copying from user
Tony Luck [Tue, 6 Oct 2020 21:09:10 +0000 (14:09 -0700)]
x86/mce: Decode a kernel instruction to determine if it is copying from user

All instructions copying data between kernel and user memory
are tagged with either _ASM_EXTABLE_UA or _ASM_EXTABLE_CPY
entries in the exception table. ex_fault_handler_type() returns
EX_HANDLER_UACCESS for both of these.

Recovery is only possible when the machine check was triggered
on a read from user memory. In this case the same strategy for
recovery applies as if the user had made the access in ring3. If
the fault was in kernel memory while copying to user there is no
current recovery plan.

For MOV and MOVZ instructions a full decode of the instruction
is done to find the source address. For MOVS instructions
the source address is in the %rsi register. The function
fault_in_kernel_space() determines whether the source address is
kernel or user, upgrade it from "static" so it can be used here.

Co-developed-by: Youquan Song <[email protected]>
Signed-off-by: Youquan Song <[email protected]>
Signed-off-by: Tony Luck <[email protected]>
Signed-off-by: Borislav Petkov <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
4 years agox86/mce: Recover from poison found while copying from user space
Tony Luck [Tue, 6 Oct 2020 21:09:09 +0000 (14:09 -0700)]
x86/mce: Recover from poison found while copying from user space

Existing kernel code can only recover from a machine check on code that
is tagged in the exception table with a fault handling recovery path.

Add two new fields in the task structure to pass information from
machine check handler to the "task_work" that is queued to run before
the task returns to user mode:

+ mce_vaddr: will be initialized to the user virtual address of the fault
  in the case where the fault occurred in the kernel copying data from
  a user address.  This is so that kill_me_maybe() can provide that
  information to the user SIGBUS handler.

+ mce_kflags: copy of the struct mce.kflags needed by kill_me_maybe()
  to determine if mce_vaddr is applicable to this error.

Add code to recover from a machine check while copying data from user
space to the kernel. Action for this case is the same as if the user
touched the poison directly; unmap the page and send a SIGBUS to the task.

Use a new helper function to share common code between the "fault
in user mode" case and the "fault while copying from user" case.

New code paths will be activated by the next patch which sets
MCE_IN_KERNEL_COPYIN.

Suggested-by: Borislav Petkov <[email protected]>
Signed-off-by: Tony Luck <[email protected]>
Signed-off-by: Borislav Petkov <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
4 years agox86/mce: Avoid tail copy when machine check terminated a copy from user
Tony Luck [Tue, 6 Oct 2020 21:09:08 +0000 (14:09 -0700)]
x86/mce: Avoid tail copy when machine check terminated a copy from user

In the page fault case it is ok to see if a few more unaligned bytes
can be copied from the source address. Worst case is that the page fault
will be triggered again.

Machine checks are more serious. Just give up at the point where the
main copy loop triggered the #MC and return from the copy code as if
the copy succeeded. The machine check handler will use task_work_add() to
make sure that the task is sent a SIGBUS.

Signed-off-by: Tony Luck <[email protected]>
Signed-off-by: Borislav Petkov <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
4 years agommc: sdhci-pci-gli: Add CQHCI Support for GL9763E
Ben Chuang [Mon, 5 Oct 2020 10:55:09 +0000 (18:55 +0800)]
mmc: sdhci-pci-gli: Add CQHCI Support for GL9763E

Add CQHCI initialization and implement CQHCI operations for GL9763E.
Use bit19 of the register (0x888) to decide whether to disable command
queuing. If the bit is set, the command queuing will be disabled.

Signed-off-by: Ben Chuang <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Ulf Hansson <[email protected]>
4 years agox86/mce: Add _ASM_EXTABLE_CPY for copy user access
Youquan Song [Tue, 6 Oct 2020 21:09:07 +0000 (14:09 -0700)]
x86/mce: Add _ASM_EXTABLE_CPY for copy user access

_ASM_EXTABLE_UA is a general exception entry to record the exception fixup
for all exception spots between kernel and user space access.

To enable recovery from machine checks while coping data from user
addresses it is necessary to be able to distinguish the places that are
looping copying data from those that copy a single byte/word/etc.

Add a new macro _ASM_EXTABLE_CPY and use it in place of _ASM_EXTABLE_UA
in the copy functions.

Record the exception reason number to regs->ax at
ex_handler_uaccess which is used to check MCE triggered.

The new fixup routine ex_handler_copy() is almost an exact copy of
ex_handler_uaccess() The difference is that it sets regs->ax to the trap
number. Following patches use this to avoid trying to copy remaining
bytes from the tail of the copy and possibly hitting the poison again.

New mce.kflags bit MCE_IN_KERNEL_COPYIN will be used by mce_severity()
calculation to indicate that a machine check is recoverable because the
kernel was copying from user space.

Signed-off-by: Youquan Song <[email protected]>
Signed-off-by: Tony Luck <[email protected]>
Signed-off-by: Borislav Petkov <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
4 years agox86/mce: Provide method to find out the type of an exception handler
Tony Luck [Tue, 6 Oct 2020 21:09:06 +0000 (14:09 -0700)]
x86/mce: Provide method to find out the type of an exception handler

Avoid a proliferation of ex_has_*_handler() functions by having just
one function that returns the type of the handler (if any).

Drop the __visible attribute for this function. It is not called
from assembler so the attribute is not necessary.

Signed-off-by: Tony Luck <[email protected]>
Signed-off-by: Borislav Petkov <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
4 years agox86/mce: Pass pointer to saved pt_regs to severity calculation routines
Youquan Song [Tue, 6 Oct 2020 21:09:05 +0000 (14:09 -0700)]
x86/mce: Pass pointer to saved pt_regs to severity calculation routines

New recovery features require additional information about processor
state when a machine check occurred. Pass pt_regs down to the routines
that need it.

No functional change.

Signed-off-by: Youquan Song <[email protected]>
Signed-off-by: Tony Luck <[email protected]>
Signed-off-by: Borislav Petkov <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
4 years agox86/platform/uv: Update Copyrights to conform to HPE standards
Mike Travis [Mon, 5 Oct 2020 20:39:29 +0000 (15:39 -0500)]
x86/platform/uv: Update Copyrights to conform to HPE standards

Add Copyrights to those files that have been updated for UV5 changes.

Signed-off-by: Mike Travis <[email protected]>
Signed-off-by: Borislav Petkov <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
4 years agox86/platform/uv: Update for UV5 NMI MMR changes
Mike Travis [Mon, 5 Oct 2020 20:39:28 +0000 (15:39 -0500)]
x86/platform/uv: Update for UV5 NMI MMR changes

The UV NMI MMR addresses and fields moved between UV4 and UV5
necessitating a rewrite of the UV NMI handler.  Adjust references
to accommodate those changes.

Signed-off-by: Mike Travis <[email protected]>
Signed-off-by: Borislav Petkov <[email protected]>
Reviewed-by: Dimitri Sivanich <[email protected]>
Reviewed-by: Steve Wahl <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
4 years agox86/platform/uv: Update UV5 TSC checking
Mike Travis [Mon, 5 Oct 2020 20:39:27 +0000 (15:39 -0500)]
x86/platform/uv: Update UV5 TSC checking

Update check of BIOS TSC sync status to include both possible "invalid"
states provided by newer UV5 BIOS.

Signed-off-by: Mike Travis <[email protected]>
Signed-off-by: Borislav Petkov <[email protected]>
Reviewed-by: Steve Wahl <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
4 years agox86/platform/uv: Update node present counting
Mike Travis [Mon, 5 Oct 2020 20:39:26 +0000 (15:39 -0500)]
x86/platform/uv: Update node present counting

The changes in the UV5 arch shrunk the NODE PRESENT table to just 2x64
entries (128 total) so are in to 64 bit MMRs instead of a depth of 64
bits in an array.  Adjust references when counting up the nodes present.

Signed-off-by: Mike Travis <[email protected]>
Signed-off-by: Borislav Petkov <[email protected]>
Reviewed-by: Dimitri Sivanich <[email protected]>
Reviewed-by: Steve Wahl <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
4 years agox86/platform/uv: Update UV5 MMR references in UV GRU
Mike Travis [Mon, 5 Oct 2020 20:39:25 +0000 (15:39 -0500)]
x86/platform/uv: Update UV5 MMR references in UV GRU

Make modifications to the GRU mappings to accommodate changes for UV5.

Signed-off-by: Mike Travis <[email protected]>
Signed-off-by: Borislav Petkov <[email protected]>
Reviewed-by: Dimitri Sivanich <[email protected]>
Reviewed-by: Steve Wahl <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
4 years agox86/platform/uv: Adjust GAM MMR references affected by UV5 updates
Mike Travis [Mon, 5 Oct 2020 20:39:24 +0000 (15:39 -0500)]
x86/platform/uv: Adjust GAM MMR references affected by UV5 updates

Make modifications to the GAM MMR mappings to accommodate changes for UV5.

Signed-off-by: Mike Travis <[email protected]>
Signed-off-by: Borislav Petkov <[email protected]>
Reviewed-by: Dimitri Sivanich <[email protected]>
Reviewed-by: Steve Wahl <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
4 years agox86/platform/uv: Update MMIOH references based on new UV5 MMRs
Mike Travis [Mon, 5 Oct 2020 20:39:23 +0000 (15:39 -0500)]
x86/platform/uv: Update MMIOH references based on new UV5 MMRs

Make modifications to the MMIOH mappings to accommodate changes for UV5.

[ Fix W=1 build warnings. ]
Reported-by: kernel test robot <[email protected]>
Signed-off-by: Mike Travis <[email protected]>
Signed-off-by: Borislav Petkov <[email protected]>
Reviewed-by: Steve Wahl <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
4 years agox86/platform/uv: Add and decode Arch Type in UVsystab
Mike Travis [Mon, 5 Oct 2020 20:39:22 +0000 (15:39 -0500)]
x86/platform/uv: Add and decode Arch Type in UVsystab

When the UV BIOS starts the kernel it passes the UVsystab info struct to
the kernel which contains information elements more specific than ACPI,
and generally pertinent only to the MMRs. These are read only fields
so information is passed one way only. A new field starting with UV5 is
the UV architecture type so the ACPI OEM_ID field can be used for other
purposes going forward. The UV Arch Type selects the entirety of the
MMRs available, with their addresses and fields defined in uv_mmrs.h.

Reported-by: kernel test robot <[email protected]>
Signed-off-by: Mike Travis <[email protected]>
Signed-off-by: Borislav Petkov <[email protected]>
Reviewed-by: Dimitri Sivanich <[email protected]>
Reviewed-by: Steve Wahl <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
4 years agox86/platform/uv: Add UV5 direct references
Mike Travis [Mon, 5 Oct 2020 20:39:21 +0000 (15:39 -0500)]
x86/platform/uv: Add UV5 direct references

Add new references to UV5 (and UVY class) system MMR addresses and
fields primarily caused by the expansion from 46 to 52 bits of physical
memory address.

Signed-off-by: Mike Travis <[email protected]>
Signed-off-by: Borislav Petkov <[email protected]>
Reviewed-by: Dimitri Sivanich <[email protected]>
Reviewed-by: Steve Wahl <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
4 years agox86/platform/uv: Update UV MMRs for UV5
Mike Travis [Mon, 5 Oct 2020 20:39:20 +0000 (15:39 -0500)]
x86/platform/uv: Update UV MMRs for UV5

Update UV MMRs in uv_mmrs.h for UV5 based on Verilog output from the
UV Hub hardware design files.  This is the next UV architecture with
a new class (UVY) being defined for 52 bit physical address masks.
Uses a bitmask for UV arch identification so a single test can cover
multiple versions.  Includes other adjustments to match the uv_mmrs.h
file to keep from encountering compile errors.  New UV5 functionality
is added in the patches that follow.

[ Fix W=1 build warnings. ]
Reported-by: kernel test robot <[email protected]>
Signed-off-by: Mike Travis <[email protected]>
Signed-off-by: Borislav Petkov <[email protected]>
Reviewed-by: Steve Wahl <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
4 years agodrivers/misc/sgi-xp: Adjust references in UV kernel modules
Mike Travis [Tue, 6 Oct 2020 21:34:27 +0000 (16:34 -0500)]
drivers/misc/sgi-xp: Adjust references in UV kernel modules

Remove the define is_uv() is_uv_system and just use the latter as is.
This removes a conflict with a new symbol in the generated uv_mmrs.h
file (is_uv()).

Signed-off-by: Mike Travis <[email protected]>
Signed-off-by: Borislav Petkov <[email protected]>
Reviewed-by: Dimitri Sivanich <[email protected]>
Reviewed-by: Steve Wahl <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
4 years agox86/platform/uv: Remove SCIR MMR references for UV systems
Mike Travis [Mon, 5 Oct 2020 20:39:18 +0000 (15:39 -0500)]
x86/platform/uv: Remove SCIR MMR references for UV systems

UV class systems no longer use System Controller for monitoring of CPU
activity provided by this driver. Other methods have been developed for
BIOS and the management controller (BMC). Remove that supporting code.

Signed-off-by: Mike Travis <[email protected]>
Signed-off-by: Borislav Petkov <[email protected]>
Reviewed-by: Dimitri Sivanich <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
4 years agox86/platform/uv: Remove UV BAU TLB Shootdown Handler
Mike Travis [Mon, 5 Oct 2020 20:39:17 +0000 (15:39 -0500)]
x86/platform/uv: Remove UV BAU TLB Shootdown Handler

The Broadcast Assist Unit (BAU) TLB shootdown handler is being rewritten
to become the UV BAU APIC driver. It is designed to speed up sending
IPIs to selective CPUs within the system. Remove the current TLB
shutdown handler (tlb_uv.c) file and a couple of kernel hooks in the
interim.

Signed-off-by: Mike Travis <[email protected]>
Signed-off-by: Borislav Petkov <[email protected]>
Reviewed-by: Dimitri Sivanich <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
4 years agonvme-core: remove extra condition for vwc
Chaitanya Kulkarni [Thu, 1 Oct 2020 18:54:32 +0000 (11:54 -0700)]
nvme-core: remove extra condition for vwc

In nvme_set_queue_limits() we initialize vwc to false and later add
a condition to set vwc true. The value of the vwc can be declare
initialized which makes all the blk_queue_XXX() calls uniform.

Signed-off-by: Chaitanya Kulkarni <[email protected]>
Reviewed-by: Keith Busch <[email protected]>
Signed-off-by: Christoph Hellwig <[email protected]>
4 years agonvme-core: remove extra variable
Chaitanya Kulkarni [Thu, 1 Oct 2020 18:54:31 +0000 (11:54 -0700)]
nvme-core: remove extra variable

In nvme_validate_ns() the exra variable ctrl is used only twice.
Using ns->ctrl directly still maintains the redability and original
length of the lines in the code. Get rid of the extra variable ctrl &
use ns->ctrl directly.

Signed-off-by: Chaitanya Kulkarni <[email protected]>
Signed-off-by: Christoph Hellwig <[email protected]>
4 years agonvme: remove nvme_identify_ns_list
Christoph Hellwig [Mon, 28 Sep 2020 12:08:28 +0000 (14:08 +0200)]
nvme: remove nvme_identify_ns_list

Just fold it into the only caller.

Signed-off-by: Christoph Hellwig <[email protected]>
Reviewed-by: Keith Busch <[email protected]>
Reviewed-by: Chaitanya Kulkarni <[email protected]>
4 years agonvme: refactor nvme_validate_ns
Christoph Hellwig [Mon, 28 Sep 2020 11:59:06 +0000 (13:59 +0200)]
nvme: refactor nvme_validate_ns

Move the logic to revalidate the block_device size or remove the
namespace from the caller into nvme_validate_ns.  This removes
the return value and thus the status code translation.  Additionally
it also catches non-permanent errors from nvme_update_ns_info using
the existing logic.

Signed-off-by: Christoph Hellwig <[email protected]>
Reviewed-by: Keith Busch <[email protected]>
Reviewed-by: Damien Le Moal <[email protected]>
4 years agonvme: move nvme_validate_ns
Christoph Hellwig [Mon, 28 Sep 2020 11:55:22 +0000 (13:55 +0200)]
nvme: move nvme_validate_ns

Move nvme_validate_ns just above its only remaining caller.

Signed-off-by: Christoph Hellwig <[email protected]>
Reviewed-by: Keith Busch <[email protected]>
Reviewed-by: Damien Le Moal <[email protected]>
4 years agonvme: query namespace identifiers before adding the namespace
Christoph Hellwig [Mon, 28 Sep 2020 12:07:56 +0000 (14:07 +0200)]
nvme: query namespace identifiers before adding the namespace

Check the namespace identifier list first thing when scanning namespaces.
This keeps the code to query the CSI common between the alloc and validate
path, and helps to structure the code better for multiple command set
support.

Signed-off-by: Christoph Hellwig <[email protected]>
Reviewed-by: Keith Busch <[email protected]>
Reviewed-by: Damien Le Moal <[email protected]>
4 years agonvme: revalidate zone bitmaps in nvme_update_ns_info
Christoph Hellwig [Mon, 28 Sep 2020 10:30:16 +0000 (12:30 +0200)]
nvme: revalidate zone bitmaps in nvme_update_ns_info

Consolidate the two calls into a single place.

Signed-off-by: Christoph Hellwig <[email protected]>
Reviewed-by: Keith Busch <[email protected]>
4 years agonvme: remove nvme_update_formats
Christoph Hellwig [Mon, 28 Sep 2020 09:10:36 +0000 (11:10 +0200)]
nvme: remove nvme_update_formats

Now that the queue is frozen before updating ->lba_shift we can't hit the
invalid references mentioned in the comment any more.  More importantly
this code would not have helped us if the format was changed by another
controller or through implementation defined back channels.

Signed-off-by: Christoph Hellwig <[email protected]>
Reviewed-by: Keith Busch <[email protected]>
Reviewed-by: Damien Le Moal <[email protected]>
4 years agonvme: update the known admin effects
Christoph Hellwig [Fri, 25 Sep 2020 06:34:43 +0000 (08:34 +0200)]
nvme: update the known admin effects

A Format NVM command can change the capabilities of namespaces, while
Sanitize does change the Logical Block Content and must be serialized.

Also remove CSUPP bit for Format - it is not a mandatory command,
and we don't check for the bit anyway.

Signed-off-by: Christoph Hellwig <[email protected]>
Reviewed-by: Keith Busch <[email protected]>
Reviewed-by: Sagi Grimberg <[email protected]>
Reviewed-by: Chaitanya Kulkarni <[email protected]>
Reviewed-by: Damien Le Moal <[email protected]>
4 years agonvme: set the queue limits in nvme_update_ns_info
Christoph Hellwig [Mon, 28 Sep 2020 10:05:28 +0000 (12:05 +0200)]
nvme: set the queue limits in nvme_update_ns_info

Only set the queue limits once we have the real block size.  This also
updates the limits on a rescan if needed.

Signed-off-by: Christoph Hellwig <[email protected]>
Reviewed-by: Keith Busch <[email protected]>
Reviewed-by: Sagi Grimberg <[email protected]>
Reviewed-by: Chaitanya Kulkarni <[email protected]>
Reviewed-by: Damien Le Moal <[email protected]>
4 years agonvme: remove the 0 lba_shift check in nvme_update_ns_info
Christoph Hellwig [Mon, 28 Sep 2020 10:12:02 +0000 (12:12 +0200)]
nvme: remove the 0 lba_shift check in nvme_update_ns_info

We can no longer reach this code if Identify Namespace failed.

Signed-off-by: Christoph Hellwig <[email protected]>
Reviewed-by: Keith Busch <[email protected]>
Reviewed-by: Sagi Grimberg <[email protected]>
Reviewed-by: Chaitanya Kulkarni <[email protected]>
Reviewed-by: Damien Le Moal <[email protected]>
4 years agonvme: clean up the check for too large logic block sizes
Christoph Hellwig [Mon, 28 Sep 2020 10:03:13 +0000 (12:03 +0200)]
nvme: clean up the check for too large logic block sizes

Use a single statement to set both the capacity and fake block size
instead of two.

Signed-off-by: Christoph Hellwig <[email protected]>
Reviewed-by: Keith Busch <[email protected]>
4 years agonvme: freeze the queue over ->lba_shift updates
Christoph Hellwig [Mon, 28 Sep 2020 10:11:42 +0000 (12:11 +0200)]
nvme: freeze the queue over ->lba_shift updates

Ensure that there can't be any I/O in flight went we change the disk
geometry in nvme_update_ns_info, most notable the LBA size by lifting
the queue free from nvme_update_disk_info into the caller

Signed-off-by: Christoph Hellwig <[email protected]>
Reviewed-by: Keith Busch <[email protected]>
Reviewed-by: Damien Le Moal <[email protected]>
4 years agonvme: factor out a nvme_configure_metadata helper
Christoph Hellwig [Fri, 25 Sep 2020 05:19:13 +0000 (07:19 +0200)]
nvme: factor out a nvme_configure_metadata helper

Factor out a helper from nvme_update_ns_info that configures the
per-namespaces metadata and PI settings.  Also make sure the helpers
clear the flags explicitly instead of all of ->features to allow for
potentially reusing ->features for future non-metadata flags.

Signed-off-by: Christoph Hellwig <[email protected]>
Reviewed-by: Keith Busch <[email protected]>
Reviewed-by: Sagi Grimberg <[email protected]>
Reviewed-by: Chaitanya Kulkarni <[email protected]>
Reviewed-by: Damien Le Moal <[email protected]>
4 years agonvme: call nvme_identify_ns as the first thing in nvme_alloc_ns_block
Christoph Hellwig [Mon, 28 Sep 2020 10:34:04 +0000 (12:34 +0200)]
nvme: call nvme_identify_ns as the first thing in nvme_alloc_ns_block

Check if the namespace actually exists as the very first thing and don't
bother with any extra work if not.  This should speed up and simplify
the sequential scanning for NVMe 1.0 devices.

Signed-off-by: Christoph Hellwig <[email protected]>
Reviewed-by: Keith Busch <[email protected]>
Reviewed-by: Chaitanya Kulkarni <[email protected]>
Reviewed-by: Damien Le Moal <[email protected]>
4 years agonvme: lift the check for an unallocated namespace into nvme_identify_ns
Christoph Hellwig [Mon, 28 Sep 2020 10:33:19 +0000 (12:33 +0200)]
nvme: lift the check for an unallocated namespace into nvme_identify_ns

Move the check from the two callers into the common helper.

Signed-off-by: Christoph Hellwig <[email protected]>
Reviewed-by: Keith Busch <[email protected]>
Reviewed-by: Chaitanya Kulkarni <[email protected]>
Reviewed-by: Sagi Grimberg <[email protected]>
4 years agonvme: rename __nvme_revalidate_disk
Christoph Hellwig [Mon, 28 Sep 2020 10:14:20 +0000 (12:14 +0200)]
nvme: rename __nvme_revalidate_disk

Rename __nvme_revalidate_disk to nvme_update_ns_info and pass a
namespace instead of the gendisk.

Signed-off-by: Christoph Hellwig <[email protected]>
Reviewed-by: Keith Busch <[email protected]>
Reviewed-by: Sagi Grimberg <[email protected]>
Reviewed-by: Chaitanya Kulkarni <[email protected]>
Reviewed-by: Damien Le Moal <[email protected]>
4 years agonvme: rename _nvme_revalidate_disk
Christoph Hellwig [Mon, 28 Sep 2020 08:25:54 +0000 (10:25 +0200)]
nvme: rename _nvme_revalidate_disk

Rename _nvme_revalidate_disk to nvme_validate_ns to better describe
what the function does, and pass the struct nvme_ns instead of the
gendisk to better match the call chain.

Signed-off-by: Christoph Hellwig <[email protected]>
Reviewed-by: Keith Busch <[email protected]>
Reviewed-by: Sagi Grimberg <[email protected]>
Reviewed-by: Chaitanya Kulkarni <[email protected]>
Reviewed-by: Damien Le Moal <[email protected]>
4 years agonvme: rename nvme_validate_ns to nvme_validate_or_alloc_ns
Christoph Hellwig [Mon, 28 Sep 2020 07:42:17 +0000 (09:42 +0200)]
nvme: rename nvme_validate_ns to nvme_validate_or_alloc_ns

Use a slightly more descriptive name to enable reusing nvme_validate_ns
in the next patch for a lower level function.

Signed-off-by: Christoph Hellwig <[email protected]>
Reviewed-by: Keith Busch <[email protected]>
Reviewed-by: Sagi Grimberg <[email protected]>
Reviewed-by: Chaitanya Kulkarni <[email protected]>
Reviewed-by: Damien Le Moal <[email protected]>
4 years agonvme: remove the disk argument to nvme_update_zone_info
Christoph Hellwig [Thu, 20 Aug 2020 12:02:18 +0000 (14:02 +0200)]
nvme: remove the disk argument to nvme_update_zone_info

The queue can trivially be derived from the nvme_ns structure.

Signed-off-by: Christoph Hellwig <[email protected]>
Reviewed-by: Keith Busch <[email protected]>
Reviewed-by: Sagi Grimberg <[email protected]>
Reviewed-by: Chaitanya Kulkarni <[email protected]>
Reviewed-by: Damien Le Moal <[email protected]>
4 years agonvme: fix initialization of the zone bitmaps
Christoph Hellwig [Thu, 20 Aug 2020 07:31:36 +0000 (09:31 +0200)]
nvme: fix initialization of the zone bitmaps

The removal of the ->revalidate_disk method broke the initialization of
the zone bitmaps, as nvme_revalidate_disk now never gets called during
initialization.

Move the zone related code from nvme_revalidate_disk into a new helper in
zns.c, and call it from nvme_alloc_ns in addition to nvme_validate_ns to
ensure the zone bitmaps are initialized during probe.

Signed-off-by: Christoph Hellwig <[email protected]>
Reviewed-by: Keith Busch <[email protected]>
Reviewed-by: Damien Le Moal <[email protected]>
4 years agoblock: optimize blk_queue_zoned_model for !CONFIG_BLK_DEV_ZONED
Christoph Hellwig [Thu, 20 Aug 2020 07:56:58 +0000 (09:56 +0200)]
block: optimize blk_queue_zoned_model for !CONFIG_BLK_DEV_ZONED

Always return BLK_ZONED_NONE if zoned device support is not enabled.
This allows various compiler optimizations including the dead code
elimination that we so like for avoiding ifdefs.

Signed-off-by: Christoph Hellwig <[email protected]>
Reviewed-by: Keith Busch <[email protected]>
Reviewed-by: Sagi Grimberg <[email protected]>
Reviewed-by: Chaitanya Kulkarni <[email protected]>
Reviewed-by: Damien Le Moal <[email protected]>
4 years agonvme-loop: don't put ctrl on nvme_init_ctrl error
Chaitanya Kulkarni [Wed, 30 Sep 2020 04:24:17 +0000 (21:24 -0700)]
nvme-loop: don't put ctrl on nvme_init_ctrl error

The function nvme_init_ctrl() gets the ctrl reference & when it fails it
does put the ctrl reference in the error unwind code.

When creating loop ctrl in nvme_loop_create_ctrl() if nvme_init_ctrl()
returns non zero (i.e. error) value it jumps to the "out_put_ctrl" label
which calls nvme_put_ctrl(), that will lead to douple ctrl put in error
unwind path.

Update nvme_loop_create_ctrl() such that this patch removes the
"out_put_ctrl" label, add a new "out" label after nvme_put_ctrl() in
error unwind path and jump to newly added label when nvme_init_ctrl()
call retuns an error.

Signed-off-by: Chaitanya Kulkarni <[email protected]>
Signed-off-by: Christoph Hellwig <[email protected]>
4 years agonvme-core: put ctrl ref when module ref get fail
Chaitanya Kulkarni [Tue, 6 Oct 2020 23:36:47 +0000 (16:36 -0700)]
nvme-core: put ctrl ref when module ref get fail

When try_module_get() fails in the nvme_dev_open() it returns without
releasing the ctrl reference which was taken earlier.

Put the ctrl reference which is taken before calling the
try_module_get() in the error return code path.

Fixes: 52a3974feb1a "nvme-core: get/put ctrl and transport module in nvme_dev_open/release()"
Signed-off-by: Chaitanya Kulkarni <[email protected]>
Reviewed-by: Logan Gunthorpe <[email protected]>
Signed-off-by: Christoph Hellwig <[email protected]>
This page took 0.138532 seconds and 4 git commands to generate.