Josef Bacik [Fri, 18 Sep 2020 20:44:32 +0000 (16:44 -0400)]
btrfs: init device stats for seed devices
We recently started recording device stats across the fleet, and noticed
a large increase in messages such as this
BTRFS warning (device dm-0): get dev_stats failed, not yet valid
on our tiers that use seed devices for their root devices. This is
because we do not initialize the device stats for any seed devices if we
have a sprout device and mount using that sprout device. The basic
steps for reproducing are:
Fix this by iterating over the fs_devices->seed if they exist in
btrfs_init_dev_stats. This fixed the problem and properly reports the
stats for both devices.
Nikolay Borisov [Fri, 18 Sep 2020 13:34:39 +0000 (16:34 +0300)]
btrfs: remove struct extent_io_ops
It's no longer used just remove the function and any related code which
was initialising it for inodes. No functional changes.
Removing 8 bytes from extent_io_tree in turn reduces size of other
structures where it is embedded, notably btrfs_inode where it reduces
size by 24 bytes.
Nikolay Borisov [Fri, 18 Sep 2020 13:34:38 +0000 (16:34 +0300)]
btrfs: call submit_bio_hook directly for metadata pages
No need to go through a function pointer indirection simply call
submit_bio_hook directly by exporting and renaming the helper to
btrfs_submit_metadata_bio. This makes the code more readable and should
result in somewhat faster code due to no longer paying the price for
specualtive attack mitigations that come with indirect function calls.
Nikolay Borisov [Fri, 18 Sep 2020 13:34:37 +0000 (16:34 +0300)]
btrfs: stop calling submit_bio_hook for data inodes
Instead export and rename the function to btrfs_submit_data_bio and
call it directly in submit_one_bio. This avoids paying the cost for
speculative attacks mitigations and improves code readability.
Nikolay Borisov [Fri, 18 Sep 2020 13:34:35 +0000 (16:34 +0300)]
btrfs: call submit_bio_hook directly in submit_one_bio
BTRFS has 2 inode types (for the purposes of the code in submit_one_bio)
- ordinary data inodes (including the freespace inode) and the btree
inode. Both of these implement submit_bio_hook so btrfsic_submit_bio can
never be called from submit_one_bio so just remove it.
Nikolay Borisov [Fri, 18 Sep 2020 13:34:33 +0000 (16:34 +0300)]
btrfs: replace readpage_end_io_hook with direct calls
Don't call readpage_end_io_hook for the btree inode. Instead of relying
on indirect calls to implement metadata buffer validation simply check
if the inode whose page we are processing equals the btree inode. If it
does call the necessary function.
This is an improvement in 2 directions:
1. We aren't paying the penalty of indirect calls in a post-speculation
attacks world.
2. The function is now named more explicitly so it's obvious what's
going on
This is in preparation to removing struct extent_io_ops altogether.
Filipe Manana [Mon, 21 Sep 2020 13:13:30 +0000 (14:13 +0100)]
btrfs: send, recompute reference path after orphanization of a directory
During an incremental send, when an inode has multiple new references we
might end up emitting rename operations for orphanizations that have a
source path that is no longer valid due to a previous orphanization of
some directory inode. This causes the receiver to fail since it tries
to rename a path that does not exists.
Example reproducer:
$ cat reproducer.sh
#!/bin/bash
mkfs.btrfs -f /dev/sdi >/dev/null
mount /dev/sdi /mnt/sdi
# Now do a series of changes such that:
#
# *) inode 258 has one new hardlink and the previous name changed
#
# *) both names conflict with the old names of two other inodes:
#
# 1) the new name "d1" conflicts with the old name of inode 259,
# under directory inode 256 (root)
#
# 2) the new name "d2" conflicts with the old name of inode 260
# under directory inode 259
#
# *) inodes 259 and 260 now have the old names of inode 258
#
# *) inode 257 is now located under inode 260 - an inode with a number
# smaller than the inode (258) for which we created a second hard
# link and swapped its names with inodes 259 and 260
#
ln /mnt/sdi/f2 /mnt/sdi/d1/f2_link
mv /mnt/sdi/f1 /mnt/sdi/d1/d2/f1
When executed the receive of the incremental stream fails:
$ ./reproducer.sh
Create a readonly snapshot of '/mnt/sdi' in '/mnt/sdi/snap1'
At subvol /mnt/sdi/snap1
Create a readonly snapshot of '/mnt/sdi' in '/mnt/sdi/snap2'
At subvol /mnt/sdi/snap2
At subvol snap1
At snapshot snap2
ERROR: rename d1/d2 -> o260-6-0 failed: No such file or directory
This happens because:
1) When processing inode 257 we end up computing the name for inode 259
because it is an ancestor in the send snapshot, and at that point it
still has its old name, "d1", from the parent snapshot because inode
259 was not yet processed. We then cache that name, which is valid
until we start processing inode 259 (or set the progress to 260 after
processing its references);
2) Later we start processing inode 258 and collecting all its new
references into the list sctx->new_refs. The first reference in the
list happens to be the reference for name "d1" while the reference for
name "d2" is next (the last element of the list).
We compute the full path "d1/d2" for this second reference and store
it in the reference (its ->full_path member). The path used for the
new parent directory was "d1" and not "f2" because inode 259, the
new parent, was not yet processed;
3) When we start processing the new references at process_recorded_refs()
we start with the first reference in the list, for the new name "d1".
Because there is a conflicting inode that was not yet processed, which
is directory inode 259, we orphanize it, renaming it from "d1" to
"o259-6-0";
4) Then we start processing the new reference for name "d2", and we
realize it conflicts with the reference of inode 260 in the parent
snapshot. So we issue an orphanization operation for inode 260 by
emitting a rename operation with a destination path of "o260-6-0"
and a source path of "d1/d2" - this source path is the value we
stored in the reference earlier at step 2), corresponding to the
->full_path member of the reference, however that path is no longer
valid due to the orphanization of the directory inode 259 in step 3).
This makes the receiver fail since the path does not exists, it should
have been "o259-6-0/d2".
Fix this by recomputing the full path of a reference before emitting an
orphanization if we previously orphanized any directory, since that
directory could be a parent in the new path. This is a rare scenario so
keeping it simple and not checking if that previously orphanized directory
is in fact an ancestor of the inode we are trying to orphanize.
Filipe Manana [Mon, 21 Sep 2020 13:13:29 +0000 (14:13 +0100)]
btrfs: send, orphanize first all conflicting inodes when processing references
When doing an incremental send it is possible that when processing the new
references for an inode we end up issuing rename or link operations that
have an invalid path, which contains the orphanized name of a directory
before we actually orphanized it, causing the receiver to fail.
The following reproducer triggers such scenario:
$ cat reproducer.sh
#!/bin/bash
mkfs.btrfs -f /dev/sdi >/dev/null
mount /dev/sdi /mnt/sdi
touch /mnt/sdi/a
touch /mnt/sdi/b
mkdir /mnt/sdi/testdir
# We want "a" to have a lower inode number then "testdir" (257 vs 259).
mv /mnt/sdi/a /mnt/sdi/testdir/a
# Now rename 259 to "testdir_2", then change the name of 257 to
# "testdir" and make it a direct descendant of the root inode (256).
# Also create a new link for inode 257 with the old name of inode 258.
# By swapping the names and location of several inodes and create a
# nasty dependency chain of rename and link operations.
mv /mnt/sdi/testdir/a /mnt/sdi/a2
touch /mnt/sdi/testdir/a
mv /mnt/sdi/b /mnt/sdi/b2
ln /mnt/sdi/a2 /mnt/sdi/b
mv /mnt/sdi/testdir /mnt/sdi/testdir_2
mv /mnt/sdi/a2 /mnt/sdi/testdir
When running the reproducer, the receive of the incremental send stream
fails:
$ ./reproducer.sh
Create a readonly snapshot of '/mnt/sdi' in '/mnt/sdi/snap1'
At subvol /mnt/sdi/snap1
Create a readonly snapshot of '/mnt/sdi' in '/mnt/sdi/snap2'
At subvol /mnt/sdi/snap2
At subvol snap1
At snapshot snap2
ERROR: link b -> o259-6-0/a failed: No such file or directory
The problem happens because of the following:
1) Before we start iterating the list of new references for inode 257,
we generate its current path and store it at @valid_path, done at
the very beginning of process_recorded_refs(). The generated path
is "o259-6-0/a", containing the orphanized name for inode 259;
2) Then we iterate over the list of new references, which has the
references "b" and "testdir" in that specific order;
3) We process reference "b" first, because it is in the list before
reference "testdir". We then issue a link operation to create
the new reference "b" using a target path corresponding to the
content at @valid_path, which corresponds to "o259-6-0/a".
However we haven't yet orphanized inode 259, its name is still
"testdir", and not "o259-6-0". The orphanization of 259 did not
happen yet because we will process the reference named "testdir"
for inode 257 only in the next iteration of the loop that goes
over the list of new references.
Fix the issue by having a preliminar iteration over all the new references
at process_recorded_refs(). This iteration is responsible only for doing
the orphanization of other inodes that have and old reference that
conflicts with one of the new references of the inode we are currently
processing. The emission of rename and link operations happen now in the
next iteration of the new references.
Commit 259ee7754b67 ("btrfs: tree-checker: Add ROOT_ITEM check")
introduced btrfs root item size check, however btrfs root item has two
versions, the legacy one which just ends before generation_v2 member, is
smaller than current btrfs root item size.
This caused btrfs kernel to reject valid but old tree root leaves.
Fix this problem by also allowing legacy root item, since kernel can
already handle them pretty well and upgrade to newer root item format
when needed.
David Sterba [Tue, 15 Sep 2020 12:58:42 +0000 (14:58 +0200)]
btrfs: use unaligned helpers for stack and header set/get helpers
In the definitions generated by BTRFS_SETGET_HEADER_FUNCS there's direct
pointer assignment but we should use the helpers for unaligned access
for clarity. It hasn't been a problem so far because of the natural
alignment.
Similarly for BTRFS_SETGET_STACK_FUNCS, that usually get a structure
from stack that has an aligned start but some members may not be aligned
due to packing. This as well hasn't caused problems so far.
Move the put/get_unaligned_le8 stubs to ctree.h so we can use them.
David Sterba [Tue, 15 Sep 2020 11:32:34 +0000 (13:32 +0200)]
btrfs: free-space-cache: use unaligned helpers to access data
The free space inode stores the tracking data, checksums etc, using the
io_ctl structure and moving the pointers. The data are generally aligned
to at least 4 bytes (u32 for CRC) so it's not completely unaligned but
for clarity we should use the proper helpers whenever a struct is
initialized from io_ctl->cur pointer.
David Sterba [Tue, 15 Sep 2020 08:54:23 +0000 (10:54 +0200)]
btrfs: send: use helpers for unaligned access to header members
The header is mapped onto the send buffer and thus its members may be
potentially unaligned so use the helpers instead of directly assigning
the pointers. This has worked so far but let's use the helpers to make
that clear.
All of these lockup reports have the call chain btrfs_clone_files() ->
btrfs_clone() in common. btrfs_clone_files() calls btrfs_clone() with
both source and destination extents locked and loops over the source
extent to create the clones.
Conditionally reschedule in the btrfs_clone() loop, to give some time back
to other processes.
btrfs: use kvcalloc for allocation in btrfs_ioctl_send()
Replace kvzalloc() call with kvcalloc() that also checks the size
internally. There's a standalone overflow check in the function so we
can return invalid parameter combination. Use array_size() helper to
compute the memory size for clone_sources_tmp.
btrfs: use kvzalloc() to allocate clone_roots in btrfs_ioctl_send()
btrfs_ioctl_send() used open-coded kvzalloc implementation earlier.
The code was accidentally replaced with kzalloc() call [1]. Restore
the original code by using kvzalloc() to allocate sctx->clone_roots.
Nikolay Borisov [Fri, 18 Sep 2020 09:15:52 +0000 (12:15 +0300)]
btrfs: remove inode argument from add_pending_csums
It's used to reference the csum root which can be done from the trans
handle as well. Simplify the signature and while at it also remove the
noinline attribute as the function uses only at most 16 bytes of stack
space.
Nikolay Borisov [Mon, 14 Sep 2020 09:37:08 +0000 (12:37 +0300)]
btrfs: promote extent_read_full_page to btrfs_readpage
Now that btrfs_readpage is the only caller of extent_read_full_page the
latter can be open coded in the former. Use the occassion to rename
__extent_read_full_page to extent_read_full_page. To facillitate this
change submit_one_bio has to be exported as well.
Nikolay Borisov [Mon, 14 Sep 2020 09:37:06 +0000 (12:37 +0300)]
btrfs: remove btrfs_get_extent indirection from __do_readpage
Now that this function is only responsible for reading data pages it's
no longer necessary to pass get_extent_t parameter across several
layers of functions. This patch removes this parameter from multiple
functions: __get_extent_map/__do_readpage/__extent_read_full_page/
extent_read_full_page and simply calls btrfs_get_extent directly in
__get_extent_map.
Nikolay Borisov [Mon, 14 Sep 2020 09:37:05 +0000 (12:37 +0300)]
btrfs: remove btree_get_extent
The sole purpose of this function was to satisfy the requirements of
__do_readpage. Since that function is no longer used to read metadata
pages the need to keep btree_get_extent around has also disappeared.
Simply remove it.
Nikolay Borisov [Mon, 14 Sep 2020 09:37:04 +0000 (12:37 +0300)]
btrfs: simplify metadata pages reading
Metadata pages currently use __do_readpage to read metadata pages,
unfortunately this function is also used to deal with ordinary data
pages. This makes the metadata pages reading code to go through multiple
hoops in order to adhere to __do_readpage invariants. Most of these are
necessary for data pages which could be compressed. For metadata it's
enough to simply build a bio and submit it.
To this effect simply call submit_extent_page directly from
read_extent_buffer_pages which is the only callpath used to populate
extent_buffers with data. This in turn enables further cleanups.
Nikolay Borisov [Mon, 14 Sep 2020 09:37:03 +0000 (12:37 +0300)]
btrfs: remove btree_readpage
There is no way for this function to be called as ->readpage() since
it's called from
generic_file_buffered_read/filemap_fault/do_read_cache_page/readhead
code. BTRFS doesn't utilize the first 3 for the btree inode and
implements it's owon readhead mechanism. So simply remove the function.
Filipe Manana [Mon, 14 Sep 2020 14:27:50 +0000 (15:27 +0100)]
btrfs: reschedule if necessary when logging directory items
Logging directories with many entries can take a significant amount of
time, and in some cases monopolize a cpu/core for a long time if the
logging task doesn't happen to block often enough.
Johannes and Lu Fengqi reported test case generic/041 triggering a soft
lockup when the kernel has CONFIG_SOFTLOCKUP_DETECTOR=y. For this test
case we log an inode with 3002 hard links, and because the test removed
one hard link before fsyncing the file, the inode logging causes the
parent directory do be logged as well, which has 6004 directory items to
log (3002 BTRFS_DIR_ITEM_KEY items plus 3002 BTRFS_DIR_INDEX_KEY items),
so it can take a significant amount of time and trigger the soft lockup.
So just make tree-log.c:log_dir_items() reschedule when necessary,
releasing the current search path before doing so and then resume from
where it was before the reschedule.
The stack trace produced when the soft lockup happens is the following:
This happens because when we link in a block group with a new raid index
type we'll create the corresponding sysfs entries for it. This is
problematic because while restriping we're holding the chunk_mutex, and
while mounting we're holding the tree locks.
Fixing this isn't pretty, we move the call to the sysfs stuff into the
btrfs_create_pending_block_groups() work, where we're not holding any
locks. This creates a slight race where other threads could see that
there's no sysfs kobj for that raid type, and race to create the
sysfs dir. Fix this by wrapping the creation in space_info->lock, so we
only get one thread calling kobject_add() for the new directory. We
don't worry about the lock on cleanup as it only gets deleted on
unmount.
On mount it's more straightforward, we loop through the space_infos
already, just check every raid index in each space_info and added the
sysfs entries for the corresponding block groups.
Josef Bacik [Tue, 1 Sep 2020 21:40:37 +0000 (17:40 -0400)]
btrfs: kill the RCU protection for fs_info->space_info
We have this thing wrapped in an RCU lock, but it's really not needed.
We create all the space_info's on mount, and we destroy them on unmount.
The list never changes and we're protected from messing with it by the
normal mount/umount path, so kill the RCU stuff around it.
Nikolay Borisov [Tue, 1 Sep 2020 14:39:59 +0000 (17:39 +0300)]
btrfs: sink total_data parameter in setup_items_for_insert
That parameter can easily be derived based on the "data_size" and "nr"
parameters exploit this fact to simply the function's signature. No
functional changes.
Nikolay Borisov [Tue, 1 Sep 2020 14:39:58 +0000 (17:39 +0300)]
btrfs: eliminate total_size parameter from setup_items_for_insert
The value of this argument can be derived from the total_data as it's
simply the value of the data size + size of btrfs_items being touched.
Move the parameter calculation inside the function. This results in a
simpler interface and also a minor size reduction:
./scripts/bloat-o-meter ctree.original fs/btrfs/ctree.o
add/remove: 0/0 grow/shrink: 0/3 up/down: 0/-34 (-34)
Function old new delta
btrfs_duplicate_item 260 259 -1
setup_items_for_insert 1200 1190 -10
btrfs_insert_empty_items 177 154 -23
Omar Sandoval [Fri, 21 Aug 2020 07:39:53 +0000 (00:39 -0700)]
btrfs: send: use btrfs_file_extent_end() in send_write_or_clone()
send_write_or_clone() basically has an open-coded copy of
btrfs_file_extent_end() except that it (incorrectly) aligns to PAGE_SIZE
instead of sectorsize. Fix and simplify the code by using
btrfs_file_extent_end().
Omar Sandoval [Fri, 21 Aug 2020 07:39:52 +0000 (00:39 -0700)]
btrfs: send: avoid copying file data
send_write() currently copies from the page cache to sctx->read_buf, and
then from sctx->read_buf to sctx->send_buf. Similarly, send_hole()
zeroes sctx->read_buf and then copies from sctx->read_buf to
sctx->send_buf. However, if we write the TLV header manually, we can
copy to sctx->send_buf directly and get rid of sctx->read_buf.
Omar Sandoval [Fri, 21 Aug 2020 07:39:51 +0000 (00:39 -0700)]
btrfs: send: get rid of i_size logic in send_write()
send_write()/fill_read_buf() have some logic for avoiding reading past
i_size. However, everywhere that we call
send_write()/send_extent_data(), we've already clamped the length down
to i_size. Get rid of the i_size handling, which simplifies the next
change.
Filipe Manana [Tue, 8 Sep 2020 10:27:24 +0000 (11:27 +0100)]
btrfs: rename btrfs_insert_clone_extent() to a more generic name
Now that we use the same mechanism to replace all the extents in a file
range with either a hole, an existing extent (when cloning) or a new
extent (when using fallocate), the name of btrfs_insert_clone_extent()
no longer reflects its genericity.
So rename it to btrfs_insert_replace_extent(), since what it does is
to either insert an existing extent or a new extent into a file range.
Filipe Manana [Tue, 8 Sep 2020 10:27:23 +0000 (11:27 +0100)]
btrfs: rename btrfs_punch_hole_range() to a more generic name
The function btrfs_punch_hole_range() is now used to replace all the file
extents in a given file range with an extent described in the given struct
btrfs_replace_extent_info argument. This extent can either be an existing
extent that is being cloned or it can be a new extent (namely a prealloc
extent). When that argument is NULL it only punches a hole (drops all the
existing extents) in the file range.
So rename the function to btrfs_replace_file_extents().
Filipe Manana [Tue, 8 Sep 2020 10:27:22 +0000 (11:27 +0100)]
btrfs: rename struct btrfs_clone_extent_info to a more generic name
Now that we can use btrfs_clone_extent_info to convey information for a
new prealloc extent as well, and not just for existing extents that are
being cloned, rename it to btrfs_replace_extent_info, which reflects the
fact that this is now more generic and it is used to replace all existing
extents in a file range with the extent described by the structure.
Filipe Manana [Tue, 8 Sep 2020 10:27:21 +0000 (11:27 +0100)]
btrfs: remove item_size member of struct btrfs_clone_extent_info
The value of item_size of struct btrfs_clone_extent_info is always set to
the size of a non-inline file extent item, and in fact the infrastructure
that uses this structure (btrfs_punch_hole_range()) does not work with
inline file extents at all (and it is not supposed to).
So just remove that field from the structure and use directly
sizeof(struct btrfs_file_extent_item) instead. Also assert that the
file extent type is not inline at btrfs_insert_clone_extent().
Filipe Manana [Tue, 8 Sep 2020 10:27:20 +0000 (11:27 +0100)]
btrfs: fix metadata reservation for fallocate that leads to transaction aborts
When doing an fallocate(), specially a zero range operation, we assume
that reserving 3 units of metadata space is enough, that at most we touch
one leaf in subvolume/fs tree for removing existing file extent items and
inserting a new file extent item. This assumption is generally true for
most common use cases. However when we end up needing to remove file extent
items from multiple leaves, we can end up failing with -ENOSPC and abort
the current transaction, turning the filesystem to RO mode. When this
happens a stack trace like the following is dumped in dmesg/syslog:
When we use fallocate() internally, for reserving an extent for a space
cache, inode cache or relocation, we can't hit this problem since either
there aren't any file extent items to remove from the subvolume tree or
there is at most one.
When using plain fallocate() it's very unlikely, since that would require
having many file extent items representing holes for the target range and
crossing multiple leafs - we attempt to increase the range (merge) of such
file extent items when punching holes, so at most we end up with 2 file
extent items for holes at leaf boundaries.
However when using the zero range operation of fallocate() for a large
range (100+ MiB for example) that's fairly easy to trigger. The following
example reproducer triggers the issue:
# Create a 100M file with many file extent items. Punch a hole every 8K
# just to speedup the file creation - we could do 4K sequential writes
# followed by fsync (or O_SYNC) as well, but that takes a lot of time.
file_size=$((100 * 1024 * 1024))
xfs_io -f -c "pwrite -S 0xab -b 10M 0 $file_size" /mnt/sdj/foobar
for ((i = 0; i < $file_size; i += 8192)); do
xfs_io -c "fpunch $i 4096" /mnt/sdj/foobar
done
# Force a transaction commit, so the zero range operation will be forced
# to COW all metadata extents it need to touch.
sync
xfs_io -c "fzero 0 $file_size" /mnt/sdj/foobar
umount /mnt/sdj
$ ./reproducer.sh
wrote 104857600/104857600 bytes at offset 0
100 MiB, 10 ops; 0.0669 sec (1.458 GiB/sec and 149.3117 ops/sec)
fallocate: No space left on device
$ dmesg
<shows the same stack trace pasted before>
To fix this use the existing infrastructure that hole punching and
extent cloning use for replacing a file range with another extent. This
deals with doing the removal of file extent items and inserting the new
one using an incremental approach, reserving more space when needed and
always ensuring we don't leave an implicit hole in the range in case
we need to do multiple iterations and a crash happens between iterations.
btrfs: move btrfs_dev_replace_update_device_in_mapping_tree to drop declaration
The function is short and simple, we can get rid of the declaration as
it's not necessary for a static function. Move it before its first
caller. No functional changes.
btrfs: use sprout device_list_mutex in btrfs_init_devices_late
On a mounted sprout filesystem, all threads now are using the
sprout::device_list_mutex, and this is the only code using the
seed::device_list_mutex. This patch converts to use the sprouts
fs_info->fs_devices->device_list_mutex.
The same reasoning holds true here, that device delete is holding
the sprout::device_list_mutex.
btrfs: reada: lock all seed/sprout devices in __reada_start_machine
On an fs mounted using a sprout device, the seed fs_devices are
maintained in a linked list under fs_info->fs_devices. Each seeds
fs_devices also has device_list_mutex initialized to protect against the
potential race with delete threads. But the delete thread (at
btrfs_rm_device()) is holding the fs_info::fs_devices::device_list_mutex
mutex which belongs to sprout device_list_mutex instead of seed
device_list_mutex. Moreover, there aren't any significient benefits in
using the seed::device_list_mutex instead of sprout::device_list_mutex.
So this patch converts them of using the seed::device_list_mutex to
sprout::device_list_mutex.
btrfs: handle errors in btrfs_sysfs_add_fs_devices
btrfs_sysfs_add_fs_devices() is called by btrfs_sysfs_add_mounted().
btrfs_sysfs_add_mounted() assumes that btrfs_sysfs_add_fs_devices() will
either add sysfs entries for all the devices or none. So this patch keeps up
to its caller expecatation and cleans up the created sysfs entries if it
has to fail at some device in the list.
btrfs: initialize sysfs devid and device link for seed device
We don't initialize the sysfs devid kobject and device-link yet for the
seed devices in an sprouted filesystem.
So this patch initializes the seed device devid kobject and the device
link in the sysfs.
btrfs: split and refactor btrfs_sysfs_remove_devices_dir
Similar to btrfs_sysfs_add_devices_dir()'s refactoring, split
btrfs_sysfs_remove_devices_dir() so that we don't have to use the device
argument to indicate whether to free all devices or just one device.
Export btrfs_sysfs_remove_device() as device operations outside of
sysfs.c now calls this instead of btrfs_sysfs_remove_devices_dir().
btrfs_sysfs_remove_devices_dir() is renamed to
btrfs_sysfs_remove_fs_devices() to suite its new role.
Now, no one outside of sysfs.c calls btrfs_sysfs_remove_fs_devices()
so it is redeclared s static. And the same function had to be moved
before its first caller.
btrfs: simplify parameters of btrfs_sysfs_add_devices_dir
When we add a device we need to add it to sysfs, so instead of using the
btrfs_sysfs_add_devices_dir() fs_devices argument to specify whether to
add a device or all of fs_devices, call the helper function directly
btrfs_sysfs_add_device() and thus make it non-static.
btrfs_sysfs_remove_devices_dir() removes device link and devid kobject
(sysfs entries) for a device or all the devices in the btrfs_fs_devices.
In preparation to remove these sysfs entries for the seed as well, add
a btrfs_sysfs_remove_device() helper function and avoid code
duplication.
btrfs_sysfs_add_devices_dir() adds device link and devid kobject
(sysfs entries) for a device or all the devices in the btrfs_fs_devices.
In preparation to add these sysfs entries for the seed as well, add
a btrfs_sysfs_add_device() helper function and avoid code duplication.
If you replace a seed device in a sprouted fs, it appears to have
successfully replaced the seed device, but if you look closely, it
didn't. Here is an example.
BTRFS info (device sdb): dev_replace from /dev/sda (devid 1) to /dev/sdc started
BTRFS info (device sdb): dev_replace from /dev/sda (devid 1) to /dev/sdc finished
$ btrfs fi show
Label: none uuid: ab2c88b7-be81-4a7e-9849-c3666e7f9f4f
Total devices 2 FS bytes used 256.00KiB
devid 1 size 3.00GiB used 520.00MiB path /dev/sdc
devid 2 size 3.00GiB used 896.00MiB path /dev/sdb
Label: none uuid: 10bd3202-0415-43af-96a8-d5409f310a7e
Total devices 1 FS bytes used 128.00KiB
devid 1 size 3.00GiB used 536.00MiB path /dev/sda
So as per the replace start command and kernel log replace was successful.
Now let's try to clean mount.
$ umount /btrfs
$ btrfs device scan --forget
$ mount -o device=/dev/sdc /dev/sdb /btrfs
mount: /btrfs: wrong fs type, bad option, bad superblock on /dev/sdb, missing codepage or helper program, or other error.
Fix in this patch:
If a seed is not sprouted then there is no replacement of it, because of
its read-only filesystem with a read-only device. Similarly, in the case
of a sprouted filesystem, the seed device is still read only. So, mark
it as you can't replace a seed device, you can only add a new device and
then delete the seed device. If replace is attempted then returns
-EINVAL.
Systems booting without the initramfs seems to scan an unusual kind
of device path (/dev/root). And at a later time, the device is updated
to the correct path. We generally print the process name and PID of the
process scanning the device but we don't capture the same information if
the device path is rescanned with a different pathname.
The current message is too long, so drop the unnecessary UUID and add
process name and PID.
While at this also update the duplicate device warning to include the
process name and PID so the messages are consistent
Josef Bacik [Thu, 3 Sep 2020 18:29:51 +0000 (14:29 -0400)]
btrfs: pretty print leaked root name
I'm a actual human being so am incapable of converting u64 to s64 in my
head, so add a helper to get the pretty name of a root objectid and use
that helper to spit out the name for any special roots for leaked roots,
so I don't have to scratch my head and figure out which root I messed up
the refs for.
btrfs: sysfs: export currently running exclusive operation
/sys/fs/<fsid>/exclusive_operation contains the currently executing
exclusive operation. Add a sysfs_notify() when operation end, so
userspace can be notified of exclusive operation is finished.
btrfs: enumerate the type of exclusive operation in progress
Instead of using a flag bit for exclusive operation, use a variable to
store which exclusive operation is being performed. Introduce an API
to start and finish an exclusive operation.
This would enable another way for tools to check which operation is
running on why starting an exclusive operation failed. The followup
patch adds a sysfs_notify() to alert userspace when the state changes, so
userspace can perform select() on it to get notified of the change.
This would enable us to enqueue a command which will wait for current
exclusive operation to complete before issuing the next exclusive
operation. This has been done synchronously as opposed to a background
process, or else error collection (if any) will become difficult.
This happens because we are holding the chunk_mutex at the time of
adding in a new device. However we only need to hold the
device_list_mutex, as we're going to iterate over the fs_devices
devices. Move the sysfs init stuff outside of the chunk_mutex to get
rid of this lockdep splat.
Nikolay Borisov [Mon, 31 Aug 2020 11:42:42 +0000 (14:42 +0300)]
btrfs: convert btrfs_inode_sectorsize to take btrfs_inode
It's counterintuitive to have a function named btrfs_inode_xxx which
takes a generic inode. Also move the function to btrfs_inode.h so that
it has access to the definition of struct btrfs_inode.
Josef Bacik [Thu, 20 Aug 2020 15:46:08 +0000 (11:46 -0400)]
btrfs: use BTRFS_NESTED_NEW_ROOT for double splits
I've made this change separate since it requires both of the newly added
NESTED flags and I didn't want to slip it into one of those changes.
If we do a double split of a node we can end up doing a
BTRFS_NESTED_SPLIT on level 0, which throws lockdep off because it
appears as a double lock. Since we're maxed out on subclasses, use
BTRFS_NESTED_NEW_ROOT if we had to do a double split. This is OK
because we won't have to do a double split if we had to insert a new
root, and the new root would be at a higher level anyway.
Josef Bacik [Thu, 20 Aug 2020 15:46:07 +0000 (11:46 -0400)]
btrfs: introduce BTRFS_NESTING_NEW_ROOT for adding new roots
The way we add new roots is confusing from a locking perspective for
lockdep. We generally have the rule that we lock things in order from
highest level to lowest, but in the case of adding a new level to the
tree we actually allocate a new block for the root, which makes the
locking go in reverse. A similar issue exists for snapshotting, we cow
the original root for the root of a new tree, however they're at the
same level. Address this by using BTRFS_NESTING_NEW_ROOT for these
operations.
Josef Bacik [Thu, 20 Aug 2020 15:46:06 +0000 (11:46 -0400)]
btrfs: introduce BTRFS_NESTING_SPLIT for split blocks
If we are splitting a leaf/node, we could do something like the
following
lock(leaf) BTRFS_NESTING_NORMAL
lock(left) BTRFS_NESTING_LEFT + BTRFS_NESTING_COW
push from leaf -> left
reset path to point to left
split left
allocate new block, lock block BTRFS_NESTING_SPLIT
at the new block point we need to have a different nesting level,
because we have already used either BTRFS_NESTING_LEFT or
BTRFS_NESTING_RIGHT when pushing items from the original leaf into the
adjacent leaves.
Our lockdep maps are based on rootid+level, however in some cases we
will lock adjacent blocks on the same level, namely in searching forward
or in split/balance. Because of this lockdep will complain, so we need
a separate subclass to indicate to lockdep that these are different
locks.
lock leaf -> BTRFS_NESTING_NORMAL
cow leaf -> BTRFS_NESTING_COW
split leaf
lock left -> BTRFS_NESTING_LEFT
lock right -> BTRFS_NESTING_RIGHT
The above graph illustrates the need for this new nesting subclass.
Josef Bacik [Thu, 20 Aug 2020 15:46:03 +0000 (11:46 -0400)]
btrfs: introduce BTRFS_NESTING_COW for cow'ing blocks
When we COW a block we are holding a lock on the original block, and
then we lock the new COW block. Because our lockdep maps are based on
root + level, this will make lockdep complain. We need a way to
indicate a subclass for locking the COW'ed block, so plumb through our
btrfs_lock_nesting from btrfs_cow_block down to the btrfs_init_buffer,
and then introduce BTRFS_NESTING_COW to be used for cow'ing blocks.
The reason I've added all this extra infrastructure is because there
will be need of different nesting classes in follow up patches.
Josef Bacik [Thu, 20 Aug 2020 15:46:02 +0000 (11:46 -0400)]
btrfs: add nesting tags to the locking helpers
We will need these when we switch to an rwsem, so plumb in the
infrastructure here to use later on. I violate the 80 character limit
some here because it'll be cleaned up later.
Josef Bacik [Thu, 20 Aug 2020 15:46:01 +0000 (11:46 -0400)]
btrfs: introduce btrfs_path::recurse
Our current tree locking stuff allows us to recurse with read locks if
we're already holding the write lock. This is necessary for the space
cache inode, as we could be holding a lock on the root_tree root when we
need to cache a block group, and thus need to be able to read down the
root_tree to read in the inode cache.
We can get away with this in our current locking, but we won't be able
to with a rwsem. Handle this by purposefully annotating the places
where we require recursion, so that in the future we can maybe come up
with a way to avoid the recursion. In the case of the free space inode,
this will be superseded by the free space tree.
Josef Bacik [Thu, 20 Aug 2020 15:46:00 +0000 (11:46 -0400)]
btrfs: rename extent_buffer::lock_nested to extent_buffer::lock_recursed
Nested locking with lockdep and everything else refers to lock hierarchy
within the same lock map. This is how we indicate the same locks for
different objects are ok to take in a specific order, for our use case
that would be to take the lock on a leaf and then take a lock on an
adjacent leaf.
What ->lock_nested _actually_ refers to is if we happen to already be
holding the write lock on the extent buffer and we're allowing a read
lock to be taken on that extent buffer, which is recursion. Rename this
so we don't get confused when we switch to a rwsem and have to start
using the _nested helpers.
Nikolay Borisov [Wed, 22 Jul 2020 08:09:25 +0000 (11:09 +0300)]
btrfs: don't opencode sync_blockdev in btrfs_init_new_device
Instead of opencoding filemap_write_and_wait simply call syncblockdev as
it makes it abundantly clear what's going on and why this is used. No
semantics changes.
Nikolay Borisov [Wed, 22 Jul 2020 08:09:24 +0000 (11:09 +0300)]
btrfs: remove redundant code from btrfs_free_stale_devices
Following the refactor of btrfs_free_stale_devices in 7bcb8164ad94 ("btrfs: use device_list_mutex when removing stale devices")
fs_devices are freed after they have been iterated by the inner
list_for_each so the use-after-free fixed by introducing the break in fd649f10c3d2 ("btrfs: Fix use-after-free when cleaning up fs_devs with
a single stale device") is no longer necessary. Just remove it
altogether. No functional changes.
Nikolay Borisov [Wed, 22 Jul 2020 08:09:23 +0000 (11:09 +0300)]
btrfs: refactor locked condition in btrfs_init_new_device
Invert unlocked to locked and exploit the fact it can only ever be
modified if we are adding a new device to a seed filesystem. This allows
to simplify the check in error: label. No semantics changes.
Nikolay Borisov [Wed, 22 Jul 2020 08:09:22 +0000 (11:09 +0300)]
btrfs: use RCU for quick device check in btrfs_init_new_device
When adding a new device there's a mandatory check to see if a device is
being duplicated to the filesystem it's added to. Since this is a
read-only operations not necessary to take device_list_mutex and can simply
make do with an rcu-readlock.
Using just RCU is safe because there won't be another device add delete
running in parallel as btrfs_init_new_device is called only from
btrfs_ioctl_add_dev.
leaf 29757440 items 5 free space 150 generation 19 owner CSUM_TREE
item 0 key (EXTENT_CSUM EXTENT_CSUM 73785344) itemoff 2323 itemsize 1672
range start 73785344 end 75497472 length 1712128
item 1 key (EXTENT_CSUM EXTENT_CSUM 75497472) itemoff 2319 itemsize 4
range start 75497472 end 75501568 length 4096
item 2 key (EXTENT_CSUM EXTENT_CSUM 75501568) itemoff 579 itemsize 1740
range start 75501568 end 77283328 length 1781760
item 3 key (EXTENT_CSUM EXTENT_CSUM 77283328) itemoff 575 itemsize 4
range start 77283328 end 77287424 length 4096
item 4 key (EXTENT_CSUM EXTENT_CSUM 4120596480) itemoff 275 itemsize 300 <<<
range start 4120596480 end 4120903680 length 307200
leaf 29753344 items 3 free space 1936 generation 19 owner CSUM_TREE
item 0 key (18446744073457893366 EXTENT_CSUM 77594624) itemoff 2323 itemsize 1672
range start 77594624 end 79306752 length 1712128
...
Note the item 4 key of leaf 29757440, which is obviously too large, and
even larger than the first key of the next leaf.
However it still follows the key order in that tree block, thus tree
checker is unable to detect it at read time, since tree checker can only
work inside one leaf, thus such complex corruption can't be detected in
advance.
[FIX]
The next time to detect such problem is at tree block merge time,
which is in push_node_left(), balance_node_right(), push_leaf_left() or
push_leaf_right().
Now we check if the key order of the right-most key of the left node is
larger than the left-most key of the right node.
By this we don't need to call the full tree-checker, while still keeping
the key order correct as key order in each node is already checked by
tree checker thus we only need to check the above two slots.
[CAUSE]
Due to extent tree corruption (still valid by itself, but bad cross
ref), we can allocate an extent which is still in extent tree. The
offending tree block of that case is from csum tree. The newly
allocated tree block is also for csum tree.
Then we will try to insert a tree block ref for the existing tree block
ref.
For a tree extent item, tree block can never be shared directly by the
same tree twice. We have such BUG_ON() to prevent such problem, but
this is not a proper error handling.
[FIX]
Replace that BUG_ON() with proper error message and leaf dump for debug
build.
Qu Wenruo [Wed, 19 Aug 2020 06:35:48 +0000 (14:35 +0800)]
btrfs: extent-tree: kill BUG_ON() in __btrfs_free_extent()
__btrfs_free_extent() is doing two things:
1. Reduce the refs number of an extent backref
Either it's an inline extent backref (inside EXTENT/METADATA item) or
a keyed extent backref (SHARED_* item).
We only need to locate that backref line, either reduce the number or
remove the backref line completely.
2. Update the refs count in EXTENT/METADATA_ITEM
During step 1), we will try to locate the EXTENT/METADATA_ITEM without
triggering another btrfs_search_slot() as fast path.
Only when we fail to locate that item, we will trigger another
btrfs_search_slot() to get that EXTENT/METADATA_ITEM after we
updated/deleted the backref line.
And we have a lot of strict checks on things like refs_to_drop against
extent refs and special case checks for single ref extents.
There are 7 BUG_ON()s, although they're doing correct checks, they can
be triggered by crafted images.
This patch improves the function:
- Introduce two examples to show what __btrfs_free_extent() is doing
One inline backref case and one keyed case. Should cover most cases.
- Kill all BUG_ON()s with proper error message and optional leaf dump
- Add comment to show the overall flow
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=202819
[ The report triggers one BUG_ON() in __btrfs_free_extent() ] Reviewed-by: Nikolay Borisov <[email protected]> Reviewed-by: Josef Bacik <[email protected]> Signed-off-by: Qu Wenruo <[email protected]> Signed-off-by: David Sterba <[email protected]>
Qu Wenruo [Wed, 19 Aug 2020 06:35:47 +0000 (14:35 +0800)]
btrfs: extent_io: do extra check for extent buffer read write functions
Although we have start, len check for extent buffer reader/write (e.g.
read_extent_buffer()), these checks have limitations:
- No overflow check
Values like start = 1024 len = -1024 can still pass the basic
(start + len) > eb->len check.
- Checks are not consistent
For read_extent_buffer() we only check (start + len) against eb->len.
While for memcmp_extent_buffer() we also check start against eb->len.
- Different error reporting mechanism
We use WARN() in read_extent_buffer() but BUG() in
memcpy_extent_buffer().
- Still modify memory if the request is obviously wrong
In read_extent_buffer() even we find (start + len) > eb->len, we still
call memset(dst, 0, len), which can easily cause memory access error
if start + len overflows.
To address above problems, this patch creates a new common function to
check such access, check_eb_range().
- Add overflow check
This function checks start, start + len against eb->len and overflow
check.
- Unified checks
- Unified error reports
Will call WARN() if CONFIG_BTRFS_DEBUG is configured.
And also do btrfs_warn() message for non-debug build.
- Exit ASAP if check fails
No more possible memory corruption.
- Add extra comment for @start @len used in those functions as it's
sometimes confused with the logical addressing instead of a range
inside the eb space
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=202817
[ Inspired by above report, the report itself is already addressed ] Reviewed-by: Josef Bacik <[email protected]> Signed-off-by: Qu Wenruo <[email protected]> Reviewed-by: David Sterba <[email protected]>
[ use check_add_overflow ] Signed-off-by: David Sterba <[email protected]>
Nikolay Borisov [Wed, 12 Aug 2020 13:16:35 +0000 (16:16 +0300)]
btrfs: rework error detection in init_tree_roots
To avoid duplicating 3 lines of code the error detection logic in
init_tree_roots is somewhat quirky. It first checks for the presence of
any error condition, then checks for the specific condition to perform
any specific actions. That's spurious because directly checking for
each respective error condition and doing the necessary steps is more
obvious. While at it change the -EUCLEAN to -EIO in case the extent
buffer is not read correctly, this is in line with other sites which
return -EIO when the eb couldn't be read.
Additionally it results in smaller code and the code reads
more linearly:
add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-95 (-95)
Function old new delta
open_ctree 17243 17148 -95
Total: Before=113104, After=113009, chg -0.08%