This patch is basically a revert of commit 5a80d1c6a270 ("btrfs: zoned:
remove max_zone_append_size logic"), but without unnecessary ASSERT and
check. The max_zone_append_size will be used as a hint to estimate the
number of extents to cover delalloc/writeback region in the later commits.
The size of a ZONE APPEND bio is also limited by queue_max_segments(), so
this commit considers it to calculate max_zone_append_size. Technically, a
bio can be larger than queue_max_segments() * PAGE_SIZE if the pages are
contiguous. But, it is safe to consider "queue_max_segments() * PAGE_SIZE"
as an upper limit of an extent size to calculate the number of extents
needed to write data.
Filipe Manana [Mon, 11 Jul 2022 14:22:50 +0000 (15:22 +0100)]
btrfs: add optimized btrfs_ino() version for 64 bits systems
Currently btrfs_ino() tries to use first the objectid of the inode's
location key. This is to avoid truncation of the inode number on 32 bits
platforms because the i_ino field of struct inode has the unsigned long
type, while the objectid is a 64 bits unsigned type (u64) on every system.
This logic was added in commit 33345d01522f81 ("Btrfs: Always use 64bit
inode number").
However if we are running on a 64 bits system, we can always directly
return the i_ino value from struct inode, which eliminates the need for
he special if statement that tests for a location key type of
BTRFS_ROOT_ITEM_KEY - in which case i_ino may not have the same value as
the objectid in the inode's location objectid, it may have a value of
BTRFS_EMPTY_SUBVOL_DIR_OBJECTID, for the case of snapshots of trees with
subvolumes/snapshots inside them.
So add a special version for 64 bits system that directly returns i_ino
of struct inode. This eliminates one branch and reduces the overall code
size, since btrfs_ino() is an inline function that is extensively used.
Before:
$ size fs/btrfs/btrfs.ko
text data bss dec hex filename 1617487 189240 29032 1835759 1c02ef fs/btrfs/btrfs.ko
After:
$ size fs/btrfs/btrfs.ko
text data bss dec hex filename 1612028 189180 29032 1830240 1bed60 fs/btrfs/btrfs.ko
Filipe Manana [Mon, 11 Jul 2022 14:22:49 +0000 (15:22 +0100)]
btrfs: set the objectid of the btree inode's location key
We currently don't use the location key of the btree inode, its content
is set to zeroes, as it's a special inode that is not persisted (it has
no inode item stored in any btree).
At btrfs_ino(), an inline function used extensively in btrfs, we have
this special check if the given inode's location objectid is 0, and if it
is, we return the value stored in the VFS' inode i_ino field instead
(which is BTRFS_BTREE_INODE_OBJECTID for the btree inode).
To reduce the code at btrfs_ino(), we can simply set the objectid of the
btree inode to the value BTRFS_BTREE_INODE_OBJECTID. This eliminates the
need to check for the special case of the objectid being zero, with the
side effect of reducing the overall code size and having less code to
execute, as btrfs_ino() is an inline function.
Before:
$ size fs/btrfs/btrfs.ko
text data bss dec hex filename 1620502 189240 29032 1838774 1c0eb6 fs/btrfs/btrfs.ko
After:
$ size fs/btrfs/btrfs.ko
text data bss dec hex filename 1617487 189240 29032 1835759 1c02ef fs/btrfs/btrfs.ko
btrfs: replace kmap_atomic() with kmap_local_page()
kmap_atomic() is being deprecated in favor of kmap_local_page() where it
is feasible. With kmap_local_page() mappings are per thread, CPU local,
and not globally visible.
The last use of kmap_atomic is in inode.c where the context is atomic [1]
and can be safely replaced by kmap_local_page.
Tested with xfstests on a QEMU + KVM 32-bits VM with 4GB RAM and booting a
kernel with HIGHMEM64GB enabled.
btrfs: zlib: replace kmap() with kmap_local_page() in zlib_decompress_bio()
The use of kmap() is being deprecated in favor of kmap_local_page(). With
kmap_local_page(), the mapping is per thread, CPU local and not globally
visible.
Therefore, use kmap_local_page() / kunmap_local() in zlib_decompress_bio()
because in this function the mappings are per thread and are not visible
in other contexts.
Tested with xfstests on QEMU + KVM 32-bits VM with 4GB of RAM and
HIGHMEM64G enabled. This patch passes 26/26 tests of group "compress".
btrfs: zlib: replace kmap() with kmap_local_page() in zlib_compress_pages()
The use of kmap() is being deprecated in favor of kmap_local_page(). With
kmap_local_page(), the mapping is per thread, CPU local and not globally
visible.
Therefore, use kmap_local_page() / kunmap_local() in zlib_compress_pages()
because in this function the mappings are per thread and are not visible
in other contexts. Furthermore, drop the mappings of "out_page" which is
allocated within zlib_compress_pages() with alloc_page(GFP_NOFS) and use
page_address().
Tested with xfstests on a QEMU + KVM 32-bits VM with 4GB of RAM booting
a kernel with HIGHMEM64G enabled. This patch passes 26/26 tests of group
"compress".
btrfs: zstd: replace kmap() with kmap_local_page()
The use of kmap() is being deprecated in favor of kmap_local_page(). With
kmap_local_page(), the mapping is per thread, CPU local and not globally
visible.
Therefore, use kmap_local_page() / kunmap_local() in zstd.c because in this
file the mappings are per thread and are not visible in other contexts. In
the meanwhile use plain page_address() on output pages allocated with
the GFP_NOFS flag instead of calling kmap*() on them (since they are
always allocated from ZONE_NORMAL).
Tested with xfstests on QEMU + KVM 32 bits VM with 4GB of RAM, booting a
kernel with HIGHMEM64G enabled.
highmem: Make __kunmap_{local,atomic}() take const void pointer
__kunmap_ {local,atomic}() currently take pointers to void. However, this
is semantically incorrect, since these functions do not change the memory
their arguments point to.
Therefore, make this semantics explicit by modifying the
__kunmap_{local,atomic}() prototypes to take pointers to const void.
As a side effect, compilers may produce more efficient code.
Filipe Manana [Mon, 4 Jul 2022 11:42:04 +0000 (12:42 +0100)]
btrfs: don't fallback to buffered IO for NOWAIT direct IO writes
Currently, for a direct IO write, if we need to fallback to buffered IO,
either to satisfy the whole write operation or just a part of it, we do
it in the current context even if it's a NOWAIT context. This is not ideal
because we currently don't have support for NOWAIT semantics in the
buffered IO path (we can block for several reasons), so we should instead
return -EAGAIN to the caller, so that it knows it should retry (the whole
operation or what's left of it) in a context where blocking is acceptable.
David Sterba [Thu, 23 Jun 2022 15:15:37 +0000 (17:15 +0200)]
btrfs: use enum for btrfs_block_rsv::type
The number of block group reserve types BTRFS_BLOCK_RSV_* is small and
fits to u8 and there's enough left in case we want to add more.
For type safety use the enum but make it 8 bits in the structure to save
space.
The structure size is now 48 on release build, making a slight
improvement in structures where it's embedded, like btrfs_fs_info or
btrfs_inode.
btrfs: do not return errors from btrfs_submit_dio_bio
Always consume the bio and call the end_io handler on error instead of
returning an error and letting the caller handle it. This matches what
the block layer submission and the other btrfs bio submission handlers do
and avoids any confusion on who needs to handle errors.
btrfs: handle allocation failure in btrfs_wq_submit_bio gracefully
btrfs_wq_submit_bio is used for writeback under memory pressure.
Instead of failing the I/O when we can't allocate the async_submit_bio,
just punt back to the synchronous submission path.
btrfs: simplify sync/async submission in btrfs_submit_data_write_bio
btrfs_submit_data_write_bio special cases the reloc root because the
checksums are preloaded, but only does so for the !sync case. The sync
case can't happen for data relocation, but just handling it more generally
significantly simplifies the logic.
btrfs: raid56: transfer the bio counter reference to the raid submission helpers
Transfer the bio counter reference acquired by btrfs_submit_bio to
raid56_parity_write and raid56_parity_recovery together with the bio
that the reference was acquired for instead of acquiring another
reference in those helpers and dropping the original one in
btrfs_submit_bio.
btrfs: do not return errors from raid56_parity_recover
Always consume the bio and call the end_io handler on error instead of
returning an error and letting the caller handle it. This matches what
the block layer submission does and avoids any confusion on who
needs to handle errors.
Also use the proper bool type for the generic_io argument.
btrfs: do not return errors from raid56_parity_write
Always consume the bio and call the end_io handler on error instead of
returning an error and letting the caller handle it. This matches what
the block layer submission does and avoids any confusion on who
needs to handle errors.
Always consume the bio and call the end_io handler on error instead of
returning an error and letting the caller handle it. This matches
what the block layer submission does and avoids any confusion on who
needs to handle errors.
As this requires touching all the callers, rename the function to
btrfs_submit_bio, which describes the functionality much better.
Qu Wenruo [Fri, 17 Jun 2022 10:04:06 +0000 (12:04 +0200)]
btrfs: return proper mapped length for RAID56 profiles in __btrfs_map_block()
For profiles other than RAID56, __btrfs_map_block() returns @map_length
as min(stripe_end, logical + *length), which is also the same result
from btrfs_get_io_geometry().
But for RAID56, __btrfs_map_block() returns @map_length as stripe_len.
This strange behavior is going to hurt incoming bio split at
btrfs_map_bio() time, as we will use @map_length as bio split size.
Fix this behavior by returning @map_length by the same calculation as
for other profiles.
The raid56 code assumes a fixed stripe length BTRFS_STRIPE_LEN but there
are functions passing it as arguments, this is not necessary. The fixed
value has been used for a long time and though the stripe length should
be configurable by super block member stripesize, this hasn't been
implemented and would require more changes so we don't need to keep this
code around until then.
Filipe Manana [Wed, 6 Jul 2022 10:14:23 +0000 (11:14 +0100)]
btrfs: remove the inode cache check at btrfs_is_free_space_inode()
The inode cache feature was removed in kernel 5.11, and we no longer have
any code that reads from or writes to inode caches. We may still mount a
filesystem that has inode caches, but they are ignored.
Remove the check for an inode cache from btrfs_is_free_space_inode(),
since we no longer have code to trigger reads from an inode cache or
writes to an inode cache. The check at send.c is still needed, because
in case we find a filesystem with an inode cache, we must ignore it.
Also leave the checks at tree-checker.c, as they are sanity checks.
This eliminates a dead branch and reduces the amount of code since it's
in an inline function.
Before:
$ size fs/btrfs/btrfs.ko
text data bss dec hex filename 1620662 189240 29032 1838934 1c0f56 fs/btrfs/btrfs.ko
After:
$ size fs/btrfs/btrfs.ko
text data bss dec hex filename 1620502 189240 29032 1838774 1c0eb6 fs/btrfs/btrfs.ko
Nikolay Borisov [Fri, 24 Jun 2022 08:01:23 +0000 (11:01 +0300)]
btrfs: sysfs: remove BIG_METADATA feature files
This flag has been merged in 3.10 and is effectively always-on. Its
status depends on the host page size so there's another way to guarantee
compatibility with old kernels.
Due to a bug introduced in 6f93e834fa7c ("btrfs: fix upper limit for
max_inline for page size 64K") the flag is not persisted among features
in the superblock so it's not reliable.
Nikolay Borisov [Fri, 24 Jun 2022 08:01:22 +0000 (11:01 +0300)]
btrfs: sysfs: remove MIXED_BACKREF feature file
This feature has been the default for about 13 year. At this point it's
safe to consider it an indispensable feature of BTRFS as such there's
no need to advertise it in sysfs. Remove the global sysfs feature file,
the per-filesystem feature file has never been there.
Nikolay Borisov [Thu, 23 Jun 2022 08:08:58 +0000 (11:08 +0300)]
btrfs: don't print 'has skinny extents' anymore on mount
Skinny extents have been a default mkfs feature since version 3.18 i
(introduced in btrfs-progs commit 6715de04d9a7 ("btrfs-progs: mkfs:
make skinny-metadata default") ). It really doesn't bring any value to
users to simply remove it.
Nikolay Borisov [Thu, 23 Jun 2022 07:57:52 +0000 (10:57 +0300)]
btrfs: don't print 'flagging with big metadata' anymore on mount
Added in commit 727011e07cbd ("Btrfs: allow metadata blocks larger than
the page size") in 2010 and it's been default for mkfs since 3.12
(2013). The message doesn't really convey any useful information to
users. Remove it.
David Sterba [Tue, 21 Jun 2022 16:40:48 +0000 (18:40 +0200)]
btrfs: clean up chained assignments
The chained assignments may be convenient to write, but make readability
a bit worse as it's too easy to overlook that there are several values
set on the same line while this is rather an exception. Making it
consistent everywhere avoids surprises.
The pattern where inode times are initialized reuses the first value and
the order is mtime, ctime. In other blocks the assignments are expanded
so the order of variables is similar to the neighboring code.
Nikolay Borisov [Thu, 23 Jun 2022 07:55:47 +0000 (10:55 +0300)]
btrfs: properly flag filesystem with BTRFS_FEATURE_INCOMPAT_BIG_METADATA
Commit 6f93e834fa7c seemingly inadvertently moved the code responsible
for flagging the filesystem as having BIG_METADATA to a place where
setting the flag was essentially lost. This means that
filesystems created with kernels containing this bug (starting with 5.15)
can potentially be mounted by older (pre-3.4) kernels. In reality
chances for this happening are low because there are other incompat
flags introduced in the mean time. Still the correct behavior is to set
INCOMPAT_BIG_METADATA flag and persist this in the superblock.
David Sterba [Wed, 22 Jun 2022 18:45:18 +0000 (20:45 +0200)]
btrfs: print checksum type and implementation at mount time
Per user request, print the checksum type and implementation at mount
time among the messages. The checksum is user configurable and the
actual crypto implementation is useful to see for performance reasons.
The same information is also available after mount in
/sys/fs/FSID/checksum file.
Example:
[25.323662] BTRFS info (device vdb): using sha256 (sha256-generic) checksum algorithm
Josef Bacik [Mon, 13 Jun 2022 22:31:17 +0000 (18:31 -0400)]
btrfs: reset block group chunk force if we have to wait
If you try to force a chunk allocation, but you race with another chunk
allocation, you will end up waiting on the chunk allocation that just
occurred and then allocate another chunk. If you have many threads all
doing this at once you can way over-allocate chunks.
Fix this by resetting force to NO_FORCE, that way if we think we need to
allocate we can, otherwise we don't force another chunk allocation if
one is already happening.
David Sterba [Wed, 18 May 2022 16:02:55 +0000 (18:02 +0200)]
btrfs: send: add new command FILEATTR for file attributes
There are file attributes inherited from previous ext2 SETFLAGS/GETFLAGS
and later from XFLAGS interfaces, now commonly found under the
'fileattr' API. This corresponds to the individual inode bits and that's
part of the on-disk format, so this is suitable for the protocol. The
other interfaces contain a lot of cruft or bits that btrfs does not
support yet.
Currently the value is u64 and matches btrfs_inode_item. Not all the
bits can be set by ioctls (like NODATASUM or READONLY), but we can send
them over the protocol and leave it up to the receiving side what and
how to apply.
As some of the flags, eg. IMMUTABLE, can prevent any further changes,
the receiving side needs to understand that and apply the changes in the
right order, or possibly with some intermediate steps. This should be
easier, future proof and simpler on the protocol layer than implementing
in kernel.
David Sterba [Tue, 17 May 2022 14:50:30 +0000 (16:50 +0200)]
btrfs: send: add OTIME as utimes attribute for proto 2+ by default
When send v1 was introduced the otime (inode creation time) was not
available, however the attribute in btrfs send protocol exists. Though
it would be possible to add it for v1 too as the attribute would be
ignored by v1 receive, let's not change the layout of v1 and only add
that to v2+. The otime cannot be changed and is only informative.
Naohiro Aota [Tue, 21 Jun 2022 06:41:02 +0000 (15:41 +0900)]
btrfs: replace unnecessary goto with direct return at cow_file_range()
The 'goto out' in cow_file_range() in the exit block are not necessary
and jump back. Replace them with return, while still keeping 'goto out'
in the main code.
Naohiro Aota [Tue, 21 Jun 2022 06:41:01 +0000 (15:41 +0900)]
btrfs: fix error handling of fallback uncompress write
When cow_file_range() fails in the middle of the allocation loop, it
unlocks the pages but leaves the ordered extents intact. Thus, we need
to call btrfs_cleanup_ordered_extents() to finish the created ordered
extents.
Also, we need to call end_extent_writepage() if locked_page is available
because btrfs_cleanup_ordered_extents() never processes the region on
the locked_page.
Furthermore, we need to set the mapping as error if locked_page is
unavailable before unlocking the pages, so that the errno is properly
propagated to the user space.
Naohiro Aota [Tue, 21 Jun 2022 06:41:00 +0000 (15:41 +0900)]
btrfs: extend btrfs_cleanup_ordered_extents for NULL locked_page
btrfs_cleanup_ordered_extents() assumes locked_page to be non-NULL, so it
is not usable for submit_uncompressed_range() which can have NULL
locked_page.
Add support supports locked_page == NULL case. Also, it rewrites
redundant "page_offset(locked_page)".
While we debug the issue, we found running fstests generic/551 on 5GB
non-zoned null_blk device in the emulated zoned mode also had a
similar hung issue.
Also, we can reproduce the same symptom with an error injected
cow_file_range() setup.
The hang occurs when cow_file_range() fails in the middle of
allocation. cow_file_range() called from do_allocation_zoned() can
split the give region ([start, end]) for allocation depending on
current block group usages. When btrfs can allocate bytes for one part
of the split regions but fails for the other region (e.g. because of
-ENOSPC), we return the error leaving the pages in the succeeded regions
locked. Technically, this occurs only when @unlock == 0. Otherwise, we
unlock the pages in an allocated region after creating an ordered
extent.
Considering the callers of cow_file_range(unlock=0) won't write out
the pages, we can unlock the pages on error exit from
cow_file_range(). So, we can ensure all the pages except @locked_page
are unlocked on error case.
In summary, cow_file_range now behaves like this:
- page_started == 1 (return value)
- All the pages are unlocked. IO is started.
- unlock == 1
- All the pages except @locked_page are unlocked in any case
- unlock == 0
- On success, all the pages are locked for writing out them
- On failure, all the pages except @locked_page are unlocked
The values are in one file so reading them at a single time will give a
more consistent view. The stats are internally tracked in nanoseconds so
the cumulative values should not suffer from rounding errors.
Writing 0 to the file 'commit_stats' will reset max_commit_ms.
Initial values are set at first mount of the filesystem.
Track several stats about transaction commit, to be later exported via
sysfs:
- number of commits so far
- duration of the last commit in ns
- maximum commit duration seen so far in ns
- total duration for all commits so far in ns
The update of the commit stats occurs after the commit thread has gone
through all the logic that checks if there is another thread committing
at the same time. This means that we only account for actual commit work
in the commit stats we report and not the time the thread spends waiting
until it is ready to do the commit work.
btrfs: remove extent writepage address space operation
Same as in commit 21b4ee7029c9 ("xfs: drop ->writepage completely"): we
can remove the callback as it's only used in one place - single page
writeback from memory reclaim and is not called for cgroup writeback at
all.
We only allow such writeback from kswapd, not from direct memory
reclaim, and so it is rarely used. When it comes from kswapd, it is
effectively random dirty page shoot-down, which is horrible for IO
patterns. We can rely on background writeback to clean all dirty pages
in an efficient way and not let it be interrupted by kswapd.
David Sterba [Thu, 2 Jun 2022 13:40:46 +0000 (15:40 +0200)]
btrfs: send: remove old TODO regarding ERESTARTSYS
The whole send operation is restartable and handling properly a buffer
write may not be easy. We can't know what caused that and if a short
delay and retry will fix it or how many retries should be performed in
case it's a temporary condition.
The error value is returned to the ioctl caller so in case it's
transient problem, the user would be notified about the reason. Remove
the TODO note as there's no plan to handle ERESTARTSYS.
btrfs: increase direct io read size limit to 256 sectors
Btrfs currently limits direct I/O reads to a single sector, which goes
back to commit c329861da406 ("Btrfs: don't allocate a separate csums
array for direct reads") from Josef. That commit changes the direct I/O
code to ".. use the private part of the io_tree for our csums.", but ten
years later that isn't how checksums for direct reads work, instead they
use a csums allocation on a per-btrfs_dio_private basis (which have their
own performance problem for small I/O, but that will be addressed later).
There is no fundamental limit in btrfs itself to limit the I/O size
except for the size of the checksum array that scales linearly with
the number of sectors in an I/O. Pick a somewhat arbitrary limit of
256 limits, which matches what the buffered reads typically see as
the upper limit as the limit for direct I/O as well.
This significantly improves direct read performance. For example a fio
run doing 1 MiB aio reads with a queue depth of 1 roughly triples the
throughput:
Before we enter raid56_parity_recover(), we have triggered some metadata
write for the full stripe 38928384, this leads to us to read all the
sectors from disk.
Furthermore, btrfs raid56 write will cache its calculated P/Q sectors to
avoid unnecessary read.
This means, for that full stripe, after any partial write, we will have
stale data, along with P/Q calculated using that stale data.
Thankfully due to patch "btrfs: only write the sectors in the vertical stripe
which has data stripes" we haven't submitted all the corrupted P/Q to disk.
When we really need to recover certain range, aka in
raid56_parity_recover(), we will use the cached rbio, along with its
cached sectors (the full stripe is all cached).
This explains why we have no event raid56_scrub_read_recover()
triggered.
Since we have the cached P/Q which is calculated using the stale data,
the recovered one will just be stale.
In our particular test case, it will always return the same incorrect
metadata, thus causing the same error message "parent transid verify
failed on 39010304 wanted 9 found 7" again and again.
[BTRFS DESTRUCTIVE RMW PROBLEM]
Test case btrfs/125 (and above workload) always has its trouble with
the destructive read-modify-write (RMW) cycle:
0 32K 64K
Data1: | Good | Good |
Data2: | Bad | Bad |
Parity: | Good | Good |
In above case, if we trigger any write into Data1, we will use the bad
data in Data2 to re-generate parity, killing the only chance to recovery
Data2, thus Data2 is lost forever.
This destructive RMW cycle is not specific to btrfs RAID56, but there
are some btrfs specific behaviors making the case even worse:
- Btrfs will cache sectors for unrelated vertical stripes.
In above example, if we're only writing into 0~32K range, btrfs will
still read data range (32K ~ 64K) of Data1, and (64K~128K) of Data2.
This behavior is to cache sectors for later update.
Incidentally commit d4e28d9b5f04 ("btrfs: raid56: make steal_rbio()
subpage compatible") has a bug which makes RAID56 to never trust the
cached sectors, thus slightly improve the situation for recovery.
Unfortunately, follow up fix "btrfs: update stripe_sectors::uptodate in
steal_rbio" will revert the behavior back to the old one.
- Btrfs raid56 partial write will update all P/Q sectors and cache them
This means, even if data at (64K ~ 96K) of Data2 is free space, and
only (96K ~ 128K) of Data2 is really stale data.
And we write into that (96K ~ 128K), we will update all the parity
sectors for the full stripe.
This unnecessary behavior will completely kill the chance of recovery.
Thankfully, an unrelated optimization "btrfs: only write the sectors
in the vertical stripe which has data stripes" will prevent
submitting the write bio for untouched vertical sectors.
That optimization will keep the on-disk P/Q untouched for a chance for
later recovery.
[FIX]
Although we have no good way to completely fix the destructive RMW
(unless we go full scrub for each partial write), we can still limit the
damage.
With patch "btrfs: only write the sectors in the vertical stripe which
has data stripes" now we won't really submit the P/Q of unrelated
vertical stripes, so the on-disk P/Q should still be fine.
Now we really need to do is just drop all the cached sectors when doing
recovery.
By this, we have a chance to read the original P/Q from disk, and have a
chance to recover the stale data, while still keep the cache to speed up
regular write path.
In fact, just dropping all the cache for recovery path is good enough to
allow the test case btrfs/125 along with the small script to pass
reliably.
The lack of metadata write after the degraded mount, and forced metadata
COW is saving us this time.
So this patch will fix the behavior by not trust any cache in
__raid56_parity_recover(), to solve the problem while still keep the
cache useful.
But please note that this test pass DOES NOT mean we have solved the
destructive RMW problem, we just do better damage control a little
better.
Related patches:
- btrfs: only write the sectors in the vertical stripe
- d4e28d9b5f04 ("btrfs: raid56: make steal_rbio() subpage compatible")
- btrfs: update stripe_sectors::uptodate in steal_rbio
btrfs: remove the finish_func argument to btrfs_mark_ordered_io_finished
finish_func is always set to finish_ordered_fn, so remove it and also
the now pointless and somewhat confusingly named
__endio_write_update_ordered wrapper.
Nikolay Borisov [Fri, 17 Jun 2022 12:53:34 +0000 (15:53 +0300)]
btrfs: batch up release of reserved metadata for delayed items used for deletion
With Filipe's recent rework of the delayed inode code one aspect which
isn't batched is the release of the reserved metadata of delayed inode's
delete items. With this patch on top of Filipe's rework and running the
same test as provided in the description of a patch titled
"btrfs: improve batch deletion of delayed dir index items" I observe
the following change of the number of calls to btrfs_block_rsv_release:
Before this change:
- block_rsv_release: 1004
- btrfs_delete_delayed_items_total_time: 14602
- delete_batches: 505
Qu Wenruo [Mon, 13 Jun 2022 07:06:35 +0000 (15:06 +0800)]
btrfs: warn about dev extents that are inside the reserved range
Btrfs on-disk format has reserved the first 1MiB for the primary super
block (at 64KiB offset) and bootloaders may also use this space.
This behavior is only introduced since v4.1 btrfs-progs release,
although kernel can ensure we never touch the reserved range of super
blocks, it's better to inform the end users, and a balance will resolve
the problem.
Qu Wenruo [Mon, 13 Jun 2022 07:06:34 +0000 (15:06 +0800)]
btrfs: use named constant for reserved device space
There's a reserved space on each device of size 1MiB that can be used by
bootloaders or to avoid accidental overwrite. Use a symbolic constant
with the explaining comment instead of hard coding the value and
multiple comments.
Note: since btrfs-progs v4.1, mkfs.btrfs will reserve the first 1MiB for
the primary super block (at offset 64KiB), until then the range could
have been used by mistake. Kernel has been always respecting the 1MiB
range for writes.
David Sterba [Mon, 6 Jun 2022 17:32:59 +0000 (19:32 +0200)]
btrfs: sink iterator parameter to btrfs_ioctl_logical_to_ino
There's only one function we pass to iterate_inodes_from_logical as
iterator, so we can drop the indirection and call it directly, after
moving the function to backref.c
David Sterba [Mon, 6 Jun 2022 17:06:17 +0000 (19:06 +0200)]
btrfs: simplify parameters of backref iterators
The inode reference iterator interface takes parameters that are derived
from the context parameter, but as it's a void* type the values are
passed individually.
Change the ctx type to inode_fs_path as it's the only thing we pass and
drop any parameters that are derived from that.
David Sterba [Mon, 6 Jun 2022 16:52:24 +0000 (18:52 +0200)]
btrfs: call inode_to_path directly and drop indirection
The functions for iterating inode reference take a function parameter
but there's only one value, inode_to_path(). Remove the indirection and
call the function. As paths_from_inode would become just an alias for
iterate_irefs(), merge the two into one function.
Qu Wenruo [Fri, 13 May 2022 08:34:31 +0000 (16:34 +0800)]
btrfs: use ncopies from btrfs_raid_array in btrfs_num_copies()
For all non-RAID56 profiles, we can use btrfs_raid_array[].ncopies
directly, only for RAID5 and RAID6 we need some extra handling as
there's no table value for that.
For RAID10 there's a change from sub_stripes to ncopies. The values are
the same but semantically we want to use number of copies, as this is
what btrfs_num_copies does.
Qu Wenruo [Fri, 13 May 2022 08:34:30 +0000 (16:34 +0800)]
btrfs: use btrfs_raid_array to calculate number of parity stripes
Use the raid table instead of hard coded values and rename the helper as
it is exported. This could make later extension on RAID56 based
profiles easier.
Qu Wenruo [Fri, 13 May 2022 08:34:28 +0000 (16:34 +0800)]
btrfs: remove parameter dev_extent_len from scrub_stripe()
For scrub_stripe() we can easily calculate the dev extent length as we
have the full info of the chunk.
Thus there is no need to pass @dev_extent_len from the caller, and we
introduce a helper, btrfs_calc_stripe_length(), to do the calculation
from extent_map structure.
David Sterba [Thu, 25 Jun 2020 17:03:41 +0000 (19:03 +0200)]
btrfs: unify tree search helper returning prev and next nodes
Simplify helper to return only next and prev pointers, we don't need all
the node/parent/prev/next pointers of __etree_search as there are now
other specialized helpers. Rename parameters so they follow the naming.
David Sterba [Thu, 25 Jun 2020 16:49:39 +0000 (18:49 +0200)]
btrfs: make tree search for insert more generic and use it for tree_search
With a slight extension of tree_search_for_insert (fill the return node
and parent return parameters) we can avoid calling __etree_search from
tree_search, that could be removed eventually in followup patches.
David Sterba [Thu, 25 Jun 2020 16:35:24 +0000 (18:35 +0200)]
btrfs: open code inexact rbtree search in tree_search
The call chain from
tree_search
tree_search_for_insert
__etree_search
can be open coded and allow further simplifications, here we need a tree
search with fallback to the next node in case it's not found. This is
represented as __etree_search parameters next_ret=valid, prev_ret=NULL.
David Sterba [Thu, 25 Jun 2020 16:11:31 +0000 (18:11 +0200)]
btrfs: add fast path for extent_state insertion
In two cases the exact location where to insert the extent state is
known at the call time so we don't need to pass it to insert_state that
takes the fast path.
David Sterba [Thu, 25 Jun 2020 15:54:54 +0000 (17:54 +0200)]
btrfs: pass bits by value not by pointer for extent_state helpers
The bits are passed to all extent state helpers for no apparent reason,
the value only read and never updated so remove the indirection and pass
it directly. Also unify the type to u32 where needed.
Qu Wenruo [Thu, 2 Jun 2022 07:51:19 +0000 (15:51 +0800)]
btrfs: raid56: avoid double for loop inside __raid56_parity_recover()
The double for loop can be easily converted to single for loop as we're
really iterating the sectors in their bytenr order.
The only exception is the full stripe skip, however that can also easily
be done inside the loop. Add an ASSERT() along with a comment for that
specific case.
Qu Wenruo [Thu, 2 Jun 2022 07:51:18 +0000 (15:51 +0800)]
btrfs: raid56: avoid double for loop inside finish_rmw()
We can easily calculate the stripe number and sector number inside the
stripe. Thus there is not much need for a double for loop.
For the only case we want to skip the whole stripe, we can manually
increase @total_sector_nr.
This is not a recommended behavior, thus every time the iterator gets
modified there will be a comment along with an ASSERT() for it.
Josef Bacik [Mon, 13 Jun 2022 19:09:48 +0000 (15:09 -0400)]
btrfs: tree-log: make the return value for log syncing consistent
Currently we will return 1 or -EAGAIN if we decide we need to commit
the transaction rather than sync the log. In practice this doesn't
really matter, we interpret any !0 and !BTRFS_NO_LOG_SYNC as needing to
commit the transaction. However this makes it hard to figure out what
the correct thing to do is.
Fix this up by defining BTRFS_LOG_FORCE_COMMIT and using this in all the
places where we want to force the transaction to be committed.
David Sterba [Mon, 6 Jun 2022 16:36:35 +0000 (18:36 +0200)]
btrfs: sysfs: advertise zoned support among features
We've hidden the zoned support in sysfs under debug config for the first
releases but now the stability is reasonable, though not all features
have been implemented.
btrfs: split discard handling out of btrfs_map_block
Mapping block for discard doesn't really share any code with the regular
block mapping case. Split it out into an entirely separate helper
that just returns an array of btrfs_discard_stripe structures and the
number of stripes.
This removes the need for the length field in the btrfs_io_context
structure, so remove tht.
btrfs: stop looking at btrfs_bio->iter in index_one_bio
All the bios that index_one_bio operates on are the bios submitted by the
upper layer. These are never resubmitted to an actual device by the
raid56 code, and thus the iter never changes from the initial state.
Thus we can always just use bi_iter directly as it will be the same as
the saved copy.
Then even if we can only mount it RO, we will still cause metadata
update for log replay:
BTRFS info (device dm-1): flagging fs with big metadata feature
BTRFS info (device dm-1): using free space tree
BTRFS info (device dm-1): has skinny extents
BTRFS info (device dm-1): start tree-log replay
This is definitely against RO compact flag requirement.
[CAUSE]
RO compact flag only forces us to do RO mount, but we will still do log
replay for plain RO mount.
Thus this will result us to do log replay and update metadata.
This can be very problematic for new RO compat flag, for example older
kernel can not understand v2 cache, and if we allow metadata update on
RO mount and invalidate/corrupt v2 cache.
[FIX]
Just reject the mount unless rescue=nologreplay is provided:
BTRFS error (device dm-1): cannot replay dirty log with unsupport optional features (0x40000000), try rescue=nologreplay instead
We don't want to set rescue=nologreply directly, as this would make the
end user to read the old data, and cause confusion.
Since the such case is really rare, we're mostly fine to just reject the
mount with an error message, which also includes the proper workaround.
It turns out that, btrfs_super_block::log_root_transid is never really
utilized (even no read for it).
This can date back to the introduction of btrfs into upstream kernel.
In fact, when reading log tree root, we always use
btrfs_super_block::generation + 1 as the expected generation.
So here we're completely safe to mark this member deprecated.
In theory we can easily reuse this member for other purposes, but to be
extra safe, here we follow the leafsize way, by adding "__unused_" for
log_root_transid.
And we can safely remove the accessors, since there is no such callers
from the very beginning.
submit_one_bio always works on the bio and compression flags from a
btrfs_bio_ctrl structure. Pass the explicitly and clean up the
calling conventions by handling a NULL bio in submit_one_bio, and
using the btrfs_bio_ctrl to pass the mirror number as well.
Merge end_write_bio and flush_write_bio into a single submit_write_bio
helper, that either submits the bio or ends it if a negative errno was
passed in. This consolidates a lot of duplicated checks in the callers.
David Sterba [Mon, 27 Jul 2020 18:59:20 +0000 (20:59 +0200)]
btrfs: remove redundant check in up check_setget_bounds
There are two separate checks in the bounds checker, the first one being
a special case of the second. As this function is performance critical
due to checking access to any eb member, reducing the size can slightly
improve performance.
On a release build on x86_64 the helper is completely inlined so the
function call overhead is also gone.
There was a report of 5% performance drop on metadata heavy workload,
that disappeared after disabling asserts. The most significant part of
that is the bounds checker.
The optimizations are reducing the number of ifs to 1 and inlining the
hot path. Low-level stuff, gets the performance back. Patch below.
4. asserts off, no setget check
run time: 44s
run time with perf: 45s
This verifies that asserts other than the setget check have negligible
impact on performance and it's not harmful to keep them on.
Analysis where the performance is lost:
* check_setget_bounds is short function, but it's still a function call,
changing the flow of instructions and given how many times it's
called the overhead adds up
* there are two conditions, one to check if the range is
completely outside (member_offset > eb->len) or partially inside
(member_offset + size > eb->len)
btrfs: replace kmap() with kmap_local_page() in lzo.c
The use of kmap() is being deprecated in favor of kmap_local_page() where
it is feasible. With kmap_local_page(), the mapping is per thread, CPU
local and not globally visible.
Therefore, use kmap_local_page() / kunmap_local() in lzo.c wherever the
mappings are per thread and not globally visible.
Tested on QEMU + KVM 32 bits VM with 4GB of RAM and HIGHMEM64G enabled.
btrfs: replace kmap() with kmap_local_page() in inode.c
The use of kmap() is being deprecated in favor of kmap_local_page() where
it is feasible. With kmap_local_page(), the mapping is per thread, CPU
local and not globally visible.
Therefore, use kmap_local_page() / kunmap_local() in inode.c wherever the
mappings are per thread and not globally visible.
Tested on QEMU + KVM 32 bits VM with 4GB of RAM and HIGHMEM64G enabled.
btrfs: do not allocate a btrfs_bio for low-level bios
The bios submitted from btrfs_map_bio don't really interact with the
rest of btrfs and the only btrfs_bio member actually used in the
low-level bios is the pointer to the btrfs_io_context used for endio
handler.
Use a union in struct btrfs_io_stripe that allows the endio handler to
find the btrfs_io_context and remove the spurious ->device assignment
so that a plain fs_bio_set bio can be used for the low-level bios
allocated inside btrfs_map_bio.
All reads bio that go through btrfs_map_bio need to be completed in
user context. And read I/Os are the most common and timing critical
in almost any file system workloads.
Embed a work_struct into struct btrfs_bio and use it to complete all
read bios submitted through btrfs_map, using the REQ_META flag to decide
which workqueue they are placed on.
This removes the need for a separate 128 byte allocation (typically
rounded up to 192 bytes by slab) for all reads with a size increase
of 24 bytes for struct btrfs_bio. Future patches will reorganize
struct btrfs_bio to make use of this extra space for writes as well.
(All sizes are based a on typical 64-bit non-debug build)
Set REQ_META in btrfs_submit_metadata_bio instead of the various callers.
We'll start relying on this flag inside of btrfs in a bit, and this
ensures it is always set correctly.
btrfs: don't use btrfs_bio_wq_end_io for compressed writes
Compressed write bio completion is the only user of btrfs_bio_wq_end_io
for writes, and the use of btrfs_bio_wq_end_io is a little suboptimal
here as we only real need user context for the final completion of a
compressed_bio structure, and not every single bio completion.
Add a work_struct to struct compressed_bio instead and use that to call
finish_compressed_bio_write. This allows to remove all handling of
write bios in the btrfs_bio_wq_end_io infrastructure.
btrfs: don't double-defer bio completions for compressed reads
The bio completion handler of the bio used for the compressed data is
already run in a workqueue using btrfs_bio_wq_end_io, so don't schedule
the completion of the original bio to the same workqueue again but just
execute it directly.
btrfs: defer I/O completion based on the btrfs_raid_bio
Instead of attaching an extra allocation an indirect call to each
low-level bio issued by the RAID code, add a work_struct to struct
btrfs_raid_bio and only defer the per-rbio completion action. The
per-bio action for all the I/Os are trivial and can be safely done
from interrupt context.
As a nice side effect this also allows sharing the boilerplate code
for the per-bio completions
There is no exit block and cleanup and the function is reasonably short
so we can use inline return and not the goto. This makes the function
more straight forward.
Assign ->mirror_num and ->bi_status in btrfs_end_bioc instead of
duplicating the logic in the callers. Also remove the bio argument as
it always must be bioc->orig_bio and the now pointless bioc_error that
did nothing but assign bi_sector to the same value just sampled in the
caller.