Git Repo - linux.git/log

dm table: fix zoned iterate_devices based device capability checks

Fix dm_table_supports_zoned_model() and invert logic of both
iterate_devices_callout_fn so that all devices' zoned capabilities are
properly checked.

Add one more parameter to dm_table_any_dev_attr(), which is actually
used as the @data parameter of iterate_devices_callout_fn, so that
dm_table_matches_zone_sectors() can be replaced by
dm_table_any_dev_attr().

Fixes: dd88d313bef02 ("dm table: add zoned block devices validation")
Cc: [email protected]
Signed-off-by: Jeffle Xu <[email protected]>
Signed-off-by: Mike Snitzer <[email protected]>

dm table: fix DAX iterate_devices based device capability checks

Fix dm_table_supports_dax() and invert logic of both
iterate_devices_callout_fn so that all devices' DAX capabilities are
properly checked.

Fixes: 545ed20e6df6 ("dm: add infrastructure for DAX support")
Cc: [email protected]
Signed-off-by: Jeffle Xu <[email protected]>
Signed-off-by: Mike Snitzer <[email protected]>

dm table: fix iterate_devices based device capability checks

According to the definition of dm_iterate_devices_fn:
* This function must iterate through each section of device used by the
* target until it encounters a non-zero return code, which it then returns.
* Returns zero if no callout returned non-zero.

For some target type (e.g. dm-stripe), one call of iterate_devices() may
iterate multiple underlying devices internally, in which case a non-zero
return code returned by iterate_devices_callout_fn will stop the iteration
in advance. No iterate_devices_callout_fn should return non-zero unless
device iteration should stop.

Rename dm_table_requires_stable_pages() to dm_table_any_dev_attr() and
elevate it for reuse to stop iterating (and return non-zero) on the
first device that causes iterate_devices_callout_fn to return non-zero.
Use dm_table_any_dev_attr() to properly iterate through devices.

Rename device_is_nonrot() to device_is_rotational() and invert logic
accordingly to fix improper disposition.

Fixes: c3c4555edd10 ("dm table: clear add_random unless all devices have it set")
Fixes: 4693c9668fdc ("dm table: propagate non rotational flag")
Cc: [email protected]
Signed-off-by: Jeffle Xu <[email protected]>
Signed-off-by: Mike Snitzer <[email protected]>

dm writecache: return the exact table values that were set

LVM doesn't like it when the target returns different values from what
was set in the constructor. Fix dm-writecache so that the returned
table values are exactly the same as requested values.

Signed-off-by: Mikulas Patocka <[email protected]>
Cc: [email protected] # v4.18+
Signed-off-by: Mike Snitzer <[email protected]>

dm crypt: support using trusted keys

Commit 27f5411a718c ("dm crypt: support using encrypted keys") extended
dm-crypt to allow use of "encrypted" keys along with "user" and "logon".

Along the same lines, teach dm-crypt to support "trusted" keys as well.

Signed-off-by: Ahmad Fatoum <[email protected]>
Signed-off-by: Mike Snitzer <[email protected]>

dm crypt: replaced #if defined with IS_ENABLED

IS_ENABLED(CONFIG_ENCRYPTED_KEYS) is true whether the option is built-in
or a module, so use it instead of #if defined checking for each
separately.

The other #if was to avoid a static function defined, but unused
warning. As we now always build the callsite when the function
is defined, we can remove that first #if guard.

Suggested-by: Arnd Bergmann <[email protected]>
Signed-off-by: Ahmad Fatoum <[email protected]>
Acked-by: Dmitry Baryshkov <[email protected]>
Signed-off-by: Mike Snitzer <[email protected]>

dm writecache: fix unnecessary NULL check warnings

Remove NULL checks before vfree() to fix these warnings:
./drivers/md/dm-writecache.c:2008:2-7: WARNING: NULL check before some
freeing functions is not needed.
./drivers/md/dm-writecache.c:2024:2-7: WARNING: NULL check before some
freeing functions is not needed.

Signed-off-by: Tian Tao <[email protected]>
Signed-off-by: Mike Snitzer <[email protected]>

dm writecache: fix performance degradation in ssd mode

Fix a thinko in ssd_commit_superblock. region.count is in sectors, not
bytes. This bug doesn't corrupt data, but it causes performance
degradation.

Signed-off-by: Mikulas Patocka <[email protected]>
Fixes: dc8a01ae1dbd ("dm writecache: optimize superblock write")
Cc: [email protected] # v5.7+
Reported-by: J. Bruce Fields <[email protected]>
Signed-off-by: Mike Snitzer <[email protected]>

dm integrity: introduce the "fix_hmac" argument

The "fix_hmac" argument improves security of internal_hash and
journal_mac:
- the section number is mixed to the mac, so that an attacker can't
  copy sectors from one journal section to another journal section
- the superblock is protected by journal_mac
- a 16-byte salt stored in the superblock is mixed to the mac, so
  that the attacker can't detect that two disks have the same hmac
  key and also to disallow the attacker to move sectors from one
  disk to another

Signed-off-by: Mikulas Patocka <[email protected]>
Reported-by: Daniel Glockner <[email protected]>
Signed-off-by: Lukas Bulwahn <[email protected]> # ReST fix
Tested-by: Milan Broz <[email protected]>
Signed-off-by: Mike Snitzer <[email protected]>

dm persistent data: fix return type of shadow_root()

shadow_root() truncates 64-bit dm_block_t into 32-bit int. This is
not an issue in practice, since dm metadata as of v5.11 can only hold at
most 4161600 blocks (255 index entries * ~16k metadata blocks).

Nevertheless, this can confuse users debugging some specific data
corruption scenarios. Also, DM_SM_METADATA_MAX_BLOCKS may be bumped in
the future, or persistent-data may find its use in other places.

Therefore, switch the return type of shadow_root from int to dm_block_t.

Signed-off-by: Jinoh Kang <[email protected]>
Signed-off-by: Mike Snitzer <[email protected]>

dm: cleanup of front padding calculation

Add two helper macros calculating the offset of bio in struct dm_io and
struct dm_target_io respectively.

Besides, simplify the front padding calculation in
dm_alloc_md_mempools().

Signed-off-by: Jeffle Xu <[email protected]>
Signed-off-by: Mike Snitzer <[email protected]>

dm integrity: fix spelling mistake "flusing" -> "flushing"

There is a spelling mistake in a dm_integrity_io_error error
message. Fix it.

Signed-off-by: Colin Ian King <[email protected]>
Signed-off-by: Mike Snitzer <[email protected]>

dm crypt: Spelling s/cihper/cipher/

Fix a misspelling of "cipher".

Signed-off-by: Geert Uytterhoeven <[email protected]>
Signed-off-by: Mike Snitzer <[email protected]>

dm dust: remove h from printk format specifier

See Documentation/core-api/printk-formats.rst.

commit cbacb5ab0aa0 ("docs: printk-formats: Stop encouraging use of unnecessary %h[xudi] and %hh[xudi]")

Standard integer promotion is already done and %hx and %hhx is useless
so do not encourage the use of %hh[xudi] or %h[xudi].

Signed-off-by: Tom Rix <[email protected]>
Signed-off-by: Mike Snitzer <[email protected]>

block: fix memory leak of bvec

bio_init() clears bio instance, so the bvec index has to be set after
bio_init(), otherwise bio->bi_io_vec may be leaked.

Fixes: 3175199ab0ac ("block: split bio_kmalloc from bio_alloc_bioset")
Cc: Johannes Thumshirn <[email protected]>
Cc: Chaitanya Kulkarni <[email protected]>
Cc: Damien Le Moal <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Signed-off-by: Ming Lei <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>

md: use rdev_read_only in restart_array

Make the read-only check in restart_array identical to the other two
read-only checks.

Signed-off-by: Christoph Hellwig <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>

md: check for NULL ->meta_bdev before calling bdev_read_only

->meta_bdev is optional and not set for most arrays. Add a
rdev_read_only helper that calls bdev_read_only for both devices
in a safe way.

Fixes: 6f0d9689b670 ("block: remove the NULL bdev check in bdev_read_only")
Reported-by: Guoqing Jiang <[email protected]>
Signed-off-by: Christoph Hellwig <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>

block: drop removed argument from kernel-doc of blk_execute_rq()

Commit 684da7628d93 ("block: remove unnecessary argument from
blk_execute_rq") changes the signature of blk_execute_rq(), but misses
to adjust its kernel-doc.

Hence, make htmldocs warns on ./block/blk-exec.c:78:

warning: Excess function parameter 'q' description in 'blk_execute_rq'

Drop removed argument from kernel-doc of blk_execute_rq() as well.

Signed-off-by: Lukas Bulwahn <[email protected]>
Acked-by: Guoqing Jiang <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>

block: remove typo in kernel-doc of set_disk_ro()

Commit 52f019d43c22 ("block: add a hard-readonly flag to struct gendisk")
provides some kernel-doc for set_disk_ro(), but introduces a small typo.

Hence, make htmldocs warns on ./block/genhd.c:1441:

warning: Function parameter or member 'read_only' not described in 'set_disk_ro'
warning: Excess function parameter 'ready_only' description in 'set_disk_ro'

Remove that typo in the kernel-doc for set_disk_ro().

Signed-off-by: Lukas Bulwahn <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>

blk-cgroup: Remove obsolete macro

Remove the obsolete 'MAX_KEY_LEN' macro.

Signed-off-by: Baolin Wang <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>

nvme-core: check bdev value for NULL

The nvme-core sets the bdev to NULL when admin comamnd is issued from
IOCTL in the following path e.g. nvme list :-

block_ioctl()
blkdev_ioctl()
  nvme_ioctl()
   nvme_user_cmd()
    nvme_submit_user_cmd()

The commit 309dca309fc3 ("block: store a block_device pointer in struct bio")
now uses bdev unconditionally in the macro bio_set_dev() and assumes
that bdev value is not NULL which results in the following crash in
since thats where bdev is actually accessed :-

void bio_associate_blkg_from_css(struct bio *bio,
struct cgroup_subsys_state *css)
{
if (bio->bi_blkg)
blkg_put(bio->bi_blkg);

if (css && css->parent) {
bio->bi_blkg = blkg_tryget_closest(bio, css);
} else {
--------------> blkg_get(bio->bi_bdev->bd_disk->queue->root_blkg);
bio->bi_blkg = bio->bi_bdev->bd_disk->queue->root_blkg;
}
}
EXPORT_SYMBOL_GPL(bio_associate_blkg_from_css);

[  345.385947] BUG: kernel NULL pointer dereference, address: 0000000000000690
[  345.387103] #PF: supervisor read access in kernel mode
[  345.387894] #PF: error_code(0x0000) - not-present page
[  345.388756] PGD 162a2b067 P4D 162a2b067 PUD 1633eb067 PMD 0
[  345.389625] Oops: 0000 [#1] SMP NOPTI
[  345.390206] CPU: 15 PID: 4100 Comm: nvme Tainted: G           OE     5.11.0-rc5blk+ #141
[  345.391377] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba52764
[  345.393074] RIP: 0010:bio_associate_blkg_from_css.cold.47+0x58/0x21f

[  345.396362] RSP: 0018:ffffc90000dbbce8 EFLAGS: 00010246
[  345.397078] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000027
[  345.398114] RDX: 0000000000000000 RSI: ffff888813be91f0 RDI: ffff888813be91f8
[  345.399039] RBP: ffffc90000dbbd30 R08: 0000000000000001 R09: 0000000000000001
[  345.399950] R10: 0000000064c66670 R11: 00000000ef955201 R12: ffff888812d32800
[  345.401031] R13: 0000000000000000 R14: ffff888113e51540 R15: ffff888113e51540
[  345.401976] FS:  00007f3747f1d780(0000) GS:ffff888813a00000(0000) knlGS:0000000000000000
[  345.402997] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  345.403737] CR2: 0000000000000690 CR3: 000000081a4bc000 CR4: 00000000003506e0
[  345.404685] Call Trace:
[  345.405031]  bio_associate_blkg+0x71/0x1c0
[  345.405649]  nvme_submit_user_cmd+0x1aa/0x38e [nvme_core]
[  345.406348]  nvme_user_cmd.isra.73.cold.98+0x54/0x92 [nvme_core]
[  345.407117]  nvme_ioctl+0x226/0x260 [nvme_core]
[  345.407707]  blkdev_ioctl+0x1c8/0x2b0
[  345.408183]  block_ioctl+0x3f/0x50
[  345.408627]  __x64_sys_ioctl+0x84/0xc0
[  345.409117]  do_syscall_64+0x33/0x40
[  345.409592]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  345.410233] RIP: 0033:0x7f3747632107

[  345.413125] RSP: 002b:00007ffe461b6648 EFLAGS: 00000206 ORIG_RAX: 0000000000000010
[  345.414086] RAX: ffffffffffffffda RBX: 00000000007b7fd0 RCX: 00007f3747632107
[  345.414998] RDX: 00007ffe461b6650 RSI: 00000000c0484e41 RDI: 0000000000000004
[  345.415966] RBP: 0000000000000004 R08: 00000000007b7fe8 R09: 00000000007b9080
[  345.416883] R10: 00007ffe461b62c0 R11: 0000000000000206 R12: 00000000007b7fd0
[  345.417808] R13: 0000000000000000 R14: 0000000000000003 R15: 0000000000000000

Add a NULL check before we set the bdev for bio.

This issue is found on block/for-next tree.

Fixes: 309dca309fc3 ("block: store a block_device pointer in struct bio")
Signed-off-by: Chaitanya Kulkarni <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>

mm: only make map_swap_entry available for CONFIG_HIBERNATION

Current tree spews this on compile:

mm/swapfile.c:2290:17: warning: ‘map_swap_entry’ defined but not used [-Wunused-function]
2290 | static sector_t map_swap_entry(swp_entry_t entry, struct block_device **bdev)
| ^~~~~~~~~~~~~~

if !CONFIG_HIBERNATION, as we don't use the function unless we have that
config option set.

Fixes: 48d15436fde6 ("mm: remove get_swap_bio")
Signed-off-by: Jens Axboe <[email protected]>

mm: remove get_swap_bio

Just reuse the block_device and sector from the swap_info structure,
just as used by the SWP_SYNCHRONOUS path. Also remove the checks for
NULL returns from bio_alloc as that can't happen for sleeping
allocations.

Signed-off-by: Christoph Hellwig <[email protected]>
Reviewed-by: Johannes Thumshirn <[email protected]>
Reviewed-by: Chaitanya Kulkarni <[email protected]>
Acked-by: Damien Le Moal <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>

nilfs2: remove cruft in nilfs_alloc_seg_bio

bio_alloc never returns NULL when it can sleep.

Signed-off-by: Christoph Hellwig <[email protected]>
Reviewed-by: Johannes Thumshirn <[email protected]>
Reviewed-by: Chaitanya Kulkarni <[email protected]>
Acked-by: Damien Le Moal <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>

nfs/blocklayout: remove cruft in bl_alloc_init_bio

bio_alloc never returns NULL when it can sleep.

Signed-off-by: Christoph Hellwig <[email protected]>
Reviewed-by: Johannes Thumshirn <[email protected]>
Reviewed-by: Chaitanya Kulkarni <[email protected]>
Acked-by: Damien Le Moal <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>

md/raid6: refactor raid5_read_one_chunk

Refactor raid5_read_one_chunk so that all simple checks are done
before allocating the bio.

Signed-off-by: Christoph Hellwig <[email protected]>
Acked-by: Song Liu <[email protected]>
Reviewed-by: Johannes Thumshirn <[email protected]>
Reviewed-by: Chaitanya Kulkarni <[email protected]>
Acked-by: Damien Le Moal <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>

md: remove md_bio_alloc_sync

md_bio_alloc_sync is never called with a NULL mddev, and ->sync_set is
initialized in md_run, so it always must be initialized as well. Just
open code the remaining call to bio_alloc_bioset.

Signed-off-by: Christoph Hellwig <[email protected]>
Acked-by: Song Liu <[email protected]>
Reviewed-by: Johannes Thumshirn <[email protected]>
Reviewed-by: Chaitanya Kulkarni <[email protected]>
Acked-by: Damien Le Moal <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>

md: simplify sync_page_io

Use an on-stack bio and biovec for the single page synchronous I/O.

Signed-off-by: Christoph Hellwig <[email protected]>
Acked-by: Song Liu <[email protected]>
Reviewed-by: Johannes Thumshirn <[email protected]>
Reviewed-by: Chaitanya Kulkarni <[email protected]>
Acked-by: Damien Le Moal <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>

md: remove bio_alloc_mddev

bio_alloc_mddev is never called with a NULL mddev, and ->bio_set is
initialized in md_run, so it always must be initialized as well. Just
open code the remaining call to bio_alloc_bioset.

Signed-off-by: Christoph Hellwig <[email protected]>
Acked-by: Song Liu <[email protected]>
Reviewed-by: Johannes Thumshirn <[email protected]>
Reviewed-by: Chaitanya Kulkarni <[email protected]>
Acked-by: Damien Le Moal <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>

drbd: remove drbd_req_make_private_bio

Open code drbd_req_make_private_bio in the two callers to prepare
for further changes. Also don't bother to initialize bi_next as the
bio code already does that that.

Signed-off-by: Christoph Hellwig <[email protected]>
Reviewed-by: Johannes Thumshirn <[email protected]>
Reviewed-by: Chaitanya Kulkarni <[email protected]>
Acked-by: Damien Le Moal <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>

drbd: remove bio_alloc_drbd

Given that drbd_md_io_bio_set is initialized during module initialization
and the module fails to load if the initialization fails there is no need
to fall back to plain bio_alloc.

Signed-off-by: Christoph Hellwig <[email protected]>
Reviewed-by: Johannes Thumshirn <[email protected]>
Reviewed-by: Chaitanya Kulkarni <[email protected]>
Acked-by: Damien Le Moal <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>

f2fs: remove FAULT_ALLOC_BIO

Sleeping bio allocations do not fail, which means that injecting an error
into sleeping bio allocations is a little silly.

Signed-off-by: Christoph Hellwig <[email protected]>
Reviewed-by: Johannes Thumshirn <[email protected]>
Reviewed-by: Chaitanya Kulkarni <[email protected]>
Acked-by: Damien Le Moal <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>

f2fs: use blkdev_issue_flush in __submit_flush_wait

Use the blkdev_issue_flush helper instead of duplicating it.

Signed-off-by: Christoph Hellwig <[email protected]>
Reviewed-by: Johannes Thumshirn <[email protected]>
Reviewed-by: Chaitanya Kulkarni <[email protected]>
Acked-by: Damien Le Moal <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>

dm-clone: use blkdev_issue_flush in commit_metadata

Use blkdev_issue_flush instead of open coding it.

Signed-off-by: Christoph Hellwig <[email protected]>
Reviewed-by: Johannes Thumshirn <[email protected]>
Reviewed-by: Chaitanya Kulkarni <[email protected]>
Acked-by: Damien Le Moal <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>

block: use an on-stack bio in blkdev_issue_flush

There is no point in allocating memory for a synchronous flush.

Signed-off-by: Christoph Hellwig <[email protected]>
Reviewed-by: Johannes Thumshirn <[email protected]>
Reviewed-by: Chaitanya Kulkarni <[email protected]>
Acked-by: Damien Le Moal <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>

block: split bio_kmalloc from bio_alloc_bioset

bio_kmalloc shares almost no logic with the bio_set based fast path
in bio_alloc_bioset. Split it into an entirely separate implementation.

Signed-off-by: Christoph Hellwig <[email protected]>
Reviewed-by: Johannes Thumshirn <[email protected]>
Reviewed-by: Chaitanya Kulkarni <[email protected]>
Acked-by: Damien Le Moal <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>

blk-crypto: use bio_kmalloc in blk_crypto_clone_bio

Use bio_kmalloc instead of open coding it.

Signed-off-by: Christoph Hellwig <[email protected]>
Reviewed-by: Eric Biggers <[email protected]>
Reviewed-by: Johannes Thumshirn <[email protected]>
Reviewed-by: Chaitanya Kulkarni <[email protected]>
Acked-by: Damien Le Moal <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>

btrfs: use bio_kmalloc in __alloc_device

Use bio_kmalloc instead of open coding it.

Signed-off-by: Christoph Hellwig <[email protected]>
Reviewed-by: Josef Bacik <[email protected]>
Reviewed-by: Johannes Thumshirn <[email protected]>
Reviewed-by: Chaitanya Kulkarni <[email protected]>
Acked-by: Damien Le Moal <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>

zonefs: use bio_alloc in zonefs_file_dio_append

Use bio_alloc instead of open coding it.

Signed-off-by: Christoph Hellwig <[email protected]>
Reviewed-by: Johannes Thumshirn <[email protected]>
Reviewed-by: Chaitanya Kulkarni <[email protected]>
Acked-by: Damien Le Moal <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>

bfq: Use only idle IO periods for think time calculations

Currently whenever bfq queue has a request queued we add now -
last_completion_time to the think time statistics. This is however
misleading in case the process is able to submit several requests in
parallel because e.g. if the queue has request completed at time T0 and
then queues new requests at times T1, T2, then we will add T1-T0 and
T2-T0 to think time statistics which just doesn't make any sence (the
queue's think time is penalized by the queue being able to submit more
IO). So add to think time statistics only time intervals when the queue
had no IO pending.

Signed-off-by: Jan Kara <[email protected]>
Acked-by: Paolo Valente <[email protected]>
[axboe: fix whitespace on empty line]
Signed-off-by: Jens Axboe <[email protected]>

bfq: Use 'ttime' local variable

Use local variable 'ttime' instead of dereferencing bfqq.

Signed-off-by: Jan Kara <[email protected]>
Acked-by: Paolo Valente <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>

bfq: Avoid false bfq queue merging

bfq_setup_cooperator() uses bfqd->in_serv_last_pos so detect whether it
makes sense to merge current bfq queue with the in-service queue.
However if the in-service queue is freshly scheduled and didn't dispatch
any requests yet, bfqd->in_serv_last_pos is stale and contains value
from the previously scheduled bfq queue which can thus result in a bogus
decision that the two queues should be merged. This bug can be observed
for example with the following fio jobfile:

[global]
direct=0
ioengine=sync
invalidate=1
size=1g
rw=read

[reader]
numjobs=4
directory=/mnt

where the 4 processes will end up in the one shared bfq queue although
they do IO to physically very distant files (for some reason I was able to
observe this only with slice_idle=1ms setting).

Fix the problem by invalidating bfqd->in_serv_last_pos when switching
in-service queue.

Fixes: 058fdecc6de7 ("block, bfq: fix in-service-queue check for queue merging")
CC: [email protected]
Signed-off-by: Jan Kara <[email protected]>
Acked-by: Paolo Valente <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>

blkcg: delete redundant get/put operations for queue

When calling blkcg_schedule_throttle(), for the same queue,
redundant get/put operations can be removed.

Signed-off-by: Chunguang Xu <[email protected]>
Acked-by: Tejun Heo <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>

block: unexport truncate_bdev_range

truncate_bdev_range is only used in always built-in block layer code,
so remove the export and the !CONFIG_BLOCK stub.

Signed-off-by: Christoph Hellwig <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>

blk: wbt: remove unused parameter from wbt_should_throttle

The first parameter rwb is not used for this function.
So just remove it.

Signed-off-by: Lei Chen <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>

bdev: Do not return EBUSY if bdev discard races with write

blkdev_fallocate() tries to detect whether a discard raced with an
overlapping write by calling invalidate_inode_pages2_range(). However
this check can give both false negatives (when writing using direct IO
or when writeback already writes out the written pagecache range) and
false positives (when write is not actually overlapping but ends in the
same page when blocksize < pagesize). This actually causes issues for
qemu which is getting confused by EBUSY errors.

Fix the problem by removing this conflicting write detection since it is
inherently racy and thus of little use anyway.

Reported-by: Maxim Levitsky <[email protected]>
CC: "Darrick J. Wong" <[email protected]>
Link: https://lore.kernel.org/qemu-devel/[email protected]
Signed-off-by: Jan Kara <[email protected]>
Reviewed-by: Maxim Levitsky <[email protected]>
Reviewed-by: Darrick J. Wong <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>

block: inherit BIO_REMAPPED when cloning bios

Cloned bios are can be used to on the same device, in which case we need
to inherit the BIO_REMAPPED flag to avoid a double partition remap. When
the cloned bios are used on another device, bio_set_dev will clear the flag.

Fixes: 309dca309fc3 ("block: store a block_device pointer in struct bio")
Signed-off-by: Christoph Hellwig <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>

bcache: use bio_set_dev to assign ->bi_bdev

Always use the bio_set_dev helper to assign ->bi_bdev to make sure
other state related to the device is uptodate.

Signed-off-by: Christoph Hellwig <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>

nvme: use bio_set_dev to assign ->bi_bdev

Always use the bio_set_dev helper to assign ->bi_bdev to make sure
other state related to the device is uptodate.

Signed-off-by: Christoph Hellwig <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>

bfq: bfq_check_waker() should be static

It's only used in the same file, mark is appropriately static.

Fixes: 71217df39dc6 ("block, bfq: make waker-queue detection more robust")
Reported-by: kernel test robot <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>

block, bfq: make waker-queue detection more robust

In the presence of many parallel I/O flows, the detection of waker
bfq_queues suffers from false positives. This commits addresses this
issue by making the filtering of actual wakers more selective. In more
detail, a candidate waker must be found to meet waker requirements
three times before being promoted to actual waker.

Tested-by: Jan Kara <[email protected]>
Signed-off-by: Paolo Valente <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>

block, bfq: save also injection state on queue merging

To prevent injection information from being lost on bfq_queue merging,
also the amount of service that a bfq_queue receives must be saved and
restored when the bfq_queue is merged and split, respectively.

Tested-by: Jan Kara <[email protected]>
Signed-off-by: Paolo Valente <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>

block, bfq: save also weight-raised service on queue merging

To prevent weight-raising information from being lost on bfq_queue merging,
also the amount of service that a bfq_queue receives must be saved and
restored when the bfq_queue is merged and split, respectively.

Tested-by: Jan Kara <[email protected]>
Signed-off-by: Paolo Valente <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>

block, bfq: fix switch back from soft-rt weitgh-raising

A bfq_queue may happen to be deemed as soft real-time while it is
still enjoying interactive weight-raising. If this happens because of
a false positive, then the bfq_queue is likely to loose its soft
real-time status soon. Upon losing such a status, the bfq_queue must
get back its interactive weight-raising, if its interactive period is
not over yet. But this case is not handled. This commit corrects this
error.

Tested-by: Jan Kara <[email protected]>
Signed-off-by: Paolo Valente <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>

block, bfq: re-evaluate convenience of I/O plugging on rq arrivals

Upon an I/O-dispatch attempt, BFQ may detect that it was better to
plug I/O dispatch, and to wait for a new request to arrive for the
currently in-service queue. But the arrival of a new request for an
empty bfq_queue, and thus the switch from idle to busy of the
bfq_queue, may cause the scenario to change, and make plugging no
longer needed for service guarantees, or more convenient for
throughput. In this case, keeping I/O-dispatch plugged would certainly
lower throughput.

To address this issue, this commit makes such a check, and stops
plugging I/O if it is better to stop plugging I/O.

Tested-by: Jan Kara <[email protected]>
Signed-off-by: Paolo Valente <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>

block, bfq: replace mechanism for evaluating I/O intensity

Some BFQ mechanisms make their decisions on a bfq_queue basing also on
whether the bfq_queue is I/O bound. In this respect, the current logic
for evaluating whether a bfq_queue is I/O bound is rather rough. This
commits replaces this logic with a more effective one.

The new logic measures the percentage of time during which a bfq_queue
is active, and marks the bfq_queue as I/O bound if the latter if this
percentage is above a fixed threshold.

Tested-by: Jan Kara <[email protected]>
Signed-off-by: Paolo Valente <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>

block: skip bio_check_eod for partition-remapped bios

When an already remapped bio is resubmitted (e.g. by blk_queue_split),
bio_check_eod will compare the remapped bi_sector against the size
of the partition, leading to spurious I/O failures.

Skip the EOD check in this case.

Fixes: 309dca309fc3 ("block: store a block_device pointer in struct bio")
Reported-by: Jens Axboe <[email protected]>
Signed-off-by: Christoph Hellwig <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>

bio: don't copy bvec for direct IO

The block layer spends quite a while in blkdev_direct_IO() to copy and
initialise bio's bvec. However, if we've already got a bvec in the input
iterator it might be reused in some cases, i.e. when new
ITER_BVEC_FLAG_FIXED flag is set. Simple tests show considerable
performance boost, and it also reduces memory footprint.

Suggested-by: Matthew Wilcox <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Signed-off-by: Pavel Begunkov <[email protected]>
Reviewed-by: Ming Lei <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>

bio: add a helper calculating nr segments to alloc

Add a helper function calculating the number of bvec segments we need to
allocate to construct a bio. It doesn't change anything functionally,
but will be used to not duplicate special cases in the future.

Reviewed-by: Christoph Hellwig <[email protected]>
Signed-off-by: Pavel Begunkov <[email protected]>
Reviewed-by: Ming Lei <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>

iov_iter: optimise bvec iov_iter_advance()

iov_iter_advance() is heavily used, but implemented through generic
means. For bvecs there is a specifically crafted function for that, so
use bvec_iter_advance() instead, it's faster and slimmer.

Reviewed-by: Christoph Hellwig <[email protected]>
Signed-off-by: Pavel Begunkov <[email protected]>
Reviewed-by: Ming Lei <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>

target/file: allocate the bvec array as part of struct target_core_file_cmd

This saves one memory allocation, and ensures the bvecs aren't freed
before the AIO completion. This will allow the lower level code to be
optimized so that it can avoid allocating another bvec array.

Signed-off-by: Christoph Hellwig <[email protected]>
Signed-off-by: Pavel Begunkov <[email protected]>
Reviewed-by: Ming Lei <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>

block/psi: remove PSI annotations from direct IO

Direct IO does not operate on the current working set of pages managed
by the kernel, so it should not be accounted as memory stall to PSI
infrastructure.

The block layer and iomap direct IO use bio_iov_iter_get_pages()
to build bios, and they are the only users of it, so to avoid PSI
tracking for them clear out BIO_WORKINGSET flag. Do same for
dio_bio_submit() because fs/direct_io constructs bios by hand directly
calling bio_add_page().

Reported-by: Christoph Hellwig <[email protected]>
Suggested-by: Christoph Hellwig <[email protected]>
Suggested-by: Johannes Weiner <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Signed-off-by: Pavel Begunkov <[email protected]>
Reviewed-by: Ming Lei <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>

bvec/iter: disallow zero-length segment bvecs

zero-length bvec segments are allowed in general, but not handled by bio
and down the block layer so filtered out. This inconsistency may be
confusing and prevent from optimisations. As zero-length segments are
useless and places that were generating them are patched, declare them
not allowed.

Reviewed-by: Christoph Hellwig <[email protected]>
Signed-off-by: Pavel Begunkov <[email protected]>
Reviewed-by: Ming Lei <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>

splice: don't generate zero-len segement bvecs

iter_file_splice_write() may spawn bvec segments with zero-length. In
preparation for prohibiting them, filter out by hand at splice level.

Reviewed-by: Christoph Hellwig <[email protected]>
Signed-off-by: Pavel Begunkov <[email protected]>
Reviewed-by: Ming Lei <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>

block: remove unnecessary argument from blk_execute_rq

We can remove 'q' from blk_execute_rq as well after the previous change
in blk_execute_rq_nowait.

And more importantly it never really was needed to start with given
that we can trivial derive it from struct request.

Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Acked-by: Ulf Hansson <[email protected]> # for mmc
Signed-off-by: Guoqing Jiang <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>

block: remove unnecessary argument from blk_execute_rq_nowait

The 'q' is not used since commit a1ce35fa4985 ("block: remove dead
elevator code"), also update the comment of the function.

And more importantly it never really was needed to start with given
that we can trivial derive it from struct request.

Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Guoqing Jiang <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>

bsg: free the request before return error code

Free the request rq before returning error code.

Fixes: 972248e9111e ("scsi: bsg-lib: handle bidi requests without block layer help")
Signed-off-by: Pan Bian <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>

bcache: don't pass BIOSET_NEED_BVECS for the 'bio_set' embedded in 'cache_set'

This bioset is just for allocating bio only from bio_next_split, and it
needn't bvecs, so remove the flag.

Cc: [email protected]
Cc: Coly Li <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Signed-off-by: Ming Lei <[email protected]>
Acked-by: Coly Li <[email protected]>
Reviewed-by: Hannes Reinecke <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>

block: move three bvec helpers declaration into private helper

bvec_alloc(), bvec_free() and bvec_nr_vecs() are only used inside block
layer core functions, no need to declare them in public header.

Reviewed-by: Christoph Hellwig <[email protected]>
Signed-off-by: Ming Lei <[email protected]>
Reviewed-by: Pavel Begunkov <[email protected]>
Reviewed-by: Hannes Reinecke <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>

block: set .bi_max_vecs as actual allocated vector number

bvec_alloc() may allocate more bio vectors than requested, so set
.bi_max_vecs as actual allocated vector number, instead of the requested
number. This way can help fs build bigger bio because new bio often won't
be allocated until the current one becomes full.

Reviewed-by: Christoph Hellwig <[email protected]>
Signed-off-by: Ming Lei <[email protected]>
Reviewed-by: Pavel Begunkov <[email protected]>
Reviewed-by: Hannes Reinecke <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>

block: don't allocate inline bvecs if this bioset needn't bvecs

The inline bvecs won't be used if user needn't bvecs by not passing
BIOSET_NEED_BVECS, so don't allocate bvecs in this situation.

Reviewed-by: Christoph Hellwig <[email protected]>
Signed-off-by: Ming Lei <[email protected]>
Reviewed-by: Pavel Begunkov <[email protected]>
Tested-by: Pavel Begunkov <[email protected]>
Reviewed-by: Hannes Reinecke <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>

block: don't pass BIOSET_NEED_BVECS for q->bio_split

q->bio_split is only used by bio_split() for fast cloning bio, and no
need to allocate bvecs, so remove this flag.

Reviewed-by: Christoph Hellwig <[email protected]>
Signed-off-by: Ming Lei <[email protected]>
Reviewed-by: Pavel Begunkov <[email protected]>
Tested-by: Pavel Begunkov <[email protected]>
Reviewed-by: Hannes Reinecke <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>

block: manage bio slab cache by xarray

Managing bio slab cache via xarray by using slab cache size as xarray
index, and storing 'struct bio_slab' instance into xarray.

So code is simplified a lot, meantime it becomes more readable than before.

Reviewed-by: Christoph Hellwig <[email protected]>
Signed-off-by: Ming Lei <[email protected]>
Reviewed-by: Pavel Begunkov <[email protected]>
Tested-by: Pavel Begunkov <[email protected]>
Reviewed-by: Hannes Reinecke <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>

bfq: don't duplicate code for different paths

As we can see, returns parent_sched_may_change whether
sd->next_in_service changes or not, so remove this judgment.

Signed-off-by: huhai <[email protected]>
Acked-by: Paolo Valente <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>

blk-mq: Improve performance of non-mq IO schedulers with multiple HW queues

Currently when non-mq aware IO scheduler (BFQ, mq-deadline) is used for
a queue with multiple HW queues, the performance it rather bad. The
problem is that these IO schedulers use queue-wide locking and their
dispatch function does not respect the hctx it is passed in and returns
any request it finds appropriate. Thus locality of request access is
broken and dispatch from multiple CPUs just contends on IO scheduler
locks. For these IO schedulers there's little point in dispatching from
multiple CPUs. Instead dispatch always only from a single CPU to limit
contention.

Below is a comparison of dbench runs on XFS filesystem where the storage
is a raid card with 64 HW queues and to it attached a single rotating
disk. BFQ is used as IO scheduler:

      clients           MQ                     SQ             MQ-Patched
Amean 1      39.12 (0.00%)       43.29 * -10.67%*       36.09 *   7.74%*
Amean 2     128.58 (0.00%)      101.30 *  21.22%*       96.14 *  25.23%*
Amean 4     577.42 (0.00%)      494.47 *  14.37%*      508.49 *  11.94%*
Amean 8     610.95 (0.00%)      363.86 *  40.44%*      362.12 *  40.73%*
Amean 16    391.78 (0.00%)      261.49 *  33.25%*      282.94 *  27.78%*
Amean 32    324.64 (0.00%)      267.71 *  17.54%*      233.00 *  28.23%*
Amean 64    295.04 (0.00%)      253.02 *  14.24%*      242.37 *  17.85%*
Amean 512 10281.61 (0.00%)    10211.16 *   0.69%*    10447.53 *  -1.61%*

Numbers are times so lower is better. MQ is stock 5.10-rc6 kernel. SQ is
the same kernel with megaraid_sas.host_tagset_enable=0 so that the card
advertises just a single HW queue. MQ-Patched is a kernel with this
patch applied.

You can see multiple hardware queues heavily hurt performance in
combination with BFQ. The patch restores the performance.

Signed-off-by: Jan Kara <[email protected]>
Reviewed-by: Ming Lei <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>

Revert "blk-mq, elevator: Count requests per hctx to improve performance"

This reverts commit b445547ec1bbd3e7bf4b1c142550942f70527d95.

Since both mq-deadline and BFQ completely ignore hctx they are passed to
their dispatch function and dispatch whatever request they deem fit
checking whether any request for a particular hctx is queued is just
pointless since we'll very likely get a request from a different hctx
anyway. In the following commit we'll deal with lock contention in these
IO schedulers in presence of multiple HW queues in a different way.

Signed-off-by: Jan Kara <[email protected]>
Reviewed-by: Ming Lei <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>

block, bfq: do not expire a queue when it is the only busy one

This commits preserves I/O-dispatch plugging for a special symmetric
case that may suddenly turn into asymmetric: the case where only one
bfq_queue, say bfqq, is busy. In this case, not expiring bfqq does not
cause any harm to any other queues in terms of service guarantees. In
contrast, it avoids the following unlucky sequence of events: (1) bfqq
is expired, (2) a new queue with a lower weight than bfqq becomes busy
(or more queues), (3) the new queue is served until a new request
arrives for bfqq, (4) when bfqq is finally served, there are so many
requests of the new queue in the drive that the pending requests for
bfqq take a lot of time to be served. In particular, event (2) may
case even already dispatched requests of bfqq to be delayed, inside
the drive. So, to avoid this series of events, the scenario is
preventively declared as asymmetric also if bfqq is the only busy
queues. By doing so, I/O-dispatch plugging is performed for bfqq.

Tested-by: Jan Kara <[email protected]>
Signed-off-by: Paolo Valente <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>

block, bfq: avoid spurious switches to soft_rt of interactive queues

BFQ tags some bfq_queues as interactive or soft_rt if it deems that
these bfq_queues contain the I/O of, respectively, interactive or soft
real-time applications. BFQ privileges both these special types of
bfq_queues over normal bfq_queues. To privilege a bfq_queue, BFQ
mainly raises the weight of the bfq_queue. In particular, soft_rt
bfq_queues get a higher weight than interactive bfq_queues.

A bfq_queue may turn from interactive to soft_rt. And this leads to a
tricky issue. Soft real-time applications usually start with an
I/O-bound, interactive phase, in which they load themselves into main
memory. BFQ correctly detects this phase, and keeps the bfq_queues
associated with the application in interactive mode for a
while. Problems arise when the I/O pattern of the application finally
switches to soft real-time. One of the conditions for a bfq_queue to
be deemed as soft_rt is that the bfq_queue does not consume too much
bandwidth. But the bfq_queues associated with a soft real-time
application consume as much bandwidth as they can in the loading phase
of the application. So, after the application becomes truly soft
real-time, a lot of time should pass before the average bandwidth
consumed by its bfq_queues finally drops to a value acceptable for
soft_rt bfq_queues. As a consequence, there might be a time gap during
which the application is not privileged at all, because its bfq_queues
are not interactive any longer, but cannot be deemed as soft_rt yet.

To avoid this problem, BFQ pretends that an interactive bfq_queue
consumes zero bandwidth, and allows an interactive bfq_queue to switch
to soft_rt. Yet, this fake zero-bandwidth consumption easily causes
the bfq_queue to often switch to soft_rt deceptively, during its
loading phase. As in soft_rt mode, the bfq_queue gets its bandwidth
correctly computed, and therefore soon switches back to
interactive. Then it switches again to soft_rt, and so on. These
spurious fluctuations usually cause losses of throughput, because they
deceive BFQ's mechanisms for boosting throughput (injection,
I/O-plugging avoidance, ...).

This commit addresses this issue as follows:
1) It does compute actual bandwidth consumption also for interactive
   bfq_queues. This avoids the above false positives.
2) When a bfq_queue switches from interactive to normal mode, the
   consumed bandwidth is reset (forgotten). This allows the
   bfq_queue to enjoy soft_rt very quickly. In particular, two
   alternatives are possible in this switch:
    - the bfq_queue still has backlog, and therefore there is a budget
      already scheduled to serve the bfq_queue; in this case, the
      scheduling of the current budget of the bfq_queue is not
      hindered, because only the scheduling of the next budget will
      be affected by the weight drop. After that, if the bfq_queue is
      actually in a soft_rt phase, and becomes empty during the
      service of its current budget, which is the natural behavior of
      a soft_rt bfq_queue, then the bfq_queue will be considered as
      soft_rt when its next I/O arrives. If, in contrast, the
      bfq_queue remains constantly non-empty, then its next budget
      will be scheduled with a low weight, which is the natural
      treatment for an I/O-bound (non soft_rt) bfq_queue.
    - the bfq_queue is empty; in this case, the bfq_queue may be
      considered unjustly soft_rt when its new I/O arrives. Yet
      the problem is now much smaller than before, because it is
      unlikely that more than one spurious fluctuation occurs.

Tested-by: Jan Kara <[email protected]>
Signed-off-by: Paolo Valente <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>

block, bfq: do not raise non-default weights

BFQ heuristics try to detect interactive I/O, and raise the weight of
the queues containing such an I/O. Yet, if also the user changes the
weight of a queue (i.e., the user changes the ioprio of the process
associated with that queue), then it is most likely better to prevent
BFQ heuristics from silently changing the same weight.

Tested-by: Jan Kara <[email protected]>
Signed-off-by: Paolo Valente <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>

block, bfq: increase time window for waker detection

Tests on slower machines showed current window to be way too
small. This commit increases it.

Tested-by: Jan Kara <[email protected]>
Signed-off-by: Paolo Valente <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>

block, bfq: set next_rq to waker_bfqq->next_rq in waker injection

Since commit c5089591c3ba ("block, bfq: detect wakers and
unconditionally inject their I/O"), when the in-service bfq_queue, say
Q, is temporarily empty, BFQ checks whether there are I/O requests to
inject (also) from the waker bfq_queue for Q. To this goal, the value
pointed by bfqq->waker_bfqq->next_rq must be controlled. However, the
current implementation mistakenly looks at bfqq->next_rq, which
instead points to the next request of the currently served queue.

This mistake evidently causes losses of throughput in scenarios with
waker bfq_queues.

This commit corrects this mistake.

Fixes: c5089591c3ba ("block, bfq: detect wakers and unconditionally inject their I/O")
Signed-off-by: Jia Cheng Hu <[email protected]>
Signed-off-by: Jan Kara <[email protected]>
Signed-off-by: Paolo Valente <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>

block, bfq: use half slice_idle as a threshold to check short ttime

The value of the I/O plugging (idling) timeout is used also as the
think-time threshold to decide whether a process has a short think
time. In this respect, a good value of this timeout for rotational
drives is un the order of several ms. Yet, this is often too long a
time interval to be effective as a think-time threshold. This commit
mitigates this problem (by a lot, according to tests), by halving the
threshold.

Tested-by: Jan Kara <[email protected]>
Signed-off-by: Paolo Valente <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>

block: use an xarray for disk->part_tbl

Now that no fast path lookups in the partition table are left, there is
no point in micro-optimizing the data structure for it. Just use a bog
standard xarray.

Signed-off-by: Christoph Hellwig <[email protected]>
Acked-by: Tejun Heo <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>

block: remove DISK_PITER_REVERSE

There is good reason to iterate backwards when deleting all partitions in
del_gendisk, just like we don't in blk_drop_partitions.

Signed-off-by: Christoph Hellwig <[email protected]>
Acked-by: Tejun Heo <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>

block: add a disk_uevent helper

Add a helper to call kobject_uevent for the disk and all partitions, and
unexport the disk_part_iter_* helpers that are now only used in the core
block code.

Signed-off-by: Christoph Hellwig <[email protected]>
Acked-by: Tejun Heo <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>

blk-mq: use ->bi_bdev for I/O accounting

Remove the reverse map from a sector to a partition for I/O accounting by
simply using ->bi_bdev.

Signed-off-by: Christoph Hellwig <[email protected]>
Acked-by: Tejun Heo <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>

block: use ->bi_bdev for bio based I/O accounting

Rework the I/O accounting for bio based drivers to use ->bi_bdev. This
means all drivers can now simply use bio_start_io_acct to start
accounting, and it will take partitions into account automatically. To
end I/O account either bio_end_io_acct can be used if the driver never
remaps I/O to a different device, or bio_end_io_acct_remapped if the
driver did remap the I/O.

Signed-off-by: Christoph Hellwig <[email protected]>
Acked-by: Tejun Heo <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>

block: do not reassig ->bi_bdev when partition remapping

There is no good reason to reassign ->bi_bdev when remapping the
partition-relative block number to the device wide one, as all the
information required by the drivers comes from the gendisk anyway.

Keeping the original ->bi_bdev alive will allow to greatly simplify
the partition-away I/O accounting.

Signed-off-by: Christoph Hellwig <[email protected]>
Acked-by: Tejun Heo <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>

block: simplify submit_bio_checks a bit

Merge a few checks for whole devices vs partitions to streamline the
sanity checks.

Signed-off-by: Christoph Hellwig <[email protected]>
Acked-by: Tejun Heo <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>

block: store a block_device pointer in struct bio

Replace the gendisk pointer in struct bio with a pointer to the newly
improved struct block device. From that the gendisk can be trivially
accessed with an extra indirection, but it also allows to directly
look up all information related to partition remapping.

Signed-off-by: Christoph Hellwig <[email protected]>
Acked-by: Tejun Heo <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>

dcssblk: remove the end of device check in dcssblk_submit_bio

The block layer already checks for this conditions in bio_check_eod
before calling the driver.

Signed-off-by: Christoph Hellwig <[email protected]>
Acked-by: Tejun Heo <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>

brd: remove the end of device check in brd_do_bvec

The block layer already checks for this conditions in bio_check_eod
before calling the driver.

Signed-off-by: Christoph Hellwig <[email protected]>
Acked-by: Tejun Heo <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>

nvme: allow revalidate to set a namespace read-only

Unconditionally call set_disk_ro now that it only updates the hardware
state. This allows to properly set up the Linux devices read-only when
the controller turns a previously writable namespace read-only.

Signed-off-by: Christoph Hellwig <[email protected]>
Reviewed-by: Keith Busch <[email protected]>
Reviewed-by: Martin K. Petersen <[email protected]>
Reviewed-by: Hannes Reinecke <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>

rbd: remove the ->set_read_only method

Now that the hardware read-only state can't be changed by the BLKROSET
ioctl, the code in this method is not required anymore.

Signed-off-by: Christoph Hellwig <[email protected]>
Reviewed-by: Hannes Reinecke <[email protected]>
Acked-by: Ilya Dryomov <[email protected]>
Reviewed-by: Martin K. Petersen <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>

block: propagate BLKROSET on the whole device to all partitions

Change the policy so that a BLKROSET on the whole device also affects
partitions.  To quote Martin K. Petersen:

It's very common for database folks to twiddle the read-only state of
block devices and partitions. I know that our users will find it very
counter-intuitive that setting /dev/sda read-only won't prevent writes
to /dev/sda1.

The existing behavior is inconsistent in the sense that doing:

  # blockdev --setro /dev/sda
  # echo foo > /dev/sda1

permits writes. But:

  # blockdev --setro /dev/sda
  <something triggers revalidate>
  # echo foo > /dev/sda1

doesn't.

And a subsequent:

  # blockdev --setrw /dev/sda
  # echo foo > /dev/sda1

doesn't work either since sda1's read-only policy has been inherited
from the whole-disk device.

You need to do:

  # blockdev --rereadpt

after setting the whole-disk device rw to effectuate the same change on
the partitions, otherwise they are stuck being read-only indefinitely.

However, setting the read-only policy on a partition does *not* require
the revalidate step. As a matter of fact, doing the revalidate will blow
away the policy setting you just made.

So the user needs to take different actions depending on whether they
are trying to read-protect a whole-disk device or a partition. Despite
using the same ioctl. That is really confusing.

I have lost count how many times our customers have had data clobbered
because of ambiguity of the existing whole-disk device policy. The
current behavior violates the principle of least surprise by letting the
user think they write protected the whole disk when they actually
didn't.

Suggested-by: Martin K. Petersen <[email protected]>
Signed-off-by: Christoph Hellwig <[email protected]>
Reviewed-by: Martin K. Petersen <[email protected]>
Reviewed-by: Hannes Reinecke <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>

block: add a hard-readonly flag to struct gendisk

Commit 20bd1d026aac ("scsi: sd: Keep disk read-only when re-reading
partition") addressed a long-standing problem with user read-only
policy being overridden as a result of a device-initiated revalidate.
The commit has since been reverted due to a regression that left some
USB devices read-only indefinitely.

To fix the underlying problems with revalidate we need to keep track
of hardware state and user policy separately.

The gendisk has been updated to reflect the current hardware state set
by the device driver. This is done to allow returning the device to
the hardware state once the user clears the BLKROSET flag.

The resulting semantics are as follows:

- If BLKROSET sets a given partition read-only, that partition will
   remain read-only even if the underlying storage stack initiates a
   revalidate. However, the BLKRRPART ioctl will cause the partition
   table to be dropped and any user policy on partitions will be lost.

- If BLKROSET has not been set, both the whole disk device and any
   partitions will reflect the current write-protect state of the
   underlying device.

Based on a patch from Martin K. Petersen <[email protected]>.

Reported-by: Oleksii Kurochko <[email protected]>
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=201221
Signed-off-by: Christoph Hellwig <[email protected]>
Reviewed-by: Ming Lei <[email protected]>
Reviewed-by: Martin K. Petersen <[email protected]>
Reviewed-by: Hannes Reinecke <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>

block: remove the NULL bdev check in bdev_read_only

Only a single caller can end up in bdev_read_only, so move the check
there.

Signed-off-by: Christoph Hellwig <[email protected]>
Reviewed-by: Ming Lei <[email protected]>
Reviewed-by: Martin K. Petersen <[email protected]>
Reviewed-by: Hannes Reinecke <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>

dm: use bdev_read_only to check if a device is read-only

dm-thin and dm-cache also work on partitions, so use the proper
interface to check if the device is read-only.

Signed-off-by: Christoph Hellwig <[email protected]>
Reviewed-by: Ming Lei <[email protected]>
Reviewed-by: Martin K. Petersen <[email protected]>
Reviewed-by: Hannes Reinecke <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>

Linux 5.11-rc5

Merge tag 'sh-for-5.11' of git://git.libc.org/linux-sh

Pull arch/sh updates from Rich Felker:
"Cleanup and warning fixes"

* tag 'sh-for-5.11' of git://git.libc.org/linux-sh:
  sh/intc: Restore devm_ioremap() alignment
  sh: mach-sh03: remove duplicate include
  arch: sh: remove duplicate include
  sh: Drop ARCH_NR_GPIOS definition
  sh: Remove unused HAVE_COPY_THREAD_TLS macro
  sh: remove CONFIG_IDE from most defconfig
  sh: mm: Convert to DEFINE_SHOW_ATTRIBUTE
  sh: intc: Convert to DEFINE_SHOW_ATTRIBUTE
  arch/sh: hyphenate Non-Uniform in Kconfig prompt
  sh: dma: fix kconfig dependency for G2_DMA