From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Fri, 26 Jul 2019 17:32:12 +0000 (-0700)
Subject: Merge tag 'for-linus-20190726' of git://git.kernel.dk/linux-block
X-Git-Tag: v5.3-rc2~25
X-Git-Url: https://repo.jachan.dev/linux.git/commitdiff_plain/04412819652fe30f900d11e96c67b4adfdf17f6b?hp=-c

Merge tag 'for-linus-20190726' of git://git.kernel.dk/linux-block

Pull block fixes from Jens Axboe:

 - Several io_uring fixes/improvements:
     - Blocking fix for O_DIRECT (me)
     - Latter page slowness for registered buffers (me)
     - Fix poll hang under certain conditions (me)
     - Defer sequence check fix for wrapped rings (Zhengyuan)
     - Mismatch in async inc/dec accounting (Zhengyuan)
     - Memory ordering issue that could cause stall (Zhengyuan)
      - Track sequential defer in bytes, not pages (Zhengyuan)

 - NVMe pull request from Christoph

 - Set of hang fixes for wbt (Josef)

 - Redundant error message kill for libahci (Ding)

 - Remove unused blk_mq_sched_started_request() and related ops (Marcos)

 - drbd dynamic alloc shash descriptor to reduce stack use (Arnd)

 - blkcg ->pd_stat() non-debug print (Tejun)

 - bcache memory leak fix (Wei)

 - Comment fix (Akinobu)

 - BFQ perf regression fix (Paolo)

* tag 'for-linus-20190726' of git://git.kernel.dk/linux-block: (24 commits)
  io_uring: ensure ->list is initialized for poll commands
  Revert "nvme-pci: don't create a read hctx mapping without read queues"
  nvme: fix multipath crash when ANA is deactivated
  nvme: fix memory leak caused by incorrect subsystem free
  nvme: ignore subnqn for ADATA SX6000LNP
  drbd: dynamically allocate shash descriptor
  block: blk-mq: Remove blk_mq_sched_started_request and started_request
  bcache: fix possible memory leak in bch_cached_dev_run()
  io_uring: track io length in async_list based on bytes
  io_uring: don't use iov_iter_advance() for fixed buffers
  block: properly handle IOCB_NOWAIT for async O_DIRECT IO
  blk-mq: allow REQ_NOWAIT to return an error inline
  io_uring: add a memory barrier before atomic_read
  rq-qos: use a mb for got_token
  rq-qos: set ourself TASK_UNINTERRUPTIBLE after we schedule
  rq-qos: don't reset has_sleepers on spurious wakeups
  rq-qos: fix missed wake-ups in rq_qos_throttle
  wait: add wq_has_single_sleeper helper
  block, bfq: check also in-flight I/O in dispatch plugging
  block: fix sysfs module parameters directory path in comment
  ...
---

04412819652fe30f900d11e96c67b4adfdf17f6b
diff --combined block/bfq-iosched.c
index 72860325245a,b627e3fc6d53..586fcfe227ea
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@@ -17,7 -17,7 +17,7 @@@
   * low-latency capabilities. BFQ also supports full hierarchical
   * scheduling through cgroups. Next paragraphs provide an introduction
   * on BFQ inner workings. Details on BFQ benefits, usage and
 - * limitations can be found in Documentation/block/bfq-iosched.txt.
 + * limitations can be found in Documentation/block/bfq-iosched.rst.
   *
   * BFQ is a proportional-share storage-I/O scheduling algorithm based
   * on the slice-by-slice service scheme of CFQ. But BFQ assigns
@@@ -3354,38 -3354,57 +3354,57 @@@ static void bfq_dispatch_remove(struct 
   * there is no active group, then the primary expectation for
   * this device is probably a high throughput.
   *
-  * We are now left only with explaining the additional
-  * compound condition that is checked below for deciding
-  * whether the scenario is asymmetric. To explain this
-  * compound condition, we need to add that the function
+  * We are now left only with explaining the two sub-conditions in the
+  * additional compound condition that is checked below for deciding
+  * whether the scenario is asymmetric. To explain the first
+  * sub-condition, we need to add that the function
   * bfq_asymmetric_scenario checks the weights of only
-  * non-weight-raised queues, for efficiency reasons (see
-  * comments on bfq_weights_tree_add()). Then the fact that
-  * bfqq is weight-raised is checked explicitly here. More
-  * precisely, the compound condition below takes into account
-  * also the fact that, even if bfqq is being weight-raised,
-  * the scenario is still symmetric if all queues with requests
-  * waiting for completion happen to be
-  * weight-raised. Actually, we should be even more precise
-  * here, and differentiate between interactive weight raising
-  * and soft real-time weight raising.
+  * non-weight-raised queues, for efficiency reasons (see comments on
+  * bfq_weights_tree_add()). Then the fact that bfqq is weight-raised
+  * is checked explicitly here. More precisely, the compound condition
+  * below takes into account also the fact that, even if bfqq is being
+  * weight-raised, the scenario is still symmetric if all queues with
+  * requests waiting for completion happen to be
+  * weight-raised. Actually, we should be even more precise here, and
+  * differentiate between interactive weight raising and soft real-time
+  * weight raising.
+  *
+  * The second sub-condition checked in the compound condition is
+  * whether there is a fair amount of already in-flight I/O not
+  * belonging to bfqq. If so, I/O dispatching is to be plugged, for the
+  * following reason. The drive may decide to serve in-flight
+  * non-bfqq's I/O requests before bfqq's ones, thereby delaying the
+  * arrival of new I/O requests for bfqq (recall that bfqq is sync). If
+  * I/O-dispatching is not plugged, then, while bfqq remains empty, a
+  * basically uncontrolled amount of I/O from other queues may be
+  * dispatched too, possibly causing the service of bfqq's I/O to be
+  * delayed even longer in the drive. This problem gets more and more
+  * serious as the speed and the queue depth of the drive grow,
+  * because, as these two quantities grow, the probability to find no
+  * queue busy but many requests in flight grows too. By contrast,
+  * plugging I/O dispatching minimizes the delay induced by already
+  * in-flight I/O, and enables bfqq to recover the bandwidth it may
+  * lose because of this delay.
   *
   * As a side note, it is worth considering that the above
-  * device-idling countermeasures may however fail in the
-  * following unlucky scenario: if idling is (correctly)
-  * disabled in a time period during which all symmetry
-  * sub-conditions hold, and hence the device is allowed to
-  * enqueue many requests, but at some later point in time some
-  * sub-condition stops to hold, then it may become impossible
-  * to let requests be served in the desired order until all
-  * the requests already queued in the device have been served.
+  * device-idling countermeasures may however fail in the following
+  * unlucky scenario: if I/O-dispatch plugging is (correctly) disabled
+  * in a time period during which all symmetry sub-conditions hold, and
+  * therefore the device is allowed to enqueue many requests, but at
+  * some later point in time some sub-condition stops to hold, then it
+  * may become impossible to make requests be served in the desired
+  * order until all the requests already queued in the device have been
+  * served. The last sub-condition commented above somewhat mitigates
+  * this problem for weight-raised queues.
   */
  static bool idling_needed_for_service_guarantees(struct bfq_data *bfqd,
  						 struct bfq_queue *bfqq)
  {
  	return (bfqq->wr_coeff > 1 &&
- 		bfqd->wr_busy_queues <
- 		bfq_tot_busy_queues(bfqd)) ||
+ 		(bfqd->wr_busy_queues <
+ 		 bfq_tot_busy_queues(bfqd) ||
+ 		 bfqd->rq_in_driver >=
+ 		 bfqq->dispatched + 4)) ||
  		bfq_asymmetric_scenario(bfqd, bfqq);
  }
  
diff --combined fs/block_dev.c
index 4707dfff991b,5dc613eec4d2..c2a85b587922
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@@ -26,7 -26,6 +26,7 @@@
  #include <linux/writeback.h>
  #include <linux/mpage.h>
  #include <linux/mount.h>
 +#include <linux/pseudo_fs.h>
  #include <linux/uio.h>
  #include <linux/namei.h>
  #include <linux/log2.h>
@@@ -345,15 -344,24 +345,24 @@@ __blkdev_direct_IO(struct kiocb *iocb, 
  	struct bio *bio;
  	bool is_poll = (iocb->ki_flags & IOCB_HIPRI) != 0;
  	bool is_read = (iov_iter_rw(iter) == READ), is_sync;
+ 	bool nowait = (iocb->ki_flags & IOCB_NOWAIT) != 0;
  	loff_t pos = iocb->ki_pos;
  	blk_qc_t qc = BLK_QC_T_NONE;
- 	int ret = 0;
+ 	gfp_t gfp;
+ 	ssize_t ret;
  
  	if ((pos | iov_iter_alignment(iter)) &
  	    (bdev_logical_block_size(bdev) - 1))
  		return -EINVAL;
  
- 	bio = bio_alloc_bioset(GFP_KERNEL, nr_pages, &blkdev_dio_pool);
+ 	if (nowait)
+ 		gfp = GFP_NOWAIT;
+ 	else
+ 		gfp = GFP_KERNEL;
+ 
+ 	bio = bio_alloc_bioset(gfp, nr_pages, &blkdev_dio_pool);
+ 	if (!bio)
+ 		return -EAGAIN;
  
  	dio = container_of(bio, struct blkdev_dio, bio);
  	dio->is_sync = is_sync = is_sync_kiocb(iocb);
@@@ -375,7 -383,10 +384,10 @@@
  	if (!is_poll)
  		blk_start_plug(&plug);
  
+ 	ret = 0;
  	for (;;) {
+ 		int err;
+ 
  		bio_set_dev(bio, bdev);
  		bio->bi_iter.bi_sector = pos >> 9;
  		bio->bi_write_hint = iocb->ki_hint;
@@@ -383,8 -394,10 +395,10 @@@
  		bio->bi_end_io = blkdev_bio_end_io;
  		bio->bi_ioprio = iocb->ki_ioprio;
  
- 		ret = bio_iov_iter_get_pages(bio, iter);
- 		if (unlikely(ret)) {
+ 		err = bio_iov_iter_get_pages(bio, iter);
+ 		if (unlikely(err)) {
+ 			if (!ret)
+ 				ret = err;
  			bio->bi_status = BLK_STS_IOERR;
  			bio_endio(bio);
  			break;
@@@ -399,6 -412,14 +413,14 @@@
  			task_io_account_write(bio->bi_iter.bi_size);
  		}
  
+ 		/*
+ 		 * Tell underlying layer to not block for resource shortage.
+ 		 * And if we would have blocked, return error inline instead
+ 		 * of through the bio->bi_end_io() callback.
+ 		 */
+ 		if (nowait)
+ 			bio->bi_opf |= (REQ_NOWAIT | REQ_NOWAIT_INLINE);
+ 
  		dio->size += bio->bi_iter.bi_size;
  		pos += bio->bi_iter.bi_size;
  
@@@ -412,6 -433,11 +434,11 @@@
  			}
  
  			qc = submit_bio(bio);
+ 			if (qc == BLK_QC_T_EAGAIN) {
+ 				if (!ret)
+ 					ret = -EAGAIN;
+ 				goto error;
+ 			}
  
  			if (polled)
  				WRITE_ONCE(iocb->ki_cookie, qc);
@@@ -432,8 -458,20 +459,20 @@@
  			atomic_inc(&dio->ref);
  		}
  
- 		submit_bio(bio);
- 		bio = bio_alloc(GFP_KERNEL, nr_pages);
+ 		qc = submit_bio(bio);
+ 		if (qc == BLK_QC_T_EAGAIN) {
+ 			if (!ret)
+ 				ret = -EAGAIN;
+ 			goto error;
+ 		}
+ 		ret += bio->bi_iter.bi_size;
+ 
+ 		bio = bio_alloc(gfp, nr_pages);
+ 		if (!bio) {
+ 			if (!ret)
+ 				ret = -EAGAIN;
+ 			goto error;
+ 		}
  	}
  
  	if (!is_poll)
@@@ -453,13 -491,16 +492,16 @@@
  	}
  	__set_current_state(TASK_RUNNING);
  
+ out:
  	if (!ret)
  		ret = blk_status_to_errno(dio->bio.bi_status);
- 	if (likely(!ret))
- 		ret = dio->size;
  
  	bio_put(&dio->bio);
  	return ret;
+ error:
+ 	if (!is_poll)
+ 		blk_finish_plug(&plug);
+ 	goto out;
  }
  
  static ssize_t
@@@ -822,19 -863,19 +864,19 @@@ static const struct super_operations bd
  	.evict_inode = bdev_evict_inode,
  };
  
 -static struct dentry *bd_mount(struct file_system_type *fs_type,
 -	int flags, const char *dev_name, void *data)
 +static int bd_init_fs_context(struct fs_context *fc)
  {
 -	struct dentry *dent;
 -	dent = mount_pseudo(fs_type, "bdev:", &bdev_sops, NULL, BDEVFS_MAGIC);
 -	if (!IS_ERR(dent))
 -		dent->d_sb->s_iflags |= SB_I_CGROUPWB;
 -	return dent;
 +	struct pseudo_fs_context *ctx = init_pseudo(fc, BDEVFS_MAGIC);
 +	if (!ctx)
 +		return -ENOMEM;
 +	fc->s_iflags |= SB_I_CGROUPWB;
 +	ctx->ops = &bdev_sops;
 +	return 0;
  }
  
  static struct file_system_type bd_type = {
  	.name		= "bdev",
 -	.mount		= bd_mount,
 +	.init_fs_context = bd_init_fs_context,
  	.kill_sb	= kill_anon_super,
  };
  
diff --combined fs/io_uring.c
index e2a66e12fbc6,15d9b16ed29d..012bc0efb9d3
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@@ -202,7 -202,7 +202,7 @@@ struct async_list 
  
  	struct file		*file;
  	off_t			io_end;
- 	size_t			io_pages;
+ 	size_t			io_len;
  };
  
  struct io_ring_ctx {
@@@ -333,7 -333,8 +333,8 @@@ struct io_kiocb 
  #define REQ_F_IO_DRAIN		16	/* drain existing IO first */
  #define REQ_F_IO_DRAINED	32	/* drain done */
  #define REQ_F_LINK		64	/* linked sqes */
- #define REQ_F_FAIL_LINK		128	/* fail rest of links */
+ #define REQ_F_LINK_DONE		128	/* linked sqes done */
+ #define REQ_F_FAIL_LINK		256	/* fail rest of links */
  	u64			user_data;
  	u32			result;
  	u32			sequence;
@@@ -429,7 -430,7 +430,7 @@@ static inline bool io_sequence_defer(st
  	if ((req->flags & (REQ_F_IO_DRAIN|REQ_F_IO_DRAINED)) != REQ_F_IO_DRAIN)
  		return false;
  
- 	return req->sequence > ctx->cached_cq_tail + ctx->sq_ring->dropped;
+ 	return req->sequence != ctx->cached_cq_tail + ctx->sq_ring->dropped;
  }
  
  static struct io_kiocb *io_get_deferred_req(struct io_ring_ctx *ctx)
@@@ -632,6 -633,7 +633,7 @@@ static void io_req_link_next(struct io_
  			nxt->flags |= REQ_F_LINK;
  		}
  
+ 		nxt->flags |= REQ_F_LINK_DONE;
  		INIT_WORK(&nxt->work, io_sq_wq_submit_work);
  		queue_work(req->ctx->sqo_wq, &nxt->work);
  	}
@@@ -1064,8 -1066,44 +1066,44 @@@ static int io_import_fixed(struct io_ri
  	 */
  	offset = buf_addr - imu->ubuf;
  	iov_iter_bvec(iter, rw, imu->bvec, imu->nr_bvecs, offset + len);
- 	if (offset)
- 		iov_iter_advance(iter, offset);
+ 
+ 	if (offset) {
+ 		/*
+ 		 * Don't use iov_iter_advance() here, as it's really slow for
+ 		 * using the latter parts of a big fixed buffer - it iterates
+ 		 * over each segment manually. We can cheat a bit here, because
+ 		 * we know that:
+ 		 *
+ 		 * 1) it's a BVEC iter, we set it up
+ 		 * 2) all bvecs are PAGE_SIZE in size, except potentially the
+ 		 *    first and last bvec
+ 		 *
+ 		 * So just find our index, and adjust the iterator afterwards.
+ 		 * If the offset is within the first bvec (or the whole first
+ 		 * bvec, just use iov_iter_advance(). This makes it easier
+ 		 * since we can just skip the first segment, which may not
+ 		 * be PAGE_SIZE aligned.
+ 		 */
+ 		const struct bio_vec *bvec = imu->bvec;
+ 
+ 		if (offset <= bvec->bv_len) {
+ 			iov_iter_advance(iter, offset);
+ 		} else {
+ 			unsigned long seg_skip;
+ 
+ 			/* skip first vec */
+ 			offset -= bvec->bv_len;
+ 			seg_skip = 1 + (offset >> PAGE_SHIFT);
+ 
+ 			iter->bvec = bvec + seg_skip;
+ 			iter->nr_segs -= seg_skip;
+ 			iter->count -= (seg_skip << PAGE_SHIFT);
+ 			iter->iov_offset = offset & ~PAGE_MASK;
+ 			if (iter->iov_offset)
+ 				iter->count -= iter->iov_offset;
+ 		}
+ 	}
+ 
  	return 0;
  }
  
@@@ -1120,28 -1158,26 +1158,26 @@@ static void io_async_list_note(int rw, 
  	off_t io_end = kiocb->ki_pos + len;
  
  	if (filp == async_list->file && kiocb->ki_pos == async_list->io_end) {
- 		unsigned long max_pages;
+ 		unsigned long max_bytes;
  
  		/* Use 8x RA size as a decent limiter for both reads/writes */
- 		max_pages = filp->f_ra.ra_pages;
- 		if (!max_pages)
- 			max_pages = VM_READAHEAD_PAGES;
- 		max_pages *= 8;
- 
- 		/* If max pages are exceeded, reset the state */
- 		len >>= PAGE_SHIFT;
- 		if (async_list->io_pages + len <= max_pages) {
+ 		max_bytes = filp->f_ra.ra_pages << (PAGE_SHIFT + 3);
+ 		if (!max_bytes)
+ 			max_bytes = VM_READAHEAD_PAGES << (PAGE_SHIFT + 3);
+ 
+ 		/* If max len are exceeded, reset the state */
+ 		if (async_list->io_len + len <= max_bytes) {
  			req->flags |= REQ_F_SEQ_PREV;
- 			async_list->io_pages += len;
+ 			async_list->io_len += len;
  		} else {
  			io_end = 0;
- 			async_list->io_pages = 0;
+ 			async_list->io_len = 0;
  		}
  	}
  
  	/* New file? Reset state. */
  	if (async_list->file != filp) {
- 		async_list->io_pages = 0;
+ 		async_list->io_len = 0;
  		async_list->file = filp;
  	}
  	async_list->io_end = io_end;
@@@ -1630,6 -1666,8 +1666,8 @@@ static int io_poll_add(struct io_kiocb 
  	INIT_LIST_HEAD(&poll->wait.entry);
  	init_waitqueue_func_entry(&poll->wait, io_poll_wake);
  
+ 	INIT_LIST_HEAD(&req->list);
+ 
  	mask = vfs_poll(poll->file, &ipt.pt) & poll->events;
  
  	spin_lock_irq(&ctx->completion_lock);
@@@ -1844,6 -1882,10 +1882,10 @@@ restart
  		/* async context always use a copy of the sqe */
  		kfree(sqe);
  
+ 		/* req from defer and link list needn't decrease async cnt */
+ 		if (req->flags & (REQ_F_IO_DRAINED | REQ_F_LINK_DONE))
+ 			goto out;
+ 
  		if (!async_list)
  			break;
  		if (!list_empty(&req_list)) {
@@@ -1891,6 -1933,7 +1933,7 @@@
  		}
  	}
  
+ out:
  	if (cur_mm) {
  		set_fs(old_fs);
  		unuse_mm(cur_mm);
@@@ -1917,6 -1960,10 +1960,10 @@@ static bool io_add_to_prev_work(struct 
  	ret = true;
  	spin_lock(&list->lock);
  	list_add_tail(&req->list, &list->list);
+ 	/*
+ 	 * Ensure we see a simultaneous modification from io_sq_wq_submit_work()
+ 	 */
+ 	smp_mb();
  	if (!atomic_read(&list->cnt)) {
  		list_del_init(&req->list);
  		ret = false;
@@@ -2400,6 -2447,7 +2447,6 @@@ static int io_cqring_wait(struct io_rin
  			  const sigset_t __user *sig, size_t sigsz)
  {
  	struct io_cq_ring *ring = ctx->cq_ring;
 -	sigset_t ksigmask, sigsaved;
  	int ret;
  
  	if (io_cqring_events(ring) >= min_events)
@@@ -2409,17 -2457,21 +2456,17 @@@
  #ifdef CONFIG_COMPAT
  		if (in_compat_syscall())
  			ret = set_compat_user_sigmask((const compat_sigset_t __user *)sig,
 -						      &ksigmask, &sigsaved, sigsz);
 +						      sigsz);
  		else
  #endif
 -			ret = set_user_sigmask(sig, &ksigmask,
 -					       &sigsaved, sigsz);
 +			ret = set_user_sigmask(sig, sigsz);
  
  		if (ret)
  			return ret;
  	}
  
  	ret = wait_event_interruptible(ctx->wait, io_cqring_events(ring) >= min_events);
 -
 -	if (sig)
 -		restore_user_sigmask(sig, &sigsaved, ret == -ERESTARTSYS);
 -
 +	restore_saved_sigmask_unless(ret == -ERESTARTSYS);
  	if (ret == -ERESTARTSYS)
  		ret = -EINTR;