Git Repo - linux.git/log

]> Git Repo - linux.git/log

projects / linux.git / log

summary | shortlog | log | commit | commitdiff | tree
first ⋅ prev ⋅ next

commit | commitdiff | tree

Jens Axboe [Thu, 15 Feb 2024 16:42:03 +0000 (09:42 -0700)]

Merge tag 'nvme-6.8-2024-02-15' of git://git.infradead.org/nvme into block-6.8

Pull NVMe fixes from Keith:

"nvme fixes for Linux 6.8

- Fabrics connection error handling (Chaitanya)
- Use relaxed effects to reduce unnecessary queue freezes (Keith)"

* tag 'nvme-6.8-2024-02-15' of git://git.infradead.org/nvme:
  nvmet: remove superfluous initialization
  nvme: implement support for relaxed effects
  nvme-fabrics: fix I/O connect error handling

commit | commitdiff | tree

Chaitanya Kulkarni [Tue, 13 Feb 2024 07:58:24 +0000 (23:58 -0800)]

nvmet: remove superfluous initialization

Remove superfluous initialization of status variable in
nvmet_execute_admin_connect() and nvmet_execute_io_connect(), since it
will get overwritten by nvmet_copy_from_sgl().

Signed-off-by: Chaitanya Kulkarni <[email protected]>
Signed-off-by: Keith Busch <[email protected]>

commit | commitdiff | tree

Keith Busch [Mon, 5 Feb 2024 19:10:25 +0000 (11:10 -0800)]

nvme: implement support for relaxed effects

NVM Express TP4167 provides a way for controllers to report a relaxed
execution constraint. Specifically, it notifies of exclusivity for IO
vs. admin commands instead of grouping these together. If set, then we
don't need to freeze IO in order to execute that admin command. The
freezing distrupts IO processes, so it's nice to avoid that if the
controller tells us it's not necessary.

Reviewed-by: Christoph Hellwig <[email protected]>
Signed-off-by: Keith Busch <[email protected]>

commit | commitdiff | tree

Chaitanya Kulkarni [Tue, 13 Feb 2024 08:26:46 +0000 (00:26 -0800)]

nvme-fabrics: fix I/O connect error handling

In nvmf_connect_io_queue(), if connect I/O command fails, we log the
error and continue for authentication. This overrides error captured
from __nvme_submit_sync_cmd(), causing wrong return value.

Add goto out_free_data after logging connect error to fix the issue.

Fixes: f50fff73d620c ("nvme: implement In-Band authentication")
Signed-off-by: Chaitanya Kulkarni <[email protected]>
Reviewed-by: Hannes Reinecke <[email protected]>
Signed-off-by: Keith Busch <[email protected]>

commit | commitdiff | tree

Jens Axboe [Thu, 8 Feb 2024 22:05:18 +0000 (15:05 -0700)]

Merge tag 'nvme-6.8-2023-02-08' of git://git.infradead.org/nvme into block-6.8

Pull NVMe fixes from Keith:

"nvme fixes for Linux 6.8

- Update a potentially stale firmware attribute (Maurizio)
- Fixes for the recent verbose error logging (Keith, Chaitanya)
- Protection information payload size fix for passthrough (Francis)"

* tag 'nvme-6.8-2023-02-08' of git://git.infradead.org/nvme:
  nvme: use ns->head->pi_size instead of t10_pi_tuple structure size
  nvme-core: fix comment to reflect right functions
  nvme: move passthrough logging attribute to head
  nvme-host: fix the updating of the firmware version

commit | commitdiff | tree

Yi Sun [Mon, 29 Jan 2024 08:52:50 +0000 (16:52 +0800)]

virtio-blk: Ensure no requests in virtqueues before deleting vqs.

Ensure no remaining requests in virtqueues before resetting vdev and
deleting virtqueues. Otherwise these requests will never be completed.
It may cause the system to become unresponsive.

Function blk_mq_quiesce_queue() can ensure that requests have become
in_flight status, but it cannot guarantee that requests have been
processed by the device. Virtqueues should never be deleted before
all requests become complete status.

Function blk_mq_freeze_queue() ensure that all requests in virtqueues
become complete status. And no requests can enter in virtqueues.

Signed-off-by: Yi Sun <[email protected]>
Reviewed-by: Stefan Hajnoczi <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jens Axboe <[email protected]>

commit | commitdiff | tree

Tejun Heo [Mon, 20 Nov 2023 22:25:56 +0000 (12:25 -1000)]

blk-iocost: Fix an UBSAN shift-out-of-bounds warning

When iocg_kick_delay() is called from a CPU different than the one which set
the delay, @now may be in the past of @iocg->delay_at leading to the
following warning:

  UBSAN: shift-out-of-bounds in block/blk-iocost.c:1359:23
  shift exponent 18446744073709 is too large for 64-bit type 'u64' (aka 'unsigned long long')
  ...
  Call Trace:
   <TASK>
   dump_stack_lvl+0x79/0xc0
   __ubsan_handle_shift_out_of_bounds+0x2ab/0x300
   iocg_kick_delay+0x222/0x230
   ioc_rqos_merge+0x1d7/0x2c0
   __rq_qos_merge+0x2c/0x80
   bio_attempt_back_merge+0x83/0x190
   blk_attempt_plug_merge+0x101/0x150
   blk_mq_submit_bio+0x2b1/0x720
   submit_bio_noacct_nocheck+0x320/0x3e0
   __swap_writepage+0x2ab/0x9d0

The underflow itself doesn't really affect the behavior in any meaningful
way; however, the past timestamp may exaggerate the delay amount calculated
later in the code, which shouldn't be a material problem given the nature of
the delay mechanism.

If @now is in the past, this CPU is racing another CPU which recently set up
the delay and there's nothing this CPU can contribute w.r.t. the delay.
Let's bail early from iocg_kick_delay() in such cases.

Reported-by: Breno Leitão <[email protected]>
Signed-off-by: Tejun Heo <[email protected]>
Fixes: 5160a5a53c0c ("blk-iocost: implement delay adjustment hysteresis")
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jens Axboe <[email protected]>

commit | commitdiff | tree

Francis Pravin [Tue, 6 Feb 2024 23:34:17 +0000 (05:04 +0530)]

nvme: use ns->head->pi_size instead of t10_pi_tuple structure size

Currently kernel supports 8 byte and 16 byte protection information.
So, use ns->head->pi_size instead of sizeof(struct t10_pi_tuple).

Signed-off-by: Francis Pravin <[email protected]>
Signed-off-by: Sathyavathi M <[email protected]>
Signed-off-by: Keith Busch <[email protected]>

commit | commitdiff | tree

Chaitanya Kulkarni [Tue, 6 Feb 2024 00:30:21 +0000 (16:30 -0800)]

nvme-core: fix comment to reflect right functions

The functions and the attribute listed in the comment doesn't exists in
the code, (ns->logging_enabled, nvme_passthru_err_log_enabled_store()
and nvme_passthru_err_log_enabled_show())

Update the comment with right function names and a comment
ns->head->passthru_err_log_enabled,
nvme_io_passthru_err_log_enabled_store() and
nvme_io_passthru_err_log_enabled_show().

Signed-off-by: Chaitanya Kulkarni <[email protected]>
Reviewed-by: Alan Adamson <[email protected]>
Signed-off-by: Keith Busch <[email protected]>

commit | commitdiff | tree

Keith Busch [Tue, 6 Feb 2024 17:47:21 +0000 (09:47 -0800)]

nvme: move passthrough logging attribute to head

The namespace does not have attributes, but the head does. Move the new
logging attribute to that structure instead of dereferencing the wrong
type.

And while we're here, fix the reverse-tree coding style.

Fixes: 9f079dda14339e ("nvme: allow passthru cmd error logging")
Reported-by: Tasmiya Nalatwad <[email protected]>
Tested-by: Tasmiya Nalatwad <[email protected]>
Reviewed-by: Chaitanya Kulkarni <[email protected]>
Reviewed-by: Alan Adamson <[email protected]>
Signed-off-by: Keith Busch <[email protected]>

commit | commitdiff | tree

Jan Kara [Tue, 23 Jan 2024 17:58:26 +0000 (18:58 +0100)]

blk-wbt: Fix detection of dirty-throttled tasks

The detection of dirty-throttled tasks in blk-wbt has been subtly broken
since its beginning in 2016. Namely if we are doing cgroup writeback and
the throttled task is not in the root cgroup, balance_dirty_pages() will
set dirty_sleep for the non-root bdi_writeback structure. However
blk-wbt checks dirty_sleep only in the root cgroup bdi_writeback
structure. Thus detection of recently throttled tasks is not working in
this case (we noticed this when we switched to cgroup v2 and suddently
writeback was slow).

Since blk-wbt has no easy way to get to proper bdi_writeback and
furthermore its intention has always been to work on the whole device
rather than on individual cgroups, just move the dirty_sleep timestamp
from bdi_writeback to backing_dev_info. That fixes the checking for
recently throttled task and saves memory for everybody as a bonus.

CC: [email protected]
Fixes: b57d74aff9ab ("writeback: track if we're sleeping on progress in balance_dirty_pages()")
Signed-off-by: Jan Kara <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
[axboe: fixup indentation errors]
Signed-off-by: Jens Axboe <[email protected]>

commit | commitdiff | tree

Hongyu Jin [Tue, 30 Jan 2024 20:26:34 +0000 (15:26 -0500)]

block: Fix where bio IO priority gets set

Commit 82b74cac2849 ("blk-ioprio: Convert from rqos policy to direct
call") pushed setting bio I/O priority down into blk_mq_submit_bio()
-- which is too low within block core's submit_bio() because it
skips setting I/O priority for block drivers that implement
fops->submit_bio() (e.g. DM, MD, etc).

Fix this by moving bio_set_ioprio() up from blk-mq.c to blk-core.c and
call it from submit_bio(). This ensures all block drivers call
bio_set_ioprio() during initial bio submission.

Fixes: a78418e6a04c ("block: Always initialize bio IO priority on submit")
Co-developed-by: Yibin Ding <[email protected]>
Signed-off-by: Yibin Ding <[email protected]>
Signed-off-by: Hongyu Jin <[email protected]>
Reviewed-by: Eric Biggers <[email protected]>
Reviewed-by: Mikulas Patocka <[email protected]>
[snitzer: revised commit header]
Signed-off-by: Mike Snitzer <[email protected]>
Reviewed-by: Ming Lei <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jens Axboe <[email protected]>

commit | commitdiff | tree

Maurizio Lombardi [Thu, 18 Jan 2024 11:48:54 +0000 (12:48 +0100)]

nvme-host: fix the updating of the firmware version

The original code didn't update the firmware version if the
"next slot" of the AFI register isn't zero or if the
"current slot" field is zero; in those cases it assumed
that a reset was needed.

However, the NVMe specification doesn't exclude the possibility that
the "next slot" value is equal to the "current slot" value,
meaning that the same firmware slot will be activated after performing
a controller level reset; in this case a reset is clearly not
necessary and we can safely update the firmware version.

Modify the code so the kernel will report that a Controller Level Reset
is needed only in the following cases:

1) If the "current slot" field is zero. This is invalid and means that
   something is wrong, a reset is needed.

or

2) if the "next slot" field isn't zero AND it's not equal to the
   "current slot" value. This means that at the next reset a different
   firmware slot will be activated.

Fixes: 983a338b96c8 ("nvme: update firmware version after commit")
Signed-off-by: Maurizio Lombardi <[email protected]>
Reviewed-by: Daniel Wagner <[email protected]>
Signed-off-by: Keith Busch <[email protected]>

commit | commitdiff | tree

Jens Axboe [Thu, 1 Feb 2024 16:11:02 +0000 (09:11 -0700)]

Merge tag 'nvme-6.8-2024-02-01' of git://git.infradead.org/nvme into block-6.8

Pull NVMe fixes from Keith:

"nvme fixes for Linux 6.8

- Remove duplicated enums (Guixen)
- Use appropriate controller state accessors (Keith)
- Retryable authentication (Hannes)
- Add missing module descriptions (Chaitanya)
- Fibre-channel fixes for blktests (Daniel)
- Various type correctness updates (Caleb)
- Improve fabrics connection debugging prints (Nitin)
- Passthrough command verbose error logging (Adam)"

* tag 'nvme-6.8-2024-02-01' of git://git.infradead.org/nvme: (31 commits)
  nvme: allow passthru cmd error logging
  nvme-fc: show hostnqn when connecting to fc target
  nvme-rdma: show hostnqn when connecting to rdma target
  nvme-tcp: show hostnqn when connecting to tcp target
  nvmet-fc: use RCU list iterator for assoc_list
  nvmet-fc: take ref count on tgtport before delete assoc
  nvmet-fc: avoid deadlock on delete association path
  nvmet-fc: abort command when there is no binding
  nvmet-fc: do not tack refs on tgtports from assoc
  nvmet-fc: remove null hostport pointer check
  nvmet-fc: hold reference on hostport match
  nvmet-fc: free queue and assoc directly
  nvmet-fc: defer cleanup using RCU properly
  nvmet-fc: release reference on target port
  nvmet-fcloop: swap the list_add_tail arguments
  nvme-fc: do not wait in vain when unloading module
  nvme-fc: log human-readable opcode on timeout
  nvme: split out fabrics version of nvme_opcode_str()
  nvme: take const cmd pointer in read-only helpers
  nvme: remove redundant status mask
  ...

commit | commitdiff | tree

Alan Adamson [Tue, 30 Jan 2024 00:19:38 +0000 (16:19 -0800)]

nvme: allow passthru cmd error logging

Commit d7ac8dca938c ("nvme: quiet user passthrough command errors")
disabled error logging for user passthrough commands.  This commit
adds the ability to opt-in to passthrough admin error logging. IO
commands initiated as passthrough will always be logged.

The logging output for passthrough commands (Admin and IO) has been
changed to include CDWXX fields.

nvme0n1: Read(0x2), LBA Out of Range (sct 0x0 / sc 0x80) DNR cdw10=0x0 cdw11=0x1
        cdw12=0x70000 cdw13=0x0 cdw14=0x0 cdw15=0x0

Add a helper function nvme_log_err_passthru() which allows us to log
error for passthru commands by decoding cdw10-cdw15 values of nvme
command.

Add a new sysfs attr passthru_err_log_enabled that allows user to conditionally
enable passthrough command logging for either passthrough Admin commands sent to
the controller or passthrough IO commands sent to a namespace.

By default, passthrough error logging is disabled.

To enable passthrough admin error logging:
        echo 1 > /sys/class/nvme/nvme0/passthru_err_log_enabled

To disable passthrough admin error logging:
        echo 0 > /sys/class/nvme/nvme0/passthru_err_log_enabled

To enable passthrough io error logging:
        echo 1 > /sys/class/nvme/nvme0/nvme0n1/passthru_err_log_enabled

To disable passthrough io error logging:
        echo 0 > /sys/class/nvme/nvme0/nvme0n1/passthru_err_log_enabled

Signed-off-by: Alan Adamson <[email protected]>
Signed-off-by: Chaitanya Kulkarni <[email protected]>
Reviewed-by: Chaitanya Kulkarni <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Signed-off-by: Keith Busch <[email protected]>

commit | commitdiff | tree

Nitin U. Yewale [Mon, 29 Jan 2024 11:06:39 +0000 (16:36 +0530)]

nvme-fc: show hostnqn when connecting to fc target

Log hostnqn when connecting to nvme target.
As hostnqn could be changed, logging this information
in syslog at appropriate time may help in troubleshooting.

Signed-off-by: Nitin U. Yewale <[email protected]>
Reviewed-by: John Meneghini <[email protected]>
Reviewed-by: Sagi Grimberg <[email protected]>
Signed-off-by: Keith Busch <[email protected]>

commit | commitdiff | tree

Nitin U. Yewale [Mon, 29 Jan 2024 11:06:38 +0000 (16:36 +0530)]

nvme-rdma: show hostnqn when connecting to rdma target

Log hostnqn when connecting to nvme target.
As hostnqn could be changed, logging this information
in syslog at appropriate time may help in troubleshooting.

Signed-off-by: Nitin U. Yewale <[email protected]>
Reviewed-by: John Meneghini <[email protected]>
Reviewed-by: Sagi Grimberg <[email protected]>
Signed-off-by: Keith Busch <[email protected]>

commit | commitdiff | tree

Nitin U. Yewale [Mon, 29 Jan 2024 11:06:37 +0000 (16:36 +0530)]

nvme-tcp: show hostnqn when connecting to tcp target

Log hostnqn when connecting to nvme target.
As hostnqn could be changed, logging this information
in syslog at appropriate time may help in troubleshooting.

Signed-off-by: Nitin U. Yewale <[email protected]>
Reviewed-by: John Meneghini <[email protected]>
Reviewed-by: Sagi Grimberg <[email protected]>
Signed-off-by: Keith Busch <[email protected]>

commit | commitdiff | tree

Daniel Wagner [Wed, 31 Jan 2024 08:51:12 +0000 (09:51 +0100)]

nvmet-fc: use RCU list iterator for assoc_list

The assoc_list is a RCU protected list, thus use the RCU flavor of list
functions.

Let's use this opportunity and refactor this code and move the lookup
into a helper and give it a descriptive name.

Signed-off-by: Daniel Wagner <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Signed-off-by: Keith Busch <[email protected]>

commit | commitdiff | tree

Daniel Wagner [Wed, 31 Jan 2024 08:51:11 +0000 (09:51 +0100)]

nvmet-fc: take ref count on tgtport before delete assoc

We have to ensure that the tgtport is not going away
before be have remove all the associations.

Reviewed-by: Christoph Hellwig <[email protected]>
Signed-off-by: Daniel Wagner <[email protected]>
Signed-off-by: Keith Busch <[email protected]>

commit | commitdiff | tree

Daniel Wagner [Wed, 31 Jan 2024 08:51:10 +0000 (09:51 +0100)]

nvmet-fc: avoid deadlock on delete association path

When deleting an association the shutdown path is deadlocking because we
try to flush the nvmet_wq nested. Avoid this by deadlock by deferring
the put work into its own work item.

Reviewed-by: Christoph Hellwig <[email protected]>
Signed-off-by: Daniel Wagner <[email protected]>
Signed-off-by: Keith Busch <[email protected]>

commit | commitdiff | tree

Daniel Wagner [Wed, 31 Jan 2024 08:51:09 +0000 (09:51 +0100)]

nvmet-fc: abort command when there is no binding

When the target port has not active port binding, there is no point in
trying to process the command as it has to fail anyway. Instead adding
checks to all commands abort the command early.

Reviewed-by: Hannes Reinecke <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Signed-off-by: Daniel Wagner <[email protected]>
Signed-off-by: Keith Busch <[email protected]>

commit | commitdiff | tree

Daniel Wagner [Wed, 31 Jan 2024 08:51:08 +0000 (09:51 +0100)]

nvmet-fc: do not tack refs on tgtports from assoc

The association life time is tied to the life time of the target port.
That means we should not take extra a refcount when creating a
association.

Reviewed-by: Hannes Reinecke <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Signed-off-by: Daniel Wagner <[email protected]>
Signed-off-by: Keith Busch <[email protected]>

commit | commitdiff | tree

Daniel Wagner [Wed, 31 Jan 2024 08:51:07 +0000 (09:51 +0100)]

nvmet-fc: remove null hostport pointer check

An association has always a valid hostport pointer. Remove useless
null pointer check.

Reviewed-by: Hannes Reinecke <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Signed-off-by: Daniel Wagner <[email protected]>
Signed-off-by: Keith Busch <[email protected]>

commit | commitdiff | tree

Daniel Wagner [Wed, 31 Jan 2024 08:51:06 +0000 (09:51 +0100)]

nvmet-fc: hold reference on hostport match

The hostport data structure is shared between the association, this why
we keep track of the users via a refcount. So we should not decrement
the refcount on a match and free the hostport several times.

Reported by KASAN.

Reviewed-by: Hannes Reinecke <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Signed-off-by: Daniel Wagner <[email protected]>
Signed-off-by: Keith Busch <[email protected]>

commit | commitdiff | tree

Daniel Wagner [Wed, 31 Jan 2024 08:51:05 +0000 (09:51 +0100)]

nvmet-fc: free queue and assoc directly

Neither struct nvmet_fc_tgt_queue nor struct nvmet_fc_tgt_assoc are data
structure which are used in a RCU context. So there is no reason to
delay the free operation.

Reviewed-by: Hannes Reinecke <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Signed-off-by: Daniel Wagner <[email protected]>
Signed-off-by: Keith Busch <[email protected]>

commit | commitdiff | tree

Daniel Wagner [Wed, 31 Jan 2024 08:51:04 +0000 (09:51 +0100)]

nvmet-fc: defer cleanup using RCU properly

When the target executes a disconnect and the host triggers a reconnect
immediately, the reconnect command still finds an existing association.

The reconnect crashes later on because nvmet_fc_delete_target_assoc
blindly removes resources while the reconnect code wants to use it.

To address this, nvmet_fc_find_target_assoc should not be able to
lookup an association which is being removed. The association list
is already under RCU lifetime management, so let's properly use it
and remove the association from the list and wait for a grace period
before cleaning up all. This means we also can drop the RCU management
on the queues, because this is now handled via the association itself.

A second step split the execution context so that the initial disconnect
command can complete without running the reconnect code in the same
context. As usual, this is done by deferring the ->done to a workqueue.

Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Hannes Reinecke <[email protected]>
Signed-off-by: Daniel Wagner <[email protected]>
Signed-off-by: Keith Busch <[email protected]>

commit | commitdiff | tree

Daniel Wagner [Wed, 31 Jan 2024 08:51:03 +0000 (09:51 +0100)]

nvmet-fc: release reference on target port

In case we return early out of __nvmet_fc_finish_ls_req() we still have
to release the reference on the target port.

Reviewed-by: Hannes Reinecke <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Signed-off-by: Daniel Wagner <[email protected]>
Signed-off-by: Keith Busch <[email protected]>

commit | commitdiff | tree

Daniel Wagner [Wed, 31 Jan 2024 08:51:02 +0000 (09:51 +0100)]

nvmet-fcloop: swap the list_add_tail arguments

The first argument of list_add_tail function is the new element which
should be added to the list which is the second argument. Swap the
arguments to allow processing more than one element at a time.

Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Hannes Reinecke <[email protected]>
Signed-off-by: Daniel Wagner <[email protected]>
Signed-off-by: Keith Busch <[email protected]>

commit | commitdiff | tree

Daniel Wagner [Wed, 31 Jan 2024 08:51:01 +0000 (09:51 +0100)]

nvme-fc: do not wait in vain when unloading module

The module exit path has race between deleting all controllers and
freeing 'left over IDs'. To prevent double free a synchronization
between nvme_delete_ctrl and ida_destroy has been added by the initial
commit.

There is some logic around trying to prevent from hanging forever in
wait_for_completion, though it does not handling all cases. E.g.
blktests is able to reproduce the situation where the module unload
hangs forever.

If we completely rely on the cleanup code executed from the
nvme_delete_ctrl path, all IDs will be freed eventually. This makes
calling ida_destroy unnecessary. We only have to ensure that all
nvme_delete_ctrl code has been executed before we leave
nvme_fc_exit_module. This is done by flushing the nvme_delete_wq
workqueue.

While at it, remove the unused nvme_fc_wq workqueue too.

Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Hannes Reinecke <[email protected]>
Signed-off-by: Daniel Wagner <[email protected]>
Signed-off-by: Keith Busch <[email protected]>

commit | commitdiff | tree

Caleb Sander [Wed, 31 Jan 2024 16:43:15 +0000 (09:43 -0700)]

nvme-fc: log human-readable opcode on timeout

The fc transport logs the opcode and fctype on command timeout.
This is sufficient information to identify the command issued,
but not very human-readable. Use the nvme_fabrics_opcode_str()
helper to also log the name of the command, as rdma and tcp already do.

Signed-off-by: Caleb Sander <[email protected]>
Reviewed-by: Chaitanya Kulkarni <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Sagi Grimberg <[email protected]>
Signed-off-by: Keith Busch <[email protected]>

commit | commitdiff | tree

Caleb Sander [Wed, 31 Jan 2024 16:43:14 +0000 (09:43 -0700)]

nvme: split out fabrics version of nvme_opcode_str()

nvme_opcode_str() currently supports admin, IO, and fabrics commands.
However, fabrics commands aren't allowed for the pci transport.
Currently the pci caller passes 0 as the fctype,
which means any fabrics command would be displayed as "Property Set".

Move fabrics command support into a function nvme_fabrics_opcode_str()
and remove the fctype argument to nvme_opcode_str().
This way, a fabrics command will display as "Unknown" for pci.
Convert the rdma and tcp transports to use nvme_fabrics_opcode_str().

Signed-off-by: Caleb Sander <[email protected]>
Reviewed-by: Chaitanya Kulkarni <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Sagi Grimberg <[email protected]>
Signed-off-by: Keith Busch <[email protected]>

commit | commitdiff | tree

Caleb Sander [Wed, 31 Jan 2024 16:43:13 +0000 (09:43 -0700)]

nvme: take const cmd pointer in read-only helpers

nvme_is_fabrics() and nvme_is_write() only read struct nvme_command,
so take it by const pointer. This allows callers to pass a const pointer
and communicates that these functions don't modify the command.

Signed-off-by: Caleb Sander <[email protected]>
Reviewed-by: Chaitanya Kulkarni <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Sagi Grimberg <[email protected]>
Signed-off-by: Keith Busch <[email protected]>

commit | commitdiff | tree

Caleb Sander [Wed, 31 Jan 2024 16:43:12 +0000 (09:43 -0700)]

nvme: remove redundant status mask

In nvme_get_error_status_str(), the status code is already masked
with 0x7ff at the beginning of the function.
Don't bother masking it again when indexing nvme_statuses.

Signed-off-by: Caleb Sander <[email protected]>
Reviewed-by: Chaitanya Kulkarni <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Sagi Grimberg <[email protected]>
Signed-off-by: Keith Busch <[email protected]>

commit | commitdiff | tree

Caleb Sander [Wed, 31 Jan 2024 16:43:11 +0000 (09:43 -0700)]

nvme: return string as char *, not unsigned char *

The functions in drivers/nvme/host/constants.c returning human-readable
status and opcode strings currently use type "const unsigned char *".
Typically string constants use type "const char *",
so remove "unsigned" from the return types.
This is a purely cosmetic change to clarify that the functions
return text strings instead of an array of bytes, for example.

Signed-off-by: Caleb Sander <[email protected]>
Reviewed-by: Chaitanya Kulkarni <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Sagi Grimberg <[email protected]>
Signed-off-by: Keith Busch <[email protected]>

commit | commitdiff | tree

Chaitanya Kulkarni [Wed, 31 Jan 2024 23:01:38 +0000 (15:01 -0800)]

nvme-common: add module description

Add MODULE_DESCRIPTION() in order to remove warnings & get clean build:-

WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/nvme/common/nvme-auth.o
WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/nvme/common/nvme-keyring.o

Signed-off-by: Chaitanya Kulkarni <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Sagi Grimberg <[email protected]>
Signed-off-by: Keith Busch <[email protected]>

commit | commitdiff | tree

Hannes Reinecke [Mon, 29 Jan 2024 06:39:48 +0000 (07:39 +0100)]

nvme: enable retries for authentication commands

Authentication commands might trigger a lengthy computation on the
controller or even a callout to an external entity.
In these cases the controller might return a status without the DNR
bit set, indicating that the command should be retried.
This patch enables retries for authentication commands by setting
NVME_SUBMIT_RETRY for __nvme_submit_sync_cmd().

Reported-by: Martin George <[email protected]>
Signed-off-by: Hannes Reinecke <[email protected]>
Reviewed-by: Chaitanya Kulkarni <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Sagi Grimberg <[email protected]>
Signed-off-by: Keith Busch <[email protected]>

commit | commitdiff | tree

Hannes Reinecke [Mon, 29 Jan 2024 06:39:46 +0000 (07:39 +0100)]

nvme: change __nvme_submit_sync_cmd() calling conventions

Combine the two arguments 'flags' and 'at_head' from __nvme_submit_sync_cmd()
into a single 'flags' argument and use function-specific values to indicate
what should be set within the function.

Signed-off-by: Hannes Reinecke <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Chaitanya Kulkarni <[email protected]>
Reviewed-by: Sagi Grimberg <[email protected]>
Signed-off-by: Keith Busch <[email protected]>

commit | commitdiff | tree

Hannes Reinecke [Mon, 29 Jan 2024 06:39:45 +0000 (07:39 +0100)]

nvme-auth: open-code single-use macros

No point in having macros just for a single function nvme_auth_submit().
Open-code them into the caller.

Signed-off-by: Hannes Reinecke <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Chaitanya Kulkarni <[email protected]>
Reviewed-by: Sagi Grimberg <[email protected]>
Signed-off-by: Keith Busch <[email protected]>

commit | commitdiff | tree

Keith Busch [Wed, 24 Jan 2024 17:27:27 +0000 (09:27 -0800)]

nvme: use ctrl state accessor

The ctrl->state value is updated in another thread using WRITE_ONCE, so
ensure all the readers use the appropriate accessor.

Reviewed-by: Sagi Grimberg <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Chaitanya Kulkarni <[email protected]>
Signed-off-by: Keith Busch <[email protected]>

commit | commitdiff | tree

Guixin Liu [Fri, 26 Jan 2024 08:26:43 +0000 (16:26 +0800)]

nvmet-tcp: fix nvme tcp ida memory leak

The nvmet_tcp_queue_ida should be destroy when the nvmet-tcp module
exit.

Signed-off-by: Guixin Liu <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Chaitanya Kulkarni <[email protected]>
Signed-off-by: Keith Busch <[email protected]>

commit | commitdiff | tree

Jens Axboe [Fri, 26 Jan 2024 00:03:54 +0000 (17:03 -0700)]

Merge tag 'md-6.8-20240126' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md into block-6.8

Pull MD fix from Song:

"This change fixes a RCU warning."

* tag 'md-6.8-20240126' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md:
md: fix a suspicious RCU usage warning

commit | commitdiff | tree

Mikulas Patocka [Wed, 17 Jan 2024 18:22:36 +0000 (19:22 +0100)]

md: fix a suspicious RCU usage warning

RCU protection was removed in the commit 2d32777d60de ("raid1: remove rcu
protection to access rdev from conf").

However, the code in fix_read_error does rcu_dereference outside
rcu_read_lock - this triggers the following warning. The warning is
triggered by a LVM2 test shell/integrity-caching.sh.

This commit removes rcu_dereference.

=============================
WARNING: suspicious RCU usage
6.7.0 #2 Not tainted
-----------------------------
drivers/md/raid1.c:2265 suspicious rcu_dereference_check() usage!

other info that might help us debug this:

rcu_scheduler_active = 2, debug_locks = 1
no locks held by mdX_raid1/1859.

stack backtrace:
CPU: 2 PID: 1859 Comm: mdX_raid1 Not tainted 6.7.0 #2
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
Call Trace:
<TASK>
dump_stack_lvl+0x60/0x70
lockdep_rcu_suspicious+0x153/0x1b0
raid1d+0x1732/0x1750 [raid1]
? lock_acquire+0x9f/0x270
? finish_wait+0x3d/0x80
? md_thread+0xf7/0x130 [md_mod]
? lock_release+0xaa/0x230
? md_register_thread+0xd0/0xd0 [md_mod]
md_thread+0xa0/0x130 [md_mod]
? housekeeping_test_cpu+0x30/0x30
kthread+0xdc/0x110
? kthread_complete_and_exit+0x20/0x20
ret_from_fork+0x28/0x40
? kthread_complete_and_exit+0x20/0x20
ret_from_fork_asm+0x11/0x20
</TASK>

Signed-off-by: Mikulas Patocka <[email protected]>
Fixes: ca294b34aaf3 ("md/raid1: support read error check")
Reviewed-by: Yu Kuai <[email protected]>
Signed-off-by: Song Liu <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

commit | commitdiff | tree

Maksim Kiselev [Wed, 24 Jan 2024 07:24:36 +0000 (10:24 +0300)]

aoe: avoid potential deadlock at set_capacity

Move set_capacity() outside of the section procected by (&d->lock).
To avoid possible interrupt unsafe locking scenario:

        CPU0                    CPU1
        ----                    ----
[1] lock(&bdev->bd_size_lock);
                                local_irq_disable();
                            [2] lock(&d->lock);
                            [3] lock(&bdev->bd_size_lock);
   <Interrupt>
[4]  lock(&d->lock);

  *** DEADLOCK ***

Where [1](&bdev->bd_size_lock) hold by zram_add()->set_capacity().
[2]lock(&d->lock) hold by aoeblk_gdalloc(). And aoeblk_gdalloc()
is trying to acquire [3](&bdev->bd_size_lock) at set_capacity() call.
In this situation an attempt to acquire [4]lock(&d->lock) from
aoecmd_cfg_rsp() will lead to deadlock.

So the simplest solution is breaking lock dependency
[2](&d->lock) -> [3](&bdev->bd_size_lock) by moving set_capacity()
outside.

Signed-off-by: Maksim Kiselev <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jens Axboe <[email protected]>

commit | commitdiff | tree

Israel Rukshin [Wed, 24 Jan 2024 12:16:27 +0000 (12:16 +0000)]

nvme-rdma: Fix transfer length when write_generate/read_verify are 0

When the block layer doesn't generate/verify metadata, the SG length is
smaller than the transfer length. This is because the SG length doesn't
include the metadata length that is added by the HW on the wire. The
target failes those commands with "Data SGL Length Invalid" by comparing
the transfer length and the SG length. Fix it by adding the metadata
length to the transfer length when there is no metadata SGL. The bug
reproduces when setting read_verify/write_generate configs to 0 at the
child multipath device or at the primary device when NVMe multipath is
disabled.

Note that setting those configs to 0 on the multipath device (ns_head)
doesn't have any impact on the I/Os.

Fixes: 5ec5d3bddc6b ("nvme-rdma: add metadata/T10-PI support")
Signed-off-by: Israel Rukshin <[email protected]>
Signed-off-by: Max Gurtovoy <[email protected]>
Reviewed-by: Sagi Grimberg <[email protected]>
Signed-off-by: Keith Busch <[email protected]>

commit | commitdiff | tree

Chaitanya Kulkarni [Tue, 23 Jan 2024 22:13:41 +0000 (14:13 -0800)]

nvmet: add module description to stop warnings

Add MODULE_DESCRIPTION() in order to remove warnings & get clean build:-

WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/nvme/target/nvmet.o
WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/nvme/target/nvme-loop.o
WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/nvme/target/nvmet-rdma.o
WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/nvme/target/nvmet-fc.o
WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/nvme/target/nvme-fcloop.o
WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/nvme/target/nvmet-tcp.o

Signed-off-by: Chaitanya Kulkarni <[email protected]>
Reviewed-by: Sagi Grimberg <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Signed-off-by: Keith Busch <[email protected]>

commit | commitdiff | tree

Chaitanya Kulkarni [Tue, 23 Jan 2024 22:13:40 +0000 (14:13 -0800)]

nvme: add module description to stop warnings

Add MODULE_DESCRIPTION() in order to remove warnings & get clean build:-

WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/nvme/host/nvme-core.o
WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/nvme/host/nvme.o
WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/nvme/host/nvme-fabrics.o
WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/nvme/host/nvme-rdma.o
WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/nvme/host/nvme-fc.o
WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/nvme/host/nvme-tcp.o

Signed-off-by: Chaitanya Kulkarni <[email protected]>
Reviewed-by: Sagi Grimberg <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Signed-off-by: Keith Busch <[email protected]>

commit | commitdiff | tree

Guixin Liu [Thu, 18 Jan 2024 12:51:45 +0000 (20:51 +0800)]

nvmet: unify aer type enum

The host and target use two definition of aer type, unify
them into a single one.

Signed-off-by: Guixin Liu <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Sagi Grimberg <[email protected]>
Signed-off-by: Keith Busch <[email protected]>

commit | commitdiff | tree

Christian A. Ehrhardt [Sun, 21 Jan 2024 20:26:34 +0000 (21:26 +0100)]

block: Fix WARNING in _copy_from_iter

Syzkaller reports a warning in _copy_from_iter because an
iov_iter is supposedly used in the wrong direction. The reason
is that syzcaller managed to generate a request with
a transfer direction of SG_DXFER_TO_FROM_DEV. This instructs
the kernel to copy user buffers into the kernel, read into
the copied buffers and then copy the data back to user space.

Thus the iovec is used in both directions.

Detect this situation in the block layer and construct a new
iterator with the correct direction for the copy-in.

Reported-by: [email protected]
Closes: https://lore.kernel.org/lkml/[email protected]/t/
Reported-by: [email protected]
Closes: https://lore.kernel.org/lkml/[email protected]/T/
Signed-off-by: Christian A. Ehrhardt <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jens Axboe <[email protected]>

commit | commitdiff | tree

Li Lingfeng [Thu, 18 Jan 2024 13:04:01 +0000 (21:04 +0800)]

block: Move checking GENHD_FL_NO_PART to bdev_add_partition()

Commit 1a721de8489f ("block: don't add or resize partition on the disk
with GENHD_FL_NO_PART") prevented all operations about partitions on disks
with GENHD_FL_NO_PART in blkpg_do_ioctl() since they are meaningless.
However, it changed error code in some scenarios. So move checking
GENHD_FL_NO_PART to bdev_add_partition() to eliminate impact.

Fixes: 1a721de8489f ("block: don't add or resize partition on the disk with GENHD_FL_NO_PART")
Reported-by: Allison Karlitskaya <[email protected]>
Closes: https://lore.kernel.org/all/CAOYeF9VsmqKMcQjo1k6YkGNujwN-nzfxY17N3F-CMikE1tYp+w@mail.gmail.com/
Signed-off-by: Li Lingfeng <[email protected]>
Reviewed-by: Yu Kuai <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jens Axboe <[email protected]>

commit | commitdiff | tree

Linus Torvalds [Sun, 21 Jan 2024 22:11:32 +0000 (14:11 -0800)]

Linux 6.8-rc1

commit | commitdiff | tree

Linus Torvalds [Sun, 21 Jan 2024 22:01:12 +0000 (14:01 -0800)]

Merge tag 'bcachefs-2024-01-21' of https://evilpiepirate.org/git/bcachefs

Pull more bcachefs updates from Kent Overstreet:
"Some fixes, Some refactoring, some minor features:

   - Assorted prep work for disk space accounting rewrite

   - BTREE_TRIGGER_ATOMIC: after combining our trigger callbacks, this
     makes our trigger context more explicit

   - A few fixes to avoid excessive transaction restarts on
     multithreaded workloads: fstests (in addition to ktest tests) are
     now checking slowpath counters, and that's shaking out a few bugs

   - Assorted tracepoint improvements

   - Starting to break up bcachefs_format.h and move on disk types so
     they're with the code they belong to; this will make room to start
     documenting the on disk format better.

   - A few minor fixes"

* tag 'bcachefs-2024-01-21' of https://evilpiepirate.org/git/bcachefs: (46 commits)
  bcachefs: Improve inode_to_text()
  bcachefs: logged_ops_format.h
  bcachefs: reflink_format.h
  bcachefs; extents_format.h
  bcachefs: ec_format.h
  bcachefs: subvolume_format.h
  bcachefs: snapshot_format.h
  bcachefs: alloc_background_format.h
  bcachefs: xattr_format.h
  bcachefs: dirent_format.h
  bcachefs: inode_format.h
  bcachefs; quota_format.h
  bcachefs: sb-counters_format.h
  bcachefs: counters.c -> sb-counters.c
  bcachefs: comment bch_subvolume
  bcachefs: bch_snapshot::btime
  bcachefs: add missing __GFP_NOWARN
  bcachefs: opts->compression can now also be applied in the background
  bcachefs: Prep work for variable size btree node buffers
  bcachefs: grab s_umount only if snapshotting
  ...

commit | commitdiff | tree

Linus Torvalds [Sun, 21 Jan 2024 19:14:40 +0000 (11:14 -0800)]

Merge tag 'timers-core-2024-01-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull timer updates from Thomas Gleixner:
"Updates for time and clocksources:

   - A fix for the idle and iowait time accounting vs CPU hotplug.

     The time is reset on CPU hotplug which makes the accumulated
     systemwide time jump backwards.

   - Assorted fixes and improvements for clocksource/event drivers"

* tag 'timers-core-2024-01-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  tick-sched: Fix idle and iowait sleeptime accounting vs CPU hotplug
  clocksource/drivers/ep93xx: Fix error handling during probe
  clocksource/drivers/cadence-ttc: Fix some kernel-doc warnings
  clocksource/drivers/timer-ti-dm: Fix make W=n kerneldoc warnings
  clocksource/timer-riscv: Add riscv_clock_shutdown callback
  dt-bindings: timer: Add StarFive JH8100 clint
  dt-bindings: timer: thead,c900-aclint-mtimer: separate mtime and mtimecmp regs

commit | commitdiff | tree

Linus Torvalds [Sun, 21 Jan 2024 19:04:29 +0000 (11:04 -0800)]

Merge tag 'powerpc-6.8-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux

Pull powerpc fixes from Aneesh Kumar:

- Increase default stack size to 32KB for Book3S

Thanks to Michael Ellerman.

* tag 'powerpc-6.8-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
powerpc/64s: Increase default stack size to 32KB

commit | commitdiff | tree

Kent Overstreet [Sun, 21 Jan 2024 17:19:01 +0000 (12:19 -0500)]

bcachefs: Improve inode_to_text()

Add line breaks - inode_to_text() is now much easier to read.

Signed-off-by: Kent Overstreet <[email protected]>

commit | commitdiff | tree

Kent Overstreet [Sun, 21 Jan 2024 07:57:45 +0000 (02:57 -0500)]

bcachefs: logged_ops_format.h

Signed-off-by: Kent Overstreet <[email protected]>

commit | commitdiff | tree

Kent Overstreet [Sun, 21 Jan 2024 07:54:47 +0000 (02:54 -0500)]

bcachefs: reflink_format.h

Signed-off-by: Kent Overstreet <[email protected]>

commit | commitdiff | tree

Kent Overstreet [Sun, 21 Jan 2024 07:51:56 +0000 (02:51 -0500)]

bcachefs; extents_format.h

Signed-off-by: Kent Overstreet <[email protected]>

commit | commitdiff | tree

Kent Overstreet [Sun, 21 Jan 2024 07:47:14 +0000 (02:47 -0500)]

bcachefs: ec_format.h

Signed-off-by: Kent Overstreet <[email protected]>

commit | commitdiff | tree

Kent Overstreet [Sun, 21 Jan 2024 07:42:53 +0000 (02:42 -0500)]

bcachefs: subvolume_format.h

Signed-off-by: Kent Overstreet <[email protected]>

commit | commitdiff | tree

Kent Overstreet [Sun, 21 Jan 2024 07:41:06 +0000 (02:41 -0500)]

bcachefs: snapshot_format.h

Signed-off-by: Kent Overstreet <[email protected]>

commit | commitdiff | tree

Kent Overstreet [Sun, 21 Jan 2024 05:01:52 +0000 (00:01 -0500)]

bcachefs: alloc_background_format.h

Signed-off-by: Kent Overstreet <[email protected]>

commit | commitdiff | tree

Kent Overstreet [Sun, 21 Jan 2024 04:59:15 +0000 (23:59 -0500)]

bcachefs: xattr_format.h

Signed-off-by: Kent Overstreet <[email protected]>

commit | commitdiff | tree

Kent Overstreet [Sun, 21 Jan 2024 04:57:10 +0000 (23:57 -0500)]

bcachefs: dirent_format.h

Signed-off-by: Kent Overstreet <[email protected]>

commit | commitdiff | tree

Kent Overstreet [Sun, 21 Jan 2024 04:55:39 +0000 (23:55 -0500)]

bcachefs: inode_format.h

Signed-off-by: Kent Overstreet <[email protected]>

commit | commitdiff | tree

Kent Overstreet [Sun, 21 Jan 2024 04:53:52 +0000 (23:53 -0500)]

bcachefs; quota_format.h

Signed-off-by: Kent Overstreet <[email protected]>

commit | commitdiff | tree

Kent Overstreet [Sun, 21 Jan 2024 04:50:56 +0000 (23:50 -0500)]

bcachefs: sb-counters_format.h

bcachefs_format.h has gotten too big; let's do some organizing.

Signed-off-by: Kent Overstreet <[email protected]>

commit | commitdiff | tree

Kent Overstreet [Sun, 21 Jan 2024 04:46:35 +0000 (23:46 -0500)]

bcachefs: counters.c -> sb-counters.c

Signed-off-by: Kent Overstreet <[email protected]>

commit | commitdiff | tree

Kent Overstreet [Sun, 21 Jan 2024 04:44:17 +0000 (23:44 -0500)]

bcachefs: comment bch_subvolume

Signed-off-by: Kent Overstreet <[email protected]>

commit | commitdiff | tree

Kent Overstreet [Sun, 21 Jan 2024 04:35:41 +0000 (23:35 -0500)]

bcachefs: bch_snapshot::btime

Add a field to bch_snapshot for creation time; this will be important
when we start exposing the snapshot tree to userspace.

Signed-off-by: Kent Overstreet <[email protected]>

commit | commitdiff | tree

Kent Overstreet [Wed, 17 Jan 2024 22:16:07 +0000 (17:16 -0500)]

bcachefs: add missing __GFP_NOWARN

Signed-off-by: Kent Overstreet <[email protected]>

commit | commitdiff | tree

Kent Overstreet [Tue, 16 Jan 2024 21:20:21 +0000 (16:20 -0500)]

bcachefs: opts->compression can now also be applied in the background

The "apply this compression method in the background" paths now use the
compression option if background_compression is not set; this means that
setting or changing the compression option will cause existing data to
be compressed accordingly in the background.

Signed-off-by: Kent Overstreet <[email protected]>

commit | commitdiff | tree

Kent Overstreet [Tue, 16 Jan 2024 18:29:59 +0000 (13:29 -0500)]

bcachefs: Prep work for variable size btree node buffers

bcachefs btree nodes are big - typically 256k - and btree roots are
pinned in memory. As we're now up to 18 btrees, we now have significant
memory overhead in mostly empty btree roots.

And in the future we're going to start enforcing that certain btree node
boundaries exist, to solve lock contention issues - analagous to XFS's
AGIs.

Thus, we need to start allocating smaller btree node buffers when we
can. This patch changes code that refers to the filesystem constant
c->opts.btree_node_size to refer to the btree node buffer size -
btree_buf_bytes() - where appropriate.

Signed-off-by: Kent Overstreet <[email protected]>

commit | commitdiff | tree

Su Yue [Mon, 15 Jan 2024 02:21:25 +0000 (10:21 +0800)]

bcachefs: grab s_umount only if snapshotting

When I was testing mongodb over bcachefs with compression,
there is a lockdep warning when snapshotting mongodb data volume.

$ cat test.sh
prog=bcachefs

$prog subvolume create /mnt/data
$prog subvolume create /mnt/data/snapshots

while true;do
    $prog subvolume snapshot /mnt/data /mnt/data/snapshots/$(date +%s)
    sleep 1s
done

$ cat /etc/mongodb.conf
systemLog:
  destination: file
  logAppend: true
  path: /mnt/data/mongod.log

storage:
  dbPath: /mnt/data/

lockdep reports:
[ 3437.452330] ======================================================
[ 3437.452750] WARNING: possible circular locking dependency detected
[ 3437.453168] 6.7.0-rc7-custom+ #85 Tainted: G            E
[ 3437.453562] ------------------------------------------------------
[ 3437.453981] bcachefs/35533 is trying to acquire lock:
[ 3437.454325] ffffa0a02b2b1418 (sb_writers#10){.+.+}-{0:0}, at: filename_create+0x62/0x190
[ 3437.454875]
               but task is already holding lock:
[ 3437.455268] ffffa0a02b2b10e0 (&type->s_umount_key#48){.+.+}-{3:3}, at: bch2_fs_file_ioctl+0x232/0xc90 [bcachefs]
[ 3437.456009]
               which lock already depends on the new lock.

[ 3437.456553]
               the existing dependency chain (in reverse order) is:
[ 3437.457054]
               -> #3 (&type->s_umount_key#48){.+.+}-{3:3}:
[ 3437.457507]        down_read+0x3e/0x170
[ 3437.457772]        bch2_fs_file_ioctl+0x232/0xc90 [bcachefs]
[ 3437.458206]        __x64_sys_ioctl+0x93/0xd0
[ 3437.458498]        do_syscall_64+0x42/0xf0
[ 3437.458779]        entry_SYSCALL_64_after_hwframe+0x6e/0x76
[ 3437.459155]
               -> #2 (&c->snapshot_create_lock){++++}-{3:3}:
[ 3437.459615]        down_read+0x3e/0x170
[ 3437.459878]        bch2_truncate+0x82/0x110 [bcachefs]
[ 3437.460276]        bchfs_truncate+0x254/0x3c0 [bcachefs]
[ 3437.460686]        notify_change+0x1f1/0x4a0
[ 3437.461283]        do_truncate+0x7f/0xd0
[ 3437.461555]        path_openat+0xa57/0xce0
[ 3437.461836]        do_filp_open+0xb4/0x160
[ 3437.462116]        do_sys_openat2+0x91/0xc0
[ 3437.462402]        __x64_sys_openat+0x53/0xa0
[ 3437.462701]        do_syscall_64+0x42/0xf0
[ 3437.462982]        entry_SYSCALL_64_after_hwframe+0x6e/0x76
[ 3437.463359]
               -> #1 (&sb->s_type->i_mutex_key#15){+.+.}-{3:3}:
[ 3437.463843]        down_write+0x3b/0xc0
[ 3437.464223]        bch2_write_iter+0x5b/0xcc0 [bcachefs]
[ 3437.464493]        vfs_write+0x21b/0x4c0
[ 3437.464653]        ksys_write+0x69/0xf0
[ 3437.464839]        do_syscall_64+0x42/0xf0
[ 3437.465009]        entry_SYSCALL_64_after_hwframe+0x6e/0x76
[ 3437.465231]
               -> #0 (sb_writers#10){.+.+}-{0:0}:
[ 3437.465471]        __lock_acquire+0x1455/0x21b0
[ 3437.465656]        lock_acquire+0xc6/0x2b0
[ 3437.465822]        mnt_want_write+0x46/0x1a0
[ 3437.465996]        filename_create+0x62/0x190
[ 3437.466175]        user_path_create+0x2d/0x50
[ 3437.466352]        bch2_fs_file_ioctl+0x2ec/0xc90 [bcachefs]
[ 3437.466617]        __x64_sys_ioctl+0x93/0xd0
[ 3437.466791]        do_syscall_64+0x42/0xf0
[ 3437.466957]        entry_SYSCALL_64_after_hwframe+0x6e/0x76
[ 3437.467180]
               other info that might help us debug this:

[ 3437.469670] 2 locks held by bcachefs/35533:
               other info that might help us debug this:

[ 3437.467507] Chain exists of:
                 sb_writers#10 --> &c->snapshot_create_lock --> &type->s_umount_key#48

[ 3437.467979]  Possible unsafe locking scenario:

[ 3437.468223]        CPU0                    CPU1
[ 3437.468405]        ----                    ----
[ 3437.468585]   rlock(&type->s_umount_key#48);
[ 3437.468758]                                lock(&c->snapshot_create_lock);
[ 3437.469030]                                lock(&type->s_umount_key#48);
[ 3437.469291]   rlock(sb_writers#10);
[ 3437.469434]
                *** DEADLOCK ***

[ 3437.469670] 2 locks held by bcachefs/35533:
[ 3437.469838]  #0: ffffa0a02ce00a88 (&c->snapshot_create_lock){++++}-{3:3}, at: bch2_fs_file_ioctl+0x1e3/0xc90 [bcachefs]
[ 3437.470294]  #1: ffffa0a02b2b10e0 (&type->s_umount_key#48){.+.+}-{3:3}, at: bch2_fs_file_ioctl+0x232/0xc90 [bcachefs]
[ 3437.470744]
               stack backtrace:
[ 3437.470922] CPU: 7 PID: 35533 Comm: bcachefs Kdump: loaded Tainted: G            E      6.7.0-rc7-custom+ #85
[ 3437.471313] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.16.3-1-1 04/01/2014
[ 3437.471694] Call Trace:
[ 3437.471795]  <TASK>
[ 3437.471884]  dump_stack_lvl+0x57/0x90
[ 3437.472035]  check_noncircular+0x132/0x150
[ 3437.472202]  __lock_acquire+0x1455/0x21b0
[ 3437.472369]  lock_acquire+0xc6/0x2b0
[ 3437.472518]  ? filename_create+0x62/0x190
[ 3437.472683]  ? lock_is_held_type+0x97/0x110
[ 3437.472856]  mnt_want_write+0x46/0x1a0
[ 3437.473025]  ? filename_create+0x62/0x190
[ 3437.473204]  filename_create+0x62/0x190
[ 3437.473380]  user_path_create+0x2d/0x50
[ 3437.473555]  bch2_fs_file_ioctl+0x2ec/0xc90 [bcachefs]
[ 3437.473819]  ? lock_acquire+0xc6/0x2b0
[ 3437.474002]  ? __fget_files+0x2a/0x190
[ 3437.474195]  ? __fget_files+0xbc/0x190
[ 3437.474380]  ? lock_release+0xc5/0x270
[ 3437.474567]  ? __x64_sys_ioctl+0x93/0xd0
[ 3437.474764]  ? __pfx_bch2_fs_file_ioctl+0x10/0x10 [bcachefs]
[ 3437.475090]  __x64_sys_ioctl+0x93/0xd0
[ 3437.475277]  do_syscall_64+0x42/0xf0
[ 3437.475454]  entry_SYSCALL_64_after_hwframe+0x6e/0x76
[ 3437.475691] RIP: 0033:0x7f2743c313af
======================================================

In __bch2_ioctl_subvolume_create(), we grab s_umount unconditionally
and unlock it at the end of the function. There is a comment
"why do we need this lock?" about the lock coming from
commit 42d237320e98 ("bcachefs: Snapshot creation, deletion")
The reason is that __bch2_ioctl_subvolume_create() calls
sync_inodes_sb() which enforce locked s_umount to writeback all dirty
nodes before doing snapshot works.

Fix it by read locking s_umount for snapshotting only and unlocking
s_umount after sync_inodes_sb().

Signed-off-by: Su Yue <[email protected]>
Signed-off-by: Kent Overstreet <[email protected]>

commit | commitdiff | tree

Su Yue [Tue, 16 Jan 2024 11:05:37 +0000 (19:05 +0800)]

bcachefs: kvfree bch_fs::snapshots in bch2_fs_snapshots_exit

bch_fs::snapshots is allocated by kvzalloc in __snapshot_t_mut.
It should be freed by kvfree not kfree.
Or umount will triger:

[  406.829178 ] BUG: unable to handle page fault for address: ffffe7b487148008
[  406.830676 ] #PF: supervisor read access in kernel mode
[  406.831643 ] #PF: error_code(0x0000) - not-present page
[  406.832487 ] PGD 0 P4D 0
[  406.832898 ] Oops: 0000 [#1] PREEMPT SMP PTI
[  406.833512 ] CPU: 2 PID: 1754 Comm: umount Kdump: loaded Tainted: G           OE      6.7.0-rc7-custom+ #90
[  406.834746 ] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.16.3-1-1 04/01/2014
[  406.835796 ] RIP: 0010:kfree+0x62/0x140
[  406.836197 ] Code: 80 48 01 d8 0f 82 e9 00 00 00 48 c7 c2 00 00 00 80 48 2b 15 78 9f 1f 01 48 01 d0 48 c1 e8 0c 48 c1 e0 06 48 03 05 56 9f 1f 01 <48> 8b 50 08 48 89 c7 f6 c2 01 0f 85 b0 00 00 00 66 90 48 8b 07 f6
[  406.837810 ] RSP: 0018:ffffb9d641607e48 EFLAGS: 00010286
[  406.838213 ] RAX: ffffe7b487148000 RBX: ffffb9d645200000 RCX: ffffb9d641607dc4
[  406.838738 ] RDX: 000065bb00000000 RSI: ffffffffc0d88b84 RDI: ffffb9d645200000
[  406.839217 ] RBP: ffff9a4625d00068 R08: 0000000000000001 R09: 0000000000000001
[  406.839650 ] R10: 0000000000000001 R11: 000000000000001f R12: ffff9a4625d4da80
[  406.840055 ] R13: ffff9a4625d00000 R14: ffffffffc0e2eb20 R15: 0000000000000000
[  406.840451 ] FS:  00007f0a264ffb80(0000) GS:ffff9a4e2d500000(0000) knlGS:0000000000000000
[  406.840851 ] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  406.841125 ] CR2: ffffe7b487148008 CR3: 000000018c4d2000 CR4: 00000000000006f0
[  406.841464 ] Call Trace:
[  406.841583 ]  <TASK>
[  406.841682 ]  ? __die+0x1f/0x70
[  406.841828 ]  ? page_fault_oops+0x159/0x470
[  406.842014 ]  ? fixup_exception+0x22/0x310
[  406.842198 ]  ? exc_page_fault+0x1ed/0x200
[  406.842382 ]  ? asm_exc_page_fault+0x22/0x30
[  406.842574 ]  ? bch2_fs_release+0x54/0x280 [bcachefs]
[  406.842842 ]  ? kfree+0x62/0x140
[  406.842988 ]  ? kfree+0x104/0x140
[  406.843138 ]  bch2_fs_release+0x54/0x280 [bcachefs]
[  406.843390 ]  kobject_put+0xb7/0x170
[  406.843552 ]  deactivate_locked_super+0x2f/0xa0
[  406.843756 ]  cleanup_mnt+0xba/0x150
[  406.843917 ]  task_work_run+0x59/0xa0
[  406.844083 ]  exit_to_user_mode_prepare+0x197/0x1a0
[  406.844302 ]  syscall_exit_to_user_mode+0x16/0x40
[  406.844510 ]  do_syscall_64+0x4e/0xf0
[  406.844675 ]  entry_SYSCALL_64_after_hwframe+0x6e/0x76
[  406.844907 ] RIP: 0033:0x7f0a2664e4fb

Signed-off-by: Su Yue <[email protected]>
Reviewed-by: Brian Foster <[email protected]>
Signed-off-by: Kent Overstreet <[email protected]>

commit | commitdiff | tree

Kent Overstreet [Tue, 16 Jan 2024 16:38:04 +0000 (11:38 -0500)]

bcachefs: bios must be 512 byte algined

Fixes: 023f9ac9f70f bcachefs: Delete dio read alignment check
Reported-by: Brian Foster <[email protected]>
Signed-off-by: Kent Overstreet <[email protected]>

commit | commitdiff | tree

Colin Ian King [Tue, 16 Jan 2024 11:07:23 +0000 (11:07 +0000)]

bcachefs: remove redundant variable tmp

The variable tmp is being assigned a value but it isn't being
read afterwards. The assignment is redundant and so tmp can be
removed.

Cleans up clang scan build warning:
warning: Although the value stored to 'ret' is used in the enclosing
expression, the value is never actually read from 'ret'
[deadcode.DeadStores]

Signed-off-by: Colin Ian King <[email protected]>
Reviewed-by: Brian Foster <[email protected]>
Signed-off-by: Kent Overstreet <[email protected]>

commit | commitdiff | tree

Kent Overstreet [Tue, 16 Jan 2024 01:40:06 +0000 (20:40 -0500)]

bcachefs: Improve trace_trans_restart_relock

Signed-off-by: Kent Overstreet <[email protected]>

commit | commitdiff | tree

Kent Overstreet [Tue, 16 Jan 2024 01:37:23 +0000 (20:37 -0500)]

bcachefs: Fix excess transaction restarts in __bchfs_fallocate()

drop_locks_do() should not be used in a fastpath without first trying
the do in nonblocking mode - the unlock and relock will cause excessive
transaction restarts and potentially livelocking with other threads that
are contending for the same locks.

Signed-off-by: Kent Overstreet <[email protected]>

commit | commitdiff | tree

Kent Overstreet [Mon, 15 Jan 2024 23:19:52 +0000 (18:19 -0500)]

bcachefs: extents_to_bp_state

Signed-off-by: Kent Overstreet <[email protected]>

commit | commitdiff | tree

Kent Overstreet [Mon, 15 Jan 2024 23:08:32 +0000 (18:08 -0500)]

bcachefs: bkey_and_val_eq()

Signed-off-by: Kent Overstreet <[email protected]>

commit | commitdiff | tree

Kent Overstreet [Mon, 15 Jan 2024 22:59:51 +0000 (17:59 -0500)]

bcachefs: Better journal tracepoints

Factor out bch2_journal_bufs_to_text(), and use it in the
journal_entry_full() tracepoint; when we can't get a journal reservation
we need to know the outstanding journal entry sizes to know if the
problem is due to excessive flushing.

Signed-off-by: Kent Overstreet <[email protected]>

commit | commitdiff | tree

Kent Overstreet [Mon, 15 Jan 2024 22:57:44 +0000 (17:57 -0500)]

bcachefs: Print size of superblock with space allocated

Signed-off-by: Kent Overstreet <[email protected]>

commit | commitdiff | tree

Kent Overstreet [Mon, 15 Jan 2024 22:56:22 +0000 (17:56 -0500)]

bcachefs: Avoid flushing the journal in the discard path

When issuing discards, we may need to flush the journal if there's too
many buckets that can't be discarded until a journal flush.

But the heuristic was bad; we should be comparing the number of buckets
that need to flushes against the number of free buckets, not the number
of buckets we saw.

Signed-off-by: Kent Overstreet <[email protected]>

commit | commitdiff | tree

Kent Overstreet [Mon, 15 Jan 2024 20:33:39 +0000 (15:33 -0500)]

bcachefs: Improve move_extent tracepoint

Also print out the data_opts, so that we can see what specifically is
being done to an extent.

Signed-off-by: Kent Overstreet <[email protected]>

commit | commitdiff | tree

Kent Overstreet [Mon, 15 Jan 2024 20:06:43 +0000 (15:06 -0500)]

bcachefs: Add missing bch2_moving_ctxt_flush_all()

This fixes a bug with rebalance IOs getting stuck with reads completed,
but writes never being issued.

Signed-off-by: Kent Overstreet <[email protected]>

commit | commitdiff | tree

Kent Overstreet [Mon, 15 Jan 2024 20:04:40 +0000 (15:04 -0500)]

bcachefs: Re-add move_extent_write tracepoint

It appears this was accidentally deleted at some point - also, do a bit
of cleanup.

Signed-off-by: Kent Overstreet <[email protected]>

commit | commitdiff | tree

Kent Overstreet [Mon, 15 Jan 2024 19:15:26 +0000 (14:15 -0500)]

bcachefs: bch2_kthread_io_clock_wait() no longer sleeps until full amount

Drop t he loop in bch2_kthread_io_clock_wait(): this allows the code
that uses it to be woken up for other reasons, and fixes a bug where
rebalance wouldn't wake up when a scan was requested.

This raises the possibility of spurious wakeups, but callers should
always be able to handle that reasonably well.

Signed-off-by: Kent Overstreet <[email protected]>

commit | commitdiff | tree

Kent Overstreet [Mon, 15 Jan 2024 19:15:03 +0000 (14:15 -0500)]

bcachefs: Add .val_to_text() for KEY_TYPE_cookie

Signed-off-by: Kent Overstreet <[email protected]>

commit | commitdiff | tree

Kent Overstreet [Mon, 15 Jan 2024 19:12:43 +0000 (14:12 -0500)]

bcachefs: Don't pass memcmp() as a pointer

Some (buggy!) compilers have issues with this.

Fixes: https://github.com/koverstreet/bcachefs/issues/625
Signed-off-by: Kent Overstreet <[email protected]>

commit | commitdiff | tree

Linus Torvalds [Sun, 21 Jan 2024 18:21:43 +0000 (10:21 -0800)]

Merge tag 'header_cleanup-2024-01-20' of https://evilpiepirate.org/git/bcachefs

Pull header fix from Kent Overstreet:
"Just one small fixup for the RT build"

* tag 'header_cleanup-2024-01-20' of https://evilpiepirate.org/git/bcachefs:
spinlock: Fix failing build for PREEMPT_RT

commit | commitdiff | tree

Kent Overstreet [Thu, 11 Jan 2024 04:47:04 +0000 (23:47 -0500)]

bcachefs: Reduce would_deadlock restarts

We don't have to take locks in any particular ordering - we'll make
forward progress just fine - but if we try to stick to an ordering, it
can help to avoid excessive would_deadlock transaction restarts.

This tweaks the reflink path to take extents btree locks in the right
order.

Signed-off-by: Kent Overstreet <[email protected]>

commit | commitdiff | tree

Kent Overstreet [Sat, 11 Nov 2023 20:08:36 +0000 (15:08 -0500)]

bcachefs: bch2_trans_account_disk_usage_change()

The disk space accounting rewrite is splitting out accounting for each
replicas set - those are moving to btree keys, instead of percpu
counters.

This breaks bch2_trans_fs_usage_apply() up, splitting out the part we
will still need.

Signed-off-by: Kent Overstreet <[email protected]>

commit | commitdiff | tree

Kent Overstreet [Fri, 17 Nov 2023 05:03:45 +0000 (00:03 -0500)]

bcachefs: bch_fs_usage_base

Split out base filesystem usage into its own type; prep work for
breaking up bch2_trans_fs_usage_apply().

Signed-off-by: Kent Overstreet <[email protected]>

commit | commitdiff | tree

Kent Overstreet [Sun, 7 Jan 2024 02:01:47 +0000 (21:01 -0500)]

bcachefs: bch2_prt_compression_type()

bounds checking helper, since compression types are extensible

Signed-off-by: Kent Overstreet <[email protected]>

commit | commitdiff | tree

Kent Overstreet [Sun, 7 Jan 2024 01:57:43 +0000 (20:57 -0500)]

bcachefs: helpers for printing data types

We need bounds checking since new versions may introduce new data types.

Signed-off-by: Kent Overstreet <[email protected]>

commit | commitdiff | tree

Kent Overstreet [Sun, 7 Jan 2024 22:14:46 +0000 (17:14 -0500)]

bcachefs: BTREE_TRIGGER_ATOMIC

Add a new flag to be explicit about when we're running atomic triggers.

Signed-off-by: Kent Overstreet <[email protected]>

commit | commitdiff | tree

Kent Overstreet [Sun, 7 Jan 2024 00:47:09 +0000 (19:47 -0500)]

bcachefs: drop to_text code for obsolete bps in alloc keys

Signed-off-by: Kent Overstreet <[email protected]>

commit | commitdiff | tree

Kent Overstreet [Sun, 7 Jan 2024 00:29:14 +0000 (19:29 -0500)]

bcachefs: eytzinger_for_each() declares loop iter

Signed-off-by: Kent Overstreet <[email protected]>

commit | commitdiff | tree

Kent Overstreet [Thu, 11 Jan 2024 04:08:30 +0000 (23:08 -0500)]

bcachefs: Don't log errors if BCH_WRITE_ALLOC_NOWAIT

Previously, we added logging in the write path to ensure that any
unexpected errors getting reported to userspace have a log message; but
BCH_WRITE_ALLOC_NOWAIT is a special case, it's used for promotes where
errors are expected and not reported out to userspace - so we need to
silence those.

Signed-off-by: Kent Overstreet <[email protected]>

Empty description

This page took 0.113397 seconds and 4 git commands to generate.