Hannes Reinecke [Wed, 20 Sep 2017 06:58:53 +0000 (08:58 +0200)]
scsi: scsi_transport_fc: set scsi_target_id upon rescan
When an rport is found in the bindings array there is no guarantee that
it had been a target port, so we need to call fc_remote_port_rolechg()
here to ensure the scsi_target_id is set correctly. Otherwise the port
will never be scanned.
Merge branch 'for-linus' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
- Two sets of NVMe pull requests from Christoph:
- Fixes for the Fibre Channel host/target to fix spec compliance
- Allow a zero keep alive timeout
- Make the debug printk for broken SGLs work better
- Fix queue zeroing during initialization
- Set of RDMA and FC fixes
- Target div-by-zero fix
- bsg double-free fix.
- ndb unknown ioctl fix from Josef.
- Buffered vs O_DIRECT page cache inconsistency fix. Has been floating
around for a long time, well reviewed. From Lukas.
- brd overflow fix from Mikulas.
- Fix for a loop regression in this merge window, where using a union
for two members of the loop_cmd turned out to be a really bad idea.
From Omar.
- Fix for an iostat regression fix in this series, using the wrong API
to get at the block queue. From Shaohua.
- Fix for a potential blktrace delection deadlock. From Waiman.
* 'for-linus' of git://git.kernel.dk/linux-block: (30 commits)
nvme-fcloop: fix port deletes and callbacks
nvmet-fc: sync header templates with comments
nvmet-fc: ensure target queue id within range.
nvmet-fc: on port remove call put outside lock
nvme-rdma: don't fully stop the controller in error recovery
nvme-rdma: give up reconnect if state change fails
nvme-core: Use nvme_wq to queue async events and fw activation
nvme: fix sqhd reference when admin queue connect fails
block: fix a crash caused by wrong API
fs: Fix page cache inconsistency when mixing buffered and AIO DIO
nvmet: implement valid sqhd values in completions
nvme-fabrics: Allow 0 as KATO value
nvme: allow timed-out ios to retry
nvme: stop aer posting if controller state not live
nvme-pci: Print invalid SGL only once
nvme-pci: initialize queue memory before interrupts
nvmet-fc: fix failing max io queue connections
nvme-fc: use transport-specific sgl format
nvme: add transport SGL definitions
nvme.h: remove FC transport-specific error values
...
PM / OPP: Call notifier without holding opp_table->lock
The notifier callbacks may want to call some OPP helper routines which
may try to take the same opp_table->lock again and cause a deadlock. One
such usecase was reported by Chanwoo Choi, where calling
dev_pm_opp_disable() leads us to the devfreq's OPP notifier handler,
which further calls dev_pm_opp_find_freq_floor() and it deadlocks.
We don't really need the opp_table->lock to be held across the notifier
call though, all we want to make sure is that the 'opp' doesn't get
freed while being used from within the notifier chain. We can do it with
help of dev_pm_opp_get/put() as well. Let's do it.
Merge tag 'gfs2-for-linus-4.14-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2
Pull gfs2 fix from Bob Peterson:
"GFS2: Fix an old regression in GFS2's debugfs interface
This fixes a regression introduced by commit 88ffbf3e037e ("GFS2: Use
resizable hash table for glocks"). The regression caused the glock dump
in debugfs to not report all the glocks, which makes debugging
extremely difficult"
* tag 'gfs2-for-linus-4.14-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2:
gfs2: Fix debugfs glocks dump
This started out as just replacing the use of crypto/rng with
get_random_bytes_wait, so that we wouldn't use bad randomness at boot
time. But, upon looking further, it appears that there were even deeper
underlying cryptographic problems, and that this seems to have been
committed with very little crypto review. So, I rewrote the whole thing,
trying to keep to the conventions introduced by the previous author, to
fix these cryptographic flaws.
It makes no sense to seed crypto/rng at boot time and then keep
using it like this, when in fact there's already get_random_bytes_wait,
which can ensure there's enough entropy and be a much more standard way
of generating keys. Since this sensitive material is being stored
untrusted, using ECB and no authentication is simply not okay at all. I
find it surprising and a bit horrifying that this code even made it past
basic crypto review, which perhaps points to some larger issues. This
patch moves from using AES-ECB to using AES-GCM. Since keys are uniquely
generated each time, we can set the nonce to zero. There was also a race
condition in which the same key would be reused at the same time in
different threads. A mutex fixes this issue now.
So, to summarize, this commit fixes the following vulnerabilities:
* Low entropy key generation, allowing an attacker to potentially
guess or predict keys.
* Unauthenticated encryption, allowing an attacker to modify the
cipher text in particular ways in order to manipulate the plaintext,
which is is even more frightening considering the next point.
* Use of ECB mode, allowing an attacker to trivially swap blocks or
compare identical plaintext blocks.
* Key re-use.
* Faulty memory zeroing.
Merge tag 'trace-v4.14-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fixes from Steven Rostedt:
"Stack tracing and RCU has been having issues with each other and
lockdep has been pointing out constant problems.
The changes have been going into the stack tracer, but it has been
discovered that the problem isn't with the stack tracer itself, but it
is with calling save_stack_trace() from within the internals of RCU.
The stack tracer is the one that can trigger the issue the easiest,
but examining the problem further, it could also happen from a WARN()
in the wrong place, or even if an NMI happened in this area and it did
an rcu_read_lock().
The critical area is where RCU is not watching. Which can happen while
going to and from idle, or bringing up or taking down a CPU.
The final fix was to put the protection in kernel_text_address() as it
is the one that requires RCU to be watching while doing the stack
trace.
To make this work properly, Paul had to allow rcu_irq_enter() happen
after rcu_nmi_enter(). This should have been done anyway, since an NMI
can page fault (reading vmalloc area), and a page fault triggers
rcu_irq_enter().
One patch is just a consolidation of code so that the fix only needed
to be done in one location"
* tag 'trace-v4.14-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing: Remove RCU work arounds from stack tracer
extable: Enable RCU if it is not watching in kernel_text_address()
extable: Consolidate *kernel_text_address() functions
rcu: Allow for page faults in NMI handlers
Peter Zijlstra [Wed, 20 Sep 2017 17:00:19 +0000 (19:00 +0200)]
smp/hotplug: Differentiate the AP completion between up and down
With lockdep-crossrelease we get deadlock reports that span cpu-up and
cpu-down chains. Such deadlocks cannot possibly happen because cpu-up
and cpu-down are globally serialized.
Peter Zijlstra [Wed, 20 Sep 2017 17:00:20 +0000 (19:00 +0200)]
smp/hotplug: Differentiate the AP-work lockdep class between up and down
With lockdep-crossrelease we get deadlock reports that span cpu-up and
cpu-down chains. Such deadlocks cannot possibly happen because cpu-up
and cpu-down are globally serialized.
Peter Zijlstra [Wed, 20 Sep 2017 17:00:18 +0000 (19:00 +0200)]
smp/hotplug: Callback vs state-machine consistency
While the generic callback functions have an 'int' return and thus
appear to be allowed to return error, this is not true for all states.
Specifically, what used to be STARTING/DYING are ran with IRQs
disabled from critical parts of CPU bringup/teardown and are not
allowed to fail. Add WARNs to enforce this rule.
But since some callbacks are indeed allowed to fail, we have the
situation where a state-machine rollback encounters a failure, in this
case we're stuck, we can't go forward and we can't go back. Also add a
WARN for that case.
AFAICT this is a fundamental 'problem' with no real obvious solution.
We want the 'prepare' callbacks to allow failure on either up or down.
Typically on prepare-up this would be things like -ENOMEM from
resource allocations, and the typical usage in prepare-down would be
something like -EBUSY to avoid CPUs being taken away.
Peter Zijlstra [Wed, 20 Sep 2017 17:00:17 +0000 (19:00 +0200)]
smp/hotplug: Rewrite AP state machine core
There is currently no explicit state change on rollback. That is,
st->bringup, st->rollback and st->target are not consistent when doing
the rollback.
Rework the AP state handling to be more coherent. This does mean we
have to do a second AP kick-and-wait for rollback, but since rollback
is the slow path of a slowpath, this really should not matter.
Take this opportunity to simplify the AP thread function to only run a
single callback per invocation. This unifies the three single/up/down
modes is supports. The looping it used to do for up/down are achieved
by retaining should_run and relying on the main smpboot_thread_fn()
loop.
(I have most of a patch that does the same for the BP state handling,
but that's not critical and gets a little complicated because
CPUHP_BRINGUP_CPU does the AP handoff from a callback, which gets
recursive @st usage, I still have de-fugly that.)
[ tglx: Move cpuhp_down_callbacks() et al. into the HOTPLUG_CPU section to
avoid gcc complaining about unused functions. Make the HOTPLUG_CPU
one piece instead of having two consecutive ifdef sections of the
same type. ]
Currently the rollback of multi-instance states is handled inside
cpuhp_invoke_callback(). The problem is that when we want to allow an
explicit state change for rollback, we need to return from the
function without doing the rollback.
Change cpuhp_invoke_callback() to optionally return the multi-instance
state, such that rollback can be done from a subsequent call.
It's caused by skb_shared_info at the end of sk_buff was overwritten by
ISCSI_KEVENT_IF_ERROR when parsing nlmsg info from skb in iscsi_if_rx.
During the loop if skb->len == nlh->nlmsg_len and both are sizeof(*nlh),
ev = nlmsg_data(nlh) will acutally get skb_shinfo(SKB) instead and set a
new value to skb_shinfo(SKB)->nr_frags by ev->type.
This patch is to fix it by checking nlh->nlmsg_len properly there to
avoid over accessing sk_buff.
Paul Burton [Fri, 22 Sep 2017 06:24:40 +0000 (23:24 -0700)]
irqchip/mips-gic: Use effective affinity to unmask
Commit 7778c4b27cbe ("irqchip: mips-gic: Use pcpu_masks to avoid reading
GIC_SH_MASK*") adjusted the way we handle masking interrupts to set &
clear the interrupt's bit in each pcpu_mask. This allows us to avoid
needing to read the GIC mask registers and perform a bitwise and of
their values with the pending & pcpu_masks.
Unfortunately this didn't quite work for IPIs, which were mapped to a
particular CPU/VP during initialisation but never set the affinity or
effective_affinity fields of their struct irq_desc. This led to them
losing their affinity when gic_unmask_irq() was called for them, and
they'd all become affine to cpu0.
Fix this by:
1) Setting the effective affinity of interrupts in
gic_shared_irq_domain_map(), which is where we actually map an
interrupt to a CPU/VP. This ensures that the effective affinity mask
is always valid, not just after explicitly setting affinity.
2) Using an interrupt's effective affinity when unmasking it, which
prevents gic_unmask_irq() from unintentionally changing which
pcpu_mask includes an interrupt.
Paul Burton [Fri, 22 Sep 2017 06:24:39 +0000 (23:24 -0700)]
irqchip/mips-gic: Fix shifts to extract register fields
The MIPS GIC driver is incorrectly using __fls to shift registers,
intending to shift to the least significant bit of a value based upon
its mask but instead shifting off all but the value's top bit. It should
actually be using __ffs to shift to the first, not last, bit of the
value.
Apparently the system I used when testing commit 3680746abd87
("irqchip: mips-gic: Convert remaining shared reg access to new
accessors") and commit b2b2e584ceab ("irqchip: mips-gic: Clean up mti,
reserved-cpu-vectors handling") managed to work correctly despite this
issue, but not all systems do...
James Smart [Tue, 19 Sep 2017 21:01:50 +0000 (14:01 -0700)]
nvme-fcloop: fix port deletes and callbacks
Now that there are potentially long delays between when a remoteport or
targetport delete calls is made and when the callback occurs (dev_loss_tmo
timeout), no longer block in the delete routines and move the final nport
puts to the callbacks.
Moved the fcloop_nport_get/put/free routines to avoid forward declarations.
Ensure port_info structs used in registrations are nulled in case fields
are not set (ex: devloss_tmo values).
James Smart [Wed, 20 Sep 2017 18:07:26 +0000 (11:07 -0700)]
nvmet-fc: sync header templates with comments
Comments were incorrect:
- defer_rcv was in host port template. moved to target port template
- Added Mandatory statements for target port template items
nvme-rdma: don't fully stop the controller in error recovery
By calling nvme_stop_ctrl on a already failed controller will wait for the
scan work to complete (only by identify timeout expiration which is 60
seconds). This is unnecessary when we already know that the controller has
failed.
nvme-rdma: give up reconnect if state change fails
If we failed to transition to state LIVE after a successful reconnect,
then controller deletion already started. In this case there is no
point moving forward with reconnect.
James Smart [Thu, 21 Sep 2017 15:13:49 +0000 (08:13 -0700)]
nvme: fix sqhd reference when admin queue connect fails
Fix bug in sqhd patch.
It wasn't the sq that was at risk. In the case where the admin queue
connect command fails, the sq->size field is not set. Therefore, this
becomes a divide by zero error.
Add a quick check to bypass under this failure condition.
The switch to rhashtables (commit 88ffbf3e03) broke the debugfs glock
dump (/sys/kernel/debug/gfs2/<device>/glocks) for dumps bigger than a
single buffer: the right function for restarting an rhashtable iteration
from the beginning of the hash table is rhashtable_walk_enter;
rhashtable_walk_stop + rhashtable_walk_start will just resume from the
current position.
selftests: timers: set-timer-lat: Fix hang when testing unsupported alarms
When timer_create() fails on a bootime or realtime clock, setup_timer()
returns 0 as if timer has been set. Callers wait forever for the timer
to expire.
This hang is seen on a system that doesn't have support for:
Test hangs waiting for a timer that hasn't been set to expire. Fix
setup_timer() to return 1, add handling in callers to detect the
unsupported case and return 0 without waiting to not fail the test.
selftests: timers: set-timer-lat: fix hang when std out/err are redirected
do_timer_oneshot() uses select() as a timer with FD_SETSIZE and readfs
is cleared with FD_ZERO without FD_SET.
When stdout and stderr are redirected, the test hangs in select forever.
Fix the problem calling select() with readfds empty and nfds zero. This
is sufficient for using select() for timer.
With this fix "./set-timer-lat > /dev/null 2>&1" no longer hangs.
Li Zhijian [Thu, 21 Sep 2017 09:13:27 +0000 (17:13 +0800)]
selftests/memfd: correct run_tests.sh permission
to fix the following issue:
------------------
TAP version 13
selftests: run_tests.sh
========================================
selftests: Warning: file run_tests.sh is not executable, correct this.
not ok 1..1 selftests: run_tests.sh [FAIL]
------------------
The 2.26 release of glibc changed how siginfo_t is defined, and the earlier
work-around to using the kernel definition are no longer needed. The old
way needs to stay around for a while, though.
selftests: futex: Makefile: fix for loops in targets to run silently
Fix for loops in targets to run silently to avoid cluttering the test
results.
Suppresses the following from targets:
for DIR in functional; do \
BUILD_TARGET=./tools/testing/selftests/futex/$DIR; \
mkdir $BUILD_TARGET -p; \
make OUTPUT=$BUILD_TARGET -C $DIR all;\
done
selftests: mqueue: Use full path to run tests from Makefile
Use full path including $(OUTPUT) to run tests from Makefile for
normal case when objects reside in the source tree as well as when
objects are relocated with make O=dir. In both cases $(OUTPUT) will
be set correctly by lib.mk.
selftests: futex: copy sub-dir test scripts for make O=dir run
For make O=dir run_tests to work, test scripts from sub-directories
need to be copied over to the object directory. Running tests from the
object directory is necessary to avoid making the source tree dirty.
PCI: Add dummy pci_acs_enabled() for CONFIG_PCI=n build
If CONFIG_PCI=n and gcc (e.g. 4.1.2) decides not to inline
get_pci_function_alias_group(), the build fails with:
drivers/iommu/iommu.o: In function `get_pci_function_alias_group':
iommu.c:(.text+0xfdc): undefined reference to `pci_acs_enabled'
Due to the various dummies for PCI calls in the CONFIG_PCI=n case,
pci_acs_enabled() never called, but not all versions of gcc are smart
enough to realize that.
While explicitly marking get_pci_function_alias_group() inline would fix
the build, this would inflate the code for the CONFIG_PCI=y case, as
get_pci_function_alias_group() is a not-so-small function called from two
places.
Hence fix the issue by introducing a dummy for pci_acs_enabled() instead.
IB/mlx5: Fix NULL deference on mlx5_ib_update_xlt failure
mlx5_ib_reg_user_mr called mlx5_ib_dereg_mr in case of MR population
failure. This resulted in a NULL dereference as ibmr->device wasn't
initialized yet.
We address this by adding an internal dereg_mr function that can handle
partially initialized MRs, and fixing clean_mr to work on partially
initialized MRs.
Fixes: ff740aefecb9 ("IB/mlx5: Decouple MR allocation and population flows") Signed-off-by: Ilya Lesokhin <[email protected]> Signed-off-by: Leon Romanovsky <[email protected]> Signed-off-by: Doug Ledford <[email protected]>
The patch simplifies mlx5_ib_cont_pages and fixes the following
issues in the original implementation:
First issues is related to alignment of the PFNs. After the check
base + p != PFN, the alignment of the PFN wasn't checked. So the PFN
sequence 0, 1, 1, 2 would result in a page_shift of 13 even though
the 3rd PFN is not 8KB aligned.
This wasn't actually a bug because it was supported by all the
existing mlx5 compatible device, but we don't want to require
this support in all future devices.
Another issue is because the inner loop didn't advance PFN so
the test "if (base + p != pfn)" always failed for SGE with
len > (1<<page_shift).
Alex Vesker [Sun, 24 Sep 2017 18:46:33 +0000 (21:46 +0300)]
IB/ipoib: Fix inconsistency with free_netdev and free_rdma_netdev
Call free_rdma_netdev instead of free_netdev each time we want to
release a netdevice. This call is also relevant for future freeing
of offloaded child interfaces.
This patch also adds a missing call for free netdevice when releasing
a parent interface that has child interfaces using ipoib_remove_one.
Fixes: cd565b4b51e5 ('IB/IPoIB: Support acceleration options callbacks') Signed-off-by: Alex Vesker <[email protected]> Signed-off-by: Leon Romanovsky <[email protected]> Signed-off-by: Doug Ledford <[email protected]>
IB/ipoib: Fix sysfs Pkey create<->remove possible deadlock
A possible ABBA lock can happen with RTNL and vlan_rwsem.
For example:
Flow A: Device Flush
__ipoib_ib_dev_flush
down_read(vlan_rwsem) // Lock A
ipoib_flush_ah
flush_workqueue(priv->wq) // Wait for completion
A work on shared WQ (Mcast carrier)
ipoib_mcast_carrier_on_task
while (!rtnl_trylock()) // Wait for lock B
Flow B: Sysfs PKEY delete
ipoib_vlan_delete
lock(RTNL) // Lock B
down_write(vlan_rwsem) // Wait for lock A
This can happen with PKEY creates as well. The solution is to release
the RTNL lock in sysfs functions in case it is not possible to lock
VLAN RW semaphore and reset the SYS call.
Fixes: 69956d83267e ("IB/ipoib: Sync between remove_one to sysfs calls that use rtnl_lock") Signed-off-by: Shalom Lagziel <[email protected]> Signed-off-by: Alex Vesker <[email protected]> Signed-off-by: Leon Romanovsky <[email protected]> Signed-off-by: Doug Ledford <[email protected]>
The ib_mr->length represents the length of the MR in bytes as per
the IBTA spec 1.3 section 11.2.10.3 (REGISTER PHYSICAL MEMORY REGION).
Currently ib_mr->length field is defined as only 32-bits field.
This might result into truncation and failed WRs of consumers who
registers more than 4GB bytes memory regions and whose WRs accessing
such MRs.
This patch makes the length 64-bit to avoid such truncation.
When security_ib_alloc_security fails, qp->qp_sec memory is freed.
However ib_destroy_qp still tries to access this memory which result
in kernel crash. So its initialized to NULL to avoid such access.
Leon Romanovsky [Sun, 24 Sep 2017 18:46:29 +0000 (21:46 +0300)]
IB/core: Fix typo in the name of the tag-matching cap struct
The tag matching functionality is implemented by mlx5 driver
by extending XRQ, however this internal kernel information was
exposed to user space applications with *xrq* name instead of *tm*.
perf report: Fix debug messages with --call-graph option
With --call-graph option, perf report can display call chains using
type, min percent threshold, optional print limit and order. And the
default call-graph parameter is 'graph,0.5,caller,function,percent'.
Before this patch, 'perf report --call-graph' shows incorrect debug
messages as below:
That is because in function __parse_callchain_report_opt(),each field of
the call-graph parameter is passed to parse_callchain_{mode,order,
sort_key,value} in turn until it meets the matching value.
For example, the order field "caller" is passed to
parse_callchain_mode() firstly and obviously it doesn't match any mode
field. Therefore parse_callchain_mode() will shows the debug message
"Invalid callchain mode: caller", which could confuse users.
The patch fixes this issue by moving the warning out of the function
parse_callchain_{mode,order,sort_key,value}.
fs: Fix page cache inconsistency when mixing buffered and AIO DIO
Currently when mixing buffered reads and asynchronous direct writes it
is possible to end up with the situation where we have stale data in the
page cache while the new data is already written to disk. This is
permanent until the affected pages are flushed away. Despite the fact
that mixing buffered and direct IO is ill-advised it does pose a thread
for a data integrity, is unexpected and should be fixed.
Fix this by deferring completion of asynchronous direct writes to a
process context in the case that there are mapped pages to be found in
the inode. Later before the completion in dio_complete() invalidate
the pages in question. This ensures that after the completion the pages
in the written area are either unmapped, or populated with up-to-date
data. Also do the same for the iomap case which uses
iomap_dio_complete() instead.
This has a side effect of deferring the completion to a process context
for every AIO DIO that happens on inode that has pages mapped. However
since the consensus is that this is ill-advised practice the performance
implication should not be a problem.
This was based on proposal from Jeff Moyer, thanks!
James Smart [Mon, 18 Sep 2017 16:08:29 +0000 (09:08 -0700)]
nvmet: implement valid sqhd values in completions
To support sqhd, for initiators that are following the spec and
paying attention to sqhd vs their sqtail values:
- add sqhd to struct nvmet_sq
- initialize sqhd to 0 in nvmet_sq_setup
- rather than propagate the 0's-based qsize value from the connect message
which requires a +1 in every sqhd update, and as nothing else references
it, convert to 1's-based value in nvmt_sq/cq_setup() calls.
- validate connect message sqsize being non-zero per spec.
- updated assign sqhd for every completion that goes back.
Also remove handling the NULL sq case in __nvmet_req_complete, as it can't
happen with the current code.
James Smart [Thu, 7 Sep 2017 20:18:04 +0000 (13:18 -0700)]
nvme: allow timed-out ios to retry
Currently the nvme_req_needs_retry() applies several checks to see if
a retry is allowed. On of those is whether the current time has exceeded
the start time of the io plus the timeout length. This check, if an io
times out, means there is never a retry allowed for the io. Which means
applications see the io failure.
Remove this check and allow the io to timeout, like it does on other
protocols, and retries to be made.
On the FC transport, a frame can be lost for an individual io, and there
may be no other errors that escalate for the connection/association.
The io will timeout, which causes the transport to escalate into creating
a new association, but the io that timed out, due to this retry logic, has
already failed back to the application and things are hosed.
James Smart [Thu, 14 Sep 2017 18:03:09 +0000 (11:03 -0700)]
nvme: stop aer posting if controller state not live
If an nvme async_event command completes, in most cases, a new
async event is posted. However, if the controller enters a
resetting or reconnecting state, there is nothing to block the
scheduled work element from posting the async event again. Nor are
there calls from the transport to stop async events when an
association dies.
In the case of FC, where the association is torn down, the aer must
be aborted on the FC link and completes through the normal job
completion path. Thus the terminated async event ends up being
rescheduled even though the controller isn't in a valid state for
the aer, and the reposting gets the transport into a partially torn
down data structure.
It's possible to hit the scenario on rdma, although much less likely
due to an aer completing right as the association is terminated and
as the association teardown reclaims the blk requests via
nvme_cancel_request() so its immediate, not a link-related action
like on FC.
Fix by putting controller state checks in both the async event
completion routine where it schedules the async event and in the
async event work routine before it calls into the transport. It's
effectively a "stop_async_events()" behavior. The transport, when
it creates a new association with the subsystem will transition
the state back to live and is already restarting the async event
posting.
Signed-off-by: James Smart <[email protected]>
[hch: remove taking a lock over reading the controller state] Signed-off-by: Christoph Hellwig <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
Keith Busch [Fri, 15 Sep 2017 17:05:38 +0000 (13:05 -0400)]
nvme-pci: Print invalid SGL only once
The WARN_ONCE macro returns true if the condition is true, not if the
warn was raised, so we're printing the scatter list every time it's
invalid. This is excessive and makes debugging harder, so this patch
prints it just once.
Keith Busch [Thu, 14 Sep 2017 17:54:39 +0000 (13:54 -0400)]
nvme-pci: initialize queue memory before interrupts
A spurious interrupt before the nvme driver has initialized the completion
queue may inadvertently cause the driver to believe it has a completion
to process. This may result in a NULL dereference since the nvmeq's tags
are not set at this point.
The patch initializes the host's CQ memory so that a spurious interrupt
isn't mistaken for a real completion.
James Smart [Mon, 11 Sep 2017 23:16:53 +0000 (16:16 -0700)]
nvmet-fc: fix failing max io queue connections
fc transport is treating NVMET_NR_QUEUES as maximum queue count, e.g.
admin queue plus NVMET_NR_QUEUES-1 io queues. But NVMET_NR_QUEUES is
the number of io queues, so maximum queue count is really
NVMET_NR_QUEUES+1.
James Smart [Thu, 7 Sep 2017 20:20:24 +0000 (13:20 -0700)]
nvme-fc: use transport-specific sgl format
Sync with NVM Express spec change and FC-NVME 1.18.
FC transport sets SGL type to Transport SGL Data Block Descriptor and
subtype to transport-specific value 0x0A.
Removed the warn-on's on the PRP fields. They are unneeded. They were
to check for values from the upper layer that weren't set right, and
for the most part were fine. But, with Async events, which reuse the
same structure and 2nd time issued the SGL overlay converted them to
the Transport SGL values - the warn-on's were errantly firing.
James Smart [Thu, 7 Sep 2017 23:27:28 +0000 (16:27 -0700)]
nvmet-fcloop: remove use of FC-specific error codes
The FC-NVME transport loopback test module used the FC-specific error
codes in cases where it emulated a transport abort case. Instead of
using the FC-specific values, now use a generic value (NVME_SC_INTERNAL).
James Smart [Thu, 7 Sep 2017 23:27:27 +0000 (16:27 -0700)]
nvmet-fc: remove use of FC-specific error codes
The FC-NVME target transport used the FC-specific error codes in
return codes when the transport or lldd failed. Instead of using the
FC-specific values, now use a generic value (NVME_SC_INTERNAL).
James Smart [Thu, 7 Sep 2017 23:27:26 +0000 (16:27 -0700)]
nvme-fc: remove use of FC-specific error codes
The FC-NVME transport used the FC-specific error codes in cases where
it had to fabricate an error to go back up stack. Instead of using the
FC-specific values, now use a generic value (NVME_SC_INTERNAL).
loop: remove union of use_aio and ref in struct loop_cmd
When the request is completed, lo_complete_rq() checks cmd->use_aio.
However, if this is in fact an aio request, cmd->use_aio will have
already been reused as cmd->ref by lo_rw_aio*. Fix it by not using a
union. On x86_64, there's a hole after the union anyways, so this
doesn't make struct loop_cmd any bigger.
Fixes: 92d773324b7e ("block/loop: fix use after free") Signed-off-by: Omar Sandoval <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
The deadlock may happen when one task (CPU1) is trying to delete a
partition in a block device and another task (CPU0) is accessing
tracing sysfs file (e.g. /sys/block/dm-1/trace/act_mask) in that
partition.
The s_active isn't an actual lock. It is a reference count (kn->count)
on the sysfs (kernfs) file. Removal of a sysfs file, however, require
a wait until all the references are gone. The reference count is
treated like a rwsem using lockdep instrumentation code.
The fact that a thread is in the sysfs callback method or in the
ioctl call means there is a reference to the opended sysfs or device
file. That should prevent the underlying block structure from being
removed.
Instead of using bd_mutex in the block_device structure, a new
blk_trace_mutex is now added to the request_queue structure to protect
access to the blk_trace structure.
Suggested-by: Christoph Hellwig <[email protected]> Signed-off-by: Waiman Long <[email protected]> Acked-by: Steven Rostedt (VMware) <[email protected]>
Fix typo in patch subject line, and prune a comment detailing how
the code used to work.
Josef Bacik [Sat, 6 May 2017 02:25:18 +0000 (22:25 -0400)]
nbd: ignore non-nbd ioctl's
In testing we noticed that nbd would spew if you ran a fio job against
the raw device itself. This is because fio calls a block device
specific ioctl, however the block layer will first pass this back to the
driver ioctl handler in case the driver wants to do something special.
Since the device was setup using netlink this caused us to spew every
time fio called this ioctl. Since we don't have special handling, just
error out for any non-nbd specific ioctl's that come in. This fixes the
spew.
The code in __brd_direct_access multiplies the pgoff variable by page size
and divides it by 512. It can cause overflow on 32-bit architectures. The
overflow happens if we create ramdisk larger than 4G and use it as a
sparse device.
This patch replaces multiplication and division with multiplication by the
number of sectors per page.
Alexandru Moise [Tue, 19 Sep 2017 20:04:12 +0000 (22:04 +0200)]
genirq: Check __free_irq() return value for NULL
__free_irq() can return a NULL irqaction for example when trying to free
already-free IRQ, but the callsite unconditionally dereferences the
returned pointer.
Peter Zijlstra [Fri, 22 Sep 2017 15:48:06 +0000 (17:48 +0200)]
futex: Fix pi_state->owner serialization
There was a reported suspicion about a race between exit_pi_state_list()
and put_pi_state(). The same report mentioned the comment with
put_pi_state() said it should be called with hb->lock held, and it no
longer is in all places.
As it turns out, the pi_state->owner serialization is indeed broken. As per
the new rules:
pi_state->owner should be serialized by pi_state->pi_mutex.wait_lock.
For the sites setting pi_state->owner we already hold wait_lock (where
required) but exit_pi_state_list() and put_pi_state() were not and
raced on clearing it.
Eric Biggers [Mon, 18 Sep 2017 18:38:29 +0000 (11:38 -0700)]
KEYS: restrict /proc/keys by credentials at open time
When checking for permission to view keys whilst reading from
/proc/keys, we should use the credentials with which the /proc/keys file
was opened. This is because, in a classic type of exploit, it can be
possible to bypass checks for the *current* credentials by passing the
file descriptor to a suid program.
Following commit 34dbbcdbf633 ("Make file credentials available to the
seqfile interfaces") we can finally fix it. So let's do it.
Eric Biggers [Mon, 18 Sep 2017 18:37:39 +0000 (11:37 -0700)]
KEYS: reset parent each time before searching key_user_tree
In key_user_lookup(), if there is no key_user for the given uid, we drop
key_user_lock, allocate a new key_user, and search the tree again. But
we failed to set 'parent' to NULL at the beginning of the second search.
If the tree were to be empty for the second search, the insertion would
be done with an invalid 'parent', scribbling over freed memory.
Fortunately this can't actually happen currently because the tree always
contains at least the root_key_user. But it still should be fixed to
make the code more robust.
Eric Biggers [Mon, 18 Sep 2017 18:37:23 +0000 (11:37 -0700)]
KEYS: prevent KEYCTL_READ on negative key
Because keyctl_read_key() looks up the key with no permissions
requested, it may find a negatively instantiated key. If the key is
also possessed, we went ahead and called ->read() on the key. But the
key payload will actually contain the ->reject_error rather than the
normal payload. Thus, the kernel oopses trying to read the
user_key_payload from memory address (int)-ENOKEY = 0x00000000ffffff82.
Fortunately the payload data is stored inline, so it shouldn't be
possible to abuse this as an arbitrary memory read primitive...
Reproducer:
keyctl new_session
keyctl request2 user desc '' @s
keyctl read $(keyctl show | awk '/user: desc/ {print $1}')
Fixes: 61ea0c0ba904 ("KEYS: Skip key state checks when checking for possession") Cc: <[email protected]> [v3.13+] Signed-off-by: Eric Biggers <[email protected]> Signed-off-by: David Howells <[email protected]>
This is problematic because these "fake" keyrings won't have the right
permissions. In particular, the user who created them first will own
them and will have full access to them via the possessor permissions,
which can be used to compromise the security of a user's keys:
Fix it by marking user and user session keyrings with a flag
KEY_FLAG_UID_KEYRING. Then, when searching for a user or user session
keyring by name, skip all keyrings that don't have the flag set.
Fixes: 69664cf16af4 ("keys: don't generate user and user session keyrings unless they're accessed") Cc: <[email protected]> [v2.6.26+] Signed-off-by: Eric Biggers <[email protected]> Signed-off-by: David Howells <[email protected]>
Eric Biggers [Mon, 18 Sep 2017 18:36:45 +0000 (11:36 -0700)]
KEYS: fix writing past end of user-supplied buffer in keyring_read()
Userspace can call keyctl_read() on a keyring to get the list of IDs of
keys in the keyring. But if the user-supplied buffer is too small, the
kernel would write the full list anyway --- which will corrupt whatever
userspace memory happened to be past the end of the buffer. Fix it by
only filling the space that is available.
Eric Biggers [Mon, 18 Sep 2017 18:36:31 +0000 (11:36 -0700)]
KEYS: fix key refcount leak in keyctl_read_key()
In keyctl_read_key(), if key_permission() were to return an error code
other than EACCES, we would leak a the reference to the key. This can't
actually happen currently because key_permission() can only return an
error code other than EACCES if security_key_permission() does, only
SELinux and Smack implement that hook, and neither can return an error
code other than EACCES. But it should still be fixed, as it is a bug
waiting to happen.
Fixes: 29db91906340 ("[PATCH] Keys: Add LSM hooks for key management [try #3]") Signed-off-by: Eric Biggers <[email protected]> Signed-off-by: David Howells <[email protected]>
Eric Biggers [Mon, 18 Sep 2017 18:36:12 +0000 (11:36 -0700)]
KEYS: fix key refcount leak in keyctl_assume_authority()
In keyctl_assume_authority(), if keyctl_change_reqkey_auth() were to
fail, we would leak the reference to the 'authkey'. Currently this can
only happen if prepare_creds() fails to allocate memory. But it still
should be fixed, as it is a more severe bug waiting to happen.
This patch also moves the read of 'authkey->serial' to before the
reference to the authkey is dropped. Doing the read after dropping the
reference is very fragile because it assumes we still hold another
reference to the key. (Which we do, in current->cred->request_key_auth,
but there's no reason not to write it in the "obviously correct" way.)
Fixes: d84f4f992cbd ("CRED: Inaugurate COW credentials") Signed-off-by: Eric Biggers <[email protected]> Signed-off-by: David Howells <[email protected]>
Eric Biggers [Thu, 21 Sep 2017 20:57:41 +0000 (13:57 -0700)]
KEYS: don't revoke uninstantiated key in request_key_auth_new()
If key_instantiate_and_link() were to fail (which fortunately isn't
possible currently), the call to key_revoke(authkey) would crash with a
NULL pointer dereference in request_key_auth_revoke() because the key
has not yet been instantiated.
Fix this by removing the call to key_revoke(). key_put() is sufficient,
as it's not possible for an uninstantiated authkey to have been used for
anything yet.
Fixes: b5f545c880a2 ("[PATCH] keys: Permit running process to instantiate keys") Signed-off-by: Eric Biggers <[email protected]> Signed-off-by: David Howells <[email protected]>
Eric Biggers [Thu, 21 Sep 2017 20:57:40 +0000 (13:57 -0700)]
KEYS: fix cred refcount leak in request_key_auth_new()
In request_key_auth_new(), if key_alloc() or key_instantiate_and_link()
were to fail, we would leak a reference to the 'struct cred'. Currently
this can only happen if key_alloc() fails to allocate memory. But it
still should be fixed, as it is a more severe bug waiting to happen.
Fix it by cleaning things up to use a helper function which frees a
'struct request_key_auth' correctly.
Fixes: d84f4f992cbd ("CRED: Inaugurate COW credentials") Signed-off-by: Eric Biggers <[email protected]> Signed-off-by: David Howells <[email protected]>
perf evsel: Fix attr.exclude_kernel setting for default cycles:p
Yet another fix for probing the max attr.precise_ip setting: it is not
enough settting attr.exclude_kernel for !root users, as they _can_
profile the kernel if the kernel.perf_event_paranoid sysctl is set to
-1, so check that as well.
I.e. non-root, default kernel.perf_event_paranoid: :uppp modifier = not allowed to sample the kernel,
non-root, most permissible kernel.perf_event_paranoid: :ppp = allowed to sample the kernel.
In both cases, use the highest available precision: attr.precise_ip = 3.
New drm_syncobj_create flags definitions, new drm_syncobj_wait
and drm_syncobj_array ABIs. DRM_I915_PERF_* calls and a new
I915_PARAM_HAS_EXEC_FENCE_ARRAY ABI for the Intel driver.
- tools/include/uapi/linux/bpf.h:
New bpf_sock fields (::mark and ::priority), new XDP_REDIRECT
action, new kvm_ppc_smmu_info fields (::data_keys, instr_keys)
perf tools: Get all of tools/{arch,include}/ in the MANIFEST
Now that I'm switching the container builds from using a local volume
pointing to the kernel repository with the perf sources, instead getting
a detached tarball to be able to use a container cluster, some places
broke because I forgot to put some of the required files in
tools/perf/MANIFEST, namely some bitsperlong.h files.
So, to fix it do the same as for tools/build/ and pack the whole
tools/arch/ directory.
Michal Simek [Tue, 19 Sep 2017 14:40:21 +0000 (16:40 +0200)]
microblaze: Add missing kvm_para.h to Kbuild
Running make allmodconfig;make is throwing compilation error:
CC kernel/watchdog.o
In file included from ./include/linux/kvm_para.h:4:0,
from kernel/watchdog.c:29:
./include/uapi/linux/kvm_para.h:32:26: fatal error: asm/kvm_para.h: No
such file or directory
#include <asm/kvm_para.h>
^
compilation terminated.
make[1]: *** [kernel/watchdog.o] Error 1
make: *** [kernel/watchdog.o] Error 2
Merge tag 'iio-fixes-for-4.14a' of git://git.kernel.org/pub/scm/linux/kernel/git/jic23/iio into staging-linus
Jonathan writes:
First round of IIO fixes for the 4.14 cycle
Note this includes fixes from recent merge window. As such the tree
is based on top of a prior staging/staging-next tree.
* iio core
- return and error for a failed read_reg debugfs call rather than
eating the error.
* ad7192
- Use the dedicated reset function in the ad_sigma_delta library
instead of an spi transfer with the data on the stack which
could cause problems with DMA.
* ad7793
- Implement a dedicate reset function in the ad_sigma_delta library
and use it to correctly reset this part.
* bme280
- ctrl_reg write must occur after any register writes
for updates to take effect.
* mcp320x
- negative voltage readout was broken.
- Fix an oops on module unload due to spi_set_drvdata not being called
in probe.
* st_magn
- Fix the data ready line configuration for the lis3mdl. It is not
configurable so the st_magn core was assuming it didn't exist
and so wasn't consuming interrupts resulting in an unhandled
interrupt.
* stm32-adc
- off by one error on max channels checking.
* stm32-timer
- preset should not be buffered - reorganising register writes avoids
this.
- fix a corner case in which write preset goes wrong when a timer is
used first as a trigger then as a counter with preset. Odd case but
you never know.
* ti-ads1015
- Fix setting of comparator polarity by fixing bitfield definition.
* twl4030
- Error path handling fix to cleanup in event of regulator
registration failure.
- Disable the vusb3v1 regulator correctly in error handling
- Don't paper over a regulator enable failure.
USB: cdc-wdm: ignore -EPIPE from GetEncapsulatedResponse
The driver will forward errors to userspace after turning most of them
into -EIO. But all status codes are not equal. The -EPIPE (stall) in
particular can be seen more as a result of normal USB signaling than
an actual error. The state is automatically cleared by the USB core
without intervention from either driver or userspace.
And most devices and firmwares will never trigger a stall as a result
of GetEncapsulatedResponse. This is in fact a requirement for CDC WDM
devices. Quoting from section 7.1 of the CDC WMC spec revision 1.1:
The function shall not return STALL in response to
GetEncapsulatedResponse.
But this driver is also handling GetEncapsulatedResponse on behalf of
the qmi_wwan and cdc_mbim drivers. Unfortunately the relevant specs
are not as clear wrt stall. So some QMI and MBIM devices *will*
occasionally stall, causing the GetEncapsulatedResponse to return an
-EPIPE status. Translating this into -EIO for userspace has proven to
be harmful. Treating it as an empty read is safer, making the driver
behave as if the device was conforming to the CDC WDM spec.
There have been numerous reports of issues related to -EPIPE errors
from some newer CDC MBIM devices in particular, like for example the
Fibocom L831-EAU. Testing on this device has shown that the issues
go away if we simply ignore the -EPIPE status. Similar handling of
-EPIPE is already known from e.g. usb_get_string()
The -EPIPE log message is still kept to let us track devices with this
unexpected behaviour, hoping that it attracts attention from firmware
developers.
Dan Carpenter [Fri, 22 Sep 2017 20:43:46 +0000 (23:43 +0300)]
USB: devio: Don't corrupt user memory
The user buffer has "uurb->buffer_length" bytes. If the kernel has more
information than that, we should truncate it instead of writing past
the end of the user's buffer. I added a WARN_ONCE() to help the user
debug the issue.
Dan Carpenter [Fri, 22 Sep 2017 20:43:25 +0000 (23:43 +0300)]
USB: devio: Prevent integer overflow in proc_do_submiturb()
There used to be an integer overflow check in proc_do_submiturb() but
we removed it. It turns out that it's still required. The
uurb->buffer_length variable is a signed integer and it's controlled by
the user. It can lead to an integer overflow when we do:
x86/mm: Fix fault error path using unsafe vma pointer
commit 7b2d0dbac489 ("x86/mm/pkeys: Pass VMA down in to fault signal
generation code") passes down a vma pointer to the error path, but that is
done once the mmap_sem is released when calling mm_fault_error() from
__do_page_fault().
This is dangerous as the vma structure is no more safe to be used once the
mmap_sem has been released. As only the protection key value is required in
the error processing, we could just pass down this value.
Fix it by passing a pointer to a protection key value down to the fault
signal generation code. The use of a pointer allows to keep the check
generating a warning message in fill_sig_info_pkey() when the vma was not
known. If the pointer is valid, the protection value can be accessed by
deferencing the pointer.
[ tglx: Made *pkey u32 as that's the type which is passed in siginfo ]
Eric Biggers [Sat, 23 Sep 2017 13:00:09 +0000 (15:00 +0200)]
x86/fpu: Reinitialize FPU registers if restoring FPU state fails
Userspace can change the FPU state of a task using the ptrace() or
rt_sigreturn() system calls. Because reserved bits in the FPU state can
cause the XRSTOR instruction to fail, the kernel has to carefully
validate that no reserved bits or other invalid values are being set.
Unfortunately, there have been bugs in this validation code. For
example, we were not checking that the 'xcomp_bv' field in the
xstate_header was 0. As-is, such bugs are exploitable to read the FPU
registers of other processes on the system. To do so, an attacker can
create a task, assign to it an invalid FPU state, then spin in a loop
and monitor the values of the FPU registers. Because the task's FPU
registers are not being restored, sometimes the FPU registers will have
the values from another process.
This is likely to continue to be a problem in the future because the
validation done by the CPU instructions like XRSTOR is not immediately
visible to kernel developers. Nor will invalid FPU states ever be
encountered during ordinary use --- they will only be seen during
fuzzing or exploits. There can even be reserved bits outside the
xstate_header which are easy to forget about. For example, the MXCSR
register contains reserved bits, which were not validated by the
KVM_SET_XSAVE ioctl until commit a575813bfe4b ("KVM: x86: Fix load
damaged SSEx MXCSR register").
Therefore, mitigate this class of vulnerability by restoring the FPU
registers from init_fpstate if restoring from the task's state fails.
We actually used to do this, but it was (perhaps unwisely) removed by
commit 9ccc27a5d297 ("x86/fpu: Remove error return values from
copy_kernel_to_*regs() functions"). This new patch is also a bit
different. First, it only clears the registers, not also the bad
in-memory state; this is simpler and makes it easier to make the
mitigation cover all callers of __copy_kernel_to_fpregs(). Second, it
does the register clearing in an exception handler so that no extra
instructions are added to context switches. In fact, we *remove*
instructions, since previously we were always zeroing the register
containing 'err' even if CONFIG_X86_DEBUG_FPU was disabled.