Thomas Falcon [Sun, 5 Mar 2017 18:18:42 +0000 (12:18 -0600)]
ibmvnic: Allocate number of rx/tx buffers agreed on by firmware
The amount of TX/RX buffers that the vNIC driver currently allocates
is different from the amount agreed upon in negotiation with firmware.
Correct that by allocating the requested number of buffers confirmed
by firmware.
Use a counter to track the number of outstanding transmissions sent
that have not received completions. If the counter reaches the maximum
number of queue entries, stop transmissions on that queue. As we receive
more completions from firmware, wake the queue once the counter reaches
an acceptable level.
This patch prevents hardware/firmware TX queue from filling up and
and generating errors. Since incorporating this fix, internal testing
has reported that these firmware errors have stopped.
David S. Miller [Tue, 7 Mar 2017 22:09:59 +0000 (14:09 -0800)]
Merge branch 'rds-fixes'
Sowmini Varadhan says:
====================
rds: tcp: fix various rds-tcp issues during netns create/delete sequences
Dmitry Vyukov reported some syszkaller panics during netns deletion.
While I have not been able to reproduce those exact panics, my attempts
to do so uncovered a few other problems, which are fixed patch 2 and
patch 3 of this patch series. In addition, as mentioned in,
https://www.spinics.net/lists/netdev/msg422997.html
code-inspection points that the rds_connection needs to take an explicit
refcnt on the struct net so that it is held down until all cleanup is
completed for netns removal, and this is fixed by patch1.
The following scripts were run concurrently to uncover/test patch{2, 3}
while simultaneously running rds-ping to 12.0.0.18 from another system:
# cat del.rds
while [ 1 ]; do
modprobe rds_tcp
modprobe -r rds-tcp
done
# cat del.netns
while [ 1 ]; do
ip netns delete blue
ip netns add blue
ip link add link eth1 address a:b:c:d:e:f blue0 type macvlan
ip link set blue0 netns blue
ip netns exec blue ip addr add 12.0.0.18/24 dev blue0
ip netns exec blue ifconfig blue0 up
sleep 3;
done
====================
rds: tcp: Sequence teardown of listen and acceptor sockets to avoid races
Commit a93d01f5777e ("RDS: TCP: avoid bad page reference in
rds_tcp_listen_data_ready") added the function
rds_tcp_listen_sock_def_readable() to handle the case when a
partially set-up acceptor socket drops into rds_tcp_listen_data_ready().
However, if the listen socket (rtn->rds_tcp_listen_sock) is itself going
through a tear-down via rds_tcp_listen_stop(), the (*ready)() will be
null and we would hit a panic of the form
BUG: unable to handle kernel NULL pointer dereference at (null)
IP: (null)
:
? rds_tcp_listen_data_ready+0x59/0xb0 [rds_tcp]
tcp_data_queue+0x39d/0x5b0
tcp_rcv_established+0x2e5/0x660
tcp_v4_do_rcv+0x122/0x220
tcp_v4_rcv+0x8b7/0x980
:
In the above case, it is not fatal to encounter a NULL value for
ready- we should just drop the packet and let the flush of the
acceptor thread finish gracefully.
In general, the tear-down sequence for listen() and accept() socket
that is ensured by this commit is:
rtn->rds_tcp_listen_sock = NULL; /* prevent any new accepts */
In rds_tcp_listen_stop():
serialize with, and prevent, further callbacks using lock_sock()
flush rds_wq
flush acceptor workq
sock_release(listen socket)
rds: tcp: Reorder initialization sequence in rds_tcp_init to avoid races
Order of initialization in rds_tcp_init needs to be done so
that resources are set up and destroyed in the correct synchronization
sequence with both the data path, as well as netns create/destroy
path. Specifically,
- we must call register_pernet_subsys and get the rds_tcp_netid
before calling register_netdevice_notifier, otherwise we risk
the sequence
1. register_netdevice_notifier sets up netdev notifier callback
2. rds_tcp_dev_event -> rds_tcp_kill_sock uses netid 0, and finds
the wrong rtn, resulting in a panic with string that is of the form:
BUG: unable to handle kernel NULL pointer dereference at 000000000000000d
IP: rds_tcp_kill_sock+0x3a/0x1d0 [rds_tcp]
:
- the rds_tcp_incoming_slab kmem_cache must be initialized before the
datapath starts up. The latter can happen any time after the
pernet_subsys registration of rds_tcp_net_ops, whose -> init
function sets up the listen socket. If the rds_tcp_incoming_slab has
not been set up at that time, a panic of the form below may be
encountered
BUG: unable to handle kernel NULL pointer dereference at 0000000000000014
IP: kmem_cache_alloc+0x90/0x1c0
:
rds_tcp_data_recv+0x1e7/0x370 [rds_tcp]
tcp_read_sock+0x96/0x1c0
rds_tcp_recv_path+0x65/0x80 [rds_tcp]
:
It is incorrect for the rds_connection to piggyback on the
sock_net() refcount for the netns because this gives rise to
a chicken-and-egg problem during rds_conn_destroy. Instead explicitly
take a ref on the net, and hold the netns down till the connection
tear-down is complete.
David S. Miller [Tue, 7 Mar 2017 22:06:15 +0000 (14:06 -0800)]
Merge branch 'sock_hold-misuses'
Eric Dumazet says:
====================
net: fix possible sock_hold() misuses
skb_complete_wifi_ack() and skb_complete_tx_timestamp() currently
call sock_hold() on sockets that might have transitioned their sk_refcnt
to zero already.
====================
David Howells [Sat, 4 Mar 2017 00:01:41 +0000 (00:01 +0000)]
rxrpc: Call state should be read with READ_ONCE() under some circumstances
The call state may be changed at any time by the data-ready routine in
response to received packets, so if the call state is to be read and acted
upon several times in a function, READ_ONCE() must be used unless the call
state lock is held.
Eric Dumazet [Fri, 3 Mar 2017 22:08:21 +0000 (14:08 -0800)]
tcp: fix various issues for sockets morphing to listen state
Dmitry Vyukov reported a divide by 0 triggered by syzkaller, exploiting
tcp_disconnect() path that was never really considered and/or used
before syzkaller ;)
I was not able to reproduce the bug, but it seems issues here are the
three possible actions that assumed they would never trigger on a
listener.
1) tcp_write_timer_handler
2) tcp_delack_timer_handler
3) MTU reduction
Only IPv6 MTU reduction was properly testing TCP_CLOSE and TCP_LISTEN
states from tcp_v6_mtu_reduced()
here are fixes for a crash with PTP, a crash in setting of VF multicast
addresses, and non-working VLAN filters configuration from the VF side.
====================
Michal Schmidt [Fri, 3 Mar 2017 16:08:33 +0000 (17:08 +0100)]
bnx2x: fix incorrect filter count in an error message
filters->count is the number of filters we were supposed to configure.
There is no reason to increase it by +1 when printing the count in an error
message.
Michal Schmidt [Fri, 3 Mar 2017 16:08:30 +0000 (17:08 +0100)]
bnx2x: fix possible overrun of VFPF multicast addresses array
It is too late to check for the limit of the number of VF multicast
addresses after they have already been copied to the req->multicast[]
array, possibly overflowing it.
Do the check before copying.
Also fix the error path to not skip unlocking vf2pf_mutex.
Michal Schmidt [Fri, 3 Mar 2017 16:08:29 +0000 (17:08 +0100)]
bnx2x: lower verbosity of VF stats debug messages
When BNX2X_MSG_IOV is enabled, the driver produces too many VF statistics
messages. Lower the verbosity of the VF stats messages similarly as in
commit 76ca70fabbdaa3 ("bnx2x: [Debug] change verbosity of some prints").
Michal Schmidt [Fri, 3 Mar 2017 16:08:28 +0000 (17:08 +0100)]
bnx2x: prevent crash when accessing PTP with interface down
It is possible to crash the kernel by accessing a PTP device while its
associated bnx2x interface is down. Before the interface is brought up,
the timecounter is not initialized, so accessing it results in NULL
dereference.
Fix it by checking if the interface is up.
Use -ENETDOWN as the error code when the interface is down.
-EFAULT in bnx2x_ptp_adjfreq() did not seem right.
Arnd Bergmann [Tue, 28 Feb 2017 21:12:04 +0000 (22:12 +0100)]
net/mlx5e: add IPV6 dependency
The ethernet support now calls directly into the ipv6 core code, which
fails if IPV6 is a loadable module but mlx5 is built-in:
drivers/net/ethernet/mellanox/mlx5/core/en_tc.o: In function `mlx5e_create_encap_header_ipv6':
en_tc.c:(.text.mlx5e_create_encap_header_ipv6+0x110): undefined reference to `ip6_route_output_flags'
This adds a dependency to ensure that MLX5_CORE_EN can only be built
if we are able link the kernel successfully. The downside is that the
ethernet option can be hidden. Alternatively we could make MLX5_CORE
depend on "IPV6 || !IPV6", which would force MLX5_CORE to be a module
when IPV6 is, including in configurations where we don't use the ethernet
support at all.
Yinghai Lu [Wed, 1 Mar 2017 08:25:40 +0000 (00:25 -0800)]
PCI/ASPM: Always set link->downstream to avoid NULL dereference on remove
We call pcie_aspm_exit_link_state() when we remove a device. If the device
is the last PCIe function to be removed below a bridge and the bridge has
an ASPM link_state struct, we disable ASPM on the link. Disabling ASPM
requires link->downstream (used in pcie_config_aspm_link()).
We previously set link->downstream in pcie_aspm_cap_init(), but only if the
device was not blacklisted. Removing the blacklisted device caused a NULL
pointer dereference in the pcie_aspm_exit_link_state() ->
pcie_config_aspm_link() path:
dt: emac: document device-tree based phy discovery and setup
This patch adds documentation for a new "phy-handle" property,
"fixed-link" and "mdio" sub-node. These allows the enumeration
of PHYs which are supported by the phy library under drivers/net/phy.
The EMAC ethernet controller in IBM and AMCC 4xx chips is
currently stuck with a few privately defined phy
implementations. It has no support for PHYs which
are supported by the generic phylib.
Linus Torvalds [Tue, 7 Mar 2017 18:52:26 +0000 (10:52 -0800)]
Merge branch 'idr-4.11' of git://git.infradead.org/users/willy/linux-dax
Pull idr fix (and new tests) from Matthew Wilcox:
"One urgent patch in here; freeing the correct IDA bitmap.
Everything else is changes to the test suite"
* 'idr-4.11' of git://git.infradead.org/users/willy/linux-dax:
radix tree test suite: Specify -m32 in LDFLAGS too
ida: Free correct IDA bitmap
radix tree test suite: Depend on Makefile and quieten grep
radix tree test suite: Fix build with --as-needed
radix tree test suite: Build 32 bit binaries
radix tree test suite: Add performance test for radix_tree_join()
radix tree test suite: Add performance test for radix_tree_split()
radix tree test suite: Add performance benchmarks
radix tree test suite: Add test for radix_tree_clear_tags()
radix tree test suite: Add tests for ida_simple_get() and ida_simple_remove()
radix tree test suite: Add test for idr_get_next()
Jaehoon Chung [Tue, 7 Mar 2017 10:54:05 +0000 (19:54 +0900)]
PCI: exynos: Initialize elbi_base even when using PHY framework
Even when using the PHY framework, we need the elbi_base. Before this
patch, we didn't initialize elbi_base, which caused NULL pointer
dereferences later.
Fixes: e7cd7ef58e1f ("PCI: exynos: Support the PHY generic framework") Signed-off-by: Jaehoon Chung <[email protected]> Signed-off-by: Bjorn Helgaas <[email protected]>
Linus Torvalds [Tue, 7 Mar 2017 18:46:10 +0000 (10:46 -0800)]
Merge tag 'powerpc-4.11-3' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc fixes from Michael Ellerman:
"Five fairly small fixes for things that went in this cycle.
A fairly large patch to rework the CAS logic on Power9, necessitated
by a late change to the firmware API, and we can't boot without it.
Three fixes going to stable, allowing more instructions to be emulated
on LE, fixing a boot crash on 32-bit Freescale BookE machines, and the
OPAL XICS workaround.
And a patch from me to sort the selects under CONFIG PPC. Annoying
churn, but worth it in the long run, and best for it to go in now to
avoid conflicts.
Thanks to:
Alexey Kardashevskiy, Anton Blanchard, Balbir Singh, Gautham R.
Shenoy, Laurentiu Tudor, Nicholas Piggin, Paul Mackerras, Ravi
Bangoria, Sachin Sant, Shile Zhang, Suraj Jitindar Singh"
* tag 'powerpc-4.11-3' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
powerpc: Sort the selects under CONFIG_PPC
powerpc/64: Fix L1D cache shape vector reporting L1I values
powerpc/64: Avoid panic during boot due to divide by zero in init_cache_info()
powerpc: Update to new option-vector-5 format for CAS
powerpc: Parse the command line before calling CAS
powerpc/xics: Work around limitations of OPAL XICS priority handling
powerpc/64: Fix checksum folding in csum_add()
powerpc/powernv: Fix opal tracepoints with JUMP_LABEL=n
powerpc/booke: Fix boot crash due to null hugepd
powerpc: Fix compiling a BE kernel with a powerpc64le toolchain
selftest/powerpc: Fix false failures for skipped tests
powerpc/powernv: Fix bug due to labeling ambiguity in power_enter_stop
powerpc/64: Invalidate process table caching after setting process table
powerpc: emulate_step() tests for load/store instructions
powerpc: Emulation support for load/store instructions on LE
Matthew Wilcox [Fri, 3 Mar 2017 17:28:37 +0000 (12:28 -0500)]
radix tree test suite: Specify -m32 in LDFLAGS too
Michael's patch to use the default make rule for linking and the patch
from Rehas to use -m32 if building a 32-bit test-suite on a 64-bit
platform don't work well together.
Matthew Wilcox [Fri, 3 Mar 2017 17:16:10 +0000 (12:16 -0500)]
ida: Free correct IDA bitmap
There's a relatively rare race where we look at the per-cpu preallocated
IDA bitmap, see it's NULL, allocate a new one, and atomically update it.
If the kmalloc() happened to sleep and we were rescheduled to a different
CPU, or an interrupt came in at the exact right time, another task
might have successfully allocated a bitmap and already deposited it.
I forgot what the semantics of cmpxchg() were and ended up freeing the
wrong bitmap leading to KASAN reporting a use-after-free.
Dmitry found the bug with syzkaller & wrote the patch. I wrote the test
case that will reproduce the bug without his patch being applied.
Matthew Wilcox [Thu, 2 Mar 2017 17:24:28 +0000 (12:24 -0500)]
radix tree test suite: Depend on Makefile and quieten grep
Changing the CFLAGS in the Makefile didn't always lead to a
recompilation because the OFILES didn't depend on the Makefile.
Also, after doing make clean, grep would still complain about
a missing map-shift.h; we need -s as well as -q.
Currently the radix tree test suite doesn't build with toolchains that
use --as-needed by default, for example Ubuntu's:
cc -I. -I../../include -g -O2 -Wall -D_LGPL_SOURCE -fsanitize=address -lpthread -lurcu main.o ... -o main
/usr/bin/ld: regression1.o: undefined reference to symbol 'pthread_join@@GLIBC_2.17'
/lib/powerpc64le-linux-gnu/libpthread.so.0: error adding symbols: DSO missing from command line
collect2: error: ld returned 1 exit status
This is caused by the custom makefile rules placing LDFLAGS before the
.o files that need the libraries.
We could fix it by using --no-as-needed, or rewriting the custom rules.
But we can also just drop the custom rules and move the libraries to
LDLIBS, and then the default rules work correctly - with the one caveat
that we need to add -fsanitize=address to LDFLAGS because that must be
passed to the linker as well as the compiler.
Rehas Sachdeva [Sun, 26 Feb 2017 11:33:00 +0000 (06:33 -0500)]
radix tree test suite: Add test for radix_tree_clear_tags()
Assert that radix_tree_clear_tags() clears the tags on the passed node and
slot. Assert that the case where the radix tree has only one entry at index
zero and the node is NULL, is also handled.
Linus Torvalds [Tue, 7 Mar 2017 18:06:25 +0000 (10:06 -0800)]
Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull namespace fix from Eric Biederman:
"This fixes a race between put_ucounts and get_ucounts that can cause a
use after free. The fix works by simplifying the code and so there is
not even a temptation to be clever and play spinlock vs atomic
reference games"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
ucount: Remove the atomicity from ucount->count
Alexander Popov [Tue, 28 Feb 2017 16:54:40 +0000 (19:54 +0300)]
tty: n_hdlc: get rid of racy n_hdlc.tbuf
Currently N_HDLC line discipline uses a self-made singly linked list for
data buffers and has n_hdlc.tbuf pointer for buffer retransmitting after
an error.
The commit be10eb7589337e5defbe214dae038a53dd21add8
("tty: n_hdlc add buffer flushing") introduced racy access to n_hdlc.tbuf.
After tx error concurrent flush_tx_queue() and n_hdlc_send_frames() can put
one data buffer to tx_free_buf_list twice. That causes double free in
n_hdlc_release().
Let's use standard kernel linked list and get rid of n_hdlc.tbuf:
in case of tx error put current data buffer after the head of tx_buf_list.
Linus Torvalds [Tue, 7 Mar 2017 17:37:28 +0000 (09:37 -0800)]
Merge tag 'trace-v4.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fixes from Steven Rostedt:
"There was some breakage with the changes for jump labels in the 4.11
merge window:
- powerpc broke as jump labels uses the two LSB bits as flags in
initialization.
A check was added to make sure that all jump label entries were 4
bytes aligned, but powerpc didn't work that way for modules. Adding
an alignment in the module linker script appeared to be the best
solution.
- Jump labels also added an anonymous union to access those LSB bits
as a normal long. But because this structure had static
initialization, it broke older compilers that could not statically
initialize anonymous unions without brackets.
- The command line parameter for setting function graph filter broke
the "EMPTY_HASH" descriptor by modifying it instead of creating a
new hash to hold the entries.
- The command line parameter ftrace_graph_max_depth was added to
allow its setting at boot time. It uses existing code and only the
command line hook was added.
This is not really a fix, but as it uses existing code without
affecting anything else, I added it to this release. It was ready
before the merge window closed, but I wanted to let it sit in
linux-next for a couple of days first"
* tag 'trace-v4.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
ftrace/graph: Add ftrace_graph_max_depth kernel parameter
tracing: Add #undef to fix compile error
jump_label: Add comment about initialization order for anonymous unions
jump_label: Fix anonymous union initialization
module: set __jump_table alignment to 8
ftrace/graph: Do not modify the EMPTY_HASH for the function_graph filter
tracing: Fix code comment for ftrace_ops_get_func()
Andre Przywara [Thu, 16 Feb 2017 10:41:20 +0000 (10:41 +0000)]
KVM: arm/arm64: VGIC: Fix command handling while ITS being disabled
The ITS spec says that ITS commands are only processed when the ITS
is enabled (section 8.19.4, Enabled, bit[0]). Our emulation was not taking
this into account.
Fix this by checking the enabled state before handling CWRITER writes.
On the other hand that means that CWRITER could advance while the ITS
is disabled, and enabling it would need those commands to be processed.
Fix this case as well by refactoring actual command processing and
calling this from both the GITS_CWRITER and GITS_CTLR handlers.
Mark Rutland [Mon, 20 Feb 2017 12:30:12 +0000 (12:30 +0000)]
arm64: KVM: Survive unknown traps from guests
Currently we BUG() if we see an ESR_EL2.EC value we don't recognise. As
configurable disables/enables are added to the architecture (controlled
by RES1/RES0 bits respectively), with associated synchronous exceptions,
it may be possible for a guest to trigger exceptions with classes that
we don't recognise.
While we can't service these exceptions in a manner useful to the guest,
we can avoid bringing down the host. Per ARM DDI 0487A.k_iss10775, page
D7-1937, EC values within the range 0x00 - 0x2c are reserved for future
use with synchronous exceptions, and EC values within the range 0x2d -
0x3f may be used for either synchronous or asynchronous exceptions.
The patch makes KVM handle any unknown EC by injecting an UNDEFINED
exception into the guest, with a corresponding (ratelimited) warning in
the host dmesg. We could later improve on this with with a new (opt-in)
exit to the host userspace.
Mark Rutland [Mon, 20 Feb 2017 12:30:11 +0000 (12:30 +0000)]
arm: KVM: Survive unknown traps from guests
Currently we BUG() if we see a HSR.EC value we don't recognise. As
configurable disables/enables are added to the architecture (controlled
by RES1/RES0 bits respectively), with associated synchronous exceptions,
it may be possible for a guest to trigger exceptions with classes that
we don't recognise.
While we can't service these exceptions in a manner useful to the guest,
we can avoid bringing down the host. Per ARM DDI 0406C.c, all currently
unallocated HSR EC encodings are reserved, and per ARM DDI
0487A.k_iss10775, page G6-4395, EC values within the range 0x00 - 0x2c
are reserved for future use with synchronous exceptions, and EC values
within the range 0x2d - 0x3f may be used for either synchronous or
asynchronous exceptions.
The patch makes KVM handle any unknown EC by injecting an UNDEFINED
exception into the guest, with a corresponding (ratelimited) warning in
the host dmesg. We could later improve on this with with a new (opt-in)
exit to the host userspace.
Jintack Lim [Mon, 6 Mar 2017 13:42:37 +0000 (05:42 -0800)]
KVM: arm/arm64: Let vcpu thread modify its own active state
Currently, if a vcpu thread tries to change the active state of an
interrupt which is already on the same vcpu's AP list, it will loop
forever. Since the VGIC mmio handler is called after a vcpu has
already synced back the LR state to the struct vgic_irq, we can just
let it proceed safely.
The VMLANCH/VMRESUME emulation should be stopped since the CPU is going to
reset here. This patch resets the nested_run_pending since the CPU is going
to be reset hence there should be nothing pending.
irqchip/crossbar: Fix incorrect type of register size
The 'size' variable is unsigned according to the dt-bindings.
As this variable is used as integer in other places, create a new variable
that allows to fix the following sparse issue (-Wtypesign):
drivers/irqchip/irq-crossbar.c:279:52: warning: incorrect type in argument 3 (different signedness)
drivers/irqchip/irq-crossbar.c:279:52: expected unsigned int [usertype] *out_value
drivers/irqchip/irq-crossbar.c:279:52: got int *<noident>
irqchip/gicv3-its: Add workaround for QDF2400 ITS erratum 0065
On Qualcomm Datacenter Technologies QDF2400 SoCs, the ITS hardware
implementation uses 16Bytes for Interrupt Translation Entry (ITE),
but reports an incorrect value of 8Bytes in GITS_TYPER.ITTE_size.
It might cause kernel memory corruption depending on the number
of MSI(x) that are configured and the amount of memory that has
been allocated for ITEs in its_create_device().
This patch fixes the potential memory corruption by setting the
correct ITE size to 16Bytes.
Ilya Dryomov [Sun, 12 Feb 2017 16:11:07 +0000 (17:11 +0100)]
libceph: osd_request_timeout option
osd_request_timeout specifies how many seconds to wait for a response
from OSDs before returning -ETIMEDOUT from an OSD request. 0 (default)
means no limit.
osd_request_timeout is osdkeepalive-precise -- in-flight requests are
swept through every osdkeepalive seconds. With ack vs commit behaviour
gone, abort_request() is really simple.
Ilya Dryomov [Wed, 1 Mar 2017 16:33:27 +0000 (17:33 +0100)]
libceph: don't set weight to IN when OSD is destroyed
Since ceph.git commit 4e28f9e63644 ("osd/OSDMap: clear osd_info,
osd_xinfo on osd deletion"), weight is set to IN when OSD is deleted.
This changes the result of applying an incremental for clients, not
just OSDs. Because CRUSH computations are obviously affected,
pre-4e28f9e63644 servers disagree with post-4e28f9e63644 clients on
object placement, resulting in misdirected requests.
Josh Poimboeuf [Thu, 2 Mar 2017 22:57:23 +0000 (16:57 -0600)]
objtool: Fix another GCC jump table detection issue
Arnd Bergmann reported a (false positive) objtool warning:
drivers/infiniband/sw/rxe/rxe_resp.o: warning: objtool: rxe_responder()+0xfe: sibling call from callable instruction with changed frame pointer
The issue is in find_switch_table(). It tries to find a switch
statement's jump table by walking backwards from an indirect jump
instruction, looking for a relocation to the .rodata section. In this
case it stopped walking prematurely: the first .rodata relocation it
encountered was for a variable (resp_state_name) instead of a jump
table, so it just assumed there wasn't a jump table.
The fix is to ignore any .rodata relocation which refers to an ELF
object symbol. This works because the jump tables are anonymous and
have no symbols associated with them.
Guenter Roeck [Mon, 6 Mar 2017 01:05:57 +0000 (17:05 -0800)]
avr32: Fix build error caused by include file reshuffling
Various avr32 builds fail:
arch/avr32/oprofile/backtrace.c:58: error: dereferencing pointer to incomplete type
arch/avr32/oprofile/backtrace.c:60: error: implicit declaration of function 'user_mode'
James Smart [Sat, 4 Mar 2017 17:30:34 +0000 (09:30 -0800)]
scsi: lpfc: Rename LPFC_MAX_EQ_DELAY to LPFC_MAX_EQ_DELAY_EQID_CNT
Without apriori understanding of what the define is, the name gives
a very different impression of what it is (a max delay value
for an EQ). Rename the define so it reflects what it is: the number
of EQ IDs that can be set in one instance of the MODIFY_EQ_DELAY
mbx command.
James Smart [Sat, 4 Mar 2017 17:30:33 +0000 (09:30 -0800)]
scsi: lpfc: Rework lpfc Kconfig for NVME options
Reworked Kconfig so that lfpc only requires the scsi stack.
NVME Initiator and NVME Target support can be enabled if
the other NVMe subsystems have been enabled.
James Smart [Sat, 4 Mar 2017 17:30:32 +0000 (09:30 -0800)]
scsi: lpfc: add transport eh_timed_out reference
Christoph's prior patch missed the template for the sli3 adapters,
which is now the "no host reset" template. Add the transport
eh_timed_out handler to the no host reset template
James Smart [Sat, 4 Mar 2017 17:30:31 +0000 (09:30 -0800)]
scsi: lpfc: Fix eh_deadline setting for sli3 adapters.
A previous change unilaterally removed the hba reset entry point
from the sli3 host template. This was done to allow tape devices
being used for back up from being removed. Why was this done ?
When there was non-responding device on the fabric, the error
escalation policy would escalate to the reset handler. When the
reset handler was called, it would reset the adapter, dropping
link, thus logging out and terminating all i/o's - on any target.
If there was a tape device on the same adapter that wasn't in
error, it would kill the tape i/o's, effectively killing the
tape device state. With the reset point removed, the adapter
reset avoided the fabric logout, allowing the other devices to
continue to operate unaffected. A hack - yes. Hint: we really
need a transport I_T nexus reset callback added to the eh process
(in between the SCSI target reset and hba reset points), so a
fc logout could occur to the one bad target only and stop the error
escalation process.
This patch commonizes the approach so it can be used for sli3 and sli4
adapters, but mandates the admin, via module parameter, specifically
identify which adapters the resets are to be removed for. Additionally,
bus_reset, which sends Target Reset TMFs to all targets, is also removed
from the template as it too has the same effect as the adapter reset.
James Smart [Sat, 4 Mar 2017 17:30:30 +0000 (09:30 -0800)]
scsi: lpfc: add NVME exchange aborts
previous code did little more than log a message.
This patch adds abort path support, modeled after the SCSI code paths.
Currently addresses only the initiator path. Target path under
development, but stubbed out.
James Smart [Sat, 4 Mar 2017 17:30:26 +0000 (09:30 -0800)]
scsi: lpfc: Fix RCTL value on NVME LS request and response
NVME LS requests and responses had wrong R_CTL values.
Use the FC4 ELS Request and Response defines (defines badly
named, they are FC4 LS's) instead of the base ELS values.
In the case where sglq is null, the current code just returns without
unlocking the spinlock sql_list_lock. Fix this by breaking out of the
while loop and the exit path will then unlock and return NULL as was
the original intention.
Detected by CoverityScan, CID#1411635 ("Missing unlock")
dma_buf->iocbq is being dereferenced immediately before it is
being null checked, so we have a potential null pointer dereference
bug. Fix this by only dereferencing it only once we have passed
a null check on the pointer.
Detected by CoverityScan, CID#1411652 ("Dereference before null check")
Tomas Jasek [Fri, 3 Mar 2017 12:45:48 +0000 (13:45 +0100)]
scsi: lpfc: replace init_timer by setup_timer
This patch shortens every init_timer in lpfc module followed by function
and data assignment using setup_timer. This is purely cleanup patch, it
does not add new functionality nor remove any existing functionality.
Arnd Bergmann [Thu, 2 Mar 2017 14:58:03 +0000 (15:58 +0100)]
scsi: qedi: fix build error without DEBUG_FS
Without CONFIG_DEBUG_FS, we run into a link error:
drivers/scsi/qedi/qedi_iscsi.o: In function `qedi_ep_poll':
qedi_iscsi.c:(.text.qedi_ep_poll+0x134): undefined reference to `do_not_recover'
drivers/scsi/qedi/qedi_iscsi.o: In function `qedi_ep_disconnect':
qedi_iscsi.c:(.text.qedi_ep_disconnect+0x36c): undefined reference to `do_not_recover'
drivers/scsi/qedi/qedi_iscsi.o: In function `qedi_ep_connect':
qedi_iscsi.c:(.text.qedi_ep_connect+0x350): undefined reference to `do_not_recover'
drivers/scsi/qedi/qedi_fw.o: In function `qedi_tmf_work':
qedi_fw.c:(.text.qedi_tmf_work+0x3b4): undefined reference to `do_not_recover'
This defines the symbol as a constant in this case, as there is no way to
set it to anything other than zero without DEBUG_FS. In addition, I'm renaming
it to qedi_do_not_recover in order to put it into a driver specific namespace,
as "do_not_recover" is a really bad name for a kernel-wide global identifier
when it is used only in one driver.
Always increment/decrement ucount->count under the ucounts_lock. The
increments are there already and moving the decrements there means the
locking logic of the code is simpler. This simplification in the
locking logic fixes a race between put_ucounts and get_ucounts that
could result in a use-after-free because the count could go zero then
be found by get_ucounts and then be freed by put_ucounts.
A bug presumably this one was found by a combination of syzkaller and
KASAN. JongWhan Kim reported the syzkaller failure and Dmitry Vyukov
spotted the race in the code.
Tahsin Erdogan [Sat, 25 Feb 2017 21:00:19 +0000 (13:00 -0800)]
percpu: acquire pcpu_lock when updating pcpu_nr_empty_pop_pages
Update to pcpu_nr_empty_pop_pages in pcpu_alloc() is currently done
without holding pcpu_lock. This can lead to bad updates to the variable.
Add missing lock calls.
Tejun Heo [Mon, 6 Mar 2017 20:33:42 +0000 (15:33 -0500)]
workqueue: trigger WARN if queue_delayed_work() is called with NULL @wq
If queue_delayed_work() gets called with NULL @wq, the kernel will
oops asynchronuosly on timer expiration which isn't too helpful in
tracking down the offender. This actually happened with smc.
__queue_delayed_work() already does several input sanity checks
synchronously. Add NULL @wq check.
Tejun Heo [Mon, 6 Mar 2017 20:26:54 +0000 (15:26 -0500)]
libata: drop WARN from protocol error in ata_sff_qc_issue()
ata_sff_qc_issue() expects upper layers to never issue commands on a
command protocol that it doesn't implement. While the assumption
holds fine with the usual IO path, nothing filters based on the
command protocol in the passthrough path (which was added later),
allowing the warning to be tripped with a passthrough command with the
right (well, wrong) protocol.
Failing with AC_ERR_SYSTEM is the right thing to do anyway. Remove
the unnecessary WARN.
Gwendal Grignou [Fri, 3 Mar 2017 17:00:09 +0000 (09:00 -0800)]
libata: transport: Remove circular dependency at free time
Without this patch, failed probe would not free resources like irq.
ata port tdev object currently hold a reference to the ata port
object. Therefore the ata port object release function will not get
called until the ata_tport_release is called. But that would never
happen, releasing the last reference of ata port dev is done by
scsi_host_release, which is called by ata_host_release when the ata
port object is released.
The ata device objects actually do not need to explicitly hold a
reference to their real counterpart, given the transport objects are
the children of these objects and device_add() is call for each child.
We know the parent will not be deleted until we call the child's
device_del().
pids_can_fork() is special in that the css association is guaranteed
to be stable throughout the function and thus doesn't need RCU
protection around task_css access. When determining the css to charge
the pid, task_css_check() is used to override the RCU sanity check.
While adding a warning message on fork rejection from pids limit, 135b8b37bd91 ("cgroup: Add pids controller event when fork fails
because of pid limit") incorrectly added a task_css access which is
neither RCU protected or explicitly annotated. This triggers the
following suspicious RCU usage warning when RCU debugging is enabled.
There's no reason to dereference task_css again here when the
associated css is already available. Fix it by replacing the
task_cgroup() call with css->cgroup.
Eryu Guan [Thu, 2 Mar 2017 23:02:06 +0000 (15:02 -0800)]
iomap: invalidate page caches should be after iomap_dio_complete() in direct write
After XFS switching to iomap based DIO (commit acdda3aae146 ("xfs:
use iomap_dio_rw")), I started to notice dio29/dio30 tests failures
from LTP run on ppc64 hosts, and they can be reproduced on x86_64
hosts with 512B/1k block size XFS too.
The failure message is like:
bufcmp: offset 0: Expected: 0x62, got 0x0
diotest03 1 TPASS : Read with Direct IO, Write without
diotest03 2 TFAIL : diotest3.c:142: comparsion failed; child=98 offset=1425408
diotest03 3 TFAIL : diotest3.c:194: Write Direct-child 98 failed
Direct write wrote 0x62 but buffer read got zero. This is because,
when doing direct write to a hole or preallocated file, we
invalidate the page caches before converting the extent from
unwritten state to normal state, which is done by
iomap_dio_complete(), thus leave a window for other buffer reader to
cache the unwritten state extent.
Consider this case, with sub-page blocksize XFS, two processes are
direct writing to different blocksize-aligned regions (say 512B) of
the same preallocated file, and reading the region back via buffered
I/O to compare contents.
process A, region [0,512] process B, region [512,1024]
xfs_file_write_iter
xfs_file_aio_dio_write
iomap_dio_rw
iomap_apply
invalidate_inode_pages2_range
xfs_file_write_iter
xfs_file_aio_dio_write
iomap_dio_rw
iomap_apply
invalidate_inode_pages2_range
iomap_dio_complete
xfs_file_read_iter
xfs_file_buffered_aio_read
generic_file_read_iter
do_generic_file_read
<readahead fills pagecache with 0>
iomap_dio_complete
xfs_file_read_iter
<read gets 0 from pagecache>
Process A first invalidates page caches, at this point the
underlying extent is still in unwritten state (iomap_dio_complete
not called yet), and process B finishs direct write and populates
page caches via readahead, which caches zeros in page for region A,
then process A reads zeros from page cache, instead of the actual
data.
Fix it by invalidating page caches after converting unwritten extent
to make sure we read content from disk after extent state changed,
as what we did before switching to iomap based dio.
Also introduce a new 'start' variable to save the original write
offset (iomap_dio_complete() updates iocb->ki_pos), and a 'err'
variable for invalidating caches result, cause we can't reuse 'ret'
anymore.