2) Memory disclosure in appletalk ipddp routing code, from Vlad
Tsyrklevich.
3) r8152 can erroneously split an RX packet into multiple URBs if the
Rx FIFO is not empty when we suspend. Fix this by waiting for the
FIFO to empty before suspending. From Hayes Wang.
4) Two GRO fixes (enter slow path when not enough SKB tail room exists,
disable frag0 optimizations when there are IPV6 extension headers)
from Eric Dumazet and Herbert Xu.
5) A series of mlx5e bug fixes (do source udp port offloading for
tunnels properly, Ip fragment matching fixes, handling firmware
errors properly when installing TC rules, etc.) from Saeed Mahameed,
Or Gerlitz, Roi Dayan, Hadar Hen Zion, Gil Rockah, and Daniel
Jurgens.
6) Two VRF fixes from David Ahern (don't skip multipath selection for
VRF paths, disallow VRF to be configured with table ID 0).
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (35 commits)
net: vrf: do not allow table id 0
net: phy: marvell: fix Marvell 88E1512 used in SGMII mode
sctp: Fix spelling mistake: "Atempt" -> "Attempt"
net: ipv4: Fix multipath selection with vrf
cgroup: move CONFIG_SOCK_CGROUP_DATA to init/Kconfig
gro: use min_t() in skb_gro_reset_offset()
net/mlx5: Only cancel recovery work when cleaning up device
net/mlx5e: Remove WARN_ONCE from adaptive moderation code
net/mlx5e: Un-register uplink representor on nic_disable
net/mlx5e: Properly handle FW errors while adding TC rules
net/mlx5e: Fix kbuild warnings for uninitialized parameters
net/mlx5e: Set inline mode requirements for matching on IP fragments
net/mlx5e: Properly get address type of encapsulation IP headers
net/mlx5e: TC ipv4 tunnel encap offload error flow fixes
net/mlx5e: Warn when rejecting offload attempts of IP tunnels
net/mlx5e: Properly handle offloading of source udp port for IP tunnels
gro: Disable frag0 optimization on IPv6 ext headers
gro: Enter slow-path if there is no tailroom
mlx4: Return EOPNOTSUPP instead of ENOTSUPP
net/af_iucv: don't use paged skbs for TX on HiperSockets
...
nvme: apply DELAY_BEFORE_CHK_RDY quirk at probe time too
Commit 54adc01055b7 ("nvme/quirk: Add a delay before checking for adapter
readiness") introduced a quirk to adapters that cannot read the bit
NVME_CSTS_RDY right after register NVME_REG_CC is set; these adapters
need a delay or else the action of reading the bit NVME_CSTS_RDY could
somehow corrupt adapter's registers state and it never recovers.
When this quirk was added, we checked ctrl->tagset in order to avoid
quirking in probe time, supposing we would never require such delay
during probe. Well, it was too optimistic; we in fact need this quirk
at probe time in some cases, like after a kexec.
In some experiments, after abnormal shutdown of machine (aka power cord
unplug), we booted into our bootloader in Power, which is a Linux kernel,
and kexec'ed into another distro. If this kexec is too quick, we end up
reaching the probe of NVMe adapter in that distro when adapter is in
bad state (not fully initialized on our bootloader). What happens next
is that nvme_wait_ready() is unable to complete, except if the quirk is
enabled.
So, this patch removes the original ctrl->tagset verification in order
to enable the quirk even on probe time.
Now that we don't abuse the cmd field in struct request for nvme command
passthrough this function needs to be converted to the proper accessor
as well.
Fixes: d49187e97e ("nvme: introduce struct nvme_request") Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Max Gurtovoy <[email protected]>
Mathias Nyman [Wed, 11 Jan 2017 15:10:34 +0000 (17:10 +0200)]
xhci: fix deadlock at host remove by running watchdog correctly
If a URB is killed while the host is removed we can end up in a situation
where the hub thread takes the roothub device lock, and waits for
the URB to be given back by xhci-hcd, blocking the host remove code.
xhci-hcd tries to stop the endpoint and give back the urb, but can't
as the host is removed from PCI bus at the same time, preventing the normal
way of giving back urb.
Instead we need to rely on the stop command timeout function to give back
the urb. This xhci_stop_endpoint_command_watchdog() timeout function
used a XHCI_STATE_DYING flag to indicate if the timeout function is already
running, but later this flag has been taking into use in other places to
mark that xhci is dying.
Remove checks for XHCI_STATE_DYING in xhci_urb_dequeue. We are still
checking that reading from pci state does not return 0xffffffff or that
host is not halted before trying to stop the endpoint.
This whole area of stopping endpoints, giving back URBs, and the wathdog
timeout need rework, this fix focuses on solving a specific deadlock
issue that we can then send to stable before any major rework.
David Ahern [Tue, 10 Jan 2017 23:22:25 +0000 (15:22 -0800)]
net: vrf: do not allow table id 0
Frank reported that vrf devices can be created with a table id of 0.
This breaks many of the run time table id checks and should not be
allowed. Detect this condition at create time and fail with EINVAL.
Fixes: 193125dbd8eb ("net: Introduce VRF device driver") Reported-by: Frank Kellermann <[email protected]> Signed-off-by: David Ahern <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Russell King [Tue, 10 Jan 2017 23:13:45 +0000 (23:13 +0000)]
net: phy: marvell: fix Marvell 88E1512 used in SGMII mode
When an Marvell 88E1512 PHY is connected to a nic in SGMII mode, the
fiber page is used for the SGMII host-side connection. The PHY driver
notices that SUPPORTED_FIBRE is set, so it tries reading the fiber page
for the link status, and ends up reading the MAC-side status instead of
the outgoing (copper) link. This leads to incorrect results reported
via ethtool.
If the PHY is connected via SGMII to the host, ignore the fiber page.
However, continue to allow the existing power management code to
suspend and resume the fiber page.
Fixes: 6cfb3bcc0641 ("Marvell phy: check link status in case of fiber link.") Signed-off-by: Russell King <[email protected]> Signed-off-by: David S. Miller <[email protected]>
David Ahern [Tue, 10 Jan 2017 22:37:35 +0000 (14:37 -0800)]
net: ipv4: Fix multipath selection with vrf
fib_select_path does not call fib_select_multipath if oif is set in the
flow struct. For VRF use cases oif is always set, so multipath route
selection is bypassed. Use the FLOWI_FLAG_SKIP_NH_OIF to skip the oif
check similar to what is done in fib_table_lookup.
Add saddr and proto to the flow struct for the fib lookup done by the
VRF driver to better match hash computation for a flow.
Fixes: 613d09b30f8b ("net: Use VRF device index for lookups on TX") Signed-off-by: David Ahern <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Arnd Bergmann [Tue, 10 Jan 2017 12:08:06 +0000 (13:08 +0100)]
cgroup: move CONFIG_SOCK_CGROUP_DATA to init/Kconfig
We now 'select SOCK_CGROUP_DATA' but Kconfig complains that this is
not right when CONFIG_NET is disabled and there is no socket interface:
warning: (CGROUP_BPF) selects SOCK_CGROUP_DATA which has unmet direct dependencies (NET)
I don't know what the correct solution for this is, but simply removing
the dependency on NET from SOCK_CGROUP_DATA by moving it out of the
'if NET' section avoids the warning and does not produce other build
errors.
Fixes: 483c4933ea09 ("cgroup: Fix CGROUP_BPF config") Signed-off-by: Arnd Bergmann <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Eric Dumazet [Wed, 11 Jan 2017 03:52:43 +0000 (19:52 -0800)]
gro: use min_t() in skb_gro_reset_offset()
On 32bit arches, (skb->end - skb->data) is not 'unsigned int',
so we shall use min_t() instead of min() to avoid a compiler error.
Fixes: 1272ce87fa01 ("gro: Enter slow-path if there is no tailroom") Reported-by: kernel test robot <[email protected]> Signed-off-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Prarit Bhargava [Thu, 5 Jan 2017 15:09:25 +0000 (10:09 -0500)]
perf/x86/intel/uncore: Fix hardcoded socket 0 assumption in the Haswell init code
hswep_uncore_cpu_init() uses a hardcoded physical package id 0 for the boot
cpu. This works as long as the boot CPU is actually on the physical package
0, which is normaly the case after power on / reboot.
But it fails with a NULL pointer dereference when a kdump kernel is started
on a secondary socket which has a different physical package id because the
locigal package translation for physical package 0 does not exist.
Use the logical package id of the boot cpu instead of hard coded 0.
Huang Shijie [Wed, 11 Jan 2017 06:02:00 +0000 (14:02 +0800)]
arm64: hugetlb: fix the wrong return value for huge_ptep_set_access_flags
In current code, the @changed always returns the last one's status for
the huge page with the contiguous bit set. This is really not what we
want. Even one of the PTEs is changed, we should tell it to the caller.
So it must be dereferenced to be used in calculation of pci_base:
*pci_base = (dma_addr_t)*vme_base + pci_offset;
This bug was caught thanks to the following gcc warning:
drivers/vme/bridges/vme_ca91cx42.c: In function ‘ca91cx42_slave_get’:
drivers/vme/bridges/vme_ca91cx42.c:467:14: warning: cast from pointer to integer of different size [-Wpointer-to-int-cast]
*pci_base = (dma_addr_t)vme_base + pci_offset;
nohz: Fix collision between tick and other hrtimers
When the tick is stopped and an interrupt occurs afterward, we check on
that interrupt exit if the next tick needs to be rescheduled. If it
doesn't need any update, we don't want to do anything.
In order to check if the tick needs an update, we compare it against the
clockevent device deadline. Now that's a problem because the clockevent
device is at a lower level than the tick itself if it is implemented
on top of hrtimer.
Every hrtimer share this clockevent device. So comparing the next tick
deadline against the clockevent device deadline is wrong because the
device may be programmed for another hrtimer whose deadline collides
with the tick. As a result we may end up not reprogramming the tick
accidentally.
In a worst case scenario under full dynticks mode, the tick stops firing
as it is supposed to every 1hz, leaving /proc/stat stalled:
Task in a full dynticks CPU
----------------------------
* hrtimer A is queued 2 seconds ahead
* the tick is stopped, scheduled 1 second ahead
* tick fires 1 second later
* on tick exit, nohz schedules the tick 1 second ahead but sees
the clockevent device is already programmed to that deadline,
fooled by hrtimer A, the tick isn't rescheduled.
* hrtimer A is cancelled before its deadline
* tick never fires again until an interrupt happens...
In order to fix this, store the next tick deadline to the tick_sched
local structure and reuse that value later to check whether we need to
reprogram the clock after an interrupt.
On the other hand, ts->sleep_length still wants to know about the next
clock event and not just the tick, so we want to improve the related
comment to avoid confusion.
Randy Dunlap [Mon, 26 Dec 2016 17:58:34 +0000 (09:58 -0800)]
auxdisplay: fix new ht16k33 build errors
Fix build errors caused by selecting incorrect kconfig symbols.
drivers/built-in.o:(.data+0x19cec): undefined reference to `sys_fillrect'
drivers/built-in.o:(.data+0x19cf0): undefined reference to `sys_copyarea'
drivers/built-in.o:(.data+0x19cf4): undefined reference to `sys_imageblit'
Akinobu Mita [Thu, 5 Jan 2017 17:14:16 +0000 (02:14 +0900)]
sysrq: attach sysrq handler correctly for 32-bit kernel
The sysrq input handler should be attached to the input device which has
a left alt key.
On 32-bit kernels, some input devices which has a left alt key cannot
attach sysrq handler. Because the keybit bitmap in struct input_device_id
for sysrq is not correctly initialized. KEY_LEFTALT is 56 which is
greater than BITS_PER_LONG on 32-bit kernels.
I found this problem when using a matrix keypad device which defines
a KEY_LEFTALT (56) but doesn't have a KEY_O (24 == 56%32).
Colin Ian King [Fri, 2 Dec 2016 16:23:55 +0000 (16:23 +0000)]
ppdev: don't print a free'd string
A previous fix of a memory leak now prints the string 'name'
that was previously free'd. Fix this by free'ing the string
at the end of the function and adding an error exit path for
the error conditions.
Pan Bian [Sat, 3 Dec 2016 08:56:49 +0000 (16:56 +0800)]
extcon: return error code on failure
Function get_zeroed_page() returns a NULL pointer if there is no enough
memory. In function extcon_sync(), it returns 0 if the call to
get_zeroed_page() fails. The return value 0 indicates success in the
context, which is incosistent with the execution status. This patch
fixes the bug by returning -ENOMEM.
Herbert Xu [Sun, 11 Dec 2016 02:05:49 +0000 (10:05 +0800)]
Revert "tty: serial: 8250: add CON_CONSDEV to flags"
This commit needs to be reverted because it prevents people from
using the serial console as a secondary console with input being
directed to tty0.
IOW, if you boot with console=ttyS0 console=tty0 then all kernels
prior to this commit will produce output on both ttyS0 and tty0
but input will only be taken from tty0. With this patch the serial
console will always be the primary console instead of tty0,
potentially preventing people from getting into their machines in
emergency situations.
Clearing FIFOs in RS485 emulation mode causes subsequent transmits to break
When in RS485 emulation mode, __do_stop_tx_rs485() calls
serial8250_clear_fifos(). This not only clears the FIFOs, but also sets
all bits in their control register (UART_FCR) to 0.
One of the effects of this is the disabling of the FIFOs, which turns
them into single-byte holding registers. The rest of the driver doesn't
know this, which results in the lions share of characters passed into a
write call to be dropped.
(I can supply logic analyzer screenshots if necessary)
This fix replaces the serial8250_clear_fifos() call to
serial8250_clear_and_reinit_fifos() - this prevents the "dropped
characters" issue from manifesting again while retaining the requirement
of clearing the RX FIFO after transmission if the SER_RS485_RX_DURING_TX
flag is disabled.
8250_pci: Fix potential use-after-free in error path
Commit f209fa03fc9d ("serial: 8250_pci: Detach low-level driver during
PCI error recovery") introduces a potential use-after-free in case the
pciserial_init_ports call in serial8250_io_resume fails, which may
happen if a memory allocation fails or if the .init quirk failed for
whatever reason). If this happen, further pci_get_drvdata will return a
pointer to freed memory.
This patch reworks the PCI recovery resume hook to restore the old priv
structure in this case, which should be ok, since the ports were already
detached. Such error during recovery causes us to give up on the
recovery.
Fixes: f209fa03fc9d ("serial: 8250_pci: Detach low-level driver during
PCI error recovery") Reported-by: Michal Suchanek <[email protected]> Signed-off-by: Gabriel Krisman Bertazi <[email protected]> Signed-off-by: Guilherme G. Piccoli <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
Richard Genoud [Tue, 6 Dec 2016 12:05:33 +0000 (13:05 +0100)]
tty/serial: atmel: RS485 half duplex w/DMA: enable RX after TX is done
When using RS485 in half duplex, RX should be enabled when TX is
finished, and stopped when TX starts.
Before commit 0058f0871efe7b01c6 ("tty/serial: atmel: fix RS485 half
duplex with DMA"), RX was not disabled in atmel_start_tx() if the DMA
was used. So, collisions could happened.
But disabling RX in atmel_start_tx() uncovered another bug:
RX was enabled again in the wrong place (in atmel_tx_dma) instead of
being enabled when TX is finished (in atmel_complete_tx_dma), so the
transmission simply stopped.
This bug was not triggered before commit 0058f0871efe7b01c6
("tty/serial: atmel: fix RS485 half duplex with DMA") because RX was
never disabled before.
Moving atmel_start_rx() in atmel_complete_tx_dma() corrects the problem.
Richard Genoud [Tue, 13 Dec 2016 16:27:56 +0000 (17:27 +0100)]
tty/serial: atmel_serial: BUG: stop DMA from transmitting in stop_tx
If we don't disable the transmitter in atmel_stop_tx, the DMA buffer
continues to send data until it is emptied.
This cause problems with the flow control (CTS is asserted and data are
still sent).
So, disabling the transmitter in atmel_stop_tx is a sane thing to do.
Tested on at91sam9g35-cm(DMA)
Tested for regressions on sama5d2-xplained(Fifo) and at91sam9g20ek(PDC)
Robin Murphy [Thu, 5 Jan 2017 17:15:01 +0000 (17:15 +0000)]
drivers: char: mem: Fix thinkos in kmem address checks
When borrowing the pfn_valid() check from mmap_kmem(), somebody managed
to get physical and virtual addresses spectacularly muddled up, such
that we've ended up with checks for one being the other. Whilst this
does indeed prevent out-of-bounds accesses crashing, on most systems
it also prevents the more desirable use-case of working at all ever.
Check the *virtual* offset correctly for what it is. Furthermore, do
so in the right place - a read or write may span multiple pages, so a
single up-front check is insufficient. High memory accesses already
have a similar validity check just before the copy_to_user() call, so
just make the low memory path fully consistent with that.
mei: bus: enable OS version only for SPT and newer
Sending OS version for support of TPM2_ChangeEPS() is required only
for SPT FW (HMB version 2.0) and newer.
On older platforms the command should be just ignored by the firmware
but some older platforms misbehave so it's safer to send the command
only if required.
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=192051 Fixes: 7279b238bade (mei: send OS type to the FW) Signed-off-by: Alexander Usyskin <[email protected]> Signed-off-by: Tomas Winkler <[email protected]> Tested-by: Jan Niehusmann <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
David S. Miller [Wed, 11 Jan 2017 02:34:03 +0000 (21:34 -0500)]
Merge branch 'mlx5-fixes'
Saeed Mahameed says:
====================
Mellanox mlx5 fixes and cleanups 2017-01-10
This series includes some mlx5e general cleanups from Daniel, Gil, Hadar
and myself.
Also it includes some critical mlx5e TC offloads fixes from Or Gerlitz.
For -stable:
- net/mlx5e: Remove WARN_ONCE from adaptive moderation code
Although this fix doesn't affect any functionality, I thought it is
better to clean this -WARN_ONCE- up for -stable in case someone hits
such corner case.
Please apply and let me know if there's any problem.
====================
Daniel Jurgens [Tue, 10 Jan 2017 20:33:39 +0000 (22:33 +0200)]
net/mlx5: Only cancel recovery work when cleaning up device
Do not attempt to drain the health workqueue when unloading the device in
the recovery flow, this can cause a deadlock when the recovery work
tries to cancel itself with sync.
Because the work is no longer unconditionally canceled when unloading, it
must be explicitly canceled in the AER flow.
Gil Rockah [Tue, 10 Jan 2017 20:33:38 +0000 (22:33 +0200)]
net/mlx5e: Remove WARN_ONCE from adaptive moderation code
When trying to do interface down or changing interface configuration
under heavy traffic, some of the adaptive moderation corner cases can
occur and leave a WARN_ONCE call trace in the kernel log.
Those WARN_ONCE are meant for debug only, and should have been inserted
only under debug. We avoid such call traces by removing those WARN_ONCE.
Fixes: cb3c7fd4f839 ("net/mlx5e: Support adaptive RX coalescing") Signed-off-by: Gil Rockah <[email protected]> Signed-off-by: Saeed Mahameed <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Saeed Mahameed [Tue, 10 Jan 2017 20:33:37 +0000 (22:33 +0200)]
net/mlx5e: Un-register uplink representor on nic_disable
The code before this patch registered uplink e-Switch representor
on nic_enable and unregistered on nic_cleanup, the right place
for this unregister is in nic_disable.
Fixes: 127ea380acc9 ("net/mlx5: Add Representors registration API") Signed-off-by: Saeed Mahameed <[email protected]> Reviewed-by: Mohamad Haj Yahia <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Or Gerlitz [Tue, 10 Jan 2017 20:33:36 +0000 (22:33 +0200)]
net/mlx5e: Properly handle FW errors while adding TC rules
When the firmware returns an error (common example is an attempt to
add twice the same rule which is refused by the some FWs), we are not
properly derefing/cleaning few resources allocated on the way.
Examples are vport vlan deref under eswitch vlan offloads, and encap
entry/neighbour deref under eswitch encapsulation offloads, fix that.
Fixes: a54e20b4fcae ('net/mlx5e: Add basic TC tunnel set action for SRIOV offloads') Fixes: 8b32580df1cb ('net/mlx5e: Add TC vlan action for SRIOV offloads') Signed-off-by: Or Gerlitz <[email protected]> Reviewed-by: Roi Dayan <[email protected]> Signed-off-by: Saeed Mahameed <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Hadar Hen Zion [Tue, 10 Jan 2017 20:33:35 +0000 (22:33 +0200)]
net/mlx5e: Fix kbuild warnings for uninitialized parameters
kbuild warn about parameters that may be used uninitialized, fix it.
Fixes: a54e20b4fcae ('net/mlx5e: Add basic TC tunnel set action for SRIOV offloads') Signed-off-by: Hadar Hen Zion <[email protected]> Signed-off-by: Saeed Mahameed <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Or Gerlitz [Tue, 10 Jan 2017 20:33:34 +0000 (22:33 +0200)]
net/mlx5e: Set inline mode requirements for matching on IP fragments
For e-switch level matching on packets being an IP fragment, we
need to make sure the source vport inline mode is L3, fix that.
Fixes: 3f7d0eb42d59 ('net/mlx5e: Offload TC matching on packets being IP fragments') Signed-off-by: Or Gerlitz <[email protected]> Reviewed-by: Roi Dayan <[email protected]> Signed-off-by: Saeed Mahameed <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Or Gerlitz [Tue, 10 Jan 2017 20:33:33 +0000 (22:33 +0200)]
net/mlx5e: Properly get address type of encapsulation IP headers
As done elsewhere in our TC/flower offload code, the address type of
the encapsulation IP headers should be realized accroding to the
addr_type field of the encapsulation control dissector key, do that.
Fixes: bbd00f7e2349 ('net/mlx5e: Add TC tunnel release action for SRIOV offloads') Signed-off-by: Or Gerlitz <[email protected]> Reviewed-by: Hadar Hen Zion <[email protected]> Signed-off-by: Saeed Mahameed <[email protected]> Signed-off-by: David S. Miller <[email protected]>
When the route lookup fails we should return the actual error.
When the neigh isn't valid, we should return -EOPNOTSUPP as done
in similar cases along the code.
When the offload can't take place as of invalid neigh etc, we
must release the neigh.
Fixes: a54e20b4fcae ('net/mlx5e: Add basic TC tunnel set action for SRIOV offloads') Signed-off-by: Or Gerlitz <[email protected]> Reviewed-by: Hadar Hen Zion <[email protected]> Signed-off-by: Saeed Mahameed <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Or Gerlitz [Tue, 10 Jan 2017 20:33:31 +0000 (22:33 +0200)]
net/mlx5e: Warn when rejecting offload attempts of IP tunnels
We silently reject offloading of IPv6 tunnels, non vxlan tunnels,
vxlan tunnels where the dst port to match is not provided, etc.
Be a bit more verbose and print a warning so the user better
realizes what went wrong here and can fix it.
Fixes: a54e20b4fcae ('net/mlx5e: Add basic TC tunnel set action for SRIOV offloads') Fixes: bbd00f7e2349 ('net/mlx5e: Add TC tunnel release action for SRIOV offloads') Signed-off-by: Or Gerlitz <[email protected]> Reviewed-by: Hadar Hen Zion <[email protected]> Signed-off-by: Saeed Mahameed <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Or Gerlitz [Tue, 10 Jan 2017 20:33:30 +0000 (22:33 +0200)]
net/mlx5e: Properly handle offloading of source udp port for IP tunnels
We can offload the matching on source udp port of ip tunnels for
decapsulation. We can not offload setting source udp port for tunnels
as part of encapsulation. Fix both the code that deals with matching
offload (decap) and the code that deal with encap offload to align with
that.
Fixes: a54e20b4fcae ('net/mlx5e: Add basic TC tunnel set action for SRIOV offloads') Fixes: bbd00f7e2349 ('net/mlx5e: Add TC tunnel release action for SRIOV offloads') Signed-off-by: Or Gerlitz <[email protected]> Reviewed-by: Hadar Hen Zion <[email protected]> Signed-off-by: Saeed Mahameed <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Mike Frysinger [Wed, 11 Jan 2017 00:58:30 +0000 (16:58 -0800)]
timerfd: export defines to userspace
Since userspace is expected to call timerfd syscalls directly with these
flags/ioctls, make sure we export them so they don't have to duplicate
the values themselves.
Mike Kravetz [Wed, 11 Jan 2017 00:58:27 +0000 (16:58 -0800)]
mm/hugetlb.c: fix reservation race when freeing surplus pages
return_unused_surplus_pages() decrements the global reservation count,
and frees any unused surplus pages that were backing the reservation.
Commit 7848a4bf51b3 ("mm/hugetlb.c: add cond_resched_lock() in
return_unused_surplus_pages()") added a call to cond_resched_lock in the
loop freeing the pages.
As a result, the hugetlb_lock could be dropped, and someone else could
use the pages that will be freed in subsequent iterations of the loop.
This could result in inconsistent global hugetlb page state, application
api failures (such as mmap) failures or application crashes.
When dropping the lock in return_unused_surplus_pages, make sure that
the global reservation count (resv_huge_pages) remains sufficiently
large to prevent someone else from claiming pages about to be freed.
This patch fixes a bug in the freelist randomization code. When a high
random number is used, the freelist will contain duplicate entries. It
will result in different allocations sharing the same chunk.
It will result in odd behaviours and crashes. It should be uncommon but
it depends on the machines. We saw it happening more often on some
machines (every few hours of running tests).
Minchan Kim [Wed, 11 Jan 2017 00:58:21 +0000 (16:58 -0800)]
zram: support BDI_CAP_STABLE_WRITES
zram has used per-cpu stream feature from v4.7. It aims for increasing
cache hit ratio of scratch buffer for compressing. Downside of that
approach is that zram should ask memory space for compressed page in
per-cpu context which requires stricted gfp flag which could be failed.
If so, it retries to allocate memory space out of per-cpu context so it
could get memory this time and compress the data again, copies it to the
memory space.
In this scenario, zram assumes the data should never be changed but it is
not true without stable page support. So, If the data is changed under
us, zram can make buffer overrun so that zsmalloc free object chain is
broken so system goes crash like below
https://bugzilla.suse.com/show_bug.cgi?id=997574
This patch adds BDI_CAP_STABLE_WRITES to zram for declaring "I am block
device needing *stable write*".
Minchan Kim [Wed, 11 Jan 2017 00:58:18 +0000 (16:58 -0800)]
zram: revalidate disk under init_lock
Commit b4c5c60920e3 ("zram: avoid lockdep splat by revalidate_disk")
moved revalidate_disk call out of init_lock to avoid lockdep
false-positive splat. However, commit 08eee69fcf6b ("zram: remove
init_lock in zram_make_request") removed init_lock in IO path so there
is no worry about lockdep splat. So, let's restore it.
This patch is needed to set BDI_CAP_STABLE_WRITES atomically in next
patch.
With investigation, it reveals currently stable page doesn't support
anonymous page. IOW, reuse_swap_page can reuse the page without waiting
writeback completion so it can overwrite page zram is compressing.
Unfortunately, zram has used per-cpu stream feature from v4.7.
It aims for increasing cache hit ratio of scratch buffer for
compressing. Downside of that approach is that zram should ask
memory space for compressed page in per-cpu context which requires
stricted gfp flag which could be failed. If so, it retries to
allocate memory space out of per-cpu context so it could get memory
this time and compress the data again, copies it to the memory space.
In this scenario, zram assumes the data should never be changed
but it is not true unless stable page supports. So, If the data is
changed under us, zram can make buffer overrun because second
compression size could be bigger than one we got in previous trial
and blindly, copy bigger size object to smaller buffer which is
buffer overrun. The overrun breaks zsmalloc free object chaining
so system goes crash like above.
I think below is same problem.
https://bugzilla.suse.com/show_bug.cgi?id=997574
Unfortunately, reuse_swap_page should be atomic so that we cannot wait on
writeback in there so the approach in this patch is simply return false if
we found it needs stable page. Although it increases memory footprint
temporarily, it happens rarely and it should be reclaimed easily althoug
it happened. Also, It would be better than waiting of IO completion,
which is critial path for application latency.
Alexander Duyck [Wed, 11 Jan 2017 00:58:12 +0000 (16:58 -0800)]
mm: add documentation for page fragment APIs
This is a first pass at trying to add documentation for the page_frag
APIs. They may still change over time but for now I thought I would try
to get these documented so that as more network drivers and stack calls
make use of them we have one central spot to document how they are meant
to be used.
Alexander Duyck [Wed, 11 Jan 2017 00:58:09 +0000 (16:58 -0800)]
mm: rename __page_frag functions to __page_frag_cache, drop order from drain
This patch does two things.
First it goes through and renames the __page_frag prefixed functions to
__page_frag_cache so that we can be clear that we are draining or
refilling the cache, not the frags themselves.
Second we drop the order parameter from __page_frag_cache_drain since we
don't actually need to pass it since all fragments are either order 0 or
must be a compound page.
Alexander Duyck [Wed, 11 Jan 2017 00:58:06 +0000 (16:58 -0800)]
mm: rename __alloc_page_frag to page_frag_alloc and __free_page_frag to page_frag_free
Patch series "Page fragment updates", v4.
This patch series takes care of a few cleanups for the page fragments
API.
First we do some renames so that things are much more consistent. First
we move the page_frag_ portion of the name to the front of the functions
names. Secondly we split out the cache specific functions from the
other page fragment functions by adding the word "cache" to the name.
Finally I added a bit of documentation that will hopefully help to
explain some of this. I plan to revisit this later as we get things
more ironed out in the near future with the changes planned for the DMA
setup to support eXpress Data Path.
This patch (of 3):
This patch renames the page frag functions to be more consistent with
other APIs. Specifically we place the name page_frag first in the name
and then have either an alloc or free call name that we append as the
suffix. This makes it a bit clearer in terms of naming.
In addition we drop the leading double underscores since we are
technically no longer a backing interface and instead the front end that
is called from the networking APIs.
the oom killer is clearly pre-mature because there there is still a lot
of page cache in the zone Normal which should satisfy this lowmem
request. Further debugging has shown that the reclaim cannot make any
forward progress because the page cache is hidden in the active list
which doesn't get rotated because inactive_list_is_low is not memcg
aware.
The code simply subtracts per-zone highmem counters from the respective
memcg's lru sizes which doesn't make any sense. We can simply end up
always seeing the resulting active and inactive counts 0 and return
false. This issue is not limited to 32b kernels but in practice the
effect on systems without CONFIG_HIGHMEM would be much harder to notice
because we do not invoke the OOM killer for allocations requests
targeting < ZONE_NORMAL.
Fix the issue by tracking per zone lru page counts in mem_cgroup_per_node
and subtract per-memcg highmem counts when memcg is enabled. Introduce
helper lruvec_zone_lru_size which redirects to either zone counters or
mem_cgroup_get_zone_lru_size when appropriate.
We are losing empty LRU but non-zero lru size detection introduced by ca707239e8a7 ("mm: update_lru_size warn and reset bad lru_size") because
of the inherent zone vs. node discrepancy.
Ard Biesheuvel [Wed, 11 Jan 2017 00:58:00 +0000 (16:58 -0800)]
mm: don't dereference struct page fields of invalid pages
The VM_BUG_ON() check in move_freepages() checks whether the node id of
a page matches the node id of its zone. However, it does this before
having checked whether the struct page pointer refers to a valid struct
page to begin with. This is guaranteed in most cases, but may not be
the case if CONFIG_HOLES_IN_ZONE=y.
So reorder the VM_BUG_ON() with the pfn_valid_within() check.
Jamie Iles [Wed, 11 Jan 2017 00:57:54 +0000 (16:57 -0800)]
signal: protect SIGNAL_UNKILLABLE from unintentional clearing.
Since commit 00cd5c37afd5 ("ptrace: permit ptracing of /sbin/init") we
can now trace init processes. init is initially protected with
SIGNAL_UNKILLABLE which will prevent fatal signals such as SIGSTOP, but
there are a number of paths during tracing where SIGNAL_UNKILLABLE can
be implicitly cleared.
This can result in init becoming stoppable/killable after tracing. For
example, running:
while true; do kill -STOP 1; done &
strace -p 1
and then stopping strace and the kill loop will result in init being
left in state TASK_STOPPED. Sending SIGCONT to init will resume it, but
init will now respond to future SIGSTOP signals rather than ignoring
them.
Make sure that when setting SIGNAL_STOP_CONTINUED/SIGNAL_STOP_STOPPED
that we don't clear SIGNAL_UNKILLABLE.
The problem is currently page fault handler doesn't supports dirty bit
emulation of pmd for non-HW dirty-bit architecture so that application
stucks until VM marked the pmd dirty.
How the emulation work depends on the architecture. In case of arm64,
when it set up pte firstly, it sets pte PTE_RDONLY to get a chance to
mark the pte dirty via triggering page fault when store access happens.
Once the page fault occurs, VM marks the pmd dirty and arch code for
setting pmd will clear PTE_RDONLY for application to proceed.
IOW, if VM doesn't mark the pmd dirty, application hangs forever by
repeated fault(i.e., store op but the pmd is PTE_RDONLY).
This patch enables pmd dirty-bit emulation for those architectures.
[1] b8d3c4c3009d, mm/huge_memory.c: don't split THP page when MADV_FREE syscall is called
The slow (i.e.: failure to acquire) syscall exit from semtimedop()
incorrectly assumed that the the same lock is acquired as it was at the
initial syscall entry.
This is wrong:
- thread A: single semop semop(), sleeps
- thread B: multi semop semop(), sleeps
- thread A: woken up by signal/timeout
With this sequence, the initial sem_lock() call locks the per-semaphore
spinlock, and it is unlocked with sem_unlock(). The call at the syscall
return locks the global spinlock. Because locknum is not updated, the
following sem_unlock() call unlocks the per-semaphore spinlock, which is
actually not locked.
The fix is trivial: Use the return value from sem_lock.
Sudip Mukherjee [Wed, 11 Jan 2017 00:57:45 +0000 (16:57 -0800)]
lib/Kconfig.debug: fix frv build failure
The build of frv allmodconfig was failing with the errors like:
/tmp/cc0JSPc3.s: Assembler messages:
/tmp/cc0JSPc3.s:1839: Error: symbol `.LSLT0' is already defined
/tmp/cc0JSPc3.s:1842: Error: symbol `.LASLTP0' is already defined
/tmp/cc0JSPc3.s:1969: Error: symbol `.LELTP0' is already defined
/tmp/cc0JSPc3.s:1970: Error: symbol `.LELT0' is already defined
Commit 866ced950bcd ("kbuild: Support split debug info v4") introduced
splitting the debug info and keeping that in a separate file. Somehow,
the frv-linux gcc did not like that and I am guessing that instead of
splitting it started copying. The first report about this is at:
I will try and see if this can work with frv and if still fails I will
open a bug report with gcc. But meanwhile this is the easiest option to
solve build failure of frv.
Michal Hocko [Wed, 11 Jan 2017 00:57:42 +0000 (16:57 -0800)]
mm: get rid of __GFP_OTHER_NODE
The flag was introduced by commit 78afd5612deb ("mm: add
__GFP_OTHER_NODE flag") to allow proper accounting of remote node
allocations done by kernel daemons on behalf of a process - e.g.
khugepaged.
After "mm: fix remote numa hits statistics" we do not need and actually
use the flag so we can safely remove it because all allocations which
are satisfied from their "home" node are accounted properly.
Michal Hocko [Wed, 11 Jan 2017 00:57:39 +0000 (16:57 -0800)]
mm: fix remote numa hits statistics
Jia He has noticed that commit b9f00e147f27 ("mm, page_alloc: reduce
branches in zone_statistics") has an unintentional side effect that
remote node allocation requests are accounted as NUMA_MISS rathat than
NUMA_HIT and NUMA_OTHER if such a request doesn't use __GFP_OTHER_NODE.
There are many of these potentially because the flag is used very rarely
while we have many users of __alloc_pages_node.
Fix this by simply ignoring __GFP_OTHER_NODE (it can be removed in a
follow up patch) and treat all allocations that were satisfied from the
preferred zone's node as NUMA_HITS because this is the same node we
requested the allocation from in most cases. If this is not the local
node then we just account it as NUMA_OTHER rather than NUMA_LOCAL.
One downsize would be that an allocation request for a node which is
outside of the mempolicy nodemask would be reported as a hit which is a
bit weird but that was the case before b9f00e147f27 already.
The result is that two threads calling devm_memremap_pages()
simultaneously can end up colliding on pgd initialization. This leads
to crash signatures like the following where the loser of the race
initializes the wrong pgd entry:
It's because ocfs2_inode_lock() get us stale LVB in which the i_size is
not equal to the disk i_size. We mistakenly trust the LVB because the
underlaying fsdlm dlm_lock() doesn't set lkb_sbflags with
DLM_SBF_VALNOTVALID properly for us. But, why?
The current code tries to downconvert lock without DLM_LKF_VALBLK flag
to tell o2cb don't update RSB's LVB if it's a PR->NULL conversion, even
if the lock resource type needs LVB. This is not the right way for
fsdlm.
The fsdlm plugin behaves different on DLM_LKF_VALBLK, it depends on
DLM_LKF_VALBLK to decide if we care about the LVB in the LKB. If
DLM_LKF_VALBLK is not set, fsdlm will skip recovering RSB's LVB from
this lkb and set the right DLM_SBF_VALNOTVALID appropriately when node
failure happens.
The following diagram briefly illustrates how this crash happens:
RSB1 is inode metadata lock resource with LOCK_TYPE_USES_LVB;
dlm_lock(no DLM_LKF_VALBLK)
convert_lock(overwrite lkb->lkb_exflags
with no DLM_LKF_VALBLK)
RSB1: NULL RSB1: EX
reset Node2
dlm_recover_rsbs()
recover_lvb()
/* The LVB is not trustable if the node with EX fails and
* no lock >= PR is left. We should set RSB_VALNOTVALID for RSB1.
*/
if(!(kb_exflags & DLM_LKF_VALBLK)) /* This means we miss the chance to
return; * to invalid the LVB here.
*/
The 2nd round:
Node 1 Node2
RSB1(become master from recovery)
ocfs2_setattr()
ocfs2_inode_lock(NULL->EX)
/* dlm_lock() return the stale lvb without setting DLM_SBF_VALNOTVALID */
ocfs2_meta_lvb_is_trustable() return 1 /* so we don't refresh inode from disk */
ocfs2_truncate_file()
mlog_bug_on_msg(disk isize != i_size_read(inode)) /* crash! */
The fix is quite straightforward. We keep to set DLM_LKF_VALBLK flag
for dlm_lock() if the lock resource type needs LVB and the fsdlm plugin
is uesed.
Michal Hocko [Wed, 11 Jan 2017 00:57:30 +0000 (16:57 -0800)]
bpf: do not use KMALLOC_SHIFT_MAX
Commit 01b3f52157ff ("bpf: fix allocation warnings in bpf maps and
integer overflow") has added checks for the maximum allocateable size.
It (ab)used KMALLOC_SHIFT_MAX for that purpose.
While this is not incorrect it is not very clean because we already have
KMALLOC_MAX_SIZE for this very reason so let's change both checks to use
KMALLOC_MAX_SIZE instead.
The original motivation for using KMALLOC_SHIFT_MAX was to work around
an incorrect KMALLOC_MAX_SIZE which could lead to allocation warnings
but it is no longer needed since "slab: make sure that KMALLOC_MAX_SIZE
will fit into MAX_ORDER".
The issue is caused by a lack of size check for the request size in
ep_write_iter which should be fixed. It, however, points to another
problem, that SLUB defines KMALLOC_MAX_SIZE too large because the its
KMALLOC_SHIFT_MAX is (MAX_ORDER + PAGE_SHIFT) which means that the
resulting page allocator request might be MAX_ORDER which is too large
(see __alloc_pages_slowpath).
The same applies to the SLOB allocator which allows even larger sizes.
Make sure that they are capped properly and never request more than
MAX_ORDER order.
Ross Zwisler [Wed, 11 Jan 2017 00:57:24 +0000 (16:57 -0800)]
dax: wrprotect pmd_t in dax_mapping_entry_mkclean
Currently dax_mapping_entry_mkclean() fails to clean and write protect
the pmd_t of a DAX PMD entry during an *sync operation. This can result
in data loss in the following sequence:
1) mmap write to DAX PMD, dirtying PMD radix tree entry and making the
pmd_t dirty and writeable
2) fsync, flushing out PMD data and cleaning the radix tree entry. We
currently fail to mark the pmd_t as clean and write protected.
3) more mmap writes to the PMD. These don't cause any page faults since
the pmd_t is dirty and writeable. The radix tree entry remains clean.
4) fsync, which fails to flush the dirty PMD data because the radix tree
entry was clean.
5) crash - dirty data that should have been fsync'd as part of 4) could
still have been in the processor cache, and is lost.
Fix this by marking the pmd_t clean and write protected in
dax_mapping_entry_mkclean(), which is called as part of the fsync
operation 2). This will cause the writes in step 3) above to generate
page faults where we'll re-dirty the PMD radix tree entry, resulting in
flushes in the fsync that happens in step 4).
Ross Zwisler [Wed, 11 Jan 2017 00:57:21 +0000 (16:57 -0800)]
mm: add follow_pte_pmd()
Patch series "Write protect DAX PMDs in *sync path".
Currently dax_mapping_entry_mkclean() fails to clean and write protect
the pmd_t of a DAX PMD entry during an *sync operation. This can result
in data loss, as detailed in patch 2.
This series is based on Dan's "libnvdimm-pending" branch, which is the
current home for Jan's "dax: Page invalidation fixes" series. You can
find a working tree here:
mm/thp/pagecache/collapse: free the pte page table on collapse for thp page cache.
With THP page cache, when trying to build a huge page from regular pte
pages, we just clear the pmd entry. We will take another fault and at
that point we will find the huge page in the radix tree, thereby using
the huge page to complete the page fault
The second fault path will allocate the needed pgtable_t page for archs
like ppc64. So no need to deposit the same in collapse path.
Depositing them in the collapse path resulting in a pgtable_t memory
leak also giving errors like
The crux of the problem is that once we insert a 4k zero page, all
locking from then on is done in terms of that 4k zero page and any
additional threads sleeping on the empty DAX entry will never be woken.
Fix this by waking all sleepers when we replace the DAX radix tree entry
with a 4k zero page. This will allow all sleeping threads to
successfully transition from locking based on the DAX empty entry to
locking on the 4k zero page.
With the test case reported by Xiong this happens very regularly in my
test setup, with some runs resulting in 9+ threads in this deadlocked
state. With this fix I've been able to run that same test dozens of
times in a loop without issue.
I have noticed that two different descriptions for B: entries in
MAINTAINERS were merged: commit 686564434e88 ("MAINTAINERS: Add bug
tracking system location entry type") and 2de2bd95f456 ("MAINTAINERS:
add "B:" for URI where to file bugs").
This patch keeps the description from 2de2bd95f456. There has been a
discussion [1] about whether this more detailed description is useful
and what it exactly implies. I find it more useful and general, and the
author of 686564434e88 agreed in the end that either is fine.
Herbert Xu [Tue, 10 Jan 2017 20:24:15 +0000 (12:24 -0800)]
gro: Disable frag0 optimization on IPv6 ext headers
The GRO fast path caches the frag0 address. This address becomes
invalid if frag0 is modified by pskb_may_pull or its variants.
So whenever that happens we must disable the frag0 optimization.
This is usually done through the combination of gro_header_hard
and gro_header_slow, however, the IPv6 extension header path did
the pulling directly and would continue to use the GRO fast path
incorrectly.
This patch fixes it by disabling the fast path when we enter the
IPv6 extension header path.
Fixes: 78a478d0efd9 ("gro: Inline skb_gro_header and cache frag0 virtual address") Reported-by: Slava Shwartsman <[email protected]> Signed-off-by: Herbert Xu <[email protected]> Signed-off-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Herbert Xu [Tue, 10 Jan 2017 20:24:01 +0000 (12:24 -0800)]
gro: Enter slow-path if there is no tailroom
The GRO path has a fast-path where we avoid calling pskb_may_pull
and pskb_expand by directly accessing frag0. However, this should
only be done if we have enough tailroom in the skb as otherwise
we'll have to expand it later anyway.
This patch adds the check by capping frag0_len with the skb tailroom.
Martin KaFai Lau [Tue, 10 Jan 2017 17:41:49 +0000 (09:41 -0800)]
mlx4: Return EOPNOTSUPP instead of ENOTSUPP
In commit b45f0674b997 ("mlx4: xdp: Allow raising MTU up to one page minus eth and vlan hdrs"),
it changed EOPNOTSUPP to ENOTSUPP by mistake. This patch fixes it.
Fixes: b45f0674b997 ("mlx4: xdp: Allow raising MTU up to one page minus eth and vlan hdrs") Signed-off-by: Martin KaFai Lau <[email protected]> Acked-by: Saeed Mahameed <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Julian Wiedmann [Tue, 10 Jan 2017 16:10:34 +0000 (17:10 +0100)]
net/af_iucv: don't use paged skbs for TX on HiperSockets
With commit e53743994e21
("af_iucv: use paged SKBs for big outbound messages"),
we transmit paged skbs for both of AF_IUCV's transport modes
(IUCV or HiperSockets).
The qeth driver for Layer 3 HiperSockets currently doesn't
support NETIF_F_SG, so these skbs would just be linearized again
by the stack.
Avoid that overhead by using paged skbs only for IUCV transport.
cc stable, since this also circumvents a significant skb leak when
sending large messages (where the skb then needs to be linearized).
Anna, Suman [Tue, 10 Jan 2017 03:48:56 +0000 (21:48 -0600)]
net: add the AF_QIPCRTR entries to family name tables
Commit bdabad3e363d ("net: Add Qualcomm IPC router") introduced a
new address family. Update the family name tables accordingly so
that the lockdep initialization can use the proper names for this
family.
It is perfectly possible to have non zero indexed switches being present
in a DSA switch tree, in such a case, we will be deferencing a NULL
pointer while dsa_cpu_port_ethtool_{setup,restore}. Be more defensive
and ensure that dst->ds[0] is valid before doing anything with it.
Fixes: 0c73c523cf73 ("net: dsa: Initialize CPU port ethtool ops per tree") Signed-off-by: Florian Fainelli <[email protected]> Reviewed-by: Vivien Didelot <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Linus Torvalds [Tue, 10 Jan 2017 21:51:54 +0000 (13:51 -0800)]
Merge tag 'linux-kselftest-4.10-rc4-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest
Pull kselftest fixes from Shuah Khan:
"This update consists of fixes to use shell instead of bash to run
tests in embedded devices where the only shell available is the
busybox ash.
Also included is a typo fix to a test result message"
* tag 'linux-kselftest-4.10-rc4-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest:
selftests: x86/pkeys: fix spelling mistake: "itertation" -> "iteration"
selftests: do not require bash to run netsocktests testcase
selftests: do not require bash to run bpf tests
selftests: do not require bash for the generated test
Omar Sandoval [Mon, 9 Jan 2017 19:44:12 +0000 (11:44 -0800)]
virtio_blk: fix panic in initialization error path
If blk_mq_init_queue() returns an error, it gets assigned to
vblk->disk->queue. Then, when we call put_disk(), we end up calling
blk_put_queue() with the ERR_PTR, causing a bad dereference. Fix it by
only assigning to vblk->disk->queue on success.
virtio_blk: avoid DMA to stack for the sense buffer
Most users of BLOCK_PC requests allocate the sense buffer on the stack,
so to avoid DMA to the stack copy them to a field in the heap allocated
virtblk_req structure. Without that any attempt at SCSI passthrough I/O,
including the SG_IO ioctl from userspace will crash the kernel. Note that
this includes running tools like hdparm even when the host does not have
SCSI passthrough enabled.
Chandan Rajendra [Tue, 10 Jan 2017 20:29:54 +0000 (13:29 -0700)]
do_direct_IO: Use inode->i_blkbits to compute block count to be cleaned
The code currently uses sdio->blkbits to compute the number of blocks to
be cleaned. However sdio->blkbits is derived from the logical block size
of the underlying block device (Refer to the definition of
do_blockdev_direct_IO()). Due to this, generic/299 test would rarely
fail when executed on an ext4 filesystem with 64k as the block size and
when using a virtio based disk (having 512 byte as the logical block
size) inside a kvm guest.
This commit fixes the bug by using inode->i_blkbits to compute the
number of blocks to be cleaned.
Signed-off-by: Chandan Rajendra <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]>
Fixed up by Jeff Moyer to only use/evaluate inode->i_blkbits once,
to avoid issues with block size changes with IO in flight.
Eric Dumazet [Mon, 9 Jan 2017 19:18:01 +0000 (11:18 -0800)]
net: skb_flow_get_be16() can be static
Removes following sparse complain :
net/core/flow_dissector.c:70:8: warning: symbol 'skb_flow_get_be16'
was not declared. Should it be static?
Fixes: 972d3876faa8 ("flow dissector: ICMP support") Signed-off-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Bryant G. Ly [Mon, 9 Jan 2017 16:21:20 +0000 (10:21 -0600)]
ibmvscsis: Fix srp_transfer_data fail return code
If srp_transfer_data fails within ibmvscsis_write_pending, then
the most likely scenario is that the client timed out the op and
removed the TCE mapping. Thus it will loop forever retrying the
op that is pretty much guaranteed to fail forever. A better return
code would be EIO instead of EAGAIN.
hayeswang [Tue, 10 Jan 2017 09:04:07 +0000 (17:04 +0800)]
r8152: fix rx issue for runtime suspend
Pause the rx and make sure the rx fifo is empty when the autosuspend
occurs.
If the rx data comes when the driver is canceling the rx urb, the host
controller would stop getting the data from the device and continue
it after next rx urb is submitted. That is, one continuing data is
split into two different urb buffers. That let the driver take the
data as a rx descriptor, and unexpected behavior happens.
SPC4r37 6.4.1 EXTENDED COPY(LID4) states (also applying to LID1 reqs):
A parameter list length of zero specifies that the copy manager shall
not transfer any data or alter any internal state, and this shall not
be considered an error.
This behaviour can be tested using the libiscsi ExtendedCopy.ParamHdr
test.
Check for XCOPY header, CSCD descriptor and segment descriptor list
truncation, and respond accordingly.
SPC4r37 6.4.1 EXTENDED COPY(LID4) states (also applying to LID1 reqs):
If the parameter list length causes truncation of the parameter list,
then the copy manager shall transfer no data and shall terminate the
EXTENDED COPY command with CHECK CONDITION status, with the sense key
set to ILLEGAL REQUEST, and the additional sense code set to PARAMETER
LIST LENGTH ERROR.
This behaviour can be tested using the libiscsi ExtendedCopy.ParamHdr
test.
The XCOPY specification in SPC4r37 states that the XCOPY source and
destination device(s) should be derived from the copy source and copy
destination (CSCD) descriptor IDs in the XCOPY segment descriptor.
The CSCD IDs are generally (for block -> block copies), indexes into
the corresponding CSCD descriptor list, e.g.
=================================
EXTENDED COPY Header
=================================
CSCD Descriptor List
- entry 0
+ LU ID <--------------<------------------\
- entry 1 |
+ LU ID <______________<_____________ |
================================= | |
Segment Descriptor List | |
- segment 0 | |
+ src CSCD ID = 0 --------->---------+----/
+ dest CSCD ID = 1 ___________>______|
+ len
+ src lba
+ dest lba
=================================
Currently LIO completely ignores the src and dest CSCD IDs in the
Segment Descriptor List, and instead assumes that the first entry in the
CSCD list corresponds to the source, and the second to the destination.
This commit removes this assumption, by ensuring that the Segment
Descriptor List is parsed prior to processing the CSCD Descriptor List.
CSCD Descriptor List processing is modified to compare the current list
index with the previously obtained src and dest CSCD IDs.
Additionally, XCOPY requests where the src and dest CSCD IDs refer to
the CSCD Descriptor List entry can now be successfully processed.
Ensure that the segment descriptor CSCD descriptor ID values correspond
to CSCD descriptor entries located in the XCOPY command parameter list.
SPC4r37 6.4.6.1 Table 150 specifies this range as 0000h to 07FFh, where
the CSCD descriptor location in the parameter list can be located via:
16 + (id * 32)
Signed-off-by: David Disseldorp <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]>
[ bvanassche: inserted "; " in the format string of an error message
and also moved a "||" operator from the start of a line to the end
of the previous line ] Signed-off-by: Bart Van Assche <[email protected]>
target_xcopy_locate_se_dev_e4() is used to locate an se_dev, based on
the WWN provided with the XCOPY request. Remove a couple of unneeded
arguments, and rely on the caller for the src/dst test.
target: return UNSUPPORTED TARGET/SEGMENT DESC TYPE CODE sense
Use UNSUPPORTED TARGET DESCRIPTOR TYPE CODE and UNSUPPORTED SEGMENT
DESCRIPTOR TYPE CODE additional sense codes if a descriptor type in an
XCOPY request is not supported, as specified in spc4r37 6.4.5 and 6.4.6.
David Disseldorp [Fri, 23 Dec 2016 10:37:56 +0000 (11:37 +0100)]
target: bounds check XCOPY total descriptor list length
spc4r37 6.4.3.5 states:
If the combined length of the CSCD descriptors and segment descriptors
exceeds the allowed value, then the copy manager shall terminate the
command with CHECK CONDITION status, with the sense key set to ILLEGAL
REQUEST, and the additional sense code set to PARAMETER LIST LENGTH
ERROR.
This functionality can be tested using the libiscsi
ExtendedCopy.DescrLimits test.
David Disseldorp [Fri, 23 Dec 2016 10:37:55 +0000 (11:37 +0100)]
target: bounds check XCOPY segment descriptor list
Check the length of the XCOPY request segment descriptor list against
the value advertised via the MAXIMUM SEGMENT DESCRIPTOR COUNT field in
the RECEIVE COPY OPERATING PARAMETERS response.
spc4r37 6.4.3.5 states:
If the number of segment descriptors exceeds the allowed number, the
copy manager shall terminate the command with CHECK CONDITION status,
with the sense key set to ILLEGAL REQUEST, and the additional sense
code set to TOO MANY SEGMENT DESCRIPTORS.
This functionality is testable using the libiscsi
ExtendedCopy.DescrLimits test.
David Disseldorp [Fri, 23 Dec 2016 10:37:54 +0000 (11:37 +0100)]
target: use XCOPY TOO MANY TARGET DESCRIPTORS sense
spc4r37 6.4.3.4 states:
If the number of CSCD descriptors exceeds the allowed number, the copy
manager shall terminate the command with CHECK CONDITION status, with
the sense key set to ILLEGAL REQUEST, and the additional sense code
set to TOO MANY TARGET DESCRIPTORS.
LIO currently responds with INVALID FIELD IN PARAMETER LIST, which sees
it fail the libiscsi ExtendedCopy.DescrLimits test.
David S. Miller [Tue, 10 Jan 2017 16:27:46 +0000 (11:27 -0500)]
Merge tag 'wireless-drivers-for-davem-2017-01-10' of git://git.kernel.org/pub/scm/linux/kernel/git/kvalo/wireless-drivers
Kalle Valo says:
====================
wireless-drivers fixes for 4.10
Only two fixes at this time. The rtlwifi fix is an important one as it
fixes a reported oops and Linus was already asking about it. The
orinoco fix is not tested on a real device, because it's old legacy
hardware and hardly no-one use it, but it should fix a (theoretical)
issue with VMAP_STACK.
====================
Andy Lutomirski [Wed, 14 Dec 2016 02:50:13 +0000 (18:50 -0800)]
wusbcore: Fix one more crypto-on-the-stack bug
The driver put a constant buffer of all zeros on the stack and
pointed a scatterlist entry at it. This doesn't work with virtual
stacks. Use ZERO_PAGE instead.
The current implementation failed to detect short transfers when
attempting to read the line state, and also, to make things worse,
logged the content of the uninitialised heap transfer buffer.