Daniel Borkmann [Wed, 6 May 2015 14:12:30 +0000 (16:12 +0200)]
seccomp, filter: add and use bpf_prog_create_from_user from seccomp
Seccomp has always been a special candidate when it comes to preparation
of its filters in seccomp_prepare_filter(). Due to the extra checks and
filter rewrite it partially duplicates code and has BPF internals exposed.
This patch adds a generic API inside the BPF code code that seccomp can use
and thus keep it's filter preparation code minimal and better maintainable.
The other side-effect is that now classic JITs can add seccomp support as
well by only providing a BPF_LDX | BPF_W | BPF_ABS translation.
Daniel Borkmann [Wed, 6 May 2015 14:12:29 +0000 (16:12 +0200)]
net: filter: add __GFP_NOWARN flag for larger kmem allocs
When seccomp BPF was added, it was discussed to add __GFP_NOWARN
flag for their configuration path as f.e. up to 32K allocations are
more prone to fail under stress. As we're going to reuse BPF API,
add __GFP_NOWARN flags where larger kmalloc() and friends allocations
could fail.
It doesn't make much sense to pass around __GFP_NOWARN everywhere as
an extra argument only for seccomp while we just as well could run
into similar issues for socket filters, where it's not desired to
have a user application throw a WARN() due to allocation failure.
seccomp: simplify seccomp_prepare_filter and reuse bpf_prepare_filter
Remove the calls to bpf_check_classic(), bpf_convert_filter() and
bpf_migrate_runtime() and let bpf_prepare_filter() take care of that
instead.
seccomp_check_filter() is passed to bpf_prepare_filter() so that it
gets called from there, after bpf_check_classic().
We can now remove exposure of two internal classic BPF functions
previously used by seccomp. The export of bpf_check_classic() symbol,
previously known as sk_chk_filter(), was there since pre git times,
and no in-tree module was using it, therefore remove it.
net: filter: add a callback to allow classic post-verifier transformations
This is in preparation for use by the seccomp code, the rationale is
not to duplicate additional code within the seccomp layer, but instead,
have it abstracted and hidden within the classic BPF API.
As an interim step, this now also makes bpf_prepare_filter() visible
(not as exported symbol though), so that seccomp can reuse that code
path instead of reimplementing it.
David S. Miller [Sat, 9 May 2015 21:27:25 +0000 (17:27 -0400)]
Merge tag 'mac80211-next-for-davem-2015-05-06' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next
Johannes Berg says:
====================
Lots of updates for net-next for this cycle. As usual, we have
a lot of small fixes and cleanups, the bigger items are:
* proper mac80211 rate control locking, to fix some random crashes
(this required changing other locking as well)
* mac80211 "fast-xmit", a mechanism to reduce, in most cases, the
amount of code we execute while going from ndo_start_xmit() to
the driver
* this also clears the way for properly supporting S/G and checksum
and segmentation offloads
====================
transfer_buffer_length is of type u32. It's therefore wrong to assign it
to a signed integer. This patch avoids the overflow.
It's worth noting that entry->length here is a long; perhaps it would be
beneficial at somepoint to change this to be unsigned as well, if
nothing else relies on its signedness for error conditions or the like.
Tony Camuso [Wed, 6 May 2015 13:09:18 +0000 (09:09 -0400)]
netxen_nic: use spin_[un]lock_bh around tx_clean_lock (2)
This patch should have been part of the previous patch having the
same summary. See http://marc.info/?l=linux-kernel&m=143039470103795&w=2
Unfortunately, I didn't check to see where else this lock was used before
submitting that patch. This should take care of it for netxen_nic, as I
did a thorough search this time.
To recap from the original patch; although testing this driver with
DEBUG_LOCKDEP and DEBUG_SPINLOCK enabled did not produce any traces,
it would be more prudent in the case of tx_clean_lock to use _bh
versions of spin_[un]lock, since this lock is manipulated in both
the process and softirq contexts.
This patch was tested for functionality and regressions with netperf
and DEBUG_LOCKDEP and DEBUG_SPINLOCK enabled.
Eric Dumazet [Wed, 6 May 2015 21:26:25 +0000 (14:26 -0700)]
tcp: add TCPWinProbe and TCPKeepAlive SNMP counters
Diagnosing problems related to Window Probes has been hard because
we lack a counter.
TCPWinProbe counts the number of ACK packets a sender has to send
at regular intervals to make sure a reverse ACK packet opening back
a window had not been lost.
TCPKeepAlive counts the number of ACK packets sent to keep TCP
flows alive (SO_KEEPALIVE)
Eric Dumazet [Wed, 6 May 2015 21:26:24 +0000 (14:26 -0700)]
tcp: adjust window probe timers to safer values
With the advent of small rto timers in datacenter TCP,
(ip route ... rto_min x), the following can happen :
1) Qdisc is full, transmit fails.
TCP sets a timer based on icsk_rto to retry the transmit, without
exponential backoff.
With low icsk_rto, and lot of sockets, all cpus are servicing timer
interrupts like crazy.
Intent of the code was to retry with a timer between 200 (TCP_RTO_MIN)
and 500ms (TCP_RESOURCE_PROBE_INTERVAL)
2) Receivers can send zero windows if they don't drain their receive queue.
TCP sends zero window probes, based on icsk_rto current value, with
exponential backoff.
With /proc/sys/net/ipv4/tcp_retries2 being 15 (or even smaller in
some cases), sender can abort in less than one or two minutes !
If receiver stops the sender, it obviously doesn't care of very tight
rto. Probability of dropping the ACK reopening the window is not
worth the risk.
Lets change the base timer to be at least 200ms (TCP_RTO_MIN) for these
events (but not normal RTO based retransmits)
A followup patch adds a new SNMP counter, as it would have helped a lot
diagnosing this issue.
Richard Alpe [Wed, 6 May 2015 11:58:55 +0000 (13:58 +0200)]
tipc: add broadcast link window set/get to nl api
Add the ability to get or set the broadcast link window through the
new netlink API. The functionality was unintentionally missing from
the new netlink API. Adding this means that we also fix the breakage
in the old API when coming through the compat layer.
Richard Alpe [Wed, 6 May 2015 11:58:54 +0000 (13:58 +0200)]
tipc: fix default link prop regression in nl compat
Default link properties can be set for media or bearer. This
functionality was missed when introducing the NL compatibility layer.
This patch implements this functionality in the compat netlink
layer. It works the same way as it did in the old API. We search for
media and bearers matching the "link name". If we find a matching
media or bearer the link tolerance, priority or window is used as
default for new links on that media or bearer.
Implements t4_init_rss_mode() to initialize the rss_mode for all the ports. If
Tunnel All Lookup isn't specified in the global RSS Configuration, then we need
to specify a default Ingress Queue for any ingress packets which aren't hashed.
We'll use our first ingress queue.
David S. Miller [Sat, 9 May 2015 20:27:04 +0000 (16:27 -0400)]
Merge branch 'be2net'
Sathya Perla says:
====================
be2net: patch-set
The following patch-set has two new feature additions, and a few
minor fixes and cleanups.
Pls consider applying to the net-next tree. Thanks.
v2 changes:
a) dropped the "don't enable pause by default" patch
b) described how the "spoof check" works in patch 1's commit log
c) I had to update our email addresses from "@emulex" to
"@avagotech". I'll send a separate patch updating the
maintainers
Patch 1 adds support for the "spoofchk" knob for VFs.
When it is enabled, "spoof checking" is done for both MAC-address
and VLAN. For each VF, the HW ensures that the source MAC address
(or vlan) of every outgoing packet of the VF exists in the MAC-list
(or vlan-list) configured for RX filtering for that VF.
If not, the packet is dropped and an error is reported to the driver
in the TX completion.
Patch 2 improves interrupt moderation on Skyhawk-R chip by using
the EQ-DB mechanism to set a "re-arm to interrupt" delay. Currently
interrupt moderation is adjusted by calculating and configuring an
EQ-delay every second. This is done via a FW-cmd. This patch uses
the EQ_DB facility to calculate and set the interrupt delay every 1ms.
This helps moderating interrupts better when the traffic is bursty.
Patch 3 adds L3/L4 error accounting to BE3 VFs, by passing L3/4 error
packets to the network stack.
Patch 4 adds an extra FW-cmd error value check in the driver to identify
an "out of vlan filters" scenario.
Patch 5 stops enabling pause by default as this setting fails in
some HW-configs where priority pause is enabled in FW. If the user
tries to do the same, an appropriate error is returned via ethtool.
Patch 5 posts the full RXQ in be_open() to prevent packet drops due to
bursty traffic when the interface is enabled.
Patch 6 refactors the be_check_ufi_compatibility() routine, that checks
to see if a UFI file meant for a lower rev of a chip is being flashed
on a higher rev, to make it simpler.
Patch 7 replaces the usage of !be_physfn() macro with be_virtfn()
that is already avialble in the driver.
Patch 8 updates the year in the copyright text to 2015.
Path 9 bumps up the driver version to 10.6.02.
====================
The code in be_check_ufi_compatibility() checks to see if a UFI file meant
for a lower rev of a chip is being flashed on a higher rev, which is
disallowed. This patch re-writes the code needed for this check in a much
simpler manner.
Suresh Reddy [Wed, 6 May 2015 09:30:36 +0000 (05:30 -0400)]
be2net: post full RXQ on interface enable
When an RXQ is created in be_open(), the driver currently posts only
64 buffers. This sometimes results in packet drops when there is a traffic
burst as soon as the interface is enabled.
This patch fixes this problem by posting the full RXQ on interface enable.
Kalesh AP [Wed, 6 May 2015 09:30:35 +0000 (05:30 -0400)]
be2net: check for INSUFFICIENT_VLANS error
When the FW runs out of vlan filters it can either return an
INSUFFICIENT_RESOURCES error or an INSUFFICIENT_VLANS error.
The driver currently checks only for the former error value.
This patch adds a check for the latter value too.
Somnath Kotur [Wed, 6 May 2015 09:30:34 +0000 (05:30 -0400)]
be2net: receive pkts with L3, L4 errors on VFs
Currently pkts with L3 or L4 errors received on PFs are not dropped
by the adapter, but instead sent to the stack. This helps the network stack
to better reflect error statistics. This was not being done on BE3 VFs.
This patch fixes this for BE3 VFs.
be2net: set interrupt moderation for Skyhawk-R using EQ-DB
Currently adaptive interrupt moderation is set by calculating
and configuring an EQ-delay every second. This is done via
a FW-cmd. But, on Skyhawk-R a "re-arm to interrupt" delay
can be set while ringing the EQ-DB. This patch uses this
facility to calculate and set the interrupt delay every 1ms.
This helps moderating interrupts better when the traffic
is bursty.
Kalesh AP [Wed, 6 May 2015 09:30:32 +0000 (05:30 -0400)]
be2net: add support for spoofchk setting
This patch adds support for spoofchk configuration for VFs.
When it is enabled, "spoof checking" is done for both MAC-address and VLAN.
For each VF, the HW ensures that the source MAC address (or vlan) of
every outgoing packet exists in the MAC-list (or vlan-list) configured
for RX filtering for that VF. If not, the packet is dropped and an error
is reported to the driver in the TX completion; this is reflected in the
"tx_spoof_check_err" ethtool counter.
This feature is supported in Skyhawk FW version 10.6.31.0 and above.
Jean Delvare [Wed, 6 May 2015 07:04:40 +0000 (09:04 +0200)]
net: amd-xgbe: Add hardware dependency
The amd-xgbe driver currently only works with the Seattle SoC, which
is ARM64 architecture, so there is no point in building this driver on
other architectures except for build testing purpose. The dependency
list can be updated later if the driver ever supports other
architectures.
David S. Miller [Sat, 9 May 2015 20:16:49 +0000 (16:16 -0400)]
Merge branch 'sfc-next'
Shradha Shah says:
====================
sfc: Enabling EF10 Vf's, set up vswitching and bind the SFC driver to the VF's
This set of patches makes way for the implementation of EF10
SR-IOV driver starting with some cleanup code.
NIC specific SR-IOV functions are moved to their own header
and netdev_ops are made generic instead of being NIC specific
Next in line comes the patch to enable VF's using sriov_configure.
VEB vswitching hierarchy is set up next followed by patches to
prepare sfc driver to bind to enabled VF's
This is followed by patch to support use of shared RSS contexts
which makes VF's use shared RSS contexts in all cases.
Patch series ends with a patch to bind the sfc driver to the
enabled VF's which creates network interfaces corresponding to
the VF's.
Coming up soon are the patches to set_vf_mac, set_vf_config,
set_vf_vlan, vf_spoofcheck, etc.
These patches have been tested with and without CONFIG_SFC_SRIOV.
In the case of CONFIG_SFC_SRIOV=y enabling of VF's using
sriov_configure is also tested. The enabled VF's bind to the
installed sfc driver succesfully to create network interfaces.
In the case of CONFIG_SFC_SRIOV=n enabling of VF's using
sriov_configure returns the correct error message:
"Function not implemented".
====================
Shradha Shah [Wed, 6 May 2015 00:00:07 +0000 (01:00 +0100)]
sfc: Bind the sfc driver to any available VF's
Add the device ID of the VF to the PCI device ID table.
Added a boolean flag is_vf in efx_nic_type to differentiate
between a VF and PF at probe time. This flag is useful in later
patches while setting MAC address specially in the
PCI-passthrough case.
Jon Cooper [Tue, 5 May 2015 23:59:38 +0000 (00:59 +0100)]
sfc: Add use of shared RSS contexts.
Allow PFs to allocate shared RSS contexts if we exhaust our
exclusive RSS contexts. Make VFs use shared RSS contexts in
all cases.
Spruce up error handling so that the shadow copy of the RSS
table is updated after successful update, rather than in all
cases, so that we report the actual contents of the RSS table
after a failure to set it, rather than what we'd like it to be.
Populate context_size parameter when vacuously allocating RSS
context of size 1.
Edward Cree [Tue, 5 May 2015 23:59:18 +0000 (00:59 +0100)]
sfc: Cope with permissions enforcement added to firmware for SR-IOV
* Accept EPERM in some simple cases, the following cases are handled:
1) efx_mcdi_read_assertion()
Unprivileged PCI functions aren't allowed to GET_ASSERTS.
We return success as it's up to the primary PF to deal with asserts.
2) efx_mcdi_mon_probe() in efx_ef10_probe()
Unprivileged PCI functions aren't allowed to read sensor info, and
worrying about sensor data is the primary PF's job.
3) phy_op->reconfigure() in efx_init_port() and efx_reset_up()
Unprivileged functions aren't allowed to MC_CMD_SET_LINK, they just have
to accept the settings (including flow-control, which is what
efx_init_port() is worried about) they've been given.
4) Fallback to GET_WORKAROUNDS in efx_ef10_probe()
Unprivileged PCI functions aren't allowed to set workarounds. So if
efx_mcdi_set_workaround() fails EPERM, use efx_mcdi_get_workarounds()
to find out if workaround_35388 is enabled.
5) If DRV_ATTACH gets EPERM, try without specifying fw-variant
Unprivileged PCI functions have to use a FIRMWARE_ID of 0xffffffff
(MC_CMD_FW_DONT_CARE).
6) Don't try to exit_assertion unless one had fired
Previously we called efx_mcdi_exit_assertion even if
efx_mcdi_read_assertion had received MC_CMD_GET_ASSERTS_FLAGS_NO_FAILS.
This is unnecessary, and the resulting MC_CMD_REBOOT, even if the
AFTER_ASSERTION flag made it a no-op, would fail EPERM for unprivileged
PCI functions.
So make efx_mcdi_read_assertion return whether an assert happened, and only
call efx_mcdi_exit_assertion if it has.
Shradha Shah [Tue, 5 May 2015 23:58:54 +0000 (00:58 +0100)]
sfc: manually allocate and free vadaptors
To be able to use MC_CMD_VADAPTOR_SET_MAC, vadaptors must be
manually allocated and freed as automatic vadaptors will disappear
when their reference_count reaches zero, which must happen before
the MAC address is changed.
Vadaptors are allocated and freed in the vswitching_probe/remove
functions for PFs and VFs, and this means that vadaptors are restored
correctly following an MC reboot or other reset when required.
Shradha Shah [Tue, 5 May 2015 23:58:31 +0000 (00:58 +0100)]
sfc: create vports for VFs and assign random MAC addresses
The parent PF creates vports for all its child VFs and adds MAC
addresses to these. When the VF driver loads, it can make an MCDI
call to get the MAC address that the parent PF assigned it.
The parent PF also assigns a mac address to its own vport because
implicit creation of a vAdaptor will only work on evb ports with
MAC addresses assigned.
The vport MAC address needs to be stored in the PF's nic_data
struct as it can later be changed on the vadaptor (and its net_dev
struct). When removing a vport the original MAC address must be
deleted.
A new flag is needed in the VF data structure to identify whether
a vport has been assigned to the VF. This is to determine whether
it needs to be un-assigned before freeing the vport. Also,
attempting to un-assign a vport which is not assigned will result
in an EALREADY error.
Daniel Pieczko [Tue, 5 May 2015 23:57:34 +0000 (00:57 +0100)]
sfc: create VEB vswitch and vport above default firmware setup
Adds functions to allocate and free vswitches and vports; vadaptors
are automatically allocated and freed when TX/RX queues are
initialised and finalised. This vswitching structure is only created
if the firmware supports it, so a check that full-featured firmware
is running is performed first.
If the MC resets, the vswitching infrastructure will need to be
recreated, so mark the "must_probe_vswitching" flag when an MC reboot
is detected.
Don't try to create a vswitch if vf-count=0
This allocation of vswitches and vports does not currently support
configuring VLAN tags, but that can be added in a future change.
Daniel Pieczko [Tue, 5 May 2015 23:57:14 +0000 (00:57 +0100)]
sfc: record the PF's vport ID in nic_data
The default port ID of EVB_PORT_ID_ASSIGNED is a "magic" number
for the MCFW to select the physical port of the PF. If other
vswitches and vports are created on top of the default firmware
configuration, the ID of the newly created vport is then required
when passed to MCDI commands. Currently, this doesn't happen so
the vport_id is never changed, but a subsequent patch will change
this behaviour so that other vswitches and vports are created.
The vport_id recorded in nic_data is only relevant for PFs.
VFs will have their vports created by their parent PF, and in
that case the parent PF will record the vport ID of each VF.
For a VF, nic_data->vport_id is expected to remain at the default
value.
Daniel Pieczko [Tue, 5 May 2015 23:56:55 +0000 (00:56 +0100)]
sfc: Record [rt]x_dpcpu_fw_id in EF10 nic_data
The (future) code to add/remove vswitches and vports will be
dependent on the firmware variant.
To simplify the checking of the firmware variant, record
values for rx_dpcpu_fw_id and tx_dpcpu_fw_id in EF10 nic_data.
There was only one place where this was previously used:
efx_mcdi_print_fwver() in ethtool.c.
The MC_CMD_GET_CAPABILITIES can be replaced and the values from
nic_data used instead.
Note that the printing of "?" if the MC command fails or if the
outlength is incorrect no longer apply, because errors are returned
in efx_ef10_init_datapath_caps() in both of these cases.
Daniel Pieczko [Tue, 5 May 2015 23:55:36 +0000 (00:55 +0100)]
sfc: Move and rename efx_vf struct to siena_vf
The efx_vf struct contains Siena-specific fields for VFs,
so rename to siena_vf.
Also move it into the siena_nic_data struct, as EF10 will
track its VFs in its own ef10_nic_data, storing much less
information about them since VFDI is no longer used.
Shradha Shah [Tue, 5 May 2015 23:55:13 +0000 (00:55 +0100)]
sfc: Own header for nic-specific sriov functions, single instance of netdev_ops and sriov removed from Falcon code
By putting all the efx_{siena,ef10}_sriov_* declarations in
{siena,ef10}_sriov.h, ensure they cannot be called from nic-generic code.
Also fixes up an instance of this, where mcdi.c was calling
efx_siena_sriov_flr.
The single instance of netdev_ops should call general high level
functions that can then call something adapter specific in efx_nic_type.
We should only do adapter specialisation via efx_nic_type.
Removal of sriov functionality from the Falcon code means that tests
are needed for the presence of some callbacks.
David S. Miller [Sat, 9 May 2015 20:05:54 +0000 (16:05 -0400)]
Merge branch 'dsa-next'
Andrew Lunn says:
====================
More Marvell DSA refactring and fixup
This patch setup continues the refactoring and cleanup of the Marvell
DSA drivers.
Patch #1 Centralizes the duplicated parts of port setup and global
setup into the shared mv88e6xxx.
Patch #2 Centralizes looping over the ports setting them up
Patch #3 Uses mnemonics for the remaining register access in the
drivers.
Patch #4 The 6172 is actually a member of the 6352 family. This moves
the probe code into the correct driver.
Patch #5 Adds more members of the 6171 family to the 6171 driver. The
new devices are untested.
Patch #6 The 6185 is a member of the 6131 family. Add it to the probe
code of the 6131 driver.
Patch #7 and Patch #8 Simply the mutex's in mv88e6xxx.c. The SMI bus
is the bottleneck, not the granularity of the mutex's so simply the
code down to a single mutex.
Patch #8 Fixes a false positive lockdep splat, due to nested uses of
MDIO busses.
Patch #9 Fixes another false positive lockdep splat with the transmit
queue because of stacked Ethernet devices.
====================
Andrew Lunn [Tue, 5 May 2015 23:09:56 +0000 (01:09 +0200)]
net: dsa: Add lockdep class to tx queues to avoid lockdep splat
DSA stacks an Ethernet device on top of an Ethernet device. This can
cause false positive lockdep splats for the transmit queue: Acked-by: Florian Fainelli <[email protected]>
=============================================
[ INFO: possible recursive locking detected ] 4.0.0-rc7-01838-g70621a215fc7 #386 Not tainted
---------------------------------------------
kworker/0:0/4 is trying to acquire lock:
(_xmit_ETHER#2){+.-...}, at: [<c040e95c>] sch_direct_xmit+0xa8/0x1fc
but task is already holding lock:
(_xmit_ETHER#2){+.-...}, at: [<c03f4208>] __dev_queue_xmit+0x4d4/0x56c
other info that might help us debug this:
Possible unsafe locking scenario:
DSA can have nested MDIO busses, where the Ethernet MDIO bus is used
to access an MDIO bus within the switch which has the PHYs connected
to it. This nesting causes lockdep to give false positives. Use
mutex_lock_nested() to avoid this.
Andrew Lunn [Tue, 5 May 2015 23:09:54 +0000 (01:09 +0200)]
net: dsa: mv88e6xxx: Replace stats mutex with SMI mutex
The SMI bus is the bottleneck in all switch operations, not the
granularity of locks. Replace the stats mutex by the SMI mutex to make
the locking concept simpler.
The REG_READ/REG_WRITE macros cannot be used while holding the SMI
mutex, since they try to acquire it. Replace with calls to the
appropriate function which does not try to get the mutex.
Andrew Lunn [Tue, 5 May 2015 23:09:53 +0000 (01:09 +0200)]
net: dsa: mv88e6xxx: Replace PHY mutex by SMI mutex
The SMI bus is the bottleneck in all switch operations, not the
granularity of locks. Replace the PHY mutex by the SMI mutex to make
the locking concept simpler.
The REG_READ/REG_WRITE macros cannot be used while holding the SMI
mutex, since they try to acquire it. Replace with calls to the
appropriate function which does not try to get the mutex.
Andrew Lunn [Tue, 5 May 2015 23:09:50 +0000 (01:09 +0200)]
net: dsa: Move mv88e6172 support into mv88e6352 family driver
The mv88e6172 is part of the mv88e6352 family of devices. Move support
for it out of the mv88e6171 driver into the mv88e6352, which results
in some simplifications to the code.
Andrew Lunn [Tue, 5 May 2015 23:09:47 +0000 (01:09 +0200)]
net: dsa: Centralise global and port setup code into mv88e6xxx.
The port setup code in the individual drivers is identical for 6123,
6171, and 6352, and very similar in 6131. Move it all into mv88e6xxx,
using the chip families to differentiate on features.
Similarly, the global setup is also very similar. Move the majority
into mv8e6xxx.
The chips themselves fall into families. Add helpers which uses the
device IDs to determine if a device is a member of a family or not.
Add some additional device IDs to the existing list, to make these
helper functions more complete. However these IDs are not yet added to
the probe functions.
Nathan Sullivan [Tue, 5 May 2015 20:00:25 +0000 (15:00 -0500)]
net: macb: Handle the RXUBR interrupt on all devices
The same hardware issue the at91 must work around applies to at least the
Zynq ethernet, and possibly more devices. The driver also needs to handle
the RXUBR interrupt since it turns it on with MACB_RX_INT_FLAGS anyway.
This patch-set contains bug fixes for state-recovery at the RDS
layer when the underlying transport is TCP and the TCP state at one
of the endpoints is reset
V2 changes: DaveM comments to reduce memory footprint, follow
NFS/RPC model where possible. Added test-case #3
Without the changes in this set, when one of the endpoints is reset,
the existing code does not correctly clean up RDS socket state for stale
connections, resulting in some unstable, timing-dependant behavior on
the wire, including an infinite exchange of 3WHs back-and-forth, and a
resulting potential to never converge RDS state.
Test cases used to verify the changes in this set are:
1. Start rds client/server applications on two participating nodes,
node1 and node2. After at least one packet has been sent (to establish
the TCP connection), restart the rds_tcp module on the client, and
now resend packets. Tcpdump should show server sending a FIN for the
"old" client port, and clean connection establishment/exchange for
the new client port.
2. At the end of step 1, restart rds srever on node2, and start client on
node1, make sure using tcpdump, 'netstat -an|grep 16385' that
packets flow correctly.
3. start RDS client/server application on two participating nodes, and
repeat steps 1 and 2, but this time, simulate node failure by doing
"ifconfig <intf> down", so no FIN is sent.
====================
net/rds: RDS-TCP: only initiate reconnect attempt on outgoing TCP socket.
When the peer of an RDS-TCP connection restarts, a reconnect
attempt should only be made from the active side of the TCP
connection, i.e. the side that has a transient TCP port
number. Do not add the passive side of the TCP connection
to the c_hash_node and thus avoid triggering rds_queue_reconnect()
for passive rds connections.
net/rds: RDS-TCP: Always create a new rds_sock for an incoming connection.
When running RDS over TCP, the active (client) side connects to the
listening ("passive") side at the RDS_TCP_PORT. After the connection
is established, if the client side reboots (potentially without even
sending a FIN) the server still has a TCP socket in the esablished
state. If the server-side now gets a new SYN comes from the client
with a different client port, TCP will create a new socket-pair, but
the RDS layer will incorrectly pull up the old rds_connection (which
is still associated with the stale t_sock and RDS socket state).
This patch corrects this behavior by having rds_tcp_accept_one()
always create a new connection for an incoming TCP SYN.
The rds and tcp state associated with the old socket-pair is cleaned
up via the rds_tcp_state_change() callback which would typically be
invoked in most cases when the client-TCP sends a FIN on TCP restart,
triggering a transition to CLOSE_WAIT state. In the rarer event of client
death without a FIN, TCP_KEEPALIVE probes on the socket will detect
the stale socket, and the TCP transition to CLOSE state will trigger
the RDS state cleanup.
Markus Stenberg [Tue, 5 May 2015 10:36:59 +0000 (13:36 +0300)]
ipv6: Fixed source specific default route handling.
If there are only IPv6 source specific default routes present, the
host gets -ENETUNREACH on e.g. connect() because ip6_dst_lookup_tail
calls ip6_route_output first, and given source address any, it fails,
and ip6_route_get_saddr is never called.
The change is to use the ip6_route_get_saddr, even if the initial
ip6_route_output fails, and then doing ip6_route_output _again_ after
we have appropriate source address available.
Note that this is '99% fix' to the problem; a correct fix would be to
do route lookups only within addrconf.c when picking a source address,
and never call ip6_route_output before source address has been
populated.
David S. Miller [Sat, 9 May 2015 19:51:00 +0000 (15:51 -0400)]
Merge branch 'for-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth
Johan Hedberg says:
====================
Here are a couple of important Bluetooth & mac802154 fixes for 4.1:
- mac802154 fix for crypto algorithm allocation failure checking
- mac802154 wpan phy leak fix for error code path
- Fix for not calling Bluetooth shutdown() if interface is not up
Let me know if there are any issues pulling. Thanks.
====================
Linus Torvalds [Sat, 9 May 2015 04:39:12 +0000 (21:39 -0700)]
Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs fixes from Al Viro:
"A couple of fixes for bugs caught while digging in fs/namei.c. The
first one is this cycle regression, the second is 3.11 and later"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
path_openat(): fix double fput()
namei: d_is_negative() should be checked before ->d_seq validation
Al Viro [Sat, 9 May 2015 02:53:15 +0000 (22:53 -0400)]
path_openat(): fix double fput()
path_openat() jumps to the wrong place after do_tmpfile() - it has
already done path_cleanup() (as part of path_lookupat() called by
do_tmpfile()), so doing that again can lead to double fput().
Al Viro [Thu, 7 May 2015 23:24:57 +0000 (19:24 -0400)]
namei: d_is_negative() should be checked before ->d_seq validation
Fetching ->d_inode, verifying ->d_seq and finding d_is_negative() to
be true does *not* mean that inode we'd fetched had been NULL - that
holds only while ->d_seq is still unchanged.
Shift d_is_negative() checks into lookup_fast() prior to ->d_seq
verification.
Linus Torvalds [Sat, 9 May 2015 03:59:02 +0000 (20:59 -0700)]
Merge branch 'for-linus-4.1' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fix from Chris Mason:
"When an arm user reported crashes near page_address(page) in my new
code, it became clear that I can't be trusted with GFP masks. Filipe
beat me to the patch, and I'll just be in the corner with my dunce cap
on"
* 'for-linus-4.1' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
Btrfs: fix wrong mapping flags for free space inode
Linus Torvalds [Sat, 9 May 2015 03:38:21 +0000 (20:38 -0700)]
Merge tag 'dm-4.1-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm
Pull device mapper fixes from Mike Snitzer:
"Two additional fixes for changes introduced via DM during the 4.1
merge window.
The first reverts a dm-crypt change that wasn't correct. The second
fixes a device format regression that impacted userspace"
* tag 'dm-4.1-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
init: fix regression by supporting devices with major:minor:offset format
Revert "dm crypt: fix deadlock when async crypto algorithm returns -EBUSY"
Linus Torvalds [Sat, 9 May 2015 02:49:35 +0000 (19:49 -0700)]
Merge branch 'for-linus' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
"A collection of fixes since the merge window;
- fix for a double elevator module release, from Chao Yu. Ancient bug.
- the splice() MORE flag fix from Christophe Leroy.
- a fix for NVMe, fixing a patch that went in in the merge window.
From Keith.
- two fixes for blk-mq CPU hotplug handling, from Ming Lei.
- bdi vs blockdev lifetime fix from Neil Brown, fixing and oops in md.
- two blk-mq fixes from Shaohua, fixing a race on queue stop and a
bad merge issue with FUA writes.
- division-by-zero fix for writeback from Tejun.
- a block bounce page accounting fix, making sure we inc/dec after
bouncing so that pre/post IO pages match up. From Wang YanQing"
* 'for-linus' of git://git.kernel.dk/linux-block:
splice: sendfile() at once fails for big files
blk-mq: don't lose requests if a stopped queue restarts
blk-mq: fix FUA request hang
block: destroy bdi before blockdev is unregistered.
block:bounce: fix call inc_|dec_zone_page_state on different pages confuse value of NR_BOUNCE
elevator: fix double release of elevator module
writeback: use |1 instead of +1 to protect against div by zero
blk-mq: fix CPU hotplug handling
blk-mq: fix race between timeout and CPU hotplug
NVMe: Fix VPD B0 max sectors translation
Linus Torvalds [Sat, 9 May 2015 02:42:59 +0000 (19:42 -0700)]
Merge tag 'gpio-v4.1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-gpio
Pull GPIO fixes from Linus Walleij:
"Here is a bunch of GPIO fixes that I collected since -rc1, nothing
controversial, nothing special:
- fix a memory leak for GPIO hotplug.
- fix a signedness bug in the ACPI GPIO pin validation.
- driver fixes: Qualcomm SPMI and OMAP MPUIO IRQ issues"
* tag 'gpio-v4.1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-gpio:
gpio: omap: Fix regression for MPUIO interrupts
gpio: sysfs: fix memory leaks and device hotplug
pinctrl: qcom-spmi-gpio: Fix input value report
pinctrl: qcom-spmi-gpio: Fix output type configuration
gpiolib: change gpio pin from unsigned to signed in acpi callback
Linus Torvalds [Sat, 9 May 2015 02:34:35 +0000 (19:34 -0700)]
Merge tag 'mmc-4.1-rc2' of git://git.linaro.org/people/ulf.hansson/mmc
Pull MMC fixes from Ulf Hansson:
"MMC core:
- Don't access RPMB partitions for normal read/write
- Fix hibernation restore sequence
MMC host:
- dw_mmc: Fix card detection for non removable cards
- dw_mmc: Fix sglist issue in 32-bit mode
- sh_mmcif: Fix timeout value for command request"
* tag 'mmc-4.1-rc2' of git://git.linaro.org/people/ulf.hansson/mmc:
mmc: dw_mmc: dw_mci_get_cd check MMC_CAP_NONREMOVABLE
mmc: dw_mmc: init desc in dw_mci_idmac_init
mmc: card: Don't access RPMB partitions for normal read/write
mmc: sh_mmcif: Fix timeout value for command request
mmc: core: add missing pm event in mmc_pm_notify to fix hib restore
Linus Torvalds [Sat, 9 May 2015 01:22:05 +0000 (18:22 -0700)]
Merge tag 'trace-fixes-v4.1-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fix from Steven Rostedt:
"The newly added ftrace_print_array_seq() function had a bug in it.
Luckily, the only user of it didn't make the 4.1 merge window.
But the helper function should be fixed before 4.2 when the users
start coming in"
* tag 'trace-fixes-v4.1-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing: Make ftrace_print_array_seq compute buf_len
ARM: dts: Add keep-power-in-suspend to WiFi SDIO node for exynos5250-snow
The Marvell mwifiex driver prevents the system to enter into a suspend
state if the card power is not preserved during a suspend/resume cycle.
So Suspend-to-RAM and Suspend-to-idle are failing on Exynos5250 Snow.
Add the keep-power-in-suspend Power Management property to the SDIO/MMC
node so the mwifiex suspend handler doesn't fail and the system is able
to enter into a suspend state.
Markus Reichl [Fri, 8 May 2015 18:05:51 +0000 (03:05 +0900)]
ARM: dts: add 'rtc_src' clock to rtc node for exynos4412-odroid boards
The Exynos4412 SoC has a s3c6410 RTC where the source clock
is now a mandatory property.
This patch fixes probe failure of s3c-rtc on Odroid-X2/U2/U3 boards.
ARM: dts: Make DP a consumer of DISP1 power domain on Exynos5420
Commit ea08de16eb1b ("ARM: dts: Add DISP1 power domain for exynos5420")
added a device node for the Exynos5420 DISP1 power domain but dit not
make the DP controller a consumer of that power domain.
This causes an "Unhandled fault: imprecise external abort" error if the
exynos-dp driver tries to access the DP controller registers and the PD
was turned off. This lead to a kernel panic and a complete system hang.
Make the DP controller device node a consumer of the DISP1 power domain
to ensure that the PD is turned on when the exynos-dp driver is probed.
Fixes: ea08de16eb1b ("ARM: dts: Add DISP1 power domain for exynos5420") Signed-off-by: Javier Martinez Canillas <[email protected]> Signed-off-by: Kukjin Kim <[email protected]>
MAINTAINERS: socfpga: update the git repo for SoCFPGA
The git tree at rocketboards.org is going away. Update the entry to reflect
the address of the new location. Also add an entry for all the socfpga_*
dts files.
Mario Kleiner [Mon, 4 May 2015 04:29:44 +0000 (06:29 +0200)]
drm/tegra: Don't use vblank_disable_immediate on incapable driver.
Tegra would not only need a hardware vblank counter that
increments at leading edge of vblank, but also support
for instantaneous high precision vblank timestamp queries, ie.
a proper implementation of dev->driver->get_vblank_timestamp().
Without these, there can be off-by-one errors during vblank
disable/enable if the scanout is inside vblank at en/disable
time, and additionally clients will never see any useable
vblank timestamps when querying via drmWaitVblank ioctl. This
would negatively affect swap scheduling under X11 and Wayland.
Dave Airlie [Fri, 8 May 2015 10:52:51 +0000 (20:52 +1000)]
Merge tag 'drm-amdkfd-fixes-2015-05-07' of git://people.freedesktop.org/~gabbayo/linux into drm-fixes
- Add missing initialization of SDMA vm register when creating an SDMA queue
- Don't report local memory size, as we don't support local memory allocation
yet.
- Allow to unregister process with exisiting queues. Until now we blocked
it with BUG_ON, which was also an error by itself.
* tag 'drm-amdkfd-fixes-2015-05-07' of git://people.freedesktop.org/~gabbayo/linux:
drm/amdkfd: Initialize sdma vm when creating sdma queue
drm/amdkfd: Don't report local memory size
drm/amdkfd: allow unregister process with queues
Dave Airlie [Fri, 8 May 2015 10:52:21 +0000 (20:52 +1000)]
Merge branch 'drm-fixes-4.1' of git://people.freedesktop.org/~agd5f/linux into drm-fixes
Mostly stability fixes for UVD and VCE, plus a few other bug and regression
fixes.
* 'drm-fixes-4.1' of git://people.freedesktop.org/~agd5f/linux:
drm/radeon: stop trying to suspend UVD sessions
drm/radeon: more strictly validate the UVD codec
drm/radeon: make UVD handle checking more strict
drm/radeon: make VCE handle check more strict
drm/radeon: fix userptr lockup
drm/radeon: fix userptr BO unpin bug v3
drm/radeon: don't setup audio on asics that don't support it
drm/radeon: disable semaphores for UVD V1 (v2)
NeilBrown [Fri, 8 May 2015 08:19:40 +0000 (18:19 +1000)]
md/raid5: fix handling of degraded stripes in batches.
There is no need for special handling of stripe-batches when the array
is degraded.
There may be if there is a failure in the batch, but STRIPE_DEGRADED
does not imply an error.
So don't set STRIPE_BATCH_ERR in ops_run_io just because the array is
degraded.
This actually causes a bug: the STRIPE_DEGRADED flag gets cleared in
check_break_stripe_batch_list() and so the bitmap bit gets cleared
when it shouldn't.
So in check_break_stripe_batch_list(), split the batch up completely -
again STRIPE_DEGRADED isn't meaningful.
Also don't set STRIPE_BATCH_ERR when there is a write error to a
replacement device. This simply removes the replacement device and
requires no extra handling.
NeilBrown [Fri, 8 May 2015 08:19:39 +0000 (18:19 +1000)]
md/raid5: fix allocation of 'scribble' array.
As the new 'scribble' array is sized based on chunk size,
we need to make sure the size matches the largest of 'old'
and 'new' chunk sizes when the array is undergoing reshape.
We also potentially need to resize it even when not resizing
the stripe cache, as chunk size can change without changing
number of devices.
So move the 'resize' code into a separate function, and
consider old and new sizes when allocating.
Signed-off-by: NeilBrown <[email protected]> Fixes: 46d5b785621a ("raid5: use flex_array for scribble data")
NeilBrown [Fri, 8 May 2015 08:19:34 +0000 (18:19 +1000)]
md/raid5: don't record new size if resize_stripes fails.
If any memory allocation in resize_stripes fails we will return
-ENOMEM, but in some cases we update conf->pool_size anyway.
This means that if we try again, the allocations will be assumed
to be larger than they are, and badness results.
So only update pool_size if there is no error.
This bug was introduced in 2.6.17 and the patch is suitable for
-stable.
Fixes: ad01c9e3752f ("[PATCH] md: Allow stripes to be expanded in preparation for expanding an array") Cc: [email protected] (v2.6.17+) Signed-off-by: NeilBrown <[email protected]>
NeilBrown [Fri, 8 May 2015 08:19:32 +0000 (18:19 +1000)]
md/raid5: more incorrect BUG_ON in handle_stripe_fill.
It is not incorrect to call handle_stripe_fill() when
a batch of full-stripe writes is active.
It is, however, a BUG if fetch_block() then decides
it needs to actually fetch anything.
md-raid0: conditional mddev->queue access to suit dm-raid
This patch is a prerequisite for dm-raid "raid0" support to allow
dm-raid to access the MD RAID0 personality doing unconditional
accesses to mddev->queue, which is NULL in case of dm-raid stacked on
top of MD.
Most of the conditional mddev->queue accesses made it to upstream but
this missing one, which prohibits md raid0 to set disk stack limits
(being done in dm core in case of md underneath dm).
Linus Torvalds [Thu, 7 May 2015 22:58:00 +0000 (15:58 -0700)]
Merge tag 'pm+acpi-4.1-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management and ACPI fixes from Rafael Wysocki:
"These include three regression fixes (PCI resources management,
ACPI/PNP device enumeration, ACPI SBS on MacBook) and two ACPI
documentation fixes related to GPIO.
Specifics:
- Fix for a PCI resources management regression introduced during the
4.0 cycle and related to the handling of ACPI resources'
Producer/Consumer flags that turn out to be useless (Jiang Liu)
- Fix for a MacBook regression related to the Smart Battery Subsystem
(SBS) driver causing various problems (stalls on boot, failure to
detect or report battery) to happen and introduced during the 3.18
cycle (Chris Bainbridge)
- Fix for an ACPI/PNP device enumeration regression introduced during
the 3.16 cycle caused by failing to include two PNP device IDs into
the list of IDs that PNP device objects need to be created for
(Witold Szczeponik)
- Fixes for two minor mistakes in the ACPI GPIO properties
documentation (Antonio Ospite, Rafael J Wysocki)"
* tag 'pm+acpi-4.1-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
ACPI / PNP: add two IDs to list for PNPACPI device enumeration
ACPI / documentation: Fix ambiguity in the GPIO properties document
ACPI / documentation: fix a sentence about GPIO resources
ACPI / SBS: Add 5 us delay to fix SBS hangs on MacBook
x86/PCI/ACPI: Make all resources except [io 0xcf8-0xcff] available on PCI bus
Nicolas Ferre [Wed, 29 Apr 2015 09:57:18 +0000 (11:57 +0200)]
MAINTAINERS: replace an AT91 maintainer
As some help is needed from an active maintainer, replace Andrew Victor
by Alexandre Belloni in the ARM/Atmel MAINTAINERS' entry (aka AT91).
Add an entry to the CREDITS file.
Thanks Andrew for the great role you played during the early days of this
product family.
Mark Salter [Tue, 28 Apr 2015 17:09:32 +0000 (13:09 -0400)]
drivers: CCI: fix used_mask init in validate_group()
Currently in validate_group(), there is a static initializer
for fake_pmu.used_mask which is based on CPU_BITS_NONE but
the used_mask array size is based on CCI_PMU_MAX_HW_EVENTS.
CCI_PMU_MAX_HW_EVENTS is not based on NR_CPUS, so CPU_BITS_NONE
is not correct and will cause a build failure if NR_CPUS
is set high enough to make CPU_BITS_NONE larger than used_mask.
Arnd Bergmann [Thu, 7 May 2015 16:28:04 +0000 (18:28 +0200)]
Merge tag 'stericsson-fixes-v4.1' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-stericsson into fixes
Merge "Ux500 fixes" from Linus Walleij:
This fixes an MMC/SD configuration issue present for some time
in the Ux500 DT but triggered by proper error handling in v4.1-rc1.
* tag 'stericsson-fixes-v4.1' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-stericsson:
ARM: ux500: Enable GPIO regulator for SD-card for snowball
ARM: ux500: Enable GPIO regulator for SD-card for HREF boards
ARM: ux500: Move GPIO regulator for SD-card into board DTSs
Arnd Bergmann [Thu, 7 May 2015 16:25:38 +0000 (18:25 +0200)]
Merge tag 'omap-for-v4.1/fixes-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tmlind/linux-omap into fixes
Merge "omap fixes against v4.1-rc1" from Tony Lindgren:
Fixes for omaps, mostly a fix for power power consumption
creeping up during idle, and two l3-noc device fixes:
- Fix power consumption creeping up with I2C4 staying on
- Fix n900 microphone bias voltages
- Fix dra7 l3-noc for host clock
- Fix omap5 l3-noc id address decoding
The rest are all just minor dts fixes:
- Fix changed EXTCON_USB_GPIO_USB in defconfig
- Fix missing isp and iva #iommu-cells property
- Various beagle x15 dts fixes for pre-production changes
- Fix am437x-sk display dts entries
* tag 'omap-for-v4.1/fixes-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tmlind/linux-omap:
bus: omap_l3_noc: Fix master id address decoding for OMAP5
bus: omap_l3_noc: Fix offset for DRA7 CLK1_HOST_CLK1_2 instance
ARM: dts: dra7: Fix efuse register size for ABB
ARM: dts: am57xx-beagle-x15: Switch GPIO fan number
ARM: dts: am57xx-beagle-x15: Switch UART mux pins
ARM: dts: am437x-sk: reduce col-scan-delay-us
ARM: dts: am437x-sk: fix for new newhaven display module revision
ARM: dts: am57xx-beagle-x15: Fix RTC aliases
ARM: dts: am57xx-beagle-x15: Fix IRQ type for mcp7941x
ARM: dts: omap3: Add #iommu-cells to isp and iva iommu
ARM: omap2plus_defconfig: Enable EXTCON_USB_GPIO
ARM: dts: OMAP3-N900: Add microphone bias voltages
ARM: OMAP2+: Fix omap off idle power consumption creeping up
Arnd Bergmann [Thu, 7 May 2015 16:24:32 +0000 (18:24 +0200)]
Merge tag 'arm-soc/for-4.1/maintainers' of http://github.com/broadcom/stblinux into fixes
Merge "MAINTAINERS update for Broadcom SoCs for 4.1 #2" from Florian Fainelli:
This pull request contains 3 changes to the MAINTAINERS file for Broadcom SoCs:
- add Ray and Scott for mach-bcm
- remove Christian for mach-bcm
- remove Marc for brcmstb
* tag 'arm-soc/for-4.1/maintainers' of http://github.com/broadcom/stblinux:
MAINTAINERS: Update brcmstb entry
MAINTAINERS: Remove Christian Daudt for mach-bcm
MAINTAINERS: Update mach-bcm maintainers list