mwifiex: handle error if IRQ request fails in mwifiex_sdio_of()
When this failure occurs, we will clear card->plt_wake_cfg so that
device would initialize without wake up on external interrupt feature.
This feature specific code in suspend and resume handlers will be
skipped.
Kalle Valo [Wed, 14 Sep 2016 16:34:50 +0000 (19:34 +0300)]
Merge ath-next from git://git.kernel.org/pub/scm/linux/kernel/git/kvalo/ath.git
ath.git patches for 4.9. Major changes:
ath10k
* add nl80211 testmode support for 10.4 firmware
* hide kernel addresses from logs using %pK format specifier
* implement NAPI support
* enable peer stats by default
ath9k
* use ieee80211_tx_status_noskb where possible
wil6210
* extract firmware capabilities from the firmware file
ath6kl
* enable firmware crash dumps on the AR6004
ath-current is also merged to fix a conflict in ath10k.
David Howells [Tue, 13 Sep 2016 07:49:05 +0000 (08:49 +0100)]
rxrpc: Add IPv6 support
Add IPv6 support to AF_RXRPC. With this, AF_RXRPC sockets can be created:
service = socket(AF_RXRPC, SOCK_DGRAM, PF_INET6);
instead of:
service = socket(AF_RXRPC, SOCK_DGRAM, PF_INET);
The AFS filesystem doesn't support IPv6 at the moment, though, since that
requires upgrades to some of the RPC calls.
Note that a good portion of this patch is replacing "%pI4:%u" in print
statements with "%pISpc" which is able to handle both protocols and print
the port.
David Howells [Tue, 13 Sep 2016 07:49:05 +0000 (08:49 +0100)]
rxrpc: Use rxrpc_extract_addr_from_skb() rather than doing this manually
There are two places that want to transmit a packet in response to one just
received and manually pick the address to reply to out of the sk_buff.
Make them use rxrpc_extract_addr_from_skb() instead so that IPv6 is handled
automatically.
David Howells [Tue, 13 Sep 2016 07:49:05 +0000 (08:49 +0100)]
rxrpc: Create an address for sendmsg() to bind unbound socket with
Create an address for sendmsg() to bind unbound socket with rather than
using a completely blank address otherwise the transport socket creation
will fail because it will try to use address family 0.
We use the address family specified in the protocol argument when the
AF_RXRPC socket was created and SOCK_DGRAM as the default. For anything
else, bind() must be used.
David Howells [Tue, 13 Sep 2016 21:36:22 +0000 (22:36 +0100)]
rxrpc: Correctly initialise, limit and transmit call->rx_winsize
call->rx_winsize should be initialised to the sysctl setting and the sysctl
setting should be limited to the maximum we want to permit. Further, we
need to place this in the ACK info instead of the sysctl setting.
Furthermore, discard the idea of accepting the subpackets of a jumbo packet
that lie beyond the receive window when the first packet of the jumbo is
within the window. Just discard the excess subpackets instead. This
allows the receive window to be opened up right to the buffer size less one
for the dead slot.
David Howells [Tue, 13 Sep 2016 08:05:14 +0000 (09:05 +0100)]
rxrpc: Fix prealloc refcounting
The preallocated call buffer holds a ref on the calls within that buffer.
The ref was being released in the wrong place - it worked okay for incoming
calls to the AFS cache manager service, but doesn't work right for incoming
calls to a userspace service.
Instead of releasing an extra ref service calls in rxrpc_release_call(),
the ref needs to be released during the acceptance/rejectance process. To
this end:
(1) The prealloc ref is now normally released during
rxrpc_new_incoming_call().
(2) For preallocated kernel API calls, the kernel API's ref needs to be
released when the call is discarded on socket close.
(3) We shouldn't take a second ref in rxrpc_accept_call().
(4) rxrpc_recvmsg_new_call() needs to get a ref of its own when it adds
the call to the to_be_accepted socket queue.
In doing (4) above, we would prefer not to put the call's refcount down to
0 as that entails doing cleanup in softirq context, but it's unlikely as
there are several refs held elsewhere, at least one of which must be put by
someone in process context calling rxrpc_release_call(). However, it's not
a problem if we do have to do that.
David Howells [Tue, 13 Sep 2016 08:12:34 +0000 (09:12 +0100)]
rxrpc: Adjust the call ref tracepoint to show kernel API refs
Adjust the call ref tracepoint to show references held on a call by the
kernel API separately as much as possible and add an additional trace to at
the allocation point from the preallocation buffer for an incoming call.
Note that this doesn't show the allocation of a client call for the kernel
separately at the moment.
David Howells [Tue, 13 Sep 2016 21:36:22 +0000 (22:36 +0100)]
rxrpc: Use skb->len not skb->data_len
skb->len should be used rather than skb->data_len when referring to the
amount of data in a packet. This will only cause a malfunction in the
following cases:
(1) We receive a jumbo packet (validation and splitting both are wrong).
(2) We see if there's extra ACK info in an ACK packet (we think it's not
there and just ignore it).
David Howells [Tue, 13 Sep 2016 21:36:21 +0000 (22:36 +0100)]
rxrpc: Requeue call for recvmsg if more data
rxrpc_recvmsg() needs to make sure that the call it has just been
processing gets requeued for further attention if the buffer has been
filled and there's more data to be consumed. The softirq producer only
queues the call and wakes the socket if it fills the first slot in the
window, so userspace might end up sleeping forever otherwise, despite there
being data available.
This is not a problem provided the userspace buffer is big enough or it
empties the buffer completely before more data comes in.
David Howells [Tue, 13 Sep 2016 21:36:21 +0000 (22:36 +0100)]
rxrpc: Add missing wakeup on Tx window rotation
We need to wake up the sender when Tx window rotation due to an incoming
ACK makes space in the buffer otherwise the sender is liable to just hang
endlessly.
This problem isn't noticeable if the Tx phase transfers no more than will
fit in a single window or the Tx window rotates fast enough that it doesn't
get full.
David Howells [Tue, 13 Sep 2016 21:36:21 +0000 (22:36 +0100)]
rxrpc: Make sure we initialise the peer hash key
Peer records created for incoming connections weren't getting their hash
key set. This meant that incoming calls wouldn't see more than one DATA
packet - which is not a problem for AFS CM calls with small request data
blobs.
Johannes Berg [Tue, 13 Sep 2016 14:39:38 +0000 (16:39 +0200)]
cfg80211: reduce connect key caching struct size
After the previous patches, connect keys can only (correctly)
be used for storing static WEP keys. Therefore, remove all the
data for dealing with key index 4/5 and reduce the size of the
key material to the maximum for WEP keys.
Johannes Berg [Tue, 13 Sep 2016 14:37:40 +0000 (16:37 +0200)]
cfg80211: validate key index better
Don't accept it if a key_idx < 0 snuck through, reject WEP keys with
key index 4 and 5 (which are used for IGTKs) and don't allow IGTKs
with key indices other than 4 and 5. This makes the key data match
expectations better.
Johannes Berg [Tue, 13 Sep 2016 14:11:32 +0000 (16:11 +0200)]
cfg80211: wext: only allow WEP keys to be configured before connected
When not connected, anything but WEP keys shouldn't be allowed to be
configured for later - only static WEP keys make sense at this point.
Change wext to reject anything else just like nl80211 does.
Johannes Berg [Tue, 13 Sep 2016 14:10:02 +0000 (16:10 +0200)]
nl80211: only allow WEP keys during connect command
This was already documented that way in nl80211.h, but the
parsing code still accepted other key types. Change it to
really only accept WEP keys as documented.
Johannes Berg [Tue, 13 Sep 2016 13:51:03 +0000 (15:51 +0200)]
nl80211: fix connect keys range check
Only key index 0-3 should be accepted, 4/5 are for IGTKs and
cannot be used as connect keys. Fix the range checking to not
allow such erroneous configurations.
David S. Miller [Tue, 13 Sep 2016 16:16:44 +0000 (12:16 -0400)]
Merge branch 'mlxsw-ethtool'
Jiri Pirko says:
====================
mlxsw: ethtool enhancements
Ido says:
Patches 1-4 do some minor cleanup in current ethtool ops. Patch 5
replace legacy {get,set}_settings callbacks with
{get,set}_link_ksettings.
====================
mlxsw: spectrum: Report port type according to operational speed
In case port isn't operational we shouldn't report the port type, but
instead return PORT_OTHER. This is consistent with most other drivers
that return PORT_OTHER when media type can't be determined.
mlxsw: spectrum: Report link partner's advertised speeds
If autonegotiation was performed successfully, then we should report the
link partner's advertised speeds instead of the operational speed of the
port.
Philippe Reynes [Sun, 11 Sep 2016 15:54:03 +0000 (17:54 +0200)]
net: ethernet: apm: xgene: use phydev from struct net_device
The private structure contain a pointer to phydev, but the structure
net_device already contain such pointer. So we can remove the pointer
phy_dev in the private structure, and update the driver to use the
one contained in struct net_device.
net: macb: fix missing unlock on error in macb_start_xmit()
Fix missing unlock before return from function macb_start_xmit()
in the error handling case.
Fixes: 007e4ba3ee13 ("net: macb: initialize checksum when using
checksum offloading") Signed-off-by: Wei Yongjun <[email protected]> Acked-by: Nicolas Ferre <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Vivien Didelot [Thu, 8 Sep 2016 16:50:43 +0000 (12:50 -0400)]
net: bridge: add helper to call /sbin/bridge-stp
If /sbin/bridge-stp is available on the system, bridge tries to execute
it instead of the kernel implementation when starting/stopping STP.
If anything goes wrong with /sbin/bridge-stp, bridge silently falls back
to kernel STP, making hard to debug userspace STP.
This patch adds a br_stp_call_user helper to start/stop userspace STP
and debug errors from the program: abnormal exit status is stored in the
lower byte and normal exit status is stored in higher byte.
Below is a simple example on a kernel with dynamic debug enabled:
# ln -s /bin/false /sbin/bridge-stp
# brctl stp br0 on
br0: failed to start userspace STP (256)
# dmesg
br0: /sbin/bridge-stp exited with code 1
br0: failed to start userspace STP (256)
br0: using kernel STP
Johannes Berg [Tue, 13 Sep 2016 13:39:29 +0000 (15:39 +0200)]
mac80211: remove useless open_count check
__ieee80211_suspend() checks early on if there's anything
to do by checking open_count, so there's no need to check
again later in the function. Remove the useless check.
ath10k: remove 4-addr padding related hw_param configuration
hw_4addr_pad was added to handle different types of padding
in 4-address rx frame. But this padding is not very specific
to 4-address, it can happen even with three address + ethernet
decap mode. Since the padding information can be obtained
through Rx desc for QCA99X0 and newer chips, this hw_param
is not needed any more.
ath10k: properly remove padding from the start of rx payload
In QCA99X0 (QCA99X0, QCA9984, QCA9888 and QCA4019) family chips,
hw adds padding at the begining of the rx payload to make L3
header 4-byte aligned. In the chips doing this type of padding,
the number of bytes padded will be indicated through msdu_end:info1.
Define a hw_rx_desc_ops wrapper to retrieve the number of padded
bytes and use this while doing undecap. This should fix padding
related issues with ethernt decap format with QCA99X0, QCA9984,
QCA9888 and QCA4019 hw.
Signed-off-by: Vasanthakumar Thiagarajan <[email protected]>
[Rename operations to hw_ops for other purposes] Signed-off-by: Benjamin Berg <[email protected]> Signed-off-by: Kalle Valo <[email protected]>
ath10k: add provision for Rx descriptor abstraction
There are slight differences in Rx hw descriptor information
among different chips. So far driver does not use those new
information for any functionalities, but there is one important
information which is available from QCA99X0 onwards to indicate
the number of bytes that hw padded at the begining of the rx
payload and this information is needed to undecap the rx
packet. Add an abstraction for Rx desc to make use of the
new desc information available. The callback that this patch
defines to retrieve the padding bytes will be used in follow-up
patch.
Signed-off-by: Vasanthakumar Thiagarajan <[email protected]>
[Rename operations to hw_ops for other purposes] Signed-off-by: Benjamin Berg <[email protected]> Signed-off-by: Kalle Valo <[email protected]>
This is to prepare for rx descriptor abstraction where we'll
be dereferencing ath10k_hw_params member in hw.h. Moreover
hw.h looks more suitable to house ath10k_hw_params definition
than core.h
Thomas Pedersen [Tue, 6 Sep 2016 19:05:28 +0000 (12:05 -0700)]
ath10k: enable peer stats by default
IFTYPE_MESH_POINT need to rely on these for accurate path
selection metrics. Other modes will probably also find
them useful. Enabling peer stats has the side effect of
reducing max number of STAs from 128 to 118. There should
be negligible performance impact.
If users really need 128 STAs and don't mind losing out on
peer stats, they can still disable them:
Trival fix to remove unused variable ar_pci in ath10k_pci_tx_pipe_cleanup
when building with W=1:
drivers/net/wireless/ath/ath10k/pci.c:1696:21: warning: variable
'ar_pci' set but not used [-Wunused-but-set-variable]
Johannes Berg [Tue, 13 Sep 2016 06:28:22 +0000 (08:28 +0200)]
mac80211: simplify TDLS RA lookup
smatch pointed out that the second check of "tdls_auth" was
pointless since if it was true, we returned from the function
already. We can further simplify the code by moving the first
check (if it's a TDLS peer at all) into the outer if, to only
handle that inside. This simplifies the control flow here.
mac80211: Re-structure aqm debugfs output and keep CoDel stats per txq
Currently the 'aqm' stats in mac80211 only keeps overlimit drop stats,
not CoDel stats. This moves the CoDel stats into the txqi structure to
keep them per txq in order to show them in debugfs.
In addition, the aqm debugfs output is restructured by splitting it up
into three files: One global per phy, one per netdev and one per
station, in the appropriate directories. The files are all called aqm,
and are only created if the driver supports the wake_tx_queue op (rather
than emitting an error on open as previously).
Pull networking fixes from David Miller:
"Mostly small sets of driver fixes scattered all over the place.
1) Mediatek driver fixes from Sean Wang. Forward port not written
correctly during TX map, missed handling of EPROBE_DEFER, and
mistaken use of put_page() instead of skb_free_frag().
2) Fix socket double-free in KCM code, from WANG Cong.
3) QED driver fixes from Sudarsana Reddy Kalluru, including a fix for
using the dcbx buffers before initializing them.
4) Mellanox Switch driver fixes from Jiri Pirko, including a fix for
double fib removals and an error handling fix in
mlxsw_sp_module_init().
5) Fix kernel panic when enabling LLDP in i40e driver, from Dave
Ertman.
6) Fix padding of TSO packets in thunderx driver, from Sunil Goutham.
7) TCP's rcv_wup not initialized properly when using fastopen, from
Neal Cardwell.
8) Don't use uninitialized flow keys in flow dissector, from Gao
Feng.
9) Use after free in l2tp module unload, from Sabrina Dubroca.
10) Fix interrupt registry ordering issues in smsc911x driver, from
Jeremy Linton.
11) Fix crashes in bonding having to do with enslaving and rx_handler,
from Mahesh Bandewar.
12) AF_UNIX deadlock fixes from Linus.
13) In mlx5 driver, don't read skb->xmit_mode after it might have been
freed from the TX reclaim path. From Tariq Toukan.
14) Fix a bug from 2015 in TCP Yeah where the congestion window does
not increase, from Artem Germanov.
15) Don't pad frames on receive in NFP driver, from Jakub Kicinski.
16) Fix chunk fragmenting in SCTP wrt. GSO, from Marcelo Ricardo
Leitner.
17) Fix deletion of VRF routes, from Mark Tomlinson.
18) Fix device refcount leak when DAD fails in ipv6, from Wei Yongjun"
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (101 commits)
net/mlx4_en: Fix panic on xmit while port is down
net/mlx4_en: Fixes for DCBX
net/mlx4_en: Fix the return value of mlx4_en_dcbnl_set_state()
net/mlx4_en: Fix the return value of mlx4_en_dcbnl_set_all()
net: ethernet: renesas: sh_eth: add POST registers for rz
drivers: net: phy: mdio-xgene: Add hardware dependency
dwc_eth_qos: do not register semi-initialized device
sctp: identify chunks that need to be fragmented at IP level
mlxsw: spectrum: Set port type before setting its address
mlxsw: spectrum_router: Fix error path in mlxsw_sp_router_init
nfp: don't pad frames on receive
nfp: drop support for old firmware ABIs
nfp: remove linux/version.h includes
tcp: cwnd does not increase in TCP YeAH
net/mlx5e: Fix parsing of vlan packets when updating lro header
net/mlx5e: Fix global PFC counters replication
net/mlx5e: Prevent casting overflow
net/mlx5e: Move an_disable_cap bit to a new position
net/mlx5e: Fix xmit_more counter race issue
tcp: fastopen: avoid negative sk_forward_alloc
...
Johannes Berg [Mon, 29 Aug 2016 20:25:19 +0000 (23:25 +0300)]
mac80211: send delBA on unexpected BlockAck Request
If we don't have a BA session, send delBA, as requested by the
IEEE 802.11 spec. Apply the same limit of sending such a delBA
only once as in the previous patch.
Johannes Berg [Mon, 29 Aug 2016 20:25:18 +0000 (23:25 +0300)]
mac80211: send delBA on unexpected BlockAck data frames
When we receive data frames with ACK policy BlockAck, send
delBA as requested by the 802.11 spec. Since this would be
happening for every frame inside an A-MPDU if it's really
received outside a session, limit it to a single attempt.
Johannes Berg [Mon, 29 Aug 2016 20:25:17 +0000 (23:25 +0300)]
mac80211: add support for radiotap timestamp field
Use the existing device timestamp from the RX status information
to add support for the new radiotap timestamp field. Currently
only 32-bit counters are supported, but we also add the radiotap
mactime where applicable. This new field allows more flexibility
in where the timestamp is taken etc. The non-timestamp data in
the field is taken from a new field in the hw struct.
Aviya Erenfeld [Mon, 29 Aug 2016 20:25:16 +0000 (23:25 +0300)]
mac80211: add support for MU-MIMO air sniffer
add support to MU-MIMO air sniffer according groupID:
in monitor mode, use a given MU-MIMO groupID to monitor stations
that belongs to that group using MU-MIMO.
add support for following a station according to its MAC address
using VHT MU-MIMO sniffer:
the monitors wait until they get an action MU-MIMO notification
frame, then parses it in order to find the groupID that corresponds
to the given MAC address and monitors packets destined to that
groupID using VHT MU-MIMO.
Maxim Altshul [Mon, 22 Aug 2016 14:14:04 +0000 (17:14 +0300)]
mac80211: RX BA support for sta max_rx_aggregation_subframes
The ability to change the max_rx_aggregation frames is useful
in cases of IOP.
There exist some devices (latest mobile phones and some AP's)
that tend to not respect a BA sessions maximum size (in Kbps).
These devices won't respect the AMPDU size that was negotiated during
association (even though they do respect the maximal number of packets).
This violation is characterized by a valid number of packets in
a single AMPDU. Even so, the total size will exceed the size negotiated
during association.
Eventually, this will cause some undefined behavior, which in turn
causes the hw to drop packets, causing the throughput to plummet.
This patch will make the subframe limitation to be held by each station,
instead of being held only by hw.
The workqueue "cfg80211_wq" is involved in cleanup, scan and event related
works. It queues multiple work items &rdev->event_work,
&rdev->dfs_update_channels_wk,
&wiphy_to_rdev(request->wiphy)->scan_done_wk,
&wiphy_to_rdev(wiphy)->sched_scan_results_wk, which require strict
execution ordering.
Hence, an ordered dedicated workqueue has been used.
Since it's a wireless driver, WQ_MEM_RECLAIM has been set to ensure
forward progress under memory pressure.
Aviya Erenfeld [Mon, 29 Aug 2016 20:25:15 +0000 (23:25 +0300)]
mac80211: refactor monitor representation in sdata
Insert the u32 monitor flags variable in a new structure
that represents a monitor interface.
This will allow to add more configuration variables to
that structure which will happen in an upcoming change.
Denis Kenzior [Wed, 3 Aug 2016 22:02:15 +0000 (17:02 -0500)]
nl80211: Allow GET_INTERFACE dumps to be filtered
This patch allows GET_INTERFACE dumps to be filtered based on
NL80211_ATTR_WIPHY or NL80211_ATTR_WDEV. The documentation for
GET_INTERFACE mentions that this is possible:
"Request an interface's configuration; either a dump request on
a %NL80211_ATTR_WIPHY or ..."
However, this behavior has not been implemented until now.
Johannes: rewrite most of the patch:
* use nl80211_dump_wiphy_parse() to also allow passing an interface
to be able to dump its siblings
* fix locking (must hold rtnl around using nl80211_fam.attrbuf)
* make init self-contained instead of relying on other cb->args
When port is down, tx drop counter update is not needed.
Updating the counter in this case can cause a kernel
panic as when the port is down, ring can be NULL.
This patch adds a capability check before enabling DCBX.
In addition, it re-organizes the relevant data structures,
and fixes a typo in a define.
Fixes: af7d51852631 ("net/mlx4_en: Add DCB PFC support through CEE netlink commands") Signed-off-by: Tariq Toukan <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Kamal Heib [Sun, 11 Sep 2016 07:56:18 +0000 (10:56 +0300)]
net/mlx4_en: Fix the return value of mlx4_en_dcbnl_set_state()
mlx4_en_dcbnl_set_state() returns u8, the return value from
mlx4_en_setup_tc() could be negative in case of failure, so fix that.
Fixes: af7d51852631 ("net/mlx4_en: Add DCB PFC support through CEE netlink commands") Signed-off-by: Kamal Heib <[email protected]> Signed-off-by: Tariq Toukan <[email protected]> Signed-off-by: David S. Miller <[email protected]>
net: dsa: bcm_sf2: Get VLAN_PORT_MASK from b53_device
While migrating the bcm_sf2 driver to use b53_common, we left a small
piece untouched where we kept our local copy of the per-port
port_vlan_ctl bitmask value. This value is now maintained by b53_device
so we need to use it instead of our local (and now stale) copy of it.
Fixes: f458995b9ad8 ("net: dsa: bcm_sf2: Utilize core B53 driver when possible") Signed-off-by: Florian Fainelli <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Commit aa71987472a9 ("nvme: fabrics drivers don't need the nvme-pci
driver") removed the dependency on BLK_DEV_NVME, but the cdoe does
depend on the block layer (which used to be an implicit dependency
through BLK_DEV_NVME).
Otherwise you get various errors from the kbuild test robot random
config testing when that happens to hit a configuration with BLOCK
device support disabled.
Merge tag 'staging-4.8-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging
Pull IIO fixes from Greg KH:
"Here are a few small IIO fixes for 4.8-rc6.
Nothing major, full details are in the shortlog, all of these have
been in linux-next with no reported issues"
* tag 'staging-4.8-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging:
iio:core: fix IIO_VAL_FRACTIONAL sign handling
iio: ensure ret is initialized to zero before entering do loop
iio: accel: kxsd9: Fix scaling bug
iio: accel: bmc150: reset chip at init time
iio: fix pressure data output unit in hid-sensor-attributes
tools:iio:iio_generic_buffer: fix trigger-less mode
David S. Miller [Sun, 11 Sep 2016 06:12:54 +0000 (23:12 -0700)]
Merge branch 'vrf-tx-hook'
David Ahern says:
====================
net: Convert vrf to tx hook
The motivation for this series is that ICMP Unreachable - Fragmentation
Needed packets are not handled properly for VRFs. Specifically, the
FIB lookup in __ip_rt_update_pmtu fails so no nexthop exception is
created with the reduced MTU. As a result connections stall if packets
larger than the smallest MTU in the path are generated.
While investigating that problem I also noticed that the MSS for all
connections in a VRF is based on the VRF device's MTU and not the
route the packets ultimately go through. VRF currently uses a dst
to direct packets to the device. The first FIB lookup returns this dst
and then the lookup in the VRF driver gets the actual output route. A
side effect of this design is that the VRF dst is cached on sockets
and then used for calculations like the MSS.
This series fixes this problem by removing the hook in the FIB lookups
that returns the dst pointing to the VRF device to the VRF and always
doing the actual FIB lookup. This allows the real dst to be used
throughout the stack (for example the MSS). Packets are diverted to
the VRF device on Tx using an l3mdev hook in the output path similar to
to what is done for Rx. The end result is a simpler implementation for
VRF with fewer intrusions into the network stack and symmetrical packet
handling for Rx and Tx paths.
Comparison of netperf performance for a build without l3mdev (best case
performance), the old vrf driver and the VRF driver from this series.
Data are collected using VMs with virtio + vhost. The netperf client
runs in the VM and netserver runs in the host. 1-byte RR tests are done
as these packets exaggerate the performance hit due to the extra lookups
done for l3mdev and VRF.
TCP_RR UDP_RR
IPv4 IPv6 IPv4 IPv6
no l3mdev 29,996 30,601 31,638 24,336
vrf old 27,417 27,626 29,159 24,801
vrf new 28,036 28,372 30,110 24,857
l3mdev, no vrf 29,534 30,465 30,670 24,346
* Transactions per second as reported by netperf
* netperf modified to take a bind-to-device argument -- the -J red option
1. 'no l3mdev' == NET_L3_MASTER_DEV is unset so code is compiled out
2. 'vrf old' == data for existing implementation
3. 'vrf new' == data with this series
4. 'l3mdev, no vrf' == NET_L3_MASTER_DEV is enabled but traffic is not
going through a VRF
About the series
- patch 1 adds the flow update (changing oif or iif to L3 master device
and setting the flag to skip the oif check) to ipv4 and ipv6 paths just
before hitting the rules. This catches all code paths in a single spot.
- patch 2 adds the Tx hook to push the packet to the l3mdev if relevant
- patch 3 adds some checks so the vrf device can act as a vrf-local
loopback. These changes were not needed before since the vrf dst was
returned from the lookup.
- patches 4 and 5 flip the ipv4 and ipv6 stacks to the tx hook leaving
the route lookup to be the real one. The dst flip happens at the
beginning of the L3 output path so the VRFs can have device based
features such as netfilter, tc and tcpdump.
- patches 6-11 remove no longer needed l3mdev code
v2
- properly handle IPv6 link scope addresses
- keep the device xmit path and associated dst which is switched in by
the l3_out hook. packets still need to go through the xmit path in
case the user puts a qdisc on the vrf device and to allow tc rules.
version 1 short circuited the tx handling and only covered netfilter
and tcpdump.
====================
David Ahern [Sat, 10 Sep 2016 19:09:56 +0000 (12:09 -0700)]
net: vrf: Flip IPv6 output path from FIB lookup hook to out hook
Flip the IPv6 output path to use the l3mdev tx out hook. The VRF dst
is not returned on the first FIB lookup. Instead, the dst on the
skb is switched at the beginning of the IPv6 output processing to
send the packet to the VRF driver on xmit.
Link scope addresses (linklocal and multicast) need special handling:
specifically the oif the flow struct can not be changed because we
want the lookup tied to the enslaved interface. ie., the source address
and the returned route MUST point to the interface scope passed in.
Convert the existing vrf_get_rt6_dst to handle only link scope addresses.
David Ahern [Sat, 10 Sep 2016 19:09:55 +0000 (12:09 -0700)]
net: vrf: Flip IPv4 output path from FIB lookup hook to out hook
Flip the IPv4 output path to use the l3mdev tx out hook. The VRF dst
is not returned on the first FIB lookup. Instead, the dst on the
skb is switched at the beginning of the IPv4 output processing to
send the packet to the VRF driver on xmit.
David Ahern [Sat, 10 Sep 2016 19:09:53 +0000 (12:09 -0700)]
net: l3mdev: Add hook to output path
This patch adds the infrastructure to the output path to pass an skb
to an l3mdev device if it has a hook registered. This is the Tx parallel
to l3mdev_ip{6}_rcv in the receive path and is the basis for removing
the existing hook that returns the vrf dst on the fib lookup.
David Ahern [Sat, 10 Sep 2016 19:09:52 +0000 (12:09 -0700)]
net: flow: Add l3mdev flow update
Add l3mdev hook to set FLOWI_FLAG_SKIP_NH_OIF flag and update oif/iif
in flow struct if its oif or iif points to a device enslaved to an L3
Master device. Only 1 needs to be converted to match the l3mdev FIB
rule. This moves the flow adjustment for l3mdev to a single point
catching all lookups. It is redundant for existing hooks (those are
removed in later patches) but is needed for missed lookups such as
PMTU updates.
Markus Elfring [Sat, 10 Sep 2016 08:21:15 +0000 (10:21 +0200)]
ATM-ZeitNet: Replace one kzalloc() call by kcalloc()
* The script "checkpatch.pl" can point information out like the following.
WARNING: Prefer kcalloc over kzalloc with multiply
Thus fix the affected source code place.
* Replace the specification of a data type by a pointer dereference
to make the corresponding size determination a bit safer according to
the Linux coding style convention.
* Delete the local variable "size" which became unnecessary with
this refactoring.
Markus Elfring [Sat, 10 Sep 2016 08:07:38 +0000 (10:07 +0200)]
ATM-ZeitNet: Improve a size determination in zatm_open()
Replace the specification of a data structure by a pointer dereference
as the parameter for the operator "sizeof" to make the corresponding size
determination a bit safer according to the Linux coding style convention.
Markus Elfring [Sat, 10 Sep 2016 07:55:53 +0000 (09:55 +0200)]
ATM-ZeitNet: Use kmalloc_array() in start_tx()
* A multiplication for the size determination of a memory allocation
indicated that an array data structure should be processed.
Thus use the corresponding function "kmalloc_array".
This issue was detected by using the Coccinelle software.
* Replace the specification of a data type by a pointer dereference
to make the corresponding size determination a bit safer according to
the Linux coding style convention.
Markus Elfring [Sat, 10 Sep 2016 06:56:03 +0000 (08:56 +0200)]
ATM-nicstar: Refactor a dev_alloc_skb() call in dequeue_rx()
The script "checkpatch.pl" can point out that assignments should usually
not be performed within condition checks.
Thus move an assignment for a local variable to a separate statement
in this function.
Markus Elfring [Sat, 10 Sep 2016 06:48:17 +0000 (08:48 +0200)]
ATM-nicstar: Refactor a kmalloc() call in ns_init_card()
* The script "checkpatch.pl" can point out that assignments should usually
not be performed within condition checks.
Thus move an assignment for a local variable to a separate statement
in this function.
* Replace the specification of a data structure by a pointer dereference
as the parameter for the operator "sizeof" to make the corresponding size
determination a bit safer according to the Linux coding style convention.