David S. Miller [Fri, 23 Sep 2016 12:38:59 +0000 (08:38 -0400)]
Merge branch 'dsa-port-fast-ageing'
Vivien Didelot says:
====================
net: dsa: add port fast ageing
Today the DSA drivers are in charge of flushing the MAC addresses
associated to a port when its STP state changes from Learning or
Forwarding, to Disabled or Blocking or Listening.
This makes the drivers more complex and hides this generic switch logic.
This patchset introduces a new optional port_fast_age operation to
dsa_switch_ops, to move this logic to the DSA layer and keep drivers
simple. b53 and mv88e6xxx are updated accordingly.
====================
Vivien Didelot [Thu, 22 Sep 2016 20:49:22 +0000 (16:49 -0400)]
net: dsa: add port fast ageing
Today the DSA drivers are in charge of flushing the MAC addresses
associated to a port when its STP state changes from Learning or
Forwarding, to Disabled or Blocking or Listening.
This makes the drivers more complex and hides the generic switch logic.
Introduce a new optional port_fast_age operation to dsa_switch_ops, to
move this logic to the DSA layer and keep drivers simple.
Current driver programs static value of MSS in hardware register for TSO
offload engine to segment the TCP payload regardless the MSS value
provided by network stack.
This patch fixes this by programming hardware registers with the
stack provided MSS value.
Since the hardware has the limitation of having only 4 MSS registers,
this patch uses reference count of mss values being used.
David Howells [Fri, 23 Sep 2016 11:39:23 +0000 (12:39 +0100)]
rxrpc: Don't send an ACK at the end of service call response transmission
Don't send an IDLE ACK at the end of the transmission of the response to a
service call. The service end resends DATA packets until the client sends an
ACK that hard-acks all the send data. At that point, the call is complete.
David Howells [Fri, 23 Sep 2016 12:17:33 +0000 (13:17 +0100)]
rxrpc: Preset timestamp on Tx sk_buffs
Set the timestamp on sk_buffs holding packets to be transmitted before
queueing them because the moment the packet is on the queue it can be seen
by the retransmission algorithm - which may see a completely random
timestamp.
If the retransmission algorithm sees such a timestamp, it may retransmit
the packet and, in future, tell the congestion management algorithm that
the retransmit timer expired.
Colin Ian King [Thu, 22 Sep 2016 17:48:58 +0000 (18:48 +0100)]
cxgb4: fix signed wrap around when decrementing index idx
Change predecrement compare to post decrement compare to avoid an
unsigned integer wrap-around comparison when decrementing idx in
the while loop.
For example, when idx is zero, the current situation will
predecrement idx in the while loop, wrapping idx to the maximum
signed integer and cause out of bounds reads on rxq_info->msix_tbl[idx].
This series further enhances the SRIOV TC offloads of mlx5 to handle
the TC vlan push and pop actions. This serves a common use-case in
virtualization systems where the virtual switch add (push) vlan tags
to packets sent from VMs and removes (pop) vlan tags before the packet
is received by the VM. We use the new E-Switch switchdev mode and the
TC vlan action to achieve that also in SW defined SRIOV environments by
offloading TC rules that contain this action along with forwarding
(TC mirred/redirect action) the packet.
In the first patch we add some helpers to access the TC vlan action info
by offloading drivers. The next five patches don't add any new functionality,
they do some refactoring and cleanups in the current code to be used next.
The seventh patch deals with supporting vlans by the mlx5 e-switch in switchdev
mode. The eighth patch does the vlan action offload from TC and the last patch
adds matching for vlans as typically required by TC flows that involve vlan
pop action.
The series was applied on top of commit 524605e "cxgb4: Convert to use simple_open()"
====================
Or Gerlitz [Thu, 22 Sep 2016 17:01:47 +0000 (20:01 +0300)]
net/mlx5: E-Switch, Support VLAN actions in the offloads mode
Many virtualization systems use a policy under which a vlan tag is
pushed to packets sent by guests, and popped before the packet is
forwarded to the VM.
The current generation of the mlx5 HW doesn't fully support that on
a per flow level. As such, we are addressing the above common use
case with the SRIOV e-Switch abilities to push vlan into packets
sent by VFs and pop vlan from packets forwarded to VFs.
The HW can match on the correct vlan being present in packets
forwarded to VFs (eSwitch steering is done before stripping
the tag), so this part is offloaded as is.
A common practice for vlans is to avoid both push vlan and pop vlan
for inter-host VM/VM (east-west) communication because in this case,
push on egress cancels out with pop on ingress.
For supporting that, we use a global eswitch vlan pop policy, hence
allowing guest A to communicate with both remote VM B and local VM C.
This works since the HW pops the vlan only if it exists (e.g for
C --> A packets but not for B --> A packets).
On the slow path, when a VF vport has an offloaded flow which involves
pushing vlans, wheres another flow is not currently offloaded, the
packets from the 2nd flow seen by the VF representor on the host have
vlan. The VF rep driver removes such vlan before calling into the host
networking stack.
Or Gerlitz [Thu, 22 Sep 2016 17:01:43 +0000 (20:01 +0300)]
net/mlx5: E-Switch, Set vport representor fields explicitly on registration
The structure we use for the eswitch vport representor (mlx5_eswitch_rep)
has some fields which are set from upper layers in the driver when they
register the rep. Use explicit setting on registration time for them and
avoid global memcpy. This patch doesn't add new functionality.
Or Gerlitz [Thu, 22 Sep 2016 17:01:42 +0000 (20:01 +0300)]
net/mlx5: E-Switch, Set the vport when registering the uplink rep
Set the vport value in the PF entry to be that of the uplink so
we can use it blindly over the tc / eswitch offload code without
translating it each time we deal with the uplink representor.
Eric Dumazet [Thu, 22 Sep 2016 15:58:55 +0000 (08:58 -0700)]
net_sched: sch_fq: account for schedule/timers drifts
It looks like the following patch can make FQ very precise, even in VM
or stressed hosts. It matters at high pacing rates.
We take into account the difference between the time that was programmed
when last packet was sent, and current time (a drift of tens of usecs is
often observed)
Add an EWMA of the unthrottle latency to help diagnostics.
This latency is the difference between current time and oldest packet in
delayed RB-tree. This accounts for the high resolution timer latency,
but can be different under stress, as fq_check_throttled() can be
opportunistically be called from a dequeue() called after an enqueue()
for a different flow.
sfc: check async completer is !NULL before calling
Add a NULL check before calling asynchronous MCDI completion functions
during device removal.
Fixes: 7014d7f6 ("sfc: allow asynchronous MCDI without completion function") Signed-off-by: Bert Kenward <[email protected]> Signed-off-by: David S. Miller <[email protected]>
David S. Miller [Fri, 23 Sep 2016 10:55:04 +0000 (06:55 -0400)]
Merge branch 'sctp-fix-gap-ack-blocks'
Marcelo Ricardo Leitner says:
====================
sctp: fix the handling of SACK Gap Ack blocks
sctp_acked() is using 32bit arithmetics on 16bits vars, via TSN_lte()
macros, which is weird and confusing.
Once the offset to ctsn is calculated, all wrapping is already handled
and thus to verify the Gap Ack blocks we can just use pure
less/big-or-equal than checks.
Also, rename gap variable to tsn_offset, so it's more meaningful, as
it doesn't point to any gap at all.
Even so, I don't think this discrepancy resulted in any practical bug.
This patch is a preparation for the next one, which will introduce
typecheck() for TSN_lte() macros and would cause a compile error here.
sctp: improve how SSN, TSN and ASCONF serial are compared
Make it similar to time_before() macros:
- easier to understand
- make use of typecheck() to avoid working on unexpected variable types
(made the issue on previous patch visible)
- for _[lg]te versions, slighly faster, as the compiler used to generate
a sequence of cmp/je/cmp/js instructions and now it's sub/test/jle
(for _lte):
Before, for sctp_outq_sack:
if (primary->cacc.changeover_active) {
1f01: 80 b9 84 02 00 00 00 cmpb $0x0,0x284(%rcx)
1f08: 74 6e je 1f78 <sctp_outq_sack+0xe8>
u8 clear_cycling = 0;
sctp_acked() is using 32bit arithmetics on 16bits vars, via TSN_lte()
macros, which is weird and confusing.
Once the offset to ctsn is calculated, all wrapping is already handled
and thus to verify the Gap Ack blocks we can just use pure
less/big-or-equal than checks.
Also, rename gap variable to tsn_offset, so it's more meaningful, as
it doesn't point to any gap at all.
Even so, I don't think this discrepancy resulted in any practical bug.
This patch is a preparation for the next one, which will introduce
typecheck() for TSN_lte() macros and would cause a compile error here.
netfilter: nf_tables: check tprot_set first when we use xt.thoff
pkt->xt.thoff is not always set properly, but we use it without any check.
For payload expr, it will cause wrong results. For nftrace, we may notify
the wrong network or transport header to the user space, furthermore,
input the following nft rules, warning message will be printed out:
# nft add rule arp filter output meta nftrace set 1
netfilter: nf_tables: improve nft payload fast eval
There's an off-by-one issue in nft_payload_fast_eval, skb_tail_pointer
and ptr + priv->len all point to the last valid address plus 1. So if
they are equal, we can still fetch the valid data. It's unnecessary to
fall back to nft_payload_eval.
netfilter: nf_queue: improve queue range support for bridge family
After commit ac2863445686 ("netfilter: bridge: add nf_afinfo to enable
queuing to userspace"), we can queue packets to the user space in bridge
family. But when the user specify the queue range, packets will be only
delivered to the first queue num. Because in nfqueue_hash, we only support
ipv4 and ipv6 family. Now add support for bridge family too.
netfilter: nft_queue: add _SREG_QNUM attr to select the queue number
Currently, the user can specify the queue numbers by _QUEUE_NUM and
_QUEUE_TOTAL attributes, this is enough in most situations.
But acctually, it is not very flexible, for example:
tcp dport 80 mapped to queue0
tcp dport 81 mapped to queue1
tcp dport 82 mapped to queue2
In order to do this thing, we must add 3 nft rules, and more
mapping meant more rules ...
So take one register to select the queue number, then we can add one
simple rule to mapping queues, maybe like this:
queue num tcp dport map { 80:0, 81:1, 82:2 ... }
Florian Westphal also proposed wider usage scenarios:
queue num jhash ip saddr . ip daddr mod ...
queue num meta cpu ...
queue num meta mark ...
The last point is how to load a queue number from sreg, although we can
use *(u16*)®s->data[reg] to load the queue number, just like nat expr
to load its l4port do.
But we will cooperate with hash expr, meta cpu, meta mark expr and so on.
They all store the result to u32 type, so cast it to u16 pointer and
dereference it will generate wrong result in the big endian system.
So just keep it simple, we treat queue number as u32 type, although u16
type is already enough.
netfilter: nf_tables: validate maximum value of u32 netlink attributes
Fetch value and validate u32 netlink attribute. This validation is
usually required when the u32 netlink attributes are being stored in a
field whose size is smaller.
This patch revisits 4da449ae1df9 ("netfilter: nft_exthdr: Add size check
on u8 nft_exthdr attributes").
Fixes: 96518518cc41 ("netfilter: add nftables") Suggested-by: Pablo Neira Ayuso <[email protected]> Signed-off-by: Laura Garcia Liebana <[email protected]> Signed-off-by: Pablo Neira Ayuso <[email protected]>
Emil Tantilov [Fri, 9 Sep 2016 19:59:10 +0000 (12:59 -0700)]
ixgbe: reset before SRIOV init to avoid mailbox issues
Enabling SRIOV while the ixgbevf driver is loaded will result in all
mailbox requests from ixgbevf_open() being rejected by ixgbe because
adapter->clear_to_send is set to false on reset.
Call ixgbe_sriov_reinit() before pci_enable_sriov() to make sure that
mailbox requests are handled from the time ixgbevf is loaded.
Alexander Duyck [Thu, 8 Sep 2016 03:28:24 +0000 (20:28 -0700)]
ixgbe: Support 4 queue RSS on VFs with 1 or 2 queue RSS on PF
Instead of limiting the VFs if we don't use 4 queues for RSS in the PF we
can instead just limit the RSS queues used to a power of 2. By doing this
we can support use cases where VFs are using more queues than the PF is
currently using and can support RSS if so desired.
The only limitation on this is that we cannot support 3 queues of RSS in
the PF or VF. In either of these cases we should fall back to 2 queues in
order to be able to use the power of 2 masking provided by the psrtype
register.
Alexander Duyck [Thu, 8 Sep 2016 03:28:17 +0000 (20:28 -0700)]
ixgbe: Limit reporting of redirection table if SR-IOV is enabled
The hardware redirection table can support more queues then the PF
currently has when SR-IOV is enabled. In order to account for this use the
RSS mask to trim of the bits that are not used.
Alexander Duyck [Thu, 8 Sep 2016 03:28:11 +0000 (20:28 -0700)]
ixgbe: Allow setting multiple queues when SR-IOV is enabled
The maximum queue count reported was 1, however support for multiple queues
with SR-IOV was added some time ago so we should report support for it to
the user so that they can select multiple queues if they so desire.
Lihong Yang [Wed, 7 Sep 2016 01:05:05 +0000 (18:05 -0700)]
i40evf: remove unnecessary error checking against i40e_shutdown_adminq
The i40e_shutdown_adminq function never returns failure. There is no need to
check the non-0 return value. Clean up the unnecessary error checking and
warning against it.
Alexander Duyck [Wed, 7 Sep 2016 01:05:04 +0000 (18:05 -0700)]
i40e: Limit TX descriptor count in cases where frag size is greater than 16K
The i40e driver was incorrectly assuming that we would always be pulling
no more than 1 descriptor from each fragment. It is in fact possible for
us to end up with the case where 2 descriptors worth of data may be pulled
when a frame is larger than one of the pieces generated when aligning the
payload to either 4K or pieces smaller than 16K.
To adjust for this we just need to make certain to test all the way to the
end of the fragments as it is possible for us to span 2 descriptors in the
block before us so we need to guarantee that even the last 6 descriptors
have enough data to fill a full frame.
i40evf: remove unnecessary error checking against i40evf_up_complete
Function i40evf_up_complete() always returns success. Changed this to a
void type and removed the code that checks the return status and prints
an error message.
Currently disabling the link state from PF via
ip link set enp5s0f0 vf 0 state disable
doesn't disable the CARRIER on the VF.
This patch updates the carrier and starts/stops the tx queues based on the
link state notification from PF.
PF: enp5s0f0, VF: enp5s2
#modprobe i40e
#echo 2 > /sys/class/net/enp5s0f0/device/sriov_numvfs
#ip link set enp5s2 up
#ip -d link show enp5s2
175: enp5s2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000
link/ether ea:4d:60:bc:6f:85 brd ff:ff:ff:ff:ff:ff promiscuity 0 addrgenmode eui64
#ip link set enp5s0f0 vf 0 state disable
#ip -d link show enp5s0f0
171: enp5s0f0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
link/ether 68:05:ca:2e:72:68 brd ff:ff:ff:ff:ff:ff promiscuity 0 addrgenmode eui64 numtxqueues 72 numrxqueues 72 portid 6805ca2e7268
vf 0 MAC 00:00:00:00:00:00, spoof checking on, link-state disable, trust off
vf 1 MAC 00:00:00:00:00:00, spoof checking on, link-state auto, trust off
#ip -d link show enp5s2
175: enp5s2: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN mode DEFAULT group default qlen 1000
link/ether ea:4d:60:bc:6f:85 brd ff:ff:ff:ff:ff:ff promiscuity 0 addrgenmode eui64 numtxqueues 16 numrxqueues 16
Colin Ian King [Sun, 28 Aug 2016 17:41:01 +0000 (18:41 +0100)]
i40e: avoid potential null pointer dereference when assigning len
There is a sanitcy check for desc being null in the first line of
function i40evf_debug_aq. However, before that, aq_desc is cast from
desc, and aq_desc is being dereferenced on the assignment of len, so
this could be a potential null pointer deference. Fix this by moving
the initialization of len to the code block where len is being used
and hence at this point we know it is OK to dereference aq_desc.
Carolyn Wyborny [Wed, 24 Aug 2016 18:33:51 +0000 (11:33 -0700)]
i40e: Fix for extra byte swap in tunnel setup
This patch fixes an issue where we were byte swapping the port
parameter, then byte swapping it again in function execution.
Obviously, that's unnecessary, so take it out of the function calls.
Without this patch, the udp based tunnel configuration would
not be correct.
Carolyn Wyborny [Wed, 24 Aug 2016 18:33:50 +0000 (11:33 -0700)]
i40e: Fix to check for NULL
This patch fixes an issue in the virt channel code, where a return
from i40e_find_vsi_from_id was not checked for NULL when applicable.
Without this patch, there is a risk for panic and static analysis
tools complain. This patch fixes the problem by adding the check
and adding an additional input check for similar reasons.
Alan Brady [Wed, 24 Aug 2016 18:33:47 +0000 (11:33 -0700)]
i40e: fix "dump port" command when NPAR enabled
When using the debugfs to issue the "dump port" command
with NPAR enabled, the firmware reports back with invalid argument.
The issue occurs because the pf->mac_seid was used to perform the query.
This is fine when NPAR is disabled because the switch ID == pf->mac_seid,
however this is not the case when NPAR is enabled. This fix instead
goes through the VSI to determine the correct ID to use in either case.
Alan Brady [Wed, 24 Aug 2016 18:33:46 +0000 (11:33 -0700)]
i40e: fix setting user defined RSS hash key
Previously, when using ethtool to change the RSS hash key, ethtool would
report back saying the old key was still being used and no error was
reported. It was unclear whether it was being reported incorrectly or
being set incorrectly. Debugging revealed 'i40e_set_rxfh()' returned
zero immediately instead of setting the key because a user defined
indirection table is not supplied when changing the hash key.
This fix instead changes it such that if an indirection table is not
supplied, then a default one is created and the hash key is now
correctly set.
Bluetooth: Fix NULL pointer dereference in mgmt context
Adds missing callback assignment to cmd_complete in pending management command
context. Dump path involves security procedure performed on legacy (pre-SSP)
devices with service security requirements set to HIGH (16digits PIN).
It fails when shorter PIN is delivered by user.
netfilter: nft_numgen: add number generation offset
Add support of an offset value for incremental counter and random. With
this option the sysadmin is able to start the counter to a certain value
and then apply the generated number.
Example:
meta mark set numgen inc mod 2 offset 100
This will generate marks with the serie 100, 101, 100, 101, ...
Sean Wang [Thu, 22 Sep 2016 08:36:15 +0000 (16:36 +0800)]
net: ethernet: mediatek: remove superfluous local variable for phy address
remove the unused variable for parsing PHY address
and the related logic for sanity test which would
be all already handled done when of_mdiobus_register
was called
David S. Miller [Thu, 22 Sep 2016 12:21:28 +0000 (08:21 -0400)]
Merge branch 'mediatek-trgmii'
Sean Wang says:
====================
mediatek: add support for RGMII on GMAC0 through TRGMII hardware module
By default, GMAC0 is connected to built-in switch called
MT7530 through the proprietary interface called Turbo RGMII
(TRGMII). TRGMII also supports well for RGMII as generic external
PHY uses but requires some slight changes to the setup of TRGMII
and doesn't have well support on current driver.
So this patchset
1) provides the slight changes of the setup for RGMII can work
through TRGMII
2) adds additional setting "trgmii" as PHY_INTERFACE_MODE_TRGMII
about phy-mode on device tree to make GMAC0 distinguish which
mode it runs
3) changes dynamically source clock, TX/RX delay and interface
mode on TRGMII for adapting various link
Changes since v1:
- fixed the style of comment which doesn't have a space at
the beginning and end of comment lines
- add support for phy-mode "trgmii" as PHY_INTERFACE_MODE_TRGMII
into linux/phy.h
- enhance the Documentation about device tree binding for trgmii
which is applicable only for GMAC0 which uses fixed-link
====================
Sean Wang [Thu, 22 Sep 2016 02:33:55 +0000 (10:33 +0800)]
net: ethernet: mediatek: add support for GMAC0 connecting with external PHY through TRGMII
Changing dynamically source clock, TX/RX delay and interface mode
used by TRGMII hardware module inside PHY capability polling routine
for adapting to the various speed of RGMII used by external PHY for
GMAC0.
Sean Wang [Thu, 22 Sep 2016 02:33:54 +0000 (10:33 +0800)]
net: ethernet: mediatek: add extension of phy-mode for TRGMII
adds PHY-mode "trgmii" as an extension for the operation
mode of the PHY interface for PHY_INTERFACE_MODE_TRGMII.
and adds a variable trgmii inside mtk_mac as the indication
to make the difference between the MAC connected to internal
switch or connected to external PHY by the given configuration
on the board and then to perform the corresponding setup on
TRGMII hardware module.
David S. Miller [Thu, 22 Sep 2016 12:14:59 +0000 (08:14 -0400)]
Merge tag 'rxrpc-rewrite-20160922-v2' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs
David Howells says:
====================
rxrpc: Preparation for slow-start algorithm [ver #2]
Here are some patches that prepare for improvements in ACK generation and
for the implementation of the slow-start part of the protocol:
(1) Stop storing the protocol header in the Tx socket buffers, but rather
generate it on the fly. This potentially saves a little space and
makes it easier to alter the header just before transmission (the
flags may get altered and the serial number has to be changed).
(2) Mask off the Tx buffer annotations and add a flag to record which ones
have already been resent.
(3) Track RTT on a per-peer basis for use in future changes. Tracepoints
are added to log this.
(4) Send PING ACKs in response to incoming calls to elicit a PING-RESPONSE
ACK from which RTT data can be calculated. The response also carries
other useful information.
(5) Expedite PING-RESPONSE ACK generation from sendmsg. If we're actively
using sendmsg, this allows us, under some circumstances, to avoid
having to rely on the background work item to run to generate this
ACK.
This requires ktime_sub_ms() to be added.
(6) Set the REQUEST-ACK flag on some DATA packets to elicit ACK-REQUESTED
ACKs from which RTT data can be calculated.
(7) Limit the use of pings and ACK requests for RTT determination.
Changes:
(V2) Don't use the C division operator for 64-bit division. One instance
should use do_div() and the other should be using nsecs_to_jiffies().
The last two patches got transposed, leading to an undefined symbol
in one of them.
Reported-by: kbuild test robot <[email protected]>
====================
David Howells [Wed, 21 Sep 2016 23:29:32 +0000 (00:29 +0100)]
rxrpc: Reduce the number of PING ACKs sent
We don't want to send a PING ACK for every new incoming call as that just
adds to the network traffic. Instead, we send a PING ACK to the first
three that we receive and then once per second thereafter.
David Howells [Wed, 21 Sep 2016 23:29:31 +0000 (00:29 +0100)]
rxrpc: Reduce the number of ACK-Requests sent
Reduce the number of ACK-Requests we set on DATA packets that we're sending
to reduce network traffic. We set the flag on odd-numbered DATA packets to
start off the RTT cache until we have at least three entries in it and then
probe once per second thereafter to keep it topped up.
This could be made tunable in future.
Note that from this point, the RXRPC_REQUEST_ACK flag is set on DATA
packets as we transmit them and not stored statically in the sk_buff.
Since the TFO socket is accepted right off SYN-data, the socket
owner can call getsockopt(TCP_INFO) to collect ongoing SYN-ACK
retransmission or timeout stats (i.e., tcpi_total_retrans,
tcpi_retransmits). Currently those stats are only updated
upon handshake completes. This patch fixes it.
This patch fixes these under-accounting SNMP rtx stats
LINUX_MIB_TCPFORWARDRETRANS
LINUX_MIB_TCPFASTRETRANS
LINUX_MIB_TCPSLOWSTARTRETRANS
when retransmitting TSO packets
Fixes: 10d3be569243 ("tcp-tso: do not split TSO packets at retransmit time") Signed-off-by: Yuchung Cheng <[email protected]> Acked-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>
David S. Miller [Thu, 22 Sep 2016 07:31:22 +0000 (03:31 -0400)]
Merge branch 'ftgmac100-ast2500-support'
Joel Stanley says:
====================
ftgmac100 support for ast2500
This series adds support to the ftgmac100 driver for the Aspeed ast2400 and
ast2500 SoCs. In particular, they ensure the driver works correctly on the
ast2500 where the MAC block has seen some changes in register layout.
They have been tested on ast2400 and ast2500 systems with the NCSI stack and
with a directly attached PHY.
V2 reworks the two patches relating to PHYSTS_CHG into the one patch that
disables the interrupt instead of playing with interrupt sensitivity. I kept
patch 4 'net/faraday: Clear stale interrupts' which was first introduced to
clear the stale PHYSTS_CHG interrupt, as it helps keep us safe from unhygienic
(vendor) bootloaders.
====================
Joel Stanley [Wed, 21 Sep 2016 23:05:03 +0000 (08:35 +0930)]
net/faraday: Mask out PHYSTS_CHG interrupt
The PHYSTS_CHG (the ftgmac100's PHY IRQ) is telling the system to go
look at the PHY registers for a link status change.
The interrupt was causing issues on Aspeed SoC where some board designs
had an active high configuration, some active low, and in some cases
repurposed for other functions. When misconfigured Linux would chew 100%
of CPU cycles servicing interrupts:
While in the ftgmac100 IP can be configured for high, low and edge
sensitivity the current driver always polls the PHY, so we chose to mask
out the interrupt.
See https://patchwork.ozlabs.org/patch/672099/ for more discussion.
Joel Stanley [Wed, 21 Sep 2016 23:05:02 +0000 (08:35 +0930)]
net/faraday: Configure old MDIO interface on Aspeed SoCs
The Aspeed SoCs have a new MDIO interface as an option in the G4 and G5
SoCs. The old one is still available, so select it in order to remain
compatible with the ftgmac100 driver.
There is stale interrupt (PHYSTS_CHG in ISR, bit#6 in 0x0) from
the bootloader (uboot) when enabling the MAC. The stale interrupts
aren't part of kernel and should be cleared.
This clears the stale interrupts in ISR (0x0) when enabling the MAC.
Joel Stanley [Wed, 21 Sep 2016 23:05:00 +0000 (08:35 +0930)]
net/faraday: Adapt for Aspeed SoCs
The RXDES and TXDES registers bits in the ftgmac100 indicates EDO{R,T}R
at bit position 15 for the Faraday Tech IP. However, the version of this
IP present in the Aspeed SoCs has these bits at position 30 in the
registers.
It appers that ast2400 SoCs support both positions, with the 15th bit
marked as reserved but still functional. In the ast2500 this bit is
reused for another function, so we need a work around.
This was confirmed with engineers from Aspeed that using bit 30 is
correct for both the ast2400 and ast2500 SoCs.
Andrew Jeffery [Wed, 21 Sep 2016 23:04:59 +0000 (08:34 +0930)]
net/faraday: Make EDO{R,T}R bits configurable
These bits are #defined at a fixed location. In order to support future
hardware that has chosen to move these bits around move the bits into a
member of the struct ftgmac100.
Andrew Jeffery [Wed, 21 Sep 2016 23:04:58 +0000 (08:34 +0930)]
net/faraday: Separate rx page storage from rxdesc
The ftgmac100 hardware revision in e.g. the Aspeed AST2500 no longer
reserves all bits in RXDES#2 but instead uses the bottom 16 bits to
store MAC frame metadata. Avoid corruption by shifting struct page
pointers out to their own member in struct ftgmac100.
David Howells [Wed, 21 Sep 2016 23:29:31 +0000 (00:29 +0100)]
rxrpc: Obtain RTT data by requesting ACKs on DATA packets
In addition to sending a PING ACK to gain RTT data, we can set the
RXRPC_REQUEST_ACK flag on a DATA packet and get a REQUESTED-ACK ACK. The
ACK packet contains the serial number of the packet it is in response to,
so we can look through the Tx buffer for a matching DATA packet.
This requires that the data packets be stamped with the time of
transmission as a ktime rather than having the resend_at time in jiffies.
This further requires the resend code to do the resend determination in
ktimes and convert to jiffies to set the timer.
David Howells [Wed, 21 Sep 2016 23:29:31 +0000 (00:29 +0100)]
rxrpc: Expedite ping response transmission
Expedite the transmission of a response to a PING ACK by sending it from
sendmsg if one is pending. We're most likely to see a PING ACK during the
client call Tx phase as the other side may use it to determine a number of
parameters, such as the client's receive window size, the RTT and whether
the client is doing slow start (similar to RFC5681).
If we don't expedite it, it's left to the background processing thread to
transmit.
David Howells [Wed, 21 Sep 2016 23:29:31 +0000 (00:29 +0100)]
rxrpc: Send pings to get RTT data
Send a PING ACK packet to the peer when we get a new incoming call from a
peer we don't have a record for. The PING RESPONSE ACK packet will tell us
the following about the peer:
(1) its receive window size
(2) its MTU sizes
(3) its support for jumbo DATA packets
(4) if it supports slow start (similar to RFC 5681)
(5) an estimate of the RTT
This is necessary because the peer won't normally send us an ACK until it
gets to the Rx phase and we send it a packet, but we would like to know
some of this information before we start sending packets.
A pair of tracepoints are added so that RTT determination can be observed.
David S. Miller [Thu, 22 Sep 2016 07:13:39 +0000 (03:13 -0400)]
Merge branch 'sctp-align'
Marcelo Ricardo Leitner says:
====================
Rename WORD_TRUNC/ROUND macros and use them
This patchset aims to rename these macros to a non-confusing name, as
reported by David Laight and David Miller, and to update all remaining
places to make use of it, which was 1 last remaining spot.
v3:
- Name it SCTP_PAD4 instead of SCTP_ALIGN4, as suggested by David Laight
v2:
- fixed 2nd patch summary
Details on the specific changelogs.
====================
To something more meaningful these days, specially because this is
working on packet headers or lengths and which are not tied to any CPU
arch but to the protocol itself.
So, WORD_TRUNC becomes SCTP_TRUNC4 and WORD_ROUND becomes SCTP_PAD4.
David S. Miller [Thu, 22 Sep 2016 06:51:48 +0000 (02:51 -0400)]
Merge branch 'mlx5e-xdp'
Tariq Toukan says:
====================
mlx5e XDP support
This series adds XDP support in mlx5e driver.
This includes the use cases: XDP_DROP, XDP_PASS, and XDP_TX.
Single stream performance tests show 16.5 Mpps for XDP_DROP,
and 12.4 Mpps for XDP_TX, with nice scalability for multiple streams/rings.
This rate of XDP_DROP is lower than the 32 Mpps we got in previous
implementation, when Striding RQ was used.
We moved to non-Striding RQ, as some XDP_TX requirements (like headroom,
packet-per-page) cannot be satisfied with the current Striding RQ HW,
and we decided to fully support both DROP/TX.
Few directions are considered in order to enable the faster rate for XDP_DROP,
e.g a possibility for users to enable Striding RQ so they choose optimized
XDP_DROP on the price of partial XDP_TX functionality, or some HW changes.
Series generated against net-next commit: cf714ac147e0 'ipvlan: Fix dependency issue'
Thanks,
Tariq
V2:
* patch 8:
- when XDP_TX fails, call mlx5e_page_release and drop the packet.
- update xdp_tx counter within mlx5e_xmit_xdp_frame.
(mlx5e_xmit_xdp_frame return value becomes obsolete, change it to void)
- drop the packet for unknown XDP return code.
* patch 9:
- use a boolean for xdp_doorbell in SQ struct, instead of dragging it
throughout the functions calls.
- handle doorbell and counters within mlx5e_xmit_xdp_frame.
====================
Adding support for XDP_TX forwarding from xdp program.
Using XDP, now user can loop packets out of the same port.
We create a dedicated TX SQ for each channel that will serve
XDP programs that return XDP_TX action to loop packets back to
the wire directly from the channel RQ RX path.
For that RX pages will now need to be mapped bi-directionally,
and on XDP_TX action we will sync the page back to device then
queue it into SQ for transmission. The XDP xmit frame function will
report back to the RX path if the page was consumed (transmitted), if so,
RX path will forget about that page as if it were released to the stack.
Later on, on XDP TX completion, the page will be released back to the
page cache.
For simplicity this patch will hit a doorbell on every XDP TX packet.
Next patch will introduce a xmit more like mechanism that will
queue up more than one packet into SQ w/o notifying the hardware,
once RX napi loop is done we will hit doorbell once for all XDP TX
packets form the previous loop. This should drastically improve
XDP TX performance.
Add support for the BPF_PROG_TYPE_PHYS_DEV hook in mlx5e driver.
When XDP is on we make sure to change channels RQs type to
MLX5_WQ_TYPE_LINKED_LIST rather than "striding RQ" type to
ensure "page per packet".
On XDP set, we fail if HW LRO is set and request from user to turn it
off. Since on ConnectX4-LX HW LRO is always on by default, this will be
annoying, but we prefer not to enforce LRO off from XDP set function.
Full channels reset (close/open) is required only when setting XDP
on/off.
When XDP set is called just to exchange programs, we will update
each RQ xdp program on the fly and for synchronization with current
data path RX activity of that RQ, we temporally disable that RQ and
ensure RX path is not running, quickly update and re-enable that RQ,
for that we do:
- rq.state = disabled
- napi_synnchronize
- xchg(rq->xdp_prg)
- rq.state = enabled
- napi_schedule // Just in case we've missed an IRQ
Packet rate performance testing was done with pktgen 64B packets and on
TX side and, TC drop action on RX side compared to XDP fast drop.
CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
Comparison is done between:
1. Baseline, Before this patch with TC drop action
2. This patch with TC drop action
3. This patch with XDP RX fast drop
RX Cores Baseline(TC drop) TC drop XDP fast Drop
--------------------------------------------------------------
1 5.3Mpps 5.3Mpps 16.5Mpps
2 10.2Mpps 10.2Mpps 31.3Mpps
4 20.5Mpps 19.9Mpps 36.3Mpps*
*My xmitter was limited to 36.3Mpps, so it is the bottleneck.
It seems that receive side can handle more.