Git Repo - linux.git/log

Merge branch 'dsa-port-fast-ageing'

Vivien Didelot says:

====================
net: dsa: add port fast ageing

Today the DSA drivers are in charge of flushing the MAC addresses
associated to a port when its STP state changes from Learning or
Forwarding, to Disabled or Blocking or Listening.

This makes the drivers more complex and hides this generic switch logic.

This patchset introduces a new optional port_fast_age operation to
dsa_switch_ops, to move this logic to the DSA layer and keep drivers
simple. b53 and mv88e6xxx are updated accordingly.
====================

Reviewed-by: Andrew Lunn <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: dsa: mv88e6xxx: implement DSA port fast ageing

Now that the DSA layer handles port fast ageing on correct STP change,
simplify _mv88e6xxx_port_state and implement mv88e6xxx_port_fast_age.

Signed-off-by: Vivien Didelot <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: dsa: b53: implement DSA port fast ageing

Remove the fast ageing logic from b53_br_set_stp_state and implement the
new DSA switch port_fast_age operation instead.

Signed-off-by: Vivien Didelot <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: dsa: add port fast ageing

Today the DSA drivers are in charge of flushing the MAC addresses
associated to a port when its STP state changes from Learning or
Forwarding, to Disabled or Blocking or Listening.

This makes the drivers more complex and hides the generic switch logic.
Introduce a new optional port_fast_age operation to dsa_switch_ops, to
move this logic to the DSA layer and keep drivers simple.

Signed-off-by: Vivien Didelot <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: dsa: add port STP state helper

Add a void helper to set the STP state of a port, checking first if the
required routine is provided by the driver.

Signed-off-by: Vivien Didelot <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

drivers: net: xgene: Fix MSS programming

Current driver programs static value of MSS in hardware register for TSO
offload engine to segment the TCP payload regardless the MSS value
provided by network stack.

This patch fixes this by programming hardware registers with the
stack provided MSS value.

Since the hardware has the limitation of having only 4 MSS registers,
this patch uses reference count of mss values being used.

Signed-off-by: Iyappan Subramanian <[email protected]>
Signed-off-by: Toan Le <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

rxrpc: Should be using ktime_add_ms() not ktime_add_ns()

ktime_add_ms() should be used to add the resend time (in ms) rather than
ktime_add_ns().

Signed-off-by: David Howells <[email protected]>

rxrpc: Make sure sendmsg() is woken on call completion

Make sure that sendmsg() gets woken up if the call it is waiting for
completes abnormally.

Signed-off-by: David Howells <[email protected]>

rxrpc: Don't send an ACK at the end of service call response transmission

Don't send an IDLE ACK at the end of the transmission of the response to a
service call. The service end resends DATA packets until the client sends an
ACK that hard-acks all the send data. At that point, the call is complete.

Signed-off-by: David Howells <[email protected]>

rxrpc: Preset timestamp on Tx sk_buffs

Set the timestamp on sk_buffs holding packets to be transmitted before
queueing them because the moment the packet is on the queue it can be seen
by the retransmission algorithm - which may see a completely random
timestamp.

If the retransmission algorithm sees such a timestamp, it may retransmit
the packet and, in future, tell the congestion management algorithm that
the retransmit timer expired.

Signed-off-by: David Howells <[email protected]>

cxgb4: fix signed wrap around when decrementing index idx

Change predecrement compare to post decrement compare to avoid an
unsigned integer wrap-around comparison when decrementing idx in
the while loop.

For example, when idx is zero, the current situation will
predecrement idx in the while loop, wrapping idx to the maximum
signed integer and cause out of bounds reads on rxq_info->msix_tbl[idx].

Signed-off-by: Colin Ian King <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Merge branch 'mlx5-sriov-vlan-push-pop'

Saeed Mahameed says:

====================
Mellanox 100G SRIOV offloads vlan push/pop

From Or Gerlitz:

This series further enhances the SRIOV TC offloads of mlx5 to handle
the TC vlan push and pop actions. This serves a common use-case in
virtualization systems where the virtual switch add (push) vlan tags
to packets sent from VMs and removes (pop) vlan tags before the packet
is received by the VM. We use the new E-Switch switchdev mode and the
TC vlan action to achieve that also in SW defined SRIOV environments by
offloading TC rules that contain this action along with forwarding
(TC mirred/redirect action) the packet.

In the first patch we add some helpers to access the TC vlan action info
by offloading drivers. The next five patches don't add any new functionality,
they do some refactoring and cleanups in the current code to be used next.

The seventh patch deals with supporting vlans by the mlx5 e-switch in switchdev
mode. The eighth patch does the vlan action offload from TC and the last patch
adds matching for vlans as typically required by TC flows that involve vlan
pop action.

The series was applied on top of commit 524605e "cxgb4: Convert to use simple_open()"
====================

Signed-off-by: David S. Miller <[email protected]>

net/mlx5e: Add TC vlan match parsing

Enhance the parsing of offloaded TC rules matches to handle vlans.

Signed-off-by: Or Gerlitz <[email protected]>
Signed-off-by: Saeed Mahameed <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net/mlx5e: Add TC vlan action for SRIOV offloads

Parse TC vlan actions and set the required elements to allow offloading.

Signed-off-by: Or Gerlitz <[email protected]>
Signed-off-by: Saeed Mahameed <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net/mlx5: E-Switch, Support VLAN actions in the offloads mode

Many virtualization systems use a policy under which a vlan tag is
pushed to packets sent by guests, and popped before the packet is
forwarded to the VM.

The current generation of the mlx5 HW doesn't fully support that on
a per flow level. As such, we are addressing the above common use
case with the SRIOV e-Switch abilities to push vlan into packets
sent by VFs and pop vlan from packets forwarded to VFs.

The HW can match on the correct vlan being present in packets
forwarded to VFs (eSwitch steering is done before stripping
the tag), so this part is offloaded as is.

A common practice for vlans is to avoid both push vlan and pop vlan
for inter-host VM/VM (east-west) communication because in this case,
push on egress cancels out with pop on ingress.

For supporting that, we use a global eswitch vlan pop policy, hence
allowing guest A to communicate with both remote VM B and local VM C.
This works since the HW pops the vlan only if it exists (e.g for
C --> A packets but not for B --> A packets).

On the slow path, when a VF vport has an offloaded flow which involves
pushing vlans, wheres another flow is not currently offloaded, the
packets from the 2nd flow seen by the VF representor on the host have
vlan. The VF rep driver removes such vlan before calling into the host
networking stack.

Signed-off-by: Or Gerlitz <[email protected]>
Signed-off-by: Saeed Mahameed <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net/mlx5e: Refactor retrival of skb from rx completion element (cqe)

Factor the relevant code into a static inline helper (skb_from_cqe)
doing that.

Move the call to napi_gro_receive to be carried out just
after mlx5e_complete_rx_cqe returns.

Both changes are to be used for the VF representor as well
in the next commit.

This patch doesn't change any functionality.

Signed-off-by: Or Gerlitz <[email protected]>
Signed-off-by: Saeed Mahameed <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net/mlx5: Put elements related to offloaded TC rule in one struct

Put the representors related to the source and dest vports and the
action in struct mlx5_esw_flow_attr which is used while setting the FDB rule.

This patch doesn't change any functionality.

Signed-off-by: Or Gerlitz <[email protected]>
Signed-off-by: Saeed Mahameed <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net/mlx5: E-Switch, Allow fine tuning of eswitch vport push/pop vlan

The HW can be programmed to push vlan, pop vlan or both.

A factorization step towards using the push/pop capabilties in the
eswitch offloads mode. This patch doesn't add new functionality.

Signed-off-by: Or Gerlitz <[email protected]>
Signed-off-by: Saeed Mahameed <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net/mlx5: E-Switch, Set vport representor fields explicitly on registration

The structure we use for the eswitch vport representor (mlx5_eswitch_rep)
has some fields which are set from upper layers in the driver when they
register the rep. Use explicit setting on registration time for them and
avoid global memcpy. This patch doesn't add new functionality.

Signed-off-by: Or Gerlitz <[email protected]>
Signed-off-by: Saeed Mahameed <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net/mlx5: E-Switch, Set the vport when registering the uplink rep

Set the vport value in the PF entry to be that of the uplink so
we can use it blindly over the tc / eswitch offload code without
translating it each time we deal with the uplink representor.

Signed-off-by: Or Gerlitz <[email protected]>
Signed-off-by: Saeed Mahameed <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net_sched: act_vlan: add helper inlines to access tcf_vlan info

Needed e.g for offloading drivers to pick the relevant attributes.

Signed-off-by: Or Gerlitz <[email protected]>
Signed-off-by: Saeed Mahameed <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net_sched: sch_fq: account for schedule/timers drifts

It looks like the following patch can make FQ very precise, even in VM
or stressed hosts. It matters at high pacing rates.

We take into account the difference between the time that was programmed
when last packet was sent, and current time (a drift of tens of usecs is
often observed)

Add an EWMA of the unthrottle latency to help diagnostics.

This latency is the difference between current time and oldest packet in
delayed RB-tree. This accounts for the high resolution timer latency,
but can be different under stress, as fq_check_throttled() can be
opportunistically be called from a dequeue() called after an enqueue()
for a different flow.

Tested:
// Start a 10Gbit flow
$ netperf --google-pacing-rate 1250000000 -H lpaa24 -l 10000 -- -K bbr &

Before patch :
$ sar -n DEV 10 5 | grep eth0 | grep Average
Average:         eth0  17106.04 756876.84   1102.75 1119049.02      0.00      0.00      0.52

After patch :
$ sar -n DEV 10 5 | grep eth0 | grep Average
Average:         eth0  17867.00 800245.90   1151.77 1183172.12      0.00      0.00      0.52

A new iproute2 tc can output the 'unthrottle latency' :

$ tc -s qd sh dev eth0 | grep latency
  0 gc, 0 highprio, 32490767 throttled, 2382 ns latency

Signed-off-by: Eric Dumazet <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

sfc: check async completer is !NULL before calling

Add a NULL check before calling asynchronous MCDI completion functions
during device removal.

Fixes: 7014d7f6 ("sfc: allow asynchronous MCDI without completion function")
Signed-off-by: Bert Kenward <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Merge branch 'sctp-fix-gap-ack-blocks'

Marcelo Ricardo Leitner says:

====================
sctp: fix the handling of SACK Gap Ack blocks

sctp_acked() is using 32bit arithmetics on 16bits vars, via TSN_lte()
macros, which is weird and confusing.

Once the offset to ctsn is calculated, all wrapping is already handled
and thus to verify the Gap Ack blocks we can just use pure
less/big-or-equal than checks.

Also, rename gap variable to tsn_offset, so it's more meaningful, as
it doesn't point to any gap at all.

Even so, I don't think this discrepancy resulted in any practical bug.

This patch is a preparation for the next one, which will introduce
typecheck() for TSN_lte() macros and would cause a compile error here.

Suggested-by: David Laight <[email protected]>
Reported-by: David Laight <[email protected]>
Signed-off-by: Marcelo Ricardo Leitner <[email protected]>
====================

Signed-off-by: David S. Miller <[email protected]>

sctp: improve how SSN, TSN and ASCONF serial are compared

Make it similar to time_before() macros:
- easier to understand
- make use of typecheck() to avoid working on unexpected variable types
  (made the issue on previous patch visible)
- for _[lg]te versions, slighly faster, as the compiler used to generate
  a sequence of cmp/je/cmp/js instructions and now it's sub/test/jle
  (for _lte):

Before, for sctp_outq_sack:
if (primary->cacc.changeover_active) {
    1f01: 80 b9 84 02 00 00 00 cmpb   $0x0,0x284(%rcx)
    1f08: 74 6e                 je     1f78 <sctp_outq_sack+0xe8>
u8 clear_cycling = 0;

if (TSN_lte(primary->cacc.next_tsn_at_change, sack_ctsn)) {
    1f0a: 8b 81 80 02 00 00     mov    0x280(%rcx),%eax
return ((s) - (t)) & TSN_SIGN_BIT;
}

static inline int TSN_lte(__u32 s, __u32 t)
{
return ((s) == (t)) || (((s) - (t)) & TSN_SIGN_BIT);
    1f10: 8b 7d bc              mov    -0x44(%rbp),%edi
    1f13: 39 c7                 cmp    %eax,%edi
    1f15: 74 25                 je     1f3c <sctp_outq_sack+0xac>
    1f17: 39 f8                 cmp    %edi,%eax
    1f19: 78 21                 js     1f3c <sctp_outq_sack+0xac>
primary->cacc.changeover_active = 0;

After:
if (primary->cacc.changeover_active) {
    1ee7: 80 b9 84 02 00 00 00 cmpb   $0x0,0x284(%rcx)
    1eee: 74 73                 je     1f63 <sctp_outq_sack+0xf3>
u8 clear_cycling = 0;

if (TSN_lte(primary->cacc.next_tsn_at_change, sack_ctsn)) {
    1ef0: 8b 81 80 02 00 00     mov    0x280(%rcx),%eax
    1ef6: 2b 45 b4              sub    -0x4c(%rbp),%eax
    1ef9: 85 c0                 test   %eax,%eax
    1efb: 7e 26                 jle    1f23 <sctp_outq_sack+0xb3>
primary->cacc.changeover_active = 0;

*_lt() generated pretty much the same code.
Tested with gcc (GCC) 6.1.1 20160621.

This patch also removes SSN_lte as it is not used and cleanups some
comments.

Signed-off-by: Marcelo Ricardo Leitner <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

sctp: fix the handling of SACK Gap Ack blocks

sctp_acked() is using 32bit arithmetics on 16bits vars, via TSN_lte()
macros, which is weird and confusing.

Once the offset to ctsn is calculated, all wrapping is already handled
and thus to verify the Gap Ack blocks we can just use pure
less/big-or-equal than checks.

Also, rename gap variable to tsn_offset, so it's more meaningful, as
it doesn't point to any gap at all.

Even so, I don't think this discrepancy resulted in any practical bug.

This patch is a preparation for the next one, which will introduce
typecheck() for TSN_lte() macros and would cause a compile error here.

Suggested-by: David Laight <[email protected]>
Reported-by: David Laight <[email protected]>
Signed-off-by: Marcelo Ricardo Leitner <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net_sched: check NULL on error path in route4_change()

On error path in route4_change(), 'f' could be NULL,
so we should check NULL before calling tcf_exts_destroy().

Fixes: b9a24bb76bf6 ("net_sched: properly handle failure case of tcf_exts_init()")
Reported-by: kbuild test robot <[email protected]>
Cc: Jamal Hadi Salim <[email protected]>
Signed-off-by: Cong Wang <[email protected]>
Acked-by: Jamal Hadi Salim <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net

netfilter: nft_lookup: remove superfluous element found check

We already checked for !found just a bit before:

        if (!found) {
                regs->verdict.code = NFT_BREAK;
                return;
        }

        if (found && set->flags & NFT_SET_MAP)
            ^^^^^

So this redundant check can just go away.

Signed-off-by: Pablo Neira Ayuso <[email protected]>

netfilter: xt_helper: Use sizeof(variable) instead of literal number

It's better to use sizeof(info->name)-1 as index to force set the string
tail instead of literal number '29'.

Signed-off-by: Gao Feng <[email protected]>
Signed-off-by: Pablo Neira Ayuso <[email protected]>

netfilter: Enhance the codes used to get random once

There are some codes which are used to get one random once in netfilter.
We could use net_get_random_once to simplify these codes.

Signed-off-by: Gao Feng <[email protected]>
Signed-off-by: Pablo Neira Ayuso <[email protected]>

netfilter: nf_tables: check tprot_set first when we use xt.thoff

pkt->xt.thoff is not always set properly, but we use it without any check.
For payload expr, it will cause wrong results. For nftrace, we may notify
the wrong network or transport header to the user space, furthermore,
input the following nft rules, warning message will be printed out:
  # nft add rule arp filter output meta nftrace set 1

  WARNING: CPU: 0 PID: 13428 at net/netfilter/nf_tables_trace.c:263
  nft_trace_notify+0x4a3/0x5e0 [nf_tables]
  Call Trace:
  [<ffffffff813d58ae>] dump_stack+0x63/0x85
  [<ffffffff810a4c0b>] __warn+0xcb/0xf0
  [<ffffffff810a4d3d>] warn_slowpath_null+0x1d/0x20
  [<ffffffffa0589703>] nft_trace_notify+0x4a3/0x5e0 [nf_tables]
  [ ... ]
  [<ffffffffa05690a8>] nft_do_chain_arp+0x78/0x90 [nf_tables_arp]
  [<ffffffff816f4aa2>] nf_iterate+0x62/0x80
  [<ffffffff816f4b33>] nf_hook_slow+0x73/0xd0
  [<ffffffff81732bbf>] arp_xmit+0x8f/0xb0
  [ ... ]
  [<ffffffff81732d36>] arp_solicit+0x106/0x2c0

So before we use pkt->xt.thoff, check the tprot_set first.

Signed-off-by: Liping Zhang <[email protected]>
Signed-off-by: Pablo Neira Ayuso <[email protected]>

netfilter: nf_tables: improve nft payload fast eval

There's an off-by-one issue in nft_payload_fast_eval, skb_tail_pointer
and ptr + priv->len all point to the last valid address plus 1. So if
they are equal, we can still fetch the valid data. It's unnecessary to
fall back to nft_payload_eval.

Signed-off-by: Liping Zhang <[email protected]>
Signed-off-by: Pablo Neira Ayuso <[email protected]>

netfilter: nf_queue: improve queue range support for bridge family

After commit ac2863445686 ("netfilter: bridge: add nf_afinfo to enable
queuing to userspace"), we can queue packets to the user space in bridge
family. But when the user specify the queue range, packets will be only
delivered to the first queue num. Because in nfqueue_hash, we only support
ipv4 and ipv6 family. Now add support for bridge family too.

Suggested-by: Pablo Neira Ayuso <[email protected]>
Signed-off-by: Liping Zhang <[email protected]>
Signed-off-by: Pablo Neira Ayuso <[email protected]>

netfilter: nft_queue: add _SREG_QNUM attr to select the queue number

Currently, the user can specify the queue numbers by _QUEUE_NUM and
_QUEUE_TOTAL attributes, this is enough in most situations.

But acctually, it is not very flexible, for example:
  tcp dport 80 mapped to queue0
  tcp dport 81 mapped to queue1
  tcp dport 82 mapped to queue2
In order to do this thing, we must add 3 nft rules, and more
mapping meant more rules ...

So take one register to select the queue number, then we can add one
simple rule to mapping queues, maybe like this:
  queue num tcp dport map { 80:0, 81:1, 82:2 ... }

Florian Westphal also proposed wider usage scenarios:
  queue num jhash ip saddr . ip daddr mod ...
  queue num meta cpu ...
  queue num meta mark ...

The last point is how to load a queue number from sreg, although we can
use *(u16*)&regs->data[reg] to load the queue number, just like nat expr
to load its l4port do.

But we will cooperate with hash expr, meta cpu, meta mark expr and so on.
They all store the result to u32 type, so cast it to u16 pointer and
dereference it will generate wrong result in the big endian system.

So just keep it simple, we treat queue number as u32 type, although u16
type is already enough.

Suggested-by: Pablo Neira Ayuso <[email protected]>
Signed-off-by: Liping Zhang <[email protected]>
Signed-off-by: Pablo Neira Ayuso <[email protected]>

netfilter: nf_tables: validate maximum value of u32 netlink attributes

Fetch value and validate u32 netlink attribute. This validation is
usually required when the u32 netlink attributes are being stored in a
field whose size is smaller.

This patch revisits 4da449ae1df9 ("netfilter: nft_exthdr: Add size check
on u8 nft_exthdr attributes").

Fixes: 96518518cc41 ("netfilter: add nftables")
Suggested-by: Pablo Neira Ayuso <[email protected]>
Signed-off-by: Laura Garcia Liebana <[email protected]>
Signed-off-by: Pablo Neira Ayuso <[email protected]>

ixgbe: reset before SRIOV init to avoid mailbox issues

Enabling SRIOV while the ixgbevf driver is loaded will result in all
mailbox requests from ixgbevf_open() being rejected by ixgbe because
adapter->clear_to_send is set to false on reset.

Call ixgbe_sriov_reinit() before pci_enable_sriov() to make sure that
mailbox requests are handled from the time ixgbevf is loaded.

Reported-by: Andrew Bowers <[email protected]>
Signed-off-by: Emil Tantilov <[email protected]>
Tested-by: Andrew Bowers <[email protected]>
Signed-off-by: Jeff Kirsher <[email protected]>

ixgbe: Support 4 queue RSS on VFs with 1 or 2 queue RSS on PF

Instead of limiting the VFs if we don't use 4 queues for RSS in the PF we
can instead just limit the RSS queues used to a power of 2. By doing this
we can support use cases where VFs are using more queues than the PF is
currently using and can support RSS if so desired.

The only limitation on this is that we cannot support 3 queues of RSS in
the PF or VF. In either of these cases we should fall back to 2 queues in
order to be able to use the power of 2 masking provided by the psrtype
register.

Signed-off-by: Alexander Duyck <[email protected]>
Tested-by: Andrew Bowers <[email protected]>
Signed-off-by: Jeff Kirsher <[email protected]>

ixgbe: Limit reporting of redirection table if SR-IOV is enabled

The hardware redirection table can support more queues then the PF
currently has when SR-IOV is enabled. In order to account for this use the
RSS mask to trim of the bits that are not used.

Signed-off-by: Alexander Duyck <[email protected]>
Tested-by: Andrew Bowers <[email protected]>
Signed-off-by: Jeff Kirsher <[email protected]>

ixgbe: Allow setting multiple queues when SR-IOV is enabled

The maximum queue count reported was 1, however support for multiple queues
with SR-IOV was added some time ago so we should report support for it to
the user so that they can select multiple queues if they so desire.

Signed-off-by: Alexander Duyck <[email protected]>
Tested-by: Andrew Bowers <[email protected]>
Signed-off-by: Jeff Kirsher <[email protected]>

ixgbe: Use MDIO_PRTAD_NONE consistently

The value MDIO_PRTAD_NONE should be used to indicate no PHY address.
Not 0, not 0xFFFF. Use the MDIO_PRTAD_NONE value consistently.

Signed-off-by: Mark Rustad <[email protected]>
Tested-by: Andrew Bowers <[email protected]>
Signed-off-by: Jeff Kirsher <[email protected]>

ixgbevf: add spinlocks for MTU change calls

Protect set_rlpml with mailbox lock to make sure the MTU configuration
is handled properly.

This change resolves an issue where set_rlpml can fail when the VF
interface is brought up:
ixgbevf 0000:03:1d.6: Failed to set MTU at 1500

Signed-off-by: Emil Tantilov <[email protected]>
Tested-by: Andrew Bowers <[email protected]>
Signed-off-by: Jeff Kirsher <[email protected]>

ixgbe: Indicate support for pause frames in all cases

All the MACs supported by ixgbe support pause frames, so indicate
that support in ethtool. Also set advertising according to requested
mode.

Signed-off-by: Mark Rustad <[email protected]>
Tested-by: Andrew Bowers <[email protected]>
Signed-off-by: Jeff Kirsher <[email protected]>

ixgbe: Resolve NULL reference by setting {read, write}_reg_mdi

Set the read_reg_mdi and write_reg_mdi method pointers for
X550EM_A_10G_T devices to resolve jumping to NULL.

Signed-off-by: Mark Rustad <[email protected]>
Tested-by: Andrew Bowers <[email protected]>
Signed-off-by: Jeff Kirsher <[email protected]>

ixgbe: make ixgbe_led_on/off_t_x550em static

These functions are only used in ixgbe_x550.c.

Fixes a warning when compiling with -Wmissing-prototypes

Reported-by: Krishneil Singh <[email protected]>
Signed-off-by: Emil Tantilov <[email protected]>
Tested-by: Andrew Bowers <[email protected]>
Signed-off-by: Jeff Kirsher <[email protected]>

ixgbe: simplify the logic for setting VLAN filtering

Simplify the logic for setting VLNCTRL.VFE by checking the VMDQ flag
and 82598 MAC instead of having to maintain a list of MAC types.

Signed-off-by: Emil Tantilov <[email protected]>
Tested-by: Andrew Bowers <[email protected]>
Signed-off-by: Jeff Kirsher <[email protected]>

i40evf: remove unnecessary error checking against i40e_shutdown_adminq

The i40e_shutdown_adminq function never returns failure. There is no need to
check the non-0 return value. Clean up the unnecessary error checking and
warning against it.

Change-ID: Ibb616f09cfb93bd1a872ebf3241a15fb8354b31b
Signed-off-by: Lihong Yang <[email protected]>
Tested-by: Andrew Bowers <[email protected]>
Signed-off-by: Jeff Kirsher <[email protected]>

i40e: Limit TX descriptor count in cases where frag size is greater than 16K

The i40e driver was incorrectly assuming that we would always be pulling
no more than 1 descriptor from each fragment. It is in fact possible for
us to end up with the case where 2 descriptors worth of data may be pulled
when a frame is larger than one of the pieces generated when aligning the
payload to either 4K or pieces smaller than 16K.

To adjust for this we just need to make certain to test all the way to the
end of the fragments as it is possible for us to span 2 descriptors in the
block before us so we need to guarantee that even the last 6 descriptors
have enough data to fill a full frame.

Change-ID: Ic2ecb4d6b745f447d334e66c14002152f50e2f99
Signed-off-by: Alexander Duyck <[email protected]>
Tested-by: Andrew Bowers <[email protected]>
Signed-off-by: Jeff Kirsher <[email protected]>

i40evf: remove unnecessary error checking against i40evf_up_complete

Function i40evf_up_complete() always returns success. Changed this to a
void type and removed the code that checks the return status and prints
an error message.

Change-ID: I8c400f174786b9c855f679e470f35af292fb50ad
Signed-off-by: Bimmy Pujari <[email protected]>
Tested-by: Andrew Bowers <[email protected]>
Signed-off-by: Jeff Kirsher <[email protected]>

i40evf: Fix link state event handling

Currently disabling the link state from PF via
ip link set enp5s0f0 vf 0 state disable
doesn't disable the CARRIER on the VF.

This patch updates the carrier and starts/stops the tx queues based on the
link state notification from PF.

  PF: enp5s0f0, VF: enp5s2
  #modprobe i40e
  #echo 2 > /sys/class/net/enp5s0f0/device/sriov_numvfs
  #ip link set enp5s2 up
  #ip -d link show enp5s2
  175: enp5s2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000
      link/ether ea:4d:60:bc:6f:85 brd ff:ff:ff:ff:ff:ff promiscuity 0 addrgenmode eui64
  #ip link set enp5s0f0 vf 0 state disable
  #ip -d link show enp5s0f0
  171: enp5s0f0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
      link/ether 68:05:ca:2e:72:68 brd ff:ff:ff:ff:ff:ff promiscuity 0 addrgenmode eui64 numtxqueues 72 numrxqueues 72 portid 6805ca2e7268
      vf 0 MAC 00:00:00:00:00:00, spoof checking on, link-state disable, trust off
      vf 1 MAC 00:00:00:00:00:00, spoof checking on, link-state auto, trust off
  #ip -d link show enp5s2
  175: enp5s2: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN mode DEFAULT group default qlen 1000
       link/ether ea:4d:60:bc:6f:85 brd ff:ff:ff:ff:ff:ff promiscuity 0 addrgenmode eui64 numtxqueues 16 numrxqueues 16

Signed-off-by: Sridhar Samudrala <[email protected]>
Tested-by: Andrew Bowers <[email protected]>
Signed-off-by: Jeff Kirsher <[email protected]>

i40e: avoid potential null pointer dereference when assigning len

There is a sanitcy check for desc being null in the first line of
function i40evf_debug_aq. However, before that, aq_desc is cast from
desc, and aq_desc is being dereferenced on the assignment of len, so
this could be a potential null pointer deference. Fix this by moving
the initialization of len to the code block where len is being used
and hence at this point we know it is OK to dereference aq_desc.

Signed-off-by: Colin Ian King <[email protected]>
Tested-by: Andrew Bowers <[email protected]>
Signed-off-by: Jeff Kirsher <[email protected]>

i40e: Fix for extra byte swap in tunnel setup

This patch fixes an issue where we were byte swapping the port
parameter, then byte swapping it again in function execution.
Obviously, that's unnecessary, so take it out of the function calls.
Without this patch, the udp based tunnel configuration would
not be correct.

Change-ID: I788d83c5bd5732170f1a81dbfa0b1ac3ca8ea5b7
Signed-off-by: Carolyn Wyborny <[email protected]>
Tested-by: Andrew Bowers <[email protected]>
Signed-off-by: Jeff Kirsher <[email protected]>

i40e: Fix to check for NULL

This patch fixes an issue in the virt channel code, where a return
from i40e_find_vsi_from_id was not checked for NULL when applicable.
Without this patch, there is a risk for panic and static analysis
tools complain. This patch fixes the problem by adding the check
and adding an additional input check for similar reasons.

Change-ID: I7e9be88eb7a3addb50eadc451c8336d9e06f5394
Signed-off-by: Carolyn Wyborny <[email protected]>
Tested-by: Andrew Bowers <[email protected]>
Signed-off-by: Jeff Kirsher <[email protected]>

i40e: return correct opcode to VF

This conditional is backward, so the driver responds back to the VF with
the wrong opcode. Do the old switcheroo to fix this.

Change-ID: I384035b0fef8a3881c176de4b4672009b3400b25
Signed-off-by: Mitch Williams <[email protected]>
Tested-by: Andrew Bowers <[email protected]>
Signed-off-by: Jeff Kirsher <[email protected]>

i40e: fix "dump port" command when NPAR enabled

When using the debugfs to issue the "dump port" command
with NPAR enabled, the firmware reports back with invalid argument.

The issue occurs because the pf->mac_seid was used to perform the query.
This is fine when NPAR is disabled because the switch ID == pf->mac_seid,
however this is not the case when NPAR is enabled. This fix instead
goes through the VSI to determine the correct ID to use in either case.

Change-ID: I0cd67913a7f2c4a2962e06d39e32e7447cc55b6a
Signed-off-by: Alan Brady <[email protected]>
Tested-by: Andrew Bowers <[email protected]>
Signed-off-by: Jeff Kirsher <[email protected]>

i40e: fix setting user defined RSS hash key

Previously, when using ethtool to change the RSS hash key, ethtool would
report back saying the old key was still being used and no error was
reported. It was unclear whether it was being reported incorrectly or
being set incorrectly. Debugging revealed 'i40e_set_rxfh()' returned
zero immediately instead of setting the key because a user defined
indirection table is not supplied when changing the hash key.

This fix instead changes it such that if an indirection table is not
supplied, then a default one is created and the hash key is now
correctly set.

Change-ID: Iddb621897ecf208650272b7ee46702cad7b69a71
Signed-off-by: Alan Brady <[email protected]>
Tested-by: Andrew Bowers <[email protected]>
Signed-off-by: Jeff Kirsher <[email protected]>

Bluetooth: btwilink: Save the packet type before sending

Running with KASAN produces some messages:

BUG: KASAN: use-after-free in ti_st_send_frame+0x9c/0x16c at addr
ffffffc064868fe8
Read of size 1 by task kworker/u17:1/1266

<KASAN output trimmed>

Hardware name: HiKey Development Board (DT)
Workqueue: hci0 hci_cmd_work
Call trace:
[<ffffffc00008c00c>] dump_backtrace+0x0/0x178
[<ffffffc00008c1a4>] show_stack+0x20/0x28
[<ffffffc00067da38>] dump_stack+0xa8/0xe0
[<ffffffc000288430>] print_trailer+0x110/0x174
[<ffffffc00028aedc>] object_err+0x4c/0x5c
[<ffffffc00028f714>] kasan_report_error+0x254/0x54c
[<ffffffc00028fa70>] kasan_report+0x64/0x70
[<ffffffc00028eb8c>] __asan_load1+0x4c/0x54
[<ffffffc000b59b24>] ti_st_send_frame+0x9c/0x16c
[<ffffffc000ee8dcc>] hci_send_frame+0xb4/0x118
[<ffffffc000ee8efc>] hci_cmd_work+0xcc/0x154
[<ffffffc0000f6c48>] process_one_work+0x26c/0x7a4
[<ffffffc0000f721c>] worker_thread+0x9c/0x73c
[<ffffffc000100250>] kthread+0x138/0x154
[<ffffffc000085c50>] ret_from_fork+0x10/0x40

The packet is being accessed for statistics after it has been freed.
Save the packet type before sending for statistics afterwards.

Signed-off-by: Laura Abbott <[email protected]>
Signed-off-by: Marcel Holtmann <[email protected]>

Merge tag 'media/v4.8-7' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media

Pull media fixes from Mauro Carvalho Chehab:

- several fixes for new drivers added for Kernel 4.8 addition (cec
   core, pulse8 cec driver and Mediatek vcodec)

- a regression fix for cx23885 and saa7134 drivers

- an important fix for rcar-fcp, making rcar_fcp_enable() return 0 on
   success

* tag 'media/v4.8-7' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media: (25 commits)
  [media] cx23885/saa7134: assign q->dev to the PCI device
  [media] rcar-fcp: Make sure rcar_fcp_enable() returns 0 on success
  [media] cec: fix ioctl return code when not registered
  [media] cec: don't Feature Abort broadcast msgs when unregistered
  [media] vcodec:mediatek: Refine VP8 encoder driver
  [media] vcodec:mediatek: Refine H264 encoder driver
  [media] vcodec:mediatek: change H264 profile default to profile high
  [media] vcodec:mediatek: Add timestamp and timecode copy for V4L2 Encoder
  [media] vcodec:mediatek: Fix visible_height larger than coded_height issue in s_fmt_out
  [media] vcodec:mediatek: Fix fops_vcodec_release flow for V4L2 Encoder
  [media] vcodec:mediatek:code refine for v4l2 Encoder driver
  [media] cec-funcs.h: add missing vendor-specific messages
  [media] cec-edid: check for IEEE identifier
  [media] pulse8-cec: fix error handling
  [media] pulse8-cec: set correct Signal Free Time
  [media] mtk-vcodec: add HAS_DMA dependency
  [media] cec: ignore messages when log_addr_mask == 0
  [media] cec: add item to TODO
  [media] cec: set unclaimed addresses to CEC_LOG_ADDR_INVALID
  [media] cec: add CEC_LOG_ADDRS_FL_ALLOW_UNREG_FALLBACK flag
  ...

Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net

Pull networking fixes from David Miller:
"Mostly small bits scattered all over the place, which is usually how
  things go this late in the -rc series.

   1) Proper driver init device resets in bnx2, from Baoquan He.

   2) Fix accounting overflow in __tcp_retransmit_skb(),
      sk_forward_alloc, and ip_idents_reserve, from Eric Dumazet.

   3) Fix crash in bna driver ethtool stats handling, from Ivan Vecera.

   4) Missing check of skb_linearize() return value in mac80211, from
      Johannes Berg.

   5) Endianness fix in nf_table_trace dumps, from Liping Zhang.

   6) SSN comparison fix in SCTP, from Marcelo Ricardo Leitner.

   7) Update DSA and b44 MAINTAINERS entries.

   8) Make input path of vti6 driver work again, from Nicolas Dichtel.

   9) Off-by-one in mlx4, from Sebastian Ott.

  10) Fix fallback route lookup handling in ipv6, from Vincent Bernat.

  11) Fix stack corruption on probe in qed driver, from Yuval Mintz.

  12) PHY init fixes in r8152 from Hayes Wang.

  13) Missing SKB free in irda_accept error path, from Phil Turnbull"

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (61 commits)
  tcp: properly account Fast Open SYN-ACK retrans
  tcp: fix under-accounting retransmit SNMP counters
  MAINTAINERS: Update b44 maintainer.
  net: get rid of an signed integer overflow in ip_idents_reserve()
  net/mlx4_core: Fix to clean devlink resources
  net: can: ifi: Configure transmitter delay
  vti6: fix input path
  ipmr, ip6mr: return lastuse relative to now
  r8152: disable ALDPS and EEE before setting PHY
  r8152: remove r8153_enable_eee
  r8152: move PHY settings to hw_phy_cfg
  r8152: move enabling PHY
  r8152: move some functions
  cxgb4/cxgb4vf: Allocate more queues for 25G and 100G adapter
  qed: Fix stack corruption on probe
  MAINTAINERS: Add an entry for the core network DSA code
  net: ipv6: fallback to full lookup if table lookup is unsuitable
  net/mlx5: E-Switch, Handle mode change failures
  net/mlx5: E-Switch, Fix error flow in the SRIOV e-switch init code
  net/mlx5: Fix flow counter bulk command out mailbox allocation
  ...

Bluetooth: Fix not updating scan rsp when adv off

Scan response data should not be updated unless there
is an advertising instance.

Signed-off-by: Michał Narajowski <[email protected]>
Signed-off-by: Marcel Holtmann <[email protected]>

Bluetooth: Add a new 04ca:3011 QCA_ROME device

BugLink: https://bugs.launchpad.net/bugs/1535802
T:  Bus=01 Lev=02 Prnt=02 Port=04 Cnt=01 Dev#=  3 Spd=12  MxCh= 0
D:  Ver= 1.10 Cls=e0(wlcon) Sub=01 Prot=01 MxPS=64 #Cfgs=  1
P:  Vendor=04ca ProdID=3011 Rev=00.01
C:  #Ifs= 2 Cfg#= 1 Atr=e0 MxPwr=100mA
I:  If#= 0 Alt= 0 #EPs= 3 Cls=e0(wlcon) Sub=01 Prot=01 Driver=btusb
I:  If#= 1 Alt= 0 #EPs= 2 Cls=e0(wlcon) Sub=01 Prot=01 Driver=btusb

Signed-off-by: Dmitry Tunin <[email protected]>
Signed-off-by: Marcel Holtmann <[email protected]>
Cc: [email protected]

Bluetooth: Fix NULL pointer dereference in mgmt context

Adds missing callback assignment to cmd_complete in pending management command
context. Dump path involves security procedure performed on legacy (pre-SSP)
devices with service security requirements set to HIGH (16digits PIN).
It fails when shorter PIN is delivered by user.

[    1.517950] Bluetooth: PIN code is not 16 bytes long
[    1.518491] BUG: unable to handle kernel NULL pointer dereference at           (null)
[    1.518584] IP: [<          (null)>]           (null)
[    1.518584] PGD 9e08067 PUD 9fdf067 PMD 0
[    1.518584] Oops: 0010 [#1] SMP
[    1.518584] Modules linked in:
[    1.518584] CPU: 0 PID: 1002 Comm: kworker/u3:2 Not tainted 4.8.0-rc6-354649-gaf4168c #16
[    1.518584] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.9.3-20160701_074356-anatol 04/01/2014
[    1.518584] Workqueue: hci0 hci_rx_work
[    1.518584] task: ffff880009ce14c0 task.stack: ffff880009e10000
[    1.518584] RIP: 0010:[<0000000000000000>]  [<          (null)>]           (null)
[    1.518584] RSP: 0018:ffff880009e13bc8  EFLAGS: 00010293
[    1.518584] RAX: 0000000000000000 RBX: ffff880009eed100 RCX: 0000000000000006
[    1.518584] RDX: ffff880009ddc000 RSI: 0000000000000000 RDI: ffff880009eed100
[    1.518584] RBP: ffff880009e13be0 R08: 0000000000000000 R09: 0000000000000001
[    1.518584] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[    1.518584] R13: ffff880009e13ccd R14: ffff880009ddc000 R15: ffff880009ddc010
[    1.518584] FS:  0000000000000000(0000) GS:ffff88000bc00000(0000) knlGS:0000000000000000
[    1.518584] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    1.518584] CR2: 0000000000000000 CR3: 0000000009fdd000 CR4: 00000000000006f0
[    1.518584] Stack:
[    1.518584]  ffffffff81909808 ffff880009e13cce ffff880009e0d40b ffff880009e13c68
[    1.518584]  ffffffff818f428d 00000000024000c0 ffff880009e13c08 ffffffff810ca903
[    1.518584]  ffff880009e13c48 ffffffff811ade34 ffffffff8178c31f ffff880009ee6200
[    1.518584] Call Trace:
[    1.518584]  [<ffffffff81909808>] ? mgmt_pin_code_neg_reply_complete+0x38/0x60
[    1.518584]  [<ffffffff818f428d>] hci_cmd_complete_evt+0x69d/0x3200
[    1.518584]  [<ffffffff810ca903>] ? rcu_read_lock_sched_held+0x53/0x60
[    1.518584]  [<ffffffff811ade34>] ? kmem_cache_alloc+0x1a4/0x200
[    1.518584]  [<ffffffff8178c31f>] ? skb_clone+0x4f/0xa0
[    1.518584]  [<ffffffff818f9d81>] hci_event_packet+0x8e1/0x28e0
[    1.518584]  [<ffffffff81a421f1>] ? _raw_spin_unlock_irqrestore+0x31/0x50
[    1.518584]  [<ffffffff810aea3e>] ? trace_hardirqs_on_caller+0xee/0x1b0
[    1.518584]  [<ffffffff818e6bd1>] hci_rx_work+0x1e1/0x5b0
[    1.518584]  [<ffffffff8107e4bd>] ? process_one_work+0x1ed/0x6b0
[    1.518584]  [<ffffffff8107e538>] process_one_work+0x268/0x6b0
[    1.518584]  [<ffffffff8107e4bd>] ? process_one_work+0x1ed/0x6b0
[    1.518584]  [<ffffffff8107e9c3>] worker_thread+0x43/0x4e0
[    1.518584]  [<ffffffff8107e980>] ? process_one_work+0x6b0/0x6b0
[    1.518584]  [<ffffffff8107e980>] ? process_one_work+0x6b0/0x6b0
[    1.518584]  [<ffffffff8108505f>] kthread+0xdf/0x100
[    1.518584]  [<ffffffff81a4297f>] ret_from_fork+0x1f/0x40
[    1.518584]  [<ffffffff81084f80>] ? kthread_create_on_node+0x210/0x210

Signed-off-by: Arek Lichwa <[email protected]>
Signed-off-by: Marcel Holtmann <[email protected]>

netfilter: nft_numgen: add number generation offset

Add support of an offset value for incremental counter and random. With
this option the sysadmin is able to start the counter to a certain value
and then apply the generated number.

Example:

meta mark set numgen inc mod 2 offset 100

This will generate marks with the serie 100, 101, 100, 101, ...

Suggested-by: Pablo Neira Ayuso <[email protected]>
Signed-off-by: Laura Garcia Liebana <[email protected]>
Signed-off-by: Pablo Neira Ayuso <[email protected]>

xen-netback: switch to threaded irq for control ring

Instead of open coding it use the threaded irq mechanism in
xen-netback.

Signed-off-by: Juergen Gross <[email protected]>
Acked-by: Wei Liu <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: ethernet: mediatek: get out of potential invalid pointer access

Potential dangerous invalid pointer might be accessed if
the error happens when couple phy_device to net_device so
cleanup the error path.

Signed-off-by: Sean Wang <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: ethernet: mediatek: use [get|set]_link_ksettings

1) use new api [get|set]_link_ksettings instead
of [get|set]_settings old ones.

2) dev->phydev is sure being ready before calling
these callbacks, so removing all the sanity check
if it is existing.

Signed-off-by: Sean Wang <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: ethernet: mediatek: remove superfluous local variable for phy address

remove the unused variable for parsing PHY address
and the related logic for sanity test which would
be all already handled done when of_mdiobus_register
was called

Reported-by: Nelson Chang <[email protected]>
Signed-off-by: Sean Wang <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: ethernet: mediatek: use phydev from struct net_device

reuse phydev already in struct net_device instead of creating
another new one in private structure.

Signed-off-by: Sean Wang <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Merge branch 'mediatek-trgmii'

Sean Wang says:

====================
mediatek: add support for RGMII on GMAC0 through TRGMII hardware module

By default, GMAC0 is connected to built-in switch called
MT7530 through the proprietary interface called Turbo RGMII
(TRGMII). TRGMII also supports well for RGMII as generic external
PHY uses but requires some slight changes to the setup of TRGMII
and doesn't have well support on current driver.

So this patchset
1) provides the slight changes of the setup for RGMII can work
   through TRGMII
2) adds additional setting "trgmii" as PHY_INTERFACE_MODE_TRGMII
   about phy-mode on device tree to make GMAC0 distinguish which
   mode it runs
3) changes dynamically source clock, TX/RX delay and interface
   mode on TRGMII for adapting various link

Changes since v1:
- fixed the style of comment which doesn't have a space at
   the beginning and end of comment lines
- add support for phy-mode "trgmii" as PHY_INTERFACE_MODE_TRGMII
   into linux/phy.h
- enhance the Documentation about device tree binding for trgmii
  which is applicable only for GMAC0 which uses fixed-link
====================

Signed-off-by: David S. Miller <[email protected]>

net: ethernet: mediatek: add the dts property to set if TRGMII supported on GMAC0

Add the dts property for the capability if TRGMII supported on GAMC0

Signed-off-by: Sean Wang <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: ethernet: mediatek: add support for GMAC0 connecting with external PHY through TRGMII

Changing dynamically source clock, TX/RX delay and interface mode
used by TRGMII hardware module inside PHY capability polling routine
for adapting to the various speed of RGMII used by external PHY for
GMAC0.

Signed-off-by: Sean Wang <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: ethernet: mediatek: add extension of phy-mode for TRGMII

adds PHY-mode "trgmii" as an extension for the operation
mode of the PHY interface for PHY_INTERFACE_MODE_TRGMII.
and adds a variable trgmii inside mtk_mac as the indication
to make the difference between the MAC connected to internal
switch or connected to external PHY by the given configuration
on the board and then to perform the corresponding setup on
TRGMII hardware module.

Signed-off-by: Sean Wang <[email protected]>
Cc: Florian Fainelli <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Merge tag 'rxrpc-rewrite-20160922-v2' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs

David Howells says:

====================
rxrpc: Preparation for slow-start algorithm [ver #2]

Here are some patches that prepare for improvements in ACK generation and
for the implementation of the slow-start part of the protocol:

(1) Stop storing the protocol header in the Tx socket buffers, but rather
     generate it on the fly.  This potentially saves a little space and
     makes it easier to alter the header just before transmission (the
     flags may get altered and the serial number has to be changed).

(2) Mask off the Tx buffer annotations and add a flag to record which ones
     have already been resent.

(3) Track RTT on a per-peer basis for use in future changes.  Tracepoints
     are added to log this.

(4) Send PING ACKs in response to incoming calls to elicit a PING-RESPONSE
     ACK from which RTT data can be calculated.  The response also carries
     other useful information.

(5) Expedite PING-RESPONSE ACK generation from sendmsg.  If we're actively
     using sendmsg, this allows us, under some circumstances, to avoid
     having to rely on the background work item to run to generate this
     ACK.

     This requires ktime_sub_ms() to be added.

(6) Set the REQUEST-ACK flag on some DATA packets to elicit ACK-REQUESTED
     ACKs from which RTT data can be calculated.

(7) Limit the use of pings and ACK requests for RTT determination.

Changes:

(V2) Don't use the C division operator for 64-bit division.  One instance
      should use do_div() and the other should be using nsecs_to_jiffies().

      The last two patches got transposed, leading to an undefined symbol
      in one of them.

Reported-by: kbuild test robot <[email protected]>
====================

Signed-off-by: David S. Miller <[email protected]>

rxrpc: Reduce the number of PING ACKs sent

We don't want to send a PING ACK for every new incoming call as that just
adds to the network traffic. Instead, we send a PING ACK to the first
three that we receive and then once per second thereafter.

This could probably be made adjustable in future.

Signed-off-by: David Howells <[email protected]>

rxrpc: Reduce the number of ACK-Requests sent

Reduce the number of ACK-Requests we set on DATA packets that we're sending
to reduce network traffic. We set the flag on odd-numbered DATA packets to
start off the RTT cache until we have at least three entries in it and then
probe once per second thereafter to keep it topped up.

This could be made tunable in future.

Note that from this point, the RXRPC_REQUEST_ACK flag is set on DATA
packets as we transmit them and not stored statically in the sk_buff.

Signed-off-by: David Howells <[email protected]>

tcp: properly account Fast Open SYN-ACK retrans

Since the TFO socket is accepted right off SYN-data, the socket
owner can call getsockopt(TCP_INFO) to collect ongoing SYN-ACK
retransmission or timeout stats (i.e., tcpi_total_retrans,
tcpi_retransmits). Currently those stats are only updated
upon handshake completes. This patch fixes it.

Signed-off-by: Yuchung Cheng <[email protected]>
Signed-off-by: Eric Dumazet <[email protected]>
Signed-off-by: Neal Cardwell <[email protected]>
Signed-off-by: Soheil Hassas Yeganeh <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

tcp: fix under-accounting retransmit SNMP counters

This patch fixes these under-accounting SNMP rtx stats
LINUX_MIB_TCPFORWARDRETRANS
LINUX_MIB_TCPFASTRETRANS
LINUX_MIB_TCPSLOWSTARTRETRANS
when retransmitting TSO packets

Fixes: 10d3be569243 ("tcp-tso: do not split TSO packets at retransmit time")
Signed-off-by: Yuchung Cheng <[email protected]>
Acked-by: Eric Dumazet <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Merge branch 'ftgmac100-ast2500-support'

Joel Stanley says:

====================
ftgmac100 support for ast2500

This series adds support to the ftgmac100 driver for the Aspeed ast2400 and
ast2500 SoCs. In particular, they ensure the driver works correctly on the
ast2500 where the MAC block has seen some changes in register layout.

They have been tested on ast2400 and ast2500 systems with the NCSI stack and
with a directly attached PHY.

V2 reworks the two patches relating to PHYSTS_CHG into the one patch that
disables the interrupt instead of playing with interrupt sensitivity. I kept
patch 4 'net/faraday: Clear stale interrupts' which was first introduced to
clear the stale PHYSTS_CHG interrupt, as it helps keep us safe from unhygienic
(vendor) bootloaders.
====================

Signed-off-by: David S. Miller <[email protected]>

net/faraday: Mask out PHYSTS_CHG interrupt

The PHYSTS_CHG (the ftgmac100's PHY IRQ) is telling the system to go
look at the PHY registers for a link status change.

The interrupt was causing issues on Aspeed SoC where some board designs
had an active high configuration, some active low, and in some cases
repurposed for other functions. When misconfigured Linux would chew 100%
of CPU cycles servicing interrupts:

[   20.280000] ftgmac100 1e660000.ethernet eth0: [ISR] = 0x200: PHYSTS_CHG
[   20.280000] ftgmac100 1e660000.ethernet eth0: [ISR] = 0x200: PHYSTS_CHG
[   20.280000] ftgmac100 1e660000.ethernet eth0: [ISR] = 0x200: PHYSTS_CHG
[   20.300000] ftgmac100 1e660000.ethernet eth0: [ISR] = 0x200: PHYSTS_CHG

While in the ftgmac100 IP can be configured for high, low and edge
sensitivity the current driver always polls the PHY, so we chose to mask
out the interrupt.

See https://patchwork.ozlabs.org/patch/672099/ for more discussion.

Signed-off-by: Joel Stanley <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net/faraday: Configure old MDIO interface on Aspeed SoCs

The Aspeed SoCs have a new MDIO interface as an option in the G4 and G5
SoCs. The old one is still available, so select it in order to remain
compatible with the ftgmac100 driver.

Signed-off-by: Joel Stanley <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net/faraday: Clear stale interrupts

There is stale interrupt (PHYSTS_CHG in ISR, bit#6 in 0x0) from
the bootloader (uboot) when enabling the MAC. The stale interrupts
aren't part of kernel and should be cleared.

This clears the stale interrupts in ISR (0x0) when enabling the MAC.

Signed-off-by: Gavin Shan <[email protected]>
Signed-off-by: Joel Stanley <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net/faraday: Adapt for Aspeed SoCs

The RXDES and TXDES registers bits in the ftgmac100 indicates EDO{R,T}R
at bit position 15 for the Faraday Tech IP. However, the version of this
IP present in the Aspeed SoCs has these bits at position 30 in the
registers.

It appers that ast2400 SoCs support both positions, with the 15th bit
marked as reserved but still functional. In the ast2500 this bit is
reused for another function, so we need a work around.

This was confirmed with engineers from Aspeed that using bit 30 is
correct for both the ast2400 and ast2500 SoCs.

Signed-off-by: Joel Stanley <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net/faraday: Make EDO{R,T}R bits configurable

These bits are #defined at a fixed location. In order to support future
hardware that has chosen to move these bits around move the bits into a
member of the struct ftgmac100.

Signed-off-by: Andrew Jeffery <[email protected]>
Signed-off-by: Joel Stanley <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net/faraday: Separate rx page storage from rxdesc

The ftgmac100 hardware revision in e.g. the Aspeed AST2500 no longer
reserves all bits in RXDES#2 but instead uses the bottom 16 bits to
store MAC frame metadata. Avoid corruption by shifting struct page
pointers out to their own member in struct ftgmac100.

Signed-off-by: Andrew Jeffery <[email protected]>
Signed-off-by: Joel Stanley <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

rxrpc: Obtain RTT data by requesting ACKs on DATA packets

In addition to sending a PING ACK to gain RTT data, we can set the
RXRPC_REQUEST_ACK flag on a DATA packet and get a REQUESTED-ACK ACK. The
ACK packet contains the serial number of the packet it is in response to,
so we can look through the Tx buffer for a matching DATA packet.

This requires that the data packets be stamped with the time of
transmission as a ktime rather than having the resend_at time in jiffies.

This further requires the resend code to do the resend determination in
ktimes and convert to jiffies to set the timer.

Signed-off-by: David Howells <[email protected]>

rxrpc: Add ktime_sub_ms()

Add a ktime_sub_ms() to go with ktime_add_ms() and co. for use in AF_RXRPC
RTT determination.

Signed-off-by: David Howells <[email protected]>

rxrpc: Expedite ping response transmission

Expedite the transmission of a response to a PING ACK by sending it from
sendmsg if one is pending. We're most likely to see a PING ACK during the
client call Tx phase as the other side may use it to determine a number of
parameters, such as the client's receive window size, the RTT and whether
the client is doing slow start (similar to RFC5681).

If we don't expedite it, it's left to the background processing thread to
transmit.

Signed-off-by: David Howells <[email protected]>

rxrpc: Send pings to get RTT data

Send a PING ACK packet to the peer when we get a new incoming call from a
peer we don't have a record for. The PING RESPONSE ACK packet will tell us
the following about the peer:

(1) its receive window size

(2) its MTU sizes

(3) its support for jumbo DATA packets

(4) if it supports slow start (similar to RFC 5681)

(5) an estimate of the RTT

This is necessary because the peer won't normally send us an ACK until it
gets to the Rx phase and we send it a packet, but we would like to know
some of this information before we start sending packets.

A pair of tracepoints are added so that RTT determination can be observed.

Signed-off-by: David Howells <[email protected]>

cxgb4: Convert to use simple_open()

Remove an open coded simple_open() function and replace file
operations references to the function with simple_open()
instead.

Generated by: scripts/coccinelle/api/simple_open.cocci

Signed-off-by: Wei Yongjun <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: dsa: qca8k: use mdio_module_driver to simplify the code

mdio_module_driver() makes the code simpler by eliminating
boilerplate code.

Signed-off-by: Wei Yongjun <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: dsa: qca8k: fix non static symbol warning

Fixes the following sparse warning:

drivers/net/dsa/qca8k.c:259:22: warning:
symbol 'qca8k_regmap_config' was not declared. Should it be static?

Signed-off-by: Wei Yongjun <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Merge branch 'sctp-align'

Marcelo Ricardo Leitner says:

====================
Rename WORD_TRUNC/ROUND macros and use them

This patchset aims to rename these macros to a non-confusing name, as
reported by David Laight and David Miller, and to update all remaining
places to make use of it, which was 1 last remaining spot.

v3:
- Name it SCTP_PAD4 instead of SCTP_ALIGN4, as suggested by David Laight
v2:
- fixed 2nd patch summary

Details on the specific changelogs.
====================

Signed-off-by: David S. Miller <[email protected]>

sctp: make use of SCTP_TRUNC4 macro

And avoid the usage of '&~3'. This is the last place still not using
the macro.
Also break the line to make it easier to read.

Signed-off-by: Marcelo Ricardo Leitner <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

sctp: rename WORD_TRUNC/ROUND macros

To something more meaningful these days, specially because this is
working on packet headers or lengths and which are not tied to any CPU
arch but to the protocol itself.

So, WORD_TRUNC becomes SCTP_TRUNC4 and WORD_ROUND becomes SCTP_PAD4.

Reported-by: David Laight <[email protected]>
Reported-by: David Miller <[email protected]>
Signed-off-by: Marcelo Ricardo Leitner <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec

Steffen Klassert says:

====================
pull request (net): ipsec 2016-09-21

1) Propagate errors on security context allocation.
   From Mathias Krause.

2) Fix inbound policy checks for inter address family tunnels.
   From Thomas Zeitlhofer.

3) Fix an old memory leak on aead algorithm usage.
   From Ilan Tayari.

4) A recent patch fixed a possible NULL pointer dereference
   but broke the vti6 input path.
   Fix from Nicolas Dichtel.
====================

Signed-off-by: David S. Miller <[email protected]>

Merge branch 'mlx5e-xdp'

Tariq Toukan says:

====================
mlx5e XDP support

This series adds XDP support in mlx5e driver.
This includes the use cases: XDP_DROP, XDP_PASS, and XDP_TX.

Single stream performance tests show 16.5 Mpps for XDP_DROP,
and 12.4 Mpps for XDP_TX, with nice scalability for multiple streams/rings.

This rate of XDP_DROP is lower than the 32 Mpps we got in previous
implementation, when Striding RQ was used.

We moved to non-Striding RQ, as some XDP_TX requirements (like headroom,
packet-per-page) cannot be satisfied with the current Striding RQ HW,
and we decided to fully support both DROP/TX.

Few directions are considered in order to enable the faster rate for XDP_DROP,
e.g a possibility for users to enable Striding RQ so they choose optimized
XDP_DROP on the price of partial XDP_TX functionality, or some HW changes.

Series generated against net-next commit:
cf714ac147e0 'ipvlan: Fix dependency issue'

Thanks,
Tariq

V2:
* patch 8:
- when XDP_TX fails, call mlx5e_page_release and drop the packet.
- update xdp_tx counter within mlx5e_xmit_xdp_frame.
(mlx5e_xmit_xdp_frame return value becomes obsolete, change it to void)
- drop the packet for unknown XDP return code.
* patch 9:
- use a boolean for xdp_doorbell in SQ struct, instead of dragging it
throughout the functions calls.
- handle doorbell and counters within mlx5e_xmit_xdp_frame.
====================

Signed-off-by: David S. Miller <[email protected]>

net/mlx5e: XDP TX xmit more

Previously we rang XDP SQ doorbell on every forwarded XDP packet.

Here we introduce a xmit more like mechanism that will queue up more
than one packet into SQ (up to RX napi budget) w/o notifying the hardware.

Once RX napi budget is consumed and we exit napi RX loop, we will
flush (doorbell) all XDP looped packets in case there are such.

XDP forward packet rate:

Comparing XDP with and w/o xmit more (bulk transmit):

RX Cores    XDP TX       XDP TX (xmit more)
---------------------------------------------------
1           6.5Mpps      12.4Mpps
2          13.2Mpps      24.2Mpps
4          25.2Mpps      36.3Mpps*
8          36.3Mpps*     36.3Mpps*

*My xmitter was limited to 36.3Mpps, so it is the bottleneck.
It seems that receive side can handle more.

Signed-off-by: Saeed Mahameed <[email protected]>
Signed-off-by: Tariq Toukan <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net/mlx5e: XDP TX forwarding support

Adding support for XDP_TX forwarding from xdp program.
Using XDP, now user can loop packets out of the same port.

We create a dedicated TX SQ for each channel that will serve
XDP programs that return XDP_TX action to loop packets back to
the wire directly from the channel RQ RX path.

For that RX pages will now need to be mapped bi-directionally,
and on XDP_TX action we will sync the page back to device then
queue it into SQ for transmission. The XDP xmit frame function will
report back to the RX path if the page was consumed (transmitted), if so,
RX path will forget about that page as if it were released to the stack.
Later on, on XDP TX completion, the page will be released back to the
page cache.

For simplicity this patch will hit a doorbell on every XDP TX packet.

Next patch will introduce a xmit more like mechanism that will
queue up more than one packet into SQ w/o notifying the hardware,
once RX napi loop is done we will hit doorbell once for all XDP TX
packets form the previous loop. This should drastically improve
XDP TX performance.

Signed-off-by: Saeed Mahameed <[email protected]>
Signed-off-by: Tariq Toukan <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net/mlx5e: Have a clear separation between different SQ types

Make a clear separate between Regular SQ (TXQ) and ICO SQ creation,
destruction and union their mutual information structures.

Don't allocate redundant TXQ skb/wqe_info/dma_fifo arrays for ICO SQ.
And have a different SQ edge for ICO SQ than TXQ SQ, to be more
accurate.

In preparation for XDP TX support.

Signed-off-by: Saeed Mahameed <[email protected]>
Signed-off-by: Tariq Toukan <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net/mlx5e: XDP fast RX drop bpf programs support

Add support for the BPF_PROG_TYPE_PHYS_DEV hook in mlx5e driver.

When XDP is on we make sure to change channels RQs type to
MLX5_WQ_TYPE_LINKED_LIST rather than "striding RQ" type to
ensure "page per packet".

On XDP set, we fail if HW LRO is set and request from user to turn it
off.  Since on ConnectX4-LX HW LRO is always on by default, this will be
annoying, but we prefer not to enforce LRO off from XDP set function.

Full channels reset (close/open) is required only when setting XDP
on/off.

When XDP set is called just to exchange programs, we will update
each RQ xdp program on the fly and for synchronization with current
data path RX activity of that RQ, we temporally disable that RQ and
ensure RX path is not running, quickly update and re-enable that RQ,
for that we do:
- rq.state = disabled
- napi_synnchronize
- xchg(rq->xdp_prg)
- rq.state = enabled
- napi_schedule // Just in case we've missed an IRQ

Packet rate performance testing was done with pktgen 64B packets and on
TX side and, TC drop action on RX side compared to XDP fast drop.

CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz

Comparison is done between:
1. Baseline, Before this patch with TC drop action
2. This patch with TC drop action
3. This patch with XDP RX fast drop

RX Cores  Baseline(TC drop)    TC drop    XDP fast Drop
--------------------------------------------------------------
1            5.3Mpps           5.3Mpps     16.5Mpps
2           10.2Mpps          10.2Mpps     31.3Mpps
4           20.5Mpps          19.9Mpps     36.3Mpps*

*My xmitter was limited to 36.3Mpps, so it is the bottleneck.
It seems that receive side can handle more.

Signed-off-by: Rana Shahout <[email protected]>
Signed-off-by: Saeed Mahameed <[email protected]>
Signed-off-by: Tariq Toukan <[email protected]>
Signed-off-by: David S. Miller <[email protected]>