Git Repo - linux.git/log

net: aquantia: Introduce rx refill threshold value

Before that, we've refilled ring even on single descriptor move.
Under high packet load that caused page allocation logic to be triggered
too often. That made overall ring processing slower.

Moreover, with page buffer reuse implemented, we should give a chance
higher networking levels to process received packets faster, release
the pages they consumed and therefore give a higher chance for these
pages to be reused.

RX ring is now refilled only when AQ_CFG_RX_REFILL_THRES or more
descriptors were processed (32 by default). Under regular traffic this
gives quite enough time for packet to be consumed and page to be reused.

Signed-off-by: Igor Russkikh <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: aquantia: optimize rx performance by page reuse strategy

We introduce internal aq_rxpage wrapper over regular page
where extra field is tracked: rxpage offset inside of allocated page.

This offset allows to reuse one page for multiple packets.
When needed (for example with large frames processing), allocated
pageorder could be customized. This gives even larger page reuse
efficiency.

page_ref_count is used to track page users. If during rx refill
underlying page has users, we increase pg_off by rx frame size
thus the top half of the page is reused.

Signed-off-by: Igor Russkikh <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: aquantia: optimize rx path using larger preallocated skb len

Atlantic driver used 14 bytes preallocated skb size. That made L3 protocol
processing inefficient because pskb_pull had to fetch all the L3/L4 headers
from extra fragments.

Specially on UDP flows that caused extra packet drops because CPU was
overloaded with pskb_pull.

This patch uses eth_get_headlen for skb preallocation.

Signed-off-by: Igor Russkikh <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Merge tag 'mlx5-updates-2019-03-20' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux

Saeed Mahameed says:

====================
mlx5-updates-2019-03-20

This series includes updates to mlx5 driver,

1) Compiler warnings cleanup from Saeed Mahameed
2) Parav Pandit simplifies sriov enable/disables
3) Gustavo A. R. Silva, Removes a redundant assignment
4) Moshe Shemesh, Adds Geneve tunnel stateless offload support
5) Eli Britstein, Adds the Support for VLAN modify action and
Replaces TC VLAN pop and push actions with VLAN modify

Note: This series includes two simple non-mlx5 patches,

1) Declare IANA_VXLAN_UDP_PORT definition in include/net/vxlan.h,
and use it in some drivers.
2) Declare GENEVE_UDP_PORT definition in include/net/geneve.h,
and use it in mlx5 and nfp drivers.
====================

Signed-off-by: David S. Miller <[email protected]>

Merge branch '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue

Jeff Kirsher says:

====================
100GbE Intel Wired LAN Driver Updates 2019-03-22

This series contains updates to ice driver only.

Akeem enables MAC anti-spoofing by default when a new VSI is being
created.  Fixes an issue when reclaiming VF resources back to the pool
after reset, by freeing VF resources separately using the first VF
vector index to traverse the list, instead of starting at the last
assigned vectors list.  Added support for VF & PF promiscuous mode in
the ice driver.  Fixed the PF driver from letting the VF know it is "not
trusted" when it attempts to add more than its permitted additional MAC
addresses.  Altered how the driver gets the VF VSIs instances, instead
of using the mailbox messages to retrieve VSIs, get it directly via the
VF object in the PF data structure.

Bruce fixes return values to resolve static analysis warnings.  Made
whitespace changes to increase readability and reduce code wrapping.

Anirudh cleans up code by removing a function prototype that was never
implemented and removed an unused field in the ice_sched_vsi_info
structure.

Kiran fixes a potential divide by zero issue by adding a check.

Victor cleans up the transmit scheduler by adjusting the stack variable
usage and added/modified debug prints to make them more useful.

Yashaswini updates the driver in VEB mode to ensure that the LAN_EN bit
is set if all the right conditions are met.

Christopher ensures the loopback enable bit is not set for prune switch
rules, since all transmit traffic would be looped back to the internal
switch and dropped.
====================

Signed-off-by: David S. Miller <[email protected]>

Merge branch 'tcp-rx-tx-cache'

Eric Dumazet says:

====================
tcp: add rx/tx cache to reduce lock contention

On hosts with many cpus we can observe a very serious contention
on spinlocks used in mm slab layer.

The following can happen quite often :

1) TX path
  sendmsg() allocates one (fclone) skb on CPU A, sends a clone.
  ACK is received on CPU B, and consumes the skb that was in the retransmit
  queue.

2) RX path
  network driver allocates skb on CPU C
  recvmsg() happens on CPU D, freeing the skb after it has been delivered
  to user space.

In both cases, we are hitting the asymetric alloc/free pattern
for which slab has to drain alien caches. At 8 Mpps per second,
this represents 16 Mpps alloc/free per second and has a huge penalty.

In an interesting experiment, I tried to use a single kmem_cache for all the skbs
(in skb_init() : skbuff_fclone_cache = skbuff_head_cache =
                  kmem_cache_create("skbuff_fclone_cache", sizeof(struct sk_buff_fclones),);
qnd most of the contention disappeared, since cpus could better use
their local slab per-cpu cache.

But we can do actually better, in the following patches.

TX : at ACK time, no longer free the skb but put it back in a tcp socket cache,
     so that next sendmsg() can reuse it immediately.

RX : at recvmsg() time, do not free the skb but put it in a tcp socket cache
   so that it can be freed by the cpu feeding the incoming packets in BH.

This increased the performance of small RPC benchmark by about 10 % on a host
with 112 hyperthreads.

v2 : - Solved a race condition : sk_stream_alloc_skb() to make sure the prior
       clone has been freed.
     - Really test rps_needed in sk_eat_skb() as claimed.
     - Fixed rps_needed use in drivers/net/tun.c

v3: Added a #ifdef CONFIG_RPS, to avoid compile error (kbuild robot)
====================

Signed-off-by: David S. Miller <[email protected]>

tcp: add one skb cache for rx

Often times, recvmsg() system calls and BH handling for a particular
TCP socket are done on different cpus.

This means the incoming skb had to be allocated on a cpu,
but freed on another.

This incurs a high spinlock contention in slab layer for small rpc,
but also a high number of cache line ping pongs for larger packets.

A full size GRO packet might use 45 page fragments, meaning
that up to 45 put_page() can be involved.

More over performing the __kfree_skb() in the recvmsg() context
adds a latency for user applications, and increase probability
of trapping them in backlog processing, since the BH handler
might found the socket owned by the user.

This patch, combined with the prior one increases the rpc
performance by about 10 % on servers with large number of cores.

(tcp_rr workload with 10,000 flows and 112 threads reach 9 Mpps
instead of 8 Mpps)

This also increases single bulk flow performance on 40Gbit+ links,
since in this case there are often two cpus working in tandem :

- CPU handling the NIC rx interrupts, feeding the receive queue,
and (after this patch) freeing the skbs that were consumed.

- CPU in recvmsg() system call, essentially 100 % busy copying out
data to user space.

Having at most one skb in a per-socket cache has very little risk
of memory exhaustion, and since it is protected by socket lock,
its management is essentially free.

Note that if rps/rfs is used, we do not enable this feature, because
there is high chance that the same cpu is handling both the recvmsg()
system call and the TCP rx path, but that another cpu did the skb
allocations in the device driver right before the RPS/RFS logic.

To properly handle this case, it seems we would need to record
on which cpu skb was allocated, and use a different channel
to give skbs back to this cpu.

Signed-off-by: Eric Dumazet <[email protected]>
Acked-by: Soheil Hassas Yeganeh <[email protected]>
Acked-by: Willem de Bruijn <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

tcp: add one skb cache for tx

On hosts with a lot of cores, RPC workloads suffer from heavy contention on slab spinlocks.

    20.69%  [kernel]       [k] queued_spin_lock_slowpath
     5.64%  [kernel]       [k] _raw_spin_lock
     3.83%  [kernel]       [k] syscall_return_via_sysret
     3.48%  [kernel]       [k] __entry_text_start
     1.76%  [kernel]       [k] __netif_receive_skb_core
     1.64%  [kernel]       [k] __fget

For each sendmsg(), we allocate one skb, and free it at the time ACK packet comes.

In many cases, ACK packets are handled by another cpus, and this unfortunately
incurs heavy costs for slab layer.

This patch uses an extra pointer in socket structure, so that we try to reuse
the same skb and avoid these expensive costs.

We cache at most one skb per socket so this should be safe as far as
memory pressure is concerned.

Signed-off-by: Eric Dumazet <[email protected]>
Acked-by: Soheil Hassas Yeganeh <[email protected]>
Acked-by: Willem de Bruijn <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: convert rps_needed and rfs_needed to new static branch api

We prefer static_branch_unlikely() over static_key_false() these days.

Signed-off-by: Eric Dumazet <[email protected]>
Acked-by: Soheil Hassas Yeganeh <[email protected]>
Acked-by: Willem de Bruijn <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Merge branch 'net-dev-BYPASS-for-lockless-qdisc'

Paolo Abeni says:

====================
net: dev: BYPASS for lockless qdisc

This patch series is aimed at improving xmit performances of lockless qdisc
in the uncontended scenario.

After the lockless refactor pfifo_fast can't leverage the BYPASS optimization.
Due to retpolines the overhead for the avoidables enqueue and dequeue operations
has increased and we see measurable regressions.

The first patch introduces the BYPASS code path for lockless qdisc, and the
second one optimizes such path further. Overall this avoids up to 3 indirect
calls per xmit packet. Detailed performance figures are reported in the 2nd
patch.

v2 -> v3:
  - qdisc_is_empty() has a const argument (Eric)

v1 -> v2:
  - use really an 'empty' flag instead of 'not_empty', as
    suggested by Eric
====================

Signed-off-by: David S. Miller <[email protected]>

net: dev: introduce support for sch BYPASS for lockless qdisc

With commit c5ad119fb6c0 ("net: sched: pfifo_fast use skb_array")
pfifo_fast no longer benefit from the TCQ_F_CAN_BYPASS optimization.
Due to retpolines the cost of the enqueue()/dequeue() pair has become
relevant and we observe measurable regression for the uncontended
scenario when the packet-rate is below line rate.

After commit 46b1c18f9deb ("net: sched: put back q.qlen into a
single location") we can check for empty qdisc with a reasonably
fast operation even for nolock qdiscs.

This change extends TCQ_F_CAN_BYPASS support to nolock qdisc.
The new chunk of code mirrors closely the existing one for traditional
qdisc, leveraging a newly introduced helper to read atomically the
qdisc length.

Tested with pktgen in queue xmit mode, with pfifo_fast, a MQ
device, and MQ root qdisc:

threads         vanilla         patched
                kpps            kpps
1               2465            2889
2               4304            5188
4               7898            9589

Same as above, but with a single queue device:

threads         vanilla         patched
                kpps            kpps
1               2556            2827
2               2900            2900
4               5000            5000
8               4700            4700

No mesaurable changes in the contended scenarios, and more 10%
improvement in the uncontended ones.

v1 -> v2:
  - rebased after flag name change

Signed-off-by: Paolo Abeni <[email protected]>
Tested-by: Ivan Vecera <[email protected]>
Reviewed-by: Eric Dumazet <[email protected]>
Reviewed-by: Ivan Vecera <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: sched: add empty status flag for NOLOCK qdisc

The queue is marked not empty after acquiring the seqlock,
and it's up to the NOLOCK qdisc clearing such flag on dequeue.
Since the empty status lays on the same cache-line of the
seqlock, it's always hot on cache during the updates.

This makes the empty flag update a little bit loosy. Given
the lack of synchronization between enqueue and dequeue, this
is unavoidable.

v2 -> v3:
- qdisc_is_empty() has a const argument (Eric)

v1 -> v2:
- use really an 'empty' flag instead of 'not_empty', as
suggested by Eric

Signed-off-by: Paolo Abeni <[email protected]>
Reviewed-by: Eric Dumazet <[email protected]>
Reviewed-by: Ivan Vecera <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

tcp: add documentation for tcp_ca_state

Add documentation to the tcp_ca_state enum, since this enum is
exposed in uapi.

Signed-off-by: Neal Cardwell <[email protected]>
Signed-off-by: Yuchung Cheng <[email protected]>
Signed-off-by: Eric Dumazet <[email protected]>
Signed-off-by: Soheil Hassas Yeganeh <[email protected]>
Cc: Sowmini Varadhan <[email protected]>
Acked-by: Sowmini Varadhan <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

tcp: remove conditional branches from tcp_mstamp_refresh()

tcp_clock_ns() (aka ktime_get_ns()) is using monotonic clock,
so the checks we had in tcp_mstamp_refresh() are no longer
relevant.

This patch removes cpu stall (when the cache line is not hot)

Signed-off-by: Eric Dumazet <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: phy: Correct Cygnus/Omega PHY driver prompt

The tristate prompt should have been replaced rather than defined a few
lines below, rebase mistake.

Fixes: 17cc9821766c ("net: phy: Move Omega PHY entry to Cygnus PHY driver")
Reported-by: Stephen Rothwell <[email protected]>
Signed-off-by: Florian Fainelli <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

selftests: bpf: tc-bpf flow shaping with EDT

Add a small test that shows how to shape a TCP flow in tc-bpf
with EDT and ECN.

Signed-off-by: Peter Oskolkov <[email protected]>
Signed-off-by: Alexei Starovoitov <[email protected]>

bpf: make bpf_skb_ecn_set_ce callable from BPF_PROG_TYPE_SCHED_ACT

This helper is useful if a bpf tc filter sets skb->tstamp.

Signed-off-by: Peter Oskolkov <[email protected]>
Signed-off-by: Alexei Starovoitov <[email protected]>

Merge branch 'bpf-tc-tunneling'

Willem de Bruijn says:

====================
BPF allows for dynamic tunneling, choosing the tunnel destination and
features on-demand. Extend bpf_skb_adjust_room to allow for efficient
tunneling at the TC hooks.

Most features are required for large packets with GSO, as these will
be modified after this patch.

Patch 1
  is a performance optimization, avoiding an unnecessary unclone
  for the TCP hot path.

Patches 2..6
  introduce a regression test. These can be squashed, but the code is
  arguably more readable when gradually expanding the feature set.

Patch 7
  is a performance optimization, avoid copying network headers
  that are going to be overwritten. This also simplifies the bpf
  program.

Patch 8
  reenables bpf_skb_adjust_room for UDP packets.

Patch 9
  configures skb tunneling metadata analogous to tunnel devices.

Patches 10..13
  expand the regression test to make use of the new features and
  enable the GSO testcases.

Changes
  v1->v2
  - move BPF_F_ADJ_ROOM_MASK out of uapi as it can be expanded
  - document new flags
  - in tests replace netcat -q flag with coreutils timeout:
      the -q flag is not supported in all netcat versions
  v2->v3
  - move BPF_F_ADJ_ROOM_ENCAP_L3_MASK out of uapi as it has no
    use in userspace
====================

Signed-off-by: Alexei Starovoitov <[email protected]>

selftests/bpf: convert bpf tunnel test to encap modes

Make the tests correctly annotate skbs with tunnel metadata.

This makes the gso tests succeed. Enable them.

Signed-off-by: Willem de Bruijn <[email protected]>
Signed-off-by: Alexei Starovoitov <[email protected]>

selftests/bpf: convert bpf tunnel test to BPF_F_ADJ_ROOM_FIXED_GSO

Lower route MTU to ensure packets fit in device MTU after encap, then
skip the gso_size changes.

Signed-off-by: Willem de Bruijn <[email protected]>
Signed-off-by: Alexei Starovoitov <[email protected]>

selftests/bpf: convert bpf tunnel test to BPF_ADJ_ROOM_MAC

Avoid moving the network layer header when prefixing tunnel headers.

This avoids an explicit call to bpf_skb_store_bytes and an implicit
move of the network header bytes in bpf_skb_adjust_room.

Signed-off-by: Willem de Bruijn <[email protected]>
Signed-off-by: Alexei Starovoitov <[email protected]>

bpf: Sync bpf.h to tools

Sync include/uapi/linux/bpf.h with tools/

Changes
  v1->v2:
  - BPF_F_ADJ_ROOM_MASK moved, no longer in this commit
  v2->v3:
  - BPF_F_ADJ_ROOM_ENCAP_L3_MASK moved, no longer in this commit

Signed-off-by: Willem de Bruijn <[email protected]>
Signed-off-by: Alexei Starovoitov <[email protected]>

bpf: add bpf_skb_adjust_room encap flags

When pushing tunnel headers, annotate skbs in the same way as tunnel
devices.

For GSO packets, the network stack requires certain fields set to
segment packets with tunnel headers. gro_gse_segment depends on
transport and inner mac header, for instance.

Add an option to pass this information.

Remove the restriction on len_diff to network header length, which
is too short, e.g., for GRE protocols.

Changes
  v1->v2:
  - document new flags
  - BPF_F_ADJ_ROOM_MASK moved
  v2->v3:
  - BPF_F_ADJ_ROOM_ENCAP_L3_MASK moved

Signed-off-by: Willem de Bruijn <[email protected]>
Signed-off-by: Alexei Starovoitov <[email protected]>

bpf: add bpf_skb_adjust_room flag BPF_F_ADJ_ROOM_FIXED_GSO

bpf_skb_adjust_room adjusts gso_size of gso packets to account for the
pushed or popped header room.

This is not allowed with UDP, where gso_size delineates datagrams. Add
an option to avoid these updates and allow this call for datagrams.

It can also be used with TCP, when MSS is known to allow headroom,
e.g., through MSS clamping or route MTU.

Changes v1->v2:
- document flag BPF_F_ADJ_ROOM_FIXED_GSO
- do not expose BPF_F_ADJ_ROOM_MASK through uapi, as it may change.

Link: https://patchwork.ozlabs.org/patch/1052497/
Signed-off-by: Willem de Bruijn <[email protected]>
Signed-off-by: Alexei Starovoitov <[email protected]>

bpf: add bpf_skb_adjust_room mode BPF_ADJ_ROOM_MAC

bpf_skb_adjust_room net allows inserting room in an skb.

Existing mode BPF_ADJ_ROOM_NET inserts room after the network header
by pulling the skb, moving the network header forward and zeroing the
new space.

Add new mode BPF_ADJUST_ROOM_MAC that inserts room after the mac
header. This allows inserting tunnel headers in front of the network
header without having to recreate the network header in the original
space, avoiding two copies.

Signed-off-by: Willem de Bruijn <[email protected]>
Signed-off-by: Alexei Starovoitov <[email protected]>

selftests/bpf: extend bpf tunnel test with tso

Segmentation offload takes a longer path. Verify that the feature
works with large packets.

The test succeeds if not setting dodgy in bpf_skb_adjust_room, as veth
TSO is permissive.

If not setting SKB_GSO_DODGY, this enables tunneled TSO offload on
supporting NICs.

The feature sets SKB_GSO_DODGY because the caller is untrusted. As a
result the packets traverse through the gso stack at least up to TCP.
And fail the gso_type validation, such as the skb->encapsulation check
in gre_gso_segment and the gso_type checks introduced in commit
418e897e0716 ("gso: validate gso_type on ipip style tunnel").

This will be addressed in a follow-on feature patch. In the meantime,
disable the new gso tests.

Changes v1->v2:
- not all netcat versions support flag '-q', use timeout instead

Signed-off-by: Willem de Bruijn <[email protected]>
Signed-off-by: Alexei Starovoitov <[email protected]>

selftests/bpf: extend bpf tunnel test with gre

GRE is a commonly used protocol. Add GRE cases for both IPv4 and IPv6.

It also inserts different sized headers, which can expose some
unexpected edge cases.

Signed-off-by: Willem de Bruijn <[email protected]>
Signed-off-by: Alexei Starovoitov <[email protected]>

selftests/bpf: expand bpf tunnel test to ipv6

The test only uses ipv4 so far, expand to ipv6.
This is mostly a boilerplate near copy of the ipv4 path.

Signed-off-by: Willem de Bruijn <[email protected]>
Signed-off-by: Alexei Starovoitov <[email protected]>

selftests/bpf: expand bpf tunnel test with decap

The bpf tunnel test encapsulates using bpf, then decapsulates using
a standard tunnel device to verify correctness.

Once encap is verified, also test decap, by replacing the tunnel
device on decap with another bpf program.

Signed-off-by: Willem de Bruijn <[email protected]>
Signed-off-by: Alexei Starovoitov <[email protected]>

selftests/bpf: bpf tunnel encap test

Validate basic tunnel encapsulation using ipip.

Set up two namespaces connected by veth. Connect a client and server.
Do this with and without bpf encap.

Signed-off-by: Willem de Bruijn <[email protected]>
Signed-off-by: Alexei Starovoitov <[email protected]>

bpf: in bpf_skb_adjust_room avoid copy in tx fast path

bpf_skb_adjust_room calls skb_cow on grow.

This expensive operation can be avoided in the fast path when the only
other clone has released the header. This is the common case for TCP,
where one headerless clone is kept on the retransmit queue.

It is safe to do so even when touching the gso fields in skb_shinfo.
Regular tunnel encap with iptunnel_handle_offloads takes the same
optimization.

The tcp stack unclones in the unlikely case that it accesses these
fields through headerless clones packets on the retransmit queue (see
__tcp_retransmit_skb).

If any other clones are present, e.g., from packet sockets,
skb_cow_head returns the same value as skb_cow().

Signed-off-by: Willem de Bruijn <[email protected]>
Signed-off-by: Alexei Starovoitov <[email protected]>

net/mlx5e: Replace TC VLAN pop and push actions with VLAN modify

Changing the VLAN header may be implemented by pop the existing header
and push a new one. Translate those operations as VLAN modify.
Applicable for use cases such as OVS where the controller translates a
vlan modify meta (OF) rule to DP pop+push actions rule.

Signed-off-by: Eli Britstein <[email protected]>
Reviewed-by: Roi Dayan <[email protected]>
Signed-off-by: Saeed Mahameed <[email protected]>

net/mlx5e: Support VLAN modify action

Support VLAN modify action by emulating a rewrite action for the VLAN
fields. Currently, the only supported field is the vid. The prio in the
action must be set to 0 to indicate no change.

Signed-off-by: Eli Britstein <[email protected]>
Reviewed-by: Roi Dayan <[email protected]>
Signed-off-by: Saeed Mahameed <[email protected]>

net/mlx5e: Add VLAN ID rewrite fields

Add VLAN ID rewrite fields as a pre-step to support this rewrite.

Signed-off-by: Eli Britstein <[email protected]>
Reviewed-by: Roi Dayan <[email protected]>
Signed-off-by: Saeed Mahameed <[email protected]>

net: Add IANA_VXLAN_UDP_PORT definition to vxlan header file

Added IANA_VXLAN_UDP_PORT (4789) definition to vxlan header file so it
can be used by drivers instead of local definition.
Updated drivers which locally defined it as 4789 to use it.

Signed-off-by: Moshe Shemesh <[email protected]>
Reviewed-by: Or Gerlitz <[email protected]>
Cc: John Hurley <[email protected]>
Cc: Jakub Kicinski <[email protected]>
Cc: Yunsheng Lin <[email protected]>
Cc: Peng Li <[email protected]>
Reviewed-by: Tariq Toukan <[email protected]>
Acked-by: Jakub Kicinski <[email protected]>
Signed-off-by: Saeed Mahameed <[email protected]>

net/mlx5e: TX, Add geneve tunnel stateless offload support

Currently support only default geneve udp port (6081).
For the tx side, the HW is assisted by SW parsing, which sets the
headers offset to offload tunneled LSO and csum. Note that for udp
tunnels, we don't use special rx offloads, as rss on the outer headers
is enough, we support checksum complete and GRO takes care of
aggregation.

Geneve TSO BW and CPU load results (tested using iperf single tcp
stream).
In this patch we add TSO support over Geneve, so the "before" result
doesn't actually get to using the TSO HW offload even when turned on.
Tested on ConnectX-5, Intel(R) Xeon(R) CPU E5-2660 v2 @2.20GHz.

__________________________________
| Before         | After           |
|________________|_________________|
| 12.6 Gbits/sec | 21.7 Gbits/sec  |
| 100% CPU load  | 61.5% CPU load  |
|________________|_________________|

Signed-off-by: Moshe Shemesh <[email protected]>
Acked-by: Or Gerlitz <[email protected]>
Signed-off-by: Saeed Mahameed <[email protected]>

net/mlx5e: Take SW parser code to a separate function

Refactor mlx5e_ipsec_set_swp() code, split the part which sets the eseg
software parser (SWP) offsets and flags, so it can be used in a
downstream patch by other mlx5e functionality which needs to set eseg
SWP.
The new function mlx5e_set_eseg_swp() is useful for setting swp for both
outer and inner headers. It also handles the special ipsec case of xfrm
mode transfer.

Signed-off-by: Moshe Shemesh <[email protected]>
Signed-off-by: Saeed Mahameed <[email protected]>

net: Move the definition of the default Geneve udp port to public header file

Move the definition of the default Geneve udp port from the geneve
source to the header file, so we can re-use it from drivers.
Modify existing drivers to use it.

Signed-off-by: Moshe Shemesh <[email protected]>
Reviewed-by: Or Gerlitz <[email protected]>
Cc: John Hurley <[email protected]>
Cc: Jakub Kicinski <[email protected]>
Reviewed-by: Tariq Toukan <[email protected]>
Acked-by: Jakub Kicinski <[email protected]>
Signed-off-by: Saeed Mahameed <[email protected]>

net/mlx5e: Remove redundant assignment

Remove redundant assignment to tun_entropy->enabled.

Addesses-Coverity-ID: 1477328 ("Unused value")
Fixes: 97417f6182f8 ("net/mlx5e: Fix GRE key by controlling port tunnel entropy calculation")
Signed-off-by: Gustavo A. R. Silva <[email protected]>
Reviewed-by: Roi Dayan <[email protected]>
Reviewed-by: Eli Britstein <[email protected]>
Acked-by: Leon Romanovsky <[email protected]>
Acked-by: Saeed Mahameed <[email protected]>
Signed-off-by: Saeed Mahameed <[email protected]>

net/mlx5e: Fix compilation warning in en_tc.c

Amazingly a mlx5e_tc function is being called from the eswitch layer,
which is by itself very terrible! The function was declared locally in
eswitch_offloads.c so it could be used there, which caused the following
compilation warning, fix that.

drivers/.../mlx5/core/en_tc.c:3242:6: [-Werror=missing-prototypes]
error: no previous prototype for ‘mlx5e_tc_clean_fdb_peer_flows’

Fixes: 04de7dda7394 ("net/mlx5e: Infrastructure for duplicated offloading of TC flows")
Reviewed-by: Roi Dayan <[email protected]>
Signed-off-by: Saeed Mahameed <[email protected]>

net/mlx5e: Fix port buffer function documentation format

This patch fixes compiler warnings:
In drivers/.../mlx5/core/en/port_buffer.c:190:
warning: Function parameter or member 'pfc_en' not described...
...
warning: Function parameter or member 'change' not described...

Fixes: 0696d60853d5 ("net/mlx5e: Receive buffer configuration")
Reviewed-by: Eran Ben Elisha <[email protected]>
Signed-off-by: Saeed Mahameed <[email protected]>

net/mlx5: Fix compilation warning in eq.c

mlx5_eq_table_get_rmap is being used only when CONFIG_RFS_ACCEL is
enabled, this patch fixes the below warning when CONFIG_RFS_ACCEL is
disabled.

drivers/.../mlx5/core/eq.c:903:18: [-Werror=missing-prototypes]
error: no previous prototype for ‘mlx5_eq_table_get_rmap’

Fixes: f2f3df550139 ("net/mlx5: EQ, Privatize eq_table and friends")
Signed-off-by: Saeed Mahameed <[email protected]>

net/mlx5: Simplify mlx5_sriov_is_enabled() by using pci core API

It is desired to get rid of num_vfs stored inside mlx5_core_sriov to
safely support vports more than vfs.
To reduce dependency on mlx5_core_sriov num_vfs, start using
pci_num_vf() from pci core.

Signed-off-by: Parav Pandit <[email protected]>
Reviewed-by: Bodong Wang <[email protected]>
Signed-off-by: Saeed Mahameed <[email protected]>

net/mlx5: Rename total_vfs to total_vports

Macro MLX5_TOTAL_VPORTS() returns total number of vports. Therefore,
rename variable total_vfs to total_vports to improve code readability.

Signed-off-by: Parav Pandit <[email protected]>
Reviewed-by: Bodong Wang <[email protected]>
Signed-off-by: Saeed Mahameed <[email protected]>

net/mlx5: Simplify sriov enable/disable flow

Simplify sriov enable/disable flow for below two checks.

1. PCI core driver allows sriov configuration only on a PF.
This is done in drivers/pci/pci-sysfs.c sriov_attrs_are_visible().

2. PCI core driver allow sriov enablement if the sriov is currently
disabled for for a PF. This is done in drivers/pci/pci-sysfs.c
sriov_numvfs_store().

Hence there is no need for mlx5 driver to duplicate such checks.

Signed-off-by: Parav Pandit <[email protected]>
Reviewed-by: Bodong Wang <[email protected]>
Signed-off-by: Saeed Mahameed <[email protected]>

ice: Get VF VSI instances directly via PF

This patch changes how we get VF VSIs instances. Instead of relying on
mailbox virtual channel message to retrieve VSI, it is more reliable
getting it directly via VF object in PF data structure.

Signed-off-by: Akeem G Abodunrin <[email protected]>
Signed-off-by: Anirudh Venkataramanan <[email protected]>
Tested-by: Andrew Bowers <[email protected]>
Signed-off-by: Jeff Kirsher <[email protected]>

ice: Don't let VF know that it is untrusted

Don't let the VF know it's not trusted when it tries to add more than
permitted additional MAC addresses.

Signed-off-by: Akeem G Abodunrin <[email protected]>
Signed-off-by: Anirudh Venkataramanan <[email protected]>
Tested-by: Andrew Bowers <[email protected]>
Signed-off-by: Jeff Kirsher <[email protected]>

ice: Set LAN_EN for all directional rules

The LAN_EN bit for a switch rule determines if the packet can go out
on the wire or not. Set the LAN_EN flag in the switch action for all
directional rules.

Signed-off-by: Yashaswini Raghuram Prathivadi Bhayankaram <[email protected]>
Signed-off-by: Anirudh Venkataramanan <[email protected]>
Tested-by: Andrew Bowers <[email protected]>
Signed-off-by: Jeff Kirsher <[email protected]>

ice: Do not set LB_EN for prune switch rules

LB_EN for prune switch rules was causing all TX traffic
to loopback to the internal switch and dropped. When
running bi-directional stress workloads with RDMA
the RDPU would hang blocking tx and rx traffic.

Signed-off-by: Christopher N Bednarz <[email protected]>
Reviewed-by: Bruce Allan <[email protected]>
Signed-off-by: Anirudh Venkataramanan <[email protected]>
Tested-by: Andrew Bowers <[email protected]>
Signed-off-by: Jeff Kirsher <[email protected]>

ice: Enable LAN_EN for the right recipes

In VEB mode, enable LAN_EN bit in the action fields for filter rules
corresponding to the right recipes.

Signed-off-by: Yashaswini Raghuram Prathivadi Bhayankaram <[email protected]>
Reviewed-by: Bruce Allan <[email protected]>
Signed-off-by: Anirudh Venkataramanan <[email protected]>
Tested-by: Andrew Bowers <[email protected]>
Signed-off-by: Jeff Kirsher <[email protected]>

ice: Add support for PF/VF promiscuous mode

Implement support for VF promiscuous mode, MAC/VLAN/MAC_VLAN and PF
multicast MAC/VLAN/MAC_VLAN promiscuous mode.

Signed-off-by: Akeem G Abodunrin <[email protected]>
Signed-off-by: Anirudh Venkataramanan <[email protected]>
Tested-by: Andrew Bowers <[email protected]>
Signed-off-by: Jeff Kirsher <[email protected]>

ice: code cleanup in ice_sched.c

This patch does some clean up in the Tx scheduler code:

1. Adjust the stack variable usage
2. Modify the debug prints to display the FW error
3. Add additional debug prints while adding/removing VSIs

Signed-off-by: Victor Raj <[email protected]>
Reviewed-by: Bruce Allan <[email protected]>
Signed-off-by: Anirudh Venkataramanan <[email protected]>
Tested-by: Andrew Bowers <[email protected]>
Signed-off-by: Jeff Kirsher <[email protected]>

ice: Remove unused vsi_id field

Remove unused vsi_id field from struct ice_sched_vsi_info.

Signed-off-by: Anirudh Venkataramanan <[email protected]>
Tested-by: Andrew Bowers <[email protected]>
Signed-off-by: Jeff Kirsher <[email protected]>

ice: fix some function prototype and signature style issues

Put the return type on a separate line for function prototypes and
signatures that would exceed the 80-character limit if both were on
the same line.

Signed-off-by: Bruce Allan <[email protected]>
Signed-off-by: Anirudh Venkataramanan <[email protected]>
Tested-by: Andrew Bowers <[email protected]>
Signed-off-by: Jeff Kirsher <[email protected]>

ice: fix the divide by zero issue

Static analysis flagged a potential divide by zero error because
vsi->num_rxq can become zero in certain condition and it is used as
divisor.

Signed-off-by: Kiran Patil <[email protected]>
Signed-off-by: Anirudh Venkataramanan <[email protected]>
Tested-by: Andrew Bowers <[email protected]>
Signed-off-by: Jeff Kirsher <[email protected]>

ice: Fix issue reconfiguring VF queues

When VF requested for queues changes, we need to update LAN Tx queue with
correct number of VF queue pairs and re-allocate VF resources based on
this new requested number of queues, which is constraint within maximum
queue supported per VF.

Signed-off-by: Akeem G Abodunrin <[email protected]>
Signed-off-by: Anirudh Venkataramanan <[email protected]>
Tested-by: Andrew Bowers <[email protected]>
Signed-off-by: Jeff Kirsher <[email protected]>

ice: Remove unused function prototype

Commit 7c710869d64e ("ice: Add handlers for VF netdevice operations")
seems to have inadvertently introduced a function prototype for
ice_set_vf_bw that isn't implemented. Remove it.

Fixes: 7c710869d64e ("ice: Add handlers for VF netdevice operations")
Signed-off-by: Anirudh Venkataramanan <[email protected]>
Tested-by: Andrew Bowers <[email protected]>
Signed-off-by: Jeff Kirsher <[email protected]>

ice: fix static analysis warnings

cppcheck warns "Identical condition '<var>', second condition is always
false". Fix them.

Signed-off-by: Bruce Allan <[email protected]>
Signed-off-by: Anirudh Venkataramanan <[email protected]>
Tested-by: Andrew Bowers <[email protected]>
Signed-off-by: Jeff Kirsher <[email protected]>

ice: Fix issue reclaiming resources back to the pool after reset

This patch fixes issue reclaiming VF resources back to the pool after
reset - Since we only allocate HW vector for all VFs and track together
with resources allocation for PF with ice_search_res, we need to free VFs
resources separately, using first VF vector index to traverse the list.
Otherwise tracker starts from the last assigned vectors list and causes
maximum supported number of HW vectors, 1024 to be exhausted, depending on
the number of VFs enabled, which causes a lot of unwanted issues, and
failed to reassign vectors for VFs.

Signed-off-by: Akeem G Abodunrin <[email protected]>
Signed-off-by: Anirudh Venkataramanan <[email protected]>
Tested-by: Andrew Bowers <[email protected]>
Signed-off-by: Jeff Kirsher <[email protected]>

ice: Enable MAC anti-spoof by default

This patch enables MAC anti-spoof by default, with creation of VF VSIs or
when the VF VSIs are being re-initialized.

Signed-off-by: Akeem G Abodunrin <[email protected]>
Signed-off-by: Anirudh Venkataramanan <[email protected]>
Tested-by: Andrew Bowers <[email protected]>
Signed-off-by: Jeff Kirsher <[email protected]>

genetlink: make policy common to family

Since maxattr is common, the policy can't really differ sanely,
so make it common as well.

The only user that did in fact manage to make a non-common policy
is taskstats, which has to be really careful about it (since it's
still using a common maxattr!). This is no longer supported, but
we can fake it using pre_doit.

This reduces the size of e.g. nl80211.o (which has lots of commands):

   text    data     bss     dec     hex filename
398745   14323    2240 415308   6564c net/wireless/nl80211.o (before)
397913   14331    2240 414484   65314 net/wireless/nl80211.o (after)
--------------------------------
   -832      +8       0    -824

Which is obviously just 8 bytes for each command, and an added 8
bytes for the new policy pointer. I'm not sure why the ops list is
counted as .text though.

Most of the code transformations were done using the following spatch:
    @ops@
    identifier OPS;
    expression POLICY;
    @@
    struct genl_ops OPS[] = {
    ...,
     {
    - .policy = POLICY,
     },
    ...
    };

    @@
    identifier ops.OPS;
    expression ops.POLICY;
    identifier fam;
    expression M;
    @@
    struct genl_family fam = {
            .ops = OPS,
            .maxattr = M,
    +       .policy = POLICY,
            ...
    };

This also gets rid of devlink_nl_cmd_region_read_dumpit() accessing
the cb->data as ops, which we want to change in a later genl patch.

Signed-off-by: Johannes Berg <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

r8169: use netif_start_queue instead of netif_wake_qeueue in rtl8169_start_xmit

Replace the call to netif_wake_queue in rtl8169_start_xmit with
netif_start_queue as we don't need to actually wake up the queue since
we are still in mid transmit so we just need to reset the bit so it
doesn't prevent the next transmit.
(Description shamelessly copied from a mail sent by Alex.)

Suggested-by: Alexander Duyck <[email protected]>
Signed-off-by: Heiner Kallweit <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: phy: aquantia: add downshift support

Aquantia PHY's of the AQR107 family support the downshift feature.
Add support for it as standard PHY tunable so that it can be controlled
via ethtool.
The AQCS109 supports a proprietary 2-pair 1Gbps mode. If two such PHY's
are connected to each other with a 2-pair cable, they may not be able
to establish a link if both advertise modes > 1Gbps.

v2:
- add downshift event detection
- warn if downshift occurred
- read downshifted rate from vendor register
- enable downshift per default on all AQR107 family members

Signed-off-by: Heiner Kallweit <[email protected]>
Reviewed-by: Florian Fainelli <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

selftests: bpf: modify urandom_read and link it non-statically

After some experiences I found that urandom_read does not need to be
linked statically. When the 'read' syscall call is moved to separate
non-inlined function then bpf_get_stackid() is able to find
the executable in stack trace and extract its build_id from it.

Signed-off-by: Ivan Vecera <[email protected]>
Signed-off-by: Alexei Starovoitov <[email protected]>

samples: bpf: add xdp_sample_pkts to .gitignore

This commit adds xdp_sample_pkts to .gitignore which is
currently ommited from the ignore file.

Signed-off-by: Daniel T. Lee <[email protected]>
Signed-off-by: Alexei Starovoitov <[email protected]>

Merge branch 'bpf_tcp_check_syncookie'

Lorenz Bauer says:

====================
This series adds the necessary helpers to determine wheter a given
(encapsulated) TCP packet belongs to a connection known to the network stack.

* bpf_skc_lookup_tcp gives access to request and timewait sockets
* bpf_tcp_check_syncookie identifies the final 3WHS ACK when syncookies
are enabled

The goal is to be able to implement load-balancing approaches like
glb-director [1] or Beamer [2] in pure eBPF. Specifically, we'd like to replace
the functionality of the glb-redirect kernel module [3] by an XDP program or
tc classifier.

Changes in v3:
* Fix missing check for ip4->ihl
* Only cast to unsigned long in BPF_CALLs

Changes in v2:
* Rename bpf_sk_check_syncookie to bpf_tcp_check_syncookie.
* Add bpf_skc_lookup_tcp. Without it bpf_tcp_check_syncookie doesn't make sense.
* Check tcp_synq_no_recent_overflow() in bpf_tcp_check_syncookie.
* Check th->syn in bpf_tcp_check_syncookie.
* Require CONFIG_IPV6 to be a built in.

1: https://github.com/github/glb-director
2: https://www.usenix.org/conference/nsdi18/presentation/olteanu
3: https://github.com/github/glb-director/tree/master/src/glb-redirect
====================

Signed-off-by: Alexei Starovoitov <[email protected]>

selftests/bpf: add tests for bpf_tcp_check_syncookie and bpf_skc_lookup_tcp

Add tests which verify that the new helpers work for both IPv4 and
IPv6, by forcing SYN cookies to always on. Use a new network namespace
to avoid clobbering the global SYN cookie settings.

Signed-off-by: Lorenz Bauer <[email protected]>
Signed-off-by: Alexei Starovoitov <[email protected]>

selftests/bpf: test references to sock_common

Make sure that returning a struct sock_common * reference invokes
the reference tracking machinery in the verifier.

Signed-off-by: Lorenz Bauer <[email protected]>
Signed-off-by: Alexei Starovoitov <[email protected]>

selftests/bpf: allow specifying helper for BPF_SK_LOOKUP

Make the BPF_SK_LOOKUP macro take a helper function, to ease
writing tests for new helpers.

Signed-off-by: Lorenz Bauer <[email protected]>
Signed-off-by: Alexei Starovoitov <[email protected]>

tools: update include/uapi/linux/bpf.h

Pull definitions for bpf_skc_lookup_tcp and bpf_sk_check_syncookie.

Signed-off-by: Lorenz Bauer <[email protected]>
Signed-off-by: Alexei Starovoitov <[email protected]>

bpf: add helper to check for a valid SYN cookie

Using bpf_skc_lookup_tcp it's possible to ascertain whether a packet
belongs to a known connection. However, there is one corner case: no
sockets are created if SYN cookies are active. This means that the final
ACK in the 3WHS is misclassified.

Using the helper, we can look up the listening socket via
bpf_skc_lookup_tcp and then check whether a packet is a valid SYN
cookie ACK.

Signed-off-by: Lorenz Bauer <[email protected]>
Signed-off-by: Alexei Starovoitov <[email protected]>

bpf: add skc_lookup_tcp helper

Allow looking up a sock_common. This gives eBPF programs
access to timewait and request sockets.

Signed-off-by: Lorenz Bauer <[email protected]>
Signed-off-by: Alexei Starovoitov <[email protected]>

bpf: allow helpers to return PTR_TO_SOCK_COMMON

It's currently not possible to access timewait or request sockets
from eBPF, since there is no way to return a PTR_TO_SOCK_COMMON
from a helper. Introduce RET_PTR_TO_SOCK_COMMON to enable this
behaviour.

Signed-off-by: Lorenz Bauer <[email protected]>
Signed-off-by: Alexei Starovoitov <[email protected]>

bpf: track references based on is_acquire_func

So far, the verifier only acquires reference tracking state for
RET_PTR_TO_SOCKET_OR_NULL. Instead of extending this for every
new return type which desires these semantics, acquire reference
tracking state iff the called helper is an acquire function.

Signed-off-by: Lorenz Bauer <[email protected]>
Signed-off-by: Alexei Starovoitov <[email protected]>

Merge branch 'Refactor-flower-classifier-to-remove-dependency-on-rtnl-lock'

Vlad Buslov says:

====================
Refactor flower classifier to remove dependency on rtnl lock

Currently, all netlink protocol handlers for updating rules, actions and
qdiscs are protected with single global rtnl lock which removes any
possibility for parallelism. This patch set is a third step to remove
rtnl lock dependency from TC rules update path.

Recently, new rtnl registration flag RTNL_FLAG_DOIT_UNLOCKED was added.
TC rule update handlers (RTM_NEWTFILTER, RTM_DELTFILTER, etc.) are
already registered with this flag and only take rtnl lock when qdisc or
classifier requires it. Classifiers can indicate that their ops
callbacks don't require caller to hold rtnl lock by setting the
TCF_PROTO_OPS_DOIT_UNLOCKED flag. The goal of this change is to refactor
flower classifier to support unlocked execution and register it with
unlocked flag.

This patch set implements following changes to make flower classifier
concurrency-safe:

- Implement reference counting for individual filters. Change fl_get to
  take reference to filter. Implement tp->ops->put callback that was
  introduced in cls API patch set to release reference to flower filter.

- Use tp->lock spinlock to protect internal classifier data structures
  from concurrent modification.

- Handle concurrent tcf proto deletion by returning EAGAIN, which will
  cause cls API to retry and create new proto instance or return error
  to the user (depending on message type).

- Handle concurrent insertion of filter with same priority and handle by
  returning EAGAIN, which will cause cls API to lookup filter again and
  process it accordingly to netlink message flags.

- Extend flower mask with reference counting and protect masks list with
  masks_lock spinlock.

- Prevent concurrent mask insertion by inserting temporary value to
  masks hash table. This is necessary because mask initialization is a
  sleeping operation and cannot be done while holding tp->lock.

Both chain level and classifier level conflicts are resolved by
returning -EAGAIN to cls API that results restart of whole operation.
This retry mechanism is a result of fine-grained locking approach used
in this and previous changes in series and is necessary to allow
concurrent updates on same chain instance. Alternative approach would be
to lock the whole chain while updating filters on any of child tp's,
adding and removing classifier instances from the chain. However, since
most CPU-intensive parts of filter update code are specifically in
classifier code and its dependencies (extensions and hw offloads), such
approach would negate most of the gains introduced by this change and
previous changes in the series when updating same chain instance.

Tcf hw offloads API is not changed by this patch set and still requires
caller to hold rtnl lock. Refactored flower classifier tracks rtnl lock
state by means of 'rtnl_held' flag provided by cls API and obtains the
lock before calling hw offloads. Following patch set will lift this
restriction and refactor cls hw offloads API to support unlocked
execution.

With these changes flower classifier is safely registered with
TCF_PROTO_OPS_DOIT_UNLOCKED flag in last patch.

Changes from V2 to V3:
- Rebase on latest net-next

Changes from V1 to V2:
- Extend cover letter with explanation about retry mechanism.
- Rebase on current net-next.
- Patch 1:
  - Use rcu_dereference_raw() for tp->root dereference.
  - Update comment in fl_head_dereference().
- Patch 2:
  - Remove redundant check in fl_change error handling code.
  - Add empty line between error check and new handle assignment.
- Patch 3:
  - Refactor loop in fl_get_next_filter() to improve readability.
- Patch 4:
  - Refactor __fl_delete() to improve readability.
- Patch 6:
  - Fix comment in fl_check_assign_mask().
- Patch 9:
  - Extend commit message.
  - Fix error code in comment.
- Patch 11:
  - Fix fl_hw_replace_filter() to always release rtnl lock in error
    handlers.
- Patch 12:
  - Don't take rtnl lock before calling __fl_destroy_filter() in
    workqueue context.
  - Extend commit message with explanation why flower still takes rtnl
    lock before calling hardware offloads API.

Github: <https://github.com/vbuslov/linux/tree/unlocked-flower-cong3>
====================

Signed-off-by: David S. Miller <[email protected]>

net: sched: flower: set unlocked flag for flower proto ops

Set TCF_PROTO_OPS_DOIT_UNLOCKED for flower classifier to indicate that its
ops callbacks don't require caller to hold rtnl lock. Don't take rtnl lock
in fl_destroy_filter_work() that is executed on workqueue instead of being
called by cls API and is not affected by setting
TCF_PROTO_OPS_DOIT_UNLOCKED. Rtnl mutex is still manually taken by flower
classifier before calling hardware offloads API that has not been updated
for unlocked execution.

Signed-off-by: Vlad Buslov <[email protected]>
Reviewed-by: Stefano Brivio <[email protected]>
Acked-by: Jiri Pirko <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: sched: flower: track rtnl lock state

Use 'rtnl_held' flag to track if caller holds rtnl lock. Propagate the flag
to internal functions that need to know rtnl lock state. Take rtnl lock
before calling tcf APIs that require it (hw offload, bind filter, etc.).

Signed-off-by: Vlad Buslov <[email protected]>
Reviewed-by: Stefano Brivio <[email protected]>
Acked-by: Jiri Pirko <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: sched: flower: protect flower classifier state with spinlock

struct tcf_proto was extended with spinlock to be used by classifiers
instead of global rtnl lock. Use it to protect shared flower classifier
data structures (handle_idr, mask hashtable and list) and fields of
individual filters that can be accessed concurrently. This patch set uses
tcf_proto->lock as per instance lock that protects all filters on
tcf_proto.

Signed-off-by: Vlad Buslov <[email protected]>
Reviewed-by: Stefano Brivio <[email protected]>
Acked-by: Jiri Pirko <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: sched: flower: handle concurrent tcf proto deletion

Without rtnl lock protection tcf proto can be deleted concurrently. Check
tcf proto 'deleting' flag after taking tcf spinlock to verify that no
concurrent deletion is in progress. Return EAGAIN error if concurrent
deletion detected, which will cause caller to retry and possibly create new
instance of tcf proto.

Retry mechanism is a result of fine-grained locking approach used in this
and previous changes in series and is necessary to allow concurrent updates
on same chain instance. Alternative approach would be to lock the whole
chain while updating filters on any of child tp's, adding and removing
classifier instances from the chain. However, since most CPU-intensive
parts of filter update code are specifically in classifier code and its
dependencies (extensions and hw offloads), such approach would negate most
of the gains introduced by this change and previous changes in the series
when updating same chain instance.

Signed-off-by: Vlad Buslov <[email protected]>
Reviewed-by: Stefano Brivio <[email protected]>
Acked-by: Jiri Pirko <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: sched: flower: handle concurrent filter insertion in fl_change

Check if user specified a handle and another filter with the same handle
was inserted concurrently. Return EAGAIN to retry filter processing (in
case it is an overwrite request).

Signed-off-by: Vlad Buslov <[email protected]>
Acked-by: Jiri Pirko <[email protected]>
Reviewed-by: Stefano Brivio <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: sched: flower: protect masks list with spinlock

Protect modifications of flower masks list with spinlock to remove
dependency on rtnl lock and allow concurrent access.

Signed-off-by: Vlad Buslov <[email protected]>
Acked-by: Jiri Pirko <[email protected]>
Reviewed-by: Stefano Brivio <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: sched: flower: handle concurrent mask insertion

Without rtnl lock protection masks with same key can be inserted
concurrently. Insert temporary mask with reference count zero to masks
hashtable. This will cause any concurrent modifications to retry.

Wait for rcu grace period to complete after removing temporary mask from
masks hashtable to accommodate concurrent readers.

Signed-off-by: Vlad Buslov <[email protected]>
Acked-by: Jiri Pirko <[email protected]>
Suggested-by: Jiri Pirko <[email protected]>
Reviewed-by: Stefano Brivio <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: sched: flower: add reference counter to flower mask

Extend fl_flow_mask structure with reference counter to allow parallel
modification without relying on rtnl lock. Use rcu read lock to safely
lookup mask and increment reference counter in order to accommodate
concurrent deletes.

Signed-off-by: Vlad Buslov <[email protected]>
Acked-by: Jiri Pirko <[email protected]>
Reviewed-by: Stefano Brivio <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: sched: flower: track filter deletion with flag

In order to prevent double deletion of filter by concurrent tasks when rtnl
lock is not used for synchronization, add 'deleted' filter field. Check
value of this field when modifying filters and return error if concurrent
deletion is detected.

Refactor __fl_delete() to accept pointer to 'last' boolean as argument,
and return error code as function return value instead. This is necessary
to signal concurrent filter delete to caller.

Signed-off-by: Vlad Buslov <[email protected]>
Reviewed-by: Stefano Brivio <[email protected]>
Acked-by: Jiri Pirko <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: sched: flower: introduce reference counting for filters

Extend flower filters with reference counting in order to remove dependency
on rtnl lock in flower ops and allow to modify filters concurrently.
Reference to flower filter can be taken/released concurrently as soon as it
is marked as 'unlocked' by last patch in this series. Use atomic reference
counter type to make concurrent modifications safe.

Always take reference to flower filter while working with it:
- Modify fl_get() to take reference to filter.
- Implement tp->put() callback as fl_put() function to allow cls API to
release reference taken by fl_get().
- Modify fl_change() to assume that caller holds reference to fold and take
reference to fnew.
- Take reference to filter while using it in fl_walk().

Implement helper functions to get/put filter reference counter.

Signed-off-by: Vlad Buslov <[email protected]>
Reviewed-by: Stefano Brivio <[email protected]>
Acked-by: Jiri Pirko <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: sched: flower: refactor fl_change

As a preparation for using classifier spinlock instead of relying on
external rtnl lock, rearrange code in fl_change. The goal is to group the
code which changes classifier state in single block in order to allow
following commits in this set to protect it from parallel modification with
tp->lock. Data structures that require tp->lock protection are mask
hashtable and filters list, and classifier handle_idr.

fl_hw_replace_filter() is a sleeping function and cannot be called while
holding a spinlock. In order to execute all sequence of changes to shared
classifier data structures atomically, call fl_hw_replace_filter() before
modifying them.

Signed-off-by: Vlad Buslov <[email protected]>
Reviewed-by: Stefano Brivio <[email protected]>
Acked-by: Jiri Pirko <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: sched: flower: don't check for rtnl on head dereference

Flower classifier only changes root pointer during init and destroy. Cls
API implements reference counting for tcf_proto, so there is no danger of
concurrent access to tp when it is being destroyed, even without protection
provided by rtnl lock.

Implement new function fl_head_dereference() to dereference tp->root
without checking for rtnl lock. Use it in all flower function that obtain
head pointer instead of rtnl_dereference().

Signed-off-by: Vlad Buslov <[email protected]>
Reviewed-by: Stefano Brivio <[email protected]>
Acked-by: Jiri Pirko <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

nfp: remove defines for unused control bits

NFP driver ABI contains bits for L2 switching which were never
implemented in initially envisioned form.

Remove the defines, and open up the possibility of
reclaiming the bits for other uses.

Signed-off-by: Jakub Kicinski <[email protected]>
Reviewed-by: Dirk van der Merwe <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Merge branch 'rhashtable-cleanups'

NeilBrown says:

====================
Two clean-ups for rhashtable.

These two patches make small improvements to
rhashtable, but are otherwise unrelated.

Thanks to Herbert, Miguel, and Paul for the review.
====================

Signed-off-by: David S. Miller <[email protected]>

rhashtable: rename rht_for_each*continue as *from.

The pattern set by list.h is that for_each..continue()
iterators start at the next entry after the given one,
while for_each..from() iterators start at the given
entry.

The rht_for_each*continue() iterators are documented as though the
start at the 'next' entry, but actually start at the given entry,
and they are used expecting that behaviour.
So fix the documentation and change the names to *from for consistency
with list.h

Acked-by: Herbert Xu <[email protected]>
Acked-by: Miguel Ojeda <[email protected]>
Signed-off-by: NeilBrown <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

rhashtable: don't hold lock on first table throughout insertion.

rhashtable_try_insert() currently holds a lock on the bucket in
the first table, while also locking buckets in subsequent tables.
This is unnecessary and looks like a hold-over from some earlier
version of the implementation.

As insert and remove always lock a bucket in each table in turn, and
as insert only inserts in the final table, there cannot be any races
that are not covered by simply locking a bucket in each table in turn.

When an insert call reaches that last table it can be sure that there
is no matchinf entry in any other table as it has searched them all, and
insertion never happens anywhere but in the last table. The fact that
code tests for the existence of future_tbl while holding a lock on
the relevant bucket ensures that two threads inserting the same key
will make compatible decisions about which is the "last" table.

This simplifies the code and allows the ->rehash field to be
discarded.

We still need a way to ensure that a dead bucket_table is never
re-linked by rhashtable_walk_stop(). This can be achieved by calling
call_rcu() inside the locked region, and checking with
rcu_head_after_call_rcu() in rhashtable_walk_stop() to see if the
bucket table is empty and dead.

Acked-by: Herbert Xu <[email protected]>
Reviewed-by: Paul E. McKenney <[email protected]>
Signed-off-by: NeilBrown <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Merge branch 'net-phy-Move-Omega-PHY-entry-to-Cygnus-PHY-driver'

Florian Fainelli says:

====================
net: phy: Move Omega PHY entry to Cygnus PHY driver

In order to pave the way for adding some specific Omega PHY features
that may not be desirable on other products covered by the bcm7xxx PHY
driver, split the Omega PHY entry into the Cygnus PHY driver such that
the PHY drivers are reflective of product lines/business units
maintaining them within Broadcom.

No functional changes intended.
====================

Acked-by: Arun Parameswaran <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: phy: Move Omega PHY entry to Cygnus PHY driver

Cygnus and Omega are part of the same business unit and product line, it
makes sense to group PHY entries by products such that a platform can
select only the drivers that it needs. Bring all the functionality that
the BCM7XXX_28NM_GPHY() macro hides for us and remove the Omega PHY
entry from bcm7xxx.c.

As an added bonus, we now have a proper mdio_device_id entry to permit
auto-loading.

Signed-off-by: Florian Fainelli <[email protected]>
Reviewed-by: Scott Branden <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: phy: Prepare for moving Omega out of bcm7xxx

The Omega PHY entry was added to bcm7xxx.c out of convenience and this
breaks the one driver per product line paradigm that was applied up
until now. Since the AFE initialization is shared between Omega and
BCM7xxx move the relevant functions to bcm-phy-lib.[ch]. No functional
changes introduced.

Signed-off-by: Florian Fainelli <[email protected]>
Reviewed-by: Scott Branden <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: dst: remove gc leftovers

Get rid of some obsolete gc-related documentation and macros that were
missed in commit 5b7c9a8ff828 ("net: remove dst gc related code").

CC: Wei Wang <[email protected]>
Signed-off-by: Julian Wiedmann <[email protected]>
Acked-by: Wei Wang <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Merge branch 'net-broadcom-Remove-print-of-base-address'

Florian Fainelli says:

====================
net: broadcom: Remove print of base address

Some broadcom MDIO/switch/Ethernet MAC drivers insist on printing the
base register virtual address which has little value.
====================

Signed-off-by: David S. Miller <[email protected]>

net: systemport: Remove print of base address

Since commit ad67b74d2469 ("printk: hash addresses printed with %p")
pointers are being hashed when printed. Displaying the virtual memory at
bootup time is not helpful, especially given we use a dev_info() which
already displays the platform device's address.

Signed-off-by: Florian Fainelli <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: dsa: bcm_sf2: Remove print of base address

Since commit ad67b74d2469 ("printk: hash addresses printed with %p")
pointers are being hashed when printed. Displaying the virtual memory at
bootup time is not helpful, we use a dev_info() print which already
displays the platform device's address.

Signed-off-by: Florian Fainelli <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: phy: mdio-bcm-unimac: Remove print of base address

Since commit ad67b74d2469 ("printk: hash addresses printed with %p")
pointers are being hashed when printed. Displaying the virtual memory at
bootup time is not helpful, especially given we use a dev_info() which
already displays the platform device's address.

Signed-off-by: Florian Fainelli <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

ipv6: Remove fallback argument from ip6_hold_safe

net and null_fallback are redundant. Remove null_fallback in favor of
!net check.

Signed-off-by: David Ahern <[email protected]>
Acked-by: Wei Wang <[email protected]>
Signed-off-by: David S. Miller <[email protected]>