Igor Russkikh [Sat, 23 Mar 2019 15:23:34 +0000 (15:23 +0000)]
net: aquantia: Introduce rx refill threshold value
Before that, we've refilled ring even on single descriptor move.
Under high packet load that caused page allocation logic to be triggered
too often. That made overall ring processing slower.
Moreover, with page buffer reuse implemented, we should give a chance
higher networking levels to process received packets faster, release
the pages they consumed and therefore give a higher chance for these
pages to be reused.
RX ring is now refilled only when AQ_CFG_RX_REFILL_THRES or more
descriptors were processed (32 by default). Under regular traffic this
gives quite enough time for packet to be consumed and page to be reused.
Igor Russkikh [Sat, 23 Mar 2019 15:23:32 +0000 (15:23 +0000)]
net: aquantia: optimize rx performance by page reuse strategy
We introduce internal aq_rxpage wrapper over regular page
where extra field is tracked: rxpage offset inside of allocated page.
This offset allows to reuse one page for multiple packets.
When needed (for example with large frames processing), allocated
pageorder could be customized. This gives even larger page reuse
efficiency.
page_ref_count is used to track page users. If during rx refill
underlying page has users, we increase pg_off by rx frame size
thus the top half of the page is reused.
Igor Russkikh [Sat, 23 Mar 2019 15:23:31 +0000 (15:23 +0000)]
net: aquantia: optimize rx path using larger preallocated skb len
Atlantic driver used 14 bytes preallocated skb size. That made L3 protocol
processing inefficient because pskb_pull had to fetch all the L3/L4 headers
from extra fragments.
Specially on UDP flows that caused extra packet drops because CPU was
overloaded with pskb_pull.
This patch uses eth_get_headlen for skb preallocation.
David S. Miller [Sun, 24 Mar 2019 02:03:44 +0000 (22:03 -0400)]
Merge tag 'mlx5-updates-2019-03-20' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux
Saeed Mahameed says:
====================
mlx5-updates-2019-03-20
This series includes updates to mlx5 driver,
1) Compiler warnings cleanup from Saeed Mahameed
2) Parav Pandit simplifies sriov enable/disables
3) Gustavo A. R. Silva, Removes a redundant assignment
4) Moshe Shemesh, Adds Geneve tunnel stateless offload support
5) Eli Britstein, Adds the Support for VLAN modify action and
Replaces TC VLAN pop and push actions with VLAN modify
Note: This series includes two simple non-mlx5 patches,
1) Declare IANA_VXLAN_UDP_PORT definition in include/net/vxlan.h,
and use it in some drivers.
2) Declare GENEVE_UDP_PORT definition in include/net/geneve.h,
and use it in mlx5 and nfp drivers.
====================
David S. Miller [Sun, 24 Mar 2019 02:02:54 +0000 (22:02 -0400)]
Merge branch '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue
Jeff Kirsher says:
====================
100GbE Intel Wired LAN Driver Updates 2019-03-22
This series contains updates to ice driver only.
Akeem enables MAC anti-spoofing by default when a new VSI is being
created. Fixes an issue when reclaiming VF resources back to the pool
after reset, by freeing VF resources separately using the first VF
vector index to traverse the list, instead of starting at the last
assigned vectors list. Added support for VF & PF promiscuous mode in
the ice driver. Fixed the PF driver from letting the VF know it is "not
trusted" when it attempts to add more than its permitted additional MAC
addresses. Altered how the driver gets the VF VSIs instances, instead
of using the mailbox messages to retrieve VSIs, get it directly via the
VF object in the PF data structure.
Bruce fixes return values to resolve static analysis warnings. Made
whitespace changes to increase readability and reduce code wrapping.
Anirudh cleans up code by removing a function prototype that was never
implemented and removed an unused field in the ice_sched_vsi_info
structure.
Kiran fixes a potential divide by zero issue by adding a check.
Victor cleans up the transmit scheduler by adjusting the stack variable
usage and added/modified debug prints to make them more useful.
Yashaswini updates the driver in VEB mode to ensure that the LAN_EN bit
is set if all the right conditions are met.
Christopher ensures the loopback enable bit is not set for prune switch
rules, since all transmit traffic would be looped back to the internal
switch and dropped.
====================
David S. Miller [Sun, 24 Mar 2019 01:57:38 +0000 (21:57 -0400)]
Merge branch 'tcp-rx-tx-cache'
Eric Dumazet says:
====================
tcp: add rx/tx cache to reduce lock contention
On hosts with many cpus we can observe a very serious contention
on spinlocks used in mm slab layer.
The following can happen quite often :
1) TX path
sendmsg() allocates one (fclone) skb on CPU A, sends a clone.
ACK is received on CPU B, and consumes the skb that was in the retransmit
queue.
2) RX path
network driver allocates skb on CPU C
recvmsg() happens on CPU D, freeing the skb after it has been delivered
to user space.
In both cases, we are hitting the asymetric alloc/free pattern
for which slab has to drain alien caches. At 8 Mpps per second,
this represents 16 Mpps alloc/free per second and has a huge penalty.
In an interesting experiment, I tried to use a single kmem_cache for all the skbs
(in skb_init() : skbuff_fclone_cache = skbuff_head_cache =
kmem_cache_create("skbuff_fclone_cache", sizeof(struct sk_buff_fclones),);
qnd most of the contention disappeared, since cpus could better use
their local slab per-cpu cache.
But we can do actually better, in the following patches.
TX : at ACK time, no longer free the skb but put it back in a tcp socket cache,
so that next sendmsg() can reuse it immediately.
RX : at recvmsg() time, do not free the skb but put it in a tcp socket cache
so that it can be freed by the cpu feeding the incoming packets in BH.
This increased the performance of small RPC benchmark by about 10 % on a host
with 112 hyperthreads.
v2 : - Solved a race condition : sk_stream_alloc_skb() to make sure the prior
clone has been freed.
- Really test rps_needed in sk_eat_skb() as claimed.
- Fixed rps_needed use in drivers/net/tun.c
v3: Added a #ifdef CONFIG_RPS, to avoid compile error (kbuild robot)
====================
Eric Dumazet [Fri, 22 Mar 2019 15:56:40 +0000 (08:56 -0700)]
tcp: add one skb cache for rx
Often times, recvmsg() system calls and BH handling for a particular
TCP socket are done on different cpus.
This means the incoming skb had to be allocated on a cpu,
but freed on another.
This incurs a high spinlock contention in slab layer for small rpc,
but also a high number of cache line ping pongs for larger packets.
A full size GRO packet might use 45 page fragments, meaning
that up to 45 put_page() can be involved.
More over performing the __kfree_skb() in the recvmsg() context
adds a latency for user applications, and increase probability
of trapping them in backlog processing, since the BH handler
might found the socket owned by the user.
This patch, combined with the prior one increases the rpc
performance by about 10 % on servers with large number of cores.
(tcp_rr workload with 10,000 flows and 112 threads reach 9 Mpps
instead of 8 Mpps)
This also increases single bulk flow performance on 40Gbit+ links,
since in this case there are often two cpus working in tandem :
- CPU handling the NIC rx interrupts, feeding the receive queue,
and (after this patch) freeing the skbs that were consumed.
- CPU in recvmsg() system call, essentially 100 % busy copying out
data to user space.
Having at most one skb in a per-socket cache has very little risk
of memory exhaustion, and since it is protected by socket lock,
its management is essentially free.
Note that if rps/rfs is used, we do not enable this feature, because
there is high chance that the same cpu is handling both the recvmsg()
system call and the TCP rx path, but that another cpu did the skb
allocations in the device driver right before the RPS/RFS logic.
To properly handle this case, it seems we would need to record
on which cpu skb was allocated, and use a different channel
to give skbs back to this cpu.
David S. Miller [Sun, 24 Mar 2019 01:52:37 +0000 (21:52 -0400)]
Merge branch 'net-dev-BYPASS-for-lockless-qdisc'
Paolo Abeni says:
====================
net: dev: BYPASS for lockless qdisc
This patch series is aimed at improving xmit performances of lockless qdisc
in the uncontended scenario.
After the lockless refactor pfifo_fast can't leverage the BYPASS optimization.
Due to retpolines the overhead for the avoidables enqueue and dequeue operations
has increased and we see measurable regressions.
The first patch introduces the BYPASS code path for lockless qdisc, and the
second one optimizes such path further. Overall this avoids up to 3 indirect
calls per xmit packet. Detailed performance figures are reported in the 2nd
patch.
v2 -> v3:
- qdisc_is_empty() has a const argument (Eric)
v1 -> v2:
- use really an 'empty' flag instead of 'not_empty', as
suggested by Eric
====================
Paolo Abeni [Fri, 22 Mar 2019 15:01:56 +0000 (16:01 +0100)]
net: dev: introduce support for sch BYPASS for lockless qdisc
With commit c5ad119fb6c0 ("net: sched: pfifo_fast use skb_array")
pfifo_fast no longer benefit from the TCQ_F_CAN_BYPASS optimization.
Due to retpolines the cost of the enqueue()/dequeue() pair has become
relevant and we observe measurable regression for the uncontended
scenario when the packet-rate is below line rate.
After commit 46b1c18f9deb ("net: sched: put back q.qlen into a
single location") we can check for empty qdisc with a reasonably
fast operation even for nolock qdiscs.
This change extends TCQ_F_CAN_BYPASS support to nolock qdisc.
The new chunk of code mirrors closely the existing one for traditional
qdisc, leveraging a newly introduced helper to read atomically the
qdisc length.
Tested with pktgen in queue xmit mode, with pfifo_fast, a MQ
device, and MQ root qdisc:
Paolo Abeni [Fri, 22 Mar 2019 15:01:55 +0000 (16:01 +0100)]
net: sched: add empty status flag for NOLOCK qdisc
The queue is marked not empty after acquiring the seqlock,
and it's up to the NOLOCK qdisc clearing such flag on dequeue.
Since the empty status lays on the same cache-line of the
seqlock, it's always hot on cache during the updates.
This makes the empty flag update a little bit loosy. Given
the lack of synchronization between enqueue and dequeue, this
is unavoidable.
v2 -> v3:
- qdisc_is_empty() has a const argument (Eric)
v1 -> v2:
- use really an 'empty' flag instead of 'not_empty', as
suggested by Eric
====================
BPF allows for dynamic tunneling, choosing the tunnel destination and
features on-demand. Extend bpf_skb_adjust_room to allow for efficient
tunneling at the TC hooks.
Most features are required for large packets with GSO, as these will
be modified after this patch.
Patch 1
is a performance optimization, avoiding an unnecessary unclone
for the TCP hot path.
Patches 2..6
introduce a regression test. These can be squashed, but the code is
arguably more readable when gradually expanding the feature set.
Patch 7
is a performance optimization, avoid copying network headers
that are going to be overwritten. This also simplifies the bpf
program.
Patch 8
reenables bpf_skb_adjust_room for UDP packets.
Patch 9
configures skb tunneling metadata analogous to tunnel devices.
Patches 10..13
expand the regression test to make use of the new features and
enable the GSO testcases.
Changes
v1->v2
- move BPF_F_ADJ_ROOM_MASK out of uapi as it can be expanded
- document new flags
- in tests replace netcat -q flag with coreutils timeout:
the -q flag is not supported in all netcat versions
v2->v3
- move BPF_F_ADJ_ROOM_ENCAP_L3_MASK out of uapi as it has no
use in userspace
====================
Willem de Bruijn [Fri, 22 Mar 2019 18:32:56 +0000 (14:32 -0400)]
bpf: add bpf_skb_adjust_room encap flags
When pushing tunnel headers, annotate skbs in the same way as tunnel
devices.
For GSO packets, the network stack requires certain fields set to
segment packets with tunnel headers. gro_gse_segment depends on
transport and inner mac header, for instance.
Add an option to pass this information.
Remove the restriction on len_diff to network header length, which
is too short, e.g., for GRE protocols.
Changes
v1->v2:
- document new flags
- BPF_F_ADJ_ROOM_MASK moved
v2->v3:
- BPF_F_ADJ_ROOM_ENCAP_L3_MASK moved
bpf_skb_adjust_room net allows inserting room in an skb.
Existing mode BPF_ADJ_ROOM_NET inserts room after the network header
by pulling the skb, moving the network header forward and zeroing the
new space.
Add new mode BPF_ADJUST_ROOM_MAC that inserts room after the mac
header. This allows inserting tunnel headers in front of the network
header without having to recreate the network header in the original
space, avoiding two copies.
Willem de Bruijn [Fri, 22 Mar 2019 18:32:53 +0000 (14:32 -0400)]
selftests/bpf: extend bpf tunnel test with tso
Segmentation offload takes a longer path. Verify that the feature
works with large packets.
The test succeeds if not setting dodgy in bpf_skb_adjust_room, as veth
TSO is permissive.
If not setting SKB_GSO_DODGY, this enables tunneled TSO offload on
supporting NICs.
The feature sets SKB_GSO_DODGY because the caller is untrusted. As a
result the packets traverse through the gso stack at least up to TCP.
And fail the gso_type validation, such as the skb->encapsulation check
in gre_gso_segment and the gso_type checks introduced in commit 418e897e0716 ("gso: validate gso_type on ipip style tunnel").
This will be addressed in a follow-on feature patch. In the meantime,
disable the new gso tests.
Changes v1->v2:
- not all netcat versions support flag '-q', use timeout instead
Willem de Bruijn [Fri, 22 Mar 2019 18:32:48 +0000 (14:32 -0400)]
bpf: in bpf_skb_adjust_room avoid copy in tx fast path
bpf_skb_adjust_room calls skb_cow on grow.
This expensive operation can be avoided in the fast path when the only
other clone has released the header. This is the common case for TCP,
where one headerless clone is kept on the retransmit queue.
It is safe to do so even when touching the gso fields in skb_shinfo.
Regular tunnel encap with iptunnel_handle_offloads takes the same
optimization.
The tcp stack unclones in the unlikely case that it accesses these
fields through headerless clones packets on the retransmit queue (see
__tcp_retransmit_skb).
If any other clones are present, e.g., from packet sockets,
skb_cow_head returns the same value as skb_cow().
Eli Britstein [Thu, 21 Mar 2019 22:51:42 +0000 (15:51 -0700)]
net/mlx5e: Replace TC VLAN pop and push actions with VLAN modify
Changing the VLAN header may be implemented by pop the existing header
and push a new one. Translate those operations as VLAN modify.
Applicable for use cases such as OVS where the controller translates a
vlan modify meta (OF) rule to DP pop+push actions rule.
Eli Britstein [Thu, 21 Mar 2019 22:51:41 +0000 (15:51 -0700)]
net/mlx5e: Support VLAN modify action
Support VLAN modify action by emulating a rewrite action for the VLAN
fields. Currently, the only supported field is the vid. The prio in the
action must be set to 0 to indicate no change.
Moshe Shemesh [Thu, 21 Mar 2019 22:51:39 +0000 (15:51 -0700)]
net: Add IANA_VXLAN_UDP_PORT definition to vxlan header file
Added IANA_VXLAN_UDP_PORT (4789) definition to vxlan header file so it
can be used by drivers instead of local definition.
Updated drivers which locally defined it as 4789 to use it.
Moshe Shemesh [Thu, 21 Mar 2019 22:51:38 +0000 (15:51 -0700)]
net/mlx5e: TX, Add geneve tunnel stateless offload support
Currently support only default geneve udp port (6081).
For the tx side, the HW is assisted by SW parsing, which sets the
headers offset to offload tunneled LSO and csum. Note that for udp
tunnels, we don't use special rx offloads, as rss on the outer headers
is enough, we support checksum complete and GRO takes care of
aggregation.
Geneve TSO BW and CPU load results (tested using iperf single tcp
stream).
In this patch we add TSO support over Geneve, so the "before" result
doesn't actually get to using the TSO HW offload even when turned on.
Tested on ConnectX-5, Intel(R) Xeon(R) CPU E5-2660 v2 @2.20GHz.
__________________________________
| Before | After |
|________________|_________________|
| 12.6 Gbits/sec | 21.7 Gbits/sec |
| 100% CPU load | 61.5% CPU load |
|________________|_________________|
Moshe Shemesh [Thu, 21 Mar 2019 22:51:37 +0000 (15:51 -0700)]
net/mlx5e: Take SW parser code to a separate function
Refactor mlx5e_ipsec_set_swp() code, split the part which sets the eseg
software parser (SWP) offsets and flags, so it can be used in a
downstream patch by other mlx5e functionality which needs to set eseg
SWP.
The new function mlx5e_set_eseg_swp() is useful for setting swp for both
outer and inner headers. It also handles the special ipsec case of xfrm
mode transfer.
Moshe Shemesh [Thu, 21 Mar 2019 22:51:36 +0000 (15:51 -0700)]
net: Move the definition of the default Geneve udp port to public header file
Move the definition of the default Geneve udp port from the geneve
source to the header file, so we can re-use it from drivers.
Modify existing drivers to use it.
Saeed Mahameed [Thu, 21 Mar 2019 22:51:33 +0000 (15:51 -0700)]
net/mlx5e: Fix compilation warning in en_tc.c
Amazingly a mlx5e_tc function is being called from the eswitch layer,
which is by itself very terrible! The function was declared locally in
eswitch_offloads.c so it could be used there, which caused the following
compilation warning, fix that.
drivers/.../mlx5/core/en_tc.c:3242:6: [-Werror=missing-prototypes]
error: no previous prototype for ‘mlx5e_tc_clean_fdb_peer_flows’
Fixes: 04de7dda7394 ("net/mlx5e: Infrastructure for duplicated offloading of TC flows") Reviewed-by: Roi Dayan <[email protected]> Signed-off-by: Saeed Mahameed <[email protected]>
Saeed Mahameed [Thu, 21 Mar 2019 22:51:32 +0000 (15:51 -0700)]
net/mlx5e: Fix port buffer function documentation format
This patch fixes compiler warnings:
In drivers/.../mlx5/core/en/port_buffer.c:190:
warning: Function parameter or member 'pfc_en' not described...
...
warning: Function parameter or member 'change' not described...
Fixes: 0696d60853d5 ("net/mlx5e: Receive buffer configuration") Reviewed-by: Eran Ben Elisha <[email protected]> Signed-off-by: Saeed Mahameed <[email protected]>
Parav Pandit [Thu, 21 Mar 2019 22:51:30 +0000 (15:51 -0700)]
net/mlx5: Simplify mlx5_sriov_is_enabled() by using pci core API
It is desired to get rid of num_vfs stored inside mlx5_core_sriov to
safely support vports more than vfs.
To reduce dependency on mlx5_core_sriov num_vfs, start using
pci_num_vf() from pci core.
Parav Pandit [Thu, 21 Mar 2019 22:51:28 +0000 (15:51 -0700)]
net/mlx5: Simplify sriov enable/disable flow
Simplify sriov enable/disable flow for below two checks.
1. PCI core driver allows sriov configuration only on a PF.
This is done in drivers/pci/pci-sysfs.c sriov_attrs_are_visible().
2. PCI core driver allow sriov enablement if the sriov is currently
disabled for for a PF. This is done in drivers/pci/pci-sysfs.c
sriov_numvfs_store().
Hence there is no need for mlx5 driver to duplicate such checks.
This patch changes how we get VF VSIs instances. Instead of relying on
mailbox virtual channel message to retrieve VSI, it is more reliable
getting it directly via VF object in PF data structure.
The LAN_EN bit for a switch rule determines if the packet can go out
on the wire or not. Set the LAN_EN flag in the switch action for all
directional rules.
LB_EN for prune switch rules was causing all TX traffic
to loopback to the internal switch and dropped. When
running bi-directional stress workloads with RDMA
the RDPU would hang blocking tx and rx traffic.
When VF requested for queues changes, we need to update LAN Tx queue with
correct number of VF queue pairs and re-allocate VF resources based on
this new requested number of queues, which is constraint within maximum
queue supported per VF.
Commit 7c710869d64e ("ice: Add handlers for VF netdevice operations")
seems to have inadvertently introduced a function prototype for
ice_set_vf_bw that isn't implemented. Remove it.
Fixes: 7c710869d64e ("ice: Add handlers for VF netdevice operations") Signed-off-by: Anirudh Venkataramanan <[email protected]> Tested-by: Andrew Bowers <[email protected]> Signed-off-by: Jeff Kirsher <[email protected]>
ice: Fix issue reclaiming resources back to the pool after reset
This patch fixes issue reclaiming VF resources back to the pool after
reset - Since we only allocate HW vector for all VFs and track together
with resources allocation for PF with ice_search_res, we need to free VFs
resources separately, using first VF vector index to traverse the list.
Otherwise tracker starts from the last assigned vectors list and causes
maximum supported number of HW vectors, 1024 to be exhausted, depending on
the number of VFs enabled, which causes a lot of unwanted issues, and
failed to reassign vectors for VFs.
Johannes Berg [Thu, 21 Mar 2019 21:51:02 +0000 (22:51 +0100)]
genetlink: make policy common to family
Since maxattr is common, the policy can't really differ sanely,
so make it common as well.
The only user that did in fact manage to make a non-common policy
is taskstats, which has to be really careful about it (since it's
still using a common maxattr!). This is no longer supported, but
we can fake it using pre_doit.
This reduces the size of e.g. nl80211.o (which has lots of commands):
text data bss dec hex filename
398745 14323 2240 415308 6564c net/wireless/nl80211.o (before)
397913 14331 2240 414484 65314 net/wireless/nl80211.o (after)
--------------------------------
-832 +8 0 -824
Which is obviously just 8 bytes for each command, and an added 8
bytes for the new policy pointer. I'm not sure why the ops list is
counted as .text though.
Most of the code transformations were done using the following spatch:
@ops@
identifier OPS;
expression POLICY;
@@
struct genl_ops OPS[] = {
...,
{
- .policy = POLICY,
},
...
};
Heiner Kallweit [Thu, 21 Mar 2019 20:41:48 +0000 (21:41 +0100)]
r8169: use netif_start_queue instead of netif_wake_qeueue in rtl8169_start_xmit
Replace the call to netif_wake_queue in rtl8169_start_xmit with
netif_start_queue as we don't need to actually wake up the queue since
we are still in mid transmit so we just need to reset the bit so it
doesn't prevent the next transmit.
(Description shamelessly copied from a mail sent by Alex.)
Heiner Kallweit [Thu, 21 Mar 2019 20:08:35 +0000 (21:08 +0100)]
net: phy: aquantia: add downshift support
Aquantia PHY's of the AQR107 family support the downshift feature.
Add support for it as standard PHY tunable so that it can be controlled
via ethtool.
The AQCS109 supports a proprietary 2-pair 1Gbps mode. If two such PHY's
are connected to each other with a 2-pair cable, they may not be able
to establish a link if both advertise modes > 1Gbps.
v2:
- add downshift event detection
- warn if downshift occurred
- read downshifted rate from vendor register
- enable downshift per default on all AQR107 family members
Ivan Vecera [Fri, 15 Mar 2019 20:04:14 +0000 (21:04 +0100)]
selftests: bpf: modify urandom_read and link it non-statically
After some experiences I found that urandom_read does not need to be
linked statically. When the 'read' syscall call is moved to separate
non-inlined function then bpf_get_stackid() is able to find
the executable in stack trace and extract its build_id from it.
====================
This series adds the necessary helpers to determine wheter a given
(encapsulated) TCP packet belongs to a connection known to the network stack.
* bpf_skc_lookup_tcp gives access to request and timewait sockets
* bpf_tcp_check_syncookie identifies the final 3WHS ACK when syncookies
are enabled
The goal is to be able to implement load-balancing approaches like
glb-director [1] or Beamer [2] in pure eBPF. Specifically, we'd like to replace
the functionality of the glb-redirect kernel module [3] by an XDP program or
tc classifier.
Changes in v3:
* Fix missing check for ip4->ihl
* Only cast to unsigned long in BPF_CALLs
Changes in v2:
* Rename bpf_sk_check_syncookie to bpf_tcp_check_syncookie.
* Add bpf_skc_lookup_tcp. Without it bpf_tcp_check_syncookie doesn't make sense.
* Check tcp_synq_no_recent_overflow() in bpf_tcp_check_syncookie.
* Check th->syn in bpf_tcp_check_syncookie.
* Require CONFIG_IPV6 to be a built in.
Lorenz Bauer [Fri, 22 Mar 2019 01:54:06 +0000 (09:54 +0800)]
selftests/bpf: add tests for bpf_tcp_check_syncookie and bpf_skc_lookup_tcp
Add tests which verify that the new helpers work for both IPv4 and
IPv6, by forcing SYN cookies to always on. Use a new network namespace
to avoid clobbering the global SYN cookie settings.
Lorenz Bauer [Fri, 22 Mar 2019 01:54:02 +0000 (09:54 +0800)]
bpf: add helper to check for a valid SYN cookie
Using bpf_skc_lookup_tcp it's possible to ascertain whether a packet
belongs to a known connection. However, there is one corner case: no
sockets are created if SYN cookies are active. This means that the final
ACK in the 3WHS is misclassified.
Using the helper, we can look up the listening socket via
bpf_skc_lookup_tcp and then check whether a packet is a valid SYN
cookie ACK.
Lorenz Bauer [Fri, 22 Mar 2019 01:54:00 +0000 (09:54 +0800)]
bpf: allow helpers to return PTR_TO_SOCK_COMMON
It's currently not possible to access timewait or request sockets
from eBPF, since there is no way to return a PTR_TO_SOCK_COMMON
from a helper. Introduce RET_PTR_TO_SOCK_COMMON to enable this
behaviour.
Lorenz Bauer [Fri, 22 Mar 2019 01:53:59 +0000 (09:53 +0800)]
bpf: track references based on is_acquire_func
So far, the verifier only acquires reference tracking state for
RET_PTR_TO_SOCKET_OR_NULL. Instead of extending this for every
new return type which desires these semantics, acquire reference
tracking state iff the called helper is an acquire function.
====================
Refactor flower classifier to remove dependency on rtnl lock
Currently, all netlink protocol handlers for updating rules, actions and
qdiscs are protected with single global rtnl lock which removes any
possibility for parallelism. This patch set is a third step to remove
rtnl lock dependency from TC rules update path.
Recently, new rtnl registration flag RTNL_FLAG_DOIT_UNLOCKED was added.
TC rule update handlers (RTM_NEWTFILTER, RTM_DELTFILTER, etc.) are
already registered with this flag and only take rtnl lock when qdisc or
classifier requires it. Classifiers can indicate that their ops
callbacks don't require caller to hold rtnl lock by setting the
TCF_PROTO_OPS_DOIT_UNLOCKED flag. The goal of this change is to refactor
flower classifier to support unlocked execution and register it with
unlocked flag.
This patch set implements following changes to make flower classifier
concurrency-safe:
- Implement reference counting for individual filters. Change fl_get to
take reference to filter. Implement tp->ops->put callback that was
introduced in cls API patch set to release reference to flower filter.
- Use tp->lock spinlock to protect internal classifier data structures
from concurrent modification.
- Handle concurrent tcf proto deletion by returning EAGAIN, which will
cause cls API to retry and create new proto instance or return error
to the user (depending on message type).
- Handle concurrent insertion of filter with same priority and handle by
returning EAGAIN, which will cause cls API to lookup filter again and
process it accordingly to netlink message flags.
- Extend flower mask with reference counting and protect masks list with
masks_lock spinlock.
- Prevent concurrent mask insertion by inserting temporary value to
masks hash table. This is necessary because mask initialization is a
sleeping operation and cannot be done while holding tp->lock.
Both chain level and classifier level conflicts are resolved by
returning -EAGAIN to cls API that results restart of whole operation.
This retry mechanism is a result of fine-grained locking approach used
in this and previous changes in series and is necessary to allow
concurrent updates on same chain instance. Alternative approach would be
to lock the whole chain while updating filters on any of child tp's,
adding and removing classifier instances from the chain. However, since
most CPU-intensive parts of filter update code are specifically in
classifier code and its dependencies (extensions and hw offloads), such
approach would negate most of the gains introduced by this change and
previous changes in the series when updating same chain instance.
Tcf hw offloads API is not changed by this patch set and still requires
caller to hold rtnl lock. Refactored flower classifier tracks rtnl lock
state by means of 'rtnl_held' flag provided by cls API and obtains the
lock before calling hw offloads. Following patch set will lift this
restriction and refactor cls hw offloads API to support unlocked
execution.
With these changes flower classifier is safely registered with
TCF_PROTO_OPS_DOIT_UNLOCKED flag in last patch.
Changes from V2 to V3:
- Rebase on latest net-next
Changes from V1 to V2:
- Extend cover letter with explanation about retry mechanism.
- Rebase on current net-next.
- Patch 1:
- Use rcu_dereference_raw() for tp->root dereference.
- Update comment in fl_head_dereference().
- Patch 2:
- Remove redundant check in fl_change error handling code.
- Add empty line between error check and new handle assignment.
- Patch 3:
- Refactor loop in fl_get_next_filter() to improve readability.
- Patch 4:
- Refactor __fl_delete() to improve readability.
- Patch 6:
- Fix comment in fl_check_assign_mask().
- Patch 9:
- Extend commit message.
- Fix error code in comment.
- Patch 11:
- Fix fl_hw_replace_filter() to always release rtnl lock in error
handlers.
- Patch 12:
- Don't take rtnl lock before calling __fl_destroy_filter() in
workqueue context.
- Extend commit message with explanation why flower still takes rtnl
lock before calling hardware offloads API.
Vlad Buslov [Thu, 21 Mar 2019 13:17:44 +0000 (15:17 +0200)]
net: sched: flower: set unlocked flag for flower proto ops
Set TCF_PROTO_OPS_DOIT_UNLOCKED for flower classifier to indicate that its
ops callbacks don't require caller to hold rtnl lock. Don't take rtnl lock
in fl_destroy_filter_work() that is executed on workqueue instead of being
called by cls API and is not affected by setting
TCF_PROTO_OPS_DOIT_UNLOCKED. Rtnl mutex is still manually taken by flower
classifier before calling hardware offloads API that has not been updated
for unlocked execution.
Vlad Buslov [Thu, 21 Mar 2019 13:17:43 +0000 (15:17 +0200)]
net: sched: flower: track rtnl lock state
Use 'rtnl_held' flag to track if caller holds rtnl lock. Propagate the flag
to internal functions that need to know rtnl lock state. Take rtnl lock
before calling tcf APIs that require it (hw offload, bind filter, etc.).
Vlad Buslov [Thu, 21 Mar 2019 13:17:42 +0000 (15:17 +0200)]
net: sched: flower: protect flower classifier state with spinlock
struct tcf_proto was extended with spinlock to be used by classifiers
instead of global rtnl lock. Use it to protect shared flower classifier
data structures (handle_idr, mask hashtable and list) and fields of
individual filters that can be accessed concurrently. This patch set uses
tcf_proto->lock as per instance lock that protects all filters on
tcf_proto.
Vlad Buslov [Thu, 21 Mar 2019 13:17:41 +0000 (15:17 +0200)]
net: sched: flower: handle concurrent tcf proto deletion
Without rtnl lock protection tcf proto can be deleted concurrently. Check
tcf proto 'deleting' flag after taking tcf spinlock to verify that no
concurrent deletion is in progress. Return EAGAIN error if concurrent
deletion detected, which will cause caller to retry and possibly create new
instance of tcf proto.
Retry mechanism is a result of fine-grained locking approach used in this
and previous changes in series and is necessary to allow concurrent updates
on same chain instance. Alternative approach would be to lock the whole
chain while updating filters on any of child tp's, adding and removing
classifier instances from the chain. However, since most CPU-intensive
parts of filter update code are specifically in classifier code and its
dependencies (extensions and hw offloads), such approach would negate most
of the gains introduced by this change and previous changes in the series
when updating same chain instance.
Vlad Buslov [Thu, 21 Mar 2019 13:17:40 +0000 (15:17 +0200)]
net: sched: flower: handle concurrent filter insertion in fl_change
Check if user specified a handle and another filter with the same handle
was inserted concurrently. Return EAGAIN to retry filter processing (in
case it is an overwrite request).
Without rtnl lock protection masks with same key can be inserted
concurrently. Insert temporary mask with reference count zero to masks
hashtable. This will cause any concurrent modifications to retry.
Wait for rcu grace period to complete after removing temporary mask from
masks hashtable to accommodate concurrent readers.
Vlad Buslov [Thu, 21 Mar 2019 13:17:37 +0000 (15:17 +0200)]
net: sched: flower: add reference counter to flower mask
Extend fl_flow_mask structure with reference counter to allow parallel
modification without relying on rtnl lock. Use rcu read lock to safely
lookup mask and increment reference counter in order to accommodate
concurrent deletes.
Vlad Buslov [Thu, 21 Mar 2019 13:17:36 +0000 (15:17 +0200)]
net: sched: flower: track filter deletion with flag
In order to prevent double deletion of filter by concurrent tasks when rtnl
lock is not used for synchronization, add 'deleted' filter field. Check
value of this field when modifying filters and return error if concurrent
deletion is detected.
Refactor __fl_delete() to accept pointer to 'last' boolean as argument,
and return error code as function return value instead. This is necessary
to signal concurrent filter delete to caller.
Vlad Buslov [Thu, 21 Mar 2019 13:17:35 +0000 (15:17 +0200)]
net: sched: flower: introduce reference counting for filters
Extend flower filters with reference counting in order to remove dependency
on rtnl lock in flower ops and allow to modify filters concurrently.
Reference to flower filter can be taken/released concurrently as soon as it
is marked as 'unlocked' by last patch in this series. Use atomic reference
counter type to make concurrent modifications safe.
Always take reference to flower filter while working with it:
- Modify fl_get() to take reference to filter.
- Implement tp->put() callback as fl_put() function to allow cls API to
release reference taken by fl_get().
- Modify fl_change() to assume that caller holds reference to fold and take
reference to fnew.
- Take reference to filter while using it in fl_walk().
Implement helper functions to get/put filter reference counter.
Vlad Buslov [Thu, 21 Mar 2019 13:17:34 +0000 (15:17 +0200)]
net: sched: flower: refactor fl_change
As a preparation for using classifier spinlock instead of relying on
external rtnl lock, rearrange code in fl_change. The goal is to group the
code which changes classifier state in single block in order to allow
following commits in this set to protect it from parallel modification with
tp->lock. Data structures that require tp->lock protection are mask
hashtable and filters list, and classifier handle_idr.
fl_hw_replace_filter() is a sleeping function and cannot be called while
holding a spinlock. In order to execute all sequence of changes to shared
classifier data structures atomically, call fl_hw_replace_filter() before
modifying them.
Vlad Buslov [Thu, 21 Mar 2019 13:17:33 +0000 (15:17 +0200)]
net: sched: flower: don't check for rtnl on head dereference
Flower classifier only changes root pointer during init and destroy. Cls
API implements reference counting for tcf_proto, so there is no danger of
concurrent access to tp when it is being destroyed, even without protection
provided by rtnl lock.
Implement new function fl_head_dereference() to dereference tp->root
without checking for rtnl lock. Use it in all flower function that obtain
head pointer instead of rtnl_dereference().
NeilBrown [Thu, 21 Mar 2019 03:42:40 +0000 (14:42 +1100)]
rhashtable: rename rht_for_each*continue as *from.
The pattern set by list.h is that for_each..continue()
iterators start at the next entry after the given one,
while for_each..from() iterators start at the given
entry.
The rht_for_each*continue() iterators are documented as though the
start at the 'next' entry, but actually start at the given entry,
and they are used expecting that behaviour.
So fix the documentation and change the names to *from for consistency
with list.h
NeilBrown [Thu, 21 Mar 2019 03:42:40 +0000 (14:42 +1100)]
rhashtable: don't hold lock on first table throughout insertion.
rhashtable_try_insert() currently holds a lock on the bucket in
the first table, while also locking buckets in subsequent tables.
This is unnecessary and looks like a hold-over from some earlier
version of the implementation.
As insert and remove always lock a bucket in each table in turn, and
as insert only inserts in the final table, there cannot be any races
that are not covered by simply locking a bucket in each table in turn.
When an insert call reaches that last table it can be sure that there
is no matchinf entry in any other table as it has searched them all, and
insertion never happens anywhere but in the last table. The fact that
code tests for the existence of future_tbl while holding a lock on
the relevant bucket ensures that two threads inserting the same key
will make compatible decisions about which is the "last" table.
This simplifies the code and allows the ->rehash field to be
discarded.
We still need a way to ensure that a dead bucket_table is never
re-linked by rhashtable_walk_stop(). This can be achieved by calling
call_rcu() inside the locked region, and checking with
rcu_head_after_call_rcu() in rhashtable_walk_stop() to see if the
bucket table is empty and dead.
In order to pave the way for adding some specific Omega PHY features
that may not be desirable on other products covered by the bcm7xxx PHY
driver, split the Omega PHY entry into the Cygnus PHY driver such that
the PHY drivers are reflective of product lines/business units
maintaining them within Broadcom.
No functional changes intended.
====================
Florian Fainelli [Wed, 20 Mar 2019 19:53:13 +0000 (12:53 -0700)]
net: phy: Move Omega PHY entry to Cygnus PHY driver
Cygnus and Omega are part of the same business unit and product line, it
makes sense to group PHY entries by products such that a platform can
select only the drivers that it needs. Bring all the functionality that
the BCM7XXX_28NM_GPHY() macro hides for us and remove the Omega PHY
entry from bcm7xxx.c.
As an added bonus, we now have a proper mdio_device_id entry to permit
auto-loading.
Florian Fainelli [Wed, 20 Mar 2019 19:53:12 +0000 (12:53 -0700)]
net: phy: Prepare for moving Omega out of bcm7xxx
The Omega PHY entry was added to bcm7xxx.c out of convenience and this
breaks the one driver per product line paradigm that was applied up
until now. Since the AFE initialization is shared between Omega and
BCM7xxx move the relevant functions to bcm-phy-lib.[ch]. No functional
changes introduced.
Florian Fainelli [Wed, 20 Mar 2019 16:45:17 +0000 (09:45 -0700)]
net: systemport: Remove print of base address
Since commit ad67b74d2469 ("printk: hash addresses printed with %p")
pointers are being hashed when printed. Displaying the virtual memory at
bootup time is not helpful, especially given we use a dev_info() which
already displays the platform device's address.
Florian Fainelli [Wed, 20 Mar 2019 16:45:16 +0000 (09:45 -0700)]
net: dsa: bcm_sf2: Remove print of base address
Since commit ad67b74d2469 ("printk: hash addresses printed with %p")
pointers are being hashed when printed. Displaying the virtual memory at
bootup time is not helpful, we use a dev_info() print which already
displays the platform device's address.
Florian Fainelli [Wed, 20 Mar 2019 16:45:15 +0000 (09:45 -0700)]
net: phy: mdio-bcm-unimac: Remove print of base address
Since commit ad67b74d2469 ("printk: hash addresses printed with %p")
pointers are being hashed when printed. Displaying the virtual memory at
bootup time is not helpful, especially given we use a dev_info() which
already displays the platform device's address.