Git Repo - linux.git/log

Merge branch 'tun-timer-cleanups'

Eric Dumazet says:

====================
tun: timer cleanups

While working on a syzkaller issue that might have been
fixed already by Cong Wang in commit 0ad646c81b21
("tun: call dev_get_valid_name() before register_netdevice()")
I made three small changes related to flow_gc_timer.
====================

Signed-off-by: David S. Miller <[email protected]>

tun: do not arm flow_gc_timer in tun_flow_init()

Timer is properly armed on demand from tun_flow_update(),
so there is no need to arm it at tun init.

Signed-off-by: Eric Dumazet <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

tun: avoid extra timer schedule in tun_flow_cleanup()

If tun_flow_cleanup() deleted all flows, no need to
arm the timer again. It will be armed next time
tun_flow_update() is called.

Signed-off-by: Eric Dumazet <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

tun: do not block BH again in tun_flow_cleanup()

tun_flow_cleanup() being a timer callback, it is already
running in BH context.

Signed-off-by: Eric Dumazet <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Merge branch 'bpf-BASE_RTT'

Lawrence Brakmo says:

====================
bpf: add support for BASE_RTT

This patch set adds the following functionality to socket_ops BPF
programs.
1) Add bpf helper function bpf_getsocketops. Currently only supports
   TCP_CONGESTION
2) Add BPF_SOCKET_OPS_BASE_RTT op to get the base RTT of the
   connection. In general, the base RTT indicates the threshold such
   that RTTs above it indicate congestion. More details in the
   relevant patches.

Consists of the following patches:

[PATCH net-next 1/5] bpf: add support for BPF_SOCK_OPS_BASE_RTT
[PATCH net-next 2/5] bpf: Adding helper function bpf_getsockops
[PATCH net-next 3/5] bpf: Add BPF_SOCKET_OPS_BASE_RTT support to
[PATCH net-next 4/5] bpf: sample BPF_SOCKET_OPS_BASE_RTT program
[PATCH net-next 5/5] bpf: create samples/bpf/tcp_bpf.readme
====================

Signed-off-by: David S. Miller <[email protected]>

bpf: create samples/bpf/tcp_bpf.readme

Readme file explaining how to create a cgroupv2 and attach one
of the tcp_*_kern.o socket_ops BPF program.

Signed-off-by: Lawrence Brakmo <[email protected]>
Acked-by: Daniel Borkmann <[email protected]>
Acked_by: Alexei Starovoitov <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

bpf: sample BPF_SOCKET_OPS_BASE_RTT program

Sample socket_ops BPF program to test the BPF helper function
bpf_getsocketops and the new socket_ops op BPF_SOCKET_OPS_BASE_RTT.

The program provides a base RTT of 80us when the calling flow is
within a DC (as determined by the IPV6 prefix) and the congestion
algorithm is "nv".

Signed-off-by: Lawrence Brakmo <[email protected]>
Acked-by: Daniel Borkmann <[email protected]>
Acked_by: Alexei Starovoitov <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

bpf: Add BPF_SOCKET_OPS_BASE_RTT support to tcp_nv

TCP_NV will try to get the base RTT from a socket_ops BPF program if one
is loaded. NV will then use the base RTT to bound its min RTT (its
notion of the base RTT). It uses the base RTT as an upper bound and 80%
of the base RTT as its lower bound.

In other words, NV will consider filtered RTTs larger than base RTT as a
sign of congestion. As a result, there is no minRTT inflation when there
is a lot of congestion. For example, in a DC where the RTTs are less
than 40us when there is no congestion, a base RTT value of 80us improves
the performance of NV. The difference between the uncongested RTT and
the base RTT provided represents how much queueing we are willing to
have (in practice it can be higher).

NV has been tunned to reduce congestion when there are many flows at the
cost of one flow not achieving full bandwith utilization. When a
reasonable base RTT is provided, one NV flow can now fully utilize the
full bandwidth. In addition, the performance is also improved when there
are many flows.

In the following examples the NV results are using a kernel with this
patch set (i.e. both NV results are using the new nv_loss_dec_factor).

With one host sending to another host and only one flow the
goodputs are:
  Cubic: 9.3 Gbps, NV: 5.5 Gbps, NV (baseRTT=80us): 9.2 Gbps

With 2 hosts sending to one host (1 flow per host, the goodput per flow
is:
  Cubic: 4.6 Gbps, NV: 4.5 Gbps, NV (baseRTT=80us)L 4.6 Gbps

But the RTTs seen by a ping process in the sender is:
  Cubic: 3.3ms  NV: 97us,  NV (baseRTT=80us): 146us

With a lot of flows things look even better for NV with baseRTT. Here we
have 3 hosts sending to one host. Each sending host has 6 flows: 1
stream, 4x1MB RPC, 1x10KB RPC. Cubic, NV and NV with baseRTT all fully
utilize the full available bandwidth. However, the distribution of
bandwidth among the flows is very different. For the 10KB RPC flow:
  Cubic: 27Mbps, NV: 111Mbps, NV (baseRTT=80us): 222Mbps

The 99% latencies for the 10KB flows are:
  Cubic: 26ms,  NV: 1ms,  NV (baseRTT=80us): 500us

The RTT seen by a ping process at the senders:
  Cubic: 3.2ms  NV: 720us,  NV (baseRTT=80us): 330us

Signed-off-by: Lawrence Brakmo <[email protected]>
Acked-by: Alexei Starovoitov <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

bpf: Adding helper function bpf_getsockops

Adding support for helper function bpf_getsockops to socket_ops BPF
programs. This patch only supports TCP_CONGESTION.

Signed-off-by: Vlad Vysotsky <[email protected]>
Acked-by: Lawrence Brakmo <[email protected]>
Acked-by: Alexei Starovoitov <[email protected]>
Acked-by: Daniel Borkmann <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

bpf: add support for BPF_SOCK_OPS_BASE_RTT

A congestion control algorithm can make a call to the BPF socket_ops
program to request the base RTT. The base RTT can be congestion control
dependent and is meant to represent a congestion threshold such that
RTTs above it indicate congestion. This is especially useful for flows
within a DC where the base RTT is easy to obtain.

Being provided a base RTT solves a basic problem in RTT based congestion
avoidance algorithms (such as Vegas, NV and BBR). Although it is easy
to get the base RTT when the network is not congested, it is very
diffcult to do when it is very congested. Newer connections get an
inflated value of the base RTT leading to unfariness (newer flows with a
larger base RTT get more bandwidth). As a result, RTT based congestion
avoidance algorithms tend to update their base RTTs to improve fairness.
In very congested networks this can lead to base RTT inflation, reducing
the ability of these RTT based congestion control algorithms to prevent
congestion.

Note that in my experiments with TCP-NV, the base RTT provided can be
much larger than the actual hardware RTT. For example, experimenting
with hosts within a rack where the hardware RTT is 16-20us, I've used
base RTTs up to 150us. The effect of using a larger base RTT is that the
congestion avoidance algorithm will allow more queueing. When there are
only a few flows the main effect is larger measured RTTs and RPC
latencies due to the increased queueing. When there are a lot of flows,
a larger base RTT can lead to more congestion and more packet drops.
For this case, where the hardware RTT is 20us, a base RTT of 80us
produces good results.

This patch only introduces BPF_SOCK_OPS_BASE_RTT, a later patch in this
set adds support for using it in TCP-NV. Further study and testing is
needed before support can be added to other delay based congestion
avoidance algorithms.

Signed-off-by: Lawrence Brakmo <[email protected]>
Acked-by: Alexei Starovoitov <[email protected]>
Acked-by: Daniel Borkmann <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

nfp: use struct fields for 8 bit-wide access

Use direct access struct fields rather than PREP_FIELD()
macros to manipulate the jump ID and length, both of which
are exactly 8-bits wide. This simplifies the code somewhat.

Signed-off-by: Simon Horman <[email protected]>
Signed-off-by: Pieter Jansen van Vuuren <[email protected]>
Acked-by: Jakub Kicinski <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: x25: mark expected switch fall-throughs

In preparation to enabling -Wimplicit-fallthrough, mark switch cases
where we are expecting to fall through.

Signed-off-by: Gustavo A. R. Silva <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: af_unix: mark expected switch fall-through

In preparation to enabling -Wimplicit-fallthrough, mark switch cases
where we are expecting to fall through.

Signed-off-by: Gustavo A. R. Silva <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

rxrpc: Don't release call mutex on error pointer

Don't release call mutex at the end of rxrpc_kernel_begin_call() if the
call pointer actually holds an error value.

Fixes: 540b1c48c37a ("rxrpc: Fix deadlock between call creation and sendmsg/recvmsg")
Reported-by: Marc Dionne <[email protected]>
Signed-off-by: David Howells <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Merge branch 'stmmac-hw-tstamp-fixes'

Jose Abreu says:

====================
net: stmmac: Fix HW timestamping

Three fixes for HW timestamping feature, all of them for RX side.
====================

Signed-off-by: David S. Miller <[email protected]>

net: stmmac: Prevent infinite loop in get_rx_timestamp_status()

Prevent infinite loop by correctly setting the loop condition to
break when i == 10.

Signed-off-by: Jose Abreu <[email protected]>
Cc: David S. Miller <[email protected]>
Cc: Joao Pinto <[email protected]>
Cc: Giuseppe Cavallaro <[email protected]>
Cc: Alexandre Torgue <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: stmmac: Fix stmmac_get_rx_hwtstamp()

When using GMAC4 the valid timestamp is from CTX next desc but
we are passing the previous desc to get_rx_timestamp_status()
callback.

Fix this and while at it rework a little bit the function logic.

Signed-off-by: Jose Abreu <[email protected]>
Cc: David S. Miller <[email protected]>
Cc: Joao Pinto <[email protected]>
Cc: Giuseppe Cavallaro <[email protected]>
Cc: Alexandre Torgue <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: stmmac: Add missing call to dev_kfree_skb()

When RX HW timestamp is enabled and a frame is discarded we are
not freeing the skb but instead only setting to NULL the entry.

Add a call to dev_kfree_skb_any() so that skb entry is correctly
freed.

Signed-off-by: Jose Abreu <[email protected]>
Cc: David S. Miller <[email protected]>
Cc: Joao Pinto <[email protected]>
Cc: Giuseppe Cavallaro <[email protected]>
Cc: Alexandre Torgue <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input

Pull input fixes from Dmitry Torokhov:

- joydev now implements a blacklist to avoid creating joystick nodes
   for accelerometers found in composite devices such as PlaStation
   controllers

- assorted driver fixes

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input:
  Input: ims-psu - check if CDC union descriptor is sane
  Input: joydev - blacklist ds3/ds4/udraw motion sensors
  Input: allow matching device IDs on property bits
  Input: factor out and export input_device_id matching code
  Input: goodix - poll the 'buffer status' bit before reading data
  Input: axp20x-pek - fix module not auto-loading for axp221 pek
  Input: tca8418 - enable interrupt after it has been requested
  Input: stmfts - fix setting ABS_MT_POSITION_* maximum size
  Input: ti_am335x_tsc - fix incorrect step config for 5 wire touchscreen
  Input: synaptics - disable kernel tracking on SMBus devices

geneve: Get rid of is_all_zero(), streamline is_tnl_info_zero()

No need to re-invent memchr_inv() with !is_all_zero(). While at
it, replace conditional and return clauses with a single return
clause in is_tnl_info_zero().

Signed-off-by: Stefano Brivio <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Merge branch 'dsa-lan9303-Add-fdb-mdb-methods'

Egil Hjelmeland says:

====================
net: dsa: lan9303: Add fdb/mdb methods

This series add support for accessing and managing the lan9303 ALR
(Address Logic Resolution).

The first patch add low level functions for accessing the ALR, along
with port_fast_age and port_fdb_dump methods.

The second patch add functions for managing ALR entires, along with
remaining fdb/mdb methods.

Note that to complete STP support, a special ALR entry with the STP eth
address must be added too. This must be addressed later.

Comments welcome!

Changes v2 -> v3:
- Whitespace polishing. Removed some "section" comments.
- Prefixed ALR constants with LAN9303_ for consistency.
- Patch 2: lan9303_port_fast_age() wrap the "port" into a struct for passing
as context to alr_loop_cb_del_port_learned. Safer in event of type change.
- Patch 2: Reviewed-by: Vivien Didelot <[email protected]>

Changes v1 -> v2:
- Patch 2: Removed question comment
====================

Signed-off-by: David S. Miller <[email protected]>

net: dsa: lan9303: Add fdb/mdb manipulation

Add functions for managing the lan9303 ALR (Address Logic
Resolution).

Implement DSA methods: port_fdb_add, port_fdb_del, port_mdb_prepare,
port_mdb_add and port_mdb_del.

Since the lan9303 do not offer reading specific ALR entry, the driver
caches all static entries - in a flat table.

Signed-off-by: Egil Hjelmeland <[email protected]>
Reviewed-by: Vivien Didelot <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: dsa: lan9303: Add port_fast_age and port_fdb_dump methods

Add DSA method port_fast_age as a step to STP support.

Add low level functions for accessing the lan9303 ALR (Address Logic
Resolution).

Added DSA method port_fdb_dump

Signed-off-by: Egil Hjelmeland <[email protected]>
Reviewed-by: Vivien Didelot <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs

Pull vfs fixes from Al Viro:
"MS_I_VERSION fixes - Mimi's fix + missing bits picked from Matthew
  (his patch contained a duplicate of the fs/namespace.c fix as well,
  but by that point the original fix had already been applied)"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  Convert fs/*/* to SB_I_VERSION
  vfs: fix mounting a filesystem with i_version

tipc: refactor tipc_sk_timeout() function

The function tipc_sk_timeout() is more complex than necessary, and
even seems to contain an undetected bug. At one of the occurences
where we renew the timer we just order it with (HZ / 20), instead
of (jiffies + HZ / 20);

In this commit we clean up the function.

Acked-by: Ying Xue <[email protected]>
Signed-off-by: Jon Maloy <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Merge branch 'net-driver-refcont_t'

Elena Reshetova says:

====================
networking drivers refcount_t conversions

Note: these are the last patches related to networking that perform
conversion of refcounters from atomic_t to refcount_t.
In contrast to the core network refcounter conversions that
were merged earlier, these are much more straightforward ones.

This series, for various networking drivers, replaces atomic_t reference
counters with the new refcount_t type and API (see include/linux/refcount.h).
By doing this we prevent intentional or accidental
underflows or overflows that can led to use-after-free vulnerabilities.

The patches are fully independent and can be cherry-picked separately.
Patches are based on top of net-next.
If there are no objections to the patches, please merge them via respective trees
====================

Signed-off-by: David S. Miller <[email protected]>

drivers, connector: convert cn_callback_entry.refcnt from atomic_t to refcount_t

atomic_t variables are currently used to implement reference
counters with the following properties:
- counter is initialized to 1 using atomic_set()
- a resource is freed upon counter reaching zero
- once counter reaches zero, its further
increments aren't allowed
- counter schema uses basic atomic operations
(set, inc, inc_not_zero, dec_and_test, etc.)

Such atomic variables should be converted to a newly provided
refcount_t type and API that prevents accidental counter overflows
and underflows. This is important since overflows and underflows
can lead to use-after-free situation and be exploitable.

The variable cn_callback_entry.refcnt is used as pure reference counter.
Convert it to refcount_t and fix up the operations.

Suggested-by: Kees Cook <[email protected]>
Reviewed-by: David Windsor <[email protected]>
Reviewed-by: Hans Liljestrand <[email protected]>
Signed-off-by: Elena Reshetova <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

drivers, net, ppp: convert syncppp.refcnt from atomic_t to refcount_t

atomic_t variables are currently used to implement reference
counters with the following properties:
- counter is initialized to 1 using atomic_set()
- a resource is freed upon counter reaching zero
- once counter reaches zero, its further
increments aren't allowed
- counter schema uses basic atomic operations
(set, inc, inc_not_zero, dec_and_test, etc.)

Such atomic variables should be converted to a newly provided
refcount_t type and API that prevents accidental counter overflows
and underflows. This is important since overflows and underflows
can lead to use-after-free situation and be exploitable.

The variable syncppp.refcnt is used as pure reference counter.
Convert it to refcount_t and fix up the operations.

Suggested-by: Kees Cook <[email protected]>
Reviewed-by: David Windsor <[email protected]>
Reviewed-by: Hans Liljestrand <[email protected]>
Signed-off-by: Elena Reshetova <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

drivers, net, ppp: convert ppp_file.refcnt from atomic_t to refcount_t

atomic_t variables are currently used to implement reference
counters with the following properties:
- counter is initialized to 1 using atomic_set()
- a resource is freed upon counter reaching zero
- once counter reaches zero, its further
increments aren't allowed
- counter schema uses basic atomic operations
(set, inc, inc_not_zero, dec_and_test, etc.)

Such atomic variables should be converted to a newly provided
refcount_t type and API that prevents accidental counter overflows
and underflows. This is important since overflows and underflows
can lead to use-after-free situation and be exploitable.

The variable ppp_file.refcnt is used as pure reference counter.
Convert it to refcount_t and fix up the operations.

Suggested-by: Kees Cook <[email protected]>
Reviewed-by: David Windsor <[email protected]>
Reviewed-by: Hans Liljestrand <[email protected]>
Signed-off-by: Elena Reshetova <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

drivers, net, ppp: convert asyncppp.refcnt from atomic_t to refcount_t

atomic_t variables are currently used to implement reference
counters with the following properties:
- counter is initialized to 1 using atomic_set()
- a resource is freed upon counter reaching zero
- once counter reaches zero, its further
increments aren't allowed
- counter schema uses basic atomic operations
(set, inc, inc_not_zero, dec_and_test, etc.)

Such atomic variables should be converted to a newly provided
refcount_t type and API that prevents accidental counter overflows
and underflows. This is important since overflows and underflows
can lead to use-after-free situation and be exploitable.

The variable asyncppp.refcnt is used as pure reference counter.
Convert it to refcount_t and fix up the operations.

Suggested-by: Kees Cook <[email protected]>
Reviewed-by: David Windsor <[email protected]>
Reviewed-by: Hans Liljestrand <[email protected]>
Signed-off-by: Elena Reshetova <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

drivers, net: convert masces_tx_sa.refcnt from atomic_t to refcount_t

atomic_t variables are currently used to implement reference
counters with the following properties:
- counter is initialized to 1 using atomic_set()
- a resource is freed upon counter reaching zero
- once counter reaches zero, its further
increments aren't allowed
- counter schema uses basic atomic operations
(set, inc, inc_not_zero, dec_and_test, etc.)

Such atomic variables should be converted to a newly provided
refcount_t type and API that prevents accidental counter overflows
and underflows. This is important since overflows and underflows
can lead to use-after-free situation and be exploitable.

The variable masces_tx_sa.refcnt is used as pure reference counter.
Convert it to refcount_t and fix up the operations.

Suggested-by: Kees Cook <[email protected]>
Reviewed-by: David Windsor <[email protected]>
Reviewed-by: Hans Liljestrand <[email protected]>
Signed-off-by: Elena Reshetova <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

drivers, net: convert masces_rx_sc.refcnt from atomic_t to refcount_t

atomic_t variables are currently used to implement reference
counters with the following properties:
- counter is initialized to 1 using atomic_set()
- a resource is freed upon counter reaching zero
- once counter reaches zero, its further
increments aren't allowed
- counter schema uses basic atomic operations
(set, inc, inc_not_zero, dec_and_test, etc.)

Such atomic variables should be converted to a newly provided
refcount_t type and API that prevents accidental counter overflows
and underflows. This is important since overflows and underflows
can lead to use-after-free situation and be exploitable.

The variable masces_rx_sc.refcnt is used as pure reference counter.
Convert it to refcount_t and fix up the operations.

Suggested-by: Kees Cook <[email protected]>
Reviewed-by: David Windsor <[email protected]>
Reviewed-by: Hans Liljestrand <[email protected]>
Signed-off-by: Elena Reshetova <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

drivers, net: convert masces_rx_sa.refcnt from atomic_t to refcount_t

atomic_t variables are currently used to implement reference
counters with the following properties:
- counter is initialized to 1 using atomic_set()
- a resource is freed upon counter reaching zero
- once counter reaches zero, its further
increments aren't allowed
- counter schema uses basic atomic operations
(set, inc, inc_not_zero, dec_and_test, etc.)

Such atomic variables should be converted to a newly provided
refcount_t type and API that prevents accidental counter overflows
and underflows. This is important since overflows and underflows
can lead to use-after-free situation and be exploitable.

The variable masces_rx_sa.refcnt is used as pure reference counter.
Convert it to refcount_t and fix up the operations.

Suggested-by: Kees Cook <[email protected]>
Reviewed-by: David Windsor <[email protected]>
Reviewed-by: Hans Liljestrand <[email protected]>
Signed-off-by: Elena Reshetova <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

drivers, net, hamradio: convert sixpack.refcnt from atomic_t to refcount_t

atomic_t variables are currently used to implement reference
counters with the following properties:
- counter is initialized to 1 using atomic_set()
- a resource is freed upon counter reaching zero
- once counter reaches zero, its further
increments aren't allowed
- counter schema uses basic atomic operations
(set, inc, inc_not_zero, dec_and_test, etc.)

Such atomic variables should be converted to a newly provided
refcount_t type and API that prevents accidental counter overflows
and underflows. This is important since overflows and underflows
can lead to use-after-free situation and be exploitable.

The variable sixpack.refcnt is used as pure reference counter.
Convert it to refcount_t and fix up the operations.

Suggested-by: Kees Cook <[email protected]>
Reviewed-by: David Windsor <[email protected]>
Reviewed-by: Hans Liljestrand <[email protected]>
Signed-off-by: Elena Reshetova <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

drivers, net, mlx5: convert fs_node.refcount from atomic_t to refcount_t

atomic_t variables are currently used to implement reference
counters with the following properties:
- counter is initialized to 1 using atomic_set()
- a resource is freed upon counter reaching zero
- once counter reaches zero, its further
increments aren't allowed
- counter schema uses basic atomic operations
(set, inc, inc_not_zero, dec_and_test, etc.)

Such atomic variables should be converted to a newly provided
refcount_t type and API that prevents accidental counter overflows
and underflows. This is important since overflows and underflows
can lead to use-after-free situation and be exploitable.

The variable fs_node.refcount is used as pure reference counter.
Convert it to refcount_t and fix up the operations.

Suggested-by: Kees Cook <[email protected]>
Reviewed-by: David Windsor <[email protected]>
Reviewed-by: Hans Liljestrand <[email protected]>
Signed-off-by: Elena Reshetova <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

drivers, net, mlx5: convert mlx5_cq.refcount from atomic_t to refcount_t

atomic_t variables are currently used to implement reference
counters with the following properties:
- counter is initialized to 1 using atomic_set()
- a resource is freed upon counter reaching zero
- once counter reaches zero, its further
increments aren't allowed
- counter schema uses basic atomic operations
(set, inc, inc_not_zero, dec_and_test, etc.)

Such atomic variables should be converted to a newly provided
refcount_t type and API that prevents accidental counter overflows
and underflows. This is important since overflows and underflows
can lead to use-after-free situation and be exploitable.

The variable mlx5_cq.refcount is used as pure reference counter.
Convert it to refcount_t and fix up the operations.

Suggested-by: Kees Cook <[email protected]>
Reviewed-by: David Windsor <[email protected]>
Reviewed-by: Hans Liljestrand <[email protected]>
Signed-off-by: Elena Reshetova <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

drivers, net, mlx4: convert mlx4_srq.refcount from atomic_t to refcount_t

atomic_t variables are currently used to implement reference
counters with the following properties:
- counter is initialized to 1 using atomic_set()
- a resource is freed upon counter reaching zero
- once counter reaches zero, its further
increments aren't allowed
- counter schema uses basic atomic operations
(set, inc, inc_not_zero, dec_and_test, etc.)

Such atomic variables should be converted to a newly provided
refcount_t type and API that prevents accidental counter overflows
and underflows. This is important since overflows and underflows
can lead to use-after-free situation and be exploitable.

The variable mlx4_srq.refcount is used as pure reference counter.
Convert it to refcount_t and fix up the operations.

Suggested-by: Kees Cook <[email protected]>
Reviewed-by: David Windsor <[email protected]>
Reviewed-by: Hans Liljestrand <[email protected]>
Signed-off-by: Elena Reshetova <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

drivers, net, mlx4: convert mlx4_qp.refcount from atomic_t to refcount_t

atomic_t variables are currently used to implement reference
counters with the following properties:
- counter is initialized to 1 using atomic_set()
- a resource is freed upon counter reaching zero
- once counter reaches zero, its further
increments aren't allowed
- counter schema uses basic atomic operations
(set, inc, inc_not_zero, dec_and_test, etc.)

Such atomic variables should be converted to a newly provided
refcount_t type and API that prevents accidental counter overflows
and underflows. This is important since overflows and underflows
can lead to use-after-free situation and be exploitable.

The variable mlx4_qp.refcount is used as pure reference counter.
Convert it to refcount_t and fix up the operations.

Suggested-by: Kees Cook <[email protected]>
Reviewed-by: David Windsor <[email protected]>
Reviewed-by: Hans Liljestrand <[email protected]>
Signed-off-by: Elena Reshetova <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

drivers, net, mlx4: convert mlx4_cq.refcount from atomic_t to refcount_t

atomic_t variables are currently used to implement reference
counters with the following properties:
- counter is initialized to 1 using atomic_set()
- a resource is freed upon counter reaching zero
- once counter reaches zero, its further
increments aren't allowed
- counter schema uses basic atomic operations
(set, inc, inc_not_zero, dec_and_test, etc.)

Such atomic variables should be converted to a newly provided
refcount_t type and API that prevents accidental counter overflows
and underflows. This is important since overflows and underflows
can lead to use-after-free situation and be exploitable.

The variable mlx4_cq.refcount is used as pure reference counter.
Convert it to refcount_t and fix up the operations.

Suggested-by: Kees Cook <[email protected]>
Reviewed-by: David Windsor <[email protected]>
Reviewed-by: Hans Liljestrand <[email protected]>
Signed-off-by: Elena Reshetova <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

drivers, net, ethernet: convert mtk_eth.dma_refcnt from atomic_t to refcount_t

atomic_t variables are currently used to implement reference
counters with the following properties:
- counter is initialized to 1 using atomic_set()
- a resource is freed upon counter reaching zero
- once counter reaches zero, its further
increments aren't allowed
- counter schema uses basic atomic operations
(set, inc, inc_not_zero, dec_and_test, etc.)

Such atomic variables should be converted to a newly provided
refcount_t type and API that prevents accidental counter overflows
and underflows. This is important since overflows and underflows
can lead to use-after-free situation and be exploitable.

The variable mtk_eth.dma_refcnt is used as pure reference counter.
Convert it to refcount_t and fix up the operations.

Suggested-by: Kees Cook <[email protected]>
Reviewed-by: David Windsor <[email protected]>
Reviewed-by: Hans Liljestrand <[email protected]>
Signed-off-by: Elena Reshetova <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

drivers, net, ethernet: convert clip_entry.refcnt from atomic_t to refcount_t

atomic_t variables are currently used to implement reference
counters with the following properties:
- counter is initialized to 1 using atomic_set()
- a resource is freed upon counter reaching zero
- once counter reaches zero, its further
increments aren't allowed
- counter schema uses basic atomic operations
(set, inc, inc_not_zero, dec_and_test, etc.)

Such atomic variables should be converted to a newly provided
refcount_t type and API that prevents accidental counter overflows
and underflows. This is important since overflows and underflows
can lead to use-after-free situation and be exploitable.

The variable clip_entry.refcnt is used as pure reference counter.
Convert it to refcount_t and fix up the operations.

Suggested-by: Kees Cook <[email protected]>
Reviewed-by: David Windsor <[email protected]>
Reviewed-by: Hans Liljestrand <[email protected]>
Signed-off-by: Elena Reshetova <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Merge branch 'mlxsw-fixes'

Jiri Pirko says:

====================
mlxsw: spectrum: Configure TTL of "inherit" for offloaded tunnels

Petr says:

Currently mlxsw only offloads tunnels that are configured with TTL of "inherit"
(which is the default). However, Spectrum defaults to 255 and the driver
neglects to change the configuration. Thus the tunnel packets from offloaded
tunnels always have TTL of 255, even though tunnels with explicit TTL of 255 are
never actually offloaded.

To fix this, introduce support for TIGCR, the register that keeps the related
bits of global tunnel configuration, and use it on first offload to properly
configure inheritance of TTL of tunnel packets from overlay packets.
====================

Signed-off-by: David S. Miller <[email protected]>

mlxsw: spectrum_router: Configure TIGCR on init

Spectrum tunnels do not default to ttl of "inherit" like the Linux ones
do. Configure TIGCR on router init so that the TTL of tunnel packets is
copied from the overlay packets.

Fixes: ee954d1a91b2 ("mlxsw: spectrum_router: Support GRE tunnels")
Signed-off-by: Petr Machata <[email protected]>
Reviewed-by: Ido Schimmel <[email protected]>
Signed-off-by: Jiri Pirko <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

mlxsw: reg: Add Tunneling IPinIP General Configuration Register

The TIGCR register is used for setting up the IPinIP Tunnel
configuration.

Fixes: ee954d1a91b2 ("mlxsw: spectrum_router: Support GRE tunnels")
Signed-off-by: Petr Machata <[email protected]>
Reviewed-by: Ido Schimmel <[email protected]>
Signed-off-by: Jiri Pirko <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Merge branch 'hns3-loopback-selftest'

Yunsheng Lin says:

====================
Add mac loopback selftest support in hns3 driver

This patchset refactors the skb receiving and transmitting function
before adding mac loopback selftest support in hns3 driver.
====================

Signed-off-by: David S. Miller <[email protected]>

net: hns3: Add mac loopback selftest support in hns3 driver

This patch adds mac loopback selftest support for ethtool cmd
by checking if a transmitted packet can be received correctly
when mac loopback is enabled.

Signed-off-by: Yunsheng Lin <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: hns3: Refactor the skb receiving and transmitting function

This patch refactors the skb receiving and transmitting functions
and export them in order to support the ethtool's mac loopback
selftest.

Signed-off-by: Yunsheng Lin <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: ethtool: remove error check for legacy setting transceiver type

Commit 9cab88726929605 ("net: ethtool: Add back transceiver type")
restores the transceiver type to struct ethtool_link_settings and
convert_link_ksettings_to_legacy_settings() but forgets to remove the
error check for the same in convert_legacy_settings_to_link_ksettings().
This prevents older versions of ethtool to change link settings.

    # ethtool --version
    ethtool version 3.16

    # ethtool -s eth0 autoneg on speed 100 duplex full
    Cannot set new settings: Invalid argument
      not setting speed
      not setting duplex
      not setting autoneg

While newer versions of ethtool works.

    # ethtool --version
    ethtool version 4.10

    # ethtool -s eth0 autoneg on speed 100 duplex full
    [   57.703268] sh-eth ee700000.ethernet eth0: Link is Down
    [   59.618227] sh-eth ee700000.ethernet eth0: Link is Up - 100Mbps/Full - flow control rx/tx

Fixes: 19cab88726929605 ("net: ethtool: Add back transceiver type")
Signed-off-by: Niklas Söderlund <[email protected]>
Reported-by: Renjith R V <[email protected]>
Tested-by: Geert Uytterhoeven <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Merge branch 'bpftool-add-a-version-command-and-fix-several-items'

Jakub Kicinski says:

====================
tools: bpftool: add a "version" command, and fix several items

Quentin says:

The first seven patches of this series bring several minor fixes to
bpftool. Please see individual commit logs for details.

Last patch adds a "version" commands to bpftool, which is in fact the
version of the kernel from which it was compiled.
====================

Signed-off-by: David S. Miller <[email protected]>

tools: bpftool: add a command to display bpftool version

This command can be used to print the version of the tool, which is in
fact the version from Linux taken from usr/include/linux/version.h.

Example usage:

$ bpftool version
bpftool v4.14.0

Signed-off-by: Quentin Monnet <[email protected]>
Signed-off-by: Jakub Kicinski <[email protected]>
Acked-by: Daniel Borkmann <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

tools: bpftool: show that `opcodes` or `file FILE` should be exclusive

For the `bpftool prog dump { jited | xlated } ...` command, adding
`opcodes` keyword (to request opcodes to be printed) will have no effect
if `file FILE` (to write binary output to FILE) is provided.

The manual page and the help message to be displayed in the terminal
should reflect that, and indicate that these options should be mutually
exclusive.

Signed-off-by: Quentin Monnet <[email protected]>
Signed-off-by: Jakub Kicinski <[email protected]>
Acked-by: Daniel Borkmann <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

tools: bpftool: print all relevant byte opcodes for "load double word"

The eBPF instruction permitting to load double words (8 bytes) into a
register need 8-byte long "immediate" field, and thus occupy twice the
space of other instructions. bpftool was aware of this and would
increment the instruction counter only once on meeting such instruction,
but it would only print the first four bytes of the immediate value to
load. Make it able to dump the whole 16 byte-long double instruction
instead (as would `llvm-objdump -d <program>`).

Signed-off-by: Quentin Monnet <[email protected]>
Signed-off-by: Jakub Kicinski <[email protected]>
Acked-by: Daniel Borkmann <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

tools: bpftool: print only one error message on byte parsing failure

Make error messages more consistent. Specifically, when bpftool fails at
parsing map key bytes, make it print a single error message to stderr
and return from the function, instead of (always) printing a second
error message afterwards.

Signed-off-by: Quentin Monnet <[email protected]>
Signed-off-by: Jakub Kicinski <[email protected]>
Acked-by: Daniel Borkmann <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

tools: bpftool: add `bpftool prog help` as real command i.r.t exit code

Make error messages and return codes more consistent. Specifically, make
`bpftool prog help` a real command, instead of printing usage by default
for a non-recognized "help" command. Output is the same, but this makes
bpftool return with a success value instead of an error.

Signed-off-by: Quentin Monnet <[email protected]>
Signed-off-by: Jakub Kicinski <[email protected]>
Acked-by: Daniel Borkmann <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

tools: bpftool: use err() instead of info() if there are too many insns

Make error messages and return codes more consistent. Specifically,
replace the use of info() macro with err() when too many eBPF
instructions are received to be dumped, given that bpftool returns with
a non-null exit value in that case.

Signed-off-by: Quentin Monnet <[email protected]>
Signed-off-by: Jakub Kicinski <[email protected]>
Acked-by: Daniel Borkmann <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

tools: bpftool: fix return value when all eBPF programs have been shown

Change the program to have a more consistent return code. Specifically,
do not make bpftool return an error code simply because it reaches the
end of the list of the eBPF programs to show.

Signed-off-by: Quentin Monnet <[email protected]>
Signed-off-by: Jakub Kicinski <[email protected]>
Acked-by: Daniel Borkmann <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

tools: bpftool: add pointer to file argument to print_hex()

Make print_hex() able to print to any file instead of standard output
only, and rename it to fprint_hex(). The function can now be called with
the info() macro, for example, without splitting the output between
standard and error outputs.

Signed-off-by: Quentin Monnet <[email protected]>
Signed-off-by: Jakub Kicinski <[email protected]>
Acked-by: Daniel Borkmann <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: sched: mark expected switch fall-throughs

In preparation to enabling -Wimplicit-fallthrough, mark switch cases
where we are expecting to fall through.

Signed-off-by: Gustavo A. R. Silva <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

soreuseport: fix initialization race

Syzkaller stumbled upon a way to trigger
WARNING: CPU: 1 PID: 13881 at net/core/sock_reuseport.c:41
reuseport_alloc+0x306/0x3b0 net/core/sock_reuseport.c:39

There are two initialization paths for the sock_reuseport structure in a
socket: Through the udp/tcp bind paths of SO_REUSEPORT sockets or through
SO_ATTACH_REUSEPORT_[CE]BPF before bind. The existing implementation
assumedthat the socket lock protected both of these paths when it actually
only protects the SO_ATTACH_REUSEPORT path. Syzkaller triggered this
double allocation by running these paths concurrently.

This patch moves the check for double allocation into the reuseport_alloc
function which is protected by a global spin lock.

Fixes: e32ea7e74727 ("soreuseport: fast reuseport UDP socket selection")
Fixes: c125e80b8868 ("soreuseport: fast reuseport TCP socket selection")
Signed-off-by: Craig Gallek <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: rose: mark expected switch fall-throughs

In preparation to enabling -Wimplicit-fallthrough, mark switch cases
where we are expecting to fall through.

Signed-off-by: Gustavo A. R. Silva <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

openvswitch: conntrack: mark expected switch fall-through

In preparation to enabling -Wimplicit-fallthrough, mark switch cases
where we are expecting to fall through.

Notice that in this particular case I placed a "fall through" comment on
its own line, which is what GCC is expecting to find.

Signed-off-by: Gustavo A. R. Silva <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: netrom: nr_in: mark expected switch fall-through

In preparation to enabling -Wimplicit-fallthrough, mark switch cases
where we are expecting to fall through.

Signed-off-by: Gustavo A. R. Silva <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: bridge: fix returning of vlan range op errors

When vlan tunnels were introduced, vlan range errors got silently
dropped and instead 0 was returned always. Restore the previous
behaviour and return errors to user-space.

Fixes: efa5356b0d97 ("bridge: per vlan dst_metadata netlink support")
Signed-off-by: Nikolay Aleksandrov <[email protected]>
Acked-by: Roopa Prabhu <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

sock: correct sk_wmem_queued accounting on efault in tcp zerocopy

Syzkaller hits WARN_ON(sk->sk_wmem_queued) in sk_stream_kill_queues
after triggering an EFAULT in __zerocopy_sg_from_iter.

On this error, skb_zerocopy_stream_iter resets the skb to its state
before the operation with __pskb_trim. It cannot kfree_skb like
datagram callers, as the skb may have data from a previous send call.

__pskb_trim calls skb_condense for unowned skbs, which adjusts their
truesize. These tcp skbuffs are owned and their truesize must add up
to sk_wmem_queued. But they match because their skb->sk is NULL until
tcp_transmit_skb.

Temporarily set skb->sk when calling __pskb_trim to signal that the
skbuffs are owned and avoid the skb_condense path.

Fixes: 52267790ef52 ("sock: add MSG_ZEROCOPY")
Signed-off-by: Willem de Bruijn <[email protected]>
Reviewed-by: Eric Dumazet <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Merge branch 'bpf-range-marking-fixes'

Daniel Borkmann says:

====================
Two BPF fixes for range marking

The set contains two fixes for direct packet access range
markings and test cases for all direct packet access patterns
that the verifier matches on.

They are targeted for net tree, note that once net gets merged
into net-next, there will be a minor merge conflict due to
signature change of the function find_good_pkt_pointers() as
well as data_meta patterns present in net-next tree. You can
just add bool false to the data_meta patterns and I will
follow-up with properly converting the patterns for data_meta
in a similar way.
====================

Signed-off-by: David S. Miller <[email protected]>

bpf: add test cases to bpf selftests to cover all access tests

Lets add test cases to cover really all possible direct packet
access tests for good/bad access cases so we keep tracking them.

Signed-off-by: Daniel Borkmann <[email protected]>
Acked-by: Alexei Starovoitov <[email protected]>
Acked-by: John Fastabend <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

bpf: fix pattern matches for direct packet access

Alexander had a test program with direct packet access, where
the access test was in the form of data + X > data_end. In an
unrelated change to the program LLVM decided to swap the branches
and emitted code for the test in form of data + X <= data_end.
We hadn't seen these being generated previously, thus verifier
would reject the program. Therefore, fix up the verifier to
detect all test cases, so we don't run into such issues in the
future.

Fixes: b4e432f1000a ("bpf: enable BPF_J{LT, LE, SLT, SLE} opcodes in verifier")
Reported-by: Alexander Alemayhu <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
Acked-by: Alexei Starovoitov <[email protected]>
Acked-by: John Fastabend <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

bpf: fix off by one for range markings with L{T, E} patterns

During review I noticed that the current logic for direct packet
access marking in check_cond_jmp_op() has an off by one for the
upper right range border when marking in find_good_pkt_pointers()
with BPF_JLT and BPF_JLE. It's not really harmful given access
up to pkt_end is always safe, but we should nevertheless correct
the range marking before it becomes ABI. If pkt_data' denotes a
pkt_data derived pointer (pkt_data + X), then for pkt_data' < pkt_end
in the true branch as well as for pkt_end <= pkt_data' in the false
branch we mark the range with X although it should really be X - 1
in these cases. For example, X could be pkt_end - pkt_data, then
when testing for pkt_data' < pkt_end the verifier simulation cannot
deduce that a byte load of pkt_data' - 1 would succeed in this
branch.

Fixes: b4e432f1000a ("bpf: enable BPF_J{LT, LE, SLT, SLE} opcodes in verifier")
Signed-off-by: Daniel Borkmann <[email protected]>
Acked-by: Alexei Starovoitov <[email protected]>
Acked-by: John Fastabend <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

bpf: devmap fix arithmetic overflow in bitmap_size calculation

An integer overflow is possible in dev_map_bitmap_size() when
calculating the BITS_TO_LONG logic which becomes, after macro
replacement,

(((n) + (d) - 1)/ (d))

where 'n' is a __u32 and 'd' is (8 * sizeof(long)). To avoid
overflow cast to u64 before arithmetic.

Reported-by: Richard Weinberger <[email protected]>
Acked-by: Daniel Borkmann <[email protected]>
Signed-off-by: John Fastabend <[email protected]>
Acked-by: Alexei Starovoitov <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Merge tag 'dmaengine-fix-4.14-rc6' of git://git.infradead.org/users/vkoul/slave-dma

Pull dmaengine fix from Vinod Koul:
"Late fix for altera driver which fixes the locking in driver"

* tag 'dmaengine-fix-4.14-rc6' of git://git.infradead.org/users/vkoul/slave-dma:
dmaengine: altera: Use IRQ-safe spinlock calls in the error paths as well

Merge branch 'aquantia-fixes'

Igor Russkikh says:

====================
net: aquantia: Atlantic driver 10/2017 updates

This patchset fixes various issues in driver,
improves parameters for better performance on 10Gbit link
====================

Signed-off-by: David S. Miller <[email protected]>

net: aquantia: Bad udp rate on default interrupt coalescing

Default Tx rates cause very long ISR delays on Tx.
0xff is 510us delay, giving only ~ 2000 interrupts per seconds for
Tx rings cleanup. With these settings udp tx rate was never higher than
~800Mbps on a single stream. Changing min delay to 0xF makes it
way better with ~6Gbps

TCP stream performance is almost unaffected by this change, since LSO
optimizations play important role.

CPU load is affected insignificantly by this change.

Signed-off-by: Pavel Belous <[email protected]>
Signed-off-by: Igor Russkikh <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: aquantia: Enable coalescing management via ethtool interface

Aquantia NIC allows both TX and RX interrupt throttle rate (ITR)
management, but this was used in a very limited way via predefined
values. This patch allows to setup ITR default values via module
command line arguments and via standard ethtool coalescing settings.

Signed-off-by: Pavel Belous <[email protected]>
Signed-off-by: Igor Russkikh <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: aquantia: mmio unmap was not performed on driver removal

That may lead to mmio resource leakage.

Signed-off-by: Pavel Belous <[email protected]>
Signed-off-by: Igor Russkikh <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: aquantia: Limit number of MSIX irqs to the number of cpus

There is no much practical use from having MSIX vectors more that number
of cpus, thus cap this first with preconfigured limit, then with number
of cpus online.

Signed-off-by: Pavel Belous <[email protected]>
Signed-off-by: Igor Russkikh <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: aquantia: Fixed transient link up/down/up notification

When doing ifconfig down/up, driver did not reported carrier_off neither
in nic_stop nor in nic_start. That caused link to be visible as "up"
during couple of seconds immediately after "ifconfig up".

Signed-off-by: Pavel Belous <[email protected]>
Signed-off-by: Igor Russkikh <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: aquantia: Add queue restarts stats counter

Queue stat strings are cleaned up, duplicate stat name strings removed,
queue restarts counter added

Signed-off-by: Pavel Belous <[email protected]>
Signed-off-by: Igor Russkikh <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: aquantia: Reset nic statistics on interface up/down

Internal statistics system on chip never gets reset until hardware
reboot. This is quite inconvenient in terms of ethtool statistics usage.

This patch implements incremental statistics update inside of
service callback.

Upon nic initialization, first request is done to fetch
initial stat data, current collected stat data gets cleared.
Internal statistics mailbox readout is improved to save space and
increase readability

Signed-off-by: Pavel Belous <[email protected]>
Signed-off-by: Igor Russkikh <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

bnxt: Move generic devlink code to new file

Moving generic devlink code (registration) out of VF-R code
into new bnxt_devlink file, in preparation for future work
to add additional devlink functionality to bnxt.

Signed-off-by: Steve Lin <[email protected]>
Acked-by: Andy Gospodarek <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

tipc: fix broken tipc_poll() function

In commit ae236fb208a6 ("tipc: receive group membership events via
member socket") we broke the tipc_poll() function by checking the
state of the receive queue before the call to poll_sock_wait(), while
relying that state afterwards, when it might have changed.

We restore this in this commit.

Signed-off-by: Jon Maloy <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Merge branch 'net-sched-convert-cls-ndo_setup_tc-offload-calls-to-per-block-callbacks'

Jiri Pirko says:

====================
net: sched: convert cls ndo_setup_tc offload calls to per-block callbacks

This patchset is a bit bigger, but most of the patches are doing the
same changes in multiple classifiers and drivers. I could do some
squashes, but I think it is better split.

This is another dependency on the way to shared block implementation.
The goal is to remove use of tp->q in classifiers code.

Also, this provides drivers possibility to track binding of blocks to
qdiscs. Legacy drivers which do not support shared block offloading.
register one callback per binding. That maintains the current
functionality we have with ndo_setup_tc. Drivers which support block
sharing offload register one callback per block which safes overhead.

Patches 1-4 introduce the binding notifications and per-block callbacks
Patches 5-8 add block callbacks calls to classifiers
Patches 9-17 do convert from ndo_setup_tc calls to block callbacks for
classifier offloads in drivers
Patches 18-20 do cleanup

v1->v2:
- patch1:
- move new enum value to the end
====================

Signed-off-by: David S. Miller <[email protected]>

net: sched: remove unused is_classid_clsact_ingress/egress helpers

These helpers are no longer in use by drivers, so remove them.

Signed-off-by: Jiri Pirko <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: sched: remove unused classid field from tc_cls_common_offload

It is no longer used by the drivers, so remove it.

Signed-off-by: Jiri Pirko <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: sched: avoid ndo_setup_tc calls for TC_SETUP_CLS*

All drivers are converted to use block callbacks for TC_SETUP_CLS*.
So it is now safe to remove the calls to ndo_setup_tc from cls_*

Signed-off-by: Jiri Pirko <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

dsa: Convert ndo_setup_tc offloads to block callbacks

Benefit from the newly introduced block callback infrastructure and
convert ndo_setup_tc calls for matchall offloads to block callbacks.

Signed-off-by: Jiri Pirko <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

nfp: bpf: Convert ndo_setup_tc offloads to block callbacks

Benefit from the newly introduced block callback infrastructure and
convert ndo_setup_tc calls for bpf offloads to block callbacks.

Signed-off-by: Jiri Pirko <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

nfp: flower: Convert ndo_setup_tc offloads to block callbacks

Benefit from the newly introduced block callback infrastructure and
convert ndo_setup_tc calls for flower offloads to block callbacks.

Signed-off-by: Jiri Pirko <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

mlx5e_rep: Convert ndo_setup_tc offloads to block callbacks

Benefit from the newly introduced block callback infrastructure and
convert ndo_setup_tc calls for flower offloads to block callbacks.

Signed-off-by: Jiri Pirko <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

ixgbe: Convert ndo_setup_tc offloads to block callbacks

Benefit from the newly introduced block callback infrastructure and
convert ndo_setup_tc calls for u32 offloads to block callbacks.

Signed-off-by: Jiri Pirko <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

cxgb4: Convert ndo_setup_tc offloads to block callbacks

Benefit from the newly introduced block callback infrastructure and
convert ndo_setup_tc calls for flower and u32 offloads to block callbacks.

Signed-off-by: Jiri Pirko <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

bnxt: Convert ndo_setup_tc offloads to block callbacks

Benefit from the newly introduced block callback infrastructure and
convert ndo_setup_tc calls for flower offloads to block callbacks.

Signed-off-by: Jiri Pirko <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

mlx5e: Convert ndo_setup_tc offloads to block callbacks

Benefit from the newly introduced block callback infrastructure and
convert ndo_setup_tc calls for flower offloads to block callbacks.

Signed-off-by: Jiri Pirko <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

mlxsw: spectrum: Convert ndo_setup_tc offloads to block callbacks

Benefit from the newly introduced block callback infrastructure and
convert ndo_setup_tc calls for matchall and flower offloads to block
callbacks.

Signed-off-by: Jiri Pirko <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: sched: cls_bpf: call block callbacks for offload

Use the newly introduced callbacks infrastructure and call block
callbacks alongside with the existing per-netdev ndo_setup_tc.

Signed-off-by: Jiri Pirko <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: sched: cls_u32: call block callbacks for offload

Use the newly introduced callbacks infrastructure and call block
callbacks alongside with the existing per-netdev ndo_setup_tc.

Signed-off-by: Jiri Pirko <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: sched: cls_u32: swap u32_remove_hw_knode and u32_remove_hw_hnode

Signed-off-by: Jiri Pirko <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: sched: cls_matchall: call block callbacks for offload

Use the newly introduced callbacks infrastructure and call block
callbacks alongside with the existing per-netdev ndo_setup_tc.

Signed-off-by: Jiri Pirko <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: sched: use tc_setup_cb_call to call per-block callbacks

Extend the tc_setup_cb_call entrypoint function originally used only for
action egress devices callbacks to call per-block callbacks as well.

Signed-off-by: Jiri Pirko <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: sched: introduce per-block callbacks

Introduce infrastructure that allows drivers to register callbacks that
are called whenever tc would offload inserted rule for a specific block.

Signed-off-by: Jiri Pirko <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: sched: use extended variants of block_get/put in ingress and clsact qdiscs

Use previously introduced extended variants of block get and put
functions. This allows to specify a binder types specific to clsact
ingress/egress which is useful for drivers to distinguish who actually
got the block.

Signed-off-by: Jiri Pirko <[email protected]>
Signed-off-by: David S. Miller <[email protected]>