Git Repo - linux.git/log

ionic: remove adminq napi instance

Remove the adminq's napi struct when tearing down
the adminq.

Fixes: 1d062b7b6f64 ("ionic: Add basic adminq support")
Signed-off-by: Shannon Nelson <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

ionic: deinit rss only if selected

Don't bother de-initing RSS if it wasn't selected.

Fixes: aa3198819bea ("ionic: Add RSS support")
Signed-off-by: Shannon Nelson <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

ionic: stop devlink warn on mgmt device

If we don't set a port type, the devlink code will eventually
print a WARN in the kernel log. Because the mgmt device is
not really a useful port, don't register it as a devlink port.

Fixes: b3f064e9746d ("ionic: add support for device id 0x1004")
Signed-off-by: Shannon Nelson <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Merge branch 'net_sched-allow-use-of-hrtimer-slack'

Eric Dumazet says:

====================
net_sched: allow use of hrtimer slack

Packet schedulers have used hrtimers with exact expiry times.

Some of them can afford having a slack, in order to reduce
the number of timer interrupts and feed bigger batches
to increase efficiency.

FQ for example does not care if throttled packets are
sent with an additional (small) delay.

Original observation of having maybe too many interrupts
was made by Willem de Bruijn.

v2: added strict netlink checking (Jakub Kicinski)
====================

Signed-off-by: David S. Miller <[email protected]>

net_sched: sch_fq: enable use of hrtimer slack

Add a new attribute to control the fq qdisc hrtimer slack.

Default is set to 10 usec.

When/if packets are throttled, fq set up an hrtimer that can
lead to one interrupt per packet in the throttled queue.

By using a timer slack, we allow better use of timer interrupts,
by giving them a chance to call multiple timer callbacks
at each hardware interrupt.

Also, giving a slack allows FQ to dequeue batches of packets
instead of a single one, thus increasing xmit_more efficiency.

This has no negative effect on the rate a TCP flow can sustain,
since each TCP flow maintains its own precise vtime (tp->tcp_wstamp_ns)

v2: added strict netlink checking (as feedback from Jakub Kicinski)

Tested:
1000 concurrent flows all using paced packets.
1,000,000 packets sent per second.

Before the patch :

$ vmstat 2 10
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
0  0      0 60726784  23628 3485992    0    0   138     1  977  535  0 12 87  0  0
0  0      0 60714700  23628 3485628    0    0     0     0 1568827 26462  0 22 78  0  0
1  0      0 60716012  23628 3485656    0    0     0     0 1570034 26216  0 22 78  0  0
0  0      0 60722420  23628 3485492    0    0     0     0 1567230 26424  0 22 78  0  0
0  0      0 60727484  23628 3485556    0    0     0     0 1568220 26200  0 22 78  0  0
2  0      0 60718900  23628 3485380    0    0     0    40 1564721 26630  0 22 78  0  0
2  0      0 60718096  23628 3485332    0    0     0     0 1562593 26432  0 22 78  0  0
0  0      0 60719608  23628 3485064    0    0     0     0 1563806 26238  0 22 78  0  0
1  0      0 60722876  23628 3485236    0    0     0   130 1565874 26566  0 22 78  0  0
1  0      0 60722752  23628 3484908    0    0     0     0 1567646 26247  0 22 78  0  0

After the patch, slack of 10 usec, we can see a reduction of interrupts
per second, and a small decrease of reported cpu usage.

$ vmstat 2 10
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
1  0      0 60722564  23628 3484728    0    0   133     1  696  545  0 13 87  0  0
1  0      0 60722568  23628 3484824    0    0     0     0 977278 25469  0 20 80  0  0
0  0      0 60716396  23628 3484764    0    0     0     0 979997 25326  0 20 80  0  0
0  0      0 60713844  23628 3484960    0    0     0     0 981394 25249  0 20 80  0  0
2  0      0 60720468  23628 3484916    0    0     0     0 982860 25062  0 20 80  0  0
1  0      0 60721236  23628 3484856    0    0     0     0 982867 25100  0 20 80  0  0
1  0      0 60722400  23628 3484456    0    0     0     8 982698 25303  0 20 80  0  0
0  0      0 60715396  23628 3484428    0    0     0     0 981777 25176  0 20 80  0  0
0  0      0 60716520  23628 3486544    0    0     0    36 978965 27857  0 21 79  0  0
0  0      0 60719592  23628 3486516    0    0     0    22 977318 25106  0 20 80  0  0

Signed-off-by: Eric Dumazet <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net_sched: do not reprogram a timer about to expire

qdisc_watchdog_schedule_range_ns() can use the newly added slack
and avoid rearming the hrtimer a bit earlier than the current
value. This patch has no effect if delta_ns parameter
is zero.

Note that this means the max slack is potentially doubled.

Signed-off-by: Eric Dumazet <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net_sched: add qdisc_watchdog_schedule_range_ns()

Some packet schedulers might want to add a slack
when programming hrtimers. This can reduce number
of interrupts and increase batch sizes and thus
give good xmit_more savings.

This commit adds qdisc_watchdog_schedule_range_ns()
helper, with an extra delta_ns parameter.

Legacy qdisc_watchdog_schedule_n() becomes an inline
passing a zero slack.

Signed-off-by: Eric Dumazet <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Merge branch 'nfp-type'

Jakub Kicinski says:

====================
net: rename flow_action stats and set NFP type

Jiri, I hope this is okay with you, I just dropped the "type" from
the helper and value names, and now things should be able to fit
on a line, within 80 characters.

Second patch makes the NFP able to offload DELAYED stats, which
is the type it supports.
====================

Signed-off-by: David S. Miller <[email protected]>

nfp: allow explicitly selected delayed stats

NFP flower offload uses delayed stats. Kernel recently gained
the ability to specify stats types. Make nfp accept DELAYED
stats, not just the catch all "any".

Signed-off-by: Jakub Kicinski <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: rename flow_action_hw_stats_types* -> flow_action_hw_stats*

flow_action_hw_stats_types_check() helper takes one of the
FLOW_ACTION_HW_STATS_*_BIT values as input. If we align
the arguments to the opening bracket of the helper there
is no way to call this helper and stay under 80 characters.

Remove the "types" part from the new flow_action helpers
and enum values.

Signed-off-by: Jakub Kicinski <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Merge branch 'net-phy-improve-phy_driver-callback-handle_interrupt'

Heiner Kallweit says:

====================
net: phy: improve phy_driver callback handle_interrupt

did_interrupt() clears the interrupt, therefore handle_interrupt() can
not check which event triggered the interrupt. To overcome this
constraint and allow more flexibility for customer interrupt handlers,
let's decouple handle_interrupt() from parts of the phylib interrupt
handling. Custom interrupt handlers now have to implement the
did_interrupt() functionality in handle_interrupt() if needed.

Fortunately we have just one custom interrupt handler so far (in the
mscc PHY driver), convert it to the changed API and make use of the
benefits.
====================

Signed-off-by: David S. Miller <[email protected]>

net: phy: mscc: consider interrupt source in interrupt handler

Trigger the respective interrupt handler functionality only if the
related interrupt source bit is set.

Signed-off-by: Heiner Kallweit <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: phy: improve phy_driver callback handle_interrupt

did_interrupt() clears the interrupt, therefore handle_interrupt() can
not check which event triggered the interrupt. To overcome this
constraint and allow more flexibility for customer interrupt handlers,
let's decouple handle_interrupt() from parts of the phylib interrupt
handling. Custom interrupt handlers now have to implement the
did_interrupt() functionality in handle_interrupt() if needed.

Fortunately we have just one custom interrupt handler so far (in the
mscc PHY driver), convert it to the changed API.

Signed-off-by: Heiner Kallweit <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Merge branch 'ethtool-consolidate-irq-coalescing-last-part'

Jakub Kicinski says:

====================
ethtool: consolidate irq coalescing - last part

Convert remaining drivers following the groundwork laid in a recent
patch set [1] and continued in [2], [3], [4], [5]. The aim of
the effort is to consolidate irq coalescing parameter validation
in the core.

This set is the sixth and last installment. It converts the remaining
8 drivers in drivers/net/ethernet. The last patch makes declaring
supported IRQ coalescing parameters a requirement.

[1] https://lore.kernel.org/netdev/20200305051542 [email protected]/
[2] https://lore.kernel.org/netdev/20200306010602.1620354 [email protected]/
[3] https://lore.kernel.org/netdev/20200310021512.1861626 [email protected]/
[4] https://lore.kernel.org/netdev/20200311223302.2171564 [email protected]/
[5] https://lore.kernel.org/netdev/20200313040803.2367590 [email protected]/
====================

Signed-off-by: David S. Miller <[email protected]>

net: ethtool: require drivers to set supported_coalesce_params

Now that all in-tree drivers have been updated we can
make the supported_coalesce_params mandatory.

To save debugging time in case some driver was missed
(or is out of tree) add a warning when netdev is registered
with set_coalesce but without supported_coalesce_params.

Signed-off-by: Jakub Kicinski <[email protected]>
Reviewed-by: Michal Kubecek <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: axienet: let core reject the unsupported coalescing parameters

Set ethtool_ops->supported_coalesce_params to let
the core reject unsupported coalescing parameters.

This driver already correctly rejected all unsupported
parameters. No functional changes.

Signed-off-by: Jakub Kicinski <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: ll_temac: let core reject the unsupported coalescing parameters

Set ethtool_ops->supported_coalesce_params to let
the core reject unsupported coalescing parameters.

This driver already correctly rejected all unsupported
parameters. No functional changes.

Signed-off-by: Jakub Kicinski <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: davinci_emac: reject unsupported coalescing params

Set ethtool_ops->supported_coalesce_params to let
the core reject unsupported coalescing parameters.

This driver did not previously reject unsupported parameters.

Signed-off-by: Jakub Kicinski <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: cpsw: reject unsupported coalescing params

Set ethtool_ops->supported_coalesce_params to let
the core reject unsupported coalescing parameters.

This driver did not previously reject unsupported parameters.

Signed-off-by: Jakub Kicinski <[email protected]>
Reviewed-by: Grygorii Strashko <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: tehuti: reject unsupported coalescing params

Set ethtool_ops->supported_coalesce_params to let
the core reject unsupported coalescing parameters.

This driver did not previously reject unsupported parameters.

Signed-off-by: Jakub Kicinski <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: dwc-xlgmac: let core reject the unsupported coalescing parameters

Set ethtool_ops->supported_coalesce_params to let
the core reject unsupported coalescing parameters.

This driver already correctly rejected all unsupported
parameters.

While at it remove unnecessary zeroing on get.

No functional changes.

Signed-off-by: Jakub Kicinski <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: socionext: reject unsupported coalescing params

Set ethtool_ops->supported_coalesce_params to let
the core reject unsupported coalescing parameters.

This driver did not previously reject unsupported parameters.

Signed-off-by: Jakub Kicinski <[email protected]>
Acked-by: Ilias Apalodimas <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: sfc: reject unsupported coalescing params

Set ethtool_ops->supported_coalesce_params to let
the core reject unsupported coalescing parameters.

This driver did not previously reject unsupported parameters.
The check for use_adaptive_tx_coalesce will now be done by
the core.

Signed-off-by: Jakub Kicinski <[email protected]>
Acked-by: Edward Cree <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net/mlx5: Avoid forwarding to other eswitch uplink

Do not allow forwarding of encapsulated traffic received from one eswtich's
uplink to another eswtich's uplink.

Signed-off-by: Eli Cohen <[email protected]>
Reviewed-by: Oz Shlomo <[email protected]>
Signed-off-by: Saeed Mahameed <[email protected]>

net/mlx5: Eswitch, enable forwarding back to uplink port

Add dependencny on cap termination_table_raw_traffic to allow non
encapsulated packets received from uplink to be forwarded back to the
received uplink port.

Refactor the conditions into a separate function.

Signed-off-by: Eli Cohen <[email protected]>
Reviewed-by: Oz Shlomo <[email protected]>
Signed-off-by: Saeed Mahameed <[email protected]>

net/mlx5e: Add support for offloading traffic from uplink to uplink

Termination tables change the direction of a packet in hw from RX to SX
pipeline. Use that to offload hairpin flows received from uplink and
sent back to uplink.

Currently termination tables are used for pushing VLAN to packets
received from uplink and targeting a VF. Extend the implementation to
allow forwarding packets to uplink. These packets can either be
encapsulated or not.

In case encapsulation is needed before forwarding, move the reformat
object to the termination table as required.

Extend the hash table key to include tunnel information for the sake of
reusing reformat objects.

Signed-off-by: Eli Cohen <[email protected]>
Reviewed-by: Oz Shlomo <[email protected]>
Signed-off-by: Saeed Mahameed <[email protected]>

net/mlx5: Don't use termination tables in slow path

Don't use termination tables for packets that are steered to the slow path,
as a pre-step for supporting packet encap (packet reformat) action on
termination tables. Packet encap (reformat action) actions steer the packet
to the slow path until outer arp entries are resolved.

Signed-off-by: Eli Cohen <[email protected]>
Reviewed-by: Oz Shlomo <[email protected]>
Signed-off-by: Saeed Mahameed <[email protected]>

net/mlx5: Avoid configuring eswitch QoS if not supported

Check if QoS is enabled for the eswitch before attempting to configure
QoS parameters and emit a netlink error if not supported.

Introduce an API to check if QoS is supported for the eswitch.

Signed-off-by: Eli Cohen <[email protected]>
Reviewed-by: Roi Dayan <[email protected]>
Reviewed-by: Paul Blakey <[email protected]>
Signed-off-by: Saeed Mahameed <[email protected]>

net/mlx5e: Fix devlink port register sequence

If udevd is configured to rename interfaces according to persistent
naming rules and if a network interface has phys_port_name in sysfs,
its contents will be appended to the interface name.
However, register_netdev creates device in sysfs and if
devlink_port_register is called after that, there is a timeframe in
which udevd may read an empty phys_port_name value. The consequence is
that the interface will lose this suffix and its name will not be
really persistent.

The solution is to register the port before registering a netdev.

Fixes: c6acd629eec7 ("net/mlx5e: Add support for devlink-port in non-representors mode")
Signed-off-by: Vladyslav Tarasiuk <[email protected]>
Reviewed-by: Maxim Mikityanskiy <[email protected]>
Reviewed-by: Jiri Pirko <[email protected]>
Signed-off-by: Saeed Mahameed <[email protected]>

net/mlx5e: Fix rejecting all egress rules not on vlan

The original condition rejected all egress rules that
are not on tunnel device.
Also, the whole point of this egress reject was to disallow bad
rules because of egdev which doesn't exists today, so remove
this check entirely.

Fixes: 0a7fcb78cc21 ("net/mlx5e: Support inner header rewrite with goto action")
Signed-off-by: Roi Dayan <[email protected]>
Reviewed-by: Oz Shlomo <[email protected]>
Reviewed-by: Vlad Buslov <[email protected]>
Signed-off-by: Saeed Mahameed <[email protected]>

net/mlx5e: en_tc: Rely just on register loopback for tunnel restoration

Register loopback which is needed for tunnel restoration, is now always
enabled if supported and not just with metadata enabled, check for
that instead.

Signed-off-by: Paul Blakey <[email protected]>
Reviewed-by: Oz Shlomo <[email protected]>
Signed-off-by: Saeed Mahameed <[email protected]>

net/mlx5e: CT: Fix stack usage compiler warning

Fix the following warnings: [-Werror=frame-larger-than=]

In function ‘mlx5_tc_ct_entry_add_rule’:
drivers/net/ethernet/mellanox/mlx5/core/en/tc_ct.c:541:1:
error: the frame size of 1136 bytes is larger than 1024 bytes

In function ‘__mlx5_tc_ct_flow_offload’:
drivers/net/ethernet/mellanox/mlx5/core/en/tc_ct.c:1049:1:
error: the frame size of 1168 bytes is larger than 1024 bytes

Fixes: 4c3844d9e97e ("net/mlx5e: CT: Introduce connection tracking")
Signed-off-by: Saeed Mahameed <[email protected]>
Reviewed-by: Paul Blakey <[email protected]>

net/mlx5e: CT: Fix insert rules when TC_CT config isn't enabled

If CONFIG_MLX5_TC_CT isn't enabled, all offloading of eswitch tc rules
fails on parsing ct match, even if there is no ct match.

Return success if there is no ct match, regardless of config.

Fixes: 4c3844d9e97e ("net/mlx5e: CT: Introduce connection tracking")
Signed-off-by: Paul Blakey <[email protected]>
Reviewed-by: Oz Shlomo <[email protected]>
Signed-off-by: Saeed Mahameed <[email protected]>

net/mlx5e: CT: remove set but not used variable 'unnew'

drivers/net/ethernet/mellanox/mlx5/core/en/tc_ct.c:
In function mlx5_tc_ct_parse_match:
drivers/net/ethernet/mellanox/mlx5/core/en/tc_ct.c:699:36: warning:
variable unnew set but not used [-Wunused-but-set-variable]

Fixes: 4c3844d9e97e ("net/mlx5e: CT: Introduce connection tracking")
Reported-by: Hulk Robot <[email protected]>
Signed-off-by: YueHaibing <[email protected]>
Signed-off-by: Saeed Mahameed <[email protected]>

net/mlx5: E-Switch, Skip restore modify header between prios of same chain

Restore modify header writes the chain mapping on the packet.
This modify header and action is added on all prios connections,
and gets overwritten with the same value consecutively in prios
of the same chain.

Use the chain's modify header only for the last prio of a given tc
chain.

Signed-off-by: Paul Blakey <[email protected]>
Reviewed-by: Oz Shlomo <[email protected]>
Signed-off-by: Saeed Mahameed <[email protected]>

net/mlx5: E-Switch: Fix using fwd and modify when firmware doesn't support it

Currently, if firmware doesn't support fwd and modify, driver fails
initializing eswitch chains while entering switchdev mode.

Instead, on such cases, disable the chains and prio feature (as we can't
restore the chain on miss) and the usage of fwd and modify.

Fixes: 8f1e0b97cc70 ("net/mlx5: E-Switch, Mark miss packets with new chain id mapping")
Signed-off-by: Paul Blakey <[email protected]>
Reviewed-by: Oz Shlomo <[email protected]>
Signed-off-by: Saeed Mahameed <[email protected]>

net/mlx5: Add missing inline to stub esw_add_restore_rule

When CONFIG_MLX5_ESWITCH is unset, clang warns:

In file included from drivers/net/ethernet/mellanox/mlx5/core/main.c:58:
drivers/net/ethernet/mellanox/mlx5/core/eswitch.h:670:1: warning: unused
function 'esw_add_restore_rule' [-Wunused-function]
esw_add_restore_rule(struct mlx5_eswitch *esw, u32 tag)
^
1 warning generated.

This stub function is missing inline; add it to suppress the warning.

Fixes: 11b717d61526 ("net/mlx5: E-Switch, Get reg_c0 value on CQE")
Signed-off-by: Nathan Chancellor <[email protected]>
Reviewed-by: Nick Desaulniers <[email protected]>
Signed-off-by: Saeed Mahameed <[email protected]>

netfilter: Introduce egress hook

Commit e687ad60af09 ("netfilter: add netfilter ingress hook after
handle_ing() under unique static key") introduced the ability to
classify packets on ingress.

Allow the same on egress.  Position the hook immediately before a packet
is handed to tc and then sent out on an interface, thereby mirroring the
ingress order.  This order allows marking packets in the netfilter
egress hook and subsequently using the mark in tc.  Another benefit of
this order is consistency with a lot of existing documentation which
says that egress tc is performed after netfilter hooks.

Egress hooks already exist for the most common protocols, such as
NF_INET_LOCAL_OUT or NF_ARP_OUT, and those are to be preferred because
they are executed earlier during packet processing.  However for more
exotic protocols, there is currently no provision to apply netfilter on
egress.  A common workaround is to enslave the interface to a bridge and
use ebtables, or to resort to tc.  But when the ingress hook was
introduced, consensus was that users should be given the choice to use
netfilter or tc, whichever tool suits their needs best:
https://lore.kernel.org/netdev/20150430153317.GA3230@salvia/
This hook is also useful for NAT46/NAT64, tunneling and filtering of
locally generated af_packet traffic such as dhclient.

There have also been occasional user requests for a netfilter egress
hook in the past, e.g.:
https://www.spinics.net/lists/netfilter/msg50038.html

Performance measurements with pktgen surprisingly show a speedup rather
than a slowdown with this commit:

* Without this commit:
  Result: OK: 34240933(c34238375+d2558) usec, 100000000 (60byte,0frags)
  2920481pps 1401Mb/sec (1401830880bps) errors: 0

* With this commit:
  Result: OK: 33997299(c33994193+d3106) usec, 100000000 (60byte,0frags)
  2941410pps 1411Mb/sec (1411876800bps) errors: 0

* Without this commit + tc egress:
  Result: OK: 39022386(c39019547+d2839) usec, 100000000 (60byte,0frags)
  2562631pps 1230Mb/sec (1230062880bps) errors: 0

* With this commit + tc egress:
  Result: OK: 37604447(c37601877+d2570) usec, 100000000 (60byte,0frags)
  2659259pps 1276Mb/sec (1276444320bps) errors: 0

* With this commit + nft egress:
  Result: OK: 41436689(c41434088+d2600) usec, 100000000 (60byte,0frags)
  2413320pps 1158Mb/sec (1158393600bps) errors: 0

Tested on a bare-metal Core i7-3615QM, each measurement was performed
three times to verify that the numbers are stable.

Commands to perform a measurement:
modprobe pktgen
echo "add_device lo@3" > /proc/net/pktgen/kpktgend_3
samples/pktgen/pktgen_bench_xmit_mode_queue_xmit.sh -i 'lo@3' -n 100000000

Commands for testing tc egress:
tc qdisc add dev lo clsact
tc filter add dev lo egress protocol ip prio 1 u32 match ip dst 4.3.2.1/32

Commands for testing nft egress:
nft add table netdev t
nft add chain netdev t co \{ type filter hook egress device lo priority 0 \; \}
nft add rule netdev t co ip daddr 4.3.2.1/32 drop

All testing was performed on the loopback interface to avoid distorting
measurements by the packet handling in the low-level Ethernet driver.

Signed-off-by: Lukas Wunner <[email protected]>
Cc: Daniel Borkmann <[email protected]>
Signed-off-by: Pablo Neira Ayuso <[email protected]>

netfilter: Generalize ingress hook

Prepare for addition of a netfilter egress hook by generalizing the
ingress hook introduced by commit e687ad60af09 ("netfilter: add
netfilter ingress hook after handle_ing() under unique static key").

In particular, rename and refactor the ingress hook's static inlines
such that they can be reused for an egress hook.

No functional change intended.

Signed-off-by: Lukas Wunner <[email protected]>
Cc: Daniel Borkmann <[email protected]>
Signed-off-by: Pablo Neira Ayuso <[email protected]>

netfilter: Rename ingress hook include file

Prepare for addition of a netfilter egress hook by renaming
<linux/netfilter_ingress.h> to <linux/netfilter_netdev.h>.

The egress hook also necessitates a refactoring of the include file,
but that is done in a separate commit to ease reviewing.

No functional change intended.

Signed-off-by: Lukas Wunner <[email protected]>
Cc: Daniel Borkmann <[email protected]>
Signed-off-by: Pablo Neira Ayuso <[email protected]>

Merge branch 'tcp-fix-stretch-ACK-bugs-in-congestion-control-modules'

Pengcheng Yang says:

====================
tcp: fix stretch ACK bugs in congestion control modules

"stretch ACKs" (caused by LRO, GRO, delayed ACKs or middleboxes)
can cause serious performance shortfalls in common congestion
control algorithms. Neal Cardwell submitted a series of patches
starting with commit e73ebb0881ea ("tcp: stretch ACK fixes prep")
to handle stretch ACKs and fixed stretch ACK bugs in Reno and
CUBIC congestion control algorithms.

This patch series continues to fix bic, scalable, veno and yeah
congestion control algorithms to handle stretch ACKs.

Changes in v2:
- Provide [PATCH 0/N] to describe the modifications of this patch series
====================

Signed-off-by: David S. Miller <[email protected]>

tcp: fix stretch ACK bugs in Yeah

Change Yeah to properly handle stretch ACKs in additive
increase mode by passing in the count of ACKed packets
to tcp_cong_avoid_ai().

In addition, we re-implemented the scalable path using
tcp_cong_avoid_ai() and removed the pkts_acked variable.

Signed-off-by: Pengcheng Yang <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

tcp: fix stretch ACK bugs in Veno

Change Veno to properly handle stretch ACKs in additive
increase mode by passing in the count of ACKed packets
to tcp_cong_avoid_ai().

Signed-off-by: Pengcheng Yang <[email protected]>
Acked-by: Neal Cardwell <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

tcp: stretch ACK fixes in Veno prep

No code logic has been changed in this patch.

Signed-off-by: Pengcheng Yang <[email protected]>
Acked-by: Neal Cardwell <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

tcp: fix stretch ACK bugs in Scalable

Change Scalable to properly handle stretch ACKs in additive
increase mode by passing in the count of ACKed packets to
tcp_cong_avoid_ai().

In addition, because we are now precisely accounting for
stretch ACKs, including delayed ACKs, we can now change
TCP_SCALABLE_AI_CNT to 100.

Signed-off-by: Pengcheng Yang <[email protected]>
Acked-by: Neal Cardwell <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

tcp: fix stretch ACK bugs in BIC

Changes BIC to properly handle stretch ACKs in additive
increase mode by passing in the count of ACKed packets
to tcp_cong_avoid_ai().

Signed-off-by: Pengcheng Yang <[email protected]>
Acked-by: Neal Cardwell <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

sfc: fix XDP-redirect in this driver

XDP-redirect is broken in this driver sfc. XDP_REDIRECT requires
tailroom for skb_shared_info when creating an SKB based on the
redirected xdp_frame (both in cpumap and veth).

The fix requires some initial explaining. The driver uses RX page-split
when possible. It reserves the top 64 bytes in the RX-page for storing
dma_addr (struct efx_rx_page_state). It also have the XDP recommended
headroom of XDP_PACKET_HEADROOM (256 bytes). As it doesn't reserve any
tailroom, it can still fit two standard MTU (1500) frames into one page.

The sizeof struct skb_shared_info in 320 bytes. Thus drivers like ixgbe
and i40e, reduce their XDP headroom to 192 bytes, which allows them to
fit two frames with max 1536 bytes into a 4K page (192+1536+320=2048).

The fix is to reduce this drivers headroom to 128 bytes and add the 320
bytes tailroom. This account for reserved top 64 bytes in the page, and
still fit two frame in a page for normal MTUs.

We must never go below 128 bytes of headroom for XDP, as one cacheline
is for xdp_frame area and next cacheline is reserved for metadata area.

Fixes: eb9a36be7f3e ("sfc: perform XDP processing on received packets")
Signed-off-by: Jesper Dangaard Brouer <[email protected]>
Acked-by: Edward Cree <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

remoteproc: clean up notification config

Rearrange the config files for remoteproc and IPA to fix their
interdependencies.

First, have CONFIG_QCOM_Q6V5_MSS select QCOM_Q6V5_IPA_NOTIFY so the
notification code is built regardless of whether IPA needs it.

Next, represent QCOM_IPA as being dependent on QCOM_Q6V5_MSS rather
than setting its value to match QCOM_Q6V5_COMMON (which is selected
by QCOM_Q6V5_MSS).

Drop all dependencies from QCOM_Q6V5_IPA_NOTIFY. The notification
code will be built whenever QCOM_Q6V5_MSS is set, and it has no other
dependencies.

Signed-off-by: Alex Elder <[email protected]>
Reviewed-by: Bjorn Andersson <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: kcm: kcmproc.c: Fix RCU list suspicious usage warning

This path fixes the suspicious RCU usage warning reported by
kernel test robot.

net/kcm/kcmproc.c:#RCU-list_traversed_in_non-reader_section

There is no need to use list_for_each_entry_rcu() in
kcm_stats_seq_show() as the list is always traversed under
knet->mutex held.

Reported-by: kernel test robot <[email protected]>
Signed-off-by: Madhuparna Bhowmik <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

qede: remove some unused code in function qede_selftest_receive_traffic

Remove set but not used variables 'sw_comp_cons' and 'hw_comp_cons'
to fix gcc '-Wunused-but-set-variable' warning:

drivers/net/ethernet/qlogic/qede/qede_ethtool.c: In function qede_selftest_receive_traffic:
drivers/net/ethernet/qlogic/qede/qede_ethtool.c:1569:20:
warning: variable sw_comp_cons set but not used [-Wunused-but-set-variable]
drivers/net/ethernet/qlogic/qede/qede_ethtool.c: In function qede_selftest_receive_traffic:
drivers/net/ethernet/qlogic/qede/qede_ethtool.c:1569:6:
warning: variable hw_comp_cons set but not used [-Wunused-but-set-variable]

After removing 'hw_comp_cons',the memory barrier 'rmb()' and its comments become useless,
so remove them as well.

Reported-by: Hulk Robot <[email protected]>
Signed-off-by: Zheng Zengkai <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: sched: set the hw_stats_type in pedit loop

For a single pedit action, multiple offload entries may be used. Set the
hw_stats_type to all of them.

Fixes: 44f865801741 ("sched: act: allow user to specify type of HW stats for a filter")
Signed-off-by: Jiri Pirko <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Merge branch 'net-stmmac-Use-readl_poll_timeout-to-simplify-the-code'

Dejin Zheng says:

====================
net: stmmac: Use readl_poll_timeout() to simplify the code

This patch sets just for replace the open-coded loop to the
readl_poll_timeout() helper macro for simplify the code in
stmmac driver.

v2 -> v3:
- return whatever error code by readl_poll_timeout() returned.
v1 -> v2:
- no changed. I am a newbie and sent this patch a month
  ago (February 6th). So far, I have not received any comments or
  suggestion. I think it may be lost somewhere in the world, so
  resend it.
====================

Signed-off-by: David S. Miller <[email protected]>

net: stmmac: use readl_poll_timeout() function in dwmac4_dma_reset()

The dwmac4_dma_reset() function use an open coded of readl_poll_timeout().
Replace the open coded handling with the proper function.

Signed-off-by: Dejin Zheng <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: stmmac: use readl_poll_timeout() function in init_systime()

The init_systime() function use an open coded of readl_poll_timeout().
Replace the open coded handling with the proper function.

Signed-off-by: Dejin Zheng <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

chcr: remove set but not used variable 'status'

drivers/crypto/chelsio/chcr_ktls.c: In function chcr_ktls_cpl_set_tcb_rpl:
drivers/crypto/chelsio/chcr_ktls.c:662:11: warning:
variable status set but not used [-Wunused-but-set-variable]

commit 8a30923e1598 ("cxgb4/chcr: Save tx keys and handle HW response")
involved this unused variable, remove it.

Reported-by: Hulk Robot <[email protected]>
Signed-off-by: YueHaibing <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

macsec: Netlink support of XPN cipher suites (IEEE 802.1AEbw)

Netlink support of extended packet number cipher suites,
allows adding and updating XPN macsec interfaces.

Added support in:
    * Creating interfaces with GCM-AES-XPN-128 and GCM-AES-XPN-256 suites.
    * Setting and getting 64bit packet numbers with of SAs.
    * Setting (only on SA creation) and getting ssci of SAs.
    * Setting salt when installing a SAK.

Added 2 cipher suite identifiers according to 802.1AE-2018 table 14-1:
    * MACSEC_CIPHER_ID_GCM_AES_XPN_128
    * MACSEC_CIPHER_ID_GCM_AES_XPN_256

In addition, added 2 new netlink attribute types:
    * MACSEC_SA_ATTR_SSCI
    * MACSEC_SA_ATTR_SALT

Depends on: macsec: Support XPN frame handling - IEEE 802.1AEbw.

Signed-off-by: Era Mayflower <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

macsec: Support XPN frame handling - IEEE 802.1AEbw

Support extended packet number cipher suites (802.1AEbw) frames handling.
This does not include the needed netlink patches.

    * Added xpn boolean field to `struct macsec_secy`.
    * Added ssci field to `struct_macsec_tx_sa` (802.1AE figure 10-5).
    * Added ssci field to `struct_macsec_rx_sa` (802.1AE figure 10-5).
    * Added salt field to `struct macsec_key` (802.1AE 10.7 NOTE 1).
    * Created pn_t type for easy access to lower and upper halves.
    * Created salt_t type for easy access to the "ssci" and "pn" parts.
    * Created `macsec_fill_iv_xpn` function to create IV in XPN mode.
    * Support in PN recovery and preliminary replay check in XPN mode.

In addition, according to IEEE 802.1AEbw figure 10-5, the PN of incoming
frame can be 0 when XPN cipher suite is used, so fixed the function
`macsec_validate_skb` to fail on PN=0 only if XPN is off.

Signed-off-by: Era Mayflower <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Bluetooth: btusb: print Intel fw build version in power-on boot

To determine the build version of Bluetooth firmware to ensure reported
issue related to a particular release. This is very helpful for every fw
downloaded to BT controller and issue reported from field test.

Signed-off-by: Amit K Bag <[email protected]>
Signed-off-by: Sukumar Ghorai <[email protected]>
Signed-off-by: Marcel Holtmann <[email protected]>

Merge branch 'net-dsa-improve-serdes-integration'

Russell King says:

====================
net: dsa: improve serdes integration

Depends on "net: mii clause 37 helpers".

Andrew Lunn mentioned that the Serdes PCS found in Marvell DSA switches
does not automatically update the switch MACs with the link parameters.
Currently, the DSA code implements a work-around for this.

This series improves the Serdes integration, making use of the recent
phylink changes to support split MAC/PCS setups. One noticable
improvement for userspace is that ethtool can now report the link
partner's advertisement.

This repost has no changes compared to the previous posting; however,
the regression Andrew had found which exists even without this patch
set has now been fixed by Andrew and merged into the net-next tree.
====================

Signed-off-by: David S. Miller <[email protected]>

net: dsa: mv88e6xxx: use PHY_DETECT in mac_link_up/mac_link_down

Use the status of the PHY_DETECT bit to determine whether we need to
force the MAC settings in mac_link_up() and mac_link_down().

Signed-off-by: Russell King <[email protected]>
Reviewed-by: Andrew Lunn <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: dsa: mv88e6xxx: remove port_link_state functions

The port_link_state method is only used by mv88e6xxx_port_setup_mac(),
which is now only called during port setup, rather than also being
called via phylink's mac_config method.

Remove this now unnecessary optimisation, which allows us to remove the
port_link_state methods as well.

Signed-off-by: Russell King <[email protected]>
Reviewed-by: Andrew Lunn <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: dsa: mv88e6xxx: combine port_set_speed and port_set_duplex

Setting the speed independently of duplex makes little sense; the two
parameters result from negotiation or fixed setup, and may have inter-
dependencies. Moreover, they are always controlled via the same
register - having them split means we have to read-modify-write this
register twice.

Combine the two operations into a single port_set_speed_duplex()
operation. Not only is this more efficient, it reduces the size of the
code as well.

Signed-off-by: Russell King <[email protected]>
Reviewed-by: Andrew Lunn <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: dsa: mv88e6xxx: fix Serdes link changes

phylink_mac_change() is supposed to be called with a 'false' argument
if the link has gone down since it was last reported up; this is to
ensure that link events along with renegotiation events are always
correctly reported to userspace.

Read the BMSR once when we have an interrupt, and report the link
latched status to phylink via phylink_mac_change(). phylink will deal
automatically with re-reading the link state once it has processed the
link-down event.

Signed-off-by: Russell King <[email protected]>
Reviewed-by: Andrew Lunn <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: dsa: mv88e6xxx: extend phylink to Serdes PHYs

Extend the mv88e6xxx phylink implementation down to Serdes PHYs, which
handle the PCS layer of such links.

- Implement phylink PCS link state reading, so that we can provide
  ethtool with the linkmodes and link speed in the expected manner.
  Note: this will only be called for in-band negotiation, which is
  only supported by the serdes interfaces.
- Implement phylink PCS configuration, so that the in-band AN and
  advertisement can be configured.
- Implement phylink PCS negotiation restart, so that the in-band AN
  can be restarted.
- Implement phylink PCS link up, so that when operating out-of-band,
  the Serdes can be configured for the appropriate fixed speed mode.

Signed-off-by: Russell King <[email protected]>
Reviewed-by: Andrew Lunn <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: dsa: mv88e6xxx: configure interface settings in mac_config

Only configure the interface settings in mac_config(), leaving the
speed and duplex settings to mac_link_up to deal with.

Signed-off-by: Russell King <[email protected]>
Reviewed-by: Andrew Lunn <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: dsa: mv88e6xxx: use BMCR definitions for serdes control register

The SGMII/1000base-X serdes register set is a clause 22 register set
offset at 0x2000 in the PHYXS device. Rather than inventing our own
defintions, use those that already exist, and name the register
MV88E6390_SGMII_BMCR. Also remove the unused MV88E6390_SGMII_STATUS
definitions.

Signed-off-by: Russell King <[email protected]>
Reviewed-by: Andrew Lunn <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: dsa: warn if phylink_mac_link_state returns error

Issue a warning to the kernel log if phylink_mac_link_state() returns
an error. This should not occur, but let's make it visible.

Signed-off-by: Russell King <[email protected]>
Reviewed-by: Andrew Lunn <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Merge branch 'net-mii-clause-37-helpers'

Russell King says:

====================
net: mii clause 37 helpers

This is a re-post of two patches that are common to two series that
I've sent in recent weeks; I'm re-posting them separately in the hope
that they can be merged.  No changes from either of the previous
postings.

These patches:

1. convert the existing (unused) mii_lpa_to_ethtool_lpa_x() function
   to a linkmode variant.

2. add a helper for clause 37 advertisements, supporting both the
   1000baseX and defacto 2500baseX variants. Note that ethtool does
   not support half duplex for either of these, and we make no effort
   to do so.
====================

Signed-off-by: David S. Miller <[email protected]>

net: mii: add linkmode_adv_to_mii_adv_x()

Add a helper to convert a linkmode advertisement to a clause 37
advertisement value for 1000base-x and 2500base-x.

Signed-off-by: Russell King <[email protected]>
Reviewed-by: Andrew Lunn <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: mii: convert mii_lpa_to_ethtool_lpa_x() to linkmode variant

Add a LPA to linkmode decoder for 1000BASE-X protocols; this decoder
only provides the modify semantics similar to other such decoders.
This replaces the unused mii_lpa_to_ethtool_lpa_x() helper.

Signed-off-by: Russell King <[email protected]>
Reviewed-by: Andrew Lunn <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

netfilter: conntrack: re-visit sysctls in unprivileged namespaces

since commit b884fa46177659 ("netfilter: conntrack: unify sysctl handling")
conntrack no longer exposes most of its sysctls (e.g. tcp timeouts
settings) to network namespaces that are not owned by the initial user
namespace.

This patch exposes all sysctls even if the namespace is unpriviliged.

compared to a 4.19 kernel, the newly visible and writeable sysctls are:
  net.netfilter.nf_conntrack_acct
  net.netfilter.nf_conntrack_timestamp
  .. to allow to enable accouting and timestamp extensions.

  net.netfilter.nf_conntrack_events
  .. to turn off conntrack event notifications.

  net.netfilter.nf_conntrack_checksum
  .. to disable checksum validation.

  net.netfilter.nf_conntrack_log_invalid
  .. to enable logging of packets deemed invalid by conntrack.

newly visible sysctls that are only exported as read-only:

  net.netfilter.nf_conntrack_count
  .. current number of conntrack entries living in this netns.

  net.netfilter.nf_conntrack_max
  .. global upperlimit (maximum size of the table).

  net.netfilter.nf_conntrack_buckets
  .. size of the conntrack table (hash buckets).

  net.netfilter.nf_conntrack_expect_max
  .. maximum number of permitted expectations in this netns.

  net.netfilter.nf_conntrack_helper
  .. conntrack helper auto assignment.

Signed-off-by: Florian Westphal <[email protected]>
Signed-off-by: Pablo Neira Ayuso <[email protected]>

netfilter: nft_lookup: update element stateful expression

If the set element comes with an stateful expression, update it.

Signed-off-by: Pablo Neira Ayuso <[email protected]>

netfilter: nf_tables: add nft_set_elem_update_expr() helper function

This helper function runs the eval path of the stateful expression
of an existing set element.

Signed-off-by: Pablo Neira Ayuso <[email protected]>

netfilter: nf_tables: add elements with stateful expressions

Update nft_add_set_elem() to handle the NFTA_SET_ELEM_EXPR netlink
attribute. This patch allows users to to add elements with stateful
expressions.

Signed-off-by: Pablo Neira Ayuso <[email protected]>

netfilter: nf_tables: statify nft_expr_init()

Not exposed anymore to modules, statify this function.

Signed-off-by: Pablo Neira Ayuso <[email protected]>

netfilter: nf_tables: add nft_set_elem_expr_alloc()

Add helper function to create stateful expression.

Signed-off-by: Pablo Neira Ayuso <[email protected]>

nft_set_pipapo: Prepare for single ranged field usage

A few adjustments in nft_pipapo_init() are needed to allow usage of
this set back-end for a single, ranged field.

Provide a convenient NFT_PIPAPO_MIN_FIELDS definition that currently
makes sure that the rbtree back-end is selected instead, for sets
with a single field.

This finally allows a fair comparison with rbtree sets, by defining
NFT_PIPAPO_MIN_FIELDS as 0 and skipping rbtree back-end initialisation:

---------------.--------------------------.-------------------------.
AMD Epyc 7402  |      baselines, Mpps     |   Mpps, % over rbtree   |
  1 thread      |__________________________|_________________________|
  3.35GHz       |        |        |        |            |            |
  768KiB L1D$   | netdev |  hash  | rbtree |            |   pipapo   |
---------------|  hook  |   no   | single |   pipapo   |single field|
type   entries |  drop  | ranges | field  |single field|    AVX2    |
---------------|--------|--------|--------|------------|------------|
net,port       |        |        |        |            |            |
          1000  |   19.0 |   10.4 |    3.8 | 6.0   +58% | 9.6  +153% |
---------------|--------|--------|--------|------------|------------|
port,net       |        |        |        |            |            |
           100  |   18.8 |   10.3 |    5.8 | 9.1   +57% |11.6  +100% |
---------------|--------|--------|--------|------------|------------|
net6,port      |        |        |        |            |            |
          1000  |   16.4 |    7.6 |    1.8 | 2.8   +55% | 6.5  +261% |
---------------|--------|--------|--------|------------|------------|
port,proto     |        |        |        |     [1]    |    [1]     |
         30000  |   19.6 |   11.6 |    3.9 | 0.9   -77% | 2.7   -31% |
---------------|--------|--------|--------|------------|------------|
port,proto     |        |        |        |            |            |
         10000  |   19.6 |   11.6 |    4.4 | 2.1   -52% | 5.6   +27% |
---------------|--------|--------|--------|------------|------------|
port,proto     |        |        |        |            |            |
4 threads 10000|   77.9 |   45.1 |   17.4 | 8.3   -52% |22.4   +29% |
---------------|--------|--------|--------|------------|------------|
net6,port,mac  |        |        |        |            |            |
            10  |   16.5 |    5.4 |    4.3 | 4.5    +5% | 8.2   +91% |
---------------|--------|--------|--------|------------|------------|
net6,port,mac, |        |        |        |            |            |
proto    1000  |   16.5 |    5.7 |    1.9 | 2.8   +47% | 6.6  +247% |
---------------|--------|--------|--------|------------|------------|
net,mac        |        |        |        |            |            |
          1000  |   19.0 |    8.4 |    3.9 | 6.0   +54% | 9.9  +154% |
---------------'--------'--------'--------'------------'------------'
[1] Causes switch of lookup table buckets for 'port' to 4-bit groups

Signed-off-by: Stefano Brivio <[email protected]>
Signed-off-by: Pablo Neira Ayuso <[email protected]>

nft_set_pipapo: Introduce AVX2-based lookup implementation

If the AVX2 set is available, we can exploit the repetitive
characteristic of this algorithm to provide a fast, vectorised
version by using 256-bit wide AVX2 operations for bucket loads and
bitwise intersections.

In most cases, this implementation consistently outperforms rbtree
set instances despite the fact they are configured to use a given,
single, ranged data type out of the ones used for performance
measurements by the nft_concat_range.sh kselftest.

That script, injecting packets directly on the ingoing device path
with pktgen, reports, averaged over five runs on a single AMD Epyc
7402 thread (3.35GHz, 768 KiB L1D$, 12 MiB L2$), the figures below.
CONFIG_RETPOLINE was not set here.

Note that this is not a fair comparison over hash and rbtree set
types: non-ranged entries (used to have a reference for hash types)
would be matched faster than this, and matching on a single field
only (which is the case for rbtree) is also significantly faster.

However, it's not possible at the moment to choose this set type
for non-ranged entries, and the current implementation also needs
a few minor adjustments in order to match on less than two fields.

---------------.-----------------------------------.------------.
AMD Epyc 7402  |          baselines, Mpps          | this patch |
  1 thread      |___________________________________|____________|
  3.35GHz       |        |        |        |        |            |
  768KiB L1D$   | netdev |  hash  | rbtree |        |            |
---------------|  hook  |   no   | single |        |   pipapo   |
type   entries |  drop  | ranges | field  | pipapo |    AVX2    |
---------------|--------|--------|--------|--------|------------|
net,port       |        |        |        |        |            |
          1000  |   19.0 |   10.4 |    3.8 |    4.0 | 7.5   +87% |
---------------|--------|--------|--------|--------|------------|
port,net       |        |        |        |        |            |
           100  |   18.8 |   10.3 |    5.8 |    6.3 | 8.1   +29% |
---------------|--------|--------|--------|--------|------------|
net6,port      |        |        |        |        |            |
          1000  |   16.4 |    7.6 |    1.8 |    2.1 | 4.8  +128% |
---------------|--------|--------|--------|--------|------------|
port,proto     |        |        |        |        |            |
         30000  |   19.6 |   11.6 |    3.9 |    0.5 | 2.6  +420% |
---------------|--------|--------|--------|--------|------------|
net6,port,mac  |        |        |        |        |            |
            10  |   16.5 |    5.4 |    4.3 |    3.4 | 4.7   +38% |
---------------|--------|--------|--------|--------|------------|
net6,port,mac, |        |        |        |        |            |
proto    1000  |   16.5 |    5.7 |    1.9 |    1.4 | 3.6   +26% |
---------------|--------|--------|--------|--------|------------|
net,mac        |        |        |        |        |            |
          1000  |   19.0 |    8.4 |    3.9 |    2.5 | 6.4  +156% |
---------------'--------'--------'--------'--------'------------'

A similar strategy could be easily reused to implement specialised
versions for other SIMD sets, and I plan to post at least a NEON
version at a later time.

Signed-off-by: Stefano Brivio <[email protected]>
Signed-off-by: Pablo Neira Ayuso <[email protected]>

nft_set_pipapo: Prepare for vectorised implementation: helpers

Move most macros and helpers to a header file, so that they can be
conveniently used by related implementations.

No functional changes are intended here.

Signed-off-by: Stefano Brivio <[email protected]>
Signed-off-by: Pablo Neira Ayuso <[email protected]>

nft_set_pipapo: Prepare for vectorised implementation: alignment

SIMD vector extension sets require stricter alignment than native
instruction sets to operate efficiently (AVX, NEON) or for some
instructions to work at all (AltiVec).

Provide facilities to define arbitrary alignment for lookup tables
and scratch maps. By defining byte alignment with NFT_PIPAPO_ALIGN,
lt_aligned and scratch_aligned pointers become available.

Additional headroom is allocated, and pointers to the possibly
unaligned, originally allocated areas are kept so that they can
be freed.

Signed-off-by: Stefano Brivio <[email protected]>
Signed-off-by: Pablo Neira Ayuso <[email protected]>

nft_set_pipapo: Add support for 8-bit lookup groups and dynamic switch

While grouping matching bits in groups of four saves memory compared
to the more natural choice of 8-bit words (lookup table size is one
eighth), it comes at a performance cost, as the number of lookup
comparisons is doubled, and those also needs bitshifts and masking.

Introduce support for 8-bit lookup groups, together with a mapping
mechanism to dynamically switch, based on defined per-table size
thresholds and hysteresis, between 8-bit and 4-bit groups, as tables
grow and shrink. Empty sets start with 8-bit groups, and per-field
tables are converted to 4-bit groups if they get too big.

An alternative approach would have been to swap per-set lookup
operation functions as needed, but this doesn't allow for different
group sizes in the same set, which looks desirable if some fields
need significantly more matching data compared to others due to
heavier impact of ranges (e.g. a big number of subnets with
relatively simple port specifications).

Allowing different group sizes for the same lookup functions implies
the need for further conditional clauses, whose cost, however,
appears to be negligible in tests.

The matching rate figures below were obtained for x86_64 running
the nft_concat_range.sh "performance" cases, averaged over five
runs, on a single thread of an AMD Epyc 7402 CPU, and for aarch64
on a single thread of a BCM2711 (Raspberry Pi 4 Model B 4GB),
clocked at a stable 2147MHz frequency:

---------------.-----------------------------------.------------.
AMD Epyc 7402  |          baselines, Mpps          | this patch |
1 thread      |___________________________________|____________|
3.35GHz       |        |        |        |        |            |
768KiB L1D$   | netdev |  hash  | rbtree |        |            |
---------------|  hook  |   no   | single | pipapo |   pipapo   |
type   entries |  drop  | ranges | field  | 4 bits | bit switch |
---------------|--------|--------|--------|--------|------------|
net,port       |        |        |        |        |            |
         1000  |   19.0 |   10.4 |    3.8 |    2.8 | 4.0   +43% |
---------------|--------|--------|--------|--------|------------|
port,net       |        |        |        |        |            |
          100  |   18.8 |   10.3 |    5.8 |    5.5 | 6.3   +14% |
---------------|--------|--------|--------|--------|------------|
net6,port      |        |        |        |        |            |
         1000  |   16.4 |    7.6 |    1.8 |    1.3 | 2.1   +61% |
---------------|--------|--------|--------|--------|------------|
port,proto     |        |        |        |        |     [1]    |
        30000  |   19.6 |   11.6 |    3.9 |    0.3 | 0.5   +66% |
---------------|--------|--------|--------|--------|------------|
net6,port,mac  |        |        |        |        |            |
           10  |   16.5 |    5.4 |    4.3 |    2.6 | 3.4   +31% |
---------------|--------|--------|--------|--------|------------|
net6,port,mac, |        |        |        |        |            |
proto    1000  |   16.5 |    5.7 |    1.9 |    1.0 | 1.4   +40% |
---------------|--------|--------|--------|--------|------------|
net,mac        |        |        |        |        |            |
         1000  |   19.0 |    8.4 |    3.9 |    1.7 | 2.5   +47% |
---------------'--------'--------'--------'--------'------------'
[1] Causes switch of lookup table buckets for 'port', not 'proto',
    to 4-bit groups

---------------.-----------------------------------.------------.
BCM2711        |          baselines, Mpps          | this patch |
  1 thread      |___________________________________|____________|
  2147MHz       |        |        |        |        |            |
  32KiB L1D$    | netdev |  hash  | rbtree |        |            |
---------------|  hook  |   no   | single | pipapo |   pipapo   |
type   entries |  drop  | ranges | field  | 4 bits | bit switch |
---------------|--------|--------|--------|--------|------------|
net,port       |        |        |        |        |            |
          1000  |   1.63 |   1.37 |   0.87 |   0.61 | 0.70  +17% |
---------------|--------|--------|--------|--------|------------|
port,net       |        |        |        |        |            |
           100  |   1.64 |   1.36 |   1.02 |   0.78 | 0.81   +4% |
---------------|--------|--------|--------|--------|------------|
net6,port      |        |        |        |        |            |
          1000  |   1.56 |   1.27 |   0.65 |   0.34 | 0.50  +47% |
---------------|--------|--------|--------|--------|------------|
port,proto [2] |        |        |        |        |            |
         10000  |   1.68 |   1.43 |   0.84 |   0.30 | 0.40  +13% |
---------------|--------|--------|--------|--------|------------|
net6,port,mac  |        |        |        |        |            |
            10  |   1.56 |   1.14 |   1.02 |   0.62 | 0.66   +6% |
---------------|--------|--------|--------|--------|------------|
net6,port,mac, |        |        |        |        |            |
proto    1000  |   1.56 |   1.12 |   0.64 |   0.27 | 0.40  +48% |
---------------|--------|--------|--------|--------|------------|
net,mac        |        |        |        |        |            |
          1000  |   1.63 |   1.26 |   0.87 |   0.41 | 0.53  +29% |
---------------'--------'--------'--------'--------'------------'
[2] Using 10000 entries instead of 30000 as it would take way too
    long for the test script to generate all of them

Signed-off-by: Stefano Brivio <[email protected]>
Signed-off-by: Pablo Neira Ayuso <[email protected]>

nft_set_pipapo: Generalise group size for buckets

Get rid of all hardcoded assumptions that buckets in lookup tables
correspond to four-bit groups, and replace them with appropriate
calculations based on a variable group size, now stored in struct
field.

The group size could now be in principle any divisor of eight. Note,
though, that lookup and get functions need an implementation
intimately depending on the group size, and the only supported size
there, currently, is four bits, which is also the initial and only
used size at the moment.

While at it, drop 'groups' from struct nft_pipapo: it was never used.

Signed-off-by: Stefano Brivio <[email protected]>
Signed-off-by: Pablo Neira Ayuso <[email protected]>

netfilter: flowtable: add tunnel encap/decap action offload support

This patch add tunnel encap decap action offload in the flowtable
offload.

Signed-off-by: wenxu <[email protected]>
Signed-off-by: Pablo Neira Ayuso <[email protected]>

netfilter: flowtable: add tunnel match offload support

This patch support both ipv4 and ipv6 tunnel_id, tunnel_src and
tunnel_dst match for flowtable offload

Signed-off-by: wenxu <[email protected]>
Signed-off-by: Pablo Neira Ayuso <[email protected]>

netfilter: flowtable: add indr block setup support

Add etfilter flowtable support indr-block setup. It makes flowtable offload
vlan and tunnel device.

Signed-off-by: wenxu <[email protected]>
Signed-off-by: Pablo Neira Ayuso <[email protected]>

netfilter: flowtable: add nf_flow_table_block_offload_init()

Add nf_flow_table_block_offload_init prepare for the indr block
offload patch

Signed-off-by: wenxu <[email protected]>
Signed-off-by: Pablo Neira Ayuso <[email protected]>

netfilter: xt_IDLETIMER: clean up some indenting

These lines were indented wrong so Smatch complained.
net/netfilter/xt_IDLETIMER.c:81 idletimer_tg_show() warn: inconsistent indenting

Signed-off-by: Dan Carpenter <[email protected]>
Signed-off-by: Pablo Neira Ayuso <[email protected]>

netfilter: bitwise: use more descriptive variable-names.

Name the mask and xor data variables, "mask" and "xor," instead of "d1"
and "d2."

Signed-off-by: Jeremy Sowden <[email protected]>
Signed-off-by: Pablo Neira Ayuso <[email protected]>

netfilter: Replace zero-length array with flexible-array member

The current codebase makes use of the zero-length array language
extension to the C90 standard, but the preferred mechanism to declare
variable-length types such as these ones is a flexible array member[1][2],
introduced in C99:

struct foo {
int stuff;
struct boo array[];
};

By making use of the mechanism above, we will get a compiler warning
in case the flexible array does not occur last in the structure, which
will help us prevent some kind of undefined behavior bugs from being
inadvertently introduced[3] to the codebase from now on.

Also, notice that, dynamic memory allocations won't be affected by
this change:

"Flexible array members have incomplete type, and so the sizeof operator
may not be applied. As a quirk of the original implementation of
zero-length arrays, sizeof evaluates to zero."[1]

Lastly, fix checkpatch.pl warning
WARNING: __aligned(size) is preferred over __attribute__((aligned(size)))
in net/bridge/netfilter/ebtables.c

This issue was found with the help of Coccinelle.

[1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
[2] https://github.com/KSPP/linux/issues/21
[3] commit 76497732932f ("cxgb3/l2t: Fix undefined behaviour")

Signed-off-by: Gustavo A. R. Silva <[email protected]>
Signed-off-by: Pablo Neira Ayuso <[email protected]>

netfilter: nft_set_pipapo: make the symbol 'nft_pipapo_get' static

Fix the following sparse warning:

net/netfilter/nft_set_pipapo.c:739:6: warning: symbol 'nft_pipapo_get' was not declared. Should it be static?

Fixes: 3c4287f62044 ("nf_tables: Add set type for arbitrary concatenation of ranges")
Signed-off-by: Chen Wandun <[email protected]>
Acked-by: Stefano Brivio <[email protected]>
Signed-off-by: Pablo Neira Ayuso <[email protected]>

netfilter: cleanup unused macro

TEMPLATE_NULLS_VAL is not used after commit 0838aa7fcfcd
("netfilter: fix netns dependencies with conntrack templates")

PFX is not used after commit 8bee4bad03c5b ("netfilter: xt
extensions: use pr_<level>")

Signed-off-by: Li RongQing <[email protected]>
Signed-off-by: Pablo Neira Ayuso <[email protected]>

netfilter: nf_tables: make all set structs const

They do not need to be writeable anymore.

v2: remove left-over __read_mostly annotation in set_pipapo.c (Stefano)

Signed-off-by: Florian Westphal <[email protected]>
Signed-off-by: Pablo Neira Ayuso <[email protected]>

netfilter: nf_tables: make sets built-in

Placing nftables set support in an extra module is pointless:

1. nf_tables needs dynamic registeration interface for sake of one module
2. nft heavily relies on sets, e.g. even simple rule like
   "nft ... tcp dport { 80, 443 }" will not work with _SETS=n.

IOW, either nftables isn't used or both nf_tables and nf_tables_set
modules are needed anyway.

With extra module:
307K net/netfilter/nf_tables.ko
  79K net/netfilter/nf_tables_set.ko

   text  data  bss     dec filename
146416  3072  545  150033 nf_tables.ko
  35496  1817    0   37313 nf_tables_set.ko

This patch:
373K net/netfilter/nf_tables.ko

178563  4049  545  183157 nf_tables.ko

Signed-off-by: Florian Westphal <[email protected]>
Signed-off-by: Pablo Neira Ayuso <[email protected]>

netfilter: nft_tunnel: add support for geneve opts

Like vxlan and erspan opts, geneve opts should also be supported in
nft_tunnel. The difference is geneve RFC (draft-ietf-nvo3-geneve-14)
allows a geneve packet to carry multiple geneve opts. So with this
patch, nftables/libnftnl would do:

  # nft add table ip filter
  # nft add chain ip filter input { type filter hook input priority 0 \; }
  # nft add tunnel filter geneve_02 { type geneve\; id 2\; \
    ip saddr 192.168.1.1\; ip daddr 192.168.1.2\; \
    sport 9000\; dport 9001\; dscp 1234\; ttl 64\; flags 1\; \
    opts \"1:1:34567890,2:2:12121212,3:3:1212121234567890\"\; }
  # nft list tunnels table filter
    table ip filter {
     tunnel geneve_02 {
     id 2
     ip saddr 192.168.1.1
     ip daddr 192.168.1.2
     sport 9000
     dport 9001
     tos 18
     ttl 64
     flags 1
     geneve opts 1:1:34567890,2:2:12121212,3:3:1212121234567890
     }
    }

v1->v2:
  - no changes, just post it separately.

Signed-off-by: Xin Long <[email protected]>
Signed-off-by: Pablo Neira Ayuso <[email protected]>

netfilter: xtables: Add snapshot of hardidletimer target

This is a snapshot of hardidletimer netfilter target.

This patch implements a hardidletimer Xtables target that can be
used to identify when interfaces have been idle for a certain period
of time.

Timers are identified by labels and are created when a rule is set
with a new label. The rules also take a timeout value (in seconds) as
an option. If more than one rule uses the same timer label, the timer
will be restarted whenever any of the rules get a hit.

One entry for each timer is created in sysfs. This attribute contains
the timer remaining for the timer to expire. The attributes are
located under the xt_idletimer class:

/sys/class/xt_idletimer/timers/<label>

When the timer expires, the target module sends a sysfs notification
to the userspace, which can then decide what to do (eg. disconnect to
save power)

Compared to IDLETIMER, HARDIDLETIMER can send notifications when
CPU is in suspend too, to notify the timer expiry.

v1->v2: Moved all functionality into IDLETIMER module to avoid
code duplication per comment from Florian.

Signed-off-by: Manoj Basapathi <[email protected]>
Signed-off-by: Subash Abhinov Kasiviswanathan <[email protected]>
Signed-off-by: Pablo Neira Ayuso <[email protected]>

netfilter: flowtable: Use nf_flow_offload_tuple for stats as well

This patch doesn't change any functionality.

Signed-off-by: Paul Blakey <[email protected]>
Signed-off-by: Pablo Neira Ayuso <[email protected]>

cdc_ncm: Fix the build warning

The ndp32->wLength is two bytes long, so replace cpu_to_le32 with cpu_to_le16.

Fixes: 0fa81b304a79 ("cdc_ncm: Implement the 32-bit version of NCM Transfer Block")
Signed-off-by: Alexander Bersenev <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Merge branch 'mptcp-simplify-mptcp_accept'

Paolo Abeni says:

====================
mptcp: simplify mptcp_accept()

Currently we allocate the MPTCP master socket at accept time.

The above makes mptcp_accept() quite complex, and requires checks is several
places for NULL MPTCP master socket.

These series simplify the MPTCP accept implementation, moving the master socket
allocation at syn-ack time, so that we drop unneeded checks with the follow-up
patch.

v1 -> v2:
- rebased on top of 2398e3991bda7caa6b112a6f650fbab92f732b91
====================

Signed-off-by: David S. Miller <[email protected]>

mptcp: drop unneeded checks

After the previous patch subflow->conn is always != NULL and
is never changed. We can drop a bunch of now unneeded checks.

v1 -> v2:
- rebased on top of commit 2398e3991bda ("mptcp: always
include dack if possible.")

Signed-off-by: Paolo Abeni <[email protected]>
Reviewed-by: Matthieu Baerts <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

mptcp: create msk early

This change moves the mptcp socket allocation from mptcp_accept() to
subflow_syn_recv_sock(), so that subflow->conn is now always set
for the non fallback scenario.

It allows cleaning up a bit mptcp_accept() reducing the additional
locking and will allow fourther cleanup in the next patch.

Signed-off-by: Paolo Abeni <[email protected]>
Reviewed-by: Matthieu Baerts <[email protected]>
Signed-off-by: David S. Miller <[email protected]>