Michael Chan [Mon, 27 Jan 2020 09:56:15 +0000 (04:56 -0500)]
bnxt_en: Remove the setting of dev_port.
The dev_port is meant to distinguish the network ports belonging to
the same PCI function. Our devices only have one network port
associated with each PCI function and so we should not set it for
correctness.
Michael Chan [Mon, 27 Jan 2020 09:56:14 +0000 (04:56 -0500)]
bnxt_en: Improve bnxt_probe_phy().
If the 2nd parameter fw_dflt is not set, we are calling bnxt_probe_phy()
after the firmware has reset. There is no need to query the current
PHY settings from firmware as these settings may be different from
the ethtool settings that the driver will re-establish later. So
return earlier in bnxt_probe_phy() to save one firmware call.
Michael Chan [Mon, 27 Jan 2020 09:56:13 +0000 (04:56 -0500)]
bnxt_en: Improve link up detection.
In bnxt_update_phy_setting(), ethtool_get_link_ksettings() and
bnxt_disable_an_for_lpbk(), we inconsistently use netif_carrier_ok()
to determine link. Instead, we should use bp->link_info.link_up
which has the true link state. The netif_carrier state may be off
during self-test and while the device is being reset and may not always
reflect the true link state.
By always using bp->link_info.link_up, the code is now more
consistent and more correct. Some unnecessary link toggles are
now prevented with this patch.
David S. Miller [Mon, 27 Jan 2020 10:31:36 +0000 (11:31 +0100)]
Merge branch 'ethtool-netlink-interface-part-2'
Michal Kubecek says:
====================
ethtool netlink interface, part 2
This shorter series adds support for getting and setting of wake-on-lan
settings and message mask (originally message level). Together with the
code already in net-next, this will allow full implementation of
"ethtool <dev>" and "ethtool -s <dev> ...".
Older versions of the ethtool netlink series allowed getting WoL settings
by unprivileged users and only filtered out the password but this was
a source of controversy so for now, ETHTOOL_MSG_WOL_GET request always
requires CAP_NET_ADMIN as ETHTOOL_GWOL ioctl request does.
====================
Michal Kubecek [Sun, 26 Jan 2020 22:11:19 +0000 (23:11 +0100)]
ethtool: add WOL_NTF notification
Send ETHTOOL_MSG_WOL_NTF notification whenever wake-on-lan settings of
a device are modified using ETHTOOL_MSG_WOL_SET netlink message or
ETHTOOL_SWOL ioctl request.
As notifications can be received by anyone, do not include SecureOn(tm)
password in notification messages.
Michal Kubecek [Sun, 26 Jan 2020 22:11:10 +0000 (23:11 +0100)]
ethtool: add DEBUG_NTF notification
Send ETHTOOL_MSG_DEBUG_NTF notification message whenever debugging message
mask for a device are modified using ETHTOOL_MSG_DEBUG_SET netlink message
or ETHTOOL_SMSGLVL ioctl request.
The notification message has the same format as reply to DEBUG_GET request.
As with other ethtool notifications, netlink requests only trigger the
notification if the mask is actually changed while ioctl request trigger it
whenever the request results in calling the ethtool_ops handler.
Michal Kubecek [Sun, 26 Jan 2020 22:11:07 +0000 (23:11 +0100)]
ethtool: set message mask with DEBUG_SET request
Implement DEBUG_SET netlink request to set debugging settings for a device.
At the moment, only message mask corresponding to message level as set by
ETHTOOL_SMSGLVL ioctl request can be set. (It is called message level in
ioctl interface but almost all drivers interpret it as a bit mask.)
Michal Kubecek [Sun, 26 Jan 2020 22:11:04 +0000 (23:11 +0100)]
ethtool: provide message mask with DEBUG_GET request
Implement DEBUG_GET request to get debugging settings for a device. At the
moment, only message mask corresponding to message level as reported by
ETHTOOL_GMSGLVL ioctl request is provided. (It is called message level in
ioctl interface but almost all drivers interpret it as a bit mask.)
As part of the implementation, provide symbolic names for message mask bits
as ETH_SS_MSG_CLASSES string set.
Merge branches 'pm-core', 'powercap', 'pm-opp', 'pm-avs' and 'pm-misc'
* pm-core:
PM-runtime: add tracepoints for usage_count changes
* powercap:
powercap/intel_rapl: add support for JasperLake
x86/cpu: Add Jasper Lake to Intel family
powercap/intel_rapl: add support for TigerLake Mobile
* pm-opp:
opp: Replace list_kref with a local counter
opp: Free static OPPs on errors while adding them
* pm-avs:
power: avs: qcom-cpr: remove duplicated include from qcom-cpr.c
power: avs: fix uninitialized error return on failed cpr_read_fuse_uV() call
power: avs: qcom-cpr: make cpr_get_opp_hz_for_req() static
power: avs: qcom-cpr: remove set but unused variable
power: avs: qcom-cpr: make sure that regmap is available
power: avs: qcom-cpr: fix unsigned expression compared with zero
power: avs: qcom-cpr: fix invalid printk specifier in debug print
power: avs: Add support for CPR (Core Power Reduction)
dt-bindings: power: avs: Add support for CPR (Core Power Reduction)
* pm-cpufreq:
cpufreq: loongson2_cpufreq: adjust cpufreq uses of LOONGSON_CHIPCFG
cpufreq: brcmstb-avs: fix imbalance of cpufreq policy refcount
cpufreq: intel_pstate: fix spelling mistake: "Whethet" -> "Whether"
cpufreq: s3c: fix unbalances of cpufreq policy refcount
cpufreq: imx-cpufreq-dt: Add i.MX8MP support
cpufreq: Use imx-cpufreq-dt for i.MX8MP's speed grading
cpufreq: tegra186: convert to devm_platform_ioremap_resource
cpufreq: kirkwood: convert to devm_platform_ioremap_resource
cpufreq: CPPC: put ACPI table after using it
cpufreq : CPPC: Break out if HiSilicon CPPC workaround is matched
* pm-sleep:
PM: suspend: Add sysfs attribute to control the "sync on suspend" behavior
PM: hibernate: fix spelling mistake "shapshot" -> "snapshot"
PM: hibernate: Add more logging on hibernation failure
PM: hibernate: improve arithmetic division in preallocate_highmem_fraction()
PM: wakeup: Show statistics for deleted wakeup sources again
PM: sleep: Switch to rtc_time64_to_tm()/rtc_tm_to_time64()
* pm-cpuidle: (27 commits)
intel_idle: Clean up irtl_2_usec()
intel_idle: Move 3 functions closer to their callers
intel_idle: Annotate initialization code and data structures
intel_idle: Move and clean up intel_idle_cpuidle_devices_uninit()
intel_idle: Rearrange intel_idle_cpuidle_driver_init()
intel_idle: Clean up NULL pointer check in intel_idle_init()
intel_idle: Fold intel_idle_probe() into intel_idle_init()
intel_idle: Eliminate __setup_broadcast_timer()
cpuidle: fix cpuidle_find_deepest_state() kerneldoc warnings
cpuidle: sysfs: fix warnings when compiling with W=1
cpuidle: coupled: fix warnings when compiling with W=1
Documentation: admin-guide: PM: Add intel_idle document
cpuidle: arm: Enable compile testing for some of drivers
cpuidle: Drop unused cpuidle_driver_ref/unref() functions
intel_idle: Use ACPI _CST on server systems
intel_idle: Add module parameter to prevent ACPI _CST from being used
intel_idle: Allow ACPI _CST to be used for selected known processors
cpuidle: Allow idle states to be disabled by default
intel_idle: Use ACPI _CST for processor models without C-state tables
intel_idle: Refactor intel_idle_cpuidle_driver_init()
...
Daniel Borkmann [Mon, 27 Jan 2020 10:25:07 +0000 (11:25 +0100)]
Merge branch 'bpf-flow-dissector-fix-port-ranges'
Yoshiki Komachi says:
====================
When I tried a test based on the selftest program for BPF flow dissector
(test_flow_dissector.sh), I observed unexpected result as below:
$ tc filter add dev lo parent ffff: protocol ip pref 1337 flower ip_proto \
udp src_port 8-10 action drop
$ tools/testing/selftests/bpf/test_flow_dissector -i 4 -f 9 -F
inner.dest4: 127.0.0.1
inner.source4: 127.0.0.3
pkts: tx=10 rx=10
The last rx means the number of received packets. I expected rx=0 in this
test (i.e., all received packets should have been dropped), but it resulted
in acceptance.
Although the previous commit 8ffb055beae5 ("cls_flower: Fix the behavior
using port ranges with hw-offload") added new flag and field toward filtering
based on port ranges with hw-offload, it missed applying for BPF flow dissector
then. As a result, BPF flow dissector currently stores data extracted from
packets in incorrect field used for exact match whenever packets are classified
by filters based on port ranges. Thus, they never match rules in such cases
because flow dissector gives rise to generating incorrect flow keys.
This series fixes the issue by replacing incorrect flag and field with new
ones in BPF flow dissector, and adds a test for filtering based on specified
port ranges to the existing selftest program.
Changes in v2:
- set key_ports to NULL at the top of __skb_flow_bpf_to_target()
====================
Yoshiki Komachi [Fri, 17 Jan 2020 07:05:32 +0000 (16:05 +0900)]
flow_dissector: Fix to use new variables for port ranges in bpf hook
This patch applies new flag (FLOW_DISSECTOR_KEY_PORTS_RANGE) and
field (tp_range) to BPF flow dissector to generate appropriate flow
keys when classified by specified port ranges.
Here's (probably) the last bluetooth-next pull request for the 5.6 kernel.
- Initial pieces of Bluetooth 5.2 Isochronous Channels support
- mgmt: Various cleanups and a new Set Blocked Keys command
- btusb: Added support for 04ca:3021 QCA_ROME device
- hci_qca: Multiple fixes & cleanups
- hci_bcm: Fixes & improved device tree support
- Fixed attempts to create duplicate debugfs entries
Please let me know if there are any issues pulling. Thanks.
====================
drivers: net: xgene: Fix the order of the arguments of 'alloc_etherdev_mqs()'
'alloc_etherdev_mqs()' expects first 'tx', then 'rx'. The semantic here
looks reversed.
Reorder the arguments passed to 'alloc_etherdev_mqs()' in order to keep
the correct semantic.
In fact, this is a no-op because both XGENE_NUM_[RT]X_RING are 8.
Fixes: 107dec2749fe ("drivers: net: xgene: Add support for multiple queues") Signed-off-by: Christophe JAILLET <[email protected]> Signed-off-by: David S. Miller <[email protected]>
John Fastabend [Mon, 27 Jan 2020 00:14:02 +0000 (16:14 -0800)]
bpf, xdp: Remove no longer required rcu_read_{un}lock()
Now that we depend on rcu_call() and synchronize_rcu() to also wait
for preempt_disabled region to complete the rcu read critical section
in __dev_map_flush() is no longer required. Except in a few special
cases in drivers that need it for other reasons.
These originally ensured the map reference was safe while a map was
also being free'd. And additionally that bpf program updates via
ndo_bpf did not happen while flush updates were in flight. But flush
by new rules can only be called from preempt-disabled NAPI context.
The synchronize_rcu from the map free path and the rcu_call from the
delete path will ensure the reference there is safe. So lets remove
the rcu_read_lock and rcu_read_unlock pair to avoid any confusion
around how this is being protected.
If the rcu_read_lock was required it would mean errors in the above
logic and the original patch would also be wrong.
Now that we have done above we put the rcu_read_lock in the driver
code where it is needed in a driver dependent way. I think this
helps readability of the code so we know where and why we are
taking read locks. Most drivers will not need rcu_read_locks here
and further XDP drivers already have rcu_read_locks in their code
paths for reading xdp programs on RX side so this makes it symmetric
where we don't have half of rcu critical sections define in driver
and the other half in devmap.
John Fastabend [Mon, 27 Jan 2020 00:14:01 +0000 (16:14 -0800)]
bpf, xdp: virtio_net use access ptr macro for xdp enable check
virtio_net currently relies on rcu critical section to access the xdp
program in its xdp_xmit handler. However, the pointer to the xdp program
is only used to do a NULL pointer comparison to determine if xdp is
enabled or not.
Use rcu_access_pointer() instead of rcu_dereference() to reflect this.
Then later when we drop rcu_read critical section virtio_net will not
need in special handling.
Heiner Kallweit [Sun, 26 Jan 2020 09:40:44 +0000 (10:40 +0100)]
r8169: don't set min_mtu/max_mtu if not needed
Defaults for min_mtu and max_mtu are set by ether_setup(), which is
called from devm_alloc_etherdev(). Let rtl_jumbo_max() only return
a positive value if actually jumbo packets are supported. This also
allows to remove constant Jumbo_1K which is a little misleading anyway.
Vladimir Oltean [Sat, 25 Jan 2020 21:01:11 +0000 (23:01 +0200)]
net: dsa: Fix use-after-free in probing of DSA switch tree
DSA sets up a switch tree little by little. Every switch of the N
members of the tree calls dsa_register_switch, and (N - 1) will just
touch the dst->ports list with their ports and quickly exit. Only the
last switch that calls dsa_register_switch will find all DSA links
complete in dsa_tree_setup_routing_table, and not return zero as a
result but instead go ahead and set up the entire DSA switch tree
(practically on behalf of the other switches too).
The trouble is that the (N - 1) switches don't clean up after themselves
after they get an error such as EPROBE_DEFER. Their footprint left in
dst->ports by dsa_switch_touch_ports is still there. And switch N, the
one responsible with actually setting up the tree, is going to work with
those stale dp, dp->ds and dp->ds->dev pointers. In particular ds and
ds->dev might get freed by the device driver.
Be there a 2-switch tree and the following calling order:
- Switch 1 calls dsa_register_switch
- Calls dsa_switch_touch_ports, populates dst->ports
- Calls dsa_port_parse_cpu, gets -EPROBE_DEFER, exits.
- Switch 2 calls dsa_register_switch
- Calls dsa_switch_touch_ports, populates dst->ports
- Probe doesn't get deferred, so it goes ahead.
- Calls dsa_tree_setup_routing_table, which returns "complete == true"
due to Switch 1 having called dsa_switch_touch_ports before.
- Because the DSA links are complete, it calls dsa_tree_setup_switches
now.
- dsa_tree_setup_switches iterates through dst->ports, initializing
the Switch 1 ds structure (invalid) and the Switch 2 ds structure
(valid).
- Undefined behavior (use after free, sometimes NULL pointers, etc).
Real example below (debugging prints added by me, as well as guards
against NULL pointers):
The solution is to recognize that the functions that call
dsa_switch_touch_ports (dsa_switch_parse_of, dsa_switch_parse) have side
effects, and therefore one should clean up their side effects on error
path. The cleanup of dst->ports was taken from dsa_switch_remove and
moved into a dedicated dsa_switch_release_ports function, which should
really be per-switch (free only the members of dst->ports that are also
members of ds, instead of all switch ports).
Heiner Kallweit [Sat, 25 Jan 2020 12:42:14 +0000 (13:42 +0100)]
net: remove eth_change_mtu
All usage of this function was removed three years ago, and the
function was marked as deprecated: a52ad514fdf3 ("net: deprecate eth_change_mtu, remove usage")
So I think we can remove it now.
Lorenzo Bianconi [Sat, 25 Jan 2020 11:48:51 +0000 (12:48 +0100)]
net: socionext: fix xdp_result initialization in netsec_process_rx
Fix xdp_result initialization in netsec_process_rx in order to not
increase rx counters if there is no bpf program attached to the xdp hook
and napi_gro_receive returns GRO_DROP
Lorenzo Bianconi [Sat, 25 Jan 2020 11:48:50 +0000 (12:48 +0100)]
net: socionext: fix possible user-after-free in netsec_process_rx
Fix possible use-after-free in in netsec_process_rx that can occurs if
the first packet is sent to the normal networking stack and the
following one is dropped by the bpf program attached to the xdp hook.
Fix the issue defining the skb pointer in the 'budget' loop
====================
net: allow per-net notifier to follow netdev into namespace
Currently we have per-net notifier, which allows to get only
notifications relevant to particular network namespace. That is enough
for drivers that have netdevs local in a particular namespace (cannot
move elsewhere).
However if netdev can change namespace, per-net notifier cannot be used.
Introduce dev_net variant that is basically per-net notifier with an
extension that re-registers the per-net notifier upon netdev namespace
change. Basically the per-net notifier follows the netdev into
namespace.
====================
Introduce dev_net variants of netdev notifier register/unregister functions
and allow per-net notifier to follow the netdevice into the namespace it is
moved to.
David S. Miller [Mon, 27 Jan 2020 10:00:21 +0000 (11:00 +0100)]
Merge branch 'Support-fraglist-GRO-GSO'
Steffen Klassert says:
====================
Support fraglist GRO/GSO
This patchset adds support to do GRO/GSO by chaining packets
of the same flow at the SKB frag_list pointer. This avoids
the overhead to merge payloads into one big packet, and
on the other end, if GSO is needed it avoids the overhead
of splitting the big packet back to the native form.
Patch 1 adds netdev feature flags to enable fraglist GRO,
this implements one of the configuration options discussed
at netconf 2019.
Patch 2 adds a netdev software feature set that defaults to off
and assigns the new fraglist GRO feature flag to it.
Patch 3 adds the core infrastructure to do fraglist GRO/GSO.
Patch 4 enables UDP to use fraglist GRO/GSO if configured.
I have only meaningful forwarding performance measurements.
I did some tests for the local receive path with netperf and iperf,
but in this case the sender that generates the packets is the
bottleneck. So the benchmarks are not that meaningful for the
receive path.
Paolo Abeni did some benchmarks of the local receive path for the
RFC v2 version of this pachset, results can be found here:
- Add IPv6 support.
- Split patchset to enable UDP GRO by default before adding
fraglist GRO support.
- Mark fraglist GRO packets as CHECKSUM_NONE.
- Take a refcount on the first segment skb when doing fraglist
segmentation. With this we can use the same error handling
path as with standard segmentation.
Changes from RFC v2:
- Add a netdev feature flag to configure listifyed GRO.
- Fix UDP GRO enabling for IPv6.
- Fix a rcu_read_lock() imbalance.
- Fix error path in skb_segment_list().
Changes from RFC v3:
- Rename NETIF_F_GRO_LIST to NETIF_F_GRO_FRAGLIST and add
NETIF_F_GSO_FRAGLIST.
- Move introduction of SKB_GSO_FRAGLIST to patch 2.
- Use udpv6_encap_needed_key instead of udp_encap_needed_key in IPv6.
- Move some missplaced code from patch 5 to patch 1 where it belongs to.
Changes from RFC v4:
- Drop the 'UDP: enable GRO by default' patch for now. Standard UDP GRO
is not changed with this patchset.
- Rebase to net-next current.
Changes fom v1 (December 18th):
- Do a full __copy_skb_header instead of tryng to find the really
needed subset header fields. Thisa can be done later.
- Mark all fraglist GRO packets with CHECKSUM_UNNECESSARY.
- Rebase to net-next current.
Changes fom v2 (January 24th):
- Do the CHECKSUM_UNNECESSARY setting from IPv4 for IPv6 too.
====================
Steffen Klassert [Sat, 25 Jan 2020 10:26:45 +0000 (11:26 +0100)]
udp: Support UDP fraglist GRO/GSO.
This patch extends UDP GRO to support fraglist GRO/GSO
by using the previously introduced infrastructure.
If the feature is enabled, all UDP packets are going to
fraglist GRO (local input and forward).
After validating the csum, we mark ip_summed as
CHECKSUM_UNNECESSARY for fraglist GRO packets to
make sure that the csum is not touched.
Steffen Klassert [Sat, 25 Jan 2020 10:26:44 +0000 (11:26 +0100)]
net: Support GRO/GSO fraglist chaining.
This patch adds the core functions to chain/unchain
GSO skbs at the frag_list pointer. This also adds
a new GSO type SKB_GSO_FRAGLIST and a is_flist
flag to napi_gro_cb which indicates that this
flow will be GROed by fraglist chaining.
Steffen Klassert [Sat, 25 Jan 2020 10:26:43 +0000 (11:26 +0100)]
net: Add a netdev software feature set that defaults to off.
The previous patch added the NETIF_F_GRO_FRAGLIST feature.
This is a software feature that should default to off.
Current software features default to on, so add a new
feature set that defaults to off.
Sven Auhagen [Sat, 25 Jan 2020 08:07:03 +0000 (08:07 +0000)]
mvneta driver disallow XDP program on hardware buffer management
Recently XDP Support was added to the mvneta driver
for software buffer management only.
It is still possible to attach an XDP program if
hardware buffer management is used.
It is not doing anything at that point.
The patch disallows attaching XDP programs to mvneta
if hardware buffer management is used.
I am sorry about that. It is my first submission and I am having
some troubles with the format of my emails.
v4 -> v5:
- Remove extra tabs
v3 -> v4:
- Please ignore v3 I accidentally submitted
my other patch with git-send-mail and v4 is correct
v2 -> v3:
- My mailserver corrupted the patch
resubmission with git-send-email
Merge branches 'acpi-battery', 'acpi-video', 'acpi-fan' and 'acpi-drivers'
* acpi-battery:
ACPI / battery: Deal better with neither design nor full capacity not being reported
ACPI / battery: Use design-cap for capacity calculations if full-cap is not available
ACPI / battery: Deal with design or full capacity being reported as -1
* acpi-video:
ACPI: video: Do not export a non working backlight interface on MSI MS-7721 boards
ACPI: video: Use native backlight on Lenovo E41-25/45
ACPI: video: fix typo in comment
* acpi-fan:
ACPI: fan: Expose fan performance state information
* acpi-drivers:
thermal: int340x_thermal: Add Tiger Lake ACPI device IDs
platform/x86: intel-hid: Add Tiger Lake ACPI device ID
ACPI: fan: Add Tiger Lake ACPI device ID
ACPI: DPTF: Add Tiger Lake ACPI device IDs
* acpica:
ACPICA: Update version to 20200110
ACPICA: All acpica: Update copyrights to 2020 Including tool signons.
ACPICA: Update the list of maintainers
ACPICA: Update version to 20191213
ACPICA: Dispatcher: always generate buffer objects for ASL create_field() operator
ACPICA: acpisrc: add unix line ending support for non-windows build
ACPICA: Disassembler: create buffer fields in ACPI_PARSE_LOAD_PASS1
ACPICA: debugger: fix spelling mistake "adress" -> "address"
David Howells [Fri, 24 Jan 2020 23:08:04 +0000 (23:08 +0000)]
rxrpc: Fix use-after-free in rxrpc_receive_data()
The subpacket scanning loop in rxrpc_receive_data() references the
subpacket count in the private data part of the sk_buff in the loop
termination condition. However, when the final subpacket is pasted into
the ring buffer, the function is no longer has a ref on the sk_buff and
should not be looking at sp->* any more. This point is actually marked in
the code when skb is cleared (but sp is not - which is an error).
Fix this by caching sp->nr_subpackets in a local variable and using that
instead.
Also clear 'sp' to catch accesses after that point.
This can show up as an oops in rxrpc_get_skb() if sp->nr_subpackets gets
trashed by the sk_buff getting freed and reused in the meantime.
Fixes: e2de6c404898 ("rxrpc: Use info in skbuff instead of reparsing a jumbo packet") Signed-off-by: David Howells <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Stephen Worley [Fri, 24 Jan 2020 21:53:27 +0000 (16:53 -0500)]
net: include struct nhmsg size in nh nlmsg size
Include the size of struct nhmsg size when calculating
how much of a payload to allocate in a new netlink nexthop
notification message.
Without this, we will fail to fill the skbuff at certain nexthop
group sizes.
You can reproduce the failure with the following iproute2 commands:
ip link add dummy1 type dummy
ip link add dummy2 type dummy
ip link add dummy3 type dummy
ip link add dummy4 type dummy
ip link add dummy5 type dummy
ip link add dummy6 type dummy
ip link add dummy7 type dummy
ip link add dummy8 type dummy
ip link add dummy9 type dummy
ip link add dummy10 type dummy
ip link add dummy11 type dummy
ip link add dummy12 type dummy
ip link add dummy13 type dummy
ip link add dummy14 type dummy
ip link add dummy15 type dummy
ip link add dummy16 type dummy
ip link add dummy17 type dummy
ip link add dummy18 type dummy
ip link add dummy19 type dummy
ip ro add 1.1.1.1/32 dev dummy1
ip ro add 1.1.1.2/32 dev dummy2
ip ro add 1.1.1.3/32 dev dummy3
ip ro add 1.1.1.4/32 dev dummy4
ip ro add 1.1.1.5/32 dev dummy5
ip ro add 1.1.1.6/32 dev dummy6
ip ro add 1.1.1.7/32 dev dummy7
ip ro add 1.1.1.8/32 dev dummy8
ip ro add 1.1.1.9/32 dev dummy9
ip ro add 1.1.1.10/32 dev dummy10
ip ro add 1.1.1.11/32 dev dummy11
ip ro add 1.1.1.12/32 dev dummy12
ip ro add 1.1.1.13/32 dev dummy13
ip ro add 1.1.1.14/32 dev dummy14
ip ro add 1.1.1.15/32 dev dummy15
ip ro add 1.1.1.16/32 dev dummy16
ip ro add 1.1.1.17/32 dev dummy17
ip ro add 1.1.1.18/32 dev dummy18
ip ro add 1.1.1.19/32 dev dummy19
ip next add id 1 via 1.1.1.1 dev dummy1
ip next add id 2 via 1.1.1.2 dev dummy2
ip next add id 3 via 1.1.1.3 dev dummy3
ip next add id 4 via 1.1.1.4 dev dummy4
ip next add id 5 via 1.1.1.5 dev dummy5
ip next add id 6 via 1.1.1.6 dev dummy6
ip next add id 7 via 1.1.1.7 dev dummy7
ip next add id 8 via 1.1.1.8 dev dummy8
ip next add id 9 via 1.1.1.9 dev dummy9
ip next add id 10 via 1.1.1.10 dev dummy10
ip next add id 11 via 1.1.1.11 dev dummy11
ip next add id 12 via 1.1.1.12 dev dummy12
ip next add id 13 via 1.1.1.13 dev dummy13
ip next add id 14 via 1.1.1.14 dev dummy14
ip next add id 15 via 1.1.1.15 dev dummy15
ip next add id 16 via 1.1.1.16 dev dummy16
ip next add id 17 via 1.1.1.17 dev dummy17
ip next add id 18 via 1.1.1.18 dev dummy18
ip next add id 19 via 1.1.1.19 dev dummy19
ip next add id 1111 group 1/2/3/4/5/6/7/8/9/10/11/12/13/14/15/16/17/18/19
ip next del id 1111
Fixes: 430a049190de ("nexthop: Add support for nexthop groups") Signed-off-by: Stephen Worley <[email protected]> Reviewed-by: David Ahern <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Andrey Ignatov [Fri, 24 Jan 2020 22:41:42 +0000 (14:41 -0800)]
tools/bpf: Allow overriding llvm tools for runqslower
tools/testing/selftests/bpf/Makefile supports overriding clang, llc and
other tools so that custom ones can be used instead of those from PATH.
It's convinient and heavily used by some users.
tc filter add dev eth0 parent 1:0 protocol ip prio 1 u32 match ip \
sport 80 0xffff flowid 1:3
tc filter add dev eth0 parent 1:0 protocol ip prio 1 u32 match ip \
sport 25 0xffff flowid 1:4
where filters are installed on qdisc 1:0, so we can't merely
search from class 1:1 when creating class 1:3 and class 1:4. We have
to walk through all the child classes of the direct parent qdisc.
Otherwise we would miss filters those need reverse binding.
Cong Wang [Fri, 24 Jan 2020 00:26:18 +0000 (16:26 -0800)]
net_sched: fix ops->bind_class() implementations
The current implementations of ops->bind_class() are merely
searching for classid and updating class in the struct tcf_result,
without invoking either of cl_ops->bind_tcf() or
cl_ops->unbind_tcf(). This breaks the design of them as qdisc's
like cbq use them to count filters too. This is why syzbot triggered
the warning in cbq_destroy_class().
In order to fix this, we have to call cl_ops->bind_tcf() and
cl_ops->unbind_tcf() like the filter binding path. This patch does
so by refactoring out two helper functions __tcf_bind_filter()
and __tcf_unbind_filter(), which are lockless and accept a Qdisc
pointer, then teaching each implementation to call them correctly.
Note, we merely pass the Qdisc pointer as an opaque pointer to
each filter, they only need to pass it down to the helper
functions without understanding it at all.
Merge branch 'opp/linux-next' of git://git.kernel.org/pub/scm/linux/kernel/git/vireshk/pm
Pull operating performance points (OPP) framework updates for v5.6
from Viresh Kumar:
"This contains a single patchset to fix reference counting of OPP
table structures."
* 'opp/linux-next' of git://git.kernel.org/pub/scm/linux/kernel/git/vireshk/pm:
opp: Replace list_kref with a local counter
opp: Free static OPPs on errors while adding them
thermal: exynos: Rename Samsung and Exynos to lowercase
Fix up inconsistent usage of upper and lowercase letters in "Samsung"
and "Exynos" names.
"SAMSUNG" and "EXYNOS" are not abbreviations but regular trademarked
names. Therefore they should be written with lowercase letters starting
with capital letter.
The lowercase "Exynos" name is promoted by its manufacturer Samsung
Electronics Co., Ltd., in advertisement materials and on website.
Although advertisement materials usually use uppercase "SAMSUNG", the
lowercase version is used in all legal aspects (e.g. on Wikipedia and in
privacy/legal statements on
https://www.samsung.com/semiconductor/privacy-global/).
thermal: generic-adc: silence info message for IIO_TEMP channels
Since commit d36e2fa0253875 ("thermal: generic-adc: make lookup table
optional") "generic-adc-thermal" can be used with an IIO_TEMP channel.
In this case the following message is logged at probe time:
no lookup table, assuming DAC channel returns milliCelcius
Silence this info message if the channel type is known to be in
milli celsius. Keep this message when the channel type is unknown or not
of type temperature.
thermal: generic-adc: silence "no lookup table" on deferred probe
A "generic-adc-thermal" without "temperature-lookup-table" is perfectly
valid since commit d36e2fa0253875 ("thermal: generic-adc: make lookup
table optional"). On deferred probe the message "no lookup table,
assuming DAC channel returns milliCelcius" is still logged.
Prevent this message on deferred probe of the IIO channel by first
looking up the IIO channel.
Yangtao Li [Thu, 19 Dec 2019 17:28:17 +0000 (09:28 -0800)]
thermal/drivers/sun8i: Add thermal driver for H6/H5/H3/A64/A83T/R40
This patch adds the support for allwinner thermal sensor, within
allwinner SoC. It will register sensors for thermal framework
and use device tree to bind cooling device.
Daniel Lezcano [Thu, 19 Dec 2019 22:53:17 +0000 (23:53 +0100)]
thermal/drivers/cpu_cooling: Rename to cpufreq_cooling
As we introduced the idle injection cooling device called
cpuidle_cooling, let's be consistent and rename the cpu_cooling to
cpufreq_cooling as this one mitigates with OPPs changes.
Daniel Lezcano [Thu, 19 Dec 2019 22:53:16 +0000 (23:53 +0100)]
thermal/drivers/cpu_cooling: Introduce the cpu idle cooling driver
The cpu idle cooling device offers a new method to cool down a CPU by
injecting idle cycles at runtime.
It has some similarities with the intel power clamp driver but it is
actually designed to be more generic and relying on the idle injection
powercap framework.
The idle injection duration is fixed while the running duration is
variable. That allows to have control on the device reactivity for the
user experience.
An idle state powering down the CPU or the cluster will allow to drop
the static leakage, thus restoring the heat capacity of the SoC. It
can be set with a trip point between the hot and the critical points,
giving the opportunity to prevent a hard reset of the system when the
cpufreq cooling fails to cool down the CPU.
With more sophisticated boards having a per core sensor, the idle
cooling device allows to cool down a single core without throttling
the compute capacity of several cpus belonging to the same clock line,
so it could be used in collaboration with the cpufreq cooling device.
Provide some documentation for the idle injection cooling effect in
order to let people to understand the rational of the approach for the
idle injection CPU cooling device.
Daniel Lezcano [Wed, 4 Dec 2019 15:39:27 +0000 (16:39 +0100)]
thermal/drivers/Kconfig: Convert the CPU cooling device to a choice
The next changes will add a new way to cool down a CPU by injecting
idle cycles. With the current configuration, a CPU cooling device is
the cpufreq cooling device. As we want to add a new CPU cooling
device, let's convert the CPU cooling to a choice giving a list of CPU
cooling devices. At this point, there is obviously only one CPU
cooling device.
Andrey Smirnov [Tue, 10 Dec 2019 16:41:50 +0000 (08:41 -0800)]
thermal: qoriq: Enable all sensors before registering them
Tmu_get_temp will get called as a part of sensor registration via
devm_thermal_zone_of_sensor_register(). To prevent it from retruning
bogus data we need to enable sensor monitoring before that. Looking at
the datasheet (i.MX8MQ RM) there doesn't seem to be any harm in
enabling them all, so, for the sake of simplicity, change the code to
do just that.
Andrey Smirnov [Tue, 10 Dec 2019 16:41:49 +0000 (08:41 -0800)]
thermal: qoriq: Convert driver to use regmap API
Convert driver to use regmap API, drop custom LE/BE IO helpers and
simplify bit manipulation using regmap_update_bits(). This also allows
us to convert some register initialization to use loops and adds
convenient debug access to TMU registers via debugfs.
Andrey Smirnov [Tue, 10 Dec 2019 16:41:47 +0000 (08:41 -0800)]
thermal: qoriq: Pass data to qoriq_tmu_calibration() directly
We can simplify error cleanup code if instead of passing a "struct
platform_device *" to qoriq_tmu_calibration() and deriving a bunch of
pointers from it, we pass those pointers directly. This way we won't
be force to call platform_set_drvdata() as early in qoriq_tmu_probe()
and need to have "platform_set_drvdata(pdev, NULL);" in error path.
Andrey Smirnov [Tue, 10 Dec 2019 16:41:46 +0000 (08:41 -0800)]
thermal: qoriq: Pass data to qoriq_tmu_register_tmu_zone() directly
Pass all necessary data to qoriq_tmu_register_tmu_zone() directly
instead of passing a platform device and then deriving it. This is
done as a first step to simplify resource deallocation code.
Andrey Smirnov [Tue, 10 Dec 2019 16:41:45 +0000 (08:41 -0800)]
thermal: qoriq: Embed per-sensor data into struct qoriq_tmu_data
Embed per-sensor data into struct qoriq_tmu_data so we can drop the
code allocating it. This also allows us to get rid of per-sensor back
reference to struct qoriq_tmu_data since now its address can be
calculated using container_of().
Amit Kucheria [Wed, 20 Nov 2019 15:45:19 +0000 (21:15 +0530)]
thermal: amlogic: Appease the kernel-doc deity
Fix up the following warning when compiled with make W=1:
linux.git/drivers/thermal/amlogic_thermal.c:78: warning: Function parameter or member 'A' not described in 'amlogic_thermal_soc_calib_data'
linux.git/drivers/thermal/amlogic_thermal.c:78: warning: Function parameter or member 'B' not described in 'amlogic_thermal_soc_calib_data'
linux.git/drivers/thermal/amlogic_thermal.c:78: warning: Function parameter or member 'm' not described in 'amlogic_thermal_soc_calib_data'
linux.git/drivers/thermal/amlogic_thermal.c:78: warning: Function parameter or member 'n' not described in 'amlogic_thermal_soc_calib_data'
Amit Kucheria [Wed, 20 Nov 2019 15:45:18 +0000 (21:15 +0530)]
thermal: tegra: Appease the kernel-doc deity
Fix up the following warning when compiled with make W=1:
linux.git/drivers/thermal/tegra/soctherm.c:369: warning: Function parameter or member 'value' not described in 'ccroc_writel'
linux.git/drivers/thermal/tegra/soctherm.c:369: warning: Excess function parameter 'v' description in 'ccroc_writel'
linux.git/drivers/thermal/tegra/soctherm.c:447: warning: Function parameter or member 'dev' not described in 'enforce_temp_range'
linux.git/drivers/thermal/tegra/soctherm.c:772: warning: Function parameter or member 'sg' not described in 'tegra_soctherm_set_hwtrips'
linux.git/drivers/thermal/tegra/soctherm.c:772: warning: Function parameter or member 'tz' not described in 'tegra_soctherm_set_hwtrips'
linux.git/drivers/thermal/tegra/soctherm.c:944: warning: Function parameter or member 'ts' not described in 'soctherm_oc_intr_enable'
linux.git/drivers/thermal/tegra/soctherm.c:1167: warning: Function parameter or member 'data' not described in 'soctherm_oc_irq_disable'
linux.git/drivers/thermal/tegra/soctherm.c:1167: warning: Excess function parameter 'irq_data' description in 'soctherm_oc_irq_disable'
linux.git/drivers/thermal/tegra/soctherm.c:1224: warning: Function parameter or member 'ctrlr' not described in 'soctherm_irq_domain_xlate_twocell'
linux.git/drivers/thermal/tegra/soctherm.c:1686: warning: Function parameter or member 'pdev' not described in 'soctherm_init_hw_throt_cdev'
linux.git/drivers/thermal/tegra/soctherm.c:1764: warning: Function parameter or member 'ts' not described in 'throttlectl_cpu_level_cfg'
linux.git/drivers/thermal/tegra/soctherm.c:1812: warning: Function parameter or member 'ts' not described in 'throttlectl_cpu_level_select'
linux.git/drivers/thermal/tegra/soctherm.c:1855: warning: Function parameter or member 'ts' not described in 'throttlectl_cpu_mn'
linux.git/drivers/thermal/tegra/soctherm.c:1886: warning: Function parameter or member 'ts' not described in 'throttlectl_gpu_level_select'
linux.git/drivers/thermal/tegra/soctherm.c:1928: warning: Function parameter or member 'ts' not described in 'soctherm_throttle_program'
Amit Kucheria [Wed, 20 Nov 2019 15:45:17 +0000 (21:15 +0530)]
thermal: samsung: Appease the kernel-doc deity
Fix up the following warning when compiled with make W=1:
linux.git/drivers/thermal/samsung/exynos_tmu.c:141: warning: bad
line: driver
linux.git/drivers/thermal/samsung/exynos_tmu.c:203: warning: Function
parameter or member 'tzd' not described in 'exynos_tmu_data'
linux.git/drivers/thermal/samsung/exynos_tmu.c:203: warning: Function
parameter or member 'tmu_set_trip_temp' not described in
'exynos_tmu_data'
linux.git/drivers/thermal/samsung/exynos_tmu.c:203: warning: Function
parameter or member 'tmu_set_trip_hyst' not described in
'exynos_tmu_data'
Amit Kucheria [Wed, 20 Nov 2019 15:45:16 +0000 (21:15 +0530)]
thermal: rockchip: Appease the kernel-doc deity
Replace a comment starting with /** by simply /* to avoid having it
interpreted as a kernel-doc comment. Describe missing function
parameters where needed.
Fixes up the following warnings when compiled with make W=1:
linux.git/drivers/thermal/rockchip_thermal.c:27: warning: cannot
understand function prototype: 'enum tshut_mode '
linux.git/drivers/thermal/rockchip_thermal.c:37: warning: cannot
understand function prototype: 'enum tshut_polarity '
linux.git/drivers/thermal/rockchip_thermal.c:46: warning: cannot
understand function prototype: 'enum sensor_id '
linux.git/drivers/thermal/rockchip_thermal.c:56: warning: cannot
understand function prototype: 'enum adc_sort_mode '
linux.git/drivers/thermal/rockchip_thermal.c:123: warning: Function
parameter or member 'chn_id' not described in 'rockchip_tsadc_chip'
linux.git/drivers/thermal/rockchip_thermal.c:123: warning: Function
parameter or member 'control' not described in 'rockchip_tsadc_chip'
linux.git/drivers/thermal/rockchip_thermal.c:167: warning: Function
parameter or member 'sensors' not described in 'rockchip_thermal_data'
linux.git/drivers/thermal/rockchip_thermal.c:608: warning: Function
parameter or member 'grf' not described in 'rk_tsadcv2_initialize'
linux.git/drivers/thermal/rockchip_thermal.c:608: warning: Function
parameter or member 'regs' not described in 'rk_tsadcv2_initialize'
linux.git/drivers/thermal/rockchip_thermal.c:608: warning: Function
parameter or member 'tshut_polarity' not described in
'rk_tsadcv2_initialize'
linux.git/drivers/thermal/rockchip_thermal.c:644: warning: Function
parameter or member 'grf' not described in 'rk_tsadcv3_initialize'
linux.git/drivers/thermal/rockchip_thermal.c:644: warning: Function
parameter or member 'regs' not described in 'rk_tsadcv3_initialize'
linux.git/drivers/thermal/rockchip_thermal.c:644: warning: Function
parameter or member 'tshut_polarity' not described in
'rk_tsadcv3_initialize'
linux.git/drivers/thermal/rockchip_thermal.c:732: warning: Function
parameter or member 'regs' not described in 'rk_tsadcv3_control'
linux.git/drivers/thermal/rockchip_thermal.c:732: warning: Function
parameter or member 'enable' not described in 'rk_tsadcv3_control'
linux.git/drivers/thermal/rockchip_thermal.c:1211: warning: Function
parameter or member 'reset' not described in
'rockchip_thermal_reset_controller'
Amit Kucheria [Wed, 20 Nov 2019 15:45:15 +0000 (21:15 +0530)]
thermal: mediatek: Appease the kernel-doc deity
Replace a comment starting with /** by simply /* to avoid having it
interpreted as a kernel-doc comment. Describe missing function
parameters where needed.
Fixes up the following warnings when compiled with make W=1:
linux.git/drivers/thermal/mtk_thermal.c:374: warning: cannot understand
function prototype: 'const struct mtk_thermal_data mt8173_thermal_data =
'
linux.git/drivers/thermal/mtk_thermal.c:413: warning: cannot understand
function prototype: 'const struct mtk_thermal_data mt2701_thermal_data =
'
linux.git/drivers/thermal/mtk_thermal.c:443: warning: cannot understand
function prototype: 'const struct mtk_thermal_data mt2712_thermal_data =
'
linux.git/drivers/thermal/mtk_thermal.c:499: warning: cannot understand
function prototype: 'const struct mtk_thermal_data mt8183_thermal_data =
'
linux.git/drivers/thermal/mtk_thermal.c:529: warning: Function parameter
or member 'sensno' not described in 'raw_to_mcelsius'
Amit Kucheria [Wed, 20 Nov 2019 15:45:13 +0000 (21:15 +0530)]
thermal: devfreq_cooling: Appease the kernel-doc deity
Fix up the following warnings with make W=1:
linux.git/drivers/thermal/devfreq_cooling.c:68: warning: Function
parameter or member 'capped_state' not described in
'devfreq_cooling_device'
linux.git/drivers/thermal/devfreq_cooling.c:593: warning: Function
parameter or member 'cdev' not described in 'devfreq_cooling_unregister'
linux.git/drivers/thermal/devfreq_cooling.c:593: warning: Excess
function parameter 'dfc' description in 'devfreq_cooling_unregister'
Amit Kucheria [Wed, 20 Nov 2019 15:45:12 +0000 (21:15 +0530)]
thermal: step_wise: Appease the kernel-doc deity
Replace - with : to appease the kernel-doc gods and fix warnings such as
the following when compiled with make W=1:
linux-amit.git/drivers/thermal/step_wise.c:187: warning: Function
parameter or member 'tz' not described in 'step_wise_throttle'
linux-amit.git/drivers/thermal/step_wise.c:187: warning: Function
parameter or member 'trip' not described in 'step_wise_throttle'
linux.git/drivers/thermal/fair_share.c:79: warning: Function parameter
or member 'tz' not described in 'fair_share_throttle'
linux.git/drivers/thermal/fair_share.c:79: warning: Function parameter
or member 'trip' not described in 'fair_share_throttle'
Amit Kucheria [Wed, 20 Nov 2019 15:45:10 +0000 (21:15 +0530)]
thermal: of-thermal: Appease the kernel-doc deity
Replace a comment starting with /** by simply /* to avoid having
it interpreted as a kernel-doc comment.
Fixes the following warning when compile with make W=1:
linux.git/drivers/thermal/of-thermal.c:761: warning: cannot understand function prototype: 'const char *trip_types[] = '
====================
Netfilter updates for net-next
This batch contains Netfilter updates for net-next:
1) Add nft_setelem_parse_key() helper function.
2) Add NFTA_SET_ELEM_KEY_END to specify a range with one single element.
3) Add NFTA_SET_DESC_CONCAT to describe the set element concatenation,
from Stefano Brivio.
4) Add bitmap_cut() to copy n-bits from source to destination,
from Stefano Brivio.
5) Add set to match on arbitrary concatenations, from Stefano Brivio.
6) Add selftest for this new set type. An extract of Stefano's
description follows:
"Existing nftables set implementations allow matching entries with
interval expressions (rbtree), e.g. 192.0.2.1-192.0.2.4, entries
specifying field concatenation (hash, rhash), e.g. 192.0.2.1:22,
but not both.
In other words, none of the set types allows matching on range
expressions for more than one packet field at a time, such as ipset
does with types bitmap:ip,mac, and, to a more limited extent
(netmasks, not arbitrary ranges), with types hash:net,net,
hash:net,port, hash:ip,port,net, and hash:net,port,net.
As a pure hash-based approach is unsuitable for matching on ranges,
and "proxying" the existing red-black tree type looks impractical as
elements would need to be shared and managed across all employed
trees, this new set implementation intends to fill the functionality
gap by employing a relatively novel approach.
The fundamental idea, illustrated in deeper detail in patch 5/9, is to
use lookup tables classifying a small number of grouped bits from each
field, and map the lookup results in a way that yields a verdict for
the full set of specified fields.
The grouping bit aspect is loosely inspired by the Grouper algorithm,
by Jay Ligatti, Josh Kuhn, and Chris Gage (see patch 5/9 for the full
reference).
A reference, stand-alone implementation of the algorithm itself is
available at:
https://pipapo.lameexcu.se
Some notes about possible future optimisations are also mentioned
there. This algorithm reduces the matching problem to, essentially,
a repetitive sequence of simple bitwise operations, and is
particularly suitable to be optimised by leveraging SIMD instruction
sets."
====================
Stefano Brivio [Tue, 21 Jan 2020 23:17:56 +0000 (00:17 +0100)]
selftests: netfilter: Introduce tests for sets with range concatenation
This test covers functionality and stability of the newly added
nftables set implementation supporting concatenation of ranged
fields.
For some selected set expression types, test:
- correctness, by checking that packets match or don't
- concurrency, by attempting races between insertion, deletion, lookup
- timeout feature, checking that packets don't match expired entries
and (roughly) estimate matching rates, comparing to baselines for
simple drop on netdev ingress hook and for hash and rbtrees sets.
In order to send packets, this needs one of sendip, netcat or bash.
To flood with traffic, iperf3, iperf and netperf are supported. For
performance measurements, this relies on the sample pktgen script
pktgen_bench_xmit_mode_netif_receive.sh.
If none of the tools suitable for a given test are available, specific
tests will be skipped.
Stefano Brivio [Tue, 21 Jan 2020 23:17:55 +0000 (00:17 +0100)]
nf_tables: Add set type for arbitrary concatenation of ranges
This new set type allows for intervals in concatenated fields,
which are expressed in the usual way, that is, simple byte
concatenation with padding to 32 bits for single fields, and
given as ranges by specifying start and end elements containing,
each, the full concatenation of start and end values for the
single fields.
Ranges are expanded to composing netmasks, for each field: these
are inserted as rules in per-field lookup tables. Bits to be
classified are divided in 4-bit groups, and for each group, the
lookup table contains 4^2 buckets, representing all the possible
values of a bit group. This approach was inspired by the Grouper
algorithm:
http://www.cse.usf.edu/~ligatti/projects/grouper/
Matching is performed by a sequence of AND operations between
bucket values, with buckets selected according to the value of
packet bits, for each group. The result of this sequence tells
us which rules matched for a given field.
In order to concatenate several ranged fields, per-field rules
are mapped using mapping arrays, one per field, that specify
which rules should be considered while matching the next field.
The mapping array for the last field contains a reference to
the element originally inserted.
The notes in nft_set_pipapo.c cover the algorithm in deeper
detail.
A pure hash-based approach is of no use here, as ranges need
to be classified. An implementation based on "proxying" the
existing red-black tree set type, creating a tree for each
field, was considered, but deemed impractical due to the fact
that elements would need to be shared between trees, at least
as long as we want to keep UAPI changes to a minimum.
A stand-alone implementation of this algorithm is available at:
https://pipapo.lameexcu.se
together with notes about possible future optimisations
(in pipapo.c).
This algorithm was designed with data locality in mind, and can
be highly optimised for SIMD instruction sets, as the bulk of
the matching work is done with repetitive, simple bitwise
operations.
At this point, without further optimisations, nft_concat_range.sh
reports, for one AMD Epyc 7351 thread (2.9GHz, 512 KiB L1D$, 8 MiB
L2$):
TEST: performance
net,port [ OK ]
baseline (drop from netdev hook): 10190076pps
baseline hash (non-ranged entries): 6179564pps
baseline rbtree (match on first field only): 2950341pps
set with 1000 full, ranged entries: 2304165pps
port,net [ OK ]
baseline (drop from netdev hook): 10143615pps
baseline hash (non-ranged entries): 6135776pps
baseline rbtree (match on first field only): 4311934pps
set with 100 full, ranged entries: 4131471pps
net6,port [ OK ]
baseline (drop from netdev hook): 9730404pps
baseline hash (non-ranged entries): 4809557pps
baseline rbtree (match on first field only): 1501699pps
set with 1000 full, ranged entries: 1092557pps
port,proto [ OK ]
baseline (drop from netdev hook): 10812426pps
baseline hash (non-ranged entries): 6929353pps
baseline rbtree (match on first field only): 3027105pps
set with 30000 full, ranged entries: 284147pps
net6,port,mac [ OK ]
baseline (drop from netdev hook): 9660114pps
baseline hash (non-ranged entries): 3778877pps
baseline rbtree (match on first field only): 3179379pps
set with 10 full, ranged entries: 2082880pps
net6,port,mac,proto [ OK ]
baseline (drop from netdev hook): 9718324pps
baseline hash (non-ranged entries): 3799021pps
baseline rbtree (match on first field only): 1506689pps
set with 1000 full, ranged entries: 783810pps
net,mac [ OK ]
baseline (drop from netdev hook): 10190029pps
baseline hash (non-ranged entries): 5172218pps
baseline rbtree (match on first field only): 2946863pps
set with 1000 full, ranged entries: 1279122pps
v4:
- fix build for 32-bit architectures: 64-bit division needs
div_u64() (kbuild test robot <[email protected]>)
v3:
- rework interface for field length specification,
NFT_SET_SUBKEY disappears and information is stored in
description
- remove scratch area to store closing element of ranges,
as elements now come with an actual attribute to specify
the upper range limit (Pablo Neira Ayuso)
- also remove pointer to 'start' element from mapping table,
closing key is now accessible via extension data
- use bytes right away instead of bits for field lengths,
this way we can also double the inner loop of the lookup
function to take care of upper and lower bits in a single
iteration (minor performance improvement)
- make it clearer that set operations are actually atomic
API-wise, but we can't e.g. implement flush() as one-shot
action
- fix type for 'dup' in nft_pipapo_insert(), check for
duplicates only in the next generation, and in general take
care of differentiating generation mask cases depending on
the operation (Pablo Neira Ayuso)
- report C implementation matching rate in commit message, so
that AVX2 implementation can be compared (Pablo Neira Ayuso)
v2:
- protect access to scratch maps in nft_pipapo_lookup() with
local_bh_disable/enable() (Florian Westphal)
- drop rcu_read_lock/unlock() from nft_pipapo_lookup(), it's
already implied (Florian Westphal)
- explain why partial allocation failures don't need handling
in pipapo_realloc_scratch(), rename 'm' to clone and update
related kerneldoc to make it clear we're not operating on
the live copy (Florian Westphal)
- add expicit check for priv->start_elem in
nft_pipapo_insert() to avoid ending up in nft_pipapo_walk()
with a NULL start element, and also zero it out in every
operation that might make it invalid, so that insertion
doesn't proceed with an invalid element (Florian Westphal)
Stefano Brivio [Tue, 21 Jan 2020 23:17:54 +0000 (00:17 +0100)]
bitmap: Introduce bitmap_cut(): cut bits and shift remaining
The new bitmap function bitmap_cut() copies bits from source to
destination by removing the region specified by parameters first
and cut, and remapping the bits above the cut region by right
shifting them.
Stefano Brivio [Tue, 21 Jan 2020 23:17:53 +0000 (00:17 +0100)]
netfilter: nf_tables: Support for sets with multiple ranged fields
Introduce a new nested netlink attribute, NFTA_SET_DESC_CONCAT, used
to specify the length of each field in a set concatenation.
This allows set implementations to support concatenation of multiple
ranged items, as they can divide the input key into matching data for
every single field. Such set implementations would be selected as
they specify support for NFT_SET_INTERVAL and allow desc->field_count
to be greater than one. Explicitly disallow this for nft_set_rbtree.
In order to specify the interval for a set entry, userspace would
include in NFTA_SET_DESC_CONCAT attributes field lengths, and pass
range endpoints as two separate keys, represented by attributes
NFTA_SET_ELEM_KEY and NFTA_SET_ELEM_KEY_END.
While at it, export the number of 32-bit registers available for
packet matching, as nftables will need this to know the maximum
number of field lengths that can be specified.
For example, "packets with an IPv4 address between 192.0.2.0 and
192.0.2.42, with destination port between 22 and 25", can be
expressed as two concatenated elements: