Alice Michael [Wed, 25 Oct 2023 21:41:53 +0000 (14:41 -0700)]
ice: Add 200G speed/phy type use
Add the support for 200G phy speeds and the mapping for their
advertisement in link. Add the new PHY type bits for AQ command, as
needed for 200G E830 controllers.
Paul Greenwalt [Wed, 25 Oct 2023 21:41:52 +0000 (14:41 -0700)]
ice: Add E830 device IDs, MAC type and registers
E830 is the 200G NIC family which uses the ice driver.
Add specific E830 registers. Embed macros to use proper register based on
(hw)->mac_type & name those macros to [ORIGINAL]_BY_MAC(hw). Registers
only available on one of the macs will need to be explicitly referred to
as E800_NAME instead of just NAME. PTP is not yet supported.
We've added 51 non-merge commits during the last 10 day(s) which contain
a total of 75 files changed, 5037 insertions(+), 200 deletions(-).
The main changes are:
1) Add open-coded task, css_task and css iterator support.
One of the use cases is customizable OOM victim selection via BPF,
from Chuyi Zhou.
2) Fix BPF verifier's iterator convergence logic to use exact states
comparison for convergence checks, from Eduard Zingerman,
Andrii Nakryiko and Alexei Starovoitov.
3) Add BPF programmable net device where bpf_mprog defines the logic
of its xmit routine. It can operate in L3 and L2 mode,
from Daniel Borkmann and Nikolay Aleksandrov.
4) Batch of fixes for BPF per-CPU kptr and re-enable unit_size checking
for global per-CPU allocator, from Hou Tao.
5) Fix libbpf which eagerly assumed that SHT_GNU_verdef ELF section
was going to be present whenever a binary has SHT_GNU_versym section,
from Andrii Nakryiko.
6) Fix BPF ringbuf correctness to fold smp_mb__before_atomic() into
atomic_set_release(), from Paul E. McKenney.
7) Add a warning if NAPI callback missed xdp_do_flush() under
CONFIG_DEBUG_NET which helps checking if drivers were missing
the former, from Sebastian Andrzej Siewior.
8) Fix missed RCU read-lock in bpf_task_under_cgroup() which was throwing
a warning under sleepable programs, from Yafang Shao.
9) Avoid unnecessary -EBUSY from htab_lock_bucket by disabling IRQ before
checking map_locked, from Song Liu.
10) Make BPF CI linked_list failure test more robust,
from Kumar Kartikeya Dwivedi.
11) Enable samples/bpf to be built as PIE in Fedora, from Viktor Malik.
12) Fix xsk starving when multiple xsk sockets were associated with
a single xsk_buff_pool, from Albert Huang.
13) Clarify the signed modulo implementation for the BPF ISA standardization
document that it uses truncated division, from Dave Thaler.
14) Improve BPF verifier's JEQ/JNE branch taken logic to also consider
signed bounds knowledge, from Andrii Nakryiko.
15) Add an option to XDP selftests to use multi-buffer AF_XDP
xdp_hw_metadata and mark used XDP programs as capable to use frags,
from Larysa Zaremba.
16) Fix bpftool's BTF dumper wrt printing a pointer value and another
one to fix struct_ops dump in an array, from Manu Bretelle.
* tag 'for-netdev' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (51 commits)
netkit: Remove explicit active/peer ptr initialization
selftests/bpf: Fix selftests broken by mitigations=off
samples/bpf: Allow building with custom bpftool
samples/bpf: Fix passing LDFLAGS to libbpf
samples/bpf: Allow building with custom CFLAGS/LDFLAGS
bpf: Add more WARN_ON_ONCE checks for mismatched alloc and free
selftests/bpf: Add selftests for netkit
selftests/bpf: Add netlink helper library
bpftool: Extend net dump with netkit progs
bpftool: Implement link show support for netkit
libbpf: Add link-based API for netkit
tools: Sync if_link uapi header
netkit, bpf: Add bpf programmable net device
bpf: Improve JEQ/JNE branch taken logic
bpf: Fold smp_mb__before_atomic() into atomic_set_release()
bpf: Fix unnecessary -EBUSY from htab_lock_bucket
xsk: Avoid starving the xsk further down the list
bpf: print full verifier states on infinite loop detection
selftests/bpf: test if state loops are detected in a tricky case
bpf: correct loop detection for iterators convergence
...
====================
Alexey Makhalov [Wed, 25 Oct 2023 23:19:31 +0000 (16:19 -0700)]
MAINTAINERS: Maintainer change for ptp_vmw driver
Deep has decided to transfer the maintainership of the VMware virtual
PTP clock driver (ptp_vmw) to Jeff. Update the MAINTAINERS file to
reflect this change.
Michael Chan [Thu, 26 Oct 2023 01:32:31 +0000 (18:32 -0700)]
bnxt_en: Fix 2 stray ethtool -S counters
The recent firmware interface change has added 2 counters in struct
rx_port_stats_ext. This caused 2 stray ethtool counters to be
displayed.
Since new counters are added from time to time, fix it so that the
ethtool logic will only display up to the maximum known counters.
These 2 counters are not used by production firmware yet.
Jakub Kicinski [Wed, 25 Oct 2023 18:27:39 +0000 (11:27 -0700)]
tools: ynl-gen: respect attr-cnt-name at the attr set level
Davide reports that we look for the attr-cnt-name in the wrong
object. We try to read it from the family, but the schema only
allows for it to exist at attr-set level.
Jakub Kicinski [Wed, 25 Oct 2023 16:22:53 +0000 (09:22 -0700)]
netlink: specs: support conditional operations
Page pool code is compiled conditionally, but the operations
are part of the shared netlink family. We can handle this
by reporting empty list of pools or -EOPNOTSUPP / -ENOSYS
but the cleanest way seems to be removing the ops completely
at compilation time. That way user can see that the page
pool ops are not present using genetlink introspection.
Same way they'd check if the kernel is "new enough" to
support the ops.
Extend the specs with the ability to specify the config
condition under which op (and its policies, etc.) should
be hidden.
Jakub Kicinski [Wed, 25 Oct 2023 16:22:04 +0000 (09:22 -0700)]
netlink: make range pointers in policies const
struct nla_policy is usually constant itself, but unless
we make the ranges inside constant we won't be able to
make range structs const. The ranges are not modified
by the core.
net/vmw_vsock/virtio_transport.c 64c99d2d6ada ("vsock/virtio: support to send non-linear skb") 53b08c498515 ("vsock/virtio: initialize the_virtio_vsock before using VQs")
Linus Torvalds [Thu, 26 Oct 2023 17:41:27 +0000 (07:41 -1000)]
Merge tag 'net-6.6-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Paolo Abeni:
"Including fixes from WiFi and netfilter.
Most regressions addressed here come from quite old versions, with the
exceptions of the iavf one and the WiFi fixes. No known outstanding
reports or investigation.
Fixes to fixes:
- eth: iavf: in iavf_down, disable queues when removing the driver
Previous releases - regressions:
- sched: act_ct: additional checks for outdated flows
- tcp: do not leave an empty skb in write queue
- tcp: fix wrong RTO timeout when received SACK reneging
- wifi: cfg80211: pass correct pointer to rdev_inform_bss()
- eth: i40e: sync next_to_clean and next_to_process for programming
status desc
- eth: iavf: initialize waitqueues before starting watchdog_task
Previous releases - always broken:
- eth: r8169: fix data-races
- eth: igb: fix potential memory leak in igb_add_ethtool_nfc_entry
- eth: r8152: avoid writing garbage to the adapter's registers
- eth: gtp: fix fragmentation needed check with gso"
* tag 'net-6.6-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (43 commits)
iavf: in iavf_down, disable queues when removing the driver
vsock/virtio: initialize the_virtio_vsock before using VQs
net: ipv6: fix typo in comments
net: ipv4: fix typo in comments
net/sched: act_ct: additional checks for outdated flows
netfilter: flowtable: GC pushes back packets to classic path
i40e: Fix wrong check for I40E_TXR_FLAGS_WB_ON_ITR
gtp: fix fragmentation needed check with gso
gtp: uapi: fix GTPA_MAX
Fix NULL pointer dereference in cn_filter()
sfc: cleanup and reduce netlink error messages
net/handshake: fix file ref count in handshake_nl_accept_doit()
wifi: mac80211: don't drop all unprotected public action frames
wifi: cfg80211: fix assoc response warning on failed links
wifi: cfg80211: pass correct pointer to rdev_inform_bss()
isdn: mISDN: hfcsusb: Spelling fix in comment
tcp: fix wrong RTO timeout when received SACK reneging
r8152: Block future register access if register access fails
r8152: Rename RTL8152_UNPLUG to RTL8152_INACCESSIBLE
r8152: Check for unplug in r8153b_ups_en() / r8153c_ups_en()
...
Yafang Shao [Wed, 25 Oct 2023 03:11:44 +0000 (03:11 +0000)]
selftests/bpf: Fix selftests broken by mitigations=off
When we configure the kernel command line with 'mitigations=off' and set
the sysctl knob 'kernel.unprivileged_bpf_disabled' to 0, the commit bc5bc309db45 ("bpf: Inherit system settings for CPU security mitigations")
causes issues in the execution of `test_progs -t verifier`. This is
because 'mitigations=off' bypasses Spectre v1 and Spectre v4 protections.
Currently, when a program requests to run in unprivileged mode
(kernel.unprivileged_bpf_disabled = 0), the BPF verifier may prevent
it from running due to the following conditions not being enabled:
While 'mitigations=off' enables the first two conditions, it does not
enable the latter two. As a result, some test cases in
'test_progs -t verifier' that were expected to fail to run may run
successfully, while others still fail but with different error messages.
This makes it challenging to address them comprehensively.
Moreover, in the future, we may introduce more fine-grained control over
CPU mitigations, such as enabling only bypass_spec_v1 or bypass_spec_v4.
Given the complexity of the situation, rather than fixing each broken test
case individually, it's preferable to skip them when 'mitigations=off' is
in effect and introduce specific test cases for the new 'mitigations=off'
scenario. For instance, we can introduce new BTF declaration tags like
'__failure__nospec', '__failure_nospecv1' and '__failure_nospecv4'.
In this patch, the approach is to simply skip the broken test cases when
'mitigations=off' is enabled. The result of `test_progs -t verifier` as
follows after this commit,
Viktor Malik [Wed, 25 Oct 2023 06:19:14 +0000 (08:19 +0200)]
samples/bpf: Allow building with custom bpftool
samples/bpf build its own bpftool boostrap to generate vmlinux.h as well
as some BPF objects. This is a redundant step if bpftool has been
already built, so update samples/bpf/Makefile such that it accepts a
path to bpftool passed via the BPFTOOL variable. The approach is
practically the same as tools/testing/selftests/bpf/Makefile uses.
Viktor Malik [Wed, 25 Oct 2023 06:19:13 +0000 (08:19 +0200)]
samples/bpf: Fix passing LDFLAGS to libbpf
samples/bpf/Makefile passes LDFLAGS=$(TPROGS_LDFLAGS) to libbpf build
without surrounding quotes, which may cause compilation errors when
passing custom TPROGS_USER_LDFLAGS.
Viktor Malik [Wed, 25 Oct 2023 06:19:12 +0000 (08:19 +0200)]
samples/bpf: Allow building with custom CFLAGS/LDFLAGS
Currently, it is not possible to specify custom flags when building
samples/bpf. The flags are defined in TPROGS_CFLAGS/TPROGS_LDFLAGS
variables, however, when trying to override those from the make command,
compilation fails.
For example, when trying to build with PIE:
$ make -C samples/bpf TPROGS_CFLAGS="-fpie" TPROGS_LDFLAGS="-pie"
This is because samples/bpf/Makefile updates these variables, especially
appends include paths to TPROGS_CFLAGS and these updates are overridden
by setting the variables from the make command.
This patch introduces variables TPROGS_USER_CFLAGS/TPROGS_USER_LDFLAGS
for this purpose, which can be set from the make command and their
values are propagated to TPROGS_CFLAGS/TPROGS_LDFLAGS.
The source and destination ports should be taken into account when
determining the route destination; they can affect the result, for
example in case there are routing rules defined.
Paolo Abeni [Thu, 26 Oct 2023 10:20:35 +0000 (12:20 +0200)]
Merge tag 'nf-next-23-10-25' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next
Pablo Neira Ayuso says:
====================
Netfilter updates for net-next
The following patchset contains Netfilter updates for net-next. Mostly
nf_tables updates with two patches for connlabel and br_netfilter.
1) Rename function name to perform on-demand GC for rbtree elements,
and replace async GC in rbtree by sync GC. Patches from Florian Westphal.
2) Use commit_mutex for NFT_MSG_GETRULE_RESET to ensure that two
concurrent threads invoking this command do not underrun stateful
objects. Patches from Phil Sutter.
3) Use single hook to deal with IP and ARP packets in br_netfilter.
Patch from Florian Westphal.
4) Use atomic_t in netns->connlabel use counter instead of using a
spinlock, also patch from Florian.
5) Cleanups for stateful objects infrastructure in nf_tables.
Patches from Phil Sutter.
6) Flush path uses opaque set element offered by the iterator, instead of
calling pipapo_deactivate() which looks up for it again.
7) Set backend .flush interface always succeeds, make it return void
instead.
8) Add struct nft_elem_priv placeholder structure and use it by replacing
void * to pass opaque set element representation from backend to frontend
which defeats compiler type checks.
9) Shrink memory consumption of set element transactions, by reducing
struct nft_trans_elem object size and reducing stack memory usage.
10) Use struct nft_elem_priv also for set backend .insert operation too.
11) Carry reset flag in nft_set_dump_ctx structure, instead of passing it
as a function argument, from Phil Sutter.
netfilter pull request 23-10-25
* tag 'nf-next-23-10-25' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next:
netfilter: nf_tables: Carry reset boolean in nft_set_dump_ctx
netfilter: nf_tables: set->ops->insert returns opaque set element in case of EEXIST
netfilter: nf_tables: shrink memory consumption of set elements
netfilter: nf_tables: expose opaque set element as struct nft_elem_priv
netfilter: nf_tables: set backend .flush always succeeds
netfilter: nft_set_pipapo: no need to call pipapo_deactivate() from flush
netfilter: nf_tables: Carry reset boolean in nft_obj_dump_ctx
netfilter: nf_tables: nft_obj_filter fits into cb->ctx
netfilter: nf_tables: Carry s_idx in nft_obj_dump_ctx
netfilter: nf_tables: A better name for nft_obj_filter
netfilter: nf_tables: Unconditionally allocate nft_obj_filter
netfilter: nf_tables: Drop pointless memset in nf_tables_dump_obj
netfilter: conntrack: switch connlabels to atomic_t
br_netfilter: use single forward hook for ip and arp
netfilter: nf_tables: Add locking for NFT_MSG_GETRULE_RESET requests
netfilter: nf_tables: Introduce nf_tables_getrule_single()
netfilter: nf_tables: Open-code audit log call in nf_tables_getrule()
netfilter: nft_set_rbtree: prefer sync gc to async worker
netfilter: nft_set_rbtree: rename gc deactivate+erase function
====================
Alex Henrie [Tue, 24 Oct 2023 21:23:08 +0000 (15:23 -0600)]
net: ipv6/addrconf: clamp preferred_lft to the minimum required
If the preferred lifetime was less than the minimum required lifetime,
ipv6_create_tempaddr would error out without creating any new address.
On my machine and network, this error happened immediately with the
preferred lifetime set to 1 second, after a few minutes with the
preferred lifetime set to 4 seconds, and not at all with the preferred
lifetime set to 5 seconds. During my investigation, I found a Stack
Exchange post from another person who seems to have had the same
problem: They stopped getting new addresses if they lowered the
preferred lifetime below 3 seconds, and they didn't really know why.
The preferred lifetime is a preference, not a hard requirement. The
kernel does not strictly forbid new connections on a deprecated address,
nor does it guarantee that the address will be disposed of the instant
its total valid lifetime expires. So rather than disable IPv6 privacy
extensions altogether if the minimum required lifetime swells above the
preferred lifetime, it is more in keeping with the user's intent to
increase the temporary address's lifetime to the minimum necessary for
the current network conditions.
With these fixes, setting the preferred lifetime to 3 or 4 seconds "just
works" because the extra fraction of a second is practically
unnoticeable. It's even possible to reduce the time before deprecation
to 1 or 2 seconds by also disabling duplicate address detection (setting
/proc/sys/net/ipv6/conf/*/dad_transmits to 0). I realize that that is a
pretty niche use case, but I know at least one person who would gladly
sacrifice performance and convenience to be sure that they are getting
the maximum possible level of privacy.
Alex Henrie [Tue, 24 Oct 2023 21:23:07 +0000 (15:23 -0600)]
net: ipv6/addrconf: clamp preferred_lft to the maximum allowed
Without this patch, there is nothing to stop the preferred lifetime of a
temporary address from being greater than its valid lifetime. If that
was the case, the valid lifetime was effectively ignored.
====================
ipv6: avoid atomic fragment on GSO output
When the ipv6 stack output a GSO packet, if its gso_size is larger than
dst MTU, then all segments would be fragmented. However, it is possible
for a GSO packet to have a trailing segment with smaller actual size
than both gso_size as well as the MTU, which leads to an "atomic
fragment". Atomic fragments are considered harmful in RFC-8021. An
Existing report from APNIC also shows that atomic fragments are more
likely to be dropped even it is equivalent to a no-op [1].
The series contains following changes:
* drop feature RTAX_FEATURE_ALLFRAG, which has been broken. This helps
simplifying other changes in this set.
* refactor __ip6_finish_output code to separate GSO and non-GSO packet
processing, mirroring IPv4 side logic.
* avoid generating atomic fragment on GSO packets.
Yan Zhai [Tue, 24 Oct 2023 14:26:40 +0000 (07:26 -0700)]
ipv6: avoid atomic fragment on GSO packets
When the ipv6 stack output a GSO packet, if its gso_size is larger than
dst MTU, then all segments would be fragmented. However, it is possible
for a GSO packet to have a trailing segment with smaller actual size
than both gso_size as well as the MTU, which leads to an "atomic
fragment". Atomic fragments are considered harmful in RFC-8021. An
Existing report from APNIC also shows that atomic fragments are more
likely to be dropped even it is equivalent to a no-op [1].
Add an extra check in the GSO slow output path. For each segment from
the original over-sized packet, if it fits with the path MTU, then avoid
generating an atomic fragment.
Yan Zhai [Tue, 24 Oct 2023 14:26:37 +0000 (07:26 -0700)]
ipv6: refactor ip6_finish_output for GSO handling
Separate GSO and non-GSO packets handling to make the logic cleaner. For
GSO packets, frag_max_size check can be omitted because it is only
useful for packets defragmented by netfilter hooks. Both local output
and GRO logic won't produce GSO packets when defragment is needed. This
also mirrors what IPv4 side code is doing.
The feature would send packets to the fragmentation path if a box
receives a PMTU value with less than 1280 byte. However, since commit 9d289715eb5c ("ipv6: stop sending PTB packets for MTU < 1280"), such
message would be simply discarded. The feature flag is neither supported
in iproute2 utility. In theory one can still manipulate it with direct
netlink message, but it is not ideal because it was based on obsoleted
guidance of RFC-2460 (replaced by RFC-8200).
The feature would always test false at the moment, so remove related
code or mark them as unused.
Michal Schmidt [Wed, 25 Oct 2023 18:32:13 +0000 (11:32 -0700)]
iavf: in iavf_down, disable queues when removing the driver
In iavf_down, we're skipping the scheduling of certain operations if
the driver is being removed. However, the IAVF_FLAG_AQ_DISABLE_QUEUES
request must not be skipped in this case, because iavf_close waits
for the transition to the __IAVF_DOWN state, which happens in
iavf_virtchnl_completion after the queues are released.
Without this fix, "rmmod iavf" takes half a second per interface that's
up and prints the "Device resources not yet released" warning.
Jakub Kicinski [Wed, 25 Oct 2023 23:02:06 +0000 (16:02 -0700)]
Merge tag 'nf-23-10-25' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
This patch contains two late Netfilter's flowtable fixes for net:
1) Flowtable GC pushes back packets to classic path in every GC run,
ie. every second. This is because NF_FLOW_HW_ESTABLISHED is only
used by sched/act_ct (never set) and IPS_SEEN_REPLY might be unset
by the time the flow is offloaded (this status bit is only reliable
in the sched/act_ct datapath).
2) sched/act_ct logic to push back packets to classic path to reevaluate
if UDP flow is unidirectional only applies if IPS_HW_OFFLOAD_BIT is
set on and no hardware offload request is pending to be handled.
From Vlad Buslov.
These two patches fixes two problems that were introduced in the
previous 6.5 development cycle.
* tag 'nf-23-10-25' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
net/sched: act_ct: additional checks for outdated flows
netfilter: flowtable: GC pushes back packets to classic path
====================
Alexandru Matei [Tue, 24 Oct 2023 19:17:42 +0000 (22:17 +0300)]
vsock/virtio: initialize the_virtio_vsock before using VQs
Once VQs are filled with empty buffers and we kick the host, it can send
connection requests. If the_virtio_vsock is not initialized before,
replies are silently dropped and do not reach the host.
virtio_transport_send_pkt() can queue packets once the_virtio_vsock is
set, but they won't be processed until vsock->tx_run is set to true. We
queue vsock->send_pkt_work when initialization finishes to send those
packets queued earlier.
Jakub Kicinski [Wed, 25 Oct 2023 19:23:36 +0000 (12:23 -0700)]
Merge branch 'mptcp-features-and-fixes-for-v6-7'
Mat Martineau says:
====================
mptcp: Features and fixes for v6.7
Patch 1 adds a configurable timeout for the MPTCP connection when all
subflows are closed, to support break-before-make use cases.
Patch 2 is a fix for a 1-byte error in rx data counters with MPTCP
fastopen connections.
Patch 3 is a minor code cleanup.
Patches 4 & 5 add handling of rcvlowat for MPTCP sockets, with a
prerequisite patch to use a common scaling ratio between TCP and MPTCP.
Patch 6 improves efficiency of memory copying in MPTCP transmit code.
Patch 7 refactors syncing of socket options from the MPTCP socket to
its subflows.
Patches 8 & 9 help the MPTCP packet scheduler perform well by changing
the handling of notsent_lowat in subflows and how available buffer space
is calculated for MPTCP-level sends.
====================
Paolo Abeni [Mon, 23 Oct 2023 20:44:42 +0000 (13:44 -0700)]
mptcp: refactor sndbuf auto-tuning
The MPTCP protocol account for the data enqueued on all the subflows
to the main socket send buffer, while the send buffer auto-tuning
algorithm set the main socket send buffer size as the max size among
the subflows.
That causes bad performances when at least one subflow is sndbuf
limited, e.g. due to very high latency, as the MPTCP scheduler can't
even fill such buffer.
Change the send-buffer auto-tuning algorithm to compute the main socket
send buffer size as the sum of all the subflows buffer size.
Paolo Abeni [Mon, 23 Oct 2023 20:44:40 +0000 (13:44 -0700)]
mptcp: consolidate sockopt synchronization
Move the socket option synchronization for active subflows
at subflow creation time. This allows removing the now unused
unlocked variant of such helper.
While at that, clean-up a bit the mptcp_subflow_create_socket()
errors path.
Paolo Abeni [Mon, 23 Oct 2023 20:44:38 +0000 (13:44 -0700)]
mptcp: give rcvlowat some love
The MPTCP protocol allow setting sk_rcvlowat, but the value there
is currently ignored.
Additionally, the default subflows sk_rcvlowat basically disables per
subflow delayed ack: the MPTCP protocol move the incoming data from the
subflows into the msk socket as soon as the TCP stacks invokes the subflow
data_ready callback. Later, when __tcp_ack_snd_check() takes action,
the subflow-level copied_seq matches rcv_nxt, and that mandate for an
immediate ack.
Let the mptcp receive path be aware of such threshold, explicitly tracking
the amount of data available to be ready and checking vs sk_rcvlowat in
mptcp_poll() and before waking-up readers.
Additionally implement the set_rcvlowat() callback, to properly handle
the rcvbuf auto-tuning on sk_rcvlowat changes.
Finally to properly handle delayed ack, force the subflow level threshold
to 0 and instead explicitly ask for an immediate ack when the msk level th
is not reached.
Paolo Abeni [Mon, 23 Oct 2023 20:44:34 +0000 (13:44 -0700)]
mptcp: add a new sysctl for make after break timeout
The MPTCP protocol allows sockets with no alive subflows to stay
in ESTABLISHED status for and user-defined timeout, to allow for
later subflows creation.
Currently such timeout is constant - TCP_TIMEWAIT_LEN. Let the
user-space configure them via a newly added sysctl, to better cope
with busy servers and simplify (make them faster) the relevant
pktdrill tests.
Note that the new know does not apply to orphaned MPTCP socket
waiting for the data_fin handshake completion: they always wait
TCP_TIMEWAIT_LEN.
Kalle Valo [Mon, 23 Oct 2023 16:41:20 +0000 (19:41 +0300)]
Revert "wifi: ath11k: call ath11k_mac_fils_discovery() without condition"
This reverts commit e149353e6562f3e3246f75dfc4cca6a0cc5b4efc. The commit caused
QCA6390 hw2.0 firmware WLAN.HST.1.0.1-05266-QCAHSTSWPLZ_V2_TO_X86-1 to crash
during disconnect:
Jeff Johnson [Thu, 19 Oct 2023 16:57:50 +0000 (09:57 -0700)]
wifi: ath12k: Introduce and use ath12k_sta_to_arsta()
Currently, the logic to return an ath12k_sta pointer, given a
ieee80211_sta pointer, uses typecasting throughout the driver. In
general, conversion functions are preferable to typecasting since
using a conversion function allows the compiler to validate the types
of both the input and output parameters.
ath12k already defines a conversion function ath12k_vif_to_arvif() for
a similar conversion. So introduce ath12k_sta_to_arsta() for this use
case, and convert all of the existing typecasting to use this
function.
Johan Hovold [Thu, 19 Oct 2023 11:36:50 +0000 (13:36 +0200)]
wifi: ath12k: fix htt mlo-offset event locking
The ath12k active pdevs are protected by RCU but the htt mlo-offset
event handling code calling ath12k_mac_get_ar_by_pdev_id() was not
marked as a read-side critical section.
Mark the code in question as an RCU read-side critical section to avoid
any potential use-after-free issues.
Johan Hovold [Thu, 19 Oct 2023 11:36:49 +0000 (13:36 +0200)]
wifi: ath12k: fix dfs-radar and temperature event locking
The ath12k active pdevs are protected by RCU but the DFS-radar and
temperature event handling code calling ath12k_mac_get_ar_by_pdev_id()
was not marked as a read-side critical section.
Mark the code in question as RCU read-side critical sections to avoid
any potential use-after-free issues.
Note that the temperature event handler looks like a place holder
currently but would still trigger an RCU lockdep splat.
Johan Hovold [Thu, 19 Oct 2023 15:53:42 +0000 (17:53 +0200)]
wifi: ath11k: fix gtk offload status event locking
The ath11k active pdevs are protected by RCU but the gtk offload status
event handling code calling ath11k_mac_get_arvif_by_vdev_id() was not
marked as a read-side critical section.
Mark the code in question as an RCU read-side critical section to avoid
any potential use-after-free issues.
Johan Hovold [Thu, 19 Oct 2023 11:25:21 +0000 (13:25 +0200)]
wifi: ath11k: fix htt pktlog locking
The ath11k active pdevs are protected by RCU but the htt pktlog handling
code calling ath11k_mac_get_ar_by_pdev_id() was not marked as a
read-side critical section.
Mark the code in question as an RCU read-side critical section to avoid
any potential use-after-free issues.
Johan Hovold [Thu, 19 Oct 2023 15:31:15 +0000 (17:31 +0200)]
wifi: ath11k: fix dfs radar event locking
The ath11k active pdevs are protected by RCU but the DFS radar event
handling code calling ath11k_mac_get_ar_by_pdev_id() was not marked as a
read-side critical section.
Mark the code in question as an RCU read-side critical section to avoid
any potential use-after-free issues.
Johan Hovold [Thu, 19 Oct 2023 15:31:14 +0000 (17:31 +0200)]
wifi: ath11k: fix temperature event locking
The ath11k active pdevs are protected by RCU but the temperature event
handling code calling ath11k_mac_get_ar_by_pdev_id() was not marked as a
read-side critical section as reported by RCU lockdep:
wifi: ath12k: rename the sc naming convention to ab
In PCI and HAL interface layer module, the identifier sc is used
to represent an instance of ath12k_base structure. However,
within ath12k, the convention is to use "ab" to represent an SoC
"base" struct. So change the all instances of sc to ab.
wifi: ath12k: rename the wmi_sc naming convention to wmi_ab
In WMI layer module, the identifier wmi_sc is used to represent
an instance of ath12k_wmi_base structure. However, within ath12k,
the convention is to use "ab" to represent an SoC "base" struct.
So change the all instances of wmi_sc to wmi_ab.
Anilkumar Kolli [Wed, 18 Oct 2023 08:37:06 +0000 (11:37 +0300)]
wifi: ath11k: add firmware-2.bin support
Firmware IE containers can dynamically provide various information
what firmware supports. Also it can embed more than one image so
updating firmware is easy, user just needs to update one file in
/lib/firmware/.
The firmware API 2 or higher will use the IE container format, the
current API 1 will not use the new format but it still is supported
for some time. Firmware API 2 files are named as firmware-2.bin
(which contains both amss.bin and m3.bin images) and API 1 files are
amss.bin and m3.bin.
Currently ath11k PCI driver provides firmware binary (amss.bin) path to
MHI driver, MHI driver reads firmware from filesystem and boots it. Add
provision to read firmware files from ath11k driver and provide the amss.bin
firmware data and size to MHI using a pointer.
Currently enum ath11k_fw_features is empty, the patches adding features will
add the flags.
With AHB devices there's no amss.bin or m3.bin, so no changes in how AHB
firmware files are used. But AHB devices can use future additions to the meta
data, for example in enum ath11k_fw_features.
Kalle Valo [Wed, 18 Oct 2023 08:37:06 +0000 (11:37 +0300)]
wifi: ath11k: qmi: refactor ath11k_qmi_m3_load()
Simple refactoring to make it easier to add firmware-2.bin support in the
following patch.
Earlier ath11k_qmi_m3_load() supported changing m3.bin contents while ath11k is
running. But that's not going to actually work, m3.bin is supposed to be the
same during the lifetime of ath11k, for example we don't support changing the
firmware capabilities on the fly. Due to this ath11k requests m3.bin firmware
file first and only then checks m3_mem->vaddr, so we are basically requesting
the firmware file even if it's not needed. Reverse the code so that m3_mem
buffer is checked first, and only if it doesn't exist, then m3.bin is requested
from user space.
Checking for m3_mem->size is redundant when m3_mem->vaddr is NULL, we would
not be able to use the buffer in that case. So remove the check for size.
Vlad Buslov [Tue, 24 Oct 2023 19:58:57 +0000 (21:58 +0200)]
net/sched: act_ct: additional checks for outdated flows
Current nf_flow_is_outdated() implementation considers any flow table flow
which state diverged from its underlying CT connection status for teardown
which can be problematic in the following cases:
- Flow has never been offloaded to hardware in the first place either
because flow table has hardware offload disabled (flag
NF_FLOWTABLE_HW_OFFLOAD is not set) or because it is still pending on 'add'
workqueue to be offloaded for the first time. The former is incorrect, the
later generates excessive deletions and additions of flows.
- Flow is already pending to be updated on the workqueue. Tearing down such
flows will also generate excessive removals from the flow table, especially
on highly loaded system where the latency to re-offload a flow via 'add'
workqueue can be quite high.
When considering a flow for teardown as outdated verify that it is both
offloaded to hardware and doesn't have any pending updates.
Fixes: 41f2c7c342d3 ("net/sched: act_ct: Fix promotion of offloaded unreplied tuple") Reviewed-by: Paul Blakey <[email protected]> Signed-off-by: Vlad Buslov <[email protected]> Signed-off-by: Pablo Neira Ayuso <[email protected]>
netfilter: flowtable: GC pushes back packets to classic path
Since 41f2c7c342d3 ("net/sched: act_ct: Fix promotion of offloaded
unreplied tuple"), flowtable GC pushes back flows with IPS_SEEN_REPLY
back to classic path in every run, ie. every second. This is because of
a new check for NF_FLOW_HW_ESTABLISHED which is specific of sched/act_ct.
In Netfilter's flowtable case, NF_FLOW_HW_ESTABLISHED never gets set on
and IPS_SEEN_REPLY is unreliable since users decide when to offload the
flow before, such bit might be set on at a later stage.
Fix it by adding a custom .gc handler that sched/act_ct can use to
deal with its NF_FLOW_HW_ESTABLISHED bit.
Fixes: 41f2c7c342d3 ("net/sched: act_ct: Fix promotion of offloaded unreplied tuple") Reported-by: Vladimir Smelhaus <[email protected]> Reviewed-by: Paul Blakey <[email protected]> Signed-off-by: Pablo Neira Ayuso <[email protected]>
amd/pds_core: core: No need for Null pointer check before kfree
kfree()/vfree() internally perform NULL check on the
pointer handed to it and take no action if it indeed is
NULL. Hence there is no need for a pre-check of the memory
pointer before handing it to kfree()/vfree().
Issue reported by ifnullfree.cocci Coccinelle semantic
patch script.
David S. Miller [Wed, 25 Oct 2023 09:28:00 +0000 (10:28 +0100)]
Merge branch 'mv88e6xxx-dsa-bindings'
Linus Walleij says:
====================
Create a binding for the Marvell MV88E6xxx DSA switches
The Marvell switches are lacking DT bindings.
I need proper schema checking to add LED support to the
Marvell switch. Just how it is, it can't go on like this.
Some Device Tree fixes are included in the series, these
remove the major and most annoying warnings fallout noise:
some warnings remain, and these are of more serious nature,
such as missing phy-mode. They can be applied individually,
or to the networking tree with the rest of the patches.
Thanks to Andrew Lunn, Vladimir Oltean and Russell King
for excellent review and feedback!
---
Changes in v7:
- Fix the elaborate spacing to satisfy yamllint in the
ports/ethernet-ports requirement.
- Link to v6: https://lore.kernel.org/r/20231024-marvell-88e6152-wan-led-v6-0-993ab0949344@linaro.org
Changes in v6:
- Fix ports/ethernet-ports requirement with proper indenting
(hopefully).
- Link to v5: https://lore.kernel.org/r/20231023-marvell-88e6152-wan-led-v5-0-0e82952015a7@linaro.org
Changes in v5:
- Consistently rename switch@n to ethernet-switch@n in all cleanup patches
- Consistently rename ports to ethernet-ports in all cleanup patches
- Consistently rename all port@n to ethernet-port@n in all cleanup patches
- Consistently rename all phy@n to ethernet-phy@n in all cleanup patches
- Restore the nodename on the Turris MOX which has a U-Boot binary using the
nodename as ABI, put in a blurb warning about this so no-one else tries
to change it in the future.
- Drop dsa.yaml direct references where we reference dsa.yaml#/$defs/ethernet-ports
- Replace the conjured MV88E6xxx example by a better one based on imx6qdl
plus strictly named nodes and added reset-gpios for a more complete example,
and another example using the interrupt controller based on
armada-381-netgear-gs110emx.dts
- Bump lineage to 2008 as Vladimir says the code was developed starting 2008.
- Link to v4: https://lore.kernel.org/r/20231018-marvell-88e6152-wan-led-v4-0-3ee0c67383be@linaro.org
Changes in v4:
- Rebase the series on top of Rob's series
"dt-bindings: net: Child node schema cleanups" (or the hex numbered
ports will not work)
- Fix up a whitespacing error corrupting v3...
- Add a new patch making the generic DSA binding require ports or
ethernet-ports in the switch node.
- Drop any corrections of port@a in the patches.
- Drop oneOf in the compatible enum for mv88e6xxx
- Use ethernet-switch, ethernet-ports and ethernet-phy in the examples
- Transclude the dsa.yaml#/$defs/ethernet-ports define for ports
- Move the DTS and binding fixes first, before the actual bindings,
so they apply without (too many) warnings as fallout.
- Drop stray colon in text.
- Drop example port in the mveusb binding.
- Link to v3: https://lore.kernel.org/r/20231016-marvell-88e6152-wan-led-v3-0-38cd449dfb15@linaro.org
Changes in v3:
- Fix up a related mvusb example in a different binding that
the scripts were complaining about.
- Fix up the wording on internal vs external MDIO buses in the
mv88e6xxx binding document.
- Remove pointless label and put the right rev-mii into the
MV88E6060 schema.
- Link to v2: https://lore.kernel.org/r/20231014-marvell-88e6152-wan-led-v2-0-7fca08b68849@linaro.org
Changes in v2:
- Break out a separate Marvell MV88E6060 binding file. I stand corrected.
- Drop the idea to rely on nodename mdio-external for the external
MDIO bus, keep the compatible, drop patch for the driver.
- Fix more Marvell DT mistakes.
- Fix NXP DT mistakes in a separate patch.
- Fix Marvell ARM64 mistakes in a separate patch.
- Link to v1: https://lore.kernel.org/r/20231013-marvell-88e6152-wan-led-v1-0-0712ba99857c@linaro.org
====================
Linus Walleij [Tue, 24 Oct 2023 13:20:32 +0000 (15:20 +0200)]
dt-bindings: marvell: Rewrite MV88E6xxx in schema
This is an attempt to rewrite the Marvell MV88E6xxx switch bindings
in YAML schema.
The current text binding says:
WARNING: This binding is currently unstable. Do not program it into a
FLASH never to be changed again. Once this binding is stable, this
warning will be removed.
Well that never happened before we switched to YAML markup,
we can't have it like this, what about fixing the mess?
Linus Walleij [Tue, 24 Oct 2023 13:20:31 +0000 (15:20 +0200)]
ARM64: dts: marvell: Fix some common switch mistakes
Fix some errors in the Marvell MV88E6xxx switch descriptions:
- The top node had no address size or cells.
- switch0@0 is not OK, should be ethernet-switch@0.
- ports should be ethernet-ports
- port@0 should be ethernet-port@0
- PHYs should be named ethernet-phy@
Linus Walleij [Tue, 24 Oct 2023 13:20:30 +0000 (15:20 +0200)]
ARM: dts: nxp: Fix some common switch mistakes
Fix some errors in the Marvell MV88E6xxx switch descriptions:
- switch0@0 is not OK, should be ethernet-switch@0
- ports should be ethernet-ports
- port should be ethernet-port
- phy should be ethernet-phy
Linus Walleij [Tue, 24 Oct 2023 13:20:29 +0000 (15:20 +0200)]
ARM: dts: marvell: Fix some common switch mistakes
Fix some errors in the Marvell MV88E6xxx switch descriptions:
- The top node had no address size or cells.
- switch0@0 is not OK, should be ethernet-switch@0.
- The ports node should be named ethernet-ports
- The ethernet-ports node should have port@0 etc children, no
plural "ports" in the children.
- Ports should be named ethernet-port@0 etc
- PHYs should be named ethernet-phy@0 etc
This serves as an example of fixes needed for introducing a
schema for the bindings, but the patch can simply be applied.
Linus Walleij [Tue, 24 Oct 2023 13:20:28 +0000 (15:20 +0200)]
dt-bindings: net: mvusb: Fix up DSA example
When adding a proper schema for the Marvell mx88e6xxx switch,
the scripts start complaining about this embedded example:
dtschema/dtc warnings/errors:
net/marvell,mvusb.example.dtb: switch@0: ports: '#address-cells'
is a required property
from schema $id: http://devicetree.org/schemas/net/dsa/marvell,mv88e6xxx.yaml#
net/marvell,mvusb.example.dtb: switch@0: ports: '#size-cells'
is a required property
from schema $id: http://devicetree.org/schemas/net/dsa/marvell,mv88e6xxx.yaml#
Fix this up by extending the example with those properties in
the ports node.
While we are at it, rename "ports" to "ethernet-ports" and rename
"switch" to "ethernet-switch" as this is recommended practice.
Linus Walleij [Tue, 24 Oct 2023 13:20:27 +0000 (15:20 +0200)]
dt-bindings: net: dsa: Require ports or ethernet-ports
Bindings using dsa.yaml#/$defs/ethernet-ports specify that
a DSA switch node need to have a ports or ethernet-ports
subnode, and that is actually required, so add requirements
using oneOf.
Dmitry Antipov [Fri, 20 Oct 2023 04:09:36 +0000 (07:09 +0300)]
wifi: rtw89: cleanup firmware elements parsing
When compiling with clang-18, I've noticed the following:
drivers/net/wireless/realtek/rtw89/fw.c:389:28: warning: cast to smaller
integer type 'enum rtw89_fw_type' from 'const void *' [-Wvoid-pointer-to-enum-cast]
389 | enum rtw89_fw_type type = (enum rtw89_fw_type)data;
| ^~~~~~~~~~~~~~~~~~~~~~~~
drivers/net/wireless/realtek/rtw89/fw.c:569:13: warning: cast to smaller
integer type 'enum rtw89_rf_path' from 'const void *' [-Wvoid-pointer-to-enum-cast]
569 | rf_path = (enum rtw89_rf_path)data;
| ^~~~~~~~~~~~~~~~~~~~~~~~
So avoid brutal everything-to-const-void-and-back casts, introduce
'union rtw89_fw_element_arg' to pass parameters to element handler
callbacks, and adjust all of the related bits accordingly. Compile
tested only.
Shiji Yang [Thu, 19 Oct 2023 11:58:58 +0000 (19:58 +0800)]
wifi: rt2x00: rework MT7620 PA/LNA RF calibration
1. Move MT7620 PA/LNA calibration code to dedicated functions.
2. For external PA/LNA devices, restore RF and BBP registers before
R-Calibration.
3. Do Rx DCOC calibration again before RXIQ calibration.
4. Add some missing LNA related registers' initialization.
Shiji Yang [Thu, 19 Oct 2023 11:58:57 +0000 (19:58 +0800)]
wifi: rt2x00: rework MT7620 channel config function
1. Move the channel configuration code from rt2800_vco_calibration()
to the rt2800_config_channel().
2. Use MT7620 SoC specific AGC initial LNA value instead of the
RT5592's value.
3. BBP{195,196} pairing write has been replaced with
rt2800_bbp_glrt_write() to reduce redundant code.
1. Do not hard reset the BBP. We can use soft reset instead. This
change has some help to the calibration failure issue.
2. Enable falling back to legacy rate from the HT/RTS rate by
setting the HT_FBK_TO_LEGACY register.
3. Implement MCS rate specific maximum PSDU size. It can improve
the transmission quality under the low RSSI condition.
4. Set BBP_84 register value to 0x19. This is used for extension
channel overlapping IOT.
Oleksij Rempel [Mon, 23 Oct 2023 09:33:38 +0000 (11:33 +0200)]
net: dsa: microchip: ksz9477: add Wake on LAN support
Add WoL support for KSZ9477 family of switches. This code was tested on
KSZ8563 chip.
KSZ9477 family of switches supports multiple PHY events:
- wake on Link Up
- wake on Energy Detect.
Since current UAPI can't differentiate between this PHY events, map all
of them to WAKE_PHY.
Oleksij Rempel [Mon, 23 Oct 2023 09:33:37 +0000 (11:33 +0200)]
net: dsa: microchip: use wakeup-source DT property to enable PME output
KSZ switches with WoL support signals wake event over PME pin. If this
pin is attached to some external PMIC or System Controller can't be
described as GPIO, the only way to describe it in the devicetree is to
use wakeup-source property. So, add support for this property and enable
PME switch output if this property is present.
Justin Stitt [Mon, 23 Oct 2023 19:39:39 +0000 (19:39 +0000)]
s390/qeth: replace deprecated strncpy with strscpy
strncpy() is deprecated for use on NUL-terminated destination strings
[1] and as such we should prefer more robust and less ambiguous string
interfaces.
We expect new_entry->dbf_name to be NUL-terminated based on its use with
strcmp():
| if (strcmp(entry->dbf_name, name) == 0) {
Moreover, NUL-padding is not required as new_entry is kzalloc'd just
before this assignment:
| new_entry = kzalloc(sizeof(struct qeth_dbf_entry), GFP_KERNEL);
... rendering any future NUL-byte assignments (like the ones strncpy()
does) redundant.
Considering the above, a suitable replacement is `strscpy` [2] due to
the fact that it guarantees NUL-termination on the destination buffer
without unnecessarily NUL-padding.
Justin Stitt [Mon, 23 Oct 2023 19:35:07 +0000 (19:35 +0000)]
s390/ctcm: replace deprecated strncpy with strscpy
strncpy() is deprecated for use on NUL-terminated destination strings
[1] and as such we should prefer more robust and less ambiguous string
interfaces.
We expect chid to be NUL-terminated based on its use with format
strings:
Moreover, NUL-padding is not required as it is _only_ used in this one
instance with a format string.
Considering the above, a suitable replacement is `strscpy` [2] due to
the fact that it guarantees NUL-termination on the destination buffer
without unnecessarily NUL-padding.
We can also drop the +1 from chid's declaration as we no longer need to
be cautious about leaving a spot for a NUL-byte. Let's use the more
idiomatic strscpy usage of (dest, src, sizeof(dest)) as this more
closely ties the destination buffer to the length.
Lorenzo Bianconi [Mon, 23 Oct 2023 22:00:19 +0000 (00:00 +0200)]
net: ethernet: mtk_wed: fix firmware loading for MT7986 SoC
The WED mcu firmware does not contain all the memory regions defined in
the dts reserved_memory node (e.g. MT7986 WED firmware does not contain
cpu-boot region).
Reverse the mtk_wed_mcu_run_firmware() logic to check all the fw
sections are defined in the dts reserved_memory node.
====================
net: ethernet: renesas: infrastructure preparations for upcoming driver
Before we upstream a new driver, Niklas and I thought that a few
cleanups for Kconfig/Makefile will help readability and maintainability.
Here they are, looking forward to comments.
====================
Wolfram Sang [Sun, 22 Oct 2023 20:53:16 +0000 (22:53 +0200)]
net: ethernet: renesas: drop SoC names in Kconfig
Mentioning SoCs in Kconfig descriptions tends to get stale (e.g. RAVB is
missing RZV2M) or imprecise (e.g. SH_ETH is not available on all
R8A779x). Drop them instead of providing vague information. Improve the
file description a tad while here.
Wolfram Sang [Sun, 22 Oct 2023 20:53:15 +0000 (22:53 +0200)]
net: ethernet: renesas: group entries in Makefile
A new Renesas driver shall be added soon. Prepare the Makefile by
grouping the specific objects to the Kconfig symbol for better
readability. Improve the file description a tad while here.
Martin KaFai Lau [Tue, 24 Oct 2023 23:05:02 +0000 (16:05 -0700)]
Merge branch 'Add bpf programmable net device'
Daniel Borkmann says:
====================
This work adds a BPF programmable device which can operate in L3 or L2
mode where the BPF program is part of the xmit routine. It's program
management is done via bpf_mprog and it comes with BPF link support.
For details see patch 1 and following. Thanks!
v3 -> v4:
- Moved netkit_release_all() into ndo_uninit (Stan)
- Two small commit msg corrections (Toke)
- Added Acked/Reviewed-by
v2 -> v3:
- Remove setting dev->min_mtu to ETH_MIN_MTU (Andrew)
- Do not populate ethtool info->version (Andrew)
- Populate netdev private data before register_netdevice (Andrew)
- Use strscpy for ifname template (Jakub)
- Use GFP_KERNEL_ACCOUNT for link kzalloc (Jakub)
- Carry and dump link attach type for bpftool (Toke)
v1 -> v2:
- Rename from meta (Toke, Andrii, Alexei)
- Reuse skb_scrub_packet (Stan)
- Remove IFF_META and use netdev_ops (Toke)
- Add comment to multicast handler (Toke)
- Remove silly version info (Toke)
- Fix attach_type_name (Quentin)
- Rework libbpf link attach api to be similar
as tcx (Andrii)
- Move flags last for bpf_netkit_opts (Andrii)
- Rebased to bpf_mprog query api changes
- Folded link support patch into main one
====================
Daniel Borkmann [Tue, 24 Oct 2023 21:49:03 +0000 (23:49 +0200)]
selftests/bpf: Add netlink helper library
Add a minimal netlink helper library for the BPF selftests. This has been
taken and cut down and cleaned up from iproute2. This covers basics such
as netdevice creation which we need for BPF selftests / BPF CI given
iproute2 package cannot cover it yet.
Stanislav Fomichev suggested that this could be replaced in future by ynl
tool generated C code once it has RTNL support to create devices. Once we
get to this point the BPF CI would also need to add libmnl. If no further
extensions are needed, a second option could be that we remove this code
again once iproute2 package has support.
Daniel Borkmann [Tue, 24 Oct 2023 21:49:02 +0000 (23:49 +0200)]
bpftool: Extend net dump with netkit progs
Add support to dump BPF programs on netkit via bpftool. This includes both
the BPF link and attach ops programs. Dumped information contain the attach
location, function entry name, program ID and link ID when applicable.
Daniel Borkmann [Tue, 24 Oct 2023 21:49:01 +0000 (23:49 +0200)]
bpftool: Implement link show support for netkit
Add support to dump netkit link information to bpftool in similar way as
we have for XDP. The netkit link info only exposes the ifindex and the
attach_type.
Below shows an example link dump output, and a cgroup link is included for
comparison, too:
The struct bpf_netkit_opts is done in similar way as struct bpf_tcx_opts
for supporting bpf_mprog control parameters. The attach location for the
primary and peer device is derived from the program section "netkit/primary"
and "netkit/peer", respectively.
Daniel Borkmann [Tue, 24 Oct 2023 21:48:59 +0000 (23:48 +0200)]
tools: Sync if_link uapi header
Sync if_link uapi header to the latest version as we need the refresher
in tooling for netkit device. Given it's been a while since the last sync
and the diff is fairly big, it has been done as its own commit.
Daniel Borkmann [Tue, 24 Oct 2023 21:48:58 +0000 (23:48 +0200)]
netkit, bpf: Add bpf programmable net device
This work adds a new, minimal BPF-programmable device called "netkit"
(former PoC code-name "meta") we recently presented at LSF/MM/BPF. The
core idea is that BPF programs are executed within the drivers xmit routine
and therefore e.g. in case of containers/Pods moving BPF processing closer
to the source.
One of the goals was that in case of Pod egress traffic, this allows to
move BPF programs from hostns tcx ingress into the device itself, providing
earlier drop or forward mechanisms, for example, if the BPF program
determines that the skb must be sent out of the node, then a redirect to
the physical device can take place directly without going through per-CPU
backlog queue. This helps to shift processing for such traffic from softirq
to process context, leading to better scheduling decisions/performance (see
measurements in the slides).
In this initial version, the netkit device ships as a pair, but we plan to
extend this further so it can also operate in single device mode. The pair
comes with a primary and a peer device. Only the primary device, typically
residing in hostns, can manage BPF programs for itself and its peer. The
peer device is designated for containers/Pods and cannot attach/detach
BPF programs. Upon the device creation, the user can set the default policy
to 'pass' or 'drop' for the case when no BPF program is attached.
Additionally, the device can be operated in L3 (default) or L2 mode. The
management of BPF programs is done via bpf_mprog, so that multi-attach is
supported right from the beginning with similar API and dependency controls
as tcx. For details on the latter see commit 053c8e1f235d ("bpf: Add generic
attach/detach/query API for multi-progs"). tc BPF compatibility is provided,
so that existing programs can be easily migrated.
Going forward, we plan to use netkit devices in Cilium as the main device
type for connecting Pods. They will be operated in L3 mode in order to
simplify a Pod's neighbor management and the peer will operate in default
drop mode, so that no traffic is leaving between the time when a Pod is
brought up by the CNI plugin and programs attached by the agent.
Additionally, the programs we attach via tcx on the physical devices are
using bpf_redirect_peer() for inbound traffic into netkit device, hence the
latter is also supporting the ndo_get_peer_dev callback. Similarly, we use
bpf_redirect_neigh() for the way out, pushing from netkit peer to phys device
directly. Also, BIG TCP is supported on netkit device. For the follow-up
work in single device mode, we plan to convert Cilium's cilium_host/_net
devices into a single one.
An extensive test suite for checking device operations and the BPF program
and link management API comes as BPF selftests in this series.
Change ifconfig with ip command, on a system where ifconfig is
not used this script will not work correcly.
Test result with this patchset:
sudo make TARGETS="net" kselftest
....
TAP version 13
1..1
timeout set to 1500
selftests: net: route_localnet.sh
run arp_announce test
net.ipv4.conf.veth0.route_localnet = 1
net.ipv4.conf.veth1.route_localnet = 1
net.ipv4.conf.veth0.arp_announce = 2
net.ipv4.conf.veth1.arp_announce = 2
PING 127.25.3.14 (127.25.3.14) from 127.25.3.4 veth0: 56(84)
bytes of data.
64 bytes from 127.25.3.14: icmp_seq=1 ttl=64 time=0.038 ms
64 bytes from 127.25.3.14: icmp_seq=2 ttl=64 time=0.068 ms
64 bytes from 127.25.3.14: icmp_seq=3 ttl=64 time=0.068 ms
64 bytes from 127.25.3.14: icmp_seq=4 ttl=64 time=0.068 ms
64 bytes from 127.25.3.14: icmp_seq=5 ttl=64 time=0.068 ms
--- 127.25.3.14 ping statistics ---
5 packets transmitted, 5 received, 0% packet loss, time 4073ms
rtt min/avg/max/mdev = 0.038/0.062/0.068/0.012 ms
ok
run arp_ignore test
net.ipv4.conf.veth0.route_localnet = 1
net.ipv4.conf.veth1.route_localnet = 1
net.ipv4.conf.veth0.arp_ignore = 3
net.ipv4.conf.veth1.arp_ignore = 3
PING 127.25.3.14 (127.25.3.14) from 127.25.3.4 veth0: 56(84)
bytes of data.
64 bytes from 127.25.3.14: icmp_seq=1 ttl=64 time=0.032 ms
64 bytes from 127.25.3.14: icmp_seq=2 ttl=64 time=0.065 ms
64 bytes from 127.25.3.14: icmp_seq=3 ttl=64 time=0.066 ms
64 bytes from 127.25.3.14: icmp_seq=4 ttl=64 time=0.065 ms
64 bytes from 127.25.3.14: icmp_seq=5 ttl=64 time=0.065 ms
--- 127.25.3.14 ping statistics ---
5 packets transmitted, 5 received, 0% packet loss, time 4092ms
rtt min/avg/max/mdev = 0.032/0.058/0.066/0.013 ms
ok
ok 1 selftests: net: route_localnet.sh
...
Jakub Kicinski [Tue, 24 Oct 2023 20:10:53 +0000 (13:10 -0700)]
Merge tag 'wireless-2023-10-24' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless
Johannes Berg says:
====================
Three more fixes:
- don't drop all unprotected public action frames since
some don't have a protected dual
- fix pointer confusion in scanning code
- fix warning in some connections with multiple links
* tag 'wireless-2023-10-24' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless:
wifi: mac80211: don't drop all unprotected public action frames
wifi: cfg80211: fix assoc response warning on failed links
wifi: cfg80211: pass correct pointer to rdev_inform_bss()
====================