Linus Torvalds [Thu, 17 Feb 2022 19:33:59 +0000 (11:33 -0800)]
Merge tag 'net-5.17-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
"Including fixes from wireless and netfilter.
Current release - regressions:
- dsa: lantiq_gswip: fix use after free in gswip_remove()
- smc: avoid overwriting the copies of clcsock callback functions
Current release - new code bugs:
- iwlwifi:
- fix use-after-free when no FW is present
- mei: fix the pskb_may_pull check in ipv4
- mei: retry mapping the shared area
- mvm: don't feed the hardware RFKILL into iwlmei
Previous releases - regressions:
- ipv6: mcast: use rcu-safe version of ipv6_get_lladdr()
- tipc: fix wrong publisher node address in link publications
- iwlwifi: mvm: don't send SAR GEO command for 3160 devices, avoid FW
assertion
- bgmac: make idm and nicpm resource optional again
- atl1c: fix tx timeout after link flap
Previous releases - always broken:
- vsock: remove vsock from connected table when connect is
interrupted by a signal
- ping: change destination interface checks to match raw sockets
- crypto: af_alg - get rid of alg_memory_allocated to avoid confusing
semantics (and null-deref) after SO_RESERVE_MEM was added
- ipv6: make exclusive flowlabel checks per-netns
- bonding: force carrier update when releasing slave
- sched: limit TC_ACT_REPEAT loops
- bridge: multicast: notify switchdev driver whenever MC processing
gets disabled because of max entries reached
- wifi: brcmfmac: fix crash in brcm_alt_fw_path when WLAN not found
- iwlwifi: fix locking when "HW not ready"
- phy: mediatek: remove PHY mode check on MT7531
- dsa: mv88e6xxx: flush switchdev FDB workqueue before removing VLAN
- dsa: lan9303:
- fix polarity of reset during probe
- fix accelerated VLAN handling"
* tag 'net-5.17-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (65 commits)
bonding: force carrier update when releasing slave
nfp: flower: netdev offload check for ip6gretap
ipv6: fix data-race in fib6_info_hw_flags_set / fib6_purge_rt
ipv4: fix data races in fib_alias_hw_flags_set
net: dsa: lan9303: add VLAN IDs to master device
net: dsa: lan9303: handle hwaccel VLAN tags
vsock: remove vsock from connected table when connect is interrupted by a signal
Revert "net: ethernet: bgmac: Use devm_platform_ioremap_resource_byname"
ping: fix the dif and sdif check in ping_lookup
net: usb: cdc_mbim: avoid altsetting toggling for Telit FN990
net: sched: limit TC_ACT_REPEAT loops
tipc: fix wrong notification node addresses
net: dsa: lantiq_gswip: fix use after free in gswip_remove()
ipv6: per-netns exclusive flowlabel checks
net: bridge: multicast: notify switchdev driver whenever MC processing gets disabled
CDC-NCM: avoid overflow in sanity checking
mctp: fix use after free
net: mscc: ocelot: fix use-after-free in ocelot_vlan_del()
bonding: fix data-races around agg_select_timer
dpaa2-eth: Initialize mutex used in one step timestamping path
...
Zhang Changzhong [Wed, 16 Feb 2022 14:18:08 +0000 (22:18 +0800)]
bonding: force carrier update when releasing slave
In __bond_release_one(), bond_set_carrier() is only called when bond
device has no slave. Therefore, if we remove the up slave from a master
with two slaves and keep the down slave, the master will remain up.
Fix this by moving bond_set_carrier() out of if (!bond_has_slaves(bond))
statement.
Commit b42bc9a3c511 ("Fix regression due to "fs: move binfmt_misc sysctl
to its own file") fixed a regression, however it failed to add a
kmemleak_not_leak().
- Fix perf_cpu_map__for_each_cpu macro in libperf, providing access to
the CPU iterator
- Sync linux/perf_event.h UAPI with the kernel sources
- Update Jiri Olsa's email address in MAINTAINERS
* tag 'perf-tools-fixes-for-v5.17-2022-02-17' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux:
perf bpf: Defer freeing string after possible strlen() on it
perf test: Fix arm64 perf_event_attr tests wrt --call-graph initialization
libsubcmd: Fix use-after-free for realloc(..., 0)
libperf: Fix perf_cpu_map__for_each_cpu macro
perf cs-etm: Fix corrupt inject files when only last branch option is enabled
perf cs-etm: No-op refactor of synth opt usage
libperf: Fix 32-bit build for tests uint64_t printf
tools headers UAPI: Sync linux/perf_event.h with the kernel sources
perf trace: Avoid early exit due SIGCHLD from non-workload processes
MAINTAINERS: Update Jiri's email address
Danie du Toit [Thu, 17 Feb 2022 12:48:20 +0000 (14:48 +0200)]
nfp: flower: netdev offload check for ip6gretap
IPv6 GRE tunnels are not being offloaded, this is caused by a missing
netdev offload check. The functionality of IPv6 GRE tunnel offloading
was previously added but this check was not included. Adding the
ip6gretap check allows IPv6 GRE tunnels to be offloaded correctly.
Eric Dumazet [Wed, 16 Feb 2022 17:32:17 +0000 (09:32 -0800)]
ipv6: fix data-race in fib6_info_hw_flags_set / fib6_purge_rt
Because fib6_info_hw_flags_set() is called without any synchronization,
all accesses to gi6->offload, fi->trap and fi->offload_failed
need some basic protection like READ_ONCE()/WRITE_ONCE().
BUG: KCSAN: data-race in fib6_info_hw_flags_set / fib6_purge_rt
write to 0xffff8881087d5886 of 1 bytes by task 1912 on cpu 1:
fib6_info_hw_flags_set+0x155/0x3b0 net/ipv6/route.c:6230
nsim_fib6_rt_hw_flags_set drivers/net/netdevsim/fib.c:668 [inline]
nsim_fib6_rt_add drivers/net/netdevsim/fib.c:691 [inline]
nsim_fib6_rt_insert drivers/net/netdevsim/fib.c:756 [inline]
nsim_fib6_event drivers/net/netdevsim/fib.c:853 [inline]
nsim_fib_event drivers/net/netdevsim/fib.c:886 [inline]
nsim_fib_event_work+0x284f/0x2cf0 drivers/net/netdevsim/fib.c:1477
process_one_work+0x3f6/0x960 kernel/workqueue.c:2307
worker_thread+0x616/0xa70 kernel/workqueue.c:2454
kthread+0x2c7/0x2e0 kernel/kthread.c:327
ret_from_fork+0x1f/0x30
value changed: 0x22 -> 0x2a
Reported by Kernel Concurrency Sanitizer on:
CPU: 1 PID: 1912 Comm: kworker/1:3 Not tainted 5.16.0-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Workqueue: events nsim_fib_event_work
Eric Dumazet [Wed, 16 Feb 2022 17:32:16 +0000 (09:32 -0800)]
ipv4: fix data races in fib_alias_hw_flags_set
fib_alias_hw_flags_set() can be used by concurrent threads,
and is only RCU protected.
We need to annotate accesses to following fields of struct fib_alias:
offload, trap, offload_failed
Because of READ_ONCE()WRITE_ONCE() limitations, make these
field u8.
BUG: KCSAN: data-race in fib_alias_hw_flags_set / fib_alias_hw_flags_set
read to 0xffff888134224a6a of 1 bytes by task 2013 on cpu 1:
fib_alias_hw_flags_set+0x28a/0x470 net/ipv4/fib_trie.c:1050
nsim_fib4_rt_hw_flags_set drivers/net/netdevsim/fib.c:350 [inline]
nsim_fib4_rt_add drivers/net/netdevsim/fib.c:367 [inline]
nsim_fib4_rt_insert drivers/net/netdevsim/fib.c:429 [inline]
nsim_fib4_event drivers/net/netdevsim/fib.c:461 [inline]
nsim_fib_event drivers/net/netdevsim/fib.c:881 [inline]
nsim_fib_event_work+0x1852/0x2cf0 drivers/net/netdevsim/fib.c:1477
process_one_work+0x3f6/0x960 kernel/workqueue.c:2307
process_scheduled_works kernel/workqueue.c:2370 [inline]
worker_thread+0x7df/0xa70 kernel/workqueue.c:2456
kthread+0x1bf/0x1e0 kernel/kthread.c:377
ret_from_fork+0x1f/0x30
write to 0xffff888134224a6a of 1 bytes by task 4872 on cpu 0:
fib_alias_hw_flags_set+0x2d5/0x470 net/ipv4/fib_trie.c:1054
nsim_fib4_rt_hw_flags_set drivers/net/netdevsim/fib.c:350 [inline]
nsim_fib4_rt_add drivers/net/netdevsim/fib.c:367 [inline]
nsim_fib4_rt_insert drivers/net/netdevsim/fib.c:429 [inline]
nsim_fib4_event drivers/net/netdevsim/fib.c:461 [inline]
nsim_fib_event drivers/net/netdevsim/fib.c:881 [inline]
nsim_fib_event_work+0x1852/0x2cf0 drivers/net/netdevsim/fib.c:1477
process_one_work+0x3f6/0x960 kernel/workqueue.c:2307
process_scheduled_works kernel/workqueue.c:2370 [inline]
worker_thread+0x7df/0xa70 kernel/workqueue.c:2456
kthread+0x1bf/0x1e0 kernel/kthread.c:377
ret_from_fork+0x1f/0x30
value changed: 0x00 -> 0x02
Reported by Kernel Concurrency Sanitizer on:
CPU: 0 PID: 4872 Comm: kworker/0:0 Not tainted 5.17.0-rc3-syzkaller-00188-g1d41d2e82623-dirty #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Workqueue: events nsim_fib_event_work
Mans Rullgard [Wed, 16 Feb 2022 20:48:18 +0000 (20:48 +0000)]
net: dsa: lan9303: add VLAN IDs to master device
If the master device does VLAN filtering, the IDs used by the switch
must be added for any frames to be received. Do this in the
port_enable() function, and remove them in port_disable().
Mans Rullgard [Wed, 16 Feb 2022 12:46:34 +0000 (12:46 +0000)]
net: dsa: lan9303: handle hwaccel VLAN tags
Check for a hwaccel VLAN tag on rx and use it if present. Otherwise,
use __skb_vlan_pop() like the other tag parsers do. This fixes the case
where the VLAN tag has already been consumed by the master.
Linus Torvalds [Thu, 17 Feb 2022 16:57:47 +0000 (08:57 -0800)]
mm: don't try to NUMA-migrate COW pages that have other uses
Oded Gabbay reports that enabling NUMA balancing causes corruption with
his Gaudi accelerator test load:
"All the details are in the bug, but the bottom line is that somehow,
this patch causes corruption when the numa balancing feature is
enabled AND we don't use process affinity AND we use GUP to pin pages
so our accelerator can DMA to/from system memory.
Either disabling numa balancing, using process affinity to bind to
specific numa-node or reverting this patch causes the bug to
disappear"
and Oded bisected the issue to commit 09854ba94c6a ("mm: do_wp_page()
simplification").
Now, the NUMA balancing shouldn't actually be changing the writability
of a page, and as such shouldn't matter for COW. But it appears it
does. Suspicious.
However, regardless of that, the condition for enabling NUMA faults in
change_pte_range() is nonsensical. It uses "page_mapcount(page)" to
decide if a COW page should be NUMA-protected or not, and that makes
absolutely no sense.
The number of mappings a page has is irrelevant: not only does GUP get a
reference to a page as in Oded's case, but the other mappings migth be
paged out and the only reference to them would be in the page count.
Since we should never try to NUMA-balance a page that we can't move
anyway due to other references, just fix the code to use 'page_count()'.
Oded confirms that that fixes his issue.
Now, this does imply that something in NUMA balancing ends up changing
page protections (other than the obvious one of making the page
inaccessible to get the NUMA faulting information). Otherwise the COW
simplification wouldn't matter - since doing the GUP on the page would
make sure it's writable.
The cause of that permission change would be good to figure out too,
since it clearly results in spurious COW events - but fixing the
nonsensical test that just happened to work before is obviously the
CorrectThing(tm) to do regardless.
Seth Forshee [Thu, 17 Feb 2022 14:13:12 +0000 (08:13 -0600)]
vsock: remove vsock from connected table when connect is interrupted by a signal
vsock_connect() expects that the socket could already be in the
TCP_ESTABLISHED state when the connecting task wakes up with a signal
pending. If this happens the socket will be in the connected table, and
it is not removed when the socket state is reset. In this situation it's
common for the process to retry connect(), and if the connection is
successful the socket will be added to the connected table a second
time, corrupting the list.
Prevent this by calling vsock_remove_connected() if a signal is received
while waiting for a connection. This is harmless if the socket is not in
the connected table, and if it is in the table then removing it will
prevent list corruption from a double add.
Note for backporting: this patch requires d5afa82c977e ("vsock: correct
removal of socket from the list"), which is in all current stable trees
except 4.9.y.
Since idm_base and nicpm_base are still optional resources not present
on all platforms, this breaks the driver for everything except Northstar
2 (which has both).
The same change was already reverted once with 755f5738ff98 ("net:
broadcom: fix a mistake about ioremap resource").
Jakub Kicinski [Tue, 15 Feb 2022 22:53:10 +0000 (14:53 -0800)]
net: allow out-of-order netdev unregistration
Sprinkle for each loops to allow netdevices to be unregistered
out of order, as their refs are released.
This prevents problems caused by dependencies between netdevs
which want to release references in their ->priv_destructor.
See commit d6ff94afd90b ("vlan: move dev_put into vlan_dev_uninit")
for example.
Eric has removed the only known ordering requirement in
commit c002496babfd ("Merge branch 'ipv6-loopback'")
so let's try this and see if anything explodes...
Xin Long [Wed, 16 Feb 2022 05:20:52 +0000 (00:20 -0500)]
ping: fix the dif and sdif check in ping_lookup
When 'ping' changes to use PING socket instead of RAW socket by:
# sysctl -w net.ipv4.ping_group_range="0 100"
There is another regression caused when matching sk_bound_dev_if
and dif, RAW socket is using inet_iif() while PING socket lookup
is using skb->dev->ifindex, the cmd below fails due to this:
# ip link add dummy0 type dummy
# ip link set dummy0 up
# ip addr add 192.168.111.1/24 dev dummy0
# ping -I dummy0 192.168.111.1 -c1
The issue was also reported on:
https://github.com/iputils/iputils/issues/104
But fixed in iputils in a wrong way by not binding to device when
destination IP is on device, and it will cause some of kselftests
to fail, as Jianlin noticed.
This patch is to use inet(6)_iif and inet(6)_sdif to get dif and
sdif for PING socket, and keep consistent with RAW socket.
David S. Miller [Thu, 17 Feb 2022 14:22:10 +0000 (14:22 +0000)]
Merge branch 'ping6-SOL_IPV6'
Jakub Kicinski says:
====================
net: ping6: support setting basic SOL_IPV6 options via cmsg
Support for IPV6_HOPLIMIT, IPV6_TCLASS, IPV6_DONTFRAG on ICMPv6
sockets and associated tests. I have no immediate plans to
implement IPV6_FLOWINFO and all the extension header stuff.
====================
Jakub Kicinski [Thu, 17 Feb 2022 01:21:18 +0000 (17:21 -0800)]
selftests: net: test IPV6_TCLASS
Test setting IPV6_TCLASS via setsockopt and cmsg
across socket types.
Output without the kernel support (this series):
Case TCLASS ICMP cmsg - packet data returned 1, expected 0
Case TCLASS ICMP cmsg - rejection returned 0, expected 1
Case TCLASS ICMP diff - pass returned 1, expected 0
Case TCLASS ICMP diff - packet data returned 1, expected 0
Case TCLASS ICMP diff - rejection returned 0, expected 1
Jakub Kicinski [Thu, 17 Feb 2022 01:21:17 +0000 (17:21 -0800)]
selftests: net: test IPV6_DONTFRAG
Test setting IPV6_DONTFRAG via setsockopt and cmsg
across socket types.
Output without the kernel support (this series):
Case DONTFRAG ICMP setsock returned 0, expected 1
Case DONTFRAG ICMP cmsg returned 0, expected 1
Case DONTFRAG ICMP both returned 0, expected 1
Case DONTFRAG ICMP diff returned 0, expected 1
FAIL - 4/24 cases failed
Jakub Kicinski [Thu, 17 Feb 2022 01:21:16 +0000 (17:21 -0800)]
net: ping6: support setting basic SOL_IPV6 options via cmsg
Support setting IPV6_HOPLIMIT, IPV6_TCLASS, IPV6_DONTFRAG
during sendmsg via SOL_IPV6 cmsgs.
tclass and dontfrag are init'ed from struct ipv6_pinfo in
ipcm6_init_sk(), while hlimit is inited to -1, so we need
to handle it being populated via cmsg explicitly.
Leave extension headers and flowlabel unimplemented.
Those are slightly more laborious to test and users
seem to primarily care about IPV6_TCLASS.
no switchdev driver makes use of VLAN port objects that lack the
BRIDGE_VLAN_INFO_BRENTRY flag. Notifying them in the first place rather
seems like an omission of commit 9c86ce2c1ae3 ("net: bridge: Notify
about bridge VLANs").
Since commit 3116ad0696dd ("net: bridge: vlan: don't notify to switchdev
master VLANs without BRENTRY flag") that was just merged, the bridge no
longer notifies switchdev upon creation of these VLANs, so we can remove
the checks from drivers.
====================
Vladimir Oltean [Wed, 16 Feb 2022 16:47:52 +0000 (18:47 +0200)]
net: ti: cpsw: remove guards against !BRIDGE_VLAN_INFO_BRENTRY
Since commit 3116ad0696dd ("net: bridge: vlan: don't notify to switchdev
master VLANs without BRENTRY flag"), the bridge no longer emits
switchdev notifiers for VLANs that don't have the
BRIDGE_VLAN_INFO_BRENTRY flag, so these checks are dead code.
Remove them.
Vladimir Oltean [Wed, 16 Feb 2022 16:47:51 +0000 (18:47 +0200)]
net: ti: am65-cpsw-nuss: remove guards against !BRIDGE_VLAN_INFO_BRENTRY
Since commit 3116ad0696dd ("net: bridge: vlan: don't notify to switchdev
master VLANs without BRENTRY flag"), the bridge no longer emits
switchdev notifiers for VLANs that don't have the
BRIDGE_VLAN_INFO_BRENTRY flag, so these checks are dead code.
Remove them.
Vladimir Oltean [Wed, 16 Feb 2022 16:47:50 +0000 (18:47 +0200)]
net: sparx5: remove guards against !BRIDGE_VLAN_INFO_BRENTRY
Since commit 3116ad0696dd ("net: bridge: vlan: don't notify to switchdev
master VLANs without BRENTRY flag"), the bridge no longer emits
switchdev notifiers for VLANs that don't have the
BRIDGE_VLAN_INFO_BRENTRY flag, so these checks are dead code.
Remove them.
Vladimir Oltean [Wed, 16 Feb 2022 16:47:49 +0000 (18:47 +0200)]
net: lan966x: remove guards against !BRIDGE_VLAN_INFO_BRENTRY
Since commit 3116ad0696dd ("net: bridge: vlan: don't notify to switchdev
master VLANs without BRENTRY flag"), the bridge no longer emits
switchdev notifiers for VLANs that don't have the
BRIDGE_VLAN_INFO_BRENTRY flag, so these checks are dead code.
Remove them.
Vladimir Oltean [Wed, 16 Feb 2022 16:47:48 +0000 (18:47 +0200)]
mlxsw: spectrum: remove guards against !BRIDGE_VLAN_INFO_BRENTRY
Since commit 3116ad0696dd ("net: bridge: vlan: don't notify to switchdev
master VLANs without BRENTRY flag"), the bridge no longer emits
switchdev notifiers for VLANs that don't have the
BRIDGE_VLAN_INFO_BRENTRY flag, so these checks are dead code.
Remove them.
David S. Miller [Thu, 17 Feb 2022 14:06:51 +0000 (14:06 +0000)]
Merge branch 'ptp-over-udp-dsa'
Vladimir Oltean says:
====================
Support PTP over UDP with the ocelot-8021q DSA tagging protocol
The alternative tag_8021q-based tagger for Ocelot switches, added here:
https://patchwork.kernel.org/project/netdevbpf/cover/20210129010009.3959398[email protected]/
mostly as a minimum viable requirement. That PTP support was mostly
self-contained code that installed some rules to replicate PTP packets
on the CPU queue, in felix_setup_mmio_filtering().
However ocelot-8021q starts to look more interesting for general purpose
usage, so it is now time to reduce the technical debt by integrating the
PTP traps used by Felix for tag_8021q with the rest of the Ocelot driver.
There is further consolidation of traps to be done. The cookies used by
MRP traps overlap with the cookies used for tag_8021q PTP traps, so
those features could not be used at the same time.
====================
Vladimir Oltean [Wed, 16 Feb 2022 14:30:14 +0000 (16:30 +0200)]
net: dsa: tag_ocelot_8021q: calculate TX checksum in software for deferred packets
DSA inherits NETIF_F_CSUM_MASK from master->vlan_features, and the
expectation is that TX checksumming is offloaded and not done in
software.
Normally the DSA master takes care of this, but packets handled by
ocelot_defer_xmit() are a very special exception, because they are
actually injected into the switch through register-based MMIO. So the
DSA master is not involved at all for these packets => no one calculates
the checksum.
This allows PTP over UDP to work using the ocelot-8021q tagging
protocol.
Vladimir Oltean [Wed, 16 Feb 2022 14:30:13 +0000 (16:30 +0200)]
net: dsa: felix: update destinations of existing traps with ocelot-8021q
Historically, the felix DSA driver has installed special traps such that
PTP over L2 works with the ocelot-8021q tagging protocol; commit 0a6f17c6ae21 ("net: dsa: tag_ocelot_8021q: add support for PTP
timestamping") has the details.
Then the ocelot switch library also gained more comprehensive support
for PTP traps through commit 96ca08c05838 ("net: mscc: ocelot: set up
traps for PTP packets").
Right now, PTP over L2 works using ocelot-8021q via the traps it has set
for itself, but nothing else does. Consolidating the two code blocks
would make ocelot-8021q gain support for PTP over L4 and tc-flower
traps, and at the same time avoid some code and TCAM duplication.
The traps are similar in intent, but different in execution, so some
explanation is required. The traps set up by felix_setup_mmio_filtering()
are VCAP IS1 filters, which have a PAG that chains them to a VCAP IS2
filter, and the IS2 is where the 'trap' action resides. The traps set up
by ocelot_trap_add(), on the other hand, have a single filter, in VCAP
IS2. The reason for chaining VCAP IS1 and IS2 in Felix was to ensure
that the hardcoded traps take precedence and cannot be overridden by the
Ocelot switch library.
So in principle, the PTP traps needed for ocelot-8021q in the Felix
driver can rely on ocelot_trap_add(), but the filters need to be patched
to account for a quirk that LS1028A has: the quirk_no_xtr_irq described
in commit 0a6f17c6ae21 ("net: dsa: tag_ocelot_8021q: add support for PTP
timestamping"). Live-patching is done by iterating through the trap list
every time we know it has been updated, and transforming a trap into a
redirect + CPU copy if ocelot-8021q is in use.
Making the DSA ocelot-8021q tagger work with the Ocelot traps means we
can eliminate the dedicated OCELOT_VCAP_IS1_TAG_8021Q_PTP_MMIO and
OCELOT_VCAP_IS2_TAG_8021Q_PTP_MMIO cookies. To minimize the patch delta,
OCELOT_VCAP_IS2_MRP_TRAP takes the place of OCELOT_VCAP_IS2_TAG_8021Q_PTP_MMIO
(the alternative would have been to left-shift all cookie numbers by 1).
Vladimir Oltean [Wed, 16 Feb 2022 14:30:12 +0000 (16:30 +0200)]
net: dsa: felix: remove dead code in felix_setup_mmio_filtering()
There has been some controversy related to the sanity check that a CPU
port exists, and commit e8b1d7698038 ("net: dsa: felix: Fix memory leak
in felix_setup_mmio_filtering") even "corrected" an apparent memory leak
as static analysis tools see it.
However, the check is completely dead code, since the earliest point at
which felix_setup_mmio_filtering() can be called is:
felix_pci_probe
-> dsa_register_switch
-> dsa_switch_probe
-> dsa_tree_setup
-> dsa_tree_setup_cpu_ports
-> dsa_tree_setup_default_cpu
-> contains the "DSA: tree %d has no CPU port\n" check
-> dsa_tree_setup_master
-> dsa_master_setup
-> sysfs_create_group(&dev->dev.kobj, &dsa_group);
-> makes tagging_store() callable
-> dsa_tree_change_tag_proto
-> dsa_tree_notify
-> dsa_switch_event
-> dsa_switch_change_tag_proto
-> ds->ops->change_tag_protocol
-> felix_change_tag_protocol
-> felix_set_tag_protocol
-> felix_setup_tag_8021q
-> felix_setup_mmio_filtering
-> breaks at first CPU port
So probing would have failed earlier if there wasn't any CPU port
defined.
To avoid all confusion, delete the dead code and replace it with a
comment.
Vladimir Oltean [Wed, 16 Feb 2022 14:30:11 +0000 (16:30 +0200)]
net: mscc: ocelot: annotate which traps need PTP timestamping
The ocelot switch library does not need this information, but the felix
DSA driver does.
As a reminder, the VSC9959 switch in LS1028A doesn't have an IRQ line
for packet extraction, so to be notified that a PTP packet needs to be
dequeued, it receives that packet also over Ethernet, by setting up a
packet trap. The Felix driver needs to install special kinds of traps
for packets in need of RX timestamps, such that the packets are
replicated both over Ethernet and over the CPU port module.
But the Ocelot switch library sets up more than one trap for PTP event
messages; it also traps PTP general messages, MRP control messages etc.
Those packets don't need PTP timestamps, so there's no reason for the
Felix driver to send them to the CPU port module.
By knowing which traps need PTP timestamps, the Felix driver can
adjust the traps installed using ocelot_trap_add() such that only those
will actually get delivered to the CPU port module.
Vladimir Oltean [Wed, 16 Feb 2022 14:30:10 +0000 (16:30 +0200)]
net: mscc: ocelot: keep traps in a list
When using the ocelot-8021q tagging protocol, the CPU port isn't
configured as an NPI port, but is a regular port. So a "trap to CPU"
operation is actually a "redirect" operation. So DSA needs to set up the
trapping action one way or another, depending on the tagging protocol in
use.
To ease DSA's work of modifying the action, keep all currently installed
traps in a list, so that DSA can live-patch them when the tagging
protocol changes.
Vladimir Oltean [Wed, 16 Feb 2022 14:30:08 +0000 (16:30 +0200)]
net: mscc: ocelot: avoid overlap in VCAP IS2 between PTP and MRP traps
OCELOT_VCAP_IS2_TAG_8021Q_TXVLAN overlaps with OCELOT_VCAP_IS2_MRP_REDIRECT.
To avoid this, make OCELOT_VCAP_IS2_MRP_REDIRECT take the cookie region
from N to 2 * N - 1 (where N is ocelot->num_phys_ports).
To avoid any risk that the singleton (not per port) VCAP IS2 filters
overlap with per-port VCAP IS2 filters, we must ensure that the number
of singleton filters is smaller than the number of physical ports.
This is true right now, but may change in the future as switches with
less ports get supported, or more singleton filters get added. So to be
future-proof, let's move the singleton filters at the end of the range,
where they won't overlap with anything to their right.
Vladimir Oltean [Wed, 16 Feb 2022 14:30:07 +0000 (16:30 +0200)]
net: mscc: ocelot: use a single VCAP filter for all MRP traps
The MRP assist code installs a VCAP IS2 trapping rule for each port, but
since the key and the action is the same, just the ingress port mask
differs, there isn't any need to do this. We can save some space in the
TCAM by using a single filter and adjusting the ingress port mask.
Reuse the ocelot_trap_add() and ocelot_trap_del() functions for this
purpose.
Now that the cookies are no longer per port, we need to change the
allocation scheme such that MRP traps use a fixed number.
Vladimir Oltean [Wed, 16 Feb 2022 14:30:06 +0000 (16:30 +0200)]
net: mscc: ocelot: delete OCELOT_MRP_CPUQ
MRP frames are configured to be trapped to the CPU queue 7, and this
number is reflected in the extraction header. However, the information
isn't used anywhere, so just leave MRP frames to go to CPU queue 0
unless needed otherwise.
Vladimir Oltean [Wed, 16 Feb 2022 14:30:05 +0000 (16:30 +0200)]
net: mscc: ocelot: consolidate cookie allocation for private VCAP rules
Every use case that needed VCAP filters (in order: DSA tag_8021q, MRP,
PTP traps) has hardcoded filter identifiers that worked well enough for
that use case alone. But when two or more of those use cases would be
used together, some of those identifiers would overlap, leading to
breakage.
Add definitions for each cookie and centralize them in ocelot_vcap.h,
such that the overlaps are more obvious.
Vladimir Oltean [Wed, 16 Feb 2022 14:30:04 +0000 (16:30 +0200)]
net: mscc: ocelot: use a consistent cookie for MRP traps
The driver uses an identifier equal to (ocelot->num_phys_ports + port)
for MRP traps installed when the system is in the role of an MRC, and an
identifier equal to (port) otherwise.
Use the same identifier in both cases as a consolidation for the various
cookie values spread throughout the driver.
David S. Miller [Thu, 17 Feb 2022 10:57:21 +0000 (10:57 +0000)]
Merge tag 'mlx5-updates-2022-02-16' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux
Saeed Mahameed says:
====================
mlx5-updates-2022-02-16
Misc updates for mlx5:
1) Alex Liu Adds support for using xdp->data_meta
2) Aya Levin Adds PTP counters and port time stamp mode for representors and
switchdev mode.
3) Tariq Toukan, Striding RQ simple improvements.
4) Roi Dayan (7): Create multiple attr instances per flow
Some TC actions use post actions for their implementation.
For example CT and sample actions.
Create a new flow attr instance after each multi table action and
create a post action rule for it as a generic parsing step.
Now multi table actions like CT, sample don't require to do it.
When flow has multiple attr instances, the first flow attr is being
offloaded normally and linked to the next attr (post action rule) by
setting an id on reg_c for matching.
Post action rule (rule created from second attr instance) match the
id on reg_c and does rest of the actions.
Example rule with actions CT,goto will be created with 2 attr instances
as following: attr1(CT)->attr2(goto)
====================
perf bpf: Defer freeing string after possible strlen() on it
This was detected by the gcc in Fedora Rawhide's gcc:
50 11.01 fedora:rawhide : FAIL gcc version 12.0.1 20220205 (Red Hat 12.0.1-0) (GCC)
inlined from 'bpf__config_obj' at util/bpf-loader.c:1242:9:
util/bpf-loader.c:1225:34: error: pointer 'map_opt' may be used after 'free' [-Werror=use-after-free]
1225 | *key_scan_pos += strlen(map_opt);
| ^~~~~~~~~~~~~~~
util/bpf-loader.c:1223:9: note: call to 'free' here
1223 | free(map_name);
| ^~~~~~~~~~~~~~
cc1: all warnings being treated as errors
So do the calculations on the pointer before freeing it.
Roi Dayan [Sun, 19 Dec 2021 08:31:18 +0000 (10:31 +0200)]
net/mlx5e: TC, Make post_act parse CT and sample actions
Before this commit post_act can be used for normal rules
and didn't handle special cases like CT and sample.
With this commit post_act rule can also handle the special cases
when needed.
Roi Dayan [Sun, 21 Nov 2021 09:38:53 +0000 (11:38 +0200)]
net/mlx5e: TC, Clean redundant counter flag from tc action parsers
When tc actions being parsed only the last flow attr created needs the
counter flag and the previous flags being reset.
Clean the flag from the tc action parsers.
Roi Dayan [Wed, 1 Dec 2021 11:27:37 +0000 (13:27 +0200)]
net/mlx5e: Use multi table support for CT and sample actions
CT and sample actions use post actions for their implementation.
Flag those actions as multi table actions so the post act infrastructure
will handle the post actions allocation.
Roi Dayan [Sun, 8 Aug 2021 06:38:03 +0000 (09:38 +0300)]
net/mlx5e: Create new flow attr for multi table actions
Some TC actions use post actions for their implementation.
For example CT and sample actions.
Create a new flow attr after each multi table action and
create a post action rule for it.
First flow attr being offloaded normally and linked to the next
attr (post action rule) with setting an id on reg_c.
Post action rules match the id on reg_c and continue to the next one.
Roi Dayan [Thu, 14 Oct 2021 08:31:51 +0000 (11:31 +0300)]
net/mlx5e: Add post act offload/unoffload API
Introduce mlx5e_tc_post_act_offload() and mlx5e_tc_post_act_unoffload()
to be able to unoffload and reoffload existing post action rules handles.
For example in neigh update events, the driver removes and readds rules in
hardware.
Roi Dayan [Sun, 12 Sep 2021 08:43:17 +0000 (11:43 +0300)]
net/mlx5e: Pass actions param to actions_match_supported()
Currently the mlx5_flow object contains a single mlx5_attr instance.
However, multi table actions (e.g. CT) instantiate multiple attr instances.
Currently action_match_supported() reads the actions flag from the
flow's attribute instance. Modify the function to receive the action
flags as a parameter which is set by the calling function and
pass the aggregated actions to actions_match_supported().
Aya Levin [Thu, 20 Jan 2022 15:53:06 +0000 (17:53 +0200)]
net/mlx5e: E-Switch, Add support for tx_port_ts in switchdev mode
When turning on tx_port_ts (private flag) a PTP-SQ is created. Consider
this queue when adding rules matching SQs to VPORTs. Otherwise the
traffic on this queue won't reach the wire.
Tariq Toukan [Wed, 19 Jan 2022 16:35:42 +0000 (18:35 +0200)]
net/mlx5e: RX, Restrict bulk size for small Striding RQs
In RQs of type multi-packet WQE (Striding RQ), each WQE is relatively
large (typically 256KB) but their number is relatively small (8 in
default).
Re-mapping the descriptors' buffers before re-posting them is done via
UMR (User-Mode Memory Registration) operations.
On the one hand, posting UMR WQEs in bulks reduces communication overhead
with the HW and better utilizes its processing units.
On the other hand, delaying the WQE repost operations for a small RQ
(say, of 4 WQEs) might drastically hit its performance, causing packet
drops due to no receive buffer, for high or bursty incoming packets rate.
Here we restrict the bulk size for too small RQs. Effectively, with the current
constants, RQ of size 4 (minimum allowed) would have no bulking, while larger
RQs will continue working with bulks of 2.
Tariq Toukan [Tue, 11 Jan 2022 16:28:12 +0000 (18:28 +0200)]
net/mlx5e: Default to Striding RQ when not conflicting with CQE compression
CQE compression is turned on by default on slow pci systems to help
reduce the load on pci.
In this case, Striding RQ was turned off as CQEs of packets that span
several strides were not compressed, significantly reducing the compression
effectiveness.
This issue does not exist when using the newer mini_cqe format "stride_index".
Hence, allow defaulting to Striding RQ in this case.
Petr Machata [Wed, 16 Feb 2022 14:31:36 +0000 (15:31 +0100)]
net: rtnetlink: rtnl_stats_get(): Emit an extack for unset filter_mask
Both get and dump handlers for RTM_GETSTATS require that a filter_mask, a
mask of which attributes should be emitted in the netlink response, is
unset. rtnl_stats_dump() does include an extack in the bounce,
rtnl_stats_get() however does not. Fix the omission.
Geliang Tang [Wed, 16 Feb 2022 02:11:25 +0000 (18:11 -0800)]
mptcp: drop unused sk in mptcp_get_options
The parameter 'sk' became useless since the code using it was dropped
from mptcp_get_options() in the commit 8d548ea1dd15 ("mptcp: do not set
unconditionally csum_reqd on incoming opt"). Let's drop it.
Yang Li [Wed, 16 Feb 2022 01:45:07 +0000 (09:45 +0800)]
net: Fix an ignored error return from dm9051_get_regs()
The return from the call to dm9051_get_regs() is int, it can be
a negative error code, however this is being assigned to an unsigned
int variable 'ret', so making 'ret' an int.
Eliminate the following coccicheck warning:
./drivers/net/ethernet/davicom/dm9051.c:527:5-8: WARNING: Unsigned
expression compared with zero: ret < 0
Jon Maloy [Wed, 16 Feb 2022 02:00:09 +0000 (21:00 -0500)]
tipc: fix wrong notification node addresses
The previous bug fix had an unfortunate side effect that broke
distribution of binding table entries between nodes. The updated
tipc_sock_addr struct is also used further down in the same
function, and there the old value is still the correct one.
Willem de Bruijn [Tue, 15 Feb 2022 16:00:37 +0000 (11:00 -0500)]
ipv6: per-netns exclusive flowlabel checks
Ipv6 flowlabels historically require a reservation before use.
Optionally in exclusive mode (e.g., user-private).
Commit 59c820b2317f ("ipv6: elide flowlabel check if no exclusive
leases exist") introduced a fastpath that avoids this check when no
exclusive leases exist in the system, and thus any flowlabel use
will be granted.
That allows skipping the control operation to reserve a flowlabel
entirely. Though with a warning if the fast path fails:
This is an optimization. Robust applications still have to revert to
requesting leases if the fast path fails due to an exclusive lease.
Still, this is subtle. Better isolate network namespaces from each
other. Flowlabels are per-netns. Also record per-netns whether
exclusive leases are in use. Then behavior does not change based on
activity in other netns.
Changes
v2
- wrap in IS_ENABLED(CONFIG_IPV6) to avoid breakage if disabled
Vladimir Oltean [Tue, 15 Feb 2022 20:47:22 +0000 (22:47 +0200)]
net: dsa: tag_8021q: only call skb_push/skb_pull around __skb_vlan_pop
__skb_vlan_pop() needs skb->data to point at the mac_header, while
skb_vlan_tag_present() and skb_vlan_tag_get() don't, because they don't
look at skb->data at all.
So we can avoid uselessly moving around skb->data for the case where the
VLAN tag was offloaded by the DSA master.
Whenever bridge driver hits the max capacity of MDBs, it disables
the MC processing (by setting corresponding bridge option), but never
notifies switchdev about such change (the notifiers are called only upon
explicit setting of this option, through the registered netlink interface).
This could lead to situation when Software MDB processing gets disabled,
but this event never gets offloaded to the underlying Hardware.
D. Wythe [Tue, 15 Feb 2022 08:24:50 +0000 (16:24 +0800)]
net/smc: return ETIMEDOUT when smc_connect_clc() timeout
When smc_connect_clc() times out, it will return -EAGAIN(tcp_recvmsg
retuns -EAGAIN while timeout), then this value will passed to the
application, which is quite confusing to the applications, makes
inconsistency with TCP.
From the manual of connect, ETIMEDOUT is more suitable, and this patch
try convert EAGAIN to ETIMEDOUT in that case.
Linus Torvalds [Tue, 15 Feb 2022 23:28:00 +0000 (15:28 -0800)]
tty: n_tty: do not look ahead for EOL character past the end of the buffer
Daniel Gibson reports that the n_tty code gets line termination wrong in
very specific cases:
"If you feed a line with exactly 64 chars + terminating newline, and
directly afterwards (without reading) another line into a pseudo
terminal, the the first read() on the other side will return the 64
char line *without* terminating newline, and the next read() will
return the missing terminating newline AND the complete next line (if
it fits in the buffer)"
and bisected the behavior to commit 3b830a9c34d5 ("tty: convert
tty_ldisc_ops 'read()' function to take a kernel pointer").
Now, digging deeper, it turns out that the behavior isn't exactly new:
what changed in commit 3b830a9c34d5 was that the tty line discipline
.read() function is now passed an intermediate kernel buffer rather than
the final user space buffer.
And that intermediate kernel buffer is 64 bytes in size - thus that
special case with exactly 64 bytes plus terminating newline.
The same problem did exist before, but historically the boundary was not
the 64-byte chunk, but the user-supplied buffer size, which is obviously
generally bigger (and potentially bigger than N_TTY_BUF_SIZE, which
would hide the issue entirely).
The reason is that the n_tty canon_copy_from_read_buf() code would look
ahead for the EOL character one byte further than it would actually
copy. It would then decide that it had found the terminator, and unmark
it as an EOL character - which in turn explains why the next read
wouldn't then be terminated by it.
Now, the reason it did all this in the first place is related to some
historical and pretty obscure EOF behavior, see commit ac8f3bf8832a
("n_tty: Fix poll() after buffer-limited eof push read") and commit 40d5e0905a03 ("n_tty: Fix EOF push handling").
And the reason for the EOL confusion is that we treat EOF as a special
EOL condition, with the EOL character being NUL (aka "__DISABLED_CHAR"
in the kernel sources).
So that EOF look-ahead also affects the normal EOL handling.
This patch just removes the look-ahead that causes problems, because EOL
is much more critical than the historical "EOF in the middle of a line
that coincides with the end of the buffer" handling ever was.
Now, it is possible that we should indeed re-introduce the "look at next
character to see if it's a EOF" behavior, but if so, that should be done
not at the kernel buffer chunk boundary in canon_copy_from_read_buf(),
but at a higher level, when we run out of the user buffer.
In particular, the place to do that would be at the top of
'n_tty_read()', where we check if it's a continuation of a previously
started read, and there is no more buffer space left, we could decide to
just eat the __DISABLED_CHAR at that point.
But that would be a separate patch, because I suspect nobody actually
cares, and I'd like to get a report about it before bothering.
The struct perf_event_attr is initialised differently in Arm64 when
recording in call-graph fp mode, so update the relevant tests, and add
two extra arm64-only tests.
Before:
$ perf test 17 -v
17: Setup struct perf_event_attr
[...]
running './tests/attr/test-record-graph-default'
expected sample_type=295, got 4391
expected sample_regs_user=0, got 1073741824
FAILED './tests/attr/test-record-graph-default' - match failure
test child finished with -1
---- end ----
After:
[...]
running './tests/attr/test-record-graph-default-aarch64'
test limitation 'aarch64'
running './tests/attr/test-record-graph-fp-aarch64'
test limitation 'aarch64'
running './tests/attr/test-record-graph-default'
test limitation '!aarch64'
excluded architecture list ['aarch64']
skipped [aarch64] './tests/attr/test-record-graph-default'
running './tests/attr/test-record-graph-fp'
test limitation '!aarch64'
excluded architecture list ['aarch64']
skipped [aarch64] './tests/attr/test-record-graph-fp'
[...]
Kees Cook [Sun, 13 Feb 2022 18:24:43 +0000 (10:24 -0800)]
libsubcmd: Fix use-after-free for realloc(..., 0)
GCC 12 correctly reports a potential use-after-free condition in the
xrealloc helper. Fix the warning by avoiding an implicit "free(ptr)"
when size == 0:
In file included from help.c:12:
In function 'xrealloc',
inlined from 'add_cmdname' at help.c:24:2: subcmd-util.h:56:23: error: pointer may be used after 'realloc' [-Werror=use-after-free]
56 | ret = realloc(ptr, size);
| ^~~~~~~~~~~~~~~~~~
subcmd-util.h:52:21: note: call to 'realloc' here
52 | void *ret = realloc(ptr, size);
| ^~~~~~~~~~~~~~~~~~
subcmd-util.h:58:31: error: pointer may be used after 'realloc' [-Werror=use-after-free]
58 | ret = realloc(ptr, 1);
| ^~~~~~~~~~~~~~~
subcmd-util.h:52:21: note: call to 'realloc' here
52 | void *ret = realloc(ptr, size);
| ^~~~~~~~~~~~~~~~~~
James Clark [Thu, 10 Feb 2022 20:06:20 +0000 (20:06 +0000)]
perf cs-etm: Fix corrupt inject files when only last branch option is enabled
'perf inject' with Coresight data generates files that cannot be opened
when only the last branch option is specified:
perf inject -i perf.data --itrace=l -o inject.data
perf script -i inject.data
0x33faa8 [0x8]: failed to process type: 9 [Bad address]
This is because cs_etm__synth_instruction_sample() is called even when
the sample type for instructions hasn't been setup. Last branch records
are attached to instruction samples so it doesn't make sense to generate
them when --itrace=i isn't specified anyway.
This change disables all calls of cs_etm__synth_instruction_sample()
unless --itrace=i is specified, resulting in a file with no samples if
only --itrace=l is provided, rather than a bad file.
James Clark [Thu, 10 Feb 2022 20:06:19 +0000 (20:06 +0000)]
perf cs-etm: No-op refactor of synth opt usage
sample_branches and sample_instructions are already saved in the
synth_opts struct. Other usages like synth_opts.last_branch don't save a
value, so make this more consistent by always going through synth_opts
and not saving duplicate values.
Rob Herring [Tue, 1 Feb 2022 21:39:03 +0000 (15:39 -0600)]
libperf: Fix 32-bit build for tests uint64_t printf
Commit a7f3713f6bf207e6 ("libperf tests: Add test_stat_multiplexing test")
added printf's of 64-bit ints using %lu which doesn't work on 32-bit
builds:
tests/test-evlist.c:529:29: error: format ‘%lu’ expects argument of type \
‘long unsigned int’, but argument 4 has type ‘uint64_t’ {aka ‘long long unsigned int’} [-Werror=format=]
Use PRIu64 instead which works on both 32-bit and 64-bit systems.
tools headers UAPI: Sync linux/perf_event.h with the kernel sources
To pick the trivial change in:
ddecd22878601a60 ("perf: uapi: Document perf_event_attr::sig_data truncation on 32 bit architectures")
Just adds a comment.
This silences this perf build warning:
Warning: Kernel ABI header at 'tools/include/uapi/linux/perf_event.h' differs from latest version at 'include/uapi/linux/perf_event.h'
diff -u tools/include/uapi/linux/perf_event.h include/uapi/linux/perf_event.h
====================
Replay and offload host VLAN entries in DSA
v2->v3:
- make the bridge stop notifying switchdev for !BRENTRY VLANs
- create precommit and commit wrappers around __vlan_add_flags().
- special-case the BRENTRY transition from false to true, instead of
treating it as a change of flags and letting drivers figure out that
it really isn't.
- avoid setting *changed unless we know that functions will not error
out later.
- drop "old_flags" from struct switchdev_obj_port_vlan, nobody needs it
now, in v2 only DSA needed it to filter out BRENTRY transitions, that
is now solved cleaner.
- no BRIDGE_VLAN_INFO_BRENTRY flag checks and manipulations in DSA
whatsoever, use the "bool changed" bit as-is after changing what it
means.
- merge dsa_slave_host_vlan_{add,del}() with
dsa_slave_foreign_vlan_{add,del}(), since now they do the same thing,
because the host_vlan functions no longer need to mangle the vlan
BRENTRY flags and bool changed.
v1->v2:
- prune switchdev VLAN additions with no actual change differently
- no longer need to revert struct net_bridge_vlan changes on error from
switchdev
- no longer need to first delete a changed VLAN before readding it
- pass 'bool changed' and 'u16 old_flags' through switchdev_obj_port_vlan
so that DSA can do some additional post-processing with the
BRIDGE_VLAN_INFO_BRENTRY flag
- support VLANs on foreign interfaces
- fix the same -EOPNOTSUPP error in mv88e6xxx, this time on removal, due
to VLAN deletion getting replayed earlier than FDB deletion
The motivation behind these patches is that Rafael reported the
following error with mv88e6xxx when the first switch port joins a
bridge:
mv88e6085 0x0000000008b96000:00: port 0 failed to add a6:ef:77:c8:5f:3d vid 1 to fdb: -95 (-EOPNOTSUPP)
The FDB entry that's added is the MAC address of the bridge, in VID 1
(the default_pvid), being replayed as part of br_add_if() -> ... ->
nbp_switchdev_sync_objs().
-EOPNOTSUPP is the mv88e6xxx driver's way of saying that VID 1 doesn't
exist in the VTU, so it can't program the ATU with a FID, something
which it needs.
It appears to be a race, but it isn't, since we only end up installing
VID 1 in the VTU by coincidence. DSA's approximation of programming
VLANs on the CPU port together with the user ports breaks down with
host FDB entries on mv88e6xxx, since that strictly requires the VTU to
contain the VID. But the user may freely add VLANs pointing just towards
the bridge, and FDB entries in those VLANs, and DSA will not be aware of
them, because it only listens for VLANs on user ports.
To create a solution that scales properly to cross-chip setups and
doesn't leak entries behind, some changes in the bridge driver are
required. I believe that these are for the better overall, but I may be
wrong. Namely, the same refcounting procedure that DSA has in place for
host FDB and MDB entries can be replicated for VLANs, except that it's
garbage in, garbage out: the VLAN addition and removal notifications
from switchdev aren't balanced. So the first 2 patches attempt to deal
with that.
This patch set has been superficially tested on a board with 3 mv88e6xxx
switches in a daisy chain and appears to produce the primary desired
effect - the driver no longer returns -EOPNOTSUPP when the first port
joins a bridge, and is successful in performing local termination under
a VLAN-aware bridge.
As an additional side effect, it silences the annoying "p%d: already a
member of VLAN %d\n" warning messages that the mv88e6xxx driver produces
when coupled with systemd-networkd, and a few VLANs are configured.
Furthermore, it advances Florian's idea from a few years back, which
never got merged:
https://lore.kernel.org/lkml/20180624153339[email protected]/
v2 has also been tested on the NXP LS1028A felix switch.
Vladimir Oltean [Tue, 15 Feb 2022 17:02:18 +0000 (19:02 +0200)]
net: dsa: offload bridge port VLANs on foreign interfaces
DSA now explicitly handles VLANs installed with the 'self' flag on the
bridge as host VLANs, instead of just replicating every bridge port VLAN
also on the CPU port and never deleting it, which is what it did before.
Forwarding towards a bridge port VLAN installed on a bridge port foreign
to DSA (separate NIC, Wi-Fi AP) used to work by virtue of the fact that
DSA itself needed to have at least one port in that VLAN (therefore, it
also had the CPU port in said VLAN). However, now that the CPU port may
not be member of all VLANs that user ports are members of, we need to
ensure this isn't the case if software forwarding to a foreign interface
is required.
The solution is to treat bridge port VLANs on standalone interfaces in
the exact same way as host VLANs. From DSA's perspective, there is no
difference between local termination and software forwarding; packets in
that VLAN must reach the CPU in both cases.
Vladimir Oltean [Tue, 15 Feb 2022 17:02:17 +0000 (19:02 +0200)]
net: dsa: add explicit support for host bridge VLANs
Currently, DSA programs VLANs on shared (DSA and CPU) ports each time it
does so on user ports. This is good for basic functionality but has
several limitations:
- the VLAN group which must reach the CPU may be radically different
from the VLAN group that must be autonomously forwarded by the switch.
In other words, the admin may want to isolate noisy stations and avoid
traffic from them going to the control processor of the switch, where
it would just waste useless cycles. The bridge already supports
independent control of VLAN groups on bridge ports and on the bridge
itself, and when VLAN-aware, it will drop packets in software anyway
if their VID isn't added as a 'self' entry towards the bridge device.
- Replaying host FDB entries may depend, for some drivers like mv88e6xxx,
on replaying the host VLANs as well. The 2 VLAN groups are
approximately the same in most regular cases, but there are corner
cases when timing matters, and DSA's approximation of replicating
VLANs on shared ports simply does not work.
- If a user makes the bridge (implicitly the CPU port) join a VLAN by
accident, there is no way for the CPU port to isolate itself from that
noisy VLAN except by rebooting the system. This is because for each
VLAN added on a user port, DSA will add it on shared ports too, but
for each VLAN deletion on a user port, it will remain installed on
shared ports, since DSA has no good indication of whether the VLAN is
still in use or not.
Now that the bridge driver emits well-balanced SWITCHDEV_OBJ_ID_PORT_VLAN
addition and removal events, DSA has a simple and straightforward task
of separating the bridge port VLANs (these have an orig_dev which is a
DSA slave interface, or a LAG interface) from the host VLANs (these have
an orig_dev which is a bridge interface), and to keep a simple reference
count of each VID on each shared port.
Forwarding VLANs must be installed on the bridge ports and on all DSA
ports interconnecting them. We don't have a good view of the exact
topology, so we simply install forwarding VLANs on all DSA ports, which
is what has been done until now.
Host VLANs must be installed primarily on the dedicated CPU port of each
bridge port. More subtly, they must also be installed on upstream-facing
and downstream-facing DSA ports that are connecting the bridge ports and
the CPU. This ensures that the mv88e6xxx's problem (VID of host FDB
entry may be absent from VTU) is still addressed even if that switch is
in a cross-chip setup, and it has no local CPU port.
Therefore:
- user ports contain only bridge port (forwarding) VLANs, and no
refcounting is necessary
- DSA ports contain both forwarding and host VLANs. Refcounting is
necessary among these 2 types.
- CPU ports contain only host VLANs. Refcounting is also necessary.
Vladimir Oltean [Tue, 15 Feb 2022 17:02:16 +0000 (19:02 +0200)]
net: switchdev: introduce switchdev_handle_port_obj_{add,del} for foreign interfaces
The switchdev_handle_port_obj_add() helper is good for replicating a
port object on the lower interfaces of @dev, if that object was emitted
on a bridge, or on a bridge port that is a LAG.
However, drivers that use this helper limit themselves to a box from
which they can no longer intercept port objects notified on neighbor
ports ("foreign interfaces").
One such driver is DSA, where software bridging with foreign interfaces
such as standalone NICs or Wi-Fi APs is an important use case. There, a
VLAN installed on a neighbor bridge port roughly corresponds to a
forwarding VLAN installed on the DSA switch's CPU port.
To support this use case while also making use of the benefits of the
switchdev_handle_* replication helper for port objects, introduce a new
variant of these functions that crawls through the neighbor ports of
@dev, in search of potentially compatible switchdev ports that are
interested in the event.
The strategy is identical to switchdev_handle_fdb_event_to_device():
if @dev wasn't a switchdev interface, then go one step upper, and
recursively call this function on the bridge that this port belongs to.
At the next recursion step, __switchdev_handle_port_obj_add() will
iterate through the bridge's lower interfaces. Among those, some will be
switchdev interfaces, and one will be the original @dev that we came
from. To prevent infinite recursion, we must suppress reentry into the
original @dev, and just call the @add_cb for the switchdev_interfaces.
It looks like this:
br0
/ | \
/ | \
/ | \
swp0 swp1 eth0
1. __switchdev_handle_port_obj_add(eth0)
-> check_cb(eth0) returns false
-> eth0 has no lower interfaces
-> eth0's bridge is br0
-> switchdev_lower_dev_find(br0, check_cb, foreign_dev_check_cb))
finds br0
2. __switchdev_handle_port_obj_add(br0)
-> check_cb(br0) returns false
-> netdev_for_each_lower_dev
-> check_cb(swp0) returns true, so we don't skip this interface
3. __switchdev_handle_port_obj_add(swp0)
-> check_cb(swp0) returns true, so we call add_cb(swp0)
(back to netdev_for_each_lower_dev from 2)
-> check_cb(swp1) returns true, so we don't skip this interface
4. __switchdev_handle_port_obj_add(swp1)
-> check_cb(swp1) returns true, so we call add_cb(swp1)
(back to netdev_for_each_lower_dev from 2)
-> check_cb(eth0) returns false, so we skip this interface to
avoid infinite recursion
Note: eth0 could have been a LAG, and we don't want to suppress the
recursion through its lowers if those exist, so when check_cb() returns
false, we still call switchdev_lower_dev_find() to estimate whether
there's anything worth a recursion beneath that LAG. Using check_cb()
and foreign_dev_check_cb(), switchdev_lower_dev_find() not only figures
out whether the lowers of the LAG are switchdev, but also whether they
actively offload the LAG or not (whether the LAG is "foreign" to the
switchdev interface or not).
The port_obj_info->orig_dev is preserved across recursive calls, so
switchdev drivers still know on which device was this notification
originally emitted.
Vladimir Oltean [Tue, 15 Feb 2022 17:02:14 +0000 (19:02 +0200)]
net: bridge: switchdev: replay all VLAN groups
The major user of replayed switchdev objects is DSA, and so far it
hasn't needed information about anything other than bridge port VLANs,
so this is all that br_switchdev_vlan_replay() knows to handle.
DSA has managed to get by through replicating every VLAN addition on a
user port such that the same VLAN is also added on all DSA and CPU
ports, but there is a corner case where this does not work.
The mv88e6xxx DSA driver currently prints this error message as soon as
the first port of a switch joins a bridge:
mv88e6085 0x0000000008b96000:00: port 0 failed to add a6:ef:77:c8:5f:3d vid 1 to fdb: -95
where a6:ef:77:c8:5f:3d vid 1 is a local FDB entry corresponding to the
bridge MAC address in the default_pvid.
The -EOPNOTSUPP is returned by mv88e6xxx_port_db_load_purge() because it
tries to map VID 1 to a FID (the ATU is indexed by FID not VID), but
fails to do so. This is because ->port_fdb_add() is called before
->port_vlan_add() for VID 1.
and the issue is that at the time of (*), the bridge port isn't in VID 1
(nbp_vlan_init hasn't been called), therefore br_switchdev_vlan_replay()
won't have anything to replay, therefore VID 1 won't be in the VTU by
the time mv88e6xxx_port_fdb_add() is called.
This happens only when the first port of a switch joins. For further
ports, the initial mv88e6xxx_port_vlan_add() is sufficient for VID 1 to
be loaded in the VTU (which is switch-wide, not per port).
The problem is somewhat unique to mv88e6xxx by chance, because most
other drivers offload an FDB entry by VID, so FDBs and VLANs can be
added asynchronously with respect to each other, but addressing the
issue at the bridge layer makes sense, since what mv88e6xxx requires
isn't absurd.
To fix this problem, we need to recognize that it isn't the VLAN group
of the port that we're interested in, but the VLAN group of the bridge
itself (so it isn't a timing issue, but rather insufficient information
being passed from switchdev to drivers).
As mentioned, currently nbp_switchdev_sync_objs() only calls
br_switchdev_vlan_replay() for VLANs corresponding to the port, but the
VLANs corresponding to the bridge itself, for local termination, also
need to be replayed. In this case, VID 1 is not (yet) present in the
port's VLAN group but is present in the bridge's VLAN group.
So to fix this bug, DSA is now obligated to explicitly handle VLANs
pointing towards the bridge in order to "close this race" (which isn't
really a race). As Tobias Waldekranz notices, this also implies that it
must explicitly handle port VLANs on foreign interfaces, something that
worked implicitly before:
https://patchwork.kernel.org/project/netdevbpf/patch/20220209213044.2353153[email protected]/#24735260
So in the end, br_switchdev_vlan_replay() must replay all VLANs from all
VLAN groups: all the ports, and the bridge itself.
Vladimir Oltean [Tue, 15 Feb 2022 17:02:13 +0000 (19:02 +0200)]
net: bridge: make nbp_switchdev_unsync_objs() follow reverse order of sync()
There may be switchdev drivers that can add/remove a FDB or MDB entry
only as long as the VLAN it's in has been notified and offloaded first.
The nbp_switchdev_sync_objs() method satisfies this requirement on
addition, but nbp_switchdev_unsync_objs() first deletes VLANs, then
deletes MDBs and FDBs. Reverse the order of the function calls to cater
to this requirement.
Vladimir Oltean [Tue, 15 Feb 2022 17:02:12 +0000 (19:02 +0200)]
net: bridge: switchdev: differentiate new VLANs from changed ones
br_switchdev_port_vlan_add() currently emits a SWITCHDEV_PORT_OBJ_ADD
event with a SWITCHDEV_OBJ_ID_PORT_VLAN for 2 distinct cases:
- a struct net_bridge_vlan got created
- an existing struct net_bridge_vlan was modified
This makes it impossible for switchdev drivers to properly balance
PORT_OBJ_ADD with PORT_OBJ_DEL events, so if we want to allow that to
happen, we must provide a way for drivers to distinguish between a
VLAN with changed flags and a new one.
Annotate struct switchdev_obj_port_vlan with a "bool changed" that
distinguishes the 2 cases above.
Vladimir Oltean [Tue, 15 Feb 2022 17:02:11 +0000 (19:02 +0200)]
net: bridge: vlan: notify switchdev only when something changed
Currently, when a VLAN entry is added multiple times in a row to a
bridge port, nbp_vlan_add() calls br_switchdev_port_vlan_add() each
time, even if the VLAN already exists and nothing about it has changed:
bridge vlan add dev lan12 vid 100 master static
Similarly, when a VLAN is added multiple times in a row to a bridge,
br_vlan_add_existing() doesn't filter at all the calls to
br_switchdev_port_vlan_add():
bridge vlan add dev br0 vid 100 self
This behavior makes driver-level accounting of VLANs impossible, since
it is enough for a single deletion event to remove a VLAN, but the
addition event can be emitted an unlimited number of times.
The cause for this can be identified as follows: we rely on
__vlan_add_flags() to retroactively tell us whether it has changed
anything about the VLAN flags or VLAN group pvid. So we'd first have to
call __vlan_add_flags() before calling br_switchdev_port_vlan_add(), in
order to have access to the "bool *changed" information. But we don't
want to change the event ordering, because we'd have to revert the
struct net_bridge_vlan changes we've made if switchdev returns an error.
So to solve this, we need another function that tells us whether any
change is going to occur in the VLAN or VLAN group, _prior_ to calling
__vlan_add_flags().
Split __vlan_add_flags() into a precommit and a commit stage, and rename
it to __vlan_flags_update(). The precommit stage,
__vlan_flags_would_change(), will determine whether there is any reason
to notify switchdev due to a change of flags (note: the BRENTRY flag
transition from false to true is treated separately: as a new switchdev
entry, because we skipped notifying the master VLAN when it wasn't a
brentry yet, and therefore not as a change of flags).
With this lookahead/precommit function in place, we can avoid notifying
switchdev if nothing changed for the VLAN and VLAN group.