Petr Machata [Thu, 16 Jun 2022 10:42:36 +0000 (13:42 +0300)]
mlxsw: Keep track of number of allocated RIFs
In order to expose number of RIFs as a resource, it is going to be handy
to have the number of currently-allocated RIFs as a single number.
Introduce such.
Amit Cohen [Thu, 16 Jun 2022 10:42:35 +0000 (13:42 +0300)]
mlxsw: Trap ARP packets at layer 3 instead of layer 2
Currently, the traps 'ARP_REQUEST' and 'ARP_RESPONSE' occur at layer 2.
To allow the packets to be flooded, they are configured with the action
'MIRROR_TO_CPU' which means that the CPU receives a replica of the packet.
Today, Spectrum ASICs also support trapping ARP packets at layer 3. This
behavior is better, then the packets can just be trapped and there is no
need to mirror them. An additional motivation is that using the traps at
layer 2, the ARP packets are dropped in the router as they do not have an
IP header, then they are counted as error packets, which might confuse
users.
Add the relevant traps for layer 3 and use them instead of the existing
traps. There is no visible change to user space.
David S. Miller [Fri, 17 Jun 2022 09:11:04 +0000 (10:11 +0100)]
Merge branch 'tcp-mem-pressure-fixes'
Eric Dumazet says:
====================
tcp: final (?) round of mem pressure fixes
While working on prior patch series (e10b02ee5b6c "Merge branch
'net-reduce-tcp_memory_allocated-inflation'"), I found that we
could still have frozen TCP flows under memory pressure.
I thought we had solved this in 2015, but the fix was not complete.
v2: deal with zerocopy tx paths.
====================
Eric Dumazet [Tue, 14 Jun 2022 17:17:34 +0000 (10:17 -0700)]
tcp: fix possible freeze in tx path under memory pressure
Blamed commit only dealt with applications issuing small writes.
Issue here is that we allow to force memory schedule for the sk_buff
allocation, but we have no guarantee that sendmsg() is able to
copy some payload in it.
In this patch, I make sure the socket can use up to tcp_wmem[0] bytes.
For example, if we consider tcp_wmem[0] = 4096 (default on x86),
and initial skb->truesize being 1280, tcp_sendmsg() is able to
copy up to 2816 bytes under memory pressure.
Before this patch a sendmsg() sending more than 2816 bytes
would either block forever (if persistent memory pressure),
or return -EAGAIN.
For bigger MTU networks, it is advised to increase tcp_wmem[0]
to avoid sending too small packets.
Eric Dumazet [Tue, 14 Jun 2022 17:17:34 +0000 (10:17 -0700)]
tcp: fix possible freeze in tx path under memory pressure
Blamed commit only dealt with applications issuing small writes.
Issue here is that we allow to force memory schedule for the sk_buff
allocation, but we have no guarantee that sendmsg() is able to
copy some payload in it.
In this patch, I make sure the socket can use up to tcp_wmem[0] bytes.
For example, if we consider tcp_wmem[0] = 4096 (default on x86),
and initial skb->truesize being 1280, tcp_sendmsg() is able to
copy up to 2816 bytes under memory pressure.
Before this patch a sendmsg() sending more than 2816 bytes
would either block forever (if persistent memory pressure),
or return -EAGAIN.
For bigger MTU networks, it is advised to increase tcp_wmem[0]
to avoid sending too small packets.
Andrii Nakryiko [Fri, 17 Jun 2022 04:55:12 +0000 (21:55 -0700)]
selftests/bpf: Don't force lld on non-x86 architectures
LLVM's lld linker doesn't have a universal architecture support (e.g.,
it definitely doesn't work on s390x), so be safe and force lld for
urandom_read and liburandom_read.so only on x86 architectures.
Merge branch 'New BPF helpers to accelerate synproxy'
Maxim Mikityanskiy says:
====================
The first patch of this series is a documentation fix.
The second patch allows BPF helpers to accept memory regions of fixed
size without doing runtime size checks.
The two next patches add new functionality that allows XDP to
accelerate iptables synproxy.
v1 of this series [1] used to include a patch that exposed conntrack
lookup to BPF using stable helpers. It was superseded by series [2] by
Kumar Kartikeya Dwivedi, which implements this functionality using
unstable helpers.
The third patch adds new helpers to issue and check SYN cookies without
binding to a socket, which is useful in the synproxy scenario.
The fourth patch adds a selftest, which includes an XDP program and a
userspace control application. The XDP program uses socketless SYN
cookie helpers and queries conntrack status instead of socket status.
The userspace control application allows to tune parameters of the XDP
program. This program also serves as a minimal example of usage of the
new functionality.
The last two patches expose the new helpers to TC BPF and extend the
selftest.
The draft of the new functionality was presented on Netdev 0x15 [3].
v2 changes:
Split into two series, submitted bugfixes to bpf, dropped the conntrack
patches, implemented the timestamp cookie in BPF using bpf_loop, dropped
the timestamp cookie patch.
v3 changes:
Moved some patches from bpf to bpf-next, dropped the patch that changed
error codes, split the new helpers into IPv4/IPv6, added verifier
functionality to accept memory regions of fixed size.
v4 changes:
Converted the selftest to the test_progs runner. Replaced some
deprecated functions in xdp_synproxy userspace helper.
v5 changes:
Fixed a bug in the selftest. Added questionable functionality to support
new helpers in TC BPF, added selftests for it.
v6 changes:
Wrap the new helpers themselves into #ifdef CONFIG_SYN_COOKIES, replaced
fclose with pclose and fixed the MSS for IPv6 in the selftest.
v7 changes:
Fixed the off-by-one error in indices, changed the section name to
"xdp", added missing kernel config options to vmtest in CI.
v8 changes:
Properly rebased, dropped the first patch (the same change was applied
by someone else), updated the cover letter.
v9 changes:
Fixed selftests for no_alu32.
v10 changes:
Selftests for s390x were blacklisted due to lack of support of kfunc,
rebased the series, split selftests to separate commits, created
ARG_PTR_TO_FIXED_SIZE_MEM and packed arg_size, addressed the rest of
comments.
selftests/bpf: Add selftests for raw syncookie helpers in TC mode
This commit extends selftests for the new BPF helpers
bpf_tcp_raw_{gen,check}_syncookie_ipv{4,6} to also test the TC BPF
functionality added in the previous commit.
bpf: Allow the new syncookie helpers to work with SKBs
This commit allows the new BPF helpers to work in SKB context (in TC
BPF programs): bpf_tcp_raw_{gen,check}_syncookie_ipv{4,6}.
Using these helpers in TC BPF programs is not recommended, because it's
unlikely that the BPF program will provide any substantional speedup
compared to regular SYN cookies or synproxy, after the SKB is already
created.
selftests/bpf: Add selftests for raw syncookie helpers
This commit adds selftests for the new BPF helpers:
bpf_tcp_raw_{gen,check}_syncookie_ipv{4,6}.
xdp_synproxy_kern.c is a BPF program that generates SYN cookies on
allowed TCP ports and sends SYNACKs to clients, accelerating synproxy
iptables module.
xdp_synproxy.c is a userspace control application that allows to
configure the following options in runtime: list of allowed ports, MSS,
window scale, TTL.
A selftest is added to prog_tests that leverages the above programs to
test the functionality of the new helpers.
bpf: Add helpers to issue and check SYN cookies in XDP
The new helpers bpf_tcp_raw_{gen,check}_syncookie_ipv{4,6} allow an XDP
program to generate SYN cookies in response to TCP SYN packets and to
check those cookies upon receiving the first ACK packet (the final
packet of the TCP handshake).
Unlike bpf_tcp_{gen,check}_syncookie these new helpers don't need a
listening socket on the local machine, which allows to use them together
with synproxy to accelerate SYN cookie generation.
bpf: Allow helpers to accept pointers with a fixed size
Before this commit, the BPF verifier required ARG_PTR_TO_MEM arguments
to be followed by ARG_CONST_SIZE holding the size of the memory region.
The helpers had to check that size in runtime.
There are cases where the size expected by a helper is a compile-time
constant. Checking it in runtime is an unnecessary overhead and waste of
BPF registers.
This commit allows helpers to accept pointers to memory without the
corresponding ARG_CONST_SIZE, given that they define the memory region
size in struct bpf_func_proto and use ARG_PTR_TO_FIXED_SIZE_MEM type.
arg_size is unionized with arg_btf_id to reduce the kernel image size,
and it's valid because they are used by different argument types.
bpf: Fix documentation of th_len in bpf_tcp_{gen,check}_syncookie
bpf_tcp_gen_syncookie expects the full length of the TCP header (with
all options), and bpf_tcp_check_syncookie accepts lengths bigger than
sizeof(struct tcphdr). Fix the documentation that says these lengths
should be exactly sizeof(struct tcphdr).
While at it, fix a typo in the name of struct ipv6hdr.
This patch series continues with the addition of supported features
for the Ethernet function of the PCI11010 / PCI11414 devices to
the LAN743x driver.
====================
====================
net: dsa: realtek: rtl8365mb: improve handling of PHY modes
This series introduces some minor cleanup of the driver and improves the
handling of PHY interface modes to break the assumption that CPU ports
are always over an external interface, and the assumption that user
ports are always using an internal PHY.
====================
Realtek switches in the rtl8365mb family always have at least one port
with a so-called external interface, supporting PHY interface modes such
as RGMII or SGMII. The purpose of this patch is to improve the driver's
handling of these ports.
A new struct rtl8365mb_chip_info is introduced together with a static
array of such structs. An instance of this struct is added for each
supported switch, distinguished by its chip ID and version. Embedded in
each chip_info struct is an array of struct rtl8365mb_extint, describing
the external interfaces available. This is more specific than the old
rtl8365mb_extint_port_map, which was only valid for switches with up to
6 ports.
The struct rtl8365mb_extint also contains a bitmask of supported PHY
interface modes, which allows the driver to distinguish which ports
support RGMII. This corrects a previous mistake in the driver whereby it
was assumed that any port with an external interface supports RGMII.
This is not actually the case: for example, the RTL8367S has two
external interfaces, only the second of which supports RGMII. The first
supports only SGMII and HSGMII. This new design will make it easier to
add support for other interface modes.
Finally, rtl8365mb_phylink_get_caps() is fixed up to return supported
capabilities based on the external interface properties described above.
This addresses Vladimir's point in the linked thread that the
capabilities are not actually a function of the DSA port type: Although
most typical applications will treat the ports with internal PHY as user
ports, there is no actual hardware limitation preventing one from using
them as a CPU port. Equally, ports with external interface(s) may well
be treated as user ports, even though it is typical to use those ports
as CPU ports.
Alvin Šipraga [Wed, 15 Jun 2022 22:51:13 +0000 (00:51 +0200)]
net: dsa: realtek: rtl8365mb: correct the max number of ports
The maximum number of ports is actually 11, according to two
observations:
1. The highest port ID used in the vendor driver is 10. Since port IDs
are indexed from 0, and since DSA follows the same numbering system,
this means up to 11 ports are to be presumed.
2. The registers with port mask fields always amount to a maximum port
mask of 0x7FF, corresponding to a maximum 11 ports.
Alvin Šipraga [Wed, 15 Jun 2022 22:51:12 +0000 (00:51 +0200)]
net: dsa: realtek: rtl8365mb: remove port_mask private data member
There is no real need for this variable: the line change interrupt mask
is sufficiently masked out when getting linkup_ind and linkdown_ind in
the interrupt handler.
The official name of this switch is RTL8367RB-VB, not RTL8367RB. There
is also an RTL8367RB-VC which is rather different. Change the name of
the CHIP_ID/_VER macros for reasons of consistency.
====================
net: ipa: more multi-channel event ring work
This series makes a little more progress toward supporting multiple
channels with a single event ring. The first removes the assumption
that consecutive events are associated with the same RX channel.
The second derives the channel associated with an event from the
event itself, and the next does a small cleanup enabled by that.
The fourth causes updates to occur for every event processed (rather
once). And the final patch does a little more rework to make TX
completion have more in common with RX completion.
====================
Alex Elder [Wed, 15 Jun 2022 16:59:29 +0000 (11:59 -0500)]
net: ipa: move more code out of gsi_channel_update()
Move the processing done for TX channels in gsi_channel_update()
into gsi_evt_ring_rx_update(). The called function is called for
both RX and TX channels, so rename it to be gsi_evt_ring_update().
As a result, this code no longer assumes events in an event ring are
associated with just one channel.
Because all events in a ring are handled in that function, we can
move the call to gsi_trans_move_complete() there, and can ring the
event ring doorbell there as well after all new events in the ring
have been processed.
When an RX transaction completes, we update the trans->len field to
contain the actual number of bytes received. This is done in a loop
in gsi_evt_ring_rx_update().
Change that function so it checks the data transfer direction
recorded in the transaction, and only updates trans->len for RX
transfers.
Then call it unconditionally. This means events for TX endpoints
will run through the loop without otherwise doing anything, but
this will change shortly.
Alex Elder [Wed, 15 Jun 2022 16:59:27 +0000 (11:59 -0500)]
net: ipa: pass GSI pointer to gsi_evt_ring_rx_update()
The only reason the event ring's channel pointer is needed in
gsi_evt_ring_rx_update() is so we can get at its GSI pointer.
We can pass the GSI pointer as an argument, along with the event
ring ID, and thereby avoid using the event ring channel pointer.
This is another step toward no longer assuming an event ring
services a single channel.
Alex Elder [Wed, 15 Jun 2022 16:59:26 +0000 (11:59 -0500)]
net: ipa: don't pass channel when mapping transaction
Change gsi_channel_trans_map() so it derives the channel used from
the transaction. Pass the index of the *first* TRE used by the
transaction, and have the called function account for the fact that
the last one used is what's important.
Alex Elder [Wed, 15 Jun 2022 16:59:25 +0000 (11:59 -0500)]
net: ipa: don't assume one channel per event ring
In gsi_evt_ring_rx_update(), use gsi_event_trans() repeatedly
to find the transaction associated with an event, rather than
assuming consecutive events are associated with the same channel.
This removes the only caller of gsi_trans_pool_next(), so get rid
of it.
====================
dt-bindings: dp83867: add binding for io_impedance_ctrl nvmem cell
We have a board where measurements indicate that the current three
options - leaving IO_IMPEDANCE_CTRL at the reset value (which is
factory calibrated to a value corresponding to approximately 50 ohms)
or using one of the two boolean properties to set it to the min/max
value - are too coarse.
This series adds a device tree binding for an nvmem cell which can be
populated during production with a suitable value calibrated for each
board, and corresponding support in the driver. The second patch adds
a trivial phy wrapper for dev_err_probe(), used in the third.
====================
Rasmus Villemoes [Tue, 14 Jun 2022 08:46:12 +0000 (10:46 +0200)]
net: phy: dp83867: implement support for io_impedance_ctrl nvmem cell
We have a board where measurements indicate that the current three
options - leaving IO_IMPEDANCE_CTRL at the (factory calibrated) reset
value or using one of the two boolean properties to set it to the
min/max value - are too coarse.
Implement support for the newly added binding allowing device tree to
specify an nvmem cell containing an appropriate value for this
specific board.
Rasmus Villemoes [Tue, 14 Jun 2022 08:46:11 +0000 (10:46 +0200)]
linux/phy.h: add phydev_err_probe() wrapper for dev_err_probe()
The dev_err_probe() function is quite useful to avoid boilerplate
related to -EPROBE_DEFER handling. Add a phydev_err_probe() helper to
simplify making use of that from phy drivers which otherwise use the
phydev_* helpers.
Rasmus Villemoes [Tue, 14 Jun 2022 08:46:10 +0000 (10:46 +0200)]
dt-bindings: dp83867: add binding for io_impedance_ctrl nvmem cell
We have a board where measurements indicate that the current three
options - leaving IO_IMPEDANCE_CTRL at the reset value (which is
factory calibrated to a value corresponding to approximately 50 ohms)
or using one of the two boolean properties to set it to the min/max
value - are too coarse.
There is no fixed mapping from register values to values in the range
35-70 ohms; it varies from chip to chip, and even that target range is
approximate. So add a DT binding for an nvmem cell which can be
populated during production with a value suitable for each specific
board.
This series implements support for sleepable uprobe programs.
Key work is in patches 2 and 3, the rest is plumbing and tests.
The main observation is that the only obstacle in the way of sleepable uprobe
programs is not the uprobe infrastructure, which already runs in a user-like
context, but the rcu usage around bpf_prog_array.
Details are in patch 2 but the tl;dr is that we chain trace_tasks and normal rcu
grace periods when releasing to array to accommodate users of either rcu type.
This introduces latency for non-sleepable users (kprobe, tp) but that's deemed
acceptable, given recent benchmarks by Andrii [1]. We're a couple of orders of
magnitude under the rate of bpf_prog_array churn that would raise flags (~1MM/s per Paul).
v2 -> v3:
* Inline uprobe_call_bpf into trace_uprobe.c, it's just a bpf_prog_run_array_sleepable call now.
* Do not disable preemption for uprobe non-sleepable programs.
* Add acks.
v1 -> v2:
* Fix lockdep annotations in bpf_prog_run_array_sleepable
* Chain rcu grace periods only for perf_event-attached programs. This limits
the additional latency on the free path to use cases where we know it won't
be a problem.
* Add tests calling helpers only available in sleepable programs.
* Remove kprobe.s support from libbpf.
====================
Delyan Kratunov [Tue, 14 Jun 2022 23:10:43 +0000 (23:10 +0000)]
bpf: allow sleepable uprobe programs to attach
uprobe and kprobe programs have the same program type, KPROBE, which is
currently not allowed to load sleepable programs.
To avoid adding a new UPROBE type, instead allow sleepable KPROBE
programs to load and defer the is-it-actually-a-uprobe-program check
to attachment time, where there's already validation of the
corresponding perf_event.
A corollary of this patch is that you can now load a sleepable kprobe
program but cannot attach it.
Delyan Kratunov [Tue, 14 Jun 2022 23:10:46 +0000 (23:10 +0000)]
bpf: implement sleepable uprobes by chaining gps
uprobes work by raising a trap, setting a task flag from within the
interrupt handler, and processing the actual work for the uprobe on the
way back to userspace. As a result, uprobe handlers already execute in a
might_fault/_sleep context. The primary obstacle to sleepable bpf uprobe
programs is therefore on the bpf side.
Namely, the bpf_prog_array attached to the uprobe is protected by normal
rcu. In order for uprobe bpf programs to become sleepable, it has to be
protected by the tasks_trace rcu flavor instead (and kfree() called after
a corresponding grace period).
Therefore, the free path for bpf_prog_array now chains a tasks_trace and
normal grace periods one after the other.
Users who iterate under tasks_trace read section would
be safe, as would users who iterate under normal read sections (from
non-sleepable locations).
The downside is that the tasks_trace latency affects all perf_event-attached
bpf programs (and not just uprobe ones). This is deemed safe given the
possible attach rates for kprobe/uprobe/tp programs.
Separately, non-sleepable programs need access to dynamically sized
rcu-protected maps, so bpf_run_prog_array_sleepables now conditionally takes
an rcu read section, in addition to the overarching tasks_trace section.
Delyan Kratunov [Tue, 14 Jun 2022 23:10:42 +0000 (23:10 +0000)]
bpf: move bpf_prog to bpf.h
In order to add a version of bpf_prog_run_array which accesses the
bpf_prog->aux member, bpf_prog needs to be more than a forward
declaration inside bpf.h.
Given that filter.h already includes bpf.h, this merely reorders
the type declarations for filter.h users. bpf.h users now have access to
bpf_prog internals.
Andrii Nakryiko [Thu, 16 Jun 2022 05:55:43 +0000 (22:55 -0700)]
libbpf: Fix internal USDT address translation logic for shared libraries
Perform the same virtual address to file offset translation that libbpf
is doing for executable ELF binaries also for shared libraries.
Currently libbpf is making a simplifying and sometimes wrong assumption
that for shared libraries relative virtual addresses inside ELF are
always equal to file offsets.
Unfortunately, this is not always the case with LLVM's lld linker, which
now by default generates quite more complicated ELF segments layout.
E.g., for liburandom_read.so from selftests/bpf, here's an excerpt from
readelf output listing ELF segments (a.k.a. program headers):
Type Offset VirtAddr PhysAddr FileSiz MemSiz Flg Align
PHDR 0x000040 0x0000000000000040 0x0000000000000040 0x0001f8 0x0001f8 R 0x8
LOAD 0x000000 0x0000000000000000 0x0000000000000000 0x0005e4 0x0005e4 R 0x1000
LOAD 0x0005f0 0x00000000000015f0 0x00000000000015f0 0x000160 0x000160 R E 0x1000
LOAD 0x000750 0x0000000000002750 0x0000000000002750 0x000210 0x000210 RW 0x1000
LOAD 0x000960 0x0000000000003960 0x0000000000003960 0x000028 0x000029 RW 0x1000
Compare that to what is generated by GNU ld (or LLVM lld's with extra
-znoseparate-code argument which disables this cleverness in the name of
file size reduction):
Type Offset VirtAddr PhysAddr FileSiz MemSiz Flg Align
LOAD 0x000000 0x0000000000000000 0x0000000000000000 0x000550 0x000550 R 0x1000
LOAD 0x001000 0x0000000000001000 0x0000000000001000 0x000131 0x000131 R E 0x1000
LOAD 0x002000 0x0000000000002000 0x0000000000002000 0x0000ac 0x0000ac R 0x1000
LOAD 0x002dc0 0x0000000000003dc0 0x0000000000003dc0 0x000262 0x000268 RW 0x1000
You can see from the first example above that for executable (Flg == "R E")
PT_LOAD segment (LOAD #2), Offset doesn't match VirtAddr columns.
And it does in the second case (GNU ld output).
This is important because all the addresses, including USDT specs,
operate in a virtual address space, while kernel is expecting file
offsets when performing uprobe attach. So such mismatches have to be
properly taken care of and compensated by libbpf, which is what this
patch is fixing.
Also patch clarifies few function and variable names, as well as updates
comments to reflect this important distinction (virtaddr vs file offset)
and to ephasize that shared libraries are not all that different from
executables in this regard.
This patch also changes selftests/bpf Makefile to force urand_read and
liburand_read.so to be built with Clang and LLVM's lld (and explicitly
request this ELF file size optimization through -znoseparate-code linker
parameter) to validate libbpf logic and ensure regressions don't happen
in the future. I've bundled these selftests changes together with libbpf
changes to keep the above description tied with both libbpf and
selftests changes.
Linus Torvalds [Thu, 16 Jun 2022 18:51:32 +0000 (11:51 -0700)]
Merge tag 'net-5.19-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
"Mostly driver fixes.
Current release - regressions:
- Revert "net: Add a second bind table hashed by port and address",
needs more work
- amd-xgbe: use platform_irq_count(), static setup of IRQ resources
had been removed from DT core
- dts: at91: ksz9477_evb: add phy-mode to fix port/phy validation
Current release - new code bugs:
- hns3: modify the ring param print info
Previous releases - always broken:
- axienet: make the 64b addressable DMA depends on 64b architectures
- iavf: fix issue with MAC address of VF shown as zero
- ice: fix PTP TX timestamp offset calculation
- usb: ax88179_178a needs FLAG_SEND_ZLP
Misc:
- document some net.sctp.* sysctls"
* tag 'net-5.19-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (31 commits)
net: axienet: add missing error return code in axienet_probe()
Revert "net: Add a second bind table hashed by port and address"
net: ax25: Fix deadlock caused by skb_recv_datagram in ax25_recvmsg
net: usb: ax88179_178a needs FLAG_SEND_ZLP
MAINTAINERS: add include/dt-bindings/net to NETWORKING DRIVERS
ARM: dts: at91: ksz9477_evb: fix port/phy validation
net: bgmac: Fix an erroneous kfree() in bgmac_remove()
ice: Fix memory corruption in VF driver
ice: Fix queue config fail handling
ice: Sync VLAN filtering features for DVM
ice: Fix PTP TX timestamp offset calculation
mlxsw: spectrum_cnt: Reorder counter pools
docs: networking: phy: Fix a typo
amd-xgbe: Use platform_irq_count()
octeontx2-vf: Add support for adaptive interrupt coalescing
xilinx: Fix build on x86.
net: axienet: Use iowrite64 to write all 64b descriptor pointers
net: axienet: make the 64b addresable DMA depends on 64b archectures
net: hns3: fix tm port shapping of fibre port is incorrect after driver initialization
net: hns3: fix PF rss size initialization bug
...
Joanne Koong [Wed, 15 Jun 2022 19:32:13 +0000 (12:32 -0700)]
Revert "net: Add a second bind table hashed by port and address"
This reverts:
commit d5a42de8bdbe ("net: Add a second bind table hashed by port and address")
commit 538aaf9b2383 ("selftests: Add test for timing a bind request to a port with a populated bhash entry") Link: https://lore.kernel.org/netdev/[email protected]/
There are a few things that need to be fixed here:
* Updating bhash2 in cases where the socket's rcv saddr changes
* Adding bhash2 hashbucket locks
Haiyang Zhang [Tue, 14 Jun 2022 20:28:55 +0000 (13:28 -0700)]
net: mana: Add support of XDP_REDIRECT action
Add a handler of the XDP_REDIRECT return code from a XDP program. The
packets will be flushed at the end of each RX/CQ NAPI poll cycle.
ndo_xdp_xmit() is implemented by sharing the code in mana_xdp_tx().
Ethtool per queue counters are added for XDP redirect and xmit operations.
net: ethernet: stmmac: reset force speed bit for ipq806x
Some bootloader may set the force speed regs even if the actual
interface should use autonegotiation between PCS and PHY.
This cause the complete malfuction of the interface.
To fix this correctly reset the force speed regs if a fixed-link is not
defined in the DTS. With a fixed-link node correctly configure the
forced speed regs to handle any misconfiguration by the bootloader.
net: ethernet: stmmac: add missing sgmii configure for ipq806x
The different gmacid require different configuration based on the soc
and on the gmac id. Add these missing configuration taken from the
original driver.
Linus Torvalds [Wed, 15 Jun 2022 21:20:26 +0000 (14:20 -0700)]
Merge tag 'hardening-v5.19-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
Pull hardening fixes from Kees Cook:
- Correctly handle vm_map areas in hardened usercopy (Matthew Wilcox)
- Adjust CFI RCU usage to avoid boot splats with cpuidle (Sami Tolvanen)
* tag 'hardening-v5.19-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
usercopy: Make usercopy resilient against ridiculously large copies
usercopy: Cast pointer to an integer once
usercopy: Handle vm_map_ram() areas
cfi: Fix __cfi_slowpath_diag RCU usage with cpuidle
Linus Torvalds [Wed, 15 Jun 2022 19:34:19 +0000 (12:34 -0700)]
Merge tag 'tpmdd-next-v5.19-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/jarkko/linux-tpmdd
Pull tpm fixes from Jarkko Sakkinen:
"Two fixes for this merge window"
* tag 'tpmdd-next-v5.19-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/jarkko/linux-tpmdd:
certs: fix and refactor CONFIG_SYSTEM_BLACKLIST_HASH_LIST build
certs/blacklist_hashes.c: fix const confusion in certs blacklist
Linus Torvalds [Wed, 15 Jun 2022 16:04:55 +0000 (09:04 -0700)]
Merge tag 'fs.fixes.v5.19-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux
Pull vfs idmapping fix from Christian Brauner:
"This fixes an issue where we fail to change the group of a file when
the caller owns the file and is a member of the group to change to.
This is only relevant on idmapped mounts.
There's a detailed description in the commit message and regression
tests have been added to xfstests"
* tag 'fs.fixes.v5.19-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
fs: account for group membership
Duoming Zhou [Tue, 14 Jun 2022 09:25:57 +0000 (17:25 +0800)]
net: ax25: Fix deadlock caused by skb_recv_datagram in ax25_recvmsg
The skb_recv_datagram() in ax25_recvmsg() will hold lock_sock
and block until it receives a packet from the remote. If the client
doesn`t connect to server and calls read() directly, it will not
receive any packets forever. As a result, the deadlock will happen.
This patch replaces skb_recv_datagram() with an open-coded variant of it
releasing the socket lock before the __skb_wait_for_more_packets() call
and re-acquiring it after such call in order that other functions that
need socket lock could be executed.
what's more, the socket lock will be released only when recvmsg() will
block and that should produce nicer overall behavior.
bcm63xx_enet: switch to napi_build_skb() to reuse skbuff_heads
napi_build_skb() reuses NAPI skbuff_head cache in order to save some
cycles on freeing/allocating skbuff_heads on every new Rx or completed
Tx.
Use napi_consume_skb() to feed the cache with skbuff_heads of completed
Tx so it's never empty.
Casper Andersson [Tue, 14 Jun 2022 06:32:23 +0000 (08:32 +0200)]
net: bridge: allow add/remove permanent mdb entries on disabled ports
Adding mdb entries on disabled ports allows you to do setup before
accepting any traffic, avoiding any time where the port is not in the
multicast group.
Problems observed:
======================================================================
1) Using ssh/sshfs. The remote sshd daemon can abort with the message:
"message authentication code incorrect"
This happens because the tcp message sent is corrupted during the
USB "Bulk out". The device calculate the tcp checksum and send a
valid tcp message to the remote sshd. Then the encryption detects
the error and aborts.
2) NETDEV WATCHDOG: ... (ax88179_178a): transmit queue 0 timed out
3) Stop normal work without any log message.
The "Bulk in" continue receiving packets normally.
The host sends "Bulk out" and the device responds with -ECONNRESET.
(The netusb.c code tx_complete ignore -ECONNRESET)
Under normal conditions these errors take days to happen and in
intense usage take hours.
A test with ping gives packet loss, showing that something is wrong:
ping -4 -s 462 {destination} # 462 = 512 - 42 - 8
Not all packets fail.
My guess is that the device tries to find another packet starting
at the extra byte and will fail or not depending on the next
bytes (old buffer content).
======================================================================
David S. Miller [Wed, 15 Jun 2022 08:15:33 +0000 (09:15 +0100)]
Merge branch '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue
Tony Nguyen says:
====================
Intel Wired LAN Driver Updates 2022-06-14
This series contains updates to ice driver only.
Michal fixes incorrect Tx timestamp offset calculation for E822 devices.
Roman enforces required VLAN filtering settings for double VLAN mode.
Przemyslaw fixes memory corruption issues with VFs by ensuring
queues are disabled in the error path of VF queue configuration and to
disabled VFs during reset.
====================
The first patch in this series makes the name used for variables
representing a TRE ring be consistent everywhere. The second
renames two structure fields to better represent their purpose.
The last four rework a little code that manages some tranaction and
byte transfer statistics maintained mainly for TX endpoints. For
the most part this series is refactoring. The last one also
includes the first step toward no longer assuming an event ring is
dedicated to a single channel.
====================
Alex Elder [Mon, 13 Jun 2022 17:17:59 +0000 (12:17 -0500)]
net: ipa: rework gsi_channel_tx_update()
Rename gsi_channel_tx_update() to be gsi_trans_tx_completed(), and
pass it just the transaction pointer, deriving the channel from the
transaction. Update the comments above the function to provide a
more concise description of how statistics for TX endpoints are
maintained and used.
Alex Elder [Mon, 13 Jun 2022 17:17:58 +0000 (12:17 -0500)]
net: ipa: stop counting total RX bytes and transactions
In gsi_evt_ring_rx_update(), we update each transaction so its len
field reflects the actual number of bytes received. In the process,
the total number of transactions and bytes processed on the channel
are summed, and added to a running total for the channel.
But we don't actually use those running totals for RX endpoints.
They're maintained for TX channels to support CoDel when they are
associated with a "real" network device.
So stop maintaining these totals for RX endpoints, and update the
comment where the fields are defined to make it clear they're only
valid for TX channels.
Alex Elder [Mon, 13 Jun 2022 17:17:57 +0000 (12:17 -0500)]
net: ipa: simplify TX completion statistics
When a TX request is issued, its channel's accumulated byte and
transaction counts are recorded. This currently does *not* take
into account the transaction being committed.
Later, when the transaction completes, the number of bytes and
transactions that have completed since the transaction was committed
are reported to the network stack. The transaction and its byte
count are accounted for at that time.
Instead, record the transaction and its bytes in the counts recorded
at commit time. This avoids the need to do so when the transaction
completes, and provides a (small) simplification of that code.
Alex Elder [Mon, 13 Jun 2022 17:17:55 +0000 (12:17 -0500)]
net: ipa: rename two transaction fields
There are two fields in a GSI transaction that keep track of TRE
counts. The first represents the number of TREs reserved for the
transaction in the TRE ring; that's currently named "tre_count".
The second is the number of TREs that are actually *used* by the
transaction at the time it is committed.
Rename the "tre_count" field to be "rsvd_count", to make its meaning
a little more specific. The "_count" is present in the name mainly
to avoid interpreting it as a reserved (not-to-be-used) field. This
name also distinguishes it from the "tre_count" field associated
with a channel.
Rename the "used" field to be "used_count", to match the convention
used for reserved TREs.
Alex Elder [Mon, 13 Jun 2022 17:17:54 +0000 (12:17 -0500)]
net: ipa: use "tre_ring" for all TRE ring local variables
All local variables that represent event rings are named "ring".
All but two functions that represent a channel's TRE ring with a
local variable use the name "tre_ring". For consistency, use that
name in the two functions that don't fit the pattern.
Jakub Kicinski [Wed, 15 Jun 2022 05:35:18 +0000 (22:35 -0700)]
Merge branch 'support-mt7531-on-bpi-r2-pro'
Frank Wunderlich says:
====================
Support mt7531 on BPI-R2 Pro
This Series add Support for the mt7531 switch on Bananapi R2 Pro board.
This board uses port5 of the switch to conect to the gmac0 of the
rk3568 SoC.
Currently CPU-Port is hardcoded in the mt7530 driver to port 6.
Compared to v1 the reset-Patch was dropped as it was not needed and
CPU-Port-changes are completely rewriten based on suggestions/code from
Vladimir Oltean (many thanks to this).
In DTS Patch i only dropped the status-property that was not
needed/ignored by driver.
Due to the Changes i also made a regression test on mt7623 bpi-r2
(mt7623 soc + mt7530) and bpi-r64 (mt7622 soc + mt7531) with cpu-
port 6. Tests were done directly (ipv4 config on dsa user port)
and with vlan-aware bridge including vlan that was tagged outgoing
on dsa user port.
====================
Frank Wunderlich [Fri, 10 Jun 2022 17:05:40 +0000 (19:05 +0200)]
dt-bindings: net: dsa: make reset optional and add rgmii-mode to mt7531
A board may have no independent reset-line, so reset cannot be used
inside switch driver.
E.g. on Bananapi-R2 Pro switch and gmac are connected to same reset-line.
Resets should be acquired only to 1 device/driver. This prevents reset to
be bound to switch-driver if reset is already used for gmac. If reset is
only used by switch driver it resets the switch *and* the gmac after the
mdio bus comes up resulting in mdio bus goes down. It takes some time
until all is up again, switch driver tries to read from mdio, will fail
and defer the probe. On next try the reset does the same again.
Make reset optional for such boards.
Allow port 5 as cpu-port and phy-mode rgmii for mt7531.
- MT7530 supports RGMII on port 5 and RGMII/TRGMII on port 6.
- MT7531 supports on port 5 RGMII and SGMII (dual-sgmii) and
SGMII on port 6.
Frank Wunderlich [Fri, 10 Jun 2022 17:05:37 +0000 (19:05 +0200)]
net: dsa: mt7530: rework mt7530_hw_vlan_{add,del}
Rework vlan_add/vlan_del functions in preparation for dynamic cpu port.
Currently BIT(MT7530_CPU_PORT) is added to new_members, even though
mt7530_port_vlan_add() will be called on the CPU port too.
Let DSA core decide when to call port_vlan_add for the CPU port, rather
than doing it implicitly.
We can do autonomous forwarding in a certain VLAN, but not add br0 to that
VLAN and avoid flooding the CPU with those packets, if software knows it
doesn't need to process them.
Jakub Kicinski [Wed, 15 Jun 2022 04:51:07 +0000 (21:51 -0700)]
Merge branch 'mlxsw-remove-xm-support'
Ido Schimmel says:
====================
mlxsw: Remove XM support
The XM was supposed to be an external device connected to the
Spectrum-{2,3} ASICs using dedicated Ethernet ports. Its purpose was to
increase the number of routes that can be offloaded to hardware. This was
achieved by having the ASIC act as a cache that refers cache misses to the
XM where the FIB is stored and LPM lookup is performed.
Testing was done over an emulator and dedicated setups in the lab, but
the product was discontinued before shipping to customers.
Therefore, in order to remove dead code and reduce complexity of the
code base, revert the three patchsets that added XM support.
====================
1) Updated HW bits and definitions for upcoming features
1.1) vport debug counters
1.2) flow meter
1.3) Execute ASO action for flow entry
1.4) enhanced CQE compression
2) Add ICM header-modify-pattern RDMA API
Leon Says
=========
SW steering manipulates packet's header using "modifying header" actions.
Many of these actions do the same operation, but use different data each time.
Currently we create and keep every one of these actions, which use expensive
and limited resources.
Now we introduce a new mechanism - pattern and argument, which splits
a modifying action into two parts:
1. action pattern: contains the operations to be applied on packet's header,
mainly set/add/copy of fields in the packet
2. action data/argument: contains the data to be used by each operation
in the pattern.
This way we reuse same patterns with different arguments to create new
modifying actions, and since many actions share the same operations, we end
up creating a small number of patterns that we keep in a dedicated cache.
These modify header patterns are implemented as new type of ICM memory,
so the following kernel patch series add the support for this new ICM type.
==========
* 'mlx5-next' of git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux:
net/mlx5: Add bits and fields to support enhanced CQE compression
net/mlx5: Remove not used MLX5_CAP_BITS_RW_MASK
net/mlx5: group fdb cleanup to single function
net/mlx5: Add support EXECUTE_ASO action for flow entry
net/mlx5: Add HW definitions of vport debug counters
net/mlx5: Add IFC bits and enums for flow meter
RDMA/mlx5: Support handling of modify-header pattern ICM area
net/mlx5: Manage ICM of type modify-header pattern
net/mlx5: Introduce header-modify-pattern ICM properties
====================
Yonghong Song [Tue, 14 Jun 2022 05:55:26 +0000 (22:55 -0700)]
selftests/bpf: Avoid skipping certain subtests
Commit 704c91e59fe0 ('selftests/bpf: Test "bpftool gen min_core_btf"')
added a test test_core_btfgen to test core relocation with btf
generated with 'bpftool gen min_core_btf'. Currently,
among 76 subtests, 25 are skipped.
Alexei found that in the above core_reloc_btfgen/enum64val___err_missing
should not be skipped.
Currently, the core_reloc tests have some negative tests.
In Commit 704c91e59fe0, for core_reloc_btfgen, all negative tests
are skipped with the following condition
if (!test_case->btf_src_file || test_case->fails) {
test__skip();
continue;
}
This is too conservative. Negative tests do not fail
mkstemp() and run_btfgen() should not be skipped.
There are a few negative tests indeed failing run_btfgen()
and this patch added 'run_btfgen_fails' to mark these tests
so that they can be skipped for btfgen tests. With this,
we have
...
#46/69 core_reloc_btfgen/enumval:OK
#46/70 core_reloc_btfgen/enumval___diff:OK
#46/71 core_reloc_btfgen/enumval___val3_missing:OK
#46/72 core_reloc_btfgen/enumval___err_missing:OK
#46/73 core_reloc_btfgen/enum64val:OK
#46/74 core_reloc_btfgen/enum64val___diff:OK
#46/75 core_reloc_btfgen/enum64val___val3_missing:OK
#46/76 core_reloc_btfgen/enum64val___err_missing:OK
...
Summary: 1/62 PASSED, 14 SKIPPED, 0 FAILED
The failure is due to
20: (bf) r2 = r1 ; R1_w=scalar(id=2,smax=1099511627776,umax=18446744069414584320,var_off=(0x0; 0xffffffff00000000),s32_min=0,s32_max=0,u32)
21: (c7) r2 s>>= 32 ; R2=scalar(smin=-2147483648,smax=256)
22: (c5) if r2 s< 0x0 goto pc+7 ; R2=scalar(umax=256,var_off=(0x0; 0x1ff))
26: (77) r1 >>= 32 ; R1_w=scalar(umax=4294967295,var_off=(0x0; 0xffffffff))
29: (0f) r6 += r1 ; R1_w=Pscalar(umax=4294967295,var_off=(0x0; 0xffffffff)) R6_w=map_value(off=1060,ks=4,vs=1572,umax=4294967295,var_off=(0)
where r1 has conservative value range compared to r2 and r1 is used later.
In llvm, commit [1] triggered the above code generation and caused
verification failure.
It may take a while for llvm to address this issue. In the main time,
let us change the variable 'len' type to 'long' and adjust condition properly.
Tested with llvm14 and latest llvm15, both worked fine.
Quentin Monnet [Fri, 10 Jun 2022 11:26:48 +0000 (12:26 +0100)]
bpftool: Do not check return value from libbpf_set_strict_mode()
The function always returns 0, so we don't need to check whether the
return value is 0 or not.
This change was first introduced in commit a777e18f1bcd ("bpftool: Use
libbpf 1.0 API mode instead of RLIMIT_MEMLOCK"), but later reverted to
restore the unconditional rlimit bump in bpftool. Let's re-add it.
In commit a777e18f1bcd ("bpftool: Use libbpf 1.0 API mode instead of
RLIMIT_MEMLOCK"), we removed the rlimit bump in bpftool, because the
kernel has switched to memcg-based memory accounting. Thanks to the
LIBBPF_STRICT_AUTO_RLIMIT_MEMLOCK, we attempted to keep compatibility
with other systems and ask libbpf to raise the limit for us if
necessary.
How do we know if memcg-based accounting is supported? There is a probe
in libbpf to check this. But this probe currently relies on the
availability of a given BPF helper, bpf_ktime_get_coarse_ns(), which
landed in the same kernel version as the memory accounting change. This
works in the generic case, but it may fail, for example, if the helper
function has been backported to an older kernel. This has been observed
for Google Cloud's Container-Optimized OS (COS), where the helper is
available but rlimit is still in use. The probe succeeds, the rlimit is
not raised, and probing features with bpftool, for example, fails.
A patch was submitted [0] to update this probe in libbpf, based on what
the cilium/ebpf Go library does [1]. It would lower the soft rlimit to
0, attempt to load a BPF object, and reset the rlimit. But it may induce
some hard-to-debug flakiness if another process starts, or the current
application is killed, while the rlimit is reduced, and the approach was
discarded.
As a workaround to ensure that the rlimit bump does not depend on the
availability of a given helper, we restore the unconditional rlimit bump
in bpftool for now.
Linus Torvalds [Tue, 14 Jun 2022 17:36:11 +0000 (10:36 -0700)]
netfs: fix up netfs_inode_init() docbook comment
Commit e81fb4198e27 ("netfs: Further cleanups after struct netfs_inode
wrapper introduced") changed the argument types and names, and actually
updated the comment too (although that was thanks to David Howells, not
me: my original patch only changed the code).
But the comment fixup didn't go quite far enough, and didn't change the
argument name in the comment, resulting in
include/linux/netfs.h:314: warning: Function parameter or member 'ctx' not described in 'netfs_inode_init'
include/linux/netfs.h:314: warning: Excess function parameter 'inode' description in 'netfs_inode_init'
during htmldoc generation.
Fixes: e81fb4198e27 ("netfs: Further cleanups after struct netfs_inode wrapper introduced") Reported-by: Stephen Rothwell <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
Disable VF's RX/TX queues, when VIRTCHNL_OP_CONFIG_VSI_QUEUES fail.
Not disabling them might lead to scenario, where PF driver leaves VF
queues enabled, when VF's VSI failed queue config.
In this scenario VF should not have RX/TX queues enabled. If PF failed
to set up VF's queues, VF will reset due to TX timeouts in VF driver.
Initialize iterator 'i' to -1, so if error happens prior to configuring
queues then error path code will not disable queue 0. Loop that
configures queues will is using same iterator, so error path code will
only disable queues that were configured.
Fixes: 77ca27c41705 ("ice: add support for virtchnl_queue_select.[tx|rx]_queues bitmap") Suggested-by: Slawomir Laba <[email protected]> Signed-off-by: Przemyslaw Patynowski <[email protected]> Signed-off-by: Mateusz Palczewski <[email protected]> Tested-by: Konrad Jankowski <[email protected]> Signed-off-by: Tony Nguyen <[email protected]>
VLAN filtering features, that is C-Tag and S-Tag, in DVM mode must be
both enabled or disabled.
In case of turning off/on only one of the features, another feature must
be turned off/on automatically with issuing an appropriate message to
the kernel log.
Fixes: 1babaf77f49d ("ice: Advertise 802.1ad VLAN filtering and offloads for PF netdev") Signed-off-by: Roman Storozhenko <[email protected]> Co-developed-by: Anatolii Gerasymenko <[email protected]> Signed-off-by: Anatolii Gerasymenko <[email protected]> Tested-by: Gurucharan <[email protected]> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <[email protected]>
Michal Michalik [Tue, 10 May 2022 11:03:43 +0000 (13:03 +0200)]
ice: Fix PTP TX timestamp offset calculation
The offset was being incorrectly calculated for E822 - that led to
collisions in choosing TX timestamp register location when more than
one port was trying to use timestamping mechanism.
In E822 one quad is being logically split between ports, so quad 0 is
having trackers for ports 0-3, quad 1 ports 4-7 etc. Each port should
have separate memory location for tracking timestamps. Due to error for
example ports 1 and 2 had been assigned to quad 0 with same offset (0),
while port 1 should have offset 0 and 1 offset 16.
Fix it by correctly calculating quad offset.
Fixes: 3a7496234d17 ("ice: implement basic E822 PTP support") Signed-off-by: Michal Michalik <[email protected]> Tested-by: Gurucharan <[email protected]> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <[email protected]>