David S. Miller [Wed, 29 Mar 2023 07:19:38 +0000 (08:19 +0100)]
Merge branch 'vsock-sockmap-support'
Bobby Eshleman says:
====================
Add support for sockmap to vsock.
We're testing usage of vsock as a way to redirect guest-local UDS
requests to the host and this patch series greatly improves the
performance of such a setup.
Compared to copying packets via userspace, this improves throughput by
121% in basic testing.
Tested as follows.
Setup: guest unix dgram sender -> guest vsock redirector -> host vsock
server
Threads: 1
Payload: 64k
No sockmap:
- 76.3 MB/s
- The guest vsock redirector was
"socat VSOCK-CONNECT:2:1234 UNIX-RECV:/path/to/sock"
Using sockmap (this patch):
- 168.8 MB/s (+121%)
- The guest redirector was a simple sockmap echo server,
redirecting unix ingress to vsock 2:1234 egress.
- Same sender and server programs
*Note: these numbers are from RFC v1
Only the virtio transport has been tested. The loopback transport was
used in writing bpf/selftests, but not thoroughly tested otherwise.
This series requires the skb patch.
Changes in v4:
- af_vsock: fix parameter alignment in vsock_dgram_recvmsg()
- af_vsock: add TCP_ESTABLISHED comment in vsock_dgram_connect()
- vsock/bpf: change ret type to bool
Changes in v3:
- vsock/bpf: Refactor wait logic in vsock_bpf_recvmsg() to avoid
backwards goto
- vsock/bpf: Check psock before acquiring slock
- vsock/bpf: Return bool instead of int of 0 or 1
- vsock/bpf: Wrap macro args __sk/__psock in parens
- vsock/bpf: Place comment trailer */ on separate line
Changes in v2:
- vsock/bpf: rename vsock_dgram_* -> vsock_*
- vsock/bpf: change sk_psock_{get,put} and {lock,release}_sock() order
to minimize slock hold time
- vsock/bpf: use "new style" wait
- vsock/bpf: fix bug in wait log
- vsock/bpf: add check that recvmsg sk_type is one dgram, seqpacket, or
stream. Return error if not one of the three.
- virtio/vsock: comment __skb_recv_datagram() usage
- virtio/vsock: do not init copied in read_skb()
- vsock/bpf: add ifdef guard around struct proto in dgram_recvmsg()
- selftests/bpf: add vsock loopback config for aarch64
- selftests/bpf: add vsock loopback config for s390x
- selftests/bpf: remove vsock device from vmtest.sh qemu machine
- selftests/bpf: remove CONFIG_VIRTIO_VSOCKETS=y from config.x86_64
- vsock/bpf: move transport-related (e.g., if (!vsk->transport)) checks
out of fast path
====================
Bobby Eshleman [Mon, 27 Mar 2023 19:11:51 +0000 (19:11 +0000)]
vsock: support sockmap
This patch adds sockmap support for vsock sockets. It is intended to be
usable by all transports, but only the virtio and loopback transports
are implemented.
SOCK_STREAM, SOCK_DGRAM, and SOCK_SEQPACKET are all supported.
====================
ynl: add support for user headers and struct attrs
Add support for user headers and struct attrs to YNL. This patchset adds
features to ynl and add a partial spec for openvswitch that demonstrates
use of the features.
Patch 1-4 add features to ynl
Patch 5 adds partial openvswitch specs that demonstrate the new features
Patch 6-7 add documentation for legacy structs and for sub-type
====================
Donald Hunter [Mon, 27 Mar 2023 08:31:36 +0000 (09:31 +0100)]
netlink: specs: add partial specification for openvswitch
The openvswitch family has a fixed header, uses struct attrs and has array
values. This partial spec demonstrates these features in the YNL CLI. These
specs are sufficient to create, delete and dump datapaths and to dump vports:
Donald Hunter [Mon, 27 Mar 2023 08:31:35 +0000 (09:31 +0100)]
tools: ynl: Add fixed-header support to ynl
Add support for netlink families that add an optional fixed header structure
after the genetlink header and before any attributes. The fixed-header can be
specified on a per op basis, or once for all operations, which serves as a
default value that can be overridden.
Jakub Kicinski [Wed, 29 Mar 2023 06:52:12 +0000 (23:52 -0700)]
Merge tag 'mlx5-updates-2023-03-20' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux
Saeed Mahameed says:
====================
mlx5-updates-2023-03-20
mlx5 dynamic msix
This patch series adds support for dynamic msix vectors allocation in mlx5.
Eli Cohen Says:
================
The following series of patches modifies mlx5_core to work with the
dynamic MSIX API. Currently, mlx5_core allocates all the interrupt
vectors it needs and distributes them amongst the consumers. With the
introduction of dynamic MSIX support, which allows for allocation of
interrupts more than once, we now allocate vectors as we need them.
This allows other drivers running on top of mlx5_core to allocate
interrupt vectors for their own use. An example for this is mlx5_vdpa,
which uses these vectors to propagate interrupts directly from the
hardware to the vCPU [1].
As a preparation for using this series, a use after free issue is fixed
in lib/cpu_rmap.c and the allocator for rmap entries has been modified.
A complementary API for irq_cpu_rmap_add() has also been introduced.
* tag 'mlx5-updates-2023-03-20' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux:
net/mlx5: Provide external API for allocating vectors
net/mlx5: Use one completion vector if eth is disabled
net/mlx5: Refactor calculation of required completion vectors
net/mlx5: Move devlink registration before mlx5_load
net/mlx5: Use dynamic msix vectors allocation
net/mlx5: Refactor completion irq request/release code
net/mlx5: Improve naming of pci function vectors
net/mlx5: Use newer affinity descriptor
net/mlx5: Modify struct mlx5_irq to use struct msi_map
net/mlx5: Fix wrong comment
net/mlx5e: Coding style fix, add empty line
lib: cpu_rmap: Add irq_cpu_rmap_remove to complement irq_cpu_rmap_add
lib: cpu_rmap: Use allocator for rmap entries
lib: cpu_rmap: Avoid use after free on rmap->obj array entries
====================
clang with W=1 reports
drivers/net/ethernet/8390/axnet_cs.c:653:9: error: variable
'xfer_count' set but not used [-Werror,-Wunused-but-set-variable]
int xfer_count = count;
^
This variable is not used so remove it.
Wolfram Sang [Mon, 27 Mar 2023 15:21:12 +0000 (17:21 +0200)]
Revert "sh_eth: remove open coded netif_running()"
This reverts commit ce1fdb065695f49ef6f126d35c1abbfe645d62d5. It turned
out this actually introduces a race condition. netif_running() is not a
suitable check for get_stats.
Wangyang and Arjan reported a bottleneck in the networking code related to
struct dst_entry::__refcnt. Performance tanks massively when concurrency on
a dst_entry increases.
This happens when there are a large amount of connections to or from the
same IP address. The memtier benchmark when run on the same host as
memcached amplifies this massively. But even over real network connections
this issue can be observed at an obviously smaller scale (due to the
network bandwith limitations in my setup, i.e. 1Gb). How to reproduce:
Run memcached with -t $N and memtier_benchmark with -t $M and --ratio=1:100
on the same machine. localhost connections amplify the problem.
Start with the defaults for $N and $M and increase them. Depending on
your machine this will tank at some point. But even in reasonably small
$N, $M scenarios the refcount operations and the resulting false sharing
fallout becomes visible in perf top. At some point it becomes the
dominating issue.
There are two factors which make this reference count a scalability issue:
1) False sharing
dst_entry:__refcnt is located at offset 64 of dst_entry, which puts
it into a seperate cacheline vs. the read mostly members located at
the beginning of the struct.
That prevents false sharing vs. the struct members in the first 64
bytes of the structure, but there is also
dst_entry::lwtstate
which is located after the reference count and in the same cache
line. This member is read after a reference count has been acquired.
The other problem is struct rtable, which embeds a struct dst_entry
at offset 0. struct dst_entry has a size of 112 bytes, which means
that the struct members of rtable which follow the dst member share
the same cache line as dst_entry::__refcnt. Especially
rtable::rt_genid
is also read by the contexts which have a reference count acquired
already.
When dst_entry:__refcnt is incremented or decremented via an atomic
operation these read accesses stall and contribute to the performance
problem.
2) atomic_inc_not_zero()
A reference on dst_entry:__refcnt is acquired via
atomic_inc_not_zero() and released via atomic_dec_return().
atomic_inc_not_zero() is implemted via a atomic_try_cmpxchg() loop,
which exposes O(N^2) behaviour under contention with N concurrent
operations. Contention scalability is degrading with even a small
amount of contenders and gets worse from there.
Lightweight instrumentation exposed an average of 8!! retry loops per
atomic_inc_not_zero() invocation in a inc()/dec() loop running
concurrently on 112 CPUs.
There is nothing which can be done to make atomic_inc_not_zero() more
scalable.
The following series addresses these issues:
1) Reorder and pad struct dst_entry to prevent the false sharing.
2) Implement and use a reference count implementation which avoids the
atomic_inc_not_zero() problem.
It is slightly less performant in the case of the final 0 -> -1
transition, but the deconstruction of these objects is a low
frequency event. get()/put() pairs are in the hotpath and that's
what this implementation optimizes for.
The algorithm of this reference count is only suitable for RCU
managed objects. Therefore it cannot replace the refcount_t
algorithm, which is also based on atomic_inc_not_zero(), due to a
subtle race condition related to the 0 -> -1 transition and the final
verdict to mark the reference count dead. See details in patch 2/3.
It might be just my lack of imagination which declares this to be
impossible and I'd be happy to be proven wrong.
As a bonus the new rcuref implementation provides underflow/overflow
detection and mitigation while being performance wise on par with
open coded atomic_inc_not_zero() / atomic_dec_return() pairs even in
the non-contended case.
The combination of these two changes results in performance gains in micro
benchmarks and also localhost and networked memtier benchmarks talking to
memcached. It's hard to quantify the benchmark results as they depend
heavily on the micro-architecture and the number of concurrent operations.
The overall gain of both changes for localhost memtier ranges from 1.2X to
3.2X and from +2% to %5% range for networked operations on a 1Gb connection.
A micro benchmark which enforces maximized concurrency shows a gain between
1.2X and 4.7X!!!
Obviously this is focussed on a particular problem and therefore needs to
be discussed in detail. It also requires wider testing outside of the cases
which this is focussed on.
Though the false sharing issue is obvious and should be addressed
independent of the more focussed reference count changes.
- Fixup kernel doc of generated atomic_add_negative() variants
I want to say thanks to Wangyang who analyzed the issue and provided the
initial fix for the false sharing problem. Further thanks go to Arjan
Peter, Marc, Will and Borislav for valuable input and providing test
results on machines which I do not have access to, and to Linus and
Eric, Qiuxu and Mark for helpful feedback.
====================
Thomas Gleixner [Thu, 23 Mar 2023 20:55:32 +0000 (21:55 +0100)]
net: dst: Switch to rcuref_t reference counting
Under high contention dst_entry::__refcnt becomes a significant bottleneck.
atomic_inc_not_zero() is implemented with a cmpxchg() loop, which goes into
high retry rates on contention.
Switch the reference count to rcuref_t which results in a significant
performance gain. Rename the reference count member to __rcuref to reflect
the change.
The gain depends on the micro-architecture and the number of concurrent
operations and has been measured in the range of +25% to +130% with a
localhost memtier/memcached benchmark which amplifies the problem
massively.
Running the memtier/memcached benchmark over a real (1Gb) network
connection the conversion on top of the false sharing fix for struct
dst_entry::__refcnt results in a total gain in the 2%-5% range over the
upstream baseline.
Wangyang Guo [Thu, 23 Mar 2023 20:55:29 +0000 (21:55 +0100)]
net: dst: Prevent false sharing vs. dst_entry:: __refcnt
dst_entry::__refcnt is highly contended in scenarios where many connections
happen from and to the same IP. The reference count is an atomic_t, so the
reference count operations have to take the cache-line exclusive.
Aside of the unavoidable reference count contention there is another
significant problem which is caused by that: False sharing.
perf top identified two affected read accesses. dst_entry::lwtstate and
rtable::rt_genid.
dst_entry:__refcnt is located at offset 64 of dst_entry, which puts it into
a seperate cacheline vs. the read mostly members located at the beginning
of the struct.
That prevents false sharing vs. the struct members in the first 64
bytes of the structure, but there is also
dst_entry::lwtstate
which is located after the reference count and in the same cache line. This
member is read after a reference count has been acquired.
struct rtable embeds a struct dst_entry at offset 0. struct dst_entry has a
size of 112 bytes, which means that the struct members of rtable which
follow the dst member share the same cache line as dst_entry::__refcnt.
Especially
rtable::rt_genid
is also read by the contexts which have a reference count acquired
already.
When dst_entry:__refcnt is incremented or decremented via an atomic
operation these read accesses stall. This was found when analysing the
memtier benchmark in 1:100 mode, which amplifies the problem extremly.
Move the rt[6i]_uncached[_list] members out of struct rtable and struct
rt6_info into struct dst_entry to provide padding and move the lwtstate
member after that so it ends up in the same cache line.
The resulting improvement depends on the micro-architecture and the number
of CPUs. It ranges from +20% to +120% with a localhost memtier/memcached
benchmark.
net: ethernet: ti: am65-cpsw: enable p0 host port rx_vlan_remap
By default, the tagged ingress packets to the switch from the host port
P0 get internal switch priority assigned equal to the DMA CPPI channel
number they came from, unless CPSW_P0_CONTROL_REG.RX_REMAP_VLAN is enabled.
This causes issues with applying QoS policies and mapping packets on
external port fifos, because the default configuration is vlan_aware and
DMA CPPI channels are shared between all external ports.
Hence enable CPSW_P0_CONTROL_REG.RX_REMAP_VLAN so packet will preserve
internal switch priority assigned following the VLAN(priority) tag no
matter through which DMA CPPI Channels packets enter the switch.
Paolo Abeni [Tue, 28 Mar 2023 10:03:54 +0000 (12:03 +0200)]
Merge branch 'allocate-multiple-skbuffs-on-tx'
Arseniy Krasnov says:
====================
allocate multiple skbuffs on tx
This adds small optimization for tx path: instead of allocating single
skbuff on every call to transport, allocate multiple skbuff's until
credit space allows, thus trying to send as much as possible data without
return to af_vsock.c.
Also this patchset includes second patch which adds check and return from
'virtio_transport_get_credit()' and 'virtio_transport_put_credit()' when
these functions are called with 0 argument. This is needed, because zero
argument makes both functions to behave as no-effect, but both of them
always tries to acquire spinlock. Moreover, first patch always calls
function 'virtio_transport_put_credit()' with zero argument in case of
successful packet transmission.
====================
Arseniy Krasnov [Sat, 25 Mar 2023 22:03:52 +0000 (01:03 +0300)]
virtio/vsock: allocate multiple skbuffs on tx
This adds small optimization for tx path: instead of allocating single
skbuff on every call to transport, allocate multiple skbuff's until
credit space allows, thus trying to send as much as possible data without
return to af_vsock.c.
Thomas Gleixner [Thu, 23 Mar 2023 20:55:31 +0000 (21:55 +0100)]
atomics: Provide rcuref - scalable reference counting
atomic_t based reference counting, including refcount_t, uses
atomic_inc_not_zero() for acquiring a reference. atomic_inc_not_zero() is
implemented with a atomic_try_cmpxchg() loop. High contention of the
reference count leads to retry loops and scales badly. There is nothing to
improve on this implementation as the semantics have to be preserved.
Provide rcuref as a scalable alternative solution which is suitable for RCU
managed objects. Similar to refcount_t it comes with overflow and underflow
detection and mitigation.
rcuref treats the underlying atomic_t as an unsigned integer and partitions
this space into zones:
0x00000000 - 0x7FFFFFFF valid zone (1 .. (INT_MAX + 1) references)
0x80000000 - 0xBFFFFFFF saturation zone
0xC0000000 - 0xFFFFFFFE dead zone
0xFFFFFFFF no reference
rcuref_get() unconditionally increments the reference count with
atomic_add_negative_relaxed(). rcuref_put() unconditionally decrements the
reference count with atomic_add_negative_release().
This unconditional increment avoids the inc_not_zero() problem, but
requires a more complex implementation on the put() side when the count
drops from 0 to -1.
When this transition is detected then it is attempted to mark the reference
count dead, by setting it to the midpoint of the dead zone with a single
atomic_cmpxchg_release() operation. This operation can fail due to a
concurrent rcuref_get() elevating the reference count from -1 to 0 again.
If the unconditional increment in rcuref_get() hits a reference count which
is marked dead (or saturated) it will detect it after the fact and bring
back the reference count to the midpoint of the respective zone. The zones
provide enough tolerance which makes it practically impossible to escape
from a zone.
The racy implementation of rcuref_put() requires to protect rcuref_put()
against a grace period ending in order to prevent a subtle use after
free. As RCU is the only mechanism which allows to protect against that, it
is not possible to fully replace the atomic_inc_not_zero() based
implementation of refcount_t with this scheme.
The final drop is slightly more expensive than the atomic_dec_return()
counterpart, but that's not the case which this is optimized for. The
optimization is on the high frequeunt get()/put() pairs and their
scalability.
The performance of an uncontended rcuref_get()/put() pair where the put()
is not dropping the last reference is still on par with the plain atomic
operations, while at the same time providing overflow and underflow
detection and mitigation.
The performance of rcuref compared to plain atomic_inc_not_zero() and
atomic_dec_return() based reference counting under contention:
- Micro benchmark: All CPUs running a increment/decrement loop on an
elevated reference count, which means the 0 to -1 transition never
happens.
The performance gain depends on microarchitecture and the number of
CPUs and has been observed in the range of 1.3X to 4.7X
- Conversion of dst_entry::__refcnt to rcuref and testing with the
localhost memtier/memcached benchmark. That benchmark shows the
reference count contention prominently.
The performance gain depends on microarchitecture and the number of
CPUs and has been observed in the range of 1.1X to 2.6X over the
previous fix for the false sharing issue vs. struct
dst_entry::__refcnt.
When memtier is run over a real 1Gb network connection, there is a
small gain on top of the false sharing fix. The two changes combined
result in a 2%-5% total gain for that networked test.
The first 2 patches by Geert Uytterhoeven add transceiver support and
improve the error messages in the rcar_canfd driver.
Cai Huoqing contributes 3 patches which remove a redundant call to
pci_clear_master() in the c_can, ctucanfd and kvaser_pciefd driver.
Frank Jungclaus's patch replaces the struct esd_usb_msg with a union
in the esd_usb driver to improve readability.
Markus Schneider-Pargmann contributes 5 patches to improve the
performance in the m_can driver, especially for SPI attached
controllers like the tcan4x5x.
* tag 'linux-can-next-for-6.4-20230327' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can-next:
can: m_can: Keep interrupts enabled during peripheral read
can: m_can: Disable unused interrupts
can: m_can: Remove double interrupt enable
can: m_can: Always acknowledge all interrupts
can: m_can: Remove repeated check for is_peripheral
can: esd_usb: Improve code readability by means of replacing struct esd_usb_msg with a union
can: kvaser_pciefd: Remove redundant pci_clear_master
can: ctucanfd: Remove redundant pci_clear_master
can: c_can: Remove redundant pci_clear_master
can: rcar_canfd: Improve error messages
can: rcar_canfd: Add transceiver support
====================
====================
Add tx push buf len param to ethtool
This patchset adds a new sub-configuration to ethtool get/set queue
params (ethtool -g) called 'tx-push-buf-len'.
This configuration specifies the maximum number of bytes of a
transmitted packet a driver can push directly to the underlying
device ('push' mode). The motivation for pushing some of the bytes to
the device has the advantages of
- Allowing a smart device to take fast actions based on the packet's
header
- Reducing latency for small packets that can be copied completely into
the device
This new param is practically similar to tx-copybreak value that can be
set using ethtool's tunable but conceptually serves a different purpose.
While tx-copybreak is used to reduce the overhead of DMA mapping and
makes no sense to use if less than the whole segment gets copied,
tx-push-buf-len allows to improve performance by analyzing the packet's
data (usually headers) before performing the DMA operation.
The configuration can be queried and set using the commands:
$ ethtool -g [interface]
# ethtool -G [interface] tx-push-buf-len [number of bytes]
This patchset also adds support for the new configuration in ENA driver
for which this parameter ensures efficient resources management on the
device side.
====================
Shay Agroskin [Thu, 23 Mar 2023 16:36:08 +0000 (18:36 +0200)]
net: ena: Recalculate TX state variables every device reset
With the ability to modify LLQ entry size, the size of packet's
payload that can be written directly to the device changes.
This patch makes the driver recalculate this information every device
negotiation (also called device reset).
David Arinzon [Thu, 23 Mar 2023 16:36:07 +0000 (18:36 +0200)]
net: ena: Add an option to configure large LLQ headers
Allow configuring the device with large LLQ headers. The Low Latency
Queue (LLQ) allows the driver to write the first N bytes of the packet,
along with the rest of the TX descriptors directly into device (N can be
either 96 or 224 for large LLQ headers configuration).
Having L4 TCP/UDP headers contained in the first 96 bytes of the packet
is required to get maximum performance from the device.
Shay Agroskin [Thu, 23 Mar 2023 16:36:06 +0000 (18:36 +0200)]
net: ena: Make few cosmetic preparations to support large LLQ
Move ena_calc_io_queue_size() implementation closer to the file's
beginning so that it can be later called from ena_device_init()
function without adding a function declaration.
Also add an empty line at some spots to separate logical blocks in
funcitons.
Shay Agroskin [Thu, 23 Mar 2023 16:36:05 +0000 (18:36 +0200)]
ethtool: Add support for configuring tx_push_buf_len
This attribute, which is part of ethtool's ring param configuration
allows the user to specify the maximum number of the packet's payload
that can be written directly to the device.
Example usage:
# ethtool -G [interface] tx-push-buf-len [number of bytes]
clang with W=1 reports
drivers/net/ethernet/qlogic/qed/qed_ll2.c:649:6: error: variable
'num_ooo_add_to_peninsula' set but not used [-Werror,-Wunused-but-set-variable]
u32 num_ooo_add_to_peninsula = 0, cid;
^
This variable is not used so remove it.
Eric Dumazet [Thu, 23 Mar 2023 16:28:42 +0000 (09:28 -0700)]
net: introduce a config option to tweak MAX_SKB_FRAGS
Currently, MAX_SKB_FRAGS value is 17.
For standard tcp sendmsg() traffic, no big deal because tcp_sendmsg()
attempts order-3 allocations, stuffing 32768 bytes per frag.
But with zero copy, we use order-0 pages.
For BIG TCP to show its full potential, we add a config option
to be able to fit up to 45 segments per skb.
This is also needed for BIG TCP rx zerocopy, as zerocopy currently
does not support skbs with frag list.
We have used MAX_SKB_FRAGS=45 value for years at Google before
we deployed 4K MTU, with no adverse effect, other than
a recent issue in mlx4, fixed in commit 26782aad00cc
("net/mlx4: MLX4_TX_BOUNCE_BUFFER_SIZE depends on MAX_SKB_FRAGS")
Back then, goal was to be able to receive full size (64KB) GRO
packets without the frag_list overhead.
Note that /proc/sys/net/core/max_skb_frags can also be used to limit
the number of fragments TCP can use in tx packets.
By default we keep the old/legacy value of 17 until we get
more coverage for the updated values.
This inflation might cause problems for drivers assuming they could pack
both the incoming packet (for MTU=1500) and skb_shared_info in half a page,
using build_skb().
v3: fix build error when CONFIG_NET=n
v2: fix two build errors assuming MAX_SKB_FRAGS was "unsigned long"
Jakub Kicinski [Fri, 24 Mar 2023 19:03:56 +0000 (12:03 -0700)]
tools: ynl: default to treating enums as flags for mask generation
I was a bit too optimistic in commit bf51d27704c9 ("tools: ynl: fix
get_mask utility routine"), not every mask we use is necessarily
coming from an enum of type "flags". We also allow flipping an
enum into flags on per-attribute basis. That's done by
the 'enum-as-flags' property of an attribute.
Restore this functionality, it's not currently used by any in-tree
family.
Jakub Kicinski [Fri, 24 Mar 2023 18:17:57 +0000 (11:17 -0700)]
selftests: tls: add a test for queuing data before setting the ULP
Other tests set up the connection fully on both ends before
communicating any data. Add a test which will queue up TLS
records to TCP before the TLS ULP is installed.
Well, I've had these patches kicking around in my tree since last
October, so I guess I had better get around to posting them. This
series is mainly a cleanup/consolidation of the probe process, with
some interrupt changes as well. Some of these changes are SBUS- (AKA
SPARC-) specific, so this should really get some testing there as well
to ensure nothing breaks. I've CC'd a few SPARC mailing lists in hopes
that someone there can try this out. I also have an SBUS card I
ordered by mistake if anyone has a SPARC computer but lacks this card.
Changes in v4:
- Tweak variable order for yuletide
- Move uninitialized return to its own commit
- Use correct SBUS/PCI accessors
- Rework hme_version to set the default in pci/sbus_probe and override it (if
necessary) in common_probe
Changes in v3:
- Incorperate a fix from another series into this commit
Changes in v2:
- Move happy_meal_begin_auto_negotiation earlier and remove forward declaration
- Make some more includes common
- Clean up mac address init
- Inline error returns
====================
Sean Anderson [Fri, 24 Mar 2023 17:51:33 +0000 (13:51 -0400)]
net: sunhme: Consolidate mac address initialization
The mac address initialization is braodly the same between PCI and SBUS,
and one was clearly copied from the other. Consolidate them. We still have
to have some ifdefs because pci_(un)map_rom is only implemented for PCI,
and idprom is only implemented for SPARC.
Sean Anderson [Fri, 24 Mar 2023 17:51:30 +0000 (13:51 -0400)]
net: sunhme: Unify IRQ requesting
Instead of registering one interrupt handler for all four SBUS Quattro
HMEs, let each HME register its own handler. To make this work, we don't
handle the IRQ if none of the status bits are set. This reduces the
complexity of the driver, and makes it easier to ensure things happen
before/after enabling IRQs.
I'm not really sure why we request IRQs in two different places (and leave
them running after removing the driver!). A lot of things in this driver
seem to just be crusty, and not necessarily intentional. I'm assuming
that's the case here as well.
This really needs to be tested by someone with an SBUS Quattro card.
Sean Anderson [Fri, 24 Mar 2023 17:51:29 +0000 (13:51 -0400)]
net: sunhme: Remove residual polling code
The sunhme driver never used the hardware MII polling feature. Even the
if-def'd out happy_meal_poll_start was removed by 2002 [1]. Remove the
various places in the driver which needlessly guard against MII interrupts
which will never be enabled.
Sean Anderson [Fri, 24 Mar 2023 17:51:27 +0000 (13:51 -0400)]
net: sunhme: Fix uninitialized return code
Fix an uninitialized return code if we never found a qfe slot. It would be
a bug if we ever got into this situation, but it's good to return something
tracable.
Fixes: acb3f35f920b ("sunhme: forward the error code from pci_enable_device()") Reported-by: kernel test robot <[email protected]> Reported-by: Dan Carpenter <[email protected]> Signed-off-by: Sean Anderson <[email protected]> Signed-off-by: David S. Miller <[email protected]>
v3 -> v4:
- addressed review comments on v3
https://lore.kernel.org/all/20230214051422[email protected]/
- 0004-xxx.patch v3 is split into 0004-xxx.patch and 0005-xxx.patch
in v4.
- API changes to accept function ID are moved to 0005-xxx.patch.
- fixed rct violations.
- reverted newly added changes that do not yet have use cases.
v2 -> v3:
- removed SRIOV VF support changes from v2, as new drivers which use
ndo_get_vf_xxx() and ndo_set_vf_xxx() are not accepted.
https://lore.kernel.org/all/20221207200204.6819575a@kernel.org/
Will implement VF representors and submit again.
- 0007-xxx.patch and 0008-xxx.patch from v2 are removed and
0009-xxx.patch in v2 is now 0007-xxx.patch in v3.
- accordingly, changed title for cover letter.
v1 -> v2:
- remove separate workqueue task to wait for firmware ready.
instead defer probe when firmware is not ready. Reported-by: Leon Romanovsky <[email protected]>
- This change has resulted in update of 0001-xxx.patch and
all other patches in the patchset.
====================
Monitor periodic heartbeat messages from device firmware.
Presence of heartbeat indicates the device is active and running.
If the heartbeat is missed for configured interval indicates
firmware has crashed and device is unusable; in this case, PF driver
stops and uninitialize the device.
octeon_ep: add separate mailbox command and response queues
Enhance control mailbox protocol to support following
- separate command and response queues
* command queue to send control commands to firmware.
* response queue to receive responses and notifications from
firmware.
- variable size messages using scatter/gather
Poll for control messages until interrupts are enabled.
All the interrupts are enabled in ndo_open().
Add ability to listen for notifications from firmware before ndo_open().
Once interrupts are enabled, this polling is disabled and all the
messages are processed by bottom half of interrupt handler.
David S. Miller [Mon, 27 Mar 2023 07:29:54 +0000 (08:29 +0100)]
Merge branch 'bcm53134-support'
Álvaro Fernández Rojas says:
====================
net: dsa: b53: mdio: add support for BCM53134
This is based on the initial work from Paul Geurts that was sent to the
incorrect linux development lists and recipients.
I've modified it by removing BCM53134_DEVICE_ID from is531x5() and therefore
adding is53134() where needed.
I also added a separate RGMII handling block for is53134() since according to
Paul, BCM53134 doesn't support RGMII_CTRL_TIMING_SEL as opposed to is531x5().
====================
Michal Michalik [Thu, 23 Mar 2023 19:08:02 +0000 (20:08 +0100)]
tools: ynl: add the Python requirements.txt file
It is a good practice to state explicitly which are the required Python
packages needed in a particular project to run it. The most commonly
used way is to store them in the `requirements.txt` file*.
Currently user needs to figure out himself that Python needs `PyYAML`
and `jsonschema` (and theirs requirements) packages to use the tool.
Add the `requirements.txt` for user convenience.
How to use it:
1) (optional) Create and activate empty virtual environment:
python3.X -m venv venv3X
source ./venv3X/bin/activate
2) Install all the required packages:
pip install -r requirements.txt
or
python -m pip install -r requirements.txt
3) Run the script!
The `requirements.txt` file was tested for:
* Python 3.6
* Python 3.8
* Python 3.10
Eli Cohen [Tue, 3 Jan 2023 07:37:23 +0000 (09:37 +0200)]
net/mlx5: Use one completion vector if eth is disabled
If eth is disabled by devlink, use only a single completion vector to
have minimum performance of all users of completion vectors. This also
affects Infiniband performance.
The rest of the vectors can be used by other consumers on a first come
first served basis.
mlx5_vdpa will make use of this to allocate dedicated vectors for its
own use.
Eli Cohen [Tue, 3 Jan 2023 06:48:14 +0000 (08:48 +0200)]
net/mlx5: Move devlink registration before mlx5_load
In order to allow reference to devlink parameters during driver load,
move the devlink registration before mlx5_load. Subsequent patch will
use it to control the number of completion vectors required based on
whether eth is enabled or not.
Eli Cohen [Sun, 1 Jan 2023 06:16:23 +0000 (08:16 +0200)]
net/mlx5: Use dynamic msix vectors allocation
Current implementation calculates the number and the partitioaning of
available interrupts vectors and then allocates all the interrupt
vectors.
Here, whenever dynamic msix allocation is supported, we change this to
use msix vectors dynamically so a vectors is actually allocated only
when needed. The current pool logic is kept in place to take care of
partitioning the vectors between the consumers and take care of
reference counting. However, the vectors are allocated only when needed.
Subsequent patches will make use of this to allocate vectors for VDPA.
Break the request and release functions into pci and sub-functions
devices handling for better readability, eventually making the code
symmetric in terms of request/release.
Eli Cohen [Sun, 1 Jan 2023 07:22:34 +0000 (09:22 +0200)]
net/mlx5: Improve naming of pci function vectors
The variable pf_vec is used to denote the number of vectors required for
the pci function's own use. To avoid confusion interpreting pf as
physical function, change the name to pcif_vec.
Same reasoning goes for pf_pool which is really pci function pool.
Eli Cohen [Thu, 29 Dec 2022 09:02:19 +0000 (11:02 +0200)]
net/mlx5: Use newer affinity descriptor
Use the more refined struct irq_affinity_desc to describe the required
IRQ affinity. For the async IRQs request unmanaged affinity and for
completion queues use managed affinity.
No functionality changes introduced. It will be used in a subsequent
patch when we use dynamic MSIX allocation.
Eli Cohen [Wed, 28 Dec 2022 13:51:29 +0000 (15:51 +0200)]
net/mlx5: Fix wrong comment
A control irq may be allocated from the parent device's pool in case
there is no SF dedicated pool. This could happen when there are not
enough vectors available for SFs.
Eli Cohen [Tue, 14 Feb 2023 09:05:46 +0000 (11:05 +0200)]
lib: cpu_rmap: Add irq_cpu_rmap_remove to complement irq_cpu_rmap_add
Add a function to complement irq_cpu_rmap_add(). It removes the irq from
the reverse mapping by setting the notifier to NULL. The function calls
irq_set_affinity_notifier() with NULL at the notify argument which then
cancel any pending notifier work and decrement reference on the
notifier. When ref count reaches zero, the glue pointer is kfree and the
rmap entry is set to NULL serving both to avoid second attempt to
release it and also making the rmap entry available for subsequent
mapping.
It should be noted the drivers usually creates the reverse mapping at
initialization time and remove it at unload time so we do not expect
failures in allocating rmap due to kref holding the glue entry.
Eli Cohen [Tue, 14 Feb 2023 07:29:46 +0000 (09:29 +0200)]
lib: cpu_rmap: Use allocator for rmap entries
Use a proper allocator for rmap entries using a naive for loop. The
allocator relies on whether an entry is NULL to be considered free.
Remove the used field of rmap which is not needed.
Also, avoid crashing the kernel if an entry is not available.
Eli Cohen [Wed, 8 Feb 2023 05:51:02 +0000 (07:51 +0200)]
lib: cpu_rmap: Avoid use after free on rmap->obj array entries
When calling irq_set_affinity_notifier() with NULL at the notify
argument, it will cause freeing of the glue pointer in the
corresponding array entry but will leave the pointer in the array. A
subsequent call to free_irq_cpu_rmap() will try to free this entry again
leading to possible use after free.
Fix that by setting NULL to the array entry and checking that we have
non-zero at the array entry when iterating over the array in
free_irq_cpu_rmap().
The current code does not suffer from this since there are no cases
where irq_set_affinity_notifier(irq, NULL) (note the NULL passed for the
notify arg) is called, followed by a call to free_irq_cpu_rmap() so we
don't hit and issue. Subsequent patches in this series excersize this
flow, hence the required fix.
third version part 2, functionally I had to move from spin_lock to
spin_lock_irqsave because of an interrupt that was calling start_xmit,
see attached stack. This is tested on tcan455x but I don't have the
integrated hardware myself so any testing is appreciated.
The series implements many small and bigger throughput improvements.
Changes in v3:
- Remove parenthesis in error messages
- Use memcpy_and_pad for buffer copy in 'can: m_can: Write transmit
header and data in one transaction'.
- Replace spin_lock with spin_lock_irqsave. I got a report of a
interrupt that was calling start_xmit just after the netqueue was
woken up before the locked region was exited. spin_lock_irqsave should
fix this. I attached the full stack at the end of the mail if someone
wants to know.
- Rebased to v6.3-rc1.
- Removed tcan4x5x patches from this series.
Changes in v2: https://lore.kernel.org/all/20230125195059[email protected]
- Rebased on v6.2-rc5
- Fixed missing/broken accounting for non peripheral m_can devices.
can: m_can: Keep interrupts enabled during peripheral read
Interrupts currently get disabled if the interrupt status shows new
received data. Non-peripheral chips handle receiving in a worker thread,
but peripheral chips are handling the receive process in the threaded
interrupt routine itself without scheduling it for a different worker.
So there is no need to disable interrupts for peripheral chips.
Frank Jungclaus [Wed, 22 Feb 2023 16:37:54 +0000 (17:37 +0100)]
can: esd_usb: Improve code readability by means of replacing struct esd_usb_msg with a union
As suggested by Vincent Mailhol, declare struct esd_usb_msg as a union
instead of a struct. Then replace all msg->msg.something constructs,
that make use of esd_usb_msg, with simpler and prettier looking
msg->something variants.
Remove pci_clear_master to simplify the code,
the bus-mastering is also cleared in do_pci_disable_device,
like this:
./drivers/pci/pci.c:2197
static void do_pci_disable_device(struct pci_dev *dev)
{
u16 pci_command;
Cai Huoqing [Thu, 23 Mar 2023 11:33:16 +0000 (19:33 +0800)]
can: ctucanfd: Remove redundant pci_clear_master
Remove pci_clear_master to simplify the code,
the bus-mastering is also cleared in do_pci_disable_device,
like this:
./drivers/pci/pci.c:2197
static void do_pci_disable_device(struct pci_dev *dev)
{
u16 pci_command;
Cai Huoqing [Thu, 23 Mar 2023 11:33:15 +0000 (19:33 +0800)]
can: c_can: Remove redundant pci_clear_master
Remove pci_clear_master to simplify the code,
the bus-mastering is also cleared in do_pci_disable_device,
like this:
./drivers/pci/pci.c:2197
static void do_pci_disable_device(struct pci_dev *dev)
{
u16 pci_command;