Git Repo - linux.git/log

net/mlx5: qos: Store rate groups in a qos domain

Groups are currently maintained as a list in their corresponding
eswitch, protected by the esw state_lock.
The upcoming cross-eswitch scheduling feature cannot work with this
approach, as it would require acquiring multiple eswitch locks (in the
correct order) in order to maintain group membership.

This commit moves the rate groups into a new 'qos domain' struct and
adds explicit qos init/cleanup steps to the eswitch init/cleanup.
Upcoming patches will expand the qos domain struct and allow it to be
shared between eswitches. For now, qos domains are private to each esw
so there's only an extra indirection.

Signed-off-by: Cosmin Ratiu <[email protected]>
Signed-off-by: Tariq Toukan <[email protected]>
Signed-off-by: Paolo Abeni <[email protected]>

net/mlx5: qos: Rename rate group 'list' as 'parent_entry'

'list' is not very descriptive, I prefer list membership to clearly
specify which list the entry belongs to. This commit renames the list
entry into the esw groups list as 'parent_entry' to make the code more
readable. This is a no-op change.

Signed-off-by: Cosmin Ratiu <[email protected]>
Signed-off-by: Tariq Toukan <[email protected]>
Signed-off-by: Paolo Abeni <[email protected]>

net/mlx5: qos: Add an explicit 'dev' to vport trace calls

vport qos trace calls used vport->dev implicitly as the device to which
the command was sent (and thus the device logged in traces).
But that will no longer be the case for cross-esw scheduling, where the
commands have to be sent to the group esw device instead.

This commit corrects that.

Signed-off-by: Cosmin Ratiu <[email protected]>
Signed-off-by: Tariq Toukan <[email protected]>
Signed-off-by: Paolo Abeni <[email protected]>

net/mlx5: qos: Store the eswitch in a mlx5_esw_rate_group

The rate groups are about to be moved out of eswitches, so store a
reference to the eswitch they belong to so things can still work
later.

This allows dropping the esw parameter from a couple of functions and
simplifying some of the code. Use this opportunity to make sure that
vport scheduling element commands are always sent to the group eswitch,
because that will be relevant for cross-esw scheduling. For now though,
the eswitches are not different.

There is no functionality change here.

Signed-off-by: Cosmin Ratiu <[email protected]>
Signed-off-by: Tariq Toukan <[email protected]>
Signed-off-by: Paolo Abeni <[email protected]>

net/mlx5: qos: Drop 'esw' param from vport qos functions

The vport has a pointer to its own eswitch in vport->dev->priv.eswitch,
so passing the same eswitch as a parameter to the various functions
manipulating vport qos is superfluous at best and prone to errors at
worst.

More importantly, with the upcoming cross-esw scheduling changes, the
eswitch that should receive the various scheduling element commands is
NOT the same as the vport's eswitch, so the current code's assumptions
will break.

To avoid confusion and bugs, this commit drops the 'esw' parameter from
all vport qos functions and uses the vport's own eswitch pointer
instead.

Signed-off-by: Cosmin Ratiu <[email protected]>
Reviewed-by: Carolina Jubran <[email protected]>
Signed-off-by: Tariq Toukan <[email protected]>
Signed-off-by: Paolo Abeni <[email protected]>

net/mlx5: qos: Always create group0

All vports not explicitly members of a group with QoS enabled are part
of the internal esw group0, except when the hw reports that groups
aren't supported (log_esw_max_sched_depth == 0). This creates corner
cases in the code, which has to make sure that this case is supported.
Additionally, the groups are about to be moved out of eswitches, and
group0 being NULL creates additional complications there.

This patch makes sure to always create group0, even if max sched depth
is 0. In that case, a software-only group0 is created referencing the
root TSAR. Vports can point to this group when their QoS is enabled and
they'll be attached to the root TSAR directly. This eliminates corner
cases in the code by offering the guarantee that if qos is enabled,
vport->qos.group is non-NULL.

Signed-off-by: Cosmin Ratiu <[email protected]>
Signed-off-by: Tariq Toukan <[email protected]>
Signed-off-by: Paolo Abeni <[email protected]>

net/mlx5: qos: Maintain rate group vport members in a list

Previously, finding group members was done by iterating over all vports
of an eswitch and comparing their group with the required one, but that
approach will break down when a group can contain vports from multiple
eswitches.

Solve that by maintaining a list of vport members.
Instead of iterating over esw vports, loop over the members list.
Use this opportunity to provide two new functions to allocate and free a
group, so that the number of state transitions is smaller. This will
also be used in a future patch.

Signed-off-by: Cosmin Ratiu <[email protected]>
Signed-off-by: Tariq Toukan <[email protected]>
Signed-off-by: Paolo Abeni <[email protected]>

net/mlx5: qos: Refactor and document bw_share calculation

The previous function (esw_qos_calculate_group_min_rate_divider) had two
completely different modes of execution, depending on the 'group_level'
parameter. Split it into two separate functions:
- esw_qos_calculate_min_rate_divider - computes min across groups.
- esw_qos_calculate_group_min_rate_divider - computes min in a group.

Fold the divider calculation into the corresponding normalize functions
to avoid having the caller compute the corresponding divider.
Also rename the normalize functions to better indicate what level
they're operating on.
Finally, document everything so that this topic can more easily be
understood by future maintainers.

Signed-off-by: Cosmin Ratiu <[email protected]>
Signed-off-by: Tariq Toukan <[email protected]>
Signed-off-by: Paolo Abeni <[email protected]>

net/mlx5: qos: Consistently name vport vars as 'vport'

The current mixture of 'vport' and 'evport' can be improved.
There is no functional change.

Signed-off-by: Cosmin Ratiu <[email protected]>
Signed-off-by: Tariq Toukan <[email protected]>
Signed-off-by: Paolo Abeni <[email protected]>

net/mlx5: qos: Rename vport 'tsar' into 'sched_elem'.

Vports do not use TSARs (Transmit Scheduling ARbiters), which are used
for grouping multiple entities together. Use the correct name in
variables and functions for clarity.
Also move the scheduling context to a local variable in the
esw_qos_sched_elem_config function instead of an empty parameter that
needs to be provided by all callers.
There is no functional change here.

Signed-off-by: Cosmin Ratiu <[email protected]>
Signed-off-by: Tariq Toukan <[email protected]>
Signed-off-by: Paolo Abeni <[email protected]>

net/mlx5: qos: Flesh out element_attributes in mlx5_ifc.h

This is used for multiple purposes, depending on the scheduling element
created. There are a few helper struct defined a long time ago, but they
are not easy to find in the file and they are about to get new members.
This commit cleans up this area a bit by:
- moving the helper structs closer to where they are relevant.
- defining a helper union to include all of them to help
discoverability.
- making use of it everywhere element_attributes is used.
- using a consistent 'attr' name.

Signed-off-by: Cosmin Ratiu <[email protected]>
Signed-off-by: Tariq Toukan <[email protected]>
Signed-off-by: Paolo Abeni <[email protected]>

Merge branch 'eth-fbnic-add-timestamping-support'

Vadim Fedorenko says:

====================
eth: fbnic: add timestamping support

The series is to add timestamping support for Meta's NIC driver.

Changelog:
v3 -> v4:
- use adjust_by_scaled_ppm() instead of open coding it
- adjust cached value of high bits of timestamp to be sure it
is older then incoming timestamps
v2 -> v3:
- rebase on top of net-next
- add doc to describe retur value of fbnic_ts40_to_ns()
v1 -> v2:
- adjust comment about using u64 stats locking primitive
- fix typo in the first patch
- Cc Richard
====================

Link: https://patch.msgid.link/[email protected]
Signed-off-by: Paolo Abeni <[email protected]>

eth: fbnic: add ethtool timestamping statistics

Add counters of packets with HW timestamps requests and lost timestamps
with no associated skbs. Use ethtool interface to report these counters.

Signed-off-by: Vadim Fedorenko <[email protected]>
Signed-off-by: Paolo Abeni <[email protected]>

eth: fbnic: add TX packets timestamping support

Add TX configuration to ethtool interface. Add processing of TX
timestamp completions as well as configuration to request HW to create
TX timestamp completion.

Signed-off-by: Vadim Fedorenko <[email protected]>
Signed-off-by: Paolo Abeni <[email protected]>

eth: fbnic: add RX packets timestamping support

Add callbacks to support timestamping configuration via ethtool.
Add processing of RX timestamps.

Signed-off-by: Vadim Fedorenko <[email protected]>
Signed-off-by: Paolo Abeni <[email protected]>

eth: fbnic: add initial PHC support

Create PHC device and provide callbacks needed for ptp_clock device.

Signed-off-by: Vadim Fedorenko <[email protected]>
Signed-off-by: Paolo Abeni <[email protected]>

eth: fbnic: add software TX timestamping support

Add software TX timestamping support. RX software timestamping is
implemented in the core and there is no need to provide special flag
in the driver anymore.

Signed-off-by: Vadim Fedorenko <[email protected]>
Signed-off-by: Paolo Abeni <[email protected]>

net: Remove likely from l3mdev_master_ifindex_by_index

The likely() annotation in l3mdev_master_ifindex_by_index() has been
found to be incorrect 100% of the time in real-world workloads (e.g.,
web servers).

Annotated branches shows the following in these servers:

correct incorrect % Function File Line
0 169053813 100 l3mdev_master_ifindex_by_index l3mdev.h 81

This is happening because l3mdev_master_ifindex_by_index() is called
from __inet_check_established(), which calls
l3mdev_master_ifindex_by_index() passing the socked bounded interface.

l3mdev_master_ifindex_by_index(net, sk->sk_bound_dev_if);

Since most sockets are not going to be bound to a network device,
the likely() is giving the wrong assumption.

Remove the likely() annotation to ensure more accurate branch
prediction.

Signed-off-by: Breno Leitao <[email protected]>
Reviewed-by: David Ahern <[email protected]>
Reviewed-by: Eric Dumazet <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Paolo Abeni <[email protected]>

Merge branch 'ipv4-namespacify-ipv4-address-hash-table'

Kuniyuki Iwashima says:

====================
ipv4: Namespacify IPv4 address hash table.

This is a prep of per-net RTNL conversion for RTM_(NEW|DEL|SET)ADDR.

Currently, each IPv4 address is linked to the global hash table, and
this needs to be protected by another global lock or namespacified to
support per-net RTNL.

Adding a global lock will cause deadlock in the rtnetlink path and GC,

  rtnetlink                      check_lifetime
  |- rtnl_net_lock(net)          |- acquire the global lock
  |- acquire the global lock     |- check ifa's netns
  `- put ifa into hash table     `- rtnl_net_lock(net)

so we need to namespacify the hash table.

The IPv6 one is already namespacified, let's follow that.

v2: https://lore.kernel.org/netdev/20241004195958 [email protected]/
v1: https://lore.kernel.org/netdev/20241001024837 [email protected]/
====================

Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

ipv4: Retire global IPv4 hash table inet_addr_lst.

No one uses inet_addr_lst anymore, so let's remove it.

Reviewed-by: Eric Dumazet <[email protected]>
Signed-off-by: Kuniyuki Iwashima <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

ipv4: Namespacify IPv4 address GC.

Each IPv4 address could have a lifetime, which is useful for DHCP,
and GC is periodically executed as check_lifetime_work.

check_lifetime() does the actual GC under RTNL.

  1. Acquire RTNL
  2. Iterate inet_addr_lst
  3. Remove IPv4 address if expired
  4. Release RTNL

Namespacifying the GC is required for per-netns RTNL, but using the
per-netns hash table will shorten the time on the hash bucket iteration
under RTNL.

Let's add per-netns GC work and use the per-netns hash table.

Reviewed-by: Eric Dumazet <[email protected]>
Signed-off-by: Kuniyuki Iwashima <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

ipv4: Use per-netns hash table in inet_lookup_ifaddr_rcu().

Now, all IPv4 addresses are put in the per-netns hash table.

Let's use it in inet_lookup_ifaddr_rcu().

Reviewed-by: Eric Dumazet <[email protected]>
Signed-off-by: Kuniyuki Iwashima <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

ipv4: Link IPv4 address to per-netns hash table.

As a prep for per-netns RTNL conversion, we want to namespacify
the IPv4 address hash table and the GC work.

Let's allocate the per-netns IPv4 address hash table to
net->ipv4.inet_addr_lst and link IPv4 addresses into it.

The actual users will be converted later.

Note that the IPv6 address hash table is already namespacified.

Reviewed-by: Eric Dumazet <[email protected]>
Signed-off-by: Kuniyuki Iwashima <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

Merge branch '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue

Tony Nguyen says:

====================
Intel Wired LAN Driver Updates 2024-10-08 (ice, iavf, igb, e1000e, e1000)

This series contains updates to ice, iavf, igb, e1000e, and e1000
drivers.

For ice:

Wojciech adds support for ethtool reset.

Paul adds support for hardware based VF mailbox limits for E830 devices.

Jake adjusts to a common iterator in ice_vc_cfg_qs_msg() and moves
storing of max_frame and rx_buf_len from VSI struct to the ring
structure.

Hongbo Li uses assign_bit() to replace an open-coded instance.

Markus Elfring adjusts a couple of PTP error paths to use a common,
shared exit point.

Yue Haibing removes unused declarations.

For iavf:

Yue Haibing removes unused declarations.

For igb:

Yue Haibing removes unused declarations.

For e1000e:

Takamitsu Iwai removes unneccessary writel() calls.

Joe Damato adds support for netdev-genl support to query IRQ, NAPI,
and queue information.

For e1000:

Joe Damato adds support for netdev-genl support to query IRQ, NAPI,
and queue information.

* '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue:
  e1000: Link NAPI instances to queues and IRQs
  e1000e: Link NAPI instances to queues and IRQs
  e1000e: Remove duplicated writel() in e1000_configure_tx/rx()
  igb: Cleanup unused declarations
  iavf: Remove unused declarations
  ice: Cleanup unused declarations
  ice: Use common error handling code in two functions
  ice: Make use of assign_bit() API
  ice: store max_frame and rx_buf_len only in ice_rx_ring
  ice: consistently use q_idx in ice_vc_cfg_qs_msg()
  ice: add E830 HW VF mailbox message limit support
  ice: Implement ethtool reset support
====================

Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

Fix misspelling of "accept*" in net

Several files have "accept*" misspelled as "accpet*" in the comments.
Fix all such occurrences.

Signed-off-by: Alexander Zubkov <[email protected]>
Reviewed-by: Simon Horman <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

net_sched: sch_sfq: handle bigger packets

SFQ has an assumption on dealing with packets smaller than 64KB.

Even before BIG TCP, TCA_STAB can provide arbitrary big values
in qdisc_pkt_len(skb)

It is time to switch (struct sfq_slot)->allot to a 32bit field.

sizeof(struct sfq_slot) is now 64 bytes, giving better cache locality.

Signed-off-by: Eric Dumazet <[email protected]>
Reviewed-by: Toke Høiland-Jørgensen <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

net: stmmac: Add DW QoS Eth v4/v5 ip payload error statistics

Add DW QoS Eth v4/v5 ip payload error statistics, and rename descriptor
bit macro because v4/v5 descriptor IPCE bit claims ip checksum
error or TCP/UDP/ICMP segment length error.

Here is bit description from DW QoS Eth data book(Part 19.6.2.2)

bit7 IPCE: IP Payload Error
When this bit is programmed, it indicates either of the following:
1).The 16-bit IP payload checksum (that is, the TCP, UDP, or ICMP
   checksum) calculated by the MAC does not match the corresponding
   checksum field in the received segment.
2).The TCP, UDP, or ICMP segment length does not match the payload
   length value in the IP  Header field.
3).The TCP, UDP, or ICMP segment length is less than minimum allowed
   segment length for TCP, UDP, or ICMP.

Signed-off-by: Minda Chen <[email protected]>
Reviewed-by: Simon Horman <[email protected]>
Reviewed-by: Serge Semin <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

ipv6: Remove redundant unlikely()

IS_ERR_OR_NULL() already implies unlikely().

Signed-off-by: Tobias Klauser <[email protected]>
Reviewed-by: Simon Horman <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

ipv6: switch inet6_acaddr_hash() to less predictable hash

commit 2384d02520ff ("net/ipv6: Add anycast addresses to a global hashtable")
added inet6_acaddr_hash(), using ipv6_addr_hash() and net_hash_mix()
to get hash spreading for typical users.

However ipv6_addr_hash() is highly predictable and a malicious user
could abuse a specific hash bucket.

Switch to __ipv6_addr_jhash(). We could use a dedicated
secret, or reuse net_hash_mix() as I did in this patch.

Signed-off-by: Eric Dumazet <[email protected]>
Reviewed-by: David Ahern <[email protected]>
Reviewed-by: Kuniyuki Iwashima <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

ipv6: switch inet6_addr_hash() to less predictable hash

In commit 3f27fb23219e ("ipv6: addrconf: add per netns perturbation
in inet6_addr_hash()"), I added net_hash_mix() in inet6_addr_hash()
to get better hash dispersion, at a time all netns were sharing the
hash table.

Since then, commit 21a216a8fc63 ("ipv6/addrconf: allocate a per
netns hash table") made the hash table per netns.

We could remove the net_hash_mix() from inet6_addr_hash(), but
there is still an issue with ipv6_addr_hash().

It is highly predictable and a malicious user can easily create
thousands of IPv6 addresses all stored in the same hash bucket.

Switch to __ipv6_addr_jhash(). We could use a dedicated
secret, or reuse net_hash_mix() as I did in this patch.

Signed-off-by: Eric Dumazet <[email protected]>
Reviewed-by: David Ahern <[email protected]>
Reviewed-by: Kuniyuki Iwashima <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

net: airoha: Fix EGRESS_RATE_METER_EN_MASK definition

Fix typo in EGRESS_RATE_METER_EN_MASK mask definition. This bus in not
introducing any user visible problem since, even if we are setting
EGRESS_RATE_METER_EN_MASK bit in REG_EGRESS_RATE_METER_CFG register,
egress QoS metering is not supported yet since we are missing some other
hw configurations (e.g token bucket rate, token bucket size).

Introduced by commit 23020f049327 ("net: airoha: Introduce ethernet support
for EN7581 SoC")

Signed-off-by: Lorenzo Bianconi <[email protected]>
Reviewed-by: Simon Horman <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

net: liquidio: Remove unused cn23xx_dump_pf_initialized_regs

cn23xx_dump_pf_initialized_regs() was added in 2016's commit
72c0091293c0 ("liquidio: CN23XX device init and sriov config")

but hasn't been used.

Remove it.

Signed-off-by: Dr. David Alan Gilbert <[email protected]>
Reviewed-by: Simon Horman <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

Merge branch 'qca_spi-improvements-to-qca7000-sync'

Stefan Wahren says:

====================
qca_spi: Improvements to QCA7000 sync

This series contains patches which improve the QCA7000 sync behavior.
====================

Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

qca_spi: Improve reset mechanism

The commit 92717c2356cb ("net: qca_spi: Avoid high load if QCA7000 is not
available") fixed the high load in case the QCA7000 is not available
but introduced sync delays for some corner cases like buffer errors.

So add the reset requests to the atomics flags, which are polled by
the SPI thread. As a result reset requests and sync state are now
separated. This has the nice benefit to make the code easier to
understand.

Signed-off-by: Stefan Wahren <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

qca_spi: Count unexpected WRBUF_SPC_AVA after reset

After a reset of the QCA7000, the amount of available write buffer
space should match QCASPI_HW_BUF_LEN. If this is not the case
this error should be counted as such.

Signed-off-by: Stefan Wahren <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

tcp: remove unnecessary update for tp->write_seq in tcp_connect()

Commit 783237e8daf13 ("net-tcp: Fast Open client - sending SYN-data")
introduces tcp_connect_queue_skb() and it would overwrite tcp->write_seq,
so it is no need to update tp->write_seq before invoking
tcp_connect_queue_skb().

Signed-off-by: xin.guo <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

doc: net: Fix .rst rendering of net_cachelines pages

The doc pages under /networking/net_cachelines are unreadable because
they lack .rst formatting for the tabular text.

Add simple table markup and tidy up the table contents:

- remove dashes that represent empty cells because they render
as bullets and are not needed
- replace 'struct_*' with 'struct *' in the first column so that
sphinx can render links for any structs that appear in the docs

Signed-off-by: Donald Hunter <[email protected]>
Reviewed-by: Bagas Sanjaya <[email protected]>
Reviewed-by: Simon Horman <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

Merge branch 'ipv4-convert-__fib_validate_source-and-its-callers-to-dscp_t'

Guillaume Nault says:

====================
ipv4: Convert __fib_validate_source() and its callers to dscp_t.

This patch series continues to prepare users of ->flowi4_tos to a
future conversion of this field (__u8 to dscp_t). This time, we convert
__fib_validate_source() and its call chain.

The objective is to eventually make all users of ->flowi4_tos use a
dscp_t value. Making ->flowi4_tos a dscp_t field will help avoiding
regressions where ECN bits are erroneously interpreted as DSCP bits.
====================

Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

ipv4: Convert __fib_validate_source() to dscp_t.

Pass a dscp_t variable to __fib_validate_source(), instead of a plain
u8, to prevent accidental setting of ECN bits in ->flowi4_tos.

Only fib_validate_source() actually calls __fib_validate_source().
Since it already has a dscp_t variable to pass as parameter, we only
need to remove the inet_dscp_to_dsfield() conversion.

Signed-off-by: Guillaume Nault <[email protected]>
Reviewed-by: Ido Schimmel <[email protected]>
Tested-by: Ido Schimmel <[email protected]>
Reviewed-by: David Ahern <[email protected]>
Link: https://patch.msgid.link/8206b0a64a21a208ed94774e261a251c8d7bc251.1728302212.git.gnault@redhat.com
Signed-off-by: Jakub Kicinski <[email protected]>

ipv4: Convert fib_validate_source() to dscp_t.

Pass a dscp_t variable to fib_validate_source(), instead of a plain u8,
to prevent accidental setting of ECN bits in ->flowi4_tos.

All callers of fib_validate_source() already have a dscp_t variable to
pass as parameter. We just need to remove the inet_dscp_to_dsfield()
conversions.

Signed-off-by: Guillaume Nault <[email protected]>
Reviewed-by: Ido Schimmel <[email protected]>
Tested-by: Ido Schimmel <[email protected]>
Reviewed-by: David Ahern <[email protected]>
Link: https://patch.msgid.link/08612a4519bc5a3578bb493fbaad82437ebb73dc.1728302212.git.gnault@redhat.com
Signed-off-by: Jakub Kicinski <[email protected]>

ipv4: Convert ip_mc_validate_source() to dscp_t.

Pass a dscp_t variable to ip_mc_validate_source(), instead of a plain
u8, to prevent accidental setting of ECN bits in ->flowi4_tos.

Callers of ip_mc_validate_source() to consider are:

  * ip_route_input_mc() which already has a dscp_t variable to pass as
    parameter. We just need to remove the inet_dscp_to_dsfield()
    conversion.

  * udp_v4_early_demux() which gets the DSCP directly from the IPv4
    header and can simply use the ip4h_dscp() helper.

Also, stop including net/inet_dscp.h in udp.c as we don't use any of
its declarations anymore.

Signed-off-by: Guillaume Nault <[email protected]>
Reviewed-by: Ido Schimmel <[email protected]>
Tested-by: Ido Schimmel <[email protected]>
Reviewed-by: David Ahern <[email protected]>
Link: https://patch.msgid.link/c91b2cca04718b7ee6cf5b9c1d5b40507d65a8d4.1728302212.git.gnault@redhat.com
Signed-off-by: Jakub Kicinski <[email protected]>

ipv4: Convert ip_route_input_mc() to dscp_t.

Pass a dscp_t variable to ip_route_input_mc(), instead of a plain u8,
to prevent accidental setting of ECN bits in ->flowi4_tos.

Only ip_route_input_rcu() actually calls ip_route_input_mc(). Since it
already has a dscp_t variable to pass as parameter, we only need to
remove the inet_dscp_to_dsfield() conversion.

Signed-off-by: Guillaume Nault <[email protected]>
Reviewed-by: Ido Schimmel <[email protected]>
Tested-by: Ido Schimmel <[email protected]>
Reviewed-by: David Ahern <[email protected]>
Link: https://patch.msgid.link/0cc653ef59bbc0a28881f706d34896c61eba9e01.1728302212.git.gnault@redhat.com
Signed-off-by: Jakub Kicinski <[email protected]>

ipv4: Convert __mkroute_input() to dscp_t.

Pass a dscp_t variable to __mkroute_input(), instead of a plain u8, to
prevent accidental setting of ECN bits in ->flowi4_tos.

Only ip_mkroute_input() actually calls __mkroute_input(). Since it
already has a dscp_t variable to pass as parameter, we only need to
remove the inet_dscp_to_dsfield() conversion.

While there, reorganise the function parameters to fill up horizontal
space.

Signed-off-by: Guillaume Nault <[email protected]>
Reviewed-by: Ido Schimmel <[email protected]>
Tested-by: Ido Schimmel <[email protected]>
Reviewed-by: David Ahern <[email protected]>
Link: https://patch.msgid.link/40853c720aee4d608e6b1b204982164c3b76697d.1728302212.git.gnault@redhat.com
Signed-off-by: Jakub Kicinski <[email protected]>

ipv4: Convert ip_mkroute_input() to dscp_t.

Pass a dscp_t variable to ip_mkroute_input(), instead of a plain u8, to
prevent accidental setting of ECN bits in ->flowi4_tos.

Only ip_route_input_slow() actually calls ip_mkroute_input(). Since it
already has a dscp_t variable to pass as parameter, we only need to
remove the inet_dscp_to_dsfield() conversion.

While there, reorganise the function parameters to fill up horizontal
space.

Signed-off-by: Guillaume Nault <[email protected]>
Reviewed-by: Ido Schimmel <[email protected]>
Tested-by: Ido Schimmel <[email protected]>
Reviewed-by: David Ahern <[email protected]>
Link: https://patch.msgid.link/6aa71e28f9ff681cbd70847080e1ab6b526f94f1.1728302212.git.gnault@redhat.com
Signed-off-by: Jakub Kicinski <[email protected]>

ipv4: Convert ip_route_use_hint() to dscp_t.

Pass a dscp_t variable to ip_route_use_hint(), instead of a plain u8,
to prevent accidental setting of ECN bits in ->flowi4_tos.

Only ip_rcv_finish_core() actually calls ip_route_use_hint(). Use the
ip4h_dscp() helper to get the DSCP from the IPv4 header.

While there, modify the declaration of ip_route_use_hint() in
include/net/route.h so that it matches the prototype of its
implementation in net/ipv4/route.c.

Signed-off-by: Guillaume Nault <[email protected]>
Reviewed-by: Ido Schimmel <[email protected]>
Tested-by: Ido Schimmel <[email protected]>
Reviewed-by: David Ahern <[email protected]>
Link: https://patch.msgid.link/c40994fdf804db7a363d04fdee01bf48dddda676.1728302212.git.gnault@redhat.com
Signed-off-by: Jakub Kicinski <[email protected]>

r8169: add support for the temperature sensor being available from RTL8125B

This adds support for the temperature sensor being available from
RTL8125B. Register information was taken from r8125 vendor driver.

Signed-off-by: Heiner Kallweit <[email protected]>
Reviewed-by: Simon Horman <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Merge branch 'net-improve-multicast-group-join-performance'

Jonas Rebmann says:

====================
improve multicast join group performance

This series seeks to improve performance on updating igmp group
memberships such as with IP_ADD_MEMBERSHIP or MCAST_JOIN_SOURCE_GROUP.

Our use case was to add 2000 multicast memberships on a TQMLS1046A which
took about 3.6 seconds for the membership additions alone. Our userspace
reproducer tool was instrumented to log runtimes of the individual
setsockopt invocations which clearly indicated quadratic complexity of
setting up the membership with regard to the total number of multicast
groups to be joined. We used perf to locate the hotspots and
subsequently optimized the most costly sections of code.

This series includes a patch to Linux igmp handling as well as a patch
to the DPAA/Freescale driver. With both patches applied, our memberships can
be set up in only about 87 miliseconds, which corresponds to a speedup
of around 40.

While we have acheived practically linear run-time complexity on the
kernel side, a small quadratic factor remains in parts of the freescale
driver code which we haven't yet optimized. We have by now payed little
attention to the optimization potential in dropping group memberships,
yet the dpaa patch applies to joining and leaving groups alike.

Overall, this patch series brings great improvements in use cases
involving large numbers of multicast groups, particularly when using the
fsl_dpa driver, without noteworthy drawbacks in other scenarios.
====================

Signed-off-by: Jonas Rebmann <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: dpaa: use __dev_mc_sync in dpaa_set_rx_mode()

The original driver first unregisters then re-registers all multicast
addresses in the struct net_device_ops::ndo_set_rx_mode() callback.

As the networking stack calls ndo_set_rx_mode() if a single multicast
address change occurs, a significant amount of time may be used to first
unregister and then re-register unchanged multicast addresses. This
leads to performance issues when tracking large numbers of multicast
addresses.

Replace the unregister and register loop and the hand crafted
mc_addr_list list handling with __dev_mc_sync(), to only update entries
which have changed.

On profiling with an fsl_dpa NIC, this patch presented a speedup of
around 40 when successively setting up 2000 multicast groups using
setsockopt(), without drawbacks on smaller numbers of multicast groups.

Signed-off-by: Jonas Rebmann <[email protected]>
Reviewed-by: Sean Anderson <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: ipv4: igmp: optimize ____ip_mc_inc_group() using mc_hash

The runtime cost of joining a single multicast group in the current
implementation of ____ip_mc_inc_group grows linearly with the number of
existing memberships. This is caused by the linear search for an
existing group record in the multicast address list.

This linear complexity results in quadratic complexity when successively
adding memberships, which becomes a performance bottleneck when setting
up large numbers of multicast memberships.

If available, use the existing multicast hash map mc_hash to quickly
search for an existing group membership record. This leads to
near-constant complexity on the addition of a new multicast record,
significantly improving performance for workloads involving many
multicast memberships.

On profiling with a loopback device, this patch presented a speedup of
around 6 when successively setting up 2000 multicast groups using
setsockopt without measurable drawbacks on smaller numbers of
multicast groups.

Signed-off-by: Jonas Rebmann <[email protected]>
Reviewed-by: Eric Dumazet <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

ptp: Add support for the AMZNC10C 'vmclock' device

The vmclock device addresses the problem of live migration with
precision clocks. The tolerances of a hardware counter (e.g. TSC) are
typically around ±50PPM. A guest will use NTP/PTP/PPS to discipline that
counter against an external source of 'real' time, and track the precise
frequency of the counter as it changes with environmental conditions.

When a guest is live migrated, anything it knows about the frequency of
the underlying counter becomes invalid. It may move from a host where
the counter running at -50PPM of its nominal frequency, to a host where
it runs at +50PPM. There will also be a step change in the value of the
counter, as the correctness of its absolute value at migration is
limited by the accuracy of the source and destination host's time
synchronization.

In its simplest form, the device merely advertises a 'disruption_marker'
which indicates that the guest should throw away any NTP synchronization
it thinks it has, and start again.

Because the shared memory region can be exposed all the way to userspace
through the /dev/vmclock0 node, applications can still use time from a
fast vDSO 'system call', and check the disruption marker to be sure that
their timestamp is indeed truthful.

The structure also allows for the precise time, as known by the host, to
be exposed directly to guests so that they don't have to wait for NTP to
resync from scratch. The PTP driver consumes this information if present.
Like the KVM PTP clock, this PTP driver can convert TSC-based cross
timestamps into KVM clock values. Unlike the KVM PTP clock, it does so
only when such is actually helpful.

The values and fields are based on the nascent virtio-rtc specification,
and the intent is that a version (hopefully precisely this version) of
this structure will be included as an optional part of that spec. In the
meantime, this driver supports the simple ACPI form of the device which
is being shipped in certain commercial hypervisors (and submitted for
inclusion in QEMU).

Signed-off-by: David Woodhouse <[email protected]>
Acked-by: Richard Cochran <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Merge branch 'pcs-xpcs-cleanups-batch-2'

Russell King says:

====================
net: pcs: xpcs: cleanups batch 2

This is the second cleanup series for XPCS.

Patch 1 removes the enum indexing the dw_xpcs_compat array. The index is
never used except to place entries in the array and to size the array.

Patch 2 removes the interface arrays - each of which only contain one
interface.

Patch 3 makes xpcs_find_compat() take the xpcs structure rather than the
ID - the previous series removed the reason for xpcs_find_compat needing
to take the ID.

Patch 4 provides a helper to convert xpcs structure to a regular
phylink_pcs structure, which leads to patch 5.

Patch 5 moves the definition of struct dw_xpcs to the private xpcs
header - with patch 4 in place, nothing outside of the xpcs driver
accesses the contents of the dw_xpcs structure.

Patch 6 renames xpcs_get_id() to xpcs_read_id() since it's reading the
ID, rather than doing anything further with it. (Prior versions of this
series renamed it to xpcs_read_phys_id() since that more accurately
described that it was reading the physical ID registers.)

Patch 7 moves the searching of the ID list out of line as this is a
separate functional block.

Patch 8 converts xpcs to use the bitmap macros, which eliminates the
need for _SHIFT definitions.

Patch 9 adds and uses _modify() accessors as there are a large amount
of read-modify-write operations in this driver. This conversion found
a bug in xpcs-wx code that has been reported and already fixed.

Patch 10 converts xpcs to use read_poll_timeout() rather than open
coding that.

Patch 11 converts all printed messages to use the dev_*() functions so
the driver and devie name are always printed.

Patch 12 moves DW_VR_MII_DIG_CTRL1_2G5_EN to the correct place in the
header file, rather than amongst another register's definitions.

Patch 13 moves the Wangxun workaround to a common location rather than
duplicating it in two places. We also reformat this to fit within
80 columns.

====================

Tested-by: Serge Semin <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: pcs: xpcs: move Wangxun VR_XS_PCS_DIG_CTRL1 configuration

According to commits 2a22b7ae2fa3 ("net: pcs: xpcs: adapt Wangxun NICs
for SGMII mode") and 2deea43f386d ("net: pcs: xpcs: add 1000BASE-X AN
interrupt support"), Wangxun devices need special VR_XS_PCS_DIG_CTRL1
settings for SGMII and 1000BASE-X. Both SGMII and 1000BASE-X use the
same settings.

Rather than placing these in the individual xpcs_config_*() functions,
move it to where we already test for the Wangxun devices in
xpcs_do_config().

Signed-off-by: Russell King (Oracle) <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: pcs: xpcs: correctly place DW_VR_MII_DIG_CTRL1_2G5_EN

Place DW_VR_MII_DIG_CTRL1_2G5_EN with the other DW_VR_MII_DIG_CTRL1
definitions rather than in the middle of a register list.

Signed-off-by: Russell King (Oracle) <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: pcs: xpcs: use dev_*() to print messages

Use the dev_*() family of functions to print all messages from the XPCS
driver so we know which instance issues the messages.

Signed-off-by: Russell King (Oracle) <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: pcs: xpcs: convert to use read_poll_timeout()

Convert the xpcs driver to use read_poll_timeout() when waiting for
reset to complete, rather than open-coding this.

Signed-off-by: Russell King (Oracle) <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: pcs: xpcs: add _modify() accessors

The xpcs driver does a lot of read-modify-write operations on
registers, which leads to long-winded code to read the register, check
whether the read was successful, modify the value in some way, and then
write it back.

We have a mdiodev _modify() accessor that encapsulates this, and does
the register modification under the MDIO bus lock ensuring that the
modification is atomic with respect to other bus operations. Convert
the xpcs driver to use this accessor.

Signed-off-by: Russell King (Oracle) <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: pcs: xpcs: use FIELD_PREP() and FIELD_GET()

Convert xpcs to use the bitfield macros rather than definining the
bitfield shifts and open-coding the insertion and extraction of these
bitfields.

Signed-off-by: Russell King (Oracle) <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: pcs: xpcs: move searching ID list out of line

Move the searching of the physical ID out of xpcs_create() and into
its own xpcs_identify() function, which makes it self contained.
This reduces the complexity in xpcs_craete(), making it easier to
follow, rather than having a lot of once-run code in the big for()
loop.

Signed-off-by: Russell King (Oracle) <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: pcs: xpcs: rename xpcs_get_id()

Rename xpcs_get_id() to xpcs_read_id() which more closely reflects
the purpose of this function.

Signed-off-by: Russell King (Oracle) <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: pcs: xpcs: move definition of struct dw_xpcs to private header

There should be no reason for anything outside the XPCS code to know
the contents of struct dw_xpcs - this is a private structure to XPCS.
Move the definition to the private pcs-xpcs.h header, leaving a
declaration in the global pcs/pcs-xpcs.h

Signed-off-by: Russell King (Oracle) <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: pcs: xpcs: provide a helper to get the phylink pcs given xpcs

Provide a helper to provide the pointer to the phylink_pcs struct
given a valid xpcs pointer. This will be necessary when we make
struct dw_xpcs private to pcs-xpcs.c

Signed-off-by: Russell King (Oracle) <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: pcs: xpcs: pass xpcs instead of xpcs->id to xpcs_find_compat()

xpcs_find_compat() is now always passed xpcs->id. Rather than always
dereferencing this in the caller, move it into xpcs_find_compat(),
thus making this function consistent with most of the other xpcs
functions in taking an xpcs pointer.

Signed-off-by: Russell King (Oracle) <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: pcs: xpcs: don't use array for interface

Currently, xpcs uses an array of interfaces that each "compat" entry
supports. When looking up the compat entry for an interface, we
iterate over the compat entries and then over each interface.

Since each compat entry only has a single interface in its interfaces
array, replace the array with a single member in the compat structure.

Signed-off-by: Russell King (Oracle) <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: pcs: xpcs: remove dw_xpcs_compat enum

There is no reason for the struct dw_xpcs_compat arrays to be a fixed
size other than the way we iterate over them. The index into the array
isn't used for anything, and having them fixed size needlessly wastes
space.

Remove the enum that defines their size, and instead use an empty
array entry (with NULL ->supported) to mark the end of the array.

Signed-off-by: Russell King (Oracle) <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: phy: microchip_t1: SQI support for LAN887x

Add support for measuring Signal Quality Index for LAN887x T1 PHY.
Signal Quality Index (SQI) is measure of Link Channel Quality from
0 to 7, with 7 as the best. By default, a link loss event shall
indicate an SQI of 0.

Signed-off-by: Tarun Alle <[email protected]>
Reviewed-by: Andrew Lunn <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

Merge branch 'net-phy-marvell-88q2xxx-enable-auto-negotiation-for-mv88q2110'

Niklas Söderlund says:

====================
net: phy: marvell-88q2xxx: Enable auto negotiation for mv88q2110

This series enables auto negotiation for the mv88q2110 device.
Previously this feature have been disabled for mv88q2110, while enabled
for other devices supported by this driver.

The initial driver implementation states this is due to the
configuration sequence provided by the vendor did not work. By comparing
the initialization sequence of other devices this driver supports and
the out-of-tree PHY driver for mv88q2110 found in the Renesas BSP [1]
I was able to figure out a working configuration.

As I have no access to the datasheets of either of these devices it
would be super if someone who has could sanity check the initialization
sequence.

With this series I'm able to auto negotiate both 1000Mbps and 100Mbps
links without issue.

    # ethtool eth0
    Settings for eth0:
            Supported ports: [  ]
            Supported link modes:   100baseT1/Full
                                    1000baseT1/Full
            Supported pause frame use: Symmetric Receive-only
            Supports auto-negotiation: Yes
            Supported FEC modes: Not reported
            Advertised link modes:  100baseT1/Full
                                    1000baseT1/Full
            Advertised pause frame use: No
            Advertised auto-negotiation: Yes
            Advertised FEC modes: Not reported
            Link partner advertised link modes:  100baseT1/Full
                                                 1000baseT1/Full
            Link partner advertised pause frame use: No
            Link partner advertised auto-negotiation: Yes
            Link partner advertised FEC modes: Not reported
            Speed: 1000Mb/s
            Duplex: Full
            Auto-negotiation: on
            master-slave cfg: preferred master
            master-slave status: slave
            Port: Twisted Pair
            PHYAD: 0
            Transceiver: external
            MDI-X: Unknown
            Link detected: yes
            SQI: 15/15

And the performance is good too. Without this change I was not able to
manually configure a 1000Mbps link, only 100Mbps ones. So this gives a
huge performance boost for my use-case.

    [  5] local 10.1.0.2 port 5201 connected to 10.1.0.1 port 38346
    [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
    [  5]   0.00-1.00   sec  96.8 MBytes   812 Mbits/sec    0    469 KBytes
    [  5]   1.00-2.00   sec  94.3 MBytes   791 Mbits/sec    0    469 KBytes
    [  5]   2.00-3.00   sec  96.1 MBytes   806 Mbits/sec    0    469 KBytes
    [  5]   3.00-4.00   sec  98.3 MBytes   825 Mbits/sec    0    469 KBytes
    [  5]   4.00-5.00   sec  98.4 MBytes   825 Mbits/sec    0    469 KBytes
    [  5]   5.00-6.00   sec  98.4 MBytes   826 Mbits/sec    0    469 KBytes
    [  5]   6.00-7.00   sec  98.9 MBytes   830 Mbits/sec    0    469 KBytes
    [  5]   7.00-8.00   sec  91.7 MBytes   769 Mbits/sec    0    469 KBytes
    [  5]   8.00-9.00   sec  99.4 MBytes   834 Mbits/sec    0    747 KBytes
    [  5]   9.00-10.00  sec   101 MBytes   851 Mbits/sec    0    747 KBytes

Patch 1/3 and 2/3 are preparation patches that align and move functions
around as the mv88q2110 code paths can now reuses much of what is done
for mv88q2220. While patch 3/3 adds the new initialization sequence and
removes the auto negotiation limit for mv88q2110.

1.  https://github.com/renesas-rcar/linux-bsp/commit/2a1f07d0e722a18188cfe62842b61f2fbc0ba812
====================

Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

net: phy: marvell-88q2xxx: Enable auto negotiation for mv88q2110

The initial marvell-88q2xxx driver only supported the Marvell 88Q2110
PHY without auto negotiation support. The reason documented states that
the provided initialization sequence did not to work. Now a method to
enable auto negotiation have been found by comparing the initialization
of other supported devices and an out-of-tree PHY driver.

Perform the minimal needed initialization of the PHY to get auto
negotiation working and remove the limitation that disables the auto
negotiation feature for the mv88q2110 device.

With this change a 1000Mbps full duplex link is able to be negotiated
between two mv88q2110 and the link works perfectly. The other side also
reflects the manually configure settings of the master device.

    # ethtool eth0
    Settings for eth0:
            Supported ports: [  ]
            Supported link modes:   100baseT1/Full
                                    1000baseT1/Full
            Supported pause frame use: Symmetric Receive-only
            Supports auto-negotiation: Yes
            Supported FEC modes: Not reported
            Advertised link modes:  100baseT1/Full
                                    1000baseT1/Full
            Advertised pause frame use: No
            Advertised auto-negotiation: Yes
            Advertised FEC modes: Not reported
            Link partner advertised link modes:  100baseT1/Full
                                                 1000baseT1/Full
            Link partner advertised pause frame use: No
            Link partner advertised auto-negotiation: Yes
            Link partner advertised FEC modes: Not reported
            Speed: 1000Mb/s
            Duplex: Full
            Auto-negotiation: on
            master-slave cfg: preferred master
            master-slave status: slave
            Port: Twisted Pair
            PHYAD: 0
            Transceiver: external
            MDI-X: Unknown
            Link detected: yes
            SQI: 15/15

Before this change I was not able to manually configure 1000Mbps link,
only a 100Mpps link so this change providers an improvement in
performance for this device.

    [  5] local 10.1.0.2 port 5201 connected to 10.1.0.1 port 38346
    [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
    [  5]   0.00-1.00   sec  96.8 MBytes   812 Mbits/sec    0    469 KBytes
    [  5]   1.00-2.00   sec  94.3 MBytes   791 Mbits/sec    0    469 KBytes
    [  5]   2.00-3.00   sec  96.1 MBytes   806 Mbits/sec    0    469 KBytes
    [  5]   3.00-4.00   sec  98.3 MBytes   825 Mbits/sec    0    469 KBytes
    [  5]   4.00-5.00   sec  98.4 MBytes   825 Mbits/sec    0    469 KBytes
    [  5]   5.00-6.00   sec  98.4 MBytes   826 Mbits/sec    0    469 KBytes
    [  5]   6.00-7.00   sec  98.9 MBytes   830 Mbits/sec    0    469 KBytes
    [  5]   7.00-8.00   sec  91.7 MBytes   769 Mbits/sec    0    469 KBytes
    [  5]   8.00-9.00   sec  99.4 MBytes   834 Mbits/sec    0    747 KBytes
    [  5]   9.00-10.00  sec   101 MBytes   851 Mbits/sec    0    747 KBytes

Signed-off-by: Niklas Söderlund <[email protected]>
Tested-by: Stefan Eichenberger <[email protected]>
Reviewed-by: Andrew Lunn <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

net: phy: marvell-88q2xxx: Make register writer function generic

In preparation to adding auto negotiation support to mv88q2110 move and
rename the helper function used to write an array of register values to
the PHY.

Just as for mv88q2220 devices this helper will be needed to for the
initial configuration of the mv88q2110 to support auto negotiation.

The function is moved verbatim, there is no change in behavior.

Signed-off-by: Niklas Söderlund <[email protected]>
Reviewed-by: Andrew Lunn <[email protected]>
Tested-by: Dimitri Fedrau <[email protected]>
Tested-by: Stefan Eichenberger <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

net: phy: marvell-88q2xxx: Align soft reset for mv88q2110 and mv88q2220

The soft reset implementations for mv88q2110 and mv88q2220 differ as the
later need to consider that auto negation is supported on mv88q2220
devices. In preparation of enabling auto negotiation on mv88q2110 merge
the two rest functions into a device generic one.

The mv88q2220 behavior is kept as is but extended to wait for the reset
bit to be clears before continuing, as was done previously on mv88q2220.

Signed-off-by: Niklas Söderlund <[email protected]>
Reviewed-by: Andrew Lunn <[email protected]>
Tested-by: Dimitri Fedrau <[email protected]>
Tested-by: Stefan Eichenberger <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

fsl/fman: Fix a typo

Fix a typo in comments: bellow -> below.

Reported-by: Matthew Wilcox <[email protected]>
Signed-off-by: Andrew Kreimer <[email protected]>
Reviewed-by: Simon Horman <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

net: phy: aquantia: allow forcing order of MDI pairs

Despite supporting Auto MDI-X, it looks like Aquantia only supports
swapping pair (1,2) with pair (3,6) like it used to be for MDI-X on
100MBit/s networks.

When all 4 pairs are in use (for 1000MBit/s or faster) the link does not
come up with pair order is not configured correctly, either using
MDI_CFG pin or using the "PMA Receive Reserved Vendor Provisioning 1"
register.

Normally, the order of MDI pairs being either ABCD or DCBA is configured
by pulling the MDI_CFG pin.

However, some hardware designs require overriding the value configured
by that bootstrap pin. The PHY allows doing that by setting a bit in
"PMA Receive Reserved Vendor Provisioning 1" register which allows
ignoring the state of the MDI_CFG pin and another bit configuring
whether the order of MDI pairs should be normal (ABCD) or reverse
(DCBA). Pair polarity is not affected and remains identical in both
settings.

Introduce property "marvell,mdi-cfg-order" which allows forcing either
normal or reverse order of the MDI pairs from DT.

If the property isn't present, the behavior is unchanged and MDI pair
order configuration is untouched (ie. either the result of MDI_CFG pin
pull-up/pull-down, or pair order override already configured by the
bootloader before Linux is started).

Forcing normal pair order is required on the Adtran SDG-8733A Wi-Fi 7
residential gateway.

Signed-off-by: Daniel Golle <[email protected]>
Reviewed-by: Andrew Lunn <[email protected]>
Link: https://patch.msgid.link/9ed760ff87d5fc456f31e407ead548bbb754497d.1728058550.git.daniel@makrotopia.org
Signed-off-by: Jakub Kicinski <[email protected]>

dt-bindings: net: marvell,aquantia: add property to override MDI_CFG

Usually the MDI pair order reversal configuration is defined by
bootstrap pin MDI_CFG. Some designs, however, require overriding the MDI
pair order and force either normal or reverse order.

Add property 'marvell,mdi-cfg-order' to allow forcing either normal or
reverse order of the MDI pairs.

Signed-off-by: Daniel Golle <[email protected]>
Reviewed-by: Rob Herring (Arm) <[email protected]>
Link: https://patch.msgid.link/7ccf25d6d7859f1ce9983c81a2051cfdfb0e0a99.1728058550.git.daniel@makrotopia.org
Signed-off-by: Jakub Kicinski <[email protected]>

Merge branch 'selftests-mlxsw-stabilize-red-tests'

Petr Machata says:

====================
selftests: mlxsw: Stabilize RED tests

Tweak the mlxsw-specific RED selftests to increase stability on
Spectrum-3 and Spectrum-4 machines.
====================

Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

selftests: mlxsw: sch_red_core: Lower TBF rate

The RED test uses a pair of TBF shapers. The first to get predictably-sized
stream of traffic, and second to get a 100% saturated chokepoint. To this
chokepoint it injects individual packets. Because the chokepoint is
saturated, these additional packets go straight to the backlog. This allows
the test to check RED behavior across various queue sizes.

The shapers are rated at 1Gbps, for historical reasons (before mlxsw
supported TBF offload, the test used port speed to create the chokepoints).
Machines with a low-power CPU may have trouble consistently generating
1Gbps of traffic, and the test then spuriously fails.

Instead, drop the rate to 200Mbps (Spectrum has a guaranteed shaper rate
granularity of 200Mbps, so anything lower is not guaranteed to work well).
Because that means fewer packets will be mirrored in the ECN-mark test,
adjust the passing condition accordingly.

Signed-off-by: Petr Machata <[email protected]>
Link: https://patch.msgid.link/c6712f9c5de75ae0bc2ab3d8ea7d92aaaf93af95.1728316370.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <[email protected]>

selftests: mlxsw: sch_red_core: Send more packets for drop tests

This test works by injecting into a port with a maxed-out queue a couple
packets and checks if a corresponding number of packets were dropped. This
has worked well on Spectrum<4, but on Spectrum-4 it has been noisy. This
is in line with the observation that on Spectrum-4, queue size tends to
fluctuate more. A handful of packets could then still be accepted to the
queue even though it was nominally full just recently.

In order to accommodate this behavior, send many more packets. The buffer
can fit N extra packets, but not N% packets. This therefore allows us to
set wider absolute margins, while actually narrowing them relatively.

Signed-off-by: Petr Machata <[email protected]>
Reviewed-by: Ido Schimmel <[email protected]>
Link: https://patch.msgid.link/abc869b9f6003d400d6293ddd5edb2f4517f44d5.1728316370.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <[email protected]>

selftests: mlxsw: sch_red_core: Sleep before querying queue depth

The qdisc stats are taken from the port's periodic HW stats, which are
updated once a second. We try to accommodate the latency by using busywait
in build_backlog().

The issue in that seems to be that when do_mark_test() builds the backlog,
it makes the decision whether to send more packets based on the first
instance of the queue depth stat exceeding the current value, when in fact
more traffic is on the way and the queue depth would increase further. This
leads to failures in TC 1 of mark-mirror test, where we see the following
failure:

TEST: TC 0: marked packets mirror'd                                 [ OK ]
TEST: TC 1: marked packets mirror'd                                 [FAIL]
        Spurious packets (1680 -> 2290) observed without buffer pressure

Fix by waiting for the full second before reading the queue depth for the
first time, to make sure it reflects all in-flight traffic.

Signed-off-by: Petr Machata <[email protected]>
Reviewed-by: Amit Cohen <[email protected]>
Link: https://patch.msgid.link/321dcf8b3e9a1f0766429c8cf3e3f1746f1bc375.1728316370.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <[email protected]>

selftests: mlxsw: sch_red_core: Increase backlog size tolerance

Backlog fluctuates on Spectrum-4 much more than on <4. In practice we can
sample queue depth values going from about -12% to about +7% of the
configured RED limit. The test which checks the queue size has a limit of
+-10%, and as a result often fails. We attempted to fix the issue by
busywaiting for several seconds hoping to get within the bounds, but that
still proved to be too noisy (or the wait time would be impractically
long). Unfortunately we have to bump the value tolerance from 10% to 15%,
which in this patch do.

Signed-off-by: Petr Machata <[email protected]>
Reviewed-by: Amit Cohen <[email protected]>
Link: https://patch.msgid.link/f54950df2a8fcba46c3ddc1053376352fa2e592b.1728316370.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <[email protected]>

selftests: mlxsw: sch_red_ets: Increase required backlog

Backlog fluctuates on Spectrum-4 much more than on <4. Increasing the
desired backlog seems to help, as the constant fluctuations do not overlap
into the territory where packets are marked.

Signed-off-by: Petr Machata <[email protected]>
Reviewed-by: Amit Cohen <[email protected]>
Link: https://patch.msgid.link/0821fb3aa8bb6a6c0d3000baab04995517c9a0cc.1728316370.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <[email protected]>

net: phy: smsc: use devm_clk_get_optional_enabled_with_rate()

Fold the separate call to clk_set_rate() into the clock getter.

Signed-off-by: Bartosz Golaszewski <[email protected]>
Reviewed-by: Andrew Lunn <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

chelsio/chtls: Remove unused chtls_set_tcb_tflag

chtls_set_tcb_tflag() has been unused since 2021's commit
827d329105bf ("chtls: Remove invalid set_tcb call")

Remove it.

Signed-off-by: Dr. David Alan Gilbert <[email protected]>
Reviewed-by: Simon Horman <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

caif: Remove unused cfsrvl_getphyid

cfsrvl_getphyid() has been unused since 2011's commit
f36214408470 ("caif: Use RCU and lists in cfcnfg.c for managing caif link layers")

Remove it.

Signed-off-by: Dr. David Alan Gilbert <[email protected]>
Reviewed-by: Simon Horman <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

net-timestamp: namespacify the sysctl_tstamp_allow_data

Let it be tuned in per netns by admins.

Signed-off-by: Jason Xing <[email protected]>
Reviewed-by: Kuniyuki Iwashima <[email protected]>
Reviewed-by: Eric Dumazet <[email protected]>
Reviewed-by: Willem de Bruijn <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

net: dsa: mv88e6xxx: Add FID map cache

Add a cached FID bitmap. This mitigates the need to walk all VTU entries
to find the next free FID.

When flushing the VTU (during init), zero the FID bitmap. Use and
manipulate this bitmap from now on, instead of reading HW for the FID
map.

The repeated VTU walks are costly and can take ~40 mins if ~4000 vlans
are added. Caching the FID map reduces this time to <2 mins.

Signed-off-by: Aryan Srivastava <[email protected]>
Reviewed-by: Andrew Lunn <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

e1000: Link NAPI instances to queues and IRQs

Add support for netdev-genl, allowing users to query IRQ, NAPI, and queue
information.

After this patch is applied, note the IRQ assigned to my NIC:

$ cat /proc/interrupts | grep enp0s8 | cut -f1 --delimiter=':'
18

Note the output from the cli:

$ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml \
--dump napi-get --json='{"ifindex": 2}'
[{'id': 513, 'ifindex': 2, 'irq': 18}]

This device supports only 1 rx and 1 tx queue, so querying that:

$ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml \
--dump queue-get --json='{"ifindex": 2}'
[{'id': 0, 'ifindex': 2, 'napi-id': 513, 'type': 'rx'},
{'id': 0, 'ifindex': 2, 'napi-id': 513, 'type': 'tx'}]

Signed-off-by: Joe Damato <[email protected]>
Reviewed-by: Simon Horman <[email protected]>
Signed-off-by: Tony Nguyen <[email protected]>

e1000e: Link NAPI instances to queues and IRQs

Add support for netdev-genl, allowing users to query IRQ, NAPI, and queue
information.

After this patch is applied, note the IRQs assigned to my NIC:

$ cat /proc/interrupts | grep ens | cut -f1 --delimiter=':'
50
51
52

While e1000e allocates 3 IRQs (RX, TX, and other), it looks like e1000e
only has a single NAPI, so I've associated the NAPI with the RX IRQ (50
on my system, seen above).

Note the output from the cli:

$ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml \
--dump napi-get --json='{"ifindex": 2}'
[{'id': 145, 'ifindex': 2, 'irq': 50}]

This device supports only 1 rx and 1 tx queue. so querying that:

$ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml \
--dump queue-get --json='{"ifindex": 2}'
[{'id': 0, 'ifindex': 2, 'napi-id': 145, 'type': 'rx'},
{'id': 0, 'ifindex': 2, 'napi-id': 145, 'type': 'tx'}]

Signed-off-by: Joe Damato <[email protected]>
Reviewed-by: Simon Horman <[email protected]>
Acked-by: Vitaly Lifshits <[email protected]>
Tested-by: Avigail Dahan <[email protected]>
Signed-off-by: Tony Nguyen <[email protected]>

e1000e: Remove duplicated writel() in e1000_configure_tx/rx()

Duplicated register initialization codes exist in e1000_configure_tx()
and e1000_configure_rx().

For example, writel(0, tx_ring->head) writes 0 to tx_ring->head, which
is adapter->hw.hw_addr + E1000_TDH(0).

This initialization is already done in ew32(TDH(0), 0).

ew32(TDH(0), 0) is equivalent to __ew32(hw, E1000_TDH(0), 0). It
executes writel(0, hw->hw_addr + E1000_TDH(0)). Since variable hw is
set to &adapter->hw, it is equal to writel(0, tx_ring->head).

We can remove similar four writel() in e1000_configure_tx() and
e1000_configure_rx().

commit 0845d45e900c ("e1000e: Modify Tx/Rx configurations to avoid
null pointer dereferences in e1000_open") has introduced these
writel(). This commit moved register writing to
e1000_configure_tx/rx(), and as result, it caused duplication in
e1000_configure_tx/rx().

This patch modifies the sequence of register writing, but removing
these writes is safe because the same writes were already there before
the commit.

I also have checked the datasheets [0] [1] and have not found any
description that we need to write RDH, RDT, TDH and TDT registers
twice at initialization. Furthermore, we have tested this patch on an
I219-V device physically.

Link: https://www.intel.com/content/dam/www/public/us/en/documents/datasheets/82577-gbe-phy-datasheet.pdf
Link: https://www.intel.com/content/www/us/en/content-details/613460/intel-82583v-gbe-controller-datasheet.html
Tested-by: Kohei Enju <[email protected]>
Signed-off-by: Takamitsu Iwai <[email protected]>
Tested-by: Mor Bar-Gabay <[email protected]>
Signed-off-by: Tony Nguyen <[email protected]>

igb: Cleanup unused declarations

e1000_init_function_pointers_82575() is never implemented and used since
commit 9d5c824399de ("igb: PCI-Express 82575 Gigabit Ethernet driver").
And commit 9835fd7321a6 ("igb: Add new function to read part number from
EEPROM in string format") removed igb_read_part_num() implementation.

Signed-off-by: Yue Haibing <[email protected]>
Reviewed-by: Maciej Fijalkowski <[email protected]>
Signed-off-by: Tony Nguyen <[email protected]>

iavf: Remove unused declarations

There is no caller and implementation in tree.

Signed-off-by: Yue Haibing <[email protected]>
Reviewed-by: Maciej Fijalkowski <[email protected]>
Signed-off-by: Tony Nguyen <[email protected]>

ice: Cleanup unused declarations

Since commit fff292b47ac1 ("ice: add VF representors one by one")
ice_eswitch_configure() is not used anymore.
Commit 1b8f15b64a00 ("ice: refactor filter functions") removed
ice_vsi_cfg_mac_fltr() but leave declaration.
Commit a24b4c6e9aab ("ice: xsk: Do not convert to buff to frame for
XDP_TX") leave ice_xmit_xdp_buff() declaration.

Commit 7cab44f1c35f ("ice: Introduce ETH56G PHY model for E825C
products") declared ice_phy_cfg_{rx,tx}_offset_eth56g(),
commit a1ffafb0b4a4 ("ice: Support configuring the device to Double
VLAN Mode") declared ice_pkg_buf_get_free_space(), and
commit 8a3a565ff210 ("ice: add admin commands to access cgu
configuration") declared ice_is_pca9575_present(), but all these never
be implemented.

Signed-off-by: Yue Haibing <[email protected]>
Reviewed-by: Maciej Fijalkowski <[email protected]>
Signed-off-by: Tony Nguyen <[email protected]>

ice: Use common error handling code in two functions

Add jump targets so that a bit of exception handling can be better reused
at the end of two function implementations.

This issue was detected by using the Coccinelle software.

Signed-off-by: Markus Elfring <[email protected]>
Reviewed-by: Jacob Keller <[email protected]>
Tested-by: Pucha Himasekhar Reddy <[email protected]> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <[email protected]>

ice: Make use of assign_bit() API

We have for some time the assign_bit() API to replace open coded

    if (foo)
            set_bit(n, bar);
    else
            clear_bit(n, bar);

Use this API to clean the code. No functional change intended.

Signed-off-by: Hongbo Li <[email protected]>
Reviewed-by: Gerhard Engleder <[email protected]>
Tested-by: George Kuruvinakunnel <[email protected]>
Reviewed-by: Jacob Keller <[email protected]>
Signed-off-by: Tony Nguyen <[email protected]>

ice: store max_frame and rx_buf_len only in ice_rx_ring

The max_frame and rx_buf_len fields of the VSI set the maximum frame size
for packets on the wire, and configure the size of the Rx buffer. In the
hardware, these are per-queue configuration. Most VSI types use a simple
method to determine the size of the buffers for all queues.

However, VFs may potentially configure different values for each queue.
While the Linux iAVF driver does not do this, it is allowed by the virtchnl
interface.

The current virtchnl code simply sets the per-VSI fields inbetween calls to
ice_vsi_cfg_single_rxq(). This technically works, as these fields are only
ever used when programming the Rx ring, and otherwise not checked again.
However, it is confusing to maintain.

The Rx ring also already has an rx_buf_len field in order to access the
buffer length in the hotpath. It also has extra unused bytes in the ring
structure which we can make use of to store the maximum frame size.

Drop the VSI max_frame and rx_buf_len fields. Add max_frame to the Rx ring,
and slightly re-order rx_buf_len to better fit into the gaps in the
structure layout.

Change the ice_vsi_cfg_frame_size function so that it writes to the ring
fields. Call this function once per ring in ice_vsi_cfg_rxqs(). This is
done over calling it inside the ice_vsi_cfg_rxq(), because
ice_vsi_cfg_rxq() is called in the virtchnl flow where the max_frame and
rx_buf_len have already been configured.

Change the accesses for rx_buf_len and max_frame to all point to the ring
structure. This has the added benefit that ice_vsi_cfg_rxq() no longer has
the surprise side effect of updating ring->rx_buf_len based on the VSI
field.

Update the virtchnl ice_vc_cfg_qs_msg() function to set the ring values
directly, and drop references to the removed VSI fields.

This now makes the VF logic clear, as the ring fields are obviously
per-queue. This reduces the required cognitive load when reasoning about
this logic.

Note that removing the VSI fields does leave a 4 byte gap, but the ice_vsi
structure has many gaps, and its layout is not as critical in the hot path.
The structure may benefit from a more thorough repacking, but no attempt
was made in this change.

Signed-off-by: Jacob Keller <[email protected]>
Tested-by: Rafal Romanowski <[email protected]>
Signed-off-by: Tony Nguyen <[email protected]>

ice: consistently use q_idx in ice_vc_cfg_qs_msg()

The ice_vc_cfg_qs_msg() function is used to configure VF queues in response
to a VIRTCHNL_OP_CONFIG_VSI_QUEUES command.

The virtchnl command contains an array of queue pair data for configuring
Tx and Rx queues. This data includes a queue ID. When configuring the
queues, the driver generally uses this queue ID to determine which Tx and
Rx ring to program. However, a handful of places use the index into the
queue pair data from the VF. While most VF implementations appear to send
this data in order, it is not mandated by the virtchnl and it is not
verified that the queue pair data comes in order.

Fix the driver to consistently use the q_idx field instead of the 'i'
iterator value when accessing the rings. For the Rx case, introduce a local
ring variable to keep lines short.

Fixes: 7ad15440acf8 ("ice: Refactor VIRTCHNL_OP_CONFIG_VSI_QUEUES handling")
Signed-off-by: Jacob Keller <[email protected]>
Tested-by: Rafal Romanowski <[email protected]>
Signed-off-by: Tony Nguyen <[email protected]>

ice: add E830 HW VF mailbox message limit support

E830 adds hardware support to prevent the VF from overflowing the PF
mailbox with VIRTCHNL messages. E830 will use the hardware feature
(ICE_F_MBX_LIMIT) instead of the software solution ice_is_malicious_vf().

To prevent a VF from overflowing the PF, the PF sets the number of
messages per VF that can be in the PF's mailbox queue
(ICE_MBX_OVERFLOW_WATERMARK). When the PF processes a message from a VF,
the PF decrements the per VF message count using the E830_MBX_VF_DEC_TRIG
register.

Signed-off-by: Paul Greenwalt <[email protected]>
Reviewed-by: Alexander Lobakin <[email protected]>
Tested-by: Rafal Romanowski <[email protected]>
Signed-off-by: Tony Nguyen <[email protected]>

ice: Implement ethtool reset support

Enable ethtool reset support. Ethtool reset flags are mapped to the
E810 reset type:
PF reset:
  $ ethtool --reset <ethX> irq dma filter offload
CORE reset:
  $ ethtool --reset <ethX> irq-shared dma-shared filter-shared \
    offload-shared ram-shared
GLOBAL reset:
  $ ethtool --reset <ethX> irq-shared dma-shared filter-shared \
    offload-shared mac-shared phy-shared ram-shared

Calling the same set of flags as in PF reset case on port representor
triggers VF reset.

Reviewed-by: Michal Swiatkowski <[email protected]>
Reviewed-by: Marcin Szycik <[email protected]>
Reviewed-by: Przemek Kitszel <[email protected]>
Acked-by: Jakub Kicinski <[email protected]>
Signed-off-by: Wojciech Drewek <[email protected]>
Tested-by: Pucha Himasekhar Reddy <[email protected]> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <[email protected]>

tools: ynl-gen: refactor check validation for TypeBinary

We only support a single check at a time for TypeBinary.
Refactor the code to cover 'exact-len' and make adding
new checks easier.

Link: https://lore.kernel.org/[email protected]
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

idpf: Don't hard code napi_struct size

The sizeof(struct napi_struct) can change. Don't hardcode the size to
400 bytes and instead use "sizeof(struct napi_struct)".

Suggested-by: Alexander Lobakin <[email protected]>
Signed-off-by: Joe Damato <[email protected]>
Acked-by: Alexander Lobakin <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

Merge branch 'rtnetlink-per-netns-rtnl'

Kuniyuki Iwashima says:

====================
rtnetlink: Per-netns RTNL.

rtnl_lock() is a "Big Kernel Lock" in the networking slow path and
serialised all rtnetlink requests until 4.13.

Since RTNL_FLAG_DOIT_UNLOCKED and RTNL_FLAG_DUMP_UNLOCKED have been
introduced in 4.14 and 6.9, respectively, rtnetlink message handlers
are ready to be converted to RTNL-less/free.

15 out of 44 dumpit()s have been converted to RCU so far, and the
progress is pretty good.  We can now dump various major network
resources without RTNL.

12 out of 87 doit()s have been converted, but most of the converted
doit()s are also on the reader side of RTNL; their message types are
RTM_GET*.

So, most of RTM_(NEW|DEL|SET)* operations are still serialised by RTNL.

For example, one of our services creates 2K netns and a small number
of network interfaces in each netns that require too many writer-side
rtnetlink requests, and setting up a single host takes 10+ minutes.

RTNL is still a huge pain for network configuration paths, and we need
more granular locking, given converting all doit()s would be unfeasible.

Actually, most RTNL users do not need to freeze multiple netns, and such
users can be protected by per-netns RTNL mutex.  The exceptions would be
RTM_NEWLINK, RTM_DELLINK, and RTM_SETLINK.  (See [0] and [1])

This series is the first step of the per-netns RTNL conversion that
gradually replaces rtnl_lock() with rtnl_net_lock(net) under
CONFIG_DEBUG_NET_SMALL_RTNL.

[0]: https://netdev.bots.linux.dev/netconf/2024/index.html
[1]: https://lpc.events/event/18/contributions/1959/
====================

Link: https://patch.msgid.link/[email protected]
Signed-off-by: Paolo Abeni <[email protected]>

rtnetlink: Add ASSERT_RTNL_NET() placeholder for netdev notifier.

The global and per-netns netdev notifier depend on RTNL, and its
dependency is not so clear due to nested calls.

Let's add a placeholder to place ASSERT_RTNL_NET() for each event.

Signed-off-by: Kuniyuki Iwashima <[email protected]>
Reviewed-by: Eric Dumazet <[email protected]>
Signed-off-by: Paolo Abeni <[email protected]>