Git Repo - linux.git/log

xsk: Add TX timestamp and TX checksum offload support

This change actually defines the (initial) metadata layout
that should be used by AF_XDP userspace (xsk_tx_metadata).
The first field is flags which requests appropriate offloads,
followed by the offload-specific fields. The supported per-device
offloads are exported via netlink (new xsk-flags).

The offloads themselves are still implemented in a bit of a
framework-y fashion that's left from my initial kfunc attempt.
I'm introducing new xsk_tx_metadata_ops which drivers are
supposed to implement. The drivers are also supposed
to call xsk_tx_metadata_request/xsk_tx_metadata_complete in
the right places. Since xsk_tx_metadata_{request,_complete}
are static inline, we don't incur any extra overhead doing
indirect calls.

The benefit of this scheme is as follows:
- keeps all metadata layout parsing away from driver code
- makes it easy to grep and see which drivers implement what
- don't need any extra flags to maintain to keep track of what
  offloads are implemented; if the callback is implemented - the offload
  is supported (used by netlink reporting code)

Two offloads are defined right now:
1. XDP_TXMD_FLAGS_CHECKSUM: skb-style csum_start+csum_offset
2. XDP_TXMD_FLAGS_TIMESTAMP: writes TX timestamp back into metadata
   area upon completion (tx_timestamp field)

XDP_TXMD_FLAGS_TIMESTAMP is also implemented for XDP_COPY mode: it writes
SW timestamp from the skb destructor (note I'm reusing hwtstamps to pass
metadata pointer).

The struct is forward-compatible and can be extended in the future
by appending more fields.

Reviewed-by: Song Yoong Siang <[email protected]>
Signed-off-by: Stanislav Fomichev <[email protected]>
Acked-by: Jakub Kicinski <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>

xsk: Support tx_metadata_len

For zerocopy mode, tx_desc->addr can point to an arbitrary offset
and carry some TX metadata in the headroom. For copy mode, there
is no way currently to populate skb metadata.

Introduce new tx_metadata_len umem config option that indicates how many
bytes to treat as metadata. Metadata bytes come prior to tx_desc address
(same as in RX case).

The size of the metadata has mostly the same constraints as XDP:
- less than 256 bytes
- 8-byte aligned (compared to 4-byte alignment on xdp, due to 8-byte
timestamp in the completion)
- non-zero

This data is not interpreted in any way right now.

Reviewed-by: Song Yoong Siang <[email protected]>
Signed-off-by: Stanislav Fomichev <[email protected]>
Reviewed-by: Jakub Kicinski <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>

Merge branch 'selftests-bpf-use-pkg-config-to-determine-ld-flags'

Akihiko Odaki says:

====================
selftests/bpf: Use pkg-config to determine ld flags

When linking statically, libraries may require other dependencies to be
included to ld flags. In particular, libelf may require libzstd. Use
pkg-config to determine such dependencies.

V4 -> V5: Introduced variables LIBELF_CFLAGS and LIBELF_LIBS.
(Daniel Borkmann)
Added patch "selftests/bpf: Choose pkg-config for the target".
V3 -> V4: Added "2> /dev/null".
V2 -> V3: Added missing "echo".
V1 -> V2: Implemented fallback, referring to HOSTPKG_CONFIG.
====================

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Andrii Nakryiko <[email protected]>

selftests/bpf: Use pkg-config for libelf

When linking statically, libraries may require other dependencies to be
included to ld flags. In particular, libelf may require libzstd. Use
pkg-config to determine such dependencies.

Signed-off-by: Akihiko Odaki <[email protected]>
Signed-off-by: Andrii Nakryiko <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]

selftests/bpf: Override PKG_CONFIG for static builds

A library may need to depend on additional archive files for static
builds so pkg-config should be instructed to list them.

Signed-off-by: Akihiko Odaki <[email protected]>
Signed-off-by: Andrii Nakryiko <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]

selftests/bpf: Choose pkg-config for the target

pkg-config is used to build sign-file executable. It should use the
library for the target instead of the host as it is called during tests.

Signed-off-by: Akihiko Odaki <[email protected]>
Signed-off-by: Andrii Nakryiko <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]

Merge branch 'bpf-add-link_info-support-for-uprobe-multi-link'

Jiri Olsa says:

====================
bpf: Add link_info support for uprobe multi link

hi,
this patchset adds support to get bpf_link_info details for
uprobe_multi links and adding support for bpftool link to
display them.

v4 changes:
  - move flags field up in bpf_uprobe_multi_link [Andrii]
  - include zero terminating byte in path_size [Andrii]
  - return d_path error directly [Yonghong]
  - use SEC(".probes") for semaphores [Yonghong]
  - fix ref_ctr_offsets leak in test [Yonghong]
  - other smaller fixes [Yonghong]

thanks,
jirka
---
====================

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Andrii Nakryiko <[email protected]>

bpftool: Add support to display uprobe_multi links

Adding support to display details for uprobe_multi links,
both plain:

  # bpftool link -p
  ...
  24: uprobe_multi  prog 126
          uprobe.multi  path /home/jolsa/bpf/test_progs  func_cnt 3  pid 4143
          offset             ref_ctr_offset     cookies
          0xd1f88            0xf5d5a8           0xdead
          0xd1f8f            0xf5d5aa           0xbeef
          0xd1f96            0xf5d5ac           0xcafe

and json:

  # bpftool link -p
  [{
  ...
      },{
          "id": 24,
          "type": "uprobe_multi",
          "prog_id": 126,
          "retprobe": false,
          "path": "/home/jolsa/bpf/test_progs",
          "func_cnt": 3,
          "pid": 4143,
          "funcs": [{
                  "offset": 860040,
                  "ref_ctr_offset": 16111016,
                  "cookie": 57005
              },{
                  "offset": 860047,
                  "ref_ctr_offset": 16111018,
                  "cookie": 48879
              },{
                  "offset": 860054,
                  "ref_ctr_offset": 16111020,
                  "cookie": 51966
              }
          ]
      }
  ]

Signed-off-by: Jiri Olsa <[email protected]>
Signed-off-by: Andrii Nakryiko <[email protected]>
Reviewed-by: Quentin Monnet <[email protected]>
Acked-by: Song Liu <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]

selftests/bpf: Add link_info test for uprobe_multi link

Adding fill_link_info test for uprobe_multi link.

Setting up uprobes with bogus ref_ctr_offsets and cookie values
to test all the bpf_link_info::uprobe_multi fields.

Signed-off-by: Jiri Olsa <[email protected]>
Signed-off-by: Andrii Nakryiko <[email protected]>
Acked-by: Song Liu <[email protected]>
Acked-by: Yonghong Song <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]

selftests/bpf: Use bpf_link__destroy in fill_link_info tests

The fill_link_info test keeps skeleton open and just creates
various links. We are wrongly calling bpf_link__detach after
each test to close them, we need to call bpf_link__destroy.

Signed-off-by: Jiri Olsa <[email protected]>
Signed-off-by: Andrii Nakryiko <[email protected]>
Acked-by: Yonghong Song <[email protected]>
Acked-by: Yafang Shao <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]

bpf: Add link_info support for uprobe multi link

Adding support to get uprobe_link details through bpf_link_info
interface.

Adding new struct uprobe_multi to struct bpf_link_info to carry
the uprobe_multi link details.

The uprobe_multi.count is passed from user space to denote size
of array fields (offsets/ref_ctr_offsets/cookies). The actual
array size is stored back to uprobe_multi.count (allowing user
to find out the actual array size) and array fields are populated
up to the user passed size.

All the non-array fields (path/count/flags/pid) are always set.

Signed-off-by: Jiri Olsa <[email protected]>
Signed-off-by: Andrii Nakryiko <[email protected]>
Acked-by: Yonghong Song <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]

bpf: Store ref_ctr_offsets values in bpf_uprobe array

We will need to return ref_ctr_offsets values through link_info
interface in following change, so we need to keep them around.

Storing ref_ctr_offsets values directly into bpf_uprobe array.

Signed-off-by: Jiri Olsa <[email protected]>
Signed-off-by: Andrii Nakryiko <[email protected]>
Acked-by: Andrii Nakryiko <[email protected]>
Acked-by: Song Liu <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]

libbpf: Add st_type argument to elf_resolve_syms_offsets function

We need to get offsets for static variables in following changes,
so making elf_resolve_syms_offsets to take st_type value as argument
and passing it to elf_sym_iter_new.

Signed-off-by: Jiri Olsa <[email protected]>
Signed-off-by: Andrii Nakryiko <[email protected]>
Acked-by: Andrii Nakryiko <[email protected]>
Acked-by: Song Liu <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]

selftests/bpf: update test_offload to use new orphaned property

- filter orphaned programs by default
- when trying to query orphaned program, don't expect bpftool failure

Cc: [email protected]
Signed-off-by: Stanislav Fomichev <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Martin KaFai Lau <[email protected]>

bpftool: mark orphaned programs during prog show

Commit ef01f4e25c17 ("bpf: restore the ebpf program ID for BPF_AUDIT_UNLOAD
and PERF_BPF_EVENT_PROG_UNLOAD") stopped removing program's id from
idr when the offloaded/bound netdev goes away. I was supposed to
take a look and check in [0], but apparently I did not.

Martin points out it might be useful to keep it that way for
observability sake, but we at least need to mark those programs as
unusable.

Mark those programs as 'orphaned' and keep printing the list when
we encounter ENODEV.

0: unspec  tag 0000000000000000
        xlated 0B  not jited  memlock 4096B  orphaned

[0]: https://lore.kernel.org/all/CAKH8qBtyR20ZWAc11z1-6pGb3Hd47AQUTbE_cfoktG59TqaJ7Q@mail.gmail.com/

v3:
* use two spaces for "  orphaned" (Quentin)

Cc: [email protected]
Fixes: ef01f4e25c17 ("bpf: restore the ebpf program ID for BPF_AUDIT_UNLOAD and PERF_BPF_EVENT_PROG_UNLOAD")
Signed-off-by: Stanislav Fomichev <[email protected]>
Reviewed-by: Quentin Monnet <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Martin KaFai Lau <[email protected]>

bpf: Fix a few selftest failures due to llvm18 change

With latest upstream llvm18, the following test cases failed:

  $ ./test_progs -j
  #13/2    bpf_cookie/multi_kprobe_link_api:FAIL
  #13/3    bpf_cookie/multi_kprobe_attach_api:FAIL
  #13      bpf_cookie:FAIL
  #77      fentry_fexit:FAIL
  #78/1    fentry_test/fentry:FAIL
  #78      fentry_test:FAIL
  #82/1    fexit_test/fexit:FAIL
  #82      fexit_test:FAIL
  #112/1   kprobe_multi_test/skel_api:FAIL
  #112/2   kprobe_multi_test/link_api_addrs:FAIL
  [...]
  #112     kprobe_multi_test:FAIL
  #356/17  test_global_funcs/global_func17:FAIL
  #356     test_global_funcs:FAIL

Further analysis shows llvm upstream patch [1] is responsible for the above
failures. For example, for function bpf_fentry_test7() in net/bpf/test_run.c,
without [1], the asm code is:

  0000000000000400 <bpf_fentry_test7>:
     400: f3 0f 1e fa                   endbr64
     404: e8 00 00 00 00                callq   0x409 <bpf_fentry_test7+0x9>
     409: 48 89 f8                      movq    %rdi, %rax
     40c: c3                            retq
     40d: 0f 1f 00                      nopl    (%rax)

... and with [1], the asm code is:

  0000000000005d20 <bpf_fentry_test7.specialized.1>:
    5d20: e8 00 00 00 00                callq   0x5d25 <bpf_fentry_test7.specialized.1+0x5>
    5d25: c3                            retq

... and <bpf_fentry_test7.specialized.1> is called instead of <bpf_fentry_test7>
and this caused test failures for #13/#77 etc. except #356.

For test case #356/17, with [1] (progs/test_global_func17.c)), the main prog
looks like:

  0000000000000000 <global_func17>:
       0:       b4 00 00 00 2a 00 00 00 w0 = 0x2a
       1:       95 00 00 00 00 00 00 00 exit

... which passed verification while the test itself expects a verification
failure.

Let us add 'barrier_var' style asm code in both places to prevent function
specialization which caused selftests failure.

  [1] https://github.com/llvm/llvm-project/pull/72903

Signed-off-by: Yonghong Song <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]

selftests/bpf: Add lazy global subprog validation tests

Add a few test that validate BPF verifier's lazy approach to validating
global subprogs.

We check that global subprogs that are called transitively through
another global subprog is validated.

We also check that invalid global subprog is not validated, if it's not
called from the main program.

And we also check that main program is always validated first, before
any of the subprogs.

Signed-off-by: Andrii Nakryiko <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
Acked-by: Eduard Zingerman <[email protected]>
Acked-by: Daniel Borkmann <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]

bpf: Validate global subprogs lazily

Slightly change BPF verifier logic around eagerness and order of global
subprog validation. Instead of going over every global subprog eagerly
and validating it before main (entry) BPF program is verified, turn it
around. Validate main program first, mark subprogs that were called from
main program for later verification, but otherwise assume it is valid.
Afterwards, go over marked global subprogs and validate those,
potentially marking some more global functions as being called. Continue
this process until all (transitively) callable global subprogs are
validated. It's a BFS traversal at its heart and will always converge.

This is an important change because it allows to feature-gate some
subprograms that might not be verifiable on some older kernel, depending
on supported set of features.

E.g., at some point, global functions were allowed to accept a pointer
to memory, which size is identified by user-provided type.
Unfortunately, older kernels don't support this feature. With BPF CO-RE
approach, the natural way would be to still compile BPF object file once
and guard calls to this global subprog with some CO-RE check or using
.rodata variables. That's what people do to guard usage of new helpers
or kfuncs, and any other new BPF-side feature that might be missing on
old kernels.

That's currently impossible to do with global subprogs, unfortunately,
because they are eagerly and unconditionally validated. This patch set
aims to change this, so that in the future when global funcs gain new
features, those can be guarded using BPF CO-RE techniques in the same
fashion as any other new kernel feature.

Two selftests had to be adjusted in sync with these changes.

test_global_func12 relied on eager global subprog validation failing
before main program failure is detected (unknown return value). Fix by
making sure that main program is always valid.

verifier_subprog_precision's parent_stack_slot_precise subtest relied on
verifier checkpointing heuristic to do a checkpoint at instruction #5,
but that's no longer true because we don't have enough jumps validated
before reaching insn #5 due to global subprogs being validated later.

Other than that, no changes, as one would expect.

Signed-off-by: Andrii Nakryiko <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
Acked-by: Eduard Zingerman <[email protected]>
Acked-by: Daniel Borkmann <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]

bpf: Emit global subprog name in verifier logs

We have the name, instead of emitting just func#N to identify global
subprog, augment verifier log messages with actual function name to make
it more user-friendly.

Signed-off-by: Andrii Nakryiko <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
Acked-by: Eduard Zingerman <[email protected]>
Acked-by: Daniel Borkmann <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]

libbpf: Start v1.4 development cycle

Bump libbpf.map to v1.4.0 to start a new libbpf version cycle.

Signed-off-by: Eduard Zingerman <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]

Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Cross-merge networking fixes after downstream PR.

Conflicts:

drivers/net/ethernet/intel/ice/ice_main.c
  c9663f79cd82 ("ice: adjust switchdev rebuild path")
  7758017911a4 ("ice: restore timestamp configuration after device reset")
https://lore.kernel.org/all/20231121211259.3348630 [email protected]/

Adjacent changes:

kernel/bpf/verifier.c
  bb124da69c47 ("bpf: keep track of max number of bpf_loop callback iterations")
  5f99f312bd3b ("bpf: add register bounds sanity checks and sanitization")

Signed-off-by: Jakub Kicinski <[email protected]>

Merge tag 'net-6.7-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Pull networking fixes from Jakub Kicinski:
"Including fixes from bpf.

  Current release - regressions:

   - Revert "net: r8169: Disable multicast filter for RTL8168H and
     RTL8107E"

   - kselftest: rtnetlink: fix ip route command typo

  Current release - new code bugs:

   - s390/ism: make sure ism driver implies smc protocol in kconfig

   - two build fixes for tools/net

  Previous releases - regressions:

   - rxrpc: couple of ACK/PING/RTT handling fixes

  Previous releases - always broken:

   - bpf: verify bpf_loop() callbacks as if they are called unknown
     number of times

   - improve stability of auto-bonding with Hyper-V

   - account BPF-neigh-redirected traffic in interface statistics

  Misc:

   - net: fill in some more MODULE_DESCRIPTION()s"

* tag 'net-6.7-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (58 commits)
  tools: ynl: fix duplicate op name in devlink
  tools: ynl: fix header path for nfsd
  net: ipa: fix one GSI register field width
  tls: fix NULL deref on tls_sw_splice_eof() with empty record
  net: axienet: Fix check for partial TX checksum
  vsock/test: fix SEQPACKET message bounds test
  i40e: Fix adding unsupported cloud filters
  ice: restore timestamp configuration after device reset
  ice: unify logic for programming PFINT_TSYN_MSK
  ice: remove ptp_tx ring parameter flag
  amd-xgbe: propagate the correct speed and duplex status
  amd-xgbe: handle the corner-case during tx completion
  amd-xgbe: handle corner-case during sfp hotplug
  net: veth: fix ethtool stats reporting
  octeontx2-pf: Fix ntuple rule creation to direct packet to VF with higher Rx queue than its PF
  net: usb: qmi_wwan: claim interface 4 for ZTE MF290
  Revert "net: r8169: Disable multicast filter for RTL8168H and RTL8107E"
  net/smc: avoid data corruption caused by decline
  nfc: virtual_ncidev: Add variable to check if ndev is running
  dpll: Fix potential msg memleak when genlmsg_put_reply failed
  ...

tools: ynl: fix duplicate op name in devlink

We don't support CRUD-inspired message types in YNL too well.
One aspect that currently trips us up is the fact that single
message ID can be used in multiple commands (as the response).
This leads to duplicate entries in the id-to-string tables:

devlink-user.c:19:34: warning: initialized field overwritten [-Woverride-init]
19 | [DEVLINK_CMD_PORT_NEW] = "port-new",
| ^~~~~~~~~~
devlink-user.c:19:34: note: (near initialization for ‘devlink_op_strmap[7]’)

Fixes tag points at where the code was generated, the "real" problem
is that the code generator does not support CRUD.

Fixes: f2f9dd164db0 ("netlink: specs: devlink: add the remaining command to generate complete split_ops")
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

tools: ynl: fix header path for nfsd

The makefile dependency is trying to include the wrong header:

<command-line>: fatal error: ../../../../include/uapi//linux/nfsd.h: No such file or directory

The guard also looks wrong.

Fixes: f14122b2c2ac ("tools: ynl: Add source files for nfsd netlink protocol")
Reviewed-by: Chuck Lever <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

net: ipa: fix one GSI register field width

The width of the R_LENGTH field of the EV_CH_E_CNTXT_1 GSI register
is 24 bits (not 20 bits) starting with IPA v5.0. Fix this.

Fixes: faf0678ec8a0 ("net: ipa: add IPA v5.0 GSI register definitions")
Signed-off-by: Alex Elder <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

tls: fix NULL deref on tls_sw_splice_eof() with empty record

syzkaller discovered that if tls_sw_splice_eof() is executed as part of
sendfile() when the plaintext/ciphertext sk_msg are empty, the send path
gets confused because the empty ciphertext buffer does not have enough
space for the encryption overhead. This causes tls_push_record() to go on
the `split = true` path (which is only supposed to be used when interacting
with an attached BPF program), and then get further confused and hit the
tls_merge_open_record() path, which then assumes that there must be at
least one populated buffer element, leading to a NULL deref.

It is possible to have empty plaintext/ciphertext buffers if we previously
bailed from tls_sw_sendmsg_locked() via the tls_trim_both_msgs() path.
tls_sw_push_pending_record() already handles this case correctly; let's do
the same check in tls_sw_splice_eof().

Fixes: df720d288dbb ("tls/sw: Use splice_eof() to flush")
Cc: [email protected]
Reported-by: [email protected]
Signed-off-by: Jann Horn <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

net: axienet: Fix check for partial TX checksum

Due to a typo, the code checked the RX checksum feature in the TX path.

Fixes: 8a3b7a252dca ("drivers/net/ethernet/xilinx: added Xilinx AXI Ethernet driver")
Signed-off-by: Samuel Holland <[email protected]>
Reviewed-by: Andrew Lunn <[email protected]>
Reviewed-by: Radhey Shyam Pandey <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

vsock/test: fix SEQPACKET message bounds test

Tune message length calculation to make this test work on machines
where 'getpagesize()' returns >32KB. Now maximum message length is not
hardcoded (on machines above it was smaller than 'getpagesize()' return
value, thus we get negative value and test fails), but calculated at
runtime and always bigger than 'getpagesize()' result. Reproduced on
aarch64 with 64KB page size.

Fixes: 5c338112e48a ("test/vsock: rework message bounds test")
Signed-off-by: Arseniy Krasnov <[email protected]>
Reported-by: Bogdan Marcynkov <[email protected]>
Reviewed-by: Stefano Garzarella <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

i40e: Fix adding unsupported cloud filters

If a VF tries to add unsupported cloud filter through virtchnl
then i40e_add_del_cloud_filter(_big_buf) returns -ENOTSUPP but
this error code is stored in 'ret' instead of 'aq_ret' that
is used as error code sent back to VF. In this scenario where
one of the mentioned functions fails the value of 'aq_ret'
is zero so the VF will incorrectly receive a 'success'.

Use 'aq_ret' to store return value and remove 'ret' local
variable. Additionally fix the issue when filter allocation
fails, in this case no notification is sent back to the VF.

Fixes: e284fc280473 ("i40e: Add and delete cloud filter")
Reviewed-by: Simon Horman <[email protected]>
Signed-off-by: Ivan Vecera <[email protected]>
Tested-by: Rafal Romanowski <[email protected]>
Signed-off-by: Tony Nguyen <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

Merge branch 'ice-restore-timestamp-config-after-reset'

Tony Nguyen says:

====================
ice: restore timestamp config after reset

Jake Keller says:

We recently discovered during internal validation that the ice driver has
not been properly restoring Tx timestamp configuration after a device reset,
which resulted in application failures after a device reset.

After some digging, it turned out this problem is two-fold. Since the
introduction of the PTP support the driver has been clobbering the storage
of the current timestamp configuration during reset. Thus after a reset, the
driver will no longer perform Tx or Rx timestamps, and will report
timestamp configuration as disabled if SIOCGHWTSTAMP ioctl is issued.

In addition, the recently merged auxiliary bus support code missed that
PFINT_TSYN_MSK must be reprogrammed on the clock owner for E822 devices.
Failure to restore this register configuration results in the driver no
longer responding to interrupts from other ports. Depending on the traffic
pattern, this can either result in increased latency responding to
timestamps on the non-owner ports, or it can result in the driver never
reporting any timestamps. The configuration of PFINT_TSYN_MSK was only done
during initialization. Due to this, the Tx timestamp issue persists even if
userspace reconfigures timestamping.

This series fixes both issues, as well as removes a redundant Tx ring field
since we can rely on the skb flag as the primary detector for a Tx timestamp
request.

Note that I don't think this series will directly apply to older stable
releases (even v6.6) as we recently refactored a lot of the PTP code to
support auxiliary bus. Patch 2/3 only matters for the post-auxiliary bus
implementation. The principle of patch 1/3 and 3/3 could apply as far back
as the initial PTP support, but I don't think it will apply cleanly as-is.
====================

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Paolo Abeni <[email protected]>

ice: restore timestamp configuration after device reset

The driver calls ice_ptp_cfg_timestamp() during ice_ptp_prepare_for_reset()
to disable timestamping while the device is resetting. This operation
destroys the user requested configuration. While the driver does call
ice_ptp_cfg_timestamp in ice_rebuild() to restore some hardware settings
after a reset, it unconditionally passes true or false, resulting in
failure to restore previous user space configuration.

This results in a device reset forcibly disabling timestamp configuration
regardless of current user settings.

This was not detected previously due to a quirk of the LinuxPTP ptp4l
application. If ptp4l detects a missing timestamp, it enters a fault state
and performs recovery logic which includes executing SIOCSHWTSTAMP again,
restoring the now accidentally cleared configuration.

Not every application does this, and for these applications, timestamps
will mysteriously stop after a PF reset, without being restored until an
application restart.

Fix this by replacing ice_ptp_cfg_timestamp() with two new functions:

1) ice_ptp_disable_timestamp_mode() which unconditionally disables the
   timestamping logic in ice_ptp_prepare_for_reset() and ice_ptp_release()

2) ice_ptp_restore_timestamp_mode() which calls
   ice_ptp_restore_tx_interrupt() to restore Tx timestamping configuration,
   calls ice_set_rx_tstamp() to restore Rx timestamping configuration, and
   issues an immediate TSYN_TX interrupt to ensure that timestamps which
   may have occurred during the device reset get processed.

Modify the ice_ptp_set_timestamp_mode to directly save the user
configuration and then call ice_ptp_restore_timestamp_mode. This way, reset
no longer destroys the saved user configuration.

This obsoletes the ice_set_tx_tstamp() function which can now be safely
removed.

With this change, all devices should now restore Tx and Rx timestamping
functionality correctly after a PF reset without application intervention.

Fixes: 77a781155a65 ("ice: enable receive hardware timestamping")
Fixes: ea9b847cda64 ("ice: enable transmit timestamps for E810 devices")
Signed-off-by: Jacob Keller <[email protected]>
Reviewed-by: Jesse Brandeburg <[email protected]>
Reviewed-by: Simon Horman <[email protected]>
Tested-by: Pucha Himasekhar Reddy <[email protected]> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <[email protected]>
Signed-off-by: Paolo Abeni <[email protected]>

ice: unify logic for programming PFINT_TSYN_MSK

Commit d938a8cca88a ("ice: Auxbus devices & driver for E822 TS") modified
how Tx timestamps are handled for E822 devices. On these devices, only the
clock owner handles reading the Tx timestamp data from firmware. To do
this, the PFINT_TSYN_MSK register is modified from the default value to one
which enables reacting to a Tx timestamp on all PHY ports.

The driver currently programs PFINT_TSYN_MSK in different places depending
on whether the port is the clock owner or not. For the clock owner, the
PFINT_TSYN_MSK value is programmed during ice_ptp_init_owner just before
calling ice_ptp_tx_ena_intr to program the PHY ports.

For the non-clock owner ports, the PFINT_TSYN_MSK is programmed during
ice_ptp_init_port.

If a large enough device reset occurs, the PFINT_TSYN_MSK register will be
reset to the default value in which only the PHY associated directly with
the PF will cause the Tx timestamp interrupt to trigger.

The driver lacks logic to reprogram the PFINT_TSYN_MSK register after a
device reset. For the E822 device, this results in the PF no longer
responding to interrupts for other ports. This results in failure to
deliver Tx timestamps to user space applications.

Rename ice_ptp_configure_tx_tstamp to ice_ptp_cfg_tx_interrupt, and unify
the logic for programming PFINT_TSYN_MSK and PFINT_OICR_ENA into one place.
This function will program both registers according to the combination of
user configuration and device requirements.

This ensures that PFINT_TSYN_MSK is always restored when we configure the
Tx timestamp interrupt.

Fixes: d938a8cca88a ("ice: Auxbus devices & driver for E822 TS")
Signed-off-by: Jacob Keller <[email protected]>
Reviewed-by: Jesse Brandeburg <[email protected]>
Reviewed-by: Simon Horman <[email protected]>
Tested-by: Pucha Himasekhar Reddy <[email protected]> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <[email protected]>
Signed-off-by: Paolo Abeni <[email protected]>

ice: remove ptp_tx ring parameter flag

Before performing a Tx timestamp in ice_stamp(), the driver checks a ptp_tx
ring variable to see if timestamping is enabled on that ring. This value is
set for all rings whenever userspace configures Tx timestamping.

Ostensibly this was done to avoid wasting cycles checking other fields when
timestamping has not been enabled. However, for Tx timestamps we already
get an individual per-SKB flag indicating whether userspace wants to
request a timestamp on that packet. We do not gain much by also having
a separate flag to check for whether timestamping was enabled.

In fact, the driver currently fails to restore the field after a PF reset.
Because of this, if a PF reset occurs, timestamps will be disabled.

Since this flag doesn't add value in the hotpath, remove it and always
provide a timestamp if the SKB flag has been set.

A following change will fix the reset path to properly restore user
timestamping configuration completely.

This went unnoticed for some time because one of the most common
applications using Tx timestamps, ptp4l, will reconfigure the socket as
part of its fault recovery logic.

Fixes: ea9b847cda64 ("ice: enable transmit timestamps for E810 devices")
Signed-off-by: Jacob Keller <[email protected]>
Reviewed-by: Jesse Brandeburg <[email protected]>
Reviewed-by: Simon Horman <[email protected]>
Tested-by: Pucha Himasekhar Reddy <[email protected]> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <[email protected]>
Signed-off-by: Paolo Abeni <[email protected]>

Merge branch 'amd-xgbe-fixes-to-handle-corner-cases'

Raju Rangoju says:

====================
amd-xgbe: fixes to handle corner-cases

This series include bug fixes to amd-xgbe driver.
====================

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Paolo Abeni <[email protected]>

amd-xgbe: propagate the correct speed and duplex status

xgbe_get_link_ksettings() does not propagate correct speed and duplex
information to ethtool during cable unplug. Due to which ethtool reports
incorrect values for speed and duplex.

Address this by propagating correct information.

Fixes: 7c12aa08779c ("amd-xgbe: Move the PHY support into amd-xgbe")
Acked-by: Shyam Sundar S K <[email protected]>
Signed-off-by: Raju Rangoju <[email protected]>
Reviewed-by: Wojciech Drewek <[email protected]>
Signed-off-by: Paolo Abeni <[email protected]>

amd-xgbe: handle the corner-case during tx completion

The existing implementation uses software logic to accumulate tx
completions until the specified time (1ms) is met and then poll them.
However, there exists a tiny gap which leads to a race between
resetting and checking the tx_activate flag. Due to this the tx
completions are not reported to upper layer and tx queue timeout
kicks-in restarting the device.

To address this, introduce a tx cleanup mechanism as part of the
periodic maintenance process.

Fixes: c5aa9e3b8156 ("amd-xgbe: Initial AMD 10GbE platform driver")
Acked-by: Shyam Sundar S K <[email protected]>
Signed-off-by: Raju Rangoju <[email protected]>
Reviewed-by: Wojciech Drewek <[email protected]>
Signed-off-by: Paolo Abeni <[email protected]>

amd-xgbe: handle corner-case during sfp hotplug

Force the mode change for SFI in Fixed PHY configurations. Fixed PHY
configurations needs PLL to be enabled while doing mode set. When the
SFP module isn't connected during boot, driver assumes AN is ON and
attempts auto-negotiation. However, if the connected SFP comes up in
Fixed PHY configuration the link will not come up as PLL isn't enabled
while the initial mode set command is issued. So, force the mode change
for SFI in Fixed PHY configuration to fix link issues.

Fixes: e57f7a3feaef ("amd-xgbe: Prepare for working with more than one type of phy")
Acked-by: Shyam Sundar S K <[email protected]>
Signed-off-by: Raju Rangoju <[email protected]>
Reviewed-by: Wojciech Drewek <[email protected]>
Signed-off-by: Paolo Abeni <[email protected]>

net: veth: fix ethtool stats reporting

Fix a possible misalignment between page_pool stats and tx xdp_stats
reported in veth_get_ethtool_stats routine.
The issue can be reproduced configuring the veth pair with the
following tx/rx queues:

$ip link add v0 numtxqueues 2 numrxqueues 4 type veth peer name v1 \
numtxqueues 1 numrxqueues 1

and loading a simple XDP program on v0 that just returns XDP_PASS.
In this case on v0 the page_pool stats overwrites tx xdp_stats for queue 1.
Fix the issue incrementing pp_idx of dev->real_num_tx_queues * VETH_TQ_STATS_LEN
since we always report xdp_stats for all tx queues in ethtool.

Fixes: 4fc418053ec7 ("net: veth: add page_pool stats")
Signed-off-by: Lorenzo Bianconi <[email protected]>
Link: https://lore.kernel.org/r/c5b5d0485016836448453f12846c7c4ab75b094a.1700593593.git.lorenzo@kernel.org
Signed-off-by: Paolo Abeni <[email protected]>

dt-bindings: net: renesas,ethertsn: Add Ethernet TSN

Add bindings for Renesas R-Car Ethernet TSN End-station IP. The RTSN
device provides Ethernet network.

Signed-off-by: Niklas Söderlund <[email protected]>
Reviewed-by: Krzysztof Kozlowski <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Paolo Abeni <[email protected]>

octeontx2-pf: Fix ntuple rule creation to direct packet to VF with higher Rx queue than its PF

It is possible to add a ntuple rule which would like to direct packet to
a VF whose number of queues are greater/less than its PF's queue numbers.
For example a PF can have 2 Rx queues but a VF created on that PF can have
8 Rx queues. As of today, ntuple rule will reject rule because it is
checking the requested queue number against PF's number of Rx queues.
As a part of this fix if the action of a ntuple rule is to move a packet
to a VF's queue then the check is removed. Also, a debug information is
printed to aware user that it is user's responsibility to cross check if
the requested queue number on that VF is a valid one.

Fixes: f0a1913f8a6f ("octeontx2-pf: Add support for ethtool ntuple filters")
Signed-off-by: Suman Ghosh <[email protected]>
Reviewed-by: Wojciech Drewek <[email protected]>
Reviewed-by: Simon Horman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Paolo Abeni <[email protected]>

Merge branch 'net-ethernet-renesas-rcar_gen4_ptp-add-v4h-support'

Niklas Söderlund says:

====================
net: ethernet: renesas: rcar_gen4_ptp: Add V4H support

This small series prepares the rcar_gen4_ptp to be useable both on both
R-Car S4 and V4H. The only in-tree driver that make use of this is
rswtich on S4. A new Ethernet (R-Car Ethernet TSN) driver for V4H is on
it's way that also will make use of rcar_gen4_ptp functionality.

Patch 1-2 are small improvements to the existing driver. While patch 3-4
adds V4H support. Finally patch 5 turns rcar_gen4_ptp into a separate
module to allow the gPTP functionality to be shared between the two
users without having to duplicate the code in each.

See each patch for changelog.
====================

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Paolo Abeni <[email protected]>

net: ethernet: renesas: rcar_gen4_ptp: Break out to module

The Gen4 gPTP support will be shared between the existing Renesas
Ethernet Switch driver and the upcoming Renesas Ethernet-TSN driver. In
preparation for this break out the gPTP support to its own module.

Signed-off-by: Niklas Söderlund <[email protected]>
Reviewed-by: Simon Horman <[email protected]>
Signed-off-by: Paolo Abeni <[email protected]>

net: ethernet: renesas: rcar_gen4_ptp: Get clock increment from clock rate

Instead of using hard coded clock increment values for each SoC derive
the clock increment from the module clock. This is done in preparation
to support a second platform, R-Car V4H that uses a 200Mhz clock
compared with the 320Mhz clock used on R-Car S4.

Tested on both SoCs,

S4 reports a clock of 320000000Hz which gives a value of 0x19000000.
Documentation says a 320Mhz clock is used and the correct increment for
that clock is 0x19000000.

V4H reports a clock of 199999992Hz which gives a value of 0x2800001a.
Documentation says a 200Mhz clock is used and the correct increment for
that clock is 0x28000000.

Suggested-by: Geert Uytterhoeven <[email protected]>
Signed-off-by: Niklas Söderlund <[email protected]>
Reviewed-by: Simon Horman <[email protected]>
Signed-off-by: Paolo Abeni <[email protected]>

net: ethernet: renesas: rcar_gen4_ptp: Prepare for shared register layout

All known R-Car Gen4 SoC share the same register layout, rename the
R-Car S4 specific identifiers so they can be shared with the upcoming
R-Car V4H support.

Signed-off-by: Niklas Söderlund <[email protected]>
Reviewed-by: Wolfram Sang <[email protected]>
Reviewed-by: Simon Horman <[email protected]>
Signed-off-by: Paolo Abeni <[email protected]>

net: ethernet: renesas: rcar_gen4_ptp: Fail on unknown register layout

Instead of printing a warning and proceeding with an unknown register
layout return an error. The only call site is already prepared to
propagate the error.

Signed-off-by: Niklas Söderlund <[email protected]>
Reviewed-by: Wolfram Sang <[email protected]>
Reviewed-by: Simon Horman <[email protected]>
Signed-off-by: Paolo Abeni <[email protected]>

net: ethernet: renesas: rcar_gen4_ptp: Remove incorrect comment

The comments intent was to indicates which function uses the enum. While
upstreaming rcar_gen4_ptp the function was renamed but this comment was
left with the old function name.

Instead of correcting the comment remove it, it adds little value.

Signed-off-by: Niklas Söderlund <[email protected]>
Reviewed-by: Wolfram Sang <[email protected]>
Reviewed-by: Simon Horman <[email protected]>
Signed-off-by: Paolo Abeni <[email protected]>

net: usb: qmi_wwan: claim interface 4 for ZTE MF290

Interface 4 is used by for QMI interface in stock firmware of MF28D, the
router which uses MF290 modem. Rebind it to qmi_wwan after freeing it up
from option driver.
The proper configuration is:

Interface mapping is:
0: QCDM, 1: (unknown), 2: AT (PCUI), 2: AT (Modem), 4: QMI

T:  Bus=01 Lev=02 Prnt=02 Port=00 Cnt=01 Dev#=  4 Spd=480  MxCh= 0
D:  Ver= 2.00 Cls=00(>ifc ) Sub=00 Prot=00 MxPS=64 #Cfgs=  1
P:  Vendor=19d2 ProdID=0189 Rev= 0.00
S:  Manufacturer=ZTE, Incorporated
S:  Product=ZTE LTE Technologies MSM
C:* #Ifs= 5 Cfg#= 1 Atr=e0 MxPwr=500mA
I:* If#= 0 Alt= 0 #EPs= 2 Cls=ff(vend.) Sub=ff Prot=ff Driver=option
E:  Ad=81(I) Atr=02(Bulk) MxPS= 512 Ivl=0ms
E:  Ad=01(O) Atr=02(Bulk) MxPS= 512 Ivl=4ms
I:* If#= 1 Alt= 0 #EPs= 2 Cls=ff(vend.) Sub=ff Prot=ff Driver=option
E:  Ad=82(I) Atr=02(Bulk) MxPS= 512 Ivl=0ms
E:  Ad=02(O) Atr=02(Bulk) MxPS= 512 Ivl=4ms
I:* If#= 2 Alt= 0 #EPs= 2 Cls=ff(vend.) Sub=ff Prot=ff Driver=option
E:  Ad=83(I) Atr=02(Bulk) MxPS= 512 Ivl=0ms
E:  Ad=03(O) Atr=02(Bulk) MxPS= 512 Ivl=4ms
I:* If#= 3 Alt= 0 #EPs= 3 Cls=ff(vend.) Sub=ff Prot=ff Driver=option
E:  Ad=84(I) Atr=03(Int.) MxPS=  64 Ivl=2ms
E:  Ad=85(I) Atr=02(Bulk) MxPS= 512 Ivl=0ms
E:  Ad=04(O) Atr=02(Bulk) MxPS= 512 Ivl=4ms
I:* If#= 4 Alt= 0 #EPs= 3 Cls=ff(vend.) Sub=ff Prot=ff Driver=qmi_wwan
E:  Ad=86(I) Atr=03(Int.) MxPS=  64 Ivl=2ms
E:  Ad=87(I) Atr=02(Bulk) MxPS= 512 Ivl=0ms
E:  Ad=05(O) Atr=02(Bulk) MxPS= 512 Ivl=4ms

Cc: Bjørn Mork <[email protected]>
Signed-off-by: Lech Perczak <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Paolo Abeni <[email protected]>

Merge tag 'loongarch-fixes-6.7-1' of git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson

Pull LoongArch fixes from Huacai Chen:
"Fix several build errors, a potential kernel panic, a cpu hotplug
  issue and update links in documentations"

* tag 'loongarch-fixes-6.7-1' of git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson:
  Docs/zh_CN/LoongArch: Update links in LoongArch introduction.rst
  Docs/LoongArch: Update links in LoongArch introduction.rst
  LoongArch: Implement constant timer shutdown interface
  LoongArch: Mark {dmw,tlb}_virt_to_page() exports as non-GPL
  LoongArch: Silence the boot warning about 'nokaslr'
  LoongArch: Add __percpu annotation for __percpu_read()/__percpu_write()
  LoongArch: Record pc instead of offset in la_abs relocation
  LoongArch: Explicitly set -fdirect-access-external-data for vmlinux
  LoongArch: Add dependency between vmlinuz.efi and vmlinux.efi

Merge tag 'hyperv-fixes-signed-20231121' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux

Pull hyperv fixes from Wei Liu:

- One fix for the KVP daemon (Ani Sinha)

- Fix for the detection of E820_TYPE_PRAM in a Gen2 VM (Saurabh Sengar)

- Micro-optimization for hv_nmi_unknown() (Uros Bizjak)

* tag 'hyperv-fixes-signed-20231121' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux:
  x86/hyperv: Use atomic_try_cmpxchg() to micro-optimize hv_nmi_unknown()
  x86/hyperv: Fix the detection of E820_TYPE_PRAM in a Gen2 VM
  hv/hv_kvp_daemon: Some small fixes for handling NM keyfiles

asm-generic: qspinlock: fix queued_spin_value_unlocked() implementation

We really don't want to do atomic_read() or anything like that, since we
already have the value, not the lock. The whole point of this is that
we've loaded the lock from memory, and we want to check whether the
value we loaded was a locked one or not.

The main use of this is the lockref code, which loads both the lock and
the reference count in one atomic operation, and then works on that
combined value. With the atomic_read(), the compiler would pointlessly
spill the value to the stack, in order to then be able to read it back
"atomically".

This is the qspinlock version of commit c6f4a9002252 ("asm-generic:
ticket-lock: Optimize arch_spin_value_unlocked()") which fixed this same
bug for ticket locks.

Cc: Guo Ren <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Waiman Long <[email protected]>
Link: https://lore.kernel.org/all/CAHk-=whNRv0v6kQiV5QO6DJhjH4KEL36vWQ6Re8Csrnh4zbRkQ@mail.gmail.com/
Signed-off-by: Linus Torvalds <[email protected]>

Revert "net: r8169: Disable multicast filter for RTL8168H and RTL8107E"

This reverts commit efa5f1311c4998e9e6317c52bc5ee93b3a0f36df.

I couldn't reproduce the reported issue. What I did, based on a pcap
packet log provided by the reporter:
- Used same chip version (RTL8168h)
- Set MAC address to the one used on the reporters system
- Replayed the EAPOL unicast packet that, according to the reporter,
was filtered out by the mc filter.
The packet was properly received.

Therefore the root cause of the reported issue seems to be somewhere
else. Disabling mc filtering completely for the most common chip
version is a quite big hammer. Therefore revert the change and wait
for further analysis results from the reporter.

Cc: [email protected]
Signed-off-by: Heiner Kallweit <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net/smc: avoid data corruption caused by decline

We found a data corruption issue during testing of SMC-R on Redis
applications.

The benchmark has a low probability of reporting a strange error as
shown below.

"Error: Protocol error, got "\xe2" as reply type byte"

Finally, we found that the retrieved error data was as follows:

0xE2 0xD4 0xC3 0xD9 0x04 0x00 0x2C 0x20 0xA6 0x56 0x00 0x16 0x3E 0x0C
0xCB 0x04 0x02 0x01 0x00 0x00 0x20 0x00 0x00 0x00 0x00 0x00 0x00 0x00
0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0xE2

It is quite obvious that this is a SMC DECLINE message, which means that
the applications received SMC protocol message.
We found that this was caused by the following situations:

client                  server
        ¦  clc proposal
        ------------->
        ¦  clc accept
        <-------------
        ¦  clc confirm
        ------------->
wait llc confirm
send llc confirm
        ¦failed llc confirm
        ¦   x------
(after 2s)timeout
                        wait llc confirm rsp

wait decline

(after 1s) timeout
                        (after 2s) timeout
        ¦   decline
        -------------->
        ¦   decline
        <--------------

As a result, a decline message was sent in the implementation, and this
message was read from TCP by the already-fallback connection.

This patch double the client timeout as 2x of the server value,
With this simple change, the Decline messages should never cross or
collide (during Confirm link timeout).

This issue requires an immediate solution, since the protocol updates
involve a more long-term solution.

Fixes: 0fb0b02bd6fd ("net/smc: adapt SMC client code to use the LLC flow")
Signed-off-by: D. Wythe <[email protected]>
Reviewed-by: Wen Gu <[email protected]>
Reviewed-by: Wenjia Zhang <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

nfc: virtual_ncidev: Add variable to check if ndev is running

syzbot reported an memory leak that happens when an skb is add to
send_buff after virtual nci closed.
This patch adds a variable to track if the ndev is running before
handling new skb in send function.

Signed-off-by: Nguyen Dinh Phi <[email protected]>
Reported-by: [email protected]
Closes: https://lore.kernel.org/lkml/[email protected]
Reviewed-by: Bongsu Jeon
Reviewed-by: Krzysztof Kozlowski <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: stmmac: Add support for HW-accelerated VLAN stripping

Current implementation supports driver level VLAN tag stripping only.
The features is always on if CONFIG_VLAN_8021Q is enabled in kernel
config and is not user configurable.

This patch add support to MAC level VLAN tag stripping and can be
configured through ethtool. If the rx-vlan-offload is off, the VLAN tag
will be stripped by driver. If the rx-vlan-offload is on, the VLAN tag
will be stripped by MAC.

Command: ethtool -K <interface> rx-vlan-offload off | on

Signed-off-by: Lai Peter Jun Ann <[email protected]>
Signed-off-by: Gan, Yi Fang <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: hsr: Add support for MC filtering at the slave device

When MC (multicast) list is updated by the networking layer due to a
user command and as well as when allmulti flag is set, it needs to be
passed to the enslaved Ethernet devices. This patch allows this
to happen by implementing ndo_change_rx_flags() and ndo_set_rx_mode()
API calls that in turns pass it to the slave devices using
existing API calls.

Signed-off-by: Murali Karicheri <[email protected]>
Signed-off-by: Ravi Gunasekaran <[email protected]>
Reviewed-by: Wojciech Drewek <[email protected]>
Reviewed-by: Simon Horman <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

x86/hyperv: Use atomic_try_cmpxchg() to micro-optimize hv_nmi_unknown()

Use atomic_try_cmpxchg() instead of atomic_cmpxchg(*ptr, old, new) == old
in hv_nmi_unknown(). On x86 the CMPXCHG instruction returns success in
the ZF flag, so this change saves a compare after CMPXCHG. The generated
asm code improves from:

  3e: 65 8b 15 00 00 00 00 mov    %gs:0x0(%rip),%edx
  45: b8 ff ff ff ff        mov    $0xffffffff,%eax
  4a: f0 0f b1 15 00 00 00 lock cmpxchg %edx,0x0(%rip)
  51: 00
  52: 83 f8 ff              cmp    $0xffffffff,%eax
  55: 0f 95 c0              setne  %al

to:

  3e: 65 8b 15 00 00 00 00 mov    %gs:0x0(%rip),%edx
  45: b8 ff ff ff ff        mov    $0xffffffff,%eax
  4a: f0 0f b1 15 00 00 00 lock cmpxchg %edx,0x0(%rip)
  51: 00
  52: 0f 95 c0              setne  %al

No functional change intended.

Cc: "K. Y. Srinivasan" <[email protected]>
Cc: Haiyang Zhang <[email protected]>
Cc: Wei Liu <[email protected]>
Cc: Dexuan Cui <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Signed-off-by: Uros Bizjak <[email protected]>
Reviewed-by: Michael Kelley <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Wei Liu <[email protected]>
Message-ID: <20231114170038 [email protected]>

Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next

Daniel Borkmann says:

====================
pull-request: bpf-next 2023-11-21

We've added 85 non-merge commits during the last 12 day(s) which contain
a total of 63 files changed, 4464 insertions(+), 1484 deletions(-).

The main changes are:

1) Huge batch of verifier changes to improve BPF register bounds logic
   and range support along with a large test suite, and verifier log
   improvements, all from Andrii Nakryiko.

2) Add a new kfunc which acquires the associated cgroup of a task within
   a specific cgroup v1 hierarchy where the latter is identified by its id,
   from Yafang Shao.

3) Extend verifier to allow bpf_refcount_acquire() of a map value field
   obtained via direct load which is a use-case needed in sched_ext,
   from Dave Marchevsky.

4) Fix bpf_get_task_stack() helper to add the correct crosstask check
   for the get_perf_callchain(), from Jordan Rome.

5) Fix BPF task_iter internals where lockless usage of next_thread()
   was wrong. The rework also simplifies the code, from Oleg Nesterov.

6) Fix uninitialized tail padding via LIBBPF_OPTS_RESET, and another
   fix for certain BPF UAPI structs to fix verifier failures seen
   in bpf_dynptr usage, from Yonghong Song.

7) Add BPF selftest fixes for map_percpu_stats flakes due to per-CPU BPF
   memory allocator not being able to allocate per-CPU pointer successfully,
   from Hou Tao.

8) Add prep work around dynptr and string handling for kfuncs which
   is later going to be used by file verification via BPF LSM and fsverity,
   from Song Liu.

9) Improve BPF selftests to update multiple prog_tests to use ASSERT_*
   macros, from Yuran Pereira.

10) Optimize LPM trie lookup to check prefixlen before walking the trie,
    from Florian Lehner.

11) Consolidate virtio/9p configs from BPF selftests in config.vm file
    given they are needed consistently across archs, from Manu Bretelle.

12) Small BPF verifier refactor to remove register_is_const(),
    from Shung-Hsi Yu.

* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (85 commits)
  selftests/bpf: Replaces the usage of CHECK calls for ASSERTs in vmlinux
  selftests/bpf: Replaces the usage of CHECK calls for ASSERTs in bpf_obj_id
  selftests/bpf: Replaces the usage of CHECK calls for ASSERTs in bind_perm
  selftests/bpf: Replaces the usage of CHECK calls for ASSERTs in bpf_tcp_ca
  selftests/bpf: reduce verboseness of reg_bounds selftest logs
  bpf: bpf_iter_task_next: use next_task(kit->task) rather than next_task(kit->pos)
  bpf: bpf_iter_task_next: use __next_thread() rather than next_thread()
  bpf: task_group_seq_get_next: use __next_thread() rather than next_thread()
  bpf: emit frameno for PTR_TO_STACK regs if it differs from current one
  bpf: smarter verifier log number printing logic
  bpf: omit default off=0 and imm=0 in register state log
  bpf: emit map name in register state if applicable and available
  bpf: print spilled register state in stack slot
  bpf: extract register state printing
  bpf: move verifier state printing code to kernel/bpf/log.c
  bpf: move verbose_linfo() into kernel/bpf/log.c
  bpf: rename BPF_F_TEST_SANITY_STRICT to BPF_F_TEST_REG_INVARIANTS
  bpf: Remove test for MOVSX32 with offset=32
  selftests/bpf: add iter test requiring range x range logic
  veristat: add ability to set BPF_F_TEST_SANITY_STRICT flag with -r flag
  ...
====================

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

dpll: Fix potential msg memleak when genlmsg_put_reply failed

We should clean the skb resource if genlmsg_put_reply failed.

Fixes: 9d71b54b65b1 ("dpll: netlink: Add DPLL framework base functions")
Signed-off-by: Hao Ge <[email protected]>
Reviewed-by: Vadim Fedorenko <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

Merge branch 'bnxt_en-prepare-to-support-new-p7-chips'

Michael Chan says:

====================
bnxt_en: Prepare to support new P7 chips

This patchset is to prepare the driver to support the new P7 chips by
refactoring and modifying the code.  The P7 chip is built on the P5
chip and many code paths can be modified to support both chips.  The
whole patchset to have basic support for P7 chips is about 20 patches so
a follow-on patchset will complete the support and add the new PCI IDs.

The first 8 patches are changes to the backing store logic to support
both chips with mostly common code paths and datastructures.  Both
chips require host backing store memory but the relevant firmware APIs
have been modified to make it easier to support new backing store
memory types.

The next 4 patches are changes to TX and RX ring indexing logic and NAPI
logic.  The main changes are to increment the TX and RX producers
unbounded and to do any masking only when needed.  These changes are
needed to support the P7 chips which require additional higher bits in
these producer indices.  The NAPI logic is also slightly modifed.

The last patch is a rename of BNXT_FLAG_CHIP_P5 to BNXT_FLAG_P5_PLUS and
other related macro changes to make it clear that the P5_PLUS macro
applies to P5 and newer chips.
====================

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

bnxt_en: Rename some macros for the P5 chips

In preparation to support a new P7 chip which has a lot of similarities
with the P5 chip, rename the BNXT_FLAG_CHIP_P5 flag to
BNXT_FLAG_CHIP_P5_PLUS. This will make it clear that the flag is for
P5 and newer chips.

Also, since there are no additional P5 variants in production, rename
BNXT_FLAG_CHIP_P5_THOR() to BNXT_FLAG_CHIP_P5() to keep the naming
more simple.

Reviewed-by: Andy Gospodarek <[email protected]>
Reviewed-by: Ajit Khaparde <[email protected]>
Signed-off-by: Randy Schacher <[email protected]>
Signed-off-by: Michael Chan <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

bnxt_en: Modify the NAPI logic for the new P7 chips

Modify the NAPI logic for the new doorbell mechanism on P7 chips.
These changes are compatible with the current P5 chips.

In the current logic, bnxt_poll_p5() services 1 or more CQs for each
MSIX.  Each MSIX has an associated NQ and each NQ has 1 or more
associated CQs.  If any CQ reaches NAPI budget, we'll stay in polling
mode and will unconditionally check and service all CQs until we exit
polling.  We always re-arm all CQs when we exit polling.

To be compatible with the new Toggle bit mechanism in P7 chips, we need
to modify the logic so that we service and re-arm the CQ only if we
receive an NQE notification for work for that CQ.  We add a new
had_nqe_notify bit to the cp_ring_info structure and it gets set when we
see the NQE notification for that CQ anytime during polling.  We'll
service and re-arm only the CQs with the had_nqe_notify bits set.

Reviewed-by: Somnath Kotur <[email protected]>
Reviewed-by: Pavan Chebbi <[email protected]>
Signed-off-by: Michael Chan <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

bnxt_en: Modify RX ring indexing logic.

Modify the RX indexing logic for both RX ring and RX aggregation ring just
like the TX logic. Change it so that the index increments unbounded and
mask it only when needed.

Modify the existing RX macros so that the index is not masked. Add new
macros RING_RX()/RING_RX_AGG() to mask it only when needed to get the
index of rxr->rx_buf_ring[] and rxr->rx_agg_ring[].

Reviewed-by: Pavan Chebbi <[email protected]>
Signed-off-by: Michael Chan <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

bnxt_en: Modify TX ring indexing logic.

Change the TX ring logic so that the index increments unbounded and
mask it only when needed.

Modify the existing macros so that the index is not masked. Add a
new macro RING_TX() to mask it only when needed to get the index of
txr->tx_buf_ring[].

Reviewed-by: Pavan Chebbi <[email protected]>
Signed-off-by: Michael Chan <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

bnxt_en: Add db_ring_mask and related macro to bnxt_db_info struct.

This allows the doorbell related logic to mask the doorbell index
to the proper range before writing the doorbell.

The current code masks the doorbell index immediately to keep it in the
legal ranges for the most part.  Subsequent patches will change the
logic so that the index increments unbounded and it only gets masked
before use.  This is preparation work for the new chip that requires an
additional Epoch bit in the doorbell that needs to toggle when the index
has wrapped around.

This patch just adds the basic infrastructure and the logic is largely
unchanged.  We now replace RING_CMP() with the new DB_RING_IDX() at
appropriate places where we mask the completion ring index before
writing the doorbell.

Reviewed-by: Pavan Chebbi <[email protected]>
Signed-off-by: Michael Chan <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

bnxt_en: Add support for HWRM_FUNC_BACKING_STORE_CFG_V2 firmware calls

Newer chips starting with 57600 will use this new firmware HWRM call to
configure backing store memory. Add this new call if it is supported
by the firmware.

Reviewed-by: Hongguang Gao <[email protected]>
Signed-off-by: Michael Chan <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

bnxt_en: Add support for new backing store query firmware API

Use the new v2 firmware API if supported by the firmware. We now have the
infrastructure to support the v2 API.

Reviewed-by: Pavan Chebbi <[email protected]>
Signed-off-by: Michael Chan <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

bnxt_en: Add bnxt_setup_ctxm_pg_tbls() helper function

In bnxt_alloc_ctx_mem(), the logic to set up the context memory entries
and to allocate the context memory tables is done repetitively. Add
a helper function to simplify the code.

The setup of the Fast Path TQM entries relies on some information from
the Slow Path TQM entries. Copy the SP_TQM entries to the FP_TQM
entries to simplify the logic.

Reviewed-by: Andy Gospodarek <[email protected]>
Signed-off-by: Michael Chan <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

bnxt_en: Use the pg_info field in bnxt_ctx_mem_type struct

Use the newly added pg_info field in bnxt_ctx_mem_type struct and
remove the standalone page info structures in bnxt_ctx_mem_info.
This now completes the reorganization of the context memory
structures to work better with the new and more flexible firmware
interface for newer chips.

Reviewed-by: Somnath Kotur <[email protected]>
Signed-off-by: Michael Chan <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

bnxt_en: Add page info to struct bnxt_ctx_mem_type

This will further improve the organization of the bnxt_ctx_mem_info
structure by moving the standalone page info structures into the
bnxt_ctx_mem_type array. Add the allocation and free logic first and
the next patch will migrate to use the new infrastructure.

Reviewed-by: Somnath Kotur <[email protected]>
Signed-off-by: Michael Chan <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

bnxt_en: Restructure context memory data structures

The current code uses a flat bnxt_ctx_mem_info structure to store 8
types of context memory for the NIC.  All the context memory types
are very similar and have similar parameters.  They can all share a
common structure to improve the organization.  Also, new firmware
interface will provide a new API to retrieve each type of context
memory by calling the API repeatedly.

This patch reorganizes the bnxt_ctx_mem_info structure to fit better
with the new firmware interface.  It will also work with the legacy
firmware interface.  The flat fields in bnxt_ctx_mem_info are replaced
by the bnxt_ctx_mem_type array.  The bnxt_mem_init array info will no
longer be needed.

Reviewed-by: Somnath Kotur <[email protected]>
Reviewed-by: Pavan Chebbi <[email protected]>
Signed-off-by: Michael Chan <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

bnxt_en: Free bp->ctx inside bnxt_free_ctx_mem()

We always free bp->ctx right after calling bnxt_free_ctx_mem(), so just
free it at the end of that function to make things simpler.

Reviewed-by: Somnath Kotur <[email protected]>
Signed-off-by: Michael Chan <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

bnxt_en: The caller of bnxt_alloc_ctx_mem() should always free bp->ctx

bnxt_alloc_ctx_mem() calls bnxt_hwrm_func_backing_store_qcaps() to
allocate the memory for bp->ctx.  Initialize bp->ctx with the allocated
memory and let the caller free it during unwind.  The unwind logic is
already there, we just need to always set bp->ctx to the allocated
memory so the caller will always free it.  This simplifies the logic
and makes it easier to expand on the backing store logic.

Reviewed-by: Pavan Chebbi <[email protected]>
Reviewed-by: Somnath Kotur <[email protected]>
Signed-off-by: Michael Chan <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

Merge branch 'net-page_pool-add-netlink-based-introspection-part1'

Jakub Kicinski says:

====================
net: page_pool: plit the page_pool_params into fast and slow

Small refactoring in prep for adding more page pool params
which won't be needed on the fast path.

v1: https://lore.kernel.org/all/20231024160220.3973311 [email protected]/
RFC: https://lore.kernel.org/all/20230816234303.3786178 [email protected]/
====================

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

net: page_pool: avoid touching slow on the fastpath

To fully benefit from previous commit add one byte of state
in the first cache line recording if we need to look at
the slow part.

The packing isn't all that impressive right now, we create
a 7B hole. I'm expecting Olek's rework will reshuffle this,
anyway.

Acked-by: Jesper Dangaard Brouer <[email protected]>
Reviewed-by: Ilias Apalodimas <[email protected]>
Reviewed-by: Mina Almasry <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

net: page_pool: split the page_pool_params into fast and slow

struct page_pool is rather performance critical and we use
16B of the first cache line to store 2 pointers used only
by test code. Future patches will add more informational
(non-fast path) attributes.

It's convenient for the user of the API to not have to worry
which fields are fast and which are slow path. Use struct
groups to split the params into the two categories internally.

Acked-by: Jesper Dangaard Brouer <[email protected]>
Reviewed-by: Mina Almasry <[email protected]>
Reviewed-by: Ilias Apalodimas <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf

Daniel Borkmann says:

====================
pull-request: bpf 2023-11-21

We've added 19 non-merge commits during the last 4 day(s) which contain
a total of 18 files changed, 1043 insertions(+), 416 deletions(-).

The main changes are:

1) Fix BPF verifier to validate callbacks as if they are called an unknown
   number of times in order to fix not detecting some unsafe programs,
   from Eduard Zingerman.

2) Fix bpf_redirect_peer() handling which missed proper stats accounting
   for veth and netkit and also generally fix missing stats for the latter,
   from Peilin Ye, Daniel Borkmann et al.

* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
  selftests/bpf: check if max number of bpf_loop iterations is tracked
  bpf: keep track of max number of bpf_loop callback iterations
  selftests/bpf: test widening for iterating callbacks
  bpf: widening for callback iterators
  selftests/bpf: tests for iterating callbacks
  bpf: verify callbacks as if they are called unknown number of times
  bpf: extract setup_func_entry() utility function
  bpf: extract __check_reg_arg() utility function
  selftests/bpf: fix bpf_loop_bench for new callback verification scheme
  selftests/bpf: track string payload offset as scalar in strobemeta
  selftests/bpf: track tcp payload offset as scalar in xdp_synproxy
  selftests/bpf: Add netkit to tc_redirect selftest
  selftests/bpf: De-veth-ize the tc_redirect test case
  bpf, netkit: Add indirect call wrapper for fetching peer dev
  bpf: Fix dev's rx stats for bpf_redirect_peer traffic
  veth: Use tstats per-CPU traffic counters
  netkit: Add tstats per-CPU traffic counters
  net: Move {l,t,d}stats allocation to core and convert veth & vrf
  net, vrf: Move dstats structure to core
====================

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

Merge branch 'mlxsw-preparations-for-support-of-cff-flood-mode'

Petr Machata says:

====================
mlxsw: Preparations for support of CFF flood mode

PGT is an in-HW table that maps addresses to sets of ports. Then when some
HW process needs a set of ports as an argument, instead of embedding the
actual set in the dynamic configuration, what gets configured is the
address referencing the set. The HW then works with the appropriate PGT
entry.

Among other allocations, the PGT currently contains two large blocks for
bridge flooding: one for 802.1q and one for 802.1d. Within each of these
blocks are three tables, for unknown-unicast, multicast and broadcast
flooding:

      . . . |    802.1q    |    802.1d    | . . .
            | UC | MC | BC | UC | MC | BC |
             \______ _____/ \_____ ______/
                    v             v
                   FID flood vectors

Thus each FID (which corresponds to an 802.1d bridge or one VLAN in an
802.1q bridge) uses three flood vectors spread across a fairly large region
of PGT.

This way of organizing the flood table (called "controlled") is not very
flexible. E.g. to decrease a bridge scale and store more IP MC vectors, one
would need to completely rewrite the bridge PGT blocks, or resort to hacks
such as storing individual MC flood vectors into unused part of the bridge
table.

In order to address these shortcomings, Spectrum-2 and above support what
is called CFF flood mode, for Compressed FID Flooding. In CFF flood mode,
each FID has a little table of its own, with three entries adjacent to each
other, one for unknown-UC, one for MC, one for BC. This allows for a much
more fine-grained approach to PGT management, where bits of it are
allocated on demand.

      . . . | FID | FID | FID | FID | FID | . . .
            |U|M|B|U|M|B|U|M|B|U|M|B|U|M|B|
             \_____________ _____________/
                           v
                   FID flood vectors

Besides the FID table organization, the CFF flood mode also impacts Router
Subport (RSP) table. This table contains flood vectors for rFIDs, which are
FIDs that reference front panel ports or LAGs. The RSP table contains two
entries per front panel port and LAG, one for unknown-UC traffic, and one
for everything else. Currently, the FW allocates and manages the table in
its own part of PGT. rFIDs are marked with flood_rsp bit and managed
specially. In CFF mode, rFIDs are managed as all other FIDs. The driver
therefore has to allocate and maintain the flood vectors. Like with bridge
FIDs, this is more work, but increases flexibility of the system.

The FW currently supports both the controlled and CFF flood modes. To shed
complexity, in the future it should only support CFF flood mode. Hence this
patchset, which is the first in series of two to add CFF flood mode support
to mlxsw.

There are FW versions out there that do not support CFF flood mode, and on
Spectrum-1 in particular, there is no plan to support it at all. mlxsw will
therefore have to support both controlled flood mode as well as CFF.

Another aspect is that at least on Spectrum-1, there are FW versions out
there that claim to support CFF flood mode, but then reject or ignore
configurations enabling the same. The driver thus has to have a say in
whether an attempt to configure CFF flood mode should even be made.

Much like with the LAG mode, the feature is therefore expressed in terms of
"does the driver prefer CFF flood mode?", and "what flood mode the PCI
module managed to configure the FW with". This gives to the driver a chance
to determine whether CFF flood mode configuration should be attempted.

In this patchset, we lay the ground with new definitions, registers and
their fields, and some minor code shaping. The next patchset will be more
focused on introducing necessary abstractions and implementation.

- Patches #1 and #2 add CFF-related items to the command interface.

- Patch #3 adds a new resource, for maximum number of flood profiles
  supported. (A flood profile is a mapping between traffic type and offset
  in the per-FID flood vector table.)

- Patches #4 to #8 adjust reg.h. The SFFP register is added, which is used
  for configuring the abovementioned traffic-type-to-offset mapping. The
  SFMR, register, which serves for FID configuration, is extended with
  fields specific to CFF mode. And other minor adjustments.

- Patches #9 and #10 add the plumbing for CFF mode: a way to request that
  CFF flood mode be configured, and a way to query the flood mode that was
  actually configured.

- Patch #11 removes dead code.

- Patches #12 and #13 add helpers that the next patchset will make use of.
  Patch #14 moves RIF setup ahead so that FID code can make use of it.
====================

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

mlxsw: spectrum_router: Call RIF setup before obtaining FID

For subport RIFs, the setup initializes, among other things, RIF port and
LAG numbers. Those are important to determine where in the PGT the RIF FID
will be stored. Therefore, call the RIF setup before fid_get.

Signed-off-by: Petr Machata <[email protected]>
Reviewed-by: Amit Cohen <[email protected]>
Reviewed-by: Ido Schimmel <[email protected]>
Link: https://lore.kernel.org/r/f24d8cad7e4748b8e8e0e16894ca6a20704dea32.1700503644.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <[email protected]>

mlxsw: spectrum_router: Add a helper to get subport number from a RIF

In the CFF flood mode, responsibility for management of the PGT entries for
rFIDs is moved from FW to the driver. All rFIDs are based off either a
front panel port, or a LAG port. The flood vectors for port-based rFIDs
enable just the port itself, the ones for LAG-based rFIDs enable all member
ports of the LAG in question.

Since all rFIDs based off the same port have the same flood vector, and
similarly for LAG-based rFIDs, the flood entries are shared. The PGT
address of the flood vector is therefore determined based on the port (or
LAG) number of the RIF connected with the rFID.

Add a helper to determine subport number given a RIF, to be used in these
calculations.

Signed-off-by: Petr Machata <[email protected]>
Reviewed-by: Amit Cohen <[email protected]>
Reviewed-by: Ido Schimmel <[email protected]>
Link: https://lore.kernel.org/r/d7ab43cf5b021f785f363f236e4b6780d10eea93.1700503644.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <[email protected]>

mlxsw: spectrum_fid: Extract SFMR packing into a helper

Both mlxsw_sp_fid_op() and mlxsw_sp_fid_edit_op() pack the core of SFMR the
same way. Extract the common code into a helper and call that. Extract out
of that a wrapper that just calls mlxsw_reg_sfmr_pack(), because it will
be useful for the dummy family later on.

Signed-off-by: Petr Machata <[email protected]>
Reviewed-by: Amit Cohen <[email protected]>
Reviewed-by: Ido Schimmel <[email protected]>
Link: https://lore.kernel.org/r/31f32b4d767183f6cb197148d0792feab2efadba.1700503644.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <[email protected]>

mlxsw: spectrum_fid: Drop unnecessary conditions

The caller already only calls mlxsw_sp_fid_flood_tables_init() and
mlxsw_sp_fid_flood_tables_fini() if (fid_family->flood_tables). There
is no configuration where the pointer is non-NULL, but the number of
tables is zero. So drop the conditions.

Signed-off-by: Petr Machata <[email protected]>
Reviewed-by: Amit Cohen <[email protected]>
Reviewed-by: Ido Schimmel <[email protected]>
Link: https://lore.kernel.org/r/897c6841bc756ac632b797bf67ac83c6a66ba359.1700503644.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <[email protected]>

mlxsw: pci: Permit enabling CFF mode

There are FW versions out there that do not support CFF flood mode, and on
Spectrum-1 in particular, there is no plan to support it at all. mlxsw will
therefore have to support both controlled flood mode as well as CFF. There
are also FW versions out there that claim to support CFF flood mode, but
then reject or ignore configurations enabling the same. The driver thus has
to have a say in whether an attempt to configure CFF flood mode should even
be made, and what to use as a fallback.

Hence express the feature in terms of "does the driver prefer CFF flood
mode?", and "what flood mode the PCI module managed to configure the FW
with". This gives to the driver a chance to determine whether CFF flood
mode configuration should be attempted.

The latter bit was added in previous patches. In this patch, add the bit
that allows the driver to determine whether CFF enablement should be
attempted, and the enablement code itself.

Signed-off-by: Petr Machata <[email protected]>
Reviewed-by: Amit Cohen <[email protected]>
Reviewed-by: Ido Schimmel <[email protected]>
Link: https://lore.kernel.org/r/41640a0ee58e0a9538f820f7b601a0e35f6449e4.1700503644.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <[email protected]>

mlxsw: core, pci: Add plumbing related to CFF mode

CFF mode, for Compressed FID Flooding, is a way of organizing flood vectors
in the PGT table. The bus module determines whether CFF is supported, can
configure flood mode to CFF if it is, and knows what flood mode has been
configured. Therefore add a bus callback to determine the configured flood
mode. Also add to core an API to query it.

Since after this patch, we rely on mlxsw_pci->flood_mode being set, it
becomes a coding error if a driver invokes this function with a set of
fields that misses the initialization. Warn and bail out in that case.

The CFF mode is not used as of this patch. The code to actually use it will
be added later.

Signed-off-by: Petr Machata <[email protected]>
Reviewed-by: Amit Cohen <[email protected]>
Reviewed-by: Ido Schimmel <[email protected]>
Link: https://lore.kernel.org/r/889d58759dd40f5037f2206b9fc4a78a9240da80.1700503644.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <[email protected]>

mlxsw: reg: Add to SFMR register the fields related to CFF flood mode

Add the field cff_mid_base, which specifies at which point in PGT the
per-FID flood table is stored. Add cff_prf_id, the profile ID, which
determines on which row of the flood table a flood vector can be found for
a given traffic type.

Signed-off-by: Petr Machata <[email protected]>
Reviewed-by: Amit Cohen <[email protected]>
Reviewed-by: Ido Schimmel <[email protected]>
Link: https://lore.kernel.org/r/3ad7ae38cf6534bedcd876f16090d109a814b3e3.1700503644.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <[email protected]>

mlxsw: reg: Extract flood-mode specific part of mlxsw_reg_sfmr_pack()

In CFF mode, it is necessary to set a different set of SFMR fields. Leave
in mlxsw_reg_sfmr_pack() only the common bits, and move the parts relevant
to controlled flood mode directly to the call site.

Signed-off-by: Petr Machata <[email protected]>
Reviewed-by: Amit Cohen <[email protected]>
Reviewed-by: Ido Schimmel <[email protected]>
Link: https://lore.kernel.org/r/6f29639ebc3ca0722272e6c644ca910096469413.1700503644.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <[email protected]>

mlxsw: reg: Drop unnecessary writes from mlxsw_reg_sfmr_pack()

The MLXSW_REG_ZERO at the beginning of the function wipes the whole
payload. There's no need to set vtfp and vv to false explicitly.

Signed-off-by: Petr Machata <[email protected]>
Reviewed-by: Amit Cohen <[email protected]>
Reviewed-by: Ido Schimmel <[email protected]>
Link: https://lore.kernel.org/r/04a51ea7cf31eea0ef7707311d8e864e2d9ef307.1700503644.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <[email protected]>

mlxsw: reg: Mark SFGC & some SFMR fields as reserved in CFF mode

Some existing fields and the whole register of SFGC are reserved in CFF
mode. Backport the reservation note to these fields.

Signed-off-by: Petr Machata <[email protected]>
Reviewed-by: Amit Cohen <[email protected]>
Reviewed-by: Ido Schimmel <[email protected]>
Link: https://lore.kernel.org/r/e1d5977a8cb778227e4ea2fd1515529957ce5de7.1700503643.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <[email protected]>

mlxsw: reg: Add Switch FID Flooding Profiles Register

The SFFP register populates the fid flooding profile tables used for the
NVE flooding and Compressed-FID Flooding (CFF).

Signed-off-by: Petr Machata <[email protected]>
Reviewed-by: Amit Cohen <[email protected]>
Reviewed-by: Ido Schimmel <[email protected]>
Link: https://lore.kernel.org/r/ca42eb67763bd0c7cf035afc62ef73632f3f61a6.1700503643.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <[email protected]>

mlxsw: resources: Add max_cap_nve_flood_prf

max_cap_nve_flood_prf describes maximum number of NVE flooding profiles.
The same value then applies for flooding profiles for flooding in CFF mode.

Signed-off-by: Petr Machata <[email protected]>
Reviewed-by: Amit Cohen <[email protected]>
Reviewed-by: Ido Schimmel <[email protected]>
Link: https://lore.kernel.org/r/064a2e013d879e5f5494167a6c120c4bb85a2204.1700503643.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <[email protected]>

mlxsw: cmd: Add MLXSW_CMD_MBOX_CONFIG_PROFILE_FLOOD_MODE_CFF

PGT, a port-group table is an in-HW block of specialized memory that holds
sets of ports. Allocated within the PGT are series of flood tables that
describe to which ports traffic of various types (unknown UC, BC, MC)
should be flooded from which FID. The hitherto-used layout of these flood
tables is being replaced with a more flexible scheme, called compressed FID
flooding (CFF). CFF can be configured through CONFIG_PROFILE.flood_mode.

In this patch, add MLXSW_CMD_MBOX_CONFIG_PROFILE_FLOOD_MODE_CFF, the value
to use to enable the CFF mode.

Signed-off-by: Petr Machata <[email protected]>
Reviewed-by: Amit Cohen <[email protected]>
Reviewed-by: Ido Schimmel <[email protected]>
Link: https://lore.kernel.org/r/fc2e063742856492f8f22b0b87abf431ea6d53d0.1700503643.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <[email protected]>

mlxsw: cmd: Add cmd_mbox.query_fw.cff_support

PGT, a port-group table is an in-HW block of specialized memory that holds
sets of ports. Allocated within the PGT are series of flood tables that
describe to which ports traffic of various types (unknown UC, BC, MC)
should be flooded from which FID. The hitherto-used layout of these flood
tables is being replaced with a more flexible scheme, called compressed FID
flooding (CFF). CFF can be configured through CONFIG_PROFILE.flood_mode.

cff_support determines whether CONFIG_PROFILE.flood_mode can be set to CFF.

Signed-off-by: Petr Machata <[email protected]>
Reviewed-by: Amit Cohen <[email protected]>
Reviewed-by: Ido Schimmel <[email protected]>
Link: https://lore.kernel.org/r/af727d0e1095e30fa45c7e60404637cdc491aeec.1700503643.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <[email protected]>

net: do not send a MOVE event when netdev changes netns

Networking supports changing netdevice's netns and name
at the same time. This allows avoiding name conflicts
and having to rename the interface in multiple steps.
E.g. netns1={eth0, eth1}, netns2={eth1} - we want
to move netns1:eth1 to netns2 and call it eth0 there.
If we can't rename "in flight" we'd need to (1) rename
eth1 -> $tmp, (2) change netns, (3) rename $tmp -> eth0.

To rename the underlying struct device we have to call
device_rename(). The rename()'s MOVE event, however, doesn't
"belong" to either the old or the new namespace.
If there are conflicts on both sides it's actually impossible
to issue a real MOVE (old name -> new name) without confusing
user space. And Daniel reports that such confusions do in fact
happen for systemd, in real life.

Since we already issue explicit REMOVE and ADD events
manually - suppress the MOVE event completely. Move
the ADD after the rename, so that the REMOVE uses
the old name, and the ADD the new one.

If there is no rename this changes the picture as follows:

Before:

old ns | KERNEL[213.399289] remove   /devices/virtual/net/eth0 (net)
new ns | KERNEL[213.401302] add      /devices/virtual/net/eth0 (net)
new ns | KERNEL[213.401397] move     /devices/virtual/net/eth0 (net)

After:

old ns | KERNEL[266.774257] remove   /devices/virtual/net/eth0 (net)
new ns | KERNEL[266.774509] add      /devices/virtual/net/eth0 (net)

If there is a rename and a conflict (using the exact eth0/eth1
example explained above) we get this:

Before:

old ns | KERNEL[224.316833] remove   /devices/virtual/net/eth1 (net)
new ns | KERNEL[224.318551] add      /devices/virtual/net/eth1 (net)
new ns | KERNEL[224.319662] move     /devices/virtual/net/eth0 (net)

After:

old ns | KERNEL[333.033166] remove   /devices/virtual/net/eth1 (net)
new ns | KERNEL[333.035098] add      /devices/virtual/net/eth0 (net)

Note that "in flight" rename is only performed when needed.
If there is no conflict for old name in the target netns -
the rename will be performed separately by dev_change_name(),
as if the rename was a different command, and there will still
be a MOVE event for the rename:

Before:

old ns | KERNEL[194.416429] remove   /devices/virtual/net/eth0 (net)
new ns | KERNEL[194.418809] add      /devices/virtual/net/eth0 (net)
new ns | KERNEL[194.418869] move     /devices/virtual/net/eth0 (net)
new ns | KERNEL[194.420866] move     /devices/virtual/net/eth1 (net)

After:

old ns | KERNEL[71.917520] remove   /devices/virtual/net/eth0 (net)
new ns | KERNEL[71.919155] add      /devices/virtual/net/eth0 (net)
new ns | KERNEL[71.920729] move     /devices/virtual/net/eth1 (net)

If deleting the MOVE event breaks some user space we should insert
an explicit kobject_uevent(MOVE) after the ADD, like this:

@@ -11192,6 +11192,12 @@ int __dev_change_net_namespace(struct net_device *dev, struct net *net,
kobject_uevent(&dev->dev.kobj, KOBJ_ADD);
netdev_adjacent_add_links(dev);

+ /* User space wants an explicit MOVE event, issue one unless
+ * dev_change_name() will get called later and issue one.
+ */
+ if (!pat || new_name[0])
+ kobject_uevent(&dev->dev.kobj, KOBJ_MOVE);
+
/* Adapt owner in case owning user namespace of target network
* namespace is different from the original one.
*/

Reported-by: Daniel Gröber <[email protected]>
Link: https://lore.kernel.org/all/[email protected]/
Acked-by: Greg Kroah-Hartman <[email protected]>
Link: https://lore.kernel.org/all/[email protected]/
Signed-off-by: Jakub Kicinski <[email protected]>

docs: netdev: try to guide people on dealing with silence

There has been more than a few threads which went idle before
the merge window and now people came back to them and started
asking about next steps.

We currently tell people to be patient and not to repost too
often. Our "not too often", however, is still a few orders of
magnitude faster than other subsystems. Or so I feel after
hearing people talk about review rates at LPC.

Clarify in the doc that if the discussion went idle for a week
on netdev, 95% of the time there's no point waiting longer.

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

net: usb: ax88179_178a: avoid two consecutive device resets

The device is always reset two consecutive times (ax88179_reset is called
twice), one from usbnet_probe during the device binding and the other from
usbnet_open.

Remove the non-necessary reset during the device binding and let the reset
operation from open to keep the normal behavior (tested with generic ASIX
Electronics Corp. AX88179 Gigabit Ethernet device).

Reported-by: Herb Wei <[email protected]>
Tested-by: Herb Wei <[email protected]>
Signed-off-by: Jose Ignacio Tornos Martinez <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

net: usb: ax88179_178a: fix failed operations during ax88179_reset

Using generic ASIX Electronics Corp. AX88179 Gigabit Ethernet device,
the following test cycle has been implemented:
    - power on
    - check logs
    - shutdown
    - after detecting the system shutdown, disconnect power
    - after approximately 60 seconds of sleep, power is restored
Running some cycles, sometimes error logs like this appear:
    kernel: ax88179_178a 2-9:1.0 (unnamed net_device) (uninitialized): Failed to write reg index 0x0001: -19
    kernel: ax88179_178a 2-9:1.0 (unnamed net_device) (uninitialized): Failed to read reg index 0x0001: -19
    ...
These failed operation are happening during ax88179_reset execution, so
the initialization could not be correct.

In order to avoid this, we need to increase the delay after reset and
clock initial operations. By using these larger values, many cycles
have been run and no failed operations appear.

It would be better to check some status register to verify when the
operation has finished, but I do not have found any available information
(neither in the public datasheets nor in the manufacturer's driver). The
only available information for the necessary delays is the maufacturer's
driver (original values) but the proposed values are not enough for the
tested devices.

Fixes: e2ca90c276e1f ("ax88179_178a: ASIX AX88179_178A USB 3.0/2.0 to gigabit ethernet adapter driver")
Reported-by: Herb Wei <[email protected]>
Tested-by: Herb Wei <[email protected]>
Signed-off-by: Jose Ignacio Tornos Martinez <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

Merge tag 'platform-drivers-x86-v6.7-2' of git://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86

Pull x86 platform drivers fixes from Ilpo Järvinen:
"Just a few fixes (one with two non-fix deps) plus tidying up
  MAINTAINERS"

* tag 'platform-drivers-x86-v6.7-2' of git://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86:
  platform/x86: intel_telemetry: Fix kernel doc descriptions
  MAINTAINERS: Drop Mark Gross as maintainer for x86 platform drivers
  platform/x86/amd/pmc: adjust getting DRAM size behavior
  platform/x86: hp-bioscfg: Remove unused obj in hp_add_other_attributes()
  platform/x86: hp-bioscfg: Fix error handling in hp_add_other_attributes()
  platform/x86: hp-bioscfg: move mutex_lock() down in hp_add_other_attributes()
  platform/x86: hp-bioscfg: Simplify return check in hp_add_other_attributes()
  platform/x86: ideapad-laptop: Set max_brightness before using it
  MAINTAINERS: Remove stale entry for SBL platform driver

Merge tag 'erofs-for-6.7-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs

Pull erofs fixes from Gao Xiang:

- Tidy up erofs_read_inode() for simplicity

- Fix broken fscache mode due to NULL dereference of dif->bdev_handle

- Add the EROFS webpage to MAINTAINERS, documentation, and Kconfig

* tag 'erofs-for-6.7-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs:
  MAINTAINERS: erofs: add EROFS webpage
  erofs: fix NULL dereference of dif->bdev_handle in fscache mode
  erofs: simplify erofs_read_inode()

Merge branch 'selftests-bpf-update-multiple-prog_tests-to-use-assert_-macros'

Yuran Pereira says:

====================
selftests/bpf: Update multiple prog_tests to use ASSERT_ macros

Multiple files/programs in `tools/testing/selftests/bpf/prog_tests/` still
heavily use the `CHECK` macro, even when better `ASSERT_` alternatives are
available.

As it was already pointed out by Yonghong Song [1] in the bpf selftests the use
of the ASSERT_* series of macros is preferred over the CHECK macro.

This patchset replaces the usage of `CHECK(` macros to the equivalent `ASSERT_`
family of macros in the following prog_tests:
- bind_perm.c
- bpf_obj_id.c
- bpf_tcp_ca.c
- vmlinux.c

[1] https://lore.kernel.org/lkml/0a142924-633c-44e6-9a92-2dc019656bf2@linux.dev

Changes in v3:
- Addressed the following points mentioned by Yonghong Song
- Improved `bpf_map_lookup_elem` assertion in bpf_tcp_ca.
- Replaced assertion introduced in v2 with one that checks `thread_ret`
  instead of `pthread_join`. This ensures that `server`'s return value
  (thread_ret) is the one being checked, as oposed to `pthread_join`'s
  return value, since the latter one is less likely to fail.

Changes in v2:
- Fixed pthread_join assertion that broke the previous test

Previous version:
v2 - https://lore.kernel.org/lkml/GV1PR10MB6563AECF8E94798A1E5B36A4E8B6A@GV1PR10MB6563.EURPRD10.PROD.OUTLOOK.COM
v1 - https://lore.kernel.org/lkml/GV1PR10MB6563FCFF1C5DEBE84FEA985FE8B0A@GV1PR10MB6563.EURPRD10.PROD.OUTLOOK.COM
====================

Link: https://lore.kernel.org/r/GV1PR10MB6563BEFEA4269E1DDBC264B1E8BBA@GV1PR10MB6563.EURPRD10.PROD.OUTLOOK.COM
Signed-off-by: Andrii Nakryiko <[email protected]>

selftests/bpf: Replaces the usage of CHECK calls for ASSERTs in vmlinux

vmlinux.c uses the `CHECK` calls even though the use of ASSERT_ series
of macros is preferred in the bpf selftests.

This patch replaces all `CHECK` calls for equivalent `ASSERT_`
macro calls.

Signed-off-by: Yuran Pereira <[email protected]>
Signed-off-by: Andrii Nakryiko <[email protected]>
Link: https://lore.kernel.org/bpf/GV1PR10MB6563ED1023A2A3AEF30BDA5DE8BBA@GV1PR10MB6563.EURPRD10.PROD.OUTLOOK.COM

selftests/bpf: Replaces the usage of CHECK calls for ASSERTs in bpf_obj_id

bpf_obj_id uses the `CHECK` calls even though the use of
ASSERT_ series of macros is preferred in the bpf selftests.

This patch replaces all `CHECK` calls for equivalent `ASSERT_`
macro calls.

Signed-off-by: Yuran Pereira <[email protected]>
Signed-off-by: Andrii Nakryiko <[email protected]>
Link: https://lore.kernel.org/bpf/GV1PR10MB65639AA3A10B4BBAA79952C7E8BBA@GV1PR10MB6563.EURPRD10.PROD.OUTLOOK.COM