]> Git Repo - linux.git/log
linux.git
5 months agoxsk: Use xsk_buff_pool directly for cq functions
Maciej Fijalkowski [Mon, 7 Oct 2024 12:24:58 +0000 (14:24 +0200)]
xsk: Use xsk_buff_pool directly for cq functions

Currently xsk_cq_{reserve_addr,submit,cancel}_locked() take xdp_sock as
an input argument but it is only used for pulling out xsk_buff_pool
pointer from it.

Change mentioned functions to take pool pointer as an input argument to
avoid unnecessary dereferences.

Signed-off-by: Maciej Fijalkowski <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
Acked-by: Magnus Karlsson <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]
5 months agoxsk: Wrap duplicated code to function
Maciej Fijalkowski [Mon, 7 Oct 2024 12:24:57 +0000 (14:24 +0200)]
xsk: Wrap duplicated code to function

Both allocation paths have exactly the same code responsible for getting
and initializing xskb. Pull it out to common function.

Signed-off-by: Maciej Fijalkowski <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
Acked-by: Magnus Karlsson <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]
5 months agoxsk: Carry a copy of xdp_zc_max_segs within xsk_buff_pool
Maciej Fijalkowski [Mon, 7 Oct 2024 12:24:56 +0000 (14:24 +0200)]
xsk: Carry a copy of xdp_zc_max_segs within xsk_buff_pool

This so we avoid dereferencing struct net_device within hot path.

Signed-off-by: Maciej Fijalkowski <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
Acked-by: Magnus Karlsson <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]
5 months agoxsk: Get rid of xdp_buff_xsk::orig_addr
Maciej Fijalkowski [Mon, 7 Oct 2024 12:24:55 +0000 (14:24 +0200)]
xsk: Get rid of xdp_buff_xsk::orig_addr

Continue the process of dieting xdp_buff_xsk by removing orig_addr
member. It can be calculated from xdp->data_hard_start where it was
previously used, so it is not anything that has to be carried around in
struct used widely in hot path.

This has been used for initializing xdp_buff_xsk::frame_dma during pool
setup and as a shortcut in xp_get_handle() to retrieve address provided
to xsk Rx queue.

Signed-off-by: Maciej Fijalkowski <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
Acked-by: Magnus Karlsson <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]
5 months agoxsk: s/free_list_node/list_node/
Maciej Fijalkowski [Mon, 7 Oct 2024 12:24:54 +0000 (14:24 +0200)]
xsk: s/free_list_node/list_node/

Now that free_list_node's purpose is two-folded, make it just a
'list_node'.

Signed-off-by: Maciej Fijalkowski <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
Acked-by: Magnus Karlsson <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]
5 months agoxsk: Get rid of xdp_buff_xsk::xskb_list_node
Maciej Fijalkowski [Mon, 7 Oct 2024 12:24:53 +0000 (14:24 +0200)]
xsk: Get rid of xdp_buff_xsk::xskb_list_node

Let's bring xdp_buff_xsk back to occupying 2 cachelines by removing
xskb_list_node - for the purpose of gathering the xskb frags
free_list_node can be used, head of the list (xsk_buff_pool::xskb_list)
stays as-is, just reuse the node ptr.

It is safe to do as a single xdp_buff_xsk can never reside in two
pool's lists simultaneously.

Signed-off-by: Maciej Fijalkowski <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
Acked-by: Magnus Karlsson <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]
5 months agoMerge branch 'selftests/bpf: add coverage for xdp_features in test_progs'
Martin KaFai Lau [Fri, 11 Oct 2024 00:53:55 +0000 (17:53 -0700)]
Merge branch 'selftests/bpf: add coverage for xdp_features in test_progs'

Alexis Lothoré says:

====================
this small series aims to increase coverage of xdp features in
test_progs. The initial versions proposed to rework test_xdp_features.sh
to make it fit in test_progs, but some discussions in v1 and v2 showed
that the script is still needed as a standalone tool. So this new
revision lets test_xdp_features.sh as-is, and rather adds missing
coverage in existing test (cpu map). The new revision is now also a
follow-up to the update performed by Florian Kauer in [1] for devmap
programs testing.

[1] https://lore.kernel.org/bpf/20240911-devel-koalo-fix-ingress-ifindex-v4-2-5c643ae10258@linutronix.de/
---
Changes in v3:
- Drop xdp_features rework commit
- update xdp_cpumap_attach to extend its coverage
- Link to v2: https://lore.kernel.org/r/20240910-convert_xdp_tests-v2-1-a46367c9d038@bootlin.com

Changes in v2:
- fix endianness management in userspace packet parsing (call htonl on
  constant rather than packet part)

The new test has been run in a local x86 environment and in CI:
 #560/1   xdp_cpumap_attach/CPUMAP with programs in entries:OK
 #560/2   xdp_cpumap_attach/CPUMAP with frags programs in entries:OK
 #560     xdp_cpumap_attach:OK
 Summary: 1/2 PASSED, 0 SKIPPED, 0 FAILED
====================

Signed-off-by: Martin KaFai Lau <[email protected]>
5 months agoselftests/bpf: check program redirect in xdp_cpumap_attach
Alexis Lothoré (eBPF Foundation) [Wed, 9 Oct 2024 10:12:09 +0000 (12:12 +0200)]
selftests/bpf: check program redirect in xdp_cpumap_attach

xdp_cpumap_attach, in its current form, only checks that an xdp cpumap
program can be executed, but not that it performs correctly the cpu
redirect as configured by userspace (bpf_prog_test_run_opts will return
success even if the redirect program returns an error)

Add a check to ensure that the program performs the configured redirect
as well. The check is based on a global variable incremented by a
chained program executed only if the redirect program properly executes.

Signed-off-by: Alexis Lothoré (eBPF Foundation) <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Martin KaFai Lau <[email protected]>
5 months agoselftests/bpf: make xdp_cpumap_attach keep redirect prog attached
Alexis Lothoré (eBPF Foundation) [Wed, 9 Oct 2024 10:12:08 +0000 (12:12 +0200)]
selftests/bpf: make xdp_cpumap_attach keep redirect prog attached

Current test only checks attach/detach on cpu map type program, and so
does not check that it can be properly executed, neither that it
redirects correctly.

Update the existing test to extend its coverage:
- keep the redirected program loaded
- try to execute it through bpf_prog_test_run_opts with some dummy
  context

While at it, bring the following minor improvements:
- isolate test interface in its own namespace

Signed-off-by: Alexis Lothoré (eBPF Foundation) <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Martin KaFai Lau <[email protected]>
5 months agoselftests/bpf: fix bpf_map_redirect call for cpu map test
Alexis Lothoré (eBPF Foundation) [Wed, 9 Oct 2024 10:12:07 +0000 (12:12 +0200)]
selftests/bpf: fix bpf_map_redirect call for cpu map test

xdp_redir_prog currently redirects packets based on the entry at index 1
in cpu_map, but the corresponding test only manipulates the entry at
index 0. This does not really affect the test in its current form since
the program is detached before having the opportunity to execute, but it
needs to be fixed before being able improve the corresponding test (ie,
not only test attach/detach but also the redirect feature)

Fix this XDP program by making it redirect packets based on entry 0 in
cpu_map instead of entry 1.

Signed-off-by: Alexis Lothoré (eBPF Foundation) <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Martin KaFai Lau <[email protected]>
5 months agoselftests/bpf: add tcx netns cookie tests
Mahe Tardy [Mon, 7 Oct 2024 09:59:58 +0000 (09:59 +0000)]
selftests/bpf: add tcx netns cookie tests

Add netns cookie test that verifies the helper is now supported and work
in the context of tc programs.

Signed-off-by: Mahe Tardy <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Martin KaFai Lau <[email protected]>
5 months agobpf: add get_netns_cookie helper to tc programs
Mahe Tardy [Mon, 7 Oct 2024 09:59:57 +0000 (09:59 +0000)]
bpf: add get_netns_cookie helper to tc programs

This is needed in the context of Cilium and Tetragon to retrieve netns
cookie from hostns when traffic leaves Pod, so that we can correlate
skb->sk's netns cookie.

Signed-off-by: Mahe Tardy <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Martin KaFai Lau <[email protected]>
5 months agoselftests/bpf: add missing header include for htons
Alexis Lothoré (eBPF Foundation) [Tue, 8 Oct 2024 14:50:57 +0000 (16:50 +0200)]
selftests/bpf: add missing header include for htons

Including the network_helpers.h header in tests can lead to the following
build error:

./network_helpers.h: In function ‘csum_tcpudp_magic’:
./network_helpers.h:116:14: error: implicit declaration of function \
  ‘htons’ [-Werror=implicit-function-declaration]
  116 |         s += htons(proto + len);

The error is avoided in many cases thanks to some other headers included
earlier and bringing in arpa/inet.h (ie: test_progs.h).

Make sure that test_progs build success does not depend on header ordering
by adding the missing header include in network_helpers.h

Fixes: f6642de0c3e9 ("selftests/bpf: Add csum helpers")
Signed-off-by: Alexis Lothoré (eBPF Foundation) <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Martin KaFai Lau <[email protected]>
5 months agoMerge branch 'netkit: Add option for scrubbing skb meta data'
Martin KaFai Lau [Tue, 8 Oct 2024 01:42:40 +0000 (18:42 -0700)]
Merge branch 'netkit: Add option for scrubbing skb meta data'

Daniel Borkmann says:

=====================
This series is to add a NETKIT_SCRUB_NONE mode such that
the netkit device will not scrub the skb->{mark, priority} before
running the netkit bpf prog. This will allow the netkit bpf prog to
implement different policies based on the skb->{mark, priority}.

The default mode NETKIT_SCRUB_DEFAULT will always scrub
the skb->{mark, priority} before calling the netkit bpf prog. This
is the existing behavior of the netkit device and this change
will not affect the existing netkit users.
=====================

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Martin KaFai Lau <[email protected]>
5 months agoselftests/bpf: Extend netkit tests to validate skb meta data
Daniel Borkmann [Fri, 4 Oct 2024 10:13:35 +0000 (12:13 +0200)]
selftests/bpf: Extend netkit tests to validate skb meta data

Add a small netkit test to validate skb mark and priority under the
default scrubbing as well as with mark and priority scrubbing off.

  # ./vmtest.sh -- ./test_progs -t netkit
  [...]
  ./test_progs -t netkit
  [    1.419662] tsc: Refined TSC clocksource calibration: 3407.993 MHz
  [    1.420151] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x311fcd52370, max_idle_ns: 440795242006 ns
  [    1.420897] clocksource: Switched to clocksource tsc
  [    1.447996] bpf_testmod: loading out-of-tree module taints kernel.
  [    1.448447] bpf_testmod: module verification failed: signature and/or required key missing - tainting kernel
  #357     tc_netkit_basic:OK
  #358     tc_netkit_device:OK
  #359     tc_netkit_multi_links:OK
  #360     tc_netkit_multi_opts:OK
  #361     tc_netkit_neigh_links:OK
  #362     tc_netkit_pkt_type:OK
  #363     tc_netkit_scrub:OK
  Summary: 7/0 PASSED, 0 SKIPPED, 0 FAILED

Signed-off-by: Daniel Borkmann <[email protected]>
Cc: Nikolay Aleksandrov <[email protected]>
Acked-by: Nikolay Aleksandrov <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Martin KaFai Lau <[email protected]>
5 months agotools: Sync if_link.h uapi tooling header
Daniel Borkmann [Fri, 4 Oct 2024 10:13:34 +0000 (12:13 +0200)]
tools: Sync if_link.h uapi tooling header

Sync if_link uapi header to the latest version as we need the refresher
in tooling for netkit device. Given it's been a while since the last sync
and the diff is fairly big, it has been done as its own commit.

Signed-off-by: Daniel Borkmann <[email protected]>
Acked-by: Nikolay Aleksandrov <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Martin KaFai Lau <[email protected]>
5 months agonetkit: Add add netkit scrub support to rt_link.yaml
Daniel Borkmann [Fri, 4 Oct 2024 10:13:33 +0000 (12:13 +0200)]
netkit: Add add netkit scrub support to rt_link.yaml

Add netkit scrub attribute support to the rt_link.yaml spec file.

Example:

  # ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/rt_link.yaml \
   --do getlink --json '{"ifname": "nk0"}' --output-json | jq
  [...]
  "linkinfo": {
    "kind": "netkit",
    "data": {
      "primary": 0,
      "policy": "forward",
      "mode": "l3",
      "scrub": "default",
      "peer-policy": "forward",
      "peer-scrub": "default"
    }
  },
  [...]

Signed-off-by: Daniel Borkmann <[email protected]>
Cc: Nikolay Aleksandrov <[email protected]>
Acked-by: Nikolay Aleksandrov <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Martin KaFai Lau <[email protected]>
5 months agonetkit: Simplify netkit mode over to use NLA_POLICY_MAX
Daniel Borkmann [Fri, 4 Oct 2024 10:13:32 +0000 (12:13 +0200)]
netkit: Simplify netkit mode over to use NLA_POLICY_MAX

Jakub suggested to rely on netlink policy validation via NLA_POLICY_MAX()
instead of open-coding it. netkit_check_mode() is a candidate which can
be simplified through this as well aside from the netkit scrubbing one.

Suggested-by: Jakub Kicinski <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
Cc: Nikolay Aleksandrov <[email protected]>
Acked-by: Nikolay Aleksandrov <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Martin KaFai Lau <[email protected]>
5 months agonetkit: Add option for scrubbing skb meta data
Daniel Borkmann [Fri, 4 Oct 2024 10:13:31 +0000 (12:13 +0200)]
netkit: Add option for scrubbing skb meta data

Jordan reported that when running Cilium with netkit in per-endpoint-routes
mode, network policy misclassifies traffic. In this direct routing mode
of Cilium which is used in case of GKE/EKS/AKS, the Pod's BPF program to
enforce policy sits on the netkit primary device's egress side.

The issue here is that in case of netkit's netkit_prep_forward(), it will
clear meta data such as skb->mark and skb->priority before executing the
BPF program. Thus, identity data stored in there from earlier BPF programs
(e.g. from tcx ingress on the physical device) gets cleared instead of
being made available for the primary's program to process. While for traffic
egressing the Pod via the peer device this might be desired, this is
different for the primary one where compared to tcx egress on the host
veth this information would be available.

To address this, add a new parameter for the device orchestration to
allow control of skb->mark and skb->priority scrubbing, to make the two
accessible from BPF (and eventually leave it up to the program to scrub).
By default, the current behavior is retained. For netkit peer this also
enables the use case where applications could cooperate/signal intent to
the BPF program.

Note that struct netkit has a 4 byte hole between policy and bundle which
is used here, in other words, struct netkit's first cacheline content used
in fast-path does not get moved around.

Fixes: 35dfaad7188c ("netkit, bpf: Add bpf programmable net device")
Reported-by: Jordan Rife <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
Cc: Nikolay Aleksandrov <[email protected]>
Link: https://github.com/cilium/cilium/issues/34042
Acked-by: Jakub Kicinski <[email protected]>
Acked-by: Nikolay Aleksandrov <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Martin KaFai Lau <[email protected]>
5 months agobpf: Remove unused macro
Maciej Fijalkowski [Tue, 1 Oct 2024 20:06:05 +0000 (22:06 +0200)]
bpf: Remove unused macro

Commit 7aebfa1b3885 ("bpf: Support narrow loads from bpf_sock_addr.user_port")
removed one and only SOCK_ADDR_LOAD_OR_STORE_NESTED_FIELD callsite but kept
the macro. Remove it to clean up the code base. Found while getting lost in
the BPF code.

Signed-off-by: Maciej Fijalkowski <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]
5 months agoMerge branch 'selftests/bpf: new MPTCP subflow subtest'
Martin KaFai Lau [Tue, 1 Oct 2024 00:20:42 +0000 (17:20 -0700)]
Merge branch 'selftests/bpf: new MPTCP subflow subtest'

Matthieu Baerts says:

====================
In this series from Geliang, modifying MPTCP BPF selftests, we have:

- A new MPTCP subflow BPF program setting socket options per subflow: it
  looks better to have this old test program in the BPF selftests to
  track regressions and to serve as example.

  Note: Nicolas is no longer working at Tessares, but he did this work
  while working for them, and his email address is no longer available.

- A new hook in the same BPF program to do the verification step.

- A new MPTCP BPF subtest validating the new BPF program added in the
  first patch, with the help of the new hook added in the second patch.

---
Changes in v7:
- Patch 2/3: use 'can_loop' instead of 'cond_break'. (Martin)
- Patch 3/3: use bpf_program__attach_cgroup(). (Martin)
- Link to v6: https://lore.kernel.org/r/20240911-upstream-bpf-next-20240506-mptcp-subflow-test-v6-0-7872294c466b@kernel.org

Changes in v6:
- Patch 3/3: use usleep() instead of sleep()
- Series: rebased on top of bpf-next/net
- Link to v5: https://lore.kernel.org/r/20240910-upstream-bpf-next-20240506-mptcp-subflow-test-v5-0-2c664a7da47c@kernel.org

Changes in v5:
- See the individual changelog for more details about them
- Patch 1/3: set TCP on the 2nd subflow
- Patch 2/3: new
- Patch 3/3: use the BPF program from patch 2/3 to do the validation
             instead of using ss.
- Series: rebased on top of bpf-next/net
- Link to v4: https://lore.kernel.org/r/20240805-upstream-bpf-next-20240506-mptcp-subflow-test-v4-0-2b4ca6994993@kernel.org

Changes in v4:
- Drop former patch 2/3: MPTCP's pm_nl_ctl requires a new header file:
  - I will check later if it is possible to avoid having duplicated
    header files in tools/include/uapi, but no need to block this series
    for that. Patch 2/3 can be added later if needed.
- Patch 2/2: skip the test if 'ip mptcp' is not available.
- Link to v3: https://lore.kernel.org/r/20240703-upstream-bpf-next-20240506-mptcp-subflow-test-v3-0-ebdc2d494049@kernel.org

Changes in v3:
- Sorry for the delay between v2 and v3, this series was conflicting
  with the "add netns helpers", but it looks like it is on hold:
  https://lore.kernel.org/cover.1715821541[email protected]
- Patch 1/3 includes "bpf_tracing_net.h", introduced in between.
- New patch 2/3: "selftests/bpf: Add mptcp pm_nl_ctl link".
- Patch 3/3: use the tool introduced in patch 2/3 + SYS_NOFAIL() helper.
- Link to v2: https://lore.kernel.org/r/20240509-upstream-bpf-next-20240506-mptcp-subflow-test-v2-0-4048c2948665@kernel.org

Changes in v2:
- Previous patches 1/4 and 2/4 have been dropped from this series:
  - 1/4: "selftests/bpf: Handle SIGINT when creating netns":
    - A new version, more generic and no longer specific to MPTCP BPF
      selftest will be sent later, as part of a new series. (Alexei)
  - 2/4: "selftests/bpf: Add RUN_MPTCP_TEST macro":
    - Removed, not to hide helper functions in macros. (Alexei)
- The commit message of patch 1/2 has been clarified to avoid some
  possible confusions spot by Alexei.
- Link to v1: https://lore.kernel.org/r/20240507-upstream-bpf-next-20240506-mptcp-subflow-test-v1-0-e2bcbdf49857@kernel.org

---
Geliang Tang (2):
      selftests/bpf: Add getsockopt to inspect mptcp subflow
      selftests/bpf: Add mptcp subflow subtest
====================

Signed-off-by: Martin KaFai Lau <[email protected]>
5 months agoselftests/bpf: Add mptcp subflow subtest
Geliang Tang [Thu, 26 Sep 2024 17:30:24 +0000 (19:30 +0200)]
selftests/bpf: Add mptcp subflow subtest

This patch adds a subtest named test_subflow in test_mptcp to load and
verify the newly added MPTCP subflow BPF program. To goal is to make
sure it is possible to set different socket options per subflows, while
the userspace socket interface only lets the application to set the same
socket options for the whole MPTCP connection and its multiple subflows.

To check that, a client and a server are started in a dedicated netns,
with veth interfaces to simulate multiple paths. They will exchange data
to allow the creation of an additional subflow.

When the different subflows are being created, the new MPTCP subflow BPF
program will set some socket options: marks and TCP CC. The validation
is done by the same program, when the userspace checks the value of the
modified socket options. On the userspace side, it will see that the
default values are still being used on the MPTCP connection, while the
BPF program will see different options set per subflow of the same MPTCP
connection.

Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/76
Signed-off-by: Geliang Tang <[email protected]>
Reviewed-by: Mat Martineau <[email protected]>
Signed-off-by: Matthieu Baerts (NGI0) <[email protected]>
Link: https://lore.kernel.org/r/20240926-upstream-bpf-next-20240506-mptcp-subflow-test-v7-3-d26029e15cdd@kernel.org
Signed-off-by: Martin KaFai Lau <[email protected]>
5 months agoselftests/bpf: Add getsockopt to inspect mptcp subflow
Geliang Tang [Thu, 26 Sep 2024 17:30:23 +0000 (19:30 +0200)]
selftests/bpf: Add getsockopt to inspect mptcp subflow

This patch adds a "cgroup/getsockopt" way to inspect the subflows of an
MPTCP socket, and verify the modifications done by the same BPF program
in the previous commit: a different mark per subflow, and a different
TCP CC set on the second one. This new hook will be used by the next
commit to verify the socket options set on each subflow.

This extra "cgroup/getsockopt" prog walks the msk->conn_list and use
bpf_core_cast to cast a pointer for readonly. It allows to inspect all
the fields of a structure.

Note that on the kernel side, the MPTCP socket stores a list of subflows
under 'msk->conn_list'. They can be iterated using the generic 'list'
helpers. They have been imported here, with a small difference:
list_for_each_entry() uses 'can_loop' to limit the number of iterations,
and ease its use. Because only data need to be read here, it is enough
to use this technique. It is planned to use bpf_iter, when BPF programs
will be used to modify data from the different subflows.
mptcp_subflow_tcp_sock() and mptcp_for_each_stubflow() helpers have also
be imported.

Suggested-by: Martin KaFai Lau <[email protected]>
Signed-off-by: Geliang Tang <[email protected]>
Reviewed-by: Matthieu Baerts (NGI0) <[email protected]>
Signed-off-by: Matthieu Baerts (NGI0) <[email protected]>
Link: https://lore.kernel.org/r/20240926-upstream-bpf-next-20240506-mptcp-subflow-test-v7-2-d26029e15cdd@kernel.org
Signed-off-by: Martin KaFai Lau <[email protected]>
5 months agoselftests/bpf: Add mptcp subflow example
Nicolas Rybowski [Thu, 26 Sep 2024 17:30:22 +0000 (19:30 +0200)]
selftests/bpf: Add mptcp subflow example

Move Nicolas' patch into bpf selftests directory. This example adds a
different mark (SO_MARK) on each subflow, and changes the TCP CC only on
the first subflow.

From the userspace, an application can do a setsockopt() on an MPTCP
socket, and typically the same value will be propagated to all subflows
(paths). If someone wants to have different values per subflow, the
recommended way is to use BPF. So it is good to add such example here,
and make sure there is no regressions.

This example shows how it is possible to:

    Identify the parent msk of an MPTCP subflow.
    Put different sockopt for each subflow of a same MPTCP connection.

Here especially, two different behaviours are implemented:

    A socket mark (SOL_SOCKET SO_MARK) is put on each subflow of a same
    MPTCP connection. The order of creation of the current subflow defines
    its mark. The TCP CC algorithm of the very first subflow of an MPTCP
    connection is set to "reno".

This is just to show it is possible to identify an MPTCP connection, and
set socket options, from different SOL levels, per subflow. "reno" has
been picked because it is built-in and usually not set as default one.
It is easy to verify with 'ss' that these modifications have been
applied correctly. That's what the next patch is going to do.

Nicolas' code comes from:

    commit 4d120186e4d6 ("bpf:examples: update mptcp_set_mark_kern.c")

from the MPTCP repo https://github.com/multipath-tcp/mptcp_net-next (the
"scripts" branch), and it has been adapted by Geliang.

Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/76
Co-developed-by: Geliang Tang <[email protected]>
Signed-off-by: Geliang Tang <[email protected]>
Signed-off-by: Nicolas Rybowski <[email protected]>
Reviewed-by: Mat Martineau <[email protected]>
Signed-off-by: Matthieu Baerts (NGI0) <[email protected]>
Link: https://lore.kernel.org/r/20240926-upstream-bpf-next-20240506-mptcp-subflow-test-v7-1-d26029e15cdd@kernel.org
Signed-off-by: Martin KaFai Lau <[email protected]>
5 months agocxgb4: clip_tbl: Fix spelling mistake "wont" -> "won't"
Colin Ian King [Mon, 23 Sep 2024 12:26:00 +0000 (13:26 +0100)]
cxgb4: clip_tbl: Fix spelling mistake "wont" -> "won't"

There are spelling mistakes in dev_err and dev_info messages. Fix them.

Signed-off-by: Colin Ian King <[email protected]>
Reviewed-by: Simon Horman <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
5 months agoMerge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Paolo Abeni [Fri, 27 Sep 2024 06:13:52 +0000 (08:13 +0200)]
Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Cross-merge networking fixes after downstream PR.

No conflicts and no adjacent changes.

Signed-off-by: Paolo Abeni <[email protected]>
5 months agoMerge tag 'net-6.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Linus Torvalds [Thu, 26 Sep 2024 17:27:10 +0000 (10:27 -0700)]
Merge tag 'net-6.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Pull networking fixes from Paolo Abeni:
 "Including fixes from netfilter.

  It looks like that most people are still traveling: both the ML volume
  and the processing capacity are low.

  Previous releases - regressions:

    - netfilter:
        - nf_reject_ipv6: fix nf_reject_ip6_tcphdr_put()
        - nf_tables: keep deleted flowtable hooks until after RCU

    - tcp: check skb is non-NULL in tcp_rto_delta_us()

    - phy: aquantia: fix -ETIMEDOUT PHY probe failure when firmware not
      present

    - eth: virtio_net: fix mismatched buf address when unmapping for
      small packets

    - eth: stmmac: fix zero-division error when disabling tc cbs

    - eth: bonding: fix unnecessary warnings and logs from
      bond_xdp_get_xmit_slave()

  Previous releases - always broken:

    - netfilter:
        - fix clash resolution for bidirectional flows
        - fix allocation with no memcg accounting

    - eth: r8169: add tally counter fields added with RTL8125

    - eth: ravb: fix rx and tx frame size limit"

* tag 'net-6.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (35 commits)
  selftests: netfilter: Avoid hanging ipvs.sh
  kselftest: add test for nfqueue induced conntrack race
  netfilter: nfnetlink_queue: remove old clash resolution logic
  netfilter: nf_tables: missing objects with no memcg accounting
  netfilter: nf_tables: use rcu chain hook list iterator from netlink dump path
  netfilter: ctnetlink: compile ctnetlink_label_size with CONFIG_NF_CONNTRACK_EVENTS
  netfilter: nf_reject: Fix build warning when CONFIG_BRIDGE_NETFILTER=n
  netfilter: nf_tables: Keep deleted flowtable hooks until after RCU
  docs: tproxy: ignore non-transparent sockets in iptables
  netfilter: ctnetlink: Guard possible unused functions
  selftests: netfilter: nft_tproxy.sh: add tcp tests
  selftests: netfilter: add reverse-clash resolution test case
  netfilter: conntrack: add clash resolution for reverse collisions
  netfilter: nf_nat: don't try nat source port reallocation for reverse dir clash
  selftests/net: packetdrill: increase timing tolerance in debug mode
  usbnet: fix cyclical race on disconnect with work queue
  net: stmmac: set PP_FLAG_DMA_SYNC_DEV only if XDP is enabled
  virtio_net: Fix mismatched buf address when unmapping for small packets
  bonding: Fix unnecessary warnings and logs from bond_xdp_get_xmit_slave()
  r8169: add missing MODULE_FIRMWARE entry for RTL8126A rev.b
  ...

5 months agoMerge tag 'char-misc-6.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregk...
Linus Torvalds [Thu, 26 Sep 2024 17:13:08 +0000 (10:13 -0700)]
Merge tag 'char-misc-6.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc

Pull char / misc driver updates from Greg KH:
 "Here is the "big" set of char/misc and other driver subsystem changes
  for 6.12-rc1.

  Lots of changes in here, primarily dominated by the usual IIO driver
  updates and additions, but there are also small driver subsystem
  updates all over the place. Included in here are:

   - lots and lots of new IIO drivers and updates to existing ones

   - interconnect subsystem updates and new drivers

   - nvmem subsystem updates and new drivers

   - mhi driver updates

   - power supply subsystem updates

   - kobj_type const work for many different small subsystems

   - comedi driver fix

   - coresight subsystem and driver updates

   - fpga subsystem improvements

   - slimbus fixups

   - binder new feature addition for "frozen" notifications

   - lots and lots of other small driver updates and cleanups

  All of these have been in linux-next for a long time with no reported
  problems"

* tag 'char-misc-6.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc: (354 commits)
  greybus: gb-beagleplay: Add firmware upload API
  arm64: dts: ti: k3-am625-beagleplay: Add bootloader-backdoor-gpios to cc1352p7
  dt-bindings: net: ti,cc1352p7: Add bootloader-backdoor-gpios
  MAINTAINERS: Update path for U-Boot environment variables YAML
  nvmem: layouts: add U-Boot env layout
  comedi: ni_routing: tools: Check when the file could not be opened
  ocxl: Remove the unused declarations in headr file
  hpet: Fix the wrong format specifier
  uio: Constify struct kobj_type
  cxl: Constify struct kobj_type
  binder: modify the comment for binder_proc_unlock
  iio: adc: axp20x_adc: add support for AXP717 ADC
  dt-bindings: iio: adc: Add AXP717 compatible
  iio: adc: axp20x_adc: Add adc_en1 and adc_en2 to axp_data
  w1: ds2482: Drop explicit initialization of struct i2c_device_id::driver_data to 0
  tools: iio: rm .*.cmd when make clean
  iio: adc: standardize on formatting for id match tables
  iio: proximity: aw96103: Add support for aw96103/aw96105 proximity sensor
  bus: mhi: host: pci_generic: Enable EDL trigger for Foxconn modems
  bus: mhi: host: pci_generic: Update EDL firmware path for Foxconn modems
  ...

5 months agoMerge tag 'staging-6.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh...
Linus Torvalds [Thu, 26 Sep 2024 17:04:35 +0000 (10:04 -0700)]
Merge tag 'staging-6.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging

Pull staging driver updates from Greg KH:
 "Here is the big set of staging driver cleanups and removals for
  6.12-rc1.

  Nothing exciting here, just slow, constant, forward progress in
  removing code and cleaning up some old drivers, along with removing
  one of them that was not being used anymore at all. In discussions
  with some developers this past week, even more deletions will be
  happening for the next major merge window, as we seems to have code
  here that obviously no one is using anymore.

  Along with the normal cleanups is the good vme_user code forward
  progress, the one major bright spot in the staging subsystem for code
  that people rely on, and is getting good development behind it.
  Hopefully it can graduate out of staging "soon".

  All of these changes have been in linux-next for a long time with no
  reported problems"

* tag 'staging-6.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging: (141 commits)
  staging: vt6655: Rename variable apTD1Rings
  staging: vt6655: Rename variable apTD0Rings
  staging: rtl8723bs: remove unused 'poll_cnt' from rtw_set_rpwm()
  staging: rtl8723bs: remove unused cnt from recv_func()
  staging: rtl8723bs: remove unused efuseValue from efuse_OneByteWrite()
  staging: rtl8712: remove unused drvinfo_sz from update_recvframe_attrib
  staging: vt6655: mac.h: Fix possible precedence issue in macros
  staging: rtl8723bs: include: Remove spaces before tabs in rtw_security.h
  staging: rtl8723bs: include: Fix trailing */ position in rtw_security.h
  staging: rtl8723bs: include: Fix indent for else block struct in rtw_security.h
  staging: rtl8723bs: include: Fix indent for struct _byte_ in rtw_security.h
  staging: rtl8723bs: include: Fix use of tabs for indent in rtw_security.h
  staging: rtl8723bs: include: Fix indent for switch block in rtw_security.h
  staging: rtl8723bs: include: Fix indent for switch case in rtw_security.h
  staging: rtl8723bs: include: Fix open brace position in rtw_security.h
  staging: nvec: Use IRQF_NO_AUTOEN flag in request_irq()
  staging: rtl8723bs: Remove unused file rtw_rf.c
  staging: rtl8723bs: Remove unused function rtw_ch2freq
  staging: rtl8723bs: Remove unused files rtw_debug.c and rtw_debug.h
  staging: rtl8723bs: Remove unused function dump_4_regs
  ...

5 months agoMerge tag 'tty-6.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty
Linus Torvalds [Thu, 26 Sep 2024 16:59:50 +0000 (09:59 -0700)]
Merge tag 'tty-6.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty

Pull tty / serial driver updates from Greg KH:
 "Here is the "big" set of tty/serial driver updates for 6.12-rc1.

  Nothing major in here, just nice forward progress in the slow cleanup
  of the serial apis, and lots of other driver updates and fixes.

  Included in here are:

   - serial api updates from Jiri to make things more uniform and sane

   - 8250_platform driver cleanups

   - samsung serial driver fixes and updates

   - qcom-geni serial driver fixes from Johan for the bizarre UART
     engine that that chip seems to have. Hopefully it's in a better
     state now, but hardware designers still seem to come up with more
     ways to make broken UARTS 40+ years after this all should have
     finished.

   - sc16is7xx driver updates

   - omap 8250 driver updates

   - 8250_bcm2835aux driver updates

   - a few new serial driver bindings added

   - other serial minor driver updates

  All of these have been in linux-next for a long time with no reported
  problems"

* tag 'tty-6.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty: (65 commits)
  tty: serial: samsung: Fix serial rx on Apple A7-A9
  tty: serial: samsung: Fix A7-A11 serial earlycon SError
  tty: serial: samsung: Use bit manipulation macros for APPLE_S5L_*
  tty: rp2: Fix reset with non forgiving PCIe host bridges
  serial: 8250_aspeed_vuart: Enable module autoloading
  serial: qcom-geni: fix polled console corruption
  serial: qcom-geni: disable interrupts during console writes
  serial: qcom-geni: fix console corruption
  serial: qcom-geni: introduce qcom_geni_serial_poll_bitfield()
  serial: qcom-geni: fix arg types for qcom_geni_serial_poll_bit()
  soc: qcom: geni-se: add GP_LENGTH/IRQ_EN_SET/IRQ_EN_CLEAR registers
  serial: qcom-geni: fix false console tx restart
  serial: qcom-geni: fix fifo polling timeout
  tty: hvc: convert comma to semicolon
  mxser: convert comma to semicolon
  serial: 8250_bcm2835aux: Fix clock imbalance in PM resume
  serial: sc16is7xx: convert bitmask definitions to use BIT() macro
  serial: sc16is7xx: fix copy-paste errors in EFR_SWFLOWx_BIT constants
  serial: sc16is7xx: remove SC16IS7XX_MSR_DELTA_MASK
  serial: xilinx_uartps: Make cdns_rs485_supported static
  ...

5 months agoMerge tag 'usb-6.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb
Linus Torvalds [Thu, 26 Sep 2024 16:45:36 +0000 (09:45 -0700)]
Merge tag 'usb-6.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb

Pull USB/Thunderbolt updates from Greg KH:
 "Here is the large set of USB and Thunderbolt changes for 6.12-rc1.

  Nothing "major" in here, except for a new 9p network gadget that has
  been worked on for a long time (all of the needed acks are here)

  Other than that, it's the usual set of:

   - Thunderbolt / USB4 driver updates and additions for new hardware

   - dwc3 driver updates and new features added

   - xhci driver updates

   - typec driver updates

   - USB gadget updates and api additions to make some gadgets more
     configurable by userspace

   - dwc2 driver updates

   - usb phy driver updates

   - usbip feature additions

   - other minor USB driver updates

  All of these have been in linux-next for a long time with no reported
  issues"

* tag 'usb-6.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb: (145 commits)
  sub: cdns3: Use predefined PCI vendor ID constant
  sub: cdns2: Use predefined PCI vendor ID constant
  USB: misc: yurex: fix race between read and write
  USB: misc: cypress_cy7c63: check for short transfer
  USB: appledisplay: close race between probe and completion handler
  USB: class: CDC-ACM: fix race between get_serial and set_serial
  usb: r8a66597-hcd: make read-only const arrays static
  usb: typec: ucsi: Fix busy loop on ASUS VivoBooks
  usb: dwc3: rtk: Clean up error code in __get_dwc3_maximum_speed()
  usb: storage: ene_ub6250: Fix right shift warnings
  usb: roles: Improve the fix for a false positive recursive locking complaint
  locking/mutex: Introduce mutex_init_with_key()
  locking/mutex: Define mutex_init() once
  net/9p/usbg: fix CONFIG_USB_GADGET dependency
  usb: xhci: fix loss of data on Cadence xHC
  usb: xHCI: add XHCI_RESET_ON_RESUME quirk for Phytium xHCI host
  usb: dwc3: imx8mp: disable SS_CON and U3 wakeup for system sleep
  usb: dwc3: imx8mp: add 2 software managed quirk properties for host mode
  usb: host: xhci-plat: Parse xhci-missing_cas_quirk and apply quirk
  usb: misc: onboard_usb_dev: add Microchip usb5744 SMBus programming support
  ...

5 months agoMerge tag 'hid-for-linus-2024092601' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Thu, 26 Sep 2024 16:25:28 +0000 (09:25 -0700)]
Merge tag 'hid-for-linus-2024092601' of git://git.kernel.org/pub/scm/linux/kernel/git/hid/hid

Pull HID fix from Jiri Kosina:
 "A revert of Device Tree binding for Goodix SPI HID driver (while
  keeping ACPI still available), as it conflicted with already existing
  binding and the original submitter didn't respond in time with a fix.

  We will be looking into ways how to reintroduce it properly (we have
  to agree on a way how to handle cases where vendor uses the very same
  product ID for I2C and SPI parts, leading to this kind conflict). But
  before that is settled, let's revert the to unbreak everybody else
  (Krzysztof Kozlowski)"

* tag 'hid-for-linus-2024092601' of git://git.kernel.org/pub/scm/linux/kernel/git/hid/hid:
  dt-bindings: input: Revert "dt-bindings: input: Goodix SPI HID Touchscreen"
  HID: hid-goodix: drop unsupported and undocumented DT part

5 months agoMerge tag 'v6.12-rc-smb3-client-fixes-part2' of git://git.samba.org/sfrench/cifs-2.6
Linus Torvalds [Thu, 26 Sep 2024 16:20:19 +0000 (09:20 -0700)]
Merge tag 'v6.12-rc-smb3-client-fixes-part2' of git://git.samba.org/sfrench/cifs-2.6

Pull smb client fixes from Steve French:
 "Most are from the recent SMB3.1.1 test event, and also an important
  netfs fix for a cifs mtime write regression

   - fix mode reported by stat of readonly directories and files

   - DFS (global namespace) related fixes

   - fixes for special file support via reparse points

   - mount improvement and reconnect fix

   - fix for noisy log message on umount

   - two netfs related fixes, one fixing a recent regression, and add
     new write tracepoint"

* tag 'v6.12-rc-smb3-client-fixes-part2' of git://git.samba.org/sfrench/cifs-2.6:
  netfs, cifs: Fix mtime/ctime update for mmapped writes
  cifs: update internal version number
  smb: client: print failed session logoffs with FYI
  cifs: Fix reversion of the iter in cifs_readv_receive().
  smb3: fix incorrect mode displayed for read-only files
  smb: client: fix parsing of device numbers
  smb: client: set correct device number on nfs reparse points
  smb: client: propagate error from cifs_construct_tcon()
  smb: client: fix DFS failover in multiuser mounts
  cifs: Make the write_{enter,done,err} tracepoints display netfs info
  smb: client: fix DFS interlink failover
  smb: client: improve purging of cached referrals
  smb: client: avoid unnecessary reconnects when refreshing referrals

5 months agoMerge tag 'probes-v6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux...
Linus Torvalds [Thu, 26 Sep 2024 15:55:36 +0000 (08:55 -0700)]
Merge tag 'probes-v6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull probes updates from Masami Hiramatsu:

 - uprobes: make trace_uprobe->nhit counter a per-CPU one

   This makes uprobe event's hit counter per-CPU for improving
   scalability on multi-core environment

 - kprobes: Remove obsoleted declaration for init_test_probes

   Remove unused init_test_probes() from header

 - Raw tracepoint probe supports raw tracepoint events on modules:
     - add a function for iterating over all tracepoints in all modules
     - add a function for iterating over tracepoints in a module
     - support raw tracepoint events on modules
     - support raw tracepoints on future loaded modules
     - add a test for tracepoint events on modules"

* tag 'probes-v6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  sefltests/tracing: Add a test for tracepoint events on modules
  tracing/fprobe: Support raw tracepoints on future loaded modules
  tracing/fprobe: Support raw tracepoint events on modules
  tracepoint: Support iterating tracepoints in a loading module
  tracepoint: Support iterating over tracepoints on modules
  kprobes: Remove obsoleted declaration for init_test_probes
  uprobes: turn trace_uprobe's nhit counter to be per-CPU one

5 months agoMerge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost
Linus Torvalds [Thu, 26 Sep 2024 15:43:17 +0000 (08:43 -0700)]
Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost

Pull virtio updates from Michael Tsirkin:
 "Several new features here:

   - virtio-balloon supports new stats

   - vdpa supports setting mac address

   - vdpa/mlx5 suspend/resume as well as MKEY ops are now faster

   - virtio_fs supports new sysfs entries for queue info

   - virtio/vsock performance has been improved

  And fixes, cleanups all over the place"

* tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost: (34 commits)
  vsock/virtio: avoid queuing packets when intermediate queue is empty
  vsock/virtio: refactor virtio_transport_send_pkt_work
  fw_cfg: Constify struct kobj_type
  vdpa/mlx5: Postpone MR deletion
  vdpa/mlx5: Introduce init/destroy for MR resources
  vdpa/mlx5: Rename mr_mtx -> lock
  vdpa/mlx5: Extract mr members in own resource struct
  vdpa/mlx5: Rename function
  vdpa/mlx5: Delete direct MKEYs in parallel
  vdpa/mlx5: Create direct MKEYs in parallel
  MAINTAINERS: add virtio-vsock driver in the VIRTIO CORE section
  virtio_fs: add sysfs entries for queue information
  virtio_fs: introduce virtio_fs_put_locked helper
  vdpa: Remove unused declarations
  vdpa/mlx5: Parallelize VQ suspend/resume for CVQ MQ command
  vdpa/mlx5: Small improvement for change_num_qps()
  vdpa/mlx5: Keep notifiers during suspend but ignore
  vdpa/mlx5: Parallelize device resume
  vdpa/mlx5: Parallelize device suspend
  vdpa/mlx5: Use async API for vq modify commands
  ...

5 months agoMerge tag 'nf-24-09-26' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf
Paolo Abeni [Thu, 26 Sep 2024 13:47:10 +0000 (15:47 +0200)]
Merge tag 'nf-24-09-26' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf

Pablo Neira Ayuso says:

====================
Netfilter fixes for net

v2: with kdoc fixes per Paolo Abeni.

The following patchset contains Netfilter fixes for net:

Patch #1 and #2 handle an esoteric scenario: Given two tasks sending UDP
packets to one another, two packets of the same flow in each direction
handled by different CPUs that result in two conntrack objects in NEW
state, where reply packet loses race. Then, patch #3 adds a testcase for
this scenario. Series from Florian Westphal.

1) NAT engine can falsely detect a port collision if it happens to pick
   up a reply packet as NEW rather than ESTABLISHED. Add extra code to
   detect this and suppress port reallocation in this case.

2) To complete the clash resolution in the reply direction, extend conntrack
   logic to detect clashing conntrack in the reply direction to existing entry.

3) Adds a test case.

Then, an assorted list of fixes follow:

4) Add a selftest for tproxy, from Antonio Ojea.

5) Guard ctnetlink_*_size() functions under
   #if defined(CONFIG_NETFILTER_NETLINK_GLUE_CT) || defined(CONFIG_NF_CONNTRACK_EVENTS)
   From Andy Shevchenko.

6) Use -m socket --transparent in iptables tproxy documentation.
   From XIE Zhibang.

7) Call kfree_rcu() when releasing flowtable hooks to address race with
   netlink dump path, from Phil Sutter.

8) Fix compilation warning in nf_reject with CONFIG_BRIDGE_NETFILTER=n.
   From Simon Horman.

9) Guard ctnetlink_label_size() under CONFIG_NF_CONNTRACK_EVENTS which
   is its only user, to address a compilation warning. From Simon Horman.

10) Use rcu-protected list iteration over basechain hooks from netlink
    dump path.

11) Fix memcg for nf_tables, use GFP_KERNEL_ACCOUNT is not complete.

12) Remove old nfqueue conntrack clash resolution. Instead trying to
    use same destination address consistently which requires double DNAT,
    use the existing clash resolution which allows clashing packets
    go through with different destination. Antonio Ojea originally
    reported an issue from the postrouting chain, I proposed a fix:
    https://lore.kernel.org/netfilter-devel/ZuwSwAqKgCB2a51-@calendula/T/
    which he reported it did not work for him.

13) Adds a selftest for patch 12.

14) Fixes ipvs.sh selftest.

netfilter pull request 24-09-26

* tag 'nf-24-09-26' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
  selftests: netfilter: Avoid hanging ipvs.sh
  kselftest: add test for nfqueue induced conntrack race
  netfilter: nfnetlink_queue: remove old clash resolution logic
  netfilter: nf_tables: missing objects with no memcg accounting
  netfilter: nf_tables: use rcu chain hook list iterator from netlink dump path
  netfilter: ctnetlink: compile ctnetlink_label_size with CONFIG_NF_CONNTRACK_EVENTS
  netfilter: nf_reject: Fix build warning when CONFIG_BRIDGE_NETFILTER=n
  netfilter: nf_tables: Keep deleted flowtable hooks until after RCU
  docs: tproxy: ignore non-transparent sockets in iptables
  netfilter: ctnetlink: Guard possible unused functions
  selftests: netfilter: nft_tproxy.sh: add tcp tests
  selftests: netfilter: add reverse-clash resolution test case
  netfilter: conntrack: add clash resolution for reverse collisions
  netfilter: nf_nat: don't try nat source port reallocation for reverse dir clash
====================

Link: https://patch.msgid.link/[email protected]
Signed-off-by: Paolo Abeni <[email protected]>
5 months agoselftests: netfilter: Avoid hanging ipvs.sh
Phil Sutter [Thu, 19 Sep 2024 12:40:00 +0000 (14:40 +0200)]
selftests: netfilter: Avoid hanging ipvs.sh

If the client can't reach the server, the latter remains listening
forever. Kill it after 5s of waiting.

Fixes: 867d2190799a ("selftests: netfilter: add ipvs test script")
Signed-off-by: Phil Sutter <[email protected]>
Signed-off-by: Pablo Neira Ayuso <[email protected]>
5 months agokselftest: add test for nfqueue induced conntrack race
Florian Westphal [Wed, 18 Sep 2024 13:16:33 +0000 (15:16 +0200)]
kselftest: add test for nfqueue induced conntrack race

The netfilter race happens when two packets with the same tuple are DNATed
and enqueued with nfqueue in the postrouting hook.

Once one of the packet is reinjected it may be DNATed again to a different
destination, but the conntrack entry remains the same and the return packet
was dropped.

Based on earlier patch from Antonio Ojea.

Link: https://bugzilla.netfilter.org/show_bug.cgi?id=1766
Co-developed-by: Antonio Ojea <[email protected]>
Signed-off-by: Antonio Ojea <[email protected]>
Signed-off-by: Florian Westphal <[email protected]>
Signed-off-by: Pablo Neira Ayuso <[email protected]>
5 months agonetfilter: nfnetlink_queue: remove old clash resolution logic
Florian Westphal [Wed, 18 Sep 2024 13:13:39 +0000 (15:13 +0200)]
netfilter: nfnetlink_queue: remove old clash resolution logic

For historical reasons there are two clash resolution spots in
netfilter, one in nfnetlink_queue and one in conntrack core.

nfnetlink_queue one was added first: If a colliding entry is found, NAT
NAT transformation is reversed by calling nat engine again with altered
tuple.

See commit 368982cd7d1b ("netfilter: nfnetlink_queue: resolve clash for
unconfirmed conntracks") for details.

One problem is that nf_reroute() won't take an action if the queueing
doesn't occur in the OUTPUT hook, i.e. when queueing in forward or
postrouting, packet will be sent via the wrong path.

Another problem is that the scenario addressed (2nd UDP packet sent with
identical addresses while first packet is still being processed) can also
occur without any nfqueue involvement due to threaded resolvers doing
A and AAAA requests back-to-back.

This lead us to add clash resolution logic to the conntrack core, see
commit 6a757c07e51f ("netfilter: conntrack: allow insertion of clashing
entries").  Instead of fixing the nfqueue based logic, lets remove it
and let conntrack core handle this instead.

Retain the ->update hook for sake of nfqueue based conntrack helpers.
We could axe this hook completely but we'd have to split confirm and
helper logic again, see commit ee04805ff54a ("netfilter: conntrack: make
conntrack userspace helpers work again").

This SHOULD NOT be backported to kernels earlier than v5.6; they lack
adequate clash resolution handling.

Patch was originally written by Pablo Neira Ayuso.

Reported-by: Antonio Ojea <[email protected]>
Closes: https://bugzilla.netfilter.org/show_bug.cgi?id=1766
Signed-off-by: Florian Westphal <[email protected]>
Tested-by: Antonio Ojea <[email protected]>
Signed-off-by: Pablo Neira Ayuso <[email protected]>
5 months agonetfilter: nf_tables: missing objects with no memcg accounting
Pablo Neira Ayuso [Wed, 18 Sep 2024 12:19:45 +0000 (14:19 +0200)]
netfilter: nf_tables: missing objects with no memcg accounting

Several ruleset objects are still not using GFP_KERNEL_ACCOUNT for
memory accounting, update them. This includes:

- catchall elements
- compat match large info area
- log prefix
- meta secctx
- numgen counters
- pipapo set backend datastructure
- tunnel private objects

Fixes: 33758c891479 ("memcg: enable accounting for nft objects")
Signed-off-by: Pablo Neira Ayuso <[email protected]>
5 months agonetfilter: nf_tables: use rcu chain hook list iterator from netlink dump path
Pablo Neira Ayuso [Tue, 17 Sep 2024 21:07:46 +0000 (23:07 +0200)]
netfilter: nf_tables: use rcu chain hook list iterator from netlink dump path

Lockless iteration over hook list is possible from netlink dump path,
use rcu variant to iterate over the hook list as is done with flowtable
hooks.

Fixes: b9703ed44ffb ("netfilter: nf_tables: support for adding new devices to an existing netdev chain")
Reported-by: Phil Sutter <[email protected]>
Signed-off-by: Pablo Neira Ayuso <[email protected]>
5 months agonetfilter: ctnetlink: compile ctnetlink_label_size with CONFIG_NF_CONNTRACK_EVENTS
Simon Horman [Mon, 16 Sep 2024 15:14:41 +0000 (16:14 +0100)]
netfilter: ctnetlink: compile ctnetlink_label_size with CONFIG_NF_CONNTRACK_EVENTS

Only provide ctnetlink_label_size when it is used,
which is when CONFIG_NF_CONNTRACK_EVENTS is configured.

Flagged by clang-18 W=1 builds as:

.../nf_conntrack_netlink.c:385:19: warning: unused function 'ctnetlink_label_size' [-Wunused-function]
  385 | static inline int ctnetlink_label_size(const struct nf_conn *ct)
      |                   ^~~~~~~~~~~~~~~~~~~~

The condition on CONFIG_NF_CONNTRACK_LABELS being removed by
this patch guards compilation of non-trivial implementations
of ctnetlink_dump_labels() and ctnetlink_label_size().

However, this is not necessary as each of these functions
will always return 0 if CONFIG_NF_CONNTRACK_LABELS is not defined
as each function starts with the equivalent of:

struct nf_conn_labels *labels = nf_ct_labels_find(ct);

if (!labels)
return 0;

And nf_ct_labels_find always returns NULL if CONFIG_NF_CONNTRACK_LABELS
is not enabled.  So I believe that the compiler optimises the code away
in such cases anyway.

Found by inspection.
Compile tested only.

Originally splitted in two patches, Pablo Neira Ayuso collapsed them and
added Fixes: tag.

Fixes: 0ceabd83875b ("netfilter: ctnetlink: deliver labels to userspace")
Link: https://lore.kernel.org/netfilter-devel/[email protected]/
Signed-off-by: Simon Horman <[email protected]>
Signed-off-by: Pablo Neira Ayuso <[email protected]>
5 months agonetfilter: nf_reject: Fix build warning when CONFIG_BRIDGE_NETFILTER=n
Simon Horman [Mon, 16 Sep 2024 09:50:34 +0000 (10:50 +0100)]
netfilter: nf_reject: Fix build warning when CONFIG_BRIDGE_NETFILTER=n

If CONFIG_BRIDGE_NETFILTER is not enabled, which is the case for x86_64
defconfig, then building nf_reject_ipv4.c and nf_reject_ipv6.c with W=1
using gcc-14 results in the following warnings, which are treated as
errors:

net/ipv4/netfilter/nf_reject_ipv4.c: In function 'nf_send_reset':
net/ipv4/netfilter/nf_reject_ipv4.c:243:23: error: variable 'niph' set but not used [-Werror=unused-but-set-variable]
  243 |         struct iphdr *niph;
      |                       ^~~~
cc1: all warnings being treated as errors
net/ipv6/netfilter/nf_reject_ipv6.c: In function 'nf_send_reset6':
net/ipv6/netfilter/nf_reject_ipv6.c:286:25: error: variable 'ip6h' set but not used [-Werror=unused-but-set-variable]
  286 |         struct ipv6hdr *ip6h;
      |                         ^~~~
cc1: all warnings being treated as errors

Address this by reducing the scope of these local variables to where
they are used, which is code only compiled when CONFIG_BRIDGE_NETFILTER
enabled.

Compile tested and run through netfilter selftests.

Reported-by: Andy Shevchenko <[email protected]>
Closes: https://lore.kernel.org/netfilter-devel/[email protected]/
Signed-off-by: Simon Horman <[email protected]>
Signed-off-by: Pablo Neira Ayuso <[email protected]>
5 months agonetfilter: nf_tables: Keep deleted flowtable hooks until after RCU
Phil Sutter [Thu, 12 Sep 2024 12:21:33 +0000 (14:21 +0200)]
netfilter: nf_tables: Keep deleted flowtable hooks until after RCU

Documentation of list_del_rcu() warns callers to not immediately free
the deleted list item. While it seems not necessary to use the
RCU-variant of list_del() here in the first place, doing so seems to
require calling kfree_rcu() on the deleted item as well.

Fixes: 3f0465a9ef02 ("netfilter: nf_tables: dynamically allocate hooks per net_device in flowtables")
Signed-off-by: Phil Sutter <[email protected]>
Signed-off-by: Pablo Neira Ayuso <[email protected]>
5 months agodocs: tproxy: ignore non-transparent sockets in iptables
谢致邦 (XIE Zhibang) [Thu, 12 Sep 2024 11:59:33 +0000 (11:59 +0000)]
docs: tproxy: ignore non-transparent sockets in iptables

The iptables example was added in commit d2f26037a38a (netfilter: Add
documentation for tproxy, 2008-10-08), but xt_socket 'transparent'
option was added in commit a31e1ffd2231 (netfilter: xt_socket: added new
revision of the 'socket' match supporting flags, 2009-06-09).

Now add the 'transparent' option to the iptables example to ignore
non-transparent sockets, which is also consistent with the nft example.

Signed-off-by: 谢致邦 (XIE Zhibang) <[email protected]>
Signed-off-by: Pablo Neira Ayuso <[email protected]>
5 months agonetfilter: ctnetlink: Guard possible unused functions
Andy Shevchenko [Tue, 10 Sep 2024 08:35:33 +0000 (11:35 +0300)]
netfilter: ctnetlink: Guard possible unused functions

Some of the functions may be unused (CONFIG_NETFILTER_NETLINK_GLUE_CT=n
and CONFIG_NF_CONNTRACK_EVENTS=n), it prevents kernel builds with clang,
`make W=1` and CONFIG_WERROR=y:

net/netfilter/nf_conntrack_netlink.c:657:22: error: unused function 'ctnetlink_acct_size' [-Werror,-Wunused-function]
  657 | static inline size_t ctnetlink_acct_size(const struct nf_conn *ct)
      |                      ^~~~~~~~~~~~~~~~~~~
net/netfilter/nf_conntrack_netlink.c:667:19: error: unused function 'ctnetlink_secctx_size' [-Werror,-Wunused-function]
  667 | static inline int ctnetlink_secctx_size(const struct nf_conn *ct)
      |                   ^~~~~~~~~~~~~~~~~~~~~
net/netfilter/nf_conntrack_netlink.c:683:22: error: unused function 'ctnetlink_timestamp_size' [-Werror,-Wunused-function]
  683 | static inline size_t ctnetlink_timestamp_size(const struct nf_conn *ct)
      |                      ^~~~~~~~~~~~~~~~~~~~~~~~

Fix this by guarding possible unused functions with ifdeffery.

See also commit 6863f5643dd7 ("kbuild: allow Clang to find unused static
inline functions for W=1 build").

Signed-off-by: Andy Shevchenko <[email protected]>
Reviewed-by: Simon Horman <[email protected]>
Signed-off-by: Pablo Neira Ayuso <[email protected]>
5 months agoselftests: netfilter: nft_tproxy.sh: add tcp tests
Antonio Ojea [Thu, 12 Sep 2024 06:17:54 +0000 (06:17 +0000)]
selftests: netfilter: nft_tproxy.sh: add tcp tests

The TPROXY functionality is widely used, however, there are only mptcp
selftests covering this feature.

The selftests represent the most common scenarios and can also be used
as selfdocumentation of the feature.

UDP and TCP testcases are split in different files because of the
different nature of the protocols, specially due to the challenges that
present to reliable test UDP due to the connectionless nature of the
protocol. UDP only covers the scenarios involving the prerouting hook.

The UDP tests are signfinicantly slower than the TCP ones, hence they
use a larger timeout, it takes 20 seconds to run the full UDP suite
on a 48 vCPU Intel(R) Xeon(R) CPU @2.60GHz.

Signed-off-by: Antonio Ojea <[email protected]>
Signed-off-by: Pablo Neira Ayuso <[email protected]>
5 months agoselftests: netfilter: add reverse-clash resolution test case
Florian Westphal [Tue, 10 Sep 2024 09:38:16 +0000 (11:38 +0200)]
selftests: netfilter: add reverse-clash resolution test case

Add test program that is sending UDP packets in both directions
and check that packets arrive without source port modification.

Signed-off-by: Florian Westphal <[email protected]>
Signed-off-by: Pablo Neira Ayuso <[email protected]>
5 months agonetfilter: conntrack: add clash resolution for reverse collisions
Florian Westphal [Tue, 10 Sep 2024 09:38:15 +0000 (11:38 +0200)]
netfilter: conntrack: add clash resolution for reverse collisions

Given existing entry:
ORIGIN: a:b -> c:d
REPLY:  c:d -> a:b

And colliding entry:
ORIGIN: c:d -> a:b
REPLY:  a:b -> c:d

The colliding ct (and the associated skb) get dropped on insert.
Permit this by checking if the colliding entry matches the reply
direction.

Happens when both ends send packets at same time, both requests are picked
up as NEW, rather than NEW for the 'first' and 'ESTABLISHED' for the
second packet.

This is an esoteric condition, as ruleset must permit NEW connections
in either direction and both peers must already have a bidirectional
traffic flow at the time conntrack gets enabled.

Allow the 'reverse' skb to pass and assign the existing (clashing)
entry.

While at it, also drop the extra 'dying' check, this is already
tested earlier by the calling function.

Signed-off-by: Florian Westphal <[email protected]>
Signed-off-by: Pablo Neira Ayuso <[email protected]>
5 months agonetfilter: nf_nat: don't try nat source port reallocation for reverse dir clash
Florian Westphal [Tue, 10 Sep 2024 09:38:14 +0000 (11:38 +0200)]
netfilter: nf_nat: don't try nat source port reallocation for reverse dir clash

A conntrack entry can be inserted to the connection tracking table if there
is no existing entry with an identical tuple in either direction.

Example:
INITIATOR -> NAT/PAT -> RESPONDER

Initiator passes through NAT/PAT ("us") and SNAT is done (saddr rewrite).
Then, later, NAT/PAT machine itself also wants to connect to RESPONDER.

This will not work if the SNAT done earlier has same IP:PORT source pair.

Conntrack table has:
ORIGINAL: $IP_INITATOR:$SPORT -> $IP_RESPONDER:$DPORT
REPLY:    $IP_RESPONDER:$DPORT -> $IP_NAT:$SPORT

and new locally originating connection wants:
ORIGINAL: $IP_NAT:$SPORT -> $IP_RESPONDER:$DPORT
REPLY:    $IP_RESPONDER:$DPORT -> $IP_NAT:$SPORT

This is handled by the NAT engine which will do a source port reallocation
for the locally originating connection that is colliding with an existing
tuple by attempting a source port rewrite.

This is done even if this new connection attempt did not go through a
masquerade/snat rule.

There is a rare race condition with connection-less protocols like UDP,
where we do the port reallocation even though its not needed.

This happens when new packets from the same, pre-existing flow are received
in both directions at the exact same time on different CPUs after the
conntrack table was flushed (or conntrack becomes active for first time).

With strict ordering/single cpu, the first packet creates new ct entry and
second packet is resolved as established reply packet.

With parallel processing, both packets are picked up as new and both get
their own ct entry.

In this case, the 'reply' packet (picked up as ORIGINAL) can be mangled by
NAT engine because a port collision is detected.

This change isn't enough to prevent a packet drop later during
nf_conntrack_confirm(), the existing clash resolution strategy will not
detect such reverse clash case.  This is resolved by a followup patch.

Signed-off-by: Florian Westphal <[email protected]>
Signed-off-by: Pablo Neira Ayuso <[email protected]>
5 months agoselftests/net: packetdrill: increase timing tolerance in debug mode
Willem de Bruijn [Thu, 19 Sep 2024 12:43:42 +0000 (08:43 -0400)]
selftests/net: packetdrill: increase timing tolerance in debug mode

Some packetdrill tests are flaky in debug mode. As discussed, increase
tolerance.

We have been doing this for debug builds outside ksft too.

Previous setting was 10000. A manual 50 runs in virtme-ng showed two
failures that needed 12000. To be on the safe side, Increase to 14000.

Link: https://lore.kernel.org/netdev/Zuhhe4-MQHd3EkfN@mini-arch/
Fixes: 1e42f73fd3c2 ("selftests/net: packetdrill: import tcp/zerocopy")
Reported-by: Stanislav Fomichev <[email protected]>
Signed-off-by: Willem de Bruijn <[email protected]>
Reviewed-by: Simon Horman <[email protected]>
Acked-by: Stanislav Fomichev <[email protected]>
Acked-by: Matthieu Baerts (NGI0) <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Paolo Abeni <[email protected]>
5 months agousbnet: fix cyclical race on disconnect with work queue
Oliver Neukum [Thu, 19 Sep 2024 12:33:42 +0000 (14:33 +0200)]
usbnet: fix cyclical race on disconnect with work queue

The work can submit URBs and the URBs can schedule the work.
This cycle needs to be broken, when a device is to be stopped.
Use a flag to do so.
This is a design issue as old as the driver.

Signed-off-by: Oliver Neukum <[email protected]>
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
CC: [email protected]
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Paolo Abeni <[email protected]>
5 months agonet: stmmac: set PP_FLAG_DMA_SYNC_DEV only if XDP is enabled
Furong Xu [Thu, 19 Sep 2024 12:10:28 +0000 (20:10 +0800)]
net: stmmac: set PP_FLAG_DMA_SYNC_DEV only if XDP is enabled

Commit 5fabb01207a2 ("net: stmmac: Add initial XDP support") sets
PP_FLAG_DMA_SYNC_DEV flag for page_pool unconditionally,
page_pool_recycle_direct() will call page_pool_dma_sync_for_device()
on every page even the page is not going to be reused by XDP program.

When XDP is not enabled, the page which holds the received buffer
will be recycled once the buffer is copied into new SKB by
skb_copy_to_linear_data(), then the MAC core will never reuse this
page any longer. Always setting PP_FLAG_DMA_SYNC_DEV wastes CPU cycles
on unnecessary calling of page_pool_dma_sync_for_device().

After this patch, up to 9% noticeable performance improvement was observed
on certain platforms.

Fixes: 5fabb01207a2 ("net: stmmac: Add initial XDP support")
Signed-off-by: Furong Xu <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Paolo Abeni <[email protected]>
5 months agovirtio_net: Fix mismatched buf address when unmapping for small packets
Wenbo Li [Thu, 19 Sep 2024 08:13:51 +0000 (16:13 +0800)]
virtio_net: Fix mismatched buf address when unmapping for small packets

Currently, the virtio-net driver will perform a pre-dma-mapping for
small or mergeable RX buffer. But for small packets, a mismatched address
without VIRTNET_RX_PAD and xdp_headroom is used for unmapping.

That will result in unsynchronized buffers when SWIOTLB is enabled, for
example, when running as a TDX guest.

This patch unifies the address passed to the virtio core as the address of
the virtnet header and fixes the mismatched buffer address.

Changes from v2: unify the buf that passed to the virtio core in small
and merge mode.
Changes from v1: Use ctx to get xdp_headroom.

Fixes: 295525e29a5b ("virtio_net: merge dma operations when filling mergeable buffers")
Signed-off-by: Wenbo Li <[email protected]>
Signed-off-by: Jiahui Cen <[email protected]>
Signed-off-by: Ying Fang <[email protected]>
Reviewed-by: Xuan Zhuo <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Paolo Abeni <[email protected]>
5 months agoMerge tag 'for-6.12/block-20240925' of git://git.kernel.dk/linux
Linus Torvalds [Wed, 25 Sep 2024 21:56:40 +0000 (14:56 -0700)]
Merge tag 'for-6.12/block-20240925' of git://git.kernel.dk/linux

Pull more block updates from Jens Axboe:

 - Improve blk-integrity segment counting and merging (Keith)

 - NVMe pull request via Keith:
      - Multipath fixes (Hannes)
      - Sysfs attribute list NULL terminate fix (Shin'ichiro)
      - Remove problematic read-back (Keith)

 - Fix for a regression with the IO scheduler switching freezing from
   6.11 (Damien)

 - Use a raw spinlock for sbitmap, as it may get called from preempt
   disabled context (Ming)

 - Cleanup for bd_claiming waiting, using var_waitqueue() rather than
   the bit waitqueues, as that more accurately describes that it does
   (Neil)

 - Various cleanups (Kanchan, Qiu-ji, David)

* tag 'for-6.12/block-20240925' of git://git.kernel.dk/linux:
  nvme: remove CC register read-back during enabling
  nvme: null terminate nvme_tls_attrs
  nvme-multipath: avoid hang on inaccessible namespaces
  nvme-multipath: system fails to create generic nvme device
  lib/sbitmap: define swap_lock as raw_spinlock_t
  block: Remove unused blk_limits_io_{min,opt}
  drbd: Fix atomicity violation in drbd_uuid_set_bm()
  block: Fix elv_iosched_local_module handling of "none" scheduler
  block: remove bogus union
  block: change wait on bd_claiming to use a var_waitqueue
  blk-integrity: improved sg segment mapping
  block: unexport blk_rq_count_integrity_sg
  nvme-rdma: use request to get integrity segments
  scsi: use request to get integrity segments
  block: provide a request helper for user integrity segments
  blk-integrity: consider entire bio list for merging
  blk-integrity: properly account for segments
  blk-mq: set the nr_integrity_segments from bio
  blk-mq: unconditional nr_integrity_segments

5 months agoMerge tag 'spi-fix-v6.12-merge-window' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Wed, 25 Sep 2024 21:49:34 +0000 (14:49 -0700)]
Merge tag 'spi-fix-v6.12-merge-window' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi

Pull spi fixes from Mark Brown:
 "Some driver specific fixes that came in during the merge window.

  Lorenzo Bianconi did some extra testing on the recently added arioha
  driver and found some issues, Alexander Dahl fixed some issues with
  signal delays in the Atmel QSPI driver and Jinjie Ruan has been fixing
  some nits with runtime PM cleanup"

* tag 'spi-fix-v6.12-merge-window' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi:
  spi: atmel-quadspi: Avoid overwriting delay register settings
  spi: airoha: remove read cache in airoha_snand_dirmap_read()
  spi: spi-fsl-lpspi: Undo runtime PM changes at driver exit time
  spi: atmel-quadspi: Undo runtime PM changes at driver exit time
  spi: airoha: fix airoha_snand_{write,read}_data data_len estimation
  spi: airoha: fix dirmap_{read,write} operations

5 months agoMerge tag 'rtc-6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/abelloni/linux
Linus Torvalds [Wed, 25 Sep 2024 21:38:37 +0000 (14:38 -0700)]
Merge tag 'rtc-6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/abelloni/linux

Pull RTC updates from Alexandre Belloni:
 "More conversions of DT bindings to yaml. There is one new driver, for
  the DFRobot SD2405AL and support for important features of the stm32
  RTC. Summary:

  New driver:
   - DFRobot SD2405AL

  Drivers:
   - stm32: add alarm A out and LSCO support
   - sun6i: disable automatic clock input switching
   - m48t59: set range"

* tag 'rtc-6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/abelloni/linux:
  rtc: rc5t619: use proper module tables
  rtc: m48t59: set range
  dt-bindings: rtc: microcrystal,rv3028: add #clock-cells property
  rtc: m48t59: Remove division condition with direct comparison
  rtc: at91sam9: fix OF node leak in probe() error path
  rtc: sun6i: disable automatic clock input switching
  dt-bindings: rtc: Drop non-trivial duplicate compatibles
  dt-bindings: vendor-prefixes: Add DFRobot.
  dt-bindings: rtc: Add support for SD2405AL.
  rtc: Add driver for SD2405AL
  rtc: s35390a: Drop vendorless compatible string from match table
  rtc: twl: convert comma to semicolon
  dt-bindings: rtc: sprd,sc2731-rtc: convert to YAML
  rtc: stm32: add alarm A out feature
  rtc: stm32: add Low Speed Clock Output (LSCO) support
  rtc: stm32: add pinctrl and pinmux interfaces
  dt-bindings: rtc: stm32: describe pinmux nodes

5 months agodt-bindings: input: Revert "dt-bindings: input: Goodix SPI HID Touchscreen"
Krzysztof Kozlowski [Wed, 25 Sep 2024 19:49:21 +0000 (21:49 +0200)]
dt-bindings: input: Revert "dt-bindings: input: Goodix SPI HID Touchscreen"

This reverts commit 9184b17fbc23 ("dt-bindings: input: Goodix SPI HID
Touchscreen") because it duplicates existing binding leadings to errors:

  goodix,gt7986u.example.dtb:
  touchscreen@0: compatible: 'oneOf' conditional failed, one must be fixed:
        ['goodix,gt7986u'] is too short
        'goodix,gt7375p' was expected

This was reported on mailing list on 6th of September, but no reaction
happened from contributor or maintainer to fix it.

Therefore let's drop binding which breaks and duplicates existing one.

Fixes: 9184b17fbc23 ("dt-bindings: input: Goodix SPI HID Touchscreen")
Reported-by: Rob Herring <[email protected]>
Closes: https://lore.kernel.org/all/CAL_Jsq+QfTtRj_JCqXzktQ49H8VUnztVuaBjvvkg3fwEHniUHw@mail.gmail.com/
Signed-off-by: Krzysztof Kozlowski <[email protected]>
Signed-off-by: Jiri Kosina <[email protected]>
5 months agoHID: hid-goodix: drop unsupported and undocumented DT part
Krzysztof Kozlowski [Wed, 25 Sep 2024 19:49:20 +0000 (21:49 +0200)]
HID: hid-goodix: drop unsupported and undocumented DT part

Drop support for Devicetree from, because the binding is being reverted
(on basis of duplicating existing binding) and property was not added to
the original binding.

Signed-off-by: Krzysztof Kozlowski <[email protected]>
Signed-off-by: Jiri Kosina <[email protected]>
5 months agoMerge tag 'memblock-v6.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rppt...
Linus Torvalds [Wed, 25 Sep 2024 18:35:19 +0000 (11:35 -0700)]
Merge tag 'memblock-v6.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rppt/memblock

Pull memblock updates from Mike Rapoport:

 - new memblock_estimated_nr_free_pages() helper to replace
   totalram_pages() which is less accurate when
   CONFIG_DEFERRED_STRUCT_PAGE_INIT is set

 - fixes for memblock tests

* tag 'memblock-v6.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rppt/memblock:
  s390/mm: get estimated free pages by memblock api
  kernel/fork.c: get estimated free pages by memblock api
  mm/memblock: introduce a new helper memblock_estimated_nr_free_pages()
  memblock test: fix implicit declaration of function 'strscpy'
  memblock test: fix implicit declaration of function 'isspace'
  memblock test: fix implicit declaration of function 'memparse'
  memblock test: add the definition of __setup()
  memblock test: fix implicit declaration of function 'virt_to_phys'
  tools/testing: abstract two init.h into common include directory
  memblock tests: include export.h in linkage.h as kernel dose
  memblock tests: include memory_hotplug.h in mmzone.h as kernel dose

5 months agoMerge tag 'sparc-for-6.12-tag1' of git://git.kernel.org/pub/scm/linux/kernel/git...
Linus Torvalds [Wed, 25 Sep 2024 18:21:06 +0000 (11:21 -0700)]
Merge tag 'sparc-for-6.12-tag1' of git://git.kernel.org/pub/scm/linux/kernel/git/alarsson/linux-sparc

Pull sparc32 update from Andreas Larsson:

 - Remove an unused variable for sparc32

* tag 'sparc-for-6.12-tag1' of git://git.kernel.org/pub/scm/linux/kernel/git/alarsson/linux-sparc:
  arch/sparc: remove unused varible paddrbase in function leon_swprobe()

5 months agoMerge tag 'powerpc-6.12-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc...
Linus Torvalds [Wed, 25 Sep 2024 18:17:25 +0000 (11:17 -0700)]
Merge tag 'powerpc-6.12-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux

Pull powerpc fixes from Michael Ellerman:

 - Fix build error in vdso32 when building 64-bit with COMPAT=y and -Os

 - Fix build error in pseries EEH when CONFIG_DEBUG_FS is not set

Thanks to Christophe Leroy, Narayana Murty N, Christian Zigotzky, and
Ritesh Harjani.

* tag 'powerpc-6.12-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
  powerpc/pseries/eeh: move pseries_eeh_err_inject() outside CONFIG_DEBUG_FS block
  powerpc/vdso32: Fix use of crtsavres for PPC64

5 months agoMerge tag 'clang-format-6.12' of https://github.com/ojeda/linux
Linus Torvalds [Wed, 25 Sep 2024 18:10:39 +0000 (11:10 -0700)]
Merge tag 'clang-format-6.12' of https://github.com/ojeda/linux

Pull clang-format updates from Miguel Ojeda:
 "A routine update of the 'for_each' macro list"

* tag 'clang-format-6.12' of https://github.com/ojeda/linux:
  clang-format: Update with v6.11-rc1's `for_each` macro list

5 months agoKbuild: make MODVERSIONS support depend on not being a compile test build
Linus Torvalds [Wed, 25 Sep 2024 18:08:28 +0000 (11:08 -0700)]
Kbuild: make MODVERSIONS support depend on not being a compile test build

Currently the Rust support is gated on not having MODVERSIONS enabled,
and as a result an "allmodconfig" build will disable Rust build tests.

While MODVERSIONS configurations are worth build testing, the feature is
not actually meaningful unless you run the result, and I'd rather get
build coverage of Rust than MODVERSIONS.  So let's disable MODVERSIONS
for build testing until the Rust side clears up.

Signed-off-by: Linus Torvalds <[email protected]>
5 months agoMerge tag 'rust-6.12' of https://github.com/Rust-for-Linux/linux
Linus Torvalds [Wed, 25 Sep 2024 17:25:40 +0000 (10:25 -0700)]
Merge tag 'rust-6.12' of https://github.com/Rust-for-Linux/linux

Pull Rust updates from Miguel Ojeda:
 "Toolchain and infrastructure:

   - Support 'MITIGATION_{RETHUNK,RETPOLINE,SLS}' (which cleans up
     objtool warnings), teach objtool about 'noreturn' Rust symbols and
     mimic '___ADDRESSABLE()' for 'module_{init,exit}'. With that, we
     should be objtool-warning-free, so enable it to run for all Rust
     object files.

   - KASAN (no 'SW_TAGS'), KCFI and shadow call sanitizer support.

   - Support 'RUSTC_VERSION', including re-config and re-build on
     change.

   - Split helpers file into several files in a folder, to avoid
     conflicts in it. Eventually those files will be moved to the right
     places with the new build system. In addition, remove the need to
     manually export the symbols defined there, reusing existing
     machinery for that.

   - Relax restriction on configurations with Rust + GCC plugins to just
     the RANDSTRUCT plugin.

  'kernel' crate:

   - New 'list' module: doubly-linked linked list for use with reference
     counted values, which is heavily used by the upcoming Rust Binder.

     This includes 'ListArc' (a wrapper around 'Arc' that is guaranteed
     unique for the given ID), 'AtomicTracker' (tracks whether a
     'ListArc' exists using an atomic), 'ListLinks' (the prev/next
     pointers for an item in a linked list), 'List' (the linked list
     itself), 'Iter' (an iterator over a 'List'), 'Cursor' (a cursor
     into a 'List' that allows to remove elements), 'ListArcField' (a
     field exclusively owned by a 'ListArc'), as well as support for
     heterogeneous lists.

   - New 'rbtree' module: red-black tree abstractions used by the
     upcoming Rust Binder.

     This includes 'RBTree' (the red-black tree itself), 'RBTreeNode' (a
     node), 'RBTreeNodeReservation' (a memory reservation for a node),
     'Iter' and 'IterMut' (immutable and mutable iterators), 'Cursor'
     (bidirectional cursor that allows to remove elements), as well as
     an entry API similar to the Rust standard library one.

   - 'init' module: add 'write_[pin_]init' methods and the
     'InPlaceWrite' trait. Add the 'assert_pinned!' macro.

   - 'sync' module: implement the 'InPlaceInit' trait for 'Arc' by
     introducing an associated type in the trait.

   - 'alloc' module: add 'drop_contents' method to 'BoxExt'.

   - 'types' module: implement the 'ForeignOwnable' trait for
     'Pin<Box<T>>' and improve the trait's documentation. In addition,
     add the 'into_raw' method to the 'ARef' type.

   - 'error' module: in preparation for the upcoming Rust support for
     32-bit architectures, like arm, locally allow Clippy lint for
     those.

  Documentation:

   - https://rust.docs.kernel.org has been announced, so link to it.

   - Enable rustdoc's "jump to definition" feature, making its output a
     bit closer to the experience in a cross-referencer.

   - Debian Testing now also provides recent Rust releases (outside of
     the freeze period), so add it to the list.

  MAINTAINERS:

   - Trevor is joining as reviewer of the "RUST" entry.

  And a few other small bits"

* tag 'rust-6.12' of https://github.com/Rust-for-Linux/linux: (54 commits)
  kasan: rust: Add KASAN smoke test via UAF
  kbuild: rust: Enable KASAN support
  rust: kasan: Rust does not support KHWASAN
  kbuild: rust: Define probing macros for rustc
  kasan: simplify and clarify Makefile
  rust: cfi: add support for CFI_CLANG with Rust
  cfi: add CONFIG_CFI_ICALL_NORMALIZE_INTEGERS
  rust: support for shadow call stack sanitizer
  docs: rust: include other expressions in conditional compilation section
  kbuild: rust: replace proc macros dependency on `core.o` with the version text
  kbuild: rust: rebuild if the version text changes
  kbuild: rust: re-run Kconfig if the version text changes
  kbuild: rust: add `CONFIG_RUSTC_VERSION`
  rust: avoid `box_uninit_write` feature
  MAINTAINERS: add Trevor Gross as Rust reviewer
  rust: rbtree: add `RBTree::entry`
  rust: rbtree: add cursor
  rust: rbtree: add mutable iterator
  rust: rbtree: add iterator
  rust: rbtree: add red-black tree implementation backed by the C version
  ...

5 months agosefltests/tracing: Add a test for tracepoint events on modules
Masami Hiramatsu (Google) [Sun, 18 Aug 2024 10:43:35 +0000 (19:43 +0900)]
sefltests/tracing: Add a test for tracepoint events on modules

Add a test case for tracepoint events on modules. This checks if it can add
and remove the events correctly.

Link: https://lore.kernel.org/all/172397781494.286558.7581515061075998225.stgit@devnote2/
Signed-off-by: Masami Hiramatsu (Google) <[email protected]>
5 months agotracing/fprobe: Support raw tracepoints on future loaded modules
Masami Hiramatsu (Google) [Sun, 18 Aug 2024 10:43:26 +0000 (19:43 +0900)]
tracing/fprobe: Support raw tracepoints on future loaded modules

Support raw tracepoint events on future loaded (unloaded) modules.
This allows user to create raw tracepoint events which can be used from
module's __init functions.

Note: since the kernel does not have any information about the tracepoints
in the unloaded modules, fprobe events can not check whether the tracepoint
exists nor extend the BTF based arguments.

Link: https://lore.kernel.org/all/172397780593.286558.18360375226968537828.stgit@devnote2/
Suggested-by: Mathieu Desnoyers <[email protected]>
Signed-off-by: Masami Hiramatsu (Google) <[email protected]>
5 months agotracing/fprobe: Support raw tracepoint events on modules
Masami Hiramatsu (Google) [Sun, 18 Aug 2024 10:43:16 +0000 (19:43 +0900)]
tracing/fprobe: Support raw tracepoint events on modules

Support raw tracepoint event on module by fprobe events.
Since it only uses for_each_kernel_tracepoint() to find a tracepoint,
the tracepoints on modules are not handled. Thus if user specified a
tracepoint on a module, it shows an error.
This adds new for_each_module_tracepoint() API to tracepoint subsystem,
and uses it to find tracepoints on modules.

Link: https://lore.kernel.org/all/172397779651.286558.15903703620679186867.stgit@devnote2/
Reported-by: don <[email protected]>
Closes: https://lore.kernel.org/all/[email protected]/
Signed-off-by: Masami Hiramatsu (Google) <[email protected]>
5 months agotracepoint: Support iterating tracepoints in a loading module
Masami Hiramatsu (Google) [Sun, 18 Aug 2024 10:43:07 +0000 (19:43 +0900)]
tracepoint: Support iterating tracepoints in a loading module

Add for_each_tracepoint_in_module() function to iterate tracepoints in
a module. This API is needed for handling tracepoints in a loading
module from tracepoint_module_notifier callback function.
This also update for_each_module_tracepoint() to pass the module to
callback function so that it can find module easily.

Link: https://lore.kernel.org/all/172397778740.286558.15781131277732977643.stgit@devnote2/
Signed-off-by: Masami Hiramatsu (Google) <[email protected]>
5 months agotracepoint: Support iterating over tracepoints on modules
Masami Hiramatsu (Google) [Sun, 18 Aug 2024 10:42:58 +0000 (19:42 +0900)]
tracepoint: Support iterating over tracepoints on modules

Add for_each_module_tracepoint() for iterating over tracepoints
on modules. This is similar to the for_each_kernel_tracepoint()
but only for the tracepoints on modules (not including kernel
built-in tracepoints).

Link: https://lore.kernel.org/all/172397777800.286558.14554748203446214056.stgit@devnote2/
Signed-off-by: Masami Hiramatsu (Google) <[email protected]>
5 months agokprobes: Remove obsoleted declaration for init_test_probes
Gaosheng Cui [Mon, 26 Aug 2024 03:25:52 +0000 (11:25 +0800)]
kprobes: Remove obsoleted declaration for init_test_probes

The init_test_probes() have been removed since
commit e44e81c5b90f ("kprobes: convert tests to kunit"), and now
it is useless, so remove it.

Link: https://lore.kernel.org/all/[email protected]/
Signed-off-by: Gaosheng Cui <[email protected]>
Signed-off-by: Masami Hiramatsu (Google) <[email protected]>
5 months agouprobes: turn trace_uprobe's nhit counter to be per-CPU one
Andrii Nakryiko [Tue, 13 Aug 2024 20:34:09 +0000 (13:34 -0700)]
uprobes: turn trace_uprobe's nhit counter to be per-CPU one

trace_uprobe->nhit counter is not incremented atomically, so its value
is questionable in when uprobe is hit on multiple CPUs simultaneously.

Also, doing this shared counter increment across many CPUs causes heavy
cache line bouncing, limiting uprobe/uretprobe performance scaling with
number of CPUs.

Solve both problems by making this a per-CPU counter.

Link: https://lore.kernel.org/all/[email protected]/
Reviewed-by: Oleg Nesterov <[email protected]>
Signed-off-by: Andrii Nakryiko <[email protected]>
Signed-off-by: Masami Hiramatsu (Google) <[email protected]>
5 months agovsock/virtio: avoid queuing packets when intermediate queue is empty
Luigi Leonardi [Tue, 30 Jul 2024 19:47:32 +0000 (21:47 +0200)]
vsock/virtio: avoid queuing packets when intermediate queue is empty

When the driver needs to send new packets to the device, it always
queues the new sk_buffs into an intermediate queue (send_pkt_queue)
and schedules a worker (send_pkt_work) to then queue them into the
virtqueue exposed to the device.

This increases the chance of batching, but also introduces a lot of
latency into the communication. So we can optimize this path by
adding a fast path to be taken when there is no element in the
intermediate queue, there is space available in the virtqueue,
and no other process that is sending packets (tx_lock held).

The following benchmarks were run to check improvements in latency and
throughput. The test bed is a host with Intel i7-10700KF CPU @ 3.80GHz
and L1 guest running on QEMU/KVM with vhost process and all vCPUs
pinned individually to pCPUs.

- Latency
   Tool: Fio version 3.37-56
   Mode: pingpong (h-g-h)
   Test runs: 50
   Runtime-per-test: 50s
   Type: SOCK_STREAM

In the following fio benchmark (pingpong mode) the host sends
a payload to the guest and waits for the same payload back.

fio process pinned both inside the host and the guest system.

Before: Linux 6.9.8

Payload 64B:

1st perc. overall 99th perc.
Before 12.91 16.78 42.24 us
After 9.77 13.57 39.17 us

Payload 512B:

1st perc. overall 99th perc.
Before 13.35 17.35 41.52 us
After 10.25 14.11 39.58 us

Payload 4K:

1st perc. overall 99th perc.
Before 14.71 19.87 41.52 us
After 10.51 14.96 40.81 us

- Throughput
   Tool: iperf-vsock

The size represents the buffer length (-l) to read/write
P represents the number of parallel streams

P=1
4K 64K 128K
Before 6.87 29.3 29.5 Gb/s
After 10.5 39.4 39.9 Gb/s

P=2
4K 64K 128K
Before 10.5 32.8 33.2 Gb/s
After 17.8 47.7 48.5 Gb/s

P=4
4K 64K 128K
Before 12.7 33.6 34.2 Gb/s
After 16.9 48.1 50.5 Gb/s

The performance improvement is related to this optimization,
I used a ebpf kretprobe on virtio_transport_send_skb to check
that each packet was sent directly to the virtqueue

Co-developed-by: Marco Pinna <[email protected]>
Signed-off-by: Marco Pinna <[email protected]>
Signed-off-by: Luigi Leonardi <[email protected]>
Message-Id: <20240730-pinna-v4-2-5c9179164db5@outlook.com>
Signed-off-by: Michael S. Tsirkin <[email protected]>
Reviewed-by: Stefano Garzarella <[email protected]>
5 months agovsock/virtio: refactor virtio_transport_send_pkt_work
Marco Pinna [Tue, 30 Jul 2024 19:47:31 +0000 (21:47 +0200)]
vsock/virtio: refactor virtio_transport_send_pkt_work

Preliminary patch to introduce an optimization to the
enqueue system.

All the code used to enqueue a packet into the virtqueue
is removed from virtio_transport_send_pkt_work()
and moved to the new virtio_transport_send_skb() function.

Co-developed-by: Luigi Leonardi <[email protected]>
Signed-off-by: Luigi Leonardi <[email protected]>
Signed-off-by: Marco Pinna <[email protected]>
Reviewed-by: Stefano Garzarella <[email protected]>
Message-Id: <20240730-pinna-v4-1-5c9179164db5@outlook.com>
Signed-off-by: Michael S. Tsirkin <[email protected]>
5 months agofw_cfg: Constify struct kobj_type
Hongbo Li [Wed, 4 Sep 2024 01:17:43 +0000 (09:17 +0800)]
fw_cfg: Constify struct kobj_type

This 'struct kobj_type' is not modified. It is only used in
kobject_init_and_add() which takes a 'const struct kobj_type *ktype'
parameter.

Constifying this structure and moving it to a read-only section,
and this can increase over all security.

```
[Before]
   text   data    bss    dec    hex    filename
   5974   1008     96   7078   1ba6    drivers/firmware/qemu_fw_cfg.o

[After]
   text   data    bss    dec    hex    filename
   6038    944     96   7078   1ba6    drivers/firmware/qemu_fw_cfg.o
```

Signed-off-by: Hongbo Li <[email protected]>
Message-Id: <20240904011743.2010319[email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>
5 months agovdpa/mlx5: Postpone MR deletion
Dragos Tatulea [Fri, 30 Aug 2024 10:58:38 +0000 (13:58 +0300)]
vdpa/mlx5: Postpone MR deletion

Currently, when a new MR is set up, the old MR is deleted. MR deletion
is about 30-40% the time of MR creation. As deleting the old MR is not
important for the process of setting up the new MR, this operation
can be postponed.

This series adds a workqueue that does MR garbage collection at a later
point. If the MR lock is taken, the handler will back off and
reschedule. The exception during shutdown: then the handler must
not postpone the work.

Note that this is only a speculative optimization: if there is some
mapping operation that is triggered while the garbage collector handler
has the lock taken, this operation it will have to wait for the handler
to finish.

Signed-off-by: Dragos Tatulea <[email protected]>
Reviewed-by: Cosmin Ratiu <[email protected]>
Message-Id: <20240830105838.2666587[email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>
5 months agovdpa/mlx5: Introduce init/destroy for MR resources
Dragos Tatulea [Fri, 30 Aug 2024 10:58:37 +0000 (13:58 +0300)]
vdpa/mlx5: Introduce init/destroy for MR resources

There's currently not a lot of action happening during
the init/destroy of MR resources. But more will be added
in the upcoming patches.

As the mr mutex lock init/destroy has been moved to these
new functions, the lifetime has now shifted away from
mlx5_vdpa_alloc_resources() / mlx5_vdpa_free_resources()
into these new functions. However, the lifetime at the
outer scope remains the same:
mlx5_vdpa_dev_add() / mlx5_vdpa_dev_free()

Signed-off-by: Dragos Tatulea <[email protected]>
Reviewed-by: Cosmin Ratiu <[email protected]>
Message-Id: <20240830105838.2666587[email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>
5 months agovdpa/mlx5: Rename mr_mtx -> lock
Dragos Tatulea [Fri, 30 Aug 2024 10:58:36 +0000 (13:58 +0300)]
vdpa/mlx5: Rename mr_mtx -> lock

Now that the mr resources have their own namespace in the
struct, give the lock a clearer name.

Signed-off-by: Dragos Tatulea <[email protected]>
Reviewed-by: Cosmin Ratiu <[email protected]>
Acked-by: Eugenio Pérez <[email protected]>
Message-Id: <20240830105838.2666587[email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>
5 months agovdpa/mlx5: Extract mr members in own resource struct
Dragos Tatulea [Fri, 30 Aug 2024 10:58:35 +0000 (13:58 +0300)]
vdpa/mlx5: Extract mr members in own resource struct

Group all mapping related resources into their own structure.

Upcoming patches will add more members in this new structure.

Signed-off-by: Dragos Tatulea <[email protected]>
Reviewed-by: Cosmin Ratiu <[email protected]>
Acked-by: Eugenio Pérez <[email protected]>
Message-Id: <20240830105838.2666587[email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>
5 months agovdpa/mlx5: Rename function
Dragos Tatulea [Fri, 30 Aug 2024 10:58:34 +0000 (13:58 +0300)]
vdpa/mlx5: Rename function

A followup patch will use this name for something else.

Signed-off-by: Dragos Tatulea <[email protected]>
Reviewed-by: Cosmin Ratiu <[email protected]>
Message-Id: <20240830105838.2666587[email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>
5 months agovdpa/mlx5: Delete direct MKEYs in parallel
Dragos Tatulea [Fri, 30 Aug 2024 10:58:33 +0000 (13:58 +0300)]
vdpa/mlx5: Delete direct MKEYs in parallel

Use the async interface to issue MTT MKEY deletion.

This makes destroy_user_mr() on average 8x times faster.
This number is also dependent on the size of the MR being
deleted.

Signed-off-by: Dragos Tatulea <[email protected]>
Reviewed-by: Cosmin Ratiu <[email protected]>
Acked-by: Eugenio Pérez <[email protected]>
Message-Id: <20240830105838.2666587[email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>
5 months agovdpa/mlx5: Create direct MKEYs in parallel
Dragos Tatulea [Fri, 30 Aug 2024 10:58:32 +0000 (13:58 +0300)]
vdpa/mlx5: Create direct MKEYs in parallel

Use the async interface to issue MTT MKEY creation.
Extra care is taken at the allocation of FW input commands
due to the MTT tables having variable sizes depending on
MR.

The indirect MKEY is still created synchronously at the
end as the direct MKEYs need to be filled in.

This makes create_user_mr() 3-5x faster, depending on
the size of the MR.

Signed-off-by: Dragos Tatulea <[email protected]>
Reviewed-by: Cosmin Ratiu <[email protected]>
Message-Id: <20240830105838.2666587[email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>
5 months agoMAINTAINERS: add virtio-vsock driver in the VIRTIO CORE section
Stefano Garzarella [Thu, 29 Aug 2024 14:37:57 +0000 (16:37 +0200)]
MAINTAINERS: add virtio-vsock driver in the VIRTIO CORE section

The virtio-vsock driver is already under VM SOCKETS (AF_VSOCK),
managed pricipally with the net tree, and VIRTIO AND VHOST
VSOCK DRIVER. However, changes that only affect the virtio part
usually go with Michael's tree, so let's also put the driver in
the VIRTIO CORE section to have its maintainers in CC for changes
to the virtio-vsock driver.

Cc: "Michael S. Tsirkin" <[email protected]>
Cc: Jason Wang <[email protected]>
Signed-off-by: Stefano Garzarella <[email protected]>
Message-Id: <20240829143757[email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>
Reviewed-by: Stefan Hajnoczi <[email protected]>
Acked-by: Jason Wang <[email protected]>
5 months agovirtio_fs: add sysfs entries for queue information
Max Gurtovoy [Sun, 25 Aug 2024 13:07:16 +0000 (16:07 +0300)]
virtio_fs: add sysfs entries for queue information

Introduce sysfs entries to provide visibility to the multiple queues
used by the Virtio FS device. This enhancement allows users to query
information about these queues.

Specifically, add two sysfs entries:
1. Queue name: Provides the name of each queue (e.g. hiprio/requests.8).
2. CPU list: Shows the list of CPUs that can process requests for each
queue.

The CPU list feature is inspired by similar functionality in the block
MQ layer, which provides analogous sysfs entries for block devices.

These new sysfs entries will improve observability and aid in debugging
and performance tuning of Virtio FS devices.

Reviewed-by: Idan Zach <[email protected]>
Reviewed-by: Shai Malin <[email protected]>
Signed-off-by: Max Gurtovoy <[email protected]>
Message-Id: <20240825130716[email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>
5 months agovirtio_fs: introduce virtio_fs_put_locked helper
Max Gurtovoy [Sun, 25 Aug 2024 13:07:15 +0000 (16:07 +0300)]
virtio_fs: introduce virtio_fs_put_locked helper

Introduce a new helper function virtio_fs_put_locked to encapsulate the
common pattern of releasing a virtio_fs reference while holding a lock.
The existing virtio_fs_put helper will be used to release a virtio_fs
reference while not holding a lock.

Also add an assertion in case the lock is not taken when it should.

Reviewed-by: Idan Zach <[email protected]>
Reviewed-by: Shai Malin <[email protected]>
Signed-off-by: Max Gurtovoy <[email protected]>
Message-Id: <20240825130716[email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>
Reviewed-by: Stefan Hajnoczi <[email protected]>
5 months agovdpa: Remove unused declarations
Yue Haibing [Mon, 19 Aug 2024 14:09:30 +0000 (22:09 +0800)]
vdpa: Remove unused declarations

There is no caller and implementation in tree.

Signed-off-by: Yue Haibing <[email protected]>
Message-Id: <20240819140930[email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>
Reviewed-by: Shannon Nelson <[email protected]>
Reviewed-by: Zhu Lingshan <[email protected]>
Reviewed-by: Shannon Nelson &lt;<a href="mailto:[email protected]" target="_blank">[email protected]</a>&gt;<br>
Reviewed-by: Zhu Lingshan <[email protected]>
5 months agovdpa/mlx5: Parallelize VQ suspend/resume for CVQ MQ command
Dragos Tatulea [Fri, 16 Aug 2024 09:01:59 +0000 (12:01 +0300)]
vdpa/mlx5: Parallelize VQ suspend/resume for CVQ MQ command

change_num_qps() is still suspending/resuming VQs one by one.
This change switches to parallel suspend/resume.

When increasing the number of queues the flow has changed a bit for
simplicity: the setup_vq() function will always be called before
resume_vqs(). If the VQ is initialized, setup_vq() will exit early. If
the VQ is not initialized, setup_vq() will create it and resume_vqs()
will resume it.

Signed-off-by: Dragos Tatulea <[email protected]>
Reviewed-by: Tariq Toukan <[email protected]>
Message-Id: <20240816090159.1967650[email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>
Acked-by: Eugenio Pérez <[email protected]>
Tested-by: Lei Yang <[email protected]>
5 months agovdpa/mlx5: Small improvement for change_num_qps()
Dragos Tatulea [Fri, 16 Aug 2024 09:01:58 +0000 (12:01 +0300)]
vdpa/mlx5: Small improvement for change_num_qps()

change_num_qps() has a lot of multiplications by 2 to convert
the number of VQ pairs to number of VQs. This patch simplifies
the code by doing the VQP -> VQ count conversion at the beginning
in a variable.

Signed-off-by: Dragos Tatulea <[email protected]>
Reviewed-by: Tariq Toukan <[email protected]>
Message-Id: <20240816090159.1967650[email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>
Acked-by: Eugenio Pérez <[email protected]>
Tested-by: Lei Yang <[email protected]>
5 months agovdpa/mlx5: Keep notifiers during suspend but ignore
Dragos Tatulea [Fri, 16 Aug 2024 09:01:57 +0000 (12:01 +0300)]
vdpa/mlx5: Keep notifiers during suspend but ignore

Unregistering notifiers is a costly operation. Instead of removing
the notifiers during device suspend and adding them back at resume,
simply ignore the call when the device is suspended.

At resume time call queue_link_work() to make sure that the device state
is propagated in case there were changes.

For 1 vDPA device x 32 VQs (16 VQPs) attached to a large VM (256 GB RAM,
32 CPUs x 2 threads per core), the device suspend time is reduced from
~13 ms to ~2.5 ms.

Signed-off-by: Dragos Tatulea <[email protected]>
Reviewed-by: Tariq Toukan <[email protected]>
Acked-by: Eugenio Pérez <[email protected]>
Message-Id: <20240816090159.1967650[email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>
Tested-by: Lei Yang <[email protected]>
5 months agovdpa/mlx5: Parallelize device resume
Dragos Tatulea [Fri, 16 Aug 2024 09:01:56 +0000 (12:01 +0300)]
vdpa/mlx5: Parallelize device resume

Currently device resume works on vqs serially. Building up on previous
changes that converted vq operations to the async api, this patch
parallelizes the device resume.

For 1 vDPA device x 32 VQs (16 VQPs) attached to a large VM (256 GB RAM,
32 CPUs x 2 threads per core), the device resume time is reduced from
~16 ms to ~4.5 ms.

Signed-off-by: Dragos Tatulea <[email protected]>
Reviewed-by: Tariq Toukan <[email protected]>
Acked-by: Eugenio Pérez <[email protected]>
Message-Id: <20240816090159.1967650[email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>
Tested-by: Lei Yang <[email protected]>
5 months agovdpa/mlx5: Parallelize device suspend
Dragos Tatulea [Fri, 16 Aug 2024 09:01:55 +0000 (12:01 +0300)]
vdpa/mlx5: Parallelize device suspend

Currently device suspend works on vqs serially. Building up on previous
changes that converted vq operations to the async api, this patch
parallelizes the device suspend:
1) Suspend all active vqs parallel.
2) Query suspended vqs in parallel.

For 1 vDPA device x 32 VQs (16 VQPs) attached to a large VM (256 GB RAM,
32 CPUs x 2 threads per core), the device suspend time is reduced from
~37 ms to ~13 ms.

A later patch will remove the link unregister operation which will make
it even faster.

Signed-off-by: Dragos Tatulea <[email protected]>
Reviewed-by: Tariq Toukan <[email protected]>
Acked-by: Eugenio Pérez <[email protected]>
Message-Id: <20240816090159.1967650[email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>
Tested-by: Lei Yang <[email protected]>
5 months agovdpa/mlx5: Use async API for vq modify commands
Dragos Tatulea [Fri, 16 Aug 2024 09:01:54 +0000 (12:01 +0300)]
vdpa/mlx5: Use async API for vq modify commands

Switch firmware vq modify command to be issued via the async API to
allow future parallelization. The new refactored function applies the
modify on a range of vqs and waits for their execution to complete.

For now the command is still used in a serial fashion. A later patch
will switch to modifying multiple vqs in parallel.

Signed-off-by: Dragos Tatulea <[email protected]>
Reviewed-by: Tariq Toukan <[email protected]>
Message-Id: <20240816090159.1967650[email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>
Acked-by: Eugenio Pérez <[email protected]>
Tested-by: Lei Yang <[email protected]>
5 months agovdpa/mlx5: Use async API for vq query command
Dragos Tatulea [Fri, 16 Aug 2024 09:01:53 +0000 (12:01 +0300)]
vdpa/mlx5: Use async API for vq query command

Switch firmware vq query command to be issued via the async API to
allow future parallelization.

For now the command is still serial but the infrastructure is there
to issue commands in parallel, including ratelimiting the number
of issued async commands to firmware.

A later patch will switch to issuing more commands at a time.

Signed-off-by: Dragos Tatulea <[email protected]>
Reviewed-by: Tariq Toukan <[email protected]>
Message-Id: <20240816090159.1967650[email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>
Tested-by: Lei Yang <[email protected]>
5 months agovdpa/mlx5: Introduce async fw command wrapper
Dragos Tatulea [Fri, 16 Aug 2024 09:01:52 +0000 (12:01 +0300)]
vdpa/mlx5: Introduce async fw command wrapper

Introduce a new function mlx5_vdpa_exec_async_cmds() which
wraps the mlx5_core async firmware command API in a way
that will be used to parallelize certain operation in this
driver.

The wrapper deals with the case when mlx5_cmd_exec_cb() returns
EBUSY due to the command being throttled.

Signed-off-by: Dragos Tatulea <[email protected]>
Reviewed-by: Tariq Toukan <[email protected]>
Message-Id: <20240816090159.1967650[email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>
Acked-by: Eugenio Pérez <[email protected]>
Tested-by: Lei Yang <[email protected]>
5 months agovdpa/mlx5: Introduce error logging function
Dragos Tatulea [Fri, 16 Aug 2024 09:01:51 +0000 (12:01 +0300)]
vdpa/mlx5: Introduce error logging function

mlx5_vdpa_err() was missing. This patch adds it and uses it in the
necessary places.

Signed-off-by: Dragos Tatulea <[email protected]>
Reviewed-by: Tariq Toukan <[email protected]>
Acked-by: Eugenio Pérez <[email protected]>
Message-Id: <20240816090159.1967650[email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>
Tested-by: Lei Yang <[email protected]>
5 months agonet/mlx5: Support throttled commands from async API
Dragos Tatulea [Fri, 16 Aug 2024 09:01:50 +0000 (12:01 +0300)]
net/mlx5: Support throttled commands from async API

Currently, commands that qualify as throttled can't be used via the
async API. That's due to the fact that the throttle semaphore can sleep
but the async API can't.

This patch allows throttling in the async API by using the tentative
variant of the semaphore and upon failure (semaphore at 0) returns EBUSY
to signal to the caller that they need to wait for the completion of
previously issued commands.

Furthermore, make sure that the semaphore is released in the callback.

Signed-off-by: Dragos Tatulea <[email protected]>
Cc: Leon Romanovsky <[email protected]>
Reviewed-by: Tariq Toukan <[email protected]>
Message-Id: <20240816090159.1967650[email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>
Tested-by: Lei Yang <[email protected]>
5 months agoMerge tag 'nvme-6.12-2024-09-25' of git://git.infradead.org/nvme into for-6.12/block
Jens Axboe [Wed, 25 Sep 2024 09:29:17 +0000 (03:29 -0600)]
Merge tag 'nvme-6.12-2024-09-25' of git://git.infradead.org/nvme into for-6.12/block

Pull NVMe fixes from Keith:

"nvme fixes for Linux 6.12

 - Multipath fixes (Hannes)
 - Sysfs attribute list NULL terminate fix (Shin'ichiro)
 - Remove problematic read-back (Keith)"

* tag 'nvme-6.12-2024-09-25' of git://git.infradead.org/nvme:
  nvme: remove CC register read-back during enabling
  nvme: null terminate nvme_tls_attrs
  nvme-multipath: avoid hang on inaccessible namespaces
  nvme-multipath: system fails to create generic nvme device

5 months agonvme: remove CC register read-back during enabling
Keith Busch [Wed, 4 Sep 2024 21:48:50 +0000 (14:48 -0700)]
nvme: remove CC register read-back during enabling

Any non-posted read should flush the previous write, so we don't
necessarily need to read back the value we just wrote. I've found at
least some controllers that respond with 0 for short moments after
writing the CC register with EN (enable) cleared, so the read-back is
overwriting our valid ctrl_config value and ends up breaking on the
subsequent enabling.

Reviewed-by: Christoph Hellwig <[email protected]>
Signed-off-by: Keith Busch <[email protected]>
5 months agonvme: null terminate nvme_tls_attrs
Shin'ichiro Kawasaki [Tue, 24 Sep 2024 09:01:34 +0000 (18:01 +0900)]
nvme: null terminate nvme_tls_attrs

Commit 1e48b34c9bc7 ("nvme: split off TLS sysfs attributes into a
separate group") introduced the struct attribute array nvme_tls_attrs.
However, the array was not null terminated and caused BUG KASAN global-
out-of-bounds. To avoid the BUG, null terminate the array.

Reported-by: Yi Zhang <[email protected]>
Closes: https://lore.kernel.org/linux-nvme/jhllwfxcedrcxcnbajwl4x2l2ujcqowqcd4ps574zrafrqhjna@f4icvecutekm/
Fixes: 1e48b34c9bc7 ("nvme: split off TLS sysfs attributes into a separate group")
Signed-off-by: Shin'ichiro Kawasaki <[email protected]>
Tested-by: Yi Zhang <[email protected]>
Reviewed-by: Hannes Reinecke <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Signed-off-by: Keith Busch <[email protected]>
5 months agonvme-multipath: avoid hang on inaccessible namespaces
Hannes Reinecke [Sat, 14 Sep 2024 12:01:23 +0000 (14:01 +0200)]
nvme-multipath: avoid hang on inaccessible namespaces

During repetitive namespace remapping operations on the target the
namespace might have changed between the time the initial scan
was performed, and partition scan was invoked by device_add_disk()
in nvme_mpath_set_live(). We then end up with a stuck scanning process:

[<0>] folio_wait_bit_common+0x12a/0x310
[<0>] filemap_read_folio+0x97/0xd0
[<0>] do_read_cache_folio+0x108/0x390
[<0>] read_part_sector+0x31/0xa0
[<0>] read_lba+0xc5/0x160
[<0>] efi_partition+0xd9/0x8f0
[<0>] bdev_disk_changed+0x23d/0x6d0
[<0>] blkdev_get_whole+0x78/0xc0
[<0>] bdev_open+0x2c6/0x3b0
[<0>] bdev_file_open_by_dev+0xcb/0x120
[<0>] disk_scan_partitions+0x5d/0x100
[<0>] device_add_disk+0x402/0x420
[<0>] nvme_mpath_set_live+0x4f/0x1f0 [nvme_core]
[<0>] nvme_mpath_add_disk+0x107/0x120 [nvme_core]
[<0>] nvme_alloc_ns+0xac6/0xe60 [nvme_core]
[<0>] nvme_scan_ns+0x2dd/0x3e0 [nvme_core]
[<0>] nvme_scan_work+0x1a3/0x490 [nvme_core]

This happens when we have several paths, some of which are inaccessible,
and the active paths are removed first. Then nvme_find_path() will requeue
I/O in the ns_head (as paths are present), but the requeue list is never
triggered as all remaining paths are inactive.

This patch checks for NVME_NSHEAD_DISK_LIVE in nvme_available_path(),
and requeue I/O after NVME_NSHEAD_DISK_LIVE has been cleared once
the last path has been removed to properly terminate pending I/O.

Signed-off-by: Hannes Reinecke <[email protected]>
Reviewed-by: Sagi Grimberg <[email protected]>
Signed-off-by: Keith Busch <[email protected]>
This page took 0.14424 seconds and 4 git commands to generate.