Git Repo - linux.git/log

bpf: get type information with BTF_ID_LIST

Get ready to remove bpf_struct_ops_init() in the future. By using
BTF_ID_LIST, it is possible to gather type information while building
instead of runtime.

Signed-off-by: Kui-Feng Lee <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Martin KaFai Lau <[email protected]>

bpf: refactory struct_ops type initialization to a function.

Move the majority of the code to bpf_struct_ops_init_one(), which can then
be utilized for the initialization of newly registered dynamically
allocated struct_ops types in the following patches.

Signed-off-by: Kui-Feng Lee <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Martin KaFai Lau <[email protected]>

Merge branch 'bpf-add-cookies-retrieval-for-perf-kprobe-multi-links'

Jiri Olsa says:

====================
bpf: Add cookies retrieval for perf/kprobe multi links

hi,
this patchset adds support to retrieve cookies from existing tracing
links that still did not support it plus changes to bpftool to display
them. It's leftover we discussed some time ago [1].

thanks,
jirka

v2 changes:
- added review/ack tags
- fixed memory leak [Quentin]
- align the uapi fields properly [Yafang Shao]

[1] https://lore.kernel.org/bpf/CALOAHbAZ6=A9j3VFCLoAC_WhgQKU7injMf06=cM2sU4Hi4Sx+Q@mail.gmail.com/
Reviewed-by: Quentin Monnet <[email protected]>
---
====================

Reviewed-by: Quentin Monnet <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>

bpftool: Display cookie for kprobe multi link

Displaying cookies for kprobe multi link, in plain mode:

  # bpftool link
  ...
  1397: kprobe_multi  prog 47532
          kretprobe.multi  func_cnt 3
          addr             cookie           func [module]
          ffffffff82b370c0 3                bpf_fentry_test1
          ffffffff82b39780 1                bpf_fentry_test2
          ffffffff82b397a0 2                bpf_fentry_test3

And in json mode:

  # bpftool link -j | jq
  ...
    {
      "id": 1397,
      "type": "kprobe_multi",
      "prog_id": 47532,
      "retprobe": true,
      "func_cnt": 3,
      "missed": 0,
      "funcs": [
        {
          "addr": 18446744071607382208,
          "func": "bpf_fentry_test1",
          "module": null,
          "cookie": 3
        },
        {
          "addr": 18446744071607392128,
          "func": "bpf_fentry_test2",
          "module": null,
          "cookie": 1
        },
        {
          "addr": 18446744071607392160,
          "func": "bpf_fentry_test3",
          "module": null,
          "cookie": 2
        }
      ]
    }

Cookie is attached to specific address, and because we sort addresses
before printing, we need to sort cookies the same way, hence adding
the struct addr_cookie to keep and sort them together.

Also adding missing dd.sym_count check to show_kprobe_multi_json.

Signed-off-by: Jiri Olsa <[email protected]>
Acked-by: Song Liu <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>

bpftool: Display cookie for perf event link probes

Displaying cookie for perf event link probes, in plain mode:

  # bpftool link
  17: perf_event  prog 90
          kprobe ffffffff82b1c2b0 bpf_fentry_test1  cookie 3735928559
  18: perf_event  prog 90
          kretprobe ffffffff82b1c2b0 bpf_fentry_test1  cookie 3735928559
  20: perf_event  prog 92
          tracepoint sched_switch  cookie 3735928559
  21: perf_event  prog 93
          event software:page-faults  cookie 3735928559
  22: perf_event  prog 91
          uprobe /proc/self/exe+0xd703c  cookie 3735928559

And in json mode:

  # bpftool link -j | jq

  {
    "id": 30,
    "type": "perf_event",
    "prog_id": 160,
    "retprobe": false,
    "addr": 18446744071607272112,
    "func": "bpf_fentry_test1",
    "offset": 0,
    "missed": 0,
    "cookie": 3735928559
  }

  {
    "id": 33,
    "type": "perf_event",
    "prog_id": 162,
    "tracepoint": "sched_switch",
    "cookie": 3735928559
  }

  {
    "id": 34,
    "type": "perf_event",
    "prog_id": 163,
    "event_type": "software",
    "event_config": "page-faults",
    "cookie": 3735928559
  }

  {
    "id": 35,
    "type": "perf_event",
    "prog_id": 161,
    "retprobe": false,
    "file": "/proc/self/exe",
    "offset": 880700,
    "cookie": 3735928559
  }

Reviewed-by: Quentin Monnet <[email protected]>
Signed-off-by: Jiri Olsa <[email protected]>
Acked-by: Song Liu <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>

selftests/bpf: Add fill_link_info test for perf event

Adding fill_link_info test for perf event and testing we
get its values back through the bpf_link_info interface.

Signed-off-by: Jiri Olsa <[email protected]>
Acked-by: Song Liu <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>

selftests/bpf: Add cookies check for perf_event fill_link_info test

Now that we get cookies for perf_event probes, adding tests
for cookie for kprobe/uprobe/tracepoint.

The perf_event test needs to be added completely and is coming
in following change.

Signed-off-by: Jiri Olsa <[email protected]>
Acked-by: Song Liu <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>

selftests/bpf: Add cookies check for kprobe_multi fill_link_info test

Adding cookies check for kprobe_multi fill_link_info test,
plus tests for invalid values related to cookies.

Signed-off-by: Jiri Olsa <[email protected]>
Acked-by: Song Liu <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>

bpftool: Fix wrong free call in do_show_link

The error path frees wrong array, it should be ref_ctr_offsets.

Acked-by: Yafang Shao <[email protected]>
Reviewed-by: Quentin Monnet <[email protected]>
Fixes: a7795698f8b6 ("bpftool: Add support to display uprobe_multi links")
Signed-off-by: Jiri Olsa <[email protected]>
Acked-by: Song Liu <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>

bpf: Store cookies in kprobe_multi bpf_link_info data

Storing cookies in kprobe_multi bpf_link_info data. The cookies
field is optional and if provided it needs to be an array of
__u64 with kprobe_multi.count length.

Acked-by: Yafang Shao <[email protected]>
Signed-off-by: Jiri Olsa <[email protected]>
Acked-by: Song Liu <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>

bpf: Add cookie to perf_event bpf_link_info records

At the moment we don't store cookie for perf_event probes,
while we do that for the rest of the probes.

Adding cookie fields to struct bpf_link_info perf event
probe records:

  perf_event.uprobe
  perf_event.kprobe
  perf_event.tracepoint
  perf_event.perf_event

And the code to store that in bpf_link_info struct.

Signed-off-by: Jiri Olsa <[email protected]>
Acked-by: Song Liu <[email protected]>
Acked-by: Yafang Shao <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>

bpf: Use r constraint instead of p constraint in selftests

Some of the BPF selftests use the "p" constraint in inline assembly
snippets, for input operands for MOV (rN = rM) instructions.

This is mainly done via the __imm_ptr macro defined in
tools/testing/selftests/bpf/progs/bpf_misc.h:

  #define __imm_ptr(name) [name]"p"(&name)

Example:

  int consume_first_item_only(void *ctx)
  {
        struct bpf_iter_num iter;
        asm volatile (
                /* create iterator */
                "r1 = %[iter];"
                [...]
                :
                : __imm_ptr(iter)
                : CLOBBERS);
        [...]
  }

The "p" constraint is a tricky one.  It is documented in the GCC manual
section "Simple Constraints":

  An operand that is a valid memory address is allowed.  This is for
  ``load address'' and ``push address'' instructions.

  p in the constraint must be accompanied by address_operand as the
  predicate in the match_operand.  This predicate interprets the mode
  specified in the match_operand as the mode of the memory reference for
  which the address would be valid.

There are two problems:

1. It is questionable whether that constraint was ever intended to be
   used in inline assembly templates, because its behavior really
   depends on compiler internals.  A "memory address" is not the same
   than a "memory operand" or a "memory reference" (constraint "m"), and
   in fact its usage in the template above results in an error in both
   x86_64-linux-gnu and bpf-unkonwn-none:

     foo.c: In function ‘bar’:
     foo.c:6:3: error: invalid 'asm': invalid expression as operand
        6 |   asm volatile ("r1 = %[jorl]" : : [jorl]"p"(&jorl));
          |   ^~~

   I would assume the same happens with aarch64, riscv, and most/all
   other targets in GCC, that do not accept operands of the form A + B
   that are not wrapped either in a const or in a memory reference.

   To avoid that error, the usage of the "p" constraint in internal GCC
   instruction templates is supposed to be complemented by the 'a'
   modifier, like in:

     asm volatile ("r1 = %a[jorl]" : : [jorl]"p"(&jorl));

   Internally documented (in GCC's final.cc) as:

     %aN means expect operand N to be a memory address
        (not a memory reference!) and print a reference
        to that address.

   That works because when the modifier 'a' is found, GCC prints an
   "operand address", which is not the same than an "operand".

   But...

2. Even if we used the internal 'a' modifier (we shouldn't) the 'rN =
   rM' instruction really requires a register argument.  In cases
   involving automatics, like in the examples above, we easily end with:

     bar:
        #APP
            r1 = r10-4
        #NO_APP

   In other cases we could conceibly also end with a 64-bit label that
   may overflow the 32-bit immediate operand of `rN = imm32'
   instructions:

        r1 = foo

   All of which is clearly wrong.

clang happens to do "the right thing" in the current usage of __imm_ptr
in the BPF tests, because even with -O2 it seems to "reload" the
fp-relative address of the automatic to a register like in:

  bar:
r1 = r10
r1 += -4
#APP
r1 = r1
#NO_APP

Which is what GCC would generate with -O0.  Whether this is by chance
or by design, the compiler shouln't be expected to do that reload
driven by the "p" constraint.

This patch changes the usage of the "p" constraint in the BPF
selftests macros to use the "r" constraint instead.  If a register is
what is required, we should let the compiler know.

Previous discussion in bpf@vger:
https://lore.kernel.org/bpf/[email protected]/T/#ef0df83d6975c34dff20bf0dd52e078f5b8ca2767

Tested in bpf-next master.
No regressions.

Signed-off-by: Jose E. Marchesi <[email protected]>
Cc: Yonghong Song <[email protected]>
Cc: Eduard Zingerman <[email protected]>
Acked-by: Yonghong Song <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>

bpf: fix constraint in test_tcpbpf_kern.c

GCC emits a warning:

  progs/test_tcpbpf_kern.c:60:9: error: ‘op’ is used uninitialized [-Werror=uninitialized]

when an uninialized op is used with a "+r" constraint.  The + modifier
means a read-write operand, but that operand in the selftest is just
written to.

This patch changes the selftest to use a "=r" constraint.  This
pacifies GCC.

Tested in bpf-next master.
No regressions.

Signed-off-by: Jose E. Marchesi <[email protected]>
Cc: Yonghong Song <[email protected]>
Cc: Eduard Zingerman <[email protected]>
Cc: [email protected]
Cc: [email protected]
Acked-by: Yonghong Song <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>

bpf: avoid VLAs in progs/test_xdp_dynptr.c

VLAs are not supported by either the BPF port of clang nor GCC.  The
selftest test_xdp_dynptr.c contains the following code:

  const size_t tcphdr_sz = sizeof(struct tcphdr);
  const size_t udphdr_sz = sizeof(struct udphdr);
  const size_t ethhdr_sz = sizeof(struct ethhdr);
  const size_t iphdr_sz = sizeof(struct iphdr);
  const size_t ipv6hdr_sz = sizeof(struct ipv6hdr);

  [...]

  static __always_inline int handle_ipv4(struct xdp_md *xdp, struct bpf_dynptr *xdp_ptr)
  {
__u8 eth_buffer[ethhdr_sz + iphdr_sz + ethhdr_sz];
__u8 iph_buffer_tcp[iphdr_sz + tcphdr_sz];
__u8 iph_buffer_udp[iphdr_sz + udphdr_sz];
[...]
  }

The eth_buffer, iph_buffer_tcp and other automatics are fixed size
only if the compiler optimizes away the constant global variables.
clang does this, but GCC does not, turning these automatics into
variable length arrays.

This patch removes the global variables and turns these values into
preprocessor constants.  This makes the selftest to build properly
with GCC.

Tested in bpf-next master.
No regressions.

Signed-off-by: Jose E. Marchesi <[email protected]>
Cc: Yonghong Song <[email protected]>
Cc: Eduard Zingerman <[email protected]>
Cc: [email protected]
Cc: [email protected]
Acked-by: Yonghong Song <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>

libbpf: call dup2() syscall directly

We've ran into issues with using dup2() API in production setting, where
libbpf is linked into large production environment and ends up calling
unintended custom implementations of dup2(). These custom implementations
don't provide atomic FD replacement guarantees of dup2() syscall,
leading to subtle and hard to debug issues.

To prevent this in the future and guarantee that no libc implementation
will do their own custom non-atomic dup2() implementation, call dup2()
syscall directly with syscall(SYS_dup2).

Note that some architectures don't seem to provide dup2 and have dup3
instead. Try to detect and pick best syscall.

Signed-off-by: Andrii Nakryiko <[email protected]>
Acked-by: Song Liu <[email protected]>
Acked-by: Yonghong Song <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>

Merge branch 'enable-the-inline-of-kptr_xchg-for-arm64'

Hou Tao says:

====================
Enable the inline of kptr_xchg for arm64

From: Hou Tao <[email protected]>

Hi,

The patch set is just a follow-up for "bpf: inline bpf_kptr_xchg()". It
enables the inline of bpf_kptr_xchg() and kptr_xchg_inline test for
arm64.

Please see individual patches for more details. And comments are always
welcome.
====================

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>

selftests/bpf: Enable kptr_xchg_inline test for arm64

Now arm64 bpf jit has enable bpf_jit_supports_ptr_xchg(), so enable
the test for arm64 as well.

Signed-off-by: Hou Tao <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>

bpf, arm64: Enable the inline of bpf_kptr_xchg()

ARM64 bpf jit satisfies the following two conditions:
1) support BPF_XCHG() on pointer-sized word.
2) the implementation of xchg is the same as atomic_xchg() on
pointer-sized words. Both of these two functions use arch_xchg() to
implement the exchange.

So enable the inline of bpf_kptr_xchg() for arm64 bpf jit.

Signed-off-by: Hou Tao <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>

bpf, docs: Clarify that MOVSX is only for BPF_X not BPF_K

Per discussion on the mailing list at
https://mailarchive.ietf.org/arch/msg/bpf/uQiqhURdtxV_ZQOTgjCdm-seh74/
the MOVSX operation is only defined to support register extension.

The document didn't previously state this and incorrectly implied
that one could use an immediate value.

Signed-off-by: Dave Thaler <[email protected]>
Acked-by: David Vernet <[email protected]>
Acked-by: Yonghong Song <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>

bpf: Define struct bpf_tcp_req_attrs when CONFIG_SYN_COOKIES=n.

kernel test robot reported the warning below:

  >> net/core/filter.c:11842:13: warning: declaration of 'struct bpf_tcp_req_attrs' will not be visible outside of this function [-Wvisibility]
      11842 |                                         struct bpf_tcp_req_attrs *attrs, int attrs__sz)
            |                                                ^
     1 warning generated.

struct bpf_tcp_req_attrs is defined under CONFIG_SYN_COOKIES
but used in kfunc without the config.

Let's move struct bpf_tcp_req_attrs definition outside of
CONFIG_SYN_COOKIES guard.

Fixes: e472f88891ab ("bpf: tcp: Support arbitrary SYN Cookie.")
Reported-by: kernel test robot <[email protected]>
Closes: https://lore.kernel.org/oe-kbuild-all/[email protected]/
Signed-off-by: Kuniyuki Iwashima <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>

bpf: Refactor ptr alu checking rules to allow alu explicitly

Current checking rules are structured to disallow alu on particular ptr
types explicitly, so default cases are allowed implicitly. This may lead
to newly added ptr types being allowed unexpectedly. So restruture it to
allow alu explicitly. The tradeoff is mainly a bit more cases added in
the switch. The following table from Eduard summarizes the rules:

        | Pointer type        | Arithmetics allowed |
        |---------------------+---------------------|
        | PTR_TO_CTX          | yes                 |
        | CONST_PTR_TO_MAP    | conditionally       |
        | PTR_TO_MAP_VALUE    | yes                 |
        | PTR_TO_MAP_KEY      | yes                 |
        | PTR_TO_STACK        | yes                 |
        | PTR_TO_PACKET_META  | yes                 |
        | PTR_TO_PACKET       | yes                 |
        | PTR_TO_PACKET_END   | no                  |
        | PTR_TO_FLOW_KEYS    | conditionally       |
        | PTR_TO_SOCKET       | no                  |
        | PTR_TO_SOCK_COMMON  | no                  |
        | PTR_TO_TCP_SOCK     | no                  |
        | PTR_TO_TP_BUFFER    | yes                 |
        | PTR_TO_XDP_SOCK     | no                  |
        | PTR_TO_BTF_ID       | yes                 |
        | PTR_TO_MEM          | yes                 |
        | PTR_TO_BUF          | yes                 |
        | PTR_TO_FUNC         | yes                 |
        | CONST_PTR_TO_DYNPTR | yes                 |

The refactored rules are equivalent to the original one. Note that
PTR_TO_FUNC and CONST_PTR_TO_DYNPTR are not reject here because: (1)
check_mem_access() rejects load/store on those ptrs, and those ptrs
with offset passing to calls are rejected check_func_arg_reg_off();
(2) someone may rely on the verifier not rejecting programs earily.

Signed-off-by: Hao Sun <[email protected]>
Acked-by: Eduard Zingerman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>

selftest/bpf: Add map_in_maps with BPF_MAP_TYPE_PERF_EVENT_ARRAY values

Check that bpf_object__load() successfully creates map_in_maps
with BPF_MAP_TYPE_PERF_EVENT_ARRAY values.
These changes cover fix in the previous patch
"libbpf: Apply map_set_def_max_entries() for inner_maps on creation".

A command line output is:
- w/o fix
$ sudo ./test_maps
libbpf: map 'mim_array_pe': failed to create inner map: -22
libbpf: map 'mim_array_pe': failed to create: Invalid argument(-22)
libbpf: failed to load object './test_map_in_map.bpf.o'
Failed to load test prog

- with fix
$ sudo ./test_maps
...
test_maps: OK, 0 SKIPPED

Fixes: 646f02ffdd49 ("libbpf: Add BTF-defined map-in-map support")
Signed-off-by: Andrey Grafin <[email protected]>
Signed-off-by: Andrii Nakryiko <[email protected]>
Acked-by: Yonghong Song <[email protected]>
Acked-by: Hou Tao <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>

libbpf: Apply map_set_def_max_entries() for inner_maps on creation

This patch allows to auto create BPF_MAP_TYPE_ARRAY_OF_MAPS and
BPF_MAP_TYPE_HASH_OF_MAPS with values of BPF_MAP_TYPE_PERF_EVENT_ARRAY
by bpf_object__load().

Previous behaviour created a zero filled btf_map_def for inner maps and
tried to use it for a map creation but the linux kernel forbids to create
a BPF_MAP_TYPE_PERF_EVENT_ARRAY map with max_entries=0.

Fixes: 646f02ffdd49 ("libbpf: Add BTF-defined map-in-map support")
Signed-off-by: Andrey Grafin <[email protected]>
Signed-off-by: Andrii Nakryiko <[email protected]>
Acked-by: Yonghong Song <[email protected]>
Acked-by: Hou Tao <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>

bpf: Sync uapi bpf.h header for the tooling infra

Both commit 91051f003948 ("tcp: Dump bound-only sockets in inet_diag.")
and commit 985b8ea9ec7e ("bpf, docs: Fix bpf_redirect_peer header doc")
missed the tooling header sync. Fix it.

Signed-off-by: Daniel Borkmann <[email protected]>
Signed-off-by: Alexei Starovoitov <[email protected]>

bpf, docs: Fix bpf_redirect_peer header doc

Amend the bpf_redirect_peer() header documentation to also mention
support for the netkit device type.

Signed-off-by: Victor Stewart <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>

Merge branch 'bpf: tcp: Support arbitrary SYN Cookie at TC.'

Kuniyuki Iwashima says:

====================
Under SYN Flood, the TCP stack generates SYN Cookie to remain stateless
for the connection request until a valid ACK is responded to the SYN+ACK.

The cookie contains two kinds of host-specific bits, a timestamp and
secrets, so only can it be validated by the generator.  It means SYN
Cookie consumes network resources between the client and the server;
intermediate nodes must remember which nodes to route ACK for the cookie.

SYN Proxy reduces such unwanted resource allocation by handling 3WHS at
the edge network.  After SYN Proxy completes 3WHS, it forwards SYN to the
backend server and completes another 3WHS.  However, since the server's
ISN differs from the cookie, the proxy must manage the ISN mappings and
fix up SEQ/ACK numbers in every packet for each connection.  If a proxy
node goes down, all the connections through it are terminated.  Keeping
a state at proxy is painful from that perspective.

At AWS, we use a dirty hack to build truly stateless SYN Proxy at scale.
Our SYN Proxy consists of the front proxy layer and the backend kernel
module.  (See slides of LPC2023 [0], p37 - p48)

The cookie that SYN Proxy generates differs from the kernel's cookie in
that it contains a secret (called rolling salt) (i) shared by all the proxy
nodes so that any node can validate ACK and (ii) updated periodically so
that old cookies cannot be validated and we need not encode a timestamp for
the cookie.  Also, ISN contains WScale, SACK, and ECN, not in TS val.  This
is not to sacrifice any connection quality, where some customers turn off
TCP timestamps option due to retro CVE.

After 3WHS, the proxy restores SYN, encapsulates ACK into SYN, and forward
the TCP-in-TCP packet to the backend server.  Our kernel module works at
Netfilter input/output hooks and first feeds SYN to the TCP stack to
initiate 3WHS.  When the module is triggered for SYN+ACK, it looks up the
corresponding request socket and overwrites tcp_rsk(req)->snt_isn with the
proxy's cookie.  Then, the module can complete 3WHS with the original ACK
as is.

This way, our SYN Proxy does not manage the ISN mappings nor wait for
SYN+ACK from the backend thus can remain stateless.  It's working very
well for high-bandwidth services like multiple Tbps, but we are looking
for a way to drop the dirty hack and further optimise the sequences.

If we could validate an arbitrary SYN Cookie on the backend server with
BPF, the proxy would need not restore SYN nor pass it.  After validating
ACK, the proxy node just needs to forward it, and then the server can do
the lightweight validation (e.g. check if ACK came from proxy nodes, etc)
and create a connection from the ACK.

This series allows us to create a full sk from an arbitrary SYN Cookie,
which is done in 3 steps.

  1) At tc, BPF prog calls a new kfunc to create a reqsk and configure
     it based on the argument populated from SYN Cookie.  The reqsk has
     its listener as req->rsk_listener and is passed to the TCP stack as
     skb->sk.

  2) During TCP socket lookup for the skb, skb_steal_sock() returns a
     listener in the reuseport group that inet_reqsk(skb->sk)->rsk_listener
     belongs to.

  3) In cookie_v[46]_check(), the reqsk (skb->sk) is fully initialised and
     a full sk is created.

The kfunc usage is as follows:

    struct bpf_tcp_req_attrs attrs = {
        .mss = mss,
        .wscale_ok = wscale_ok,
        .rcv_wscale = rcv_wscale, /* Server's WScale < 15 */
        .snd_wscale = snd_wscale, /* Client's WScale < 15 */
        .tstamp_ok = tstamp_ok,
        .rcv_tsval = tsval,
        .rcv_tsecr = tsecr, /* Server's Initial TSval */
        .usec_ts_ok = usec_ts_ok,
        .sack_ok = sack_ok,
        .ecn_ok = ecn_ok,
    }

    skc = bpf_skc_lookup_tcp(...);
    sk = (struct sock *)bpf_skc_to_tcp_sock(skc);
    bpf_sk_assign_tcp_reqsk(skb, sk, attrs, sizeof(attrs));
    bpf_sk_release(skc);

[0]: https://lpc.events/event/17/contributions/1645/attachments/1350/2701/SYN_Proxy_at_Scale_with_BPF.pdf

Changes:
  v8
    * Rebase on Yonghong's cpuv4 fix
    * Patch 5
      * Fill the trailing 3-bytes padding in struct bpf_tcp_req_attrs
        and test it as null
    * Patch 6
      * Remove unused IPPROTP_MPTCP definition

  v7: https://lore.kernel.org/bpf/20231221012806 [email protected]/
    * Patch 5 & 6
      * Drop MPTCP support

  v6: https://lore.kernel.org/bpf/20231214155424 [email protected]/
    * Patch 5 & 6
      * /struct /s/tcp_cookie_attributes/bpf_tcp_req_attrs/
      * Don't reuse struct tcp_options_received and use u8 for each attrs
    * Patch 6
      * Check retval of test__start_subtest()

  v5: https://lore.kernel.org/netdev/20231211073650 [email protected]/
    * Split patch 1-3
    * Patch 3
      * Clear req->rsk_listener in skb_steal_sock()
    * Patch 4 & 5
      * Move sysctl validation and tsoff init from cookie_bpf_check() to kfunc
    * Patch 5
      * Do not increment LINUX_MIB_SYNCOOKIES(RECV|FAILED)
    * Patch 6
      * Remove __always_inline
      * Test if tcp_handle_{syn,ack}() is executed
      * Move some definition to bpf_tracing_net.h
      * s/BPF_F_CURRENT_NETNS/-1/

  v4: https://lore.kernel.org/bpf/20231205013420 [email protected]/
    * Patch 1 & 2
      * s/CONFIG_SYN_COOKIE/CONFIG_SYN_COOKIES/
    * Patch 1
      * Don't set rcv_wscale for BPF SYN Cookie case.
    * Patch 2
      * Add test for tcp_opt.{unused,rcv_wscale} in kfunc
      * Modify skb_steal_sock() to avoid resetting skb-sk
      * Support SO_REUSEPORT lookup
    * Patch 3
      * Add CONFIG_SYN_COOKIES to Kconfig for CI
      * Define BPF_F_CURRENT_NETNS

  v3: https://lore.kernel.org/netdev/20231121184245 [email protected]/
    * Guard kfunc and req->syncookie part in inet6?_steal_sock() with
      CONFIG_SYN_COOKIE

  v2: https://lore.kernel.org/netdev/20231120222341 [email protected]/
    * Drop SOCK_OPS and move SYN Cookie validation logic to TC with kfunc.
    * Add cleanup patches to reduce discrepancy between cookie_v[46]_check()

  v1: https://lore.kernel.org/bpf/20231013220433 [email protected]/
====================

Signed-off-by: Martin KaFai Lau <[email protected]>
Signed-off-by: Alexei Starovoitov <[email protected]>

selftest: bpf: Test bpf_sk_assign_tcp_reqsk().

This commit adds a sample selftest to demonstrate how we can use
bpf_sk_assign_tcp_reqsk() as the backend of SYN Proxy.

The test creates IPv4/IPv6 x TCP connections and transfer messages
over them on lo with BPF tc prog attached.

The tc prog will process SYN and returns SYN+ACK with the following
ISN and TS.  In a real use case, this part will be done by other
hosts.

        MSB                                   LSB
  ISN:  | 31 ... 8 | 7 6 |   5 |    4 | 3 2 1 0 |
        |   Hash_1 | MSS | ECN | SACK |  WScale |

  TS:   | 31 ... 8 |          7 ... 0           |
        |   Random |           Hash_2           |

  WScale in SYN is reused in SYN+ACK.

The client returns ACK, and tc prog will recalculate ISN and TS
from ACK and validate SYN Cookie.

If it's valid, the prog calls kfunc to allocate a reqsk for skb and
configure the reqsk based on the argument created from SYN Cookie.

Later, the reqsk will be processed in cookie_v[46]_check() to create
a connection.

Signed-off-by: Kuniyuki Iwashima <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Martin KaFai Lau <[email protected]>
Signed-off-by: Alexei Starovoitov <[email protected]>

bpf: tcp: Support arbitrary SYN Cookie.

This patch adds a new kfunc available at TC hook to support arbitrary
SYN Cookie.

The basic usage is as follows:

    struct bpf_tcp_req_attrs attrs = {
        .mss = mss,
        .wscale_ok = wscale_ok,
        .rcv_wscale = rcv_wscale, /* Server's WScale < 15 */
        .snd_wscale = snd_wscale, /* Client's WScale < 15 */
        .tstamp_ok = tstamp_ok,
        .rcv_tsval = tsval,
        .rcv_tsecr = tsecr, /* Server's Initial TSval */
        .usec_ts_ok = usec_ts_ok,
        .sack_ok = sack_ok,
        .ecn_ok = ecn_ok,
    }

    skc = bpf_skc_lookup_tcp(...);
    sk = (struct sock *)bpf_skc_to_tcp_sock(skc);
    bpf_sk_assign_tcp_reqsk(skb, sk, attrs, sizeof(attrs));
    bpf_sk_release(skc);

bpf_sk_assign_tcp_reqsk() takes skb, a listener sk, and struct
bpf_tcp_req_attrs and allocates reqsk and configures it.  Then,
bpf_sk_assign_tcp_reqsk() links reqsk with skb and the listener.

The notable thing here is that we do not hold refcnt for both reqsk
and listener.  To differentiate that, we mark reqsk->syncookie, which
is only used in TX for now.  So, if reqsk->syncookie is 1 in RX, it
means that the reqsk is allocated by kfunc.

When skb is freed, sock_pfree() checks if reqsk->syncookie is 1,
and in that case, we set NULL to reqsk->rsk_listener before calling
reqsk_free() as reqsk does not hold a refcnt of the listener.

When the TCP stack looks up a socket from the skb, we steal the
listener from the reqsk in skb_steal_sock() and create a full sk
in cookie_v[46]_check().

The refcnt of reqsk will finally be set to 1 in tcp_get_cookie_sock()
after creating a full sk.

Note that we can extend struct bpf_tcp_req_attrs in the future when
we add a new attribute that is determined in 3WHS.

Signed-off-by: Kuniyuki Iwashima <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Martin KaFai Lau <[email protected]>
Signed-off-by: Alexei Starovoitov <[email protected]>

bpf: tcp: Handle BPF SYN Cookie in cookie_v[46]_check().

We will support arbitrary SYN Cookie with BPF in the following
patch.

If BPF prog validates ACK and kfunc allocates a reqsk, it will
be carried to cookie_[46]_check() as skb->sk. If skb->sk is not
NULL, we call cookie_bpf_check().

Then, we clear skb->sk and skb->destructor, which are needed not
to hold refcnt for reqsk and the listener. See the following patch
for details.

After that, we finish initialisation for the remaining fields with
cookie_tcp_reqsk_init().

Note that the server side WScale is set only for non-BPF SYN Cookie.

Signed-off-by: Kuniyuki Iwashima <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Martin KaFai Lau <[email protected]>
Signed-off-by: Alexei Starovoitov <[email protected]>

bpf: tcp: Handle BPF SYN Cookie in skb_steal_sock().

We will support arbitrary SYN Cookie with BPF.

If BPF prog validates ACK and kfunc allocates a reqsk, it will
be carried to TCP stack as skb->sk with req->syncookie 1. Also,
the reqsk has its listener as req->rsk_listener with no refcnt
taken.

When the TCP stack looks up a socket from the skb, we steal
inet_reqsk(skb->sk)->rsk_listener in skb_steal_sock() so that
the skb will be processed in cookie_v[46]_check() with the
listener.

Note that we do not clear skb->sk and skb->destructor so that we
can carry the reqsk to cookie_v[46]_check().

Signed-off-by: Kuniyuki Iwashima <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Martin KaFai Lau <[email protected]>
Signed-off-by: Alexei Starovoitov <[email protected]>

selftests/bpf: Fix potential premature unload in bpf_testmod

It is possible for bpf_kfunc_call_test_release() to be called from
bpf_map_free_deferred() when bpf_testmod is already unloaded and
perf_test_stuct.cnt which it tries to decrease is no longer in memory.
This patch tries to fix the issue by waiting for all references to be
dropped in bpf_testmod_exit().

The issue can be triggered by running 'test_progs -t map_kptr' in 6.5,
but is obscured in 6.6 by d119357d07435 ("rcu-tasks: Treat only
synchronous grace periods urgently").

Fixes: 65eb006d85a2 ("bpf: Move kernel test kfuncs to bpf_testmod")
Signed-off-by: Artem Savkov <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
Acked-by: Yonghong Song <[email protected]>
Cc: Jiri Olsa <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]
Link: https://lore.kernel.org/bpf/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>

tcp: Move skb_steal_sock() to request_sock.h

We will support arbitrary SYN Cookie with BPF.

If BPF prog validates ACK and kfunc allocates a reqsk, it will
be carried to TCP stack as skb->sk with req->syncookie 1.

In skb_steal_sock(), we need to check inet_reqsk(sk)->syncookie
to see if the reqsk is created by kfunc. However, inet_reqsk()
is not available in sock.h.

Let's move skb_steal_sock() to request_sock.h.

While at it, we refactor skb_steal_sock() so it returns early if
skb->sk is NULL to minimise the following patch.

Signed-off-by: Kuniyuki Iwashima <[email protected]>
Reviewed-by: Eric Dumazet <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Martin KaFai Lau <[email protected]>
Signed-off-by: Alexei Starovoitov <[email protected]>

bpftool: Silence build warning about calloc()

There exists the following warning when building bpftool:

  CC      prog.o
prog.c: In function ‘profile_open_perf_events’:
prog.c:2301:24: warning: ‘calloc’ sizes specified with ‘sizeof’ in the earlier argument and not in the later argument [-Wcalloc-transposed-args]
2301 |                 sizeof(int), obj->rodata->num_cpu * obj->rodata->num_metric);
      |                        ^~~
prog.c:2301:24: note: earlier argument should specify number of elements, later size of each element

Tested with the latest upstream GCC which contains a new warning option
-Wcalloc-transposed-args. The first argument to calloc is documented to
be number of elements in array, while the second argument is size of each
element, just switch the first and second arguments of calloc() to silence
the build warning, compile tested only.

Fixes: 47c09d6a9f67 ("bpftool: Introduce "prog profile" command")
Signed-off-by: Tiezhu Yang <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
Reviewed-by: Quentin Monnet <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>

tcp: Move tcp_ns_to_ts() to tcp.h

We will support arbitrary SYN Cookie with BPF.

When BPF prog validates ACK and kfunc allocates a reqsk, we need
to call tcp_ns_to_ts() to calculate an offset of TSval for later
use:

  time
  t0 : Send SYN+ACK
       -> tsval = Initial TSval (Random Number)

  t1 : Recv ACK of 3WHS
       -> tsoff = TSecr - tcp_ns_to_ts(usec_ts_ok, tcp_clock_ns())
                = Initial TSval - t1

  t2 : Send ACK
       -> tsval = t2 + tsoff
                = Initial TSval + (t2 - t1)
                = Initial TSval + Time Delta (x)

  (x) Note that the time delta does not include the initial RTT
      from t0 to t1.

Let's move tcp_ns_to_ts() to tcp.h.

Signed-off-by: Kuniyuki Iwashima <[email protected]>
Reviewed-by: Eric Dumazet <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Martin KaFai Lau <[email protected]>
Signed-off-by: Alexei Starovoitov <[email protected]>

bpf: Minor improvements for bpf_cmp.

Few minor improvements for bpf_cmp() macro:
. reduce number of args in __bpf_cmp()
. rename NOFLIP to UNLIKELY
. add a comment about 64-bit truncation in "i" constraint
. use "ri" constraint for sizeof(rhs) <= 4
. improve error message for bpf_cmp_likely()

Before:
progs/iters_task_vma.c:31:7: error: variable 'ret' is uninitialized when used here [-Werror,-Wuninitialized]
   31 |                 if (bpf_cmp_likely(seen, <==, 1000))
      |                     ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
../bpf/bpf_experimental.h:325:3: note: expanded from macro 'bpf_cmp_likely'
  325 |                 ret;
      |                 ^~~
progs/iters_task_vma.c:31:7: note: variable 'ret' is declared here
../bpf/bpf_experimental.h:310:3: note: expanded from macro 'bpf_cmp_likely'
  310 |                 bool ret;
      |                 ^

After:
progs/iters_task_vma.c:31:7: error: invalid operand for instruction
   31 |                 if (bpf_cmp_likely(seen, <==, 1000))
      |                     ^
../bpf/bpf_experimental.h:324:17: note: expanded from macro 'bpf_cmp_likely'
  324 |                         asm volatile("r0 " #OP " invalid compare");
      |                                      ^
<inline asm>:1:5: note: instantiated into assembly here
    1 |         r0 <== invalid compare
      |            ^

Signed-off-by: Alexei Starovoitov <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
Acked-by: Yonghong Song <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>

docs/bpf: Fix an incorrect statement in verifier.rst

In verifier.rst, I found an incorrect statement (maybe a typo) in section
'Liveness marks tracking'. Basically, the wrong register is attributed
to have a read mark. This may confuse the user.

Signed-off-by: Yonghong Song <[email protected]>
Acked-by: Eduard Zingerman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>

selftests/bpf: Add a selftest with not-8-byte aligned BPF_ST

Add a selftest with a 4 bytes BPF_ST of 0 where the store is not
8-byte aligned. The goal is to ensure that STACK_ZERO is properly
marked in stack slots and the STACK_ZERO value can propagate
properly during the load.

Acked-by: Andrii Nakryiko <[email protected]>
Signed-off-by: Yonghong Song <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>

bpf: Track aligned st store as imprecise spilled registers

With patch set [1], precision backtracing supports register spill/fill
to/from the stack. The patch [2] allows initial imprecise register spill
with content 0. This is a common case for cpuv3 and lower for
initializing the stack variables with pattern
  r1 = 0
  *(u64 *)(r10 - 8) = r1
and the [2] has demonstrated good verification improvement.

For cpuv4, the initialization could be
  *(u64 *)(r10 - 8) = 0
The current verifier marks the r10-8 contents with STACK_ZERO.
Similar to [2], let us permit the above insn to behave like
imprecise register spill which can reduce number of verified states.
The change is in function check_stack_write_fixed_off().

Before this patch, spilled zero will be marked as STACK_ZERO
which can provide precise values. In check_stack_write_var_off(),
STACK_ZERO will be maintained if writing a const zero
so later it can provide precise values if needed.

The above handling of '*(u64 *)(r10 - 8) = 0' as a spill
will have issues in check_stack_write_var_off() as the spill
will be converted to STACK_MISC and the precise value 0
is lost. To fix this issue, if the spill slots with const
zero and the BPF_ST write also with const zero, the spill slots
are preserved, which can later provide precise values
if needed. Without the change in check_stack_write_var_off(),
the test_verifier subtest 'BPF_ST_MEM stack imm zero, variable offset'
will fail.

I checked cpuv3 and cpuv4 with and without this patch with veristat.
There is no state change for cpuv3 since '*(u64 *)(r10 - 8) = 0'
is only generated with cpuv4.

For cpuv4:
$ ../veristat -C old.cpuv4.csv new.cpuv4.csv -e file,prog,insns,states -f 'insns_diff!=0'
File                                        Program              Insns (A)  Insns (B)  Insns    (DIFF)  States (A)  States (B)  States (DIFF)
------------------------------------------  -------------------  ---------  ---------  ---------------  ----------  ----------  -------------
local_storage_bench.bpf.linked3.o           get_local                  228        168    -60 (-26.32%)          17          14   -3 (-17.65%)
pyperf600_bpf_loop.bpf.linked3.o            on_event                  6066       4889  -1177 (-19.40%)         403         321  -82 (-20.35%)
test_cls_redirect.bpf.linked3.o             cls_redirect             35483      35387     -96 (-0.27%)        2179        2177    -2 (-0.09%)
test_l4lb_noinline.bpf.linked3.o            balancer_ingress          4494       4522     +28 (+0.62%)         217         219    +2 (+0.92%)
test_l4lb_noinline_dynptr.bpf.linked3.o     balancer_ingress          1432       1455     +23 (+1.61%)          92          94    +2 (+2.17%)
test_xdp_noinline.bpf.linked3.o             balancer_ingress_v6       3462       3458      -4 (-0.12%)         216         216    +0 (+0.00%)
verifier_iterating_callbacks.bpf.linked3.o  widening                    52         41    -11 (-21.15%)           4           3   -1 (-25.00%)
xdp_synproxy_kern.bpf.linked3.o             syncookie_tc             12412      11719    -693 (-5.58%)         345         330   -15 (-4.35%)
xdp_synproxy_kern.bpf.linked3.o             syncookie_xdp            12478      11794    -684 (-5.48%)         346         331   -15 (-4.34%)

test_l4lb_noinline and test_l4lb_noinline_dynptr has minor regression, but
pyperf600_bpf_loop and local_storage_bench gets pretty good improvement.

  [1] https://lore.kernel.org/all/20231205184248.1502704 [email protected]/
  [2] https://lore.kernel.org/all/20231205184248.1502704 [email protected]/

Cc: Kuniyuki Iwashima <[email protected]>
Cc: Martin KaFai Lau <[email protected]>
Signed-off-by: Yonghong Song <[email protected]>
Tested-by: Kuniyuki Iwashima <[email protected]>
Acked-by: Andrii Nakryiko <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>

selftests/bpf: Test assigning ID to scalars on spill

The previous commit implemented assigning IDs to registers holding
scalars before spill. Add the test cases to check the new functionality.

Signed-off-by: Maxim Mikityanskiy <[email protected]>
Acked-by: Eduard Zingerman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>

bpf: Assign ID to scalars on spill

Currently, when a scalar bounded register is spilled to the stack, its
ID is preserved, but only if was already assigned, i.e. if this register
was MOVed before.

Assign an ID on spill if none is set, so that equal scalars could be
tracked if a register is spilled to the stack and filled into another
register.

One test is adjusted to reflect the change in register IDs.

Signed-off-by: Maxim Mikityanskiy <[email protected]>
Acked-by: Eduard Zingerman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>

bpf: Add the get_reg_width function

Put calculation of the register value width into a dedicated function.
This function will also be used in a following commit.

Signed-off-by: Maxim Mikityanskiy <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>

bpf: Add the assign_scalar_id_before_mov function

Extract the common code that generates a register ID for src_reg before
MOV if needed into a new function. This function will also be used in
a following commit.

Signed-off-by: Maxim Mikityanskiy <[email protected]>
Acked-by: Eduard Zingerman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>

selftests/bpf: Add a test case for 32-bit spill tracking

When a range check is performed on a register that was 32-bit spilled to
the stack, the IDs of the two instances of the register are the same, so
the range should also be the same.

Signed-off-by: Maxim Mikityanskiy <[email protected]>
Acked-by: Eduard Zingerman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>

bpf: Make bpf_for_each_spilled_reg consider narrow spills

Adjust the check in bpf_get_spilled_reg to take into account spilled
registers narrower than 64 bits. That allows find_equal_scalars to
properly adjust the range of all spilled registers that have the same
ID. Before this change, it was possible for a register and a spilled
register to have the same IDs but different ranges if the spill was
narrower than 64 bits and a range check was performed on the register.

Signed-off-by: Maxim Mikityanskiy <[email protected]>
Acked-by: Eduard Zingerman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>

selftests/bpf: check if imprecise stack spills confuse infinite loop detection

Verify that infinite loop detection logic separates states with
identical register states but different imprecise scalars spilled to
stack.

Signed-off-by: Eduard Zingerman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>

bpf: make infinite loop detection in is_state_visited() exact

Current infinite loops detection mechanism is speculative:
- first, states_maybe_looping() check is done which simply does memcmp
  for R1-R10 in current frame;
- second, states_equal(..., exact=false) is called. With exact=false
  states_equal() would compare scalars for equality only if in old
  state scalar has precision mark.

Such logic might be problematic if compiler makes some unlucky stack
spill/fill decisions. An artificial example of a false positive looks
as follows:

        r0 = ... unknown scalar ...
        r0 &= 0xff;
        *(u64 *)(r10 - 8) = r0;
        r0 = 0;
    loop:
        r0 = *(u64 *)(r10 - 8);
        if r0 > 10 goto exit_;
        r0 += 1;
        *(u64 *)(r10 - 8) = r0;
        r0 = 0;
        goto loop;

This commit updates call to states_equal to use exact=true, forcing
all scalar comparisons to be exact.

Signed-off-by: Eduard Zingerman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>

selftests/bpf: Fix the u64_offset_to_skb_data test

The u64_offset_to_skb_data test is supposed to make a 64-bit fill, but
instead makes a 16-bit one. Fix the test according to its intention and
update the comments accordingly (umax is no longer 0xffff). The 16-bit
fill is covered by u16_offset_to_skb_data.

Signed-off-by: Maxim Mikityanskiy <[email protected]>
Acked-by: Eduard Zingerman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>

selftests/bpf: Update LLVM Phabricator links

reviews.llvm.org was LLVM's Phabricator instances for code review. It
has been abandoned in favor of GitHub pull requests. While the majority
of links in the kernel sources still work because of the work Fangrui
has done turning the dynamic Phabricator instance into a static archive,
there are some issues with that work, so preemptively convert all the
links in the kernel sources to point to the commit on GitHub.

Most of the commits have the corresponding differential review link in
the commit message itself so there should not be any loss of fidelity in
the relevant information.

Additionally, fix a typo in the xdpwall.c print ("LLMV" -> "LLVM") while
in the area.

Link: https://discourse.llvm.org/t/update-on-github-pull-requests/71540/172
Acked-by: Yonghong Song <[email protected]>
Signed-off-by: Nathan Chancellor <[email protected]>
Link: https://lore.kernel.org/r/20240111-bpf-update-llvm-phabricator-links-v2-1-9a7ae976bd64@kernel.org
Signed-off-by: Alexei Starovoitov <[email protected]>

selftests/bpf: detect testing prog flags support

Various tests specify extra testing prog_flags when loading BPF
programs, like BPF_F_TEST_RND_HI32, and more recently also
BPF_F_TEST_REG_INVARIANTS. While BPF_F_TEST_RND_HI32 is old enough to
not cause much problem on older kernels, BPF_F_TEST_REG_INVARIANTS is
very fresh and unconditionally specifying it causes selftests to fail on
even slightly outdated kernels.

This breaks libbpf CI test against 4.9 and 5.15 kernels, it can break
some local development (done outside of VM), etc.

To prevent this, and guard against similar problems in the future, do
runtime detection of supported "testing flags", and only provide those
that host kernel recognizes.

Acked-by: Song Liu <[email protected]>
Acked-by: Jiri Olsa <[email protected]>
Signed-off-by: Andrii Nakryiko <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>

Introduce concept of conformance groups

The discussion of what the actual conformance groups should be
is still in progress, so this is just part 1 which only uses
"legacy" for deprecated instructions and "basic" for everything
else. Subsequent patches will add more groups as discussion
continues.

Signed-off-by: Dave Thaler <[email protected]>
Acked-by: David Vernet <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>

net: filter: fix spelling mistakes

Fix spelling errors as reported by codespell.

Signed-off-by: Randy Dunlap <[email protected]>
Cc: Alexei Starovoitov <[email protected]>
Cc: Daniel Borkmann <[email protected]>
Cc: Andrii Nakryiko <[email protected]>
Cc: [email protected]
Cc: "David S. Miller" <[email protected]>
Cc: Eric Dumazet <[email protected]>
Cc: Jakub Kicinski <[email protected]>
Cc: Paolo Abeni <[email protected]>
Reviewed-by: Simon Horman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>

bpf: support multiple tags per argument

Add ability to iterate multiple decl_tag types pointed to the same
function argument. Use this to support multiple __arg_xxx tags per
global subprog argument.

We leave btf_find_decl_tag_value() intact, but change its implementation
to use a new btf_find_next_decl_tag() which can be straightforwardly
used to find next BTF type ID of a matching btf_decl_tag type.
btf_prepare_func_args() is switched from btf_find_decl_tag_value() to
btf_find_next_decl_tag() to gain multiple tags per argument support.

Signed-off-by: Andrii Nakryiko <[email protected]>
Acked-by: Eduard Zingerman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>

bpf: prepare btf_prepare_func_args() for multiple tags per argument

Add btf_arg_tag flags enum to be able to record multiple tags per
argument. Also streamline pointer argument processing some more.

Signed-off-by: Andrii Nakryiko <[email protected]>
Acked-by: Eduard Zingerman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>

bpf: make sure scalar args don't accept __arg_nonnull tag

Move scalar arg processing in btf_prepare_func_args() after all pointer
arg processing is done. This makes it easier to do validation. One
example of unintended behavior right now is ability to specify
__arg_nonnull for integer/enum arguments. This patch fixes this.

Signed-off-by: Andrii Nakryiko <[email protected]>
Acked-by: Eduard Zingerman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>

selftests/bpf: fix test_loader check message

Seeing:

process_subtest:PASS:Can't alloc specs array 0 nsec

... in verbose successful test log is very confusing. Use smaller
identifier-like test tag to denote that we are asserting specs array
allocation success.

Now it's much less distracting:

process_subtest:PASS:specs_alloc 0 nsec

Signed-off-by: Andrii Nakryiko <[email protected]>
Acked-by: Eduard Zingerman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>

Merge branch 'bpf-inline-bpf_kptr_xchg'

Hou Tao says:

====================
The motivation of inlining bpf_kptr_xchg() comes from the performance
profiling of bpf memory allocator benchmark [1]. The benchmark uses
bpf_kptr_xchg() to stash the allocated objects and to pop the stashed
objects for free. After inling bpf_kptr_xchg(), the performance for
object free on 8-CPUs VM increases about 2%~10%. However the performance
gain comes with costs: both the kasan and kcsan checks on the pointer
will be unavailable. Initially the inline is implemented in do_jit() for
x86-64 directly, but I think it will more portable to implement the
inline in verifier.

Patch #1 supports inlining bpf_kptr_xchg() helper and enables it on
x86-4. Patch #2 factors out a helper for newly-added test in patch #3.
Patch #3 tests whether the inlining of bpf_kptr_xchg() is expected.
Please see individual patches for more details. And comments are always
welcome.

Change Log:
v3:
  * rebased on bpf-next tree
  * patch 1 & 2: Add Rvb-by and Ack-by tags from Eduard
  * patch 3: use inline assembly and naked function instead of c code
             (suggested by Eduard)

v2: https://lore.kernel.org/bpf/20231223104042.1432300 [email protected]/
  * rebased on bpf-next tree
  * drop patch #1 in v1 due to discussion in [2]
  * patch #1: add the motivation in the commit message, merge patch #1
              and #3 into the new patch in v2. (Daniel)
  * patch #2/#3: newly-added patch to test the inlining of
                 bpf_kptr_xchg() (Eduard)

v1: https://lore.kernel.org/bpf/95b8c2cd-44d5-5fe1-60b5-7e8218779566@huaweicloud.com/

[1]: https://lore.kernel.org/bpf/20231221141501.3588586 [email protected]/
[2]: https://lore.kernel.org/bpf/fd94efb9-4a56-c982-dc2e-c66be5202cb7@huaweicloud.com/
====================

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>

selftests/bpf: Test the inlining of bpf_kptr_xchg()

The test uses bpf_prog_get_info_by_fd() to obtain the xlated
instructions of the program first. Since these instructions have
already been rewritten by the verifier, the tests then checks whether
the rewritten instructions are as expected. And to ensure LLVM generates
code exactly as expected, use inline assembly and a naked function.

Suggested-by: Eduard Zingerman <[email protected]>
Signed-off-by: Hou Tao <[email protected]>
Acked-by: Eduard Zingerman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>

selftests/bpf: Factor out get_xlated_program() helper

Both test_verifier and test_progs use get_xlated_program(), so moving
the helper into testing_helpers.h to reuse it.

Acked-by: Eduard Zingerman <[email protected]>
Signed-off-by: Hou Tao <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>

bpf: Support inlining bpf_kptr_xchg() helper

The motivation of inlining bpf_kptr_xchg() comes from the performance
profiling of bpf memory allocator benchmark. The benchmark uses
bpf_kptr_xchg() to stash the allocated objects and to pop the stashed
objects for free. After inling bpf_kptr_xchg(), the performance for
object free on 8-CPUs VM increases about 2%~10%. The inline also has
downside: both the kasan and kcsan checks on the pointer will be
unavailable.

bpf_kptr_xchg() can be inlined by converting the calling of
bpf_kptr_xchg() into an atomic_xchg() instruction. But the conversion
depends on two conditions:
1) JIT backend supports atomic_xchg() on pointer-sized word
2) For the specific arch, the implementation of xchg is the same as
atomic_xchg() on pointer-sized words.

It seems most 64-bit JIT backends satisfies these two conditions. But
as a precaution, defining a weak function bpf_jit_supports_ptr_xchg()
to state whether such conversion is safe and only supporting inline for
64-bit host.

For x86-64, it supports BPF_XCHG atomic operation and both xchg() and
atomic_xchg() use arch_xchg() to implement the exchange, so enabling the
inline of bpf_kptr_xchg() on x86-64 first.

Reviewed-by: Eduard Zingerman <[email protected]>
Signed-off-by: Hou Tao <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>

riscv, bpf: Fix unpredictable kernel crash about RV64 struct_ops

We encountered a kernel crash triggered by the bpf_tcp_ca testcase as
show below:

Unable to handle kernel paging request at virtual address ff60000088554500
Oops [#1]
...
CPU: 3 PID: 458 Comm: test_progs Tainted: G OE 6.8.0-rc1-kselftest_plain #1
Hardware name: riscv-virtio,qemu (DT)
epc : 0xff60000088554500
ra : tcp_ack+0x288/0x1232
epc : ff60000088554500 ra : ffffffff80cc7166 sp : ff2000000117ba50
gp : ffffffff82587b60 tp : ff60000087be0040 t0 : ff60000088554500
t1 : ffffffff801ed24e t2 : 0000000000000000 s0 : ff2000000117bbc0
s1 : 0000000000000500 a0 : ff20000000691000 a1 : 0000000000000018
a2 : 0000000000000001 a3 : ff60000087be03a0 a4 : 0000000000000000
a5 : 0000000000000000 a6 : 0000000000000021 a7 : ffffffff8263f880
s2 : 000000004ac3c13b s3 : 000000004ac3c13a s4 : 0000000000008200
s5 : 0000000000000001 s6 : 0000000000000104 s7 : ff2000000117bb00
s8 : ff600000885544c0 s9 : 0000000000000000 s10: ff60000086ff0b80
s11: 000055557983a9c0 t3 : 0000000000000000 t4 : 000000000000ffc4
t5 : ffffffff8154f170 t6 : 0000000000000030
status: 0000000200000120 badaddr: ff60000088554500 cause: 000000000000000c
Code: c796 67d7 0000 0000 0052 0002 c13b 4ac3 0000 0000 (0001) 0000
---[ end trace 0000000000000000 ]---

The reason is that commit 2cd3e3772e41 ("x86/cfi,bpf: Fix bpf_struct_ops
CFI") changes the func_addr of arch_prepare_bpf_trampoline in struct_ops
from NULL to non-NULL, while we use func_addr on RV64 to differentiate
between struct_ops and regular trampoline. When the struct_ops testcase
is triggered, it emits wrong prologue and epilogue, and lead to
unpredictable issues. After commit 2cd3e3772e41, we can use
BPF_TRAMP_F_INDIRECT to distinguish them as it always be set in
struct_ops.

Fixes: 2cd3e3772e41 ("x86/cfi,bpf: Fix bpf_struct_ops CFI")
Signed-off-by: Pu Lehui <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
Tested-by: Björn Töpel <[email protected]>
Acked-by: Björn Töpel <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]

Merge tag 'wireless-2024-01-22' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless

Kalle Valo says:

====================
wireless fixes for v6.8-rc2

The most visible fix here is the ath11k crash fix which was introduced
in v6.7. We also have a fix for iwlwifi memory corruption and few
smaller fixes in the stack.

* tag 'wireless-2024-01-22' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless:
  wifi: mac80211: fix race condition on enabling fast-xmit
  wifi: iwlwifi: fix a memory corruption
  wifi: mac80211: fix potential sta-link leak
  wifi: cfg80211/mac80211: remove dependency on non-existing option
  wifi: cfg80211: fix missing interfaces when dumping
  wifi: ath11k: rely on mac80211 debugfs handling for vif
  wifi: p54: fix GCC format truncation warning with wiphy->fw_version
====================

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

Merge branch 'netfs-fixes' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs

Pull netfs fixes from David Howells:

* 'netfs-fixes' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
  afs: Fix missing/incorrect unlocking of RCU read lock
  afs: Remove afs_dynroot_d_revalidate() as it is redundant
  afs: Fix error handling with lookup via FS.InlineBulkStatus
  afs: Hide silly-rename files from userspace
  cachefiles, erofs: Fix NULL deref in when cachefiles is not doing ondemand-mode
  netfs: Fix a NULL vs IS_ERR() check in netfs_perform_write()
  netfs, fscache: Prevent Oops in fscache_put_cache()
  cifs: Don't use certain unnecessary folio_*() functions
  afs: Don't use certain unnecessary folio_*() functions
  netfs: Don't use certain unnecessary folio_*() functions

Signed-off-by: Christian Brauner <[email protected]>

Merge branch 'inet_diag-remove-three-mutexes-in-diag-dumps'

Eric Dumazet says:

====================
inet_diag: remove three mutexes in diag dumps

Surprisingly, inet_diag operations are serialized over a stack
of three mutexes, giving legacy /proc based files an unfair
advantage on modern hosts.

This series removes all of them, making inet_diag operations
(eg iproute2/ss) fully parallel.

1-2) Two first patches are adding data-race annotations
     and can be backported to stable kernels.

3-4) inet_diag_table_mutex can be replaced with RCU protection,
     if we add corresponding protection against module unload.

5-7) sock_diag_table_mutex can be replaced with RCU protection,
     if we add corresponding protection against module unload.

8)  sock_diag_mutex is removed, as the old bug it was
     working around has been fixed more elegantly.

9)  inet_diag_dump_icsk() can skip over empty buckets to reduce
     spinlock contention.
====================

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Paolo Abeni <[email protected]>

eventfs: Save directory inodes in the eventfs_inode structure

The eventfs inodes and directories are allocated when referenced. But this
leaves the issue of keeping consistent inode numbers and the number is
only saved in the inode structure itself. When the inode is no longer
referenced, it can be freed. When the file that the inode was representing
is referenced again, the inode is once again created, but the inode number
needs to be the same as it was before.

Just making the inode numbers the same for all files is fine, but that
does not work with directories. The find command will check for loops via
the inode number and having the same inode number for directories triggers:

# find /sys/kernel/tracing
find: File system loop detected;
'/sys/kernel/debug/tracing/events/initcall/initcall_finish' is part of the same file system loop as
'/sys/kernel/debug/tracing/events/initcall'.
[..]

Linus pointed out that the eventfs_inode structure ends with a single
32bit int, and on 64 bit machines, there's likely a 4 byte hole due to
alignment. We can use this hole to store the inode number for the
eventfs_inode. All directories in eventfs are represented by an
eventfs_inode and that data structure can hold its inode number.

That last int was also purposely placed at the end of the structure to
prevent holes from within. Now that there's a 4 byte number to hold the
inode, both the inode number and the last integer can be moved up in the
structure for better cache locality, where the llist and rcu fields can be
moved to the end as they are only used when the eventfs_inode is being
deleted.

Link: https://lore.kernel.org/all/CAMuHMdXKiorg-jiuKoZpfZyDJ3Ynrfb8=X+c7x0Eewxn-YRdCA@mail.gmail.com/
Link: https://lore.kernel.org/linux-trace-kernel/[email protected]
Cc: Masami Hiramatsu <[email protected]>
Cc: Mathieu Desnoyers <[email protected]>
Cc: Linus Torvalds <[email protected]>
Reported-by: Geert Uytterhoeven <[email protected]>
Tested-by: Geert Uytterhoeven <[email protected]>
Fixes: 53c41052ba31 ("eventfs: Have the inodes all for files and directories all be the same")
Signed-off-by: Steven Rostedt (Google) <[email protected]>
Reviewed-by: Kees Cook <[email protected]>

inet_diag: skip over empty buckets

After the removal of inet_diag_table_mutex, sock_diag_table_mutex
and sock_diag_mutex, I was able so see spinlock contention from
inet_diag_dump_icsk() when running 100 parallel invocations.

It is time to skip over empty buckets.

Signed-off-by: Eric Dumazet <[email protected]>
Reviewed-by: Guillaume Nault <[email protected]>
Reviewed-by: Kuniyuki Iwashima <[email protected]>
Reviewed-by: Willem de Bruijn <[email protected]>
Signed-off-by: Paolo Abeni <[email protected]>

sock_diag: remove sock_diag_mutex

sock_diag_rcv() is still serializing its operations using
a mutex, for no good reason.

This came with commit 0a9c73014415 ("[INET_DIAG]: Fix oops
in netlink_rcv_skb"), but the root cause has been fixed
with commit cd40b7d3983c ("[NET]: make netlink user -> kernel
interface synchronious")

Remove this mutex to let multiple threads run concurrently.

Signed-off-by: Eric Dumazet <[email protected]>
Reviewed-by: Guillaume Nault <[email protected]>
Reviewed-by: Kuniyuki Iwashima <[email protected]>
Reviewed-by: Willem de Bruijn <[email protected]>
Signed-off-by: Paolo Abeni <[email protected]>

sock_diag: allow concurrent operation in sock_diag_rcv_msg()

TCPDIAG_GETSOCK and DCCPDIAG_GETSOCK diag are serialized
on sock_diag_table_mutex.

This is to make sure inet_diag module is not unloaded
while diag was ongoing.

It is time to get rid of this mutex and use RCU protection,
allowing full parallelism.

Signed-off-by: Eric Dumazet <[email protected]>
Reviewed-by: Guillaume Nault <[email protected]>
Reviewed-by: Kuniyuki Iwashima <[email protected]>
Reviewed-by: Willem de Bruijn <[email protected]>
Signed-off-by: Paolo Abeni <[email protected]>

sock_diag: allow concurrent operations

sock_diag_broadcast_destroy_work() and __sock_diag_cmd()
are currently using sock_diag_table_mutex to protect
against concurrent sock_diag_handlers[] changes.

This makes inet_diag dump serialized, thus less scalable
than legacy /proc files.

It is time to switch to full RCU protection.

Signed-off-by: Eric Dumazet <[email protected]>
Reviewed-by: Guillaume Nault <[email protected]>
Reviewed-by: Kuniyuki Iwashima <[email protected]>
Reviewed-by: Willem de Bruijn <[email protected]>
Signed-off-by: Paolo Abeni <[email protected]>

sock_diag: add module pointer to "struct sock_diag_handler"

Following patch is going to use RCU instead of
sock_diag_table_mutex acquisition.

This patch is a preparation, no change of behavior yet.

Signed-off-by: Eric Dumazet <[email protected]>
Reviewed-by: Guillaume Nault <[email protected]>
Reviewed-by: Kuniyuki Iwashima <[email protected]>
Reviewed-by: Willem de Bruijn <[email protected]>
Signed-off-by: Paolo Abeni <[email protected]>

inet_diag: allow concurrent operations

inet_diag_lock_handler() current implementation uses a mutex
to protect inet_diag_table[] array against concurrent changes.

This makes inet_diag dump serialized, thus less scalable
than legacy /proc files.

It is time to switch to full RCU protection.

As a bonus, if a target is statically linked instead of being
modular, inet_diag_lock_handler() & inet_diag_unlock_handler()
reduce to reads only.

Signed-off-by: Eric Dumazet <[email protected]>
Reviewed-by: Guillaume Nault <[email protected]>
Reviewed-by: Kuniyuki Iwashima <[email protected]>
Reviewed-by: Willem de Bruijn <[email protected]>
Signed-off-by: Paolo Abeni <[email protected]>

inet_diag: add module pointer to "struct inet_diag_handler"

Following patch is going to use RCU instead of
inet_diag_table_mutex acquisition.

This patch is a preparation, no change of behavior yet.

Signed-off-by: Eric Dumazet <[email protected]>
Reviewed-by: Guillaume Nault <[email protected]>
Reviewed-by: Kuniyuki Iwashima <[email protected]>
Reviewed-by: Willem de Bruijn <[email protected]>
Signed-off-by: Paolo Abeni <[email protected]>

inet_diag: annotate data-races around inet_diag_table[]

inet_diag_lock_handler() reads inet_diag_table[proto] locklessly.

Use READ_ONCE()/WRITE_ONCE() annotations to avoid potential issues.

Fixes: d523a328fb02 ("[INET]: Fix inet_diag dead-lock regression")
Signed-off-by: Eric Dumazet <[email protected]>
Reviewed-by: Guillaume Nault <[email protected]>
Reviewed-by: Kuniyuki Iwashima <[email protected]>
Reviewed-by: Willem de Bruijn <[email protected]>
Signed-off-by: Paolo Abeni <[email protected]>

sock_diag: annotate data-races around sock_diag_handlers[family]

__sock_diag_cmd() and sock_diag_bind() read sock_diag_handlers[family]
without a lock held.

Use READ_ONCE()/WRITE_ONCE() annotations to avoid potential issues.

Fixes: 8ef874bfc729 ("sock_diag: Move the sock_ code to net/core/")
Signed-off-by: Eric Dumazet <[email protected]>
Reviewed-by: Guillaume Nault <[email protected]>
Reviewed-by: Kuniyuki Iwashima <[email protected]>
Reviewed-by: Willem de Bruijn <[email protected]>
Signed-off-by: Paolo Abeni <[email protected]>

ipv6: init the accept_queue's spinlocks in inet6_create

In commit 198bc90e0e73("tcp: make sure init the accept_queue's spinlocks
once"), the spinlocks of accept_queue are initialized only when socket is
created in the inet4 scenario. The locks are not initialized when socket
is created in the inet6 scenario. The kernel reports the following error:
INFO: trying to register non-static key.
The code is fine but needs lockdep annotation, or maybe
you didn't initialize this object before use?
turning off the locking correctness validator.
Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
Call Trace:
<TASK>
dump_stack_lvl (lib/dump_stack.c:107)
register_lock_class (kernel/locking/lockdep.c:1289)
__lock_acquire (kernel/locking/lockdep.c:5015)
lock_acquire.part.0 (kernel/locking/lockdep.c:5756)
_raw_spin_lock_bh (kernel/locking/spinlock.c:178)
inet_csk_listen_stop (net/ipv4/inet_connection_sock.c:1386)
tcp_disconnect (net/ipv4/tcp.c:2981)
inet_shutdown (net/ipv4/af_inet.c:935)
__sys_shutdown (./include/linux/file.h:32 net/socket.c:2438)
__x64_sys_shutdown (net/socket.c:2445)
do_syscall_64 (arch/x86/entry/common.c:52)
entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:129)
RIP: 0033:0x7f52ecd05a3d
Code: 5b 41 5c c3 66 0f 1f 84 00 00 00 00 00 f3 0f 1e fa 48 89 f8 48 89 f7
48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff
ff 73 01 c3 48 8b 0d ab a3 0e 00 f7 d8 64 89 01 48
RSP: 002b:00007f52ecf5dde8 EFLAGS: 00000293 ORIG_RAX: 0000000000000030
RAX: ffffffffffffffda RBX: 00007f52ecf5e640 RCX: 00007f52ecd05a3d
RDX: 00007f52ecc8b188 RSI: 0000000000000000 RDI: 0000000000000004
RBP: 00007f52ecf5de20 R08: 00007ffdae45c69f R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000293 R12: 00007f52ecf5e640
R13: 0000000000000000 R14: 00007f52ecc8b060 R15: 00007ffdae45c6e0

Fixes: 198bc90e0e73 ("tcp: make sure init the accept_queue's spinlocks once")
Signed-off-by: Zhengchao Shao <[email protected]>
Reviewed-by: Eric Dumazet <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Paolo Abeni <[email protected]>

wifi: iwlegacy: Use kcalloc() instead of kzalloc()

As noted in the "Deprecated Interfaces, Language Features, Attributes,
and Conventions" documentation [1], size calculations (especially
multiplication) should not be performed in memory allocator (or similar)
function arguments due to the risk of them overflowing. This could lead
to values wrapping around and a smaller allocation being made than the
caller was expecting. Using those allocations could lead to linear
overflows of heap memory and other misbehaviors.

So, use the purpose specific kcalloc() function instead of the argument
size * count in the kzalloc() function.

Also, it is preferred to use sizeof(*pointer) instead of sizeof(type)
due to the type of the variable can change and one needs not change the
former (unlike the latter).

Link: https://www.kernel.org/doc/html/next/process/deprecated.html#open-coded-arithmetic-in-allocator-arguments
Link: https://github.com/KSPP/linux/issues/162
Signed-off-by: Erick Archer <[email protected]>
Reviewed-by: Gustavo A. R. Silva <[email protected]>
Acked-by: Stanislaw Gruszka <[email protected]>
Signed-off-by: Kalle Valo <[email protected]>
Link: https://msgid.link/[email protected]

wifi: rtw89: fix disabling concurrent mode TX hang issue

When disabling concurrent mode and switching to a single interface, the
TX might stuck. The reason is TBTT prohibit area circuit still enable
to block TX. To disable tbtt prohibit area circuit need to delay 2ms to
make it effective. However, we only delay 2us in original code. So we
fix it.

Signed-off-by: Chih-Kang Chang <[email protected]>
Signed-off-by: Ping-Ke Shih <[email protected]>
Signed-off-by: Kalle Valo <[email protected]>
Link: https://msgid.link/[email protected]

wifi: rtw89: fix HW scan timeout due to TSF sync issue

When STA connects to an AP and doesn't receive any beacon yet, the
hardware scan is triggered. This scan begins with the default TSF
value. Once STA receives a beacon when switches back to the operating
channel, its TSF synchronizes with the AP. However, if there is a
significant difference in TSF values between the default value and
the synchronized value, it will cause firmware fail to trigger
interrupt, and the C2H won't be sent out. As a result, the scan
continues until a timeout occurs. To fix this issue, we disable TSF
synchronization during scanning to prevent drastic TSF changes, and
enable TSF synchronization after scan.

Signed-off-by: Chih-Kang Chang <[email protected]>
Signed-off-by: Ping-Ke Shih <[email protected]>
Signed-off-by: Kalle Valo <[email protected]>
Link: https://msgid.link/[email protected]

wifi: rtw89: add wait/completion for abort scan

When aborting scan, wait until FW is done to keep both states aligned.
This prevents driver modifying channel then gets overwritten by FW.

Signed-off-by: Po-Hao Huang <[email protected]>
Signed-off-by: Ping-Ke Shih <[email protected]>
Signed-off-by: Kalle Valo <[email protected]>
Link: https://msgid.link/[email protected]

wifi: rtw89: fix null pointer access when abort scan

During cancel scan we might use vif that weren't scanning.
Fix this by using the actual scanning vif.

Signed-off-by: Po-Hao Huang <[email protected]>
Signed-off-by: Ping-Ke Shih <[email protected]>
Signed-off-by: Kalle Valo <[email protected]>
Link: https://msgid.link/[email protected]

wifi: rtw89: disable RTS when broadcast/multicast

RTS switch should not be enabled for broadcast and multicast. This
could cause incorrect behavior during AP mode, so we fix it.

Signed-off-by: Po-Hao Huang <[email protected]>
Signed-off-by: Ping-Ke Shih <[email protected]>
Signed-off-by: Kalle Valo <[email protected]>
Link: https://msgid.link/[email protected]

wifi: rtw89: Set default CQM config if not present

When wpa_supplicant is initiated by users and not by NetworkManager,
the CQM configuration might not be set. Without this setting, ICs
with connection monitor handled by driver won't detect connection
loss. To fix this we prepare a default setting upon associated at
first, then update again if any is given later.

Signed-off-by: Po-Hao Huang <[email protected]>
Signed-off-by: Ping-Ke Shih <[email protected]>
Signed-off-by: Kalle Valo <[email protected]>
Link: https://msgid.link/[email protected]

wifi: rtw89: refine hardware scan C2H events

Define struct for scan offload C2H events and update each elements'
bitfield. This patch does not change original behavior, just style
conversion and naming changes.

Signed-off-by: Po-Hao Huang <[email protected]>
Signed-off-by: Ping-Ke Shih <[email protected]>
Signed-off-by: Kalle Valo <[email protected]>
Link: https://msgid.link/[email protected]

wifi: rtw89: refine add_chan H2C command to encode_bits

Use struct filling style instead of pointer casting.
This does not change the original behavior.

Signed-off-by: Po-Hao Huang <[email protected]>
Signed-off-by: Ping-Ke Shih <[email protected]>
Signed-off-by: Kalle Valo <[email protected]>
Link: https://msgid.link/[email protected]

wifi: rtw89: 8922a: add BTG functions to assist BT coexistence to control TX/RX

These functions are to control baseband AGC while BT coexists with WiFi.
Among these functions, ctrl_btg_bt_rx is used to control AGC related
settings, which is affected by BT RX, while BT shares the same path with
WiFi; ctrl_nbtg_bt_tx is used to control AGC settings under non-shared
path condition, which is affected by BT TX.

Signed-off-by: Chung-Hsuan Hung <[email protected]>
Signed-off-by: Ping-Ke Shih <[email protected]>
Signed-off-by: Kalle Valo <[email protected]>
Link: https://msgid.link/[email protected]

wifi: rtw89: 8922a: add TX power related ops

The ::power_trim is to write bias value programmed in efuse to normalize
TX power, and then using ::set_txpwr_ctrl to set reference TX power value.
The ::set_txpwr is to set final TX power according to regulation of current
country.

Signed-off-by: Ping-Ke Shih <[email protected]>
Signed-off-by: Kalle Valo <[email protected]>
Link: https://msgid.link/[email protected]

wifi: rtw89: 8922a: add register definitions of H2C, C2H, page, RRSR and EDCCA

Firmware H2C commands and C2H events can go via registers, so define them
accordingly. The page registers are to arrange local buffer of WiFi chip.
RRSR is to define rate selection to transmit BA or ACK. EDCCA is to set
threshold of engine detection mechanism by BB hardware.

Like other chips, define these registers and we can share the same flow.

Signed-off-by: Ping-Ke Shih <[email protected]>
Signed-off-by: Kalle Valo <[email protected]>
Link: https://msgid.link/[email protected]

wifi: rtw89: 8922a: add chip_ops related to BB init

The chip_ops::bb_preinit and ::bb_postinit are called before and after
loading BB parameters from tables of firmware file. The ::bb_reset is
used to reset hardware state, and currently it is not needed by 8922AE so
leave it as empty. The ::bb_sethw is to implement conditional parameters.

Signed-off-by: Ping-Ke Shih <[email protected]>
Signed-off-by: Kalle Valo <[email protected]>
Link: https://msgid.link/[email protected]

wifi: rtw89: 8922a: add chip_ops::{enable,disable}_bb_rf

When we are going to up interface to make connection, turn on BB and RF
hardware power by enable_bb_rf ops. Oppositely, using disable_bb_rf to
turn them off.

Signed-off-by: Ping-Ke Shih <[email protected]>
Signed-off-by: Kalle Valo <[email protected]>
Link: https://msgid.link/[email protected]

wifi: rtw89: add mlo_dbcc_mode for WiFi 7 chips

WiFi 7 chips can operate in various MLO applications, such as 1 link (2SS)
and 2 links (1SS + 1SS), and we should configure different PHY mode for
each of them.

For example,
- MLO_2_PLUS_0_1RF is 1 link with 2SS rate, and enable one RF component.
- MLO_1_PLUS_1_1RF is 2 links with 1SS rate for each, and enable one RF
component that can support two paths.

By default, we set the mode to legacy MLO_DBCC_NOT_SUPPORT (don't support
MLO and DBCC yet), and later we will introduce logic to change the mode.

Signed-off-by: Ping-Ke Shih <[email protected]>
Signed-off-by: Kalle Valo <[email protected]>
Link: https://msgid.link/[email protected]

ovl: mark xwhiteouts directory with overlay.opaque='x'

An opaque directory cannot have xwhiteouts, so instead of marking an
xwhiteouts directory with a new xattr, overload overlay.opaque xattr
for marking both opaque dir ('y') and xwhiteouts dir ('x').

This is more efficient as the overlay.opaque xattr is checked during
lookup of directory anyway.

This also prevents unnecessary checking the xattr when reading a
directory without xwhiteouts, i.e. most of the time.

Note that the xwhiteouts marker is not checked on the upper layer and
on the last layer in lowerstack, where xwhiteouts are not expected.

Fixes: bc8df7a3dc03 ("ovl: Add an alternative type of whiteout")
Cc: <[email protected]> # v6.7
Reviewed-by: Alexander Larsson <[email protected]>
Tested-by: Alexander Larsson <[email protected]>
Signed-off-by: Amir Goldstein <[email protected]>

netlink: fix potential sleeping issue in mqueue_flush_file

I analyze the potential sleeping issue of the following processes:
Thread A                                Thread B
...                                     netlink_create  //ref = 1
do_mq_notify                            ...
  sock = netlink_getsockbyfilp          ...     //ref = 2
  info->notify_sock = sock;             ...
...                                     netlink_sendmsg
...                                       skb = netlink_alloc_large_skb  //skb->head is vmalloced
...                                       netlink_unicast
...                                         sk = netlink_getsockbyportid //ref = 3
...                                         netlink_sendskb
...                                           __netlink_sendskb
...                                             skb_queue_tail //put skb to sk_receive_queue
...                                         sock_put //ref = 2
...                                     ...
...                                     netlink_release
...                                       deferred_put_nlk_sk //ref = 1
mqueue_flush_file
  spin_lock
  remove_notification
    netlink_sendskb
      sock_put  //ref = 0
        sk_free
          ...
          __sk_destruct
            netlink_sock_destruct
              skb_queue_purge  //get skb from sk_receive_queue
                ...
                __skb_queue_purge_reason
                  kfree_skb_reason
                    __kfree_skb
                    ...
                    skb_release_all
                      skb_release_head_state
                        netlink_skb_destructor
                          vfree(skb->head)  //sleeping while holding spinlock

In netlink_sendmsg, if the memory pointed to by skb->head is allocated by
vmalloc, and is put to sk_receive_queue queue, also the skb is not freed.
When the mqueue executes flush, the sleeping bug will occur. Use
vfree_atomic instead of vfree in netlink_skb_destructor to solve the issue.

Fixes: c05cdb1b864f ("netlink: allow large data transfers from user-space")
Signed-off-by: Zhengchao Shao <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Paolo Abeni <[email protected]>

selftest: Don't reuse port for SO_INCOMING_CPU test.

Jakub reported that ASSERT_EQ(cpu, i) in so_incoming_cpu.c seems to
fire somewhat randomly.

  # #  RUN           so_incoming_cpu.before_reuseport.test3 ...
  # # so_incoming_cpu.c:191:test3:Expected cpu (32) == i (0)
  # # test3: Test terminated by assertion
  # #          FAIL  so_incoming_cpu.before_reuseport.test3
  # not ok 3 so_incoming_cpu.before_reuseport.test3

When the test failed, not-yet-accepted CLOSE_WAIT sockets received
SYN with a "challenging" SEQ number, which was sent from an unexpected
CPU that did not create the receiver.

The test basically does:

  1. for each cpu:
    1-1. create a server
    1-2. set SO_INCOMING_CPU

  2. for each cpu:
    2-1. set cpu affinity
    2-2. create some clients
    2-3. let clients connect() to the server on the same cpu
    2-4. close() clients

  3. for each server:
    3-1. accept() all child sockets
    3-2. check if all children have the same SO_INCOMING_CPU with the server

The root cause was the close() in 2-4. and net.ipv4.tcp_tw_reuse.

In a loop of 2., close() changed the client state to FIN_WAIT_2, and
the peer transitioned to CLOSE_WAIT.

In another loop of 2., connect() happened to select the same port of
the FIN_WAIT_2 socket, and it was reused as the default value of
net.ipv4.tcp_tw_reuse is 2.

As a result, the new client sent SYN to the CLOSE_WAIT socket from
a different CPU, and the receiver's sk_incoming_cpu was overwritten
with unexpected CPU ID.

Also, the SYN had a different SEQ number, so the CLOSE_WAIT socket
responded with Challenge ACK.  The new client properly returned RST
and effectively killed the CLOSE_WAIT socket.

This way, all clients were created successfully, but the error was
detected later by 3-2., ASSERT_EQ(cpu, i).

To avoid the failure, let's make sure that (i) the number of clients
is less than the number of available ports and (ii) such reuse never
happens.

Fixes: 6df96146b202 ("selftest: Add test for SO_INCOMING_CPU.")
Reported-by: Jakub Kicinski <[email protected]>
Signed-off-by: Kuniyuki Iwashima <[email protected]>
Tested-by: Jakub Kicinski <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Paolo Abeni <[email protected]>

tcp: Add memory barrier to tcp_push()

On CPUs with weak memory models, reads and updates performed by tcp_push
to the sk variables can get reordered leaving the socket throttled when
it should not. The tasklet running tcp_wfree() may also not observe the
memory updates in time and will skip flushing any packets throttled by
tcp_push(), delaying the sending. This can pathologically cause 40ms
extra latency due to bad interactions with delayed acks.

Adding a memory barrier in tcp_push removes the bug, similarly to the
previous commit bf06200e732d ("tcp: tsq: fix nonagle handling").
smp_mb__after_atomic() is used to not incur in unnecessary overhead
on x86 since not affected.

Patch has been tested using an AWS c7g.2xlarge instance with Ubuntu
22.04 and Apache Tomcat 9.0.83 running the basic servlet below:

import java.io.IOException;
import java.io.OutputStreamWriter;
import java.io.PrintWriter;
import javax.servlet.ServletException;
import javax.servlet.http.HttpServlet;
import javax.servlet.http.HttpServletRequest;
import javax.servlet.http.HttpServletResponse;

public class HelloWorldServlet extends HttpServlet {
    @Override
    protected void doGet(HttpServletRequest request, HttpServletResponse response)
      throws ServletException, IOException {
        response.setContentType("text/html;charset=utf-8");
        OutputStreamWriter osw = new OutputStreamWriter(response.getOutputStream(),"UTF-8");
        String s = "a".repeat(3096);
        osw.write(s,0,s.length());
        osw.flush();
    }
}

Load was applied using wrk2 (https://github.com/kinvolk/wrk2) from an AWS
c6i.8xlarge instance. Before the patch an additional 40ms latency from P99.99+
values is observed while, with the patch, the extra latency disappears.

No patch and tcp_autocorking=1
./wrk -t32 -c128 -d40s --latency -R10000  http://172.31.60.173:8080/hello/hello
  ...
50.000%    0.91ms
75.000%    1.13ms
90.000%    1.46ms
99.000%    1.74ms
99.900%    1.89ms
99.990%   41.95ms  <<< 40+ ms extra latency
99.999%   48.32ms
100.000%   48.96ms

With patch and tcp_autocorking=1
./wrk -t32 -c128 -d40s --latency -R10000  http://172.31.60.173:8080/hello/hello
  ...
50.000%    0.90ms
75.000%    1.13ms
90.000%    1.45ms
99.000%    1.72ms
99.900%    1.83ms
99.990%    2.11ms  <<< no 40+ ms extra latency
99.999%    2.53ms
100.000%    2.62ms

Patch has been also tested on x86 (m7i.2xlarge instance) which it is not
affected by this issue and the patch doesn't introduce any additional
delay.

Fixes: 7aa5470c2c09 ("tcp: tsq: move tsq_flags close to sk_wmem_alloc")
Signed-off-by: Salvatore Dipietro <[email protected]>
Reviewed-by: Eric Dumazet <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Paolo Abeni <[email protected]>

fbdev: stifb: Fix crash in stifb_blank()

Avoid a kernel crash in stifb by providing the correct pointer to the fb_info
struct. Prior to commit e2e0b838a184 ("video/sticore: Remove info field from
STI struct") the fb_info struct was at the beginning of the fb struct.

Fixes: e2e0b838a184 ("video/sticore: Remove info field from STI struct")
Signed-off-by: Helge Deller <[email protected]>
Cc: Thomas Zimmermann <[email protected]>

drm/ttm: fix ttm pool initialization for no-dma-device drivers

The QXL driver doesn't use any device for DMA mappings or allocations so
dev_to_node() will panic inside ttm_device_init() on NUMA systems:

  general protection fault, probably for non-canonical address 0xdffffc000000007a: 0000 [#1] PREEMPT SMP KASAN NOPTI
  KASAN: null-ptr-deref in range [0x00000000000003d0-0x00000000000003d7]
  CPU: 1 PID: 1 Comm: swapper/0 Not tainted 6.7.0+ #9
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-3-gd478f380-rebuilt.opensuse.org 04/01/2014
  RIP: 0010:ttm_device_init+0x10e/0x340
  Call Trace:
    qxl_ttm_init+0xaa/0x310
    qxl_device_init+0x1071/0x2000
    qxl_pci_probe+0x167/0x3f0
    local_pci_probe+0xe1/0x1b0
    pci_device_probe+0x29d/0x790
    really_probe+0x251/0x910
    __driver_probe_device+0x1ea/0x390
    driver_probe_device+0x4e/0x2e0
    __driver_attach+0x1e3/0x600
    bus_for_each_dev+0x12d/0x1c0
    bus_add_driver+0x25a/0x590
    driver_register+0x15c/0x4b0
    qxl_pci_driver_init+0x67/0x80
    do_one_initcall+0xf5/0x5d0
    kernel_init_freeable+0x637/0xb10
    kernel_init+0x1c/0x2e0
    ret_from_fork+0x48/0x80
    ret_from_fork_asm+0x1b/0x30
  RIP: 0010:ttm_device_init+0x10e/0x340

Fall back to NUMA_NO_NODE if there is no device for DMA.

Found by Linux Verification Center (linuxtesting.org).

Fixes: b0a7ce53d494 ("drm/ttm: Schedule delayed_delete worker closer")
Signed-off-by: Fedor Pchelkin <[email protected]>
Reviewed-by: Christian König <[email protected]>
Reported-by: Steven Rostedt <[email protected]>
Cc: Rajneesh Bhardwaj <[email protected]>
Cc: Felix Kuehling <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

Revert "btrfs: zstd: fix and simplify the inline extent decompression"

This reverts commit 1e7f6def8b2370ecefb54b3c8f390ff894b0c51b.

It causes my machine to not even boot, and Klara Modin reports that the
cause is that small zstd-compressed files return garbage when read.

Reported-by: Klara Modin <[email protected]>
Link: https://lore.kernel.org/linux-btrfs/CABq1_vj4GpUeZpVG49OHCo-3sdbe2-2ROcu_xDvUG-6-5zPRXg@mail.gmail.com/
Reported-and-bisected-by: Linus Torvalds <[email protected]>
Acked-by: David Sterba <[email protected]>
Cc: Qu Wenruo <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

afs: Fix missing/incorrect unlocking of RCU read lock

In afs_proc_addr_prefs_show(), we need to unlock the RCU read lock in both
places before returning (and not lock it again).

Fixes: f94f70d39cc2 ("afs: Provide a way to configure address priorities")
Reported-by: kernel test robot <[email protected]>
Closes: https://lore.kernel.org/oe-lkp/[email protected]
Signed-off-by: David Howells <[email protected]>
cc: [email protected]
cc: [email protected]

afs: Remove afs_dynroot_d_revalidate() as it is redundant

Remove afs_dynroot_d_revalidate() as it is redundant as all it does is
return 1 and the caller assumes that if the op is not given.

Suggested-by: Alexander Viro <[email protected]>
Signed-off-by: David Howells <[email protected]>
cc: Marc Dionne <[email protected]>
cc: [email protected]
cc: [email protected]

afs: Fix error handling with lookup via FS.InlineBulkStatus

When afs does a lookup, it tries to use FS.InlineBulkStatus to preemptively
look up a bunch of files in the parent directory and cache this locally, on
the basis that we might want to look at them too (for example if someone
does an ls on a directory, they may want want to then stat every file
listed).

FS.InlineBulkStatus can be considered a compound op with the normal abort
code applying to the compound as a whole.  Each status fetch within the
compound is then given its own individual abort code - but assuming no
error that prevents the bulk fetch from returning the compound result will
be 0, even if all the constituent status fetches failed.

At the conclusion of afs_do_lookup(), we should use the abort code from the
appropriate status to determine the error to return, if any - but instead
it is assumed that we were successful if the op as a whole succeeded and we
return an incompletely initialised inode, resulting in ENOENT, no matter
the actual reason.  In the particular instance reported, a vnode with no
permission granted to be accessed is being given a UAEACCES abort code
which should be reported as EACCES, but is instead being reported as
ENOENT.

Fix this by abandoning the inode (which will be cleaned up with the op) if
file[1] has an abort code indicated and turn that abort code into an error
instead.

Whilst we're at it, add a tracepoint so that the abort codes of the
individual subrequests of FS.InlineBulkStatus can be logged.  At the moment
only the container abort code can be 0.

Fixes: e49c7b2f6de7 ("afs: Build an abstraction around an "operation" concept")
Reported-by: Jeffrey Altman <[email protected]>
Signed-off-by: David Howells <[email protected]>
Reviewed-by: Marc Dionne <[email protected]>
cc: [email protected]

afs: Hide silly-rename files from userspace

There appears to be a race between silly-rename files being created/removed
and various userspace tools iterating over the contents of a directory,
leading to such errors as:

find: './kernel/.tmp_cpio_dir/include/dt-bindings/reset/.__afs2080': No such file or directory
tar: ./include/linux/greybus/.__afs3C95: File removed before we read it

when building a kernel.

Fix afs_readdir() so that it doesn't return .__afsXXXX silly-rename files
to userspace. This doesn't stop them being looked up directly by name as
we need to be able to look them up from within the kernel as part of the
silly-rename algorithm.

Fixes: 79ddbfa500b3 ("afs: Implement sillyrename for unlink and rename")
Signed-off-by: David Howells <[email protected]>
cc: Marc Dionne <[email protected]>
cc: [email protected]