Kui-Feng Lee [Fri, 19 Jan 2024 22:49:53 +0000 (14:49 -0800)]
bpf: get type information with BTF_ID_LIST
Get ready to remove bpf_struct_ops_init() in the future. By using
BTF_ID_LIST, it is possible to gather type information while building
instead of runtime.
Kui-Feng Lee [Fri, 19 Jan 2024 22:49:52 +0000 (14:49 -0800)]
bpf: refactory struct_ops type initialization to a function.
Move the majority of the code to bpf_struct_ops_init_one(), which can then
be utilized for the initialization of newly registered dynamically
allocated struct_ops types in the following patches.
====================
bpf: Add cookies retrieval for perf/kprobe multi links
hi,
this patchset adds support to retrieve cookies from existing tracing
links that still did not support it plus changes to bpftool to display
them. It's leftover we discussed some time ago [1].
Cookie is attached to specific address, and because we sort addresses
before printing, we need to sort cookies the same way, hence adding
the struct addr_cookie to keep and sort them together.
Also adding missing dd.sym_count check to show_kprobe_multi_json.
Jiri Olsa [Fri, 19 Jan 2024 11:04:59 +0000 (12:04 +0100)]
bpf: Store cookies in kprobe_multi bpf_link_info data
Storing cookies in kprobe_multi bpf_link_info data. The cookies
field is optional and if provided it needs to be an array of
__u64 with kprobe_multi.count length.
The "p" constraint is a tricky one. It is documented in the GCC manual
section "Simple Constraints":
An operand that is a valid memory address is allowed. This is for
``load address'' and ``push address'' instructions.
p in the constraint must be accompanied by address_operand as the
predicate in the match_operand. This predicate interprets the mode
specified in the match_operand as the mode of the memory reference for
which the address would be valid.
There are two problems:
1. It is questionable whether that constraint was ever intended to be
used in inline assembly templates, because its behavior really
depends on compiler internals. A "memory address" is not the same
than a "memory operand" or a "memory reference" (constraint "m"), and
in fact its usage in the template above results in an error in both
x86_64-linux-gnu and bpf-unkonwn-none:
foo.c: In function ‘bar’:
foo.c:6:3: error: invalid 'asm': invalid expression as operand
6 | asm volatile ("r1 = %[jorl]" : : [jorl]"p"(&jorl));
| ^~~
I would assume the same happens with aarch64, riscv, and most/all
other targets in GCC, that do not accept operands of the form A + B
that are not wrapped either in a const or in a memory reference.
To avoid that error, the usage of the "p" constraint in internal GCC
instruction templates is supposed to be complemented by the 'a'
modifier, like in:
%aN means expect operand N to be a memory address
(not a memory reference!) and print a reference
to that address.
That works because when the modifier 'a' is found, GCC prints an
"operand address", which is not the same than an "operand".
But...
2. Even if we used the internal 'a' modifier (we shouldn't) the 'rN =
rM' instruction really requires a register argument. In cases
involving automatics, like in the examples above, we easily end with:
bar:
#APP
r1 = r10-4
#NO_APP
In other cases we could conceibly also end with a 64-bit label that
may overflow the 32-bit immediate operand of `rN = imm32'
instructions:
r1 = foo
All of which is clearly wrong.
clang happens to do "the right thing" in the current usage of __imm_ptr
in the BPF tests, because even with -O2 it seems to "reload" the
fp-relative address of the automatic to a register like in:
bar:
r1 = r10
r1 += -4
#APP
r1 = r1
#NO_APP
Which is what GCC would generate with -O0. Whether this is by chance
or by design, the compiler shouln't be expected to do that reload
driven by the "p" constraint.
This patch changes the usage of the "p" constraint in the BPF
selftests macros to use the "r" constraint instead. If a register is
what is required, we should let the compiler know.
The eth_buffer, iph_buffer_tcp and other automatics are fixed size
only if the compiler optimizes away the constant global variables.
clang does this, but GCC does not, turning these automatics into
variable length arrays.
This patch removes the global variables and turns these values into
preprocessor constants. This makes the selftest to build properly
with GCC.
Andrii Nakryiko [Fri, 19 Jan 2024 21:02:01 +0000 (13:02 -0800)]
libbpf: call dup2() syscall directly
We've ran into issues with using dup2() API in production setting, where
libbpf is linked into large production environment and ends up calling
unintended custom implementations of dup2(). These custom implementations
don't provide atomic FD replacement guarantees of dup2() syscall,
leading to subtle and hard to debug issues.
To prevent this in the future and guarantee that no libc implementation
will do their own custom non-atomic dup2() implementation, call dup2()
syscall directly with syscall(SYS_dup2).
Note that some architectures don't seem to provide dup2 and have dup3
instead. Try to detect and pick best syscall.
Hou Tao [Fri, 19 Jan 2024 10:25:28 +0000 (18:25 +0800)]
bpf, arm64: Enable the inline of bpf_kptr_xchg()
ARM64 bpf jit satisfies the following two conditions:
1) support BPF_XCHG() on pointer-sized word.
2) the implementation of xchg is the same as atomic_xchg() on
pointer-sized words. Both of these two functions use arch_xchg() to
implement the exchange.
So enable the inline of bpf_kptr_xchg() for arm64 bpf jit.
Dave Thaler [Thu, 18 Jan 2024 23:29:54 +0000 (15:29 -0800)]
bpf, docs: Clarify that MOVSX is only for BPF_X not BPF_K
Per discussion on the mailing list at
https://mailarchive.ietf.org/arch/msg/bpf/uQiqhURdtxV_ZQOTgjCdm-seh74/
the MOVSX operation is only defined to support register extension.
The document didn't previously state this and incorrectly implied
that one could use an immediate value.
bpf: Define struct bpf_tcp_req_attrs when CONFIG_SYN_COOKIES=n.
kernel test robot reported the warning below:
>> net/core/filter.c:11842:13: warning: declaration of 'struct bpf_tcp_req_attrs' will not be visible outside of this function [-Wvisibility]
11842 | struct bpf_tcp_req_attrs *attrs, int attrs__sz)
| ^
1 warning generated.
struct bpf_tcp_req_attrs is defined under CONFIG_SYN_COOKIES
but used in kfunc without the config.
Let's move struct bpf_tcp_req_attrs definition outside of
CONFIG_SYN_COOKIES guard.
Hao Sun [Wed, 17 Jan 2024 09:40:12 +0000 (10:40 +0100)]
bpf: Refactor ptr alu checking rules to allow alu explicitly
Current checking rules are structured to disallow alu on particular ptr
types explicitly, so default cases are allowed implicitly. This may lead
to newly added ptr types being allowed unexpectedly. So restruture it to
allow alu explicitly. The tradeoff is mainly a bit more cases added in
the switch. The following table from Eduard summarizes the rules:
The refactored rules are equivalent to the original one. Note that
PTR_TO_FUNC and CONST_PTR_TO_DYNPTR are not reject here because: (1)
check_mem_access() rejects load/store on those ptrs, and those ptrs
with offset passing to calls are rejected check_func_arg_reg_off();
(2) someone may rely on the verifier not rejecting programs earily.
Andrey Grafin [Wed, 17 Jan 2024 13:06:19 +0000 (16:06 +0300)]
selftest/bpf: Add map_in_maps with BPF_MAP_TYPE_PERF_EVENT_ARRAY values
Check that bpf_object__load() successfully creates map_in_maps
with BPF_MAP_TYPE_PERF_EVENT_ARRAY values.
These changes cover fix in the previous patch
"libbpf: Apply map_set_def_max_entries() for inner_maps on creation".
A command line output is:
- w/o fix
$ sudo ./test_maps
libbpf: map 'mim_array_pe': failed to create inner map: -22
libbpf: map 'mim_array_pe': failed to create: Invalid argument(-22)
libbpf: failed to load object './test_map_in_map.bpf.o'
Failed to load test prog
Andrey Grafin [Wed, 17 Jan 2024 13:06:18 +0000 (16:06 +0300)]
libbpf: Apply map_set_def_max_entries() for inner_maps on creation
This patch allows to auto create BPF_MAP_TYPE_ARRAY_OF_MAPS and
BPF_MAP_TYPE_HASH_OF_MAPS with values of BPF_MAP_TYPE_PERF_EVENT_ARRAY
by bpf_object__load().
Previous behaviour created a zero filled btf_map_def for inner maps and
tried to use it for a map creation but the linux kernel forbids to create
a BPF_MAP_TYPE_PERF_EVENT_ARRAY map with max_entries=0.
Daniel Borkmann [Wed, 17 Jan 2024 09:16:11 +0000 (10:16 +0100)]
bpf: Sync uapi bpf.h header for the tooling infra
Both commit 91051f003948 ("tcp: Dump bound-only sockets in inet_diag.")
and commit 985b8ea9ec7e ("bpf, docs: Fix bpf_redirect_peer header doc")
missed the tooling header sync. Fix it.
Martin KaFai Lau [Tue, 16 Jan 2024 22:42:40 +0000 (14:42 -0800)]
Merge branch 'bpf: tcp: Support arbitrary SYN Cookie at TC.'
Kuniyuki Iwashima says:
====================
Under SYN Flood, the TCP stack generates SYN Cookie to remain stateless
for the connection request until a valid ACK is responded to the SYN+ACK.
The cookie contains two kinds of host-specific bits, a timestamp and
secrets, so only can it be validated by the generator. It means SYN
Cookie consumes network resources between the client and the server;
intermediate nodes must remember which nodes to route ACK for the cookie.
SYN Proxy reduces such unwanted resource allocation by handling 3WHS at
the edge network. After SYN Proxy completes 3WHS, it forwards SYN to the
backend server and completes another 3WHS. However, since the server's
ISN differs from the cookie, the proxy must manage the ISN mappings and
fix up SEQ/ACK numbers in every packet for each connection. If a proxy
node goes down, all the connections through it are terminated. Keeping
a state at proxy is painful from that perspective.
At AWS, we use a dirty hack to build truly stateless SYN Proxy at scale.
Our SYN Proxy consists of the front proxy layer and the backend kernel
module. (See slides of LPC2023 [0], p37 - p48)
The cookie that SYN Proxy generates differs from the kernel's cookie in
that it contains a secret (called rolling salt) (i) shared by all the proxy
nodes so that any node can validate ACK and (ii) updated periodically so
that old cookies cannot be validated and we need not encode a timestamp for
the cookie. Also, ISN contains WScale, SACK, and ECN, not in TS val. This
is not to sacrifice any connection quality, where some customers turn off
TCP timestamps option due to retro CVE.
After 3WHS, the proxy restores SYN, encapsulates ACK into SYN, and forward
the TCP-in-TCP packet to the backend server. Our kernel module works at
Netfilter input/output hooks and first feeds SYN to the TCP stack to
initiate 3WHS. When the module is triggered for SYN+ACK, it looks up the
corresponding request socket and overwrites tcp_rsk(req)->snt_isn with the
proxy's cookie. Then, the module can complete 3WHS with the original ACK
as is.
This way, our SYN Proxy does not manage the ISN mappings nor wait for
SYN+ACK from the backend thus can remain stateless. It's working very
well for high-bandwidth services like multiple Tbps, but we are looking
for a way to drop the dirty hack and further optimise the sequences.
If we could validate an arbitrary SYN Cookie on the backend server with
BPF, the proxy would need not restore SYN nor pass it. After validating
ACK, the proxy node just needs to forward it, and then the server can do
the lightweight validation (e.g. check if ACK came from proxy nodes, etc)
and create a connection from the ACK.
This series allows us to create a full sk from an arbitrary SYN Cookie,
which is done in 3 steps.
1) At tc, BPF prog calls a new kfunc to create a reqsk and configure
it based on the argument populated from SYN Cookie. The reqsk has
its listener as req->rsk_listener and is passed to the TCP stack as
skb->sk.
2) During TCP socket lookup for the skb, skb_steal_sock() returns a
listener in the reuseport group that inet_reqsk(skb->sk)->rsk_listener
belongs to.
3) In cookie_v[46]_check(), the reqsk (skb->sk) is fully initialised and
a full sk is created.
Changes:
v8
* Rebase on Yonghong's cpuv4 fix
* Patch 5
* Fill the trailing 3-bytes padding in struct bpf_tcp_req_attrs
and test it as null
* Patch 6
* Remove unused IPPROTP_MPTCP definition
v6: https://lore.kernel.org/bpf/20231214155424[email protected]/
* Patch 5 & 6
* /struct /s/tcp_cookie_attributes/bpf_tcp_req_attrs/
* Don't reuse struct tcp_options_received and use u8 for each attrs
* Patch 6
* Check retval of test__start_subtest()
v5: https://lore.kernel.org/netdev/20231211073650[email protected]/
* Split patch 1-3
* Patch 3
* Clear req->rsk_listener in skb_steal_sock()
* Patch 4 & 5
* Move sysctl validation and tsoff init from cookie_bpf_check() to kfunc
* Patch 5
* Do not increment LINUX_MIB_SYNCOOKIES(RECV|FAILED)
* Patch 6
* Remove __always_inline
* Test if tcp_handle_{syn,ack}() is executed
* Move some definition to bpf_tracing_net.h
* s/BPF_F_CURRENT_NETNS/-1/
v4: https://lore.kernel.org/bpf/20231205013420[email protected]/
* Patch 1 & 2
* s/CONFIG_SYN_COOKIE/CONFIG_SYN_COOKIES/
* Patch 1
* Don't set rcv_wscale for BPF SYN Cookie case.
* Patch 2
* Add test for tcp_opt.{unused,rcv_wscale} in kfunc
* Modify skb_steal_sock() to avoid resetting skb-sk
* Support SO_REUSEPORT lookup
* Patch 3
* Add CONFIG_SYN_COOKIES to Kconfig for CI
* Define BPF_F_CURRENT_NETNS
v3: https://lore.kernel.org/netdev/20231121184245[email protected]/
* Guard kfunc and req->syncookie part in inet6?_steal_sock() with
CONFIG_SYN_COOKIE
v2: https://lore.kernel.org/netdev/20231120222341[email protected]/
* Drop SOCK_OPS and move SYN Cookie validation logic to TC with kfunc.
* Add cleanup patches to reduce discrepancy between cookie_v[46]_check()
bpf_sk_assign_tcp_reqsk() takes skb, a listener sk, and struct
bpf_tcp_req_attrs and allocates reqsk and configures it. Then,
bpf_sk_assign_tcp_reqsk() links reqsk with skb and the listener.
The notable thing here is that we do not hold refcnt for both reqsk
and listener. To differentiate that, we mark reqsk->syncookie, which
is only used in TX for now. So, if reqsk->syncookie is 1 in RX, it
means that the reqsk is allocated by kfunc.
When skb is freed, sock_pfree() checks if reqsk->syncookie is 1,
and in that case, we set NULL to reqsk->rsk_listener before calling
reqsk_free() as reqsk does not hold a refcnt of the listener.
When the TCP stack looks up a socket from the skb, we steal the
listener from the reqsk in skb_steal_sock() and create a full sk
in cookie_v[46]_check().
The refcnt of reqsk will finally be set to 1 in tcp_get_cookie_sock()
after creating a full sk.
Note that we can extend struct bpf_tcp_req_attrs in the future when
we add a new attribute that is determined in 3WHS.
bpf: tcp: Handle BPF SYN Cookie in cookie_v[46]_check().
We will support arbitrary SYN Cookie with BPF in the following
patch.
If BPF prog validates ACK and kfunc allocates a reqsk, it will
be carried to cookie_[46]_check() as skb->sk. If skb->sk is not
NULL, we call cookie_bpf_check().
Then, we clear skb->sk and skb->destructor, which are needed not
to hold refcnt for reqsk and the listener. See the following patch
for details.
After that, we finish initialisation for the remaining fields with
cookie_tcp_reqsk_init().
Note that the server side WScale is set only for non-BPF SYN Cookie.
bpf: tcp: Handle BPF SYN Cookie in skb_steal_sock().
We will support arbitrary SYN Cookie with BPF.
If BPF prog validates ACK and kfunc allocates a reqsk, it will
be carried to TCP stack as skb->sk with req->syncookie 1. Also,
the reqsk has its listener as req->rsk_listener with no refcnt
taken.
When the TCP stack looks up a socket from the skb, we steal
inet_reqsk(skb->sk)->rsk_listener in skb_steal_sock() so that
the skb will be processed in cookie_v[46]_check() with the
listener.
Note that we do not clear skb->sk and skb->destructor so that we
can carry the reqsk to cookie_v[46]_check().
Artem Savkov [Wed, 10 Jan 2024 08:57:37 +0000 (09:57 +0100)]
selftests/bpf: Fix potential premature unload in bpf_testmod
It is possible for bpf_kfunc_call_test_release() to be called from
bpf_map_free_deferred() when bpf_testmod is already unloaded and
perf_test_stuct.cnt which it tries to decrease is no longer in memory.
This patch tries to fix the issue by waiting for all references to be
dropped in bpf_testmod_exit().
The issue can be triggered by running 'test_progs -t map_kptr' in 6.5,
but is obscured in 6.6 by d119357d07435 ("rcu-tasks: Treat only
synchronous grace periods urgently").
If BPF prog validates ACK and kfunc allocates a reqsk, it will
be carried to TCP stack as skb->sk with req->syncookie 1.
In skb_steal_sock(), we need to check inet_reqsk(sk)->syncookie
to see if the reqsk is created by kfunc. However, inet_reqsk()
is not available in sock.h.
Let's move skb_steal_sock() to request_sock.h.
While at it, we refactor skb_steal_sock() so it returns early if
skb->sk is NULL to minimise the following patch.
Tiezhu Yang [Tue, 16 Jan 2024 06:19:20 +0000 (14:19 +0800)]
bpftool: Silence build warning about calloc()
There exists the following warning when building bpftool:
CC prog.o
prog.c: In function ‘profile_open_perf_events’:
prog.c:2301:24: warning: ‘calloc’ sizes specified with ‘sizeof’ in the earlier argument and not in the later argument [-Wcalloc-transposed-args]
2301 | sizeof(int), obj->rodata->num_cpu * obj->rodata->num_metric);
| ^~~
prog.c:2301:24: note: earlier argument should specify number of elements, later size of each element
Tested with the latest upstream GCC which contains a new warning option
-Wcalloc-transposed-args. The first argument to calloc is documented to
be number of elements in array, while the second argument is size of each
element, just switch the first and second arguments of calloc() to silence
the build warning, compile tested only.
Few minor improvements for bpf_cmp() macro:
. reduce number of args in __bpf_cmp()
. rename NOFLIP to UNLIKELY
. add a comment about 64-bit truncation in "i" constraint
. use "ri" constraint for sizeof(rhs) <= 4
. improve error message for bpf_cmp_likely()
Before:
progs/iters_task_vma.c:31:7: error: variable 'ret' is uninitialized when used here [-Werror,-Wuninitialized]
31 | if (bpf_cmp_likely(seen, <==, 1000))
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
../bpf/bpf_experimental.h:325:3: note: expanded from macro 'bpf_cmp_likely'
325 | ret;
| ^~~
progs/iters_task_vma.c:31:7: note: variable 'ret' is declared here
../bpf/bpf_experimental.h:310:3: note: expanded from macro 'bpf_cmp_likely'
310 | bool ret;
| ^
After:
progs/iters_task_vma.c:31:7: error: invalid operand for instruction
31 | if (bpf_cmp_likely(seen, <==, 1000))
| ^
../bpf/bpf_experimental.h:324:17: note: expanded from macro 'bpf_cmp_likely'
324 | asm volatile("r0 " #OP " invalid compare");
| ^
<inline asm>:1:5: note: instantiated into assembly here
1 | r0 <== invalid compare
| ^
Yonghong Song [Thu, 11 Jan 2024 05:21:36 +0000 (21:21 -0800)]
docs/bpf: Fix an incorrect statement in verifier.rst
In verifier.rst, I found an incorrect statement (maybe a typo) in section
'Liveness marks tracking'. Basically, the wrong register is attributed
to have a read mark. This may confuse the user.
Yonghong Song [Wed, 10 Jan 2024 05:13:55 +0000 (21:13 -0800)]
selftests/bpf: Add a selftest with not-8-byte aligned BPF_ST
Add a selftest with a 4 bytes BPF_ST of 0 where the store is not
8-byte aligned. The goal is to ensure that STACK_ZERO is properly
marked in stack slots and the STACK_ZERO value can propagate
properly during the load.
Yonghong Song [Wed, 10 Jan 2024 05:13:48 +0000 (21:13 -0800)]
bpf: Track aligned st store as imprecise spilled registers
With patch set [1], precision backtracing supports register spill/fill
to/from the stack. The patch [2] allows initial imprecise register spill
with content 0. This is a common case for cpuv3 and lower for
initializing the stack variables with pattern
r1 = 0
*(u64 *)(r10 - 8) = r1
and the [2] has demonstrated good verification improvement.
For cpuv4, the initialization could be
*(u64 *)(r10 - 8) = 0
The current verifier marks the r10-8 contents with STACK_ZERO.
Similar to [2], let us permit the above insn to behave like
imprecise register spill which can reduce number of verified states.
The change is in function check_stack_write_fixed_off().
Before this patch, spilled zero will be marked as STACK_ZERO
which can provide precise values. In check_stack_write_var_off(),
STACK_ZERO will be maintained if writing a const zero
so later it can provide precise values if needed.
The above handling of '*(u64 *)(r10 - 8) = 0' as a spill
will have issues in check_stack_write_var_off() as the spill
will be converted to STACK_MISC and the precise value 0
is lost. To fix this issue, if the spill slots with const
zero and the BPF_ST write also with const zero, the spill slots
are preserved, which can later provide precise values
if needed. Without the change in check_stack_write_var_off(),
the test_verifier subtest 'BPF_ST_MEM stack imm zero, variable offset'
will fail.
I checked cpuv3 and cpuv4 with and without this patch with veristat.
There is no state change for cpuv3 since '*(u64 *)(r10 - 8) = 0'
is only generated with cpuv4.
Currently, when a scalar bounded register is spilled to the stack, its
ID is preserved, but only if was already assigned, i.e. if this register
was MOVed before.
Assign an ID on spill if none is set, so that equal scalars could be
tracked if a register is spilled to the stack and filled into another
register.
One test is adjusted to reflect the change in register IDs.
Extract the common code that generates a register ID for src_reg before
MOV if needed into a new function. This function will also be used in
a following commit.
selftests/bpf: Add a test case for 32-bit spill tracking
When a range check is performed on a register that was 32-bit spilled to
the stack, the IDs of the two instances of the register are the same, so
the range should also be the same.
bpf: Make bpf_for_each_spilled_reg consider narrow spills
Adjust the check in bpf_get_spilled_reg to take into account spilled
registers narrower than 64 bits. That allows find_equal_scalars to
properly adjust the range of all spilled registers that have the same
ID. Before this change, it was possible for a register and a spilled
register to have the same IDs but different ranges if the spill was
narrower than 64 bits and a range check was performed on the register.
bpf: make infinite loop detection in is_state_visited() exact
Current infinite loops detection mechanism is speculative:
- first, states_maybe_looping() check is done which simply does memcmp
for R1-R10 in current frame;
- second, states_equal(..., exact=false) is called. With exact=false
states_equal() would compare scalars for equality only if in old
state scalar has precision mark.
Such logic might be problematic if compiler makes some unlucky stack
spill/fill decisions. An artificial example of a false positive looks
as follows:
selftests/bpf: Fix the u64_offset_to_skb_data test
The u64_offset_to_skb_data test is supposed to make a 64-bit fill, but
instead makes a 16-bit one. Fix the test according to its intention and
update the comments accordingly (umax is no longer 0xffff). The 16-bit
fill is covered by u16_offset_to_skb_data.
reviews.llvm.org was LLVM's Phabricator instances for code review. It
has been abandoned in favor of GitHub pull requests. While the majority
of links in the kernel sources still work because of the work Fangrui
has done turning the dynamic Phabricator instance into a static archive,
there are some issues with that work, so preemptively convert all the
links in the kernel sources to point to the commit on GitHub.
Most of the commits have the corresponding differential review link in
the commit message itself so there should not be any loss of fidelity in
the relevant information.
Additionally, fix a typo in the xdpwall.c print ("LLMV" -> "LLVM") while
in the area.
Andrii Nakryiko [Tue, 9 Jan 2024 23:17:38 +0000 (15:17 -0800)]
selftests/bpf: detect testing prog flags support
Various tests specify extra testing prog_flags when loading BPF
programs, like BPF_F_TEST_RND_HI32, and more recently also
BPF_F_TEST_REG_INVARIANTS. While BPF_F_TEST_RND_HI32 is old enough to
not cause much problem on older kernels, BPF_F_TEST_REG_INVARIANTS is
very fresh and unconditionally specifying it causes selftests to fail on
even slightly outdated kernels.
This breaks libbpf CI test against 4.9 and 5.15 kernels, it can break
some local development (done outside of VM), etc.
To prevent this, and guard against similar problems in the future, do
runtime detection of supported "testing flags", and only provide those
that host kernel recognizes.
Dave Thaler [Mon, 8 Jan 2024 21:42:31 +0000 (13:42 -0800)]
Introduce concept of conformance groups
The discussion of what the actual conformance groups should be
is still in progress, so this is just part 1 which only uses
"legacy" for deprecated instructions and "basic" for everything
else. Subsequent patches will add more groups as discussion
continues.
Andrii Nakryiko [Fri, 5 Jan 2024 00:09:05 +0000 (16:09 -0800)]
bpf: support multiple tags per argument
Add ability to iterate multiple decl_tag types pointed to the same
function argument. Use this to support multiple __arg_xxx tags per
global subprog argument.
We leave btf_find_decl_tag_value() intact, but change its implementation
to use a new btf_find_next_decl_tag() which can be straightforwardly
used to find next BTF type ID of a matching btf_decl_tag type.
btf_prepare_func_args() is switched from btf_find_decl_tag_value() to
btf_find_next_decl_tag() to gain multiple tags per argument support.
Andrii Nakryiko [Fri, 5 Jan 2024 00:09:03 +0000 (16:09 -0800)]
bpf: make sure scalar args don't accept __arg_nonnull tag
Move scalar arg processing in btf_prepare_func_args() after all pointer
arg processing is done. This makes it easier to do validation. One
example of unintended behavior right now is ability to specify
__arg_nonnull for integer/enum arguments. This patch fixes this.
... in verbose successful test log is very confusing. Use smaller
identifier-like test tag to denote that we are asserting specs array
allocation success.
====================
The motivation of inlining bpf_kptr_xchg() comes from the performance
profiling of bpf memory allocator benchmark [1]. The benchmark uses
bpf_kptr_xchg() to stash the allocated objects and to pop the stashed
objects for free. After inling bpf_kptr_xchg(), the performance for
object free on 8-CPUs VM increases about 2%~10%. However the performance
gain comes with costs: both the kasan and kcsan checks on the pointer
will be unavailable. Initially the inline is implemented in do_jit() for
x86-64 directly, but I think it will more portable to implement the
inline in verifier.
Patch #1 supports inlining bpf_kptr_xchg() helper and enables it on
x86-4. Patch #2 factors out a helper for newly-added test in patch #3.
Patch #3 tests whether the inlining of bpf_kptr_xchg() is expected.
Please see individual patches for more details. And comments are always
welcome.
Change Log:
v3:
* rebased on bpf-next tree
* patch 1 & 2: Add Rvb-by and Ack-by tags from Eduard
* patch 3: use inline assembly and naked function instead of c code
(suggested by Eduard)
v2: https://lore.kernel.org/bpf/20231223104042.1432300[email protected]/
* rebased on bpf-next tree
* drop patch #1 in v1 due to discussion in [2]
* patch #1: add the motivation in the commit message, merge patch #1
and #3 into the new patch in v2. (Daniel)
* patch #2/#3: newly-added patch to test the inlining of
bpf_kptr_xchg() (Eduard)
Hou Tao [Fri, 5 Jan 2024 10:48:19 +0000 (18:48 +0800)]
selftests/bpf: Test the inlining of bpf_kptr_xchg()
The test uses bpf_prog_get_info_by_fd() to obtain the xlated
instructions of the program first. Since these instructions have
already been rewritten by the verifier, the tests then checks whether
the rewritten instructions are as expected. And to ensure LLVM generates
code exactly as expected, use inline assembly and a naked function.
Hou Tao [Fri, 5 Jan 2024 10:48:17 +0000 (18:48 +0800)]
bpf: Support inlining bpf_kptr_xchg() helper
The motivation of inlining bpf_kptr_xchg() comes from the performance
profiling of bpf memory allocator benchmark. The benchmark uses
bpf_kptr_xchg() to stash the allocated objects and to pop the stashed
objects for free. After inling bpf_kptr_xchg(), the performance for
object free on 8-CPUs VM increases about 2%~10%. The inline also has
downside: both the kasan and kcsan checks on the pointer will be
unavailable.
bpf_kptr_xchg() can be inlined by converting the calling of
bpf_kptr_xchg() into an atomic_xchg() instruction. But the conversion
depends on two conditions:
1) JIT backend supports atomic_xchg() on pointer-sized word
2) For the specific arch, the implementation of xchg is the same as
atomic_xchg() on pointer-sized words.
It seems most 64-bit JIT backends satisfies these two conditions. But
as a precaution, defining a weak function bpf_jit_supports_ptr_xchg()
to state whether such conversion is safe and only supporting inline for
64-bit host.
For x86-64, it supports BPF_XCHG atomic operation and both xchg() and
atomic_xchg() use arch_xchg() to implement the exchange, so enabling the
inline of bpf_kptr_xchg() on x86-64 first.
The reason is that commit 2cd3e3772e41 ("x86/cfi,bpf: Fix bpf_struct_ops
CFI") changes the func_addr of arch_prepare_bpf_trampoline in struct_ops
from NULL to non-NULL, while we use func_addr on RV64 to differentiate
between struct_ops and regular trampoline. When the struct_ops testcase
is triggered, it emits wrong prologue and epilogue, and lead to
unpredictable issues. After commit 2cd3e3772e41, we can use
BPF_TRAMP_F_INDIRECT to distinguish them as it always be set in
struct_ops.
Jakub Kicinski [Tue, 23 Jan 2024 16:38:13 +0000 (08:38 -0800)]
Merge tag 'wireless-2024-01-22' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless
Kalle Valo says:
====================
wireless fixes for v6.8-rc2
The most visible fix here is the ath11k crash fix which was introduced
in v6.7. We also have a fix for iwlwifi memory corruption and few
smaller fixes in the stack.
* tag 'wireless-2024-01-22' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless:
wifi: mac80211: fix race condition on enabling fast-xmit
wifi: iwlwifi: fix a memory corruption
wifi: mac80211: fix potential sta-link leak
wifi: cfg80211/mac80211: remove dependency on non-existing option
wifi: cfg80211: fix missing interfaces when dumping
wifi: ath11k: rely on mac80211 debugfs handling for vif
wifi: p54: fix GCC format truncation warning with wiphy->fw_version
====================
Merge branch 'netfs-fixes' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs
Pull netfs fixes from David Howells:
* 'netfs-fixes' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
afs: Fix missing/incorrect unlocking of RCU read lock
afs: Remove afs_dynroot_d_revalidate() as it is redundant
afs: Fix error handling with lookup via FS.InlineBulkStatus
afs: Hide silly-rename files from userspace
cachefiles, erofs: Fix NULL deref in when cachefiles is not doing ondemand-mode
netfs: Fix a NULL vs IS_ERR() check in netfs_perform_write()
netfs, fscache: Prevent Oops in fscache_put_cache()
cifs: Don't use certain unnecessary folio_*() functions
afs: Don't use certain unnecessary folio_*() functions
netfs: Don't use certain unnecessary folio_*() functions
eventfs: Save directory inodes in the eventfs_inode structure
The eventfs inodes and directories are allocated when referenced. But this
leaves the issue of keeping consistent inode numbers and the number is
only saved in the inode structure itself. When the inode is no longer
referenced, it can be freed. When the file that the inode was representing
is referenced again, the inode is once again created, but the inode number
needs to be the same as it was before.
Just making the inode numbers the same for all files is fine, but that
does not work with directories. The find command will check for loops via
the inode number and having the same inode number for directories triggers:
# find /sys/kernel/tracing
find: File system loop detected;
'/sys/kernel/debug/tracing/events/initcall/initcall_finish' is part of the same file system loop as
'/sys/kernel/debug/tracing/events/initcall'.
[..]
Linus pointed out that the eventfs_inode structure ends with a single
32bit int, and on 64 bit machines, there's likely a 4 byte hole due to
alignment. We can use this hole to store the inode number for the
eventfs_inode. All directories in eventfs are represented by an
eventfs_inode and that data structure can hold its inode number.
That last int was also purposely placed at the end of the structure to
prevent holes from within. Now that there's a 4 byte number to hold the
inode, both the inode number and the last integer can be moved up in the
structure for better cache locality, where the llist and rcu fields can be
moved to the end as they are only used when the eventfs_inode is being
deleted.
Eric Dumazet [Mon, 22 Jan 2024 11:26:03 +0000 (11:26 +0000)]
inet_diag: skip over empty buckets
After the removal of inet_diag_table_mutex, sock_diag_table_mutex
and sock_diag_mutex, I was able so see spinlock contention from
inet_diag_dump_icsk() when running 100 parallel invocations.
Eric Dumazet [Mon, 22 Jan 2024 11:26:02 +0000 (11:26 +0000)]
sock_diag: remove sock_diag_mutex
sock_diag_rcv() is still serializing its operations using
a mutex, for no good reason.
This came with commit 0a9c73014415 ("[INET_DIAG]: Fix oops
in netlink_rcv_skb"), but the root cause has been fixed
with commit cd40b7d3983c ("[NET]: make netlink user -> kernel
interface synchronious")
Remove this mutex to let multiple threads run concurrently.
Eric Dumazet [Mon, 22 Jan 2024 11:26:00 +0000 (11:26 +0000)]
sock_diag: allow concurrent operations
sock_diag_broadcast_destroy_work() and __sock_diag_cmd()
are currently using sock_diag_table_mutex to protect
against concurrent sock_diag_handlers[] changes.
This makes inet_diag dump serialized, thus less scalable
than legacy /proc files.
Erick Archer [Fri, 19 Jan 2024 17:16:55 +0000 (18:16 +0100)]
wifi: iwlegacy: Use kcalloc() instead of kzalloc()
As noted in the "Deprecated Interfaces, Language Features, Attributes,
and Conventions" documentation [1], size calculations (especially
multiplication) should not be performed in memory allocator (or similar)
function arguments due to the risk of them overflowing. This could lead
to values wrapping around and a smaller allocation being made than the
caller was expecting. Using those allocations could lead to linear
overflows of heap memory and other misbehaviors.
So, use the purpose specific kcalloc() function instead of the argument
size * count in the kzalloc() function.
Also, it is preferred to use sizeof(*pointer) instead of sizeof(type)
due to the type of the variable can change and one needs not change the
former (unlike the latter).
Chih-Kang Chang [Fri, 19 Jan 2024 08:15:01 +0000 (16:15 +0800)]
wifi: rtw89: fix disabling concurrent mode TX hang issue
When disabling concurrent mode and switching to a single interface, the
TX might stuck. The reason is TBTT prohibit area circuit still enable
to block TX. To disable tbtt prohibit area circuit need to delay 2ms to
make it effective. However, we only delay 2us in original code. So we
fix it.
Chih-Kang Chang [Fri, 19 Jan 2024 08:15:00 +0000 (16:15 +0800)]
wifi: rtw89: fix HW scan timeout due to TSF sync issue
When STA connects to an AP and doesn't receive any beacon yet, the
hardware scan is triggered. This scan begins with the default TSF
value. Once STA receives a beacon when switches back to the operating
channel, its TSF synchronizes with the AP. However, if there is a
significant difference in TSF values between the default value and
the synchronized value, it will cause firmware fail to trigger
interrupt, and the C2H won't be sent out. As a result, the scan
continues until a timeout occurs. To fix this issue, we disable TSF
synchronization during scanning to prevent drastic TSF changes, and
enable TSF synchronization after scan.
Po-Hao Huang [Fri, 19 Jan 2024 08:14:56 +0000 (16:14 +0800)]
wifi: rtw89: Set default CQM config if not present
When wpa_supplicant is initiated by users and not by NetworkManager,
the CQM configuration might not be set. Without this setting, ICs
with connection monitor handled by driver won't detect connection
loss. To fix this we prepare a default setting upon associated at
first, then update again if any is given later.
Po-Hao Huang [Fri, 19 Jan 2024 08:14:55 +0000 (16:14 +0800)]
wifi: rtw89: refine hardware scan C2H events
Define struct for scan offload C2H events and update each elements'
bitfield. This patch does not change original behavior, just style
conversion and naming changes.
Chung-Hsuan Hung [Sat, 20 Jan 2024 00:38:31 +0000 (08:38 +0800)]
wifi: rtw89: 8922a: add BTG functions to assist BT coexistence to control TX/RX
These functions are to control baseband AGC while BT coexists with WiFi.
Among these functions, ctrl_btg_bt_rx is used to control AGC related
settings, which is affected by BT RX, while BT shares the same path with
WiFi; ctrl_nbtg_bt_tx is used to control AGC settings under non-shared
path condition, which is affected by BT TX.
Ping-Ke Shih [Sat, 20 Jan 2024 00:38:30 +0000 (08:38 +0800)]
wifi: rtw89: 8922a: add TX power related ops
The ::power_trim is to write bias value programmed in efuse to normalize
TX power, and then using ::set_txpwr_ctrl to set reference TX power value.
The ::set_txpwr is to set final TX power according to regulation of current
country.
Ping-Ke Shih [Sat, 20 Jan 2024 00:38:29 +0000 (08:38 +0800)]
wifi: rtw89: 8922a: add register definitions of H2C, C2H, page, RRSR and EDCCA
Firmware H2C commands and C2H events can go via registers, so define them
accordingly. The page registers are to arrange local buffer of WiFi chip.
RRSR is to define rate selection to transmit BA or ACK. EDCCA is to set
threshold of engine detection mechanism by BB hardware.
Like other chips, define these registers and we can share the same flow.
Ping-Ke Shih [Sat, 20 Jan 2024 00:38:28 +0000 (08:38 +0800)]
wifi: rtw89: 8922a: add chip_ops related to BB init
The chip_ops::bb_preinit and ::bb_postinit are called before and after
loading BB parameters from tables of firmware file. The ::bb_reset is
used to reset hardware state, and currently it is not needed by 8922AE so
leave it as empty. The ::bb_sethw is to implement conditional parameters.
When we are going to up interface to make connection, turn on BB and RF
hardware power by enable_bb_rf ops. Oppositely, using disable_bb_rf to
turn them off.
Ping-Ke Shih [Sat, 20 Jan 2024 00:38:26 +0000 (08:38 +0800)]
wifi: rtw89: add mlo_dbcc_mode for WiFi 7 chips
WiFi 7 chips can operate in various MLO applications, such as 1 link (2SS)
and 2 links (1SS + 1SS), and we should configure different PHY mode for
each of them.
For example,
- MLO_2_PLUS_0_1RF is 1 link with 2SS rate, and enable one RF component.
- MLO_1_PLUS_1_1RF is 2 links with 1SS rate for each, and enable one RF
component that can support two paths.
By default, we set the mode to legacy MLO_DBCC_NOT_SUPPORT (don't support
MLO and DBCC yet), and later we will introduce logic to change the mode.
Amir Goldstein [Sat, 20 Jan 2024 10:18:39 +0000 (12:18 +0200)]
ovl: mark xwhiteouts directory with overlay.opaque='x'
An opaque directory cannot have xwhiteouts, so instead of marking an
xwhiteouts directory with a new xattr, overload overlay.opaque xattr
for marking both opaque dir ('y') and xwhiteouts dir ('x').
This is more efficient as the overlay.opaque xattr is checked during
lookup of directory anyway.
This also prevents unnecessary checking the xattr when reading a
directory without xwhiteouts, i.e. most of the time.
Note that the xwhiteouts marker is not checked on the upper layer and
on the last layer in lowerstack, where xwhiteouts are not expected.
Zhengchao Shao [Mon, 22 Jan 2024 01:18:07 +0000 (09:18 +0800)]
netlink: fix potential sleeping issue in mqueue_flush_file
I analyze the potential sleeping issue of the following processes:
Thread A Thread B
... netlink_create //ref = 1
do_mq_notify ...
sock = netlink_getsockbyfilp ... //ref = 2
info->notify_sock = sock; ...
... netlink_sendmsg
... skb = netlink_alloc_large_skb //skb->head is vmalloced
... netlink_unicast
... sk = netlink_getsockbyportid //ref = 3
... netlink_sendskb
... __netlink_sendskb
... skb_queue_tail //put skb to sk_receive_queue
... sock_put //ref = 2
... ...
... netlink_release
... deferred_put_nlk_sk //ref = 1
mqueue_flush_file
spin_lock
remove_notification
netlink_sendskb
sock_put //ref = 0
sk_free
...
__sk_destruct
netlink_sock_destruct
skb_queue_purge //get skb from sk_receive_queue
...
__skb_queue_purge_reason
kfree_skb_reason
__kfree_skb
...
skb_release_all
skb_release_head_state
netlink_skb_destructor
vfree(skb->head) //sleeping while holding spinlock
In netlink_sendmsg, if the memory pointed to by skb->head is allocated by
vmalloc, and is put to sk_receive_queue queue, also the skb is not freed.
When the mqueue executes flush, the sleeping bug will occur. Use
vfree_atomic instead of vfree in netlink_skb_destructor to solve the issue.
selftest: Don't reuse port for SO_INCOMING_CPU test.
Jakub reported that ASSERT_EQ(cpu, i) in so_incoming_cpu.c seems to
fire somewhat randomly.
# # RUN so_incoming_cpu.before_reuseport.test3 ...
# # so_incoming_cpu.c:191:test3:Expected cpu (32) == i (0)
# # test3: Test terminated by assertion
# # FAIL so_incoming_cpu.before_reuseport.test3
# not ok 3 so_incoming_cpu.before_reuseport.test3
When the test failed, not-yet-accepted CLOSE_WAIT sockets received
SYN with a "challenging" SEQ number, which was sent from an unexpected
CPU that did not create the receiver.
The test basically does:
1. for each cpu:
1-1. create a server
1-2. set SO_INCOMING_CPU
2. for each cpu:
2-1. set cpu affinity
2-2. create some clients
2-3. let clients connect() to the server on the same cpu
2-4. close() clients
3. for each server:
3-1. accept() all child sockets
3-2. check if all children have the same SO_INCOMING_CPU with the server
The root cause was the close() in 2-4. and net.ipv4.tcp_tw_reuse.
In a loop of 2., close() changed the client state to FIN_WAIT_2, and
the peer transitioned to CLOSE_WAIT.
In another loop of 2., connect() happened to select the same port of
the FIN_WAIT_2 socket, and it was reused as the default value of
net.ipv4.tcp_tw_reuse is 2.
As a result, the new client sent SYN to the CLOSE_WAIT socket from
a different CPU, and the receiver's sk_incoming_cpu was overwritten
with unexpected CPU ID.
Also, the SYN had a different SEQ number, so the CLOSE_WAIT socket
responded with Challenge ACK. The new client properly returned RST
and effectively killed the CLOSE_WAIT socket.
This way, all clients were created successfully, but the error was
detected later by 3-2., ASSERT_EQ(cpu, i).
To avoid the failure, let's make sure that (i) the number of clients
is less than the number of available ports and (ii) such reuse never
happens.
On CPUs with weak memory models, reads and updates performed by tcp_push
to the sk variables can get reordered leaving the socket throttled when
it should not. The tasklet running tcp_wfree() may also not observe the
memory updates in time and will skip flushing any packets throttled by
tcp_push(), delaying the sending. This can pathologically cause 40ms
extra latency due to bad interactions with delayed acks.
Adding a memory barrier in tcp_push removes the bug, similarly to the
previous commit bf06200e732d ("tcp: tsq: fix nonagle handling").
smp_mb__after_atomic() is used to not incur in unnecessary overhead
on x86 since not affected.
Patch has been tested using an AWS c7g.2xlarge instance with Ubuntu
22.04 and Apache Tomcat 9.0.83 running the basic servlet below:
public class HelloWorldServlet extends HttpServlet {
@Override
protected void doGet(HttpServletRequest request, HttpServletResponse response)
throws ServletException, IOException {
response.setContentType("text/html;charset=utf-8");
OutputStreamWriter osw = new OutputStreamWriter(response.getOutputStream(),"UTF-8");
String s = "a".repeat(3096);
osw.write(s,0,s.length());
osw.flush();
}
}
Load was applied using wrk2 (https://github.com/kinvolk/wrk2) from an AWS
c6i.8xlarge instance. Before the patch an additional 40ms latency from P99.99+
values is observed while, with the patch, the extra latency disappears.
No patch and tcp_autocorking=1
./wrk -t32 -c128 -d40s --latency -R10000 http://172.31.60.173:8080/hello/hello
...
50.000% 0.91ms
75.000% 1.13ms
90.000% 1.46ms
99.000% 1.74ms
99.900% 1.89ms
99.990% 41.95ms <<< 40+ ms extra latency
99.999% 48.32ms
100.000% 48.96ms
With patch and tcp_autocorking=1
./wrk -t32 -c128 -d40s --latency -R10000 http://172.31.60.173:8080/hello/hello
...
50.000% 0.90ms
75.000% 1.13ms
90.000% 1.45ms
99.000% 1.72ms
99.900% 1.83ms
99.990% 2.11ms <<< no 40+ ms extra latency
99.999% 2.53ms
100.000% 2.62ms
Patch has been also tested on x86 (m7i.2xlarge instance) which it is not
affected by this issue and the patch doesn't introduce any additional
delay.
Helge Deller [Mon, 22 Jan 2024 20:26:33 +0000 (21:26 +0100)]
fbdev: stifb: Fix crash in stifb_blank()
Avoid a kernel crash in stifb by providing the correct pointer to the fb_info
struct. Prior to commit e2e0b838a184 ("video/sticore: Remove info field from
STI struct") the fb_info struct was at the beginning of the fb struct.
Fixes: e2e0b838a184 ("video/sticore: Remove info field from STI struct") Signed-off-by: Helge Deller <[email protected]> Cc: Thomas Zimmermann <[email protected]>
David Howells [Tue, 2 Jan 2024 14:02:37 +0000 (14:02 +0000)]
afs: Fix error handling with lookup via FS.InlineBulkStatus
When afs does a lookup, it tries to use FS.InlineBulkStatus to preemptively
look up a bunch of files in the parent directory and cache this locally, on
the basis that we might want to look at them too (for example if someone
does an ls on a directory, they may want want to then stat every file
listed).
FS.InlineBulkStatus can be considered a compound op with the normal abort
code applying to the compound as a whole. Each status fetch within the
compound is then given its own individual abort code - but assuming no
error that prevents the bulk fetch from returning the compound result will
be 0, even if all the constituent status fetches failed.
At the conclusion of afs_do_lookup(), we should use the abort code from the
appropriate status to determine the error to return, if any - but instead
it is assumed that we were successful if the op as a whole succeeded and we
return an incompletely initialised inode, resulting in ENOENT, no matter
the actual reason. In the particular instance reported, a vnode with no
permission granted to be accessed is being given a UAEACCES abort code
which should be reported as EACCES, but is instead being reported as
ENOENT.
Fix this by abandoning the inode (which will be cleaned up with the op) if
file[1] has an abort code indicated and turn that abort code into an error
instead.
Whilst we're at it, add a tracepoint so that the abort codes of the
individual subrequests of FS.InlineBulkStatus can be logged. At the moment
only the container abort code can be 0.
David Howells [Mon, 8 Jan 2024 17:22:36 +0000 (17:22 +0000)]
afs: Hide silly-rename files from userspace
There appears to be a race between silly-rename files being created/removed
and various userspace tools iterating over the contents of a directory,
leading to such errors as:
find: './kernel/.tmp_cpio_dir/include/dt-bindings/reset/.__afs2080': No such file or directory
tar: ./include/linux/greybus/.__afs3C95: File removed before we read it
when building a kernel.
Fix afs_readdir() so that it doesn't return .__afsXXXX silly-rename files
to userspace. This doesn't stop them being looked up directly by name as
we need to be able to look them up from within the kernel as part of the
silly-rename algorithm.