Simon Horman [Mon, 20 Nov 2023 08:28:40 +0000 (08:28 +0000)]
MAINTAINERS: Add indirect_call_wrapper.h to NETWORKING [GENERAL]
indirect_call_wrapper.h is not, strictly speaking, networking specific.
However, it's git history indicates that in practice changes go through
netdev and thus the netdev maintainers have effectively been taking
responsibility for it.
Formalise this by adding it to the NETWORKING [GENERAL] section in the
MAINTAINERS file.
It is not clear how many other files under include/linux fall into this
category and it would be interesting, as a follow-up, to audit that and
propose further updates to the MAINTAINERS file as appropriate.
Long Li [Sun, 19 Nov 2023 16:23:43 +0000 (08:23 -0800)]
hv_netvsc: Mark VF as slave before exposing it to user-mode
When a VF is being exposed form the kernel, it should be marked as "slave"
before exposing to the user-mode. The VF is not usable without netvsc
running as master. The user-mode should never see a VF without the "slave"
flag.
This commit moves the code of setting the slave flag to the time before
VF is exposed to user-mode.
Haiyang Zhang [Sun, 19 Nov 2023 16:23:41 +0000 (08:23 -0800)]
hv_netvsc: fix race of netvsc and VF register_netdevice
The rtnl lock also needs to be held before rndis_filter_device_add()
which advertises nvsp_2_vsc_capability / sriov bit, and triggers
VF NIC offering and registering. If VF NIC finished register_netdev()
earlier it may cause name based config failure.
To fix this issue, move the call to rtnl_lock() before
rndis_filter_device_add(), so VF will be registered later than netvsc
/ synthetic NIC, and gets a name numbered (ethX) after netvsc.
Zhengchao Shao [Sat, 18 Nov 2023 08:16:53 +0000 (16:16 +0800)]
bonding: return -ENOMEM instead of BUG in alb_upper_dev_walk
If failed to allocate "tags" or could not find the final upper device from
start_dev's upper list in bond_verify_device_path(), only the loopback
detection of the current upper device should be affected, and the system is
no need to be panic.
So return -ENOMEM in alb_upper_dev_walk to stop walking, print some warn
information when failed to allocate memory for vlan tags in
bond_verify_device_path.
I also think that the following function calls
netdev_walk_all_upper_dev_rcu
---->>>alb_upper_dev_walk
---------->>>bond_verify_device_path
From this way, "end device" can eventually be obtained from "start device"
in bond_verify_device_path, IS_ERR(tags) could be instead of
IS_ERR_OR_NULL(tags) in alb_upper_dev_walk.
Shinas Rasheed [Fri, 17 Nov 2023 10:38:10 +0000 (02:38 -0800)]
octeon_ep: support Octeon CN10K devices
Add PCI Endpoint NIC support for Octeon CN10K devices.
CN10K devices are part of Octeon 10 family products with
similar PCI NIC characteristics. These include:
- CN10KA
- CNF10KA
- CNF10KB
- CN10KB
core.c:116: warning: Function parameter or member 'ioss_evtconfig' not described in 'telemetry_update_events'
core.c:188: warning: Function parameter or member 'ioss_evtconfig' not described in 'telemetry_get_eventconfig'
It looks like it were copy'n'paste typos when these descriptions
had been introduced. Fix the typos.
Hans de Goede [Mon, 20 Nov 2023 15:45:45 +0000 (16:45 +0100)]
MAINTAINERS: Drop Mark Gross as maintainer for x86 platform drivers
Mark has not really been active as maintainer for x86 platform drivers
lately, drop Mark from the MAINTAINERS entries for drivers/platform/x86,
drivers/platform/mellanox and drivers/platform/surface.
When a cpu is hot-unplugged, it is put in idle state and the function
arch_cpu_idle_dead() is called. The timer interrupt for this processor
should be disabled, otherwise there will be pending timer interrupt for
the unplugged cpu, so that vcpu is prevented from giving up scheduling
when system is running in vm mode.
This patch implements the timer shutdown interface so that the constant
timer will be properly disabled when a CPU is hot-unplugged.
Huacai Chen [Tue, 21 Nov 2023 07:03:25 +0000 (15:03 +0800)]
LoongArch: Silence the boot warning about 'nokaslr'
The kernel parameter 'nokaslr' is handled before start_kernel(), so we
don't need early_param() to mark it technically. But it can cause a boot
warning as follows:
Unknown kernel command line parameters "nokaslr", will be passed to user space.
When we use 'init=/bin/bash', 'nokaslr' which passed to user space will
even cause a kernel panic. So we use early_param() to mark 'nokaslr',
simply print a notice and silence the boot warning (also fix a potential
panic). This logic is similar to RISC-V.
Huacai Chen [Tue, 21 Nov 2023 07:03:25 +0000 (15:03 +0800)]
LoongArch: Add __percpu annotation for __percpu_read()/__percpu_write()
When build kernel with C=1, we get:
arch/loongarch/kernel/process.c:234:46: warning: incorrect type in argument 1 (different address spaces)
arch/loongarch/kernel/process.c:234:46: expected void *ptr
arch/loongarch/kernel/process.c:234:46: got unsigned long [noderef] __percpu *
arch/loongarch/kernel/process.c:234:46: warning: incorrect type in argument 1 (different address spaces)
arch/loongarch/kernel/process.c:234:46: expected void *ptr
arch/loongarch/kernel/process.c:234:46: got unsigned long [noderef] __percpu *
arch/loongarch/kernel/process.c:234:46: warning: incorrect type in argument 1 (different address spaces)
arch/loongarch/kernel/process.c:234:46: expected void *ptr
arch/loongarch/kernel/process.c:234:46: got unsigned long [noderef] __percpu *
arch/loongarch/kernel/process.c:234:46: warning: incorrect type in argument 1 (different address spaces)
arch/loongarch/kernel/process.c:234:46: expected void *ptr
arch/loongarch/kernel/process.c:234:46: got unsigned long [noderef] __percpu *
Add __percpu annotation for __percpu_read()/__percpu_write() can avoid
such warnings. __percpu_xchg() and other functions don't need annotation
because their wrapper, i.e. _pcp_protect(), already suppresses warnings.
WANG Rui [Tue, 21 Nov 2023 07:03:25 +0000 (15:03 +0800)]
LoongArch: Record pc instead of offset in la_abs relocation
To clarify, the previous version functioned flawlessly. However, it's
worth noting that the LLVM's LoongArch backend currently lacks support
for cross-section label calculations. With this patch, we enable the use
of clang to compile relocatable kernels.
WANG Rui [Tue, 21 Nov 2023 07:03:25 +0000 (15:03 +0800)]
LoongArch: Explicitly set -fdirect-access-external-data for vmlinux
After this llvm commit [1], The -fno-pic does not imply direct access
external data. Explicitly set -fdirect-access-external-data for vmlinux
that can avoids GOT entries.
====================
verify callbacks as if they are called unknown number of times
This series updates verifier logic for callback functions handling.
Current master simulates callback body execution exactly once,
which leads to verifier not detecting unsafe programs like below:
This was reported previously in [0].
The basic idea of the fix is to schedule callback entry state for
verification in env->head until some identical, previously visited
state in current DFS state traversal is found. Same logic as with open
coded iterators, and builds on top recent fixes [1] for those.
The series is structured as follows:
- patches #1,2,3 update strobemeta, xdp_synproxy selftests and
bpf_loop_bench benchmark to allow convergence of the bpf_loop
callback states;
- patches #4,5 just shuffle the code a bit;
- patch #6 is the main part of the series;
- patch #7 adds test cases for #6;
- patch #8 extend patch #6 with same speculative scalar widening
logic, as used for open coded iterators;
- patch #9 adds test cases for #8;
- patch #10 extends patch #6 to track maximal number of callback
executions specifically for bpf_loop();
- patch #11 adds test cases for #10.
Veristat results comparing this series to master+patches #1,2,3 using selftests
show the following difference:
Veristat results comparing this series to master using Tetragon BPF
files [2] also show some differences.
States diff varies from +2% to +15% on 23 programs out of 186,
no new failures.
Changelog:
- V3 [5] -> V4, changes suggested by Andrii:
- validate mark_chain_precision() result in patch #10;
- renaming s/cumulative_callback_depth/callback_unroll_depth/.
- V2 [4] -> V3:
- fixes in expected log messages for test cases:
- callback_result_precise;
- parent_callee_saved_reg_precise_with_callback;
- parent_stack_slot_precise_with_callback;
- renamings (suggested by Alexei):
- s/callback_iter_depth/cumulative_callback_depth/
- s/is_callback_iter_next/calls_callback/
- s/mark_callback_iter_next/mark_calls_callback/
- prepare_func_exit() updated to exit with -EFAULT when
callee->in_callback_fn is true but calls_callback() is not true
for callsite;
- test case 'bpf_loop_iter_limit_nested' rewritten to use return
value check instead of verifier log message checks
(suggested by Alexei).
- V1 [3] -> V2, changes suggested by Andrii:
- small changes for error handling code in __check_func_call();
- callback body processing log is now matched in relevant
verifier_subprog_precision.c tests;
- R1 passed to bpf_loop() is now always marked as precise;
- log level 2 message for bpf_loop() iteration termination instead of
iteration depth messages;
- __no_msg macro removed;
- bpf_loop_iter_limit_nested updated to avoid using __no_msg;
- commit message for patch #3 updated according to Alexei's request.
Eduard Zingerman [Tue, 21 Nov 2023 02:07:01 +0000 (04:07 +0200)]
selftests/bpf: check if max number of bpf_loop iterations is tracked
Check that even if bpf_loop() callback simulation does not converge to
a specific state, verification could proceed via "brute force"
simulation of maximal number of callback calls.
Each 'cb' simulation would eventually return to 'prog' and reach
'return choice_arr[ctx.i]' statement. At which point ctx.i would be
marked precise, thus forcing verifier to track multitude of separate
states with {.i=0}, {.i=1}, ... at bpf_loop() callback entry.
This commit allows "brute force" handling for such cases by limiting
number of callback body simulations using 'umax' value of the first
bpf_loop() parameter.
For this, extend bpf_func_state with 'callback_depth' field.
Increment this field when callback visiting state is pushed to states
traversal stack. For frame #N it's 'callback_depth' field counts how
many times callback with frame depth N+1 had been executed.
Use bpf_func_state specifically to allow independent tracking of
callback depths when multiple nested bpf_loop() calls are present.
Eduard Zingerman [Tue, 21 Nov 2023 02:06:58 +0000 (04:06 +0200)]
bpf: widening for callback iterators
Callbacks are similar to open coded iterators, so add imprecise
widening logic for callback body processing. This makes callback based
loops behave identically to open coded iterators, e.g. allowing to
verify programs like below:
Eduard Zingerman [Tue, 21 Nov 2023 02:06:57 +0000 (04:06 +0200)]
selftests/bpf: tests for iterating callbacks
A set of test cases to check behavior of callback handling logic,
check if verifier catches the following situations:
- program not safe on second callback iteration;
- program not safe on zero callback iterations;
- infinite loop inside a callback.
Verify that callback logic works for bpf_loop, bpf_for_each_map_elem,
bpf_user_ringbuf_drain, bpf_find_vma.
Eduard Zingerman [Tue, 21 Nov 2023 02:06:56 +0000 (04:06 +0200)]
bpf: verify callbacks as if they are called unknown number of times
Prior to this patch callbacks were handled as regular function calls,
execution of callback body was modeled exactly once.
This patch updates callbacks handling logic as follows:
- introduces a function push_callback_call() that schedules callback
body verification in env->head stack;
- updates prepare_func_exit() to reschedule callback body verification
upon BPF_EXIT;
- as calls to bpf_*_iter_next(), calls to callback invoking functions
are marked as checkpoints;
- is_state_visited() is updated to stop callback based iteration when
some identical parent state is found.
Paths with callback function invoked zero times are now verified first,
which leads to necessity to modify some selftests:
- the following negative tests required adding release/unlock/drop
calls to avoid previously masked unrelated error reports:
- cb_refs.c:underflow_prog
- exceptions_fail.c:reject_rbtree_add_throw
- exceptions_fail.c:reject_with_cp_reference
- the following precision tracking selftests needed change in expected
log trace:
- verifier_subprog_precision.c:callback_result_precise
(note: r0 precision is no longer propagated inside callback and
I think this is a correct behavior)
- verifier_subprog_precision.c:parent_callee_saved_reg_precise_with_callback
- verifier_subprog_precision.c:parent_stack_slot_precise_with_callback
Eduard Zingerman [Tue, 21 Nov 2023 02:06:55 +0000 (04:06 +0200)]
bpf: extract setup_func_entry() utility function
Move code for simulated stack frame creation to a separate utility
function. This function would be used in the follow-up change for
callbacks handling.
Eduard Zingerman [Tue, 21 Nov 2023 02:06:54 +0000 (04:06 +0200)]
bpf: extract __check_reg_arg() utility function
Split check_reg_arg() into two utility functions:
- check_reg_arg() operating on registers from current verifier state;
- __check_reg_arg() operating on a specific set of registers passed as
a parameter;
The __check_reg_arg() function would be used by a follow-up change for
callbacks handling.
Eduard Zingerman [Tue, 21 Nov 2023 02:06:53 +0000 (04:06 +0200)]
selftests/bpf: fix bpf_loop_bench for new callback verification scheme
This is a preparatory change. A follow-up patch "bpf: verify callbacks
as if they are called unknown number of times" changes logic for
callbacks handling. While previously callbacks were verified as a
single function call, new scheme takes into account that callbacks
could be executed unknown number of times.
This has dire implications for bpf_loop_bench:
SEC("fentry/" SYS_PREFIX "sys_getpgid")
int benchmark(void *ctx)
{
for (int i = 0; i < 1000; i++) {
bpf_loop(nr_loops, empty_callback, NULL, 0);
__sync_add_and_fetch(&hits, nr_loops);
}
return 0;
}
W/o callbacks change verifier sees it as a 1000 calls to
empty_callback(). However, with callbacks change things become
exponential:
- i=0: state exploring empty_callback is scheduled with i=0 (a);
- i=1: state exploring empty_callback is scheduled with i=1;
...
- i=999: state exploring empty_callback is scheduled with i=999;
- state (a) is popped from stack;
- i=1: state exploring empty_callback is scheduled with i=1;
...
Avoid this issue by rewriting outer loop as bpf_loop().
Unfortunately, this adds a function call to a loop at runtime, which
negatively affects performance:
throughput latency
before: 149.919 ± 0.168 M ops/s, 6.670 ns/op
after : 137.040 ± 0.187 M ops/s, 7.297 ns/op
Eduard Zingerman [Tue, 21 Nov 2023 02:06:52 +0000 (04:06 +0200)]
selftests/bpf: track string payload offset as scalar in strobemeta
This change prepares strobemeta for update in callbacks verification
logic. To allow bpf_loop() verification converge when multiple
callback iterations are considered:
- track offset inside strobemeta_payload->payload directly as scalar
value;
- at each iteration make sure that remaining
strobemeta_payload->payload capacity is sufficient for execution of
read_{map,str}_var functions;
- make sure that offset is tracked as unbound scalar between
iterations, otherwise verifier won't be able infer that bpf_loop
callback reaches identical states.
Eduard Zingerman [Tue, 21 Nov 2023 02:06:51 +0000 (04:06 +0200)]
selftests/bpf: track tcp payload offset as scalar in xdp_synproxy
This change prepares syncookie_{tc,xdp} for update in callbakcs
verification logic. To allow bpf_loop() verification converge when
multiple callback itreations are considered:
- track offset inside TCP payload explicitly, not as a part of the
pointer;
- make sure that offset does not exceed MAX_PACKET_OFF enforced by
verifier;
- make sure that offset is tracked as unbound scalar between
iterations, otherwise verifier won't be able infer that bpf_loop
callback reaches identical states.
Pedro Tammela [Fri, 17 Nov 2023 17:12:07 +0000 (14:12 -0300)]
selftests: tc-testing: timeout on unbounded loops
In the spirit of failing early, timeout on unbounded loops that take
longer than 20 ticks to complete. Such loops are to ensure that objects
created are already visible so tests can proceed without any issues.
If a test setup takes more than 20 ticks to see an object, there's
definetely something wrong.
Pedro Tammela [Fri, 17 Nov 2023 17:12:05 +0000 (14:12 -0300)]
selftests: tc-testing: use netns delete from pyroute2
When pyroute2 is available, use the native netns delete routine instead
of calling iproute2 to do it. As forks are expensive with some kernel
configs, minimize its usage to avoid kselftests timeouts.
Pedro Tammela [Fri, 17 Nov 2023 17:12:04 +0000 (14:12 -0300)]
selftests: tc-testing: move back to per test ns setup
Surprisingly in kernel configs with most of the debug knobs turned on,
pre-allocating the test resources makes tdc run much slower overall than
when allocating resources on a per test basis.
As these knobs are used in kselftests in downstream CIs, let's go back
to the old way of doing things to avoid kselftests timeouts.
Pedro Tammela [Fri, 17 Nov 2023 17:12:03 +0000 (14:12 -0300)]
selftests: tc-testing: cap parallel tdc to 4 cores
We have observed a lot of lock contention and test instability when running with >8 cores.
Enough to actually make the tests run slower than with fewer cores.
Cap the maximum cores of parallel tdc to 4 which showed in testing to
be a reasonable number for efficiency and stability in different kernel
config scenarios.
Jakub Kicinski [Tue, 21 Nov 2023 02:04:32 +0000 (18:04 -0800)]
Merge branch 'nfp-add-flow-steering-support'
Louis Peens says:
====================
nfp: add flow-steering support
This short series adds flow steering support for the nfp driver.
The first patch adds the part to communicate with ethtool but
stubs out the HW offload parts. The second patch implements the
HW communication and offloads flow steering.
After this series the user can now use 'ethtool -N/-n' to configure
and display rx classification rules.
====================
Yinjun Zhang [Fri, 17 Nov 2023 07:11:13 +0000 (09:11 +0200)]
nfp: add ethtool flow steering callbacks
This is the first part to implement flow steering. The communication
between ethtool and driver is done. User can use following commands
to display and set flows:
The axiethernet driver can use the dmaengine framework to communicate
with the xilinx DMAengine driver(AXIDMA, MCDMA). The inspiration behind
this dmaengine adoption is to reuse the in-kernel xilinx dma engine
driver[1] and remove redundant dma programming sequence[2] from the
ethernet driver. This simplifies the ethernet driver and also makes
it generic to be hooked to any complaint dma IP i.e AXIDMA, MCDMA
without any modification.
The dmaengine framework was extended for metadata API support during
the axidma RFC[3] discussion. However, it still needs further
enhancements to make it well suited for ethernet usecases.
Comments, suggestions, thoughts to implement remaining functional
features are very welcome!
Add dmaengine framework to communicate with the xilinx DMAengine
driver(AXIDMA).
Axi ethernet driver uses separate channels for transmit and receive.
Add support for these channels to handle TX and RX with skb and
appropriate callbacks. Also add axi ethernet core interrupt for
dmaengine framework support.
The dmaengine framework was extended for metadata API support.
However it still needs further enhancements to make it well suited for
ethernet usecases. The ethernet features i.e ethtool set/get of DMA IP
properties, ndo_poll_controller,(mentioned in TODO) are not supported
and it requires follow-up discussions.
dmaengine support has a dependency on xilinx_dma as it uses
xilinx_vdma_channel_set_config() API to reset the DMA IP
which internally reset MAC prior to accessing MDIO.
Benchmark with netperf:
xilinx-zcu102-20232:~$ netperf -H 192.168.10.20 -t TCP_STREAM
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET
to 192.168.10.20 () port 0 AF_INET
Recv Send Send
Socket Socket Message Elapsed
Size Size Size Time Throughput
bytes bytes bytes secs. 10^6bits/sec
131072 16384 16384 10.02 886.69
xilinx-zcu102-20232:~$ netperf -H 192.168.10.20 -t UDP_STREAM
MIGRATED UDP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET
to 192.168.10.20 () port 0 AF_INET
Socket Message Elapsed Messages
Size Size Time Okay Errors Throughput
bytes bytes secs # # 10^6bits/sec
net: axienet: Preparatory changes for dmaengine support
The axiethernet driver has inbuilt dma programming. In order to add
dmaengine support and make it's integration seamless the current axidma
inbuilt programming code is put under use_dmaengine check.
It also performs minor code reordering to minimize conditional
use_dmaengine checks and there is no functional change. It uses
"dmas" property to identify whether it should use a dmaengine
framework or inbuilt axidma programming.
dt-bindings: net: xlnx,axi-ethernet: Introduce DMA support
Xilinx 1G/2.5G Ethernet Subsystem provides 32-bit AXI4-Stream buses to
move transmit and receive Ethernet data to and from the subsystem.
These buses are designed to be used with an AXI Direct Memory Access(DMA)
IP or AXI Multichannel Direct Memory Access (MCDMA) IP core, AXI4-Stream
Data FIFO, or any other custom logic in any supported device.
Primary high-speed DMA data movement between system memory and stream
target is through the AXI4 Read Master to AXI4 memory-mapped to stream
(MM2S) Master, and AXI stream to memory-mapped (S2MM) Slave to AXI4
Write Master. AXI DMA/MCDMA enables channel of data movement on both
MM2S and S2MM paths in scatter/gather mode.
AXI DMA has two channels where as MCDMA has 16 Tx and 16 Rx channels.
To uniquely identify each channel use 'chan' suffix. Depending on the
usecase AXI ethernet driver can request any combination of multichannel
DMA channels using generic dmas, dma-names properties.
Willem de Bruijn [Thu, 16 Nov 2023 20:34:43 +0000 (15:34 -0500)]
selftests: net: verify fq per-band packet limit
Commit 29f834aa326e ("net_sched: sch_fq: add 3 bands and WRR
scheduling") introduces multiple traffic bands, and per-band maximum
packet count.
Per-band limits ensures that packets in one class cannot fill the
entire qdisc and so cause DoS to the traffic in the other classes.
Verify this behavior:
1. set the limit to 10 per band
2. send 20 pkts on band A: verify that 10 are queued, 10 dropped
3. send 20 pkts on band A: verify that 0 are queued, 20 dropped
4. send 20 pkts on band B: verify that 10 are queued, 10 dropped
Packets must remain queued for a period to trigger this behavior.
Use SO_TXTIME to store packets for 100 msec.
The test reuses existing upstream test infra. The script is a fork of
cmsg_time.sh. The scripts call cmsg_sender.
The test extends cmsg_sender with two arguments:
* '-P' SO_PRIORITY
There is a subtle difference between IPv4 and IPv6 stack behavior:
PF_INET/IP_TOS sets IP header bits and sk_priority
PF_INET6/IPV6_TCLASS sets IP header bits BUT NOT sk_priority
* '-n' num pkts
Send multiple packets in quick succession.
I first attempted a for loop in the script, but this is too slow in
virtualized environments, causing flakiness as the 100ms timeout is
reached and packets are dequeued.
Also do not wait for timestamps to be queued unless timestamps are
requested.
The LAN743x/PCI11xxx DMA descriptors are always 4 dwords long, but the
device supports placing the descriptors in memory back to back or
reserving space in between them using its DMA_DESCRIPTOR_SPACE (DSPACE)
configurable hardware setting. Currently DSPACE is unnecessarily set to
match the host's L1 cache line size, resulting in space reserved in
between descriptors in most platforms and causing a suboptimal behavior
(single PCIe Mem transaction per descriptor). By changing the setting
to DSPACE=16 many descriptors can be packed in a single PCIe Mem
transaction resulting in a massive performance improvement in
bidirectional tests without any negative effects.
Tested and verified improvements on x64 PC and several ARM platforms
(typical data below)
RK3399 Spec:
The SOM-RK3399 is ARM module designed and developed by FriendlyElec.
Cores: 64-bit Dual Core Cortex-A72 + Quad Core Cortex-A53
Frequency: Cortex-A72(up to 2.0GHz), Cortex-A53(up to 1.5GHz)
PCIe: PCIe x4, compatible with PCIe 2.1, Dual operation mode
Andrii Nakryiko [Mon, 20 Nov 2023 18:04:52 +0000 (10:04 -0800)]
selftests/bpf: reduce verboseness of reg_bounds selftest logs
Reduce verboseness of test_progs' output in reg_bounds set of tests with
two changes.
First, instead of each different operator (<, <=, >, ...) being it's own
subtest, combine all different ops for the same (x, y, init_t, cond_t)
values into single subtest. Instead of getting 6 subtests, we get one
generic one, e.g.:
Second, for random generated test cases, treat all of them as a single
test to eliminate very verbose output with random values in them. So now
we'll just get one line per each combination of (init_t, cond_t),
instead of 6 x 25 = 150 subtests before this change:
#225 reg_bounds_rand_consts_s32_s32:OK
Given we reduce verboseness so much, it makes sense to do a bit more
random testing, so we also bump default number of random tests to 100,
up from 25. This doesn't increase runtime significantly, especially in
parallelized mode.
With all the above changes we still make sure that we have all the
information necessary for reproducing test case if it happens to fail.
That includes reporting random seed and specific operator that is
failing. Those will only be printed to console if related test/subtest
fails, so it doesn't have any added verboseness implications.
Martin KaFai Lau [Mon, 20 Nov 2023 18:15:17 +0000 (10:15 -0800)]
Merge branch 'bpf_redirect_peer fixes'
Daniel Borkmann says:
====================
This fixes bpf_redirect_peer stats accounting for veth and netkit,
and adds tstats in the first place for the latter. Utilise indirect
call wrapper for bpf_redirect_peer, and improve test coverage of the
latter also for netkit devices. Details in the patches, thanks!
The series was targeted at bpf originally, and is done here as well,
so it can trigger BPF CI. Jakub, if you think directly going via net
is better since the majority of the diff touches net anyway, that is
fine, too.
Thanks!
v2 -> v3:
- Add kdoc for pcpu_stat_type (Simon)
- Reject invalid type value in netdev_do_alloc_pcpu_stats (Simon)
- Add Reviewed-by tags from list
v1 -> v2:
- Move stats allocation/freeing into net core (Jakub)
- As prepwork for the above, move vrf's dstats over into the core
- Add a check into stats alloc to enforce tstats upon
implementing ndo_get_peer_dev
- Add Acked-by tags from list
Daniel Borkmann (6):
net, vrf: Move dstats structure to core
net: Move {l,t,d}stats allocation to core and convert veth & vrf
netkit: Add tstats per-CPU traffic counters
bpf, netkit: Add indirect call wrapper for fetching peer dev
selftests/bpf: De-veth-ize the tc_redirect test case
selftests/bpf: Add netkit to tc_redirect selftest
====================
Daniel Borkmann [Tue, 14 Nov 2023 00:42:20 +0000 (01:42 +0100)]
selftests/bpf: Add netkit to tc_redirect selftest
Extend the existing tc_redirect selftest to also cover netkit devices
for exercising the bpf_redirect_peer() code paths, so that we have both
veth as well as netkit covered, all tests still pass after this change.
Daniel Borkmann [Tue, 14 Nov 2023 00:42:19 +0000 (01:42 +0100)]
selftests/bpf: De-veth-ize the tc_redirect test case
No functional changes to the test case, but just renaming various functions,
variables, etc, to remove veth part of their name for making it more generic
and reusable later on (e.g. for netkit).
Daniel Borkmann [Tue, 14 Nov 2023 00:42:18 +0000 (01:42 +0100)]
bpf, netkit: Add indirect call wrapper for fetching peer dev
ndo_get_peer_dev is used in tcx BPF fast path, therefore make use of
indirect call wrapper and therefore optimize the bpf_redirect_peer()
internal handling a bit. Add a small skb_get_peer_dev() wrapper which
utilizes the INDIRECT_CALL_1() macro instead of open coding.
Future work could potentially add a peer pointer directly into struct
net_device in future and convert veth and netkit over to use it so
that eventually ndo_get_peer_dev can be removed.
Peilin Ye [Tue, 14 Nov 2023 00:42:17 +0000 (01:42 +0100)]
bpf: Fix dev's rx stats for bpf_redirect_peer traffic
Traffic redirected by bpf_redirect_peer() (used by recent CNIs like Cilium)
is not accounted for in the RX stats of supported devices (that is, veth
and netkit), confusing user space metrics collectors such as cAdvisor [0],
as reported by Youlun.
Fix it by calling dev_sw_netstats_rx_add() in skb_do_redirect(), to update
RX traffic counters. Devices that support ndo_get_peer_dev _must_ use the
@tstats per-CPU counters (instead of @lstats, or @dstats).
To make this more fool-proof, error out when ndo_get_peer_dev is set but
@tstats are not selected.
[0] Specifically, the "container_network_receive_{byte,packet}s_total"
counters are affected.
Peilin Ye [Tue, 14 Nov 2023 00:42:16 +0000 (01:42 +0100)]
veth: Use tstats per-CPU traffic counters
Currently veth devices use the lstats per-CPU traffic counters, which only
cover TX traffic. veth_get_stats64() actually populates RX stats of a veth
device from its peer's TX counters, based on the assumption that a veth
device can _only_ receive packets from its peer, which is no longer true:
For example, recent CNIs (like Cilium) can use the bpf_redirect_peer() BPF
helper to redirect traffic from NIC's tc ingress to veth's tc ingress (in
a different netns), skipping veth's peer device. Unfortunately, this kind
of traffic isn't currently accounted for in veth's RX stats.
In preparation for the fix, use tstats (instead of lstats) to maintain
both RX and TX counters for each veth device. We'll use RX counters for
bpf_redirect_peer() traffic, and keep using TX counters for the usual
"peer-to-peer" traffic. In veth_get_stats64(), calculate RX stats by
_adding_ RX count to peer's TX count, in order to cover both kinds of
traffic.
veth_stats_rx() might need a name change (perhaps to "veth_stats_xdp()")
for less confusion, but let's leave it to another patch to keep the fix
minimal.
Daniel Borkmann [Tue, 14 Nov 2023 00:42:15 +0000 (01:42 +0100)]
netkit: Add tstats per-CPU traffic counters
Add dev->tstats traffic accounting to netkit. The latter contains per-CPU
RX and TX counters.
The dev's TX counters are bumped upon pass/unspec as well as redirect
verdicts, in other words, on everything except for drops.
The dev's RX counters are bumped upon successful __netif_rx(), as well
as from skb_do_redirect() (not part of this commit here).
Using dev->lstats with having just a single packets/bytes counter and
inferring one another's RX counters from the peer dev's lstats is not
possible given skb_do_redirect() can also bump the device's stats.
Daniel Borkmann [Tue, 14 Nov 2023 00:42:14 +0000 (01:42 +0100)]
net: Move {l,t,d}stats allocation to core and convert veth & vrf
Move {l,t,d}stats allocation to the core and let netdevs pick the stats
type they need. That way the driver doesn't have to bother with error
handling (allocation failure checking, making sure free happens in the
right spot, etc) - all happening in the core.
Daniel Borkmann [Tue, 14 Nov 2023 00:42:13 +0000 (01:42 +0100)]
net, vrf: Move dstats structure to core
Just move struct pcpu_dstats out of the vrf into the core, and streamline
the field names slightly, so they better align with the {t,l}stats ones.
No functional change otherwise. A conversion of the u64s to u64_stats_t
could be done at a separate point in future. This move is needed as we are
moving the {t,l,d}stats allocation/freeing to the core.
Linus Torvalds [Sun, 19 Nov 2023 21:54:28 +0000 (13:54 -0800)]
Merge tag 'kbuild-fixes-v6.7' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild
Pull Kbuild fixes from Masahiro Yamada:
- Fix section mismatch warning messages for riscv and loongarch
- Remove CONFIG_IA64 left-over from linux/export-internal.h
- Fix the location of the quotes for UIMAGE_NAME
- Fix a memory leak bug in Kconfig
* tag 'kbuild-fixes-v6.7' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild:
kconfig: fix memory leak from range properties
kbuild: Move the single quotes for image name
linux/export: clean up the IA-64 KSYM_FUNC macro
modpost: fix section mismatch message for RELA
Linus Torvalds [Sun, 19 Nov 2023 21:49:32 +0000 (13:49 -0800)]
Merge tag 'irq_urgent_for_v6.7_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq fix from Borislav Petkov:
- Flush the translation service tables to prevent unpredictable
behavior on non-coherent GIC devices
* tag 'irq_urgent_for_v6.7_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
irqchip/gic-v3-its: Flush ITS tables correctly in non-coherent GIC designs
Linus Torvalds [Sun, 19 Nov 2023 21:46:17 +0000 (13:46 -0800)]
Merge tag 'x86_urgent_for_v6.7_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Borislav Petkov:
- Ignore invalid x2APIC entries in order to not waste per-CPU data
- Fix a back-to-back signals handling scenario when shadow stack is in
use
- A documentation fix
- Add Kirill as TDX maintainer
* tag 'x86_urgent_for_v6.7_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/acpi: Ignore invalid x2APIC entries
x86/shstk: Delay signal entry SSP write until after user accesses
x86/Documentation: Indent 'note::' directive for protocol version number note
MAINTAINERS: Add Intel TDX entry
Linus Torvalds [Sun, 19 Nov 2023 21:35:07 +0000 (13:35 -0800)]
Merge tag 'timers_urgent_for_v6.7_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer fix from Borislav Petkov:
- Do the push of pending hrtimers away from a CPU which is being
offlined earlier in the offlining process in order to prevent a
deadlock
* tag 'timers_urgent_for_v6.7_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
hrtimers: Push pending hrtimers away from outgoing CPU earlier
Linus Torvalds [Sun, 19 Nov 2023 21:32:00 +0000 (13:32 -0800)]
Merge tag 'sched_urgent_for_v6.7_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Borislav Petkov:
- Fix virtual runtime calculation when recomputing a sched entity's
weights
- Fix wrongly rejected unprivileged poll requests to the cgroup psi
pressure files
- Make sure the load balancing is done by only one CPU
* tag 'sched_urgent_for_v6.7_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/fair: Fix the decision for load balance
sched: psi: fix unprivileged polling against cgroups
sched/eevdf: Fix vruntime adjustment on reweight
Suman Ghosh [Fri, 17 Nov 2023 10:40:18 +0000 (16:10 +0530)]
octeontx2-pf: Fix memory leak during interface down
During 'ifconfig <netdev> down' one RSS memory was not getting freed.
This patch fixes the same.
Fixes: 81a4362016e7 ("octeontx2-pf: Add RSS multi group support") Signed-off-by: Suman Ghosh <[email protected]> Reviewed-by: Simon Horman <[email protected]> Signed-off-by: David S. Miller <[email protected]>
David S. Miller [Sun, 19 Nov 2023 19:46:40 +0000 (19:46 +0000)]
Merge branch 'am65-cpsw-ethtool-mac-stats'
Roger Quadros says:
===================
net: eth: am65-cpsw: add ethtool MAC stats
Gets 'ethtool -S eth0 --groups eth-mac' command to work.
Also set default TX channels to maximum available and does
cleanup in am65_cpsw_nuss_common_open() error path.
Changelog:
v2:
- add __iomem to *stats, to prevent sparse warning
- clean up RX descriptors and free up SKB in error handling of
am65_cpsw_nuss_common_open()
- Re-arrange some funcitons to avoid forward declaration
====================
Roger Quadros [Fri, 17 Nov 2023 12:17:55 +0000 (14:17 +0200)]
net: ethernet: ti: am65-cpsw: Fix error handling in am65_cpsw_nuss_common_open()
k3_udma_glue_enable_rx/tx_chn returns error code on failure.
Bail out on error while enabling TX/RX channel.
In the error path, clean up the RX descriptors and SKBs.
Get rid of kmemleak_not_leak() as it seems unnecessary now.
Fixes: 93a76530316a ("net: ethernet: ti: introduce am65x/j721e gigabit eth subsystem driver") Signed-off-by: Roger Quadros <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Roger Quadros [Fri, 17 Nov 2023 12:17:54 +0000 (14:17 +0200)]
net: ethernet: am65-cpsw: Set default TX channels to maximum
am65-cpsw supports 8 TX hardware queues. Set this as default.
The rationale is that some am65-cpsw devices can have up to 4 ethernet
ports. If the number of TX channels have to be changed then all
interfaces have to be brought down and up as the old default of 1
TX channel is too restrictive for any mqprio/taprio usage.
Another reason for this change is to allow testing using
kselftest:net/forwarding:ethtool_mm.sh out of the box.
Jiawen Wu [Fri, 17 Nov 2023 10:11:08 +0000 (18:11 +0800)]
net: wangxun: fix kernel panic due to null pointer
When the device uses a custom subsystem vendor ID, the function
wx_sw_init() returns before the memory of 'wx->mac_table' is allocated.
The null pointer will causes the kernel panic.
Fixes: 79625f45ca73 ("net: wangxun: Move MAC address handling to libwx") Signed-off-by: Jiawen Wu <[email protected]> Reviewed-by: Simon Horman <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Li RongQing [Wed, 15 Nov 2023 12:01:08 +0000 (20:01 +0800)]
rtnetlink: introduce nlmsg_new_large and use it in rtnl_getlink
if a PF has 256 or more VFs, ip link command will allocate an order 3
memory or more, and maybe trigger OOM due to memory fragment,
the VFs needed memory size is computed in rtnl_vfinfo_size.
so introduce nlmsg_new_large which calls netlink_alloc_large_skb in
which vmalloc is used for large memory, to avoid the failure of
allocating memory
Jakub Kicinski [Sun, 19 Nov 2023 03:46:32 +0000 (19:46 -0800)]
Merge branch '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue
Tony Nguyen says:
====================
ice: one by one port representors creation
Michal Swiatkowski says:
Currently ice supports creating port representors only for VFs. For that
use case they can be created and removed in one step.
This patchset is refactoring current flow to support port representor
creation also for subfunctions and SIOV. In this case port representors
need to be created and removed one by one. Also, they can be added and
removed while other port representors are running.
To achieve that we need to change the switchdev configuration flow.
Three first patches are only cosmetic (renaming, removing not used code).
Next few ones are preparation for new flow. The most important one
is "add VF representor one by one". It fully implements new flow.
New type of port representor (for subfunction) will be introduced in
follow up patchset.
* '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue:
ice: reserve number of CP queues
ice: adjust switchdev rebuild path
ice: add VF representors one by one
ice: realloc VSI stats arrays
ice: set Tx topology every time new repr is added
ice: allow changing SWITCHDEV_CTRL VSI queues
ice: return pointer to representor
ice: make representor code generic
ice: remove VF pointer reference in eswitch code
ice: track port representors in xarray
ice: use repr instead of vf->repr
ice: track q_id in representor
ice: remove unused control VSI parameter
ice: remove redundant max_vsi_num variable
ice: rename switchdev to eswitch
====================
Jakub Kicinski [Sun, 19 Nov 2023 03:42:29 +0000 (19:42 -0800)]
Merge branch '1GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue
Tony Nguyen says:
====================
igc: Add support for physical + free-running timers
Vinicius Costa Gomes says:
The objective is to allow having functionality that depends on the
physical timer (taprio and ETF offloads, for example) and vclocks
operating together.
The "big" missing piece is the implementation of the .getcyclesx64()
function in igc, as i225/i226 have multiple timers, we use one of
those timers (timer 1) as a free-running (non adjustable) timer.
The complication is that only implementing .getcyclesx64() and nothing
else will break synchronization when using vclocks, as reading the clock
will retrieve the free-running value but timnestamps will come from the
adjustable timer. The solution is to modify "in one go" the timestamping
code to be able to retrieve the timestamp from the correct timer (if a
socket is "phc_bound" to a vclock the timestamp will come from the
free-running timer).
I was debating whether or not to do the adjustments for the internal latencies
for the free-running timestamps, decided to do the adjustments so the path
delay when using vclocks is similar to the one when using the physical clock.
One future improvement is to implement the .getcrosscycles() function.
* '1GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue:
igc: Add support for PTP .getcyclesx64()
igc: Simplify setting flags in the TX data descriptor
====================
====================
net/sched: cls_u32: use proper refcounts
In u32 we are open coding refcounts of hashtables with integers which is
far from ideal. Update those with proper refcount and add a couple of
tests to tdc that exercise the refcounts explicitly.
====================
Pedro Tammela [Tue, 14 Nov 2023 14:18:56 +0000 (11:18 -0300)]
selftests/tc-testing: add hashtable tests for u32
Add tests to specifically check for the refcount interactions of
hashtables created by u32. These tables should not be deleted when
referenced and the flush order should respect a tree like composition.
Pedro Tammela [Tue, 14 Nov 2023 14:18:55 +0000 (11:18 -0300)]
net/sched: cls_u32: replace int refcounts with proper refcounts
Proper refcounts will always warn splat when something goes wrong,
be it underflow, saturation or object resurrection. As these are always
a source of bugs, use it in cls_u32 as a safeguard to prevent/catch issues.
Another benefit is that the refcount API self documents the code, making
clear when transitions to dead are expected.
For such an update we had to make minor adaptations on u32 to fit the refcount
API. First we set explicitly to '1' when objects are created, then the
objects are alive until a 1 -> 0 happens, which is then released appropriately.
The above made clear some redundant operations in the u32 code
around the root_ht handling that were removed. The root_ht is created
with a refcnt set to 1. Then when it's associated with tcf_proto it increments the refcnt to 2.
Throughout the entire code the root_ht is an exceptional case and can never be referenced,
therefore the refcnt never incremented/decremented.
Its lifetime is always bound to tcf_proto, meaning if you delete tcf_proto
the root_ht is deleted as well. The code made up for the fact that root_ht refcnt is 2 and did
a double decrement to free it, which is not a fit for the refcount API.
Even though refcount_t is implemented using atomics, we should observe
a negligible control plane impact.
Jakub Kicinski [Sun, 19 Nov 2023 02:38:05 +0000 (18:38 -0800)]
net: partial revert of the "Make timestamping selectable: series
Revert following commits:
commit acec05fb78ab ("net_tstamp: Add TIMESTAMPING SOFTWARE and HARDWARE mask")
commit 11d55be06df0 ("net: ethtool: Add a command to expose current time stamping layer")
commit bb8645b00ced ("netlink: specs: Introduce new netlink command to get current timestamp")
commit d905f9c75329 ("net: ethtool: Add a command to list available time stamping layers")
commit aed5004ee7a0 ("netlink: specs: Introduce new netlink command to list available time stamping layers")
commit 51bdf3165f01 ("net: Replace hwtstamp_source by timestamping layer")
commit 0f7f463d4821 ("net: Change the API of PHY default timestamp to MAC")
commit 091fab122869 ("net: ethtool: ts: Update GET_TS to reply the current selected timestamp")
commit 152c75e1d002 ("net: ethtool: ts: Let the active time stamping layer be selectable")
commit ee60ea6be0d3 ("netlink: specs: Introduce time stamping set command")
Linus Torvalds [Sat, 18 Nov 2023 23:20:58 +0000 (15:20 -0800)]
Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Pull SCSI fixes from James Bottomley:
"Seven small fixes, six in drivers and one in sd.
The sd fix is so large because it changes a struct pointer to a struct
but otherwise is fairly simple"
* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
scsi: ufs: qcom-ufs: dt-bindings: Document the SM8650 UFS Controller
scsi: sd: Fix sshdr use in sd_suspend_common()
scsi: scsi_debug: Delete some bogus error checking
scsi: scsi_debug: Fix some bugs in sdebug_error_write()
scsi: ufs: core: Fix racing issue between ufshcd_mcq_abort() and ISR
scsi: ufs: core: Expand MCQ queue slot to DeviceQueueDepth + 1
scsi: qla2xxx: Fix system crash due to bad pointer access
Linus Torvalds [Sat, 18 Nov 2023 23:13:10 +0000 (15:13 -0800)]
Merge tag 'parisc-for-6.7-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux
Pull parisc fixes from Helge Deller:
"On parisc we still sometimes need writeable stacks, e.g. if programs
aren't compiled with gcc-14. To avoid issues with the upcoming
systemd-254 we therefore have to disable prctl(PR_SET_MDWE) for now
(for parisc only).
The other two patches are minor: a bugfix for the soft power-off on
qemu with 64-bit kernel and prefer strscpy() over strlcpy():
- Fix power soft-off on qemu
- Disable prctl(PR_SET_MDWE) since parisc sometimes still needs
writeable stacks
- Use strscpy instead of strlcpy in show_cpuinfo()"
* tag 'parisc-for-6.7-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux:
prctl: Disable prctl(PR_SET_MDWE) on parisc
parisc/power: Fix power soft-off when running on qemu
parisc: Replace strlcpy() with strscpy()
Heiner Kallweit [Mon, 13 Nov 2023 19:13:26 +0000 (20:13 +0100)]
r8169: improve RTL8411b phy-down fixup
Mirsad proposed a patch to reduce the number of spinlock lock/unlock
operations and the function code size. This can be further improved
because the function sets a consecutive register block.
This patch set moves a big chunk of verifier log related code from gigantic
verifier.c file into more focused kernel/bpf/log.c. This is not essential to
the rest of functionality in this patch set, so I can undo it, but it felt
like it's good to start chipping away from 20K+ verifier.c whenever we can.
The main purpose of the patch set, though, is in improving verifier log
further.
Patches #3-#4 start printing out register state even if that register is
spilled into stack slot. Previously we'd get only spilled register type, but
no additional information, like SCALAR_VALUE's ranges. Super limiting during
debugging. For cases of register spills smaller than 8 bytes, we also print
out STACK_MISC/STACK_ZERO/STACK_INVALID markers. This, among other things,
will make it easier to write tests for these mixed spill/misc cases.
Patch #5 prints map name for PTR_TO_MAP_VALUE/PTR_TO_MAP_KEY/CONST_PTR_TO_MAP
registers. In big production BPF programs, it's important to map assembly to
actual map, and it's often non-trivial. Having map name helps.
Patch #6 just removes visual noise in form of ubiquitous imm=0 and off=0. They
are default values, omit them.
Patch #7 is probably the most controversial, but it reworks how verifier log
prints numbers. For small valued integers we use decimals, but for large ones
we switch to hexadecimal. From personal experience this is a much more useful
convention. We can tune what consitutes "small value", for now it's 16-bit
range.
Patch #8 prints frame number for PTR_TO_CTX registers, if that frame is
different from the "current" one. This removes ambiguity and confusion,
especially in complicated cases with multiple subprogs passing around
pointers.
v2->v3:
- adjust reg_bounds tester to parse hex form of reg state as well;
- print reg->range as unsigned (Alexei);
v1->v2:
- use verbose_snum() for range and offset in register state (Eduard);
- fixed typos and added acks from Eduard and Stanislav.
====================
Andrii Nakryiko [Sat, 18 Nov 2023 03:46:23 +0000 (19:46 -0800)]
bpf: emit frameno for PTR_TO_STACK regs if it differs from current one
It's possible to pass a pointer to parent's stack to child subprogs. In
such case verifier state output is ambiguous not showing whether
register container a pointer to "current" stack, belonging to current
subprog (frame), or it's actually a pointer to one of parent frames.
So emit this information if frame number differs between the state which
register is part of. E.g., if current state is in frame 2 and it has
a register pointing to stack in grand parent state (frame #0), we'll see
something like 'R1=fp[0]-16', while "local stack pointer" will be just
'R2=fp-16'.
Andrii Nakryiko [Sat, 18 Nov 2023 03:46:22 +0000 (19:46 -0800)]
bpf: smarter verifier log number printing logic
Instead of always printing numbers as either decimals (and in some
cases, like for "imm=%llx", in hexadecimals), decide the form based on
actual values. For numbers in a reasonably small range (currently,
[0, U16_MAX] for unsigned values, and [S16_MIN, S16_MAX] for signed ones),
emit them as decimals. In all other cases, even for signed values,
emit them in hexadecimals.
For large values hex form is often times way more useful: it's easier to
see an exact difference between 0xffffffff80000000 and 0xffffffff7fffffff,
than between 18446744071562067966 and 18446744071562067967, as one
particular example.
Small values representing small pointer offsets or application
constants, on the other hand, are way more useful to be represented in
decimal notation.
Adjust reg_bounds register state parsing logic to take into account this
change.
Andrii Nakryiko [Sat, 18 Nov 2023 03:46:21 +0000 (19:46 -0800)]
bpf: omit default off=0 and imm=0 in register state log
Simplify BPF verifier log further by omitting default (and frequently
irrelevant) off=0 and imm=0 parts for non-SCALAR_VALUE registers. As can
be seen from fixed tests, this is often a visual noise for PTR_TO_CTX
register and even for PTR_TO_PACKET registers.
Omitting default values follows the rest of register state logic: we
omit default values to keep verifier log succinct and to highlight
interesting state that deviates from default one. E.g., we do the same
for var_off, when it's unknown, which gives no additional information.
Andrii Nakryiko [Sat, 18 Nov 2023 03:46:20 +0000 (19:46 -0800)]
bpf: emit map name in register state if applicable and available
In complicated real-world applications, whenever debugging some
verification error through verifier log, it often would be very useful
to see map name for PTR_TO_MAP_VALUE register. Usually this needs to be
inferred from key/value sizes and maybe trying to guess C code location,
but it's not always clear.
Given verifier has the name, and it's never too long, let's just emit it
for ptr_to_map_key, ptr_to_map_value, and const_ptr_to_map registers. We
reshuffle the order a bit, so that map name, key size, and value size
appear before offset and immediate values, which seems like a more
logical order.
Current output:
R1_w=map_ptr(map=array_map,ks=4,vs=8,off=0,imm=0)
But we'll get rid of useless off=0 and imm=0 parts in the next patch.