Lorenz Bauer [Sun, 29 Mar 2020 22:53:41 +0000 (15:53 -0700)]
selftests: bpf: Add test for sk_assign
Attach a tc direct-action classifier to lo in a fresh network
namespace, and rewrite all connection attempts to localhost:4321
to localhost:1234 (for port tests) and connections to unreachable
IPv4/IPv6 IPs to the local socket (for address tests). Includes
implementations for both TCP and UDP.
Keep in mind that both client to server and server to client traffic
passes the classifier.
Joe Stringer [Sun, 29 Mar 2020 22:53:40 +0000 (15:53 -0700)]
bpf: Don't refcount LISTEN sockets in sk_assign()
Avoid taking a reference on listen sockets by checking the socket type
in the sk_assign and in the corresponding skb_steal_sock() code in the
the transport layer, and by ensuring that the prefetch free (sock_pfree)
function uses the same logic to check whether the socket is refcounted.
Joe Stringer [Sun, 29 Mar 2020 22:53:39 +0000 (15:53 -0700)]
net: Track socket refcounts in skb_steal_sock()
Refactor the UDP/TCP handlers slightly to allow skb_steal_sock() to make
the determination of whether the socket is reference counted in the case
where it is prefetched by earlier logic such as early_demux.
Joe Stringer [Sun, 29 Mar 2020 22:53:38 +0000 (15:53 -0700)]
bpf: Add socket assign support
Add support for TPROXY via a new bpf helper, bpf_sk_assign().
This helper requires the BPF program to discover the socket via a call
to bpf_sk*_lookup_*(), then pass this socket to the new helper. The
helper takes its own reference to the socket in addition to any existing
reference that may or may not currently be obtained for the duration of
BPF processing. For the destination socket to receive the traffic, the
traffic must be routed towards that socket via local route. The
simplest example route is below, but in practice you may want to route
traffic more narrowly (eg by CIDR):
$ ip route add local default dev lo
This patch avoids trying to introduce an extra bit into the skb->sk, as
that would require more invasive changes to all code interacting with
the socket to ensure that the bit is handled correctly, such as all
error-handling cases along the path from the helper in BPF through to
the orphan path in the input. Instead, we opt to use the destructor
variable to switch on the prefetch of the socket.
Daniel Borkmann [Mon, 30 Mar 2020 20:38:54 +0000 (22:38 +0200)]
bpf, doc: Add John as official reviewer to BPF subsystem
We've added John Fastabend to our weekly BPF patch review rotation over
last months now where he provided excellent and timely feedback on BPF
patches. Therefore, add him to the BPF core reviewer team to the MAINTAINERS
file to reflect that.
KP Singh [Mon, 30 Mar 2020 14:42:46 +0000 (16:42 +0200)]
bpf: btf: Fix arg verification in btf_ctx_access()
The bounds checking for the arguments accessed in the BPF program breaks
when the expected_attach_type is not BPF_TRACE_FEXIT, BPF_LSM_MAC or
BPF_MODIFY_RETURN resulting in no check being done for the default case
(the programs which do not receive the return value of the attached
function in its arguments) when the index of the argument being accessed
is equal to the number of arguments (nr_args).
This was a result of a misplaced "else if" block introduced by the
Commit 6ba43b761c41 ("bpf: Attachment verification for
BPF_MODIFY_RETURN")
Jann Horn [Mon, 30 Mar 2020 16:03:24 +0000 (18:03 +0200)]
bpf: Simplify reg_set_min_max_inv handling
reg_set_min_max_inv() contains exactly the same logic as reg_set_min_max(),
just flipped around. While this makes sense in a cBPF verifier (where ALU
operations are not symmetric), it does not make sense for eBPF.
Replace reg_set_min_max_inv() with a helper that flips the opcode around,
then lets reg_set_min_max() do the complicated work.
Jann Horn [Mon, 30 Mar 2020 16:03:23 +0000 (18:03 +0200)]
bpf: Fix tnum constraints for 32-bit comparisons
The BPF verifier tried to track values based on 32-bit comparisons by
(ab)using the tnum state via 581738a681b6 ("bpf: Provide better register
bounds after jmp32 instructions"). The idea is that after a check like
this:
if ((u32)r0 > 3)
exit
We can't meaningfully constrain the arithmetic-range-based tracking, but
we can update the tnum state to (value=0,mask=0xffff'ffff'0000'0003).
However, the implementation from 581738a681b6 didn't compute the tnum
constraint based on the fixed operand, but instead derives it from the
arithmetic-range-based tracking. This means that after the following
sequence of operations:
if (r0 >= 0x1'0000'0001)
exit
if ((u32)r0 > 7)
exit
The verifier assumed that the lower half of r0 is in the range (0, 0)
and apply the tnum constraint (value=0,mask=0xffff'ffff'0000'0000) thus
causing the overall tnum to be (value=0,mask=0x1'0000'0000), which was
incorrect. Provide a fixed implementation.
The verifier rewrote original instructions it recognized as dead code with
'goto pc-1', but reality differs from verifier simulation in that we're
actually able to trigger a hang due to hitting the 'goto pc-1' instructions.
Taking different examples to make the issue more obvious: in this example
we're probing bounds on a completely unknown scalar variable in r1:
We're first probing lower/upper bounds via jmp64, later we do a similar
check via jmp32 and examine the resulting var_off there. After fall-through
in insn 14, we get the following bounded r1 with 0x7fffffffff unknown marked
bits in the variable section.
Thus, after knowing r1 <= 0x4000000000 and r1 >= 0x2000000000:
The lower/upper bounds haven't changed since they have high bits set in
u64 space and the jmp32 tests can only refine bounds in the low bits.
However, for the var part the expectation would have been 0x7f000007ff
or something less precise up to 0x7fffffffff. A outcome of 0x7f00000000
is not correct since it would contradict the earlier probed bounds
where we know that the result should have been in [0x200,0x400] in u32
space. Therefore, tests with such info will lead to wrong verifier
assumptions later on like falsely predicting conditional jumps to be
always taken, etc.
The issue here is that __reg_bound_offset32()'s implementation from
commit 581738a681b6 ("bpf: Provide better register bounds after jmp32
instructions") makes an incorrect range assumption:
In the above walk-through example, __reg_bound_offset32() as-is chose
a range after masking with 0xffffffff of [0x0,0x0] since umin:0x2000000000
and umax:0x4000000000 and therefore the lo32 part was clamped to 0x0 as
well. However, in the umin:0x2000000000 and umax:0x4000000000 range above
we'd end up with an actual possible interval of [0x0,0xffffffff] for u32
space instead.
In case of the original reproducer, the situation looked as follows at
insn 5 for r0:
After the fall-through, we similarly forced the var_off result into
the wrong range [0x30303030,0x3030302f] suggesting later on that fixed
bits must only be of 0x30303020 with 0x10000001f unknowns whereas such
assumption can only be made when both bounds in hi32 range match.
Originally, I was thinking to fix this by moving reg into a temp reg and
use proper coerce_reg_to_size() helper on the temp reg where we can then
based on that define the range tnum for later intersection:
In the case of the concrete example, this gives us a more conservative unknown
section. Thus, after knowing r1 <= 0x4000000000 and r1 >= 0x2000000000 and
w1 <= 0x400 and w1 >= 0x200:
However, above new __reg_bound_offset32() has no effect on refining the
knowledge of the register contents. Meaning, if the bounds in hi32 range
mismatch we'll get the identity function given the range reg spans
[0x0,0xffffffff] and we cast var_off into lo32 only to later on binary
or it again with the hi32.
Likewise, if the bounds in hi32 range match, then we mask both bounds
with 0xffffffff, use the resulting umin/umax for the range to later
intersect the lo32 with it. However, _prior_ called __reg_bound_offset()
did already such intersection on the full reg and we therefore would only
repeat the same operation on the lo32 part twice.
Given this has no effect and the original commit had false assumptions,
this patch reverts the code entirely which is also more straight forward
for stable trees: apparently 581738a681b6 got auto-selected by Sasha's
ML system and misclassified as a fix, so it got sucked into v5.4 where
it should never have landed. A revert is low-risk also from a user PoV
since it requires a recent kernel and llc to opt-into -mcpu=v3 BPF CPU
to generate jmp32 instructions. A proper bounds refinement would need a
significantly more complex approach which is currently being worked, but
no stable material [0]. Hence revert is best option for stable. After the
revert, the original reported program gets rejected as follows:
Daniel Borkmann [Sun, 29 Mar 2020 23:35:55 +0000 (01:35 +0200)]
Merge branch 'bpf-lsm'
KP Singh says:
====================
** Motivation
Google does analysis of rich runtime security data to detect and thwart
threats in real-time. Currently, this is done in custom kernel modules
but we would like to replace this with something that's upstream and
useful to others.
The current kernel infrastructure for providing telemetry (Audit, Perf
etc.) is disjoint from access enforcement (i.e. LSMs). Augmenting the
information provided by audit requires kernel changes to audit, its
policy language and user-space components. Furthermore, building a MAC
policy based on the newly added telemetry data requires changes to
various LSMs and their respective policy languages.
This patchset allows BPF programs to be attached to LSM hooks This
facilitates a unified and dynamic (not requiring re-compilation of the
kernel) audit and MAC policy.
** Why an LSM?
Linux Security Modules target security behaviours rather than the
kernel's API. For example, it's easy to miss out a newly added system
call for executing processes (eg. execve, execveat etc.) but the LSM
framework ensures that all process executions trigger the relevant hooks
irrespective of how the process was executed.
Allowing users to implement LSM hooks at runtime also benefits the LSM
eco-system by enabling a quick feedback loop from the security community
about the kind of behaviours that the LSM Framework should be targeting.
** How does it work?
The patchset introduces a new eBPF (https://docs.cilium.io/en/v1.6/bpf/)
program type BPF_PROG_TYPE_LSM which can only be attached to LSM hooks.
Loading and attachment of BPF programs requires CAP_SYS_ADMIN.
The new LSM registers nop functions (bpf_lsm_<hook_name>) as LSM hook
callbacks. Their purpose is to provide a definite point where BPF
programs can be attached as BPF_TRAMP_MODIFY_RETURN trampoline programs
for hooks that return an int, and BPF_TRAMP_FEXIT trampoline programs
for void LSM hooks.
Audit logs can be written using a format chosen by the eBPF program to
the perf events buffer or to global eBPF variables or maps and can be
further processed in user-space.
which allows verifiable read-only structure accesses by field names
rather than fixed offsets. This allows accessing the hook parameters
using a dynamically created context which provides a certain degree of
ABI stability:
// Only declare the structure and fields intended to be used
// in the program
struct vm_area_struct {
unsigned long vm_start;
} __attribute__((preserve_access_index));
// Declare the eBPF program mprotect_audit which attaches to
// to the file_mprotect LSM hook and accepts three arguments.
SEC("lsm/file_mprotect")
int BPF_PROG(mprotect_audit, struct vm_area_struct *vma,
unsigned long reqprot, unsigned long prot, int ret)
{
unsigned long vm_start = vma->vm_start;
return 0;
}
By relocating field offsets, BTF makes a large portion of kernel data
structures readily accessible across kernel versions without requiring a
large corpus of BPF helper functions and requiring recompilation with
every kernel version. The BTF type information is also used by the BPF
verifier to validate memory accesses within the BPF program and also
prevents arbitrary writes to the kernel memory.
The limitations of BTF compatibility are described in BPF Co-Re
(http://vger.kernel.org/bpfconf2019_talks/bpf-core.pdf, i.e. field
renames, #defines and changes to the signature of LSM hooks). This
design imposes that the MAC policy (eBPF programs) be updated when the
inspected kernel structures change outside of BTF compatibility
guarantees. In practice, this is only required when a structure field
used by a current policy is removed (or renamed) or when the used LSM
hooks change. We expect the maintenance cost of these changes to be
acceptable as compared to the design presented in the RFC.
A simple example and some documentation is included in the patchset.
In order to better illustrate the capabilities of the framework some
more advanced prototype (not-ready for review) code has also been
published separately:
* Logging execution events (including environment variables and
arguments)
https://github.com/sinkap/linux-krsi/blob/patch/v1/examples/samples/bpf/lsm_audit_env.c
* Detecting deletion of running executables:
https://github.com/sinkap/linux-krsi/blob/patch/v1/examples/samples/bpf/lsm_detect_exec_unlink.c
* Detection of writes to /proc/<pid>/mem:
https://github.com/sinkap/linux-krsi/blob/patch/v1/examples/samples/bpf/lsm_audit_env.c
We have updated Google's internal telemetry infrastructure and have
started deploying this LSM on our Linux Workstations. This gives us more
confidence in the real-world applications of such a system.
** Changelog:
- v8 -> v9:
https://lore.kernel.org/bpf/20200327192854[email protected]/
* Fixed a selftest crash when CONFIG_LSM doesn't have "bpf".
* Added James' Ack.
* Rebase.
- v7 -> v8:
https://lore.kernel.org/bpf/20200326142823[email protected]/
* Removed CAP_MAC_ADMIN check from bpf_lsm_verify_prog. LSMs can add it
in their own bpf_prog hook. This can be revisited as a separate patch.
* Added Andrii and James' Ack/Review tags.
* Fixed an indentation issue and missing newlines in selftest error
a cases.
* Updated a comment as suggested by Alexei.
* Updated the documentation to use the newer libbpf API and some other
fixes.
* Rebase
- v6 -> v7:
https://lore.kernel.org/bpf/20200325152629[email protected]/
* Removed __weak from the LSM attachment nops per Kees' suggestion.
Will send a separate patch (if needed) to update the noinline
definition in include/linux/compiler_attributes.h.
* waitpid to wait specifically for the forked child in selftests.
* Comment format fixes in security/... as suggested by Casey.
* Added Acks from Kees and Andrii and Casey's Reviewed-by: tags to
the respective patches.
* Rebase
- v5 -> v6:
https://lore.kernel.org/bpf/20200323164415[email protected]/
* Updated LSM_HOOK macro to define a default value and cleaned up the
BPF LSM hook declarations.
* Added Yonghong's Acks and Kees' Reviewed-by tags.
* Simplification of the selftest code.
* Rebase and fixes suggested by Andrii and Yonghong and some other minor
fixes noticed in internal review.
- v4 -> v5:
https://lore.kernel.org/bpf/20200220175250[email protected]/
* Removed static keys and special casing of BPF calls from the LSM
framework.
* Initialized the BPF callbacks (nops) as proper LSM hooks.
* Updated to using the newly introduced BPF_TRAMP_MODIFY_RETURN
trampolines in https://lkml.org/lkml/2020/3/4/877
* Addressed Andrii's feedback and rebased.
- v3 -> v4:
* Moved away from allocating a separate security_hook_heads and adding a
new special case for arch_prepare_bpf_trampoline to using BPF fexit
trampolines called from the right place in the LSM hook and toggled by
static keys based on the discussion in:
https://lore.kernel.org/bpf/CAG48ez25mW+_oCxgCtbiGMX07g_ph79UOJa07h=o_6B6+Q-u5g@mail.gmail.com/
* Since the code does not deal with security_hook_heads anymore, it goes
from "being a BPF LSM" to "BPF program attachment to LSM hooks".
* Added a new test case which ensures that the BPF programs' return value
is reflected by the LSM hook.
- v2 -> v3 does not change the overall design and has some minor fixes:
* LSM_ORDER_LAST is introduced to represent the behaviour of the BPF LSM
* Fixed the inadvertent clobbering of the LSM Hook error codes
* Added GPL license requirement to the commit log
* The lsm_hook_idx is now the more conventional 0-based index
* Some changes were split into a separate patch ("Load btf_vmlinux only
once per object")
https://lore.kernel.org/bpf/20200117212825[email protected]/
* Addressed Andrii's feedback on the BTF implementation
* Documentation update for using generated vmlinux.h to simplify
programs
* Rebase
- Changes since v1:
https://lore.kernel.org/bpf/20191220154208[email protected]
* Eliminate the requirement to maintain LSM hooks separately in
security/bpf/hooks.h Use BPF trampolines to dynamically allocate
security hooks
* Drop the use of securityfs as bpftool provides the required
introspection capabilities. Update the tests to use the bpf_skeleton
and global variables
* Use O_CLOEXEC anonymous fds to represent BPF attachment in line with
the other BPF programs with the possibility to use bpf program pinning
in the future to provide "permanent attachment".
* Drop the logic based on prog names for handling re-attachment.
* Drop bpf_lsm_event_output from this series and send it as a separate
patch.
====================
KP Singh [Sun, 29 Mar 2020 00:43:55 +0000 (01:43 +0100)]
bpf: lsm: Add selftests for BPF_PROG_TYPE_LSM
* Load/attach a BPF program that hooks to file_mprotect (int)
and bprm_committed_creds (void).
* Perform an action that triggers the hook.
* Verify if the audit event was received using the shared global
variables for the process executed.
* Verify if the mprotect returns a -EPERM.
KP Singh [Sun, 29 Mar 2020 00:43:54 +0000 (01:43 +0100)]
tools/libbpf: Add support for BPF_PROG_TYPE_LSM
Since BPF_PROG_TYPE_LSM uses the same attaching mechanism as
BPF_PROG_TYPE_TRACING, the common logic is refactored into a static
function bpf_program__attach_btf_id.
A new API call bpf_program__attach_lsm is still added to avoid userspace
conflicts if this ever changes in the future.
KP Singh [Sun, 29 Mar 2020 00:43:52 +0000 (01:43 +0100)]
bpf: lsm: Implement attach, detach and execution
JITed BPF programs are dynamically attached to the LSM hooks
using BPF trampolines. The trampoline prologue generates code to handle
conversion of the signature of the hook to the appropriate BPF context.
The allocated trampoline programs are attached to the nop functions
initialized as LSM hooks.
BPF_PROG_TYPE_LSM programs must have a GPL compatible license and
and need CAP_SYS_ADMIN (required for loading eBPF programs).
Upon attachment:
* A BPF fexit trampoline is used for LSM hooks with a void return type.
* A BPF fmod_ret trampoline is used for LSM hooks which return an
int. The attached programs can override the return value of the
bpf LSM hook to indicate a MAC Policy decision.
KP Singh [Sun, 29 Mar 2020 00:43:51 +0000 (01:43 +0100)]
bpf: lsm: Provide attachment points for BPF LSM programs
When CONFIG_BPF_LSM is enabled, nop functions, bpf_lsm_<hook_name>, are
generated for each LSM hook. These functions are initialized as LSM
hooks in a subsequent patch.
KP Singh [Sun, 29 Mar 2020 00:43:50 +0000 (01:43 +0100)]
security: Refactor declaration of LSM hooks
The information about the different types of LSM hooks is scattered
in two locations i.e. union security_list_options and
struct security_hook_heads. Rather than duplicating this information
even further for BPF_PROG_TYPE_LSM, define all the hooks with the
LSM_HOOK macro in lsm_hook_defs.h which is then used to generate all
the data structures required by the LSM framework.
selftests: Add test for overriding global data value before load
This adds a test to exercise the new bpf_map__set_initial_value() function.
The test simply overrides the global data section with all zeroes, and
checks that the new value makes it into the kernel map on load.
libbpf: Add setter for initial value for internal maps
For internal maps (most notably the maps backing global variables), libbpf
uses an internal mmaped area to store the data after opening the object.
This data is subsequently copied into the kernel map when the object is
loaded.
This adds a function to set a new value for that data, which can be used to
before it is loaded into the kernel. This is especially relevant for RODATA
maps, since those are frozen on load.
Daniel Borkmann [Sun, 29 Mar 2020 19:59:02 +0000 (21:59 +0200)]
bpf, net: Fix build issue when net ns not configured
Fix a redefinition of 'net_gen_cookie' error that was overlooked
when net ns is not configured.
Fixes: f318903c0bf4 ("bpf: Add netns cookie and enable it for bpf cgroup hooks") Reported-by: kbuild test robot <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]>
====================
This series adds support for atomically replacing the XDP program loaded on an
interface. This is achieved by means of a new netlink attribute that can specify
the expected previous program to replace on the interface. If set, the kernel
will compare this "expected fd" attribute with the program currently loaded on
the interface, and reject the operation if it does not match.
With this primitive, userspace applications can avoid stepping on each other's
toes when simultaneously updating the loaded XDP program.
Changelog:
v4:
- Switch back to passing FD instead of ID (Andrii)
- Rename flag to XDP_FLAGS_REPLACE (for consistency with other similar uses)
v3:
- Pass existing ID instead of FD (Jakub)
- Use opts struct for new libbpf function (Andrii)
v2:
- Fix checkpatch nits and add .strict_start_type to netlink policy (Jakub)
====================
libbpf: Add function to set link XDP fd while specifying old program
This adds a new function to set the XDP fd while specifying the FD of the
program to replace, using the newly added IFLA_XDP_EXPECTED_FD netlink
parameter. The new function uses the opts struct mechanism to be extendable
in the future.
xdp: Support specifying expected existing program when attaching XDP
While it is currently possible for userspace to specify that an existing
XDP program should not be replaced when attaching to an interface, there is
no mechanism to safely replace a specific XDP program with another.
This patch adds a new netlink attribute, IFLA_XDP_EXPECTED_FD, which can be
set along with IFLA_XDP_FD. If set, the kernel will check that the program
currently loaded on the interface matches the expected one, and fail the
operation if it does not. This corresponds to a 'cmpxchg' memory operation.
Setting the new attribute with a negative value means that no program is
expected to be attached, which corresponds to setting the UPDATE_IF_NOEXIST
flag.
A new companion flag, XDP_FLAGS_REPLACE, is also added to explicitly
request checking of the EXPECTED_FD attribute. This is needed for userspace
to discover whether the kernel supports the new attribute.
Fletcher Dunn [Fri, 27 Mar 2020 03:24:07 +0000 (03:24 +0000)]
libbpf, xsk: Init all ring members in xsk_umem__create and xsk_socket__create
Fix a sharp edge in xsk_umem__create and xsk_socket__create. Almost all of
the members of the ring buffer structs are initialized, but the "cached_xxx"
variables are not all initialized. The caller is required to zero them.
This is needlessly dangerous. The results if you don't do it can be very bad.
For example, they can cause xsk_prod_nb_free and xsk_cons_nb_avail to return
values greater than the size of the queue. xsk_ring_cons__peek can return an
index that does not refer to an item that has been queued.
I have confirmed that without this change, my program misbehaves unless I
memset the ring buffers to zero before calling the function. Afterwards,
my program works without (or with) the memset.
====================
This adds various straight-forward helper improvements and additions to BPF
cgroup based connect(), sendmsg(), recvmsg() and bind-related hooks which
would allow to implement more fine-grained policies and improve current load
balancer limitations we're seeing. For details please see individual patches.
I've tested them on Kubernetes & Cilium and also added selftests for the small
verifier extension. Thanks!
====================
Daniel Borkmann [Fri, 27 Mar 2020 15:58:55 +0000 (16:58 +0100)]
bpf: Enable retrival of pid/tgid/comm from bpf cgroup hooks
We already have the bpf_get_current_uid_gid() helper enabled, and
given we now have perf event RB output available for connect(),
sendmsg(), recvmsg() and bind-related hooks, add a trivial change
to enable bpf_get_current_pid_tgid() and bpf_get_current_comm()
as well.
Daniel Borkmann [Fri, 27 Mar 2020 15:58:54 +0000 (16:58 +0100)]
bpf: Enable bpf cgroup hooks to retrieve cgroup v2 and ancestor id
Enable the bpf_get_current_cgroup_id() helper for connect(), sendmsg(),
recvmsg() and bind-related hooks in order to retrieve the cgroup v2
context which can then be used as part of the key for BPF map lookups,
for example. Given these hooks operate in process context 'current' is
always valid and pointing to the app that is performing mentioned
syscalls if it's subject to a v2 cgroup. Also with same motivation of
commit 7723628101aa ("bpf: Introduce bpf_skb_ancestor_cgroup_id helper")
enable retrieval of ancestor from current so the cgroup id can be used
for policy lookups which can then forbid connect() / bind(), for example.
Daniel Borkmann [Fri, 27 Mar 2020 15:58:53 +0000 (16:58 +0100)]
bpf: Allow to retrieve cgroup v1 classid from v2 hooks
Today, Kubernetes is still operating on cgroups v1, however, it is
possible to retrieve the task's classid based on 'current' out of
connect(), sendmsg(), recvmsg() and bind-related hooks for orchestrators
which attach to the root cgroup v2 hook in a mixed env like in case
of Cilium, for example, in order to then correlate certain pod traffic
and use it as part of the key for BPF map lookups.
Daniel Borkmann [Fri, 27 Mar 2020 15:58:52 +0000 (16:58 +0100)]
bpf: Add netns cookie and enable it for bpf cgroup hooks
In Cilium we're mainly using BPF cgroup hooks today in order to implement
kube-proxy free Kubernetes service translation for ClusterIP, NodePort (*),
ExternalIP, and LoadBalancer as well as HostPort mapping [0] for all traffic
between Cilium managed nodes. While this works in its current shape and avoids
packet-level NAT for inter Cilium managed node traffic, there is one major
limitation we're facing today, that is, lack of netns awareness.
In Kubernetes, the concept of Pods (which hold one or multiple containers)
has been built around network namespaces, so while we can use the global scope
of attaching to root BPF cgroup hooks also to our advantage (e.g. for exposing
NodePort ports on loopback addresses), we also have the need to differentiate
between initial network namespaces and non-initial one. For example, ExternalIP
services mandate that non-local service IPs are not to be translated from the
host (initial) network namespace as one example. Right now, we have an ugly
work-around in place where non-local service IPs for ExternalIP services are
not xlated from connect() and friends BPF hooks but instead via less efficient
packet-level NAT on the veth tc ingress hook for Pod traffic.
On top of determining whether we're in initial or non-initial network namespace
we also have a need for a socket-cookie like mechanism for network namespaces
scope. Socket cookies have the nice property that they can be combined as part
of the key structure e.g. for BPF LRU maps without having to worry that the
cookie could be recycled. We are planning to use this for our sessionAffinity
implementation for services. Therefore, add a new bpf_get_netns_cookie() helper
which would resolve both use cases at once: bpf_get_netns_cookie(NULL) would
provide the cookie for the initial network namespace while passing the context
instead of NULL would provide the cookie from the application's network namespace.
We're using a hole, so no size increase; the assignment happens only once.
Therefore this allows for a comparison on initial namespace as well as regular
cookie usage as we have today with socket cookies. We could later on enable
this helper for other program types as well as we would see need.
(*) Both externalTrafficPolicy={Local|Cluster} types
[0] https://github.com/cilium/cilium/blob/master/bpf/bpf_sock.c
Daniel Borkmann [Fri, 27 Mar 2020 15:58:51 +0000 (16:58 +0100)]
bpf: Enable perf event rb output for bpf cgroup progs
Currently, connect(), sendmsg(), recvmsg() and bind-related hooks
are all lacking perf event rb output in order to push notifications
or monitoring events up to user space. Back in commit a5a3a828cd00
("bpf: add perf event notificaton support for sock_ops"), I've worked
with Sowmini to enable them for sock_ops where the context part is
not used (as opposed to skbs for example where the packet data can
be appended). Make the bpf_sockopt_event_output() helper generic and
enable it for mentioned hooks.
Daniel Borkmann [Fri, 27 Mar 2020 15:58:50 +0000 (16:58 +0100)]
bpf: Enable retrieval of socket cookie for bind/post-bind hook
We currently make heavy use of the socket cookie in BPF's connect(),
sendmsg() and recvmsg() hooks for load-balancing decisions. However,
it is currently not enabled/implemented in BPF {post-}bind hooks
where it can later be used in combination for correlation in the tc
egress path, for example.
YueHaibing [Thu, 26 Mar 2020 03:16:13 +0000 (11:16 +0800)]
bpf: Remove unused vairable 'bpf_xdp_link_lops'
kernel/bpf/syscall.c:2263:34: warning: 'bpf_xdp_link_lops' defined but not used [-Wunused-const-variable=]
static const struct bpf_link_ops bpf_xdp_link_lops;
^~~~~~~~~~~~~~~~~
commit 70ed506c3bbc ("bpf: Introduce pinnable bpf_link abstraction")
involded this unused variable, remove it.
Andrii Nakryiko [Wed, 25 Mar 2020 06:57:42 +0000 (23:57 -0700)]
bpf: Factor out attach_type to prog_type mapping for attach/detach
Factor out logic mapping expected program attach type to program type and
subsequent handling of program attach/detach. Also list out all supported
cgroup BPF program types explicitly to prevent accidental bugs once more
program types are added to a mapping. Do the same for prog_query API.
Andrii Nakryiko [Wed, 25 Mar 2020 06:57:41 +0000 (23:57 -0700)]
bpf: Factor out cgroup storages operations
Refactor cgroup attach/detach code to abstract away common operations
performed on all types of cgroup storages. This makes high-level logic more
apparent, plus allows to reuse more code across multiple functions.
7: (b7) r1 = 2
8: R0_w=map_value(id=0,off=0,ks=8,vs=8,imm=0) R1_w=inv2 R10=fp0 fp-8_w=mmmmmmmm
8: (67) r1 <<= 31
9: R0_w=map_value(id=0,off=0,ks=8,vs=8,imm=0) R1_w=inv4294967296 R10=fp0 fp-8_w=mmmmmmmm
9: (74) w1 >>= 31
10: R0_w=map_value(id=0,off=0,ks=8,vs=8,imm=0) R1_w=inv0 R10=fp0 fp-8_w=mmmmmmmm
10: (14) w1 -= 2
11: R0_w=map_value(id=0,off=0,ks=8,vs=8,imm=0) R1_w=inv4294967294 R10=fp0 fp-8_w=mmmmmmmm
11: (0f) r0 += r1
last_idx 11 first_idx 0
regs=2 stack=0 before 10: (14) w1 -= 2
regs=2 stack=0 before 9: (74) w1 >>= 31
regs=2 stack=0 before 8: (67) r1 <<= 31
regs=2 stack=0 before 7: (b7) r1 = 2
math between map_value pointer and 4294967294 is not allowed
Before this series we did not trip the "math between map_value pointer..."
error because check_reg_sane_offset is never called in
adjust_ptr_min_max_vals(). Instead we have a register state that looks
like this at line 11*,
In R1 'smin_val != smax_val' yet we have a tnum_const as seen
by 'var_off(0xfffffffe; 0x0))' with a 0x0 mask. So we hit this check
in adjust_ptr_min_max_vals()
if ((known && (smin_val != smax_val || umin_val != umax_val)) ||
smin_val > smax_val || umin_val > umax_val) {
/* Taint dst register if offset had invalid bounds derived from
* e.g. dead branches.
*/
__mark_reg_unknown(env, dst_reg);
return 0;
}
So we don't throw an error here and instead only throw an error
later in the verification when the memory access is made.
The root cause in verifier without alu32 bounds tracking is having
'umin_value = 0' and 'umax_value = U64_MAX' from BPF_SUB which we set
when 'umin_value < umax_val' here,
Later in adjust_calar_min_max_vals we previously did a
coerce_reg_to_size() which will clamp the U64_MAX to U32_MAX by
truncating to 32bits. But either way without a call to update_reg_bounds
the less precise bounds tracking will fall out of the alu op
verification.
After latest changes we now exit adjust_scalar_min_max_vals with the
more precise umin value, due to zero extension propogating bounds from
alu32 bounds into alu64 bounds and then calling update_reg_bounds.
This then causes the verifier to trigger an earlier error and we get
the error in the output above.
This patch updates tests to reflect new error message.
* I have a local patch to print entire verifier state regardless if we
believe it is a constant so we can get a full picture of the state.
Usually if tnum_is_const() then bounds are also smin=smax, etc. but
this is not always true and is a bit subtle. Being able to see these
states helps understand dataflow imo. Let me know if we want something
similar upstream.
John Fastabend [Tue, 24 Mar 2020 17:38:37 +0000 (10:38 -0700)]
bpf: Verifer, adjust_scalar_min_max_vals to always call update_reg_bounds()
Currently, for all op verification we call __red_deduce_bounds() and
__red_bound_offset() but we only call __update_reg_bounds() in bitwise
ops. However, we could benefit from calling __update_reg_bounds() in
BPF_ADD, BPF_SUB, and BPF_MUL cases as well.
For example, a register with state 'R1_w=invP0' when we subtract from
it,
w1 -= 2
Before coerce we will now have an smin_value=S64_MIN, smax_value=U64_MAX
and unsigned bounds umin_value=0, umax_value=U64_MAX. These will then
be clamped to S32_MIN, U32_MAX values by coerce in the case of alu32 op
as done in above example. However tnum will be a constant because the
ALU op is done on a constant.
Without update_reg_bounds() we have a scenario where tnum is a const
but our unsigned bounds do not reflect this. By calling update_reg_bounds
after coerce to 32bit we further refine the umin_value to U64_MAX in the
alu64 case or U32_MAX in the alu32 case above.
John Fastabend [Tue, 24 Mar 2020 17:38:15 +0000 (10:38 -0700)]
bpf: Verifer, refactor adjust_scalar_min_max_vals
Pull per op ALU logic into individual functions. We are about to add
u32 versions of each of these by pull them out the code gets a bit
more readable here and nicer in the next patch.
libbpf: Don't allocate 16M for log buffer by default
For each prog/btf load we allocate and free 16 megs of verifier buffer.
On production systems it doesn't really make sense because the
programs/btf have gone through extensive testing and (mostly) guaranteed
to successfully load.
Let's assume successful case by default and skip buffer allocation
on the first try. If there is an error, start with BPF_LOG_BUF_SIZE
and double it on each ENOSPC iteration.
Andrey Ignatov [Tue, 24 Mar 2020 18:51:35 +0000 (11:51 -0700)]
bpf: Document bpf_inspect drgn tool
It's a follow-up for discussion in [1].
drgn tool bpf_inspect.py was merged to drgn repo in [2]. Document it
in kernel tree to make BPF developers aware that the tool exists and
can help with getting BPF state unavailable via UAPI.
For now it's just one tool but the doc is written in a way that allows
to cover more tools in the future if needed.
Please refer to the doc itself for more details.
The patch was tested by `make htmldocs` and sanity-checking that
resulting html looks good.
v2 -> v3:
- two sections: "Description" and "Getting started" (Daniel);
- add examples in "Getting started" section (Daniel);
- add "Customization" section to show how tool can be customized.
Daniel T. Lee [Sat, 21 Mar 2020 10:04:24 +0000 (19:04 +0900)]
samples, bpf: Refactor perf_event user program with libbpf bpf_link
The bpf_program__attach of libbpf(using bpf_link) is much more intuitive
than the previous method using ioctl.
bpf_program__attach_perf_event manages the enable of perf_event and
attach of BPF programs to it, so there's no neeed to do this
directly with ioctl.
In addition, bpf_link provides consistency in the use of API because it
allows disable (detach, destroy) for multiple events to be treated as
one bpf_link__destroy. Also, bpf_link__destroy manages the close() of
perf_event fd.
This commit refactors samples that attach the bpf program to perf_event
by using libbbpf instead of ioctl. Also the bpf_load in the samples were
removed and migrated to use libbbpf API.
Daniel T. Lee [Sat, 21 Mar 2020 10:04:23 +0000 (19:04 +0900)]
samples, bpf: Move read_trace_pipe to trace_helpers
To reduce the reliance of trace samples (trace*_user) on bpf_load,
move read_trace_pipe to trace_helpers. By moving this bpf_loader helper
elsewhere, trace functions can be easily migrated to libbbpf.
Martin KaFai Lau [Fri, 20 Mar 2020 15:21:07 +0000 (08:21 -0700)]
bpf: Add tests for bpf_sk_storage to bpf_tcp_ca
This patch adds test to exercise the bpf_sk_storage_get()
and bpf_sk_storage_delete() helper from the bpf_dctcp.c.
The setup and check on the sk_storage is done immediately
before and after the connect().
This patch also takes this chance to move the pthread_create()
after the connect() has been done. That will remove the need of
the "wait_thread" label.
Martin KaFai Lau [Fri, 20 Mar 2020 15:21:01 +0000 (08:21 -0700)]
bpf: Add bpf_sk_storage support to bpf_tcp_ca
This patch adds bpf_sk_storage_get() and bpf_sk_storage_delete()
helper to the bpf_tcp_ca's struct_ops. That would allow
bpf-tcp-cc to:
1) share sk private data with other bpf progs.
2) use bpf_sk_storage as a private storage for a bpf-tcp-cc
if the existing icsk_ca_priv is not big enough.
YueHaibing [Fri, 20 Mar 2020 02:34:26 +0000 (10:34 +0800)]
bpf, tcp: Make tcp_bpf_recvmsg static
After commit f747632b608f ("bpf: sockmap: Move generic sockmap
hooks from BPF TCP"), tcp_bpf_recvmsg() is not used out of
tcp_bpf.c, so make it static and remove it from tcp.h. Also move
it to BPF_STREAM_PARSER #ifdef to fix unused function warnings.
YueHaibing [Fri, 20 Mar 2020 02:34:25 +0000 (10:34 +0800)]
bpf, tcp: Fix unused function warnings
If BPF_STREAM_PARSER is not set, gcc warns:
net/ipv4/tcp_bpf.c:483:12: warning: 'tcp_bpf_sendpage' defined but not used [-Wunused-function]
net/ipv4/tcp_bpf.c:395:12: warning: 'tcp_bpf_sendmsg' defined but not used [-Wunused-function]
net/ipv4/tcp_bpf.c:13:13: warning: 'tcp_bpf_stream_read' defined but not used [-Wunused-function]
Moves the unused functions into the #ifdef CONFIG_BPF_STREAM_PARSER.
Martin KaFai Lau [Wed, 18 Mar 2020 17:16:56 +0000 (10:16 -0700)]
bpftool: Add struct_ops support
This patch adds struct_ops support to the bpftool.
To recap a bit on the recent bpf_struct_ops feature on the kernel side:
It currently supports "struct tcp_congestion_ops" to be implemented
in bpf. At a high level, bpf_struct_ops is struct_ops map populated
with a number of bpf progs. bpf_struct_ops currently supports the
"struct tcp_congestion_ops". However, the bpf_struct_ops design is
generic enough that other kernel struct ops can be supported in
the future.
Although struct_ops is map+progs at a high lever, there are differences
in details. For example,
1) After registering a struct_ops, the struct_ops is held by the kernel
subsystem (e.g. tcp-cc). Thus, there is no need to pin a
struct_ops map or its progs in order to keep them around.
2) To iterate all struct_ops in a system, it iterates all maps
in type BPF_MAP_TYPE_STRUCT_OPS. BPF_MAP_TYPE_STRUCT_OPS is
the current usual filter. In the future, it may need to
filter by other struct_ops specific properties. e.g. filter by
tcp_congestion_ops or other kernel subsystem ops in the future.
3) struct_ops requires the running kernel having BTF info. That allows
more flexibility in handling other kernel structs. e.g. it can
always dump the latest bpf_map_info.
4) Also, "struct_ops" command is not intended to repeat all features
already provided by "map" or "prog". For example, if there really
is a need to pin the struct_ops map, the user can use the "map" cmd
to do that.
While the first attempt was to reuse parts from map/prog.c, it ended up
not a lot to share. The only obvious item is the map_parse_fds() but
that still requires modifications to accommodate struct_ops map specific
filtering (for the immediate and the future needs). Together with the
earlier mentioned differences, it is better to part away from map/prog.c.
The initial set of subcmds are, register, unregister, show, and dump.
For register, it registers all struct_ops maps that can be found in an
obj file. Option can be added in the future to specify a particular
struct_ops map. Also, the common bpf_tcp_cc is stateless (e.g.
bpf_cubic.c and bpf_dctcp.c). The "reuse map" feature is not
implemented in this patch and it can be considered later also.
For other subcmds, please see the man doc for details.
Martin KaFai Lau [Wed, 18 Mar 2020 17:16:50 +0000 (10:16 -0700)]
bpftool: Translate prog_id to its bpf prog_name
The kernel struct_ops obj has kernel's func ptrs implemented by bpf_progs.
The bpf prog_id is stored as the value of the func ptr for introspection
purpose. In the latter patch, a struct_ops dump subcmd will be added
to introspect these func ptrs. It is desired to print the actual bpf
prog_name instead of only printing the prog_id.
Since struct_ops is the only usecase storing prog_id in the func ptr,
this patch adds a prog_id_as_func_ptr bool (default is false) to
"struct btf_dumper" in order not to mis-interpret the ptr value
for the other existing use-cases.
While printing a func_ptr as a bpf prog_name,
this patch also prefix the bpf prog_name with the ptr's func_proto.
[ Note that it is the ptr's func_proto instead of the bpf prog's
func_proto ]
It reuses the current btf_dump_func() to obtain the ptr's func_proto
string.
Here is an example from the bpf_cubic.c:
"void (struct sock *, u32, u32) bictcp_cong_avoid/prog_id:140"
Martin KaFai Lau [Wed, 18 Mar 2020 17:16:43 +0000 (10:16 -0700)]
bpftool: Print as a string for char array
A char[] is currently printed as an integer array.
This patch will print it as a string when
1) The array element type is an one byte int
2) The array element type has a BTF_INT_CHAR encoding or
the array element type's name is "char"
3) All characters is between (0x1f, 0x7f) and it is terminated
by a null character.
Martin KaFai Lau [Wed, 18 Mar 2020 17:16:37 +0000 (10:16 -0700)]
bpftool: Print the enum's name instead of value
This patch prints the enum's name if there is one found in
the array of btf_enum.
The commit 9eea98497951 ("bpf: fix BTF verification of enums")
has details about an enum could have any power-of-2 size (up to 8 bytes).
This patch also takes this chance to accommodate these non 4 byte
enums.
Fangrui Song [Wed, 18 Mar 2020 22:27:46 +0000 (15:27 -0700)]
bpf: Support llvm-objcopy for vmlinux BTF
Simplify gen_btf logic to make it work with llvm-objcopy. The existing
'file format' and 'architecture' parsing logic is brittle and does not
work with llvm-objcopy/llvm-objdump.
'file format' output of llvm-objdump>=11 will match GNU objdump, but
'architecture' (bfdarch) may not.
.BTF in .tmp_vmlinux.btf is non-SHF_ALLOC. Add the SHF_ALLOC flag
because it is part of vmlinux image used for introspection. C code
can reference the section via linker script defined __start_BTF and
__stop_BTF. This fixes a small problem that previous .BTF had the
SHF_WRITE flag (objcopy -I binary -O elf* synthesized .data).
Additionally, `objcopy -I binary` synthesized symbols
_binary__btf_vmlinux_bin_start and _binary__btf_vmlinux_bin_stop (not
used elsewhere) are replaced with more commonplace __start_BTF and
__stop_BTF.
Add 2>/dev/null because GNU objcopy (but not llvm-objcopy) warns
"empty loadable segment detected at vaddr=0xffffffff81000000, is this intentional?"
We use a dd command to change the e_type field in the ELF header from
ET_EXEC to ET_REL so that lld will accept .btf.vmlinux.bin.o. Accepting
ET_EXEC as an input file is an extremely rare GNU ld feature that lld
does not intend to support, because this is error-prone.
The output section description .BTF in include/asm-generic/vmlinux.lds.h
avoids potential subtle orphan section placement issues and suppresses
--orphan-handling=warn warnings.
Andrii Nakryiko [Sat, 14 Mar 2020 01:39:32 +0000 (18:39 -0700)]
selftests/bpf: Reset process and thread affinity after each test/sub-test
Some tests and sub-tests are setting "custom" thread/process affinity and
don't reset it back. Instead of requiring each test to undo all this, ensure
that thread affinity is restored by test_progs test runner itself.
Andrii Nakryiko [Sat, 14 Mar 2020 01:39:31 +0000 (18:39 -0700)]
selftests/bpf: Fix test_progs's parsing of test numbers
When specifying disjoint set of tests, test_progs doesn't set skipped test's
array elements to false. This leads to spurious execution of tests that should
have been skipped. Fix it by explicitly initializing them to false.
Andrii Nakryiko [Sat, 14 Mar 2020 01:39:30 +0000 (18:39 -0700)]
selftests/bpf: Fix race in tcp_rtt test
Previous attempt to make tcp_rtt more robust introduced a new race, in which
server_done might be set to true before server can actually accept any
connection. Fix this by unconditionally waiting for accept(). Given socket is
non-blocking, if there are any problems with client side, it should eventually
close listening FD and let server thread exit with failure.
Andrii Nakryiko [Sat, 14 Mar 2020 00:27:43 +0000 (17:27 -0700)]
selftests/bpf: Fix nanosleep for real this time
Amazingly, some libc implementations don't call __NR_nanosleep syscall from
their nanosleep() APIs. Hammer it down with explicit syscall() call and never
get back to it again. Also simplify code for timespec initialization.
I verified that nanosleep is called w/ printk and in exactly same Linux image
that is used in Travis CI. So it should both sleep and call correct syscall.
v1->v2:
- math is too hard, fix usec -> nsec convertion (Martin);
- test_vmlinux has explicit nanosleep() call, convert that one as well.
XDP-redirect is broken in this driver sfc. XDP_REDIRECT requires
tailroom for skb_shared_info when creating an SKB based on the
redirected xdp_frame (both in cpumap and veth).
The fix requires some initial explaining. The driver uses RX page-split
when possible. It reserves the top 64 bytes in the RX-page for storing
dma_addr (struct efx_rx_page_state). It also have the XDP recommended
headroom of XDP_PACKET_HEADROOM (256 bytes). As it doesn't reserve any
tailroom, it can still fit two standard MTU (1500) frames into one page.
The sizeof struct skb_shared_info in 320 bytes. Thus drivers like ixgbe
and i40e, reduce their XDP headroom to 192 bytes, which allows them to
fit two frames with max 1536 bytes into a 4K page (192+1536+320=2048).
The fix is to reduce this drivers headroom to 128 bytes and add the 320
bytes tailroom. This account for reserved top 64 bytes in the page, and
still fit two frame in a page for normal MTUs.
We must never go below 128 bytes of headroom for XDP, as one cacheline
is for xdp_frame area and next cacheline is reserved for metadata area.
Fixes: eb9a36be7f3e ("sfc: perform XDP processing on received packets") Signed-off-by: Jesper Dangaard Brouer <[email protected]> Acked-by: Edward Cree <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Alex Elder [Mon, 16 Mar 2020 22:51:21 +0000 (17:51 -0500)]
remoteproc: clean up notification config
Rearrange the config files for remoteproc and IPA to fix their
interdependencies.
First, have CONFIG_QCOM_Q6V5_MSS select QCOM_Q6V5_IPA_NOTIFY so the
notification code is built regardless of whether IPA needs it.
Next, represent QCOM_IPA as being dependent on QCOM_Q6V5_MSS rather
than setting its value to match QCOM_Q6V5_COMMON (which is selected
by QCOM_Q6V5_MSS).
Drop all dependencies from QCOM_Q6V5_IPA_NOTIFY. The notification
code will be built whenever QCOM_Q6V5_MSS is set, and it has no other
dependencies.
Zheng Zengkai [Mon, 16 Mar 2020 13:05:24 +0000 (21:05 +0800)]
qede: remove some unused code in function qede_selftest_receive_traffic
Remove set but not used variables 'sw_comp_cons' and 'hw_comp_cons'
to fix gcc '-Wunused-but-set-variable' warning:
drivers/net/ethernet/qlogic/qede/qede_ethtool.c: In function qede_selftest_receive_traffic:
drivers/net/ethernet/qlogic/qede/qede_ethtool.c:1569:20:
warning: variable sw_comp_cons set but not used [-Wunused-but-set-variable]
drivers/net/ethernet/qlogic/qede/qede_ethtool.c: In function qede_selftest_receive_traffic:
drivers/net/ethernet/qlogic/qede/qede_ethtool.c:1569:6:
warning: variable hw_comp_cons set but not used [-Wunused-but-set-variable]
After removing 'hw_comp_cons',the memory barrier 'rmb()' and its comments become useless,
so remove them as well.
Jiri Pirko [Mon, 16 Mar 2020 08:03:25 +0000 (09:03 +0100)]
net: sched: set the hw_stats_type in pedit loop
For a single pedit action, multiple offload entries may be used. Set the
hw_stats_type to all of them.
Fixes: 44f865801741 ("sched: act: allow user to specify type of HW stats for a filter") Signed-off-by: Jiri Pirko <[email protected]> Signed-off-by: David S. Miller <[email protected]>
====================
net: stmmac: Use readl_poll_timeout() to simplify the code
This patch sets just for replace the open-coded loop to the
readl_poll_timeout() helper macro for simplify the code in
stmmac driver.
v2 -> v3:
- return whatever error code by readl_poll_timeout() returned.
v1 -> v2:
- no changed. I am a newbie and sent this patch a month
ago (February 6th). So far, I have not received any comments or
suggestion. I think it may be lost somewhere in the world, so
resend it.
====================
YueHaibing [Sat, 14 Mar 2020 10:51:20 +0000 (18:51 +0800)]
chcr: remove set but not used variable 'status'
drivers/crypto/chelsio/chcr_ktls.c: In function chcr_ktls_cpl_set_tcb_rpl:
drivers/crypto/chelsio/chcr_ktls.c:662:11: warning:
variable status set but not used [-Wunused-but-set-variable]
commit 8a30923e1598 ("cxgb4/chcr: Save tx keys and handle HW response")
involved this unused variable, remove it.
Era Mayflower [Mon, 9 Mar 2020 19:47:02 +0000 (19:47 +0000)]
macsec: Netlink support of XPN cipher suites (IEEE 802.1AEbw)
Netlink support of extended packet number cipher suites,
allows adding and updating XPN macsec interfaces.
Added support in:
* Creating interfaces with GCM-AES-XPN-128 and GCM-AES-XPN-256 suites.
* Setting and getting 64bit packet numbers with of SAs.
* Setting (only on SA creation) and getting ssci of SAs.
* Setting salt when installing a SAK.
Added 2 cipher suite identifiers according to 802.1AE-2018 table 14-1:
* MACSEC_CIPHER_ID_GCM_AES_XPN_128
* MACSEC_CIPHER_ID_GCM_AES_XPN_256
In addition, added 2 new netlink attribute types:
* MACSEC_SA_ATTR_SSCI
* MACSEC_SA_ATTR_SALT
Depends on: macsec: Support XPN frame handling - IEEE 802.1AEbw.
Era Mayflower [Mon, 9 Mar 2020 19:47:01 +0000 (19:47 +0000)]
macsec: Support XPN frame handling - IEEE 802.1AEbw
Support extended packet number cipher suites (802.1AEbw) frames handling.
This does not include the needed netlink patches.
* Added xpn boolean field to `struct macsec_secy`.
* Added ssci field to `struct_macsec_tx_sa` (802.1AE figure 10-5).
* Added ssci field to `struct_macsec_rx_sa` (802.1AE figure 10-5).
* Added salt field to `struct macsec_key` (802.1AE 10.7 NOTE 1).
* Created pn_t type for easy access to lower and upper halves.
* Created salt_t type for easy access to the "ssci" and "pn" parts.
* Created `macsec_fill_iv_xpn` function to create IV in XPN mode.
* Support in PN recovery and preliminary replay check in XPN mode.
In addition, according to IEEE 802.1AEbw figure 10-5, the PN of incoming
frame can be 0 when XPN cipher suite is used, so fixed the function
`macsec_validate_skb` to fail on PN=0 only if XPN is off.
Andrew Lunn mentioned that the Serdes PCS found in Marvell DSA switches
does not automatically update the switch MACs with the link parameters.
Currently, the DSA code implements a work-around for this.
This series improves the Serdes integration, making use of the recent
phylink changes to support split MAC/PCS setups. One noticable
improvement for userspace is that ethtool can now report the link
partner's advertisement.
This repost has no changes compared to the previous posting; however,
the regression Andrew had found which exists even without this patch
set has now been fixed by Andrew and merged into the net-next tree.
====================
The port_link_state method is only used by mv88e6xxx_port_setup_mac(),
which is now only called during port setup, rather than also being
called via phylink's mac_config method.
Remove this now unnecessary optimisation, which allows us to remove the
port_link_state methods as well.
Russell King [Sat, 14 Mar 2020 10:15:53 +0000 (10:15 +0000)]
net: dsa: mv88e6xxx: combine port_set_speed and port_set_duplex
Setting the speed independently of duplex makes little sense; the two
parameters result from negotiation or fixed setup, and may have inter-
dependencies. Moreover, they are always controlled via the same
register - having them split means we have to read-modify-write this
register twice.
Combine the two operations into a single port_set_speed_duplex()
operation. Not only is this more efficient, it reduces the size of the
code as well.
Russell King [Sat, 14 Mar 2020 10:15:48 +0000 (10:15 +0000)]
net: dsa: mv88e6xxx: fix Serdes link changes
phylink_mac_change() is supposed to be called with a 'false' argument
if the link has gone down since it was last reported up; this is to
ensure that link events along with renegotiation events are always
correctly reported to userspace.
Read the BMSR once when we have an interrupt, and report the link
latched status to phylink via phylink_mac_change(). phylink will deal
automatically with re-reading the link state once it has processed the
link-down event.
Russell King [Sat, 14 Mar 2020 10:15:43 +0000 (10:15 +0000)]
net: dsa: mv88e6xxx: extend phylink to Serdes PHYs
Extend the mv88e6xxx phylink implementation down to Serdes PHYs, which
handle the PCS layer of such links.
- Implement phylink PCS link state reading, so that we can provide
ethtool with the linkmodes and link speed in the expected manner.
Note: this will only be called for in-band negotiation, which is
only supported by the serdes interfaces.
- Implement phylink PCS configuration, so that the in-band AN and
advertisement can be configured.
- Implement phylink PCS negotiation restart, so that the in-band AN
can be restarted.
- Implement phylink PCS link up, so that when operating out-of-band,
the Serdes can be configured for the appropriate fixed speed mode.
Russell King [Sat, 14 Mar 2020 10:15:33 +0000 (10:15 +0000)]
net: dsa: mv88e6xxx: use BMCR definitions for serdes control register
The SGMII/1000base-X serdes register set is a clause 22 register set
offset at 0x2000 in the PHYXS device. Rather than inventing our own
defintions, use those that already exist, and name the register
MV88E6390_SGMII_BMCR. Also remove the unused MV88E6390_SGMII_STATUS
definitions.
David S. Miller [Mon, 16 Mar 2020 00:10:14 +0000 (17:10 -0700)]
Merge branch 'net-mii-clause-37-helpers'
Russell King says:
====================
net: mii clause 37 helpers
This is a re-post of two patches that are common to two series that
I've sent in recent weeks; I'm re-posting them separately in the hope
that they can be merged. No changes from either of the previous
postings.
These patches:
1. convert the existing (unused) mii_lpa_to_ethtool_lpa_x() function
to a linkmode variant.
2. add a helper for clause 37 advertisements, supporting both the
1000baseX and defacto 2500baseX variants. Note that ethtool does
not support half duplex for either of these, and we make no effort
to do so.
====================
Russell King [Sat, 14 Mar 2020 10:09:53 +0000 (10:09 +0000)]
net: mii: convert mii_lpa_to_ethtool_lpa_x() to linkmode variant
Add a LPA to linkmode decoder for 1000BASE-X protocols; this decoder
only provides the modify semantics similar to other such decoders.
This replaces the unused mii_lpa_to_ethtool_lpa_x() helper.
The ndp32->wLength is two bytes long, so replace cpu_to_le32 with cpu_to_le16.
Fixes: 0fa81b304a79 ("cdc_ncm: Implement the 32-bit version of NCM Transfer Block") Signed-off-by: Alexander Bersenev <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Currently we allocate the MPTCP master socket at accept time.
The above makes mptcp_accept() quite complex, and requires checks is several
places for NULL MPTCP master socket.
These series simplify the MPTCP accept implementation, moving the master socket
allocation at syn-ack time, so that we drop unneeded checks with the follow-up
patch.
Paolo Abeni [Fri, 13 Mar 2020 15:52:41 +0000 (16:52 +0100)]
mptcp: create msk early
This change moves the mptcp socket allocation from mptcp_accept() to
subflow_syn_recv_sock(), so that subflow->conn is now always set
for the non fallback scenario.
It allows cleaning up a bit mptcp_accept() reducing the additional
locking and will allow fourther cleanup in the next patch.
Vladimir Oltean [Fri, 13 Mar 2020 13:46:51 +0000 (15:46 +0200)]
net: mscc: ocelot: adjust maxlen on NPI port, not CPU
Being a non-physical port, the CPU port does not have an ocelot_port
structure, so the ocelot_port_writel call inside the
ocelot_port_set_maxlen() function would access data behind a NULL
pointer.
This is a patch for net-next only, the net tree boots fine, the bug was
introduced during the net -> net-next merge.
Fixes: 1d3435793123 ("Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net") Fixes: a8015ded89ad ("net: mscc: ocelot: properly account for VLAN header length when setting MRU") Signed-off-by: Vladimir Oltean <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Hoang Le [Fri, 13 Mar 2020 03:18:03 +0000 (10:18 +0700)]
tipc: add NULL pointer check to prevent kernel oops
Calling:
tipc_node_link_down()->
- tipc_node_write_unlock()->tipc_mon_peer_down()
- tipc_mon_peer_down()
just after disabling bearer could be caused kernel oops.
Fix this by adding a sanity check to make sure valid memory
access.
====================
ethtool: consolidate irq coalescing - part 5
Convert more drivers following the groundwork laid in a recent
patch set [1] and continued in [2], [3], [4]. The aim of the effort
is to consolidate irq coalescing parameter validation in the core.
This set converts further 15 drivers in drivers/net/ethernet.
One more conversion sets to come.