Lorenz Bauer [Mon, 29 Jun 2020 09:56:26 +0000 (10:56 +0100)]
bpf: flow_dissector: Check value of unused flags to BPF_PROG_DETACH
Using BPF_PROG_DETACH on a flow dissector program supports neither
attach_flags nor attach_bpf_fd. Yet no value is enforced for them.
Enforce that attach_flags are zero, and require the current program
to be passed via attach_bpf_fd. This allows us to remove the check
for CAP_SYS_ADMIN, since userspace can now no longer remove
arbitrary flow dissector programs.
Lorenz Bauer [Mon, 29 Jun 2020 09:56:25 +0000 (10:56 +0100)]
bpf: flow_dissector: Check value of unused flags to BPF_PROG_ATTACH
Using BPF_PROG_ATTACH on a flow dissector program supports neither
target_fd, attach_flags or replace_bpf_fd but accepts any value.
Enforce that all of them are zero. This is fine for replace_bpf_fd
since its presence is indicated by BPF_F_REPLACE. It's more
problematic for target_fd, since zero is a valid fd. Should we
want to use the flag later on we'd have to add an exception for
fd 0. The alternative is to force a value like -1. This requires
more changes to tests. There is also precedent for using 0,
since bpf_iter uses this for target_fd as well.
====================
This patch set prepares ground for link-based multi-prog attachment for
future netns attach types, with BPF_SK_LOOKUP attach type in mind [0].
Two changes are needed in order to attach and run a series of BPF programs:
1) an bpf_prog_array of programs to run (patch #2), and
2) a list of attached links to keep track of attachments (patch #3).
Nothing changes for BPF flow_dissector. Just as before only one program can
be attached to netns.
In v3 I've simplified patch #2 that introduces bpf_prog_array to take
advantage of the fact that it will hold at most one program for now.
In particular, I'm no longer using bpf_prog_array_copy. It turned out to be
less suitable for link operations than I thought as it fails to append the
same BPF program.
bpf_prog_array_replace_item is also gone, because we know we always want to
replace the first element in prog_array.
Naturally the code that handles bpf_prog_array will need change once
more when there is a program type that allows multi-prog attachment. But I
feel it will be better to do it gradually and present it together with
tests that actually exercise multi-prog code paths.
v2 -> v3:
- Don't check if run_array is null in link update callback. (Martin)
- Allow updating the link with the same BPF program. (Andrii)
- Add patch #4 with a test for the above case.
- Kill bpf_prog_array_replace_item. Access the run_array directly.
- Switch from bpf_prog_array_copy() to bpf_prog_array_alloc(1, ...).
- Replace rcu_deref_protected & RCU_INIT_POINTER with rcu_replace_pointer.
- Drop Andrii's Ack from patch #2. Code changed.
v1 -> v2:
- Show with a (void) cast that bpf_prog_array_replace_item() return value
is ignored on purpose. (Andrii)
- Explain why bpf-cgroup cannot replace programs in bpf_prog_array based
on bpf_prog pointer comparison in patch #2 description. (Andrii)
====================
Jakub Sitnicki [Thu, 25 Jun 2020 14:13:57 +0000 (16:13 +0200)]
selftests/bpf: Test updating flow_dissector link with same program
This case, while not particularly useful, is worth covering because we
expect the operation to succeed as opposed when re-attaching the same
program directly with PROG_ATTACH.
While at it, update the tests summary that fell out of sync when tests
extended to cover links.
Jakub Sitnicki [Thu, 25 Jun 2020 14:13:56 +0000 (16:13 +0200)]
bpf, netns: Keep a list of attached bpf_link's
To support multi-prog link-based attachments for new netns attach types, we
need to keep track of more than one bpf_link per attach type. Hence,
convert net->bpf.links into a list, that currently can be either empty or
have just one item.
Instead of reusing bpf_prog_list from bpf-cgroup, we link together
bpf_netns_link's themselves. This makes list management simpler as we don't
have to allocate, initialize, and later release list elements. We can do
this because multi-prog attachment will be available only for bpf_link, and
we don't need to build a list of programs attached directly and indirectly
via links.
Jakub Sitnicki [Thu, 25 Jun 2020 14:13:55 +0000 (16:13 +0200)]
bpf, netns: Keep attached programs in bpf_prog_array
Prepare for having multi-prog attachments for new netns attach types by
storing programs to run in a bpf_prog_array, which is well suited for
iterating over programs and running them in sequence.
After this change bpf(PROG_QUERY) may block to allocate memory in
bpf_prog_array_copy_to_user() for collected program IDs. This forces a
change in how we protect access to the attached program in the query
callback. Because bpf_prog_array_copy_to_user() can sleep, we switch from
an RCU read lock to holding a mutex that serializes updaters.
Because we allow only one BPF flow_dissector program to be attached to
netns at all times, the bpf_prog_array pointed by net->bpf.run_array is
always either detached (null) or one element long.
Jakub Sitnicki [Thu, 25 Jun 2020 14:13:54 +0000 (16:13 +0200)]
flow_dissector: Pull BPF program assignment up to bpf-netns
Prepare for using bpf_prog_array to store attached programs by moving out
code that updates the attached program out of flow dissector.
Managing bpf_prog_array is more involved than updating a single bpf_prog
pointer. This will let us do it all from one place, bpf/net_namespace.c, in
the subsequent patch.
Andrii Nakryiko [Tue, 30 Jun 2020 06:15:00 +0000 (23:15 -0700)]
bpf: Enforce BPF ringbuf size to be the power of 2
BPF ringbuf assumes the size to be a multiple of page size and the power of
2 value. The latter is important to avoid division while calculating position
inside the ring buffer and using (N-1) mask instead. This patch fixes omission
to enforce power-of-2 size rule.
====================
Fix a splat introduced by recent changes to avoid skipping ingress policy
when kTLS is enabled. The RCU splat was introduced because in the non-TLS
case the caller is wrapped in an rcu_read_lock/unlock. But, in the TLS
case we have a reference to the psock and the caller did not wrap its
call in rcu_read_lock/unlock.
To fix extend the RCU section to include the redirect case which was
missed. From v1->v2 I changed the location a bit to simplify the code
some. See patch 1.
But, then Martin asked why it was not needed in the non-TLS case. The
answer for patch 1 was, as stated above, because the caller has the
rcu read lock. However, there was still a missing case where a BPF
user could in-theory line up a set of parameters to hit a case
where the code was entered from strparser side from a different context
then the initial caller. To hit this user would need a parser program
to return value greater than skb->len then an ENOMEM error could happen
in the strparser codepath triggering strparser to retry from a workqueue
and without rcu_read_lock original caller used. See patch 2 for details.
Finally, we don't actually have any selftests for parser returning a
value geater than skb->len so add one in patch 3. This is especially
needed because at least I don't have any code that uses the parser
to return value greater than skb->len. So I wouldn't have caught any
errors here in my own testing.
Thanks, John
v1->v2: simplify code in patch 1 some and add patches 2 and 3.
====================
John Fastabend [Thu, 25 Jun 2020 23:13:18 +0000 (16:13 -0700)]
bpf, sockmap: RCU dereferenced psock may be used outside RCU block
If an ingress verdict program specifies message sizes greater than
skb->len and there is an ENOMEM error due to memory pressure we
may call the rcv_msg handler outside the strp_data_ready() caller
context. This is because on an ENOMEM error the strparser will
retry from a workqueue. The caller currently protects the use of
psock by calling the strp_data_ready() inside a rcu_read_lock/unlock
block.
But, in above workqueue error case the psock is accessed outside
the read_lock/unlock block of the caller. So instead of using
psock directly we must do a look up against the sk again to
ensure the psock is available.
There is an an ugly piece here where we must handle
the case where we paused the strp and removed the psock. On
psock removal we first pause the strparser and then remove
the psock. If the strparser is paused while an skb is
scheduled on the workqueue the skb will be dropped on the
flow and kfree_skb() is called. If the workqueue manages
to get called before we pause the strparser but runs the rcvmsg
callback after the psock is removed we will hit the unlikely
case where we run the sockmap rcvmsg handler but do not have
a psock. For now we will follow strparser logic and drop the
skb on the floor with skb_kfree(). This is ugly because the
data is dropped. To date this has not caused problems in practice
because either the application controlling the sockmap is
coordinating with the datapath so that skbs are "flushed"
before removal or we simply wait for the sock to be closed before
removing it.
This patch fixes the describe RCU bug and dropping the skb doesn't
make things worse. Future patches will improve this by allowing
the normal case where skbs are not merged to skip the strparser
altogether. In practice many (most?) use cases have no need to
merge skbs so its both a code complexity hit as seen above and
a performance issue. For example, in the Cilium case we always
set the strparser up to return sbks 1:1 without any merging and
have avoided above issues.
John Fastabend [Thu, 25 Jun 2020 23:12:59 +0000 (16:12 -0700)]
bpf, sockmap: RCU splat with redirect and strparser error or TLS
There are two paths to generate the below RCU splat the first and
most obvious is the result of the BPF verdict program issuing a
redirect on a TLS socket (This is the splat shown below). Unlike
the non-TLS case the caller of the *strp_read() hooks does not
wrap the call in a rcu_read_lock/unlock. Then if the BPF program
issues a redirect action we hit the RCU splat.
However, in the non-TLS socket case the splat appears to be
relatively rare, because the skmsg caller into the strp_data_ready()
is wrapped in a rcu_read_lock/unlock. Shown here,
If the above was the only way to run the verdict program we
would be safe. But, there is a case where the strparser may throw an
ENOMEM error while parsing the skb. This is a result of a failed
skb_clone, or alloc_skb_for_msg while building a new merged skb when
the msg length needed spans multiple skbs. This will in turn put the
skb on the strp_wrk workqueue in the strparser code. The skb will
later be dequeued and verdict programs run, but now from a
different context without the rcu_read_lock()/unlock() critical
section in sk_psock_strp_data_ready() shown above. In practice
I have not seen this yet, because as far as I know most users of the
verdict programs are also only working on single skbs. In this case no
merge happens which could trigger the above ENOMEM errors. In addition
the system would need to be under memory pressure. For example, we
can't hit the above case in selftests because we missed having tests
to merge skbs. (Added in later patch)
To fix the below splat extend the rcu_read_lock/unnlock block to
include the call to sk_psock_tls_verdict_apply(). This will fix both
TLS redirect case and non-TLS redirect+error case. Also remove
psock from the sk_psock_tls_verdict_apply() function signature its
not used there.
[ 1095.937597] WARNING: suspicious RCU usage
[ 1095.940964] 5.7.0-rc7-02911-g463bac5f1ca79 #1 Tainted: G W
[ 1095.944363] -----------------------------
[ 1095.947384] include/linux/skmsg.h:284 suspicious rcu_dereference_check() usage!
[ 1095.950866]
[ 1095.950866] other info that might help us debug this:
[ 1095.950866]
[ 1095.957146]
[ 1095.957146] rcu_scheduler_active = 2, debug_locks = 1
[ 1095.961482] 1 lock held by test_sockmap/15970:
[ 1095.964501] #0: ffff9ea6b25de660 (sk_lock-AF_INET){+.+.}-{0:0}, at: tls_sw_recvmsg+0x13a/0x840 [tls]
[ 1095.968568]
[ 1095.968568] stack backtrace:
[ 1095.975001] CPU: 1 PID: 15970 Comm: test_sockmap Tainted: G W 5.7.0-rc7-02911-g463bac5f1ca79 #1
[ 1095.977883] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
[ 1095.980519] Call Trace:
[ 1095.982191] dump_stack+0x8f/0xd0
[ 1095.984040] sk_psock_skb_redirect+0xa6/0xf0
[ 1095.986073] sk_psock_tls_strp_read+0x1d8/0x250
[ 1095.988095] tls_sw_recvmsg+0x714/0x840 [tls]
v2: Improve commit message to identify non-TLS redirect plus error case
condition as well as more common TLS case. In the process I decided
doing the rcu_read_unlock followed by the lock/unlock inside branches
was unnecessarily complex. We can just extend the current rcu block
and get the same effeective without the shuffling and branching.
Thanks Martin!
libbpf: Adjust SEC short cut for expected attach type BPF_XDP_DEVMAP
Adjust the SEC("xdp_devmap/") prog type prefix to contain a
slash "/" for expected attach type BPF_XDP_DEVMAP. This is consistent
with other prog types like tracing.
John Fastabend [Wed, 24 Jun 2020 22:20:39 +0000 (15:20 -0700)]
bpf: Do not allow btf_ctx_access with __int128 types
To ensure btf_ctx_access() is safe the verifier checks that the BTF
arg type is an int, enum, or pointer. When the function does the
BTF arg lookup it uses the calculation 'arg = off / 8' using the
fact that registers are 8B. This requires that the first arg is
in the first reg, the second in the second, and so on. However,
for __int128 the arg will consume two registers by default LLVM
implementation. So this will cause the arg layout assumed by the
'arg = off / 8' calculation to be incorrect.
Because __int128 is uncommon this patch applies the easiest fix and
will force int types to be sizeof(u64) or smaller so that they will
fit in a single register.
bpf: Restore behaviour of CAP_SYS_ADMIN allowing the loading of networking bpf programs
This is a fix for a regression in commit 2c78ee898d8f ("bpf: Implement CAP_BPF").
Before the above commit it was possible to load network bpf programs
with just the CAP_SYS_ADMIN privilege.
The Android bpfloader happens to run in such a configuration (it has
SYS_ADMIN but not NET_ADMIN) and creates maps and loads bpf programs
for later use by Android's netd (which has NET_ADMIN but not SYS_ADMIN).
Yonghong Song [Wed, 24 Jun 2020 00:10:54 +0000 (17:10 -0700)]
bpf: Set the number of exception entries properly for subprograms
Currently, if a bpf program has more than one subprograms, each program will be
jitted separately. For programs with bpf-to-bpf calls the
prog->aux->num_exentries is not setup properly. For example, with
bpf_iter_netlink.c modified to force one function to be not inlined and with
CONFIG_BPF_JIT_ALWAYS_ON the following error is seen:
$ ./test_progs -n 3/3
...
libbpf: failed to load program 'iter/netlink'
libbpf: failed to load object 'bpf_iter_netlink'
libbpf: failed to load BPF skeleton 'bpf_iter_netlink': -4007
test_netlink:FAIL:bpf_iter_netlink__open_and_load skeleton open_and_load failed
#3/3 netlink:FAIL
The dmesg shows the following errors:
ex gen bug
which is triggered by the following code in arch/x86/net/bpf_jit_comp.c:
if (excnt >= bpf_prog->aux->num_exentries) {
pr_err("ex gen bug\n");
return -EFAULT;
}
This patch fixes the issue by computing proper num_exentries for each
subprogram before calling JIT.
Andrii Nakryiko [Fri, 19 Jun 2020 23:04:22 +0000 (16:04 -0700)]
libbpf: Fix CO-RE relocs against .text section
bpf_object__find_program_by_title(), used by CO-RE relocation code, doesn't
return .text "BPF program", if it is a function storage for sub-programs.
Because of that, any CO-RE relocation in helper non-inlined functions will
fail. Fix this by searching for .text-corresponding BPF program manually.
Adjust one of bpf_iter selftest to exhibit this pattern.
Andrii Nakryiko [Sun, 21 Jun 2020 03:11:59 +0000 (20:11 -0700)]
libbpf: Forward-declare bpf_stats_type for systems with outdated UAPI headers
Systems that doesn't yet have the very latest linux/bpf.h header, enum
bpf_stats_type will be undefined, causing compilation warnings. Prevents this
by forward-declaring enum.
David Howells [Fri, 19 Jun 2020 22:38:16 +0000 (23:38 +0100)]
rxrpc: Fix notification call on completion of discarded calls
When preallocated service calls are being discarded, they're passed to
->discard_new_call() to have the caller clean up any attached higher-layer
preallocated pieces before being marked completed. However, the act of
marking them completed now invokes the call's notification function - which
causes a problem because that function might assume that the previously
freed pieces of memory are still there.
Fix this by setting a dummy notification function on the socket after
calling ->discard_new_call().
This results in the following kasan message when the kafs module is
removed.
==================================================================
BUG: KASAN: use-after-free in afs_wake_up_async_call+0x6aa/0x770 fs/afs/rxrpc.c:707
Write of size 1 at addr ffff8880946c39e4 by task kworker/u4:1/21
Hangbin Liu [Fri, 19 Jun 2020 03:24:45 +0000 (11:24 +0800)]
tc-testing: update geneve options match in tunnel_key unit tests
Since iproute2 commit f72c3ad00f3b ("tc: m_tunnel_key: add options
support for vxlan"), the geneve opt output use key word "geneve_opts"
instead of "geneve_opt". To make compatibility for both old and new
iproute2, let's accept both "geneve_opt" and "geneve_opts".
Heiner Kallweit [Thu, 18 Jun 2020 21:25:50 +0000 (23:25 +0200)]
r8169: fix firmware not resetting tp->ocp_base
Typically the firmware takes care that tp->ocp_base is reset to its
default value. That's not the case (at least) for RTL8117.
As a result subsequent PHY access reads/writes the wrong page and
the link is broken. Fix this be resetting tp->ocp_base explicitly.
Dany Madden [Thu, 18 Jun 2020 19:24:13 +0000 (15:24 -0400)]
ibmvnic: continue to init in CRQ reset returns H_CLOSED
Continue the reset path when partner adapter is not ready or H_CLOSED is
returned from reset crq. This patch allows the CRQ init to proceed to
establish a valid CRQ for traffic to flow after reset.
Shannon Nelson [Thu, 18 Jun 2020 17:29:04 +0000 (10:29 -0700)]
ionic: tame the watchdog timer on reconfig
Even with moving netif_tx_disable() to an earlier point when
taking down the queues for a reconfiguration, we still end
up with the occasional netdev watchdog Tx Timeout complaint.
The old method of using netif_trans_update() works fine for
queue 0, but has no effect on the remaining queues. Using
netif_device_detach() allows us to signal to the watchdog to
ignore us for the moment.
Fixes: beead698b173 ("ionic: Add the basic NDO callbacks for netdev support") Signed-off-by: Shannon Nelson <[email protected]> Acked-by: Jonathan Toppins <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Willem de Bruijn [Thu, 18 Jun 2020 16:40:43 +0000 (12:40 -0400)]
selftests/net: report etf errors correctly
The ETF qdisc can queue skbs that it could not pace on the errqueue.
Address a few issues in the selftest
- recv buffer size was too small, and incorrectly calculated
- compared errno to ee_code instead of ee_errno
- missed invalid request error type
v2:
- fix a few checkpatch --strict indentation warnings
Fixes: ea6a547669b3 ("selftests/net: make so_txtime more robust to timer variance") Signed-off-by: Willem de Bruijn <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Thomas Falcon [Thu, 18 Jun 2020 15:43:46 +0000 (10:43 -0500)]
ibmveth: Fix max MTU limit
The max MTU limit defined for ibmveth is not accounting for
virtual ethernet buffer overhead, which is twenty-two additional
bytes set aside for the ethernet header and eight additional bytes
of an opaque handle reserved for use by the hypervisor. Update the
max MTU to reflect this overhead.
Fixes: d894be57ca92 ("ethernet: use net core MTU range checking in more drivers") Fixes: 110447f8269a ("ethernet: fix min/max MTU typos") Signed-off-by: Thomas Falcon <[email protected]> Signed-off-by: David S. Miller <[email protected]>
====================
several fixes for indirect flow_blocks offload
v2:
patch2: store the cb_priv of representor to the flow_block_cb->indr.cb_priv
in the driver. And make the correct check with the statments
this->indr.cb_priv == cb_priv
patch4: del the driver list only in the indriect cleanup callbacks
v3:
add the cover letter and changlogs.
v4:
collapsed 1/4, 2/4, 4/4 in v3 to one fix
Add the prepare patch 1 and 2
v5:
patch1: place flow_indr_block_cb_alloc() right before
flow_indr_dev_setup_offload() to avoid moving flow_block_indr_init()
This series fixes commit 1fac52da5942 ("net: flow_offload: consolidate
indirect flow_block infrastructure") that revists the flow_block
infrastructure.
patch #1 #2: prepare for fix patch #3
add and use flow_indr_block_cb_alloc/remove function
patch #3: fix flow_indr_dev_unregister path
If the representor is removed, then identify the indirect flow_blocks
that need to be removed by the release callback and the port representor
structure. To identify the port representor structure, a new
indr.cb_priv field needs to be introduced. The flow_block also needs to
be removed from the driver list from the cleanup path
patch#4 fix block->nooffloaddevcnt warning dmesg log.
When a indr device add in offload success. The block->nooffloaddevcnt
should be 0. After the representor go away. When the dir device go away
the flow_block UNBIND operation with -EOPNOTSUPP which lead the warning
demesg log.
The block->nooffloaddevcnt should always count for indr block.
even the indr block offload successful. The representor maybe
gone away and the ingress qdisc can work in software mode.
====================
The block->nooffloaddevcnt should always count for indr block.
even the indr block offload successful. The representor maybe
gone away and the ingress qdisc can work in software mode.
block->nooffloaddevcnt warning with following dmesg log:
Fixes: 0fdcf78d5973 ("net: use flow_indr_dev_setup_offload()") Signed-off-by: wenxu <[email protected]> Signed-off-by: David S. Miller <[email protected]>
If the representor is removed, then identify the indirect flow_blocks
that need to be removed by the release callback and the port representor
structure. To identify the port representor structure, a new
indr.cb_priv field needs to be introduced. The flow_block also needs to
be removed from the driver list from the cleanup path.
Fixes: 1fac52da5942 ("net: flow_offload: consolidate indirect flow_block infrastructure") Signed-off-by: wenxu <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Sabrina Dubroca [Thu, 18 Jun 2020 10:13:22 +0000 (12:13 +0200)]
geneve: allow changing DF behavior after creation
Currently, trying to change the DF parameter of a geneve device does
nothing:
# ip -d link show geneve1
14: geneve1: <snip>
link/ether <snip>
geneve id 1 remote 10.0.0.1 ttl auto df set dstport 6081 <snip>
# ip link set geneve1 type geneve id 1 df unset
# ip -d link show geneve1
14: geneve1: <snip>
link/ether <snip>
geneve id 1 remote 10.0.0.1 ttl auto df set dstport 6081 <snip>
We just need to update the value in geneve_changelink.
Fixes: a025fb5f49ad ("geneve: Allow configuration of DF behaviour") Signed-off-by: Sabrina Dubroca <[email protected]> Reviewed-by: Stefano Brivio <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Claudiu Manoil [Thu, 18 Jun 2020 09:16:52 +0000 (12:16 +0300)]
enetc: Fix HW_VLAN_CTAG_TX|RX toggling
VLAN tag insertion/extraction offload is correctly
activated at probe time but deactivation of this feature
(i.e. via ethtool) is broken. Toggling works only for
Tx/Rx ring 0 of a PF, and is ignored for the other rings,
including the VF rings.
To fix this, the existing VLAN offload toggling code
was extended to all the rings assigned to a netdevice,
instead of the default ring 0 (likely a leftover from the
early validation days of this feature). And the code was
moved to the common set_features() function to fix toggling
for the VF driver too.
Fixes: d4fd0404c1c9 ("enetc: Introduce basic PF and VF ENETC ethernet drivers") Signed-off-by: Claudiu Manoil <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Claudiu Beznea [Thu, 18 Jun 2020 08:37:40 +0000 (11:37 +0300)]
net: macb: undo operations in case of failure
Undo previously done operation in case macb_phylink_connect()
fails. Since macb_reset_hw() is the 1st undo operation the
napi_exit label was renamed to reset_hw.
David S. Miller [Fri, 19 Jun 2020 20:39:01 +0000 (13:39 -0700)]
Merge branch 'net-phy-MDIO-bus-scanning-fixes'
Florian Fainelli says:
====================
net: phy: MDIO bus scanning fixes
This patch series fixes two problems with the current MDIO bus scanning
logic which was identified while moving from 4.9 to 5.4 on devices that
do rely on scanning the MDIO bus at runtime because they use pluggable
cards.
Changes in v2:
- added comment explaining the special value of -ENODEV
- added Andrew's Reviewed-by tag
====================
Florian Fainelli [Fri, 19 Jun 2020 18:47:47 +0000 (11:47 -0700)]
net: phy: Check harder for errors in get_phy_id()
Commit 02a6efcab675 ("net: phy: allow scanning busses with missing
phys") added a special condition to return -ENODEV in case -ENODEV or
-EIO was returned from the first read of the MII_PHYSID1 register.
In case the MDIO bus data line pull-up is not strong enough, the MDIO
bus controller will not flag this as a read error. This can happen when
a pluggable daughter card is not connected and weak internal pull-ups
are used (since that is the only option, otherwise the pins are
floating).
The second read of MII_PHYSID2 will be correctly flagged an error
though, but now we will return -EIO which will be treated as a hard
error, thus preventing MDIO bus scanning loops to continue succesfully.
Apply the same logic to both register reads, thus allowing the scanning
logic to proceed.
Fixes: 02a6efcab675 ("net: phy: allow scanning busses with missing phys") Reviewed-by: Andrew Lunn <[email protected]> Signed-off-by: Florian Fainelli <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Florian Fainelli [Fri, 19 Jun 2020 18:47:46 +0000 (11:47 -0700)]
of: of_mdio: Correct loop scanning logic
Commit 209c65b61d94 ("drivers/of/of_mdio.c:fix of_mdiobus_register()")
introduced a break of the loop on the premise that a successful
registration should exit the loop. The premise is correct but not to
code, because rc && rc != -ENODEV is just a special error condition,
that means we would exit the loop even with rc == -ENODEV which is
absolutely not correct since this is the error code to indicate to the
MDIO bus layer that scanning should continue.
Fix this by explicitly checking for rc = 0 as the only valid condition
to break out of the loop.
Fixes: 209c65b61d94 ("drivers/of/of_mdio.c:fix of_mdiobus_register()") Reviewed-by: Andrew Lunn <[email protected]> Signed-off-by: Florian Fainelli <[email protected]> Signed-off-by: David S. Miller <[email protected]>
1) Fix double ESP trailer insertion in IPsec crypto offload if
netif_xmit_frozen_or_stopped is true. From Huy Nguyen.
2) Merge fixup for "remove output_finish indirection from
xfrm_state_afinfo". From Stephen Rothwell.
3) Select CRYPTO_SEQIV for ESP as this is needed for GCM and several
other encryption algorithms. Also modernize the crypto algorithm
selections for ESP and AH, remove those that are maked as "MUST NOT"
and add those that are marked as "MUST" be implemented in RFC 8221.
From Eric Biggers.
Please note the merge conflict between commit:
a7f7f6248d97 ("treewide: replace '---help---' in Kconfig files with 'help'")
from Linus' tree and commits:
7d4e39195925 ("esp, ah: consolidate the crypto algorithm selections") be01369859b8 ("esp, ah: modernize the crypto algorithm selections")
David S. Miller [Fri, 19 Jun 2020 19:55:12 +0000 (12:55 -0700)]
Merge branch '40GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/net-queue
Jeff Kirsher says:
====================
Intel Wired LAN Driver Updates 2020-06-18
This series contains fixes to ixgbe, i40e and ice driver.
Ciara fixes up the ixgbe, i40e and ice drivers to protect access when
allocating and freeing the rings. In addition, made use of READ_ONCE
when reading the rings prior to accessing the statistics pointer.
Björn fixes a crash where the receive descriptor ring allocation was
moved to a different function, which broke the ethtool set_ringparam()
hook.
====================
Björn Töpel [Fri, 12 Jun 2020 11:47:31 +0000 (13:47 +0200)]
i40e: fix crash when Rx descriptor count is changed
When the AF_XDP buffer allocator was introduced, the Rx SW ring
"rx_bi" allocation was moved from i40e_setup_rx_descriptors()
function, and was instead done in the i40e_configure_rx_ring()
function.
This broke the ethtool set_ringparam() hook for changing the Rx
descriptor count, which was relying on i40e_setup_rx_descriptors() to
handle the allocation.
Fix this by adding an explicit i40e_alloc_rx_bi() call to
i40e_set_ringparam().
Fixes: be1222b585fd ("i40e: Separate kernel allocated rx_bi rings from AF_XDP rings") Signed-off-by: Björn Töpel <[email protected]> Tested-by: Andrew Bowers <[email protected]> Signed-off-by: Jeff Kirsher <[email protected]>
Ciara Loftus [Tue, 9 Jun 2020 13:19:45 +0000 (13:19 +0000)]
ice: protect ring accesses with WRITE_ONCE
The READ_ONCE macro is used when reading rings prior to accessing the
statistics pointer. The corresponding WRITE_ONCE usage when allocating and
freeing the rings to ensure protected access was not in place. Introduce
this.
Ciara Loftus [Tue, 9 Jun 2020 13:19:44 +0000 (13:19 +0000)]
i40e: protect ring accesses with READ- and WRITE_ONCE
READ_ONCE should be used when reading rings prior to accessing the
statistics pointer. Introduce this as well as the corresponding WRITE_ONCE
usage when allocating and freeing the rings, to ensure protected access.
Ciara Loftus [Tue, 9 Jun 2020 13:19:43 +0000 (13:19 +0000)]
ixgbe: protect ring accesses with READ- and WRITE_ONCE
READ_ONCE should be used when reading rings prior to accessing the
statistics pointer. Introduce this as well as the corresponding WRITE_ONCE
usage when allocating and freeing the rings, to ensure protected access.
Eric Dumazet [Thu, 18 Jun 2020 05:23:25 +0000 (22:23 -0700)]
net: increment xmit_recursion level in dev_direct_xmit()
Back in commit f60e5990d9c1 ("ipv6: protect skb->sk accesses
from recursive dereference inside the stack") Hannes added code
so that IPv6 stack would not trust skb->sk for typical cases
where packet goes through 'standard' xmit path (__dev_queue_xmit())
Alas af_packet had a dev_direct_xmit() path that was not
dealing yet with xmit_recursion level.
Also change sk_mc_loop() to dump a stack once only.
Florian Fainelli [Thu, 18 Jun 2020 03:42:44 +0000 (20:42 -0700)]
net: dsa: bcm_sf2: Fix node reference count
of_find_node_by_name() will do an of_node_put() on the "from" argument.
With CONFIG_OF_DYNAMIC enabled which checks for device_node reference
counts, we would be getting a warning like this:
Commit 3b33583265ed ("net: Add fraglist GRO/GSO feature flags") missed
an entry for NETIF_F_GSO_FRAGLIST in netdev_features_strings array. As
a result, fraglist GSO feature is not shown in 'ethtool -k' output and
can't be toggled on/off.
The fix is trivial.
Fixes: 3b33583265ed ("net: Add fraglist GRO/GSO feature flags") Signed-off-by: Alexander Lobakin <[email protected]> Reviewed-by: Michal Kubecek <[email protected]> Signed-off-by: David S. Miller <[email protected]>
tg3: driver sleeps indefinitely when EEH errors exceed eeh_max_freezes
The driver function tg3_io_error_detected() calls napi_disable twice,
without an intervening napi_enable, when the number of EEH errors exceeds
eeh_max_freezes, resulting in an indefinite sleep while holding rtnl_lock.
Add check for pcierr_recovery which skips code already executed for the
"Frozen" state.
Martin [Wed, 17 Jun 2020 17:00:23 +0000 (22:30 +0530)]
bareudp: Fixed multiproto mode configuration
Code to handle multiproto configuration is missing.
Fixes: 4b5f67232d95 ("net: Special handling for IP & MPLS") Signed-off-by: Martin <[email protected]> Signed-off-by: David S. Miller <[email protected]>
David S. Miller [Fri, 19 Jun 2020 03:27:42 +0000 (20:27 -0700)]
Merge branch 's390-qeth-fixes'
Julian Wiedmann says:
====================
s390/qeth: fixes 2020-06-17
please apply the following patch series for qeth to netdev's net tree.
The first patch fixes a regression in the error handling for a specific
cmd type. I have some follow-ups queued up for net-next to clean this
up properly...
The second patch fine-tunes the HW offload restrictions that went in
with this merge window. In some setups we don't need to apply them.
====================
Julian Wiedmann [Wed, 17 Jun 2020 14:54:53 +0000 (16:54 +0200)]
s390/qeth: let isolation mode override HW offload restrictions
When a device is configured with ISOLATION_MODE_FWD, traffic never goes
through the internal switch. Don't apply the offload restrictions in
this case.
Fixes: c619e9a6f52f ("s390/qeth: don't use restricted offloads for local traffic") Signed-off-by: Julian Wiedmann <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Julian Wiedmann [Wed, 17 Jun 2020 14:54:52 +0000 (16:54 +0200)]
s390/qeth: fix error handling for isolation mode cmds
Current(?) OSA devices also store their cmd-specific return codes for
SET_ACCESS_CONTROL cmds into the top-level cmd->hdr.return_code.
So once we added stricter checking for the top-level field a while ago,
none of the error logic that rolls back the user's configuration to its
old state is applied any longer.
For this specific cmd, go back to the old model where we peek into the
cmd structure even though the top-level field indicated an error.
Fixes: 686c97ee29c8 ("s390/qeth: fix error handling in adapter command callbacks") Signed-off-by: Julian Wiedmann <[email protected]> Signed-off-by: David S. Miller <[email protected]>
====================
mptcp: cope with syncookie on MP_JOINs
Currently syncookies on MP_JOIN connections are not handled correctly: the
connections fallback to TCP and are kept alive instead of resetting them at
fallback time.
The first patch propagates the required information up to syn_recv_sock time,
and the 2nd patch addresses the unifying the error path for all MP_JOIN
requests.
====================
Paolo Abeni [Wed, 17 Jun 2020 10:08:57 +0000 (12:08 +0200)]
mptcp: drop MP_JOIN request sock on syn cookies
Currently any MPTCP socket using syn cookies will fallback to
TCP at 3rd ack time. In case of MP_JOIN requests, the RFC mandate
closing the child and sockets, but the existing error paths
do not handle the syncookie scenario correctly.
Address the issue always forcing the child shutdown in case of
MP_JOIN fallback.
Fixes: ae2dd7164943 ("mptcp: handle tcp fallback when using syn cookies") Signed-off-by: Paolo Abeni <[email protected]> Reviewed-by: Mat Martineau <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Paolo Abeni [Wed, 17 Jun 2020 10:08:56 +0000 (12:08 +0200)]
mptcp: cache msk on MP_JOIN init_req
The msk ownership is transferred to the child socket at
3rd ack time, so that we avoid more lookups later. If the
request does not reach the 3rd ack, the MSK reference is
dropped at request sock release time.
As a side effect, fallback is now tracked by a NULL msk
reference instead of zeroed 'mp_join' field. This will
simplify the next patch.
The arp request address is error, this is because fib_table_lookup in
fib_check_nh lookup the destnation 9.9.9.9 nexthop, the scope of
the fib result is RT_SCOPE_LINK,the correct scope is RT_SCOPE_HOST.
Here I add a check of whether this is RT_TABLE_MAIN to solve this problem.
Fixes: 3bfd847203c6 ("net: Use passed in table for nexthop lookups") Signed-off-by: guodeqing <[email protected]> Reviewed-by: David Ahern <[email protected]> Signed-off-by: David S. Miller <[email protected]>
David S. Miller [Fri, 19 Jun 2020 03:20:46 +0000 (20:20 -0700)]
Merge branch 'sja1105-fixes'
Vladimir Oltean says:
====================
Fix VLAN checks for SJA1105 DSA tc-flower filters
This fixes a ridiculous situation where the driver, in VLAN-unaware
mode, would refuse accepting any tc filter:
tc filter replace dev sw1p3 ingress flower skip_sw \
dst_mac 42:be:24:9b:76:20 \
action gate (...)
Error: sja1105: Can only gate based on {DMAC, VID, PCP}.
tc filter replace dev sw1p3 ingress protocol 802.1Q flower skip_sw \
vlan_id 1 vlan_prio 0 dst_mac 42:be:24:9b:76:20 \
action gate (...)
Error: sja1105: Can only gate based on DMAC.
So, without changing the VLAN awareness state, it says it doesn't want
VLAN-aware rules, and it doesn't want VLAN-unaware rules either. One
would say it's in Schrodinger's state...
Now, the situation has been made worse by commit 7f14937facdc ("net:
dsa: sja1105: keep the VLAN awareness state in a driver variable"),
which made VLAN awareness a ternary attribute, but after inspecting the
code from before that patch with a truth table, it looks like the
logical bug was there even before.
While attempting to fix this, I also noticed some leftover debugging
code in one of the places that needed to be fixed. It would have
appeared in the context of patch 3/3 anyway, so I decided to create a
patch that removes it.
====================
Vladimir Oltean [Tue, 16 Jun 2020 23:58:43 +0000 (02:58 +0300)]
net: dsa: sja1105: fix checks for VLAN state in gate action
This action requires the VLAN awareness state of the switch to be of the
same type as the key that's being added:
- If the switch is unaware of VLAN, then the tc filter key must only
contain the destination MAC address.
- If the switch is VLAN-aware, the key must also contain the VLAN ID and
PCP.
But this check doesn't work unless we verify the VLAN awareness state on
both the "if" and the "else" branches.
Fixes: 834f8933d5dd ("net: dsa: sja1105: implement tc-gate using time-triggered virtual links") Signed-off-by: Vladimir Oltean <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Vladimir Oltean [Tue, 16 Jun 2020 23:58:42 +0000 (02:58 +0300)]
net: dsa: sja1105: fix checks for VLAN state in redirect action
This action requires the VLAN awareness state of the switch to be of the
same type as the key that's being added:
- If the switch is unaware of VLAN, then the tc filter key must only
contain the destination MAC address.
- If the switch is VLAN-aware, the key must also contain the VLAN ID and
PCP.
But this check doesn't work unless we verify the VLAN awareness state on
both the "if" and the "else" branches.
Fixes: dfacc5a23e22 ("net: dsa: sja1105: support flow-based redirection via virtual links") Signed-off-by: Vladimir Oltean <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Vladimir Oltean [Tue, 16 Jun 2020 23:58:41 +0000 (02:58 +0300)]
net: dsa: sja1105: remove debugging code in sja1105_vl_gate
This shouldn't be there.
Fixes: 834f8933d5dd ("net: dsa: sja1105: implement tc-gate using time-triggered virtual links") Signed-off-by: Vladimir Oltean <[email protected]> Signed-off-by: David S. Miller <[email protected]>
David S. Miller [Fri, 19 Jun 2020 03:17:49 +0000 (20:17 -0700)]
Merge branch 'act_gate-fixes'
Davide Caratti says:
====================
two fixes for 'act_gate' control plane
- patch 1/2 attempts to fix the error path of tcf_gate_init() when users
try to configure 'act_gate' rules with wrong parameters
- patch 2/2 is a follow-up of a recent fix for NULL dereference in
the error path of tcf_gate_init()
further work will introduce a tdc test for 'act_gate'.
changes since v2:
- fix undefined behavior in patch 1/2
- improve comment in patch 2/2
changes since v1:
coding style fixes in patch 1/2 and 2/2
====================
Davide Caratti [Tue, 16 Jun 2020 20:25:21 +0000 (22:25 +0200)]
net/sched: act_gate: fix configuration of the periodic timer
assigning a dummy value of 'clock_id' to avoid cancellation of the cycle
timer before its initialization was a temporary solution, and we still
need to handle the case where act_gate timer parameters are changed by
commands like the following one:
# tc action replace action gate <parameters>
the fix consists in the following items:
1) remove the workaround assignment of 'clock_id', and init the list of
entries before the first error path after IDR atomic check/allocation
2) validate 'clock_id' earlier: there is no need to do IDR atomic
check/allocation if we know that 'clock_id' is a bad value
3) use a dedicated function, 'gate_setup_timer()', to ensure that the
timer is cancelled and re-initialized on action overwrite, and also
ensure we initialize the timer in the error path of tcf_gate_init()
v3: improve comment in the error path of tcf_gate_init() (thanks to
Vladimir Oltean)
v2: avoid 'goto' in gate_setup_timer (thanks to Cong Wang)
CC: Ivan Vecera <[email protected]> Fixes: a01c245438c5 ("net/sched: fix a couple of splats in the error path of tfc_gate_init()") Fixes: a51c328df310 ("net: qos: introduce a gate control flow action") Signed-off-by: Davide Caratti <[email protected]> Acked-by: Vladimir Oltean <[email protected]> Tested-by: Vladimir Oltean <[email protected]> Signed-off-by: David S. Miller <[email protected]>
this is caused by the test on 'cycletime_ext', that is still unassigned
when the action is newly created. This makes the action .init() return 0
without calling tcf_idr_insert(), hence the UAF + crash.
rework the logic that prevents zero values of cycle-time, as follows:
1) 'tcfg_cycletime_ext' seems to be unused in the action software path,
and it was already possible by other means to obtain non-zero
cycletime and zero cycletime-ext. So, removing that test should not
cause any damage.
2) while at it, we must prevent overwriting configuration data with wrong
ones: use a temporary variable for 'tcfg_cycletime', and validate it
preserving the original semantic (that allowed computing the cycle
time as the sum of all intervals, when not specified by
TCA_GATE_CYCLE_TIME).
3) remove the test on 'tcfg_cycletime', no more useful, and avoid
returning -EFAULT, which did not seem an appropriate return value for
a wrong netlink attribute.
v3: fix uninitialized 'cycletime' (thanks to Vladimir Oltean)
v2: remove useless 'return;' at the end of void gate_get_start_time()
Taehee Yoo [Tue, 16 Jun 2020 16:51:51 +0000 (16:51 +0000)]
ip_tunnel: fix use-after-free in ip_tunnel_lookup()
In the datapath, the ip_tunnel_lookup() is used and it internally uses
fallback tunnel device pointer, which is fb_tunnel_dev.
This pointer variable should be set to NULL when a fb interface is deleted.
But there is no routine to set fb_tunnel_dev pointer to NULL.
So, this pointer will be still used after interface is deleted and
it eventually results in the use-after-free problem.
Test commands:
ip netns add A
ip netns add B
ip link add eth0 type veth peer name eth1
ip link set eth0 netns A
ip link set eth1 netns B
ip netns exec A ip link set lo up
ip netns exec A ip link set eth0 up
ip netns exec A ip link add gre1 type gre local 10.0.0.1 \
remote 10.0.0.2
ip netns exec A ip link set gre1 up
ip netns exec A ip a a 10.0.100.1/24 dev gre1
ip netns exec A ip a a 10.0.0.1/24 dev eth0
ip netns exec B ip link set lo up
ip netns exec B ip link set eth1 up
ip netns exec B ip link add gre1 type gre local 10.0.0.2 \
remote 10.0.0.1
ip netns exec B ip link set gre1 up
ip netns exec B ip a a 10.0.100.2/24 dev gre1
ip netns exec B ip a a 10.0.0.2/24 dev eth1
ip netns exec A hping3 10.0.100.2 -2 --flood -d 60000 &
ip netns del B
Taehee Yoo [Tue, 16 Jun 2020 16:04:00 +0000 (16:04 +0000)]
ip6_gre: fix use-after-free in ip6gre_tunnel_lookup()
In the datapath, the ip6gre_tunnel_lookup() is used and it internally uses
fallback tunnel device pointer, which is fb_tunnel_dev.
This pointer variable should be set to NULL when a fb interface is deleted.
But there is no routine to set fb_tunnel_dev pointer to NULL.
So, this pointer will be still used after interface is deleted and
it eventually results in the use-after-free problem.
Test commands:
ip netns add A
ip netns add B
ip link add eth0 type veth peer name eth1
ip link set eth0 netns A
ip link set eth1 netns B
ip netns exec A ip link set lo up
ip netns exec A ip link set eth0 up
ip netns exec A ip link add ip6gre1 type ip6gre local fc:0::1 \
remote fc:0::2
ip netns exec A ip -6 a a fc:100::1/64 dev ip6gre1
ip netns exec A ip link set ip6gre1 up
ip netns exec A ip -6 a a fc:0::1/64 dev eth0
ip netns exec A ip link set ip6gre0 up
ip netns exec B ip link set lo up
ip netns exec B ip link set eth1 up
ip netns exec B ip link add ip6gre1 type ip6gre local fc:0::2 \
remote fc:0::1
ip netns exec B ip -6 a a fc:100::2/64 dev ip6gre1
ip netns exec B ip link set ip6gre1 up
ip netns exec B ip -6 a a fc:0::2/64 dev eth1
ip netns exec B ip link set ip6gre0 up
ip netns exec A ping fc:100::2 -s 60000 &
ip netns del B
Taehee Yoo [Tue, 16 Jun 2020 15:52:05 +0000 (15:52 +0000)]
net: core: reduce recursion limit value
In the current code, ->ndo_start_xmit() can be executed recursively only
10 times because of stack memory.
But, in the case of the vxlan, 10 recursion limit value results in
a stack overflow.
In the current code, the nested interface is limited by 8 depth.
There is no critical reason that the recursion limitation value should
be 10.
So, it would be good to be the same value with the limitation value of
nesting interface depth.
Test commands:
ip link add vxlan10 type vxlan vni 10 dstport 4789 srcport 4789 4789
ip link set vxlan10 up
ip a a 192.168.10.1/24 dev vxlan10
ip n a 192.168.10.2 dev vxlan10 lladdr fc:22:33:44:55:66 nud permanent
for i in {9..0}
do
let A=$i+1
ip link add vxlan$i type vxlan vni $i dstport 4789 srcport 4789 4789
ip link set vxlan$i up
ip a a 192.168.$i.1/24 dev vxlan$i
ip n a 192.168.$i.2 dev vxlan$i lladdr fc:22:33:44:55:66 nud permanent
bridge fdb add fc:22:33:44:55:66 dev vxlan$A dst 192.168.$i.2 self
done
hping3 192.168.10.2 -2 -d 60000
Fixes: 11a766ce915f ("net: Increase xmit RECURSION_LIMIT to 10.") Signed-off-by: Taehee Yoo <[email protected]> Signed-off-by: David S. Miller <[email protected]>
If call_netdevice_notifiers() failed, then rollback_registered()
calls netdev_unregister_kobject() which holds the kobject. The
reference cannot be put because the netdev won't be add to todo
list, so it will leads a memleak, we need put the reference to
avoid memleak.
Sascha Hauer [Tue, 16 Jun 2020 08:31:40 +0000 (10:31 +0200)]
net: ethernet: mvneta: Add 2500BaseX support for SoCs without comphy
The older SoCs like Armada XP support a 2500BaseX mode in the datasheets
referred to as DR-SGMII (Double rated SGMII) or HS-SGMII (High Speed
SGMII). This is an upclocked 1000BaseX mode, thus
PHY_INTERFACE_MODE_2500BASEX is the appropriate mode define for it.
adding support for it merely means writing the correct magic value into
the MVNETA_SERDES_CFG register.
Sascha Hauer [Tue, 16 Jun 2020 08:31:39 +0000 (10:31 +0200)]
net: ethernet: mvneta: Fix Serdes configuration for SoCs without comphy
The MVNETA_SERDES_CFG register is only available on older SoCs like the
Armada XP. On newer SoCs like the Armada 38x the fields are moved to
comphy. This patch moves the writes to this register next to the comphy
initialization, so that depending on the SoC either comphy or
MVNETA_SERDES_CFG is configured.
With this we no longer write to the MVNETA_SERDES_CFG on SoCs where it
doesn't exist.
Shannon Nelson [Tue, 16 Jun 2020 15:06:26 +0000 (08:06 -0700)]
ionic: export features for vlans to use
Set up vlan_features for use by any vlans above us.
Fixes: beead698b173 ("ionic: Add the basic NDO callbacks for netdev support") Signed-off-by: Shannon Nelson <[email protected]> Acked-by: Jonathan Toppins <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Shannon Nelson [Tue, 16 Jun 2020 01:14:59 +0000 (18:14 -0700)]
ionic: no link check while resetting queues
If the driver is busy resetting queues after a change in
MTU or queue parameters, don't bother checking the link,
wait until the next watchdog cycle.
Fixes: 987c0871e8ae ("ionic: check for linkup in watchdog") Signed-off-by: Shannon Nelson <[email protected]> Acked-by: Jonathan Toppins <[email protected]> Signed-off-by: David S. Miller <[email protected]>
David Howells [Wed, 17 Jun 2020 14:46:33 +0000 (15:46 +0100)]
rxrpc: Fix afs large storage transmission performance drop
Commit 2ad6691d988c, which moved the modification of the status annotation
for a packet in the Tx buffer prior to the retransmission moved the state
clearance, but managed to lose the bit that set it to UNACK.
Consequently, if a retransmission occurs, the packet is accidentally
changed to the ACK state (ie. 0) by masking it off, which means that the
packet isn't counted towards the tally of newly-ACK'd packets if it gets
hard-ACK'd. This then prevents the congestion control algorithm from
recovering properly.
Fix by reinstating the change of state to UNACK.
Spotted by the generic/460 xfstest.
Fixes: 2ad6691d988c ("rxrpc: Fix race between incoming ACK parser and retransmitter") Signed-off-by: David Howells <[email protected]>
David Howells [Wed, 17 Jun 2020 22:01:23 +0000 (23:01 +0100)]
rxrpc: Fix handling of rwind from an ACK packet
The handling of the receive window size (rwind) from a received ACK packet
is not correct. The rxrpc_input_ackinfo() function currently checks the
current Tx window size against the rwind from the ACK to see if it has
changed, but then limits the rwind size before storing it in the tx_winsize
member and, if it increased, wake up the transmitting process. This means
that if rwind > RXRPC_RXTX_BUFF_SIZE - 1, this path will always be
followed.
Fix this by limiting rwind before we compare it to tx_winsize.
The effect of this can be seen by enabling the rxrpc_rx_rwind_change
tracepoint.
Fixes: 702f2ac87a9a ("rxrpc: Wake up the transmitter if Rx window size increases on the peer") Signed-off-by: David Howells <[email protected]>
Those last two bytes - 96 1f - aren't part of the original packet.
In the ax88179 RX path, the usbnet rx_fixup function trims a 2-byte
'alignment pseudo header' from the start of the packet, and sets the
length from a per-packet field populated by hardware. It looks like that
length field *includes* the 2-byte header; the current driver assumes
that it's excluded.
This change trims the 2-byte alignment header after we've set the packet
length, so the resulting packet length is correct. While we're moving
the comment around, this also fixes the spelling of 'pseudo'.
selftests/bpf: Make sure optvals > PAGE_SIZE are bypassed
We are relying on the fact, that we can pass > sizeof(int) optvals
to the SOL_IP+IP_FREEBIND option (the kernel will take first 4 bytes).
In the BPF program we check that we can only touch PAGE_SIZE bytes,
but the real optlen is PAGE_SIZE * 2. In both cases, we override it to
some predefined value and trim the optlen.
Also, let's modify exiting IP_TOS usecase to test optlen=0 case
where BPF program just bypasses the data as is.
bpf: Don't return EINVAL from {get,set}sockopt when optlen > PAGE_SIZE
Attaching to these hooks can break iptables because its optval is
usually quite big, or at least bigger than the current PAGE_SIZE limit.
David also mentioned some SCTP options can be big (around 256k).
For such optvals we expose only the first PAGE_SIZE bytes to
the BPF program. BPF program has two options:
1. Set ctx->optlen to 0 to indicate that the BPF's optval
should be ignored and the kernel should use original userspace
value.
2. Set ctx->optlen to something that's smaller than the PAGE_SIZE.
v5:
* use ctx->optlen == 0 with trimmed buffer (Alexei Starovoitov)
* update the docs accordingly
v4:
* use temporary buffer to avoid optval == optval_end == NULL;
this removes the corner case in the verifier that might assume
non-zero PTR_TO_PACKET/PTR_TO_PACKET_END.
v3:
* don't increase the limit, bypass the argument
devmap: Use bpf_map_area_alloc() for allocating hash buckets
Syzkaller discovered that creating a hash of type devmap_hash with a large
number of entries can hit the memory allocator limit for allocating
contiguous memory regions. There's really no reason to use kmalloc_array()
directly in the devmap code, so just switch it to the existing
bpf_map_area_alloc() function that is used elsewhere.
Tobias Klauser [Tue, 16 Jun 2020 11:33:03 +0000 (13:33 +0200)]
tools, bpftool: Add ringbuf map type to map command docs
Commit c34a06c56df7 ("tools/bpftool: Add ringbuf map to a list of known
map types") added the symbolic "ringbuf" name. Document it in the bpftool
map command docs and usage as well.
Andrii Nakryiko [Tue, 16 Jun 2020 05:04:30 +0000 (22:04 -0700)]
bpf: bpf_probe_read_kernel_str() has to return amount of data read on success
During recent refactorings, bpf_probe_read_kernel_str() started returning 0 on
success, instead of amount of data successfully read. This majorly breaks
applications relying on bpf_probe_read_kernel_str() and bpf_probe_read_str()
and their results. Fix this by returning actual number of bytes read.
1) Don't get per-cpu pointer with preemption enabled in nft_set_pipapo,
fix from Stefano Brivio.
2) Fix memory leak in ctnetlink, from Pablo Neira Ayuso.
3) Multiple definitions of MPTCP_PM_MAX_ADDR, from Geliang Tang.
4) Accidently disabling NAPI in non-error paths of macb_open(), from
Charles Keepax.
5) Fix races between alx_stop and alx_remove, from Zekun Shen.
6) We forget to re-enable SRIOV during resume in bnxt_en driver, from
Michael Chan.
7) Fix memory leak in ipv6_mc_destroy_dev(), from Wang Hai.
8) rxtx stats use wrong index in mvpp2 driver, from Sven Auhagen.
9) Fix memory leak in mptcp_subflow_create_socket error path, from Wei
Yongjun.
10) We should not adjust the TCP window advertised when sending dup acks
in non-SACK mode, because it won't be counted as a dup by the sender
if the window size changes. From Eric Dumazet.
11) Destroy the right number of queues during remove in mvpp2 driver,
from Sven Auhagen.
12) Various WOL and PM fixes to e1000 driver, from Chen Yu, Vaibhav
Gupta, and Arnd Bergmann.
* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (35 commits)
e1000e: fix unused-function warning
e1000: use generic power management
e1000e: Do not wake up the system via WOL if device wakeup is disabled
lan743x: add MODULE_DEVICE_TABLE for module loading alias
mlxsw: spectrum: Adjust headroom buffers for 8x ports
bareudp: Fixed configuration to avoid having garbage values
mvpp2: remove module bugfix
tcp: grow window for OOO packets only for SACK flows
mptcp: fix memory leak in mptcp_subflow_create_socket()
netfilter: flowtable: Make nf_flow_table_offload_add/del_cb inline
net/sched: act_ct: Make tcf_ct_flow_table_restore_skb inline
net: dsa: sja1105: fix PTP timestamping with large tc-taprio cycles
mvpp2: ethtool rxtx stats fix
MAINTAINERS: switch to my private email for Renesas Ethernet drivers
rocker: fix incorrect error handling in dma_rings_init
test_objagg: Fix potential memory leak in error handling
net: ethernet: mtk-star-emac: simplify interrupt handling
mld: fix memory leak in ipv6_mc_destroy_dev()
bnxt_en: Return from timer if interface is not in open state.
bnxt_en: Fix AER reset logic on 57500 chips.
...
Linus Torvalds [Wed, 17 Jun 2020 00:40:51 +0000 (17:40 -0700)]
Merge tag 'afs-fixes-20200616' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs
Pull AFS fixes from David Howells:
"I've managed to get xfstests kind of working with afs. Here are a set
of patches that fix most of the bugs found.
There are a number of primary issues:
- Incorrect handling of mtime and non-handling of ctime. It might be
argued, that the latter isn't a bug since the AFS protocol doesn't
support ctime, but I should probably still update it locally.
- Shared-write mmap, truncate and writeback bugs. This includes not
changing i_size under the callback lock, overwriting local i_size
with the reply from the server after a partial writeback, not
limiting the writeback from an mmapped page to EOF.
- Checks for an abort code indicating that the primary vnode in an
operation was deleted by a third-party are done in the wrong place.
- Silly rename bugs. This includes an incomplete conversion to the
new operation handling, duplicate nlink handling, nlink changing
not being done inside the callback lock and insufficient handling
of third-party conflicting directory changes.
And some secondary ones:
- The UAEOVERFLOW abort code should map to EOVERFLOW not EREMOTEIO.
- Remove a couple of unused or incompletely used bits.
- Remove a couple of redundant success checks.
These seem to fix all the data-corruption bugs found by
./check -afs -g quick
along with the obvious silly rename bugs and time bugs.
There are still some test failures, but they seem to fall into two
classes: firstly, the authentication/security model is different to
the standard UNIX model and permission is arbitrated by the server and
cached locally; and secondly, there are a number of features that AFS
does not support (such as mknod). But in these cases, the tests
themselves need to be adapted or skipped.
Using the in-kernel afs client with xfstests also found a bug in the
AuriStor AFS server that has been fixed for a future release"
* tag 'afs-fixes-20200616' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
afs: Fix silly rename
afs: afs_vnode_commit_status() doesn't need to check the RPC error
afs: Fix use of afs_check_for_remote_deletion()
afs: Remove afs_operation::abort_code
afs: Fix yfs_fs_fetch_status() to honour vnode selector
afs: Remove yfs_fs_fetch_file_status() as it's not used
afs: Fix the mapping of the UAEOVERFLOW abort code
afs: Fix truncation issues and mmap writeback size
afs: Concoct ctimes
afs: Fix EOF corruption
afs: afs_write_end() should change i_size under the right lock
afs: Fix non-setting of mtime when writing into mmap
Randy Dunlap [Mon, 15 Jun 2020 02:59:07 +0000 (19:59 -0700)]
Documentation: remove SH-5 index entries
Remove SH-5 documentation index entries following the removal
of SH-5 source code.
Error: Cannot open file ../arch/sh/mm/tlb-sh5.c
Error: Cannot open file ../arch/sh/mm/tlb-sh5.c
Error: Cannot open file ../arch/sh/include/asm/tlb_64.h
Error: Cannot open file ../arch/sh/include/asm/tlb_64.h
Linus Torvalds [Wed, 17 Jun 2020 00:23:57 +0000 (17:23 -0700)]
Merge tag 'flex-array-conversions-5.8-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gustavoars/linux
Pull flexible-array member conversions from Gustavo A. R. Silva:
"Replace zero-length arrays with flexible-array members.
Notice that all of these patches have been baking in linux-next for
two development cycles now.
There is a regular need in the kernel to provide a way to declare
having a dynamically sized set of trailing elements in a structure.
Kernel code should always use “flexible array members”[1] for these
cases. The older style of one-element or zero-length arrays should no
longer be used[2].
C99 introduced “flexible array members”, which lacks a numeric size
for the array declaration entirely:
This is the way the kernel expects dynamically sized trailing elements
to be declared. It allows the compiler to generate errors when the
flexible array does not occur last in the structure, which helps to
prevent some kind of undefined behavior[3] bugs from being
inadvertently introduced to the codebase.
It also allows the compiler to correctly analyze array sizes (via
sizeof(), CONFIG_FORTIFY_SOURCE, and CONFIG_UBSAN_BOUNDS). For
instance, there is no mechanism that warns us that the following
application of the sizeof() operator to a zero-length array always
results in zero:
At the last line of code above, size turns out to be zero, when one
might have thought it represents the total size in bytes of the
dynamic memory recently allocated for the trailing array items. Here
are a couple examples of this issue[4][5].
Instead, flexible array members have incomplete type, and so the
sizeof() operator may not be applied[6], so any misuse of such
operators will be immediately noticed at build time.
The cleanest and least error-prone way to implement this is through
the use of a flexible array member:
Arvind Sankar [Tue, 16 Jun 2020 22:25:47 +0000 (18:25 -0400)]
x86/purgatory: Add -fno-stack-protector
The purgatory Makefile removes -fstack-protector options if they were
configured in, but does not currently add -fno-stack-protector.
If gcc was configured with the --enable-default-ssp configure option,
this results in the stack protector still being enabled for the
purgatory (absent distro-specific specs files that might disable it
again for freestanding compilations), if the main kernel is being
compiled with stack protection enabled (if it's disabled for the main
kernel, the top-level Makefile will add -fno-stack-protector).
This will break the build since commit e4160b2e4b02 ("x86/purgatory: Fail the build if purgatory.ro has missing symbols")
and prior to that would have caused runtime failure when trying to use
kexec.
Explicitly add -fno-stack-protector to avoid this, as done in other
Makefiles that need to disable the stack protector.