Eric Dumazet [Sat, 15 Jun 2019 23:28:48 +0000 (16:28 -0700)]
neigh: fix use-after-free read in pneigh_get_next
Nine years ago, I added RCU handling to neighbours, not pneighbours.
(pneigh are not commonly used)
Unfortunately I missed that /proc dump operations would use a
common entry and exit point : neigh_seq_start() and neigh_seq_stop()
We need to read_lock(tbl->lock) or risk use-after-free while
iterating the pneigh structures.
We might later convert pneigh to RCU and revert this patch.
sysbot reported :
BUG: KASAN: use-after-free in pneigh_get_next.isra.0+0x24b/0x280 net/core/neighbour.c:3158
Read of size 8 at addr ffff888097f2a700 by task syz-executor.0/9825
Memory state around the buggy address: ffff888097f2a600: 00 00 00 00 00 00 00 00 fc fc fc fc fc fc fc fc ffff888097f2a680: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
>ffff888097f2a700: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
^ ffff888097f2a780: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc ffff888097f2a800: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
Fixes: 767e97e1e0db ("neigh: RCU conversion of struct neighbour") Signed-off-by: Eric Dumazet <[email protected]> Reported-by: syzbot <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Eric Dumazet [Sat, 15 Jun 2019 20:19:55 +0000 (13:19 -0700)]
tcp: fix compile error if !CONFIG_SYSCTL
tcp_tx_skb_cache_key and tcp_rx_skb_cache_key must be available
even if CONFIG_SYSCTL is not set.
Fixes: 0b7d7f6b2208 ("tcp: add tcp_tx_skb_cache sysctl") Fixes: ede61ca474a0 ("tcp: add tcp_rx_skb_cache sysctl") Signed-off-by: Eric Dumazet <[email protected]> Reported-by: Willem de Bruijn <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Ivan Vecera [Fri, 14 Jun 2019 15:48:36 +0000 (17:48 +0200)]
be2net: Fix number of Rx queues used for flow hashing
Number of Rx queues used for flow hashing returned by the driver is
incorrect and this bug prevents user to use the last Rx queue in
indirection table.
Fixes: 594ad54a2c3b ("be2net: Add support for setting and getting rx flow hash options") Reported-by: Tianhao <[email protected]> Signed-off-by: Ivan Vecera <[email protected]> Signed-off-by: David S. Miller <[email protected]>
When stack receives pkt: [802.1P vlan 0][802.1AD vlan 100][IPv4],
vlan_do_receive() returns false if it does not find vlan_dev. Later
__netif_receive_skb_core() fails to find packet type handler for
skb->protocol 801.1AD and drops the packet.
801.1P header with vlan id 0 should be handled as untagged packets.
This patch fixes it by checking if vlan_id is 0 and processes next vlan
header.
Russell King espressed some strong opposition to this
change, explaining that this is trying to make phylink
behave outside of how it has been designed.
Matt Mullins [Tue, 11 Jun 2019 21:53:04 +0000 (14:53 -0700)]
bpf: fix nested bpf tracepoints with per-cpu data
BPF_PROG_TYPE_RAW_TRACEPOINTs can be executed nested on the same CPU, as
they do not increment bpf_prog_active while executing.
This enables three levels of nesting, to support
- a kprobe or raw tp or perf event,
- another one of the above that irq context happens to call, and
- another one in nmi context
(at most one of which may be a kprobe or perf event).
Arthur Fabre [Sat, 15 Jun 2019 21:36:27 +0000 (14:36 -0700)]
bpf: Fix out of bounds memory access in bpf_sk_storage
bpf_sk_storage maps use multiple spin locks to reduce contention.
The number of locks to use is determined by the number of possible CPUs.
With only 1 possible CPU, bucket_log == 0, and 2^0 = 1 locks are used.
When updating elements, the correct lock is determined with hash_ptr().
Calling hash_ptr() with 0 bits is undefined behavior, as it does:
x >> (64 - bits)
Using the value results in an out of bounds memory access.
In my case, this manifested itself as a page fault when raw_spin_lock_bh()
is called later, when running the self tests:
./tools/testing/selftests/bpf/test_verifier 773 775
[ 16.366342] BUG: unable to handle page fault for address: ffff8fe7a66f93f8
Force the minimum number of locks to two.
Signed-off-by: Arthur Fabre <[email protected]> Fixes: 6ac99e8f23d4 ("bpf: Introduce bpf sk local storage") Acked-by: Andrii Nakryiko <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]>
Stephen Barber [Sat, 15 Jun 2019 06:42:37 +0000 (23:42 -0700)]
vsock/virtio: set SOCK_DONE on peer shutdown
Set the SOCK_DONE flag to match the TCP_CLOSING state when a peer has
shut down and there is nothing left to read.
This fixes the following bug:
1) Peer sends SHUTDOWN(RDWR).
2) Socket enters TCP_CLOSING but SOCK_DONE is not set.
3) read() returns -ENOTCONN until close() is called, then returns 0.
Linus Walleij [Thu, 13 Jun 2019 22:25:20 +0000 (00:25 +0200)]
net: dsa: rtl8366: Fix up VLAN filtering
We get this regression when using RTL8366RB as part of a bridge
with OpenWrt:
WARNING: CPU: 0 PID: 1347 at net/switchdev/switchdev.c:291
switchdev_port_attr_set_now+0x80/0xa4
lan0: Commit of attribute (id=7) failed.
(...)
realtek-smi switch lan0: failed to initialize vlan filtering on this port
This is because it is trying to disable VLAN filtering
on VLAN0, as we have forgot to add 1 to the port number
to get the right VLAN in rtl8366_vlan_filtering(): when
we initialize the VLAN we associate VLAN1 with port 0,
VLAN2 with port 1 etc, so we need to add 1 to the port
offset.
Fixes: d8652956cf37 ("net: dsa: realtek-smi: Add Realtek SMI driver") Signed-off-by: Linus Walleij <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Ioana Ciornei [Thu, 13 Jun 2019 06:37:51 +0000 (09:37 +0300)]
net: phylink: set the autoneg state in phylink_phy_change
The phy_state field of phylink should carry only valid information
especially when this can be passed to the .mac_config callback.
Update the an_enabled field with the autoneg state in the
phylink_phy_change function.
Fixes: 9525ae83959b ("phylink: add phylink infrastructure") Signed-off-by: Ioana Ciornei <[email protected]> Signed-off-by: David S. Miller <[email protected]>
David S. Miller [Sat, 15 Jun 2019 03:18:28 +0000 (20:18 -0700)]
Merge branch 'tcp-add-three-static-keys'
Eric Dumazet says:
====================
tcp: add three static keys
Recent addition of per TCP socket rx/tx cache brought
regressions for some workloads, as reported by Feng Tang.
It seems better to make them opt-in, before we adopt better
heuristics.
The last patch adds high_order_alloc_disable sysctl
to ask TCP sendmsg() to exclusively use order-0 allocations,
as mm layer has specific optimizations.
====================
>From linux-3.7, (commit 5640f7685831 "net: use a per task frag
allocator") TCP sendmsg() has preferred using order-3 allocations.
While it gives good results for most cases, we had reports
that heavy uses of TCP over loopback were hitting a spinlock
contention in page allocations/freeing.
This commits adds a sysctl so that admins can opt-in
for order-0 allocations. Hopefully mm layer might optimize
order-3 allocations in the future since it could give us
a nice boost (see 8 lines of following benchmark)
The following benchmark shows a win when more than 8 TCP_STREAM
threads are running (56 x86 cores server in my tests)
for thr in {1..30}
do
sysctl -wq net.core.high_order_alloc_disable=0
T0=`./super_netperf $thr -H 127.0.0.1 -l 15`
sysctl -wq net.core.high_order_alloc_disable=1
T1=`./super_netperf $thr -H 127.0.0.1 -l 15`
echo $thr:$T0:$T1
done
Eric Dumazet [Fri, 14 Jun 2019 23:22:20 +0000 (16:22 -0700)]
tcp: add tcp_tx_skb_cache sysctl
Feng Tang reported a performance regression after introduction
of per TCP socket tx/rx caches, for TCP over loopback (netperf)
There is high chance the regression is caused by a change on
how well the 32 KB per-thread page (current->task_frag) can
be recycled, and lack of pcp caches for order-3 pages.
I could not reproduce the regression myself, cpus all being
spinning on the mm spinlocks for page allocs/freeing, regardless
of enabling or disabling the per tcp socket caches.
It seems best to disable the feature by default, and let
admins enabling it.
MM layer either needs to provide scalable order-3 pages
allocations, or could attempt a trylock on zone->lock if
the caller only attempts to get a high-order page and is
able to fallback to order-0 ones in case of pressure.
Eric Dumazet [Fri, 14 Jun 2019 23:22:19 +0000 (16:22 -0700)]
tcp: add tcp_rx_skb_cache sysctl
Instead of relying on rps_needed, it is safer to use a separate
static key, since we do not want to enable TCP rx_skb_cache
by default. This feature can cause huge increase of memory
usage on hosts with millions of sockets.
Haiyang Zhang [Thu, 13 Jun 2019 21:06:53 +0000 (21:06 +0000)]
hv_netvsc: Set probe mode to sync
For better consistency of synthetic NIC names, we set the probe mode to
PROBE_FORCE_SYNCHRONOUS. So the names can be aligned with the vmbus
channel offer sequence.
Fixes: af0a5646cb8d ("use the new async probing feature for the hyperv drivers") Signed-off-by: Haiyang Zhang <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Vlad Buslov [Thu, 13 Jun 2019 14:54:04 +0000 (17:54 +0300)]
net: sched: flower: don't call synchronize_rcu() on mask creation
Current flower mask creating code assumes that temporary mask that is used
when inserting new filter is stack allocated. To prevent race condition
with data patch synchronize_rcu() is called every time fl_create_new_mask()
replaces temporary stack allocated mask. As reported by Jiri, this
increases runtime of creating 20000 flower classifiers from 4 seconds to
163 seconds. However, this design is no longer necessary since temporary
mask was converted to be dynamically allocated by commit 2cddd2014782
("net/sched: cls_flower: allocate mask dynamically in fl_change()").
Remove synchronize_rcu() calls from mask creation code. Instead, refactor
fl_change() to always deallocate temporary mask with rcu grace period.
Neil Horman [Thu, 13 Jun 2019 10:35:59 +0000 (06:35 -0400)]
sctp: Free cookie before we memdup a new one
Based on comments from Xin, even after fixes for our recent syzbot
report of cookie memory leaks, its possible to get a resend of an INIT
chunk which would lead to us leaking cookie memory.
To ensure that we don't leak cookie memory, free any previously
allocated cookie first.
Change notes
v1->v2
update subsystem tag in subject (davem)
repeat kfree check for peer_random and peer_hmacs (xin)
Robert Hancock [Wed, 12 Jun 2019 20:33:32 +0000 (14:33 -0600)]
net: dsa: microchip: Don't try to read stats for unused ports
If some of the switch ports were not listed in the device tree, due to
being unused, the ksz_mib_read_work function ended up accessing a NULL
dp->slave pointer and causing an oops. Skip checking statistics for any
unused ports.
David S. Miller [Sat, 15 Jun 2019 02:05:58 +0000 (19:05 -0700)]
Merge branch 'qmi_wwan-fix-QMAP-handling'
Reinhard Speyerer says:
====================
qmi_wwan: fix QMAP handling
This series addresses the following issues observed when using the
QMAP support of the qmi_wwan driver:
1. The QMAP code in the qmi_wwan driver is based on the CodeAurora
GobiNet driver ([1], [2]) which does not process QMAP padding
in the RX path correctly. This causes qmimux_rx_fixup() to pass
incorrect data to the IP stack when padding is used.
2. qmimux devices currently lack proper network device usage statistics.
3. RCU stalls on device disconnect with QMAP activated like this
qmi_wwan: avoid RCU stalls on device disconnect when in QMAP mode
Switch qmimux_unregister_device() and qmi_wwan_disconnect() to
use unregister_netdevice_queue() and unregister_netdevice_many()
instead of unregister_netdevice(). This avoids RCU stalls which
have been observed on device disconnect in certain setups otherwise.
Fixes: c6adf77953bc ("net: usb: qmi_wwan: add qmap mux protocol support") Cc: Daniele Palmas <[email protected]> Signed-off-by: Reinhard Speyerer <[email protected]> Signed-off-by: David S. Miller <[email protected]>
qmi_wwan: add support for QMAP padding in the RX path
The QMAP code in the qmi_wwan driver is based on the CodeAurora GobiNet
driver which does not process QMAP padding in the RX path correctly.
Add support for QMAP padding to qmimux_rx_fixup() according to the
description of the rmnet driver.
Fixes: c6adf77953bc ("net: usb: qmi_wwan: add qmap mux protocol support") Cc: Daniele Palmas <[email protected]> Signed-off-by: Reinhard Speyerer <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Since commit 177366bf7ceb the %rbp stopped pointing to %rbp of the
previous stack frame. That broke frame pointer based stack unwinding.
This commit is a partial revert of it.
Note that the location of tail_call_cnt is fixed, since the verifier
enforces MAX_BPF_STACK stack size for programs with tail calls.
Fixes: 177366bf7ceb ("bpf: change x86 JITed program stack layout") Signed-off-by: Alexei Starovoitov <[email protected]>
Toshiaki Makita [Fri, 14 Jun 2019 08:20:15 +0000 (17:20 +0900)]
bpf, devmap: Add missing RCU read lock on flush
.ndo_xdp_xmit() assumes it is called under RCU. For example virtio_net
uses RCU to detect it has setup the resources for tx. The assumption
accidentally broke when introducing bulk queue in devmap.
Fixes: 5d053f9da431 ("bpf: devmap prepare xdp frames for bulking") Reported-by: David Ahern <[email protected]> Signed-off-by: Toshiaki Makita <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]>
Toshiaki Makita [Fri, 14 Jun 2019 08:20:13 +0000 (17:20 +0900)]
bpf, devmap: Fix premature entry free on destroying map
dev_map_free() waits for flush_needed bitmap to be empty in order to
ensure all flush operations have completed before freeing its entries.
However the corresponding clear_bit() was called before using the
entries, so the entries could be used after free.
All access to the entries needs to be done before clearing the bit.
It seems commit a5e2da6e9787 ("bpf: netdev is never null in
__dev_map_flush") accidentally changed the clear_bit() and memory access
order.
Note that the problem happens only in __dev_map_flush(), not in
dev_map_flush_old(). dev_map_flush_old() is called only after nulling
out the corresponding netdev_map entry, so dev_map_free() never frees
the entry thus no such race happens there.
Fixes: a5e2da6e9787 ("bpf: netdev is never null in __dev_map_flush") Signed-off-by: Toshiaki Makita <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]>
David S. Miller [Fri, 14 Jun 2019 16:36:51 +0000 (09:36 -0700)]
Merge tag 'mac80211-for-davem-2019-06-14' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211
Johannes Berg says:
====================
Various fixes, all over:
* a few memory leaks
* fixes for management frame protection security
and A2/A3 confusion (affecting TDLS as well)
* build fix for certificates
* etc.
====================
net: phylink: further mac_config documentation improvements
While reviewing the DPAA2 work, it has become apparent that we need
better documentation about which members of the phylink link state
structure are valid in the mac_config call. Improve this
documentation.
Young Xiao [Fri, 14 Jun 2019 07:13:02 +0000 (15:13 +0800)]
nfc: Ensure presence of required attributes in the deactivate_target handler
Check that the NFC_ATTR_TARGET_INDEX attributes (in addition to
NFC_ATTR_DEVICE_INDEX) are provided by the netlink client prior to
accessing them. This prevents potential unhandled NULL pointer dereference
exceptions which can be triggered by malicious user-mode programs,
if they omit one or both of these attributes.
Eric Biggers [Mon, 10 Jun 2019 20:02:19 +0000 (13:02 -0700)]
cfg80211: fix memory leak of wiphy device name
In wiphy_new_nm(), if an error occurs after dev_set_name() and
device_initialize() have already been called, it's necessary to call
put_device() (via wiphy_free()) to avoid a memory leak.
The bits of Rx MCS Map in VHT capability were enumerated
with index transform - index i -> (i + 1) bit => nss i. BUG!
while it should be - index i -> (i + 1) bit => (i + 1) nss.
The bug was exposed in commit a53b2a0b1245 ("iwlwifi: mvm: implement VHT
extended NSS support in rs.c"), where iwlwifi started using the
function.
Luca Coelho [Wed, 29 May 2019 12:25:29 +0000 (15:25 +0300)]
cfg80211: use BIT_ULL in cfg80211_parse_mbssid_data()
The seen_indices variable is u64 and in other parts of the code we
assume mbssid_index_ie[2] can be up to 45, so we should use the 64-bit
versions of BIT, namely, BIT_ULL().
Yibo Zhao [Fri, 14 Jun 2019 11:01:52 +0000 (19:01 +0800)]
mac80211: only warn once on chanctx_conf being NULL
In multiple SSID cases, it takes time to prepare every AP interface
to be ready in initializing phase. If a sta already knows everything it
needs to join one of the APs and sends authentication to the AP which
is not fully prepared at this point of time, AP's channel context
could be NULL. As a result, warning message occurs.
Even worse, if the AP is under attack via tools such as MDK3 and massive
authentication requests are received in a very short time, console will
be hung due to kernel warning messages.
WARN_ON_ONCE() could be a better way for indicating warning messages
without duplicate messages to flood the console.
Johannes: We still need to address the underlying problem, but we
don't really have a good handle on it yet. Suppress the
worst side-effects for now.
Johannes Berg [Wed, 13 Feb 2019 14:13:30 +0000 (15:13 +0100)]
mac80211: drop robust management frames from unknown TA
When receiving a robust management frame, drop it if we don't have
rx->sta since then we don't have a security association and thus
couldn't possibly validate the frame.
Daniel Borkmann [Thu, 13 Jun 2019 21:07:01 +0000 (23:07 +0200)]
Merge branch 'bpf-ppc-div-fix'
Naveen N. Rao says:
====================
The first patch updates DIV64 overflow tests to properly detect error
conditions. The second patch fixes powerpc64 JIT to generate the proper
unsigned division instruction for BPF_ALU64.
====================
Naveen N. Rao [Wed, 12 Jun 2019 18:51:40 +0000 (00:21 +0530)]
powerpc/bpf: use unsigned division instruction for 64-bit operations
BPF_ALU64 div/mod operations are currently using signed division, unlike
BPF_ALU32 operations. Fix the same. DIV64 and MOD64 overflow tests pass
with this fix.
Fixes: 156d0e290e969c ("powerpc/ebpf/jit: Implement JIT compiler for extended BPF") Cc: [email protected] # v4.8+ Signed-off-by: Naveen N. Rao <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]>
Naveen N. Rao [Wed, 12 Jun 2019 18:51:39 +0000 (00:21 +0530)]
bpf: fix div64 overflow tests to properly detect errors
If the result of the division is LLONG_MIN, current tests do not detect
the error since the return value is truncated to a 32-bit value and ends
up being 0.
bpf: simplify definition of BPF_FIB_LOOKUP related flags
Previously, the BPF_FIB_LOOKUP_{DIRECT,OUTPUT} flags in the BPF UAPI
were defined with the help of BIT macro. This had the following issues:
- In order to use any of the flags, a user was required to depend
on <linux/bits.h>.
- No other flag in bpf.h uses the macro, so it seems that an unwritten
convention is to use (1 << (nr)) to define BPF-related flags.
Fixes: 87f5fc7e48dd ("bpf: Provide helper to do forwarding lookups in kernel FIB table") Signed-off-by: Martynas Pumputis <[email protected]> Acked-by: Andrii Nakryiko <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]>
====================
net: mvpp2: prs: Fixes for VID filtering
This series fixes some issues with VID filtering offload, mainly due to
the wrong ranges being used in the TCAM header parser.
The first patch fixes a bug where removing a VLAN from a port's
whitelist would also remove it from other port's, if they are on the
same PPv2 instance.
The second patch makes so that we don't invalidate the wrong TCAM
entries when clearing the whole whitelist.
====================
net: mvpp2: prs: Use the correct helpers when removing all VID filters
When removing all VID filters, the mvpp2_prs_vid_entry_remove would be
called with the TCAM id incorrectly used as a VID, causing the wrong
TCAM entries to be invalidated.
Fix this by directly invalidating entries in the VID range.
Fixes: 56beda3db602 ("net: mvpp2: Add hardware offloading for VLAN filtering") Suggested-by: Yuri Chipchev <[email protected]> Signed-off-by: Maxime Chevallier <[email protected]> Signed-off-by: David S. Miller <[email protected]>
David S. Miller [Wed, 12 Jun 2019 18:08:15 +0000 (11:08 -0700)]
Merge branch 'mlxsw-Various-fixes'
Ido Schimmel says:
====================
mlxsw: Various fixes
This patchset contains various fixes for mlxsw.
Patch #1 fixes an hash polarization problem when a nexthop device is a
LAG device. This is caused by the fact that the same seed is used for
the LAG and ECMP hash functions.
Patch #2 fixes an issue in which the driver fails to refresh a nexthop
neighbour after it becomes dead. This prevents the nexthop from ever
being written to the adjacency table and used to forward traffic. Patch
Patch #4 fixes a wrong extraction of TOS value in flower offload code.
Patch #5 is a test case.
Patch #6 works around a buffer issue in Spectrum-2 by reducing the
default sizes of the shared buffer pools.
Patch #7 prevents prio-tagged packets from entering the switch when PVID
is removed from the bridge port.
Please consider patches #2, #4 and #6 for 5.1.y
====================
Petr Machata [Tue, 11 Jun 2019 07:19:45 +0000 (10:19 +0300)]
mlxsw: spectrum_buffers: Reduce pool size on Spectrum-2
Due to an issue on Spectrum-2, in front-panel ports split four ways, 2 out
of 32 port buffers cannot be used. To work around this, the next FW release
will mark them as unused, and will report correspondingly lower total
shared buffer size. mlxsw will pick up the new value through a query to
cap_total_buffer_size resource. However the initial size for shared buffer
pool 0 is hard-coded and therefore needs to be updated.
Thus reduce the pool size by 2.7 MiB (which corresponds to 2/32 of the
total size of 42 MiB), and round down to the whole number of cells.
Fixes: fe099bf682ab ("mlxsw: spectrum_buffers: Add Spectrum-2 shared buffer configuration") Signed-off-by: Petr Machata <[email protected]> Acked-by: Jiri Pirko <[email protected]> Signed-off-by: Ido Schimmel <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Ido Schimmel [Tue, 11 Jun 2019 07:19:40 +0000 (10:19 +0300)]
mlxsw: spectrum: Use different seeds for ECMP and LAG hash
The same hash function and seed are used for both ECMP and LAG hash.
Therefore, when a LAG device is used as a nexthop device as part of an
ECMP group, hash polarization can occur and all the traffic will be
hashed to a single LAG slave.
Fix this by using a different seed for the LAG hash.
John Fastabend [Wed, 12 Jun 2019 17:23:57 +0000 (17:23 +0000)]
net: tls, correctly account for copied bytes with multiple sk_msgs
tls_sw_do_sendpage needs to return the total number of bytes sent
regardless of how many sk_msgs are allocated. Unfortunately, copied
(the value we return up the stack) is zero'd before each new sk_msg
is allocated so we only return the copied size of the last sk_msg used.
The caller (splice, etc.) of sendpage will then believe only part
of its data was sent and send the missing chunks again. However,
because the data actually was sent the receiver will get multiple
copies of the same data.
To reproduce this do multiple sendfile calls with a length close to
the max record size. This will in turn call splice/sendpage, sendpage
may use multiple sk_msg in this case and then returns the incorrect
number of bytes. This will cause splice to resend creating duplicate
data on the receiver. Andre created a C program that can easily
generate this case so we will push a similar selftest for this to
bpf-next shortly.
The fix is to _not_ zero the copied field so that the total sent
bytes is returned.
Get the ingress interface and increment ICMP counters based on that
instead of skb->dev when the the dev is a VRF device.
This is a follow up on the following message:
https://www.spinics.net/lists/netdev/msg560268.html
v2: Avoid changing skb->dev since it has unintended effect for local
delivery (David Ahern). Signed-off-by: Stephen Suryaputra <[email protected]> Reviewed-by: David Ahern <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Using ethtool, users can specify a classification action matching on the
full vlan tag, which includes the DEI bit (also previously called CFI).
However, when converting the ethool_flow_spec to a flow_rule, we use
dissector keys to represent the matching patterns.
Since the vlan dissector key doesn't include the DEI bit, this
information was silently discarded when translating the ethtool
flow spec in to a flow_rule.
This commit adds the DEI bit into the vlan dissector key, and allows
propagating the information to the driver when parsing the ethtool flow
spec.
Fixes: eca4205f9ec3 ("ethtool: add ethtool_rx_flow_spec to flow_rule structure translator") Reported-by: Michał Mirosław <[email protected]> Signed-off-by: Maxime Chevallier <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Matteo Croce [Wed, 12 Jun 2019 09:50:37 +0000 (11:50 +0200)]
mpls: fix af_mpls dependencies for real
Randy reported that selecting MPLS_ROUTING without PROC_FS breaks
the build, because since commit c1a9d65954c6 ("mpls: fix af_mpls
dependencies"), MPLS_ROUTING selects PROC_SYSCTL, but Kconfig's select
doesn't recursively handle dependencies.
Change the select into a dependency.
Ilya Maximets [Fri, 7 Jun 2019 17:27:32 +0000 (20:27 +0300)]
xdp: check device pointer before clearing
We should not call 'ndo_bpf()' or 'dev_put()' with NULL argument.
Fixes: c9b47cc1fabc ("xsk: fix bug when trying to use both copy and zero-copy on one queue id") Signed-off-by: Ilya Maximets <[email protected]> Acked-by: Jonathan Lemon <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]>
Martin KaFai Lau [Tue, 11 Jun 2019 21:45:57 +0000 (14:45 -0700)]
bpf: net: Set sk_bpf_storage back to NULL for cloned sk
The cloned sk should not carry its parent-listener's sk_bpf_storage.
This patch fixes it by setting it back to NULL.
Fixes: 6ac99e8f23d4 ("bpf: Introduce bpf sk local storage") Signed-off-by: Martin KaFai Lau <[email protected]> Acked-by: Andrii Nakryiko <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]>
David S. Miller [Tue, 11 Jun 2019 19:07:33 +0000 (12:07 -0700)]
Merge branch 'vxlan-geneve-linear'
Stefano Brivio says:
====================
Don't assume linear buffers in error handlers for VXLAN and GENEVE
Guillaume noticed the same issue fixed by commit 26fc181e6cac ("fou, fou6:
do not assume linear skbs") for fou and fou6 is also present in VXLAN and
GENEVE error handlers: we can't assume linear buffers there, we need to
use pskb_may_pull() instead.
====================
Stefano Brivio [Mon, 10 Jun 2019 22:27:06 +0000 (00:27 +0200)]
geneve: Don't assume linear buffers in error handler
In commit a07966447f39 ("geneve: ICMP error lookup handler") I wrongly
assumed buffers from icmp_socket_deliver() would be linear. This is not
the case: icmp_socket_deliver() only guarantees we have 8 bytes of linear
data.
Eric fixed this same issue for fou and fou6 in commits 26fc181e6cac
("fou, fou6: do not assume linear skbs") and 5355ed6388e2 ("fou, fou6:
avoid uninit-value in gue_err() and gue6_err()").
Use pskb_may_pull() instead of checking skb->len, and take into account
the fact we later access the GENEVE header with udp_hdr(), so we also
need to sum skb_transport_header() here.
Reported-by: Guillaume Nault <[email protected]> Fixes: a07966447f39 ("geneve: ICMP error lookup handler") Signed-off-by: Stefano Brivio <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Stefano Brivio [Mon, 10 Jun 2019 22:27:05 +0000 (00:27 +0200)]
vxlan: Don't assume linear buffers in error handler
In commit c3a43b9fec8a ("vxlan: ICMP error lookup handler") I wrongly
assumed buffers from icmp_socket_deliver() would be linear. This is not
the case: icmp_socket_deliver() only guarantees we have 8 bytes of linear
data.
Eric fixed this same issue for fou and fou6 in commits 26fc181e6cac
("fou, fou6: do not assume linear skbs") and 5355ed6388e2 ("fou, fou6:
avoid uninit-value in gue_err() and gue6_err()").
Use pskb_may_pull() instead of checking skb->len, and take into account
the fact we later access the VXLAN header with udp_hdr(), so we also
need to sum skb_transport_header() here.
Reported-by: Guillaume Nault <[email protected]> Fixes: c3a43b9fec8a ("vxlan: ICMP error lookup handler") Signed-off-by: Stefano Brivio <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Taehee Yoo [Sun, 9 Jun 2019 14:26:21 +0000 (23:26 +0900)]
net: openvswitch: do not free vport if register_netdevice() is failed.
In order to create an internal vport, internal_dev_create() is used and
that calls register_netdevice() internally.
If register_netdevice() fails, it calls dev->priv_destructor() to free
private data of netdev. actually, a private data of this is a vport.
Hence internal_dev_create() should not free and use a vport after failure
of register_netdevice().
Fixes: cf124db566e6 ("net: Fix inconsistent teardown and release of private netdev state.") Signed-off-by: Taehee Yoo <[email protected]> Reviewed-by: Greg Rose <[email protected]> Signed-off-by: David S. Miller <[email protected]>
But it missed that zerocopy need not be passed at the first send. The
right test whether the uarg is newly allocated and thus has extra
refcnt 1 is not !skb, but !skb_zcopy.
Fixes: 100f6d8e09905 ("net: correct zerocopy refcnt with udp MSG_MORE") Reported-by: syzbot <[email protected]> Signed-off-by: Willem de Bruijn <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Jonathan Lemon [Sat, 8 Jun 2019 19:54:19 +0000 (12:54 -0700)]
bpf: lpm_trie: check left child of last leftmost node for NULL
If the leftmost parent node of the tree has does not have a child
on the left side, then trie_get_next_key (and bpftool map dump) will
not look at the child on the right. This leads to the traversal
missing elements.
Lookup is not affected.
Update selftest to handle this case.
Reproducer:
bpftool map create /sys/fs/bpf/lpm type lpm_trie key 6 \
value 1 entries 256 name test_lpm flags 1
bpftool map update pinned /sys/fs/bpf/lpm key 8 0 0 0 0 0 value 1
bpftool map update pinned /sys/fs/bpf/lpm key 16 0 0 0 0 128 value 2
bpftool map dump pinned /sys/fs/bpf/lpm
Returns only 1 element. (2 expected)
Fixes: b471f2f1de8b ("bpf: implement MAP_GET_NEXT_KEY command for LPM_TRIE") Signed-off-by: Jonathan Lemon <[email protected]> Acked-by: Martin KaFai Lau <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]>
John Hurley [Sun, 9 Jun 2019 00:48:03 +0000 (17:48 -0700)]
nfp: ensure skb network header is set for packet redirect
Packets received at the NFP driver may be redirected to egress of another
netdev (e.g. in the case of OvS internal ports). On the egress path, some
processes, like TC egress hooks, may expect the network header offset
field in the skb to be correctly set. If this is not the case there is
potential for abnormal behaviour and even the triggering of BUG() calls.
Set the skb network header field before the mac header pull when doing a
packet redirect.
Fixes: 27f54b582567 ("nfp: allow fallback packets from non-reprs") Signed-off-by: John Hurley <[email protected]> Reviewed-by: Jakub Kicinski <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Yuchung Cheng [Sat, 8 Jun 2019 01:26:33 +0000 (18:26 -0700)]
tcp: fix undo spurious SYNACK in passive Fast Open
Commit 794200d66273 ("tcp: undo cwnd on Fast Open spurious SYNACK
retransmit") may cause tcp_fastretrans_alert() to warn about pending
retransmission in Open state. This is triggered when the Fast Open
server both sends data and has spurious SYNACK retransmission during
the handshake, and the data packets were lost or reordered.
The root cause is a bit complicated:
(1) Upon receiving SYN-data: a full socket is created with
snd_una = ISN + 1 by tcp_create_openreq_child()
(2) On SYNACK timeout the server/sender enters CA_Loss state.
(3) Upon receiving the final ACK to complete the handshake, sender
does not mark FLAG_SND_UNA_ADVANCED since (1)
Sender then calls tcp_process_loss since state is CA_loss by (2)
(4) tcp_process_loss() does not invoke undo operations but instead
mark REXMIT_LOST to force retransmission
(5) tcp_rcv_synrecv_state_fastopen() calls tcp_try_undo_loss(). It
changes state to CA_Open but has positive tp->retrans_out
(6) Next ACK triggers the WARN_ON in tcp_fastretrans_alert()
The step that goes wrong is (4) where the undo operation should
have been invoked because the ACK successfully acknowledged the
SYN sequence. This fixes that by specifically checking undo
when the SYN-ACK sequence is acknowledged. Then after
tcp_process_loss() the state would be further adjusted based
in tcp_fastretrans_alert() to avoid triggering the warning in (6).
Fixes: 794200d66273 ("tcp: undo cwnd on Fast Open spurious SYNACK retransmit") Signed-off-by: Yuchung Cheng <[email protected]> Signed-off-by: Neal Cardwell <[email protected]> Signed-off-by: David S. Miller <[email protected]>
====================
ibmvnic: Fixes for device reset handling
This series contains three unrelated fixes to issues seen during
device resets. The first patch fixes an error when the driver requests
to deactivate the link of an uninitialized device, resulting in a
failure to reset. Next, a patch to fix multicast transmission
failures seen after a driver reset. The final patch fixes mishandling
of memory allocation failures during device initialization, which
caused a kernel oops.
====================
Thomas Falcon [Fri, 7 Jun 2019 21:03:55 +0000 (16:03 -0500)]
ibmvnic: Fix unchecked return codes of memory allocations
The return values for these memory allocations are unchecked,
which may cause an oops if the driver does not handle them after
a failure. Fix by checking the function's return code.
Thomas Falcon [Fri, 7 Jun 2019 21:03:54 +0000 (16:03 -0500)]
ibmvnic: Refresh device multicast list after reset
It was observed that multicast packets were no longer received after
a device reset. The fix is to resend the current multicast list to
the backing device after recovery.
Thomas Falcon [Fri, 7 Jun 2019 21:03:53 +0000 (16:03 -0500)]
ibmvnic: Do not close unopened driver during reset
Check driver state before halting it during a reset. If the driver is
not running, do nothing. Otherwise, a request to deactivate a down link
can cause an error and the reset will fail.
Please pull and let me know if there is any problem.
For -stable v4.17
('net/mlx5: Avoid reloading already removed devices')
For -stable v5.0
('net/mlx5e: Avoid detaching non-existing netdev under switchdev mode')
For -stable v5.1
('net/mlx5e: Fix source port matching in fdb peer flow rule')
('net/mlx5e: Support tagged tunnel over bond')
('net/mlx5e: Add ndo_set_feature for uplink representor')
('net/mlx5: Update pci error handler entries and command translation')
====================
David S. Miller [Mon, 10 Jun 2019 02:44:01 +0000 (19:44 -0700)]
Merge tag 'linux-can-fixes-for-5.2-20190607' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can
Marc Kleine-Budde says:
====================
pull-request: can 2019-06-07
this is a pull reqeust of 9 patches for net/master.
The first patch is by Alexander Dahl and removes a duplicate menu entry from
the Kconfig. The next patch by Joakim Zhang fixes the timeout in the flexcan
driver when setting small bit rates. Anssi Hannula's patch for the xilinx_can
driver fixes the bittiming_const for CAN FD core. The two patches by Sean
Nyekjaer bring mcp25625 to the existing mcp251x driver. The patch by Eugen
Hristev implements an errata for the m_can driver. YueHaibing's patch fixes the
error handling ing can_init(). The patch by Fabio Estevam for the flexcan
driver removes an unneeded registration message during flexcan_probe(). And the
last patch is by Willem de Bruijn and adds the missing purging the socket
error queue on sock destruct.
====================
George Wilkie [Fri, 7 Jun 2019 10:49:41 +0000 (11:49 +0100)]
mpls: fix warning with multi-label encap
If you configure a route with multiple labels, e.g.
ip route add 10.10.3.0/24 encap mpls 16/100 via 10.10.2.2 dev ens4
A warning is logged:
kernel: [ 130.561819] netlink: 'ip': attribute type 1 has an invalid
length.
This happens because mpls_iptunnel_policy has set the type of
MPLS_IPTUNNEL_DST to fixed size NLA_U32.
Change it to a minimum size.
nla_get_labels() does the remaining validation.
Michael Schmitz [Fri, 7 Jun 2019 05:37:34 +0000 (17:37 +1200)]
net: phy: rename Asix Electronics PHY driver
[Resent to net instead of net-next - may clash with Anders Roxell's patch
series addressing duplicate module names]
Commit 31dd83b96641 ("net-next: phy: new Asix Electronics PHY driver")
introduced a new PHY driver drivers/net/phy/asix.c that causes a module
name conflict with a pre-existiting driver (drivers/net/usb/asix.c).
The PHY driver is used by the X-Surf 100 ethernet card driver, and loaded
by that driver via its PHY ID. A rename of the driver looks unproblematic.
Rename PHY driver to ax88796b.c in order to resolve name conflict.
Eric Dumazet [Thu, 6 Jun 2019 21:32:34 +0000 (14:32 -0700)]
ipv6: flowlabel: fl6_sock_lookup() must use atomic_inc_not_zero
Before taking a refcount, make sure the object is not already
scheduled for deletion.
Same fix is needed in ipv6_flowlabel_opt()
Fixes: 18367681a10b ("ipv6 flowlabel: Convert np->ipv6_fl_list to RCU.") Signed-off-by: Eric Dumazet <[email protected]> Cc: Willem de Bruijn <[email protected]> Signed-off-by: David S. Miller <[email protected]>
CC net/ipv4/fib_semantics.o
net/ipv4/fib_semantics.c: In function 'fib_check_nh_v4_gw':
net/ipv4/fib_semantics.c:1027:12: warning: 'err' may be used uninitialized in this function [-Wmaybe-uninitialized]
if (!tbl || err) {
^~
The following pull-request contains BPF updates for your *net* tree.
The main changes are:
1) Fix several bugs in riscv64 JIT code emission which forgot to clear high
32-bits for alu32 ops, from Björn and Luke with selftests covering all
relevant BPF alu ops from Björn and Jiong.
2) Two fixes for UDP BPF reuseport that avoid calling the program in case of
__udp6_lib_err and UDP GRO which broke reuseport_select_sock() assumption
that skb->data is pointing to transport header, from Martin.
3) Two fixes for BPF sockmap: a use-after-free from sleep in psock's backlog
workqueue, and a missing restore of sk_write_space when psock gets dropped,
from Jakub and John.
4) Fix unconnected UDP sendmsg hook API which is insufficient as-is since it
breaks standard applications like DNS if reverse NAT is not performed upon
receive, from Daniel.
5) Fix an out-of-bounds read in __bpf_skc_lookup which in case of AF_INET6
fails to verify that the length of the tuple is long enough, from Lorenz.
6) Fix libbpf's libbpf__probe_raw_btf to return an fd instead of 0/1 (for
{un,}successful probe) as that is expected to be propagated as an fd to
load_sk_storage_btf() and thus closing the wrong descriptor otherwise,
from Michal.
7) Fix bpftool's JSON output for the case when a lookup fails, from Krzesimir.
8) Minor misc fixes in docs, samples and selftests, from various others.
====================
Eli Britstein [Sun, 2 Jun 2019 13:47:59 +0000 (13:47 +0000)]
net/mlx5e: Support tagged tunnel over bond
Stacked devices like bond interface may have a VLAN device on top of
them. Detect lag state correctly under this condition, and return the
correct routed net device, according to it the encap header is built.
Fixes: e32ee6c78efa ("net/mlx5e: Support tunnel encap over tagged Ethernet") Signed-off-by: Eli Britstein <[email protected]> Reviewed-by: Roi Dayan <[email protected]> Signed-off-by: Saeed Mahameed <[email protected]>
Alaa Hleihel [Sun, 26 May 2019 08:56:27 +0000 (11:56 +0300)]
net/mlx5e: Avoid detaching non-existing netdev under switchdev mode
After introducing dedicated uplink representor, the netdev instance
set over the esw manager vport (PF) became no longer in use, so it was
removed in the cited commit once we're on switchdev mode.
However, the mlx5e_detach function was not updated accordingly, and it
still tries to detach a non-existing netdev, causing a kernel crash.
Raed Salem [Sun, 2 Jun 2019 09:04:08 +0000 (12:04 +0300)]
net/mlx5e: Fix source port matching in fdb peer flow rule
The cited commit changed the initialization placement of the eswitch
attributes so it is done prior to parse tc actions function call,
including among others the in_rep and in_mdev fields which are mistakenly
reassigned inside the parse actions function.
This breaks the source port matching criteria of the peer redirect rule.
Fix by removing the now redundant reassignment of the already initialized
fields.
net/mlx5e: Replace reciprocal_scale in TX select queue function
The TX queue index returned by the fallback function ranges
between [0,NUM CHANNELS - 1] if QoS isn't set and
[0, (NUM CHANNELS)*(NUM TCs) -1] otherwise.
Our HW uses different TC mapping than the fallback function
(which is denoted as 'up', user priority) so we only need to extract
a channel number out of the returned value.
Since (NUM CHANNELS)*(NUM TCs) is a relatively small number, using
reciprocal scale almost always returns zero.
We instead access the 'txq2sq' table to extract the sq (and with it the
channel number) associated with the tx queue, thus getting
a more evenly distributed channel number.
Perf:
Rx/Tx side with Intel(R) Xeon(R) Silver 4108 CPU @ 1.80GHz and ConnectX-5.
Used 'iperf' UDP traffic, 10 threads, and priority 5.
Before: 0.566Mpps
After: 2.37Mpps
As expected, releasing the existing bottleneck of steering all traffic
to TX queue zero significantly improves transmission rates.
Chris Mi [Thu, 16 May 2019 09:36:43 +0000 (17:36 +0800)]
net/mlx5e: Add ndo_set_feature for uplink representor
After we have a dedicated uplink representor, the new netdev ops
doesn't support ndo_set_feature. Because of that, we can't change
some features, eg. rxvlan. Now add it back.
In this patch, I also do a cleanup for the features flag handling,
eg. remove duplicate NETIF_F_HW_TC flag setting.
Fixes: aec002f6f82c ("net/mlx5e: Uninstantiate esw manager vport netdev on switchdev mode") Signed-off-by: Chris Mi <[email protected]> Reviewed-by: Roi Dayan <[email protected]> Reviewed-by: Vlad Buslov <[email protected]> Signed-off-by: Saeed Mahameed <[email protected]>
Alaa Hleihel [Sun, 19 May 2019 08:11:49 +0000 (11:11 +0300)]
net/mlx5: Avoid reloading already removed devices
Prior to reloading a device we must first verify that it was not already
removed. Otherwise, the attempt to remove the device will do nothing, and
in that case we will end up proceeding with adding an new device that no
one was expecting to remove, leaving behind used resources such as EQs that
causes a failure to destroy comp EQs and syndrome (0x30f433).
Fix that by making sure that we try to remove and add a device (based on a
protocol) only if the device is already added.
Edward Srouji [Thu, 23 May 2019 16:45:38 +0000 (19:45 +0300)]
net/mlx5: Update pci error handler entries and command translation
Add missing entries for create/destroy UCTX and UMEM commands.
This could get us wrong "unknown FW command" error in flows
where we unbind the device or reset the driver.
Also the translation of these commands from opcodes to string
was missing.
Fixes: 6e3722baac04 ("IB/mlx5: Use the correct commands for UMEM and UCTX allocation") Signed-off-by: Edward Srouji <[email protected]> Signed-off-by: Saeed Mahameed <[email protected]>
The reason for printing 'ptrval' is explained at
Documentation/core-api/printk-formats.rst:
"Pointers printed without a specifier extension (i.e unadorned %p) are
hashed to prevent leaking information about the kernel memory layout. This
has the added benefit of providing a unique identifier. On 64-bit machines
the first 32 bits are zeroed. The kernel will print ``(ptrval)`` until it
gathers enough entropy."
Instead of passing %pK, which can print the correct address, simply
remove the entire message as it is not really that useful.
Eugen Hristev [Mon, 4 Mar 2019 14:44:13 +0000 (14:44 +0000)]
can: m_can: implement errata "Needless activation of MRAF irq"
During frame reception while the MCAN is in Error Passive state and the
Receive Error Counter has thevalue MCAN_ECR.REC = 127, it may happen
that MCAN_IR.MRAF is set although there was no Message RAM access
failure. If MCAN_IR.MRAF is enabled, an interrupt to the Host CPU is
generated.
Work around:
The Message RAM Access Failure interrupt routine needs to check whether
MCAN_ECR.RP = '1' and MCAN_ECR.REC = '127'.
In this case, reset MCAN_IR.MRAF. No further action is required.
This affects versions older than 3.2.0
Errata explained on Sama5d2 SoC which includes this hardware block:
http://ww1.microchip.com/downloads/en/DeviceDoc/SAMA5D2-Family-Silicon-Errata-and-Data-Sheet-Clarification-DS80000803B.pdf
chapter 6.2
Reproducibility: If 2 devices with m_can are connected back to back,
configuring different bitrate on them will lead to interrupt storm on
the receiving side, with error "Message RAM access failure occurred".
Another way is to have a bad hardware connection. Bad wire connection
can lead to this issue as well.
This patch fixes the issue according to provided workaround.
can: xilinx_can: use correct bittiming_const for CAN FD core
Commit 9e5f1b273e6a ("can: xilinx_can: add support for Xilinx CAN FD
core") added a new can_bittiming_const structure for CAN FD cores that
support larger values for tseg1, tseg2, and sjw than previous Xilinx CAN
cores, but the commit did not actually take that into use.
Joakim Zhang [Thu, 31 Jan 2019 09:37:22 +0000 (09:37 +0000)]
can: flexcan: fix timeout when set small bitrate
Current we can meet timeout issue when setting a small bitrate like
10000 as follows on i.MX6UL EVK board (ipg clock = 66MHZ, per clock =
30MHZ):
| root@imx6ul7d:~# ip link set can0 up type can bitrate 10000
A link change request failed with some changes committed already.
Interface can0 may have been left with an inconsistent configuration,
please check.
| RTNETLINK answers: Connection timed out
It is caused by calling of flexcan_chip_unfreeze() timeout.
Originally the code is using usleep_range(10, 20) for unfreeze
operation, but the patch (8badd65 can: flexcan: avoid calling
usleep_range from interrupt context) changed it into udelay(10) which is
only a half delay of before, there're also some other delay changes.
After double to FLEXCAN_TIMEOUT_US to 100 can fix the issue.
Meanwhile, Rasmus Villemoes reported that even with a timeout of 100,
flexcan_probe() fails on the MPC8309, which requires a value of at least
140 to work reliably. 250 works for everyone.