Paolo Abeni [Fri, 7 Jan 2022 00:20:17 +0000 (16:20 -0800)]
mptcp: cleanup accept and poll
After the previous patch, msk->subflow will never be deleted during
the whole msk lifetime. We don't need anymore to acquire references to
it in mptcp_stream_accept() and we can use the listener subflow accept
queue to simplify mptcp_poll() for listener socket.
Overall this removes a lock pair and 4 more atomic operations per
accept().
David S. Miller [Fri, 7 Jan 2022 11:10:57 +0000 (11:10 +0000)]
Merge tag 'mlx5-updates-2022-01-06' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux
Saeed Mahameed says:
====================
mlx5-updates-2022-01-06
1) Expose FEC per lane block counters via ethtool
2) Trivial fixes/updates/cleanup to mlx5e netdev driver
3) Fix htmldoc build warning
4) Spread mlx5 SFs (sub-functions) to all available CPU cores: Commits 1..5
Shay Drory Says:
================
Before this patchset, mlx5 subfunction shared the same IRQs (MSI-X) with
their peers subfunctions, causing them to use same CPU cores.
In large scale, this is very undesirable, SFs use small number of cpu
cores and all of them will be packed on the same CPU cores, not
utilizing all CPU cores in the system.
In this patchset we want to achieve two things.
a) Spread IRQs used by SFs to all cpu cores
b) Pack less SFs in the same IRQ, will result in multiple IRQs per core.
In this patchset, we spread SFs over all online cpus available to mlx5
irqs in Round-Robin manner. e.g.: Whenever a SF is created, pick the next
CPU core with least number of SF IRQs bound to it, SFs will share IRQs on
the same core until a certain limit, when such limit is reached, we
request a new IRQ and add it to that CPU core IRQ pool, when out of IRQs,
pick any IRQ with least number of SF users.
This enhancement is done in order to achieve a better distribution of
the SFs over all the available CPUs, which reduces application latency,
as shown bellow.
Machine details:
Intel(R) Xeon(R) CPU E5-2697 v3 @ 2.60GHz with 56 cores.
PCI Express 3 with BW of 126 Gb/s.
ConnectX-5 Ex; EDR IB (100Gb/s) and 100GbE; dual-port QSFP28; PCIe4.0
x16.
Base line test description:
Single SF on the system. One instance of netperf is running on-top the
SF.
Numbers: latency = 15.136 usec, CPU Util = 35%
Test description:
There are 250 SFs on the system. There are 3 instances of netperf
running, on-top three different SFs, in parallel.
- The first two entries (#1 and #2) show current state. e.g.: SFs are
using the same CPU. The last two entries (#3 and #4) shows the latency
reduction improvement of this patch. e.g.: SFs are on different CPUs.
- Whenever we use several CPUs, in case there is a different CPU
utilization, write the utilization of each CPU separately.
- Whenever the latency result of the netperf instances were different,
write the latency of each netperf instances separately.
Commands:
- for netperf CPU=0:
$ for i in {1..3}; do taskset -c 0 netperf -H 1${i}.1.1.1 -t TCP_RR -- \
-o RT_LATENCY -r8 & done
- for netperf CPU=0,2,4
$ for i in {1..3}; do taskset -c $(( ($i - 1) * 2 )) netperf -H \
1${i}.1.1.1 -t TCP_RR -- -o RT_LATENCY -r8 & done
Niklas Schnelle [Wed, 5 Jan 2022 15:10:54 +0000 (16:10 +0100)]
s390/pci: simplify __pciwb_mio() inline asm
The PCI Write Barrier instruction ignores the registers encoded in it.
There is thus no need to explicitly set the register to zero or to
associate it with a variable at all. In the resulting binary this removes
an unnecessary lghi and it makes the code simpler.
Vlastimil Babka [Fri, 7 Jan 2022 10:13:28 +0000 (11:13 +0100)]
Merge branch 'for-5.17/struct-slab' into for-linus
Series "Separate struct slab from struct page" v4
This is originally an offshoot of the folio work by Matthew. One of the more
complex parts of the struct page definition are the parts used by the slab
allocators. It would be good for the MM in general if struct slab were its own
data type, and it also helps to prevent tail pages from slipping in anywhere.
As Matthew requested in his proof of concept series, I have taken over the
development of this series, so it's a mix of patches from him (often modified
by me) and my own.
One big difference is the use of coccinelle to perform the relatively trivial
parts of the conversions automatically and at once, instead of a larger number
of smaller incremental reviewable steps. Thanks to Julia Lawall and Luis
Chamberlain for all their help!
Another notable difference is (based also on review feedback) I don't represent
with a struct slab the large kmalloc allocations which are not really a slab,
but use page allocator directly. When going from an object address to a struct
slab, the code tests first folio slab flag, and only if it's set it converts to
struct slab. This makes the struct slab type stronger.
Finally, although Matthew's version didn't use any of the folio work, the
initial support has been merged meanwhile so my version builds on top of it
where appropriate. This eliminates some of the redundant compound_head()
being performed e.g. when testing the slab flag.
To sum up, after this series, struct page fields used by slab allocators are
moved from struct page to a new struct slab, that uses the same physical
storage. The availability of the fields is further distinguished by the
selected slab allocator implementation. The advantages include:
- Similar to folios, if the slab is of order > 0, struct slab always is
guaranteed to be the head page. Additionally it's guaranteed to be an actual
slab page, not a large kmalloc. This removes uncertainty and potential for
bugs.
- It's not possible to accidentally use fields of the slab implementation that's
not configured.
- Other subsystems cannot use slab's fields in struct page anymore (some
existing non-slab usages had to be adjusted in this series), so slab
implementations have more freedom in rearranging them in the struct slab.
Bluetooth: btintel: Fix broken LED quirk for legacy ROM devices
This patch fixes the broken LED quirk for Intel legacy ROM devices.
To fix the LED issue that doesn't turn off immediately, the host sends
the SW RFKILL command while shutting down the interface and it puts the
devices in SW RFKILL state.
Once the device is in SW RFKILL state, it can only accept HCI_Reset to
exit from the SW RFKILL state. This patch checks the quirk for broken
LED and sends the HCI_Reset before sending the HCI_Intel_Read_Version
command.
The affected legacy ROM devices are
- 8087:07dc
- 8087:0a2a
- 8087:0aa7
Fixes: ffcba827c0a1d ("Bluetooth: btintel: Fix the LED is not turning off immediately") Signed-off-by: Tedd Ho-Jeong An <[email protected]> Signed-off-by: Marcel Holtmann <[email protected]>
Jakub Kicinski [Fri, 7 Jan 2022 04:06:32 +0000 (20:06 -0800)]
Merge branch '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue
Tony Nguyen says:
====================
100GbE Intel Wired LAN Driver Updates 2022-01-06
Victor adds restoring of advanced rules after reset.
Wojciech improves usage of switchdev control VSI by utilizing the
device's advanced rules for forwarding.
Christophe Jaillet removes some unneeded calls to zero bitmaps, changes
some bitmap operations that don't need to be atomic, and converts a
kfree() to a more appropriate bitmap_free().
* '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue:
ice: Use bitmap_free() to free bitmap
ice: Optimize a few bitmap operations
ice: Slightly simply ice_find_free_recp_res_idx
ice: improve switchdev's slow-path
ice: replay advanced rules after reset
====================
Jakub Kicinski [Fri, 7 Jan 2022 04:00:48 +0000 (20:00 -0800)]
Merge branch 'mlxsw-add-spectrum-4-support'
Ido Schimmel says:
====================
mlxsw: Add Spectrum-4 support
This patchset adds Spectrum-4 support in mlxsw. It builds on top of a
previous patchset merged in commit 10184da91666 ("Merge branch
'mlxsw-Spectrum-4-prep'") and makes two additional changes before adding
Spectrum-4 support.
Patchset overview:
Patches #1-#2 add a few Spectrum-4 specific variants of existing ACL
keys. The new variants are needed because the size of certain key
elements (e.g., local port) was increased in Spectrum-4.
Patches #3-#6 are preparations.
Patch #7 implements the Spectrum-4 variant of the Bloom filter hash
function. The Bloom filter is used to optimize ACL lookups by
potentially skipping certain lookups if they are guaranteed not to
match. See additional info in merge commit ae6750e0a5ef ("Merge branch
'mlxsw-spectrum_acl-Add-Bloom-filter-support'").
Amit Cohen [Thu, 6 Jan 2022 16:06:51 +0000 (18:06 +0200)]
mlxsw: spectrum_acl_bloom_filter: Add support for Spectrum-4 calculation
Spectrum-4 will calculate hash function for bloom filter differently
from the existing ASICs.
First, two hash functions will be used to calculate 16 bits result.
The final result will be combination of the two results - 6 bits which
are result of CRC-6 will be used as MSB and 10 bits which are result of
CRC-10 will be used as LSB.
Second, while in Spectrum{2,3}, there is a padding in each chunk, so the
chunks use a sequence of whole bytes, in Spectrum-4 there is no padding,
so each chunk use 20 bytes minus 2 bits, so it is necessary to align the
chunks to be without holes.
Add dedicated 'mlxsw_sp_acl_bf_ops' for Spectrum-4 and add the required
tables for CRC calculations.
All the details are documented as part of the code for future use.
Amit Cohen [Thu, 6 Jan 2022 16:06:50 +0000 (18:06 +0200)]
mlxsw: Add operations structure for bloom filter calculation
Spectrum-4 will calculate hash function for bloom filter differently from
the existing ASICs.
There are two changes:
1. Instead of using one hash function to calculate 16 bits output (CRC-16),
two functions will be used.
2. The chunks will be built differently, without padding.
As preparation for support of Spectrum-4 bloom filter, add 'ops'
structure to allow handling different calculation for different ASICs.
Amit Cohen [Thu, 6 Jan 2022 16:06:49 +0000 (18:06 +0200)]
mlxsw: spectrum_acl_bloom_filter: Rename Spectrum-2 specific objects for future use
Spectrum-4 will calculate hash function for bloom filter differently from
the existing ASICs.
There are two changes:
1. Instead of using one hash function to calculate 16 bits output (CRC-16),
two functions will be used.
2. The chunks will be built differently, without padding.
As preparation for support of Spectrum-4 bloom filter, rename CRC table
to include "sp2" prefix and "crc16", as next patch will add two additional
tables. In addition, rename all the dedicated functions and defines for
Spectrum-{2,3} to include "sp2" prefix.
Amit Cohen [Thu, 6 Jan 2022 16:06:48 +0000 (18:06 +0200)]
mlxsw: spectrum_acl_bloom_filter: Make mlxsw_sp_acl_bf_key_encode() more flexible
Spectrum-4 will calculate hash function for bloom filter differently from
the existing ASICs.
One of the changes is related to the way that the chunks will be build -
without padding.
As preparation for support of Spectrum-4 bloom filter, make
mlxsw_sp_acl_bf_key_encode() more flexible, so it will be able to use it
for Spectrum-4 as well.
Amit Cohen [Thu, 6 Jan 2022 16:06:46 +0000 (18:06 +0200)]
mlxsw: Introduce flex key elements for Spectrum-4
Spectrum-4 ASIC will support more virtual routers and local ports
compared to the existing ASICs. Therefore, the virtual router and local
port ACL key elements need to be increased.
Introduce new key elements for Spectrum-4 to be aligned with the elements
used already for other Spectrum ASICs.
The key blocks layout is the same for Spectrum-4, so use the existing
code for encode_block() and clear_block(), just create separate blocks.
Note that size of `VIRT_ROUTER_MSB` is 4 bits in Spectrum-4,
therefore declare it using `MLXSW_AFK_ELEMENT_INST_U32()`, in order to
be able to set `.avoid_size_check` to true.
Otherwise, `mlxsw_afk_blocks_check()` will fail and warn.
Amit Cohen [Thu, 6 Jan 2022 16:06:45 +0000 (18:06 +0200)]
mlxsw: Rename virtual router flex key element
In Spectrum-4, the size of the virtual router ACL key element increased
from 11 bits to 12 bits.
In order to reuse the existing virtual router ACL key element
enumerators for Spectrum-4, rename 'VIRT_ROUTER_8_10' and
'VIRT_ROUTER_0_7' to 'VIRT_ROUTER_MSB' and 'VIRT_ROUTER_LSB',
respectively.
Jakub Kicinski [Fri, 7 Jan 2022 03:56:31 +0000 (19:56 -0800)]
Merge branch 'dpaa2-eth-small-cleanup'
Ioana Ciornei says:
====================
dpaa2-eth: small cleanup
These 3 patches are just part of a small cleanup on the dpaa2-eth and
the dpaa2-switch drivers.
In case we are hitting a case in which the fwnode of the root dprc
device we initiate a deferred probe. On the dpaa2-switch side, if we are
on the remove path, make sure that we check for a non-NULL pointer
before accessing the port private structure.
====================
Ioana Ciornei [Thu, 6 Jan 2022 13:59:05 +0000 (15:59 +0200)]
dpaa2-switch: check if the port priv is valid
Before accessing the port private structure make sure that there is
still a non-NULL pointer there. A NULL pointer access can happen when we
are on the remove path, some switch ports are unregistered and some are
in the process of unregistering.
Ioana Ciornei [Thu, 6 Jan 2022 13:59:04 +0000 (15:59 +0200)]
dpaa2-mac: return -EPROBE_DEFER from dpaa2_mac_open in case the fwnode is not set
We could get into a situation when the fwnode of the parent device is
not yet set because its probe didn't yet finish. When this happens, any
caller of the dpaa2_mac_open() will not have the fwnode available, thus
cause problems at the PHY connect time.
Avoid this by just returning -EPROBE_DEFER from the dpaa2_mac_open when
this happens.
The parent pointer node handler must be declared with a NULL
initializer. Before using it, a check must be performed to make
sure that a valid address has been assigned to it.
The following patchset contains Netfilter fixes for net:
1) Refcount leak in ipt_CLUSTERIP rule loading path, from Xin Xiong.
2) Use socat in netfilter selftests, from Hangbin Liu.
3) Skip layer checksum 4 update for IP fragments.
4) Missing allocation of pcpu scratch maps on clone in
nft_set_pipapo, from Florian Westphal.
* git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf:
netfilter: nft_set_pipapo: allocate pcpu scratch maps on clone
netfilter: nft_payload: do not update layer 4 checksum when mangling fragments
selftests: netfilter: switch to socat for tests using -q option
netfilter: ipt_CLUSTERIP: fix refcount leak in clusterip_tg_check()
====================
Linus Torvalds [Fri, 7 Jan 2022 02:35:17 +0000 (18:35 -0800)]
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma
Pull rdma fixes from Jason Gunthorpe:
"Last pull for 5.16, the reversion has been known for a while now but
didn't get a proper fix in time. Looks like we will have several
info-leak bugs to take care of going foward.
- Revert the patch fixing the DM related crash causing a widespread
regression for kernel ULPs. A proper fix just didn't appear this
cycle due to the holidays
- Missing NULL check on alloc in uverbs
- Double free in rxe error paths
- Fix a new kernel-infoleak report when forming ah_attr's without
GRH's in ucma"
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma:
RDMA/core: Don't infoleak GRH fields
RDMA/uverbs: Check for null return of kmalloc_array
Revert "RDMA/mlx5: Fix releasing unallocated memory in dereg MR flow"
RDMA/rxe: Prevent double freeing rxe_map_set()
We've added 41 non-merge commits during the last 2 day(s) which contain
a total of 36 files changed, 1214 insertions(+), 368 deletions(-).
The main changes are:
1) Various fixes in the verifier, from Kris and Daniel.
2) Fixes in sockmap, from John.
3) bpf_getsockopt fix, from Kuniyuki.
4) INET_POST_BIND fix, from Menglong.
5) arm64 JIT fix for bpf pseudo funcs, from Hou.
6) BPF ISA doc improvements, from Christoph.
* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (41 commits)
bpf: selftests: Add bind retry for post_bind{4, 6}
bpf: selftests: Use C99 initializers in test_sock.c
net: bpf: Handle return value of BPF_CGROUP_RUN_PROG_INET{4,6}_POST_BIND()
bpf/selftests: Test bpf_d_path on rdonly_mem.
libbpf: Add documentation for bpf_map batch operations
selftests/bpf: Don't rely on preserving volatile in PT_REGS macros in loop3
xdp: Add xdp_do_redirect_frame() for pre-computed xdp_frames
xdp: Move conversion to xdp_frame out of map functions
page_pool: Store the XDP mem id
page_pool: Add callback to init pages when they are allocated
xdp: Allow registering memory model without rxq reference
samples/bpf: xdpsock: Add timestamp for Tx-only operation
samples/bpf: xdpsock: Add time-out for cleaning Tx
samples/bpf: xdpsock: Add sched policy and priority support
samples/bpf: xdpsock: Add cyclic TX operation capability
samples/bpf: xdpsock: Add clockid selection support
samples/bpf: xdpsock: Add Dest and Src MAC setting for Tx-only operation
samples/bpf: xdpsock: Add VLAN support for Tx-only operation
libbpf 1.0: Deprecate bpf_object__find_map_by_offset() API
libbpf 1.0: Deprecate bpf_map__is_offload_neutral()
...
====================
The return value of BPF_CGROUP_RUN_PROG_INET{4,6}_POST_BIND() in
__inet_bind() is not handled properly. While the return value
is non-zero, it will set inet_saddr and inet_rcv_saddr to 0 and
exit:
Let's take UDP for example and see what will happen. For UDP
socket, it will be added to 'udp_prot.h.udp_table->hash' and
'udp_prot.h.udp_table->hash2' after the sk->sk_prot->get_port()
called success. If 'inet->inet_rcv_saddr' is specified here,
then 'sk' will be in the 'hslot2' of 'hash2' that it don't belong
to (because inet_saddr is changed to 0), and UDP packet received
will not be passed to this sock. If 'inet->inet_rcv_saddr' is not
specified here, the sock will work fine, as it can receive packet
properly, which is wired, as the 'bind()' is already failed.
To undo the get_port() operation, introduce the 'put_port' field
for 'struct proto'. For TCP proto, it is inet_put_port(); For UDP
proto, it is udp_lib_unhash(); For icmp proto, it is
ping_unhash().
Therefore, after sys_bind() fail caused by
BPF_CGROUP_RUN_PROG_INET4_POST_BIND(), it will be unbinded, which
means that it can try to be binded to another port.
The second patch use C99 initializers in test_sock.c
The third patch is the selftests for this modification.
Changes since v4:
- use C99 initializers in test_sock.c before adding the test case
Changes since v3:
- add the third patch which use C99 initializers in test_sock.c
Changes since v2:
- NULL check for sk->sk_prot->put_port
Changes since v1:
- introduce 'put_port' field for 'struct proto'
- add selftests for it
====================
Menglong Dong [Thu, 6 Jan 2022 13:20:22 +0000 (21:20 +0800)]
bpf: selftests: Add bind retry for post_bind{4, 6}
With previous patch, kernel is able to 'put_port' after sys_bind()
fails. Add the test for that case: rebind another port after
sys_bind() fails. If the bind success, it means previous bind
operation is already undoed.
Menglong Dong [Thu, 6 Jan 2022 13:20:20 +0000 (21:20 +0800)]
net: bpf: Handle return value of BPF_CGROUP_RUN_PROG_INET{4,6}_POST_BIND()
The return value of BPF_CGROUP_RUN_PROG_INET{4,6}_POST_BIND() in
__inet_bind() is not handled properly. While the return value
is non-zero, it will set inet_saddr and inet_rcv_saddr to 0 and
exit:
Let's take UDP for example and see what will happen. For UDP
socket, it will be added to 'udp_prot.h.udp_table->hash' and
'udp_prot.h.udp_table->hash2' after the sk->sk_prot->get_port()
called success. If 'inet->inet_rcv_saddr' is specified here,
then 'sk' will be in the 'hslot2' of 'hash2' that it don't belong
to (because inet_saddr is changed to 0), and UDP packet received
will not be passed to this sock. If 'inet->inet_rcv_saddr' is not
specified here, the sock will work fine, as it can receive packet
properly, which is wired, as the 'bind()' is already failed.
To undo the get_port() operation, introduce the 'put_port' field
for 'struct proto'. For TCP proto, it is inet_put_port(); For UDP
proto, it is udp_lib_unhash(); For icmp proto, it is
ping_unhash().
Therefore, after sys_bind() fail caused by
BPF_CGROUP_RUN_PROG_INET4_POST_BIND(), it will be unbinded, which
means that it can try to be binded to another port.
The reverted commit had added a retry mechanism to the command entry
index allocation. The previous patch ensures that there is a free
command entry index once the command work handler holds the command
semaphore. Thus the retry mechanism is not needed.
Fixes: 410bd754cd73 ("net/mlx5: Add retry mechanism to the command entry index allocation") Signed-off-by: Moshe Shemesh <[email protected]> Reviewed-by: Eran Ben Elisha <[email protected]> Signed-off-by: Saeed Mahameed <[email protected]>
Moshe Shemesh [Sun, 5 Dec 2021 10:07:49 +0000 (12:07 +0200)]
net/mlx5: Set command entry semaphore up once got index free
Avoid a race where command work handler may fail to allocate command
entry index, by holding the command semaphore down till command entry
index is being freed.
Fixes: 410bd754cd73 ("net/mlx5: Add retry mechanism to the command entry index allocation") Signed-off-by: Moshe Shemesh <[email protected]> Reviewed-by: Eran Ben Elisha <[email protected]> Signed-off-by: Saeed Mahameed <[email protected]>
Maor Dickman [Mon, 3 Jan 2022 13:04:18 +0000 (15:04 +0200)]
net/mlx5e: Sync VXLAN udp ports during uplink representor profile change
Currently during NIC profile disablement all VXLAN udp ports offloaded to the
HW are flushed and during its enablement the driver send notification to
the stack to inform the core that the entire UDP tunnel port state has been
lost, uplink representor doesn't have the same behavior which can cause
VXLAN udp ports offload to be in bad state while moving between modes while
VXLAN interface exist.
Fixed by aligning the uplink representor profile behavior to the NIC behavior.
Fixes: 84db66124714 ("net/mlx5e: Move set vxlan nic info to profile init") Signed-off-by: Maor Dickman <[email protected]> Reviewed-by: Roi Dayan <[email protected]> Signed-off-by: Saeed Mahameed <[email protected]>
Shay Drory [Thu, 30 Dec 2021 06:54:08 +0000 (08:54 +0200)]
net/mlx5: Fix access to sf_dev_table on allocation failure
Even when SF devices are supported, the SF device table allocation
can still fail.
In such case mlx5_sf_dev_supported still reports true, but SF device
table is invalid. This can result in NULL table access.
Cells marked above are changed from original inner packet ip_ecn value.
Tc then matches on the modified inner ip_ecn, but hw offload which
matches the inner ip_ecn value before decap, will fail.
Fix that by mapping all the cases of outer and inner ip_ecn matching,
and only supporting cases where we know inner wouldn't be changed by
decap, or in the outer ip_ecn=CE case, inner ip_ecn didn't matter.
Although the NIC doesn't support offload of outer header CSUM, using
gso_partial_features allows offloading the tunnel's segmentation. The
driver relies on the stack CSUM calculation of the outer header. For
this, NETIF_F_GSO_GRE_CSUM must be a member of the device's features.
Fixes: 54e1217b9048 ("net/mlx5e: Block offload of outer header csum for GRE tunnel") Signed-off-by: Aya Levin <[email protected]> Reviewed-by: Gal Pressman <[email protected]> Signed-off-by: Saeed Mahameed <[email protected]>
Although the NIC doesn't support offload of outer header CSUM, using
gso_partial_features allows offloading the tunnel's segmentation. The
driver relies on the stack CSUM calculation of the outer header. For
this, NETIF_F_GSO_UDP_TUNNEL_CSUM must be a member of the device's
features.
Fixes: 6d6727dddc7f ("net/mlx5e: Block offload of outer header csum for UDP tunnels") Signed-off-by: Aya Levin <[email protected]> Reviewed-by: Gal Pressman <[email protected]> Signed-off-by: Saeed Mahameed <[email protected]>
Maor Dickman [Thu, 30 Dec 2021 09:20:10 +0000 (11:20 +0200)]
net/mlx5e: Don't block routes with nexthop objects in SW
Routes with nexthop objects is currently not supported by multipath offload
and any attempts to use it is blocked, however this also block adding SW
routes with nexthop.
Resolve this by returning NOTIFY_DONE instead of an error which will allow such
a route to be created in SW but not offloaded.
This fix also solve an issue which block adding such routes on different devices
due to missing check if the route FIB device is one of multipath devices.
Fixes: 6a87afc072c3 ("mlx5: Fail attempts to use routes with nexthop objects") Signed-off-by: Maor Dickman <[email protected]> Reviewed-by: Roi Dayan <[email protected]> Signed-off-by: Saeed Mahameed <[email protected]>
Maor Dickman [Wed, 29 Dec 2021 14:10:41 +0000 (16:10 +0200)]
net/mlx5e: Fix wrong usage of fib_info_nh when routes with nexthop objects are used
Creating routes with nexthop objects while in switchdev mode leads to access to
un-allocated memory and trigger bellow call trace due to hitting WARN_ON.
This is caused due to illegal usage of fib_info_nh in TC tunnel FIB event handling to
resolve the FIB device while fib_info built in with nexthop.
Fixed by ignoring attempts to use nexthop objects with routes until support can be
properly added.
Aya Levin [Thu, 23 Dec 2021 12:38:28 +0000 (14:38 +0200)]
net/mlx5e: Fix page DMA map/unmap attributes
Driver initiates DMA sync, hence it may skip CPU sync. Add
DMA_ATTR_SKIP_CPU_SYNC as input attribute both to dma_map_page and
dma_unmap_page to avoid redundant sync with the CPU.
When forcing the device to work with SWIOTLB, the extra sync might cause
data corruption. The driver unmaps the whole page while the hardware
used just a part of the bounce buffer. So syncing overrides the entire
page with bounce buffer that only partially contains real data.
Fixes: bc77b240b3c5 ("net/mlx5e: Add fragmented memory support for RX multi packet WQE") Fixes: db05815b36cb ("net/mlx5e: Add XSK zero-copy support") Signed-off-by: Aya Levin <[email protected]> Reviewed-by: Gal Pressman <[email protected]> Signed-off-by: Saeed Mahameed <[email protected]>
Documentation/networking/devlink/mlx5.rst:13: WARNING: Error parsing content block for the "list-table" directive:
+uniform two-level bullet list expected, but row 2 does not contain the same number of items as row 1 (2 vs 3).
...
Add the missing item in the first row.
Fixes: 0844fa5f7b89 ("net/mlx5: Let user configure io_eq_size param") Reported-by: Stephen Rothwell <[email protected]> Signed-off-by: Saeed Mahameed <[email protected]>
Gal Pressman [Wed, 22 Dec 2021 12:03:39 +0000 (14:03 +0200)]
net/mlx5e: Add recovery flow in case of error CQE
The rep legacy RQ completion handling was missing the appropriate
handling of error CQEs (dump the CQE and queue a recover work), fix it
by calling trigger_report() when needed.
Since all CQE handling flows do the exact same error CQE handling,
extract it to a common helper function.
Roi Dayan [Mon, 3 Jan 2022 08:57:01 +0000 (10:57 +0200)]
net/mlx5e: TC, Remove redundant error logging
Remove redundant and trivial error logging when trying to
offload mirred device with unsupported devices.
Using OVS could hit those a lot and the errors are still
logged in extack.
Gal Pressman [Mon, 29 Nov 2021 08:57:31 +0000 (10:57 +0200)]
net/mlx5e: Move HW-GRO and CQE compression check to fix features flow
Feature dependencies should be resolved in fix features rather than in
set features flow. Move the check that disables HW-GRO in case CQE
compression is enabled from set_feature_hw_gro() to
mlx5e_fix_features().
Maor Dickman [Thu, 9 Dec 2021 12:03:01 +0000 (14:03 +0200)]
net/mlx5e: Unblock setting vid 0 for VF in case PF isn't eswitch manager
When using libvirt to passthrough VF to VM it will always set the VF vlan
to 0 even if user didn’t request it, this will cause libvirt to fail to
boot in case the PF isn't eswitch owner.
Example of such case is the DPU host PF which isn't eswitch manager, so
any attempt to passthrough VF of it using libvirt will fail.
Fix it by not returning error in case set VF vlan is called with vid 0.
Lama Kayal [Mon, 13 Sep 2021 13:06:35 +0000 (16:06 +0300)]
net/mlx5e: Expose FEC counters via ethtool
Add FEC counters' statistics of corrected_blocks and
uncorrectable_blocks, along with their lanes via ethtool.
HW supports corrected_blocks and uncorrectable_blocks counters both for
RS-FEC mode and FC-FEC mode. In FC mode these counters are accumulated
per lane, while in RS mode the correction method crosses lanes, thus
only total corrected_blocks and uncorrectable_blocks are reported in
this mode.
Maher Sanalla [Wed, 5 Jan 2022 12:50:11 +0000 (14:50 +0200)]
net/mlx5: Update log_max_qp value to FW max capability
log_max_qp in driver's default profile #2 was set to 18, but FW actually
supports 17 at the most - a situation that led to the concerning print
when the driver is loaded:
"log_max_qp value in current profile is 18, changing to HCA capabaility
limit (17)"
The expected behavior from mlx5_profile #2 is to match the maximum FW
capability in regards to log_max_qp. Thus, log_max_qp in profile #2 is
initialized to a defined static value (0xff) - which basically means that
when loading this profile, log_max_qp value will be what the currently
installed FW supports at most.
Shay Drory [Tue, 23 Nov 2021 10:50:19 +0000 (12:50 +0200)]
net/mlx5: SF, Use all available cpu for setting cpu affinity
Currently all SFs are using the same CPUs. Spreading SF over CPUs, in
round-robin manner, in order to achieve better distribution of the SFs
over available CPUs.
Shay Drory [Sun, 12 Dec 2021 12:51:27 +0000 (14:51 +0200)]
net/mlx5: Introduce API for bulk request and release of IRQs
Currently IRQs are requested one by one. To balance spreading IRQs
among cpus using such scheme requires remembering cpu mask for the
cpus used for a given device. This complicates the IRQ allocation
scheme in subsequent patch.
Hence, prepare the code for bulk IRQs allocation. This enables
spreading IRQs among cpus in subsequent patch.
Shay Drory [Tue, 23 Nov 2021 08:48:07 +0000 (10:48 +0200)]
net/mlx5: Split irq_pool_affinity logic to new file
The downstream patches add more functionality to irq_pool_affinity.
Move the irq_pool_affinity logic to a new file in order to ease the
coding and maintenance of it.
Shay Drory [Sun, 14 Nov 2021 11:01:21 +0000 (13:01 +0200)]
net/mlx5: Introduce control IRQ request API
Currently, IRQ layer have a separate flow for ctrl and comp IRQs, and
the distinction between ctrl and comp IRQs is done in the IRQ layer.
In order to ease the coding and maintenance of the IRQ layer,
introduce a new API for requesting control IRQs -
mlx5_ctrl_irq_request(struct mlx5_core_dev *dev).
Jann Horn [Mon, 3 Jan 2022 15:59:31 +0000 (16:59 +0100)]
random: don't reset crng_init_cnt on urandom_read()
At the moment, urandom_read() (used for /dev/urandom) resets crng_init_cnt
to zero when it is called at crng_init<2. This is inconsistent: We do it
for /dev/urandom reads, but not for the equivalent
getrandom(GRND_INSECURE).
(And worse, as Jason pointed out, we're only doing this as long as
maxwarn>0.)
crng_init_cnt is only read in crng_fast_load(); it is relevant at
crng_init==0 for determining when to switch to crng_init==1 (and where in
the RNG state array to write).
As far as I understand:
- crng_init==0 means "we have nothing, we might just be returning the same
exact numbers on every boot on every machine, we don't even have
non-cryptographic randomness; we should shove every bit of entropy we
can get into the RNG immediately"
- crng_init==1 means "well we have something, it might not be
cryptographic, but at least we're not gonna return the same data every
time or whatever, it's probably good enough for TCP and ASLR and stuff;
we now have time to build up actual cryptographic entropy in the input
pool"
- crng_init==2 means "this is supposed to be cryptographically secure now,
but we'll keep adding more entropy just to be sure".
The current code means that if someone is pulling data from /dev/urandom
fast enough at crng_init==0, we'll keep resetting crng_init_cnt, and we'll
never make forward progress to crng_init==1. It seems to be intended to
prevent an attacker from bruteforcing the contents of small individual RNG
inputs on the way from crng_init==0 to crng_init==1, but that's misguided;
crng_init==1 isn't supposed to provide proper cryptographic security
anyway, RNG users who care about getting secure RNG output have to wait
until crng_init==2.
This code was inconsistent, and it probably made things worse - just get
rid of it.
random: avoid superfluous call to RDRAND in CRNG extraction
RDRAND is not fast. RDRAND is actually quite slow. We've known this for
a while, which is why functions like get_random_u{32,64} were converted
to use batching of our ChaCha-based CRNG instead.
Yet CRNG extraction still includes a call to RDRAND, in the hot path of
every call to get_random_bytes(), /dev/urandom, and getrandom(2).
This call to RDRAND here seems quite superfluous. CRNG is already
extracting things based on a 256-bit key, based on good entropy, which
is then reseeded periodically, updated, backtrack-mutated, and so
forth. The CRNG extraction construction is something that we're already
relying on to be secure and solid. If it's not, that's a serious
problem, and it's unlikely that mixing in a measly 32 bits from RDRAND
is going to alleviate things.
And in the case where the CRNG doesn't have enough entropy yet, we're
already initializing the ChaCha key row with RDRAND in
crng_init_try_arch_early().
Removing the call to RDRAND improves performance on an i7-11850H by
370%. In other words, the vast majority of the work done by
extract_crng() prior to this commit was devoted to fetching 32 bits of
RDRAND.
Previously, the ChaCha constants for the primary pool were only
initialized in crng_initialize_primary(), called by rand_initialize().
However, some randomness is actually extracted from the primary pool
beforehand, e.g. by kmem_cache_create(). Therefore, statically
initialize the ChaCha constants for the primary pool.
random: use IS_ENABLED(CONFIG_NUMA) instead of ifdefs
Rather than an awkward combination of ifdefs and __maybe_unused, we can
ensure more source gets parsed, regardless of the configuration, by
using IS_ENABLED for the CONFIG_NUMA conditional code. This makes things
cleaner and easier to follow.
I've confirmed that on !CONFIG_NUMA, we don't wind up with excess code
by accident; the generated object file is the same.
If we're trusting bootloader randomness, crng_fast_load() is called by
add_hwgenerator_randomness(), which sets us to crng_init==1. However,
usually it is only called once for an initial 64-byte push, so bootloader
entropy will not mix any bytes into the input pool. So it's conceivable
that crng_init==1 when crng_initialize_primary() is called later, but
then the input pool is empty. When that happens, the crng state key will
be overwritten with extracted output from the empty input pool. That's
bad.
In contrast, if we're not trusting bootloader randomness, we call
crng_slow_load() *and* we call mix_pool_bytes(), so that later
crng_initialize_primary() isn't drawing on nothing.
In order to prevent crng_initialize_primary() from extracting an empty
pool, have the trusted bootloader case mirror that of the untrusted
bootloader case, mixing the input into the pool.
random: do not throw away excess input to crng_fast_load
When crng_fast_load() is called by add_hwgenerator_randomness(), we
currently will advance to crng_init==1 once we've acquired 64 bytes, and
then throw away the rest of the buffer. Usually, that is not a problem:
When add_hwgenerator_randomness() gets called via EFI or DT during
setup_arch(), there won't be any IRQ randomness. Therefore, the 64 bytes
passed by EFI exactly matches what is needed to advance to crng_init==1.
Usually, DT seems to pass 64 bytes as well -- with one notable exception
being kexec, which hands over 128 bytes of entropy to the kexec'd kernel.
In that case, we'll advance to crng_init==1 once 64 of those bytes are
consumed by crng_fast_load(), but won't continue onward feeding in bytes
to progress to crng_init==2. This commit fixes the issue by feeding
any leftover bytes into the next phase in add_hwgenerator_randomness().
random: do not re-init if crng_reseed completes before primary init
If the bootloader supplies sufficient material and crng_reseed() is called
very early on, but not too early that wqs aren't available yet, then we
might transition to crng_init==2 before rand_initialize()'s call to
crng_initialize_primary() made. Then, when crng_initialize_primary() is
called, if we're trusting the CPU's RDRAND instructions, we'll
needlessly reinitialize the RNG and emit a message about it. This is
mostly harmless, as numa_crng_init() will allocate and then free what it
just allocated, and excessive calls to invalidate_batched_entropy()
aren't so harmful. But it is funky and the extra message is confusing,
so avoid the re-initialization all together by checking for crng_init <
2 in crng_initialize_primary(), just as we already do in crng_reseed().
random: fix crash on multiple early calls to add_bootloader_randomness()
Currently, if CONFIG_RANDOM_TRUST_BOOTLOADER is enabled, multiple calls
to add_bootloader_randomness() are broken and can cause a NULL pointer
dereference, as noted by Ivan T. Ivanov. This is not only a hypothetical
problem, as qemu on arm64 may provide bootloader entropy via EFI and via
devicetree.
On the first call to add_hwgenerator_randomness(), crng_fast_load() is
executed, and if the seed is long enough, crng_init will be set to 1.
On subsequent calls to add_bootloader_randomness() and then to
add_hwgenerator_randomness(), crng_fast_load() will be skipped. Instead,
wait_event_interruptible() and then credit_entropy_bits() will be called.
If the entropy count for that second seed is large enough, that proceeds
to crng_reseed().
However, both wait_event_interruptible() and crng_reseed() depends
(at least in numa_crng_init()) on workqueues. Therefore, test whether
system_wq is already initialized, which is a sufficient indicator that
workqueue_init_early() has progressed far enough.
If we wind up hitting the !system_wq case, we later want to do what
would have been done there when wqs are up, so set a flag, and do that
work later from the rand_initialize() call.
Reported-by: Ivan T. Ivanov <[email protected]> Fixes: 18b915ac6b0a ("efi/random: Treat EFI_RNG_PROTOCOL output as bootloader randomness") Cc: [email protected] Signed-off-by: Dominik Brodowski <[email protected]>
[Jason: added crng_need_done state and related logic.] Signed-off-by: Jason A. Donenfeld <[email protected]>
random: do not sign extend bytes for rotation when mixing
By using `char` instead of `unsigned char`, certain platforms will sign
extend the byte when `w = rol32(*bytes++, input_rotate)` is called,
meaning that bit 7 is overrepresented when mixing. This isn't a real
problem (unless the mixer itself is already broken) since it's still
invertible, but it's not quite correct either. Fix this by using an
explicit unsigned type.
This commit addresses one of the lower hanging fruits of the RNG: its
usage of SHA1.
BLAKE2s is generally faster, and certainly more secure, than SHA1, which
has [1] been [2] really [3] very [4] broken [5]. Additionally, the
current construction in the RNG doesn't use the full SHA1 function, as
specified, and allows overwriting the IV with RDRAND output in an
undocumented way, even in the case when RDRAND isn't set to "trusted",
which means potential malicious IV choices. And its short length means
that keeping only half of it secret when feeding back into the mixer
gives us only 2^80 bits of forward secrecy. In other words, not only is
the choice of hash function dated, but the use of it isn't really great
either.
This commit aims to fix both of these issues while also keeping the
general structure and semantics as close to the original as possible.
Specifically:
a) Rather than overwriting the hash IV with RDRAND, we put it into
BLAKE2's documented "salt" and "personal" fields, which were
specifically created for this type of usage.
b) Since this function feeds the full hash result back into the
entropy collector, we only return from it half the length of the
hash, just as it was done before. This increases the
construction's forward secrecy from 2^80 to a much more
comfortable 2^128.
c) Rather than using the raw "sha1_transform" function alone, we
instead use the full proper BLAKE2s function, with finalization.
This also has the advantage of supplying 16 bytes at a time rather than
SHA1's 10 bytes, which, in addition to having a faster compression
function to begin with, means faster extraction in general. On an Intel
i7-11850H, this commit makes initial seeding around 131% faster.
BLAKE2s itself has the nice property of internally being based on the
ChaCha permutation, which the RNG is already using for expansion, so
there shouldn't be any issue with newness, funkiness, or surprising CPU
behavior, since it's based on something already in use.
In preparation for using blake2s in the RNG, we change the way that it
is wired-in to the build system. Instead of using ifdefs to select the
right symbol, we use weak symbols. And because ARM doesn't need the
generic implementation, we make the generic one default only if an arch
library doesn't need it already, and then have arch libraries that do
need it opt-in. So that the arch libraries can remain tristate rather
than bool, we then split the shash part from the glue code.
Eric Biggers [Mon, 20 Dec 2021 22:41:57 +0000 (16:41 -0600)]
random: fix data race on crng init time
_extract_crng() does plain loads of crng->init_time and
crng_global_init_time, which causes undefined behavior if
crng_reseed() and RNDRESEEDCRNG modify these corrently.
Use READ_ONCE() and WRITE_ONCE() to make the behavior defined.
Don't fix the race on crng->init_time by protecting it with crng->lock,
since it's not a problem for duplicate reseedings to occur. I.e., the
lockless access with READ_ONCE() is fine.
Fixes: d848e5f8e1eb ("random: add new ioctl RNDRESEEDCRNG") Fixes: e192be9d9a30 ("random: replace non-blocking pool with a Chacha20-based CRNG") Cc: [email protected] Signed-off-by: Eric Biggers <[email protected]> Acked-by: Paul E. McKenney <[email protected]> Signed-off-by: Jason A. Donenfeld <[email protected]>
Eric Biggers [Mon, 20 Dec 2021 22:41:56 +0000 (16:41 -0600)]
random: fix data race on crng_node_pool
extract_crng() and crng_backtrack_protect() load crng_node_pool with a
plain load, which causes undefined behavior if do_numa_crng_init()
modifies it concurrently.
Fix this by using READ_ONCE(). Note: as per the previous discussion
https://lore.kernel.org/lkml/20211219025139[email protected]/T/#u,
READ_ONCE() is believed to be sufficient here, and it was requested that
it be used here instead of smp_load_acquire().
Also change do_numa_crng_init() to set crng_node_pool using
cmpxchg_release() instead of mb() + cmpxchg(), as the former is
sufficient here but is more lightweight.
irq: remove unused flags argument from __handle_irq_event_percpu()
The __IRQF_TIMER bit from the flags argument was used in
add_interrupt_randomness() to distinguish the timer interrupt from other
interrupts. This is no longer the case.
Remove the flags argument from __handle_irq_event_percpu().
Mark Brown [Wed, 1 Dec 2021 17:44:49 +0000 (17:44 +0000)]
random: document add_hwgenerator_randomness() with other input functions
The section at the top of random.c which documents the input functions
available does not document add_hwgenerator_randomness() which might lead
a reader to overlook it. Add a brief note about it.
Signed-off-by: Mark Brown <[email protected]>
[Jason: reorganize position of function in doc comment and also document
add_bootloader_randomness() while we're at it.] Signed-off-by: Jason A. Donenfeld <[email protected]>
Hao Luo [Thu, 6 Jan 2022 20:55:25 +0000 (12:55 -0800)]
bpf/selftests: Test bpf_d_path on rdonly_mem.
The second parameter of bpf_d_path() can only accept writable
memories. Rdonly_mem obtained from bpf_per_cpu_ptr() can not
be passed into bpf_d_path for modification. This patch adds
a selftest to verify this behavior.
This also updates the public API for the `keys` parameter
of `bpf_map_delete_batch()`, and both the
`keys` and `values` parameters of `bpf_map_update_batch()`
to be constants.
Linus Torvalds [Thu, 6 Jan 2022 23:00:43 +0000 (15:00 -0800)]
Merge tag 'trace-v5.16-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fixes from Steven Rostedt:
"Three minor tracing fixes:
- Fix missing prototypes in sample module for direct functions
- Fix check of valid buffer in get_trace_buf()
- Fix annotations of percpu pointers"
* tag 'trace-v5.16-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing: Tag trace_percpu_buffer as a percpu pointer
tracing: Fix check for trace_percpu_buffer validity in get_trace_buf()
ftrace/samples: Add missing prototypes direct functions
Andrii Nakryiko [Thu, 6 Jan 2022 20:51:56 +0000 (12:51 -0800)]
selftests/bpf: Don't rely on preserving volatile in PT_REGS macros in loop3
PT_REGS*() macro on some architectures force-cast struct pt_regs to
other types (user_pt_regs, etc) and might drop volatile modifiers, if any.
Volatile isn't really required as pt_regs value isn't supposed to change
during the BPF program run, so this is correct behavior.
But progs/loop3.c relies on that volatile modifier to ensure that loop
is preserved. Fix loop3.c by declaring i and sum variables as volatile
instead. It preserves the loop and makes the test pass on all
architectures (including s390x which is currently broken).
Tejun Heo [Thu, 6 Jan 2022 21:02:29 +0000 (11:02 -1000)]
cgroup: Use open-time cgroup namespace for process migration perm checks
cgroup process migration permission checks are performed at write time as
whether a given operation is allowed or not is dependent on the content of
the write - the PID. This currently uses current's cgroup namespace which is
a potential security weakness as it may allow scenarios where a less
privileged process tricks a more privileged one into writing into a fd that
it created.
This patch makes cgroup remember the cgroup namespace at the time of open
and uses it for migration permission checks instad of current's. Note that
this only applies to cgroup2 as cgroup1 doesn't have namespace support.
This also fixes a use-after-free bug on cgroupns reported in
Tejun Heo [Thu, 6 Jan 2022 21:02:29 +0000 (11:02 -1000)]
cgroup: Allocate cgroup_file_ctx for kernfs_open_file->priv
of->priv is currently used by each interface file implementation to store
private information. This patch collects the current two private data usages
into struct cgroup_file_ctx which is allocated and freed by the common path.
This allows generic private data which applies to multiple files, which will
be used to in the following patch.
Note that cgroup_procs iterator is now embedded as procs.iter in the new
cgroup_file_ctx so that it doesn't need to be allocated and freed
separately.
v2: union dropped from cgroup_file_ctx and the procs iterator is embedded in
cgroup_file_ctx as suggested by Linus.
v3: Michal pointed out that cgroup1's procs pidlist uses of->priv too.
Converted. Didn't change to embedded allocation as cgroup1 pidlists get
stored for caching.
Tejun Heo [Thu, 6 Jan 2022 21:02:28 +0000 (11:02 -1000)]
cgroup: Use open-time credentials for process migraton perm checks
cgroup process migration permission checks are performed at write time as
whether a given operation is allowed or not is dependent on the content of
the write - the PID. This currently uses current's credentials which is a
potential security weakness as it may allow scenarios where a less
privileged process tricks a more privileged one into writing into a fd that
it created.
This patch makes both cgroup2 and cgroup1 process migration interfaces to
use the credentials saved at the time of open (file->f_cred) instead of
current's.
Reported-by: "Eric W. Biederman" <[email protected]> Suggested-by: Linus Torvalds <[email protected]> Fixes: 187fe84067bd ("cgroup: require write perm on common ancestor when moving processes on the default hierarchy") Reviewed-by: Michal Koutný <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
The 'possible_idx' bitmap is set just after it is zeroed, so we can save
the first step.
The 'free_idx' bitmap is used only at the end of the function as the
result of a bitmap xor operation. So there is no need to explicitly
zero it before.
So, slightly simply the code and remove 2 useless 'bitmap_zero()' call
Wojciech Drewek [Tue, 26 Oct 2021 10:38:40 +0000 (12:38 +0200)]
ice: improve switchdev's slow-path
In current switchdev implementation, every VF PR is assigned to
individual ring on switchdev ctrl VSI. For slow-path traffic, there
is a mapping VF->ring done in software based on src_vsi value (by
calling ice_eswitch_get_target_netdev function).
With this change, HW solution is introduced which is more
efficient. For each VF, src MAC (VF's MAC) filter will be created,
which forwards packets to the corresponding switchdev ctrl VSI queue
based on src MAC address.
This filter has to be removed and then replayed in case of
resetting one VF. Keep information about this rule in repr->mac_rule,
thanks to that we know which rule has to be removed and replayed
for a given VF.
In case of CORE/GLOBAL all rules are removed
automatically. We have to take care of readding them. This is done
by ice_replay_vsi_adv_rule.
When driver leaves switchdev mode, remove all advanced rules
from switchdev ctrl VSI. This is done by ice_rem_adv_rule_for_vsi.
Flag repr->rule_added is needed because in some cases reset
might be triggered before VF sends request to add MAC.
Victor Raj [Tue, 26 Oct 2021 10:38:39 +0000 (12:38 +0200)]
ice: replay advanced rules after reset
ice_replay_vsi_adv_rule will replay advanced rules for a given VSI.
Exit this function when list of rules for given recipe is empty.
Do not add rule when given vsi_handle does not match vsi_handle
from the rule info.
Use ICE_MAX_NUM_RECIPES instead of ICE_SW_LKUP_LAST in order to find
advanced rules as well.
Wen Gu [Thu, 6 Jan 2022 12:42:08 +0000 (20:42 +0800)]
net/smc: Reset conn->lgr when link group registration fails
SMC connections might fail to be registered in a link group due to
unable to find a usable link during its creation. As a result,
smc_conn_create() will return a failure and most resources related
to the connection won't be applied or initialized, such as
conn->abort_work or conn->lnk.
If smc_conn_free() is invoked later, it will try to access the
uninitialized resources related to the connection, thus causing
a warning or crash.
This patch tries to fix this by resetting conn->lgr to NULL if an
abnormal exit occurs in smc_lgr_register_conn(), thus avoiding the
access to uninitialized resources in smc_conn_free().
Meanwhile, the new created link group should be terminated if smc
connections can't be registered in it. So smc_lgr_cleanup_early() is
modified to take care of link group only and invoked to terminate
unusable link group by smc_conn_create(). The call to smc_conn_free()
is moved out from smc_lgr_cleanup_early() to smc_conn_abort().
Miaoqian Lin [Fri, 24 Dec 2021 08:02:49 +0000 (08:02 +0000)]
Bluetooth: hci_qca: Fix NULL vs IS_ERR_OR_NULL check in qca_serdev_probe
The function devm_gpiod_get_index() return error pointers on error.
Thus devm_gpiod_get_index_optional() could return NULL and error pointers.
The same as devm_gpiod_get_optional() function. Using IS_ERR_OR_NULL()
check to catch error pointers.
Fixes: 77131dfe ("Bluetooth: hci_qca: Replace devm_gpiod_get() with devm_gpiod_get_optional()") Signed-off-by: Miaoqian Lin <[email protected]> Signed-off-by: Marcel Holtmann <[email protected]>
Jiasheng Jiang [Fri, 24 Dec 2021 02:53:18 +0000 (10:53 +0800)]
Bluetooth: hci_bcm: Check for error irq
For the possible failure of the platform_get_irq(), the returned irq
could be error number and will finally cause the failure of the
request_irq().
Consider that platform_get_irq() can now in certain cases return
-EPROBE_DEFER, and the consequences of letting request_irq() effectively
convert that into -EINVAL, even at probe time rather than later on.
So it might be better to check just now.
Jiasheng Jiang [Thu, 6 Jan 2022 10:04:10 +0000 (18:04 +0800)]
fsl/fman: Check for null pointer after calling devm_ioremap
As the possible failure of the allocation, the devm_ioremap() may return
NULL pointer.
Take tgec_initialization() as an example.
If allocation fails, the params->base_addr will be NULL pointer and will
be assigned to tgec->regs in tgec_config().
Then it will cause the dereference of NULL pointer in set_mac_address(),
which is called by tgec_init().
Therefore, it should be better to add the sanity check after the calling
of the devm_ioremap().
Fixes: 3933961682a3 ("fsl/fman: Add FMan MAC driver") Signed-off-by: Jiasheng Jiang <[email protected]> Signed-off-by: David S. Miller <[email protected]>