Biju Das [Fri, 1 Oct 2021 15:06:36 +0000 (16:06 +0100)]
ravb: Initialize GbEthernet E-MAC
Initialize GbEthernet E-MAC found on RZ/G2L SoC.
This patch also renames ravb_set_rate to ravb_set_rate_rcar and
ravb_rcar_emac_init to ravb_emac_init_rcar to be consistent with
the naming convention used in sh_eth driver.
Biju Das [Fri, 1 Oct 2021 15:06:34 +0000 (16:06 +0100)]
ravb: Add magic_pkt to struct ravb_hw_info
E-MAC on R-Car supports magic packet detection, whereas RZ/G2L
does not support this feature. Add magic_pkt to struct ravb_hw_info
and enable this feature only for R-Car.
Biju Das [Fri, 1 Oct 2021 15:06:30 +0000 (16:06 +0100)]
ravb: Add support for RZ/G2L SoC
RZ/G2L SoC has Gigabit Ethernet IP consisting of Ethernet controller
(E-MAC), Internal TCP/IP Offload Engine (TOE) and Dedicated Direct
memory access controller (DMAC).
This patch adds compatible string for RZ/G2L and fills up the
ravb_hw_info struct. Function stubs are added which will be used by
gbeth_hw_info and will be filled incrementally.
Biju Das [Fri, 1 Oct 2021 15:06:29 +0000 (16:06 +0100)]
ravb: Add nc_queue to struct ravb_hw_info
R-Car supports network control queue whereas RZ/G2L does not support
it. Add nc_queue to struct ravb_hw_info, so that NC queue is handled
only by R-Car.
This patch also renames ravb_rcar_dmac_init to ravb_dmac_init_rcar
to be consistent with the naming convention used in sh_eth driver.
prevent_wake logic is backward since what it is really checking is
if the device may wakeup the system or not, not that it will prevent
the to be awaken.
Also looking on how other subsystems have the entry as power/wakeup
this also renames the force_prevent_wake to force_wakeup in vhci driver.
napi_busy_loop() disables preemption and performs a NAPI poll. We can't acquire
sleeping locks with disabled preemption which would be required while
__napi_poll() invokes the callback of the driver.
A threaded interrupt performing the NAPI-poll can be preempted on PREEMPT_RT.
A RT thread on another CPU may observe NAPIF_STATE_SCHED bit set and busy-spin
until it is cleared or its spin time runs out. Given it is the task with the
highest priority it will never observe the NEED_RESCHED bit set.
In this case the time is better spent by simply sleeping.
The NET_RX_BUSY_POLL is disabled by default (the system wide sysctls for
poll/read are set to zero). Disabling NET_RX_BUSY_POLL on PREEMPT_RT to avoid
wrong locking context in case it is used.
Andrii Nakryiko [Fri, 1 Oct 2021 22:31:51 +0000 (15:31 -0700)]
Merge branch 'libbpf: Support uniform BTF-defined key/value specification across all BPF maps'
Hengqi Chen says:
====================
Currently a bunch of (usually pretty specialized) BPF maps do not support
specifying BTF types for they key and value. For such maps, specifying
their definition like this:
Would actually produce warnings about retrying BPF map creation without BTF.
Users are forced to know such nuances and use __uint(key_size, 4) instead.
This is non-uniform, annoying, and inconvenient.
This patch set teaches libbpf to recognize those specialized maps and removes
BTF type IDs when creating BPF map. Also, update existing BPF selftests to
exericse this change.
====================
selftests/bpf: Use BTF-defined key/value for map definitions
Change map definitions in BPF selftests to use BTF-defined
key/value types. This unifies the map definitions and ensures
libbpf won't emit warning about retrying map creation.
libbpf: Support uniform BTF-defined key/value specification across all BPF maps
A bunch of BPF maps do not support specifying BTF types for key and value.
This is non-uniform and inconvenient[0]. Currently, libbpf uses a retry
logic which removes BTF type IDs when BPF map creation failed. Instead
of retrying, this commit recognizes those specialized maps and removes
BTF type IDs when creating BPF map.
Jakub Kicinski [Fri, 1 Oct 2021 21:41:38 +0000 (14:41 -0700)]
Merge tag 'mlx5-updates-2021-09-30' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux
Saeed Mahameed says:
====================
mlx5-updates-2021-09-30
1) From Yevgeny Kliteynik:
This patch series deals with vport handling in SW steering.
For every vport, SW steering queries FW for this vport's properties,
such as RX/TX ICM addresses to be able to add this vport as dest action.
The following patches rework vport capabilities managements and add support
for Scalable Functions (SFs).
- Patch 1 fixes the vport number data type all over the DR code to 16 bits
in accordance with HW spec.
- Patch 2 replaces local SW steering WIRE_PORT macro with the existing
mlx5 define.
- Patch 3 adds missing query for vport 0 and and handles eswitch manager
capabilities for ECPF (BlueField in embedded CPU mode).
- Patch 4 fixes error messages for failure to obtain vport caps from
different locations in the code to have the same verbosity level and
similar wording.
- Patch 5 adds support for csum recalculation flow tables on SFs: it
implements these FTs management in XArray instead of the fixed size array,
thus adding support for csum recalculation table for any valid vport.
- Patch 6 is the main patch of this whole series: it refactors vports
capabilities handling and adds SFs support.
2) Minor and trivial updates and cleanups
* tag 'mlx5-updates-2021-09-30' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux:
net/mlx5e: Use array_size() helper
net/mlx5: Use struct_size() helper in kvzalloc()
net/mlx5: Use kvcalloc() instead of kvzalloc()
net/mlx5: Tolerate failures in debug features while driver load
net/mlx5: Warn for devlink reload when there are VFs alive
net/mlx5: DR, Add missing string for action type SAMPLER
net/mlx5: DR, init_next_match only if needed
net/mlx5: DR, Fix typo 'offeset' to 'offset'
net/mlx5: DR, Increase supported num of actions to 32
net/mlx5: DR, Add support for SF vports
net/mlx5: DR, Support csum recalculation flow table on SFs
net/mlx5: DR, Align error messages for failure to obtain vport caps
net/mlx5: DR, Add missing query for vport 0
net/mlx5: DR, Replace local WIRE_PORT macro with the existing MLX5_VPORT_UPLINK
net/mlx5: DR, Fix vport number data type to u16
====================
Johan Almbladh [Fri, 1 Oct 2021 13:03:48 +0000 (15:03 +0200)]
bpf/tests: Add test of LDX_MEM with operand aliasing
This patch adds a set of tests of BPF_LDX_MEM where both operand registers
are the same register. Mainly testing 32-bit JITs that may load a 64-bit
value in two 32-bit loads, and must not overwrite the address register.
Johan Almbladh [Fri, 1 Oct 2021 13:03:47 +0000 (15:03 +0200)]
bpf/tests: Add test of ALU shifts with operand register aliasing
This patch adds a tests of ALU32 and ALU64 LSH/RSH/ARSH operations for the
case when the two operands are the same register. Mainly intended to test
JITs that implement ALU64 shifts with 32-bit CPU instructions.
Also renamed related helper functions for consistency with the new tests.
Johan Almbladh [Fri, 1 Oct 2021 13:03:45 +0000 (15:03 +0200)]
bpf/tests: Add exhaustive tests of ALU register combinations
This patch replaces the current register combination test with new
exhaustive tests. Before, only a subset of register combinations was
tested for ALU64 DIV. Now, all combinatons of operand registers are
tested, including the case when they are the same, and for all ALU32
and ALU64 operations.
Johan Almbladh [Fri, 1 Oct 2021 13:03:44 +0000 (15:03 +0200)]
bpf/tests: Minor restructuring of ALU tests
This patch moves the ALU LSH/RSH/ARSH reference computations into the
common reference value function. Also fix typo in constants so they
now have the intended values.
Johan Almbladh [Fri, 1 Oct 2021 13:03:43 +0000 (15:03 +0200)]
bpf/tests: Add more tests for ALU and ATOMIC register clobbering
This patch expands the register-clobbering-during-function-call tests
to cover more all ALU32/64 MUL, DIV and MOD operations and all ATOMIC
operations. In short, if a JIT implements a complex operation with
a call to an external function, it must make sure to save and restore
all its caller-saved registers that may be clobbered by the call.
Johan Almbladh [Fri, 1 Oct 2021 13:03:42 +0000 (15:03 +0200)]
bpf/tests: Add tests to check source register zero-extension
This patch adds tests to check that the source register is preserved when
zero-extending a 32-bit value. In particular, it checks that the source
operand is not zero-extended in-place.
Johan Almbladh [Fri, 1 Oct 2021 13:03:41 +0000 (15:03 +0200)]
bpf/tests: Add exhaustive tests of BPF_ATOMIC magnitudes
This patch adds a series of test to verify the operation of BPF_ATOMIC
with BPF_DW and BPF_W sizes, for all power-of-two magnitudes of the
register value operand.
Also fixes a confusing typo in the comment for a related test.
Johan Almbladh [Fri, 1 Oct 2021 13:03:40 +0000 (15:03 +0200)]
bpf/tests: Add zero-extension checks in BPF_ATOMIC tests
This patch updates the existing tests of BPF_ATOMIC operations to verify
that a 32-bit register operand is properly zero-extended. In particular,
it checks the operation on archs that require 32-bit operands to be
properly zero-/sign-extended or the result is undefined, e.g. MIPS64.
Johan Almbladh [Fri, 1 Oct 2021 13:03:39 +0000 (15:03 +0200)]
bpf/tests: Add tests of BPF_LDX and BPF_STX with small sizes
This patch adds a series of tests to verify the behavior of BPF_LDX and
BPF_STX with BPF_B//W sizes in isolation. In particular, it checks that
BPF_LDX zero-extendeds the result, and that BPF_STX does not overwrite
adjacent bytes in memory.
BPF_ST and operations on BPF_DW size are deemed to be sufficiently
tested by existing tests.
fix up for "net: add new socket option SO_RESERVE_MEM"
Some architectures do not include uapi/asm/socket.h
Fixes: 2bb2f5fb21b0 ("net: add new socket option SO_RESERVE_MEM") Signed-off-by: Stephen Rothwell <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Jacob Keller [Thu, 30 Sep 2021 21:21:04 +0000 (14:21 -0700)]
devlink: report maximum number of snapshots with regions
Each region has an independently configurable number of maximum
snapshots. This information is not reported to userspace, making it not
very discoverable. Fix this by adding a new
DEVLINK_ATTR_REGION_MAX_SNAPSHOST attribute which is used to report this
maximum.
Ex:
$devlink region
pci/0000:af:00.0/nvm-flash: size 10485760 snapshot [] max 1
pci/0000:af:00.0/device-caps: size 4096 snapshot [] max 10
pci/0000:af:00.1/nvm-flash: size 10485760 snapshot [] max 1
pci/0000:af:00.1/device-caps: size 4096 snapshot [] max 10
This information enables users to understand why a new region command
may fail due to having too many existing snapshots.
David S. Miller [Fri, 1 Oct 2021 13:19:01 +0000 (14:19 +0100)]
Merge branch 'mctp-kunit-tests'
Jeremy Kerr says:
====================
MCTP kunit tests
This change adds some initial kunit tests for the MCTP core. We'll
expand the coverage in a future series, and augment with a few
selftests, but this establishes a baseline set of tests for now.
Thanks to the kunit folks for the framework!
====================
Jeremy Kerr [Fri, 1 Oct 2021 08:18:42 +0000 (16:18 +0800)]
mctp: Add packet rx tests
Add a few tests for the initial packet ingress through
mctp_pkttype_receive function; mainly packet header sanity checks. Full
input routing checks will be added as a separate change.
Some un-support wakeup platforms keep USB power and suspend signal
is coming late, this makes Realtek some chip keep its firmware,
and make it never load new firmware.
So use vendor specific HCI command to ask them drop its firmware after
system shutdown or resume.
net: sched: Use struct_size() helper in kvmalloc()
Make use of the struct_size() helper instead of an open-coded version,
in order to avoid any potential type mistakes or integer overflows
that, in the worst scenario, could lead to heap overflows.
Make use of the struct_size() helper instead of an open-coded version,
in order to avoid any potential type mistakes or integer overflows that,
in the worse scenario, could lead to heap overflows.
Lama Kayal [Sun, 1 Aug 2021 11:57:16 +0000 (14:57 +0300)]
net/mlx5: Warn for devlink reload when there are VFs alive
When performing PF reload, VF can't communicate with FW until
it recovers and reloads as well.
Add a warning message when performing devlink reload while
VFs are still present. Thus, giving a notice of an unfavorable
behavior that might occur as a result of a consequential reloads
and cause interruption of VF recovery.
Move all the vport capabilities to a separate struct and store vport caps
in XArray: SFs vport numbers will not come in the same range as VF vports,
so the existing implementation of vport capabilities as a fixed size array
is not suitable here.
XArray is a perfect fit: it is efficient when the indices used are densely
clustered. In addition to being a perfect fit as a dynamic data structure,
XArray also provides locking - it uses RCU and an internal spinlock to
synchronise access, so no additional protection needed.
Now except for the eswitch manager vport, all other vports (including the
uplink vport) are handled in the same way: when a new go-to-vport action
is added, this vport's caps are loaded from the xarray. If it is the first
time for this particular vport number, then its capabilities are queried
from FW and filled in into the appropriate entry.
net/mlx5: DR, Support csum recalculation flow table on SFs
Implement csum recalculation flow tables in XAarray instead of a fixed
array, thus adding support for csum recalc table on any valid vport
number, which enables this support for SFs.
Currently, vport 0 capabilities are not set.
To fix this, we now querying both eswitch manager and vport 0.
Eswitch manager has an access to all the vports - for eswitch manager PF, all
vports can be referred as other vports. The exception is embedded CPU mode,
where there is vport 0 of ECPF and the PF vport 0.
Here is how vport are queried:
For Connect-X5/6:
PF vport (0) and vports 1..n: vport number, other = true
esw_manager is vport 0 (PF)
For BlueField (in embedded CPU mode):
ECPF vport: vport = 0, other = false
PF vport (0) and 1..n: vport number, other = true
esw_manager = vport 0 (ECPF)
Also, note that there's no need for other_vport function parameter
in dr_domain_query_vport - this value is now deduced locally in the
function.
net/sched/sch_api.c b193e15ac69d ("net: prevent user from passing illegal stab size") 69508d43334e ("net_sched: Use struct_size() and flex_array_size() helpers")
Merge tag 'net-5.15-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
"Networking fixes, including fixes from mac80211, netfilter and bpf.
Current release - regressions:
- bpf, cgroup: assign cgroup in cgroup_sk_alloc when called from
interrupt
- mdio: revert mechanical patches which broke handling of optional
resources
- dev_addr_list: prevent address duplication
Previous releases - regressions:
- sctp: break out if skb_header_pointer returns NULL in sctp_rcv_ootb
(NULL deref)
- Revert "mac80211: do not use low data rates for data frames with no
ack flag", fixing broadcast transmissions
- mac80211: fix use-after-free in CCMP/GCMP RX
- netfilter: include zone id in tuple hash again, minimize collisions
- netfilter: nf_tables: unlink table before deleting it (race -> UAF)
- netfilter: log: work around missing softdep backend module
- mptcp: don't return sockets in foreign netns
- sched: flower: protect fl_walk() with rcu (race -> UAF)
- ixgbe: fix NULL pointer dereference in ixgbe_xdp_setup
- smsc95xx: fix stalled rx after link change
- enetc: fix the incorrect clearing of IF_MODE bits
- ipv4: fix rtnexthop len when RTA_FLOW is present
- dsa: mv88e6xxx: 6161: use correct MAX MTU config method for this
SKU
- e100: fix length calculation & buffer overrun in ethtool::get_regs
Previous releases - always broken:
- mac80211: fix using stale frag_tail skb pointer in A-MSDU tx
- mac80211: drop frames from invalid MAC address in ad-hoc mode
- af_unix: fix races in sk_peer_pid and sk_peer_cred accesses (race
-> UAF)
- bpf, x86: Fix bpf mapping of atomic fetch implementation
- bpf: handle return value of BPF_PROG_TYPE_STRUCT_OPS prog
- netfilter: ip6_tables: zero-initialize fragment offset
- mhi: fix error path in mhi_net_newlink
- af_unix: return errno instead of NULL in unix_create1() when over
the fs.file-max limit
Misc:
- bpf: exempt CAP_BPF from checks against bpf_jit_limit
- netfilter: conntrack: make max chain length random, prevent
guessing buckets by attackers
- netfilter: nf_nat_masquerade: make async masq_inet6_event handling
generic, defer conntrack walk to work queue (prevent hogging RTNL
lock)"
* tag 'net-5.15-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (77 commits)
af_unix: fix races in sk_peer_pid and sk_peer_cred accesses
net: stmmac: fix EEE init issue when paired with EEE capable PHYs
net: dev_addr_list: handle first address in __hw_addr_add_ex
net: sched: flower: protect fl_walk() with rcu
net: introduce and use lock_sock_fast_nested()
net: phy: bcm7xxx: Fixed indirect MMD operations
net: hns3: disable firmware compatible features when uninstall PF
net: hns3: fix always enable rx vlan filter problem after selftest
net: hns3: PF enable promisc for VF when mac table is overflow
net: hns3: fix show wrong state when add existing uc mac address
net: hns3: fix mixed flag HCLGE_FLAG_MQPRIO_ENABLE and HCLGE_FLAG_DCB_ENABLE
net: hns3: don't rollback when destroy mqprio fail
net: hns3: remove tc enable checking
net: hns3: do not allow call hns3_nic_net_open repeatedly
ixgbe: Fix NULL pointer dereference in ixgbe_xdp_setup
net: bridge: mcast: Associate the seqcount with its protecting lock.
net: mdio-ipq4019: Fix the error for an optional regs resource
net: hns3: fix hclge_dbg_dump_tm_pg() stack usage
net: mdio: mscc-miim: Fix the mdio controller
af_unix: Return errno instead of NULL in unix_create1().
...
Po-Hsu Lin [Wed, 29 Sep 2021 05:12:50 +0000 (13:12 +0800)]
selftests/bpf: Use kselftest skip code for skipped tests
There are several test cases in the bpf directory are still using
exit 0 when they need to be skipped. Use kselftest framework skip
code instead so it can help us to distinguish the return status.
Criterion to filter out what should be fixed in bpf directory:
grep -r "exit 0" -B1 | grep -i skip
This change might cause some false-positives if people are running
these test scripts directly and only checking their return codes,
which will change from 0 to 4. However I think the impact should be
small as most of our scripts here are already using this skip code.
And there will be no such issue if running them with the kselftest
framework.
Aya Levin [Mon, 13 Sep 2021 13:49:47 +0000 (16:49 +0300)]
net/mlx5e: Mutually exclude setting of TX-port-TS and MQPRIO in channel mode
TX-port-TS hijacks the PTP traffic to a specific HW TX-queue. This
conflicts with MQPRIO in channel mode, which specifies explicitly which
TC accepts the packet. This patch mutually excludes the above
configuration.
Lama Kayal [Sun, 29 Aug 2021 08:26:03 +0000 (11:26 +0300)]
net/mlx5e: Fix the presented RQ index in PTP stats
PTP-RQ counters title format contains PTP-RQ identifier, which is
mistakenly not passed to sprinft().
This leads to unexpected garbage values instead.
This patch fixes it.
When setting number of completion EQs of the SF, consider number of
online CPUs.
Without this consideration, when number of online cpus are less than 8,
unnecessary 8 completion EQs are allocated.
Aya Levin [Thu, 23 Sep 2021 12:30:01 +0000 (15:30 +0300)]
net/mlx5: Avoid generating event after PPS out in Real time mode
When in Real-time mode, HW clock is synced with the PTP daemon. Hence
driver should not re-calibrate the next pulse (via MTPPSE repetitive
events mechanism).
This patch arms repetitive events only in free-running mode.
Fixes: 432119de33d9 ("net/mlx5: Add cyc2time HW translation mode support") Signed-off-by: Aya Levin <[email protected]> Reviewed-by: Eran Ben Elisha <[email protected]> Signed-off-by: Saeed Mahameed <[email protected]>
Aya Levin [Thu, 23 Sep 2021 13:56:09 +0000 (16:56 +0300)]
net/mlx5: Force round second at 1PPS out start time
Allow configuration of 1PPS start time only with time-stamp representing
a round second. Prior to this patch driver allowed setting of a
non-round-second which is not supported by the device. Avoid unexpected
behavior by restricting start-time configuration to a round-second.
net/mlx5: E-Switch, Fix double allocation of acl flow counter
Flow counter is allocated in eswitch legacy acl setting functions
without checking if already allocated by previous setting. Add a check
to avoid such double allocation.
* Add netdev->tc_to_txq rollback in case of failure in
mlx5e_update_netdev_queues().
* Fix broken transition between the two modes:
MQPRIO DCB mode with tc==8, and MQPRIO channel mode.
* Disable MQPRIO channel mode if re-attaching with a different number
of channels.
* Improve code sharing.
net/mlx5e: Keep the value for maximum number of channels in-sync
The value for maximum number of channels is first calculated based
on the netdev's profile and current function resources (specifically,
number of MSIX vectors, which depends among other things on the number
of online cores in the system).
This value is then used to calculate the netdev's number of rxqs/txqs.
Once created (by alloc_etherdev_mqs), the number of netdev's rxqs/txqs
is constant and we must not exceed it.
To achieve this, keep the maximum number of channels in sync upon any
netdevice re-attach.
Use mlx5e_get_max_num_channels() for calculating the number of netdev's
rxqs/txqs. After netdev is created, use mlx5e_calc_max_nch() (which
coinsiders core device resources, profile, and netdev) to init or
update priv->max_nch.
Before this patch, the value of priv->max_nch might get out of sync,
mistakenly allowing accesses to out-of-bounds objects, which would
crash the system.
Track the number of channels stats structures used in a separate
field, as they are persistent to suspend/resume operations. All the
collected stats of every channel index that ever existed should be
preserved. They are reset only when struct mlx5e_priv is,
in mlx5e_priv_cleanup(), which is part of the profile changing flow.
There is no point anymore in blocking a profile change due to max_nch
mismatch in mlx5e_netdev_change_profile(). Remove the limitation.
Fixes: a1f240f18017 ("net/mlx5e: Adjust to max number of channles when re-attaching") Signed-off-by: Tariq Toukan <[email protected]> Reviewed-by: Aya Levin <[email protected]> Reviewed-by: Maxim Mikityanskiy <[email protected]> Signed-off-by: Saeed Mahameed <[email protected]>
Raed Salem [Thu, 26 Aug 2021 14:07:17 +0000 (17:07 +0300)]
net/mlx5e: IPSEC RX, enable checksum complete
Currently in Rx data path IPsec crypto offloaded packets uses
csum_none flag, so checksum is handled by the stack, this naturally
have some performance/cpu utilization impact on such flows. As Nvidia
NIC starting from ConnectX6DX provides checksum complete value out of
the box also for such flows there is no sense in taking csum_none path,
furthermore the stack (xfrm) have the method to handle checksum complete
corrections for such flows i.e. IPsec trailer removal and consequently
checksum value adjustment.
Because of the above and in addition the ConnectX6DX is the first HW
which supports IPsec crypto offload then it is safe to report csum
complete for IPsec offloaded traffic.
Merge tag 'gpio-fixes-for-v5.15-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux
Pull gpio fixes from Bartosz Golaszewski:
"A single fix for the gpio-pca953x driver and two commits updating the
MAINTAINERS entries for Mun Yew Tham (GPIO specific) and myself
(treewide after a change in professional situation).
Summary:
- don't ignore I2C errors in gpio-pca953x
- update MAINTAINERS entries for Mun Yew Tham and myself"
* tag 'gpio-fixes-for-v5.15-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux:
MAINTAINERS: Update Mun Yew Tham as Altera Pio Driver maintainer
MAINTAINERS: update my email address
gpio: pca953x: do not ignore i2c errors
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma
Pull rdma fixes from Jason Gunthorpe:
"Not much too exciting here, although two syzkaller bugs that seem to
have 9 lives may have finally been squashed.
Several core bugs and a batch of driver bug fixes:
- Fix compilation problems in qib and hfi1
- Do not corrupt the joined multicast group state when using
SEND_ONLY
- Several CMA bugs, a reference leak for listening and two syzkaller
crashers
- Various bug fixes for irdma
- Fix a Sleeping while atomic bug in usnic
- Properly sanitize kernel pointers in dmesg
- Two bugs in the 64b CQE support for hns"
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma:
RDMA/hns: Add the check of the CQE size of the user space
RDMA/hns: Fix the size setting error when copying CQE in clean_cq()
RDMA/hfi1: Fix kernel pointer leak
RDMA/usnic: Lock VF with mutex instead of spinlock
RDMA/hns: Work around broken constant propagation in gcc 8
RDMA/cma: Ensure rdma_addr_cancel() happens before issuing more requests
RDMA/cma: Do not change route.addr.src_addr.ss_family
RDMA/irdma: Report correct WC error when there are MW bind errors
RDMA/irdma: Report correct WC error when transport retry counter is exceeded
RDMA/irdma: Validate number of CQ entries on create CQ
RDMA/irdma: Skip CQP ring during a reset
MAINTAINERS: Update Broadcom RDMA maintainers
RDMA/cma: Fix listener leak in rdma_cma_listen_on_all() failure
IB/cma: Do not send IGMP leaves for sendonly Multicast groups
IB/qib: Fix clang confusion of NULL pointer comparison
Eric Dumazet [Wed, 29 Sep 2021 22:57:50 +0000 (15:57 -0700)]
af_unix: fix races in sk_peer_pid and sk_peer_cred accesses
Jann Horn reported that SO_PEERCRED and SO_PEERGROUPS implementations
are racy, as af_unix can concurrently change sk_peer_pid and sk_peer_cred.
In order to fix this issue, this patch adds a new spinlock that needs
to be used whenever these fields are read or written.
Jann also pointed out that l2cap_sock_get_peer_pid_cb() is currently
reading sk->sk_peer_pid which makes no sense, as this field
is only possibly set by AF_UNIX sockets.
We will have to clean this in a separate patch.
This could be done by reverting b48596d1dc25 "Bluetooth: L2CAP: Add get_peer_pid callback"
or implementing what was truly expected.
David S. Miller [Thu, 30 Sep 2021 13:17:10 +0000 (14:17 +0100)]
Merge branch 'snmp-optimizations'
Eric Dumazet says:
====================
net: snmp: minor optimizations
Fetching many SNMP counters on hosts with large number of cpus
takes a lot of time. mptcp still uses the old non-batched
fashion which is not cache friendly.
====================
Wong Vee Khee [Thu, 30 Sep 2021 06:44:36 +0000 (14:44 +0800)]
net: stmmac: fix EEE init issue when paired with EEE capable PHYs
When STMMAC is paired with Energy-Efficient Ethernet(EEE) capable PHY,
and the PHY is advertising EEE by default, we need to enable EEE on the
xPCS side too, instead of having user to manually trigger the enabling
config via ethtool.
Fixed this by adding xpcs_config_eee() call in stmmac_eee_init().
Fixes: 7617af3d1a5e ("net: pcs: Introducing support for DWC xpcs Energy Efficient Ethernet") Cc: Michael Sit Wei Hong <[email protected]> Signed-off-by: Wong Vee Khee <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Jason Xing [Wed, 29 Sep 2021 17:56:05 +0000 (10:56 -0700)]
ixgbe: let the xdpdrv work with more than 64 cpus
Originally, ixgbe driver doesn't allow the mounting of xdpdrv if the
server is equipped with more than 64 cpus online. So it turns out that
the loading of xdpdrv causes the "NOMEM" failure.
Actually, we can adjust the algorithm and then make it work through
mapping the current cpu to some xdp ring with the protect of @tx_lock.
Here are some numbers before/after applying this patch with xdp-example
loaded on the eth0X:
As server (rx path):
Before After
TCP_STREAM send-64 1416.49 1383.12
TCP_STREAM send-128 3141.49 3055.50
TCP_STREAM send-512 9488.73 9487.44
TCP_STREAM send-1k 9491.17 9356.22 (not stable)
TCP_RR send-1 23617.74 23601.60
...
Notice: the TCP_RR mode is unstable as the official document explains.
I tested many times with different parameters combined through netperf.
Though the result is not that accurate, I cannot see much influence on
this patch. The static key is places on the hot path, but it actually
shouldn't cause a huge regression theoretically.
David S. Miller [Thu, 30 Sep 2021 12:36:46 +0000 (13:36 +0100)]
Merge branch 'SO_RESEVED_MEM'
Wei Wang says:
====================
net: add new socket option SO_RESERVE_MEM
This patch series introduces a new socket option SO_RESERVE_MEM.
This socket option provides a mechanism for users to reserve a certain
amount of memory for the socket to use. When this option is set, kernel
charges the user specified amount of memory to memcg, as well as
sk_forward_alloc. This amount of memory is not reclaimable and is
available in sk_forward_alloc for this socket.
With this socket option set, the networking stack spends less cycles
doing forward alloc and reclaim, which should lead to better system
performance, with the cost of an amount of pre-allocated and
unreclaimable memory, even under memory pressure.
With a tcp_stream test with 10 flows running on a simulated 100ms RTT
link, I can see the cycles spent in __sk_mem_raise_allocated() dropping
by ~0.02%. Not a whole lot, since we already have logic in
sk_mem_uncharge() to only reclaim 1MB when sk_forward_alloc has more
than 2MB free space. But on a system suffering memory pressure
constently, the savings should be more.
The first patch is the implementation of this socket option. The
following 2 patches change the tcp stack to make use of this reserved
memory when under memory pressure. This makes the tcp stack behavior
more flexible when under memory pressure, and provides a way for user to
control the distribution of the memory among its sockets.
With a TCP connection on a simulated 100ms RTT link, the default
throughput under memory pressure is ~500Kbps. With SO_RESERVE_MEM set to
100KB, the throughput under memory pressure goes up to ~3.5Mbps.
Change since v2:
- Added description for new field added in struct sock in patch 1
Change since v1:
- Added performance stats in cover letter and rebased
====================
Wei Wang [Wed, 29 Sep 2021 17:25:13 +0000 (10:25 -0700)]
tcp: adjust rcv_ssthresh according to sk_reserved_mem
When user sets SO_RESERVE_MEM socket option, in order to utilize the
reserved memory when in memory pressure state, we adjust rcv_ssthresh
according to the available reserved memory for the socket, instead of
using 4 * advmss always.
Wei Wang [Wed, 29 Sep 2021 17:25:12 +0000 (10:25 -0700)]
tcp: adjust sndbuf according to sk_reserved_mem
If user sets SO_RESERVE_MEM socket option, in order to fully utilize the
reserved memory in memory pressure state on the tx path, we modify the
logic in sk_stream_moderate_sndbuf() to set sk_sndbuf according to
available reserved memory, instead of MIN_SOCK_SNDBUF, and adjust it
when new data is acked.
Wei Wang [Wed, 29 Sep 2021 17:25:11 +0000 (10:25 -0700)]
net: add new socket option SO_RESERVE_MEM
This socket option provides a mechanism for users to reserve a certain
amount of memory for the socket to use. When this option is set, kernel
charges the user specified amount of memory to memcg, as well as
sk_forward_alloc. This amount of memory is not reclaimable and is
available in sk_forward_alloc for this socket.
With this socket option set, the networking stack spends less cycles
doing forward alloc and reclaim, which should lead to better system
performance, with the cost of an amount of pre-allocated and
unreclaimable memory, even under memory pressure.
Note:
This socket option is only available when memory cgroup is enabled and we
require this reserved memory to be charged to the user's memcg. We hope
this could avoid mis-behaving users to abused this feature to reserve a
large amount on certain sockets and cause unfairness for others.
Jakub Kicinski [Wed, 29 Sep 2021 15:32:24 +0000 (08:32 -0700)]
net: dev_addr_list: handle first address in __hw_addr_add_ex
struct dev_addr_list is used for device addresses, unicast addresses
and multicast addresses. The first of those needs special handling
of the main address - netdev->dev_addr points directly the data
of the entry and drivers write to it freely, so we can't maintain
it in the rbtree (for now, at least, to be fixed in net-next).
Current work around sprinkles special handling of the first
address on the list throughout the code but it missed the case
where address is being added. First address will not be visible
during subsequent adds.
Syzbot found a warning where unicast addresses are modified
without holding the rtnl lock, tl;dr is that team generates
the same modification multiple times, not necessarily when
right locks are held.
In the repro we have:
macvlan -> team -> veth
macvlan adds a unicast address to the team. Team then pushes
that address down to its memebers (veths). Next something unrelated
makes team sync member addrs again, and because of the bug
the addr entries get duplicated in the veths. macvlan gets
removed, removes its addr from team which removes only one
of the duplicated addresses from veths. This removal is done
under rtnl. Next syzbot uses iptables to add a multicast addr
to team (which does not hold rtnl lock). Team syncs veth addrs,
but because veths' unicast list still has the duplicate it will
also get sync, even though this update is intended for mc addresses.
Again, uc address updates need rtnl lock, boom.
Reported-by: [email protected] Fixes: 406f42fa0d3c ("net-next: When a bond have a massive amount of VLANs with IPv6 addresses, performance of changing link state, attaching a VRF, changing an IPv6 address, etc. go down dramtically.") Signed-off-by: Jakub Kicinski <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Patch that refactored fl_walk() to use idr_for_each_entry_continue_ul()
also removed rcu protection of individual filters which causes following
use-after-free when filter is deleted concurrently. Fix fl_walk() to obtain
rcu read lock while iterating and taking the filter reference and temporary
release the lock while calling arg->fn() callback that can sleep.
KASAN trace:
[ 352.773640] ==================================================================
[ 352.775041] BUG: KASAN: use-after-free in fl_walk+0x159/0x240 [cls_flower]
[ 352.776304] Read of size 4 at addr ffff8881c8251480 by task tc/2987
Colin Ian King [Wed, 29 Sep 2021 13:27:53 +0000 (14:27 +0100)]
octeontx2-af: Remove redundant initialization of variable pin
The variable pin is being initialized with a value that is never
read, it is being updated later on in only one case of a switch
statement. The assignment is redundant and can be removed.
Addresses-Coverity: ("Unused value") Signed-off-by: Colin Ian King <[email protected]> Signed-off-by: David S. Miller <[email protected]>
The macb PTP support currently implements the `gettime64` callback to allow
to retrieve the hardware clock time. Update the implementation to provide
the `gettimex64` callback instead.
The difference between the two is that with `gettime64` a snapshot of the
system clock is taken before and after invoking the callback. Whereas
`gettimex64` expects the callback itself to take the snapshots.
To get the time from the macb Ethernet core multiple register accesses have
to be done. Only one of which will happen at the time reported by the
function. This leads to a non-symmetric delay and adds a slight offset
between the hardware and system clock time when using the `gettime64`
method. This offset can be a few 100 nanoseconds. Switching to the
`gettimex64` method allows for a more precise correlation of the hardware
and system clocks and results in a lower offset between the two.
On a Xilinx ZynqMP system `phc2sys` reports a delay of 1120 ns before and
300 ns after the patch. With the latter being mostly symmetric.
the problem originates from uncorrect lock annotation in the mptcp
code and is only visible since commit 2dcb96bacce3 ("net: core: Correct
the sock::sk_lock.owned lockdep annotations"), but is present since
the port-based endpoint support initial implementation.
This patch addresses the issue introducing a nested variant of
lock_sock_fast() and using it in the relevant code path.
Fixes: 1729cf186d8a ("mptcp: create the listening socket for new port") Fixes: 2dcb96bacce3 ("net: core: Correct the sock::sk_lock.owned lockdep annotations") Suggested-by: Thomas Gleixner <[email protected]> Reported-and-tested-by: [email protected] Signed-off-by: Paolo Abeni <[email protected]> Reviewed-by: Thomas Gleixner <[email protected]> Signed-off-by: David S. Miller <[email protected]>
octeontx2-af: Adjust LA pointer for cpt parse header
In case of ltype NPC_LT_LA_CPT_HDR, LA pointer is pointing to the
start of cpt parse header. Since cpt parse header has veriable
length padding, this will be a problem for DMAC extraction. Adding
KPU profile changes to adjust the LA pointer to start at ether header
in case of cpt parse header by
- Adding ptr advance in pkind 58 to a fixed value 40
- Adding variable length offset 7 and mask 7 (pad len in
CPT_PARSE_HDR).
Also added the missing static declaration for npc_set_var_len_offset_pkind
function.
libbpf: Fix skel_internal.h to set errno on loader retval < 0
When the loader indicates an internal error (result of a checked bpf
system call), it returns the result in attr.test.retval. However, tests
that rely on ASSERT_OK_PTR on NULL (returned from light skeleton) may
miss that NULL denotes an error if errno is set to 0. This would result
in skel pointer being NULL, while ASSERT_OK_PTR returning 1, leading to
a SEGV on dereference of skel, because libbpf_get_error relies on the
assumption that errno is always set in case of error for ptr == NULL.
In particular, this was observed for the ksyms_module test. When
executed using `./test_progs -t ksyms`, prior tests manipulated errno
and the test didn't crash when it failed at ksyms_module load, while
using `./test_progs -t ksyms_module` crashed due to errno being
untouched.
net_sched: Use struct_size() and flex_array_size() helpers
Make use of the struct_size() and flex_array_size() helpers instead of
an open-coded version, in order to avoid any potential type mistakes
or integer overflows that, in the worse scenario, could lead to heap
overflows.
libbpf: Properly ignore STT_SECTION symbols in legacy map definitions
The previous patch to ignore STT_SECTION symbols only added the ignore
condition in one of them. This fails if there's more than one map
definition in the 'maps' section, because the subsequent modulus check will
fail, resulting in error messages like:
libbpf: elf: unable to determine legacy map definition size in ./xdpdump_xdp.o
Fix this by also ignoring STT_SECTION in the first loop.
bpf: Do not invoke the XDP dispatcher for PROG_RUN with single repeat
We have a unit test that invokes an XDP program with 1m different
inputs, aka 1m BPF_PROG_RUN syscalls. We run this test concurrently
with slight variations in how we generated the input.
Since commit f23c4b3924d2 ("bpf: Start using the BPF dispatcher in BPF_TEST_RUN")
the unit test has slowed down significantly. Digging deeper reveals that
the concurrent tests are serialised in the kernel on the XDP dispatcher.
This is a global resource that is protected by a mutex, on which we contend.
Fix this by not calling into the XDP dispatcher if we only want to perform
a single run of the BPF program.