Currently, link_mode parameter derives 3 other link parameters, speed,
lanes and duplex, and the derived information is sent to user space.
Few bugs were found in that functionality.
First, some drivers clear the 'ethtool_link_ksettings' struct in their
get_link_ksettings() callback and cause receiving wrong link mode
information in user space. And also, some drivers can report random
values in the 'link_mode' field and cause general protection fault.
Second, the link parameters are only derived in netlink path so in ioctl
path, we don't any reasonable values.
Third, setting 'speed 10000 lanes 1' fails since the lanes parameter
wasn't set for ETHTOOL_LINK_MODE_10000baseR_FEC_BIT.
Patch #1 solves the first two problems by removing link_mode parameter
and deriving the link parameters in driver instead of ethtool.
Patch #2 solves the third one, by setting the lanes parameter for the
link_mode.
v3:
* Remove the link_mode parameter in the first patch to solve
both two issues from patch#1 and patch#2.
* Add the second patch to solve the third issue.
v2:
* Add patch #2.
* Introduce 'cap_link_mode_supported' instead of adding a
validity field to 'ethtool_link_ksettings' struct in patch #1.
====================
ethtool: Add lanes parameter for ETHTOOL_LINK_MODE_10000baseR_FEC_BIT
Lanes field is missing for ETHTOOL_LINK_MODE_10000baseR_FEC_BIT
link mode and it causes a failure when trying to set
'speed 10000 lanes 1' on Spectrum-2 machines when autoneg is set to on.
Add the lanes parameter for ETHTOOL_LINK_MODE_10000baseR_FEC_BIT
link mode.
Fixes: c8907043c6ac9 ("ethtool: Get link mode in use instead of speed and duplex parameters") Signed-off-by: Danielle Ratson <[email protected]> Reviewed-by: Ido Schimmel <[email protected]> Signed-off-by: David S. Miller <[email protected]>
ethtool: Remove link_mode param and derive link params from driver
Some drivers clear the 'ethtool_link_ksettings' struct in their
get_link_ksettings() callback, before populating it with actual values.
Such drivers will set the new 'link_mode' field to zero, resulting in
user space receiving wrong link mode information given that zero is a
valid value for the field.
Another problem is that some drivers (notably tun) can report random
values in the 'link_mode' field. This can result in a general protection
fault when the field is used as an index to the 'link_mode_params' array
[1].
This happens because such drivers implement their set_link_ksettings()
callback by simply overwriting their private copy of
'ethtool_link_ksettings' struct with the one they get from the stack,
which is not always properly initialized.
Fix these problems by removing 'link_mode' from 'ethtool_link_ksettings'
and instead have drivers call ethtool_params_from_link_mode() with the
current link mode. The function will derive the link parameters (e.g.,
speed) from the link mode and fill them in the 'ethtool_link_ksettings'
struct.
v3:
* Remove link_mode parameter and derive the link parameters in
the driver instead of passing link_mode parameter to ethtool
and derive it there.
v2:
* Introduce 'cap_link_mode_supported' instead of adding a
validity field to 'ethtool_link_ksettings' struct.
Fixes: c8907043c6ac9 ("ethtool: Get link mode in use instead of speed and duplex parameters") Signed-off-by: Danielle Ratson <[email protected]> Reported-by: Eric Dumazet <[email protected]> Reviewed-by: Ido Schimmel <[email protected]> Signed-off-by: David S. Miller <[email protected]>
this is a pull request of 6 patches for net-next/master.
The first patch targets the CAN driver infrastructure, it improves the
alloc_can{,fd}_skb() function to set the pointer to the CAN frame to
NULL if skb allocation fails.
The next patch adds missing error handling to the m_can driver's RX
path (the code was introduced in -next, no need to backport).
In the next patch an unused constant is removed from an enum in the
c_can driver.
The last 3 patches target the mcp251xfd driver. They add BQL support
and try to work around a sometimes broken CRC when reading the TBC
register.
====================
Andrei Vagin [Wed, 7 Apr 2021 06:40:51 +0000 (23:40 -0700)]
net: remove the new_ifindex argument from dev_change_net_namespace
Here is only one place where we want to specify new_ifindex. In all
other cases, callers pass 0 as new_ifindex. It looks reasonable to add a
low-level function with new_ifindex and to convert
dev_change_net_namespace to a static inline wrapper.
Fixes: eeb85a14ee34 ("net: Allow to specify ifindex when device is moved to another namespace") Suggested-by: Jakub Kicinski <[email protected]> Signed-off-by: Andrei Vagin <[email protected]> Acked-by: Jakub Kicinski <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Andrei Vagin [Wed, 7 Apr 2021 06:40:03 +0000 (23:40 -0700)]
net: introduce nla_policy for IFLA_NEW_IFINDEX
In this case, we don't need to check that new_ifindex is positive in
validate_linkmsg.
Fixes: eeb85a14ee34 ("net: Allow to specify ifindex when device is moved to another namespace") Suggested-by: Jakub Kicinski <[email protected]> Signed-off-by: Andrei Vagin <[email protected]> Acked-by: Jakub Kicinski <[email protected]> Signed-off-by: David S. Miller <[email protected]>
David S. Miller [Wed, 7 Apr 2021 21:38:24 +0000 (14:38 -0700)]
Merge tag 'mlx5-updates-2021-04-06' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux
Saeed Mahameed says:
====================
mlx5-updates-2021-04-06
Introduce TC sample offload
Background
----------
The tc sample action allows user to sample traffic matched by tc
classifier. The sampling consists of choosing packets randomly and
sampling them using psample module.
The tc sample parameters include group id, sampling rate and packet's
truncation (to save kernel-user traffic).
Sample in TC SW
---------------
User must specify rate and group id for sample action, truncate is
optional.
tc filter add dev enp4s0f0_0 ingress protocol ip prio 1 flower \
src_mac 02:25:d0:14:01:02 dst_mac 02:25:d0:14:01:03 \
action sample rate 10 group 5 trunc 60 \
action mirred egress redirect dev enp4s0f0_1
The tc sample action kernel module 'act_sample' will call another
kernel module 'psample' to send sampled packets to userspace.
The sample action is translated to a goto flow table object
destination which samples packets according to the provided
sample ratio. Sampled packets are duplicated. One copy is
processed by a termination table, named the sample table,
which sends the packet to the eswitch manager port (that will
be processed by software).
The second copy is processed by the default table which executes
the subsequent actions. The default table is created per <vport,
chain, prio> tuple as rules with different prios and chains may
overlap.
For example, for the following typical flow table:
+-------------------------------+
+ original flow table +
+-------------------------------+
+ original match +
+-------------------------------+
+ sample action + other actions +
+-------------------------------+
We translate the tc filter with sample action to the following HW model:
+---------------------+
+ original flow table +
+---------------------+
+ original match +
+---------------------+
|
v
+------------------------------------------------+
+ Flow Sampler Object +
+------------------------------------------------+
+ sample ratio +
+------------------------------------------------+
+ sample table id | default table id +
+------------------------------------------------+
| |
v v
+-----------------------------+ +----------------------------------------+
+ sample table + + default table per <vport, chain, prio> +
+-----------------------------+ +----------------------------------------+
+ forward to management vport + + original match +
+-----------------------------+ +----------------------------------------+
+ other actions +
+----------------------------------------+
Flow sampler object
-------------------
Hardware introduces flow sampler object to do sample. It is a new
destination type. Driver needs to specify two flow table ids in it.
One is sample table id. The other one is the default table id.
Sample table samples the packets according to the sample rate and
forward the sampled packets to eswitch manager port. Default table
finishes the subsequent actions.
Group id and reg_c0
-------------------
Userspace program will take different actions for sampled packets
according to tc sample action group id. So hardware must pass group
id to software for each sampled packets. In Paul Blakey's "Introduce
connection tracking offload" patch set, reg_c0 lower 16 bits are used
for miss packet chain id restore. We convert reg_c0 lower 16 bits to
a common object pool, so other features can also use it.
Since sample group id is 32 bits, create a 16 bits object id to map
the group id and write the object id to reg_c0 lower 16 bits. reg_c0
can only be used for matching. Write reg_c0 to flow_tag, so software
can get the object id via flow_tag and find group id via the common
object pool.
Sampler restore handle
----------------------
Use common object pool to create an object id to map sample parameters.
Allocate a modify header action to write the object id to reg_c0 lower
16 bits. Create a restore rule to pass the object id to software. So
software can identify sampled packets via the object id and send it to
userspace.
Aggregate the modify header action, restore rule and object id to a
sample restore handle. Re-use identical sample restore handle for
the same object id.
Send sampled packets to userspace
---------------------------------
The destination for sampled packets is eswitch manager port, so
representors can receive sampled packets together with the group id.
Driver will send sampled packets and group id to userspace via psample.
====================
In function fdp_nci_patch_otp and fdp_nci_patch_ram,many goto
out statements are used, and out label just return variable r.
in some places,just jump to the out label, and in other places,
assign a value to the variable r,then jump to the out label.
It is unnecessary, we just use return sentences to replace goto
sentences and delete out label.
mlxsw: core: Remove critical trip points from thermal zones
Disable software thermal protection by removing critical trip points
from all thermal zones.
The software thermal protection is redundant given there are two layers
of protection below it in firmware and hardware. The first layer is
performed by firmware, the second, in case firmware was not able to
perform protection, by hardware.
The temperature threshold set for hardware protection is always higher
than for firmware.
Kurt Kanzenbach [Tue, 6 Apr 2021 07:35:09 +0000 (09:35 +0200)]
net: hsr: Reset MAC header for Tx path
Reset MAC header in HSR Tx path. This is needed, because direct packet
transmission, e.g. by specifying PACKET_QDISC_BYPASS does not reset the MAC
header.
This has been observed using the following setup:
|$ ip link add name hsr0 type hsr slave1 lan0 slave2 lan1 supervision 45 version 1
|$ ifconfig hsr0 up
|$ ./test hsr0
The test binary is using mmap'ed sockets and is specifying the
PACKET_QDISC_BYPASS socket option.
This patch resolves the following warning on a non-patched kernel:
The warning can be safely removed, because the other call sites of
hsr_forward_skb() make sure that the skb is prepared correctly.
Fixes: d346a3fae3ff ("packet: introduce PACKET_QDISC_BYPASS socket option") Signed-off-by: Kurt Kanzenbach <[email protected]> Reviewed-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>
mptcp: drop all sub-options except ADD_ADDR when the echo bit is set
Current Linux carries echo-ed ADD_ADDR over pure TCP ACKs, so there is no
need to add a DSS element that would fit only ADD_ADDR with IPv4 address.
Drop the DSS from echo-ed ADD_ADDR, regardless of the IP version.
selftests: mptcp: add the net device name testcase
This patch added a new testcase for setting the net device name. In it,
pass the net device name to pm_nl_ctl to set the ifindex field of struct
mptcp_pm_addr_entry.
Since the type of the address family in struct mptcp_options_received
became sa_family_t, we should set AF_INET/AF_INET6 to it, instead of
using MPTCP_ADDR_IPVERSION_4/6.
mptcp: use mptcp_addr_info in mptcp_options_received
This patch added a new struct mptcp_addr_info member addr in struct
mptcp_options_received, and dropped the original family, addr_id, addr,
addr6 and port fields in it. Then we can pass the parameter mp_opt.addr
directly to mptcp_pm_add_addr_received and mptcp_pm_add_addr_echoed.
Since the port number became big-endian now, use htons to convert the
incoming port number to it. Also use ntohs to convert it when passing
it to add_addr_generate_hmac or printing it out.
This patch moved the mptcp_addr_info struct from protocol.h to mptcp.h,
added a new struct mptcp_addr_info member addr in struct mptcp_out_options,
and dropped the original addr, addr6, addr_id and port fields in it. Then
we can use opts->addr to get the adding address from PM directly using
mptcp_pm_add_addr_signal.
Since the port number became big-endian now, use ntohs to convert it
before sending it out with the ADD_ADDR suboption. Also convert it
when passing it to add_addr_generate_hmac or printing it out.
mptcp: move flags and ifindex out of mptcp_addr_info
This patch moved the flags and ifindex fields from struct mptcp_addr_info
to struct mptcp_pm_addr_entry. Add the flags and ifindex values as two new
parameters to __mptcp_subflow_connect.
In mptcp_pm_create_subflow_or_signal_addr, pass the local address entry's
flags and ifindex fields to __mptcp_subflow_connect.
In mptcp_pm_nl_add_addr_received, just pass two zeros to it.
net/rds: Avoid potential use after free in rds_send_remove_from_sock
In case of rs failure in rds_send_remove_from_sock(), the 'rm' resource
is freed and later under spinlock, causing potential use-after-free.
Set the free pointer to NULL to avoid undefined behavior.
Merge tag 'arc-5.12-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/vgupta/arc
Pull ARC fixlets from Vineet Gupta:
"A few straggler fixes for ARC"
* tag 'arc-5.12-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/vgupta/arc:
ARC: treewide: avoid the pointer addition with NULL pointer
arc: kernel: Return -EFAULT if copy_to_user() fails
ARC: haps: bump memory to 1 GB
The issue happens when pcibus_to_node() returns NO_NUMA_NODE.
Fix this issue by moving the initialization of dd->node to hfi1_devdata
allocation and remove the other pcibus_to_node() calls in the probe path
and use dd->node instead.
Affinity logic is adjusted to use a new field dd->affinity_entry as a
guard instead of dd->node.
RDMA/cxgb4: check for ipv6 address properly while destroying listener
ipv6 bit is wrongly set by the below which causes fatal adapter lookup
engine errors for ipv4 connections while destroying a listener. Fix it to
properly check the local address for ipv6.
Frank Rowand [Mon, 5 Apr 2021 03:28:45 +0000 (22:28 -0500)]
of: properly check for error returned by fdt_get_name()
fdt_get_name() returns error values via a parameter pointer
instead of in function return. Fix check for this error value
in populate_node() and callers of populate_node().
Chasing up the caller tree showed callers of various functions
failing to initialize the value of pointer parameters that
can return error values. Initialize those values to NULL.
The bug was introduced by
commit e6a6928c3ea1 ("of/fdt: Convert FDT functions to use libfdt")
but this patch can not be backported directly to that commit
because the relevant code has further been restructured by
commit dfbd4c6eff35 ("drivers/of: Split unflatten_dt_node()")
ACPI: processor: Fix build when CONFIG_ACPI_PROCESSOR=m
Commit 8cdddd182bd7 ("ACPI: processor: Fix CPU0 wakeup in
acpi_idle_play_dead()") tried to fix CPU0 hotplug breakage by copying
wakeup_cpu0() + start_cpu0() logic from hlt_play_dead()//mwait_play_dead()
into acpi_idle_play_dead(). The problem is that these functions are not
exported to modules so when CONFIG_ACPI_PROCESSOR=m build fails.
The issue could've been fixed by exporting both wakeup_cpu0()/start_cpu0()
(the later from assembly) but it seems putting the whole pattern into a
new function and exporting it instead is better.
Merge tag 'arm-fixes-5.11-2' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc
Pull ARM SoC fixes from Arnd Bergmann:
"Most of the changes again are devicetree fixes, but there are also
five trivial build fixes for issues I found when test building with
gcc-11 or when running 'make W=1', and some OMAP platform specific
code fixups.
Broadcom:
- One revert for a Raspberry pi interrupt controller change that
caused a regression.
TI OMAP:
- Remove unused duplicate sha2md5_fck clock node that can race with
the OMAP4_SHA2MD5_CLKCTRL clock node for disable for unused clocks
- Add aliases for omap4/5 mmc to put the slots back into the right
order again
- Fix typo for bionic voltage controllers that accidentally use mpu
for all instances instead of mpu, core and iva
- Fix random hangs for droid4 caused by missing fix from TI Android
kernel tree to do a dummy smc call on cpuidle wakeup path
NXP i.MX:
- Fix a system failure on imx6qdl-phytec-pfla02 board when booting
from SD, by adding missing vmmc supply for SD interfaces.
- Fix address typo in i.MX8MM/Q IOMUXC_SD1_DATA0_GPIO2_IO2
definition.
Marvell mvebu:
- Fix storm interrupt on Turris Omnia
- Enable hardware buffer management as it should be
... and build fixes for PXA, Freescale, Marvell, OMAP1 and Keystone"
* tag 'arm-fixes-5.11-2' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc:
ARM: dts: turris-omnia: configure LED[2]/INTn pin as interrupt pin
ARM: dts: turris-omnia: fix hardware buffer management
Revert "arm64: dts: marvell: armada-cp110: Switch to per-port SATA interrupts"
ARM: mvebu: avoid clang -Wtautological-constant warning
ARM: pxa: mainstone: avoid -Woverride-init warning
ARM: omap1: fix building with clang IAS
soc/fsl: qbman: fix conflicting alignment attributes
ARM: keystone: fix integer overflow warning
ARM: dts: imx6: pbab01: Set vmmc supply for both SD interfaces
arm64: dts: imx8mm/q: Fix pad control of SD1_DATA0
ARM: OMAP4: PM: update ROM return address for OSWR and OFF
ARM: OMAP4: Fix PMIC voltage domains for bionic
ARM: dts: Fix moving mmc devices with aliases for omap4 & 5
ARM: dts: Drop duplicate sha2md5_fck to fix clk_disable race
Revert "ARM: dts: bcm2711: Add the BSC interrupt controller"
Merge branch 'parisc-5.12-3' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux
Pull parisc fixes from Helge Deller:
"One link error fix found by the kernel test robot, one sparse warning
fix, remove a duplicate declaration and some spelling fixes"
* 'parisc-5.12-3' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux:
parisc: math-emu: Few spelling fixes in the file fpu.h
parisc: avoid a warning on u8 cast for cmpxchg on u8 pointers
parisc: parisc-agp requires SBA IOMMU driver
parisc: Remove duplicate struct task_struct declaration
Merge tag 'platform-drivers-x86-v5.12-3' of git://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86
Pull x86 platform driver fix from Hans de Goede:
"A single bugfix to fix spurious wakeups from suspend caused by recent
intel-hid driver changes"
* tag 'platform-drivers-x86-v5.12-3' of git://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86:
platform/x86: intel-hid: Fix spurious wakeups caused by tablet-mode events during suspend
Merge tag 'regulator-fix-v5.12-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator
Pull regulator fixes from Mark Brown:
"bd9571mwv regulator fixes for v5.12.
A set of driver specific fixes here, the main one is a fix to not try
to set unsupported voltages on this device. The other two patches
clean up the error handling and eliminate the possibility that we
could overflow the page when writing sysfs output (which AFAICT wasn't
an issue but better to be sure)"
* tag 'regulator-fix-v5.12-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator:
regulator: bd9571mwv: Convert device attribute to sysfs_emit()
regulator: bd9571mwv: Fix regulator name printed on registration failure
regulator: bd9571mwv: Fix AVS and DVFS voltage range
xen/evtchn: Change irq_info lock to raw_spinlock_t
Unmask operation must be called with interrupt disabled,
on preempt_rt spin_lock_irqsave/spin_unlock_irqrestore
don't disable/enable interrupts, so use raw_* implementation
and change lock variable in struct irq_info from spinlock_t
to raw_spinlock_t
Merge tag 'asoc-fix-v5.12-rc6' of https://git.kernel.org/pub/scm/linux/kernel/git/broonie/sound into for-linus
ASoC: Fixes for v5.12
A fairly small batch of driver specific fixes, mainly for various x86
systems with the biggest set being fixes to power down DSPs properly on
x86 SOF systems.
Jonas Holmberg [Wed, 7 Apr 2021 07:54:28 +0000 (09:54 +0200)]
ALSA: aloop: Fix initialization of controls
Add a control to the card before copying the id so that the numid field
is initialized in the copy. Otherwise the numid field of active_id,
format_id, rate_id and channels_id will be the same (0) and
snd_ctl_notify() will not queue the events properly.
can: mcp251xfd: mcp251xfd_regmap_crc_read(): work around broken CRC on TBC register
MCP251XFD_REG_TBC is the time base counter register. It increments
once per SYS clock tick, which is 20 or 40 MHz. Observation shows that
if the lowest byte (which is transferred first on the SPI bus) of that
register is 0x00 or 0x80 the calculated CRC doesn't always match the
transferred one.
To reproduce this problem let the driver read the TBC register in a
high frequency. This can be done by attaching only the mcp251xfd CAN
controller to a valid terminated CAN bus and send a single CAN frame.
As there are no other CAN controller on the bus, the sent CAN frame is
not ACKed and the mcp251xfd repeats it. If user space enables the bus
error reporting, each of the NACK errors is reported with a time
stamp (which is read from the TBC register) to user space.
$ ip link set can0 down
$ ip link set can0 up type can bitrate 500000 berr-reporting on
$ cansend can0 4FF#ff.01.00.00.00.00.00.00
If the highest bit in the lowest byte is flipped the transferred CRC
matches the calculated one. We assume for now the CRC calculation in
the chip works on wrong data and the transferred data is correct.
This patch implements the following workaround:
- If a CRC read error on the TBC register is detected and the lowest
byte is 0x00 or 0x80, the highest bit of the lowest byte is flipped
and the CRC is calculated again.
- If the CRC now matches, the _original_ data is passed to the reader.
For now we assume transferred data was OK.
can: m_can: m_can_receive_skb(): add missing error handling to can_rx_offload_queue_sorted() call
In commit 1be37d3b0414 ("can: m_can: fix periph RX path: use
rx-offload to ensure skbs are sent from softirq context") the RX path
for peripherals (i.e. SPI based m_can controllers) was converted to
the rx-offload infrastructure. However, the error handling for
can_rx_offload_queue_sorted() was forgotten.
can_rx_offload_queue_sorted() will return with an error if the
internal queue is full.
This patch adds the missing error handling, by increasing the
rx_fifo_errors.
can: skb: alloc_can{,fd}_skb(): set "cf" to NULL if skb allocation fails
The handling of CAN bus errors typically consist of allocating a CAN
error SKB using alloc_can_err_skb() followed by stats handling and
filling the error details in the newly allocated CAN error SKB. Even
if the allocation of the SKB fails the stats handling should not be
skipped.
The common pattern in CAN drivers is to allocate the skb and work on
the struct can_frame pointer "cf", if it has been assigned by
alloc_can_err_skb().
In case of an OOM alloc_can_err_skb() returns NULL, but doesn't set
"cf" to NULL as well. For the above pattern to work the "cf" has to be
initialized to NULL, which is easily forgotten.
To solve this kind of problems, set "cf" to NULL if
alloc_can_err_skb() returns NULL.
Chris Mi [Mon, 21 Sep 2020 08:45:07 +0000 (16:45 +0800)]
net/mlx5e: TC, Add support to offload sample action
The following diagram illustrates the hardware model for tc sample action:
+---------------------+
+ original flow table +
+---------------------+
+ original match +
+---------------------+
|
v
+------------------------------------------------+
+ Flow Sampler Object +
+------------------------------------------------+
+ sample ratio +
+------------------------------------------------+
+ sample table id | default table id +
+------------------------------------------------+
| |
v v
+-----------------------------+ +----------------------------------------+
+ sample table + + default table per <vport, chain, prio> +
+-----------------------------+ +----------------------------------------+
+ forward to management vport + + original match +
+-----------------------------+ +----------------------------------------+
+ other actions +
+----------------------------------------+
The sample action is translated to a goto flow table object
destination which samples packets according to the provided
sample ratio. Sampled packets are duplicated. One copy is
processed by a termination table, named the sample table,
which sends the packet to the eswitch manager port (that will
be processed by software).
The second copy is processed by the default table which executes
the subsequent actions. The default table is created per <vport,
chain, prio> tuple as rules with different prios and chains may
overlap.
Chris Mi [Tue, 26 Jan 2021 03:08:28 +0000 (11:08 +0800)]
net/mlx5e: TC, Add sampler restore handle API
Use common object pool to create an object ID to map sample parameters.
Allocate a modify header action to write the object ID to reg_c0 lower
16 bits. Create a restore rule to pass the object ID to software. So
software can identify sampled packets via the object ID and send it to
userspace.
Aggregate the modify header action, restore rule and object ID to a
sample restore handle. Re-use identical sample restore handle for
the same object ID.
Chris Mi [Mon, 21 Sep 2020 05:25:54 +0000 (13:25 +0800)]
net/mlx5e: TC, Add sampler object API
In order to offload sample action, HW introduces sampler object. The
sampler object samples packets according to the provided sample ratio.
Sampled packets are duplicated. One copy is processed by a termination
table, named the sample table, which sends the packet up to software.
The second copy is processed by the default table.
Instantiate sampler object. Re-use identical sampler object for
the same sample ratio, sample table and default table as a prestep for
offloading tc sample actions.
Chris Mi [Fri, 18 Sep 2020 09:46:57 +0000 (17:46 +0800)]
net/mlx5e: TC, Add sampler termination table API
Sampled packets are sent to software using termination tables. There
is only one rule in that table that is to forward sampled packets to
the e-switch management vport.
Create a sampler termination table and rule for each eswitch.
Chris Mi [Mon, 31 Aug 2020 05:27:53 +0000 (13:27 +0800)]
net/mlx5: Instantiate separate mapping objects for FDB and NIC tables
Currently, the u32 chain id is mapped to u16 value which is stored on
the lower 16 bits of reg_c0 for FDB and reg_b for NIC tables. The
mapping is internally maintained by the chains object. However, with
the introduction of reg_c0 objects the fdb may store more than just
the chain id on reg_c0. This is not relevant for NIC tables.
Separate the chains mapping instantiation for FDB and NIC tables.
Remove the mapping from the chains object. For FDB tables, create
the mapping per eswitch. For NIC tables, create the mapping per tc
table. Pass the corresponding mapping pointer when creating the
chains object.
Chris Mi [Thu, 10 Sep 2020 07:28:02 +0000 (15:28 +0800)]
net/mlx5: Map register values to restore objects
Currently reg_c0 lower 16 bits and reg_b are used to store the chain
id that missed in FDB and NIC tables accordingly. However, the
registers' values may index a restore object, rather than a single u32
value. Different object types can be used to restore mutually exclusive
contexts such as chain id and sample group id.
Use the mapping object to associate an index with a restore object
as a prestep for supporting additional restore types.
Chris Mi [Fri, 9 Oct 2020 03:06:33 +0000 (11:06 +0800)]
net/mlx5: E-switch, Set per vport table default group number
Different per voprt table is created using a different per vport table
namespace. Because we can't use variable to set the namespace member
value. If max group number is 0 in the namespace, use the eswitch
default max group number.
Chris Mi [Mon, 31 Aug 2020 05:22:20 +0000 (13:22 +0800)]
net/mlx5: E-switch, Move vport table functions to a new file
Currently, the vport table functions are in common eswitch offload
file. This file is too big. Move the vport table create, delete and
lookup functions to a separate file. Put the file in esw directory.
Pre-step for generalizing its functionality for serving both the
mirroring and the sample features.
Make sure to modify uplink port to follow only if the uplink_follow
capability is set as required by the HW spec. Failure to do so causes
traffic to the uplink representor net device to cease after switching to
switchdev mode.
Fixes: 7d0314b11cdd ("net/mlx5e: Modify uplink state on interface up/down") Signed-off-by: Eli Cohen <[email protected]> Reviewed-by: Roi Dayan <[email protected]> Signed-off-by: Saeed Mahameed <[email protected]>
Al Viro [Tue, 6 Apr 2021 23:46:51 +0000 (19:46 -0400)]
LOOKUP_MOUNTPOINT: we are cleaning "jumped" flag too late
That (and traversals in case of umount .) should be done before
complete_walk(). Either a braino or mismerge damage on queue
reorders - either way, I should've spotted that much earlier.
Fucked-up-by: Al Viro <[email protected]>
X-Paperbag: Brown Fixes: 161aff1d93ab "LOOKUP_MOUNTPOINT: fold path_mountpointat() into path_lookupat()" Cc: [email protected] # v5.7+ Signed-off-by: Al Viro <[email protected]>
Phillip Potter [Tue, 6 Apr 2021 17:45:54 +0000 (18:45 +0100)]
net: tun: set tun->dev->addr_len during TUNSETLINK processing
When changing type with TUNSETLINK ioctl command, set tun->dev->addr_len
to match the appropriate type, using new tun_get_addr_len utility function
which returns appropriate address length for given type. Fixes a
KMSAN-found uninit-value bug reported by syzbot at:
https://syzkaller.appspot.com/bug?id=0766d38c656abeace60621896d705743aeefed51
net: hns3: clear VF down state bit before request link status
Currently, the VF down state bit is cleared after VF sending
link status request command. There is problem that when VF gets
link status replied from PF, the down state bit may still set
as 1. In this case, the link status replied from PF will be
ignored and always set VF link status to down.
To fix this problem, clear VF down state bit before VF requests
link status.
Fixes: e2cb1dec9779 ("net: hns3: Add HNS3 VF HCL(Hardware Compatibility Layer) Support") Signed-off-by: Guangbin Huang <[email protected]> Signed-off-by: Huazhong Tan <[email protected]> Signed-off-by: David S. Miller <[email protected]>
====================
Netfilter updates for net-next
The following batch contains Netfilter/IPVS updates for your net-next tree:
1) Simplify log infrastructure modularity: Merge ipv4, ipv6, bridge,
netdev and ARP families to nf_log_syslog.c. Add module softdeps.
This fixes a rare deadlock condition that might occur when log
module autoload is required. From Florian Westphal.
2) Moves part of netfilter related pernet data from struct net to
net_generic() infrastructure. All of these users can be modules,
so if they are not loaded there is no need to waste space. Size
reduction is 7 cachelines on x86_64, also from Florian.
2) Update nftables audit support to report events once per table,
to get it aligned with iptables. From Richard Guy Briggs.
3) Check for stale routes from the flowtable garbage collector path.
This is fixing IPv6 which breaks due missing check for the dst_cookie.
4) Add a nfnl_fill_hdr() function to simplify netlink + nfnetlink
headers setup.
5) Remove documentation on several statified functions.
6) Remove printk on netns creation for the FTP IPVS tracker,
from Florian Westphal.
7) Remove unnecessary nf_tables_destroy_list_lock spinlock
initialization, from Yang Yingliang.
7) Remove a duplicated forward declaration in ipset,
from Wan Jiabing.
====================
pcnet32: Use pci_resource_len to validate PCI resource
pci_resource_start() is not a good indicator to determine if a PCI
resource exists or not, since the resource may start at address 0.
This is seen when trying to instantiate the driver in qemu for riscv32
or riscv64.
This currently happens because on redirect case we do skb_set_owner_r()
with the original sock. This increments the fwd_alloc memory accounting
on the original sock. Then on redirect we may push this into the queue
of the psock we are redirecting to. When the skb is flushed from the
queue we give the memory back to the original sock. The problem is if
the original sock is destroyed/closed with skbs on another psocks queue
then the original sock will not have a way to reclaim the memory before
being destroyed. Then above warning will be thrown
At this point we have torn down our own psock, but have the outstanding
skb in psock_other. Note that SK_PASS doesn't have this problem because
the sk_psock_drop() logic releases the skb, its still associated with
our psock.
To resolve lets only account for sockets on the ingress queue that are
still associated with the current socket. On the redirect case we will
check memory limits per 6fa9201a89898, but will omit fwd_alloc accounting
until skb is actually enqueued. When the skb is sent via skb_send_sock_locked
or received with sk_psock_skb_ingress memory will be claimed on psock_other.
John Fastabend [Thu, 1 Apr 2021 22:00:19 +0000 (15:00 -0700)]
bpf, sockmap: Fix sk->prot unhash op reset
In '4da6a196f93b1' we fixed a potential unhash loop caused when
a TLS socket in a sockmap was removed from the sockmap. This
happened because the unhash operation on the TLS ctx continued
to point at the sockmap implementation of unhash even though the
psock has already been removed. The sockmap unhash handler when a
psock is removed does the following,
rcu_read_lock();
psock = sk_psock(sk);
if (unlikely(!psock)) {
rcu_read_unlock();
if (sk->sk_prot->unhash)
sk->sk_prot->unhash(sk);
return;
}
[...]
}
The unlikely() case is there to handle the case where psock is detached
but the proto ops have not been updated yet. But, in the above case
with TLS and removed psock we never fixed sk_prot->unhash() and unhash()
points back to sock_map_unhash resulting in a loop. To fix this we added
this bit of code,
This will set the sk_prot->unhash back to its saved value. This is the
correct callback for a TLS socket that has been removed from the sock_map.
Unfortunately, this also overwrites the unhash pointer for all psocks.
We effectively break sockmap unhash handling for any future socks.
Omitting the unhash operation will leave stale entries in the map if
a socket transition through unhash, but does not do close() op.
To fix set unhash correctly before calling into tls_update. This way the
TLS enabled socket will point to the saved unhash() handler.
This is caused by NULL returned by tipc_aead_get(), and then crashed when
dereferencing it later in tipc_crypto_rcv_complete(). This might happen
when tipc_crypto_rcv_complete() is called by two threads at the same time:
the tmp attached by tipc_crypto_key_attach() in one thread may be released
by the one attached by that in the other thread.
This patch is to fix it by incrementing the tmp's refcnt before attaching
it instead of calling tipc_aead_get() after attaching it.
Fixes: fc1b6d6de220 ("tipc: introduce TIPC encryption & authentication") Reported-by: Li Shuang <[email protected]> Signed-off-by: Xin Long <[email protected]> Signed-off-by: David S. Miller <[email protected]>
In function s3fwrn5_nci_post_setup, the variable ret is assigned then
goto out label, which just return ret, so we use return to replace it.
Other goto sentences are similar, we use return sentences to replace
goto sentences and delete out label.
David S. Miller [Tue, 6 Apr 2021 23:22:37 +0000 (16:22 -0700)]
Merge branch 'usbnet-speed'
Grant Grundler says:
====================
usbnet: speed reporting for devices without MDIO
This series introduces support for USB network devices that report
speed as a part of their protocol, not emulating an MII to be accessed
over MDIO.
v2: rebased on recent upstream changes
v3: incorporated hints on naming and comments
v4: fix misplaced hunks; reword some commit messages;
add same change for cdc_ether
v4-repost: added "net-next" to subject and Andrew Lunn's Reviewed-by
I'm reposting Oliver Neukum's <[email protected]> patch series with
fix ups for "misplaced hunks" (landed in the wrong patches).
Please fixup the "author" if "git am" fails to attribute the
patches 1-3 (of 4) to Oliver.
I've tested v4 series with "5.12-rc3+" kernel on Intel NUC6i5SYB
and + Sabrent NT-S25G. Google Pixelbook Go (chromeos-4.4 kernel)
+ Alpha Network AUE2500C were connected directly to the NT-S25G
to get 2.5Gbps link rate:
Settings for enx002427880815:
Supported ports: [ ]
Supported link modes: Not reported
Supported pause frame use: No
Supports auto-negotiation: No
Supported FEC modes: Not reported
Advertised link modes: Not reported
Advertised pause frame use: No
Advertised auto-negotiation: No
Advertised FEC modes: Not reported
Speed: 2500Mb/s
Duplex: Half
Auto-negotiation: off
Port: Twisted Pair
PHYAD: 0
Transceiver: internal
MDI-X: Unknown
Current message level: 0x00000007 (7)
drv probe link
Link detected: yes
"Duplex" is a lie since we get no information about it.
I expect "Auto-Negotiation" is always true for cdc_ncm and
cdc_ether devices and perhaps someone knows offhand how
to have ethtool report "true" instead.
Grant Grundler [Mon, 5 Apr 2021 23:13:44 +0000 (16:13 -0700)]
net: cdc_ether: record speed in status method
Until very recently, the usbnet framework only had support functions
for devices which reported the link speed by explicitly querying the
PHY over a MDIO interface. However, the cdc_ether devices send
notifications when the link state or link speeds change and do not
expose the PHY (or modem) directly.
Support funtions (e.g. usbnet_get_link_ksettings_internal()) to directly
query state recorded by the cdc_ether driver were added in a previous patch.
Instead of cdc_ether spewing the link speed into the dmesg buffer,
record the link speed encoded in these notifications and tell the
usbnet framework to use the new functions to get link speed/state.
User space can now get the most recent link speed/state using ethtool.
v4: added to series since cdc_ether uses same notifications
as cdc_ncm driver.
Oliver Neukum [Mon, 5 Apr 2021 23:13:43 +0000 (16:13 -0700)]
net: cdc_ncm: record speed in status method
Until very recently, the usbnet framework only had support functions
for devices which reported the link speed by explicitly querying the
PHY over a MDIO interface. However, the cdc_ncm devices send
notifications when the link state or link speeds change and do not
expose the PHY (or modem) directly.
Support funtions (e.g. usbnet_get_link_ksettings_internal()) to directly
query state recorded by the cdc_ncm driver were added in a previous patch.
So instead of cdc_ncm spewing the link speed into the dmesg buffer,
record the link speed encoded in these notifications and tell the
usbnet framework to use the new functions to get link speed/state.
Link speed/state is now available via ethtool.
This is especially useful given all current RTL8156 devices emit
a connection/speed status notification every 32ms and this would
fill the dmesg buffer. This implementation replaces the one
recently submitted in de658a195ee23ca6aaffe197d1d2ea040beea0a2 :
"net: usb: cdc_ncm: don't spew notifications"
Oliver Neukum [Mon, 5 Apr 2021 23:13:41 +0000 (16:13 -0700)]
usbnet: add _mii suffix to usbnet_set/get_link_ksettings
The generic functions assumed devices provided an MDIO interface (accessed
via older mii code, not phylib). This is true only for genuine ethernet.
Devices with a higher level of abstraction or based on different
technologies do not have MDIO. To support this case, first rename
the existing functions with _mii suffix.
v2: rebased on changed upstream
v3: changed names to clearly say that this does NOT use phylib
v4: moved hunks to correct patch; reworded commmit messages
Eric Dumazet [Fri, 2 Apr 2021 13:26:02 +0000 (06:26 -0700)]
virtio_net: Do not pull payload in skb->head
Xuan Zhuo reported that commit 3226b158e67c ("net: avoid 32 x truesize
under-estimation for tiny skbs") brought a ~10% performance drop.
The reason for the performance drop was that GRO was forced
to chain sk_buff (using skb_shinfo(skb)->frag_list), which
uses more memory but also cause packet consumers to go over
a lot of overhead handling all the tiny skbs.
It turns out that virtio_net page_to_skb() has a wrong strategy :
It allocates skbs with GOOD_COPY_LEN (128) bytes in skb->head, then
copies 128 bytes from the page, before feeding the packet to GRO stack.
This was suboptimal before commit 3226b158e67c ("net: avoid 32 x truesize
under-estimation for tiny skbs") because GRO was using 2 frags per MSS,
meaning we were not packing MSS with 100% efficiency.
Fix is to pull only the ethernet header in page_to_skb()
Then, we change virtio_net_hdr_to_skb() to pull the missing
headers, instead of assuming they were already pulled by callers.
This fixes the performance regression, but could also allow virtio_net
to accept packets with more than 128bytes of headers.
Many thanks to Xuan Zhuo for his report, and his tests/help.
Userspace sends tcp connection (sock) destroy on network switch
i.e switching the default network of the device between multiple
networks(Cellular/Wifi/Ethernet).
Kernel though doesn't send reset for the connections in SYN-SENT state
and these connections continue to remain.
Even as per RFC 793, there is no hard rule to not send RST on ABORT in
this state.
Modify tcp_abort and tcp_disconnect behavior to send RST for connections
in syn-sent state to avoid lingering connections on network switch.
net: broadcom: bcm4908enet: Fix a double free in bcm4908_enet_dma_alloc
In bcm4908_enet_dma_alloc, if callee bcm4908_dma_alloc_buf_descs() failed,
it will free the ring->cpu_addr by dma_free_coherent() and return error.
Then bcm4908_enet_dma_free() will be called, and free the same cpu_addr
by dma_free_coherent() again.
My patch set ring->cpu_addr to NULL after it is freed in
bcm4908_dma_alloc_buf_descs() to avoid the double free.
Fixes: 4feffeadbcb2e ("net: broadcom: bcm4908enet: add BCM4908 controller driver") Signed-off-by: Lv Yunlong <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Pavel Skripkin [Thu, 4 Mar 2021 15:21:25 +0000 (18:21 +0300)]
net: mac802154: Fix general protection fault
syzbot found general protection fault in crypto_destroy_tfm()[1].
It was caused by wrong clean up loop in llsec_key_alloc().
If one of the tfm array members is in IS_ERR() range it will
cause general protection fault in clean up function [1].
Alexander Aring [Mon, 5 Apr 2021 00:30:54 +0000 (20:30 -0400)]
net: ieee802154: stop dump llsec params for monitors
This patch stops dumping llsec params for monitors which we don't support
yet. Otherwise we will access llsec mib which isn't initialized for
monitors.
Alexander Aring [Mon, 5 Apr 2021 00:30:53 +0000 (20:30 -0400)]
net: ieee802154: forbid monitor for del llsec seclevel
This patch forbids to del llsec seclevel for monitor interfaces which we
don't support yet. Otherwise we will access llsec mib which isn't
initialized for monitors.
Alexander Aring [Mon, 5 Apr 2021 00:30:52 +0000 (20:30 -0400)]
net: ieee802154: forbid monitor for add llsec seclevel
This patch forbids to add llsec seclevel for monitor interfaces which we
don't support yet. Otherwise we will access llsec mib which isn't
initialized for monitors.
Alexander Aring [Mon, 5 Apr 2021 00:30:51 +0000 (20:30 -0400)]
net: ieee802154: stop dump llsec seclevels for monitors
This patch stops dumping llsec seclevels for monitors which we don't
support yet. Otherwise we will access llsec mib which isn't initialized
for monitors.