Daniel Borkmann [Wed, 5 Jan 2022 19:35:13 +0000 (11:35 -0800)]
bpf: Don't promote bogus looking registers after null check.
If we ever get to a point again where we convert a bogus looking <ptr>_or_null
typed register containing a non-zero fixed or variable offset, then lets not
reset these bounds to zero since they are not and also don't promote the register
to a <ptr> type, but instead leave it as <ptr>_or_null. Converting to a unknown
register could be an avenue as well, but then if we run into this case it would
allow to leak a kernel pointer this way.
Fixes: f1174f77b50c ("bpf/verifier: rework value tracking") Signed-off-by: Daniel Borkmann <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]>
John Fastabend [Tue, 4 Jan 2022 21:46:45 +0000 (13:46 -0800)]
bpf, sockmap: Fix double bpf_prog_put on error case in map_link
sock_map_link() is called to update a sockmap entry with a sk. But, if the
sock_map_init_proto() call fails then we return an error to the map_update
op against the sockmap. In the error path though we need to cleanup psock
and dec the refcnt on any programs associated with the map, because we
refcnt them early in the update process to ensure they are pinned for the
psock. (This avoids a race where user deletes programs while also updating
the map with new socks.)
In current code we do the prog refcnt dec explicitely by calling
bpf_prog_put() when the program was found in the map. But, after commit
'38207a5e81230' in this error path we've already done the prog to psock
assignment so the programs have a reference from the psock as well. This
then causes the psock tear down logic, invoked by sk_psock_put() in the
error path, to similarly call bpf_prog_put on the programs there.
To be explicit this logic does the prog->psock assignment:
if (msg_*)
psock_set_prog(...)
Then the error path under the out_progs label does a similar check and
dec with:
if (msg_*)
bpf_prog_put(...)
And the teardown logic sk_psock_put() does ...
psock_set_prog(msg_*, NULL)
... triggering another bpf_prog_put(...). Then KASAN gives us this splat,
found by syzbot because we've created an inbalance between bpf_prog_inc and
bpf_prog_put calling put twice on the program.
BUG: KASAN: vmalloc-out-of-bounds in __bpf_prog_put kernel/bpf/syscall.c:1812 [inline]
BUG: KASAN: vmalloc-out-of-bounds in __bpf_prog_put kernel/bpf/syscall.c:1812 [inline] kernel/bpf/syscall.c:1829
BUG: KASAN: vmalloc-out-of-bounds in bpf_prog_put+0x8c/0x4f0 kernel/bpf/syscall.c:1829 kernel/bpf/syscall.c:1829
Read of size 8 at addr ffffc90000e76038 by task syz-executor020/3641
To fix clean up error path so it doesn't try to do the bpf_prog_put in the
error path once progs are assigned then it relies on the normal psock
tear down logic to do complete cleanup.
For completness we also cover the case whereh sk_psock_init_strp() fails,
but this is not expected because it indicates an incorrect socket type
and should be caught earlier.
John Fastabend [Tue, 4 Jan 2022 20:59:18 +0000 (12:59 -0800)]
bpf, sockmap: Fix return codes from tcp_bpf_recvmsg_parser()
Applications can be confused slightly because we do not always return the
same error code as expected, e.g. what the TCP stack normally returns. For
example on a sock err sk->sk_err instead of returning the sock_error we
return EAGAIN. This usually means the application will 'try again'
instead of aborting immediately. Another example, when a shutdown event
is received we should immediately abort instead of waiting for data when
the user provides a timeout.
These tend to not be fatal, applications usually recover, but introduces
bogus errors to the user or introduces unexpected latency. Before
'c5d2177a72a16' we fell back to the TCP stack when no data was available
so we managed to catch many of the cases here, although with the extra
latency cost of calling tcp_msg_wait_data() first.
To fix lets duplicate the error handling in TCP stack into tcp_bpf so
that we get the same error codes.
These were found in our CI tests that run applications against sockmap
and do longer lived testing, at least compared to test_sockmap that
does short-lived ping/pong tests, and in some of our test clusters
we deploy.
Its non-trivial to do these in a shorter form CI tests that would be
appropriate for BPF selftests, but we are looking into it so we can
ensure this keeps working going forward. As a preview one idea is to
pull in the packetdrill testing which catches some of this.
The root cause is the size of BPF_PSEUDO_FUNC instruction increases
from 2 to 3 after the address of called bpf-function is settled and
there are two bpf-to-bpf calls in test_pkt_access. The generated
instructions are shown below:
The four RGMII interface modes take care of the required RGMII delay
configuration at the PHY and should not be limited by the network MAC
driver. Sadly, gemini was only permitting RGMII mode with no delays,
which would require the required delay to be inserted via PCB tracking
or by the MAC.
However, there are designs that require the PHY to add the delay, which
is impossible without Gemini permitting the other three PHY interface
modes. Fix the driver to allow these.
This series fixes the RGMII delays for 88E1118 Marvell PHYs, after
a report by Corentin Labbe that the Marvell driver fails to work.
Patch 1 cleans up the paged register accesses in m88e1118_config_init()
and patch 2 adds the RGMII delay configuration.
This comes with an element of risk as existing DT may need to be fixed
for this in a similar way as we have done in the recent past for other
PHY drivers that have misinterpreted the RGMII interface modes.
====================
net: phy: marvell: configure RGMII delays for 88E1118
Corentin Labbe reports that the SSI 1328 does not work when allowing
the PHY to operate at gigabit speeds, but does work with the generic
PHY driver.
This appears to be because m88e1118_config_init() writes a fixed value
to the MSCR register, claiming that this is to enable 1G speeds.
However, this always sets bits 4 and 5, enabling RGMII transmit and
receive delays. The suspicion is that the original board this was
added for required the delays to make 1G speeds work.
Add the necessary configuration for RGMII delays for the 88E1118 to
bring this into line with the requirements for RGMII support, and thus
make the SSI 1328 work.
Corentin Labbe has tested this on gemini-ssi1328 and gemini-ns2502.
net: phy: marvell: use phy_write_paged() to set MSCR
Use phy_write_paged() in m88e1118_config_init() to set the MSCR value.
We leave the other paged write for the LEDs in case the DT register
parsing is relying on this page.
Eric Dumazet [Wed, 5 Jan 2022 17:08:49 +0000 (09:08 -0800)]
netlink: do not allocate a device refcount tracker in ethnl_default_notify()
As reported by Johannes, the tracker allocated in
ethnl_default_notify() is not really needed, as this
function is not expected to change a device reference count.
Linus Torvalds [Wed, 5 Jan 2022 17:30:10 +0000 (09:30 -0800)]
Merge tag 'gpio-fixes-for-v5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux
Pull gpio fixes from Bartosz Golaszewski:
"Here are two last fixes for this release cycle from the GPIO
subsystem:
- fix irq offset calculation in gpio-aspeed-sgpio
- update the MAINTAINERS entry for gpio-brcmstb"
* tag 'gpio-fixes-for-v5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux:
MAINTAINERS: update gpio-brcmstb maintainers
gpio: gpio-aspeed-sgpio: Fix wrong hwirq base in irq handler
Jakub Kicinski [Wed, 5 Jan 2022 17:00:11 +0000 (09:00 -0800)]
Merge tag 'ieee802154-for-net-2022-01-05' of git://git.kernel.org/pub/scm/linux/kernel/git/sschmidt/wpan
Stefan Schmidt says:
====================
pull-request: ieee802154 for net 2022-01-05
Below I have a last minute fix for the atusb driver.
Pavel fixes a KASAN uninit report for the driver. This version is the
minimal impact fix to ease backporting. A bigger rework of the driver to
avoid potential similar problems is ongoing and will come through net-next
when ready.
* tag 'ieee802154-for-net-2022-01-05' of git://git.kernel.org/pub/scm/linux/kernel/git/sschmidt/wpan:
ieee802154: atusb: fix uninit value in atusb_set_extended_addr
====================
This series deletes the no-op cross-chip notifier support for MRP and
HSR, features which were introduced relatively recently and did not get
full review at the time. The new code is functionally equivalent, but
simpler.
Vladimir Oltean [Wed, 5 Jan 2022 13:18:13 +0000 (15:18 +0200)]
net: dsa: remove cross-chip support for HSR
The cross-chip notifiers for HSR are bypass operations, meaning that
even though all switches in a tree are notified, only the switch
specified in the info structure is targeted.
We can eliminate the unnecessary complexity by deleting the cross-chip
notifier logic and calling the ds->ops straight from port.c.
Vladimir Oltean [Wed, 5 Jan 2022 13:18:12 +0000 (15:18 +0200)]
net: dsa: remove cross-chip support for MRP
The cross-chip notifiers for MRP are bypass operations, meaning that
even though all switches in a tree are notified, only the switch
specified in the info structure is targeted.
We can eliminate the unnecessary complexity by deleting the cross-chip
notifier logic and calling the ds->ops straight from port.c.
Vladimir Oltean [Wed, 5 Jan 2022 13:18:11 +0000 (15:18 +0200)]
net: dsa: fix incorrect function pointer check for MRP ring roles
The cross-chip notifier boilerplate code meant to check the presence of
ds->ops->port_mrp_add_ring_role before calling it, but checked
ds->ops->port_mrp_add instead, before calling
ds->ops->port_mrp_add_ring_role.
Therefore, a driver which implements one operation but not the other
would trigger a NULL pointer dereference.
There isn't any such driver in DSA yet, so there is no reason to
backport the change. Issue found through code inspection.
Danielle Ratson [Wed, 5 Jan 2022 10:22:27 +0000 (12:22 +0200)]
mlxsw: pci: Avoid flow control for EMAD packets
Locally generated packets ingress the device through its CPU port. When
the CPU port is congested and there are not enough credits in its
headroom buffer, packets can be dropped.
While this might be acceptable for data packets that traverse the
network, configuration packets exchanged between the host and the device
(EMADs) should not be subjected to this flow control.
The "sdq_lp" bit in the SDQ (Send Descriptor Queue) context allows the
host to instruct the device to treat packets sent on this queue as
"local processing" and always process them, regardless of the state of
the CPU port's headroom.
Add the definition of this bit and set it for the dedicated SDQ reserved
for the transmission of EMAD packets. This makes the "local processing"
bit in the WQE (Work Queue Element) redundant, so clear it.
this is a pull request of 15 patches for net-next/master.
The first patch is by me and removed an unused variable from the
usb_8dev driver.
Andy Shevchenko contributes a patch for the mcp251x driver, which
removes an unneeded assignment.
Jimmy Assarsson's patch for the kvaser_usb makes use of units.h in the
assignment of frequencies.
Lad Prabhakar provides 2 patches, converting the ti_hecc and the
sja1000 driver to make use of platform_get_irq().
The 10 remaining patches are by Vincent Mailhol. First the etas_es58x
driver populates the net_device::dev_port. The next 5 patches cleanup
the handling of CAN error and CAN RTR messages of all drivers. The
remaining 4 patches enhance the CAN controller mode flag handling and
export it via netlink to user space.
====================
David S. Miller [Wed, 5 Jan 2022 14:46:23 +0000 (14:46 +0000)]
Merge branch 'dsa-cleanups'
Vladimir Oltean says:
====================
Cleanup to main DSA structures
This series contains changes that do the following:
- struct dsa_port reduced from 576 to 544 bytes, and first cache line a
bit better organized
- struct dsa_switch from 160 to 136 bytes, and first cache line a bit
better organized
- struct dsa_switch_tree from 112 to 104 bytes, and first cache line a
bit better organized
====================
Vladimir Oltean [Wed, 5 Jan 2022 13:21:41 +0000 (15:21 +0200)]
net: dsa: combine two holes in struct dsa_switch_tree
There is a 7 byte hole after dst->setup and a 4 byte hole after
dst->default_proto. Combining them, we have a single hole of just 3
bytes on 64 bit machines.
Vladimir Oltean [Wed, 5 Jan 2022 13:21:39 +0000 (15:21 +0200)]
net: dsa: make dsa_switch :: num_ports an unsigned int
Currently, num_ports is declared as size_t, which is defined as
__kernel_ulong_t, therefore it occupies 8 bytes of memory.
Even switches with port numbers in the range of tens are exotic, so
there is no need for this amount of storage.
Additionally, because the max_num_bridges member right above it is also
4 bytes, it means the compiler needs to add padding between the last 2
fields. By reducing the size, we don't need that padding and can reduce
the struct size.
Vladimir Oltean [Wed, 5 Jan 2022 13:21:38 +0000 (15:21 +0200)]
net: dsa: merge all bools of struct dsa_switch into a single u32
struct dsa_switch has 9 boolean properties, many of which are in fact
set by drivers for custom behavior (vlan_filtering_is_global,
needs_standalone_vlan_filtering, etc etc). The binary layout of the
structure could be improved. For example, the "bool setup" at the
beginning introduces a gratuitous 7 byte hole in the first cache line.
The change merges all boolean properties into bitfields of an u32, and
places that u32 in the first cache line of the structure, since many
bools are accessed from the data path (untag_bridge_pvid, vlan_filtering,
vlan_filtering_is_global).
We place this u32 after the existing ds->index, which is also 4 bytes in
size. As a positive side effect, ds->tagger_data now fits into the first
cache line too, because 4 bytes are saved.
Vladimir Oltean [Wed, 5 Jan 2022 13:21:37 +0000 (15:21 +0200)]
net: dsa: move dsa_port :: type near dsa_port :: index
Both dsa_port :: type and dsa_port :: index introduce a 4 octet hole
after them, so we can group them together and the holes would be
eliminated, turning 16 octets of storage into just 8. This makes the
cpu_dp pointer fit in the first cache line, which is good, because
dsa_slave_to_master(), called by dsa_enqueue_skb(), uses it.
Vladimir Oltean [Wed, 5 Jan 2022 13:21:36 +0000 (15:21 +0200)]
net: dsa: merge all bools of struct dsa_port into a single u8
struct dsa_port has 5 bool members which create quite a number of 7 byte
holes in the structure layout. By merging them all into bitfields of an
u8, and placing that u8 in the 1-byte hole after dp->mac and dp->stp_state,
we can reduce the structure size from 576 bytes to 552 bytes on arm64.
Vladimir Oltean [Wed, 5 Jan 2022 13:21:35 +0000 (15:21 +0200)]
net: dsa: move dsa_port :: stp_state near dsa_port :: mac
The MAC address of a port is 6 octets in size, and this creates a 2
octet hole after it. There are some other u8 members of struct dsa_port
that we can put in that hole. One such member is the stp_state.
Currently, hns3 PF and VF module have two sets of rss and tqp stats APIs
to provide get and set functions. Most of these APIs are the same. There is
no need to keep these two sets of same functions for double development and
bugfix work.
This series refactor the rss and tqp stats APIs in hns3 PF and VF by
implementing one set of common APIs for PF and VF reuse and deleting the
old APIs.
====================
Jie Wang [Wed, 5 Jan 2022 14:20:15 +0000 (22:20 +0800)]
net: hns3: create new common cmd code for PF and VF modules
Currently PF and VF use two sets of command code for modules to interact
with firmware. These codes values are same espect the macro names. It is
redundent to keep two sets of command code for same functions between PF
and VF.
So this patch firstly creates a unified command code for PF and VF module.
We keep the macro name same with the PF command code name to avoid too many
meaningless modifications. Secondly the new common command codes are used
to replace the old ones in VF and deletes the old ones.
Jie Wang [Wed, 5 Jan 2022 14:20:14 +0000 (22:20 +0800)]
net: hns3: refactor VF tqp stats APIs with new common tqp stats APIs
This patch firstly uses new tqp struct(hclge_comm_tqp) and removes the
old VF tqp struct(hclgevf_tqp). All the tqp stats members used in VF module
are modified according to the new hclge_comm_tqp.
Secondly VF tqp stats APIs are refactored to use new common tqp stats APIs.
The old tqp stats APIs in VF are deleted.
Jie Wang [Wed, 5 Jan 2022 14:20:13 +0000 (22:20 +0800)]
net: hns3: refactor PF tqp stats APIs with new common tqp stats APIs
This patch firstly uses new tqp struct(hclge_comm_tqp) and deletes the
old PF tqp struct(hclge_tqp). All the tqp stats members used in PF module
are modified according to the new hclge_comm_tqp.
Secondly PF tqp stats APIs are refactored to use new common tqp stats APIs.
The old tqp stats APIs in PF are deleted.
Jie Wang [Wed, 5 Jan 2022 14:20:12 +0000 (22:20 +0800)]
net: hns3: create new set of common tqp stats APIs for PF and VF reuse
This patch creates new set of common tqp stats structures and APIs for PF
and VF tqp stats module. Subfunctions such as get tqp stats, update tqp
stats and reset tqp stats are inclued in this patch.
These new common tqp stats APIs will be used to replace the old PF and VF
tqp stats APIs in next patches.
Jie Wang [Wed, 5 Jan 2022 14:20:11 +0000 (22:20 +0800)]
net: hns3: refactor VF rss init APIs with new common rss init APIs
This patch uses common rss init APIs to replace the old APIs in VF rss
module and removes the old VF rss init APIs. Several related Subfunctions
and macros are also modified in this patch.
Jie Wang [Wed, 5 Jan 2022 14:20:10 +0000 (22:20 +0800)]
net: hns3: refactor PF rss init APIs with new common rss init APIs
This patch uses common rss init APIs to replace the old APIs in PF rss
module and deletes the old PF rss init APIs. Some related subfunctions and
macros are also modified in this patch.
Jie Wang [Wed, 5 Jan 2022 14:20:09 +0000 (22:20 +0800)]
net: hns3: create new set of common rss init APIs for PF and VF reuse
This patch creates new set of common rss init APIs for PF and VF rss
module. Subfunctions called by rss init process are also created include
rss tuple configuration and rss indirect table configuration.
These new common rss init APIs will be used to replace the old PF and VF
rss init APIs in next patches.
Jie Wang [Wed, 5 Jan 2022 14:20:08 +0000 (22:20 +0800)]
net: hns3: refactor VF rss set APIs with new common rss set APIs
This patch uses new common rss set APIs to replace the old APIs in VF rss
module and removes those old rss set APIs. The related macros in VF are
also modified.
Jie Wang [Wed, 5 Jan 2022 14:20:07 +0000 (22:20 +0800)]
net: hns3: refactor PF rss set APIs with new common rss set APIs
This patch uses new common rss set APIs to replace the old APIs in PF rss
module and deletes the old rss set APIs. The related macros are also
modified.
Jie Wang [Wed, 5 Jan 2022 14:20:05 +0000 (22:20 +0800)]
net: hns3: refactor VF rss get APIs with new common rss get APIs
This patch firstly uses new rss parameter struct(hclge_comm_rss_cfg) as
child member of hclgevf_dev and deletes the original child rss parameter
member(hclgevf_rss_cfg). All the rss parameter members used in VF rss
module is modified according to the new hclge_comm_rss_cfg.
Secondly VF rss get APIs are refactored to use new common rss get APIs. The
old rss get APIs in VF are deleted.
Jie Wang [Wed, 5 Jan 2022 14:20:04 +0000 (22:20 +0800)]
net: hns3: refactor PF rss get APIs with new common rss get APIs
This patch firstly uses new rss parameter struct(hclge_comm_rss_cfg) as
child member of hclge_dev and deletes the original child rss parameter
members in vport. All the vport child rss parameter members used in PF rss
module is modified according to the new hclge_comm_rss_cfg.
Secondly PF rss get APIs are refactored to use new common rss get APIs. The
old rss get APIs in PF are deleted.
Jie Wang [Wed, 5 Jan 2022 14:20:03 +0000 (22:20 +0800)]
net: hns3: create new set of common rss get APIs for PF and VF rss module
The PF and VF rss get APIs are almost the same espect the suffixes of API
names. These same impementions bring double development and bugfix work.
So this patch creates new common rss get APIs for PF and VF rss module.
Subfunctions called by rss query process are also created(e.g. rss tuple
conversion APIs).
These new common rss get APIs will be used to replace PF and VF old rss
APIs in next patches.
Jie Wang [Wed, 5 Jan 2022 14:20:02 +0000 (22:20 +0800)]
net: hns3: refactor hclge_comm_send function in PF/VF drivers
Currently, there are two different sets of special command codes in PF and
VF cmdq modules, this is because VF driver only uses small part of all the
command codes. In other words, these not used command codes in VF are also
sepcial command codes theoretically.
So this patch unifes the special command codes and deletes the bool param
is_pf of hclge_comm_send. All the related functions are refactored
according to the new hclge_comm_send function prototype.
Jie Wang [Wed, 5 Jan 2022 14:20:01 +0000 (22:20 +0800)]
net: hns3: create new rss common structure hclge_comm_rss_cfg
Currently PF stores its rss parameters in vport structure. VF stores rss
configurations in hclgevf_rss_cfg structure. Actually hns3 rss parameters
are same beween PF and VF. The two set of rss parameters are redundent and
may add extra bugfix work.
So this patch creates new common rss parameter struct(hclge_comm_rss_cfg)
to unify PF and VF rss configurations.
These new structures will be used to unify rss configurations in PF and VF
rss APIs in next patches.
Jiri Olsa [Tue, 4 Jan 2022 12:10:30 +0000 (13:10 +0100)]
bpf/selftests: Fix namespace mount setup in tc_redirect
The tc_redirect umounts /sys in the new namespace, which can be
mounted as shared and cause global umount. The lazy umount also
takes down mounted trees under /sys like debugfs, which won't be
available after sysfs mounts again and could cause fails in other
tests.
Paul Chaignon [Tue, 4 Jan 2022 18:00:13 +0000 (19:00 +0100)]
bpftool: Probe for instruction set extensions
This patch introduces new probes to check whether the kernel supports
instruction set extensions v2 and v3. The first introduced eBPF
instructions BPF_J{LT,LE,SLT,SLE} in commit 92b31a9af73b ("bpf: add
BPF_J{LT,LE,SLT,SLE} instructions"). The second introduces 32-bit
variants of all jump instructions in commit 092ed0968bb6 ("bpf:
verifier support JMP32").
These probes are useful for userspace BPF projects that want to use newer
instruction set extensions on newer kernels, to reduce the programs'
sizes or their complexity. LLVM already provides an mcpu=probe option to
automatically probe the kernel and select the newest-supported
instruction set extension. That is however not flexible enough for all
use cases. For example, in Cilium, we only want to use the v3
instruction set extension on v5.10+, even though it is supported on all
kernels v5.1+.
Paul Chaignon [Tue, 4 Jan 2022 17:59:57 +0000 (18:59 +0100)]
bpftool: Probe for bounded loop support
This patch introduces a new probe to check whether the verifier supports
bounded loops as introduced in commit 2589726d12a1 ("bpf: introduce
bounded loops"). This patch will allow BPF users such as Cilium to probe
for loop support on startup and only unconditionally unroll loops on
older kernels.
The results are displayed as part of the miscellaneous section, as shown
below.
Paul Chaignon [Tue, 4 Jan 2022 17:59:29 +0000 (18:59 +0100)]
bpftool: Refactor misc. feature probe
There is currently a single miscellaneous feature probe,
HAVE_LARGE_INSN_LIMIT, to check for the 1M instructions limit in the
verifier. Subsequent patches will add additional miscellaneous probes,
which follow the same pattern at the existing probe. This patch
therefore refactors the probe to avoid code duplication in subsequent
patches.
The BPF program type and the checked error numbers in the
HAVE_LARGE_INSN_LIMIT probe are changed to better generalize to other
probes. The feature probe retains its current behavior despite those
changes.
====================
net: lan966x: Extend switchdev with mdb support
This patch series extends lan966x with mdb support by implementing
the switchdev callbacks: SWITCHDEV_OBJ_ID_PORT_MDB and
SWITCHDEV_OBJ_ID_HOST_MDB.
It adds support for both ipv4/ipv6 entries and l2 entries.
v2->v3:
- rename PGID_FIRST and PGID_LAST to PGID_GP_START and PGID_GP_END
- don't forget and relearn an entry for the CPU if there are more
references to the cpu.
v1->v2:
- rename lan966x_mac_learn_impl to __lan966x_mac_learn
- rename lan966x_mac_cpu_copy to lan966x_mac_ip_learn
- fix grammar and typos in comments and commit messages
- add reference counter for entries that copy frames to CPU
====================
Horatiu Vultur [Tue, 4 Jan 2022 15:33:38 +0000 (16:33 +0100)]
net: lan966x: Extend switchdev with mdb support
Extend lan966x driver with mdb support by implementing the switchdev
calls: SWITCHDEV_OBJ_ID_PORT_MDB and SWITCHDEV_OBJ_ID_HOST_MDB.
It is allowed to add both ipv4/ipv6 entries and l2 entries. To add
ipv4/ipv6 entries is not required to use the PGID table while for l2
entries it is required. The PGID table is much smaller than MAC table
so only fewer l2 entries can be added.
Horatiu Vultur [Tue, 4 Jan 2022 15:33:37 +0000 (16:33 +0100)]
net: lan966x: Add PGID_GP_START and PGID_GP_END
The first entries in the PGID table are used by the front ports while
the last entries are used for different purposes like flooding mask,
copy to CPU, etc. So add these macros to define which entries can be
used for general purpose.
Horatiu Vultur [Tue, 4 Jan 2022 15:33:36 +0000 (16:33 +0100)]
net: lan966x: Add function lan966x_mac_ip_learn()
Extend mac functionality with the function lan966x_mac_ip_learn. This
function adds an entry in the MAC table for IP multicast addresses.
These entries can copy a frame to the CPU but also can forward on the
front ports.
This functionality is needed for mdb support. In case the CPU and some
of the front ports subscribe to an IP multicast address.
====================
net: ethernet: mtk_eth_soc: refactoring and Clause 45
Rework value and type of mdio read and write functions in mtk_eth_soc
and generally clean up and unify both functions.
Then add support to access Clause 45 phy registers, using newly
introduced helper inline functions added by a patch Russell King has
suggested in a reply to an earlier version of this series [1].
All three commits are tested on the Bananapi BPi-R64 board having
MediaTek MT7531BE DSA gigE switch using clause 22 MDIO and
Ubiquiti UniFi 6 LR access point having Aquantia AQR112C PHY using
clause 45 MDIO.
v11: also address return value of mtk_mdio_busy_wait
v10: correct order of SoB lines in 2/3, change patch order in series
v9: improved formatting and Cc missing maintainer
v8: add patch from Russel King, switch to bitfield helper macros
v7: remove unneeded variables and order OR-ed call parameters
v6: further clean up functions and more cleanly separate patches
v5: fix wrong variable name in first patch covered by follow-up patch
v4: clean-up return values and types, split into two commits
v3: return -1 instead of 0xffff on error in _mtk_mdio_write
v2: use MII_DEVADDR_C45_SHIFT and MII_REGADDR_C45_MASK to extract
device id and register address. Unify read and write functions to
have identical types and parameter names where possible as we are
anyway already replacing both function bodies.
====================
Implement read and write access to IEEE 802.3 Clause 45 Ethernet
phy registers while making use of new mdiobus_c45_regad and
mdiobus_c45_devad helpers.
Tested on the Ubiquiti UniFi 6 LR access point featuring
MediaTek MT7622BV WiSoC with Aquantia AQR112C.
Daniel Golle [Tue, 4 Jan 2022 12:06:22 +0000 (12:06 +0000)]
net: ethernet: mtk_eth_soc: fix return values and refactor MDIO ops
Instead of returning -1 (-EPERM) when MDIO bus is stuck busy
while writing or 0xffff if it happens while reading, return the
appropriate -ETIMEDOUT. Also fix return type to int instead of u32.
Refactor functions to use bitfield helpers instead of having various
masking and shifting constants in the code, which also results in the
register definitions in the header file being more obviously related
to what is stated in the MediaTek's Reference Manual.
Fixes: 656e705243fd0 ("net-next: mediatek: add support for MT7623 ethernet") Signed-off-by: Daniel Golle <[email protected]> Signed-off-by: David S. Miller <[email protected]>
David S. Miller [Wed, 5 Jan 2022 11:15:16 +0000 (11:15 +0000)]
Merge branch '40GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue
Tony Nguyen says:
====================
Intel Wired LAN Driver Updates 2022-01-04
This series contains updates to i40e and iavf drivers.
Mateusz adjusts displaying of failed VF MAC message when the failure is
expected as well as modifying an NVM info message to not confuse the user
for i40e.
Di Zhu fixes a use-after-free issue MAC filters for i40e.
Jedrzej fixes an issue with misreporting of Rx and Tx queues during
reinitialization for i40e.
Karen correct checking of channel queue configuration to occur against
active queues for iavf.
====================
Vincent Mailhol [Mon, 13 Dec 2021 16:02:26 +0000 (01:02 +0900)]
can: netlink: report the CAN controller mode supported flags
Currently, the CAN netlink interface provides no easy ways to check
the capabilities of a given controller. The only method from the
command line is to try each CAN_CTRLMODE_* individually to check
whether the netlink interface returns an -EOPNOTSUPP error or not
(alternatively, one may find it easier to directly check the source
code of the driver instead...)
This patch introduces a method for the user to check both the
supported and the static capabilities. The proposed method introduces
a new IFLA nest: IFLA_CAN_CTRLMODE_EXT which extends the current
IFLA_CAN_CTRLMODE. This is done to guaranty a full forward and
backward compatibility between the kernel and the user land
applications.
The IFLA_CAN_CTRLMODE_EXT nest contains one single entry:
IFLA_CAN_CTRLMODE_SUPPORTED. Because this entry is only used in one
direction: kernel to userland, no new struct nla_policy are
introduced.
Below table explains how IFLA_CAN_CTRLMODE_SUPPORTED (hereafter:
"supported") and can_ctrlmode::flags (hereafter: "flags") allow us to
identify both the supported and the static capabilities, when masked
with any of the CAN_CTRLMODE_* bit flags:
Vincent Mailhol [Mon, 13 Dec 2021 16:02:24 +0000 (01:02 +0900)]
can: dev: add sanity check in can_set_static_ctrlmode()
Previous patch removed can_priv::ctrlmode_static to replace it with
can_get_static_ctrlmode().
A condition sine qua non for this to work is that the controller
static modes should never be set in can_priv::ctrlmode_supported
(c.f. the comment on can_priv::ctrlmode_supported which states that it
is for "options that can be *modified* by netlink"). Also, this
condition is already correctly fulfilled by all existing drivers
which rely on the ctrlmode_static feature.
Nonetheless, we added an extra safeguard in can_set_static_ctrlmode()
to return an error value and to warn the developer who would be
adventurous enough to set to static a given feature that is already
set to supported.
The drivers which rely on the static controller mode are then updated
to check the return value of can_set_static_ctrlmode().
As such, there is no need to store this information. This patch remove
the field ctrlmode_static of struct can_priv and provides, in
replacement, the inline function can_get_static_ctrlmode() which
returns the same value.
Vincent Mailhol [Tue, 7 Dec 2021 12:15:31 +0000 (21:15 +0900)]
can: do not increase tx_bytes statistics for RTR frames
The actual payload length of the CAN Remote Transmission Request (RTR)
frames is always 0, i.e. no payload is transmitted on the wire.
However, those RTR frames still use the DLC to indicate the length of
the requested frame.
As such, net_device_stats::tx_bytes should not be increased when
sending RTR frames.
The function can_get_echo_skb() already returns the correct length,
even for RTR frames (c.f. [1]). However, for historical reasons, the
drivers do not use can_get_echo_skb()'s return value and instead, most
of them store a temporary length (or dlc) in some local structure or
array. Using the return value of can_get_echo_skb() solves the
issue. After doing this, such length/dlc fields become unused and so
this patch does the adequate cleaning when needed.
This patch fixes all the CAN drivers.
Finally, can_get_echo_skb() is decorated with the __must_check
attribute in order to force future drivers to correctly use its return
value (else the compiler would emit a warning).
[1] commit ed3320cec279 ("can: dev: __can_get_echo_skb():
fix real payload length return value for RTR frames")
Vincent Mailhol [Tue, 7 Dec 2021 12:15:30 +0000 (21:15 +0900)]
can: do not increase rx_bytes statistics for RTR frames
The actual payload length of the CAN Remote Transmission Request (RTR)
frames is always 0, i.e. no payload is transmitted on the wire.
However, those RTR frames still use the DLC to indicate the length of
the requested frame.
As such, net_device_stats::rx_bytes should not be increased for the
RTR frames.
Vincent Mailhol [Tue, 7 Dec 2021 12:15:29 +0000 (21:15 +0900)]
can: do not copy the payload of RTR frames
The actual payload length of the CAN Remote Transmission Request (RTR)
frames is always 0, i.e. no payload is transmitted on the wire.
However, those RTR frames still use the DLC to indicate the length of
the requested frame.
For this reason, it is incorrect to copy the payload of RTR frames
(the payload buffer would only contain garbage data). This patch
encapsulates the payload copy in a check toward the RTR flag.
Vincent Mailhol [Tue, 7 Dec 2021 12:15:28 +0000 (21:15 +0900)]
can: kvaser_usb: do not increase tx statistics when sending error message frames
The CAN error message frames (i.e. error skb) are an interface
specific to socket CAN. The payload of the CAN error message frames
does not correspond to any actual data sent on the wire. Only an error
flag and a delimiter are transmitted when an error occurs (c.f. ISO
11898-1 section 10.4.4.2 "Error flag").
For this reason, it makes no sense to increment the tx_packets and
tx_bytes fields of struct net_device_stats when sending an error
message frame because no actual payload will be transmitted on the
wire.
N.B. Sending error message frames is a very specific feature which, at
the moment, is only supported by the Kvaser Hydra hardware. Please
refer to [1] for more details on the topic.
Vincent Mailhol [Tue, 7 Dec 2021 12:15:27 +0000 (21:15 +0900)]
can: do not increase rx statistics when generating a CAN rx error message frame
The CAN error message frames (i.e. error skb) are an interface
specific to socket CAN. The payload of the CAN error message frames
does not correspond to any actual data sent on the wire. Only an error
flag and a delimiter are transmitted when an error occurs (c.f. ISO
11898-1 section 10.4.4.2 "Error flag").
For this reason, it makes no sense to increment the rx_packets and
rx_bytes fields of struct net_device_stats because no actual payload
were transmitted on the wire.
The field dev_port of struct net_device indicates the port number of a
network device [1]. This patch populates this field.
This field can be helpful to distinguish between the two network
interfaces of a dual channel device (i.e. ES581.4 or ES582.1). Indeed,
at the moment, all the network interfaces of a same device share the
same static udev attributes c.f. output of:
| udevadm info --attribute-walk /sys/class/net/canX
The dev_port attribute can then be used to write some udev rules to,
for example, assign a permanent name to each network interface based
on the serial/dev_port pair (which is convenient when you have a test
bench with several CAN devices connected simultaneously and wish to
keep consistent interface names upon reboot).
Lad Prabhakar [Tue, 21 Dec 2021 19:45:08 +0000 (19:45 +0000)]
can: ti_hecc: ti_hecc_probe(): use platform_get_irq() to get the interrupt
platform_get_resource(pdev, IORESOURCE_IRQ, ..) relies on static
allocation of IRQ resources in DT core code, this causes an issue when
using hierarchical interrupt domains using "interrupts" property in
the node as this bypasses the hierarchical setup and messes up the irq
chaining.
In preparation for removal of static setup of IRQ resource from DT
core code use platform_get_irq().
Andy Shevchenko [Thu, 2 Dec 2021 20:58:55 +0000 (22:58 +0200)]
can: mcp251x: mcp251x_gpio_setup(): Get rid of duplicate of_node assignment
GPIO library does copy the of_node from the parent device of the GPIO
chip, there is no need to repeat this in the individual drivers.
Remove assignment here.
For the details one may look into the of_gpio_dev_init()
implementation.
M Chetan Kumar [Tue, 4 Jan 2022 15:02:13 +0000 (20:32 +0530)]
Revert "net: wwan: iosm: Keep device at D0 for s2idle case"
Depending on BIOS configuration IOSM driver exchanges
protocol required for putting device into D3L2 or D3L1.2.
ipc_pcie_suspend_s2idle() is implemented to put device to D3L1.2.
This patch forces PCI core know this device should stay at D0.
- pci_save_state()is expensive since it does a lot of slow PCI
config reads.
The reported issue is not observed on x86 platform. The supurios
wake on AMD platform needs to be futher debugged with orignal patch
submitter [1]. Also the impact of adding pci_save_state() needs to be
assessed by testing it on other platforms.
This reverts commit f4dd5174e273("net: wwan: iosm: Keep device
at D0 for s2idle case").
Martin Habets [Sun, 2 Jan 2022 08:41:22 +0000 (08:41 +0000)]
sfc: The RX page_ring is optional
The RX page_ring is an optional feature that improves
performance. When allocation fails the driver can still
function, but possibly with a lower bandwidth.
Guard against dereferencing a NULL page_ring.
iavf: Fix limit of total number of queues to active queues of VF
In the absence of this validation, if the user requests to
configure queues more than the enabled queues, it results in
sending the requested number of queues to the kernel stack
(due to the asynchronous nature of VF response), in which
case the stack might pick a queue to transmit that is not
enabled and result in Tx hang. Fix this bug by
limiting the total number of queues allocated for VF to
active queues of VF.
i40e: Fix incorrect netdev's real number of RX/TX queues
There was a wrong queues representation in sysfs during
driver's reinitialization in case of online cpus number is
less than combined queues. It was caused by stopped
NetworkManager, which is responsible for calling vsi_open
function during driver's initialization.
In specific situation (ex. 12 cpus online) there were 16 queues
in /sys/class/net/<iface>/queues. In case of modifying queues with
value higher, than number of online cpus, then it caused write
errors and other errors.
Add updating of sysfs's queues representation during driver
initialization.
i40e: Fix for displaying message regarding NVM version
When loading the i40e driver, it prints a message like: 'The driver for the
device detected a newer version of the NVM image v1.x than expected v1.y.
Please install the most recent version of the network driver.' This is
misleading as the driver is working as expected.
Fix that by removing the second part of message and changing it from
dev_info to dev_dbg.
Fixes: 4fb29bddb57f ("i40e: The driver now prints the API version in error message") Signed-off-by: Mateusz Palczewski <[email protected]> Tested-by: Gurucharan G <[email protected]> Signed-off-by: Tony Nguyen <[email protected]>
Di Zhu [Mon, 29 Nov 2021 13:52:01 +0000 (19:52 +0600)]
i40e: fix use-after-free in i40e_sync_filters_subtask()
Using ifconfig command to delete the ipv6 address will cause
the i40e network card driver to delete its internal mac_filter and
i40e_service_task kernel thread will concurrently access the mac_filter.
These two processes are not protected by lock
so causing the following use-after-free problems.
i40e: Fix to not show opcode msg on unsuccessful VF MAC change
Hide i40e opcode information sent during response to VF in case when
untrusted VF tried to change MAC on the VF interface.
This is implemented by adding an additional parameter 'hide' to the
response sent to VF function that hides the display of error
information, but forwards the error code to VF.
Previously it was not possible to send response with some error code
to VF without displaying opcode information.
Pavel Skripkin [Tue, 4 Jan 2022 18:28:06 +0000 (21:28 +0300)]
ieee802154: atusb: fix uninit value in atusb_set_extended_addr
Alexander reported a use of uninitialized value in
atusb_set_extended_addr(), that is caused by reading 0 bytes via
usb_control_msg().
Fix it by validating if the number of bytes transferred is actually
correct, since usb_control_msg() may read less bytes, than was requested
by caller.
Fail log:
BUG: KASAN: uninit-cmp in ieee802154_is_valid_extended_unicast_addr include/linux/ieee802154.h:310 [inline]
BUG: KASAN: uninit-cmp in atusb_set_extended_addr drivers/net/ieee802154/atusb.c:1000 [inline]
BUG: KASAN: uninit-cmp in atusb_probe.cold+0x29f/0x14db drivers/net/ieee802154/atusb.c:1056
Uninit value used in comparison: 311daa649a2003bd stack handle: 000000009a2003bd
ieee802154_is_valid_extended_unicast_addr include/linux/ieee802154.h:310 [inline]
atusb_set_extended_addr drivers/net/ieee802154/atusb.c:1000 [inline]
atusb_probe.cold+0x29f/0x14db drivers/net/ieee802154/atusb.c:1056
usb_probe_interface+0x314/0x7f0 drivers/usb/core/driver.c:396
Jakub Kicinski [Tue, 4 Jan 2022 16:13:02 +0000 (08:13 -0800)]
Merge tag 'mac80211-next-for-net-next-2022-01-04' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next
Johannes Berg says:
====================
Just a few more changes:
- mac80211: allow non-standard VHT MCSes 10/11
- mac80211: add sleepable station iterator for drivers
- nl80211: clarify a comment
- mac80211: small cleanup to use typed element helpers
* tag 'mac80211-next-for-net-next-2022-01-04' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next:
mac80211: use ieee80211_bss_get_elem()
nl80211: clarify comment for mesh PLINK_BLOCKED state
mac80211: Add stations iterator where the iterator function may sleep
mac80211: allow non-standard VHT MCS-10/11
====================
Jakub Kicinski [Tue, 4 Jan 2022 15:18:27 +0000 (07:18 -0800)]
Merge tag 'mac80211-for-net-2022-01-04' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211
Johannes Berg says:
====================
Two more changes:
- mac80211: initialize a variable to avoid using it uninitialized
- mac80211 mesh: put some data structures into the container to
fix bugs with and not have to deal with allocation failures
* tag 'mac80211-for-net-2022-01-04' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211:
mac80211: mesh: embedd mesh_paths and mpp_paths into ieee80211_if_mesh
mac80211: initialize variable have_higher_than_11mbit
====================
Felix Fietkau [Mon, 20 Dec 2021 10:51:47 +0000 (11:51 +0100)]
nl80211: clarify comment for mesh PLINK_BLOCKED state
When a mesh link is in blocked state, it is very useful to still allow
auth requests from the peer to re-establish it.
When a remote node is power cycled, the peer state can easily end up
in blocked state if multiple auth attempts are performed. Since this
can lead to several minutes of downtime, we should accept auth attempts
of the peer after it has come back.
mac80211: Add stations iterator where the iterator function may sleep
ieee80211_iterate_active_interfaces() and
ieee80211_iterate_active_interfaces_atomic() already exist, where the
former allows the iterator function to sleep. Add
ieee80211_iterate_stations() which is similar to
ieee80211_iterate_stations_atomic() but allows the iterator to sleep.
This is needed for adding SDIO support to the rtw88 driver. Some
interators there are reading or writing registers. With the SDIO ops
(sdio_readb, sdio_writeb and friends) this means that the iterator
function may sleep.
Ping-Ke Shih [Mon, 3 Jan 2022 01:36:21 +0000 (09:36 +0800)]
mac80211: allow non-standard VHT MCS-10/11
Some AP can possibly try non-standard VHT rate and mac80211 warns and drops
packets, and leads low TCP throughput.
Rate marked as a VHT rate but data is invalid: MCS: 10, NSS: 2
WARNING: CPU: 1 PID: 7817 at net/mac80211/rx.c:4856 ieee80211_rx_list+0x223/0x2f0 [mac8021
Since commit c27aa56a72b8 ("cfg80211: add VHT rate entries for MCS-10 and MCS-11")
has added, mac80211 adds this support as well.
After this patch, throughput is good and iw can get the bitrate:
rx bitrate: 975.1 MBit/s VHT-MCS 10 80MHz short GI VHT-NSS 2
or
rx bitrate: 1083.3 MBit/s VHT-MCS 11 80MHz short GI VHT-NSS 2
Pavel Skripkin [Thu, 30 Dec 2021 19:55:47 +0000 (22:55 +0300)]
mac80211: mesh: embedd mesh_paths and mpp_paths into ieee80211_if_mesh
Syzbot hit NULL deref in rhashtable_free_and_destroy(). The problem was
in mesh_paths and mpp_paths being NULL.
mesh_pathtbl_init() could fail in case of memory allocation failure, but
nobody cared, since ieee80211_mesh_init_sdata() returns void. It led to
leaving 2 pointers as NULL. Syzbot has found null deref on exit path,
but it could happen anywhere else, because code assumes these pointers are
valid.
Since all ieee80211_*_setup_sdata functions are void and do not fail,
let's embedd mesh_paths and mpp_paths into parent struct to avoid
adding error handling on higher levels and follow the pattern of others
setup_sdata functions
mlme.c:5332:7: warning: Branch condition evaluates to a
garbage value
have_higher_than_11mbit)
^~~~~~~~~~~~~~~~~~~~~~~
have_higher_than_11mbit is only set to true some of the time in
ieee80211_get_rates() but is checked all of the time. So
have_higher_than_11mbit needs to be initialized to false.
David S. Miller [Tue, 4 Jan 2022 12:40:22 +0000 (12:40 +0000)]
Merge branch 'namespacify-mtu-ipv4'
xu xin says:
====================
ipv4: Namespaceify two sysctls related with mtu
The following patch series enables the min_pmtu and mtu_expires to
be visible and configurable per net namespace. Different namespace
application might have different requirements on the setting of
min_pmtu and mtu_expires.
If these two patches are applied, inside a net namespace we create,
we can see two more sysctls under /proc/sys/net/ipv4/route:
1. min_pmtu
2. mtu_expires
where min_pmtu and mtu_expires are configurable.
====================
Eric Dumazet [Tue, 4 Jan 2022 09:45:08 +0000 (01:45 -0800)]
sch_qfq: prevent shift-out-of-bounds in qfq_init_qdisc
tx_queue_len can be set to ~0U, we need to be more
careful about overflows.
__fls(0) is undefined, as this report shows:
UBSAN: shift-out-of-bounds in net/sched/sch_qfq.c:1430:24
shift exponent 51770272 is too large for 32-bit type 'int'
CPU: 0 PID: 25574 Comm: syz-executor.0 Not tainted 5.16.0-rc7-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
<TASK>
__dump_stack lib/dump_stack.c:88 [inline]
dump_stack_lvl+0x201/0x2d8 lib/dump_stack.c:106
ubsan_epilogue lib/ubsan.c:151 [inline]
__ubsan_handle_shift_out_of_bounds+0x494/0x530 lib/ubsan.c:330
qfq_init_qdisc+0x43f/0x450 net/sched/sch_qfq.c:1430
qdisc_create+0x895/0x1430 net/sched/sch_api.c:1253
tc_modify_qdisc+0x9d9/0x1e20 net/sched/sch_api.c:1660
rtnetlink_rcv_msg+0x934/0xe60 net/core/rtnetlink.c:5571
netlink_rcv_skb+0x200/0x470 net/netlink/af_netlink.c:2496
netlink_unicast_kernel net/netlink/af_netlink.c:1319 [inline]
netlink_unicast+0x814/0x9f0 net/netlink/af_netlink.c:1345
netlink_sendmsg+0xaea/0xe60 net/netlink/af_netlink.c:1921
sock_sendmsg_nosec net/socket.c:704 [inline]
sock_sendmsg net/socket.c:724 [inline]
____sys_sendmsg+0x5b9/0x910 net/socket.c:2409
___sys_sendmsg net/socket.c:2463 [inline]
__sys_sendmsg+0x280/0x370 net/socket.c:2492
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x44/0xd0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x44/0xae
Fixes: 462dbc9101ac ("pkt_sched: QFQ Plus: fair-queueing service at DRR cost") Signed-off-by: Eric Dumazet <[email protected]> Reported-by: syzbot <[email protected]> Signed-off-by: David S. Miller <[email protected]>
This code used to copy in an unsigned long worth of data before
the sockptr_t conversion, so restore that.
Fixes: a7b75c5a8c41 ("net: pass a sockptr_t into ->setsockopt") Reported-by: Dan Carpenter <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Jakub Kicinski [Tue, 4 Jan 2022 03:48:27 +0000 (19:48 -0800)]
net: fixup build after bpf header changes
Recent bpf-next merge brought in header changes which uncovered
includes missing in net-next which were not present in bpf-next.
Build problems happen only on less-popular arches like hppa,
sparc, alpha etc.
I could repro the build problem with ice but not the mlx5 problem
Abdul was reporting. mlx5 does look like it should include filter.h,
anyway.
This patch adds support for scatter gather DMA. DMA in PMAC splits
the packet into several buffers when the MTU on the CPU port is
less than the MTU of the switch. The first buffer starts at an
offset of NET_IP_ALIGN. In subsequent buffers, dma ignores the
offset. Thanks to this patch, the user can still connect to the
device in such a situation. For normal configurations, the patch
has no effect on performance.