Vladimir Oltean [Mon, 19 Jul 2021 17:14:42 +0000 (20:14 +0300)]
net: dsa: sja1105: delete the best_effort_vlan_filtering mode
Simply put, the best-effort VLAN filtering mode relied on VLAN retagging
from a bridge VLAN towards a tag_8021q sub-VLAN in order to be able to
decode the source port in the tagger, but the VLAN retagging
implementation inside the sja1105 chips is not the best and we were
relying on marginal operating conditions.
The most notable limitation of the best-effort VLAN filtering mode is
its incapacity to treat this case properly:
ip link add br0 type bridge vlan_filtering 1
ip link set swp2 master br0
ip link set swp4 master br0
bridge vlan del dev swp4 vid 1
bridge vlan add dev swp4 vid 1 pvid
When sending an untagged packet through swp2, the expectation is for it
to be forwarded to swp4 as egress-tagged (so it will contain VLAN ID 1
on egress). But the switch will send it as egress-untagged.
but it failed miserably because it broke PTP RX timestamping, in a way
that cannot be corrected due to hardware issues related to VLAN
retagging.
So with either PTP broken or pushing VLAN headers on egress for untagged
packets being broken, the sad reality is that the best-effort VLAN
filtering code is broken. Delete it.
Note that this means there will be a temporary loss of functionality in
this driver until it is replaced with something better (network stack
RX/TX capability for "mode 2" as described in
Documentation/networking/dsa/sja1105.rst, the "port under VLAN-aware
bridge" case). We simply cannot keep this code until that driver rework
is done, it is super bloated and tangled with tag_8021q.
David S. Miller [Tue, 20 Jul 2021 13:23:13 +0000 (06:23 -0700)]
Merge branch 'veth-flexible-channel-numbers'
Paolo Abeni says:
====================
veth: more flexible channels number configuration
XDP setups can benefit from multiple veth RX/TX queues. Currently
veth allow setting such number only at creation time via the
'numrxqueues' and 'numtxqueues' parameters.
This series introduces support for the ethtool set_channel operation
and allows configuring the queue number via a new module parameter.
The veth default configuration is not changed.
Finally self-tests are updated to check the new features, with both
valid and invalid arguments.
This iteration is a rebase of the most recent RFC, it does not provide
a module parameter to configure the default number of queues, but I
think could be worthy
RFC v1 -> RFC v2:
- report more consistent 'combined' count
- make set_channel as resilient as possible to errors
- drop module parameter - but I would still consider it.
- more self-tests
====================
David S. Miller [Tue, 20 Jul 2021 13:22:14 +0000 (06:22 -0700)]
Merge branch 'bridge-vlan-multicast'
Nikolay Aleksandrov says:
====================
net: bridge: multicast: add vlan support
This patchset adds initial per-vlan multicast support, most of the code
deals with moving to multicast context pointers from bridge/port pointers.
That allows us to switch them with the per-vlan contexts when a multicast
packet is being processed and vlan multicast snooping has been enabled.
That is controlled by a global bridge option added in patch 06 which is
off by default (BR_BOOLOPT_MCAST_VLAN_SNOOPING). It is important to note
that this option can change only under RTNL and doesn't require
multicast_lock, so we need to be careful when retrieving mcast contexts
in parallel. For packet processing they are switched only once in
br_multicast_rcv() and then used until the packet has been processed.
For the most part we need these contexts only to read config values and
check if they are disabled. The global mcast state which is maintained
consists of querier and router timers, the rest are config options.
The port mcast state which is maintained consists of query timer and
link to router port list if it's ever marked as a router port. Port
multicast contexts _must_ be used only with their respective global
contexts, that is a bridge port's mcast context must be used only with
bridge's global mcast context and a vlan/port's mcast context must be
used only with that vlan's global mcast context due to the router port
lists. This way a bridge port can be marked as a router in multiple
vlans, but might not be a router in some other vlan. Also this allows us
to have per-vlan querier elections, per-vlan queries and basically the
whole multicast state becomes per-vlan when the option is enabled.
One of the hardest parts is synchronization with vlan's memory
management, that is done through a new vlan flag: BR_VLFLAG_MCAST_ENABLED
which is changed only under multicast_lock. When a vlan is being
destroyed first that flag is removed under the lock, then the multicast
context is torn down which includes waiting for any outstanding context
timers. Since all of the vlan processing depends on BR_VLFLAG_MCAST_ENABLED
it must be checked first if the contexts are vlan and the multicast_lock
has been acquired. That is done by all IGMP/MLD packet processing
functions and timers. When processing a packet we have RCU so the vlan
memory won't be freed, but if the flag is missing we must not process it.
The timers are synchronized in the same way with the addition of waiting
for them to finish in case they are running after removing the flag
under multicast_lock (i.e. they were waiting for the lock). Multicast vlan
snooping requires vlan filtering to be enabled, if it's disabled then
snooping gets automatically disabled, too. BR_VLFLAG_GLOBAL_MCAST_ENABLED
controls if a vlan has BR_VLFLAG_MCAST_ENABLED set which is used in all
vlan disabled checks. We need both flags because one is controlled by
user-space globally (BR_VLFLAG_GLOBAL_MCAST_ENABLED) and the other is
for a particular bridge/vlan or port/vlan entry (BR_VLFLAG_MCAST_ENABLED).
Since the latter is also used for synchronization between the multicast
and vlan code, and also controlled by BR_VLFLAG_GLOBAL_MCAST_ENABLED we
rely on it when checking if a vlan context is disabled. The multicast
fast-path has 3 new bit tests on the cache-hot bridge flags field, I
didn't observe any measurable difference. I haven't forced either
context options to be always disabled when the other type is enabled
because the state consists of timers which either expire (router) or
don't affect the normal operation. Some options, like the mcast querier
one, won't be allowed to change for the disabled context type, that will
come with a future patch-set which adds per-vlan querier control.
Another important addition is the global vlan options, so far we had
only per bridge/port vlan options but in order to control vlan multicast
snooping globally we need to add a new type of global vlan options.
They can be changed only on the bridge device and are dumped only when a
special flag is set in the dump request. The first global option is vlan
mcast snooping control, it controls the vlan BR_VLFLAG_GLOBAL_MCAST_ENABLED
private flag. It can be set only on master vlan entries. There will be
many more global vlan options in the future both for multicast config
and other per-vlan options (e.g. STP).
There's a lot of room for improvements, I'll do some of the initial
ones but splitting the state to different contexts opens the door
for a lot more. Also any new multicast options become vlan-supported with
very little to no effort by using the same contexts.
Short patch description:
patches 01-04: initial mcast context add, no functional changes
patch 05: adds vlan mcast init and control helpers and uses them on
vlan create/destroy
patch 06: adds a global bridge mcast vlan snooping knob (default
off)
patches 07-08: add a helper for users which must derive the contexts
based on current bridge and vlan options (e.g. timers)
patch 09: adds checks for disabled vlan contexts in packet
processing and timers
patch 10: adds support for per-vlan querier and tagged queries
patch 11: adds router port vlan id in the notifications
patches 12-14: add global vlan options support (change, dump, notify)
patch 15: adds per-vlan global mcast snooping control
Future patch-sets which build on this one (in order):
- vlan state mcast handling
- user-space mdb contexts (currently only the bridge contexts are used
there)
- all bridge multicast config options added per-vlan global and per
vlan/port
- iproute2 support for all the new uAPIs
- selftests
This set has been stress-tested (deleting/adding ports/vlans while changing
vlan mcast snooping while processing IGMP/MLD packets), and also has
passed all bridge self-tests. I'm sending this set as early as possible
since there're a few more related sets that should go in the same
release to get proper and full mcast vlan snooping support.
====================
qeth uses three device_type structs - a generic one, and one for each
sub-driver (which is used for fixed-layer devices only). Instead of
exporting these device_types back&forth between the driver's modules,
make all the logic self-contained within the sub-drivers.
On disc->setup() they either install their own device_type, or add the
sysfs attributes that are missing in the generic device_type. Later on
disc->remove() these attributes are removed again from any device that
has the generic device_type.
Commit fb64de1bc36c ("s390/qeth: phase out OSN support") spelled out
why the OSN support in qeth is in a bad shape, and put any remaining
interested parties on notice to speak up before it gets ripped out.
It's 2021 now, so make true on that promise and remove all the
OSN-specific parts from qeth. This also means that we no longer need to
export various parts of the cmd & data path internals to the L2 driver.
David S. Miller [Tue, 20 Jul 2021 13:13:37 +0000 (06:13 -0700)]
Merge branch '40GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue
Tony Nguyen says:
====================
40GbE Intel Wired LAN Driver Updates 2021-07-19
This series contains updates to iavf and i40e drivers.
Stefan Assmann adds locking to a path that does not acquire a spinlock
where needed for i40e. He also adjusts locking of critical sections to
help avoid races and removes overriding of the adapter state during
pending reset for iavf driver.
====================
David S. Miller [Tue, 20 Jul 2021 13:11:28 +0000 (06:11 -0700)]
Merge branch 'veth-flexible-channel-numbers'
Paolo Abeni says:
====================
veth: more flexible channels number configuration
XDP setups can benefit from multiple veth RX/TX queues. Currently
veth allow setting such number only at creation time via the
'numrxqueues' and 'numtxqueues' parameters.
This series introduces support for the ethtool set_channel operation
and allows configuring the queue number via a new module parameter.
The veth default configuration is not changed.
Finally self-tests are updated to check the new features, with both
valid and invalid arguments.
This iteration is a rebase of the most recent RFC, it does not provide
a module parameter to configure the default number of queues, but I
think could be worthy
RFC v1 -> RFC v2:
- report more consistent 'combined' count
- make set_channel as resilient as possible to errors
- drop module parameter - but I would still consider it.
- more self-tests
====================
Paolo Abeni [Tue, 20 Jul 2021 08:41:51 +0000 (10:41 +0200)]
veth: create by default nr_possible_cpus queues
This allows easier XDP usage. The number of default active
queues is not changed: 1 RX and 1 TX so that this does
not introduce overhead on the datapath for queue selection.
v1 -> v2:
- drop the module parameter, force default to nr_possible_cpus - Toke
Paolo Abeni [Tue, 20 Jul 2021 08:41:50 +0000 (10:41 +0200)]
veth: implement support for set_channel ethtool op
This change implements the set_channel() ethtool operation,
preserving the current defaults values and allowing up set
the number of queues in the range set ad device creation
time.
The update operation tries hard to leave the device in a
consistent status in case of errors.
RFC v1 -> RFC v2:
- don't flip device status on set_channel()
- roll-back the changes if possible on error - Jackub
Paolo Abeni [Tue, 20 Jul 2021 08:41:48 +0000 (10:41 +0200)]
veth: always report zero combined channels
veth get_channel currently reports for channels being both RX/TX and
combined. As Jakub noted:
"""
ethtool man page is relatively clear, unfortunately the kernel code
is not and few read the man page. A channel is approximately an IRQ,
not a queue, and IRQ can't be dedicated and combined simultaneously
"""
This patch changes the information exposed by veth_get_channels,
setting max_combined to zero, being more consistent with the above
statement. The ethtool_channels is always cleared by the caller, we just
need to avoid setting the 'combined' fields.
unix sockets allows to send file descriptors via SCM_RIGHTS type messages.
Each such send call forces kernel to allocate up to 2Kb memory for
struct scm_fp_list.
It makes sense to account for them to restrict the host's memory
consumption from inside the memcg-limited container.
The size of the ip_tunnel_prl structs allocation is controllable from
user-space, thus it's better to avoid spam in dmesg if allocation failed.
Also add __GFP_ACCOUNT as this is a good candidate for per-memcg
accounting. Allocation is temporary and limited by 4GB.
memcg: enable accounting for inet_bin_bucket cache
net namespace can create up to 64K tcp and dccp ports and force kernel
to allocate up to several megabytes of memory per netns
for inet_bind_bucket objects.
It makes sense to account for them to restrict the host's memory
consumption from inside the memcg-limited container.
memcg: enable accounting for IP address and routing-related objects
An netadmin inside container can use 'ip a a' and 'ip r a'
to assign a large number of ipv4/ipv6 addresses and routing entries
and force kernel to allocate megabytes of unaccounted memory
for long-lived per-netdevice related kernel objects:
'struct in_ifaddr', 'struct inet6_ifaddr', 'struct fib6_node',
'struct rt6_info', 'struct fib_rules' and ip_fib caches.
These objects can be manually removed, though usually they lives
in memory till destroy of its net namespace.
It makes sense to account for them to restrict the host's memory
consumption from inside the memcg-limited container.
One of such objects is the 'struct fib6_node' mostly allocated in
net/ipv6/route.c::__ip6_ins_rt() inside the lock_bh()/unlock_bh() section:
In this case it is not enough to simply add SLAB_ACCOUNT to corresponding
kmem cache. The proper memory cgroup still cannot be found due to the
incorrect 'in_interrupt()' check used in memcg_kmem_bypass().
Obsoleted in_interrupt() does not describe real execution context properly.
>From include/linux/preempt.h:
The following macros are deprecated and should not be used in new code:
in_interrupt() - We're in NMI,IRQ,SoftIRQ context or have BH disabled
To verify the current execution context new macro should be used instead:
in_task() - We're in task context
memcg: enable accounting for net_device and Tx/Rx queues
Container netadmin can create a lot of fake net devices,
then create a new net namespace and repeat it again and again.
Net device can request the creation of up to 4096 tx and rx queues,
and force kernel to allocate up to several tens of megabytes memory
per net device.
It makes sense to account for them to restrict the host's memory
consumption from inside the memcg-limited container.
David S. Miller [Tue, 20 Jul 2021 12:41:29 +0000 (05:41 -0700)]
Merge branch 'bridge-vlan-multicast'
Nikolay Aleksandrov says:
====================
net: bridge: multicast: add vlan support
This patchset adds initial per-vlan multicast support, most of the code
deals with moving to multicast context pointers from bridge/port pointers.
That allows us to switch them with the per-vlan contexts when a multicast
packet is being processed and vlan multicast snooping has been enabled.
That is controlled by a global bridge option added in patch 06 which is
off by default (BR_BOOLOPT_MCAST_VLAN_SNOOPING). It is important to note
that this option can change only under RTNL and doesn't require
multicast_lock, so we need to be careful when retrieving mcast contexts
in parallel. For packet processing they are switched only once in
br_multicast_rcv() and then used until the packet has been processed.
For the most part we need these contexts only to read config values and
check if they are disabled. The global mcast state which is maintained
consists of querier and router timers, the rest are config options.
The port mcast state which is maintained consists of query timer and
link to router port list if it's ever marked as a router port. Port
multicast contexts _must_ be used only with their respective global
contexts, that is a bridge port's mcast context must be used only with
bridge's global mcast context and a vlan/port's mcast context must be
used only with that vlan's global mcast context due to the router port
lists. This way a bridge port can be marked as a router in multiple
vlans, but might not be a router in some other vlan. Also this allows us
to have per-vlan querier elections, per-vlan queries and basically the
whole multicast state becomes per-vlan when the option is enabled.
One of the hardest parts is synchronization with vlan's memory
management, that is done through a new vlan flag: BR_VLFLAG_MCAST_ENABLED
which is changed only under multicast_lock. When a vlan is being
destroyed first that flag is removed under the lock, then the multicast
context is torn down which includes waiting for any outstanding context
timers. Since all of the vlan processing depends on BR_VLFLAG_MCAST_ENABLED
it must be checked first if the contexts are vlan and the multicast_lock
has been acquired. That is done by all IGMP/MLD packet processing
functions and timers. When processing a packet we have RCU so the vlan
memory won't be freed, but if the flag is missing we must not process it.
The timers are synchronized in the same way with the addition of waiting
for them to finish in case they are running after removing the flag
under multicast_lock (i.e. they were waiting for the lock). Multicast vlan
snooping requires vlan filtering to be enabled, if it's disabled then
snooping gets automatically disabled, too. BR_VLFLAG_GLOBAL_MCAST_ENABLED
controls if a vlan has BR_VLFLAG_MCAST_ENABLED set which is used in all
vlan disabled checks. We need both flags because one is controlled by
user-space globally (BR_VLFLAG_GLOBAL_MCAST_ENABLED) and the other is
for a particular bridge/vlan or port/vlan entry (BR_VLFLAG_MCAST_ENABLED).
Since the latter is also used for synchronization between the multicast
and vlan code, and also controlled by BR_VLFLAG_GLOBAL_MCAST_ENABLED we
rely on it when checking if a vlan context is disabled. The multicast
fast-path has 3 new bit tests on the cache-hot bridge flags field, I
didn't observe any measurable difference. I haven't forced either
context options to be always disabled when the other type is enabled
because the state consists of timers which either expire (router) or
don't affect the normal operation. Some options, like the mcast querier
one, won't be allowed to change for the disabled context type, that will
come with a future patch-set which adds per-vlan querier control.
Another important addition is the global vlan options, so far we had
only per bridge/port vlan options but in order to control vlan multicast
snooping globally we need to add a new type of global vlan options.
They can be changed only on the bridge device and are dumped only when a
special flag is set in the dump request. The first global option is vlan
mcast snooping control, it controls the vlan BR_VLFLAG_GLOBAL_MCAST_ENABLED
private flag. It can be set only on master vlan entries. There will be
many more global vlan options in the future both for multicast config
and other per-vlan options (e.g. STP).
There's a lot of room for improvements, I'll do some of the initial
ones but splitting the state to different contexts opens the door
for a lot more. Also any new multicast options become vlan-supported with
very little to no effort by using the same contexts.
Short patch description:
patches 01-04: initial mcast context add, no functional changes
patch 05: adds vlan mcast init and control helpers and uses them on
vlan create/destroy
patch 06: adds a global bridge mcast vlan snooping knob (default
off)
patches 07-08: add a helper for users which must derive the contexts
based on current bridge and vlan options (e.g. timers)
patch 09: adds checks for disabled vlan contexts in packet
processing and timers
patch 10: adds support for per-vlan querier and tagged queries
patch 11: adds router port vlan id in the notifications
patches 12-14: add global vlan options support (change, dump, notify)
patch 15: adds per-vlan global mcast snooping control
Future patch-sets which build on this one (in order):
- vlan state mcast handling
- user-space mdb contexts (currently only the bridge contexts are used
there)
- all bridge multicast config options added per-vlan global and per
vlan/port
- iproute2 support for all the new uAPIs
- selftests
This set has been stress-tested (deleting/adding ports/vlans while changing
vlan mcast snooping while processing IGMP/MLD packets), and also has
passed all bridge self-tests. I'm sending this set as early as possible
since there're a few more related sets that should go in the same
release to get proper and full mcast vlan snooping support.
====================
Add a new global vlan option which controls whether multicast snooping
is enabled or disabled for a single vlan. It controls the vlan private
flag: BR_VLFLAG_GLOBAL_MCAST_ENABLED.
net: bridge: vlan: notify when global options change
Add support for global options notifications. They use only RTM_NEWVLAN
since global options can only be set and are contained in a separate
vlan global options attribute. Notifications are compressed in ranges
where possible, i.e. the sequential vlan options are equal.
net: bridge: vlan: add support for dumping global vlan options
Add a new vlan options dump flag which causes only global vlan options
to be dumped. The dumps are done only with bridge devices, ports are
ignored. They support vlan compression if the options in sequential
vlans are equal (currently always true).
We can have two types of vlan options depending on context:
- per-device vlan options (split in per-bridge and per-port)
- global vlan options
The second type wasn't supported in the bridge until now, but we need
them for per-vlan multicast support, per-vlan STP support and other
options which require global vlan context. They are contained in the global
bridge vlan context even if the vlan is not configured on the bridge device
itself. This patch adds initial netlink attributes and support for setting
these global vlan options, they can only be set (RTM_NEWVLAN) and the
operation must use the bridge device. Since there are no such options yet
it shouldn't have any functional effect.
net: bridge: multicast: add vlan querier and query support
Add basic vlan context querier support, if the contexts passed to
multicast_alloc_query are vlan then the query will be tagged. Also
handle querier start/stop of vlan contexts.
net: bridge: multicast: check if should use vlan mcast ctx
Add helpers which check if the current bridge/port multicast context
should be used (i.e. they're not disabled) and use them for Rx IGMP/MLD
processing, timers and new group addition. It is important for vlans to
disable processing of timer/packet after the multicast_lock is obtained
if the vlan context doesn't have BR_VLFLAG_MCAST_ENABLED. There are two
cases when that flag is missing:
- if the vlan is getting destroyed it will be removed and timers will
be stopped
- if the vlan mcast snooping is being disabled
net: bridge: multicast: use the port group to port context helper
We need to use the new port group to port context helper in places where
we cannot pass down the proper context (i.e. functions that can be
called by timers or outside the packet snooping paths).
net: bridge: multicast: add helper to get port mcast context from port group
Add br_multicast_pg_to_port_ctx() which returns the proper port multicast
context from either port or vlan based on bridge option and vlan flags.
As the comment inside explains the locking is a bit tricky, we rely on
the fact that BR_VLFLAG_MCAST_ENABLED requires multicast_lock to change
and we also require it to be held to call that helper. If we find the
vlan under rcu and it still has the flag then we can be sure it will be
alive until we unlock multicast_lock which should be enough.
Note that the context might change from vlan to bridge between different
calls to this helper as the mcast vlan knob requires only rtnl so it should
be used carefully and for read-only/check purposes.
Add a global knob that controls if vlan multicast snooping is enabled.
The proper contexts (vlan or bridge-wide) will be chosen based on the knob
when processing packets and changing bridge device state. Note that
vlans have their individual mcast snooping enabled by default, but this
knob is needed to turn on bridge vlan snooping. It is disabled by
default. To enable the knob vlan filtering must also be enabled, it
doesn't make sense to have vlan mcast snooping without vlan filtering
since that would lead to inconsistencies. Disabling vlan filtering will
also automatically disable vlan mcast snooping.
net: bridge: multicast: add vlan state initialization and control
Add helpers to enable/disable vlan multicast based on its flags, we need
two flags because we need to know if the vlan has multicast enabled
globally (user-controlled) and if it has it enabled on the specific device
(bridge or port). The new private vlan flags are:
- BR_VLFLAG_MCAST_ENABLED: locally enabled multicast on the device, used
when removing a vlan, toggling vlan mcast snooping and controlling
single vlan (kernel-controlled, valid under RTNL and multicast_lock)
- BR_VLFLAG_GLOBAL_MCAST_ENABLED: globally enabled multicast for the
vlan, used to control the bridge-wide vlan mcast snooping for a
single vlan (user-controlled, can be checked under any context)
Bridge vlan contexts are created with multicast snooping enabled by
default to be in line with the current bridge snooping defaults. In
order to actually activate per vlan snooping and context usage a
bridge-wide knob will be added later which will default to disabled.
If that knob is enabled then automatically all vlan snooping will be
enabled. All vlan contexts are initialized with the current bridge
multicast context defaults.
net: bridge: multicast: use multicast contexts instead of bridge or port
Pass multicast context pointers to multicast functions instead of bridge/port.
This would make it easier later to switch these contexts to their per-vlan
versions. The patch is basically search and replace, no functional changes.
net: bridge: multicast: factor out bridge multicast context
Factor out the bridge's global multicast context into a separate
structure which will later be used for per-vlan global context.
No functional changes intended.
net: bridge: multicast: factor out port multicast context
Factor out the port's multicast context into a separate structure which
will later be shared for per-port,vlan context. No functional changes
intended. We need the structure even if bridge multicast is not defined
to pass down as pointer to forwarding functions.
Stefan Assmann [Tue, 16 Mar 2021 10:01:41 +0000 (11:01 +0100)]
iavf: fix locking of critical sections
To avoid races between iavf_init_task(), iavf_reset_task(),
iavf_watchdog_task(), iavf_adminq_task() as well as the shutdown and
remove functions more locking is required.
The current protection by __IAVF_IN_CRITICAL_TASK is needed in
additional places.
- The reset task performs state transitions, therefore needs locking.
- The adminq task acts on replies from the PF in
iavf_virtchnl_completion() which may alter the states.
- The init task is not only run during probe but also if a VF gets stuck
to reinitialize it.
- The shutdown function performs a state transition.
- The remove function performs a state transition and also free's
resources.
iavf_lock_timeout() is introduced to avoid waiting infinitely
and cause a deadlock. Rather unlock and print a warning.
Stefan Assmann [Fri, 5 Mar 2021 12:38:56 +0000 (13:38 +0100)]
iavf: do not override the adapter state in the watchdog task
The iavf watchdog task overrides adapter->state to __IAVF_RESETTING
when it detects a pending reset. Then schedules iavf_reset_task() which
takes care of the reset.
The reset task is capable of handling the reset without changing
adapter->state. In fact we lose the state information when the watchdog
task prematurely changes the adapter state. This may lead to a crash if
instead of the reset task the iavf_remove() function gets called before
the reset task.
In that case (if we were in state __IAVF_RUNNING previously) the
iavf_remove() function triggers iavf_close() which fails to close the
device because of the incorrect state information.
This may result in a crash due to pending interrupts.
kernel BUG at drivers/pci/msi.c:357!
[...]
Call Trace:
[<ffffffffbddf24dd>] pci_disable_msix+0x3d/0x50
[<ffffffffc08d2a63>] iavf_reset_interrupt_capability+0x23/0x40 [iavf]
[<ffffffffc08d312a>] iavf_remove+0x10a/0x350 [iavf]
[<ffffffffbddd3359>] pci_device_remove+0x39/0xc0
[<ffffffffbdeb492f>] __device_release_driver+0x7f/0xf0
[<ffffffffbdeb49c3>] device_release_driver+0x23/0x30
[<ffffffffbddcabb4>] pci_stop_bus_device+0x84/0xa0
[<ffffffffbddcacc2>] pci_stop_and_remove_bus_device+0x12/0x20
[<ffffffffbddf361f>] pci_iov_remove_virtfn+0xaf/0x160
[<ffffffffbddf3bcc>] sriov_disable+0x3c/0xf0
[<ffffffffbddf3ca3>] pci_disable_sriov+0x23/0x30
[<ffffffffc0667365>] i40e_free_vfs+0x265/0x2d0 [i40e]
[<ffffffffc0667624>] i40e_pci_sriov_configure+0x144/0x1f0 [i40e]
[<ffffffffbddd5307>] sriov_numvfs_store+0x177/0x1d0
Code: 00 00 e8 3c 25 e3 ff 49 c7 86 88 08 00 00 00 00 00 00 5b 41 5c 41 5d 41 5e 41 5f 5d c3 48 8b 7b 28 e8 0d 44
RIP [<ffffffffbbbf1068>] free_msi_irqs+0x188/0x190
The solution is to not touch the adapter->state in iavf_watchdog_task()
and let the reset task handle the state transition.
Stefan Assmann [Thu, 4 Mar 2021 09:34:30 +0000 (10:34 +0100)]
i40e: improve locking of mac_filter_hash
i40e_config_vf_promiscuous_mode() calls
i40e_getnum_vf_vsi_vlan_filters() without acquiring the
mac_filter_hash_lock spinlock.
This is unsafe because mac_filter_hash may get altered in another thread
while i40e_getnum_vf_vsi_vlan_filters() traverses the hashes.
Simply adding the spinlock in i40e_getnum_vf_vsi_vlan_filters() is not
possible as it already gets called in i40e_get_vlan_list_sync() with the
spinlock held. Therefore adding a wrapper that acquires the spinlock and
call the correct function where appropriate.
It is one use-after-free in ip_check_mc_rcu.
In ip_mc_del_src, the ip_sf_list of pmc has been freed under pmc->lock protection.
But access to ip_sf_list in ip_check_mc_rcu is not protected by the lock.
David S. Miller [Sat, 17 Jul 2021 00:32:15 +0000 (17:32 -0700)]
Merge branch 'vmxnet3-version-6'
Ronak Doshi says:
====================
vmxnet3: upgrade to version 6
vmxnet3 emulation has recently added several new features which includes
increase in queues supported, remove power of 2 limitation on queues,
add RSS for ESP IPv6, etc. This patch series extends the vmxnet3 driver
to leverage these new features.
Compatibility is maintained using existing vmxnet3 versioning mechanism as
follows:
- new features added to vmxnet3 emulation are associated with new vmxnet3
version viz. vmxnet3 version 6.
- emulation advertises all the versions it supports to the driver.
- during initialization, vmxnet3 driver picks the highest version number
supported by both the emulation and the driver and configures emulation
to run at that version.
In particular, following changes are introduced:
Patch 1:
This patch introduces utility macros for vmxnet3 version 6 comparison
and updates Copyright information.
Patch 2:
This patch adds support to increase maximum Tx/Rx queues from 8 to 32.
Patch 3:
This patch removes the limitation of power of 2 on the queues.
Patch 4:
Uses existing get_rss_hash_opts and set_rss_hash_opts methods to add
support for ESP IPv6 RSS.
Patch 5:
This patch reports correct RSS hash type based on the type of RSS
performed.
Patch 6:
This patch updates maximum configurable mtu to 9190.
Patch 7:
With all vmxnet3 version 6 changes incorporated in the vmxnet3 driver,
with this patch, the driver can configure emulation to run at vmxnet3
version 6.
====================
With all vmxnet3 version 6 changes incorporated in the vmxnet3 driver,
the driver can configure emulation to run at vmxnet3 version 6, provided
the emulation advertises support for version 6.
Vmxnet3 version 4 added support for ESP RSS. However, only IPv4 was
supported. With vmxnet3 version 6, this patch enables RSS for ESP
IPv6 packets as well.
vmxnet3: remove power of 2 limitation on the queues
With version 6, vmxnet3 relaxes the restriction on queues to
be power of two. This is helpful in cases (Edge VM) where
vcpus are less than 8 and device requires more than 4 queues.
Currently, vmxnet3 supports maximum of 8 Tx/Rx queues. With increase
in number of vcpus on a VM, to achieve better performance and utilize
idle vcpus, we need to increase the max number of queues supported.
This patch enhances vmxnet3 to support maximum of 32 Tx/Rx queues.
Increasing the Rx queues also increases the probability of distrubuting
the traffic from different flows to different queues with RSS.
vmxnet3 is currently at version 4 and this patch initiates the
preparation to accommodate changes for upto version 6. Introduced
utility macros for vmxnet3 version 6 comparison and update Copyright
information.
Xin Long [Fri, 16 Jul 2021 21:44:07 +0000 (17:44 -0400)]
tipc: keep the skb in rcv queue until the whole data is read
Currently, when userspace reads a datagram with a buffer that is
smaller than this datagram, the data will be truncated and only
part of it can be received by users. It doesn't seem right that
users don't know the datagram size and have to use a huge buffer
to read it to avoid the truncation.
This patch to fix it by keeping the skb in rcv queue until the
whole data is read by users. Only the last msg of the datagram
will be marked with MSG_EOR, just as TCP/SCTP does.
Note that this will work as above only when MSG_EOR is set in the
flags parameter of recvmsg(), so that it won't break any old user
applications.
David S. Miller [Sat, 17 Jul 2021 00:25:13 +0000 (17:25 -0700)]
Merge branch '1GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/t
nguy/next-queue
Tony Nguyen says:
====================
1GbE Intel Wired LAN Driver Updates 2021-07-16
Vinicius Costa Gomes says:
Add support for steering traffic to specific RX queues using Flex Filters.
As the name implies, Flex Filters are more flexible than using
Layer-2, VLAN or MAC address filters, one of the reasons is that they
allow "AND" operations more easily, e.g. when the user wants to steer
some traffic based on the source MAC address and the packet ethertype.
Future work include adding support for offloading tc-u32 filters to
the hardware.
The series is divided as follows:
Patch 1/5, add the low level primitives for configuring Flex filters.
Patch 2/5 and 3/5, allow ethtool to manage Flex filters.
Patch 4/5, when specifying filters that have multiple predicates, use
Flex filters.
Patch 5/5, Adds support for exposing the i225 LEDs using the LED subsystem.
====================
Kurt Kanzenbach [Tue, 29 Jun 2021 04:43:31 +0000 (21:43 -0700)]
igc: Make flex filter more flexible
Currently flex filters are only used for filters containing user data.
However, it makes sense to utilize them also for filters having
multiple conditions, because that's not supported by the driver at the
moment. Add it.
Kurt Kanzenbach [Tue, 29 Jun 2021 04:43:29 +0000 (21:43 -0700)]
igc: Integrate flex filter into ethtool ops
Use the flex filter mechanism to extend the current ethtool filter
operations by intercoperating the user data. This allows to match
eight more bytes within a Ethernet frame in addition to macs, ether
types and vlan.
The matching pattern looks like this:
* dest_mac [6]
* src_mac [6]
* tpid [2]
* vlan tci [2]
* ether type [2]
* user data [8]
This can be used to match Profinet traffic classes by FrameID range.
Kurt Kanzenbach [Tue, 29 Jun 2021 04:43:28 +0000 (21:43 -0700)]
igc: Add possibility to add flex filter
The Intel i225 NIC has the possibility to add flex filters which can
match up to the first 128 byte of a packet. These filters are useful
for all kind of packet matching. One particular use case is Profinet,
as the different traffic classes are distinguished by the frame id
range which cannot be matched by any other means.
Remove un-used "phy-reset-delay" property which found when do dtbs_check
(set additionalProperties: false in fsl,fec.yaml).
Double check current driver and commit history, "phy-reset-delay" never comes
up, so it should be safe to remove it.
$ make ARCH=arm CROSS_COMPILE=arm-linux-gnueabihf- dtbs_check DT_SCHEMA_FILES=Documentation/devicetree/bindings/net/fsl,fec.yaml
arch/arm/boot/dts/imx7d-mba7.dt.yaml: ethernet@30be0000: 'phy-reset-delay' does not match any of the regexes: 'pinctrl-[0-9]+'
arch/arm/boot/dts/imx7d-mba7.dt.yaml: ethernet@30bf0000: 'phy-reset-delay' does not match any of the regexes: 'pinctrl-[0-9]+'
/arch/arm/boot/dts/imx7s-mba7.dt.yaml: ethernet@30be0000: 'phy-reset-delay' does not match any of the regexes: 'pinctrl-[0-9]+'
Correct node name for FEC which found when do dtbs_check.
$ make ARCH=arm CROSS_COMPILE=arm-linux-gnueabihf- dtbs_check DT_SCHEMA_FILES=Documentation/devicetree/bindings/net/fsl,fec.yaml
arch/arm/boot/dts/imx35-eukrea-mbimxsd35-baseboard.dt.yaml: fec@50038000: $nodename:0: 'fec@50038000' does not match '^ethernet(@.*)?$'
arch/arm/boot/dts/imx35-pdk.dt.yaml: fec@50038000: $nodename:0: 'fec@50038000' does not match '^ethernet(@.*)?$'
Peilin Ye [Fri, 16 Jul 2021 01:52:45 +0000 (18:52 -0700)]
netdevsim: Add multi-queue support
Currently netdevsim only supports a single queue per port, which is
insufficient for testing multi-queue TC schedulers e.g. sch_mq. Extend
the current sysfs interface so that users can create ports with multiple
queues:
As an example, echoing "2 4 8" creates 4 ports, with 8 queues per port.
Note, this is compatible with the current interface, with default number
of queues set to 1. For example, echoing "2 4" creates 4 ports with 1
queue per port; echoing "2" simply creates 1 port with 1 queue.
Mark Gray [Thu, 15 Jul 2021 12:27:54 +0000 (08:27 -0400)]
openvswitch: Introduce per-cpu upcall dispatch
The Open vSwitch kernel module uses the upcall mechanism to send
packets from kernel space to user space when it misses in the kernel
space flow table. The upcall sends packets via a Netlink socket.
Currently, a Netlink socket is created for every vport. In this way,
there is a 1:1 mapping between a vport and a Netlink socket.
When a packet is received by a vport, if it needs to be sent to
user space, it is sent via the corresponding Netlink socket.
This mechanism, with various iterations of the corresponding user
space code, has seen some limitations and issues:
* On systems with a large number of vports, there is a correspondingly
large number of Netlink sockets which can limit scaling.
(https://bugzilla.redhat.com/show_bug.cgi?id=1526306)
* Packet reordering on upcalls.
(https://bugzilla.redhat.com/show_bug.cgi?id=1844576)
* A thundering herd issue.
(https://bugzilla.redhat.com/show_bug.cgi?id=1834444)
This patch introduces an alternative, feature-negotiated, upcall
mode using a per-cpu dispatch rather than a per-vport dispatch.
In this mode, the Netlink socket to be used for the upcall is
selected based on the CPU of the thread that is executing the upcall.
In this way, it resolves the issues above as:
a) The number of Netlink sockets scales with the number of CPUs
rather than the number of vports.
b) Ordering per-flow is maintained as packets are distributed to
CPUs based on mechanisms such as RSS and flows are distributed
to a single user space thread.
c) Packets from a flow can only wake up one user space thread.
The corresponding user space code can be found at:
https://mail.openvswitch.org/pipermail/ovs-dev/2021-July/385139.html
Bill Wendling [Wed, 14 Jul 2021 09:17:46 +0000 (02:17 -0700)]
bnx2x: remove unused variable 'cur_data_offset'
Fix the clang build warning:
drivers/net/ethernet/broadcom/bnx2x/bnx2x_sriov.c:1862:13: error: variable 'cur_data_offset' set but not used [-Werror,-Wunused-but-set-variable]
dma_addr_t cur_data_offset;
Yajun Deng [Thu, 15 Jul 2021 12:12:57 +0000 (20:12 +0800)]
rtnetlink: use nlmsg_notify() in rtnetlink_send()
The netlink_{broadcast, unicast} don't deal with 'if (err > 0' statement
but nlmsg_{multicast, unicast} do. The nlmsg_notify() contains them.
so use nlmsg_notify() instead. so that the caller wouldn't deal with
'if (err > 0' statement.
Haiyue Wang [Wed, 14 Jul 2021 07:34:59 +0000 (15:34 +0800)]
gve: fix the wrong AdminQ buffer overflow check
The 'tail' pointer is also free-running count, so it needs to be masked
as 'adminq_prod_cnt' does, to become an index value of AdminQ buffer.
Fixes: 5cdad90de62c ("gve: Batch AQ commands for creating and destroying queues.") Signed-off-by: Haiyue Wang <[email protected]> Reviewed-by: Catherine Sullivan <[email protected]> Signed-off-by: David S. Miller <[email protected]>
This is the last patchset of the original large patchset. In the
previous patchset, a new BPF sockmap program BPF_SK_SKB_VERDICT
was introduced and UDP began to support it too. In this patchset,
we add BPF_SK_SKB_VERDICT support to Unix datagram socket, so that
we can finally splice Unix datagram socket and UDP socket. Please
check each patch description for more details.
and this patchset is available here:
https://github.com/congwang/linux/tree/sockmap3 Acked-by: John Fastabend <[email protected]>
---
v5: lift socket state check for dgram
remove ->unhash() case
add retries for EAGAIN in all test cases
remove an unused parameter of __unix_dgram_recvmsg()
rebase on the latest bpf-next
v4: fix af_unix disconnect case
add unix_unhash()
split out two small patches
reduce u->iolock critical section
remove an unused parameter of __unix_dgram_recvmsg()
v3: fix Kconfig dependency
make unix_read_sock() static
fix a UAF in unix_release()
add a missing header unix_bpf.c
v2: separate out from the original large patchset
rebase to the latest bpf-next
clean up unix_read_sock()
export sock_map_close()
factor out some helpers in selftests for code reuse
====================
Cong Wang [Sun, 4 Jul 2021 19:02:48 +0000 (12:02 -0700)]
af_unix: Implement unix_dgram_bpf_recvmsg()
We have to implement unix_dgram_bpf_recvmsg() to replace the
original ->recvmsg() to retrieve skmsg from ingress_msg.
AF_UNIX is again special here because the lack of
sk_prot->recvmsg(). I simply add a special case inside
unix_dgram_recvmsg() to call sk->sk_prot->recvmsg() directly.
Cong Wang [Sun, 4 Jul 2021 19:02:46 +0000 (12:02 -0700)]
af_unix: Add a dummy ->close() for sockmap
Unlike af_inet, unix_proto is very different, it does not even
have a ->close(). We have to add a dummy implementation to
satisfy sockmap. Normally it is just a nop, it is introduced only
for sockmap to replace it.
Cong Wang [Sun, 4 Jul 2021 19:02:45 +0000 (12:02 -0700)]
af_unix: Set TCP_ESTABLISHED for datagram sockets too
Currently only unix stream socket sets TCP_ESTABLISHED,
datagram socket can set this too when they connect to its
peer socket. At least __ip4_datagram_connect() does the same.
This will be used to determine whether an AF_UNIX datagram
socket can be redirected to in sockmap.
Cong Wang [Sun, 4 Jul 2021 19:02:43 +0000 (12:02 -0700)]
sock_map: Lift socket state restriction for datagram sockets
TCP and other connection oriented sockets have accept()
for each incoming connection on the server side, hence
they can just insert those fd's from accept() to sockmap,
which are of course established.
Now with datagram sockets begin to support sockmap and
redirection, the restriction is no longer applicable to
them, as they have no accept(). So we have to lift this
restriction for them. This is fine, because inside
bpf_sk_redirect_map() we still have another socket status
check, sock_map_redirect_allowed(), as a guard.
This also means they do not have to be removed from
sockmap when disconnecting.
Cong Wang [Sun, 4 Jul 2021 19:02:42 +0000 (12:02 -0700)]
sock_map: Relax config dependency to CONFIG_NET
Currently sock_map still has Kconfig dependency on CONFIG_INET,
but there is no actual functional dependency on it after we
introduce ->psock_update_sk_prot().
We have to extend it to CONFIG_NET now as we are going to
support AF_UNIX.
====================
Add bpf_get_func_ip helper that returns IP address of the
caller function for trampoline and krobe programs.
There're 2 specific implementation of the bpf_get_func_ip
helper, one for trampoline progs and one for kprobe/kretprobe
progs.
The trampoline helper call is replaced/inlined by the verifier
with simple move instruction. The kprobe/kretprobe is actual
helper call that returns prepared caller address.
Also available at:
https://git.kernel.org/pub/scm/linux/kernel/git/jolsa/perf.git
bpf/get_func_ip
v4 changes:
- dropped jit/x86 check for get_func_ip tracing check [Alexei]
- added code to bpf_get_func_ip_tracing [Alexei]
and tested that it works without inlining [Alexei]
- changed has_get_func_ip to check_get_func_ip [Andrii]
- replaced test assert loop with explicit asserts [Andrii]
- adde bpf_program__attach_kprobe_opts function
and use it for offset setup [Andrii]
- used bpf_program__set_autoload(false) for test6 [Andrii]
- added Masami's ack
v3 changes:
- resend with Masami in cc and v3 in each patch subject
v2 changes:
- use kprobe_running to get kprobe instead of cpu var [Masami]
- added support to add kprobe on function+offset
and test for that [Alan]
====================
Jiri Olsa [Wed, 14 Jul 2021 09:44:00 +0000 (11:44 +0200)]
selftests/bpf: Add test for bpf_get_func_ip in kprobe+offset probe
Adding test for bpf_get_func_ip in kprobe+ofset probe.
Because of the offset value it's arch specific, enabling
the new test only for x86_64 architecture.
Alan Maguire [Wed, 14 Jul 2021 09:43:59 +0000 (11:43 +0200)]
libbpf: Allow specification of "kprobe/function+offset"
kprobes can be placed on most instructions in a function, not
just entry, and ftrace and bpftrace support the function+offset
notification for probe placement. Adding parsing of func_name
into func+offset to bpf_program__attach_kprobe() allows the
user to specify
SEC("kprobe/bpf_fentry_test5+0x6")
...for example, and the offset can be passed to perf_event_open_probe()
to support kprobe attachment.
Jiri Olsa [Wed, 14 Jul 2021 09:43:55 +0000 (11:43 +0200)]
bpf: Add bpf_get_func_ip helper for tracing programs
Adding bpf_get_func_ip helper for BPF_PROG_TYPE_TRACING programs,
specifically for all trampoline attach types.
The trampoline's caller IP address is stored in (ctx - 8) address.
so there's no reason to actually call the helper, but rather fixup
the call instruction and return [ctx - 8] value directly.
Jiri Olsa [Wed, 14 Jul 2021 09:43:54 +0000 (11:43 +0200)]
bpf: Enable BPF_TRAMP_F_IP_ARG for trampolines with call_get_func_ip
Enabling BPF_TRAMP_F_IP_ARG for trampolines that actually need it.
The BPF_TRAMP_F_IP_ARG adds extra 3 instructions to trampoline code
and is used only by programs with bpf_get_func_ip helper, which is
added in following patch and sets call_get_func_ip bit.
This patch ensures that BPF_TRAMP_F_IP_ARG flag is used only for
trampolines that have programs with call_get_func_ip set.
Jiri Olsa [Wed, 14 Jul 2021 09:43:53 +0000 (11:43 +0200)]
bpf, x86: Store caller's ip in trampoline stack
Storing caller's ip in trampoline's stack. Trampoline programs
can reach the IP in (ctx - 8) address, so there's no change in
program's arguments interface.
The IP address is takes from [fp + 8], which is return address
from the initial 'call fentry' call to trampoline.
This IP address will be returned via bpf_get_func_ip helper
helper, which is added in following patches.
Daniel Borkmann [Thu, 15 Jul 2021 20:33:04 +0000 (22:33 +0200)]
Merge branch 'bpf-timers'
Alexei Starovoitov says:
====================
The first request to support timers in bpf was made in 2013 before sys_bpf
syscall was added. That use case was periodic sampling. It was address with
attaching bpf programs to perf_events. Then during XDP development the timers
were requested to do garbage collection and health checks. They were worked
around by implementing timers in user space and triggering progs with
BPF_PROG_RUN command. The user space timers and perf_event+bpf timers are not
armed by the bpf program. They're done asynchronously vs program execution.
The XDP program cannot send a packet and arm the timer at the same time. The
tracing prog cannot record an event and arm the timer right away. This large
class of use cases remained unaddressed. The jiffy based and hrtimer based
timers are essential part of the kernel development and with this patch set
the hrtimer based timers will be available to bpf programs.
TLDR: bpf timers is a wrapper of hrtimers with all the extra safety added
to make sure bpf progs cannot crash the kernel.
v6->v7:
- address Andrii's comments and add his Acks.
v5->v6:
- address code review feedback from Martin and add his Acks.
- add usercnt > 0 check to bpf_timer_init and remove timers_cancel_and_free
second loop in map_free callbacks.
- add cond_resched_rcu.
v4->v5:
- Martin noticed the following issues:
. prog could be reallocated bpf_patch_insn_data().
Fixed by passing 'aux' into bpf_timer_set_callback, since 'aux' is stable
during insn patching.
. Added missing rcu_read_lock.
. Removed redundant record_map.
- Discovered few bugs with stress testing:
. One cpu does htab_free_prealloced_timers->bpf_timer_cancel_and_free->hrtimer_cancel
while another is trying to do something with the timer like bpf_timer_start/set_callback.
Those ops try to acquire bpf_spin_lock that is already taken by bpf_timer_cancel_and_free,
so both cpus spin forever. The same problem existed in bpf_timer_cancel().
One bpf prog on one cpu might call bpf_timer_cancel and wait, while another cpu is in
the timer callback that tries to do bpf_timer_*() helper on the same timer.
The fix is to do drop_prog_refcnt() and unlock. And only then hrtimer_cancel.
Because of this had to add callback_fn != NULL check to bpf_timer_cb().
Also removed redundant bpf_prog_inc/put from bpf_timer_cb() and replaced
with rcu_dereference_check similar to recent rcu_read_lock-removal from drivers.
bpf_timer_cb is in softirq.
. Managed to hit refcnt==0 while doing bpf_prog_put from bpf_timer_cancel_and_free().
That exposed the issue that bpf_prog_put wasn't ready to be called from irq context.
Fixed similar to bpf_map_put which is irq ready.
- Refactored BPF_CALL_1(bpf_spin_lock) into __bpf_spin_lock_irqsave() to
make the main logic more clear, since Martin and Yonghong brought up this concern.
v3->v4:
1.
Split callback_fn from bpf_timer_start into bpf_timer_set_callback as
suggested by Martin. That makes bpf timer api match one to one to
kernel hrtimer api and provides greater flexibility.
2.
Martin also discovered the following issue with uref approach:
bpftool prog load xdp_timer.o /sys/fs/bpf/xdp_timer type xdp
bpftool net attach xdpgeneric pinned /sys/fs/bpf/xdp_timer dev lo
rm /sys/fs/bpf/xdp_timer
nc -6 ::1 8888
bpftool net detach xdpgeneric dev lo
The timer callback stays active in the kernel though the prog was detached
and map usercnt == 0.
It happened because 'bpftool prog load' pinned the prog only.
The map usercnt went to zero. Subsequent attach and runs didn't
affect map usercnt. The timer was able to start and bpf_prog_inc itself.
When the prog was detached the prog stayed active.
To address this issue added
if (!atomic64_read(&(t->map->usercnt))) return -EPERM;
to the first patch.
Which means that timers are allowed only in the maps that are held
by user space with open file descriptor or maps pinned in bpffs.
3.
Discovered that timers in inner maps were broken.
The inner map pointers are dynamic. Therefore changed bpf_timer_init()
to accept explicit map pointer supplied by the program instead
of hidden map pointer supplied by the verifier.
To make sure that pointer to a timer actually belongs to that map
added the verifier check in patch 3.
4.
Addressed Yonghong's feedback. Improved comments and added
dynamic in_nmi() check.
Added Acks.
v2->v3:
The v2 approach attempted to bump bpf_prog refcnt when bpf_timer_start is
called to make sure callback code doesn't disappear when timer is active and
drop refcnt when timer cb is done. That led to a ton of race conditions between
callback running and concurrent bpf_timer_init/start/cancel on another cpu,
and concurrent bpf_map_update/delete_elem, and map destroy.
Then v2.5 approach skipped prog refcnt altogether. Instead it remembered all
timers that bpf prog armed in a link list and canceled them when prog refcnt
went to zero. The race conditions disappeared, but timers in map-in-map could
not be supported cleanly, since timers in inner maps have inner map's life time
and don't match prog's life time.
This v3 approach makes timers to be owned by maps. It allows timers in inner
maps to be supported from the start. This apporach relies on "user refcnt"
scheme used in prog_array that stores bpf programs for bpf_tail_call. The
bpf_timer_start() increments prog refcnt, but unlike 1st approach the timer
callback does decrement the refcnt. The ops->map_release_uref is
responsible for cancelling the timers and dropping prog refcnt when user space
reference to a map is dropped. That addressed all the races and simplified
locking.
Andrii presented a use case where specifying callback_fn in bpf_timer_init()
is inconvenient vs specifying in bpf_timer_start(). The bpf_timer_init()
typically is called outside for timer callback, while bpf_timer_start() most
likely will be called from the callback.
timer_cb() { ... bpf_timer_start(timer_cb); ...} looks like recursion and as
infinite loop to the verifier. The verifier had to be made smarter to recognize
such async callbacks. Patches 7,8,9 addressed that.
Patch 1 and 2 refactoring.
Patch 3 implements bpf timer helpers and locking.
Patch 4 implements map side of bpf timer support.
Patch 5 prevent pointer mismatch in bpf_timer_init.
Patch 6 adds support for BTF in inner maps.
Patch 7 teaches check_cfg() pass to understand async callbacks.
Patch 8 teaches do_check() pass to understand async callbacks.
Patch 9 teaches check_max_stack_depth() pass to understand async callbacks.
Patches 10 and 11 are the tests.
v1->v2:
- Addressed great feedback from Andrii and Toke.
- Fixed race between parallel bpf_timer_*() ops.
- Fixed deadlock between timer callback and LRU eviction or bpf_map_delete/update.
- Disallowed mmap and global timers.
- Allow spin_lock and bpf_timer in an element.
- Fixed memory leaks due to map destruction and LRU eviction.
- A ton more tests.
====================
selftests/bpf: Add a test with bpf_timer in inner map.
Check that map-in-map supports bpf timers.
Check that indirect "recursion" of timer callbacks works:
timer_cb1() { bpf_timer_set_callback(timer_cb2); }
timer_cb2() { bpf_timer_set_callback(timer_cb1); }
Check that
bpf_map_release
htab_free_prealloced_timers
bpf_timer_cancel_and_free
hrtimer_cancel
works while timer cb is running.
"while true; do ./test_progs -t timer_mim; done"
is a great stress test. It caught missing timer cancel in htab->extra_elems.
timer_mim_reject.c is a negative test that checks
that timer<->map mismatch is prevented.
Add bpf_timer test that creates timers in preallocated and
non-preallocated hash, in array and in lru maps.
Let array timer expire once and then re-arm it for 35 seconds.
Arm lru timer into the same callback.
Then arm and re-arm hash timers 10 times each.
At the last invocation of prealloc hash timer cancel the array timer.
Force timer free via LRU eviction and direct bpf_map_delete_elem.
bpf: Teach stack depth check about async callbacks.
Teach max stack depth checking algorithm about async callbacks
that don't increase bpf program stack size.
Also add sanity check that bpf_tail_call didn't sneak into async cb.
It's impossible, since PTR_TO_CTX is not available in async cb,
hence the program cannot contain bpf_tail_call(ctx,...);
bpf: Implement verifier support for validation of async callbacks.
bpf_for_each_map_elem() and bpf_timer_set_callback() helpers are relying on
PTR_TO_FUNC infra in the verifier to validate addresses to subprograms
and pass them into the helpers as function callbacks.
In case of bpf_for_each_map_elem() the callback is invoked synchronously
and the verifier treats it as a normal subprogram call by adding another
bpf_func_state and new frame in __check_func_call().
bpf_timer_set_callback() doesn't invoke the callback directly.
The subprogram will be called asynchronously from bpf_timer_cb().
Teach the verifier to validate such async callbacks as special kind
of jump by pushing verifier state into stack and let pop_stack() process it.
Special care needs to be taken during state pruning.
The call insn doing bpf_timer_set_callback has to be a prune_point.
Otherwise short timer callbacks might not have prune points in front of
bpf_timer_set_callback() which means is_state_visited() will be called
after this call insn is processed in __check_func_call(). Which means that
another async_cb state will be pushed to be walked later and the verifier
will eventually hit BPF_COMPLEXITY_LIMIT_JMP_SEQ limit.
Since push_async_cb() looks like another push_stack() branch the
infinite loop detection will trigger false positive. To recognize
this case mark such states as in_async_callback_fn.
To distinguish infinite loop in async callback vs the same callback called
with different arguments for different map and timer add async_entry_cnt
to bpf_func_state.
In the following bpf subprogram:
static int timer_cb(void *map, void *key, void *value)
{
bpf_timer_set_callback(.., timer_cb);
}
the 'timer_cb' is a pointer to a function.
ld_imm64 insn is used to carry this pointer.
bpf_pseudo_func() returns true for such ld_imm64 insn.
Unlike bpf_for_each_map_elem() the bpf_timer_set_callback() is asynchronous.
Relax control flow check to allow such "recursion" that is seen as an infinite
loop by check_cfg(). The distinction between bpf_for_each_map_elem() the
bpf_timer_set_callback() is done in the follow up patch.
BTF is required for 'struct bpf_timer' to be recognized inside map value.
The bpf timers are supported inside inner maps.
Remember 'struct btf *' in inner_map_meta to make it available
to the verifier in the sequence:
bpf_timer_init() arguments are:
1. pointer to a timer (which is embedded in map element).
2. pointer to a map.
Make sure that pointer to a timer actually belongs to that map.
Use map_uid (which is unique id of inner map) to reject:
inner_map1 = bpf_map_lookup_elem(outer_map, key1)
inner_map2 = bpf_map_lookup_elem(outer_map, key2)
if (inner_map1 && inner_map2) {
timer = bpf_map_lookup_elem(inner_map1);
if (timer)
// mismatch would have been allowed
bpf_timer_init(timer, inner_map2);
}