Git Repo - linux.git/log

iwlwifi: api: annotate compressed BA notif array sizes

Annotate the compressed BA notification array sizes and
make both of them 0-length since the length of 1 is just
confusing - it may be different than that and the offset
to the second one needs to be calculated in the C code
anyhow.

Signed-off-by: Johannes Berg <[email protected]>
Signed-off-by: Luca Coelho <[email protected]>

iwlwifi: mvm: Support TKIP on gen2 data path

Make the adjustments for gen2 TX and RX of TKIP packets. Strip MIC on
RX. Don't add IV space and keep the MIC space zeroed on TX.

Devices that support gen2 data path support TKIP only in station mode.
In all other modes, fall back to SW encryption. Do this early in the
set_key() callback so that the key flags would not be incorrectly set.

Signed-off-by: David Spinadel <[email protected]>
Signed-off-by: Ilan Peer <[email protected]>
Signed-off-by: Luca Coelho <[email protected]>

iwlwifi: pcie: read correct prph address for newer devices

For newer devices we have higher range of periphery
addresses. Currently it is masked out, so we end up
reading another address.

Signed-off-by: Sara Sharon <[email protected]>
Signed-off-by: Luca Coelho <[email protected]>

iwlwifi: mvm: enable sending HE_AIR_SNIFFER command via debugfs

In order to receive TB (Trigger Based) PPDU in monitor mode,
the Driver must send the HE_AIR_SNIFFER_CONFIG_CMD host command.
Enable that via debugfs.

Signed-off-by: Liad Kaufman <[email protected]>
Signed-off-by: Ido Yariv <[email protected]>
Signed-off-by: Shaul Triebitz <[email protected]>
Signed-off-by: Luca Coelho <[email protected]>

iwlwifi: mvm: cleanup dead code on resume flow for non unified image.

CDB support has nothing to do with non unified image.

Signed-off-by: Haim Dreyfuss <[email protected]>
Signed-off-by: Luca Coelho <[email protected]>

iwlwifi: fix non_shared_ant for 22000 devices

The non-shared antenna was wrong for 22000 device series.
Fix it to ANT_B for correct antenna preference by coex in MVM driver.

Fixes: e34d975e40ff ("iwlwifi: Add a000 HW family support")
Signed-off-by: Erel Geron <[email protected]>
Signed-off-by: Luca Coelho <[email protected]>

iwlwifi: dbg: don't crash if the firmware crashes in the middle of a debug dump

We can dump data from the firmware either when it crashes,
or when the firmware is alive.
Not all the data is available if the firmware is running
(like the Tx / Rx FIFOs which are available only when the
firmware is halted), so we first check that the firmware
is alive to compute the required size for the dump and then
fill the buffer with the data.

When we allocate the buffer, we test the STATUS_FW_ERROR
bit to check if the firmware is alive or not. This bit
can be changed during the course of the dump since it is
modified in the interrupt handler.

We hit a case where we allocate the buffer while the
firmware is sill working, and while we start to fill the
buffer, the firmware crashes. Then we test STATUS_FW_ERROR
again and decide to fill the buffer with data like the
FIFOs even if no room was allocated for this data in the
buffer. This means that we overflow the buffer that was
allocated leading to memory corruption.

To fix this, test the STATUS_FW_ERROR bit only once and
rely on local variables to check if we should dump fifos
or other firmware components.

Fixes: 04fd2c28226f ("iwlwifi: mvm: add rxf and txf to dump data")
Signed-off-by: Emmanuel Grumbach <[email protected]>
Signed-off-by: Luca Coelho <[email protected]>

iwlwifi: remove ucode error tracepoint

Alexei's patch, assumed that all versions of "struct iwl_error_event_table"
are the same, but there are really different versions in different files.

Rather than trying to fix this, or splitting the tracepoint, or anything of
the sort, just remove it entirely - turns out that nobody really uses it.

Signed-off-by: Johannes Berg <[email protected]>
Signed-off-by: Luca Coelho <[email protected]>

iwlwifi: mvm: report RU offset is known

We already report the RU offset, so we'd better also
report that we know the value.

Fixes: e5721e3f770f ("iwlwifi: mvm: add radiotap data for HE")
Signed-off-by: Johannes Berg <[email protected]>
Signed-off-by: Luca Coelho <[email protected]>

iwlwifi: iwlmvm: fix typo when checking for TX Beamforming

Check the actual bit (mask) in Rx notification rate_n_flags.

Signed-off-by: Shaul Triebitz <[email protected]>
Signed-off-by: Luca Coelho <[email protected]>

iwlwifi: debug flow cleanup

Cleanup of the debug flow by moving several flows to separate
functions to increase readability. Three functions were created:

1. iwl_fw_get_prph_len - returns the size needed for periphery dump.
2. iwl_fw_dump_mem for - executes the memory dumping flow.
3. iwl_trans_get_fw_monitor_len - returns the size needed for monitor dump.

Signed-off-by: Shahar S Matityahu <[email protected]>
Signed-off-by: Luca Coelho <[email protected]>

iwlwifi: RX API: remove unnecessary anonymous struct

There's no value in having an anonymous struct for holding
a few fields, remove it.

Signed-off-by: Johannes Berg <[email protected]>
Signed-off-by: Luca Coelho <[email protected]>

iwlwifi: fw: stop and start debugging using host command

In new devices, access to periphery is forbidden. Send instead
host command to start and stop debugging.

Memory allocation is written in context info, but in case we
need to update it there is a dedicated command. Add definitions,
currently unused, of the new command.

Signed-off-by: Sara Sharon <[email protected]>
Signed-off-by: Luca Coelho <[email protected]>

iwlwifi: fw: add a restart FW debug function

Move the restart FW debug code to a function. This avoids code
duplication and lays the infra to support the new start and stop
host commands in some future devices.

Signed-off-by: Sara Sharon <[email protected]>
Signed-off-by: Luca Coelho <[email protected]>

iwlwifi: mvm: fix a comment about the SP length

The SP length in the ADD_STA command is an actual number of
frames, and not the SP len as it appears in the WME IE.
Fix that comment. The actual code is fine.

Signed-off-by: Emmanuel Grumbach <[email protected]>
Signed-off-by: Luca Coelho <[email protected]>

ice: fix changing of ring descriptor size (ethtool -G)

rx_mini_pending was set to an incorrect value. This was causing EINVAL to
always be returned to 'ethtool -G'. The driver does not support mini or
jumbo rings so the respective settings should be zero.

Also, change the valid range of the number of descriptors in the rings to
make the code simpler and easier for users to understand (this removes the
valid settings of 8 and 16). Add a system log message indicating when the
number is rounded-up from what the user specifies with the 'ethtool -G'
command (i.e. when it is not a multiple of 32), and update the log message
when a user-provided value is out of range to also indicate the stride.

Signed-off-by: Bruce Allan <[email protected]>
Signed-off-by: Anirudh Venkataramanan <[email protected]>
Tested-by: Andrew Bowers <[email protected]>
Signed-off-by: Jeff Kirsher <[email protected]>

ice: Update to capabilities admin queue command

This patch makes a couple of changes in the way the driver uses the
"get capabilities" command.

1. Get device capabilities in addition to function capabilities

2. Align to latest spec by using cap_count to determine size of the
buffer in case of length error.

Signed-off-by: Anirudh Venkataramanan <[email protected]>
Tested-by: Andrew Bowers <[email protected]>
Signed-off-by: Jeff Kirsher <[email protected]>

ice: Query the Tx scheduler node before adding it

Query the Tx scheduler tree node information from FW before adding it to
the driver's software database. This will keep the node information current
in driver.

Signed-off-by: Anirudh Venkataramanan <[email protected]>
Tested-by: Andrew Bowers <[email protected]>
Signed-off-by: Jeff Kirsher <[email protected]>

ice: Update comment for ice_fltr_mgmt_list_entry

Previously the comment stated that VSI lists should be used when a
second VSI becomes a subscriber to the "VLAN address". VSI lists
are always used for VLAN membership, so replace "VLAN address" with
"MAC address". Also note that VLAN(s) always use VSI list rules.

Signed-off-by: Brett Creeley <[email protected]>
Signed-off-by: Anirudh Venkataramanan <[email protected]>
Tested-by: Andrew Bowers <[email protected]>
Signed-off-by: Jeff Kirsher <[email protected]>

ice: update fw version check logic

We have MAX_FW_API_VER_BRANCH, MAX_FW_API_VER_MAJOR, and
MAX_FW_API_VER_MINOR that we use in ice_controlq.h to test when a
firmware version is newer than expected. This is currently tested by
comparing each field separately. Thus, we compare the branch field
against the MAX_FW_API_VER_BRANCH, and so forth.

This means that currently, if we suppose that the max firmware version
is defined as 0.2.1, i.e.

Then firmware 0.1.3 will fail to load. This is because the minor version
3 is greater than the max minor version 1.

This is not intuitive, because of the notion that increasing the major
firmware version to 2 should mean any firmware version with a major
version is less than 2 should be considered older than 2...

In order to allow both 0.2.1 and 0.1.3 to load, you would have to define
the "max" firmware version as 0.2.3.. It is possible that such
a firmware version doesn't even exist yet!

Fix this by replacing the current logic with an updated check that
behaves as follows:

First, we check the major version. If it is greater than the expected
version, then we prevent driver load. Additionally, a warning message is
logged to indicate to the system administrator that they need to update
their driver. This is now the only case where the driver will refuse to
load.

Second, if the major version is less than the expected version, we log
an information message indicating the NVM should be updated.

Third, if the major version is exact, we'll then check the minor
version. If the minor version is more than two versions less than
expected, we log an information message indicating the NVM should be
updated. If it is more than two versions greater than the expected
version, we log an information message that the driver should be
updated.

To support this, the ice_aq_ver_check function needs its signature
updated to pass the HW structure. Since we now pass this structure,
there is no need to pass the firmware API versions separately.

Signed-off-by: Jacob Keller <[email protected]>
Signed-off-by: Anirudh Venkataramanan <[email protected]>
Tested-by: Andrew Bowers <[email protected]>
Signed-off-by: Jeff Kirsher <[email protected]>

ice: update branding strings and supported device ids

Update branding strings and remove device ids 0x1594 and 0x1595.

Signed-off-by: Bruce Allan <[email protected]>
Signed-off-by: Anirudh Venkataramanan <[email protected]>
Tested-by: Andrew Bowers <[email protected]>
Signed-off-by: Jeff Kirsher <[email protected]>

ice: replace unnecessary memcpy with direct assignment

Direct assignment is preferred over a memcpy()

Signed-off-by: Bruce Allan <[email protected]>
Signed-off-by: Anirudh Venkataramanan <[email protected]>
Tested-by: Andrew Bowers <[email protected]>
Signed-off-by: Jeff Kirsher <[email protected]>

ice: use [sr]q.count when checking if queue is initialized

When shutting down the controlqs, we check if they are initialized
before we shut them down and destroy the lock. This is important, as it
prevents attempts to access the lock of an already shutdown queue.

Unfortunately, we checked rq.head and sq.head as the value to determine
if the queue was initialized. This doesn't work, because head is not
reset when the queue is shutdown. In some flows, the adminq will have
already been shut down prior to calling ice_shutdown_all_ctrlqs. This
can result in a crash due to attempting to access the already destroyed
mutex.

Fix this by using rq.count and sq.count instead. Indeed, ice_shutdown_sq
and ice_shutdown_rq already indicate that this is the value we should be
using to determine of the queue was initialized.

Signed-off-by: Jacob Keller <[email protected]>
Signed-off-by: Anirudh Venkataramanan <[email protected]>
Tested-by: Andrew Bowers <[email protected]>
Signed-off-by: Jeff Kirsher <[email protected]>

Bluetooth: bt3c_cs: Fix obsolete function

simple_strtol and simple_strtoul are obsolete, both place
use kstrtouint instead.

V2: fix error tmp += tn
V3: fix compile error

Signed-off-by: Ding Xiang <[email protected]>
Signed-off-by: Marcel Holtmann <[email protected]>

Bluetooth: btrsi: fix bt tx timeout issue

observed sometimes data is coming with unaligned address from kernel
BT stack. If unaligned address is passed, some data in payload is
stripped when packet is loading to firmware and this results, BT
connection timeout is happening.

sh# hciconfig hci0 up
Can't init device hci0: hci0 command 0x0c03 tx timeout

Fixed this by moving the data to aligned address.

Signed-off-by: Sanjay Kumar Konduri <[email protected]>
Signed-off-by: Siva Rebbagondla <[email protected]>
Signed-off-by: Marcel Holtmann <[email protected]>

Bluetooth: L2CAP: Detect if remote is not able to use the whole MPS

If the remote is not able to fully utilize the MPS choosen recalculate
the credits based on the actual amount it is sending that way it can
still send packets of MTU size without credits dropping to 0.

Signed-off-by: Luiz Augusto von Dentz <[email protected]>
Signed-off-by: Marcel Holtmann <[email protected]>

Bluetooth: L2CAP: Derive rx credits from MTU and MPS

Give enough rx credits for a full packet instead of using an arbitrary
number which may not be enough depending on the MTU and MPS which can
cause interruptions while waiting for more credits, also remove
debugfs entry for l2cap_le_max_credits.

With these changes the credits are restored after each SDU is received
instead of using fixed threshold, this way it is garanteed that there
will always be enough credits to send a packet without waiting more
credits to arrive.

Signed-off-by: Luiz Augusto von Dentz <[email protected]>
Signed-off-by: Marcel Holtmann <[email protected]>

Bluetooth: L2CAP: Derive MPS from connection MTU

This ensures the MPS can fit in a single HCI fragment so each
segment don't have to be reassembled at HCI level, in addition to
that also remove the debugfs entry to configure the MPS.

Signed-off-by: Luiz Augusto von Dentz <[email protected]>
Signed-off-by: Marcel Holtmann <[email protected]>

Bluetooth: btbcm: Add entry for BCM4335C0 UART bluetooth

This patch adds the device ID for the AMPAK AP6335 combo module used
in the 1st generation WeTek Hub Android/LibreELEC HTPC box. The WiFI
chip identifies itself as BCM4339, while Bluetooth identifies itself
as BCM4335 (rev C0):

```
[    4.864248] Bluetooth: hci0: BCM: chip id 86
[    4.866388] Bluetooth: hci0: BCM: features 0x2f
[    4.889317] Bluetooth: hci0: BCM4335C0
[    4.889332] Bluetooth: hci0: BCM4335C0 (003.001.009) build 0000
[    9.778383] Bluetooth: hci0: BCM4335C0 (003.001.009) build 0268
```

Output from hciconfig:

```
hci0: Type: Primary  Bus: UART
BD Address: 43:39:00:00:1F:AC  ACL MTU: 1021:8  SCO MTU: 64:1
UP RUNNING
RX bytes:7567 acl:234 sco:0 events:386 errors:0
TX bytes:53844 acl:77 sco:0 commands:304 errors:0
Features: 0xbf 0xfe 0xcf 0xfe 0xdb 0xff 0x7b 0x87
Packet type: DM1 DM3 DM5 DH1 DH3 DH5 HV1 HV2 HV3
Link policy: RSWITCH SNIFF
Link mode: SLAVE ACCEPT
Name: 'HUB'
Class: 0x0c0000
Service Classes: Rendering, Capturing
Device Class: Miscellaneous,
HCI Version: 4.0 (0x6)  Revision: 0x10c
LMP Version: 4.0 (0x6)  Subversion: 0x6109
Manufacturer: Broadcom Corporation (15)
```

Signed-off-by: Christian Hewitt <[email protected]>
Signed-off-by: Marcel Holtmann <[email protected]>

Bluetooth: btrtl: Add support for RTL8822C with USB interface

This device is included in the RTL8822CU combination wifi and BT part,
as well as the BT part of the RTL8822CE.
The necessary firmware has been submitted to the linux-firmware
project.

Signed-off-by: Alex Lu <[email protected]>
Signed-off-by: Larry Finger <[email protected]>
Signed-off-by: Marcel Holtmann <[email protected]>

Bluetooth: hci_serdev: Fixed error space required before open paranethesis

Fixed error in space required before paranthesis
in drivers/bluetooth/hci_serdev.c

Signed-off-by: Jagdish Tirumala <[email protected]>
Signed-off-by: Marcel Holtmann <[email protected]>

Bluetooth: Add definitions and track LE resolve list modification

Add the definitions for adding entries to the LE resolve list and
removing entries from the LE resolve list. When the LE resolve list
gets changed via HCI commands make sure that the internal storage of
the resolve list entries gets updated.

Signed-off-by: Ankit Navik <[email protected]>
Signed-off-by: Marcel Holtmann <[email protected]>

Bluetooth: hci_qca: Add poweroff support during hci down for wcn3990

This patch enables power off support for hci down and power on support
for hci up. As wcn3990 power sources are ignited by regulators, we will
turn off them during hci down, i.e. an complete power off of wcn3990.
So while hci up, will call vendor setup which will turn on the regulators,
requests BT chip version and download the firmware.

Signed-off-by: Balakrishna Godavarthi <[email protected]>
Signed-off-by: Marcel Holtmann <[email protected]>

Bluetooth: btusb: Add quirk for BTUSB_INTEL_NEW

Intel "new" controllers can do both LE scan and BR/EDR inquiry at once.

Signed-off-by: Justin TerAvest <[email protected]>
Signed-off-by: Marcel Holtmann <[email protected]>

Bluetooth: btrtl: Make array extension_sig static, shrinks object size

Don't populate the array extension_sig on the stack but instead make it
static. Makes the object code smaller by 75 bytes:

Before:
   text    data     bss     dec     hex filename
  14325    4920       0   19245    4b2d drivers/bluetooth/btrtl.o

After:
   text    data     bss     dec     hex filename
  14186    4984       0   19170    4ae2 drivers/bluetooth/btrtl.o

(gcc version 8.2.0 x86_64)

Signed-off-by: Colin Ian King <[email protected]>
Signed-off-by: Marcel Holtmann <[email protected]>

Bluetooth: hci_serdev: Add protocol check in hci_uart_dequeue().

This will help to check the status of protocol while dequeuing an
skb packet. In some instaces we will end up kernel crash,
where proto close is called and we trying to dequeue an packet.

[  500.142902] [<ffffff80080f9ce4>] do_raw_spin_lock+0x1c/0xe0
[  500.148643] [<ffffff80088f1c7c>] _raw_spin_lock_irqsave+0x38/0x48
[  500.154917] [<ffffff8008780ce8>] skb_dequeue+0x28/0x84
[  500.160209] [<ffffff8000ad6f48>] 0xffffff8000ad6f48
[  500.165230] [<ffffff8000ad6610>] 0xffffff8000ad6610
[  500.170257] [<ffffff80080c7ce8>] process_one_work+0x238/0x3e4
[  500.176174] [<ffffff80080c8330>] worker_thread+0x2bc/0x3d4
[  500.181821] [<ffffff80080cdabc>] kthread+0x138/0x140
[  500.186945] [<ffffff80080844e0>] ret_from_fork+0x10/0x18

Signed-off-by: Balakrishna Godavarthi <[email protected]>
Signed-off-by: Marcel Holtmann <[email protected]>

Bluetooth: hci_serdev: clear HCI_UART_PROTO_READY to avoid closing proto races

Clearing HCI_UART_PROTO_READY will avoid usage of proto function pointers
before running the proto close function pointer. There is chance of kernel
crash, due to usage of non proto close function pointers after proto close.

Signed-off-by: Balakrishna Godavarthi <[email protected]>
Signed-off-by: Marcel Holtmann <[email protected]>

Bluetooth: hci_qca: Remove hdev dereference in qca_close().

When flag KASAN is set, we are seeing an following crash while removing
hci_uart module.

[   50.589909] Unable to handle kernel paging request at virtual address 6b6b6b6b6b6b73
[   50.597902] Mem abort info:
[   50.600846]   Exception class = DABT (current EL), IL = 32 bits
[   50.606959]   SET = 0, FnV = 0
[   50.610142]   EA = 0, S1PTW = 0
[   50.613396] Data abort info:
[   50.616401]   ISV = 0, ISS = 0x00000004
[   50.620373]   CM = 0, WnR = 0
[   50.623466] [006b6b6b6b6b6b73] address between user and kernel address ranges
[   50.630818] Internal error: Oops: 96000004 [#1] PREEMPT SMP

[   50.671670] PC is at qca_power_shutdown+0x28/0x100 [hci_uart]
[   50.677593] LR is at qca_close+0x74/0xb0 [hci_uart]
[   50.775689] Process rmmod (pid: 2144, stack limit = 0xffffff801ba90000)
[   50.782493] Call trace:

[   50.872150] [<ffffff8000c3c81c>] qca_power_shutdown+0x28/0x100 [hci_uart]
[   50.879138] [<ffffff8000c3c968>] qca_close+0x74/0xb0 [hci_uart]
[   50.885238] [<ffffff8000c3a71c>] hci_uart_unregister_device+0x44/0x50 [hci_uart]
[   50.892846] [<ffffff8000c3c9f4>] qca_serdev_remove+0x50/0x5c [hci_uart]
[   50.899654] [<ffffff800844f630>] serdev_drv_remove+0x28/0x38
[   50.905489] [<ffffff800850fc44>] device_release_driver_internal+0x140/0x1e4
[   50.912653] [<ffffff800850fd94>] driver_detach+0x78/0x84
[   50.918121] [<ffffff800850edac>] bus_remove_driver+0x80/0xa8
[   50.923942] [<ffffff80085107dc>] driver_unregister+0x4c/0x58
[   50.929768] [<ffffff8000c3ca8c>] qca_deinit+0x24/0x598 [hci_uart]
[   50.936045] [<ffffff8000c3ca10>] hci_uart_exit+0x10/0x48 [hci_uart]
[   50.942495] [<ffffff8008136630>] SyS_delete_module+0x17c/0x224

This crash is due to dereference of hdev, after freeing it.

Signed-off-by: Balakrishna Godavarthi <[email protected]>
Signed-off-by: Marcel Holtmann <[email protected]>

Bluetooth: hci_qca: Remove serdev_device_open/close function calls

Removed serdev_device_open/close functions from qca_open/close as
they are called in hci_uart_register_device() and
hci_uart_unregister_device() functions.

Signed-off-by: Balakrishna Godavarthi <[email protected]>
Signed-off-by: Marcel Holtmann <[email protected]>

Bluetooth: Remove unnecessary smp_mb__{before,after}_atomic

The barriers are unneeded; wait_woken() and woken_wake_function()
already provide us with the required synchronization: remove them
and document that we're relying on the (implicit) synchronization
provided by wait_woken() and woken_wake_function().

Signed-off-by: Andrea Parri <[email protected]>
Reviewed-by: Brian Norris <[email protected]>
Signed-off-by: Marcel Holtmann <[email protected]>

net-ipv4: remove 2 always zero parameters from ipv4_redirect()

(the parameters in question are mark and flow_flags)

Reviewed-by: David Ahern <[email protected]>
Signed-off-by: Maciej Żenczykowski <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net-ipv4: remove 2 always zero parameters from ipv4_update_pmtu()

(the parameters in question are mark and flow_flags)

Reviewed-by: David Ahern <[email protected]>
Signed-off-by: Maciej Żenczykowski <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: mvneta: Add support for 2500Mbps SGMII

The mvneta controller can handle speeds up to 2500Mbps on the SGMII
interface. This relies on serdes configuration, the lane must be
configured at 3.125Gbps and we can't use in-band autoneg at that speed.

The main issue when supporting that speed on this particular controller
is that the link partner can send ethernet frames with a shortened
preamble, which if not explicitly enabled in the controller will cause
unexpected behaviours.

This was tested on Armada 385, with the comphy configuration done in
bootloader.

Signed-off-by: Maxime Chevallier <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Merge branch 'net-vhost-improve-performance-when-enable-busyloop'

Tonghao Zhang says:

====================
net: vhost: improve performance when enable busyloop

This patches improve the guest receive performance.
On the handle_tx side, we poll the sock receive queue
at the same time. handle_rx do that in the same way.

For more performance report, see patch 4
====================

Signed-off-by: David S. Miller <[email protected]>

net: vhost: add rx busy polling in tx path

This patch improves the guest receive performance.
On the handle_tx side, we poll the sock receive queue at the
same time. handle_rx do that in the same way.

We set the poll-us=100us and use the netperf to test throughput
and mean latency. When running the tests, the vhost-net kthread
of that VM, is alway 100% CPU. The commands are shown as below.

Rx performance is greatly improved by this patch. There is not
notable performance change on tx with this series though. This
patch is useful for bi-directional traffic.

netperf -H IP -t TCP_STREAM -l 20 -- -O "THROUGHPUT, THROUGHPUT_UNITS, MEAN_LATENCY"

Topology:
[Host] ->linux bridge -> tap vhost-net ->[Guest]

TCP_STREAM:
* Without the patch: 19842.95 Mbps, 6.50 us mean latency
* With the patch: 37598.20 Mbps, 3.43 us mean latency

Signed-off-by: Tonghao Zhang <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: vhost: factor out busy polling logic to vhost_net_busy_poll()

Factor out generic busy polling logic and will be
used for in tx path in the next patch. And with the patch,
qemu can set differently the busyloop_timeout for rx queue.

To avoid duplicate codes, introduce the helper functions:
* sock_has_rx_data(changed from sk_has_rx_data)
* vhost_net_busy_poll_try_queue

Signed-off-by: Tonghao Zhang <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: vhost: replace magic number of lock annotation

Use the VHOST_NET_VQ_XXX as a subclass for mutex_lock_nested.

Signed-off-by: Tonghao Zhang <[email protected]>
Acked-by: Jason Wang <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: vhost: lock the vqs one by one

This patch changes the way that lock all vqs
at the same, to lock them one by one. It will
be used for next patch to avoid the deadlock.

Signed-off-by: Tonghao Zhang <[email protected]>
Acked-by: Jason Wang <[email protected]>
Signed-off-by: Jason Wang <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

tcp: expose sk_state in tcp_retransmit_skb tracepoint

After sk_state exposed, we can get in which state this retransmission
occurs. That could give us more detail for dignostic.
For example, if this retransmission occurs in SYN_SENT state, it may
also indicates that the syn packet may be dropped on the remote peer due
to syn backlog queue full and then we could check the remote peer.

BTW,SYNACK retransmission is traced in tcp_retransmit_synack tracepoint.

Signed-off-by: Yafang Shao <[email protected]>
Signed-off-by: Eric Dumazet <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: faraday: fix return type of ndo_start_xmit function

The method ndo_start_xmit() is defined as returning an 'netdev_tx_t',
which is a typedef for an enum type, so make sure the implementation in
this driver has returns 'netdev_tx_t' value, and change the function
return type to netdev_tx_t.

Found by coccinelle.

Signed-off-by: YueHaibing <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: smsc: fix return type of ndo_start_xmit function

The method ndo_start_xmit() is defined as returning an 'netdev_tx_t',
which is a typedef for an enum type, so make sure the implementation in
this driver has returns 'netdev_tx_t' value, and change the function
return type to netdev_tx_t.

Found by coccinelle.

Signed-off-by: YueHaibing <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: liquidio: list usage cleanup

Trival cleanup, list_move_tail will implement the same function that
list_del() + list_add_tail() will do. hence just replace them.

Signed-off-by: zhong jiang <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: qed: list usage cleanup

Trival cleanup, list_move_tail will implement the same function that
list_del() + list_add_tail() will do. hence just replace them.

Signed-off-by: zhong jiang <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Merge branch 'net-bridge-convert-bool-options-to-bits'

Nikolay Aleksandrov says:

====================
net: bridge: convert bool options to bits

A lot of boolean bridge options have been added around the net_bridge
structure resulting in holes and more importantly different cache lines
that need to be fetched in the fast path. This set moves all of those
to bits in a bitfield which resides in a hot cache line thus reducing
the size of net_bridge, the number of holes and the number of cache
lines needed for the fast path.
The set is also sent in preparation for new boolean options to avoid
spreading them in the structure and making new holes.
One nice side-effect is that we avoid potential race conditions by using
the bitops since some of the options were bits being directly set in
parallel risking hard to debug issues (has_ipv6_addr).

Before:
size: 1184, holes: 8, sum holes: 30
After:
size: 1160, holes: 3, sum holes: 7

Patch 01 is a trivial style fix
Patch 02 adds the new options bitfield and converts the vlan boolean
options to bits
Patches 03-08 convert the rest of the boolean options to bits
Patch 09 re-arranges a few fields in net_bridge to further reduce size

v2: patch 09: remove the comment about offload_fwd_mark in net_bridge and
leave it where it is now, thanks to Ido for spotting it
====================

Signed-off-by: David S. Miller <[email protected]>

net: bridge: pack net_bridge better

Further reduce the size of net_bridge with 8 bytes and reduce the number of
holes in it:
Before: holes: 5, sum holes: 15
After: holes: 3, sum holes: 7

Signed-off-by: Nikolay Aleksandrov <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: bridge: convert mtu_set_by_user to a bit

Convert the last remaining bool option to a bit thus reducing the overall
net_bridge size further by 8 bytes.

Signed-off-by: Nikolay Aleksandrov <[email protected]>
Reviewed-by: Stephen Hemminger <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: bridge: convert neigh_suppress_enabled option to a bit

Convert the neigh_suppress_enabled option to a bit.

Signed-off-by: Nikolay Aleksandrov <[email protected]>
Reviewed-by: Stephen Hemminger <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: bridge: convert mcast options to bits

This patch converts the rest of the mcast options to bits. It also packs
the mcast options a little better by moving multicast_mld_version to an
existing hole, reducing the net_bridge size by 8 bytes.

Signed-off-by: Nikolay Aleksandrov <[email protected]>
Reviewed-by: Stephen Hemminger <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: bridge: convert and rename mcast disabled

Convert mcast disabled to an option bit and while doing so convert the
logic to check if multicast is enabled instead. That is make the logic
follow the option value - if it's set then mcast is enabled and vice versa.
This avoids a few confusing places where we inverted the value that's being
set to follow the mcast_disabled logic.

Signed-off-by: Nikolay Aleksandrov <[email protected]>
Reviewed-by: Stephen Hemminger <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: bridge: convert group_addr_set option to a bit

Convert group_addr_set internal bridge opt to a bit.

Signed-off-by: Nikolay Aleksandrov <[email protected]>
Reviewed-by: Stephen Hemminger <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: bridge: convert nf call options to bits

No functional change, convert of nf_call_[ip|ip6|arp]tables to bits.

Signed-off-by: Nikolay Aleksandrov <[email protected]>
Reviewed-by: Stephen Hemminger <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: bridge: add bitfield for options and convert vlan opts

Bridge options have usually been added as separate fields all over the
net_bridge struct taking up space and ending up in different cache lines.
Let's move them to a single bitfield to save up space and speedup lookups.
This patch adds a simple API for option modifying and retrieving using
bitops and converts the first user of the API - the bridge vlan options
(vlan_enabled and vlan_stats_enabled).

Signed-off-by: Nikolay Aleksandrov <[email protected]>
Reviewed-by: Stephen Hemminger <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: bridge: make struct opening bracket consistent

Currently we have a mix of opening brackets on new lines and on the same
line, let's move them all on the same line.

Signed-off-by: Nikolay Aleksandrov <[email protected]>
Reviewed-by: Stephen Hemminger <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Merge branch 's390-net-next'

Julian Wiedmann says:

====================
s390/net: updates 2018-09-26

please apply one more series of cleanups and small improvements for qeth
to net-next. Note that one patch needs to touch both af_iucv and qeth, in
order to untangle their receive paths.
====================

Signed-off-by: David S. Miller <[email protected]>

s390/qeth: remove duplicated carrier state tracking

The netdevice is always available, apply any carrier state changes to it
without caching them.
On a STARTLAN event (ie. carrier-up), defer updating the state to
qeth_core_hardsetup_card() in the subsequent recovery action.

Also remove the carrier-state checks from the xmit routines. Stopping
transmission on carrier-down is the responsibility of upper-level code
(eg see dev_direct_xmit()).

Signed-off-by: Julian Wiedmann <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

s390/qeth: clean up drop conditions for received cmds

If qeth_check_ipa_data() consumed an event, there's no point in
processing it further. So drop it early, and make the surrounding code
a tiny bit more readable.

Signed-off-by: Julian Wiedmann <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

s390/qeth: re-indent qeth_check_ipa_data()

Pull one level of checking up into qeth_send_control_data_cb(), and
clean up an else-after-return. No functional change.

Signed-off-by: Julian Wiedmann <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

s390/qeth: consume local address events

We have no code that is waiting for these events, so just drop them when
they arrive.

Signed-off-by: Julian Wiedmann <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

s390/qeth: remove various redundant code

1. tracing iob->rc makes no sense when it hasn't been modified by the
   callback,
2. the qeth_dbf_list is declared with LIST_HEAD, which also initializes
   the list,
3. the ccwgroup core only calls the thaw/restore callbacks if the gdev
   is online, so we don't have to check for it again.

Signed-off-by: Julian Wiedmann <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

s390/qeth: remove CARD_FROM_CDEV helper

The cdev-to-card translation walks through two layers of drvdata,
with no locking or refcounting (where eg. the ccwgroup core only
accesses a cdev's drvdata while holding the ccwlock).

This might be safe for now, but any careless usage of the helper has the
potential for subtle races and use-after-free's. Luckily there's only
one occurrence where we _really_ need it (in qeth_irq()), for any other
user we can just pass through an appropriate card pointer.

Signed-off-by: Julian Wiedmann <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

s390/qeth: pass card pointer in iob callback

This allows us to remove the CARD_FROM_CDEV calls in the iob callbacks.

Signed-off-by: Julian Wiedmann <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

s390/qeth: re-use qeth_notify_skbs()

When not using the CQ, this allows us avoid the second skb queue walk
in qeth_release_skbs().

Signed-off-by: Julian Wiedmann <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

s390/qeth: remove additional skb refcount

This was presumably left over from back when qeth recursed into
dev_queue_xmit().

Signed-off-by: Julian Wiedmann <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

s390/qeth: replace open-coded skb_queue_walk()

To match the use of __skb_queue_purge(), also make the skb's enqueue in
qeth_fill_buffer() lockless.

Signed-off-by: Julian Wiedmann <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net/af_iucv: locate IUCV header via skb_network_header()

This patch attempts to untangle the TX and RX code in qeth from
af_iucv's respective HiperTransport path:
On the TX side, pointing skb_network_header() at the IUCV header
means that qeth_l3_fill_af_iucv_hdr() no longer needs a magical offset
to access the header.
On the RX side, qeth pulls the (fake) L2 header off the skb like any
normal ethernet driver would. This makes working with the IUCV header
in af_iucv easier, since we no longer have to assume a fixed skb layout.

While at it, replace the open-coded length checks in af_iucv's RX path
with pskb_may_pull().

Signed-off-by: Julian Wiedmann <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

s390/qeth: on gdev release, reset drvdata

qeth_core_probe_device() sets the gdev's drvdata, but doesn't reset it
on a subsequent error. Move the (re-)setting around a bit, so that it
happens symmetrically on allocating/freeing the qeth_card struct.

This is no actual problem, as the ccwgroup core will discard the gdev
on a probe error. But from qeth's perspective the gdev is an external
resource, so it's best to manage it cleanly.

Signed-off-by: Julian Wiedmann <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

s390/qeth: fix discipline unload after setup error

Device initialization code usually first loads a subdriver
(via qeth_core_load_discipline()), and then runs its setup() callback.
If this fails, it rolls back the load via qeth_core_free_discipline().

qeth_core_free_discipline() expects the options.layer attribute to be
initialized, but on error in setup() that's currently not the case.
Resulting in misbalanced symbol_put() calls.

Fix this by setting options.layer when loading the subdriver.

Signed-off-by: Julian Wiedmann <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

s390/qeth: use DEFINE_MUTEX for qeth_mod_mutex

Consolidate declaration and initialization of a static variable.
While at it reduce its scope in qeth_core_load_discipline(), and simplify
the return logic accordingly.

Signed-off-by: Julian Wiedmann <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

s390/qeth: convert layer attribute to enum

While the raw values are fixed due to their use in a sysfs attribute,
we can still use the proper QETH_DISCIPLINE_* enum within the driver.

Also move the initialization into qeth_set_initial_options(), along with
all other user-configurable fields.

Signed-off-by: Julian Wiedmann <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: phy: marvell: Fix build.

Local variable 'autoneg' doesn't even exist:

drivers/net/phy/marvell.c: In function 'm88e1121_config_aneg':
drivers/net/phy/marvell.c:468:25: error: 'autoneg' undeclared (first use in this function); did you mean 'put_net'?
if (phydev->autoneg != autoneg || changed) {
^~~~~~~

Fixes: d6ab93364734 ("net: phy: marvell: Avoid unnecessary soft reset")
Reported-by:Vakul Garg <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

bridge: br_arp_nd_proxy: set icmp6_router if neigh has NTF_ROUTER

Fixes: ed842faeb2bd ("bridge: suppress nd pkts on BR_NEIGH_SUPPRESS ports")
Signed-off-by: Roopa Prabhu <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next

Daniel Borkmann says:

====================
pull-request: bpf-next 2018-09-25

The following pull-request contains BPF updates for your *net-next* tree.

The main changes are:

1) Allow for RX stack hardening by implementing the kernel's flow
   dissector in BPF. Idea was originally presented at netconf 2017 [0].
   Quote from merge commit:

     [...] Because of the rigorous checks of the BPF verifier, this
     provides significant security guarantees. In particular, the BPF
     flow dissector cannot get inside of an infinite loop, as with
     CVE-2013-4348, because BPF programs are guaranteed to terminate.
     It cannot read outside of packet bounds, because all memory accesses
     are checked. Also, with BPF the administrator can decide which
     protocols to support, reducing potential attack surface. Rarely
     encountered protocols can be excluded from dissection and the
     program can be updated without kernel recompile or reboot if a
     bug is discovered. [...]

   Also, a sample flow dissector has been implemented in BPF as part
   of this work, from Petar and Willem.

   [0] http://vger.kernel.org/netconf2017_files/rx_hardening_and_udp_gso.pdf

2) Add support for bpftool to list currently active attachment
   points of BPF networking programs providing a quick overview
   similar to bpftool's perf subcommand, from Yonghong.

3) Fix a verifier pruning instability bug where a union member
   from the register state was not cleared properly leading to
   branches not being pruned despite them being valid candidates,
   from Alexei.

4) Various smaller fast-path optimizations in XDP's map redirect
   code, from Jesper.

5) Enable to recognize BPF_MAP_TYPE_REUSEPORT_SOCKARRAY maps
   in bpftool, from Roman.

6) Remove a duplicate check in libbpf that probes for function
   storage, from Taeung.

7) Fix an issue in test_progs by avoid checking for errno since
   on success its value should not be checked, from Mauricio.

8) Fix unused variable warning in bpf_getsockopt() helper when
   CONFIG_INET is not configured, from Anders.

9) Fix a compilation failure in the BPF sample code's use of
   bpf_flow_keys, from Prashant.

10) Minor cleanups in BPF code, from Yue and Zhong.
====================

Signed-off-by: David S. Miller <[email protected]>

net: dsa: lantiq_gswip: Depend on HAS_IOMEM

The driver uses devm_ioremap_resource() which is only available when
CONFIG_HAS_IOMEM is set, make the driver depend on this config option.
User mode Linux does not have CONFIG_HAS_IOMEM set and the driver was
failing on this architecture.

Fixes: 14fceff4771e ("net: dsa: Add Lantiq / Intel DSA driver for vrx200")
Reported-by: kbuild test robot <[email protected]>
Signed-off-by: Hauke Mehrtens <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Merge branch 'net-phy-Eliminate-unnecessary-soft'

Florian Fainelli says:

====================
net: phy: Eliminate unnecessary soft

This patch series eliminates unnecessary software resets of the PHY.
This should hopefully not break anybody's hardware; but I would
appreciate testing to make sure this is is the case.

Sorry for this long email list, I wanted to make sure I reached out to
all people who made changes to the Marvell PHY driver.

Thank you!

Changes since RFT:

- added Tested-by tags from Wang, Dongsheng, Andrew, Chris and Clemens
====================

Signed-off-by: David S. Miller <[email protected]>

net: phy: marvell: Avoid unnecessary soft reset

The BMCR.RESET bit on the Marvell PHYs has a special meaning in that
it commits the register writes into the HW for it to latch and be
configured appropriately. Doing software resets causes link drops, and
this is unnecessary disruption if nothing changed.

Determine from marvell_set_polarity()'s return code whether the register value
was changed and if it was, propagate that to the logic that hits the software
reset bit.

This avoids doing unnecessary soft reset if the PHY is configured in
the same state it was previously, this also eliminates the need for a
m88e1111_config_aneg() function since it now is the same as
marvell_config_aneg().

Tested-by: Wang, Dongsheng <[email protected]>
Tested-by: Chris Healy <[email protected]>
Tested-by: Andrew Lunn <[email protected]>
Tested-by: Clemens Gruber <[email protected]>
Signed-off-by: Florian Fainelli <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: phy: Stop with excessive soft reset

While consolidating the PHY reset in phy_init_hw() an unconditionaly
BMCR soft-reset I became quite trigger happy with those. This was later
on deactivated for the Generic PHY driver on the premise that a prior
software entity (e.g: bootloader) might have applied workarounds in
commit 0878fff1f42c ("net: phy: Do not perform software reset for
Generic PHY").

Since we have a hook to wire-up a soft_reset callback, just use that and
get rid of the call to genphy_soft_reset() entirely. This speeds up
initialization and link establishment for most PHYs out there that do
not require a reset.

Fixes: 87aa9f9c61ad ("net: phy: consolidate PHY reset in phy_init_hw()")
Tested-by: Wang, Dongsheng <[email protected]>
Tested-by: Chris Healy <[email protected]>
Tested-by: Andrew Lunn <[email protected]>
Tested-by: Clemens Gruber <[email protected]>
Signed-off-by: Florian Fainelli <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Merge branch '40GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue

Jeff Kirsher says:

====================
40GbE Intel Wired LAN Driver Updates 2018-09-25

This series contains updates to i40e and xsk.

Mariusz fixes an issue where the VF link state was not being updated
properly when the PF is down or up.  Also cleaned up the promiscuous
configuration during a VF reset.

Patryk simplifies the code a bit to use the variables for PF and HW that
are declared, rather than using the VSI pointers.  Cleaned up the
message length parameter to several virtchnl functions, since it was not
being used (or needed).

Harshitha fixes two potential race conditions when trying to change VF
settings by creating a helper function to validate that the VF is
enabled and that the VSI is set up.

Sergey corrects a double "link down" message by putting in a check for
whether or not the link is up or going down.

Björn addresses an AF_XDP zero-copy issue that buffers passed
from userspace to the kernel was leaked when the hardware descriptor
ring was torn down.  A zero-copy capable driver picks buffers off the
fill ring and places them on the hardware receive ring to be completed at
a later point when DMA is complete. Similar on the transmit side; The
driver picks buffers off the transmit ring and places them on the
transmit hardware ring.

In the typical flow, the receive buffer will be placed onto an receive
ring (completed to the user), and the transmit buffer will be placed on
the completion ring to notify the user that the transfer is done.

However, if the driver needs to tear down the hardware rings for some
reason (interface goes down, reconfiguration and such), the userspace
buffers cannot be leaked. They have to be reused or completed back to
userspace.
====================

Signed-off-by: David S. Miller <[email protected]>

Merge branch 'Refactor-classifier-API-to-work-with-Qdisc-blocks-without-rtnl-lock'

Vlad Buslov says:

====================
Refactor classifier API to work with Qdisc/blocks without rtnl lock

Currently, all netlink protocol handlers for updating rules, actions and
qdiscs are protected with single global rtnl lock which removes any
possibility for parallelism. This patch set is a third step to remove
rtnl lock dependency from TC rules update path.

Recently, new rtnl registration flag RTNL_FLAG_DOIT_UNLOCKED was added.
Handlers registered with this flag are called without RTNL taken. End
goal is to have rule update handlers(RTM_NEWTFILTER, RTM_DELTFILTER,
etc.) to be registered with UNLOCKED flag to allow parallel execution.
However, there is no intention to completely remove or split rtnl lock
itself. This patch set addresses specific problems in implementation of
classifiers API that prevent its control path from being executed
concurrently. Additional changes are required to refactor classifiers
API and individual classifiers for parallel execution. This patch set
lays groundwork to eventually register rule update handlers as
rtnl-unlocked by modifying code in cls API that works with Qdiscs and
blocks. Following patch set does the same for chains and classifiers.

The goal of this change is to refactor tcf_block_find() and its
dependencies to allow concurrent execution:
- Extend Qdisc API with rcu to lookup and take reference to Qdisc
  without relying on rtnl lock.
- Extend tcf_block with atomic reference counting and rcu.
- Always take reference to tcf_block while working with it.
- Implement tcf_block_release() to release resources obtained by
  tcf_block_find()
- Create infrastructure to allow registering Qdiscs with class ops that
  do not require the caller to hold rtnl lock.

All three netlink rule update handlers use tcf_block_find() to lookup
Qdisc and block, and this patch set introduces additional means of
synchronization to substitute rtnl lock in cls API.

Some functions in cls and sch APIs have historic names that no longer
clearly describe their intent. In order not make this code even more
confusing when introducing their concurrency-friendly versions, rename
these functions to describe actual implementation.

Changes from V2 to V3:
- Patch 1:
  - Explicitly include refcount.h in rtnetlink.h.
- Patch 3:
  - Move rcu_head field to the end of struct Qdisc.
  - Rearrange local variable declarations in qdisc_lookup_rcu().
- Patch 5:
  - Remove tcf_qdisc_put() and inline its content to callers.

Changes from V1 to V2:
- Rebase on latest net-next.
- Patch 8 - remove.
- Patch 9 - fold into patch 11.
- Patch 11:
  - Rename tcf_block_{get|put}() to tcf_block_refcnt_{get|put}().
- Patch 13 - remove.
====================

Acked-by: Cong Wang <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: sched: use reference counting for tcf blocks on rules update

In order to remove dependency on rtnl lock on rules update path, always
take reference to block while using it on rules update path. Change
tcf_block_get() error handling to properly release block with reference
counting, instead of just destroying it, in order to accommodate potential
concurrent users.

Signed-off-by: Vlad Buslov <[email protected]>
Acked-by: Jiri Pirko <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: sched: implement tcf_block_refcnt_{get|put}()

Implement get/put function for blocks that only take/release the reference
and perform deallocation. These functions are intended to be used by
unlocked rules update path to always hold reference to block while working
with it. They use on new fine-grained locking mechanisms introduced in
previous patches in this set, instead of relying on global protection
provided by rtnl lock.

Extract code that is common with tcf_block_detach_ext() into common
function __tcf_block_put().

Extend tcf_block with rcu to allow safe deallocation when it is accessed
concurrently.

Signed-off-by: Vlad Buslov <[email protected]>
Acked-by: Jiri Pirko <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: sched: protect block idr with spinlock

Protect block idr access with spinlock, instead of relying on rtnl lock.
Take tn->idr_lock spinlock during block insertion and removal.

Signed-off-by: Vlad Buslov <[email protected]>
Acked-by: Jiri Pirko <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: sched: implement functions to put and flush all chains

Extract code that flushes and puts all chains on tcf block to two
standalone function to be shared with functions that locklessly get/put
reference to block.

Signed-off-by: Vlad Buslov <[email protected]>
Acked-by: Jiri Pirko <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: sched: change tcf block reference counter type to refcount_t

As a preparation for removing rtnl lock dependency from rules update path,
change tcf block reference counter type to refcount_t to allow modification
by concurrent users.

In block put function perform decrement and check reference counter once to
accommodate concurrent modification by unlocked users. After this change
tcf_chain_put at the end of block put function is called with
block->refcnt==0 and will deallocate block after the last chain is
released, so there is no need to manually deallocate block in this case.
However, if block reference counter reached 0 and there are no chains to
release, block must still be deallocated manually.

Signed-off-by: Vlad Buslov <[email protected]>
Acked-by: Jiri Pirko <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: sched: use Qdisc rcu API instead of relying on rtnl lock

As a preparation from removing rtnl lock dependency from rules update path,
use Qdisc rcu and reference counting capabilities instead of relying on
rtnl lock while working with Qdiscs. Create new tcf_block_release()
function, and use it to free resources taken by tcf_block_find().
Currently, this function only releases Qdisc and it is extended in next
patches in this series.

Signed-off-by: Vlad Buslov <[email protected]>
Acked-by: Jiri Pirko <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: sched: add helper function to take reference to Qdisc

Implement function to take reference to Qdisc that relies on rcu read lock
instead of rtnl mutex. Function only takes reference to Qdisc if reference
counter isn't zero. Intended to be used by unlocked cls API.

Signed-off-by: Vlad Buslov <[email protected]>
Acked-by: Jiri Pirko <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: sched: extend Qdisc with rcu

Currently, Qdisc API functions assume that users have rtnl lock taken. To
implement rtnl unlocked classifiers update interface, Qdisc API must be
extended with functions that do not require rtnl lock.

Extend Qdisc structure with rcu. Implement special version of put function
qdisc_put_unlocked() that is called without rtnl lock taken. This function
only takes rtnl lock if Qdisc reference counter reached zero and is
intended to be used as optimization.

Signed-off-by: Vlad Buslov <[email protected]>
Acked-by: Jiri Pirko <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: sched: rename qdisc_destroy() to qdisc_put()

Current implementation of qdisc_destroy() decrements Qdisc reference
counter and only actually destroy Qdisc if reference counter value reached
zero. Rename qdisc_destroy() to qdisc_put() in order for it to better
describe the way in which this function currently implemented and used.

Extract code that deallocates Qdisc into new private qdisc_destroy()
function. It is intended to be shared between regular qdisc_put() and its
unlocked version that is introduced in next patch in this series.

Signed-off-by: Vlad Buslov <[email protected]>
Acked-by: Jiri Pirko <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: core: netlink: add helper refcount dec and lock function

Rtnl lock is encapsulated in netlink and cannot be accessed by other
modules directly. This means that reference counted objects that rely on
rtnl lock cannot use it with refcounter helper function that atomically
releases decrements reference and obtains mutex.

This patch implements simple wrapper function around refcount_dec_and_lock
that obtains rtnl lock if reference counter value reached 0.

Signed-off-by: Vlad Buslov <[email protected]>
Acked-by: Jiri Pirko <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

i40e: disallow changing the number of descriptors when AF_XDP is on

When an AF_XDP UMEM is attached to any of the Rx rings, we disallow a
user to change the number of descriptors via e.g. "ethtool -G IFNAME".

Otherwise, the size of the stash/reuse queue can grow unbounded, which
would result in OOM or leaking userspace buffers.

Signed-off-by: Björn Töpel <[email protected]>
Tested-by: Andrew Bowers <[email protected]>
Signed-off-by: Jeff Kirsher <[email protected]>

i40e: clean zero-copy XDP Rx ring on shutdown/reset

Outstanding Rx descriptors are temporarily stored on a stash/reuse
queue. When/if the HW rings comes up again, entries from the stash are
used to re-populate the ring.

The latter required some restructuring of the allocation scheme for
the AF_XDP zero-copy implementation. There is now a fast, and a slow
allocation. The "fast allocation" is used from the fast-path and
obtains free buffers from the fill ring and the internal recycle
mechanism. The "slow allocation" is only used in ring setup, and
obtains buffers from the fill ring and the stash (if any).

Signed-off-by: Björn Töpel <[email protected]>
Tested-by: Andrew Bowers <[email protected]>
Signed-off-by: Jeff Kirsher <[email protected]>