Git Repo - linux.git/log

]> Git Repo - linux.git/log

Alex Vesker [Thu, 12 Jul 2018 12:13:17 +0000 (15:13 +0300)]

devlink: Add generic parameters region_snapshot

region_snapshot - When set enables capturing region snapshots

Signed-off-by: Alex Vesker <[email protected]>
Signed-off-by: Jiri Pirko <[email protected]>
Reviewed-by: Moshe Shemesh <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

commit | commitdiff | tree

Alex Vesker [Thu, 12 Jul 2018 12:13:16 +0000 (15:13 +0300)]

net/mlx4_core: Add Crdump FW snapshot support

Crdump allows the driver to create a snapshot of the FW PCI
crspace and health buffer during a critical FW issue.
In case of a FW command timeout, FW getting stuck or a non zero
value on the catastrophic buffer, a snapshot will be taken.

The snapshot is exposed using devlink, cr-space, fw-health
address regions are registered on init and snapshots are attached
once a new snapshot is collected by the driver.

Signed-off-by: Alex Vesker <[email protected]>
Signed-off-by: Tariq Toukan <[email protected]>
Signed-off-by: Jiri Pirko <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

commit | commitdiff | tree

Alex Vesker [Thu, 12 Jul 2018 12:13:15 +0000 (15:13 +0300)]

net/mlx4_core: Add health buffer address capability

Health buffer address is a 32 bit PCI address offset provided by
the FW. This offset is used for reading FW health debug data
located on the shared CR space. Cr space is accessible in both
driver and FW and allows for different queries and configurations.
Health buffer size is always 64B of readable data followed by a
lock which is used to block volatile CR space access.

Signed-off-by: Alex Vesker <[email protected]>
Signed-off-by: Tariq Toukan <[email protected]>
Signed-off-by: Jiri Pirko <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

commit | commitdiff | tree

Alex Vesker [Thu, 12 Jul 2018 12:13:14 +0000 (15:13 +0300)]

devlink: Add support for region snapshot read command

Add support for DEVLINK_CMD_REGION_READ_GET used for both reading
and dumping region data. Read allows reading from a region specific
address for given length. Dump allows reading the full region.
If only snapshot ID is provided a snapshot dump will be done.
If snapshot ID, Address and Length are provided a snapshot read
will done.

This is used for both snapshot access and will be used in the same
way to access current data on the region.

Signed-off-by: Alex Vesker <[email protected]>
Signed-off-by: Jiri Pirko <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

commit | commitdiff | tree

Alex Vesker [Thu, 12 Jul 2018 12:13:13 +0000 (15:13 +0300)]

devlink: Add support for region snapshot delete command

Add support for DEVLINK_CMD_REGION_DEL used
for deleting a snapshot from a region. The snapshot ID is required.
Also added notification support for NEW and DEL of snapshots.

Signed-off-by: Alex Vesker <[email protected]>
Signed-off-by: Jiri Pirko <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

commit | commitdiff | tree

Alex Vesker [Thu, 12 Jul 2018 12:13:12 +0000 (15:13 +0300)]

devlink: Extend the support querying for region snapshot IDs

Extend the support for DEVLINK_CMD_REGION_GET command to also
return the IDs of the snapshot currently present on the region.
Each reply will include a nested snapshots attribute that
can contain multiple snapshot attributes each with an ID.

Signed-off-by: Alex Vesker <[email protected]>
Signed-off-by: Jiri Pirko <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

commit | commitdiff | tree

Alex Vesker [Thu, 12 Jul 2018 12:13:11 +0000 (15:13 +0300)]

devlink: Add support for region get command

Add support for DEVLINK_CMD_REGION_GET command which is used for
querying for the supported DEV/REGION values of devlink devices.
The support is both for doit and dumpit.

Reply includes:
BUS_NAME, DEVICE_NAME, REGION_NAME, REGION_SIZE

Signed-off-by: Alex Vesker <[email protected]>
Signed-off-by: Jiri Pirko <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

commit | commitdiff | tree

Alex Vesker [Thu, 12 Jul 2018 12:13:10 +0000 (15:13 +0300)]

devlink: Add support for creating region snapshots

Each device address region can store multiple snapshots,
each snapshot is identified using a different numerical ID.
This ID is used when deleting a snapshot or showing an address
region specific snapshot. This patch exposes a callback to add
a new snapshot to an address region.
The snapshot will be deleted using the destructor function
when destroying a region or when a snapshot delete command
from devlink user tool.

Signed-off-by: Alex Vesker <[email protected]>
Signed-off-by: Jiri Pirko <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

commit | commitdiff | tree

Alex Vesker [Thu, 12 Jul 2018 12:13:09 +0000 (15:13 +0300)]

devlink: Add callback to query for snapshot id before snapshot create

To restrict the driver with the snapshot ID selection a new callback
is introduced for the driver to get the snapshot ID before creating
a new snapshot. This will also allow giving the same ID for multiple
snapshots taken of different regions on the same time.

Signed-off-by: Alex Vesker <[email protected]>
Signed-off-by: Jiri Pirko <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

commit | commitdiff | tree

Alex Vesker [Thu, 12 Jul 2018 12:13:08 +0000 (15:13 +0300)]

devlink: Add support for creating and destroying regions

This allows a device to register its supported address regions.
Each address region can be accessed directly for example reading
the snapshots taken of this address space.
Drivers are not limited in the name selection for different regions.
An example of a region-name can be: pci cr-space, register-space.

Signed-off-by: Alex Vesker <[email protected]>
Signed-off-by: Jiri Pirko <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

commit | commitdiff | tree

David S. Miller [Fri, 13 Jul 2018 00:30:49 +0000 (17:30 -0700)]

Merge branch 'mvpp2-add-RSS-support'

Maxime Chevallier says:

====================
net: mvpp2: add RSS support

This series adds support for RSS on PPv2. There already was some code to
handle the RSS tables, but the driver was missing all the classification
steps required to actually use these tables.

RSS is used through the classifier, using at least 2 lookups :
- One using the C2 engine, a TCAM engine that match the packet based on
   some header extracted fields, assigns the default rx queue for that
   packet and tag it for RSS
- One using the C3Hx engine, which computes the hash that's used to perform
   the lookup in the RSS table.

Since RSS spreads the load across CPUs, we need to make sure that packets
from the same flow are always assigned the same rx queue, to prevent
re-ordering.

This series therefore adds a classification step based on the Header Parser,
that separate ingress traffic into 52 flows, based on some L2, L3 and L4
parameters.

Patches 1 and 2 fix some header issues, from the driver splitting

Patches 3 to 7 make sure the correct receive queue setup is used for RSS

Patches 8 to 14 deal with the way we handle the RSS tables

Patch 15 implement basic classifier configuration, by using it to assign the
default receive queue

Patch 16 implement the ingress traffic splitting into multiple flows

Patch 17 adds RSS support, by using the needed classification steps

Patch 18 adds the required ethtool ops to configure the flow hash parameters

This was tested on MacchiatoBin, giving some nice performance improvements
using ip forwarding (going from 5Gbps to 9.6Gbps total throughput).

RSS is disabled by default.
====================

Signed-off-by: David S. Miller <[email protected]>

commit | commitdiff | tree

Maxime Chevallier [Thu, 12 Jul 2018 11:54:27 +0000 (13:54 +0200)]

net: mvpp2: allow setting RSS flow hash parameters with ethtool

This commit allows setting the RSS hash generation parameters from
ethtool. When setting parameters for a given flow type from ethtool
(e.g. tcp4), all the corresponding flows in the flow table are updated,
according to the supported hash parameters.

For example, when configuring TCP over IPv4 hash parameters to be
src/dst IP + src/dst port ("ethtool -N eth0 rx-flow-hash tcp4 sdfn"),
we only set the "src/dst port" hash parameters on the non-fragmented TCP
over IPv4 flows.

Signed-off-by: Maxime Chevallier <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

commit | commitdiff | tree

Maxime Chevallier [Thu, 12 Jul 2018 11:54:26 +0000 (13:54 +0200)]

net: mvpp2: add an RSS classification step for each flow

One of the classification action that can be performed is to compute a
hash of the packet header based on some header fields, and lookup a RSS
table based on this hash to determine the final RxQ.

This is done by adding one lookup entry per flow per port, so that we
can configure the hash generation parameters for each flow and each
port.

There are 2 possible engines that can be used for RSS hash generation :

- C3HA, that generates a hash based on up to 4 header-extracted fields
- C3HB, that does the same as c3HA, but also includes L4 info in the hash

There are a lot of fields that can be extracted from the header. For now,
we only use the ones that we can configure using ethtool :
- DST MAC address
- L3 info
- Source IP
- Destination IP
- Source port
- Destination port

The C3HB engine is selected when we use L4 fields (src/dst port).

               Header parser          Dec table
Ingress pkt  +-------------+ flow id +----------------------------+
------------->| TCAM + SRAM |-------->|TCP IPv4 w/ VLAN, not frag  |
              +-------------+         |TCP IPv4 w/o VLAN, not frag |
                                      |TCP IPv4 w/ VLAN, frag      |--+
                                      |etc.                        |  |
                                      +----------------------------+  |
                                                                      |
                                            Flow table                |
  +---------+   +------------+         +--------------------------+   |
  | RSS tbl |<--| Classifier |<--------| flow 0: C2 lookup        |   |
  +---------+   +------------+         |         C3 lookup port 0 |   |
                 |         |           |         C3 lookup port 1 |   |
         +-----------+ +-------------+ |         ...              |   |
         | C2 engine | | C3H engines | | flow 1: C2 lookup        |<--+
         +-----------+ +-------------+ |         C3 lookup port 0 |
                                       |         ...              |
                                       | ...                      |
                                       | flow 51 : C2 lookup      |
                                       |           ...            |
                                       +--------------------------+

The C2 engine also gains the role of enabling and disabling the RSS
table lookup for this packet.

Signed-off-by: Maxime Chevallier <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

commit | commitdiff | tree

Maxime Chevallier [Thu, 12 Jul 2018 11:54:25 +0000 (13:54 +0200)]

net: mvpp2: split ingress traffic into multiple flows

The PPv2 classifier allows to perform classification operations on each
ingress packet, based on the flow the packet is assigned to.

The current code uses only 1 flow per port, and the only classification
action consists of assigning the rx queue to the packet, depending on the
port.

In preparation for adding RSS support, we have to split all incoming
traffic into different flows. Since RSS assigns a rx queue depending on
the hash of some header fields, we have to make sure that the hash is
generated in a consistent way for all packets in the same flow.

What we call a "flow" is actually a set of attributes attached to a
packet that depends on various L2/L3/L4 info.

This patch introduces 52 flows, wich are a combination of various L2, L3
and L4 attributes :
- Whether or not the packet has a VLAN tag
- Whether the packet is IPv4, IPv6 or something else
- Whether the packet is TCP, UDP or something else
- Whether or not the packet is fragmented at L3 level.

The flow is associated to a packet by the Header Parser. Each flow
corresponds to an entry in the decoding table. This entry then points to
the sequence of classification lookups to be performed by the
classifier, represented in the flow table.

For now, the only lookup we perform is a C2 lookup to set the default
rx queue.

               Header parser          Dec table
Ingress pkt  +-------------+ flow id +----------------------------+
------------->| TCAM + SRAM |-------->|TCP IPv4 w/ VLAN, not frag  |
              +-------------+         |TCP IPv4 w/o VLAN, not frag |
                                      |TCP IPv4 w/ VLAN, frag      |--+
                                      |etc.                        |  |
                                      +----------------------------+  |
                                                                      |
                                           Flow table                 |
                +------------+        +---------------------+         |
     To RxQ <---| Classifier |<-------| flow 0: C2 lookup   |<--------+
                +------------+        | flow 1: C2 lookup   |
                       |              | ...                 |
                +------------+        | flow 51 : C2 lookup |
| C2 engine  |        +---------------------+
                +------------+

Signed-off-by: Maxime Chevallier <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

commit | commitdiff | tree

Maxime Chevallier [Thu, 12 Jul 2018 11:54:24 +0000 (13:54 +0200)]

net: mvpp2: use classifier to assign default rx queue

The PPv2 Controller has a classifier, that can perform multiple lookup
operations for each packet, using different engines.

One of these engines is the C2 engine, which performs TCAM based lookups
on data extracted from the packet header. When a packet matches an
entry, the engine sets various attributes, used to perform
classification operations.

One of these attributes is the rx queue in which the packet should be sent.
The current code uses the lookup_id table (also called decoding table)
to assign the rx queue. However, this only works if we use one entry per
port in the decoding table, which won't be the case once we add RSS
lookups.

This patch uses the C2 engine to assign the rx queue to each packet.

The C2 engine is used through the flow table, which dictates what
classification operations are done for a given flow.

Right now, we have one flow per port, which contains every ingress
packet for this port.

Signed-off-by: Maxime Chevallier <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

commit | commitdiff | tree

Maxime Chevallier [Thu, 12 Jul 2018 11:54:23 +0000 (13:54 +0200)]

net: mvpp2: rename per-port RSS init function

mvpp22_init_rss function configures the RSS parameters for each port, so
rename it accordingly. Since this function relies on classifier
configuration, move its call right after the classifier config.

Signed-off-by: Maxime Chevallier <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

commit | commitdiff | tree

Maxime Chevallier [Thu, 12 Jul 2018 11:54:22 +0000 (13:54 +0200)]

net: mvpp2: make sure we don't spread load on disabled CPUs

When filling the RSS table, we have to make sure that the rx queue is
attached to an online CPU.

This patch is not a full support for cpu_hotplug, but rather a way to
make sure that we don't break network on system booted with the maxcpus
parameter.

Signed-off-by: Maxime Chevallier <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

commit | commitdiff | tree

Antoine Tenart [Thu, 12 Jul 2018 11:54:21 +0000 (13:54 +0200)]

net: mvpp2: improve the distribution of packets on CPUs when using RSS

This patch adds an extra indirection when setting the indirection table
into the RSS hardware table to improve the packets distribution across
CPUs. For example, if 2 queues are used on a multi-core system this new
indirection will choose two queues on two different CPUs instead of the
two first queues which are on the same first CPU.

Signed-off-by: Antoine Tenart <[email protected]>
Signed-off-by: Maxime Chevallier <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

commit | commitdiff | tree

Antoine Tenart [Thu, 12 Jul 2018 11:54:20 +0000 (13:54 +0200)]

net: mvpp2: RSS indirection table support

This patch adds the RSS indirection table support, allowing to use the
ethtool -x and -X options to dump and set this table.

Signed-off-by: Antoine Tenart <[email protected]>
[Maxime: Small warning fixes, use one table per port]
Signed-off-by: Maxime Chevallier <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

commit | commitdiff | tree

Maxime Chevallier [Thu, 12 Jul 2018 11:54:19 +0000 (13:54 +0200)]

net: mvpp2: use one RSS table per port

PPv2 Controller has 8 RSS Tables, of 32 entries each. A lookup in the
RXQ2RSS_TABLE is performed for each incoming packet, and the RSS Table
to be used is chosen according to the default rx queue that would be
used for the packet.

This default rx queue is set in the Lookup_id Table (also called
Decoding Table), and is equal to the port->first_rxq.

Since the Classifier itself isn't active at any time for the moment,
this doesn't have a direct effect, the default rx queue at the moment is
the one where all packets end-up into.

Signed-off-by: Maxime Chevallier <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

commit | commitdiff | tree

Maxime Chevallier [Thu, 12 Jul 2018 11:54:18 +0000 (13:54 +0200)]

net: mvpp2: fix RSS register definitions

There is no RSS_TABLE register in PPv2 Controller. The register 0x1510
which was specified is actually named "RSS_HASH_SEL", but isn't used by
this driver at all.

Based on how this register was used, it should have been the
RXQ2RSS_TABLE register, which allows to select the RSS table that will
be used for the incoming packet.

The RSS_TABLE_POINTER is actually a field of this RXQ2RSS_TABLE
register.

Since RSS tables are actually not used by the driver for now, this
commit does not fix a runtime bug.

Signed-off-by: Maxime Chevallier <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

commit | commitdiff | tree

Antoine Tenart [Thu, 12 Jul 2018 11:54:17 +0000 (13:54 +0200)]

net: mvpp2: fix a typo in the RSS code

Cosmetic patch fixing a typo in one of the RSS comments.

Signed-off-by: Antoine Tenart <[email protected]>
Signed-off-by: Maxime Chevallier <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

commit | commitdiff | tree

Maxime Chevallier [Thu, 12 Jul 2018 11:54:16 +0000 (13:54 +0200)]

net: mvpp2: use only one rx queue per port per CPU

The number of receive queue per port is :
- MVPP2_DEFAULT_RXQ if in single queue mode
- MVPP2_DEFAULT_RXQ * num_possible_cpus if in multi queue mode

with MVPP2_DEFAULT_RXQ = 4.

However, we don't use the extra rx queues at the moment, we really only
need one per port per CPU, until some more advanced classification rules
are implemented.

Suggested-by: Stefan Chulski <[email protected]>
Signed-off-by: Maxime Chevallier <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

commit | commitdiff | tree

Maxime Chevallier [Thu, 12 Jul 2018 11:54:15 +0000 (13:54 +0200)]

net: mvpp2: fix hardcoded number of rx queues

There's a dedicated #define that indicates the number of rx queues per
port per cpu, this commit removes a harcoded use of that value

This doesn't fix any runtime bugs since the harcoded value matches the
expected value.

Signed-off-by: Maxime Chevallier <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

commit | commitdiff | tree

Yan Markman [Thu, 12 Jul 2018 11:54:14 +0000 (13:54 +0200)]

net: mvpp2: use RSS only when using multi-queue mode

Since RSS only applies when we have per-cpu rx queues, it should only
be enabled when the driver is configured to make use of multi-queue
mode.

Signed-off-by: Yan Markman <[email protected]>
[Maxime: Commit message]
Signed-off-by: Maxime Chevallier <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

commit | commitdiff | tree

Maxime Chevallier [Thu, 12 Jul 2018 11:54:13 +0000 (13:54 +0200)]

net: mvpp2: make multi queue mode the default mode

The multi queue mode is needed to have RSS available, and offers some
nice advantages, being able to have one rx queue vector per CPU.

This mode has been usable through the use of a module parameter, this
commit makes it the default value.

Signed-off-by: Maxime Chevallier <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

commit | commitdiff | tree

Maxime Chevallier [Thu, 12 Jul 2018 11:54:12 +0000 (13:54 +0200)]

net: mvpp2: make sure we use single queue mode on PPv2.1

The PPv2 driver defines 2 "queue_modes" :
- QDIST_SINGLE_MODE, where each port share one rx queue vector
between all CPUs
- QDIST_MULTI_MODE, where each port has one rx queue vector per CPU.

Multi queue mode isn't available on PPv2.1, make sure we fallback to
single mode when running on this revision.

Signed-off-by: Maxime Chevallier <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

commit | commitdiff | tree

Maxime Chevallier [Thu, 12 Jul 2018 11:54:11 +0000 (13:54 +0200)]

net: mvpp2: define the number of RSS entries per table in mvpp2.h

The size of the the RSS indirection tables should be defined in mvpp2.h,
so that we can use it in all files of the PPv2 driver.

This commit moves the define in mvpp2.h, and adds the missing #include
in mvpp2_cls.h.

Signed-off-by: Maxime Chevallier <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

commit | commitdiff | tree

Maxime Chevallier [Thu, 12 Jul 2018 11:54:10 +0000 (13:54 +0200)]

net: mvpp2: fix include guards in mvpp2_prs.h

Include guards should be put before #includes. This doesn't fix any bug,
but prevent future compilation issues when adding new files in the mvpp2
driver

The Header Parser init function needs the platform_device definition,
and with the fixed include guards we need to add the missing include.

Signed-off-by: Maxime Chevallier <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

commit | commitdiff | tree

Prashant Bhole [Thu, 12 Jul 2018 07:24:59 +0000 (16:24 +0900)]

net: gro: properly remove skb from list

Following crash occurs in validate_xmit_skb_list() when same skb is
iterated multiple times in the loop and consume_skb() is called.

The root cause is calling list_del_init(&skb->list) and not clearing
skb->next in d4546c2509b1. list_del_init(&skb->list) sets skb->next
to point to skb itself. skb->next needs to be cleared because other
parts of network stack uses another kind of SKB lists.
validate_xmit_skb_list() uses such list.

A similar type of bugfix was reported by Jesper Dangaard Brouer.
https://patchwork.ozlabs.org/patch/942541/

This patch clears skb->next and changes list_del_init() to list_del()
so that list->prev will maintain the list poison.

[  148.185511] ==================================================================
[  148.187865] BUG: KASAN: use-after-free in validate_xmit_skb_list+0x4b/0xa0
[  148.190158] Read of size 8 at addr ffff8801e52eefc0 by task swapper/1/0
[  148.192940]
[  148.193642] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.18.0-rc3+ #25
[  148.195423] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20180531_142017-buildhw-08.phx2.fedoraproject.org-1.fc28 04/01/2014
[  148.199129] Call Trace:
[  148.200565]  <IRQ>
[  148.201911]  dump_stack+0xc6/0x14c
[  148.203572]  ? dump_stack_print_info.cold.1+0x2f/0x2f
[  148.205083]  ? kmsg_dump_rewind_nolock+0x59/0x59
[  148.206307]  ? validate_xmit_skb+0x2c6/0x560
[  148.207432]  ? debug_show_held_locks+0x30/0x30
[  148.208571]  ? validate_xmit_skb_list+0x4b/0xa0
[  148.211144]  print_address_description+0x6c/0x23c
[  148.212601]  ? validate_xmit_skb_list+0x4b/0xa0
[  148.213782]  kasan_report.cold.6+0x241/0x2fd
[  148.214958]  validate_xmit_skb_list+0x4b/0xa0
[  148.216494]  sch_direct_xmit+0x1b0/0x680
[  148.217601]  ? dev_watchdog+0x4e0/0x4e0
[  148.218675]  ? do_raw_spin_trylock+0x10/0x120
[  148.219818]  ? do_raw_spin_lock+0xe0/0xe0
[  148.221032]  __dev_queue_xmit+0x1167/0x1810
[  148.222155]  ? sched_clock+0x5/0x10
[...]

[  148.474257] Allocated by task 0:
[  148.475363]  kasan_kmalloc+0xbf/0xe0
[  148.476503]  kmem_cache_alloc+0xb4/0x1b0
[  148.477654]  __build_skb+0x91/0x250
[  148.478677]  build_skb+0x67/0x180
[  148.479657]  e1000_clean_rx_irq+0x542/0x8a0
[  148.480757]  e1000_clean+0x652/0xd10
[  148.481772]  net_rx_action+0x4ea/0xc20
[  148.482808]  __do_softirq+0x1f9/0x574
[  148.483831]
[  148.484575] Freed by task 0:
[  148.485504]  __kasan_slab_free+0x12e/0x180
[  148.486589]  kmem_cache_free+0xb4/0x240
[  148.487634]  kfree_skbmem+0xed/0x150
[  148.488648]  consume_skb+0x146/0x250
[  148.489665]  validate_xmit_skb+0x2b7/0x560
[  148.490754]  validate_xmit_skb_list+0x70/0xa0
[  148.491897]  sch_direct_xmit+0x1b0/0x680
[  148.493949]  __dev_queue_xmit+0x1167/0x1810
[  148.495103]  br_dev_queue_push_xmit+0xce/0x250
[  148.496196]  br_forward_finish+0x276/0x280
[  148.497234]  __br_forward+0x44f/0x520
[  148.498260]  br_forward+0x19f/0x1b0
[  148.499264]  br_handle_frame_finish+0x65e/0x980
[  148.500398]  NF_HOOK.constprop.10+0x290/0x2a0
[  148.501522]  br_handle_frame+0x417/0x640
[  148.502582]  __netif_receive_skb_core+0xaac/0x18f0
[  148.503753]  __netif_receive_skb_one_core+0x98/0x120
[  148.504958]  netif_receive_skb_internal+0xe3/0x330
[  148.506154]  napi_gro_complete+0x190/0x2a0
[  148.507243]  dev_gro_receive+0x9f7/0x1100
[  148.508316]  napi_gro_receive+0xcb/0x260
[  148.509387]  e1000_clean_rx_irq+0x2fc/0x8a0
[  148.510501]  e1000_clean+0x652/0xd10
[  148.511523]  net_rx_action+0x4ea/0xc20
[  148.512566]  __do_softirq+0x1f9/0x574
[  148.513598]
[  148.514346] The buggy address belongs to the object at ffff8801e52eefc0
[  148.514346]  which belongs to the cache skbuff_head_cache of size 232
[  148.517047] The buggy address is located 0 bytes inside of
[  148.517047]  232-byte region [ffff8801e52eefc0, ffff8801e52ef0a8)
[  148.519549] The buggy address belongs to the page:
[  148.520726] page:ffffea000794bb00 count:1 mapcount:0 mapping:ffff880106f4dfc0 index:0xffff8801e52ee840 compound_mapcount: 0
[  148.524325] flags: 0x17ffffc0008100(slab|head)
[  148.525481] raw: 0017ffffc0008100 ffff880106b938d0 ffff880106b938d0 ffff880106f4dfc0
[  148.527503] raw: ffff8801e52ee840 0000000000190011 00000001ffffffff 0000000000000000
[  148.529547] page dumped because: kasan: bad access detected

Fixes: d4546c2509b1 ("net: Convert GRO SKB handling to list_head.")
Signed-off-by: Prashant Bhole <[email protected]>
Reported-by: Tyler Hicks <[email protected]>
Tested-by: Tyler Hicks <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

commit | commitdiff | tree

David S. Miller [Thu, 12 Jul 2018 23:42:40 +0000 (16:42 -0700)]

Merge branch 's390-qeth-updates'

Julian Wiedmann says:

====================
s390/qeth: updates 2018-07-11

please apply this first batch of qeth patches for net-next. It brings the
usual cleanups, and some performance improvements to the transmit paths.
====================

Signed-off-by: David S. Miller <[email protected]>

commit | commitdiff | tree

Julian Wiedmann [Wed, 11 Jul 2018 15:42:47 +0000 (17:42 +0200)]

s390/qeth: speed-up IPv4 OSA xmit

Move the xmit of offload-eligible (ie IPv4) traffic on OSA over to the
new, copy-free path.
As with L2, we'll need to preserve the skb_orphan() behaviour of the
old code path until TX completion is sufficiently fast.

Signed-off-by: Julian Wiedmann <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

commit | commitdiff | tree

Julian Wiedmann [Wed, 11 Jul 2018 15:42:46 +0000 (17:42 +0200)]

s390/qeth: speed-up L3 IQD xmit

This implements a new xmit path for L3 HiperSockets, which carves the
HW header from skb headroom instead of allocating it from the hdr cache.
It also adds NETIF_F_SG support.

The delta in qeth_l3_xmit() is all just removal of IQD-specific code and
some minor consolidation.

Signed-off-by: Julian Wiedmann <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

commit | commitdiff | tree

Julian Wiedmann [Wed, 11 Jul 2018 15:42:45 +0000 (17:42 +0200)]

s390/qeth: add a L3 xmit wrapper

In preparation for future work, move the high-level xmit work into a
separate wrapper. This matches the L2 xmit code.

Signed-off-by: Julian Wiedmann <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

commit | commitdiff | tree

Julian Wiedmann [Wed, 11 Jul 2018 15:42:44 +0000 (17:42 +0200)]

s390/qeth: increase GSO max size for eligible L3 devices

When a L3 device doesn't offer TSO, allow the stack to build full-size
GSO skbs.

Signed-off-by: Julian Wiedmann <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

commit | commitdiff | tree

Julian Wiedmann [Wed, 11 Jul 2018 15:42:43 +0000 (17:42 +0200)]

s390/qeth: clean up exported symbols

Remove some redundant EXPORTs. While at it, also move some L2-only
prototypes into the proper header file.

Signed-off-by: Julian Wiedmann <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

commit | commitdiff | tree

Julian Wiedmann [Wed, 11 Jul 2018 15:42:42 +0000 (17:42 +0200)]

s390/qeth: consolidate ccwgroup driver definition

Reshuffle the code a bit so that everything is in one place.

Signed-off-by: Julian Wiedmann <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

commit | commitdiff | tree

Julian Wiedmann [Wed, 11 Jul 2018 15:42:41 +0000 (17:42 +0200)]

s390/qeth: clean up Output Queue selection

Consolidate duplicated code, fix the misuse of RTN_UNSPEC and simplify
the handling of non-unicast traffic on IQD devices.

Signed-off-by: Julian Wiedmann <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

commit | commitdiff | tree

Julian Wiedmann [Wed, 11 Jul 2018 15:42:40 +0000 (17:42 +0200)]

s390/qeth: fine-tune RX modesetting

Changing a device's address lists (or its promisc mode) already triggers
an RX modeset, there's no need to do it manually from the L2 driver's
ndo_vlan_rx_kill_vid() hook.

Also when setting a device online, dev_open() already calls
dev_set_rx_mode(). So a manual modeset is only necessary from the
recovery path.

Signed-off-by: Julian Wiedmann <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

commit | commitdiff | tree

Julian Wiedmann [Wed, 11 Jul 2018 15:42:39 +0000 (17:42 +0200)]

s390/qeth: remove unused buffer->aob pointer

Except for tracing, the pointer is not used.

At the same time, accessing it from qeth_qdio_output_handler() is racy:
whenever qeth_qdio_cq_handler() gets control, its call to
qeth_qdio_handle_aob() frees the AOB.

So the AOB pointer that qeth_qdio_output_handler() stores into 'buffer'
can go stale at any time, and trigger a use-after-free.

Signed-off-by: Julian Wiedmann <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

commit | commitdiff | tree

Julian Wiedmann [Wed, 11 Jul 2018 15:42:38 +0000 (17:42 +0200)]

s390/qeth: various buffer management cleanups

Use the new qeth_scrub_qdio_buffer() helper, remove an extra parameter
from qeth_clear_output_buffer(), init the bufstates.user field just once
(in qeth_flush_buffers()) and remove some noisy trace messages.

Signed-off-by: Julian Wiedmann <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

commit | commitdiff | tree

Jesper Dangaard Brouer [Wed, 11 Jul 2018 15:01:20 +0000 (17:01 +0200)]

net: ipv4: fix listify ip_rcv_finish in case of forwarding

In commit 5fa12739a53d ("net: ipv4: listify ip_rcv_finish") calling
dst_input(skb) was split-out. The ip_sublist_rcv_finish() just calls
dst_input(skb) in a loop.

The problem is that ip_sublist_rcv_finish() forgot to remove the SKB
from the list before invoking dst_input(). Further more we need to
clear skb->next as other parts of the network stack use another kind
of SKB lists for xmit_more (see dev_hard_start_xmit).

A crash occurs if e.g. dst_input() invoke ip_forward(), which calls
dst_output()/ip_output() that eventually calls __dev_queue_xmit() +
sch_direct_xmit(), and a crash occurs in validate_xmit_skb_list().

This patch only fixes the crash, but there is a huge potential for
a performance boost if we can pass an SKB-list through to ip_forward.

Fixes: 5fa12739a53d ("net: ipv4: listify ip_rcv_finish")
Signed-off-by: Jesper Dangaard Brouer <[email protected]>
Acked-by: Edward Cree <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

commit | commitdiff | tree

Arnd Bergmann [Wed, 11 Jul 2018 12:29:53 +0000 (14:29 +0200)]

nfp: avoid using getnstimeofday64()

getnstimeofday64 is deprecated in favor of the ktime_get() family of
functions. The direct replacement would be ktime_get_real_ts64(),
but I'm picking the basic ktime_get() instead:

- using a ktime_t simplifies the code compared to timespec64
- using monotonic time instead of real time avoids issues caused
by a concurrent settimeofday() or during a leap second adjustment.

Acked-by: Jakub Kicinski <[email protected]>
Signed-off-by: Arnd Bergmann <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

commit | commitdiff | tree

Arnd Bergmann [Wed, 11 Jul 2018 12:29:52 +0000 (14:29 +0200)]

liquidio: use ktime_get_real_ts64() instead of getnstimeofday64()

The two do the same thing, but we want to have a consistent
naming in the kernel.

Signed-off-by: Arnd Bergmann <[email protected]>
Acked-by: Felix Manlunas <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

commit | commitdiff | tree

David S. Miller [Thu, 12 Jul 2018 21:54:12 +0000 (14:54 -0700)]

Merge branch 'net-sched-act_skbedit-lockless-data-path'

Davide Caratti says:

====================
net/sched: act_skbedit: lockless data path

the data path of act_skbedit can be faster if we avoid using spinlocks:
- patch 1 converts act_skbedit statistics to use per-cpu counters
- patch 2 lets act_skbedit use RCU to read/update its configuration

test procedure (using pktgen from https://github.com/netoptimizer):

# ip link add name eth1 type dummy
# ip link set dev eth1 up
# tc qdisc add dev eth1 clsact
# tc filter add dev eth1 egress matchall action skbedit priority c1a0:c1a0
# for c in 1 2 4 ; do
> ./pktgen_bench_xmit_mode_queue_xmit.sh -v -s 64 -t $c -n 5000000 -i eth1
> done

test results (avg. pps/thread)

  $c | before patch |  after patch | improvement
----+--------------+--------------+------------
   1 | 3917464 ± 3% | 4000458 ± 3% |  irrelevant
   2 | 3455367 ± 4% | 3953076 ± 1% |        +14%
   4 | 2496594 ± 2% | 3801123 ± 3% |        +52%

v2: rebased on latest net-next
====================

Signed-off-by: David S. Miller <[email protected]>

commit | commitdiff | tree

Davide Caratti [Wed, 11 Jul 2018 14:04:50 +0000 (16:04 +0200)]

net/sched: act_skbedit: don't use spinlock in the data path

use RCU instead of spin_{,un}lock_bh, to protect concurrent read/write on
act_skbedit configuration. This reduces the effects of contention in the
data path, in case multiple readers are present.

Signed-off-by: Davide Caratti <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

commit | commitdiff | tree

Davide Caratti [Wed, 11 Jul 2018 14:04:49 +0000 (16:04 +0200)]

net/sched: skbedit: use per-cpu counters

use per-CPU counters, instead of sharing a single set of stats with all
cores: this removes the need of spinlocks when stats are read/updated.

Signed-off-by: Davide Caratti <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

commit | commitdiff | tree

Arnd Bergmann [Wed, 11 Jul 2018 10:16:12 +0000 (12:16 +0200)]

tcp: use monotonic timestamps for PAWS

Using get_seconds() for timestamps is deprecated since it can lead
to overflows on 32-bit systems. While the interface generally doesn't
overflow until year 2106, the specific implementation of the TCP PAWS
algorithm breaks in 2038 when the intermediate signed 32-bit timestamps
overflow.

A related problem is that the local timestamps in CLOCK_REALTIME form
lead to unexpected behavior when settimeofday is called to set the system
clock backwards or forwards by more than 24 days.

While the first problem could be solved by using an overflow-safe method
of comparing the timestamps, a nicer solution is to use a monotonic
clocksource with ktime_get_seconds() that simply doesn't overflow (at
least not until 136 years after boot) and that doesn't change during
settimeofday().

To make 32-bit and 64-bit architectures behave the same way here, and
also save a few bytes in the tcp_options_received structure, I'm changing
the type to a 32-bit integer, which is now safe on all architectures.

Finally, the ts_recent_stamp field also (confusingly) gets used to store
a jiffies value in tcp_synq_overflow()/tcp_synq_no_recent_overflow().
This is currently safe, but changing the type to 32-bit requires
some small changes there to keep it working.

Signed-off-by: Arnd Bergmann <[email protected]>
Signed-off-by: Eric Dumazet <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

commit | commitdiff | tree

Vakul Garg [Wed, 11 Jul 2018 09:02:20 +0000 (14:32 +0530)]

net/tls: Use aead_request_alloc/free for request alloc/free

Instead of kzalloc/free for aead_request allocation and free, use
functions aead_request_alloc(), aead_request_free(). It ensures that
any sensitive crypto material held in crypto transforms is securely
erased from memory.

Signed-off-by: Vakul Garg <[email protected]>
Acked-by: Dave Watson <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

commit | commitdiff | tree

Pieter Jansen van Vuuren [Wed, 11 Jul 2018 01:22:31 +0000 (18:22 -0700)]

tc-testing: add geneve options in tunnel_key unit tests

Extend tc tunnel_key action unit tests with geneve options. Tests
include testing single and multiple geneve options, as well as
testing geneve options that are expected to fail.

Signed-off-by: Pieter Jansen van Vuuren <[email protected]>
Acked-by: Lucas Bates <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

commit | commitdiff | tree

Daniel Borkmann [Thu, 12 Jul 2018 18:45:23 +0000 (20:45 +0200)]

Merge branch 'bpf-arm-jit-improvements'

Russell King says:

====================
This series improves the ARM BPF JIT compiler by:

- enumerating the stack layout rather than using constants that happen
  to be multiples of four
- rejig the BPF "register" accesses to use negative numbers instead of
  positive, which could be confused with register numbers in the bpf2a32
  array.
- since we maintain the ARM FP register as a pointer to the top of our
  scratch space (or, with frame pointers enabled, a valid ARM frame
  pointer register), we can access our scratch space using FP, which is
  constant across all BPF programs, including tail-called programs.
- use immediate forms of ARM instructions where possible, rather than
  first loading the immediate into an ARM register.
- use load-with-shift instruction rather than seperate shift instruction
  followed by load
- avoid reloading index and array in the tail-call code
- use double-word load/store instructions where available

Version 2:

- Fix ARMv5 test pointed out by Olof
- Fix build error found by 0-day (adding an additional patch)
====================

Signed-off-by: Daniel Borkmann <[email protected]>

commit | commitdiff | tree

Russell King [Wed, 11 Jul 2018 09:32:38 +0000 (10:32 +0100)]

ARM: net: bpf: use double-word load/stores where available

Use double-word load and stores where support for this instruction is
supported by the CPU architecture.

Signed-off-by: Russell King <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>

commit | commitdiff | tree

Russell King [Wed, 11 Jul 2018 09:32:33 +0000 (10:32 +0100)]

ARM: net: bpf: always use odd/even register pair

Always use an odd/even register pair for our 64-bit registers, so that
we're able to use the double-word load/store instructions in the future.

Signed-off-by: Russell King <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>

commit | commitdiff | tree

Russell King [Wed, 11 Jul 2018 09:32:28 +0000 (10:32 +0100)]

ARM: net: bpf: avoid reloading 'array'

Rearranging the order of the initial tail call code a little allows is
to avoid reloading the 'array' pointer.

Signed-off-by: Russell King <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>

commit | commitdiff | tree

Russell King [Wed, 11 Jul 2018 09:32:22 +0000 (10:32 +0100)]

ARM: net: bpf: avoid reloading 'index'

Avoid reloading 'index' after we have validated it - it remains in
tmp2[1] up to the point that we begin the code to index the pointer
array, so with a little rearrangement of the registers, we can use
the already loaded value.

Signed-off-by: Russell King <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>

commit | commitdiff | tree

Russell King [Wed, 11 Jul 2018 09:32:17 +0000 (10:32 +0100)]

ARM: net: bpf: use ldr instructions with shifted rm register

Rather than pre-shifting the rm register for the ldr in the tail call,
shift it in the load instruction. This eliminates one unnecessary
instruction.

Signed-off-by: Russell King <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>

commit | commitdiff | tree

Russell King [Wed, 11 Jul 2018 09:32:12 +0000 (10:32 +0100)]

ARM: net: bpf: use immediate forms of instructions where possible

Rather than moving constants to a register and then using them in a
subsequent instruction, use them directly in the desired instruction
cutting out the "middle" register. This removes two instructions from
the tail call code path.

Signed-off-by: Russell King <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>

commit | commitdiff | tree

Russell King [Wed, 11 Jul 2018 09:32:07 +0000 (10:32 +0100)]

ARM: net: bpf: imm12 constant conversion

Provide a version of the imm8m() function that the compiler can optimise
when used with a constant expression.

Signed-off-by: Russell King <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>

commit | commitdiff | tree

Russell King [Wed, 11 Jul 2018 09:32:02 +0000 (10:32 +0100)]

ARM: net: bpf: access eBPF scratch space using ARM FP register

Access the eBPF scratch space using the frame pointer rather than our
stack pointer, as the offsets from the ARM frame pointer are constant
across all eBPF programs.

Since we no longer reference the scratch space registers from the stack
pointer, this simplifies emit_push_r64() as it no longer needs to know
how many words are pushed onto the stack.

Signed-off-by: Russell King <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>

commit | commitdiff | tree

Russell King [Wed, 11 Jul 2018 09:31:57 +0000 (10:31 +0100)]

ARM: net: bpf: 64-bit accessor functions for BPF registers

Provide a couple of 64-bit register accessors, and use them where
appropriate

Signed-off-by: Russell King <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>

commit | commitdiff | tree

Russell King [Wed, 11 Jul 2018 09:31:52 +0000 (10:31 +0100)]

ARM: net: bpf: provide accessor functions for BPF registers

Many of the code paths need to have knowledge about whether a register
is stacked or in a CPU register. Move this decision making to a pair
of helper functions instead of having it scattered throughout the
code.

Signed-off-by: Russell King <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>

commit | commitdiff | tree

Russell King [Wed, 11 Jul 2018 09:31:47 +0000 (10:31 +0100)]

ARM: net: bpf: remove is_on_stack() and sstk/dstk

The decision about whether a BPF register is on the stack or in a CPU
register is detected at the top BPF insn processing level, and then
percolated throughout the remainder of the code. Since we now use
negative register values to represent stacked registers, we can detect
where a BPF register is stored without restoring to carrying this
additional metadata through all code paths.

Signed-off-by: Russell King <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>

commit | commitdiff | tree

Russell King [Wed, 11 Jul 2018 09:31:41 +0000 (10:31 +0100)]

ARM: net: bpf: use negative numbers for stacked registers

Use negative numbers for eBPF registers that live on the stack.

Signed-off-by: Russell King <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>

commit | commitdiff | tree

Russell King [Wed, 11 Jul 2018 09:31:36 +0000 (10:31 +0100)]

ARM: net: bpf: provide load/store ops with negative immediates

Provide a set of load/store opcode generators that work with negative
immediates as well as positive ones.

Signed-off-by: Russell King <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>

commit | commitdiff | tree

Russell King [Wed, 11 Jul 2018 09:31:31 +0000 (10:31 +0100)]

ARM: net: bpf: enumerate the JIT scratch stack layout

Enumerate the contents of the JIT scratch stack layout used for storing
some of the JITs 64-bit registers, tail call counter and AX register.

Signed-off-by: Russell King <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>

commit | commitdiff | tree

Daniel Borkmann [Thu, 12 Jul 2018 16:55:54 +0000 (18:55 +0200)]

Merge branch 'bpf-helper-man-install'

Quentin Monnet says:

====================
The three patches in this series are related to the documentation for eBPF
helpers. The first patch brings minor formatting edits to the documentation
in include/uapi/linux/bpf.h, and the second one updates the related header
file under tools/.

The third patch adds a Makefile under tools/bpf for generating the
documentation (man pages) about eBPF helpers. The targets defined in this
file can also be called from the bpftool directory (please refer to
relevant commit logs for details).
====================

Signed-off-by: Daniel Borkmann <[email protected]>

commit | commitdiff | tree

Quentin Monnet [Thu, 12 Jul 2018 11:52:24 +0000 (12:52 +0100)]

tools: bpf: build and install man page for eBPF helpers from bpftool/

Provide a new Makefile.helpers in tools/bpf, in order to build and
install the man page for eBPF helpers. This Makefile is also included in
the one used to build bpftool documentation, so that it can be called
either on its own (cd tools/bpf && make -f Makefile.helpers) or from
bpftool directory (cd tools/bpf/bpftool && make doc, or
cd tools/bpf/bpftool/Documentation && make helpers).

Makefile.helpers is not added directly to bpftool to avoid changing its
Makefile too much (helpers are not 100% directly related with bpftool).
But the possibility to build the page from bpftool directory makes us
able to package the helpers man page with bpftool, and to install it
along with bpftool documentation, so that the doc for helpers becomes
easily available to developers through the "man" program.

Cc: [email protected]
Suggested-by: Daniel Borkmann <[email protected]>
Signed-off-by: Quentin Monnet <[email protected]>
Reviewed-by: Jakub Kicinski <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>

commit | commitdiff | tree

Quentin Monnet [Thu, 12 Jul 2018 11:52:23 +0000 (12:52 +0100)]

tools: bpf: synchronise BPF UAPI header with tools

Update with latest changes from include/uapi/linux/bpf.h header.

Signed-off-by: Quentin Monnet <[email protected]>
Reviewed-by: Jakub Kicinski <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>

commit | commitdiff | tree

Quentin Monnet [Thu, 12 Jul 2018 11:52:22 +0000 (12:52 +0100)]

bpf: fix documentation for eBPF helpers

Minor formatting edits for eBPF helpers documentation, including blank
lines removal, fix of item list for return values in bpf_fib_lookup(),
and missing prefix on bpf_skb_load_bytes_relative().

Signed-off-by: Quentin Monnet <[email protected]>
Reviewed-by: Jakub Kicinski <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>

commit | commitdiff | tree

David S. Miller [Thu, 12 Jul 2018 07:03:31 +0000 (00:03 -0700)]

Merge branch 'be2net-small-structures-clean-up'

Ivan Vecera says:

====================
be2net: small structures clean-up

The series:
- removes unused / unneccessary fields in several be2net structures
- re-order fields in some structures to eliminate holes, cache-lines
crosses
- as result reduces size of main struct be_adapter by 4kB
====================

Signed-off-by: David S. Miller <[email protected]>

commit | commitdiff | tree

Ivan Vecera [Tue, 10 Jul 2018 20:59:48 +0000 (22:59 +0200)]

be2net: move rss_flags field in rss_info to ensure proper alignment

The current position of .rss_flags field in struct rss_info causes
that fields .rsstable and .rssqueue (both 128 bytes long) crosses
cache-line boundaries. Moving it at the end properly align all fields.

Before patch:
struct rss_info {
        u64                        rss_flags;            /*     0     8 */
        u8                         rsstable[128];        /*     8   128 */
        /* --- cacheline 2 boundary (128 bytes) was 8 bytes ago --- */
        u8                         rss_queue[128];       /*   136   128 */
        /* --- cacheline 4 boundary (256 bytes) was 8 bytes ago --- */
        u8                         rss_hkey[40];         /*   264    40 */
};

After patch:
struct rss_info {
        u8                         rsstable[128];        /*     0   128 */
        /* --- cacheline 2 boundary (128 bytes) --- */
        u8                         rss_queue[128];       /*   128   128 */
        /* --- cacheline 4 boundary (256 bytes) --- */
        u8                         rss_hkey[40];         /*   256    40 */
        u64                        rss_flags;            /*   296     8 */
};

Signed-off-by: Ivan Vecera <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

commit | commitdiff | tree

Ivan Vecera [Tue, 10 Jul 2018 20:59:47 +0000 (22:59 +0200)]

be2net: re-order fields in be_error_recovert to avoid hole

- Unionize two u8 fields where only one of them is used depending on NIC
chipset.
- Move recovery_supported field after that union

These changes eliminate 7-bytes hole in the struct and makes it smaller
by 8 bytes.

Signed-off-by: Ivan Vecera <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

commit | commitdiff | tree

Ivan Vecera [Tue, 10 Jul 2018 20:59:46 +0000 (22:59 +0200)]

be2net: remove unused tx_jiffies field from be_tx_stats

Signed-off-by: Ivan Vecera <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

commit | commitdiff | tree

Ivan Vecera [Tue, 10 Jul 2018 20:59:45 +0000 (22:59 +0200)]

be2net: move txcp field in be_tx_obj to eliminate holes in the struct

Before patch:
struct be_tx_obj {
        u32                        db_offset;            /*     0     4 */

        /* XXX 4 bytes hole, try to pack */

        struct be_queue_info       q;                    /*     8    56 */
        /* --- cacheline 1 boundary (64 bytes) --- */
        struct be_queue_info       cq;                   /*    64    56 */
        struct be_tx_compl_info    txcp;                 /*   120     4 */

        /* XXX 4 bytes hole, try to pack */

        /* --- cacheline 2 boundary (128 bytes) --- */
        struct sk_buff *           sent_skb_list[2048];  /*   128 16384 */
        ...
}:

After patch:
struct be_tx_obj {
        u32                        db_offset;            /*     0     4 */
        struct be_tx_compl_info    txcp;                 /*     4     4 */
        struct be_queue_info       q;                    /*     8    56 */
        /* --- cacheline 1 boundary (64 bytes) --- */
        struct be_queue_info       cq;                   /*    64    56 */
        struct sk_buff *           sent_skb_list[2048];  /*   120 16384 */
        ...
};

Signed-off-by: Ivan Vecera <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

commit | commitdiff | tree

Ivan Vecera [Tue, 10 Jul 2018 20:59:44 +0000 (22:59 +0200)]

be2net: reorder fields in be_eq_obj structure

Re-order fields in struct be_eq_obj to ensure that .napi field begins
at start of cache-line. Also the .adapter field is moved to the first
cache-line next to .q field and 3 fields (idx,msi_idx,spurious_intr)
and the 4-bytes hole to 3rd cache-line.

Signed-off-by: Ivan Vecera <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

commit | commitdiff | tree

Ivan Vecera [Tue, 10 Jul 2018 20:59:43 +0000 (22:59 +0200)]

be2net: remove desc field from be_eq_obj

The event queue description (be_eq_obj.desc) field is used only to format
string for IRQ name and it is not really needed to hold this value.
Remove it and use local variable to format string for IRQ name.

Signed-off-by: Ivan Vecera <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

commit | commitdiff | tree

Ivan Vecera [Tue, 10 Jul 2018 20:59:42 +0000 (22:59 +0200)]

be2net: remove unused old custom busy-poll fields

The commit fb6113e688e0 ("be2net: get rid of custom busy poll code")
replaced custom busy-poll code by the generic one but left several
macros and fields in struct be_eq_obj that are currently unused.
Remove this stuff.

Fixes: fb6113e688e0 ("be2net: get rid of custom busy poll code")
Signed-off-by: Ivan Vecera <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

commit | commitdiff | tree

Ivan Vecera [Tue, 10 Jul 2018 20:59:41 +0000 (22:59 +0200)]

be2net: remove unused old AIC info

The commit 2632bafd74ae ("be2net: fix adaptive interrupt coalescing")
introduced a separate struct be_aic_obj to hold AIC information but
unfortunately left the old stuff in be_eq_obj. So remove it.

Fixes: 2632bafd74ae ("be2net: fix adaptive interrupt coalescing")
Signed-off-by: Ivan Vecera <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

commit | commitdiff | tree

Ivan Khoronzhuk [Tue, 10 Jul 2018 13:04:04 +0000 (16:04 +0300)]

net: ethernet: ti: cpts: break cycle once late ts is matched

The late ts queue can contain a bunch of skbs while hi rate testing,
no need to check all of them if timestamp is already matched.

Signed-off-by: Ivan Khoronzhuk <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

commit | commitdiff | tree

Petr Machata [Tue, 10 Jul 2018 12:44:26 +0000 (14:44 +0200)]

selftests: forwarding: mirror_gre_nh: Unset rp_filter on host VRF

The mirrored packets arrive at $h3 encapsulated in GRE/IPv4, with IP
address from 192.0.2.128/28 network. However the interface is configured
as a member of 192.0.2.160/28 and there's no route directing traffic
from the former network through that interface. Correspondingly, the RP
filter on the VRF rejects it.

Therefore turn off the VRF's RP filter.

Signed-off-by: Petr Machata <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

commit | commitdiff | tree

David S. Miller [Thu, 12 Jul 2018 06:10:20 +0000 (23:10 -0700)]

Merge branch 'mlxsw-ERSPAN-Take-LACP-state-into-consideration'

Ido Schimmel says:

====================
mlxsw: ERSPAN: Take LACP state into consideration

Petr says:

When offloading mirror-to-gretap, mlxsw needs to preroute the path that
the encapsulated packet will take. That path may include a LAG device
above a front panel port. So far, mlxsw resolved the path to the first
up front panel slave of the LAG interface, but that only reflects
administrative state of the port. It neglects to consider whether the
port actually has a carrier, and what the LACP state is. This patch set
aims to address these problems.

Patch #1 publishes team_port_get_rcu().

Then in patch #2, a new function is introduced,
mlxsw_sp_port_dev_check(). That returns, for a given netdevice that is a
slave of a LAG device, whether that device is "txable", i.e. whether the
LAG master would send traffic through it. Since there's no good place to
put LAG-wide helpers, introduce a new header include/net/lag.h.

Finally in patch #3, fix the slave selection logic to take into
consideration whether a given slave has a carrier and whether it is
txable.
====================

Signed-off-by: David S. Miller <[email protected]>

commit | commitdiff | tree

Petr Machata [Tue, 10 Jul 2018 07:02:59 +0000 (10:02 +0300)]

mlxsw: spectrum_span: Change LAG lower selection

When offloading mirror-to-gretap, mlxsw needs to preroute the path that
the encapsulated packet will take. That path may include a LAG device
above a front panel port. So far, mlxsw resolved the path to the first
up front panel slave of the LAG interface, but that only reflects
administrative state of the port. It neglects to consider whether the
port actually has a carrier, and what the LACP state is.

So instead of checking upness of the device, check carrier state and
txability.

Signed-off-by: Petr Machata <[email protected]>
Reviewed-by: Jiri Pirko <[email protected]>
Signed-off-by: Ido Schimmel <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

commit | commitdiff | tree

Petr Machata [Tue, 10 Jul 2018 07:02:58 +0000 (10:02 +0300)]

net: Add lag.h, net_lag_port_dev_txable()

LAG devices (team or bond) recognize for each one of their slave devices
whether LAG traffic is going to be sent through that device. Bond calls
such devices "active", team calls them "txable". When this state
changes, a NETDEV_CHANGELOWERSTATE notification is distributed, together
with a netdev_notifier_changelowerstate_info structure that for LAG
devices includes a tx_enabled flag that refers to the new state. The
notification thus makes it possible to react to the changes in txability
in drivers.

However there's no way to query txability from the outside on demand.
That is problematic namely for mlxsw, which when resolving ERSPAN packet
path, may encounter a LAG device, and needs to determine which of the
slaves it should choose.

To that end, introduce a new function, net_lag_port_dev_txable(), which
determines whether a given slave device is "active" or
"txable" (depending on the flavor of the LAG device). That function then
dispatches to per-LAG-flavor helpers, bond_is_active_slave_dev() resp.
team_port_dev_txable().

Because there currently is no good place where net_lag_port_dev_txable()
should be added, introduce a new header file, lag.h, which should from
now on hold any logic common to both team and bond. (But keep
netif_is_lag_master() together with the rest of netif_is_*_master()
functions).

Signed-off-by: Petr Machata <[email protected]>
Reviewed-by: Jiri Pirko <[email protected]>
Signed-off-by: Ido Schimmel <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

commit | commitdiff | tree

Petr Machata [Tue, 10 Jul 2018 07:02:57 +0000 (10:02 +0300)]

team: Publish team_port_get_rcu()

A follow-up patch adds a new entry point, team_port_dev_txable(). Making
it an ordinary exported function would mean that any module that may
need the service in one of the supported configurations also
unconditionally needs to pull in the team module, whether or not the
user actually intends to create team interfaces.

To prevent that, team_port_dev_txable() is defined in if_team.h, and
therefore all dependencies of that function also need to be
publicly-visible.

Therefore move team_port_get_rcu() from team.c to if_team.h.

Signed-off-by: Petr Machata <[email protected]>
Reviewed-by: Jiri Pirko <[email protected]>
Signed-off-by: Ido Schimmel <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

commit | commitdiff | tree

Travis Brown [Tue, 10 Jul 2018 00:35:01 +0000 (00:35 +0000)]

macvlan: Change status when lower device goes down

Today macvlan ignores the notification when a lower device goes
administratively down, preventing the lack of connectivity from
bubbling up.

Processing NETDEV_DOWN results in a macvlan state of LOWERLAYERDOWN
with NO-CARRIER which should be easy to interpret in userspace.

2: lower: <BROADCAST,MULTICAST> mtu 1500 qdisc mq state DOWN mode DEFAULT group default qlen 1000
3: macvlan@lower: <NO-CARRIER,BROADCAST,MULTICAST,UP,M-DOWN> mtu 1500 qdisc noqueue state LOWERLAYERDOWN mode DEFAULT group default qlen 1000

Signed-off-by: Suresh Krishnan <[email protected]>
Signed-off-by: Travis Brown <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

commit | commitdiff | tree

David S. Miller [Thu, 12 Jul 2018 06:06:14 +0000 (23:06 -0700)]

Merge branch 'tipc-make-link-protocol-more-resilient'

Jon Maloy says:

====================
tipc: make link protocol more resilient

These two commits make the link ptotocol more resilient to
infrastructures with frequent packet duplication and long delays.
====================

Signed-off-by: David S. Miller <[email protected]>

commit | commitdiff | tree

Jon Maloy [Mon, 9 Jul 2018 23:07:36 +0000 (01:07 +0200)]

tipc: check session number before accepting link protocol messages

In some virtual environments we observe a significant higher number of
packet reordering and delays than we have been used to traditionally.

This makes it necessary with stricter checks on incoming link protocol
messages' session number, which until now only has been validated for
RESET messages.

Since the other two message types, ACTIVATE and STATE messages also
carry this number, it is easy to extend the validation check to those
messages.

We also introduce a flag indicating if a link has a valid peer session
number or not. This eliminates the mixing of 32- and 16-bit arithmethics
we are currently using to achieve this.

Acked-by: Ying Xue <[email protected]>
Signed-off-by: Jon Maloy <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

commit | commitdiff | tree

Jon Maloy [Mon, 9 Jul 2018 23:07:35 +0000 (01:07 +0200)]

tipc: add sequence number check for link STATE messages

Some switch infrastructures produce huge amounts of packet duplicates.
This becomes a problem if those messages are STATE/NACK protocol
messages, causing unnecessary retransmissions of already accepted
packets.

We now introduce a unique sequence number per STATE protocol message
so that duplicates can be identified and ignored. This will also be
useful when tracing such cases, and to avert replay attacks when TIPC
is encrypted.

For compatibility reasons we have to introduce a new capability flag
TIPC_LINK_PROTO_SEQNO to handle this new feature.

Signed-off-by: Jon Maloy <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

commit | commitdiff | tree

David S. Miller [Thu, 12 Jul 2018 06:03:32 +0000 (23:03 -0700)]

Merge branch '10GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue

Jeff Kirsher says:

====================
L2 Fwd Offload & 10GbE Intel Driver Updates 2018-07-09

This patch series is meant to allow support for the L2 forward offload, aka
MACVLAN offload without the need for using ndo_select_queue.

The existing solution currently requires that we use ndo_select_queue in
the transmit path if we want to associate specific Tx queues with a given
MACVLAN interface. In order to get away from this we need to repurpose the
tc_to_txq array and XPS pointer for the MACVLAN interface and use those as
a means of accessing the queues on the lower device. As a result we cannot
offload a device that is configured as multiqueue, however it doesn't
really make sense to configure a macvlan interfaced as being multiqueue
anyway since it doesn't really have a qdisc of its own in the first place.

The big changes in this set are:
  Allow lower device to update tc_to_txq and XPS map of offloaded MACVLAN
  Disable XPS for single queue devices
  Replace accel_priv with sb_dev in ndo_select_queue
  Add sb_dev parameter to fallback function for ndo_select_queue
  Consolidated ndo_select_queue functions that appeared to be duplicates
====================

Signed-off-by: David S. Miller <[email protected]>

commit | commitdiff | tree

Deepti Raghavan [Mon, 9 Jul 2018 17:53:39 +0000 (17:53 +0000)]

tcp: expose both send and receive intervals for rate sample

Congestion control algorithms, which access the rate sample
through the tcp_cong_control function, only have access to the maximum
of the send and receive interval, for cases where the acknowledgment
rate may be inaccurate due to ACK compression or decimation. Algorithms
may want to use send rates and receive rates as separate signals.

Signed-off-by: Deepti Raghavan <[email protected]>
Acked-by: Neal Cardwell <[email protected]>
Signed-off-by: Eric Dumazet <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

commit | commitdiff | tree

Vlad Buslov [Mon, 9 Jul 2018 17:26:47 +0000 (20:26 +0300)]

net: sched: fix unprotected access to rcu cookie pointer

Fix action attribute size calculation function to take rcu read lock and
access act_cookie pointer with rcu dereference.

Fixes: eec94fdb0480 ("net: sched: use rcu for action cookie update")
Reported-by: Marcelo Ricardo Leitner <[email protected]>
Signed-off-by: Vlad Buslov <[email protected]>
Reviewed-by: Marcelo Ricardo Leitner <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

commit | commitdiff | tree

David S. Miller [Thu, 12 Jul 2018 05:59:39 +0000 (22:59 -0700)]

Merge branch 'cxgb4-move-stats-fetched-from-firmware-to-debugfs'

Rahul Lakkireddy says:

====================
cxgb4: move stats fetched from firmware to debugfs

Some stats are fetched via slow firmware mailbox, which can cause
packet drops under heavy load. So, this series removes these stats
from ethtool -S and expose them via debugfs.

Patch 1 removes stats fetched via firmware from ethtool -S.
Patch 2 exposes stats removed in Patch 1 via debugfs.
====================

Signed-off-by: David S. Miller <[email protected]>

commit | commitdiff | tree

Rahul Lakkireddy [Mon, 9 Jul 2018 16:12:47 +0000 (21:42 +0530)]

cxgb4: expose stats fetched from firmware via debugfs

Expose stats obtained from firmware via debugfs. These stats can't
be part of ethtool -S because the slow firmware mailbox can cause
packet drops under heavy load.

Signed-off-by: Rahul Lakkireddy <[email protected]>
Signed-off-by: Ganesh Goudar <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

commit | commitdiff | tree

Rahul Lakkireddy [Mon, 9 Jul 2018 16:12:46 +0000 (21:42 +0530)]

cxgb4: remove stats fetched from firmware

When running ethtool -S, some stats are requested from firmware.
Since getting these stats via firmware mailbox is slow, some packets
get dropped under heavy load while running ethtool -S.

So, remove these stats from ethtool -S.

Signed-off-by: Rahul Lakkireddy <[email protected]>
Signed-off-by: Ganesh Goudar <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

commit | commitdiff | tree

Antoine Tenart [Mon, 9 Jul 2018 15:00:43 +0000 (17:00 +0200)]

net: mvpp2: explicitly include linux/interrupt.h

The Marvell PPv2 driver uses interrupts and tasklet but does not
explicitly include linux/interrupt.h, relying on implicit includes. This
one particularly is included by chance after a long unlogical chain of
inclusions. Fix this so we do not get future build breaks.

Signed-off-by: Antoine Tenart <[email protected]>
Signed-off-by: Antoine Tenart <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

commit | commitdiff | tree

Jan Dakinevich [Mon, 9 Jul 2018 13:51:19 +0000 (16:51 +0300)]

cnic: use kvzalloc to allocate memory for csk_tbl

Size of csk_tbl is about 58K, which means 3rd order page allocation.
kvzalloc provides a fallback if no high order memory is available.

Signed-off-by: Jan Dakinevich <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

commit | commitdiff | tree

Colin Ian King [Mon, 9 Jul 2018 12:23:13 +0000 (13:23 +0100)]

wimax/i2400m: remove redundant variables ack_status, bcf and protocol

Variables ack_status, bcf and protocol are being assigned but are
never used hence they are redundant and can be removed.

Also declare ack_type as unsigned int rather than unsigned to clean
up a checkpatch warning.

Cleans up clang warnings:
warning: variable 'ack_status' set but not used [-Wunused-but-set-variable]
warning: variable 'bcf' set but not used [-Wunused-but-set-variable]
warning: variable 'protocol' set but not used [-Wunused-but-set-variable]

Signed-off-by: Colin Ian King <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

commit | commitdiff | tree

Vlad Buslov [Mon, 9 Jul 2018 11:33:26 +0000 (14:33 +0300)]

net: sched: act_ife: fix memory leak in ife init

Free params if tcf_idr_check_alloc() returned error.

Fixes: 0190c1d452a9 ("net: sched: atomically check-allocate action")
Reported-by: Dan Carpenter <[email protected]>
Signed-off-by: Vlad Buslov <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

commit | commitdiff | tree

Arjun Vynipadath [Mon, 9 Jul 2018 11:22:03 +0000 (16:52 +0530)]

cxgb4: specify IQTYPE in fw_iq_cmd

congestion argument passed to t4_sge_alloc_rxq() is used
to differentiate between nic/ofld queues.

Signed-off-by: Arjun Vynipadath <[email protected]>
Signed-off-by: Ganesh Goudar <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

commit | commitdiff | tree

David S. Miller [Thu, 12 Jul 2018 05:50:46 +0000 (22:50 -0700)]

Merge branch 'net-ipv6-addr_gen_mode-fixes'

Sabrina Dubroca says:

====================
net/ipv6: addr_gen_mode fixes

This series fixes bugs in handling of the addr_gen_mode option, mainly
related to the sysctl. A minor netlink issue was also present in the
initial commit introducing the option on a per-netdevice basis.

v2: add patch 4, requested by David Ahern during review of v1
add patch 5, missing documentation for the sysctl
patches 1, 2, 3 are unchanged
====================

Signed-off-by: David S. Miller <[email protected]>

Empty description

RSS Atom

This page took 0.116503 seconds and 4 git commands to generate.