Git Repo - linux.git/log

sched: push TC filter protocol creation into a separate function

Make the long function tc_ctl_tfilter a little bit shorter and easier to
read. Also make the creation of filter proto symmetric to destruction.

Signed-off-by: Jiri Pirko <[email protected]>
Acked-by: Jamal Hadi Salim <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

sched: move tcf_proto_destroy and tcf_destroy_chain helpers into cls_api

Creation is done in this file, move destruction to be at the same place.

Signed-off-by: Jiri Pirko <[email protected]>
Acked-by: Jamal Hadi Salim <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

sched: rename tcf_destroy to tcf_destroy_proto

This function destroys TC filter protocol, not TC filter. So name it
accordingly.

Signed-off-by: Jiri Pirko <[email protected]>
Acked-by: Jamal Hadi Salim <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Merge branch 'mlxsw-identical-routes-handling'

Jiri Pirko says:

====================
mlxsw: Identical routes handling

Ido says:

The kernel can store several FIB aliases that share the same prefix and
length. These aliases can differ in other parameters such as TOS and
metric, which are taken into account during lookup.

Offloading devices might not have the same flexibility, allowing only a
single route with the same prefix and length to be reflected. mlxsw is
one such device.

This patchset aims to correctly handle this situation in the mlxsw
driver. The first four patches introduce small changes in the IPv4 FIB
code, so that listeners of the FIB notification chain will be able to
correctly handle identical routes.

The last three patches build on top of previous work and introduce the
necessary changes in the mlxsw driver. The biggest change is the
introduction of a FIB node, where identical routes are chained, instead
of a primitive reference counting. This is explained in detail in the
fifth patch.
====================

Signed-off-by: David S. Miller <[email protected]>

mlxsw: spectrum_router: Add support for route replace

Upon the reception of an ENTRY_REPLACE notification, resolve the FIB
node corresponding to the prefix and length and insert the new route
before the first matching entry.

Since the notification also signals the deletion of the replaced route,
delete it from the driver's cache.

Signed-off-by: Ido Schimmel <[email protected]>
Signed-off-by: Jiri Pirko <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

mlxsw: spectrum_router: Add support for route append

When a new route is appended, it's placed after existing routes sharing
the same parameters (prefix, length, table ID, TOS and priority).

While the device supports only one route with the same prefix and length
in a single table, it's important to correctly place the appended route
in the driver's cache, as when a route is deleted the next one is
programmed into the device.

Following the reception of an ENTRY_APPEND notification, resolve the
FIB node corresponding to the prefix and length and correctly place the
new entry in its entry list.

Signed-off-by: Ido Schimmel <[email protected]>
Signed-off-by: Jiri Pirko <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

mlxsw: spectrum_router: Correctly handle identical routes

In the device, routes are indexed in a routing table based on the prefix
and its length. This is in contrast to the kernel's FIB where several
FIB aliases can exist with these parameters being identical. In such
cases, the routes will be sorted by table ID (LOCAL first, then MAIN),
TOS and finally priority (metric).

During lookup, these routes will be evaluated in order. In case the
packet's TOS field is non-zero and a FIB alias with a matching TOS is
found, then it's selected. Otherwise, the lookup defaults to the route
with TOS 0 (if it exists). However, if the requested scope is narrower
than the one found, then the lookup continues.

To best reflect the kernel's datapath we should take the above into
account. Given a prefix and its length, the reflected route will always
be the first one in the FIB alias list. However, if the route has a
non-zero TOS then its action will be converted to trap instead of
forward, since we currently don't support TOS-based routing. If this
turns out to be a real issue, we can add support for that using
policy-based switching.

The route's scope can be effectively ignored as any packet being routed
by the device would've been looked-up using the widest scope (UNIVERSE).

To achieve that we need to do two changes. Firstly, we need to create
another struct (FIB node) that will hold the list of FIB entries sharing
the same prefix and length. This struct will be hashed using these two
parameters.

Secondly, we need to change the route reflection to match the above
logic, so that the first FIB entry in the list will be programmed into
the device while the rest will remain in the driver's cache in case of
subsequent changes.

Signed-off-by: Ido Schimmel <[email protected]>
Signed-off-by: Jiri Pirko <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

ipv4: fib: Add events for FIB replace and append

The FIB notification chain currently uses the NLM_F_{REPLACE,APPEND}
flags to signal routes being replaced or appended.

Instead of using netlink flags for in-kernel notifications we can simply
introduce two new events in the FIB notification chain. This has the
added advantage of making the API cleaner, thereby making it clear that
these events should be supported by listeners of the notification chain.

Signed-off-by: Ido Schimmel <[email protected]>
Signed-off-by: Jiri Pirko <[email protected]>
CC: Patrick McHardy <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

ipv4: fib: Send notification before deleting FIB alias

When a FIB alias is replaced following NLM_F_REPLACE, the ENTRY_ADD
notification is sent after the reference on the previous FIB info was
dropped. This is problematic as potential listeners might need to access
it in their notification blocks.

Solve this by sending the notification prior to the deletion of the
replaced FIB alias. This is consistent with ENTRY_DEL notifications.

Signed-off-by: Ido Schimmel <[email protected]>
Signed-off-by: Jiri Pirko <[email protected]>
CC: Patrick McHardy <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

ipv4: fib: Send deletion notification with actual FIB alias type

When a FIB alias is removed, a notification is sent using the type
passed from user space - can be RTN_UNSPEC - instead of the actual type
of the removed alias. This is problematic for listeners of the FIB
notification chain, as several FIB aliases can exist with matching
parameters, but the type.

Solve this by passing the actual type of the removed FIB alias.

Signed-off-by: Ido Schimmel <[email protected]>
Signed-off-by: Jiri Pirko <[email protected]>
CC: Patrick McHardy <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

ipv4: fib: Only flush FIB aliases belonging to currently flushed table

In case the MAIN table is flushed and its trie is shared with the LOCAL
table, then we might be flushing FIB aliases belonging to the latter.
This can lead to FIB_ENTRY_DEL notifications sent with the wrong table
ID.

The above doesn't affect current listeners, as the table ID is ignored
during entry deletion, but this will change later in the patchset.

When flushing a particular table, skip any aliases belonging to a
different one.

Signed-off-by: Ido Schimmel <[email protected]>
Signed-off-by: Jiri Pirko <[email protected]>
CC: Alexander Duyck <[email protected]>
CC: Patrick McHardy <[email protected]>
Reviewed-by: Alexander Duyck <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Merge branch 'openvswitch-Conntrack-integration-improvements'

openvswitch: Pack struct sw_flow_key.

struct sw_flow_key has two 16-bit holes. Move the most matched
conntrack match fields there. In some typical cases this reduces the
size of the key that needs to be hashed into half and into one cache
line.

Signed-off-by: Jarno Rajahalme <[email protected]>
Acked-by: Joe Stringer <[email protected]>
Acked-by: Pravin B Shelar <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

openvswitch: Add force commit.

Stateful network admission policy may allow connections to one
direction and reject connections initiated in the other direction.
After policy change it is possible that for a new connection an
overlapping conntrack entry already exists, where the original
direction of the existing connection is opposed to the new
connection's initial packet.

Most importantly, conntrack state relating to the current packet gets
the "reply" designation based on whether the original direction tuple
or the reply direction tuple matched. If this "directionality" is
wrong w.r.t. to the stateful network admission policy it may happen
that packets in neither direction are correctly admitted.

This patch adds a new "force commit" option to the OVS conntrack
action that checks the original direction of an existing conntrack
entry. If that direction is opposed to the current packet, the
existing conntrack entry is deleted and a new one is subsequently
created in the correct direction.

Signed-off-by: Jarno Rajahalme <[email protected]>
Acked-by: Pravin B Shelar <[email protected]>
Acked-by: Joe Stringer <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

openvswitch: Add original direction conntrack tuple to sw_flow_key.

Add the fields of the conntrack original direction 5-tuple to struct
sw_flow_key.  The new fields are initially marked as non-existent, and
are populated whenever a conntrack action is executed and either finds
or generates a conntrack entry.  This means that these fields exist
for all packets that were not rejected by conntrack as untrackable.

The original tuple fields in the sw_flow_key are filled from the
original direction tuple of the conntrack entry relating to the
current packet, or from the original direction tuple of the master
conntrack entry, if the current conntrack entry has a master.
Generally, expected connections of connections having an assigned
helper (e.g., FTP), have a master conntrack entry.

The main purpose of the new conntrack original tuple fields is to
allow matching on them for policy decision purposes, with the premise
that the admissibility of tracked connections reply packets (as well
as original direction packets), and both direction packets of any
related connections may be based on ACL rules applying to the master
connection's original direction 5-tuple.  This also makes it easier to
make policy decisions when the actual packet headers might have been
transformed by NAT, as the original direction 5-tuple represents the
packet headers before any such transformation.

When using the original direction 5-tuple the admissibility of return
and/or related packets need not be based on the mere existence of a
conntrack entry, allowing separation of admission policy from the
established conntrack state.  While existence of a conntrack entry is
required for admission of the return or related packets, policy
changes can render connections that were initially admitted to be
rejected or dropped afterwards.  If the admission of the return and
related packets was based on mere conntrack state (e.g., connection
being in an established state), a policy change that would make the
connection rejected or dropped would need to find and delete all
conntrack entries affected by such a change.  When using the original
direction 5-tuple matching the affected conntrack entries can be
allowed to time out instead, as the established state of the
connection would not need to be the basis for packet admission any
more.

It should be noted that the directionality of related connections may
be the same or different than that of the master connection, and
neither the original direction 5-tuple nor the conntrack state bits
carry this information.  If needed, the directionality of the master
connection can be stored in master's conntrack mark or labels, which
are automatically inherited by the expected related connections.

The fact that neither ARP nor ND packets are trackable by conntrack
allows mutual exclusion between ARP/ND and the new conntrack original
tuple fields.  Hence, the IP addresses are overlaid in union with ARP
and ND fields.  This allows the sw_flow_key to not grow much due to
this patch, but it also means that we must be careful to never use the
new key fields with ARP or ND packets.  ARP is easy to distinguish and
keep mutually exclusive based on the ethernet type, but ND being an
ICMPv6 protocol requires a bit more attention.

Signed-off-by: Jarno Rajahalme <[email protected]>
Acked-by: Joe Stringer <[email protected]>
Acked-by: Pravin B Shelar <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

openvswitch: Inherit master's labels.

We avoid calling into nf_conntrack_in() for expected connections, as
that would remove the expectation that we want to stick around until
we are ready to commit the connection.  Instead, we do a lookup in the
expectation table directly.  However, after a successful expectation
lookup we have set the flow key label field from the master
connection, whereas nf_conntrack_in() does not do this.  This leads to
master's labels being inherited after an expectation lookup, but those
labels not being inherited after the corresponding conntrack action
with a commit flag.

This patch resolves the problem by changing the commit code path to
also inherit the master's labels to the expected connection.
Resolving this conflict in favor of inheriting the labels allows more
information be passed from the master connection to related
connections, which would otherwise be much harder if the 32 bits in
the connmark are not enough.  Labels can still be set explicitly, so
this change only affects the default values of the labels in presense
of a master connection.

Fixes: 7f8a436eaa2c ("openvswitch: Add conntrack action")
Signed-off-by: Jarno Rajahalme <[email protected]>
Acked-by: Pravin B Shelar <[email protected]>
Acked-by: Joe Stringer <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

openvswitch: Refactor labels initialization.

Refactoring conntrack labels initialization makes changes in later
patches easier to review.

Signed-off-by: Jarno Rajahalme <[email protected]>
Acked-by: Pravin B Shelar <[email protected]>
Acked-by: Joe Stringer <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

openvswitch: Simplify labels length logic.

Since 23014011ba42 ("netfilter: conntrack: support a fixed size of 128
distinct labels"), the size of conntrack labels extension has fixed to
128 bits, so we do not need to check for labels sizes shorter than 128
at run-time. This patch simplifies labels length logic accordingly,
but allows the conntrack labels size to be increased in the future
without breaking the build. In the event of conntrack labels
increasing in size OVS would still be able to deal with the 128 first
label bits.

Suggested-by: Joe Stringer <[email protected]>
Signed-off-by: Jarno Rajahalme <[email protected]>
Acked-by: Pravin B Shelar <[email protected]>
Acked-by: Joe Stringer <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

openvswitch: Unionize ovs_key_ct_label with a u32 array.

Make the array of labels in struct ovs_key_ct_label an union, adding a
u32 array of the same byte size as the existing u8 array. It is
faster to loop through the labels 32 bits at the time, which is also
the alignment of netlink attributes.

Signed-off-by: Jarno Rajahalme <[email protected]>
Acked-by: Joe Stringer <[email protected]>
Acked-by: Pravin B Shelar <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

openvswitch: Do not trigger events for unconfirmed connections.

Receiving change events before the 'new' event for the connection has
been received can be confusing. Avoid triggering change events for
setting conntrack mark or labels before the conntrack entry has been
confirmed.

Fixes: 182e3042e15d ("openvswitch: Allow matching on conntrack mark")
Fixes: c2ac66735870 ("openvswitch: Allow matching on conntrack label")
Signed-off-by: Jarno Rajahalme <[email protected]>
Acked-by: Joe Stringer <[email protected]>
Acked-by: Pravin B Shelar <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

openvswitch: Use inverted tuple in ovs_ct_find_existing() if NATted.

The conntrack lookup for existing connections fails to invert the
packet 5-tuple for NATted packets, and therefore fails to find the
existing conntrack entry. Conntrack only stores 5-tuples for incoming
packets, and there are various situations where a lookup on a packet
that has already been transformed by NAT needs to be made. Looking up
an existing conntrack entry upon executing packet received from the
userspace is one of them.

This patch fixes ovs_ct_find_existing() to invert the packet 5-tuple
for the conntrack lookup whenever the packet has already been
transformed by conntrack from its input form as evidenced by one of
the NAT flags being set in the conntrack state metadata.

Fixes: 05752523e565 ("openvswitch: Interface with NAT.")
Signed-off-by: Jarno Rajahalme <[email protected]>
Acked-by: Joe Stringer <[email protected]>
Acked-by: Pravin B Shelar <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

openvswitch: Fix comments for skb->_nfct

Fix comments referring to skb 'nfct' and 'nfctinfo' fields now that
they are combined into '_nfct'.

Signed-off-by: Jarno Rajahalme <[email protected]>
Acked-by: Pravin B Shelar <[email protected]>
Acked-by: Joe Stringer <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Merge branch 'ena-bug-fixes'

Netanel Belgazal says:

====================
Bug Fixes in ENA driver

Changes from V3:
* Rebase patchset to master and solve merge conflicts.
* Remove redundant bug fix (fix error handling when probe fails)
====================

Signed-off-by: David S. Miller <[email protected]>

net/ena: update driver version to 1.1.2

Signed-off-by: Netanel Belgazal <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net/ena: change condition for host attribute configuration

Move the host info config to be the first admin command that is executed.
This change require the driver to remove the 'feature check'
from host info configuration flow.
The check is removed since the supported features bitmask field
is retrieved only after calling ENA_ADMIN_DEVICE_ATTRIBUTES admin command.

If set host info is not supported an error will be returned by the device.

Signed-off-by: Netanel Belgazal <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net/ena: change driver's default timeouts

The timeouts were too agressive and sometimes cause false alarms.

Signed-off-by: Netanel Belgazal <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net/ena: reduce the severity of ena printouts

Signed-off-by: Netanel Belgazal <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net/ena: use READ_ONCE to access completion descriptors

Completion descriptors are accessed from the driver and from the device.
To avoid reading the old value, use READ_ONCE macro.

Signed-off-by: Netanel Belgazal <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net/ena: use napi_complete_done() return value

Do not unamsk interrupts if we are in busy poll mode.

Signed-off-by: Netanel Belgazal <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net/ena: fix potential access to freed memory during device reset

If the ena driver detects that the device is not behave as expected,
it tries to reset the device.
The reset flow calls ena_down, which will frees all the resources
the driver allocates and then it will reset the device.

This flow can cause memory corruption if the device is still writes
to the driver's memory space.
To overcome this potential race, move the reset before the device
resources are freed.

Signed-off-by: Netanel Belgazal <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net/ena: refactor ena_get_stats64 to be atomic context safe

ndo_get_stat64() can be called from atomic context, but the current
implementation sends an admin command to retrieve the statistics from
the device. This admin command can sleep.

This patch re-factors the implementation of ena_get_stats64() to use
the {rx,tx}bytes/count from the driver's inner counters, and to obtain
the rx drop counter from the asynchronous keep alive (heart bit)
event.

Signed-off-by: Netanel Belgazal <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net/ena: fix NULL dereference when removing the driver after device reset failed

If for some reason the device stops responding, and the device reset
failes to recover the device, the mmio register read data structure
will not be reinitialized.

On driver removal, the driver will also try to reset the device, but
this time the mmio data structure will be NULL.

To solve this issue, perform the device reset in the remove function
only if the device is runnig.

Crash log
   54.240382] BUG: unable to handle kernel NULL pointer dereference at           (null)
[   54.244186] IP: [<ffffffffc067de5a>] ena_com_reg_bar_read32+0x8a/0x180 [ena_drv]
[   54.244186] PGD 0
[   54.244186] Oops: 0002 [#1] SMP
[   54.244186] Modules linked in: ena_drv(OE-) snd_hda_codec_generic kvm_intel kvm crct10dif_pclmul ppdev crc32_pclmul ghash_clmulni_intel aesni_intel snd_hda_intel aes_x86_64 snd_hda_controller lrw gf128mul cirrus glue_helper ablk_helper ttm snd_hda_codec drm_kms_helper cryptd snd_hwdep drm snd_pcm pvpanic snd_timer syscopyarea sysfillrect snd parport_pc sysimgblt serio_raw soundcore i2c_piix4 mac_hid lp parport psmouse floppy
[   54.244186] CPU: 5 PID: 1841 Comm: rmmod Tainted: G           OE 3.16.0-031600-generic #201408031935
[   54.244186] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
[   54.244186] task: ffff880135852880 ti: ffff8800bb640000 task.ti: ffff8800bb640000
[   54.244186] RIP: 0010:[<ffffffffc067de5a>]  [<ffffffffc067de5a>] ena_com_reg_bar_read32+0x8a/0x180 [ena_drv]
[   54.244186] RSP: 0018:ffff8800bb643d50  EFLAGS: 00010083
[   54.244186] RAX: 000000000000deb0 RBX: 0000000000030d40 RCX: 0000000000000003
[   54.244186] RDX: 0000000000000202 RSI: 0000000000000058 RDI: ffffc90000775104
[   54.244186] RBP: ffff8800bb643d88 R08: 0000000000000000 R09: cf00000000000000
[   54.244186] R10: 0000000fffffffe0 R11: 0000000000000001 R12: 0000000000000000
[   54.244186] R13: ffffc90000765000 R14: ffffc90000775104 R15: 00007fca1fa98090
[   54.244186] FS:  00007fca1f1bd740(0000) GS:ffff88013fd40000(0000) knlGS:0000000000000000
[   54.244186] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   54.244186] CR2: 0000000000000000 CR3: 00000000b9cf6000 CR4: 00000000001406e0
[   54.244186] Stack:
[   54.244186]  0000000000000202 0000005800000286 ffffc90000765000 ffffc90000765000
[   54.244186]  ffff880135f6b000 ffff8800b9360000 00007fca1fa98090 ffff8800bb643db8
[   54.244186]  ffffffffc0680b3d ffff8800b93608c0 ffffc90000765000 ffff880135f6b000
[   54.244186] Call Trace:
[   54.244186]  [<ffffffffc0680b3d>] ena_com_dev_reset+0x1d/0x1b0 [ena_drv]
[   54.244186]  [<ffffffffc0678497>] ena_remove+0xa7/0x130 [ena_drv]
[   54.244186]  [<ffffffff813d4df6>] pci_device_remove+0x46/0xc0
[   54.244186]  [<ffffffff814c3b7f>] __device_release_driver+0x7f/0xf0
[   54.244186]  [<ffffffff814c4738>] driver_detach+0xc8/0xd0
[   54.244186]  [<ffffffff814c3969>] bus_remove_driver+0x59/0xd0
[   54.244186]  [<ffffffff814c4fde>] driver_unregister+0x2e/0x60
[   54.244186]  [<ffffffff810f0a80>] ? show_refcnt+0x40/0x40
[   54.244186]  [<ffffffff813d4ec3>] pci_unregister_driver+0x23/0xa0
[   54.244186]  [<ffffffffc068413f>] ena_cleanup+0x10/0xed1 [ena_drv]
[   54.244186]  [<ffffffff810f3a47>] SyS_delete_module+0x157/0x1e0
[   54.244186]  [<ffffffff81014fb7>] ? do_notify_resume+0xc7/0xd0
[   54.244186]  [<ffffffff81793fad>] system_call_fastpath+0x1a/0x1f
[   54.244186] Code: c3 4d 8d b5 04 01 01 00 4c 89 f7 e8 e1 5a 11 c1 48 89 45 c8 41 0f b7 85 00 01 01 00 8d 48 01 66 2d 52 21 66 41 89 8d 00 01 01 00 <66> 41 89 04 24 0f b7 45 d4 89 45 d0 89 c1 41 0f b7 85 00 01 01
[   54.244186] RIP  [<ffffffffc067de5a>] ena_com_reg_bar_read32+0x8a/0x180 [ena_drv]
[   54.244186]  RSP <ffff8800bb643d50>
[   54.244186] CR2: 0000000000000000
[   54.244186] ---[ end trace 18dd9889b6497810 ]---

Signed-off-by: Netanel Belgazal <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net/ena: fix RSS default hash configuration

ENA default hash configures IPv4_frag hash twice instead of
configure non-IP packets.

The bug caused IPv4 fragmented packets to be calculated based on
L2 source and destination address instead of L3 source and destination.
IPv4 packets can reach to the wrong Rx queue.

Signed-off-by: Netanel Belgazal <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net/ena: fix ethtool RSS flow configuration

ena_flow_data_to_flow_hash and ena_flow_hash_to_flow_type
treat the ena_flow_hash_to_flow_type enum as power of two values.

Change the values of ena_admin_flow_hash_fields to be power of two values.

This bug effect the ethtool set/get rxnfc.
ethtool will report wrong values hash fields for get and will
configure wrong hash fields in set.

Signed-off-by: Netanel Belgazal <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net/ena: fix queues number calculation

The ENA driver tries to open a queue per vCPU.
To determine how many vCPUs the instance have it uses num_possible_cpus()
while it should have use num_online_cpus() instead.

Signed-off-by: Netanel Belgazal <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net/ena: remove ntuple filter support from device feature list

Remove NETIF_F_NTUPLE from netdev->features.
The ENA device driver does not support ntuple filtering.

Signed-off-by: Netanel Belgazal <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

nfsd: Revert "nfsd: special case truncates some more"

This patch incorrectly attempted nested mnt_want_write, and incorrectly
disabled nfsd's owner override for truncate. We'll fix those problems
and make another attempt soon, for the moment I think the safest is to
revert.

Signed-off-by: J. Bruce Fields <[email protected]>

Merge tag 'drm-fixes-for-v4.10-rc8' of git://people.freedesktop.org/~airlied/linux

Pull drm fixes from Dave Airlie:
"This should be the final set of drm fixes for 4.10: one vmwgfx boot
  fix, one vc4 fix, and a few i915 fixes:

* tag 'drm-fixes-for-v4.10-rc8' of git://people.freedesktop.org/~airlied/linux:
  drm: vc4: adapt to new behaviour of drm_crtc.c
  drm/i915: Always convert incoming exec offsets to non-canonical
  drm/i915: Remove overzealous fence warn on runtime suspend
  drm/i915/bxt: Add MST support when do DPLL calculation
  drm/i915: don't warn about Skylake CPU - KabyPoint PCH combo
  drm/i915: fix i915 running as dom0 under Xen
  drm/i915: Flush untouched framebuffers before display on !llc
  drm/i915: fix use-after-free in page_flip_completed()
  drm/vmwgfx: Fix depth input into drm_mode_legacy_fb_format

Merge tag 'drm-intel-fixes-2017-02-09' of git://anongit.freedesktop.org/git/drm-intel into drm-fixes

Hopefully final fixes for v4.10, about half of them stable material.

* tag 'drm-intel-fixes-2017-02-09' of git://anongit.freedesktop.org/git/drm-intel:
  drm/i915: Always convert incoming exec offsets to non-canonical
  drm/i915: Remove overzealous fence warn on runtime suspend
  drm/i915/bxt: Add MST support when do DPLL calculation
  drm/i915: don't warn about Skylake CPU - KabyPoint PCH combo
  drm/i915: fix i915 running as dom0 under Xen
  drm/i915: Flush untouched framebuffers before display on !llc
  drm/i915: fix use-after-free in page_flip_completed()

Merge tag 'drm-misc-fixes-2017-02-09' of git://anongit.freedesktop.org/git/drm-misc into drm-fixes

Last-minute vc4 fix for 4.10.

* tag 'drm-misc-fixes-2017-02-09' of git://anongit.freedesktop.org/git/drm-misc:
drm: vc4: adapt to new behaviour of drm_crtc.c

Merge branch 'enic-vxlan-offload'

Govindarajulu Varadarajan says:

====================
enic: add vxlan offload support

This series adds vxlan offload support for enic driver. The first
patch adds vxlan devcmd for configuring vxland offload parameters.
Second patch adds ndo_udp_tunnel_add/del and offload on rx path.
There are to modes in which fw supports vxlan offload.

mode 0: fcoe bit is set for encapsulated packet. fcoe_fc_crc_ok is set
if checksum of csum is ok. This bit is or of ip_csum_ok and
tcp_udp_csum_ok

mode 2: BIT(0) in rss_hash is set if it is encapsulated packet.
        BIT(1) is set if outer_ip_csum_ok/
        BIT(2) is set if outer_tcp_csum_ok

Some hw supports only mode 0, some support mode 0 and 2. Driver gets
the supported modes bitmap using get_supported_feature_ver devcmd
and selects the highest mode both driver and fw supports.

Third patch adds offload support on tx path by adding
enic_features_check().

v2: Order local variable declarations from longest to shortest line,
    on all three patches.
====================

Signed-off-by: David S. Miller <[email protected]>

enic: add vxlan offload on tx path

Define ndo_features_check. Hw supports offload only for ipv4 inner and
ipv4 outer pkt.

Code refactor for setting inner tcp pseudo csum.

Signed-off-by: Govindarajulu Varadarajan <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

enic: add udp_tunnel ndo for vxlan offload

Defines enic_udp_tunnel_add/del for configuring vxlan tunnel offload.
enic supports offload of only one ipv4/udp port.

There are two modes that fw supports for vxlan offload.

mode 0: fcoe bit is set for encapsulated packet. fcoe_fc_crc_ok is set
if checksum of csum is ok. This bit is or of ip_csum_ok and
tcp_udp_csum_ok

mode 2: BIT(0) in rss_hash is set if it is encapsulated packet.
BIT(1) is set if outer_ip_csum_ok/
BIT(2) is set if outer_tcp_csum_ok

tcp_udp_csum_ok/ipv4_csum_ok is set if inner csum is OK.

Signed-off-by: Govindarajulu Varadarajan <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

enic: add devcmds for vxlan offload

This patch adds devcmds needed for vxlan offload. Implement 3 new devcmd

overlay_offload_ctrl: enable/disable offload
overlay_offload_cfg: update offload udp port number
get_supported_feature_ver: get hw supported offload version. Each
version has different bitmap for csum_ok/encap

Signed-off-by: Govindarajulu Varadarajan <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: dsa: mv88e6xxx: Move forward declaration to where it is needed

Move it out from the middle for the #defines to just before it is
needed.

Signed-off-by: Andrew Lunn <[email protected]>
Reviewed-by: Vivien Didelot <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: dsa: Fix duplicate object rule

While adding switch.o to the list of DSA object files, we essentially
duplicated the previous obj-y line and just added switch.o, remove the
duplicate.

Fixes: f515f192ab4f ("net: dsa: add switch notifier")
Signed-off-by: Florian Fainelli <[email protected]>
Reviewed-by: Vivien Didelot <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: phy: Initialize mdio clock at probe function

USB PHYs need the MDIO clock divisor enabled earlier to work.
Initialize mdio clock divisor in probe function. The ext bus
bit available in the same register will be used by mdio mux
to enable external mdio.

Signed-off-by: Yendapally Reddy Dhananjaya Reddy <[email protected]>
Fixes: ddc24ae1 ("net: phy: Broadcom iProc MDIO bus driver")
Reviewed-by: Florian Fainelli <[email protected]>
Signed-off-by: Jon Mason <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Merge branch 'qcom-emac-more-ethtool'

Timur Tabi says:

====================
net: qcom/emac: add the last ethtool functions

These two patches implement the remaining two ethtool functions that
are of interest to the Qualcomm EMAC driver. These are the last
patches that will be submitted for the 4.11 merge window.
====================

Signed-off-by: David S. Miller <[email protected]>

net: qcom/emac: add ethtool support for setting ring parameters

Implement the set_ringparam method, which allows the user to specify
the size of the TX and RX descriptor rings. The values are constrained
to the limits of the hardware.

Since the driver does not use separate queues for mini or jumbo frames,
attempts to set those values are rejected.

If the interface is already running when the setting is changed, then
the interface is reset.

Signed-off-by: Timur Tabi <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: qcom/emac: add ethtool support for reading hardware registers

Implement the get_regs_len and get_regs ethtool methods. The driver
returns the values of selected hardware registers.

The make the register offsets known to emac_ethtool, the the register
offset macros are all combined into one header file. They were
inexplicably and arbitrarily split between two files.

Signed-off-by: Timur Tabi <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

ARM: orion: remove unused wnr854t_switch_plat_data

The other instances of this structure got removed along with the MDIO
device change, but this one was left behind and needs to be removed
as well:

arch/arm/mach-orion5x/wnr854t-setup.c:109:44: error: 'wnr854t_switch_plat_data' defined but not used [-Werror=unused-variable]
static struct dsa_platform_data __initdata wnr854t_switch_plat_data = {

Fixes: 575e93f7b5e6 ("ARM: orion: Register DSA switch as a MDIO device")
Signed-off-by: Arnd Bergmann <[email protected]>
Acked-by: Florian Fainelli <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Merge branch 'sctp-sender-stream-reconf-reset-add-streams'

Xin Long says:

====================
sctp: add sender-side procedures for stream reconf asoc reset and add streams

Patch 4/6 is to implement sender-side procedures for the SSN/TSN Reset
Request Parameter described in rfc6525 section 5.1.4, patch 3/6 is
ahead of it to define a function to make the request chunk for it.

Patch 6/6 is to implement sender-side procedures for the Add Incoming
and Outgoing Streams Request Parameter Request Parameter described in
rfc6525 section 5.1.5 and 5.1.6, patch 5/6 is ahead of it to define a
function to make the request chunk for it.

Patch 2/6 is a fix to recover streams states when it fails to send
request and Patch 1/6 is to drop some unncessary __packed from some
old structures.

v1->v2:
  - put these into a smaller group.
  - rename some temporary variables in the codes.
  - rename the titles of the commits and improve some changelogs.
v2->v3:
  - re-split the patchset and make sure it has no dead codes for review.
  - move some codes into stream.c from socket.c.
v3->v4:
  - add one more patch to fix a send reset stream request issue.
  - doing actual work only when request is sent successfully.
  - reduce some indents in sctp_send_add_streams.
v4->v5:
  - close streams before sending request and recover them when sending
    fails in patch 1/5 and patch 3/5
v5->v6:
  - add patch 1/6 to drop some unncessary __packed from some old structures.
  - remove __packed from some new structures in patch 3/6 and 5/6.
  - define unsigned int outcnt and incnt to make codes smaller in patch 6/6.
  - use krealloc instead of kcalloc and remove ksize check in patch 6/6, as
    ksize check is acutally used in krealloc already.
====================

Signed-off-by: David S. Miller <[email protected]>

sctp: implement sender-side procedures for Add Incoming/Outgoing Streams Request Parameter

This patch is to implement Sender-Side Procedures for the Add
Outgoing and Incoming Streams Request Parameter described in
rfc6525 section 5.1.5-5.1.6.

It is also to add sockopt SCTP_ADD_STREAMS in rfc6525 section
6.3.4 for users.

Signed-off-by: Xin Long <[email protected]>
Acked-by: Marcelo Ricardo Leitner <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

sctp: add support for generating stream reconf add incoming/outgoing streams request chunk

This patch is to define Add Incoming/Outgoing Streams Request
Parameter described in rfc6525 section 4.5 and 4.6. They can
be in one same chunk trunk as rfc6525 section 3.1-7 describes,
so make them in one function.

Signed-off-by: Xin Long <[email protected]>
Acked-by: Marcelo Ricardo Leitner <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

sctp: implement sender-side procedures for SSN/TSN Reset Request Parameter

This patch is to implement Sender-Side Procedures for the SSN/TSN
Reset Request Parameter descibed in rfc6525 section 5.1.4.

It is also to add sockopt SCTP_RESET_ASSOC in rfc6525 section 6.3.3
for users.

Signed-off-by: Xin Long <[email protected]>
Acked-by: Marcelo Ricardo Leitner <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

sctp: add support for generating stream reconf ssn/tsn reset request chunk

This patch is to define SSN/TSN Reset Request Parameter described
in rfc6525 section 4.3.

Signed-off-by: Xin Long <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

sctp: streams should be recovered when it fails to send request.

Now when sending stream reset request, it closes the streams to
block further xmit of data until this request is completed, then
calls sctp_send_reconf to send the chunk.

But if sctp_send_reconf returns err, and it doesn't recover the
streams' states back, which means the request chunk would not be
queued and sent, so the asoc will get stuck, streams are closed
and no packet is even queued.

This patch is to fix it by recovering the streams' states when
it fails to send the request, it is also to fix a return value.

Fixes: 7f9d68ac944e ("sctp: implement sender-side procedures for SSN Reset Request Parameter")
Signed-off-by: Xin Long <[email protected]>
Acked-by: Marcelo Ricardo Leitner <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

sctp: drop unnecessary __packed from some stream reconf structures

commit 85c727b59483 ("sctp: drop __packed from almost all SCTP structures")
has removed __packed from almost all SCTP structures. But there still are
three structures where it should be dropped.

This patch is to remove it from some stream reconf structures.

Signed-off-by: Xin Long <[email protected]>
Acked-by: Marcelo Ricardo Leitner <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Merge branch 'sfc-more-encap-offloads'

Edward Cree says:

====================
sfc: more encap offloads

This patch series adds support for RX checksum offload of encapsulated packets.
It also adds support for configuring the hardware's lists of UDP ports used for
VXLAN and GENEVE encapsulation offloads. Since changing these lists causes the
MC to reboot, the driver has been hardened against reboots, which used to be
considered an exceptional occurrence but are now normal.
====================

Signed-off-by: David S. Miller <[email protected]>

sfc: configure UDP tunnel offload ports

Implement ndo_udp_tunnel_{add,del} to update the NIC's list of VXLAN and
GENEVE UDP ports. Also reset the port list to empty on driver load and
on driver unload, with appropriate flag set on the unload case.
These port numbers are used for RX inner checksum offload, and in future
will also be used for TX inner checksum offload and encapsulated TSO.

Signed-off-by: Edward Cree <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

sfc: update mcdi_pcol definitions for MC_CMD_SET_TUNNEL_ENCAP_UDP_PORTS

Signed-off-by: Edward Cree <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

sfc: call mcdi_reboot_detected() when MC reboots during an MCDI command

This function wasn't being called in this particular case when the MC
reboots. This caused resource reallocations to not be handled properly
and often ended up disabling the interface.

Signed-off-by: Edward Cree <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

sfc: harden driver against MC resets during initial probe

This is mainly to prepare for a future overlay networking patch that
could cause an MC reset at probe time if the UDP tunnel port list is
set immediately upon driver load.

Signed-off-by: Edward Cree <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

sfc: set csum_level for encapsulated packets

Set the csum_level for encapsulated packets where the encapsulation
type, l3 class and l4 class are sets that need it.

Signed-off-by: Edward Cree <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

sfc: process RX event inner checksum flags

Add support for RX checksum offload of encapsulated packets. This
essentially just means paying attention to the inner checksum flags
in the RX event, and if *either* checksum flag indicates a fail then
don't tell the kernel that checksum offload was successful.
Also, count these checksum errors and export the counts to ethtool -S.

Test the most common "good" case of RX events with a single bitmask
instead of a series of ifs. Move the more specific error checking
in to a separate function for clarity, and don't use unlikely() there
since we know at least one of the bits is bad.

Signed-off-by: Edward Cree <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

igmp, mld: Fix memory leak in igmpv3/mld_del_delrec()

In function igmpv3/mld_add_delrec() we allocate pmc and put it in
idev->mc_tomb, so we should free it when we don't need it in del_delrec().
But I removed kfree(pmc) incorrectly in latest two patches. Now fix it.

Fixes: 24803f38a5c0 ("igmp: do not remove igmp souce list info when ...")
Fixes: 1666d49e1d41 ("mld: do not remove mld souce list info when ...")
Reported-by: Daniel Borkmann <[email protected]>
Signed-off-by: Hangbin Liu <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

xen-netfront: Improve error handling during initialization

This fixes a crash when running out of grant refs when creating many
queues across many netdevs.

* If creating queues fails (i.e. there are no grant refs available),
call xenbus_dev_fatal() to ensure that the xenbus device is set to the
closed state.
* If no queues are created, don't call xennet_disconnect_backend as
netdev->real_num_tx_queues will not have been set correctly.
* If setup_netfront() fails, ensure that all the queues created are
cleaned up, not just those that have been set up.
* If any queues were set up and an error occurs, call
xennet_destroy_queues() to clean up the napi context.
* If any fatal error occurs, unregister and destroy the netdev to avoid
leaving around a half setup network device.

Signed-off-by: Ross Lagerwall <[email protected]>
Reviewed-by: Boris Ostrovsky <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Merge branch 'sierra_net-fixes'

Stefan Brüns says:

====================
Fixes for sierra_net driver

When trying to initiate a dual-stack (ipv4v6) connection, a MC7710, FW
version SWI9200X_03.05.24.00ap answers with an unsupported LSI. Add support
for this LSI.
Also the link_type should be ignored when going idle, otherwise the modem
is stuck in a bad link state.
Tested on MC7710, T-Mobile DE, APN internet.telekom, IPv4v6 PDP type. Both
IPv4 and IPv6 connections work.

v2: Do not overwrite protocol field in rx_fixup
v3: Remove leftover struct ethhdr *eth declaration
====================

Signed-off-by: David S. Miller <[email protected]>

sierra_net: Skip validating irrelevant fields for IDLE LSIs

When the context is deactivated, the link_type is set to 0xff, which
triggers a warning message, and results in a wrong link status, as
the LSI is ignored.

Signed-off-by: Stefan Brüns <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

sierra_net: Add support for IPv6 and Dual-Stack Link Sense Indications

If a context is configured as dualstack ("IPv4v6"), the modem indicates
the context activation with a slightly different indication message.
The dual-stack indication omits the link_type (IPv4/v6) and adds
additional address fields.
IPv6 LSIs are identical to IPv4 LSIs, but have a different link type.

Signed-off-by: Stefan Brüns <[email protected]>
Reviewed-by: Bjørn Mork <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

kcm: fix 0-length case for kcm_sendmsg()

Dmitry reported a kernel warning:

WARNING: CPU: 3 PID: 2936 at net/kcm/kcmsock.c:627
kcm_write_msgs+0x12e3/0x1b90 net/kcm/kcmsock.c:627
CPU: 3 PID: 2936 Comm: a.out Not tainted 4.10.0-rc6+ #209
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
Call Trace:
  __dump_stack lib/dump_stack.c:15 [inline]
  dump_stack+0x2ee/0x3ef lib/dump_stack.c:51
  panic+0x1fb/0x412 kernel/panic.c:179
  __warn+0x1c4/0x1e0 kernel/panic.c:539
  warn_slowpath_null+0x2c/0x40 kernel/panic.c:582
  kcm_write_msgs+0x12e3/0x1b90 net/kcm/kcmsock.c:627
  kcm_sendmsg+0x163a/0x2200 net/kcm/kcmsock.c:1029
  sock_sendmsg_nosec net/socket.c:635 [inline]
  sock_sendmsg+0xca/0x110 net/socket.c:645
  sock_write_iter+0x326/0x600 net/socket.c:848
  new_sync_write fs/read_write.c:499 [inline]
  __vfs_write+0x483/0x740 fs/read_write.c:512
  vfs_write+0x187/0x530 fs/read_write.c:560
  SYSC_write fs/read_write.c:607 [inline]
  SyS_write+0xfb/0x230 fs/read_write.c:599
  entry_SYSCALL_64_fastpath+0x1f/0xc2

when calling syscall(__NR_write, sock2, 0x208aaf27ul, 0x0ul) on a KCM
seqpacket socket. It appears that kcm_sendmsg() does not handle len==0
case correctly, which causes an empty skb is allocated and queued.
Fix this by skipping the skb allocation for len==0 case.

Reported-by: Dmitry Vyukov <[email protected]>
Cc: Tom Herbert <[email protected]>
Signed-off-by: Cong Wang <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

xen-netfront: Rework the fix for Rx stall during OOM and network stress

The commit 90c311b0eeea ("xen-netfront: Fix Rx stall during network
stress and OOM") caused the refill timer to be triggerred almost on
all invocations of xennet_alloc_rx_buffers for certain workloads.
This reworks the fix by reverting to the old behaviour and taking into
consideration the skb allocation failure. Refill timer is now triggered
on insufficient requests or skb allocation failure.

Signed-off-by: Vineeth Remanan Pillai <[email protected]>
Fixes: 90c311b0eeea (xen-netfront: Fix Rx stall during network stress and OOM)
Reported-by: Boris Ostrovsky <[email protected]>
Reviewed-by: Boris Ostrovsky <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Merge git://git.kernel.org/pub/scm/linux/kernel/git/nab/target-pending

Pull SCSI target fixes from Nicholas Bellinger:
"This target series for v4.10 contains fixes which address a few
  long-standing bugs that DATERA's QA + automation teams have uncovered
  while putting v4.1.y target code into production usage.

  We've been running the top three in our nightly automated regression
  runs for the last two months, and the COMPARE_AND_WRITE fix Mr. Gary
  Guo has been manually verifying against a four node ESX cluster this
  past week.

  Note all of them have CC' stable tags.

  Summary:

   - Fix a bug with ESX EXTENDED_COPY + SAM_STAT_RESERVATION_CONFLICT
     status, where target_core_xcopy.c logic was incorrectly returning
     SAM_STAT_CHECK_CONDITION for all non SAM_STAT_GOOD cases (Nixon
     Vincent)

   - Fix a TMR LUN_RESET hung task bug while other in-flight TMRs are
     being aborted, before the new one had been dispatched into tmr_wq
     (Rob Millner)

   - Fix a long standing double free OOPs, where a dynamically generated
     'demo-mode' NodeACL has multiple sessions associated with it, and
     the /sys/kernel/config/target/$FABRIC/$WWN/ subsequently disables
     demo-mode, but never converts the dynamic ACL into a explicit ACL
     (Rob Millner)

   - Fix a long standing reference leak with ESX VAAI COMPARE_AND_WRITE
     when the second phase WRITE COMMIT command fails, resulting in
     CHECK_CONDITION response never being sent and se_cmd->cmd_kref
     never reaching zero (Gary Guo)

  Beyond these items on v4.1.y we've reproduced, fixed, and run through
  our regression test suite using iscsi-target exports, there are two
  additional outstanding list items:

   - Remove a >= v4.2 RCU conversion BUG_ON that would trigger when
     dynamic node NodeACLs where being converted to explicit NodeACLs.
     The patch drops the BUG_ON to follow how pre RCU conversion worked
     for this special case (Benjamin Estrabaud)

   - Add ibmvscsis target_core_fabric_ops->max_data_sg_nent assignment
     to match what IBM's Virtual SCSI hypervisor is already enforcing at
     transport layer. (Bryant Ly + Steven Royer)"

* git://git.kernel.org/pub/scm/linux/kernel/git/nab/target-pending:
  ibmvscsis: Add SGL limit
  target: Fix COMPARE_AND_WRITE ref leak for non GOOD status
  target: Fix multi-session dynamic se_node_acl double free OOPs
  target: Fix early transport_generic_handle_tmr abort scenario
  target: Use correct SCSI status during EXTENDED_COPY exception
  target: Don't BUG_ON during NodeACL dynamic -> explicit conversion

net: phy: Fix PHY module checks and NULL deref in phy_attach_direct()

The Generic PHY drivers gets assigned after we checked that the current
PHY driver is NULL, so we need to check a few things before we can
safely dereference d->driver. This would be causing a NULL deference to
occur when a system binds to the Generic PHY driver. Update
phy_attach_direct() to do the following:

- grab the driver module reference after we have assigned the Generic
  PHY drivers accordingly, and remember we came from the generic PHY
  path

- update the error path to clean up the module reference in case the
  Generic PHY probe function fails

- split the error path involving phy_detacht() to avoid double free/put
  since phy_detach() does all the clean up

- finally, have phy_detach() drop the module reference count before we
  call device_release_driver() for the Generic PHY driver case

Fixes: cafe8df8b9bc ("net: phy: Fix lack of reference count on PHY driver")
Signed-off-by: Florian Fainelli <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Merge tag 'pstore-v4.10-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux

Pull pstore fix from Kees Cook:
"Fix pstore regression (boot Oops) when ftrace disabled, from Brian
Norris"

* tag 'pstore-v4.10-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
pstore: don't OOPS when there are no ftrace zones

Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input

Pull input fixes from Dmitry Torokhov:
"A fix for a crash in uinput, and a fix for build errors when HID-RMI
  is built-in but SERIO is a module"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input:
  Input: synaptics-rmi4 - select 'SERIO' when needed
  Input: uinput - fix crash when mixing old and new init style

pstore: don't OOPS when there are no ftrace zones

We'll OOPS in ramoops_get_next_prz() if the platform didn't ask for any
ftrace zones (i.e., cxt->fprzs will be NULL). Let's just skip this
entire FTRACE section if there's no 'fprzs'.

Regression seen on a coreboot/depthcharge-based Chromebook.

Fixes: 2fbea82bbb89 ("pstore: Merge per-CPU ftrace records into one")
Cc: Joel Fernandes <[email protected]>
Cc: Kees Cook <[email protected]>
Signed-off-by: Brian Norris <[email protected]>
Signed-off-by: Kees Cook <[email protected]>

Merge tag 'vfio-v4.10-final' of git://github.com/awilliam/linux-vfio

Pull VFIO fix from Alex Williamson:
"Fix regression in attaching groups to existing container for SPAPR
IOMMU backend (Alexey Kardashevskiy)"

* tag 'vfio-v4.10-final' of git://github.com/awilliam/linux-vfio:
vfio/spapr_tce: Set window when adding additional groups to container

Merge branch 'fixes' of git://git.armlinux.org.uk/~rmk/linux-arm

Pull ARM fixes from Russell King:
"A couple more fixes for 4.10:

   - fix addressing the short regset write issue (Dave Martin)

   - fix for LPAE systems which leave a pending imprecise data abort
     before entering the kernel (Alexander Sverdlin)"

* 'fixes' of git://git.armlinux.org.uk/~rmk/linux-arm:
  ARM: 8643/3: arm/ptrace: Preserve previous registers for short regset write
  ARM: 8642/1: LPAE: catch pending imprecise abort on unmask

i2c: piix4: Request the SMBUS semaphore inside the mutex

SMBSLVCNT must be protected with the piix4_mutex_sb800 in order to avoid
multiple buses accessing to the semaphore at the same time.

Fixes: 701dc207bf55 ("i2c: piix4: Avoid race conditions with IMC")
Reported-by: Jean Delvare <[email protected]>
Signed-off-by: Ricardo Ribalda Delgado <[email protected]>
Signed-off-by: Jean Delvare <[email protected]>
Signed-off-by: Wolfram Sang <[email protected]>

i2c: piix4: Fix request_region size

Since '701dc207bf55 ("i2c: piix4: Avoid race conditions with IMC")' we
are using the SMBSLVCNT register at offset 0x8. We need to request it.

Fixes: 701dc207bf55 ("i2c: piix4: Avoid race conditions with IMC")
Signed-off-by: Ricardo Ribalda Delgado <[email protected]>
Signed-off-by: Jean Delvare <[email protected]>
Signed-off-by: Wolfram Sang <[email protected]>

mac80211: fix CSA in IBSS mode

Add the missing IBSS capability flag during capability init as it needs
to be inserted into the generated beacon in order for CSA to work.

Fixes: cd7760e62c2ac ("mac80211: add support for CSA in IBSS mode")
Signed-off-by: Piotr Gawlowicz <[email protected]>
Signed-off-by: Mikołaj Chwalisz <[email protected]>
Tested-by: Koen Vandeputte <[email protected]>
Signed-off-by: Johannes Berg <[email protected]>

cfg80211: fix NAN bands definition

The nl80211_nan_dual_band_conf enumeration doesn't make much sense.
The default value is assigned to a bit, which makes it weird if the
default bit and other bits are set at the same time.

To improve this, get rid of NL80211_NAN_BAND_DEFAULT and add a wiphy
configuration to let the drivers define which bands are supported.
This is exposed to the userspace, which then can make a decision on
which band(s) to use. Additionally, rename all "dual_band" elements
to "bands", to make things clearer.

Signed-off-by: Luca Coelho <[email protected]>
Signed-off-by: Johannes Berg <[email protected]>

ALSA: hda - adding a new NV HDMI/DP codec ID in the driver

Without this change, the HDMI/DP codec will be recognised as a
generic codec, and there is no sound when playing through this codec.

As suggested by NVidia side, after adding the new ID in the driver,
the sound playing works well.

Cc: <[email protected]>
Signed-off-by: Hui Wang <[email protected]>
Signed-off-by: Takashi Iwai <[email protected]>

powerpc/powernv: Properly set "host-ipi" on IPIs

Otherwise KVM will fail to pass them through to the host

Signed-off-by: Benjamin Herrenschmidt <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>

powerpc/powernv: Fix CPU hotplug to handle waking on HVI

The IPIs come in as HVI not EE, so we need to test the appropriate
SRR1 bits. The encoding is such that it won't have false positives
on P7 and P8 so we can just test it like that. We also need to handle
the icp-opal variant of the flush.

Fixes: d74361881f0d ("powerpc/xics: Add ICP OPAL backend")
Cc: [email protected] # v4.8+
Signed-off-by: Benjamin Herrenschmidt <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>

powerpc/mm/radix: Update ERAT flushes when invalidating TLB

Three tiny changes to the ERAT flushing logic: First don't make
it depend on DD1. It hasn't been decided yet but we might run
DD2 in a mode that also requires explicit flushes for performance
reasons so make it unconditional. We also add a missing isync, and
finally remove the flush from _tlbiel_va as it is only necessary
for congruence-class invalidations (PID, LPID and full TLB), not
targetted invalidations.

Fixes: 96ed1fe511a8 ("powerpc/mm/radix: Invalidate ERAT on tlbiel for POWER9 DD1")
Cc: [email protected] # v4.9+
Signed-off-by: Benjamin Herrenschmidt <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>

Revert "x86/ioapic: Restore IO-APIC irq_chip retrigger callback"

This reverts commit 020eb3daaba2857b32c4cf4c82f503d6a00a67de.

Gabriel C reports that it causes his machine to not boot, and we haven't
tracked down the reason for it yet.  Since the bug it fixes has been
around for a longish time, we're better off reverting the fix for now.

Gabriel says:
"It hangs early and freezes with a lot RCU warnings.

  I bisected it down to :

  > Ruslan Ruslichenko (1):
  >       x86/ioapic: Restore IO-APIC irq_chip retrigger callback

  Reverting this one fixes the problem for me..

  The box is a PRIMERGY TX200 S5 , 2 socket , 2 x E5520 CPU(s) installed"

and Ruslan and Thomas are currently stumped.

Reported-and-bisected-by: Gabriel C <[email protected]>
Cc: Ruslan Ruslichenko <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected] # for the backport of the original commit
Signed-off-by: Linus Torvalds <[email protected]>

Revert "hwrng: core - zeroize buffers with random data"

This reverts commit 2cc751545854d7bd7eedf4d7e377bb52e176cd07.

With this commit in place I get on a Cavium ThunderX (arm64) system:

$ if=/dev/hwrng bs=256 count=1 | od -t x1 -A x -v > rng-bad.txt
1+0 records in
1+0 records out
256 bytes (256 B) copied, 9.1171e-05 s, 2.8 MB/s
$ dd if=/dev/hwrng bs=256 count=1 | od -t x1 -A x -v >> rng-bad.txt
1+0 records in
1+0 records out
256 bytes (256 B) copied, 9.6141e-05 s, 2.7 MB/s
$ cat rng-bad.txt
000000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
000010 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
000020 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
000030 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
000040 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
000050 00 00 00 00 37 20 46 ae d0 fc 1c 55 25 6e b0 b8
000060 7c 7e d7 d4 00 0f 6f b2 91 1e 30 a8 fa 3e 52 0e
000070 06 2d 53 30 be a1 20 0f aa 56 6e 0e 44 6e f4 35
000080 b7 6a fe d2 52 70 7e 58 56 02 41 ea d1 9c 6a 6a
000090 d1 bd d8 4c da 35 45 ef 89 55 fc 59 d5 cd 57 ba
0000a0 4e 3e 02 1c 12 76 43 37 23 e1 9f 7a 9f 9e 99 24
0000b0 47 b2 de e3 79 85 f6 55 7e ad 76 13 4f a0 b5 41
0000c0 c6 92 42 01 d9 12 de 8f b4 7b 6e ae d7 24 fc 65
0000d0 4d af 0a aa 36 d9 17 8d 0e 8b 7a 3b b6 5f 96 47
0000e0 46 f7 d8 ce 0b e8 3e c6 13 a6 2c b6 d6 cc 17 26
0000f0 e3 c3 17 8e 9e 45 56 1e 41 ef 29 1a a8 65 c8 3a
000100
000000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
000010 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
000020 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
000030 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
000040 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
000050 00 00 00 00 f4 90 65 aa 8b f2 5e 31 01 53 b4 d4
000060 06 c0 23 a2 99 3d 01 e4 b0 c1 b1 55 0f 80 63 cf
000070 33 24 d8 3a 1d 5e cd 2c ba c0 d0 18 6f bc 97 46
000080 1e 19 51 b1 90 15 af 80 5e d1 08 0d eb b0 6c ab
000090 6a b4 fe 62 37 c5 e1 ee 93 c3 58 78 91 2a d5 23
0000a0 63 50 eb 1f 3b 84 35 18 cf b2 a4 b8 46 69 9e cf
0000b0 0c 95 af 03 51 45 a8 42 f1 64 c9 55 fc 69 76 63
0000c0 98 9d 82 fa 76 85 24 da 80 07 29 fe 4e 76 0c 61
0000d0 ff 23 94 4f c8 5c ce 0b 50 e8 31 bc 9d ce f4 ca
0000e0 be ca 28 da e6 fa cc 64 1c ec a8 41 db fe 42 bd
0000f0 a0 e2 4b 32 b4 52 ba 03 70 8e c1 8e d0 50 3a c6
000100

To my untrained mental entropy detector, the first several bytes of
each read from /dev/hwrng seem to not be very random (i.e. all zero).

When I revert the patch (apply this patch), I get back to what we have
in v4.9, which looks like (much more random appearing):

$ dd if=/dev/hwrng bs=256 count=1 | od -t x1 -A x -v > rng-good.txt
1+0 records in
1+0 records out
256 bytes (256 B) copied, 0.000252233 s, 1.0 MB/s
$ dd if=/dev/hwrng bs=256 count=1 | od -t x1 -A x -v >> rng-good.txt
1+0 records in
1+0 records out
256 bytes (256 B) copied, 0.000113571 s, 2.3 MB/s
$ cat rng-good.txt
000000 75 d1 2d 19 68 1f d2 26 a1 49 22 61 66 e8 09 e5
000010 e0 4e 10 d0 1a 2c 45 5d 59 04 79 8e e2 b7 2c 2e
000020 e8 ad da 34 d5 56 51 3d 58 29 c7 7a 8e ed 22 67
000030 f9 25 b9 fb c6 b7 9c 35 1f 84 21 35 c1 1d 48 34
000040 45 7c f6 f1 57 63 1a 88 38 e8 81 f0 a9 63 ad 0e
000050 be 5d 3e 74 2e 4e cb 36 c2 01 a8 14 e1 38 e1 bb
000060 23 79 09 56 77 19 ff 98 e8 44 f3 27 eb 6e 0a cb
000070 c9 36 e3 2a 96 13 07 a0 90 3f 3b bd 1d 04 1d 67
000080 be 33 14 f8 02 c2 a4 02 ab 8b 5b 74 86 17 f0 5e
000090 a1 d7 aa ef a6 21 7b 93 d1 85 86 eb 4e 8c d0 4c
0000a0 56 ac e4 45 27 44 84 9f 71 db 36 b9 f7 47 d7 b3
0000b0 f2 9c 62 41 a3 46 2b 5b e3 80 63 a4 35 b5 3c f4
0000c0 bc 1e 3a ad e4 59 4a 98 6c e8 8d ff 1b 16 f8 52
0000d0 05 5c 2f 52 2a 0f 45 5b 51 fb 93 97 a4 49 4f 06
0000e0 f3 a0 d1 1e ba 3d ed a7 60 8f bb 84 2c 21 94 2d
0000f0 b3 66 a6 61 1e 58 30 24 85 f8 c8 18 c3 77 00 22
000100
000000 73 ca cc a1 d9 bb 21 8d c3 5c f3 ab 43 6d a7 a4
000010 4a fd c5 f4 9c ba 4a 0f b1 2e 19 15 4e 84 26 e0
000020 67 c9 f2 52 4d 65 1f 81 b7 8b 6d 2b 56 7b 99 75
000030 2e cd d0 db 08 0c 4b df f3 83 c6 83 00 2e 2b b8
000040 0f af 61 1d f2 02 35 74 b5 a4 6f 28 f3 a1 09 12
000050 f2 53 b5 d2 da 45 01 e5 12 d6 46 f8 0b db ed 51
000060 7b f4 0d 54 e0 63 ea 22 e2 1d d0 d6 d0 e7 7e e0
000070 93 91 fb 87 95 43 41 28 de 3d 8b a3 a8 8f c4 9e
000080 30 95 12 7a b2 27 28 ff 37 04 2e 09 7c dd 7c 12
000090 e1 50 60 fb 6d 5f a8 65 14 40 89 e3 4c d2 87 8f
0000a0 34 76 7e 66 7a 8e 6b a3 fc cf 38 52 2e f9 26 f0
0000b0 98 63 15 06 34 99 b2 88 4f aa d8 14 88 71 f1 81
0000c0 be 51 11 2b f4 7e a0 1e 12 b2 44 2e f6 8d 84 ea
0000d0 63 82 2b 66 b3 9a fd 08 73 5a c2 cc ab 5a af b1
0000e0 88 e3 a6 80 4b fc db ed 71 e0 ae c0 0a a4 8c 35
0000f0 eb 89 f9 8a 4b 52 59 6f 09 7c 01 3f 56 e7 c7 bf
000100

Signed-off-by: David Daney <[email protected]>
Acked-by: Herbert Xu <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

Merge branch 'akpm' (patches from Andrew)

Merge fixes from Andrew Morton:
"4 fixes"

* emailed patches from Andrew Morton <[email protected]>:
  mm/slub.c: fix random_seq offset destruction
  cpumask: use nr_cpumask_bits for parsing functions
  mm: avoid returning VM_FAULT_RETRY from ->page_mkwrite handlers
  kernel/ucount.c: mark user_header with kmemleak_ignore()

mm/slub.c: fix random_seq offset destruction

Commit 210e7a43fa90 ("mm: SLUB freelist randomization") broke USB hub
initialisation as described in

https://bugzilla.kernel.org/show_bug.cgi?id=177551.

Bail out early from init_cache_random_seq if s->random_seq is already
initialised. This prevents destroying the previously computed
random_seq offsets later in the function.

If the offsets are destroyed, then shuffle_freelist will truncate
page->freelist to just the first object (orphaning the rest).

Fixes: 210e7a43fa90 ("mm: SLUB freelist randomization")
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Sean Rees <[email protected]>
Reported-by: <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Thomas Garnier <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

cpumask: use nr_cpumask_bits for parsing functions

Commit 513e3d2d11c9 ("cpumask: always use nr_cpu_ids in formatting and
parsing functions") converted both cpumask printing and parsing
functions to use nr_cpu_ids instead of nr_cpumask_bits.  While this was
okay for the printing functions as it just picked one of the two output
formats that we were alternating between depending on a kernel config,
doing the same for parsing wasn't okay.

nr_cpumask_bits can be either nr_cpu_ids or NR_CPUS.  We can always use
nr_cpu_ids but that is a variable while NR_CPUS is a constant, so it can
be more efficient to use NR_CPUS when we can get away with it.
Converting the printing functions to nr_cpu_ids makes sense because it
affects how the masks get presented to userspace and doesn't break
anything; however, using nr_cpu_ids for parsing functions can
incorrectly leave the higher bits uninitialized while reading in these
masks from userland.  As all testing and comparison functions use
nr_cpumask_bits which can be larger than nr_cpu_ids, the parsed cpumasks
can erroneously yield false negative results.

This made the taskstats interface incorrectly return -EINVAL even when
the inputs were correct.

Fix it by restoring the parse functions to use nr_cpumask_bits instead
of nr_cpu_ids.

Link: http://lkml.kernel.org/r/[email protected]
Fixes: 513e3d2d11c9 ("cpumask: always use nr_cpu_ids in formatting and parsing functions")
Signed-off-by: Tejun Heo <[email protected]>
Reported-by: Martin Steigerwald <[email protected]>
Debugged-by: Ben Hutchings <[email protected]>
Cc: <[email protected]> [4.0+]
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm: avoid returning VM_FAULT_RETRY from ->page_mkwrite handlers

Some ->page_mkwrite handlers may return VM_FAULT_RETRY as its return
code (GFS2 or Lustre can definitely do this). However VM_FAULT_RETRY
from ->page_mkwrite is completely unhandled by the mm code and results
in locking and writeably mapping the page which definitely is not what
the caller wanted.

Fix Lustre and block_page_mkwrite_ret() used by other filesystems
(notably GFS2) to return VM_FAULT_NOPAGE instead which results in
bailing out from the fault code, the CPU then retries the access, and we
fault again effectively doing what the handler wanted.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Jan Kara <[email protected]>
Reported-by: Al Viro <[email protected]>
Reviewed-by: Jinshan Xiong <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

kernel/ucount.c: mark user_header with kmemleak_ignore()

The user_header gets caught by kmemleak with the following splat as
missing a free:

  unreferenced object 0xffff99667a733d80 (size 96):
  comm "swapper/0", pid 1, jiffies 4294892317 (age 62191.468s)
  hex dump (first 32 bytes):
    a0 b6 92 b4 ff ff ff ff 00 00 00 00 01 00 00 00  ................
    01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
  backtrace:
     kmemleak_alloc+0x4a/0xa0
     __kmalloc+0x144/0x260
     __register_sysctl_table+0x54/0x5e0
     register_sysctl+0x1b/0x20
     user_namespace_sysctl_init+0x17/0x34
     do_one_initcall+0x52/0x1a0
     kernel_init_freeable+0x173/0x200
     kernel_init+0xe/0x100
     ret_from_fork+0x2c/0x40

The BUG_ON()s are intended to crash so no need to clean up after
ourselves on error there.  This is also a kernel/ subsys_init() we don't
need a respective exit call here as this is never modular, so just white
list it.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Luis R. Rodriguez <[email protected]>
Cc: Eric W. Biederman <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Nikolay Borisov <[email protected]>
Cc: Serge Hallyn <[email protected]>
Cc: Jan Kara <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

drm: vc4: adapt to new behaviour of drm_crtc.c

When drm_crtc_init_with_planes() was orignally added
(in drm_crtc.c, e13161af80c185ecd8dc4641d0f5df58f9e3e0af
drm: Add drm_crtc_init_with_planes() (v2)), it only checked for "primary"
being non-null. If that was the case, it modified primary->possible_crtcs.

Then, when support for cursor planes was added
(fc1d3e44ef7c1db93384150fdbf8948dcf949f15 drm: Allow drivers to register
cursor planes with crtc), the same behaviour was implemented for cursor
planes.

vc4_plane_init() since its inception has passed 0xff as "possible_crtcs"
parameter to drm_universal_plane_init(). With a change in drm_crtc.c
(7abc7d47510c75dd984380ebf819616e574c9604 drm: don't override
possible_crtcs for primary/cursor planes) passing 0xff results in primary's
possible_crtcs set to 0xff (cursor was updated manually by vc4_crtc.c).
Consequently, it would be allowed to use the primary plane from CRTC 1 (for
example) on CRTC 0, which would result in the overlay and cursors being
buried.

Signed-off-by: Andrzej Pietrasiewicz <[email protected]>
Reviewed-by: Eric Anholt <[email protected]>
Link: http://patchwork.freedesktop.org/patch/msgid/[email protected]
Fixes: 7abc7d47510c ("drm: don't override possible_crtcs for primary/cursor planes")

net: thunderx: Fix PHY autoneg for SGMII QLM mode

This patch fixes the case where there is no phydev attached
to a LMAC in DT due to non-existance of a PHY driver or due
to usage of non-stanadard PHY which doesn't support autoneg.
Changes dependeds on firmware to send correct info w.r.t
PHY and autoneg capability.

This patch also covers a case where a 10G/40G interface is used
as a 1G with convertors with Cortina PHY in between.

Signed-off-by: Thanneeru Srinivasulu <[email protected]>
Signed-off-by: Sunil Goutham <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

mlxsw: spectrum_router: Don't reflect LINKDOWN nexthops

The kernel resolves the nexthops for a given route using
FIB_LOOKUP_IGNORE_LINKSTATE which means a notification can be sent for a
route with one of its nexthops being LINKDOWN.

In case IGNORE_ROUTES_WITH_LINKDOWN is set for the nexthop netdev, then
we shouldn't reflect the nexthop to the device's table.

Once the nexthop netdev's carrier goes up we'll be notified using NH_ADD
and reflect it to the device.

Signed-off-by: Ido Schimmel <[email protected]>
Signed-off-by: Jiri Pirko <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Merge branch 'mlxsw-Reflect-nexthop-status-changes'

Jiri Pirko says:

====================
mlxsw: Reflect nexthop status changes

Ido says:

When the kernel forwards IPv4 packets via multipath routes it doesn't
consider nexthops that are dead or linkdown. For example, if the nexthop
netdev is administratively down or doesn't have a carrier.

Devices capable of offloading such multipath routes need to be made
aware of changes in the reflected nexthops' status. Otherwise, the
device might forward packets via non-functional nexthops, resulting in
packet loss. This patchset aims to fix that.

The first 11 patches deal with the necessary restructuring in the
mlxsw driver, so that it's able to correctly add and remove nexthops
from the device's adjacency table.

The 12th patch adds the NH_{ADD,DEL} events to the FIB notification
chain. These notifications are sent whenever the kernel decides to add
or remove a nexthop from the forwarding plane.

Finally, the last three patches add support for these events in the
mlxsw driver, which is currently the only driver capable of offloading
multipath routes.
====================

Signed-off-by: David S. Miller <[email protected]>

mlxsw: spectrum_router: Flush resources when RIF is deleted

When the last IP address is removed from a netdev, its RIF is deleted.
However, if user didn't first remove neighbours and nexthops using this
interface, then they would still be present in the device's tables.

Therefore, whenever a RIF is deleted, make sure all the neighbours and
nexthops (adjacency entries) using it are removed from the relevant
tables as well.

The action associated with any route using this RIF would be refreshed,
most likely to trap. If the kernel decides to remove the route (f.e.,
because all the nexthops are now DEAD), then an event would be sent,
causing the route to be removed from the device.

Signed-off-by: Ido Schimmel <[email protected]>
Signed-off-by: Jiri Pirko <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

mlxsw: spectrum_router: Reflect nexthop status changes

When a packet hits a multipath route in the device's routing table, a
hash is computed over its headers, which is then used to select the
appropriate nexthop from the device's adjacency table.

There are situations in which the kernel removes a nexthop from a
multipath route (e.g., no carrier) and the device should do the same.

Upon the reception of NH_{ADD,DEL} events, add or remove a nexthop from
the device's adjacency table and refresh all the routes using the
nexthop group. If all the nexthops of a multipath route are invalid,
then any packet hitting the route would be trapped to the CPU for
forwarding.

If all the nexthops are DEAD, then the kernel would remove the route
entirely. On the other hand, if all the nexthops are merely LINKDOWN,
then the kernel would keep the route and forward any incoming packet
using a different route.

While the last case might sound like a problem, it's expected that a
routing daemon running in user space would remove such a route from the
FIB as it's dumped with the DEAD flag set.

Signed-off-by: Ido Schimmel <[email protected]>
Signed-off-by: Jiri Pirko <[email protected]>
Signed-off-by: David S. Miller <[email protected]>