Git Repo - linux.git/log

net: sched: fix tc_should_offload for specific clsact classes

When offloading classifiers such as u32 or flower to hardware, and the
qdisc is clsact (TC_H_CLSACT), then we need to differentiate its classes,
since not all of them handle ingress, therefore we must leave those in
software path. Add a .tcf_cl_offload() callback, so we can generically
handle them, tested on ixgbe.

Fixes: 10cbc6843446 ("net/sched: cls_flower: Hardware offloaded filters statistics support")
Fixes: 5b33f48842fa ("net/flower: Introduce hardware offload support")
Fixes: a1b7c5fd7fe9 ("net: sched: add cls_u32 offload hooks for netdevs")
Signed-off-by: Daniel Borkmann <[email protected]>
Acked-by: John Fastabend <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

act_police: fix a crash during removal

The police action is using its own code to initialize tcf hash
info, which makes us to forgot to initialize a->hinfo correctly.
Fix this by calling the helper function tcf_hash_create() directly.

This patch fixed the following crash:

BUG: unable to handle kernel NULL pointer dereference at 0000000000000028
IP: [<ffffffff810c099f>] __lock_acquire+0xd3/0xf91
PGD d3c34067 PUD d3e18067 PMD 0
Oops: 0000 [#1] SMP
CPU: 2 PID: 853 Comm: tc Not tainted 4.6.0+ #87
Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
task: ffff8800d3e28040 ti: ffff8800d3f6c000 task.ti: ffff8800d3f6c000
RIP: 0010:[<ffffffff810c099f>]  [<ffffffff810c099f>] __lock_acquire+0xd3/0xf91
RSP: 0000:ffff88011b203c80  EFLAGS: 00010002
RAX: 0000000000000046 RBX: 0000000000000000 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000028
RBP: ffff88011b203d40 R08: 0000000000000001 R09: 0000000000000000
R10: ffff88011b203d58 R11: ffff88011b208000 R12: 0000000000000001
R13: ffff8800d3e28040 R14: 0000000000000028 R15: 0000000000000000
FS:  0000000000000000(0000) GS:ffff88011b200000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000028 CR3: 00000000d4be1000 CR4: 00000000000006e0
Stack:
  ffff8800d3e289c0 0000000000000046 000000001b203d60 ffffffff00000000
  0000000000000000 ffff880000000000 0000000000000000 ffffffff00000000
  ffffffff8187142c ffff88011b203ce8 ffff88011b203ce8 ffffffff8101dbfc
Call Trace:
  <IRQ>
  [<ffffffff8187142c>] ? __tcf_hash_release+0x77/0xd1
  [<ffffffff8101dbfc>] ? native_sched_clock+0x1a/0x35
  [<ffffffff8101dbfc>] ? native_sched_clock+0x1a/0x35
  [<ffffffff810a9604>] ? sched_clock_local+0x11/0x78
  [<ffffffff810bf6a1>] ? mark_lock+0x24/0x201
  [<ffffffff810c1dbd>] lock_acquire+0x120/0x1b4
  [<ffffffff810c1dbd>] ? lock_acquire+0x120/0x1b4
  [<ffffffff8187142c>] ? __tcf_hash_release+0x77/0xd1
  [<ffffffff81aad89f>] _raw_spin_lock_bh+0x3c/0x72
  [<ffffffff8187142c>] ? __tcf_hash_release+0x77/0xd1
  [<ffffffff8187142c>] __tcf_hash_release+0x77/0xd1
  [<ffffffff81871a27>] tcf_action_destroy+0x49/0x7c
  [<ffffffff81870b1c>] tcf_exts_destroy+0x20/0x2d
  [<ffffffff8189273b>] u32_destroy_key+0x1b/0x4d
  [<ffffffff81892788>] u32_delete_key_freepf_rcu+0x1b/0x1d
  [<ffffffff810de3b8>] rcu_process_callbacks+0x610/0x82e
  [<ffffffff8189276d>] ? u32_destroy_key+0x4d/0x4d
  [<ffffffff81ab0bc1>] __do_softirq+0x191/0x3f4

Fixes: ddf97ccdd7cb ("net_sched: add network namespace support for tc actions")
Cc: Jamal Hadi Salim <[email protected]>
Signed-off-by: Cong Wang <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Merge branch 'net-sched-fast-stats'

Eric Dumazet says:

====================
net: sched: faster stats gathering

A while back, I sent one RFC patch using lockless stats gathering
on 64bit arches.

This patch series does it more cleanly, using a seqcount.

Since qdisc/class stats are written at dequeue() time,
we can ask the dequeue to change the seqcount, so that
stats readers can avoid taking the root qdisc lock,
and instead the typical read_seqcount_{begin|retry} guarded
loop.

This does not change fast path costs, as the seqcount
increments are not more expensive than the bit manipulation,
and allows readers to not freeze the fast path anymore.
====================

Signed-off-by: David S. Miller <[email protected]>

net: sched: do not acquire qdisc spinlock in qdisc/class stats dump

Large tc dumps (tc -s {qdisc|class} sh dev ethX) done by Google BwE host
agent [1] are problematic at scale :

For each qdisc/class found in the dump, we currently lock the root qdisc
spinlock in order to get stats. Sampling stats every 5 seconds from
thousands of HTB classes is a challenge when the root qdisc spinlock is
under high pressure. Not only the dumps take time, they also slow
down the fast path (queue/dequeue packets) by 10 % to 20 % in some cases.

An audit of existing qdiscs showed that sch_fq_codel is the only qdisc
that might need the qdisc lock in fq_codel_dump_stats() and
fq_codel_dump_class_stats()

In v2 of this patch, I now use the Qdisc running seqcount to provide
consistent reads of packets/bytes counters, regardless of 32/64 bit arches.

I also changed rate estimators to use the same infrastructure
so that they no longer need to lock root qdisc lock.

[1]
http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/43838.pdf

Signed-off-by: Eric Dumazet <[email protected]>
Cc: Cong Wang <[email protected]>
Cc: Jamal Hadi Salim <[email protected]>
Cc: John Fastabend <[email protected]>
Cc: Kevin Athey <[email protected]>
Cc: Xiaotian Pei <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net_sched: transform qdisc running bit into a seqcount

Instead of using a single bit (__QDISC___STATE_RUNNING)
in sch->__state, use a seqcount.

This adds lockdep support, but more importantly it will allow us
to sample qdisc/class statistics without having to grab qdisc root lock.

Signed-off-by: Eric Dumazet <[email protected]>
Cc: Cong Wang <[email protected]>
Cc: Jamal Hadi Salim <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

fq_codel: return non zero qlen in class dumps

We properly scan the flow list to count number of packets,
but John passed 0 to gnet_stats_copy_queue() so we report
a zero value to user space instead of the result.

Fixes: 640158536632 ("net: sched: restrict use of qstats qlen")
Signed-off-by: Eric Dumazet <[email protected]>
Cc: John Fastabend <[email protected]>
Acked-by: John Fastabend <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Merge branch 'u32-hwoffload-fixes'

Jakub Kicinski says:

====================
cls_u32 hardware offload fixes

This set fixes two small issues with error codes I noticed
in cls_u32. Second patch could be viewed as user space API
change but that portion of API is not part of any release,
yet.

Compile tested only.
====================

Signed-off-by: David S. Miller <[email protected]>

net: cls_u32: be more strict about skip-sw flag

Return an error if user requested skip-sw and the underlaying
hardware cannot handle tc offloads (or offloads are disabled).

Signed-off-by: Jakub Kicinski <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: cls_u32: fix error code for invalid flags

'err' variable is not set in this test, we would return whatever
previous test set 'err' to.

Signed-off-by: Jakub Kicinski <[email protected]>
Acked-by: Sridhar Samudrala <[email protected]>
Acked-by: John Fastabend <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

gtp: #define _UAPI_LINUX_GTP_H_ and not _UAPI_LINUX_GTP_H__

Fix clang build warning:

./include/uapi/linux/gtp.h:1:9: warning: '_UAPI_LINUX_GTP_H_' is
used as a header guard here, followed by #define of a different
macro [-Wheader-guard]

fix by defining _UAPI_LINUX_GTP_H_ and not _UAPI_LINUX_GTP_H__

Signed-off-by: Colin Ian King <[email protected]>
Acked-by: Pablo Neira Ayuso <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Merge tag 'clk-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux

Pull clk fixes from Stephen Boyd:
"This finally removes the CLK_IS_ROOT flag by picking up the last few
  stragglers that didn't get merged by anyone this time around.

  Better to do it now than wait for another one to pop up.  There's also
  a minor maintainers update and a Kconfig fix"

* tag 'clk-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux:
  clk: nxp: Select MFD_SYSCON for creg driver
  MAINTAINERS: Add file patterns for clock device tree bindings
  clk: Remove CLK_IS_ROOT flag
  clk: microchip: Remove CLK_IS_ROOT
  powerpc/512x: clk: Remove CLK_IS_ROOT
  vexpress/spc: Remove CLK_IS_ROOT

Merge branch 'be2net-noncrit-fixes'

Sathya Perla says:

====================
be2net: patch set

Hi David, the following patch set contains three non-critical fixes that
can go into the net-next tree.

Patch 1 fixes the logic for provisioning queue pairs on VFs to take into
account the limit on number of TXQs too as in some profiles the number
of TXQs is less than that of RXQs.

Patch 2 enables WoL support from shutdown on Skyhawk.

Patch 3 enhances the logic for provisioning queue pairs on VFs on
SR-IOV over multi-partition configs. Each PF (partition) on a port has to
compute the number of RSS tables it's VFs can use.
====================

Signed-off-by: David S. Miller <[email protected]>

be2net: Fix provisioning of RSS for VFs in multi-partition configurations

Currently, we do not distribute queue resources to enable RSS for VFs
in multi-channel/partition configurations.
Fix this by having each PF(SRIOV capable) calculate it's share of the
15 RSS Policy Tables available per port before provisioning resources for
all the VFs.
This proportional share calculation is done based on division of the
PF's MAX VFs with the Total MAX VFs on that port. It also needs to
learn about the no: of NIC PFs on the port and subtract that from
the 15 RSS Policy Tables on the port.

Signed-off-by: Somnath Kotur <[email protected]>
Signed-off-by: Sathya Perla <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

be2net: Enable Wake-On-LAN from shutdown for Skyhawk

Skyhawk does support wake-up from ACPI shutdown state - S5, provided the
platform supports it (like Auxiliary power source etc). The changes listed
below are done to fix this.

1) There's no need to defer the HW configuration of WOL to be_suspend().
Remove this in be_suspend() and move it to be_set_wol() ethtool function
so it is configured directly in the context of ethtool. This automatically
takes care of the shutdown case.

2) The driver incorrectly uses WOL_CAP field in the FW response to
get_acpi_wol_cap() command, to determine if WOL is enabled. Instead the
driver must rely on the macaddr field in the response to infer WOL state.

3) In be_get_config() during init, if we find that WOL is enabled in FW,
call pci_enable_wake() to enable pmcsr.pme_en bit. This is needed to
support persistent WOL configuration provided by the FW in some platforms.

4) Remove code in be_set_wol() that writes to PCICFG_PM_CONTROL_OFFSET
to set pme_en bit; pci_enable_wake() sets that.

Fixes: 028991e49 ("Enabling Wake-on-LAN is not supported in S5 state")
Signed-off-by: Sriharsha Basavapatna <[email protected]>
Signed-off-by: Sathya Perla <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

be2net: use max-TXQs limit too while provisioning VF queue pairs

When the PF driver provisions resources for VFs, it currently only looks
at max RSS queues available to calculate the number of VF queue pairs.
This logic breaks when there are less number of TX-queues than RSS-queues.
This patch fixes this problem by using the max-TXQs available in the
PF-pool in the calculations. As a part of this change the
be_calculate_vf_qs() routine is renamed as be_calculate_vf_res() and the
code that calculates limits on other related resources is moved here to
contain all resource calculation code inside one routine.

Signed-off-by: Suresh Reddy <[email protected]>
Signed-off-by: Sathya Perla <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: fec: fix spelling mistakes and add missing newline

trivial fix to spelling mistakes and add missing newline in pr_err
messages

Signed-off-by: Colin Ian King <[email protected]>
Acked-by: Fugang Duan <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Merge branch 'bnxt_en-fixes'

Michael Chan says:

====================
bnxt_en: Bug fixes.

Fix a race condition and VLAN rx acceleration logic.
====================

Signed-off-by: David S. Miller <[email protected]>

bnxt_en: Simplify VLAN receive logic.

Since both CTAG and STAG rx acceleration must be enabled together, we
only need to check one feature flag (NETIF_F_HW_VLAN_CTAG_RX) before
calling __vlan_hwaccel_put_tag().

Signed-off-by: Michael Chan <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

bnxt_en: Enable and disable RX CTAG and RX STAG VLAN acceleration together.

The hardware can only be set to strip or not strip both the VLAN CTAG and
STAG. It cannot strip one and not strip the other. Add logic to
bnxt_fix_features() to toggle both feature flags when the user is toggling
one of them.

Signed-off-by: Michael Chan <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

bnxt_en: Fix tx push race condition.

Set the is_push flag in the software BD before the tx data is pushed to
the chip. It is possible to get the tx interrupt as soon as the tx data
is pushed. The tx handler will not handle the event properly if the
is_push flag is not set and it will crash.

Signed-off-by: Michael Chan <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

drivers/net: support hdlc function for QE-UCC

The driver add hdlc support for Freescale QUICC Engine.
It support NMSI and TSA mode.

Signed-off-by: Zhao Qiang <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

fsl/qe: Add QE TDM lib

QE has module to support TDM, some other protocols
supported by QE are based on TDM.
add a qe-tdm lib, this lib provides functions to the protocols
using TDM to configurate QE-TDM.

Signed-off-by: Zhao Qiang <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

fsl/qe: Make regs resouce_size_t

Signed-off-by: Zhao Qiang <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

fsl/qe: setup clock source for TDM mode

Add tdm clock configuration in both qe clock system and ucc
fast controller.

Signed-off-by: Zhao Qiang <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

fsl/qe: add rx_sync and tx_sync for TDM mode

Rx_sync and tx_sync are used by QE-TDM mode,
add them to struct ucc_fast_info.

Signed-off-by: Zhao Qiang <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net sched: indentation and other OCD stylistic fixes

Signed-off-by: Jamal Hadi Salim <[email protected]>
Acked-by: Cong Wang <[email protected]>

Merge branch 'sch-action-tstamp'

Jamal Hadi Salim says:

====================
net sched action timestamp improvements

Various aggregations of duplicated code, fixes and introduction of firstused
timestamp

v2: add const for source time info per suggestion from Cong
====================

Signed-off-by: David S. Miller <[email protected]>

net sched actions: aggregate dumping of actions timeinfo

Signed-off-by: Jamal Hadi Salim <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net sched actions: introduce timestamp for firsttime use

Useful to know when the action was first used for accounting
(and debugging)

Signed-off-by: Jamal Hadi Salim <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net sched: actions use tcf_lastuse_update for consistency

Signed-off-by: Jamal Hadi Salim <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net/sched: cls_flower: Introduce support in SKIP SW flag

In order to make a filter processed only by hardware, skip_sw flag
should be supplied. This is an addition to the already existing skip_hw
flag (filter will be processed by software only). If no flag is
specified, filter will be processed by both software and hardware.

If only hardware offloaded filters exist, fl_classify() will return
without doing anything.

A following userspace patch will be sent once kernel patch is accepted.

Example:

tc filter add dev enp0s9 protocol ip prio 20 parent ffff: \
flower \
ip_proto 6 \
indev enp0s9 \
skip_sw \
action skbedit mark 0x1234

Signed-off-by: Amir Vadai <[email protected]>
Acked-by: Jiri Pirko <[email protected]>
Acked-by: John Fastabend <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Merge branch 'qed-iov-fw-reqs'

Yuval Mintz says:

====================
qed: IOV series - relax firmware requirements

In order for VFs to work, current implementation demands that the VF's
requried storm firmware would be exactly the version that was loaded by
the PF, which is a very harsh requirement.
This patch series is intended to relax this -
the recently submitted firmware is intended to be forward/backward
compatible in its fastpath [slowpath is configured by PF on behalf of VF],
and so VFs would only be required of having the same major faspath HSI in
order to work.

Most of the other patches in this series extend current forward
compatibilty of driver to reduce chance of breaking PF/VF compatibility
in the future. A few are unrelated IOV changes.
====================

Signed-off-by: David S. Miller <[email protected]>

qed: PF to reply to unknown messages

If a future VF would send the PF an unknown message, the PF today would
not send a reply. This would have 2 bad effects:
  a. VF would have to timeout on the request.
  b. If VF were to send an additional message to PF, firmware would mark
     it as malicious.

Instead, if there's some valid reply-address on the message - let the PF
answer and tell the VF it doesn't know the message.

Signed-off-by: Yuval Mintz <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

qed: PF enforce MAC limitation of VFs

The only limitation relating to MACs the PF enforce today on its VFs
is in case it has a forced-unicast MAC address for them, in which case
they can't configure other unicast addresses.
Specifically, the PF isn't enforcing the number of MAC addresse a VF can
configure regardless of the nubmer of such filters agreed upon by PF and
VF during the acquisition process.

PF's shadow-config is now extended to also contain information about its
VFs' unicast addresses configuration, allowing such enforcement.

Signed-off-by: Yuval Mintz <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

qed: Move doorbell calculation from VF to PF

Today, the VF is aware of its queues context-ids, and calculates the
doorbell address when opening its queues on its own.
The configuration of doorbells in HW can sometime in the future be changed
by the PF [hw has several configurable features that might affect doorbell
addresses, e.g., dpm support], this would break compatibility with older
VFs as their calculated doorbell addresses would be incorrect for such a
configuration.

In order to avoid such a backward compatibility failure, let the PF make
the calculation of the doorbell offset based on the context-id, and pass
that to the VF.

Signed-off-by: Yuval Mintz <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

qed: Make PF more robust against malicious VF

There are several requests the VF can make toward the PF which the driver
would pass to firmware without checking the validity first - specifically,
opening queues and updating vports. Such configurations might cause the
firmware to assert.

This adds validation of the legality of said configurations on the PF side
before passing it onward via ramrod to firmware.

Signed-off-by: Yuval Mintz <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

qed: PF-VF resource negotiation

One of the goals of the vf's first message to the PF [acquire]
is to learn about the number of resources available to it [macs, vlans,
etc.]. This is done via negotiation - the VF requires a set of resources,
which the PF either approves or disaproves and sends a smaller set of
resources as alternative. In this later case, the VF is then expected to
either abort the probe or re-send the acquire message with less
required resources.

While this infrastructure exists since the initial submision of qed
SRIOV support, it's in fact completely inoperational - PF isn't really
looking into the resources the VF has asked for and is never going to
reply to the VF that it lacks resources.

This patch addresses this flow, fixing it and allowing the PF and VF
to actually agree on a set of resources.

Signed-off-by: Yuval Mintz <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

qed: Relax VF firmware requirements

Current driver require an exact match between VF and PF storm firmware;
Any difference would fail the VF acquire message, causing the VF probe
to be aborted.

While there's still dependencies between the two, the recent FW submission
has relaxed the match requirement - instead of an exact match, there's now
a 'fastpath' HSI major/minor scheme, where VFs and PFs that match in their
major number can co-exist even if their minor is different.

In order to accomadate this change some changes in the vf-start init flow
had to be made, as the VF start ramrod now has to be sent only after PF
learns which fastpath HSI its VF is requiring.

Signed-off-by: Yuval Mintz <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: get rid of spin_trylock() in net_tx_action()

Note: Tom Herbert posted almost same patch 3 months back, but for
different reasons.

The reasons we want to get rid of this spin_trylock() are :

1) Under high qdisc pressure, the spin_trylock() has almost no
chance to succeed.

2) We loop multiple times in softirq handler, eventually reaching
the max retry count (10), and we schedule ksoftirqd.

Since we want to adhere more strictly to ksoftirqd being waked up in
the future (https://lwn.net/Articles/687617/), better avoid spurious
wakeups.

3) calls to __netif_reschedule() dirty the cache line containing
q->next_sched, slowing down the owner of qdisc.

4) RT kernels can not use the spin_trylock() here.

With help of busylock, we get the qdisc spinlock fast enough, and
the trylock trick brings only performance penalty.

Depending on qdisc setup, I observed a gain of up to 19 % in qdisc
performance (1016600 pps instead of 853400 pps, using prio+tbf+fq_codel)

("mpstat -I SCPU 1" is much happier now)

Signed-off-by: Eric Dumazet <[email protected]>
Cc: Tom Herbert <[email protected]>
Acked-by: Tom Herbert <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

rxrpc: fix ptr_ret.cocci warnings

net/rxrpc/rxkad.c:1165:1-3: WARNING: PTR_ERR_OR_ZERO can be used

Use PTR_ERR_OR_ZERO rather than if(IS_ERR(...)) + PTR_ERR

Generated by: scripts/coccinelle/api/ptr_ret.cocci

CC: David Howells <[email protected]>
Signed-off-by: Fengguang Wu <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Merge branch 'rds-packet-assembly-fixes'

Sowmini Varadhan says:

====================
RDS: TCP: socket locking RDS packet assembly fixes

This three part patchset fixes bugs in synchronization between
rds_tcp_accept_one() and the rds-tcp send/recv path.

Patch 1 ensures that the lock_sock() is taken appropriately
and the RDS datagram reassembly state is reset to synchronize
with the receive path.

Patch 2 ensures that partially sent RDS datagrams will get
retransmitted after rds_tcp_accept_one() switches sockets.

Patch 3 fixes a race window which would prematurely re-enable
rds_send_xmit() before the rds_tcp_connection setup has been
completed in rds_tcp_accept_one().
====================

Signed-off-by: David S. Miller <[email protected]>

RDS: TCP: fix race windows in send-path quiescence by rds_tcp_accept_one()

The send path needs to be quiesced before resetting callbacks from
rds_tcp_accept_one(), and commit eb192840266f ("RDS:TCP: Synchronize
rds_tcp_accept_one with rds_send_xmit when resetting t_sock") achieves
this using the c_state and RDS_IN_XMIT bit following the pattern
used by rds_conn_shutdown(). However this leaves the possibility
of a race window as shown in the sequence below
    take t_conn_lock in rds_tcp_conn_connect
    send outgoing syn to peer
    drop t_conn_lock in rds_tcp_conn_connect
    incoming from peer triggers rds_tcp_accept_one, conn is
marked CONNECTING
    wait for RDS_IN_XMIT to quiesce any rds_send_xmit threads
    call rds_tcp_reset_callbacks
    [.. race-window where incoming syn-ack can cause the conn
to be marked UP from rds_tcp_state_change ..]
    lock_sock called from rds_tcp_reset_callbacks, and we set
t_sock to null
As soon as the conn is marked UP in the race-window above, rds_send_xmit()
threads will proceed to rds_tcp_xmit and may encounter a null-pointer
deref on the t_sock.

Given that rds_tcp_state_change() is invoked in softirq context, whereas
rds_tcp_reset_callbacks() is in workq context, and testing for RDS_IN_XMIT
after lock_sock could result in a deadlock with tcp_sendmsg, this
commit fixes the race by using a new c_state, RDS_TCP_RESETTING, which
will prevent a transition to RDS_CONN_UP from rds_tcp_state_change().

Signed-off-by: Sowmini Varadhan <[email protected]>
Acked-by: Santosh Shilimkar <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

RDS: TCP: Retransmit half-sent datagrams when switching sockets in rds_tcp_reset_callbacks

When we switch a connection's sockets in rds_tcp_rest_callbacks,
any partially sent datagram must be retransmitted on the new
socket so that the receiver can correctly reassmble the RDS
datagram. Use rds_send_reset() which is designed for this purpose.

Signed-off-by: Sowmini Varadhan <[email protected]>
Acked-by: Santosh Shilimkar <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

RDS: TCP: Add/use rds_tcp_reset_callbacks to reset tcp socket safely

When rds_tcp_accept_one() has to replace the existing tcp socket
with a newer tcp socket (duelling-syn resolution), it must lock_sock()
to suppress the rds_tcp_data_recv() path while callbacks are being
changed. Also, existing RDS datagram reassembly state must be reset,
so that the next datagram on the new socket does not have corrupted
state. Similarly when resetting the newly accepted socket, appropriate
locks and synchronization is needed.

This commit ensures correct synchronization by invoking
kernel_sock_shutdown to reset a newly accepted sock, and by taking
appropriate lock_sock()s (for old and new sockets) when resetting
existing callbacks.

Signed-off-by: Sowmini Varadhan <[email protected]>
Acked-by: Santosh Shilimkar <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

fq_codel: fix NET_XMIT_CN behavior

My prior attempt to fix the backlogs of parents failed.

If we return NET_XMIT_CN, our parents wont increase their backlog,
so our qdisc_tree_reduce_backlog() should take this into account.

v2: Florian Westphal pointed out that we could drop the packet,
so we need to save qdisc_pkt_len(skb) in a temp variable before
calling fq_codel_drop()

Fixes: 9d18562a2278 ("fq_codel: add batch ability to fq_codel_drop()")
Fixes: 2ccccf5fb43f ("net_sched: update hierarchical backlog too")
Reported-by: Stas Nichiporovich <[email protected]>
Signed-off-by: Eric Dumazet <[email protected]>
Cc: WANG Cong <[email protected]>
Cc: Jamal Hadi Salim <[email protected]>
Acked-by: Jamal Hadi Salim <[email protected]>
Acked-by: Cong Wang <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

bpf, trace: use READ_ONCE for retrieving file ptr

In bpf_perf_event_read() and bpf_perf_event_output(), we must use
READ_ONCE() for fetching the struct file pointer, which could get
updated concurrently, so we must prevent the compiler from potential
refetching.

We already do this with tail calls for fetching the related bpf_prog,
but not so on stored perf events. Semantics for both are the same
with regards to updates.

Fixes: a43eec304259 ("bpf: introduce bpf_perf_event_output() helper")
Fixes: 35578d798400 ("bpf: Implement function bpf_perf_event_read() that get the selected hardware PMU conuter")
Signed-off-by: Daniel Borkmann <[email protected]>
Acked-by: Alexei Starovoitov <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

vhost_net: stop polling socket during rx processing

We don't stop rx polling socket during rx processing, this will lead
unnecessary wakeups from under layer net devices (E.g
sock_def_readable() form tun). Rx will be slowed down in this
way. This patch avoids this by stop polling socket during rx
processing. A small drawback is that this introduces some overheads in
light load case because of the extra start/stop polling, but single
netperf TCP_RR does not notice any change. In a super heavy load case,
e.g using pktgen to inject packet to guest, we get about ~8.8%
improvement on pps:

before: ~1240000 pkt/s
after: ~1350000 pkt/s

Signed-off-by: Jason Wang <[email protected]>
Acked-by: Michael S. Tsirkin <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace

Pull userns fixes from Eric Biederman:
"This contains two small but significant fixes to fs/namespace.c.

  The first adds a filesystem refcount drop on error.  The second
  corrects a test in fs_fully_visible which could be abused to allow
  mounting of proc or sysfs, when that should not be allowed.

  To keep myself honest I have tested to ensure the incorrect test in
  fs_fully_visible actually allows improper mounting of proc before the
  fix and that when fixed the improper mounting is not allowed"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
  mnt: fs_fully_visible test the proper mount for MNT_LOCKED
  mnt: If fs_fully_visible fails call put_filesystem.

IB/IPoIB: Don't update neigh validity for unresolved entries

ipoib_neigh_get unconditionally updates the "alive" variable member on
any packet send.  This prevents the neighbor garbage collection from
cleaning out a dead neighbor entry if we are still queueing packets
for it.  If the queue for this neighbor is full, then don't update the
alive timestamp.  That way the neighbor can time out even if packets
are still being queued as long as none of them are being sent.

Fixes: b63b70d87741 ("IPoIB: Use a private hash table for path lookup in xmit path")
Signed-off-by: Erez Shitrit <[email protected]>
Signed-off-by: Leon Romanovsky <[email protected]>
Signed-off-by: Doug Ledford <[email protected]>

IB/mlx5: Fix alternate path code

Userspace flag IBV_QP_ALT_PATH is supposed to set the alternate path
including fields alt_pkey_index and alt_timeout.
Added IB_QP_PKEY_INDEX and IB_QP_TIMEOUT to the attribute mask when
calling mlx5_set_path for the alternate path to force setting the
alt_pkey_index and alt_timeout values.

Fixes: bf24481a3a7c4 ('IB/mlx5: Consider alternate path in pkey ...')
Signed-off-by: Achiad Shochat <[email protected]>
Signed-off-by: Noa Osherovich <[email protected]>
Reviewed-by: Jack Morgenstein <[email protected]>
Signed-off-by: Leon Romanovsky <[email protected]>
Signed-off-by: Doug Ledford <[email protected]>

IB/mlx5: Fix pkey_index length in the QP path record

Pkey index fields in the QP context path record are extended to 16
bits, as required by IB spec (version 1.3).
This change affects all QP commands which include path records.

To enable this change, moved the free adaptive routing flag bit
(free_ar) to the most significant byte of the QP path record.

Fixes: e126ba97dba9e ('mlx5: Add driver for Mellanox Connect-IB ...')
Signed-off-by: Noa Osherovich <[email protected]>
Reviewed-by: Jack Morgenstein <[email protected]>
Signed-off-by: Leon Romanovsky <[email protected]>
Signed-off-by: Doug Ledford <[email protected]>

IB/mlx5: Fix entries check in mlx5_ib_resize_cq

Verify that number of entries is less than device capability.
Add an appropriate warning message for error flow.

Fixes: bde51583f49b ('IB/mlx5: Add support for resize CQ')
Signed-off-by: Majd Dibbiny <[email protected]>
Signed-off-by: Noa Osherovich <[email protected]>
Signed-off-by: Leon Romanovsky <[email protected]>
Signed-off-by: Doug Ledford <[email protected]>

IB/mlx5: Fix entries checks in mlx5_ib_create_cq

Number of entries shouldn't be greater than the device's max
capability. This should be checked before rounding the entries number
to power of two.

Fixes: 51ee86a4af639 ('IB/mlx5: Fix check of number of entries...')
Signed-off-by: Majd Dibbiny <[email protected]>
Signed-off-by: Noa Osherovich <[email protected]>
Signed-off-by: Leon Romanovsky <[email protected]>
Reviewed-by: Sagi Grimberg <[email protected]>
Signed-off-by: Doug Ledford <[email protected]>

IB/mlx5: Check BlueFlame HCA support

BlueFlame support is reported only for PFs when the HCA capability is
on.

Fixes: 938fe83c8dcbb ('net/mlx5_core: New device capabilities...')
Signed-off-by: Majd Dibbiny <[email protected]>
Signed-off-by: Noa Osherovich <[email protected]>
Signed-off-by: Leon Romanovsky <[email protected]>
Reviewed-by: Sagi Grimberg <[email protected]>
Signed-off-by: Doug Ledford <[email protected]>

IB/mlx5: Fix returned values of query QP

Some variables were not initialized properly: max_recv_wr,
max_recv_sge, max_send_wr, qp_context and max_inline_data.

Fixes: e126ba97dba9 ('mlx5: Add driver for Mellanox Connect-IB...')
Signed-off-by: Noa Osherovich <[email protected]>
Signed-off-by: Leon Romanovsky <[email protected]>
Reviewed-by: Sagi Grimberg <[email protected]>
Signed-off-by: Doug Ledford <[email protected]>

IB/mlx5: Limit query HCA clock

When PAGE_SIZE is larger than 4K, the user shouldn't be able to query
the HCA core clock. This counter is within 4KB boundary and the
user-space shall not read information that's after this boundary.

Fixes: b368d7cb8ceb7 ('IB/mlx5: Add hca_core_clock_offset to...')
Signed-off-by: Majd Dibbiny <[email protected]>
Signed-off-by: Noa Osherovich <[email protected]>
Signed-off-by: Leon Romanovsky <[email protected]>
Signed-off-by: Doug Ledford <[email protected]>

IB/mlx5: Fix FW version diaplay in sysfs

Add a 4-digit padding to show FW version in proper format.

Fixes: 9603b61de1eee ('mlx5: Move pci device handling from...')
Signed-off-by: Eran Ben Elisha <[email protected]>
Signed-off-by: Noa Osherovich <[email protected]>
Signed-off-by: Leon Romanovsky <[email protected]>
Signed-off-by: Doug Ledford <[email protected]>

IB/mlx5: Return PORT_ERR in Active to Initializing tranisition

FW port-change events are fired on Active <-> non Active port state
transitions only.
When the port state changes from Active to Initializing (Active ->
Down -> Initializing), a single event is fired.
The HCA transitions from Down to Initializing unless prevented from
doing so, hence the driver should also propagate events when the port
state is Initializing to consumers so they'll be aware that the port
is no longer Active and act accordingly.

Fixes: e126ba97dba9e ('mlx5: Add driver for Mellanox Connect-IB...')
Signed-off-by: Noa Osherovich <[email protected]>
Signed-off-by: Leon Romanovsky <[email protected]>
Signed-off-by: Doug Ledford <[email protected]>

IB/mlx5: Set flow steering capability bit

Flow steering is supported by mlx5 device when the following
features are supported by firmware:

1. NIC RX flow table.
2. Device has enough flow steering levels.
3. Atomic modification of flow table entry.
4. Flow tables chaining.

To check if flow steering is supported it's enough to check
if the driver opened the mlx5 bypass namespace.

Signed-off-by: Maor Gottlieb <[email protected]>
Signed-off-by: Leon Romanovsky <[email protected]>
Signed-off-by: Doug Ledford <[email protected]>

IB/core: Make all casts in ib_device_cap_flags enum consistent

Replace the few u64 casts with ULL to match the rest of the casts.

Signed-off-by: Max Gurtovoy <[email protected]>
Signed-off-by: Doug Ledford <[email protected]>

IB/core: Fix bit curruption in ib_device_cap_flags structure

ib_device_cap_flags 64-bit expansion caused caps overlapping
and made consumers read wrong device capabilities. For example
IB_DEVICE_SG_GAPS_REG was falsely read by the iser driver causing
it to use a non-existing capability. This happened because signed
int becomes sign extended when converted it to u64. Fix this by
casting IB_DEVICE_ON_DEMAND_PAGING enumeration to ULL.

Fixes: f5aa9159a418 ('IB/core: Add arbitrary sg_list support')
Reported-by: Robert LeBlanc <[email protected]>
Cc: Stable <[email protected]> #[v4.6+]
Acked-by: Sagi Grimberg <[email protected]>
Signed-off-by: Max Gurtovoy <[email protected]>
Signed-off-by: Matan Barak <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Signed-off-by: Doug Ledford <[email protected]>

IB/core: Initialize sysfs attributes before sysfs create group

For dynamically allocated sysfs attributes there is a need to call
sysfs_attr_init in order to comply with lockdep, not calling it
will result in error complaining key is not in .data section.

Fixes: b40f4757daa1 ("IB/core: Make device counter infrastructure dynamic")
Signed-off-by: Mark Bloch <[email protected]>
Signed-off-by: Leon Romanovsky <[email protected]>
Signed-off-by: Doug Ledford <[email protected]>

IB/IPoIB: Disable bottom half when dealing with device address

Align locking usage when touching device address with rest
of the kernel. Lock the bottom half when doing so using
netif_addr_lock_bh.

This also solves the following case as reported by lockdep:
CPU0 CPU1
---- ----
lock(_xmit_INFINIBAND);
local_irq_disable();
lock(&(&mc->mca_lock)->rlock);
lock(_xmit_INFINIBAND);
<Interrupt>
lock(&(&mc->mca_lock)->rlock);

*** DEADLOCK ***

Fixes: 492a7e67ff83 ("IB/IPoIB: Allow setting the device address")
Signed-off-by: Mark Bloch <[email protected]>
Signed-off-by: Leon Romanovsky <[email protected]>
Signed-off-by: Doug Ledford <[email protected]>

IB/core: Fix removal of default GID cache entry

When deleting a default GID from the cache, its gid_type field is set
to 0.

This could set the gid_type to RoCE v1 for a RoCE v2 default GID,
essentially making it inaccessible to future modifications, since it
is no longer found by find_gid().

This fix preserves the gid_type value for default gids during cache
operations.

Fixes: b39ffa1df505 ('IB/core: Add gid_type to gid attribute')
Signed-off-by: Aviv Heller <[email protected]>
Signed-off-by: Leon Romanovsky <[email protected]>
Signed-off-by: Doug Ledford <[email protected]>

IB/IPoIB: Fix race between ipoib_remove_one to sysfs functions

In ipoib_remove_one the driver holds the rtnl_lock and tries to do some
operation like dev_change_flags or unregister_netdev, while sysfs
callback like ipoib_vlan_delete holds sysfs mutex and tries to hold the
rtnl_lock via rtnl_trylock() and restart_syscall() if the lock is not
free, meanwhile ipoib_remove_one tries to get the sysfs lock in order to
free its sysfs directory, and we will get  a->b, b->a deadlock.

    Trace like the following:

        schedule+0x37/0x80
        schedule_preempt_disabled+0xe/0x10
        __mutex_lock_slowpath+0xb5/0x120
        mutex_lock+0x23/0x40
        rtnl_lock+0x15/0x20
        netdev_run_todo+0x17c/0x320
        rtnl_unlock+0xe/0x10
        ipoib_vlan_delete+0x11b/0x1b0 [ib_ipoib]
        delete_child+0x54/0x80 [ib_ipoib]
        dev_attr_store+0x18/0x30
        sysfs_kf_write+0x37/0x40
        mutex_lock+0x16/0x40
        SyS_write+0x55/0xc0
        entry_SYSCALL_64_fastpath+0x16/0x75
    And
        schedule+0x37/0x80
        __kernfs_remove+0x1a8/0x260
        ? wake_atomic_t_function+0x60/0x60
        kernfs_remove+0x25/0x40
        sysfs_remove_dir+0x50/0x80
        kobject_del+0x18/0x50
        device_del+0x19f/0x260
        netdev_unregister_kobject+0x6a/0x80
        rollback_registered_many+0x1fd/0x340
        rollback_registered+0x3c/0x70
        unregister_netdevice_queue+0x55/0xc0
        unregister_netdev+0x20/0x30
        ipoib_remove_one+0x114/0x1b0 [ib_ipoib]
        ib_unregister_client+0x4a/0x170 [ib_core]
        ? find_module_all+0x71/0xa0
        ipoib_cleanup_module+0x10/0x94 [ib_ipoib]
        SyS_delete_module+0x1b5/0x210
        entry_SYSCALL_64_fastpath+0x16/0x75

The fix is by checking the flag IPOIB_FLAG_INTF_ON_DESTROY in order to
get out from the sysfs function.

Fixes: 862096a8bbf8 ("IB/ipoib: Add more rtnl_link_ops callbacks")
Fixes: 9baa0b036410 ("IB/ipoib: Add rtnl_link_ops support")
Signed-off-by: Erez Shitrit <[email protected]>
Signed-off-by: Leon Romanovsky <[email protected]>
Signed-off-by: Doug Ledford <[email protected]>

IB/core: Fix query port failure in RoCE

Currently ib_query_port always attempts to to read the subnet prefix by
calling ib_query_gid(). For RoCE/iWARP there is no subnet manager and no
subnet prefix. Fix this by querying GID[0] only for IB networks.

Fixes: fad61ad4e755 ('IB/core: Add subnet prefix to port info')
Signed-off-by: Eli Cohen <[email protected]>
Signed-off-by: Leon Romanovsky <[email protected]>
Reviewed-by: Steve Wise <[email protected]>
Signed-off-by: Doug Ledford <[email protected]>

IB/core: fix error unwind in sysfs hw counters code

Between the initial and final versions of the function setup_hw_stats,
the order of variable initialization was changed. However, the unwind
flow on error did not properly keep up with the flow changes. Make
the unwind flow match a proper unwind of the allocation flow, then
remove no longer needed variable initializations.

Fixes: b40f4757daa1 (IB/core: Make device counter infrastructure
dynamic)
Signed-off-by: Doug Ledford <[email protected]>

IB/core: Fix array length allocation

The new sysfs hw_counters code had an off by one in its array allocation
length. Fix that and the comment along with it.

Reported-by: Mark Bloch <[email protected]>
Fixes: b40f4757daa1 (IB/core: Make device counter infrastructure
dynamic)
Signed-off-by: Doug Ledford <[email protected]>

ALSA: hda/realtek: Add T560 docking unit fixup

Tested with Lenovo Ultradock. Fixes the non-working headphone jack on
the docking unit.

Signed-off-by: Torsten Hilbrich <[email protected]>
Tested-by: Torsten Hilbrich <[email protected]>
Cc: <[email protected]>
Signed-off-by: Takashi Iwai <[email protected]>

mnt: fs_fully_visible test the proper mount for MNT_LOCKED

MNT_LOCKED implies on a child mount implies the child is locked to the
parent. So while looping through the children the children should be
tested (not their parent).

Typically an unshare of a mount namespace locks all mounts together
making both the parent and the slave as locked but there are a few
corner cases where other things work.

Cc: [email protected]
Fixes: ceeb0e5d39fc ("vfs: Ignore unlocked mounts in fs_fully_visible")
Reported-by: Seth Forshee <[email protected]>
Signed-off-by: "Eric W. Biederman" <[email protected]>

mnt: If fs_fully_visible fails call put_filesystem.

Add this trivial missing error handling.

Cc: [email protected]
Fixes: 1b852bceb0d1 ("mnt: Refactor the logic for mounting sysfs and proc in a user namespace")
Signed-off-by: "Eric W. Biederman" <[email protected]>

net: ethernet: cavium: liquidio: request_manager: Remove create_workqueue

alloc_workqueue replaces deprecated create_workqueue().

A dedicated workqueue has been used since the workitem viz
(&db_wq->wk.work which maps to check_db_timeout) is involved
in normal device operation. WQ_MEM_RECLAIM has been set to guarantee
forward progress under memory pressure, which is a requirement here.
Since there are only a fixed number of work items, explicit concurrency
limit is unnecessary.

flush_workqueue is unnecessary since destroy_workqueue() itself calls
drain_workqueue() which flushes repeatedly till the workqueue
becomes empty.

Signed-off-by: Bhaktipriya Shridhar <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: ethernet: cavium: liquidio: response_manager: Remove create_workqueue

alloc_workqueue replaces deprecated create_workqueue().

A dedicated workqueue has been used since the workitem viz
(&cwq->wk.work which maps to oct_poll_req_completion) is involved
in normal device operation. WQ_MEM_RECLAIM has been set to guarantee
forward progress under memory pressure, which is a requirement here.
Since there are only a fixed number of work items, explicit concurrency
limit is unnecessary.

flush_workqueue is unnecessary since destroy_workqueue() itself calls
drain_workqueue() which flushes repeatedly till the workqueue
becomes empty. Hence the call to flush_workqueue() has been dropped.

Signed-off-by: Bhaktipriya Shridhar <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net_sched: keep backlog updated with qlen

For gso_skb we only update qlen, backlog should be updated too.

Note, it is correct to just update these stats at one layer,
because the gso_skb is cached there.

Reported-by: Stas Nichiporovich <[email protected]>
Fixes: 2ccccf5fb43f ("net_sched: update hierarchical backlog too")
Cc: Jamal Hadi Salim <[email protected]>
Signed-off-by: Cong Wang <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

virtio-net: Add initial MTU advice feature

This commit adds the feature bit and associated mtu device entry for the
virtio network device. When a virtio device comes up, it checks the
feature bit for the VIRTIO_NET_F_MTU feature. If such feature bit is
enabled, the driver will read the advised MTU and use it as the initial
value.

Signed-off-by: Aaron Conole <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

ACPI / EC: Fix a boot EC regresion by restoring boot EC support for the DSDT EC

According to the Windows probing result, during the table loading, the EC
device described in the ECDT should be used. And the ECDT EC is also
effective during the period the namespace objects are initialized (we can
see a separate process executing _STA/_INI on Windows before executing
other device specific control methods, for example, EC._REG). During the
device enumration, the EC device described in the DSDT should be used. But
there are differences between Linux and Windows around the device probing
order. Thus in Linux, we should enable the DSDT EC as early as possible
before enumerating devices in order not to trigger issues related to the
device enumeration order differences.

This patch thus converts acpi_boot_ec_enable() into acpi_ec_dsdt_probe() to
fix the gap. This also fixes a user reported regression triggered after we
switched the "table loading"/"ECDT support" to be ACPI spec 2.0 compliant.

Fixes: 59f0aa9480cf (ACPI 2.0 / ECDT: Remove early namespace reference from EC)
Link: https://bugzilla.kernel.org/show_bug.cgi?id=119261
Reported-and-tested-by: Gabriele Mazzotta <[email protected]>
Signed-off-by: Lv Zheng <[email protected]>
Signed-off-by: Rafael J. Wysocki <[email protected]>

IB/hfi1: Suppress sparse warnings

Avoid that sparse reports the following warnings for the hfi1 driver:

trace.c:217:13: warning: no previous prototype for ‘print_u64_array’ [-Wmissing-prototypes]
user_sdma.c:1361:17: warning: dubious: !x & y

Signed-off-by: Bart Van Assche <[email protected]>
Cc: Mike Marciniszyn <[email protected]>
Cc: Dennis Dalessandro <[email protected]>
Signed-off-by: Doug Ledford <[email protected]>

IB/hfi1: Use bit 0 instead of bit 1

The first argument of test_bit() and clear_bit() is a bit number and
not a bitmask. Hence change that first argument from (1 << 0) into 0.
This patch avoids that smatch reports the following warnings:

user_sdma.c:1059: sdma_cache_evict() warn: test_bit() takes a bit number
user_sdma.c:1590: sdma_rb_remove() warn: test_bit() takes a bit number

Signed-off-by: Bart Van Assche <[email protected]>
Cc: Mike Marciniszyn <[email protected]>
Cc: Dennis Dalessandro <[email protected]>
Signed-off-by: Doug Ledford <[email protected]>

IB/hfi1: Fix indentation

Make the indentation of the source code consistent. Detected by
smatch.

Signed-off-by: Bart Van Assche <[email protected]>
Cc: Mike Marciniszyn <[email protected]>
Cc: Dennis Dalessandro <[email protected]>
Signed-off-by: Doug Ledford <[email protected]>

IB/rdmavt: Annotate rvt_reset_qp()

This patch avoids that sparse reports the following warning:

rdmavt/qp.c:507:17: warning: context imbalance in 'rvt_reset_qp' - unexpected unlock

Signed-off-by: Bart Van Assche <[email protected]>
Cc: Mike Marciniszyn <[email protected]>
Cc: Dennis Dalessandro <[email protected]>
Signed-off-by: Doug Ledford <[email protected]>

IB/mad: Fix indentation

Make indentation consistent. Detected by smatch.

Signed-off-by: Bart Van Assche <[email protected]>
Cc: Hal Rosenstock <[email protected]>
Cc: Ira Weiny <[email protected]>
Reviewed-By: Ira Weiny <[email protected]>
Reviewed-by: Hal Rosenstock <[email protected]>
Reviewed-by: Sagi Grimberg <[email protected]>
Signed-off-by: Doug Ledford <[email protected]>

RDMA/core: Fix indentation

Make indentation consistent. Detected by smatch.

Signed-off-by: Bart Van Assche <[email protected]>
Cc: Tatyana Nikolova <[email protected]>
Cc: Steve Wise <[email protected]>
Reviewed-by: Steve Wise <[email protected]>
Reviewed-by: Sagi Grimberg <[email protected]>
Signed-off-by: Doug Ledford <[email protected]>

IB/srp: Fix srp_map_sg_dma()

Because patch "IB/srp: Move common code into the caller" was applied
partially srp_map_sg_dma() doesn't work properly. Fix this by
applying the remainder of that patch. See also
http://thread.gmane.org/gmane.linux.drivers.rdma/35803/focus=35811.

Fixes: 3849e44d1c4b ("IB/srp: Move common code into the caller")
Signed-off-by: Bart Van Assche <[email protected]>
Cc: Mike Marciniszyn <[email protected]>
Cc: Sagi Grimberg <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Laurence Oberman <[email protected]>
Reviewed-by: Sagi Grimberg <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Signed-off-by: Doug Ledford <[email protected]>

IB/srp: Always initialize use_fast_reg and use_fmr

Avoid that mapping fails due to use_fast_reg != 0 or use_fmr != 0
if both member variables should be zero (if never_register == 1 or
if neither FMR nor FR is supported). Remove an initialization that
became superfluous due to changing a kmalloc() into a kzalloc()
call.

Fixes: 509c5f33f4f6 ("IB/srp: Prevent mapping failures")
Cc: Sagi Grimberg <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Laurence Oberman <[email protected]>
Signed-off-by: Bart Van Assche <[email protected]>
Reviewed-by: Leon Romanovsky <[email protected]>
Reviewed-by: Sagi Grimberg <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Signed-off-by: Doug Ledford <[email protected]>

IB/mlx4: Fix device managed flow steering support test

Perform the test for device managed flow steering support even if
memory windows are not supported. I noticed this because smatch
reported inconsistent indentation for the device managed flow
steering support test.

Signed-off-by: Bart Van Assche <[email protected]>
Reviewed-by: Sagi Grimberg <[email protected]>
Cc: Yishai Hadas <[email protected]>
Reviewed-by: Leon Romanovsky <[email protected]>
Signed-off-by: Doug Ledford <[email protected]>

IB/usnic: Remove unused DMA attributes

The DMA attributes are set but never used.

Signed-off-by: Krzysztof Kozlowski <[email protected]>
Signed-off-by: Doug Ledford <[email protected]>

IB/core: fix null pointer deref and mem leak in error handling

The current error handling in setup_hw_stats has a couple of issues.
It is possible to generate a null pointer deference on the
kfree of hsag->attrs[i] because two of the early error exit paths
jump to the kfree when hsags NULL and not allocated. Fix this by
moving the kfree on stats and jumping to that, avoiding the hsag
freeing.

Secondly, there is a memory leak of stats if the hsag allocation
fails; instead of returning, jump to the kfree on stats.

Signed-off-by: Colin Ian King <[email protected]>
Signed-off-by: Doug Ledford <[email protected]>

IB/core: fix an error code in ib_core_init()

We should return the error code if ib_add_ibnl_clients() fails. The
current code returns success.

Fixes: 735c631ae99d ('IB/core: Register SA ibnl client during ib_core initialization')
Signed-off-by: Dan Carpenter <[email protected]>
Reviewed-by: Mark Bloch <[email protected]>
Signed-off-by: Doug Ledford <[email protected]>

IB/hfi1: Avoid large frame size warning

When CONFIG_FRAME_WARN is set to 1024 bytes, which is useful to find
stack consumers, we get a warning in hfi1 driver.

drivers/infiniband/hw/hfi1/affinity.c: In function
‘hfi1_get_proc_affinity’:
drivers/infiniband/hw/hfi1/affinity.c:415:1: warning: the frame size of
1056 bytes is larger than 1024 bytes [-Wframe-larger-than=]

This change removes unneeded buf[1024] declaration and usage.

Fixes: f48ad614c100 ("IB/hfi1: Move driver out of staging")
Signed-off-by: Leon Romanovsky <[email protected]>
Acked-by: Dennis Dalessandro <[email protected]>
Signed-off-by: Doug Ledford <[email protected]>

IB/hfi1: fix some indenting

That extra tabs are misleading.

Signed-off-by: Dan Carpenter <[email protected]>
Reviewed-by: Bart Van Assche <[email protected]>
Acked-by: Dennis Dalessandro <[email protected]>
Signed-off-by: Doug Ledford <[email protected]>

net: Revert vrf-local changes.

This reverts commit 2fb7ea455d57e22110c54fc2de0656b6f744263c.

It results in build errors because ip6_input is not a
symbol exported to modules.

Signed-off-by: David S. Miller <[email protected]>

IB/cm: Fix a recently introduced locking bug

ib_cm_notify() can be called from interrupt context. Hence do not
reenable interrupts unconditionally in cm_establish().

This patch avoids that lockdep reports the following warning:

WARNING: CPU: 0 PID: 23317 at kernel/locking/lockdep.c:2624 trace _hardirqs_on_caller+0x112/0x1b0
DEBUG_LOCKS_WARN_ON(current->hardirq_context)
Call Trace:
<IRQ> [<ffffffff812bd0e5>] dump_stack+0x67/0x92
[<ffffffff81056f21>] __warn+0xc1/0xe0
[<ffffffff81056f8a>] warn_slowpath_fmt+0x4a/0x50
[<ffffffff810a5932>] trace_hardirqs_on_caller+0x112/0x1b0
[<ffffffff810a59dd>] trace_hardirqs_on+0xd/0x10
[<ffffffff815992c7>] _raw_spin_unlock_irq+0x27/0x40
[<ffffffffa0382e9c>] ib_cm_notify+0x25c/0x290 [ib_cm]
[<ffffffffa068fbc1>] srpt_qp_event+0xa1/0xf0 [ib_srpt]
[<ffffffffa04efb97>] mlx4_ib_qp_event+0x67/0xd0 [mlx4_ib]
[<ffffffffa034ec0a>] mlx4_qp_event+0x5a/0xc0 [mlx4_core]
[<ffffffffa03365f8>] mlx4_eq_int+0x3d8/0xcf0 [mlx4_core]
[<ffffffffa0336f9c>] mlx4_msi_x_interrupt+0xc/0x20 [mlx4_core]
[<ffffffff810b0914>] handle_irq_event_percpu+0x64/0x100
[<ffffffff810b09e4>] handle_irq_event+0x34/0x60
[<ffffffff810b3a6a>] handle_edge_irq+0x6a/0x150
[<ffffffff8101ad05>] handle_irq+0x15/0x20
[<ffffffff8101a66c>] do_IRQ+0x5c/0x110
[<ffffffff8159a2c9>] common_interrupt+0x89/0x89
[<ffffffff81297a17>] blk_run_queue_async+0x37/0x40
[<ffffffffa0163e53>] rq_completed+0x43/0x70 [dm_mod]
[<ffffffffa0164896>] dm_softirq_done+0x176/0x280 [dm_mod]
[<ffffffff812a26c2>] blk_done_softirq+0x52/0x90
[<ffffffff8105bc1f>] __do_softirq+0x10f/0x230
[<ffffffff8105bec8>] irq_exit+0xa8/0xb0
[<ffffffff8103653e>] smp_trace_call_function_single_interrupt+0x2e/0x30
[<ffffffff81036549>] smp_call_function_single_interrupt+0x9/0x10
[<ffffffff8159a959>] call_function_single_interrupt+0x89/0x90
<EOI>

Fixes: commit be4b499323bf (IB/cm: Do not queue work to a device that's going away)
Signed-off-by: Bart Van Assche <[email protected]>
Cc: Erez Shitrit <[email protected]>
Cc: Sean Hefty <[email protected]>
Cc: Nikolay Borisov <[email protected]>
Cc: stable <[email protected]> # v4.2+
Acked-by: Erez Shitrit <[email protected]>
Signed-off-by: Doug Ledford <[email protected]>

soreuseport: add compat case for setsockopt SO_ATTACH_REUSEPORT_CBPF

Commit 538950a1b752 ("soreuseport: setsockopt SO_ATTACH_REUSEPORT_[CE]BPF")
missed to add the compat case for the SO_ATTACH_REUSEPORT_CBPF option.

Signed-off-by: Helge Deller <[email protected]>
Acked-by: Daniel Borkmann <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

soreuseport: Fix reuseport_bpf testcase on 32bit architectures

This fixes the following compiler warnings when compiling the
reuseport_bpf testcase on a 32 bit platform:

reuseport_bpf.c: In function ‘attach_ebpf’:
reuseport_bpf.c:114:15: warning: cast from pointer to integer of ifferent size [-Wpointer-to-int-cast]

Signed-off-by: Helge Deller <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Merge branch 'vrf-local'

David Ahern says:

====================
net: vrf: Add support for local traffic to local addresses

Add support for locally originated traffic to VRF-local addresses,
be it addresses on enslaved devices or addresses on the VRF device:

$ ip addr show dev red
33: red: <NOARP,MASTER,UP,LOWER_UP> mtu 65536 qdisc pfifo_fast state UP group default qlen 1000
    link/ether be:00:53:b5:e4:25 brd ff:ff:ff:ff:ff:ff
    inet 1.1.1.1/32 scope global red
       valid_lft forever preferred_lft forever
    inet6 1111:1::1/128 scope global
       valid_lft forever preferred_lft forever

$ ip addr show dev eth1
3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master red state UP group default qlen 1000
    link/ether 02:e0:f9:79:34:bd brd ff:ff:ff:ff:ff:ff
    inet 10.100.1.1/24 brd 10.100.1.255 scope global eth1
       valid_lft forever preferred_lft forever
    inet6 2100:1::1/120 scope global
       valid_lft forever preferred_lft forever
    inet6 fe80::e0:f9ff:fe79:34bd/64 scope link
       valid_lft forever preferred_lft forever

$ ping -c1 -I red 10.100.1.1
    ping: Warning: source address might be selected on device other than red.
    PING 10.100.1.1 (10.100.1.1) from 10.100.1.1 red: 56(84) bytes of data.
    64 bytes from 10.100.1.1: icmp_seq=1 ttl=64 time=0.057 ms

$ ping -c1 -I red 1.1.1.1
PING 1.1.1.1 (1.1.1.1) from 1.1.1.1 red: 56(84) bytes of data.
64 bytes from 1.1.1.1: icmp_seq=1 ttl=64 time=0.136 ms

--- 1.1.1.1 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.136/0.136/0.136/0.000 ms

$ ping6 -c1 -I red  2100:1::1
ping6: Warning: source address might be selected on device other than red.
PING 2100:1::1(2100:1::1) from 2100:1::1 red: 56 data bytes
64 bytes from 2100:1::1: icmp_seq=1 ttl=64 time=0.167 ms

--- 2100:1::1 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.167/0.167/0.167/0.000 ms

$ ping6 -c1 -I red 1111::1
PING 1111::1(1111::1) from 1111:1::1 red: 56 data bytes
64 bytes from 1111::1: icmp_seq=1 ttl=64 time=0.187 ms

--- 1111::1 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.187/0.187/0.187/0.000 ms

This change also enables use of loopback address on the VRF device:
$ ip addr add dev red 127.0.0.1/8

$ ping -c1 -I red 127.0.0.1
PING 127.0.0.1 (127.0.0.1) from 127.0.0.1 red: 56(84) bytes of data.
64 bytes from 127.0.0.1: icmp_seq=1 ttl=64 time=0.058 ms
====================

Signed-off-by: David S. Miller <[email protected]>

net: vrf: ipv6 support for local traffic to local addresses

Add support for locally originated traffic to VRF-local IPv6 addresses.
Similar to IPv4 a local dst is set on the skb and the packet is
reinserted with a call to netif_rx. With this patch, ping, tcp and udp
packets to a local IPv6 address are successfully routed:

    $ ip addr show dev eth1
    4: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master red state UP group default qlen 1000
        link/ether 02:e0:f9:1c:b9:74 brd ff:ff:ff:ff:ff:ff
        inet 10.100.1.1/24 brd 10.100.1.255 scope global eth1
           valid_lft forever preferred_lft forever
        inet6 2100:1::1/120 scope global
           valid_lft forever preferred_lft forever
        inet6 fe80::e0:f9ff:fe1c:b974/64 scope link
           valid_lft forever preferred_lft forever

    $ ping6 -c1 -I red 2100:1::1
    ping6: Warning: source address might be selected on device other than red.
    PING 2100:1::1(2100:1::1) from 2100:1::1 red: 56 data bytes
    64 bytes from 2100:1::1: icmp_seq=1 ttl=64 time=0.098 ms

ip6_input is exported so the VRF driver can use it for the dst input
function. The dst_alloc function for IPv4 defaults to setting the input and
output functions; IPv6's does not. VRF does not need to duplicate the Rx path
so just export the ipv6 input function.

Signed-off-by: David Ahern <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: vrf: ipv4 support for local traffic to local addresses

Add support for locally originated traffic to VRF-local addresses. If
destination device for an skb is the loopback or VRF device then set
its dst to a local version of the VRF cached dst_entry and call netif_rx
to insert the packet onto the rx queue - similar to what is done for
loopback. This patch handles IPv4 support; follow on patch handles IPv6.

With this patch, ping, tcp and udp packets to a local IPv4 address are
successfully routed:

    $ ip addr show dev eth1
    4: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master red state UP group default qlen 1000
        link/ether 02:e0:f9:1c:b9:74 brd ff:ff:ff:ff:ff:ff
        inet 10.100.1.1/24 brd 10.100.1.255 scope global eth1
           valid_lft forever preferred_lft forever
        inet6 2100:1::1/120 scope global
           valid_lft forever preferred_lft forever
        inet6 fe80::e0:f9ff:fe1c:b974/64 scope link
           valid_lft forever preferred_lft forever

    $ ping -c1 -I red 10.100.1.1
    ping: Warning: source address might be selected on device other than red.
    PING 10.100.1.1 (10.100.1.1) from 10.100.1.1 red: 56(84) bytes of data.
    64 bytes from 10.100.1.1: icmp_seq=1 ttl=64 time=0.057 ms

This patch also enables use of IPv4 loopback address on the VRF device:
    $ ip addr add dev red 127.0.0.1/8

    $ ping -c1 -I red 127.0.0.1
    PING 127.0.0.1 (127.0.0.1) from 127.0.0.1 red: 56(84) bytes of data.
    64 bytes from 127.0.0.1: icmp_seq=1 ttl=64 time=0.058 ms

Signed-off-by: David Ahern <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: vrf: Minor refactoring for local address patches

Move the stripping of the ethernet header from is_ip_tx_frame into the
ipv4 and ipv6 outbound functions. If the packet is destined to a local
address the header is retained since the packet is sent back to netif_rx.

Collapse vrf_send_v4_prep into vrf_process_v4_outbound.

Signed-off-by: David Ahern <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

drm/nouveau/disp/sor/gm107: training pattern registers are like gm200

Signed-off-by: Ben Skeggs <[email protected]>
Cc: [email protected]

drm/nouveau/disp/sor/gf119: both links use the same training register

It appears that, for whatever reason, both link A and B use the same
register to control the training pattern. It's a little odd, as the
GPUs before this (Tesla/Fermi1) have per-link registers, as do newer
GPUs (Maxwell).

Fixes the third DP output on NVS 510 (GK107).

Signed-off-by: Ben Skeggs <[email protected]>
Cc: [email protected]