Git Repo - linux.git/log

bpf: lru: Cleanup test_lru_map.c

This patch does the following cleanup on test_lru_map.c
1) Fix indentation (Replace spaces by tabs)
2) Remove redundant BPF_F_NO_COMMON_LRU test
3) Simplify some comments

Signed-off-by: Martin KaFai Lau <[email protected]>
Acked-by: Alexei Starovoitov <[email protected]>
Acked-by: Daniel Borkmann <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

bpf: lru: Add test_lru_sanity6 for BPF_F_NO_COMMON_LRU

test_lru_sanity3 is not applicable to BPF_F_NO_COMMON_LRU.
It just happens to work when PERCPU_FREE_TARGET == 16.

This patch:
1) Disable test_lru_sanity3 for BPF_F_NO_COMMON_LRU
2) Add test_lru_sanity6 to test list rotation for
the BPF_F_NO_COMMON_LRU map.

Signed-off-by: Martin KaFai Lau <[email protected]>
Acked-by: Alexei Starovoitov <[email protected]>
Acked-by: Daniel Borkmann <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: mvneta: fix failed to suspend if WOL is enabled

Recently, suspend/resume and WOL support are added into mvneta driver.
If we enable WOL, then we get some error as below on Marvell BG4CT
platforms during suspend:

[ 184.149723] dpm_run_callback(): mdio_bus_suspend+0x0/0x50 returns -16
[ 184.149727] PM: Device f7b62004.mdio-mi:00 failed to suspend: error -16

-16 means -EBUSY, phy_suspend() will return -EBUSY if it finds the
device has WOL enabled.

We fix this issue by properly setting the netdev's power.can_wakeup
and power.wakeup, i.e

1. in mvneta_mdio_probe(), call device_set_wakeup_capable() to set
power.can_wakeup if the phy support WOL.

2. in mvneta_ethtool_set_wol(), call device_set_wakeup_enable() to
set power.wakeup if WOL has been successfully enabled in phy.

Signed-off-by: Jisheng Zhang <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: bridge: notify on hw fdb takeover

Recently we added support for SW fdbs to take over HW ones, but that
results in changing a user-visible fdb flag thus we need to send a
notification, also it's consistent with how HW takes over SW entries.

Signed-off-by: Nikolay Aleksandrov <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Merge branch 'mediatek-tx-bugs'

Sean Wang says:

====================
mediatek: Fix crash caused by reporting inconsistent skb->len to BQL

Changes since v1:
- fix inconsistent enumeration which easily causes the potential bug

The series fixes kernel BUG caused by inconsistent SKB length reported
into BQL. The reason for inconsistent length comes from hardware BUG which
results in different port number carried on the TXD within the lifecycle of
SKB. So patch 2) is proposed for use a software way to track which port
the SKB involving instead of hardware way. And patch 1) is given for another
issue I found which causes TXD and SKB inconsistency that is not expected
in the initial logic, so it is also being corrected it in the series.

The log for the kernel BUG caused by the issue is posted as below.

[  120.825955] kernel BUG at ... lib/dynamic_queue_limits.c:26!
[  120.837684] Internal error: Oops - BUG: 0 [#1] SMP ARM
[  120.842778] Modules linked in:
[  120.845811] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.11.0-rc1-191576-gdbcef47 #35
[  120.853488] Hardware name: Mediatek Cortex-A7 (Device Tree)
[  120.859012] task: c1007480 task.stack: c1000000
[  120.863510] PC is at dql_completed+0x108/0x17c
[  120.867915] LR is at 0x46
[  120.870512] pc : [<c03c19c8>]    lr : [<00000046>]    psr: 80000113
[  120.870512] sp : c1001d58  ip : c1001d80  fp : c1001d7c
[  120.881895] r10: 0000003e  r9 : df6b3400  r8 : 0ed86506
[  120.887075] r7 : 00000001  r6 : 00000001  r5 : 0ed8654c  r4 : df0135d8
[  120.893546] r3 : 00000001  r2 : df016800  r1 : 0000fece  r0 : df6b3480
[  120.900018] Flags: Nzcv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment none
[  120.907093] Control: 10c5387d  Table: 9e27806a  DAC: 00000051
[  120.912789] Process swapper/0 (pid: 0, stack limit = 0xc1000218)
[  120.918744] Stack: (0xc1001d58 to 0xc1002000)

....

121.085331] 1fc0: 00000000 c0a52a28 00000000 c10855d4 c1003c58 c0a52a24 c100885c 8000406a
[  121.093444] 1fe0: 410fc073 00000000 00000000 c1001ff8 8000807c c0a009cc 00000000 00000000
[  121.101575] [<c03c19c8>] (dql_completed) from [<c04cb010>] (mtk_napi_tx+0x1d0/0x37c)
[  121.109263] [<c04cb010>] (mtk_napi_tx) from [<c05e28cc>] (net_rx_action+0x24c/0x3b8)
[  121.116951] [<c05e28cc>] (net_rx_action) from [<c010152c>] (__do_softirq+0xe4/0x35c)
[  121.124638] [<c010152c>] (__do_softirq) from [<c012a624>] (irq_exit+0xe8/0x150)
[  121.131895] [<c012a624>] (irq_exit) from [<c017750c>] (__handle_domain_irq+0x70/0xc4)
[  121.139666] [<c017750c>] (__handle_domain_irq) from [<c0101404>] (gic_handle_irq+0x58/0x9c)
[  121.147953] [<c0101404>] (gic_handle_irq) from [<c010e18c>] (__irq_svc+0x6c/0x90)
[  121.155373] Exception stack(0xc1001ef8 to 0xc1001f40)
====================

Signed-off-by: David S. Miller <[email protected]>

net: ethernet: mediatek: fix inconsistency of port number carried in TXD

Fix port inconsistency on TXD due to hardware BUG that would cause
different port number is carried on the same TXD between tx_map()
and tx_unmap() with the iperf test. It would cause confusing BQL
logic which leads to kernel panic when dual GMAC runs concurrently.

Signed-off-by: Sean Wang <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: ethernet: mediatek: fix inconsistency between TXD and the used buffer

Fix inconsistency between the TXD descriptor and the used buffer that
would cause unexpected logic at mtk_tx_unmap() during skb housekeeping.

Signed-off-by: Sean Wang <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: phy: micrel: fix crash when statistic requested for KSZ9031 phy

Now the command:
ethtool --phy-statistics eth0
will cause system crash with meassage "Unable to handle kernel NULL pointer
dereference at virtual address 00000010" from:

(kszphy_get_stats) from [<c069f1d8>] (ethtool_get_phy_stats+0xd8/0x210)
(ethtool_get_phy_stats) from [<c06a0738>] (dev_ethtool+0x5b8/0x228c)
(dev_ethtool) from [<c06b5484>] (dev_ioctl+0x3fc/0x964)
(dev_ioctl) from [<c0679f7c>] (sock_ioctl+0x170/0x2c0)
(sock_ioctl) from [<c02419d4>] (do_vfs_ioctl+0xa8/0x95c)
(do_vfs_ioctl) from [<c02422c4>] (SyS_ioctl+0x3c/0x64)
(SyS_ioctl) from [<c0107d60>] (ret_fast_syscall+0x0/0x44)

The reason: phy_driver structure for KSZ9031 phy has no .probe() callback
defined. As result, struct phy_device *phydev->priv pointer will not be
initializes (null).
This issue will affect also following phys:
KSZ8795, KSZ886X, KSZ8873MLL, KSZ9031, KSZ9021, KSZ8061, KS8737

Fix it by:
- adding .probe() = kszphy_probe() callback to KSZ9031, KSZ9021
phys. The kszphy_probe() can be re-used as it doesn't do any phy specific
settings.
- removing statistic callbacks from other phys (KSZ8795, KSZ886X,
KSZ8873MLL, KSZ8061, KS8737) as they doesn't have corresponding
statistic counters.

Fixes: 2b2427d06426 ("phy: micrel: Add ethtool statistics counters")
Signed-off-by: Grygorii Strashko <[email protected]>
Reviewed-by: Andrew Lunn <[email protected]>
Reviewed-by: Florian Fainelli <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

kcm: remove a useless copy_from_user()

struct kcm_clone only contains fd, and kcm_clone() only
writes this struct, so there is no need to copy it from user.

Cc: Tom Herbert <[email protected]>
Signed-off-by: Cong Wang <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: vrf: Fix setting NLM_F_EXCL flag when adding l3mdev rule

Only need 1 l3mdev FIB rule. Fix setting NLM_F_EXCL in the nlmsghdr.

Fixes: 1aa6c4f6b8cd8 ("net: vrf: Add l3mdev rules on first device create")
Signed-off-by: David Ahern <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

MAINTAINERS: rename TC entry and add couple of header files

The section is not specific only to "TC classifiers", but applies to the
whole TC subsystem. Also, add couple of forgotten headers.

Signed-off-by: Jiri Pirko <[email protected]>
Acked-by: Jamal Hadi Salim <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: phy: simplify phy_supported_speeds()

Simplify the loop in phy_supported_speeds().

Signed-off-by: Russell King <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: phy: improve phylib correctness for non-autoneg settings

phylib has some undesirable behaviour when forcing a link mode through
ethtool.  phylib uses this code:

idx = phy_find_valid(phy_find_setting(phydev->speed, phydev->duplex),
features);

to find an index in the settings table.  phy_find_setting() starts at
index 0, and scans upwards looking for an exact speed and duplex match.
When it doesn't find it, it returns MAX_NUM_SETTINGS - 1, which is
10baseT-Half duplex.

phy_find_valid() then scans from the point (and effectively only checks
one entry) before bailing out, returning MAX_NUM_SETTINGS - 1.

phy_sanitize_settings() then sets ->speed to SPEED_10 and ->duplex to
DUPLEX_HALF whether or not 10baseT-Half is supported or not.  This goes
against all the comments against these functions, and 10baseT-Half may
not even be supported by the hardware.

Rework these functions, introducing a new method of scanning the table.
There are two modes of lookup that phylib wants: exact, and inexact.

- in exact mode, we return either an exact match or failure
- in inexact mode, we return an exact match if it exists, a match at
  the highest speed that is not greater than the requested speed
  (ignoring duplex), or failing that, the lowest supported speed, or
  failure.

The biggest difference is that we always check whether the entry is
supported before further consideration, so all unsupported entries are
not considered as candidates.

This results in arguably saner behaviour, better matches the comments,
and is probably what users would expect.

This becomes important as ethernet speeds increase, PHYs exist which do
not support the 10Mbit speeds, and half-duplex is likely to become
obsolete - it's already not even an option on 10Gbit and faster links.

Signed-off-by: Russell King <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Subject: net: allow configuring default qdisc

Since 3.12 it has been possible to configure the default queuing
discipline via sysctl. This patch adds ability to configure the
default queue discipline in kernel configuration. This is useful for
environments where configuring the value from userspace is difficult
to manage.

The default is still the same as before (pfifo_fast) and it is
possible to change after kernel init with sysctl. This is similar
to how TCP congestion control works.

Signed-off-by: Stephen Hemminger <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Merge branch 'qed-arfs'

Manish Chopra says:

====================
qed/qede: aRFS support

This series adds support for Accelerated Flow Steering
in qede driver for TCP/UDP over IPv4/IPv6 protocols.

Please consider applying this series to "net-next"
====================

Signed-off-by: David S. Miller <[email protected]>

qede: Add aRFS support

This patch adds support for aRFS for TCP and UDP
protocols with IPv4/IPv6.

Signed-off-by: Manish Chopra <[email protected]>
Signed-off-by: Yuval Mintz <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

qed: aRFS infrastructure support

This patch adds necessary APIs to interface with
qede aRFS support in successive patch.

It also reserves separate PTT entry for aRFS,
[as being in fastpath flow] for hardware access instead of
trying to acquire it at run time from the ptt pool.

Signed-off-by: Manish Chopra <[email protected]>
Signed-off-by: Yuval Mintz <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

smsc95xx: Add comments to the registers definition

This chip is used by a lot of embedded devices and also by the Raspberry
Pi 1, 2 & 3 which were created to promote the study of computer
sciences. Students wanting to learn kernel / network device driver
programming through those devices can only rely on the Linux kernel
driver source to make their own.

This commit adds a lot of comments to the registers definition to expand
the register names.

Cc: Steve Glendinning <[email protected]>
Cc: Microchip Linux Driver Support <[email protected]>
CC: David Miller <[email protected]>
Signed-off-by: Martin Wetterwald <[email protected]>
Reviewed-by: Andrew Lunn <[email protected]>
Acked-by: Steve Glendinning <[email protected]>
Acked-by: Woojung Huh <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: thunderx: Fix set_max_bgx_per_node for 81xx rgx

Add the PCI_SUBSYS_DEVID_81XX_RGX and use the same to set
the max bgx per node count.

This fixes the issue intoduced by following commit
78aacb6f6 net: thunderx: Fix invalid mac addresses for node1 interfaces
With this commit the max_bgx_per_node for 81xx is set as 2 instead of 3
because of which num_vfs is always calculated as zero.

Signed-off-by: George Cherian <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

l2tp: device MTU setup, tunnel socket needs a lock

The MTU overhead calculation in L2TP device set-up
merged via commit b784e7ebfce8cfb16c6f95e14e8532d0768ab7ff
needs to be adjusted to lock the tunnel socket while
referencing the sub-data structures to derive the
socket's IP overhead.

Reported-by: Guillaume Nault <[email protected]>
Tested-by: Guillaume Nault <[email protected]>
Signed-off-by: R. Parameswaran <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net-timestamp: avoid use-after-free in ip_recv_error

Syzkaller reported a use-after-free in ip_recv_error at line

info->ipi_ifindex = skb->dev->ifindex;

This function is called on dequeue from the error queue, at which
point the device pointer may no longer be valid.

Save ifindex on enqueue in __skb_complete_tx_timestamp, when the
pointer is valid or NULL. Store it in temporary storage skb->cb.

It is safe to reference skb->dev here, as called from device drivers
or dev_queue_xmit. The exception is when called from tcp_ack_tstamp;
in that case it is NULL and ifindex is set to 0 (invalid).

Do not return a pktinfo cmsg if ifindex is 0. This maintains the
current behavior of not returning a cmsg if skb->dev was NULL.

On dequeue, the ipv4 path will cast from sock_exterr_skb to
in_pktinfo. Both have ifindex as their first element, so no explicit
conversion is needed. This is by design, introduced in commit
0b922b7a829c ("net: original ingress device index in PKTINFO"). For
ipv6 ip6_datagram_support_cmsg converts to in6_pktinfo.

Fixes: 829ae9d61165 ("net-timestamp: allow reading recv cmsg on errqueue with origin tstamp")
Reported-by: Andrey Konovalov <[email protected]>
Signed-off-by: Willem de Bruijn <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

ipv4: fix a deadlock in ip_ra_control

Similar to commit 87e9f0315952
("ipv4: fix a potential deadlock in mcast getsockopt() path"),
there is a deadlock scenario for IP_ROUTER_ALERT too:

       CPU0                    CPU1
       ----                    ----
  lock(rtnl_mutex);
                               lock(sk_lock-AF_INET);
                               lock(rtnl_mutex);
  lock(sk_lock-AF_INET);

Fix this by always locking RTNL first on all setsockopt() paths.

Note, after this patch ip_ra_lock is no longer needed either.

Reported-by: Dmitry Vyukov <[email protected]>
Tested-by: Andrey Konovalov <[email protected]>
Signed-off-by: Cong Wang <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: ipv6: send unsolicited NA on admin up

ndisc_notify is the ipv6 equivalent to arp_notify. When arp_notify is
set to 1, gratuitous arp requests are sent when the device is brought up.
The same is expected when ndisc_notify is set to 1 (per ndisc_notify in
Documentation/networking/ip-sysctl.txt). The NA is not sent on NETDEV_UP
event; add it.

Fixes: 5cb04436eef6 ("ipv6: add knob to send unsolicited ND on link-layer address change")
Signed-off-by: David Ahern <[email protected]>
Acked-by: Hannes Frederic Sowa <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Merge branch 'mlx5-RDMA-netdevice'

Saeed Mahameed says:

====================
Mellanox, mlx5 RDMA net device support

This series provides the lower level mlx5 support of RDMA netdevice
creation API [1] suggested and introduced by Intel's HFI OPA VNIC
netdevice driver [2], to enable IPoIB mlx5 RDMA netdevice creation.

mlx5 IPoIB RDMA netdev will serve as an acceleration netdevice for the current
IPoIB ULP generic netdevice, providing:
- mlx5 RSS support.
- mlx5 HW RX,TX offloads (checksum, TSO, LRO, etc ..).
- Full mlx5 HW features transparent to the ULP itself.

The idea here is to reuse and benefit from the already implemented mlx5e netdevice
management and channels API for both etherent and RDMA netdevices, since both IPoIB
and Ethernet netdevices share same common mlx5 HW resources (with some small
exceptions) and share most of the control/data path logic, it is more natural to
have them share the same code.

The differences between IPoIB and Ethernet netdevices can be summarized to:

Steering:
In mlx5, IPoIB traffic is sent and received from an underlay special QP, and in Ethernet
the traffic is handled by vports and vport steering is managed by e-switch or FW.

For IPoIB traffic to get steered correctly the only thing we need to do is to create RSS
HW contexts for RX and TX HW contexts for TX (similar to mlx5e) with the underlay QP attached to
them (underlay QP will be 0 in case of Ethernet).

RX,TX:
Since IPoIB traffic is different, slightly modified RX and TX handlers are required,
still we do some code reuse in data path via common helper functions.

All of the other generic netdevice and mlx5 aspects will be shared between mlx5 Ethernet
and IPoIB netdevices, e.g.
- Channels creation and handling (RQs,SQs,CQs, NAPI, interrupt moderation, etc..)
- Offloads, checksum, GRO, LRO, TSO, and more.
- netdevice logic and non Ethernet specific ndos (open/close, etc..)

In order to achieve what we want:

In patchet 1 to 3, Erez added the supported for underlay QP in mlx5_ifc and refactored
the mlx5 steering code to accept the underlay QP as a parameter for creating steering
objects and enabled flow steering for IB link.

Then we are going to use the mlx5e netdevice profile, which is already used to separate between
NIC and VF representors netdevices, to create new type of IPoIB netdevice profile.

For that, one small refactoring is required to make mlx5e netdevice profile management
more genetic and agnostic to link type which is done in patch #4.

In patch #5, we introduce ipoib.c to host all of mlx5 IPoIB (mlx5i) specific logic and a
skeleton for the IPoIB mlx5 netdevice profile, and we will start filling it in next patches,
using mlx5e already existing APIs.

Patch #6 and #7, Implement init/cleanup RX mlx5i netdev profile handlers to create mlx5 RSS
resources, same as mlx5e but without vlan and L2 steering tables.

Patch #8, Implement init/cleanup TX mlx5i netdev profile handlers, to create TX resources
same as mlx5e but with one TC (tc = 0) support.

Patch #9, Implement mlx5i open/close ndos, where we reuese the mlx5e channels API, to start/stop TX/RX channels.

Patch #10, Create the underlay QP and attach it to mlx5i RSS and TX HW contexts.

Patch #11 and #12, Break down the mlx5e xmit flow into smaller helper function and implement the
mlx5i IPoIB xmit routine.

Patch #13 and #14, Have an RX handler per netdevice profile. We already do this before this series
in a non clean way to separate between NIC netdev and VF representor RX handlers, in patch 13 we make
the RX handler generic and bound to a profile and in patch 14 we implement the IPoIB RX handlers.

Patch #15, Small cleanup to avoid e-switch with IPoIB netdev.

In order to enable mlx5 IPoIB, a merge between the IPoIB RDMA netdev offolad support [3]
- which was alread submitted to the rdma mailing list - and this series is required
plus an extra small patch [4] which will connect between both sides and actually enables the offload.

Once both patch-sets are merged into linux we will have to submit the extra small patch [4], to enable
the feature.

Thanks,
Saeed.

[1] https://patchwork.kernel.org/patch/9676637/

[2] https://lwn.net/Articles/715453/
https://patchwork.kernel.org/patch/9587815/

[3] https://patchwork.kernel.org/patch/9672069/
[4] https://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux.git/commit/?id=0141db6a686e32294dee015b7d07706162ba48d8
====================

Signed-off-by: David S. Miller <[email protected]>

hw/mlx5: Add New bit to check over QP creation

Add check for bit IB_QP_CREATE_NETIF_QP while creating QP.

Signed-off-by: Erez Shitrit <[email protected]>
Signed-off-by: Saeed Mahameed <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net/mlx5e: E-switch vport manager is valid for ethernet only

Currently the driver support only ethernet eswitch, and we want to
protect downstream IPoIB netdev from trying to access it in IB link.

Signed-off-by: Saeed Mahameed <[email protected]>
Reviewed-by: Erez Shitrit <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net/mlx5e: IPoIB, RX handler

Implement IPoIB RX SKB handler.

Signed-off-by: Saeed Mahameed <[email protected]>
Reviewed-by: Erez Shitrit <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net/mlx5e: RX handlers per netdev profile

In order to have different RX handler per profile, fix and refactor the
current code to take the rx handler directly from the netdevice profile
rather than computing it on runtime as it was done with the switchdev
mode representor rx handler.

This will also remove the current wrong assumption in mlx5e_alloc_rq
code that mlx5e_priv->ppriv is of the type vport_rep.

Signed-off-by: Saeed Mahameed <[email protected]>
Reviewed-by: Erez Shitrit <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net/mlx5e: IPoIB, Xmit flow

Implement mlx5e's IPoIB SKB transmit using the helper functions provided
by mlx5e ethernet tx flow, the only difference in the code between
mlx5e_xmit and mlx5i_xmit is that IPoIB has some extra fields to fill
(UD datagram segment) in the TX descriptor (WQE) and it doesn't need to
have any vlan handling.

Signed-off-by: Saeed Mahameed <[email protected]>
Reviewed-by: Erez Shitrit <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net/mlx5e: Xmit flow break down

Break current mlx5e xmit flow into smaller blocks (helper functions)
in order to reuse them for IPoIB SKB transmission.

Signed-off-by: Saeed Mahameed <[email protected]>
Reviewed-by: Erez Shitrit <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net/mlx5e: IPoIB, Underlay QP

Create IPoIB underlay QP needed by the IPoIB netdevice profile for RSS
and TX HW context to perform on IPoIB traffic.

Reset the underlay QP on dev_uninit ndo to stop IPoIB traffic going
through this QP when the ULP IPoIB decides to cleanup.

Implement attach/detach mcast RDMA netdev callbacks for later RDMA
netdev use.

Signed-off-by: Saeed Mahameed <[email protected]>
Reviewed-by: Erez Shitrit <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net/mlx5e: IPoIB, Basic netdev ndos open/close

Implement open/close of IPoIB netdevice ndos using mlx5e's
channels API to manage data path resources (RQs/SQs/CQs).

Set IPoIB netdev address on dev_init ndo.

Signed-off-by: Saeed Mahameed <[email protected]>
Reviewed-by: Erez Shitrit <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net/mlx5e: IPoIB, TX TIS creation

Modify mlx5e tis creation function to accept underlay qp number, which
will be needed by IPoIB.

Implement mlx5i (IPoIB) tx init/cleanup netdevice profile flows to
create one TIS with the IPoIB underlay qp, for IPoIB TX SQs.

Signed-off-by: Saeed Mahameed <[email protected]>
Reviewed-by: Erez Shitrit <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net/mlx5e: IPoIB, RSS flow steering tables

Like the mlx5e ethernet mode, on IPoIB mode we need to create RX steering
tables, but IPoIB do not require MAC and VLAN steering tables so the
only tables we create in here are:
1. TTC Table (Traffic Type Classifier table for RSS steering)
2. ARFS Table (for accelerated RFS support)

Creation of those tables is identical to mlx5e ethernet mode, hence the
use of mlx5e_create_ttc_table and mlx5e_arfs_create_tables.

Signed-off-by: Saeed Mahameed <[email protected]>
Reviewed-by: Erez Shitrit <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net/mlx5e: IPoIB, RX steering RSS RQTs and TIRs

Implement IPoIB RX RSS (RQTs and TIRs) HW objects creation,
All we do here is simply reuse the mlx5e implementation to create
direct and indirect (RSS) steering HW objects.

For that we just expose
mlx5e_{create,destroy}_{direct,indirect}_{rqt,tir} functions into en.h
and call them from ipoib.c in init/cleanup_rx IPoIB netdevice profile
callbacks.

Signed-off-by: Saeed Mahameed <[email protected]>
Reviewed-by: Erez Shitrit <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net/mlx5e: IPoIB, Add netdevice profile skeleton

Create mlx5e IPoIB netdevice profile skeleton in the new ipoib.c
file with empty implementation.

Downstream patches will provide the full mlx5 rdma netdevice acceleration
support for IPoIB into this new file, by using the mlx5e netdevice
profile and new mlx5_channels APIs and infrastructures.
Same as already done in mlx5e NIC netdevice and switchdev mode VF
representors.

Signed-off-by: Saeed Mahameed <[email protected]>
Reviewed-by: Erez Shitrit <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net/mlx5e: More generic netdev management API

In preparation for mlx5e RDMA net_device support, here we generalize
mlx5e_attach/detach in a way that those functions will be agnostic
to link type. For that we move ethernet specific NIC net device logic out
of those functions into {nic,rep}_{enable/disable} mlx5e NIC and
representor profiles callbacks.

Also some of the logic was moved only to NIC profile since it is not right
to have this logic for representor net device (e.g. set port MTU).

Signed-off-by: Saeed Mahameed <[email protected]>
Reviewed-by: Erez Shitrit <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net/mlx5: Enable flow-steering for IB link

Get the relevant capabilities if supports ipoib_enhanced_offloads and
init the flow steering table accordingly.

Signed-off-by: Erez Shitrit <[email protected]>
Signed-off-by: Saeed Mahameed <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net/mlx5: Refactor create flow table method to accept underlay QP

IB flow tables need the underlay qp to perform flow steering.
Here we change the API of the flow tables creation to accept the
underlay QP number as a parameter in order to support IB (IPoIB) flow
steering.

Signed-off-by: Erez Shitrit <[email protected]>
Signed-off-by: Saeed Mahameed <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net/mlx5: Add IPoIB enhanced offloads bits to mlx5_ifc

New capability bit: ipoib_enhanced_offloads, indicates new ability for UD
QP to do RSS and enhanced IPoIB offloads and acceleration.

Add underlay_qpn to the TIS and flow_table objects In order to support
SET_ROOT command, to connect between IPoIB QPs and flow steering tables.

Signed-off-by: Erez Shitrit <[email protected]>
Signed-off-by: Saeed Mahameed <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

hv_netvsc: Exclude non-TCP port numbers from vRSS hashing

Azure hosts are not supporting non-TCP port numbers in vRSS hashing for
now. For example, UDP packet loss rate will be high if port numbers are
also included in vRSS hash.

So, we created this patch to use only IP numbers for hashing in non-TCP
traffic.

Signed-off-by: Haiyang Zhang <[email protected]>
Reviewed-by: Stephen Hemminger <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

hv_netvsc: Fix the queue index computation in forwarding case

If the outgoing skb has a RX queue mapping available, we use the queue
number directly, other than put it through Send Indirection Table.

Signed-off-by: Haiyang Zhang <[email protected]>
Reviewed-by: Stephen Hemminger <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: dsa: isolate legacy code

This patch moves as is the legacy DSA code from dsa.c to legacy.c,
except the few shared symbols which remain in dsa.c.

Signed-off-by: Vivien Didelot <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

sfc: limit the number of receive queues

The number of rx queues is determined by the rss_cpus parameter
or the cpu topology. If that is higher than EFX_MAX_RX_QUEUES the
driver can corrupt state.

Fixes: 8ceee660aacb ("New driver "sfc" for Solarstorm SFC4000 controller.")
Signed-off-by: Bert Kenward <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Linux 4.11-rc7

Merge tag 'armsoc-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc

Pull ARM SoC fixes from Olof Johansson:
"Again, a batch that's been sitting a couple of weeks, mostly because
  I anticipated a bit more material but it didn't show up -- which is
  good.

  These are all your garden variety fixes for ARM platforms.

  The most visible issue fixed here is probably the SMP reset issue on
  OMAP, the rest are minor stuff"

* tag 'armsoc-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc:
  arm64: allwinner: a64: add pmu0 regs for USB PHY
  ARM: OMAP2+: omap_device: Sync omap_device and pm_runtime after probe defer
  reset: add exported __reset_control_get, return NULL if optional
  ARM: orion5x: only call into phylib when available
  ARM: omap2+: Revert omap-smp.c changes resetting CPU1 during boot
  ARM: dts: am335x-evmsk: adjust mmc2 param to allow suspend
  ARM: dts: ti: fix PCI bus dtc warnings
  ARM: dts: am335x-baltos: disable EEE for Atheros 8035 PHY
  ARM: dts: OMAP3: Fix MFG ID EEPROM
  ARM: sun8i: a33: add operating-points-v2 property to all nodes
  ARM: sun8i: a33: remove highest OPP to fix CPU crashes

Merge branch 'for-linus' of git://git.kernel.dk/linux-block

Pull block fixes from Jens Axboe:
"Four small fixes.

  Three of them fix the same error in NVMe, in loop, fc, and rdma
  respectively.  The last fix from Ming fixes a regression in this
  series, where our bvec gap logic was wrong and causes an oops on
  NVMe for certain conditions"

* 'for-linus' of git://git.kernel.dk/linux-block:
  block: fix bio_will_gap() for first bvec with offset
  nvme-fc: Fix sqsize wrong assignment based on ctrl MQES capability
  nvme-rdma: Fix sqsize wrong assignment based on ctrl MQES capability
  nvme-loop: Fix sqsize wrong assignment based on ctrl MQES capability

Merge tag 'omap-for-v4.11/fixes-rc6-signed' of git://git.kernel.org/pub/scm/linux/kernel/git/tmlind/linux-omap into fixes

Regression fix for omap interconnect code for deferred probe.
Without this fix we can get PM related warnings for devices that
use deferred probe. If necessary, this fix can wait for the
v4.12 merge window no problem.

* tag 'omap-for-v4.11/fixes-rc6-signed' of git://git.kernel.org/pub/scm/linux/kernel/git/tmlind/linux-omap:
  ARM: OMAP2+: omap_device: Sync omap_device and pm_runtime after probe defer
  ARM: omap2+: Revert omap-smp.c changes resetting CPU1 during boot
  ARM: dts: am335x-evmsk: adjust mmc2 param to allow suspend
  ARM: dts: ti: fix PCI bus dtc warnings
  ARM: dts: am335x-baltos: disable EEE for Atheros 8035 PHY
  ARM: dts: OMAP3: Fix MFG ID EEPROM

Signed-off-by: Olof Johansson <[email protected]>

Merge branch 'for-4.11-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup

Pull cgroup fix from Tejun Heo:
"Unfortunately, the commit to fix the cgroup mount race in the previous
  pull request can lead to hangs.

  The original bug has been around for a while and isn't too likely to
  be triggered in usual use cases. Revert the commit for now"

* 'for-4.11-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  Revert "cgroup: avoid attaching a cgroup root to two different superblocks"

Merge tag 'tty-4.11-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty

Pull tty fix from Greg KH:
"Here is a single tty core revert for a patch that was reported to
  cause problems.

  The original issue is one that we have lived with for decades, so
  trying to scramble to fix the fix in time for 4.11-final does not make
  sense due to the fragility of the tty ldisc layer. Just reverting it
  makes sense for now"

* tag 'tty-4.11-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty:
  Revert "tty: don't panic on OOM in tty_set_ldisc()"

Merge tag 'trace-v4.11-rc5-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace

Pull ftrace fix from Steven Rostedt:
"While rewriting the function probe code, I stumbled over a long
  standing bug. This bug has been there sinc function tracing was added
  way back when. But my new development depends on this bug being fixed,
  and it should be fixed regardless as it causes ftrace to disable
  itself when triggered, and a reboot is required to enable it again.

  The bug is that the function probe does not disable itself properly if
  there's another probe of its type still enabled. For example:

     # cd /sys/kernel/debug/tracing
     # echo schedule:traceoff > set_ftrace_filter
     # echo do_IRQ:traceoff > set_ftrace_filter
     # echo \!do_IRQ:traceoff > /debug/tracing/set_ftrace_filter
     # echo do_IRQ:traceoff > set_ftrace_filter

  The above registers two traceoff probes (one for schedule and one for
  do_IRQ, and then removes do_IRQ.

  But since there still exists one for schedule, it is not done
  properly. When adding do_IRQ back, the breakage in the accounting is
  noticed by the ftrace self tests, and it causes a warning and disables
  ftrace"

* tag 'trace-v4.11-rc5-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  ftrace: Fix removing of second function probe

Revert "cgroup: avoid attaching a cgroup root to two different superblocks"

This reverts commit bfb0b80db5f9dca5ac0a5fd0edb765ee555e5a8e.

Andrei reports CRIU test hangs with the patch applied. The bug fixed
by the patch isn't too likely to trigger in actual uses. Revert the
patch for now.

Signed-off-by: Tejun Heo <[email protected]>
Reported-by: Andrei Vagin <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]

parisc: Fix get_user() for 64-bit value on 32-bit kernel

This fixes a bug in which the upper 32-bits of a 64-bit value which is
read by get_user() was lost on a 32-bit kernel.
While touching this code, split out pre-loading of %sr2 space register
and clean up code indent.

Cc: <[email protected]> # v4.9+
Signed-off-by: Helge Deller <[email protected]>

Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net

Conflicts were simply overlapping changes. In the net/ipv4/route.c
case the code had simply moved around a little bit and the same fix
was made in both 'net' and 'net-next'.

In the net/sched/sch_generic.c case a fix in 'net' happened at
the same time that a new argument was added to qdisc_hash_add().

Signed-off-by: David S. Miller <[email protected]>

Merge branch 'libnvdimm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm

Pull nvdimm fixes from Dan Williams:
"A small crop of lockdep, sleeping while atomic, and other fixes /
  band-aids in advance of the full-blown reworks targeting the next
  merge window. The largest change here is "libnvdimm: fix blk free
  space accounting" which deletes a pile of buggy code that better
  testing would have caught before merging. The next change that is
  borderline too big for a late rc is switching the device-dax locking
  from rcu to srcu, I couldn't think of a smaller way to make that fix.

  The __copy_user_nocache fix will have a full replacement in 4.12 to
  move those pmem special case considerations into the pmem driver. The
  "libnvdimm: band aid btt vs clear poison locking" commit admits that
  our error clearing support for btt went in broken, so we just disable
  it in 4.11 and -stable. A replacement / full fix is in the pipeline
  for 4.12

  Some of these would have been caught earlier had DEBUG_ATOMIC_SLEEP
  been enabled on my development station. I wonder if we should have:

      config DEBUG_ATOMIC_SLEEP
        default PROVE_LOCKING

  ...since I mistakenly thought I got both with PROVE_LOCKING=y.

  These have received a build success notification from the 0day robot,
  and some have appeared in a -next release with no reported issues"

* 'libnvdimm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
  x86, pmem: fix broken __copy_user_nocache cache-bypass assumptions
  device-dax: switch to srcu, fix rcu_read_lock() vs pte allocation
  libnvdimm: band aid btt vs clear poison locking
  libnvdimm: fix reconfig_mutex, mmap_sem, and jbd2_handle lockdep splat
  libnvdimm: fix blk free space accounting
  acpi, nfit, libnvdimm: fix interleave set cookie calculation (64-bit comparison)

Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi

Pull SCSI fixes from James Bottomley:
"This is seven small fixes which are all for user visible issues that
  fortunately only occur in rare circumstances.

  The most serious is the sr one in which QEMU can cause us to read
  beyond the end of a buffer (I don't think it's exploitable, but just
  in case).

  The next is the sd capacity fix which means all non 512 byte sector
  drives greater than 2TB fail to be correctly sized.

  The rest are either in new drivers (qedf) or on error legs"

* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
  scsi: ipr: do not set DID_PASSTHROUGH on CHECK CONDITION
  scsi: aacraid: fix PCI error recovery path
  scsi: sd: Fix capacity calculation with 32-bit sector_t
  scsi: qla2xxx: Add fix to read correct register value for ISP82xx.
  scsi: qedf: Fix crash due to unsolicited FIP VLAN response.
  scsi: sr: Sanity check returned mode data
  scsi: sd: Consider max_xfer_blocks if opt_xfer_blocks is unusable

Merge branch 'parisc-4.11-4' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux

Pull parisc fix from Helge Deller:
"Mikulas Patocka fixed a few bugs in our new pa_memcpy() assembler
  function, e.g. one bug made the kernel unbootable if source and
  destination address are the same"

* 'parisc-4.11-4' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux:
  parisc: fix bugs in pa_memcpy

orangefs: free superblock when mount fails

Otherwise lockdep says:

[ 1337.483798] ================================================
[ 1337.483999] [ BUG: lock held when returning to user space! ]
[ 1337.484252] 4.11.0-rc6 #19 Not tainted
[ 1337.484423] ------------------------------------------------
[ 1337.484626] mount/14766 is leaving the kernel with locks still held!
[ 1337.484841] 1 lock held by mount/14766:
[ 1337.485017] #0: (&type->s_umount_key#33/1){+.+.+.}, at: [<ffffffff8124171f>] sget_userns+0x2af/0x520

Caught by xfstests generic/413 which tried to mount with the unsupported
mount option dax. Then xfstests generic/422 ran sync which deadlocks.

Signed-off-by: Martin Brandenburg <[email protected]>
Acked-by: Mike Marshall <[email protected]>
Cc: [email protected]
Signed-off-by: Linus Torvalds <[email protected]>

vfs: don't do RCU lookup of empty pathnames

Normal pathname lookup doesn't allow empty pathnames, but using
AT_EMPTY_PATH (with name_to_handle_at() or fstatat(), for example) you
can trigger an empty pathname lookup.

And not only is the RCU lookup in that case entirely unnecessary
(because we'll obviously immediately finalize the end result), it is
actively wrong.

Why? An empth path is a special case that will return the original
'dirfd' dentry - and that dentry may not actually be RCU-free'd,
resulting in a potential use-after-free if we were to initialize the
path lazily under the RCU read lock and depend on complete_walk()
finalizing the dentry.

Found by syzkaller and KASAN.

Reported-by: Dmitry Vyukov <[email protected]>
Reported-by: Vegard Nossum <[email protected]>
Acked-by: Al Viro <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

parisc: fix bugs in pa_memcpy

The patch 554bfeceb8a22d448cd986fc9efce25e833278a1 ("parisc: Fix access
fault handling in pa_memcpy()") reimplements the pa_memcpy function.
Unfortunatelly, it makes the kernel unbootable. The crash happens in the
function ide_complete_cmd where memcpy is called with the same source
and destination address.

This patch fixes a few bugs in pa_memcpy:

* When jumping to .Lcopy_loop_16 for the first time, don't skip the
  instruction "ldi 31,t0" (this bug made the kernel unbootable)
* Use the COND macro when comparing length, so that the comparison is
  64-bit (a theoretical issue, in case the length is greater than
  0xffffffff)
* Don't use the COND macro after the "extru" instruction (the PA-RISC
  specification says that the upper 32-bits of extru result are undefined,
  although they are set to zero in practice)
* Fix exception addresses in .Lcopy16_fault and .Lcopy8_fault
* Rename .Lcopy_loop_4 to .Lcopy_loop_8 (so that it is consistent with
  .Lcopy8_fault)

Cc: <[email protected]> # v4.9+
Fixes: 554bfeceb8a2 ("parisc: Fix access fault handling in pa_memcpy()")
Signed-off-by: Mikulas Patocka <[email protected]>
Signed-off-by: Helge Deller <[email protected]>

Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input

Pull input fixes from Dmitry Torokhov:
"Just a small update to xpad driver to recognize yet another gamepad,
  and another change making sure userio.h is exported"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input:
  Input: xpad - add support for Razer Wildcat gamepad
  uapi: add missing install of userio.h

Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net

Pull networking fixes from David Miller:
"Things seem to be settling down as far as networking is concerned,
  let's hope this trend continues...

   1) Add iov_iter_revert() and use it to fix the behavior of
      skb_copy_datagram_msg() et al., from Al Viro.

   2) Fix the protocol used in the synthetic SKB we cons up for the
      purposes of doing a simulated route lookup for RTM_GETROUTE
      requests. From Florian Larysch.

   3) Don't add noop_qdisc to the per-device qdisc hashes, from Cong
      Wang.

   4) Don't call netdev_change_features with the team lock held, from
      Xin Long.

   5) Revert TCP F-RTO extension to catch more spurious timeouts because
      it interacts very badly with some middle-boxes. From Yuchung
      Cheng.

   6) Fix the loss of error values in l2tp {s,g}etsockopt calls, from
      Guillaume Nault.

   7) ctnetlink uses bit positions where it should be using bit masks,
      fix from Liping Zhang.

   8) Missing RCU locking in netfilter helper code, from Gao Feng.

   9) Avoid double frees and use-after-frees in tcp_disconnect(), from
      Eric Dumazet.

  10) Don't do a changelink before we register the netdevice in
      bridging, from Ido Schimmel.

  11) Lock the ipv6 device address list properly, from Rabin Vincent"

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (29 commits)
  netfilter: ipt_CLUSTERIP: Fix wrong conntrack netns refcnt usage
  netfilter: nft_hash: do not dump the auto generated seed
  drivers: net: usb: qmi_wwan: add QMI_QUIRK_SET_DTR for Telit PID 0x1201
  ipv6: Fix idev->addr_list corruption
  net: xdp: don't export dev_change_xdp_fd()
  bridge: netlink: register netdevice before executing changelink
  bridge: implement missing ndo_uninit()
  bpf: reference may_access_skb() from __bpf_prog_run()
  tcp: clear saved_syn in tcp_disconnect()
  netfilter: nf_ct_expect: use proper RCU list traversal/update APIs
  netfilter: ctnetlink: skip dumping expect when nfct_help(ct) is NULL
  netfilter: make it safer during the inet6_dev->addr_list traversal
  netfilter: ctnetlink: make it safer when checking the ct helper name
  netfilter: helper: Add the rcu lock when call __nf_conntrack_helper_find
  netfilter: ctnetlink: using bit to represent the ct event
  netfilter: xt_TCPMSS: add more sanity tests on tcph->doff
  net: tcp: Increase TCP_MIB_OUTRSTS even though fail to alloc skb
  l2tp: don't mask errors in pppol2tp_getsockopt()
  l2tp: don't mask errors in pppol2tp_setsockopt()
  tcp: restrict F-RTO to work-around broken middle-boxes
  ...

Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 fixes from Thomas Gleixner:
"A set of small fixes for x86:

   - fix locking in RDT to prevent memory leaks and freeing in use
     memory

   - prevent setting invalid values for vdso32_enabled which cause
     inconsistencies for user space resulting in application crashes.

   - plug a race in the vdso32 code between fork and sysctl which causes
     inconsistencies for user space resulting in application crashes.

   - make MPX signal delivery work in compat mode

   - make the dmesg output of traps and faults readable again"

* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/intel_rdt: Fix locking in rdtgroup_schemata_write()
  x86/debug: Fix the printk() debug output of signal_fault(), do_trap() and do_general_protection()
  x86/vdso: Plug race between mapping and ELF header setup
  x86/vdso: Ensure vdso32_enabled gets set to valid values only
  x86/signals: Fix lower/upper bound reporting in compat siginfo

Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull perf fixes from Thomas Gleixner:
"Two small fixes for perf:

   - the move to support cross arch annotation introduced per arch
     initialization requirements, fullfill them for s/390 (Christian
     Borntraeger)

   - add the missing initialization to the LBR entries to avoid exposing
     random or stale data"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf/x86: Avoid exposing wrong/stale data in intel_pmu_lbr_read_32()
  perf annotate s390: Fix perf annotate error -95 (4.10 regression)

Merge branch 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull irq fixes from Thomas Gleixner:
"The irq department provides:

   - two fixes for the CPU affinity spread infrastructure to prevent
     unbalanced spreading in corner cases which leads to horrible
     performance, because interrupts are rather aggregated than spread

   - add a missing spinlock initializer in the imx-gpcv2 init code"

* 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  irqchip/irq-imx-gpcv2: Fix spinlock initialization
  irq/affinity: Fix extra vecs calculation
  irq/affinity: Fix CPU spread for unbalanced nodes

Merge branch 'efi-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull EFI fixes from Thomas Gleixner:
"Three fixes from EFI land:

   - prevent accessing a Graphic Output Device (GOP) which the kernel
     does not know to handle

   - prevent PCI reconfiguration to modify a BAR which covers the
     framebuffer because that's already in use through the EFI GOP
     interface

   - avoid reserving EFI runtime regions as this results in bogus memory
     mappings"

* 'efi-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/efi: Don't try to reserve runtime regions
  efi/fb: Avoid reconfiguration of BAR that covers the framebuffer
  efi/libstub: Skip GOP with PIXEL_BLT_ONLY format

Merge branch 'for-linus-4.11' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs

Pull btrfs fixes from Chris Mason:
"Dave Sterba collected a few more fixes for the last rc.

  These aren't marked for stable, but I'm putting them in with a batch
  were testing/sending by hand for this release"

* 'for-linus-4.11' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  Btrfs: fix potential use-after-free for cloned bio
  Btrfs: fix segmentation fault when doing dio read
  Btrfs: fix invalid dereference in btrfs_retry_endio
  btrfs: drop the nossd flag when remounting with -o ssd

Merge branch 'for-next' of git://git.samba.org/sfrench/cifs-2.6

Pull more CIFS fixes from Steve French:
"As promised, here is the remaining set of cifs/smb3 fixes for stable
  (and a fix for one regression) now that they have had additional
  review and testing"

* 'for-next' of git://git.samba.org/sfrench/cifs-2.6:
  CIFS: Fix SMB3 mount without specifying a security mechanism
  CIFS: store results of cifs_reopen_file to avoid infinite wait
  CIFS: remove bad_network_name flag
  CIFS: reconnect thread reschedule itself
  CIFS: handle guest access errors to Windows shares
  CIFS: Fix null pointer deref during read resp processing

ftrace: Fix removing of second function probe

When two function probes are added to set_ftrace_filter, and then one of
them is removed, the update to the function locations is not performed, and
the record keeping of the function states are corrupted, and causes an
ftrace_bug() to occur.

This is easily reproducable by adding two probes, removing one, and then
adding it back again.

# cd /sys/kernel/debug/tracing
# echo schedule:traceoff > set_ftrace_filter
# echo do_IRQ:traceoff > set_ftrace_filter
# echo \!do_IRQ:traceoff > /debug/tracing/set_ftrace_filter
# echo do_IRQ:traceoff > set_ftrace_filter

Causes:
------------[ cut here ]------------
WARNING: CPU: 2 PID: 1098 at kernel/trace/ftrace.c:2369 ftrace_get_addr_curr+0x143/0x220
Modules linked in: [...]
CPU: 2 PID: 1098 Comm: bash Not tainted 4.10.0-test+ #405
Hardware name: Hewlett-Packard HP Compaq Pro 6300 SFF/339A, BIOS K01 v02.05 05/07/2012
Call Trace:
  dump_stack+0x68/0x9f
  __warn+0x111/0x130
  ? trace_irq_work_interrupt+0xa0/0xa0
  warn_slowpath_null+0x1d/0x20
  ftrace_get_addr_curr+0x143/0x220
  ? __fentry__+0x10/0x10
  ftrace_replace_code+0xe3/0x4f0
  ? ftrace_int3_handler+0x90/0x90
  ? printk+0x99/0xb5
  ? 0xffffffff81000000
  ftrace_modify_all_code+0x97/0x110
  arch_ftrace_update_code+0x10/0x20
  ftrace_run_update_code+0x1c/0x60
  ftrace_run_modify_code.isra.48.constprop.62+0x8e/0xd0
  register_ftrace_function_probe+0x4b6/0x590
  ? ftrace_startup+0x310/0x310
  ? debug_lockdep_rcu_enabled.part.4+0x1a/0x30
  ? update_stack_state+0x88/0x110
  ? ftrace_regex_write.isra.43.part.44+0x1d3/0x320
  ? preempt_count_sub+0x18/0xd0
  ? mutex_lock_nested+0x104/0x800
  ? ftrace_regex_write.isra.43.part.44+0x1d3/0x320
  ? __unwind_start+0x1c0/0x1c0
  ? _mutex_lock_nest_lock+0x800/0x800
  ftrace_trace_probe_callback.isra.3+0xc0/0x130
  ? func_set_flag+0xe0/0xe0
  ? __lock_acquire+0x642/0x1790
  ? __might_fault+0x1e/0x20
  ? trace_get_user+0x398/0x470
  ? strcmp+0x35/0x60
  ftrace_trace_onoff_callback+0x48/0x70
  ftrace_regex_write.isra.43.part.44+0x251/0x320
  ? match_records+0x420/0x420
  ftrace_filter_write+0x2b/0x30
  __vfs_write+0xd7/0x330
  ? do_loop_readv_writev+0x120/0x120
  ? locks_remove_posix+0x90/0x2f0
  ? do_lock_file_wait+0x160/0x160
  ? __lock_is_held+0x93/0x100
  ? rcu_read_lock_sched_held+0x5c/0xb0
  ? preempt_count_sub+0x18/0xd0
  ? __sb_start_write+0x10a/0x230
  ? vfs_write+0x222/0x240
  vfs_write+0xef/0x240
  SyS_write+0xab/0x130
  ? SyS_read+0x130/0x130
  ? trace_hardirqs_on_caller+0x182/0x280
  ? trace_hardirqs_on_thunk+0x1a/0x1c
  entry_SYSCALL_64_fastpath+0x18/0xad
RIP: 0033:0x7fe61c157c30
RSP: 002b:00007ffe87890258 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
RAX: ffffffffffffffda RBX: ffffffff8114a410 RCX: 00007fe61c157c30
RDX: 0000000000000010 RSI: 000055814798f5e0 RDI: 0000000000000001
RBP: ffff8800c9027f98 R08: 00007fe61c422740 R09: 00007fe61ca53700
R10: 0000000000000073 R11: 0000000000000246 R12: 0000558147a36400
R13: 00007ffe8788f160 R14: 0000000000000024 R15: 00007ffe8788f15c
  ? trace_hardirqs_off_caller+0xc0/0x110
---[ end trace 99fa09b3d9869c2c ]---
Bad trampoline accounting at: ffffffff81cc3b00 (do_IRQ+0x0/0x150)

Cc: [email protected]
Fixes: 59df055f1991 ("ftrace: trace different functions with a different tracer")
Signed-off-by: Steven Rostedt (VMware) <[email protected]>

block: fix bio_will_gap() for first bvec with offset

Commit 729204ef49ec("block: relax check on sg gap") allows us to merge
bios, if both are physically contiguous. This change can merge a huge
number of small bios, through mkfs for example, mkfs.ntfs running time
can be decreased to ~1/10.

But if one rq starts with a non-aligned buffer (the 1st bvec's bv_offset
is non-zero) and if we allow the merge, it is quite difficult to respect
sg gap limit, especially the max segment size, or we risk having an
unaligned virtual boundary. This patch tries to avoid the issue by
disallowing a merge, if the req starts with an unaligned buffer.

Also add comments to explain why the merged segment can't end in
unaligned virt boundary.

Fixes: 729204ef49ec ("block: relax check on sg gap")
Tested-by: Johannes Thumshirn <[email protected]>
Reviewed-by: Johannes Thumshirn <[email protected]>
Signed-off-by: Ming Lei <[email protected]>
Rewrote parts of the commit message and comments.

Signed-off-by: Jens Axboe <[email protected]>

Merge tag 'fbdev-v4.11-rc6' of git://github.com/bzolnier/linux

Pull fbdev fixes from Bartlomiej Zolnierkiewicz:

- fix probing time checks in omapfb driver (regression fix)

- fix optional VBAT support in ssd1307fb driver (regression fix)

- fix connecting to backend in xen-fbfront driver

* tag 'fbdev-v4.11-rc6' of git://github.com/bzolnier/linux:
  fbdev: omapfb: delete check_required_callbacks()
  xen, fbfront: fix connecting to backend
  fbdev/ssd1307fb: fix optional VBAT support

Merge tag 'pm-4.11-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull power management fixes from Rafael Wysocki:
"These fix a cpufreq core regression related to CPU online/offline and
  several issues in the turbostat and cpupower utilities.

  Specifics:

   - Allow CPUs to be put back online even if the cpufreq driver is
     unable to work with them (eg. due to missing information from
     platform firmware), which was the previous behavior expected by
     users, but changed in the 4.9 time frame (Chen Yu).

   - Fix a few minor issues in the turbostat utility, introduced mostly
     during the recent update of it (Len Brown, Doug Smythies).

   - Fix a cpupower utility bug causing it to report incorrect values
     for turbo frequencies in some cases (Ben Hutchings)"

* tag 'pm-4.11-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  cpupower: Fix turbo frequency reporting for pre-Sandy Bridge cores
  cpufreq: Bring CPUs up even if cpufreq_online() failed
  tools/power turbostat: update version number
  tools/power turbostat: fix impossibly large CPU%c1 value
  tools/power turbostat: turbostat.8 add missing column definitions
  tools/power turbostat: update HWP dump to decimal from hex
  tools/power turbostat: enable package THERM_INTERRUPT dump
  tools/power turbostat: show missing Core and GFX power on SKL and KBL
  tools/power turbostat: bugfix: GFXMHz column not changing

Merge tag 'acpi-4.11-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull ACPI fixes from Rafael Wysocki:

"These revert a recent ACPICA commit that turned out to be problematic
  and fix a device enumeration breakage from the 4.8 cycle.

  Specifics:

   - Revert a recent ACPICA commit targeted at catching firmware bugs
     which promptly did that and caused functional problems to appear
     (Rafael Wysocki).

   - Fix a device enumeration problem introduced in the 4.8 time frame
     which caused the ACPI docking station driver to report incorrect
     status via sysfs among other things (Rafael Wysocki)"

* tag 'acpi-4.11-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  Revert "ACPICA: Resources: Not a valid resource if buffer length too long"
  ACPI / scan: Set the visited flag for all enumerated devices

Merge tag 'devmem-v4.11-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux

Pull CONFIG_STRICT_DEVMEM fix from Kees Cook:
"Fixes /dev/mem to read back zeros for System RAM areas in the 1MB
  exception area on x86 to avoid exposing RAM or tripping hardened
  usercopy"

* tag 'devmem-v4.11-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
  mm: Tighten x86 /dev/mem with zeroing reads

Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost

Pull virtio fixes from Michael S. Tsirkin:
"virtio oops fixes

  The virtio pci rework using shared interrupts caused a lot of issues.
  We tried to fix them but run out of time. Revert for now, and revisit
  the issue for the next kernel.

  Luckily we are able to do this without loosing automatic interrupt
  NUMA affinity which was the main motivator for the rework"

* tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost:
  virtio-pci: Remove affinity hint before freeing the interrupt
  Revert "virtio_pci: remove struct virtio_pci_vq_info"
  Revert "virtio_pci: use shared interrupts for virtqueues"
  Revert "virtio_pci: don't duplicate the msix_enable flag in struct pci_dev"
  Revert "virtio_pci: simplify MSI-X setup"
  Revert "virtio_pci: fix out of bound access for msix_names"
  MAINTAINERS: fix virtio file pattern
  virtio_console: fix uninitialized variable use
  virtio_net: clear MTU when out of range
  virtio: allow drivers to validate features
  virtio_net: enable big packets for large MTU values

Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf

Pablo Neira Ayuso says:

====================
Netfilter fixes for net

The following patchset contains Netfilter fixes for your net tree,
they are:

1) Missing TCP header sanity check in TCPMSS target, from Eric Dumazet.

2) Incorrect event message type for related conntracks created via
   ctnetlink, from Liping Zhang.

3) Fix incorrect rcu locking when handling helpers from ctnetlink,
   from Gao feng.

4) Fix missing rcu locking when updating helper, from Liping Zhang.

5) Fix missing read_lock_bh when iterating over list of device addresses
   from TPROXY and redirect, also from Liping.

6) Fix crash when trying to dump expectations from conntrack with no
   helper via ctnetlink, from Liping.

7) Missing RCU protection to expecation list update given ctnetlink
   iterates over the list under rcu read lock side, from Liping too.

8) Don't dump autogenerated seed in nft_hash to userspace, this is
   very confusing to the user, again from Liping.

9) Fix wrong conntrack netns module refcount in ipt_CLUSTERIP,
   from Gao feng.
====================

Signed-off-by: David S. Miller <[email protected]>

fbdev: omapfb: delete check_required_callbacks()

Commit 561eb9d09a93 ("fbdev: omap/lcd: Make callbacks optional") made
panel callbacks optional but forgot to update check_required_callbacks().
As a result many (all?) OMAP systems using omapfb will crash at boot.
Fix by deleting the whole function.

Fixes: 561eb9d09a93 ("fbdev: omap/lcd: Make callbacks optional")
Signed-off-by: Aaro Koskinen <[email protected]>
Cc: Tomi Valkeinen <[email protected]>
Cc: Lars-Peter Clausen <[email protected]>
Signed-off-by: Bartlomiej Zolnierkiewicz <[email protected]>

Merge branches 'acpi-scan-fixes' and 'acpica-fixes'

* acpi-scan-fixes:
ACPI / scan: Set the visited flag for all enumerated devices

* acpica-fixes:
Revert "ACPICA: Resources: Not a valid resource if buffer length too long"

Merge branches 'pm-cpufreq-fixes' and 'pm-tools-fixes'

* pm-cpufreq-fixes:
  cpufreq: Bring CPUs up even if cpufreq_online() failed

* pm-tools-fixes:
  cpupower: Fix turbo frequency reporting for pre-Sandy Bridge cores
  tools/power turbostat: update version number
  tools/power turbostat: fix impossibly large CPU%c1 value
  tools/power turbostat: turbostat.8 add missing column definitions
  tools/power turbostat: update HWP dump to decimal from hex
  tools/power turbostat: enable package THERM_INTERRUPT dump
  tools/power turbostat: show missing Core and GFX power on SKL and KBL
  tools/power turbostat: bugfix: GFXMHz column not changing

Revert "tty: don't panic on OOM in tty_set_ldisc()"

This reverts commit 5362544bebe85071188dd9e479b5a5040841c895 as it is
reported to cause a reproducable crash.

Fixes: 5362544bebe8 ("tty: don't panic on OOM in tty_set_ldisc()")
Reported-by: Vegard Nossum <[email protected]>
Cc: Dmitry Vyukov <[email protected]>
Cc: <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Jiri Slaby <[email protected]>
Cc: Peter Hurley <[email protected]>
Cc: One Thousand Gnomes <[email protected]>
Cc: Linus Torvalds <[email protected]>
Signed-off-by: Greg Kroah-Hartman [email protected]

irqchip/irq-imx-gpcv2: Fix spinlock initialization

The raw_spinlock in the IMX GPCV2 interupt chip is not initialized before
usage. That results in a lockdep splat:

  INFO: trying to register non-static key.
  the code is fine but needs lockdep annotation.
  turning off the locking correctness validator.

Add the missing raw_spin_lock_init() to the setup code.

Fixes: e324c4dc4a59 ("irqchip/imx-gpcv2: IMX GPCv2 driver for wakeup sources")
Signed-off-by: Tyler Baker <[email protected]>
Reviewed-by: Fabio Estevam <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Thomas Gleixner <[email protected]>

perf/x86: Avoid exposing wrong/stale data in intel_pmu_lbr_read_32()

When the perf_branch_entry::{in_tx,abort,cycles} fields were added,
intel_pmu_lbr_read_32() wasn't updated to initialize them.

Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: <[email protected]>
Fixes: 135c5612c460 ("perf/x86/intel: Support Haswell/v4 LBR format")
Signed-off-by: Ingo Molnar <[email protected]>

Merge branch 'akpm' (patches from Andrew)

Merge fixes from Andrew Morton:
"11 fixes.

  The presence of 'thp: reduce indentation level in change_huge_pmd()'
  is unfortunate. But the patchset had been decently reviewed and tested
  before we decided it was needed in -stable and I felt it best not to
  churn things at the last minute"

* emailed patches from Andrew Morton <[email protected]>:
  mailmap: add Martin Kepplinger's email
  zsmalloc: expand class bit
  zram: do not use copy_page with non-page aligned address
  zram: fix operator precedence to get offset
  hugetlbfs: fix offset overflow in hugetlbfs mmap
  thp: fix MADV_DONTNEED vs clear soft dirty race
  thp: fix MADV_DONTNEED vs. MADV_FREE race
  mm: drop unused pmdp_huge_get_and_clear_notify()
  thp: fix MADV_DONTNEED vs. numa balancing race
  thp: reduce indentation level in change_huge_pmd()
  z3fold: fix page locking in z3fold_alloc()

mailmap: add Martin Kepplinger's email

Set the partly deprecated companies' email addresses as alias for the
personal one.

Link: http://lkml.kernel.org/r/1491984622-17321-1-git-send-email-martin.kepplinger@ginzinger.com
Signed-off-by: Martin Kepplinger <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

zsmalloc: expand class bit

Now 64K page system, zsamlloc has 257 classes so 8 class bit is not
enough.  With that, it corrupts the system when zsmalloc stores
65536byte data(ie, index number 256) so that this patch increases class
bit for simple fix for stable backport.  We should clean up this mess
soon.

  index size
  0 32
  1 288
  ..
  ..
  204 52256
  256 65536

Fixes: 3783689a1 ("zsmalloc: introduce zspage structure")
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Minchan Kim <[email protected]>
Cc: Sergey Senozhatsky <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

zram: do not use copy_page with non-page aligned address

The copy_page is optimized memcpy for page-alinged address.  If it is
used with non-page aligned address, it can corrupt memory which means
system corruption.  With zram, it can happen with

1. 64K architecture
2. partial IO
3. slub debug

Partial IO need to allocate a page and zram allocates it via kmalloc.
With slub debug, kmalloc(PAGE_SIZE) doesn't return page-size aligned
address.  And finally, copy_page(mem, cmem) corrupts memory.

So, this patch changes it to memcpy.

Actuaully, we don't need to change zram_bvec_write part because zsmalloc
returns page-aligned address in case of PAGE_SIZE class but it's not
good to rely on the internal of zsmalloc.

Note:
When this patch is merged to stable, clear_page should be fixed, too.
Unfortunately, recent zram removes it by "same page merge" feature so
it's hard to backport this patch to -stable tree.

I will handle it when I receive the mail from stable tree maintainer to
merge this patch to backport.

Fixes: 42e99bd ("zram: optimize memory operations with clear_page()/copy_page()")
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Minchan Kim <[email protected]>
Cc: Sergey Senozhatsky <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

zram: fix operator precedence to get offset

In zram_rw_page, the logic to get offset is wrong by operator precedence
(i.e., "<<" is higher than "&"). With wrong offset, zram can corrupt
the user's data. This patch fixes it.

Fixes: 8c7f01025 ("zram: implement rw_page operation of zram")
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Minchan Kim <[email protected]>
Cc: Sergey Senozhatsky <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

hugetlbfs: fix offset overflow in hugetlbfs mmap

If mmap() maps a file, it can be passed an offset into the file at which
the mapping is to start.  Offset could be a negative value when
represented as a loff_t.  The offset plus length will be used to update
the file size (i_size) which is also a loff_t.

Validate the value of offset and offset + length to make sure they do
not overflow and appear as negative.

Found by syzcaller with commit ff8c0c53c475 ("mm/hugetlb.c: don't call
region_abort if region_chg fails") applied.  Prior to this commit, the
overflow would still occur but we would luckily return ENOMEM.

To reproduce:

   mmap(0, 0x2000, 0, 0x40021, 0xffffffffffffffffULL, 0x8000000000000000ULL);

Resulted in,

  kernel BUG at mm/hugetlb.c:742!
  Call Trace:
   hugetlbfs_evict_inode+0x80/0xa0
   evict+0x24a/0x620
   iput+0x48f/0x8c0
   dentry_unlink_inode+0x31f/0x4d0
   __dentry_kill+0x292/0x5e0
   dput+0x730/0x830
   __fput+0x438/0x720
   ____fput+0x1a/0x20
   task_work_run+0xfe/0x180
   exit_to_usermode_loop+0x133/0x150
   syscall_return_slowpath+0x184/0x1c0
   entry_SYSCALL_64_fastpath+0xab/0xad

Fixes: ff8c0c53c475 ("mm/hugetlb.c: don't call region_abort if region_chg fails")
Link: http://lkml.kernel.org/r/[email protected]
Reported-by: Vegard Nossum <[email protected]>
Signed-off-by: Mike Kravetz <[email protected]>
Acked-by: Hillf Danton <[email protected]>
Cc: Dmitry Vyukov <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: "Kirill A . Shutemov" <[email protected]>
Cc: Andrey Ryabinin <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

thp: fix MADV_DONTNEED vs clear soft dirty race

Yet another instance of the same race.

Fix is identical to change_huge_pmd().

See "thp: fix MADV_DONTNEED vs. numa balancing race" for more details.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Kirill A. Shutemov <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Hillf Danton <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

thp: fix MADV_DONTNEED vs. MADV_FREE race

Both MADV_DONTNEED and MADV_FREE handled with down_read(mmap_sem).

It's critical to not clear pmd intermittently while handling MADV_FREE
to avoid race with MADV_DONTNEED:

CPU0: CPU1:
madvise_free_huge_pmd()
pmdp_huge_get_and_clear_full()
madvise_dontneed()
zap_pmd_range()
pmd_trans_huge(*pmd) == 0 (without ptl)
// skip the pmd
set_pmd_at();
// pmd is re-established

It results in MADV_DONTNEED skipping the pmd, leaving it not cleared.
It violates MADV_DONTNEED interface and can result is userspace
misbehaviour.

Basically it's the same race as with numa balancing in
change_huge_pmd(), but a bit simpler to mitigate: we don't need to
preserve dirty/young flags here due to MADV_FREE functionality.

[[email protected]: Urgh... Power is special again]
Link: http://lkml.kernel.org/r/[email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Kirill A. Shutemov <[email protected]>
Acked-by: Minchan Kim <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Hillf Danton <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm: drop unused pmdp_huge_get_and_clear_notify()

Dave noticed that after fixing MADV_DONTNEED vs numa balancing race the
last pmdp_huge_get_and_clear_notify() user is gone.

Let's drop the helper.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Kirill A. Shutemov <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

thp: fix MADV_DONTNEED vs. numa balancing race

In case prot_numa, we are under down_read(mmap_sem).  It's critical to
not clear pmd intermittently to avoid race with MADV_DONTNEED which is
also under down_read(mmap_sem):

CPU0: CPU1:
change_huge_pmd(prot_numa=1)
pmdp_huge_get_and_clear_notify()
madvise_dontneed()
zap_pmd_range()
  pmd_trans_huge(*pmd) == 0 (without ptl)
  // skip the pmd
set_pmd_at();
// pmd is re-established

The race makes MADV_DONTNEED miss the huge pmd and don't clear it
which may break userspace.

Found by code analysis, never saw triggered.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Kirill A. Shutemov <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Hillf Danton <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

thp: reduce indentation level in change_huge_pmd()

Patch series "thp: fix few MADV_DONTNEED races"

For MADV_DONTNEED to work properly with huge pages, it's critical to not
clear pmd intermittently unless you hold down_write(mmap_sem).

Otherwise MADV_DONTNEED can miss the THP which can lead to userspace
breakage.

See example of such race in commit message of patch 2/4.

All these races are found by code inspection. I haven't seen them
triggered. I don't think it's worth to apply them to stable@.

This patch (of 4):

Restructure code in preparation for a fix.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Kirill A. Shutemov <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Hillf Danton <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

z3fold: fix page locking in z3fold_alloc()

Stress testing of the current z3fold implementation on a 8-core system
revealed it was possible that a z3fold page deleted from its unbuddied
list in z3fold_alloc() would be put on another unbuddied list by
z3fold_free() while z3fold_alloc() is still processing it. This has
been introduced with commit 5a27aa822 ("z3fold: add kref refcounting")
due to the removal of special handling of a z3fold page not on any list
in z3fold_free().

To fix this, the z3fold page lock should be taken in z3fold_alloc()
before the pool lock is released. To avoid deadlocking, we just try to
lock the page as soon as we get a hold of it, and if trylock fails, we
drop this page and take the next one.

Signed-off-by: Vitaly Wool <[email protected]>
Cc: Dan Streetman <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

ia64: restore symbol versions for symbols defined in assembly

The ia64 build generates many warnings like this:

WARNING: EXPORT symbol "empty_zero_page" [vmlinux] version generation failed, symbol will not be versioned.

Besides adding the necessary header this also requires fiddling with
some explicit .S -> .o rules.

Cc: IA64-ML <[email protected]>
Signed-off-by: Jan Beulich <[email protected]>
Signed-off-by: Tony Luck <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

irq/affinity: Fix extra vecs calculation

This fixes a math error calculating the extra_vecs. The error assumed
only 1 cpu per vector, but the value needs to account for the actual
number of cpus per vector in order to get the correct remainder for
extra CPU assignment.

Fixes: 7bf8222b9bd0 ("irq/affinity: Fix CPU spread for unbalanced nodes")
Reported-by: Xiaolong Ye <[email protected]>
Signed-off-by: Keith Busch <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Thomas Gleixner <[email protected]>

netfilter: ipt_CLUSTERIP: Fix wrong conntrack netns refcnt usage

Current codes invoke wrongly nf_ct_netns_get in the destroy routine,
it should use nf_ct_netns_put, not nf_ct_netns_get.
It could cause some modules could not be unloaded.

Fixes: ecb2421b5ddf ("netfilter: add and use nf_ct_netns_get/put")
Signed-off-by: Gao Feng <[email protected]>
Signed-off-by: Pablo Neira Ayuso <[email protected]>

netfilter: nft_hash: do not dump the auto generated seed

This can prevent the nft utility from printing out the auto generated
seed to the user, which is unnecessary and confusing.

Fixes: cb1b69b0b15b ("netfilter: nf_tables: add hash expression")
Signed-off-by: Liping Zhang <[email protected]>
Signed-off-by: Pablo Neira Ayuso <[email protected]>

Merge branch 'netlink_ext_ACK'

Johannes Berg says:

====================
netlink extended ACK reporting

Changes since v4:
* use __NLMSGERR_ATTR_MAX instead of NUM_NLMSGERR_ATTRS

Changes since v3:
* Add NLM_F_CAPPED and NLM_F_ACK_TLVS flags, to allow entirely
   stateless parsing of the ACK messages by looking at the new
   flags. Need to check NLM_F_ACK_TLVS first, since capping can
   be done in kernels before this patchset without setting the
   flag.
* Remove "missing_attr" functionality - this can obviously be
   added back rather easily, but I'd rather have more discussion
   about the nesting problem there.
* Improve documentation of NLMSGERR_ATTR_OFFS
* Improve message structure documentation, documenting that the
   request message is always capped for success cases
* fix nlmsg_len of the outer message by calling nlmsg_end()
* fix memcpy() of the request in success cases, going back to
   the original code that I'd changed before due to the payload
   adjustments that I reverted when introducing tlvlen

Changes since v2:
* add NUM_NLMSGERR_ATTRS, NLMSGERR_ATTR_MAX
* fix cookie length to 20 (sha-1 length)
* move struct members for cookie to patch 3 where they should be
* another cleanup suggested by David Ahern

Changes since v1:
* credit Pablo and Jamal
* incorporate suggestion from David Ahern
* fix compilation in decnet
====================

Signed-off-by: David S. Miller <[email protected]>

netlink: pass extended ACK struct where available

This is an add-on to the previous patch that passes the extended ACK
structure where it's already available by existing genl_info or extack
function arguments.

This was done with this spatch (with some manual adjustment of
indentation):

@@
expression A, B, C, D, E;
identifier fn, info;
@@
fn(..., struct genl_info *info, ...) {
...
-nlmsg_parse(A, B, C, D, E, NULL)
+nlmsg_parse(A, B, C, D, E, info->extack)
...
}

@@
expression A, B, C, D, E;
identifier fn, info;
@@
fn(..., struct genl_info *info, ...) {
<...
-nla_parse_nested(A, B, C, D, NULL)
+nla_parse_nested(A, B, C, D, info->extack)
...>
}

@@
expression A, B, C, D, E;
identifier fn, extack;
@@
fn(..., struct netlink_ext_ack *extack, ...) {
<...
-nlmsg_parse(A, B, C, D, E, NULL)
+nlmsg_parse(A, B, C, D, E, extack)
...>
}

@@
expression A, B, C, D, E;
identifier fn, extack;
@@
fn(..., struct netlink_ext_ack *extack, ...) {
<...
-nla_parse(A, B, C, D, E, NULL)
+nla_parse(A, B, C, D, E, extack)
...>
}

@@
expression A, B, C, D, E;
identifier fn, extack;
@@
fn(..., struct netlink_ext_ack *extack, ...) {
...
-nlmsg_parse(A, B, C, D, E, NULL)
+nlmsg_parse(A, B, C, D, E, extack)
...
}

@@
expression A, B, C, D;
identifier fn, extack;
@@
fn(..., struct netlink_ext_ack *extack, ...) {
<...
-nla_parse_nested(A, B, C, D, NULL)
+nla_parse_nested(A, B, C, D, extack)
...>
}

@@
expression A, B, C, D;
identifier fn, extack;
@@
fn(..., struct netlink_ext_ack *extack, ...) {
<...
-nlmsg_validate(A, B, C, D, NULL)
+nlmsg_validate(A, B, C, D, extack)
...>
}

@@
expression A, B, C, D;
identifier fn, extack;
@@
fn(..., struct netlink_ext_ack *extack, ...) {
<...
-nla_validate(A, B, C, D, NULL)
+nla_validate(A, B, C, D, extack)
...>
}

@@
expression A, B, C;
identifier fn, extack;
@@
fn(..., struct netlink_ext_ack *extack, ...) {
<...
-nla_validate_nested(A, B, C, NULL)
+nla_validate_nested(A, B, C, extack)
...>
}

Signed-off-by: Johannes Berg <[email protected]>
Reviewed-by: Jiri Pirko <[email protected]>
Signed-off-by: David S. Miller <[email protected]>