Git Repo - linux.git/log

arm64: KVM: pmu: Fix AArch32 cycle counter access

We're missing the handling code for the cycle counter accessed
from a 32bit guest, leading to unexpected results.

Cc: [email protected] # 4.6+
Signed-off-by: Wei Huang <[email protected]>
Signed-off-by: Marc Zyngier <[email protected]>

x86/dumpstack: Prevent KASAN false positive warnings

The oops stack dump code scans the entire stack, which can cause KASAN
"stack-out-of-bounds" false positive warnings. Tell KASAN to ignore it.

Signed-off-by: Josh Poimboeuf <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Brian Gerst <[email protected]>
Cc: Denys Vlasenko <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Stephane Eranian <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vince Weaver <[email protected]>
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/5f6e80c4b0c7f7f0b6211900847a247cdaad753c.1479398226.git.jpoimboe@redhat.com
Signed-off-by: Ingo Molnar <[email protected]>

x86/unwind: Prevent KASAN false positive warnings in guess unwinder

The guess unwinder scans the entire stack, which can cause KASAN
"stack-out-of-bounds" false positive warnings. Tell KASAN to ignore it.

Reported-by: Peter Zijlstra <[email protected]>
Signed-off-by: Josh Poimboeuf <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Brian Gerst <[email protected]>
Cc: Denys Vlasenko <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Stephane Eranian <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vince Weaver <[email protected]>
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/61939c0b2b2d63ce97ba59cba3b00fd47c2962cf.1479398226.git.jpoimboe@redhat.com
Signed-off-by: Ingo Molnar <[email protected]>

cfg80211: limit scan results cache size

It's possible to make scanning consume almost arbitrary amounts
of memory, e.g. by sending beacon frames with random BSSIDs at
high rates while somebody is scanning.

Limit the number of BSS table entries we're willing to cache to
1000, limiting maximum memory usage to maybe 4-5MB, but lower
in practice - that would be the case for having both full-sized
beacon and probe response frames for each entry; this seems not
possible in practice, so a limit of 1000 entries will likely be
closer to 0.5 MB.

Cc: [email protected]
Signed-off-by: Johannes Berg <[email protected]>

powerpc/mm/radix: Invalidate ERAT on tlbiel for POWER9 DD1

On POWER9 DD1, when we do a local TLB invalidate we also need to explicitly
invalidate the ERAT.

Signed-off-by: Michael Neuling <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>

i2c: digicolor: use clk_disable_unprepare instead of clk_unprepare

since clk_prepare_enable() is used to get i2c->clk, we should
use clk_disable_unprepare() to release it for the error path.

Signed-off-by: Wei Yongjun <[email protected]>
Acked-by: Baruch Siach <[email protected]>
Signed-off-by: Wolfram Sang <[email protected]>

Merge tag 'sunxi-fixes-for-4.9' of https://git.kernel.org/pub/scm/linux/kernel/git/mripard/linux into fixes

Allwinner fixes for 4.9

A fix to reintroduce missing pinmux options that turned out not to be
optional.

* tag 'sunxi-fixes-for-4.9' of https://git.kernel.org/pub/scm/linux/kernel/git/mripard/linux:
ARM: dts: sun8i: fix the pinmux for UART1

Signed-off-by: Olof Johansson <[email protected]>

Merge tag 'sti-dt-for-v4.9-rc' of git://git.kernel.org/pub/scm/linux/kernel/git/pchotard/sti into fixes

STi DT fix:

Fix typo cs-gpio to cs-gpios

* tag 'sti-dt-for-v4.9-rc' of git://git.kernel.org/pub/scm/linux/kernel/git/pchotard/sti:
ARM: dts: STiH410-b2260: Fix typo in spi0 chipselect definition

Signed-off-by: Olof Johansson <[email protected]>

Merge tag 'imx-fixes-4.9-2' of git://git.kernel.org/pub/scm/linux/kernel/git/shawnguo/linux into fixes

i.MX fixes for 4.9, 2nd round:

It fixes a boot failure on imx53-qsb board with a DA9053 PMIC, which is
caused by the regulator core change, commit fa93fd4ecc9c ("regulator:
core: Ensure we are at least in bounds for our constraints").

* tag 'imx-fixes-4.9-2' of git://git.kernel.org/pub/scm/linux/kernel/git/shawnguo/linux:
ARM: dts: imx53-qsb: Fix regulator constraints

Signed-off-by: Olof Johansson <[email protected]>

Merge tag 'omap-for-v4.9/fixes-for-rc-cycle' of git://git.kernel.org/pub/scm/linux/kernel/git/tmlind/linux-omap into fixes

Fixes for omaps for v4.9-rc cycle. Except for the omap3 fix for the SoC
features printed, all these are quite trivial and tiny. The omap5 jack
detection and gpadc patches are not strictly fixes, but I wanted to get
binding document typo fixed before it pops up on other boards. The
gpadc one liner was in the same series and I applied and pushed it out
already before noticing it could have waited. The list of changes is:

- Fix omap3 SoC features printed
- Make sure OMAP_INTERCONNECT is selected for am43xx only configurations
- Add missing memory node for torpedo
- Initialize uart4_mask properly to avoid writing garbage to PRM registers
- Fix NULL pointer dereference for omap4 volt_data
- Add alias for omap5 gpadc needed by iio drivers
- Enable omap5 jack headset jack detection and fix it's binding typo
- Add missing memory node for logicpd-som-lv
- Fix wrong SMPS6 voltage for VDD-DDR3 for omap5

* tag 'omap-for-v4.9/fixes-for-rc-cycle' of git://git.kernel.org/pub/scm/linux/kernel/git/tmlind/linux-omap:
  ARM: dts: omap5: board-common: fix wrong SMPS6 (VDD-DDR3) voltage
  ARM: omap3: Add missing memory node in SOM-LV
  ASoC: omap-abe-twl6040: fix typo in bindings documentation
  dts: omap5: board-common: enable twl6040 headset jack detection
  dts: omap5: board-common: add phandle to reference Palmas gpadc
  ARM: OMAP2+: avoid NULL pointer dereference
  ARM: OMAP2+: PRM: initialize en_uart4_mask and grpsel_uart4_mask
  ARM: dts: omap3: Fix memory node in Torpedo board
  ARM: AM43XX: Select OMAP_INTERCONNECT in Kconfig
  ARM: OMAP3: Fix formatting of features printed

Signed-off-by: Olof Johansson <[email protected]>

Merge tag 'mvebu-fixes-4.9-1' of git://git.infradead.org/linux-mvebu into fixes

mvebu fixes for 4.9 (part 1)

All of them are fixes for arm64 device tree

- 2 for the SPI node on the Armada 7K/8K
- 1 for the clock node on the Armada 37xx

* tag 'mvebu-fixes-4.9-1' of git://git.infradead.org/linux-mvebu:
  arm64: dts: marvell: add unique identifiers for Armada A8k SPI controllers
  arm64: dts: marvell: fix clocksource for CP110 slave SPI0
  arm64: dts: marvell: Fix typo in label name on Armada 37xx

Signed-off-by: Olof Johansson <[email protected]>

Merge tag 'drm-intel-fixes-2016-11-17' of ssh://git.freedesktop.org/git/drm-intel into drm-fixes

i915 misc fixes.

* tag 'drm-intel-fixes-2016-11-17' of ssh://git.freedesktop.org/git/drm-intel:
  drm/i915: Assume non-DP++ port if dvo_port is HDMI and there's no AUX ch specified in the VBT
  drm/i915: Refresh that status of MST capable connectors in ->detect()
  drm/i915: Grab the rotation from the passed plane state for VLV sprites
  drm/i915: Mark CPU cache as dirty when used for rendering

ipmi/bt-bmc: change compatible node to 'aspeed, ast2400-ibt-bmc'

The Aspeed SoCs have two BT interfaces : one is IPMI compliant and the
other is H8S/2168 compliant.

The current ipmi/bt-bmc driver implements the IPMI version and we
should reflect its nature in the compatible node name using
'aspeed,ast2400-ibt-bmc' instead of 'aspeed,ast2400-bt-bmc'. The
latter should be used for a H8S interface driver if it is implemented
one day.

Signed-off-by: Cédric Le Goater <[email protected]>
Signed-off-by: Olof Johansson <[email protected]>

Revert "drm/mediatek: set vblank_disable_allowed to true"

This reverts commit f752fff611b99f5679224f3990a1f531ea64b1ec.

Signed-off-by: Dave Airlie <[email protected]>

Revert "drm/mediatek: fix a typo of OD_CFG to OD_RELAYMODE"

This reverts commit 83ba62bc700bab710b22be3a1bf6cf973f754273.

Signed-off-by: Dave Airlie <[email protected]>

Merge branch 'for-linus' of git://git.kernel.dk/linux-block

Pull block fixes from Jens Axboe:
"A set of fixes, one for NVMe from Keith, and a set for nvme-{rdma,t,f}
  from the usual suspects, fixing actual problems that would be a shame
  to release 4.9 with"

* 'for-linus' of git://git.kernel.dk/linux-block:
  nvme/pci: Don't free queues on error
  nvmet-rdma: drain the queue-pair just before freeing it
  nvme-rdma: stop and free io queues on connect failure
  nvmet-rdma: don't forget to delete a queue from the list of connection failed
  nvmet: Don't queue fatal error work if csts.cfs is set
  nvme-rdma: reject non-connect commands before the queue is live
  nvmet-rdma: Fix possible NULL deref when handling rdma cm events

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma

Pull rmda fixes from Doug Ledford.
"First round of -rc fixes.

  Due to various issues, I've been away and couldn't send a pull request
  for about three weeks. There were a number of -rc patches that built
  up in the meantime (some where there already from the early -rc
  stages). Obviously, there were way too many to send now, so I tried to
  pare the list down to the more important patches for the -rc cycle.

  Most of the code has had plenty of soak time at the various vendor's
  testing setups, so I doubt there will be another -rc pull request this
  cycle. I also tried to limit the patches to those with smaller
  footprints, so even though a shortlog is longer than I would like, the
  actual diffstat is mostly very small with the exception of just three
  files that had more changes, and a couple files with pure removals.

  Summary:
   - Misc Intel hfi1 fixes
   - Misc Mellanox mlx4, mlx5, and rxe fixes
   - A couple cxgb4 fixes"

* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma: (34 commits)
  iw_cxgb4: invalidate the mr when posting a read_w_inv wr
  iw_cxgb4: set *bad_wr for post_send/post_recv errors
  IB/rxe: Update qp state for user query
  IB/rxe: Clear queue buffer when modifying QP to reset
  IB/rxe: Fix handling of erroneous WR
  IB/rxe: Fix kernel panic in UDP tunnel with GRO and RX checksum
  IB/mlx4: Fix create CQ error flow
  IB/mlx4: Check gid_index return value
  IB/mlx5: Fix NULL pointer dereference on debug print
  IB/mlx5: Fix fatal error dispatching
  IB/mlx5: Resolve soft lock on massive reg MRs
  IB/mlx5: Use cache line size to select CQE stride
  IB/mlx5: Validate requested RQT size
  IB/mlx5: Fix memory leak in query device
  IB/core: Avoid unsigned int overflow in sg_alloc_table
  IB/core: Add missing check for addr_resolve callback return value
  IB/core: Set routable RoCE gid type for ipv4/ipv6 networks
  IB/cm: Mark stale CM id's whenever the mad agent was unregistered
  IB/uverbs: Fix leak of XRC target QPs
  IB/hfi1: Remove incorrect IS_ERR check
  ...

Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs

Pull vfs fixes from Al Viro:
"A couple of regression fixes"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
fix iov_iter_advance() for ITER_PIPE
xattr: Fix setting security xattrs on sockfs

Merge tag 'for-linus-4.9-rc5-ofs-1' of git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux

Pull orangefs fix from Mike Marshall:
"orangefs: add .owner to debugfs file_operations

  Without ".owner = THIS_MODULE" it is possible to crash the kernel by
  unloading the Orangefs module while someone is reading debugfs files"

* tag 'for-linus-4.9-rc5-ofs-1' of git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux:
  orangefs: add .owner to debugfs file_operations

net sched filters: pass netlink message flags in event notification

Userland client should be able to read an event, and reflect it back to
the kernel, therefore it needs to extract complete set of netlink flags.

For example, this will allow "tc monitor" to distinguish Add and Replace
operations.

Signed-off-by: Roman Mashak <[email protected]>
Signed-off-by: Jamal Hadi Salim <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

mremap: fix race between mremap() and page cleanning

Prior to 3.15, there was a race between zap_pte_range() and
page_mkclean() where writes to a page could be lost.  Dave Hansen
discovered by inspection that there is a similar race between
move_ptes() and page_mkclean().

We've been able to reproduce the issue by enlarging the race window with
a msleep(), but have not been able to hit it without modifying the code.
So, we think it's a real issue, but is difficult or impossible to hit in
practice.

The zap_pte_range() issue is fixed by commit 1cf35d47712d("mm: split
'tlb_flush_mmu()' into tlb flushing and memory freeing parts").  And
this patch is to fix the race between page_mkclean() and mremap().

Here is one possible way to hit the race: suppose a process mmapped a
file with READ | WRITE and SHARED, it has two threads and they are bound
to 2 different CPUs, e.g.  CPU1 and CPU2.  mmap returned X, then thread
1 did a write to addr X so that CPU1 now has a writable TLB for addr X
on it.  Thread 2 starts mremaping from addr X to Y while thread 1
cleaned the page and then did another write to the old addr X again.
The 2nd write from thread 1 could succeed but the value will get lost.

        thread 1                           thread 2
     (bound to CPU1)                    (bound to CPU2)

  1: write 1 to addr X to get a
     writeable TLB on this CPU

                                        2: mremap starts

                                        3: move_ptes emptied PTE for addr X
                                           and setup new PTE for addr Y and
                                           then dropped PTL for X and Y

  4: page laundering for N by doing
     fadvise FADV_DONTNEED. When done,
     pageframe N is deemed clean.

  5: *write 2 to addr X

                                        6: tlb flush for addr X

  7: munmap (Y, pagesize) to make the
     page unmapped

  8: fadvise with FADV_DONTNEED again
     to kick the page off the pagecache

  9: pread the page from file to verify
     the value. If 1 is there, it means
     we have lost the written 2.

  *the write may or may not cause segmentation fault, it depends on
  if the TLB is still on the CPU.

Please note that this is only one specific way of how the race could
occur, it didn't mean that the race could only occur in exact the above
config, e.g. more than 2 threads could be involved and fadvise() could
be done in another thread, etc.

For anonymous pages, they could race between mremap() and page reclaim:
THP: a huge PMD is moved by mremap to a new huge PMD, then the new huge
PMD gets unmapped/splitted/pagedout before the flush tlb happened for
the old huge PMD in move_page_tables() and we could still write data to
it.  The normal anonymous page has similar situation.

To fix this, check for any dirty PTE in move_ptes()/move_huge_pmd() and
if any, did the flush before dropping the PTL.  If we did the flush for
every move_ptes()/move_huge_pmd() call then we do not need to do the
flush in move_pages_tables() for the whole range.  But if we didn't, we
still need to do the whole range flush.

Alternatively, we can track which part of the range is flushed in
move_ptes()/move_huge_pmd() and which didn't to avoid flushing the whole
range in move_page_tables().  But that would require multiple tlb
flushes for the different sub-ranges and should be less efficient than
the single whole range flush.

KBuild test on my Sandybridge desktop doesn't show any noticeable change.
v4.9-rc4:
  real    5m14.048s
  user    32m19.800s
  sys     4m50.320s

With this commit:
  real    5m13.888s
  user    32m19.330s
  sys     4m51.200s

Reported-by: Dave Hansen <[email protected]>
Signed-off-by: Aaron Lu <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

ip6_tunnel: disable caching when the traffic class is inherited

If an ip6 tunnel is configured to inherit the traffic class from
the inner header, the dst_cache must be disabled or it will foul
the policy routing.

The issue is apprently there since at leat Linux-2.6.12-rc2.

Reported-by: Liam McBirnie <[email protected]>
Cc: Liam McBirnie <[email protected]>
Acked-by: Hannes Frederic Sowa <[email protected]>
Signed-off-by: Paolo Abeni <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Merge branch 'phy-dev-leaks'

Johan Hovold says:

====================
net: phy: fix of_node and device leaks

These patches fix a couple of of_node leaks in the fixed-link code and a
device reference leak in a phy helper.
====================

Signed-off-by: David S. Miller <[email protected]>

net: phy: fixed_phy: fix of_node leak in fixed_phy_unregister

Make sure to drop the of_node reference taken in fixed_phy_register()
when deregistering a PHY.

Fixes: a75951217472 ("net: phy: extend fixed driver with
fixed_phy_register()")

Signed-off-by: Johan Hovold <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

of_mdio: fix device reference leak in of_phy_find_device

Make sure to drop the reference taken by bus_find_device() before
returning NULL from of_phy_find_device() when the found device is not a
PHY.

Fixes: 6ed742363b9c ("of: of_mdio: Ensure mdio device is a PHY")
Signed-off-by: Johan Hovold <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

of_mdio: fix node leak in of_phy_register_fixed_link error path

Make sure to drop the of_node reference also on failure to parse the
speed property in of_phy_register_fixed_link().

Fixes: 3be2a49e5c08 ("of: provide a binding for fixed link PHYs")
Signed-off-by: Johan Hovold <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: check dead netns for peernet2id_alloc()

Andrei reports we still allocate netns ID from idr after we destroy
it in cleanup_net().

cleanup_net():
  ...
  idr_destroy(&net->netns_ids);
  ...
  list_for_each_entry_reverse(ops, &pernet_list, list)
    ops_exit_list(ops, &net_exit_list);
      -> rollback_registered_many()
        -> rtmsg_ifinfo_build_skb()
         -> rtnl_fill_ifinfo()
           -> peernet2id_alloc()

After that point we should not even access net->netns_ids, we
should check the death of the current netns as early as we can in
peernet2id_alloc().

For net-next we can consider to avoid sending rtmsg totally,
it is a good optimization for netns teardown path.

Fixes: 0c7aecd4bde4 ("netns: add rtnl cmd to add and get peer netns ids")
Reported-by: Andrei Vagin <[email protected]>
Cc: Nicolas Dichtel <[email protected]>
Signed-off-by: Cong Wang <[email protected]>
Acked-by: Andrei Vagin <[email protected]>
Signed-off-by: Nicolas Dichtel <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

crypto: caam - fix type mismatch warning

Building the caam driver on arm64 produces a harmless warning:

drivers/crypto/caam/caamalg.c:140:139: warning: comparison of distinct pointer types lacks a cast

We can use min_t to tell the compiler which type we want it to use
here.

Fixes: 5ecf8ef9103c ("crypto: caam - fix sg dump")
Signed-off-by: Arnd Bergmann <[email protected]>
Reviewed-by: Horia Geantă <[email protected]>
Signed-off-by: Herbert Xu <[email protected]>

dmaengine: cppi41: More PM runtime fixes

Fix use of u32 instead of int for checking for negative errors values
as pointed out by Dan Carpenter <[email protected]>.

And while testing the PM runtime error path by randomly returning
failed values in runtime resume, I noticed two more places that need
fixing:

- If pm_runtime_get_sync() fails in probe, we still need to do
  pm_runtime_put_sync() to keep the use count happy. We could call
  pm_runtime_put_noidle() on the error path, but we're just going
  to call pm_runtime_disable() after that so pm_runtime_put_sync()
  will do what we want

- We should print an error if pm_runtime_get_sync() fails in
  cppi41_dma_alloc_chan_resources() so we know where it happens

Reported-by: Dan Carpenter <[email protected]>
Fixes: 740b4be3f742 ("dmaengine: cpp41: Fix handling of error path")
Signed-off-by: Tony Lindgren <[email protected]>
Signed-off-by: Vinod Koul <[email protected]>

x86/boot: Avoid warning for zero-filling .bss

The latest binutils are warning about a .fill directive with an explicit
value in a .bss section:

  arch/x86/kernel/head_32.S: Assembler messages:
  arch/x86/kernel/head_32.S:677: Warning: ignoring fill value in section `.bss..page_aligned'
  arch/x86/kernel/head_32.S:679: Warning: ignoring fill value in section `.bss..page_aligned'

This comes from the 'ENTRY()' macro padding the space between the symbols
with 'nop' via:

  .align 4,0x90

Open-coding the .globl directive without the padding avoids that warning,
as all the symbols are already page aligned.

Signed-off-by: Arnd Bergmann <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Brian Gerst <[email protected]>
Cc: Denys Vlasenko <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Josh Poimboeuf <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>

fix iov_iter_advance() for ITER_PIPE

iov_iter_advance() needs to decrement iter->count by the number of
bytes we'd moved beyond. Normal flavours do that, but ITER_PIPE
doesn't and ITER_PIPE generic_file_read_iter() for O_DIRECT files
ends up with a bogus fallback to page cache read, resulting in incorrect
values for file offset and bytes read.

Signed-off-by: Abhi Das <[email protected]>
Signed-off-by: Al Viro <[email protected]>

xattr: Fix setting security xattrs on sockfs

The IOP_XATTR flag is set on sockfs because sockfs supports getting the
"system.sockprotoname" xattr.  Since commit 6c6ef9f2, this flag is checked for
setxattr support as well.  This is wrong on sockfs because security xattr
support there is supposed to be provided by security_inode_setsecurity.  The
smack security module relies on socket labels (xattrs).

Fix this by adding a security xattr handler on sockfs that returns
-EAGAIN, and by checking for -EAGAIN in setxattr.

We cannot simply check for -EOPNOTSUPP in setxattr because there are
filesystems that neither have direct security xattr support nor support
via security_inode_setsecurity.  A more proper fix might be to move the
call to security_inode_setsecurity into sockfs, but it's not clear to me
if that is safe: we would end up calling security_inode_post_setxattr after
that as well.

Signed-off-by: Andreas Gruenbacher <[email protected]>
Signed-off-by: Al Viro <[email protected]>

bnxt: add a missing rcu synchronization

Add a missing synchronize_net() call to avoid potential use after free,
since we explicitly call napi_hash_del() to factorize the RCU grace
period.

Fixes: c0c050c58d84 ("bnxt_en: New Broadcom ethernet driver.")
Signed-off-by: Eric Dumazet <[email protected]>
Cc: Michael Chan <[email protected]>
Acked-by: Michael Chan <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: dsa: b53: Fix VLAN usage and how we treat CPU port

We currently have a fundamental problem in how we treat the CPU port and
its VLAN membership. As soon as a second VLAN is configured to be
untagged, the CPU automatically becomes untagged for that VLAN as well,
and yet, we don't gracefully make sure that the CPU becomes tagged in
the other VLANs it could be a member of. This results in only one VLAN
being effectively usable from the CPU's perspective.

Instead of having some pretty complex logic which tries to maintain the
CPU port's default VLAN and its untagged properties, just do something
very simple which consists in neither altering the CPU port's PVID
settings, nor its untagged settings:

- whenever a VLAN is added, the CPU is automatically a member of this
VLAN group, as a tagged member
- PVID settings for downstream ports do not alter the CPU port's PVID
since it now is part of all VLANs in the system

This means that a typical example where e.g: LAN ports are in VLAN1, and
WAN port is in VLAN2, now require having two VLAN interfaces for the
host to properly terminate and send traffic from/to.

Fixes: Fixes: a2482d2ce349 ("net: dsa: b53: Plug in VLAN support")
Reported-by: Hartmut Knaack <[email protected]>
Signed-off-by: Florian Fainelli <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Merge tag 'drm-fixes-for-v4.9-rc6' of git://people.freedesktop.org/~airlied/linux

Pull drm fixes fr9om Dave Airlie:
"Fixes for amdgpu, and a bunch of arm drivers.

  There seems to be an uptick in the ARM drivers sending things for
  fixes which is good, so I've decided to dequeue a bit early, more
  stuff may arrive before the weekend.

  This contains mediatek, arcpgu, sunxi, fsl-dcu display controller
  fixes along with 3 amdgpu fixes, one for a fencing issue with
  secondary GPUs"

* tag 'drm-fixes-for-v4.9-rc6' of git://people.freedesktop.org/~airlied/linux:
  drm/amdgpu:fix vpost_needed routine
  drm/amdgpu/powerplay: drop a redundant NULL check
  drm/amdgpu: Attach exclusive fence to prime exported bo's. (v5)
  drm/arcpgu: Accommodate adv7511 switch to DRM bridge
  drm/fsl-dcu: disable planes before disabling CRTC
  drm/fsl-dcu: update all registers on flush
  drm/fsl-dcu: do not update when modifying irq registers
  drm/sun4i: Propagate error to the caller
  drm/sun4i: Fix error handling
  drm/mediatek: modify the factor to make the pll_rate set in the 1G-2G range
  drm/mediatek: enhance the HDMI driving current
  drm/mediatek: do mtk_hdmi_send_infoframe after HDMI clock enable
  drm/mediatek: clear IRQ status before enable OVL interrupt
  drm/mediatek: set vblank_disable_allowed to true
  drm/mediatek: fix a typo of OD_CFG to OD_RELAYMODE
  drm/sun4i: rgb: Remove the bridge enable/disable functions
  drm/sun4i: rgb: Enable panel after controller

iw_cxgb4: invalidate the mr when posting a read_w_inv wr

Also, rearrange things a bit to have a common c4iw_invalidate_mr()
function used everywhere that we need to invalidate.

Fixes: 49b53a93a64a ("iw_cxgb4: add fast-path for small REG_MR operations")
Signed-off-by: Steve Wise <[email protected]>
Signed-off-by: Doug Ledford <[email protected]>

iw_cxgb4: set *bad_wr for post_send/post_recv errors

There are a few cases in c4iw_post_send() and c4iw_post_receive()
where *bad_wr is not set when an error is returned. This can
cause a crash if the application tries to use bad_wr.

Signed-off-by: Steve Wise <[email protected]>
Signed-off-by: Doug Ledford <[email protected]>

Merge branches 'hfi1' and 'mlx' into k.o/for-4.9-rc

IB/rxe: Update qp state for user query

The method rxe_qp_error() transitions QP to error state
and make sure the QP is drained. It did not though update
the QP state for user's query.

This patch fixes this.

Fixes: 8700e3e7c485 ("Soft RoCE driver")
Signed-off-by: Yonatan Cohen <[email protected]>
Reviewed-by: Moni Shoua <[email protected]>
Signed-off-by: Leon Romanovsky <[email protected]>
Signed-off-by: Doug Ledford <[email protected]>

IB/rxe: Clear queue buffer when modifying QP to reset

RXE resets the send-q only once in rxe_qp_init_req() when
QP is created, but when the QP is reused after QP reset, the send-q
holds previous garbage data.

This garbage data wrongly fails CQEs that otherwise
should have completed successfully.

Fixes: 8700e3e7c485 ("Soft RoCE driver")
Signed-off-by: Yonatan Cohen <[email protected]>
Reviewed-by: Moni Shoua <[email protected]>
Signed-off-by: Leon Romanovsky <[email protected]>
Signed-off-by: Doug Ledford <[email protected]>

IB/rxe: Fix handling of erroneous WR

To correctly handle a erroneous WR this fix does the following
1. Make sure the bad WQE causes a user completion event.
2. Call rxe_completer to handle the erred WQE.

Before the fix, when rxe_requester found a bad WQE, it changed its
status to IB_WC_LOC_PROT_ERR and exit with 0 for non RC QPs.

If this was the 1st WQE then there would be no ACK to invoke the
completer and this bad WQE would be stuck in the QP's send-q.

On top of that the requester exiting with 0 caused rxe_do_task to
endlessly invoke rxe_requester, resulting in a soft-lockup attached
below.

In case the WQE was not the 1st and rxe_completer did get a chance to
handle the bad WQE, it did not cause a complete event since the WQE's
IB_SEND_SIGNALED flag was not set.

Setting WQE status to IB_SEND_SIGNALED is subject to IBA spec
version 1.2.1, section 10.7.3.1 Signaled Completions.

NMI watchdog: BUG: soft lockup - CPU#7 stuck for 22s!
[<ffffffffa0590145>] ? rxe_pool_get_index+0x35/0xb0 [rdma_rxe]
[<ffffffffa05952ec>] lookup_mem+0x3c/0xc0 [rdma_rxe]
[<ffffffffa0595534>] copy_data+0x1c4/0x230 [rdma_rxe]
[<ffffffffa058c180>] rxe_requester+0x9d0/0x1100 [rdma_rxe]
[<ffffffff8158e98a>] ? kfree_skbmem+0x5a/0x60
[<ffffffffa05962c9>] rxe_do_task+0x89/0xf0 [rdma_rxe]
[<ffffffffa05963e2>] rxe_run_task+0x12/0x30 [rdma_rxe]
[<ffffffffa059110a>] rxe_post_send+0x41a/0x550 [rdma_rxe]
[<ffffffff811ef922>] ? __kmalloc+0x182/0x200
[<ffffffff816ba512>] ? down_read+0x12/0x40
[<ffffffffa054bd32>] ib_uverbs_post_send+0x532/0x540 [ib_uverbs]
[<ffffffff815f8722>] ? tcp_sendmsg+0x402/0xb80
[<ffffffffa05453dc>] ib_uverbs_write+0x18c/0x3f0 [ib_uverbs]
[<ffffffff81623c2e>] ? inet_recvmsg+0x7e/0xb0
[<ffffffff8158764d>] ? sock_recvmsg+0x3d/0x50
[<ffffffff81215b87>] __vfs_write+0x37/0x140
[<ffffffff81216892>] vfs_write+0xb2/0x1b0
[<ffffffff81217ce5>] SyS_write+0x55/0xc0
[<ffffffff816bc672>] entry_SYSCALL_64_fastpath+0x1a/0xa

Fixes: 8700e3e7c485 ("Soft RoCE driver")
Signed-off-by: Yonatan Cohen <[email protected]>
Reviewed-by: Moni Shoua <[email protected]>
Signed-off-by: Leon Romanovsky <[email protected]>
Signed-off-by: Doug Ledford <[email protected]>

IB/rxe: Fix kernel panic in UDP tunnel with GRO and RX checksum

Missing initialization of udp_tunnel_sock_cfg causes to following
kernel panic, while kernel tries to execute gro_receive().

While being there, we converted udp_port_cfg to use the same
initialization scheme as udp_tunnel_sock_cfg.

------------[ cut here ]------------
kernel tried to execute NX-protected page - exploit attempt? (uid: 0)
BUG: unable to handle kernel paging request at ffffffffa0588c50
IP: [<ffffffffa0588c50>] __this_module+0x50/0xffffffffffff8400 [ib_rxe]
PGD 1c09067 PUD 1c0a063 PMD bb394067 PTE 80000000ad5e8163
Oops: 0011 [#1] SMP
Modules linked in: ib_rxe ip6_udp_tunnel udp_tunnel
CPU: 5 PID: 0 Comm: swapper/5 Not tainted 4.7.0-rc3+ #2
Hardware name: Red Hat KVM, BIOS Bochs 01/01/2011
task: ffff880235e4e680 ti: ffff880235e68000 task.ti: ffff880235e68000
RIP: 0010:[<ffffffffa0588c50>]
[<ffffffffa0588c50>] __this_module+0x50/0xffffffffffff8400 [ib_rxe]
RSP: 0018:ffff880237343c80  EFLAGS: 00010282
RAX: 00000000dffe482d RBX: ffff8800ae330900 RCX: 000000002001b712
RDX: ffff8800ae330900 RSI: ffff8800ae102578 RDI: ffff880235589c00
RBP: ffff880237343cb0 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: ffff8800ae33e262
R13: ffff880235589c00 R14: 0000000000000014 R15: ffff8800ae102578
FS:  0000000000000000(0000) GS:ffff880237340000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffffffffa0588c50 CR3: 0000000001c06000 CR4: 00000000000006e0
Stack:
ffffffff8160860e ffff8800ae330900 ffff8800ae102578 0000000000000014
000000000000004e ffff8800ae102578 ffff880237343ce0 ffffffff816088fb
0000000000000000 ffff8800ae330900 0000000000000000 00000000ffad0000
Call Trace:
<IRQ>
[<ffffffff8160860e>] ? udp_gro_receive+0xde/0x130
[<ffffffff816088fb>] udp4_gro_receive+0x10b/0x2d0
[<ffffffff81611373>] inet_gro_receive+0x1d3/0x270
[<ffffffff81594e29>] dev_gro_receive+0x269/0x3b0
[<ffffffff81595188>] napi_gro_receive+0x38/0x120
[<ffffffffa011caee>] mlx5e_handle_rx_cqe+0x27e/0x340 [mlx5_core]
[<ffffffffa011d076>] mlx5e_poll_rx_cq+0x66/0x6d0 [mlx5_core]
[<ffffffffa011d7ae>] mlx5e_napi_poll+0x8e/0x400 [mlx5_core]
[<ffffffff815949a0>] net_rx_action+0x160/0x380
[<ffffffff816a9197>] __do_softirq+0xd7/0x2c5
[<ffffffff81085c35>] irq_exit+0xf5/0x100
[<ffffffff816a8f16>] do_IRQ+0x56/0xd0
[<ffffffff816a6dcc>] common_interrupt+0x8c/0x8c
<EOI>
[<ffffffff81061f96>] ? native_safe_halt+0x6/0x10
[<ffffffff81037ade>] default_idle+0x1e/0xd0
[<ffffffff8103828f>] arch_cpu_idle+0xf/0x20
[<ffffffff810c37dc>] default_idle_call+0x3c/0x50
[<ffffffff810c3b13>] cpu_startup_entry+0x323/0x3c0
[<ffffffff81050d8c>] start_secondary+0x15c/0x1a0
RIP  [<ffffffffa0588c50>] __this_module+0x50/0xffffffffffff8400 [ib_rxe]
RSP <ffff880237343c80>
CR2: ffffffffa0588c50
---[ end trace 489ee31fa7614ac5 ]---
Kernel panic - not syncing: Fatal exception in interrupt
Kernel Offset: disabled
---[ end Kernel panic - not syncing: Fatal exception in interrupt
------------[ cut here ]------------

Fixes: 8700e3e7c485 ("Soft RoCE driver")
Signed-off-by: Yonatan Cohen <[email protected]>
Reviewed-by: Moni Shoua <[email protected]>
Signed-off-by: Leon Romanovsky <[email protected]>
Signed-off-by: Doug Ledford <[email protected]>

IB/mlx4: Fix create CQ error flow

Currently, if ib_copy_to_udata fails, the CQ
won't be deleted from the radix tree and the HW (HW2SW).

Fixes: 225c7b1feef1 ('IB/mlx4: Add a driver Mellanox ConnectX InfiniBand adapters')
Signed-off-by: Matan Barak <[email protected]>
Signed-off-by: Daniel Jurgens <[email protected]>
Reviewed-by: Mark Bloch <[email protected]>
Signed-off-by: Leon Romanovsky <[email protected]>
Signed-off-by: Doug Ledford <[email protected]>

IB/mlx4: Check gid_index return value

Check the returned GID index value and return an error if it is invalid.

Fixes: 5070cd2239bd ('IB/mlx4: Replace mechanism for RoCE GID management')
Signed-off-by: Daniel Jurgens <[email protected]>
Reviewed-by: Mark Bloch <[email protected]>
Reviewed-by: Yuval Shaia <[email protected]>
Signed-off-by: Leon Romanovsky <[email protected]>
Signed-off-by: Doug Ledford <[email protected]>

IB/mlx5: Fix NULL pointer dereference on debug print

For XRC QP CQs may not exist. Check before attempting dereference.

Fixes: e126ba97dba9 ('mlx5: Add driver for Mellanox Connect-IB adapters')
Signed-off-by: Eli Cohen <[email protected]>
Signed-off-by: Maor Gottlieb <[email protected]>
Reviewed-by: Yishai Hadas <[email protected]>
Signed-off-by: Leon Romanovsky <[email protected]>
Signed-off-by: Doug Ledford <[email protected]>

IB/mlx5: Fix fatal error dispatching

When an internal error condition is detected, make sure to set the
device inactive after dispatching the event so ULPs can get a
notification of this event.

Fixes: e126ba97dba9 ('mlx5: Add driver for Mellanox Connect-IB adapters')
Signed-off-by: Eli Cohen <[email protected]>
Signed-off-by: Maor Gottlieb <[email protected]>
Reviewed-by: Mohamad Haj Yahia <[email protected]>
Signed-off-by: Leon Romanovsky <[email protected]>
Signed-off-by: Doug Ledford <[email protected]>

IB/mlx5: Resolve soft lock on massive reg MRs

When calling reg_mr of large MRs (e.g. 4GB) from multiple processes
and MR caches can't supply the required amount of MRs the slow-path
of MR allocation may be used. In this case we need to serialize the
slow-path between the processes to avoid soft lock.

Fixes: e126ba97dba9 ('mlx5: Add driver for Mellanox Connect-IB adapters')
Signed-off-by: Moshe Lazer <[email protected]>
Signed-off-by: Maor Gottlieb <[email protected]>
Reviewed-by: Eli Cohen <[email protected]>
Signed-off-by: Leon Romanovsky <[email protected]>
Signed-off-by: Doug Ledford <[email protected]>

IB/mlx5: Use cache line size to select CQE stride

When creating kernel CQs use 128B CQE stride if the
cache line size is 128B, 64B otherwise. This prevents
multiple CQEs from residing in a 128B cache line,
which can cause retries when there are concurrent
read and writes in one cache line.

Tested with IPoIB on PPC64, saw ~5% throughput
improvement.

Fixes: e126ba97dba9 ('mlx5: Add driver for Mellanox Connect-IB adapters')
Signed-off-by: Daniel Jurgens <[email protected]>
Signed-off-by: Maor Gottlieb <[email protected]>
Signed-off-by: Leon Romanovsky <[email protected]>
Signed-off-by: Doug Ledford <[email protected]>

IB/mlx5: Validate requested RQT size

Validate that the requested size of RQT is supported by firmware.

Fixes: c5f9092936fe ('IB/mlx5: Add Receive Work Queue Indirection table operations')
Signed-off-by: Maor Gottlieb <[email protected]>
Reviewed-by: Yishai Hadas <[email protected]>
Signed-off-by: Leon Romanovsky <[email protected]>
Signed-off-by: Doug Ledford <[email protected]>

IB/mlx5: Fix memory leak in query device

We need to free dev->port when we fail to enable RoCE or
initialize node data.

Fixes: 0837e86a7a34 ('IB/mlx5: Add per port counters')
Signed-off-by: Majd Dibbiny <[email protected]>
Signed-off-by: Maor Gottlieb <[email protected]>
Reviewed-by: Mark Bloch <[email protected]>
Signed-off-by: Leon Romanovsky <[email protected]>
Signed-off-by: Doug Ledford <[email protected]>

IB/core: Avoid unsigned int overflow in sg_alloc_table

sg_alloc_table gets unsigned int as parameter while the driver
returns it as size_t. Check npages isn't greater than maximum
unsigned int.

Fixes: eeb8461e36c9 ("IB: Refactor umem to use linear SG table")
Signed-off-by: Mark Bloch <[email protected]>
Signed-off-by: Maor Gottlieb <[email protected]>
Signed-off-by: Leon Romanovsky <[email protected]>
Signed-off-by: Doug Ledford <[email protected]>

IB/core: Add missing check for addr_resolve callback return value

When calling rdma_resolve_ip inside rdma_addr_find_l2_eth_by_grh,
the return status of the request was ignored in the callback function
causing a successful return and an empty dmac.

Signed-off-by: Mark Bloch <[email protected]>
Signed-off-by: Alex Vesker <[email protected]>
Reviewed-by: Or Gerlitz <[email protected]>
Signed-off-by: Leon Romanovsky <[email protected]>
Signed-off-by: Doug Ledford <[email protected]>

IB/core: Set routable RoCE gid type for ipv4/ipv6 networks

On Thu, Oct 27, 2016 at 04:36:28PM +0300, Leon Romanovsky wrote:
> From: Mark Bloch <[email protected]>
>
> If the underlying netowrk type is ipv4 or ipv6 and the device supports
> routable RoCE, prefer it so the traffic could cross subnets.
>
> Signed-off-by: Mark Bloch <[email protected]>
> Signed-off-by: Maor Gottlieb <[email protected]>
> Signed-off-by: Leon Romanovsky <[email protected]>
> ---

Hi Doug,

Please take the following v1 of this patch where I fixed spelling error
from "netowrk" to be "network".

Thanks.

>From 09f96ba3e9b4442cfb44dca04c6726e55525c9c3 Mon Sep 17 00:00:00 2001
From: Mark Bloch <[email protected]>
Date: Sun, 11 Sep 2016 06:25:10 +0000
Subject: [PATCH rdma-rc v1 3/6] IB/core: Set routable RoCE gid type for ipv4/ipv6
networks

If the underlying network type is ipv4 or ipv6 and the device supports
routable RoCE, prefer it so the traffic could cross subnets.

Signed-off-by: Mark Bloch <[email protected]>
Signed-off-by: Maor Gottlieb <[email protected]>
Signed-off-by: Leon Romanovsky <[email protected]>
Signed-off-by: Doug Ledford <[email protected]>

IB/cm: Mark stale CM id's whenever the mad agent was unregistered

When there is a CM id object that has port assigned to it, it means that
the cm-id asked for the specific port that it should go by it, but if
that port was removed (hot-unplug event) the cm-id was not updated.
In order to fix that the port keeps a list of all the cm-id's that are
planning to go by it, whenever the port is removed it marks all of them
as invalid.

This commit fixes a kernel panic which happens when running traffic between
guests and we force reboot a guest mid traffic, it triggers a kernel panic:

Call Trace:
  [<ffffffff815271fa>] ? panic+0xa7/0x16f
  [<ffffffff8152b534>] ? oops_end+0xe4/0x100
  [<ffffffff8104a00b>] ? no_context+0xfb/0x260
  [<ffffffff81084db2>] ? del_timer_sync+0x22/0x30
  [<ffffffff8104a295>] ? __bad_area_nosemaphore+0x125/0x1e0
  [<ffffffff81084240>] ? process_timeout+0x0/0x10
  [<ffffffff8104a363>] ? bad_area_nosemaphore+0x13/0x20
  [<ffffffff8104aabf>] ? __do_page_fault+0x31f/0x480
  [<ffffffff81065df0>] ? default_wake_function+0x0/0x20
  [<ffffffffa0752675>] ? free_msg+0x55/0x70 [mlx5_core]
  [<ffffffffa0753434>] ? cmd_exec+0x124/0x840 [mlx5_core]
  [<ffffffff8105a924>] ? find_busiest_group+0x244/0x9f0
  [<ffffffff8152d45e>] ? do_page_fault+0x3e/0xa0
  [<ffffffff8152a815>] ? page_fault+0x25/0x30
  [<ffffffffa024da25>] ? cm_alloc_msg+0x35/0xc0 [ib_cm]
  [<ffffffffa024e821>] ? ib_send_cm_dreq+0xb1/0x1e0 [ib_cm]
  [<ffffffffa024f836>] ? cm_destroy_id+0x176/0x320 [ib_cm]
  [<ffffffffa024fb00>] ? ib_destroy_cm_id+0x10/0x20 [ib_cm]
  [<ffffffffa034f527>] ? ipoib_cm_free_rx_reap_list+0xa7/0x110 [ib_ipoib]
  [<ffffffffa034f590>] ? ipoib_cm_rx_reap+0x0/0x20 [ib_ipoib]
  [<ffffffffa034f5a5>] ? ipoib_cm_rx_reap+0x15/0x20 [ib_ipoib]
  [<ffffffff81094d20>] ? worker_thread+0x170/0x2a0
  [<ffffffff8109b2a0>] ? autoremove_wake_function+0x0/0x40
  [<ffffffff81094bb0>] ? worker_thread+0x0/0x2a0
  [<ffffffff8109aef6>] ? kthread+0x96/0xa0
  [<ffffffff8100c20a>] ? child_rip+0xa/0x20
  [<ffffffff8109ae60>] ? kthread+0x0/0xa0
  [<ffffffff8100c200>] ? child_rip+0x0/0x20

Fixes: a977049dacde ("[PATCH] IB: Add the kernel CM implementation")
Signed-off-by: Mark Bloch <[email protected]>
Signed-off-by: Erez Shitrit <[email protected]>
Reviewed-by: Maor Gottlieb <[email protected]>
Signed-off-by: Leon Romanovsky <[email protected]>
Signed-off-by: Doug Ledford <[email protected]>

IB/uverbs: Fix leak of XRC target QPs

The real QP is destroyed in case of the ref count reaches zero, but
for XRC target QPs this call was missed and caused to QP leaks.

Let's call to destroy for all flows.

Fixes: 0e0ec7e0638e ('RDMA/core: Export ib_open_qp() to share XRC...')
Signed-off-by: Tariq Toukan <[email protected]>
Signed-off-by: Noa Osherovich <[email protected]>
Signed-off-by: Leon Romanovsky <[email protected]>
Signed-off-by: Doug Ledford <[email protected]>

Merge tag 'xtensa-20161116' of git://github.com/jcmvbkbc/linux-xtensa

Pull Xtensa fixes from Max Filippov:

- fix register dumps, stack dumps and stack traces that got torn due to
   recent printk changes

- wire up pkey_{mprotect,alloc,free} syscalls

* tag 'xtensa-20161116' of git://github.com/jcmvbkbc/linux-xtensa:
  xtensa: wire up new pkey_{mprotect,alloc,free} syscalls
  xtensa: clean up printk usage for boot/crash logging

ARM: Fix XIP kernels

Commit 7619751f8c90 ("ARM: 8595/2: apply more __ro_after_init") caused
a regression with XIP kernels by moving the __ro_after_init data into
the read-only section. With XIP kernels, the read-only section is
located in read-only memory from the very beginning.

Work around this by moving the __ro_after_init data back into the .data
section, which will be in RAM, and hence will be writable.

It should be noted that in doing so, this remains writable after init.

Fixes: 7619751f8c90 ("ARM: 8595/2: apply more __ro_after_init")
Reported-by: Andrea Merello <[email protected]>
Tested-by: Andrea Merello <[email protected]> [ XIP stm32 ]
Tested-by: Alexandre Torgue <[email protected]>
Signed-off-by: Russell King <[email protected]>

Merge branch 'drm-fixes-4.9' of git://people.freedesktop.org/~agd5f/linux into drm-fixes

Just a few bug fixes for 4.9.  The big one is Mario's prime fencing fix.

* 'drm-fixes-4.9' of git://people.freedesktop.org/~agd5f/linux:
  drm/amdgpu:fix vpost_needed routine
  drm/amdgpu/powerplay: drop a redundant NULL check
  drm/amdgpu: Attach exclusive fence to prime exported bo's. (v5)

Merge branch 'mediatek-drm-fixes-2016-11-11' of https://github.com/ckhu-mediatek/linux.git-tags into drm-fixes

This branch include one patch to fix a typo, two patches to disable
vblank interrupt, and three patches to support HDMI 4K resolution.

* 'mediatek-drm-fixes-2016-11-11' of https://github.com/ckhu-mediatek/linux.git-tags:
  drm/mediatek: modify the factor to make the pll_rate set in the 1G-2G range
  drm/mediatek: enhance the HDMI driving current
  drm/mediatek: do mtk_hdmi_send_infoframe after HDMI clock enable
  drm/mediatek: clear IRQ status before enable OVL interrupt
  drm/mediatek: set vblank_disable_allowed to true
  drm/mediatek: fix a typo of OD_CFG to OD_RELAYMODE

net/phy/vitesse: Configure RGMII skew on VSC8601, if needed

With RGMII, we need a 1.5 to 2ns skew between clock and data lines. The
VSC8601 can handle this internally. While the VSC8601 can set more
fine-grained delays, the standard skew settings work out of the box.
The same heuristic is used to determine when this skew should be enabled
as in vsc824x_config_init().

Tested on custom board with AM3352 SOC and VSC801 PHY.

Signed-off-by: Alexandru Gagniuc <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

cxgb4: do not call napi_hash_del()

Calling napi_hash_del() before netif_napi_del() is dangerous
if a synchronize_rcu() is not enforced before NAPI struct freeing.

Lets leave this detail to core networking stack and feel
more comfortable.

Signed-off-by: Eric Dumazet <[email protected]>
Cc: Hariprasad S <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

be2net: do not call napi_hash_del()

Calling napi_hash_del() before netif_napi_del() is dangerous
if a synchronize_rcu() is not enforced before NAPI struct freeing.

Lets leave this detail to core networking stack and feel
more comfortable.

Signed-off-by: Eric Dumazet <[email protected]>
Cc: Sathya Perla <[email protected]>
Cc: Ajit Khaparde <[email protected]>
Cc: Sriharsha Basavapatna <[email protected]>
Cc: Somnath Kotur <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

tools/power/acpi: Remove direct kernel source include reference

Avoid breaking cross-compiled ACPI tools builds by rearranging the
handling of kernel header files.

This patch also contains OUTPUT/srctree cleanups in order to make above fix
working for various build environments.

Fixes: e323c02dee59 (ACPICA: MSVC9: Fix <sys/stat.h> inclusion order issue)
Reported-and-tested-by: Yisheng Xie <[email protected]>
Reported-by: Andy Shevchenko <[email protected]>
Signed-off-by: Lv Zheng <[email protected]>
[ rjw: Changelog ]
Signed-off-by: Rafael J. Wysocki <[email protected]>

virtio-net: add a missing synchronize_net()

It seems many drivers do not respect napi_hash_del() contract.

When napi_hash_del() is used before netif_napi_del(), an RCU grace
period is needed before freeing NAPI object.

Fixes: 91815639d880 ("virtio-net: rx busy polling support")
Signed-off-by: Eric Dumazet <[email protected]>
Cc: Jason Wang <[email protected]>
Cc: Michael S. Tsirkin <[email protected]>
Acked-by: Michael S. Tsirkin <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

gpio: Remove GPIO_DEVRES option

This option was added in 6a89a314ab107a12af08c71420c19a37a30fc2d3 to
allow use of the devm_gpio_* functions without CONFIG_GPIOLIB.

However, only a few months later in
b69ac52449c658b7ac40034dc3c5f5f4a71a723d, CONFIG_GPIOLIB was added
as a dependency, defeating the original purpose of this option.
Instead of that patch, the original commit could have just been
reverted (and in fact was partially so in
403c1d0be5ccbd750d25c59d8358843a81e52e3b). Further, since this
option has a dependency on HAS_IOMEM, even though it does not
require it, it causes build failures when !HAS_IOMEM (e.g. in a
uml build).

Fix that by completely removing the option, in essence completing
the reversion of the original commit.

Signed-off-by: Keno Fischer <[email protected]>
Signed-off-by: Linus Walleij <[email protected]>

nvme/pci: Don't free queues on error

The nvme_remove function tears down all allocated resources in the correct
order, so no need to free queues on error during initialization. This
fixes possible use-after-free errors when queues are still associated
with a blk-mq hctx.

Reported-by: Scott Bauer <[email protected]>
Tested-by: Scott Bauer <[email protected]>
Signed-off-by: Keith Busch <[email protected]>
Reviewed-by: Sagi Grimberg <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Cc: [email protected]
Signed-off-by: Jens Axboe <[email protected]>

Merge tag 'sunxi-clk-fixes-for-4.9' of https://git.kernel.org/pub/scm/linux/kernel/git/mripard/linux into clk-fixes

Pull Allwinner clock fixes from Maxime Ripard:

Two fixes, one for the old clock code, one for the new implementation.

* tag 'sunxi-clk-fixes-for-4.9' of https://git.kernel.org/pub/scm/linux/kernel/git/mripard/linux:
clk: sunxi: Fix M factor computation for APB1
clk: sunxi-ng: sun6i-a31: Force AHB1 clock to use PLL6 as parent

clk: efm32gg: Pass correct type to hw provider registration

Dan Carpenter reports that we're passing a pointer to a pointer
here when we should just be passing a pointer. Pass the right
pointer so that the of_clk_hw_onecell_get() sees the appropriate
data pointer on its end.

Reported-by: Dan Carpenter <[email protected]>
Cc: Stephen Boyd <[email protected]>
Cc: Uwe Kleine-König <[email protected]>
Fixes: 9337631f52a8 ("clk: efm32gg: Migrate to clk_hw based OF and registration APIs")
Signed-off-by: Stephen Boyd <[email protected]>

clk: berlin: Pass correct type to hw provider registration

Dan Carpenter reports that we're passing a pointer to a pointer
here when we should just be passing a pointer. Pass the right
pointer so that the of_clk_hw_onecell_get() sees the appropriate
data pointer on its end.

Reported-by: Dan Carpenter <[email protected]>
Cc: Jisheng Zhang <[email protected]>
Cc: Alexandre Belloni <[email protected]>
Cc: Sebastian Hesselbarth <[email protected]>
Cc: Stephen Boyd <[email protected]>
Fixes: f6475e298297 ("clk: berlin: Migrate to clk_hw based registration and OF APIs")
Signed-off-by: Stephen Boyd <[email protected]>

Merge branch 'thunderx-fixes'

Sunil Goutham says:

====================
net: thunderx: Miscellaneous fixes

This patchset includes fixes for incorrect LMAC credits,
unreliable driver statistics, memory leak upon interface
down e.t.c

Changes from v1:
- As suggested replaced bit shifting with BIT() macro
in the patch 'Fix configuration of L3/L4 length checking'.
====================

Signed-off-by: David S. Miller <[email protected]>

net: thunderx: Fix memory leak and other issues upon interface toggle

This patch fixes the following
1. When interface is being teardown and queues are being cleaned up,
   free pending SKBs that are in SQ which are either not transmitted
   or freed as NAPI is disabled by that time.
2. While interface initialization, delay CFG_DONE notification till
   the end to avoid corner cases where TXQs are enabled but CQ
   interrupts are not which results blocking transmission and kicking
   off watchdog.
3. Check for IFF_UP while re-enabling RBDR interrupts from tasklet.

Signed-off-by: Sunil Goutham <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: thunderx: Fix VF driver's interface statistics

This patch fixes multiple issues
1. Convert all driver statistics to percpu counters for accuracy.
2. To avoid multiple CQEs posted by a TSO packet appended to HW,
   TSO pkt's SQE has 'post_cqe' not set but a dummy SQE is added
   for getting HW transmit completion notification. This dummy
   SQE has 'dont_send' set and HW drops the pkt pointed to in this
   thus Tx drop counter increases. This patch fixes this by subtracting
   SW tx tso counter from HW Tx drop counter for actual packet drop counter.
3. Reset all individual queue's and VNIC HW stats when interface is going down.
4. Getrid off unnecessary counters in hot path.
5. Bringout all CQE error stats i.e both Rx and Tx.

Signed-off-by: Sunil Goutham <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: thunderx: Fix configuration of L3/L4 length checking

This patch fixes enabling of HW verification of L3/L4 length and
TCP/UDP checksum which is currently being cleared. Also fixed VLAN
stripping config which is being cleared when multiqset is enabled.

Signed-off-by: Sunil Goutham <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: thunderx: Program LMAC credits based on MTU

Programming LMAC credits taking 9K frame size by default is incorrect
as for an interface which is one of the many on the same BGX/QLM
no of credits available will be less as Tx FIFO will be divided
across all interfaces. So let's say a BGX with 40G interface and another
BGX with multiple 10G, bandwidth of 10G interfaces will be effected when
traffic is running on both 40G and 10G interfaces simultaneously.

This patch fixes this issue by programming credits based on netdev's MTU.
Also fixed configuring MTU to HW and added CQE counter for pkts which
exceed this value.

Signed-off-by: Sunil Goutham <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: thunderx: Introduce BGX_ID_MASK macro to extract bgx_id

This patch fixes the 'bgx_id' determination on 83xx where there are
4 BGX blocks instead of 2 on other platforms.

Signed-off-by: Radha Mohan Chintakuntla <[email protected]>
Signed-off-by: Sunil Goutham <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Merge branch 'fib-tables-fixes'

Alexander Duyck says:

====================
ipv4: Fix memory leaks and reference issues in fib

This series fixes one major issue and one minor issue in the fib tables.

The major issue is that we had lost the functionality that was flushing the
local table entries from main after we had unmerged the two tries. In
order to regain the functionality I have performed a partial revert and
then moved the functionality for flushing the external entries from main
into fib_unmerge.

The minor issue was a memory leak that could occur in the event that we
weren't able to add an alias to the local trie resulting in the fib alias
being leaked.
====================

Signed-off-by: David S. Miller <[email protected]>

ipv4: Fix memory leak in exception case for splitting tries

Fix a small memory leak that can occur where we leak a fib_alias in the
event of us not being able to insert it into the local table.

Fixes: 0ddcf43d5d4a0 ("ipv4: FIB Local/MAIN table collapse")
Reported-by: Eric Dumazet <[email protected]>
Signed-off-by: Alexander Duyck <[email protected]>
Acked-by: Eric Dumazet <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

ipv4: Restore fib_trie_flush_external function and fix call ordering

The patch that removed the FIB offload infrastructure was a bit too
aggressive and also removed code needed to clean up us splitting the table
if additional rules were added. Specifically the function
fib_trie_flush_external was called at the end of a new rule being added to
flush the foreign trie entries from the main trie.

I updated the code so that we only call fib_trie_flush_external on the main
table so that we flush the entries for local from main. This way we don't
call it for every rule change which is what was happening previously.

Fixes: 347e3b28c1ba2 ("switchdev: remove FIB offload infrastructure")
Reported-by: Eric Dumazet <[email protected]>
Cc: Jiri Pirko <[email protected]>
Signed-off-by: Alexander Duyck <[email protected]>
Acked-by: Eric Dumazet <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

bpf: fix range arithmetic for bpf map access

I made some invalid assumptions with BPF_AND and BPF_MOD that could result in
invalid accesses to bpf map entries. Fix this up by doing a few things

1) Kill BPF_MOD support. This doesn't actually get used by the compiler in real
life and just adds extra complexity.

2) Fix the logic for BPF_AND, don't allow AND of negative numbers and set the
minimum value to 0 for positive AND's.

3) Don't do operations on the ranges if they are set to the limits, as they are
by definition undefined, and allowing arithmetic operations on those values
could make them appear valid when they really aren't.

This fixes the testcase provided by Jann as well as a few other theoretical
problems.

Reported-by: Jann Horn <[email protected]>
Signed-off-by: Josef Bacik <[email protected]>
Acked-by: Alexei Starovoitov <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse

Pull fuse fixes from Miklos Szeredi:
"A regression fix and bug fix bound for stable"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
fuse: fix fuse_write_end() if zero bytes were copied
fuse: fix root dentry initialization

Merge tag 'mfd-fixes-4.9' of git://git.kernel.org/pub/scm/linux/kernel/git/lee/mfd

Pull MFD fixes from Lee Jones:
- Fix PCI properties in intel-lpss-pci
- Fix Resetting issue during suspend in intel-lpss-pci
- Seperate IRQs for USBC device and CHRG in intel_soc_pmic_bxtwc
- Add timeout to fix Resetting issue in stmpe
- Ensure we 'put' reference to device when done in mfd-core

* tag 'mfd-fixes-4.9' of git://git.kernel.org/pub/scm/linux/kernel/git/lee/mfd:
  mfd: core: Fix device reference leak in mfd_clone_cell
  mfd: stmpe: Fix RESET regression on STMPE2401
  mfd: intel_soc_pmic_bxtwc: Fix usbc interrupt
  mfd: intel-lpss: Do not put device in reset state on suspend
  mfd: lpss: Fix Intel Kaby Lake PCH-H properties

orangefs: add .owner to debugfs file_operations

Without ".owner = THIS_MODULE" it is possible to crash the kernel
by unloading the Orangefs module while someone is reading debugfs
files.

Signed-off-by: Mike Marshall <[email protected]>

mfd: core: Fix device reference leak in mfd_clone_cell

Make sure to drop the reference taken by bus_find_device_by_name()
before returning from mfd_clone_cell().

Fixes: a9bbba996302 ("mfd: add platform_device sharing support for mfd")
Signed-off-by: Johan Hovold <[email protected]>
Signed-off-by: Lee Jones <[email protected]>

mfd: stmpe: Fix RESET regression on STMPE2401

Since commit c4dd1ba355aae2bc3d1213da6c66c53e3c31e028
("mfd: stmpe: Add reset support for all STMPE variant")
we're resetting the STMPE expanders before use.

This caused a regression on the STMP2401 on the Nomadik
NHK8815:

stmpe-i2c 0-0043: stmpe2401 detected, chip id: 0x101
nmk-i2c 101f8000.i2c0: write to slave 0x43 timed out
nmk-i2c 101f8000.i2c0: no ack received after address transmission
stmpe-i2c 0-0044: stmpe2401 detected, chip id: 0x101
nmk-i2c 101f8000.i2c0: write to slave 0x44 timed out
nmk-i2c 101f8000.i2c0: no ack received after address transmission

It turns out that we start to poll for the reset bit to
go low again too quickly: the STMPE2401 is not yet online and
ready to be asked for the status of the RESET bit.

By introducing a 10ms delay before starting to hammer
the register for information, we get back to normal:

stmpe-i2c 0-0043: stmpe2401 detected, chip id: 0x101
stmpe-i2c 0-0044: stmpe2401 detected, chip id: 0x101

Cc: [email protected]
Cc: Amelie Delaunay <[email protected]>
Fixes: c4dd1ba355aa ("mfd: stmpe: Add reset support for all STMPE variant")
Signed-off-by: Linus Walleij <[email protected]>
Acked-by: Patrice Chotard <[email protected]>
Signed-off-by: Lee Jones <[email protected]>

mfd: intel_soc_pmic_bxtwc: Fix usbc interrupt

The wcove USB Type-C driver is currently being flooded with
interrupts that are not targeted to it. The reason for that
is because all CHRG first level interrupts are mapped to it.
This fixes the issue by introducing separate irq for the
usbc device, and mapping only USB Type-C PHY interrupts to
it.

Fixes: 9c6235c86332 ("mfd: intel_soc_pmic_bxtwc: Add bxt_wcove_usbc device")
Signed-off-by: Heikki Krogerus <[email protected]>
Signed-off-by: Lee Jones <[email protected]>

mfd: intel-lpss: Do not put device in reset state on suspend

Commit 41a3da2b8e163 ("mfd: intel-lpss: Save register context on
suspend") saved the register context while going to suspend and
also put the device in reset state.

Due to the resetting of device, system cannot enter S3/S0ix
states when no_console_suspend flag is enabled. The system
and serial console both hang. The resetting of device is not
needed while going to suspend. Hence remove this code.

Cc: [email protected]
Fixes: 41a3da2b8e163 ("mfd: intel-lpss: Save register context on suspend")
Signed-off-by: Azhar Shaikh <[email protected]>
Acked-by: Mika Westerberg <[email protected]>
Reviewed-by: Andy Shevchenko <[email protected]>
Signed-off-by: Lee Jones <[email protected]>

mfd: lpss: Fix Intel Kaby Lake PCH-H properties

There are a few issues on Intel Kaby Lake PCH-H properties added by
commit a6a576b78e09 ("mfd: lpss: Add Intel Kaby Lake PCH-H PCI IDs"):

- Input clock of I2C controller on Intel Kaby Lake PCH-H is 120 MHz not
  133 MHz. This was probably copy-paste error from Intel Broxton I2C
  properties.
- There is no default I2C SDA hold time specified which is used when
  ACPI doesn't provide it. I got information from Windows driver team
  that Kaby Lake PCH-H can use the same configuration than Intel
  Sunrisepoint PCH.
- Common HS-UART properties are not used.

Fix these by reusing the Sunrisepoint properties on Kaby Lake PCH-H.

Fixes: a6a576b78e09 ("mfd: lpss: Add Intel Kaby Lake PCH-H PCI IDs")
Reported-by: Xiang A Wang <[email protected]>
Signed-off-by: Jarkko Nikula <[email protected]>
Acked-by: Mika Westerberg <[email protected]>
Signed-off-by: Lee Jones <[email protected]>

sched/fair: Fix task group initialization

The moves of tasks are now propagated down to root and the utilization
of cfs_rq reflects reality so it doesn't need to be estimated at init.

Signed-off-by: Vincent Guittot <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Acked-by: Dietmar Eggemann <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: [email protected]
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>

sched/fair: Propagate asynchrous detach

A task can be asynchronously detached from cfs_rq when migrating
between CPUs. The load of the migrated task is then removed from
source cfs_rq during its next update. We use this event to set
propagation flag.

During the load balance, we take advantage of the update of blocked
load to propagate any pending changes.

The propagation relies on patch:

"sched: Fix hierarchical order in rq->leaf_cfs_rq_list"

... which orders children and parents, to ensure that it's done in one pass.

Signed-off-by: Vincent Guittot <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Acked-by: Dietmar Eggemann <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: [email protected]
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>

sched/fair: Propagate load during synchronous attach/detach

When a task moves from/to a cfs_rq, we set a flag which is then used to
propagate the change at parent level (sched_entity and cfs_rq) during
next update. If the cfs_rq is throttled, the flag will stay pending until
the cfs_rq is unthrottled.

For propagating the utilization, we copy the utilization of group cfs_rq to
the sched_entity.

For propagating the load, we have to take into account the load of the
whole task group in order to evaluate the load of the sched_entity.
Similarly to what was done before the rewrite of PELT, we add a correction
factor in case the task group's load is greater than its share so it will
contribute the same load of a task of equal weight.

Signed-off-by: Vincent Guittot <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Acked-by: Dietmar Eggemann <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: [email protected]
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>

sched/fair: Factorize PELT update

Every time we modify load/utilization of sched_entity, we start to
sync it with its cfs_rq. This update is done in different ways:

- when attaching/detaching a sched_entity, we update cfs_rq and then
we sync the entity with the cfs_rq.

- when enqueueing/dequeuing the sched_entity, we update both
sched_entity and cfs_rq metrics to now.

Use update_load_avg() everytime we have to update and sync cfs_rq and
sched_entity before changing the state of a sched_enity.

Signed-off-by: Vincent Guittot <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Acked-by: Dietmar Eggemann <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: [email protected]
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>

sched/fair: Fix hierarchical order in rq->leaf_cfs_rq_list

Fix the insertion of cfs_rq in rq->leaf_cfs_rq_list to ensure that a
child will always be called before its parent.

The hierarchical order in shares update list has been introduced by
commit:

  67e86250f8ea ("sched: Introduce hierarchal order on shares update list")

With the current implementation a child can be still put after its
parent.

Lets take the example of:

       root
        \
         b
         /\
         c d*
           |
           e*

with root -> b -> c already enqueued but not d -> e so the
leaf_cfs_rq_list looks like: head -> c -> b -> root -> tail

The branch d -> e will be added the first time that they are enqueued,
starting with e then d.

When e is added, its parents is not already on the list so e is put at
the tail : head -> c -> b -> root -> e -> tail

Then, d is added at the head because its parent is already on the
list: head -> d -> c -> b -> root -> e -> tail

e is not placed at the right position and will be called the last
whereas it should be called at the beginning.

Because it follows the bottom-up enqueue sequence, we are sure that we
will finished to add either a cfs_rq without parent or a cfs_rq with a
parent that is already on the list. We can use this event to detect
when we have finished to add a new branch. For the others, whose
parents are not already added, we have to ensure that they will be
added after their children that have just been inserted the steps
before, and after any potential parents that are already in the list.
The easiest way is to put the cfs_rq just after the last inserted one
and to keep track of it untl the branch is fully added.

Signed-off-by: Vincent Guittot <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Acked-by: Dietmar Eggemann <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: [email protected]
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>

sched/fair: Factorize attach/detach entity

Factorize post_init_entity_util_avg() and part of attach_task_cfs_rq()
in one function attach_entity_cfs_rq().

Create symmetric detach_entity_cfs_rq() function.

Signed-off-by: Vincent Guittot <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Acked-by: Dietmar Eggemann <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: [email protected]
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>

sched/fair: Fix incorrect comment for capacity_margin

The comment for capacity_margin introduced in:

3273163c6775 ("sched/fair: Let asymmetric CPU configurations balance at wake-up")

... got its usage the wrong way round - fix it.

Signed-off-by: Morten Rasmussen <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>

sched/fair: Avoid pulling tasks from non-overloaded higher capacity groups

For asymmetric CPU capacity systems it is counter-productive for
throughput if low capacity CPUs are pulling tasks from non-overloaded
CPUs with higher capacity. The assumption is that higher CPU capacity is
preferred over running alone in a group with lower CPU capacity.

This patch rejects higher CPU capacity groups with one or less task per
CPU as potential busiest group which could otherwise lead to a series of
failing load-balancing attempts leading to a force-migration.

Signed-off-by: Morten Rasmussen <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>

sched/fair: Add per-CPU min capacity to sched_group_capacity

struct sched_group_capacity currently represents the compute capacity
sum of all CPUs in the sched_group.

Unless it is divided by the group_weight to get the average capacity
per CPU, it hides differences in CPU capacity for mixed capacity systems
(e.g. high RT/IRQ utilization or ARM big.LITTLE).

But even the average may not be sufficient if the group covers CPUs of
different capacities.

Instead, by extending struct sched_group_capacity to indicate min per-CPU
capacity in the group a suitable group for a given task utilization can
more easily be found such that CPUs with reduced capacity can be avoided
for tasks with high utilization (not implemented by this patch).

Signed-off-by: Morten Rasmussen <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>

sched/fair: Consider spare capacity in find_idlest_group()

In low-utilization scenarios comparing relative loads in
find_idlest_group() doesn't always lead to the most optimum choice.
Systems with groups containing different numbers of cpus and/or cpus of
different compute capacity are significantly better off when considering
spare capacity rather than relative load in those scenarios.

In addition to existing load based search an alternative spare capacity
based candidate sched_group is found and selected instead if sufficient
spare capacity exists. If not, existing behaviour is preserved.

Signed-off-by: Morten Rasmussen <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>

sched/fair: Compute task/cpu utilization at wake-up correctly

At task wake-up load-tracking isn't updated until the task is enqueued.
The task's own view of its utilization contribution may therefore not be
aligned with its contribution to the cfs_rq load-tracking which may have
been updated in the meantime. Basically, the task's own utilization
hasn't yet accounted for the sleep decay, while the cfs_rq may have
(partially). Estimating the cfs_rq utilization in case the task is
migrated at wake-up as task_rq(p)->cfs.avg.util_avg - p->se.avg.util_avg
is therefore incorrect as the two load-tracking signals aren't time
synchronized (different last update).

To solve this problem, this patch synchronizes the task utilization with
its previous rq before the task utilization is used in the wake-up path.
Currently the update/synchronization is done _after_ the task has been
placed by select_task_rq_fair(). The synchronization is done without
having to take the rq lock using the existing mechanism used in
remove_entity_load_avg().

Signed-off-by: Morten Rasmussen <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>

sched/x86: Do not clear PREEMPT_NEED_RESCHED on preempt count reset

The per-cpu preempt count of x86 contains two values, the actual preempt
count and the inverted PREEMPT_NEED_RESCHED bit. If a corrupted preempt
count is detected the preempt_count_set() function is used to reset the
preempt count.

In case the inverted PREEMPT_NEED_RESCHED bit is zero at the time of the
reset, the preemption indication is lost. Use raw_cpu_cmpxchg_4() to reset
only the count part and leave the PREEMPT_NEED_RESCHED bit as it is.

This improves the kernel's behavior when it runs into preempt count leaks
and tries to fix them up.

Signed-off-by: Martin Schwidefsky <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>

sched/cpuacct: Avoid %lld seq_printf warning

For s390 kernel builds I keep getting this warning:

kernel/sched/cpuacct.c: In function 'cpuacct_stats_show':
kernel/sched/cpuacct.c:298:25: warning: format '%lld' expects argument of type 'long long int', but argument 4 has type 'clock_t {aka long int}' [-Wformat=]
seq_printf(sf, "%s %lld\n",

Silence the warning by adding an explicit cast.

Signed-off-by: Martin Schwidefsky <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>