Git Repo - linux.git/log

ixgbe: make __ixgbe_setup_tc static

This function is only used in ixgbe_main.c
Resolves a "missing prototype" warning when building the driver with W=1

Reported-by: Phil Schmitt <[email protected]>
Signed-off-by: Emil Tantilov <[email protected]>
Acked-by: John Fastabend <[email protected]>
Tested-by: Andrew Bowers <[email protected]>
Signed-off-by: Jeff Kirsher <[email protected]>

ixgbevf: fix error code path when setting MAC address

Return error when a MAC address change is rejected by the PF.

This will prevent the user from modifying the MAC address when
that operation is not permitted.

Signed-off-by: Emil Tantilov <[email protected]>
Tested-by: Andrew Bowers <[email protected]>
Signed-off-by: Jeff Kirsher <[email protected]>

ixgbevf: call ndo_stop() instead of dev_close() when running offline selftest

Calling dev_close() causes IFF_UP to be cleared which will remove the
interfaces routes and some addresses. That's probably not what the user
intended when running the offline selftest. Besides this does not happen
if the interface is brought down before the test, so the current
behaviour is inconsistent.
Instead call the net_device_ops ndo_stop function directly and avoid
touching IFF_UP at all.

Signed-off-by: Stefan Assmann <[email protected]>
Tested-by: Andrew Bowers <[email protected]>
Signed-off-by: Jeff Kirsher <[email protected]>

ixgbe: call ndo_stop() instead of dev_close() when running offline selftest

Calling dev_close() causes IFF_UP to be cleared which will remove the
interfaces routes and some addresses. That's probably not what the user
intended when running the offline selftest. Besides this does not happen
if the interface is brought down before the test, so the current
behaviour is inconsistent.
Instead call the net_device_ops ndo_stop function directly and avoid
touching IFF_UP at all.

Signed-off-by: Stefan Assmann <[email protected]>
Tested-by: Andrew Bowers <[email protected]>
Signed-off-by: Jeff Kirsher <[email protected]>

ixgbe: Use udelay to avoid sleeping while atomic

Use udelay instead of usleep_range because this can be called while
a lock is held.

Signed-off-by: Mark Rustad <[email protected]>
Tested-by: Andrew Bowers <[email protected]>
Signed-off-by: Jeff Kirsher <[email protected]>

ixgbe: Fix ATR so that it correctly handles IPv6 extension headers

The ATR code was assuming that it would be able to use tcp_hdr for
every TCP frame that came through.  However this isn't the case as it
is possible for a frame to arrive that is TCP but sent through something
like a raw socket.  As a result the driver was setting up bad filters in
which tcp_hdr was really pointing to the network header so the data was
all invalid.

In order to correct this I have added a bit of parsing logic that will
determine the TCP header location based off of the network header and
either the offset in the case of the IPv4 header, or a walk through the
IPv6 extension headers until it encounters the header that indicates
IPPROTO_TCP.  In addition I have added checks to verify that the lowest
protocol provided is recognized as IPv4 or IPv6 to help mitigate raw
sockets using ETH_P_ALL from having ATR applied to them.

Signed-off-by: Alexander Duyck <[email protected]>
Tested-by: Andrew Bowers <[email protected]>
Signed-off-by: Jeff Kirsher <[email protected]>

ixgbe: Store VXLAN port number in network order

The VXLAN port number should be stored in network order instead of in host
order as it is accessed from the hot-path in ATR.  This way we can avoid
having to do any byte swaps in order to validate the port number.

I moved the vxlan_port value into a hole in the read-mostly region of the
adapter struct.  This way it should be in a warm cache-line instead of in
some isolated region in memory when it needs to be accessed.

In addition I went through and stripped a bunch of unneeded ifdef flags
since having an extra variable present doesn't really hurt anything and
makes the code easier to read.  I also went through and dropped the
NETIF_F_RXCSUM flag which was being set in hw_encap_features but provides
no value as the flag is not evaluated in the Rx path.

Signed-off-by: Alexander Duyck <[email protected]>
Tested-by: Andrew Bowers <[email protected]>
Signed-off-by: Jeff Kirsher <[email protected]>

ixgbe: Fix for RAR0 not being set to default MAC addr

commit c9f53e63c208 ("ixgbe: Refactor MAC address configuration code")
introduced code that doesn't set HW register RAR0 to default mac address
but FF:FF:FF:FF:FF:FF. Due to this, ixgbe HW discards all incoming packets
that doesn't have destination mac address equals to FF:FF:FF:FF:FF:FF.

This commit sets RAR0 correctly to default HW mac address.

Signed-off-by: Tushar Dave <[email protected]>
Tested-by: Sowmini Varadhan <[email protected]>
Signed-off-by: Jeff Kirsher <[email protected]>

clk: qcom: ipq4019: add some fixed clocks for ddrppl and fepll

Drivers for these don't exist yet so we will add them as fixed clocks
so we don't BUG() if we change clocks that reference these clocks.

Signed-off-by: Matthew McClintock <[email protected]>
Signed-off-by: Stephen Boyd <[email protected]>

clk: qcom: ipq4019: switch remaining defines to enums

When this was added not all the remaining defines were switched over to
use enums, so let's complete that process here

Reported-by: Stephen Boyd <[email protected]>
Signed-off-by: Matthew McClintock <[email protected]>
Signed-off-by: Stephen Boyd <[email protected]>

clk: qcom: Make reset_control_ops const

The qcom_reset_ops structure is never modified. Make it const.

Signed-off-by: Philipp Zabel <[email protected]>
Signed-off-by: Stephen Boyd <[email protected]>

clk: tegra: Make reset_control_ops const

The rst_ops structure is never modified. Make it const.

Signed-off-by: Philipp Zabel <[email protected]>
Signed-off-by: Stephen Boyd <[email protected]>

clk: sunxi: Make reset_control_ops const

The sunxi_ve_reset_ops, sun9i_mmc_reset_ops, and sunxi_usb_reset_ops
structures are never modified. Make them const.

Signed-off-by: Philipp Zabel <[email protected]>
Signed-off-by: Stephen Boyd <[email protected]>

clk: atlas7: Make reset_control_ops const

The atlas7_rst_ops structure is never modified. Make it const.

Signed-off-by: Philipp Zabel <[email protected]>
Signed-off-by: Stephen Boyd <[email protected]>

clk: rockchip: Make reset_control_ops const

The rockchip_softrst_ops structure is never modified. Make it const.

Signed-off-by: Philipp Zabel <[email protected]>
Reviewed-by: Heiko Stuebner <[email protected]>
Signed-off-by: Stephen Boyd <[email protected]>

clk: mmp: Make reset_control_ops const

The mmp_clk_reset_ops structure is never modified. Make it const.

Signed-off-by: Philipp Zabel <[email protected]>
Signed-off-by: Stephen Boyd <[email protected]>

clk: mediatek: Make reset_control_ops const

The mtk_reset_ops structure is never modified. Make it const.

Signed-off-by: Philipp Zabel <[email protected]>
Reviewed-by: Matthias Brugger <[email protected]>
Signed-off-by: Stephen Boyd <[email protected]>

arm64: defconfig: updates for 4.6

A few defconfig updates got dropped on the floor during the merge window,
so I've rounded up the remainder here:

  * Fix duplicate definition of MMC_BLOCK_MINORS and bump to 32 for
    msm8916

  * CPUFreq support for the Juno platform, using the MHU/SCPI interface

  * Removal of the default command line, which assumed a console called
    ttyAMA0

  * Bits and pieces for the Hi6220 (96Boards HiKey)

Signed-off-by: Will Deacon <[email protected]>

dlm: config: Fix ENOMEM failures in make_cluster()

Commit 1ae1602de0 "configfs: switch ->default groups to a linked list"
left the NULL gps pointer behind after removing the kcalloc() call which
made it non-NULL. It also left the !gps check in place so make_cluster()
now fails with ENOMEM. Remove the remaining uses of the gps variable to
fix that.

Reviewed-by: Bob Peterson <[email protected]>
Reviewed-by: Andreas Gruenbacher <[email protected]>
Signed-off-by: Andrew Price <[email protected]>
Signed-off-by: David Teigland <[email protected]>

arm64: perf: Move PMU register related defines to asm/perf_event.h

To use the ARMv8 PMU related register defines from the KVM code, we move
the relevant definitions to asm/perf_event.h header file and rename them
with prefix ARMV8_PMU_. This allows us to get rid of kvm_perf_event.h.

Signed-off-by: Anup Patel <[email protected]>
Signed-off-by: Shannon Zhao <[email protected]>
Acked-by: Marc Zyngier <[email protected]>
Reviewed-by: Andrew Jones <[email protected]>
Signed-off-by: Marc Zyngier <[email protected]>
Signed-off-by: Will Deacon <[email protected]>

arm64: opcodes.h: Add arm big-endian config options before including arm header

arm and arm64 use different config options to specify big endian. This
needs taking into account when including code/headers between the two
architectures.

A case in point is PAN, which uses the __instr_arm() macro to output
instructions. The macro comes from opcodes.h, which lives under arch/arm.
On a big-endian build the mismatched config options mean the instruction
isn't byte swapped correctly, resulting in undefined instruction exceptions
during boot:

| alternatives: patching kernel code
| kdevtmpfs[87]: undefined instruction: pc=ffffffc0004505b4
| kdevtmpfs[87]: undefined instruction: pc=ffffffc00076231c
| kdevtmpfs[87]: undefined instruction: pc=ffffffc00076231c
| kdevtmpfs[87]: undefined instruction: pc=ffffffc00076231c
| kdevtmpfs[87]: undefined instruction: pc=ffffffc00076231c
| kdevtmpfs[87]: undefined instruction: pc=ffffffc00076231c
| kdevtmpfs[87]: undefined instruction: pc=ffffffc00076231c
| kdevtmpfs[87]: undefined instruction: pc=ffffffc00076231c
| kdevtmpfs[87]: undefined instruction: pc=ffffffc00076231c
| kdevtmpfs[87]: undefined instruction: pc=ffffffc00076231c
| Internal error: Oops - undefined instruction: 0 [#1] SMP
| Modules linked in:
| CPU: 0 PID: 87 Comm: kdevtmpfs Not tainted 4.1.16+ #5
| Hardware name: Hisilicon PhosphorHi1382 EVB (DT)
| task: ffffffc336591700 ti: ffffffc3365a4000 task.ti: ffffffc3365a4000
| PC is at dump_instr+0x68/0x100
| LR is at do_undefinstr+0x1d4/0x2a4
| pc : [<ffffffc00076231c>] lr : [<ffffffc0000811d4>] pstate: 604001c5
| sp : ffffffc3365a6450

Cc: <[email protected]> #4.3.x-
Reported-by: Hanjun Guo <[email protected]>
Tested-by: Xuefeng Wang <[email protected]>
Signed-off-by: James Morse <[email protected]>
Signed-off-by: Will Deacon <[email protected]>

s390/crypto: provide correct file mode at device register.

When the prng device driver calls misc_register() there is the possibility
to also provide the recommented file permissions. This fix now gives
useful values (0644) where previously just the default was used (resulting
in 0600 for the device file).

Signed-off-by: Harald Freudenberger <[email protected]>
Signed-off-by: Martin Schwidefsky <[email protected]>

powerpc: Correct used_vsr comment

The used_vsr flag is set if process has used VSX registers, not Altivec
registers. But the comment says otherwise, correct the comment.

Signed-off-by: Simon Guo <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>

powerpc/process: Fix altivec SPR not being saved

In save_sprs() in process.c contains the following test:

if (cpu_has_feature(cpu_has_feature(CPU_FTR_ALTIVEC)))
t->vrsave = mfspr(SPRN_VRSAVE);

CPU feature with the mask 0x1 is CPU_FTR_COHERENT_ICACHE so the test
is equivilent to:

if (cpu_has_feature(CPU_FTR_ALTIVEC) &&
cpu_has_feature(CPU_FTR_COHERENT_ICACHE))

On CPUs without support for both (i.e G5) this results in vrsave not
being saved between context switches. The vector register save/restore
code doesn't use VRSAVE to determine which registers to save/restore,
but the value of VRSAVE is used to determine if altivec is being used
in several code paths.

Fixes: 152d523e6307 ("powerpc: Create context switch helpers save_sprs() and restore_sprs()")
Cc: [email protected]
Signed-off-by: Oliver O'Halloran <[email protected]>
Signed-off-by: Anton Blanchard <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>

powerpc/mm: Fixup preempt underflow with huge pages

hugepd_free() used __get_cpu_var() once. Nothing ensured that the code
accessing the variable did not migrate from one CPU to another and soon
this was noticed by Tiejun Chen in 94b09d755462 ("powerpc/hugetlb:
Replace __get_cpu_var with get_cpu_var"). So we had it fixed.

Christoph Lameter was doing his __get_cpu_var() replaces and forgot
PowerPC. Then he noticed this and sent his fixed up batch again which
got applied as 69111bac42f5 ("powerpc: Replace __get_cpu_var uses").

The careful reader will noticed one little detail: get_cpu_var() got
replaced with this_cpu_ptr(). So now we have a put_cpu_var() which does
a preempt_enable() and nothing that does preempt_disable() so we
underflow the preempt counter.

Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: [email protected]
Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
Reviewed-by: Aneesh Kumar K.V <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>

x86, pmem: use memcpy_mcsafe() for memcpy_from_pmem()

Update the definition of memcpy_from_pmem() to return 0 or a negative
error code. Implement x86/arch_memcpy_from_pmem() with memcpy_mcsafe().

Cc: Borislav Petkov <[email protected]>
Cc: Tony Luck <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Linus Torvalds <[email protected]>
Acked-by: Ingo Molnar <[email protected]>
Reviewed-by: Ross Zwisler <[email protected]>
Signed-off-by: Dan Williams <[email protected]>

qmi_wwan: add "D-Link DWM-221 B1" device id

Thomas reports:
"Windows:

00 diagnostics
01 modem
02 at-port
03 nmea
04 nic

Linux:

T:  Bus=02 Lev=01 Prnt=01 Port=03 Cnt=01 Dev#=  4 Spd=480 MxCh= 0
D:  Ver= 2.00 Cls=00(>ifc ) Sub=00 Prot=00 MxPS=64 #Cfgs=  1
P:  Vendor=2001 ProdID=7e19 Rev=02.32
S:  Manufacturer=Mobile Connect
S:  Product=Mobile Connect
S:  SerialNumber=0123456789ABCDEF
C:  #Ifs= 6 Cfg#= 1 Atr=a0 MxPwr=500mA
I:  If#= 0 Alt= 0 #EPs= 2 Cls=ff(vend.) Sub=ff Prot=ff Driver=option
I:  If#= 1 Alt= 0 #EPs= 3 Cls=ff(vend.) Sub=00 Prot=00 Driver=option
I:  If#= 2 Alt= 0 #EPs= 3 Cls=ff(vend.) Sub=00 Prot=00 Driver=option
I:  If#= 3 Alt= 0 #EPs= 3 Cls=ff(vend.) Sub=00 Prot=00 Driver=option
I:  If#= 4 Alt= 0 #EPs= 3 Cls=ff(vend.) Sub=ff Prot=ff Driver=qmi_wwan
I:  If#= 5 Alt= 0 #EPs= 2 Cls=08(stor.) Sub=06 Prot=50 Driver=usb-storage"

Reported-by: Thomas Schäfer <[email protected]>
Signed-off-by: Bjørn Mork <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/ide

Pull IDE fixes from David Miller:
"Just two small changes:

  1) Remove bogus init annotation in icside, from Arnd Bergmann.

  2) Don't use zero clock rates in palm_bk3710 driver, from Wolfram
     Sang"

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/ide:
  ide: palm_bk3710: test clock rate to avoid division by 0
  ide: icside: remove incorrect initconst annotation

Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc

Pull sparc fixes from David Miller:
"Minor typing cleanup from Joe Perches, and some comment typo fixes
  from Adam Buchbinder"

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc:
  sparc: Convert naked unsigned uses to unsigned int
  sparc: Fix misspellings in comments.

Merge git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile

Pull arch/tile bugfixes from Chris Metcalf:
"These include updates to MAINTAINERS, some comment spelling fixes, and
  a bugfix to the tile kgdb.c support"

* git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile:
  tile: Fix misspellings in comments.
  MAINTAINERS: update web link for tile architecture
  MAINTAINERS: update arch/tile maintainer email domain
  tile kgdb: fix bug in copy to gdb regs, and optimize memset

Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf

Pablo Neira Ayuso says:

====================
Netfilter fixes for net

The following patchset contains Netfilter fixes for you net tree,
they are:

1) There was a race condition between parallel save/swap and delete,
   which resulted a kernel crash due to the increase ref for save, swap,
   wrong ref decrease operations. Reported and fixed by Vishwanath Pai.

2) OVS should call into CT NAT for packets of new expected connections only
   when the conntrack state is persisted with the 'commit' option to the
   OVS CT action. From Jarno Rajahalme.

3) Resolve kconfig dependencies with new OVS NAT support. From Arnd Bergmann.

4) Early validation of entry->target_offset to make sure it doesn't take us
   out from the blob, from Florian Westphal.

5) Again early validation of entry->next_offset to make sure it doesn't take
   out from the blob, also from Florian.

6) Check that entry->target_offset is always of of sizeof(struct xt_entry)
   for unconditional entries, when checking both from check_underflow()
   and when checking for loops in mark_source_chains(), again from
   Florian.

7) Fix inconsistent behaviour in nfnetlink_queue when
   NFQA_CFG_F_FAIL_OPEN is set and netlink_unicast() fails due to buffer
   overrun, we have to reinject the packet as the user expects.

8) Enforce nul-terminated table names from getsockopt GET_ENTRIES
   requests.

9) Don't assume skb->sk is set from nft_bridge_reject and synproxy,
   this fixes a recent update of the code to namespaceify
   ip_default_ttl, patch from Liping Zhang.

This batch comes with four patches to validate x_tables blobs coming
from userspace. CONFIG_USERNS exposes the x_tables interface to
unpriviledged users and to be honest this interface never received the
attention for this move away from the CAP_NET_ADMIN domain. Florian is
working on another round with more patches with more sanity checks, so
expect a bit more Netfilter fixes in this development cycle than usual.
====================

Signed-off-by: David S. Miller <[email protected]>

netfilter: ipv4: fix NULL dereference

Commit fa50d974d104 ("ipv4: Namespaceify ip_default_ttl sysctl knob")
use sock_net(skb->sk) to get the net namespace, but we can't assume
that sk_buff->sk is always exist, so when it is NULL, oops will happen.

Signed-off-by: Liping Zhang <[email protected]>
Reviewed-by: Nikolay Borisov <[email protected]>
Signed-off-by: Pablo Neira Ayuso <[email protected]>

netfilter: x_tables: enforce nul-terminated table name from getsockopt GET_ENTRIES

Make sure the table names via getsockopt GET_ENTRIES is nul-terminated
in ebtables and all the x_tables variants and their respective compat
code. Uncovered by KASAN.

Reported-by: Baozeng Ding <[email protected]>
Signed-off-by: Pablo Neira Ayuso <[email protected]>

netfilter: nfnetlink_queue: honor NFQA_CFG_F_FAIL_OPEN when netlink unicast fails

When netlink unicast fails to deliver the message to userspace, we
should also check if the NFQA_CFG_F_FAIL_OPEN flag is set so we reinject
the packet back to the stack.

I think the user expects no packet drops when this flag is set due to
queueing to userspace errors, no matter if related to the internal queue
or when sending the netlink message to userspace.

The userspace application will still get the ENOBUFS error via recvmsg()
so the user still knows that, with the current configuration that is in
place, the userspace application is not consuming the messages at the
pace that the kernel needs.

Reported-by: "Yigal Reiss (yreiss)" <[email protected]>
Signed-off-by: Pablo Neira Ayuso <[email protected]>
Tested-by: "Yigal Reiss (yreiss)" <[email protected]>

netfilter: x_tables: fix unconditional helper

Ben Hawkes says:

In the mark_source_chains function (net/ipv4/netfilter/ip_tables.c) it
is possible for a user-supplied ipt_entry structure to have a large
next_offset field. This field is not bounds checked prior to writing a
counter value at the supplied offset.

Problem is that mark_source_chains should not have been called --
the rule doesn't have a next entry, so its supposed to return
an absolute verdict of either ACCEPT or DROP.

However, the function conditional() doesn't work as the name implies.
It only checks that the rule is using wildcard address matching.

However, an unconditional rule must also not be using any matches
(no -m args).

The underflow validator only checked the addresses, therefore
passing the 'unconditional absolute verdict' test, while
mark_source_chains also tested for presence of matches, and thus
proceeeded to the next (not-existent) rule.

Unify this so that all the callers have same idea of 'unconditional rule'.

Reported-by: Ben Hawkes <[email protected]>
Signed-off-by: Florian Westphal <[email protected]>
Signed-off-by: Pablo Neira Ayuso <[email protected]>

netfilter: x_tables: make sure e->next_offset covers remaining blob size

Otherwise this function may read data beyond the ruleset blob.

Signed-off-by: Florian Westphal <[email protected]>
Signed-off-by: Pablo Neira Ayuso <[email protected]>

netfilter: x_tables: validate e->target_offset early

We should check that e->target_offset is sane before
mark_source_chains gets called since it will fetch the target entry
for loop detection.

Signed-off-by: Florian Westphal <[email protected]>
Signed-off-by: Pablo Neira Ayuso <[email protected]>

openvswitch: call only into reachable nf-nat code

The openvswitch code has gained support for calling into the
nf-nat-ipv4/ipv6 modules, however those can be loadable modules
in a configuration in which openvswitch is built-in, leading
to link errors:

net/built-in.o: In function `__ovs_ct_lookup':
:(.text+0x2cc2c8): undefined reference to `nf_nat_icmp_reply_translation'
:(.text+0x2cc66c): undefined reference to `nf_nat_icmpv6_reply_translation'

The dependency on (!NF_NAT || NF_NAT) prevents similar issues,
but NF_NAT is set to 'y' if any of the symbols selecting
it are built-in, but the link error happens when any of them
are modular.

A second issue is that even if CONFIG_NF_NAT_IPV6 is built-in,
CONFIG_NF_NAT_IPV4 might be completely disabled. This is unlikely
to be useful in practice, but the driver currently only handles
IPv6 being optional.

This patch improves the Kconfig dependency so that openvswitch
cannot be built-in if either of the two other symbols are set
to 'm', and it replaces the incorrect #ifdef in ovs_ct_nat_execute()
with two "if (IS_ENABLED())" checks that should catch all corner
cases also make the code more readable.

The same #ifdef exists ovs_ct_nat_to_attr(), where it does not
cause a link error, but for consistency I'm changing it the same
way.

Signed-off-by: Arnd Bergmann <[email protected]>
Fixes: 05752523e565 ("openvswitch: Interface with NAT.")
Acked-by: Joe Stringer <[email protected]>
Signed-off-by: Pablo Neira Ayuso <[email protected]>

openvswitch: Fix checking for new expected connections.

OVS should call into CT NAT for packets of new expected connections only
when the conntrack state is persisted with the 'commit' option to the
OVS CT action. The test for this condition is doubly wrong, as the CT
status field is ANDed with the bit number (IPS_EXPECTED_BIT) rather
than the mask (IPS_EXPECTED), and due to the wrong assumption that the
expected bit would apply only for the first (i.e., 'new') packet of a
connection, while in fact the expected bit remains on for the lifetime of
an expected connection. The 'ctinfo' value IP_CT_RELATED derived from
the ct status can be used instead, as it is only ever applicable to
the 'new' packets of the expected connection.

Fixes: 05752523e565 ('openvswitch: Interface with NAT.')
Reported-by: Dan Carpenter <[email protected]>
Signed-off-by: Jarno Rajahalme <[email protected]>
Signed-off-by: Pablo Neira Ayuso <[email protected]>

netfilter: ipset: fix race condition in ipset save, swap and delete

This fix adds a new reference counter (ref_netlink) for the struct ip_set.
The other reference counter (ref) can be swapped out by ip_set_swap and we
need a separate counter to keep track of references for netlink events
like dump. Using the same ref counter for dump causes a race condition
which can be demonstrated by the following script:

ipset create hash_ip1 hash:ip family inet hashsize 1024 maxelem 500000 \
counters
ipset create hash_ip2 hash:ip family inet hashsize 300000 maxelem 500000 \
counters
ipset create hash_ip3 hash:ip family inet hashsize 1024 maxelem 500000 \
counters

ipset save &

ipset swap hash_ip3 hash_ip2
ipset destroy hash_ip3 /* will crash the machine */

Swap will exchange the values of ref so destroy will see ref = 0 instead of
ref = 1. With this fix in place swap will not succeed because ipset save
still has ref_netlink on the set (ip_set_swap doesn't swap ref_netlink).

Both delete and swap will error out if ref_netlink != 0 on the set.

Note: The changes to *_head functions is because previously we would
increment ref whenever we called these functions, we don't do that
anymore.

Reviewed-by: Joshua Hunt <[email protected]>
Signed-off-by: Vishwanath Pai <[email protected]>
Signed-off-by: Jozsef Kadlecsik <[email protected]>
Signed-off-by: Pablo Neira Ayuso <[email protected]>

drm/amdgpu: Don't move pinned BOs

The purpose of pinning is to prevent a buffer from moving.

Reviewed-by: Christian König <[email protected]>
Tested-by: Rex Zhu <[email protected]>
Signed-off-by: Michel Dänzer <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

drm/radeon: Don't move pinned BOs

The purpose of pinning is to prevent a buffer from moving.

Reviewed-by: Christian König <[email protected]>
Signed-off-by: Michel Dänzer <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

net: macb: Only call GPIO functions if there is a valid GPIO

GPIOlib will print warning messages if we call GPIO functions without a
valid GPIO. Change the code to avoid doing so.

Signed-off-by: Charles Keepax <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: hns: set-coalesce-usecs returns errno by dsaf.ko

It may fail to set coalesce usecs to HW, and Ethtool needs to know if it
is successful to cfg the parameter or not. So it needs return the errno by
dsaf.ko.

Signed-off-by: Lisheng <[email protected]>
Signed-off-by: Yisen Zhuang <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: hns: fixed the setting and getting overtime bug

The overtime setting and getting REGs in HNS V2 is defferent from HNS V1.
It needs to be distinguished between them if getting or setting the REGs.

Signed-off-by: Lisheng <[email protected]>
Signed-off-by: Yisen Zhuang <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

openvswitch: Use proper buffer size in nla_memcpy

For the input parameter count, it's better to use the size
of destination buffer size, as nla_memcpy would take into
account the length of the source netlink attribute when
a data is copied from an attribute.

Signed-off-by: Haishuang Yan <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

drm/radeon: add a dpm quirk for all R7 370 parts

Higher mclk values are not stable due to a bug somewhere.
Limit them for now.

Signed-off-by: Alex Deucher <[email protected]>
Cc: [email protected]

drm/radeon: add another R7 370 quirk

bug:
https://bugzilla.kernel.org/show_bug.cgi?id=115291

Signed-off-by: Alex Deucher <[email protected]>
Cc: [email protected]

ALSA: dice: fix memory leak when unplugging

When sound card is going to be released, dice private data is
also released. Then all of data should be released. However,
stream data is not released. This causes memory leak when
unplugging dice unit.

This commit fixes the bug.

Fixes: 4bdc495c87b3('ALSA: dice: handle several PCM substreams when any isochronous streams are available')
Signed-off-by: Takashi Sakamoto <[email protected]>
Signed-off-by: Takashi Iwai <[email protected]>

drm/rockchip: dw_hdmi: Don't call platform_set_drvdata()

The Rockchip dw_hdmi driver just called platform_set_drvdata() to get
your hopes up that maybe, somehow, you'd be able to retrieve the 'struct
rockchip_hdmi' from a pointer to the 'struct device'. You can't. When
we call dw_hdmi_bind() the main driver calls dev_set_drvdata(), which
clobbers our setting.

Let's just remove the platform_set_drvdata() to avoid dashing people's
hopes.

Signed-off-by: Douglas Anderson <[email protected]>

drm/rockchip: vop: Fix vop crtc cleanup

This fixes a few problems in the vop crtc cleanup (handling error
paths and cleanup upon exit):

* The vop_create_crtc() error path had an unsafe version of the
  iterator used for iterating over all planes (though it was
  destroying planes in the iterator so should have used the safe
  version)

* vop_destroy_crtc() - wasn't calling vop_plane_destroy(), which made
  slub_debug unhappy, at least if we ended up running this due to a
  deferred probe.

* In vop_create_crtc() if we were missing the "port" device tree node
  we would fail but not return an error (found by code inspection).

Fix these problems.

Signed-off-by: Douglas Anderson <[email protected]>

drm/rockchip: dw_hdmi: Call drm_encoder_cleanup() in error path

The drm_encoder_cleanup() was missing both from the error path of
dw_hdmi_rockchip_bind(). This caused a crash when slub_debug was
enabled and we ended up deferring probe of HDMI at boot.

This call isn't needed from unbind() because if dw_hdmi_bind() returns
no error then it takes over the job of freeing the encoder (in
dw_hdmi_unbind).

Signed-off-by: Douglas Anderson <[email protected]>

drm/rockchip: vop: Disable planes when disabling CRTC

When a VOP is re-enabled, it will start scanning right away the
framebuffers that were configured from the last time, even if those have
been destroyed already.

To prevent the VOP from trying to access freed memory, disable all its
windows when the CRTC is being disabled, then each window will get a
valid framebuffer address before it's enabled again.

Signed-off-by: Tomeu Vizoso <[email protected]>
Link: http://lkml.kernel.org/g/CAAObsKAv+05ih5U+=4kic_NsjGMhfxYheHR8xXXmacZs+p5SHw@mail.gmail.com

drm/rockchip: vop: Don't reject empty modesets

So that when DRM_IOCTL_MODE_SETCRTC is called without a FB nor mode, the
CRTC gets disabled.

Signed-off-by: Tomeu Vizoso <[email protected]>
Link: http://lkml.kernel.org/g/CAAObsKAv+05ih5U+=4kic_NsjGMhfxYheHR8xXXmacZs+p5SHw@mail.gmail.com

drm/rockchip: cancel pending vblanks on close

When closing the DRM device while a vblank is pending, we access
file_priv after it has been free'd, which gives:

  Unable to handle kernel NULL pointer dereference at virtual address 00000000
  ...
  PC is at __list_add+0x5c/0xe8
  LR is at send_vblank_event+0x54/0x1f0
  ...
  [<c02952e8>] (__list_add) from [<c031a7b4>] (send_vblank_event+0x54/0x1f0)
  [<c031a760>] (send_vblank_event) from [<c031a9c0>] (drm_send_vblank_event+0x70/0x78)
  [<c031a950>] (drm_send_vblank_event) from [<c031a9f8>] (drm_crtc_send_vblank_event+0x30/0x34)
  [<c031a9c8>] (drm_crtc_send_vblank_event) from [<c0339ad8>] (vop_isr+0x224/0x28c)
  [<c03398b4>] (vop_isr) from [<c0081780>] (handle_irq_event_percpu+0x12c/0x3e4)

This can be triggered somewhat reliably with:

modetest -M rockchip -v -s ...

Add a preclose hook to the driver so that we can discard any pending
vblank events when the device is closed.

Signed-off-by: John Keeping <[email protected]>

drm/rockchip: vop: fix crtc size in plane check

If the geometry of a crtc is changing in an atomic update then we must
validate the plane size against the new state of the crtc and not the
current size, otherwise if the crtc size is increasing the plane will be
cropped at the previous size and will not fill the screen.

Signed-off-by: John Keeping <[email protected]>

ravb: fix software timestamping

In ravb_start_xmit dont call skb_tx_timestamp only when hardware
timestamping is requested: in the latter case software timestamps are
suppressed and thus the call of skb_tx_timestamp does not have any effect.

Instead call skb_tx_timestamp unconditionally in ravb_start_xmit, since
the function checks itself if software timestamping is required or should
be skipped due to hardware timestamping.

Signed-off-by: Lino Sanfilippo <[email protected]>
Acked-by: Sergei Shtylyov <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: sxgbe: fix error paths in sxgbe_platform_probe()

We need to use post-decrement to ensure that irq_dispose_mapping is
also called on priv->rxq[0]->irq_no; moreover, if one of the above for
loops failed already at i==0 (so we reach one of these labels with
that value of i), we'll enter an essentially infinite loop of
out-of-bounds accesses.

Signed-off-by: Rasmus Villemoes <[email protected]>
Reviewed-by: Francois Romieu <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Drivers: isdn: hisax: isac.c: Fix assignment and check into one expression.

Fix variable assignment inside if statement. It is error-prone and hard to read.

Signed-off-by: Cosmin-Gabriel Samoila <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Fix returned tc and hoplimit values for route with IPv6 encapsulation

For a route with IPv6 encapsulation, the traffic class and hop limit
values are interchanged when returned to userspace by the kernel.
For example, see below.

># ip route add 192.168.0.1 dev eth0.2 encap ip6 dst 0x50 tc 0x50 hoplimit 100 table 1000
># ip route show table 1000
192.168.0.1 encap ip6 id 0 src :: dst fe83::1 hoplimit 80 tc 100 dev eth0.2 scope link

Signed-off-by: Quentin Armitage <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

drivers/net/usb/plusb.c: Fix typo

Signed-off-by: Diego Viola <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

hwmon: (max1111) Return -ENODEV from max1111_read_channel if not instantiated

arm:pxa_defconfig can result in the following crash if the max1111 driver
is not instantiated.

Unhandled fault: page domain fault (0x01b) at 0x00000000
pgd = c0004000
[00000000] *pgd=00000000
Internal error: : 1b [#1] PREEMPT ARM
Modules linked in:
CPU: 0 PID: 300 Comm: kworker/0:1 Not tainted 4.5.0-01301-g1701f680407c #10
Hardware name: SHARP Akita
Workqueue: events sharpsl_charge_toggle
task: c390a000 ti: c391e000 task.ti: c391e000
PC is at max1111_read_channel+0x20/0x30
LR is at sharpsl_pm_pxa_read_max1111+0x2c/0x3c
pc : [<c03aaab0>] lr : [<c0024b50>] psr: 20000013
...
[<c03aaab0>] (max1111_read_channel) from [<c0024b50>]
(sharpsl_pm_pxa_read_max1111+0x2c/0x3c)
[<c0024b50>] (sharpsl_pm_pxa_read_max1111) from [<c00262e0>]
(spitzpm_read_devdata+0x5c/0xc4)
[<c00262e0>] (spitzpm_read_devdata) from [<c0024094>]
(sharpsl_check_battery_temp+0x78/0x110)
[<c0024094>] (sharpsl_check_battery_temp) from [<c0024f9c>]
(sharpsl_charge_toggle+0x48/0x110)
[<c0024f9c>] (sharpsl_charge_toggle) from [<c004429c>]
(process_one_work+0x14c/0x48c)
[<c004429c>] (process_one_work) from [<c0044618>] (worker_thread+0x3c/0x5d4)
[<c0044618>] (worker_thread) from [<c004a238>] (kthread+0xd0/0xec)
[<c004a238>] (kthread) from [<c000a670>] (ret_from_fork+0x14/0x24)

This can occur because the SPI controller driver (SPI_PXA2XX) is built as
module and thus not necessarily loaded. While building SPI_PXA2XX into the
kernel would make the problem disappear, it appears prudent to ensure that
the driver is instantiated before accessing its data structures.

Cc: Arnd Bergmann <[email protected]>
Cc: [email protected]
Signed-off-by: Guenter Roeck <[email protected]>

Linux 4.6-rc1

Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client

Pull Ceph updates from Sage Weil:
"There is quite a bit here, including some overdue refactoring and
  cleanup on the mon_client and osd_client code from Ilya, scattered
  writeback support for CephFS and a pile of bug fixes from Zheng, and a
  few random cleanups and fixes from others"

[ I already decided not to pull this because of it having been rebased
  recently, but ended up changing my mind after all.  Next time I'll
  really hold people to it.  Oh well.   - Linus ]

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (34 commits)
  libceph: use KMEM_CACHE macro
  ceph: use kmem_cache_zalloc
  rbd: use KMEM_CACHE macro
  ceph: use lookup request to revalidate dentry
  ceph: kill ceph_get_dentry_parent_inode()
  ceph: fix security xattr deadlock
  ceph: don't request vxattrs from MDS
  ceph: fix mounting same fs multiple times
  ceph: remove unnecessary NULL check
  ceph: avoid updating directory inode's i_size accidentally
  ceph: fix race during filling readdir cache
  libceph: use sizeof_footer() more
  ceph: kill ceph_empty_snapc
  ceph: fix a wrong comparison
  ceph: replace CURRENT_TIME by current_fs_time()
  ceph: scattered page writeback
  libceph: add helper that duplicates last extent operation
  libceph: enable large, variable-sized OSD requests
  libceph: osdc->req_mempool should be backed by a slab pool
  libceph: make r_request msg_size calculation clearer
  ...

Merge tag 'ofs-pull-tag-1' of git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux

Pull orangefs filesystem from Mike Marshall.

This finally merges the long-pending orangefs filesystem, which has been
much cleaned up with input from Al Viro over the last six months.  From
the documentation file:

"OrangeFS is an LGPL userspace scale-out parallel storage system.  It
  is ideal for large storage problems faced by HPC, BigData, Streaming
  Video, Genomics, Bioinformatics.

  Orangefs, originally called PVFS, was first developed in 1993 by Walt
  Ligon and Eric Blumer as a parallel file system for Parallel Virtual
  Machine (PVM) as part of a NASA grant to study the I/O patterns of
  parallel programs.

  Orangefs features include:

    - Distributes file data among multiple file servers
    - Supports simultaneous access by multiple clients
    - Stores file data and metadata on servers using local file system
      and access methods
    - Userspace implementation is easy to install and maintain
    - Direct MPI support
    - Stateless"

see Documentation/filesystems/orangefs.txt for more in-depth details.

* tag 'ofs-pull-tag-1' of git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux: (174 commits)
  orangefs: fix orangefs_superblock locking
  orangefs: fix do_readv_writev() handling of error halfway through
  orangefs: have ->kill_sb() evict the VFS side of things first
  orangefs: sanitize ->llseek()
  orangefs-bufmap.h: trim unused junk
  orangefs: saner calling conventions for getting a slot
  orangefs_copy_{to,from}_bufmap(): don't pass bufmap pointer
  orangefs: get rid of readdir_handle_s
  ornagefs: ensure that truncate has an up to date inode size
  orangefs: move code which sets i_link to orangefs_inode_getattr
  orangefs: remove needless wrapper around GFP_KERNEL
  orangefs: remove wrapper around mutex_lock(&inode->i_mutex)
  orangefs: refactor inode type or link_target change detection
  orangefs: use new getattr for revalidate and remove old getattr
  orangefs: use new getattr in inode getattr and permission
  orangefs: use new orangefs_inode_getattr to get size in write and llseek
  orangefs: use new orangefs_inode_getattr to create new inodes
  orangefs: rename orangefs_inode_getattr to orangefs_inode_old_getattr
  orangefs: remove inode->i_lock wrapper
  orangefs: put register_chrdev immediately before register_filesystem
  ...

Merge tag 'ntb-4.6' of git://github.com/jonmason/ntb

Pull NTB bug fixes from Jon Mason:
"NTB bug fixes for tasklet from spinning forever, link errors,
  translation window setup, NULL ptr dereference, and ntb-perf errors.

  Also, a modification to the driver API that makes _addr functions
  optional"

* tag 'ntb-4.6' of git://github.com/jonmason/ntb:
  NTB: Remove _addr functions from ntb_hw_amd
  NTB: Make _addr functions optional in the API
  NTB: Fix incorrect clean up routine in ntb_perf
  NTB: Fix incorrect return check in ntb_perf
  ntb: fix possible NULL dereference
  ntb: add missing setup of translation window
  ntb: stop link work when we do not have memory
  ntb: stop tasklet from spinning forever during shutdown.
  ntb: perf test: fix address space confusion

Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi

Pull more SCSI updates from James Bottomley:
"The only new stuff which missed the first pull request is an update to
  the UFS driver.

  The rest is an assortment of bug fixes and minor tweaks which appeared
  recently (some are fixes for recent code and some are stuff spotted
  recently by the checkers or the new gcc-6 compiler [most of Arnd's
  stuff])"

* tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (32 commits)
  scsi_common: do not clobber fixed sense information
  scsi: ufs: select CONFIG_NLS
  scsi: fc: use get/put_unaligned64 for wwn access
  fnic: move printk()s outside of the critical code section.
  qla2xxx: avoid maybe_uninitialized warning
  megaraid_sas: add missing curly braces in ioctl handler
  lpfc: fix misleading indentation
  scsi_transport_sas: add 'scsi_target_id' sysfs attribute
  scsi_dh_alua: uninitialized variable in alua_check_vpd()
  scsi: ufs-qcom: add printouts of testbus debug registers
  scsi: ufs-qcom: enable/disable the device ref clock
  scsi: ufs-qcom: set PA_Local_TX_LCC_Enable before link startup
  scsi: ufs: add device quirk delay before putting UFS rails in LPM
  scsi: ufs: fix leakage during link off state
  scsi: ufs: tune UniPro parameters to optimize hibern8 exit time
  scsi: ufs: handle non spec compliant bkops behaviour by device
  scsi: ufs: add retry for query descriptors
  scsi: ufs: add error recovery after DL NAC error
  scsi: ufs: make error handling bit faster
  scsi: ufs: disable vccq if it's not needed by UFS device
  ...

f2fs/crypto: fix xts_tweak initialization

Commit 0b81d07790726 ("fs crypto: move per-file encryption from f2fs
tree to fs/crypto") moved the f2fs crypto files to fs/crypto/ and
renamed the symbol prefixes from "f2fs_" to "fscrypt_" (and from "F2FS_"
to just "FS" for preprocessor symbols).

Because of the symbol renaming, it's a bit hard to see it as a file
move: use

    git show -M30 0b81d07790726

to lower the rename detection to just 30% similarity and make git show
the files as renamed (the header file won't be shown as a rename even
then - since all it contains is symbol definitions, it looks almost
completely different).

Even with the renames showing as renames, the diffs are not all that
easy to read, since so much is just the renames.  But Eric Biggers
noticed that it's not just all renames: the initialization of the
xts_tweak had been broken too, using the inode number rather than the
page offset.

That's not right - it makes the xfs_tweak the same for all pages of each
inode.  It _might_ make sense to make the xfs_tweak contain both the
offset _and_ the inode number, but not just the inode number.

Reported-by: Eric Biggers <[email protected]>
Cc: Jaegeuk Kim <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

NTB: Remove _addr functions from ntb_hw_amd

Kernel zero day testing warned about address space confusion.  A virtual
iomem address was used where a physical address is expected.  The
offending functions implement an optional part of the api, so they are
removed.  They can be added later, after testing.

Fixes: a1b3695820aa490e58915d720a1438069813008b
Signed-off-by: Allen Hubbe <[email protected]>
Acked-by: Xiangliang Yu <[email protected]>
Signed-off-by: Jon Mason <[email protected]>

orangefs: fix orangefs_superblock locking

* switch orangefs_remount() to taking ORANGEFS_SB(sb) instead of sb
* remove from the list _before_ orangefs_unmount() - request_mutex
in the latter will make sure that nothing observed in the loop in
ORANGEFS_DEV_REMOUNT_ALL handling will get freed until the end
of loop
* on removal, keep the forward pointer and zero the back one. That
way we can drop and regain the spinlock in the loop body (again,
ORANGEFS_DEV_REMOUNT_ALL one) and still be able to get to the
rest of the list.

Signed-off-by: Al Viro <[email protected]>
Signed-off-by: Mike Marshall <[email protected]>

orangefs: fix do_readv_writev() handling of error halfway through

Error should only be returned if nothing had been read/written.
Otherwise we need to report a short read/write instead.

Signed-off-by: Al Viro <[email protected]>
Signed-off-by: Mike Marshall <[email protected]>

orangefs: have ->kill_sb() evict the VFS side of things first

Signed-off-by: Al Viro <[email protected]>
Signed-off-by: Mike Marshall <[email protected]>

orangefs: sanitize ->llseek()

a) open files can't have NULL inodes
b) it's SEEK_END, not ORANGEFS_SEEK_END; no need to get cute.
c) make_bad_inode() on lseek()?

Signed-off-by: Al Viro <[email protected]>
Signed-off-by: Mike Marshall <[email protected]>

orangefs-bufmap.h: trim unused junk

Signed-off-by: Al Viro <[email protected]>
Signed-off-by: Mike Marshall <[email protected]>

orangefs: saner calling conventions for getting a slot

just have it return the slot number or -E... - the caller checks
the sign anyway

Signed-off-by: Al Viro <[email protected]>
Signed-off-by: Mike Marshall <[email protected]>

orangefs_copy_{to,from}_bufmap(): don't pass bufmap pointer

it's always __orangefs_bufmap

Signed-off-by: Al Viro <[email protected]>
Signed-off-by: Mike Marshall <[email protected]>

orangefs: get rid of readdir_handle_s

no point, really - we couldn't keep those across the calls of
getdents(); it would be too easy to DoS, having all slots exhausted.

Signed-off-by: Al Viro <[email protected]>
Signed-off-by: Mike Marshall <[email protected]>

ACPI / processor: Request native thermal interrupt handling via _OSC

There are several reports of freeze on enabling HWP (Hardware PStates)
feature on Skylake-based systems by the Intel P-states driver. The root
cause is identified as the HWP interrupts causing BIOS code to freeze.

HWP interrupts use the thermal LVT which can be handled by Linux
natively, but on the affected Skylake-based systems SMM will respond
to it by default.  This is a problem for several reasons:
- On the affected systems the SMM thermal LVT handler is broken (it
   will crash when invoked) and a BIOS update is necessary to fix it.
- With thermal interrupt handled in SMM we lose all of the reporting
   features of the arch/x86/kernel/cpu/mcheck/therm_throt driver.
- Some thermal drivers like x86-package-temp depend on the thermal
   threshold interrupts signaled via the thermal LVT.
- The HWP interrupts are useful for debugging and tuning
   performance (if the kernel can handle them).
The native handling of thermal interrupts needs to be enabled
because of that.

This requires some way to tell SMM that the OS can handle thermal
interrupts.  That can be done by using _OSC/_PDC in processor
scope very early during ACPI initialization.

The meaning of _OSC/_PDC bit 12 in processor scope is whether or
not the OS supports native handling of interrupts for Collaborative
Processor Performance Control (CPPC) notifications.  Since on
HWP-capable systems CPPC is a firmware interface to HWP, setting
this bit effectively tells the firmware that the OS will handle
thermal interrupts natively going forward.

For details on _OSC/_PDC refer to:
http://www.intel.com/content/www/us/en/standards/processor-vendor-specific-acpi-specification.html

To implement the _OSC/_PDC handshake as described, introduce a new
function, acpi_early_processor_osc(), that walks the ACPI
namespace looking for ACPI processor objects and invokes _OSC for
them with bit 12 in the capabilities buffer set and terminates the
namespace walk on the first success.

Also modify intel_thermal_interrupt() to clear HWP status bits in
the HWP_STATUS MSR to acknowledge HWP interrupts (which prevents
them from firing continuously).

Signed-off-by: Srinivas Pandruvada <[email protected]>
[ rjw: Subject & changelog, function rename ]
Signed-off-by: Rafael J. Wysocki <[email protected]>

Merge branch 'akpm' (patches from Andrew)

Merge fourth patch-bomb from Andrew Morton:
"A lot more stuff than expected, sorry.  A bunch of ocfs2 reviewing was
  finished off.

   - mhocko's oom-reaper out-of-memory-handler changes

   - ocfs2 fixes and features

   - KASAN feature work

   - various fixes"

* emailed patches from Andrew Morton <[email protected]>: (42 commits)
  thp: fix typo in khugepaged_scan_pmd()
  MAINTAINERS: fill entries for KASAN
  mm/filemap: generic_file_read_iter(): check for zero reads unconditionally
  kasan: test fix: warn if the UAF could not be detected in kmalloc_uaf2
  mm, kasan: stackdepot implementation. Enable stackdepot for SLAB
  arch, ftrace: for KASAN put hard/soft IRQ entries into separate sections
  mm, kasan: add GFP flags to KASAN API
  mm, kasan: SLAB support
  kasan: modify kmalloc_large_oob_right(), add kmalloc_pagealloc_oob_right()
  include/linux/oom.h: remove undefined oom_kills_count()/note_oom_kill()
  mm/page_alloc: prevent merging between isolated and other pageblocks
  drivers/memstick/host/r592.c: avoid gcc-6 warning
  ocfs2: extend enough credits for freeing one truncate record while replaying truncate records
  ocfs2: extend transaction for ocfs2_remove_rightmost_path() and ocfs2_update_edge_lengths() before to avoid inconsistency between inode and et
  ocfs2/dlm: move lock to the tail of grant queue while doing in-place convert
  ocfs2: solve a problem of crossing the boundary in updating backups
  ocfs2: fix occurring deadlock by changing ocfs2_wq from global to local
  ocfs2/dlm: fix BUG in dlm_move_lockres_to_recovery_list
  ocfs2/dlm: fix race between convert and recovery
  ocfs2: fix a deadlock issue in ocfs2_dio_end_io_write()
  ...

Merge tag 'pm+acpi-4.6-rc1-3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull power management fixlet from Rafael Wysocki:
"One of commits in my previous pull request changed the permissions of
drivers/power/avs/rockchip-io-domain.c to executable by mistake"

* tag 'pm+acpi-4.6-rc1-3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
Fix permissions of drivers/power/avs/rockchip-io-domain.c

Merge tag 'please-pull-preadv2' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux

Pull ia64 update from Tony Luck:
"Wire up new system calls p{read,write}v2 for ia64"

* tag 'please-pull-preadv2' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux:
[IA64] Enable preadv2 and pwritev2 syscalls for ia64

Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input

Pull more input updates from Dmitry Torokhov:
"Second round of updates for the input subsystem.

  The BYD PS/2 protocol driver now uses absolute reporting mode and
  should behave more like other touchpads; Synaptics driver needed to
  extend one of its quirks to a newer firmware version, and a few USB
  drivers got tightened up checks for the contents of their descriptors"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input:
  Input: sur40 - fix DMA on stack
  Input: ati_remote2 - fix crashes on detecting device with invalid descriptor
  Input: synaptics - handle spurious release of trackstick buttons, again
  Input: synaptics-rmi4 - remove check of Non-NULL array
  Input: byd - enable absolute mode
  Input: ims-pcu - sanity check against missing interfaces
  Input: melfas_mip4 - add hw_version sysfs attribute

thp: fix typo in khugepaged_scan_pmd()

!PageLRU should lead to SCAN_PAGE_LRU, not SCAN_SCAN_ABORT result.

Signed-off-by: Kirill A. Shutemov <[email protected]>
Cc: Ebru Akagunduz <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

MAINTAINERS: fill entries for KASAN

Signed-off-by: Andrey Ryabinin <[email protected]>
Cc: Alexander Potapenko <[email protected]>
Acked-by: Dmitry Vyukov <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/filemap: generic_file_read_iter(): check for zero reads unconditionally

If
- generic_file_read_iter() gets called with a zero read length,
- the read offset is at a page boundary,
- IOCB_DIRECT is not set
-  and the page in question hasn't made it into the page cache yet,
then do_generic_file_read() will trigger a readahead with a req_size hint
of zero.

Since roundup_pow_of_two(0) is undefined, UBSAN reports

  UBSAN: Undefined behaviour in include/linux/log2.h:63:13
  shift exponent 64 is too large for 64-bit type 'long unsigned int'
  CPU: 3 PID: 1017 Comm: sa1 Tainted: G L 4.5.0-next-20160318+ #14
  [...]
  Call Trace:
   [...]
   [<ffffffff813ef61a>] ondemand_readahead+0x3aa/0x3d0
   [<ffffffff813ef61a>] ? ondemand_readahead+0x3aa/0x3d0
   [<ffffffff813c73bd>] ? find_get_entry+0x2d/0x210
   [<ffffffff813ef9c3>] page_cache_sync_readahead+0x63/0xa0
   [<ffffffff813cc04d>] do_generic_file_read+0x80d/0xf90
   [<ffffffff813cc955>] generic_file_read_iter+0x185/0x420
   [...]
   [<ffffffff81510b06>] __vfs_read+0x256/0x3d0
   [...]

when get_init_ra_size() gets called from ondemand_readahead().

The net effect is that the initial readahead size is arch dependent for
requested read lengths of zero: for example, since

  1UL << (sizeof(unsigned long) * 8)

evaluates to 1 on x86 while its result is 0 on ARMv7, the initial readahead
size becomes 4 on the former and 0 on the latter.

What's more, whether or not the file access timestamp is updated for zero
length reads is decided differently for the two cases of IOCB_DIRECT
being set or cleared: in the first case, generic_file_read_iter()
explicitly skips updating that timestamp while in the latter case, it is
always updated through the call to do_generic_file_read().

According to POSIX, zero length reads "do not modify the last data access
timestamp" and thus, the IOCB_DIRECT behaviour is POSIXly correct.

Let generic_file_read_iter() unconditionally check the requested read
length at its entry and return immediately with success if it is zero.

Signed-off-by: Nicolai Stange <[email protected]>
Cc: Al Viro <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

kasan: test fix: warn if the UAF could not be detected in kmalloc_uaf2

Signed-off-by: Alexander Potapenko <[email protected]>
Acked-by: Andrey Ryabinin <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Andrey Konovalov <[email protected]>
Cc: Dmitry Vyukov <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Konstantin Serebryany <[email protected]>
Cc: Dmitry Chernenkov <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm, kasan: stackdepot implementation. Enable stackdepot for SLAB

Implement the stack depot and provide CONFIG_STACKDEPOT.  Stack depot
will allow KASAN store allocation/deallocation stack traces for memory
chunks.  The stack traces are stored in a hash table and referenced by
handles which reside in the kasan_alloc_meta and kasan_free_meta
structures in the allocated memory chunks.

IRQ stack traces are cut below the IRQ entry point to avoid unnecessary
duplication.

Right now stackdepot support is only enabled in SLAB allocator.  Once
KASAN features in SLAB are on par with those in SLUB we can switch SLUB
to stackdepot as well, thus removing the dependency on SLUB stack
bookkeeping, which wastes a lot of memory.

This patch is based on the "mm: kasan: stack depots" patch originally
prepared by Dmitry Chernenkov.

Joonsoo has said that he plans to reuse the stackdepot code for the
mm/page_owner.c debugging facility.

[[email protected]: s/depot_stack_handle/depot_stack_handle_t]
[[email protected]: comment style fixes]
Signed-off-by: Alexander Potapenko <[email protected]>
Signed-off-by: Andrey Ryabinin <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Andrey Konovalov <[email protected]>
Cc: Dmitry Vyukov <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Konstantin Serebryany <[email protected]>
Cc: Dmitry Chernenkov <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

arch, ftrace: for KASAN put hard/soft IRQ entries into separate sections

KASAN needs to know whether the allocation happens in an IRQ handler.
This lets us strip everything below the IRQ entry point to reduce the
number of unique stack traces needed to be stored.

Move the definition of __irq_entry to <linux/interrupt.h> so that the
users don't need to pull in <linux/ftrace.h>. Also introduce the
__softirq_entry macro which is similar to __irq_entry, but puts the
corresponding functions to the .softirqentry.text section.

Signed-off-by: Alexander Potapenko <[email protected]>
Acked-by: Steven Rostedt <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Andrey Konovalov <[email protected]>
Cc: Dmitry Vyukov <[email protected]>
Cc: Andrey Ryabinin <[email protected]>
Cc: Konstantin Serebryany <[email protected]>
Cc: Dmitry Chernenkov <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm, kasan: add GFP flags to KASAN API

Add GFP flags to KASAN hooks for future patches to use.

This patch is based on the "mm: kasan: unified support for SLUB and SLAB
allocators" patch originally prepared by Dmitry Chernenkov.

Signed-off-by: Alexander Potapenko <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Andrey Konovalov <[email protected]>
Cc: Dmitry Vyukov <[email protected]>
Cc: Andrey Ryabinin <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Konstantin Serebryany <[email protected]>
Cc: Dmitry Chernenkov <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm, kasan: SLAB support

Add KASAN hooks to SLAB allocator.

This patch is based on the "mm: kasan: unified support for SLUB and SLAB
allocators" patch originally prepared by Dmitry Chernenkov.

Signed-off-by: Alexander Potapenko <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Andrey Konovalov <[email protected]>
Cc: Dmitry Vyukov <[email protected]>
Cc: Andrey Ryabinin <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Konstantin Serebryany <[email protected]>
Cc: Dmitry Chernenkov <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

kasan: modify kmalloc_large_oob_right(), add kmalloc_pagealloc_oob_right()

This patchset implements SLAB support for KASAN

Unlike SLUB, SLAB doesn't store allocation/deallocation stacks for heap
objects, therefore we reimplement this feature in mm/kasan/stackdepot.c.
The intention is to ultimately switch SLUB to use this implementation as
well, which will save a lot of memory (right now SLUB bloats each object
by 256 bytes to store the allocation/deallocation stacks).

Also neither SLUB nor SLAB delay the reuse of freed memory chunks, which
is necessary for better detection of use-after-free errors. We
introduce memory quarantine (mm/kasan/quarantine.c), which allows
delayed reuse of deallocated memory.

This patch (of 7):

Rename kmalloc_large_oob_right() to kmalloc_pagealloc_oob_right(), as
the test only checks the page allocator functionality. Also reimplement
kmalloc_large_oob_right() so that the test allocates a large enough
chunk of memory that still does not trigger the page allocator fallback.

Signed-off-by: Alexander Potapenko <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Andrey Konovalov <[email protected]>
Cc: Dmitry Vyukov <[email protected]>
Cc: Andrey Ryabinin <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Konstantin Serebryany <[email protected]>
Cc: Dmitry Chernenkov <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

include/linux/oom.h: remove undefined oom_kills_count()/note_oom_kill()

A leftover from commit c32b3cbe0d06 ("oom, PM: make OOM detection in the
freezer path raceless").

Signed-off-by: Tetsuo Handa <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/page_alloc: prevent merging between isolated and other pageblocks

Hanjun Guo has reported that a CMA stress test causes broken accounting of
CMA and free pages:

> Before the test, I got:
> -bash-4.3# cat /proc/meminfo | grep Cma
> CmaTotal:         204800 kB
> CmaFree:          195044 kB
>
>
> After running the test:
> -bash-4.3# cat /proc/meminfo | grep Cma
> CmaTotal:         204800 kB
> CmaFree:         6602584 kB
>
> So the freed CMA memory is more than total..
>
> Also the the MemFree is more than mem total:
>
> -bash-4.3# cat /proc/meminfo
> MemTotal:       16342016 kB
> MemFree:        22367268 kB
> MemAvailable:   22370528 kB

Laura Abbott has confirmed the issue and suspected the freepage accounting
rewrite around 3.18/4.0 by Joonsoo Kim.  Joonsoo had a theory that this is
caused by unexpected merging between MIGRATE_ISOLATE and MIGRATE_CMA
pageblocks:

> CMA isolates MAX_ORDER aligned blocks, but, during the process,
> partialy isolated block exists. If MAX_ORDER is 11 and
> pageblock_order is 9, two pageblocks make up MAX_ORDER
> aligned block and I can think following scenario because pageblock
> (un)isolation would be done one by one.
>
> (each character means one pageblock. 'C', 'I' means MIGRATE_CMA,
> MIGRATE_ISOLATE, respectively.
>
> CC -> IC -> II (Isolation)
> II -> CI -> CC (Un-isolation)
>
> If some pages are freed at this intermediate state such as IC or CI,
> that page could be merged to the other page that is resident on
> different type of pageblock and it will cause wrong freepage count.

This was supposed to be prevented by CMA operating on MAX_ORDER blocks,
but since it doesn't hold the zone->lock between pageblocks, a race
window does exist.

It's also likely that unexpected merging can occur between
MIGRATE_ISOLATE and non-CMA pageblocks.  This should be prevented in
__free_one_page() since commit 3c605096d315 ("mm/page_alloc: restrict
max order of merging on isolated pageblock").  However, we only check
the migratetype of the pageblock where buddy merging has been initiated,
not the migratetype of the buddy pageblock (or group of pageblocks)
which can be MIGRATE_ISOLATE.

Joonsoo has suggested checking for buddy migratetype as part of
page_is_buddy(), but that would add extra checks in allocator hotpath
and bloat-o-meter has shown significant code bloat (the function is
inline).

This patch reduces the bloat at some expense of more complicated code.
The buddy-merging while-loop in __free_one_page() is initially bounded
to pageblock_border and without any migratetype checks.  The checks are
placed outside, bumping the max_order if merging is allowed, and
returning to the while-loop with a statement which can't be possibly
considered harmful.

This fixes the accounting bug and also removes the arguably weird state
in the original commit 3c605096d315 where buddies could be left
unmerged.

Fixes: 3c605096d315 ("mm/page_alloc: restrict max order of merging on isolated pageblock")
Link: https://lkml.org/lkml/2016/3/2/280
Signed-off-by: Vlastimil Babka <[email protected]>
Reported-by: Hanjun Guo <[email protected]>
Tested-by: Hanjun Guo <[email protected]>
Acked-by: Joonsoo Kim <[email protected]>
Debugged-by: Laura Abbott <[email protected]>
Debugged-by: Joonsoo Kim <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: "Kirill A. Shutemov" <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Yasuaki Ishimatsu <[email protected]>
Cc: Zhang Yanfei <[email protected]>
Cc: Michal Nazarewicz <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: "Aneesh Kumar K.V" <[email protected]>
Cc: <[email protected]> [3.18+]
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

drivers/memstick/host/r592.c: avoid gcc-6 warning

The r592 driver relies on behavior of the DMA mapping API that is
normally observed but not guaranteed by the API.  Instead it uses a
runtime check to fail transfers if the API ever behaves

When CONFIG_NEED_SG_DMA_LENGTH is not set, one of the checks turns into a
comparison of a variable with itself, which gcc-6.0 now warns about:

drivers/memstick/host/r592.c: In function 'r592_transfer_fifo_dma':
drivers/memstick/host/r592.c:302:31: error: self-comparison always evaluates to false [-Werror=tautological-compare]
    (sg_dma_len(&dev->req->sg) < dev->req->sg.length)) {
                               ^

The check itself is not a problem, so this patch just rephrases the
condition in a way that gcc does not consider an indication of a mistake.
We already know that dev->req->sg.length was initially R592_LFIFO_SIZE, so
we can compare it to that constant again.

Signed-off-by: Arnd Bergmann <[email protected]>
Cc: Maxim Levitsky <[email protected]>
Cc: Quentin Lambert <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

ocfs2: extend enough credits for freeing one truncate record while replaying truncate records

Now function ocfs2_replay_truncate_records() first modifies tl_used,
then calls ocfs2_extend_trans() to extend transactions for gd and alloc
inode used for freeing clusters. jbd2_journal_restart() may be called
and it may happen that tl_used in truncate log is decreased but the
clusters are not freed, which means these clusters are lost. So we
should avoid extending transactions in these two operations.

Signed-off-by: joyce.xue <[email protected]>
Reviewed-by: Mark Fasheh <[email protected]>
Acked-by: Joseph Qi <[email protected]>
Cc: Joel Becker <[email protected]>
Cc: Junxiao Bi <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

ocfs2: extend transaction for ocfs2_remove_rightmost_path() and ocfs2_update_edge_lengths() before to avoid inconsistency between inode and et

I found that jbd2_journal_restart() is called in some places without
keeping things consistently before.  However, jbd2_journal_restart() may
commit the handle's transaction and restart another one.  If the first
transaction is committed successfully while another not, it may cause
filesystem inconsistency or read only.  This is an effort to fix this
kind of problems.

This patch (of 3):

The following functions will be called while truncating an extent:
ocfs2_remove_btree_range
  -> ocfs2_start_trans
  -> ocfs2_remove_extent
     -> ocfs2_truncate_rec
       -> ocfs2_extend_rotate_transaction
         -> jbd2_journal_restart if jbd2_journal_extend fail
       -> ocfs2_rotate_tree_left
         -> ocfs2_remove_rightmost_path
             -> ocfs2_extend_rotate_transaction
               -> ocfs2_unlink_subtree
                -> ocfs2_update_edge_lengths
                  -> ocfs2_extend_trans
                    -> jbd2_journal_restart if jbd2_journal_extend fail
  -> ocfs2_et_update_clusters
  -> ocfs2_commit_trans

jbd2_journal_restart() may be called and it may happened that the buffers
dirtied in ocfs2_truncate_rec() are committed while buffers dirtied in
ocfs2_et_update_clusters() are not, the total clusters on extent tree and
i_clusters in ocfs2_dinode is inconsistency.  So the clusters got from
ocfs2_dinode is incorrect, and it also cause read-only problem when call
ocfs2_commit_truncate() with the error message: "Inode %llu has empty
extent block at %llu".

We should extend enough credits for function ocfs2_remove_rightmost_path
and ocfs2_update_edge_lengths to avoid this inconsistency.

Signed-off-by: joyce.xue <[email protected]>
Acked-by: Joseph Qi <[email protected]>
Cc: Mark Fasheh <[email protected]>
Cc: Joel Becker <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

ocfs2/dlm: move lock to the tail of grant queue while doing in-place convert

We have found a bug when two nodes doing umount one after another.

1) Node 1 migrate a lockres that has 3 locks in grant queue such as
   N2(PR)<->N3(NL)<->N4(PR) to N2.  After migration, lvb of the lock
   N3(NL) and N4(PR) are empty on node 2 because migration target do not
   copy lvb to these two lock.

2) Node 3 want to convert to PR, it can be granted in
   __dlmconvert_master(), and the order of these locks is unchanged.  The
   lvb of the lock N3(PR) on node 2 is copyed from lockres in function
   dlm_update_lvb() while the lvb of lock N4(PR) is still empty.

3) Node 2 want to leave domain, it will migrate this lockres to node 3.
   Then node 2 will trigger the BUG in dlm_prepare_lvb_for_migration()
   when adding the lock N4(PR) to mres with the following message because
   the lvb of mres is already copied from lock N3(PR), but the lvb of lock
   N4(PR) is empty.

"Mismatched lvb in lock cookie=%u:%llu, name=%.*s, node=%u"

[[email protected]: tweak comment]
Signed-off-by: xuejiufei <[email protected]>
Acked-by: Joseph Qi <[email protected]>
Cc: Mark Fasheh <[email protected]>
Cc: Joel Becker <[email protected]>
Cc: Junxiao Bi <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

ocfs2: solve a problem of crossing the boundary in updating backups

In update_backups() there exists a problem of crossing the boundary as
follows:

we assume that lun will be resized to 1TB(cluster_size is 32kb), it will
include 0~33554431 cluster, in update_backups func, it will backup super
block in location of 1TB which is the 33554432th cluster, so the
phenomenon of crossing the boundary happens.

Signed-off-by: Yiwen Jiang <[email protected]>
Reviewed-by: Joseph Qi <[email protected]>
Cc: Xue jiufei <[email protected]>
Cc: Mark Fasheh <[email protected]>
Cc: Joel Becker <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

ocfs2: fix occurring deadlock by changing ocfs2_wq from global to local

This patch fixes a deadlock, as follows:

  Node 1                Node 2                  Node 3
1)volume a and b are    only mount vol a        only mount vol b
  mounted

2)                      start to mount b        start to mount a

3)                      check hb of Node 3      check hb of Node 2
                        in vol a, qs_holds++    in vol b, qs_holds++

4) -------------------- all nodes' network down --------------------

5)                      progress of mount b     the same situation as
                        failed, and then call   Node 2
                        ocfs2_dismount_volume.
                        but the process is hung,
                        since there is a work
                        in ocfs2_wq cannot beo
                        completed. This work is
                        about vol a, because
                        ocfs2_wq is global wq.
                        BTW, this work which is
                        scheduled in ocfs2_wq is
                        ocfs2_orphan_scan_work,
                        and the context in this work
                        needs to take inode lock
                        of orphan_dir, because
                        lockres owner are Node 1 and
                        all nodes' nework has been down
                        at the same time, so it can't
                        get the inode lock.

6)                      Why can't this node be fenced
                        when network disconnected?
                        Because the process of
                        mount is hung what caused qs_holds
                        is not equal 0.

Because all works in the ocfs2_wq are relative to the super block.

The solution is to change the ocfs2_wq from global to local.  In other
words, move it into struct ocfs2_super.

Signed-off-by: Yiwen Jiang <[email protected]>
Reviewed-by: Joseph Qi <[email protected]>
Cc: Xue jiufei <[email protected]>
Cc: Mark Fasheh <[email protected]>
Cc: Joel Becker <[email protected]>
Cc: Cc: Junxiao Bi <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

ocfs2/dlm: fix BUG in dlm_move_lockres_to_recovery_list

When master handles convert request, it queues ast first and then
returns status.  This may happen that the ast is sent before the request
status because the above two messages are sent by two threads.  And
right after the ast is sent, if master down, it may trigger BUG in
dlm_move_lockres_to_recovery_list in the requested node because ast
handler moves it to grant list without clear lock->convert_pending.  So
remove BUG_ON statement and check if the ast is processed in
dlmconvert_remote.

Signed-off-by: Joseph Qi <[email protected]>
Reported-by: Yiwen Jiang <[email protected]>
Cc: Junxiao Bi <[email protected]>
Cc: Mark Fasheh <[email protected]>
Cc: Joel Becker <[email protected]>
Cc: Tariq Saeed <[email protected]>
Cc: Junxiao Bi <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>