Git Repo - linux.git/log

cls_bpf: don't decrement net's refcount when offload fails

When cls_bpf offload was added it seemed like a good idea to
call cls_bpf_delete_prog() instead of extending the error
handling path, since the software state is fully initialized
at that point. This handling of errors without jumping to
the end of the function is error prone, as proven by later
commit missing that extra call to __cls_bpf_delete_prog().

__cls_bpf_delete_prog() is now expected to be invoked with
a reference on exts->net or the field zeroed out. The call
on the offload's error patch does not fullfil this requirement,
leading to each error stealing a reference on net namespace.

Create a function undoing what cls_bpf_set_parms() did and
use it from __cls_bpf_delete_prog() and the error path.

Fixes: aae2c35ec892 ("cls_bpf: use tcf_exts_get_net() before call_rcu()")
Signed-off-by: Jakub Kicinski <[email protected]>
Reviewed-by: Simon Horman <[email protected]>
Acked-by: Daniel Borkmann <[email protected]>
Acked-by: Cong Wang <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

mmc: sdhci: Avoid swiotlb buffer being full

The commit de3ee99b097d ("mmc: Delete bounce buffer handling") deletes the
bounce buffer handling, but also causes the max_req_size for sdhci to be
increased, in case when max_segs == 1. This causes errors for sdhci-pci
Ricoh variant, about the swiotlb buffer to become full.

Fix the issue, by taking IO_TLB_SEGSIZE and IO_TLB_SHIFT into account when
deciding the max_req_size for sdhci.

Reported-by: Jiri Slaby <[email protected]>
Fixes: de3ee99b097d ("mmc: Delete bounce buffer handling")
Cc: <[email protected]> # v4.14+
Signed-off-by: Ulf Hansson <[email protected]>
Tested-by: Jiri Slaby <[email protected]>
Acked-by: Adrian Hunter <[email protected]>

arm64: mm: cleanup stale AIVIVT references

Since commit:

155433cb365ee466 ("arm64: cache: Remove support for ASID-tagged VIVT I-caches")

... the kernel no longer cares about AIVIVT I-caches, as these were
removed from the architecture.

This patch removes the stale references to such I-caches.

The comment in flush_context() is also updated to clarify when and where
the TLB invalidation occurs.

Signed-off-by: Mark Rutland <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Will Deacon <[email protected]>
Signed-off-by: Will Deacon <[email protected]>

Merge tag 'drm-for-v4.15-part2-fixes' of git://people.freedesktop.org/~airlied/linux

Pull drm fixes from Dave Airlie:

- TTM regression fix for some virt gpus (bochs vga)

- a few i915 stable fixes

- one vc4 fix

- one uapi fix

* tag 'drm-for-v4.15-part2-fixes' of git://people.freedesktop.org/~airlied/linux:
  drm/ttm: don't attempt to use hugepages if dma32 requested (v2)
  drm/vblank: Pass crtc_id to page_flip_ioctl.
  drm/i915: Fix init_clock_gating for resume
  drm/i915: Mark the userptr invalidate workqueue as WQ_MEM_RECLAIM
  drm/i915: Clear breadcrumb node when cancelling signaling
  drm/i915/gvt: ensure -ve return value is handled correctly
  drm/i915: Re-register PMIC bus access notifier on runtime resume
  drm/i915: Fix false-positive assert_rpm_wakelock_held in i915_pmic_bus_access_notifier v2
  drm/edid: Don't send non-zero YQ in AVI infoframe for HDMI 1.x sinks
  drm/vc4: Account for interrupts in flight

Revert "ALSA: usb-audio: Fix potential zero-division at parsing FU"

The commit 8428a8ebde2d ("ALSA: usb-audio: Fix potential zero-division
at parsing FU") is utterly bogus and breaks the case with csize=1
instead of fixing anything. Just take it back again.

Reported-by: Jörg Otte <[email protected]>
Fixes: 8428a8ebde2d ("ALSA: usb-audio: Fix potential zero-division at parsing FU"
Signed-off-by: Takashi Iwai <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

xfs: calculate correct offset in xfs_scrub_quota_item

It's only used for tracepoints so it's relatively harmless,
but the offset is calculated incorrectly in xfs_scrub_quota_item.

qi_dqperchunk is the nr. of dquots per "chunk" which we have
conveniently *cough* defined to always be 1 FSB. Therefore
block_offset * qi_dqperchunk == first id in that chunk,
and so offset = id / qi_dqperchunk

id * dqperchunk is ... meaningless.

Fixes-coverity-id: 1423965
Fixes: c2fc338c ("xfs: scrub quota information")
Signed-off-by: Eric Sandeen <[email protected]>
Reviewed-by: Darrick J. Wong <[email protected]>
Signed-off-by: Darrick J. Wong <[email protected]>

xfs: fix uninitialized variable in xfs_scrub_quota

On the first pass through the while(1) loop, we get to
xfs_scrub_should_terminate() which can test the uninitialized
error variable.

Fixes-coverity-id: 1423737
Fixes: c2fc338c ("xfs: scrub quota information")
Signed-off-by: Eric Sandeen <[email protected]>
Reviewed-by: Darrick J. Wong <[email protected]>
Signed-off-by: Darrick J. Wong <[email protected]>

xfs: fix leaks on corruption errors in xfs_bmap.c

Use _GOTO instead of _RETURN so we can free the allocated
cursor on error.

Fixes: bf80628 ("xfs: remove xfs_bmse_shift_one")
Fixes-coverity-id: 1423813, 1423676
Signed-off-by: Eric Sandeen <[email protected]>
Reviewed-by: Darrick J. Wong <[email protected]>
Signed-off-by: Darrick J. Wong <[email protected]>

xfs: fortify xfs_alloc_buftarg error handling

percpu_counter_init failure path doesn't clean up &btp->bt_lru list.
Call list_lru_destroy in that error path. Similarly register_shrinker
error path is not handled.

While it is unlikely to trigger these error path, it is not impossible
especially the later might fail with large NUMAs. Let's handle the
failure to make the code more robust.

Noticed-by: Tetsuo Handa <[email protected]>
Signed-off-by: Michal Hocko <[email protected]>
Acked-by: Dave Chinner <[email protected]>
Reviewed-by: Darrick J. Wong <[email protected]>
Signed-off-by: Darrick J. Wong <[email protected]>

nvme-pci: fix NULL pointer dereference in nvme_free_host_mem()

Following condition which will cause NULL pointer dereference will
occur in nvme_free_host_mem() when it tries to remove pci device via
nvme_remove() especially after a failure of host memory allocation for HMB.

"(host_mem_descs == NULL) && (nr_host_mem_descs != 0)"

It's because __nr_host_mem_descs__ is not cleared to 0 unlike
__host_mem_descs__ is so.

Signed-off-by: Minwoo Im <[email protected]>
Signed-off-by: Christoph Hellwig <[email protected]>

nvme-rdma: fix memory leak during queue allocation

In case nvme_rdma_wait_for_cm timeout expires before we get
an established or rejected event (rdma_connect succeeded) from
rdma_cm, we end up with leaking the ib transport resources for
dedicated queue. This scenario can easily reproduced using traffic
test during port toggling.
Also, in order to protect from parallel ib queue destruction, that
may be invoked from different context's, introduce new flag that
stands for transport readiness. While we're here, protect also against
a situation that we can receive rdma_cm events during ib queue destruction.

Signed-off-by: Max Gurtovoy <[email protected]>
Signed-off-by: Christoph Hellwig <[email protected]>

s390/gs: add compat regset for the guarded storage broadcast control block

git commit e525f8a6e696210d15f8b8277d4da12fc4add299
"s390/gs: add regset for the guarded storage broadcast control block"
added the missing regset to the s390_regsets array but failed to add it
to the s390_compat_regsets array.

Fixes: e525f8a6e696 ("add compat regset for the guarded storage broadcast control block")
Signed-off-by: Martin Schwidefsky <[email protected]>

Btrfs: incremental send, fix wrong unlink path after renaming file

Under some circumstances, an incremental send operation can issue wrong
paths for unlink commands related to files that have multiple hard links
and some (or all) of those links were renamed between the parent and send
snapshots. Consider the following example:

Parent snapshot

.                                                      (ino 256)
|---- a/                                               (ino 257)
|     |---- b/                                         (ino 259)
|     |     |---- c/                                   (ino 260)
|     |     |---- f2                                   (ino 261)
|     |
|     |---- f2l1                                       (ino 261)
|
|---- d/                                               (ino 262)
       |---- f1l1_2                                     (ino 258)
       |---- f2l2                                       (ino 261)
       |---- f1_2                                       (ino 258)

Send snapshot

.                                                      (ino 256)
|---- a/                                               (ino 257)
|     |---- f2l1/                                      (ino 263)
|             |---- b2/                                (ino 259)
|                   |---- c/                           (ino 260)
|                   |     |---- d3                     (ino 262)
|                   |           |---- f1l1_2           (ino 258)
|                   |           |---- f2l2_2           (ino 261)
|                   |           |---- f1_2             (ino 258)
|                   |
|                   |---- f2                           (ino 261)
|                   |---- f1l2                         (ino 258)
|
|---- d                                                (ino 261)

When computing the incremental send stream the following steps happen:

1) When processing inode 261, a rename operation is issued that renames
   inode 262, which currently as a path of "d", to an orphan name of
   "o262-7-0". This is done because in the send snapshot, inode 261 has
   of its hard links with a path of "d" as well.

2) Two link operations are issued that create the new hard links for
   inode 261, whose names are "d" and "f2l2_2", at paths "/" and
   "o262-7-0/" respectively.

3) Still while processing inode 261, unlink operations are issued to
   remove the old hard links of inode 261, with names "f2l1" and "f2l2",
   at paths "a/" and "d/". However path "d/" does not correspond anymore
   to the directory inode 262 but corresponds instead to a hard link of
   inode 261 (link command issued in the previous step). This makes the
   receiver fail with a ENOTDIR error when attempting the unlink
   operation.

The problem happens because before sending the unlink operation, we failed
to detect that inode 262 was one of ancestors for inode 261 in the parent
snapshot, and therefore we didn't recompute the path for inode 262 before
issuing the unlink operation for the link named "f2l2" of inode 262. The
detection failed because the function "is_ancestor()" only follows the
first hard link it finds for an inode instead of all of its hard links
(as it was originally created for being used with directories only, for
which only one hard link exists). So fix this by making "is_ancestor()"
follow all hard links of the input inode.

A test case for fstests follows soon.

Signed-off-by: Filipe Manana <[email protected]>
Signed-off-by: David Sterba <[email protected]>

net/packet: fix a race in packet_bind() and packet_notifier()

syzbot reported crashes [1] and provided a C repro easing bug hunting.

When/if packet_do_bind() calls __unregister_prot_hook() and releases
po->bind_lock, another thread can run packet_notifier() and process an
NETDEV_UP event.

This calls register_prot_hook() and hooks again the socket right before
first thread is able to grab again po->bind_lock.

Fixes this issue by temporarily setting po->num to 0, as suggested by
David Miller.

[1]
dev_remove_pack: ffff8801bf16fa80 not found
------------[ cut here ]------------
kernel BUG at net/core/dev.c:7945!  ( BUG_ON(!list_empty(&dev->ptype_all)); )
invalid opcode: 0000 [#1] SMP KASAN
Dumping ftrace buffer:
   (ftrace buffer empty)
Modules linked in:
device syz0 entered promiscuous mode
CPU: 0 PID: 3161 Comm: syzkaller404108 Not tainted 4.14.0+ #190
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
task: ffff8801cc57a500 task.stack: ffff8801cc588000
RIP: 0010:netdev_run_todo+0x772/0xae0 net/core/dev.c:7945
RSP: 0018:ffff8801cc58f598 EFLAGS: 00010293
RAX: ffff8801cc57a500 RBX: dffffc0000000000 RCX: ffffffff841f75b2
RDX: 0000000000000000 RSI: 1ffff100398b1ede RDI: ffff8801bf1f8810
device syz0 entered promiscuous mode
RBP: ffff8801cc58f898 R08: 0000000000000001 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: ffff8801bf1f8cd8
R13: ffff8801cc58f870 R14: ffff8801bf1f8780 R15: ffff8801cc58f7f0
FS:  0000000001716880(0000) GS:ffff8801db400000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000020b13000 CR3: 0000000005e25000 CR4: 00000000001406f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
rtnl_unlock+0xe/0x10 net/core/rtnetlink.c:106
tun_detach drivers/net/tun.c:670 [inline]
tun_chr_close+0x49/0x60 drivers/net/tun.c:2845
__fput+0x333/0x7f0 fs/file_table.c:210
____fput+0x15/0x20 fs/file_table.c:244
task_work_run+0x199/0x270 kernel/task_work.c:113
exit_task_work include/linux/task_work.h:22 [inline]
do_exit+0x9bb/0x1ae0 kernel/exit.c:865
do_group_exit+0x149/0x400 kernel/exit.c:968
SYSC_exit_group kernel/exit.c:979 [inline]
SyS_exit_group+0x1d/0x20 kernel/exit.c:977
entry_SYSCALL_64_fastpath+0x1f/0x96
RIP: 0033:0x44ad19

Fixes: 30f7ea1c2b5f ("packet: race condition in packet_bind")
Signed-off-by: Eric Dumazet <[email protected]>
Reported-by: syzbot <[email protected]>
Cc: Francesco Ruggeri <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

packet: fix crash in fanout_demux_rollover()

syzkaller found a race condition fanout_demux_rollover() while removing
a packet socket from a fanout group.

po->rollover is read and operated on during packet_rcv_fanout(), via
fanout_demux_rollover(), but the pointer is currently cleared before the
synchronization in packet_release(). It is safer to delay the cleanup
until after synchronize_net() has been called, ensuring all calls to
packet_rcv_fanout() for this socket have finished.

To further simplify synchronization around the rollover structure, set
po->rollover in fanout_add() only if there are no errors. This removes
the need for rcu in the struct and in the call to
packet_getsockopt(..., PACKET_ROLLOVER_STATS, ...).

Crashing stack trace:
fanout_demux_rollover+0xb6/0x4d0 net/packet/af_packet.c:1392
packet_rcv_fanout+0x649/0x7c8 net/packet/af_packet.c:1487
dev_queue_xmit_nit+0x835/0xc10 net/core/dev.c:1953
xmit_one net/core/dev.c:2975 [inline]
dev_hard_start_xmit+0x16b/0xac0 net/core/dev.c:2995
__dev_queue_xmit+0x17a4/0x2050 net/core/dev.c:3476
dev_queue_xmit+0x17/0x20 net/core/dev.c:3509
neigh_connected_output+0x489/0x720 net/core/neighbour.c:1379
neigh_output include/net/neighbour.h:482 [inline]
ip6_finish_output2+0xad1/0x22a0 net/ipv6/ip6_output.c:120
ip6_finish_output+0x2f9/0x920 net/ipv6/ip6_output.c:146
NF_HOOK_COND include/linux/netfilter.h:239 [inline]
ip6_output+0x1f4/0x850 net/ipv6/ip6_output.c:163
dst_output include/net/dst.h:459 [inline]
NF_HOOK.constprop.35+0xff/0x630 include/linux/netfilter.h:250
mld_sendpack+0x6a8/0xcc0 net/ipv6/mcast.c:1660
mld_send_initial_cr.part.24+0x103/0x150 net/ipv6/mcast.c:2072
mld_send_initial_cr net/ipv6/mcast.c:2056 [inline]
ipv6_mc_dad_complete+0x99/0x130 net/ipv6/mcast.c:2079
addrconf_dad_completed+0x595/0x970 net/ipv6/addrconf.c:4039
addrconf_dad_work+0xac9/0x1160 net/ipv6/addrconf.c:3971
process_one_work+0xbf0/0x1bc0 kernel/workqueue.c:2113
worker_thread+0x223/0x1990 kernel/workqueue.c:2247
kthread+0x35e/0x430 kernel/kthread.c:231
ret_from_fork+0x2a/0x40 arch/x86/entry/entry_64.S:432

Fixes: 0648ab70afe6 ("packet: rollover prepare: per-socket state")
Fixes: 509c7a1ecc860 ("packet: avoid panic in packet_getsockopt()")
Reported-by: syzbot <[email protected]>
Signed-off-by: Mike Maloney <[email protected]>
Reviewed-by: Eric Dumazet <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Merge branch 'sctp-fix-sparse-errors'

Xin Long says:

====================
sctp: fix some other sparse errors

After the last fixes for sparse errors, there are still three sparse
errors in sctp codes, two of them are type cast, and the other one
is using extern.
====================

Signed-off-by: David S. Miller <[email protected]>

sctp: remove extern from stream sched

Now each stream sched ops is defined in different .c file and
added into the global ops in another .c file, it uses extern
to make this work.

However extern is not good coding style to get them in and
even make C=2 reports errors for this.

This patch adds sctp_sched_ops_xxx_init for each stream sched
ops in their .c file, then get them into the global ops by
calling them when initializing sctp module.

Fixes: 637784ade221 ("sctp: introduce priority based stream scheduler")
Fixes: ac1ed8b82cd6 ("sctp: introduce round robin stream scheduler")
Signed-off-by: Xin Long <[email protected]>
Acked-by: Marcelo Ricardo Leitner <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

sctp: force the params with right types for sctp csum apis

Now sctp_csum_xxx doesn't really match the param types of these common
csum apis. As sctp_csum_xxx is defined in sctp/checksum.h, many sparse
errors occur when make C=2 not only with M=net/sctp but also with other
modules that include this header file.

This patch is to force them fit in csum apis with the right types.

Fixes: e6d8b64b34aa ("net: sctp: fix and consolidate SCTP checksumming code")
Signed-off-by: Xin Long <[email protected]>
Acked-by: Marcelo Ricardo Leitner <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

sctp: force SCTP_ERROR_INV_STRM with __u32 when calling sctp_chunk_fail

This patch is to force SCTP_ERROR_INV_STRM with right type to
fit in sctp_chunk_fail to avoid the sparse error.

Signed-off-by: Xin Long <[email protected]>
Acked-by: Marcelo Ricardo Leitner <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

lmc: Use memdup_user() as a cleanup

Fix coccicheck warning which recommends to use memdup_user():
drivers/net/wan/lmc/lmc_main.c:497:27-34: WARNING opportunity for memdup_user
Generated by: scripts/coccinelle/memdup_user/memdup_user.cocci

Signed-off-by: Vasyl Gomonovych <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

bnxt_en: Fix an error handling path in 'bnxt_get_module_eeprom()'

Error code returned by 'bnxt_read_sfp_module_eeprom_info()' is handled a
few lines above when reading the A0 portion of the EEPROM.
The same should be done when reading the A2 portion of the EEPROM.

In order to correctly propagate an error, update 'rc' in this 2nd call as
well, otherwise 0 (success) is returned.

Signed-off-by: Christophe JAILLET <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: phy: marvell10g: fix the PHY id mask

The Marvell 10G PHY driver supports different hardware revisions, which
have their bits 3..0 differing. To get the correct revision number these
bits should be ignored. This patch fixes this by using the already
defined MARVELL_PHY_ID_MASK (0xfffffff0) instead of the custom
0xffffffff mask.

Fixes: 20b2af32ff3f ("net: phy: add Marvell Alaska X 88X3310 10Gigabit PHY support")
Suggested-by: Yan Markman <[email protected]>
Signed-off-by: Antoine Tenart <[email protected]>
Reviewed-by: Andrew Lunn <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Merge branch 'mvpp2-fixes'

Antoine Tenart says:

====================
net: mvpp2: set of fixes

This series fixes various issues with the Marvell PPv2 driver. The
patches are sent together to avoid any possible conflict. The series is
based on today's net tree.
====================

Signed-off-by: David S. Miller <[email protected]>

net: mvpp2: check ethtool sets the Tx ring size is to a valid min value

This patch fixes the Tx ring size checks when using ethtool, by adding
an extra check in the PPv2 check_ringparam_valid helper. The Tx ring
size cannot be set to a value smaller than the minimum number of
descriptors needed for TSO.

Fixes: 1d17db08c056 ("net: mvpp2: limit TSO segments and use stop/wake thresholds")
Suggested-by: Yan Markman <[email protected]>
Signed-off-by: Antoine Tenart <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: mvpp2: do not disable GMAC padding

Short fragmented packets may never be sent by the hardware when padding
is disabled. This patch stop modifying the GMAC padding bits, to leave
them to their reset value (disabled).

Fixes: 3919357fb0bb ("net: mvpp2: initialize the GMAC when using a port")
Signed-off-by: Yan Markman <[email protected]>
[Antoine: commit message]
Signed-off-by: Antoine Tenart <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: mvpp2: cleanup probed ports in the probe error path

This patches fixes the probe error path by cleaning up probed ports, to
avoid leaving registered net devices when the driver failed to probe.

Fixes: 3f518509dedc ("ethernet: Add new driver for Marvell Armada 375 network unit")
Signed-off-by: Antoine Tenart <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: mvpp2: fix the txq_init error path

When an allocation in the txq_init path fails, the allocated buffers
end-up being freed twice: in the txq_init error path, and in txq_deinit.
This lead to issues as txq_deinit would work on already freed memory
regions:

kernel BUG at mm/slub.c:3915!
Internal error: Oops - BUG: 0 [#1] PREEMPT SMP

This patch fixes this by removing the txq_init own error path, as the
txq_deinit function is always called on errors. This was introduced by
TSO as way more buffers are allocated.

Fixes: 186cd4d4e414 ("net: mvpp2: software tso support")
Signed-off-by: Antoine Tenart <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

quota: propagate error from __dquot_initialize

In commit 6184fc0b8dd7 ("quota: Propagate error from ->acquire_dquot()"),
we have propagated error from __dquot_initialize to caller, but we forgot
to handle such error in add_dquot_ref(), so, currently, during quota
accounting information initialization flow, if we failed for some of
inodes, we just ignore such error, and do account for others, which is
not a good implementation.

In this patch, we choose to let user be aware of such error, so after
turning on quota successfully, we can make sure all inodes disk usage
can be accounted, which will be more reasonable.

Suggested-by: Jan Kara <[email protected]>
Signed-off-by: Chao Yu <[email protected]>
Signed-off-by: Jan Kara <[email protected]>

Merge branch 'mlxsw-GRE-offloading-fixes'

Jiri Pirko says:

====================
mlxsw: GRE offloading fixes

Petr says:

This patchset fixes a couple bugs in offloading GRE tunnels in mlxsw
driver.

Patch #1 fixes a problem that local routes pointing at a GRE tunnel
device are offloaded even if that netdevice is down.

Patch #2 detects that as a result of moving a GRE netdevice to a
different VRF, two tunnels now have a conflict of local addresses,
something that the mlxsw driver can't offload.

Patch #3 fixes a FIB abort caused by forming a route pointing at a
GRE tunnel that is eligible for offloading but already onloaded.

Patch #4 fixes a problem that next hops migrated to a new RIF kept the
old RIF reference, which went dangling shortly afterwards.
====================

Signed-off-by: David S. Miller <[email protected]>

mlxsw: spectrum_router: Update nexthop RIF on update

The function mlxsw_sp_nexthop_rif_update() walks the list of nexthops
associated with a RIF, and updates the corresponding entries in the
switch. It is used in particular when a tunnel underlay netdevice moves
to a different VRF, and all the nexthops are migrated over to a new RIF.
The problem is that each nexthop holds a reference to its RIF, and that
is not updated. So after the old RIF is gone, further activity on these
nexthops (such as downing the underlay netdevice) dereferences a
dangling pointer.

Fix the issue by updating rif of impacted nexthops before calling
mlxsw_sp_nexthop_rif_update().

Fixes: 0c5f1cd5ba8c ("mlxsw: spectrum_router: Generalize __mlxsw_sp_ipip_entry_update_tunnel()")
Signed-off-by: Petr Machata <[email protected]>
Reviewed-by: Ido Schimmel <[email protected]>
Signed-off-by: Jiri Pirko <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

mlxsw: spectrum_router: Handle encap to demoted tunnels

Some tunnels that are offloadable on their own can nonetheless be
demoted to slow path if their local address is in conflict with that of
another tunnel. When a route is formed for such a tunnel,
mlxsw_sp_nexthop_ipip_init() fails to find the corresponding IPIP entry,
and that triggers a FIB abort.

Resolve the problem by not assuming that a tunnel for which
mlxsw_sp_ipip_ops.can_offload() holds also automatically has an IPIP
entry.

Fixes: af641713e97d ("mlxsw: spectrum_router: Onload conflicting tunnels")
Signed-off-by: Petr Machata <[email protected]>
Reviewed-by: Ido Schimmel <[email protected]>
Signed-off-by: Jiri Pirko <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

mlxsw: spectrum_router: Demote tunnels on VRF migration

The mlxsw driver currently doesn't offload GRE tunnels if they have the
same local address and use the same underlay VRF. When such a situation
arises, the tunnels in conflict are demoted to slow path.

However, the current code only verifies this condition on tunnel
creation and tunnel change, not when a tunnel is moved to a different
VRF. When the tunnel has no bound device, underlay and overlay are the
same. Thus moving a tunnel moves the underlay as well, and that can
cause local address conflict.

So modify mlxsw_sp_netdevice_ipip_ol_vrf_event() to check if there are
any conflicting tunnels, and demote them if yes.

Fixes: af641713e97d ("mlxsw: spectrum_router: Onload conflicting tunnels")
Signed-off-by: Petr Machata <[email protected]>
Reviewed-by: Ido Schimmel <[email protected]>
Signed-off-by: Jiri Pirko <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

mlxsw: spectrum_router: Offload decap only for up tunnels

When a new local route is added, an IPIP entry is looked up to determine
whether the route should be offloaded as a tunnel decap or as a trap.
That decision should take into account whether the tunnel netdevice in
question is actually IFF_UP, and only install a decap offload if it is.

Fixes: 0063587d3587 ("mlxsw: spectrum: Support decap-only IP-in-IP tunnels")
Signed-off-by: Petr Machata <[email protected]>
Reviewed-by: Ido Schimmel <[email protected]>
Signed-off-by: Jiri Pirko <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Merge branch '40GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/net-queue

Jeff Kirsher says:

====================
Intel Wired LAN Driver Updates 2017-11-27

This series contains updates to e1000, e1000e and i40e.

Gustavo A. R. Silva fixes a sizeof() issue where we were taking the size of
the pointer (which is always the size of the pointer).

Sasha does a follow up fix to a previous fix for buffer overrun, to resolve
community feedback from David Laight and the use of magic numbers.

Amritha fixes the reporting of error codes for when adding a cloud filter
fails.

Ahmad Fatoum brushes the dust off the e1000 driver to fix a code comment
and debug message which was incorrect about what the code was really doing.
====================

Signed-off-by: David S. Miller <[email protected]>

btrfs: tree-checker: Fix false panic for sanity test

[BUG]
If we run btrfs with CONFIG_BTRFS_FS_RUN_SANITY_TESTS=y, it will
instantly cause kernel panic like:

------
...
assertion failed: 0, file: fs/btrfs/disk-io.c, line: 3853
...
Call Trace:
btrfs_mark_buffer_dirty+0x187/0x1f0 [btrfs]
setup_items_for_insert+0x385/0x650 [btrfs]
__btrfs_drop_extents+0x129a/0x1870 [btrfs]
...
-----

[Cause]
Btrfs will call btrfs_check_leaf() in btrfs_mark_buffer_dirty() to check
if the leaf is valid with CONFIG_BTRFS_FS_RUN_SANITY_TESTS=y.

However quite some btrfs_mark_buffer_dirty() callers(*) don't really
initialize its item data but only initialize its item pointers, leaving
item data uninitialized.

This makes tree-checker catch uninitialized data as error, causing
such panic.

*: These callers include but not limited to
setup_items_for_insert()
btrfs_split_item()
btrfs_expand_item()

[Fix]
Add a new parameter @check_item_data to btrfs_check_leaf().
With @check_item_data set to false, item data check will be skipped and
fallback to old btrfs_check_leaf() behavior.

So we can still get early warning if we screw up item pointers, and
avoid false panic.

Cc: Filipe Manana <[email protected]>
Reported-by: Lakshmipathi.G <[email protected]>
Signed-off-by: Qu Wenruo <[email protected]>
Reviewed-by: Liu Bo <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>

Merge tag 'gvt-fixes-2017-11-28' of https://github.com/intel/gvt-linux into drm-intel-fixes

gvt-fixes-2017-11-28

- regression fix for sane request alloc (Fred)
- locking fix (Changbin)
- fix invalid addr mask (Xiong)
- compression regression fix (Weinan)
- fix default pipe enable for virtual display (Xiaolin)

Signed-off-by: Joonas Lahtinen <[email protected]>

drm/i915/gvt: Correct ADDR_4K/2M/1G_MASK definition

For ADDR_4K_MASK, bit[45..12] should be 1, all other bits
should be 0. The current definition wrongly set bit[46] as 1
also. This path fixes this.

v2: Add commit message, fixes and cc stable.(Zhenyu)

Fixes: 2707e4446688("drm/i915/gvt: vGPU graphics memory virtualization")
Signed-off-by: Xiong Zhang <[email protected]>
Cc: [email protected]
Signed-off-by: Zhenyu Wang <[email protected]>

drm/i915/gvt: enabled pipe A default on creating vgpu

when i915 driver unloading, it will shutdown all CRTCs and
it will introudce kernel panic when conducting igt drv_module_reload
test case under guest environment (bug reported by XENGT-468) as below:

BUG: unable to handle kernel NULL pointer dereference at 0000000000000070
IP: intel_edp_backlight_off+0xe/0x7c [i915]
RIP: 0010:intel_edp_backlight_off+0xe/0x7c [i915]
Call Trace:
intel_disable_ddi+0xb3/0xbc [i915]
intel_modeset_setup_hw_state+0x654/0xb4c [i915]
intel_modeset_init+0x9f1/0xe69 [i915]
? intel_i2c_reset+0x3d/0x40 [i915]
? intel_setup_gmbus+0xba/0x249 [i915]
i915_driver_load+0xae5/0xcc0 [i915]
i915_pci_probe+0x3a/0x3c [i915]
local_pci_probe+0x38/0x7b
pci_device_probe+0xec/0x12b
driver_probe_device+0x134/0x294
__driver_attach+0x6a/0x8c
? driver_probe_device+0x294/0x294
bus_for_each_dev+0x68/0x80
driver_attach+0x19/0x1b
bus_add_driver+0xea/0x1d3
? 0xffffffffa03cd000
driver_register+0x85/0xc1
? 0xffffffffa03cd000
__pci_register_driver+0x55/0x57
i915_init+0x57/0x5a [i915]
do_one_initcall+0x8a/0x12e
? __vunmap+0x8d/0x93
? kmem_cache_alloc_trace+0x96/0x11c
do_init_module+0x5a/0x1e1

in this case, active connector detected but no active pipe
available, so it will hang to disable connector.

to fix, on vgpu creating, to report active pipe available for
guest.

Signed-off-by: Xiaolin Zhang <[email protected]>
Signed-off-by: Zhenyu Wang <[email protected]>

drm/i915/gvt: Move request alloc to dispatch_workload path only

Previously the performance is improved through the workload auditing
and shadowing ahead of vGPU scheduling, however, there is the case that
more requests are allocated in submit_context before the previous request
is added, the timeline will hold its seqno which is later.

This patch is to move the request alloc to dispatch_workload function,
where is the same place as request is added.

It will fix the issue of kernel BUG for (timeline->seqno != request->fence.seqno)
check when add_request.

Fixes: 89ea20b930cb ("drm/i915/gvt: Factor out scan and shadow from workload dispatch")
Signed-off-by: Chuanxiao Dong <[email protected]>
Signed-off-by: fred gao <[email protected]>
Signed-off-by: Zhenyu Wang <[email protected]>
(cherry picked from commit f2880e04f3a5419366926182fc97a3c2e4fd8f2a)

drm/i915/gvt: remove skl_misc_ctl_write handler

With different settings of compressed data hash mode between VMs and host
may cause gpu issues.

Commit: 1999f108c ("drm/i915/gvt: Disable compression workaround for Gen9")
disable compression workaround of guest in gvt host to align with host.

Commit: 93564044f ("drm/i915: Switch over to the LLC/eLLC hotspot avoidance
hash mode for CCS") add compression workaround, then we can remove the
skl_misc_ctl_write hanlder.

Better solution should be always keeping same settings as host, and bypass
the write request from VMs, but it need to fetch data from host's
"Context".

Cc: Zhi Wang <[email protected]>
Signed-off-by: Weinan Li <[email protected]>
Signed-off-by: Xiong Zhang <[email protected]>
Signed-off-by: Zhenyu Wang <[email protected]>

drm/i915/gvt: Fix unsafe locking caused by spin_unlock_bh

The caller of shadow_context_status_change may disable irqs. So it is not
safe to use spin_unlock_bh in such context. Let's switch to irqsave version
for safety.

------------[ cut here ]------------
WARNING: CPU: 2 PID: 4504 at kernel/softirq.c:161 __local_bh_enable_ip+0x46/0x60
[  168.797710] Hardware name: Dell Inc. OptiPlex 7040/0Y7WYT, BIOS 1.2.8 01/26/2016
[  168.797712] task: ffff8c693d22db80 task.stack: ffffb51b482bc000
[  168.797718] RIP: 0010:__local_bh_enable_ip+0x46/0x60
[  168.797721] RSP: 0018:ffffb51b482bfa10 EFLAGS: 00010046
[  168.797724] RAX: 0000000000000046 RBX: ffff8c6900278000 RCX: 00000000ffffffff
[  168.797726] RDX: 0000000000000001 RSI: 0000000000000200 RDI: ffffffffc06a0330
[  168.797728] RBP: ffffb51b482bfa10 R08: 0000000000000000 R09: ffff8c690027cb90
[  168.797730] R10: ffffb51b482bfa40 R11: 00000004072f0001 R12: 0000000000000000
[  168.797732] R13: 0000000000000000 R14: ffff8c690027ca9c R15: 0000000000000000
[  168.797735] FS:  00007ff187c56700(0000) GS:ffff8c6959d00000(0000) knlGS:0000000000000000
[  168.797738] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  168.797740] CR2: 0000562bc0c3991f CR3: 0000000430614006 CR4: 00000000003606e0
[  168.797742] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  168.797744] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  168.797745] Call Trace:
[  168.797755]  _raw_spin_unlock_bh+0x1e/0x20
[  168.797826]  shadow_context_status_change+0x120/0x1e0 [i915]
[  168.797831]  notifier_call_chain+0x4a/0x70
[  168.797834]  atomic_notifier_call_chain+0x1a/0x20
[  168.797896]  execlists_cancel_port_requests+0x4f/0x80 [i915]
[  168.797956]  reset_common_ring+0x30/0x100 [i915]
[  168.798007]  i915_gem_reset_engine+0x114/0x330 [i915]
[  168.798060]  ? i915_gem_retire_requests+0x75/0x180 [i915]
[  168.798111]  i915_gem_reset+0x3e/0xb0 [i915]
[  168.798149]  i915_reset+0x10b/0x1c0 [i915]
[  168.798187]  i915_reset_device+0x209/0x220 [i915]
[  168.798225]  ? gen8_gt_irq_ack+0x170/0x170 [i915]
[  168.798229]  ? __queue_work+0x430/0x430
[  168.798270]  i915_handle_error+0x285/0x420 [i915]
[  168.798275]  ? mntput+0x24/0x40
[  168.798281]  ? terminate_walk+0x8e/0xf0
[  168.798328]  i915_wedged_set+0x84/0xc0 [i915]
[  168.798333]  simple_attr_write+0xab/0xc0
[  168.798337]  full_proxy_write+0x54/0x90
[  168.798343]  __vfs_write+0x37/0x170
[  168.798349]  ? common_file_perm+0x4c/0x100
[  168.798355]  ? apparmor_file_permission+0x1a/0x20
[  168.798361]  ? security_file_permission+0x3b/0xc0
[  168.798365]  vfs_write+0xb8/0x1b0
[  168.798370]  SyS_write+0x55/0xc0
[  168.798376]  entry_SYSCALL_64_fastpath+0x1e/0xa9

Fixes: 0e86cc9 ("drm/i915/gvt: implement per-vm mmio switching optimization")
Signed-off-by: Changbin Du <[email protected]>
Signed-off-by: Zhenyu Wang <[email protected]>

drm/i915: fix intel_backlight_device_register declaration

The alternative intel_backlight_device_register() definition apparently
never got used, but I have now run into a case of i915 being compiled
without CONFIG_BACKLIGHT_CLASS_DEVICE, resulting in a number of
identical warnings:

drivers/gpu/drm/i915/intel_drv.h:1739:12: error: 'intel_backlight_device_register' defined but not used [-Werror=unused-function]

This marks the function as 'inline', which was surely the original
intention here.

Fixes: 1ebaa0b9c2d4 ("drm/i915: Move backlight registration to connector registration")
Signed-off-by: Arnd Bergmann <[email protected]>
Signed-off-by: Daniel Vetter <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
(cherry picked from commit 2de2d0b063b08becb2c67a2c338c44e37bdcffee)
Signed-off-by: Joonas Lahtinen <[email protected]>

drm/i915/fbdev: Serialise early hotplug events with async fbdev config

As both the hotplug event and fbdev configuration run asynchronously, it
is possible for them to run concurrently. If configuration fails, we were
freeing the fbdev causing a use-after-free in the hotplug event.

<7>[ 3069.935211] [drm:intel_fb_initial_config [i915]] Not using firmware configuration
<7>[ 3069.935225] [drm:drm_setup_crtcs] looking for cmdline mode on connector 77
<7>[ 3069.935229] [drm:drm_setup_crtcs] looking for preferred mode on connector 77 0
<7>[ 3069.935233] [drm:drm_setup_crtcs] found mode 3200x1800
<7>[ 3069.935236] [drm:drm_setup_crtcs] picking CRTCs for 8192x8192 config
<7>[ 3069.935253] [drm:drm_setup_crtcs] desired mode 3200x1800 set on crtc 43 (0,0)
<7>[ 3069.935323] [drm:intelfb_create [i915]] no BIOS fb, allocating a new one
<4>[ 3069.967737] general protection fault: 0000 [#1] PREEMPT SMP
<0>[ 3069.977453] ---------------------------------
<4>[ 3069.977457] Modules linked in: i915(+) vgem snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic x86_pkg_temp_thermal intel_powerclamp coretemp crct10dif_pclmul crc32_pclmul ghash_clmulni_intel snd_hda_codec snd_hwdep snd_hda_core snd_pcm r8169 mei_me mii prime_numbers mei i2c_hid pinctrl_geminilake pinctrl_intel [last unloaded: i915]
<4>[ 3069.977492] CPU: 1 PID: 15414 Comm: kworker/1:0 Tainted: G     U          4.14.0-CI-CI_DRM_3388+ #1
<4>[ 3069.977497] Hardware name: Intel Corp. Geminilake/GLK RVP1 DDR4 (05), BIOS GELKRVPA.X64.0062.B30.1708222146 08/22/2017
<4>[ 3069.977508] Workqueue: events output_poll_execute
<4>[ 3069.977512] task: ffff880177734e40 task.stack: ffffc90001fe4000
<4>[ 3069.977519] RIP: 0010:__lock_acquire+0x109/0x1b60
<4>[ 3069.977523] RSP: 0018:ffffc90001fe7bb0 EFLAGS: 00010002
<4>[ 3069.977526] RAX: 6b6b6b6b6b6b6b6b RBX: 0000000000000282 RCX: 0000000000000000
<4>[ 3069.977530] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff880170d4efd0
<4>[ 3069.977534] RBP: ffffc90001fe7c70 R08: 0000000000000001 R09: 0000000000000000
<4>[ 3069.977538] R10: 0000000000000000 R11: ffffffff81899609 R12: ffff880170d4efd0
<4>[ 3069.977542] R13: ffff880177734e40 R14: 0000000000000001 R15: 0000000000000000
<4>[ 3069.977547] FS:  0000000000000000(0000) GS:ffff88017fc80000(0000) knlGS:0000000000000000
<4>[ 3069.977551] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
<4>[ 3069.977555] CR2: 00007f7e8b7bcf04 CR3: 0000000003e0f000 CR4: 00000000003406e0
<4>[ 3069.977559] Call Trace:
<4>[ 3069.977565]  ? mark_held_locks+0x64/0x90
<4>[ 3069.977571]  ? _raw_spin_unlock_irq+0x24/0x50
<4>[ 3069.977575]  ? _raw_spin_unlock_irq+0x24/0x50
<4>[ 3069.977579]  ? trace_hardirqs_on_caller+0xde/0x1c0
<4>[ 3069.977583]  ? _raw_spin_unlock_irq+0x2f/0x50
<4>[ 3069.977588]  ? finish_task_switch+0xa5/0x210
<4>[ 3069.977592]  ? lock_acquire+0xaf/0x200
<4>[ 3069.977596]  lock_acquire+0xaf/0x200
<4>[ 3069.977600]  ? __mutex_lock+0x5e9/0x9b0
<4>[ 3069.977604]  _raw_spin_lock+0x2a/0x40
<4>[ 3069.977608]  ? __mutex_lock+0x5e9/0x9b0
<4>[ 3069.977612]  __mutex_lock+0x5e9/0x9b0
<4>[ 3069.977616]  ? drm_fb_helper_hotplug_event.part.19+0x16/0xa0
<4>[ 3069.977621]  ? drm_fb_helper_hotplug_event.part.19+0x16/0xa0
<4>[ 3069.977625]  drm_fb_helper_hotplug_event.part.19+0x16/0xa0
<4>[ 3069.977630]  output_poll_execute+0x8d/0x180
<4>[ 3069.977635]  process_one_work+0x22e/0x660
<4>[ 3069.977640]  worker_thread+0x48/0x3a0
<4>[ 3069.977644]  ? _raw_spin_unlock_irqrestore+0x4c/0x60
<4>[ 3069.977649]  kthread+0x102/0x140
<4>[ 3069.977653]  ? process_one_work+0x660/0x660
<4>[ 3069.977657]  ? kthread_create_on_node+0x40/0x40
<4>[ 3069.977662]  ret_from_fork+0x27/0x40
<4>[ 3069.977666] Code: 8d 62 f8 c3 49 81 3c 24 e0 fa 3c 82 41 be 00 00 00 00 45 0f 45 f0 83 fe 01 77 86 89 f0 49 8b 44 c4 08 48 85 c0 0f 84 76 ff ff ff <f0> ff 80 38 01 00 00 8b 1d 62 f9 e8 01 45 8b 85 b8 08 00 00 85
<1>[ 3069.977707] RIP: __lock_acquire+0x109/0x1b60 RSP: ffffc90001fe7bb0
<4>[ 3069.977712] ---[ end trace 4ad012eb3af62df7 ]---

In order to keep the dev_priv->ifbdev alive after failure, we have to
avoid the free and leave it empty until we unload the module (which is
less than ideal, but a necessary evil for simplicity). Then we can use
intel_fbdev_sync() to serialise the hotplug event with the configuration.
The serialisation between the two was removed in commit 934458c2c95d
("Revert "drm/i915: Fix races on fbdev""), but the use after free is much
older, commit 366e39b4d2c5 ("drm/i915: Tear down fbdev if initialization
fails")

Fixes: 366e39b4d2c5 ("drm/i915: Tear down fbdev if initialization fails")
Fixes: 934458c2c95d ("Revert "drm/i915: Fix races on fbdev"")
Signed-off-by: Chris Wilson <[email protected]>
Cc: Lukas Wunner <[email protected]>
Cc: Joonas Lahtinen <[email protected]>
Cc: Daniel Vetter <[email protected]>
Cc: [email protected]
Reviewed-by: Lukas Wunner <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
(cherry picked from commit ad88d7fc6c032ddfb32b8d496a070ab71de3a64f)
Signed-off-by: Joonas Lahtinen <[email protected]>

drm/i915: Prevent zero length "index" write

The hardware always writes one or two bytes in the index portion of
an indexed transfer. Make sure the message we send as the index
doesn't have a zero length.

Cc: [email protected]
Cc: Daniel Kurtz <[email protected]>
Cc: Chris Wilson <[email protected]>
Cc: Daniel Vetter <[email protected]>
Cc: Sean Paul <[email protected]>
Fixes: 56f9eac05489 ("drm/i915/intel_i2c: use INDEX cycles for i2c read transactions")
Signed-off-by: Ville Syrjälä <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
Reviewed-by: Chris Wilson <[email protected]>
(cherry picked from commit bb9e0d4bca50f429152e74a459160b41f3d60fb2)
Signed-off-by: Joonas Lahtinen <[email protected]>

drm/i915: Don't try indexed reads to alternate slave addresses

We can only specify the one slave address to indexed reads/writes.
Make sure the messages we check are destined to the same slave
address before deciding to do an indexed transfer.

Cc: [email protected]
Cc: Daniel Kurtz <[email protected]>
Cc: Chris Wilson <[email protected]>
Cc: Daniel Vetter <[email protected]>
Cc: Sean Paul <[email protected]>
Fixes: 56f9eac05489 ("drm/i915/intel_i2c: use INDEX cycles for i2c read transactions")
Signed-off-by: Ville Syrjälä <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
Reviewed-by: Chris Wilson <[email protected]>
(cherry picked from commit c4deb62d7821672265b87952bcd1c808f3bf3e8f)
Signed-off-by: Joonas Lahtinen <[email protected]>

hwmon: (pmbus) Use 64bit math for DIRECT format values

Power values in the 100s of watt range can easily blow past
32bit math limits when processing everything in microwatts.

Use 64bit math instead to avoid these issues on common 32bit ARM
BMC platforms.

Fixes: 442aba78728e ("hwmon: PMBus device driver")
Signed-off-by: Robert Lippert <[email protected]>
Signed-off-by: Guenter Roeck <[email protected]>

proc: don't report kernel addresses in /proc/<pid>/stack

This just changes the file to report them as zero, although maybe even
that could be removed. I checked, and at least procps doesn't actually
seem to parse the 'stack' file at all.

And since the file doesn't necessarily even exist (it requires
CONFIG_STACKTRACE), possibly other tools don't really use it either.

That said, in case somebody parses it with tools, just having that zero
there should keep such tools happy.

Signed-off-by: Linus Torvalds <[email protected]>

apparmor: fix oops in audit_signal_cb hook

The apparmor_audit_data struct ordering got messed up during a merge
conflict, resulting in the signal integer and peer pointer being in
a union instead of a struct.

For most of the 4.13 and 4.14 life cycle, this was hidden by
commit 651e28c5537a ("apparmor: add base infastructure for socket
mediation") which fixed the apparmor_audit_data struct when its data
was added. When that commit was reverted in -rc7 the signal audit bug
was exposed, and unfortunately it never showed up in any of the
testing until after 4.14 was released. Shaun Khan, Zephaniah
E. Loss-Cutler-Hull filed nearly simultaneous bug reports (with
different oopes, the smaller of which is included below).

Full credit goes to Tetsuo Handa for jumping on this as well and
noticing the audit data struct problem and reporting it.

[   76.178568] BUG: unable to handle kernel paging request at
ffffffff0eee3bc0
[   76.178579] IP: audit_signal_cb+0x6c/0xe0
[   76.178581] PGD 1a640a067 P4D 1a640a067 PUD 0
[   76.178586] Oops: 0000 [#1] PREEMPT SMP
[   76.178589] Modules linked in: fuse rfcomm bnep usblp uvcvideo btusb
btrtl btbcm btintel bluetooth ecdh_generic ip6table_filter ip6_tables
xt_tcpudp nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack
iptable_filter ip_tables x_tables intel_rapl joydev wmi_bmof serio_raw
iwldvm iwlwifi shpchp kvm_intel kvm irqbypass autofs4 algif_skcipher
nls_iso8859_1 nls_cp437 crc32_pclmul ghash_clmulni_intel
[   76.178620] CPU: 0 PID: 10675 Comm: pidgin Not tainted
4.14.0-f1-dirty #135
[   76.178623] Hardware name: Hewlett-Packard HP EliteBook Folio
9470m/18DF, BIOS 68IBD Ver. F.62 10/22/2015
[   76.178625] task: ffff9c7a94c31dc0 task.stack: ffffa09b02a4c000
[   76.178628] RIP: 0010:audit_signal_cb+0x6c/0xe0
[   76.178631] RSP: 0018:ffffa09b02a4fc08 EFLAGS: 00010292
[   76.178634] RAX: ffffa09b02a4fd60 RBX: ffff9c7aee0741f8 RCX:
0000000000000000
[   76.178636] RDX: ffffffffee012290 RSI: 0000000000000006 RDI:
ffff9c7a9493d800
[   76.178638] RBP: ffffa09b02a4fd40 R08: 000000000000004d R09:
ffffa09b02a4fc46
[   76.178641] R10: ffffa09b02a4fcb8 R11: ffff9c7ab44f5072 R12:
ffffa09b02a4fd40
[   76.178643] R13: ffffffff9e447be0 R14: ffff9c7a94c31dc0 R15:
0000000000000001
[   76.178646] FS:  00007f8b11ba2a80(0000) GS:ffff9c7afea00000(0000)
knlGS:0000000000000000
[   76.178648] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   76.178650] CR2: ffffffff0eee3bc0 CR3: 00000003d5209002 CR4:
00000000001606f0
[   76.178652] Call Trace:
[   76.178660]  common_lsm_audit+0x1da/0x780
[   76.178665]  ? d_absolute_path+0x60/0x90
[   76.178669]  ? aa_check_perms+0xcd/0xe0
[   76.178672]  aa_check_perms+0xcd/0xe0
[   76.178675]  profile_signal_perm.part.0+0x90/0xa0
[   76.178679]  aa_may_signal+0x16e/0x1b0
[   76.178686]  apparmor_task_kill+0x51/0x120
[   76.178690]  security_task_kill+0x44/0x60
[   76.178695]  group_send_sig_info+0x25/0x60
[   76.178699]  kill_pid_info+0x36/0x60
[   76.178703]  SYSC_kill+0xdb/0x180
[   76.178707]  ? preempt_count_sub+0x92/0xd0
[   76.178712]  ? _raw_write_unlock_irq+0x13/0x30
[   76.178716]  ? task_work_run+0x6a/0x90
[   76.178720]  ? exit_to_usermode_loop+0x80/0xa0
[   76.178723]  entry_SYSCALL_64_fastpath+0x13/0x94
[   76.178727] RIP: 0033:0x7f8b0e58b767
[   76.178729] RSP: 002b:00007fff19efd4d8 EFLAGS: 00000206 ORIG_RAX:
000000000000003e
[   76.178732] RAX: ffffffffffffffda RBX: 0000557f3e3c2050 RCX:
00007f8b0e58b767
[   76.178735] RDX: 0000000000000000 RSI: 0000000000000000 RDI:
000000000000263b
[   76.178737] RBP: 0000000000000000 R08: 0000557f3e3c2270 R09:
0000000000000001
[   76.178739] R10: 000000000000022d R11: 0000000000000206 R12:
0000000000000000
[   76.178741] R13: 0000000000000001 R14: 0000557f3e3c13c0 R15:
0000000000000000
[   76.178745] Code: 48 8b 55 18 48 89 df 41 b8 20 00 08 01 5b 5d 48 8b
42 10 48 8b 52 30 48 63 48 4c 48 8b 44 c8 48 31 c9 48 8b 70 38 e9 f4 fd
00 00 <48> 8b 14 d5 40 27 e5 9e 48 c7 c6 7d 07 19 9f 48 89 df e8 fd 35
[   76.178794] RIP: audit_signal_cb+0x6c/0xe0 RSP: ffffa09b02a4fc08
[   76.178796] CR2: ffffffff0eee3bc0
[   76.178799] ---[ end trace 514af9529297f1a3 ]---

Fixes: cd1dbf76b23d ("apparmor: add the ability to mediate signals")
Reported-by: Zephaniah E. Loss-Cutler-Hull <[email protected]>
Reported-by: Shuah Khan <[email protected]>
Suggested-by: Tetsuo Handa <[email protected]>
Tested-by: Ivan Kozik <[email protected]>
Tested-by: Zephaniah E. Loss-Cutler-Hull <[email protected]>
Tested-by: Christian Boltz <[email protected]>
Tested-by: Shuah Khan <[email protected]>
Cc: [email protected]
Signed-off-by: John Johansen <[email protected]>

e1000: Fix off-by-one in debug message

Signed-off-by: Ahmad Fatoum <[email protected]>
Tested-by: Aaron Brown <[email protected]>
Signed-off-by: Jeff Kirsher <[email protected]>

i40e: Fix reporting incorrect error codes

Adding cloud filters could fail for a number of reasons,
unsupported filter fields for example, which fails during
validation of fields itself. This will not result in admin
command errors and converting the admin queue status to posix
error code using i40e_aq_rc_to_posix would result in incorrect
error values. If the failure was due to AQ error itself,
reporting that correctly is handled in the inner function.

Signed-off-by: Amritha Nambiar <[email protected]>
Tested-by: Andrew Bowers <[email protected]>
Signed-off-by: Jeff Kirsher <[email protected]>

e1000e: fix the use of magic numbers for buffer overrun issue

This is a follow on to commit b10effb92e27 ("fix buffer overrun while the
I219 is processing DMA transactions") to address David Laights concerns
about the use of "magic" numbers. So define masks as well as add
additional code comments to give a better understanding of what needs to
be done to avoid a buffer overrun.

Signed-off-by: Sasha Neftin <[email protected]>
Reviewed-by: Alexander H Duyck <[email protected]>
Reviewed-by: Dima Ruinskiy <[email protected]>
Reviewed-by: Raanan Avargil <[email protected]>
Tested-by: Aaron Brown <[email protected]>
Signed-off-by: Jeff Kirsher <[email protected]>

i40e/virtchnl: fix application of sizeof to pointer

sizeof when applied to a pointer typed expression gives the size of
the pointer.

The proper fix in this particular case is to code sizeof(*vfres)
instead of sizeof(vfres).

This issue was detected with the help of Coccinelle.

Signed-off-by: Gustavo A R Silva <[email protected]>
Tested-by: Andrew Bowers <[email protected]>
Signed-off-by: Jeff Kirsher <[email protected]>

lockd: fix "list_add double add" caused by legacy signal interface

restart_grace() uses hardcoded init_net.
It can cause to "list_add double add" in following scenario:

1) nfsd and lockd was started in several net namespaces
2) nfsd in init_net was stopped (lockd was not stopped because
it have users from another net namespaces)
3) lockd got signal, called restart_grace() -> set_grace_period()
and enabled lock_manager in hardcoded init_net.
4) nfsd in init_net is started again,
its lockd_up() calls set_grace_period() and tries to add
lock_manager into init_net 2nd time.

Jeff Layton suggest:
"Make it safe to call locks_start_grace multiple times on the same
lock_manager. If it's already on the global grace_list, then don't try
to add it again. (But we don't intentionally add twice, so for now we
WARN about that case.)

With this change, we also need to ensure that the nfsd4 lock manager
initializes the list before we call locks_start_grace. While we're at
it, move the rest of the nfsd_net initialization into
nfs4_state_create_net. I see no reason to have it spread over two
functions like it is today."

Suggested patch was updated to generate warning in described situation.

Suggested-by: Jeff Layton <[email protected]>
Signed-off-by: Vasily Averin <[email protected]>
Signed-off-by: J. Bruce Fields <[email protected]>

nlm_shutdown_hosts_net() cleanup

nlm_complain_hosts() walks through nlm_server_hosts hlist, which should
be protected by nlm_host_mutex.

Signed-off-by: Vasily Averin <[email protected]>
Reviewed-by: Jeff Layton <[email protected]>
Signed-off-by: J. Bruce Fields <[email protected]>

race of nfsd inetaddr notifiers vs nn->nfsd_serv change

nfsd_inet[6]addr_event uses nn->nfsd_serv without taking nfsd_mutex,
which can be changed during execution of notifiers and crash the host.

Moreover if notifiers were enabled in one net namespace they are enabled
in all other net namespaces, from creation until destruction.

This patch allows notifiers to access nn->nfsd_serv only after the
pointer is correctly initialized and delays cleanup until notifiers are
no longer in use.

Signed-off-by: Vasily Averin <[email protected]>
Tested-by: Scott Mayhew <[email protected]>
Signed-off-by: J. Bruce Fields <[email protected]>

race of lockd inetaddr notifiers vs nlmsvc_rqst change

lockd_inet[6]addr_event use nlmsvc_rqst without taken nlmsvc_mutex,
nlmsvc_rqst can be changed during execution of notifiers and crash the host.

Patch enables access to nlmsvc_rqst only when it was correctly initialized
and delays its cleanup until notifiers are no longer in use.

Note that nlmsvc_rqst can be temporally set to ERR_PTR, so the "if
(nlmsvc_rqst)" check in notifiers is insufficient on its own.

Signed-off-by: Vasily Averin <[email protected]>
Tested-by: Scott Mayhew <[email protected]>
Signed-off-by: J. Bruce Fields <[email protected]>

SUNRPC: make cache_detail structures const

Make these const as they are only getting passed to the function
cache_create_net having the argument as const.

Signed-off-by: Bhumika Goyal <[email protected]>
Reviewed-by: Jeff Layton <[email protected]>
Signed-off-by: J. Bruce Fields <[email protected]>

NFSD: make cache_detail structures const

Make these const as they are only getting passed to the function
cache_create_net having the argument as const.

Signed-off-by: Bhumika Goyal <[email protected]>
Reviewed-by: Jeff Layton <[email protected]>
Signed-off-by: J. Bruce Fields <[email protected]>

sunrpc: make the function arg as const

Make the struct cache_detail *tmpl argument of the function
cache_create_net as const as it is only getting passed to kmemup having
the argument as const void *.
Add const to the prototype too.

Signed-off-by: Bhumika Goyal <[email protected]>
Reviewed-by: Jeff Layton <[email protected]>
Signed-off-by: J. Bruce Fields <[email protected]>

nfsd: check for use of the closed special stateid

Prevent the use of the closed (invalid) special stateid by clients.

Signed-off-by: Andrew Elble <[email protected]>
Signed-off-by: J. Bruce Fields <[email protected]>

nfsd: fix panic in posix_unblock_lock called from nfs4_laundromat

From kernel 4.9, my two nfsv4 servers sometimes suffer from
"panic: unable to handle kernel page request"
in posix_unblock_lock() called from nfs4_laundromat().

These panics diseappear if we revert the commit "nfsd: add a LRU list
for blocked locks".

The cause appears to be a typo in nfs4_laundromat(), which is also
present in nfs4_state_shutdown_net().

Cc: [email protected]
Fixes: 7919d0a27f1e "nfsd: add a LRU list for blocked locks"
Cc: [email protected]
Reveiwed-by: Jeff Layton <[email protected]>
Signed-off-by: J. Bruce Fields <[email protected]>

lockd: lost rollback of set_grace_period() in lockd_down_net()

Commit efda760fe95ea ("lockd: fix lockd shutdown race") is incorrect,
it removes lockd_manager and disarm grace_period_end for init_net only.

If nfsd was started from another net namespace lockd_up_net() calls
set_grace_period() that adds lockd_manager into per-netns list
and queues grace_period_end delayed work.

These action should be reverted in lockd_down_net().
Otherwise it can lead to double list_add on after restart nfsd in netns,
and to use-after-free if non-disarmed delayed work will be executed after netns destroy.

Fixes: efda760fe95e ("lockd: fix lockd shutdown race")
Cc: [email protected]
Signed-off-by: Vasily Averin <[email protected]>
Signed-off-by: J. Bruce Fields <[email protected]>

lockd: added cleanup checks in exit_net hook

Signed-off-by: Vasily Averin <[email protected]>
Signed-off-by: J. Bruce Fields <[email protected]>

grace: replace BUG_ON by WARN_ONCE in exit_net hook

Signed-off-by: Vasily Averin <[email protected]>
Signed-off-by: J. Bruce Fields <[email protected]>

nfsd: fix locking validator warning on nfs4_ol_stateid->st_mutex class

The use of the st_mutex has been confusing the validator. Use the
proper nested notation so as to not produce warnings.

Signed-off-by: Andrew Elble <[email protected]>
Signed-off-by: J. Bruce Fields <[email protected]>

lockd: remove net pointer from messages

Publishing of net pointer is not safe,
use net->ns.inum as net ID in debug messages

[  171.757678] lockd_up_net: per-net data created; net=f00001e7
[  171.767188] NFSD: starting 90-second grace period (net f00001e7)
[  300.653313] lockd: nuking all hosts in net f00001e7...
[  300.653641] lockd: host garbage collection for net f00001e7
[  300.653968] lockd: nlmsvc_mark_resources for net f00001e7
[  300.711483] lockd_down_net: per-net data destroyed; net=f00001e7
[  300.711847] lockd: nuking all hosts in net 0...
[  300.711847] lockd: host garbage collection for net 0
[  300.711848] lockd: nlmsvc_mark_resources for net 0

Signed-off-by: Vasily Averin <[email protected]>
Signed-off-by: J. Bruce Fields <[email protected]>

nfsd: remove net pointer from debug messages

Publishing of net pointer is not safe,
replace it in debug meesages by net->ns.inum

[  119.989161] nfsd: initializing export module (net: f00001e7).
[  171.767188] NFSD: starting 90-second grace period (net f00001e7)
[  322.185240] nfsd: shutting down export module (net: f00001e7).
[  322.186062] nfsd: export shutdown complete (net: f00001e7).

Signed-off-by: Vasily Averin <[email protected]>
Signed-off-by: J. Bruce Fields <[email protected]>

nfsd: Fix races with check_stateid_generation()

The various functions that call check_stateid_generation() in order
to compare a client-supplied stateid with the nfs4_stid state, usually
need to atomically check for closed state. Those that perform the
check after locking the st_mutex using nfsd4_lock_ol_stateid()
should now be OK, but we do want to fix up the others.

Signed-off-by: Trond Myklebust <[email protected]>
Signed-off-by: J. Bruce Fields <[email protected]>

nfsd: Ensure we check stateid validity in the seqid operation checks

After taking the stateid st_mutex, we want to know that the stateid
still represents valid state before performing any non-idempotent
actions.

Signed-off-by: Trond Myklebust <[email protected]>
Signed-off-by: J. Bruce Fields <[email protected]>

nfsd: Fix race in lock stateid creation

If we're looking up a new lock state, and the creation fails, then
we want to unhash it, just like we do for OPEN. However in order
to do so, we need to that no other LOCK requests can grab the
mutex until we have unhashed it (and marked it as closed).

Signed-off-by: Trond Myklebust <[email protected]>
Signed-off-by: J. Bruce Fields <[email protected]>

nfsd4: move find_lock_stateid

Trivial cleanup to simplify following patch.

Signed-off-by: Trond Myklebust <[email protected]>
Signed-off-by: J. Bruce Fields <[email protected]>

nfsd: Ensure we don't recognise lock stateids after freeing them

In order to deal with lookup races, nfsd4_free_lock_stateid() needs
to be able to signal to other stateful functions that the lock stateid
is no longer valid. Right now, nfsd_lock() will check whether or not an
existing stateid is still hashed, but only in the "new lock" path.

To ensure the stateid invalidation is also recognised by the "existing lock"
path, and also by a second call to nfsd4_free_lock_stateid() itself, we can
change the type to NFS4_CLOSED_STID under the stp->st_mutex.

Signed-off-by: Trond Myklebust <[email protected]>
Signed-off-by: J. Bruce Fields <[email protected]>

nfsd: CLOSE SHOULD return the invalid special stateid for NFSv4.x (x>0)

Signed-off-by: Trond Myklebust <[email protected]>
Signed-off-by: J. Bruce Fields <[email protected]>

nfsd: Fix another OPEN stateid race

If nfsd4_process_open2() is initialising a new stateid, and yet the
call to nfs4_get_vfs_file() fails for some reason, then we must
declare the stateid closed, and unhash it before dropping the mutex.

Right now, we unhash the stateid after dropping the mutex, and without
changing the stateid type, meaning that another OPEN could theoretically
look it up and attempt to use it.

Reported-by: Andrew W Elble <[email protected]>
Signed-off-by: Trond Myklebust <[email protected]>
Cc: [email protected]
Signed-off-by: J. Bruce Fields <[email protected]>

nfsd: Fix stateid races between OPEN and CLOSE

Open file stateids can linger on the nfs4_file list of stateids even
after they have been closed. In order to avoid reusing such a
stateid, and confusing the client, we need to recheck the
nfs4_stid's type after taking the mutex.
Otherwise, we risk reusing an old stateid that was already closed,
which will confuse clients that expect new stateids to conform to
RFC7530 Sections 9.1.4.2 and 16.2.5 or RFC5661 Sections 8.2.2 and 18.2.4.

Signed-off-by: Trond Myklebust <[email protected]>
Cc: [email protected]
Signed-off-by: J. Bruce Fields <[email protected]>

bpf: offload: add a license header

I forgot to add a license on kernel/bpf/offload.c. Luckily I'm
still the only author so make it explicitly GPLv2.

Signed-off-by: Jakub Kicinski <[email protected]>
Reviewed-by: Simon Horman <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>

Rename superblock flags (MS_xyz -> SB_xyz)

This is a pure automated search-and-replace of the internal kernel
superblock flags.

The s_flags are now called SB_*, with the names and the values for the
moment mirroring the MS_* flags that they're equivalent to.

Note how the MS_xyz flags are the ones passed to the mount system call,
while the SB_xyz flags are what we then use in sb->s_flags.

The script to do this was:

    # places to look in; re security/*: it generally should *not* be
    # touched (that stuff parses mount(2) arguments directly), but
    # there are two places where we really deal with superblock flags.
    FILES="drivers/mtd drivers/staging/lustre fs ipc mm \
            include/linux/fs.h include/uapi/linux/bfs_fs.h \
            security/apparmor/apparmorfs.c security/apparmor/include/lib.h"
    # the list of MS_... constants
    SYMS="RDONLY NOSUID NODEV NOEXEC SYNCHRONOUS REMOUNT MANDLOCK \
          DIRSYNC NOATIME NODIRATIME BIND MOVE REC VERBOSE SILENT \
          POSIXACL UNBINDABLE PRIVATE SLAVE SHARED RELATIME KERNMOUNT \
          I_VERSION STRICTATIME LAZYTIME SUBMOUNT NOREMOTELOCK NOSEC BORN \
          ACTIVE NOUSER"

    SED_PROG=
    for i in $SYMS; do SED_PROG="$SED_PROG -e s/MS_$i/SB_$i/g"; done

    # we want files that contain at least one of MS_...,
    # with fs/namespace.c and fs/pnode.c excluded.
    L=$(for i in $SYMS; do git grep -w -l MS_$i $FILES; done| sort|uniq|grep -v '^fs/namespace.c'|grep -v '^fs/pnode.c')

    for f in $L; do sed -i $f $SED_PROG; done

Requested-by: Al Viro <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

auxdisplay: img-ascii-lcd: Only build on archs that have IOMEM

This avoids the MODPOST error:

ERROR: "devm_ioremap_resource" [drivers/auxdisplay/img-ascii-lcd.ko] undefined!

Signed-off-by: Thomas Meyer <[email protected]>
Acked-by: Randy Dunlap <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm, thp: Do not make pmd/pud dirty without a reason

Currently we make page table entries dirty all the time regardless of
access type and don't even consider if the mapping is write-protected.
The reasoning is that we don't really need dirty tracking on THP and
making the entry dirty upfront may save some time on first write to the
page.

Unfortunately, such approach may result in false-positive
can_follow_write_pmd() for huge zero page or read-only shmem file.

Let's only make page dirty only if we about to write to the page anyway
(as we do for small pages).

I've restructured the code to make entry dirty inside
maybe_p[mu]d_mkwrite(). It also takes into account if the vma is
write-protected.

Signed-off-by: Kirill A. Shutemov <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: Hugh Dickins <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm, thp: Do not make page table dirty unconditionally in touch_p[mu]d()

Currently, we unconditionally make page table dirty in touch_pmd().
It may result in false-positive can_follow_write_pmd().

We may avoid the situation, if we would only make the page table entry
dirty if caller asks for write access -- FOLL_WRITE.

The patch also changes touch_pud() in the same way.

Signed-off-by: Kirill A. Shutemov <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Hugh Dickins <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

tipc: eliminate access after delete in group_filter_msg()

KASAN revealed another access after delete in group.c. This time
it found that we read the header of a received message after the
buffer has been released.

Signed-off-by: Jon Maloy <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

xen-netfront: remove warning when unloading module

v2:
* Replace busy wait with wait_event()/wake_up_all()
* Cannot garantee that at the time xennet_remove is called, the
   xen_netback state will not be XenbusStateClosed, so added a
   condition for that
* There's a small chance for the xen_netback state is
   XenbusStateUnknown by the time the xen_netfront switches to Closed,
   so added a condition for that.

When unloading module xen_netfront from guest, dmesg would output
warning messages like below:

  [  105.236836] xen:grant_table: WARNING: g.e. 0x903 still in use!
  [  105.236839] deferring g.e. 0x903 (pfn 0x35805)

This problem relies on netfront and netback being out of sync. By the time
netfront revokes the g.e.'s netback didn't have enough time to free all of
them, hence displaying the warnings on dmesg.

The trick here is to make netfront to wait until netback frees all the g.e.'s
and only then continue to cleanup for the module removal, and this is done by
manipulating both device states.

Signed-off-by: Eduardo Otubo <[email protected]>
Acked-by: Juergen Gross <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

blktrace: fix trace mutex deadlock

A previous commit changed the locking around registration/cleanup,
but direct callers of blk_trace_remove() were missed. This means
that if we hit the error path in setup, we will deadlock on
attempting to re-acquire the queue trace mutex.

Fixes: 1f2cac107c59 ("blktrace: fix unlocked access to init/start-stop/teardown")
Signed-off-by: Jens Axboe <[email protected]>

i2c: i2c-boardinfo: fix memory leaks on devinfo

Currently when an error occurs devinfo is still allocated but is
unused when the error exit paths break out of the for-loop. Fix
this by kfree'ing devinfo to avoid the leak.

Detected by CoverityScan, CID#1416590 ("Resource Leak")

Fixes: 4124c4eba402 ("i2c: allow attaching IRQ resources to i2c_board_info")
Fixes: 0daaf99d8424 ("i2c: copy device properties when using i2c_register_board_info()")
Signed-off-by: Colin Ian King <[email protected]>
Signed-off-by: Wolfram Sang <[email protected]>

i2c: i801: Fix Failed to allocate irq -2147483648 error

On Apollo Lake devices the BIOS does not set up IRQ routing for the i801
SMBUS controller IRQ, so we end up with dev->irq set to IRQ_NOTCONNECTED.

Detect this and do not try to use the irq in this case silencing:
i801_smbus 0000:00:1f.1: Failed to allocate irq -2147483648: -107

Cc: [email protected]
BugLink: https://communities.intel.com/thread/114759
Signed-off-by: Hans de Goede <[email protected]>
Reviewed-by: Jean Delvare <[email protected]>
Signed-off-by: Wolfram Sang <[email protected]>

xfs: log recovery should replay deferred ops in order

As part of testing log recovery with dm_log_writes, Amir Goldstein
discovered an error in the deferred ops recovery that lead to corruption
of the filesystem metadata if a reflink+rmap filesystem happened to shut
down midway through a CoW remap:

"This is what happens [after failed log recovery]:

"Phase 1 - find and verify superblock...
"Phase 2 - using internal log
"        - zero log...
"        - scan filesystem freespace and inode maps...
"        - found root inode chunk
"Phase 3 - for each AG...
"        - scan (but don't clear) agi unlinked lists...
"        - process known inodes and perform inode discovery...
"        - agno = 0
"data fork in regular inode 134 claims CoW block 376
"correcting nextents for inode 134
"bad data fork in inode 134
"would have cleared inode 134"

Hou Tao dissected the log contents of exactly such a crash:

"According to the implementation of xfs_defer_finish(), these ops should
be completed in the following sequence:

"Have been done:
"(1) CUI: Oper (160)
"(2) BUI: Oper (161)
"(3) CUD: Oper (194), for CUI Oper (160)
"(4) RUI A: Oper (197), free rmap [0x155, 2, -9]

"Should be done:
"(5) BUD: for BUI Oper (161)
"(6) RUI B: add rmap [0x155, 2, 137]
"(7) RUD: for RUI A
"(8) RUD: for RUI B

"Actually be done by xlog_recover_process_intents()
"(5) BUD: for BUI Oper (161)
"(6) RUI B: add rmap [0x155, 2, 137]
"(7) RUD: for RUI B
"(8) RUD: for RUI A

"So the rmap entry [0x155, 2, -9] for COW should be freed firstly,
then a new rmap entry [0x155, 2, 137] will be added. However, as we can see
from the log record in post_mount.log (generated after umount) and the trace
print, the new rmap entry [0x155, 2, 137] are added firstly, then the rmap
entry [0x155, 2, -9] are freed."

When reconstructing the internal log state from the log items found on
disk, it's required that deferred ops replay in exactly the same order
that they would have had the filesystem not gone down.  However,
replaying unfinished deferred ops can create /more/ deferred ops.  These
new deferred ops are finished in the wrong order.  This causes fs
corruption and replay crashes, so let's create a single defer_ops to
handle the subsequent ops created during replay, then use one single
transaction at the end of log recovery to ensure that everything is
replayed in the same order as they're supposed to be.

Reported-by: Amir Goldstein <[email protected]>
Analyzed-by: Hou Tao <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Tested-by: Amir Goldstein <[email protected]>
Signed-off-by: Darrick J. Wong <[email protected]>

xfs: always free inline data before resetting inode fork during ifree

In xfs_ifree, we reset the data/attr forks to extents format without
bothering to free any inline data buffer that might still be around
after all the blocks have been truncated off the file. Prior to commit
43518812d2 ("xfs: remove support for inlining data/extents into the
inode fork") nobody noticed because the leftover inline data after
truncation was small enough to fit inside the inline buffer inside the
fork itself.

However, now that we've removed the inline buffer, we /always/ have to
free the inline data buffer or else we leak them like crazy. This test
was found by turning on kmemleak for generic/001 or generic/388.

Signed-off-by: Darrick J. Wong <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>

Merge tag 'kvm-ppc-fixes-4.15-1' of git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc into kvm-master

PPC KVM fixes for 4.15

One commit here, that fixes a couple of bugs relating to the patch
series that enables HPT guests to run on a radix host on POWER9
systems. This patch series went upstream in the 4.15 merge window,
so no stable backport is required.

KVM: Let KVM_SET_SIGNAL_MASK work as advertised

KVM API says for the signal mask you set via KVM_SET_SIGNAL_MASK, that
"any unblocked signal received [...] will cause KVM_RUN to return with
-EINTR" and that "the signal will only be delivered if not blocked by
the original signal mask".

This, however, is only true, when the calling task has a signal handler
registered for a signal. If not, signal evaluation is short-circuited for
SIG_IGN and SIG_DFL, and the signal is either ignored without KVM_RUN
returning or the whole process is terminated.

Make KVM_SET_SIGNAL_MASK behave as advertised by utilizing logic similar
to that in do_sigtimedwait() to avoid short-circuiting of signals.

Signed-off-by: Jan H. SchÃ¶nherr <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

Btrfs: fix list_add corruption and soft lockups in fsync

Xfstests btrfs/146 revealed this corruption,

[   58.138831] Buffer I/O error on dev dm-0, logical block 2621424, async page read
[   58.151233] BTRFS error (device sdf): bdev /dev/mapper/error-test errs: wr 1, rd 0, flush 0, corrupt 0, gen 0
[   58.152403] list_add corruption. prev->next should be next (ffff88005e6775d8), but was ffffc9000189be88. (prev=ffffc9000189be88).
[   58.153518] ------------[ cut here ]------------
[   58.153892] WARNING: CPU: 1 PID: 1287 at lib/list_debug.c:31 __list_add_valid+0x169/0x1f0
...
[   58.157379] RIP: 0010:__list_add_valid+0x169/0x1f0
...
[   58.161956] Call Trace:
[   58.162264]  btrfs_log_inode_parent+0x5bd/0xfb0 [btrfs]
[   58.163583]  btrfs_log_dentry_safe+0x60/0x80 [btrfs]
[   58.164003]  btrfs_sync_file+0x4c2/0x6f0 [btrfs]
[   58.164393]  vfs_fsync_range+0x5f/0xd0
[   58.164898]  do_fsync+0x5a/0x90
[   58.165170]  SyS_fsync+0x10/0x20
[   58.165395]  entry_SYSCALL_64_fastpath+0x1f/0xbe
...

It turns out that we could record btrfs_log_ctx:io_err in
log_one_extents when IO fails, but make log_one_extents() return '0'
instead of -EIO, so the IO error is not acknowledged by the callers,
i.e.  btrfs_log_inode_parent(), which would remove btrfs_log_ctx:list
from list head 'root->log_ctxs'.  Since btrfs_log_ctx is allocated
from stack memory, it'd get freed with a object alive on the
list. then a future list_add will throw the above warning.

This returns the correct error in the above case.

Jeff also reported this while testing against his fsync error
patch set[1].

[1]: https://www.spinics.net/lists/linux-btrfs/msg65308.html
"btrfs list corruption and soft lockups while testing writeback error handling"

Fixes: 8407f553268a4611f254 ("Btrfs: fix data corruption after fast fsync and writeback error")
Signed-off-by: Liu Bo <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>

KVM: VMX: Fix vmx->nested freeing when no SMI handler

Reported by syzkaller:

   ------------[ cut here ]------------
   WARNING: CPU: 5 PID: 2939 at arch/x86/kvm/vmx.c:3844 free_loaded_vmcs+0x77/0x80 [kvm_intel]
   CPU: 5 PID: 2939 Comm: repro Not tainted 4.14.0+ #26
   RIP: 0010:free_loaded_vmcs+0x77/0x80 [kvm_intel]
   Call Trace:
    vmx_free_vcpu+0xda/0x130 [kvm_intel]
    kvm_arch_destroy_vm+0x192/0x290 [kvm]
    kvm_put_kvm+0x262/0x560 [kvm]
    kvm_vm_release+0x2c/0x30 [kvm]
    __fput+0x190/0x370
    task_work_run+0xa1/0xd0
    do_exit+0x4d2/0x13e0
    do_group_exit+0x89/0x140
    get_signal+0x318/0xb80
    do_signal+0x8c/0xb40
    exit_to_usermode_loop+0xe4/0x140
    syscall_return_slowpath+0x206/0x230
    entry_SYSCALL_64_fastpath+0x98/0x9a

The syzkaller testcase will execute VMXON/VMLAUCH instructions, so the
vmx->nested stuff is populated, it will also issue KVM_SMI ioctl. However,
the testcase is just a simple c program and not be lauched by something
like seabios which implements smi_handler. Commit 05cade71cf (KVM: nSVM:
fix SMI injection in guest mode) gets out of guest mode and set nested.vmxon
to false for the duration of SMM according to SDM 34.14.1 "leave VMX
operation" upon entering SMM. We can't alloc/free the vmx->nested stuff
each time when entering/exiting SMM since it will induce more overhead. So
the function vmx_pre_enter_smm() marks nested.vmxon false even if vmx->nested
stuff is still populated. What it expected is em_rsm() can mark nested.vmxon
to be true again. However, the smi_handler/rsm will not execute since there
is no something like seabios in this scenario. The function free_nested()
fails to free the vmx->nested stuff since the vmx->nested.vmxon is false
which results in the above warning.

This patch fixes it by also considering the no SMI handler case, luckily
vmx->nested.smm.vmxon is marked according to the value of vmx->nested.vmxon
in vmx_pre_enter_smm(), we can take advantage of it and free vmx->nested
stuff when L1 goes down.

Reported-by: Dmitry Vyukov <[email protected]>
Cc: Paolo Bonzini <[email protected]>
Cc: Radim Krčmář <[email protected]>
Cc: Dmitry Vyukov <[email protected]>
Reviewed-by: Liran Alon <[email protected]>
Fixes: 05cade71cf (KVM: nSVM: fix SMI injection in guest mode)
Signed-off-by: Wanpeng Li <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: VMX: Fix rflags cache during vCPU reset

Reported by syzkaller:

   *** Guest State ***
   CR0: actual=0x0000000080010031, shadow=0x0000000060000010, gh_mask=fffffffffffffff7
   CR4: actual=0x0000000000002061, shadow=0x0000000000000000, gh_mask=ffffffffffffe8f1
   CR3 = 0x000000002081e000
   RSP = 0x000000000000fffa  RIP = 0x0000000000000000
   RFLAGS=0x00023000         DR7 = 0x00000000000000
          ^^^^^^^^^^
   ------------[ cut here ]------------
   WARNING: CPU: 6 PID: 24431 at /home/kernel/linux/arch/x86/kvm//x86.c:7302 kvm_arch_vcpu_ioctl_run+0x651/0x2ea0 [kvm]
   CPU: 6 PID: 24431 Comm: reprotest Tainted: G        W  OE   4.14.0+ #26
   RIP: 0010:kvm_arch_vcpu_ioctl_run+0x651/0x2ea0 [kvm]
   RSP: 0018:ffff880291d179e0 EFLAGS: 00010202
   Call Trace:
    kvm_vcpu_ioctl+0x479/0x880 [kvm]
    do_vfs_ioctl+0x142/0x9a0
    SyS_ioctl+0x74/0x80
    entry_SYSCALL_64_fastpath+0x23/0x9a

The failed vmentry is triggered by the following beautified testcase:

    #include <unistd.h>
    #include <sys/syscall.h>
    #include <string.h>
    #include <stdint.h>
    #include <linux/kvm.h>
    #include <fcntl.h>
    #include <sys/ioctl.h>

    long r[5];
    int main()
    {
        struct kvm_debugregs dr = { 0 };

        r[2] = open("/dev/kvm", O_RDONLY);
        r[3] = ioctl(r[2], KVM_CREATE_VM, 0);
        r[4] = ioctl(r[3], KVM_CREATE_VCPU, 7);
        struct kvm_guest_debug debug = {
                .control = 0xf0403,
                .arch = {
                        .debugreg[6] = 0x2,
                        .debugreg[7] = 0x2
                }
        };
        ioctl(r[4], KVM_SET_GUEST_DEBUG, &debug);
        ioctl(r[4], KVM_RUN, 0);
    }

which testcase tries to setup the processor specific debug
registers and configure vCPU for handling guest debug events through
KVM_SET_GUEST_DEBUG.  The KVM_SET_GUEST_DEBUG ioctl will get and set
rflags in order to set TF bit if single step is needed. All regs' caches
are reset to avail and GUEST_RFLAGS vmcs field is reset to 0x2 during vCPU
reset. However, the cache of rflags is not reset during vCPU reset. The
function vmx_get_rflags() returns an unreset rflags cache value since
the cache is marked avail, it is 0 after boot. Vmentry fails if the
rflags reserved bit 1 is 0.

This patch fixes it by resetting both the GUEST_RFLAGS vmcs field and
its cache to 0x2 during vCPU reset.

Reported-by: Dmitry Vyukov <[email protected]>
Tested-by: Dmitry Vyukov <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Cc: Paolo Bonzini <[email protected]>
Cc: Radim Krčmář <[email protected]>
Cc: Nadav Amit <[email protected]>
Cc: Dmitry Vyukov <[email protected]>
Signed-off-by: Wanpeng Li <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: X86: Fix softlockup when get the current kvmclock

watchdog: BUG: soft lockup - CPU#6 stuck for 22s! [qemu-system-x86:10185]
CPU: 6 PID: 10185 Comm: qemu-system-x86 Tainted: G           OE   4.14.0-rc4+ #4
RIP: 0010:kvm_get_time_scale+0x4e/0xa0 [kvm]
Call Trace:
  get_time_ref_counter+0x5a/0x80 [kvm]
  kvm_hv_process_stimers+0x120/0x5f0 [kvm]
  kvm_arch_vcpu_ioctl_run+0x4b4/0x1690 [kvm]
  kvm_vcpu_ioctl+0x33a/0x620 [kvm]
  do_vfs_ioctl+0xa1/0x5d0
  SyS_ioctl+0x79/0x90
  entry_SYSCALL_64_fastpath+0x1e/0xa9

This can be reproduced when running kvm-unit-tests/hyperv_stimer.flat and
cpu-hotplug stress simultaneously. __this_cpu_read(cpu_tsc_khz) returns 0
(set in kvmclock_cpu_down_prep()) when the pCPU is unhotplug which results
in kvm_get_time_scale() gets into an infinite loop.

This patch fixes it by treating the unhotplug pCPU as not using master clock.

Reviewed-by: Radim Krčmář <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Cc: Paolo Bonzini <[email protected]>
Cc: Radim Krčmář <[email protected]>
Signed-off-by: Wanpeng Li <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: lapic: Fixup LDR on load in x2apic

In x2apic mode the LDR is fixed based on the ID rather
than separately loadable like it was before x2.
When kvm_apic_set_state is called, the base is set, and if
it has the X2APIC_ENABLE flag set then the LDR is calculated;
however that value gets overwritten by the memcpy a few lines
below overwriting it with the value that came from userland.

The symptom is a lack of EOI after loading the state
(e.g. after a QEMU migration) and is due to the EOI bitmap
being wrong due to the incorrect LDR. This was seen with
a Win2016 guest under Qemu with irqchip=split whose USB mouse
didn't work after a VM migration.

This corresponds to RH bug:
https://bugzilla.redhat.com/show_bug.cgi?id=1502591

Reported-by: Yiqian Wei <[email protected]>
Signed-off-by: Dr. David Alan Gilbert <[email protected]>
Cc: [email protected]
[Applied fixup from Liran Alon. - Paolo]
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: lapic: Split out x2apic ldr calculation

Split out the ldr calculation from kvm_apic_set_x2apic_id
since we're about to reuse it in the following patch.

Signed-off-by: Dr. David Alan Gilbert <[email protected]>
Cc: [email protected]
Signed-off-by: Paolo Bonzini <[email protected]>

reiserfs: remove unneeded i_version bump

The i_version field in reiserfs is not initialized and is only ever
updated here. Nothing ever views it, so just remove it.

Signed-off-by: Jeff Layton <[email protected]>
Signed-off-by: Jan Kara <[email protected]>

Merge tag 'mac80211-for-davem-2017-11-27' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211

Johannes Berg says:

====================
Four fixes:
* CRYPTO_SHA256 is needed for regdb validation
* mac80211: mesh path metric was wrong in some frames
* mac80211: use QoS null-data packets on QoS connections
* mac80211: tear down RX aggregation sessions first to
drop fewer packets in HW restart scenarios
====================

Signed-off-by: David S. Miller <[email protected]>

btrfs: Fix wild memory access in compression level parser

[BUG]
Kernel panic when mounting with "-o compress" mount option.
KASAN will report like:
------
==================================================================
BUG: KASAN: wild-memory-access in strncmp+0x31/0xc0
Read of size 1 at addr d86735fce994f800 by task mount/662
...
Call Trace:
dump_stack+0xe3/0x175
kasan_report+0x163/0x370
__asan_load1+0x47/0x50
strncmp+0x31/0xc0
btrfs_compress_str2level+0x20/0x70 [btrfs]
btrfs_parse_options+0xff4/0x1870 [btrfs]
open_ctree+0x2679/0x49f0 [btrfs]
btrfs_mount+0x1b7f/0x1d30 [btrfs]
mount_fs+0x49/0x190
vfs_kern_mount.part.29+0xba/0x280
vfs_kern_mount+0x13/0x20
btrfs_mount+0x31e/0x1d30 [btrfs]
mount_fs+0x49/0x190
vfs_kern_mount.part.29+0xba/0x280
do_mount+0xaad/0x1a00
SyS_mount+0x98/0xe0
entry_SYSCALL_64_fastpath+0x1f/0xbe
------

[Cause]
For 'compress' and 'compress_force' options, its token doesn't expect
any parameter so its args[0] contains uninitialized data.
Accessing args[0] will cause above wild memory access.

[Fix]
For Opt_compress and Opt_compress_force, set compression level to
the default.

Signed-off-by: Qu Wenruo <[email protected]>
Reviewed-by: David Sterba <[email protected]>
[ set the default in advance ]
Signed-off-by: David Sterba <[email protected]>

RISC-V: Add VDSO entries for clock_get/gettimeofday/getcpu

For now these are just placeholders that execute the syscall. We will
later optimize them to avoid kernel crossings, but we'd like to have the
VDSO entries from the first released kernel version to make the ABI
simpler.

Signed-off-by: Andrew Waterman <[email protected]>
Signed-off-by: Palmer Dabbelt <[email protected]>

RISC-V: Remove __vdso_cmpxchg{32,64} symbol versions

These were left over from an earlier version of the port.

Signed-off-by: Palmer Dabbelt <[email protected]>