Xin Long [Sun, 26 Nov 2017 13:19:05 +0000 (21:19 +0800)]
vxlan: use __be32 type for the param vni in __vxlan_fdb_delete
All callers of __vxlan_fdb_delete pass vni with __be32 type, and
this param should be declared as __be32 type.
Fixes: 3ad7a4b141eb ("vxlan: support fdb and learning in COLLECT_METADATA mode") Signed-off-by: Xin Long <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Xin Long [Sun, 26 Nov 2017 13:12:09 +0000 (21:12 +0800)]
bonding: use nla_get_u64 to extract the value for IFLA_BOND_AD_ACTOR_SYSTEM
bond_opt_initval expects a u64 type param, it's better to use
nla_get_u64 to extract the value here, to eliminate a sparse
endianness mismatch warning.
Fixes: 171a42c38c6e ("bonding: add netlink support for sys prio, actor sys mac, and port key") Signed-off-by: Xin Long <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Xin Long [Sun, 26 Nov 2017 12:56:07 +0000 (20:56 +0800)]
sctp: use right member as the param of list_for_each_entry
Commit d04adf1b3551 ("sctp: reset owner sk for data chunks on out queues
when migrating a sock") made a mistake that using 'list' as the param of
list_for_each_entry to traverse the retransmit, sacked and abandoned
queues, while chunks are using 'transmitted_list' to link into these
queues.
It could cause NULL dereference panic if there are chunks in any of these
queues when peeling off one asoc.
So use the chunk member 'transmitted_list' instead in this patch.
Fixes: d04adf1b3551 ("sctp: reset owner sk for data chunks on out queues when migrating a sock") Signed-off-by: Xin Long <[email protected]> Acked-by: Marcelo Ricardo Leitner <[email protected]> Acked-by: Neil Horman <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Paolo Abeni [Tue, 28 Nov 2017 13:28:39 +0000 (14:28 +0100)]
sch_sfq: fix null pointer dereference at timer expiration
While converting sch_sfq to use timer_setup(), the commit cdeabbb88134
("net: sched: Convert timers to use timer_setup()") forgot to
initialize the 'sch' field. As a result, the timer callback tries to
dereference a NULL pointer, and the kernel does oops.
Fix it initializing such field at qdisc creation time.
Jakub Kicinski [Mon, 27 Nov 2017 19:11:41 +0000 (11:11 -0800)]
cls_bpf: don't decrement net's refcount when offload fails
When cls_bpf offload was added it seemed like a good idea to
call cls_bpf_delete_prog() instead of extending the error
handling path, since the software state is fully initialized
at that point. This handling of errors without jumping to
the end of the function is error prone, as proven by later
commit missing that extra call to __cls_bpf_delete_prog().
__cls_bpf_delete_prog() is now expected to be invoked with
a reference on exts->net or the field zeroed out. The call
on the offload's error patch does not fullfil this requirement,
leading to each error stealing a reference on net namespace.
Create a function undoing what cls_bpf_set_parms() did and
use it from __cls_bpf_delete_prog() and the error path.
Ulf Hansson [Mon, 27 Nov 2017 10:28:50 +0000 (11:28 +0100)]
mmc: sdhci: Avoid swiotlb buffer being full
The commit de3ee99b097d ("mmc: Delete bounce buffer handling") deletes the
bounce buffer handling, but also causes the max_req_size for sdhci to be
increased, in case when max_segs == 1. This causes errors for sdhci-pci
Ricoh variant, about the swiotlb buffer to become full.
Fix the issue, by taking IO_TLB_SEGSIZE and IO_TLB_SHIFT into account when
deciding the max_req_size for sdhci.
Linus Torvalds [Tue, 28 Nov 2017 18:01:15 +0000 (10:01 -0800)]
Merge tag 'drm-for-v4.15-part2-fixes' of git://people.freedesktop.org/~airlied/linux
Pull drm fixes from Dave Airlie:
- TTM regression fix for some virt gpus (bochs vga)
- a few i915 stable fixes
- one vc4 fix
- one uapi fix
* tag 'drm-for-v4.15-part2-fixes' of git://people.freedesktop.org/~airlied/linux:
drm/ttm: don't attempt to use hugepages if dma32 requested (v2)
drm/vblank: Pass crtc_id to page_flip_ioctl.
drm/i915: Fix init_clock_gating for resume
drm/i915: Mark the userptr invalidate workqueue as WQ_MEM_RECLAIM
drm/i915: Clear breadcrumb node when cancelling signaling
drm/i915/gvt: ensure -ve return value is handled correctly
drm/i915: Re-register PMIC bus access notifier on runtime resume
drm/i915: Fix false-positive assert_rpm_wakelock_held in i915_pmic_bus_access_notifier v2
drm/edid: Don't send non-zero YQ in AVI infoframe for HDMI 1.x sinks
drm/vc4: Account for interrupts in flight
Takashi Iwai [Mon, 27 Nov 2017 09:59:40 +0000 (10:59 +0100)]
Revert "ALSA: usb-audio: Fix potential zero-division at parsing FU"
The commit 8428a8ebde2d ("ALSA: usb-audio: Fix potential zero-division
at parsing FU") is utterly bogus and breaks the case with csize=1
instead of fixing anything. Just take it back again.
Eric Sandeen [Tue, 28 Nov 2017 02:23:33 +0000 (18:23 -0800)]
xfs: calculate correct offset in xfs_scrub_quota_item
It's only used for tracepoints so it's relatively harmless,
but the offset is calculated incorrectly in xfs_scrub_quota_item.
qi_dqperchunk is the nr. of dquots per "chunk" which we have
conveniently *cough* defined to always be 1 FSB. Therefore
block_offset * qi_dqperchunk == first id in that chunk,
and so offset = id / qi_dqperchunk
Michal Hocko [Thu, 23 Nov 2017 16:13:40 +0000 (17:13 +0100)]
xfs: fortify xfs_alloc_buftarg error handling
percpu_counter_init failure path doesn't clean up &btp->bt_lru list.
Call list_lru_destroy in that error path. Similarly register_shrinker
error path is not handled.
While it is unlikely to trigger these error path, it is not impossible
especially the later might fail with large NUMAs. Let's handle the
failure to make the code more robust.
Minwoo Im [Fri, 24 Nov 2017 18:03:00 +0000 (03:03 +0900)]
nvme-pci: fix NULL pointer dereference in nvme_free_host_mem()
Following condition which will cause NULL pointer dereference will
occur in nvme_free_host_mem() when it tries to remove pci device via
nvme_remove() especially after a failure of host memory allocation for HMB.
Max Gurtovoy [Tue, 28 Nov 2017 16:28:44 +0000 (18:28 +0200)]
nvme-rdma: fix memory leak during queue allocation
In case nvme_rdma_wait_for_cm timeout expires before we get
an established or rejected event (rdma_connect succeeded) from
rdma_cm, we end up with leaking the ib transport resources for
dedicated queue. This scenario can easily reproduced using traffic
test during port toggling.
Also, in order to protect from parallel ib queue destruction, that
may be invoked from different context's, introduce new flag that
stands for transport readiness. While we're here, protect also against
a situation that we can receive rdma_cm events during ib queue destruction.
s390/gs: add compat regset for the guarded storage broadcast control block
git commit e525f8a6e696210d15f8b8277d4da12fc4add299
"s390/gs: add regset for the guarded storage broadcast control block"
added the missing regset to the s390_regsets array but failed to add it
to the s390_compat_regsets array.
Fixes: e525f8a6e696 ("add compat regset for the guarded storage broadcast control block") Signed-off-by: Martin Schwidefsky <[email protected]>
Filipe Manana [Fri, 17 Nov 2017 01:54:00 +0000 (01:54 +0000)]
Btrfs: incremental send, fix wrong unlink path after renaming file
Under some circumstances, an incremental send operation can issue wrong
paths for unlink commands related to files that have multiple hard links
and some (or all) of those links were renamed between the parent and send
snapshots. Consider the following example:
When computing the incremental send stream the following steps happen:
1) When processing inode 261, a rename operation is issued that renames
inode 262, which currently as a path of "d", to an orphan name of
"o262-7-0". This is done because in the send snapshot, inode 261 has
of its hard links with a path of "d" as well.
2) Two link operations are issued that create the new hard links for
inode 261, whose names are "d" and "f2l2_2", at paths "/" and
"o262-7-0/" respectively.
3) Still while processing inode 261, unlink operations are issued to
remove the old hard links of inode 261, with names "f2l1" and "f2l2",
at paths "a/" and "d/". However path "d/" does not correspond anymore
to the directory inode 262 but corresponds instead to a hard link of
inode 261 (link command issued in the previous step). This makes the
receiver fail with a ENOTDIR error when attempting the unlink
operation.
The problem happens because before sending the unlink operation, we failed
to detect that inode 262 was one of ancestors for inode 261 in the parent
snapshot, and therefore we didn't recompute the path for inode 262 before
issuing the unlink operation for the link named "f2l2" of inode 262. The
detection failed because the function "is_ancestor()" only follows the
first hard link it finds for an inode instead of all of its hard links
(as it was originally created for being used with directories only, for
which only one hard link exists). So fix this by making "is_ancestor()"
follow all hard links of the input inode.
Eric Dumazet [Tue, 28 Nov 2017 16:03:30 +0000 (08:03 -0800)]
net/packet: fix a race in packet_bind() and packet_notifier()
syzbot reported crashes [1] and provided a C repro easing bug hunting.
When/if packet_do_bind() calls __unregister_prot_hook() and releases
po->bind_lock, another thread can run packet_notifier() and process an
NETDEV_UP event.
This calls register_prot_hook() and hooks again the socket right before
first thread is able to grab again po->bind_lock.
Fixes this issue by temporarily setting po->num to 0, as suggested by
David Miller.
Mike Maloney [Tue, 28 Nov 2017 15:44:29 +0000 (10:44 -0500)]
packet: fix crash in fanout_demux_rollover()
syzkaller found a race condition fanout_demux_rollover() while removing
a packet socket from a fanout group.
po->rollover is read and operated on during packet_rcv_fanout(), via
fanout_demux_rollover(), but the pointer is currently cleared before the
synchronization in packet_release(). It is safer to delay the cleanup
until after synchronize_net() has been called, ensuring all calls to
packet_rcv_fanout() for this socket have finished.
To further simplify synchronization around the rollover structure, set
po->rollover in fanout_add() only if there are no errors. This removes
the need for rcu in the struct and in the call to
packet_getsockopt(..., PACKET_ROLLOVER_STATS, ...).
David S. Miller [Tue, 28 Nov 2017 16:00:14 +0000 (11:00 -0500)]
Merge branch 'sctp-fix-sparse-errors'
Xin Long says:
====================
sctp: fix some other sparse errors
After the last fixes for sparse errors, there are still three sparse
errors in sctp codes, two of them are type cast, and the other one
is using extern.
====================
Xin Long [Sun, 26 Nov 2017 12:16:08 +0000 (20:16 +0800)]
sctp: remove extern from stream sched
Now each stream sched ops is defined in different .c file and
added into the global ops in another .c file, it uses extern
to make this work.
However extern is not good coding style to get them in and
even make C=2 reports errors for this.
This patch adds sctp_sched_ops_xxx_init for each stream sched
ops in their .c file, then get them into the global ops by
calling them when initializing sctp module.
Fixes: 637784ade221 ("sctp: introduce priority based stream scheduler") Fixes: ac1ed8b82cd6 ("sctp: introduce round robin stream scheduler") Signed-off-by: Xin Long <[email protected]> Acked-by: Marcelo Ricardo Leitner <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Xin Long [Sun, 26 Nov 2017 12:16:07 +0000 (20:16 +0800)]
sctp: force the params with right types for sctp csum apis
Now sctp_csum_xxx doesn't really match the param types of these common
csum apis. As sctp_csum_xxx is defined in sctp/checksum.h, many sparse
errors occur when make C=2 not only with M=net/sctp but also with other
modules that include this header file.
This patch is to force them fit in csum apis with the right types.
Fixes: e6d8b64b34aa ("net: sctp: fix and consolidate SCTP checksumming code") Signed-off-by: Xin Long <[email protected]> Acked-by: Marcelo Ricardo Leitner <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Vasyl Gomonovych [Wed, 22 Nov 2017 15:29:57 +0000 (16:29 +0100)]
lmc: Use memdup_user() as a cleanup
Fix coccicheck warning which recommends to use memdup_user():
drivers/net/wan/lmc/lmc_main.c:497:27-34: WARNING opportunity for memdup_user
Generated by: scripts/coccinelle/memdup_user/memdup_user.cocci
bnxt_en: Fix an error handling path in 'bnxt_get_module_eeprom()'
Error code returned by 'bnxt_read_sfp_module_eeprom_info()' is handled a
few lines above when reading the A0 portion of the EEPROM.
The same should be done when reading the A2 portion of the EEPROM.
In order to correctly propagate an error, update 'rc' in this 2nd call as
well, otherwise 0 (success) is returned.
Antoine Tenart [Tue, 28 Nov 2017 13:26:30 +0000 (14:26 +0100)]
net: phy: marvell10g: fix the PHY id mask
The Marvell 10G PHY driver supports different hardware revisions, which
have their bits 3..0 differing. To get the correct revision number these
bits should be ignored. This patch fixes this by using the already
defined MARVELL_PHY_ID_MASK (0xfffffff0) instead of the custom
0xffffffff mask.
Fixes: 20b2af32ff3f ("net: phy: add Marvell Alaska X 88X3310 10Gigabit PHY support") Suggested-by: Yan Markman <[email protected]> Signed-off-by: Antoine Tenart <[email protected]> Reviewed-by: Andrew Lunn <[email protected]> Signed-off-by: David S. Miller <[email protected]>
David S. Miller [Tue, 28 Nov 2017 15:09:52 +0000 (10:09 -0500)]
Merge branch 'mvpp2-fixes'
Antoine Tenart says:
====================
net: mvpp2: set of fixes
This series fixes various issues with the Marvell PPv2 driver. The
patches are sent together to avoid any possible conflict. The series is
based on today's net tree.
====================
Antoine Tenart [Tue, 28 Nov 2017 13:19:51 +0000 (14:19 +0100)]
net: mvpp2: check ethtool sets the Tx ring size is to a valid min value
This patch fixes the Tx ring size checks when using ethtool, by adding
an extra check in the PPv2 check_ringparam_valid helper. The Tx ring
size cannot be set to a value smaller than the minimum number of
descriptors needed for TSO.
Fixes: 1d17db08c056 ("net: mvpp2: limit TSO segments and use stop/wake thresholds") Suggested-by: Yan Markman <[email protected]> Signed-off-by: Antoine Tenart <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Yan Markman [Tue, 28 Nov 2017 13:19:50 +0000 (14:19 +0100)]
net: mvpp2: do not disable GMAC padding
Short fragmented packets may never be sent by the hardware when padding
is disabled. This patch stop modifying the GMAC padding bits, to leave
them to their reset value (disabled).
Fixes: 3919357fb0bb ("net: mvpp2: initialize the GMAC when using a port") Signed-off-by: Yan Markman <[email protected]>
[Antoine: commit message] Signed-off-by: Antoine Tenart <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Antoine Tenart [Tue, 28 Nov 2017 13:19:49 +0000 (14:19 +0100)]
net: mvpp2: cleanup probed ports in the probe error path
This patches fixes the probe error path by cleaning up probed ports, to
avoid leaving registered net devices when the driver failed to probe.
Fixes: 3f518509dedc ("ethernet: Add new driver for Marvell Armada 375 network unit") Signed-off-by: Antoine Tenart <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Antoine Tenart [Tue, 28 Nov 2017 13:19:48 +0000 (14:19 +0100)]
net: mvpp2: fix the txq_init error path
When an allocation in the txq_init path fails, the allocated buffers
end-up being freed twice: in the txq_init error path, and in txq_deinit.
This lead to issues as txq_deinit would work on already freed memory
regions:
This patch fixes this by removing the txq_init own error path, as the
txq_deinit function is always called on errors. This was introduced by
TSO as way more buffers are allocated.
Fixes: 186cd4d4e414 ("net: mvpp2: software tso support") Signed-off-by: Antoine Tenart <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Chao Yu [Tue, 28 Nov 2017 15:01:44 +0000 (23:01 +0800)]
quota: propagate error from __dquot_initialize
In commit 6184fc0b8dd7 ("quota: Propagate error from ->acquire_dquot()"),
we have propagated error from __dquot_initialize to caller, but we forgot
to handle such error in add_dquot_ref(), so, currently, during quota
accounting information initialization flow, if we failed for some of
inodes, we just ignore such error, and do account for others, which is
not a good implementation.
In this patch, we choose to let user be aware of such error, so after
turning on quota successfully, we can make sure all inodes disk usage
can be accounted, which will be more reasonable.
David S. Miller [Tue, 28 Nov 2017 14:55:48 +0000 (09:55 -0500)]
Merge branch 'mlxsw-GRE-offloading-fixes'
Jiri Pirko says:
====================
mlxsw: GRE offloading fixes
Petr says:
This patchset fixes a couple bugs in offloading GRE tunnels in mlxsw
driver.
Patch #1 fixes a problem that local routes pointing at a GRE tunnel
device are offloaded even if that netdevice is down.
Patch #2 detects that as a result of moving a GRE netdevice to a
different VRF, two tunnels now have a conflict of local addresses,
something that the mlxsw driver can't offload.
Patch #3 fixes a FIB abort caused by forming a route pointing at a
GRE tunnel that is eligible for offloading but already onloaded.
Patch #4 fixes a problem that next hops migrated to a new RIF kept the
old RIF reference, which went dangling shortly afterwards.
====================
Petr Machata [Tue, 28 Nov 2017 12:17:14 +0000 (13:17 +0100)]
mlxsw: spectrum_router: Update nexthop RIF on update
The function mlxsw_sp_nexthop_rif_update() walks the list of nexthops
associated with a RIF, and updates the corresponding entries in the
switch. It is used in particular when a tunnel underlay netdevice moves
to a different VRF, and all the nexthops are migrated over to a new RIF.
The problem is that each nexthop holds a reference to its RIF, and that
is not updated. So after the old RIF is gone, further activity on these
nexthops (such as downing the underlay netdevice) dereferences a
dangling pointer.
Fix the issue by updating rif of impacted nexthops before calling
mlxsw_sp_nexthop_rif_update().
Fixes: 0c5f1cd5ba8c ("mlxsw: spectrum_router: Generalize __mlxsw_sp_ipip_entry_update_tunnel()") Signed-off-by: Petr Machata <[email protected]> Reviewed-by: Ido Schimmel <[email protected]> Signed-off-by: Jiri Pirko <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Petr Machata [Tue, 28 Nov 2017 12:17:13 +0000 (13:17 +0100)]
mlxsw: spectrum_router: Handle encap to demoted tunnels
Some tunnels that are offloadable on their own can nonetheless be
demoted to slow path if their local address is in conflict with that of
another tunnel. When a route is formed for such a tunnel,
mlxsw_sp_nexthop_ipip_init() fails to find the corresponding IPIP entry,
and that triggers a FIB abort.
Resolve the problem by not assuming that a tunnel for which
mlxsw_sp_ipip_ops.can_offload() holds also automatically has an IPIP
entry.
Petr Machata [Tue, 28 Nov 2017 12:17:12 +0000 (13:17 +0100)]
mlxsw: spectrum_router: Demote tunnels on VRF migration
The mlxsw driver currently doesn't offload GRE tunnels if they have the
same local address and use the same underlay VRF. When such a situation
arises, the tunnels in conflict are demoted to slow path.
However, the current code only verifies this condition on tunnel
creation and tunnel change, not when a tunnel is moved to a different
VRF. When the tunnel has no bound device, underlay and overlay are the
same. Thus moving a tunnel moves the underlay as well, and that can
cause local address conflict.
So modify mlxsw_sp_netdevice_ipip_ol_vrf_event() to check if there are
any conflicting tunnels, and demote them if yes.
Petr Machata [Tue, 28 Nov 2017 12:17:11 +0000 (13:17 +0100)]
mlxsw: spectrum_router: Offload decap only for up tunnels
When a new local route is added, an IPIP entry is looked up to determine
whether the route should be offloaded as a tunnel decap or as a trap.
That decision should take into account whether the tunnel netdevice in
question is actually IFF_UP, and only install a decap offload if it is.
David S. Miller [Tue, 28 Nov 2017 14:52:04 +0000 (09:52 -0500)]
Merge branch '40GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/net-queue
Jeff Kirsher says:
====================
Intel Wired LAN Driver Updates 2017-11-27
This series contains updates to e1000, e1000e and i40e.
Gustavo A. R. Silva fixes a sizeof() issue where we were taking the size of
the pointer (which is always the size of the pointer).
Sasha does a follow up fix to a previous fix for buffer overrun, to resolve
community feedback from David Laight and the use of magic numbers.
Amritha fixes the reporting of error codes for when adding a cloud filter
fails.
Ahmad Fatoum brushes the dust off the e1000 driver to fix a code comment
and debug message which was incorrect about what the code was really doing.
====================
[Cause]
Btrfs will call btrfs_check_leaf() in btrfs_mark_buffer_dirty() to check
if the leaf is valid with CONFIG_BTRFS_FS_RUN_SANITY_TESTS=y.
However quite some btrfs_mark_buffer_dirty() callers(*) don't really
initialize its item data but only initialize its item pointers, leaving
item data uninitialized.
This makes tree-checker catch uninitialized data as error, causing
such panic.
*: These callers include but not limited to
setup_items_for_insert()
btrfs_split_item()
btrfs_expand_item()
[Fix]
Add a new parameter @check_item_data to btrfs_check_leaf().
With @check_item_data set to false, item data check will be skipped and
fallback to old btrfs_check_leaf() behavior.
So we can still get early warning if we screw up item pointers, and
avoid false panic.
Xiaolin Zhang [Thu, 16 Nov 2017 08:54:23 +0000 (16:54 +0800)]
drm/i915/gvt: enabled pipe A default on creating vgpu
when i915 driver unloading, it will shutdown all CRTCs and
it will introudce kernel panic when conducting igt drv_module_reload
test case under guest environment (bug reported by XENGT-468) as below:
fred gao [Tue, 14 Nov 2017 09:09:35 +0000 (17:09 +0800)]
drm/i915/gvt: Move request alloc to dispatch_workload path only
Previously the performance is improved through the workload auditing
and shadowing ahead of vGPU scheduling, however, there is the case that
more requests are allocated in submit_context before the previous request
is added, the timeline will hold its seqno which is later.
This patch is to move the request alloc to dispatch_workload function,
where is the same place as request is added.
It will fix the issue of kernel BUG for (timeline->seqno != request->fence.seqno)
check when add_request.
Weinan Li [Tue, 21 Nov 2017 02:54:41 +0000 (10:54 +0800)]
drm/i915/gvt: remove skl_misc_ctl_write handler
With different settings of compressed data hash mode between VMs and host
may cause gpu issues.
Commit: 1999f108c ("drm/i915/gvt: Disable compression workaround for Gen9")
disable compression workaround of guest in gvt host to align with host.
Commit: 93564044f ("drm/i915: Switch over to the LLC/eLLC hotspot avoidance
hash mode for CCS") add compression workaround, then we can remove the
skl_misc_ctl_write hanlder.
Better solution should be always keeping same settings as host, and bypass
the write request from VMs, but it need to fetch data from host's
"Context".
Changbin Du [Mon, 13 Nov 2017 06:58:31 +0000 (14:58 +0800)]
drm/i915/gvt: Fix unsafe locking caused by spin_unlock_bh
The caller of shadow_context_status_change may disable irqs. So it is not
safe to use spin_unlock_bh in such context. Let's switch to irqsave version
for safety.
The alternative intel_backlight_device_register() definition apparently
never got used, but I have now run into a case of i915 being compiled
without CONFIG_BACKLIGHT_CLASS_DEVICE, resulting in a number of
identical warnings:
drivers/gpu/drm/i915/intel_drv.h:1739:12: error: 'intel_backlight_device_register' defined but not used [-Werror=unused-function]
This marks the function as 'inline', which was surely the original
intention here.
Chris Wilson [Sat, 25 Nov 2017 19:41:55 +0000 (19:41 +0000)]
drm/i915/fbdev: Serialise early hotplug events with async fbdev config
As both the hotplug event and fbdev configuration run asynchronously, it
is possible for them to run concurrently. If configuration fails, we were
freeing the fbdev causing a use-after-free in the hotplug event.
In order to keep the dev_priv->ifbdev alive after failure, we have to
avoid the free and leave it empty until we unload the module (which is
less than ideal, but a necessary evil for simplicity). Then we can use
intel_fbdev_sync() to serialise the hotplug event with the configuration.
The serialisation between the two was removed in commit 934458c2c95d
("Revert "drm/i915: Fix races on fbdev""), but the use after free is much
older, commit 366e39b4d2c5 ("drm/i915: Tear down fbdev if initialization
fails")
Ville Syrjälä [Thu, 23 Nov 2017 19:41:57 +0000 (21:41 +0200)]
drm/i915: Prevent zero length "index" write
The hardware always writes one or two bytes in the index portion of
an indexed transfer. Make sure the message we send as the index
doesn't have a zero length.
Ville Syrjälä [Thu, 23 Nov 2017 19:41:56 +0000 (21:41 +0200)]
drm/i915: Don't try indexed reads to alternate slave addresses
We can only specify the one slave address to indexed reads/writes.
Make sure the messages we check are destined to the same slave
address before deciding to do an indexed transfer.
Linus Torvalds [Tue, 28 Nov 2017 00:45:56 +0000 (16:45 -0800)]
proc: don't report kernel addresses in /proc/<pid>/stack
This just changes the file to report them as zero, although maybe even
that could be removed. I checked, and at least procps doesn't actually
seem to parse the 'stack' file at all.
And since the file doesn't necessarily even exist (it requires
CONFIG_STACKTRACE), possibly other tools don't really use it either.
That said, in case somebody parses it with tools, just having that zero
there should keep such tools happy.
John Johansen [Wed, 22 Nov 2017 15:33:38 +0000 (07:33 -0800)]
apparmor: fix oops in audit_signal_cb hook
The apparmor_audit_data struct ordering got messed up during a merge
conflict, resulting in the signal integer and peer pointer being in
a union instead of a struct.
For most of the 4.13 and 4.14 life cycle, this was hidden by
commit 651e28c5537a ("apparmor: add base infastructure for socket
mediation") which fixed the apparmor_audit_data struct when its data
was added. When that commit was reverted in -rc7 the signal audit bug
was exposed, and unfortunately it never showed up in any of the
testing until after 4.14 was released. Shaun Khan, Zephaniah
E. Loss-Cutler-Hull filed nearly simultaneous bug reports (with
different oopes, the smaller of which is included below).
Full credit goes to Tetsuo Handa for jumping on this as well and
noticing the audit data struct problem and reporting it.
Amritha Nambiar [Fri, 17 Nov 2017 23:35:57 +0000 (15:35 -0800)]
i40e: Fix reporting incorrect error codes
Adding cloud filters could fail for a number of reasons,
unsupported filter fields for example, which fails during
validation of fields itself. This will not result in admin
command errors and converting the admin queue status to posix
error code using i40e_aq_rc_to_posix would result in incorrect
error values. If the failure was due to AQ error itself,
reporting that correctly is handled in the inner function.
Sasha Neftin [Mon, 6 Nov 2017 06:31:59 +0000 (08:31 +0200)]
e1000e: fix the use of magic numbers for buffer overrun issue
This is a follow on to commit b10effb92e27 ("fix buffer overrun while the
I219 is processing DMA transactions") to address David Laights concerns
about the use of "magic" numbers. So define masks as well as add
additional code comments to give a better understanding of what needs to
be done to avoid a buffer overrun.
Vasily Averin [Mon, 13 Nov 2017 04:25:40 +0000 (07:25 +0300)]
lockd: fix "list_add double add" caused by legacy signal interface
restart_grace() uses hardcoded init_net.
It can cause to "list_add double add" in following scenario:
1) nfsd and lockd was started in several net namespaces
2) nfsd in init_net was stopped (lockd was not stopped because
it have users from another net namespaces)
3) lockd got signal, called restart_grace() -> set_grace_period()
and enabled lock_manager in hardcoded init_net.
4) nfsd in init_net is started again,
its lockd_up() calls set_grace_period() and tries to add
lock_manager into init_net 2nd time.
Jeff Layton suggest:
"Make it safe to call locks_start_grace multiple times on the same
lock_manager. If it's already on the global grace_list, then don't try
to add it again. (But we don't intentionally add twice, so for now we
WARN about that case.)
With this change, we also need to ensure that the nfsd4 lock manager
initializes the list before we call locks_start_grace. While we're at
it, move the rest of the nfsd_net initialization into
nfs4_state_create_net. I see no reason to have it spread over two
functions like it is today."
Suggested patch was updated to generate warning in described situation.
Vasily Averin [Fri, 10 Nov 2017 07:19:35 +0000 (10:19 +0300)]
race of nfsd inetaddr notifiers vs nn->nfsd_serv change
nfsd_inet[6]addr_event uses nn->nfsd_serv without taking nfsd_mutex,
which can be changed during execution of notifiers and crash the host.
Moreover if notifiers were enabled in one net namespace they are enabled
in all other net namespaces, from creation until destruction.
This patch allows notifiers to access nn->nfsd_serv only after the
pointer is correctly initialized and delays cleanup until notifiers are
no longer in use.
Bhumika Goyal [Tue, 17 Oct 2017 16:14:23 +0000 (18:14 +0200)]
sunrpc: make the function arg as const
Make the struct cache_detail *tmpl argument of the function
cache_create_net as const as it is only getting passed to kmemup having
the argument as const void *.
Add const to the prototype too.
Naofumi Honda [Thu, 9 Nov 2017 15:57:16 +0000 (10:57 -0500)]
nfsd: fix panic in posix_unblock_lock called from nfs4_laundromat
From kernel 4.9, my two nfsv4 servers sometimes suffer from
"panic: unable to handle kernel page request"
in posix_unblock_lock() called from nfs4_laundromat().
These panics diseappear if we revert the commit "nfsd: add a LRU list
for blocked locks".
The cause appears to be a typo in nfs4_laundromat(), which is also
present in nfs4_state_shutdown_net().
Vasily Averin [Thu, 2 Nov 2017 10:03:42 +0000 (13:03 +0300)]
lockd: lost rollback of set_grace_period() in lockd_down_net()
Commit efda760fe95ea ("lockd: fix lockd shutdown race") is incorrect,
it removes lockd_manager and disarm grace_period_end for init_net only.
If nfsd was started from another net namespace lockd_up_net() calls
set_grace_period() that adds lockd_manager into per-netns list
and queues grace_period_end delayed work.
These action should be reverted in lockd_down_net().
Otherwise it can lead to double list_add on after restart nfsd in netns,
and to use-after-free if non-disarmed delayed work will be executed after netns destroy.
Vasily Averin [Wed, 8 Nov 2017 05:55:55 +0000 (08:55 +0300)]
lockd: remove net pointer from messages
Publishing of net pointer is not safe,
use net->ns.inum as net ID in debug messages
[ 171.757678] lockd_up_net: per-net data created; net=f00001e7
[ 171.767188] NFSD: starting 90-second grace period (net f00001e7)
[ 300.653313] lockd: nuking all hosts in net f00001e7...
[ 300.653641] lockd: host garbage collection for net f00001e7
[ 300.653968] lockd: nlmsvc_mark_resources for net f00001e7
[ 300.711483] lockd_down_net: per-net data destroyed; net=f00001e7
[ 300.711847] lockd: nuking all hosts in net 0...
[ 300.711847] lockd: host garbage collection for net 0
[ 300.711848] lockd: nlmsvc_mark_resources for net 0
Trond Myklebust [Fri, 3 Nov 2017 12:00:16 +0000 (08:00 -0400)]
nfsd: Fix races with check_stateid_generation()
The various functions that call check_stateid_generation() in order
to compare a client-supplied stateid with the nfs4_stid state, usually
need to atomically check for closed state. Those that perform the
check after locking the st_mutex using nfsd4_lock_ol_stateid()
should now be OK, but we do want to fix up the others.
Trond Myklebust [Fri, 3 Nov 2017 12:00:14 +0000 (08:00 -0400)]
nfsd: Fix race in lock stateid creation
If we're looking up a new lock state, and the creation fails, then
we want to unhash it, just like we do for OPEN. However in order
to do so, we need to that no other LOCK requests can grab the
mutex until we have unhashed it (and marked it as closed).
Trond Myklebust [Fri, 3 Nov 2017 12:00:13 +0000 (08:00 -0400)]
nfsd: Ensure we don't recognise lock stateids after freeing them
In order to deal with lookup races, nfsd4_free_lock_stateid() needs
to be able to signal to other stateful functions that the lock stateid
is no longer valid. Right now, nfsd_lock() will check whether or not an
existing stateid is still hashed, but only in the "new lock" path.
To ensure the stateid invalidation is also recognised by the "existing lock"
path, and also by a second call to nfsd4_free_lock_stateid() itself, we can
change the type to NFS4_CLOSED_STID under the stp->st_mutex.
Trond Myklebust [Fri, 3 Nov 2017 12:00:11 +0000 (08:00 -0400)]
nfsd: Fix another OPEN stateid race
If nfsd4_process_open2() is initialising a new stateid, and yet the
call to nfs4_get_vfs_file() fails for some reason, then we must
declare the stateid closed, and unhash it before dropping the mutex.
Right now, we unhash the stateid after dropping the mutex, and without
changing the stateid type, meaning that another OPEN could theoretically
look it up and attempt to use it.
Trond Myklebust [Fri, 3 Nov 2017 12:00:10 +0000 (08:00 -0400)]
nfsd: Fix stateid races between OPEN and CLOSE
Open file stateids can linger on the nfs4_file list of stateids even
after they have been closed. In order to avoid reusing such a
stateid, and confusing the client, we need to recheck the
nfs4_stid's type after taking the mutex.
Otherwise, we risk reusing an old stateid that was already closed,
which will confuse clients that expect new stateids to conform to
RFC7530 Sections 9.1.4.2 and 16.2.5 or RFC5661 Sections 8.2.2 and 18.2.4.
Linus Torvalds [Mon, 27 Nov 2017 21:05:09 +0000 (13:05 -0800)]
Rename superblock flags (MS_xyz -> SB_xyz)
This is a pure automated search-and-replace of the internal kernel
superblock flags.
The s_flags are now called SB_*, with the names and the values for the
moment mirroring the MS_* flags that they're equivalent to.
Note how the MS_xyz flags are the ones passed to the mount system call,
while the SB_xyz flags are what we then use in sb->s_flags.
The script to do this was:
# places to look in; re security/*: it generally should *not* be
# touched (that stuff parses mount(2) arguments directly), but
# there are two places where we really deal with superblock flags.
FILES="drivers/mtd drivers/staging/lustre fs ipc mm \
include/linux/fs.h include/uapi/linux/bfs_fs.h \
security/apparmor/apparmorfs.c security/apparmor/include/lib.h"
# the list of MS_... constants
SYMS="RDONLY NOSUID NODEV NOEXEC SYNCHRONOUS REMOUNT MANDLOCK \
DIRSYNC NOATIME NODIRATIME BIND MOVE REC VERBOSE SILENT \
POSIXACL UNBINDABLE PRIVATE SLAVE SHARED RELATIME KERNMOUNT \
I_VERSION STRICTATIME LAZYTIME SUBMOUNT NOREMOTELOCK NOSEC BORN \
ACTIVE NOUSER"
SED_PROG=
for i in $SYMS; do SED_PROG="$SED_PROG -e s/MS_$i/SB_$i/g"; done
# we want files that contain at least one of MS_...,
# with fs/namespace.c and fs/pnode.c excluded.
L=$(for i in $SYMS; do git grep -w -l MS_$i $FILES; done| sort|uniq|grep -v '^fs/namespace.c'|grep -v '^fs/pnode.c')
mm, thp: Do not make pmd/pud dirty without a reason
Currently we make page table entries dirty all the time regardless of
access type and don't even consider if the mapping is write-protected.
The reasoning is that we don't really need dirty tracking on THP and
making the entry dirty upfront may save some time on first write to the
page.
Unfortunately, such approach may result in false-positive
can_follow_write_pmd() for huge zero page or read-only shmem file.
Let's only make page dirty only if we about to write to the page anyway
(as we do for small pages).
I've restructured the code to make entry dirty inside
maybe_p[mu]d_mkwrite(). It also takes into account if the vma is
write-protected.
Jon Maloy [Mon, 27 Nov 2017 19:13:39 +0000 (20:13 +0100)]
tipc: eliminate access after delete in group_filter_msg()
KASAN revealed another access after delete in group.c. This time
it found that we read the header of a received message after the
buffer has been released.
Eduardo Otubo [Thu, 23 Nov 2017 14:18:35 +0000 (15:18 +0100)]
xen-netfront: remove warning when unloading module
v2:
* Replace busy wait with wait_event()/wake_up_all()
* Cannot garantee that at the time xennet_remove is called, the
xen_netback state will not be XenbusStateClosed, so added a
condition for that
* There's a small chance for the xen_netback state is
XenbusStateUnknown by the time the xen_netfront switches to Closed,
so added a condition for that.
When unloading module xen_netfront from guest, dmesg would output
warning messages like below:
[ 105.236836] xen:grant_table: WARNING: g.e. 0x903 still in use!
[ 105.236839] deferring g.e. 0x903 (pfn 0x35805)
This problem relies on netfront and netback being out of sync. By the time
netfront revokes the g.e.'s netback didn't have enough time to free all of
them, hence displaying the warnings on dmesg.
The trick here is to make netfront to wait until netback frees all the g.e.'s
and only then continue to cleanup for the module removal, and this is done by
manipulating both device states.
Jens Axboe [Sun, 19 Nov 2017 18:52:55 +0000 (11:52 -0700)]
blktrace: fix trace mutex deadlock
A previous commit changed the locking around registration/cleanup,
but direct callers of blk_trace_remove() were missed. This means
that if we hit the error path in setup, we will deadlock on
attempting to re-acquire the queue trace mutex.
Colin Ian King [Wed, 22 Nov 2017 17:52:24 +0000 (17:52 +0000)]
i2c: i2c-boardinfo: fix memory leaks on devinfo
Currently when an error occurs devinfo is still allocated but is
unused when the error exit paths break out of the for-loop. Fix
this by kfree'ing devinfo to avoid the leak.
Detected by CoverityScan, CID#1416590 ("Resource Leak")
Fixes: 4124c4eba402 ("i2c: allow attaching IRQ resources to i2c_board_info") Fixes: 0daaf99d8424 ("i2c: copy device properties when using i2c_register_board_info()") Signed-off-by: Colin Ian King <[email protected]> Signed-off-by: Wolfram Sang <[email protected]>
Darrick J. Wong [Wed, 22 Nov 2017 04:53:02 +0000 (20:53 -0800)]
xfs: log recovery should replay deferred ops in order
As part of testing log recovery with dm_log_writes, Amir Goldstein
discovered an error in the deferred ops recovery that lead to corruption
of the filesystem metadata if a reflink+rmap filesystem happened to shut
down midway through a CoW remap:
"This is what happens [after failed log recovery]:
"Phase 1 - find and verify superblock...
"Phase 2 - using internal log
" - zero log...
" - scan filesystem freespace and inode maps...
" - found root inode chunk
"Phase 3 - for each AG...
" - scan (but don't clear) agi unlinked lists...
" - process known inodes and perform inode discovery...
" - agno = 0
"data fork in regular inode 134 claims CoW block 376
"correcting nextents for inode 134
"bad data fork in inode 134
"would have cleared inode 134"
Hou Tao dissected the log contents of exactly such a crash:
"According to the implementation of xfs_defer_finish(), these ops should
be completed in the following sequence:
"Have been done:
"(1) CUI: Oper (160)
"(2) BUI: Oper (161)
"(3) CUD: Oper (194), for CUI Oper (160)
"(4) RUI A: Oper (197), free rmap [0x155, 2, -9]
"Should be done:
"(5) BUD: for BUI Oper (161)
"(6) RUI B: add rmap [0x155, 2, 137]
"(7) RUD: for RUI A
"(8) RUD: for RUI B
"Actually be done by xlog_recover_process_intents()
"(5) BUD: for BUI Oper (161)
"(6) RUI B: add rmap [0x155, 2, 137]
"(7) RUD: for RUI B
"(8) RUD: for RUI A
"So the rmap entry [0x155, 2, -9] for COW should be freed firstly,
then a new rmap entry [0x155, 2, 137] will be added. However, as we can see
from the log record in post_mount.log (generated after umount) and the trace
print, the new rmap entry [0x155, 2, 137] are added firstly, then the rmap
entry [0x155, 2, -9] are freed."
When reconstructing the internal log state from the log items found on
disk, it's required that deferred ops replay in exactly the same order
that they would have had the filesystem not gone down. However,
replaying unfinished deferred ops can create /more/ deferred ops. These
new deferred ops are finished in the wrong order. This causes fs
corruption and replay crashes, so let's create a single defer_ops to
handle the subsequent ops created during replay, then use one single
transaction at the end of log recovery to ensure that everything is
replayed in the same order as they're supposed to be.
Darrick J. Wong [Wed, 22 Nov 2017 20:21:07 +0000 (12:21 -0800)]
xfs: always free inline data before resetting inode fork during ifree
In xfs_ifree, we reset the data/attr forks to extents format without
bothering to free any inline data buffer that might still be around
after all the blocks have been truncated off the file. Prior to commit 43518812d2 ("xfs: remove support for inlining data/extents into the
inode fork") nobody noticed because the leftover inline data after
truncation was small enough to fit inside the inline buffer inside the
fork itself.
However, now that we've removed the inline buffer, we /always/ have to
free the inline data buffer or else we leak them like crazy. This test
was found by turning on kmemleak for generic/001 or generic/388.
Paolo Bonzini [Mon, 27 Nov 2017 16:54:13 +0000 (17:54 +0100)]
Merge tag 'kvm-ppc-fixes-4.15-1' of git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc into kvm-master
PPC KVM fixes for 4.15
One commit here, that fixes a couple of bugs relating to the patch
series that enables HPT guests to run on a radix host on POWER9
systems. This patch series went upstream in the 4.15 merge window,
so no stable backport is required.
Jan H. Schönherr [Fri, 24 Nov 2017 21:39:01 +0000 (22:39 +0100)]
KVM: Let KVM_SET_SIGNAL_MASK work as advertised
KVM API says for the signal mask you set via KVM_SET_SIGNAL_MASK, that
"any unblocked signal received [...] will cause KVM_RUN to return with
-EINTR" and that "the signal will only be delivered if not blocked by
the original signal mask".
This, however, is only true, when the calling task has a signal handler
registered for a signal. If not, signal evaluation is short-circuited for
SIG_IGN and SIG_DFL, and the signal is either ignored without KVM_RUN
returning or the whole process is terminated.
Make KVM_SET_SIGNAL_MASK behave as advertised by utilizing logic similar
to that in do_sigtimedwait() to avoid short-circuiting of signals.
Liu Bo [Tue, 21 Nov 2017 21:35:40 +0000 (14:35 -0700)]
Btrfs: fix list_add corruption and soft lockups in fsync
Xfstests btrfs/146 revealed this corruption,
[ 58.138831] Buffer I/O error on dev dm-0, logical block 2621424, async page read
[ 58.151233] BTRFS error (device sdf): bdev /dev/mapper/error-test errs: wr 1, rd 0, flush 0, corrupt 0, gen 0
[ 58.152403] list_add corruption. prev->next should be next (ffff88005e6775d8), but was ffffc9000189be88. (prev=ffffc9000189be88).
[ 58.153518] ------------[ cut here ]------------
[ 58.153892] WARNING: CPU: 1 PID: 1287 at lib/list_debug.c:31 __list_add_valid+0x169/0x1f0
...
[ 58.157379] RIP: 0010:__list_add_valid+0x169/0x1f0
...
[ 58.161956] Call Trace:
[ 58.162264] btrfs_log_inode_parent+0x5bd/0xfb0 [btrfs]
[ 58.163583] btrfs_log_dentry_safe+0x60/0x80 [btrfs]
[ 58.164003] btrfs_sync_file+0x4c2/0x6f0 [btrfs]
[ 58.164393] vfs_fsync_range+0x5f/0xd0
[ 58.164898] do_fsync+0x5a/0x90
[ 58.165170] SyS_fsync+0x10/0x20
[ 58.165395] entry_SYSCALL_64_fastpath+0x1f/0xbe
...
It turns out that we could record btrfs_log_ctx:io_err in
log_one_extents when IO fails, but make log_one_extents() return '0'
instead of -EIO, so the IO error is not acknowledged by the callers,
i.e. btrfs_log_inode_parent(), which would remove btrfs_log_ctx:list
from list head 'root->log_ctxs'. Since btrfs_log_ctx is allocated
from stack memory, it'd get freed with a object alive on the
list. then a future list_add will throw the above warning.
This returns the correct error in the above case.
Jeff also reported this while testing against his fsync error
patch set[1].
[1]: https://www.spinics.net/lists/linux-btrfs/msg65308.html
"btrfs list corruption and soft lockups while testing writeback error handling"
Fixes: 8407f553268a4611f254 ("Btrfs: fix data corruption after fast fsync and writeback error") Signed-off-by: Liu Bo <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
The syzkaller testcase will execute VMXON/VMLAUCH instructions, so the
vmx->nested stuff is populated, it will also issue KVM_SMI ioctl. However,
the testcase is just a simple c program and not be lauched by something
like seabios which implements smi_handler. Commit 05cade71cf (KVM: nSVM:
fix SMI injection in guest mode) gets out of guest mode and set nested.vmxon
to false for the duration of SMM according to SDM 34.14.1 "leave VMX
operation" upon entering SMM. We can't alloc/free the vmx->nested stuff
each time when entering/exiting SMM since it will induce more overhead. So
the function vmx_pre_enter_smm() marks nested.vmxon false even if vmx->nested
stuff is still populated. What it expected is em_rsm() can mark nested.vmxon
to be true again. However, the smi_handler/rsm will not execute since there
is no something like seabios in this scenario. The function free_nested()
fails to free the vmx->nested stuff since the vmx->nested.vmxon is false
which results in the above warning.
This patch fixes it by also considering the no SMI handler case, luckily
vmx->nested.smm.vmxon is marked according to the value of vmx->nested.vmxon
in vmx_pre_enter_smm(), we can take advantage of it and free vmx->nested
stuff when L1 goes down.
which testcase tries to setup the processor specific debug
registers and configure vCPU for handling guest debug events through
KVM_SET_GUEST_DEBUG. The KVM_SET_GUEST_DEBUG ioctl will get and set
rflags in order to set TF bit if single step is needed. All regs' caches
are reset to avail and GUEST_RFLAGS vmcs field is reset to 0x2 during vCPU
reset. However, the cache of rflags is not reset during vCPU reset. The
function vmx_get_rflags() returns an unreset rflags cache value since
the cache is marked avail, it is 0 after boot. Vmentry fails if the
rflags reserved bit 1 is 0.
This patch fixes it by resetting both the GUEST_RFLAGS vmcs field and
its cache to 0x2 during vCPU reset.
This can be reproduced when running kvm-unit-tests/hyperv_stimer.flat and
cpu-hotplug stress simultaneously. __this_cpu_read(cpu_tsc_khz) returns 0
(set in kvmclock_cpu_down_prep()) when the pCPU is unhotplug which results
in kvm_get_time_scale() gets into an infinite loop.
This patch fixes it by treating the unhotplug pCPU as not using master clock.
In x2apic mode the LDR is fixed based on the ID rather
than separately loadable like it was before x2.
When kvm_apic_set_state is called, the base is set, and if
it has the X2APIC_ENABLE flag set then the LDR is calculated;
however that value gets overwritten by the memcpy a few lines
below overwriting it with the value that came from userland.
The symptom is a lack of EOI after loading the state
(e.g. after a QEMU migration) and is due to the EOI bitmap
being wrong due to the incorrect LDR. This was seen with
a Win2016 guest under Qemu with irqchip=split whose USB mouse
didn't work after a VM migration.
This corresponds to RH bug:
https://bugzilla.redhat.com/show_bug.cgi?id=1502591