Xander Huff [Wed, 24 Aug 2016 21:47:53 +0000 (16:47 -0500)]
Revert "phy: IRQ cannot be shared"
This reverts:
commit 33c133cc7598 ("phy: IRQ cannot be shared")
On hardware with multiple PHY devices hooked up to the same IRQ line, allow
them to share it.
Sergei Shtylyov says:
"I'm not sure now what was the reason I concluded that the IRQ sharing
was impossible... most probably I thought that the kernel IRQ handling
code exited the loop over the IRQ actions once IRQ_HANDLED was returned
-- which is obviously not so in reality..."
Florian Fainelli [Wed, 24 Aug 2016 18:01:20 +0000 (11:01 -0700)]
net: dsa: bcm_sf2: Fix race condition while unmasking interrupts
We kept shadow copies of which interrupt sources we have enabled and
disabled, but due to an order bug in how intrl2_mask_clear was defined,
we could run into the following scenario:
CPU0 CPU1
intrl2_1_mask_clear(..)
sets INTRL2_CPU_MASK_CLEAR
bcm_sf2_switch_1_isr
read INTRL2_CPU_STATUS and masks with stale
irq1_mask value
updates irq1_mask value
Which would make us loop again and again trying to process and interrupt
we are not clearing since our copy of whether it was enabled before
still indicates it was not. Fix this by updating the shadow copy first,
and then unasking at the HW level.
Fixes: 246d7f773c13 ("net: dsa: add Broadcom SF2 switch driver") Signed-off-by: Florian Fainelli <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Dave Airlie [Thu, 25 Aug 2016 19:18:40 +0000 (05:18 +1000)]
Merge tag 'drm-intel-fixes-2016-08-25' of git://anongit.freedesktop.org/drm-intel into drm-fixes
i915 fixes queue.
* tag 'drm-intel-fixes-2016-08-25' of git://anongit.freedesktop.org/drm-intel:
drm/i915: Fix botched merge that downgrades CSR versions.
drm/i915/skl: Ensure pipes with changed wms get added to the state
drm/i915/gen9: Only copy WM results for changed pipes to skl_hw
drm/i915/skl: Add support for the SAGV, fix underrun hangs
drm/i915/gen6+: Interpret mailbox error flags
drm/i915: Reattach comment, complete type specification
drm/i915: Unconditionally flush any chipset buffers before execbuf
drm/i915/gen9: Drop invalid WARN() during data rate calculation
drm/i915/gen9: Initialize intel_state->active_crtcs during WM sanitization (v2)
Daniel Vetter [Wed, 10 Aug 2016 16:52:38 +0000 (18:52 +0200)]
drm: Protect fb_defio in drivers with CONFIG_KMS_FBDEV_EMULATION
For reasons that entirely elude me fb.h exposes all the structures,
even when it is not enabled. Except for special stuff like fb_defio.
Which means all the drivers which haven't yet switched over to the
defio support in the helpers and still roll their own, will fail
to compile when fbdev emulation is disabled. Protect just those
bits, as a gnarly reminder that conversion to the core defio helpers
would be good.
Bluetooth: Fix bt_sock_recvmsg when MSG_TRUNC is not set
Commit b5f34f9420b50c9b5876b9a2b68e96be6d629054 attempt to introduce
proper handling for MSG_TRUNC but recv and variants should still work
as read if no flag is passed, but because the code may set MSG_TRUNC to
msg->msg_flags that shall not be used as it may cause it to be behave as
if MSG_TRUNC is always, so instead of using it this changes the code to
use the flags parameter which shall contain the original flags.
Takashi Iwai [Thu, 25 Aug 2016 15:56:09 +0000 (17:56 +0200)]
Merge tag 'asoc-fix-v4.8-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/sound into for-linus
ASoC: Fixes for v4.8
A clutch of fixes for v4.8. These are mainly driver specific, the most
notable ones being those for OMAP which fix a series of issues that
broke boot on some platforms there when deferred probe kicked in.
There's also one core fix for an issue when unbinding a card which for
some reason had managed to not manifest until recently.
Tatyana Nikolova [Wed, 24 Aug 2016 18:59:17 +0000 (13:59 -0500)]
i40iw: Send last streaming mode message for loopback connections
Send a zero length last streaming mode message for loopback
connections to synchronize between accepting QP and connecting QP.
This avoids data transfer to start on the accepting QP before
the connecting QP is in RTS. Also remove function i40iw_loopback_nop()
as it is no longer used.
Andrey Ryabinin [Tue, 23 Aug 2016 15:55:31 +0000 (18:55 +0300)]
fs/block_dev: fix potential NULL ptr deref in freeze_bdev()
Calling freeze_bdev() twice on the same block device without mounted
filesystem get_super() will return NULL, which will lead to NULL-ptr
dereference later in drop_super().
Check get_super() result to fix that.
Note, that this is a purely theoretical issue. We have only 3
freeze_bdev() callers. 2 of them are in filesystem code and used on a
device with mounted fs. The third one in lock_fs() has protection in
upper-layer code against freezing block device the second time without
thawing it first.
Sabrina Dubroca [Tue, 23 Aug 2016 08:20:31 +0000 (10:20 +0200)]
netfilter: ebtables: put module reference when an incorrect extension is found
commit bcf493428840 ("netfilter: ebtables: Fix extension lookup with
identical name") added a second lookup in case the extension that was
found during the first lookup matched another extension with the same
name, but didn't release the reference on the incorrect module.
Fixes: bcf493428840 ("netfilter: ebtables: Fix extension lookup with identical name") Signed-off-by: Sabrina Dubroca <[email protected]> Acked-by: Phil Sutter <[email protected]> Signed-off-by: Pablo Neira Ayuso <[email protected]>
Liping Zhang [Mon, 22 Aug 2016 14:57:56 +0000 (22:57 +0800)]
netfilter: nft_meta: improve the validity check of pkttype set expr
"meta pkttype set" is only supported on prerouting chain with bridge
family and ingress chain with netdev family.
But the validate check is incomplete, and the user can add the nft
rules on input chain with bridge family, for example:
# nft add table bridge filter
# nft add chain bridge filter input {type filter hook input \
priority 0 \;}
# nft add chain bridge filter test
# nft add rule bridge filter test meta pkttype set unicast
# nft add rule bridge filter input jump test
Liping Zhang [Mon, 22 Aug 2016 13:58:18 +0000 (21:58 +0800)]
netfilter: cttimeout: unlink timeout objs in the unconfirmed ct lists
KASAN reported this bug:
BUG: KASAN: use-after-free in icmp_packet+0x25/0x50 [nf_conntrack_ipv4] at
addr ffff880002db08c8
Read of size 4 by task lt-nf-queue/19041
Call Trace:
<IRQ> [<ffffffff815eeebb>] dump_stack+0x63/0x88
[<ffffffff813386f8>] kasan_report_error+0x528/0x560
[<ffffffff81338cc8>] kasan_report+0x58/0x60
[<ffffffffa07393f5>] ? icmp_packet+0x25/0x50 [nf_conntrack_ipv4]
[<ffffffff81337551>] __asan_load4+0x61/0x80
[<ffffffffa07393f5>] icmp_packet+0x25/0x50 [nf_conntrack_ipv4]
[<ffffffffa06ecaa0>] nf_conntrack_in+0x550/0x980 [nf_conntrack]
[<ffffffffa06ec550>] ? __nf_conntrack_confirm+0xb10/0xb10 [nf_conntrack]
[ ... ]
The main reason is that we missed to unlink the timeout objects in the
unconfirmed ct lists, so we will access the timeout objects that have
already been freed.
Liping Zhang [Mon, 22 Aug 2016 13:58:17 +0000 (21:58 +0800)]
netfilter: cttimeout: put back l4proto when replacing timeout policy
We forget to call nf_ct_l4proto_put when replacing the existing
timeout policy. Acctually, there's no need to get ct l4proto
before doing replace, so we can move it to a later position.
Filipe Manana [Tue, 23 Aug 2016 20:13:51 +0000 (21:13 +0100)]
Btrfs: fix lockdep warning on deadlock against an inode's log mutex
Commit 44f714dae50a ("Btrfs: improve performance on fsync against new
inode after rename/unlink"), which landed in 4.8-rc2, introduced a
possibility for a deadlock due to double locking of an inode's log mutex
by the same task, which lockdep reports with:
[23045.433975] =============================================
[23045.434748] [ INFO: possible recursive locking detected ]
[23045.435426] 4.7.0-rc6-btrfs-next-34+ #1 Not tainted
[23045.436044] ---------------------------------------------
[23045.436044] xfs_io/3688 is trying to acquire lock:
[23045.436044] (&ei->log_mutex){+.+...}, at: [<ffffffffa038552d>] btrfs_log_inode+0x13a/0xc95 [btrfs]
[23045.436044]
but task is already holding lock:
[23045.436044] (&ei->log_mutex){+.+...}, at: [<ffffffffa038552d>] btrfs_log_inode+0x13a/0xc95 [btrfs]
[23045.436044]
other info that might help us debug this:
[23045.436044] Possible unsafe locking scenario:
This is because while logging the inode of file bar we end up logging its
parent directory (since its inode has an unlink_trans field matching the
current transaction id due to the rename operation), which in turn logs
the inodes for all its new dentries, so that the new inode for the new
file named foo gets logged which in turn triggered another logging attempt
for the inode we are fsync'ing, since that inode had an old name that
corresponds to the name of the new inode.
So fix this by ensuring that when logging the inode for a new dentry that
has a name matching an old name of some other inode, we don't log again
the original inode that we are fsync'ing.
Fixes: 44f714dae50a ("Btrfs: improve performance on fsync against new inode after rename/unlink") Signed-off-by: Filipe Manana <[email protected]> Signed-off-by: David Sterba <[email protected]> Signed-off-by: Chris Mason <[email protected]>
Liu Bo [Tue, 23 Aug 2016 22:22:58 +0000 (15:22 -0700)]
Btrfs: detect corruption when non-root leaf has zero item
Right now we treat leaf which has zero item as a valid one
because we could have an empty tree, that is, a root that is
also a leaf without any item, however, in the same case but
when the leaf is not a root, we can end up with hitting the
BUG_ON(1) in btrfs_extend_item() called by
setup_inline_extent_backref().
This makes us check the situation as a corruption if leaf is
not its own root.
Jeff Mahoney [Thu, 18 Aug 2016 01:58:33 +0000 (21:58 -0400)]
btrfs: don't create or leak aliased root while cleaning up orphans
commit 909c3a22da3 (Btrfs: fix loading of orphan roots leading to BUG_ON)
avoids the BUG_ON but can add an aliased root to the dead_roots list or
leak the root.
Since we've already been loading roots into the radix tree, we should
use it before looking the root up on disk.
At the end of unmount/dev-delete, if the device exclusive open is not
actually closed, then there might be a race with another program in
the userland who is trying to open the device in exclusive mode and
it may fail for eg:
unmount /btrfs; fsck /dev/x
btrfs dev del /dev/x /btrfs; fsck /dev/x
so here background blkdev_put() is not a choice
With BTRFS_RESERVE_FLUSH_LIMIT, if flush_space is already on the
flush_state of ALLOC_CHUNK and it successfully allocates a new
chunk, then instead of trying to reserve space again,
reserve_metadata_bytes returns 1 immediately.
Eventually the callers who call start_transaction() usually just
do the IS_ERR() check which ERR_PTR(1) can pass, then it'll get
a panic when dereferencing a pointer which is ERR_PTR(1).
The following patch fixes the above problem.
"btrfs: flush_space: treat return value of do_chunk_alloc properly"
https://patchwork.kernel.org/patch/7778651/
This add comments to clarify do_chunk_alloc()'s return value.
>From this warning, freeze_super() already holds SB_FREEZE_FS, but
btrfs_freeze() will call btrfs_commit_transaction() again, if
btrfs_commit_transaction() finds that it has delayed iputs to handle,
it'll start_transaction(), which will try to get SB_FREEZE_FS lock
again, then deadlock occurs.
The root cause is that in btrfs, sync_filesystem(sb) does not make
sure all metadata is updated. There still maybe some codes adding
delayed iputs, see below sample race window:
CPU1 | CPU2
|-> freeze_super() |
|-> sync_filesystem(sb); |
| |-> cleaner_kthread()
| | |-> btrfs_delete_unused_bgs()
| | |-> btrfs_remove_chunk()
| | |-> btrfs_remove_block_group()
| | |-> btrfs_add_delayed_iput()
| |
|-> sb->s_writers.frozen = SB_FREEZE_FS; |
|-> sb_wait_write(sb, SB_FREEZE_FS); |
| acquire SB_FREEZE_FS lock. |
| |
|-> btrfs_freeze() |
|-> btrfs_commit_transaction() |
|-> btrfs_run_delayed_iputs() |
| will handle delayed iputs, |
| that means start_transaction() |
| will be called, which will try |
| to get SB_FREEZE_FS lock. |
To fix this issue, introduce a "int fs_frozen" to record internally whether
fs has been frozen. If fs has been frozen, we can not handle delayed iputs.
This patch can fix some false ENOSPC errors, below test script can
reproduce one false ENOSPC error:
#!/bin/bash
dd if=/dev/zero of=fs.img bs=$((1024*1024)) count=128
dev=$(losetup --show -f fs.img)
mkfs.btrfs -f -M $dev
mkdir /tmp/mntpoint
mount $dev /tmp/mntpoint
cd /tmp/mntpoint
xfs_io -f -c "falloc 0 $((64*1024*1024))" testfile
Above script will fail for ENOSPC reason, but indeed fs still has free
space to satisfy this request. Please see call graph:
btrfs_fallocate()
|-> btrfs_alloc_data_chunk_ondemand()
| bytes_may_use += 64M
|-> btrfs_prealloc_file_range()
|-> btrfs_reserve_extent()
|-> btrfs_add_reserved_bytes()
| alloc_type is RESERVE_ALLOC_NO_ACCOUNT, so it does not
| change bytes_may_use, and bytes_reserved += 64M. Now
| bytes_may_use + bytes_reserved == 128M, which is greater
| than btrfs_space_info's total_bytes, false enospc occurs.
| Note, the bytes_may_use decrease operation will be done in
| end of btrfs_fallocate(), which is too late.
Here is another simple case for buffered write:
CPU 1 | CPU 2
|
|-> cow_file_range() |-> __btrfs_buffered_write()
|-> btrfs_reserve_extent() | |
| | |
| | |
| ..... | |-> btrfs_check_data_free_space()
| |
| |
|-> extent_clear_unlock_delalloc() |
In CPU 1, btrfs_reserve_extent()->find_free_extent()->
btrfs_add_reserved_bytes() do not decrease bytes_may_use, the decrease
operation will be delayed to be done in extent_clear_unlock_delalloc().
Assume in this case, btrfs_reserve_extent() reserved 128MB data, CPU2's
btrfs_check_data_free_space() tries to reserve 100MB data space.
If
100MB > data_sinfo->total_bytes - data_sinfo->bytes_used -
data_sinfo->bytes_reserved - data_sinfo->bytes_pinned -
data_sinfo->bytes_readonly - data_sinfo->bytes_may_use
btrfs_check_data_free_space() will try to allcate new data chunk or call
btrfs_start_delalloc_roots(), or commit current transaction in order to
reserve some free space, obviously a lot of work. But indeed it's not
necessary as long as decreasing bytes_may_use timely, we still have
free space, decreasing 128M from bytes_may_use.
To fix this issue, this patch chooses to update bytes_may_use for both
data and metadata in btrfs_add_reserved_bytes(). For compress path, real
extent length may not be equal to file content length, so introduce a
ram_bytes argument for btrfs_reserve_extent(), find_free_extent() and
btrfs_add_reserved_bytes(), it's becasue bytes_may_use is increased by
file content length. Then compress path can update bytes_may_use
correctly. Also now we can discard RESERVE_ALLOC_NO_ACCOUNT, RESERVE_ALLOC
and RESERVE_FREE.
As we know, usually EXTENT_DO_ACCOUNTING is used for error path. In
run_delalloc_nocow(), for inode marked as NODATACOW or extent marked as
PREALLOC, we also need to update bytes_may_use, but can not pass
EXTENT_DO_ACCOUNTING, because it also clears metadata reservation, so
here we introduce EXTENT_CLEAR_DATA_RESV flag to indicate btrfs_clear_bit_hook()
to update btrfs_space_info's bytes_may_use.
Meanwhile __btrfs_prealloc_file_range() will call
btrfs_free_reserved_data_space() internally for both sucessful and failed
path, btrfs_prealloc_file_range()'s callers does not need to call
btrfs_free_reserved_data_space() any more.
Wang Xiaoguang [Mon, 25 Jul 2016 07:51:39 +0000 (15:51 +0800)]
btrfs: divide btrfs_update_reserved_bytes() into two functions
This patch divides btrfs_update_reserved_bytes() into
btrfs_add_reserved_bytes() and btrfs_free_reserved_bytes(), and
next patch will extend btrfs_add_reserved_bytes()to fix some
false ENOSPC error, please see later patch for detailed info.
Wang Xiaoguang [Mon, 25 Jul 2016 07:51:38 +0000 (15:51 +0800)]
btrfs: use correct offset for reloc_inode in prealloc_file_extent_cluster()
In prealloc_file_extent_cluster(), btrfs_check_data_free_space() uses
wrong file offset for reloc_inode, it uses cluster->start and cluster->end,
which indeed are extent's bytenr. The correct value should be
cluster->[start|end] minus block group's start bytenr.
Qu Wenruo [Mon, 15 Aug 2016 02:36:51 +0000 (10:36 +0800)]
btrfs: relocation: Fix leaking qgroups numbers on data extents
This patch fixes a REGRESSION introduced in 4.2, caused by the big quota
rework.
When balancing data extents, qgroup will leak all its numbers for
relocated data extents.
The relocation is done in the following steps for data extents:
1) Create data reloc tree and inode
2) Copy all data extents to data reloc tree
And commit transaction
3) Create tree reloc tree(special snapshot) for any related subvolumes
4) Replace file extent in tree reloc tree with new extents in data reloc
tree
And commit transaction
5) Merge tree reloc tree with original fs, by swapping tree blocks
For 1)~4), since tree reloc tree and data reloc tree doesn't count to
qgroup, everything is OK.
But for 5), the swapping of tree blocks will only info qgroup to track
metadata extents.
If metadata extents contain file extents, qgroup number for file extents
will get lost, leading to corrupted qgroup accounting.
The fix is, before commit transaction of step 5), manually info qgroup to
track all file extents in data reloc tree.
Since at commit transaction time, the tree swapping is done, and qgroup
will account these data extents correctly.
Refactor btrfs_qgroup_insert_dirty_extent() function, to two functions:
1. btrfs_qgroup_insert_dirty_extent_nolock()
Almost the same with original code.
For delayed_ref usage, which has delayed refs locked.
Change the return value type to int, since caller never needs the
pointer, but only needs to know if they need to free the allocated
memory.
2. btrfs_qgroup_insert_dirty_extent()
The more encapsulated version.
Will do the delayed_refs lock, memory allocation, quota enabled check
and other things.
The original design is to keep exported functions to minimal, but since
more btrfs hacks exposed, like replacing path in balance, we need to
record dirty extents manually, so we have to add such functions.
Also, add comment for both functions, to info developers how to keep
qgroup correct when doing hacks.
Jeff Mahoney [Tue, 9 Aug 2016 02:08:06 +0000 (22:08 -0400)]
btrfs: waiting on qgroup rescan should not always be interruptible
We wait on qgroup rescan completion in three places: file system
shutdown, the quota disable ioctl, and the rescan wait ioctl. If the
user sends a signal while we're waiting, we continue happily along. This
is expected behavior for the rescan wait ioctl. It's racy in the shutdown
path but mostly works due to other unrelated synchronization points.
In the quota disable path, it Oopses the kernel pretty much immediately.
Jeff Mahoney [Mon, 15 Aug 2016 16:10:33 +0000 (12:10 -0400)]
btrfs: properly track when rescan worker is running
The qgroup_flags field is overloaded such that it reflects the on-disk
status of qgroups and the runtime state. The BTRFS_QGROUP_STATUS_FLAG_RESCAN
flag is used to indicate that a rescan operation is in progress, but if
the file system is unmounted while a rescan is running, the rescan
operation is paused. If the file system is then mounted read-only,
the flag will still be present but the rescan operation will not have
been resumed. When we go to umount, btrfs_qgroup_wait_for_completion
will see the flag and interpret it to mean that the rescan worker is
still running and will wait for a completion that will never come.
This patch uses a separate flag to indicate when the worker is
running. The locking and state surrounding the qgroup rescan worker
needs a lot of attention beyond this patch but this is enough to
avoid a hung umount.
Alex Lyakas [Sun, 6 Dec 2015 10:32:31 +0000 (12:32 +0200)]
btrfs: flush_space: treat return value of do_chunk_alloc properly
do_chunk_alloc returns 1 when it succeeds to allocate a new chunk.
But flush_space will not convert this to 0, and will also return 1.
As a result, reserve_metadata_bytes will think that flush_space failed,
and may potentially return this value "1" to the caller (depends how
reserve_metadata_bytes was called). The caller will also treat this as an error.
For example, btrfs_block_rsv_refill does:
int ret = -ENOSPC;
...
ret = reserve_metadata_bytes(root, block_rsv, num_bytes, flush);
if (!ret) {
block_rsv_add_bytes(block_rsv, num_bytes, 0);
return 0;
}
btrfs: backref: Fix soft lockup in __merge_refs function
When over 1000 file extents refers to one extent, find_parent_nodes()
will be obviously slow, due to the O(n^2)~O(n^3) loops inside
__merge_refs().
The following ftrace shows the cubic growth of execution time:
256 refs
5) + 91.768 us | __add_keyed_refs.isra.12 [btrfs]();
5) 1.447 us | __add_missing_keys.isra.13 [btrfs]();
5) ! 114.544 us | __merge_refs [btrfs]();
5) ! 136.399 us | __merge_refs [btrfs]();
512 refs
6) ! 279.859 us | __add_keyed_refs.isra.12 [btrfs]();
6) 3.164 us | __add_missing_keys.isra.13 [btrfs]();
6) ! 442.498 us | __merge_refs [btrfs]();
6) # 2091.073 us | __merge_refs [btrfs]();
and 1024 refs
7) ! 368.683 us | __add_keyed_refs.isra.12 [btrfs]();
7) 4.810 us | __add_missing_keys.isra.13 [btrfs]();
7) # 2043.428 us | __merge_refs [btrfs]();
7) * 18964.23 us | __merge_refs [btrfs]();
And sort them into the following char:
(Unit: us)
------------------------------------------------------------------------
Trace function | 256 ref | 512 refs | 1024 refs |
------------------------------------------------------------------------
__add_keyed_refs | 91 | 249 | 368 |
__add_missing_keys | 1 | 3 | 4 |
__merge_refs 1st call | 114 | 442 | 2043 |
__merge_refs 2nd call | 136 | 2091 | 18964 |
------------------------------------------------------------------------
We can see the that __add_keyed_refs() grows almost in linear behavior.
And __add_missing_keys() in this case doesn't change much or takes much
time.
While for the 1st __merge_refs() it's square growth
for the 2nd __merge_refs() call it's cubic growth.
It's no doubt that merge_refs() will take a long long time to execute if
the number of refs continues its grows.
So add a cond_resced() into the loop of __merge_refs().
Although this will solve the problem of soft lockup, we need to use the
new rb_tree based structure introduced by Lu Fengqi to really solve the
long execution time.
Liu Bo [Tue, 19 Jul 2016 22:36:05 +0000 (15:36 -0700)]
Btrfs: fix memory leak of reloc_root
When some critical errors occur and FS would be flipped into RO,
if we have an on-going balance, we can end up with a memory leak
of root->reloc_root since btrfs_drop_snapshots() bails out
without freeing reloc_root at the very early start.
However, we're not able to free reloc_root in btrfs_drop_snapshots()
because its caller, merge_reloc_roots(), still needs to access it to
cleanup reloc_root's rbtree.
This makes us free reloc_root when we're going to free fs/file roots.
Mark Rutland [Wed, 24 Aug 2016 17:02:08 +0000 (18:02 +0100)]
arm64: avoid TLB conflict with CONFIG_RANDOMIZE_BASE
When CONFIG_RANDOMIZE_BASE is selected, we modify the page tables to remap the
kernel at a newly-chosen VA range. We do this with the MMU disabled, but do not
invalidate TLBs prior to re-enabling the MMU with the new tables. Thus the old
mappings entries may still live in TLBs, and we risk violating
Break-Before-Make requirements, leading to TLB conflicts and/or other issues.
We invalidate TLBs when we uninsall the idmap in early setup code, but prior to
this we are subject to issues relating to the Break-Before-Make violation.
Avoid these issues by invalidating the TLBs before the new mappings can be
used by the hardware.
Linus Torvalds [Thu, 25 Aug 2016 09:49:38 +0000 (05:49 -0400)]
Merge branch 'for-rc' of git://git.kernel.org/pub/scm/linux/kernel/git/rzhang/linux
Pull thermal fixes from Zhang Rui:
- Fix cpu_cooling to have separate thermal_cooling_device_ops
structures for cpus with and without power model, to avoid NULL
dereference in cpufreq_state2power. From Brendan Jackman.
- Fix a possible NULL dereference in imx_thermal driver. From Corentin
LABBE.
- Another two trivial fixes, one typo fix and one deleting module
owner. From Caesar Wang and Markus Elfring.
* 'for-rc' of git://git.kernel.org/pub/scm/linux/kernel/git/rzhang/linux:
thermal: imx: fix a possible NULL dereference
thermal: trivial: fix the typo
Thermal-INT3406: Delete owner assignment
thermal: cpu_cooling: Fix NULL dereference in cpufreq_state2power
Dave Airlie [Thu, 25 Aug 2016 02:50:30 +0000 (12:50 +1000)]
Merge branch 'drm-fixes-4.8' of git://people.freedesktop.org/~agd5f/linux into drm-fixes
radeon and amdgpu fixes for 4.8. Nothing major:
- fix a performance regression due to the LRU changes in 4.7
- 32 bit fixes
- fix a PLL regression
- misc bug fixes
* 'drm-fixes-4.8' of git://people.freedesktop.org/~agd5f/linux:
drm/amdgpu: skip TV/CV in display parsing
drm/amdgpu: avoid a possible array overflow
drm/amdgpu: fix lru size grouping v2
drm/amdgpu: fix timeout value check in amd_sched_job_recovery
drm/amdgpu: fix sdma_v2_4_ring_test_ib
drm/amdgpu: fix amdgpu_move_blit on 32bit systems
drm/radeon: fix radeon_move_blit on 32bit systems
drm/radeon: only apply the SS fractional workaround to RS[78]80
Dave Airlie [Thu, 25 Aug 2016 02:49:22 +0000 (12:49 +1000)]
Merge tag 'drm/tegra/for-4.8-rc4' of git://anongit.freedesktop.org/tegra/linux into drm-fixes
drm/tegra: Fixes for v4.8-rc4
This contains one fix for DSI runtime power management support that was
introduced in v4.8-rc1. This is slightly more elaborate than I would've
wished, but there are a few corner cases that needed fixing.
* tag 'drm/tegra/for-4.8-rc4' of git://anongit.freedesktop.org/tegra/linux:
drm/tegra: dsi: Enhance runtime power management
Commit e6047149db ("dm: use bio op accessors") switched DM over to
using bio_set_op_attrs() but didn't take care to initialize
lc->io_req.bi_op_flags in dm-log.c:rw_header(). This caused
rw_header()'s call to dm_io() to make bio->bi_op_flags be uninitialized
in dm-io.c:do_region(), which ultimately resulted in a SCSI BUG() in
sd_init_command().
Also, adjust rw_header() and its callers to use REQ_OP_{READ|WRITE}.
Mike Snitzer [Thu, 25 Aug 2016 01:12:58 +0000 (21:12 -0400)]
dm flakey: fix reads to be issued if drop_writes configured
v4.8-rc3 commit 99f3c90d0d ("dm flakey: error READ bios during the
down_interval") overlooked the 'drop_writes' feature, which is meant to
allow reads to be issued rather than errored, during the down_interval.
Jens Axboe [Wed, 24 Aug 2016 21:38:01 +0000 (15:38 -0600)]
blk-mq: improve warning for running a queue on the wrong CPU
__blk_mq_run_hw_queue() currently warns if we are running the queue on a
CPU that isn't set in its mask. However, this can happen if a CPU is
being offlined, and the workqueue handling will place the work on CPU0
instead. Improve the warning so that it only triggers if the batch cpu
in the hardware queue is currently online. If it triggers for that
case, then it's indicative of a flow problem in blk-mq, so we want to
retain it for that case.
Jens Axboe [Wed, 24 Aug 2016 21:34:35 +0000 (15:34 -0600)]
blk-mq: don't overwrite rq->mq_ctx
We do this in a few places, if the CPU is offline. This isn't allowed,
though, since on multi queue hardware, we can't just move a request
from one software queue to another, if they map to different hardware
queues. The request and tag isn't valid on another hardware queue.
This can happen if plugging races with CPU offlining. But it does
no harm, since it can only happen in the window where we are
currently busy freezing the queue and flushing IO, in preparation
for redoing the software <-> hardware queue mappings.
Doug Ledford [Wed, 24 Aug 2016 16:14:19 +0000 (12:14 -0400)]
IB/srpt: Update sport->port_guid with each port refresh
If port_guid is set with the default subnet_prefix, then we get a change
event and run a port refresh, we don't update the port_guid. As a
result, attempts to create a target device that uses the new
subnet_prefix in the wwn will fail to find a match and be rejected by
the ib_srpt driver. This makes it impossible to configure a port if it
was initialized with a default subnet_prefix and later changed to any
non-default subnet-prefix. Updating the port refresh task to always
update the wwn based upon the current subnext_prefix solves this
problem.
Linus Torvalds [Wed, 24 Aug 2016 18:04:30 +0000 (14:04 -0400)]
Merge tag 'for-linus-4.8b-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip
Pull xen regression fix from David Vrabel:
"Fix a regression in the xenbus device preventing userspace tools from
working"
* tag 'for-linus-4.8b-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
xen: change the type of xen_vcpu_id to uint32_t
xenbus: don't look up transaction IDs for ordinary writes
We pass xen_vcpu_id mapping information to hypercalls which require
uint32_t type so it would be cleaner to have it as uint32_t. The
initializer to -1 can be dropped as we always do the mapping before using
it and we never check the 'not set' value anyway.
Yotam Gigi [Wed, 24 Aug 2016 09:18:52 +0000 (11:18 +0200)]
mlxsw: router: Enable neighbors to be created on stacked devices
Make the function mlxsw_router_neigh_construct search the rif according
to the neighbour dev other than the dev that was passed to the ndo, thus
allowing creating neigbhours upon stacked devices.
Ido Schimmel [Wed, 24 Aug 2016 09:18:51 +0000 (11:18 +0200)]
mlxsw: spectrum: Add missing flood to router port
In case we have a layer 3 interface on top of a bridge (VLAN / FID RIF),
then we should flood the following packet types to the router:
* Broadcast: If DIP is the broadcast address of the interface, then we
need to be able to get it to CPU by trapping it following route lookup.
* Reserved IP multicast (224.0.0.X): Some control packets (e.g. OSPF)
use this range and are trapped in the router block.
Fixes: 99f44bb3527b ("mlxsw: spectrum: Enable L3 interfaces on top of bridge devices") Signed-off-by: Ido Schimmel <[email protected]> Signed-off-by: Jiri Pirko <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Selvin Xavier [Wed, 24 Aug 2016 05:17:41 +0000 (01:17 -0400)]
RDMA/ocrdma: Fix the max_sge reported from FW
Current driver is reporting wrong values for max_sge and
max_sge_rd in query_device. This breaks the nfs rdma and iser
in some device profiles. Fixing the driver to report
correct values from FW.
Mustafa Ismail [Tue, 23 Aug 2016 22:24:56 +0000 (17:24 -0500)]
i40iw: Avoid writing to freed memory
iwpbl->iwmr points to the structure that contains iwpbl,
which is iwmr. Setting this to NULL would result in
writing to freed memory. So just free iwmr, and return.
Mustafa Ismail [Tue, 23 Aug 2016 21:50:13 +0000 (16:50 -0500)]
i40iw: Fix double free of allocated_buffer
Memory allocated for iwqp; iwqp->allocated_buffer is freed twice in
the create_qp error path. Correct this by having it freed only once in
i40iw_free_qp_resources().
Chris Wilson [Tue, 23 Aug 2016 20:16:26 +0000 (21:16 +0100)]
IB/mlx5: Remove superfluous include of io-mapping.h
This file does not use any structs or functions defined by io-mapping.h
(nor does it directly use iomap, ioremap, iounamp or friends). Remove it
to simplify verification of changes to io-mapping.h
Shiraz Saleem [Mon, 22 Aug 2016 23:16:37 +0000 (18:16 -0500)]
i40iw: Add missing NULL check for MPA private data
Add NULL check for pdata and pdata->addr before the memcpy in
i40iw_form_cm_frame(). This fixes a NULL pointer de-reference
which occurs when the MPA private data pointer is NULL. Also
only copy pdata->size bytes in the memcpy to prevent reading
past the length of the private data buffer provided by upper layer.
Daniel Borkmann [Wed, 27 Jul 2016 18:40:14 +0000 (11:40 -0700)]
Bluetooth: split sk_filter in l2cap_sock_recv_cb
During an audit for sk_filter(), we found that rx_busy_skb handling
in l2cap_sock_recv_cb() and l2cap_sock_recvmsg() looks not quite as
intended.
The assumption from commit e328140fdacb ("Bluetooth: Use event-driven
approach for handling ERTM receive buffer") is that errors returned
from sock_queue_rcv_skb() are due to receive buffer shortage. However,
nothing should prevent doing a setsockopt() with SO_ATTACH_FILTER on
the socket, that could drop some of the incoming skbs when handled in
sock_queue_rcv_skb().
In that case sock_queue_rcv_skb() will return with -EPERM, propagated
from sk_filter() and if in L2CAP_MODE_ERTM mode, wrong assumption was
that we failed due to receive buffer being full. From that point onwards,
due to the to-be-dropped skb being held in rx_busy_skb, we cannot make
any forward progress as rx_busy_skb is never cleared from l2cap_sock_recvmsg(),
due to the filter drop verdict over and over coming from sk_filter().
Meanwhile, in l2cap_sock_recv_cb() all new incoming skbs are being
dropped due to rx_busy_skb being occupied.
Instead, just use __sock_queue_rcv_skb() where an error really tells that
there's a receive buffer issue. Split the sk_filter() and enable it for
non-segmented modes at queuing time since at this point in time the skb has
already been through the ERTM state machine and it has been acked, so dropping
is not allowed. Instead, for ERTM and streaming mode, call sk_filter() in
l2cap_data_rcv() so the packet can be dropped before the state machine sees it.
Fixes: e328140fdacb ("Bluetooth: Use event-driven approach for handling ERTM receive buffer") Signed-off-by: Daniel Borkmann <[email protected]> Signed-off-by: Mat Martineau <[email protected]> Acked-by: Willem de Bruijn <[email protected]> Signed-off-by: Marcel Holtmann <[email protected]>
Frederic Dalleau [Tue, 23 Aug 2016 05:59:19 +0000 (07:59 +0200)]
Bluetooth: Fix memory leak at end of hci requests
In hci_req_sync_complete the event skb is referenced in hdev->req_skb.
It is used (via hci_req_run_skb) from either __hci_cmd_sync_ev which will
pass the skb to the caller, or __hci_req_sync which leaks.
Ming Lei [Tue, 23 Aug 2016 13:49:45 +0000 (21:49 +0800)]
block: make sure a big bio is split into at most 256 bvecs
After arbitrary bio size was introduced, the incoming bio may
be very big. We have to split the bio into small bios so that
each holds at most BIO_MAX_PAGES bvecs for safety reason, such
as bio_clone().
The issue can be reproduced by the following steps:
- create one raid1 over two virtio-blk
- build bcache device over the above raid1 and another cache device
and bucket size is set as 2Mbytes
- set cache mode as writeback
- run random write over ext4 on the bcache device
Andy Lutomirski [Wed, 24 Aug 2016 10:52:12 +0000 (03:52 -0700)]
nvme: Fix nvme_get/set_features() with a NULL result pointer
nvme_set_features() callers seem to expect that passing NULL as the
result pointer is acceptable. Teach nvme_set_features() not to try to
write to the NULL address.
For symmetry, make the same change to nvme_get_features(), despite the
fact that all current callers pass a valid result pointer.
I assume that this bug hasn't been reported in practice because
the callers that pass NULL are all in the SCSI translation layer
and no one uses the relevant operations.
Thierry Reding [Fri, 12 Aug 2016 14:00:53 +0000 (16:00 +0200)]
drm/tegra: dsi: Enhance runtime power management
The MIPI DSI output on Tegra SoCs requires some external logic to
calibrate the MIPI pads before a video signal can be transmitted. This
MIPI calibration logic requires to be powered on while the MIPI pads are
being used, which is currently done as part of the DSI driver's probe
implementation.
This is suboptimal because it will leave the MIPI calibration logic
powered up even if the DSI output is never used.
On Tegra114 and earlier this behaviour also causes the driver to hang
while trying to power up the MIPI calibration logic because the power
partition that contains the MIPI calibration logic will be powered on
by the display controller at output pipeline configuration time. Thus
the power up sequence for the MIPI calibration logic happens before
it's power partition is guaranteed to be enabled.
Fix this by splitting up the API into a request/free pair of functions
that manage the runtime dependency between the DSI and the calibration
modules (no registers are accessed) and a set of enable, calibrate and
disable functions that program the MIPI calibration logic at points in
time where the power partition is really enabled.
While at it, make sure that the runtime power management also works in
ganged mode, which is currently also broken.
Change vif_event_lock to spinlock from mutex, since this lock is
used in wait_event_timeout() via vif_event_equals(). This caused
a warning report as below.
As far as I can see, this lock protects regions where updating
structure members, not function calls. Also, since those
regions are not called from interrupt handlers (of course, it
was a mutex), spin_lock is used instead of spin_lock_irqsave.
brcmfmac: Check rtnl_lock is locked when removing interface
Check rtnl_lock is locked in brcmf_p2p_ifp_removed() by passing
rtnl_locked flag. Actually the caller brcmf_del_if() checks whether
the rtnl_lock is locked, but doesn't pass it to brcmf_p2p_ifp_removed().
Without this fix, wpa_supplicant goes softlockup with rtnl_lock
holding (this means all other process using netlink are locked up too)
Will Deacon [Wed, 24 Aug 2016 09:07:14 +0000 (10:07 +0100)]
perf/core: Use this_cpu_ptr() when stopping AUX events
When tearing down an AUX buf for an event via perf_mmap_close(),
__perf_event_output_stop() is called on the event's CPU to ensure that
trace generation is halted before the process of unmapping and
freeing the buffer pages begins.
The callback is performed via cpu_function_call(), which ensures that it
runs with interrupts disabled and is therefore not preemptible.
Unfortunately, the current code grabs the per-cpu context pointer using
get_cpu_ptr(), which unnecessarily disables preemption and doesn't pair
the call with put_cpu_ptr(), leading to a preempt_count() imbalance and
a BUG when freeing the AUX buffer later on:
Mark Rutland [Tue, 9 Aug 2016 10:03:56 +0000 (11:03 +0100)]
MAINTAINERS: Add ARM ARCHITECTED TIMER entry
The ARM architected timer driver falls under the drivers/clocksource/
catch-all in MAINTAINERS, and get_maintainers.pl doesn't suggest a
number of people who should be Cc'd.
The ARM architected timer is a core component of ARMv7+VE and ARMv8, and
is critical to the correct operation of both architecture ports (and
their respective KVM code), and patches to it should have review by
knowledgeable interested parties.
This patch adds a MAINTAINERS entry for the driver and its low-level
arch components, such that get_maintainer.pl will always include
relevant interested parties for modifications to the driver. For the
timebeing, this means myself and Marc Zyngier.
So IR table is setup even if "noapic" boot parameter is added. As a result we
crash later when the interrupt affinity is set due to a half initialized
remapping infrastructure.
Prevent remap initialization when IOAPIC is disabled.
John Stultz [Tue, 23 Aug 2016 23:08:22 +0000 (16:08 -0700)]
timekeeping: Cap array access in timekeeping_debug
It was reported that hibernation could fail on the 2nd attempt, where the
system hangs at hibernate() -> syscore_resume() -> i8237A_resume() ->
claim_dma_lock(), because the lock has already been taken.
However there is actually no other process would like to grab this lock on
that problematic platform.
Further investigation showed that the problem is triggered by setting
/sys/power/pm_trace to 1 before the 1st hibernation.
Since once pm_trace is enabled, the rtc becomes unmeaningful after suspend,
and meanwhile some BIOSes would like to adjust the 'invalid' RTC (e.g, smaller
than 1970) to the release date of that motherboard during POST stage, thus
after resumed, it may seem that the system had a significant long sleep time
which is a completely meaningless value.
Then in timekeeping_resume -> tk_debug_account_sleep_time, if the bit31 of the
sleep time happened to be set to 1, fls() returns 32 and we add 1 to
sleep_time_bin[32], which causes an out of bounds array access and therefor
memory being overwritten.
As depicted by System.map:
0xffffffff81c9d080 b sleep_time_bin
0xffffffff81c9d100 B dma_spin_lock
the dma_spin_lock.val is set to 1, which caused this problem.
This patch adds a sanity check in tk_debug_account_sleep_time()
to ensure we don't index past the sleep_time_bin array.
[jstultz: Problem diagnosed and original patch by Chen Yu, I've solved the
issue slightly differently, but borrowed his excelent explanation of the
issue here.]
John Stultz [Tue, 23 Aug 2016 23:08:21 +0000 (16:08 -0700)]
timekeeping: Avoid taking lock in NMI path with CONFIG_DEBUG_TIMEKEEPING
When I added some extra sanity checking in timekeeping_get_ns() under
CONFIG_DEBUG_TIMEKEEPING, I missed that the NMI safe __ktime_get_fast_ns()
method was using timekeeping_get_ns().
Thus the locking added to the debug checks broke the NMI-safety of
__ktime_get_fast_ns().
This patch open-codes the timekeeping_get_ns() logic for
__ktime_get_fast_ns(), so can avoid any deadlocks in NMI.
David Ahern [Wed, 24 Aug 2016 04:05:27 +0000 (21:05 -0700)]
net: diag: Fix refcnt leak in error path destroying socket
inet_diag_find_one_icsk takes a reference to a socket that is not
released if sock_diag_destroy returns an error. Fix by changing
tcp_diag_destroy to manage the refcnt for all cases and remove
the sock_put calls from tcp_abort.
Fixes: c1e64e298b8ca ("net: diag: Support destroying TCP sockets") Reported-by: Lorenzo Colitti <[email protected]> Signed-off-by: David Ahern <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Instead of using sock_tx_timestamp, use skb_tx_timestamp to record
software transmit timestamp of a packet.
sock_tx_timestamp resets and overrides the tx_flags of the skb.
The function is intended to be called from within the protocol
layer when creating the skb, not from a device driver. This is
inconsistent with other drivers and will cause issues for TCP.
In TCP, we intend to sample the timestamps for the last byte
for each sendmsg/sendpage. For that reason, tcp_sendmsg calls
tcp_tx_timestamp only with the last skb that it generates.
For example, if a 128KB message is split into two 64KB packets
we want to sample the SND timestamp of the last packet. The current
code in the tun driver, however, will result in sampling the SND
timestamp for both packets.
Also, when the last packet is split into smaller packets for
retranmission (see tcp_fragment), the tun driver will record
timestamps for all of the retransmitted packets and not only the
last packet.
Linus Torvalds [Wed, 24 Aug 2016 00:24:27 +0000 (20:24 -0400)]
Merge tag 'for-f2fs-v4.8-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs
Pull f2fs fixes from Jaegeuk Kim:
- fsmark regression
- i_size race condition
- wrong conditions in f2fs_move_file_range
* tag 'for-f2fs-v4.8-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs:
f2fs: avoid potential deadlock in f2fs_move_file_range
f2fs: allow copying file range only in between regular files
Revert "f2fs: move i_size_write in f2fs_write_end"
Revert "f2fs: use percpu_rw_semaphore"
Lance Richardson [Tue, 23 Aug 2016 15:40:52 +0000 (11:40 -0400)]
sctp: fix overrun in sctp_diag_dump_one()
The function sctp_diag_dump_one() currently performs a memcpy()
of 64 bytes from a 16 byte field into another 16 byte field. Fix
by using correct size, use sizeof to obtain correct size instead
of using a hard-coded constant.
Rabin Vincent [Tue, 23 Aug 2016 14:31:28 +0000 (16:31 +0200)]
dwc_eth_qos: fix interrupt enable race
We currently enable interrupts before we enable NAPI. If an RX interrupt
hits before we enabled NAPI then the NAPI callback is never called and
we leave the hardware with RX interrupts disabled, which of course leads
us to never handling received packets. Fix this by moving the interrupt
enable to after we've enable NAPI and the reclaim tasklet.
Fixes: cd5e41234729 ("dwc_eth_qos: do phy_start before resetting hardware") Signed-off-by: Rabin Vincent <[email protected]> Signed-off-by: Lars Persson <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Jamie Lentin [Mon, 22 Aug 2016 21:47:08 +0000 (22:47 +0100)]
net: mv88e6xxx: Fix ingress rate removal for mv6131 chips
The PORT_RATE_CONTROL register works differently on 88e6095/6095f/6131
in comparison to 6123/61/65, and 0x0 disables. The distinction was lost
Linux 4.1 --> 4.2
Xander Huff [Mon, 22 Aug 2016 20:57:16 +0000 (15:57 -0500)]
phy: micrel: Reenable interrupts during resume for ksz9031
Like the ksz8081, the ksz9031 has the behavior where it will clear the
interrupt enable bits when leaving power down. This takes advantage of the
solution provided by f5aba91.
Zefir Kurtisi [Mon, 22 Aug 2016 13:58:12 +0000 (15:58 +0200)]
gianfar: fix size of scatter-gathered frames
The current scatter-gather logic in gianfar is flawed, since
it does not consider the eTSEC's RxBD 'Data Length' field is
context depening: for the last fragment it contains the full
frame size, while fragments contain the fragment size, which
equals the value written to register MRBLR.
This causes data corruption as soon as the hardware starts
to fragment receiving frames. As a result, the size of
fragmented frames is increased by
(nr_frags - 1) * MRBLR
We first noticed this issue working with DSA, where an ICMP
request sized 1472 bytes causes the scatter-gather logic to
kick in. The full Ethernet frame (1518) gets increased by
DSA (4), GMAC_FCB_LEN (8), and FSL_GIANFAR_DEV_HAS_TIMER
(priv->padding=8) to a total of 1538 octets, which is
fragmented by the hardware and reconstructed by the driver
to a 3074 octet frame.
This patch fixes the problem by adjusting the size of
the last fragment.
It was tested by setting MRBLR to different multiples of
64, proving correct scatter-gather operation on frames
with up to 9000 octets in size.
Zefir Kurtisi [Mon, 22 Aug 2016 13:56:38 +0000 (15:56 +0200)]
gianfar: prevent fragmentation in DSA environments
The eTSEC register MRBLR defines the maximum space in
the RX buffers and is set to 1536 by gianfar. This
reasonably covers the common use case where the MTU
is kept at default 1500. In that case, the largest
Ethernet frame size of 1518 plus an optional
GMAC_FCB_LEN of 8, and an additional padding of 8
to handle FSL_GIANFAR_DEV_HAS_TIMER totals to 1534
and nicely fit within the chosen MRBLR.
Alas, if the eTSEC is attached to a DSA enabled switch,
the (E)DSA header extension (4 or 8 bytes) causes every
maximum sized frame to be fragmented by the hardware.
This patch increases the maximum RX buffer size by 8
and rounds up to the next multiple of 64, which the
hardware's defines as RX buffer granularity.