Andrey Ryabinin [Tue, 23 Aug 2016 15:55:31 +0000 (18:55 +0300)]
fs/block_dev: fix potential NULL ptr deref in freeze_bdev()
Calling freeze_bdev() twice on the same block device without mounted
filesystem get_super() will return NULL, which will lead to NULL-ptr
dereference later in drop_super().
Check get_super() result to fix that.
Note, that this is a purely theoretical issue. We have only 3
freeze_bdev() callers. 2 of them are in filesystem code and used on a
device with mounted fs. The third one in lock_fs() has protection in
upper-layer code against freezing block device the second time without
thawing it first.
Filipe Manana [Tue, 23 Aug 2016 20:13:51 +0000 (21:13 +0100)]
Btrfs: fix lockdep warning on deadlock against an inode's log mutex
Commit 44f714dae50a ("Btrfs: improve performance on fsync against new
inode after rename/unlink"), which landed in 4.8-rc2, introduced a
possibility for a deadlock due to double locking of an inode's log mutex
by the same task, which lockdep reports with:
[23045.433975] =============================================
[23045.434748] [ INFO: possible recursive locking detected ]
[23045.435426] 4.7.0-rc6-btrfs-next-34+ #1 Not tainted
[23045.436044] ---------------------------------------------
[23045.436044] xfs_io/3688 is trying to acquire lock:
[23045.436044] (&ei->log_mutex){+.+...}, at: [<ffffffffa038552d>] btrfs_log_inode+0x13a/0xc95 [btrfs]
[23045.436044]
but task is already holding lock:
[23045.436044] (&ei->log_mutex){+.+...}, at: [<ffffffffa038552d>] btrfs_log_inode+0x13a/0xc95 [btrfs]
[23045.436044]
other info that might help us debug this:
[23045.436044] Possible unsafe locking scenario:
This is because while logging the inode of file bar we end up logging its
parent directory (since its inode has an unlink_trans field matching the
current transaction id due to the rename operation), which in turn logs
the inodes for all its new dentries, so that the new inode for the new
file named foo gets logged which in turn triggered another logging attempt
for the inode we are fsync'ing, since that inode had an old name that
corresponds to the name of the new inode.
So fix this by ensuring that when logging the inode for a new dentry that
has a name matching an old name of some other inode, we don't log again
the original inode that we are fsync'ing.
Fixes: 44f714dae50a ("Btrfs: improve performance on fsync against new inode after rename/unlink") Signed-off-by: Filipe Manana <[email protected]> Signed-off-by: David Sterba <[email protected]> Signed-off-by: Chris Mason <[email protected]>
Liu Bo [Tue, 23 Aug 2016 22:22:58 +0000 (15:22 -0700)]
Btrfs: detect corruption when non-root leaf has zero item
Right now we treat leaf which has zero item as a valid one
because we could have an empty tree, that is, a root that is
also a leaf without any item, however, in the same case but
when the leaf is not a root, we can end up with hitting the
BUG_ON(1) in btrfs_extend_item() called by
setup_inline_extent_backref().
This makes us check the situation as a corruption if leaf is
not its own root.
Jeff Mahoney [Thu, 18 Aug 2016 01:58:33 +0000 (21:58 -0400)]
btrfs: don't create or leak aliased root while cleaning up orphans
commit 909c3a22da3 (Btrfs: fix loading of orphan roots leading to BUG_ON)
avoids the BUG_ON but can add an aliased root to the dead_roots list or
leak the root.
Since we've already been loading roots into the radix tree, we should
use it before looking the root up on disk.
At the end of unmount/dev-delete, if the device exclusive open is not
actually closed, then there might be a race with another program in
the userland who is trying to open the device in exclusive mode and
it may fail for eg:
unmount /btrfs; fsck /dev/x
btrfs dev del /dev/x /btrfs; fsck /dev/x
so here background blkdev_put() is not a choice
With BTRFS_RESERVE_FLUSH_LIMIT, if flush_space is already on the
flush_state of ALLOC_CHUNK and it successfully allocates a new
chunk, then instead of trying to reserve space again,
reserve_metadata_bytes returns 1 immediately.
Eventually the callers who call start_transaction() usually just
do the IS_ERR() check which ERR_PTR(1) can pass, then it'll get
a panic when dereferencing a pointer which is ERR_PTR(1).
The following patch fixes the above problem.
"btrfs: flush_space: treat return value of do_chunk_alloc properly"
https://patchwork.kernel.org/patch/7778651/
This add comments to clarify do_chunk_alloc()'s return value.
>From this warning, freeze_super() already holds SB_FREEZE_FS, but
btrfs_freeze() will call btrfs_commit_transaction() again, if
btrfs_commit_transaction() finds that it has delayed iputs to handle,
it'll start_transaction(), which will try to get SB_FREEZE_FS lock
again, then deadlock occurs.
The root cause is that in btrfs, sync_filesystem(sb) does not make
sure all metadata is updated. There still maybe some codes adding
delayed iputs, see below sample race window:
CPU1 | CPU2
|-> freeze_super() |
|-> sync_filesystem(sb); |
| |-> cleaner_kthread()
| | |-> btrfs_delete_unused_bgs()
| | |-> btrfs_remove_chunk()
| | |-> btrfs_remove_block_group()
| | |-> btrfs_add_delayed_iput()
| |
|-> sb->s_writers.frozen = SB_FREEZE_FS; |
|-> sb_wait_write(sb, SB_FREEZE_FS); |
| acquire SB_FREEZE_FS lock. |
| |
|-> btrfs_freeze() |
|-> btrfs_commit_transaction() |
|-> btrfs_run_delayed_iputs() |
| will handle delayed iputs, |
| that means start_transaction() |
| will be called, which will try |
| to get SB_FREEZE_FS lock. |
To fix this issue, introduce a "int fs_frozen" to record internally whether
fs has been frozen. If fs has been frozen, we can not handle delayed iputs.
This patch can fix some false ENOSPC errors, below test script can
reproduce one false ENOSPC error:
#!/bin/bash
dd if=/dev/zero of=fs.img bs=$((1024*1024)) count=128
dev=$(losetup --show -f fs.img)
mkfs.btrfs -f -M $dev
mkdir /tmp/mntpoint
mount $dev /tmp/mntpoint
cd /tmp/mntpoint
xfs_io -f -c "falloc 0 $((64*1024*1024))" testfile
Above script will fail for ENOSPC reason, but indeed fs still has free
space to satisfy this request. Please see call graph:
btrfs_fallocate()
|-> btrfs_alloc_data_chunk_ondemand()
| bytes_may_use += 64M
|-> btrfs_prealloc_file_range()
|-> btrfs_reserve_extent()
|-> btrfs_add_reserved_bytes()
| alloc_type is RESERVE_ALLOC_NO_ACCOUNT, so it does not
| change bytes_may_use, and bytes_reserved += 64M. Now
| bytes_may_use + bytes_reserved == 128M, which is greater
| than btrfs_space_info's total_bytes, false enospc occurs.
| Note, the bytes_may_use decrease operation will be done in
| end of btrfs_fallocate(), which is too late.
Here is another simple case for buffered write:
CPU 1 | CPU 2
|
|-> cow_file_range() |-> __btrfs_buffered_write()
|-> btrfs_reserve_extent() | |
| | |
| | |
| ..... | |-> btrfs_check_data_free_space()
| |
| |
|-> extent_clear_unlock_delalloc() |
In CPU 1, btrfs_reserve_extent()->find_free_extent()->
btrfs_add_reserved_bytes() do not decrease bytes_may_use, the decrease
operation will be delayed to be done in extent_clear_unlock_delalloc().
Assume in this case, btrfs_reserve_extent() reserved 128MB data, CPU2's
btrfs_check_data_free_space() tries to reserve 100MB data space.
If
100MB > data_sinfo->total_bytes - data_sinfo->bytes_used -
data_sinfo->bytes_reserved - data_sinfo->bytes_pinned -
data_sinfo->bytes_readonly - data_sinfo->bytes_may_use
btrfs_check_data_free_space() will try to allcate new data chunk or call
btrfs_start_delalloc_roots(), or commit current transaction in order to
reserve some free space, obviously a lot of work. But indeed it's not
necessary as long as decreasing bytes_may_use timely, we still have
free space, decreasing 128M from bytes_may_use.
To fix this issue, this patch chooses to update bytes_may_use for both
data and metadata in btrfs_add_reserved_bytes(). For compress path, real
extent length may not be equal to file content length, so introduce a
ram_bytes argument for btrfs_reserve_extent(), find_free_extent() and
btrfs_add_reserved_bytes(), it's becasue bytes_may_use is increased by
file content length. Then compress path can update bytes_may_use
correctly. Also now we can discard RESERVE_ALLOC_NO_ACCOUNT, RESERVE_ALLOC
and RESERVE_FREE.
As we know, usually EXTENT_DO_ACCOUNTING is used for error path. In
run_delalloc_nocow(), for inode marked as NODATACOW or extent marked as
PREALLOC, we also need to update bytes_may_use, but can not pass
EXTENT_DO_ACCOUNTING, because it also clears metadata reservation, so
here we introduce EXTENT_CLEAR_DATA_RESV flag to indicate btrfs_clear_bit_hook()
to update btrfs_space_info's bytes_may_use.
Meanwhile __btrfs_prealloc_file_range() will call
btrfs_free_reserved_data_space() internally for both sucessful and failed
path, btrfs_prealloc_file_range()'s callers does not need to call
btrfs_free_reserved_data_space() any more.
Wang Xiaoguang [Mon, 25 Jul 2016 07:51:39 +0000 (15:51 +0800)]
btrfs: divide btrfs_update_reserved_bytes() into two functions
This patch divides btrfs_update_reserved_bytes() into
btrfs_add_reserved_bytes() and btrfs_free_reserved_bytes(), and
next patch will extend btrfs_add_reserved_bytes()to fix some
false ENOSPC error, please see later patch for detailed info.
Wang Xiaoguang [Mon, 25 Jul 2016 07:51:38 +0000 (15:51 +0800)]
btrfs: use correct offset for reloc_inode in prealloc_file_extent_cluster()
In prealloc_file_extent_cluster(), btrfs_check_data_free_space() uses
wrong file offset for reloc_inode, it uses cluster->start and cluster->end,
which indeed are extent's bytenr. The correct value should be
cluster->[start|end] minus block group's start bytenr.
Qu Wenruo [Mon, 15 Aug 2016 02:36:51 +0000 (10:36 +0800)]
btrfs: relocation: Fix leaking qgroups numbers on data extents
This patch fixes a REGRESSION introduced in 4.2, caused by the big quota
rework.
When balancing data extents, qgroup will leak all its numbers for
relocated data extents.
The relocation is done in the following steps for data extents:
1) Create data reloc tree and inode
2) Copy all data extents to data reloc tree
And commit transaction
3) Create tree reloc tree(special snapshot) for any related subvolumes
4) Replace file extent in tree reloc tree with new extents in data reloc
tree
And commit transaction
5) Merge tree reloc tree with original fs, by swapping tree blocks
For 1)~4), since tree reloc tree and data reloc tree doesn't count to
qgroup, everything is OK.
But for 5), the swapping of tree blocks will only info qgroup to track
metadata extents.
If metadata extents contain file extents, qgroup number for file extents
will get lost, leading to corrupted qgroup accounting.
The fix is, before commit transaction of step 5), manually info qgroup to
track all file extents in data reloc tree.
Since at commit transaction time, the tree swapping is done, and qgroup
will account these data extents correctly.
Refactor btrfs_qgroup_insert_dirty_extent() function, to two functions:
1. btrfs_qgroup_insert_dirty_extent_nolock()
Almost the same with original code.
For delayed_ref usage, which has delayed refs locked.
Change the return value type to int, since caller never needs the
pointer, but only needs to know if they need to free the allocated
memory.
2. btrfs_qgroup_insert_dirty_extent()
The more encapsulated version.
Will do the delayed_refs lock, memory allocation, quota enabled check
and other things.
The original design is to keep exported functions to minimal, but since
more btrfs hacks exposed, like replacing path in balance, we need to
record dirty extents manually, so we have to add such functions.
Also, add comment for both functions, to info developers how to keep
qgroup correct when doing hacks.
Jeff Mahoney [Tue, 9 Aug 2016 02:08:06 +0000 (22:08 -0400)]
btrfs: waiting on qgroup rescan should not always be interruptible
We wait on qgroup rescan completion in three places: file system
shutdown, the quota disable ioctl, and the rescan wait ioctl. If the
user sends a signal while we're waiting, we continue happily along. This
is expected behavior for the rescan wait ioctl. It's racy in the shutdown
path but mostly works due to other unrelated synchronization points.
In the quota disable path, it Oopses the kernel pretty much immediately.
Jeff Mahoney [Mon, 15 Aug 2016 16:10:33 +0000 (12:10 -0400)]
btrfs: properly track when rescan worker is running
The qgroup_flags field is overloaded such that it reflects the on-disk
status of qgroups and the runtime state. The BTRFS_QGROUP_STATUS_FLAG_RESCAN
flag is used to indicate that a rescan operation is in progress, but if
the file system is unmounted while a rescan is running, the rescan
operation is paused. If the file system is then mounted read-only,
the flag will still be present but the rescan operation will not have
been resumed. When we go to umount, btrfs_qgroup_wait_for_completion
will see the flag and interpret it to mean that the rescan worker is
still running and will wait for a completion that will never come.
This patch uses a separate flag to indicate when the worker is
running. The locking and state surrounding the qgroup rescan worker
needs a lot of attention beyond this patch but this is enough to
avoid a hung umount.
Alex Lyakas [Sun, 6 Dec 2015 10:32:31 +0000 (12:32 +0200)]
btrfs: flush_space: treat return value of do_chunk_alloc properly
do_chunk_alloc returns 1 when it succeeds to allocate a new chunk.
But flush_space will not convert this to 0, and will also return 1.
As a result, reserve_metadata_bytes will think that flush_space failed,
and may potentially return this value "1" to the caller (depends how
reserve_metadata_bytes was called). The caller will also treat this as an error.
For example, btrfs_block_rsv_refill does:
int ret = -ENOSPC;
...
ret = reserve_metadata_bytes(root, block_rsv, num_bytes, flush);
if (!ret) {
block_rsv_add_bytes(block_rsv, num_bytes, 0);
return 0;
}
btrfs: backref: Fix soft lockup in __merge_refs function
When over 1000 file extents refers to one extent, find_parent_nodes()
will be obviously slow, due to the O(n^2)~O(n^3) loops inside
__merge_refs().
The following ftrace shows the cubic growth of execution time:
256 refs
5) + 91.768 us | __add_keyed_refs.isra.12 [btrfs]();
5) 1.447 us | __add_missing_keys.isra.13 [btrfs]();
5) ! 114.544 us | __merge_refs [btrfs]();
5) ! 136.399 us | __merge_refs [btrfs]();
512 refs
6) ! 279.859 us | __add_keyed_refs.isra.12 [btrfs]();
6) 3.164 us | __add_missing_keys.isra.13 [btrfs]();
6) ! 442.498 us | __merge_refs [btrfs]();
6) # 2091.073 us | __merge_refs [btrfs]();
and 1024 refs
7) ! 368.683 us | __add_keyed_refs.isra.12 [btrfs]();
7) 4.810 us | __add_missing_keys.isra.13 [btrfs]();
7) # 2043.428 us | __merge_refs [btrfs]();
7) * 18964.23 us | __merge_refs [btrfs]();
And sort them into the following char:
(Unit: us)
------------------------------------------------------------------------
Trace function | 256 ref | 512 refs | 1024 refs |
------------------------------------------------------------------------
__add_keyed_refs | 91 | 249 | 368 |
__add_missing_keys | 1 | 3 | 4 |
__merge_refs 1st call | 114 | 442 | 2043 |
__merge_refs 2nd call | 136 | 2091 | 18964 |
------------------------------------------------------------------------
We can see the that __add_keyed_refs() grows almost in linear behavior.
And __add_missing_keys() in this case doesn't change much or takes much
time.
While for the 1st __merge_refs() it's square growth
for the 2nd __merge_refs() call it's cubic growth.
It's no doubt that merge_refs() will take a long long time to execute if
the number of refs continues its grows.
So add a cond_resced() into the loop of __merge_refs().
Although this will solve the problem of soft lockup, we need to use the
new rb_tree based structure introduced by Lu Fengqi to really solve the
long execution time.
Liu Bo [Tue, 19 Jul 2016 22:36:05 +0000 (15:36 -0700)]
Btrfs: fix memory leak of reloc_root
When some critical errors occur and FS would be flipped into RO,
if we have an on-going balance, we can end up with a memory leak
of root->reloc_root since btrfs_drop_snapshots() bails out
without freeing reloc_root at the very early start.
However, we're not able to free reloc_root in btrfs_drop_snapshots()
because its caller, merge_reloc_roots(), still needs to access it to
cleanup reloc_root's rbtree.
This makes us free reloc_root when we're going to free fs/file roots.
Mark Rutland [Wed, 24 Aug 2016 17:02:08 +0000 (18:02 +0100)]
arm64: avoid TLB conflict with CONFIG_RANDOMIZE_BASE
When CONFIG_RANDOMIZE_BASE is selected, we modify the page tables to remap the
kernel at a newly-chosen VA range. We do this with the MMU disabled, but do not
invalidate TLBs prior to re-enabling the MMU with the new tables. Thus the old
mappings entries may still live in TLBs, and we risk violating
Break-Before-Make requirements, leading to TLB conflicts and/or other issues.
We invalidate TLBs when we uninsall the idmap in early setup code, but prior to
this we are subject to issues relating to the Break-Before-Make violation.
Avoid these issues by invalidating the TLBs before the new mappings can be
used by the hardware.
Linus Torvalds [Thu, 25 Aug 2016 09:49:38 +0000 (05:49 -0400)]
Merge branch 'for-rc' of git://git.kernel.org/pub/scm/linux/kernel/git/rzhang/linux
Pull thermal fixes from Zhang Rui:
- Fix cpu_cooling to have separate thermal_cooling_device_ops
structures for cpus with and without power model, to avoid NULL
dereference in cpufreq_state2power. From Brendan Jackman.
- Fix a possible NULL dereference in imx_thermal driver. From Corentin
LABBE.
- Another two trivial fixes, one typo fix and one deleting module
owner. From Caesar Wang and Markus Elfring.
* 'for-rc' of git://git.kernel.org/pub/scm/linux/kernel/git/rzhang/linux:
thermal: imx: fix a possible NULL dereference
thermal: trivial: fix the typo
Thermal-INT3406: Delete owner assignment
thermal: cpu_cooling: Fix NULL dereference in cpufreq_state2power
Dave Airlie [Thu, 25 Aug 2016 02:50:30 +0000 (12:50 +1000)]
Merge branch 'drm-fixes-4.8' of git://people.freedesktop.org/~agd5f/linux into drm-fixes
radeon and amdgpu fixes for 4.8. Nothing major:
- fix a performance regression due to the LRU changes in 4.7
- 32 bit fixes
- fix a PLL regression
- misc bug fixes
* 'drm-fixes-4.8' of git://people.freedesktop.org/~agd5f/linux:
drm/amdgpu: skip TV/CV in display parsing
drm/amdgpu: avoid a possible array overflow
drm/amdgpu: fix lru size grouping v2
drm/amdgpu: fix timeout value check in amd_sched_job_recovery
drm/amdgpu: fix sdma_v2_4_ring_test_ib
drm/amdgpu: fix amdgpu_move_blit on 32bit systems
drm/radeon: fix radeon_move_blit on 32bit systems
drm/radeon: only apply the SS fractional workaround to RS[78]80
Dave Airlie [Thu, 25 Aug 2016 02:49:22 +0000 (12:49 +1000)]
Merge tag 'drm/tegra/for-4.8-rc4' of git://anongit.freedesktop.org/tegra/linux into drm-fixes
drm/tegra: Fixes for v4.8-rc4
This contains one fix for DSI runtime power management support that was
introduced in v4.8-rc1. This is slightly more elaborate than I would've
wished, but there are a few corner cases that needed fixing.
* tag 'drm/tegra/for-4.8-rc4' of git://anongit.freedesktop.org/tegra/linux:
drm/tegra: dsi: Enhance runtime power management
Commit e6047149db ("dm: use bio op accessors") switched DM over to
using bio_set_op_attrs() but didn't take care to initialize
lc->io_req.bi_op_flags in dm-log.c:rw_header(). This caused
rw_header()'s call to dm_io() to make bio->bi_op_flags be uninitialized
in dm-io.c:do_region(), which ultimately resulted in a SCSI BUG() in
sd_init_command().
Also, adjust rw_header() and its callers to use REQ_OP_{READ|WRITE}.
Mike Snitzer [Thu, 25 Aug 2016 01:12:58 +0000 (21:12 -0400)]
dm flakey: fix reads to be issued if drop_writes configured
v4.8-rc3 commit 99f3c90d0d ("dm flakey: error READ bios during the
down_interval") overlooked the 'drop_writes' feature, which is meant to
allow reads to be issued rather than errored, during the down_interval.
Jens Axboe [Wed, 24 Aug 2016 21:38:01 +0000 (15:38 -0600)]
blk-mq: improve warning for running a queue on the wrong CPU
__blk_mq_run_hw_queue() currently warns if we are running the queue on a
CPU that isn't set in its mask. However, this can happen if a CPU is
being offlined, and the workqueue handling will place the work on CPU0
instead. Improve the warning so that it only triggers if the batch cpu
in the hardware queue is currently online. If it triggers for that
case, then it's indicative of a flow problem in blk-mq, so we want to
retain it for that case.
Jens Axboe [Wed, 24 Aug 2016 21:34:35 +0000 (15:34 -0600)]
blk-mq: don't overwrite rq->mq_ctx
We do this in a few places, if the CPU is offline. This isn't allowed,
though, since on multi queue hardware, we can't just move a request
from one software queue to another, if they map to different hardware
queues. The request and tag isn't valid on another hardware queue.
This can happen if plugging races with CPU offlining. But it does
no harm, since it can only happen in the window where we are
currently busy freezing the queue and flushing IO, in preparation
for redoing the software <-> hardware queue mappings.
Doug Ledford [Wed, 24 Aug 2016 16:14:19 +0000 (12:14 -0400)]
IB/srpt: Update sport->port_guid with each port refresh
If port_guid is set with the default subnet_prefix, then we get a change
event and run a port refresh, we don't update the port_guid. As a
result, attempts to create a target device that uses the new
subnet_prefix in the wwn will fail to find a match and be rejected by
the ib_srpt driver. This makes it impossible to configure a port if it
was initialized with a default subnet_prefix and later changed to any
non-default subnet-prefix. Updating the port refresh task to always
update the wwn based upon the current subnext_prefix solves this
problem.
Linus Torvalds [Wed, 24 Aug 2016 18:04:30 +0000 (14:04 -0400)]
Merge tag 'for-linus-4.8b-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip
Pull xen regression fix from David Vrabel:
"Fix a regression in the xenbus device preventing userspace tools from
working"
* tag 'for-linus-4.8b-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
xen: change the type of xen_vcpu_id to uint32_t
xenbus: don't look up transaction IDs for ordinary writes
Shaohua Li [Tue, 23 Aug 2016 04:14:01 +0000 (21:14 -0700)]
raid5: fix memory leak of bio integrity data
Yi reported a memory leak of raid5 with DIF/DIX enabled disks. raid5
doesn't alloc/free bio, instead it reuses bios. There are two issues in
current code:
1. the code calls bio_init (from
init_stripe->raid5_build_block->bio_init) then bio_reset (ops_run_io).
The bio is reused, so likely there is integrity data attached. bio_init
will clear a pointer to integrity data and makes bio_reset can't release
the data
2. bio_reset is called before dispatching bio. After bio is finished,
it's possible we don't free bio's integrity data (eg, we don't call
bio_reset again)
Both issues will cause memory leak. The patch moves bio_init to stripe
creation and bio_reset to bio end io. This will fix the two issues.
Song Liu [Fri, 19 Aug 2016 22:34:01 +0000 (15:34 -0700)]
r5cache: set MD_JOURNAL_CLEAN correctly
Currently, the code sets MD_JOURNAL_CLEAN when the array has
MD_FEATURE_JOURNAL and the recovery_cp is MaxSector. The array
will be MD_JOURNAL_CLEAN even if the journal device is missing.
With this patch, the MD_JOURNAL_CLEAN is only set when the journal
device presents.
We pass xen_vcpu_id mapping information to hypercalls which require
uint32_t type so it would be cleaner to have it as uint32_t. The
initializer to -1 can be dropped as we always do the mapping before using
it and we never check the 'not set' value anyway.
Yotam Gigi [Wed, 24 Aug 2016 09:18:52 +0000 (11:18 +0200)]
mlxsw: router: Enable neighbors to be created on stacked devices
Make the function mlxsw_router_neigh_construct search the rif according
to the neighbour dev other than the dev that was passed to the ndo, thus
allowing creating neigbhours upon stacked devices.
Ido Schimmel [Wed, 24 Aug 2016 09:18:51 +0000 (11:18 +0200)]
mlxsw: spectrum: Add missing flood to router port
In case we have a layer 3 interface on top of a bridge (VLAN / FID RIF),
then we should flood the following packet types to the router:
* Broadcast: If DIP is the broadcast address of the interface, then we
need to be able to get it to CPU by trapping it following route lookup.
* Reserved IP multicast (224.0.0.X): Some control packets (e.g. OSPF)
use this range and are trapped in the router block.
Fixes: 99f44bb3527b ("mlxsw: spectrum: Enable L3 interfaces on top of bridge devices") Signed-off-by: Ido Schimmel <[email protected]> Signed-off-by: Jiri Pirko <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Selvin Xavier [Wed, 24 Aug 2016 05:17:41 +0000 (01:17 -0400)]
RDMA/ocrdma: Fix the max_sge reported from FW
Current driver is reporting wrong values for max_sge and
max_sge_rd in query_device. This breaks the nfs rdma and iser
in some device profiles. Fixing the driver to report
correct values from FW.
Mustafa Ismail [Tue, 23 Aug 2016 22:24:56 +0000 (17:24 -0500)]
i40iw: Avoid writing to freed memory
iwpbl->iwmr points to the structure that contains iwpbl,
which is iwmr. Setting this to NULL would result in
writing to freed memory. So just free iwmr, and return.
Mustafa Ismail [Tue, 23 Aug 2016 21:50:13 +0000 (16:50 -0500)]
i40iw: Fix double free of allocated_buffer
Memory allocated for iwqp; iwqp->allocated_buffer is freed twice in
the create_qp error path. Correct this by having it freed only once in
i40iw_free_qp_resources().
Chris Wilson [Tue, 23 Aug 2016 20:16:26 +0000 (21:16 +0100)]
IB/mlx5: Remove superfluous include of io-mapping.h
This file does not use any structs or functions defined by io-mapping.h
(nor does it directly use iomap, ioremap, iounamp or friends). Remove it
to simplify verification of changes to io-mapping.h
Shiraz Saleem [Mon, 22 Aug 2016 23:16:37 +0000 (18:16 -0500)]
i40iw: Add missing NULL check for MPA private data
Add NULL check for pdata and pdata->addr before the memcpy in
i40iw_form_cm_frame(). This fixes a NULL pointer de-reference
which occurs when the MPA private data pointer is NULL. Also
only copy pdata->size bytes in the memcpy to prevent reading
past the length of the private data buffer provided by upper layer.
Daniel Borkmann [Wed, 27 Jul 2016 18:40:14 +0000 (11:40 -0700)]
Bluetooth: split sk_filter in l2cap_sock_recv_cb
During an audit for sk_filter(), we found that rx_busy_skb handling
in l2cap_sock_recv_cb() and l2cap_sock_recvmsg() looks not quite as
intended.
The assumption from commit e328140fdacb ("Bluetooth: Use event-driven
approach for handling ERTM receive buffer") is that errors returned
from sock_queue_rcv_skb() are due to receive buffer shortage. However,
nothing should prevent doing a setsockopt() with SO_ATTACH_FILTER on
the socket, that could drop some of the incoming skbs when handled in
sock_queue_rcv_skb().
In that case sock_queue_rcv_skb() will return with -EPERM, propagated
from sk_filter() and if in L2CAP_MODE_ERTM mode, wrong assumption was
that we failed due to receive buffer being full. From that point onwards,
due to the to-be-dropped skb being held in rx_busy_skb, we cannot make
any forward progress as rx_busy_skb is never cleared from l2cap_sock_recvmsg(),
due to the filter drop verdict over and over coming from sk_filter().
Meanwhile, in l2cap_sock_recv_cb() all new incoming skbs are being
dropped due to rx_busy_skb being occupied.
Instead, just use __sock_queue_rcv_skb() where an error really tells that
there's a receive buffer issue. Split the sk_filter() and enable it for
non-segmented modes at queuing time since at this point in time the skb has
already been through the ERTM state machine and it has been acked, so dropping
is not allowed. Instead, for ERTM and streaming mode, call sk_filter() in
l2cap_data_rcv() so the packet can be dropped before the state machine sees it.
Fixes: e328140fdacb ("Bluetooth: Use event-driven approach for handling ERTM receive buffer") Signed-off-by: Daniel Borkmann <[email protected]> Signed-off-by: Mat Martineau <[email protected]> Acked-by: Willem de Bruijn <[email protected]> Signed-off-by: Marcel Holtmann <[email protected]>
Frederic Dalleau [Tue, 23 Aug 2016 05:59:19 +0000 (07:59 +0200)]
Bluetooth: Fix memory leak at end of hci requests
In hci_req_sync_complete the event skb is referenced in hdev->req_skb.
It is used (via hci_req_run_skb) from either __hci_cmd_sync_ev which will
pass the skb to the caller, or __hci_req_sync which leaks.
Ming Lei [Tue, 23 Aug 2016 13:49:45 +0000 (21:49 +0800)]
block: make sure a big bio is split into at most 256 bvecs
After arbitrary bio size was introduced, the incoming bio may
be very big. We have to split the bio into small bios so that
each holds at most BIO_MAX_PAGES bvecs for safety reason, such
as bio_clone().
The issue can be reproduced by the following steps:
- create one raid1 over two virtio-blk
- build bcache device over the above raid1 and another cache device
and bucket size is set as 2Mbytes
- set cache mode as writeback
- run random write over ext4 on the bcache device
Andy Lutomirski [Wed, 24 Aug 2016 10:52:12 +0000 (03:52 -0700)]
nvme: Fix nvme_get/set_features() with a NULL result pointer
nvme_set_features() callers seem to expect that passing NULL as the
result pointer is acceptable. Teach nvme_set_features() not to try to
write to the NULL address.
For symmetry, make the same change to nvme_get_features(), despite the
fact that all current callers pass a valid result pointer.
I assume that this bug hasn't been reported in practice because
the callers that pass NULL are all in the SCSI translation layer
and no one uses the relevant operations.
Thierry Reding [Fri, 12 Aug 2016 14:00:53 +0000 (16:00 +0200)]
drm/tegra: dsi: Enhance runtime power management
The MIPI DSI output on Tegra SoCs requires some external logic to
calibrate the MIPI pads before a video signal can be transmitted. This
MIPI calibration logic requires to be powered on while the MIPI pads are
being used, which is currently done as part of the DSI driver's probe
implementation.
This is suboptimal because it will leave the MIPI calibration logic
powered up even if the DSI output is never used.
On Tegra114 and earlier this behaviour also causes the driver to hang
while trying to power up the MIPI calibration logic because the power
partition that contains the MIPI calibration logic will be powered on
by the display controller at output pipeline configuration time. Thus
the power up sequence for the MIPI calibration logic happens before
it's power partition is guaranteed to be enabled.
Fix this by splitting up the API into a request/free pair of functions
that manage the runtime dependency between the DSI and the calibration
modules (no registers are accessed) and a set of enable, calibrate and
disable functions that program the MIPI calibration logic at points in
time where the power partition is really enabled.
While at it, make sure that the runtime power management also works in
ganged mode, which is currently also broken.
Will Deacon [Wed, 24 Aug 2016 09:07:14 +0000 (10:07 +0100)]
perf/core: Use this_cpu_ptr() when stopping AUX events
When tearing down an AUX buf for an event via perf_mmap_close(),
__perf_event_output_stop() is called on the event's CPU to ensure that
trace generation is halted before the process of unmapping and
freeing the buffer pages begins.
The callback is performed via cpu_function_call(), which ensures that it
runs with interrupts disabled and is therefore not preemptible.
Unfortunately, the current code grabs the per-cpu context pointer using
get_cpu_ptr(), which unnecessarily disables preemption and doesn't pair
the call with put_cpu_ptr(), leading to a preempt_count() imbalance and
a BUG when freeing the AUX buffer later on:
Li Zhong [Wed, 24 Aug 2016 07:34:40 +0000 (15:34 +0800)]
crypto: vmx - fix null dereference in p8_aes_xts_crypt
walk.iv is not assigned a value in blkcipher_walk_init. It makes iv uninitialized.
It is possibly a null value(as shown below), which is then used by aes_p8_encrypt.
This patch moves iv = walk.iv after blkcipher_walk_virt, in which walk.iv is set.
Fabian Frederick [Tue, 16 Aug 2016 19:49:45 +0000 (21:49 +0200)]
hwrng: mxc-rnga - Fix Kconfig dependency
We can directly depend on SOC_IMX31 since commit c9ee94965dce
("ARM: imx: deconstruct mxc_rnga initialization")
Since that commit, CONFIG_HW_RANDOM_MXC_RNGA could not be switched on
with unknown symbol ARCH_HAS_RNGA and mxc-rnga.o can't be generated with
ARCH=arm make M=drivers/char/hw_random
Previously, HW_RANDOM_MXC_RNGA required ARCH_HAS_RNGA
which was based on IMX_HAVE_PLATFORM_MXC_RNGA && ARCH_MXC.
IMX_HAVE_PLATFORM_MXC_RNGA was based on SOC_IMX31.
Mark Rutland [Tue, 9 Aug 2016 10:03:56 +0000 (11:03 +0100)]
MAINTAINERS: Add ARM ARCHITECTED TIMER entry
The ARM architected timer driver falls under the drivers/clocksource/
catch-all in MAINTAINERS, and get_maintainers.pl doesn't suggest a
number of people who should be Cc'd.
The ARM architected timer is a core component of ARMv7+VE and ARMv8, and
is critical to the correct operation of both architecture ports (and
their respective KVM code), and patches to it should have review by
knowledgeable interested parties.
This patch adds a MAINTAINERS entry for the driver and its low-level
arch components, such that get_maintainer.pl will always include
relevant interested parties for modifications to the driver. For the
timebeing, this means myself and Marc Zyngier.
So IR table is setup even if "noapic" boot parameter is added. As a result we
crash later when the interrupt affinity is set due to a half initialized
remapping infrastructure.
Prevent remap initialization when IOAPIC is disabled.
John Stultz [Tue, 23 Aug 2016 23:08:22 +0000 (16:08 -0700)]
timekeeping: Cap array access in timekeeping_debug
It was reported that hibernation could fail on the 2nd attempt, where the
system hangs at hibernate() -> syscore_resume() -> i8237A_resume() ->
claim_dma_lock(), because the lock has already been taken.
However there is actually no other process would like to grab this lock on
that problematic platform.
Further investigation showed that the problem is triggered by setting
/sys/power/pm_trace to 1 before the 1st hibernation.
Since once pm_trace is enabled, the rtc becomes unmeaningful after suspend,
and meanwhile some BIOSes would like to adjust the 'invalid' RTC (e.g, smaller
than 1970) to the release date of that motherboard during POST stage, thus
after resumed, it may seem that the system had a significant long sleep time
which is a completely meaningless value.
Then in timekeeping_resume -> tk_debug_account_sleep_time, if the bit31 of the
sleep time happened to be set to 1, fls() returns 32 and we add 1 to
sleep_time_bin[32], which causes an out of bounds array access and therefor
memory being overwritten.
As depicted by System.map:
0xffffffff81c9d080 b sleep_time_bin
0xffffffff81c9d100 B dma_spin_lock
the dma_spin_lock.val is set to 1, which caused this problem.
This patch adds a sanity check in tk_debug_account_sleep_time()
to ensure we don't index past the sleep_time_bin array.
[jstultz: Problem diagnosed and original patch by Chen Yu, I've solved the
issue slightly differently, but borrowed his excelent explanation of the
issue here.]
John Stultz [Tue, 23 Aug 2016 23:08:21 +0000 (16:08 -0700)]
timekeeping: Avoid taking lock in NMI path with CONFIG_DEBUG_TIMEKEEPING
When I added some extra sanity checking in timekeeping_get_ns() under
CONFIG_DEBUG_TIMEKEEPING, I missed that the NMI safe __ktime_get_fast_ns()
method was using timekeeping_get_ns().
Thus the locking added to the debug checks broke the NMI-safety of
__ktime_get_fast_ns().
This patch open-codes the timekeeping_get_ns() logic for
__ktime_get_fast_ns(), so can avoid any deadlocks in NMI.
David Ahern [Wed, 24 Aug 2016 04:05:27 +0000 (21:05 -0700)]
net: diag: Fix refcnt leak in error path destroying socket
inet_diag_find_one_icsk takes a reference to a socket that is not
released if sock_diag_destroy returns an error. Fix by changing
tcp_diag_destroy to manage the refcnt for all cases and remove
the sock_put calls from tcp_abort.
Fixes: c1e64e298b8ca ("net: diag: Support destroying TCP sockets") Reported-by: Lorenzo Colitti <[email protected]> Signed-off-by: David Ahern <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Instead of using sock_tx_timestamp, use skb_tx_timestamp to record
software transmit timestamp of a packet.
sock_tx_timestamp resets and overrides the tx_flags of the skb.
The function is intended to be called from within the protocol
layer when creating the skb, not from a device driver. This is
inconsistent with other drivers and will cause issues for TCP.
In TCP, we intend to sample the timestamps for the last byte
for each sendmsg/sendpage. For that reason, tcp_sendmsg calls
tcp_tx_timestamp only with the last skb that it generates.
For example, if a 128KB message is split into two 64KB packets
we want to sample the SND timestamp of the last packet. The current
code in the tun driver, however, will result in sampling the SND
timestamp for both packets.
Also, when the last packet is split into smaller packets for
retranmission (see tcp_fragment), the tun driver will record
timestamps for all of the retransmitted packets and not only the
last packet.
Linus Torvalds [Wed, 24 Aug 2016 00:24:27 +0000 (20:24 -0400)]
Merge tag 'for-f2fs-v4.8-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs
Pull f2fs fixes from Jaegeuk Kim:
- fsmark regression
- i_size race condition
- wrong conditions in f2fs_move_file_range
* tag 'for-f2fs-v4.8-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs:
f2fs: avoid potential deadlock in f2fs_move_file_range
f2fs: allow copying file range only in between regular files
Revert "f2fs: move i_size_write in f2fs_write_end"
Revert "f2fs: use percpu_rw_semaphore"
Lance Richardson [Tue, 23 Aug 2016 15:40:52 +0000 (11:40 -0400)]
sctp: fix overrun in sctp_diag_dump_one()
The function sctp_diag_dump_one() currently performs a memcpy()
of 64 bytes from a 16 byte field into another 16 byte field. Fix
by using correct size, use sizeof to obtain correct size instead
of using a hard-coded constant.
Rabin Vincent [Tue, 23 Aug 2016 14:31:28 +0000 (16:31 +0200)]
dwc_eth_qos: fix interrupt enable race
We currently enable interrupts before we enable NAPI. If an RX interrupt
hits before we enabled NAPI then the NAPI callback is never called and
we leave the hardware with RX interrupts disabled, which of course leads
us to never handling received packets. Fix this by moving the interrupt
enable to after we've enable NAPI and the reclaim tasklet.
Fixes: cd5e41234729 ("dwc_eth_qos: do phy_start before resetting hardware") Signed-off-by: Rabin Vincent <[email protected]> Signed-off-by: Lars Persson <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Jamie Lentin [Mon, 22 Aug 2016 21:47:08 +0000 (22:47 +0100)]
net: mv88e6xxx: Fix ingress rate removal for mv6131 chips
The PORT_RATE_CONTROL register works differently on 88e6095/6095f/6131
in comparison to 6123/61/65, and 0x0 disables. The distinction was lost
Linux 4.1 --> 4.2
Xander Huff [Mon, 22 Aug 2016 20:57:16 +0000 (15:57 -0500)]
phy: micrel: Reenable interrupts during resume for ksz9031
Like the ksz8081, the ksz9031 has the behavior where it will clear the
interrupt enable bits when leaving power down. This takes advantage of the
solution provided by f5aba91.
Zefir Kurtisi [Mon, 22 Aug 2016 13:58:12 +0000 (15:58 +0200)]
gianfar: fix size of scatter-gathered frames
The current scatter-gather logic in gianfar is flawed, since
it does not consider the eTSEC's RxBD 'Data Length' field is
context depening: for the last fragment it contains the full
frame size, while fragments contain the fragment size, which
equals the value written to register MRBLR.
This causes data corruption as soon as the hardware starts
to fragment receiving frames. As a result, the size of
fragmented frames is increased by
(nr_frags - 1) * MRBLR
We first noticed this issue working with DSA, where an ICMP
request sized 1472 bytes causes the scatter-gather logic to
kick in. The full Ethernet frame (1518) gets increased by
DSA (4), GMAC_FCB_LEN (8), and FSL_GIANFAR_DEV_HAS_TIMER
(priv->padding=8) to a total of 1538 octets, which is
fragmented by the hardware and reconstructed by the driver
to a 3074 octet frame.
This patch fixes the problem by adjusting the size of
the last fragment.
It was tested by setting MRBLR to different multiples of
64, proving correct scatter-gather operation on frames
with up to 9000 octets in size.
Zefir Kurtisi [Mon, 22 Aug 2016 13:56:38 +0000 (15:56 +0200)]
gianfar: prevent fragmentation in DSA environments
The eTSEC register MRBLR defines the maximum space in
the RX buffers and is set to 1536 by gianfar. This
reasonably covers the common use case where the MTU
is kept at default 1500. In that case, the largest
Ethernet frame size of 1518 plus an optional
GMAC_FCB_LEN of 8, and an additional padding of 8
to handle FSL_GIANFAR_DEV_HAS_TIMER totals to 1534
and nicely fit within the chosen MRBLR.
Alas, if the eTSEC is attached to a DSA enabled switch,
the (E)DSA header extension (4 or 8 bytes) causes every
maximum sized frame to be fragmented by the hardware.
This patch increases the maximum RX buffer size by 8
and rounds up to the next multiple of 64, which the
hardware's defines as RX buffer granularity.
Keith Busch [Tue, 23 Aug 2016 21:36:42 +0000 (16:36 -0500)]
x86/PCI: VMD: Fix infinite loop executing irq's
We can't initialize the list head on deletion as this causes the node to
point to itself, which causes an infinite loop if vmd_irq() happens to be
servicing that node.
The list initialization was trying to fix a bug from multiple calls to
disable the same IRQ. Fix this instead by having the VMD driver track if
the interrupt is enabled.
[bhelgaas: changelog, add "Fixes"] Fixes: 97e923063575 ("x86/PCI: VMD: Initialize list item in IRQ disable") Reported-by: Grzegorz Koczot <[email protected]> Tested-by: Miroslaw Drost <[email protected]> Signed-off-by: Keith Busch <[email protected]> Signed-off-by: Bjorn Helgaas <[email protected]>
Acked-by Jon Derrick: <[email protected]>
Andrey Ryabinin [Wed, 17 Aug 2016 15:10:11 +0000 (18:10 +0300)]
um: Don't discard .text.exit section
Commit e41f501d3912 ("vmlinux.lds: account for destructor sections")
added '.text.exit' to EXIT_TEXT which is discarded at link time by default.
This breaks compilation of UML:
`.text.exit' referenced in section `.fini_array' of
/usr/lib/gcc/x86_64-linux-gnu/6/../../../x86_64-linux-gnu/libc.a(sdlerror.o):
defined in discarded section `.text.exit' of
/usr/lib/gcc/x86_64-linux-gnu/6/../../../x86_64-linux-gnu/libc.a(sdlerror.o)
Apparently UML doesn't want to discard exit text, so let's place all EXIT_TEXT
sections in .exit.text.
Vincent Stehlé [Fri, 12 Aug 2016 13:26:30 +0000 (15:26 +0200)]
ubifs: Fix assertion in layout_in_gaps()
An assertion in layout_in_gaps() verifies that the gap_lebs pointer is
below the maximum bound. When computing this maximum bound the idx_lebs
count is multiplied by sizeof(int), while C pointers arithmetic does take
into account the size of the pointed elements implicitly already. Remove
the multiplication to fix the assertion.
Linus Torvalds [Tue, 23 Aug 2016 18:32:38 +0000 (14:32 -0400)]
Merge tag 'usercopy-v4.8-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
Pull hardened usercopy fixes from Kees Cook:
- avoid signed math problems on unexpected compilers
- avoid false positives at very end of kernel text range checks
* tag 'usercopy-v4.8-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
usercopy: fix overlap check for kernel text
usercopy: avoid potentially undefined behavior in pointer math
Bharat Potnuri [Tue, 23 Aug 2016 14:57:33 +0000 (20:27 +0530)]
iw_cxgb4: Fix cxgb4 arm CQ logic w/IB_CQ_REPORT_MISSED_EVENTS
Current cxgb4 arm CQ logic ignores IB_CQ_REPORT_MISSED_EVENTS for
request completion notification on a CQ. Due to this ib_poll_handler()
assumes all events polled and avoids further iopoll scheduling.
This patch adds logic to cxgb4 ib_req_notify_cq() handler to check if
CQ is not empty and return accordingly. Based on the return value of
ib_req_notify_cq() handler, ib_poll_handler() will schedule a run of
iopoll handler.
Shiraz Saleem [Mon, 22 Aug 2016 23:09:14 +0000 (18:09 -0500)]
i40iw: Change mem_resources pointer to a u8
iwdev->mem_resources is incorrectly defined as an unsigned
long instead of u8. As a result, the offset into the dynamic
allocated structures in i40iw_initialize_hw_resources() is
incorrectly calculated and would lead to writing of memory
regions outside of the allocated buffer.
pnfs/blocklayout: update last_write_offset atomically with extents
Block/SCSI layout write completion may add committable extents to the
extent tree before updating the layout's last-written byte under the inode
lock. If a sync happens before this value is updated, then
prepare_layoutcommit may find and encode these extents which would produce
a LAYOUTCOMMIT request whose encoded extents are larger than the request's
loca_length.
Fix this by using a last-written byte value that is updated atomically with
the extent tree so that commitable extents always match.