Git Repo - linux.git/log

IB/hfi1: On error, fix use after free during user context setup

During base context setup, if setup_base_ctxt() fails, the context is
deallocated. This is incorrect because the context is referenced on
return, to notify any waiting subcontext. If there are no subcontexts
the pointer will be invalid.

Reorganize the error path so that deallocate_ctxt() is called after all
the possible subcontexts have been notified.

Reviewed-by: Ira Weiny <[email protected]>
Signed-off-by: Michael J. Ruhl <[email protected]>
Signed-off-by: Dennis Dalessandro <[email protected]>
Signed-off-by: Doug Ledford <[email protected]>

Revert "IB/ipoib: Update broadcast object if PKey value was changed in index 0"

commit 9a9b8112699d will cause core to fail UD QP from being destroyed
on ipoib unload, therefore cause resources leakage.
On pkey change event above patch modifies mgid before calling underlying
driver to detach it from QP. Drivers' detach_mcast() will fail to find
modified mgid it was never given to attach in a first place.
Core qp->usecnt will never go down, so ib_destroy_qp() will fail.

IPoIB driver actually does take care of new broadcast mgid based on new
pkey by destroying an old mcast object in ipoib_mcast_dev_flush())
....
if (priv->broadcast) {
rb_erase(&priv->broadcast->rb_node, &priv->multicast_tree);
list_add_tail(&priv->broadcast->list, &remove_list);
priv->broadcast = NULL;
}
...

then in restarted ipoib_macst_join_task() creating a new broadcast mcast
object, sending join request and on completion tells the driver to attach
to reinitialized QP:
...
if (!priv->broadcast) {
...
broadcast = ipoib_mcast_alloc(dev, 0);
...
memcpy(broadcast->mcmember.mgid.raw, priv->dev->broadcast + 4,
sizeof (union ib_gid));
priv->broadcast = broadcast;
...

Fixes: 9a9b8112699d ("IB/ipoib: Update broadcast object if PKey value was changed in index 0")
Cc: [email protected]
Reviewed-by: Mike Marciniszyn <[email protected]>
Reviewed-by: Dennis Dalessandro <[email protected]>
Signed-off-by: Alex Estrin <[email protected]>
Signed-off-by: Dennis Dalessandro <[email protected]>
Reviewed-by: Feras Daoud <[email protected]>
Signed-off-by: Doug Ledford <[email protected]>

IB/hfi1: Return correct value in general interrupt handler

The general interrupt handler returns IRQ_HANDLED whether an IRQ
was handled or not.
Determine if an IRQ was handled and return the correct value.

Reviewed-by: Dennis Dalessandro <[email protected]>
Reviewed-by: Michael J. Ruhl <[email protected]>
Signed-off-by: Kamenee Arumugam <[email protected]>
Signed-off-by: Dennis Dalessandro <[email protected]>
Signed-off-by: Doug Ledford <[email protected]>

IB/hfi1: Check eeprom config partition validity

Relying on a trailing magic value is incorrect. There are instances where
this is not present as trailing magic value has a specific purpose which is
not partition validation. Instead use the header magic value which is
present in all variants of the platform configuration and is intended for
validation. This is also used in other locations in the driver.

Fixes: bc5214ee2922 (IB/hfi1: Handle missing magic values in config file)
Reviewed-by: Jakub Byczkowski <[email protected]>
Signed-off-by: Jan Sokolowski <[email protected]>
Signed-off-by: Dennis Dalessandro <[email protected]>
Signed-off-by: Doug Ledford <[email protected]>

IB/hfi1: Only reset QSFP after link up and turn off AOC TX

QSFP reset enables AOC transmitters by default. They should be off
before moving to high power mode to complete the setup. There is no
need to reset the QSFP during LNI failure as it was reset at link down.

Reviewed-by: Mike Marciniszyn <[email protected]>
Reviewed-by: Jakub Byczkowski <[email protected]>
Signed-off-by: Sebastian Sanchez <[email protected]>
Signed-off-by: Dennis Dalessandro <[email protected]>
Signed-off-by: Doug Ledford <[email protected]>

IB/hfi1: Turn off AOC TX after offline substates

Offline.quietDuration was added in the 8051 firmware, and the driver
only turns off the AOC transmitters when offline.quiet is reached.
However, the AOC transmitters need to be turned off at the new state.
Therefore, turn off the AOC transmitters at any offline substates
including offline.quiet and offline.quietDuration, then recheck we
reached offline.quiet to support backwards compatibility.

Reviewed-by: Jakub Byczkowski <[email protected]>
Reviewed-by: Mike Marciniszyn <[email protected]>
Signed-off-by: Sebastian Sanchez <[email protected]>
Signed-off-by: Dennis Dalessandro <[email protected]>
Signed-off-by: Doug Ledford <[email protected]>

iommu: Fix comment for iommu_ops.map_sg

The definition of map_sg was split during a recent addition to iommu_ops.
Put it back together.

Fixes: add02cfdc9bc ("iommu: Introduce Interface for IOMMU TLB Flushing")
Signed-off-by: Jean-Philippe Brucker <[email protected]>
Signed-off-by: Joerg Roedel <[email protected]>

iommu/amd: pr_err() strings should end with newlines

pr_err() messages should end with a new-line to avoid other messages
being concatenated. So replace '/n' with '\n'.

Signed-off-by: Arvind Yadav <[email protected]>
Fixes: 45a01c42933b ('iommu/amd: Add function copy_dev_tables()')
Signed-off-by: Joerg Roedel <[email protected]>

iommu/mediatek: Limit the physical address in 32bit for v7s

The ARM short descriptor has already limited the physical address
to 32bit after the commit <76557391433c> ("iommu/io-pgtable: Sanitise
map/unmap addresses"). But in MediaTek 4GB mode, the physical address
is from 0x1_0000_0000 to 0x1_ffff_ffff. this will cause:

WARNING: CPU: 4 PID: 3900 at
xxx/drivers/iommu/io-pgtable-arm-v7s.c:482 arm_v7s_map+0x40/0xf8
Modules linked in:

CPU: 4 PID: 3900 Comm: weston Tainted: G S W 4.9.44 #1
Hardware name: MediaTek MT2712m1v1 board (DT)
task: ffffffc0eaa5b280 task.stack: ffffffc0e9858000
PC is at arm_v7s_map+0x40/0xf8
LR is at mtk_iommu_map+0x64/0x90
pc : [<ffffff80085b09e8>] lr : [<ffffff80085b29fc>] pstate: 000001c5
sp : ffffffc0e985b920
x29: ffffffc0e985b920 x28: 0000000127d00000
x27: 0000000000100000 x26: ffffff8008f9e000
x25: 0000000000000003 x24: 0000000000100000
x23: 0000000127d00000 x22: 00000000ff800000
x21: ffffffc0f7ec8ce0 x20: 0000000000000003
x19: 0000000000000003 x18: 0000000000000002
x17: 0000007f7e5d72c0 x16: ffffff80082b0f08
x15: 0000000000000001 x14: 000000000000003f
x13: 0000000000000000 x12: 0000000000000028
x11: 0088000000000000 x10: 0000000000000000
x9 : ffffff80092fa000 x8 : ffffffc0e9858000
x7 : ffffff80085b29d8 x6 : 0000000000000000
x5 : ffffff80085b09a8 x4 : 0000000000000003
x3 : 0000000000100000 x2 : 0000000127d00000
x1 : 00000000ff800000 x0 : 0000000000000001
...
Call trace:
[<ffffff80085b09e8>] arm_v7s_map+0x40/0xf8
[<ffffff80085b29fc>] mtk_iommu_map+0x64/0x90
[<ffffff80085ab5f8>] iommu_map+0x100/0x3a0
[<ffffff80085ab99c>] default_iommu_map_sg+0x104/0x168
[<ffffff80085aead8>] iommu_dma_alloc+0x238/0x3f8
[<ffffff8008098b30>] __iommu_alloc_attrs+0xa8/0x260
[<ffffff80085f364c>] mtk_drm_gem_create+0xac/0x180
[<ffffff80085f3894>] mtk_drm_gem_dumb_create+0x54/0xc8
[<ffffff80085d576c>] drm_mode_create_dumb_ioctl+0xa4/0xd8
[<ffffff80085cb2a0>] drm_ioctl+0x1c0/0x490

In order to satify this, Limit the physical address to 32bit.

Signed-off-by: Yong Wu <[email protected]>
Acked-by: Will Deacon <[email protected]>
Signed-off-by: Joerg Roedel <[email protected]>

iommu/io-pgtable-arm-v7s: Need dma-sync while there is no QUIRK_NO_DMA

Fix the commit 81b3c2521844 ("iommu/io-pgtable: Introduce explicit
coherency"). If there is no IO_PGTABLE_QUIRK_NO_DMA, we should call
dma_sync_single_for_device for cache synchronization.

Signed-off-by: Yong Wu <[email protected]>
Fixes: 81b3c2521844 ('iommu/io-pgtable: Introduce explicit coherency')
Reviewed-by: Robin Murphy <[email protected]>
Signed-off-by: Joerg Roedel <[email protected]>

mtd: Fix partition alignment check on multi-erasesize devices

Commit 1eeef2d7483a ("mtd: handle partitioning on devices with 0
erasesize") introduced a regression on heterogeneous erase region
devices. Alignment of the partition was tested against the master
eraseblock size which can be bigger than the slave one, thus leading
to some partitions being marked as read-only.

Update wr_alignment to match this slave erasesize after this erasesize
has been determined by picking the biggest erasesize of all the regions
embedded in the MTD partition.

Reported-by: Mathias Thore <[email protected]>
Fixes: 1eeef2d7483a ("mtd: handle partitioning on devices with 0 erasesize")
Cc: <[email protected]>
Signed-off-by: Boris Brezillon <[email protected]>
Tested-by: Mathias Thore <[email protected]>
Reviewed-by: Mathias Thore <[email protected]>

KVM: VMX: simplify and fix vmx_vcpu_pi_load

The simplify part: do not touch pi_desc.nv, we can set it when the
VCPU is first created. Likewise, pi_desc.sn is only handled by
vmx_vcpu_pi_load, do not touch it in __pi_post_block.

The fix part: do not check kvm_arch_has_assigned_device, instead
check the SN bit to figure out whether vmx_vcpu_pi_put ran before.
This matches what the previous patch did in pi_post_block.

Cc: Huangweidong <[email protected]>
Cc: Gonglei <[email protected]>
Cc: wangxin <[email protected]>
Cc: Radim Krčmář <[email protected]>
Tested-by: Longpeng (Mike) <[email protected]>
Cc: [email protected]
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: VMX: avoid double list add with VT-d posted interrupts

In some cases, for example involving hot-unplug of assigned
devices, pi_post_block can forget to remove the vCPU from the
blocked_vcpu_list.  When this happens, the next call to
pi_pre_block corrupts the list.

Fix this in two ways.  First, check vcpu->pre_pcpu in pi_pre_block
and WARN instead of adding the element twice in the list.  Second,
always do the list removal in pi_post_block if vcpu->pre_pcpu is
set (not -1).

The new code keeps interrupts disabled for the whole duration of
pi_pre_block/pi_post_block.  This is not strictly necessary, but
easier to follow.  For the same reason, PI.ON is checked only
after the cmpxchg, and to handle it we just call the post-block
code.  This removes duplication of the list removal code.

Cc: Huangweidong <[email protected]>
Cc: Gonglei <[email protected]>
Cc: wangxin <[email protected]>
Cc: Radim Krčmář <[email protected]>
Tested-by: Longpeng (Mike) <[email protected]>
Cc: [email protected]
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: VMX: extract __pi_post_block

Simple code movement patch, preparing for the next one.

Cc: Huangweidong <[email protected]>
Cc: Gonglei <[email protected]>
Cc: wangxin <[email protected]>
Cc: Radim Krčmář <[email protected]>
Tested-by: Longpeng (Mike) <[email protected]>
Cc: [email protected]
Signed-off-by: Paolo Bonzini <[email protected]>

arm64: Make sure SPsel is always set

When the kernel is entered at EL2 on an ARMv8.0 system, we construct
the EL1 pstate and make sure this uses the the EL1 stack pointer
(we perform an exception return to EL1h).

But if the kernel is either entered at EL1 or stays at EL2 (because
we're on a VHE-capable system), we fail to set SPsel, and use whatever
stack selection the higher exception level has choosen for us.

Let's not take any chance, and make sure that SPsel is set to one
before we decide the mode we're going to run in.

Cc: <[email protected]>
Acked-by: Mark Rutland <[email protected]>
Signed-off-by: Marc Zyngier <[email protected]>
Signed-off-by: Will Deacon <[email protected]>
Signed-off-by: Catalin Marinas <[email protected]>

quota: Fix quota corruption with generic/232 test

Eric has reported that since commit d2faa415166b "quota: Do not acquire
dqio_sem for dquot overwrites in v2 format" test generic/232
occasionally fails due to quota information being incorrect. Indeed that
commit was too eager to remove dqio_sem completely from the path that
just overwrites quota structure with updated information. Although that
is innocent on its own, another process that inserts new quota structure
to the same block can perform read-modify-write cycle of that block thus
effectively discarding quota information update if they race in a wrong
way.

Fix the problem by acquiring dqio_sem for reading for overwrites of
quota structure. Note that it *is* possible to completely avoid taking
dqio_sem in the overwrite path however that will require modifying path
inserting / deleting quota structures to avoid RMW cycles of the full
block and for now it is not clear whether it is worth the hassle.

Fixes: d2faa415166b2883428efa92f451774ef44373ac
Reported-and-tested-by: Eric Whitney <[email protected]>
Signed-off-by: Jan Kara <[email protected]>

platform/x86: fujitsu-laptop: Don't oops when FUJ02E3 is not presnt

My Fujitsu-Siemens Lifebook S6120 doesn't have the FUJ02E3 device,
but it does have FUJ02B1. That means we do register the backlight
device (and it even seems to work), but the code will oops as soon
as we try to set the backlight brightness because it's trying to
call call_fext_func() with a NULL device. Let's just skip those
function calls when the FUJ02E3 device is not present.

Cc: Jonathan Woithe <[email protected]>
Cc: Andy Shevchenko <[email protected]>
Signed-off-by: Ville Syrjälä <[email protected]>
Cc: <[email protected]> # 4.13.x
Signed-off-by: Darren Hart (VMware) <[email protected]>

Merge tag 'mmc-v4.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc

Pull MMC fixes from Ulf Hansson:

  - sdhci-pci: Fix voltage switch for some Intel host controllers

  - tmio: remove broken and noisy debug macro

* tag 'mmc-v4.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc:
  mmc: sdhci-pci: Fix voltage switch for some Intel host controllers
  mmc: tmio: remove broken and noisy debug macro

vfs: Return -ENXIO for negative SEEK_HOLE / SEEK_DATA offsets

In generic_file_llseek_size, return -ENXIO for negative offsets as well
as offsets beyond EOF. This affects filesystems which don't implement
SEEK_HOLE / SEEK_DATA internally, possibly because they don't support
holes.

Fixes xfstest generic/448.

Signed-off-by: Andreas Gruenbacher <[email protected]>
Cc: [email protected]
Signed-off-by: Linus Torvalds <[email protected]>

xfs: revert "xfs: factor rmap btree size into the indlen calculations"

In commit fd26a88093ba we added a worst case estimate for rmapbt blocks
needed to satisfy the block mapping request. Since then, we added the
ability to reserve enough space in each AG such that we should never run
out of blocks to grow the rmapbt, which makes this calculation
unnecessary. Revert the commit because it makes the extra delalloc
indlen accounting unnecessary and incorrect.

Reported-by: Eryu Guan <[email protected]>
Reviewed-by: Brian Foster <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Signed-off-by: Darrick J. Wong <[email protected]>

xfs: Capture state of the right inode in xfs_iflush_done

My previous patch: d3a304b6292168b83b45d624784f973fdc1ca674 check for
XFS_LI_FAILED flag xfs_iflush done, so the failed item can be properly
resubmitted.

In the loop scanning other inodes being completed, it should check the
current item for the XFS_LI_FAILED, and not the initial one.

The state of the initial inode is checked after the loop ends

Kudos to Eric for catching this.

Signed-off-by: Carlos Maiolino <[email protected]>
Reviewed-by: Darrick J. Wong <[email protected]>
Signed-off-by: Darrick J. Wong <[email protected]>

xfs: perag initialization should only touch m_ag_max_usable for AG 0

We call __xfs_ag_resv_init to make a per-AG reservation for each AG.
This makes the reservation per-AG, not per-filesystem. Therefore, it
is incorrect to adjust m_ag_max_usable for each AG. Adjust it only
when we're reserving AG 0's blocks so that we only do it once per fs.

Signed-off-by: Darrick J. Wong <[email protected]>
Reviewed-by: Brian Foster <[email protected]>

xfs: update i_size after unwritten conversion in dio completion

Since commit d531d91d6990 ("xfs: always use unwritten extents for
direct I/O writes"), we start allocating unwritten extents for all
direct writes to allow appending aio in XFS.

But for dio writes that could extend file size we update the in-core
inode size first, then convert the unwritten extents to real
allocations at dio completion time in xfs_dio_write_end_io(). Thus a
racing direct read could see the new i_size and find the unwritten
extents first and read zeros instead of actual data, if the direct
writer also takes a shared iolock.

Fix it by updating the in-core inode size after the unwritten extent
conversion. To do this, introduce a new boolean argument to
xfs_iomap_write_unwritten() to tell if we want to update in-core
i_size or not.

Suggested-by: Brian Foster <[email protected]>
Reviewed-by: Brian Foster <[email protected]>
Signed-off-by: Eryu Guan <[email protected]>
Reviewed-by: Darrick J. Wong <[email protected]>
Signed-off-by: Darrick J. Wong <[email protected]>

iomap_dio_rw: Allocate AIO completion queue before submitting dio

Executing xfs/104 test in a loop on Linux-v4.13 kernel on a ppc64
machine can cause the following NULL pointer dereference,

.queue_work_on+0x4c/0x80
.iomap_dio_bio_end_io+0xbc/0x1f0
.bio_endio+0x118/0x1f0
.blk_update_request+0xd0/0x470
.blk_mq_end_request+0x24/0xc0
.lo_complete_rq+0x40/0xe0
.__blk_mq_complete_request_remote+0x28/0x40
.flush_smp_call_function_queue+0xc4/0x1e0
.smp_ipi_demux_relaxed+0x8c/0x100
.icp_hv_ipi_action+0x54/0xa0
.__handle_irq_event_percpu+0x84/0x2c0
.handle_irq_event_percpu+0x28/0x80
.handle_percpu_irq+0x78/0xc0
.generic_handle_irq+0x40/0x70
.__do_irq+0x88/0x200
.call_do_irq+0x14/0x24
.do_IRQ+0x84/0x130

This occurs due to the following sequence of events,

1. Allocate dio for Direct I/O write.
2. Invoke iomap_apply() until iov_iter_count() bytes have been submitted.
   - Assume that we have submitted atleast one bio. Hence iomap_dio->ref value
     will be >= 2.
   - If during the second iteration, iomap_apply() ends up returning -ENOSPC, we would
     break out of the loop and since the 'ret' value is a negative number we
     end up not allocating memory for super_block->s_dio_done_wq.
3. Meanwhile, iomap_dio_bio_end_io() is invoked for bios that have been
   submitted and here the code ends up dereferencing the NULL pointer stored
   at super_block->s_dio_done_wq.

This commit fixes the bug by allocating memory for
super_block->s_dio_done_wq before iomap_apply() is invoked.

Reported-by: Eryu Guan <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Tested-by: Eryu Guan <[email protected]>
Signed-off-by: Chandan Rajendra <[email protected]>
Reviewed-by: Darrick J. Wong <[email protected]>
Signed-off-by: Darrick J. Wong <[email protected]>

xfs: validate bdev support for DAX inode flag

Currently only the blocksize is checked, but we should really be calling
bdev_dax_supported() which also tests to make sure we can get a
struct dax_device and that the dax_direct_access() path is working.

This is the same check that we do for the "-o dax" mount option in
xfs_fs_fill_super().

This does not fix the race issues that caused the XFS DAX inode option to
be disabled, so that option will still be disabled. If/when we re-enable
it, though, I think we will want this issue to have been fixed. I also do
think that we want to fix this in stable kernels.

Signed-off-by: Ross Zwisler <[email protected]>
CC: [email protected]
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Darrick J. Wong <[email protected]>
Signed-off-by: Darrick J. Wong <[email protected]>

btrfs: log csums for all modified extents

Amir reported a bug discovered by his cleaned up version of my
dm-log-writes xfstests where we were missing csums at certain replay
points.  This is because fsx was doing an msync(), which essentially
fsync()'s a specific range of a file.  We will log all modified extents,
but only search for the checksums in the range we are being asked to
sync.  We cannot simply log the extents in the range we're being asked
because we are logging the inode item as it is currently, which if it
has had a i_size update before the msync means we will miss extents when
replaying.  We could possibly get around this by marking the inode with
the transaction that extended the i_size to see if we have this case,
but this would be racy and we'd have to lock the whole range of the
inode to make sure we didn't have an ordered extent outside of our range
that was in the middle of completing.

Fix this simply by keeping track of the modified extents range and
logging the csums for the entire range of extents that we are logging.
This makes the xfstest pass.

Reported-by: Amir Goldstein <[email protected]>
Signed-off-by: Josef Bacik <[email protected]>
Signed-off-by: David Sterba <[email protected]>

Btrfs: fix unexpected result when dio reading corrupted blocks

commit 4246a0b63bd8 ("block: add a bi_error field to struct bio")
changed the logic of how dio read endio reports errors.

For single stripe dio read, %bio->bi_status reflects the error before
verifying checksum, and now we're updating it when data block matches
with its checksum, while in the mismatching case, %bio->bi_status is
not updated to relfect that.

When some blocks in a file have been corrupted on disk, reading such a
file ends up with

1) checksum errors are reported in kernel log
2) read(2) returns successfully with some content being 0x01.

In order to fix it, we need to report its checksum mismatch error to
the upper layer (dio layer in this case) as well.

Fixes: 4246a0b63bd8 ("block: add a bi_error field to struct bio")
Signed-off-by: Liu Bo <[email protected]>
Reported-by: Goffredo Baroncelli <[email protected]>
Tested-by: Goffredo Baroncelli <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>

btrfs: Report error on removing qgroup if del_qgroup_item fails

Previously, we were calling del_qgroup_item, and ignoring the return code
resulting in a potential to have divergent in-memory state without an
error. Perhaps, it makes sense to handle this error code, and put the
filesystem into a read only, or similar state.

This patch only adds reporting of the error if the error is fatal,
(any error other than qgroup not found).

Signed-off-by: Sargun Dhillon <[email protected]>
Reviewed-by: Qu Wenruo <[email protected]>
Signed-off-by: David Sterba <[email protected]>

Btrfs: skip checksum when reading compressed data if some IO have failed

Currently even if the underlying disk reports failure on IO,
compressed read endio still gets to verify checksum and reports it as
a checksum error.

In fact, if some IO have failed during reading a compressed data
extent , there's no way the checksum could match, therefore, we can
skip that in order to return error quickly to the upper layer.

Please note that we need to do this after recording the failed mirror
index so that read-repair in the upper layer's endio can work
properly.

Signed-off-by: Liu Bo <[email protected]>
Tested-by: Paul Jones <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>

Btrfs: fix kernel oops while reading compressed data

The kernel oops happens at

kernel BUG at fs/btrfs/extent_io.c:2104!
...
RIP: clean_io_failure+0x263/0x2a0 [btrfs]

It's showing that read-repair code is using an improper mirror index.
This is due to the fact that compression read's endio hasn't recorded
the failed mirror index in %cb->orig_bio.

With this, btrfs's read-repair can work properly on reading compressed
data.

Signed-off-by: Liu Bo <[email protected]>
Reported-by: Paul Jones <[email protected]>
Tested-by: Paul Jones <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>

Btrfs: use btrfs_op instead of bio_op in __btrfs_map_block

This seems to be a leftover of commit cf8cddd38bab ("btrfs: don't
abuse REQ_OP_* flags for btrfs_map_block").

It should use btrfs_op() helper to provide one of 'enum btrfs_map_op'
types.

Fixes: cf8cddd38bab ("btrfs: don't abuse REQ_OP_* flags for btrfs_map_block")
Signed-off-by: Liu Bo <[email protected]>
Reviewed-by: Satoru Takeuchi <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>

Btrfs: do not backup tree roots when fsync

It doesn't make sense to backup tree roots when doing fsync, since
during fsync those tree roots have not been consistent on disk.

Signed-off-by: Liu Bo <[email protected]>
Reviewed-by: Qu Wenruo <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>

btrfs: remove BTRFS_FS_QUOTA_DISABLING flag

Currently, "btrfs quota enable" would fail after "btrfs quota disable" on
the first time with syslog output "qgroup_rescan_init failed with -22", but
it would succeed on the second time.

When "quota disable" is called, BTRFS_FS_QUOTA_DISABLING flag bit will be
set in fs_info->flags in btrfs_quota_disable(), but it will not be droppd
in btrfs_run_qgroups() (which is called in btrfs_commit_transaction())
because quota_root has already been freed. If "quota enable" is called
after that, both BTRFS_FS_QUOTA_DISABLING and BTRFS_FS_QUOTA_ENABLED flag
would be dropped in the btrfs_run_qgroups() since quota_root is not NULL.
This leads to the failure of "quota enable" on the first time.

BTRFS_FS_QUOTA_DISABLING flag is not used outside of "quota disable"
context and is equivalent to whether quota_root is NULL or not.
btrfs_run_qgroups() checks whether quota_root is NULL or not in the first
place.

So, let's remove BTRFS_FS_QUOTA_DISABLING flag.

Signed-off-by: Tomohiro Misono <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>

btrfs: propagate error to btrfs_cmp_data_prepare caller

btrfs_cmp_data_prepare() (almost) always returns 0 i.e. ignoring errors
from gather_extent_pages(). While the pages are freed by
btrfs_cmp_data_free(), cmp->num_pages still has > 0. Then,
btrfs_extent_same() try to access the already freed pages causing faults
(or violates PageLocked assertion).

This patch just return the error as is so that the caller stop the process.

Signed-off-by: Naohiro Aota <[email protected]>
Fixes: f441460202cb ("btrfs: fix deadlock with extent-same and readpage")
Cc: <[email protected]> # 4.2
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>

btrfs: prevent to set invalid default subvolid

`btrfs sub set-default` succeeds to set an ID which isn't corresponding to any
fs/file tree. If such the bad ID is set to a filesystem, we can't mount this
filesystem without specifying `subvol` or `subvolid` mount options.

Fixes: 6ef5ed0d386b ("Btrfs: add ioctl and incompat flag to set the default mount subvol")
Cc: <[email protected]>
Signed-off-by: Satoru Takeuchi <[email protected]>
Reviewed-by: Qu Wenruo <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>

Btrfs: send: fix error number for unknown inode types

ENOTSUPP should not be returned to the user program.
(cf. include/linux/errno.h)
Therefore, EOPNOTSUPP is used instead of ENOTSUPP.

Signed-off-by: Tsutomu Itoh <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>

btrfs: fix NULL pointer dereference from free_reloc_roots()

__del_reloc_root should be called before freeing up reloc_root->node.
If not, calling __del_reloc_root() dereference reloc_root->node, causing
the system BUG.

Fixes: 6bdf131fac23 ("Btrfs: don't leak reloc root nodes on error")
Cc: <[email protected]> # 4.9
Signed-off-by: Naohiro Aota <[email protected]>
Reviewed-by: Nikolay Borisov <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>

btrfs: finish ordered extent cleaning if no progress is found

__endio_write_update_ordered() repeats the search until it reaches the end
of the specified range. This works well with direct IO path, because before
the function is called, it's ensured that there are ordered extents filling
whole the range. It's not the case, however, when it's called from
run_delalloc_range(): it is possible to have error in the midle of the loop
in e.g. run_delalloc_nocow(), so that there exisits the range not covered
by any ordered extents. By cleaning such "uncomplete" range,
__endio_write_update_ordered() stucks at offset where there're no ordered
extents.

Since the ordered extents are created from head to tail, we can stop the
search if there are no offset progress.

Fixes: 524272607e88 ("btrfs: Handle delalloc error correctly to avoid ordered extent hang")
Cc: <[email protected]> # 4.12
Signed-off-by: Naohiro Aota <[email protected]>
Reviewed-by: Qu Wenruo <[email protected]>
Reviewed-by: Josef Bacik <[email protected]>
Signed-off-by: David Sterba <[email protected]>

btrfs: clear ordered flag on cleaning up ordered extents

Commit 524272607e88 ("btrfs: Handle delalloc error correctly to avoid
ordered extent hang") introduced btrfs_cleanup_ordered_extents() to cleanup
submitted ordered extents. However, it does not clear the ordered bit
(Private2) of corresponding pages. Thus, the following BUG occurs from
free_pages_check_bad() (on btrfs/125 with nospace_cache).

BUG: Bad page state in process btrfs pfn:3fa787
page:ffffdf2acfe9e1c0 count:0 mapcount:0 mapping: (null) index:0xd
flags: 0x8000000000002008(uptodate|private_2)
raw: 8000000000002008 0000000000000000 000000000000000d 00000000ffffffff
raw: ffffdf2acf5c1b20 ffffb443802238b0 0000000000000000 0000000000000000
page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set
bad because of flags: 0x2000(private_2)

This patch clears the flag same as other places calling
btrfs_dec_test_ordered_pending() for every page in the specified range.

Fixes: 524272607e88 ("btrfs: Handle delalloc error correctly to avoid ordered extent hang")
Cc: <[email protected]> # 4.12
Signed-off-by: Naohiro Aota <[email protected]>
Reviewed-by: Qu Wenruo <[email protected]>
Reviewed-by: Josef Bacik <[email protected]>
Signed-off-by: David Sterba <[email protected]>

Btrfs: fix incorrect {node,sector}size endianness from BTRFS_IOC_FS_INFO

fs_info->super_copy->{node,sector}size are little-endian, but the ioctl
should return the values in native endianness. Use the cached values in
btrfs_fs_info instead. Found with sparse.

Fixes: 80a773fbfc2d ("btrfs: retrieve more info from FS_INFO ioctl")
Signed-off-by: Omar Sandoval <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>

Btrfs: do not reset bio->bi_ops while writing bio

flush_epd_write_bio() sets bio->bi_opf by itself to honor REQ_SYNC,
but it's not needed at all since bio->bi_opf has set up properly in
both __extent_writepage() and write_one_eb(), and in the case of
write_one_eb(), it also sets REQ_META, which we will lose in
flush_epd_write_bio().

This remove this unnecessary bio->bi_opf setting.

Signed-off-by: Liu Bo <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>

Btrfs: use the new helper wbc_to_write_flags

This updates btrfs to use the helper wbc_to_write_flags which has been
applied in ext4/xfs/f2fs/block.

Please note that, with this, btrfs's dirty pages written by a
writeback job will carry the flag REQ_BACKGROUND, which is currently
used by writeback-throttle to determine whether it should go to get a
request or wait.

Signed-off-by: Liu Bo <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>

drm/tegra: trace: Fix path to include

The TRACE_INCLUDE_FILE macro needs to specify the path relative to the
define_trace.h header rather than relative to the file defining it.

Reported-by: Dmitry Osipenko <[email protected]>
Tested-by: Dmitry Osipenko <[email protected]>
Signed-off-by: Thierry Reding <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]

Merge branch 'WIP.x86/fpu' into x86/fpu, because it's ready

Signed-off-by: Ingo Molnar <[email protected]>

x86/fpu: Use using_compacted_format() instead of open coded X86_FEATURE_XSAVES

This is the canonical method to use.

Signed-off-by: Eric Biggers <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Dmitry Vyukov <[email protected]>
Cc: Eric Biggers <[email protected]>
Cc: Fenghua Yu <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Kevin Hao <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Michael Halcrow <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Wanpeng Li <[email protected]>
Cc: Yu-cheng Yu <[email protected]>
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>

x86/fpu: Use validate_xstate_header() to validate the xstate_header in copy_user_to_xstate()

Tighten the checks in copy_user_to_xstate().

Signed-off-by: Eric Biggers <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Dmitry Vyukov <[email protected]>
Cc: Eric Biggers <[email protected]>
Cc: Fenghua Yu <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Kevin Hao <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Michael Halcrow <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Wanpeng Li <[email protected]>
Cc: Yu-cheng Yu <[email protected]>
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>

x86/fpu: Eliminate the 'xfeatures' local variable in copy_user_to_xstate()

We now have this field in hdr.xfeatures.

Signed-off-by: Eric Biggers <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Dmitry Vyukov <[email protected]>
Cc: Eric Biggers <[email protected]>
Cc: Fenghua Yu <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Kevin Hao <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Michael Halcrow <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Wanpeng Li <[email protected]>
Cc: Yu-cheng Yu <[email protected]>
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>

x86/fpu: Copy the full header in copy_user_to_xstate()

This is in preparation to verify the full xstate header as supplied by user-space.

Signed-off-by: Eric Biggers <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Dmitry Vyukov <[email protected]>
Cc: Eric Biggers <[email protected]>
Cc: Fenghua Yu <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Kevin Hao <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Michael Halcrow <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Wanpeng Li <[email protected]>
Cc: Yu-cheng Yu <[email protected]>
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>

x86/fpu: Use validate_xstate_header() to validate the xstate_header in copy_kernel_to_xstate()

Tighten the checks in copy_kernel_to_xstate().

Signed-off-by: Eric Biggers <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Dmitry Vyukov <[email protected]>
Cc: Eric Biggers <[email protected]>
Cc: Fenghua Yu <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Kevin Hao <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Michael Halcrow <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Wanpeng Li <[email protected]>
Cc: Yu-cheng Yu <[email protected]>
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>

x86/fpu: Eliminate the 'xfeatures' local variable in copy_kernel_to_xstate()

We have this information in the xstate_header.

Signed-off-by: Eric Biggers <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Dmitry Vyukov <[email protected]>
Cc: Eric Biggers <[email protected]>
Cc: Fenghua Yu <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Kevin Hao <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Michael Halcrow <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Wanpeng Li <[email protected]>
Cc: Yu-cheng Yu <[email protected]>
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>

x86/fpu: Copy the full state_header in copy_kernel_to_xstate()

This is in preparation to verify the full xstate header as supplied by user-space.

Signed-off-by: Eric Biggers <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Dmitry Vyukov <[email protected]>
Cc: Eric Biggers <[email protected]>
Cc: Fenghua Yu <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Kevin Hao <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Michael Halcrow <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Wanpeng Li <[email protected]>
Cc: Yu-cheng Yu <[email protected]>
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>

x86/fpu: Use validate_xstate_header() to validate the xstate_header in __fpu__restore_sig()

Tighten the checks in __fpu__restore_sig() and update comments.

Signed-off-by: Eric Biggers <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Dmitry Vyukov <[email protected]>
Cc: Eric Biggers <[email protected]>
Cc: Fenghua Yu <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Kevin Hao <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Michael Halcrow <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Wanpeng Li <[email protected]>
Cc: Yu-cheng Yu <[email protected]>
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>

x86/fpu: Use validate_xstate_header() to validate the xstate_header in xstateregs_set()

Tighten the checks in xstateregs_set().

Signed-off-by: Eric Biggers <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Dmitry Vyukov <[email protected]>
Cc: Eric Biggers <[email protected]>
Cc: Fenghua Yu <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Kevin Hao <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Michael Halcrow <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Wanpeng Li <[email protected]>
Cc: Yu-cheng Yu <[email protected]>
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>

x86/fpu: Introduce validate_xstate_header()

Move validation of user-supplied xstate_header into a helper function,
in preparation of calling it from both the ptrace and sigreturn syscall
paths.

The new function also considers it to be an error if *any* reserved bits
are set, whereas before we were just clearing most of them silently.

This should reduce the chance of bugs that fail to correctly validate
user-supplied XSAVE areas.  It also will expose any broken userspace
programs that set the other reserved bits; this is desirable because
such programs will lose compatibility with future CPUs and kernels if
those bits are ever used for anything.  (There shouldn't be any such
programs, and in fact in the case where the compacted format is in use
we were already validating xfeatures.  But you never know...)

Signed-off-by: Eric Biggers <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Dmitry Vyukov <[email protected]>
Cc: Eric Biggers <[email protected]>
Cc: Fenghua Yu <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Kevin Hao <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Michael Halcrow <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Wanpeng Li <[email protected]>
Cc: Yu-cheng Yu <[email protected]>
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>

x86/fpu: Rename fpu__activate_fpstate_read/write() to fpu__prepare_[read|write]()

As per the new nomenclature we don't 'activate' the FPU state
anymore, we initialize it. So drop the _activate_fpstate name
from these functions, which were a bit of a mouthful anyway,
and name them:

fpu__prepare_read()
fpu__prepare_write()

Cc: Andy Lutomirski <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Eric Biggers <[email protected]>
Cc: Fenghua Yu <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>

x86/fpu: Rename fpu__activate_curr() to fpu__initialize()

Rename this function to better express that it's all about
initializing the FPU state of a task which goes hand in hand
with the fpu::initialized field.

Cc: Andrew Morton <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Eric Biggers <[email protected]>
Cc: Fenghua Yu <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Yu-cheng Yu <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>

x86/fpu: Simplify and speed up fpu__copy()

fpu__copy() has a preempt_disable()/enable() pair, which it had to do to
be able to atomically unlazy the current task when doing an FNSAVE.

But we don't unlazy tasks anymore, we always do direct saves/restores of
FPU context.

So remove both the unnecessary critical section, and update the comments.

Cc: Andrew Morton <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Eric Biggers <[email protected]>
Cc: Fenghua Yu <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Yu-cheng Yu <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>

x86/fpu: Fix stale comments about lazy FPU logic

We don't do any lazy restore anymore, what we have are two pieces of optimization:

- no-FPU tasks that don't save/restore the FPU context (kernel threads are such)

- cached FPU registers maintained via the fpu->last_cpu field. This means that
   if an FPU task context switches to a non-FPU task then we can maintain the
   FPU registers as an in-FPU copies (cache), and skip the restoration of them
   once we switch back to the original FPU-using task.

Update all the comments that still referred to old 'lazy' and 'unlazy' concepts.

Cc: Andrew Morton <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Eric Biggers <[email protected]>
Cc: Fenghua Yu <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Yu-cheng Yu <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>

x86/fpu: Rename fpu::fpstate_active to fpu::initialized

The x86 FPU code used to have a complex state machine where both the FPU
registers and the FPU state context could be 'active' (or inactive)
independently of each other - which enabled features like lazy FPU restore.

Much of this complexity is gone in the current code: now we basically can
have FPU-less tasks (kernel threads) that don't use (and save/restore) FPU
state at all, plus full FPU users that save/restore directly with no laziness
whatsoever.

But the fpu::fpstate_active still carries bits of the old complexity - meanwhile
this flag has become a simple flag that shows whether the FPU context saving
area in the thread struct is initialized and used, or not.

Rename it to fpu::initialized to express this simplicity in the name as well.

Cc: Andrew Morton <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Eric Biggers <[email protected]>
Cc: Fenghua Yu <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Yu-cheng Yu <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>

x86/fpu: Remove fpu__current_fpstate_write_begin/end()

These functions are not used anymore, so remove them.

Cc: Andrew Morton <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Bobby Powers <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Eric Biggers <[email protected]>
Cc: Fenghua Yu <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Yu-cheng Yu <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>

x86/fpu: Fix fpu__activate_fpstate_read() and update comments

fpu__activate_fpstate_read() can be called for the current task
when coredumping - or for stopped tasks when ptrace-ing.

Implement this properly in the code and update the comments.

This also fixes an incorrect (but harmless) warning introduced by
one of the earlier patches.

Cc: Andrew Morton <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Eric Biggers <[email protected]>
Cc: Fenghua Yu <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Yu-cheng Yu <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>

scsi: scsi_transport_fc: Also check for NOTPRESENT in fc_remote_port_add()

During failover there is a small race window between fc_remote_port_add()
and fc_timeout_deleted_rport(); the latter drops the lock after setting the
port to NOTPRESENT, so if fc_remote_port_add() is called right at that time
it will fail to detect the existing rport and happily adding a new
structure, causing rports to get registered twice.

Signed-off-by: Hannes Reinecke <[email protected]>
Reviewed-by: Johannes Thumshirn <[email protected]>
Signed-off-by: Martin K. Petersen <[email protected]>

Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs

Pull compat fix from Al Viro:
"I really wish gcc warned about conversions from pointer to function
into void *..."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
fix a typo in put_compat_shm_info()

xfs: remove redundant re-initialization of total_nr_pages

Variable total_nr_pages is being initialized and then updated with
the same value, this latter assignment is redundant and can be
removed. Cleans up clang build warning:

Value stored to 'total_nr_pages' during its initialization is never read

Signed-off-by: Colin Ian King <[email protected]>
Reviewed-by: Brian Foster <[email protected]>
Reviewed-by: Darrick J. Wong <[email protected]>
Signed-off-by: Darrick J. Wong <[email protected]>

xfs: Output warning message when discard option was enabled even though the device does not support discard

In order to using discard function, it is necessary that not only xfs
is mounted with discard option, but also the discard function is
supported by the device. Current code doesn't output any message when
users mount with discard option on unsupported device, so it is
difficult to notice that it was not enabled actually.

This patch adds the warning message to notice that discard option is
not enabled due to unsupported device when the filesystem is mounted.

Changes in v2 (Suggested by Brian Foster):
- Move the unsupported device check into xfs_fs_fill_super().
- Clear the discard flag when device is unsupported.

Signed-off-by: Kenjiro Nakayama <[email protected]>
Reviewed-by: Darrick J. Wong <[email protected]>
Signed-off-by: Darrick J. Wong <[email protected]>

xfs: report zeroed or not correctly in xfs_zero_range()

The 'did_zero' param of xfs_zero_range() was not passed to
iomap_zero_range() correctly. This was introduced by commit
7bb41db3ea16 ("xfs: handle 64-bit length in xfs_iozero"), and found
by code inspection.

Signed-off-by: Eryu Guan <[email protected]>
Reviewed-by: Carlos Maiolino <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Darrick J. Wong <[email protected]>
Signed-off-by: Darrick J. Wong <[email protected]>

xfs: kill meaningless variable 'zero'

In xfs_file_aio_write_checks(), variable 'zero' is there only to
satisfy xfs_zero_eof(), the result of it is ignored. Now, with
iomap_zero_range() based xfs_zero_eof(), we can safely pass NULL as
the last param of it and kill 'zero'.

Signed-off-by: Eryu Guan <[email protected]>
Reviewed-by: Carlos Maiolino <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Darrick J. Wong <[email protected]>
Signed-off-by: Darrick J. Wong <[email protected]>

fs/xfs: Use %pS printk format for direct addresses

Use the %pS instead of the %pF printk format specifier for printing symbols
from direct addresses. This is needed for the ia64, ppc64 and parisc64
architectures.

Signed-off-by: Helge Deller <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Darrick J. Wong <[email protected]>
Signed-off-by: Darrick J. Wong <[email protected]>

xfs: evict CoW fork extents when performing finsert/fcollapse

When we perform an finsert/fcollapse operation, cancel all the CoW
extents for the affected file offset range so that they don't end up
pointing to the wrong blocks.

Reported-by: Amir Goldstein <[email protected]>
Reviewed-by: Carlos Maiolino <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Signed-off-by: Darrick J. Wong <[email protected]>

xfs: don't unconditionally clear the reflink flag on zero-block files

If we have speculative cow preallocations hanging around in the cow
fork, don't let a truncate operation clear the reflink flag because if
we do then there's a chance we'll forget to free those extents when we
destroy the incore inode.

Reported-by: Amir Goldstein <[email protected]>
Reviewed-by: Carlos Maiolino <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Signed-off-by: Darrick J. Wong <[email protected]>

fix a typo in put_compat_shm_info()

"uip" misspelled as "up"; unfortunately, the latter happens to be
a function and gcc is happy to convert it to void *...

Signed-off-by: Al Viro <[email protected]>

PCI: Fix race condition with driver_override

The driver_override implementation is susceptible to a race condition when
different threads are reading vs. storing a different driver override. Add
locking to avoid the race condition.

This is in close analogy to commit 6265539776a0 ("driver core: platform:
fix race condition with driver_override") from Adrian Salido.

Fixes: 782a985d7af2 ("PCI: Introduce new device binding path using pci_dev.driver_override")
Signed-off-by: Nicolai Stange <[email protected]>
Signed-off-by: Bjorn Helgaas <[email protected]>
Cc: [email protected] # v3.16+

cpufreq: dt: Fix sysfs duplicate filename creation for platform-device

ti-cpufreq and cpufreq-dt-platdev drivers are registering platform-device
with same name "cpufreq-dt" using platform_device_register_*() routines.
This is leading to build warnings appended below.

Providing hardware information to OPP framework along with the platform-
device creation should be done by ti-cpufreq driver before cpufreq-dt
driver comes into place.

This patch add's TI am33xx, am43 and dra7 platforms (which use opp-v2
property) to the blacklist of devices in cpufreq-dt-platform driver to
avoid creating platform-device twice and remove build warnings.

[    2.370167] ------------[ cut here ]------------
[    2.375087] WARNING: CPU: 0 PID: 1 at fs/sysfs/dir.c:31 sysfs_warn_dup+0x58/0x78
[    2.383112] sysfs: cannot create duplicate filename '/devices/platform/cpufreq-dt'
[    2.391219] Modules linked in:
[    2.394506] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.13.0-next-20170912 #1
[    2.402006] Hardware name: Generic AM33XX (Flattened Device Tree)
[    2.408437] [<c0110a28>] (unwind_backtrace) from [<c010ca84>] (show_stack+0x10/0x14)
[    2.416568] [<c010ca84>] (show_stack) from [<c0827d64>] (dump_stack+0xac/0xe0)
[    2.424165] [<c0827d64>] (dump_stack) from [<c0137470>] (__warn+0xd8/0x104)
[    2.431488] [<c0137470>] (__warn) from [<c01374d0>] (warn_slowpath_fmt+0x34/0x44)
[    2.439351] [<c01374d0>] (warn_slowpath_fmt) from [<c03459d0>] (sysfs_warn_dup+0x58/0x78)
[    2.447938] [<c03459d0>] (sysfs_warn_dup) from [<c0345ab8>] (sysfs_create_dir_ns+0x80/0x98)
[    2.456719] [<c0345ab8>] (sysfs_create_dir_ns) from [<c082c554>] (kobject_add_internal+0x9c/0x2d4)
[    2.466124] [<c082c554>] (kobject_add_internal) from [<c082c7d8>] (kobject_add+0x4c/0x9c)
[    2.474712] [<c082c7d8>] (kobject_add) from [<c05803e4>] (device_add+0xcc/0x57c)
[    2.482489] [<c05803e4>] (device_add) from [<c0584b74>] (platform_device_add+0x100/0x220)
[    2.491085] [<c0584b74>] (platform_device_add) from [<c05855a8>] (platform_device_register_full+0xf4/0x118)
[    2.501305] [<c05855a8>] (platform_device_register_full) from [<c067023c>] (ti_cpufreq_init+0x150/0x22c)
[    2.511253] [<c067023c>] (ti_cpufreq_init) from [<c0101df4>] (do_one_initcall+0x3c/0x170)
[    2.519838] [<c0101df4>] (do_one_initcall) from [<c0c00eb4>] (kernel_init_freeable+0x1fc/0x2c4)
[    2.528974] [<c0c00eb4>] (kernel_init_freeable) from [<c083bcac>] (kernel_init+0x8/0x110)
[    2.537565] [<c083bcac>] (kernel_init) from [<c0107d18>] (ret_from_fork+0x14/0x3c)
[    2.545981] ---[ end trace 2fc00e213c13ab20 ]---
[    2.551051] ------------[ cut here ]------------
[    2.555931] WARNING: CPU: 0 PID: 1 at lib/kobject.c:240 kobject_add_internal+0x254/0x2d4
[    2.564578] kobject_add_internal failed for cpufreq-dt with -EEXIST, don't try to register
things with the same name in the same directory.
[    2.577977] Modules linked in:
[    2.581261] CPU: 0 PID: 1 Comm: swapper/0 Tainted: G        W       4.13.0-next-20170912 #1
[    2.590013] Hardware name: Generic AM33XX (Flattened Device Tree)
[    2.596437] [<c0110a28>] (unwind_backtrace) from [<c010ca84>] (show_stack+0x10/0x14)
[    2.604573] [<c010ca84>] (show_stack) from [<c0827d64>] (dump_stack+0xac/0xe0)
[    2.612172] [<c0827d64>] (dump_stack) from [<c0137470>] (__warn+0xd8/0x104)
[    2.619494] [<c0137470>] (__warn) from [<c01374d0>] (warn_slowpath_fmt+0x34/0x44)
[    2.627362] [<c01374d0>] (warn_slowpath_fmt) from [<c082c70c>] (kobject_add_internal+0x254/0x2d4)
[    2.636666] [<c082c70c>] (kobject_add_internal) from [<c082c7d8>] (kobject_add+0x4c/0x9c)
[    2.645255] [<c082c7d8>] (kobject_add) from [<c05803e4>] (device_add+0xcc/0x57c)
[    2.653027] [<c05803e4>] (device_add) from [<c0584b74>] (platform_device_add+0x100/0x220)
[    2.661615] [<c0584b74>] (platform_device_add) from [<c05855a8>] (platform_device_register_full+0xf4/0x118)
[    2.671833] [<c05855a8>] (platform_device_register_full) from [<c067023c>] (ti_cpufreq_init+0x150/0x22c)
[    2.681779] [<c067023c>] (ti_cpufreq_init) from [<c0101df4>] (do_one_initcall+0x3c/0x170)
[    2.690377] [<c0101df4>] (do_one_initcall) from [<c0c00eb4>] (kernel_init_freeable+0x1fc/0x2c4)
[    2.699510] [<c0c00eb4>] (kernel_init_freeable) from [<c083bcac>] (kernel_init+0x8/0x110)
[    2.708106] [<c083bcac>] (kernel_init) from [<c0107d18>] (ret_from_fork+0x14/0x3c)
[    2.716217] ---[ end trace 2fc00e213c13ab21 ]---

Fixes: edeec420de24 (cpufreq: dt-cpufreq: platdev Automatically create device with OPP v2)
Signed-off-by: Suniel Mahesh <[email protected]>
Acked-by: Viresh Kumar <[email protected]>
Signed-off-by: Rafael J. Wysocki <[email protected]>

scsi: scsi_transport_fc: set scsi_target_id upon rescan

When an rport is found in the bindings array there is no guarantee that
it had been a target port, so we need to call fc_remote_port_rolechg()
here to ensure the scsi_target_id is set correctly. Otherwise the port
will never be scanned.

Signed-off-by: Hannes Reinecke <[email protected]>
Reviewed-by: Johannes Thumshirn <[email protected]>
Tested-by: Chad Dupuis <[email protected]>
Signed-off-by: Martin K. Petersen <[email protected]>

Merge branch 'for-linus' of git://git.kernel.dk/linux-block

Pull block fixes from Jens Axboe:

- Two sets of NVMe pull requests from Christoph:
      - Fixes for the Fibre Channel host/target to fix spec compliance
      - Allow a zero keep alive timeout
      - Make the debug printk for broken SGLs work better
      - Fix queue zeroing during initialization
      - Set of RDMA and FC fixes
      - Target div-by-zero fix

- bsg double-free fix.

- ndb unknown ioctl fix from Josef.

- Buffered vs O_DIRECT page cache inconsistency fix. Has been floating
   around for a long time, well reviewed. From Lukas.

- brd overflow fix from Mikulas.

- Fix for a loop regression in this merge window, where using a union
   for two members of the loop_cmd turned out to be a really bad idea.
   From Omar.

- Fix for an iostat regression fix in this series, using the wrong API
   to get at the block queue. From Shaohua.

- Fix for a potential blktrace delection deadlock. From Waiman.

* 'for-linus' of git://git.kernel.dk/linux-block: (30 commits)
  nvme-fcloop: fix port deletes and callbacks
  nvmet-fc: sync header templates with comments
  nvmet-fc: ensure target queue id within range.
  nvmet-fc: on port remove call put outside lock
  nvme-rdma: don't fully stop the controller in error recovery
  nvme-rdma: give up reconnect if state change fails
  nvme-core: Use nvme_wq to queue async events and fw activation
  nvme: fix sqhd reference when admin queue connect fails
  block: fix a crash caused by wrong API
  fs: Fix page cache inconsistency when mixing buffered and AIO DIO
  nvmet: implement valid sqhd values in completions
  nvme-fabrics: Allow 0 as KATO value
  nvme: allow timed-out ios to retry
  nvme: stop aer posting if controller state not live
  nvme-pci: Print invalid SGL only once
  nvme-pci: initialize queue memory before interrupts
  nvmet-fc: fix failing max io queue connections
  nvme-fc: use transport-specific sgl format
  nvme: add transport SGL definitions
  nvme.h: remove FC transport-specific error values
  ...

PM / OPP: Call notifier without holding opp_table->lock

The notifier callbacks may want to call some OPP helper routines which
may try to take the same opp_table->lock again and cause a deadlock. One
such usecase was reported by Chanwoo Choi, where calling
dev_pm_opp_disable() leads us to the devfreq's OPP notifier handler,
which further calls dev_pm_opp_find_freq_floor() and it deadlocks.

We don't really need the opp_table->lock to be held across the notifier
call though, all we want to make sure is that the 'opp' doesn't get
freed while being used from within the notifier chain. We can do it with
help of dev_pm_opp_get/put() as well. Let's do it.

Cc: 4.11+ <[email protected]> # 4.11+
Fixes: 5b650b388844 "PM / OPP: Take kref from _find_opp_table()"
Reported-by: Chanwoo Choi <[email protected]>
Tested-by: Chanwoo Choi <[email protected]>
Reviewed-by: Stephen Boyd <[email protected]>
Reviewed-by: Chanwoo Choi <[email protected]>
Signed-off-by: Viresh Kumar <[email protected]>
Signed-off-by: Rafael J. Wysocki <[email protected]>

Merge tag 'gfs2-for-linus-4.14-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2

Pull gfs2 fix from Bob Peterson:
"GFS2: Fix an old regression in GFS2's debugfs interface

This fixes a regression introduced by commit 88ffbf3e037e ("GFS2: Use
resizable hash table for glocks"). The regression caused the glock dump
in debugfs to not report all the glocks, which makes debugging
extremely difficult"

* tag 'gfs2-for-linus-4.14-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2:
gfs2: Fix debugfs glocks dump

Merge tag 'microblaze-4.14-rc3' of git://git.monstr.eu/linux-2.6-microblaze

Pull Microblaze fixes from Michal Simek:

- Kbuild fix

- use vma_pages

- setup default little endians

* tag 'microblaze-4.14-rc3' of git://git.monstr.eu/linux-2.6-microblaze:
  arch: change default endian for microblaze
  microblaze: Cocci spatch "vma_pages"
  microblaze: Add missing kvm_para.h to Kbuild

security/keys: rewrite all of big_key crypto

This started out as just replacing the use of crypto/rng with
get_random_bytes_wait, so that we wouldn't use bad randomness at boot
time. But, upon looking further, it appears that there were even deeper
underlying cryptographic problems, and that this seems to have been
committed with very little crypto review. So, I rewrote the whole thing,
trying to keep to the conventions introduced by the previous author, to
fix these cryptographic flaws.

It makes no sense to seed crypto/rng at boot time and then keep
using it like this, when in fact there's already get_random_bytes_wait,
which can ensure there's enough entropy and be a much more standard way
of generating keys. Since this sensitive material is being stored
untrusted, using ECB and no authentication is simply not okay at all. I
find it surprising and a bit horrifying that this code even made it past
basic crypto review, which perhaps points to some larger issues. This
patch moves from using AES-ECB to using AES-GCM. Since keys are uniquely
generated each time, we can set the nonce to zero. There was also a race
condition in which the same key would be reused at the same time in
different threads. A mutex fixes this issue now.

So, to summarize, this commit fixes the following vulnerabilities:

  * Low entropy key generation, allowing an attacker to potentially
    guess or predict keys.
  * Unauthenticated encryption, allowing an attacker to modify the
    cipher text in particular ways in order to manipulate the plaintext,
    which is is even more frightening considering the next point.
  * Use of ECB mode, allowing an attacker to trivially swap blocks or
    compare identical plaintext blocks.
  * Key re-use.
  * Faulty memory zeroing.

Signed-off-by: Jason A. Donenfeld <[email protected]>
Reviewed-by: Eric Biggers <[email protected]>
Signed-off-by: David Howells <[email protected]>
Cc: Herbert Xu <[email protected]>
Cc: Kirill Marinushkin <[email protected]>
Cc: [email protected]
Cc: [email protected]

security/keys: properly zero out sensitive key material in big_key

Error paths forgot to zero out sensitive material, so this patch changes
some kfrees into a kzfrees.

Signed-off-by: Jason A. Donenfeld <[email protected]>
Signed-off-by: David Howells <[email protected]>
Reviewed-by: Eric Biggers <[email protected]>
Cc: Herbert Xu <[email protected]>
Cc: Kirill Marinushkin <[email protected]>
Cc: [email protected]
Cc: [email protected]

Merge tag 'trace-v4.14-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace

Pull tracing fixes from Steven Rostedt:
"Stack tracing and RCU has been having issues with each other and
  lockdep has been pointing out constant problems.

  The changes have been going into the stack tracer, but it has been
  discovered that the problem isn't with the stack tracer itself, but it
  is with calling save_stack_trace() from within the internals of RCU.

  The stack tracer is the one that can trigger the issue the easiest,
  but examining the problem further, it could also happen from a WARN()
  in the wrong place, or even if an NMI happened in this area and it did
  an rcu_read_lock().

  The critical area is where RCU is not watching. Which can happen while
  going to and from idle, or bringing up or taking down a CPU.

  The final fix was to put the protection in kernel_text_address() as it
  is the one that requires RCU to be watching while doing the stack
  trace.

  To make this work properly, Paul had to allow rcu_irq_enter() happen
  after rcu_nmi_enter(). This should have been done anyway, since an NMI
  can page fault (reading vmalloc area), and a page fault triggers
  rcu_irq_enter().

  One patch is just a consolidation of code so that the fix only needed
  to be done in one location"

* tag 'trace-v4.14-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  tracing: Remove RCU work arounds from stack tracer
  extable: Enable RCU if it is not watching in kernel_text_address()
  extable: Consolidate *kernel_text_address() functions
  rcu: Allow for page faults in NMI handlers

smp/hotplug: Hotplug state fail injection

Add a sysfs file to one-time fail a specific state. This can be used
to test the state rollback code paths.

Something like this (hotplug-up.sh):

  #!/bin/bash

  echo 0 > /debug/sched_debug
  echo 1 > /debug/tracing/events/cpuhp/enable

  ALL_STATES=`cat /sys/devices/system/cpu/hotplug/states | cut -d':' -f1`
  STATES=${1:-$ALL_STATES}

  for state in $STATES
  do
  echo 0 > /sys/devices/system/cpu/cpu1/online
  echo 0 > /debug/tracing/trace
  echo Fail state: $state
  echo $state > /sys/devices/system/cpu/cpu1/hotplug/fail
  cat /sys/devices/system/cpu/cpu1/hotplug/fail
  echo 1 > /sys/devices/system/cpu/cpu1/online

  cat /debug/tracing/trace > hotfail-${state}.trace

  sleep 1
  done

Can be used to test for all possible rollback (barring multi-instance)
scenarios on CPU-up, CPU-down is a trivial modification of the above.

Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: https://lkml.kernel.org/r/[email protected]

smp/hotplug: Differentiate the AP completion between up and down

With lockdep-crossrelease we get deadlock reports that span cpu-up and
cpu-down chains. Such deadlocks cannot possibly happen because cpu-up
and cpu-down are globally serialized.

  takedown_cpu()
    irq_lock_sparse()
    wait_for_completion(&st->done)

                                cpuhp_thread_fun
                                  cpuhp_up_callback
                                    cpuhp_invoke_callback
                                      irq_affinity_online_cpu
                                        irq_local_spare()
                                        irq_unlock_sparse()
                                  complete(&st->done)

Now that we have consistent AP state, we can trivially separate the
AP completion between up and down using st->bringup.

Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Acked-by: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: https://lkml.kernel.org/r/[email protected]

smp/hotplug: Differentiate the AP-work lockdep class between up and down

With lockdep-crossrelease we get deadlock reports that span cpu-up and
cpu-down chains. Such deadlocks cannot possibly happen because cpu-up
and cpu-down are globally serialized.

  CPU0                  CPU1                    CPU2
  cpuhp_up_callbacks:   takedown_cpu:           cpuhp_thread_fun:

  cpuhp_state
                        irq_lock_sparse()
    irq_lock_sparse()
                        wait_for_completion()
                                                cpuhp_state
                                                complete()

Now that we have consistent AP state, we can trivially separate the
AP-work class between up and down using st->bringup.

Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: https://lkml.kernel.org/r/[email protected]

smp/hotplug: Callback vs state-machine consistency

While the generic callback functions have an 'int' return and thus
appear to be allowed to return error, this is not true for all states.

Specifically, what used to be STARTING/DYING are ran with IRQs
disabled from critical parts of CPU bringup/teardown and are not
allowed to fail. Add WARNs to enforce this rule.

But since some callbacks are indeed allowed to fail, we have the
situation where a state-machine rollback encounters a failure, in this
case we're stuck, we can't go forward and we can't go back. Also add a
WARN for that case.

AFAICT this is a fundamental 'problem' with no real obvious solution.
We want the 'prepare' callbacks to allow failure on either up or down.
Typically on prepare-up this would be things like -ENOMEM from
resource allocations, and the typical usage in prepare-down would be
something like -EBUSY to avoid CPUs being taken away.

Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: https://lkml.kernel.org/r/[email protected]

smp/hotplug: Rewrite AP state machine core

There is currently no explicit state change on rollback. That is,
st->bringup, st->rollback and st->target are not consistent when doing
the rollback.

Rework the AP state handling to be more coherent. This does mean we
have to do a second AP kick-and-wait for rollback, but since rollback
is the slow path of a slowpath, this really should not matter.

Take this opportunity to simplify the AP thread function to only run a
single callback per invocation. This unifies the three single/up/down
modes is supports. The looping it used to do for up/down are achieved
by retaining should_run and relying on the main smpboot_thread_fn()
loop.

(I have most of a patch that does the same for the BP state handling,
but that's not critical and gets a little complicated because
CPUHP_BRINGUP_CPU does the AP handoff from a callback, which gets
recursive @st usage, I still have de-fugly that.)

[ tglx: Move cpuhp_down_callbacks() et al. into the HOTPLUG_CPU section to
   avoid gcc complaining about unused functions. Make the HOTPLUG_CPU
   one piece instead of having two consecutive ifdef sections of the
   same type. ]

Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: https://lkml.kernel.org/r/[email protected]

smp/hotplug: Allow external multi-instance rollback

Currently the rollback of multi-instance states is handled inside
cpuhp_invoke_callback(). The problem is that when we want to allow an
explicit state change for rollback, we need to return from the
function without doing the rollback.

Change cpuhp_invoke_callback() to optionally return the multi-instance
state, such that rollback can be done from a subsequent call.

Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: https://lkml.kernel.org/r/[email protected]

smp/hotplug: Add state diagram

Add a state diagram to clarify when which states are ran where.

Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: https://lkml.kernel.org/r/[email protected]

MAINTAINERS: Add entry for MediaTek PMIC LED driver

Add myself as a maintainer to support existing SoCs and push forward
following MediaTek PMICs with LEDs to reuse the driver.

Signed-off-by: Sean Wang <[email protected]>
Signed-off-by: Jacek Anaszewski <[email protected]>

scsi: scsi_transport_iscsi: fix the issue that iscsi_if_rx doesn't parse nlmsg properly

ChunYu found a kernel crash by syzkaller:

[  651.617875] kasan: CONFIG_KASAN_INLINE enabled
[  651.618217] kasan: GPF could be caused by NULL-ptr deref or user memory access
[  651.618731] general protection fault: 0000 [#1] SMP KASAN
[  651.621543] CPU: 1 PID: 9539 Comm: scsi Not tainted 4.11.0.cov #32
[  651.621938] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[  651.622309] task: ffff880117780000 task.stack: ffff8800a3188000
[  651.622762] RIP: 0010:skb_release_data+0x26c/0x590
[...]
[  651.627260] Call Trace:
[  651.629156]  skb_release_all+0x4f/0x60
[  651.629450]  consume_skb+0x1a5/0x600
[  651.630705]  netlink_unicast+0x505/0x720
[  651.632345]  netlink_sendmsg+0xab2/0xe70
[  651.633704]  sock_sendmsg+0xcf/0x110
[  651.633942]  ___sys_sendmsg+0x833/0x980
[  651.637117]  __sys_sendmsg+0xf3/0x240
[  651.638820]  SyS_sendmsg+0x32/0x50
[  651.639048]  entry_SYSCALL_64_fastpath+0x1f/0xc2

It's caused by skb_shared_info at the end of sk_buff was overwritten by
ISCSI_KEVENT_IF_ERROR when parsing nlmsg info from skb in iscsi_if_rx.

During the loop if skb->len == nlh->nlmsg_len and both are sizeof(*nlh),
ev = nlmsg_data(nlh) will acutally get skb_shinfo(SKB) instead and set a
new value to skb_shinfo(SKB)->nr_frags by ev->type.

This patch is to fix it by checking nlh->nlmsg_len properly there to
avoid over accessing sk_buff.

Reported-by: ChunYu Wang <[email protected]>
Signed-off-by: Xin Long <[email protected]>
Acked-by: Chris Leech <[email protected]>
Signed-off-by: Martin K. Petersen <[email protected]>

irqdomain: Add __rcu annotations to radix tree accessors

Fix various address spaces warning of sparse.

kernel/irq/irqdomain.c:1463:14: warning: incorrect type in assignment (different address spaces)
kernel/irq/irqdomain.c:1463:14:    expected void **slot
kernel/irq/irqdomain.c:1463:14:    got void [noderef] <asn:4>**
kernel/irq/irqdomain.c:1465:66: warning: incorrect type in argument 2 (different address spaces)
kernel/irq/irqdomain.c:1465:66:    expected void [noderef] <asn:4>**slot
kernel/irq/irqdomain.c:1465:66:    got void **slot

Signed-off-by: Masahiro Yamada <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Cc: Marc Zyngier <[email protected]>
Cc: Jason Cooper <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]

irqchip/mips-gic: Use effective affinity to unmask

Commit 7778c4b27cbe ("irqchip: mips-gic: Use pcpu_masks to avoid reading
GIC_SH_MASK*") adjusted the way we handle masking interrupts to set &
clear the interrupt's bit in each pcpu_mask. This allows us to avoid
needing to read the GIC mask registers and perform a bitwise and of
their values with the pending & pcpu_masks.

Unfortunately this didn't quite work for IPIs, which were mapped to a
particular CPU/VP during initialisation but never set the affinity or
effective_affinity fields of their struct irq_desc. This led to them
losing their affinity when gic_unmask_irq() was called for them, and
they'd all become affine to cpu0.

Fix this by:

1) Setting the effective affinity of interrupts in
    gic_shared_irq_domain_map(), which is where we actually map an
    interrupt to a CPU/VP. This ensures that the effective affinity mask
    is always valid, not just after explicitly setting affinity.

2) Using an interrupt's effective affinity when unmasking it, which
    prevents gic_unmask_irq() from unintentionally changing which
    pcpu_mask includes an interrupt.

Fixes: 7778c4b27cbe ("irqchip: mips-gic: Use pcpu_masks to avoid reading GIC_SH_MASK*")
Signed-off-by: Paul Burton <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Cc: Marc Zyngier <[email protected]>
Cc: Jason Cooper <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]

irqchip/mips-gic: Fix shifts to extract register fields

The MIPS GIC driver is incorrectly using __fls to shift registers,
intending to shift to the least significant bit of a value based upon
its mask but instead shifting off all but the value's top bit. It should
actually be using __ffs to shift to the first, not last, bit of the
value.

Apparently the system I used when testing commit 3680746abd87
("irqchip: mips-gic: Convert remaining shared reg access to new
accessors") and commit b2b2e584ceab ("irqchip: mips-gic: Clean up mti,
reserved-cpu-vectors handling") managed to work correctly despite this
issue, but not all systems do...

Fixes: 3680746abd87 ("irqchip: mips-gic: Convert remaining shared reg access to new accessors")
Fixes: b2b2e584ceab ("irqchip: mips-gic: Clean up mti, reserved-cpu-vectors handling")
Signed-off-by: Paul Burton <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Cc: Marc Zyngier <[email protected]>
Cc: Jason Cooper <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]

nvme-fcloop: fix port deletes and callbacks

Now that there are potentially long delays between when a remoteport or
targetport delete calls is made and when the callback occurs (dev_loss_tmo
timeout), no longer block in the delete routines and move the final nport
puts to the callbacks.

Moved the fcloop_nport_get/put/free routines to avoid forward declarations.

Ensure port_info structs used in registrations are nulled in case fields
are not set (ex: devloss_tmo values).

Signed-off-by: James Smart <[email protected]>
Signed-off-by: Christoph Hellwig <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>

nvmet-fc: sync header templates with comments

Comments were incorrect:
- defer_rcv was in host port template. moved to target port template
- Added Mandatory statements for target port template items

Signed-off-by: James Smart <[email protected]>
Signed-off-by: Christoph Hellwig <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>

nvmet-fc: ensure target queue id within range.

When searching for queue id's ensure they are within the expected range.

Signed-off-by: James Smart <[email protected]>
Signed-off-by: Christoph Hellwig <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>

nvmet-fc: on port remove call put outside lock

Avoid calling the put routine, as it may traverse to free routines while
holding the target lock.

Signed-off-by: James Smart <[email protected]>
Signed-off-by: Christoph Hellwig <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>

nvme-rdma: don't fully stop the controller in error recovery

By calling nvme_stop_ctrl on a already failed controller will wait for the
scan work to complete (only by identify timeout expiration which is 60
seconds). This is unnecessary when we already know that the controller has
failed.

Reported-by: Yi Zhang <[email protected]>
Signed-off-by: Sagi Grimberg <[email protected]>
Signed-off-by: Christoph Hellwig <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>

nvme-rdma: give up reconnect if state change fails

If we failed to transition to state LIVE after a successful reconnect,
then controller deletion already started. In this case there is no
point moving forward with reconnect.

Reviewed-by: Johannes Thumshirn <[email protected]>
Signed-off-by: Sagi Grimberg <[email protected]>
Signed-off-by: Christoph Hellwig <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>

nvme-core: Use nvme_wq to queue async events and fw activation

async_event_work might race as it is executed from two different
workqueues at the moment.

Reviewed-by: Johannes Thumshirn <[email protected]>
Signed-off-by: Sagi Grimberg <[email protected]>
Signed-off-by: Christoph Hellwig <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>