Git Repo - linux.git/log

Btrfs: use btrfs_op instead of bio_op in __btrfs_map_block

This seems to be a leftover of commit cf8cddd38bab ("btrfs: don't
abuse REQ_OP_* flags for btrfs_map_block").

It should use btrfs_op() helper to provide one of 'enum btrfs_map_op'
types.

Fixes: cf8cddd38bab ("btrfs: don't abuse REQ_OP_* flags for btrfs_map_block")
Signed-off-by: Liu Bo <[email protected]>
Reviewed-by: Satoru Takeuchi <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>

Btrfs: do not backup tree roots when fsync

It doesn't make sense to backup tree roots when doing fsync, since
during fsync those tree roots have not been consistent on disk.

Signed-off-by: Liu Bo <[email protected]>
Reviewed-by: Qu Wenruo <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>

btrfs: remove BTRFS_FS_QUOTA_DISABLING flag

Currently, "btrfs quota enable" would fail after "btrfs quota disable" on
the first time with syslog output "qgroup_rescan_init failed with -22", but
it would succeed on the second time.

When "quota disable" is called, BTRFS_FS_QUOTA_DISABLING flag bit will be
set in fs_info->flags in btrfs_quota_disable(), but it will not be droppd
in btrfs_run_qgroups() (which is called in btrfs_commit_transaction())
because quota_root has already been freed. If "quota enable" is called
after that, both BTRFS_FS_QUOTA_DISABLING and BTRFS_FS_QUOTA_ENABLED flag
would be dropped in the btrfs_run_qgroups() since quota_root is not NULL.
This leads to the failure of "quota enable" on the first time.

BTRFS_FS_QUOTA_DISABLING flag is not used outside of "quota disable"
context and is equivalent to whether quota_root is NULL or not.
btrfs_run_qgroups() checks whether quota_root is NULL or not in the first
place.

So, let's remove BTRFS_FS_QUOTA_DISABLING flag.

Signed-off-by: Tomohiro Misono <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>

btrfs: propagate error to btrfs_cmp_data_prepare caller

btrfs_cmp_data_prepare() (almost) always returns 0 i.e. ignoring errors
from gather_extent_pages(). While the pages are freed by
btrfs_cmp_data_free(), cmp->num_pages still has > 0. Then,
btrfs_extent_same() try to access the already freed pages causing faults
(or violates PageLocked assertion).

This patch just return the error as is so that the caller stop the process.

Signed-off-by: Naohiro Aota <[email protected]>
Fixes: f441460202cb ("btrfs: fix deadlock with extent-same and readpage")
Cc: <[email protected]> # 4.2
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>

btrfs: prevent to set invalid default subvolid

`btrfs sub set-default` succeeds to set an ID which isn't corresponding to any
fs/file tree. If such the bad ID is set to a filesystem, we can't mount this
filesystem without specifying `subvol` or `subvolid` mount options.

Fixes: 6ef5ed0d386b ("Btrfs: add ioctl and incompat flag to set the default mount subvol")
Cc: <[email protected]>
Signed-off-by: Satoru Takeuchi <[email protected]>
Reviewed-by: Qu Wenruo <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>

Btrfs: send: fix error number for unknown inode types

ENOTSUPP should not be returned to the user program.
(cf. include/linux/errno.h)
Therefore, EOPNOTSUPP is used instead of ENOTSUPP.

Signed-off-by: Tsutomu Itoh <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>

btrfs: fix NULL pointer dereference from free_reloc_roots()

__del_reloc_root should be called before freeing up reloc_root->node.
If not, calling __del_reloc_root() dereference reloc_root->node, causing
the system BUG.

Fixes: 6bdf131fac23 ("Btrfs: don't leak reloc root nodes on error")
Cc: <[email protected]> # 4.9
Signed-off-by: Naohiro Aota <[email protected]>
Reviewed-by: Nikolay Borisov <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>

btrfs: finish ordered extent cleaning if no progress is found

__endio_write_update_ordered() repeats the search until it reaches the end
of the specified range. This works well with direct IO path, because before
the function is called, it's ensured that there are ordered extents filling
whole the range. It's not the case, however, when it's called from
run_delalloc_range(): it is possible to have error in the midle of the loop
in e.g. run_delalloc_nocow(), so that there exisits the range not covered
by any ordered extents. By cleaning such "uncomplete" range,
__endio_write_update_ordered() stucks at offset where there're no ordered
extents.

Since the ordered extents are created from head to tail, we can stop the
search if there are no offset progress.

Fixes: 524272607e88 ("btrfs: Handle delalloc error correctly to avoid ordered extent hang")
Cc: <[email protected]> # 4.12
Signed-off-by: Naohiro Aota <[email protected]>
Reviewed-by: Qu Wenruo <[email protected]>
Reviewed-by: Josef Bacik <[email protected]>
Signed-off-by: David Sterba <[email protected]>

btrfs: clear ordered flag on cleaning up ordered extents

Commit 524272607e88 ("btrfs: Handle delalloc error correctly to avoid
ordered extent hang") introduced btrfs_cleanup_ordered_extents() to cleanup
submitted ordered extents. However, it does not clear the ordered bit
(Private2) of corresponding pages. Thus, the following BUG occurs from
free_pages_check_bad() (on btrfs/125 with nospace_cache).

BUG: Bad page state in process btrfs pfn:3fa787
page:ffffdf2acfe9e1c0 count:0 mapcount:0 mapping: (null) index:0xd
flags: 0x8000000000002008(uptodate|private_2)
raw: 8000000000002008 0000000000000000 000000000000000d 00000000ffffffff
raw: ffffdf2acf5c1b20 ffffb443802238b0 0000000000000000 0000000000000000
page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set
bad because of flags: 0x2000(private_2)

This patch clears the flag same as other places calling
btrfs_dec_test_ordered_pending() for every page in the specified range.

Fixes: 524272607e88 ("btrfs: Handle delalloc error correctly to avoid ordered extent hang")
Cc: <[email protected]> # 4.12
Signed-off-by: Naohiro Aota <[email protected]>
Reviewed-by: Qu Wenruo <[email protected]>
Reviewed-by: Josef Bacik <[email protected]>
Signed-off-by: David Sterba <[email protected]>

Btrfs: fix incorrect {node,sector}size endianness from BTRFS_IOC_FS_INFO

fs_info->super_copy->{node,sector}size are little-endian, but the ioctl
should return the values in native endianness. Use the cached values in
btrfs_fs_info instead. Found with sparse.

Fixes: 80a773fbfc2d ("btrfs: retrieve more info from FS_INFO ioctl")
Signed-off-by: Omar Sandoval <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>

Btrfs: do not reset bio->bi_ops while writing bio

flush_epd_write_bio() sets bio->bi_opf by itself to honor REQ_SYNC,
but it's not needed at all since bio->bi_opf has set up properly in
both __extent_writepage() and write_one_eb(), and in the case of
write_one_eb(), it also sets REQ_META, which we will lose in
flush_epd_write_bio().

This remove this unnecessary bio->bi_opf setting.

Signed-off-by: Liu Bo <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>

Btrfs: use the new helper wbc_to_write_flags

This updates btrfs to use the helper wbc_to_write_flags which has been
applied in ext4/xfs/f2fs/block.

Please note that, with this, btrfs's dirty pages written by a
writeback job will carry the flag REQ_BACKGROUND, which is currently
used by writeback-throttle to determine whether it should go to get a
request or wait.

Signed-off-by: Liu Bo <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>

drm/tegra: trace: Fix path to include

The TRACE_INCLUDE_FILE macro needs to specify the path relative to the
define_trace.h header rather than relative to the file defining it.

Reported-by: Dmitry Osipenko <[email protected]>
Tested-by: Dmitry Osipenko <[email protected]>
Signed-off-by: Thierry Reding <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]

Merge branch 'WIP.x86/fpu' into x86/fpu, because it's ready

Signed-off-by: Ingo Molnar <[email protected]>

x86/fpu: Use using_compacted_format() instead of open coded X86_FEATURE_XSAVES

This is the canonical method to use.

Signed-off-by: Eric Biggers <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Dmitry Vyukov <[email protected]>
Cc: Eric Biggers <[email protected]>
Cc: Fenghua Yu <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Kevin Hao <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Michael Halcrow <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Wanpeng Li <[email protected]>
Cc: Yu-cheng Yu <[email protected]>
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>

x86/fpu: Use validate_xstate_header() to validate the xstate_header in copy_user_to_xstate()

Tighten the checks in copy_user_to_xstate().

Signed-off-by: Eric Biggers <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Dmitry Vyukov <[email protected]>
Cc: Eric Biggers <[email protected]>
Cc: Fenghua Yu <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Kevin Hao <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Michael Halcrow <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Wanpeng Li <[email protected]>
Cc: Yu-cheng Yu <[email protected]>
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>

x86/fpu: Eliminate the 'xfeatures' local variable in copy_user_to_xstate()

We now have this field in hdr.xfeatures.

Signed-off-by: Eric Biggers <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Dmitry Vyukov <[email protected]>
Cc: Eric Biggers <[email protected]>
Cc: Fenghua Yu <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Kevin Hao <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Michael Halcrow <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Wanpeng Li <[email protected]>
Cc: Yu-cheng Yu <[email protected]>
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>

x86/fpu: Copy the full header in copy_user_to_xstate()

This is in preparation to verify the full xstate header as supplied by user-space.

Signed-off-by: Eric Biggers <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Dmitry Vyukov <[email protected]>
Cc: Eric Biggers <[email protected]>
Cc: Fenghua Yu <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Kevin Hao <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Michael Halcrow <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Wanpeng Li <[email protected]>
Cc: Yu-cheng Yu <[email protected]>
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>

x86/fpu: Use validate_xstate_header() to validate the xstate_header in copy_kernel_to_xstate()

Tighten the checks in copy_kernel_to_xstate().

Signed-off-by: Eric Biggers <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Dmitry Vyukov <[email protected]>
Cc: Eric Biggers <[email protected]>
Cc: Fenghua Yu <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Kevin Hao <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Michael Halcrow <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Wanpeng Li <[email protected]>
Cc: Yu-cheng Yu <[email protected]>
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>

x86/fpu: Eliminate the 'xfeatures' local variable in copy_kernel_to_xstate()

We have this information in the xstate_header.

Signed-off-by: Eric Biggers <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Dmitry Vyukov <[email protected]>
Cc: Eric Biggers <[email protected]>
Cc: Fenghua Yu <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Kevin Hao <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Michael Halcrow <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Wanpeng Li <[email protected]>
Cc: Yu-cheng Yu <[email protected]>
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>

x86/fpu: Copy the full state_header in copy_kernel_to_xstate()

This is in preparation to verify the full xstate header as supplied by user-space.

Signed-off-by: Eric Biggers <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Dmitry Vyukov <[email protected]>
Cc: Eric Biggers <[email protected]>
Cc: Fenghua Yu <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Kevin Hao <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Michael Halcrow <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Wanpeng Li <[email protected]>
Cc: Yu-cheng Yu <[email protected]>
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>

x86/fpu: Use validate_xstate_header() to validate the xstate_header in __fpu__restore_sig()

Tighten the checks in __fpu__restore_sig() and update comments.

Signed-off-by: Eric Biggers <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Dmitry Vyukov <[email protected]>
Cc: Eric Biggers <[email protected]>
Cc: Fenghua Yu <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Kevin Hao <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Michael Halcrow <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Wanpeng Li <[email protected]>
Cc: Yu-cheng Yu <[email protected]>
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>

x86/fpu: Use validate_xstate_header() to validate the xstate_header in xstateregs_set()

Tighten the checks in xstateregs_set().

Signed-off-by: Eric Biggers <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Dmitry Vyukov <[email protected]>
Cc: Eric Biggers <[email protected]>
Cc: Fenghua Yu <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Kevin Hao <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Michael Halcrow <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Wanpeng Li <[email protected]>
Cc: Yu-cheng Yu <[email protected]>
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>

x86/fpu: Introduce validate_xstate_header()

Move validation of user-supplied xstate_header into a helper function,
in preparation of calling it from both the ptrace and sigreturn syscall
paths.

The new function also considers it to be an error if *any* reserved bits
are set, whereas before we were just clearing most of them silently.

This should reduce the chance of bugs that fail to correctly validate
user-supplied XSAVE areas.  It also will expose any broken userspace
programs that set the other reserved bits; this is desirable because
such programs will lose compatibility with future CPUs and kernels if
those bits are ever used for anything.  (There shouldn't be any such
programs, and in fact in the case where the compacted format is in use
we were already validating xfeatures.  But you never know...)

Signed-off-by: Eric Biggers <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Dmitry Vyukov <[email protected]>
Cc: Eric Biggers <[email protected]>
Cc: Fenghua Yu <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Kevin Hao <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Michael Halcrow <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Wanpeng Li <[email protected]>
Cc: Yu-cheng Yu <[email protected]>
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>

x86/fpu: Rename fpu__activate_fpstate_read/write() to fpu__prepare_[read|write]()

As per the new nomenclature we don't 'activate' the FPU state
anymore, we initialize it. So drop the _activate_fpstate name
from these functions, which were a bit of a mouthful anyway,
and name them:

fpu__prepare_read()
fpu__prepare_write()

Cc: Andy Lutomirski <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Eric Biggers <[email protected]>
Cc: Fenghua Yu <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>

x86/fpu: Rename fpu__activate_curr() to fpu__initialize()

Rename this function to better express that it's all about
initializing the FPU state of a task which goes hand in hand
with the fpu::initialized field.

Cc: Andrew Morton <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Eric Biggers <[email protected]>
Cc: Fenghua Yu <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Yu-cheng Yu <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>

x86/fpu: Simplify and speed up fpu__copy()

fpu__copy() has a preempt_disable()/enable() pair, which it had to do to
be able to atomically unlazy the current task when doing an FNSAVE.

But we don't unlazy tasks anymore, we always do direct saves/restores of
FPU context.

So remove both the unnecessary critical section, and update the comments.

Cc: Andrew Morton <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Eric Biggers <[email protected]>
Cc: Fenghua Yu <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Yu-cheng Yu <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>

x86/fpu: Fix stale comments about lazy FPU logic

We don't do any lazy restore anymore, what we have are two pieces of optimization:

- no-FPU tasks that don't save/restore the FPU context (kernel threads are such)

- cached FPU registers maintained via the fpu->last_cpu field. This means that
   if an FPU task context switches to a non-FPU task then we can maintain the
   FPU registers as an in-FPU copies (cache), and skip the restoration of them
   once we switch back to the original FPU-using task.

Update all the comments that still referred to old 'lazy' and 'unlazy' concepts.

Cc: Andrew Morton <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Eric Biggers <[email protected]>
Cc: Fenghua Yu <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Yu-cheng Yu <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>

x86/fpu: Rename fpu::fpstate_active to fpu::initialized

The x86 FPU code used to have a complex state machine where both the FPU
registers and the FPU state context could be 'active' (or inactive)
independently of each other - which enabled features like lazy FPU restore.

Much of this complexity is gone in the current code: now we basically can
have FPU-less tasks (kernel threads) that don't use (and save/restore) FPU
state at all, plus full FPU users that save/restore directly with no laziness
whatsoever.

But the fpu::fpstate_active still carries bits of the old complexity - meanwhile
this flag has become a simple flag that shows whether the FPU context saving
area in the thread struct is initialized and used, or not.

Rename it to fpu::initialized to express this simplicity in the name as well.

Cc: Andrew Morton <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Eric Biggers <[email protected]>
Cc: Fenghua Yu <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Yu-cheng Yu <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>

x86/fpu: Remove fpu__current_fpstate_write_begin/end()

These functions are not used anymore, so remove them.

Cc: Andrew Morton <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Bobby Powers <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Eric Biggers <[email protected]>
Cc: Fenghua Yu <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Yu-cheng Yu <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>

x86/fpu: Fix fpu__activate_fpstate_read() and update comments

fpu__activate_fpstate_read() can be called for the current task
when coredumping - or for stopped tasks when ptrace-ing.

Implement this properly in the code and update the comments.

This also fixes an incorrect (but harmless) warning introduced by
one of the earlier patches.

Cc: Andrew Morton <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Eric Biggers <[email protected]>
Cc: Fenghua Yu <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Yu-cheng Yu <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>

scsi: scsi_transport_fc: Also check for NOTPRESENT in fc_remote_port_add()

During failover there is a small race window between fc_remote_port_add()
and fc_timeout_deleted_rport(); the latter drops the lock after setting the
port to NOTPRESENT, so if fc_remote_port_add() is called right at that time
it will fail to detect the existing rport and happily adding a new
structure, causing rports to get registered twice.

Signed-off-by: Hannes Reinecke <[email protected]>
Reviewed-by: Johannes Thumshirn <[email protected]>
Signed-off-by: Martin K. Petersen <[email protected]>

Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs

Pull compat fix from Al Viro:
"I really wish gcc warned about conversions from pointer to function
into void *..."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
fix a typo in put_compat_shm_info()

xfs: remove redundant re-initialization of total_nr_pages

Variable total_nr_pages is being initialized and then updated with
the same value, this latter assignment is redundant and can be
removed. Cleans up clang build warning:

Value stored to 'total_nr_pages' during its initialization is never read

Signed-off-by: Colin Ian King <[email protected]>
Reviewed-by: Brian Foster <[email protected]>
Reviewed-by: Darrick J. Wong <[email protected]>
Signed-off-by: Darrick J. Wong <[email protected]>

xfs: Output warning message when discard option was enabled even though the device does not support discard

In order to using discard function, it is necessary that not only xfs
is mounted with discard option, but also the discard function is
supported by the device. Current code doesn't output any message when
users mount with discard option on unsupported device, so it is
difficult to notice that it was not enabled actually.

This patch adds the warning message to notice that discard option is
not enabled due to unsupported device when the filesystem is mounted.

Changes in v2 (Suggested by Brian Foster):
- Move the unsupported device check into xfs_fs_fill_super().
- Clear the discard flag when device is unsupported.

Signed-off-by: Kenjiro Nakayama <[email protected]>
Reviewed-by: Darrick J. Wong <[email protected]>
Signed-off-by: Darrick J. Wong <[email protected]>

xfs: report zeroed or not correctly in xfs_zero_range()

The 'did_zero' param of xfs_zero_range() was not passed to
iomap_zero_range() correctly. This was introduced by commit
7bb41db3ea16 ("xfs: handle 64-bit length in xfs_iozero"), and found
by code inspection.

Signed-off-by: Eryu Guan <[email protected]>
Reviewed-by: Carlos Maiolino <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Darrick J. Wong <[email protected]>
Signed-off-by: Darrick J. Wong <[email protected]>

xfs: kill meaningless variable 'zero'

In xfs_file_aio_write_checks(), variable 'zero' is there only to
satisfy xfs_zero_eof(), the result of it is ignored. Now, with
iomap_zero_range() based xfs_zero_eof(), we can safely pass NULL as
the last param of it and kill 'zero'.

Signed-off-by: Eryu Guan <[email protected]>
Reviewed-by: Carlos Maiolino <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Darrick J. Wong <[email protected]>
Signed-off-by: Darrick J. Wong <[email protected]>

fs/xfs: Use %pS printk format for direct addresses

Use the %pS instead of the %pF printk format specifier for printing symbols
from direct addresses. This is needed for the ia64, ppc64 and parisc64
architectures.

Signed-off-by: Helge Deller <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Darrick J. Wong <[email protected]>
Signed-off-by: Darrick J. Wong <[email protected]>

xfs: evict CoW fork extents when performing finsert/fcollapse

When we perform an finsert/fcollapse operation, cancel all the CoW
extents for the affected file offset range so that they don't end up
pointing to the wrong blocks.

Reported-by: Amir Goldstein <[email protected]>
Reviewed-by: Carlos Maiolino <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Signed-off-by: Darrick J. Wong <[email protected]>

xfs: don't unconditionally clear the reflink flag on zero-block files

If we have speculative cow preallocations hanging around in the cow
fork, don't let a truncate operation clear the reflink flag because if
we do then there's a chance we'll forget to free those extents when we
destroy the incore inode.

Reported-by: Amir Goldstein <[email protected]>
Reviewed-by: Carlos Maiolino <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Signed-off-by: Darrick J. Wong <[email protected]>

fix a typo in put_compat_shm_info()

"uip" misspelled as "up"; unfortunately, the latter happens to be
a function and gcc is happy to convert it to void *...

Signed-off-by: Al Viro <[email protected]>

PCI: Fix race condition with driver_override

The driver_override implementation is susceptible to a race condition when
different threads are reading vs. storing a different driver override. Add
locking to avoid the race condition.

This is in close analogy to commit 6265539776a0 ("driver core: platform:
fix race condition with driver_override") from Adrian Salido.

Fixes: 782a985d7af2 ("PCI: Introduce new device binding path using pci_dev.driver_override")
Signed-off-by: Nicolai Stange <[email protected]>
Signed-off-by: Bjorn Helgaas <[email protected]>
Cc: [email protected] # v3.16+

cpufreq: dt: Fix sysfs duplicate filename creation for platform-device

ti-cpufreq and cpufreq-dt-platdev drivers are registering platform-device
with same name "cpufreq-dt" using platform_device_register_*() routines.
This is leading to build warnings appended below.

Providing hardware information to OPP framework along with the platform-
device creation should be done by ti-cpufreq driver before cpufreq-dt
driver comes into place.

This patch add's TI am33xx, am43 and dra7 platforms (which use opp-v2
property) to the blacklist of devices in cpufreq-dt-platform driver to
avoid creating platform-device twice and remove build warnings.

[    2.370167] ------------[ cut here ]------------
[    2.375087] WARNING: CPU: 0 PID: 1 at fs/sysfs/dir.c:31 sysfs_warn_dup+0x58/0x78
[    2.383112] sysfs: cannot create duplicate filename '/devices/platform/cpufreq-dt'
[    2.391219] Modules linked in:
[    2.394506] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.13.0-next-20170912 #1
[    2.402006] Hardware name: Generic AM33XX (Flattened Device Tree)
[    2.408437] [<c0110a28>] (unwind_backtrace) from [<c010ca84>] (show_stack+0x10/0x14)
[    2.416568] [<c010ca84>] (show_stack) from [<c0827d64>] (dump_stack+0xac/0xe0)
[    2.424165] [<c0827d64>] (dump_stack) from [<c0137470>] (__warn+0xd8/0x104)
[    2.431488] [<c0137470>] (__warn) from [<c01374d0>] (warn_slowpath_fmt+0x34/0x44)
[    2.439351] [<c01374d0>] (warn_slowpath_fmt) from [<c03459d0>] (sysfs_warn_dup+0x58/0x78)
[    2.447938] [<c03459d0>] (sysfs_warn_dup) from [<c0345ab8>] (sysfs_create_dir_ns+0x80/0x98)
[    2.456719] [<c0345ab8>] (sysfs_create_dir_ns) from [<c082c554>] (kobject_add_internal+0x9c/0x2d4)
[    2.466124] [<c082c554>] (kobject_add_internal) from [<c082c7d8>] (kobject_add+0x4c/0x9c)
[    2.474712] [<c082c7d8>] (kobject_add) from [<c05803e4>] (device_add+0xcc/0x57c)
[    2.482489] [<c05803e4>] (device_add) from [<c0584b74>] (platform_device_add+0x100/0x220)
[    2.491085] [<c0584b74>] (platform_device_add) from [<c05855a8>] (platform_device_register_full+0xf4/0x118)
[    2.501305] [<c05855a8>] (platform_device_register_full) from [<c067023c>] (ti_cpufreq_init+0x150/0x22c)
[    2.511253] [<c067023c>] (ti_cpufreq_init) from [<c0101df4>] (do_one_initcall+0x3c/0x170)
[    2.519838] [<c0101df4>] (do_one_initcall) from [<c0c00eb4>] (kernel_init_freeable+0x1fc/0x2c4)
[    2.528974] [<c0c00eb4>] (kernel_init_freeable) from [<c083bcac>] (kernel_init+0x8/0x110)
[    2.537565] [<c083bcac>] (kernel_init) from [<c0107d18>] (ret_from_fork+0x14/0x3c)
[    2.545981] ---[ end trace 2fc00e213c13ab20 ]---
[    2.551051] ------------[ cut here ]------------
[    2.555931] WARNING: CPU: 0 PID: 1 at lib/kobject.c:240 kobject_add_internal+0x254/0x2d4
[    2.564578] kobject_add_internal failed for cpufreq-dt with -EEXIST, don't try to register
things with the same name in the same directory.
[    2.577977] Modules linked in:
[    2.581261] CPU: 0 PID: 1 Comm: swapper/0 Tainted: G        W       4.13.0-next-20170912 #1
[    2.590013] Hardware name: Generic AM33XX (Flattened Device Tree)
[    2.596437] [<c0110a28>] (unwind_backtrace) from [<c010ca84>] (show_stack+0x10/0x14)
[    2.604573] [<c010ca84>] (show_stack) from [<c0827d64>] (dump_stack+0xac/0xe0)
[    2.612172] [<c0827d64>] (dump_stack) from [<c0137470>] (__warn+0xd8/0x104)
[    2.619494] [<c0137470>] (__warn) from [<c01374d0>] (warn_slowpath_fmt+0x34/0x44)
[    2.627362] [<c01374d0>] (warn_slowpath_fmt) from [<c082c70c>] (kobject_add_internal+0x254/0x2d4)
[    2.636666] [<c082c70c>] (kobject_add_internal) from [<c082c7d8>] (kobject_add+0x4c/0x9c)
[    2.645255] [<c082c7d8>] (kobject_add) from [<c05803e4>] (device_add+0xcc/0x57c)
[    2.653027] [<c05803e4>] (device_add) from [<c0584b74>] (platform_device_add+0x100/0x220)
[    2.661615] [<c0584b74>] (platform_device_add) from [<c05855a8>] (platform_device_register_full+0xf4/0x118)
[    2.671833] [<c05855a8>] (platform_device_register_full) from [<c067023c>] (ti_cpufreq_init+0x150/0x22c)
[    2.681779] [<c067023c>] (ti_cpufreq_init) from [<c0101df4>] (do_one_initcall+0x3c/0x170)
[    2.690377] [<c0101df4>] (do_one_initcall) from [<c0c00eb4>] (kernel_init_freeable+0x1fc/0x2c4)
[    2.699510] [<c0c00eb4>] (kernel_init_freeable) from [<c083bcac>] (kernel_init+0x8/0x110)
[    2.708106] [<c083bcac>] (kernel_init) from [<c0107d18>] (ret_from_fork+0x14/0x3c)
[    2.716217] ---[ end trace 2fc00e213c13ab21 ]---

Fixes: edeec420de24 (cpufreq: dt-cpufreq: platdev Automatically create device with OPP v2)
Signed-off-by: Suniel Mahesh <[email protected]>
Acked-by: Viresh Kumar <[email protected]>
Signed-off-by: Rafael J. Wysocki <[email protected]>

scsi: scsi_transport_fc: set scsi_target_id upon rescan

When an rport is found in the bindings array there is no guarantee that
it had been a target port, so we need to call fc_remote_port_rolechg()
here to ensure the scsi_target_id is set correctly. Otherwise the port
will never be scanned.

Signed-off-by: Hannes Reinecke <[email protected]>
Reviewed-by: Johannes Thumshirn <[email protected]>
Tested-by: Chad Dupuis <[email protected]>
Signed-off-by: Martin K. Petersen <[email protected]>

Merge branch 'for-linus' of git://git.kernel.dk/linux-block

Pull block fixes from Jens Axboe:

- Two sets of NVMe pull requests from Christoph:
      - Fixes for the Fibre Channel host/target to fix spec compliance
      - Allow a zero keep alive timeout
      - Make the debug printk for broken SGLs work better
      - Fix queue zeroing during initialization
      - Set of RDMA and FC fixes
      - Target div-by-zero fix

- bsg double-free fix.

- ndb unknown ioctl fix from Josef.

- Buffered vs O_DIRECT page cache inconsistency fix. Has been floating
   around for a long time, well reviewed. From Lukas.

- brd overflow fix from Mikulas.

- Fix for a loop regression in this merge window, where using a union
   for two members of the loop_cmd turned out to be a really bad idea.
   From Omar.

- Fix for an iostat regression fix in this series, using the wrong API
   to get at the block queue. From Shaohua.

- Fix for a potential blktrace delection deadlock. From Waiman.

* 'for-linus' of git://git.kernel.dk/linux-block: (30 commits)
  nvme-fcloop: fix port deletes and callbacks
  nvmet-fc: sync header templates with comments
  nvmet-fc: ensure target queue id within range.
  nvmet-fc: on port remove call put outside lock
  nvme-rdma: don't fully stop the controller in error recovery
  nvme-rdma: give up reconnect if state change fails
  nvme-core: Use nvme_wq to queue async events and fw activation
  nvme: fix sqhd reference when admin queue connect fails
  block: fix a crash caused by wrong API
  fs: Fix page cache inconsistency when mixing buffered and AIO DIO
  nvmet: implement valid sqhd values in completions
  nvme-fabrics: Allow 0 as KATO value
  nvme: allow timed-out ios to retry
  nvme: stop aer posting if controller state not live
  nvme-pci: Print invalid SGL only once
  nvme-pci: initialize queue memory before interrupts
  nvmet-fc: fix failing max io queue connections
  nvme-fc: use transport-specific sgl format
  nvme: add transport SGL definitions
  nvme.h: remove FC transport-specific error values
  ...

PM / OPP: Call notifier without holding opp_table->lock

The notifier callbacks may want to call some OPP helper routines which
may try to take the same opp_table->lock again and cause a deadlock. One
such usecase was reported by Chanwoo Choi, where calling
dev_pm_opp_disable() leads us to the devfreq's OPP notifier handler,
which further calls dev_pm_opp_find_freq_floor() and it deadlocks.

We don't really need the opp_table->lock to be held across the notifier
call though, all we want to make sure is that the 'opp' doesn't get
freed while being used from within the notifier chain. We can do it with
help of dev_pm_opp_get/put() as well. Let's do it.

Cc: 4.11+ <[email protected]> # 4.11+
Fixes: 5b650b388844 "PM / OPP: Take kref from _find_opp_table()"
Reported-by: Chanwoo Choi <[email protected]>
Tested-by: Chanwoo Choi <[email protected]>
Reviewed-by: Stephen Boyd <[email protected]>
Reviewed-by: Chanwoo Choi <[email protected]>
Signed-off-by: Viresh Kumar <[email protected]>
Signed-off-by: Rafael J. Wysocki <[email protected]>

Merge tag 'gfs2-for-linus-4.14-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2

Pull gfs2 fix from Bob Peterson:
"GFS2: Fix an old regression in GFS2's debugfs interface

This fixes a regression introduced by commit 88ffbf3e037e ("GFS2: Use
resizable hash table for glocks"). The regression caused the glock dump
in debugfs to not report all the glocks, which makes debugging
extremely difficult"

* tag 'gfs2-for-linus-4.14-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2:
gfs2: Fix debugfs glocks dump

Merge tag 'microblaze-4.14-rc3' of git://git.monstr.eu/linux-2.6-microblaze

Pull Microblaze fixes from Michal Simek:

- Kbuild fix

- use vma_pages

- setup default little endians

* tag 'microblaze-4.14-rc3' of git://git.monstr.eu/linux-2.6-microblaze:
  arch: change default endian for microblaze
  microblaze: Cocci spatch "vma_pages"
  microblaze: Add missing kvm_para.h to Kbuild

security/keys: rewrite all of big_key crypto

This started out as just replacing the use of crypto/rng with
get_random_bytes_wait, so that we wouldn't use bad randomness at boot
time. But, upon looking further, it appears that there were even deeper
underlying cryptographic problems, and that this seems to have been
committed with very little crypto review. So, I rewrote the whole thing,
trying to keep to the conventions introduced by the previous author, to
fix these cryptographic flaws.

It makes no sense to seed crypto/rng at boot time and then keep
using it like this, when in fact there's already get_random_bytes_wait,
which can ensure there's enough entropy and be a much more standard way
of generating keys. Since this sensitive material is being stored
untrusted, using ECB and no authentication is simply not okay at all. I
find it surprising and a bit horrifying that this code even made it past
basic crypto review, which perhaps points to some larger issues. This
patch moves from using AES-ECB to using AES-GCM. Since keys are uniquely
generated each time, we can set the nonce to zero. There was also a race
condition in which the same key would be reused at the same time in
different threads. A mutex fixes this issue now.

So, to summarize, this commit fixes the following vulnerabilities:

  * Low entropy key generation, allowing an attacker to potentially
    guess or predict keys.
  * Unauthenticated encryption, allowing an attacker to modify the
    cipher text in particular ways in order to manipulate the plaintext,
    which is is even more frightening considering the next point.
  * Use of ECB mode, allowing an attacker to trivially swap blocks or
    compare identical plaintext blocks.
  * Key re-use.
  * Faulty memory zeroing.

Signed-off-by: Jason A. Donenfeld <[email protected]>
Reviewed-by: Eric Biggers <[email protected]>
Signed-off-by: David Howells <[email protected]>
Cc: Herbert Xu <[email protected]>
Cc: Kirill Marinushkin <[email protected]>
Cc: [email protected]
Cc: [email protected]

security/keys: properly zero out sensitive key material in big_key

Error paths forgot to zero out sensitive material, so this patch changes
some kfrees into a kzfrees.

Signed-off-by: Jason A. Donenfeld <[email protected]>
Signed-off-by: David Howells <[email protected]>
Reviewed-by: Eric Biggers <[email protected]>
Cc: Herbert Xu <[email protected]>
Cc: Kirill Marinushkin <[email protected]>
Cc: [email protected]
Cc: [email protected]

Merge tag 'trace-v4.14-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace

Pull tracing fixes from Steven Rostedt:
"Stack tracing and RCU has been having issues with each other and
  lockdep has been pointing out constant problems.

  The changes have been going into the stack tracer, but it has been
  discovered that the problem isn't with the stack tracer itself, but it
  is with calling save_stack_trace() from within the internals of RCU.

  The stack tracer is the one that can trigger the issue the easiest,
  but examining the problem further, it could also happen from a WARN()
  in the wrong place, or even if an NMI happened in this area and it did
  an rcu_read_lock().

  The critical area is where RCU is not watching. Which can happen while
  going to and from idle, or bringing up or taking down a CPU.

  The final fix was to put the protection in kernel_text_address() as it
  is the one that requires RCU to be watching while doing the stack
  trace.

  To make this work properly, Paul had to allow rcu_irq_enter() happen
  after rcu_nmi_enter(). This should have been done anyway, since an NMI
  can page fault (reading vmalloc area), and a page fault triggers
  rcu_irq_enter().

  One patch is just a consolidation of code so that the fix only needed
  to be done in one location"

* tag 'trace-v4.14-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  tracing: Remove RCU work arounds from stack tracer
  extable: Enable RCU if it is not watching in kernel_text_address()
  extable: Consolidate *kernel_text_address() functions
  rcu: Allow for page faults in NMI handlers

smp/hotplug: Hotplug state fail injection

Add a sysfs file to one-time fail a specific state. This can be used
to test the state rollback code paths.

Something like this (hotplug-up.sh):

  #!/bin/bash

  echo 0 > /debug/sched_debug
  echo 1 > /debug/tracing/events/cpuhp/enable

  ALL_STATES=`cat /sys/devices/system/cpu/hotplug/states | cut -d':' -f1`
  STATES=${1:-$ALL_STATES}

  for state in $STATES
  do
  echo 0 > /sys/devices/system/cpu/cpu1/online
  echo 0 > /debug/tracing/trace
  echo Fail state: $state
  echo $state > /sys/devices/system/cpu/cpu1/hotplug/fail
  cat /sys/devices/system/cpu/cpu1/hotplug/fail
  echo 1 > /sys/devices/system/cpu/cpu1/online

  cat /debug/tracing/trace > hotfail-${state}.trace

  sleep 1
  done

Can be used to test for all possible rollback (barring multi-instance)
scenarios on CPU-up, CPU-down is a trivial modification of the above.

Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: https://lkml.kernel.org/r/[email protected]

smp/hotplug: Differentiate the AP completion between up and down

With lockdep-crossrelease we get deadlock reports that span cpu-up and
cpu-down chains. Such deadlocks cannot possibly happen because cpu-up
and cpu-down are globally serialized.

  takedown_cpu()
    irq_lock_sparse()
    wait_for_completion(&st->done)

                                cpuhp_thread_fun
                                  cpuhp_up_callback
                                    cpuhp_invoke_callback
                                      irq_affinity_online_cpu
                                        irq_local_spare()
                                        irq_unlock_sparse()
                                  complete(&st->done)

Now that we have consistent AP state, we can trivially separate the
AP completion between up and down using st->bringup.

Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Acked-by: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: https://lkml.kernel.org/r/[email protected]

smp/hotplug: Differentiate the AP-work lockdep class between up and down

With lockdep-crossrelease we get deadlock reports that span cpu-up and
cpu-down chains. Such deadlocks cannot possibly happen because cpu-up
and cpu-down are globally serialized.

  CPU0                  CPU1                    CPU2
  cpuhp_up_callbacks:   takedown_cpu:           cpuhp_thread_fun:

  cpuhp_state
                        irq_lock_sparse()
    irq_lock_sparse()
                        wait_for_completion()
                                                cpuhp_state
                                                complete()

Now that we have consistent AP state, we can trivially separate the
AP-work class between up and down using st->bringup.

Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: https://lkml.kernel.org/r/[email protected]

smp/hotplug: Callback vs state-machine consistency

While the generic callback functions have an 'int' return and thus
appear to be allowed to return error, this is not true for all states.

Specifically, what used to be STARTING/DYING are ran with IRQs
disabled from critical parts of CPU bringup/teardown and are not
allowed to fail. Add WARNs to enforce this rule.

But since some callbacks are indeed allowed to fail, we have the
situation where a state-machine rollback encounters a failure, in this
case we're stuck, we can't go forward and we can't go back. Also add a
WARN for that case.

AFAICT this is a fundamental 'problem' with no real obvious solution.
We want the 'prepare' callbacks to allow failure on either up or down.
Typically on prepare-up this would be things like -ENOMEM from
resource allocations, and the typical usage in prepare-down would be
something like -EBUSY to avoid CPUs being taken away.

Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: https://lkml.kernel.org/r/[email protected]

smp/hotplug: Rewrite AP state machine core

There is currently no explicit state change on rollback. That is,
st->bringup, st->rollback and st->target are not consistent when doing
the rollback.

Rework the AP state handling to be more coherent. This does mean we
have to do a second AP kick-and-wait for rollback, but since rollback
is the slow path of a slowpath, this really should not matter.

Take this opportunity to simplify the AP thread function to only run a
single callback per invocation. This unifies the three single/up/down
modes is supports. The looping it used to do for up/down are achieved
by retaining should_run and relying on the main smpboot_thread_fn()
loop.

(I have most of a patch that does the same for the BP state handling,
but that's not critical and gets a little complicated because
CPUHP_BRINGUP_CPU does the AP handoff from a callback, which gets
recursive @st usage, I still have de-fugly that.)

[ tglx: Move cpuhp_down_callbacks() et al. into the HOTPLUG_CPU section to
   avoid gcc complaining about unused functions. Make the HOTPLUG_CPU
   one piece instead of having two consecutive ifdef sections of the
   same type. ]

Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: https://lkml.kernel.org/r/[email protected]

smp/hotplug: Allow external multi-instance rollback

Currently the rollback of multi-instance states is handled inside
cpuhp_invoke_callback(). The problem is that when we want to allow an
explicit state change for rollback, we need to return from the
function without doing the rollback.

Change cpuhp_invoke_callback() to optionally return the multi-instance
state, such that rollback can be done from a subsequent call.

Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: https://lkml.kernel.org/r/[email protected]

smp/hotplug: Add state diagram

Add a state diagram to clarify when which states are ran where.

Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: https://lkml.kernel.org/r/[email protected]

MAINTAINERS: Add entry for MediaTek PMIC LED driver

Add myself as a maintainer to support existing SoCs and push forward
following MediaTek PMICs with LEDs to reuse the driver.

Signed-off-by: Sean Wang <[email protected]>
Signed-off-by: Jacek Anaszewski <[email protected]>

scsi: scsi_transport_iscsi: fix the issue that iscsi_if_rx doesn't parse nlmsg properly

ChunYu found a kernel crash by syzkaller:

[  651.617875] kasan: CONFIG_KASAN_INLINE enabled
[  651.618217] kasan: GPF could be caused by NULL-ptr deref or user memory access
[  651.618731] general protection fault: 0000 [#1] SMP KASAN
[  651.621543] CPU: 1 PID: 9539 Comm: scsi Not tainted 4.11.0.cov #32
[  651.621938] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[  651.622309] task: ffff880117780000 task.stack: ffff8800a3188000
[  651.622762] RIP: 0010:skb_release_data+0x26c/0x590
[...]
[  651.627260] Call Trace:
[  651.629156]  skb_release_all+0x4f/0x60
[  651.629450]  consume_skb+0x1a5/0x600
[  651.630705]  netlink_unicast+0x505/0x720
[  651.632345]  netlink_sendmsg+0xab2/0xe70
[  651.633704]  sock_sendmsg+0xcf/0x110
[  651.633942]  ___sys_sendmsg+0x833/0x980
[  651.637117]  __sys_sendmsg+0xf3/0x240
[  651.638820]  SyS_sendmsg+0x32/0x50
[  651.639048]  entry_SYSCALL_64_fastpath+0x1f/0xc2

It's caused by skb_shared_info at the end of sk_buff was overwritten by
ISCSI_KEVENT_IF_ERROR when parsing nlmsg info from skb in iscsi_if_rx.

During the loop if skb->len == nlh->nlmsg_len and both are sizeof(*nlh),
ev = nlmsg_data(nlh) will acutally get skb_shinfo(SKB) instead and set a
new value to skb_shinfo(SKB)->nr_frags by ev->type.

This patch is to fix it by checking nlh->nlmsg_len properly there to
avoid over accessing sk_buff.

Reported-by: ChunYu Wang <[email protected]>
Signed-off-by: Xin Long <[email protected]>
Acked-by: Chris Leech <[email protected]>
Signed-off-by: Martin K. Petersen <[email protected]>

irqdomain: Add __rcu annotations to radix tree accessors

Fix various address spaces warning of sparse.

kernel/irq/irqdomain.c:1463:14: warning: incorrect type in assignment (different address spaces)
kernel/irq/irqdomain.c:1463:14:    expected void **slot
kernel/irq/irqdomain.c:1463:14:    got void [noderef] <asn:4>**
kernel/irq/irqdomain.c:1465:66: warning: incorrect type in argument 2 (different address spaces)
kernel/irq/irqdomain.c:1465:66:    expected void [noderef] <asn:4>**slot
kernel/irq/irqdomain.c:1465:66:    got void **slot

Signed-off-by: Masahiro Yamada <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Cc: Marc Zyngier <[email protected]>
Cc: Jason Cooper <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]

irqchip/mips-gic: Use effective affinity to unmask

Commit 7778c4b27cbe ("irqchip: mips-gic: Use pcpu_masks to avoid reading
GIC_SH_MASK*") adjusted the way we handle masking interrupts to set &
clear the interrupt's bit in each pcpu_mask. This allows us to avoid
needing to read the GIC mask registers and perform a bitwise and of
their values with the pending & pcpu_masks.

Unfortunately this didn't quite work for IPIs, which were mapped to a
particular CPU/VP during initialisation but never set the affinity or
effective_affinity fields of their struct irq_desc. This led to them
losing their affinity when gic_unmask_irq() was called for them, and
they'd all become affine to cpu0.

Fix this by:

1) Setting the effective affinity of interrupts in
    gic_shared_irq_domain_map(), which is where we actually map an
    interrupt to a CPU/VP. This ensures that the effective affinity mask
    is always valid, not just after explicitly setting affinity.

2) Using an interrupt's effective affinity when unmasking it, which
    prevents gic_unmask_irq() from unintentionally changing which
    pcpu_mask includes an interrupt.

Fixes: 7778c4b27cbe ("irqchip: mips-gic: Use pcpu_masks to avoid reading GIC_SH_MASK*")
Signed-off-by: Paul Burton <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Cc: Marc Zyngier <[email protected]>
Cc: Jason Cooper <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]

irqchip/mips-gic: Fix shifts to extract register fields

The MIPS GIC driver is incorrectly using __fls to shift registers,
intending to shift to the least significant bit of a value based upon
its mask but instead shifting off all but the value's top bit. It should
actually be using __ffs to shift to the first, not last, bit of the
value.

Apparently the system I used when testing commit 3680746abd87
("irqchip: mips-gic: Convert remaining shared reg access to new
accessors") and commit b2b2e584ceab ("irqchip: mips-gic: Clean up mti,
reserved-cpu-vectors handling") managed to work correctly despite this
issue, but not all systems do...

Fixes: 3680746abd87 ("irqchip: mips-gic: Convert remaining shared reg access to new accessors")
Fixes: b2b2e584ceab ("irqchip: mips-gic: Clean up mti, reserved-cpu-vectors handling")
Signed-off-by: Paul Burton <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Cc: Marc Zyngier <[email protected]>
Cc: Jason Cooper <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]

nvme-fcloop: fix port deletes and callbacks

Now that there are potentially long delays between when a remoteport or
targetport delete calls is made and when the callback occurs (dev_loss_tmo
timeout), no longer block in the delete routines and move the final nport
puts to the callbacks.

Moved the fcloop_nport_get/put/free routines to avoid forward declarations.

Ensure port_info structs used in registrations are nulled in case fields
are not set (ex: devloss_tmo values).

Signed-off-by: James Smart <[email protected]>
Signed-off-by: Christoph Hellwig <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>

nvmet-fc: sync header templates with comments

Comments were incorrect:
- defer_rcv was in host port template. moved to target port template
- Added Mandatory statements for target port template items

Signed-off-by: James Smart <[email protected]>
Signed-off-by: Christoph Hellwig <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>

nvmet-fc: ensure target queue id within range.

When searching for queue id's ensure they are within the expected range.

Signed-off-by: James Smart <[email protected]>
Signed-off-by: Christoph Hellwig <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>

nvmet-fc: on port remove call put outside lock

Avoid calling the put routine, as it may traverse to free routines while
holding the target lock.

Signed-off-by: James Smart <[email protected]>
Signed-off-by: Christoph Hellwig <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>

nvme-rdma: don't fully stop the controller in error recovery

By calling nvme_stop_ctrl on a already failed controller will wait for the
scan work to complete (only by identify timeout expiration which is 60
seconds). This is unnecessary when we already know that the controller has
failed.

Reported-by: Yi Zhang <[email protected]>
Signed-off-by: Sagi Grimberg <[email protected]>
Signed-off-by: Christoph Hellwig <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>

nvme-rdma: give up reconnect if state change fails

If we failed to transition to state LIVE after a successful reconnect,
then controller deletion already started. In this case there is no
point moving forward with reconnect.

Reviewed-by: Johannes Thumshirn <[email protected]>
Signed-off-by: Sagi Grimberg <[email protected]>
Signed-off-by: Christoph Hellwig <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>

nvme-core: Use nvme_wq to queue async events and fw activation

async_event_work might race as it is executed from two different
workqueues at the moment.

Reviewed-by: Johannes Thumshirn <[email protected]>
Signed-off-by: Sagi Grimberg <[email protected]>
Signed-off-by: Christoph Hellwig <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>

nvme: fix sqhd reference when admin queue connect fails

Fix bug in sqhd patch.

It wasn't the sq that was at risk. In the case where the admin queue
connect command fails, the sq->size field is not set. Therefore, this
becomes a divide by zero error.

Add a quick check to bypass under this failure condition.

Signed-off-by: James Smart <[email protected]>
Signed-off-by: Christoph Hellwig <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>

gfs2: Fix debugfs glocks dump

The switch to rhashtables (commit 88ffbf3e03) broke the debugfs glock
dump (/sys/kernel/debug/gfs2/<device>/glocks) for dumps bigger than a
single buffer: the right function for restarting an rhashtable iteration
from the beginning of the hash table is rhashtable_walk_enter;
rhashtable_walk_stop + rhashtable_walk_start will just resume from the
current position.

Signed-off-by: Andreas Gruenbacher <[email protected]>
Signed-off-by: Bob Peterson <[email protected]>
Cc: [email protected] # v4.3+

selftests: timers: set-timer-lat: Fix hang when testing unsupported alarms

When timer_create() fails on a bootime or realtime clock, setup_timer()
returns 0 as if timer has been set. Callers wait forever for the timer
to expire.

This hang is seen on a system that doesn't have support for:

CLOCK_REALTIME_ALARM ABSTIME missing CAP_WAKE_ALARM? : [UNSUPPORTED]

Test hangs waiting for a timer that hasn't been set to expire. Fix
setup_timer() to return 1, add handling in callers to detect the
unsupported case and return 0 without waiting to not fail the test.

Signed-off-by: Shuah Khan <[email protected]>

selftests: timers: set-timer-lat: fix hang when std out/err are redirected

do_timer_oneshot() uses select() as a timer with FD_SETSIZE and readfs
is cleared with FD_ZERO without FD_SET.

When stdout and stderr are redirected, the test hangs in select forever.
Fix the problem calling select() with readfds empty and nfds zero. This
is sufficient for using select() for timer.

With this fix "./set-timer-lat > /dev/null 2>&1" no longer hangs.

Signed-off-by: Shuah Khan <[email protected]>
Acked-by: Greg Hackmann <[email protected]>
Signed-off-by: Shuah Khan <[email protected]>

selftests/memfd: correct run_tests.sh permission

to fix the following issue:
------------------
TAP version 13
selftests: run_tests.sh
========================================
selftests: Warning: file run_tests.sh is not executable, correct this.
not ok 1..1 selftests: run_tests.sh [FAIL]
------------------

Signed-off-by: Li Zhijian <[email protected]>
Signed-off-by: Shuah Khan <[email protected]>

selftests/seccomp: Support glibc 2.26 siginfo_t.h

The 2.26 release of glibc changed how siginfo_t is defined, and the earlier
work-around to using the kernel definition are no longer needed. The old
way needs to stay around for a while, though.

Reported-by: Seth Forshee <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Will Drewry <[email protected]>
Cc: Shuah Khan <[email protected]>
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Kees Cook <[email protected]>
Tested-by: Seth Forshee <[email protected]>
Signed-off-by: Shuah Khan <[email protected]>

selftests: futex: Makefile: fix for loops in targets to run silently

Fix for loops in targets to run silently to avoid cluttering the test
results.

Suppresses the following from targets:

for DIR in functional; do               \
        BUILD_TARGET=./tools/testing/selftests/futex/$DIR; \
        mkdir $BUILD_TARGET  -p;        \
        make OUTPUT=$BUILD_TARGET -C $DIR all;\
done

./tools/testing/selftests/futex/run.sh

Signed-off-by: Shuah Khan <[email protected]>
Reviewed-by: Darren Hart (VMware) <[email protected]>
Signed-off-by: Shuah Khan <[email protected]>

selftests: Makefile: fix for loops in targets to run silently

Fix for loops in targets to run silently to avoid cluttering the test
results.

Suppresses the following from targets: e.g run from breakpoints

for TARGET in breakpoints; do \
BUILD_TARGET=$BUILD/$TARGET; \
mkdir $BUILD_TARGET -p; \
make OUTPUT=$BUILD_TARGET -C $TARGET;\
done;

Signed-off-by: Shuah Khan <[email protected]>

selftests: mqueue: Use full path to run tests from Makefile

Use full path including $(OUTPUT) to run tests from Makefile for
normal case when objects reside in the source tree as well as when
objects are relocated with make O=dir. In both cases $(OUTPUT) will
be set correctly by lib.mk.

Signed-off-by: Shuah Khan <[email protected]>

selftests: futex: copy sub-dir test scripts for make O=dir run

For make O=dir run_tests to work, test scripts from sub-directories
need to be copied over to the object directory. Running tests from the
object directory is necessary to avoid making the source tree dirty.

Signed-off-by: Shuah Khan <[email protected]>
Reviewed-by: Darren Hart (VMware) <[email protected]>
Signed-off-by: Shuah Khan <[email protected]>

PCI: Add dummy pci_acs_enabled() for CONFIG_PCI=n build

If CONFIG_PCI=n and gcc (e.g. 4.1.2) decides not to inline
get_pci_function_alias_group(), the build fails with:

drivers/iommu/iommu.o: In function `get_pci_function_alias_group':
iommu.c:(.text+0xfdc): undefined reference to `pci_acs_enabled'

Due to the various dummies for PCI calls in the CONFIG_PCI=n case,
pci_acs_enabled() never called, but not all versions of gcc are smart
enough to realize that.

While explicitly marking get_pci_function_alias_group() inline would fix
the build, this would inflate the code for the CONFIG_PCI=y case, as
get_pci_function_alias_group() is a not-so-small function called from two
places.

Hence fix the issue by introducing a dummy for pci_acs_enabled() instead.

Fixes: 0ae349a0f33f ("iommu/qcom: Add qcom_iommu")
Signed-off-by: Geert Uytterhoeven <[email protected]>
Signed-off-by: Bjorn Helgaas <[email protected]>
Reviewed-by: Alex Williamson <[email protected]>

IB/mlx5: Fix NULL deference on mlx5_ib_update_xlt failure

mlx5_ib_reg_user_mr called mlx5_ib_dereg_mr in case of MR population
failure. This resulted in a NULL dereference as ibmr->device wasn't
initialized yet.

We address this by adding an internal dereg_mr function that can handle
partially initialized MRs, and fixing clean_mr to work on partially
initialized MRs.

Fixes: ff740aefecb9 ("IB/mlx5: Decouple MR allocation and population flows")
Signed-off-by: Ilya Lesokhin <[email protected]>
Signed-off-by: Leon Romanovsky <[email protected]>
Signed-off-by: Doug Ledford <[email protected]>

IB/mlx5: Simplify mlx5_ib_cont_pages

The patch simplifies mlx5_ib_cont_pages and fixes the following
issues in the original implementation:

First issues is related to alignment of the PFNs. After the check
base + p != PFN, the alignment of the PFN wasn't checked. So the PFN
sequence 0, 1, 1, 2 would result in a page_shift of 13 even though
the 3rd PFN is not 8KB aligned.

This wasn't actually a bug because it was supported by all the
existing mlx5 compatible device, but we don't want to require
this support in all future devices.

Another issue is because the inner loop didn't advance PFN so
the test "if (base + p != pfn)" always failed for SGE with
len > (1<<page_shift).

Fixes: e126ba97dba9 ("mlx5: Add driver for Mellanox Connect-IB adapters")
Signed-off-by: Ilya Lesokhin <[email protected]>
Reviewed-by: Eli Cohen <[email protected]>
Signed-off-by: Leon Romanovsky <[email protected]>
Signed-off-by: Doug Ledford <[email protected]>

IB/ipoib: Fix inconsistency with free_netdev and free_rdma_netdev

Call free_rdma_netdev instead of free_netdev each time we want to
release a netdevice. This call is also relevant for future freeing
of offloaded child interfaces.

This patch also adds a missing call for free netdevice when releasing
a parent interface that has child interfaces using ipoib_remove_one.

Fixes: cd565b4b51e5 ('IB/IPoIB: Support acceleration options callbacks')
Signed-off-by: Alex Vesker <[email protected]>
Signed-off-by: Leon Romanovsky <[email protected]>
Signed-off-by: Doug Ledford <[email protected]>

IB/ipoib: Fix sysfs Pkey create<->remove possible deadlock

A possible ABBA lock can happen with RTNL and vlan_rwsem.
For example:

Flow A: Device Flush
__ipoib_ib_dev_flush
down_read(vlan_rwsem) // Lock A
ipoib_flush_ah
flush_workqueue(priv->wq) // Wait for completion
A work on shared WQ (Mcast carrier)
ipoib_mcast_carrier_on_task
while (!rtnl_trylock()) // Wait for lock B

Flow B: Sysfs PKEY delete
ipoib_vlan_delete
lock(RTNL) // Lock B
down_write(vlan_rwsem) // Wait for lock A

This can happen with PKEY creates as well. The solution is to release
the RTNL lock in sysfs functions in case it is not possible to lock
VLAN RW semaphore and reset the SYS call.

Fixes: 69956d83267e ("IB/ipoib: Sync between remove_one to sysfs calls that use rtnl_lock")
Signed-off-by: Shalom Lagziel <[email protected]>
Signed-off-by: Alex Vesker <[email protected]>
Signed-off-by: Leon Romanovsky <[email protected]>
Signed-off-by: Doug Ledford <[email protected]>

IB: Correct MR length field to be 64-bit

The ib_mr->length represents the length of the MR in bytes as per
the IBTA spec 1.3 section 11.2.10.3 (REGISTER PHYSICAL MEMORY REGION).

Currently ib_mr->length field is defined as only 32-bits field.
This might result into truncation and failed WRs of consumers who
registers more than 4GB bytes memory regions and whose WRs accessing
such MRs.

This patch makes the length 64-bit to avoid such truncation.

Cc: Sagi Grimberg <[email protected]>
Cc: Chuck Lever <[email protected]>
Cc: Faisal Latif <[email protected]>
Fixes: 4c67e2bfc8b7 ("IB/core: Introduce new fast registration API")
Signed-off-by: Ilya Lesokhin <[email protected]>
Signed-off-by: Parav Pandit <[email protected]>
Signed-off-by: Leon Romanovsky <[email protected]>
Signed-off-by: Doug Ledford <[email protected]>

IB/core: Fix qp_sec use after free access

When security_ib_alloc_security fails, qp->qp_sec memory is freed.
However ib_destroy_qp still tries to access this memory which result
in kernel crash. So its initialized to NULL to avoid such access.

Fixes: d291f1a65232 ("IB/core: Enforce PKey security on QPs")
Signed-off-by: Parav Pandit <[email protected]>
Reviewed-by: Daniel Jurgens <[email protected]>
Signed-off-by: Leon Romanovsky <[email protected]>
Signed-off-by: Doug Ledford <[email protected]>

IB/core: Fix typo in the name of the tag-matching cap struct

The tag matching functionality is implemented by mlx5 driver
by extending XRQ, however this internal kernel information was
exposed to user space applications with *xrq* name instead of *tm*.

This patch renames *xrq* to *tm* to handle that.

Fixes: 8d50505ada72 ("IB/uverbs: Expose XRQ capabilities")
Signed-off-by: Leon Romanovsky <[email protected]>
Reviewed-by: Yishai Hadas <[email protected]>
Signed-off-by: Doug Ledford <[email protected]>

perf tools: Fix syscalltbl build failure

The build of kernel v4.14-rc1 for i686 fails on RHEL 6 with the error
in tools/perf:

  util/syscalltbl.c:157: error: expected ';', ',' or ')' before '__maybe_unused'
  mv: cannot stat `util/.syscalltbl.o.tmp': No such file or directory

Fix it by placing/moving:

  #include <linux/compiler.h>

  outside of #ifdef HAVE_SYSCALL_TABLE block.

Signed-off-by: Akemi Yagi <[email protected]>
Cc: Alan Bartlett <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>

perf report: Fix debug messages with --call-graph option

With --call-graph option, perf report can display call chains using
type, min percent threshold, optional print limit and order. And the
default call-graph parameter is 'graph,0.5,caller,function,percent'.

Before this patch, 'perf report --call-graph' shows incorrect debug
messages as below:

  # perf report --call-graph
  Invalid callchain mode: 0.5
  Invalid callchain order: 0.5
  Invalid callchain sort key: 0.5
  Invalid callchain config key: 0.5
  Invalid callchain mode: caller
  Invalid callchain mode: function
  Invalid callchain order: function
  Invalid callchain mode: percent
  Invalid callchain order: percent
  Invalid callchain sort key: percent

That is because in function __parse_callchain_report_opt(),each field of
the call-graph parameter is passed to parse_callchain_{mode,order,
sort_key,value} in turn until it meets the matching value.

For example, the order field "caller" is passed to
parse_callchain_mode() firstly and obviously it doesn't match any mode
field. Therefore parse_callchain_mode() will shows the debug message
"Invalid callchain mode: caller", which could confuse users.

The patch fixes this issue by moving the warning out of the function
parse_callchain_{mode,order,sort_key,value}.

Signed-off-by: Mengting Zhang <[email protected]>
Acked-by: Jiri Olsa <[email protected]>
Tested-by: Arnaldo Carvalho de Melo <[email protected]>
Cc: Alexander Shishkin <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Krister Johansen <[email protected]>
Cc: Li Bin <[email protected]>
Cc: Milian Wolff <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Wang Nan <[email protected]>
Cc: Yao Jin <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>

block: fix a crash caused by wrong API

part_stat_show takes a part device not a disk, so we should use
part_to_disk.

Fixes: d62e26b3ffd2("block: pass in queue to inflight accounting")
Cc: Bart Van Assche <[email protected]>
Cc: Omar Sandoval <[email protected]>
Signed-off-by: Shaohua Li <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>

fs: Fix page cache inconsistency when mixing buffered and AIO DIO

Currently when mixing buffered reads and asynchronous direct writes it
is possible to end up with the situation where we have stale data in the
page cache while the new data is already written to disk. This is
permanent until the affected pages are flushed away. Despite the fact
that mixing buffered and direct IO is ill-advised it does pose a thread
for a data integrity, is unexpected and should be fixed.

Fix this by deferring completion of asynchronous direct writes to a
process context in the case that there are mapped pages to be found in
the inode. Later before the completion in dio_complete() invalidate
the pages in question. This ensures that after the completion the pages
in the written area are either unmapped, or populated with up-to-date
data. Also do the same for the iomap case which uses
iomap_dio_complete() instead.

This has a side effect of deferring the completion to a process context
for every AIO DIO that happens on inode that has pages mapped. However
since the consensus is that this is ill-advised practice the performance
implication should not be a problem.

This was based on proposal from Jeff Moyer, thanks!

Reviewed-by: Jan Kara <[email protected]>
Reviewed-by: Darrick J. Wong <[email protected]>
Reviewed-by: Jeff Moyer <[email protected]>
Signed-off-by: Lukas Czerner <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>

nvmet: implement valid sqhd values in completions

To support sqhd, for initiators that are following the spec and
paying attention to sqhd vs their sqtail values:

- add sqhd to struct nvmet_sq
- initialize sqhd to 0 in nvmet_sq_setup
- rather than propagate the 0's-based qsize value from the connect message
which requires a +1 in every sqhd update, and as nothing else references
it, convert to 1's-based value in nvmt_sq/cq_setup() calls.
- validate connect message sqsize being non-zero per spec.
- updated assign sqhd for every completion that goes back.

Also remove handling the NULL sq case in __nvmet_req_complete, as it can't
happen with the current code.

Signed-off-by: James Smart <[email protected]>
Reviewed-by: Sagi Grimberg <[email protected]>
Reviewed-by: Max Gurtovoy <[email protected]>
Signed-off-by: Christoph Hellwig <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>

nvme-fabrics: Allow 0 as KATO value

Currently, driver code allows user to set 0 as KATO
(Keep Alive TimeOut), but this is not being respected.
This patch enforces the expected behavior.

Signed-off-by: Guilherme G. Piccoli <[email protected]>
Reviewed-by: Sagi Grimberg <[email protected]>
Signed-off-by: Christoph Hellwig <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>

nvme: allow timed-out ios to retry

Currently the nvme_req_needs_retry() applies several checks to see if
a retry is allowed. On of those is whether the current time has exceeded
the start time of the io plus the timeout length. This check, if an io
times out, means there is never a retry allowed for the io. Which means
applications see the io failure.

Remove this check and allow the io to timeout, like it does on other
protocols, and retries to be made.

On the FC transport, a frame can be lost for an individual io, and there
may be no other errors that escalate for the connection/association.
The io will timeout, which causes the transport to escalate into creating
a new association, but the io that timed out, due to this retry logic, has
already failed back to the application and things are hosed.

Signed-off-by: James Smart <[email protected]>
Reviewed-by: Keith Busch <[email protected]>
Signed-off-by: Christoph Hellwig <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>

nvme: stop aer posting if controller state not live

If an nvme async_event command completes, in most cases, a new
async event is posted. However, if the controller enters a
resetting or reconnecting state, there is nothing to block the
scheduled work element from posting the async event again. Nor are
there calls from the transport to stop async events when an
association dies.

In the case of FC, where the association is torn down, the aer must
be aborted on the FC link and completes through the normal job
completion path. Thus the terminated async event ends up being
rescheduled even though the controller isn't in a valid state for
the aer, and the reposting gets the transport into a partially torn
down data structure.

It's possible to hit the scenario on rdma, although much less likely
due to an aer completing right as the association is terminated and
as the association teardown reclaims the blk requests via
nvme_cancel_request() so its immediate, not a link-related action
like on FC.

Fix by putting controller state checks in both the async event
completion routine where it schedules the async event and in the
async event work routine before it calls into the transport. It's
effectively a "stop_async_events()" behavior. The transport, when
it creates a new association with the subsystem will transition
the state back to live and is already restarting the async event
posting.

Signed-off-by: James Smart <[email protected]>
[hch: remove taking a lock over reading the controller state]
Signed-off-by: Christoph Hellwig <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>

nvme-pci: Print invalid SGL only once

The WARN_ONCE macro returns true if the condition is true, not if the
warn was raised, so we're printing the scatter list every time it's
invalid. This is excessive and makes debugging harder, so this patch
prints it just once.

Signed-off-by: Keith Busch <[email protected]>
Reviewed-by: Martin K. Petersen <[email protected]>
Signed-off-by: Christoph Hellwig <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>

nvme-pci: initialize queue memory before interrupts

A spurious interrupt before the nvme driver has initialized the completion
queue may inadvertently cause the driver to believe it has a completion
to process. This may result in a NULL dereference since the nvmeq's tags
are not set at this point.

The patch initializes the host's CQ memory so that a spurious interrupt
isn't mistaken for a real completion.

Signed-off-by: Keith Busch <[email protected]>
Reviewed-by: Johannes Thumshirn <[email protected]>
Signed-off-by: Christoph Hellwig <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>

nvmet-fc: fix failing max io queue connections

fc transport is treating NVMET_NR_QUEUES as maximum queue count, e.g.
admin queue plus NVMET_NR_QUEUES-1 io queues. But NVMET_NR_QUEUES is
the number of io queues, so maximum queue count is really
NVMET_NR_QUEUES+1.

Fix the handling in the target fc transport

Signed-off-by: James Smart <[email protected]>
Signed-off-by: Christoph Hellwig <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>

nvme-fc: use transport-specific sgl format

Sync with NVM Express spec change and FC-NVME 1.18.

FC transport sets SGL type to Transport SGL Data Block Descriptor and
subtype to transport-specific value 0x0A.

Removed the warn-on's on the PRP fields. They are unneeded. They were
to check for values from the upper layer that weren't set right, and
for the most part were fine. But, with Async events, which reuse the
same structure and 2nd time issued the SGL overlay converted them to
the Transport SGL values - the warn-on's were errantly firing.

Signed-off-by: James Smart <[email protected]>
Signed-off-by: Christoph Hellwig <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>