Git Repo - linux.git/log

bug: lift "cut here" out of __warn()

In preparation for cleaning up "cut here", move the "cut here" logic up
out of __warn() and into callers that pass non-NULL args. For anyone
looking closely, there are two callers that pass NULL args: one already
explicitly prints "cut here". The remaining case is covered by how a WARN
is built, which will be cleaned up in the next patch.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Kees Cook <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Christophe Leroy <[email protected]>
Cc: Drew Davenport <[email protected]>
Cc: Feng Tang <[email protected]>
Cc: Mauro Carvalho Chehab <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Petr Mladek <[email protected]>
Cc: "Steven Rostedt (VMware)" <[email protected]>
Cc: YueHaibing <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

bug: consolidate warn_slowpath_fmt() usage

Instead of having a separate helper for no printk output, just consolidate
the logic into warn_slowpath_fmt().

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Kees Cook <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Christophe Leroy <[email protected]>
Cc: Drew Davenport <[email protected]>
Cc: Feng Tang <[email protected]>
Cc: Mauro Carvalho Chehab <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Petr Mladek <[email protected]>
Cc: "Steven Rostedt (VMware)" <[email protected]>
Cc: YueHaibing <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

bug: rename __WARN_printf_taint() to __WARN_printf()

This just renames the helper to improve readability.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Kees Cook <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Christophe Leroy <[email protected]>
Cc: Drew Davenport <[email protected]>
Cc: Feng Tang <[email protected]>
Cc: Mauro Carvalho Chehab <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Petr Mladek <[email protected]>
Cc: "Steven Rostedt (VMware)" <[email protected]>
Cc: YueHaibing <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

bug: refactor away warn_slowpath_fmt_taint()

Patch series "Clean up WARN() "cut here" handling", v2.

Christophe Leroy noticed that the fix for missing "cut here" in the WARN()
case was adding explicit printk() calls instead of teaching the exception
handler to add it.  This refactors the bug/warn infrastructure to pass
this information as a new BUGFLAG.

Longer details repeated from the last patch in the series:

bug: move WARN_ON() "cut here" into exception handler

The original cleanup of "cut here" missed the WARN_ON() case (that does
not have a printk message), which was fixed recently by adding an explicit
printk of "cut here".  This had the downside of adding a printk() to every
WARN_ON() caller, which reduces the utility of using an instruction
exception to streamline the resulting code.  By making this a new BUGFLAG,
all of these can be removed and "cut here" can be handled by the exception
handler.

This was very pronounced on PowerPC, but the effect can be seen on x86 as
well.  The resulting text size of a defconfig build shows some small
savings from this patch:

   text    data     bss     dec     hex filename
19691167        5134320 1646664 26472151        193eed7 vmlinux.before
19676362        5134260 1663048 26473670        193f4c6 vmlinux.after

This change also opens the door for creating something like BUG_MSG(),
where a custom printk() before issuing BUG(), without confusing the "cut
here" line.

This patch (of 7):

There's no reason to have specialized helpers for passing the warn taint
down to __warn().  Consolidate and refactor helper macros, removing
__WARN_printf() and warn_slowpath_fmt_taint().

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Kees Cook <[email protected]>
Cc: Christophe Leroy <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Christophe Leroy <[email protected]>
Cc: Drew Davenport <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: "Steven Rostedt (VMware)" <[email protected]>
Cc: Feng Tang <[email protected]>
Cc: Petr Mladek <[email protected]>
Cc: Mauro Carvalho Chehab <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: YueHaibing <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

scripts/gdb: handle split debug

Some systems (like Chrome OS) may use "split debug" for kernel modules.
That means that the debug symbols are in a different file than the main
elf file.  Let's handle that by also searching for debug symbols that end
in ".ko.debug".

This is a packaging topic.  You can take a normal elf file and split the
debug out of it using objcopy.  Try "man objcopy" and then take a look at
the "--only-keep-debug" option.  It'll give you a whole recipe for doing
splitdebug.  The suffix used for the debug symbols is arbitrary.  If
people have other another suffix besides ".ko.debug" then we could
presumably support that too...

For portage (which is the packaging system used by Chrome OS) split debug
is supported by default (and the suffix is .ko.debug).  ...and so in
Chrome OS we always get the installed elf files stripped and then the
symbols stashed away.

At the moment we don't actually use the normal portage magic to do this
for the kernel though since it affects our ability to get good stack dumps
in the kernel.  We instead pass a script as "strip" [1].

[1] https://chromium.googlesource.com/chromiumos/overlays/chromiumos-overlay/+/refs/heads/master/eclass/cros-kernel/strip_splitdebug

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Douglas Anderson <[email protected]>
Reviewed-by: Stephen Boyd <[email protected]>
Reviewed-by: Jan Kiszka <[email protected]>
Cc: Kieran Bingham <[email protected]>
Cc: Jason Wessel <[email protected]>
Cc: Daniel Thompson <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

kgdb: don't use a notifier to enter kgdb at panic; call directly

Right now kgdb/kdb hooks up to debug panics by registering for the panic
notifier.  This works OK except that it means that kgdb/kdb gets called
_after_ the CPUs in the system are taken offline.  That means that if
anything important was happening on those CPUs (like something that might
have contributed to the panic) you can't debug them.

Specifically I ran into a case where I got a panic because a task was
"blocked for more than 120 seconds" which was detected on CPU 2.  I nicely
got shown stack traces in the kernel log for all CPUs including CPU 0,
which was running 'PID: 111 Comm: kworker/0:1H' and was in the middle of
__mmc_switch().

I then ended up at the kdb prompt where switched over to kgdb to try to
look at local variables of the process on CPU 0.  I found that I couldn't.
Digging more, I found that I had no info on any tasks running on CPUs
other than CPU 2 and that asking kdb for help showed me "Error: no saved
data for this cpu".  This was because all the CPUs were offline.

Let's move the entry of kdb/kgdb to a direct call from panic() and stop
using the generic notifier.  Putting a direct call in allows us to order
things more properly and it also doesn't seem like we're breaking any
abstractions by calling into the debugger from the panic function.

Daniel said:

: This patch changes the way kdump and kgdb interact with each other.
: However it would seem rather odd to have both tools simultaneously armed
: and, even if they were, the user still has the option to use panic_timeout
: to force a kdump to happen.  Thus I think the change of order is
: acceptable.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Douglas Anderson <[email protected]>
Reviewed-by: Daniel Thompson <[email protected]>
Cc: Jason Wessel <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Feng Tang <[email protected]>
Cc: YueHaibing <[email protected]>
Cc: Sergey Senozhatsky <[email protected]>
Cc: "Steven Rostedt (VMware)" <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

compiler: enable CONFIG_OPTIMIZE_INLINING forcibly

Commit 9012d011660e ("compiler: allow all arches to enable
CONFIG_OPTIMIZE_INLINING") allowed all architectures to enable this
option. A couple of build errors were reported by randconfig, but all of
them have been ironed out.

Towards the goal of removing CONFIG_OPTIMIZE_INLINING entirely (and it
will simplify the 'inline' macro in compiler_types.h), this commit changes
it to always-on option. Going forward, the compiler will always be
allowed to not inline functions marked 'inline'.

This is not a problem for x86 since it has been long used by
arch/x86/configs/{x86_64,i386}_defconfig.

I am keeping the config option just in case any problem crops up for other
architectures.

The code clean-up will be done after confirming this is solid.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Masahiro Yamada <[email protected]>
Acked-by: Nick Desaulniers <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Miguel Ojeda <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

uaccess: add missing __must_check attributes

The usercopy implementation comments describe that callers of the
copy_*_user() family of functions must always have their return values
checked. This can be enforced at compile time with __must_check, so add
it where needed.

Link: http://lkml.kernel.org/r/201908251609.ADAD5CAAC1@keescook
Signed-off-by: Kees Cook <[email protected]>
Cc: Alexander Viro <[email protected]>
Cc: Dan Carpenter <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

kexec: restore arch_kexec_kernel_image_probe declaration

arch_kexec_kernel_image_probe function declaration has been removed by
commit 9ec4ecef0af7 ("kexec_file,x86,powerpc: factor out kexec_file_ops
functions"). Still this function is overridden by couple of architectures
and proper prototype declaration is therefore important, so bring it back.
This fixes the following sparse warning on s390:
arch/s390/kernel/machine_kexec_file.c:333:5: warning: symbol
'arch_kexec_kernel_image_probe' was not declared. Should it be static?

Link: http://lkml.kernel.org/r/patch.git-ff1c9045ebdc.your-ad-here.call-01564402297-ext-5690@work.hours
Signed-off-by: Vasily Gorbik <[email protected]>
Acked-by: Dave Young <[email protected]>
Reviewed-by: Bhupesh Sharma <[email protected]>
Cc: Eric Biederman <[email protected]>
Cc: AKASHI Takahiro <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

kexec: bail out upon SIGKILL when allocating memory.

syzbot found that a thread can stall for minutes inside kexec_load() after
that thread was killed by SIGKILL [1]. It turned out that the reproducer
was trying to allocate 2408MB of memory using kimage_alloc_page() from
kimage_load_normal_segment(). Let's check for SIGKILL before doing memory
allocation.

[1] https://syzkaller.appspot.com/bug?id=a0e3436829698d5824231251fad9d8e998f94f5e

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Tetsuo Handa <[email protected]>
Reported-by: syzbot <[email protected]>
Cc: Eric Biederman <[email protected]>
Reviewed-by: Andrew Morton <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

cpumask: nicer for_each_cpumask_and() signature

Mask arguments can be swapped without changing anything. Make arguments
names reflect that:

#define for_each_cpu_and(cpu, mask1, mask2)

Link: http://lkml.kernel.org/r/20190724183350.GA15041@avx2
Signed-off-by: Alexey Dobriyan <[email protected]>
Reviewed-by: Andrew Morton <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

fork: improve error message for corrupted page tables

When a user process exits, the kernel cleans up the mm_struct of the user
process and during cleanup, check_mm() checks the page tables of the user
process for corruption (E.g: unexpected page flags set/cleared).  For
corrupted page tables, the error message printed by check_mm() isn't very
clear as it prints the loop index instead of page table type (E.g:
Resident file mapping pages vs Resident shared memory pages).  The loop
index in check_mm() is used to index rss_stat[] which represents
individual memory type stats.  Hence, instead of printing index, print
memory type, thereby improving error message.

Without patch:
--------------
[  204.836425] mm/pgtable-generic.c:29: bad p4d 0000000089eb4e92(800000025f941467)
[  204.836544] BUG: Bad rss-counter state mm:00000000f75895ea idx:0 val:2
[  204.836615] BUG: Bad rss-counter state mm:00000000f75895ea idx:1 val:5
[  204.836685] BUG: non-zero pgtables_bytes on freeing mm: 20480

With patch:
-----------
[   69.815453] mm/pgtable-generic.c:29: bad p4d 0000000084653642(800000025ca37467)
[   69.815872] BUG: Bad rss-counter state mm:00000000014a6c03 type:MM_FILEPAGES val:2
[   69.815962] BUG: Bad rss-counter state mm:00000000014a6c03 type:MM_ANONPAGES val:5
[   69.816050] BUG: non-zero pgtables_bytes on freeing mm: 20480

Also, change print function (from printk(KERN_ALERT, ..) to pr_alert()) so
that it matches the other print statement.

Link: http://lkml.kernel.org/r/da75b5153f617f4c5739c08ee6ebeb3d19db0fbc.1565123758.git.sai.praneeth.prakhya@intel.com
Signed-off-by: Sai Praneeth Prakhya <[email protected]>
Reviewed-by: Anshuman Khandual <[email protected]>
Suggested-by: Dave Hansen <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Acked-by: Dave Hansen <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

fat: delete an unnecessary check before brelse()

brelse() tests whether its argument is NULL and then returns immediately.
Thus the test around the call is not needed.

This issue was detected by using the Coccinelle software.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Markus Elfring <[email protected]>
Acked-by: OGAWA Hirofumi <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

fs/reiserfs/do_balan.c: remove set but not used variable

Fix the following gcc warning:

fs/reiserfs/do_balan.c: In function balance_leaf_insert_right:
fs/reiserfs/do_balan.c:629:6: warning: variable ret set but not used
[-Wunused-but-set-variable]

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Jason Yan <[email protected]>
Cc: zhengbin <[email protected]>
Cc: Jan Kara <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

fs/reiserfs/journal.c: remove set but not used variable

Fix the following gcc warning:

fs/reiserfs/journal.c: In function flush_used_journal_lists:
fs/reiserfs/journal.c:1791:6: warning: variable ret set but not used
[-Wunused-but-set-variable]

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Jason Yan <[email protected]>
Cc: zhengbin <[email protected]>
Cc: Jan Kara <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

fs/reiserfs/do_balan.c: remove set but not used variables

fs/reiserfs/do_balan.c: In function balance_leaf_when_delete:
fs/reiserfs/do_balan.c:245:20: warning: variable ih set but not used [-Wunused-but-set-variable]
fs/reiserfs/do_balan.c: In function balance_leaf_insert_left:
fs/reiserfs/do_balan.c:301:7: warning: variable version set but not used [-Wunused-but-set-variable]
fs/reiserfs/do_balan.c: In function balance_leaf_insert_right:
fs/reiserfs/do_balan.c:649:7: warning: variable version set but not used [-Wunused-but-set-variable]
fs/reiserfs/do_balan.c: In function balance_leaf_new_nodes_insert:
fs/reiserfs/do_balan.c:953:7: warning: variable version set but not used [-Wunused-but-set-variable]

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: zhengbin <[email protected]>
Reported-by: Hulk Robot <[email protected]>
Cc: Jan Kara <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

fs/reiserfs/fix_node.c: remove set but not used variables

fs/reiserfs/fix_node.c: In function get_num_ver:
fs/reiserfs/fix_node.c:379:6: warning: variable cur_free set but not used [-Wunused-but-set-variable]
fs/reiserfs/fix_node.c: In function dc_check_balance_internal:
fs/reiserfs/fix_node.c:1737:6: warning: variable maxsize set but not used [-Wunused-but-set-variable]

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: zhengbin <[email protected]>
Reported-by: Hulk Robot <[email protected]>
Cc: Jan Kara <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

fs/reiserfs/prints.c: remove set but not used variables

Fixes gcc '-Wunused-but-set-variable' warning:

fs/reiserfs/prints.c: In function check_internal_block_head:
fs/reiserfs/prints.c:749:21: warning: variable blkh set but not used [-Wunused-but-set-variable]

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: zhengbin <[email protected]>
Reported-by: Hulk Robot <[email protected]>
Cc: Jan Kara <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

fs/reiserfs/objectid.c: remove set but not used variables

Fixes gcc '-Wunused-but-set-variable' warning:

fs/reiserfs/objectid.c: In function reiserfs_convert_objectid_map_v1:
fs/reiserfs/objectid.c:186:25: warning: variable new_objectid_map set but not used [-Wunused-but-set-variable]

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: zhengbin <[email protected]>
Reported-by: Hulk Robot <[email protected]>
Cc: Jan Kara <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

fs/reiserfs/lbalance.c: remove set but not used variables

Fixes gcc '-Wunused-but-set-variable' warning:

fs/reiserfs/lbalance.c: In function leaf_paste_entries:
fs/reiserfs/lbalance.c:1325:9: warning: variable old_entry_num set but not used [-Wunused-but-set-variable]

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: zhengbin <[email protected]>
Reported-by: Hulk Robot <[email protected]>
Cc: Jan Kara <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

fs/reiserfs/stree.c: remove set but not used variables

Fixes gcc '-Wunused-but-set-variable' warning:

fs/reiserfs/stree.c: In function search_by_key:
fs/reiserfs/stree.c:596:6: warning: variable right_neighbor_of_leaf_node set but not used [-Wunused-but-set-variable]

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: zhengbin <[email protected]>
Reported-by: Hulk Robot <[email protected]>
Cc: Jan Kara <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

fs/reiserfs/journal.c: remove set but not used variables

Fixes gcc '-Wunused-but-set-variable' warning:

fs/reiserfs/journal.c: In function flush_older_commits:
fs/reiserfs/journal.c:894:15: warning: variable first_trans_id set but not used [-Wunused-but-set-variable]
fs/reiserfs/journal.c: In function flush_journal_list:
fs/reiserfs/journal.c:1354:38: warning: variable last set but not used [-Wunused-but-set-variable]
fs/reiserfs/journal.c: In function do_journal_release:
fs/reiserfs/journal.c:1916:6: warning: variable flushed set but not used [-Wunused-but-set-variable]
fs/reiserfs/journal.c: In function do_journal_end:
fs/reiserfs/journal.c:3993:6: warning: variable old_start set but not used [-Wunused-but-set-variable]

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: zhengbin <[email protected]>
Reported-by: Hulk Robot <[email protected]>
Cc: Jan Kara <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

fs: reiserfs: remove unnecessary check of bh in remove_from_transaction()

On lines 3430-3434, bh has been assured to be non-null:
    cn = get_journal_hash_dev(sb, journal->j_hash_table, blocknr);
    if (!cn || !cn->bh) {
        return ret;
    }
    bh = cn->bh;

Thus, the check of bh on line 3447 is unnecessary and can be removed.
Thank Andrew Morton for good advice.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Jia-Ju Bai <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Hariprasad Kelam <[email protected]>
Cc: Bharath Vedartham <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

checkpatch: make git output use LANGUAGE=en_US.utf8

git output parsing depends on the language being en_US english.

Make the backtick execution of all `git <foo>` commands set the
LANGUAGE of the process to en_US.utf8 before executing the actual
command using `export LANGUAGE=en_US.utf8; git <foo>`.

Because the command is executed in a child process, the parent
LANGUAGE is unchanged.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Joe Perches <[email protected]>
Reported-by: Sean Christopherson <[email protected]>
Reviewed-by: Sean Christopherson <[email protected]>
Tested-by: Sean Christopherson <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

checkpatch: remove obsolete period from "ambiguous SHA1" query

Git dropped the period from its "ambiguous SHA1" error message in commit
0c99171ad2 ("get_short_sha1: mark ambiguity error for translation"), circa
2016. Drop the period from checkpatch's associated query so as to match
both the old and new error messages.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>
Acked-by: Joe Perches <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

checkpatch: allow consecutive close braces

checkpatch allows consecutive open braces, so it should also allow
consecutive close braces.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Joe Perches <[email protected]>
Acked-by: Jeff Kirsher <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

checkpatch: prefer __section over __attribute__((section(...)))

Add another test for __attribute__((section("foo"))) uses that should be
__section(foo)

Link: http://lkml.kernel.org/r/[email protected]
Suggested-by: Nick Desaulniers <[email protected]>
Signed-off-by: Joe Perches <[email protected]>
Tested-by: Nick Desaulniers <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

checkpatch: exclude sizeof sub-expressions from MACRO_ARG_REUSE

The arguments of sizeof are not evaluated so arguments are safe to re-use
in that context. Excluding sizeof subexpressions means macros like
ARRAY_SIZE can pass checkpatch.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Brendan Jackman <[email protected]>
Acked-by: Joe Perches <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

checkpatch.pl: warn on invalid commit id

It can happen that a commit message refers to an invalid commit id,
because the referenced hash changed following a rebase, or simply by
mistake.  Add a check in checkpatch.pl which checks that an hash
referenced by a Fixes tag, or just cited in the commit message, is a valid
commit id.

    $ scripts/checkpatch.pl <<'EOF'
    Subject: [PATCH] test commit

    Sample test commit to test checkpatch.pl
    Commit 1da177e4c3f4 ("Linux-2.6.12-rc2") really exists,
    commit 0bba044c4ce7 ("tree") is valid but not a commit,
    while commit b4cc0b1c0cca ("unknown") is invalid.

Fixes: f0cacc14cade ("unknown")
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
    EOF
    WARNING: Unknown commit id '0bba044c4ce7', maybe rebased or not pulled?
    #8:
    commit 0bba044c4ce7 ("tree") is valid but not a commit,

    WARNING: Unknown commit id 'b4cc0b1c0cca', maybe rebased or not pulled?
    #9:
    while commit b4cc0b1c0cca ("unknown") is invalid.

    WARNING: Unknown commit id 'f0cacc14cade', maybe rebased or not pulled?
    #11:
Fixes: f0cacc14cade ("unknown")
    total: 0 errors, 3 warnings, 4 lines checked

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Matteo Croce <[email protected]>
Cc: Joe Perches <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

checkpatch: improve SPDX license checking

Use perl's m@<match>@ match and not /<match>/ comparisons to avoid
an error using c90's // comment style.

Miscellanea:

o Use normal tab indentation and alignment

Link: http://lkml.kernel.org/r/[email protected]
Link: http://lkml.kernel.org/r/f08eb62458407a145cfedf959d1091af151cd665.1563575364.git.joe@perches.com
Signed-off-by: Joe Perches <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

checkpatch: don't interpret stack dumps as commit IDs

Add more types of lines that appear to be stack dumps that also include
hex lines that might otherwise be interpreted as commit IDs.

Link: http://lkml.kernel.org/r/[email protected]
Link: http://lkml.kernel.org/r/f7dc9727795db3802809a24162abe0b67e14123b.1563575364.git.joe@perches.com
Signed-off-by: Joe Perches <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

lib/hexdump: make print_hex_dump_bytes() a nop on !DEBUG builds

I'm seeing a bunch of debug prints from a user of print_hex_dump_bytes()
in my kernel logs, but I don't have CONFIG_DYNAMIC_DEBUG enabled nor do I
have DEBUG defined in my build.  The problem is that
print_hex_dump_bytes() calls a wrapper function in lib/hexdump.c that
calls print_hex_dump() with KERN_DEBUG level.  There are three cases to
consider here

  1. CONFIG_DYNAMIC_DEBUG=y  --> call dynamic_hex_dum()
  2. CONFIG_DYNAMIC_DEBUG=n && DEBUG --> call print_hex_dump()
  3. CONFIG_DYNAMIC_DEBUG=n && !DEBUG --> stub it out

Right now, that last case isn't detected and we still call
print_hex_dump() from the stub wrapper.

Let's make print_hex_dump_bytes() only call print_hex_dump_debug() so that
it works properly in all cases.

Case #1, print_hex_dump_debug() calls dynamic_hex_dump() and we get same
behavior.  Case #2, print_hex_dump_debug() calls print_hex_dump() with
KERN_DEBUG and we get the same behavior.  Case #3, print_hex_dump_debug()
is a nop, changing behavior to what we want, i.e.  print nothing.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Stephen Boyd <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

lib/extable.c: add missing prototypes

When building with W=1, a number of warnings are issued:

  CC      lib/extable.o
lib/extable.c:63:6: warning: no previous prototype for 'sort_extable' [-Wmissing-prototypes]
   63 | void sort_extable(struct exception_table_entry *start,
      |      ^~~~~~~~~~~~
lib/extable.c:75:6: warning: no previous prototype for 'trim_init_extable' [-Wmissing-prototypes]
   75 | void trim_init_extable(struct module *m)
      |      ^~~~~~~~~~~~~~~~~
lib/extable.c:115:1: warning: no previous prototype for 'search_extable' [-Wmissing-prototypes]
  115 | search_extable(const struct exception_table_entry *base,
      | ^~~~~~~~~~~~~~

Add the missing #include for the prototypes.

Link: http://lkml.kernel.org/r/45574.1565235784@turing-police
Signed-off-by: Valdis Kletnieks <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

lib/generic-radix-tree.c: make 2 functions static inline

When building with W=1, we get some warnings:

l  CC      lib/generic-radix-tree.o
lib/generic-radix-tree.c:39:10: warning: no previous prototype for 'genradix_root_to_depth' [-Wmissing-prototypes]
   39 | unsigned genradix_root_to_depth(struct genradix_root *r)
      |          ^~~~~~~~~~~~~~~~~~~~~~
lib/generic-radix-tree.c:44:23: warning: no previous prototype for 'genradix_root_to_node' [-Wmissing-prototypes]
   44 | struct genradix_node *genradix_root_to_node(struct genradix_root *r)
      |                       ^~~~~~~~~~~~~~~~~~~~~

They're not used anywhere else, so make them static inline.

Link: http://lkml.kernel.org/r/46923.1565236485@turing-police
Signed-off-by: Valdis Kletnieks <[email protected]>
Cc: Kent Overstreet <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

strscpy: reject buffer sizes larger than INT_MAX

As already done for snprintf(), add a check in strscpy() for giant (i.e.
likely negative and/or miscalculated) copy sizes, WARN, and error out.

Link: http://lkml.kernel.org/r/201907260928.23DE35406@keescook
Signed-off-by: Kees Cook <[email protected]>
Cc: Joe Perches <[email protected]>
Cc: Rasmus Villemoes <[email protected]>
Cc: Yann Droneaud <[email protected]>
Cc: David Laight <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Stephen Kitt <[email protected]>
Cc: Jann Horn <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

include/trace/events/writeback.h: fix -Wstringop-truncation warnings

There are many of those warnings.

In file included from ./arch/powerpc/include/asm/paca.h:15,
                 from ./arch/powerpc/include/asm/current.h:13,
                 from ./include/linux/thread_info.h:21,
                 from ./include/asm-generic/preempt.h:5,
                 from ./arch/powerpc/include/generated/asm/preempt.h:1,
                 from ./include/linux/preempt.h:78,
                 from ./include/linux/spinlock.h:51,
                 from fs/fs-writeback.c:19:
In function 'strncpy',
    inlined from 'perf_trace_writeback_page_template' at
./include/trace/events/writeback.h:56:1:
./include/linux/string.h:260:9: warning: '__builtin_strncpy' specified
bound 32 equals destination size [-Wstringop-truncation]
  return __builtin_strncpy(p, q, size);
         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Fix it by using the new strscpy_pad() which was introduced in "lib/string:
Add strscpy_pad() function" and will always be NUL-terminated instead of
strncpy().  Also, change strlcpy() to use strscpy_pad() in this file for
consistency.

Link: http://lkml.kernel.org/r/[email protected]
Fixes: 455b2864686d ("writeback: Initial tracing support")
Fixes: 028c2dd184c0 ("writeback: Add tracing to balance_dirty_pages")
Fixes: e84d0a4f8e39 ("writeback: trace event writeback_queue_io")
Fixes: b48c104d2211 ("writeback: trace event bdi_dirty_ratelimit")
Fixes: cc1676d917f3 ("writeback: Move requeueing when I_SYNC set to writeback_sb_inodes()")
Fixes: 9fb0a7da0c52 ("writeback: add more tracepoints")
Signed-off-by: Qian Cai <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Cc: Tobin C. Harding <[email protected]>
Cc: Steven Rostedt (VMware) <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Fengguang Wu <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Joe Perches <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Jann Horn <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Nitin Gote <[email protected]>
Cc: Rasmus Villemoes <[email protected]>
Cc: Stephen Kitt <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

kernel-doc: core-api: include string.h into core-api

core-api should show all the various string functions including the newly
added stracpy and stracpy_pad.

Miscellanea:

o Update the Returns: value for strscpy
o fix a defect with %NUL)

[[email protected]: correct return of -E2BIG descriptions]
Link: http://lkml.kernel.org/r/29f998b4c1a9d69fbeae70500ba0daa4b340c546.1563889130.git.joe@perches.com
Link: http://lkml.kernel.org/r/224a6ebf39955f4107c0c376d66155d970e46733.1563841972.git.joe@perches.com
Signed-off-by: Joe Perches <[email protected]>
Reviewed-by: Kees Cook <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Stephen Kitt <[email protected]>
Cc: Nitin Gote <[email protected]>
Cc: Rasmus Villemoes <[email protected]>
Cc: Jann Horn <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

augmented rbtree: rework the RB_DECLARE_CALLBACKS macro definition

Change the definition of the RBCOMPUTE function.  The propagate callback
repeatedly calls RBCOMPUTE as it moves from leaf to root.  it wants to
stop recomputing once the augmented subtree information doesn't change.
This was previously checked using the == operator, but that only works
when the augmented subtree information is a scalar field.  This commit
modifies the RBCOMPUTE function so that it now sets the augmented subtree
information instead of returning it, and returns a boolean value
indicating if the propagate callback should stop.

The motivation for this change is that I want to introduce augmented
rbtree uses where the augmented data for the subtree is a struct instead
of a scalar.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Michel Lespinasse <[email protected]>
Acked-by: Peter Zijlstra (Intel) <[email protected]>
Cc: David Howells <[email protected]>
Cc: Davidlohr Bueso <[email protected]>
Cc: Uladzislau Rezki <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

augmented rbtree: add new RB_DECLARE_CALLBACKS_MAX macro

Add RB_DECLARE_CALLBACKS_MAX, which generates augmented rbtree callbacks
for the case where the augmented value is a scalar whose definition
follows a max(f(node)) pattern. This actually covers all present uses of
RB_DECLARE_CALLBACKS, and saves some (source) code duplication in the
various RBCOMPUTE function definitions.

[[email protected]: fix mm/vmalloc.c]
Link: http://lkml.kernel.org/r/CANN689FXgK13wDYNh1zKxdipeTuALG4eKvKpsdZqKFJ-rvtGiQ@mail.gmail.com
[[email protected]: re-add check to check_augmented()]
Link: http://lkml.kernel.org/r/[email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Michel Lespinasse <[email protected]>
Acked-by: Peter Zijlstra (Intel) <[email protected]>
Cc: David Howells <[email protected]>
Cc: Davidlohr Bueso <[email protected]>
Cc: Uladzislau Rezki <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

augmented rbtree: add comments for RB_DECLARE_CALLBACKS macro

Patch series "make RB_DECLARE_CALLBACKS more generic", v3.

These changes are intended to make the RB_DECLARE_CALLBACKS macro more
generic (allowing the aubmented subtree information to be a struct instead
of a scalar).

I have verified the compiled lib/interval_tree.o and mm/mmap.o files to
check that they didn't change.  This held as expected for interval_tree.o;
mmap.o did have some changes which could be reverted by marking
__vma_link_rb as noinline.  I did not add such a change to the patchset; I
felt it was reasonable enough to leave the inlining decision up to the
compiler.

This patch (of 3):

Add a short comment summarizing the arguments to RB_DECLARE_CALLBACKS.
The arguments are also now capitalized.  This copies the style of the
INTERVAL_TREE_DEFINE macro.

No functional changes in this commit, only comments and capitalization.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Michel Lespinasse <[email protected]>
Acked-by: Davidlohr Bueso <[email protected]>
Acked-by: Peter Zijlstra (Intel) <[email protected]>
Cc: David Howells <[email protected]>
Cc: Uladzislau Rezki <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

rbtree: avoid generating code twice for the cached versions (tools copy)

As was already noted in rbtree.h, the logic to cache rb_first (or
rb_last) can easily be implemented externally to the core rbtree api.

This commit takes the changes applied to the include/linux/ and lib/
rbtree files in 9f973cb38088 ("lib/rbtree: avoid generating code twice
for the cached versions"), and applies these to the
tools/include/linux/ and tools/lib/ files as well to keep them
synchronized.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Michel Lespinasse <[email protected]>
Cc: David Howells <[email protected]>
Cc: Davidlohr Bueso <[email protected]>
Cc: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

kernel/elfcore.c: include proper prototypes

When building with W=1, gcc properly complains that there's no prototypes:

  CC      kernel/elfcore.o
kernel/elfcore.c:7:17: warning: no previous prototype for 'elf_core_extra_phdrs' [-Wmissing-prototypes]
    7 | Elf_Half __weak elf_core_extra_phdrs(void)
      |                 ^~~~~~~~~~~~~~~~~~~~
kernel/elfcore.c:12:12: warning: no previous prototype for 'elf_core_write_extra_phdrs' [-Wmissing-prototypes]
   12 | int __weak elf_core_write_extra_phdrs(struct coredump_params *cprm, loff_t offset)
      |            ^~~~~~~~~~~~~~~~~~~~~~~~~~
kernel/elfcore.c:17:12: warning: no previous prototype for 'elf_core_write_extra_data' [-Wmissing-prototypes]
   17 | int __weak elf_core_write_extra_data(struct coredump_params *cprm)
      |            ^~~~~~~~~~~~~~~~~~~~~~~~~
kernel/elfcore.c:22:15: warning: no previous prototype for 'elf_core_extra_data_size' [-Wmissing-prototypes]
   22 | size_t __weak elf_core_extra_data_size(void)
      |               ^~~~~~~~~~~~~~~~~~~~~~~~

Provide the include file so gcc is happy, and we don't have potential code drift

Link: http://lkml.kernel.org/r/29875.1565224705@turing-police
Signed-off-by: Valdis Kletnieks <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

linux/coff.h: add include guard

Add a header include guard just in case.

My motivation is to allow Kbuild to detect missing include guard:

https://patchwork.kernel.org/patch/11063011/

Before I enable this checker I want to fix as many headers as possible.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Masahiro Yamada <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

memcg, kmem: do not fail __GFP_NOFAIL charges

Thomas has noticed the following NULL ptr dereference when using cgroup
v1 kmem limit:
BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
PGD 0
P4D 0
Oops: 0000 [#1] PREEMPT SMP PTI
CPU: 3 PID: 16923 Comm: gtk-update-icon Not tainted 4.19.51 #42
Hardware name: Gigabyte Technology Co., Ltd. Z97X-Gaming G1/Z97X-Gaming G1, BIOS F9 07/31/2015
RIP: 0010:create_empty_buffers+0x24/0x100
Code: cd 0f 1f 44 00 00 0f 1f 44 00 00 41 54 49 89 d4 ba 01 00 00 00 55 53 48 89 fb e8 97 fe ff ff 48 89 c5 48 89 c2 eb 03 48 89 ca <48> 8b 4a 08 4c 09 22 48 85 c9 75 f1 48 89 6a 08 48 8b 43 18 48 8d
RSP: 0018:ffff927ac1b37bf8 EFLAGS: 00010286
RAX: 0000000000000000 RBX: fffff2d4429fd740 RCX: 0000000100097149
RDX: 0000000000000000 RSI: 0000000000000082 RDI: ffff9075a99fbe00
RBP: 0000000000000000 R08: fffff2d440949cc8 R09: 00000000000960c0
R10: 0000000000000002 R11: 0000000000000000 R12: 0000000000000000
R13: ffff907601f18360 R14: 0000000000002000 R15: 0000000000001000
FS:  00007fb55b288bc0(0000) GS:ffff90761f8c0000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000008 CR3: 000000007aebc002 CR4: 00000000001606e0
Call Trace:
create_page_buffers+0x4d/0x60
__block_write_begin_int+0x8e/0x5a0
? ext4_inode_attach_jinode.part.82+0xb0/0xb0
? jbd2__journal_start+0xd7/0x1f0
ext4_da_write_begin+0x112/0x3d0
generic_perform_write+0xf1/0x1b0
? file_update_time+0x70/0x140
__generic_file_write_iter+0x141/0x1a0
ext4_file_write_iter+0xef/0x3b0
__vfs_write+0x17e/0x1e0
vfs_write+0xa5/0x1a0
ksys_write+0x57/0xd0
do_syscall_64+0x55/0x160
entry_SYSCALL_64_after_hwframe+0x44/0xa9

Tetsuo then noticed that this is because the __memcg_kmem_charge_memcg
fails __GFP_NOFAIL charge when the kmem limit is reached.  This is a wrong
behavior because nofail allocations are not allowed to fail.  Normal
charge path simply forces the charge even if that means to cross the
limit.  Kmem accounting should be doing the same.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Michal Hocko <[email protected]>
Reported-by: Thomas Lindroth <[email protected]>
Debugged-by: Tetsuo Handa <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Cc: Andrey Ryabinin <[email protected]>
Cc: Thomas Lindroth <[email protected]>
Cc: Shakeel Butt <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

Merge tag 'drm-misc-next-fixes-2019-09-23' of git://anongit.freedesktop.org/drm/drm-misc into drm-next

- Multiple panfrost fixes for regulator support and page fault handling
- Some cleanups and fixes in the self-refresh helpers
- Some cleanups and fixes in the atomic helpers

Signed-off-by: Dave Airlie <[email protected]>
From: Maxime Ripard <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/20190923160946.nvaqiw5j7fpcdhc7@gilmour

selftests/bpf: adjust strobemeta loop to satisfy latest clang

Some recent changes in latest Clang started causing the following
warning when unrolling strobemeta test case main loop:

  progs/strobemeta.h:416:2: warning: loop not unrolled: the optimizer was
  unable to perform the requested transformation; the transformation might
  be disabled or specified as part of an unsupported transformation
  ordering [-Wpass-failed=transform-warning]

This patch simplifies loop's exit condition to depend only on constant
max iteration number (STROBE_MAX_MAP_ENTRIES), while moving early
termination logic inside the loop body. The changes are equivalent from
program logic standpoint, but fixes the warning. It also appears to
improve generated BPF code, as it fixes previously failing non-unrolled
strobemeta test cases.

Cc: Alexei Starovoitov <[email protected]>
Signed-off-by: Andrii Nakryiko <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>

selftests/bpf: delete unused variables in test_sysctl

Remove no longer used variables and avoid compiler warnings.

Signed-off-by: Andrii Nakryiko <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>

libbpf: fix false uninitialized variable warning

Some compilers emit warning for potential uninitialized next_id usage.
The code is correct, but control flow is too complicated for some
compilers to figure this out. Re-initialize next_id to satisfy
compiler.

Signed-off-by: Andrii Nakryiko <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>

bpf/xskmap: Return ERR_PTR for failure case instead of NULL.

When kzalloc() failed, NULL was returned to the caller, which
tested the pointer with IS_ERR(), which didn't match, so the
pointer was used later, resulting in a NULL dereference.

Return ERR_PTR(-ENOMEM) instead of NULL.

Reported-by: [email protected]
Fixes: 0402acd683c6 ("xsk: remove AF_XDP socket from map when the socket is released")
Signed-off-by: Jonathan Lemon <[email protected]>
Acked-by: Björn Töpel <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>

selftests/bpf: test_progs: fix client/server race in tcp_rtt

This is the same problem I found earlier in test_sockopt_inherit:
there is a race between server thread doing accept() and client
thread doing connect(). Let's explicitly synchronize them via
pthread conditional variable.

v2:
* don't exit from server_thread without signaling condvar,
fixes possible issue where main() would wait forever (Andrii Nakryiko)

Fixes: b55873984dab ("selftests/bpf: test BPF_SOCK_OPS_RTT_CB")
Signed-off-by: Stanislav Fomichev <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>

drm/amd/display: prevent memory leak

In dcn*_create_resource_pool the allocated memory should be released if
construct pool fails.

Reviewed-by: Harry Wentland <[email protected]>
Signed-off-by: Navid Emamdoost <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

perf parser: Remove needless include directives

They go on accumulating there like the debug.h one, that was introduced
here:

f23610245c1a ("perf list: Add debug support for outputing alias string")

But then, when that need is removed via:

2073ad3326b7 ("perf tools: Factor out PMU matching in parser")

The thing stays there, so continue the house cleaning spree...

list.h not needed, no macros from there are used, and 'struct
list_head' is in linux/types.h, ditto for util.h, no need for that as
well.

Cc: Adrian Hunter <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Namhyung Kim <[email protected]>
Link: https://lkml.kernel.org/n/[email protected]
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>

perf build: Add detection of java-11-openjdk-devel package

With Java 11 there is no seperate JRE anymore.

Details:

  https://coderanch.com/t/701603/java/JRE-JDK

Therefore the detection of the JRE needs to be adapted.

This change works for s390 and x86.  I have not tested other platforms.

Committer testing:

Continues to work with the OpenJDK 8:

  $ rm -f ~acme/lib64/libperf-jvmti.so
  $ rpm -qa | grep jdk-devel
  java-1.8.0-openjdk-devel-1.8.0.222.b10-0.fc30.x86_64
  $ git log --oneline -1
  a51937170f33 (HEAD -> perf/core) perf build: Add detection of java-11-openjdk-devel package
  $ rm -rf /tmp/build/perf ; mkdir -p /tmp/build/perf ; make -C tools/perf O=/tmp/build/perf install > /dev/null 2>1
  $ ls -la ~acme/lib64/libperf-jvmti.so
  -rwxr-xr-x. 1 acme acme 230744 Sep 24 16:46 /home/acme/lib64/libperf-jvmti.so
  $

Suggested-by: Andreas Krebbel <[email protected]>
Signed-off-by: Thomas Richter <[email protected]>
Tested-by: Arnaldo Carvalho de Melo <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Hendrik Brueckner <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Link: http://lore.kernel.org/lkml/[email protected]
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>

perf jvmti: Include JVMTI support for s390

Enable JVMTI support for s390 perf tool chain.

Signed-off-by: Thomas Richter <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Hendrik Brueckner <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Link: http://lore.kernel.org/lkml/[email protected]
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>

perf vendor events: Remove P8 HW events which are not supported

This patch is to remove following hardware events
from JSON file which are not supported on POWER8.

pm_l3_p0_grp_pump
pm_l3_p0_lco_data
pm_l3_p0_lco_no_data
pm_l3_p0_lco_rty

Note: Unfortunately power8 event list is not publicly available.

Fixes: c3b4d5c4afb0 ("perf vendor events: Remove P8 HW events which are not supported")
Signed-off-by: Mamatha Inamdar <[email protected]>
Acked-by: Ravi Bangoria <[email protected]>
Cc: Alexander Shishkin <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Madhavan Srinivasan <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Link: http://lore.kernel.org/lkml/[email protected]
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>

perf evlist: Fix access of freed id arrays

I'm not fully sure if this is the correct fix, but without this I get
crashes on more complex perf stat metric usages. The problem is that
part of the state gets freed when a weak group fails, but then is later
still used. Just don't free the ids, we're going to reuse them anyways
on the weak group retry.

For example:

  % perf stat -M IpB,IpCall,IpTB,IPC,Retiring_SMT,Frontend_Bound_SMT,Kernel_Utilization,CPU_Utilization --metric-only -a -I 1000 sleep 2

  crashes and gives in valgrind:

  =21527== Invalid write of size 8
  ==21527==    at 0x4EE582: hlist_add_head (list.h:644)
  ==21527==    by 0x4EFD3C: perf_evlist__id_hash (evlist.c:477)
  ==21527==    by 0x4EFD99: perf_evlist__id_add (evlist.c:483)
  ==21527==    by 0x4EFF15: perf_evlist__id_add_fd (evlist.c:524)
  ==21527==    by 0x4FC693: store_evsel_ids (evsel.c:2969)
  ==21527==    by 0x4FC76C: perf_evsel__store_ids (evsel.c:2986)
  ==21527==    by 0x450DA7: __run_perf_stat (builtin-stat.c:519)
  ==21527==    by 0x451285: run_perf_stat (builtin-stat.c:636)
  ==21527==    by 0x454619: cmd_stat (builtin-stat.c:1966)
  ==21527==    by 0x4D557D: run_builtin (perf.c:310)
  ==21527==    by 0x4D57EA: handle_internal_command (perf.c:362)
  ==21527==    by 0x4D5931: run_argv (perf.c:406)
  ==21527==  Address 0x12e3f008 is 104 bytes inside a block of size 2,056 free'd
  ==21527==    at 0x4839A0C: free (vg_replace_malloc.c:540)
  ==21527==    by 0x627139: xyarray__delete (xyarray.c:32)
  ==21527==    by 0x4F6BE4: perf_evsel__free_id (evsel.c:1253)
  ==21527==    by 0x4FA11F: evsel__close (evsel.c:1994)
  ==21527==    by 0x4F30A3: perf_evlist__reset_weak_group (evlist.c:1783)
  ==21527==    by 0x450B47: __run_perf_stat (builtin-stat.c:466)
  ==21527==    by 0x451285: run_perf_stat (builtin-stat.c:636)
  ==21527==    by 0x454619: cmd_stat (builtin-stat.c:1966)
  ==21527==    by 0x4D557D: run_builtin (perf.c:310)
  ==21527==    by 0x4D57EA: handle_internal_command (perf.c:362)
  ==21527==    by 0x4D5931: run_argv (perf.c:406)
  ==21527==    by 0x4D5CAE: main (perf.c:531)
  ==21527==  Block was alloc'd at
  ==21527==    at 0x483AB1A: calloc (vg_replace_malloc.c:762)
  ==21527==    by 0x627024: zalloc (zalloc.c:8)
  ==21527==    by 0x627088: xyarray__new (xyarray.c:10)
  ==21527==    by 0x4F6B20: perf_evsel__alloc_id (evsel.c:1237)
  ==21527==    by 0x4FC74E: perf_evsel__store_ids (evsel.c:2983)
  ==21527==    by 0x450DA7: __run_perf_stat (builtin-stat.c:519)
  ==21527==    by 0x451285: run_perf_stat (builtin-stat.c:636)
  ==21527==    by 0x454619: cmd_stat (builtin-stat.c:1966)
  ==21527==    by 0x4D557D: run_builtin (perf.c:310)
  ==21527==    by 0x4D57EA: handle_internal_command (perf.c:362)
  ==21527==    by 0x4D5931: run_argv (perf.c:406)
  ==21527==    by 0x4D5CAE: main (perf.c:531)

Signed-off-by: Andi Kleen <[email protected]>
Acked-by: Jiri Olsa <[email protected]>
Tested-by: Arnaldo Carvalho de Melo <[email protected]>
Link: http://lore.kernel.org/lkml/[email protected]
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>

perf stat: Fix free memory access / memory leaks in metrics

Make sure to not free the name passed in by the caller, but free all the
allocated ids when parsing expressions.

The loop at the end knows that the first entry shouldn't be freed, so
make sure the caller name is the first entry.

Fixes

  % perf stat -M IpB,IpCall,IpTB,IPC,Retiring_SMT,Frontend_Bound_SMT,Kernel_Utilization,CPU_Utilization --metric-only -a -I 1000 sleep 2

  valgrind:
       1.009943231 ==21527== Invalid read of size 1
  ==21527==    at 0x483CB74: strcmp (vg_replace_strmem.c:849)
  ==21527==    by 0x582CF8: collect_all_aliases (stat-display.c:554)
  ==21527==    by 0x582EB3: collect_data (stat-display.c:577)
  ==21527==    by 0x583A32: print_counter_aggr (stat-display.c:806)
  ==21527==    by 0x584FAD: perf_evlist__print_counters (stat-display.c:1200)
  ==21527==    by 0x45133A: print_counters (builtin-stat.c:655)
  ==21527==    by 0x450629: process_interval (builtin-stat.c:353)
  ==21527==    by 0x450FBD: __run_perf_stat (builtin-stat.c:564)
  ==21527==    by 0x451285: run_perf_stat (builtin-stat.c:636)
  ==21527==    by 0x454619: cmd_stat (builtin-stat.c:1966)
  ==21527==    by 0x4D557D: run_builtin (perf.c:310)
  ==21527==    by 0x4D57EA: handle_internal_command (perf.c:362)
  ==21527==  Address 0x12826cd0 is 0 bytes inside a block of size 25 free'd
  ==21527==    at 0x4839A0C: free (vg_replace_malloc.c:540)
  ==21527==    by 0x627041: __zfree (zalloc.c:13)
  ==21527==    by 0x57F66A: generic_metric (stat-shadow.c:814)
  ==21527==    by 0x580B21: perf_stat__print_shadow_stats (stat-shadow.c:1057)
  ==21527==    by 0x58418E: print_metric_headers (stat-display.c:943)
  ==21527==    by 0x5844BC: print_interval (stat-display.c:1004)
  ==21527==    by 0x584DEB: perf_evlist__print_counters (stat-display.c:1172)
  ==21527==    by 0x45133A: print_counters (builtin-stat.c:655)
  ==21527==    by 0x450629: process_interval (builtin-stat.c:353)
  ==21527==    by 0x450FBD: __run_perf_stat (builtin-stat.c:564)
  ==21527==    by 0x451285: run_perf_stat (builtin-stat.c:636)
  ==21527==    by 0x454619: cmd_stat (builtin-stat.c:1966)
  ==21527==  Block was alloc'd at
  ==21527==    at 0x483880B: malloc (vg_replace_malloc.c:309)
  ==21527==    by 0x51677DE: strdup (in /usr/lib64/libc-2.29.so)
  ==21527==    by 0x506457: parse_events_name (parse-events.c:1754)
  ==21527==    by 0x5550BB: parse_events_parse (parse-events.y:214)
  ==21527==    by 0x50694D: parse_events__scanner (parse-events.c:1887)
  ==21527==    by 0x506AEF: parse_events (parse-events.c:1927)
  ==21527==    by 0x521D8B: metricgroup__parse_groups (metricgroup.c:527)
  ==21527==    by 0x45156F: parse_metric_groups (builtin-stat.c:721)
  ==21527==    by 0x6228A9: get_value (parse-options.c:243)
  ==21527==    by 0x62363F: parse_short_opt (parse-options.c:348)
  ==21527==    by 0x62363F: parse_options_step (parse-options.c:536)
  ==21527==    by 0x62363F: parse_options_subcommand (parse-options.c:651)
  ==21527==    by 0x453C1D: cmd_stat (builtin-stat.c:1718)
  ==21527==    by 0x4D557D: run_builtin (perf.c:310)

and also a leak report.

Committer testing:

Before:

  # perf stat -M IpB,IpCall,IpTB,IPC,Retiring_SMT,Frontend_Bound_SMT,Kernel_Utilization,CPU_Utilization --metric-only -a -I 1000 sleep 2
  #           time      CPU_Utilization
       1.000470810                      free(): double free detected in tcache 2
  Aborted (core dumped)
  #

After:

  # perf stat -M IpB,IpCall,IpTB,IPC,Retiring_SMT,Frontend_Bound_SMT,Kernel_Utilization,CPU_Utilization --metric-only -a -I 1000 sleep 2
  #           time      CPU_Utilization
       1.000494752                  0.1
       2.001105112                  0.1
  #

Signed-off-by: Andi Kleen <[email protected]>
Acked-by: Jiri Olsa <[email protected]>
Tested-by: Arnaldo Carvalho de Melo <[email protected]>
Link: http://lore.kernel.org/lkml/[email protected]
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>

perf tools: Replace needless mmap.h with what is needed, event.h

The perf_sample struct definition and the event_attr_init() are in
util/event.h, but some places were getting it thru an otherwise needless
util/mmap.h header, fix it by including util/event.h directly.

Cc: Adrian Hunter <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Namhyung Kim <[email protected]>
Link: https://lkml.kernel.org/n/[email protected]
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>

perf evsel: Move config terms to a separate header

Further reducing the size of util/evsel.h.

Cc: Adrian Hunter <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Namhyung Kim <[email protected]>
Link: https://lkml.kernel.org/n/[email protected]
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>

perf evlist: Remove unused perf_evlist__fprintf() method

Ditch it, noone is using it, one more stdio.h include in a hot header.

Fix the fallout in parse-events.y, where we end up using a FILE pointer,
I think due to YYDEBUG being set and in some places, like Amazon Linux 1
we don't get stdio.h included by luck, like in most other places, add a
explicit stdio.h include directive.

Cc: Adrian Hunter <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Namhyung Kim <[email protected]>
Link: https://lkml.kernel.org/n/[email protected]
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>

perf evsel: Introduce evsel_fprintf.h

We already had evsel_fprintf.c, add its counterpart, so that we can
reduce evsel.h a bit more.

We needed a new perf_event_attr_fprintf.c file so as to have a separate
object to link with the python binding in tools/perf/util/python-ext-sources
and not drag symbol_conf, etc into the python binding.

Cc: Adrian Hunter <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Namhyung Kim <[email protected]>
Link: https://lkml.kernel.org/n/[email protected]
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>

perf evsel: Remove need for symbol_conf in evsel_fprintf.c

So that we an later link it to the python binding without having to
drag the symbol object files.

Cc: Adrian Hunter <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Namhyung Kim <[email protected]>
Link: https://lkml.kernel.org/n/[email protected]
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>

KVM: nVMX: cleanup and fix host 64-bit mode checks

KVM was incorrectly checking vmcs12->host_ia32_efer even if the "load
IA32_EFER" exit control was reset. Also, some checks were not using
the new CC macro for tracing.

Cleanup everything so that the vCPU's 64-bit mode is determined
directly from EFER_LMA and the VMCS checks are based on that, which
matches section 26.2.4 of the SDM.

Cc: Sean Christopherson <[email protected]>
Cc: Krish Sadhukhan <[email protected]>
Fixes: 5845038c111db27902bc220a4f70070fe945871c
Reviewed-by: Jim Mattson <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

Merge tag 'ceph-for-5.4-rc1' of git://github.com/ceph/ceph-client

Pull ceph updates from Ilya Dryomov:
"The highlights are:

   - automatic recovery of a blacklisted filesystem session (Zheng Yan).
     This is disabled by default and can be enabled by mounting with the
     new "recover_session=clean" option.

   - serialize buffered reads and O_DIRECT writes (Jeff Layton). Care is
     taken to avoid serializing O_DIRECT reads and writes with each
     other, this is based on the exclusion scheme from NFS.

   - handle large osdmaps better in the face of fragmented memory
     (myself)

   - don't limit what security.* xattrs can be get or set (Jeff Layton).
     We were overly restrictive here, unnecessarily preventing things
     like file capability sets stored in security.capability from
     working.

   - allow copy_file_range() within the same inode and across different
     filesystems within the same cluster (Luis Henriques)"

* tag 'ceph-for-5.4-rc1' of git://github.com/ceph/ceph-client: (41 commits)
  ceph: call ceph_mdsc_destroy from destroy_fs_client
  libceph: use ceph_kvmalloc() for osdmap arrays
  libceph: avoid a __vmalloc() deadlock in ceph_kvmalloc()
  ceph: allow object copies across different filesystems in the same cluster
  ceph: include ceph_debug.h in cache.c
  ceph: move static keyword to the front of declarations
  rbd: pull rbd_img_request_create() dout out into the callers
  ceph: reconnect connection if session hang in opening state
  libceph: drop unused con parameter of calc_target()
  ceph: use release_pages() directly
  rbd: fix response length parameter for encoded strings
  ceph: allow arbitrary security.* xattrs
  ceph: only set CEPH_I_SEC_INITED if we got a MAC label
  ceph: turn ceph_security_invalidate_secctx into static inline
  ceph: add buffered/direct exclusionary locking for reads and writes
  libceph: handle OSD op ceph_pagelist_append() errors
  ceph: don't return a value from void function
  ceph: don't freeze during write page faults
  ceph: update the mtime when truncating up
  ceph: fix indentation in __get_snap_name()
  ...

Merge tag 'fuse-update-5.4' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse

Pull fuse updates from Miklos Szeredi:

- Continue separating the transport (user/kernel communication) and the
   filesystem layers of fuse. Getting rid of most layering violations
   will allow for easier cleanup and optimization later on.

- Prepare for the addition of the virtio-fs filesystem. The actual
   filesystem will be introduced by a separate pull request.

- Convert to new mount API.

- Various fixes, optimizations and cleanups.

* tag 'fuse-update-5.4' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse: (55 commits)
  fuse: Make fuse_args_to_req static
  fuse: fix memleak in cuse_channel_open
  fuse: fix beyond-end-of-page access in fuse_parse_cache()
  fuse: unexport fuse_put_request
  fuse: kmemcg account fs data
  fuse: on 64-bit store time in d_fsdata directly
  fuse: fix missing unlock_page in fuse_writepage()
  fuse: reserve byteswapped init opcodes
  fuse: allow skipping control interface and forced unmount
  fuse: dissociate DESTROY from fuseblk
  fuse: delete dentry if timeout is zero
  fuse: separate fuse device allocation and installation in fuse_conn
  fuse: add fuse_iqueue_ops callbacks
  fuse: extract fuse_fill_super_common()
  fuse: export fuse_dequeue_forget() function
  fuse: export fuse_get_unique()
  fuse: export fuse_send_init_request()
  fuse: export fuse_len_args()
  fuse: export fuse_end_request()
  fuse: fix request limit
  ...

Merge tag 'tpmdd-next-20190925' of git://git.infradead.org/users/jjs/linux-tpmdd

Pull tpm fixes from Jarkko Sakkinen.

* tag 'tpmdd-next-20190925' of git://git.infradead.org/users/jjs/linux-tpmdd:
  tpm: Wrap the buffer from the caller to tpm_buf in tpm_send()
  MAINTAINERS: keys: Update path to trusted.h
  KEYS: trusted: correctly initialize digests and fix locking issue
  selftests/tpm2: Add log and *.pyc to .gitignore
  selftests/tpm2: Add the missing TEST_FILES assignment

Merge tag 'iomap-5.4-merge-6' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull iomap updates from Darrick Wong:
"After last week's failed pull request attempt, I scuttled everything
  in the branch except for the directio endio api changes, which were
  trivial. Everything else will simply have to wait for the next cycle.

  Summary:

   - Report both io errors and short io results to the directio endio
     handler.

   - Allow directio callers to pass an ops structure to iomap_dio_rw"

* tag 'iomap-5.4-merge-6' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  iomap: move the iomap_dio_rw ->end_io callback into a structure
  iomap: split size and error for iomap_dio_rw ->end_io

sched/fair: Avoid redundant EAS calculation

The EAS wake-up path computes the system energy for several CPU
candidates: the CPU with maximum spare capacity in each performance
domain, and the prev_cpu. However, if prev_cpu also happens to be the
CPU with maximum spare capacity in its performance domain, the energy
calculation is still done twice, unnecessarily.

Add a condition to filter out this corner case before doing the energy
calculation.

Reported-by: Pavan Kondeti <[email protected]>
Signed-off-by: Quentin Perret <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Fixes: eb92692b2544 ("sched/fair: Speed-up energy-aware wake-ups")
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>

sched/core: Remove double update_max_interval() call on CPU startup

update_max_interval() is called in both CPUHP_AP_SCHED_STARTING's startup
and teardown callbacks, but it turns out it's also called at the end of
the startup callback of CPUHP_AP_ACTIVE (which is further down the
startup sequence).

There's no point in repeating this interval update in the startup sequence
since the CPU will remain online until it goes down the teardown path.

Remove the redundant call in sched_cpu_activate() (CPUHP_AP_ACTIVE).

Signed-off-by: Valentin Schneider <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>

sched/core: Fix preempt_schedule() interrupt return comment

preempt_schedule_irq() is the one that should be called on return from
interrupt, clean up the comment to avoid any ambiguity.

Signed-off-by: Valentin Schneider <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Acked-by: Thomas Gleixner <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>

sched/fair: Fix -Wunused-but-set-variable warnings

Commit:

   de53fd7aedb1 ("sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices")

introduced a few compilation warnings:

  kernel/sched/fair.c: In function '__refill_cfs_bandwidth_runtime':
  kernel/sched/fair.c:4365:6: warning: variable 'now' set but not used [-Wunused-but-set-variable]
  kernel/sched/fair.c: In function 'start_cfs_bandwidth':
  kernel/sched/fair.c:4992:6: warning: variable 'overrun' set but not used [-Wunused-but-set-variable]

Also, __refill_cfs_bandwidth_runtime() does no longer update the
expiration time, so fix the comments accordingly.

Signed-off-by: Qian Cai <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Reviewed-by: Ben Segall <[email protected]>
Reviewed-by: Dave Chiluk <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Fixes: de53fd7aedb1 ("sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices")
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>

sched/core: Fix migration to invalid CPU in __set_cpus_allowed_ptr()

An oops can be triggered in the scheduler when running qemu on arm64:

Unable to handle kernel paging request at virtual address ffff000008effe40
Internal error: Oops: 96000007 [#1] SMP
Process migration/0 (pid: 12, stack limit = 0x00000000084e3736)
pstate: 20000085 (nzCv daIf -PAN -UAO)
pc : __ll_sc___cmpxchg_case_acq_4+0x4/0x20
lr : move_queued_task.isra.21+0x124/0x298
...
Call trace:
  __ll_sc___cmpxchg_case_acq_4+0x4/0x20
  __migrate_task+0xc8/0xe0
  migration_cpu_stop+0x170/0x180
  cpu_stopper_thread+0xec/0x178
  smpboot_thread_fn+0x1ac/0x1e8
  kthread+0x134/0x138
  ret_from_fork+0x10/0x18

__set_cpus_allowed_ptr() will choose an active dest_cpu in affinity mask to
migrage the process if process is not currently running on any one of the
CPUs specified in affinity mask. __set_cpus_allowed_ptr() will choose an
invalid dest_cpu (dest_cpu >= nr_cpu_ids, 1024 in my virtual machine) if
CPUS in an affinity mask are deactived by cpu_down after cpumask_intersects
check. cpumask_test_cpu() of dest_cpu afterwards is overflown and may pass if
corresponding bit is coincidentally set. As a consequence, kernel will
access an invalid rq address associate with the invalid CPU in
migration_cpu_stop->__migrate_task->move_queued_task and the Oops occurs.

The reproduce the crash:

  1) A process repeatedly binds itself to cpu0 and cpu1 in turn by calling
  sched_setaffinity.

  2) A shell script repeatedly does "echo 0 > /sys/devices/system/cpu/cpu1/online"
  and "echo 1 > /sys/devices/system/cpu/cpu1/online" in turn.

  3) Oops appears if the invalid CPU is set in memory after tested cpumask.

Signed-off-by: KeMeng Shi <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Reviewed-by: Valentin Schneider <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>

sched/membarrier: Return -ENOMEM to userspace on memory allocation failure

Remove the IPI fallback code from membarrier to deal with very
infrequent cpumask memory allocation failure. Use GFP_KERNEL rather
than GFP_NOWAIT, and relax the blocking guarantees for the expedited
membarrier system call commands, allowing it to block if waiting for
memory to be made available.

In addition, now -ENOMEM can be returned to user-space if the cpumask
memory allocation fails.

Signed-off-by: Mathieu Desnoyers <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Chris Metcalf <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Eric W. Biederman <[email protected]>
Cc: Kirill Tkhai <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Mike Galbraith <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Russell King - ARM Linux admin <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>

sched/membarrier: Skip IPIs when mm->mm_users == 1

If there is only a single mm_user for the mm, the private expedited
membarrier command can skip the IPIs, because only a single thread
is using the mm.

Signed-off-by: Mathieu Desnoyers <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Chris Metcalf <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Eric W. Biederman <[email protected]>
Cc: Kirill Tkhai <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Mike Galbraith <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Russell King - ARM Linux admin <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>

selftests, sched/membarrier: Add multi-threaded test

membarrier commands cover very different code paths if they are in
a single-threaded vs multi-threaded process. Therefore, exercise both
scenarios in the kernel selftests to increase coverage of this selftest.

Signed-off-by: Mathieu Desnoyers <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Chris Metcalf <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Eric W. Biederman <[email protected]>
Cc: Kirill Tkhai <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Mike Galbraith <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Russell King - ARM Linux admin <[email protected]>
Cc: Shuah Khan <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>

sched/membarrier: Fix p->mm->membarrier_state racy load

The membarrier_state field is located within the mm_struct, which
is not guaranteed to exist when used from runqueue-lock-free iteration
on runqueues by the membarrier system call.

Copy the membarrier_state from the mm_struct into the scheduler runqueue
when the scheduler switches between mm.

When registering membarrier for mm, after setting the registration bit
in the mm membarrier state, issue a synchronize_rcu() to ensure the
scheduler observes the change. In order to take care of the case
where a runqueue keeps executing the target mm without swapping to
other mm, iterate over each runqueue and issue an IPI to copy the
membarrier_state from the mm_struct into each runqueue which have the
same mm which state has just been modified.

Move the mm membarrier_state field closer to pgd in mm_struct to use
a cache line already touched by the scheduler switch_mm.

The membarrier_execve() (now membarrier_exec_mmap) hook now needs to
clear the runqueue's membarrier state in addition to clear the mm
membarrier state, so move its implementation into the scheduler
membarrier code so it can access the runqueue structure.

Add memory barrier in membarrier_exec_mmap() prior to clearing
the membarrier state, ensuring memory accesses executed prior to exec
are not reordered with the stores clearing the membarrier state.

As suggested by Linus, move all membarrier.c RCU read-side locks outside
of the for each cpu loops.

Suggested-by: Linus Torvalds <[email protected]>
Signed-off-by: Mathieu Desnoyers <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Chris Metcalf <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Eric W. Biederman <[email protected]>
Cc: Kirill Tkhai <[email protected]>
Cc: Mike Galbraith <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Russell King - ARM Linux admin <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>

sched/membarrier: Call sync_core only before usermode for same mm

When the prev and next task's mm change, switch_mm() provides the core
serializing guarantees before returning to usermode. The only case
where an explicit core serialization is needed is when the scheduler
keeps the same mm for prev and next.

Suggested-by: Oleg Nesterov <[email protected]>
Signed-off-by: Mathieu Desnoyers <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Chris Metcalf <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Eric W. Biederman <[email protected]>
Cc: Kirill Tkhai <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Mike Galbraith <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Russell King - ARM Linux admin <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>

sched/membarrier: Remove redundant check

Checking that the number of threads is 1 is redundant with checking
mm_users == 1.

No change in functionality intended.

Suggested-by: Oleg Nesterov <[email protected]>
Signed-off-by: Mathieu Desnoyers <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Chris Metcalf <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Eric W. Biederman <[email protected]>
Cc: Kirill Tkhai <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Mike Galbraith <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Russell King - ARM Linux admin <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>

sched/membarrier: Fix private expedited registration check

Fix a logic flaw in the way membarrier_register_private_expedited()
handles ready state checks for private expedited sync core and private
expedited registrations.

If a private expedited membarrier registration is first performed, and
then a private expedited sync_core registration is performed, the ready
state check will skip the second registration when it really should not.

Signed-off-by: Mathieu Desnoyers <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Chris Metcalf <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Eric W. Biederman <[email protected]>
Cc: Kirill Tkhai <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Mike Galbraith <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Russell King - ARM Linux admin <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>

tasks, sched/core: RCUify the assignment of rq->curr

The current task on the runqueue is currently read with rcu_dereference().

To obtain ordinary RCU semantics for an rcu_dereference() of rq->curr it needs
to be paired with rcu_assign_pointer() of rq->curr.  Which provides the
memory barrier necessary to order assignments to the task_struct
and the assignment to rq->curr.

Unfortunately the assignment of rq->curr in __schedule is a hot path,
and it has already been show that additional barriers in that code
will reduce the performance of the scheduler.  So I will attempt to
describe below why you can effectively have ordinary RCU semantics
without any additional barriers.

The assignment of rq->curr in init_idle is a slow path called once
per cpu and that can use rcu_assign_pointer() without any concerns.

As I write this there are effectively two users of rcu_dereference() on
rq->curr.  There is the membarrier code in kernel/sched/membarrier.c
that only looks at "->mm" after the rcu_dereference().  Then there is
task_numa_compare() in kernel/sched/fair.c.  My best reading of the
code shows that task_numa_compare only access: "->flags",
"->cpus_ptr", "->numa_group", "->numa_faults[]",
"->total_numa_faults", and "->se.cfs_rq".

The code in __schedule() essentially does:
rq_lock(...);
smp_mb__after_spinlock();

next = pick_next_task(...);
rq->curr = next;

context_switch(prev, next);

At the start of the function the rq_lock/smp_mb__after_spinlock
pair provides a full memory barrier.  Further there is a full memory barrier
in context_switch().

This means that any task that has already run and modified itself (the
common case) has already seen two memory barriers before __schedule()
runs and begins executing.  A task that modifies itself then sees a
third full memory barrier pair with the rq_lock();

For a brand new task that is enqueued with wake_up_new_task() there
are the memory barriers present from the taking and release the
pi_lock and the rq_lock as the processes is enqueued as well as the
full memory barrier at the start of __schedule() assuming __schedule()
happens on the same cpu.

This means that by the time we reach the assignment of rq->curr
except for values on the task struct modified in pick_next_task
the code has the same guarantees as if it used rcu_assign_pointer().

Reading through all of the implementations of pick_next_task it
appears pick_next_task is limited to modifying the task_struct fields
"->se", "->rt", "->dl".  These fields are the sched_entity structures
of the varies schedulers.

Further "->se.cfs_rq" is only changed in cgroup attach/move operations
initialized by userspace.

Unless I have missed something this means that in practice that the
users of "rcu_dereference(rq->curr)" get normal RCU semantics of
rcu_dereference() for the fields the care about, despite the
assignment of rq->curr in __schedule() ot using rcu_assign_pointer.

Signed-off-by: Eric W. Biederman <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Chris Metcalf <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Davidlohr Bueso <[email protected]>
Cc: Kirill Tkhai <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Mike Galbraith <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Russell King - ARM Linux admin <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>

tasks, sched/core: With a grace period after finish_task_switch(), remove unnecessary code

Remove work arounds that were written before there was a grace period
after tasks left the runqueue in finish_task_switch().

In particular now that there tasks exiting the runqueue exprience
a RCU grace period none of the work performed by task_rcu_dereference()
excpet the rcu_dereference() is necessary so replace task_rcu_dereference()
with rcu_dereference().

Remove the code in rcuwait_wait_event() that checks to ensure the current
task has not exited. It is no longer necessary as it is guaranteed
that any running task will experience a RCU grace period after it
leaves the run queueue.

Remove the comment in rcuwait_wake_up() as it is no longer relevant.

Ref: 8f95c90ceb54 ("sched/wait, RCU: Introduce rcuwait machinery")
Ref: 150593bf8693 ("sched/api: Introduce task_rcu_dereference() and try_get_task_struct()")
Signed-off-by: Eric W. Biederman <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Chris Metcalf <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Davidlohr Bueso <[email protected]>
Cc: Kirill Tkhai <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Mike Galbraith <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Russell King - ARM Linux admin <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>

tasks, sched/core: Ensure tasks are available for a grace period after leaving the runqueue

In the ordinary case today the RCU grace period for a task_struct is
triggered when another process wait's for it's zombine and causes the
kernel to call release_task().  As the waiting task has to receive a
signal and then act upon it before this happens, typically this will
occur after the original task as been removed from the runqueue.

Unfortunaty in some cases such as self reaping tasks it can be shown
that release_task() will be called starting the grace period for
task_struct long before the task leaves the runqueue.

Therefore use put_task_struct_rcu_user() in finish_task_switch() to
guarantee that the there is a RCU lifetime after the task
leaves the runqueue.

Besides the change in the start of the RCU grace period for the
task_struct this change may cause perf_event_delayed_put and
trace_sched_process_free.  The function perf_event_delayed_put boils
down to just a WARN_ON for cases that I assume never show happen.  So
I don't see any problem with delaying it.

The function trace_sched_process_free is a trace point and thus
visible to user space.  Occassionally userspace has the strangest
dependencies so this has a miniscule chance of causing a regression.
This change only changes the timing of when the tracepoint is called.
The change in timing arguably gives userspace a more accurate picture
of what is going on.  So I don't expect there to be a regression.

In the case where a task self reaps we are pretty much guaranteed that
the RCU grace period is delayed.  So we should get quite a bit of
coverage in of this worst case for the change in a normal threaded
workload.  So I expect any issues to turn up quickly or not at all.

I have lightly tested this change and everything appears to work
fine.

Inspired-by: Linus Torvalds <[email protected]>
Inspired-by: Oleg Nesterov <[email protected]>
Signed-off-by: Eric W. Biederman <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Chris Metcalf <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Davidlohr Bueso <[email protected]>
Cc: Kirill Tkhai <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Mike Galbraith <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Russell King - ARM Linux admin <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>

tasks: Add a count of task RCU users

Add a count of the number of RCU users (currently 1) of the task
struct so that we can later add the scheduler case and get rid of the
very subtle task_rcu_dereference(), and just use rcu_dereference().

As suggested by Oleg have the count overlap rcu_head so that no
additional space in task_struct is required.

Inspired-by: Linus Torvalds <[email protected]>
Inspired-by: Oleg Nesterov <[email protected]>
Signed-off-by: Eric W. Biederman <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Chris Metcalf <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Davidlohr Bueso <[email protected]>
Cc: Kirill Tkhai <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Mike Galbraith <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Russell King - ARM Linux admin <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>

sunrpc: clean up indentation issue

There are statements that are indented incorrectly, remove the
extraneous spacing.

Signed-off-by: Colin Ian King <[email protected]>
Signed-off-by: J. Bruce Fields <[email protected]>

btrfs: Fix a regression which we can't convert to SINGLE profile

[BUG]
With v5.3 kernel, we can't convert to SINGLE profile:

  # btrfs balance start -f -dconvert=single $mnt
  ERROR: error during balancing '/mnt/btrfs': Invalid argument
  # dmesg -t | tail
  validate_convert_profile: data profile=0x1000000000000 allowed=0x20 is_valid=1 final=0x1000000000000 ret=1
  BTRFS error (device dm-3): balance: invalid convert data profile single

[CAUSE]
With the extra debug output added, it shows that the @allowed bit is
lacking the special in-memory only SINGLE profile bit.

Thus we fail at that (profile & ~allowed) check.

This regression is caused by commit 081db89b13cb ("btrfs: use raid_attr
to get allowed profiles for balance conversion") and the fact that we
don't use any bit to indicate SINGLE profile on-disk, but uses special
in-memory only bit to help distinguish different profiles.

[FIX]
Add that BTRFS_AVAIL_ALLOC_BIT_SINGLE to @allowed, so the code should be
the same as it was and fix the regression.

Reported-by: Chris Murphy <[email protected]>
Fixes: 081db89b13cb ("btrfs: use raid_attr to get allowed profiles for balance conversion")
CC: [email protected] # 5.3+
Reviewed-by: Anand Jain <[email protected]>
Signed-off-by: Qu Wenruo <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>

KVM: vmx: fix build warnings in hv_enable_direct_tlbflush() on i386

The following was reported on i386:

  arch/x86/kvm/vmx/vmx.c: In function 'hv_enable_direct_tlbflush':
  arch/x86/kvm/vmx/vmx.c:503:10: warning: cast from pointer to integer of different size [-Wpointer-to-int-cast]

pr_debugs() in this function are  more or less useless, let's just
remove them. evmcs->hv_vm_id can use 'unsigned long' instead of 'u64'.

Also, simplify the code a little bit.

Reported-by: kbuild test robot <[email protected]>
Signed-off-by: Vitaly Kuznetsov <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: x86: Don't check kvm_rebooting in __kvm_handle_fault_on_reboot()

Remove the kvm_rebooting check from VMX/SVM instruction exception fixup
now that kvm_spurious_fault() conditions its BUG() on !kvm_rebooting.
Because the 'cleanup_insn' functionally is also gone, deferring to
kvm_spurious_fault() means __kvm_handle_fault_on_reboot() can eliminate
its .fixup code entirely and have its exception table entry branch
directly to the call to kvm_spurious_fault().

Cc: Paolo Bonzini <[email protected]>
Cc: Josh Poimboeuf <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: x86: Drop ____kvm_handle_fault_on_reboot()

Remove the variation of __kvm_handle_fault_on_reboot() that accepts a
post-fault cleanup instruction now that its sole user (VMREAD) uses
a different method for handling faults.

Acked-by: Paolo Bonzini <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: VMX: Add error handling to VMREAD helper

Now that VMREAD flows require a taken branch, courtesy of commit

3901336ed9887 ("x86/kvm: Don't call kvm_spurious_fault() from .fixup")

bite the bullet and add full error handling to VMREAD, i.e. replace the
JMP added by __ex()/____kvm_handle_fault_on_reboot() with a hinted Jcc.

To minimize the code footprint, add a helper function, vmread_error(),
to handle both faults and failures so that the inline flow has a single
CALL.

Acked-by: Paolo Bonzini <[email protected]>
Cc: Josh Poimboeuf <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: VMX: Optimize VMX instruction error and fault handling

Rework the VMX instruction helpers using asm-goto to branch directly
to error/fault "handlers" in lieu of using __ex(), i.e. the generic
____kvm_handle_fault_on_reboot(). Branching directly to fault handling
code during fixup avoids the extra JMP that is inserted after every VMX
instruction when using the generic "fault on reboot" (see commit
3901336ed9887, "x86/kvm: Don't call kvm_spurious_fault() from .fixup").

Opportunistically clean up the helpers so that they all have consistent
error handling and messages.

Leave the usage of ____kvm_handle_fault_on_reboot() (via __ex()) in
kvm_cpu_vmxoff() and nested_vmx_check_vmentry_hw() as is. The VMXOFF
case is not a fast path, i.e. the cleanliness of __ex() is worth the
JMP, and the extra JMP in nested_vmx_check_vmentry_hw() is unavoidable.

Note, VMREAD cannot get the asm-goto treatment as output operands aren't
compatible with GCC's asm-goto due to internal compiler restrictions.

Acked-by: Paolo Bonzini <[email protected]>
Cc: Josh Poimboeuf <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

btrfs: relocation: fix use-after-free on dead relocation roots

[BUG]
One user reported a reproducible KASAN report about use-after-free:

  BTRFS info (device sdi1): balance: start -dvrange=1256811659264..1256811659265
  BTRFS info (device sdi1): relocating block group 1256811659264 flags data|raid0
  ==================================================================
  BUG: KASAN: use-after-free in btrfs_init_reloc_root+0x2cd/0x340 [btrfs]
  Write of size 8 at addr ffff88856f671710 by task kworker/u24:10/261579

  CPU: 2 PID: 261579 Comm: kworker/u24:10 Tainted: P           OE     5.2.11-arch1-1-kasan #4
  Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./X99 Extreme4, BIOS P3.80 04/06/2018
  Workqueue: btrfs-endio-write btrfs_endio_write_helper [btrfs]
  Call Trace:
   dump_stack+0x7b/0xba
   print_address_description+0x6c/0x22e
   ? btrfs_init_reloc_root+0x2cd/0x340 [btrfs]
   __kasan_report.cold+0x1b/0x3b
   ? btrfs_init_reloc_root+0x2cd/0x340 [btrfs]
   kasan_report+0x12/0x17
   __asan_report_store8_noabort+0x17/0x20
   btrfs_init_reloc_root+0x2cd/0x340 [btrfs]
   record_root_in_trans+0x2a0/0x370 [btrfs]
   btrfs_record_root_in_trans+0xf4/0x140 [btrfs]
   start_transaction+0x1ab/0xe90 [btrfs]
   btrfs_join_transaction+0x1d/0x20 [btrfs]
   btrfs_finish_ordered_io+0x7bf/0x18a0 [btrfs]
   ? lock_repin_lock+0x400/0x400
   ? __kmem_cache_shutdown.cold+0x140/0x1ad
   ? btrfs_unlink_subvol+0x9b0/0x9b0 [btrfs]
   finish_ordered_fn+0x15/0x20 [btrfs]
   normal_work_helper+0x1bd/0xca0 [btrfs]
   ? process_one_work+0x819/0x1720
   ? kasan_check_read+0x11/0x20
   btrfs_endio_write_helper+0x12/0x20 [btrfs]
   process_one_work+0x8c9/0x1720
   ? pwq_dec_nr_in_flight+0x2f0/0x2f0
   ? worker_thread+0x1d9/0x1030
   worker_thread+0x98/0x1030
   kthread+0x2bb/0x3b0
   ? process_one_work+0x1720/0x1720
   ? kthread_park+0x120/0x120
   ret_from_fork+0x35/0x40

  Allocated by task 369692:
   __kasan_kmalloc.part.0+0x44/0xc0
   __kasan_kmalloc.constprop.0+0xba/0xc0
   kasan_kmalloc+0x9/0x10
   kmem_cache_alloc_trace+0x138/0x260
   btrfs_read_tree_root+0x92/0x360 [btrfs]
   btrfs_read_fs_root+0x10/0xb0 [btrfs]
   create_reloc_root+0x47d/0xa10 [btrfs]
   btrfs_init_reloc_root+0x1e2/0x340 [btrfs]
   record_root_in_trans+0x2a0/0x370 [btrfs]
   btrfs_record_root_in_trans+0xf4/0x140 [btrfs]
   start_transaction+0x1ab/0xe90 [btrfs]
   btrfs_start_transaction+0x1e/0x20 [btrfs]
   __btrfs_prealloc_file_range+0x1c2/0xa00 [btrfs]
   btrfs_prealloc_file_range+0x13/0x20 [btrfs]
   prealloc_file_extent_cluster+0x29f/0x570 [btrfs]
   relocate_file_extent_cluster+0x193/0xc30 [btrfs]
   relocate_data_extent+0x1f8/0x490 [btrfs]
   relocate_block_group+0x600/0x1060 [btrfs]
   btrfs_relocate_block_group+0x3a0/0xa00 [btrfs]
   btrfs_relocate_chunk+0x9e/0x180 [btrfs]
   btrfs_balance+0x14e4/0x2fc0 [btrfs]
   btrfs_ioctl_balance+0x47f/0x640 [btrfs]
   btrfs_ioctl+0x119d/0x8380 [btrfs]
   do_vfs_ioctl+0x9f5/0x1060
   ksys_ioctl+0x67/0x90
   __x64_sys_ioctl+0x73/0xb0
   do_syscall_64+0xa5/0x370
   entry_SYSCALL_64_after_hwframe+0x44/0xa9

  Freed by task 369692:
   __kasan_slab_free+0x14f/0x210
   kasan_slab_free+0xe/0x10
   kfree+0xd8/0x270
   btrfs_drop_snapshot+0x154c/0x1eb0 [btrfs]
   clean_dirty_subvols+0x227/0x340 [btrfs]
   relocate_block_group+0x972/0x1060 [btrfs]
   btrfs_relocate_block_group+0x3a0/0xa00 [btrfs]
   btrfs_relocate_chunk+0x9e/0x180 [btrfs]
   btrfs_balance+0x14e4/0x2fc0 [btrfs]
   btrfs_ioctl_balance+0x47f/0x640 [btrfs]
   btrfs_ioctl+0x119d/0x8380 [btrfs]
   do_vfs_ioctl+0x9f5/0x1060
   ksys_ioctl+0x67/0x90
   __x64_sys_ioctl+0x73/0xb0
   do_syscall_64+0xa5/0x370
   entry_SYSCALL_64_after_hwframe+0x44/0xa9

  The buggy address belongs to the object at ffff88856f671100
   which belongs to the cache kmalloc-4k of size 4096
  The buggy address is located 1552 bytes inside of
   4096-byte region [ffff88856f671100, ffff88856f672100)
  The buggy address belongs to the page:
  page:ffffea0015bd9c00 refcount:1 mapcount:0 mapping:ffff88864400e600 index:0x0 compound_mapcount: 0
  flags: 0x2ffff0000010200(slab|head)
  raw: 02ffff0000010200 dead000000000100 dead000000000200 ffff88864400e600
  raw: 0000000000000000 0000000000070007 00000001ffffffff 0000000000000000
  page dumped because: kasan: bad access detected

  Memory state around the buggy address:
   ffff88856f671600: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
   ffff88856f671680: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
  >ffff88856f671700: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                           ^
   ffff88856f671780: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
   ffff88856f671800: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
  ==================================================================
  BTRFS info (device sdi1): 1 enospc errors during balance
  BTRFS info (device sdi1): balance: ended with status: -28

[CAUSE]
The problem happens when finish_ordered_io() get called with balance
still running, while the reloc root of that subvolume is already dead.
(Tree is swap already done, but tree not yet deleted for possible qgroup
usage.)

That means root->reloc_root still exists, but that reloc_root can be
under btrfs_drop_snapshot(), thus we shouldn't access it.

The following race could cause the use-after-free problem:

                CPU1              |                CPU2
--------------------------------------------------------------------------
                                  | relocate_block_group()
                                  | |- unset_reloc_control(rc)
                                  | |- btrfs_commit_transaction()
btrfs_finish_ordered_io()         | |- clean_dirty_subvols()
|- btrfs_join_transaction()       |    |
   |- record_root_in_trans()      |    |
      |- btrfs_init_reloc_root()  |    |
         |- if (root->reloc_root) |    |
         |                        |    |- root->reloc_root = NULL
         |                        |    |- btrfs_drop_snapshot(reloc_root);
         |- reloc_root->last_trans|
                 = trans->transid |
    ^^^^^^^^^^^^^^^^^^^^^^
            Use after free

[FIX]
Fix it by the following modifications:

- Test if the root has dead reloc tree before accessing root->reloc_root
  If the root has BTRFS_ROOT_DEAD_RELOC_TREE, then we don't need to
  create or update root->reloc_tree

- Clear the BTRFS_ROOT_DEAD_RELOC_TREE flag until we have fully dropped
  reloc tree
  To co-operate with above modification, so as long as
  BTRFS_ROOT_DEAD_RELOC_TREE is still set, we won't try to re-create
  reloc tree at record_root_in_trans().

Reported-by: Cebtenzzre <[email protected]>
Fixes: d2311e698578 ("btrfs: relocation: Delay reloc tree deletion after merge_reloc_roots")
CC: [email protected] # 5.1+
Reviewed-by: Josef Bacik <[email protected]>
Reviewed-by: Filipe Manana <[email protected]>
Signed-off-by: Qu Wenruo <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>

KVM: x86: Check kvm_rebooting in kvm_spurious_fault()

Explicitly check kvm_rebooting in kvm_spurious_fault() prior to invoking
BUG(), as opposed to assuming the caller has already done so. Letting
kvm_spurious_fault() be called "directly" will allow VMX to better
optimize its low level assembly flows.

As a happy side effect, kvm_spurious_fault() no longer needs to be
marked as a dead end since it doesn't unconditionally BUG().

Acked-by: Paolo Bonzini <[email protected]>
Cc: Josh Poimboeuf <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: selftests: fix ucall on x86

After commit e8bb4755eea2("KVM: selftests: Split ucall.c into architecture
specific files") selftests which use ucall on x86 started segfaulting and
apparently it's gcc to blame: it "optimizes" ucall() function throwing away
va_start/va_end part because it thinks the structure is not being used.
Previously, it couldn't do that because the there was also MMIO version and
the decision which particular implementation to use was done at runtime.

With older gccs it's possible to solve the problem by adding 'volatile'
to 'struct ucall' but at least with gcc-8.3 this trick doesn't work.

'memory' clobber seems to do the job.

Signed-off-by: Vitaly Kuznetsov <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

perf copyfile: Move copyfile routines to separate files

Further reducing the util.c hodgepodge files.

Cc: Adrian Hunter <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Namhyung Kim <[email protected]>
Link: https://lkml.kernel.org/n/[email protected]
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>

libperf: Add perf_evlist__poll() function

Move perf_evlist__poll() from tools/perf to libperf, it will be used in
the following patches.

And rename the existing perf's function to evlist__poll().

Signed-off-by: Jiri Olsa <[email protected]>
Cc: Alexander Shishkin <[email protected]>
Cc: Michael Petlan <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Link: http://lore.kernel.org/lkml/[email protected]
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>

libperf: Add perf_evlist__add_pollfd() function

Move perf_evlist__add_pollfd() from tools/perf to libperf, it will be
used in the following patches.

Also rename perf's perf_evlist__add_pollfd()/perf_evlist__filter_pollfd()
to evlist__add_pollfd()/evlist__filter_pollfd().

Signed-off-by: Jiri Olsa <[email protected]>
Cc: Alexander Shishkin <[email protected]>
Cc: Michael Petlan <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Link: http://lore.kernel.org/lkml/[email protected]
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>

libperf: Add perf_evlist__alloc_pollfd() function

Move perf_evlist__alloc_pollfd() from tools/perf to libperf, it will be
used in the following patches.

Signed-off-by: Jiri Olsa <[email protected]>
Cc: Alexander Shishkin <[email protected]>
Cc: Michael Petlan <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Link: http://lore.kernel.org/lkml/[email protected]
[ Added api/fd/array.h include to the lib/evlist.c file ]
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>

libperf: Add libperf_init() call to the tests

Add libperf_init() call to the automated tests.

Committer notes:

Added missing stdarg.h and/or stdio.h to places using vfprintf.

Signed-off-by: Jiri Olsa <[email protected]>
Cc: Alexander Shishkin <[email protected]>
Cc: Michael Petlan <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Link: http://lore.kernel.org/lkml/[email protected]
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>

libperf: Merge libperf_set_print() into libperf_init()

The libperf_set_print() function needs to be called in any case so let's
merge it with libperf_init(), so we have just one init function.

Signed-off-by: Jiri Olsa <[email protected]>
Cc: Alexander Shishkin <[email protected]>
Cc: Michael Petlan <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Link: http://lore.kernel.org/lkml/[email protected]
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>

libperf: Add libperf dependency for tests targets

Add libperf dependency for tests targets.

Signed-off-by: Jiri Olsa <[email protected]>
Cc: Alexander Shishkin <[email protected]>
Cc: Michael Petlan <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Link: http://lore.kernel.org/lkml/[email protected]
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>