Git Repo - linux.git/log

fs/binfmt_elf: use PT_LOAD p_align values for static PIE

Extend commit ce81bb256a22 ("fs/binfmt_elf: use PT_LOAD p_align values
for suitable start address") which fixed PIE binaries built with
-Wl,-z,max-page-size=0x200000, to cover static PIE binaries. This
fixes:

https://bugzilla.kernel.org/show_bug.cgi?id=215275

Tested by verifying static PIE binaries with -Wl,-z,max-page-size=0x200000 loading.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: H.J. Lu <[email protected]>
Cc: Chris Kennelly <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Alexey Dobriyan <[email protected]>
Cc: Song Liu <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Ian Rogers <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Cc: Sandeep Patil <[email protected]>
Cc: Fangrui Song <[email protected]>
Cc: Nick Desaulniers <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: Shuah Khan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

const_structs.checkpatch: add frequently used ops structs

Add commonly used structs (>50 instances) which are always or almost
always const.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Rikard Falkeborn <[email protected]>
Cc: Joe Perches <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

checkpatch: improve Kconfig help test

The Kconfig help test erroneously counts patch context lines as part of
the help text.

Fix that and improve the message block output.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Joe Perches <[email protected]>
Tested-by: Randy Dunlap <[email protected]>
Acked-by: Randy Dunlap <[email protected]>
Cc: Andy Whitcroft <[email protected]>
Cc: Dwaipayan Ray <[email protected]>
Cc: Lukas Bulwahn <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

checkpatch: relax regexp for COMMIT_LOG_LONG_LINE

One exceptions to the COMMIT_LOG_LONG_LINE rule is a file path followed
by ':'.  That is typically some sort diagnostic message from a compiler
or a build tool, in which case we don't want to wrap the lines but keep
the message unmodified.

The regular expression used to match this pattern currently doesn't
accept absolute paths or + characters.  This can result in false
positives as in the following (out-of-tree) example:

  ...
  /home/jerome/work/optee_repo_qemu/build/../toolchains/aarch32/bin/arm-linux-gnueabihf-ld.bfd: /home/jerome/work/toolchains-gcc10.2/aarch32/bin/../lib/gcc/arm-none-linux-gnueabihf/10.2.1/../../../../arm-none-linux-gnueabihf/lib/libstdc++.a(eh_alloc.o): in function `__cxa_allocate_exception':
  /tmp/dgboter/bbs/build03--cen7x86_64/buildbot/cen7x86_64--arm-none-linux-gnueabihf/build/src/gcc/libstdc++-v3/libsupc++/eh_alloc.cc:284: undefined reference to `malloc'
  ...

Update the regular expression to match the above paths.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Jerome Forissier <[email protected]>
Acked-by: Joe Perches <[email protected]>
Cc: Andy Whitcroft <[email protected]>
Cc: Dwaipayan Ray <[email protected]>
Cc: Lukas Bulwahn <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

lib/test_meminit: destroy cache in kmem_cache_alloc_bulk() test

Make do_kmem_cache_size_bulk() destroy the cache it creates.

Link: https://lkml.kernel.org/r/aced20a94bf04159a139f0846e41d38a1537debb.1640018297.git.andreyknvl@google.com
Fixes: 03a9349ac0e0 ("lib/test_meminit: add a kmem_cache_alloc_bulk() test")
Signed-off-by: Andrey Konovalov <[email protected]>
Reviewed-by: Marco Elver <[email protected]>
Cc: Alexander Potapenko <[email protected]>
Cc: Dmitry Vyukov <[email protected]>
Cc: Andrey Ryabinin <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

uuid: remove licence boilerplate text from the header

Remove licence boilerplate text from the UAPI header.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Andy Shevchenko <[email protected]>
Acked-by: Christoph Hellwig <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

uuid: discourage people from using UAPI header in new code

Discourage people from using UAPI header in new code by adding a note.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Andy Shevchenko <[email protected]>
Acked-by: Christoph Hellwig <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

kunit: replace kernel.h with the necessary inclusions

When kernel.h is used in the headers it adds a lot into dependency hell,
especially when there are circular dependencies are involved.

Replace kernel.h inclusion with the list of what is really being used.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Andy Shevchenko <[email protected]>
Reviewed-by: Brendan Higgins <[email protected]>
Tested-by: Brendan Higgins <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

test_hash.c: refactor into kunit

Use KUnit framework to make tests more easily integrable with CIs. Even
though these tests are not yet properly written as unit tests this
change should help in debugging.

Also remove kernel messages (i.e. through pr_info) as KUnit handles all
debugging output and let it handle module init and exit details.

Link: https://lkml.kernel.org/r/[email protected]
Reviewed-by: David Gow <[email protected]>
Reported-by: kernel test robot <[email protected]>
Tested-by: David Gow <[email protected]>
Co-developed-by: Augusto Durães Camargo <[email protected]>
Signed-off-by: Augusto Durães Camargo <[email protected]>
Co-developed-by: Enzo Ferreira <[email protected]>
Signed-off-by: Enzo Ferreira <[email protected]>
Signed-off-by: Isabella Basso <[email protected]>
Cc: Brendan Higgins <[email protected]>
Cc: Daniel Latypov <[email protected]>
Cc: Geert Uytterhoeven <[email protected]>
Cc: Rodrigo Siqueira <[email protected]>
Cc: Shuah Khan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

lib/Kconfig.debug: properly split hash test kernel entries

Split TEST_HASH so that each entry only has one file.

Note that there's no stringhash test file, but actually
<linux/stringhash.h> tests are performed in lib/test_hash.c.

Link: https://lkml.kernel.org/r/[email protected]
Reviewed-by: David Gow <[email protected]>
Tested-by: David Gow <[email protected]>
Signed-off-by: Isabella Basso <[email protected]>
Cc: Augusto Durães Camargo <[email protected]>
Cc: Brendan Higgins <[email protected]>
Cc: Daniel Latypov <[email protected]>
Cc: Enzo Ferreira <[email protected]>
Cc: Geert Uytterhoeven <[email protected]>
Cc: kernel test robot <[email protected]>
Cc: Rodrigo Siqueira <[email protected]>
Cc: Shuah Khan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

test_hash.c: split test_hash_init

Split up test_hash_init so that it calls each test more explicitly
insofar it is possible without rewriting the entire file. This aims at
improving readability.

Split tests performed on string_or as they don't interfere with those
performed in hash_or. Also separate pr_info calls about skipped tests
as they're not part of the tests themselves, but only warn about
(un)defined arch-specific hash functions.

Link: https://lkml.kernel.org/r/[email protected]
Reviewed-by: David Gow <[email protected]>
Tested-by: David Gow <[email protected]>
Signed-off-by: Isabella Basso <[email protected]>
Cc: Augusto Durães Camargo <[email protected]>
Cc: Brendan Higgins <[email protected]>
Cc: Daniel Latypov <[email protected]>
Cc: Enzo Ferreira <[email protected]>
Cc: Geert Uytterhoeven <[email protected]>
Cc: kernel test robot <[email protected]>
Cc: Rodrigo Siqueira <[email protected]>
Cc: Shuah Khan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

test_hash.c: split test_int_hash into arch-specific functions

Split the test_int_hash function to keep its mainloop separate from
arch-specific chunks, which are only compiled as needed. This aims at
improving readability.

Link: https://lkml.kernel.org/r/[email protected]
Reviewed-by: David Gow <[email protected]>
Tested-by: David Gow <[email protected]>
Signed-off-by: Isabella Basso <[email protected]>
Cc: Augusto Durães Camargo <[email protected]>
Cc: Brendan Higgins <[email protected]>
Cc: Daniel Latypov <[email protected]>
Cc: Enzo Ferreira <[email protected]>
Cc: Geert Uytterhoeven <[email protected]>
Cc: kernel test robot <[email protected]>
Cc: Rodrigo Siqueira <[email protected]>
Cc: Shuah Khan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

hash.h: remove unused define directive

Patch series "test_hash.c: refactor into KUnit", v3.

We refactored the lib/test_hash.c file into KUnit as part of the student
group LKCAMP [1] introductory hackathon for kernel development.

This test was pointed to our group by Daniel Latypov [2], so its full
conversion into a pure KUnit test was our goal in this patch series, but
we ran into many problems relating to it not being split as unit tests,
which complicated matters a bit, as the reasoning behind the original
tests is quite cryptic for those unfamiliar with hash implementations.

Some interesting developments we'd like to highlight are:

- In patch 1/5 we noticed that there was an unused define directive
   that could be removed.

- In patch 4/5 we noticed how stringhash and hash tests are all under
   the lib/test_hash.c file, which might cause some confusion, and we
   also broke those kernel config entries up.

Overall KUnit developments have been made in the other patches in this
series:

In patches 2/5, 3/5 and 5/5 we refactored the lib/test_hash.c file so as
to make it more compatible with the KUnit style, whilst preserving the
original idea of the maintainer who designed it (i.e.  George Spelvin),
which might be undesirable for unit tests, but we assume it is enough
for a first patch.

This patch (of 5):

Currently, there exist hash_32() and __hash_32() functions, which were
introduced in a patch [1] targeting architecture specific optimizations.
These functions can be overridden on a per-architecture basis to achieve
such optimizations.  They must set their corresponding define directive
(HAVE_ARCH_HASH_32 and HAVE_ARCH__HASH_32, respectively) so that header
files can deal with these overrides properly.

As the supported 32-bit architectures that have their own hash function
implementation (i.e.  m68k, Microblaze, H8/300, pa-risc) have only been
making use of the (more general) __hash_32() function (which only lacks
a right shift operation when compared to the hash_32() function), remove
the define directive corresponding to the arch-specific hash_32()
implementation.

[1] https://lore.kernel.org/lkml/20160525073311 [email protected]/

[[email protected]: hash_32_generic() becomes hash_32()]

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Reviewed-by: David Gow <[email protected]>
Tested-by: David Gow <[email protected]>
Co-developed-by: Augusto Durães Camargo <[email protected]>
Signed-off-by: Augusto Durães Camargo <[email protected]>
Co-developed-by: Enzo Ferreira <[email protected]>
Signed-off-by: Enzo Ferreira <[email protected]>
Signed-off-by: Isabella Basso <[email protected]>
Cc: Geert Uytterhoeven <[email protected]>
Cc: Brendan Higgins <[email protected]>
Cc: Daniel Latypov <[email protected]>
Cc: Shuah Khan <[email protected]>
Cc: Rodrigo Siqueira <[email protected]>
Cc: kernel test robot <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

lib/list_debug.c: print more list debugging context in __list_del_entry_valid()

Currently, the entry->prev and entry->next are considered to be valid as
long as they are not LIST_POISON{1|2}.  However, the memory may be
corrupted.  The prev->next is invalid probably because 'prev' is
invalid, not because prev->next's content is illegal.

Unfortunately, the printk and its subfunctions will modify the registers
that hold the 'prev' and 'next', and we don't see this valuable
information in the BUG context.

So print the contents of 'entry->prev' and 'entry->next'.

Here's an example:
  list_del corruption. prev->next should be c0ecbf74, but was c08410dc
  kernel BUG at lib/list_debug.c:53!
  ... ...
  PC is at __list_del_entry_valid+0x58/0x98
  LR is at __list_del_entry_valid+0x58/0x98
  psr: 60000093
  sp : c0ecbf30  ip : 00000000  fp : 00000001
  r10: c08410d0  r9 : 00000001  r8 : c0825e0c
  r7 : 20000013  r6 : c08410d0  r5 : c0ecbf74  r4 : c0ecbf74
  r3 : c0825d08  r2 : 00000000  r1 : df7ce6f4  r0 : 00000044
  ... ...
  Stack: (0xc0ecbf30 to 0xc0ecc000)
  bf20:                                     c0ecbf74 c0164fd0 c0ecbf70 c0165170
  bf40: c0eca000 c0840c00 c0840c00 c0824500 c0825e0c c0189bbc c088f404 60000013
  bf60: 60000013 c0e85100 000004ec 00000000 c0ebcdc0 c0ecbf74 c0ecbf74 c0825d08
  bf80: c0e807c0 c018965c 00000000 c013f2a0 c0e807c0 c013f154 00000000 00000000
  bfa0: 00000000 00000000 00000000 c01001b0 00000000 00000000 00000000 00000000
  bfc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
  bfe0: 00000000 00000000 00000000 00000000 00000013 00000000 00000000 00000000
  (__list_del_entry_valid) from (__list_del_entry+0xc/0x20)
  (__list_del_entry) from (finish_swait+0x60/0x7c)
  (finish_swait) from (rcu_gp_kthread+0x560/0xa20)
  (rcu_gp_kthread) from (kthread+0x14c/0x15c)
  (kthread) from (ret_from_fork+0x14/0x24)

At first, I thought prev->next was overwritten.  Later, I carefully
analyzed the RCU code and the disassembly code.  The error occurred when
deleting a node from the list rcu_state.gp_wq.  The System.map shows
that the address of rcu_state is c0840c00.  Then I use gdb to obtain the
offset of rcu_state.gp_wq.task_list.

  (gdb) p &((struct rcu_state *)0)->gp_wq.task_list
  $1 = (struct list_head *) 0x4dc

Again:
  list_del corruption. prev->next should be c0ecbf74, but was c08410dc

  c08410dc = c0840c00 + 0x4dc = &rcu_state.gp_wq.task_list

Because rcu_state.gp_wq has at most one node, so I can guess that "prev
= &rcu_state.gp_wq.task_list".  But for other scenes, maybe I wasn't so
lucky, I cannot figure out the value of 'prev'.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Zhen Lei <[email protected]>
Cc: "Paul E . McKenney" <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

list: introduce list_is_head() helper and re-use it in list.h

Introduce list_is_head() in the similar (*) way as it's done for
list_entry_is_head().  Make use of it in the list.h.

*) it's done as inliner and not a macro to be aligned with other
   list_is_*() APIs; while at it, make all three to have the same
   style.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Andy Shevchenko <[email protected]>
Cc: Heikki Krogerus <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

kstrtox: uninline everything

I've made a mistake of looking into lib/kstrtox.o code generation.

The only function remotely performance critical is _parse_integer()
(via /proc/*/map_files/*), everything else is not.

Uninline everything, shrink lib/kstrtox.o by ~20 % !

Space savings on x86_64:

add/remove: 0/0 grow/shrink: 0/23 up/down: 0/-1269 (-1269 !!!)
Function                                     old     new   delta
kstrtoull                                     16      13      -3
kstrtouint                                    59      48     -11
kstrtou8                                      60      49     -11
kstrtou16                                     61      50     -11
_kstrtoul                                     46      35     -11
kstrtoull_from_user                           95      83     -12
kstrtoul_from_user                            95      83     -12
kstrtoll                                      93      80     -13
kstrtouint_from_user                         124      83     -41
kstrtou8_from_user                           125      83     -42
kstrtou16_from_user                          126      83     -43
kstrtos8                                     101      50     -51
kstrtos16                                    102      51     -51
kstrtoint                                    100      49     -51
_kstrtol                                      93      35     -58
kstrtobool_from_user                         156      75     -81
kstrtoll_from_user                           165      83     -82
kstrtol_from_user                            165      83     -82
kstrtoint_from_user                          172      83     -89
kstrtos8_from_user                           173      83     -90
kstrtos16_from_user                          174      83     -91
_parse_integer                               136      10    -126
_kstrtoull                                   308     101    -207
Total: Before=3421236, After=3419967, chg -0.04%

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Alexey Dobriyan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

get_maintainer: don't remind about no git repo when --nogit is used

When --nogit is used with scripts/get_maintainer.pl, the script spews 4
lines of unnecessary information (noise).  Do not print those lines when
--nogit is specified.

This change removes the printing of these 4 lines:

  ./scripts/get_maintainer.pl: No supported VCS found.  Add --nogit to options?
  Using a git repository produces better results.
  Try Linus Torvalds' latest git repository using:
  git clone git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Randy Dunlap <[email protected]>
Cc: Joe Perches <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

kernel/sys.c: only take tasklist_lock for get/setpriority(PRIO_PGRP)

PRIO_PGRP needs the tasklist_lock mainly to serialize vs setpgid(2), to
protect against any concurrent change_pid(PIDTYPE_PGID) that can move
the task from one hlist to another while iterating.

However, the remaining can only rely only on RCU:

PRIO_PROCESS only does the task lookup and never iterates over tasklist
and we already have an rcu-aware stable pointer.

PRIO_USER is already racy vs setuid(2) so with creds being rcu
protected, we can end up seeing stale data. When removing the
tasklist_lock there can be a race with (i) fork but this is benign as
the child's nice is inherited and the new task is not observable by the
user yet either, hence the return semantics do not differ. And (ii) a
race with exit, which is a small window and can cause us to miss a task
which was removed from the list and it had the highest nice.

Similarly change the buggy do_each_thread/while_each_thread combo in
PRIO_USER for the rcu-safe for_each_process_thread flavor, which doesn't
make use of next_thread/p->thread_group.

[[email protected]: coding style fixes]

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Davidlohr Bueso <[email protected]>
Acked-by: Oleg Nesterov <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

kthread: dynamically allocate memory to store kthread's full name

When I was implementing a new per-cpu kthread cfs_migration, I found the
comm of it "cfs_migration/%u" is truncated due to the limitation of
TASK_COMM_LEN.  For example, the comm of the percpu thread on CPU10~19
all have the same name "cfs_migration/1", which will confuse the user.
This issue is not critical, because we can get the corresponding CPU
from the task's Cpus_allowed.  But for kthreads corresponding to other
hardware devices, it is not easy to get the detailed device info from
task comm, for example,

    jbd2/nvme0n1p2-
    xfs-reclaim/sdf

Currently there are so many truncated kthreads:

    rcu_tasks_kthre
    rcu_tasks_rude_
    rcu_tasks_trace
    poll_mpt3sas0_s
    ext4-rsv-conver
    xfs-reclaim/sd{a, b, c, ...}
    xfs-blockgc/sd{a, b, c, ...}
    xfs-inodegc/sd{a, b, c, ...}
    audit_send_repl
    ecryptfs-kthrea
    vfio-irqfd-clea
    jbd2/nvme0n1p2-
    ...

We can shorten these names to work around this problem, but it may be
not applied to all of the truncated kthreads.  Take 'jbd2/nvme0n1p2-'
for example, it is a nice name, and it is not a good idea to shorten it.

One possible way to fix this issue is extending the task comm size, but
as task->comm is used in lots of places, that may cause some potential
buffer overflows.  Another more conservative approach is introducing a
new pointer to store kthread's full name if it is truncated, which won't
introduce too much overhead as it is in the non-critical path.  Finally
we make a dicision to use the second approach.  See also the discussions
in this thread:
https://lore.kernel.org/lkml/20211101060419 [email protected]/

After this change, the full name of these truncated kthreads will be
displayed via /proc/[pid]/comm:

    rcu_tasks_kthread
    rcu_tasks_rude_kthread
    rcu_tasks_trace_kthread
    poll_mpt3sas0_statu
    ext4-rsv-conversion
    xfs-reclaim/sdf1
    xfs-blockgc/sdf1
    xfs-inodegc/sdf1
    audit_send_reply
    ecryptfs-kthread
    vfio-irqfd-cleanup
    jbd2/nvme0n1p2-8

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Yafang Shao <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Reviewed-by: Petr Mladek <[email protected]>
Suggested-by: Petr Mladek <[email protected]>
Suggested-by: Steven Rostedt <[email protected]>
Cc: Mathieu Desnoyers <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Alexei Starovoitov <[email protected]>
Cc: Andrii Nakryiko <[email protected]>
Cc: Michal Miroslaw <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Kees Cook <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

tools/testing/selftests/bpf: replace open-coded 16 with TASK_COMM_LEN

As the sched:sched_switch tracepoint args are derived from the kernel,
we'd better make it same with the kernel. So the macro TASK_COMM_LEN is
converted to type enum, then all the BPF programs can get it through
BTF.

The BPF program which wants to use TASK_COMM_LEN should include the
header vmlinux.h. Regarding the test_stacktrace_map and
test_tracepoint, as the type defined in linux/bpf.h are also defined in
vmlinux.h, so we don't need to include linux/bpf.h again.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Yafang Shao <[email protected]>
Acked-by: Andrii Nakryiko <[email protected]>
Acked-by: David Hildenbrand <[email protected]>
Cc: Mathieu Desnoyers <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Andrii Nakryiko <[email protected]>
Cc: Michal Miroslaw <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Petr Mladek <[email protected]>
Cc: Alexei Starovoitov <[email protected]>
Cc: Dennis Dalessandro <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

tools/bpf/bpftool/skeleton: replace bpf_probe_read_kernel with bpf_probe_read_kernel_str to get task comm

bpf_probe_read_kernel_str() will add a nul terminator to the dst, then
we don't care about if the dst size is big enough.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Yafang Shao <[email protected]>
Acked-by: Andrii Nakryiko <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Cc: Mathieu Desnoyers <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Alexei Starovoitov <[email protected]>
Cc: Andrii Nakryiko <[email protected]>
Cc: Michal Miroslaw <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Petr Mladek <[email protected]>
Cc: Dennis Dalessandro <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

samples/bpf/test_overhead_kprobe_kern: replace bpf_probe_read_kernel with bpf_probe_read_kernel_str to get task comm

bpf_probe_read_kernel_str() will add a nul terminator to the dst, then
we don't care about if the dst size is big enough. This patch also
replaces the hard-coded 16 with TASK_COMM_LEN to make it grepable.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Yafang Shao <[email protected]>
Reviewed-by: Kees Cook <[email protected]>
Acked-by: Andrii Nakryiko <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Cc: Mathieu Desnoyers <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Alexei Starovoitov <[email protected]>
Cc: Andrii Nakryiko <[email protected]>
Cc: Michal Miroslaw <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Petr Mladek <[email protected]>
Cc: Dennis Dalessandro <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

fs/binfmt_elf: replace open-coded string copy with get_task_comm

It is better to use get_task_comm() instead of the open coded string
copy as we do in other places.

struct elf_prpsinfo is used to dump the task information in userspace
coredump or kernel vmcore.  Below is the verification of vmcore,

  crash> ps
     PID    PPID  CPU       TASK        ST  %MEM     VSZ    RSS  COMM
        0      0   0  ffffffff9d21a940  RU   0.0       0      0  [swapper/0]
  >     0      0   1  ffffa09e40f85e80  RU   0.0       0      0  [swapper/1]
  >     0      0   2  ffffa09e40f81f80  RU   0.0       0      0  [swapper/2]
  >     0      0   3  ffffa09e40f83f00  RU   0.0       0      0  [swapper/3]
  >     0      0   4  ffffa09e40f80000  RU   0.0       0      0  [swapper/4]
  >     0      0   5  ffffa09e40f89f80  RU   0.0       0      0  [swapper/5]
        0      0   6  ffffa09e40f8bf00  RU   0.0       0      0  [swapper/6]
  >     0      0   7  ffffa09e40f88000  RU   0.0       0      0  [swapper/7]
  >     0      0   8  ffffa09e40f8de80  RU   0.0       0      0  [swapper/8]
  >     0      0   9  ffffa09e40f95e80  RU   0.0       0      0  [swapper/9]
  >     0      0  10  ffffa09e40f91f80  RU   0.0       0      0  [swapper/10]
  >     0      0  11  ffffa09e40f93f00  RU   0.0       0      0  [swapper/11]
  >     0      0  12  ffffa09e40f90000  RU   0.0       0      0  [swapper/12]
  >     0      0  13  ffffa09e40f9bf00  RU   0.0       0      0  [swapper/13]
  >     0      0  14  ffffa09e40f98000  RU   0.0       0      0  [swapper/14]
  >     0      0  15  ffffa09e40f9de80  RU   0.0       0      0  [swapper/15]

It works well as expected.

Some comments are added to explain why we use the hard-coded 16.

Link: https://lkml.kernel.org/r/[email protected]
Suggested-by: Kees Cook <[email protected]>
Signed-off-by: Yafang Shao <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Cc: Mathieu Desnoyers <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Andrii Nakryiko <[email protected]>
Cc: Michal Miroslaw <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Petr Mladek <[email protected]>
Cc: Alexei Starovoitov <[email protected]>
Cc: Andrii Nakryiko <[email protected]>
Cc: Dennis Dalessandro <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

drivers/infiniband: replace open-coded string copy with get_task_comm

We'd better use the helper get_task_comm() rather than the open-coded
strlcpy() to get task comm. As the comment above the hard-coded 16, we
can replace it with TASK_COMM_LEN.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Yafang Shao <[email protected]>
Acked-by: Dennis Dalessandro <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Cc: Mathieu Desnoyers <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Alexei Starovoitov <[email protected]>
Cc: Andrii Nakryiko <[email protected]>
Cc: Michal Miroslaw <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Petr Mladek <[email protected]>
Cc: Andrii Nakryiko <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

fs/exec: replace strncpy with strscpy_pad in __get_task_comm

If the dest buffer size is smaller than sizeof(tsk->comm), the buffer
will be without null ternimator, that may cause problem. Using
strscpy_pad() instead of strncpy() in __get_task_comm() can make the
string always nul ternimated and zero padded.

Link: https://lkml.kernel.org/r/[email protected]
Suggested-by: Kees Cook <[email protected]>
Suggested-by: Steven Rostedt <[email protected]>
Signed-off-by: Yafang Shao <[email protected]>
Reviewed-by: Kees Cook <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Cc: Mathieu Desnoyers <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Alexei Starovoitov <[email protected]>
Cc: Andrii Nakryiko <[email protected]>
Cc: Michal Miroslaw <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Petr Mladek <[email protected]>
Cc: Andrii Nakryiko <[email protected]>
Cc: Dennis Dalessandro <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

fs/exec: replace strlcpy with strscpy_pad in __set_task_comm

Patch series "task comm cleanups", v2.

This patchset is part of the patchset "extend task comm from 16 to
24"[1].  Now we have different opinion that dynamically allocates memory
to store kthread's long name into a separate pointer, so I decide to
take the useful cleanups apart from the original patchset and send it
separately[2].

These useful cleanups can make the usage around task comm less
error-prone.  Furthermore, it will be useful if we want to extend task
comm in the future.

[1]. https://lore.kernel.org/lkml/20211101060419 [email protected]/
[2]. https://lore.kernel.org/lkml/CALOAHbAx55AUo3bm8ZepZSZnw7A08cvKPdPyNTf=E_tPqmw5hw@mail.gmail.com/

This patch (of 7):

strlcpy() can trigger out-of-bound reads on the source string[1], we'd
better use strscpy() instead.  To make it be robust against full tsk->comm
copies that got noticed in other places, we should make sure it's zero
padded.

[1] https://github.com/KSPP/linux/issues/89

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Yafang Shao <[email protected]>
Reviewed-by: Kees Cook <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Cc: Mathieu Desnoyers <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Alexei Starovoitov <[email protected]>
Cc: Andrii Nakryiko <[email protected]>
Cc: Michal Miroslaw <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Petr Mladek <[email protected]>
Cc: Andrii Nakryiko <[email protected]>
Cc: Dennis Dalessandro <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

kernel.h: include a note to discourage people from including it in headers

Include a note at the top to discourage people from including it in
headers.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Andy Shevchenko <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

include/linux/unaligned: replace kernel.h with the necessary inclusions

When kernel.h is used in the headers it adds a lot into dependency hell,
especially when there are circular dependencies are involved.

Replace kernel.h inclusion with the list of what is really being used.

The rest of the changes are induced by the above and may not be split.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Andy Shevchenko <[email protected]>
Acked-by: Arend van Spriel <[email protected]> [brcmfmac]
Acked-by: Kalle Valo <[email protected]>
Cc: Arend van Spriel <[email protected]>
Cc: Franky Lin <[email protected]>
Cc: Hante Meuleman <[email protected]>
Cc: Chi-hsien Lin <[email protected]>
Cc: Wright Feng <[email protected]>
Cc: Chung-hsien Hsu <[email protected]>
Cc: Kalle Valo <[email protected]>
Cc: David S. Miller <[email protected]>
Cc: Jakub Kicinski <[email protected]>
Cc: Heikki Krogerus <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

sysctl: remove redundant ret assignment

Subsequent if judgments will assign new values to ret, so the statement
here should be deleted

The clang_analyzer complains as follows:

fs/proc/proc_sysctl.c:
Value stored to 'ret' is never read

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: luo penghao <[email protected]>
Reported-by: Zeal Robot <[email protected]>
Acked-by: Luis Chamberlain <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

sysctl: fix duplicate path separator in printed entries

sysctl_print_dir() always terminates the printed path name with a slash,
so printing a slash before the file part causes a duplicate like in

sysctl duplicate entry: /kernel//perf_user_access

Fix this by dropping the extra slash.

Link: https://lkml.kernel.org/r/e3054d605dc56f83971e4b6d2f5fa63a978720ad.1641551872.git.geert+renesas@glider.be
Signed-off-by: Geert Uytterhoeven <[email protected]>
Acked-by: Christian Brauner <[email protected]>
Acked-by: Luis Chamberlain <[email protected]>
Cc: Iurii Zaikin <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

proc: convert the return type of proc_fd_access_allowed() to be boolean

Convert return type of proc_fd_access_allowed() and the 'allowed' in it
to be boolean since the return type of ptrace_may_access() is boolean.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Qi Zheng <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Alexey Dobriyan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

proc: make the proc_create[_data]() stubs static inlines

Change the proc_create[_data]() stubs which are used when CONFIG_PROC_FS
is not set from #defines to a static inline stubs.

This should fix clang -Werror builds failing due to errors like this:

drivers/platform/x86/thinkpad_acpi.c:918:30: error: unused variable
'dispatch_proc_ops' [-Werror,-Wunused-const-variable]

Fixing this in include/linux/proc_fs.h should ensure that the same issue
is also fixed in any other drivers hitting the same -Werror issue.

[[email protected]: fix CONFIG_PROC_FS=n]
[[email protected]: fix arch/sparc/kernel/led.c]
[[email protected]: fix build]

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Hans de Goede <[email protected]>
Reported-by: kernel test robot <[email protected]>
Acked-by: Christian Brauner <[email protected]>
Cc: Alexander Viro <[email protected]>
Cc: Hans de Goede <[email protected]>
Cc: David Howells <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

proc/vmcore: don't fake reading zeroes on surprise vmcore_cb unregistration

In commit cc5f2704c934 ("proc/vmcore: convert oldmem_pfn_is_ram callback
to more generic vmcore callbacks"), we added detection of surprise
vmcore_cb unregistration after the vmcore was already opened.  Once
detected, we warn the user and simulate reading zeroes from that point
on when accessing the vmcore.

The basic reason was that unexpected unregistration, for example, by
manually unbinding a driver from a device after opening the vmcore, is
not supported and could result in reading oldmem the vmcore_cb would
have actually prohibited while registered.  However, something like that
can similarly be trigger by a user that's really looking for trouble
simply by unbinding the relevant driver before opening the vmcore -- or
by disallowing loading the driver in the first place.  So it's actually
of limited help.

Currently, unregistration can only be triggered via virtio-mem when
manually unbinding the driver from the device inside the VM; there is no
way to trigger it from the hypervisor, as hypervisors don't allow for
unplugging virtio-mem devices -- ripping out system RAM from a VM
without coordination with the guest is usually not a good idea.

The important part is that unbinding the driver and unregistering the
vmcore_cb while concurrently reading the vmcore won't crash the system,
and that is handled by the rwsem.

To make the mechanism more future proof, let's remove the "read zero"
part, but leave the warning in place.  For example, we could have a
future driver (like virtio-balloon) that will contact the hypervisor to
figure out if we already populated a page for a given PFN.
Hotunplugging such a device and consequently unregistering the vmcore_cb
could be triggered from the hypervisor without harming the system even
while kdump is running.  In that case, we don't want to silently end up
with a vmcore that contains wrong data, because the user inside the VM
might be unaware of the hypervisor action and might easily miss the
warning in the log.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: David Hildenbrand <[email protected]>
Acked-by: Baoquan He <[email protected]>
Cc: Dave Young <[email protected]>
Cc: Vivek Goyal <[email protected]>
Cc: Philipp Rudo <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm: percpu: add generic pcpu_populate_pte() function

With NEED_PER_CPU_PAGE_FIRST_CHUNK enabled, we need a function to
populate pte, this patch adds a generic pcpu populate pte function,
pcpu_populate_pte(), which is marked __weak and used on most
architectures, but it is overridden on x86, which has its own
implementation.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Kefeng Wang <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: "David S. Miller" <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: "Rafael J. Wysocki" <[email protected]>
Cc: Dennis Zhou <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Albert Ou <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Palmer Dabbelt <[email protected]>
Cc: Paul Walmsley <[email protected]>
Cc: Thomas Bogendoerfer <[email protected]>
Cc: Will Deacon <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm: percpu: add generic pcpu_fc_alloc/free funciton

With the previous patch, we could add a generic pcpu first chunk
allocate and free function to cleanup the duplicated definations on each
architecture.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Kefeng Wang <[email protected]>
Cc: Thomas Bogendoerfer <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: "David S. Miller" <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Dennis Zhou <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Albert Ou <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Palmer Dabbelt <[email protected]>
Cc: Paul Walmsley <[email protected]>
Cc: "Rafael J. Wysocki" <[email protected]>
Cc: Will Deacon <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm: percpu: add pcpu_fc_cpu_to_node_fn_t typedef

Add pcpu_fc_cpu_to_node_fn_t and pass it into pcpu_fc_alloc_fn_t, pcpu
first chunk allocation will call it to alloc memblock on the
corresponding node by it, this is prepare for the next patch.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Kefeng Wang <[email protected]>
Cc: Thomas Bogendoerfer <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: "David S. Miller" <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: "Rafael J. Wysocki" <[email protected]>
Cc: Dennis Zhou <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Albert Ou <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Palmer Dabbelt <[email protected]>
Cc: Paul Walmsley <[email protected]>
Cc: Will Deacon <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm: percpu: generalize percpu related config

Patch series "mm: percpu: Cleanup percpu first chunk function".

When supporting page mapping percpu first chunk allocator on arm64, we
found there are lots of duplicated codes in percpu embed/page first chunk
allocator.  This patchset is aimed to cleanup them and should no function
change.

The currently supported status about 'embed' and 'page' in Archs shows
below,

embed: NEED_PER_CPU_PAGE_FIRST_CHUNK
page:  NEED_PER_CPU_EMBED_FIRST_CHUNK

embed page
------------------------
arm64   Y Y
mips   Y N
powerpc   Y Y
riscv   Y N
sparc   Y Y
x86   Y Y
------------------------

There are two interfaces about percpu first chunk allocator,

extern int __init pcpu_embed_first_chunk(size_t reserved_size, size_t dyn_size,
                                size_t atom_size,
                                pcpu_fc_cpu_distance_fn_t cpu_distance_fn,
-                               pcpu_fc_alloc_fn_t alloc_fn,
-                               pcpu_fc_free_fn_t free_fn);
+                               pcpu_fc_cpu_to_node_fn_t cpu_to_nd_fn);

extern int __init pcpu_page_first_chunk(size_t reserved_size,
-                               pcpu_fc_alloc_fn_t alloc_fn,
-                               pcpu_fc_free_fn_t free_fn,
-                               pcpu_fc_populate_pte_fn_t populate_pte_fn);
+                               pcpu_fc_cpu_to_node_fn_t cpu_to_nd_fn);

The pcpu_fc_alloc_fn_t/pcpu_fc_free_fn_t is killed, we provide generic
pcpu_fc_alloc() and pcpu_fc_free() function, which are called in the
pcpu_embed/page_first_chunk().

1) For pcpu_embed_first_chunk(), pcpu_fc_cpu_to_node_fn_t is needed to be
   provided when archs supported NUMA.

2) For pcpu_page_first_chunk(), the pcpu_fc_populate_pte_fn_t is killed too,
   a generic pcpu_populate_pte() which marked '__weak' is provided, if you
   need a different function to populate pte on the arch(like x86), please
   provide its own implementation.

[1] https://github.com/kevin78/linux.git percpu-cleanup

This patch (of 4):

The HAVE_SETUP_PER_CPU_AREA/NEED_PER_CPU_EMBED_FIRST_CHUNK/
NEED_PER_CPU_PAGE_FIRST_CHUNK/USE_PERCPU_NUMA_NODE_ID configs, which have
duplicate definitions on platforms that subscribe it.

Move them into mm, drop these redundant definitions and instead just
select it on applicable platforms.

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Kefeng Wang <[email protected]>
Acked-by: Catalin Marinas <[email protected]> [arm64]
Cc: Will Deacon <[email protected]>
Cc: Thomas Bogendoerfer <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Paul Walmsley <[email protected]>
Cc: Palmer Dabbelt <[email protected]>
Cc: Albert Ou <[email protected]>
Cc: "David S. Miller" <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Dennis Zhou <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: "Rafael J. Wysocki" <[email protected]>
Cc: Tejun Heo <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

cifs: update internal module number

To 2.35

Signed-off-by: Steve French <[email protected]>

smb3: send NTLMSSP version information

For improved debugging it can be helpful to send version information
as other clients do during NTLMSSP negotiation. See protocol document
MS-NLMP section 2.2.1.1

Set the major and minor versions based on the kernel version, and the
BuildNumber based on the internal cifs.ko module version number,
and following the recommendation in the protocol documentation
(MS-NLMP section 2.2.10) we set the NTLMRevisionCurrent field to 15.

Reviewed-by: Shyam Prasad N <[email protected]>
Signed-off-by: Steve French <[email protected]>

riscv: fix boolconv.cocci warnings

arch/riscv/mm/init.c:48:11-16: WARNING: conversion to bool not needed here

Remove unneeded conversion to bool

Semantic patch information:
Relational and logical operators evaluate to bool,
explicit conversion is overly verbose and unneeded.

Generated by: scripts/coccinelle/misc/boolconv.cocci

Reported-by: kernel test robot <[email protected]>
Signed-off-by: kernel test robot <[email protected]>
Signed-off-by: Palmer Dabbelt <[email protected]>

RISC-V: Introduce sv48 support without relocatable kernel

This patchset allows to have a single kernel for sv39 and sv48 without
being relocatable.

The idea comes from Arnd Bergmann who suggested to do the same as x86,
that is mapping the kernel to the end of the address space, which allows
the kernel to be linked at the same address for both sv39 and sv48 and
then does not require to be relocated at runtime.

This implements sv48 support at runtime. The kernel will try to boot
with 4-level page table and will fallback to 3-level if the HW does not
support it. Folding the 4th level into a 3-level page table has almost
no cost at runtime.

Note that kasan region had to be moved to the end of the address space
since its location must be known at compile-time and then be valid for
both sv39 and sv48 (and sv57 that is coming).

* riscv-sv48-v3:
  riscv: Explicit comment about user virtual address space size
  riscv: Use pgtable_l4_enabled to output mmu_type in cpuinfo
  riscv: Implement sv48 support
  asm-generic: Prepare for riscv use of pud_alloc_one and pud_free
  riscv: Allow to dynamically define VA_BITS
  riscv: Introduce functions to switch pt_ops
  riscv: Split early kasan mapping to prepare sv48 introduction
  riscv: Move KASAN mapping next to the kernel mapping
  riscv: Get rid of MAXPHYSMEM configs

Signed-off-by: Palmer Dabbelt <[email protected]>

riscv: Explicit comment about user virtual address space size

Define precisely the size of the user accessible virtual space size
for sv32/39/48 mmu types and explain why the whole virtual address
space is split into 2 equal chunks between kernel and user space.

Signed-off-by: Alexandre Ghiti <[email protected]>
Reviewed-by: Anup Patel <[email protected]>
Reviewed-by: Palmer Dabbelt <[email protected]>
Signed-off-by: Palmer Dabbelt <[email protected]>

riscv: Use pgtable_l4_enabled to output mmu_type in cpuinfo

Now that the mmu type is determined at runtime using SATP
characteristic, use the global variable pgtable_l4_enabled to output
mmu type of the processor through /proc/cpuinfo instead of relying on
device tree infos.

Signed-off-by: Alexandre Ghiti <[email protected]>
Reviewed-by: Anup Patel <[email protected]>
Reviewed-by: Palmer Dabbelt <[email protected]>
Signed-off-by: Palmer Dabbelt <[email protected]>

riscv: Implement sv48 support

By adding a new 4th level of page table, give the possibility to 64bit
kernel to address 2^48 bytes of virtual address: in practice, that offers
128TB of virtual address space to userspace and allows up to 64TB of
physical memory.

If the underlying hardware does not support sv48, we will automatically
fallback to a standard 3-level page table by folding the new PUD level into
PGDIR level. In order to detect HW capabilities at runtime, we
use SATP feature that ignores writes with an unsupported mode.

Signed-off-by: Alexandre Ghiti <[email protected]>
Signed-off-by: Palmer Dabbelt <[email protected]>

asm-generic: Prepare for riscv use of pud_alloc_one and pud_free

In the following commits, riscv will almost use the generic versions of
pud_alloc_one and pud_free but an additional check is required since those
functions are only relevant when using at least a 4-level page table, which
will be determined at runtime on riscv.

So move the content of those functions into other functions that riscv
can use without duplicating code.

Signed-off-by: Alexandre Ghiti <[email protected]>
Signed-off-by: Palmer Dabbelt <[email protected]>

riscv: Allow to dynamically define VA_BITS

With 4-level page table folding at runtime, we don't know at compile time
the size of the virtual address space so we must set VA_BITS dynamically
so that sparsemem reserves the right amount of memory for struct pages.

Signed-off-by: Alexandre Ghiti <[email protected]>
Signed-off-by: Palmer Dabbelt <[email protected]>

riscv: Introduce functions to switch pt_ops

This simply gathers the different pt_ops initialization in functions
where a comment was added to explain why the page table operations must
be changed along the boot process.

Signed-off-by: Alexandre Ghiti <[email protected]>
Signed-off-by: Palmer Dabbelt <[email protected]>

riscv: Split early kasan mapping to prepare sv48 introduction

Now that kasan shadow region is next to the kernel, for sv48, this
region won't be aligned on PGDIR_SIZE and then when populating this
region, we'll need to get down to lower levels of the page table. So
instead of reimplementing the page table walk for the early population,
take advantage of the existing functions used for the final population.

Note that kasan swapper initialization must also be split since memblock
is not initialized at this point and as the last PGD is shared with the
kernel, we'd need to allocate a PUD so postpone the kasan final
population after the kernel population is done.

Signed-off-by: Alexandre Ghiti <[email protected]>
Signed-off-by: Palmer Dabbelt <[email protected]>

riscv: Move KASAN mapping next to the kernel mapping

Now that KASAN_SHADOW_OFFSET is defined at compile time as a config,
this value must remain constant whatever the size of the virtual address
space, which is only possible by pushing this region at the end of the
address space next to the kernel mapping.

Signed-off-by: Alexandre Ghiti <[email protected]>
Signed-off-by: Palmer Dabbelt <[email protected]>

riscv: Get rid of MAXPHYSMEM configs

CONFIG_MAXPHYSMEM_* are actually never used, even the nommu defconfigs
selecting the MAXPHYSMEM_2GB had no effects on PAGE_OFFSET since it was
preempted by !MMU case right before.

In addition, the move of the kernel mapping at the end of the address
space broke the use of MAXPHYSMEM_2G with MMU since it defines PAGE_OFFSET
at the same address as the kernel mapping.

Reported-by: Geert Uytterhoeven <[email protected]>
Fixes: 2bfc6cd81bd1 ("riscv: Move kernel mapping outside of linear mapping")
Signed-off-by: Alexandre Ghiti <[email protected]>
Tested-by: Geert Uytterhoeven <[email protected]>
Tested-by: Conor Dooley <[email protected]>
Cc: [email protected]
Signed-off-by: Palmer Dabbelt <[email protected]>

xfs: flush inodegc workqueue tasks before cancel

The xfs_inodegc_stop() helper performs a high level flush of pending
work on the percpu queues and then runs a cancel_work_sync() on each
of the percpu work tasks to ensure all work has completed before
returning. While cancel_work_sync() waits for wq tasks to complete,
it does not guarantee work tasks have started. This means that the
_stop() helper can queue and instantly cancel a wq task without
having completed the associated work. This can be observed by
tracepoint inspection of a simple "rm -f <file>; fsfreeze -f <mnt>"
test:

xfs_destroy_inode: ... ino 0x83 ...
xfs_inode_set_need_inactive: ... ino 0x83 ...
xfs_inodegc_stop: ...
...
xfs_inodegc_start: ...
xfs_inodegc_worker: ...
xfs_inode_inactivating: ... ino 0x83 ...

The first few lines show that the inode is removed and need inactive
state set, but the inactivation work has not completed before the
inodegc mechanism stops. The inactivation doesn't actually occur
until the fs is unfrozen and the gc mechanism starts back up. Note
that this test requires fsfreeze to reproduce because xfs_freeze
indirectly invokes xfs_fs_statfs(), which calls xfs_inodegc_flush().

When this occurs, the workqueue try_to_grab_pending() logic first
tries to steal the pending bit, which does not succeed because the
bit has been set by queue_work_on(). Subsequently, it checks for
association of a pool workqueue from the work item under the pool
lock. This association is set at the point a work item is queued and
cleared when dequeued for processing. If the association exists, the
work item is removed from the queue and cancel_work_sync() returns
true. If the pwq association is cleared, the remove attempt assumes
the task is busy and retries (eventually returning false to the
caller after waiting for the work task to complete).

To avoid this race, we can flush each work item explicitly before
cancel. However, since the _queue_all() already schedules each
underlying work item, the workqueue level helpers are sufficient to
achieve the same ordering effect. E.g., the inodegc enabled flag
prevents scheduling any further work in the _stop() case. Use the
drain_workqueue() helper in this particular case to make the intent
a bit more self explanatory.

Signed-off-by: Brian Foster <[email protected]>
Reviewed-by: Darrick J. Wong <[email protected]>
Signed-off-by: Darrick J. Wong <[email protected]>
Reviewed-by: Dave Chinner <[email protected]>

io-wq: delete dead lock shuffling code

We used to have more code around the work loop, but now the goto and
lock juggling just makes it less readable than it should. Get rid of it.

Signed-off-by: Jens Axboe <[email protected]>

clk: mediatek: relicense mt7986 clock driver to GPL-2.0

The previous mt7986 clock drivers were incorrectly marked as GPL-1.0.
This patch changes the driver to the standard GPL-2.0 license.

Signed-off-by: Sam Shih <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Reported-by: Lukas Bulwahn <[email protected]>
Signed-off-by: Stephen Boyd <[email protected]>

riscv: bpf: Fix eBPF's exception tables

eBPF's exception tables needs to be modified to relative synchronously.

Suggested-by: Tong Tiangen <[email protected]>
Signed-off-by: Jisheng Zhang <[email protected]>
Fixes: 1f77ed9422cb ("riscv: switch to relative extable and other improvements")
Signed-off-by: Palmer Dabbelt <[email protected]>

kvm: selftests: Do not indent with spaces

Some indentation with spaces crept in, likely due to terminal-based
cut and paste. Clean it up.

Signed-off-by: Paolo Bonzini <[email protected]>

kvm: selftests: sync uapi/linux/kvm.h with Linux header

KVM_CAP_XSAVE2 is out of sync due to a conflict. Copy the whole
file while at it.

Reported-by: Yang Zhong <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

riscv: mm: init: try best to remove #ifdef CONFIG_XIP_KERNEL usage

Currently, the #ifdef CONFIG_XIP_KERNEL usage can be divided into the
following three types:

The first one is for functions/declarations only used in XIP case.

The second one is for XIP_FIXUP case. Something as below:
|foo_type foo;
|#ifdef CONFIG_XIP_KERNEL
|#define foo (*(foo_type *)XIP_FIXUP(&foo))
|#endif

Usually, it's better to let the foo macro sit with the foo var
together. But if various foos are defined adjacently, we can
save some #ifdef CONFIG_XIP_KERNEL usage by grouping them together.

The third one is for different implementations for XIP, usually, this
is a #ifdef...#else...#endif case.

This patch moves the pt_ops macro to adjacent #ifdef CONFIG_XIP_KERNEL
and group first type usage cases into one.

Signed-off-by: Jisheng Zhang <[email protected]>
Reviewed-by: Alexandre Ghiti <[email protected]>
Signed-off-by: Palmer Dabbelt <[email protected]>

riscv: mm: init: try IS_ENABLED(CONFIG_XIP_KERNEL) instead of #ifdef

Try our best to replace the conditional compilation using
"#ifdef CONFIG_XIP_KERNEL" with "IS_ENABLED(CONFIG_XIP_KERNEL)", to
simplify the code and to increase compile coverage.

Signed-off-by: Jisheng Zhang <[email protected]>
Reviewed-by: Alexandre Ghiti <[email protected]>
Signed-off-by: Palmer Dabbelt <[email protected]>

riscv: mm: init: remove _pt_ops and use pt_ops directly

Except "pt_ops", other global vars when CONFIG_XIP_KERNEL=y is defined
as below:

|foo_type foo;
|#ifdef CONFIG_XIP_KERNEL
|#define foo (*(foo_type *)XIP_FIXUP(&foo))
|#endif

Follow the same way for pt_ops to unify the style and to simplify code.

Signed-off-by: Jisheng Zhang <[email protected]>
Reviewed-by: Alexandre Ghiti <[email protected]>
Signed-off-by: Palmer Dabbelt <[email protected]>

riscv: mm: init: try best to use IS_ENABLED(CONFIG_64BIT) instead of #ifdef

Try our best to replace the conditional compilation using
"#ifdef CONFIG_64BIT" by a check for "IS_ENABLED(CONFIG_64BIT)", to
simplify the code and to increase compile coverage.

Now we can also remove the __maybe_unused used in max_mapped_addr
declaration.

We also remove the BUG_ON check of mapping the last 4K bytes of the
addressable memory since this is always true for every kernel actually.

Signed-off-by: Jisheng Zhang <[email protected]>
Reviewed-by: Alexandre Ghiti <[email protected]>
Signed-off-by: Palmer Dabbelt <[email protected]>

riscv: mm: init: remove unnecessary "#ifdef CONFIG_CRASH_DUMP"

The is_kdump_kernel() returns false for !CRASH_DUMP case, so we don't
need the #ifdef CONFIG_CRASH_DUMP for is_kdump_kernel() checking.

Signed-off-by: Jisheng Zhang <[email protected]>
Reviewed-by: Alexandre Ghiti <[email protected]>
Signed-off-by: Palmer Dabbelt <[email protected]>

tools headers UAPI: Sync x86 arch prctl headers with the kernel sources

To pick the changes in this cset:

  980fe2fddcff2193 ("x86/fpu: Extend fpu_xstate_prctl() with guest permissions")

This picks these new prctls:

  $ tools/perf/trace/beauty/x86_arch_prctl.sh > /tmp/before
  $ cp arch/x86/include/uapi/asm/prctl.h tools/arch/x86/include/uapi/asm/prctl.h
  $ tools/perf/trace/beauty/x86_arch_prctl.sh > /tmp/after
  $ diff -u /tmp/before /tmp/after
  --- /tmp/before 2022-01-19 14:40:05.049394977 -0300
  +++ /tmp/after 2022-01-19 14:40:35.628154565 -0300
  @@ -9,6 +9,8 @@
    [0x1021 - 0x1001]= "GET_XCOMP_SUPP",
    [0x1022 - 0x1001]= "GET_XCOMP_PERM",
    [0x1023 - 0x1001]= "REQ_XCOMP_PERM",
  + [0x1024 - 0x1001]= "GET_XCOMP_GUEST_PERM",
  + [0x1025 - 0x1001]= "REQ_XCOMP_GUEST_PERM",
   };

   #define x86_arch_prctl_codes_2_offset 0x2001
  $

With this 'perf trace' can translate those numbers into strings and use
the strings in filter expressions:

  # perf trace -e prctl
       0.000 ( 0.011 ms): DOM Worker/3722622 prctl(option: SET_NAME, arg2: 0x7f9c014b7df5)     = 0
       0.032 ( 0.002 ms): DOM Worker/3722622 prctl(option: SET_NAME, arg2: 0x7f9bb6b51580)     = 0
       5.452 ( 0.003 ms): StreamT~ns #30/3722623 prctl(option: SET_NAME, arg2: 0x7f9bdbdfeb70) = 0
       5.468 ( 0.002 ms): StreamT~ns #30/3722623 prctl(option: SET_NAME, arg2: 0x7f9bdbdfea70) = 0
      24.494 ( 0.009 ms): IndexedDB #556/3722624 prctl(option: SET_NAME, arg2: 0x7f562a32ae28) = 0
      24.540 ( 0.002 ms): IndexedDB #556/3722624 prctl(option: SET_NAME, arg2: 0x7f563c6d4b30) = 0
     670.281 ( 0.008 ms): systemd-userwo/3722339 prctl(option: SET_NAME, arg2: 0x564be30805c8) = 0
     670.293 ( 0.002 ms): systemd-userwo/3722339 prctl(option: SET_NAME, arg2: 0x564be30800f0) = 0
  ^C#

This addresses these perf build warnings:

  Warning: Kernel ABI header at 'tools/arch/x86/include/uapi/asm/prctl.h' differs from latest version at 'arch/x86/include/uapi/asm/prctl.h'
  diff -u tools/arch/x86/include/uapi/asm/prctl.h arch/x86/include/uapi/asm/prctl.h

Cc: Paolo Bonzini <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>

cifs: Support fscache indexing rewrite

Change the cifs filesystem to take account of the changes to fscache's
indexing rewrite and reenable caching in cifs.

The following changes have been made:

(1) The fscache_netfs struct is no more, and there's no need to register
     the filesystem as a whole.

(2) The session cookie is now an fscache_volume cookie, allocated with
     fscache_acquire_volume().  That takes three parameters: a string
     representing the "volume" in the index, a string naming the cache to
     use (or NULL) and a u64 that conveys coherency metadata for the
     volume.

     For cifs, I've made it render the volume name string as:

"cifs,<ipaddress>,<sharename>"

     where the sharename has '/' characters replaced with ';'.

     This probably needs rethinking a bit as the total name could exceed
     the maximum filename component length.

     Further, the coherency data is currently just set to 0.  It needs
     something else doing with it - I wonder if it would suffice simply to
     sum the resource_id, vol_create_time and vol_serial_number or maybe
     hash them.

(3) The fscache_cookie_def is no more and needed information is passed
     directly to fscache_acquire_cookie().  The cache no longer calls back
     into the filesystem, but rather metadata changes are indicated at
     other times.

     fscache_acquire_cookie() is passed the same keying and coherency
     information as before.

(4) The functions to set/reset cookies are removed and
     fscache_use_cookie() and fscache_unuse_cookie() are used instead.

     fscache_use_cookie() is passed a flag to indicate if the cookie is
     opened for writing.  fscache_unuse_cookie() is passed updates for the
     metadata if we changed it (ie. if the file was opened for writing).

     These are called when the file is opened or closed.

(5) cifs_setattr_*() are made to call fscache_resize() to change the size
     of the cache object.

(6) The functions to read and write data are stubbed out pending a
     conversion to use netfslib.

Changes
=======
ver #8:
- Abstract cache invalidation into a helper function.
- Fix some checkpatch warnings[3].

ver #7:
- Removed the accidentally added-back call to get the super cookie in
   cifs_root_iget().
- Fixed the right call to cifs_fscache_get_super_cookie() to take account
   of the "-o fsc" mount flag.

ver #6:
- Moved the change of gfpflags_allow_blocking() to current_is_kswapd() for
   cifs here.
- Fixed one of the error paths in cifs_atomic_open() to jump around the
   call to use the cookie.
- Fixed an additional successful return in the middle of cifs_open() to
   use the cookie on the way out.
- Only get a volume cookie (and thus inode cookies) when "-o fsc" is
   supplied to mount.

ver #5:
- Fixed a couple of bits of cookie handling[2]:
   - The cookie should be released in cifs_evict_inode(), not
     cifsFileInfo_put_final().  The cookie needs to persist beyond file
     closure so that writepages will be able to write to it.
   - fscache_use_cookie() needs to be called in cifs_atomic_open() as it is
     for cifs_open().

ver #4:
- Fixed the use of sizeof with memset.
- tcon->vol_create_time is __le64 so doesn't need cpu_to_le64().

ver #3:
- Canonicalise the cifs coherency data to make the cache portable.
- Set volume coherency data.

ver #2:
- Use gfpflags_allow_blocking() rather than using flag directly.
- Upgraded to -rc4 to allow for upstream changes[1].
- fscache_acquire_volume() now returns errors.

Signed-off-by: David Howells <[email protected]>
Acked-by: Jeff Layton <[email protected]>
cc: Steve French <[email protected]>
cc: Shyam Prasad N <[email protected]>
cc: [email protected]
cc: [email protected]
Link: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=23b55d673d7527b093cd97b7c217c82e70cd1af0
Link: https://lore.kernel.org/r/[email protected]/
Link: https://lore.kernel.org/r/CAH2r5muTanw9pJqzAHd01d9A8keeChkzGsCEH6=0rHutVLAF-A@mail.gmail.com/
Link: https://lore.kernel.org/r/163819671009.215744.11230627184193298714.stgit@warthog.procyon.org.uk/
Link: https://lore.kernel.org/r/163906982979.143852.10672081929614953210.stgit@warthog.procyon.org.uk/
Link: https://lore.kernel.org/r/163967187187.1823006.247415138444991444.stgit@warthog.procyon.org.uk/
Link: https://lore.kernel.org/r/164021579335.640689.2681324337038770579.stgit@warthog.procyon.org.uk/
Link: https://lore.kernel.org/r/[email protected]/
Link: https://lore.kernel.org/r/[email protected]/
Signed-off-by: Steve French <[email protected]>

selftests: kvm: add amx_test to .gitignore

amx_test's binary should be present in the .gitignore file for the git
to ignore it.

Fixes: bf70636d9443 ("selftest: kvm: Add amx selftest")
Signed-off-by: Muhammad Usama Anjum <[email protected]>
Message-Id: <20220118122053.1941915 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: SVM: Nullify vcpu_(un)blocking() hooks if AVIC is disabled

Nullify svm_x86_ops.vcpu_(un)blocking if AVIC/APICv is disabled as the
hooks are necessary only to clear the vCPU's IsRunning entry in the
Physical APIC and to update IRTE entries if the VM has a pass-through
device attached.

Opportunistically rename the helpers to clarify their AVIC relationship.

Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <20211208015236.1616697 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: SVM: Move svm_hardware_setup() and its helpers below svm_x86_ops

Move svm_hardware_setup() below svm_x86_ops so that KVM can modify ops
during setup, e.g. the vcpu_(un)blocking hooks can be nullified if AVIC
is disabled or unsupported.

No functional change intended.

Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <20211208015236.1616697 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: SVM: Drop AVIC's intermediate avic_set_running() helper

Drop avic_set_running() in favor of calling avic_vcpu_{load,put}()
directly, and modify the block+put path to use preempt_disable/enable()
instead of get/put_cpu(), as it doesn't actually care about the current
pCPU associated with the vCPU. Opportunistically add lockdep assertions
as being preempted in avic_vcpu_put() would lead to consuming stale data,
even though doing so _in the current code base_ would not be fatal.

Add a much needed comment explaining why svm_vcpu_blocking() needs to
unload the AVIC and update the IRTE _before_ the vCPU starts blocking.

Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <20211208015236.1616697 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: VMX: Don't do full kick when handling posted interrupt wakeup

When waking vCPUs in the posted interrupt wakeup handling, do exactly
that and no more. There is no need to kick the vCPU as the wakeup
handler just needs to get the vCPU task running, and if it's in the guest
then it's definitely running.

Signed-off-by: Sean Christopherson <[email protected]>
Reviewed-by: Maxim Levitsky <[email protected]>
Message-Id: <20211208015236.1616697 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: VMX: Fold fallback path into triggering posted IRQ helper

Move the fallback "wake_up" path into the helper to trigger posted
interrupt helper now that the nested and non-nested paths are identical.

No functional change intended.

Signed-off-by: Sean Christopherson <[email protected]>
Reviewed-by: Maxim Levitsky <[email protected]>
Message-Id: <20211208015236.1616697 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: VMX: Pass desired vector instead of bool for triggering posted IRQ

Refactor the posted interrupt helper to take the desired notification
vector instead of a bool so that the callers are self-documenting.

No functional change intended.

Signed-off-by: Sean Christopherson <[email protected]>
Reviewed-by: Maxim Levitsky <[email protected]>
Message-Id: <20211208015236.1616697 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: VMX: Don't do full kick when triggering posted interrupt "fails"

Replace the full "kick" with just the "wake" in the fallback path when
triggering a virtual interrupt via a posted interrupt fails because the
guest is not IN_GUEST_MODE. If the guest transitions into guest mode
between the check and the kick, then it's guaranteed to see the pending
interrupt as KVM syncs the PIR to IRR (and onto GUEST_RVI) after setting
IN_GUEST_MODE. Kicking the guest in this case is nothing more than an
unnecessary VM-Exit (and host IRQ).

Opportunistically update comments to explain the various ordering rules
and barriers at play.

Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <20211208015236.1616697 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: SVM: Skip AVIC and IRTE updates when loading blocking vCPU

Don't bother updating the Physical APIC table or IRTE when loading a vCPU
that is blocking, i.e. won't be marked IsRun{ning}=1, as the pCPU is
queried if and only if IsRunning is '1'. If the vCPU was migrated, the
new pCPU will be picked up when avic_vcpu_load() is called by
svm_vcpu_unblocking().

Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <20211208015236.1616697 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: SVM: Use kvm_vcpu_is_blocking() in AVIC load to handle preemption

Use kvm_vcpu_is_blocking() to determine whether or not the vCPU should be
marked running during avic_vcpu_load(). Drop avic_is_running, which
really should have been named "vcpu_is_not_blocking", as it tracked if
the vCPU was blocking, not if it was actually running, e.g. it was set
during svm_create_vcpu() when the vCPU was obviously not running.

This is technically a teeny tiny functional change, as the vCPU will be
marked IsRunning=1 on being reloaded if the vCPU is preempted between
svm_vcpu_blocking() and prepare_to_rcuwait(). But that's a benign change
as the vCPU will be marked IsRunning=0 when KVM voluntarily schedules out
the vCPU.

Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <20211208015236.1616697 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: SVM: Remove unnecessary APICv/AVIC update in vCPU unblocking path

Remove handling of KVM_REQ_APICV_UPDATE from svm_vcpu_unblocking(), it's
no longer needed as it was made obsolete by commit df7e4827c549 ("KVM:
SVM: call avic_vcpu_load/avic_vcpu_put when enabling/disabling AVIC").
Prior to that commit, the manual check was necessary to ensure the AVIC
stuff was updated by avic_set_running() when a request to enable APICv
became pending while the vCPU was blocking, as the request handling
itself would not do the update. But, as evidenced by the commit, that
logic was flawed and subject to various races.

Now that svm_refresh_apicv_exec_ctrl() does avic_vcpu_load/put() in
response to an APICv status change, drop the manual check in the
unblocking path.

Suggested-by: Paolo Bonzini <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <20211208015236.1616697 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: SVM: Don't bother checking for "running" AVIC when kicking for IPIs

Drop the avic_vcpu_is_running() check when waking vCPUs in response to a
VM-Exit due to incomplete IPI delivery. The check isn't wrong per se, but
it's not 100% accurate in the sense that it doesn't guarantee that the vCPU
was one of the vCPUs that didn't receive the IPI.

The check isn't required for correctness as blocking == !running in this
context.

From a performance perspective, waking a live task is not expensive as the
only moderately costly operation is a locked operation to temporarily
disable preemption. And if that is indeed a performance issue,
kvm_vcpu_is_blocking() would be a better check than poking into the AVIC.

Signed-off-by: Sean Christopherson <[email protected]>
Reviewed-by: Maxim Levitsky <[email protected]>
Message-Id: <20211208015236.1616697 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: SVM: Signal AVIC doorbell iff vCPU is in guest mode

Signal the AVIC doorbell iff the vCPU is running in the guest. If the vCPU
is not IN_GUEST_MODE, it's guaranteed to pick up any pending IRQs on the
next VMRUN, which unconditionally processes the vIRR.

Add comments to document the logic.

Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <20211208015236.1616697 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: x86: Remove defunct pre_block/post_block kvm_x86_ops hooks

Drop kvm_x86_ops' pre/post_block() now that all implementations are nops.

No functional change intended.

Signed-off-by: Sean Christopherson <[email protected]>
Reviewed-by: Maxim Levitsky <[email protected]>
Message-Id: <20211208015236.1616697 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: x86: Unexport LAPIC's switch_to_{hv,sw}_timer() helpers

Unexport switch_to_{hv,sw}_timer() now that common x86 handles the
transitions.

No functional change intended.

Signed-off-by: Sean Christopherson <[email protected]>
Reviewed-by: Maxim Levitsky <[email protected]>
Message-Id: <20211208015236.1616697 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: VMX: Move preemption timer <=> hrtimer dance to common x86

Handle the switch to/from the hypervisor/software timer when a vCPU is
blocking in common x86 instead of in VMX. Even though VMX is the only
user of a hypervisor timer, the logic and all functions involved are
generic x86 (unless future CPUs do something completely different and
implement a hypervisor timer that runs regardless of mode).

Handling the switch in common x86 will allow for the elimination of the
pre/post_blocks hooks, and also lets KVM switch back to the hypervisor
timer if and only if it was in use (without additional params). Add a
comment explaining why the switch cannot be deferred to kvm_sched_out()
or kvm_vcpu_block().

Signed-off-by: Sean Christopherson <[email protected]>
Reviewed-by: Maxim Levitsky <[email protected]>
Message-Id: <20211208015236.1616697 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: Move x86 VMX's posted interrupt list_head to vcpu_vmx

Move the seemingly generic block_vcpu_list from kvm_vcpu to vcpu_vmx, and
rename the list and all associated variables to clarify that it tracks
the set of vCPU that need to be poked on a posted interrupt to the wakeup
vector. The list is not used to track _all_ vCPUs that are blocking, and
the term "blocked" can be misleading as it may refer to a blocking
condition in the host or the guest, where as the PI wakeup case is
specifically for the vCPUs that are actively blocking from within the
guest.

No functional change intended.

Signed-off-by: Sean Christopherson <[email protected]>
Reviewed-by: Maxim Levitsky <[email protected]>
Message-Id: <20211208015236.1616697 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: Drop unused kvm_vcpu.pre_pcpu field

Remove kvm_vcpu.pre_pcpu as it no longer has any users. No functional
change intended.

Signed-off-by: Sean Christopherson <[email protected]>
Reviewed-by: Maxim Levitsky <[email protected]>
Message-Id: <20211208015236.1616697 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: VMX: Handle PI descriptor updates during vcpu_put/load

Move the posted interrupt pre/post_block logic into vcpu_put/load
respectively, using the kvm_vcpu_is_blocking() to determining whether or
not the wakeup handler needs to be set (and unset). This avoids updating
the PI descriptor if halt-polling is successful, reduces the number of
touchpoints for updating the descriptor, and eliminates the confusing
behavior of intentionally leaving a "stale" PI.NDST when a blocking vCPU
is scheduled back in after preemption.

The downside is that KVM will do the PID update twice if the vCPU is
preempted after prepare_to_rcuwait() but before schedule(), but that's a
rare case (and non-existent on !PREEMPT kernels).

The notable wart is the need to send a self-IPI on the wakeup vector if
an outstanding notification is pending after configuring the wakeup
vector. Ideally, KVM would just do a kvm_vcpu_wake_up() in this case,
but the scheduler doesn't support waking a task from its preemption
notifier callback, i.e. while the task is right in the middle of
being scheduled out.

Note, setting the wakeup vector before halt-polling is not necessary:
once the pending IRQ will be recorded in the PIR, kvm_vcpu_has_events()
will detect this (via kvm_cpu_get_interrupt(), kvm_apic_get_interrupt(),
apic_has_interrupt_for_ppr() and finally vmx_sync_pir_to_irr()) and
terminate the polling.

Signed-off-by: Sean Christopherson <[email protected]>
Reviewed-by: Maxim Levitsky <[email protected]>
Message-Id: <20211208015236.1616697 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

Merge branch 'kvm-pi-raw-spinlock' into HEAD

Bring in fix for VT-d posted interrupts before further changing the code in 5.17.

Signed-off-by: Paolo Bonzini <[email protected]>

KVM: avoid warning on s390 in mark_page_dirty

Avoid warnings on s390 like
[ 1801.980931] CPU: 12 PID: 117600 Comm: kworker/12:0 Tainted: G            E     5.17.0-20220113.rc0.git0.32ce2abb03cf.300.fc35.s390x+next #1
[ 1801.980938] Workqueue: events irqfd_inject [kvm]
[...]
[ 1801.981057] Call Trace:
[ 1801.981060]  [<000003ff805f0f5c>] mark_page_dirty_in_slot+0xa4/0xb0 [kvm]
[ 1801.981083]  [<000003ff8060e9fe>] adapter_indicators_set+0xde/0x268 [kvm]
[ 1801.981104]  [<000003ff80613c24>] set_adapter_int+0x64/0xd8 [kvm]
[ 1801.981124]  [<000003ff805fb9aa>] kvm_set_irq+0xc2/0x130 [kvm]
[ 1801.981144]  [<000003ff805f8d86>] irqfd_inject+0x76/0xa0 [kvm]
[ 1801.981164]  [<0000000175e56906>] process_one_work+0x1fe/0x470
[ 1801.981173]  [<0000000175e570a4>] worker_thread+0x64/0x498
[ 1801.981176]  [<0000000175e5ef2c>] kthread+0x10c/0x110
[ 1801.981180]  [<0000000175de73c8>] __ret_from_fork+0x40/0x58
[ 1801.981185]  [<000000017698440a>] ret_from_fork+0xa/0x40

when writing to a guest from an irqfd worker as long as we do not have
the dirty ring.

Signed-off-by: Christian Borntraeger <[email protected]>
Reluctantly-acked-by: David Woodhouse <[email protected]>
Message-Id: <20220113122924 [email protected]>
Fixes: 2efd61a608b0 ("KVM: Warn if mark_page_dirty() is called without an active vCPU")
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: selftests: Add a test to force emulation with a pending exception

Add a VMX specific test to verify that KVM doesn't explode if userspace
attempts KVM_RUN when emulation is required with a pending exception.
KVM VMX's emulation support for !unrestricted_guest punts exceptions to
userspace instead of attempting to synthesize the exception with all the
correct state (and stack switching, etc...).

Punting is acceptable as there's never been a request to support
injecting exceptions when emulating due to invalid state, but KVM has
historically assumed that userspace will do the right thing and either
clear the exception or kill the guest. Deliberately do the opposite and
attempt to re-enter the guest with a pending exception and emulation
required to verify KVM continues to punt the combination to userspace,
e.g. doesn't explode, WARN, etc...

Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <20211228232437.1875318 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: VMX: Reject KVM_RUN if emulation is required with pending exception

Reject KVM_RUN if emulation is required (because VMX is running without
unrestricted guest) and an exception is pending, as KVM doesn't support
emulating exceptions except when emulating real mode via vm86.  The vCPU
is hosed either way, but letting KVM_RUN proceed triggers a WARN due to
the impossible condition.  Alternatively, the WARN could be removed, but
then userspace and/or KVM bugs would result in the vCPU silently running
in a bad state, which isn't very friendly to users.

Originally, the bug was hit by syzkaller with a nested guest as that
doesn't require kvm_intel.unrestricted_guest=0.  That particular flavor
is likely fixed by commit cd0e615c49e5 ("KVM: nVMX: Synthesize
TRIPLE_FAULT for L2 if emulation is required"), but it's trivial to
trigger the WARN with a non-nested guest, and userspace can likely force
bad state via ioctls() for a nested guest as well.

Checking for the impossible condition needs to be deferred until KVM_RUN
because KVM can't force specific ordering between ioctls.  E.g. clearing
exception.pending in KVM_SET_SREGS doesn't prevent userspace from setting
it in KVM_SET_VCPU_EVENTS, and disallowing KVM_SET_VCPU_EVENTS with
emulation_required would prevent userspace from queuing an exception and
then stuffing sregs.  Note, if KVM were to try and detect/prevent the
condition prior to KVM_RUN, handle_invalid_guest_state() and/or
handle_emulation_failure() would need to be modified to clear the pending
exception prior to exiting to userspace.

------------[ cut here ]------------
WARNING: CPU: 6 PID: 137812 at arch/x86/kvm/vmx/vmx.c:1623 vmx_queue_exception+0x14f/0x160 [kvm_intel]
CPU: 6 PID: 137812 Comm: vmx_invalid_nes Not tainted 5.15.2-7cc36c3e14ae-pop #279
Hardware name: ASUS Q87M-E/Q87M-E, BIOS 1102 03/03/2014
RIP: 0010:vmx_queue_exception+0x14f/0x160 [kvm_intel]
Code: <0f> 0b e9 fd fe ff ff 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00
RSP: 0018:ffffa45c83577d38 EFLAGS: 00010202
RAX: 0000000000000003 RBX: 0000000080000006 RCX: 0000000000000006
RDX: 0000000000000000 RSI: 0000000000010002 RDI: ffff9916af734000
RBP: ffff9916af734000 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000006
R13: 0000000000000000 R14: ffff9916af734038 R15: 0000000000000000
FS:  00007f1e1a47c740(0000) GS:ffff99188fb80000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f1e1a6a8008 CR3: 000000026f83b005 CR4: 00000000001726e0
Call Trace:
  kvm_arch_vcpu_ioctl_run+0x13a2/0x1f20 [kvm]
  kvm_vcpu_ioctl+0x279/0x690 [kvm]
  __x64_sys_ioctl+0x83/0xb0
  do_syscall_64+0x3b/0xc0
  entry_SYSCALL_64_after_hwframe+0x44/0xae

Reported-by: [email protected]
Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <20211228232437.1875318 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

selftests: kvm/x86: Add test for KVM_SET_PMU_EVENT_FILTER

Verify that the PMU event filter works as expected.

Note that the virtual PMU doesn't work as expected on AMD Zen CPUs (an
intercepted rdmsr is counted as a retired branch instruction), but the
PMU event filter does work.

Signed-off-by: Jim Mattson <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
Message-Id: <20220115052431 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

selftests: kvm/x86: Introduce x86_model()

Extract the x86 model number from CPUID.01H:EAX.

Signed-off-by: Jim Mattson <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
Message-Id: <20220115052431 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

selftests: kvm/x86: Export x86_family() for use outside of processor.c

Move this static inline function to processor.h, so that it can be
used in individual tests, as needed.

Opportunistically replace the bare 'unsigned' with 'unsigned int.'

Signed-off-by: Jim Mattson <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
Message-Id: <20220115052431 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

selftests: kvm/x86: Introduce is_amd_cpu()

Replace the one ad hoc "AuthenticAMD" CPUID vendor string comparison
with a new function, is_amd_cpu().

Signed-off-by: Jim Mattson <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
Message-Id: <20220115052431 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

selftests: kvm/x86: Parameterize the CPUID vendor string check

Refactor is_intel_cpu() to make it easier to reuse the bulk of the
code for other vendors in the future.

Signed-off-by: Jim Mattson <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
Message-Id: <20220115052431 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: x86/pmu: Use binary search to check filtered events

The PMU event filter may contain up to 300 events. Replace the linear
search in reprogram_gp_counter() with a binary search.

Signed-off-by: Jim Mattson <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
Message-Id: <20220115052431 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

cifs: cifs_ses_mark_for_reconnect should also update reconnect bits

Recent restructuring of cifs_reconnect introduced a helper func
named cifs_ses_mark_for_reconnect, which updates the state of tcp
session for all the channels of a session for reconnect.

However, this does not update the session state and chans_need_reconnect
bitmask. This change fixes that.

Also, cifs_mark_tcp_sess_for_reconnect should mark set the bitmask
for all channels when the whole session is marked for reconnect.
Fixed that here too.

Signed-off-by: Shyam Prasad N <[email protected]>
Signed-off-by: Steve French <[email protected]>

cifs: update tcpStatus during negotiate and sess setup

Till the end of SMB session setup, update tcpStatus and
avoid updating session status field. There was a typo in
cifs_setup_session, which caused ses->status to be updated
instead. This was causing issues during reconnect.

Signed-off-by: Shyam Prasad N <[email protected]>
Signed-off-by: Steve French <[email protected]>

cifs: make status checks in version independent callers

The status of tcp session, smb session and tcon have the
same flow, irrespective of the SMB version used. Hence
these status checks and updates should happen in the
version independent callers of these commands.

Signed-off-by: Shyam Prasad N <[email protected]>
Signed-off-by: Steve French <[email protected]>

cifs: remove repeated state change in dfs tree connect

cifs_tree_connect checks and sets the tidStatus for the tcon.
cifs_tree_connect also calls a dfs specific tree connect
function, which also does similar checks. This should
not happen. Removing it with this change.

Signed-off-by: Shyam Prasad N <[email protected]>
Signed-off-by: Steve French <[email protected]>

cifs: fix the cifs_reconnect path for DFS

Recently, the cifs_reconnect code was refactored into
two branches for regular vs dfs codepath. Some of my
recent changes were missing in the dfs path, namely the
code to enable periodic DNS query, and a missing lock.

Signed-off-by: Shyam Prasad N <[email protected]>
Signed-off-by: Steve French <[email protected]>

cifs: remove unused variable ses_selected

ses_selected is being declared and set at several places. It is not
being used. Remove it.

Signed-off-by: Muhammad Usama Anjum <[email protected]>
Signed-off-by: Steve French <[email protected]>

cifs: protect all accesses to chan_* with chan_lock

A spin lock called chan_lock was introduced recently.
But not all accesses were protected. Doing that with
this change.

To make sure that a channel is not freed when in use,
we need to introduce a ref count. But today, we don't
ever free channels.

Signed-off-by: Shyam Prasad N <[email protected]>
Signed-off-by: Steve French <[email protected]>

cifs: fix the connection state transitions with multichannel

Recent changes to multichannel required some adjustments in
the way connection states transitioned during/after reconnect.

Also some minor fixes:
1. A pending switch of GlobalMid_Lock to cifs_tcp_ses_lock
2. Relocations of the code that logs reconnect
3. Changed some code in allocate_mid to suit the new scheme

Signed-off-by: Shyam Prasad N <[email protected]>
Signed-off-by: Steve French <[email protected]>