]> Git Repo - linux.git/log
linux.git
4 years agokasan: clean up setting free info in kasan_slab_free
Andrey Konovalov [Fri, 26 Feb 2021 01:20:07 +0000 (17:20 -0800)]
kasan: clean up setting free info in kasan_slab_free

Put kasan_stack_collection_enabled() check and kasan_set_free_info() calls
next to each other.

The way this was previously implemented was a minor optimization that
relied of the the fact that kasan_stack_collection_enabled() is always
true for generic KASAN.  The confusion that this brings outweights saving
a few instructions.

Link: https://lkml.kernel.org/r/f838e249be5ab5810bf54a36ef5072cfd80e2da7.1612546384.git.andreyknvl@google.com
Signed-off-by: Andrey Konovalov <[email protected]>
Reviewed-by: Marco Elver <[email protected]>
Cc: Alexander Potapenko <[email protected]>
Cc: Andrey Ryabinin <[email protected]>
Cc: Branislav Rankov <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Dmitry Vyukov <[email protected]>
Cc: Evgenii Stepanov <[email protected]>
Cc: Kevin Brodsky <[email protected]>
Cc: Peter Collingbourne <[email protected]>
Cc: Vincenzo Frascino <[email protected]>
Cc: Will Deacon <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
4 years agokasan: optimize large kmalloc poisoning
Andrey Konovalov [Fri, 26 Feb 2021 01:20:03 +0000 (17:20 -0800)]
kasan: optimize large kmalloc poisoning

Similarly to kasan_kmalloc(), kasan_kmalloc_large() doesn't need to
unpoison the object as it as already unpoisoned by alloc_pages() (or by
ksize() for krealloc()).

This patch changes kasan_kmalloc_large() to only poison the redzone.

Link: https://lkml.kernel.org/r/33dee5aac0e550ad7f8e26f590c9b02c6129b4a3.1612546384.git.andreyknvl@google.com
Signed-off-by: Andrey Konovalov <[email protected]>
Reviewed-by: Marco Elver <[email protected]>
Cc: Alexander Potapenko <[email protected]>
Cc: Andrey Ryabinin <[email protected]>
Cc: Branislav Rankov <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Dmitry Vyukov <[email protected]>
Cc: Evgenii Stepanov <[email protected]>
Cc: Kevin Brodsky <[email protected]>
Cc: Peter Collingbourne <[email protected]>
Cc: Vincenzo Frascino <[email protected]>
Cc: Will Deacon <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
4 years agokasan, mm: optimize kmalloc poisoning
Andrey Konovalov [Fri, 26 Feb 2021 01:19:59 +0000 (17:19 -0800)]
kasan, mm: optimize kmalloc poisoning

For allocations from kmalloc caches, kasan_kmalloc() always follows
kasan_slab_alloc().  Currenly, both of them unpoison the whole object,
which is unnecessary.

This patch provides separate implementations for both annotations:
kasan_slab_alloc() unpoisons the whole object, and kasan_kmalloc() only
poisons the redzone.

For generic KASAN, the redzone start might not be aligned to
KASAN_GRANULE_SIZE.  Therefore, the poisoning is split in two parts:
kasan_poison_last_granule() poisons the unaligned part, and then
kasan_poison() poisons the rest.

This patch also clarifies alignment guarantees of each of the poisoning
functions and drops the unnecessary round_up() call for redzone_end.

With this change, the early SLUB cache annotation needs to be changed to
kasan_slab_alloc(), as kasan_kmalloc() doesn't unpoison objects now.  The
number of poisoned bytes for objects in this cache stays the same, as
kmem_cache_node->object_size is equal to sizeof(struct kmem_cache_node).

Link: https://lkml.kernel.org/r/7e3961cb52be380bc412860332063f5f7ce10d13.1612546384.git.andreyknvl@google.com
Signed-off-by: Andrey Konovalov <[email protected]>
Reviewed-by: Marco Elver <[email protected]>
Cc: Alexander Potapenko <[email protected]>
Cc: Andrey Ryabinin <[email protected]>
Cc: Branislav Rankov <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Dmitry Vyukov <[email protected]>
Cc: Evgenii Stepanov <[email protected]>
Cc: Kevin Brodsky <[email protected]>
Cc: Peter Collingbourne <[email protected]>
Cc: Vincenzo Frascino <[email protected]>
Cc: Will Deacon <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
4 years agokasan, mm: don't save alloc stacks twice
Andrey Konovalov [Fri, 26 Feb 2021 01:19:55 +0000 (17:19 -0800)]
kasan, mm: don't save alloc stacks twice

Patch series "kasan: optimizations and fixes for HW_TAGS", v4.

This patchset makes the HW_TAGS mode more efficient, mostly by reworking
poisoning approaches and simplifying/inlining some internal helpers.

With this change, the overhead of HW_TAGS annotations excluding setting
and checking memory tags is ~3%.  The performance impact caused by tags
will be unknown until we have hardware that supports MTE.

As a side-effect, this patchset speeds up generic KASAN by ~15%.

This patch (of 13):

Currently KASAN saves allocation stacks in both kasan_slab_alloc() and
kasan_kmalloc() annotations.  This patch changes KASAN to save allocation
stacks for slab objects from kmalloc caches in kasan_kmalloc() only, and
stacks for other slab objects in kasan_slab_alloc() only.

This change requires ____kasan_kmalloc() knowing whether the object
belongs to a kmalloc cache.  This is implemented by adding a flag field to
the kasan_info structure.  That flag is only set for kmalloc caches via a
new kasan_cache_create_kmalloc() annotation.

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/7c673ebca8d00f40a7ad6f04ab9a2bddeeae2097.1612546384.git.andreyknvl@google.com
Signed-off-by: Andrey Konovalov <[email protected]>
Reviewed-by: Marco Elver <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Vincenzo Frascino <[email protected]>
Cc: Dmitry Vyukov <[email protected]>
Cc: Alexander Potapenko <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Andrey Ryabinin <[email protected]>
Cc: Peter Collingbourne <[email protected]>
Cc: Evgenii Stepanov <[email protected]>
Cc: Branislav Rankov <[email protected]>
Cc: Kevin Brodsky <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
4 years agokasan: use error_report_end tracepoint
Alexander Potapenko [Fri, 26 Feb 2021 01:19:51 +0000 (17:19 -0800)]
kasan: use error_report_end tracepoint

Make it possible to trace KASAN error reporting.  A good usecase is
watching for trace events from the userspace to detect and process memory
corruption reports from the kernel.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Alexander Potapenko <[email protected]>
Suggested-by: Marco Elver <[email protected]>
Cc: Andrey Konovalov <[email protected]>
Cc: Dmitry Vyukov <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Petr Mladek <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Sergey Senozhatsky <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
4 years agokfence: use error_report_end tracepoint
Alexander Potapenko [Fri, 26 Feb 2021 01:19:47 +0000 (17:19 -0800)]
kfence: use error_report_end tracepoint

Make it possible to trace KFENCE error reporting.  A good usecase is
watching for trace events from the userspace to detect and process memory
corruption reports from the kernel.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Alexander Potapenko <[email protected]>
Suggested-by: Marco Elver <[email protected]>
Cc: Andrey Konovalov <[email protected]>
Cc: Dmitry Vyukov <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Petr Mladek <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Sergey Senozhatsky <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
4 years agotracing: add error_report_end trace point
Alexander Potapenko [Fri, 26 Feb 2021 01:19:44 +0000 (17:19 -0800)]
tracing: add error_report_end trace point

Patch series "Add error_report_end tracepoint to KFENCE and KASAN", v3.

This patchset adds a tracepoint, error_repor_end, that is to be used by
KFENCE, KASAN, and potentially other bug detection tools, when they print
an error report.  One of the possible use cases is userspace collection of
kernel error reports: interested parties can subscribe to the tracing
event via tracefs, and get notified when an error report occurs.

This patch (of 3):

Introduce error_report_end tracepoint.  It can be used in debugging tools
like KASAN, KFENCE, etc.  to provide extensions to the error reporting
mechanisms (e.g.  allow tests hook into error reporting, ease error report
collection from production kernels).  Another benefit would be making use
of ftrace for debugging or benchmarking the tools themselves.

Should we need it, the tracepoint name leaves us with the possibility to
introduce a complementary error_report_start tracepoint in the future.

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Alexander Potapenko <[email protected]>
Suggested-by: Marco Elver <[email protected]>
Cc: Andrey Konovalov <[email protected]>
Cc: Dmitry Vyukov <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Petr Mladek <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Sergey Senozhatsky <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
4 years agokfence: report sensitive information based on no_hash_pointers
Marco Elver [Fri, 26 Feb 2021 01:19:40 +0000 (17:19 -0800)]
kfence: report sensitive information based on no_hash_pointers

We cannot rely on CONFIG_DEBUG_KERNEL to decide if we're running a "debug
kernel" where we can safely show potentially sensitive information in the
kernel log.

Instead, simply rely on the newly introduced "no_hash_pointers" to print
unhashed kernel pointers, as well as decide if our reports can include
other potentially sensitive information such as registers and corrupted
bytes.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Marco Elver <[email protected]>
Cc: Timur Tabi <[email protected]>
Cc: Alexander Potapenko <[email protected]>
Cc: Dmitry Vyukov <[email protected]>
Cc: Andrey Konovalov <[email protected]>
Cc: Jann Horn <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
4 years agoMAINTAINERS: add entry for KFENCE
Marco Elver [Fri, 26 Feb 2021 01:19:35 +0000 (17:19 -0800)]
MAINTAINERS: add entry for KFENCE

Add entry for KFENCE maintainers.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Alexander Potapenko <[email protected]>
Signed-off-by: Marco Elver <[email protected]>
Reviewed-by: Dmitry Vyukov <[email protected]>
Reviewed-by: SeongJae Park <[email protected]>
Co-developed-by: Alexander Potapenko <[email protected]>
Cc: Andrey Konovalov <[email protected]>
Cc: Andrey Ryabinin <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Christopher Lameter <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Eric Dumazet <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Hillf Danton <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jann Horn <[email protected]>
Cc: Joern Engel <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Will Deacon <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
4 years agokfence: add test suite
Marco Elver [Fri, 26 Feb 2021 01:19:31 +0000 (17:19 -0800)]
kfence: add test suite

Add KFENCE test suite, testing various error detection scenarios. Makes
use of KUnit for test organization. Since KFENCE's interface to obtain
error reports is via the console, the test verifies that KFENCE outputs
expected reports to the console.

[[email protected]: fix typo in test]
Link: https://lkml.kernel.org/r/[email protected]
[[email protected]: show access type in report]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Alexander Potapenko <[email protected]>
Signed-off-by: Marco Elver <[email protected]>
Reviewed-by: Dmitry Vyukov <[email protected]>
Co-developed-by: Alexander Potapenko <[email protected]>
Reviewed-by: Jann Horn <[email protected]>
Cc: Andrey Konovalov <[email protected]>
Cc: Andrey Ryabinin <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Christopher Lameter <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Eric Dumazet <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Hillf Danton <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Joern Engel <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: SeongJae Park <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Will Deacon <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
4 years agokfence, Documentation: add KFENCE documentation
Marco Elver [Fri, 26 Feb 2021 01:19:26 +0000 (17:19 -0800)]
kfence, Documentation: add KFENCE documentation

Add KFENCE documentation in dev-tools/kfence.rst, and add to index.

[[email protected]: add missing copyright header to documentation]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Alexander Potapenko <[email protected]>
Signed-off-by: Marco Elver <[email protected]>
Reviewed-by: Dmitry Vyukov <[email protected]>
Co-developed-by: Alexander Potapenko <[email protected]>
Reviewed-by: Jann Horn <[email protected]>
Cc: Andrey Konovalov <[email protected]>
Cc: Andrey Ryabinin <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Christopher Lameter <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Eric Dumazet <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Hillf Danton <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Joern Engel <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: SeongJae Park <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Will Deacon <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
4 years agokfence, kasan: make KFENCE compatible with KASAN
Alexander Potapenko [Fri, 26 Feb 2021 01:19:21 +0000 (17:19 -0800)]
kfence, kasan: make KFENCE compatible with KASAN

Make KFENCE compatible with KASAN. Currently this helps test KFENCE
itself, where KASAN can catch potential corruptions to KFENCE state, or
other corruptions that may be a result of freepointer corruptions in the
main allocators.

[[email protected]: merge fixup]
[[email protected]: untag addresses for KFENCE]
Link: https://lkml.kernel.org/r/9dc196006921b191d25d10f6e611316db7da2efc.1611946152.git.andreyknvl@google.com
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Marco Elver <[email protected]>
Signed-off-by: Alexander Potapenko <[email protected]>
Signed-off-by: Andrey Konovalov <[email protected]>
Reviewed-by: Dmitry Vyukov <[email protected]>
Reviewed-by: Jann Horn <[email protected]>
Co-developed-by: Marco Elver <[email protected]>
Cc: Andrey Konovalov <[email protected]>
Cc: Andrey Ryabinin <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Christopher Lameter <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Eric Dumazet <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Hillf Danton <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Joern Engel <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: SeongJae Park <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Will Deacon <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
4 years agomm, kfence: insert KFENCE hooks for SLUB
Alexander Potapenko [Fri, 26 Feb 2021 01:19:16 +0000 (17:19 -0800)]
mm, kfence: insert KFENCE hooks for SLUB

Inserts KFENCE hooks into the SLUB allocator.

To pass the originally requested size to KFENCE, add an argument
'orig_size' to slab_alloc*(). The additional argument is required to
preserve the requested original size for kmalloc() allocations, which
uses size classes (e.g. an allocation of 272 bytes will return an object
of size 512). Therefore, kmem_cache::size does not represent the
kmalloc-caller's requested size, and we must introduce the argument
'orig_size' to propagate the originally requested size to KFENCE.

Without the originally requested size, we would not be able to detect
out-of-bounds accesses for objects placed at the end of a KFENCE object
page if that object is not equal to the kmalloc-size class it was
bucketed into.

When KFENCE is disabled, there is no additional overhead, since
slab_alloc*() functions are __always_inline.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Marco Elver <[email protected]>
Signed-off-by: Alexander Potapenko <[email protected]>
Reviewed-by: Dmitry Vyukov <[email protected]>
Reviewed-by: Jann Horn <[email protected]>
Co-developed-by: Marco Elver <[email protected]>
Cc: Andrey Konovalov <[email protected]>
Cc: Andrey Ryabinin <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Christopher Lameter <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Eric Dumazet <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Hillf Danton <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Joern Engel <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: SeongJae Park <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Will Deacon <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
4 years agomm, kfence: insert KFENCE hooks for SLAB
Alexander Potapenko [Fri, 26 Feb 2021 01:19:11 +0000 (17:19 -0800)]
mm, kfence: insert KFENCE hooks for SLAB

Inserts KFENCE hooks into the SLAB allocator.

To pass the originally requested size to KFENCE, add an argument
'orig_size' to slab_alloc*(). The additional argument is required to
preserve the requested original size for kmalloc() allocations, which
uses size classes (e.g. an allocation of 272 bytes will return an object
of size 512). Therefore, kmem_cache::size does not represent the
kmalloc-caller's requested size, and we must introduce the argument
'orig_size' to propagate the originally requested size to KFENCE.

Without the originally requested size, we would not be able to detect
out-of-bounds accesses for objects placed at the end of a KFENCE object
page if that object is not equal to the kmalloc-size class it was
bucketed into.

When KFENCE is disabled, there is no additional overhead, since
slab_alloc*() functions are __always_inline.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Marco Elver <[email protected]>
Signed-off-by: Alexander Potapenko <[email protected]>
Reviewed-by: Dmitry Vyukov <[email protected]>
Co-developed-by: Marco Elver <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Andrey Konovalov <[email protected]>
Cc: Andrey Ryabinin <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Eric Dumazet <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Hillf Danton <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jann Horn <[email protected]>
Cc: Joern Engel <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: SeongJae Park <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Will Deacon <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
4 years agokfence: use pt_regs to generate stack trace on faults
Marco Elver [Fri, 26 Feb 2021 01:19:08 +0000 (17:19 -0800)]
kfence: use pt_regs to generate stack trace on faults

Instead of removing the fault handling portion of the stack trace based on
the fault handler's name, just use struct pt_regs directly.

Change kfence_handle_page_fault() to take a struct pt_regs, and plumb it
through to kfence_report_error() for out-of-bounds, use-after-free, or
invalid access errors, where pt_regs is used to generate the stack trace.

If the kernel is a DEBUG_KERNEL, also show registers for more information.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Marco Elver <[email protected]>
Suggested-by: Mark Rutland <[email protected]>
Acked-by: Mark Rutland <[email protected]>
Cc: Alexander Potapenko <[email protected]>
Cc: Dmitry Vyukov <[email protected]>
Cc: Jann Horn <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
4 years agoarm64, kfence: enable KFENCE for ARM64
Marco Elver [Fri, 26 Feb 2021 01:19:03 +0000 (17:19 -0800)]
arm64, kfence: enable KFENCE for ARM64

Add architecture specific implementation details for KFENCE and enable
KFENCE for the arm64 architecture. In particular, this implements the
required interface in <asm/kfence.h>.

KFENCE requires that attributes for pages from its memory pool can
individually be set. Therefore, force the entire linear map to be mapped
at page granularity. Doing so may result in extra memory allocated for
page tables in case rodata=full is not set; however, currently
CONFIG_RODATA_FULL_DEFAULT_ENABLED=y is the default, and the common case
is therefore not affected by this change.

[[email protected]: add missing copyright and description header]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Alexander Potapenko <[email protected]>
Signed-off-by: Marco Elver <[email protected]>
Reviewed-by: Dmitry Vyukov <[email protected]>
Co-developed-by: Alexander Potapenko <[email protected]>
Reviewed-by: Jann Horn <[email protected]>
Reviewed-by: Mark Rutland <[email protected]>
Cc: Andrey Konovalov <[email protected]>
Cc: Andrey Ryabinin <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Christopher Lameter <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Eric Dumazet <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Hillf Danton <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Joern Engel <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: SeongJae Park <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Will Deacon <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
4 years agox86, kfence: enable KFENCE for x86
Alexander Potapenko [Fri, 26 Feb 2021 01:18:57 +0000 (17:18 -0800)]
x86, kfence: enable KFENCE for x86

Add architecture specific implementation details for KFENCE and enable
KFENCE for the x86 architecture. In particular, this implements the
required interface in <asm/kfence.h> for setting up the pool and
providing helper functions for protecting and unprotecting pages.

For x86, we need to ensure that the pool uses 4K pages, which is done
using the set_memory_4k() helper function.

[[email protected]: add missing copyright and description header]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Marco Elver <[email protected]>
Signed-off-by: Alexander Potapenko <[email protected]>
Reviewed-by: Dmitry Vyukov <[email protected]>
Co-developed-by: Marco Elver <[email protected]>
Reviewed-by: Jann Horn <[email protected]>
Cc: Andrey Konovalov <[email protected]>
Cc: Andrey Ryabinin <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Christopher Lameter <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Eric Dumazet <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Hillf Danton <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Joern Engel <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: SeongJae Park <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Will Deacon <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
4 years agomm: add Kernel Electric-Fence infrastructure
Alexander Potapenko [Fri, 26 Feb 2021 01:18:53 +0000 (17:18 -0800)]
mm: add Kernel Electric-Fence infrastructure

Patch series "KFENCE: A low-overhead sampling-based memory safety error detector", v7.

This adds the Kernel Electric-Fence (KFENCE) infrastructure. KFENCE is a
low-overhead sampling-based memory safety error detector of heap
use-after-free, invalid-free, and out-of-bounds access errors.  This
series enables KFENCE for the x86 and arm64 architectures, and adds
KFENCE hooks to the SLAB and SLUB allocators.

KFENCE is designed to be enabled in production kernels, and has near
zero performance overhead. Compared to KASAN, KFENCE trades performance
for precision. The main motivation behind KFENCE's design, is that with
enough total uptime KFENCE will detect bugs in code paths not typically
exercised by non-production test workloads. One way to quickly achieve a
large enough total uptime is when the tool is deployed across a large
fleet of machines.

KFENCE objects each reside on a dedicated page, at either the left or
right page boundaries. The pages to the left and right of the object
page are "guard pages", whose attributes are changed to a protected
state, and cause page faults on any attempted access to them. Such page
faults are then intercepted by KFENCE, which handles the fault
gracefully by reporting a memory access error.

Guarded allocations are set up based on a sample interval (can be set
via kfence.sample_interval). After expiration of the sample interval,
the next allocation through the main allocator (SLAB or SLUB) returns a
guarded allocation from the KFENCE object pool. At this point, the timer
is reset, and the next allocation is set up after the expiration of the
interval.

To enable/disable a KFENCE allocation through the main allocator's
fast-path without overhead, KFENCE relies on static branches via the
static keys infrastructure. The static branch is toggled to redirect the
allocation to KFENCE.

The KFENCE memory pool is of fixed size, and if the pool is exhausted no
further KFENCE allocations occur. The default config is conservative
with only 255 objects, resulting in a pool size of 2 MiB (with 4 KiB
pages).

We have verified by running synthetic benchmarks (sysbench I/O,
hackbench) and production server-workload benchmarks that a kernel with
KFENCE (using sample intervals 100-500ms) is performance-neutral
compared to a non-KFENCE baseline kernel.

KFENCE is inspired by GWP-ASan [1], a userspace tool with similar
properties. The name "KFENCE" is a homage to the Electric Fence Malloc
Debugger [2].

For more details, see Documentation/dev-tools/kfence.rst added in the
series -- also viewable here:

https://raw.githubusercontent.com/google/kasan/kfence/Documentation/dev-tools/kfence.rst

[1] http://llvm.org/docs/GwpAsan.html
[2] https://linux.die.net/man/3/efence

This patch (of 9):

This adds the Kernel Electric-Fence (KFENCE) infrastructure. KFENCE is a
low-overhead sampling-based memory safety error detector of heap
use-after-free, invalid-free, and out-of-bounds access errors.

KFENCE is designed to be enabled in production kernels, and has near
zero performance overhead. Compared to KASAN, KFENCE trades performance
for precision. The main motivation behind KFENCE's design, is that with
enough total uptime KFENCE will detect bugs in code paths not typically
exercised by non-production test workloads. One way to quickly achieve a
large enough total uptime is when the tool is deployed across a large
fleet of machines.

KFENCE objects each reside on a dedicated page, at either the left or
right page boundaries. The pages to the left and right of the object
page are "guard pages", whose attributes are changed to a protected
state, and cause page faults on any attempted access to them. Such page
faults are then intercepted by KFENCE, which handles the fault
gracefully by reporting a memory access error. To detect out-of-bounds
writes to memory within the object's page itself, KFENCE also uses
pattern-based redzones. The following figure illustrates the page
layout:

  ---+-----------+-----------+-----------+-----------+-----------+---
     | xxxxxxxxx | O :       | xxxxxxxxx |       : O | xxxxxxxxx |
     | xxxxxxxxx | B :       | xxxxxxxxx |       : B | xxxxxxxxx |
     | x GUARD x | J : RED-  | x GUARD x | RED-  : J | x GUARD x |
     | xxxxxxxxx | E :  ZONE | xxxxxxxxx |  ZONE : E | xxxxxxxxx |
     | xxxxxxxxx | C :       | xxxxxxxxx |       : C | xxxxxxxxx |
     | xxxxxxxxx | T :       | xxxxxxxxx |       : T | xxxxxxxxx |
  ---+-----------+-----------+-----------+-----------+-----------+---

Guarded allocations are set up based on a sample interval (can be set
via kfence.sample_interval). After expiration of the sample interval, a
guarded allocation from the KFENCE object pool is returned to the main
allocator (SLAB or SLUB). At this point, the timer is reset, and the
next allocation is set up after the expiration of the interval.

To enable/disable a KFENCE allocation through the main allocator's
fast-path without overhead, KFENCE relies on static branches via the
static keys infrastructure. The static branch is toggled to redirect the
allocation to KFENCE. To date, we have verified by running synthetic
benchmarks (sysbench I/O, hackbench) that a kernel compiled with KFENCE
is performance-neutral compared to the non-KFENCE baseline.

For more details, see Documentation/dev-tools/kfence.rst (added later in
the series).

[[email protected]: fix parameter description for kfence_object_start()]
Link: https://lkml.kernel.org/r/[email protected]
[[email protected]: avoid stalling work queue task without allocations]
Link: https://lkml.kernel.org/r/CADYN=9J0DQhizAGB0-jz4HOBBh+05kMBXb4c0cXMS7Qi5NAJiw@mail.gmail.com
Link: https://lkml.kernel.org/r/[email protected]
[[email protected]: fix potential deadlock due to wake_up()]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
[[email protected]: add option to use KFENCE without static keys]
Link: https://lkml.kernel.org/r/[email protected]
[[email protected]: add missing copyright and description headers]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Marco Elver <[email protected]>
Signed-off-by: Alexander Potapenko <[email protected]>
Reviewed-by: Dmitry Vyukov <[email protected]>
Reviewed-by: SeongJae Park <[email protected]>
Co-developed-by: Marco Elver <[email protected]>
Reviewed-by: Jann Horn <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Andrey Konovalov <[email protected]>
Cc: Andrey Ryabinin <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Christopher Lameter <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Eric Dumazet <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Hillf Danton <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Joern Engel <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Will Deacon <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
4 years agomm/early_ioremap.c: use __func__ instead of function name
Stephen Zhang [Fri, 26 Feb 2021 01:18:48 +0000 (17:18 -0800)]
mm/early_ioremap.c: use __func__ instead of function name

It is better to use __func__ instead of function name.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Stephen Zhang <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
4 years agomm/backing-dev.c: use might_alloc()
Daniel Vetter [Fri, 26 Feb 2021 01:18:45 +0000 (17:18 -0800)]
mm/backing-dev.c: use might_alloc()

Now that my little helper has landed, use it more.  On top of the existing
check this also uses lockdep through the fs_reclaim annotations.

[[email protected]: include linux/sched/mm.h]

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Daniel Vetter <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
4 years agomm/dmapool: use might_alloc()
Daniel Vetter [Fri, 26 Feb 2021 01:18:41 +0000 (17:18 -0800)]
mm/dmapool: use might_alloc()

Now that my little helper has landed, use it more.  On top of the existing
check this also uses lockdep through the fs_reclaim annotations.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Daniel Vetter <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
4 years agomm: page-flags.h: Typo fix (It -> If)
Guo Ren [Fri, 26 Feb 2021 01:18:38 +0000 (17:18 -0800)]
mm: page-flags.h: Typo fix (It -> If)

The "If" was wrongly spelled as "It".

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Guo Ren <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Alexander Duyck <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Steven Price <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
4 years agomm/zsmalloc.c: use page_private() to access page->private
Miaohe Lin [Fri, 26 Feb 2021 01:18:34 +0000 (17:18 -0800)]
mm/zsmalloc.c: use page_private() to access page->private

It's recommended to use helper macro page_private() to access the private
field of page.  Use such helper to eliminate direct access.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Miaohe Lin <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Nitin Gupta <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
4 years agozsmalloc: account the number of compacted pages correctly
Rokudo Yan [Fri, 26 Feb 2021 01:18:31 +0000 (17:18 -0800)]
zsmalloc: account the number of compacted pages correctly

There exists multiple path may do zram compaction concurrently.
1. auto-compaction triggered during memory reclaim
2. userspace utils write zram<id>/compaction node

So, multiple threads may call zs_shrinker_scan/zs_compact concurrently.
But pages_compacted is a per zsmalloc pool variable and modification
of the variable is not serialized(through under class->lock).
There are two issues here:
1. the pages_compacted may not equal to total number of pages
freed(due to concurrently add).
2. zs_shrinker_scan may not return the correct number of pages
freed(issued by current shrinker).

The fix is simple:
1. account the number of pages freed in zs_compact locally.
2. use actomic variable pages_compacted to accumulate total number.

Link: https://lkml.kernel.org/r/[email protected]
Fixes: 860c707dca155a56 ("zsmalloc: account the number of compacted pages")
Signed-off-by: Rokudo Yan <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Sergey Senozhatsky <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
4 years agomm/zsmalloc.c: convert to use kmem_cache_zalloc in cache_alloc_zspage()
Miaohe Lin [Fri, 26 Feb 2021 01:18:27 +0000 (17:18 -0800)]
mm/zsmalloc.c: convert to use kmem_cache_zalloc in cache_alloc_zspage()

We always memset the zspage allocated via cache_alloc_zspage.  So it's
more convenient to use kmem_cache_zalloc in cache_alloc_zspage than caller
do it manually.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Miaohe Lin <[email protected]>
Reviewed-by: Sergey Senozhatsky <[email protected]>
Cc: Minchan Kim <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
4 years agomm: set the sleep_mapped to true for zbud and z3fold
Tian Tao [Fri, 26 Feb 2021 01:18:22 +0000 (17:18 -0800)]
mm: set the sleep_mapped to true for zbud and z3fold

zpool driver adds a flag to indicate whether the zpool driver can enter an
atomic context after mapping.  This patch sets it true for z3fold and
zbud.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Tian Tao <[email protected]>
Reviewed-by: Vitaly Wool <[email protected]>
Acked-by: Sebastian Andrzej Siewior <[email protected]>
Reported-by: Mike Galbraith <[email protected]>
Cc: Seth Jennings <[email protected]>
Cc: Dan Streetman <[email protected]>
Cc: Barry Song <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
4 years agomm/zswap: add the flag can_sleep_mapped
Tian Tao [Fri, 26 Feb 2021 01:18:17 +0000 (17:18 -0800)]
mm/zswap: add the flag can_sleep_mapped

Patch series "Fix the compatibility of zsmalloc and zswap".

Patch #1 adds a flag to zpool, then zswap used to determine if zpool
drivers such as zbud/z3fold/zsmalloc will enter an atomic context after
mapping.

The difference between zbud/z3fold and zsmalloc is that zsmalloc requires
an atomic context that since its map function holds a preempt-disabled,
but zbud/z3fold don't require an atomic context.  So patch #2 sets flag
sleep_mapped to true indicating that zbud/z3fold can sleep after mapping.
zsmalloc didn't support sleep after mapping, so don't set that flag to
true.

This patch (of 2):

Add a flag to zpool, named is "can_sleep_mapped", and have it set true for
zbud/z3fold, not set this flag for zsmalloc, so its default value is
false.  Then zswap could go the current path if the flag is true; and if
it's false, copy data from src to a temporary buffer, then unmap the
handle, take the mutex, process the buffer instead of src to avoid
sleeping function called from atomic context.

[[email protected]: add return value in zswap_frontswap_load]
Link: https://lkml.kernel.org/r/[email protected]
[[email protected]: fix potential memory leak]
Link: https://lkml.kernel.org/r/[email protected]
[[email protected]: fix potential uninitialized pointer read on tmp]
Link: https://lkml.kernel.org/r/[email protected]
[[email protected]: fix variable 'entry' is uninitialized when used]
Link: https://lkml.kernel.org/r/[email protected]:
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Tian Tao <[email protected]>
Signed-off-by: Nathan Chancellor <[email protected]>
Signed-off-by: Colin Ian King <[email protected]>
Reviewed-by: Vitaly Wool <[email protected]>
Acked-by: Sebastian Andrzej Siewior <[email protected]>
Reported-by: Mike Galbraith <[email protected]>
Cc: Barry Song <[email protected]>
Cc: Dan Streetman <[email protected]>
Cc: Seth Jennings <[email protected]>
Cc: Dan Carpenter <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
4 years agomm: zswap: clean up confusing comment
Randy Dunlap [Fri, 26 Feb 2021 01:18:13 +0000 (17:18 -0800)]
mm: zswap: clean up confusing comment

Correct wording and change one duplicated word (it) to "it is".

Link: https://lkml.kernel.org/r/[email protected]
Fixes: 0ab0abcf5115 ("mm/zswap: refactor the get/put routines")
Signed-off-by: Randy Dunlap <[email protected]>
Cc: Weijie Yang <[email protected]>
Cc: Seth Jennings <[email protected]>
Cc: Seth Jennings <[email protected]>
Cc: Dan Streetman <[email protected]>
Cc: Vitaly Wool <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
4 years agomm/rmap: fix potential pte_unmap on an not mapped pte
Miaohe Lin [Fri, 26 Feb 2021 01:18:09 +0000 (17:18 -0800)]
mm/rmap: fix potential pte_unmap on an not mapped pte

For PMD-mapped page (usually THP), pvmw->pte is NULL.  For PTE-mapped THP,
pvmw->pte is mapped.  But for HugeTLB pages, pvmw->pte is not mapped and
set to the relevant page table entry.  So in page_vma_mapped_walk_done(),
we may do pte_unmap() for HugeTLB pte which is not mapped.  Fix this by
checking pvmw->page against PageHuge before trying to do pte_unmap().

Link: https://lkml.kernel.org/r/[email protected]
Fixes: ace71a19cec5 ("mm: introduce page_vma_mapped_walk()")
Signed-off-by: Hongxiang Lou <[email protected]>
Signed-off-by: Miaohe Lin <[email protected]>
Tested-by: Sedat Dilek <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Nathan Chancellor <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: Shakeel Butt <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Michel Lespinasse <[email protected]>
Cc: Nick Desaulniers <[email protected]>
Cc: "Kirill A. Shutemov" <[email protected]>
Cc: Wei Yang <[email protected]>
Cc: Dmitry Safonov <[email protected]>
Cc: Brian Geffon <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
4 years agomm/rmap: correct obsolete comment of page_get_anon_vma()
Miaohe Lin [Fri, 26 Feb 2021 01:18:06 +0000 (17:18 -0800)]
mm/rmap: correct obsolete comment of page_get_anon_vma()

Since commit 746b18d421da ("mm: use refcounts for page_lock_anon_vma()"),
page_lock_anon_vma() is renamed to page_get_anon_vma() and converted to
return a refcount increased anon_vma.  But it forgot to change the
relevant comment.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Miaohe Lin <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
4 years agomm/rmap: use page_not_mapped in try_to_unmap()
Miaohe Lin [Fri, 26 Feb 2021 01:18:03 +0000 (17:18 -0800)]
mm/rmap: use page_not_mapped in try_to_unmap()

page_mapcount_is_zero() calculates accurately how many mappings a hugepage
has in order to check against 0 only.  This is a waste of cpu time.  We
can do this via page_not_mapped() to save some possible atomic_read
cycles.  Remove the function page_mapcount_is_zero() as it's not used
anymore and move page_not_mapped() above try_to_unmap() to avoid
identifier undeclared compilation error.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Miaohe Lin <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
4 years agomm/rmap: fix obsolete comment in __page_check_anon_rmap()
Miaohe Lin [Fri, 26 Feb 2021 01:17:59 +0000 (17:17 -0800)]
mm/rmap: fix obsolete comment in __page_check_anon_rmap()

Commit 21333b2b66b8 ("ksm: no debug in page_dup_rmap()") has reverted
page_dup_rmap() to an inline atomic_inc of mapcount.  So page_dup_rmap()
does not call __page_check_anon_rmap() anymore.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Miaohe Lin <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
4 years agomm/rmap: remove unneeded semicolon in page_not_mapped()
Miaohe Lin [Fri, 26 Feb 2021 01:17:56 +0000 (17:17 -0800)]
mm/rmap: remove unneeded semicolon in page_not_mapped()

Remove extra semicolon without any functional change intended.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Miaohe Lin <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
4 years agomm/rmap: correct some obsolete comments of anon_vma
Miaohe Lin [Fri, 26 Feb 2021 01:17:53 +0000 (17:17 -0800)]
mm/rmap: correct some obsolete comments of anon_vma

commit 2b575eb64f7a ("mm: convert anon_vma->lock to a mutex") changed
spinlock used to serialize access to vma list to mutex.  And further, the
commit 5a505085f043 ("mm/rmap: Convert the struct anon_vma::mutex to an
rwsem") converted the mutex to an rwsem for solving scalability problem.
So replace spinlock with rwsem to make comment uptodate.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Miaohe Lin <[email protected]>
Cc: Rik van Riel <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
4 years agomm/mlock: stop counting mlocked pages when none vma is found
Miaohe Lin [Fri, 26 Feb 2021 01:17:49 +0000 (17:17 -0800)]
mm/mlock: stop counting mlocked pages when none vma is found

There will be no vma satisfies addr < vm_end when find_vma() returns NULL.
Thus it's meaningless to traverse the vma list below because we can't
find any vma to count mlocked pages.  Stop counting mlocked pages in this
case to save some vma list traversal cycles.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Miaohe Lin <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
4 years agovirtio-mem: check against mhp_get_pluggable_range() which memory we can hotplug
David Hildenbrand [Fri, 26 Feb 2021 01:17:45 +0000 (17:17 -0800)]
virtio-mem: check against mhp_get_pluggable_range() which memory we can hotplug

Right now, we only check against MAX_PHYSMEM_BITS - but turns out there
are more restrictions of which memory we can actually hotplug, especially
om arm64 or s390x once we support them: we might receive something like
-E2BIG or -ERANGE from add_memory_driver_managed(), stopping device
operation.

So, check right when initializing the device which memory we can add,
warning the user.  Try only adding actually pluggable ranges: in the worst
case, no memory provided by our device is pluggable.

In the usual case, we expect all device memory to be pluggable, and in
corner cases only some memory at the end of the device-managed memory
region to not be pluggable.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: David Hildenbrand <[email protected]>
Signed-off-by: Anshuman Khandual <[email protected]>
Cc: "Michael S. Tsirkin" <[email protected]>
Cc: Jason Wang <[email protected]>
Cc: Pankaj Gupta <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Wei Yang <[email protected]>
Cc: teawater <[email protected]>
Cc: Anshuman Khandual <[email protected]>
Cc: Pankaj Gupta <[email protected]>
Cc: Jonathan Cameron <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Ard Biesheuvel <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Catalin Marinas <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
4 years agos390/mm: define arch_get_mappable_range()
Anshuman Khandual [Fri, 26 Feb 2021 01:17:41 +0000 (17:17 -0800)]
s390/mm: define arch_get_mappable_range()

This overrides arch_get_mappabble_range() on s390 platform which will be
used with recently added generic framework.  It modifies the existing
range check in vmem_add_mapping() using arch_get_mappable_range().  It
also adds a VM_BUG_ON() check that would ensure that mhp_range_allowed()
has already been called on the hotplug path.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Anshuman Khandual <[email protected]>
Acked-by: Heiko Carstens <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Cc: Ard Biesheuvel <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Jason Wang <[email protected]>
Cc: Jonathan Cameron <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: "Michael S. Tsirkin" <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Pankaj Gupta <[email protected]>
Cc: Pankaj Gupta <[email protected]>
Cc: teawater <[email protected]>
Cc: Wei Yang <[email protected]>
Cc: Will Deacon <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
4 years agoarm64/mm: define arch_get_mappable_range()
Anshuman Khandual [Fri, 26 Feb 2021 01:17:37 +0000 (17:17 -0800)]
arm64/mm: define arch_get_mappable_range()

This overrides arch_get_mappable_range() on arm64 platform which will be
used with recently added generic framework.  It drops
inside_linear_region() and subsequent check in arch_add_memory() which are
no longer required.  It also adds a VM_BUG_ON() check that would ensure
that mhp_range_allowed() has already been called.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Anshuman Khandual <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Reviewed-by: Catalin Marinas <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Ard Biesheuvel <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Jason Wang <[email protected]>
Cc: Jonathan Cameron <[email protected]>
Cc: "Michael S. Tsirkin" <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Pankaj Gupta <[email protected]>
Cc: Pankaj Gupta <[email protected]>
Cc: teawater <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Cc: Wei Yang <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
4 years agomm/memory_hotplug: prevalidate the address range being added with platform
Anshuman Khandual [Fri, 26 Feb 2021 01:17:33 +0000 (17:17 -0800)]
mm/memory_hotplug: prevalidate the address range being added with platform

Patch series "mm/memory_hotplug: Pre-validate the address range with platform", v5.

This series adds a mechanism allowing platforms to weigh in and
prevalidate incoming address range before proceeding further with the
memory hotplug.  This helps prevent potential platform errors for the
given address range, down the hotplug call chain, which inevitably fails
the hotplug itself.

This mechanism was suggested by David Hildenbrand during another
discussion with respect to a memory hotplug fix on arm64 platform.

https://lore.kernel.org/linux-arm-kernel/1600332402[email protected]/

This mechanism focuses on the addressibility aspect and not [sub] section
alignment aspect.  Hence check_hotplug_memory_range() and check_pfn_span()
have been left unchanged.

This patch (of 4):

This introduces mhp_range_allowed() which can be called in various memory
hotplug paths to prevalidate the address range which is being added, with
the platform.  Then mhp_range_allowed() calls mhp_get_pluggable_range()
which provides applicable address range depending on whether linear
mapping is required or not.  For ranges that require linear mapping, it
calls a new arch callback arch_get_mappable_range() which the platform can
override.  So the new callback, in turn provides the platform an
opportunity to configure acceptable memory hotplug address ranges in case
there are constraints.

This mechanism will help prevent platform specific errors deep down during
hotplug calls.  This drops now redundant
check_hotplug_memory_addressable() check in __add_pages() but instead adds
a VM_BUG_ON() check which would ensure that the range has been validated
with mhp_range_allowed() earlier in the call chain.  Besides
mhp_get_pluggable_range() also can be used by potential memory hotplug
callers to avail the allowed physical range which would go through on a
given platform.

This does not really add any new range check in generic memory hotplug but
instead compensates for lost checks in arch_add_memory() where applicable
and check_hotplug_memory_addressable(), with unified mhp_range_allowed().

[[email protected]: make pagemap_range() return -EINVAL when mhp_range_allowed() fails]

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Anshuman Khandual <[email protected]>
Suggested-by: David Hildenbrand <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Reviewed-by: Oscar Salvador <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Vasily Gorbik <[email protected]> # s390
Cc: Will Deacon <[email protected]>
Cc: Ard Biesheuvel <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Jason Wang <[email protected]>
Cc: Jonathan Cameron <[email protected]>
Cc: "Michael S. Tsirkin" <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Pankaj Gupta <[email protected]>
Cc: Pankaj Gupta <[email protected]>
Cc: teawater <[email protected]>
Cc: Wei Yang <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
4 years agoDocumentation: sysfs/memory: clarify some memory block device properties
David Hildenbrand [Fri, 26 Feb 2021 01:17:28 +0000 (17:17 -0800)]
Documentation: sysfs/memory: clarify some memory block device properties

In commit 53cdc1cb29e8 ("drivers/base/memory.c: indicate all memory blocks
as removable") we changed the output of the "removable" property of memory
devices to return "1" if and only if the kernel supports memory offlining.

Let's update documentation, stating that the interface is legacy.  Also
update documentation of the "state" property and "valid_zones" properties.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: David Hildenbrand <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Reviewed-by: Oscar Salvador <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Jonathan Cameron <[email protected]>
Cc: Ilya Dryomov <[email protected]>
Cc: Mauro Carvalho Chehab <[email protected]>
Cc: Geert Uytterhoeven <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
4 years agodrivers/base/memory: don't store phys_device in memory blocks
David Hildenbrand [Fri, 26 Feb 2021 01:17:24 +0000 (17:17 -0800)]
drivers/base/memory: don't store phys_device in memory blocks

No need to store the value for each and every memory block, as we can
easily query the value at runtime.  Reshuffle the members to optimize the
memory layout.  Also, let's clarify what the interface once was used for
and why it's legacy nowadays.

"phys_device" was used on s390x in older versions of lsmem[2]/chmem[3],
back when they were still part of s390x-tools.  They were later replaced
by the variants in linux-utils.  For example, RHEL6 and RHEL7 contain
lsmem/chmem from s390-utils.  RHEL8 switched to versions from util-linux
on s390x [4].

"phys_device" was added with sysfs support for memory hotplug in commit
3947be1969a9 ("[PATCH] memory hotplug: sysfs and add/remove functions") in
2005.  It always returned 0.

s390x started returning something != 0 on some setups (if sclp.rzm is set
by HW) in 2010 via commit 57b552ba0b2f ("memory hotplug/s390: set
phys_device").

For s390x, it allowed for identifying which memory block devices belong to
the same storage increment (RZM).  Only if all memory block devices
comprising a single storage increment were offline, the memory could
actually be removed in the hypervisor.

Since commit e5d709bb5fb7 ("s390/memory hotplug: provide
memory_block_size_bytes() function") in 2013 a memory block device spans
at least one storage increment - which is why the interface isn't really
helpful/used anymore (except by old lsmem/chmem tools).

There were once RFC patches to make use of "phys_device" in ACPI context;
however, the underlying problem could be solved using different interfaces
[1].

[1] https://patchwork.kernel.org/patch/2163871/
[2] https://github.com/ibm-s390-tools/s390-tools/blob/v2.1.0/zconf/lsmem
[3] https://github.com/ibm-s390-tools/s390-tools/blob/v2.1.0/zconf/chmem
[4] https://bugzilla.redhat.com/show_bug.cgi?id=1504134

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: David Hildenbrand <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Reviewed-by: Oscar Salvador <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Gerald Schaefer <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: "Rafael J. Wysocki" <[email protected]>
Cc: Mauro Carvalho Chehab <[email protected]>
Cc: Ilya Dryomov <[email protected]>
Cc: Vaibhav Jain <[email protected]>
Cc: Tom Rix <[email protected]>
Cc: Geert Uytterhoeven <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
4 years agomm/memory_hotplug: use helper function zone_end_pfn() to get end_pfn
Miaohe Lin [Fri, 26 Feb 2021 01:17:21 +0000 (17:17 -0800)]
mm/memory_hotplug: use helper function zone_end_pfn() to get end_pfn

Commit 108bcc96ef70 ("mm: add & use zone_end_pfn() and zone_spans_pfn()")
introduced the helper zone_end_pfn() to calculate the zone end pfn.  But
update_pgdat_span() forgot to use it.

Use this helper and rename local variable zone_end_pfn to end_pfn to avoid
a naming conflict with the existing zone_end_pfn().

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Miaohe Lin <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
4 years agomm/memory_hotplug: MEMHP_MERGE_RESOURCE -> MHP_MERGE_RESOURCE
David Hildenbrand [Fri, 26 Feb 2021 01:17:17 +0000 (17:17 -0800)]
mm/memory_hotplug: MEMHP_MERGE_RESOURCE -> MHP_MERGE_RESOURCE

Let's make "MEMHP_MERGE_RESOURCE" consistent with "MHP_NONE", "mhp_t" and
"mhp_flags".  As discussed recently [1], "mhp" is our internal acronym for
memory hotplug now.

[1] https://lore.kernel.org/linux-mm/c37de2d0-28a1-4f7d-f944-cfd7d81c334d@redhat.com/

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: David Hildenbrand <[email protected]>
Reviewed-by: Miaohe Lin <[email protected]>
Acked-by: Michael S. Tsirkin <[email protected]>
Reviewed-by: Oscar Salvador <[email protected]>
Acked-by: Wei Liu <[email protected]>
Reviewed-by: Pankaj Gupta <[email protected]>
Cc: "K. Y. Srinivasan" <[email protected]>
Cc: Haiyang Zhang <[email protected]>
Cc: Stephen Hemminger <[email protected]>
Cc: Jason Wang <[email protected]>
Cc: Boris Ostrovsky <[email protected]>
Cc: Juergen Gross <[email protected]>
Cc: Stefano Stabellini <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Anshuman Khandual <[email protected]>
Cc: Wei Yang <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
4 years agomm/memory_hotplug: rename all existing 'memhp' into 'mhp'
Anshuman Khandual [Fri, 26 Feb 2021 01:17:13 +0000 (17:17 -0800)]
mm/memory_hotplug: rename all existing 'memhp' into 'mhp'

This renames all 'memhp' instances to 'mhp' except for memhp_default_state
for being a kernel command line option.  This is just a clean up and
should not cause a functional change.  Let's make it consistent rater than
mixing the two prefixes.  In preparation for more users of the 'mhp'
terminology.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Anshuman Khandual <[email protected]>
Suggested-by: David Hildenbrand <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: "Rafael J. Wysocki" <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
4 years agomm: fix memory_failure() handling of dax-namespace metadata
Dan Williams [Fri, 26 Feb 2021 01:17:08 +0000 (17:17 -0800)]
mm: fix memory_failure() handling of dax-namespace metadata

Given 'struct dev_pagemap' spans both data pages and metadata pages be
careful to consult the altmap if present to delineate metadata.  In fact
the pfn_first() helper already identifies the first valid data pfn, so
export that helper for other code paths via pgmap_pfn_valid().

Other usage of get_dev_pagemap() are not a concern because those are
operating on known data pfns having been looked up by get_user_pages().
I.e.  metadata pfns are never user mapped.

Link: https://lkml.kernel.org/r/161058501758.1840162.4239831989762604527.stgit@dwillia2-desk3.amr.corp.intel.com
Fixes: 6100e34b2526 ("mm, memory_failure: Teach memory_failure() about dev_pagemap pages")
Signed-off-by: Dan Williams <[email protected]>
Reported-by: David Hildenbrand <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Reviewed-by: Naoya Horiguchi <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Qian Cai <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
4 years agomm: teach pfn_to_online_page() about ZONE_DEVICE section collisions
Dan Williams [Fri, 26 Feb 2021 01:17:05 +0000 (17:17 -0800)]
mm: teach pfn_to_online_page() about ZONE_DEVICE section collisions

While pfn_to_online_page() is able to determine pfn_valid() at subsection
granularity it is not able to reliably determine if a given pfn is also
online if the section is mixes ZONE_{NORMAL,MOVABLE} with ZONE_DEVICE.
This means that pfn_to_online_page() may return invalid @page objects.
For example with a memory map like:

100000000-1fbffffff : System RAM
  142000000-143002e16 : Kernel code
  143200000-143713fff : Kernel rodata
  143800000-143b15b7f : Kernel data
  144227000-144ffffff : Kernel bss
1fc000000-2fbffffff : Persistent Memory (legacy)
  1fc000000-2fbffffff : namespace0.0

This command:

echo 0x1fc000000 > /sys/devices/system/memory/soft_offline_page

...succeeds when it should fail.  When it succeeds it touches an
uninitialized page and may crash or cause other damage (see
dissolve_free_huge_page()).

While the memory map above is contrived via the memmap=ss!nn kernel
command line option, the collision happens in practice on shipping
platforms.  The memory controller resources that decode spans of physical
address space are a limited resource.  One technique platform-firmware
uses to conserve those resources is to share a decoder across 2 devices to
keep the address range contiguous.  Unfortunately the unit of operation of
a decoder is 64MiB while the Linux section size is 128MiB.  This results
in situations where, without subsection hotplug memory mappings with
different lifetimes collide into one object that can only express one
lifetime.

Update move_pfn_range_to_zone() to flag (SECTION_TAINT_ZONE_DEVICE) a
section that mixes ZONE_DEVICE pfns with other online pfns.  With
SECTION_TAINT_ZONE_DEVICE to delineate, pfn_to_online_page() can fall back
to a slow-path check for ZONE_DEVICE pfns in an online section.  In the
fast path online_section() for a full ZONE_DEVICE section returns false.

Because the collision case is rare, and for simplicity, the
SECTION_TAINT_ZONE_DEVICE flag is never cleared once set.

[[email protected]: fix CONFIG_ZONE_DEVICE=n build]
Link: https://lkml.kernel.org/r/CAPcyv4iX+7LAgAeSqx7Zw-Zd=ZV9gBv8Bo7oTbwCOOqJoZ3+Yg@mail.gmail.com
Link: https://lkml.kernel.org/r/161058500675.1840162.7887862152161279354.stgit@dwillia2-desk3.amr.corp.intel.com
Fixes: ba72b4c8cf60 ("mm/sparsemem: support sub-section hotplug")
Signed-off-by: Dan Williams <[email protected]>
Reported-by: Michal Hocko <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Reported-by: David Hildenbrand <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Reviewed-by: Oscar Salvador <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: Qian Cai <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
4 years agomm: teach pfn_to_online_page() to consider subsection validity
Dan Williams [Fri, 26 Feb 2021 01:17:01 +0000 (17:17 -0800)]
mm: teach pfn_to_online_page() to consider subsection validity

pfn_to_online_page is primarily used to filter out offline or fully
uninitialized pages.  pfn_valid resp.  online_section_nr have a coarse
per memory section granularity.  If a section shared with a partially
offline memory (e.g.  part of ZONE_DEVICE) then pfn_to_online_page
would lead to a false positive on some pfns.  Fix this by adding
pfn_section_valid check which is subsection aware.

[[email protected]: changelog rewrite]

Link: https://lkml.kernel.org/r/161058500148.1840162.4365921007820501696.stgit@dwillia2-desk3.amr.corp.intel.com
Fixes: b13bc35193d9 ("mm/hotplug: invalid PFNs from pfn_to_online_page()")
Signed-off-by: Dan Williams <[email protected]>
Reported-by: David Hildenbrand <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Reviewed-by: Oscar Salvador <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: Qian Cai <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
4 years agomm: move pfn_to_online_page() out of line
Dan Williams [Fri, 26 Feb 2021 01:16:57 +0000 (17:16 -0800)]
mm: move pfn_to_online_page() out of line

Patch series "mm: Fix pfn_to_online_page() with respect to ZONE_DEVICE", v4.

A pfn-walker that uses pfn_to_online_page() may inadvertently translate a
pfn as online and in the page allocator, when it is offline managed by a
ZONE_DEVICE mapping (details in Patch 3: ("mm: Teach pfn_to_online_page()
about ZONE_DEVICE section collisions")).

The 2 proposals under consideration are teach pfn_to_online_page() to be
precise in the presence of mixed-zone sections, or teach the memory-add
code to drop the System RAM associated with ZONE_DEVICE collisions.  In
order to not regress memory capacity by a few 10s to 100s of MiB the
approach taken in this set is to add precision to pfn_to_online_page().

In the course of validating pfn_to_online_page() a couple other fixes
fell out:

1/ soft_offline_page() fails to drop the reference taken in the
   madvise(..., MADV_SOFT_OFFLINE) case.

2/ memory_failure() uses get_dev_pagemap() to lookup ZONE_DEVICE pages,
   however that mapping may contain data pages and metadata raw pfns.
   Introduce pgmap_pfn_valid() to delineate the 2 types and fail the
   handling of raw metadata pfns.

This patch (of 4);

pfn_to_online_page() is already too large to be a macro or an inline
function.  In anticipation of further logic changes / growth, move it out
of line.

No functional change, just code movement.

Link: https://lkml.kernel.org/r/161058499000.1840162.702316708443239771.stgit@dwillia2-desk3.amr.corp.intel.com
Link: https://lkml.kernel.org/r/161058499608.1840162.10165648147615238793.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams <[email protected]>
Reported-by: Michal Hocko <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Reviewed-by: Oscar Salvador <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: Qian Cai <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
4 years agomm/vmstat.c: erase latency in vmstat_shepherd
Jiang Biao [Fri, 26 Feb 2021 01:16:54 +0000 (17:16 -0800)]
mm/vmstat.c: erase latency in vmstat_shepherd

Many 100us+ latencies have been deteceted in vmstat_shepherd() on CPX
platform which has 208 logic cpus.  And vmstat_shepherd is queued every
second, which could make the case worse.

Add schedule point in vmstat_shepherd() to erase the latency.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Jiang Biao <[email protected]>
Reported-by: Bin Lai <[email protected]>
Cc: Christoph Lameter <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
4 years agomm: vmstat: add some comments on internal storage of byte items
Johannes Weiner [Fri, 26 Feb 2021 01:16:51 +0000 (17:16 -0800)]
mm: vmstat: add some comments on internal storage of byte items

Byte-accounted items are used for slab object accounting at the cgroup
level, because the objects in a slab page can belong to different cgroups.
At the global level these items always change in multiples of whole slab
pages.  The vmstat code exploits this and stores these items as pages
internally, which allows for more compact per-cpu data.

This optimization isn't self-evident from the asserts and the division in
the stat update functions.  Provide the reader with some context.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Johannes Weiner <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
4 years agomm: vmstat: fix NOHZ wakeups for node stat changes
Johannes Weiner [Fri, 26 Feb 2021 01:16:47 +0000 (17:16 -0800)]
mm: vmstat: fix NOHZ wakeups for node stat changes

On NOHZ, the periodic vmstat flushers on each CPU can go to sleep and
won't wake up until stat changes are detected in the per-cpu deltas of the
zone vmstat counters.

In commit 75ef71840539 ("mm, vmstat: add infrastructure for per-node
vmstats") per-node counters were introduced, and subsequently most stats
were moved from the zone to the node level.  However, the node counters
weren't added to the NOHZ wakeup detection.

In theory this can cause per-cpu errors to remain in the user-reported
stats indefinitely.  In practice this only affects a handful of sub
counters (file_mapped, dirty and writeback e.g.) because other page state
changes at the node level likely involve a change at the zone level as
well (alloc and free, lru ops).  Also, nobody has complained.

Fix it up for completeness: wake up vmstat refreshing on node changes.
Also remove the BUILD_BUG_ONs that assert counter size; we haven't relied
on it since we added sizeof() to the range calculation in commit
13c9aaf7fa01 ("mm/vmstat.c: fix NUMA statistics updates").

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Johannes Weiner <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
4 years agomm: cma: print region name on failure
Patrick Daly [Fri, 26 Feb 2021 01:16:44 +0000 (17:16 -0800)]
mm: cma: print region name on failure

Print the name of the CMA region for convenience.  This is useful
information to have when cma_alloc() fails.

[[email protected]: print the "count" variable]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Patrick Daly <[email protected]>
Signed-off-by: Georgi Djakov <[email protected]>
Acked-by: Minchan Kim <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Reviewed-by: Randy Dunlap <[email protected]>
Cc: Minchan Kim <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
4 years agomm/page_alloc: count CMA pages per zone and print them in /proc/zoneinfo
David Hildenbrand [Fri, 26 Feb 2021 01:16:40 +0000 (17:16 -0800)]
mm/page_alloc: count CMA pages per zone and print them in /proc/zoneinfo

Let's count the number of CMA pages per zone and print them in
/proc/zoneinfo.

Having access to the total number of CMA pages per zone is helpful for
debugging purposes to know where exactly the CMA pages ended up, and to
figure out how many pages of a zone might behave differently, even after
some of these pages might already have been allocated.

As one example, CMA pages part of a kernel zone cannot be used for
ordinary kernel allocations but instead behave more like ZONE_MOVABLE.

For now, we are only able to get the global nr+free cma pages from
/proc/meminfo and the free cma pages per zone from /proc/zoneinfo.

Example after this patch when booting a 6 GiB QEMU VM with
"hugetlb_cma=2G":
  # cat /proc/zoneinfo | grep cma
          cma      0
        nr_free_cma  0
          cma      0
        nr_free_cma  0
          cma      524288
        nr_free_cma  493016
          cma      0
          cma      0
  # cat /proc/meminfo | grep Cma
  CmaTotal:        2097152 kB
  CmaFree:         1972064 kB

Note: We print even without CONFIG_CMA, just like "nr_free_cma"; this way,
      one can be sure when spotting "cma 0", that there are definetly no
      CMA pages located in a zone.

[[email protected]: v2]
Link: https://lkml.kernel.org/r/[email protected]
[[email protected]: v3]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: David Hildenbrand <[email protected]>
Reviewed-by: Oscar Salvador <[email protected]>
Acked-by: David Rientjes <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: "Peter Zijlstra (Intel)" <[email protected]>
Cc: Mike Rapoport <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Wei Yang <[email protected]>
Cc: Zi Yan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
4 years agomm/cma: expose all pages to the buddy if activation of an area fails
David Hildenbrand [Fri, 26 Feb 2021 01:16:37 +0000 (17:16 -0800)]
mm/cma: expose all pages to the buddy if activation of an area fails

Right now, if activation fails, we might already have exposed some pages
to the buddy for CMA use (although they will never get actually used by
CMA), and some pages won't be exposed to the buddy at all.

Let's check for "single zone" early and on error, don't expose any pages
for CMA use - instead, expose them to the buddy available for any use.
Simply call free_reserved_page() on every single page - easier than going
via free_reserved_area(), converting back and forth between pfns and virt
addresses.

In addition, make sure to fixup totalcma_pages properly.

Example: 6 GiB QEMU VM with "... hugetlb_cma=2G movablecore=20% ...":
  [    0.006891] hugetlb_cma: reserve 2048 MiB, up to 2048 MiB per node
  [    0.006893] cma: Reserved 2048 MiB at 0x0000000100000000
  [    0.006893] hugetlb_cma: reserved 2048 MiB on node 0
  ...
  [    0.175433] cma: CMA area hugetlb0 could not be activated

Before this patch:
  # cat /proc/meminfo
  MemTotal:        5867348 kB
  MemFree:         5692808 kB
  MemAvailable:    5542516 kB
  ...
  CmaTotal:        2097152 kB
  CmaFree:         1884160 kB

After this patch:
  # cat /proc/meminfo
  MemTotal:        6077308 kB
  MemFree:         5904208 kB
  MemAvailable:    5747968 kB
  ...
  CmaTotal:              0 kB
  CmaFree:               0 kB

Note: cma_init_reserved_mem() makes sure that we always cover full
pageblocks / MAX_ORDER - 1 pages.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: David Hildenbrand <[email protected]>
Reviewed-by: Zi Yan <[email protected]>
Reviewed-by: Oscar Salvador <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: "Peter Zijlstra (Intel)" <[email protected]>
Cc: Mike Rapoport <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Wei Yang <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
4 years agomm: cma: allocate cma areas bottom-up
Roman Gushchin [Fri, 26 Feb 2021 01:16:33 +0000 (17:16 -0800)]
mm: cma: allocate cma areas bottom-up

Currently cma areas without a fixed base are allocated close to the end of
the node.  This placement is sub-optimal because of compaction: it brings
pages into the cma area.  In particular, it can bring in hot executable
pages, even if there is a plenty of free memory on the machine.  This
results in cma allocation failures.

Instead let's place cma areas close to the beginning of a node.  In this
case the compaction will help to free cma areas, resulting in better cma
allocation success rates.

If there is enough memory let's try to allocate bottom-up starting with
4GB to exclude any possible interference with DMA32.  On smaller machines
or in a case of a failure, stick with the old behavior.

16GB vm, 2GB cma area:
With this patch:
[    0.000000] Command line: root=/dev/vda3 rootflags=subvol=/root systemd.unified_cgroup_hierarchy=1 enforcing=0 console=ttyS0,115200 hugetlb_cma=2G
[    0.002928] hugetlb_cma: reserve 2048 MiB, up to 2048 MiB per node
[    0.002930] cma: Reserved 2048 MiB at 0x0000000100000000
[    0.002931] hugetlb_cma: reserved 2048 MiB on node 0

Without this patch:
[    0.000000] Command line: root=/dev/vda3 rootflags=subvol=/root systemd.unified_cgroup_hierarchy=1 enforcing=0 console=ttyS0,115200 hugetlb_cma=2G
[    0.002930] hugetlb_cma: reserve 2048 MiB, up to 2048 MiB per node
[    0.002933] cma: Reserved 2048 MiB at 0x00000003c0000000
[    0.002934] hugetlb_cma: reserved 2048 MiB on node 0

v2:
  - switched to memblock_set_bottom_up(true), by Mike
  - start with 4GB, by Mike

[[email protected]: whitespace fix, per Mike]
Link: https://lkml.kernel.org/r/[email protected]
[[email protected]: fix 32-bit warnings]
Link: https://lkml.kernel.org/r/[email protected]
[[email protected]: fix 32-bit systems]
[[email protected]: build fix]

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Roman Gushchin <[email protected]>
Reviewed-by: Mike Rapoport <[email protected]>
Cc: Wonhyuk Yang <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Michal Hocko <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
4 years agomm,shmem,thp: limit shmem THP allocations to requested zones
Rik van Riel [Fri, 26 Feb 2021 01:16:29 +0000 (17:16 -0800)]
mm,shmem,thp: limit shmem THP allocations to requested zones

Hugh pointed out that the gma500 driver uses shmem pages, but needs to
limit them to the DMA32 zone.  Ensure the allocations resulting from the
gfp_mask returned by limit_gfp_mask use the zone flags that were
originally passed to shmem_getpage_gfp.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Rik van Riel <[email protected]>
Suggested-by: Hugh Dickins <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Xu Yu <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
4 years agomm,thp,shmem: make khugepaged obey tmpfs mount flags
Rik van Riel [Fri, 26 Feb 2021 01:16:25 +0000 (17:16 -0800)]
mm,thp,shmem: make khugepaged obey tmpfs mount flags

Currently if thp enabled=[madvise], mounting a tmpfs filesystem with
huge=always and mmapping files from that tmpfs does not result in
khugepaged collapsing those mappings, despite the mount flag indicating
that it should.

Fix that by breaking up the blocks of tests in hugepage_vma_check a little
bit, and testing things in the correct order.

Link: https://lkml.kernel.org/r/[email protected]
Fixes: c2231020ea7b ("mm: thp: register mm for khugepaged when merging vma for shmem")
Signed-off-by: Rik van Riel <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Xu Yu <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
4 years agomm,thp,shm: limit gfp mask to no more than specified
Rik van Riel [Fri, 26 Feb 2021 01:16:22 +0000 (17:16 -0800)]
mm,thp,shm: limit gfp mask to no more than specified

Matthew Wilcox pointed out that the i915 driver opportunistically
allocates tmpfs memory, but will happily reclaim some of its pool if no
memory is available.

Make sure the gfp mask used to opportunistically allocate a THP is always
at least as restrictive as the original gfp mask.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Rik van Riel <[email protected]>
Suggested-by: Matthew Wilcox <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Xu Yu <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
4 years agomm,thp,shmem: limit shmem THP alloc gfp_mask
Rik van Riel [Fri, 26 Feb 2021 01:16:18 +0000 (17:16 -0800)]
mm,thp,shmem: limit shmem THP alloc gfp_mask

Patch series "mm,thp,shm: limit shmem THP alloc gfp_mask", v6.

The allocation flags of anonymous transparent huge pages can be controlled
through the files in /sys/kernel/mm/transparent_hugepage/defrag, which can
help the system from getting bogged down in the page reclaim and
compaction code when many THPs are getting allocated simultaneously.

However, the gfp_mask for shmem THP allocations were not limited by those
configuration settings, and some workloads ended up with all CPUs stuck on
the LRU lock in the page reclaim code, trying to allocate dozens of THPs
simultaneously.

This patch applies the same configurated limitation of THPs to shmem
hugepage allocations, to prevent that from happening.

This way a THP defrag setting of "never" or "defer+madvise" will result in
quick allocation failures without direct reclaim when no 2MB free pages
are available.

With this patch applied, THP allocations for tmpfs will be a little more
aggressive than today for files mmapped with MADV_HUGEPAGE, and a little
less aggressive for files that are not mmapped or mapped without that
flag.

This patch (of 4):

The allocation flags of anonymous transparent huge pages can be controlled
through the files in /sys/kernel/mm/transparent_hugepage/defrag, which can
help the system from getting bogged down in the page reclaim and
compaction code when many THPs are getting allocated simultaneously.

However, the gfp_mask for shmem THP allocations were not limited by those
configuration settings, and some workloads ended up with all CPUs stuck on
the LRU lock in the page reclaim code, trying to allocate dozens of THPs
simultaneously.

This patch applies the same configurated limitation of THPs to shmem
hugepage allocations, to prevent that from happening.

Controlling the gfp_mask of THP allocations through the knobs in sysfs
allows users to determine the balance between how aggressively the system
tries to allocate THPs at fault time, and how much the application may end
up stalling attempting those allocations.

This way a THP defrag setting of "never" or "defer+madvise" will result in
quick allocation failures without direct reclaim when no 2MB free pages
are available.

With this patch applied, THP allocations for tmpfs will be a little more
aggressive than today for files mmapped with MADV_HUGEPAGE, and a little
less aggressive for files that are not mmapped or mapped without that
flag.

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Rik van Riel <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Cc: Xu Yu <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Hugh Dickins <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
4 years agomm: remove pagevec_lookup_entries
Matthew Wilcox (Oracle) [Fri, 26 Feb 2021 01:16:14 +0000 (17:16 -0800)]
mm: remove pagevec_lookup_entries

pagevec_lookup_entries() is now just a wrapper around find_get_entries()
so remove it and convert all its callers.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Reviewed-by: William Kucharski <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Yang Shi <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
4 years agomm: pass pvec directly to find_get_entries
Matthew Wilcox (Oracle) [Fri, 26 Feb 2021 01:16:11 +0000 (17:16 -0800)]
mm: pass pvec directly to find_get_entries

All callers of find_get_entries() use a pvec, so pass it directly instead
of manipulating it in the caller.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Reviewed-by: William Kucharski <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Yang Shi <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
4 years agomm: remove nr_entries parameter from pagevec_lookup_entries
Matthew Wilcox (Oracle) [Fri, 26 Feb 2021 01:16:07 +0000 (17:16 -0800)]
mm: remove nr_entries parameter from pagevec_lookup_entries

All callers want to fetch the full size of the pvec.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Reviewed-by: William Kucharski <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Yang Shi <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
4 years agomm: add an 'end' parameter to pagevec_lookup_entries
Matthew Wilcox (Oracle) [Fri, 26 Feb 2021 01:16:03 +0000 (17:16 -0800)]
mm: add an 'end' parameter to pagevec_lookup_entries

Simplifies the callers and uses the existing functionality in
find_get_entries().  We can also drop the final argument of
truncate_exceptional_pvec_entries() and simplify the logic in that
function.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Reviewed-by: William Kucharski <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Yang Shi <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
4 years agomm: add an 'end' parameter to find_get_entries
Matthew Wilcox (Oracle) [Fri, 26 Feb 2021 01:16:00 +0000 (17:16 -0800)]
mm: add an 'end' parameter to find_get_entries

This simplifies the callers and leads to a more efficient implementation
since the XArray has this functionality already.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Reviewed-by: William Kucharski <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Yang Shi <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
4 years agomm: add and use find_lock_entries
Matthew Wilcox (Oracle) [Fri, 26 Feb 2021 01:15:56 +0000 (17:15 -0800)]
mm: add and use find_lock_entries

We have three functions (shmem_undo_range(), truncate_inode_pages_range()
and invalidate_mapping_pages()) which want exactly this function, so add
it to filemap.c.  Before this patch, shmem_undo_range() would split any
compound page which overlaps either end of the range being punched in both
the first and second loops through the address space.  After this patch,
that functionality is left for the second loop, which is arguably more
appropriate since the first loop is supposed to run through all the pages
quickly, and splitting a page can sleep.

[[email protected]: add assertion]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Reviewed-by: William Kucharski <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Yang Shi <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
4 years agoiomap: use mapping_seek_hole_data
Matthew Wilcox (Oracle) [Fri, 26 Feb 2021 01:15:52 +0000 (17:15 -0800)]
iomap: use mapping_seek_hole_data

Enhance mapping_seek_hole_data() to handle partially uptodate pages and
convert the iomap seek code to call it.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: William Kucharski <[email protected]>
Cc: Yang Shi <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
4 years agomm/filemap: add mapping_seek_hole_data
Matthew Wilcox (Oracle) [Fri, 26 Feb 2021 01:15:48 +0000 (17:15 -0800)]
mm/filemap: add mapping_seek_hole_data

Rewrite shmem_seek_hole_data() and move it to filemap.c.

[[email protected]: don't put an xa_is_value() page]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: William Kucharski <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Yang Shi <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
4 years agomm/filemap: add helper for finding pages
Matthew Wilcox (Oracle) [Fri, 26 Feb 2021 01:15:44 +0000 (17:15 -0800)]
mm/filemap: add helper for finding pages

There is a lot of common code in find_get_entries(),
find_get_pages_range() and find_get_pages_range_tag().  Factor out
find_get_entry() which simplifies all three functions.

[[email protected]: remove VM_BUG_ON_PAGE()]
Link: https://lkml.kernel.org/r/[email protected]:
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Reviewed-by: William Kucharski <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Yang Shi <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
4 years agomm/filemap: rename find_get_entry to mapping_get_entry
Matthew Wilcox (Oracle) [Fri, 26 Feb 2021 01:15:40 +0000 (17:15 -0800)]
mm/filemap: rename find_get_entry to mapping_get_entry

find_get_entry doesn't "find" anything.  It returns the entry at a
particular index.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: William Kucharski <[email protected]>
Cc: Yang Shi <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
4 years agomm: add FGP_ENTRY
Matthew Wilcox (Oracle) [Fri, 26 Feb 2021 01:15:36 +0000 (17:15 -0800)]
mm: add FGP_ENTRY

The functionality of find_lock_entry() and find_get_entry() can be
provided by pagecache_get_page(), which lets us delete find_lock_entry()
and make find_get_entry() static.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: William Kucharski <[email protected]>
Cc: Yang Shi <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
4 years agomm/swap: optimise get_shadow_from_swap_cache
Matthew Wilcox (Oracle) [Fri, 26 Feb 2021 01:15:33 +0000 (17:15 -0800)]
mm/swap: optimise get_shadow_from_swap_cache

There's no need to get a reference to the page, just load the entry and
see if it's a shadow entry.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: William Kucharski <[email protected]>
Cc: Yang Shi <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
4 years agomm/shmem: use pagevec_lookup in shmem_unlock_mapping
Matthew Wilcox (Oracle) [Fri, 26 Feb 2021 01:15:29 +0000 (17:15 -0800)]
mm/shmem: use pagevec_lookup in shmem_unlock_mapping

The comment shows that the reason for using find_get_entries() is now
stale; find_get_pages() will not return 0 if it hits a consecutive run of
swap entries, and I don't believe it has since 2011.  pagevec_lookup() is
a simpler function to use than find_get_pages(), so use it instead.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Reviewed-by: William Kucharski <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Yang Shi <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
4 years agomm: make pagecache tagged lookups return only head pages
Matthew Wilcox (Oracle) [Fri, 26 Feb 2021 01:15:25 +0000 (17:15 -0800)]
mm: make pagecache tagged lookups return only head pages

Patch series "Overhaul multi-page lookups for THP", v4.

This THP prep patchset changes several page cache iteration APIs to only
return head pages.

 - It's only possible to tag head pages in the page cache, so only
   return head pages, not all their subpages.
 - Factor a lot of common code out of the various batch lookup routines
 - Add mapping_seek_hole_data()
 - Unify find_get_entries() and pagevec_lookup_entries()
 - Make find_get_entries only return head pages, like find_get_entry().

These are only loosely connected, but they seem to make sense together as
a series.

This patch (of 14):

Pagecache tags are used for dirty page writeback.  Since dirtiness is
tracked on a per-THP basis, we only want to return the head page rather
than each subpage of a tagged page.  All the filesystems which use huge
pages today are in-memory, so there are no tagged huge pages today.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Reviewed-by: William Kucharski <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
4 years agoMerge tag 'nfs-for-5.12-1' of git://git.linux-nfs.org/projects/anna/linux-nfs
Linus Torvalds [Fri, 26 Feb 2021 17:17:24 +0000 (09:17 -0800)]
Merge tag 'nfs-for-5.12-1' of git://git.linux-nfs.org/projects/anna/linux-nfs

Pull NFS Client Updates from Anna Schumaker:
 "New Features:
   - Support for eager writes, and the write=eager and write=wait mount
     options

- Other Bugfixes and Cleanups:
   - Fix typos in some comments
   - Fix up fall-through warnings for Clang
   - Cleanups to the NFS readpage codepath
   - Remove FMR support in rpcrdma_convert_iovs()
   - Various other cleanups to xprtrdma
   - Fix xprtrdma pad optimization for servers that don't support
     RFC 8797
   - Improvements to rpcrdma tracepoints
   - Fix up nfs4_bitmask_adjust()
   - Optimize sparse writes past the end of files"

* tag 'nfs-for-5.12-1' of git://git.linux-nfs.org/projects/anna/linux-nfs: (27 commits)
  NFS: Support the '-owrite=' option in /proc/self/mounts and mountinfo
  NFS: Set the stable writes flag when initialising the super block
  NFS: Add mount options supporting eager writes
  NFS: Add support for eager writes
  NFS: 'flags' field should be unsigned in struct nfs_server
  NFS: Don't set NFS_INO_INVALID_XATTR if there is no xattr cache
  NFS: Always clear an invalid mapping when attempting a buffered write
  NFS: Optimise sparse writes past the end of file
  NFS: Fix documenting comment for nfs_revalidate_file_size()
  NFSv4: Fixes for nfs4_bitmask_adjust()
  xprtrdma: Clean up rpcrdma_prepare_readch()
  rpcrdma: Capture bytes received in Receive completion tracepoints
  xprtrdma: Pad optimization, revisited
  rpcrdma: Fix comments about reverse-direction operation
  xprtrdma: Refactor invocations of offset_in_page()
  xprtrdma: Simplify rpcrdma_convert_kvec() and frwr_map()
  xprtrdma: Remove FMR support in rpcrdma_convert_iovs()
  NFS: Add nfs_pageio_complete_read() and remove nfs_readpage_async()
  NFS: Call readpage_async_filler() from nfs_readpage_async()
  NFS: Refactor nfs_readpage() and nfs_readpage_async() to use nfs_readdesc
  ...

4 years agoswiotlb: Validate bounce size in the sync/unmap path
Martin Radev [Tue, 12 Jan 2021 15:07:29 +0000 (16:07 +0100)]
swiotlb: Validate bounce size in the sync/unmap path

The size of the buffer being bounced is not checked if it happens
to be larger than the size of the mapped buffer. Because the size
can be controlled by a device, as it's the case with virtio devices,
this can lead to memory corruption.

This patch saves the remaining buffer memory for each slab and uses
that information for validation in the sync/unmap paths before
swiotlb_bounce is called.

Validating this argument is important under the threat models of
AMD SEV-SNP and Intel TDX, where the HV is considered untrusted.

Signed-off-by: Martin Radev <[email protected]>
Signed-off-by: Konrad Rzeszutek Wilk <[email protected]>
4 years agonvme-pci: set min_align_mask
Jianxiong Gao [Mon, 1 Feb 2021 18:30:17 +0000 (10:30 -0800)]
nvme-pci: set min_align_mask

The PRP addressing scheme requires all PRP entries except for the
first one to have a zero offset into the NVMe controller pages (which
can be different from the Linux PAGE_SIZE).  Use the min_align_mask
device parameter to ensure that swiotlb does not change the address
of the buffer modulo the device page size to ensure that the PRPs
won't be malformed.

Signed-off-by: Jianxiong Gao <[email protected]>
Signed-off-by: Christoph Hellwig <[email protected]>
Tested-by: Jianxiong Gao <[email protected]>
Signed-off-by: Konrad Rzeszutek Wilk <[email protected]>
4 years agoswiotlb: respect min_align_mask
Christoph Hellwig [Mon, 22 Feb 2021 19:39:44 +0000 (14:39 -0500)]
swiotlb: respect min_align_mask

Respect the min_align_mask in struct device_dma_parameters in swiotlb.

There are two parts to it:
 1) for the lower bits of the alignment inside the io tlb slot, just
    extent the size of the allocation and leave the start of the slot
     empty
 2) for the high bits ensure we find a slot that matches the high bits
    of the alignment to avoid wasting too much memory

Based on an earlier patch from Jianxiong Gao <[email protected]>.

Signed-off-by: Christoph Hellwig <[email protected]>
Acked-by: Jianxiong Gao <[email protected]>
Tested-by: Jianxiong Gao <[email protected]>
Signed-off-by: Konrad Rzeszutek Wilk <[email protected]>
4 years agobtrfs: use copy_highpage() instead of 2 kmaps()
Ira Weiny [Wed, 10 Feb 2021 06:22:20 +0000 (22:22 -0800)]
btrfs: use copy_highpage() instead of 2 kmaps()

There are many places where kmap/memove/kunmap patterns occur.

This pattern exists in the core common function copy_highpage().

Use copy_highpage to avoid open coding the use of kmap and leverages the
core functions use of kmap_local_page().

Development of this patch was aided by the following coccinelle script:

// <smpl>
// SPDX-License-Identifier: GPL-2.0-only
// Find kmap/copypage/kunmap pattern and replace with copy_highpage calls
//
// NOTE: The expressions in the copy page version of this kmap pattern are
// overly complex and so these all need individual attention.
//
// Confidence: Low
// Copyright: (C) 2021 Intel Corporation
// URL: http://coccinelle.lip6.fr/
// Comments:
// Options:

//
// Then a copy_page where we have 2 pages involved.
//
@ copy_page_rule @
expression page, page2, To, From, Size;
identifier ptr, ptr2;
type VP, VP2;
@@

/* kmap */
(
-VP ptr = kmap(page);
...
-VP2 ptr2 = kmap(page2);
|
-VP ptr = kmap_atomic(page);
...
-VP2 ptr2 = kmap_atomic(page2);
|
-ptr = kmap(page);
...
-ptr2 = kmap(page2);
|
-ptr = kmap_atomic(page);
...
-ptr2 = kmap_atomic(page2);
)

// 1 or more copy versions of the entire page
<+...
(
-copy_page(To, From);
+copy_highpage(To, From);
|
-memmove(To, From, Size);
+memmoveExtra(To, From, Size);
)
...+>

/* kunmap */
(
-kunmap(page2);
...
-kunmap(page);
|
-kunmap(page);
...
-kunmap(page2);
|
-kmap_atomic(ptr2);
...
-kmap_atomic(ptr);
)

// Remove any pointers left unused
@
depends on copy_page_rule
@
identifier copy_page_rule.ptr;
identifier copy_page_rule.ptr2;
type VP, VP1;
type VP2, VP21;
@@

-VP ptr;
... when != ptr;
? VP1 ptr;
-VP2 ptr2;
... when != ptr2;
? VP21 ptr2;

// </smpl>

Reviewed-by: Christoph Hellwig <[email protected]>
Signed-off-by: Ira Weiny <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
4 years agobtrfs: use memcpy_[to|from]_page() and kmap_local_page()
Ira Weiny [Wed, 10 Feb 2021 06:22:19 +0000 (22:22 -0800)]
btrfs: use memcpy_[to|from]_page() and kmap_local_page()

There are many places where the pattern kmap/memcpy/kunmap occurs.

This pattern was lifted to the core common functions
memcpy_[to|from]_page().

Use these new functions to reduce the code, eliminate direct uses of
kmap, and leverage the new core functions use of kmap_local_page().

Also, there is 1 place where a kmap/memcpy is followed by an
optional memset.  Here we leave the kmap open coded to avoid remapping
the page but use kmap_local_page() directly.

Development of this patch was aided by the coccinelle script:

// <smpl>
// SPDX-License-Identifier: GPL-2.0-only
// Find kmap/memcpy/kunmap pattern and replace with memcpy*page calls
//
// NOTE: Offsets and other expressions may be more complex than what the script
// will automatically generate.  Therefore a catchall rule is provided to find
// the pattern which then must be evaluated by hand.
//
// Confidence: Low
// Copyright: (C) 2021 Intel Corporation
// URL: http://coccinelle.lip6.fr/
// Comments:
// Options:

//
// simple memcpy version
//
@ memcpy_rule1 @
expression page, T, F, B, Off;
identifier ptr;
type VP;
@@

(
-VP ptr = kmap(page);
|
-ptr = kmap(page);
|
-VP ptr = kmap_atomic(page);
|
-ptr = kmap_atomic(page);
)
<+...
(
-memcpy(ptr + Off, F, B);
+memcpy_to_page(page, Off, F, B);
|
-memcpy(ptr, F, B);
+memcpy_to_page(page, 0, F, B);
|
-memcpy(T, ptr + Off, B);
+memcpy_from_page(T, page, Off, B);
|
-memcpy(T, ptr, B);
+memcpy_from_page(T, page, 0, B);
)
...+>
(
-kunmap(page);
|
-kunmap_atomic(ptr);
)

// Remove any pointers left unused
@
depends on memcpy_rule1
@
identifier memcpy_rule1.ptr;
type VP, VP1;
@@

-VP ptr;
... when != ptr;
? VP1 ptr;

//
// Some callers kmap without a temp pointer
//
@ memcpy_rule2 @
expression page, T, Off, F, B;
@@

<+...
(
-memcpy(kmap(page) + Off, F, B);
+memcpy_to_page(page, Off, F, B);
|
-memcpy(kmap(page), F, B);
+memcpy_to_page(page, 0, F, B);
|
-memcpy(T, kmap(page) + Off, B);
+memcpy_from_page(T, page, Off, B);
|
-memcpy(T, kmap(page), B);
+memcpy_from_page(T, page, 0, B);
)
...+>
-kunmap(page);
// No need for the ptr variable removal

//
// Catch all
//
@ memcpy_rule3 @
expression page;
expression GenTo, GenFrom, GenSize;
identifier ptr;
type VP;
@@

(
-VP ptr = kmap(page);
|
-ptr = kmap(page);
|
-VP ptr = kmap_atomic(page);
|
-ptr = kmap_atomic(page);
)
<+...
(
//
// Some call sites have complex expressions within the memcpy
// match a catch all to be evaluated by hand.
//
-memcpy(GenTo, GenFrom, GenSize);
+memcpy_to_pageExtra(page, GenTo, GenFrom, GenSize);
+memcpy_from_pageExtra(GenTo, page, GenFrom, GenSize);
)
...+>
(
-kunmap(page);
|
-kunmap_atomic(ptr);
)

// Remove any pointers left unused
@
depends on memcpy_rule3
@
identifier memcpy_rule3.ptr;
type VP, VP1;
@@

-VP ptr;
... when != ptr;
? VP1 ptr;

// <smpl>

Reviewed-by: Christoph Hellwig <[email protected]>
Signed-off-by: Ira Weiny <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
4 years agoi2c: exynos5: Preserve high speed master code
Mårten Lindahl [Tue, 16 Feb 2021 22:25:38 +0000 (23:25 +0100)]
i2c: exynos5: Preserve high speed master code

When the driver starts to send a message with the MASTER_ID field
set (high speed), the whole I2C_ADDR register is overwritten including
MASTER_ID as the SLV_ADDR_MAS field is set.

This patch preserves already written fields in I2C_ADDR when writing
SLV_ADDR_MAS.

Fixes: 8a73cd4cfa15 ("i2c: exynos5: add High Speed I2C controller driver")
Signed-off-by: Mårten Lindahl <[email protected]>
Reviewed-by: Krzysztof Kozlowski <[email protected]>
Tested-by: Krzysztof Kozlowski <[email protected]>
Signed-off-by: Wolfram Sang <[email protected]>
4 years agoRevert "i2c: i2c-qcom-geni: Add shutdown callback for i2c"
Wolfram Sang [Wed, 24 Feb 2021 09:23:13 +0000 (10:23 +0100)]
Revert "i2c: i2c-qcom-geni: Add shutdown callback for i2c"

This reverts commit e0371298ddc51761be257698554ea507ac8bf831. It was
accidently applied despite discussion still going on.

Signed-off-by: Wolfram Sang <[email protected]>
Acked-by: Stephen Boyd <[email protected]>
4 years agoi2c: designware: Get right data length
Liguang Zhang [Thu, 25 Feb 2021 14:26:31 +0000 (22:26 +0800)]
i2c: designware: Get right data length

IC_DATA_CMD[11] indicates the first data byte received after the address
phase for receive transfer in Master receiver or Slave receiver mode,
this bit was set in some transfer flow. IC_DATA_CMD[7:0] contains the
data to be transmitted or received on the I2C bus, so we should use the
lower 8 bits to get the real data length.

Signed-off-by: Liguang Zhang <[email protected]>
Reviewed-by: Andy Shevchenko <[email protected]>
Signed-off-by: Wolfram Sang <[email protected]>
4 years agoi2c: brcmstb: Fix brcmstd_send_i2c_cmd condition
Maxime Ripard [Thu, 25 Feb 2021 16:11:01 +0000 (17:11 +0100)]
i2c: brcmstb: Fix brcmstd_send_i2c_cmd condition

The brcmstb_send_i2c_cmd currently has a condition that is (CMD_RD ||
CMD_WR) which always evaluates to true, while the obvious fix is to test
whether the cmd variable passed as parameter holds one of these two
values.

Fixes: dd1aa2524bc5 ("i2c: brcmstb: Add Broadcom settop SoC i2c controller driver")
Reported-by: Dave Stevenson <[email protected]>
Signed-off-by: Maxime Ripard <[email protected]>
Acked-by: Florian Fainelli <[email protected]>
Signed-off-by: Wolfram Sang <[email protected]>
4 years agoKVM: xen: flush deferred static key before checking it
Paolo Bonzini [Fri, 26 Feb 2021 09:49:06 +0000 (04:49 -0500)]
KVM: xen: flush deferred static key before checking it

A missing flush would cause the static branch to trigger incorrectly.

Cc: David Woodhouse <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
4 years agoKVM: x86/mmu: Set SPTE_AD_WRPROT_ONLY_MASK if and only if PML is enabled
Sean Christopherson [Thu, 25 Feb 2021 20:47:26 +0000 (12:47 -0800)]
KVM: x86/mmu: Set SPTE_AD_WRPROT_ONLY_MASK if and only if PML is enabled

Check that PML is actually enabled before setting the mask to force a
SPTE to be write-protected.  The bits used for the !AD_ENABLED case are
in the upper half of the SPTE.  With 64-bit paging and EPT, these bits
are ignored, but with 32-bit PAE paging they are reserved.  Setting them
for L2 SPTEs without checking PML breaks NPT on 32-bit KVM.

Fixes: 1f4e5fc83a42 ("KVM: x86: fix nested guest live migration with PML")
Cc: [email protected]
Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <20210225204749.1512652[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
4 years agoKVM: x86: hyper-v: Fix Hyper-V context null-ptr-deref
Wanpeng Li [Fri, 26 Feb 2021 07:59:59 +0000 (15:59 +0800)]
KVM: x86: hyper-v: Fix Hyper-V context null-ptr-deref

Reported by syzkaller:

    KASAN: null-ptr-deref in range [0x0000000000000140-0x0000000000000147]
    CPU: 1 PID: 8370 Comm: syz-executor859 Not tainted 5.11.0-syzkaller #0
    RIP: 0010:synic_get arch/x86/kvm/hyperv.c:165 [inline]
    RIP: 0010:kvm_hv_set_sint_gsi arch/x86/kvm/hyperv.c:475 [inline]
    RIP: 0010:kvm_hv_irq_routing_update+0x230/0x460 arch/x86/kvm/hyperv.c:498
    Call Trace:
     kvm_set_irq_routing+0x69b/0x940 arch/x86/kvm/../../../virt/kvm/irqchip.c:223
     kvm_vm_ioctl+0x12d0/0x2800 arch/x86/kvm/../../../virt/kvm/kvm_main.c:3959
     vfs_ioctl fs/ioctl.c:48 [inline]
     __do_sys_ioctl fs/ioctl.c:753 [inline]
     __se_sys_ioctl fs/ioctl.c:739 [inline]
     __x64_sys_ioctl+0x193/0x200 fs/ioctl.c:739
     do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
     entry_SYSCALL_64_after_hwframe+0x44/0xae

Hyper-V context is lazily allocated until Hyper-V specific MSRs are accessed
or SynIC is enabled. However, the syzkaller testcase sets irq routing table
directly w/o enabling SynIC. This results in null-ptr-deref when accessing
SynIC Hyper-V context. This patch fixes it.

syzkaller source: https://syzkaller.appspot.com/x/repro.c?x=163342ccd00000

Reported-by: [email protected]
Fixes: 8f014550dfb1 ("KVM: x86: hyper-v: Make Hyper-V emulation enablement conditional")
Signed-off-by: Wanpeng Li <[email protected]>
Message-Id: <1614326399[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
4 years agoKVM: x86: remove misplaced comment on active_mmu_pages
Dongli Zhang [Fri, 26 Feb 2021 06:19:45 +0000 (22:19 -0800)]
KVM: x86: remove misplaced comment on active_mmu_pages

The 'mmu_page_hash' is used as hash table while 'active_mmu_pages' is a
list. Remove the misplaced comment as it's mostly stating the obvious
anyways.

Signed-off-by: Dongli Zhang <[email protected]>
Reviewed-by: Sean Christopherson <[email protected]>
Message-Id: <20210226061945[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
4 years agoKVM: Documentation: rectify rst markup in kvm_run->flags
Chenyi Qiang [Fri, 26 Feb 2021 07:55:41 +0000 (15:55 +0800)]
KVM: Documentation: rectify rst markup in kvm_run->flags

Commit c32b1b896d2a ("KVM: X86: Add the Document for
KVM_CAP_X86_BUS_LOCK_EXIT") added a new flag in kvm_run->flags
documentation, and caused warning in make htmldocs:

  Documentation/virt/kvm/api.rst:5004: WARNING: Unexpected indentation
  Documentation/virt/kvm/api.rst:5004: WARNING: Inline emphasis start-string without end-string

Fix this rst markup issue.

Signed-off-by: Chenyi Qiang <[email protected]>
Message-Id: <20210226075541[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
4 years agoDocumentation: kvm: fix messy conversion from .txt to .rst
Paolo Bonzini [Thu, 25 Feb 2021 13:37:01 +0000 (08:37 -0500)]
Documentation: kvm: fix messy conversion from .txt to .rst

Building the documentation gives a warning that the KVM_PPC_RESIZE_HPT_PREPARE
label is defined twice.  The root cause is that the KVM_PPC_RESIZE_HPT_PREPARE
API is present twice, the second being a mix of the prepare and commit APIs.
Fix it.

Signed-off-by: Paolo Bonzini <[email protected]>
4 years agocifs: update internal version number
Steve French [Tue, 16 Feb 2021 05:58:58 +0000 (23:58 -0600)]
cifs: update internal version number

To 2.31

Signed-off-by: Steve French <[email protected]>
4 years agocifs: use discard iterator to discard unneeded network data more efficiently
David Howells [Thu, 4 Feb 2021 06:15:21 +0000 (00:15 -0600)]
cifs: use discard iterator to discard unneeded network data more efficiently

The iterator, ITER_DISCARD, that can only be used in READ mode and
just discards any data copied to it, was added to allow a network
filesystem to discard any unwanted data sent by a server.
Convert cifs_discard_from_socket() to use this.

Signed-off-by: David Howells <[email protected]>
Signed-off-by: Steve French <[email protected]>
4 years agovmlinux.lds.h: Define SANTIZER_DISCARDS with CONFIG_GCOV_KERNEL=y
Nathan Chancellor [Sat, 30 Jan 2021 00:46:51 +0000 (17:46 -0700)]
vmlinux.lds.h: Define SANTIZER_DISCARDS with CONFIG_GCOV_KERNEL=y

clang produces .eh_frame sections when CONFIG_GCOV_KERNEL is enabled,
even when -fno-asynchronous-unwind-tables is in KBUILD_CFLAGS:

$ make CC=clang vmlinux
...
ld: warning: orphan section `.eh_frame' from `init/main.o' being placed in section `.eh_frame'
ld: warning: orphan section `.eh_frame' from `init/version.o' being placed in section `.eh_frame'
ld: warning: orphan section `.eh_frame' from `init/do_mounts.o' being placed in section `.eh_frame'
ld: warning: orphan section `.eh_frame' from `init/do_mounts_initrd.o' being placed in section `.eh_frame'
ld: warning: orphan section `.eh_frame' from `init/initramfs.o' being placed in section `.eh_frame'
ld: warning: orphan section `.eh_frame' from `init/calibrate.o' being placed in section `.eh_frame'
ld: warning: orphan section `.eh_frame' from `init/init_task.o' being placed in section `.eh_frame'
...

$ rg "GCOV_KERNEL|GCOV_PROFILE_ALL" .config
CONFIG_GCOV_KERNEL=y
CONFIG_ARCH_HAS_GCOV_PROFILE_ALL=y
CONFIG_GCOV_PROFILE_ALL=y

This was already handled for a couple of other options in
commit d812db78288d ("vmlinux.lds.h: Avoid KASAN and KCSAN's unwanted
sections") and there is an open LLVM bug for this issue. Take advantage
of that section for this config as well so that there are no more orphan
warnings.

Link: https://bugs.llvm.org/show_bug.cgi?id=46478
Link: https://github.com/ClangBuiltLinux/linux/issues/1069
Reported-by: kernel test robot <[email protected]>
Reviewed-by: Fangrui Song <[email protected]>
Reviewed-by: Nick Desaulniers <[email protected]>
Signed-off-by: Nathan Chancellor <[email protected]>
Fixes: d812db78288d ("vmlinux.lds.h: Avoid KASAN and KCSAN's unwanted sections")
Cc: [email protected]
Signed-off-by: Kees Cook <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
4 years agoMerge tag 'pwm/for-5.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/thierry...
Linus Torvalds [Thu, 25 Feb 2021 20:23:49 +0000 (12:23 -0800)]
Merge tag 'pwm/for-5.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/thierry.reding/linux-pwm

Pull pwm updates from Thierry Reding:
 "The ZTE ZX platform is being removed, so the PWM driver is no longer
  needed and removed as well.

  Other than that this contains a small set of fixes and cleanups across
  a couple of drivers"

* tag 'pwm/for-5.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/thierry.reding/linux-pwm:
  pwm: lpc18xx-sct: remove unneeded semicolon
  pwm: iqs620a: Correct a stale state variable
  pwm: iqs620a: Fix overflow and optimize calculations
  pwm: rockchip: Enable clock before calling clk_get_rate()
  pwm: rockchip: Eliminate potential race condition when probing
  pwm: rockchip: Replace "bus clk" with "PWM clk"
  pwm: rockchip: rockchip_pwm_probe(): Remove superfluous clk_unprepare()
  pwm: rockchip: Enable APB clock during register access while probing
  pwm: Remove ZTE ZX driver

4 years agoMerge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost
Linus Torvalds [Thu, 25 Feb 2021 20:21:08 +0000 (12:21 -0800)]
Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost

Pull virtio updates from Michael Tsirkin:

 - new vdpa features to allow creation and deletion of new devices

 - virtio-blk support per-device queue depth

 - fixes, cleanups all over the place

* tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost: (31 commits)
  virtio-input: add multi-touch support
  virtio_mmio: fix one typo
  vdpa/mlx5: fix param validation in mlx5_vdpa_get_config()
  virtio_net: Fix fall-through warnings for Clang
  virtio_input: Prevent EV_MSC/MSC_TIMESTAMP loop storm for MT.
  virtio-blk: support per-device queue depth
  virtio_vdpa: don't warn when fail to disable vq
  virtio-pci: introduce modern device module
  virito-pci-modern: rename map_capability() to vp_modern_map_capability()
  virtio-pci-modern: introduce helper to get notification offset
  virtio-pci-modern: introduce helper for getting queue nums
  virtio-pci-modern: introduce helper for setting/geting queue size
  virtio-pci-modern: introduce helper to set/get queue_enable
  virtio-pci-modern: introduce vp_modern_queue_address()
  virtio-pci-modern: introduce vp_modern_set_queue_vector()
  virtio-pci-modern: introduce vp_modern_generation()
  virtio-pci-modern: introduce helpers for setting and getting features
  virtio-pci-modern: introduce helpers for setting and getting status
  virtio-pci-modern: introduce helper to set config vector
  virtio-pci-modern: introduce vp_modern_remove()
  ...

4 years agokbuild: Move .thinlto-cache removal to 'make clean'
Masahiro Yamada [Thu, 25 Feb 2021 19:39:12 +0000 (04:39 +0900)]
kbuild: Move .thinlto-cache removal to 'make clean'

Instead of 'make distclean', 'make clean' should remove build artifacts
unneeded by external module builds. Obviously, you do not need to keep
this directory.

Fixes: dc5723b02e52 ("kbuild: add support for Clang LTO")
Signed-off-by: Masahiro Yamada <[email protected]>
Signed-off-by: Kees Cook <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
4 years agoparisc: select FTRACE_MCOUNT_USE_PATCHABLE_FUNCTION_ENTRY
Sami Tolvanen [Wed, 24 Feb 2021 22:57:06 +0000 (14:57 -0800)]
parisc: select FTRACE_MCOUNT_USE_PATCHABLE_FUNCTION_ENTRY

parisc uses -fpatchable-function-entry with dynamic ftrace, which means we
don't need recordmcount. Select FTRACE_MCOUNT_USE_PATCHABLE_FUNCTION_ENTRY
to tell that to the build system.

Reported-by: Guenter Roeck <[email protected]>
Fixes: 3b15cdc15956 ("tracing: move function tracer options to Kconfig")
Signed-off-by: Sami Tolvanen <[email protected]>
Tested-by: Guenter Roeck <[email protected]>
Tested-by: Kees Cook <[email protected]>
Signed-off-by: Kees Cook <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
4 years agoMerge tag 'mips_5.12_1' of git://git.kernel.org/pub/scm/linux/kernel/git/mips/linux
Linus Torvalds [Thu, 25 Feb 2021 20:18:21 +0000 (12:18 -0800)]
Merge tag 'mips_5.12_1' of git://git.kernel.org/pub/scm/linux/kernel/git/mips/linux

Pull more MIPS updates from Thomas Bogendoerfer:

 - added n64 block driver

 - fix for ubsan warnings

 - fix for bcm63xx platform

 - update of linux-mips mailinglist

* tag 'mips_5.12_1' of git://git.kernel.org/pub/scm/linux/kernel/git/mips/linux:
  arch: mips: update references to current linux-mips list
  mips: bmips: init clocks earlier
  vmlinux.lds.h: catch even more instrumentation symbols into .data
  n64: store dev instance into disk private data
  n64: cleanup n64cart_probe()
  n64: cosmetics changes
  n64: remove curly brackets
  n64: use sector SECTOR_SHIFT instead 512
  n64: use enums for reg
  n64: move module param at the top
  n64: move module info at the end
  n64: use pr_fmt to avoid duplicate string
  block: Add n64 cart driver

4 years agodocs: proc.rst: fix indentation warning
Randy Dunlap [Tue, 23 Feb 2021 06:04:18 +0000 (22:04 -0800)]
docs: proc.rst: fix indentation warning

Fix indentation snafu in proc.rst as reported by Stephen.

next-20210219/Documentation/filesystems/proc.rst:697: WARNING: Unexpected indentation.

Fixes: 93ea4a0b8fce ("Documentation: proc.rst: add more about the 6 fields in loadavg")
Reported-by: Stephen Rothwell <[email protected]>
Signed-off-by: Randy Dunlap <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: [email protected]
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jonathan Corbet <[email protected]>
4 years agoMerge tag 'drm-next-2021-02-26' of git://anongit.freedesktop.org/drm/drm
Linus Torvalds [Thu, 25 Feb 2021 20:10:22 +0000 (12:10 -0800)]
Merge tag 'drm-next-2021-02-26' of git://anongit.freedesktop.org/drm/drm

Pull more drm updates from Dave Airlie:
 "This is mostly fixes but I missed msm-next pull last week. It's been
  in drm-next.

  Otherwise it's a selection of i915, amdgpu and misc fixes, one TTM
  memory leak, nothing really major stands out otherwise.

  core:
   - vblank fence timing improvements

  dma-buf:
   - improve error handling

  ttm:
   - memory leak fix

  msm:
   - a6xx speedbin support
   - a508, a509, a512 support
   - various a5xx fixes
   - various dpu fixes
   - qseed3lite support for sm8250
   - dsi fix for msm8994
   - mdp5 fix for framerate bug with cmd mode panels
   - a6xx GMU OOB race fixes that were showing up in CI
   - various addition and removal of semicolons
   - gem submit fix for legacy userspace relocs path

  amdgpu:
   - clang warning fix
   - S0ix platform shutdown/poweroff fix
   - misc display fixes

  i915:
   - color format fix
   - -Wuninitialised reenabled
   - GVT ww locking, cmd parser fixes

  atyfb:
   - fix build

  rockchip:
   - AFBC modifier fix"

* tag 'drm-next-2021-02-26' of git://anongit.freedesktop.org/drm/drm: (60 commits)
  drm/panel: kd35t133: allow using non-continuous dsi clock
  drm/rockchip: Require the YTR modifier for AFBC
  drm/ttm: Fix a memory leak
  drm/drm_vblank: set the dma-fence timestamp during send_vblank_event
  dma-fence: allow signaling drivers to set fence timestamp
  dma-buf: heaps: Rework heap allocation hooks to return struct dma_buf instead of fd
  dma-buf: system_heap: Make sure to return an error if we abort
  drm/amd/display: Fix system hang after multiple hotplugs (v3)
  drm/amdgpu: fix shutdown and poweroff process failed with s0ix
  drm/i915: Nuke INTEL_OUTPUT_FORMAT_INVALID
  drm/i915: Enable -Wuninitialized
  drm/amd/display: Remove Assert from dcn10_get_dig_frontend
  drm/amd/display: Add vupdate_no_lock interrupts for DCN2.1
  Revert "drm/amd/display: reuse current context instead of recreating one"
  drm/amd/pm/swsmu: Avoid using structure_size uninitialized in smu_cmn_init_soft_gpu_metrics
  fbdev: atyfb: add stubs for aty_{ld,st}_lcd()
  drm/i915/gvt: Introduce per object locking in GVT scheduler.
  drm/i915/gvt: Purge dev_priv->gt
  drm/i915/gvt: Parse default state to update reg whitelist
  dt-bindings: dp-connector: Drop maxItems from -supply
  ...

4 years agoDocumentation: cgroup-v2: fix path to example BPF program
Antonio Terceiro [Wed, 24 Feb 2021 13:16:31 +0000 (10:16 -0300)]
Documentation: cgroup-v2: fix path to example BPF program

This file has been moved into the "progs" subdirectory, together with all
test BPF programs.

Fixes: bd4aed0ee73c ("selftests: bpf: centre kernel bpf objects under new subdir "progs"")
Signed-off-by: Antonio Terceiro <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Zefan Li <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Jiong Wang <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jonathan Corbet <[email protected]>
This page took 0.154653 seconds and 4 git commands to generate.