Git Repo - linux.git/log

kasan: add compiler barriers to KUNIT_EXPECT_KASAN_FAIL

It might not be obvious to the compiler that the expression must be
executed between writing and reading to fail_data. In this case, the
compiler might reorder or optimize away some of the accesses, and
the tests will fail.

Add compiler barriers around the expression in KUNIT_EXPECT_KASAN_FAIL
and use READ/WRITE_ONCE() for accessing fail_data fields.

Link: https://linux-review.googlesource.com/id/I046079f48641a1d36fe627fc8827a9249102fd50
Link: https://lkml.kernel.org/r/6f11596f367d8ae8f71d800351e9a5d91eda19f6.1610733117.git.andreyknvl@google.com
Signed-off-by: Andrey Konovalov <[email protected]>
Reviewed-by: Marco Elver <[email protected]>
Reviewed-by: Alexander Potapenko <[email protected]>
Cc: Andrey Ryabinin <[email protected]>
Cc: Branislav Rankov <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Dmitry Vyukov <[email protected]>
Cc: Evgenii Stepanov <[email protected]>
Cc: Kevin Brodsky <[email protected]>
Cc: Peter Collingbourne <[email protected]>
Cc: Vincenzo Frascino <[email protected]>
Cc: Will Deacon <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

kasan: rename CONFIG_TEST_KASAN_MODULE

Rename CONFIG_TEST_KASAN_MODULE to CONFIG_KASAN_MODULE_TEST.

This naming is more consistent with the existing CONFIG_KASAN_KUNIT_TEST.

Link: https://linux-review.googlesource.com/id/Id347dfa5fe8788b7a1a189863e039f409da0ae5f
Link: https://lkml.kernel.org/r/f08250246683981bcf8a094fbba7c361995624d2.1610733117.git.andreyknvl@google.com
Signed-off-by: Andrey Konovalov <[email protected]>
Reviewed-by: Marco Elver <[email protected]>
Reviewed-by: Alexander Potapenko <[email protected]>
Cc: Andrey Ryabinin <[email protected]>
Cc: Branislav Rankov <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Dmitry Vyukov <[email protected]>
Cc: Evgenii Stepanov <[email protected]>
Cc: Kevin Brodsky <[email protected]>
Cc: Peter Collingbourne <[email protected]>
Cc: Vincenzo Frascino <[email protected]>
Cc: Will Deacon <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

kasan, arm64: allow using KUnit tests with HW_TAGS mode

On a high level, this patch allows running KUnit KASAN tests with the
hardware tag-based KASAN mode.

Internally, this change reenables tag checking at the end of each KASAN
test that triggers a tag fault and leads to tag checking being disabled.

Also simplify is_write calculation in report_tag_fault.

With this patch KASAN tests are still failing for the hardware tag-based
mode; fixes come in the next few patches.

[[email protected]: export HW_TAGS symbols for KUnit tests]
Link: https://lkml.kernel.org/r/e7eeb252da408b08f0c81b950a55fb852f92000b.1613155970.git.andreyknvl@google.com
Link: https://linux-review.googlesource.com/id/Id94dc9eccd33b23cda4950be408c27f879e474c8
Link: https://lkml.kernel.org/r/51b23112cf3fd62b8f8e9df81026fa2b15870501.1610733117.git.andreyknvl@google.com
Signed-off-by: Andrey Konovalov <[email protected]>
Reviewed-by: Catalin Marinas <[email protected]>
Reviewed-by: Vincenzo Frascino <[email protected]>
Cc: Alexander Potapenko <[email protected]>
Cc: Andrey Ryabinin <[email protected]>
Cc: Branislav Rankov <[email protected]>
Cc: Dmitry Vyukov <[email protected]>
Cc: Evgenii Stepanov <[email protected]>
Cc: Kevin Brodsky <[email protected]>
Cc: Marco Elver <[email protected]>
Cc: Peter Collingbourne <[email protected]>
Cc: Will Deacon <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

kasan: add match-all tag tests

Add 3 new tests for tag-based KASAN modes:

1. Check that match-all pointer tag is not assigned randomly.
2. Check that 0xff works as a match-all pointer tag.
3. Check that there are no match-all memory tags.

Note, that test #3 causes a significant number (255) of KASAN reports
to be printed during execution for the SW_TAGS mode.

[[email protected]: export kasan_poison]
Link: https://lkml.kernel.org/r/[email protected]
[[email protected]: s/EXPORT_SYMBOL_GPL/EXPORT_SYMBOL/, per Andrey]

Link: https://linux-review.googlesource.com/id/I78f1375efafa162b37f3abcb2c5bc2f3955dfd8e
Link: https://lkml.kernel.org/r/da841a5408e2204bf25f3b23f70540a65844e8a4.1610733117.git.andreyknvl@google.com
Signed-off-by: Andrey Konovalov <[email protected]>
Signed-off-by: Arnd Bergmann <[email protected]>
Reviewed-by: Marco Elver <[email protected]>
Reviewed-by: Alexander Potapenko <[email protected]>
Cc: Andrey Ryabinin <[email protected]>
Cc: Branislav Rankov <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Dmitry Vyukov <[email protected]>
Cc: Evgenii Stepanov <[email protected]>
Cc: Kevin Brodsky <[email protected]>
Cc: Peter Collingbourne <[email protected]>
Cc: Vincenzo Frascino <[email protected]>
Cc: Will Deacon <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

kasan: add macros to simplify checking test constraints

Some KASAN tests require specific kernel configs to be enabled.
Instead of copy-pasting the checks for these configs add a few helper
macros and use them.

Link: https://linux-review.googlesource.com/id/I237484a7fddfedf4a4aae9cc61ecbcdbe85a0a63
Link: https://lkml.kernel.org/r/6a0fcdb9676b7e869cfc415893ede12d916c246c.1610733117.git.andreyknvl@google.com
Signed-off-by: Andrey Konovalov <[email protected]>
Suggested-by: Alexander Potapenko <[email protected]>
Reviewed-by: Marco Elver <[email protected]>
Reviewed-by: Alexander Potapenko <[email protected]>
Cc: Andrey Ryabinin <[email protected]>
Cc: Branislav Rankov <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Dmitry Vyukov <[email protected]>
Cc: Evgenii Stepanov <[email protected]>
Cc: Kevin Brodsky <[email protected]>
Cc: Peter Collingbourne <[email protected]>
Cc: Vincenzo Frascino <[email protected]>
Cc: Will Deacon <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

kasan: clean up comments in tests

Clarify and update comments in KASAN tests.

Link: https://linux-review.googlesource.com/id/I6c816c51fa1e0eb7aa3dead6bda1f339d2af46c8
Link: https://lkml.kernel.org/r/ba6db104d53ae0e3796f80ef395f6873c1c1282f.1610733117.git.andreyknvl@google.com
Signed-off-by: Andrey Konovalov <[email protected]>
Reviewed-by: Marco Elver <[email protected]>
Reviewed-by: Alexander Potapenko <[email protected]>
Cc: Andrey Ryabinin <[email protected]>
Cc: Branislav Rankov <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Dmitry Vyukov <[email protected]>
Cc: Evgenii Stepanov <[email protected]>
Cc: Kevin Brodsky <[email protected]>
Cc: Peter Collingbourne <[email protected]>
Cc: Vincenzo Frascino <[email protected]>
Cc: Will Deacon <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

kasan: clarify HW_TAGS impact on TBI

Mention in the documentation that enabling CONFIG_KASAN_HW_TAGS always
results in in-kernel TBI (Top Byte Ignore) being enabled.

Also do a few minor documentation cleanups.

Link: https://linux-review.googlesource.com/id/Iba2a6697e3c6304cb53f89ec61dedc77fa29e3ae
Link: https://lkml.kernel.org/r/3b4ea6875bb14d312092ad14ac55cb456c83c08e.1610733117.git.andreyknvl@google.com
Signed-off-by: Andrey Konovalov <[email protected]>
Reviewed-by: Marco Elver <[email protected]>
Reviewed-by: Alexander Potapenko <[email protected]>
Cc: Andrey Ryabinin <[email protected]>
Cc: Branislav Rankov <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Dmitry Vyukov <[email protected]>
Cc: Evgenii Stepanov <[email protected]>
Cc: Kevin Brodsky <[email protected]>
Cc: Peter Collingbourne <[email protected]>
Cc: Vincenzo Frascino <[email protected]>
Cc: Will Deacon <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

kasan: prefix global functions with kasan_

Patch series "kasan: HW_TAGS tests support and fixes", v4.

This patchset adds support for running KASAN-KUnit tests with the
hardware tag-based mode and also contains a few fixes.

This patch (of 15):

There's a number of internal KASAN functions that are used across multiple
source code files and therefore aren't marked as static inline.  To avoid
littering the kernel function names list with generic function names,
prefix all such KASAN functions with kasan_.

As a part of this change:

- Rename internal (un)poison_range() to kasan_(un)poison() (no _range)
   to avoid name collision with a public kasan_unpoison_range().

- Rename check_memory_region() to kasan_check_range(), as it's a more
   fitting name.

Link: https://lkml.kernel.org/r/[email protected]
Link: https://linux-review.googlesource.com/id/I719cc93483d4ba288a634dba80ee6b7f2809cd26
Link: https://lkml.kernel.org/r/13777aedf8d3ebbf35891136e1f2287e2f34aaba.1610733117.git.andreyknvl@google.com
Signed-off-by: Andrey Konovalov <[email protected]>
Suggested-by: Marco Elver <[email protected]>
Reviewed-by: Marco Elver <[email protected]>
Reviewed-by: Alexander Potapenko <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Vincenzo Frascino <[email protected]>
Cc: Dmitry Vyukov <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Andrey Ryabinin <[email protected]>
Cc: Peter Collingbourne <[email protected]>
Cc: Evgenii Stepanov <[email protected]>
Cc: Branislav Rankov <[email protected]>
Cc: Kevin Brodsky <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

vmalloc: remove redundant NULL check

Fix below warnings reported by coccicheck:

fs/proc/vmcore.c:1503:2-7: WARNING: NULL check before some freeing functions is not needed.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Yang Li <[email protected]>
Reported-by: Abaci Robot <[email protected]>
Acked-by: Baoquan He <[email protected]>
Cc: Dave Young <[email protected]>
Cc: Vivek Goyal <[email protected]>
Cc: Alexey Dobriyan <[email protected]>
Cc: "Uladzislau Rezki (Sony)" <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/page_reporting: use list_entry_is_head() in page_reporting_cycle()

Replace '&next->lru != list' with list_entry_is_head(). No functional
change.

Link: https://lkml.kernel.org/r/20201222182735.GA1257912@ubuntu-A520I-AC
Signed-off-by: sh <[email protected]>
Reviewed-by: Andrew Morton <[email protected]>
Cc: Alexander Duyck <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm: mremap: unlink anon_vmas when mremap with MREMAP_DONTUNMAP success

mremap with MREMAP_DONTUNMAP can move all page table entries to new vma,
which means all pages allocated for the old vma are not relevant to it
anymore, and the relevant anon_vma links needs to be unlinked, in nature
the old vma is much like been freshly created and have no pages been fault
in.

But we should not do unlink, if the new vma has effectively merged with
the old one.

[[email protected]: v2]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Li Xinhai <[email protected]>
Cc: Brian Geffon <[email protected]>
Cc: Lokesh Gidra <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm: rmap: explicitly reset vma->anon_vma in unlink_anon_vmas()

In case the vma will continue to be used after unlink its relevant
anon_vma, we need to reset the vma->anon_vma pointer to NULL. So, later
when fault happen within this vma again, a new anon_vma will be prepared.

By this way, the vma will only be checked for reverse mapping of pages
which been fault in after the unlink_anon_vmas call.

Currently, the mremap with MREMAP_DONTUNMAP scenario will continue use the
vma after moved its page table entries to a new vma. For other scenarios,
the vma itself will be freed after call unlink_anon_vmas.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Li Xinhai <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Brian Geffon <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Lokesh Gidra <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/mprotect.c: optimize error detection in do_mprotect_pkey()

Obviously, the error variable detection of the if statement is
for the mprotect callback function, so it is also put into the
scope of calling callbck.

This is a cleanup which makes this site consistent with the rest of this
function's error handling.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Tianjia Zhang <[email protected]>
Reported-by: Jia Zhang <[email protected]>
Reviewed-by: Andrew Morton <[email protected]>
Cc: Jarkko Sakkinen <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/memory.c: fix potential pte_unmap_unlock pte error

If all pte entry is none in 'non-create' case, we would break the loop with
pte unchanged. Then the wrong pte - 1 would be passed to pte_unmap_unlock.
This is a theoretical issue which may not be a real bug. So it's not worth
cc stable.

Link: https://lkml.kernel.org/r/[email protected]
Fixes: aee16b3cee27 ("Add apply_to_page_range() which applies a function to a pte range")
Signed-off-by: Miaohe Lin <[email protected]>
Reviewed-by: Andrew Morton <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Ian Pratt <[email protected]>
Cc: Chris Wright <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/pgtable-generic.c: optimize the VM_BUG_ON condition in pmdp_huge_clear_flush()

The developer will have trouble figuring out why the BUG actually
triggered when there is a complex expression in the VM_BUG_ON. Because we
can only identify the condition triggered BUG via line number provided by
VM_BUG_ON. Optimize this by spliting such a complex expression into two
simple conditions.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Miaohe Lin <[email protected]>
Suggested-by: Andrew Morton <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/pgtable-generic.c: simplify the VM_BUG_ON condition in pmdp_huge_clear_flush()

The condition (A && !C && !D) || !A is equivalent to !A || (A && !C && !D)
and can be further simplified to !A || (!C && !D).

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Miaohe Lin <[email protected]>
Reviewed-by: Andrew Morton <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/memory.c: fix potential pte_unmap_unlock pte error

Since commit 42e4089c7890 ("x86/speculation/l1tf: Disallow non privileged
high MMIO PROT_NONE mappings"), when the first pfn modify is not allowed,
we would break the loop with pte unchanged.  Then the wrong pte - 1 would
be passed to pte_unmap_unlock.

Andi said:

"While the fix is correct, I'm not sure if it actually is a real bug.
  Is there any architecture that would do something else than unlocking
  the underlying page? If it's just the underlying page then it should
  be always the same page, so no bug"

Link: https://lkml.kernel.org/r/[email protected]
Fixes: 42e4089c789 ("x86/speculation/l1tf: Disallow non privileged high MMIO PROT_NONE mappings")
Signed-off-by: Hongxiang Lou <[email protected]>
Signed-off-by: Miaohe Lin <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Josh Poimboeuf <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/mmap.c: remove unnecessary local variable

The local variable 'retval' is assigned just for once in __do_sys_brk(),
and the function returns the value of the local variable right after the
assignment. Remove unnecessary assignment and local variable declaration.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Adrian Huang <[email protected]>
Acked-by: Souptick Joarder <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm: memcontrol: fix slub memory accounting

SLUB currently account kmalloc() and kmalloc_node() allocations larger
than order-1 page per-node.  But it forget to update the per-memcg
vmstats.  So it can lead to inaccurate statistics of "slab_unreclaimable"
which is from memory.stat.  Fix it by using mod_lruvec_page_state instead
of mod_node_page_state.

Link: https://lkml.kernel.org/r/[email protected]
Fixes: 6a486c0ad4dc ("mm, sl[ou]b: improve memory accounting")
Signed-off-by: Muchun Song <[email protected]>
Reviewed-by: Shakeel Butt <[email protected]>
Reviewed-by: Roman Gushchin <[email protected]>
Reviewed-by: Michal Koutný <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm: memcontrol: fix get_active_memcg return value

We use a global percpu int_active_memcg variable to store the remote memcg
when we are in the interrupt context.  But get_active_memcg always return
the current->active_memcg or root_mem_cgroup.  The remote memcg (set in
the interrupt context) is ignored.  This is not what we want.  So fix it.

Link: https://lkml.kernel.org/r/[email protected]
Fixes: 37d5985c003d ("mm: kmem: prepare remote memcg charging infra for interrupt contexts")
Signed-off-by: Muchun Song <[email protected]>
Reviewed-by: Shakeel Butt <[email protected]>
Reviewed-by: Roman Gushchin <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm: memcontrol: fix swap undercounting in cgroup2

When pages are swapped in, the VM may retain the swap copy to avoid
repeated writes in the future.  It's also retained if shared pages are
faulted back in some processes, but not in others.  During that time we
have an in-memory copy of the page, as well as an on-swap copy.  Cgroup1
and cgroup2 handle these overlapping lifetimes slightly differently due to
the nature of how they account memory and swap:

Cgroup1 has a unified memory+swap counter that tracks a data page
regardless whether it's in-core or swapped out.  On swapin, we transfer
the charge from the swap entry to the newly allocated swapcache page, even
though the swap entry might stick around for a while.  That's why we have
a mem_cgroup_uncharge_swap() call inside mem_cgroup_charge().

Cgroup2 tracks memory and swap as separate, independent resources and thus
has split memory and swap counters.  On swapin, we charge the newly
allocated swapcache page as memory, while the swap slot in turn must
remain charged to the swap counter as long as its allocated too.

The cgroup2 logic was broken by commit 2d1c498072de ("mm: memcontrol: make
swap tracking an integral part of memory control"), because it
accidentally removed the do_memsw_account() check in the branch inside
mem_cgroup_uncharge() that was supposed to tell the difference between the
charge transfer in cgroup1 and the separate counters in cgroup2.

As a result, cgroup2 currently undercounts retained swap to varying
degrees: swap slots are cached up to 50% of the configured limit or total
available swap space; partially faulted back shared pages are only limited
by physical capacity.  This in turn allows cgroups to significantly
overconsume their alloted swap space.

Add the do_memsw_account() check back to fix this problem.

Link: https://lkml.kernel.org/r/[email protected]
Fixes: 2d1c498072de ("mm: memcontrol: make swap tracking an integral part of memory control")
Signed-off-by: Muchun Song <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Reviewed-by: Shakeel Butt <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Cc: <[email protected]> [5.8+]
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

fs: buffer: use raw page_memcg() on locked page

alloc_page_buffers() currently uses get_mem_cgroup_from_page() for
charging the buffers to the page owner, which does an rcu-protected
page->memcg lookup and acquires a reference. But buffer allocation has
the page lock held throughout, which pins the page to the memcg and
thereby the memcg - neither rcu nor holding an extra reference during the
allocation are necessary. Use a raw page_memcg() instead.

This was the last user of get_mem_cgroup_from_page(), delete it.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Johannes Weiner <[email protected]>
Reported-by: Muchun Song <[email protected]>
Reviewed-by: Shakeel Butt <[email protected]>
Acked-by: Roman Gushchin <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/list_lru.c: remove kvfree_rcu_local()

The list_lru file used to have local kvfree_rcu() which was renamed by
commit e0feed08ab41 ("mm/list_lru.c: Rename kvfree_rcu() to local
variant") to introduce the globally visible kvfree_rcu().

Now we have global kvfree_rcu(), so remove the local kvfree_rcu_local()
and just use the global one.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Shakeel Butt <[email protected]>
Reviewed-by: Uladzislau Rezki <[email protected]>
Reviewed-by: Kirill Tkhai <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm: memcontrol: replace the loop with a list_for_each_entry()

The rule of list walk has gone since commit a9d5adeeb4b2
("mm/memcontrol: allow to uncharge page without using page->lru field")

So remove the strange comment and replace the loop with a
list_for_each_entry().

There is only one caller of the uncharge_list(). So just fold it into
mem_cgroup_uncharge_list() and remove it.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Muchun Song <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Acked-by: Roman Gushchin <[email protected]>
Reviewed-by: Miaohe Lin <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/memcontrol: remove redundant NULL check

Fix below warnings reported by coccicheck:

mm/memcontrol.c:451:3-9: WARNING: NULL check before some freeing functions is not needed.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Yang Li <[email protected]>
Reported-by: Abaci Robot <[email protected]>
Reviewed-by: Andrew Morton <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm: page_counter: re-layout structure to reduce false sharing

When checking a memory cgroup related performance regression [1], from the
perf c2c profiling data, we found high false sharing for accessing 'usage'
and 'parent'.

On 64 bit system, the 'usage' and 'parent' are close to each other, and
easy to be in one cacheline (for cacheline size == 64+ B).  'usage' is
usally written, while 'parent' is usually read as the cgroup's
hierarchical counting nature.

So move the 'parent' to the end of the structure to make sure they
are in different cache lines.

Following are some performance data with the patch, against v5.11-rc1.  [
In the data, A means a platform with 2 sockets 48C/96T, B is a platform of
4 sockests 72C/144T, and if a %stddev will be shown bigger than 2%,
P100/P50 means number of test tasks equals to 100%/50% of nr_cpu]

will-it-scale/malloc1
---------------------
   v5.11-rc1 v5.11-rc1+patch

A-P100      15782 ±  2%      -0.1%      15765 ±  3%  will-it-scale.per_process_ops
A-P50      21511            +8.9%      23432        will-it-scale.per_process_ops
B-P100       9155            +2.2%       9357        will-it-scale.per_process_ops
B-P50      10967            +7.1%      11751 ±  2%  will-it-scale.per_process_ops

will-it-scale/pagefault2
------------------------
   v5.11-rc1 v5.11-rc1+patch

A-P100      79028            +3.0%      81411        will-it-scale.per_process_ops
A-P50     183960 ±  2%      +4.4%     192078 ±  2%  will-it-scale.per_process_ops
B-P100      85966            +9.9%      94467 ±  3%  will-it-scale.per_process_ops
B-P50     198195            +9.8%     217526        will-it-scale.per_process_ops

fio (4k/1M is block size)
-------------------------
   v5.11-rc1 v5.11-rc1+patch

A-P50-r-4k     16881 ±  2%    +1.2%      17081 ±  2%  fio.read_bw_MBps
A-P50-w-4k      3931          +4.5%       4111 ±  2%  fio.write_bw_MBps
A-P50-r-1M     15178          -0.2%      15154        fio.read_bw_MBps
A-P50-w-1M      3924          +0.1%       3929        fio.write_bw_MBps

[1].https://lore.kernel.org/lkml/20201102091543.GM31092@shao2-debian/

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Feng Tang <[email protected]>
Reviewed-by: Roman Gushchin <[email protected]>
Reviewed-by: Shakeel Butt <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm: kmem: make __memcg_kmem_(un)charge static

I've noticed that __memcg_kmem_charge() and __memcg_kmem_uncharge() are
not used anywhere except memcontrol.c. Yet they are not declared as
non-static and are declared in memcontrol.h.

This patch makes them static.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Roman Gushchin <[email protected]>
Reviewed-by: Shakeel Butt <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm: memcg: add swapcache stat for memcg v2

This patch adds swapcache stat for the cgroup v2.  The swapcache
represents the memory that is accounted against both the memory and the
swap limit of the cgroup.  The main motivation behind exposing the
swapcache stat is for enabling users to gracefully migrate from cgroup
v1's memsw counter to cgroup v2's memory and swap counters.

Cgroup v1's memsw limit allows users to limit the memory+swap usage of a
workload but without control on the exact proportion of memory and swap.
Cgroup v2 provides separate limits for memory and swap which enables more
control on the exact usage of memory and swap individually for the
workload.

With some little subtleties, the v1's memsw limit can be switched with the
sum of the v2's memory and swap limits.  However the alternative for memsw
usage is not yet available in cgroup v2.  Exposing per-cgroup swapcache
stat enables that alternative.  Adding the memory usage and swap usage and
subtracting the swapcache will approximate the memsw usage.  This will
help in the transparent migration of the workloads depending on memsw
usage and limit to v2' memory and swap counters.

The reasons these applications are still interested in this approximate
memsw usage are: (1) these applications are not really interested in two
separate memory and swap usage metrics.  A single usage metric is more
simple to use and reason about for them.

(2) The memsw usage metric hides the underlying system's swap setup from
the applications.  Applications with multiple instances running in a
datacenter with heterogeneous systems (some have swap and some don't) will
keep seeing a consistent view of their usage.

[[email protected]: fix CONFIG_SWAP=n build]

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Shakeel Butt <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Reviewed-by: Roman Gushchin <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Yang Shi <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/memcg: remove rcu locking for lock_page_lruvec function series

lock_page_lruvec() and its variants used rcu_read_lock() with the
intention of safeguarding against the mem_cgroup being destroyed
concurrently; but so long as they are called under the specified
conditions (as they are), there is no way for the page's mem_cgroup to be
destroyed. Delete the unnecessary rcu_read_lock() and _unlock().

Hugh Dickins polished the commit log. Thanks a lot!

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Alex Shi <[email protected]>
Acked-by: Hugh Dickins <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/memcg: revise the using condition of lock_page_lruvec function series

lock_page_lruvec() and its variants are safe to use under the same
conditions as commit_charge(): add lock_page_memcg() to the comment.

Polished with Hugh Dickins' suggestions, thanks!

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Alex Shi <[email protected]>
Acked-by: Hugh Dickins <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm: memcontrol: make the slab calculation consistent

Although the ratio of the slab is one, we also should read the ratio from
the related memory_stats instead of hard-coding.  And the local variable
of size is already the value of slab_unreclaimable.  So we do not need to
read again.

To do this we need some code like below:

if (unlikely(memory_stats[i].idx == NR_SLAB_UNRECLAIMABLE_B)) {
- size = memcg_page_state(memcg, NR_SLAB_RECLAIMABLE_B) +
-        memcg_page_state(memcg, NR_SLAB_UNRECLAIMABLE_B);
+       VM_BUG_ON(i < 1);
+       VM_BUG_ON(memory_stats[i - 1].idx != NR_SLAB_RECLAIMABLE_B);
+ size += memcg_page_state(memcg, memory_stats[i - 1].idx) *
+ memory_stats[i - 1].ratio;

It requires a series of VM_BUG_ONs or comments to ensure these two items
are actually adjacent and in the right order.  So it would probably be
easier to implement this using a wrapper that has a big switch() for unit
conversion.

More details about this discussion can refer to:

    https://lore.kernel.org/patchwork/patch/1348611/

This would fix the ratio inconsistency and get rid of the order
guarantee.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Muchun Song <[email protected]>
Cc: Alexey Dobriyan <[email protected]>
Cc: Feng Tang <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: NeilBrown <[email protected]>
Cc: Pankaj Gupta <[email protected]>
Cc: Rafael. J. Wysocki <[email protected]>
Cc: Randy Dunlap <[email protected]>
Cc: Roman Gushchin <[email protected]>
Cc: Sami Tolvanen <[email protected]>
Cc: Shakeel Butt <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm: memcontrol: convert NR_FILE_PMDMAPPED account to pages

Currently we use struct per_cpu_nodestat to cache the vmstat counters,
which leads to inaccurate statistics especially THP vmstat counters.  In
the systems with hundreds of processors it can be GBs of memory.  For
example, for a 96 CPUs system, the threshold is the maximum number of 125.
And the per cpu counters can cache 23.4375 GB in total.

The THP page is already a form of batched addition (it will add 512 worth
of memory in one go) so skipping the batching seems like sensible.
Although every THP stats update overflows the per-cpu counter, resorting
to atomic global updates.  But it can make the statistics more accuracy
for the THP vmstat counters.

So we convert the NR_FILE_PMDMAPPED account to pages.  This patch is
consistent with 8f182270dfec ("mm/swap.c: flush lru pvecs on compound page
arrival").  Doing this also can make the unit of vmstat counters more
unified.  Finally, the unit of the vmstat counters are pages, kB and
bytes.  The B/KB suffix can tell us that the unit is bytes or kB.  The
rest which is without suffix are pages.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Muchun Song <[email protected]>
Cc: Alexey Dobriyan <[email protected]>
Cc: Feng Tang <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: NeilBrown <[email protected]>
Cc: Pankaj Gupta <[email protected]>
Cc: Rafael. J. Wysocki <[email protected]>
Cc: Randy Dunlap <[email protected]>
Cc: Roman Gushchin <[email protected]>
Cc: Sami Tolvanen <[email protected]>
Cc: Shakeel Butt <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm: memcontrol: convert NR_SHMEM_PMDMAPPED account to pages

Currently we use struct per_cpu_nodestat to cache the vmstat counters,
which leads to inaccurate statistics especially THP vmstat counters.  In
the systems with hundreds of processors it can be GBs of memory.  For
example, for a 96 CPUs system, the threshold is the maximum number of 125.
And the per cpu counters can cache 23.4375 GB in total.

The THP page is already a form of batched addition (it will add 512 worth
of memory in one go) so skipping the batching seems like sensible.
Although every THP stats update overflows the per-cpu counter, resorting
to atomic global updates.  But it can make the statistics more accuracy
for the THP vmstat counters.

So we convert the NR_SHMEM_PMDMAPPED account to pages.  This patch is
consistent with 8f182270dfec ("mm/swap.c: flush lru pvecs on compound page
arrival").  Doing this also can make the unit of vmstat counters more
unified.  Finally, the unit of the vmstat counters are pages, kB and
bytes.  The B/KB suffix can tell us that the unit is bytes or kB.  The
rest which is without suffix are pages.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Muchun Song <[email protected]>
Cc: Alexey Dobriyan <[email protected]>
Cc: Feng Tang <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: NeilBrown <[email protected]>
Cc: Pankaj Gupta <[email protected]>
Cc: Rafael. J. Wysocki <[email protected]>
Cc: Randy Dunlap <[email protected]>
Cc: Roman Gushchin <[email protected]>
Cc: Sami Tolvanen <[email protected]>
Cc: Shakeel Butt <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm: memcontrol: convert NR_SHMEM_THPS account to pages

Currently we use struct per_cpu_nodestat to cache the vmstat counters,
which leads to inaccurate statistics especially THP vmstat counters.  In
the systems with hundreds of processors it can be GBs of memory.  For
example, for a 96 CPUs system, the threshold is the maximum number of 125.
And the per cpu counters can cache 23.4375 GB in total.

The THP page is already a form of batched addition (it will add 512 worth
of memory in one go) so skipping the batching seems like sensible.
Although every THP stats update overflows the per-cpu counter, resorting
to atomic global updates.  But it can make the statistics more accuracy
for the THP vmstat counters.

So we convert the NR_SHMEM_THPS account to pages.  This patch is
consistent with 8f182270dfec ("mm/swap.c: flush lru pvecs on compound page
arrival").  Doing this also can make the unit of vmstat counters more
unified.  Finally, the unit of the vmstat counters are pages, kB and
bytes.  The B/KB suffix can tell us that the unit is bytes or kB.  The
rest which is without suffix are pages.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Muchun Song <[email protected]>
Cc: Alexey Dobriyan <[email protected]>
Cc: Feng Tang <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: NeilBrown <[email protected]>
Cc: Pankaj Gupta <[email protected]>
Cc: Rafael. J. Wysocki <[email protected]>
Cc: Randy Dunlap <[email protected]>
Cc: Roman Gushchin <[email protected]>
Cc: Sami Tolvanen <[email protected]>
Cc: Shakeel Butt <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm: memcontrol: convert NR_FILE_THPS account to pages

Currently we use struct per_cpu_nodestat to cache the vmstat counters,
which leads to inaccurate statistics especially THP vmstat counters.  In
the systems with if hundreds of processors it can be GBs of memory.  For
example, for a 96 CPUs system, the threshold is the maximum number of 125.
And the per cpu counters can cache 23.4375 GB in total.

The THP page is already a form of batched addition (it will add 512 worth
of memory in one go) so skipping the batching seems like sensible.
Although every THP stats update overflows the per-cpu counter, resorting
to atomic global updates.  But it can make the statistics more accuracy
for the THP vmstat counters.

So we convert the NR_FILE_THPS account to pages.  This patch is consistent
with 8f182270dfec ("mm/swap.c: flush lru pvecs on compound page arrival").
Doing this also can make the unit of vmstat counters more unified.
Finally, the unit of the vmstat counters are pages, kB and bytes.  The
B/KB suffix can tell us that the unit is bytes or kB.  The rest which is
without suffix are pages.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Muchun Song <[email protected]>
Cc: Alexey Dobriyan <[email protected]>
Cc: Feng Tang <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: NeilBrown <[email protected]>
Cc: Pankaj Gupta <[email protected]>
Cc: Rafael. J. Wysocki <[email protected]>
Cc: Randy Dunlap <[email protected]>
Cc: Roman Gushchin <[email protected]>
Cc: Sami Tolvanen <[email protected]>
Cc: Shakeel Butt <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm: memcontrol: convert NR_ANON_THPS account to pages

Currently we use struct per_cpu_nodestat to cache the vmstat counters,
which leads to inaccurate statistics especially THP vmstat counters.  In
the systems with hundreds of processors it can be GBs of memory.  For
example, for a 96 CPUs system, the threshold is the maximum number of 125.
And the per cpu counters can cache 23.4375 GB in total.

The THP page is already a form of batched addition (it will add 512 worth
of memory in one go) so skipping the batching seems like sensible.
Although every THP stats update overflows the per-cpu counter, resorting
to atomic global updates.  But it can make the statistics more accuracy
for the THP vmstat counters.

So we convert the NR_ANON_THPS account to pages.  This patch is consistent
with 8f182270dfec ("mm/swap.c: flush lru pvecs on compound page arrival").
Doing this also can make the unit of vmstat counters more unified.
Finally, the unit of the vmstat counters are pages, kB and bytes.  The
B/KB suffix can tell us that the unit is bytes or kB.  The rest which is
without suffix are pages.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Muchun Song <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Rafael. J. Wysocki <[email protected]>
Cc: Alexey Dobriyan <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Shakeel Butt <[email protected]>
Cc: Roman Gushchin <[email protected]>
Cc: Sami Tolvanen <[email protected]>
Cc: Feng Tang <[email protected]>
Cc: NeilBrown <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Randy Dunlap <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Pankaj Gupta <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm: memcontrol: fix NR_ANON_THPS accounting in charge moving

Patch series "Convert all THP vmstat counters to pages", v6.

This patch series is aimed to convert all THP vmstat counters to pages.

The unit of some vmstat counters are pages, some are bytes, some are
HPAGE_PMD_NR, and some are KiB. When we want to expose these vmstat
counters to the userspace, we have to know the unit of the vmstat counters
is which one. When the unit is bytes or kB, both clearly distinguishable
by the B/KB suffix. But for the THP vmstat counters, we may make mistakes.

For example, the below is some bug fix for the THP vmstat counters:

  - 7de2e9f195b9 ("mm: memcontrol: correct the NR_ANON_THPS counter of hierarchical memcg")
  - The first commit in this series ("fix NR_ANON_THPS accounting in charge moving")

This patch series can make the code clear. And make all the unit of the THP
vmstat counters in pages. Finally, the unit of the vmstat counters are
pages, kB and bytes. The B/KB suffix can tell us that the unit is bytes
or kB. The rest which is without suffix are pages.

In this series, I changed the following vmstat counters unit from HPAGE_PMD_NR
to pages. However, there is no change to the print format of output to user
space.

  - NR_ANON_THPS
  - NR_FILE_THPS
  - NR_SHMEM_THPS
  - NR_SHMEM_PMDMAPPED
  - NR_FILE_PMDMAPPED

Doing this also can make the statistics more accuracy for the THP vmstat
counters. This series is consistent with 8f182270dfec ("mm/swap.c: flush lru
pvecs on compound page arrival").

Because we use struct per_cpu_nodestat to cache the vmstat counters, which
leads to inaccurate statistics especially THP vmstat counters. In the systems
with hundreds of processors it can be GBs of memory. For example, for a 96
CPUs system, the threshold is the maximum number of 125. And the per cpu
counters can cache 23.4375 GB in total.

The THP page is already a form of batched addition (it will add 512 worth of
memory in one go) so skipping the batching seems like sensible. Although every
THP stats update overflows the per-cpu counter, resorting to atomic global
updates. But it can make the statistics more accuracy for the THP vmstat
counters. From this point of view, I think that do this converting is
reasonable.

Thanks Hugh for mentioning this. This was inspired by Johannes and Roman.
Thanks to them.

This patch (of 7):

The unit of NR_ANON_THPS is HPAGE_PMD_NR already.  So it should inc/dec by
one rather than nr_pages.

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Fixes: 468c398233da ("mm: memcontrol: switch to native NR_ANON_THPS counter")
Signed-off-by: Muchun Song <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Acked-by: Pankaj Gupta <[email protected]>
Reviewed-by: Roman Gushchin <[email protected]>
Reviewed-by: Shakeel Butt <[email protected]>
Cc: Alexey Dobriyan <[email protected]>
Cc: Feng Tang <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: NeilBrown <[email protected]>
Cc: Rafael. J. Wysocki <[email protected]>
Cc: Randy Dunlap <[email protected]>
Cc: Sami Tolvanen <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm: memcontrol: optimize per-lruvec stats counter memory usage

The vmstat threshold is 32 (MEMCG_CHARGE_BATCH), Actually the threshold
can be as big as MEMCG_CHARGE_BATCH * PAGE_SIZE.  It still fits into s32.
So introduce struct batched_lruvec_stat to optimize memory usage.

The size of struct lruvec_stat is 304 bytes on 64 bit systems.  As it is a
per-cpu structure.  So with this patch, we can save 304 / 2 * ncpu bytes
per-memcg per-node where ncpu is the number of the possible CPU.  If there
are c memory cgroup (include dying cgroup) and n NUMA node in the system.
Finally, we can save (152 * ncpu * c * n) bytes.

[[email protected]: fix typo in comment]

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Muchun Song <[email protected]>
Reviewed-by: Shakeel Butt <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Cc: Shakeel Butt <[email protected]>
Cc: Roman Gushchin <[email protected]>
Cc: Stephen Rothwell <[email protected]>
Cc: Chris Down <[email protected]>
Cc: Yafang Shao <[email protected]>
Cc: Wei Yang <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm: memcg/slab: pre-allocate obj_cgroups for slab caches with SLAB_ACCOUNT

In general it's unknown in advance if a slab page will contain accounted
objects or not.  In order to avoid memory waste, an obj_cgroup vector is
allocated dynamically when a need to account of a new object arises.  Such
approach is memory efficient, but requires an expensive cmpxchg() to set
up the memcg/objcgs pointer, because an allocation can race with a
different allocation on another cpu.

But in some common cases it's known for sure that a slab page will contain
accounted objects: if the page belongs to a slab cache with a SLAB_ACCOUNT
flag set.  It includes such popular objects like vm_area_struct, anon_vma,
task_struct, etc.

In such cases we can pre-allocate the objcgs vector and simple assign it
to the page without any atomic operations, because at this early stage the
page is not visible to anyone else.

A very simplistic benchmark (allocating 10000000 64-bytes objects in a
row) shows ~15% win.  In the real life it seems that most workloads are
not very sensitive to the speed of (accounted) slab allocations.

[[email protected]: open-code set_page_objcgs() and add some comments, by Johannes]
Link: https://lkml.kernel.org/r/[email protected]
[[email protected]: fix it for mm-slub-call-account_slab_page-after-slab-page-initialization-fix.patch]

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Roman Gushchin <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Reviewed-by: Shakeel Butt <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Christoph Lameter <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/swap: don't SetPageWorkingset unconditionally during swapin

We are capable of SetPageWorkingset based on refault distances after
commit aae466b0052e ("mm/swap: implement workingset detection for
anonymous LRU").  This is done by workingset_refault(), which is right
above the unconditional SetPageWorkingset deleted by this patch.

The unconditional SetPageWorkingset miscategorizes pages that are read
ahead or never belonged to the working set (e.g., tmpfs pages accessed
only once by fd).  When those pages are swapped in (after they were
swapped out) for the first time, they skew PSI (when using async swap).
When this happens again, depending on their refault distances, they might
skew workingset_restore_anon counter in addition to PSI because their
shadows indicate they were part of the working set.

Historically, SetPageWorkingset was added as part of the PSI series, and
Johannes said:
"It was meant to mark incoming pages under IO with SetPageWorkingset
  when waiting for them constituted a memory stall.

  On the page cache side, because we HAVE workingset detection, this was
  specific to recently evicted pages that had been active in their
  previous life. On the anon side, the aging algorithm had no
  distinction between workingset and sporadically used pages. Given the
  choice between a) no swapin stalls are pressure and b) all swapin
  stalls are pressure, I went with the latter in order to detect swap
  storms. The false positive case - high rate of swapin without severe
  memory pressure - was relatively unlikely, because we tried to avoid
  swapping until everything was completely on fire in the first place."

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Yu Zhao <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Acked-by: Joonsoo Kim <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/swap_state: constify static struct attribute_group

The only usage of swap_attr_group is to pass its address to
sysfs_create_group() which takes a pointer to const attribute_group. Make
it const to allow the compiler to put it in read-only memory.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Rikard Falkeborn <[email protected]>
Reviewed-by: Amy Parker <[email protected]>
Acked-by: "Huang, Ying" <[email protected]>
Reviewed-by: Miaohe Lin <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/page_io: use pr_alert_ratelimited for swap read/write errors

If there are errors during swap read or write, they can easily fill the
log buffer and remove any previous messages that might be useful for
debugging, especially on systems that rely for logging only on the kernel
ring-buffer.

For example, on a systems using zram as swap, we are more likely to see
any page allocation errors preceding the swap write errors if the alerts
are ratelimited.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Georgi Djakov <[email protected]>
Acked-by: Minchan Kim <[email protected]>
Reviewed-by: Miaohe Lin <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/swapfile.c: fix debugging information problem

Once the function name is changed, it may be easy to forget to modify the
corresponding code here.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Stephen Zhang <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/swap_slots.c: remove redundant NULL check

Fix below warnings reported by coccicheck:

mm/swap_slots.c:197:3-9: WARNING: NULL check before some freeing functions is not needed.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Yang Li <[email protected]>
Reported-by: Abaci Robot <[email protected]>
Reviewed-by: Andrew Morton <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm: backing-dev: Remove duplicated macro definition

Move the K() macro a little forward to remove the same macro definition.

Link: https://lkml.kernel.org/r/d1ccdf2d3116dce9814f2bcc1f0415ecb4c76ea5.1612862230.git.baolin.wang@linux.alibaba.com
Signed-off-by: Baolin Wang <[email protected]>
Reviewed-by: Andrew Morton <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

fs/buffer.c: add checking buffer head stat before clear

clear_buffer_new() is used to clear buffer new stat. When PAGE_SIZE is
64K, most buffer heads in the list are not needed to clear.
clear_buffer_new() has an enpensive atomic modification operation, Let's
add checking buffer head before clear it as __block_write_begin_int does
which is good for performance.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Yang Guo <[email protected]>
Signed-off-by: Shaokun Zhang <[email protected]>
Cc: Alexander Viro <[email protected]>
Cc: Nick Piggin <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/filemap: simplify generic_file_read_iter

Avoid the pointless goto out just for returning retval.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Christoph Hellwig <[email protected]>
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: Kent Overstreet <[email protected]>
Cc: Miaohe Lin <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/filemap: rename generic_file_buffered_read to filemap_read

Rename generic_file_buffered_read to match the naming of filemap_fault,
also update the written parameter to a more descriptive name and improve
the kerneldoc comment.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Christoph Hellwig <[email protected]>
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: Kent Overstreet <[email protected]>
Cc: Miaohe Lin <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/filemap: don't relock the page after calling readpage

We don't need to get the page lock again; we just need to wait for the I/O
to finish, so use wait_on_page_locked_killable() like the other callers of
->readpage.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: Kent Overstreet <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Cc: Miaohe Lin <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/filemap: restructure filemap_get_pages

Remove the got_pages label, remove indentation, rename find_page to retry,
simplify error handling.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: Kent Overstreet <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Cc: Miaohe Lin <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/filemap: split filemap_readahead out of filemap_get_pages

This simplifies the error handling.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: Kent Overstreet <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Cc: Miaohe Lin <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/filemap: add filemap_range_uptodate

Move the complicated condition and the calculations out of
filemap_update_page() into its own function.

[[email protected]: unlock page before dropping its refcount]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: Kent Overstreet <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Cc: Miaohe Lin <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/filemap: move the iocb checks into filemap_update_page

We don't need to give up when a non-blocking request sees a !Uptodate
page. We may be able to satisfy the read from a partially-uptodate page.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: Kent Overstreet <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Cc: Miaohe Lin <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/filemap: convert filemap_update_page to return an errno

Use AOP_TRUNCATED_PAGE to indicate that no error occurred, but the page we
looked up is no longer valid. In this case, the reference to the page
will have been removed; if we hit any other error, the caller will release
the reference.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: Kent Overstreet <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Cc: Miaohe Lin <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/filemap: change filemap_create_page calling conventions

By moving the iocb flag checks to the caller, we can pass the file and the
page index instead of the iocb. It never needed the iter. By passing the
pagevec, we can return an errno (or AOP_TRUNCATED_PAGE) instead of an
ERR_PTR.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: Kent Overstreet <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Cc: Miaohe Lin <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/filemap: change filemap_read_page calling conventions

Make this function more generic by passing the file instead of the iocb.
Check in the callers whether we should call readpage or not. Also make it
return an errno / 0 / AOP_TRUNCATED_PAGE, and make calling put_page() the
caller's responsibility.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: Kent Overstreet <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Cc: Miaohe Lin <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/filemap: don't call ->readpage if IOCB_WAITQ is set

The readpage operation can block in many (most?) filesystems, so we should
punt to a work queue instead of calling it. This was the last caller of
lock_page_for_iocb(), so remove it.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: Kent Overstreet <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Cc: Miaohe Lin <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/filemap: inline __wait_on_page_locked_async into caller

The previous patch removed wait_on_page_locked_async(), so inline
__wait_on_page_locked_async into __lock_page_async().

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: Kent Overstreet <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Cc: Miaohe Lin <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/filemap: support readpage splitting a page

For page splitting to succeed, the thread asking to split the page has to
be the only one with a reference to the page.  Calling
wait_on_page_locked() while holding a reference to the page will
effectively prevent this from happening with sufficient threads waiting on
the same page.  Use put_and_wait_on_page_locked() to sleep without holding
a reference to the page, then retry the page lookup after the page is
unlocked.

Since we now get the page lock a little earlier in filemap_update_page(),
we can eliminate a number of duplicate checks.  The original intent
(commit ebded02788b5 ("avoid unnecessary calls to lock_page when waiting
for IO to complete during a read")) behind getting the page lock later was
to avoid re-locking the page after it has been brought uptodate by another
thread.  We still avoid that because we go through the normal lookup path
again after the winning thread has brought the page uptodate.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: Kent Overstreet <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Cc: Miaohe Lin <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/filemap: pass a sleep state to put_and_wait_on_page_locked

This is prep work for the next patch, but I think at least one of the
current callers would prefer a killable sleep to an uninterruptible one.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: Kent Overstreet <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Cc: Miaohe Lin <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/filemap: use head pages in generic_file_buffered_read

Add filemap_get_read_batch() which returns the head pages which represent
a contiguous array of bytes in the file. It also stops when encountering
a page marked as Readahead or !Uptodate (but does return that page) so it
can be handled appropriately by filemap_get_pages(). That lets us remove
the loop in filemap_get_pages() and check only the last page.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: Kent Overstreet <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Cc: Miaohe Lin <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/filemap: convert filemap_get_pages to take a pagevec

Using a pagevec lets us keep the pages and the number of pages together
which simplifies a lot of the calling conventions.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Cc: Kent Overstreet <[email protected]>
Cc: Miaohe Lin <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/filemap: remove dynamically allocated array from filemap_read

Increasing the batch size runs into diminishing returns. It's probably
better to make, eg, three calls to filemap_get_pages() than it is to call
into kmalloc().

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Miaohe Lin <[email protected]>
Cc: Kent Overstreet <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/filemap: rename generic_file_buffered_read subfunctions

Patch series "Refactor generic_file_buffered_read", v5.

This is a combination of Christoph's work to refactor
generic_file_buffered_read() and some of my large-page support
which was disrupted by Kent's refactoring of generic_file_buffered_read.

This patch (of 18):

The recent split of generic_file_buffered_read() created some very long
function names which are hard to distinguish from each other. Rename as
follows:

generic_file_buffered_read_readpage -> filemap_read_page
generic_file_buffered_read_pagenotuptodate -> filemap_update_page
generic_file_buffered_read_no_cached_page -> filemap_create_page
generic_file_buffered_read_get_pages -> filemap_get_pages

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: Kent Overstreet <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Miaohe Lin <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/filemap: don't revert iter on -EIOCBQUEUED

Currently, if I/O is enqueued for async execution direct paths of
generic_file_{read,write}_iter() will always revert the iter. There are
no users expecting that, and that is also costly. Leave iterators as is
on -EIOCBQUEUED.

Link: https://lkml.kernel.org/r/f5247b60e7abbd2ff850cd108491f53a2e0c501a.1610207781.git.asml.silence@gmail.com
Signed-off-by: Pavel Begunkov <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Jens Axboe <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/filemap: remove unused parameter and change to void type for replace_page_cache_page()

Since commit 74d609585d8b ("page cache: Add and replace pages using the
XArray") was merged, the replace_page_cache_page() can not fail and always
return 0, we can remove the redundant return value and void it. Moreover
remove the unused gfp_mask.

Link: https://lkml.kernel.org/r/609c30e5274ba15d8b90c872fd0d8ac437a9b2bb.1610071401.git.baolin.wang@linux.alibaba.com
Signed-off-by: Baolin Wang <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Miklos Szeredi <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/page_owner: use helper function zone_end_pfn() to get end_pfn

Commit 108bcc96ef70 ("mm: add & use zone_end_pfn() and zone_spans_pfn()")
introduced the helper zone_end_pfn() to calculate the zone end pfn. But
pagetypeinfo_showmixedcount_print forgot to use it. And the
initialization of local variable pfn is duplicated, remove one.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Miaohe Lin <[email protected]>
Reviewed-by: Andrew Morton <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/debug_vm_pgtable/basic: iterate over entire protection_map[]

Currently the basic tests just validate various page table transformations
after starting with vm_get_page_prot(VM_READ|VM_WRITE|VM_EXEC) protection.
Instead scan over the entire protection_map[] for better coverage.  It
also makes sure that all these basic page table tranformations checks hold
true irrespective of the starting protection value for the page table
entry.  There is also a slight change in the debug print format for basic
tests to capture the protection value it is being tested with.  The
modified output looks something like

[pte_basic_tests          ]: Validating PTE basic ()
[pte_basic_tests          ]: Validating PTE basic (read)
[pte_basic_tests          ]: Validating PTE basic (write)
[pte_basic_tests          ]: Validating PTE basic (read|write)
[pte_basic_tests          ]: Validating PTE basic (exec)
[pte_basic_tests          ]: Validating PTE basic (read|exec)
[pte_basic_tests          ]: Validating PTE basic (write|exec)
[pte_basic_tests          ]: Validating PTE basic (read|write|exec)
[pte_basic_tests          ]: Validating PTE basic (shared)
[pte_basic_tests          ]: Validating PTE basic (read|shared)
[pte_basic_tests          ]: Validating PTE basic (write|shared)
[pte_basic_tests          ]: Validating PTE basic (read|write|shared)
[pte_basic_tests          ]: Validating PTE basic (exec|shared)
[pte_basic_tests          ]: Validating PTE basic (read|exec|shared)
[pte_basic_tests          ]: Validating PTE basic (write|exec|shared)
[pte_basic_tests          ]: Validating PTE basic (read|write|exec|shared)

This adds a missing argument 'struct mm_struct *' in pud_basic_tests()
test .  This never got exposed before as PUD based THP is available only
on X86 platform where mm_pmd_folded(mm) call gets macro replaced without
requiring the mm_struct i.e __is_defined(__PAGETABLE_PMD_FOLDED).

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Anshuman Khandual <[email protected]>
Tested-by: Gerald Schaefer <[email protected]> [s390]
Reviewed-by: Steven Price <[email protected]>
Suggested-by: Catalin Marinas <[email protected]>
Cc: Christophe Leroy <[email protected]>
Cc: Gerald Schaefer <[email protected]>
Cc: Paul Walmsley <[email protected]>
Cc: Vineet Gupta <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/debug_vm_pgtable/basic: add validation for dirtiness after write protect

Patch series "mm/debug_vm_pgtable: Some minor updates", v3.

This series contains some cleanups and new test suggestions from Catalin
from an earlier discussion.

https://lore.kernel.org/linux-mm/20201123142237.GF17833@gaia/

This patch (of 2):

This adds validation tests for dirtiness after write protect conversion
for each page table level.  There are two new separate test types involved
here.

The first test ensures that a given page table entry does not become dirty
after pxx_wrprotect().  This is important for platforms like arm64 which
transfers and drops the hardware dirty bit (!PTE_RDONLY) to the software
dirty bit while making it an write protected one.  This test ensures that
no fresh page table entry could be created with hardware dirty bit set.
The second test ensures that a given page table entry always preserve the
dirty information across pxx_wrprotect().

This adds two previously missing PUD level basic tests and while here
fixes pxx_wrprotect() related typos in the documentation file.

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Anshuman Khandual <[email protected]>
Suggested-by: Catalin Marinas <[email protected]>
Tested-by: Gerald Schaefer <[email protected]> [s390]
Cc: Christophe Leroy <[email protected]>
Cc: Gerald Schaefer <[email protected]>
Cc: Vineet Gupta <[email protected]>
Cc: Paul Walmsley <[email protected]>
Cc: Steven Price <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/debug: improve memcg debugging

The memcg_data is only valid on the head page, not the tail pages. Change
the format and location of the printout within the dump to match the other
parts of struct page better.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: Zi Yan <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Johannes Weiner <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/slub: minor coding style tweaks

Add whitespace to fix coding style issues, improve code reading.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Zhiyuan Dai <[email protected]>
Acked-by: David Rientjes <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm, slub: remove slub_memcg_sysfs boot param and CONFIG_SLUB_MEMCG_SYSFS_ON

The boot param and config determine the value of memcg_sysfs_enabled,
which is unused since commit 10befea91b61 ("mm: memcg/slab: use a single
set of kmem_caches for all allocations") as there are no per-memcg kmem
caches anymore.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Vlastimil Babka <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Acked-by: Roman Gushchin <[email protected]>
Acked-by: David Rientjes <[email protected]>
Reviewed-by: Miaohe Lin <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm, slub: splice cpu and page freelists in deactivate_slab()

In deactivate_slab() we currently move all but one objects on the cpu
freelist to the page freelist one by one using the costly cmpxchg_double()
operation.  Then we unfreeze the page while moving the last object on page
freelist, with a final cmpxchg_double().

This can be optimized to avoid the cmpxchg_double() per object.  Just
count the objects on cpu freelist (to adjust page->inuse properly) and
also remember the last object in the chain.  Then splice page->freelist to
the last object and effectively add the whole cpu freelist to
page->freelist while unfreezing the page, with a single cmpxchg_double().

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Vlastimil Babka <[email protected]>
Reviewed-by: Jann Horn <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm, slab, slub: stop taking cpu hotplug lock

SLAB has been using get/put_online_cpus() around creating, destroying and
shrinking kmem caches since 95402b382901 ("cpu-hotplug: replace
per-subsystem mutexes with get_online_cpus()") in 2008, which is supposed
to be replacing a private mutex (cache_chain_mutex, called slab_mutex
today) with system-wide mechanism, but in case of SLAB it's in fact used
in addition to the existing mutex, without explanation why.

SLUB appears to have avoided the cpu hotplug lock initially, but gained it
due to common code unification, such as 20cea9683ecc ("mm, sl[aou]b: Move
kmem_cache_create mutex handling to common code").

Regardless of the history, checking if the hotplug lock is actually needed
today suggests that it's not, and therefore it's better to avoid this
system-wide lock and the ordering this imposes wrt other locks (such as
slab_mutex).

Specifically, in SLAB we have for_each_online_cpu() in do_tune_cpucache()
protected by slab_mutex, and cpu hotplug callbacks that also take the
slab_mutex, which is also taken by the common slab function that currently
also take the hotplug lock.  Thus the slab_mutex protection should be
sufficient.  Also per-cpu array caches are allocated for each possible
cpu, so not affected by their online/offline state.

In SLUB we have for_each_online_cpu() in functions that show statistics
and are already unprotected today, as racing with hotplug is not harmful.
Otherwise SLUB relies on percpu allocator.  The slub_cpu_dead() hotplug
callback takes the slab_mutex.

To sum up, this patch removes get/put_online_cpus() calls from slab as it
should be safe without further adjustments.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Vlastimil Babka <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: Qian Cai <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm, slab, slub: stop taking memory hotplug lock

Since commit 03afc0e25f7f ("slab: get_online_mems for
kmem_cache_{create,destroy,shrink}") we are taking memory hotplug lock for
SLAB and SLUB when creating, destroying or shrinking a cache.  It is quite
a heavy lock and it's best to avoid it if possible, as we had several
issues with lockdep complaining about ordering in the past, see e.g.
e4f8e513c3d3 ("mm/slub: fix a deadlock in show_slab_objects()").

The problem scenario in 03afc0e25f7f (solved by the memory hotplug lock)
can be summarized as follows: while there's slab_mutex synchronizing new
kmem cache creation and SLUB's MEM_GOING_ONLINE callback
slab_mem_going_online_callback(), we may miss creation of kmem_cache_node
for the hotplugged node in the new kmem cache, because the hotplug
callback doesn't yet see the new cache, and cache creation in
init_kmem_cache_nodes() only inits kmem_cache_node for nodes in the
N_NORMAL_MEMORY nodemask, which however may not yet include the new node,
as that happens only later after the MEM_GOING_ONLINE callback.

Instead of using get/put_online_mems(), the problem can be solved by SLUB
maintaining its own nodemask of nodes for which it has allocated the
per-node kmem_cache_node structures.  This nodemask would generally mirror
the N_NORMAL_MEMORY nodemask, but would be updated only in under SLUB's
control in its memory hotplug callbacks under the slab_mutex.  This patch
adds such nodemask and its handling.

Commit 03afc0e25f7f mentiones "issues like [the one above]", but there
don't appear to be further issues.  All the paths (shared for SLAB and
SLUB) taking the memory hotplug locks are also taking the slab_mutex,
except kmem_cache_shrink() where 03afc0e25f7f replaced slab_mutex with
get/put_online_mems().

We however cannot simply restore slab_mutex in kmem_cache_shrink(), as
SLUB can enters the function from a write to sysfs 'shrink' file, thus
holding kernfs lock, and in kmem_cache_create() the kernfs lock is nested
within slab_mutex.  But on closer inspection we don't actually need to
protect kmem_cache_shrink() from hotplug callbacks: While SLUB's
__kmem_cache_shrink() does for_each_kmem_cache_node(), missing a new node
added in parallel hotplug is not fatal, and parallel hotremove does not
free kmem_cache_node's anymore after the previous patch, so use-after free
cannot happen.  The per-node shrinking itself is protected by
n->list_lock.  Same is true for SLAB, and SLOB is no-op.

SLAB also doesn't need the memory hotplug locking, which it only gained by
03afc0e25f7f through the shared paths in slab_common.c.  Its memory
hotplug callbacks are also protected by slab_mutex against races with
these paths.  The problem of SLUB relying on N_NORMAL_MEMORY doesn't apply
to SLAB, as its setup_kmem_cache_nodes relies on N_ONLINE, and the new
node is already set there during the MEM_GOING_ONLINE callback, so no
special care is needed for SLAB.

As such, this patch removes all get/put_online_mems() usage by the slab
subsystem.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Vlastimil Babka <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: Qian Cai <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm, slub: stop freeing kmem_cache_node structures on node offline

Patch series "mm, slab, slub: remove cpu and memory hotplug locks".

Some related work caused me to look at how we use get/put_mems_online()
and get/put_online_cpus() during kmem cache
creation/descruction/shrinking, and realize that it should be actually
safe to remove all of that with rather small effort (as e.g.  Michal Hocko
suspected in some of the past discussions already).  This has the benefit
to avoid rather heavy locks that have caused locking order issues already
in the past.  So this is the result, Patches 2 and 3 remove memory hotplug
and cpu hotplug locking, respectively.  Patch 1 is due to realization that
in fact some races exist despite the locks (even if not removed), but the
most sane solution is not to introduce more of them, but rather accept
some wasted memory in scenarios that should be rare anyway (full memory
hot remove), as we do the same in other contexts already.

This patch (of 3):

Commit e4f8e513c3d3 ("mm/slub: fix a deadlock in show_slab_objects()") has
fixed a problematic locking order by removing the memory hotplug lock
get/put_online_mems() from show_slab_objects().  During the discussion, it
was argued [1] that this is OK, because existing slabs on the node would
prevent a hotremove to proceed.

That's true, but per-node kmem_cache_node structures are not necessarily
allocated on the same node and may exist even without actual slab pages on
the same node.  Any path that uses get_node() directly or via
for_each_kmem_cache_node() (such as show_slab_objects()) can race with
freeing of kmem_cache_node even with the !NULL check, resulting in
use-after-free.

To that end, commit e4f8e513c3d3 argues in a comment that:

* We don't really need mem_hotplug_lock (to hold off
* slab_mem_going_offline_callback) here because slab's memory hot
* unplug code doesn't destroy the kmem_cache->node[] data.

While it's true that slab_mem_going_offline_callback() doesn't free the
kmem_cache_node, the later callback slab_mem_offline_callback() actually
does, so the race and use-after-free exists.  Not just for
show_slab_objects() after commit e4f8e513c3d3, but also many other places
that are not under slab_mutex.  And adding slab_mutex locking or other
synchronization to SLUB paths such as get_any_partial() would be bad for
performance and error-prone.

The easiest solution is therefore to make the abovementioned comment true
and stop freeing the kmem_cache_node structures, accepting some wasted
memory in the full memory node removal scenario.  Analogically we also
don't free hotremoved pgdat as mentioned in [1], nor the similar per-node
structures in SLAB.  Importantly this approach will not block the
hotremove, as generally such nodes should be movable in order to succeed
hotremove in the first place, and thus the GFP_KERNEL allocated
kmem_cache_node will come from elsewhere.

[1] https://lore.kernel.org/linux-mm/20190924151147 [email protected]/

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Vlastimil Babka <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Cc: Qian Cai <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/slub: disable user tracing for kmemleak caches by default

If kmemleak is enabled, it uses a kmem cache for its own objects. These
objects are used to hold information kmemleak uses, including a stack
trace. If slub_debug is also turned on, each of them has *another* stack
trace, so the overhead adds up, and on my tests (on ARCH=um, admittedly)
2/3rds of the allocations end up being doing the stack tracing.

Turn off SLAB_STORE_USER if SLAB_NOLEAKTRACE was given, to avoid storing
the essentially same data twice.

Link: https://lkml.kernel.org/r/20210113215114.d94efa13ba30.I117b6764e725b3192318bbcf4269b13b709539ae@changeid
Signed-off-by: Johannes Berg <[email protected]>
Acked-by: David Rientjes <[email protected]>
Acked-by: Catalin Marinas <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/slab: minor coding style tweaks

Fix some coding style issues, improve code reading. Adds whitespace to
clearly separate the parameters.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Zhiyuan Dai <[email protected]>
Acked-by: David Rientjes <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/sl?b.c: remove ctor argument from kmem_cache_flags

This argument hasn't been used since e153362a50a3 ("slub: Remove objsize
check in kmem_cache_flags()") so simply remove it.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Nikolay Borisov <[email protected]>
Reviewed-by: Miaohe Lin <[email protected]>
Reviewed-by: Vlastimil Babka <[email protected]>
Acked-by: Christoph Lameter <[email protected]>
Acked-by: David Rientjes <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm, tracing: record slab name for kmem_cache_free()

Currently, a trace record generated by the RCU core is as below.

... kmem_cache_free: call_site=rcu_core+0x1fd/0x610 ptr=00000000f3b49a66

It doesn't tell us what the RCU core has freed.

This patch adds the slab name to trace_kmem_cache_free().
The new format is as follows.

... kmem_cache_free: call_site=rcu_core+0x1fd/0x610 ptr=0000000037f79c8d name=dentry
... kmem_cache_free: call_site=rcu_core+0x1fd/0x610 ptr=00000000f78cb7b5 name=sock_inode_cache
... kmem_cache_free: call_site=rcu_core+0x1fd/0x610 ptr=0000000018768985 name=pool_workqueue
... kmem_cache_free: call_site=rcu_core+0x1fd/0x610 ptr=000000006a6cb484 name=radix_tree_node

We can use it to understand what the RCU core is going to free. For
example, some users maybe interested in when the RCU core starts
freeing reclaimable slabs like dentry to reduce memory pressure.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Jacob Wen <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: "Paul E. McKenney" <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

ramfs: support O_TMPFILE

[[email protected]: update inode_operations.tmpfile]

Link: http://lkml.kernel.org/r/20190206073349.GA15311@avx2
Signed-off-by: Alexey Dobriyan <[email protected]>
Acked-by: Christian Brauner <[email protected]>
Cc: Al Viro <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

fs: delete repeated words in comments

Delete duplicate words in fs/*.c.
The doubled words that are being dropped are:
that, be, the, in, and, for

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Randy Dunlap <[email protected]>
Reviewed-by: Matthew Wilcox (Oracle) <[email protected]>
Cc: Alexander Viro <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

ocfs2: simplify the calculation of variables

Fix the following coccicheck warnings:

fs/ocfs2/refcounttree.c:981:16-18: WARNING !A || A && B is equivalent to !A || B.

Link: https://lkml.kernel.org/r/1612235424-80367-1-git-send-email-jiapeng.chong@linux.alibaba.com
Signed-off-by: Jiapeng Chong <[email protected]>
Reported-by: Abaci Robot <[email protected]>
Reviewed-by: Joseph Qi <[email protected]>
Cc: Mark Fasheh <[email protected]>
Cc: Joel Becker <[email protected]>
Cc: Junxiao Bi <[email protected]>
Cc: Changwei Ge <[email protected]>
Cc: Gang He <[email protected]>
Cc: Jun Piao <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

ocfs2: fix a use after free on error

The error handling in this function frees "reg" but it is still on the
"o2hb_all_regions" list so it will lead to a use after freew. Joseph Qi
points out that we need to clear the bit in the "o2hb_region_bitmap" as
well

Link: https://lkml.kernel.org/r/YBk4M6HUG8jB/jc7@mwanda
Fixes: 1cf257f51191 ("ocfs2: fix memory leak")
Signed-off-by: Dan Carpenter <[email protected]>
Reviewed-by: Joseph Qi <[email protected]>
Cc: Mark Fasheh <[email protected]>
Cc: Joel Becker <[email protected]>
Cc: Junxiao Bi <[email protected]>
Cc: Changwei Ge <[email protected]>
Cc: Gang He <[email protected]>
Cc: Jun Piao <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

ocfs2: clean up some definitions which are not used any more

There are some definitions which is not used anymore in OCFS2 module, so
as to be removed.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Guozhonghua <[email protected]>
Acked-by: Joseph Qi <[email protected]>
Cc: Mark Fasheh <[email protected]>
Cc: Joel Becker <[email protected]>
Cc: Junxiao Bi <[email protected]>
Cc: Changwei Ge <[email protected]>
Cc: Gang He <[email protected]>
Cc: Jun Piao <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

ocfs2: remove redundant conditional before iput

iput handles NULL pointers gracefully, so there's no need to check the
pointer before the call.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Yi Li <[email protected]>
Acked-by: Joseph Qi <[email protected]>
Cc: Mark Fasheh <[email protected]>
Cc: Joel Becker <[email protected]>
Cc: Junxiao Bi <[email protected]>
Cc: Changwei Ge <[email protected]>
Cc: Gang He <[email protected]>
Cc: Jun Piao <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

ntfs: check for valid standard information attribute

Mounting a corrupted filesystem with NTFS resulted in a kernel crash.

We should check for valid STANDARD_INFORMATION attribute offset and length
before trying to access it

Link: https://lkml.kernel.org/r/[email protected]
Link: https://syzkaller.appspot.com/bug?extid=c584225dabdea2f71969
Signed-off-by: Rustam Kovhaev <[email protected]>
Reported-by: [email protected]
Tested-by: [email protected]
Acked-by: Anton Altaparmakov <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

ntfs: layout.h: delete duplicated words

Drop the repeated words "the" and "in" in comments.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Randy Dunlap <[email protected]>
Acked-by: Anton Altaparmakov <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

scripts/spelling.txt: add more spellings to spelling.txt

Here are some of the more common spelling mistakes and typos that I've
found while fixing up spelling mistakes in the kernel since September 2020

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Colin Ian King <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

scripts/spelling.txt: add "allocted" and "exeeds" typo

Increase "allocted" and "exeeds" spelling error check.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: dingsenjie <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

scripts/spelling.txt: check for "exeeds"

Increase exeeds spelling error check.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: zuoqilin <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

scripts/spelling.txt: increase error-prone spell checking

Increase maping spelling error check.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: WenZhang <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

hexagon: remove CONFIG_EXPERIMENTAL from defconfigs

Since CONFIG_EXPERIMENTAL was removed in 2013, go ahead and drop it
from any defconfig files.

Link: https://lkml.kernel.org/r/[email protected]
Fixes: 3d374d09f16f ("final removal of CONFIG_EXPERIMENTAL")
Signed-off-by: Randy Dunlap <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Brian Cain <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

Merge tag 'keys-misc-20210126' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs

Pull keyring updates from David Howells:
"Here's a set of minor keyrings fixes/cleanups that I've collected from
  various people for the upcoming merge window.

  A couple of them might, in theory, be visible to userspace:

   - Make blacklist_vet_description() reject uppercase letters as they
     don't match the all-lowercase hex string generated for a blacklist
     search.

     This may want reconsideration in the future, but, currently, you
     can't add to the blacklist keyring from userspace and the only
     source of blacklist keys generates lowercase descriptions.

   - Fix blacklist_init() to use a new KEY_ALLOC_* flag to indicate that
     it wants KEY_FLAG_KEEP to be set rather than passing KEY_FLAG_KEEP
     into keyring_alloc() as KEY_FLAG_KEEP isn't a valid alloc flag.

     This isn't currently a problem as the blacklist keyring isn't
     currently writable by userspace.

  The rest of the patches are cleanups and I don't think they should
  have any visible effect"

* tag 'keys-misc-20210126' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
  watch_queue: rectify kernel-doc for init_watch()
  certs: Replace K{U,G}IDT_INIT() with GLOBAL_ROOT_{U,G}ID
  certs: Fix blacklist flag type confusion
  PKCS#7: Fix missing include
  certs: Fix blacklisted hexadecimal hash string check
  certs/blacklist: fix kernel doc interface issue
  crypto: public_key: Remove redundant header file from public_key.h
  keys: remove trailing semicolon in macro definition
  crypto: pkcs7: Use match_string() helper to simplify the code
  PKCS#7: drop function from kernel-doc pkcs7_validate_trust_one
  encrypted-keys: Replace HTTP links with HTTPS ones
  crypto: asymmetric_keys: fix some comments in pkcs7_parser.h
  KEYS: remove redundant memset
  security: keys: delete repeated words in comments
  KEYS: asymmetric: Fix kerneldoc
  security/keys: use kvfree_sensitive()
  watch_queue: Drop references to /dev/watch_queue
  keys: Remove outdated __user annotations
  security: keys: Fix fall-through warnings for Clang

Merge tag 'clang-lto-v5.12-rc1-part2' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux

Pull more clang LTO updates from Kees Cook:
"Clang LTO x86 enablement.

  Full disclosure: while this has _not_ been in linux-next (since it
  initially looked like the objtool dependencies weren't going to make
  v5.12), it has been under daily build and runtime testing by Sami for
  quite some time. These x86 portions have been discussed on lkml, with
  Peter, Josh, and others helping nail things down.

  The bulk of the changes are to get objtool working happily. The rest
  of the x86 enablement is very small.

  Summary:

   - Generate __mcount_loc in objtool (Peter Zijlstra)

   - Support running objtool against vmlinux.o (Sami Tolvanen)

   - Clang LTO enablement for x86 (Sami Tolvanen)"

Link: https://lore.kernel.org/lkml/[email protected]/
Link: https://lore.kernel.org/lkml/[email protected]/
* tag 'clang-lto-v5.12-rc1-part2' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
  kbuild: lto: force rebuilds when switching CONFIG_LTO
  x86, build: allow LTO to be selected
  x86, cpu: disable LTO for cpu.c
  x86, vdso: disable LTO only for vDSO
  kbuild: lto: postpone objtool
  objtool: Split noinstr validation from --vmlinux
  x86, build: use objtool mcount
  tracing: add support for objtool mcount
  objtool: Don't autodetect vmlinux.o
  objtool: Fix __mcount_loc generation with Clang's assembler
  objtool: Add a pass for generating __mcount_loc

Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc

Pull sparc updates from David Miller:
"A host of mall cleanups and adjustments that have accumulated while I
  was away, nothing major"

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc: (26 commits)
  sparc: make xchg() into a statement expression
  sparc64: Use arch_validate_flags() to validate ADI flag
  sparc32: Fix comparing pointer to 0 coccicheck warning
  sparc: fix led.c driver when PROC_FS is not enabled
  sparc: Fix handling of page table constructor failure
  sparc64: only select COMPAT_BINFMT_ELF if BINFMT_ELF is set
  tty: hvcs: Drop unnecessary if block
  tty: vcc: Drop unnecessary if block
  tty: vcc: Drop impossible to hit WARN_ON
  sparc: sparc64_defconfig: add necessary configs for qemu
  sparc64: switch defconfig from the legacy ide driver to libata
  sparc32: Preserve clone syscall flags argument for restarts due to signals
  sparc32: Limit memblock allocation to low memory
  sparc: Replace test_ti_thread_flag() with test_tsk_thread_flag()
  sbus: char: Remove meaningless jump label out_free
  sparc32: signal: Fix stack trampoline for RT signals
  sparc: remove SA_STATIC_ALLOC macro definition
  sparc: use for_each_child_of_node() macro
  sparc: Use fallthrough pseudo-keyword
  sparc32: srmmu: improve type safety of __nocache_fix()
  ...

Merge tag 'dmaengine-5.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/vkoul/dmaengine

Pull dmaengine updates from Vinod Koul:
"We have couple of drivers removed a new driver and bunch of new device
  support and few updates to drivers for this round.

  New drivers/devices:
   - Intel LGM SoC DMA driver
   - Actions Semi S500 DMA controller
   - Renesas r8a779a0 dma controller
   - Ingenic JZ4760(B) dma controller
   - Intel KeemBay AxiDMA controller

  Removed:
   - Coh901318 dma driver
   - Zte zx dma driver
   - Sirfsoc dma driver

  Updates:
   - mmp_pdma, mmp_tdma gained module support
   - imx-sdma become modern and dropped platform data support
   - dw-axi driver gained slave and cyclic dma support"

* tag 'dmaengine-5.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/vkoul/dmaengine: (58 commits)
  dmaengine: dw-axi-dmac: remove redundant null check on desc
  dmaengine: xilinx_dma: Alloc tx descriptors GFP_NOWAIT
  dmaengine: dw-axi-dmac: Virtually split the linked-list
  dmaengine: dw-axi-dmac: Set constraint to the Max segment size
  dmaengine: dw-axi-dmac: Add Intel KeemBay AxiDMA BYTE and HALFWORD registers
  dmaengine: dw-axi-dmac: Add Intel KeemBay AxiDMA handshake
  dmaengine: dw-axi-dmac: Add Intel KeemBay AxiDMA support
  dmaengine: drivers: Kconfig: add HAS_IOMEM dependency to DW_AXI_DMAC
  dmaengine: dw-axi-dmac: Add Intel KeemBay DMA register fields
  dt-binding: dma: dw-axi-dmac: Add support for Intel KeemBay AxiDMA
  dmaengine: dw-axi-dmac: Support burst residue granularity
  dmaengine: dw-axi-dmac: Support of_dma_controller_register()
  dmaegine: dw-axi-dmac: Support device_prep_dma_cyclic()
  dmaengine: dw-axi-dmac: Support device_prep_slave_sg
  dmaengine: dw-axi-dmac: Add device_config operation
  dmaengine: dw-axi-dmac: Add device_synchronize() callback
  dmaengine: dw-axi-dmac: move dma_pool_create() to alloc_chan_resources()
  dmaengine: dw-axi-dmac: simplify descriptor management
  dt-bindings: dma: Add YAML schemas for dw-axi-dmac
  dmaengine: ti: k3-psil: optimize struct psil_endpoint_config for size
  ...

Merge tag 'acpi-5.12-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull more ACPI updates from Rafael Wysocki:
"Fix race condition in generic_serial_bus (I2C) and GPIO Operation
  Region handling in ACPICA and reduce some related code duplication
  (Hans de Goede)"

* tag 'acpi-5.12-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  ACPICA: Remove some code duplication from acpi_ev_address_space_dispatch
  ACPICA: Fix race in generic_serial_bus (I2C) and GPIO op_region parameter handling

Merge tag 'pm-5.12-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull more power management updates from Rafael Wysocki:
"These are fixes and cleanups on top of the power management material
  for 5.12-rc1 merged previously.

  Specifics:

   - Address cpufreq regression introduced in 5.11 that causes CPU
     frequency reporting to be distorted on systems with CPPC that use
     acpi-cpufreq as the scaling driver (Rafael Wysocki).

   - Fix regression introduced during the 5.10 development cycle related
     to CPU hotplug and policy recreation in the qcom-cpufreq-hw driver
     (Shawn Guo).

   - Fix recent regression in the operating performance points (OPP)
     framework that may cause frequency updates to be skipped by mistake
     in some cases (Jonathan Marek).

   - Simplify schedutil governor code and remove a misleading comment
     from it (Yue Hu).

   - Fix kerneldoc comment typo in the cpufreq core (Yue Hu)"

* tag 'pm-5.12-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  cpufreq: Fix typo in kerneldoc comment
  cpufreq: schedutil: Remove update_lock comment from struct sugov_policy definition
  cpufreq: schedutil: Remove needless sg_policy parameter from ignore_dl_rate_limit()
  cpufreq: ACPI: Set cpuinfo.max_freq directly if max boost is known
  cpufreq: qcom-hw: drop devm_xxx() calls from init/exit hooks
  opp: Don't skip freq update for different frequency

Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input

Pull input updates from Dmitry Torokhov:
"Mostly existing driver fixes plus a new driver for game controllers
  directly connected to Nintendo 64, and an enhancement for keyboards
  driven by Chrome OS EC to communicate layout of the top row to
  userspace"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input: (47 commits)
  Input: st1232 - fix NORMAL vs. IDLE state handling
  Input: aiptek - convert sysfs sprintf/snprintf family to sysfs_emit
  Input: alps - fix spelling of "positive"
  ARM: dts: cros-ec-keyboard: Use keymap macros
  dt-bindings: input: Fix the keymap for LOCK key
  dt-bindings: input: Create macros for cros-ec keymap
  Input: cros-ec-keyb - expose function row physical map to userspace
  dt-bindings: input: cros-ec-keyb: Add a new property describing top row
  Input: applespi - fix occasional crc errors under load.
  Input: applespi - don't wait for responses to commands indefinitely.
  Input: st1232 - add IDLE state as ready condition
  Input: zinitix - fix return type of zinitix_init_touch()
  Input: i8042 - add ASUS Zenbook Flip to noselftest list
  Input: add missing dependencies on CONFIG_HAS_IOMEM
  Input: joydev - prevent potential read overflow in ioctl
  Input: elo - fix an error code in elo_connect()
  Input: xpad - add support for PowerA Enhanced Wired Controller for Xbox Series X|S
  Input: sur40 - fix an error code in sur40_probe()
  Input: elants_i2c - detect enum overflow
  Input: zinitix - remove unneeded semicolon
  ...