Git Repo - linux.git/log

mm: use CPU_BITS_NONE to initialize init_mm.cpu_bitmask

Replace open-coded bitmap array initialization of init_mm.cpu_bitmask with
neat CPU_BITS_NONE macro.

And, since init_mm.cpu_bitmask is statically set to zero, there is no way
to clear it again in start_kernel().

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Mike Rapoport <[email protected]>
Reviewed-by: Andrew Morton <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/vmalloc.c: move 'area->pages' after if statement

If !area->pages statement is true where memory allocation fails, area is
freed.

In this case 'area->pages = pages' should not executed. So move
'area->pages = pages' after if statement.

[[email protected]: give area->pages the same treatment]
Link: http://lkml.kernel.org/r/20190830035716.GA190684@LGEARND20B15
Signed-off-by: Austin Kim <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Reviewed-by: Andrew Morton <[email protected]>
Cc: Uladzislau Rezki (Sony) <[email protected]>
Cc: Roman Gushchin <[email protected]>
Cc: Roman Penyaev <[email protected]>
Cc: Rick Edgecombe <[email protected]>
Cc: Mike Rapoport <[email protected]>
Cc: Andrey Ryabinin <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/vmalloc: modify struct vmap_area to reduce its size

Objective
---------

The current implementation of struct vmap_area wasted space.

After applying this commit, sizeof(struct vmap_area) has been
reduced from 11 words to 8 words.

Description
-----------

1) Pack "subtree_max_size", "vm" and "purge_list". This is no problem
because

A) "subtree_max_size" is only used when vmap_area is in "free" tree

B) "vm" is only used when vmap_area is in "busy" tree

C) "purge_list" is only used when vmap_area is in vmap_purge_list

2) Eliminate "flags".

;Since only one flag VM_VM_AREA is being used, and the same thing can be
done by judging whether "vm" is NULL, then the "flags" can be eliminated.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Pengfei Li <[email protected]>
Suggested-by: Uladzislau Rezki (Sony) <[email protected]>
Reviewed-by: Uladzislau Rezki (Sony) <[email protected]>
Cc: Hillf Danton <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Oleksiy Avramchenko <[email protected]>
Cc: Roman Gushchin <[email protected]>
Cc: Steven Rostedt <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/vmalloc: do not keep unpurged areas in the busy tree

The busy tree can be quite big, even though the area is freed or unmapped
it still stays there until "purge" logic removes it.

1) Optimize and reduce the size of "busy" tree by removing a node from
   it right away as soon as user triggers free paths.  It is possible to
   do so, because the allocation is done using another augmented tree.

The vmalloc test driver shows the difference, for example the
"fix_size_alloc_test" is ~11% better comparing with default configuration:

sudo ./test_vmalloc.sh performance

<default>
Summary: fix_size_alloc_test loops: 1000000 avg: 993985 usec
Summary: full_fit_alloc_test loops: 1000000 avg: 973554 usec
Summary: long_busy_list_alloc_test loops: 1000000 avg: 12617652 usec
<default>

<this patch>
Summary: fix_size_alloc_test loops: 1000000 avg: 882263 usec
Summary: full_fit_alloc_test loops: 1000000 avg: 973407 usec
Summary: long_busy_list_alloc_test loops: 1000000 avg: 12593929 usec
<this patch>

2) Since the busy tree now contains allocated areas only and does not
   interfere with lazily free nodes, introduce the new function
   show_purge_info() that dumps "unpurged" areas that is propagated
   through "/proc/vmallocinfo".

3) Eliminate VM_LAZY_FREE flag.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Uladzislau Rezki (Sony) <[email protected]>
Signed-off-by: Pengfei Li <[email protected]>
Cc: Roman Gushchin <[email protected]>
Cc: Uladzislau Rezki <[email protected]>
Cc: Hillf Danton <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Oleksiy Avramchenko <[email protected]>
Cc: Steven Rostedt <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/sparse.c: remove NULL check in clear_hwpoisoned_pages()

There is no possibility for memmap to be NULL in the current codebase.

This check was added in commit 95a4774d055c ("memory-hotplug: update
mce_bad_pages when removing the memory") where memmap was originally
inited to NULL, and only conditionally given a value.

The code that could have passed a NULL has been removed by commit
ba72b4c8cf60 ("mm/sparsemem: support sub-section hotplug"), so there is no
longer a possibility that memmap can be NULL.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Alastair D'Silva <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Cc: Mike Rapoport <[email protected]>
Cc: Wei Yang <[email protected]>
Cc: Qian Cai <[email protected]>
Cc: Alexander Duyck <[email protected]>
Cc: Logan Gunthorpe <[email protected]>
Cc: Baoquan He <[email protected]>
Cc: Balbir Singh <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/sparse.c: don't manually decrement num_poisoned_pages

Use the function written to do it instead.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Alastair D'Silva <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Acked-by: Mike Rapoport <[email protected]>
Reviewed-by: Wei Yang <[email protected]>
Reviewed-by: Oscar Salvador <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/sparse.c: use __nr_to_section(section_nr) to get mem_section

__pfn_to_section is defined as __nr_to_section(pfn_to_section_nr(pfn)).

Since we already get section_nr, it is not necessary to get mem_section
from start_pfn. By doing so, we reduce one redundant operation.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Wei Yang <[email protected]>
Reviewed-by: Anshuman Khandual <[email protected]>
Tested-by: Anshuman Khandual <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Pavel Tatashin <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: David Hildenbrand <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/sparse.c: fix ALIGN() without power of 2 in sparse_buffer_alloc()

The size argument passed into sparse_buffer_alloc() has already been
aligned with PAGE_SIZE or PMD_SIZE.

If the size after aligned is not power of 2 (e.g. 0x480000), the
PTR_ALIGN() will return wrong value. Use roundup to round sparsemap_buf
up to next multiple of size.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Lecopzer Chen <[email protected]>
Signed-off-by: Mark-PK Tsai <[email protected]>
Cc: YJ Chiang <[email protected]>
Cc: Lecopzer Chen <[email protected]>
Cc: Pavel Tatashin <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Mike Rapoport <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/sparse.c: fix memory leak of sparsemap_buf in aligned memory

sparse_buffer_alloc(xsize) gets the size of memory from sparsemap_buf
after being aligned with the size.  However, the size is at least
PAGE_ALIGN(sizeof(struct page) * PAGES_PER_SECTION) and usually larger
than PAGE_SIZE.

Also, sparse_buffer_fini() only frees memory between sparsemap_buf and
sparsemap_buf_end, since sparsemap_buf may be changed by PTR_ALIGN()
first, the aligned space before sparsemap_buf is wasted and no one will
touch it.

In our ARM32 platform (without SPARSEMEM_VMEMMAP)
  Sparse_buffer_init
    Reserve d359c000 - d3e9c000 (9M)
  Sparse_buffer_alloc
    Alloc   d3a00000 - d3E80000 (4.5M)
  Sparse_buffer_fini
    Free    d3e80000 - d3e9c000 (~=100k)
The reserved memory between d359c000 - d3a00000 (~=4.4M) is unfreed.

In ARM64 platform (with SPARSEMEM_VMEMMAP)

  sparse_buffer_init
    Reserve ffffffc07d623000 - ffffffc07f623000 (32M)
  Sparse_buffer_alloc
    Alloc   ffffffc07d800000 - ffffffc07f600000 (30M)
  Sparse_buffer_fini
    Free    ffffffc07f600000 - ffffffc07f623000 (140K)
The reserved memory between ffffffc07d623000 - ffffffc07d800000
(~=1.9M) is unfreed.

Let's explicit free redundant aligned memory.

[[email protected]: mark sparse_buffer_free as __meminit]
Link: http://lkml.kernel.org/r/[email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Lecopzer Chen <[email protected]>
Signed-off-by: Mark-PK Tsai <[email protected]>
Signed-off-by: Arnd Bergmann <[email protected]>
Cc: YJ Chiang <[email protected]>
Cc: Lecopzer Chen <[email protected]>
Cc: Pavel Tatashin <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Mike Rapoport <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/memory_hotplug.c: s/is/if

Correct typo in comment.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Souptick Joarder <[email protected]>
Reviewed-by: Andrew Morton <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/memory_hotplug: online_pages cannot be 0 in online_pages()

walk_system_ram_range() will fail with -EINVAL in case
online_pages_range() was never called (== no resource applicable in the
range). Otherwise, we will always call online_pages_range() with nr_pages
> 0 and, therefore, have online_pages > 0.

Remove that special handling.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: David Hildenbrand <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Pavel Tatashin <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Arun KS <[email protected]>
Cc: Bjorn Helgaas <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Nadav Amit <[email protected]>
Cc: Wei Yang <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/memory_hotplug: make sure the pfn is aligned to the order when onlining

Commit a9cd410a3d29 ("mm/page_alloc.c: memory hotplug: free pages as
higher order") assumed that any PFN we get via memory resources is aligned
to to MAX_ORDER - 1, I am not convinced that is always true. Let's play
safe, check the alignment and fallback to single pages.

akpm: warn in this situation so we get to find out if and why this ever
occurs.

[[email protected]: add WARN_ON_ONCE()]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: David Hildenbrand <[email protected]>
Cc: Arun KS <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Pavel Tatashin <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Bjorn Helgaas <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Nadav Amit <[email protected]>
Cc: Wei Yang <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/memory_hotplug: simplify online_pages_range()

online_pages always corresponds to nr_pages. Simplify the code, getting
rid of online_pages_blocks(). Add some comments.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: David Hildenbrand <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Pavel Tatashin <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Arun KS <[email protected]>
Cc: Bjorn Helgaas <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Nadav Amit <[email protected]>
Cc: Wei Yang <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/memory_hotplug: drop PageReserved() check in online_pages_range()

move_pfn_range_to_zone() will set all pages to PG_reserved via
memmap_init_zone(). The only way a page could no longer be reserved would
be if a MEM_GOING_ONLINE notifier would clear PG_reserved - which is not
done (the online_page callback is used for that purpose by e.g., Hyper-V
instead). walk_system_ram_range() will never call online_pages_range()
with duplicate PFNs, so drop the PageReserved() check.

This seems to be a leftover from ancient times where the memmap was
initialized when adding memory and we wanted to check for already onlined
memory.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: David Hildenbrand <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Pavel Tatashin <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Arun KS <[email protected]>
Cc: Bjorn Helgaas <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Nadav Amit <[email protected]>
Cc: Wei Yang <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/memory_hotplug.c: use PFN_UP / PFN_DOWN in walk_system_ram_range()

Patch series "mm/memory_hotplug: online_pages() cleanups", v2.

Some cleanups (+ one fix for a special case) in the context of
online_pages().

This patch (of 5):

This makes it clearer that we will never call func() with duplicate PFNs
in case we have multiple sub-page memory resources. All unaligned parts
of PFNs are completely discarded.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: David Hildenbrand <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Reviewed-by: Wei Yang <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Bjorn Helgaas <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Nadav Amit <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Arun KS <[email protected]>
Cc: Pavel Tatashin <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/memory_hotplug.c: prevent memory leak when reusing pgdat

When offlining a node in try_offline_node(), pgdat is not released. So
that pgdat could be reused in hotadd_new_pgdat(). While we reallocate
pgdat->per_cpu_nodestats if this pgdat is reused.

This patch prevents the memory leak by just allocating per_cpu_nodestats
when it is a new pgdat.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Wei Yang <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: Oscar Salvador <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

drivers/base/memory.c: don't store end_section_nr in memory blocks

Each memory block spans the same amount of sections/pages/bytes.  The size
is determined before the first memory block is created.  No need to store
what we can easily calculate - and the calculations even look simpler now.

Michal brought up the idea of variable-sized memory blocks.  However, if
we ever implement something like this, we will need an API compatibility
switch and reworks at various places (most code assumes a fixed memory
block size).  So let's cleanup what we have right now.

While at it, fix the variable naming in register_mem_sect_under_node() -
we no longer talk about a single section.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: David Hildenbrand <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: "Rafael J. Wysocki" <[email protected]>
Cc: Pavel Tatashin <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Oscar Salvador <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

driver/base/memory.c: validate memory block size early

Let's validate the memory block size early, when initializing the memory
device infrastructure. Fail hard in case the value is not suitable.

As nobody checks the return value of memory_dev_init(), turn it into a
void function and fail with a panic in all scenarios instead. Otherwise,
we'll crash later during boot when core/drivers expect that the memory
device infrastructure (including memory_block_size_bytes()) works as
expected.

I think long term, we should move the whole memory block size
configuration (set_memory_block_size_order() and
memory_block_size_bytes()) into drivers/base/memory.c.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: David Hildenbrand <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: "Rafael J. Wysocki" <[email protected]>
Cc: Pavel Tatashin <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Dan Williams <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

drivers/base/memory.c: fixup documentation of removable/phys_index/block_size_bytes

Let's rephrase to memory block terminology and add some further
clarifications.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: David Hildenbrand <[email protected]>
Reviewed-by: Andrew Morton <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: "Rafael J. Wysocki" <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Oscar Salvador <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

drivers/base/node.c: simplify unregister_memory_block_under_nodes()

We don't allow to offline memory block devices that belong to multiple
numa nodes.  Therefore, such devices can never get removed.  It is
sufficient to process a single node when removing the memory block.  No
need to iterate over each and every PFN.

We already have the nid stored for each memory block.  Make sure that the
nid always has a sane value.

Please note that checking for node_online(nid) is not required.  If we
would have a memory block belonging to a node that is no longer offline,
then we would have a BUG in the node offlining code.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: David Hildenbrand <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: "Rafael J. Wysocki" <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Stephen Rothwell <[email protected]>
Cc: Pavel Tatashin <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Oscar Salvador <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/memory_hotplug: remove move_pfn_range()

Let's remove this indirection. We need the zone in the caller either way,
so let's just detect it there. Add some documentation for
move_pfn_range_to_zone() instead.

[[email protected]: restore newline, per David]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: David Hildenbrand <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Reviewed-by: Oscar Salvador <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Pavel Tatashin <[email protected]>
Cc: Dan Williams <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm: do not hash address in print_bad_pte()

Using %px to show the actual address in print_bad_pte()
to help us to debug issue.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Kefeng Wang <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm: consolidate pgtable_cache_init() and pgd_cache_init()

Both pgtable_cache_init() and pgd_cache_init() are used to initialize kmem
cache for page table allocations on several architectures that do not use
PAGE_SIZE tables for one or more levels of the page table hierarchy.

Most architectures do not implement these functions and use __weak default
NOP implementation of pgd_cache_init(). Since there is no such default
for pgtable_cache_init(), its empty stub is duplicated among most
architectures.

Rename the definitions of pgd_cache_init() to pgtable_cache_init() and
drop empty stubs of pgtable_cache_init().

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Mike Rapoport <[email protected]>
Acked-by: Will Deacon <[email protected]> [arm64]
Acked-by: Thomas Gleixner <[email protected]> [x86]
Cc: Catalin Marinas <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

microblaze: switch to generic version of pte allocation

The microblaze implementation of pte_alloc_one() has a provision to
allocated PTEs from high memory, but neither CONFIG_HIGHPTE nor pte_map*()
versions for suitable for HIGHPTE are defined.

Except that, microblaze version of pte_alloc_one() is identical to the
generic one as well as the implementations of pte_free() and
pte_free_kernel().

Switch microblaze to use the generic versions of these functions. Also
remove pte_free_slow() that is not referenced anywhere in the code.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Mike Rapoport <[email protected]>
Acked-by: Mark Rutland <[email protected]>
Cc: Michal Simek <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

sh: switch to generic version of pte allocation

The sh implementation pte_alloc_one(), pte_alloc_one_kernel(),
pte_free_kernel() and pte_free() is identical to the generic except of
lack of __GFP_ACCOUNT for the user PTEs allocation.

Switch sh to use generic version of these functions.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Mike Rapoport <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

ia64: switch to generic version of pte allocation

The ia64 implementation pte_alloc_one(), pte_alloc_one_kernel(),
pte_free_kernel() and pte_free() is identical to the generic except of
lack of __GFP_ACCOUNT for the user PTEs allocation.

Switch ia64 to use generic version of these functions.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Mike Rapoport <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm: remove quicklist page table caches

Patch series "mm: remove quicklist page table caches".

A while ago Nicholas proposed to remove quicklist page table caches [1].

I've rebased his patch on the curren upstream and switched ia64 and sh to
use generic versions of PTE allocation.

[1] https://lore.kernel.org/linux-mm/20190711030339 [email protected]

This patch (of 3):

Remove page table allocator "quicklists". These have been around for a
long time, but have not got much traction in the last decade and are only
used on ia64 and sh architectures.

The numbers in the initial commit look interesting but probably don't
apply anymore. If anybody wants to resurrect this it's in the git
history, but it's unhelpful to have this code and divergent allocator
behaviour for minor archs.

Also it might be better to instead make more general improvements to page
allocator if this is still so slow.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Nicholas Piggin <[email protected]>
Signed-off-by: Mike Rapoport <[email protected]>
Cc: Tony Luck <[email protected]>
Cc: Yoshinori Sato <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm: release the spinlock on zap_pte_range

In our testing (camera recording), Miguel and Wei found
unmap_page_range() takes above 6ms with preemption disabled easily.
When I see that, the reason is it holds page table spinlock during
entire 512 page operation in a PMD.  6.2ms is never trivial for user
experince if RT task couldn't run in the time because it could make
frame drop or glitch audio problem.

I had a time to benchmark it via adding some trace_printk hooks between
pte_offset_map_lock and pte_unmap_unlock in zap_pte_range.  The testing
device is 2018 premium mobile device.

I can get 2ms delay rather easily to release 2M(ie, 512 pages) when the
task runs on little core even though it doesn't have any IPI and LRU
lock contention.  It's already too heavy.

If I remove activate_page, 35-40% overhead of zap_pte_range is gone so
most of overhead(about 0.7ms) comes from activate_page via
mark_page_accessed.  Thus, if there are LRU contention, that 0.7ms could
accumulate up to several ms.

So this patch adds a check for need_resched() in the loop, and a
preemption point if necessary.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Minchan Kim <[email protected]>
Reported-by: Miguel de Dios <[email protected]>
Reported-by: Wei Wang <[email protected]>
Reviewed-by: Andrew Morton <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Mel Gorman <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm: remove redundant assignment of entry

Since ptent will not be changed after previous assignment of entry, it is
not necessary to do the assignment again.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Wei Yang <[email protected]>
Acked-by: Matthew Wilcox (Oracle) <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

net/xdp: convert put_page() to put_user_page*()

For pages that were retained via get_user_pages*(), release those pages
via the new put_user_page*() routines, instead of via put_page() or
release_pages().

This is part a tree-wide conversion, as described in fc1d8e7cca2d ("mm:
introduce put_user_page*(), placeholder versions").

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: John Hubbard <[email protected]>
Acked-by: Björn Töpel <[email protected]>
Cc: Björn Töpel <[email protected]>
Cc: Magnus Karlsson <[email protected]>
Cc: David S. Miller <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

drivers/gpu/drm/via: convert put_page() to put_user_page*()

For pages that were retained via get_user_pages*(), release those pages
via the new put_user_page*() routines, instead of via put_page() or
release_pages().

This is part a tree-wide conversion, as described in fc1d8e7cca2d ("mm:
introduce put_user_page*(), placeholder versions").

Also reverse the order of a comparison, in order to placate checkpatch.pl.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: John Hubbard <[email protected]>
Cc: David Airlie <[email protected]>
Cc: Daniel Vetter <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/gup: add make_dirty arg to put_user_pages_dirty_lock()

[11~From: John Hubbard <[email protected]>
Subject: mm/gup: add make_dirty arg to put_user_pages_dirty_lock()

Patch series "mm/gup: add make_dirty arg to put_user_pages_dirty_lock()",
v3.

There are about 50+ patches in my tree [2], and I'll be sending out the
remaining ones in a few more groups:

* The block/bio related changes (Jerome mostly wrote those, but I've had
  to move stuff around extensively, and add a little code)

* mm/ changes

* other subsystem patches

* an RFC that shows the current state of the tracking patch set.  That
  can only be applied after all call sites are converted, but it's good to
  get an early look at it.

This is part a tree-wide conversion, as described in fc1d8e7cca2d ("mm:
introduce put_user_page*(), placeholder versions").

This patch (of 3):

Provide more capable variation of put_user_pages_dirty_lock(), and delete
put_user_pages_dirty().  This is based on the following:

1.  Lots of call sites become simpler if a bool is passed into
   put_user_page*(), instead of making the call site choose which
   put_user_page*() variant to call.

2.  Christoph Hellwig's observation that set_page_dirty_lock() is
   usually correct, and set_page_dirty() is usually a bug, or at least
   questionable, within a put_user_page*() calling chain.

This leads to the following API choices:

    * put_user_pages_dirty_lock(page, npages, make_dirty)

    * There is no put_user_pages_dirty(). You have to
      hand code that, in the rare case that it's
      required.

[[email protected]: remove unused variable in siw_free_plist()]
Link: http://lkml.kernel.org/r/[email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: John Hubbard <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Ira Weiny <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm: vmscan: do not share cgroup iteration between reclaimers

One of our services observed a high rate of cgroup OOM kills in the
presence of large amounts of clean cache.  Debugging showed that the
culprit is the shared cgroup iteration in page reclaim.

Under high allocation concurrency, multiple threads enter reclaim at the
same time.  Fearing overreclaim when we first switched from the single
global LRU to cgrouped LRU lists, we introduced a shared iteration state
for reclaim invocations - whether 1 or 20 reclaimers are active
concurrently, we only walk the cgroup tree once: the 1st reclaimer
reclaims the first cgroup, the second the second one etc.  With more
reclaimers than cgroups, we start another walk from the top.

This sounded reasonable at the time, but the problem is that reclaim
concurrency doesn't scale with allocation concurrency.  As reclaim
concurrency increases, the amount of memory individual reclaimers get to
scan gets smaller and smaller.  Individual reclaimers may only see one
cgroup per cycle, and that may not have much reclaimable memory.  We see
individual reclaimers declare OOM when there is plenty of reclaimable
memory available in cgroups they didn't visit.

This patch does away with the shared iterator, and every reclaimer is
allowed to scan the full cgroup tree and see all of reclaimable memory,
just like it would on a non-cgrouped system.  This way, when OOM is
declared, we know that the reclaimer actually had a chance.

To still maintain fairness in reclaim pressure, disallow cgroup reclaim
from bailing out of the tree walk early.  Kswapd and regular direct
reclaim already don't bail, so it's not clear why limit reclaim would have
to, especially since it only walks subtrees to begin with.

This change completely eliminates the OOM kills on our service, while
showing no signs of overreclaim - no increased scan rates, %sys time, or
abrupt free memory spikes.  I tested across 100 machines that have 64G of
RAM and host about 300 cgroups each.

[ It's possible overreclaim never was a *practical* issue to begin
  with - it was simply a concern we had on the mailing lists at the
  time, with no real data to back it up. But we have also added more
  bail-out conditions deeper inside reclaim (e.g. the proportional
  exit in shrink_node_memcg) since. Regardless, now we have data that
  suggests full walks are more reliable and scale just fine. ]

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Johannes Weiner <[email protected]>
Reviewed-by: Roman Gushchin <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm: memcontrol: switch to rcu protection in drain_all_stock()

Commit 72f0184c8a00 ("mm, memcg: remove hotplug locking from try_charge")
introduced css_tryget()/css_put() calls in drain_all_stock(), which are
supposed to protect the target memory cgroup from being released during
the mem_cgroup_is_descendant() call.

However, it's not completely safe.  In theory, memcg can go away between
reading stock->cached pointer and calling css_tryget().

This can happen if drain_all_stock() races with drain_local_stock()
performed on the remote cpu as a result of a work, scheduled by the
previous invocation of drain_all_stock().

The race is a bit theoretical and there are few chances to trigger it, but
the current code looks a bit confusing, so it makes sense to fix it
anyway.  The code looks like as if css_tryget() and css_put() are used to
protect stocks drainage.  It's not necessary because stocked pages are
holding references to the cached cgroup.  And it obviously won't work for
works, scheduled on other cpus.

So, let's read the stock->cached pointer and evaluate the memory cgroup
inside a rcu read section, and get rid of css_tryget()/css_put() calls.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Roman Gushchin <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: Hillf Danton <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm, memcg: throttle allocators when failing reclaim over memory.high

We're trying to use memory.high to limit workloads, but have found that
containment can frequently fail completely and cause OOM situations
outside of the cgroup.  This happens especially with swap space -- either
when none is configured, or swap is full.  These failures often also don't
have enough warning to allow one to react, whether for a human or for a
daemon monitoring PSI.

Here is output from a simple program showing how long it takes in usec
(column 2) to allocate a megabyte of anonymous memory (column 1) when a
cgroup is already beyond its memory high setting, and no swap is
available:

    [root@ktst ~]# systemd-run -p MemoryHigh=100M -p MemorySwapMax=1 \
    > --wait -t timeout 300 /root/mdf
    [...]
    95  1035
    96  1038
    97  1000
    98  1036
    99  1048
    100 1590
    101 1968
    102 1776
    103 1863
    104 1757
    105 1921
    106 1893
    107 1760
    108 1748
    109 1843
    110 1716
    111 1924
    112 1776
    113 1831
    114 1766
    115 1836
    116 1588
    117 1912
    118 1802
    119 1857
    120 1731
    [...]
    [System OOM in 2-3 seconds]

The delay does go up extremely marginally past the 100MB memory.high
threshold, as now we spend time scanning before returning to usermode, but
it's nowhere near enough to contain growth.  It also doesn't get worse the
more pages you have, since it only considers nr_pages.

The current situation goes against both the expectations of users of
memory.high, and our intentions as cgroup v2 developers.  In
cgroup-v2.txt, we claim that we will throttle and only under "extreme
conditions" will memory.high protection be breached.  Likewise, cgroup v2
users generally also expect that memory.high should throttle workloads as
they exceed their high threshold.  However, as seen above, this isn't
always how it works in practice -- even on banal setups like those with no
swap, or where swap has become exhausted, we can end up with memory.high
being breached and us having no weapons left in our arsenal to combat
runaway growth with, since reclaim is futile.

It's also hard for system monitoring software or users to tell how bad the
situation is, as "high" events for the memcg may in some cases be benign,
and in others be catastrophic.  The current status quo is that we fail
containment in a way that doesn't provide any advance warning that things
are about to go horribly wrong (for example, we are about to invoke the
kernel OOM killer).

This patch introduces explicit throttling when reclaim is failing to keep
memcg size contained at the memory.high setting.  It does so by applying
an exponential delay curve derived from the memcg's overage compared to
memory.high.  In the normal case where the memcg is either below or only
marginally over its memory.high setting, no throttling will be performed.

This composes well with system health monitoring and remediation, as these
allocator delays are factored into PSI's memory pressure calculations.
This both creates a mechanism system administrators or applications
consuming the PSI interface to trivially see that the memcg in question is
struggling and use that to make more reasonable decisions, and permits
them enough time to act.  Either of these can act with significantly more
nuance than that we can provide using the system OOM killer.

This is a similar idea to memory.oom_control in cgroup v1 which would put
the cgroup to sleep if the threshold was violated, but it's also
significantly improved as it results in visible memory pressure, and also
doesn't schedule indefinitely, which previously made tracing and other
introspection difficult (ie.  it's clamped at 2*HZ per allocation through
MEMCG_MAX_HIGH_DELAY_JIFFIES).

Contrast the previous results with a kernel with this patch:

    [root@ktst ~]# systemd-run -p MemoryHigh=100M -p MemorySwapMax=1 \
    > --wait -t timeout 300 /root/mdf
    [...]
    95  1002
    96  1000
    97  1002
    98  1003
    99  1000
    100 1043
    101 84724
    102 330628
    103 610511
    104 1016265
    105 1503969
    106 2391692
    107 2872061
    108 3248003
    109 4791904
    110 5759832
    111 6912509
    112 8127818
    113 9472203
    114 12287622
    115 12480079
    116 14144008
    117 15808029
    118 16384500
    119 16383242
    120 16384979
    [...]

As you can see, in the normal case, memory allocation takes around 1000
usec.  However, as we exceed our memory.high, things start to increase
exponentially, but fairly leniently at first.  Our first megabyte over
memory.high takes us 0.16 seconds, then the next is 0.46 seconds, then the
next is almost an entire second.  This gets worse until we reach our
eventual 2*HZ clamp per batch, resulting in 16 seconds per megabyte.
However, this is still making forward progress, so permits tracing or
further analysis with programs like GDB.

We use an exponential curve for our delay penalty for a few reasons:

1. We run mem_cgroup_handle_over_high to potentially do reclaim after
   we've already performed allocations, which means that temporarily
   going over memory.high by a small amount may be perfectly legitimate,
   even for compliant workloads. We don't want to unduly penalise such
   cases.
2. An exponential curve (as opposed to a static or linear delay) allows
   ramping up memory pressure stats more gradually, which can be useful
   to work out that you have set memory.high too low, without destroying
   application performance entirely.

This patch expands on earlier work by Johannes Weiner. Thanks!

[[email protected]: fix max() warning]
[[email protected]: fix __udivdi3 ref on 32-bit]
[[email protected]: fix it even more]
[[email protected]: fix 64-bit divide even more]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Chris Down <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Roman Gushchin <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Nathan Chancellor <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm: page cache: store only head pages in i_pages

Transparent Huge Pages are currently stored in i_pages as pointers to
consecutive subpages. This patch changes that to storing consecutive
pointers to the head page in preparation for storing huge pages more
efficiently in i_pages.

Large parts of this are "inspired" by Kirill's patch
https://lore.kernel.org/lkml/20170126115819 [email protected]/

Kirill and Huang Ying contributed several fixes.

[[email protected]: use compound_nr, squish uninit-var warning]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox <[email protected]>
Acked-by: Jan Kara <[email protected]>
Reviewed-by: Kirill Shutemov <[email protected]>
Reviewed-by: Song Liu <[email protected]>
Tested-by: Song Liu <[email protected]>
Tested-by: William Kucharski <[email protected]>
Reviewed-by: William Kucharski <[email protected]>
Tested-by: Qian Cai <[email protected]>
Tested-by: Mikhail Gavrilov <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Chris Wilson <[email protected]>
Cc: Song Liu <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/filemap.c: rewrite mapping_needs_writeback in less fancy manner

This actually checks that writeback is needed or in progress.

Link: http://lkml.kernel.org/r/156378817069.1087.1302816672037672488.stgit@buzz
Signed-off-by: Konstantin Khlebnikov <[email protected]>
Reviewed-by: Andrew Morton <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Jan Kara <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/filemap.c: don't initiate writeback if mapping has no dirty pages

Functions like filemap_write_and_wait_range() should do nothing if inode
has no dirty pages or pages currently under writeback. But they anyway
construct struct writeback_control and this does some atomic operations if
CONFIG_CGROUP_WRITEBACK=y - on fast path it locks inode->i_lock and
updates state of writeback ownership, on slow path might be more work.
Current this path is safely avoided only when inode mapping has no pages.

For example generic_file_read_iter() calls filemap_write_and_wait_range()
at each O_DIRECT read - pretty hot path.

This patch skips starting new writeback if mapping has no dirty tags set.
If writeback is already in progress filemap_write_and_wait_range() will
wait for it.

Link: http://lkml.kernel.org/r/156378816804.1087.8607636317907921438.stgit@buzz
Signed-off-by: Konstantin Khlebnikov <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Johannes Weiner <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm, page_owner, debug_pagealloc: save and dump freeing stack trace

The debug_pagealloc functionality is useful to catch buggy page allocator
users that cause e.g.  use after free or double free.  When page
inconsistency is detected, debugging is often simpler by knowing the call
stack of process that last allocated and freed the page.  When page_owner
is also enabled, we record the allocation stack trace, but not freeing.

This patch therefore adds recording of freeing process stack trace to page
owner info, if both page_owner and debug_pagealloc are configured and
enabled.  With only page_owner enabled, this info is not useful for the
memory leak debugging use case.  dump_page() is adjusted to print the
info.  An example result of calling __free_pages() twice may look like
this (note the page last free stack trace):

BUG: Bad page state in process bash  pfn:13d8f8
page:ffffc31984f63e00 refcount:-1 mapcount:0 mapping:0000000000000000 index:0x0
flags: 0x1affff800000000()
raw: 01affff800000000 dead000000000100 dead000000000122 0000000000000000
raw: 0000000000000000 0000000000000000 ffffffffffffffff 0000000000000000
page dumped because: nonzero _refcount
page_owner tracks the page as freed
page last allocated via order 0, migratetype Unmovable, gfp_mask 0xcc0(GFP_KERNEL)
prep_new_page+0x143/0x150
get_page_from_freelist+0x289/0x380
__alloc_pages_nodemask+0x13c/0x2d0
khugepaged+0x6e/0xc10
kthread+0xf9/0x130
ret_from_fork+0x3a/0x50
page last free stack trace:
free_pcp_prepare+0x134/0x1e0
free_unref_page+0x18/0x90
khugepaged+0x7b/0xc10
kthread+0xf9/0x130
ret_from_fork+0x3a/0x50
Modules linked in:
CPU: 3 PID: 271 Comm: bash Not tainted 5.3.0-rc4-2.g07a1a73-default+ #57
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58-prebuilt.qemu.org 04/01/2014
Call Trace:
dump_stack+0x85/0xc0
bad_page.cold+0xba/0xbf
rmqueue_pcplist.isra.0+0x6c5/0x6d0
rmqueue+0x2d/0x810
get_page_from_freelist+0x191/0x380
__alloc_pages_nodemask+0x13c/0x2d0
__get_free_pages+0xd/0x30
__pud_alloc+0x2c/0x110
copy_page_range+0x4f9/0x630
dup_mmap+0x362/0x480
dup_mm+0x68/0x110
copy_process+0x19e1/0x1b40
_do_fork+0x73/0x310
__x64_sys_clone+0x75/0x80
do_syscall_64+0x6e/0x1e0
entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x7f10af854a10
...

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Vlastimil Babka <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Michal Hocko <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm, page_owner: keep owner info when freeing the page

For debugging purposes it might be useful to keep the owner info even
after page has been freed, and include it in e.g.  dump_page() when
detecting a bad page state.  For that, change the PAGE_EXT_OWNER flag
meaning to "page owner info has been set at least once" and add new
PAGE_EXT_OWNER_ACTIVE for tracking whether page is supposed to be
currently tracked allocated or free.  Adjust dump_page() accordingly,
distinguishing free and allocated pages.  In the page_owner debugfs file,
keep printing only allocated pages so that existing scripts are not
confused, and also because free pages are irrelevant for the memory
statistics or leak detection that's the typical use case of the file,
anyway.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Vlastimil Babka <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Michal Hocko <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm, page_owner: record page owner for each subpage

Patch series "debug_pagealloc improvements through page_owner", v2.

The debug_pagealloc functionality serves a similar purpose on the page
allocator level that slub_debug does on the kmalloc level, which is to
detect bad users.  One notable feature that slub_debug has is storing
stack traces of who last allocated and freed the object.  On page level we
track allocations via page_owner, but that info is discarded when freeing,
and we don't track freeing at all.  This series improves those aspects.
With both debug_pagealloc and page_owner enabled, we can then get bug
reports such as the example in Patch 4.

SLUB debug tracking additionally stores cpu, pid and timestamp.  This could
be added later, if deemed useful enough to justify the additional page_ext
structure size.

This patch (of 3):

Currently, page owner info is only recorded for the first page of a
high-order allocation, and copied to tail pages in the event of a split
page.  With the plan to keep previous owner info after freeing the page,
it would be benefical to record page owner for each subpage upon
allocation.  This increases the overhead for high orders, but that should
be acceptable for a debugging option.

The order stored for each subpage is the order of the whole allocation.
This makes it possible to calculate the "head" pfn and to recognize "tail"
pages (quoted because not all high-order allocations are compound pages
with true head and tail pages).  When reading the page_owner debugfs file,
keep skipping the "tail" pages so that stats gathered by existing scripts
don't get inflated.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Vlastimil Babka <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Michal Hocko <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm: replace list_move_tail() with add_page_to_lru_list_tail()

This is a cleanup patch that replaces two historical uses of
list_move_tail() with relatively recent add_page_to_lru_list_tail().

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Yu Zhao <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Ira Weiny <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm: introduce compound_nr()

Replace 1 << compound_order(page) with compound_nr(page). Minor
improvements in readability.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: Andrew Morton <[email protected]>
Reviewed-by: Ira Weiny <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
Cc: Michal Hocko <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm: introduce page_shift()

Replace PAGE_SHIFT + compound_order(page) with the new page_shift()
function. Minor improvements in readability.

[[email protected]: fix build in tce_page_is_contained()]
Link: http://lkml.kernel.org/r/201907241853.yNQTrJWd%[email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: Andrew Morton <[email protected]>
Reviewed-by: Ira Weiny <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
Cc: Michal Hocko <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm: introduce page_size()

Patch series "Make working with compound pages easier", v2.

These three patches add three helpers and convert the appropriate
places to use them.

This patch (of 3):

It's unnecessarily hard to find out the size of a potentially huge page.
Replace 'PAGE_SIZE << compound_order(page)' with page_size(page).

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Reviewed-by: Andrew Morton <[email protected]>
Reviewed-by: Ira Weiny <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/rmap.c: remove set but not used variable 'cstart'

Fixes gcc '-Wunused-but-set-variable' warning:

mm/rmap.c: In function page_mkclean_one:
mm/rmap.c:906:17: warning: variable cstart set but not used [-Wunused-but-set-variable]

It is not used any more since
commit cdb07bdea28e ("mm/rmap.c: remove redundant variable cend")

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: YueHaibing <[email protected]>
Reported-by: Hulk Robot <[email protected]>
Reviewed-by: Mike Kravetz <[email protected]>
Reviewed-by: Kirill Tkhai <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/page_poison.c: fix a typo in a comment

s/posioned/poisoned/

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Christophe JAILLET <[email protected]>
Reviewed-by: Andrew Morton <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

lib/test_kasan.c: add roundtrip tests

In several places we need to be able to operate on pointers which have
gone via a roundtrip:

virt -> {phys,page} -> virt

With KASAN_SW_TAGS, we can't preserve the tag for SLUB objects, and the
{phys,page} -> virt conversion will use KASAN_TAG_KERNEL.

This patch adds tests to ensure that this works as expected, without
false positives which have recently been spotted [1,2] in testing.

[1] https://lore.kernel.org/linux-arm-kernel/20190819114420 [email protected]/
[2] https://lore.kernel.org/linux-arm-kernel/20190819132347 [email protected]/

[[email protected]: coding-style fixes]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Mark Rutland <[email protected]>
Reviewed-by: Andrey Konovalov <[email protected]>
Tested-by: Andrey Konovalov <[email protected]>
Acked-by: Andrey Ryabinin <[email protected]>
Cc: Alexander Potapenko <[email protected]>
Cc: Dmitry Vyukov <[email protected]>
Cc: Will Deacon <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

kasan: add memory corruption identification for software tag-based mode

Add memory corruption identification at bug report for software tag-based
mode. The report shows whether it is "use-after-free" or "out-of-bound"
error instead of "invalid-access" error. This will make it easier for
programmers to see the memory corruption problem.

We extend the slab to store five old free pointer tag and free backtrace,
we can check if the tagged address is in the slab record and make a good
guess if the object is more like "use-after-free" or "out-of-bound".
therefore every slab memory corruption can be identified whether it's
"use-after-free" or "out-of-bound".

[[email protected]: simplify & clenup code]
Link: https://lkml.kernel.org/r/[email protected]]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Walter Wu <[email protected]>
Signed-off-by: Andrey Ryabinin <[email protected]>
Acked-by: Andrey Konovalov <[email protected]>
Cc: Dmitry Vyukov <[email protected]>
Cc: Alexander Potapenko <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/kmemleak: increase the max mem pool to 1M

There are some machines with slow disk and fast CPUs.  When they are under
memory pressure, it could take a long time to swap before the OOM kicks in
to free up some memory.  As the results, it needs a large mem pool for
kmemleak or suffering from higher chance of a kmemleak metadata allocation
failure.  524288 proves to be the good number for all architectures here.
Increase the upper bound to 1M to leave some room for the future.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Qian Cai <[email protected]>
Acked-by: Catalin Marinas <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/kmemleak.c: record the current memory pool size

The only way to obtain the current memory pool size for a running kernel
is to check the kernel config file which is inconvenient. Record it in
the kernel messages.

[[email protected]: s/memory pool size/memory pool/available/, per Catalin]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Qian Cai <[email protected]>
Acked-by: Catalin Marinas <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm: kmemleak: use the memory pool for early allocations

Currently kmemleak uses a static early_log buffer to trace all memory
allocation/freeing before the slab allocator is initialised.  Such early
log is replayed during kmemleak_init() to properly initialise the kmemleak
metadata for objects allocated up that point.  With a memory pool that
does not rely on the slab allocator, it is possible to skip this early log
entirely.

In order to remove the early logging, consider kmemleak_enabled == 1 by
default while the kmem_cache availability is checked directly on the
object_cache and scan_area_cache variables.  The RCU callback is only
invoked after object_cache has been initialised as we wouldn't have any
concurrent list traversal before this.

In order to reduce the number of callbacks before kmemleak is fully
initialised, move the kmemleak_init() call to mm_init().

[[email protected]: coding-style fixes]
[[email protected]: remove WARN_ON(), per Catalin]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Catalin Marinas <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Qian Cai <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm: kmemleak: simple memory allocation pool for kmemleak objects

Add a memory pool for struct kmemleak_object in case the normal
kmem_cache_alloc() fails under the gfp constraints passed by the caller.
The mem_pool[] array size is currently fixed at 16000.

We are not using the existing mempool kernel API since this requires
the slab allocator to be available (for pool->elements allocation). A
subsequent kmemleak patch will replace the static early log buffer with
the pool allocation introduced here and this functionality is required
to be available before the slab was initialised.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Catalin Marinas <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Qian Cai <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm: kmemleak: make the tool tolerant to struct scan_area allocation failures

Patch series "mm: kmemleak: Use a memory pool for kmemleak object
allocations", v3.

Following the discussions on v2 of this patch(set) [1], this series takes
slightly different approach:

- it implements its own simple memory pool that does not rely on the
  slab allocator

- drops the early log buffer logic entirely since it can now allocate
  metadata from the memory pool directly before kmemleak is fully
  initialised

- CONFIG_DEBUG_KMEMLEAK_EARLY_LOG_SIZE option is renamed to
  CONFIG_DEBUG_KMEMLEAK_MEM_POOL_SIZE

- moves the kmemleak_init() call earlier (mm_init())

- to avoid a separate memory pool for struct scan_area, it makes the
  tool robust when such allocations fail as scan areas are rather an
  optimisation

[1] http://lkml.kernel.org/r/20190727132334 [email protected]

This patch (of 3):

Object scan areas are an optimisation aimed to decrease the false
positives and slightly improve the scanning time of large objects known to
only have a few specific pointers.  If a struct scan_area fails to
allocate, kmemleak can still function normally by scanning the full
object.

Introduce an OBJECT_FULL_SCAN flag and mark objects as such when scan_area
allocation fails.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Catalin Marinas <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Qian Cai <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

kmemleak: increase DEBUG_KMEMLEAK_EARLY_LOG_SIZE default to 16K

The current default value (400) is too low on many systems (e.g. some
ARM64 platform takes up 1000+ entries).

syzbot uses 16000 as default value, and has proved to be enough on beefy
configurations, so let's pick that value.

This consumes more RAM on boot (each entry is 160 bytes, so in total
~2.5MB of RAM), but the memory would later be freed (early_log is
__initdata).

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Nicolas Boichat <[email protected]>
Suggested-by: Dmitry Vyukov <[email protected]>
Acked-by: Catalin Marinas <[email protected]>
Acked-by: Dmitry Vyukov <[email protected]>
Cc: Masahiro Yamada <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Petr Mladek <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Tetsuo Handa <[email protected]>
Cc: Joe Lawrence <[email protected]>
Cc: Uladzislau Rezki <[email protected]>
Cc: Andy Shevchenko <[email protected]>
Cc: Stephen Rothwell <[email protected]>
Cc: Andrey Ryabinin <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/slub.c: fix -Wunused-function compiler warnings

tid_to_cpu() and tid_to_event() are only used in note_cmpxchg_failure()
when SLUB_DEBUG_CMPXCHG=y, so when SLUB_DEBUG_CMPXCHG=n by default, Clang
will complain that those unused functions.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Qian Cai <[email protected]>
Acked-by: David Rientjes <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm, slab: move memcg_cache_params structure to mm/slab.h

The memcg_cache_params structure is only embedded into the kmem_cache of
slab and slub allocators as defined in slab_def.h and slub_def.h and used
internally by mm code. There is no needed to expose it in a public
header. So move it from include/linux/slab.h to mm/slab.h. It is just a
refactoring patch with no code change.

In fact both the slub_def.h and slab_def.h should be moved into the mm
directory as well, but that will probably cause many merge conflicts.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Waiman Long <[email protected]>
Acked-by: David Rientjes <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Roman Gushchin <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Shakeel Butt <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Cc: Michal Hocko <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm, slab: extend slab/shrink to shrink all memcg caches

Currently, a value of '1" is written to /sys/kernel/slab/<slab>/shrink
file to shrink the slab by flushing out all the per-cpu slabs and free
slabs in partial lists.  This can be useful to squeeze out a bit more
memory under extreme condition as well as making the active object counts
in /proc/slabinfo more accurate.

This usually applies only to the root caches, as the SLUB_MEMCG_SYSFS_ON
option is usually not enabled and "slub_memcg_sysfs=1" not set.  Even if
memcg sysfs is turned on, it is too cumbersome and impractical to manage
all those per-memcg sysfs files in a real production system.

So there is no practical way to shrink memcg caches.  Fix this by enabling
a proper write to the shrink sysfs file of the root cache to scan all the
available memcg caches and shrink them as well.  For a non-root memcg
cache (when SLUB_MEMCG_SYSFS_ON or slub_memcg_sysfs is on), only that
cache will be shrunk when written.

On a 2-socket 64-core 256-thread arm64 system with 64k page after
a parallel kernel build, the the amount of memory occupied by slabs
before shrinking slabs were:

# grep task_struct /proc/slabinfo
task_struct        53137  53192   4288   61    4 : tunables    0    0
0 : slabdata    872    872      0
# grep "^S[lRU]" /proc/meminfo
Slab:            3936832 kB
SReclaimable:     399104 kB
SUnreclaim:      3537728 kB

After shrinking slabs (by echoing "1" to all shrink files):

# grep "^S[lRU]" /proc/meminfo
Slab:            1356288 kB
SReclaimable:     263296 kB
SUnreclaim:      1092992 kB
# grep task_struct /proc/slabinfo
task_struct         2764   6832   4288   61    4 : tunables    0    0
0 : slabdata    112    112      0

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Waiman Long <[email protected]>
Acked-by: Roman Gushchin <[email protected]>
Acked-by: Christoph Lameter <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Shakeel Butt <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

ocfs2: fix spelling mistake "ambigous" -> "ambiguous"

There is a spelling mistake in a mlog_bug_on_msg message. Fix it.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Colin Ian King <[email protected]>
Acked-by: Joseph Qi <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

ocfs2: checkpoint appending truncate log transaction before flushing

Appending truncate log(TA) and and flushing truncate log(TF) are two
separated transactions. They can be both committed but not checkpointed.
If crash occurs then, both transaction will be replayed with several
already released to global bitmap clusters. Then truncate log will be
replayed resulting in cluster double free.

To reproduce this issue, just crash the host while punching hole to files.

Signed-off-by: Changwei Ge <[email protected]>
Reviewed-by: Joseph Qi <[email protected]>
Cc: Mark Fasheh <[email protected]>
Cc: Joel Becker <[email protected]>
Cc: Junxiao Bi <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

ocfs2: wait for recovering done after direct unlock request

There is a scenario causing ocfs2 umount hang when multiple hosts are
rebooting at the same time.

NODE1                           NODE2               NODE3
send unlock requset to NODE2
                                dies
                                                    become recovery master
                                                    recover NODE2
find NODE2 dead
mark resource RECOVERING
directly remove lock from grant list
calculate usage but RECOVERING marked
**miss the window of purging
clear RECOVERING

To reproduce this issue, crash a host and then umount ocfs2
from another node.

To solve this, just let unlock progress wait for recovery done.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Changwei Ge <[email protected]>
Reviewed-by: Joseph Qi <[email protected]>
Cc: Mark Fasheh <[email protected]>
Cc: Joel Becker <[email protected]>
Cc: Junxiao Bi <[email protected]>
Cc: Changwei Ge <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

ocfs2: delete unnecessary checks before brelse()

brelse() tests whether its argument is NULL and then returns immediately.
Thus the tests around the shown calls are not needed.

This issue was detected by using the Coccinelle software.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Markus Elfring <[email protected]>
Reviewed-by: Joseph Qi <[email protected]>
Cc: Mark Fasheh <[email protected]>
Cc: Joel Becker <[email protected]>
Cc: Junxiao Bi <[email protected]>
Cc: Changwei Ge <[email protected]>
Cc: Gang He <[email protected]>
Cc: Jun Piao <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

fs/ocfs2/dir.c: remove set but not used variables

Fixes gcc '-Wunused-but-set-variable' warning:

fs/ocfs2/dir.c: In function ocfs2_dx_dir_transfer_leaf:
fs/ocfs2/dir.c:3653:42: warning: variable new_list set but not used [-Wunused-but-set-variable]

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: zhengbin <[email protected]>
Signed-off-by: Joseph Qi <[email protected]>
Reported-by: Hulk Robot <[email protected]>
Reviewed-by: Joseph Qi <[email protected]>
Reviewed-by: Changwei Ge <[email protected]>
Cc: Mark Fasheh <[email protected]>
Cc: Joel Becker <[email protected]>
Cc: Junxiao Bi <[email protected]>
Cc: Gang He <[email protected]>
Cc: Jun Piao <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

fs/ocfs2/file.c: remove set but not used variables

Fixes gcc '-Wunused-but-set-variable' warning:

fs/ocfs2/file.c: In function ocfs2_prepare_inode_for_write:
fs/ocfs2/file.c:2143:9: warning: variable end set but not used [-Wunused-but-set-variable]

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: zhengbin <[email protected]>
Signed-off-by: Joseph Qi <[email protected]>
Reported-by: Hulk Robot <[email protected]>
Reviewed-by: Joseph Qi <[email protected]>
Reviewed-by: Changwei Ge <[email protected]>
Cc: Mark Fasheh <[email protected]>
Cc: Joel Becker <[email protected]>
Cc: Junxiao Bi <[email protected]>
Cc: Gang He <[email protected]>
Cc: Jun Piao <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

fs/ocfs2/namei.c: remove set but not used variables

Fixes gcc '-Wunused-but-set-variable' warning:

fs/ocfs2/namei.c: In function ocfs2_create_inode_in_orphan:
fs/ocfs2/namei.c:2503:23: warning: variable di set but not used [-Wunused-but-set-variable]

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: zhengbin <[email protected]>
Signed-off-by: Joseph Qi <[email protected]>
Reported-by: Hulk Robot <[email protected]>
Reviewed-by: Joseph Qi <[email protected]>
Reviewed-by: Changwei Ge <[email protected]>
Cc: Mark Fasheh <[email protected]>
Cc: Joel Becker <[email protected]>
Cc: Junxiao Bi <[email protected]>
Cc: Gang He <[email protected]>
Cc: Jun Piao <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

ocfs2: remove unused ocfs2_orphan_scan_exit() declaration

ocfs2_orphan_scan_exit() is declared but not implemented. Also perform a
minor cleanup in ocfs2_link_credits()

Link: http://lkml.kernel.org/r/71604351584F6A4EBAE558C676F37CA4014FC208AC@H3CMLB12-EX.srv.huawei-3com.com
Signed-off-by: guozhonghua <[email protected]>
Reviewed-by: Andrew Morton <[email protected]>
Cc: Mark Fasheh <[email protected]>
Cc: Joel Becker <[email protected]>
Cc: Junxiao Bi <[email protected]>
Cc: Joseph Qi <[email protected]>
Cc: Changwei Ge <[email protected]>
Cc: Gang He <[email protected]>
Cc: Jun Piao <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

ocfs2: remove unused ocfs2_calc_tree_trunc_credits()

ocfs2_calc_tree_trunc_credits() is not called anywhere.

Link: http://lkml.kernel.org/r/71604351584F6A4EBAE558C676F37CA4014FC2050F@H3CMLB12-EX.srv.huawei-3com.com
Signed-off-by: guozhonghua <[email protected]>
Reviewed-by: Andrew Morton <[email protected]>
Cc: Mark Fasheh <[email protected]>
Cc: Joel Becker <[email protected]>
Cc: Junxiao Bi <[email protected]>
Cc: Joseph Qi <[email protected]>
Cc: Changwei Ge <[email protected]>
Cc: Gang He <[email protected]>
Cc: Jun Piao <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

ocfs2: further debugfs cleanups

There is no need to check return value of debugfs_create functions, but
the last sweep through ocfs missed a number of places where this was
happening. There is also no need to save the individual dentries for the
debugfs files, as everything is can just be removed at once when the
directory is removed.

By getting rid of the file dentries for the debugfs entries, a bit of
local memory can be saved as well.

[[email protected]: ensure ret is set to zero before returning]
Link: http://lkml.kernel.org/r/[email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Greg Kroah-Hartman <[email protected]>
Signed-off-by: Colin Ian King <[email protected]>
Reviewed-by: Joseph Qi <[email protected]>
Cc: Mark Fasheh <[email protected]>
Cc: Joel Becker <[email protected]>
Cc: Jia Guo <[email protected]>
Cc: Junxiao Bi <[email protected]>
Cc: Changwei Ge <[email protected]>
Cc: Gang He <[email protected]>
Cc: Jun Piao <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

jbd2: remove jbd2_journal_inode_add_[write|wait]

Since ext4/ocfs2 are using jbd2_inode dirty range scoping APIs now,
jbd2_journal_inode_add_[write|wait] are not used any more, remove them.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Joseph Qi <[email protected]>
Reviewed-by: Ross Zwisler <[email protected]>
Acked-by: Changwei Ge <[email protected]>
Cc: Gang He <[email protected]>
Cc: Joel Becker <[email protected]>
Cc: Joseph Qi <[email protected]>
Cc: Jun Piao <[email protected]>
Cc: Junxiao Bi <[email protected]>
Cc: Mark Fasheh <[email protected]>
Cc: "Theodore Ts'o" <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

ocfs2: use jbd2_inode dirty range scoping

6ba0e7dc64a5 ("jbd2: introduce jbd2_inode dirty range scoping") allow us
scoping each of the inode dirty ranges associated with a given
transaction, and ext4 already does this way.

Now let's also use the newly introduced jbd2_inode dirty range scoping to
prevent us from waiting forever when trying to complete a journal
transaction in ocfs2.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Joseph Qi <[email protected]>
Reviewed-by: Ross Zwisler <[email protected]>
Reviewed-by: Changwei Ge <[email protected]>
Cc: "Theodore Ts'o" <[email protected]>
Cc: Mark Fasheh <[email protected]>
Cc: Joel Becker <[email protected]>
Cc: Junxiao Bi <[email protected]>
Cc: Joseph Qi <[email protected]>
Cc: Gang He <[email protected]>
Cc: Jun Piao <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

kbuild: clean compressed initramfs image

Since 9e3596b0c653 ("kbuild: initramfs cleanup, set target from Kconfig")
"make clean" leaves behind compressed initramfs images.  Example:

  $ make defconfig
  $ sed -i 's|CONFIG_INITRAMFS_SOURCE=""|CONFIG_INITRAMFS_SOURCE="/tmp/ir.cpio"|' .config
  $ make olddefconfig
  $ make -s
  $ make -s clean
  $ git clean -ndxf | grep initramfs
  Would remove usr/initramfs_data.cpio.gz

clean rules do not have CONFIG_* context so they do not know which
compression format was used.  Thus they don't know which files to delete.

Tell clean to delete all possible compression formats.

Once patched usr/initramfs_data.cpio.gz and friends are deleted by
"make clean".

Link: http://lkml.kernel.org/r/[email protected]
Fixes: 9e3596b0c653 ("kbuild: initramfs cleanup, set target from Kconfig")
Signed-off-by: Greg Thelen <[email protected]>
Cc: Nicholas Piggin <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

z3fold: fix retry mechanism in page reclaim

z3fold_page_reclaim()'s retry mechanism is broken: on a second iteration
it will have zhdr from the first one so that zhdr is no longer in line
with struct page. That leads to crashes when the system is stressed.

Fix that by moving zhdr assignment up.

While at it, protect against using already freed handles by using own
local slots structure in z3fold_page_reclaim().

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Vitaly Wool <[email protected]>
Reported-by: Markus Linnala <[email protected]>
Reported-by: Chris Murphy <[email protected]>
Reported-by: Agustin Dall'Alba <[email protected]>
Cc: "Maciej S. Szmigiero" <[email protected]>
Cc: Shakeel Butt <[email protected]>
Cc: Henry Burns <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm: add dummy can_do_mlock() helper

On kernels without CONFIG_MMU, we get a link error for the siw driver:

drivers/infiniband/sw/siw/siw_mem.o: In function `siw_umem_get':
siw_mem.c:(.text+0x4c8): undefined reference to `can_do_mlock'

This is probably not the only driver that needs the function and could
otherwise build correctly without CONFIG_MMU, so add a dummy variant that
always returns false.

Link: http://lkml.kernel.org/r/[email protected]
Fixes: 2251334dcac9 ("rdma/siw: application buffer management")
Signed-off-by: Arnd Bergmann <[email protected]>
Suggested-by: Jason Gunthorpe <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: Bernard Metzler <[email protected]>
Cc: "Matthew Wilcox (Oracle)" <[email protected]>
Cc: "Kirill A. Shutemov" <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

Revert "mm/z3fold.c: fix race between migration and destruction"

With the original commit applied, z3fold_zpool_destroy() may get blocked
on wait_event() for indefinite time. Revert this commit for the time
being to get rid of this problem since the issue the original commit
addresses is less severe.

Link: http://lkml.kernel.org/r/[email protected]
Fixes: d776aaa9895eb6eb77 ("mm/z3fold.c: fix race between migration and destruction")
Reported-by: Agustín Dall'Alba <[email protected]>
Signed-off-by: Vitaly Wool <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Vitaly Wool <[email protected]>
Cc: Shakeel Butt <[email protected]>
Cc: Jonathan Adams <[email protected]>
Cc: Henry Burns <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

fat: work around race with userspace's read via blockdev while mounting

If userspace reads the buffer via blockdev while mounting,
sb_getblk()+modify can race with buffer read via blockdev.

For example,

            FS                               userspace
    bh = sb_getblk()
    modify bh->b_data
                                  read
    ll_rw_block(bh)
      fill bh->b_data by on-disk data
      /* lost modified data by FS */
      set_buffer_uptodate(bh)
    set_buffer_uptodate(bh)

Userspace should not use the blockdev while mounting though, the udev
seems to be already doing this.  Although I think the udev should try to
avoid this, workaround the race by small overhead.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: OGAWA Hirofumi <[email protected]>
Reported-by: Jan Stancek <[email protected]>
Tested-by: Jan Stancek <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

Merge tag 'microblaze-v5.4-rc1' of git://git.monstr.eu/linux-2.6-microblaze

Pull Microblaze updates from Michal Simek:

- clean up reset gpio handler

- defconfig updates

- add support for 8 byte get_user()

- switch to generic dma code

* tag 'microblaze-v5.4-rc1' of git://git.monstr.eu/linux-2.6-microblaze:
  microblaze: Switch to standard restart handler
  microblaze: defconfig synchronization
  microblaze: Enable Xilinx AXI emac driver by default
  arch/microblaze: support get_user() of size 8 bytes
  microblaze: remove ioremap_fullcache
  microblaze: use the generic dma coherent remap allocator
  microblaze/nommu: use the generic uncached segment support

Merge tag 'platform-drivers-x86-v5.4-2' of git://git.infradead.org/linux-platform-drivers-x86

Pull x86 platform-drivers fixes from Andy Shevchenko:

- Fix compilation error of ASUS WMI driver when CONFIG_ACPI_BATTERY=n

- Fix I²C multi-instantiate driver to work with several USB PD devices

- Fix boot issue on Siemens SIMATIC IPC277E when PMC critical clock is
   being disabled

- Plenty of fixes to Intel Speed-Select Technology tools

* tag 'platform-drivers-x86-v5.4-2' of git://git.infradead.org/linux-platform-drivers-x86:
  platform/x86: i2c-multi-instantiate: Derive the device name from parent
  platform/x86: pmc_atom: Add Siemens SIMATIC IPC277E to critclk_systems DMI table
  tools/power/x86/intel-speed-select: Fix perf-profile command output
  tools/power/x86/intel-speed-select: Extend core-power command set
  tools/power/x86/intel-speed-select: Fix some debug prints
  tools/power/x86/intel-speed-select: Format get-assoc information
  tools/power/x86/intel-speed-select: Allow online/offline based on tdp
  tools/power/x86/intel-speed-select: Fix high priority core mask over count
  platform/x86: asus-wmi: Make it depend on ACPI battery API

Merge tag 'hyperv-next-signed' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux

Pull Hyper-V updates from Sasha Levin:

- first round of vmbus hibernation support (Dexuan Cui)

- remove dependencies on PAGE_SIZE (Maya Nakamura)

- move the hyper-v tools/ code into the tools build system (Andy
   Shevchenko)

- hyper-v balloon cleanups (Dexuan Cui)

* tag 'hyperv-next-signed' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux:
  Drivers: hv: vmbus: Resume after fixing up old primary channels
  Drivers: hv: vmbus: Suspend after cleaning up hv_sock and sub channels
  Drivers: hv: vmbus: Clean up hv_sock channels by force upon suspend
  Drivers: hv: vmbus: Suspend/resume the vmbus itself for hibernation
  Drivers: hv: vmbus: Ignore the offers when resuming from hibernation
  Drivers: hv: vmbus: Implement suspend/resume for VSC drivers for hibernation
  Drivers: hv: vmbus: Add a helper function is_sub_channel()
  Drivers: hv: vmbus: Suspend/resume the synic for hibernation
  Drivers: hv: vmbus: Break out synic enable and disable operations
  HID: hv: Remove dependencies on PAGE_SIZE for ring buffer
  Tools: hv: move to tools buildsystem
  hv_balloon: Reorganize the probe function
  hv_balloon: Use a static page for the balloon_up send buffer

Merge branch 'work.mount3' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs

Pull more mount API conversions from Al Viro:
"Assorted conversions of options parsing to new API.

  gfs2 is probably the most serious one here; the rest is trivial stuff.

  Other things in what used to be #work.mount are going to wait for the
  next cycle (and preferably go via git trees of the filesystems
  involved)"

* 'work.mount3' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  gfs2: Convert gfs2 to fs_context
  vfs: Convert spufs to use the new mount API
  vfs: Convert hypfs to use the new mount API
  hypfs: Fix error number left in struct pointer member
  vfs: Convert functionfs to use the new mount API
  vfs: Convert bpf to use the new mount API

ia64: Fix some warnings introduced in merge window

Fix

  arch/ia64/kernel/irq_ia64.c:586:1: warning: no return statement in function returning non-void [-Wreturn-type]
  arch/ia64/mm/contig.c:111:6: warning: unused variable 'rc' [-Wunused-variable]
  arch/ia64/mm/discontig.c:189:39: warning: unused variable 'rc' [-Wunused-variable]

Signed-off-by: Tony Luck <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

fuse: Make fuse_args_to_req static

Fix sparse warning:

fs/fuse/dev.c:468:6: warning: symbol 'fuse_args_to_req' was not declared. Should it be static?

Reported-by: Hulk Robot <[email protected]>
Signed-off-by: YueHaibing <[email protected]>
Fixes: 68583165f962 ("fuse: add pages to fuse_args")
Signed-off-by: Miklos Szeredi <[email protected]>

fuse: fix memleak in cuse_channel_open

If cuse_send_init fails, need to fuse_conn_put cc->fc.

cuse_channel_open->fuse_conn_init->refcount_set(&fc->count, 1)
->fuse_dev_alloc->fuse_conn_get
->fuse_dev_free->fuse_conn_put

Fixes: cc080e9e9be1 ("fuse: introduce per-instance fuse_dev structure")
Reported-by: Hulk Robot <[email protected]>
Signed-off-by: zhengbin <[email protected]>
Signed-off-by: Miklos Szeredi <[email protected]>

fuse: fix beyond-end-of-page access in fuse_parse_cache()

With DEBUG_PAGEALLOC on, the following triggers.

  BUG: unable to handle page fault for address: ffff88859367c000
  #PF: supervisor read access in kernel mode
  #PF: error_code(0x0000) - not-present page
  PGD 3001067 P4D 3001067 PUD 406d3a8067 PMD 406d30c067 PTE 800ffffa6c983060
  Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
  CPU: 38 PID: 3110657 Comm: python2.7
  RIP: 0010:fuse_readdir+0x88f/0xe7a [fuse]
  Code: 49 8b 4d 08 49 39 4e 60 0f 84 44 04 00 00 48 8b 43 08 43 8d 1c 3c 4d 01 7e 68 49 89 dc 48 03 5c 24 38 49 89 46 60 8b 44 24 30 <8b> 4b 10 44 29 e0 48 89 ca 48 83 c1 1f 48 83 e1 f8 83 f8 17 49 89
  RSP: 0018:ffffc90035edbde0 EFLAGS: 00010286
  RAX: 0000000000001000 RBX: ffff88859367bff0 RCX: 0000000000000000
  RDX: 0000000000000000 RSI: ffff88859367bfed RDI: 0000000000920907
  RBP: ffffc90035edbe90 R08: 000000000000014b R09: 0000000000000004
  R10: ffff88859367b000 R11: 0000000000000000 R12: 0000000000000ff0
  R13: ffffc90035edbee0 R14: ffff889fb8546180 R15: 0000000000000020
  FS:  00007f80b5f4a740(0000) GS:ffff889fffa00000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: ffff88859367c000 CR3: 0000001c170c2001 CR4: 00000000003606e0
  DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
  Call Trace:
   iterate_dir+0x122/0x180
   __x64_sys_getdents+0xa6/0x140
   do_syscall_64+0x42/0x100
   entry_SYSCALL_64_after_hwframe+0x44/0xa9

It's in fuse_parse_cache().  %rbx (ffff88859367bff0) is fuse_dirent
pointer - addr + offset.  FUSE_DIRENT_SIZE() is trying to dereference
namelen off of it but that derefs into the next page which is disabled
by pagealloc debug causing a PF.

This is caused by dirent->namelen being accessed before ensuring that
there's enough bytes in the page for the dirent.  Fix it by pushing
down reclen calculation.

Signed-off-by: Tejun Heo <[email protected]>
Fixes: 5d7bc7e8680c ("fuse: allow using readdir cache")
Cc: [email protected] # v4.20+
Signed-off-by: Miklos Szeredi <[email protected]>

fuse: unexport fuse_put_request

This function has been made static, which now causes a compile-time
warning:

WARNING: "fuse_put_request" [vmlinux] is a static EXPORT_SYMBOL_GPL

Remove the unneeded export.

Fixes: 66abc3599c3c ("fuse: unexport request ops")
Signed-off-by: Arnd Bergmann <[email protected]>
Reviewed-by: Stefan Hajnoczi <[email protected]>
Signed-off-by: Miklos Szeredi <[email protected]>

fuse: kmemcg account fs data

account per-file, dentry, and inode data

blockdev/superblock and temporary per-request data was left alone, as
this usually isn't accounted

Reviewed-by: Shakeel Butt <[email protected]>
Signed-off-by: Khazhismel Kumykov <[email protected]>
Signed-off-by: Miklos Szeredi <[email protected]>

fuse: on 64-bit store time in d_fsdata directly

Implements the optimization noted in commit f75fdf22b0a8 ("fuse: don't
use ->d_time"), as the additional memory can be significant. (In
particular, on SLAB configurations this 8-byte alloc becomes 32 bytes).
Per-dentry, this can consume significant memory.

Reviewed-by: Shakeel Butt <[email protected]>
Signed-off-by: Khazhismel Kumykov <[email protected]>
Signed-off-by: Miklos Szeredi <[email protected]>

fuse: fix missing unlock_page in fuse_writepage()

unlock_page() was missing in case of an already in-flight write against the
same page.

Signed-off-by: Vasily Averin <[email protected]>
Fixes: ff17be086477 ("fuse: writepage: skip already in flight")
Cc: <[email protected]> # v3.13
Signed-off-by: Miklos Szeredi <[email protected]>

ALSA: usb-audio: Add DSD support for EVGA NU Audio

EVGA NU Audio is actually a USB audio device on a PCIexpress card,
with it's own USB controller. It supports both PCM and DSD.

Signed-off-by: Jussi Laako <[email protected]>
Cc: <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Takashi Iwai <[email protected]>

Merge tag 'mfd-next-5.4' of git://git.kernel.org/pub/scm/linux/kernel/git/lee/mfd

Pull MFD updates from Lee Jones:
"New Drivers:
   - Add support for Merrifield Basin Cove PMIC

  New Device Support:
   - Add support for Intel Tiger Lake to Intel LPSS PCI
   - Add support for Intel Sky Lake to Intel LPSS PCI
   - Add support for ST-Ericsson DB8520 to DB8500 PRCMU

  New Functionality:
   - Add RTC and PWRC support to MT6323

  Fix-ups:
   - Clean-up include files; davinci_voicecodec, asic3, sm501, mt6397
   - Ignore return values from debugfs_create*(); ab3100-*, ab8500-debugfs, aat2870-core
   - Device Tree changes; rn5t618, mt6397
   - Use new I2C API; tps80031, 88pm860x-core, ab3100-core, bcm590xx,
                      da9150-core, max14577, max77693, max77843, max8907,
                      max8925-i2c, max8997, max8998, palmas, twl-core,
   - Remove obsolete code; da9063, jz4740-adc
   - Simplify semantics; timberdale, htc-i2cpld
   - Add 'fall-through' tags; omap-usb-host, db8500-prcmu
   - Remove superfluous prints; ab8500-debugfs, db8500-prcmu, fsl-imx25-tsadc,
                                intel_soc_pmic_bxtwc, qcom_rpm, sm501
   - Trivial rename/whitespace/typo fixes; mt6397-core, MAINTAINERS
   - Reorganise code structure; mt6397-*
   - Improve code consistency; intel-lpss
   - Use MODULE_SOFTDEP() helper; intel-lpss
   - Use DEFINE_RES_*() helpers; mt6397-core

  Bug Fixes:
   - Clean-up resources; max77620
   - Prevent input events being dropped on resume; intel-lpss-pci
   - Prevent sleeping in IRQ context; ezx-pcap"

* tag 'mfd-next-5.4' of git://git.kernel.org/pub/scm/linux/kernel/git/lee/mfd: (48 commits)
  mfd: mt6323: Add MT6323 RTC and PWRC
  mfd: mt6323: Replace boilerplate resource code with DEFINE_RES_* macros
  mfd: mt6397: Add mutex include
  dt-bindings: mfd: mediatek: Add MT6323 Power Controller
  dt-bindings: mfd: mediatek: Update RTC to include MT6323
  dt-bindings: mfd: mediatek: mt6397: Change to relative paths
  mfd: db8500-prcmu: Support the higher DB8520 ARMSS
  mfd: intel-lpss: Use MODULE_SOFTDEP() instead of implicit request
  mfd: htc-i2cpld: Drop check because i2c_unregister_device() is NULL safe
  mfd: sm501: Include the GPIO driver header
  mfd: intel-lpss: Add Intel Skylake ACPI IDs
  mfd: intel-lpss: Consistently use GENMASK()
  mfd: Add support for Merrifield Basin Cove PMIC
  mfd: ezx-pcap: Replace mutex_lock with spin_lock
  mfd: asic3: Include the right header
  MAINTAINERS: altera-sysmgr: Fix typo in a filepath
  mfd: mt6397: Extract IRQ related code from core driver
  mfd: mt6397: Rename macros to something more readable
  mfd: Remove dev_err() usage after platform_get_irq()
  mfd: db8500-prcmu: Mark expected switch fall-throughs
  ...

Merge tag 'backlight-next-5.4' of git://git.kernel.org/pub/scm/linux/kernel/git/lee/backlight

Pull backlight updates from Lee Jones:
"Core Frameworks
   - Obtain scale type through sysfs

  New Functionality:
   - Provide Device Tree functionality in rave-sp-backlight
   - Calculate if scale type is (non-)linear in pwm_bl

  Fix-ups:
   - Simplify code in lm3630a_bl
   - Trivial rename/whitespace/typo fixes in lms283gf05
   - Remove superfluous NULL check in tosa_lcd
   - Fix power state initialisation in gpio_backlight
   - List supported file in MAINTAINERS

  Bug Fixes:
   - Kconfig - default to not building unless requested in
     {LED,BACKLIGHT}_CLASS_DEVICE"

* tag 'backlight-next-5.4' of git://git.kernel.org/pub/scm/linux/kernel/git/lee/backlight:
  backlight: pwm_bl: Set scale type for brightness curves specified in the DT
  backlight: pwm_bl: Set scale type for CIE 1931 curves
  backlight: Expose brightness curve type through sysfs
  MAINTAINERS: Add entry for stable backlight sysfs ABI documentation
  backlight: gpio-backlight: Correct initial power state handling
  video: backlight: tosa_lcd: drop check because i2c_unregister_device() is NULL safe
  video: backlight: Drop default m for {LCD,BACKLIGHT_CLASS_DEVICE}
  backlight: lms283gf05: Fix a typo in the description passed to 'devm_gpio_request_one()'
  backlight: lm3630a: Switch to use fwnode_property_count_uXX()
  backlight: rave-sp: Leave initial state and register with correct device

Merge tag 'pci-v5.4-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci

Pull PCI updates from Bjorn Helgaas:
"Enumeration:

   - Consolidate _HPP/_HPX stuff in pci-acpi.c and simplify it
     (Krzysztof Wilczynski)

   - Fix incorrect PCIe device types and remove dev->has_secondary_link
     to simplify code that deals with upstream/downstream ports (Mika
     Westerberg)

   - After suspend, restore Resizable BAR size bits correctly for 1MB
     BARs (Sumit Saxena)

   - Enable PCI_MSI_IRQ_DOMAIN support for RISC-V (Wesley Terpstra)

  Virtualization:

   - Add ACS quirks for iProc PAXB (Abhinav Ratna), Amazon Annapurna
     Labs (Ali Saidi)

   - Move sysfs SR-IOV functions to iov.c (Kelsey Skunberg)

   - Remove group write permissions from sysfs sriov_numvfs,
     sriov_drivers_autoprobe (Kelsey Skunberg)

  Hotplug:

   - Simplify pciehp indicator control (Denis Efremov)

  Peer-to-peer DMA:

   - Allow P2P DMA between root ports for whitelisted bridges (Logan
     Gunthorpe)

   - Whitelist some Intel host bridges for P2P DMA (Logan Gunthorpe)

   - DMA map P2P DMA requests that traverse host bridge (Logan
     Gunthorpe)

  Amazon Annapurna Labs host bridge driver:

   - Add DT binding and controller driver (Jonathan Chocron)

  Hyper-V host bridge driver:

   - Fix hv_pci_dev->pci_slot use-after-free (Dexuan Cui)

   - Fix PCI domain number collisions (Haiyang Zhang)

   - Use instance ID bytes 4 & 5 as PCI domain numbers (Haiyang Zhang)

   - Fix build errors on non-SYSFS config (Randy Dunlap)

  i.MX6 host bridge driver:

   - Limit DBI register length (Stefan Agner)

  Intel VMD host bridge driver:

   - Fix config addressing issues (Jon Derrick)

  Layerscape host bridge driver:

   - Add bar_fixed_64bit property to endpoint driver (Xiaowei Bao)

   - Add CONFIG_PCI_LAYERSCAPE_EP to build EP/RC drivers separately
     (Xiaowei Bao)

  Mediatek host bridge driver:

   - Add MT7629 controller support (Jianjun Wang)

  Mobiveil host bridge driver:

   - Fix CPU base address setup (Hou Zhiqiang)

   - Make "num-lanes" property optional (Hou Zhiqiang)

  Tegra host bridge driver:

   - Fix OF node reference leak (Nishka Dasgupta)

   - Disable MSI for root ports to work around design problem (Vidya
     Sagar)

   - Add Tegra194 DT binding and controller support (Vidya Sagar)

   - Add support for sideband pins and slot regulators (Vidya Sagar)

   - Add PIPE2UPHY support (Vidya Sagar)

  Misc:

   - Remove unused pci_block_cfg_access() et al (Kelsey Skunberg)

   - Unexport pci_bus_get(), etc (Kelsey Skunberg)

   - Hide PM, VC, link speed, ATS, ECRC, PTM constants and interfaces in
     the PCI core (Kelsey Skunberg)

   - Clean up sysfs DEVICE_ATTR() usage (Kelsey Skunberg)

   - Mark expected switch fall-through (Gustavo A. R. Silva)

   - Propagate errors for optional regulators and PHYs (Thierry Reding)

   - Fix kernel command line resource_alignment parameter issues (Logan
     Gunthorpe)"

* tag 'pci-v5.4-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci: (112 commits)
  PCI: Add pci_irq_vector() and other stubs when !CONFIG_PCI
  arm64: tegra: Add PCIe slot supply information in p2972-0000 platform
  arm64: tegra: Add configuration for PCIe C5 sideband signals
  PCI: tegra: Add support to enable slot regulators
  PCI: tegra: Add support to configure sideband pins
  PCI: vmd: Fix shadow offsets to reflect spec changes
  PCI: vmd: Fix config addressing when using bus offsets
  PCI: dwc: Add validation that PCIe core is set to correct mode
  PCI: dwc: al: Add Amazon Annapurna Labs PCIe controller driver
  dt-bindings: PCI: Add Amazon's Annapurna Labs PCIe host bridge binding
  PCI: Add quirk to disable MSI-X support for Amazon's Annapurna Labs Root Port
  PCI/VPD: Prevent VPD access for Amazon's Annapurna Labs Root Port
  PCI: Add ACS quirk for Amazon Annapurna Labs root ports
  PCI: Add Amazon's Annapurna Labs vendor ID
  MAINTAINERS: Add PCI native host/endpoint controllers designated reviewer
  PCI: hv: Use bytes 4 and 5 from instance ID as the PCI domain numbers
  dt-bindings: PCI: tegra: Add PCIe slot supplies regulator entries
  dt-bindings: PCI: tegra: Add sideband pins configuration entries
  PCI: tegra: Add Tegra194 PCIe support
  PCI: Get rid of dev->has_secondary_link flag
  ...

Merge tag 'smack-for-5.4-rc1' of git://github.com/cschaufler/smack-next

Pull smack updates from Casey Schaufler:
"Four patches for v5.4. Nothing is major.

  All but one are in response to mechanically detected potential issues.
  The remaining patch cleans up kernel-doc notations"

* tag 'smack-for-5.4-rc1' of git://github.com/cschaufler/smack-next:
  smack: use GFP_NOFS while holding inode_smack::smk_lock
  security: smack: Fix possible null-pointer dereferences in smack_socket_sock_rcv_skb()
  smack: fix some kernel-doc notations
  Smack: Don't ignore other bprm->unsafe flags if LSM_UNSAFE_PTRACE is set

Merge branch 'pci/trivial'

  - Fix typos and whitespace errors (Bjorn Helgaas, Krzysztof Wilczynski)

  - Remove unnecessary "return" statements (Krzysztof Wilczynski)

  - Correct of_irq_parse_pci() function documentation (Lubomir Rintel)

* pci/trivial:
  PCI: Remove unnecessary returns
  PCI: OF: Correct of_irq_parse_pci() documentation
  PCI: Fix typos and whitespace errors

Merge branch 'remotes/lorenzo/pci/vmd'

  - Fix VMD config addressing to ignore starting bus offset (Jon Derrick)

  - Fix VMD shadow offset scratchpad address (Jon Derrick)

* remotes/lorenzo/pci/vmd:
  PCI: vmd: Fix shadow offsets to reflect spec changes
  PCI: vmd: Fix config addressing when using bus offsets

Merge branch 'lorenzo/pci/tegra'

  - Fix Tegra OF node reference leak (Nishka Dasgupta)

  - Add #defines for PCIe Data Link Feature and Physical Layer 16.0 GT/s
    features (Vidya Sagar)

  - Disable MSI for Tegra Root Ports since they don't support using MSI for
    all Root Port events (Vidya Sagar)

  - Group DesignWare write-protected register writes together (Vidya Sagar)

  - Move DesignWare capability search interfaces so they can be used by
    both host and endpoint drivers (Vidya Sagar)

  - Add DesignWare extended capability search interfaces (Vidya Sagar)

  - Export dw_pcie_wait_for_link() so drivers can be modules (Vidya Sagar)

  - Add "snps,enable-cdm-check" DT binding for Configuration Dependent
    Module (CDM) register checking (Vidya Sagar)

  - Add DesignWare support for "snps,enable-cdm-check" CDM checking (Vidya
    Sagar)

  - Add "supports-clkreq" DT binding for host drivers to decide whether to
    advertise low power features (Vidya Sagar)

  - Add DT binding for Tegra194 (Vidya Sagar)

  - Add DT binding for Tegra194 P2U (PIPE to UPHY) block (Vidya Sagar)

  - Add support for Tegra194 P2U (PIPE to UPHY) (Vidya Sagar)

  - Add support for Tegra194 host controller (Vidya Sagar)

  - Add Tegra support for sideband PERST# and CLKREQ# for C5 (Vidya Sagar)

  - Add Tegra support for slot regulators for p2972-0000 platform (Vidya
    Sagar)

* lorenzo/pci/tegra:
  arm64: tegra: Add PCIe slot supply information in p2972-0000 platform
  arm64: tegra: Add configuration for PCIe C5 sideband signals
  PCI: tegra: Add support to enable slot regulators
  PCI: tegra: Add support to configure sideband pins
  dt-bindings: PCI: tegra: Add PCIe slot supplies regulator entries
  dt-bindings: PCI: tegra: Add sideband pins configuration entries
  PCI: tegra: Add Tegra194 PCIe support
  phy: tegra: Add PCIe PIPE2UPHY support
  dt-bindings: PHY: P2U: Add Tegra194 P2U block
  dt-bindings: PCI: tegra: Add device tree support for Tegra194
  dt-bindings: Add PCIe supports-clkreq property
  PCI: dwc: Add support to enable CDM register check
  dt-bindings: PCI: designware: Add binding for CDM register check
  PCI: dwc: Export dw_pcie_wait_for_link() API
  PCI: dwc: Add extended configuration space capability search API
  PCI: dwc: Move config space capability search API
  PCI: dwc: Group DBI registers writes requiring unlocking
  PCI: Disable MSI for Tegra root ports
  PCI: Add #defines for some of PCIe spec r4.0 features
  PCI: tegra: Fix OF node reference leak

Merge branch 'remotes/lorenzo/pci/mobiveil'

- Fix mobiveil inbound window CPU base address setup (Hou Zhiqiang)

* remotes/lorenzo/pci/mobiveil:
PCI: mobiveil: Fix the CPU base address setup in inbound window

Merge branch 'remotes/lorenzo/pci/misc'

  - Propagate regulator_get_optional() errors so callers can distinguish
    real errors from optional regulators that are absent (Thierry Reding)

  - Propagate devm_of_phy_get() errors so callers can distinguish
    real errors from optional PHYs that are absent (Thierry Reding)

  - Add Andrew Murray as PCI native driver reviewer (Lorenzo Pieralisi)

* remotes/lorenzo/pci/misc:
  MAINTAINERS: Add PCI native host/endpoint controllers designated reviewer
  PCI: iproc: Propagate errors for optional PHYs
  PCI: histb: Propagate errors for optional regulators
  PCI: armada8x: Propagate errors for optional PHYs
  PCI: imx6: Propagate errors for optional regulators
  PCI: exynos: Propagate errors for optional PHYs
  PCI: rockchip: Propagate errors for optional regulators

Merge branch 'remotes/lorenzo/pci/mediatek'

  - Add mediatek support for MT7629 (Jianjun Wang)

* remotes/lorenzo/pci/mediatek:
  PCI: mediatek: Add controller support for MT7629
  dt-bindings: PCI: Add support for MT7629

Merge branch 'remotes/lorenzo/pci/layerscape'

  - Mark Layerscape endpoint BARs 2 and 4 as 64-bit (Xiaowei Bao)

  - Add CONFIG_PCI_LAYERSCAPE_EP so EP/RC can be built separately (Xiaowei
    Bao)

* remotes/lorenzo/pci/layerscape:
  PCI: layerscape: Add CONFIG_PCI_LAYERSCAPE_EP to build EP/RC separately
  PCI: layerscape: Add the bar_fixed_64bit property to the endpoint driver

Merge branch 'remotes/lorenzo/pci/imx'

  - Reduce i.MX 6Quad DBI register length to avoid aborts from accessing
    invalid registers (Stefan Agner)

* remotes/lorenzo/pci/imx:
  PCI: imx6: Limit DBI register length