]> Git Repo - linux.git/log
linux.git
6 months agomm: move can_modify_vma to mm/vma.h
Pedro Falcato [Sat, 17 Aug 2024 00:18:28 +0000 (01:18 +0100)]
mm: move can_modify_vma to mm/vma.h

Patch series "mm: Optimize mseal checks", v3.

Optimize mseal checks by removing the separate can_modify_mm() step, and
just doing checks on the individual vmas, when various operations are
themselves iterating through the tree.  This provides a nice speedup and
restores performance parity with pre-mseal[3].

will-it-scale mmap1_process[1] -t 1 results:

commit 3450fe2b574b4345e4296ccae395149e1a357fee:

min:277605 max:277605 total:277605
min:281784 max:281784 total:281784
min:277238 max:277238 total:277238
min:281761 max:281761 total:281761
min:274279 max:274279 total:274279
min:254854 max:254854 total:254854
measurement
min:269143 max:269143 total:269143
min:270454 max:270454 total:270454
min:243523 max:243523 total:243523
min:251148 max:251148 total:251148
min:209669 max:209669 total:209669
min:190426 max:190426 total:190426
min:231219 max:231219 total:231219
min:275364 max:275364 total:275364
min:266540 max:266540 total:266540
min:242572 max:242572 total:242572
min:284469 max:284469 total:284469
min:278882 max:278882 total:278882
min:283269 max:283269 total:283269
min:281204 max:281204 total:281204

After this patch set:

min:280580 max:280580 total:280580
min:290514 max:290514 total:290514
min:291006 max:291006 total:291006
min:290352 max:290352 total:290352
min:294582 max:294582 total:294582
min:293075 max:293075 total:293075
measurement
min:295613 max:295613 total:295613
min:294070 max:294070 total:294070
min:293193 max:293193 total:293193
min:291631 max:291631 total:291631
min:295278 max:295278 total:295278
min:293782 max:293782 total:293782
min:290361 max:290361 total:290361
min:294517 max:294517 total:294517
min:293750 max:293750 total:293750
min:293572 max:293572 total:293572
min:295239 max:295239 total:295239
min:292932 max:292932 total:292932
min:293319 max:293319 total:293319
min:294954 max:294954 total:294954

This was a Completely Unscientific test but seems to show there were
around 5-10% gains on ops per second.

Oliver performed his own tests and showed[3] a similar ~5% gain in them.

[1]: mmap1_process does mmap and munmap in a loop. I didn't bother testing multithreading cases.
[2]: https://lore.kernel.org/all/20240807124103[email protected]/
[3]: https://lore.kernel.org/all/ZrMMJfe9aXSWxJz6@xsang-OptiPlex-9020/
Link: https://lore.kernel.org/all/[email protected]/
This patch (of 7):

Move can_modify_vma to vma.h so it can be inlined properly (with the
intent to remove can_modify_mm callsites).

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Pedro Falcato <[email protected]>
Reviewed-by: Liam R. Howlett <[email protected]>
Reviewed-by: Lorenzo Stoakes <[email protected]>
Cc: Jeff Xu <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Pedro Falcato <[email protected]>
Cc: Shuah Khan <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
6 months agomm: allow read-ahead with IOCB_NOWAIT set
Yafang Shao [Tue, 20 Aug 2024 02:26:39 +0000 (10:26 +0800)]
mm: allow read-ahead with IOCB_NOWAIT set

Readahead support for IOCB_NOWAIT was introduced in commit 2e85abf053b9
("mm: allow read-ahead with IOCB_NOWAIT set").  However, this
implementation broke the semantics of IOCB_NOWAIT by potentially causing
it to wait on I/O during memory reclamation.  This behavior was later
modified in commit efa8480a8316 ("fs: RWF_NOWAIT should imply IOCB_NOIO").

To resolve the blocking issue during memory reclamation, we can use
memalloc_noio_{save,restore} to ensure non-blocking behavior.  This change
restores the original functionality, allowing preadv2(IOCB_NOWAIT) to
trigger readahead if the file content is not present in the page cache.

While this process may trigger direct memory reclamation, the
__GFP_NORETRY flag is set in the readahead GFP flags, ensuring it won't
block.

A use case for this change is when we want to trigger readahead in the
preadv2(2) syscall if the file cache is absent, but without waiting for
certain filesystem locks, like xfs_ilock.  A simple example is as follows:

retry:
    if (preadv2(fd, iovec, cnt, offset, RWF_NOWAIT) < 0) {
        do_other_work();
        goto retry;
    }

Link: https://lore.gnuweeb.org/io-uring/[email protected]/
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Yafang Shao <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: Christian Brauner <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
6 months agomm: remove migration for HugePage in isolate_single_pageblock()
Kefeng Wang [Tue, 20 Aug 2024 03:26:30 +0000 (11:26 +0800)]
mm: remove migration for HugePage in isolate_single_pageblock()

The gigantic page size may larger than memory block size, so memory
offline always fails in this case after commit b2c9e2fbba32 ("mm: make
alloc_contig_range work at pageblock granularity"),

offline_pages
  start_isolate_page_range
    start_isolate_page_range(isolate_before=true)
      isolate [isolate_start, isolate_start + pageblock_nr_pages)
    start_isolate_page_range(isolate_before=false)
      isolate [isolate_end - pageblock_nr_pages, isolate_end) pageblock
        __alloc_contig_migrate_range
          isolate_migratepages_range
            isolate_migratepages_block
              isolate_or_dissolve_huge_page
                if (hstate_is_gigantic(h))
                    return -ENOMEM;

[   15.815756] memory offlining [mem 0x3c0000000-0x3c7ffffff] failed due to failure to isolate range

Gigantic PageHuge is bigger than a pageblock, but since it is freed as
order-0 pages, its pageblocks after being freed will get to the right
free list.  There is no need to have special handling code for them in
start_isolate_page_range().  For both alloc_contig_range() and memory
offline cases, the migration code after start_isolate_page_range() will
be able to migrate gigantic PageHuge when possible.  Let's clean up
start_isolate_page_range() and fix the aforementioned memory offline
failure issue all together.

Let's clean up start_isolate_page_range() and fix the aforementioned
memory offline failure issue all together.

Link: https://lkml.kernel.org/r/[email protected]
Fixes: b2c9e2fbba32 ("mm: make alloc_contig_range work at pageblock granularity")
Signed-off-by: Kefeng Wang <[email protected]>
Acked-by: David Hildenbrand <[email protected]>
Acked-by: Zi Yan <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Oscar Salvador <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
6 months agomm: shrinker: use min() to improve shrinker_debugfs_scan_write()
Thorsten Blum [Tue, 20 Aug 2024 04:22:55 +0000 (06:22 +0200)]
mm: shrinker: use min() to improve shrinker_debugfs_scan_write()

Use the min() macro to simplify the shrinker_debugfs_scan_write() function
and improve its readability.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Thorsten Blum <[email protected]>
Reviewed-by: Muchun Song <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Qi Zheng <[email protected]>
Cc: Roman Gushchin <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
6 months agoselftests: mm: support shmem mTHP collapse testing
Baolin Wang [Tue, 20 Aug 2024 09:49:17 +0000 (17:49 +0800)]
selftests: mm: support shmem mTHP collapse testing

Add shmem mTHP collpase testing.  Similar to the anonymous page, users can
use the '-s' parameter to specify the shmem mTHP size for testing.

Link: https://lkml.kernel.org/r/fa44bfa20ca5b9fd6f9163a048f3d3c1e53cd0a8.1724140601.git.baolin.wang@linux.alibaba.com
Signed-off-by: Baolin Wang <[email protected]>
Cc: Barry Song <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Ryan Roberts <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: Zi Yan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
6 months agomm: khugepaged: support shmem mTHP collapse
Baolin Wang [Tue, 20 Aug 2024 09:49:16 +0000 (17:49 +0800)]
mm: khugepaged: support shmem mTHP collapse

Shmem already supports the allocation of mTHP, but khugepaged does not yet
support collapsing mTHP folios.  Now khugepaged is ready to support mTHP,
and this patch enables the collapse of shmem mTHP.

Link: https://lkml.kernel.org/r/b9da76aab4276eb6e5d12c479af2b5eea5b4575d.1724140601.git.baolin.wang@linux.alibaba.com
Signed-off-by: Baolin Wang <[email protected]>
Cc: Barry Song <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Ryan Roberts <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: Zi Yan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
6 months agomm: khugepaged: support shmem mTHP copy
Baolin Wang [Tue, 20 Aug 2024 09:49:15 +0000 (17:49 +0800)]
mm: khugepaged: support shmem mTHP copy

Iterate each subpage in the large folio to copy, as preparation for
supporting shmem mTHP collapse.

Link: https://lkml.kernel.org/r/222d615b7c837eabb47a238126c5fdeff8aa5283.1724140601.git.baolin.wang@linux.alibaba.com
Signed-off-by: Baolin Wang <[email protected]>
Cc: Barry Song <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Ryan Roberts <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: Zi Yan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
6 months agomm: khugepaged: use the number of pages in the folio to check the reference count
Baolin Wang [Tue, 20 Aug 2024 09:49:14 +0000 (17:49 +0800)]
mm: khugepaged: use the number of pages in the folio to check the reference count

Use the number of pages in the folio to check the reference count as
preparation for supporting shmem mTHP collapse.

Link: https://lkml.kernel.org/r/9ea49262308de28957596cc6e8edc2d3a4f54659.1724140601.git.baolin.wang@linux.alibaba.com
Signed-off-by: Baolin Wang <[email protected]>
Acked-by: David Hildenbrand <[email protected]>
Cc: Barry Song <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Ryan Roberts <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: Zi Yan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
6 months agomm: khugepaged: expand the is_refcount_suitable() to support file folios
Baolin Wang [Tue, 20 Aug 2024 09:49:13 +0000 (17:49 +0800)]
mm: khugepaged: expand the is_refcount_suitable() to support file folios

Patch series "support shmem mTHP collapse", v2.

Shmem already supports mTHP allocation[1], and this patchset adds support
for shmem mTHP collapse, as well as adding relevant test cases.

This patch (of 5):

Expand the is_refcount_suitable() to support reference checks for file
folios, as preparation for supporting shmem mTHP collapse.

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/eae4cb3195ebbb654bfb7967cb7261d4e4e7c7fa.1724140601.git.baolin.wang@linux.alibaba.com
Signed-off-by: Baolin Wang <[email protected]>
Acked-by: David Hildenbrand <[email protected]>
Cc: Barry Song <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Ryan Roberts <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: Zi Yan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
6 months agomm: always inline _compound_head() with CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP=y
David Hildenbrand [Tue, 20 Aug 2024 12:22:10 +0000 (14:22 +0200)]
mm: always inline _compound_head() with CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP=y

We already force-inline page_fixed_fake_head(), page_is_fake_head() and
PageTail(), however the compiler might decide that _compound_head() is not
worthy to be inlined, because of page_fixed_fake_head().

The result is that, for example, PageAnonExclusive() now might involve a
function call when checking PageHuge(), which performs a
page_folio()->_compound_head() call.  This can lead to a slight regression
of the stress-ng.clone benchmark.

This is not super-urgent to fix, but always inlining _compound_head()
seems like the obvious thing to do for this primitive, similar to the
other ones.

This change restores the slight regression and a compilation with
CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP=y shows no relevant bloat [2]:

add/remove: 15/14 grow/shrink: 79/87 up/down: 12836/-13917 (-1081)
...
Total: Before=32786363, After=32785282, chg -0.00%

[1] https://lkml.kernel.org/r/817150f2-abf7-430f-9973-540bd6cdd26f@intel.com
[2] https://lore.kernel.org/all/116e117c-2821-401d-8e62-b85cdec37f4a@redhat.com/

Link: https://lkml.kernel.org/r/[email protected]
Fixes: c0bff412e67b ("mm: allow anon exclusive check over hugetlb tail pages")
Signed-off-by: David Hildenbrand <[email protected]>
Reported-by: kernel test robot <[email protected]>
Closes: https://lore.kernel.org/oe-lkp/[email protected]
Tested-by: Yin Fengwei <[email protected]>
Cc: Peter Xu <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
6 months agomm/kmemleak: use IS_ERR_PCPU() for pointer in the percpu address space
Uros Bizjak [Sun, 18 Aug 2024 21:01:52 +0000 (23:01 +0200)]
mm/kmemleak: use IS_ERR_PCPU() for pointer in the percpu address space

Use IS_ERR_PCPU() instead of IS_ERR() for pointers in the percpu address
space.  The patch also fixes following sparse warnings:

kmemleak.c:1063:39: warning: cast removes address space '__percpu' of expression
kmemleak.c:1138:37: warning: cast removes address space '__percpu' of expression

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Uros Bizjak <[email protected]>
Acked-by: Catalin Marinas <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
6 months agoerr.h: add ERR_PTR_PCPU(), PTR_ERR_PCPU() and IS_ERR_PCPU() macros
Uros Bizjak [Sun, 18 Aug 2024 21:01:51 +0000 (23:01 +0200)]
err.h: add ERR_PTR_PCPU(), PTR_ERR_PCPU() and IS_ERR_PCPU() macros

Add ERR_PTR_PCPU(), PTR_ERR_PCPU() and IS_ERR_PCPU() macros that operate
on pointers in the percpu address space.

These macros remove the need for (__force void *) function argument casts
(to avoid sparse -Wcast-from-as warnings).

The patch will also avoid future build errors due to pointer address space
mismatch with enabled strict percpu address space checks.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Uros Bizjak <[email protected]>
Acked-by: Catalin Marinas <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
6 months agoselftests/mm: remove unnecessary ia64 code and comment
Jinjiang Tu [Mon, 19 Aug 2024 13:06:09 +0000 (21:06 +0800)]
selftests/mm: remove unnecessary ia64 code and comment

IA64 has gone with commit cf8e8658100d ("arch: Remove Itanium (IA-64)
architecture"), so remove unnecessary ia64 special mm code and comment in
selftests too.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Jinjiang Tu <[email protected]>
Cc: Kefeng Wang <[email protected]>
Cc: Mike Rapoport <[email protected]>
Cc: Nanyong Sun <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
6 months agomm: krealloc: clarify valid usage of __GFP_ZERO
Danilo Krummrich [Mon, 12 Aug 2024 22:34:35 +0000 (00:34 +0200)]
mm: krealloc: clarify valid usage of __GFP_ZERO

Properly document that if __GFP_ZERO logic is requested, callers must
ensure that, starting with the initial memory allocation, every subsequent
call to this API for the same memory allocation is flagged with
__GFP_ZERO.  Otherwise, it is possible that __GFP_ZERO is not fully
honored by this API.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Danilo Krummrich <[email protected]>
Acked-by: David Rientjes <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Hyeonggon Yoo <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: Roman Gushchin <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
6 months agomm: krealloc: consider spare memory for __GFP_ZERO
Danilo Krummrich [Mon, 12 Aug 2024 22:34:34 +0000 (00:34 +0200)]
mm: krealloc: consider spare memory for __GFP_ZERO

As long as krealloc() is called with __GFP_ZERO consistently, starting
with the initial memory allocation, __GFP_ZERO should be fully honored.

However, if for an existing allocation krealloc() is called with a
decreased size, it is not ensured that the spare portion the allocation is
zeroed.  Thus, if krealloc() is subsequently called with a larger size
again, __GFP_ZERO can't be fully honored, since we don't know the previous
size, but only the bucket size.

Example:

buf = kzalloc(64, GFP_KERNEL);
memset(buf, 0xff, 64);

buf = krealloc(buf, 48, GFP_KERNEL | __GFP_ZERO);

/* After this call the last 16 bytes are still 0xff. */
buf = krealloc(buf, 64, GFP_KERNEL | __GFP_ZERO);

Fix this, by explicitly setting spare memory to zero, when shrinking an
allocation with __GFP_ZERO flag set or init_on_alloc enabled.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Danilo Krummrich <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Acked-by: David Rientjes <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Hyeonggon Yoo <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: Roman Gushchin <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
6 months agomm/rmap: use folio->_mapcount for small folios
David Hildenbrand [Fri, 16 Aug 2024 10:32:46 +0000 (12:32 +0200)]
mm/rmap: use folio->_mapcount for small folios

We have some cases left whereby we operate on small folios and still refer
to page->_mapcount.  Let's just use folio->_mapcount instead, which
currently still overlays page->_mapcount, so no change.

This change will make it easier to later spot any remaining users of
page->_mapcount that target tail pages.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: David Hildenbrand <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
6 months agomm/hugetlb: use __GFP_COMP for gigantic folios
Yu Zhao [Wed, 14 Aug 2024 03:54:51 +0000 (21:54 -0600)]
mm/hugetlb: use __GFP_COMP for gigantic folios

Use __GFP_COMP for gigantic folios to greatly reduce not only the amount
of code but also the allocation and free time.

LOC (approximately): +60, -240

Allocate and free 500 1GB hugeTLB memory without HVO by:
  time echo 500 >/sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
  time echo 0 >/sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages

       Before  After
Alloc  ~13s    ~10s
Free   ~15s    <1s

The above magnitude generally holds for multiple x86 and arm64 CPU models.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Yu Zhao <[email protected]>
Reported-by: Frank van der Linden <[email protected]>
Acked-by: Zi Yan <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Muchun Song <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
6 months agomm/cma: add cma_{alloc,free}_folio()
Yu Zhao [Wed, 14 Aug 2024 03:54:50 +0000 (21:54 -0600)]
mm/cma: add cma_{alloc,free}_folio()

With alloc_contig_range() and free_contig_range() supporting large folios,
CMA can allocate and free large folios too, by cma_alloc_folio() and
cma_free_folio().

[[email protected]: fix WARN in cma_alloc_folio()]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Yu Zhao <[email protected]>
Acked-by: Zi Yan <[email protected]>
Cc: Frank van der Linden <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Muchun Song <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
6 months agomm/contig_alloc: support __GFP_COMP
Yu Zhao [Wed, 14 Aug 2024 03:54:49 +0000 (21:54 -0600)]
mm/contig_alloc: support __GFP_COMP

Patch series "mm/hugetlb: alloc/free gigantic folios", v2.

Use __GFP_COMP for gigantic folios can greatly reduce not only the amount
of code but also the allocation and free time.

Approximate LOC to mm/hugetlb.c: +60, -240

Allocate and free 500 1GB hugeTLB memory without HVO by:
  time echo 500 >/sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
  time echo 0 >/sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages

       Before  After
Alloc  ~13s    ~10s
Free   ~15s    <1s

The above magnitude generally holds for multiple x86 and arm64 CPU
models.

Perf profile before:
  Alloc
    - 99.99% alloc_pool_huge_folio
       - __alloc_fresh_hugetlb_folio
          - 83.23% alloc_contig_pages_noprof
             - 47.46% alloc_contig_range_noprof
                - 20.96% isolate_freepages_range
                     16.10% split_page
                - 14.10% start_isolate_page_range
                - 12.02% undo_isolate_page_range

  Free
    - update_and_free_pages_bulk
       - 87.71% free_contig_range
          - 76.02% free_unref_page
             - 41.30% free_unref_page_commit
                - 32.58% free_pcppages_bulk
                   - 24.75% __free_one_page
               13.96% _raw_spin_trylock
         12.27% __update_and_free_hugetlb_folio

Perf profile after:
  Alloc
    - 99.99% alloc_pool_huge_folio
         alloc_gigantic_folio
       - alloc_contig_pages_noprof
          - 59.15% alloc_contig_range_noprof
             - 20.72% start_isolate_page_range
               20.64% prep_new_page
             - 17.13% undo_isolate_page_range

  Free
    - update_and_free_pages_bulk
       - __folio_put
       - __free_pages_ok
            7.46% free_tail_page_prepare
          - 1.97% free_one_page
               1.86% __free_one_page

This patch (of 3):

Support __GFP_COMP in alloc_contig_range().  When the flag is set, upon
success the function returns a large folio prepared by prep_new_page(),
rather than a range of order-0 pages prepared by split_free_pages() (which
is renamed from split_map_pages()).

alloc_contig_range() can be used to allocate folios larger than
MAX_PAGE_ORDER, e.g., gigantic hugeTLB folios.  So on the free path,
free_one_page() needs to handle that by split_large_buddy().

[[email protected]: fix folio_alloc_gigantic_noprof() WARN expression, per Yu Liao]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Yu Zhao <[email protected]>
Acked-by: Zi Yan <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Frank van der Linden <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
6 months agomm,memcg: provide per-cgroup counters for NUMA balancing operations
Kaiyang Zhao [Wed, 14 Aug 2024 17:42:27 +0000 (17:42 +0000)]
mm,memcg: provide per-cgroup counters for NUMA balancing operations

The ability to observe the demotion and promotion decisions made by the
kernel on a per-cgroup basis is important for monitoring and tuning
containerized workloads on machines equipped with tiered memory.

Different containers in the system may experience drastically different
memory tiering actions that cannot be distinguished from the global
counters alone.

For example, a container running a workload that has a much hotter memory
accesses will likely see more promotions and fewer demotions, potentially
depriving a colocated container of top tier memory to such an extent that
its performance degrades unacceptably.

For another example, some containers may exhibit longer periods between
data reuse, causing much more numa_hint_faults than numa_pages_migrated.
In this case, tuning hot_threshold_ms may be appropriate, but the signal
can easily be lost if only global counters are available.

In the long term, we hope to introduce per-cgroup control of promotion and
demotion actions to implement memory placement policies in tiering.

This patch set adds seven counters to memory.stat in a cgroup:
numa_pages_migrated, numa_pte_updates, numa_hint_faults, pgdemote_kswapd,
pgdemote_khugepaged, pgdemote_direct and pgpromote_success.  pgdemote_*
and pgpromote_success are also available in memory.numa_stat.

count_memcg_events_mm() is added to count multiple event occurrences at
once, and get_mem_cgroup_from_folio() is added because we need to get a
reference to the memcg of a folio before it's migrated to track
numa_pages_migrated.  The accounting of PGDEMOTE_* is moved to
shrink_inactive_list() before being changed to per-cgroup.

[[email protected]: add documentation of the memcg counters in cgroup-v2.rst]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Kaiyang Zhao <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Roman Gushchin <[email protected]>
Cc: Shakeel Butt <[email protected]>
Cc: Wei Xu <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
6 months agokasan: simplify and clarify Makefile
Andrey Konovalov [Tue, 13 Aug 2024 22:40:27 +0000 (00:40 +0200)]
kasan: simplify and clarify Makefile

When KASAN support was being added to the Linux kernel, GCC did not yet
support all of the KASAN-related compiler options.  Thus, the KASAN
Makefile had to probe the compiler for supported options.

Nowadays, the Linux kernel GCC version requirement is 5.1+, and thus we
don't need the probing of the -fasan-shadow-offset parameter: it exists in
all 5.1+ GCCs.

Simplify the KASAN Makefile to drop CFLAGS_KASAN_MINIMAL.

Also add a few more comments and unify the indentation.

[[email protected]: comments fixes per Miguel]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Andrey Konovalov <[email protected]>
Reviewed-by: Miguel Ojeda <[email protected]>
Acked-by: Marco Elver <[email protected]>
Cc: Alexander Potapenko <[email protected]>
Cc: Andrey Ryabinin <[email protected]>
Cc: Dmitry Vyukov <[email protected]>
Cc: Matthew Maurer <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
6 months agomm: shmem: support large folio swap out
Baolin Wang [Mon, 12 Aug 2024 07:42:10 +0000 (15:42 +0800)]
mm: shmem: support large folio swap out

Shmem will support large folio allocation [1] [2] to get a better
performance, however, the memory reclaim still splits the precious large
folios when trying to swap out shmem, which may lead to the memory
fragmentation issue and can not take advantage of the large folio for
shmeme.

Moreover, the swap code already supports for swapping out large folio
without split, hence this patch set supports the large folio swap out for
shmem.

Note the i915_gem_shmem driver still need to be split when swapping, thus
add a new flag 'split_large_folio' for writeback_control to indicate
spliting the large folio.

[1] https://lore.kernel.org/all/cover.1717495894[email protected]/
[2] https://lore.kernel.org/all/20240515055719[email protected]/

[[email protected]: shmem_writepage() split folio at EOF before swapout]
Link: https://lkml.kernel.org/r/[email protected]
[[email protected]: remove the wbc->split_large_folio per Hugh]
Link: https://lkml.kernel.org/r/1236a002daa301b3b9ba73d6c0fab348427cf295.1724833399.git.baolin.wang@linux.alibaba.com
Link: https://lkml.kernel.org/r/d80c21abd20e1b0f5ca66b330f074060fb2f082d.1723434324.git.baolin.wang@linux.alibaba.com
Signed-off-by: Baolin Wang <[email protected]>
Signed-off-by: Hugh Dickins <[email protected]>
Cc: Barry Song <[email protected]>
Cc: Chris Li <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: "Huang, Ying" <[email protected]>
Cc: Kefeng Wang <[email protected]>
Cc: Lance Yang <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Pankaj Raghav <[email protected]>
Cc: Ryan Roberts <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: Zi Yan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
6 months agomm: shmem: split large entry if the swapin folio is not large
Baolin Wang [Mon, 12 Aug 2024 07:42:09 +0000 (15:42 +0800)]
mm: shmem: split large entry if the swapin folio is not large

Now the swap device can only swap-in order 0 folio, even though a large
folio is swapped out.  This requires us to split the large entry
previously saved in the shmem pagecache to support the swap in of small
folios.

[[email protected]: fix warnings from kmalloc_fix_flags()]
Link: https://lkml.kernel.org/r/[email protected]
[[email protected]: drop the 'new_order' parameter]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/4a0f12f27c54a62eb4d9ca1265fed3a62531a63e.1723434324.git.baolin.wang@linux.alibaba.com
Signed-off-by: Baolin Wang <[email protected]>
Signed-off-by: Hugh Dickins <[email protected]>
Cc: Barry Song <[email protected]>
Cc: Chris Li <[email protected]>
Cc: Daniel Gomez <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: "Huang, Ying" <[email protected]>
Cc: Kefeng Wang <[email protected]>
Cc: Lance Yang <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Pankaj Raghav <[email protected]>
Cc: Ryan Roberts <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: Zi Yan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
6 months agomm: shmem: drop folio reference count using 'nr_pages' in shmem_delete_from_page_cache()
Baolin Wang [Mon, 12 Aug 2024 07:42:08 +0000 (15:42 +0800)]
mm: shmem: drop folio reference count using 'nr_pages' in shmem_delete_from_page_cache()

To support large folio swapin/swapout for shmem in the following patches,
drop the folio's reference count by the number of pages contained in the
folio when a shmem folio is deleted from shmem pagecache after adding into
swap cache.

Link: https://lkml.kernel.org/r/b371eadb27f42fc51261c51008fbb9a334985b4c.1723434324.git.baolin.wang@linux.alibaba.com
Signed-off-by: Baolin Wang <[email protected]>
Cc: Barry Song <[email protected]>
Cc: Chris Li <[email protected]>
Cc: Daniel Gomez <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: "Huang, Ying" <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Kefeng Wang <[email protected]>
Cc: Lance Yang <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Pankaj Raghav <[email protected]>
Cc: Ryan Roberts <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: Zi Yan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
6 months agomm: shmem: support large folio allocation for shmem_replace_folio()
Baolin Wang [Mon, 12 Aug 2024 07:42:07 +0000 (15:42 +0800)]
mm: shmem: support large folio allocation for shmem_replace_folio()

To support large folio swapin for shmem in the following patches, add
large folio allocation for the new replacement folio in
shmem_replace_folio().  Moreover large folios occupy N consecutive entries
in the swap cache instead of using multi-index entries like the page
cache, therefore we should replace each consecutive entries in the swap
cache instead of using the shmem_replace_entry().

As well as updating statistics and folio reference count using the number
of pages in the folio.

[[email protected]: fix the gfp flag for large folio allocation]
Link: https://lkml.kernel.org/r/[email protected]
[[email protected]: fix build without CONFIG_TRANSPARENT_HUGEPAGE]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/a41138ecc857ef13e7c5ffa0174321e9e2c9970a.1723434324.git.baolin.wang@linux.alibaba.com
Signed-off-by: Baolin Wang <[email protected]>
Cc: Barry Song <[email protected]>
Cc: Chris Li <[email protected]>
Cc: Daniel Gomez <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: "Huang, Ying" <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Kefeng Wang <[email protected]>
Cc: Lance Yang <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Pankaj Raghav <[email protected]>
Cc: Ryan Roberts <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: Zi Yan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
6 months agomm: shmem: use swap_free_nr() to free shmem swap entries
Baolin Wang [Mon, 12 Aug 2024 07:42:06 +0000 (15:42 +0800)]
mm: shmem: use swap_free_nr() to free shmem swap entries

As a preparation for supporting shmem large folio swapout, use
swap_free_nr() to free some continuous swap entries of the shmem large
folio when the large folio was swapped in from the swap cache.  In
addition, the index should also be round down to the number of pages when
adding the swapin folio into the pagecache.

Link: https://lkml.kernel.org/r/342207fa679fc88a447dac2e101ad79e6050fe79.1723434324.git.baolin.wang@linux.alibaba.com
Signed-off-by: Baolin Wang <[email protected]>
Cc: Barry Song <[email protected]>
Cc: Chris Li <[email protected]>
Cc: Daniel Gomez <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: "Huang, Ying" <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Kefeng Wang <[email protected]>
Cc: Lance Yang <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Pankaj Raghav <[email protected]>
Cc: Ryan Roberts <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: Zi Yan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
6 months agomm: filemap: use xa_get_order() to get the swap entry order
Baolin Wang [Mon, 12 Aug 2024 07:42:05 +0000 (15:42 +0800)]
mm: filemap: use xa_get_order() to get the swap entry order

In the following patches, shmem will support the swap out of large folios,
which means the shmem mappings may contain large order swap entries, so
using xa_get_order() to get the folio order of the shmem swap entry to
update the '*start' correctly.

[[email protected]: use xa_get_order() to get the swap entry order]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/6876d55145c1cc80e79df7884aa3a62e397b101d.1723434324.git.baolin.wang@linux.alibaba.com
Signed-off-by: Baolin Wang <[email protected]>
Signed-off-by: Hugh Dickins <[email protected]>
Cc: Barry Song <[email protected]>
Cc: Chris Li <[email protected]>
Cc: Daniel Gomez <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: "Huang, Ying" <[email protected]>
Cc: Kefeng Wang <[email protected]>
Cc: Lance Yang <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Pankaj Raghav <[email protected]>
Cc: Ryan Roberts <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: Zi Yan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
6 months agomm: shmem: return number of pages beeing freed in shmem_free_swap
Daniel Gomez [Mon, 12 Aug 2024 07:42:04 +0000 (15:42 +0800)]
mm: shmem: return number of pages beeing freed in shmem_free_swap

Both shmem_free_swap callers expect the number of pages being freed.  In
the large folios context, this needs to support larger values other than 0
(used as 1 page being freed) and -ENOENT (used as 0 pages being freed).
In preparation for large folios adoption, make shmem_free_swap routine
return the number of pages being freed.  So, returning 0 in this context,
means 0 pages being freed.

While we are at it, changing to use free_swap_and_cache_nr() to free large
order swap entry by Baolin Wang.

Link: https://lkml.kernel.org/r/9623e863c83d749d5ab407f6fdf0a8e5a3bdf052.1723434324.git.baolin.wang@linux.alibaba.com
Signed-off-by: Daniel Gomez <[email protected]>
Signed-off-by: Baolin Wang <[email protected]>
Suggested-by: Matthew Wilcox <[email protected]>
Cc: Barry Song <[email protected]>
Cc: Chris Li <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: "Huang, Ying" <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Kefeng Wang <[email protected]>
Cc: Lance Yang <[email protected]>
Cc: Pankaj Raghav <[email protected]>
Cc: Ryan Roberts <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: Zi Yan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
6 months agomm: shmem: extend shmem_partial_swap_usage() to support large folio swap
Baolin Wang [Mon, 12 Aug 2024 07:42:03 +0000 (15:42 +0800)]
mm: shmem: extend shmem_partial_swap_usage() to support large folio swap

To support shmem large folio swapout in the following patches, using
xa_get_order() to get the order of the swap entry to calculate the swap
usage of shmem.

Link: https://lkml.kernel.org/r/60b130b9fc3e422bb91293a172c2113c85e9233a.1723434324.git.baolin.wang@linux.alibaba.com
Signed-off-by: Baolin Wang <[email protected]>
Cc: Barry Song <[email protected]>
Cc: Chris Li <[email protected]>
Cc: Daniel Gomez <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: "Huang, Ying" <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Kefeng Wang <[email protected]>
Cc: Lance Yang <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Pankaj Raghav <[email protected]>
Cc: Ryan Roberts <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: Zi Yan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
6 months agomm: swap: extend swap_shmem_alloc() to support batch SWAP_MAP_SHMEM flag setting
Baolin Wang [Mon, 12 Aug 2024 07:42:02 +0000 (15:42 +0800)]
mm: swap: extend swap_shmem_alloc() to support batch SWAP_MAP_SHMEM flag setting

Patch series "support large folio swap-out and swap-in for shmem", v5.

Shmem will support large folio allocation [1] [2] to get a better
performance, however, the memory reclaim still splits the precious large
folios when trying to swap-out shmem, which may lead to the memory
fragmentation issue and can not take advantage of the large folio for
shmeme.

Moreover, the swap code already supports for swapping out large folio
without split, and large folio swap-in[3] series is queued into
mm-unstable branch.  Hence this patch set also supports the large folio
swap-out and swap-in for shmem.

This patch (of 9):

To support shmem large folio swap operations, add a new parameter to
swap_shmem_alloc() that allows batch SWAP_MAP_SHMEM flag setting for shmem
swap entries.

While we are at it, using folio_nr_pages() to get the number of pages of
the folio as a preparation.

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/99f64115d04b285e009580eb177352c57119ffd0.1723434324.git.baolin.wang@linux.alibaba.com
Signed-off-by: Baolin Wang <[email protected]>
Reviewed-by: Barry Song <[email protected]>
Cc: Chris Li <[email protected]>
Cc: Daniel Gomez <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: "Huang, Ying" <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Kefeng Wang <[email protected]>
Cc: Lance Yang <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Pankaj Raghav <[email protected]>
Cc: Ryan Roberts <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: Zi Yan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
6 months agomm: attempt to batch free swap entries for zap_pte_range()
Barry Song [Wed, 7 Aug 2024 21:58:59 +0000 (09:58 +1200)]
mm: attempt to batch free swap entries for zap_pte_range()

Zhiguo reported that swap release could be a serious bottleneck during
process exits[1].  With mTHP, we have the opportunity to batch free swaps.

Thanks to the work of Chris and Kairui[2], I was able to achieve this
optimization with minimal code changes by building on their efforts.

If swap_count is 1, which is likely true as most anon memory are private,
we can free all contiguous swap slots all together.

Ran the below test program for measuring the bandwidth of munmap
using zRAM and 64KiB mTHP:

 #include <sys/mman.h>
 #include <sys/time.h>
 #include <stdlib.h>

 unsigned long long tv_to_ms(struct timeval tv)
 {
        return tv.tv_sec * 1000 + tv.tv_usec / 1000;
 }

 main()
 {
        struct timeval tv_b, tv_e;
        int i;
 #define SIZE 1024*1024*1024
        void *p = mmap(NULL, SIZE, PROT_READ | PROT_WRITE,
                                MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
        if (!p) {
                perror("fail to get memory");
                exit(-1);
        }

        madvise(p, SIZE, MADV_HUGEPAGE);
        memset(p, 0x11, SIZE); /* write to get mem */

        madvise(p, SIZE, MADV_PAGEOUT);

        gettimeofday(&tv_b, NULL);
        munmap(p, SIZE);
        gettimeofday(&tv_e, NULL);

        printf("munmap in bandwidth: %ld bytes/ms\n",
                        SIZE/(tv_to_ms(tv_e) - tv_to_ms(tv_b)));
 }

The result is as below (munmap bandwidth):
                mm-unstable  mm-unstable-with-patch
   round1       21053761      63161283
   round2       21053761      63161283
   round3       21053761      63161283
   round4       20648881      67108864
   round5       20648881      67108864

munmap bandwidth becomes 3X faster.

[1] https://lore.kernel.org/linux-mm/20240731133318[email protected]/
[2] https://lore.kernel.org/linux-mm/20240730-swap-allocator-v5-0-cb9c148b9297@kernel.org/

[[email protected]: check all swaps belong to same swap_cgroup in swap_pte_batch()]
Link: https://lkml.kernel.org/r/[email protected]
[[email protected]: add mem_cgroup_disabled() check]
Link: https://lkml.kernel.org/r/[email protected]
[[email protected]: add missing zswap_invalidate()]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Barry Song <[email protected]>
Signed-off-by: Hugh Dickins <[email protected]>
Acked-by: David Hildenbrand <[email protected]>
Cc: Kairui Song <[email protected]>
Cc: Chris Li <[email protected]>
Cc: "Huang, Ying" <[email protected]>
Cc: Kalesh Singh <[email protected]>
Cc: Ryan Roberts <[email protected]>
Cc: Barry Song <[email protected]>
Cc: Yosry Ahmed <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
6 months agomm: rename instances of swap_info_struct to meaningful 'si'
Barry Song [Wed, 7 Aug 2024 21:58:58 +0000 (09:58 +1200)]
mm: rename instances of swap_info_struct to meaningful 'si'

Patch series "mm: batch free swaps for zap_pte_range()", v3.

Batch free swap slots for zap_pte_range(), making munmap three times
faster when the page table entries are filled with swap entries to
be freed. This is likely another advantage of using mTHP.

This patch (of 3):

"p" means "pointer to something", rename it to a more meaningful
identifier - "si".  We also have a case with the name "sis", rename it to
"si" as well.

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Barry Song <[email protected]>
Acked-by: David Hildenbrand <[email protected]>
Cc: Chris Li <[email protected]>
Cc: "Huang, Ying" <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Kairui Song <[email protected]>
Cc: Kalesh Singh <[email protected]>
Cc: Ryan Roberts <[email protected]>
Cc: Zhiguo Jiang <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
6 months agodocs: move numa=fake description to kernel-parameters.txt
Mike Rapoport (Microsoft) [Wed, 7 Aug 2024 06:41:10 +0000 (09:41 +0300)]
docs: move numa=fake description to kernel-parameters.txt

NUMA emulation can be now enabled on arm64 and riscv in addition to x86.

Move description of numa=fake parameters from x86 documentation of
admin-guide/kernel-parameters.txt

Link: https://lkml.kernel.org/r/[email protected]
Suggested-by: Zi Yan <[email protected]>
Signed-off-by: Mike Rapoport (Microsoft) <[email protected]>
Reviewed-by: Jonathan Cameron <[email protected]>
Tested-by: Jonathan Cameron <[email protected]> [arm64 + CXL via QEMU]
Acked-by: Dan Williams <[email protected]>
Acked-by: David Hildenbrand <[email protected]>
Cc: Alexander Gordeev <[email protected]>
Cc: Andreas Larsson <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Christophe Leroy <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Davidlohr Bueso <[email protected]>
Cc: David S. Miller <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Huacai Chen <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jiaxun Yang <[email protected]>
Cc: John Paul Adrian Glaubitz <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Palmer Dabbelt <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Rob Herring (Arm) <[email protected]>
Cc: Samuel Holland <[email protected]>
Cc: Thomas Bogendoerfer <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Cc: Will Deacon <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
6 months agomm: make range-to-target_node lookup facility a part of numa_memblks
Mike Rapoport (Microsoft) [Wed, 7 Aug 2024 06:41:09 +0000 (09:41 +0300)]
mm: make range-to-target_node lookup facility a part of numa_memblks

The x86 implementation of range-to-target_node lookup (i.e.
phys_to_target_node() and memory_add_physaddr_to_nid()) relies on
numa_memblks.

Since numa_memblks are now part of the generic code, move these functions
from x86 to mm/numa_memblks.c and select CONFIG_NUMA_KEEP_MEMINFO when
CONFIG_NUMA_MEMBLKS=y for dax and cxl.

[[email protected]: fix build]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Mike Rapoport (Microsoft) <[email protected]>
Reviewed-by: Jonathan Cameron <[email protected]>
Tested-by: Zi Yan <[email protected]> # for x86_64 and arm64
Tested-by: Jonathan Cameron <[email protected]> [arm64 + CXL via QEMU]
Reviewed-by: Dan Williams <[email protected]>
Acked-by: David Hildenbrand <[email protected]>
Cc: Alexander Gordeev <[email protected]>
Cc: Andreas Larsson <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Christophe Leroy <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Davidlohr Bueso <[email protected]>
Cc: David S. Miller <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Huacai Chen <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jiaxun Yang <[email protected]>
Cc: John Paul Adrian Glaubitz <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Palmer Dabbelt <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Rob Herring (Arm) <[email protected]>
Cc: Samuel Holland <[email protected]>
Cc: Thomas Bogendoerfer <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Cc: Will Deacon <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
6 months agoarch_numa: switch over to numa_memblks
Mike Rapoport (Microsoft) [Wed, 7 Aug 2024 06:41:08 +0000 (09:41 +0300)]
arch_numa: switch over to numa_memblks

Until now arch_numa was directly translating firmware NUMA information
to memblock.

Using numa_memblks as an intermediate step has a few advantages:
* alignment with more battle tested x86 implementation
* availability of NUMA emulation
* maintaining node information for not yet populated memory

Adjust a few places in numa_memblks to compile with 32-bit phys_addr_t and
replace current functionality related to numa_add_memblk() and
__node_distance() in arch_numa with the implementation based on
numa_memblks and add functions required by numa_emulation.

[[email protected]: fix section mismatch]
Link: https://lkml.kernel.org/r/[email protected]
[[email protected]: PFN_PHYS() translation is unnecessary here]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Mike Rapoport (Microsoft) <[email protected]>
Tested-by: Zi Yan <[email protected]> # for x86_64 and arm64
Reviewed-by: Jonathan Cameron <[email protected]>
Tested-by: Jonathan Cameron <[email protected]> [arm64 + CXL via QEMU]
Acked-by: Dan Williams <[email protected]>
Acked-by: David Hildenbrand <[email protected]>
Cc: Alexander Gordeev <[email protected]>
Cc: Andreas Larsson <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Christophe Leroy <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Davidlohr Bueso <[email protected]>
Cc: David S. Miller <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Huacai Chen <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jiaxun Yang <[email protected]>
Cc: John Paul Adrian Glaubitz <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Palmer Dabbelt <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Rob Herring (Arm) <[email protected]>
Cc: Samuel Holland <[email protected]>
Cc: Thomas Bogendoerfer <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Cc: Will Deacon <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
6 months agoof, numa: return -EINVAL when no numa-node-id is found
Mike Rapoport (Microsoft) [Wed, 7 Aug 2024 06:41:07 +0000 (09:41 +0300)]
of, numa: return -EINVAL when no numa-node-id is found

Currently of_numa_parse_memory_nodes() returns 0 if no "memory" node in
device tree contains "numa-node-id" property.  This makes of_numa_init()
to return "success" despite no NUMA nodes were actually parsed and set up.

arch_numa workarounds this by returning an error if numa_nodes_parsed is
empty.

numa_memblks however would WARN() in such case and since it will be used
by arch_numa shortly, such warning is not desirable.

Make sure of_numa_init() returns -EINVAL when no NUMA node information was
found in the device tree.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Mike Rapoport (Microsoft) <[email protected]>
Reviewed-by: Jonathan Cameron <[email protected]>
Tested-by: Jonathan Cameron <[email protected]> [arm64 + CXL via QEMU]
Acked-by: Dan Williams <[email protected]>
Acked-by: David Hildenbrand <[email protected]>
Cc: Alexander Gordeev <[email protected]>
Cc: Andreas Larsson <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Christophe Leroy <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Davidlohr Bueso <[email protected]>
Cc: David S. Miller <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Huacai Chen <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jiaxun Yang <[email protected]>
Cc: John Paul Adrian Glaubitz <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Palmer Dabbelt <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Rob Herring (Arm) <[email protected]>
Cc: Samuel Holland <[email protected]>
Cc: Thomas Bogendoerfer <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Zi Yan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
6 months agomm: numa_memblks: use memblock_{start,end}_of_DRAM() when sanitizing meminfo
Mike Rapoport (Microsoft) [Wed, 7 Aug 2024 06:41:06 +0000 (09:41 +0300)]
mm: numa_memblks: use memblock_{start,end}_of_DRAM() when sanitizing meminfo

numa_cleanup_meminfo() moves blocks outside system RAM to
numa_reserved_meminfo and it uses 0 and PFN_PHYS(max_pfn) to determine the
memory boundaries.

Replace the memory range boundaries with more portable
memblock_start_of_DRAM() and memblock_end_of_DRAM().

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Mike Rapoport (Microsoft) <[email protected]>
Tested-by: Zi Yan <[email protected]> # for x86_64 and arm64
Reviewed-by: Jonathan Cameron <[email protected]>
Tested-by: Jonathan Cameron <[email protected]> [arm64 + CXL via QEMU]
Acked-by: Dan Williams <[email protected]>
Acked-by: David Hildenbrand <[email protected]>
Cc: Alexander Gordeev <[email protected]>
Cc: Andreas Larsson <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Christophe Leroy <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Davidlohr Bueso <[email protected]>
Cc: David S. Miller <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Huacai Chen <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jiaxun Yang <[email protected]>
Cc: John Paul Adrian Glaubitz <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Palmer Dabbelt <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Rob Herring (Arm) <[email protected]>
Cc: Samuel Holland <[email protected]>
Cc: Thomas Bogendoerfer <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Cc: Will Deacon <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
6 months agomm: numa_memblks: make several functions and variables static
Mike Rapoport (Microsoft) [Wed, 7 Aug 2024 06:41:05 +0000 (09:41 +0300)]
mm: numa_memblks: make several functions and variables static

Make functions and variables that are exclusively used by numa_memblks
static.

Move numa_nodemask_from_meminfo() before its callers to avoid forward
declaration.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Mike Rapoport (Microsoft) <[email protected]>
Tested-by: Zi Yan <[email protected]> # for x86_64 and arm64
Reviewed-by: Jonathan Cameron <[email protected]>
Tested-by: Jonathan Cameron <[email protected]> [arm64 + CXL via QEMU]
Acked-by: Dan Williams <[email protected]>
Acked-by: David Hildenbrand <[email protected]>
Cc: Alexander Gordeev <[email protected]>
Cc: Andreas Larsson <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Christophe Leroy <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Davidlohr Bueso <[email protected]>
Cc: David S. Miller <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Huacai Chen <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jiaxun Yang <[email protected]>
Cc: John Paul Adrian Glaubitz <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Palmer Dabbelt <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Rob Herring (Arm) <[email protected]>
Cc: Samuel Holland <[email protected]>
Cc: Thomas Bogendoerfer <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Cc: Will Deacon <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
6 months agomm: numa_memblks: introduce numa_memblks_init
Mike Rapoport (Microsoft) [Wed, 7 Aug 2024 06:41:04 +0000 (09:41 +0300)]
mm: numa_memblks: introduce numa_memblks_init

Move most of x86::numa_init() to numa_memblks so that the latter will be
more self-contained.

With this numa_memblk data structures should not be exposed to the
architecture specific code.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Mike Rapoport (Microsoft) <[email protected]>
Tested-by: Zi Yan <[email protected]> # for x86_64 and arm64
Reviewed-by: Jonathan Cameron <[email protected]>
Tested-by: Jonathan Cameron <[email protected]> [arm64 + CXL via QEMU]
Acked-by: Dan Williams <[email protected]>
Acked-by: David Hildenbrand <[email protected]>
Cc: Alexander Gordeev <[email protected]>
Cc: Andreas Larsson <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Christophe Leroy <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Davidlohr Bueso <[email protected]>
Cc: David S. Miller <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Huacai Chen <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jiaxun Yang <[email protected]>
Cc: John Paul Adrian Glaubitz <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Palmer Dabbelt <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Rob Herring (Arm) <[email protected]>
Cc: Samuel Holland <[email protected]>
Cc: Thomas Bogendoerfer <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Cc: Will Deacon <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
6 months agomm: introduce numa_emulation
Mike Rapoport (Microsoft) [Wed, 7 Aug 2024 06:41:03 +0000 (09:41 +0300)]
mm: introduce numa_emulation

Move numa_emulation code from arch/x86 to mm/numa_emulation.c

This code will be later reused by arch_numa.

No functional changes.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Mike Rapoport (Microsoft) <[email protected]>
Tested-by: Zi Yan <[email protected]> # for x86_64 and arm64
Reviewed-by: Jonathan Cameron <[email protected]>
Tested-by: Jonathan Cameron <[email protected]> [arm64 + CXL via QEMU]
Acked-by: Dan Williams <[email protected]>
Cc: Alexander Gordeev <[email protected]>
Cc: Andreas Larsson <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Christophe Leroy <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Davidlohr Bueso <[email protected]>
Cc: David S. Miller <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Huacai Chen <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jiaxun Yang <[email protected]>
Cc: John Paul Adrian Glaubitz <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Palmer Dabbelt <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Rob Herring (Arm) <[email protected]>
Cc: Samuel Holland <[email protected]>
Cc: Thomas Bogendoerfer <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Cc: Will Deacon <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
6 months agomm: move numa_distance and related code from x86 to numa_memblks
Mike Rapoport (Microsoft) [Wed, 7 Aug 2024 06:41:02 +0000 (09:41 +0300)]
mm: move numa_distance and related code from x86 to numa_memblks

Move code dealing with numa_distance array from arch/x86 to
mm/numa_memblks.c

This code will be later reused by arch_numa.

No functional changes.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Mike Rapoport (Microsoft) <[email protected]>
Tested-by: Zi Yan <[email protected]> # for x86_64 and arm64
Reviewed-by: Jonathan Cameron <[email protected]>
Tested-by: Jonathan Cameron <[email protected]> [arm64 + CXL via QEMU]
Acked-by: Dan Williams <[email protected]>
Acked-by: David Hildenbrand <[email protected]>
Cc: Alexander Gordeev <[email protected]>
Cc: Andreas Larsson <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Christophe Leroy <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Davidlohr Bueso <[email protected]>
Cc: David S. Miller <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Huacai Chen <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jiaxun Yang <[email protected]>
Cc: John Paul Adrian Glaubitz <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Palmer Dabbelt <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Rob Herring (Arm) <[email protected]>
Cc: Samuel Holland <[email protected]>
Cc: Thomas Bogendoerfer <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Cc: Will Deacon <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
6 months agomm: introduce numa_memblks
Mike Rapoport (Microsoft) [Wed, 7 Aug 2024 06:41:01 +0000 (09:41 +0300)]
mm: introduce numa_memblks

Move code dealing with numa_memblks from arch/x86 to mm/ and add Kconfig
options to let x86 select it in its Kconfig.

This code will be later reused by arch_numa.

No functional changes.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Mike Rapoport (Microsoft) <[email protected]>
Tested-by: Zi Yan <[email protected]> # for x86_64 and arm64
Reviewed-by: Jonathan Cameron <[email protected]>
Tested-by: Jonathan Cameron <[email protected]> [arm64 + CXL via QEMU]
Acked-by: Dan Williams <[email protected]>
Acked-by: David Hildenbrand <[email protected]>
Cc: Alexander Gordeev <[email protected]>
Cc: Andreas Larsson <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Christophe Leroy <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Davidlohr Bueso <[email protected]>
Cc: David S. Miller <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Huacai Chen <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jiaxun Yang <[email protected]>
Cc: John Paul Adrian Glaubitz <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Palmer Dabbelt <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Rob Herring (Arm) <[email protected]>
Cc: Samuel Holland <[email protected]>
Cc: Thomas Bogendoerfer <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Cc: Will Deacon <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
6 months agox86/numa: numa_{add,remove}_cpu: make cpu parameter unsigned
Mike Rapoport (Microsoft) [Wed, 7 Aug 2024 06:41:00 +0000 (09:41 +0300)]
x86/numa: numa_{add,remove}_cpu: make cpu parameter unsigned

CPU id cannot be negative.

Making it unsigned also aligns with declarations in
include/asm-generic/numa.h used by arm64 and riscv and allows sharing numa
emulation code with these architectures.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Mike Rapoport (Microsoft) <[email protected]>
Reviewed-by: Jonathan Cameron <[email protected]>
Tested-by: Zi Yan <[email protected]> # for x86_64 and arm64
Tested-by: Jonathan Cameron <[email protected]> [arm64 + CXL via QEMU]
Acked-by: Dan Williams <[email protected]>
Acked-by: David Hildenbrand <[email protected]>
Cc: Alexander Gordeev <[email protected]>
Cc: Andreas Larsson <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Christophe Leroy <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Davidlohr Bueso <[email protected]>
Cc: David S. Miller <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Huacai Chen <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jiaxun Yang <[email protected]>
Cc: John Paul Adrian Glaubitz <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Palmer Dabbelt <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Rob Herring (Arm) <[email protected]>
Cc: Samuel Holland <[email protected]>
Cc: Thomas Bogendoerfer <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Cc: Will Deacon <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
6 months agox86/numa_emu: use a helper function to get MAX_DMA32_PFN
Mike Rapoport (Microsoft) [Wed, 7 Aug 2024 06:40:59 +0000 (09:40 +0300)]
x86/numa_emu: use a helper function to get MAX_DMA32_PFN

This is required to make numa emulation code architecture independent so
that it can be moved to generic code in following commits.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Mike Rapoport (Microsoft) <[email protected]>
Reviewed-by: Jonathan Cameron <[email protected]>
Tested-by: Zi Yan <[email protected]> # for x86_64 and arm64
Tested-by: Jonathan Cameron <[email protected]> [arm64 + CXL via QEMU]
Acked-by: Dan Williams <[email protected]>
Acked-by: David Hildenbrand <[email protected]>
Cc: Alexander Gordeev <[email protected]>
Cc: Andreas Larsson <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Christophe Leroy <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Davidlohr Bueso <[email protected]>
Cc: David S. Miller <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Huacai Chen <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jiaxun Yang <[email protected]>
Cc: John Paul Adrian Glaubitz <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Palmer Dabbelt <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Rob Herring (Arm) <[email protected]>
Cc: Samuel Holland <[email protected]>
Cc: Thomas Bogendoerfer <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Cc: Will Deacon <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
6 months agox86/numa_emu: split __apicid_to_node update to a helper function
Mike Rapoport (Microsoft) [Wed, 7 Aug 2024 06:40:58 +0000 (09:40 +0300)]
x86/numa_emu: split __apicid_to_node update to a helper function

This is required to make numa emulation code architecture independent so
that it can be moved to generic code in following commits.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Mike Rapoport (Microsoft) <[email protected]>
Reviewed-by: Jonathan Cameron <[email protected]>
Tested-by: Zi Yan <[email protected]> # for x86_64 and arm64
Tested-by: Jonathan Cameron <[email protected]> [arm64 + CXL via QEMU]
Acked-by: Dan Williams <[email protected]>
Acked-by: David Hildenbrand <[email protected]>
Cc: Alexander Gordeev <[email protected]>
Cc: Andreas Larsson <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Christophe Leroy <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Davidlohr Bueso <[email protected]>
Cc: David S. Miller <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Huacai Chen <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jiaxun Yang <[email protected]>
Cc: John Paul Adrian Glaubitz <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Palmer Dabbelt <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Rob Herring (Arm) <[email protected]>
Cc: Samuel Holland <[email protected]>
Cc: Thomas Bogendoerfer <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Cc: Will Deacon <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
6 months agox86/numa_emu: simplify allocation of phys_dist
Mike Rapoport (Microsoft) [Wed, 7 Aug 2024 06:40:57 +0000 (09:40 +0300)]
x86/numa_emu: simplify allocation of phys_dist

By the time numa_emulation() is called, all physical memory is already
mapped in the direct map and there is no need to define limits for
memblock allocation.

Replace memblock_phys_alloc_range() with memblock_alloc().

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Mike Rapoport (Microsoft) <[email protected]>
Reviewed-by: Jonathan Cameron <[email protected]>
Tested-by: Zi Yan <[email protected]> # for x86_64 and arm64
Tested-by: Jonathan Cameron <[email protected]> [arm64 + CXL via QEMU]
Acked-by: Dan Williams <[email protected]>
Acked-by: David Hildenbrand <[email protected]>
Cc: Alexander Gordeev <[email protected]>
Cc: Andreas Larsson <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Christophe Leroy <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Davidlohr Bueso <[email protected]>
Cc: David S. Miller <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Huacai Chen <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jiaxun Yang <[email protected]>
Cc: John Paul Adrian Glaubitz <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Palmer Dabbelt <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Rob Herring (Arm) <[email protected]>
Cc: Samuel Holland <[email protected]>
Cc: Thomas Bogendoerfer <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Cc: Will Deacon <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
6 months agox86/numa: move FAKE_NODE_* defines to numa_emu
Mike Rapoport (Microsoft) [Wed, 7 Aug 2024 06:40:56 +0000 (09:40 +0300)]
x86/numa: move FAKE_NODE_* defines to numa_emu

The definitions of FAKE_NODE_MIN_SIZE and FAKE_NODE_MIN_HASH_MASK are only
used by numa emulation code, make them local to
arch/x86/mm/numa_emulation.c

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Mike Rapoport (Microsoft) <[email protected]>
Reviewed-by: Jonathan Cameron <[email protected]>
Tested-by: Zi Yan <[email protected]> # for x86_64 and arm64
Tested-by: Jonathan Cameron <[email protected]> [arm64 + CXL via QEMU]
Acked-by: Dan Williams <[email protected]>
Acked-by: David Hildenbrand <[email protected]>
Cc: Alexander Gordeev <[email protected]>
Cc: Andreas Larsson <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Christophe Leroy <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Davidlohr Bueso <[email protected]>
Cc: David S. Miller <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Huacai Chen <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jiaxun Yang <[email protected]>
Cc: John Paul Adrian Glaubitz <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Palmer Dabbelt <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Rob Herring (Arm) <[email protected]>
Cc: Samuel Holland <[email protected]>
Cc: Thomas Bogendoerfer <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Cc: Will Deacon <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
6 months agox86/numa: use get_pfn_range_for_nid to verify that node spans memory
Mike Rapoport (Microsoft) [Wed, 7 Aug 2024 06:40:55 +0000 (09:40 +0300)]
x86/numa: use get_pfn_range_for_nid to verify that node spans memory

Instead of looping over numa_meminfo array to detect node's start and
end addresses use get_pfn_range_for_init().

This is shorter and make it easier to lift numa_memblks to generic code.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Mike Rapoport (Microsoft) <[email protected]>
Tested-by: Zi Yan <[email protected]> # for x86_64 and arm64
Reviewed-by: Jonathan Cameron <[email protected]>
Tested-by: Jonathan Cameron <[email protected]> [arm64 + CXL via QEMU]
Acked-by: Dan Williams <[email protected]>
Acked-by: David Hildenbrand <[email protected]>
Cc: Alexander Gordeev <[email protected]>
Cc: Andreas Larsson <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Christophe Leroy <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Davidlohr Bueso <[email protected]>
Cc: David S. Miller <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Huacai Chen <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jiaxun Yang <[email protected]>
Cc: John Paul Adrian Glaubitz <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Palmer Dabbelt <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Rob Herring (Arm) <[email protected]>
Cc: Samuel Holland <[email protected]>
Cc: Thomas Bogendoerfer <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Cc: Will Deacon <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
6 months agox86/numa: simplify numa_distance allocation
Mike Rapoport (Microsoft) [Wed, 7 Aug 2024 06:40:54 +0000 (09:40 +0300)]
x86/numa: simplify numa_distance allocation

Allocation of numa_distance uses memblock_phys_alloc_range() to limit
allocation to be below the last mapped page.

But NUMA initializaition runs after the direct map is populated and there
is also code in setup_arch() that adjusts memblock limit to reflect how
much memory is already mapped in the direct map.

Simplify the allocation of numa_distance and use plain memblock_alloc().

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Mike Rapoport (Microsoft) <[email protected]>
Tested-by: Zi Yan <[email protected]> # for x86_64 and arm64
Reviewed-by: Jonathan Cameron <[email protected]>
Tested-by: Jonathan Cameron <[email protected]> [arm64 + CXL via QEMU]
Acked-by: Dan Williams <[email protected]>
Acked-by: David Hildenbrand <[email protected]>
Cc: Alexander Gordeev <[email protected]>
Cc: Andreas Larsson <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Christophe Leroy <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Davidlohr Bueso <[email protected]>
Cc: David S. Miller <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Huacai Chen <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jiaxun Yang <[email protected]>
Cc: John Paul Adrian Glaubitz <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Palmer Dabbelt <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Rob Herring (Arm) <[email protected]>
Cc: Samuel Holland <[email protected]>
Cc: Thomas Bogendoerfer <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Cc: Will Deacon <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
6 months agoarch, mm: pull out allocation of NODE_DATA to generic code
Mike Rapoport (Microsoft) [Wed, 7 Aug 2024 06:40:53 +0000 (09:40 +0300)]
arch, mm: pull out allocation of NODE_DATA to generic code

Architectures that support NUMA duplicate the code that allocates
NODE_DATA on the node-local memory with slight variations in reporting of
the addresses where the memory was allocated.

Use x86 version as the basis for the generic alloc_node_data() function
and call this function in architecture specific numa initialization.

Round up node data size to SMP_CACHE_BYTES rather than to PAGE_SIZE like
x86 used to do since the bootmem era when allocation granularity was
PAGE_SIZE anyway.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Mike Rapoport (Microsoft) <[email protected]>
Acked-by: David Hildenbrand <[email protected]>
Reviewed-by: Jonathan Cameron <[email protected]>
Tested-by: Zi Yan <[email protected]> # for x86_64 and arm64
Tested-by: Jonathan Cameron <[email protected]> [arm64 + CXL via QEMU]
Acked-by: Dan Williams <[email protected]>
Cc: Alexander Gordeev <[email protected]>
Cc: Andreas Larsson <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Christophe Leroy <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Davidlohr Bueso <[email protected]>
Cc: David S. Miller <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Huacai Chen <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jiaxun Yang <[email protected]>
Cc: John Paul Adrian Glaubitz <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Palmer Dabbelt <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Rob Herring (Arm) <[email protected]>
Cc: Samuel Holland <[email protected]>
Cc: Thomas Bogendoerfer <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Cc: Will Deacon <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
6 months agomm: drop CONFIG_HAVE_ARCH_NODEDATA_EXTENSION
Mike Rapoport (Microsoft) [Wed, 7 Aug 2024 06:40:52 +0000 (09:40 +0300)]
mm: drop CONFIG_HAVE_ARCH_NODEDATA_EXTENSION

There are no users of HAVE_ARCH_NODEDATA_EXTENSION left, so
arch_alloc_nodedata() and arch_refresh_nodedata() are not needed anymore.

Replace the call to arch_alloc_nodedata() in free_area_init() with a new
helper alloc_offline_node_data(), remove arch_refresh_nodedata() and
cleanup include/linux/memory_hotplug.h from the associated ifdefery.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Mike Rapoport (Microsoft) <[email protected]>
Tested-by: Zi Yan <[email protected]> # for x86_64 and arm64
Acked-by: Dan Williams <[email protected]>
Cc: Alexander Gordeev <[email protected]>
Cc: Andreas Larsson <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Christophe Leroy <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Davidlohr Bueso <[email protected]>
Cc: David S. Miller <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Huacai Chen <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jiaxun Yang <[email protected]>
Cc: John Paul Adrian Glaubitz <[email protected]>
Cc: Jonathan Cameron <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Palmer Dabbelt <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Rob Herring (Arm) <[email protected]>
Cc: Samuel Holland <[email protected]>
Cc: Thomas Bogendoerfer <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Cc: Will Deacon <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
6 months agoarch, mm: move definition of node_data to generic code
Mike Rapoport (Microsoft) [Wed, 7 Aug 2024 06:40:51 +0000 (09:40 +0300)]
arch, mm: move definition of node_data to generic code

Every architecture that supports NUMA defines node_data in the same way:

struct pglist_data *node_data[MAX_NUMNODES];

No reason to keep multiple copies of this definition and its forward
declarations, especially when such forward declaration is the only thing
in include/asm/mmzone.h for many architectures.

Add definition and declaration of node_data to generic code and drop
architecture-specific versions.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Mike Rapoport (Microsoft) <[email protected]>
Acked-by: David Hildenbrand <[email protected]>
Reviewed-by: Jonathan Cameron <[email protected]>
Acked-by: Davidlohr Bueso <[email protected]>
Tested-by: Zi Yan <[email protected]> # for x86_64 and arm64
Tested-by: Jonathan Cameron <[email protected]> [arm64 + CXL via QEMU]
Acked-by: Dan Williams <[email protected]>
Cc: Alexander Gordeev <[email protected]>
Cc: Andreas Larsson <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Christophe Leroy <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: David S. Miller <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Huacai Chen <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jiaxun Yang <[email protected]>
Cc: John Paul Adrian Glaubitz <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Palmer Dabbelt <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Rob Herring (Arm) <[email protected]>
Cc: Samuel Holland <[email protected]>
Cc: Thomas Bogendoerfer <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Cc: Will Deacon <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
6 months agoMIPS: loongson64: drop HAVE_ARCH_NODEDATA_EXTENSION
Mike Rapoport (Microsoft) [Wed, 7 Aug 2024 06:40:50 +0000 (09:40 +0300)]
MIPS: loongson64: drop HAVE_ARCH_NODEDATA_EXTENSION

Commit f8f9f21c7848 ("MIPS: Fix build error for loongson64 and sgi-ip27")
added HAVE_ARCH_NODEDATA_EXTENSION to loongson64 to silence a compilation
error that happened because loongson64 didn't define array of pg_data_t as
node_data like most other architectures did.

After rename of __node_data to node_data arch_alloc_nodedata() and
HAVE_ARCH_NODEDATA_EXTENSION can be dropped from loongson64.

Since it was the only user of HAVE_ARCH_NODEDATA_EXTENSION config option
also remove this option from arch/mips/Kconfig.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Mike Rapoport (Microsoft) <[email protected]>
Reviewed-by: Jonathan Cameron <[email protected]>
Tested-by: Jonathan Cameron <[email protected]> [arm64 + CXL via QEMU]
Acked-by: Dan Williams <[email protected]>
Acked-by: David Hildenbrand <[email protected]>
Cc: Alexander Gordeev <[email protected]>
Cc: Andreas Larsson <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Christophe Leroy <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Davidlohr Bueso <[email protected]>
Cc: David S. Miller <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Huacai Chen <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jiaxun Yang <[email protected]>
Cc: John Paul Adrian Glaubitz <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Palmer Dabbelt <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Rob Herring (Arm) <[email protected]>
Cc: Samuel Holland <[email protected]>
Cc: Thomas Bogendoerfer <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Zi Yan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
6 months agoMIPS: loongson64: rename __node_data to node_data
Mike Rapoport (Microsoft) [Wed, 7 Aug 2024 06:40:49 +0000 (09:40 +0300)]
MIPS: loongson64: rename __node_data to node_data

Make definition of node_data match other architectures.  This will allow
pulling declaration of node_data to the generic mm code in the following
commit.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Mike Rapoport (Microsoft) <[email protected]>
Reviewed-by: Jiaxun Yang <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Reviewed-by: Jonathan Cameron <[email protected]>
Tested-by: Jonathan Cameron <[email protected]> [arm64 + CXL via QEMU]
Acked-by: Dan Williams <[email protected]>
Cc: Alexander Gordeev <[email protected]>
Cc: Andreas Larsson <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Christophe Leroy <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Davidlohr Bueso <[email protected]>
Cc: David S. Miller <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Huacai Chen <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: John Paul Adrian Glaubitz <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Palmer Dabbelt <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Rob Herring (Arm) <[email protected]>
Cc: Samuel Holland <[email protected]>
Cc: Thomas Bogendoerfer <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Zi Yan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
6 months agoMIPS: sgi-ip27: drop HAVE_ARCH_NODEDATA_EXTENSION
Mike Rapoport (Microsoft) [Wed, 7 Aug 2024 06:40:48 +0000 (09:40 +0300)]
MIPS: sgi-ip27: drop HAVE_ARCH_NODEDATA_EXTENSION

Commit f8f9f21c7848 ("MIPS: Fix build error for loongson64 and sgi-ip27")
added HAVE_ARCH_NODEDATA_EXTENSION to sgi-ip27 to silence a compilation
error that happened because sgi-ip27 didn't define array of pg_data_t as
node_data like most other architectures did.

After addition of node_data array that matches other architectures and
after ensuring that offline nodes do not appear on node_possible_map, it
is safe to drop arch_alloc_nodedata() and HAVE_ARCH_NODEDATA_EXTENSION
from sgi-ip27.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Mike Rapoport (Microsoft) <[email protected]>
Reviewed-by: Jonathan Cameron <[email protected]>
Tested-by: Jonathan Cameron <[email protected]> [arm64 + CXL via QEMU]
Acked-by: Dan Williams <[email protected]>
Acked-by: David Hildenbrand <[email protected]>
Cc: Alexander Gordeev <[email protected]>
Cc: Andreas Larsson <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Christophe Leroy <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Davidlohr Bueso <[email protected]>
Cc: David S. Miller <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Huacai Chen <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jiaxun Yang <[email protected]>
Cc: John Paul Adrian Glaubitz <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Palmer Dabbelt <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Rob Herring (Arm) <[email protected]>
Cc: Samuel Holland <[email protected]>
Cc: Thomas Bogendoerfer <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Zi Yan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
6 months agoMIPS: sgi-ip27: ensure node_possible_map only contains valid nodes
Mike Rapoport (Microsoft) [Wed, 7 Aug 2024 06:40:47 +0000 (09:40 +0300)]
MIPS: sgi-ip27: ensure node_possible_map only contains valid nodes

For SGI IP27 machines node_possible_map is statically set to NODE_MASK_ALL
and it is not updated during NUMA initialization.

Ensure that it only contains nodes present in the system.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Mike Rapoport (Microsoft) <[email protected]>
Reviewed-by: Jonathan Cameron <[email protected]>
Tested-by: Jonathan Cameron <[email protected]> [arm64 + CXL via QEMU]
Acked-by: Dan Williams <[email protected]>
Acked-by: David Hildenbrand <[email protected]>
Cc: Alexander Gordeev <[email protected]>
Cc: Andreas Larsson <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Christophe Leroy <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Davidlohr Bueso <[email protected]>
Cc: David S. Miller <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Huacai Chen <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jiaxun Yang <[email protected]>
Cc: John Paul Adrian Glaubitz <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Palmer Dabbelt <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Rob Herring (Arm) <[email protected]>
Cc: Samuel Holland <[email protected]>
Cc: Thomas Bogendoerfer <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Zi Yan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
6 months agoMIPS: sgi-ip27: make NODE_DATA() the same as on all other architectures
Mike Rapoport (Microsoft) [Wed, 7 Aug 2024 06:40:46 +0000 (09:40 +0300)]
MIPS: sgi-ip27: make NODE_DATA() the same as on all other architectures

sgi-ip27 is the only system that defines NODE_DATA() differently than the
rest of NUMA machines.

Add node_data array of struct pglist pointers that will point to
__node_data[node]->pglist and redefine NODE_DATA() to use node_data array.

This will allow pulling declaration of node_data to the generic mm code in
the next commit.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Mike Rapoport (Microsoft) <[email protected]>
Reviewed-by: Jonathan Cameron <[email protected]>
Tested-by: Jonathan Cameron <[email protected]> [arm64 + CXL via QEMU]
Acked-by: Dan Williams <[email protected]>
Acked-by: David Hildenbrand <[email protected]>
Cc: Alexander Gordeev <[email protected]>
Cc: Andreas Larsson <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Christophe Leroy <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Davidlohr Bueso <[email protected]>
Cc: David S. Miller <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Huacai Chen <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jiaxun Yang <[email protected]>
Cc: John Paul Adrian Glaubitz <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Palmer Dabbelt <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Rob Herring (Arm) <[email protected]>
Cc: Samuel Holland <[email protected]>
Cc: Thomas Bogendoerfer <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Zi Yan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
6 months agomm: move kernel/numa.c to mm/
Mike Rapoport (Microsoft) [Wed, 7 Aug 2024 06:40:45 +0000 (09:40 +0300)]
mm: move kernel/numa.c to mm/

Patch series "mm: introduce numa_memblks", v4.

Following the discussion about handling of CXL fixed memory windows on
arm64 [1] I decided to bite the bullet and move numa_memblks from x86 to
the generic code so they will be available on arm64/riscv and maybe on
loongarch sometime later.

While it could be possible to use memblock to describe CXL memory windows,
it currently lacks notion of unpopulated memory ranges and numa_memblks
does implement this.

Another reason to make numa_memblks generic is that both arch_numa (arm64
and riscv) and loongarch use trimmed copy of x86 code although there is no
fundamental reason why the same code cannot be used on all these
platforms.  Having numa_memblks in mm/ will make it's interaction with
ACPI and FDT more consistent and I believe will reduce maintenance burden.

And with generic numa_memblks it is (almost) straightforward to enable
NUMA emulation on arm64 and riscv.

The first 9 commits in this series are cleanups that are not strictly
related to numa_memblks.
Commits 10-16 slightly reorder code in x86 to allow extracting numa_memblks
and NUMA emulation to the generic code.
Commits 17-19 actually move the code from arch/x86/ to mm/ and commits 20-22
does some aftermath cleanups.
Commit 23 updates of_numa_init() to return error of no NUMA nodes were
found in the device tree.
Commit 24 switches arch_numa to numa_memblks.
Commit 25 enables usage of phys_to_target_node() and
memory_add_physaddr_to_nid() with numa_memblks.
Commit 26 moves the description for numa=fake from x86 to admin-guide.

[1] https://lore.kernel.org/all/20240529171236[email protected]/

This patch (of 26):

The stub functions in kernel/numa.c belong to mm/ rather than to kernel/

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Mike Rapoport (Microsoft) <[email protected]>
Acked-by: David Hildenbrand <[email protected]>
Reviewed-by: Jonathan Cameron <[email protected]>
Tested-by: Zi Yan <[email protected]> # for x86_64 and arm64
Tested-by: Jonathan Cameron <[email protected]> [arm64 + CXL via QEMU]
Acked-by: Dan Williams <[email protected]>
Cc: Alexander Gordeev <[email protected]>
Cc: Andreas Larsson <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Christophe Leroy <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Davidlohr Bueso <[email protected]>
Cc: David S. Miller <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Huacai Chen <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jiaxun Yang <[email protected]>
Cc: John Paul Adrian Glaubitz <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Palmer Dabbelt <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Rob Herring (Arm) <[email protected]>
Cc: Samuel Holland <[email protected]>
Cc: Thomas Bogendoerfer <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Cc: Will Deacon <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
6 months agomm: swap: add a adaptive full cluster cache reclaim
Kairui Song [Wed, 31 Jul 2024 06:49:21 +0000 (23:49 -0700)]
mm: swap: add a adaptive full cluster cache reclaim

Link all full cluster with one full list, and reclaim from it when the
allocation have ran out of all usable clusters.

There are many reason a folio can end up being in the swap cache while
having no swap count reference.  So the best way to search for such slots
is still by iterating the swap clusters.

With the list as an LRU, iterating from the oldest cluster and keep them
rotating is a very doable and clean way to free up potentially not inuse
clusters.

When any allocation failure, try reclaim and rotate only one cluster.
This is adaptive for high order allocations they can tolerate fallback.
So this avoids latency, and give the full cluster list an fair chance to
get reclaimed.  It release the usage stress for the fallback order 0
allocation or following up high order allocation.

If the swap device is getting very full, reclaim more aggresively to
ensure no OOM will happen.  This ensures order 0 heavy workload won't go
OOM as order 0 won't fail if any cluster still have any space.

[[email protected]: fix discard of full cluster]
Link: https://lkml.kernel.org/r/CAMgjq7CWwK75_2Zi5P40K08pk9iqOcuWKL6khu=x4Yg_nXaQag@mail.gmail.com
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Kairui Song <[email protected]>
Reported-by: Barry Song <[email protected]>
Cc: Chris Li <[email protected]>
Cc: "Huang, Ying" <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Kalesh Singh <[email protected]>
Cc: Ryan Roberts <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Kairui Song <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
6 months agomm: swap: relaim the cached parts that got scanned
Kairui Song [Wed, 31 Jul 2024 06:49:20 +0000 (23:49 -0700)]
mm: swap: relaim the cached parts that got scanned

This commit implements reclaim during scan for cluster allocator.

Cluster scanning were unable to reuse SWAP_HAS_CACHE slots, which could
result in low allocation success rate or early OOM.

So to ensure maximum allocation success rate, integrate reclaiming with
scanning.  If found a range of suitable swap slots but fragmented due to
HAS_CACHE, just try to reclaim the slots.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Kairui Song <[email protected]>
Reported-by: Barry Song <[email protected]>
Cc: Chris Li <[email protected]>
Cc: "Huang, Ying" <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Kalesh Singh <[email protected]>
Cc: Ryan Roberts <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
6 months agomm: swap: add a fragment cluster list
Kairui Song [Wed, 31 Jul 2024 06:49:19 +0000 (23:49 -0700)]
mm: swap: add a fragment cluster list

Now swap cluster allocator arranges the clusters in LRU style, so the
"cold" cluster stay at the head of nonfull lists are the ones that were
used for allocation long time ago and still partially occupied.  So if
allocator can't find enough contiguous slots to satisfy an high order
allocation, it's unlikely there will be slot being free on them to satisfy
the allocation, at least in a short period.

As a result, nonfull cluster scanning will waste time repeatly scanning
the unusable head of the list.

Also, multiple CPUs could content on the same head cluster of nonfull
list.  Unlike free clusters which are removed from the list when any CPU
starts using it, nonfull cluster stays on the head.

So introduce a new list frag list, all scanned nonfull clusters will be
moved to this list.  Both for avoiding repeated scanning and contention.

Frag list is still used as fallback for allocations, so if one CPU failed
to allocate one order of slots, it can still steal other CPU's clusters.
And order 0 will favor the fragmented clusters to better protect nonfull
clusters

If any slots on a fragment list are being freed, move the fragment list
back to nonfull list indicating it worth another scan on the cluster.
Compared to scan upon freeing a slot, this keep the scanning lazy and save
some CPU if there are still other clusters to use.

It may seems unneccessay to keep the fragmented cluster on list at all if
they can't be used for specific order allocation.  But this will start to
make sense once reclaim dring scanning is ready.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Kairui Song <[email protected]>
Reported-by: Barry Song <[email protected]>
Cc: Chris Li <[email protected]>
Cc: "Huang, Ying" <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Kalesh Singh <[email protected]>
Cc: Ryan Roberts <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
6 months agomm: swap: allow cache reclaim to skip slot cache
Kairui Song [Wed, 31 Jul 2024 06:49:18 +0000 (23:49 -0700)]
mm: swap: allow cache reclaim to skip slot cache

Currently we free the reclaimed slots through slot cache even if the slot
is required to be empty immediately.  As a result the reclaim caller will
see the slot still occupied even after a successful reclaim, and need to
keep reclaiming until slot cache get flushed.  This caused ineffective or
over reclaim when SWAP is under stress.

So introduce a new flag allowing the slot to be emptied bypassing the slot
cache.

[[email protected]: small folios should have nr_pages == 1 but not nr_page == 0]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Kairui Song <[email protected]>
Reported-by: Barry Song <[email protected]>
Cc: Chris Li <[email protected]>
Cc: "Huang, Ying" <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Kalesh Singh <[email protected]>
Cc: Ryan Roberts <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
6 months agomm: swap: skip slot cache on freeing for mTHP
Kairui Song [Wed, 31 Jul 2024 06:49:17 +0000 (23:49 -0700)]
mm: swap: skip slot cache on freeing for mTHP

Currently when we are freeing mTHP folios from swap cache, we free then
one by one and put each entry into swap slot cache.  Slot cache is
designed to reduce the overhead by batching the freeing, but mTHP swap
entries are already continuous so they can be batch freed without it
already, it saves litle overhead, or even increase overhead for larger
mTHP.

What's more, mTHP entries could stay in swap cache for a while.
Contiguous swap entry is an rather rare resource so releasing them
directly can help improve mTHP allocation success rate when under
pressure.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Kairui Song <[email protected]>
Reported-by: Barry Song <[email protected]>
Acked-by: Barry Song <[email protected]>
Cc: Chris Li <[email protected]>
Cc: "Huang, Ying" <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Kalesh Singh <[email protected]>
Cc: Ryan Roberts <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
6 months agomm: swap: clean up initialization helper
Kairui Song [Wed, 31 Jul 2024 06:49:16 +0000 (23:49 -0700)]
mm: swap: clean up initialization helper

At this point, alloc_cluster is never called already, and
inc_cluster_info_page is called by initialization only, a lot of dead code
can be dropped.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Kairui Song <[email protected]>
Reported-by: Barry Song <[email protected]>
Cc: Chris Li <[email protected]>
Cc: "Huang, Ying" <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Kalesh Singh <[email protected]>
Cc: Ryan Roberts <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
6 months agomm: swap: separate SSD allocation from scan_swap_map_slots()
Chris Li [Wed, 31 Jul 2024 06:49:15 +0000 (23:49 -0700)]
mm: swap: separate SSD allocation from scan_swap_map_slots()

Previously the SSD and HDD share the same swap_map scan loop in
scan_swap_map_slots().  This function is complex and hard to flow the
execution flow.

scan_swap_map_try_ssd_cluster() can already do most of the heavy lifting
to locate the candidate swap range in the cluster.  However it needs to go
back to scan_swap_map_slots() to check conflict and then perform the
allocation.

When scan_swap_map_try_ssd_cluster() failed, it still depended on the
scan_swap_map_slots() to do brute force scanning of the swap_map.  When
the swapfile is large and almost full, it will take some CPU time to go
through the swap_map array.

Get rid of the cluster allocation dependency on the swap_map scan loop in
scan_swap_map_slots().  Streamline the cluster allocation code path.  No
more conflict checks.

For order 0 swap entry, when run out of free and nonfull list.  It will
allocate from the higher order nonfull cluster list.

Users should see less CPU time spent on searching the free swap slot when
swapfile is almost full.

[[email protected]: fix array-bounds error with CONFIG_THP_SWAP=n]
Link: https://lkml.kernel.org/r/CAMgjq7Bz0DY+rY0XgCoH7-Q=uHLdo3omi8kUr4ePDweNyofsbQ@mail.gmail.com
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Chris Li <[email protected]>
Signed-off-by: Kairui Song <[email protected]>
Reported-by: Barry Song <[email protected]>
Cc: "Huang, Ying" <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Kalesh Singh <[email protected]>
Cc: Ryan Roberts <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
6 months agomm: swap: mTHP allocate swap entries from nonfull list
Chris Li [Wed, 31 Jul 2024 06:49:14 +0000 (23:49 -0700)]
mm: swap: mTHP allocate swap entries from nonfull list

Track the nonfull cluster as well as the empty cluster on lists.  Each
order has one nonfull cluster list.

The cluster will remember which order it was used during new cluster
allocation.

When the cluster has free entry, add to the nonfull[order] list.  Â When
the free cluster list is empty, also allocate from the nonempty list of
that order.

This improves the mTHP swap allocation success rate.

There are limitations if the distribution of numbers of different orders
of mTHP changes a lot.  e.g.  there are a lot of nonfull cluster assign to
order A while later time there are a lot of order B allocation while very
little allocation in order A.  Currently the cluster used by order A will
not reused by order B unless the cluster is 100% empty.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Chris Li <[email protected]>
Reported-by: Barry Song <[email protected]>
Cc: "Huang, Ying" <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Kairui Song <[email protected]>
Cc: Kalesh Singh <[email protected]>
Cc: Ryan Roberts <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
6 months agomm: swap: swap cluster switch to double link list
Chris Li [Wed, 31 Jul 2024 06:49:13 +0000 (23:49 -0700)]
mm: swap: swap cluster switch to double link list

Patch series "mm: swap: mTHP swap allocator base on swap cluster order",
v5.

This is the short term solutions "swap cluster order" listed in my "Swap
Abstraction" discussion slice 8 in the recent LSF/MM conference.

When commit 845982eb264bc "mm: swap: allow storage of all mTHP orders" is
introduced, it only allocates the mTHP swap entries from the new empty
cluster list.  Â It has a fragmentation issue reported by Barry.

https://lore.kernel.org/all/CAGsJ_4zAcJkuW016Cfi6wicRr8N9X+GJJhgMQdSMp+Ah+NSgNQ@mail.gmail.com/

The reason is that all the empty clusters have been exhausted while there
are plenty of free swap entries in the cluster that are not 100% free.

Remember the swap allocation order in the cluster.  Keep track of the per
order non full cluster list for later allocation.

This series gives the swap SSD allocation a new separate code path from
the HDD allocation.  The new allocator use cluster list only and do not
global scan swap_map[] without lock any more.

This streamline the swap allocation for SSD.  The code matches the
execution flow much better.

User impact: For users that allocate and free mix order mTHP swapping, It
greatly improves the success rate of the mTHP swap allocation after the
initial phase.

It also performs faster when the swapfile is close to full, because the
allocator can get the non full cluster from a list rather than scanning a
lot of swap_map entries. 

With Barry's mthp test program V2:

Without:
$ ./thp_swap_allocator_test -a
Iteration 1: swpout inc: 32, swpout fallback inc: 192, Fallback percentage: 85.71%
Iteration 2: swpout inc: 0, swpout fallback inc: 231, Fallback percentage: 100.00%
Iteration 3: swpout inc: 0, swpout fallback inc: 227, Fallback percentage: 100.00%
...
Iteration 98: swpout inc: 0, swpout fallback inc: 224, Fallback percentage: 100.00%
Iteration 99: swpout inc: 0, swpout fallback inc: 215, Fallback percentage: 100.00%
Iteration 100: swpout inc: 0, swpout fallback inc: 222, Fallback percentage: 100.00%

$ ./thp_swap_allocator_test -a -s
Iteration 1: swpout inc: 0, swpout fallback inc: 224, Fallback percentage: 100.00%
Iteration 2: swpout inc: 0, swpout fallback inc: 218, Fallback percentage: 100.00%
Iteration 3: swpout inc: 0, swpout fallback inc: 222, Fallback percentage: 100.00%
..
Iteration 98: swpout inc: 0, swpout fallback inc: 228, Fallback percentage: 100.00%
Iteration 99: swpout inc: 0, swpout fallback inc: 230, Fallback percentage: 100.00%
Iteration 100: swpout inc: 0, swpout fallback inc: 229, Fallback percentage: 100.00%

$ ./thp_swap_allocator_test -s
Iteration 1: swpout inc: 0, swpout fallback inc: 224, Fallback percentage: 100.00%
Iteration 2: swpout inc: 0, swpout fallback inc: 218, Fallback percentage: 100.00%
Iteration 3: swpout inc: 0, swpout fallback inc: 222, Fallback percentage: 100.00%
..
Iteration 98: swpout inc: 0, swpout fallback inc: 228, Fallback percentage: 100.00%
Iteration 99: swpout inc: 0, swpout fallback inc: 230, Fallback percentage: 100.00%
Iteration 100: swpout inc: 0, swpout fallback inc: 229, Fallback percentage: 100.00%

$ ./thp_swap_allocator_test
Iteration 1: swpout inc: 0, swpout fallback inc: 224, Fallback percentage: 100.00%
Iteration 2: swpout inc: 0, swpout fallback inc: 218, Fallback percentage: 100.00%
Iteration 3: swpout inc: 0, swpout fallback inc: 222, Fallback percentage: 100.00%
..
Iteration 98: swpout inc: 0, swpout fallback inc: 228, Fallback percentage: 100.00%
Iteration 99: swpout inc: 0, swpout fallback inc: 230, Fallback percentage: 100.00%
Iteration 100: swpout inc: 0, swpout fallback inc: 229, Fallback percentage: 100.00%

With: # with all 0.00% filter out
$ ./thp_swap_allocator_test -a | grep -v "0.00%"
$ # all result are 0.00%

$ ./thp_swap_allocator_test -a -s | grep -v "0.00%"
./thp_swap_allocator_test -a -s | grep -v "0.00%"
Iteration 14: swpout inc: 223, swpout fallback inc: 3, Fallback percentage: 1.33%
Iteration 19: swpout inc: 219, swpout fallback inc: 7, Fallback percentage: 3.10%
Iteration 28: swpout inc: 225, swpout fallback inc: 1, Fallback percentage: 0.44%
Iteration 29: swpout inc: 227, swpout fallback inc: 1, Fallback percentage: 0.44%
Iteration 34: swpout inc: 220, swpout fallback inc: 8, Fallback percentage: 3.51%
Iteration 35: swpout inc: 222, swpout fallback inc: 11, Fallback percentage: 4.72%
Iteration 38: swpout inc: 217, swpout fallback inc: 4, Fallback percentage: 1.81%
Iteration 40: swpout inc: 222, swpout fallback inc: 6, Fallback percentage: 2.63%
Iteration 42: swpout inc: 221, swpout fallback inc: 2, Fallback percentage: 0.90%
Iteration 43: swpout inc: 215, swpout fallback inc: 7, Fallback percentage: 3.15%
Iteration 47: swpout inc: 226, swpout fallback inc: 2, Fallback percentage: 0.88%
Iteration 49: swpout inc: 217, swpout fallback inc: 1, Fallback percentage: 0.46%
Iteration 52: swpout inc: 221, swpout fallback inc: 8, Fallback percentage: 3.49%
Iteration 56: swpout inc: 224, swpout fallback inc: 4, Fallback percentage: 1.75%
Iteration 58: swpout inc: 214, swpout fallback inc: 5, Fallback percentage: 2.28%
Iteration 62: swpout inc: 220, swpout fallback inc: 3, Fallback percentage: 1.35%
Iteration 64: swpout inc: 224, swpout fallback inc: 1, Fallback percentage: 0.44%
Iteration 67: swpout inc: 221, swpout fallback inc: 1, Fallback percentage: 0.45%
Iteration 75: swpout inc: 220, swpout fallback inc: 9, Fallback percentage: 3.93%
Iteration 82: swpout inc: 227, swpout fallback inc: 1, Fallback percentage: 0.44%
Iteration 86: swpout inc: 211, swpout fallback inc: 12, Fallback percentage: 5.38%
Iteration 89: swpout inc: 226, swpout fallback inc: 2, Fallback percentage: 0.88%
Iteration 93: swpout inc: 220, swpout fallback inc: 1, Fallback percentage: 0.45%
Iteration 94: swpout inc: 224, swpout fallback inc: 1, Fallback percentage: 0.44%
Iteration 96: swpout inc: 221, swpout fallback inc: 6, Fallback percentage: 2.64%
Iteration 98: swpout inc: 227, swpout fallback inc: 1, Fallback percentage: 0.44%
Iteration 99: swpout inc: 227, swpout fallback inc: 3, Fallback percentage: 1.30%

$ ./thp_swap_allocator_test
./thp_swap_allocator_test
Iteration 1: swpout inc: 233, swpout fallback inc: 0, Fallback percentage: 0.00%
Iteration 2: swpout inc: 131, swpout fallback inc: 101, Fallback percentage: 43.53%
Iteration 3: swpout inc: 71, swpout fallback inc: 155, Fallback percentage: 68.58%
Iteration 4: swpout inc: 55, swpout fallback inc: 168, Fallback percentage: 75.34%
Iteration 5: swpout inc: 35, swpout fallback inc: 191, Fallback percentage: 84.51%
Iteration 6: swpout inc: 25, swpout fallback inc: 199, Fallback percentage: 88.84%
Iteration 7: swpout inc: 23, swpout fallback inc: 205, Fallback percentage: 89.91%
Iteration 8: swpout inc: 9, swpout fallback inc: 219, Fallback percentage: 96.05%
Iteration 9: swpout inc: 13, swpout fallback inc: 213, Fallback percentage: 94.25%
Iteration 10: swpout inc: 12, swpout fallback inc: 216, Fallback percentage: 94.74%
Iteration 11: swpout inc: 16, swpout fallback inc: 213, Fallback percentage: 93.01%
Iteration 12: swpout inc: 10, swpout fallback inc: 210, Fallback percentage: 95.45%
Iteration 13: swpout inc: 16, swpout fallback inc: 212, Fallback percentage: 92.98%
Iteration 14: swpout inc: 12, swpout fallback inc: 212, Fallback percentage: 94.64%
Iteration 15: swpout inc: 15, swpout fallback inc: 211, Fallback percentage: 93.36%
Iteration 16: swpout inc: 15, swpout fallback inc: 200, Fallback percentage: 93.02%
Iteration 17: swpout inc: 9, swpout fallback inc: 220, Fallback percentage: 96.07%

$ ./thp_swap_allocator_test -s
 ./thp_swap_allocator_test -s
Iteration 1: swpout inc: 233, swpout fallback inc: 0, Fallback percentage: 0.00%
Iteration 2: swpout inc: 97, swpout fallback inc: 135, Fallback percentage: 58.19%
Iteration 3: swpout inc: 42, swpout fallback inc: 192, Fallback percentage: 82.05%
Iteration 4: swpout inc: 19, swpout fallback inc: 214, Fallback percentage: 91.85%
Iteration 5: swpout inc: 12, swpout fallback inc: 213, Fallback percentage: 94.67%
Iteration 6: swpout inc: 11, swpout fallback inc: 217, Fallback percentage: 95.18%
Iteration 7: swpout inc: 9, swpout fallback inc: 214, Fallback percentage: 95.96%
Iteration 8: swpout inc: 8, swpout fallback inc: 213, Fallback percentage: 96.38%
Iteration 9: swpout inc: 2, swpout fallback inc: 223, Fallback percentage: 99.11%
Iteration 10: swpout inc: 2, swpout fallback inc: 228, Fallback percentage: 99.13%
Iteration 11: swpout inc: 4, swpout fallback inc: 214, Fallback percentage: 98.17%
Iteration 12: swpout inc: 5, swpout fallback inc: 226, Fallback percentage: 97.84%
Iteration 13: swpout inc: 3, swpout fallback inc: 212, Fallback percentage: 98.60%
Iteration 14: swpout inc: 0, swpout fallback inc: 222, Fallback percentage: 100.00%
Iteration 15: swpout inc: 3, swpout fallback inc: 222, Fallback percentage: 98.67%
Iteration 16: swpout inc: 4, swpout fallback inc: 223, Fallback percentage: 98.24%

=========
Kernel compile under tmpfs with cgroup memory.max = 470M.
12 core 24 hyperthreading, 32 jobs. 10 Run each group

SSD swap 10 runs average, 20G swap partition:
With:
user    2929.064
system  1479.381 : 1376.89 1398.22 1444.64 1477.39 1479.04 1497.27
1504.47 1531.4 1532.92 1551.57
real    1441.324

Without:
user    2910.872
system  1482.732 : 1440.01 1451.4 1462.01 1467.47 1467.51 1469.3
1470.19 1496.32 1544.1 1559.01
real    1580.822

Two zram swap: zram0 3.0G zram1 20G.

The idea is forcing the zram0 almost full then overflow to zram1:

With:
user    4320.301
system  4272.403 : 4236.24 4262.81 4264.75 4269.13 4269.44 4273.06
4279.85 4285.98 4289.64 4293.13
real    431.759

Without
user    4301.393
system  4387.672 : 4374.47 4378.3 4380.95 4382.84 4383.06 4388.05
4389.76 4397.16 4398.23 4403.9
real    433.979

------ more test result from Kaiui ----------

Test with build linux kernel using a 4G ZRAM, 1G memory.max limit on top of shmem:

System info: 32 Core AMD Zen2, 64G total memory.

Test 3 times using only 4K pages:
=================================

With:
-----
1838.74user 2411.21system 2:37.86elapsed 2692%CPU (0avgtext+0avgdata 847060maxresident)k
1839.86user 2465.77system 2:39.35elapsed 2701%CPU (0avgtext+0avgdata 847060maxresident)k
1840.26user 2454.68system 2:39.43elapsed 2693%CPU (0avgtext+0avgdata 847060maxresident)k

Summary (~4.6% improment of system time):
User: 1839.62
System: 2443.89: 2465.77 2454.68 2411.21
Real: 158.88

Without:
--------
1837.99user 2575.95system 2:43.09elapsed 2706%CPU (0avgtext+0avgdata 846520maxresident)k
1838.32user 2555.15system 2:42.52elapsed 2709%CPU (0avgtext+0avgdata 846520maxresident)k
1843.02user 2561.55system 2:43.35elapsed 2702%CPU (0avgtext+0avgdata 846520maxresident)k

Summary:
User: 1839.78
System: 2564.22: 2575.95 2555.15 2561.55
Real: 162.99

Test 5 times using enabled all mTHP pages:
==========================================

With:
-----
1796.44user 2937.33system 2:59.09elapsed 2643%CPU (0avgtext+0avgdata 846936maxresident)k
1802.55user 3002.32system 2:54.68elapsed 2750%CPU (0avgtext+0avgdata 847072maxresident)k
1806.59user 2986.53system 2:55.17elapsed 2736%CPU (0avgtext+0avgdata 847092maxresident)k
1803.27user 2982.40system 2:54.49elapsed 2742%CPU (0avgtext+0avgdata 846796maxresident)k
1807.43user 3036.08system 2:56.06elapsed 2751%CPU (0avgtext+0avgdata 846488maxresident)k

Summary (~8.4% improvement of system time):
User: 1803.25
System: 2988.93: 2937.33 3002.32 2986.53 2982.40 3036.08
Real: 175.90

mTHP swapout status:
/sys/kernel/mm/transparent_hugepage/hugepages-32kB/stats/swpout:347721
/sys/kernel/mm/transparent_hugepage/hugepages-32kB/stats/swpout_fallback:3110
/sys/kernel/mm/transparent_hugepage/hugepages-512kB/stats/swpout:3365
/sys/kernel/mm/transparent_hugepage/hugepages-512kB/stats/swpout_fallback:8269
/sys/kernel/mm/transparent_hugepage/hugepages-2048kB/stats/swpout:24
/sys/kernel/mm/transparent_hugepage/hugepages-2048kB/stats/swpout_fallback:3341
/sys/kernel/mm/transparent_hugepage/hugepages-1024kB/stats/swpout:145
/sys/kernel/mm/transparent_hugepage/hugepages-1024kB/stats/swpout_fallback:5038
/sys/kernel/mm/transparent_hugepage/hugepages-64kB/stats/swpout:322737
/sys/kernel/mm/transparent_hugepage/hugepages-64kB/stats/swpout_fallback:36808
/sys/kernel/mm/transparent_hugepage/hugepages-16kB/stats/swpout:380455
/sys/kernel/mm/transparent_hugepage/hugepages-16kB/stats/swpout_fallback:1010
/sys/kernel/mm/transparent_hugepage/hugepages-256kB/stats/swpout:24973
/sys/kernel/mm/transparent_hugepage/hugepages-256kB/stats/swpout_fallback:13223
/sys/kernel/mm/transparent_hugepage/hugepages-128kB/stats/swpout:197348
/sys/kernel/mm/transparent_hugepage/hugepages-128kB/stats/swpout_fallback:80541

Without:
--------
1794.41user 3151.29system 3:05.97elapsed 2659%CPU (0avgtext+0avgdata 846704maxresident)k
1810.27user 3304.48system 3:05.38elapsed 2759%CPU (0avgtext+0avgdata 846636maxresident)k
1809.84user 3254.85system 3:03.83elapsed 2755%CPU (0avgtext+0avgdata 846952maxresident)k
1813.54user 3259.56system 3:04.28elapsed 2752%CPU (0avgtext+0avgdata 846848maxresident)k
1829.97user 3338.40system 3:07.32elapsed 2759%CPU (0avgtext+0avgdata 847024maxresident)k

Summary:
User: 1811.61
System: 3261.72 : 3151.29 3304.48 3254.85 3259.56 3338.40
Real: 185.356

mTHP swapout status:
hugepages-32kB/stats/swpout:35630
hugepages-32kB/stats/swpout_fallback:1809908
hugepages-512kB/stats/swpout:523
hugepages-512kB/stats/swpout_fallback:55235
hugepages-2048kB/stats/swpout:53
hugepages-2048kB/stats/swpout_fallback:17264
hugepages-1024kB/stats/swpout:85
hugepages-1024kB/stats/swpout_fallback:24979
hugepages-64kB/stats/swpout:30117
hugepages-64kB/stats/swpout_fallback:1825399
hugepages-16kB/stats/swpout:42775
hugepages-16kB/stats/swpout_fallback:1951123
hugepages-256kB/stats/swpout:2326
hugepages-256kB/stats/swpout_fallback:170165
hugepages-128kB/stats/swpout:17925
hugepages-128kB/stats/swpout_fallback:1309757

This patch (of 9):

Previously, the swap cluster used a cluster index as a pointer to
construct a custom single link list type "swap_cluster_list".  The next
cluster pointer is shared with the cluster->count.  It prevents puting the
non free cluster into a list.

Change the cluster to use the standard double link list instead.  This
allows tracing the nonfull cluster in the follow up patch.  That way, it
is faster to get to the nonfull cluster of that order.

Remove the cluster getter/setter for accessing the cluster struct member.

The list operation is protected by the swap_info_struct->lock.

Change cluster code to use "struct swap_cluster_info *" to reference the
cluster rather than by using index.  That is more consistent with the list
manipulation.  It avoids the repeat adding index to the cluser_info.  The
code is easier to understand.

Remove the cluster next pointer is NULL flag, the double link list can
handle the empty list pretty well.

The "swap_cluster_info" struct is two pointer bigger, because 512 swap
entries share one swap_cluster_info struct, it has very little impact on
the average memory usage per swap entry.  For 1TB swapfile, the swap
cluster data structure increases from 8MB to 24MB.

Other than the list conversion, there is no real function change in this
patch.

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Chris Li <[email protected]>
Reported-by: Barry Song <[email protected]>
Reviewed-by: "Huang, Ying" <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Kairui Song <[email protected]>
Cc: Kalesh Singh <[email protected]>
Cc: Ryan Roberts <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
6 months agomemcg: initiate deprecation of pressure_level
Shakeel Butt [Wed, 14 Aug 2024 22:00:21 +0000 (15:00 -0700)]
memcg: initiate deprecation of pressure_level

The pressure_level in memcg v1 provides memory pressure notifications to
the user space.  At the moment it provides notifications for three levels
of memory pressure i.e.  low, medium and critical, which are defined based
on internal memory reclaim implementation details.  More specifically the
ratio of scanned and reclaimed pages during a memory reclaim.  However
this is not robust as there are workloads with mostly unreclaimable user
memory or kernel memory.

For v2, the users can use PSI for memory pressure status of the system or
the cgroup.  Let's start the deprecation process for pressure_level and
add warnings to gather the info on how the current users are using this
interface and how they can be used to PSI.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Shakeel Butt <[email protected]>
Reviewed-by: T.J. Mercier <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Acked-by: Roman Gushchin <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Muchun Song <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
6 months agomemcg: initiate deprecation of oom_control
Shakeel Butt [Wed, 14 Aug 2024 22:00:20 +0000 (15:00 -0700)]
memcg: initiate deprecation of oom_control

The oom_control provides functionality to disable memcg oom-killer,
notifications on oom-kill and reading the stats regarding oom-kills.  This
interface was mainly introduced to provide functionality for userspace
oom-killers.  However it is not robust enough and only supports OOM
handling in the page fault path.

For v2, the users can use the combination of memory.events notifications,
memory.high and PSI to provide userspace OOM-killing functionality.
Actually LMKD in Android and OOMd in systemd and Meta infrastructure
already use PSI in combination with other stats to implement userspace
OOM-killing.

Let's start the deprecation process for v1 and gather the info on how the
current users are using this interface and work on providing a more robust
functionality in v2.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Shakeel Butt <[email protected]>
Reviewed-by: T.J. Mercier <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Acked-by: Roman Gushchin <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Muchun Song <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
6 months agomemcg: initiate deprecation of v1 soft limit
Shakeel Butt [Wed, 14 Aug 2024 22:00:19 +0000 (15:00 -0700)]
memcg: initiate deprecation of v1 soft limit

Memcg v1 provides soft limit functionality for the best effort memory
sharing between multiple workloads on a system.  It is usually triggered
through kswapd and at the moment does not reclaim kernel memory.

Memcg v2 provides more straightforward best effort (memory.low) and hard
protection (memory.min) functionalities.  Let's initiate the deprecation
of soft limit from v1 and gather if v2 needs something more to move the
existing v1 users to v2 regarding soft limit.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Shakeel Butt <[email protected]>
Reviewed-by: T.J. Mercier <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Acked-by: Roman Gushchin <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Muchun Song <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
6 months agomemcg: initiate deprecation of v1 tcp accounting
Shakeel Butt [Wed, 14 Aug 2024 22:00:18 +0000 (15:00 -0700)]
memcg: initiate deprecation of v1 tcp accounting

Patch series "memcg: initiate deprecation of v1 features", v2.

Start the deprecation process of the memcg v1 features which we discussed
during LSFMMBPF 2024 [1].  For now add the warnings to collect the
information on how the current users are using these features.  Next we
will work on providing better alternatives in v2 (if needed) and fully
deprecate these features.

Link: https://lwn.net/Articles/974575
This patch (of 4):

Memcg v1 provides opt-in TCP memory accounting feature.  However it is
mostly unused due to its performance impact on the network traffic.  In
v2, the TCP memory is accounted in the regular memory usage and is
transparent to the users but they can observe the TCP memory usage through
memcg stats.

Let's initiate the deprecation process of memcg v1's tcp accounting
functionality and add warnings to gather if there are any users and if
there are, collect how they are using it and plan to provide them better
alternative in v2.

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Shakeel Butt <[email protected]>
Reviewed-by: T.J. Mercier <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Acked-by: Roman Gushchin <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Muchun Song <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
6 months agomemcg: make PGPGIN and PGPGOUT v1 only
Shakeel Butt [Thu, 15 Aug 2024 05:04:53 +0000 (22:04 -0700)]
memcg: make PGPGIN and PGPGOUT v1 only

Currently PGPGIN and PGPGOUT are used and exposed in the memcg v1 only
code.  So, let's put them under CONFIG_MEMCG_V1.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Shakeel Butt <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Roman Gushchin <[email protected]>
Cc: T.J. Mercier <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
6 months agomemcg: allocate v1 event percpu only on v1 deployment
Shakeel Butt [Thu, 15 Aug 2024 05:04:52 +0000 (22:04 -0700)]
memcg: allocate v1 event percpu only on v1 deployment

Currently memcg->events_percpu gets allocated on v2 deployments.  Let's
move the allocation to v1 only codebase.  This is not needed in v2.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Shakeel Butt <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Roman Gushchin <[email protected]>
Cc: T.J. Mercier <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
6 months agomemcg: make v1 only functions static
Shakeel Butt [Thu, 15 Aug 2024 05:04:51 +0000 (22:04 -0700)]
memcg: make v1 only functions static

The functions memcg1_charge_statistics() and memcg1_check_events() are
never used outside of v1 source file.  So, make them static.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Shakeel Butt <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Roman Gushchin <[email protected]>
Cc: T.J. Mercier <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
6 months agomemcg: move v1 events and statistics code to v1 file
Shakeel Butt [Thu, 15 Aug 2024 05:04:50 +0000 (22:04 -0700)]
memcg: move v1 events and statistics code to v1 file

Currently the common code path for charge commit, swapout and batched
uncharge are executing v1 only code which is completely useless for the v2
deployments where CONFIG_MEMCG_V1 is disabled.  In addition, it is mucking
with IRQs which might be slow on some architectures.  Let's move all of
this code to v1 only code and remove them from v2 only deployments.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Shakeel Butt <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Roman Gushchin <[email protected]>
Cc: T.J. Mercier <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
6 months agomemcg: move mem_cgroup_charge_statistics to v1 code
Shakeel Butt [Thu, 15 Aug 2024 05:04:49 +0000 (22:04 -0700)]
memcg: move mem_cgroup_charge_statistics to v1 code

There are no callers of mem_cgroup_charge_statistics() in the v2 code
base, so move it to the v1 only code and rename it to
memcg1_charge_statistics().

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Shakeel Butt <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Roman Gushchin <[email protected]>
Cc: T.J. Mercier <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
6 months agomemcg: move mem_cgroup_event_ratelimit to v1 code
Shakeel Butt [Thu, 15 Aug 2024 05:04:48 +0000 (22:04 -0700)]
memcg: move mem_cgroup_event_ratelimit to v1 code

There are no callers of mem_cgroup_event_ratelimit() in the v2 code.  Move
it to v1 only code and rename it to memcg1_event_ratelimit().

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Shakeel Butt <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Roman Gushchin <[email protected]>
Cc: T.J. Mercier <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
6 months agomemcg: move v1 only percpu stats in separate struct
Shakeel Butt [Thu, 15 Aug 2024 05:04:47 +0000 (22:04 -0700)]
memcg: move v1 only percpu stats in separate struct

Patch series "memcg: further decouple v1 code from v2".

Some of the v1 code is still in v2 code base due to v1 fields in the
struct memcg_vmstats_percpu.  This field decouples those fileds from v2
struct and move all the related code into v1 only code base.

This patch (of 7):

At the moment struct memcg_vmstats_percpu contains two v1 only fields
which consumes memory even when CONFIG_MEMCG_V1 is not enabled.  In
addition there are v1 only functions accessing them and are in the main
memcontrol source file and can not be moved to v1 only source file due to
these fields.  Let's move these fields into their own struct.  Later
patches will move the functions accessing them to v1 source file and only
allocate these fields when CONFIG_MEMCG_V1 is enabled.

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Shakeel Butt <[email protected]>
Acked-by: Roman Gushchin <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: T.J. Mercier <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
6 months agomm: override mTHP "enabled" defaults at kernel cmdline
Ryan Roberts [Wed, 14 Aug 2024 02:02:47 +0000 (14:02 +1200)]
mm: override mTHP "enabled" defaults at kernel cmdline

Add thp_anon= cmdline parameter to allow specifying the default enablement
of each supported anon THP size.  The parameter accepts the following
format and can be provided multiple times to configure each size:

thp_anon=<size>,<size>[KMG]:<value>;<size>-<size>[KMG]:<value>

An example:

thp_anon=16K-64K:always;128K,512K:inherit;256K:madvise;1M-2M:never

See Documentation/admin-guide/mm/transhuge.rst for more details.

Configuring the defaults at boot time is useful to allow early user space
to take advantage of mTHP before its been configured through sysfs.

[[email protected]: use get_oder() and check size is is_power_of_2]
Link: https://lkml.kernel.org/r/[email protected]
[[email protected]: some minor cleanup according to David's comments]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Ryan Roberts <[email protected]>
Co-developed-by: Barry Song <[email protected]>
Signed-off-by: Barry Song <[email protected]>
Reviewed-by: Baolin Wang <[email protected]>
Tested-by: Baolin Wang <[email protected]>
Acked-by: David Hildenbrand <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Lance Yang <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
6 months agomaple_tree: make write helper functions void
Sidhartha Kumar [Wed, 14 Aug 2024 16:19:44 +0000 (12:19 -0400)]
maple_tree: make write helper functions void

The return value of various write helper functions are not checked. We
can safely change the return type of these functions to be void.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Sidhartha Kumar <[email protected]>
Cc: Liam R. Howlett <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
6 months agomaple_tree: remove unneeded mas_wr_walk() in mas_store_prealloc()
Sidhartha Kumar [Wed, 14 Aug 2024 16:19:43 +0000 (12:19 -0400)]
maple_tree: remove unneeded mas_wr_walk() in mas_store_prealloc()

Users of mas_store_prealloc() enter this function with nodes already
preallocated. This means the store type must be already set. We can then
remove the call to mas_wr_store_type() and initialize the write state to
continue the partial walk that was done when determining the store type.

Link: https://lkml.kernel.org/r/[email protected]
Reviewed-by: Liam R. Howlett <[email protected]>
Signed-off-by: Sidhartha Kumar <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
6 months agomaple_tree: remove repeated sanity checks from write helper functions
Sidhartha Kumar [Wed, 14 Aug 2024 16:19:42 +0000 (12:19 -0400)]
maple_tree: remove repeated sanity checks from write helper functions

These sanity checks are now redundant as they are already checked in
mas_wr_store_type(). We can remove them from mas_wr_append() and
mas_wr_node_store().

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Sidhartha Kumar <[email protected]>
Cc: Liam R. Howlett <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
6 months agomaple_tree: remove node allocations from various write helper functions
Sidhartha Kumar [Wed, 14 Aug 2024 16:19:41 +0000 (12:19 -0400)]
maple_tree: remove node allocations from various write helper functions

These write helper functions are all called from store paths which
preallocate enough nodes that will be needed for the write. There is no
more need to allocate within the functions themselves.

Link: https://lkml.kernel.org/r/[email protected]
Reviewed-by: Liam R. Howlett <[email protected]>
Signed-off-by: Sidhartha Kumar <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
6 months agomaple_tree: have mas_store() allocate nodes if needed
Sidhartha Kumar [Wed, 14 Aug 2024 16:19:40 +0000 (12:19 -0400)]
maple_tree: have mas_store() allocate nodes if needed

Not all users of mas_store() enter with nodes already preallocated.
Check for the MA_STATE_PREALLOC flag to decide whether to preallocate nodes
within mas_store() rather than relying on future write helper functions
to perform the allocations. This allows the write helper functions to be
simplified as they do not have to do checks to make sure there are
enough allocated nodes to perform the write.

Link: https://lkml.kernel.org/r/[email protected]
Reviewed-by: Liam R. Howlett <[email protected]>
Signed-off-by: Sidhartha Kumar <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
6 months agomaple_tree: remove mas_wr_modify()
Sidhartha Kumar [Wed, 14 Aug 2024 16:19:39 +0000 (12:19 -0400)]
maple_tree: remove mas_wr_modify()

There are no more users of the function, safely remove it.

Link: https://lkml.kernel.org/r/[email protected]
Reviewed-by: Liam R. Howlett <[email protected]>
Signed-off-by: Sidhartha Kumar <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
6 months agomaple_tree: simplify mas_commit_b_node()
Sidhartha Kumar [Wed, 14 Aug 2024 16:19:38 +0000 (12:19 -0400)]
maple_tree: simplify mas_commit_b_node()

The only callers of mas_commit_b_node() are those with store type of
wr_rebalance and wr_split_store. Use mas->store_type to dispatch to the
correct helper function. This allows the removal of mas_reuse_node() as
it is no longer used.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Sidhartha Kumar <[email protected]>
Cc: Liam R. Howlett <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
6 months agomaple_tree: convert mas_insert() to preallocate nodes
Sidhartha Kumar [Wed, 14 Aug 2024 16:19:37 +0000 (12:19 -0400)]
maple_tree: convert mas_insert() to preallocate nodes

By setting the store type in mas_insert(), we no longer need to use
mas_wr_modify() to determine the correct store function to use. Instead,
set the store type and call mas_wr_store_entry(). Also, pass in the
requested gfp flags to mas_insert() so they can be passed to the call to
mas_wr_preallocate().

Link: https://lkml.kernel.org/r/[email protected]
Reviewed-by: Liam R. Howlett <[email protected]>
Signed-off-by: Sidhartha Kumar <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
6 months agomaple_tree: use store type in mas_wr_store_entry()
Sidhartha Kumar [Wed, 14 Aug 2024 16:19:36 +0000 (12:19 -0400)]
maple_tree: use store type in mas_wr_store_entry()

When storing an entry, we can read the store type that was set from a
previous partial walk of the tree. Now that the type of store is known,
select the correct write helper function to use to complete the store.

Also noinline mas_wr_spanning_store() to limit stack frame usage in
mas_wr_store_entry() as it allocates a maple_big_node on the stack.

Link: https://lkml.kernel.org/r/[email protected]
Reviewed-by: Liam R. Howlett <[email protected]>
Signed-off-by: Sidhartha Kumar <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
6 months agomaple_tree: print store type in mas_dump()
Sidhartha Kumar [Wed, 14 Aug 2024 16:19:35 +0000 (12:19 -0400)]
maple_tree: print store type in mas_dump()

Knowing the store type of the maple state could be helpful for debugging.
Have mas_dump() print mas->store_type.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Sidhartha Kumar <[email protected]>
Reviewed-by: Liam R. Howlett <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
6 months agomaple_tree: use mas_store_gfp() in mtree_store_range()
Sidhartha Kumar [Wed, 14 Aug 2024 16:19:34 +0000 (12:19 -0400)]
maple_tree: use mas_store_gfp() in mtree_store_range()

Refactor mtree_store_range() to use mas_store_gfp() which will abstract
the store, memory allocation, and error handling.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Sidhartha Kumar <[email protected]>
Reviewed-by: Liam R. Howlett <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
6 months agomaple_tree: preallocate nodes in mas_erase()
Sidhartha Kumar [Wed, 14 Aug 2024 16:19:33 +0000 (12:19 -0400)]
maple_tree: preallocate nodes in mas_erase()

Use mas_wr_preallocate() in mas_erase() to preallocate enough nodes to
complete the erase.  Add error handling by skipping the store if the
preallocation lead to some error besides no memory.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Sidhartha Kumar <[email protected]>
Cc: Liam R. Howlett <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
6 months agomaple_tree: remove mas_destroy() from mas_nomem()
Sidhartha Kumar [Wed, 14 Aug 2024 16:19:32 +0000 (12:19 -0400)]
maple_tree: remove mas_destroy() from mas_nomem()

Separate call to mas_destroy() from mas_nomem() so we can check for no
memory errors without destroying the current maple state in
mas_store_gfp().  We then add calls to mas_destroy() to callers of
mas_nomem().

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Sidhartha Kumar <[email protected]>
Reviewed-by: Liam R. Howlett <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
6 months agomaple_tree: introduce mas_wr_store_type()
Sidhartha Kumar [Wed, 14 Aug 2024 16:19:31 +0000 (12:19 -0400)]
maple_tree: introduce mas_wr_store_type()

Introduce mas_wr_store_type() which will set the correct store type based
on a walk of the tree.  In mas_wr_node_store() the <= min_slots condition
is changed to < as if new_end is = to mt_min_slots then there is not
enough room.

mas_prealloc_calc() is also introduced to abstract the calculation used to
determine the number of nodes needed for a store operation.

In this change a call to mas_reset() is removed in the error case of
mas_prealloc().  This is only needed in the MA_STATE_REBALANCE case of
mas_destroy().  We can move the call to mas_reset() directly to
mas_destroy().

Also, add a test case to validate the order that we check the store type
in is correct.  This test models a vma expanding and then shrinking which
is part of the boot process.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Sidhartha Kumar <[email protected]>
Cc: Liam R. Howlett <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
6 months agomaple_tree: move up mas_wr_store_setup() and mas_wr_prealloc_setup()
Sidhartha Kumar [Wed, 14 Aug 2024 16:19:30 +0000 (12:19 -0400)]
maple_tree: move up mas_wr_store_setup() and mas_wr_prealloc_setup()

Subsequent patches require these definitions to be higher, no functional
changes intended.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Sidhartha Kumar <[email protected]>
Reviewed-by: Liam R. Howlett <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
6 months agomaple_tree: introduce mas_wr_prealloc_setup()
Sidhartha Kumar [Wed, 14 Aug 2024 16:19:29 +0000 (12:19 -0400)]
maple_tree: introduce mas_wr_prealloc_setup()

Introduce a helper function, mas_wr_prealoc_setup(), that will set up a
maple write state in order to start a walk of a maple tree.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Sidhartha Kumar <[email protected]>
Cc: Liam R. Howlett <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
6 months agomaple_tree: introduce store_type enum
Sidhartha Kumar [Wed, 14 Aug 2024 16:19:28 +0000 (12:19 -0400)]
maple_tree: introduce store_type enum

Patch series "Introduce a store type enum for the Maple tree", v4.

================================ OVERVIEW ================================

This series implements two work items[3]: "aligning mas_store_gfp() with
mas_preallocate()" and "enum for store type".

mas_store_gfp() is modified to preallocate nodes.  This simplies many of
the write helper functions by allowing them to use mas_store_gfp() rather
than open coding node allocation and error handling.

The enum defines the following store types:

enum store_type {
    wr_invalid,
    wr_new_root,
    wr_store_root,
    wr_exact_fit,
    wr_spanning_store,
    wr_split_store,
    wr_rebalance,
    wr_append,
    wr_node_store,
    wr_slot_store,
};

In the current maple tree code, a walk down the tree is done in
mas_preallocate() to determine the number of nodes needed for this write.
After node allocation, mas_wr_store_entry() will perform another walk to
determine which write helper function to use to complete the write.

Rather than performing the second walk, we can store the type of write in
the maple write state during node allocation and read this field to
complete the write.

Patches 1-16 implement this store type feature.
Patch 17 is a cleanup patch to change functions that have unused return
types to be void.

================================ RESULTS =================================

Phoronix t-test-1 (Seconds < Lower Is Better)
    v6.10-rc6
        Threads: 1
            33.15

        Threads: 2
            10.81

    v6.10-rc6 + this series
            Threads: 1
            32.69

        Threads: 2
            10.45

Stress-ng mmap
                    6.10_base  store_type_v4
Duration User        2744.65     2769.40
Duration System     10862.69    10817.59
Duration Elapsed     1477.58     1478.35

================================ TESTING =================================

Testing was done with the maple tree test suite.  A new test case is also
added to validate the order in which we test for and assign the store
type.

[1]: https://lore.kernel.org/linux-mm/80926b22-a8d2-9992-eb5e-27e2c99cf460@google.com/T/#m81044feb66765265f8ca7f21e4b4b3725b18780a
[2]: https://lore.kernel.org/linux-mm/80926b22-a8d2-9992-eb5e-27e2c99cf460@google.com/T/#mb36c6526486638e82518c0f37a428fb279c84d8a
[3]: https://lists.infradead.org/pipermail/maple-tree/2023-December/003098.html

This patch (of 17):

Add a store_type enum that is stored in ma_state.  This will be used to
keep track of partial walks of the tree so that subsequent walks can pick
up where a previous walk left off.

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Sidhartha Kumar <[email protected]>
Cc: Liam R. Howlett <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
6 months agomm: kmem: add lockdep assertion to obj_cgroup_memcg
Muchun Song [Wed, 14 Aug 2024 09:34:15 +0000 (17:34 +0800)]
mm: kmem: add lockdep assertion to obj_cgroup_memcg

obj_cgroup_memcg() is supposed to safe to prevent the returned memory
cgroup from being freed only when the caller is holding the rcu read lock
or objcg_lock or cgroup_mutex.  It is very easy to ignore thoes conditions
when users call some upper APIs which call obj_cgroup_memcg() internally
like mem_cgroup_from_slab_obj() (See the link below).  So it is better to
add lockdep assertion to obj_cgroup_memcg() to find those issues ASAP.

Because there is no user of obj_cgroup_memcg() holding objcg_lock to make
the returned memory cgroup safe, do not add objcg_lock assertion (We
should export objcg_lock if we really want to do).  Additionally, this is
some internal implementation detail of memcg and should not be accessible
outside memcg code.

Some users like __mem_cgroup_uncharge() do not care the lifetime of the
returned memory cgroup, which just want to know if the folio is charged to
a memory cgroup, therefore, they do not need to hold the needed locks.  In
which case, introduce a new helper folio_memcg_charged() to do this.
Compare it to folio_memcg(), it could eliminate a memory access of
objcg->memcg for kmem, actually, a really small gain.

[[email protected]: fix split_page_memcg()]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lore.kernel.org/all/[email protected]/
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Muchun Song <[email protected]>
Acked-by: Shakeel Butt <[email protected]>
Acked-by: Roman Gushchin <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Michal Hocko <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
6 months agomemcg: use ratelimited stats flush in the reclaim
Shakeel Butt [Tue, 13 Aug 2024 21:53:58 +0000 (14:53 -0700)]
memcg: use ratelimited stats flush in the reclaim

The Meta prod is seeing large amount of stalls in memcg stats flush from
the memcg reclaim code path.  At the moment, this specific callsite is
doing a synchronous memcg stats flush.  The rstat flush is an expensive
and time consuming operation, so concurrent relaimers will busywait on the
lock potentially for a long time.  Actually this issue is not unique to
Meta and has been observed by Cloudflare [1] as well.  For the Cloudflare
case, the stalls were due to contention between kswapd threads running on
their 8 numa node machines which does not make sense as rstat flush is
global and flush from one kswapd thread should be sufficient for all.
Simply replace the synchronous flush with the ratelimited one.

One may raise a concern on potentially using 2 sec stale (at worst) stats
for heuristics like desirable inactive:active ratio and preferring
inactive file pages over anon pages but these specific heuristics do not
require very precise stats and also are ignored under severe memory
pressure.

More specifically for this code path, the stats are needed for two
specific heuristics:

1. Deactivate LRUs
2. Cache trim mode

The deactivate LRUs heuristic is to maintain a desirable inactive:active
ratio of the LRUs.  The specific stats needed are WORKINGSET_ACTIVATE* and
the hierarchical LRU size.  The WORKINGSET_ACTIVATE* is needed to check if
there is a refault since last snapshot and the LRU size are needed for the
desirable ratio between inactive and active LRUs.  See the table below on
how the desirable ratio is calculated.

/* total     target    max
 * memory    ratio     inactive
 * -------------------------------------
 *   10MB       1         5MB
 *  100MB       1        50MB
 *    1GB       3       250MB
 *   10GB      10       0.9GB
 *  100GB      31         3GB
 *    1TB     101        10GB
 *   10TB     320        32GB
 */

The desirable ratio only changes at the boundary of 1 GiB, 10 GiB, 100
GiB, 1 TiB and 10 TiB.  There is no need for the precise and accurate LRU
size information to calculate this ratio.  In addition, if deactivation is
skipped for some LRU, the kernel will force deactive on the severe memory
pressure situation.

For the cache trim mode, inactive file LRU size is read and the kernel
scales it down based on the reclaim iteration (file >> sc->priority) and
only checks if it is zero or not.  Again precise information is not
needed.

This patch has been running on Meta fleet for several months and we have
not observed any issues.  Please note that MGLRU is not impacted by this
issue at all as it avoids rstat flushing completely.

Link: https://lore.kernel.org/all/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Shakeel Butt <[email protected]>
Cc: Jesper Dangaard Brouer <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Roman Gushchin <[email protected]>
Cc: Yosry Ahmed <[email protected]>
Cc: Yu Zhao <[email protected]>
Cc: Nhat Pham <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
6 months agomm: remove legacy install_special_mapping() code
Linus Torvalds [Tue, 20 Aug 2024 22:14:47 +0000 (15:14 -0700)]
mm: remove legacy install_special_mapping() code

All relevant architectures had already been converted to the new interface
(which just has an underscore in front of the name - not very imaginative
naming), this just force-converts the stragglers.

The modern interface is almost identical to the old one, except instead of
the page pointer it takes a "struct vm_special_mapping" that describes the
mapping (and contains the page pointer as one member), and it returns the
resulting 'vma' instead of just the error code.

Getting rid of the old interface also gets rid of some special casing,
which had caused problems with the mremap extensions to "struct
vm_special_mapping".

[[email protected]: coding-style cleanups]
Link: https://lkml.kernel.org/r/CAHk-=whvR+z=0=0gzgdfUiK70JTa-=+9vxD-4T=3BagXR6dciA@mail.gmail.comTested-by:
Link: https://lore.kernel.org/all/20240819195120.GA1113263@thelio-3990X/
Signed-off-by: Linus Torvalds <[email protected]>
Cc: Nathan Chancellor <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Anton Ivanov <[email protected]>
Cc: Brian Cain <[email protected]>
Cc: Christophe Leroy <[email protected]>
Cc: Dinh Nguyen <[email protected]>
Cc: Guo Ren <[email protected]>
Cc: Jeff Xu <[email protected]>
Cc: Johannes Berg <[email protected]>
Cc: John Paul Adrian Glaubitz <[email protected]>
Cc: Liam R. Howlett <[email protected]>
Cc: Nicholas Piggin <[email protected]>
Cc: Pedro Falcato <[email protected]>
Cc: Richard Weinberger <[email protected]>
Cc: Rich Felker <[email protected]>
Cc: Rob Landley <[email protected]>
Cc: Yoshinori Sato <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
6 months agopowerpc/vdso: refactor error handling
Michael Ellerman [Mon, 12 Aug 2024 08:26:05 +0000 (18:26 +1000)]
powerpc/vdso: refactor error handling

Linus noticed that the error handling in __arch_setup_additional_pages()
fails to clear the mm VDSO pointer if _install_special_mapping() fails.
In practice there should be no actual bug, because if there's an error the
VDSO pointer is cleared later in arch_setup_additional_pages().

However it's no longer necessary to set the pointer before installing the
mapping.  Commit c1bab64360e6 ("powerpc/vdso: Move to
_install_special_mapping() and remove arch_vma_name()") reworked the code
so that the VMA name comes from the vm_special_mapping.name, rather than
relying on arch_vma_name().

So rework the code to only set the VDSO pointer once the mappings have
been installed correctly, and remove the stale comment.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Michael Ellerman <[email protected]>
Reviewed-by: Liam R. Howlett <[email protected]>
Cc: Christophe Leroy <[email protected]>
Cc: Jeff Xu <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Nicholas Piggin <[email protected]>
Cc: Pedro Falcato <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
This page took 0.160063 seconds and 4 git commands to generate.