Eric Biggers [Wed, 12 Aug 2020 01:35:33 +0000 (18:35 -0700)]
fs/minix: set s_maxbytes correctly
The minix filesystem leaves super_block::s_maxbytes at MAX_NON_LFS rather
than setting it to the actual filesystem-specific limit. This is broken
because it means userspace doesn't see the standard behavior like getting
EFBIG and SIGXFSZ when exceeding the maximum file size.
Eric Biggers [Wed, 12 Aug 2020 01:35:30 +0000 (18:35 -0700)]
fs/minix: reject too-large maximum file size
If the minix filesystem tries to map a very large logical block number to
its on-disk location, block_to_path() can return offsets that are too
large, causing out-of-bounds memory accesses when accessing indirect index
blocks. This should be prevented by the check against the maximum file
size, but this doesn't work because the maximum file size is read directly
from the on-disk superblock and isn't validated itself.
Fix this by validating the maximum file size at mount time.
Eric Biggers [Wed, 12 Aug 2020 01:35:27 +0000 (18:35 -0700)]
fs/minix: don't allow getting deleted inodes
If an inode has no links, we need to mark it bad rather than allowing it
to be accessed. This avoids WARNINGs in inc_nlink() and drop_nlink() when
doing directory operations on a fuzzed filesystem.
Eric Biggers [Wed, 12 Aug 2020 01:35:24 +0000 (18:35 -0700)]
fs/minix: check return value of sb_getblk()
Patch series "fs/minix: fix syzbot bugs and set s_maxbytes".
This series fixes all syzbot bugs in the minix filesystem:
KASAN: null-ptr-deref Write in get_block
KASAN: use-after-free Write in get_block
KASAN: use-after-free Read in get_block
WARNING in inc_nlink
KMSAN: uninit-value in get_block
WARNING in drop_nlink
It also fixes the minix filesystem to set s_maxbytes correctly, so that
userspace sees the correct behavior when exceeding the max file size.
Quentin Monnet [Wed, 12 Aug 2020 01:35:13 +0000 (18:35 -0700)]
checkpatch: fix CONST_STRUCT when const_structs.checkpatch is missing
Checkpatch reports warnings when some specific structs are not declared as
const in the code. The list of structs to consider was initially defined
in the checkpatch.pl script itself, but it was later moved to an external
file (scripts/const_structs.checkpatch), in commit bf1fa1dae68e
("checkpatch: externalize the structs that should be const"). This
introduced two minor issues:
- When file scripts/const_structs.checkpatch is not present (for
example, if checkpatch is run outside of the kernel directory with the
"--no-tree" option), a warning is printed to stderr to tell the user
that "No structs that should be const will be found". This is fair,
but the warning is printed unconditionally, even if the option
"--ignore CONST_STRUCT" is passed. In the latter case, we explicitly
ask checkpatch to skip this check, so no warning should be printed.
- When scripts/const_structs.checkpatch is missing, or even when trying
to silence the warning by adding an empty file, $const_structs is set
to "", and the regex used for finding structs that should be const,
"$line =~ /struct\s+($const_structs)(?!\s*\{)/)", matches all
structs found in the code, thus reporting a number of false positives.
Let's fix the first item by skipping scripts/const_structs.checkpatch
processing if "CONST_STRUCT" checks are ignored, and the second one by
skipping the test if $const_structs is not defined. Since we modify the
read_words() function a little bit, update the checks for
$typedefsfile/$typeOtherTypedefs as well.
Kars Mulder [Wed, 12 Aug 2020 01:34:56 +0000 (18:34 -0700)]
kstrto*: do not describe simple_strto*() as obsolete/replaced
The documentation of the kstrto*() functions describes kstrto*() as
"replacements" of the "obsolete" simple_strto*() functions. Both of these
terms are inaccurate: they're not replacements because they have different
behaviour, and the simple_strto*() are not obsolete because there are
cases where they have benefits over kstrto*().
Remove usage of the terms "replacement" and "obsolete" in reference to
simple_strto*(), and instead use the term "preferred over".
Kars Mulder [Wed, 12 Aug 2020 01:34:53 +0000 (18:34 -0700)]
kstrto*: correct documentation references to simple_strto*()
The documentation of the kstrto*() functions reference the simple_strtoull
function by "used as a replacement for [the obsolete] simple_strtoull".
All these functions describes themselves as replacements for the function
simple_strtoull, even though a function like kstrtol() would be more aptly
described as a replacement of simple_strtol().
Fix these references by making the documentation of kstrto*() reference
the closest simple_strto*() equivalent available. The functions
kstrto[u]int() do not have direct simple_strto[u]int() equivalences, so
these are made to refer to simple_strto[u]l() instead.
Furthermore, add parentheses after function names, as is standard in
kernel documentation.
Tiezhu Yang [Wed, 12 Aug 2020 01:34:47 +0000 (18:34 -0700)]
lib/test_lockup.c: fix return value of test_lockup_init()
Since filp_open() returns an error pointer, we should use IS_ERR() to
check the return value and then return PTR_ERR() if failed to get the
actual return value instead of always -EINVAL.
E.g. without this patch:
[root@localhost loongson]# ls no_such_file
ls: cannot access no_such_file: No such file or directory
[root@localhost loongson]# modprobe test_lockup file_path=no_such_file lock_sb_umount time_secs=60 state=S
modprobe: ERROR: could not insert 'test_lockup': Invalid argument
[root@localhost loongson]# dmesg | tail -1
[ 126.100596] test_lockup: cannot find file_path
With this patch:
[root@localhost loongson]# ls no_such_file
ls: cannot access no_such_file: No such file or directory
[root@localhost loongson]# modprobe test_lockup file_path=no_such_file lock_sb_umount time_secs=60 state=S
modprobe: ERROR: could not insert 'test_lockup': Unknown symbol in module, or unknown parameter (see dmesg)
[root@localhost loongson]# dmesg | tail -1
[ 95.134362] test_lockup: failed to open no_such_file: -2
Tiezhu Yang [Wed, 12 Aug 2020 01:34:44 +0000 (18:34 -0700)]
lib/Kconfig.debug: make TEST_LOCKUP depend on module
Since test_lockup is a test module to generate lockups, it is better to
limit TEST_LOCKUP to module (=m) or disabled (=n) because we can not use
the module parameters when CONFIG_TEST_LOCKUP=y.
lib/test_bitops: do the full test during module init
Currently, the bitops test consists of two parts: one part is executed
during module load, the second part during module unload. This is
cumbersome for the user, as he has to perform two steps to execute all
tests, and is different from most (all?) other tests.
Merge the two parts, so both are executed during module load.
struct __genradix is defined as having its member 'root'
annotated as __rcu. But in the corresponding API RCU is not used.
Sparse reports this type mismatch as:
lib/generic-radix-tree.c:56:35: warning: incorrect type in initializer (different address spaces)
lib/generic-radix-tree.c:56:35: expected struct genradix_root *r
lib/generic-radix-tree.c:56:35: got struct genradix_root [noderef] <asn:4> *__val
with 6 other ones.
So, correct root's type by removing this unneeded __rcu.
Stefano Brivio [Wed, 12 Aug 2020 01:34:32 +0000 (18:34 -0700)]
lib/test_bitmap.c: add test for bitmap_cut()
Inspired by an original patch from Yury Norov: introduce a test for
bitmap_cut() that also makes sure functionality is as described for
partially overlapping src and dst.
Stefano Brivio [Wed, 12 Aug 2020 01:34:29 +0000 (18:34 -0700)]
lib/bitmap.c: fix bitmap_cut() for partial overlapping case
Patch series "lib: Fix bitmap_cut() for overlaps, add test"
This patch (of 2):
Yury Norov reports that bitmap_cut() will not produce the right outcome if
src and dst partially overlap, with src pointing at some location after
dst, because the memmove() affects src before we store the bits that we
need to keep, that is, the bits preceding the cut -- as long as we the
beginning of the cut is not aligned to a long.
Fix this by storing those bits before the memmove().
Note that this is just a theoretical concern so far, as the only user of
this function, pipapo_drop() from the nftables set back-end implemented in
net/netfilter/nft_set_pipapo.c, always supplies entirely overlapping src
and dst.
Feng Tang [Wed, 12 Aug 2020 01:34:13 +0000 (18:34 -0700)]
./Makefile: add debug option to enable function aligned on 32 bytes
Recently 0day reported many strange performance changes (regression or
improvement), in which there was no obvious relation between the culprit
commit and the benchmark at the first look, and it causes people to doubt
the test itself is wrong.
Upon further check, many of these cases are caused by the change to the
alignment of kernel text or data, as whole text/data of kernel are linked
together, change in one domain may affect alignments of other domains.
gcc has an option '-falign-functions=n' to force text aligned, and with
that option enabled, some of those performance changes will be gone, like
[1][2][3].
Add this option so that developers and 0day can easily find performance
bump caused by text alignment change, as tracking these strange bump is
quite time consuming. Though it can't help in other cases like data
alignment changes like [4].
Following is some size data for v5.7 kernel built with a RHEL config used
in 0day:
Add a helper that waits for a pid and stores the status in the passed in
kernel pointer. Use it to fix the usage of kernel_wait4 in
call_usermodehelper_exec_sync that only happens to work due to the
implicit set_fs(KERNEL_DS) for kernel threads.
alpha: fix annotation of io{read,write}{16,32}be()
These accessors must be used to read/write a big-endian bus. The value
returned or written is native-endian.
However, these accessors are defined using be{16,32}_to_cpu() or
cpu_to_be{16,32}() to make the endian conversion but these expect a
__be{16,32} when none is present. Keeping them would need a force cast
that would solve nothing at all.
So, do the conversion using swab{16,32}, like done in asm-generic for
similar situations.
exec: use force_uaccess_begin during exec and exit
Both exec and exit want to ensure that the uaccess routines actually do
access user pointers. Use the newly added force_uaccess_begin helper
instead of an open coded set_fs for that to prepare for kernel builds
where set_fs() does not exist.
Add helpers to wrap the get_fs/set_fs magic for undoing any damange done
by set_fs(KERNEL_DS). There is no real functional benefit, but this
documents the intent of these calls better, and will allow stubbing the
functions out easily for kernels builds that do not allow address space
overrides in the future.
syscalls: use uaccess_kernel in addr_limit_user_check
Patch series "clean up address limit helpers", v2.
In preparation for eventually phasing out direct use of set_fs(), this
series removes the segment_eq() arch helper that is only used to implement
or duplicate the uaccess_kernel() API, and then adds descriptive helpers
to force the kernel address limit.
This patch (of 6):
Use the uaccess_kernel helper instead of duplicating it.
mm, memory_hotplug: update pcp lists everytime onlining a memory block
When onlining a first memory block in a zone, pcp lists are not updated
thus pcp struct will have the default setting of ->high = 0,->batch = 1.
This means till the second memory block in a zone(if it have) is onlined
the pcp lists of this zone will not contain any pages because pcp's
->count is always greater than ->high thus free_pcppages_bulk() is called
to free batch size(=1) pages every time system wants to add a page to the
pcp list through free_unref_page().
To put this in a word, system is not using benefits offered by the pcp
lists when there is a single onlineable memory block in a zone. Correct
this by always updating the pcp lists when memory block is onlined.
Daniel Jordan [Wed, 12 Aug 2020 01:32:12 +0000 (18:32 -0700)]
x86/mm: use max memory block size on bare metal
Some of our servers spend significant time at kernel boot initializing
memory block sysfs directories and then creating symlinks between them and
the corresponding nodes. The slowness happens because the machines get
stuck with the smallest supported memory block size on x86 (128M), which
results in 16,288 directories to cover the 2T of installed RAM. The
search for each memory block is noticeable even with commit 4fb6eabf1037
("drivers/base/memory.c: cache memory blocks in xarray to accelerate
lookup").
Commit 078eb6aa50dc ("x86/mm/memory_hotplug: determine block size based on
the end of boot memory") chooses the block size based on alignment with
memory end. That addresses hotplug failures in qemu guests, but for bare
metal systems whose memory end isn't aligned to even the smallest size, it
leaves them at 128M.
Make kernels that aren't running on a hypervisor use the largest supported
size (2G) to minimize overhead on big machines. Kernel boot goes 7%
faster on the aforementioned servers, shaving off half a second.
mm/mmu_notifier.c:187: warning: Function parameter or member 'interval_sub' not described in 'mmu_interval_read_bgin'
mm/mmu_notifier.c:708: warning: Function parameter or member 'subscription' not described in 'mmu_notifier_registr'
mm/mmu_notifier.c:708: warning: Excess function parameter 'mn' description in 'mmu_notifier_register'
mm/mmu_notifier.c:880: warning: Function parameter or member 'subscription' not described in 'mmu_notifier_put'
mm/mmu_notifier.c:880: warning: Excess function parameter 'mn' description in 'mmu_notifier_put'
mm/mmu_notifier.c:982: warning: Function parameter or member 'ops' not described in 'mmu_interval_notifier_insert'
The current_gfp_context() converts a number of PF_MEMALLOC_* per-process
flags into the corresponding GFP_* flags for memory allocation. In that
function, current->flags is accessed 3 times. That may lead to duplicated
access of the same memory location.
This is not usually a problem with minimal debug config options on as the
compiler can optimize away the duplicated memory accesses. With most of
the debug config options on, however, that may not be the case. For
example, the x86-64 object size of the __need_fs_reclaim() in a debug
kernel that calls current_gfp_context() was 309 bytes. With this patch
applied, the object size is reduced to 202 bytes. This is a saving of 107
bytes and will probably be slightly faster too.
Use READ_ONCE() to access current->flags to prevent the compiler from
possibly accessing current->flags multiple times.
Mike Kravetz [Wed, 12 Aug 2020 01:32:03 +0000 (18:32 -0700)]
cma: don't quit at first error when activating reserved areas
The routine cma_init_reserved_areas is designed to activate all
reserved cma areas. It quits when it first encounters an error.
This can leave some areas in a state where they are reserved but
not activated. There is no feedback to code which performed the
reservation. Attempting to allocate memory from areas in such a
state will result in a BUG.
Modify cma_init_reserved_areas to always attempt to activate all
areas. The called routine, cma_activate_area is responsible for
leaving the area in a valid state. No one is making active use
of returned error codes, so change the routine to void.
How to reproduce: This example uses kernelcore, hugetlb and cma
as an easy way to reproduce. However, this is a more general cma
issue.
Two node x86 VM 16GB total, 8GB per node
Kernel command line parameters, kernelcore=4G hugetlb_cma=8G
Related boot time messages,
hugetlb_cma: reserve 8192 MiB, up to 4096 MiB per node
cma: Reserved 4096 MiB at 0x0000000100000000
hugetlb_cma: reserved 4096 MiB on node 0
cma: Reserved 4096 MiB at 0x0000000300000000
hugetlb_cma: reserved 4096 MiB on node 1
cma: CMA area hugetlb could not be activated
Barry Song [Wed, 12 Aug 2020 01:31:57 +0000 (18:31 -0700)]
mm: cma: fix the name of CMA areas
Patch series "mm: fix the names of general cma and hugetlb cma", v2.
The current code of CMA can only work when users pass a const string as
name parameter. we need to fix the way to handle names in CMA. On the
other hand, to avoid name conflicts after enabling CMA_DEBUGFS, each
hugetlb should get a different CMA name.
This patch (of 2):
If users give a name saved in stack, the current code will generate magic
pointer. if users don't give a name(NULL), kasprintf() will always return
NULL as we are at the early stage. that means cma_init_reserved_mem()
will return -ENOMEM if users set name parameter as NULL.
Jianqun Xu [Wed, 12 Aug 2020 01:31:54 +0000 (18:31 -0700)]
mm/cma.c: fix NULL pointer dereference when cma could not be activated
In some case the cma area could not be activated, but the cma_alloc be
used under this case, then the kernel will crash caused by NULL pointer
dereference.
Add bitmap valid check in cma_alloc to avoid this issue.
mm/vmstat: add events for THP migration without split
Add following new vmstat events which will help in validating THP
migration without split. Statistics reported through these new VM events
will help in performance debugging.
In addition, these new events also update normal page migration statistics
appropriately via PGMIGRATE_SUCCESS and PGMIGRATE_FAILURE. While here,
this updates current trace event 'mm_migrate_pages' to accommodate now
available THP statistics.
Yang Shi [Wed, 12 Aug 2020 01:31:48 +0000 (18:31 -0700)]
mm: thp: remove debug_cow switch
Since commit 3917c80280c93a7123f ("thp: change CoW semantics for
anon-THP"), the CoW page fault of THP has been rewritten, debug_cow is not
used anymore. So, just remove it.
Ralph Campbell [Wed, 12 Aug 2020 01:31:41 +0000 (18:31 -0700)]
mm/migrate: optimize migrate_vma_setup() for holes
Patch series "mm/migrate: optimize migrate_vma_setup() for holes".
A simple optimization for migrate_vma_*() when the source vma is not an
anonymous vma and a new test case to exercise it.
This patch (of 2):
When migrating system memory to device private memory, if the source
address range is a valid VMA range and there is no memory or a zero page,
the source PFN array is marked as valid but with no PFN.
This lets the device driver allocate private memory and clear it, then
insert the new device private struct page into the CPU's page tables when
migrate_vma_pages() is called. migrate_vma_pages() only inserts the new
page if the VMA is an anonymous range.
There is no point in telling the device driver to allocate device private
memory and then not migrate the page. Instead, mark the source PFN array
entries as not migrating to avoid this overhead.
Mike Kravetz [Wed, 12 Aug 2020 01:31:38 +0000 (18:31 -0700)]
hugetlbfs: remove call to huge_pte_alloc without i_mmap_rwsem
Commit c0d0381ade79 ("hugetlbfs: use i_mmap_rwsem for more pmd sharing
synchronization") requires callers of huge_pte_alloc to hold i_mmap_rwsem
in at least read mode. This is because the explicit locking in
huge_pmd_share (called by huge_pte_alloc) was removed. When restructuring
the code, the call to huge_pte_alloc in the else block at the beginning of
hugetlb_fault was missed.
Unfortunately, that else clause is exercised when there is no page table
entry. This will likely lead to a call to huge_pmd_share. If
huge_pmd_share thinks pmd sharing is possible, it will traverse the
mapping tree (i_mmap) without holding i_mmap_rwsem. If someone else is
modifying the tree, bad things such as addressing exceptions or worse
could happen.
Simply remove the else clause. It should have been removed previously.
The code following the else will call huge_pte_alloc with the appropriate
locking.
To prevent this type of issue in the future, add routines to assert that
i_mmap_rwsem is held, and call these routines in huge pmd sharing
routines.
Mike Kravetz [Wed, 12 Aug 2020 01:31:35 +0000 (18:31 -0700)]
hugetlbfs: prevent filesystem stacking of hugetlbfs
syzbot found issues with having hugetlbfs on a union/overlay as reported
in [1]. Due to the limitations (no write) and special functionality of
hugetlbfs, it does not work well in filesystem stacking. There are no
know use cases for hugetlbfs stacking. Rather than making modifications
to get hugetlbfs working in such environments, simply prevent stacking.
Yafang Shao [Wed, 12 Aug 2020 01:31:32 +0000 (18:31 -0700)]
mm, oom: show process exiting information in __oom_kill_process()
When the OOM killer finds a victim and tryies to kill it, if the victim is
already exiting, the task mm will be NULL and no process will be killed.
But the dump_header() has been already executed, so it will be strange to
dump so much information without killing a process. We'd better show some
helpful information to indicate why this happens.
Michal Hocko [Wed, 12 Aug 2020 01:31:28 +0000 (18:31 -0700)]
doc, mm: clarify /proc/<pid>/oom_score value range
The exported value includes oom_score_adj so the range is no [0, 1000] as
described in the previous section but rather [0, 2000]. Mention that fact
explicitly.
Michal Hocko [Wed, 12 Aug 2020 01:31:25 +0000 (18:31 -0700)]
doc, mm: sync up oom_score_adj documentation
There are at least two notes in the oom section. The 3% discount for root
processes is gone since d46078b28889 ("mm, oom: remove 3% bonus for
CAP_SYS_ADMIN processes").
Likewise children of the selected oom victim are not sacrificed since bbbe48029720 ("mm, oom: remove 'prefer children over parent' heuristic")
Yafang Shao [Wed, 12 Aug 2020 01:31:22 +0000 (18:31 -0700)]
mm, oom: make the calculation of oom badness more accurate
Recently we found an issue on our production environment that when memcg
oom is triggered the oom killer doesn't chose the process with largest
resident memory but chose the first scanned process. Note that all
processes in this memcg have the same oom_score_adj, so the oom killer
should chose the process with largest resident memory.
We can find that the first scanned process 5740 (pause) was killed, but
its rss is only one page. That is because, when we calculate the oom
badness in oom_badness(), we always ignore the negtive point and convert
all of these negtive points to 1. Now as oom_score_adj of all the
processes in this targeted memcg have the same value -998, the points of
these processes are all negtive value. As a result, the first scanned
process will be killed.
The oom_socre_adj (-998) in this memcg is set by kubelet, because it is a
a Guaranteed pod, which has higher priority to prevent from being killed
by system oom.
To fix this issue, we should make the calculation of oom point more
accurate. We can achieve it by convert the chosen_point from 'unsigned
long' to 'long'.
Wenchao Hao [Wed, 12 Aug 2020 01:31:16 +0000 (18:31 -0700)]
mm/mempolicy.c: check parameters first in kernel_get_mempolicy
Previous implementatoin calls untagged_addr() before error check, while if
the error check failed and return EINVAL, the untagged_addr() call is just
useless work.
mm: mempolicy: fix kerneldoc of numa_map_to_online_node()
Fix W=1 compile warnings (invalid kerneldoc):
mm/mempolicy.c:137: warning: Function parameter or member 'node' not described in 'numa_map_to_online_node'
mm/mempolicy.c:137: warning: Excess function parameter 'nid' description in 'numa_map_to_online_node'
Nitin Gupta [Wed, 12 Aug 2020 01:31:07 +0000 (18:31 -0700)]
mm: use unsigned types for fragmentation score
Proactive compaction uses per-node/zone "fragmentation score" which is
always in range [0, 100], so use unsigned type of these scores as well as
for related constants.
Nitin Gupta [Wed, 12 Aug 2020 01:31:04 +0000 (18:31 -0700)]
mm: fix compile error due to COMPACTION_HPAGE_ORDER
Fix compile error when COMPACTION_HPAGE_ORDER is assigned to
HUGETLB_PAGE_ORDER. The correct way to check if this constant is defined
is to check for CONFIG_HUGETLBFS.
Nitin Gupta [Wed, 12 Aug 2020 01:31:00 +0000 (18:31 -0700)]
mm: proactive compaction
For some applications, we need to allocate almost all memory as hugepages.
However, on a running system, higher-order allocations can fail if the
memory is fragmented. Linux kernel currently does on-demand compaction as
we request more hugepages, but this style of compaction incurs very high
latency. Experiments with one-time full memory compaction (followed by
hugepage allocations) show that kernel is able to restore a highly
fragmented memory state to a fairly compacted memory state within <1 sec
for a 32G system. Such data suggests that a more proactive compaction can
help us allocate a large fraction of memory as hugepages keeping
allocation latencies low.
For a more proactive compaction, the approach taken here is to define a
new sysctl called 'vm.compaction_proactiveness' which dictates bounds for
external fragmentation which kcompactd tries to maintain.
The tunable takes a value in range [0, 100], with a default of 20.
Note that a previous version of this patch [1] was found to introduce too
many tunables (per-order extfrag{low, high}), but this one reduces them to
just one sysctl. Also, the new tunable is an opaque value instead of
asking for specific bounds of "external fragmentation", which would have
been difficult to estimate. The internal interpretation of this opaque
value allows for future fine-tuning.
Currently, we use a simple translation from this tunable to [low, high]
"fragmentation score" thresholds (low=100-proactiveness, high=low+10%).
The score for a node is defined as weighted mean of per-zone external
fragmentation. A zone's present_pages determines its weight.
To periodically check per-node score, we reuse per-node kcompactd threads,
which are woken up every 500 milliseconds to check the same. If a node's
score exceeds its high threshold (as derived from user-provided
proactiveness value), proactive compaction is started until its score
reaches its low threshold value. By default, proactiveness is set to 20,
which implies threshold values of low=80 and high=90.
This patch is largely based on ideas from Michal Hocko [2]. See also the
LWN article [3].
Performance data
================
System: x64_64, 1T RAM, 80 CPU threads.
Kernel: 5.6.0-rc3 + this patch
echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/enabled
echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/defrag
Before starting the driver, the system was fragmented from a userspace
program that allocates all memory and then for each 2M aligned section,
frees 3/4 of base pages using munmap. The workload is mainly anonymous
userspace pages, which are easy to move around. I intentionally avoided
unmovable pages in this test to see how much latency we incur when
hugepage allocations hit direct compaction.
1. Kernel hugepage allocation latencies
With the system in such a fragmented state, a kernel driver then allocates
as many hugepages as possible and measures allocation latency:
Total 2M hugepages allocated = 384105 (750G worth of hugepages out of 762G
total free => 98% of free memory could be allocated as hugepages)
2. JAVA heap allocation
In this test, we first fragment memory using the same method as for (1).
Then, we start a Java process with a heap size set to 700G and request the
heap to be allocated with THP hugepages. We also set THP to madvise to
allow hugepage backing of this heap.
The above command allocates 700G of Java heap using hugepages.
- With vanilla 5.6.0-rc3
17.39user 1666.48system 27:37.89elapsed
- With 5.6.0-rc3 + this patch, with proactiveness=20
8.35user 194.58system 3:19.62elapsed
Elapsed time remains around 3:15, as proactiveness is further increased.
Note that proactive compaction happens throughout the runtime of these
workloads. The situation of one-time compaction, sufficient to supply
hugepages for following allocation stream, can probably happen for more
extreme proactiveness values, like 80 or 90.
In the above Java workload, proactiveness is set to 20. The test starts
with a node's score of 80 or higher, depending on the delay between the
fragmentation step and starting the benchmark, which gives more-or-less
time for the initial round of compaction. As t he benchmark consumes
hugepages, node's score quickly rises above the high threshold (90) and
proactive compaction starts again, which brings down the score to the low
threshold level (80). Repeat.
bpftrace also confirms proactive compaction running 20+ times during the
runtime of this Java benchmark. kcompactd threads consume 100% of one of
the CPUs while it tries to bring a node's score within thresholds.
Backoff behavior
================
Above workloads produce a memory state which is easy to compact. However,
if memory is filled with unmovable pages, proactive compaction should
essentially back off. To test this aspect:
- Created a kernel driver that allocates almost all memory as hugepages
followed by freeing first 3/4 of each hugepage.
- Set proactiveness=40
- Note that proactive_compact_node() is deferred maximum number of times
with HPAGE_FRAG_CHECK_INTERVAL_MSEC of wait between each check
(=> ~30 seconds between retries).
Michal Koutný [Wed, 12 Aug 2020 01:30:57 +0000 (18:30 -0700)]
/proc/PID/smaps: consistent whitespace output format
The keys in smaps output are padded to fixed width with spaces. All
except for THPeligible that uses tabs (only since commit c06306696f83
("mm: thp: fix false negative of shmem vma's THP eligibility")).
Unify the output formatting to save time debugging some naïve parsers.
(Part of the unification is also aligning FilePmdMapped with others.)
Joonsoo Kim [Wed, 12 Aug 2020 01:30:54 +0000 (18:30 -0700)]
mm/vmscan: restore active/inactive ratio for anonymous LRU
Now that workingset detection is implemented for anonymous LRU, we don't
need large inactive list to allow detecting frequently accessed pages
before they are reclaimed, anymore. This effectively reverts the
temporary measure put in by commit "mm/vmscan: make active/inactive ratio
as 1:1 for anon lru".
Joonsoo Kim [Wed, 12 Aug 2020 01:30:50 +0000 (18:30 -0700)]
mm/swap: implement workingset detection for anonymous LRU
This patch implements workingset detection for anonymous LRU. All the
infrastructure is implemented by the previous patches so this patch just
activates the workingset detection by installing/retrieving the shadow
entry and adding refault calculation.
Joonsoo Kim [Wed, 12 Aug 2020 01:30:47 +0000 (18:30 -0700)]
mm/swapcache: support to handle the shadow entries
Workingset detection for anonymous page will be implemented in the
following patch and it requires to store the shadow entries into the
swapcache. This patch implements an infrastructure to store the shadow
entry in the swapcache.
Joonsoo Kim [Wed, 12 Aug 2020 01:30:43 +0000 (18:30 -0700)]
mm/workingset: prepare the workingset detection infrastructure for anon LRU
To prepare the workingset detection for anon LRU, this patch splits
workingset event counters for refault, activate and restore into anon and
file variants, as well as the refaults counter in struct lruvec.
Joonsoo Kim [Wed, 12 Aug 2020 01:30:40 +0000 (18:30 -0700)]
mm/vmscan: protect the workingset on anonymous LRU
In current implementation, newly created or swap-in anonymous page is
started on active list. Growing active list results in rebalancing
active/inactive list so old pages on active list are demoted to inactive
list. Hence, the page on active list isn't protected at all.
Following is an example of this situation.
Assume that 50 hot pages on active list. Numbers denote the number of
pages on active/inactive list (active | inactive).
1. 50 hot pages on active list
50(h) | 0
2. workload: 50 newly created (used-once) pages
50(uo) | 50(h)
3. workload: another 50 newly created (used-once) pages
50(uo) | 50(uo), swap-out 50(h)
This patch tries to fix this issue. Like as file LRU, newly created or
swap-in anonymous pages will be inserted to the inactive list. They are
promoted to active list if enough reference happens. This simple
modification changes the above example as following.
1. 50 hot pages on active list
50(h) | 0
2. workload: 50 newly created (used-once) pages
50(h) | 50(uo)
3. workload: another 50 newly created (used-once) pages
50(h) | 50(uo), swap-out 50(uo)
As you can see, hot pages on active list would be protected.
Note that, this implementation has a drawback that the page cannot be
promoted and will be swapped-out if re-access interval is greater than the
size of inactive list but less than the size of total(active+inactive).
To solve this potential issue, following patch will apply workingset
detection similar to the one that's already applied to file LRU.
Joonsoo Kim [Wed, 12 Aug 2020 01:30:36 +0000 (18:30 -0700)]
mm/vmscan: make active/inactive ratio as 1:1 for anon lru
Patch series "workingset protection/detection on the anonymous LRU list", v7.
* PROBLEM
In current implementation, newly created or swap-in anonymous page is
started on the active list. Growing the active list results in
rebalancing active/inactive list so old pages on the active list are
demoted to the inactive list. Hence, hot page on the active list isn't
protected at all.
Following is an example of this situation.
Assume that 50 hot pages on active list and system can contain total 100
pages. Numbers denote the number of pages on active/inactive list (active
| inactive). (h) stands for hot pages and (uo) stands for used-once
pages.
1. 50 hot pages on active list
50(h) | 0
2. workload: 50 newly created (used-once) pages
50(uo) | 50(h)
3. workload: another 50 newly created (used-once) pages
50(uo) | 50(uo), swap-out 50(h)
As we can see, hot pages are swapped-out and it would cause swap-in later.
* SOLUTION
Since this is what we want to avoid, this patchset implements workingset
protection. Like as the file LRU list, newly created or swap-in anonymous
page is started on the inactive list. Also, like as the file LRU list, if
enough reference happens, the page will be promoted. This simple
modification changes the above example as following.
1. 50 hot pages on active list
50(h) | 0
2. workload: 50 newly created (used-once) pages
50(h) | 50(uo)
3. workload: another 50 newly created (used-once) pages
50(h) | 50(uo), swap-out 50(uo)
hot pages remains in the active list. :)
* EXPERIMENT
I tested this scenario on my test bed and confirmed that this problem
happens on current implementation. I also checked that it is fixed by
this patchset.
* SUBJECT
workingset detection
* PROBLEM
Later part of the patchset implements the workingset detection for the
anonymous LRU list. There is a corner case that workingset protection
could cause thrashing. If we can avoid thrashing by workingset detection,
we can get the better performance.
Following is an example of thrashing due to the workingset protection.
1. 50 hot pages on active list
50(h) | 0
2. workload: 50 newly created (will be hot) pages
50(h) | 50(wh)
3. workload: another 50 newly created (used-once) pages
50(h) | 50(uo), swap-out 50(wh)
5. workload: another 50 newly created (used-once) pages
50(h) | 50(uo), swap-out 50(wh)
6. repeat 4, 5
Without workingset detection, this kind of workload cannot be promoted and
thrashing happens forever.
* SOLUTION
Therefore, this patchset implements workingset detection. All the
infrastructure for workingset detecion is already implemented, so there is
not much work to do. First, extend workingset detection code to deal with
the anonymous LRU list. Then, make swap cache handles the exceptional
value for the shadow entry. Lastly, install/retrieve the shadow value
into/from the swap cache and check the refault distance.
* EXPERIMENT
I made a test program to imitates above scenario and confirmed that
problem exists. Then, I checked that this patchset fixes it.
My test setup is a virtual machine with 8 cpus and 6100MB memory. But,
the amount of the memory that the test program can use is about 280 MB.
This is because the system uses large ram-backed swap and large ramdisk to
capture the trace.
Since hot-1 memory (96MB) is on the active list, the inactive list can
contains roughly 190MB pages. hot-2 memory's re-access interval (96+128
MB) is more 190MB, so it cannot be promoted without workingset detection
and swap-in/out happens repeatedly. With this patchset, workingset
detection works and promotion happens. Therefore, swap-in/out occurs
less.
Here is the result. (average of 5 runs)
type swap-in swap-out
base 863240 989945
patch 681565 809273
As we can see, patched kernel do less swap-in/out.
* OVERALL TEST (ebizzy using modified random function)
ebizzy is the test program that main thread allocates lots of memory and
child threads access them randomly during the given times. Swap-in will
happen if allocated memory is larger than the system memory.
The random function that represents the zipf distribution is used to make
hot/cold memory. Hot/cold ratio is controlled by the parameter. If the
parameter is high, hot memory is accessed much larger than cold one. If
the parameter is low, the number of access on each memory would be
similar. I uses various parameters in order to show the effect of
patchset on various hot/cold ratio workload.
My test setup is a virtual machine with 8 cpus, 1024 MB memory and 5120 MB
ram swap.
Result format is as following.
param: 1-1024-0.1
- 1 (number of thread)
- 1024 (allocated memory size, MB)
- 0.1 (zipf distribution alpha,
0.1 works like as roughly uniform random,
1.3 works like as small portion of memory is hot and the others are cold)
pswpin: smaller is better
std: standard deviation
improvement: negative is better
As we can see, all the cases show improvement. Especially, test case with
zipf distribution 1.3 show more improvements. It means that if there is a
hot/cold tendency in anon pages, this patchset works better.
This patch (of 6):
Current implementation of LRU management for anonymous page has some
problems. Most important one is that it doesn't protect the workingset,
that is, pages on the active LRU list. Although, this problem will be
fixed in the following patchset, the preparation is required and this
patch does it.
What following patch does is to implement workingset protection. After
the following patchset, newly created or swap-in pages will start their
lifetime on the inactive list. If inactive list is too small, there is
not enough chance to be referenced and the page cannot become the
workingset.
In order to provide the newly anonymous or swap-in pages enough chance to
be referenced again, this patch makes active/inactive LRU ratio as 1:1.
This is just a temporary measure. Later patch in the series introduces
workingset detection for anonymous LRU that will be used to better decide
if pages should start on the active and inactive list. Afterwards this
patch is effectively reverted.
Muchun Song [Wed, 12 Aug 2020 01:30:32 +0000 (18:30 -0700)]
mm/hugetlb: add mempolicy check in the reservation routine
In the reservation routine, we only check whether the cpuset meets the
memory allocation requirements. But we ignore the mempolicy of MPOL_BIND
case. If someone mmap hugetlb succeeds, but the subsequent memory
allocation may fail due to mempolicy restrictions and receives the SIGBUS
signal. This can be reproduced by the follow steps.
1) Compile the test case.
cd tools/testing/selftests/vm/
gcc map_hugetlb.c -o map_hugetlb
2) Pre-allocate huge pages. Suppose there are 2 numa nodes in the
system. Each node will pre-allocate one huge page.
echo 2 > /proc/sys/vm/nr_hugepages
3) Run test case(mmap 4MB). We receive the SIGBUS signal.
numactl --membind=3D0 ./map_hugetlb 4
With this patch applied, the mmap will fail in the step 3) and throw
"mmap: Cannot allocate memory".
Roman Gushchin [Wed, 12 Aug 2020 01:30:29 +0000 (18:30 -0700)]
kselftests: cgroup: add perpcu memory accounting test
Add a simple test to check the percpu memory accounting. The test creates
a cgroup tree with 1000 child cgroups and checks values of memory.current
and memory.stat::percpu.
Roman Gushchin [Wed, 12 Aug 2020 01:30:25 +0000 (18:30 -0700)]
mm: memcg: charge memcg percpu memory to the parent cgroup
Memory cgroups are using large chunks of percpu memory to store vmstat
data. Yet this memory is not accounted at all, so in the case when there
are many (dying) cgroups, it's not exactly clear where all the memory is.
Because the size of memory cgroup internal structures can dramatically
exceed the size of object or page which is pinning it in the memory, it's
not a good idea to simply ignore it. It actually breaks the isolation
between cgroups.
Let's account the consumed percpu memory to the parent cgroup.
Percpu memory can represent a noticeable chunk of the total memory
consumption, especially on big machines with many CPUs. Let's track
percpu memory usage for each memcg and display it in memory.stat.
A percpu allocation is usually scattered over multiple pages (and nodes),
and can be significantly smaller than a page. So let's add a byte-sized
counter on the memcg level: MEMCG_PERCPU_B. Byte-sized vmstat infra
created for slabs can be perfectly reused for percpu case.
Roman Gushchin [Wed, 12 Aug 2020 01:30:17 +0000 (18:30 -0700)]
mm: memcg/percpu: account percpu memory to memory cgroups
Percpu memory is becoming more and more widely used by various subsystems,
and the total amount of memory controlled by the percpu allocator can make
a good part of the total memory.
As an example, bpf maps can consume a lot of percpu memory, and they are
created by a user. Also, some cgroup internals (e.g. memory controller
statistics) can be quite large. On a machine with many CPUs and big
number of cgroups they can consume hundreds of megabytes.
So the lack of memcg accounting is creating a breach in the memory
isolation. Similar to the slab memory, percpu memory should be accounted
by default.
To implement the perpcu accounting it's possible to take the slab memory
accounting as a model to follow. Let's introduce two types of percpu
chunks: root and memcg. What makes memcg chunks different is an
additional space allocated to store memcg membership information. If
__GFP_ACCOUNT is passed on allocation, a memcg chunk should be be used.
If it's possible to charge the corresponding size to the target memory
cgroup, allocation is performed, and the memcg ownership data is recorded.
System-wide allocations are performed using root chunks, so there is no
additional memory overhead.
To implement a fast reparenting of percpu memory on memcg removal, we
don't store mem_cgroup pointers directly: instead we use obj_cgroup API,
introduced for slab accounting.
Roman Gushchin [Wed, 12 Aug 2020 01:30:14 +0000 (18:30 -0700)]
percpu: return number of released bytes from pcpu_free_area()
Patch series "mm: memcg accounting of percpu memory", v3.
This patchset adds percpu memory accounting to memory cgroups. It's based
on the rework of the slab controller and reuses concepts and features
introduced for the per-object slab accounting.
Percpu memory is becoming more and more widely used by various subsystems,
and the total amount of memory controlled by the percpu allocator can make
a good part of the total memory.
As an example, bpf maps can consume a lot of percpu memory, and they are
created by a user. Also, some cgroup internals (e.g. memory controller
statistics) can be quite large. On a machine with many CPUs and big
number of cgroups they can consume hundreds of megabytes.
So the lack of memcg accounting is creating a breach in the memory
isolation. Similar to the slab memory, percpu memory should be accounted
by default.
Percpu allocations by their nature are scattered over multiple pages, so
they can't be tracked on the per-page basis. So the per-object tracking
introduced by the new slab controller is reused.
The patchset implements charging of percpu allocations, adds memcg-level
statistics, enables accounting for percpu allocations made by memory
cgroup internals and provides some basic tests.
To implement the accounting of percpu memory without a significant memory
and performance overhead the following approach is used: all accounted
allocations are placed into a separate percpu chunk (or chunks). These
chunks are similar to default chunks, except that they do have an attached
vector of pointers to obj_cgroup objects, which is big enough to save a
pointer for each allocated object. On the allocation, if the allocation
has to be accounted (__GFP_ACCOUNT is passed, the allocating process
belongs to a non-root memory cgroup, etc), the memory cgroup is getting
charged and if the maximum limit is not exceeded the allocation is
performed using a memcg-aware chunk. Otherwise -ENOMEM is returned or the
allocation is forced over the limit, depending on gfp (as any other kernel
memory allocation). The memory cgroup information is saved in the
obj_cgroup vector at the corresponding offset. On the release time the
memcg information is restored from the vector and the cgroup is getting
uncharged. Unaccounted allocations (at this point the absolute majority
of all percpu allocations) are performed in the old way, so no additional
overhead is expected.
To avoid pinning dying memory cgroups by outstanding allocations,
obj_cgroup API is used instead of directly saving memory cgroup pointers.
obj_cgroup is basically a pointer to a memory cgroup with a standalone
reference counter. The trick is that it can be atomically swapped to
point at the parent cgroup, so that the original memory cgroup can be
released prior to all objects, which has been charged to it. Because all
charges and statistics are fully recursive, it's perfectly correct to
uncharge the parent cgroup instead. This scheme is used in the slab
memory accounting, and percpu memory can just follow the scheme.
This patch (of 5):
To implement accounting of percpu memory we need the information about the
size of freed object. Return it from pcpu_free_area().