* emailed patches from Andrew Morton <[email protected]>: (126 commits)
ipc: optimize semget/shmget/msgget for lots of keys
ipc/sem: play nicer with large nsops allocations
ipc/sem: drop sem_checkid helper
ipc: convert kern_ipc_perm.refcount from atomic_t to refcount_t
ipc: convert sem_undo_list.refcnt from atomic_t to refcount_t
ipc: convert ipc_namespace.count from atomic_t to refcount_t
kcov: support compat processes
sh: defconfig: cleanup from old Kconfig options
mn10300: defconfig: cleanup from old Kconfig options
m32r: defconfig: cleanup from old Kconfig options
drivers/pps: use surrounding "if PPS" to remove numerous dependency checks
drivers/pps: aesthetic tweaks to PPS-related content
cpumask: make cpumask_next() out-of-line
kmod: move #ifdef CONFIG_MODULES wrapper to Makefile
kmod: split off umh headers into its own file
MAINTAINERS: clarify kmod is just a kernel module loader
kmod: split out umh code into its own file
test_kmod: flip INT checks to be consistent
test_kmod: remove paranoid UINT_MAX check on uint range processing
vfat: deduplicate hex2bin()
...
ipc: optimize semget/shmget/msgget for lots of keys
ipc_findkey() used to scan all objects to look for the wanted key. This
is slow when using a high number of keys. This change adds an rhashtable
of kern_ipc_perm objects in ipc_ids, so that one lookup cease to be O(n).
This change gives a 865% improvement of benchmark reaim.jobs_per_min on a
56 threads Intel(R) Xeon(R) CPU E5-2695 v3 @ 2.30GHz with 256G memory [1]
Other (more micro) benchmark results, by the author: On an i5 laptop, the
following loop executed right after a reboot took, without and with this
change:
for (int i = 0, k=0x424242; i < KEYS; ++i)
semget(k++, 1, IPC_CREAT | 0600);
total total max single max single
KEYS without with call without call with
Replacing semop()'s kmalloc for kvmalloc was originally proposed by
Manfred on the premise that it can be called for large (than order-1)
sizes. For example, while Oracle recommends setting SEMOPM to a _minimum_
of 100, some distros[1] encourage the setting to be a factor of the amount
of db tasks (PROCESSES), which can get fishy for large systems (easily
going beyond 1000).
[1] An Example of Semaphore Settings
https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/5/html/Tuning_and_Optimizing_Red_Hat_Enterprise_Linux_for_Oracle_9i_and_10g_Databases/sect-Oracle_9i_and_10g_Tuning_Guide-Setting_Semaphores-An_Example_of_Semaphore_Settings.html
So let's just convert this to kvmalloc, just like the rest of the
allocations we do in ipc. While the fallback vmalloc obviously involves
more overhead, this by far the uncommon path, and it's better for the user
than just erroring out with kmalloc.
Elena Reshetova [Fri, 8 Sep 2017 23:17:45 +0000 (16:17 -0700)]
ipc: convert kern_ipc_perm.refcount from atomic_t to refcount_t
refcount_t type and corresponding API should be used instead of atomic_t
when the variable is used as a reference counter. This allows to avoid
accidental refcounter overflows that might lead to use-after-free
situations.
Elena Reshetova [Fri, 8 Sep 2017 23:17:42 +0000 (16:17 -0700)]
ipc: convert sem_undo_list.refcnt from atomic_t to refcount_t
refcount_t type and corresponding API should be used instead of atomic_t
when the variable is used as a reference counter. This allows to avoid
accidental refcounter overflows that might lead to use-after-free
situations.
Elena Reshetova [Fri, 8 Sep 2017 23:17:38 +0000 (16:17 -0700)]
ipc: convert ipc_namespace.count from atomic_t to refcount_t
refcount_t type and corresponding API should be used instead of atomic_t
when the variable is used as a reference counter. This allows to avoid
accidental refcounter overflows that might lead to use-after-free
situations.
Support compat processes in KCOV by providing compat_ioctl callback.
Compat mode uses the same ioctl callback: we have 2 commands that do not
use the argument and 1 that already checks that the arg does not overflow
INT_MAX. This allows to use KCOV-guided fuzzing in compat processes.
mn10300: defconfig: cleanup from old Kconfig options
Remove old, dead Kconfig options (in order appearing in this commit):
- EXPERIMENTAL is gone since v3.9;
- INET_LRO: commit 7bbf3cae65b6 ("ipv4: Remove inet_lro library");
- MTD_PARTITIONS: commit 6a8a98b22b10 ("mtd: kill
CONFIG_MTD_PARTITIONS");
- MTD_CHAR: commit 660685d9d1b4 ("mtd: merge mtdchar module with
mtdcore");
- NETDEV_1000 and NETDEV_10000: commit f860b0522f65 ("drivers/net:
Kconfig and Makefile cleanup"); NET_ETHERNET should be replaced with
just ETHERNET but that is separate change;
- HID_SUPPORT: commit 1f41a6a99476 ("HID: Fix the generic Kconfig
options");
- RCU_CPU_STALL_DETECTOR: commit a00e0d714fbd ("rcu: Remove conditional
compilation for RCU CPU stall warnings");
drivers/pps: aesthetic tweaks to PPS-related content
Collection of aesthetic adjustments to various PPS-related files,
directories and Documentation, some quite minor just for the sake of
consistency, including:
* Updated example of pps device tree node (courtesy Rodolfo G.)
* "PPS-API" -> "PPS API"
* "pps_source_info_s" -> "pps_source_info"
* "ktimer driver" -> "pps-ktimer driver"
* "ppstest /dev/pps0" -> "ppstest /dev/pps1" to match example
* Add missing PPS-related entries to MAINTAINERS file
* Other trivialities
kmod: move #ifdef CONFIG_MODULES wrapper to Makefile
The entire file is now conditionally compiled only when CONFIG_MODULES is
enabled, and this this is a bool. Just move this conditional to the
Makefile as its easier to read this way.
Patch series "kmod: few code cleanups to split out umh code"
The usermode helper has a provenance from the old usb code which first
required a usermode helper. Eventually this was shoved into kmod.c and
the kernel's modprobe calls was converted over eventually to share the
same code. Over time the list of usermode helpers in the kernel has grown
-- so kmod is just but one user of the API.
This series is a simple logical cleanup which acknowledges the code
evolution of the usermode helper and shoves the UMH API into its own
dedicated file. This way users of the API can later just include umh.h
instead of kmod.h.
Note despite the diff state the first patch really is just a code shove,
no functional changes are done there. I did use git format-patch -M to
generate the patch, but in the end the split was not enough for git to
consider it a rename hence the large diffstat.
I've put this through 0-day and it gives me their machine compilation
blessings with all tests as OK.
This patch (of 4):
There's a slew of usermode helper users and kmod is just one of them.
Split out the usermode helper code into its own file to keep the logic and
focus split up.
Dan Carpenter [Fri, 8 Sep 2017 23:16:53 +0000 (16:16 -0700)]
test_kmod: remove paranoid UINT_MAX check on uint range processing
The UINT_MAX comparison is not needed because "max" is already an unsigned
int, and we expect developer C code max value input to have a sensible 0 -
UINT_MAX range. Note that if it so happens to be UINT_MAX + 1 it would
lead to an issue, but we expect the developer to know this.
autofs: use unsigned int/long instead of uint/ulong for ioctl args
The standard types unsigned int and unsigned long should be used for
.compat_ioctl. autofs is the only fs using uing/ulong for this, and these
are even the only uint/ulong in the entire autofs code.
Drop unneeded long cast in return value of autofs_dev_ioctl_compat().
It's already long.
This comment was correct when it was added in 8d7b48e0 ("autofs4: add
miscellaneous device for ioctls") in 2008, but not after 4e44b685 "Get rid
of path_lookup in autofs4" in 2009 which introduced find_autofs_mount().
Use a macro which defines misc-dev ioctl parameter size (excluding a path
beyond &path[0]) since it's been used to initialize and copy this
structure ever since it first appeared in 8d7b48e0 in 2008.
(or simply get rid of this if this is just unnecessary abstraction when
all it needs is sizeof(struct autofs_dev_ioctl))
These are not used by either kernel or userspace, although
AUTOFS_IOC_EXPIRE_DIRECT once seems to have been used by userspace in
around 2006-2008, which was technically just an alias of the existing
ioctl AUTOFS_IOC_EXPIRE_MULTI.
ioctls for autofs are already complicated enough that they could be
removed unless these are staying here to be able to compile userspace code
of certain period of time from a decade ago.
Edit: [email protected]
Yes, this is indeed very old and anything that still uses must
be updated becuase it will be using broken functionality.
End edit: [email protected]
Ian Kent [Fri, 8 Sep 2017 23:16:30 +0000 (16:16 -0700)]
autofs: make dev ioctl version and ismountpoint user accessible
Some of the autofs miscellaneous device ioctls need to be accessable to
user space applications without CAP_SYS_ADMIN to get information about
autofs mounts.
Ian Kent [Fri, 8 Sep 2017 23:16:27 +0000 (16:16 -0700)]
autofs: make disc device user accessible
The autofs miscellanous device ioctls that shouldn't require
CAP_SYS_ADMIN need to be accessible to user space applications in order
to be able to get information about autofs mounts.
The module checks capabilities so the miscelaneous device should be fine
with broad permissions.
Ian Kent [Fri, 8 Sep 2017 23:16:24 +0000 (16:16 -0700)]
autofs: fix AT_NO_AUTOMOUNT not being honored
The fstatat(2) and statx() calls can pass the flag AT_NO_AUTOMOUNT which
is meant to clear the LOOKUP_AUTOMOUNT flag and prevent triggering of an
automount by the call. But this flag is unconditionally cleared for all
stat family system calls except statx().
stat family system calls have always triggered mount requests for the
negative dentry case in follow_automount() which is intended but prevents
the fstatat(2) and statx() AT_NO_AUTOMOUNT case from being handled.
In order to handle the AT_NO_AUTOMOUNT for both system calls the negative
dentry case in follow_automount() needs to be changed to return ENOENT
when the LOOKUP_AUTOMOUNT flag is clear (and the other required flags are
clear).
AFAICT this change doesn't have any noticable side effects and may, in
some use cases (although I didn't see it in testing) prevent unnecessary
callbacks to the automount daemon.
It's also possible that a stat family call has been made with a path that
is in the process of being mounted by some other process. But stat family
calls should return the automount state of the path as it is "now" so it
shouldn't wait for mount completion.
This is the same semantic as the positive dentry case already handled.
Daniel Micay [Fri, 8 Sep 2017 23:16:20 +0000 (16:16 -0700)]
init/main.c: extract early boot entropy from the passed cmdline
Feed the boot command-line as to the /dev/random entropy pool
Existing Android bootloaders usually pass data which may not be known by
an external attacker on the kernel command-line. It may also be the
case on other embedded systems. Sample command-line from a Google Pixel
running CopperheadOS....
Among other things, it contains a value unique to the device
(androidboot.serialno=FA6CE0305299), unique to the OS builds for the
device variant (veritykeyid=id:dfcb9db0089e5b3b4090a592415c28e1cb4545ab)
and timings from the bootloader stages in milliseconds
(androidboot.boottime=1BLL:85,1BLE:669,2BLL:0,2BLE:1777,SW:6,KL:8136).
Laura Abbott [Fri, 8 Sep 2017 23:16:17 +0000 (16:16 -0700)]
init: move stack canary initialization after setup_arch
Patch series "Command line randomness", v3.
A series to add the kernel command line as a source of randomness.
This patch (of 2):
Stack canary intialization involves getting a random number. Getting this
random number may involve accessing caches or other architectural specific
features which are not available until after the architecture is setup.
Move the stack canary initialization later to accommodate this.
Jean Delvare [Fri, 8 Sep 2017 23:16:11 +0000 (16:16 -0700)]
checkpatch: add 6 missing types to --list-types
Unlike all other types, LONG_LINE, LONG_LINE_COMMENT and LONG_LINE_STRING
are passed to WARN() through a variable. This causes the parser in
list_types() to miss them and consequently they are not present in the
output of --list-types.
Additionally, types TYPO_SPELLING, FSF_MAILING_ADDRESS and AVOID_BUG are
passed with a variable level, causing the parser to miss them too.
So modify the regex to also catch these special cases.
Jean Delvare [Fri, 8 Sep 2017 23:16:07 +0000 (16:16 -0700)]
checkpatch: rename variables to avoid confusion
The variable name "$msg_type" is sometimes used to set the message type,
and sometimes used to set the message level. This works but is kind of
confusing. Use "$msg_level" in the latter case instead, to make the code
clearer.
lib/oid_registry.c: X.509: fix the buffer overflow in the utility function for OID string
The sprint_oid() utility function doesn't properly check the buffer size
that it causes that the warning in vsnprintf() be triggered. For
example on v4.1 kernel:
------------[ cut here ]------------
WARNING: CPU: 0 PID: 2357 at lib/vsprintf.c:1867 vsnprintf+0x5a7/0x5c0()
...
We can trigger this issue by injecting maliciously crafted x509 cert in
DER format. Just using hex editor to change the length of OID to over
the length of the SEQUENCE container. For example:
0:d=0 hl=4 l= 980 cons: SEQUENCE
4:d=1 hl=4 l= 700 cons: SEQUENCE
8:d=2 hl=2 l= 3 cons: cont [ 0 ]
10:d=3 hl=2 l= 1 prim: INTEGER :02
13:d=2 hl=2 l= 9 prim: INTEGER :9B47FAF791E7D1E3
24:d=2 hl=2 l= 13 cons: SEQUENCE
26:d=3 hl=2 l= 9 prim: OBJECT :sha256WithRSAEncryption
37:d=3 hl=2 l= 0 prim: NULL
39:d=2 hl=2 l= 121 cons: SEQUENCE
41:d=3 hl=2 l= 22 cons: SET
43:d=4 hl=2 l= 20 cons: SEQUENCE <=== the SEQ length is 20
45:d=5 hl=2 l= 3 prim: OBJECT :organizationName
<=== the original length is 3, change the length of OID to over the length of SEQUENCE
Pawel Wieczorkiewicz reported this problem and Takashi Iwai provided
patch to fix it by checking the bufsize in sprint_oid().
Dan Carpenter [Fri, 8 Sep 2017 23:15:48 +0000 (16:15 -0700)]
lib/string.c: check for kmalloc() failure
This is mostly to keep the number of static checker warnings down so we
can spot new bugs instead of them being drowned in noise. This function
doesn't return normal kernel error codes but instead the return value is
used to display exactly which memory failed. I chose -1 as hopefully
that's a helpful thing to print.
The macro is the compile-time analogue of bitmap_from_u64() with the same
purpose: convert the 64-bit number to the properly ordered pair of 32-bit
parts, suitable for filling the bitmap in 32-bit BE environment.
Use it to make test_bitmap_parselist() correct for 32-bit BE ABIs.
lib/bitmap.c: make bitmap_parselist() thread-safe and much faster
Current implementation of bitmap_parselist() uses a static variable to
save local state while setting bits in the bitmap. It is obviously wrong
if we assume execution in multiprocessor environment. Fortunately, it's
possible to rewrite this portion of code to avoid using the static
variable.
It is also possible to set bits in the mask per-range with bitmap_set(),
not per-bit, as it is implemented now, with set_bit(); which is way
faster.
The important side effect of this change is that setting bits in this
function from now is not per-bit atomic and less memory-ordered. This is
because set_bit() guarantees the order of memory accesses, while
bitmap_set() does not. I think that it is the advantage of the new
approach, because the bitmap_parselist() is intended to initialise bit
arrays, and user should protect the whole bitmap during initialisation if
needed. So protecting individual bits looks expensive and useless. Also,
other range-oriented functions in lib/bitmap.c don't worry much about
atomicity.
With all that, setting 2k bits in map with the pattern like 0-2047:128/256
becomes ~50 times faster after applying the patch in my testing
environment (arm64 hosted on qemu).
The second patch of the series adds the test for bitmap_parselist(). It's
not intended to cover all tricky cases, just to make sure that I didn't
screw up during rework.
Add a test module that allows testing that CONFIG_DEBUG_VIRTUAL works
correctly, at least that it can catch invalid calls to virt_to_phys()
against the non-linear kernel virtual address map.
For the same reasons we already cache the leftmost pointer, apply the same
optimization for rb_last() calls. Users must explicitly do this as
rb_root_cached only deals with the smallest node.
Such that we can optimize __mem_cgroup_largest_soft_limit_node(). The
only overhead is the extra footprint for the cached pointer, but this
should not be an issue for mem_cgroup_tree_per_node.
... such that we can avoid the tree walks to get the node with the
smallest key. Semantically the same, as the previously used rb_first(),
but O(1). The main overhead is the extra footprint for the cached rb_node
pointer, which should not matter for epoll.
... such that we can avoid the tree walks to get the node with the
smallest key. Semantically the same, as the previously used rb_first(),
but O(1). The main overhead is the extra footprint for the cached rb_node
pointer, which should not matter for procfs.
Allow interval trees to quickly check for overlaps to avoid unnecesary
tree lookups in interval_tree_iter_first().
As of this patch, all interval tree flavors will require using a
'rb_root_cached' such that we can have the leftmost node easily
available. While most users will make use of this feature, those with
special functions (in addition to the generic insert, delete, search
calls) will avoid using the cached option as they can do funky things
with insertions -- for example, vma_interval_tree_insert_after().
We can work with a single rb_root_cached root to test both cached and
non-cached rbtrees. In addition, also add a test to measure latencies
between rb_first and its fast counterpart.
rbtree: add some additional comments for rebalancing cases
While overall the code is very nicely commented, it might not be
immediately obvious from the diagrams what is going on. Add a very
brief summary of each case. Opposite cases where the node is the left
child are left untouched.
rbtree: optimize root-check during rebalancing loop
The only times the nil-parent (root node) condition is true is when the
node is the first in the tree, or after fixing rbtree rule #4 and the
case 1 rebalancing made the node the root. Such conditions do not apply
most of the time:
(i) The common case in an rbtree is to have more than a single node,
so this is only true for the first rb_insert().
(ii) While there is a chance only one first rotation is needed, cases
where the node's uncle is black (cases 2,3) are more common as we can
have the following scenarios during the rotation looping:
case1 only, case1+1, case2+3, case1+2+3, case3 only, etc.
This patch, therefore, adds an unlikely() optimization to this
conditional. When profiling with CONFIG_PROFILE_ANNOTATED_BRANCHES, a
kernel build shows that the incorrect rate is less than 15%, and for
workloads that involve insert mostly trees overtime tend to have less
than 2% incorrect rate.
Patch series "rbtree: Cache leftmost node internally", v4.
A series to extending rbtrees to internally cache the leftmost node such
that we can have fast overlap check optimization for all interval tree
users[1]. The benefits of this series are that:
(i) Unify users that do internal leftmost node caching.
(ii) Optimize all interval tree users.
(iii) Convert at least two new users (epoll and procfs) to the new interface.
This patch (of 16):
Red-black tree semantics imply that nodes with smaller or greater (or
equal for duplicates) keys always be to the left and right,
respectively. For the kernel this is extremely evident when considering
our rb_first() semantics. Enabling lookups for the smallest node in the
tree in O(1) can save a good chunk of cycles in not having to walk down
the tree each time. To this end there are a few core users that
explicitly do this, such as the scheduler and rtmutexes. There is also
the desire for interval trees to have this optimization allowing faster
overlap checking.
This patch introduces a new 'struct rb_root_cached' which is just the
root with a cached pointer to the leftmost node. The reason why the
regular rb_root was not extended instead of adding a new structure was
that this allows the user to have the choice between memory footprint
and actual tree performance. The new wrappers on top of the regular
rb_root calls are:
- rb_first_cached(cached_root) -- which is a fast replacement
for rb_first.
- rb_insert_color_cached(node, cached_root, new)
- rb_erase_cached(node, cached_root)
In addition, augmented cached interfaces are also added for basic
insertion and deletion operations; which becomes important for the
interval tree changes.
With the exception of the inserts, which adds a bool for updating the
new leftmost, the interfaces are kept the same. To this end, porting rb
users to the cached version becomes really trivial, and keeping current
rbtree semantics for users that don't care about the optimization
requires zero overhead.
GENMASK(_ULL) performs a left-shift of ~0UL(L), which technically
results in an integer overflow. clang raises a warning if the overflow
occurs in a preprocessor expression. Clear the low-order bits through a
substraction instead of the left-shift to avoid the overflow.
include: warn for inconsistent endian config definition
We have seen some generic code use config parameter CONFIG_CPU_BIG_ENDIAN
to decide the endianness.
Here are the few examples.
include/asm-generic/qrwlock.h
drivers/of/base.c
drivers/of/fdt.c
drivers/tty/serial/earlycon.c
drivers/tty/serial/serial_core.c
Display warning if CPU_BIG_ENDIAN is not defined on big endian
architecture and also warn if it defined on little endian architectures.
Here is our original discussion
https://lkml.org/lkml/2017/5/24/620
arch/microblaze: add choice for endianness and update Makefile
microblaze architectures can be configured for either little or big endian
formats. Add a choice option for the user to select the correct endian
format(default to big endian).
Also update the Makefile so toolchain can compile for the format it is
configured for.
arch: define CPU_BIG_ENDIAN for all fixed big endian archs
Patch series "Define CPU_BIG_ENDIAN or warn for inconsistencies", v3.
While working on enabling queued rwlock on SPARC, found this following
code in include/asm-generic/qrwlock.h which uses CONFIG_CPU_BIG_ENDIAN to
clear a byte.
Problem is many of the fixed big endian architectures don't define
CPU_BIG_ENDIAN and clears the wrong byte.
Define CPU_BIG_ENDIAN for all the fixed big endian architecture to fix it.
Also found few more references of this config parameter in
drivers/of/base.c
drivers/of/fdt.c
drivers/tty/serial/earlycon.c
drivers/tty/serial/serial_core.c
Be aware that this may cause regressions if someone has worked-around
problems in the above code already. Remove the work-around.
Here is our original discussion
https://lkml.org/lkml/2017/5/24/620
Matthew Wilcox [Fri, 8 Sep 2017 23:14:15 +0000 (16:14 -0700)]
vga: optimise console scrolling
Where possible, call memset16(), memmove() or memcpy() instead of using
open-coded loops. I don't like the calling convention that uses a byte
count instead of a count of u16s, but it's a little late to change that.
Reduces code size of fbcon.o by almost 400 bytes on my laptop build.
Matthew Wilcox [Fri, 8 Sep 2017 23:14:04 +0000 (16:14 -0700)]
alpha: add support for memset16
Alpha already had an optimised fill-memory-with-16-bit-quantity
assembler routine called memsetw(). It has a slightly different calling
convention from memset16() in that it takes a byte count, not a count of
words. That's the same convention used by ARM's __memset routines, so
rename Alpha's routine to match and add a memset16() wrapper around it.
Then convert Alpha's scr_memsetw() to call memset16() instead of
memsetw().
Matthew Wilcox [Fri, 8 Sep 2017 23:13:56 +0000 (16:13 -0700)]
x86: implement memset16, memset32 & memset64
These are single instructions on x86. There's no 64-bit instruction for
x86-32, but we don't yet have any user for memset64() on 32-bit
architectures, so don't bother to implement it.
Matthew Wilcox [Fri, 8 Sep 2017 23:13:48 +0000 (16:13 -0700)]
lib/string.c: add multibyte memset functions
Patch series "Multibyte memset variations", v4.
A relatively common idiom we're missing is a function to fill an area of
memory with a pattern which is larger than a single byte. I first
noticed this with a zram patch which wanted to fill a page with an
'unsigned long' value. There turn out to be quite a few places in the
kernel which can benefit from using an optimised function rather than a
loop; sometimes text size, sometimes speed, and sometimes both. The
optimised PowerPC version (not included here) improves performance by
about 30% on POWER8 on just the raw memset_l().
Most of the extra lines of code come from the three testcases I added.
This patch (of 8):
memset16(), memset32() and memset64() are like memset(), but allow the
caller to fill the destination with a value larger than a single byte.
memset_l() and memset_p() allow the caller to use unsigned long and
pointer values respectively.
David Rientjes [Fri, 8 Sep 2017 23:13:41 +0000 (16:13 -0700)]
fs, proc: unconditional cond_resched when reading smaps
If there are large numbers of hugepages to iterate while reading
/proc/pid/smaps, the page walk never does cond_resched(). On archs
without split pmd locks, there can be significant and observable
contention on mm->page_table_lock which cause lengthy delays without
rescheduling.
Always reschedule in smaps_pte_range() if necessary since the pagewalk
iteration can be expensive.
Michal Hocko [Fri, 8 Sep 2017 23:13:35 +0000 (16:13 -0700)]
fs, proc: remove priv argument from is_stack
Commit b18cb64ead40 ("fs/proc: Stop trying to report thread stacks")
removed the priv parameter user in is_stack so the argument is
redundant. Drop it.
mm/mempolicy.c: remove BUG_ON() checks for VMA inside mpol_misplaced()
VMA and its address bounds checks are too late in this function. They
must have been verified earlier in the page fault sequence. Hence just
remove them.
Darrick J. Wong [Fri, 8 Sep 2017 23:13:25 +0000 (16:13 -0700)]
mm: kvfree the swap cluster info if the swap file is unsatisfactory
If initializing a small swap file fails because the swap file has a
problem (holes, etc.) then we need to free the cluster info as part of
cleanup. Unfortunately a previous patch changed the code to use kvzalloc
but did not change all the vfree calls to use kvfree.
mm/page_alloc.c: apply gfp_allowed_mask before the first allocation attempt
We are by error initializing alloc_flags before gfp_allowed_mask is
applied. This could cause problems after pm_restrict_gfp_mask() is called
during suspend operation. Apply gfp_allowed_mask before initializing
alloc_flags so that the first allocation attempt uses correct flags.
KCMP's KCMP_EPOLL_TFD mode merged in commit 0791e3644e5ef2 ("kcmp: add
KCMP_EPOLL_TFD mode to compare epoll target files") we've had no selftest
for it yet (except in criu development list). Thus add it.
Michal Hocko [Fri, 8 Sep 2017 23:13:15 +0000 (16:13 -0700)]
mm/sparse.c: fix typo in online_mem_sections
online_mem_sections() accidentally marks online only the first section
in the given range. This is a typo which hasn't been noticed because I
haven't tested large 2GB blocks previously. All users of
pfn_to_online_page would get confused on the the rest of the pfn range
in the block.
All we need to fix this is to use iterator (pfn) rather than start_pfn.
Seen while reading the code, in handle_mm_fault(), in the case
arch_vma_access_permitted() is failing the call to
mem_cgroup_oom_disable() is not made.
To fix that, move the call to mem_cgroup_oom_enable() after calling
arch_vma_access_permitted() as it should not have entered the memcg OOM.
To address this issue, the existing per-cpu stock infrastructure can be
used.
refill_stock() can be called from mem_cgroup_uncharge_skmem() to move
charge to a per-cpu stock instead of calling atomic
page_counter_uncharge().
To prevent the uncontrolled growth of per-cpu stocks, refill_stock()
will explicitly drain the cached charge, if the cached value exceeds
CHARGE_BATCH.
mm: fadvise: avoid fadvise for fs without backing device
The fadvise() manpage is silent on fadvise()'s effect on memory-based
filesystems (shmem, hugetlbfs & ramfs) and pseudo file systems (procfs,
sysfs, kernfs). The current implementaion of fadvise is mostly a noop
for such filesystems except for FADV_DONTNEED which will trigger
expensive remote LRU cache draining. This patch makes the noop of
fadvise() on such file systems very explicit.
However this change has two side effects for ramfs and one for tmpfs.
First fadvise(FADV_DONTNEED) could remove the unmapped clean zero'ed
pages of ramfs (allocated through read, readahead & read fault) and
tmpfs (allocated through read fault). Also fadvise(FADV_WILLNEED) could
create such clean zero'ed pages for ramfs. This change removes those
possibilities.
One of our generic libraries does fadvise(FADV_DONTNEED). Recently we
observed high latency in fadvise() and noticed that the users have
started using tmpfs files and the latency was due to expensive remote
LRU cache draining. For normal tmpfs files (have data written on them),
fadvise(FADV_DONTNEED) will always trigger the unneeded remote cache
draining.
zs_stat_inc/dec/get() uses enum zs_stat_type for the stat type, however
some callers pass an enum fullness_group value. Change the type to int to
reflect the actual use of the functions and get rid of 'enum-conversion'
warnings
Joonsoo Kim [Fri, 8 Sep 2017 23:12:59 +0000 (16:12 -0700)]
mm/mlock.c: use page_zone() instead of page_zone_id()
page_zone_id() is a specialized function to compare the zone for the pages
that are within the section range. If the section of the pages are
different, page_zone_id() can be different even if their zone is the same.
This wrong usage doesn't cause any actual problem since
__munlock_pagevec_fill() would be called again with failed index.
However, it's better to use more appropriate function here.
Kemi Wang [Fri, 8 Sep 2017 23:12:55 +0000 (16:12 -0700)]
mm: consider the number in local CPUs when reading NUMA stats
To avoid deviation, the per cpu number of NUMA stats in
vm_numa_stat_diff[] is included when a user *reads* the NUMA stats.
Since NUMA stats does not be read by users frequently, and kernel does not
need it to make a decision, it will not be a problem to make the readers
more expensive.
Kemi Wang [Fri, 8 Sep 2017 23:12:52 +0000 (16:12 -0700)]
mm: update NUMA counter threshold size
There is significant overhead in cache bouncing caused by zone counters
(NUMA associated counters) update in parallel in multi-threaded page
allocation (suggested by Dave Hansen).
This patch updates NUMA counter threshold to a fixed size of MAX_U16 - 2,
as a small threshold greatly increases the update frequency of the global
counter from local per cpu counter(suggested by Ying Huang).
The rationality is that these statistics counters don't affect the
kernel's decision, unlike other VM counters, so it's not a problem to use
a large threshold.
With this patchset, we see 31.3% drop of CPU cycles(537-->369) for per
single page allocation and reclaim on Jesper's page_bench03 benchmark.
Benchmark provided by Jesper D Brouer(increase loop times to 10000000):
https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/mm/
bench
Kemi Wang [Fri, 8 Sep 2017 23:12:48 +0000 (16:12 -0700)]
mm: change the call sites of numa statistics items
Patch series "Separate NUMA statistics from zone statistics", v2.
Each page allocation updates a set of per-zone statistics with a call to
zone_statistics(). As discussed in 2017 MM summit, these are a
substantial source of overhead in the page allocator and are very rarely
consumed. This significant overhead in cache bouncing caused by zone
counters (NUMA associated counters) update in parallel in multi-threaded
page allocation (pointed out by Dave Hansen).
A link to the MM summit slides:
http://people.netfilter.org/hawk/presentations/MM-summit2017/MM-summit2017-JesperBrouer.pdf
To mitigate this overhead, this patchset separates NUMA statistics from
zone statistics framework, and update NUMA counter threshold to a fixed
size of MAX_U16 - 2, as a small threshold greatly increases the update
frequency of the global counter from local per cpu counter (suggested by
Ying Huang). The rationality is that these statistics counters don't
need to be read often, unlike other VM counters, so it's not a problem
to use a large threshold and make readers more expensive.
With this patchset, we see 31.3% drop of CPU cycles(537-->369, see
below) for per single page allocation and reclaim on Jesper's
page_bench03 benchmark. Meanwhile, this patchset keeps the same style
of virtual memory statistics with little end-user-visible effects (only
move the numa stats to show behind zone page stats, see the first patch
for details).
I did an experiment of single page allocation and reclaim concurrently
using Jesper's page_bench03 benchmark on a 2-Socket Broadwell-based
server (88 processors with 126G memory) with different size of threshold
of pcp counter.
Benchmark provided by Jesper D Brouer(increase loop times to 10000000):
https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/mm/bench
In this patch, NUMA statistics is separated from zone statistics
framework, all the call sites of NUMA stats are changed to use
numa-stats-specific functions, it does not have any functionality change
except that the number of NUMA stats is shown behind zone page stats
when users *read* the zone info.
userfaultfd: non-cooperative: closing the uffd without triggering SIGBUS
This is an enhancement to avoid a non cooperative userfaultfd manager
having to unregister all regions before it can close the uffd after all
userfaultfd activity completed.
The UFFDIO_UNREGISTER would serialize against the handle_userfault by
taking the mmap_sem for writing, but we can simply repeat the page fault
if we detect the uffd was closed and so the regular page fault paths
should takeover.
mm/hmm: avoid bloating arch that do not make use of HMM
This moves all new code including new page migration helper behind kernel
Kconfig option so that there is no codee bloat for arch or user that do
not want to use HMM or any of its associated features.
mm/hmm: add new helper to hotplug CDM memory region
Unlike unaddressable memory, coherent device memory has a real resource
associated with it on the system (as CPU can address it). Add a new
helper to hotplug such memory within the HMM framework.