Git Repo - linux.git/log

lib/vsprintf: refactor duplicate code to special_hex_number()

special_hex_number() is a helper to print a fixed size type in a hex
format with '0x' prefix, zero padding, and small letters.  In the module
we have already several copies of such code.  Consolidate them under
special_hex_number() helper.

There are couple of differences though.

It seems nobody cared about the output in case of CONFIG_KALLSYMS=n,
when printing symbol address, because the asked field width is not
enough to care last 2 characters in the string represantation of the
pointer.  Fixed here.

The %pNF specifier used to be allowed with a specific field width,
though there is neither any user of it nor mention the possibility in
the documentation.

Signed-off-by: Andy Shevchenko <[email protected]>
Cc: Rasmus Villemoes <[email protected]>
Cc: Joe Perches <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

printk-formats.txt: remove unimplemented %pT

%pT for task->comm has been proposed (several times, I think), but is
not actually implemented. Remove it from printk-formats.txt and add it
back if/when it gets implemented.

Signed-off-by: Rasmus Villemoes <[email protected]>
Reviewed-by: Andy Shevchenko <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

printk: help pr_debug and pr_devel to optimize out arguments

Currently, pr_debug and pr_devel will not elide function call arguments
appearing in calls to the no_printk for these macros. This is because
all side effects must be honored before proceeding to the 0-value
assignment in no_printk.

The behavior is contrary to documentation found in the CodingStyle and
the header file where these functions are declared.

This patch corrects that behavior by shunting out the call to no_printk
completely. The format string is still checked by gcc for correctness,
but no code seems to be emitted in common cases.

[[email protected]: remove braces, per Joe]
Fixes: 5264f2f75d86 ("include/linux/printk.h: use and neaten no_printk")
Signed-off-by: Aaron Conole <[email protected]>
Reported-by: Dmitry Vyukov <[email protected]>
Cc: Joe Perches <[email protected]>
Cc: Jason Baron <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

lib/test_printf.c: test dentry printing

Signed-off-by: Rasmus Villemoes <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Andy Shevchenko <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Joe Perches <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Maurizio Lombardi <[email protected]>
Cc: Tejun Heo <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

lib/test_printf.c: add test for large bitmaps

Following "lib/vsprintf.c: expand field_width to 24 bits", let's add a
test to see that we now actually support bitmaps with 65536 bits.

Signed-off-by: Rasmus Villemoes <[email protected]>
Acked-by: Kees Cook <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Andy Shevchenko <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Joe Perches <[email protected]>
Cc: Maurizio Lombardi <[email protected]>
Cc: Tejun Heo <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

lib/test_printf.c: account for kvasprintf tests

These should also count as performed tests.

Signed-off-by: Rasmus Villemoes <[email protected]>
Acked-by: Kees Cook <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Andy Shevchenko <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Joe Perches <[email protected]>
Cc: Maurizio Lombardi <[email protected]>
Cc: Tejun Heo <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

lib/test_printf.c: add a few number() tests

This adds a few tests to test_number, one of which serves to document
another deviation from POSIX/C99 (printing 0 with an explicit precision
of 0).

Signed-off-by: Rasmus Villemoes <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Andy Shevchenko <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Joe Perches <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Maurizio Lombardi <[email protected]>
Cc: Tejun Heo <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

lib/test_printf.c: test precision quirks

The kernel's printf doesn't follow the standards in a few corner cases
(which are probably mostly irrelevant). Add tests that document the
current behaviour.

Signed-off-by: Rasmus Villemoes <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Andy Shevchenko <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Joe Perches <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Maurizio Lombardi <[email protected]>
Cc: Tejun Heo <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

lib/test_printf.c: check for out-of-bound writes

Add a few padding bytes on either side of the test buffer, and check
that these (and the part of the buffer not used) are untouched by
vsnprintf.

Signed-off-by: Rasmus Villemoes <[email protected]>
Acked-by: Kees Cook <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Andy Shevchenko <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Joe Perches <[email protected]>
Cc: Maurizio Lombardi <[email protected]>
Cc: Tejun Heo <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

lib/test_printf.c: don't BUG

BUG is a completely unnecessarily big hammer, and we're more likely to
get the internal bug reported if we just pr_err() and ensure the test
suite fails.

Signed-off-by: Rasmus Villemoes <[email protected]>
Acked-by: Kees Cook <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Andy Shevchenko <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Joe Perches <[email protected]>
Cc: Maurizio Lombardi <[email protected]>
Cc: Tejun Heo <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

lib/kasprintf.c: add sanity check to kvasprintf

kasprintf relies on being able to replay the formatting and getting the
same result (in particular, the same length). This will almost always
work, but it is possible that the object pointed to by a %s or %p
argument changed under us (so we might get truncated output). Add a
somewhat paranoid sanity check and let's see if it ever triggers.

Signed-off-by: Rasmus Villemoes <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Andy Shevchenko <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Joe Perches <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Maurizio Lombardi <[email protected]>
Cc: Tejun Heo <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

lib/vsprintf.c: warn about too large precisions and field widths

The field width is overloaded to pass some extra information for some %p
extensions (e.g.  #bits for %pb).  But we might silently truncate the
passed value when we stash it in struct printf_spec (see e.g.
"lib/vsprintf.c: expand field_width to 24 bits").  Hopefully 23 value
bits should now be enough for everybody, but if not, let's make some
noise.

Do the same for the precision.  In both cases, clamping seems more
sensible than truncating.  While, according to POSIX, "A negative
precision is taken as if the precision were omitted.", the kernel's
printf has always treated that case as if the precision was 0, so we use
that as lower bound.  For the field width, the smallest representable
value is actually -(1<<23), but a negative field width means 'set the
LEFT flag and use the absolute value', so we want the absolute value to
fit.

Signed-off-by: Rasmus Villemoes <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Andy Shevchenko <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Joe Perches <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Maurizio Lombardi <[email protected]>
Cc: Tejun Heo <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

lib/vsprintf.c: help gcc make number() smaller

One consequence of the reorganization of struct printf_spec to make
field_width 24 bits was that number() gained about 180 bytes. Since
spec is never passed to other functions, we can help gcc make number()
lose most of that extra weight by using local variables for the field
width and precision.

Reviewed-by: Andy Shevchenko <[email protected]>
Signed-off-by: Rasmus Villemoes <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Joe Perches <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Maurizio Lombardi <[email protected]>
Cc: Tejun Heo <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

lib/vsprintf.c: expand field_width to 24 bits

Maurizio Lombardi reported a problem [1] with the %pb extension: It
doesn't work for sufficiently large bitmaps, since the size is stashed
in the field_width field of the struct printf_spec, which is currently
an s16.  Concretely, this manifested itself in
/sys/bus/pseudo/drivers/scsi_debug/map being empty, since the bitmap
printer got a size of 0, which is the 16 bit truncation of the actual
bitmap size.

We do want to keep struct printf_spec at 8 bytes so that it can cheaply
be passed by value.  The qualifier field is only used for internal
bookkeeping in format_decode, so we might as well use a local variable
for that.  This gives us an additional 8 bits, which we can then use for
the field width.

To stay in 8 bytes, we need to do a little rearranging and make the type
member a bitfield as well.  For consistency, change all the members to
bit fields.  gcc doesn't generate much worse code with these changes (in
fact, bloat-o-meter says we save 300 bytes - which I think is a little
surprising).

I didn't find a BUILD_BUG/compiletime_assertion/... which would work
outside function context, so for now I just open-coded it.

[1] http://thread.gmane.org/gmane.linux.kernel/2034835

[[email protected]: avoid open-coded BUILD_BUG_ON]
Signed-off-by: Rasmus Villemoes <[email protected]>
Reported-by: Maurizio Lombardi <[email protected]>
Acked-by: Tejun Heo <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Andy Shevchenko <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Joe Perches <[email protected]>
Cc: Kees Cook <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

lib/vsprintf.c: eliminate potential race in string()

If the string corresponding to a %s specifier can change under us, we
might end up copying a \0 byte to the output buffer. There might be
callers who expect the output buffer to contain a genuine C string whose
length is exactly the snprintf return value (assuming truncation hasn't
happened or has been checked for).

We can avoid this by only passing over the source string once, stopping
the first time we meet a nul byte (or when we reach the given
precision), and then letting widen_string() handle left/right space
padding. As a small bonus, this code reuse also makes the generated
code slightly smaller.

Signed-off-by: Rasmus Villemoes <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Andy Shevchenko <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Joe Perches <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Maurizio Lombardi <[email protected]>
Cc: Tejun Heo <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

lib/vsprintf.c: move string() below widen_string()

This is pure code movement, making sure the widen_string() helper is
defined before the string() function.

Signed-off-by: Rasmus Villemoes <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Andy Shevchenko <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Joe Perches <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Maurizio Lombardi <[email protected]>
Cc: Tejun Heo <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

lib/vsprintf.c: pull out padding code from dentry_name()

Pull out the logic in dentry_name() which handles field width space
padding, in preparation for reusing it from string(). Rename the
widen() helper to move_right(), since it is used for handling the
!(flags & LEFT) case.

Signed-off-by: Rasmus Villemoes <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Andy Shevchenko <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Joe Perches <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Maurizio Lombardi <[email protected]>
Cc: Tejun Heo <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

printk: do cond_resched() between lines while outputting to consoles

@console_may_schedule tracks whether console_sem was acquired through
lock or trylock.  If the former, we're inside a sleepable context and
console_conditional_schedule() performs cond_resched().  This allows
console drivers which use console_lock for synchronization to yield
while performing time-consuming operations such as scrolling.

However, the actual console outputting is performed while holding
irq-safe logbuf_lock, so console_unlock() clears @console_may_schedule
before starting outputting lines.  Also, only a few drivers call
console_conditional_schedule() to begin with.  This means that when a
lot of lines need to be output by console_unlock(), for example on a
console registration, the task doing console_unlock() may not yield for
a long time on a non-preemptible kernel.

If this happens with a slow console devices, for example a serial
console, the outputting task may occupy the cpu for a very long time.
Long enough to trigger softlockup and/or RCU stall warnings, which in
turn pile more messages, sometimes enough to trigger the next cycle of
warnings incapacitating the system.

Fix it by making console_unlock() insert cond_resched() between lines if
@console_may_schedule.

Signed-off-by: Tejun Heo <[email protected]>
Reported-by: Calvin Owens <[email protected]>
Acked-by: Jan Kara <[email protected]>
Cc: Dave Jones <[email protected]>
Cc: Kyle McMartin <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

printk: only unregister boot consoles when necessary

Boot consoles are typically replaced by proper consoles during the boot
process.  This can be problematic if the boot console data is part of
the init section that is reclaimed late during boot.  If the proper
console does not register before this point in time, the boot console
will need to be removed (so that the freed memory is not accessed),
leaving the system without output for some time.

There are various reasons why the proper console may not register early
enough, such as deferred probe or the driver being a loadable module.
If that happens, there is some amount of time where no console messages
are visible to the user, which in turn can mean that they won't see
crashes or other potentially useful information.

To avoid this situation, only remove the boot console when it resides in
the init section.  Code exists to replace the boot console by the proper
console when it is registered, keeping a seamless transition between the
boot and proper consoles.

Signed-off-by: Thierry Reding <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Joe Perches <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

asm/sections: add helpers to check for section data

Add a helper to check if an object (given an address and a size) is part
of a section (given beginning and end addresses). For convenience, also
provide a helper that performs this check for __init data using the
__init_begin and __init_end limits.

Signed-off-by: Thierry Reding <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Joe Perches <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

kernel/stop_machine.c: remove CONFIG_SMP dependencies

stop_machine.o is only built if CONFIG_SMP=y, so this ifdef always
evaluates to true.

[[email protected]: remove now-unneeded ifdef]
Reported-by: Valentin Rothberg <[email protected]>
Cc: Chris Wilson <[email protected]>
Cc: Ingo Molnar <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

uselib: default depending if libc5 was used

uselib hasn't been used since libc5; glibc does not use it.  Deprecate
uselib a bit more, by making the default y only if libc5 was widely used
on the plaform.

This makes arm64 kernel built with defconfig slightly smaller

bloat-o-meter:
  add/remove: 0/3 grow/shrink: 0/2 up/down: 0/-1390 (-1390)
  function                                     old     new   delta
  kernel_config_data                         18164   18162      -2
  uselib_flags                                  20       -     -20
  padzero                                      216     192     -24
  sys_uselib                                   380       -    -380
  load_elf_library                             964       -    -964

Signed-off-by: Riku Voipio <[email protected]>
Reviewed-by: Josh Triplett <[email protected]>
Acked-by: Geert Uytterhoeven <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

err.h: add (missing) unlikely() to IS_ERR_OR_NULL()

IS_ERR_VALUE() already contains it and so we need to add this only to
the !ptr check. That will allow users of IS_ERR_OR_NULL(), to not add
this compiler flag.

Signed-off-by: Viresh Kumar <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

Kconfig: remove HAVE_LATENCYTOP_SUPPORT

As illustrated by commit a3afe70b83fd ("[S390] latencytop s390
support."), HAVE_LATENCYTOP_SUPPORT is defined by an architecture to
advertise an implementation of save_stack_trace_tsk.

However, as of 9212ddb5eada ("stacktrace: provide save_stack_trace_tsk()
weak alias") a dummy implementation is provided if STACKTRACE=y. Given
that LATENCYTOP already depends on STACKTRACE_SUPPORT and selects
STACKTRACE, we can remove HAVE_LATENCYTOP_SUPPORT altogether.

Signed-off-by: Will Deacon <[email protected]>
Acked-by: Heiko Carstens <[email protected]>
Cc: Vineet Gupta <[email protected]>
Cc: Russell King <[email protected]>
Cc: James Hogan <[email protected]>
Cc: Michal Simek <[email protected]>
Cc: Helge Deller <[email protected]>
Acked-by: Michael Ellerman <[email protected]>
Cc: "David S. Miller" <[email protected]>
Cc: Guan Xuetao <[email protected]>
Cc: Ingo Molnar <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

include/linux/kdev_t.h: remove new_valid_dev()

As all new_valid_dev() checks have been removed it's time to drop
new_valid_dev() itself.

No functional change.

Signed-off-by: Yaowei Bai <[email protected]>
Cc: Al Viro <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

fs/stat.c: drop the last new_valid_dev check

New_valid_dev() always returns true, so that's unnecessary to perform
new_valid_dev() checks in some filesystems. Most checks of
new_valid_dev() have been removed so let's drop this last one and then
we can remove new_valid_dev() from the source code.

No functional change.

Signed-off-by: Yaowei Bai <[email protected]>
Cc: Al Viro <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

include/linux/kernel.h: change abs() macro so it uses consistent return type

Rewrite abs() so that its return type does not depend on the
architecture and no unexpected type conversion happen inside of it.  The
only conversion is from unsigned to signed type.  char is left as a
return type but treated as a signed type regradless of it's actual
signedness.

With the old version, int arguments were promoted to long and depending
on architecture a long argument might result in s64 or long return type
(which may or may not be the same).

This came after some back and forth with Nicolas.  The current macro has
different return type (for the same input type) depending on
architecture which might be midly iritating.

An alternative version would promote to int like so:

#define abs(x) __abs_choose_expr(x, long long, \
__abs_choose_expr(x, long, \
__builtin_choose_expr( \
sizeof(x) <= sizeof(int), \
({ int __x = (x); __x<0?-__x:__x; }), \
((void)0))))

I have no preference but imagine Linus might.  :] Nicolas argument against
is that promoting to int causes iconsistent behaviour:

int main(void) {
unsigned short a = 0, b = 1, c = a - b;
unsigned short d = abs(a - b);
unsigned short e = abs(c);
printf("%u %u\n", d, e);  // prints: 1 65535
}

Then again, no sane person expects consistent behaviour from C integer
arithmetic.  ;)

Note:

  __builtin_types_compatible_p(unsigned char, char) is always false, and
  __builtin_types_compatible_p(signed char, char) is also always false.

Signed-off-by: Michal Nazarewicz <[email protected]>
Reviewed-by: Nicolas Pitre <[email protected]>
Cc: Srinivas Pandruvada <[email protected]>
Cc: Wey-Yi Guy <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

include/linux/poison.h: use POISON_POINTER_DELTA for poison pointers

TIMER_ENTRY_STATIC and TAIL_MAPPING are defined as poison pointers which
should point to nowhere. Redefine them using POISON_POINTER_DELTA
arithmetics to make sure they really point to non-mappable area declared
by the target architecture.

Signed-off-by: Vasily Kulikov <[email protected]>
Acked-by: Thomas Gleixner <[email protected]>
Cc: Solar Designer <[email protected]>
Cc: "Kirill A. Shutemov" <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

parisc: Protect huge page pte changes with spinlocks

PA-RISC doesn't have atomic instructions to modify page table entries, so it
takes spinlock in the TLB handler and modifies the page table entry
non-atomically. If you modify the page table entry without the spinlock, you
may race with TLB handler on another CPU and your modification may be lost.
Protect against that with usage of purge_tlb_start() and purge_tlb_end() which
handles the TLB spinlock.

Signed-off-by: Helge Deller <[email protected]>
Cc: [email protected] # v4.4

batman-adv: Drop immediate orig_node free function

It is not allowed to free the memory of an object which is part of a list
which is protected by rcu-read-side-critical sections without making sure
that no other context is accessing the object anymore. This usually happens
by removing the references to this object and then waiting until the rcu
grace period is over and no one (allowedly) accesses it anymore.

But the _now functions ignore this completely. They free the object
directly even when a different context still tries to access it. This has
to be avoided and thus these functions must be removed and all functions
have to use batadv_orig_node_free_ref.

Fixes: 72822225bd41 ("batman-adv: Fix rcu_barrier() miss due to double call_rcu() in TT code")
Signed-off-by: Sven Eckelmann <[email protected]>
Signed-off-by: Marek Lindner <[email protected]>
Signed-off-by: Antonio Quartulli <[email protected]>

batman-adv: Drop immediate batadv_hard_iface free function

It is not allowed to free the memory of an object which is part of a list
which is protected by rcu-read-side-critical sections without making sure
that no other context is accessing the object anymore. This usually happens
by removing the references to this object and then waiting until the rcu
grace period is over and no one (allowedly) accesses it anymore.

But the _now functions ignore this completely. They free the object
directly even when a different context still tries to access it. This has
to be avoided and thus these functions must be removed and all functions
have to use batadv_hardif_free_ref.

Fixes: 89652331c00f ("batman-adv: split tq information in neigh_node struct")
Signed-off-by: Sven Eckelmann <[email protected]>
Signed-off-by: Marek Lindner <[email protected]>
Signed-off-by: Antonio Quartulli <[email protected]>

batman-adv: Drop immediate neigh_ifinfo free function

It is not allowed to free the memory of an object which is part of a list
which is protected by rcu-read-side-critical sections without making sure
that no other context is accessing the object anymore. This usually happens
by removing the references to this object and then waiting until the rcu
grace period is over and no one (allowedly) accesses it anymore.

But the _now functions ignore this completely. They free the object
directly even when a different context still tries to access it. This has
to be avoided and thus these functions must be removed and all functions
have to use batadv_neigh_ifinfo_free_ref.

Fixes: 89652331c00f ("batman-adv: split tq information in neigh_node struct")
Signed-off-by: Sven Eckelmann <[email protected]>
Signed-off-by: Marek Lindner <[email protected]>
Signed-off-by: Antonio Quartulli <[email protected]>

batman-adv: Drop immediate batadv_hardif_neigh_node free function

It is not allowed to free the memory of an object which is part of a list
which is protected by rcu-read-side-critical sections without making sure
that no other context is accessing the object anymore. This usually happens
by removing the references to this object and then waiting until the rcu
grace period is over and no one (allowedly) accesses it anymore.

But the _now functions ignore this completely. They free the object
directly even when a different context still tries to access it. This has
to be avoided and thus these functions must be removed and all functions
have to use batadv_hardif_neigh_free_ref.

Fixes: cef63419f7db ("batman-adv: add list of unique single hop neighbors per hard-interface")
Signed-off-by: Sven Eckelmann <[email protected]>
Signed-off-by: Marek Lindner <[email protected]>
Signed-off-by: Antonio Quartulli <[email protected]>

batman-adv: Drop immediate batadv_neigh_node free function

It is not allowed to free the memory of an object which is part of a list
which is protected by rcu-read-side-critical sections without making sure
that no other context is accessing the object anymore. This usually happens
by removing the references to this object and then waiting until the rcu
grace period is over and no one (allowedly) accesses it anymore.

But the _now functions ignore this completely. They free the object
directly even when a different context still tries to access it. This has
to be avoided and thus these functions must be removed and all functions
have to use batadv_neigh_node_free_ref.

Fixes: 89652331c00f ("batman-adv: split tq information in neigh_node struct")
Signed-off-by: Sven Eckelmann <[email protected]>
Signed-off-by: Marek Lindner <[email protected]>
Signed-off-by: Antonio Quartulli <[email protected]>

batman-adv: Drop immediate batadv_orig_ifinfo free function

It is not allowed to free the memory of an object which is part of a list
which is protected by rcu-read-side-critical sections without making sure
that no other context is accessing the object anymore. This usually happens
by removing the references to this object and then waiting until the rcu
grace period is over and no one (allowedly) accesses it anymore.

But the _now functions ignore this completely. They free the object
directly even when a different context still tries to access it. This has
to be avoided and thus these functions must be removed and all functions
have to use batadv_orig_ifinfo_free_ref.

Fixes: 7351a4822d42 ("batman-adv: split out router from orig_node")
Signed-off-by: Sven Eckelmann <[email protected]>
Signed-off-by: Marek Lindner <[email protected]>
Signed-off-by: Antonio Quartulli <[email protected]>

batman-adv: Avoid recursive call_rcu for batadv_nc_node

The batadv_nc_node_free_ref function uses call_rcu to delay the free of the
batadv_nc_node object until no (already started) rcu_read_lock is enabled
anymore. This makes sure that no context is still trying to access the
object which should be removed. But batadv_nc_node also contains a
reference to orig_node which must be removed.

The reference drop of orig_node was done in the call_rcu function
batadv_nc_node_free_rcu but should actually be done in the
batadv_nc_node_release function to avoid nested call_rcus. This is
important because rcu_barrier (e.g. batadv_softif_free or batadv_exit) will
not detect the inner call_rcu as relevant for its execution. Otherwise this
barrier will most likely be inserted in the queue before the callback of
the first call_rcu was executed. The caller of rcu_barrier will therefore
continue to run before the inner call_rcu callback finished.

Fixes: d56b1705e28c ("batman-adv: network coding - detect coding nodes and remove these after timeout")
Signed-off-by: Sven Eckelmann <[email protected]>
Signed-off-by: Marek Lindner <[email protected]>
Signed-off-by: Antonio Quartulli <[email protected]>

batman-adv: Avoid recursive call_rcu for batadv_bla_claim

The batadv_claim_free_ref function uses call_rcu to delay the free of the
batadv_bla_claim object until no (already started) rcu_read_lock is enabled
anymore. This makes sure that no context is still trying to access the
object which should be removed. But batadv_bla_claim also contains a
reference to backbone_gw which must be removed.

The reference drop of backbone_gw was done in the call_rcu function
batadv_claim_free_rcu but should actually be done in the
batadv_claim_release function to avoid nested call_rcus. This is important
because rcu_barrier (e.g. batadv_softif_free or batadv_exit) will not
detect the inner call_rcu as relevant for its execution. Otherwise this
barrier will most likely be inserted in the queue before the callback of
the first call_rcu was executed. The caller of rcu_barrier will therefore
continue to run before the inner call_rcu callback finished.

Fixes: 23721387c409 ("batman-adv: add basic bridge loop avoidance code")
Signed-off-by: Sven Eckelmann <[email protected]>
Acked-by: Simon Wunderlich <[email protected]>
Signed-off-by: Marek Lindner <[email protected]>
Signed-off-by: Antonio Quartulli <[email protected]>

bna: fix Rx data corruption with VLAN stripping enabled and MTU > 4096

The multi-buffer Rx mode implemented in the past introduced
a regression that causes a data corruption for received VLAN
traffic when VLAN tag stripping is enabled. This mode is supported
only be newer chipsets (1860) and is enabled when MTU > 4096.

When this mode is enabled Rx queue contains buffers with fixed size
2048 bytes. Any incoming packet larger than 2048 is divided into
multiple buffers that are attached as skb frags in polling routine.

The driver assumes that all buffers associated with a packet except
the last one is fully used (e.g. packet with size 5000 are divided
into 3 buffers 2048 + 2048 + 904 bytes) and ignores true size reported
in completions. This assumption is usually true but not when VLAN
packet is received and VLAN tag stripping is enabled. In this case
the first buffer is 2044 bytes long but as the driver always assumes
2048 bytes then 4 extra random bytes are included between the first
and the second frag. Additionally the driver sets checksum as correct
so the packet is properly processed by the core.

The driver needs to check the size of used space in each Rx buffer
reported by FW and not blindly use the fixed value.

Cc: Rasesh Mody <[email protected]>
Signed-off-by: Ivan Vecera <[email protected]>
Reviewed-by: Rasesh Mody <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Merge tag 'clk-for-linus-4.5' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux

Pull clk framework updates from Michael Turquette:
"The clk framework and driver changes for 4.5 look pretty typical.  The
  bulk of the changes are to clk controller drivers, though some
  improvements to the core and some re-usable blocks/templates also
  received some love.

  In this past cycle the clk maintainers developed a good workflow for
  handling the common case of patch submissions containing a new
  drivers, new shared Device Tree header and a new Device Tree binding
  description.  This requires coordination with the Device Tree
  maintainers and with the architecture maintainers (typically the
  arm-soc tree in our case).

  This explains the increase in changes to include/dt-bindings/... and
  to Documentation/devicetree/bindings/clock/... coming from the clk
  tree.  The same commits can be expected to come through those trees on
  occasion, through the use of shared, immutable branches"

* tag 'clk-for-linus-4.5' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux: (125 commits)
  clk: remove duplicated COMMON_CLK_NXP record from clk/Kconfig
  clk: fix clk-gpio.c with optional clock= DT property
  clk: rockchip: fix section mismatches with new child-clocks
  clk: gpio: handle error codes for of_clk_get_parent_count()
  clk: gpio: fix memory leak
  clk: shmobile: r8a7795: Add SATA0 clock
  clk: bcm2835: Add PWM clock support
  clk: bcm2835: Support for clock parent selection
  clk: bcm2835: add a round up ability to the clock divisor
  clk: lpc32xx: add common clock framework driver
  clk: lpc18xx: add NXP specific COMMON_CLK_NXP configuration symbol
  dt-bindings: clock: add NXP LPC32xx clock list for consumers
  dt-bindings: clock: add description of LPC32xx USB clock controller
  dt-bindings: clock: add description of LPC32xx clock controller
  clk: rockchip: rk3036: include downstream muxes into fractional dividers
  clk: add flag for clocks that need to be enabled on rate changes
  clk: rockchip: Allow the RK3288 SPDIF clocks to change their parent
  clk: rockchip: include downstream muxes into fractional dividers
  clk: rockchip: handle mux dependency of fractional dividers
  clk: bcm2835: Add a driver for the auxiliary peripheral clock gates.
  ...

Merge branch 'dmi-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jdelvare/staging

Pull dmi updates from Jean Delvare.

* 'dmi-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jdelvare/staging:
  firmware: dmi_scan: Save SMBIOS Type 9 System Slots
  firmware: dmi_scan: Fix dmi_find_device description
  firmware: dmi_scan: Clarify dmi_save_extended_devices
  firmware: dmi_scan: Optimize dmi_save_extended_devices

memcg: only free spare array when readers are done

A spare array holding mem cgroup threshold events is kept around to make
sure we can always safely deregister an event and have an array to store
the new set of events in.

In the scenario where we're going from 1 to 0 registered events, the
pointer to the primary array containing 1 event is copied to the spare
slot, and then the spare slot is freed because no events are left.
However, it is freed before calling synchronize_rcu(), which means
readers may still be accessing threshold->primary after it is freed.

Fixed by only freeing after synchronize_rcu().

Signed-off-by: Martijn Coenen <[email protected]>
Cc: Johannes Weiner <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm: soft-offline: exit with failure for non anonymous thp

Currently memory_failure() doesn't handle non anonymous thp case,
because we can hardly expect the error handling to be successful, and it
can just hit some corner case which results in BUG_ON or something
severe like that. This is also the case for soft offline code, so let's
make it in the same way.

Orignal code has a MF_COUNT_INCREASED check before put_hwpoison_page(),
but it's unnecessary because get_any_page() is already called when
running on this code, which takes a refcount of the target page
regardress of the flag. So this patch also removes it.

[[email protected]: fix build]
Signed-off-by: Naoya Horiguchi <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: "Kirill A. Shutemov" <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm: soft-offline: clean up soft_offline_page()

soft_offline_page() has some deeply indented code, that's the sign of
demand for cleanup. So let's do this. No functionality change.

Signed-off-by: Naoya Horiguchi <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: "Kirill A. Shutemov" <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/hugetlbfs: unmap pages if page fault raced with hole punch

Page faults can race with fallocate hole punch.  If a page fault happens
between the unmap and remove operations, the page is not removed and
remains within the hole.  This is not the desired behavior.  The race is
difficult to detect in user level code as even in the non-race case, a
page within the hole could be faulted back in before fallocate returns.
If userfaultfd is expanded to support hugetlbfs in the future, this race
will be easier to observe.

If this race is detected and a page is mapped, the remove operation
(remove_inode_hugepages) will unmap the page before removing.  The unmap
within remove_inode_hugepages occurs with the hugetlb_fault_mutex held
so that no other faults will be processed until the page is removed.

The (unmodified) routine hugetlb_vmdelete_list was moved ahead of
remove_inode_hugepages to satisfy the new reference.

[[email protected]: move hugetlb_vmdelete_list()]
Signed-off-by: Mike Kravetz <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: Hillf Danton <[email protected]>
Cc: Davidlohr Bueso <[email protected]>
Cc: Dave Hansen <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

fs/hugetlbfs/inode.c: fix bugs in hugetlb_vmtruncate_list()

Hillf Danton noticed bugs in the hugetlb_vmtruncate_list routine.  The
argument end is of type pgoff_t.  It was being converted to a vaddr
offset and passed to unmap_hugepage_range.  However, end was also being
used as an argument to the vma_interval_tree_foreach controlling loop.
In addition, the conversion of end to vaddr offset was incorrect.

hugetlb_vmtruncate_list is called as part of a file truncate or
fallocate hole punch operation.

When truncating a hugetlbfs file, this bug could prevent some pages from
being unmapped.  This is possible if there are multiple vmas mapping the
file, and there is a sufficiently sized hole between the mappings.  The
size of the hole between two vmas (A,B) must be such that the starting
virtual address of B is greater than (ending virtual address of A <<
PAGE_SHIFT).  In this case, the pages in B would not be unmapped.  If
pages are not properly unmapped during truncate, the following BUG is
hit:

kernel BUG at fs/hugetlbfs/inode.c:428!

In the fallocate hole punch case, this bug could prevent pages from
being unmapped as in the truncate case.  However, for hole punch the
result is that unmapped pages will not be removed during the operation.
For hole punch, it is also possible that more pages than desired will be
unmapped.  This unnecessary unmapping will cause page faults to
reestablish the mappings on subsequent page access.

Fixes: 1bfad99ab (" hugetlbfs: hugetlb_vmtruncate_list() needs to take a range")Reported-by: Hillf Danton <[email protected]>
Signed-off-by: Mike Kravetz <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: Davidlohr Bueso <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: <[email protected]> [4.3]
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm: make swapoff more robust against soft dirty

Both s390 and powerpc have hit the issue of swapoff hanging, when
CONFIG_HAVE_ARCH_SOFT_DIRTY and CONFIG_MEM_SOFT_DIRTY ifdefs were not
quite as x86_64 had them. I think it would be much clearer if
HAVE_ARCH_SOFT_DIRTY was just a Kconfig option set by architectures to
determine whether the MEM_SOFT_DIRTY option should be offered, and the
actual code depend upon CONFIG_MEM_SOFT_DIRTY alone.

But won't embark on that change myself: instead make swapoff more
robust, by using pte_swp_clear_soft_dirty() on each pte it encounters,
without an explicit #ifdef CONFIG_MEM_SOFT_DIRTY. That being a no-op,
whether the bit in question is defined as 0 or the asm-generic fallback
is used, unless soft dirty is fully turned on.

Why "maybe" in maybe_same_pte()? Rename it pte_same_as_swp().

Signed-off-by: Hugh Dickins <[email protected]>
Reviewed-by: Aneesh Kumar K.V <[email protected]>
Acked-by: Cyrill Gorcunov <[email protected]>
Cc: Laurent Dufour <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Martin Schwidefsky <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm: fix locking order in mm_take_all_locks()

Dmitry Vyukov has reported[1] possible deadlock (triggered by his
syzkaller fuzzer):

Possible unsafe locking scenario:

       CPU0                    CPU1
       ----                    ----
  lock(&hugetlbfs_i_mmap_rwsem_key);
                               lock(&mapping->i_mmap_rwsem);
                               lock(&hugetlbfs_i_mmap_rwsem_key);
  lock(&mapping->i_mmap_rwsem);

Both traces points to mm_take_all_locks() as a source of the problem.
It doesn't take care about ordering or hugetlbfs_i_mmap_rwsem_key (aka
mapping->i_mmap_rwsem for hugetlb mapping) vs.  i_mmap_rwsem.

huge_pmd_share() does memory allocation under hugetlbfs_i_mmap_rwsem_key
and allocator can take i_mmap_rwsem if it hit reclaim.  So we need to
take i_mmap_rwsem from all hugetlb VMAs before taking i_mmap_rwsem from
rest of VMAs.

The patch also documents locking order for hugetlbfs_i_mmap_rwsem_key.

[1] http://lkml.kernel.org/r/CACT4Y+Zu95tBs-0EvdiAKzUOsb4tczRRfCRTpLr4bg_OP9HuVg@mail.gmail.com

Signed-off-by: Kirill A. Shutemov <[email protected]>
Reported-by: Dmitry Vyukov <[email protected]>
Reviewed-by: Michal Hocko <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm: mempolicy: skip non-migratable VMAs when setting MPOL_MF_LAZY

MPOL_MF_LAZY is not visible from userspace since a720094ded8c ("mm:
mempolicy: Hide MPOL_NOOP and MPOL_MF_LAZY from userspace for now"), but
it should still skip non-migratable VMAs such as VM_IO, VM_PFNMAP, and
VM_HUGETLB VMAs, and avoid useless overhead of minor faults.

Signed-off-by: Liang Chen <[email protected]>
Signed-off-by: Gavin Guo <[email protected]>
Acked-by: Rik van Riel <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/page_alloc.c: remove unused struct zone *z variable

Remove unused struct zone *z variable which appeared in 86051ca5eaf5
("mm: fix usemap initialization").

Signed-off-by: Alexander Kuleshov <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/mlock.c: change can_do_mlock return value type to boolean

Since can_do_mlock only return 1 or 0, so make it boolean.

No functional change.

[[email protected]: update declaration in mm.h]
Signed-off-by: Wang Xiaoqiang <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/vmalloc.c: use macro IS_ALIGNED to judge the aligment

Just cleanup, no functional change.

Signed-off-by: Wang Xiaoqiang <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

cgroup, memcg, writeback: drop spurious rcu locking around mem_cgroup_css_from_page()

In earlier versions, mem_cgroup_css_from_page() could return non-root
css on a legacy hierarchy which can go away and required rcu locking;
however, the eventual version simply returns the root cgroup if memcg is
on a legacy hierarchy and thus doesn't need rcu locking around or in it.
Remove spurious rcu lockings.

Signed-off-by: Tejun Heo <[email protected]>
Reported-by: Johannes Weiner <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Cc: Jens Axboe <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/page_isolation: do some cleanup in "undo_isolate_page_range"

Use "IS_ALIGNED" to judge the alignment, rather than directly judging.

Signed-off-by: Wang Xiaoqiang <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

memblock: fix section mismatch

allmodconfig produces following warning for me:

  WARNING: vmlinux.o(.text.unlikely+0x10314): Section mismatch in reference from the function movable_node_is_enabled() to the variable .meminit.data:movable_node_enabled
  The function movable_node_is_enabled() references
  the variable __meminitdata movable_node_enabled.
  This is often because movable_node_is_enabled lacks a __meminitdata
  annotation or the annotation of movable_node_enabled is wrong.

Let's mark the function with __meminit.  It fixes the warning.

Signed-off-by: Kirill A. Shutemov <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

s390/mm: enable fixup_user_fault retrying

By passing a non-null flag we allow fixup_user_fault to retry, which
enables userfaultfd. As during these retries we might drop the mmap_sem
we need to check if that happened and redo the complete chain of
actions.

Signed-off-by: Dominik Dingel <[email protected]>
Reviewed-by: Andrea Arcangeli <[email protected]>
Cc: "Kirill A. Shutemov" <[email protected]>
Cc: Martin Schwidefsky <[email protected]>
Cc: Christian Borntraeger <[email protected]>
Cc: "Jason J. Herne" <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Eric B Munson <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Dominik Dingel <[email protected]>
Cc: Paolo Bonzini <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm: bring in additional flag for fixup_user_fault to signal unlock

During Jason's work with postcopy migration support for s390 a problem
regarding gmap faults was discovered.

The gmap code will call fixup_user_fault which will end up always in
handle_mm_fault.  Till now we never cared about retries, but as the
userfaultfd code kind of relies on it.  this needs some fix.

This patchset does not take care of the futex code.  I will now look
closer at this.

This patch (of 2):

With the introduction of userfaultfd, kvm on s390 needs fixup_user_fault
to pass in FAULT_FLAG_ALLOW_RETRY and give feedback if during the
faulting we ever unlocked mmap_sem.

This patch brings in the logic to handle retries as well as it cleans up
the current documentation.  fixup_user_fault was not having the same
semantics as filemap_fault.  It never indicated if a retry happened and
so a caller wasn't able to handle that case.  So we now changed the
behaviour to always retry a locked mmap_sem.

Signed-off-by: Dominik Dingel <[email protected]>
Reviewed-by: Andrea Arcangeli <[email protected]>
Cc: "Kirill A. Shutemov" <[email protected]>
Cc: Martin Schwidefsky <[email protected]>
Cc: Christian Borntraeger <[email protected]>
Cc: "Jason J. Herne" <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Eric B Munson <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Dominik Dingel <[email protected]>
Cc: Paolo Bonzini <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

dax: re-enable dax pmd mappings

Now that the get_user_pages() path knows how to handle dax-pmd mappings,
remove the protections that disabled dax-pmd support.

Tests available from github.com/pmem/ndctl:

make TESTS="lib/test-dax.sh lib/test-mmap.sh" check

Signed-off-by: Dan Williams <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

dax: provide diagnostics for pmd mapping failures

There is a wide gamut of conditions that can trigger the dax pmd path to
fallback to pte mappings. Ideally we'd have a syscall interface to
determine mapping characteristics after the fact. In the meantime
provide debug messages.

Signed-off-by: Dan Williams <[email protected]>
Suggested-by: Matthew Wilcox <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm, x86: get_user_pages() for dax mappings

A dax mapping establishes a pte with _PAGE_DEVMAP set when the driver
has established a devm_memremap_pages() mapping, i.e. when the pfn_t
return from ->direct_access() has PFN_DEV and PFN_MAP set. Later, when
encountering _PAGE_DEVMAP during a page table walk we lookup and pin a
struct dev_pagemap instance to keep the result of pfn_to_page() valid
until put_page().

Signed-off-by: Dan Williams <[email protected]>
Tested-by: Logan Gunthorpe <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm, dax: dax-pmd vs thp-pmd vs hugetlbfs-pmd

A dax-huge-page mapping while it uses some thp helpers is ultimately not
a transparent huge page.  The distinction is especially important in the
get_user_pages() path.  pmd_devmap() is used to distinguish dax-pmds
from pmd_huge() and pmd_trans_huge() which have slightly different
semantics.

Explicitly mark the pmd_trans_huge() helpers that dax needs by adding
pmd_devmap() checks.

[[email protected]: fix regression in handling mlocked pages in  __split_huge_pmd()]
Signed-off-by: Dan Williams <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Signed-off-by: Kirill A. Shutemov <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm, dax, pmem: introduce {get|put}_dev_pagemap() for dax-gup

get_dev_page() enables paths like get_user_pages() to pin a dynamically
mapped pfn-range (devm_memremap_pages()) while the resulting struct page
objects are in use.  Unlike get_page() it may fail if the device is, or
is in the process of being, disabled.  While the initial lookup of the
range may be an expensive list walk, the result is cached to speed up
subsequent lookups which are likely to be in the same mapped range.

devm_memremap_pages() now requires a reference counter to be specified
at init time.  For pmem this means moving request_queue allocation into
pmem_alloc() so the existing queue usage counter can track "device
pages".

ZONE_DEVICE pages always have an elevated count and will never be on an
lru reclaim list.  That space in 'struct page' can be redirected for
other uses, but for safety introduce a poison value that will always
trip __list_add() to assert.  This allows half of the struct list_head
storage to be reclaimed with some assurance to back up the assumption
that the page count never goes to zero and a list_add() is never
attempted.

Signed-off-by: Dan Williams <[email protected]>
Tested-by: Logan Gunthorpe <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Ross Zwisler <[email protected]>
Cc: Alexander Viro <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

libnvdimm, pmem: move request_queue allocation earlier in probe

Before the dynamically allocated struct pages from devm_memremap_pages()
can be put to use outside the driver, we need a mechanism to track
whether they are still in use at teardown. Towards that goal reorder
the initialization sequence to allow the 'q_usage_counter' from the
request_queue to be used by the devm_memremap_pages() implementation (in
subsequent patches).

Signed-off-by: Dan Williams <[email protected]>
Cc: Ross Zwisler <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm, dax: convert vmf_insert_pfn_pmd() to pfn_t

Similar to the conversion of vm_insert_mixed() use pfn_t in the
vmf_insert_pfn_pmd() to tag the resulting pte with _PAGE_DEVICE when the
pfn is backed by a devm_memremap_pages() mapping.

Signed-off-by: Dan Williams <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Alexander Viro <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm, dax, gpu: convert vm_insert_mixed to pfn_t

Convert the raw unsigned long 'pfn' argument to pfn_t for the purpose of
evaluating the PFN_MAP and PFN_DEV flags. When both are set it triggers
_PAGE_DEVMAP to be set in the resulting pte.

There are no functional changes to the gpu drivers as a result of this
conversion.

Signed-off-by: Dan Williams <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: David Airlie <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

x86, mm: introduce _PAGE_DEVMAP

_PAGE_DEVMAP is a hardware-unused pte bit that will later be used in the
get_user_pages() path to identify pfns backed by the dynamic allocation
established by devm_memremap_pages. Upon seeing that bit the gup path
will lookup and pin the allocation while the pages are in use.

Since the _PAGE_DEVMAP bit is > 32 it must be cast to u64 instead of a
pteval_t to allow pmd_flags() usage in the realmode boot code to build.

Signed-off-by: Dan Williams <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

frv: fix compiler warning from definition of __pmd()

Take into account that the pmd_t type is a array inside a struct, so it
needs two levels of brackets to initialize. Otherwise, a usage of __pmd
generates a warning:

include/linux/mm.h:986:2: warning: missing braces around initializer [-Wmissing-braces]

Signed-off-by: Dan Williams <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

hugetlb: fix compile error on tile

Inlude asm/pgtable.h to get the definition for pud_t to fix:

include/linux/hugetlb.h:203:29: error: unknown type name 'pud_t'

Signed-off-by: Dan Williams <[email protected]>
Cc: Liviu Dudau <[email protected]>
Cc: Sudeep Holla <[email protected]>
Cc: Lorenzo Pieralisi <[email protected]>
Cc: David Howells <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

avr32: convert to asm-generic/memory_model.h

Switch avr32/include/asm/page.h to use the common defintions for
pfn_to_page(), page_to_pfn(), and ARCH_PFN_OFFSET.

Signed-off-by: Dan Williams <[email protected]>
Cc: Haavard Skinnemoen <[email protected]>
Cc: Hans-Christian Egtvedt <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

libnvdimm, pfn, pmem: allocate memmap array in persistent memory

Use the new vmem_altmap capability to enable the pmem driver to arrange
for a struct page memmap to be established in persistent memory.

[[email protected]: mn10300: declare __pfn_to_phys() to fix build error]
Signed-off-by: Dan Williams <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Ross Zwisler <[email protected]>
Signed-off-by: Guenter Roeck <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

x86, mm: introduce vmem_altmap to augment vmemmap_populate()

In support of providing struct page for large persistent memory
capacities, use struct vmem_altmap to change the default policy for
allocating memory for the memmap array.  The default vmemmap_populate()
allocates page table storage area from the page allocator.  Given
persistent memory capacities relative to DRAM it may not be feasible to
store the memmap in 'System Memory'.  Instead vmem_altmap represents
pre-allocated "device pages" to satisfy vmemmap_alloc_block_buf()
requests.

Signed-off-by: Dan Williams <[email protected]>
Reported-by: kbuild test robot <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Dave Hansen <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm: introduce find_dev_pagemap()

There are several scenarios where we need to retrieve and update
metadata associated with a given devm_memremap_pages() mapping, and the
only lookup key available is a pfn in the range:

1/ We want to augment vmemmap_populate() (called via arch_add_memory())
   to allocate memmap storage from pre-allocated pages reserved by the
   device driver.  At vmemmap_alloc_block_buf() time it grabs device pages
   rather than page allocator pages.  This is in support of
   devm_memremap_pages() mappings where the memmap is too large to fit in
   main memory (i.e. large persistent memory devices).

2/ Taking a reference against the mapping when inserting device pages
   into the address_space radix of a given inode.  This facilitates
   unmap_mapping_range() and truncate_inode_pages() operations when the
   driver is tearing down the mapping.

3/ get_user_pages() operations on ZONE_DEVICE memory require taking a
   reference against the mapping so that the driver teardown path can
   revoke and drain usage of device pages.

Signed-off-by: Dan Williams <[email protected]>
Tested-by: Logan Gunthorpe <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Ross Zwisler <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm: skip memory block registration for ZONE_DEVICE

Prevent userspace from trying and failing to online ZONE_DEVICE pages
which are meant to never be onlined.

For example on platforms with a udev rule like the following:

  SUBSYSTEM=="memory", ACTION=="add", ATTR{state}=="offline", ATTR{state}="online"

...will generate futile attempts to online the ZONE_DEVICE sections.
Example kernel messages:

    Built 1 zonelists in Node order, mobility grouping on.  Total pages: 1004747
    Policy zone: Normal
    online_pages [mem 0x248000000-0x24fffffff] failed

Signed-off-by: Dan Williams <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm, dax, pmem: introduce pfn_t

For the purpose of communicating the optional presence of a 'struct
page' for the pfn returned from ->direct_access(), introduce a type that
encapsulates a page-frame-number plus flags.  These flags contain the
historical "page_link" encoding for a scatterlist entry, but can also
denote "device memory".  Where "device memory" is a set of pfns that are
not part of the kernel's linear mapping by default, but are accessed via
the same memory controller as ram.

The motivation for this new type is large capacity persistent memory
that needs struct page entries in the 'memmap' to support 3rd party DMA
(i.e.  O_DIRECT I/O with a persistent memory source/target).  However,
we also need it in support of maintaining a list of mapped inodes which
need to be unmapped at driver teardown or freeze_bdev() time.

Signed-off-by: Dan Williams <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Ross Zwisler <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

kvm: rename pfn_t to kvm_pfn_t

To date, we have implemented two I/O usage models for persistent memory,
PMEM (a persistent "ram disk") and DAX (mmap persistent memory into
userspace).  This series adds a third, DAX-GUP, that allows DAX mappings
to be the target of direct-i/o.  It allows userspace to coordinate
DMA/RDMA from/to persistent memory.

The implementation leverages the ZONE_DEVICE mm-zone that went into
4.3-rc1 (also discussed at kernel summit) to flag pages that are owned
and dynamically mapped by a device driver.  The pmem driver, after
mapping a persistent memory range into the system memmap via
devm_memremap_pages(), arranges for DAX to distinguish pfn-only versus
page-backed pmem-pfns via flags in the new pfn_t type.

The DAX code, upon seeing a PFN_DEV+PFN_MAP flagged pfn, flags the
resulting pte(s) inserted into the process page tables with a new
_PAGE_DEVMAP flag.  Later, when get_user_pages() is walking ptes it keys
off _PAGE_DEVMAP to pin the device hosting the page range active.
Finally, get_page() and put_page() are modified to take references
against the device driver established page mapping.

Finally, this need for "struct page" for persistent memory requires
memory capacity to store the memmap array.  Given the memmap array for a
large pool of persistent may exhaust available DRAM introduce a
mechanism to allocate the memmap from persistent memory.  The new
"struct vmem_altmap *" parameter to devm_memremap_pages() enables
arch_add_memory() to use reserved pmem capacity rather than the page
allocator.

This patch (of 18):

The core has developed a need for a "pfn_t" type [1].  Move the existing
pfn_t in KVM to kvm_pfn_t [2].

[1]: https://lists.01.org/pipermail/linux-nvdimm/2015-September/002199.html
[2]: https://lists.01.org/pipermail/linux-nvdimm/2015-September/002218.html

Signed-off-by: Dan Williams <[email protected]>
Acked-by: Christoffer Dall <[email protected]>
Cc: Paolo Bonzini <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

um: kill pfn_t

The core has developed a need for a "pfn_t" type [1]. Convert the usage
of pfn_t by usermode-linux to an unsigned long, and update pfn_to_phys()
to drop its expectation of a typed pfn.

[1]: https://lists.01.org/pipermail/linux-nvdimm/2015-September/002199.html

Signed-off-by: Dan Williams <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Jeff Dike <[email protected]>
Cc: Richard Weinberger <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

dax: Split pmd map when fallback on COW

An infinite loop of PMD faults was observed when attempted to mlock() a
private read-only PMD mmap'd range of a DAX file.

__dax_pmd_fault() simply returns with VM_FAULT_FALLBACK when falling
back to PTE on COW. However, __handle_mm_fault() returns without
falling back to handle_pte_fault() because a PMD map is present in this
case.

Change __dax_pmd_fault() to split the PMD map, if present, before
returning with VM_FAULT_FALLBACK.

Signed-off-by: Toshi Kani <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Ross Zwisler <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm, dax: fix livelock, allow dax pmd mappings to become writeable

Prior to this change DAX PMD mappings that were made read-only were
never able to be made writable again. This is because the code in
insert_pfn_pmd() that calls pmd_mkdirty() and pmd_mkwrite() would skip
these calls if the PMD already existed in the page table.

Instead, if we are doing a write always mark the PMD entry as dirty and
writeable. Without this code we can get into a condition where we mark
the PMD as read-only, and then on a subsequent write fault we get into
an infinite loop of PMD faults where we try unsuccessfully to make the
PMD writeable.

Signed-off-by: Ross Zwisler <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
Reported-by: Jeff Moyer <[email protected]>
Reported-by: Toshi Kani <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

dax: fix lifetime of in-kernel dax mappings with dax_map_atomic()

The DAX implementation needs to protect new calls to ->direct_access()
and usage of its return value against the driver for the underlying
block device being disabled. Use blk_queue_enter()/blk_queue_exit() to
hold off blk_cleanup_queue() from proceeding, or otherwise fail new
mapping requests if the request_queue is being torn down.

This also introduces blk_dax_ctl to simplify the interface from fs/dax.c
through dax_map_atomic() to bdev_direct_access().

[[email protected]: fix read() of a hole]
Signed-off-by: Dan Williams <[email protected]>
Reviewed-by: Jeff Moyer <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Ross Zwisler <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

dax: guarantee page aligned results from bdev_direct_access()

If a ->direct_access() implementation ever returns a map count less than
PAGE_SIZE, catch the error in bdev_direct_access(). This simplifies
error checking in upper layers.

Signed-off-by: Dan Williams <[email protected]>
Reported-by: Ross Zwisler <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

dax: increase granularity of dax_clear_blocks() operations

dax_clear_blocks is currently performing a cond_resched() after every
PAGE_SIZE memset.  We need not check so frequently, for example md-raid
only calls cond_resched() at stripe granularity.  Also, in preparation
for introducing a dax_map_atomic() operation that temporarily pins a dax
mapping move the call to cond_resched() to the outer loop.

The worst case latency between calls to cond_resched() after this change
is 500us the average latency is 133us.  This is up from a 10us max and
4us average.

Signed-off-by: Dan Williams <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Reviewed-by: Jeff Moyer <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

pmem, dax: clean up clear_pmem()

To date, we have implemented two I/O usage models for persistent memory,
PMEM (a persistent "ram disk") and DAX (mmap persistent memory into
userspace).  This series adds a third, DAX-GUP, that allows DAX mappings
to be the target of direct-i/o.  It allows userspace to coordinate
DMA/RDMA from/to persistent memory.

The implementation leverages the ZONE_DEVICE mm-zone that went into
4.3-rc1 (also discussed at kernel summit) to flag pages that are owned
and dynamically mapped by a device driver.  The pmem driver, after
mapping a persistent memory range into the system memmap via
devm_memremap_pages(), arranges for DAX to distinguish pfn-only versus
page-backed pmem-pfns via flags in the new pfn_t type.

The DAX code, upon seeing a PFN_DEV+PFN_MAP flagged pfn, flags the
resulting pte(s) inserted into the process page tables with a new
_PAGE_DEVMAP flag.  Later, when get_user_pages() is walking ptes it keys
off _PAGE_DEVMAP to pin the device hosting the page range active.
Finally, get_page() and put_page() are modified to take references
against the device driver established page mapping.

Finally, this need for "struct page" for persistent memory requires
memory capacity to store the memmap array.  Given the memmap array for a
large pool of persistent may exhaust available DRAM introduce a
mechanism to allocate the memmap from persistent memory.  The new
"struct vmem_altmap *" parameter to devm_memremap_pages() enables
arch_add_memory() to use reserved pmem capacity rather than the page
allocator.

This patch (of 25):

Both __dax_pmd_fault, and clear_pmem() were taking special steps to
clear memory a page at a time to take advantage of non-temporal
clear_page() implementations.  However, x86_64 does not use non-temporal
instructions for clear_page(), and arch_clear_pmem() was always
incurring the cost of __arch_wb_cache_pmem().

Clean up the assumption that doing clear_pmem() a page at a time is more
performant.

Signed-off-by: Dan Williams <[email protected]>
Reported-by: Dave Hansen <[email protected]>
Reviewed-by: Ross Zwisler <[email protected]>
Reviewed-by: Jeff Moyer <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Alexander Viro <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Christoffer Dall <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: David Airlie <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: Jeff Dike <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Logan Gunthorpe <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Paolo Bonzini <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Richard Weinberger <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Toshi Kani <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

thp: fix split_huge_page() after mremap() of THP

Sasha Levin has reported KASAN out-of-bounds bug[1].  It points to "if
(!is_swap_pte(pte[i]))" in unfreeze_page_vma() as a problematic access.

The cause is that split_huge_page() doesn't handle THP correctly if it's
not allingned to PMD boundary.  It can happen after mremap().

Test-case (not always triggers the bug):

#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <sys/mman.h>

#define MB (1024UL*1024)
#define SIZE (2*MB)
#define BASE ((void *)0x400000000000)

int main()
{
char *p;

p = mmap(BASE, SIZE, PROT_READ | PROT_WRITE,
MAP_FIXED | MAP_PRIVATE | MAP_ANONYMOUS | MAP_POPULATE,
-1, 0);
if (p == MAP_FAILED)
perror("mmap"), exit(1);
p = mremap(BASE, SIZE, SIZE, MREMAP_FIXED | MREMAP_MAYMOVE,
BASE + SIZE + 8192);
if (p == MAP_FAILED)
perror("mremap"), exit(1);
system("echo 1 > /sys/kernel/debug/split_huge_pages");
return 0;
}

The patch fixes freeze and unfreeze paths to handle page table boundary
crossing.

It also makes mapcount vs count check in split_huge_page_to_list()
stricter:
- after freeze we don't expect any subpage mapped as we remove them
   from rmap when setting up migration entries;
- count must be 1, meaning only caller has reference to the page;

[1] https://gist.github.com/sashalevin/c67fbea55e7c0576972a

Signed-off-by: Kirill A. Shutemov <[email protected]>
Reported-by: Sasha Levin <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/huge_memory.c: don't split THP page when MADV_FREE syscall is called

We don't need to split THP page when MADV_FREE syscall is called if
[start, len] is aligned with THP size.  The split could be done when VM
decide to free it in reclaim path if memory pressure is heavy.  With
that, we could avoid unnecessary THP split.

For the feature, this patch changes pte dirtness marking logic of THP.
Now, it marks every ptes of pages dirty unconditionally in splitting,
which makes MADV_FREE void.  So, instead, this patch propagates pmd
dirtiness to all pages via PG_dirty and restores pte dirtiness from
PG_dirty.  With this, if pmd is clean(ie, MADV_FREEed) when split
happens(e,g, shrink_page_list), all of pages are clean too so we could
discard them.

Signed-off-by: Minchan Kim <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: "James E.J. Bottomley" <[email protected]>
Cc: "Kirill A. Shutemov" <[email protected]>
Cc: Shaohua Li <[email protected]>
Cc: <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Chen Gang <[email protected]>
Cc: Chris Zankel <[email protected]>
Cc: Daniel Micay <[email protected]>
Cc: Darrick J. Wong <[email protected]>
Cc: David S. Miller <[email protected]>
Cc: Helge Deller <[email protected]>
Cc: Ivan Kokshaysky <[email protected]>
Cc: Jason Evans <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: KOSAKI Motohiro <[email protected]>
Cc: Matt Turner <[email protected]>
Cc: Max Filippov <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Michael Kerrisk <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Mika Penttil <[email protected]>
Cc: Ralf Baechle <[email protected]>
Cc: Richard Henderson <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Roland Dreier <[email protected]>
Cc: Russell King <[email protected]>
Cc: Shaohua Li <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Wu Fengguang <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

arch/arm64/include/asm/pgtable.h: add pmd_mkclean for THP

MADV_FREE needs pmd_dirty and pmd_mkclean for detecting recent overwrite
of the contents since MADV_FREE syscall is called for THP page.

This patch adds pmd_mkclean for THP page MADV_FREE support.

Signed-off-by: Minchan Kim <[email protected]>
Cc: "James E.J. Bottomley" <[email protected]>
Cc: "Kirill A. Shutemov" <[email protected]>
Cc: Shaohua Li <[email protected]>
Cc: <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Chen Gang <[email protected]>
Cc: Chris Zankel <[email protected]>
Cc: Daniel Micay <[email protected]>
Cc: Darrick J. Wong <[email protected]>
Cc: David S. Miller <[email protected]>
Cc: Helge Deller <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Ivan Kokshaysky <[email protected]>
Cc: Jason Evans <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: KOSAKI Motohiro <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Matt Turner <[email protected]>
Cc: Max Filippov <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Michael Kerrisk <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Mika Penttil <[email protected]>
Cc: Ralf Baechle <[email protected]>
Cc: Richard Henderson <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Roland Dreier <[email protected]>
Cc: Russell King <[email protected]>
Cc: Shaohua Li <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Wu Fengguang <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

arch/arm/include/asm/pgtable-3level.h: add pmd_mkclean for THP

MADV_FREE needs pmd_dirty and pmd_mkclean for detecting recent overwrite
of the contents since MADV_FREE syscall is called for THP page.

This patch adds pmd_mkclean for THP page MADV_FREE support.

Signed-off-by: Minchan Kim <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Russell King <[email protected]>
Cc: "James E.J. Bottomley" <[email protected]>
Cc: "Kirill A. Shutemov" <[email protected]>
Cc: Shaohua Li <[email protected]>
Cc: <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Chen Gang <[email protected]>
Cc: Chris Zankel <[email protected]>
Cc: Daniel Micay <[email protected]>
Cc: Darrick J. Wong <[email protected]>
Cc: David S. Miller <[email protected]>
Cc: Helge Deller <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Ivan Kokshaysky <[email protected]>
Cc: Jason Evans <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: KOSAKI Motohiro <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Matt Turner <[email protected]>
Cc: Max Filippov <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Michael Kerrisk <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Mika Penttil <[email protected]>
Cc: Ralf Baechle <[email protected]>
Cc: Richard Henderson <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Roland Dreier <[email protected]>
Cc: Shaohua Li <[email protected]>
Cc: Wu Fengguang <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

arch/powerpc/include/asm/pgtable-ppc64.h: add pmd_[dirty|mkclean] for THP

MADV_FREE needs pmd_dirty and pmd_mkclean for detecting recent overwrite
of the contents since MADV_FREE syscall is called for THP page.

This patch adds pmd_dirty and pmd_mkclean for THP page MADV_FREE
support.

Signed-off-by: Minchan Kim <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: "James E.J. Bottomley" <[email protected]>
Cc: "Kirill A. Shutemov" <[email protected]>
Cc: Shaohua Li <[email protected]>
Cc: <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Chen Gang <[email protected]>
Cc: Chris Zankel <[email protected]>
Cc: Daniel Micay <[email protected]>
Cc: Darrick J. Wong <[email protected]>
Cc: David S. Miller <[email protected]>
Cc: Helge Deller <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Ivan Kokshaysky <[email protected]>
Cc: Jason Evans <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: KOSAKI Motohiro <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Matt Turner <[email protected]>
Cc: Max Filippov <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Michael Kerrisk <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Mika Penttil <[email protected]>
Cc: Ralf Baechle <[email protected]>
Cc: Richard Henderson <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Roland Dreier <[email protected]>
Cc: Russell King <[email protected]>
Cc: Shaohua Li <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Wu Fengguang <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

arch/sparc/include/asm/pgtable_64.h: add pmd_[dirty|mkclean] for THP

MADV_FREE needs pmd_dirty and pmd_mkclean for detecting recent overwrite
of the contents since MADV_FREE syscall is called for THP page.

This patch adds pmd_dirty and pmd_mkclean for THP page MADV_FREE
support.

Signed-off-by: Minchan Kim <[email protected]>
Cc: "James E.J. Bottomley" <[email protected]>
Cc: "Kirill A. Shutemov" <[email protected]>
Cc: Shaohua Li <[email protected]>
Cc: <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Chen Gang <[email protected]>
Cc: Chris Zankel <[email protected]>
Cc: Daniel Micay <[email protected]>
Cc: Darrick J. Wong <[email protected]>
Cc: David S. Miller <[email protected]>
Cc: Helge Deller <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Ivan Kokshaysky <[email protected]>
Cc: Jason Evans <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: KOSAKI Motohiro <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Matt Turner <[email protected]>
Cc: Max Filippov <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Michael Kerrisk <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Mika Penttil <[email protected]>
Cc: Ralf Baechle <[email protected]>
Cc: Richard Henderson <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Roland Dreier <[email protected]>
Cc: Russell King <[email protected]>
Cc: Shaohua Li <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Wu Fengguang <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

arch/x86/include/asm/pgtable.h: add pmd_[dirty|mkclean] for THP

MADV_FREE needs pmd_dirty and pmd_mkclean for detecting recent overwrite
of the contents since MADV_FREE syscall is called for THP page.

This patch adds pmd_dirty and pmd_mkclean for THP page MADV_FREE
support.

Signed-off-by: Minchan Kim <[email protected]>
Cc: "James E.J. Bottomley" <[email protected]>
Cc: "Kirill A. Shutemov" <[email protected]>
Cc: Shaohua Li <[email protected]>
Cc: <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Chen Gang <[email protected]>
Cc: Chris Zankel <[email protected]>
Cc: Daniel Micay <[email protected]>
Cc: Darrick J. Wong <[email protected]>
Cc: David S. Miller <[email protected]>
Cc: Helge Deller <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Ivan Kokshaysky <[email protected]>
Cc: Jason Evans <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: KOSAKI Motohiro <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Matt Turner <[email protected]>
Cc: Max Filippov <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Michael Kerrisk <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Mika Penttil <[email protected]>
Cc: Ralf Baechle <[email protected]>
Cc: Richard Henderson <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Roland Dreier <[email protected]>
Cc: Russell King <[email protected]>
Cc: Shaohua Li <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Wu Fengguang <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/ksm.c: mark stable page dirty

The MADV_FREE patchset changes page reclaim to simply free a clean
anonymous page with no dirty ptes, instead of swapping it out; but KSM
uses clean write-protected ptes to reference the stable ksm page. So be
sure to mark that page dirty, so it's never mistakenly discarded.

[[email protected]: adjusted comments]
Signed-off-by: Minchan Kim <[email protected]>
Acked-by: Hugh Dickins <[email protected]>
Cc: "James E.J. Bottomley" <[email protected]>
Cc: "Kirill A. Shutemov" <[email protected]>
Cc: Shaohua Li <[email protected]>
Cc: <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Chen Gang <[email protected]>
Cc: Chris Zankel <[email protected]>
Cc: Daniel Micay <[email protected]>
Cc: Darrick J. Wong <[email protected]>
Cc: David S. Miller <[email protected]>
Cc: Helge Deller <[email protected]>
Cc: Ivan Kokshaysky <[email protected]>
Cc: Jason Evans <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: KOSAKI Motohiro <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Matt Turner <[email protected]>
Cc: Max Filippov <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Michael Kerrisk <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Mika Penttil <[email protected]>
Cc: Ralf Baechle <[email protected]>
Cc: Richard Henderson <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Roland Dreier <[email protected]>
Cc: Russell King <[email protected]>
Cc: Shaohua Li <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Wu Fengguang <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm: move lazily freed pages to inactive list

MADV_FREE is a hint that it's okay to discard pages if there is memory
pressure and we use reclaimers(ie, kswapd and direct reclaim) to free
them so there is no value keeping them in the active anonymous LRU so
this patch moves them to inactive LRU list's head.

This means that MADV_FREE-ed pages which were living on the inactive
list are reclaimed first because they are more likely to be cold rather
than recently active pages.

An arguable issue for the approach would be whether we should put the
page to the head or tail of the inactive list. I chose head because the
kernel cannot make sure it's really cold or warm for every MADV_FREE
usecase but at least we know it's not *hot*, so landing of inactive head
would be a comprimise for various usecases.

This fixes suboptimal behavior of MADV_FREE when pages living on the
active list will sit there for a long time even under memory pressure
while the inactive list is reclaimed heavily. This basically breaks the
whole purpose of using MADV_FREE to help the system to free memory which
is might not be used.

Signed-off-by: Minchan Kim <[email protected]>
Acked-by: Hugh Dickins <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Shaohua Li <[email protected]>
Cc: "James E.J. Bottomley" <[email protected]>
Cc: "Kirill A. Shutemov" <[email protected]>
Cc: <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Chen Gang <[email protected]>
Cc: Chris Zankel <[email protected]>
Cc: Daniel Micay <[email protected]>
Cc: Darrick J. Wong <[email protected]>
Cc: David S. Miller <[email protected]>
Cc: Helge Deller <[email protected]>
Cc: Ivan Kokshaysky <[email protected]>
Cc: Jason Evans <[email protected]>
Cc: KOSAKI Motohiro <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Matt Turner <[email protected]>
Cc: Max Filippov <[email protected]>
Cc: Michael Kerrisk <[email protected]>
Cc: Mika Penttil <[email protected]>
Cc: Ralf Baechle <[email protected]>
Cc: Richard Henderson <[email protected]>
Cc: Roland Dreier <[email protected]>
Cc: Russell King <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Wu Fengguang <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/madvise.c: free swp_entry in madvise_free

When I test below piece of code with 12 processes(ie, 512M * 12 = 6G
consume) on my (3G ram + 12 cpu + 8G swap, the madvise_free is
siginficat slower (ie, 2x times) than madvise_dontneed.

     loop = 5;
     mmap(512M);
     while (loop--) {
             memset(512M);
             madvise(MADV_FREE or MADV_DONTNEED);
     }

The reason is lots of swapin.

1) dontneed: 1,612 swapin
2) madvfree: 879,585 swapin

If we find hinted pages were already swapped out when syscall is called,
it's pointless to keep the swapped-out pages in pte.  Instead, let's
free the cold page because swapin is more expensive than (alloc page +
zeroing).

With this patch, it reduced swapin from 879,585 to 1,878 so elapsed time

1) dontneed: 6.10user 233.50system 0:50.44elapsed
2) madvfree: 6.03user 401.17system 1:30.67elapsed
2) madvfree + below patch: 6.70user 339.14system 1:04.45elapsed

Signed-off-by: Minchan Kim <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Acked-by: Hugh Dickins <[email protected]>
Cc: "James E.J. Bottomley" <[email protected]>
Cc: "Kirill A. Shutemov" <[email protected]>
Cc: Shaohua Li <[email protected]>
Cc: <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Chen Gang <[email protected]>
Cc: Chris Zankel <[email protected]>
Cc: Daniel Micay <[email protected]>
Cc: Darrick J. Wong <[email protected]>
Cc: David S. Miller <[email protected]>
Cc: Helge Deller <[email protected]>
Cc: Ivan Kokshaysky <[email protected]>
Cc: Jason Evans <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: KOSAKI Motohiro <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Matt Turner <[email protected]>
Cc: Max Filippov <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Michael Kerrisk <[email protected]>
Cc: Mika Penttil <[email protected]>
Cc: Ralf Baechle <[email protected]>
Cc: Richard Henderson <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Roland Dreier <[email protected]>
Cc: Russell King <[email protected]>
Cc: Shaohua Li <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Wu Fengguang <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

arch/*/include/uapi/asm/mman.h: : let MADV_FREE have same value for all architectures

For uapi, need try to let all macros have same value, and MADV_FREE is
added into main branch recently, so need redefine MADV_FREE for it.

At present, '8' can be shared with all architectures, so redefine it to
'8'.

[[email protected]: correct uniform value of MADV_FREE]
Signed-off-by: Chen Gang <[email protected]>
Signed-off-by: Minchan Kim <[email protected]>
Acked-by: Hugh Dickins <[email protected]>
Cc: Ralf Baechle <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Richard Henderson <[email protected]>
Cc: Ivan Kokshaysky <[email protected]>
Cc: Matt Turner <[email protected]>
Cc: "James E.J. Bottomley" <[email protected]>
Cc: Helge Deller <[email protected]>
Cc: Chris Zankel <[email protected]>
Cc: Max Filippov <[email protected]>
Cc: Roland Dreier <[email protected]>
Cc: Darrick J. Wong <[email protected]>
Cc: David S. Miller <[email protected]>
Cc: "Kirill A. Shutemov" <[email protected]>
Cc: Shaohua Li <[email protected]>
Cc: <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Daniel Micay <[email protected]>
Cc: Jason Evans <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: KOSAKI Motohiro <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Michael Kerrisk <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Mika Penttil <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Russell King <[email protected]>
Cc: Shaohua Li <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Wu Fengguang <[email protected]>
Signed-off-by: Sudip Mukherjee <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm: define MADV_FREE for some arches

Most architectures use asm-generic, but alpha, mips, parisc, xtensa need
their own definitions.

This patch defines MADV_FREE for them so it should fix build break for
their architectures.

Maybe, I should split and feed pieces to arch maintainers but included
here for mmotm convenience.

[[email protected]: let MADV_FREE have same value for all architectures]
Signed-off-by: Minchan Kim <[email protected]>
Acked-by: Max Filippov <[email protected]>
Cc: Wu Fengguang <[email protected]>
Cc: Michael Kerrisk <[email protected]>
Cc: Richard Henderson <[email protected]>
Cc: Ivan Kokshaysky <[email protected]>
Cc: "James E.J. Bottomley" <[email protected]>
Cc: Helge Deller <[email protected]>
Cc: Ralf Baechle <[email protected]>
Cc: Chris Zankel <[email protected]>
Cc: "Kirill A. Shutemov" <[email protected]>
Cc: Shaohua Li <[email protected]>
Cc: <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Chen Gang <[email protected]>
Cc: Daniel Micay <[email protected]>
Cc: Darrick J. Wong <[email protected]>
Cc: David S. Miller <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Jason Evans <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: KOSAKI Motohiro <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Matt Turner <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Mika Penttil <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Roland Dreier <[email protected]>
Cc: Russell King <[email protected]>
Cc: Shaohua Li <[email protected]>
Cc: Will Deacon <[email protected]>
Signed-off-by: Chen Gang <[email protected]>
Acked-by: Hugh Dickins <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm: support madvise(MADV_FREE)

Linux doesn't have an ability to free pages lazy while other OS already
have been supported that named by madvise(MADV_FREE).

The gain is clear that kernel can discard freed pages rather than
swapping out or OOM if memory pressure happens.

Without memory pressure, freed pages would be reused by userspace
without another additional overhead(ex, page fault + allocation +
zeroing).

Jason Evans said:

: Facebook has been using MAP_UNINITIALIZED
: (https://lkml.org/lkml/2012/1/18/308) in some of its applications for
: several years, but there are operational costs to maintaining this
: out-of-tree in our kernel and in jemalloc, and we are anxious to retire it
: in favor of MADV_FREE.  When we first enabled MAP_UNINITIALIZED it
: increased throughput for much of our workload by ~5%, and although the
: benefit has decreased using newer hardware and kernels, there is still
: enough benefit that we cannot reasonably retire it without a replacement.
:
: Aside from Facebook operations, there are numerous broadly used
: applications that would benefit from MADV_FREE.  The ones that immediately
: come to mind are redis, varnish, and MariaDB.  I don't have much insight
: into Android internals and development process, but I would hope to see
: MADV_FREE support eventually end up there as well to benefit applications
: linked with the integrated jemalloc.
:
: jemalloc will use MADV_FREE once it becomes available in the Linux kernel.
: In fact, jemalloc already uses MADV_FREE or equivalent everywhere it's
: available: *BSD, OS X, Windows, and Solaris -- every platform except Linux
: (and AIX, but I'm not sure it even compiles on AIX).  The lack of
: MADV_FREE on Linux forced me down a long series of increasingly
: sophisticated heuristics for madvise() volume reduction, and even so this
: remains a common performance issue for people using jemalloc on Linux.
: Please integrate MADV_FREE; many people will benefit substantially.

How it works:

When madvise syscall is called, VM clears dirty bit of ptes of the
range.  If memory pressure happens, VM checks dirty bit of page table
and if it found still "clean", it means it's a "lazyfree pages" so VM
could discard the page instead of swapping out.  Once there was store
operation for the page before VM peek a page to reclaim, dirty bit is
set so VM can swap out the page instead of discarding.

One thing we should notice is that basically, MADV_FREE relies on dirty
bit in page table entry to decide whether VM allows to discard the page
or not.  IOW, if page table entry includes marked dirty bit, VM
shouldn't discard the page.

However, as a example, if swap-in by read fault happens, page table
entry doesn't have dirty bit so MADV_FREE could discard the page
wrongly.

For avoiding the problem, MADV_FREE did more checks with PageDirty and
PageSwapCache.  It worked out because swapped-in page lives on swap
cache and since it is evicted from the swap cache, the page has PG_dirty
flag.  So both page flags check effectively prevent wrong discarding by
MADV_FREE.

However, a problem in above logic is that swapped-in page has PG_dirty
still after they are removed from swap cache so VM cannot consider the
page as freeable any more even if madvise_free is called in future.

Look at below example for detail.

    ptr = malloc();
    memset(ptr);
    ..
    ..
    .. heavy memory pressure so all of pages are swapped out
    ..
    ..
    var = *ptr; -> a page swapped-in and could be removed from
                   swapcache. Then, page table doesn't mark
                   dirty bit and page descriptor includes PG_dirty
    ..
    ..
    madvise_free(ptr); -> It doesn't clear PG_dirty of the page.
    ..
    ..
    ..
    .. heavy memory pressure again.
    .. In this time, VM cannot discard the page because the page
    .. has *PG_dirty*

To solve the problem, this patch clears PG_dirty if only the page is
owned exclusively by current process when madvise is called because
PG_dirty represents ptes's dirtiness in several processes so we could
clear it only if we own it exclusively.

Firstly, heavy users would be general allocators(ex, jemalloc, tcmalloc
and hope glibc supports it) and jemalloc/tcmalloc already have supported
the feature for other OS(ex, FreeBSD)

  barrios@blaptop:~/benchmark/ebizzy$ lscpu
  Architecture:          x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Byte Order:            Little Endian
  CPU(s):                12
  On-line CPU(s) list:   0-11
  Thread(s) per core:    1
  Core(s) per socket:    1
  Socket(s):             12
  NUMA node(s):          1
  Vendor ID:             GenuineIntel
  CPU family:            6
  Model:                 2
  Stepping:              3
  CPU MHz:               3200.185
  BogoMIPS:              6400.53
  Virtualization:        VT-x
  Hypervisor vendor:     KVM
  Virtualization type:   full
  L1d cache:             32K
  L1i cache:             32K
  L2 cache:              4096K
  NUMA node0 CPU(s):     0-11
  ebizzy benchmark(./ebizzy -S 10 -n 512)

  Higher avg is better.

   vanilla-jemalloc             MADV_free-jemalloc

  1 thread
  records: 10                   records: 10
  avg:   2961.90                avg:  12069.70
  std:     71.96(2.43%)         std:    186.68(1.55%)
  max:   3070.00                max:  12385.00
  min:   2796.00                min:  11746.00

  2 thread
  records: 10                   records: 10
  avg:   5020.00                avg:  17827.00
  std:    264.87(5.28%)         std:    358.52(2.01%)
  max:   5244.00                max:  18760.00
  min:   4251.00                min:  17382.00

  4 thread
  records: 10                   records: 10
  avg:   8988.80                avg:  27930.80
  std:   1175.33(13.08%)        std:   3317.33(11.88%)
  max:   9508.00                max:  30879.00
  min:   5477.00                min:  21024.00

  8 thread
  records: 10                   records: 10
  avg:  13036.50                avg:  33739.40
  std:    170.67(1.31%)         std:   5146.22(15.25%)
  max:  13371.00                max:  40572.00
  min:  12785.00                min:  24088.00

  16 thread
  records: 10                   records: 10
  avg:  11092.40                avg:  31424.20
  std:    710.60(6.41%)         std:   3763.89(11.98%)
  max:  12446.00                max:  36635.00
  min:   9949.00                min:  25669.00

  32 thread
  records: 10                   records: 10
  avg:  11067.00                avg:  34495.80
  std:    971.06(8.77%)         std:   2721.36(7.89%)
  max:  12010.00                max:  38598.00
  min:   9002.00                min:  30636.00

In summary, MADV_FREE is about much faster than MADV_DONTNEED.

This patch (of 12):

Add core MADV_FREE implementation.

[[email protected]: small cleanups]
Signed-off-by: Minchan Kim <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Acked-by: Hugh Dickins <[email protected]>
Cc: Mika Penttil <[email protected]>
Cc: Michael Kerrisk <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: KOSAKI Motohiro <[email protected]>
Cc: Jason Evans <[email protected]>
Cc: Daniel Micay <[email protected]>
Cc: "Kirill A. Shutemov" <[email protected]>
Cc: Shaohua Li <[email protected]>
Cc: <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: "James E.J. Bottomley" <[email protected]>
Cc: "Kirill A. Shutemov" <[email protected]>
Cc: "Shaohua Li" <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Chen Gang <[email protected]>
Cc: Chris Zankel <[email protected]>
Cc: Darrick J. Wong <[email protected]>
Cc: David S. Miller <[email protected]>
Cc: Helge Deller <[email protected]>
Cc: Ivan Kokshaysky <[email protected]>
Cc: Matt Turner <[email protected]>
Cc: Max Filippov <[email protected]>
Cc: Ralf Baechle <[email protected]>
Cc: Richard Henderson <[email protected]>
Cc: Roland Dreier <[email protected]>
Cc: Russell King <[email protected]>
Cc: Shaohua Li <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Wu Fengguang <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

zram: don't call idr_remove() from zram_remove()

The use of idr_remove() is forbidden in the callback functions of
idr_for_each().  It is therefore unsafe to call idr_remove in
zram_remove().

This patch moves the call to idr_remove() from zram_remove() to
hot_remove_store().  In the detroy_devices() path, idrs are removed by
idr_destroy().  This solves an use-after-free detected by KASan.

[[email protected]: fix coding stype, per Sergey]
Signed-off-by: Jerome Marchand <[email protected]>
Acked-by: Sergey Senozhatsky <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: <[email protected]> [4.2+]
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm: add page_check_address_transhuge() helper

page_referenced_one() and page_idle_clear_pte_refs_one() duplicate the
code for looking up pte of a (possibly transhuge) page. Move this code
to a new helper function, page_check_address_transhuge(), and make the
above mentioned functions use it.

This is just a cleanup, no functional changes are intended.

Signed-off-by: Vladimir Davydov <[email protected]>
Reviewed-by: Kirill A. Shutemov <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

thp: increase split_huge_page() success rate

During freeze_page(), we remove the page from rmap.  It munlocks the
page if it was mlocked.  clear_page_mlock() uses thelru cache, which
temporary pins the page.

Let's drain the lru cache before checking page's count vs.  mapcount.
The change makes mlocked page split on first attempt, if it was not
pinned by somebody else.

Signed-off-by: Kirill A. Shutemov <[email protected]>
Cc: Sasha Levin <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

thp: add debugfs handle to split all huge pages

Writing 1 into 'split_huge_pages' will try to find and split all huge
pages in the system. This is useful for debuging.

[[email protected]: fix printk text, per Vlastimil]
Signed-off-by: Kirill A. Shutemov <[email protected]>
Cc: Sasha Levin <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm: prepare page_referenced() and page_idle to new THP refcounting

Both page_referenced() and page_idle_clear_pte_refs_one() assume that
THP can only be mapped with PMD, so there's no reason to look on PTEs
for PageTransHuge() pages. That's no true anymore: THP can be mapped
with PTEs too.

The patch removes PageTransHuge() test from the functions and opencode
page table check.

[[email protected]: coding-style fixes]
Signed-off-by: Kirill A. Shutemov <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: Sasha Levin <[email protected]>
Cc: Minchan Kim <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

thp: allow mlocked THP again

Before THP refcounting rework, THP was not allowed to cross VMA
boundary.  So, if we have THP and we split it, PG_mlocked can be safely
transferred to small pages.

With new THP refcounting and naive approach to mlocking we can end up
with this scenario:
1. we have a mlocked THP, which belong to one VM_LOCKED VMA.
2. the process does munlock() on the *part* of the THP:
      - the VMA is split into two, one of them VM_LOCKED;
      - huge PMD split into PTE table;
      - THP is still mlocked;
3. split_huge_page():
      - it transfers PG_mlocked to *all* small pages regrardless if it
blong to any VM_LOCKED VMA.

We probably could munlock() all small pages on split_huge_page(), but I
think we have accounting issue already on step two.

Instead of forbidding mlocked pages altogether, we just avoid mlocking
PTE-mapped THPs and munlock THPs on split_huge_pmd().

This means PTE-mapped THPs will be on normal lru lists and will be split
under memory pressure by vmscan.  After the split vmscan will detect
unevictable small pages and mlock them.

With this approach we shouldn't hit situation like described above.

Signed-off-by: Kirill A. Shutemov <[email protected]>
Cc: Sasha Levin <[email protected]>
Cc: Aneesh Kumar K.V <[email protected]>
Cc: Jerome Marchand <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: Steve Capper <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: David Rientjes <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>