Git Repo - linux.git/log

netconsole: remove unnecessary netconsole_target_get/out() from write_msg()

write_msg() grabs target_list_lock and walks target_list invoking
netpool_send_udp() on each target. Curiously, it protects each iteration
with netconsole_target_get/put() even though it never releases
target_list_lock which protects all the members.

While this doesn't harm anything, it doesn't serve any purpose either.
The items on the list can't go away while target_list_lock is held.
Remove the unnecessary get/put pair.

Signed-off-by: Tejun Heo <[email protected]>
Cc: David Miller <[email protected]>
Cc: Kay Sievers <[email protected]>
Cc: Petr Mladek <[email protected]>
Cc: Tetsuo Handa <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

printk: implement support for extended console drivers

printk log_buf keeps various metadata for each message including its
sequence number and timestamp.  The metadata is currently available only
through /dev/kmsg and stripped out before passed onto console drivers.  We
want this metadata to be available to console drivers too so that console
consumers can get full information including the metadata and dictionary,
which among other things can be used to detect whether messages got lost
in transit.

This patch implements support for extended console drivers.  Consoles can
indicate that they want extended messages by setting the new CON_EXTENDED
flag and they'll be fed messages formatted the same way as /dev/kmsg.

"<level>,<sequnum>,<timestamp>,<contflag>;<message text>\n"

If extended consoles exist, in-kernel fragment assembly is disabled.  This
ensures that all messages emitted to consoles have full metadata including
sequence number.  The contflag carries enough information to reassemble
the fragments from the reader side trivially.  Note that this only affects
/dev/kmsg.  Regular console and /proc/kmsg outputs are not affected by
this change.

* Extended message formatting for console drivers is enabled iff there
  are registered extended consoles.

* Comment describing /dev/kmsg message format updated to add missing
  contflag field and help distinguishing variable from verbatim terms.

Signed-off-by: Tejun Heo <[email protected]>
Cc: David Miller <[email protected]>
Cc: Kay Sievers <[email protected]>
Reviewed-by: Petr Mladek <[email protected]>
Cc: Tetsuo Handa <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

printk: factor out message formatting from devkmsg_read()

The extended message formatting used for /dev/kmsg will be used implement
extended consoles. Factor out msg_print_ext_header() and
msg_print_ext_body() from devkmsg_read().

This is pure restructuring.

Signed-off-by: Tejun Heo <[email protected]>
Cc: David Miller <[email protected]>
Cc: Kay Sievers <[email protected]>
Reviewed-by: Petr Mladek <[email protected]>
Cc: Tetsuo Handa <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

printk: guard the amount written per line by devkmsg_read()

This patchset updates netconsole so that it can emit messages with the
same header as used in /dev/kmsg which gives neconsole receiver full log
information which enables things like structured logging and detection
of lost messages.

This patch (of 7):

devkmsg_read() uses 8k buffer and assumes that the formatted output
message won't overrun which seems safe given LOG_LINE_MAX, the current use
of dict and the escaping method being used; however, we're planning to use
devkmsg formatting wider and accounting for the buffer size properly isn't
that complicated.

This patch defines CONSOLE_EXT_LOG_MAX as 8192 and updates devkmsg_read()
so that it limits output accordingly.

Signed-off-by: Tejun Heo <[email protected]>
Cc: David Miller <[email protected]>
Cc: Kay Sievers <[email protected]>
Reviewed-by: Petr Mladek <[email protected]>
Cc: Tetsuo Handa <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

drivers/misc/altera-stapl/altera.c: remove extraneous KERN_INFO prefix

The KERN_INFO prefix is being prepended to KERN_DEBUG when using the
dprink macro, Remove it as it is extraneous since we are printing the
message out as debug via dprintk().

Fixes smatch warning:

drivers/misc/altera-stapl/altera.c:2454 altera_init()
warn: KERN_* level not at start of string

Signed-off-by: Colin Ian King <[email protected]>
Cc: Rasmus Villemoes <[email protected]>
Cc: Igor M. Liplianin <[email protected]>
Cc: Mauro Carvalho Chehab <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

clone: support passing tls argument via C rather than pt_regs magic

clone has some of the quirkiest syscall handling in the kernel, with a
pile of special cases, historical curiosities, and architecture-specific
calling conventions.  In particular, clone with CLONE_SETTLS accepts a
parameter "tls" that the C entry point completely ignores and some
assembly entry points overwrite; instead, the low-level arch-specific
code pulls the tls parameter out of the arch-specific register captured
as part of pt_regs on entry to the kernel.  That's a massive hack, and
it makes the arch-specific code only work when called via the specific
existing syscall entry points; because of this hack, any new clone-like
system call would have to accept an identical tls argument in exactly
the same arch-specific position, rather than providing a unified system
call entry point across architectures.

The first patch allows architectures to handle the tls argument via
normal C parameter passing, if they opt in by selecting
HAVE_COPY_THREAD_TLS.  The second patch makes 32-bit and 64-bit x86 opt
into this.

These two patches came out of the clone4 series, which isn't ready for
this merge window, but these first two cleanup patches were entirely
uncontroversial and have acks.  I'd like to go ahead and submit these
two so that other architectures can begin building on top of this and
opting into HAVE_COPY_THREAD_TLS.  However, I'm also happy to wait and
send these through the next merge window (along with v3 of clone4) if
anyone would prefer that.

This patch (of 2):

clone with CLONE_SETTLS accepts an argument to set the thread-local
storage area for the new thread.  sys_clone declares an int argument
tls_val in the appropriate point in the argument list (based on the
various CLONE_BACKWARDS variants), but doesn't actually use or pass along
that argument.  Instead, sys_clone calls do_fork, which calls
copy_process, which calls the arch-specific copy_thread, and copy_thread
pulls the corresponding syscall argument out of the pt_regs captured at
kernel entry (knowing what argument of clone that architecture passes tls
in).

Apart from being awful and inscrutable, that also only works because only
one code path into copy_thread can pass the CLONE_SETTLS flag, and that
code path comes from sys_clone with its architecture-specific
argument-passing order.  This prevents introducing a new version of the
clone system call without propagating the same architecture-specific
position of the tls argument.

However, there's no reason to pull the argument out of pt_regs when
sys_clone could just pass it down via C function call arguments.

Introduce a new CONFIG_HAVE_COPY_THREAD_TLS for architectures to opt into,
and a new copy_thread_tls that accepts the tls parameter as an additional
unsigned long (syscall-argument-sized) argument.  Change sys_clone's tls
argument to an unsigned long (which does not change the ABI), and pass
that down to copy_thread_tls.

Architectures that don't opt into copy_thread_tls will continue to ignore
the C argument to sys_clone in favor of the pt_regs captured at kernel
entry, and thus will be unable to introduce new versions of the clone
syscall.

Patch co-authored by Josh Triplett and Thiago Macieira.

Signed-off-by: Josh Triplett <[email protected]>
Acked-by: Andy Lutomirski <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Thiago Macieira <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

stddef.h: move offsetofend inside #ifndef/#endif guard, neaten

Commit 3876488444e7 ("include/stddef.h: Move offsetofend() from vfio.h
to a generic kernel header") added offsetofend outside the normal
include #ifndef/#endif guard. Move it inside.

Miscellanea:

o remove unnecessary blank line
o standardize offsetof macros whitespace style

Signed-off-by: Joe Perches <[email protected]>
Cc: Denys Vlasenko <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mailmap: add rdunlap email auto-correction

To avoid having xenotime bounce when things like get_maintainers gives
me addresses, add Randy's current address.

Signed-off-by: Kees Cook <[email protected]>
Acked-by: Randy Dunlap <[email protected]>
Cc: Joe Perches <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

Mohit Kumar has moved

Mohit's email-id doesn't exist anymore as he has left the company.
Replace ST's id with [email protected].

Signed-off-by: Pratyush Anand <[email protected]>
Cc: Mohit Kumar <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

Pratyush Anand has moved

[email protected] email-id doesn't exist anymore as I have left the
company. Replace ST's id with [email protected].

Signed-off-by: Pratyush Anand <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

compiler-intel: fix wrong compiler barrier() macro

Cleanup commit 73679e508201 ("compiler-intel.h: Remove duplicate
definition") removed the double definition of __memory_barrier()
intrinsics.

However, in doing so, it also removed the preceding #undef barrier by
accident, meaning, the actual barrier() macro from compiler-gcc.h with
inline asm is still in place as __GNUC__ is provided.

Subsequently, barrier() can never be defined as __memory_barrier() from
compiler.h since it already has a definition in place and if we trust
the comment in compiler-intel.h, ecc doesn't support gcc specific asm
statements.

I don't have an ecc at hand (unsure if that's still used in the field?)
and only found this by accident during code review, a revert of that
cleanup would be simplest option.

Fixes: 73679e508201 ("compiler-intel.h: Remove duplicate definition")
Signed-off-by: Daniel Borkmann <[email protected]>
Reviewed-by: Pranith Kumar <[email protected]>
Cc: Pranith Kumar <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: mancha security <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

compiler-gcc: integrate the various compiler-gcc[345].h files

As gcc major version numbers are going to advance rather rapidly in the
future, there's no real value in separate files for each compiler
version.

Deduplicate some of the macros #defined in each file too.

Neaten comments using normal kernel commenting style.

Signed-off-by: Joe Perches <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Michal Marek <[email protected]>
Cc: Segher Boessenkool <[email protected]>
Cc: Sasha Levin <[email protected]>
Cc: Anton Blanchard <[email protected]>
Cc: Alan Modra <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

compiler-gcc.h: neatening

- Move the inline and noinline blocks together

- Comment neatening

- Alignment of __attribute__ uses

- Consistent naming of __must_be_array macro argument

- Multiline macro neatening

Signed-off-by: Joe Perches <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Michal Marek <[email protected]>
Cc: Segher Boessenkool <[email protected]>
Cc: Sasha Levin <[email protected]>
Cc: Anton Blanchard <[email protected]>
Cc: Alan Modra <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

fs, proc: introduce CONFIG_PROC_CHILDREN

Commit 818411616baf ("fs, proc: introduce /proc/<pid>/task/<tid>/children
entry") introduced the children entry for checkpoint restore and the
file is only available on kernels configured with CONFIG_EXPERT and
CONFIG_CHECKPOINT_RESTORE.

This is available in most distributions (Fedora, Debian, Ubuntu, CoreOS)
because they usually enable CONFIG_EXPERT and CONFIG_CHECKPOINT_RESTORE.
But Arch does not enable CONFIG_EXPERT or CONFIG_CHECKPOINT_RESTORE.

However, the children proc file is useful outside of checkpoint restore.
I would like to use it in rkt.  The rkt process exec() another program
it does not control, and that other program will fork()+exec() a child
process.  I would like to find the pid of the child process from an
external tool without iterating in /proc over all processes to find
which one has a parent pid equal to rkt.

This commit introduces CONFIG_PROC_CHILDREN and makes
CONFIG_CHECKPOINT_RESTORE select it.  This allows enabling
/proc/<pid>/task/<tid>/children without needing to enable
CONFIG_CHECKPOINT_RESTORE and CONFIG_EXPERT.

Alban tested that /proc/<pid>/task/<tid>/children is present when the
kernel is configured with CONFIG_PROC_CHILDREN=y but without
CONFIG_CHECKPOINT_RESTORE

Signed-off-by: Iago López Galeiras <[email protected]>
Tested-by: Alban Crequy <[email protected]>
Reviewed-by: Cyrill Gorcunov <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Pavel Emelyanov <[email protected]>
Cc: Serge Hallyn <[email protected]>
Cc: KAMEZAWA Hiroyuki <[email protected]>
Cc: Alexander Viro <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Djalal Harouni <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

proc: fix PAGE_SIZE limit of /proc/$PID/cmdline

/proc/$PID/cmdline truncates output at PAGE_SIZE. It is easy to see with

$ cat /proc/self/cmdline $(seq 1037) 2>/dev/null

However, command line size was never limited to PAGE_SIZE but to 128 KB
and relatively recently limitation was removed altogether.

People noticed and ask questions:
http://stackoverflow.com/questions/199130/how-do-i-increase-the-proc-pid-cmdline-4096-byte-limit

seq file interface is not OK, because it kmalloc's for whole output and
open + read(, 1) + sleep will pin arbitrary amounts of kernel memory. To
not do that, limit must be imposed which is incompatible with arbitrary
sized command lines.

I apologize for hairy code, but this it direct consequence of command line
layout in memory and hacks to support things like "init [3]".

The loops are "unrolled" otherwise it is either macros which hide control
flow or functions with 7-8 arguments with equal line count.

There should be real setproctitle(2) or something.

[[email protected]: fix a billion min() warnings]
Signed-off-by: Alexey Dobriyan <[email protected]>
Tested-by: Jarod Wilson <[email protected]>
Acked-by: Jarod Wilson <[email protected]>
Cc: Cyrill Gorcunov <[email protected]>
Cc: Jan Stancek <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

prctl: more prctl(PR_SET_MM_*) checks

Individual prctl(PR_SET_MM_*) calls do some checking to maintain a
consistent view of mm->arg_start et al fields, but not enough.  In
particular PR_SET_MM_ARG_START/PR_SET_MM_ARG_END/ R_SET_MM_ENV_START/
PR_SET_MM_ENV_END only check that the address lies in an existing VMA,
but don't check that the start address is lower than the end address _at
all_.

Consolidate all consistency checks, so there will be no difference in
the future between PR_SET_MM_MAP and individual PR_SET_MM_* calls.

The program below makes both ARGV and ENVP areas be reversed.  It makes
/proc/$PID/cmdline show garbage (it doesn't oops by luck).

#include <sys/mman.h>
#include <sys/prctl.h>
#include <unistd.h>

enum {PAGE_SIZE=4096};

int main(void)
{
void *p;

p = mmap(NULL, PAGE_SIZE, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);

#define PR_SET_MM               35
#define PR_SET_MM_ARG_START     8
#define PR_SET_MM_ARG_END       9
#define PR_SET_MM_ENV_START     10
#define PR_SET_MM_ENV_END       11
prctl(PR_SET_MM, PR_SET_MM_ARG_START, (unsigned long)p + PAGE_SIZE - 1, 0, 0);
prctl(PR_SET_MM, PR_SET_MM_ARG_END,   (unsigned long)p, 0, 0);
prctl(PR_SET_MM, PR_SET_MM_ENV_START, (unsigned long)p + PAGE_SIZE - 1, 0, 0);
prctl(PR_SET_MM, PR_SET_MM_ENV_END,   (unsigned long)p, 0, 0);

pause();
return 0;
}

[[email protected]: tidy code, tweak comment]
Signed-off-by: Alexey Dobriyan <[email protected]>
Acked-by: Cyrill Gorcunov <[email protected]>
Cc: Jarod Wilson <[email protected]>
Cc: Jan Stancek <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

avr32: use for_each_sg()

This replaces the plain loop over the sglist array with for_each_sg()
macro which consists of sg_next() function calls. Since avr32 doesn't
select ARCH_HAS_SG_CHAIN, it is not necessary to use for_each_sg() in
order to loop over each sg element. But this can help find problems
with drivers that do not properly initialize their sg tables when
CONFIG_DEBUG_SG is enabled.

Signed-off-by: Akinobu Mita <[email protected]>
Cc: Haavard Skinnemoen <[email protected]>
Acked-by: Hans-Christian Egtvedt <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

frv: use for_each_sg()

This replaces the plain loop over the sglist array with for_each_sg()
macro which consists of sg_next() function calls. Since frv doesn't
select ARCH_HAS_SG_CHAIN, it is not necessary to use for_each_sg() in
order to loop over each sg element. But this can help find problems
with drivers that do not properly initialize their sg tables when
CONFIG_DEBUG_SG is enabled.

Signed-off-by: Akinobu Mita <[email protected]>
Cc: David Howells <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

frv: remove unused inline function is_in_rom()

The function is not used anywhere in the tree (anymore) and this is the
last remaining instance, so remove it.

Signed-off-by: Tobias Klauser <[email protected]>
Cc: David Howells <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

zpool: remove zpool_evict()

Remove zpool_evict() helper function.  As zbud is currently the only
zpool implementation that supports eviction, add zpool and zpool_ops
references to struct zbud_pool and directly call zpool_ops->evict(zpool,
handle) on eviction.

Currently zpool provides the zpool_evict helper which locks the zpool
list lock and searches through all pools to find the specific one
matching the caller, and call the corresponding zpool_ops->evict
function.  However, this is unnecessary, as the zbud pool can simply
keep a reference to the zpool that created it, as well as the zpool_ops,
and directly call the zpool_ops->evict function, when it needs to evict
a page.  This avoids a spinlock and list search in zpool for each
eviction.

Signed-off-by: Dan Streetman <[email protected]>
Cc: Seth Jennings <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Nitin Gupta <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

zpool: change pr_info to pr_debug

Change the pr_info() calls to pr_debug(). There's no need for the extra
verbosity in the log. Also change the msg formats to be consistent.

Signed-off-by: Dan Streetman <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Ganesh Mahendran <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

zswap: runtime enable/disable

Change the "enabled" parameter to be configurable at runtime. Remove the
enabled check from init(), and move it to the frontswap store() function;
when enabled, pages will be stored, and when disabled, pages won't be
stored.

This is almost identical to Seth's patch from 2 years ago:
http://lkml.iu.edu/hypermail/linux/kernel/1307.2/04289.html

[[email protected]: tweak documentation]
Signed-off-by: Dan Streetman <[email protected]>
Suggested-by: Seth Jennings <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

zram: check comp algorithm availability earlier

Improvement idea by Marcin Jabrzyk.

comp_algorithm_store() silently accepts any supplied algorithm name,
because zram performs algorithm availability check later, during the
device configuration phase in disksize_store() and emits the following
error:

  "zram: Cannot initialise %s compressing backend"

this error line is somewhat generic and, besides, can indicate a failed
attempt to allocate compression backend's working buffers.

add algorithm availability check to comp_algorithm_store():

  echo lzz > /sys/block/zram0/comp_algorithm
  -bash: echo: write error: Invalid argument

Signed-off-by: Sergey Senozhatsky <[email protected]>
Reported-by: Marcin Jabrzyk <[email protected]>
Acked-by: Minchan Kim <[email protected]>
Cc: Nitin Gupta <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

zram: cut trailing newline in algorithm name

Supplied sysfs values sometimes contain new-line symbols (echo vs.  echo
-n), which we also copy as a compression algorithm name.  it works fine
when we lookup for compression algorithm, because we use sysfs_streq()
which takes care of new line symbols.  however, it doesn't look nice when
we print compression algorithm name if zcomp_create() failed:

zram: Cannot initialise LXZ
            compressing backend

cut trailing new-line, so the error string will look like

  zram: Cannot initialise LXZ compressing backend

we also now can replace sysfs_streq() in zcomp_available_show() with
strcmp().

Signed-off-by: Sergey Senozhatsky <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Nitin Gupta <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

zram: cosmetic zram_bvec_write() cleanup

`bool locked' local variable tells us if we should perform
zcomp_strm_release() or not (jumped to `out' label before
zcomp_strm_find() occurred), which is equivalent to `zstrm' being or not
being NULL. remove `locked' and check `zstrm' instead.

Signed-off-by: Sergey Senozhatsky <[email protected]>
Acked-by: Minchan Kim <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

zram: add dynamic device add/remove functionality

We currently don't support on-demand device creation.  The one and only
way to have N zram devices is to specify num_devices module parameter
(default value: 1).  IOW if, for some reason, at some point, user wants
to have N + 1 devies he/she must umount all the existing devices, unload
the module, load the module passing num_devices equals to N + 1.  And do
this again, if needed.

This patch introduces zram control sysfs class, which has two sysfs
attrs:
- hot_add      -- add a new zram device
- hot_remove   -- remove a specific (device_id) zram device

hot_add sysfs attr is read-only and has only automatic device id
assignment mode (as requested by Minchan Kim).  read operation performed
on this attr creates a new zram device and returns back its device_id or
error status.

Usage example:
# add a new specific zram device
cat /sys/class/zram-control/hot_add
2

# remove a specific zram device
echo 4 > /sys/class/zram-control/hot_remove

Returning zram_add() error code back to user (-ENOMEM in this case)

cat /sys/class/zram-control/hot_add
cat: /sys/class/zram-control/hot_add: Cannot allocate memory

NOTE, there might be users who already depend on the fact that at least
zram0 device gets always created by zram_init(). Preserve this behavior.

[[email protected]: use zram->claim to avoid lockdep splat]
Signed-off-by: Sergey Senozhatsky <[email protected]>
Cc: Minchan Kim <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

zram: close race by open overriding

[ Original patch from Minchan Kim <[email protected]> ]

Commit ba6b17d68c8e ("zram: fix umount-reset_store-mount race
condition") introduced bdev->bd_mutex to protect a race between mount
and reset.  At that time, we don't have dynamic zram-add/remove feature
so it was okay.

However, as we introduce dynamic device feature, bd_mutex became
trouble.

CPU 0

echo 1 > /sys/block/zram<id>/reset
  -> kernfs->s_active(A)
    -> zram:reset_store->bd_mutex(B)

CPU 1

echo <id> > /sys/class/zram/zram-remove
  ->zram:zram_remove: bd_mutex(B)
  -> sysfs_remove_group
    -> kernfs->s_active(A)

IOW, AB -> BA deadlock

The reason we are holding bd_mutex for zram_remove is to prevent
any incoming open /dev/zram[0-9]. Otherwise, we could remove zram
others already have opened. But it causes above deadlock problem.

To fix the problem, this patch overrides block_device.open and
it returns -EBUSY if zram asserts he claims zram to reset so any
incoming open will be failed so we don't need to hold bd_mutex
for zram_remove ayn more.

This patch is to prepare for zram-add/remove feature.

[[email protected]: simplify reset_store()]
Signed-off-by: Minchan Kim <[email protected]>
Acked-by: Sergey Senozhatsky <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

zram: return zram device_id from zram_add()

This patch prepares zram to enable on-demand device creation.
zram_add() performs automatic device_id assignment and returns
new device id (>= 0) or error code (< 0).

Signed-off-by: Sergey Senozhatsky <[email protected]>
Acked-by: Minchan Kim <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

zram: trivial: correct flag operations comment

We don't have meta->tb_lock anymore and use meta table entry bit_spin_lock
instead. update corresponding comment.

Signed-off-by: Sergey Senozhatsky <[email protected]>
Acked-by: Minchan Kim <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

zram: report every added and removed device

With dynamic device creation/removal (which will be introduced later in
the series) printing num_devices in zram_init() will not make a lot of
sense, as well as printing the number of destroyed devices in
destroy_devices(). Print per-device action (added/removed) in zram_add()
and zram_remove() instead.

Example:

[ 3645.259652] zram: Added device: zram5
[ 3646.152074] zram: Added device: zram6
[ 3650.585012] zram: Removed device: zram5
[ 3655.845584] zram: Added device: zram8
[ 3660.975223] zram: Removed device: zram6

Signed-off-by: Sergey Senozhatsky <[email protected]>
Acked-by: Minchan Kim <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

zram: remove max_num_devices limitation

Limiting the number of zram devices to 32 (default max_num_devices value)
is confusing, let's drop it. A user with 2TB or 4TB of RAM, for example,
can request as many devices as he can handle.

Signed-off-by: Sergey Senozhatsky <[email protected]>
Acked-by: Minchan Kim <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

zram: reorganize code layout

This patch looks big, but basically it just moves code blocks.
No functional changes.

Our current code layout looks like a sandwitch.

For example,
a) between read/write handlers, we have update_used_max() helper function:

static int zram_decompress_page
static int zram_bvec_read
static inline void update_used_max
static int zram_bvec_write
static int zram_bvec_rw

b) RW request handlers __zram_make_request/zram_bio_discard are divided by
sysfs attr reset_store() function and corresponding zram_reset_device()
handler:

static void zram_bio_discard
static void zram_reset_device
static ssize_t disksize_store
static ssize_t reset_store
static void __zram_make_request

c) we first a bunch of sysfs read/store functions. then a number of
one-liners, then helper functions, RW functions, sysfs functions, helper
functions again, and so on.

Reorganize layout to be more logically grouped (a brief description,
`cat zram_drv.c | grep static` gives a bigger picture):

-- one-liners: zram_test_flag/etc.

-- helpers: is_partial_io/update_position/etc

-- sysfs attr show/store functions + ZRAM_ATTR_RO() generated stats
show() functions
exception: reset and disksize store functions are required to be after
meta() functions. because we do device create/destroy actions in these
sysfs handlers.

-- "mm" functions: meta get/put, meta alloc/free, page free
static inline bool zram_meta_get
static inline void zram_meta_put
static void zram_meta_free
static struct zram_meta *zram_meta_alloc
static void zram_free_page

-- a block of I/O functions
static int zram_decompress_page
static int zram_bvec_read
static int zram_bvec_write
static void zram_bio_discard
static int zram_bvec_rw
static void __zram_make_request
static void zram_make_request
static void zram_slot_free_notify
static int zram_rw_page

-- device contol: add/remove/init/reset functions (+zram-control class
will sit here)
static int zram_reset_device
static ssize_t reset_store
static ssize_t disksize_store
static int zram_add
static void zram_remove
static int __init zram_init
static void __exit zram_exit

Signed-off-by: Sergey Senozhatsky <[email protected]>
Acked-by: Minchan Kim <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

zram: use idr instead of `zram_devices' array

This patch makes some preparations for on-demand device add/remove
functionality.

Remove `zram_devices' array and switch to id-to-pointer translation (idr).
idr doesn't bloat zram struct with additional members, f.e. list_head,
yet still provides ability to match the device_id with the device pointer.

No user-space visible changes.

[[email protected]: return -ENOMEM when `queue' alloc fails]
Signed-off-by: Sergey Senozhatsky <[email protected]>
Reported-by: Julia Lawall <[email protected]>
Acked-by: Minchan Kim <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

zram: cosmetic ZRAM_ATTR_RO code formatting tweak

Fix a misplaced backslash.

Signed-off-by: Sergey Senozhatsky <[email protected]>
Acked-by: Minchan Kim <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

zram: add `compact` sysfs entry to documentation

We currently don't support zram on-demand device creation.  The only way
to have N zram devices is to specify num_devices module parameter (default
value 1).  That means that if, for some reason, at some point, user wants
to have N + 1 devies he/she must umount all the existing devices, unload
the module, load the module passing num_devices equals to N + 1.

This patchset introduces zram-control sysfs class, which has two sysfs
attrs:

- hot_add     -- add a new zram device
- hot_remove  -- remove a specific (device_id) zram device

    Usage example:
        # add a new specific zram device
        cat /sys/class/zram-control/hot_add
        1

        # remove a specific zram device
        echo 4 > /sys/class/zram-control/hot_remove

This patch (of 10):

Briefly describe missing `compact` sysfs entry.

Signed-off-by: Sergey Senozhatsky <[email protected]>
Acked-by: Minchan Kim <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

zsmalloc: remove obsolete ZSMALLOC_DEBUG

The DEBUG define in zsmalloc is useless, there is no usage of it at all.

Signed-off-by: Marcin Jabrzyk <[email protected]>
Acked-by: Sergey Senozhatsky <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Nitin Gupta <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

zram: remove obsolete ZRAM_DEBUG option

This config option doesn't provide any usage for zram.

Signed-off-by: Marcin Jabrzyk <[email protected]>
Acked-by: Sergey Senozhatsky <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Nitin Gupta <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

s390/mm: change HPAGE_SHIFT type to int

With making HPAGE_SHIFT an unsigned integer we also accidentally changed
pageblock_order. In order to avoid compiler warnings we make
HPAGE_SHFIT an int again.

Signed-off-by: Dominik Dingel <[email protected]>
Suggested-by: Andrew Morton <[email protected]>
Cc: Martin Schwidefsky <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Christian Borntraeger <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

s390/mm: forward check for huge pmds to pmd_large()

We already do the check in pmd_large, so we can just forward the call.

Signed-off-by: Dominik Dingel <[email protected]>
Acked-by: Martin Schwidefsky <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Christian Borntraeger <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

s390/hugetlb: remove dead code for sw emulated huge pages

We now support only hugepages on hardware with EDAT1 support. So we
remove the prepare/release_hugepage hooks and simplify set_huge_pte_at
and huge_ptep_get.

Signed-off-by: Dominik Dingel <[email protected]>
Acked-by: Martin Schwidefsky <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Christian Borntraeger <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/hugetlb: remove arch_prepare/release_hugepage from arch headers

Nobody used these hooks so they were removed from common code, and can now
be removed from the architectures.

Signed-off-by: Dominik Dingel <[email protected]>
Acked-by: Martin Schwidefsky <[email protected]>
Acked-by: Ralf Baechle <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Christian Borntraeger <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/hugetlb: remove unused arch hook prepare/release_hugepage

With s390 dropping support for emulated hugepages, the last user of
arch_prepare_hugepage and arch_release_hugepage is gone.

Signed-off-by: Dominik Dingel <[email protected]>
Acked-by: Martin Schwidefsky <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Christian Borntraeger <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

s390/mm: make hugepages_supported a boot time decision

There is a potential bug with KVM and hugetlbfs if the hardware does not
support hugepages (EDAT1). We fix this by making EDAT1 a hard requirement
for hugepages and therefore removing and simplifying code.

As s390, with the sw-emulated hugepages, was the only user of
arch_prepare/release_hugepage I also removed theses calls from common and
other architecture code.

This patch (of 5):

By dropping support for hugepages on machines which do not have the
hardware feature EDAT1, we fix a potential s390 KVM bug.

The bug would happen if a guest is backed by hugetlbfs (not supported
currently), but does not get pagetables with PGSTE. This would lead to
random memory overwrites.

Signed-off-by: Dominik Dingel <[email protected]>
Acked-by: Martin Schwidefsky <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Christian Borntraeger <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

Merge branch 'for-4.2' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/libata

Pull libata updates from Tejun Heo:

- a number of libata core changes to better support NCQ TRIM.

- ahci now supports MSI-X in single IRQ mode to support a new
   controller which doesn't implement MSI or INTX.

- ahci now supports edge-triggered IRQ mode to support a new controller
   which for some odd reason did edge-triggered IRQ.

- the usual controller support additions and changes.

* 'for-4.2' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/libata: (27 commits)
  libata: Do not blacklist Micron M500DC
  ata: ahci_mvebu: add suspend/resume support
  ahci, msix: Fix build error for !PCI_MSI
  ahci: Add support for Cavium's ThunderX host controller
  ahci: Add generic MSI-X support for single interrupts to SATA PCI driver
  libata: finally use __initconst in ata_parse_force_one()
  drivers: ata: add support for Ceva sata host controller
  devicetree:bindings: add devicetree bindings for ceva ahci
  ahci: added support for Freescale AHCI sata
  ahci: Store irq number in struct ahci_host_priv
  ahci: Move interrupt enablement code to a separate function
  Doc: libata: Fix spelling typo found in libata.xml
  ata:sata_nv - Change 1 to true for bool type variable.
  ata: add Broadcom AHCI SATA3 driver for STB chips
  Documentation: devicetree: add Broadcom SATA binding
  libata: Fix regression when the NCQ Send and Receive log page is absent
  ata: hpt366: fix constant cast warning
  ata: ahci_xgene: potential NULL dereference in probe
  ata: ahci_xgene: Add AHCI Support for 2nd HW version of APM X-Gene SoC AHCI SATA Host controller.
  libahci: Add support to handle HOST_IRQ_STAT as edge trigger latch.
  ...

Merge tag 'dm-4.2-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm

Pull device mapper updates from Mike Snitzer:

- DM core cleanups:

     * blk-mq request-based DM no longer uses any mempools now that
       partial completions are no longer handled as part of cloned
       requests

- DM raid cleanups and support for MD raid0

- DM cache core advances and a new stochastic-multi-queue (smq) cache
   replacement policy

     * smq is the new default dm-cache policy

- DM thinp cleanups and much more efficient large discard support

- DM statistics support for request-based DM and nanosecond resolution
   timestamps

- Fixes to DM stripe, DM log-writes, DM raid1 and DM crypt

* tag 'dm-4.2-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (39 commits)
  dm stats: add support for request-based DM devices
  dm stats: collect and report histogram of IO latencies
  dm stats: support precise timestamps
  dm stats: fix divide by zero if 'number_of_areas' arg is zero
  dm cache: switch the "default" cache replacement policy from mq to smq
  dm space map metadata: fix occasional leak of a metadata block on resize
  dm thin metadata: fix a race when entering fail mode
  dm thin: fail messages with EOPNOTSUPP when pool cannot handle messages
  dm thin: range discard support
  dm thin metadata: add dm_thin_remove_range()
  dm thin metadata: add dm_thin_find_mapped_range()
  dm btree: add dm_btree_remove_leaves()
  dm stats: Use kvfree() in dm_kvfree()
  dm cache: age and write back cache entries even without active IO
  dm cache: prefix all DMERR and DMINFO messages with cache device name
  dm cache: add fail io mode and needs_check flag
  dm cache: wake the worker thread every time we free a migration object
  dm cache: add stochastic-multi-queue (smq) policy
  dm cache: boost promotion of blocks that will be overwritten
  dm cache: defer whole cells
  ...

Merge branch 'for-4.2/writeback' of git://git.kernel.dk/linux-block

Pull cgroup writeback support from Jens Axboe:
"This is the big pull request for adding cgroup writeback support.

  This code has been in development for a long time, and it has been
  simmering in for-next for a good chunk of this cycle too.  This is one
  of those problems that has been talked about for at least half a
  decade, finally there's a solution and code to go with it.

  Also see last weeks writeup on LWN:

        http://lwn.net/Articles/648292/"

* 'for-4.2/writeback' of git://git.kernel.dk/linux-block: (85 commits)
  writeback, blkio: add documentation for cgroup writeback support
  vfs, writeback: replace FS_CGROUP_WRITEBACK with SB_I_CGROUPWB
  writeback: do foreign inode detection iff cgroup writeback is enabled
  v9fs: fix error handling in v9fs_session_init()
  bdi: fix wrong error return value in cgwb_create()
  buffer: remove unusued 'ret' variable
  writeback: disassociate inodes from dying bdi_writebacks
  writeback: implement foreign cgroup inode bdi_writeback switching
  writeback: add lockdep annotation to inode_to_wb()
  writeback: use unlocked_inode_to_wb transaction in inode_congested()
  writeback: implement unlocked_inode_to_wb transaction and use it for stat updates
  writeback: implement [locked_]inode_to_wb_and_lock_list()
  writeback: implement foreign cgroup inode detection
  writeback: make writeback_control track the inode being written back
  writeback: relocate wb[_try]_get(), wb_put(), inode_{attach|detach}_wb()
  mm: vmscan: disable memcg direct reclaim stalling if cgroup writeback support is in use
  writeback: implement memcg writeback domain based throttling
  writeback: reset wb_domain->dirty_limit[_tstmp] when memcg domain size changes
  writeback: implement memcg wb_domain
  writeback: update wb_over_bg_thresh() to use wb_domain aware operations
  ...

Merge branch 'for-4.2/sg' of git://git.kernel.dk/linux-block

Pull asm/scatterlist.h removal from Jens Axboe:
"We don't have any specific arch scatterlist anymore, since parisc
  finally switched over.  Kill the include"

* 'for-4.2/sg' of git://git.kernel.dk/linux-block:
  remove scatterlist.h generation from arch Kbuild files
  remove <asm/scatterlist.h>

tracing: Fix typo from "static inlin" to "static inline"

The trace.h header when called without CONFIG_EVENT_TRACING enabled
(seldom done), will not compile because of a typo in the protocol
of trace_event_enum_update().

Cc: [email protected] # 4.1+
Signed-off-by: Steven Rostedt <[email protected]>

tracing/filter: Do not allow infix to exceed end of string

While debugging a WARN_ON() for filtering, I found that it is possible
for the filter string to be referenced after its end. With the filter:

# echo '>' > /sys/kernel/debug/events/ext4/ext4_truncate_exit/filter

The filter_parse() function can call infix_get_op() which calls
infix_advance() that updates the infix filter pointers for the cnt
and tail without checking if the filter is already at the end, which
will put the cnt to zero and the tail beyond the end. The loop then calls
infix_next() that has

ps->infix.cnt--;
return ps->infix.string[ps->infix.tail++];

The cnt will now be below zero, and the tail that is returned is
already passed the end of the filter string. So far the allocation
of the filter string usually has some buffer that is zeroed out, but
if the filter string is of the exact size of the allocated buffer
there's no guarantee that the charater after the nul terminating
character will be zero.

Luckily, only root can write to the filter.

Cc: [email protected] # 2.6.33+
Signed-off-by: Steven Rostedt <[email protected]>

Merge branch 'for-4.2/drivers' of git://git.kernel.dk/linux-block

Pull block driver updates from Jens Axboe:
"This contains:

   - a few race fixes for null_blk, from Akinobu Mita.

   - a series of fixes for mtip32xx, from Asai Thambi and Selvan Mani at
     Micron.

   - NVMe:
        * Fix for missing error return on allocation failure, from Axel
          Lin.

        * Code consolidation and cleanups from Christoph.

        * Memory barrier addition, syncing queue count and queue
          pointers. From Jon Derrick.

        * Various fixes from Keith, an addition to support user
          issue reset from sysfs or ioctl, and automatic namespace
          rescan.

        * Fix from Matias, avoiding losing some request flags when
          marking the request failfast.

   - small cleanups and sparse fixups for ps3vram.  From Geert
     Uytterhoeven and Geoff Lavand.

   - s390/dasd dead code removal, from Jarod Wilson.

   - a set of fixes and optimizations for loop, from Ming Lei.

   - conversion to blkdev_reread_part() of loop, dasd, ndb.  From Ming
     Lei.

   - updates to cciss.  From Tomas Henzl"

* 'for-4.2/drivers' of git://git.kernel.dk/linux-block: (44 commits)
  mtip32xx: Fix accessing freed memory
  block: nvme-scsi: Catch kcalloc failure
  NVMe: Fix IO for extended metadata formats
  nvme: don't overwrite req->cmd_flags on sync cmd
  mtip32xx: increase wait time for hba reset
  mtip32xx: fix minor number
  mtip32xx: remove unnecessary sleep in mtip_ftl_rebuild_poll()
  mtip32xx: fix crash on surprise removal of the drive
  mtip32xx: Abort I/O during secure erase operation
  mtip32xx: fix incorrectly setting MTIP_DDF_SEC_LOCK_BIT
  mtip32xx: remove unused variable 'port->allocated'
  mtip32xx: fix rmmod issue
  MAINTAINERS: Update ps3vram block driver
  block/ps3vram: Remove obsolete reference to MTD
  block/ps3vram: Fix sparse warnings
  NVMe: Automatic namespace rescan
  NVMe: Memory barrier before queue_count is incremented
  NVMe: add sysfs and ioctl controller reset
  null_blk: restart request processing on completion handler
  null_blk: prevent timer handler running on a different CPU where started
  ...

tracing/filter: Do not WARN on operand count going below zero

When testing the fix for the trace filter, I could not come up with
a scenario where the operand count goes below zero, so I added a
WARN_ON_ONCE(cnt < 0) to the logic. But there is legitimate case
that it can happen (although the filter would be wrong).

# echo '>' > /sys/kernel/debug/events/ext4/ext4_truncate_exit/filter

That is, a single operation without any operands will hit the path
where the WARN_ON_ONCE() can trigger. Although this is harmless,
and the filter is reported as a error. But instead of spitting out
a warning to the kernel dmesg, just fail nicely and report it via
the proper channels.

Link: http://lkml.kernel.org/r/[email protected]
Reported-by: Vince Weaver <[email protected]>
Reported-by: Sasha Levin <[email protected]>
Cc: [email protected] # 2.6.33+
Signed-off-by: Steven Rostedt <[email protected]>

Merge branch 'for-4.2/core' of git://git.kernel.dk/linux-block

Pull core block IO update from Jens Axboe:
"Nothing really major in here, mostly a collection of smaller
  optimizations and cleanups, mixed with various fixes.  In more detail,
  this contains:

   - Addition of policy specific data to blkcg for block cgroups.  From
     Arianna Avanzini.

   - Various cleanups around command types from Christoph.

   - Cleanup of the suspend block I/O path from Christoph.

   - Plugging updates from Shaohua and Jeff Moyer, for blk-mq.

   - Eliminating atomic inc/dec of both remaining IO count and reference
     count in a bio.  From me.

   - Fixes for SG gap and chunk size support for data-less (discards)
     IO, so we can merge these better.  From me.

   - Small restructuring of blk-mq shared tag support, freeing drivers
     from iterating hardware queues.  From Keith Busch.

   - A few cfq-iosched tweaks, from Tahsin Erdogan and me.  Makes the
     IOPS mode the default for non-rotational storage"

* 'for-4.2/core' of git://git.kernel.dk/linux-block: (35 commits)
  cfq-iosched: fix other locations where blkcg_to_cfqgd() can return NULL
  cfq-iosched: fix sysfs oops when attempting to read unconfigured weights
  cfq-iosched: move group scheduling functions under ifdef
  cfq-iosched: fix the setting of IOPS mode on SSDs
  blktrace: Add blktrace.c to BLOCK LAYER in MAINTAINERS file
  block, cgroup: implement policy-specific per-blkcg data
  block: Make CFQ default to IOPS mode on SSDs
  block: add blk_set_queue_dying() to blkdev.h
  blk-mq: Shared tag enhancements
  block: don't honor chunk sizes for data-less IO
  block: only honor SG gap prevention for merges that contain data
  block: fix returnvar.cocci warnings
  block, dm: don't copy bios for request clones
  block: remove management of bi_remaining when restoring original bi_end_io
  block: replace trylock with mutex_lock in blkdev_reread_part()
  block: export blkdev_reread_part() and __blkdev_reread_part()
  suspend: simplify block I/O handling
  block: collapse bio bit space
  block: remove unused BIO_RW_BLOCK and BIO_EOF flags
  block: remove BIO_EOPNOTSUPP
  ...

Merge tag 'upstream-4.2-rc1' of git://git.infradead.org/linux-ubifs

Pull UBI/UBIFS updates from Richard Weinberger:
"Minor fixes for UBI and UBIFS"

* tag 'upstream-4.2-rc1' of git://git.infradead.org/linux-ubifs:
  UBI: Remove unnecessary `\'
  UBI: Use static class and attribute groups
  UBI: add a helper function for updatting on-flash layout volumes
  UBI: Fastmap: Do not add vol if it already exists
  UBI: Init vol->reserved_pebs by assignment
  UBI: Fastmap: Rename variables to make them meaningful
  UBI: Fastmap: Remove unnecessary `\'
  UBI: Fastmap: Use max() to get the larger value
  ubifs: fix to check error code of register_shrinker
  UBI: block: Dynamically allocate minor numbers

Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 updates from Ted Ts'o:
"A very large number of cleanups and bug fixes --- in particular for
  the ext4 encryption patches, which is a new feature added in the last
  merge window.  Also fix a number of long-standing xfstest failures.
  (Quota writes failing due to ENOSPC, a race between truncate and
  writepage in data=journalled mode that was causing generic/068 to
  fail, and other corner cases.)

  Also add support for FALLOC_FL_INSERT_RANGE, and improve jbd2
  performance eliminating locking when a buffer is modified more than
  once during a transaction (which is very common for allocation
  bitmaps, for example), in which case the state of the journalled
  buffer head doesn't need to change"

[ I renamed "ext4_follow_link()" to "ext4_encrypted_follow_link()" in
  the merge resolution, to make it clear that that function is _only_
  used for encrypted symlinks.  The function doesn't actually work for
  non-encrypted symlinks at all, and they use the generic helpers
                                         - Linus ]

* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (52 commits)
  ext4: set lazytime on remount if MS_LAZYTIME is set by mount
  ext4: only call ext4_truncate when size <= isize
  ext4: make online defrag error reporting consistent
  ext4: minor cleanup of ext4_da_reserve_space()
  ext4: don't retry file block mapping on bigalloc fs with non-extent file
  ext4: prevent ext4_quota_write() from failing due to ENOSPC
  ext4: call sync_blockdev() before invalidate_bdev() in put_super()
  jbd2: speedup jbd2_journal_dirty_metadata()
  jbd2: get rid of open coded allocation retry loop
  ext4: improve warning directory handling messages
  jbd2: fix ocfs2 corrupt when updating journal superblock fails
  ext4: mballoc: avoid 20-argument function call
  ext4: wait for existing dio workers in ext4_alloc_file_blocks()
  ext4: recalculate journal credits as inode depth changes
  jbd2: use GFP_NOFS in jbd2_cleanup_journal_tail()
  ext4: use swap() in mext_page_double_lock()
  ext4: use swap() in memswap()
  ext4: fix race between truncate and __ext4_journalled_writepage()
  ext4 crypto: fail the mount if blocksize != pagesize
  ext4: Add support FALLOC_FL_INSERT_RANGE for fallocate
  ...

Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc

Pull sparc fixes from David Miller:
"Sparc perf stack traversal fixes from David Ahern"

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc:
  sparc64: perf: Use UREG_FP rather than UREG_I6
  sparc64: perf: Add sanity checking on addresses in user stack
  sparc64: Convert BUG_ON to warning
  sparc: perf: Disable pagefaults while walking userspace stacks

um: Don't pollute kernel namespace with uapi

Don't include ptrace uapi stuff in arch headers, it will
pollute the kernel namespace and conflict with existing
stuff.
In this case it fixes clashes with common names like R8.

Signed-off-by: Richard Weinberger <[email protected]>

um: Include sys/types.h for makedev(), major(), minor()

The functions in question are not part of the POSIX standard,
documentation however hints that the corresponding header shall
be sys/types.h. C libraries other than glibc, namely musl, did
not include that header via other ways and complained.

Signed-off-by: Hans-Werner Hilse <[email protected]>
Signed-off-by: Richard Weinberger <[email protected]>

um: Do not use stdin and stdout identifiers for struct members

stdin, stdout and stderr are macros according to C89/C99.
Thus do not use them as struct member identifiers to avoid
bad results from macro expansion.

Signed-off-by: Hans-Werner Hilse <[email protected]>
Signed-off-by: Richard Weinberger <[email protected]>

um: Do not use __ptr_t type for stack_t's .ss pointer

__ptr_t type is a glibc-specific type, while the generally
documented type is a void*. That's what other C libraries use,
too.

Signed-off-by: Hans-Werner Hilse <[email protected]>
Signed-off-by: Richard Weinberger <[email protected]>

Merge tag 'for-4.2' of git://git.sourceforge.jp/gitroot/uclinux-h8/linux

Pull Renesas H8/300 architecture re-introduction from Yoshinori Sato.

We dropped arch/h8300 two years ago as stale and old, this is a new and
more modern rewritten arch support for the same architecture.

* tag 'for-4.2' of git://git.sourceforge.jp/gitroot/uclinux-h8/linux: (27 commits)
  h8300: fix typo.
  h8300: Always build dtb
  h8300: Remove ARCH_WANT_IPC_PARSE_VERSION
  sh-sci: Get register size from platform device
  clk: h8300: fix error handling in h8s2678_pll_clk_setup()
  h8300: Symbol name fix
  h8300: devicetree source
  h8300: configs
  h8300: IRQ chip driver
  h8300: clocksource
  h8300: clock driver
  h8300: Build scripts
  h8300: library functions
  h8300: Memory management
  h8300: miscellaneous functions
  h8300: process helpers
  h8300: compressed image support
  h8300: Low level entry
  h8300: kernel startup
  h8300: Interrupt and exceptions
  ...

crypto: rsa - add .gitignore for crypto/*.-asn1.[ch] files

There are two generated files: crypto/rsakey-asn1.c and crypto/raskey-asn1.h,
after the cfc2bb32b31371d6bffc6bf2da3548f20ad48c83 commit. Let's add
.gitignore to ignore *-asn1.[ch] files.

Signed-off-by: Alexander Kuleshov <[email protected]>
Signed-off-by: Herbert Xu <[email protected]>

crypto: asymmetric_keys/rsa - Use non-conflicting variable name

arm64:allmodconfig fails to build as follows.

In file included from include/acpi/platform/aclinux.h:74:0,
                 from include/acpi/platform/acenv.h:173,
                 from include/acpi/acpi.h:56,
                 from include/linux/acpi.h:37,
                 from ./arch/arm64/include/asm/dma-mapping.h:21,
                 from include/linux/dma-mapping.h:86,
                 from include/linux/skbuff.h:34,
                 from include/crypto/algapi.h:18,
                 from crypto/asymmetric_keys/rsa.c:16:
include/linux/ctype.h:15:12: error: expected ‘;’, ‘,’ or ‘)’
before numeric constant
#define _X 0x40 /* hex digit */
            ^
crypto/asymmetric_keys/rsa.c:123:47: note: in expansion of macro ‘_X’
static int RSA_I2OSP(MPI x, size_t xLen, u8 **_X)
                                               ^
crypto/asymmetric_keys/rsa.c: In function ‘RSA_verify_signature’:
crypto/asymmetric_keys/rsa.c:256:2: error:
implicit declaration of function ‘RSA_I2OSP’

The problem is caused by an unrelated include file change, resulting in
the inclusion of ctype.h on arm64. This in turn causes the local variable
_X to conflict with macro _X used in ctype.h.

Fixes: b6197b93fa4b ("arm64 : Introduce support for ACPI _CCA object")
Cc: Suthikulpanit, Suravee <[email protected]>
Signed-off-by: Guenter Roeck <[email protected]>
Signed-off-by: Herbert Xu <[email protected]>

crypto: testmgr - don't print info about missing test for gcm-aes-aesni

Don't print info about missing test for the internal
helper __driver-gcm-aes-aesni

changes in v2:
- marked test as fips allowed

Signed-off-by: Tadeusz Struk <[email protected]>
Signed-off-by: Herbert Xu <[email protected]>

crypto: jitterentropy - Delete unnecessary checks before the function call "kzfree"

The kzfree() function tests whether its argument is NULL and then
returns immediately. Thus the test around the call is not needed.

This issue was detected by using the Coccinelle software.

Signed-off-by: Markus Elfring <[email protected]>
Signed-off-by: Herbert Xu <[email protected]>

crypto: akcipher - fix spelling cihper -> cipher

Signed-off-by: Tadeusz Struk <[email protected]>
Signed-off-by: Herbert Xu <[email protected]>

crypto: jitterentropy - avoid compiler warnings

The core of the Jitter RNG is intended to be compiled with -O0. To
ensure that the Jitter RNG can be compiled on all architectures,
separate out the RNG core into a stand-alone C file that can be compiled
with -O0 which does not depend on any kernel include file.

As no kernel includes can be used in the C file implementing the core
RNG, any dependencies on kernel code must be extracted.

A second file provides the link to the kernel and the kernel crypto API
that can be compiled with the regular compile options of the kernel.

Signed-off-by: Stephan Mueller <[email protected]>
Signed-off-by: Herbert Xu <[email protected]>

Merge branch 'sparc-perf-stack'

David Ahern says:

====================
sparc64: perf fixes for userspace stacks

Coming back to the perf userspace callchain problem. As a reminder there are
a series of problems trying to use perf to collect callchains with scheduling
tracepoints, e.g., perf sched record -g -- <cmd>.

The first patch disables pagefaults while walking the user stack. As discussed
a couple of months ago this is the right fix, but I was puzzled as to why
processes were terminating with sigbus (and sometimes sigsegv). I believe the
root of this problem is bad addresses trying to walk the frames using frame
pointers. The bad addresses lead to faults that get handled by do_sparc64_fault
and it aborts the task though I am still puzzled as to why it gets past this
check in do_sparc64_fault:

if (in_atomic() || !mm)
goto intr_or_no_mm;

pagefault_disable bumps the preempt_count which should make in_atomic return != 0
(building kernels with preemption set to voluntar, CONFIG_PREEMPT_VOLUNTARY=y).

While this set does not fully solve the problem it does prevent a number of
pain points with the current code, most notably able to lock up the system.
====================

Signed-off-by: David S. Miller <[email protected]>

sparc64: perf: Use UREG_FP rather than UREG_I6

perf walks userspace callchains by following frame pointers. Use the
UREG_FP macro to make it clearer that the %fp is being used.

Signed-off-by: David Ahern <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

sparc64: perf: Add sanity checking on addresses in user stack

Processes are getting killed (sigbus or segv) while walking userspace
callchains when using perf. In some instances I have seen ufp = 0x7ff
which does not seem like a proper stack address.

This patch adds a function to run validity checks against the address
before attempting the copy_from_user. The checks are copied from the
x86 version as a start point with the addition of a 4-byte alignment
check.

Signed-off-by: David Ahern <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

sparc64: Convert BUG_ON to warning

Pagefault handling has a BUG_ON path that panics the system. Convert it to
a warning instead. There is no need to bring down the system for this kind
of failure.

The following was hit while running:
    perf sched record -g -- make -j 16

[3609412.782801] kernel BUG at /opt/dahern/linux.git/arch/sparc/mm/fault_64.c:416!
[3609412.782833]               \|/ ____ \|/
[3609412.782833]               "@'/ .. \`@"
[3609412.782833]               /_| \__/ |_\
[3609412.782833]                  \__U_/
[3609412.782870] cat(4516): Kernel bad sw trap 5 [#1]
[3609412.782889] CPU: 0 PID: 4516 Comm: cat Tainted: G            E   4.1.0-rc8+ #6
[3609412.782909] task: fff8000126e31f80 ti: fff8000110d90000 task.ti: fff8000110d90000
[3609412.782931] TSTATE: 0000004411001603 TPC: 000000000096b164 TNPC: 000000000096b168 Y: 0000004e    Tainted: G            E
[3609412.782964] TPC: <do_sparc64_fault+0x5e4/0x6a0>
[3609412.782979] g0: 000000000096abe0 g1: 0000000000d314c4 g2: 0000000000000000 g3: 0000000000000001
[3609412.783009] g4: fff8000126e31f80 g5: fff80001302d2000 g6: fff8000110d90000 g7: 00000000000000ff
[3609412.783045] o0: 0000000000aff6a8 o1: 00000000000001a0 o2: 0000000000000001 o3: 0000000000000054
[3609412.783080] o4: fff8000100026820 o5: 0000000000000001 sp: fff8000110d935f1 ret_pc: 000000000096b15c
[3609412.783117] RPC: <do_sparc64_fault+0x5dc/0x6a0>
[3609412.783137] l0: 000007feff996000 l1: 0000000000030001 l2: 0000000000000004 l3: fff8000127bd0120
[3609412.783174] l4: 0000000000000054 l5: fff8000127bd0188 l6: 0000000000000000 l7: fff8000110d9dba8
[3609412.783210] i0: fff8000110d93f60 i1: fff8000110ca5530 i2: 000000000000003f i3: 0000000000000054
[3609412.783244] i4: fff800010000081a i5: fff8000100000398 i6: fff8000110d936a1 i7: 0000000000407c6c
[3609412.783286] I7: <sparc64_realfault_common+0x10/0x20>
[3609412.783308] Call Trace:
[3609412.783329]  [0000000000407c6c] sparc64_realfault_common+0x10/0x20
[3609412.783353] Disabling lock debugging due to kernel taint
[3609412.783379] Caller[0000000000407c6c]: sparc64_realfault_common+0x10/0x20
[3609412.783449] Caller[fff80001002283e4]: 0xfff80001002283e4
[3609412.783471] Instruction DUMP: 921021a0  7feaff91  901222a8 <91d02005> 82086100  02f87f7b  808a2873  81cfe008  01000000
[3609412.783542] Kernel panic - not syncing: Fatal exception
[3609412.784605] Press Stop-A (L1-A) to return to the boot prom
[3609412.784615] ---[ end Kernel panic - not syncing: Fatal exception

With this patch rather than a panic I occasionally get something like this:
    perf sched record -g -m 1024  -- make -j N

where N is based on number of cpus (128 to 1024 for a T7-4 and 8 for an 8 cpu
VM on a T5-2).

WARNING: CPU: 211 PID: 52565 at /opt/dahern/linux.git/arch/sparc/mm/fault_64.c:417 do_sparc64_fault+0x340/0x70c()
address (7feffcd6000) != regs->tpc (fff80001004873c0)
Modules linked in: ipt_REJECT nf_reject_ipv4 nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter ip_tables ip6t_REJECT nf_reject_ipv6 xt_tcpudp nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables x_tables ipv6 cdc_ether usbnet mii ixgbe mdio igb i2c_algo_bit i2c_core ptp crc32c_sparc64 camellia_sparc64 des_sparc64 des_generic md5_sparc64 sha512_sparc64 sha1_sparc64 uio_pdrv_genirq uio usb_storage mpt3sas scsi_transport_sas raid_class aes_sparc64 sunvnet sunvdc sha256_sparc64(E) sha256_generic(E)
CPU: 211 PID: 52565 Comm: ld Tainted: G        W   E   4.1.0-rc8+ #19
Call Trace:
[000000000045ce30] warn_slowpath_common+0x7c/0xa0
[000000000045ceec] warn_slowpath_fmt+0x30/0x40
[000000000098ad64] do_sparc64_fault+0x340/0x70c
[0000000000407c2c] sparc64_realfault_common+0x10/0x20
---[ end trace 62ee02065a01a049 ]---
ld[52565]: segfault at fff80001004873c0 ip fff80001004873c0 (rpc fff8000100158868) sp 000007feffcd70e1 error 30002 in libc-2.12.so[fff8000100410000+184000]

The segfault is horrible, but better than a system panic.

An 8-cpu VM on a T5-2 also showed the above traces from time to time,
so it is a general problem and not specific to the T7 or baremetal.

Signed-off-by: David Ahern <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

sparc: perf: Disable pagefaults while walking userspace stacks

Page faults generated walking userspace stacks can call schedule to switch
out the task. When collecting callchains for scheduler tracepoints this
causes a deadlock as the tracepoints can be hit with the runqueue lock held:

[ 8138.159054] WARNING: CPU: 758 PID: 12488 at /opt/dahern/linux.git/arch/sparc/kernel/nmi.c:80 perfctr_irq+0x1f8/0x2b4()

[ 8138.203152] Watchdog detected hard LOCKUP on cpu 758

[ 8138.410969] CPU: 758 PID: 12488 Comm: perf Not tainted 4.0.0-rc6+ #6
[ 8138.437146] Call Trace:
[ 8138.447193]  [000000000045cdd4] warn_slowpath_common+0x7c/0xa0
[ 8138.471238]  [000000000045ce90] warn_slowpath_fmt+0x30/0x40
[ 8138.494189]  [0000000000983e38] perfctr_irq+0x1f8/0x2b4
[ 8138.515716]  [00000000004209f4] tl0_irq15+0x14/0x20
[ 8138.535791]  [00000000009839ec] _raw_spin_trylock_bh+0x68/0x108
[ 8138.560180]  [0000000000980018] __schedule+0xcc/0x710
[ 8138.580981]  [00000000009806dc] preempt_schedule_common+0x10/0x3c
[ 8138.606082]  [000000000098077c] _cond_resched+0x34/0x44
[ 8138.627603]  [0000000000565990] kmem_cache_alloc_node+0x24/0x1a0
[ 8138.652345]  [0000000000450b60] tsb_grow+0xac/0x488
[ 8138.672429]  [0000000000985040] do_sparc64_fault+0x4dc/0x6e4
[ 8138.695736]  [0000000000407c2c] sparc64_realfault_common+0x10/0x20
[ 8138.721202]  [00000000006f2e24] NG4copy_from_user+0xa4/0x3c0
[ 8138.744510]  [000000000044f900] perf_callchain_user+0x5c/0x6c
[ 8138.768182]  [0000000000517b5c] perf_callchain+0x16c/0x19c
[ 8138.790774]  [0000000000515f84] perf_prepare_sample+0x68/0x218
[ 8138.814801] ---[ end trace 42ca6294b1ff7573 ]---

As with PowerPC (b59a1bfcc240, "powerpc/perf: Disable pagefaults during
callchain stack read") disable pagefaults while walking userspace stacks.

Signed-off-by: David Ahern <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

firmware: dmi: struct dmi_header should be packed

Apparently the compiler does fine without it, but it feels safer and
clearer to add the missing attribute.

Signed-off-by: Jean Delvare <[email protected]>

firmware: dmi_scan: Coding style cleanups

Signed-off-by: Jean Delvare <[email protected]>

Documentation: ABI: sysfs-firmware-dmi: add -entries suffix to file name

The dmi-sysfs module adds DMI table structures entries under
/sys/firmware/dmi/entries only, so rename documentation file to
sysfs-firmware-dmi-entries as more appropriate. Without renaming it's
confusing to differ this from sysfs-firmware-dmi-tables that adds raw
DMI table and actually adds "dmi" kobject.

Signed-off-by: Ivan Khoronzhuk <[email protected]>
Signed-off-by: Jean Delvare <[email protected]>

firmware: dmi_scan: add SBMIOS entry and DMI tables

Some utils, like dmidecode and smbios, need to access SMBIOS entry
table area in order to get information like SMBIOS version, size, etc.
Currently it's done via /dev/mem. But for situation when /dev/mem
usage is disabled, the utils have to use dmi sysfs instead, which
doesn't represent SMBIOS entry and adds code/delay redundancy when direct
access for table is needed.

So this patch creates dmi/tables and adds SMBIOS entry point to allow
utils in question to work correctly without /dev/mem. Also patch adds
raw dmi table to simplify dmi table processing in user space, as
proposed by Jean Delvare.

Tested-by: Roy Franz <[email protected]>
Signed-off-by: Ivan Khoronzhuk <[email protected]>
Signed-off-by: Jean Delvare <[email protected]>

firmware: dmi_scan: Trim DMI table length before exporting it

The SMBIOS v3 entry points specify a maximum length for the DMI table,
not the exact length. Thus there may be garbage after the end-of-table
marker, which we don't want to export to user-space. Adjust dmi_len
when we find the end-of-table marker, so that only the actual table
payload is exported.

Signed-off-by: Jean Delvare <[email protected]>
Cc: Ivan Khoronzhuk <[email protected]>

firmware: dmi_scan: Rename dmi_table to dmi_decode_table

The "dmi_table" function looks like data instance, but it does DMI
table decode. This patch renames it to "dmi_decode_table" name as
more appropriate. That allows us to use "dmi_table" name for correct
purposes.

Signed-off-by: Ivan Khoronzhuk <[email protected]>
Signed-off-by: Jean Delvare <[email protected]>

firmware: dmi: List my quilt tree

I'll be maintaining the pending patches to the dmi_scan and dmi-id
drivers as a quilt tree.

Signed-off-by: Jean Delvare <[email protected]>

firmware: dmi_scan: Only honor end-of-table for 64-bit tables

A 32-bit entry point to a DMI table says how many structures the table
contains. The SMBIOS specification explicitly says that end-of-table
markers should be ignored if they are not actually at the end of the
DMI table. So only honor the end-of-table marker for tables accessed
through 64-bit entry points, as they do not specify a structure count.

Fixes: fc43026278 ("dmi: add support for SMBIOS 3.0 64-bit entry point")
Signed-off-by: Jean Delvare <[email protected]>
Acked-by: Ard Biesheuvel <[email protected]>
Cc: Leif Lindholm <[email protected]>
Cc: Matt Fleming <[email protected]>

Merge branch 'akpm' (patches from Andrew)

Merge first patchbomb from Andrew Morton:

- a few misc things

- ocfs2 udpates

- kernel/watchdog.c feature work (took ages to get right)

- most of MM.  A few tricky bits are held up and probably won't make 4.2.

* emailed patches from Andrew Morton <[email protected]>: (91 commits)
  mm: kmemleak_alloc_percpu() should follow the gfp from per_alloc()
  mm, thp: respect MPOL_PREFERRED policy with non-local node
  tmpfs: truncate prealloc blocks past i_size
  mm/memory hotplug: print the last vmemmap region at the end of hot add memory
  mm/mmap.c: optimization of do_mmap_pgoff function
  mm: kmemleak: optimise kmemleak_lock acquiring during kmemleak_scan
  mm: kmemleak: avoid deadlock on the kmemleak object insertion error path
  mm: kmemleak: do not acquire scan_mutex in kmemleak_do_cleanup()
  mm: kmemleak: fix delete_object_*() race when called on the same memory block
  mm: kmemleak: allow safe memory scanning during kmemleak disabling
  memcg: convert mem_cgroup->under_oom from atomic_t to int
  memcg: remove unused mem_cgroup->oom_wakeups
  frontswap: allow multiple backends
  x86, mirror: x86 enabling - find mirrored memory ranges
  mm/memblock: allocate boot time data structures from mirrored memory
  mm/memblock: add extra "flags" to memblock to allow selection of memory based on attribute
  mm: do not ignore mapping_gfp_mask in page cache allocation paths
  mm/cma.c: fix typos in comments
  mm/oom_kill.c: print points as unsigned int
  mm/hugetlb: handle races in alloc_huge_page and hugetlb_reserve_pages
  ...

Merge tag 'please-pull-pstore' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux

Pull pstore updates from Tony Luck:
"Miscellaneous pstore improvements"

* tag 'please-pull-pstore' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux:
  ramoops: make it possible to change mem_type param.
  pstore/ram: verify ramoops header before saving record
  fs/pstore: Optimization function ramoops_init_przs
  fs/pstore: update the backend parameter in pstore module
  pstore: do not use message compression without lock

Merge tag 'for-f2fs-4.2' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs

Pull f2fs updates from Jaegeuk Kim:
"New features:
   - per-file encryption (e.g., ext4)
   - FALLOC_FL_ZERO_RANGE
   - FALLOC_FL_COLLAPSE_RANGE
   - RENAME_WHITEOUT

  Major enhancement/fixes:
   - recovery broken superblocks
   - enhance f2fs_trim_fs with a discard_map
   - fix a race condition on dentry block allocation
   - fix a deadlock during summary operation
   - fix a missing fiemap result

  .. and many minor bug fixes and clean-ups were done"

* tag 'for-f2fs-4.2' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (83 commits)
  f2fs: do not trim preallocated blocks when truncating after i_size
  f2fs crypto: add alloc_bounce_page
  f2fs crypto: fix to handle errors likewise ext4
  f2fs: drop the volatile_write flag only
  f2fs: skip committing valid superblock
  f2fs: setting discard option in parse_options()
  f2fs: fix to return exact trimmed size
  f2fs: support FALLOC_FL_INSERT_RANGE
  f2fs: hide common code in f2fs_replace_block
  f2fs: disable the discard option when device doesn't support
  f2fs crypto: remove alloc_page for bounce_page
  f2fs: fix a deadlock for summary page lock vs. sentry_lock
  f2fs crypto: clean up error handling in f2fs_fname_setup_filename
  f2fs crypto: avoid f2fs_inherit_context for symlink
  f2fs crypto: do not set encryption policy for non-directory by ioctl
  f2fs crypto: allow setting encryption policy once
  f2fs crypto: check context consistent for rename2
  f2fs: avoid duplicated code by reusing f2fs_read_end_io
  f2fs crypto: use per-inode tfm structure
  f2fs: recovering broken superblock during mount
  ...

Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs

Pull UDF fixes and cleanups from Jan Kara:
"The contains some small fixes and improvements in error handling for
  UDF.

  Bundled is also one ext3 coding style fix and a fix in quota
  documentation"

* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
  udf: fix udf_load_pvoldesc()
  udf: remove double err declaration in udf_file_write_iter()
  UDF: support NFSv2 export
  fs: ext3: super: fixed a space coding style issue
  quota: Update documentation
  udf: Return error from udf_find_entry()
  udf: Make udf_get_filename() return error instead of 0 length file name
  udf: bug on exotic flag in udf_get_filename()
  udf: improve error management in udf_CS0toNLS()
  udf: improve error management in udf_CS0toUTF8()
  udf: unicode: update function name in comments
  udf: remove unnecessary test in udf_build_ustr_exact()
  udf: Return -ENOMEM when allocation fails in udf_get_filename()

Merge tag 'docs-for-linus' of git://git.lwn.net/linux-2.6

Pull documentation updates from Jonathan Corbet:
"The main thing here is Ingo's big subdirectory documenting feature
  support for each architecture.  Beyond that, it's the usual pile of
  fixes, tweaks, and small additions"

* tag 'docs-for-linus' of git://git.lwn.net/linux-2.6: (79 commits)
  doc:md: fix typo in md.txt.
  Documentation/mic/mpssd: don't build x86 userspace when cross compiling
  Documentation/prctl: don't build tsc tests when cross compiling
  Documentation/vDSO: don't build tests when cross compiling
  Doc:ABI/testing: Fix typo in sysfs-bus-fcoe
  Doc: Docbook: Change wikipedia's URL from http to https in scsi.tmpl
  Doc: Change wikipedia's URL from http to https
  Documentation/kernel-parameters: add missing pciserial to the earlyprintk
  Doc:pps: Fix typo in pps.txt
  kbuild : Fix documentation of INSTALL_HDR_PATH
  Documentation: filesystems: updated struct file_operations documentation in vfs.txt
  kbuild: edit explanation of clean-files variable
  Doc: ja_JP: Fix typo in HOWTO
  Move freefall program from Documentation/ to tools/
  Documentation: ARM: EXYNOS: Describe boot loaders interface
  Doc:nfc: Fix typo in nfc-hci.txt
  vfs: Minor documentation fix
  Doc: networking: txtimestamp: fix printf format warning
  Documentation, intel_pstate: Improve legacy mode internal governors description
  Documentation: extend use case for EXPORT_SYMBOL_GPL()
  ...

Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input

Pull input subsystem updates from Dmitry Torokhov:
"Thanks to Samuel Thibault input device (keyboard) LEDs are no longer
  hardwired within the input core but use LED subsystem and so allow use
  of different triggers; Hans de Goede did a large update for the ALPS
  touchpad driver; we have new TI drv2665 haptics driver and DA9063
  OnKey driver, and host of other drivers got various fixes"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input: (55 commits)
  Input: pixcir_i2c_ts - fix receive error
  MAINTAINERS: remove non existent input mt git tree
  Input: improve usage of gpiod API
  tty/vt/keyboard: define LED triggers for VT keyboard lock states
  tty/vt/keyboard: define LED triggers for VT LED states
  Input: export LEDs as class devices in sysfs
  Input: cyttsp4 - use swap() in cyttsp4_get_touch()
  Input: goodix - do not explicitly set evbits in input device
  Input: goodix - export id and version read from device
  Input: goodix - fix variable length array warning
  Input: goodix - fix alignment issues
  Input: add OnKey driver for DA9063 MFD part
  Input: elan_i2c - add product IDs FW names
  Input: elan_i2c - add support for multi IC type and iap format
  Input: focaltech - report finger width to userspace
  tty: remove platform_sysrq_reset_seq
  Input: synaptics_i2c - use proper boolean values
  Input: psmouse - use true instead of 1 for boolean values
  Input: cyapa - fix a few typos in comments
  Input: stmpe-ts - enforce device tree only mode
  ...

Merge tag 'edac_for_4.2_2' of git://git.kernel.org/pub/scm/linux/kernel/git/bp/bp

Pull EDAC updates from Borislav Petkov:

- New APM X-Gene SoC EDAC driver (Loc Ho)

- AMD error injection module improvements (Aravind Gopalakrishnan)

- Altera Arria 10 support (Thor Thayer)

- misc fixes and cleanups all over the place

* tag 'edac_for_4.2_2' of git://git.kernel.org/pub/scm/linux/kernel/git/bp/bp: (28 commits)
  EDAC: Update Documentation/edac.txt
  EDAC: Fix typos in Documentation/edac.txt
  EDAC, mce_amd_inj: Set MISCV on injection
  EDAC, mce_amd_inj: Move bit preparations before the injection
  EDAC, mce_amd_inj: Cleanup and simplify README
  EDAC, altera: Do not allow suspend when EDAC is enabled
  EDAC, mce_amd_inj: Make inj_type static
  arm: socfpga: dts: Add Arria10 SDRAM EDAC DTS support
  EDAC, altera: Add Arria10 EDAC support
  EDAC, altera: Refactor for Altera CycloneV SoC
  EDAC, altera: Generalize driver to use DT Memory size
  EDAC, mce_amd_inj: Add README file
  EDAC, mce_amd_inj: Add individual permissions field to dfs_node
  EDAC, mce_amd_inj: Modify flags attribute to use string arguments
  EDAC, mce_amd_inj: Read out number of MCE banks from the hardware
  EDAC, mce_amd_inj: Use MCE_INJECT_GET macro for bank node too
  EDAC, xgene: Fix cpuid abuse
  EDAC, mpc85xx: Extend error address to 64 bit
  EDAC, mpc8xxx: Adapt for FSL SoC
  EDAC, edac_stub: Drop arch-specific include
  ...

Merge tag 'pinctrl-v4.2-1' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl

Pull pin control updates from Linus Walleij:
"Here is the bulk of pin control changes for the v4.2 series: Quite a
  lot of new SoC subdrivers and two new main drivers this time, apart
  from that business as usual.

  Details:

  Core functionality:
   - Enable exclusive pin ownership: it is possible to flag a pin
     controller so that GPIO and other functions cannot use a single pin
     simultaneously.

  New drivers:
   - NXP LPC18xx System Control Unit pin controller
   - Imagination Pistachio SoC pin controller

  New subdrivers:
   - Freescale i.MX7d SoC
   - Intel Sunrisepoint-H PCH
   - Renesas PFC R8A7793
   - Renesas PFC R8A7794
   - Mediatek MT6397, MT8127
   - SiRF Atlas 7
   - Allwinner A33
   - Qualcomm MSM8660
   - Marvell Armada 395
   - Rockchip RK3368

  Cleanups:
   - A big cleanup of the Marvell MVEBU driver rectifying it to
     correspond to reality
   - Drop platform device probing from the SH PFC driver, we are now a
     DT only shop for SuperH
   - Drop obsolte multi-platform check for SH PFC
   - Various janitorial: constification, grammar etc

  Improvements:
   - The AT91 GPIO portions now supports the set_multiple() feature
   - Split out SPI pins on the Xilinx Zynq
   - Support DTs without specific function nodes in the i.MX driver"

* tag 'pinctrl-v4.2-1' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl: (99 commits)
  pinctrl: rockchip: add support for the rk3368
  pinctrl: rockchip: generalize perpin driver-strength setting
  pinctrl: sh-pfc: r8a7794: add SDHI pin groups
  pinctrl: sh-pfc: r8a7794: add MMCIF pin groups
  pinctrl: sh-pfc: add R8A7794 PFC support
  pinctrl: make pinctrl_register() return proper error code
  pinctrl: mvebu: armada-39x: add support for Armada 395 variant
  pinctrl: mvebu: armada-39x: add missing SATA functions
  pinctrl: mvebu: armada-39x: add missing PCIe functions
  pinctrl: mvebu: armada-38x: add ptp functions
  pinctrl: mvebu: armada-38x: add ua1 functions
  pinctrl: mvebu: armada-38x: add nand functions
  pinctrl: mvebu: armada-38x: add sata functions
  pinctrl: mvebu: armada-xp: add dram functions
  pinctrl: mvebu: armada-xp: add nand rb function
  pinctrl: mvebu: armada-xp: add spi1 function
  pinctrl: mvebu: armada-39x: normalize ref clock naming
  pinctrl: mvebu: armada-xp: rename spi to spi0
  pinctrl: mvebu: armada-370: align spi1 clock pin naming
  pinctrl: mvebu: armada-370: align VDD cpu-pd pin naming with datasheet
  ...

drm/dp/mst: close deadlock in connector destruction.

I've only seen this once, and I failed to capture the
lockdep backtrace, but I did some investigations.

If we are calling into the MST layer from EDID probing,
we have the mode_config mutex held, if during that EDID
probing, the MST hub goes away, then we can get a deadlock
where the connector destruction function in the driver
tries to retake the mode config mutex.

This offloads connector destruction to a workqueue,
and avoid the subsequenct lock ordering issue.

Acked-by: Daniel Vetter <[email protected]>
Cc: [email protected]
Signed-off-by: Dave Airlie <[email protected]>

Merge tag 'backlight-for-linus-4.2' of git://git.kernel.org/pub/scm/linux/kernel/git/lee/backlight

Pull backlight updates from Lee Jones:
"Changes to existing drivers:

   - supply MODULE_DEVICE_TABLE() to ensure probing
   - constify struct; da9052_bl
   - enable compile test; lcd_l4f00242t03, lcd_lms283fg05, backlight_gpio
   - suspend/resume bugfix; lp855x_bl
   - devm_gpiod_get_optional() API fixup; pwm_bl
   - error handling fixup; backlight"

* tag 'backlight-for-linus-4.2' of git://git.kernel.org/pub/scm/linux/kernel/git/lee/backlight:
  backlight: Change the return type of backlight_update_status() to int
  backlight: pwm_bl: Simplify usage of devm_gpiod_get_optional
  backlight: lp855x: Don't clear level on suspend/blank
  backlight: Allow compile test of GPIO consumers if !GPIOLIB
  video: backlight: da9052: Constify platform_device_id
  gpio-backlight: Discover driver during boot time

mm: kmemleak_alloc_percpu() should follow the gfp from per_alloc()

Beginning at commit d52d3997f843 ("ipv6: Create percpu rt6_info"), the
following INFO splat is logged:

  ===============================
  [ INFO: suspicious RCU usage. ]
  4.1.0-rc7-next-20150612 #1 Not tainted
  -------------------------------
  kernel/sched/core.c:7318 Illegal context switch in RCU-bh read-side critical section!
  other info that might help us debug this:
  rcu_scheduler_active = 1, debug_locks = 0
   3 locks held by systemd/1:
   #0:  (rtnl_mutex){+.+.+.}, at: [<ffffffff815f0c8f>] rtnetlink_rcv+0x1f/0x40
   #1:  (rcu_read_lock_bh){......}, at: [<ffffffff816a34e2>] ipv6_add_addr+0x62/0x540
   #2:  (addrconf_hash_lock){+...+.}, at: [<ffffffff816a3604>] ipv6_add_addr+0x184/0x540
  stack backtrace:
  CPU: 0 PID: 1 Comm: systemd Not tainted 4.1.0-rc7-next-20150612 #1
  Hardware name: TOSHIBA TECRA A50-A/TECRA A50-A, BIOS Version 4.20   04/17/2014
  Call Trace:
    dump_stack+0x4c/0x6e
    lockdep_rcu_suspicious+0xe7/0x120
    ___might_sleep+0x1d5/0x1f0
    __might_sleep+0x4d/0x90
    kmem_cache_alloc+0x47/0x250
    create_object+0x39/0x2e0
    kmemleak_alloc_percpu+0x61/0xe0
    pcpu_alloc+0x370/0x630

Additional backtrace lines are truncated.  In addition, the above splat
is followed by several "BUG: sleeping function called from invalid
context at mm/slub.c:1268" outputs.  As suggested by Martin KaFai Lau,
these are the clue to the fix.  Routine kmemleak_alloc_percpu() always
uses GFP_KERNEL for its allocations, whereas it should follow the gfp
from its callers.

Reviewed-by: Catalin Marinas <[email protected]>
Reviewed-by: Kamalesh Babulal <[email protected]>
Acked-by: Martin KaFai Lau <[email protected]>
Signed-off-by: Larry Finger <[email protected]>
Cc: Martin KaFai Lau <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: <[email protected]> [3.18+]
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm, thp: respect MPOL_PREFERRED policy with non-local node

Since commit 077fcf116c8c ("mm/thp: allocate transparent hugepages on
local node"), we handle THP allocations on page fault in a special way -
for non-interleave memory policies, the allocation is only attempted on
the node local to the current CPU, if the policy's nodemask allows the
node.

This is motivated by the assumption that THP benefits cannot offset the
cost of remote accesses, so it's better to fallback to base pages on the
local node (which might still be available, while huge pages are not due
to fragmentation) than to allocate huge pages on a remote node.

The nodemask check prevents us from violating e.g.  MPOL_BIND policies
where the local node is not among the allowed nodes.  However, the
current implementation can still give surprising results for the
MPOL_PREFERRED policy when the preferred node is different than the
current CPU's local node.

In such case we should honor the preferred node and not use the local
node, which is what this patch does.  If hugepage allocation on the
preferred node fails, we fall back to base pages and don't try other
nodes, with the same motivation as is done for the local node hugepage
allocations.  The patch also moves the MPOL_INTERLEAVE check around to
simplify the hugepage specific test.

The difference can be demonstrated using in-tree transhuge-stress test
on the following 2-node machine where half memory on one node was
occupied to show the difference.

> numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 24 25 26 27 28 29 30 31 32 33 34 35
node 0 size: 7878 MB
node 0 free: 3623 MB
node 1 cpus: 12 13 14 15 16 17 18 19 20 21 22 23 36 37 38 39 40 41 42 43 44 45 46 47
node 1 size: 8045 MB
node 1 free: 7818 MB
node distances:
node   0   1
  0:  10  21
  1:  21  10

Before the patch:
> numactl -p0 -C0 ./transhuge-stress
transhuge-stress: 2.197 s/loop, 0.276 ms/page,   7249.168 MiB/s 7962 succeed,    0 failed, 1786 different pages

> numactl -p0 -C12 ./transhuge-stress
transhuge-stress: 2.962 s/loop, 0.372 ms/page,   5376.172 MiB/s 7962 succeed,    0 failed, 3873 different pages

Number of successful THP allocations corresponds to free memory on node 0 in
the first case and node 1 in the second case, i.e. -p parameter is ignored and
cpu binding "wins".

After the patch:
> numactl -p0 -C0 ./transhuge-stress
transhuge-stress: 2.183 s/loop, 0.274 ms/page,   7295.516 MiB/s 7962 succeed,    0 failed, 1760 different pages

> numactl -p0 -C12 ./transhuge-stress
transhuge-stress: 2.878 s/loop, 0.361 ms/page,   5533.638 MiB/s 7962 succeed,    0 failed, 1750 different pages

> numactl -p1 -C0 ./transhuge-stress
transhuge-stress: 4.628 s/loop, 0.581 ms/page,   3440.893 MiB/s 7962 succeed,    0 failed, 3918 different pages

The -p parameter is respected regardless of cpu binding.

> numactl -C0 ./transhuge-stress
transhuge-stress: 2.202 s/loop, 0.277 ms/page,   7230.003 MiB/s 7962 succeed,    0 failed, 1750 different pages

> numactl -C12 ./transhuge-stress
transhuge-stress: 3.020 s/loop, 0.379 ms/page,   5273.324 MiB/s 7962 succeed,    0 failed, 3916 different pages

Without -p parameter, hugepage restriction to CPU-local node works as before.

Fixes: 077fcf116c8c ("mm/thp: allocate transparent hugepages on local node")
Signed-off-by: Vlastimil Babka <[email protected]>
Cc: Aneesh Kumar K.V <[email protected]>
Acked-by: David Rientjes <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: <[email protected]> [4.0+]
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

tmpfs: truncate prealloc blocks past i_size

One of the rocksdb people noticed that when you do something like this

    fallocate(fd, FALLOC_FL_KEEP_SIZE, 0, 10M)
    pwrite(fd, buf, 5M, 0)
    ftruncate(5M)

on tmpfs, the file would still take up 10M: which led to super fun
issues because we were getting ENOSPC before we thought we should be
getting ENOSPC.  This patch fixes the problem, and mirrors what all the
other fs'es do (and was agreed to be the correct behaviour at LSF).

I tested it locally to make sure it worked properly with the following

    xfs_io -f -c "falloc -k 0 10M" -c "pwrite 0 5M" -c "truncate 5M" file

Without the patch we have "Blocks: 20480", with the patch we have the
correct value of "Blocks: 10240".

Signed-off-by: Josef Bacik <[email protected]>
Signed-off-by: Hugh Dickins <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/memory hotplug: print the last vmemmap region at the end of hot add memory

When hot add two nodes continuously, we found the vmemmap region info is
a bit messed.  The last region of node 2 is printed when node 3 hot
added, like the following:

  Initmem setup node 2 [mem 0x0000000000000000-0xffffffffffffffff]
   On node 2 totalpages: 0
   Built 2 zonelists in Node order, mobility grouping on.  Total pages: 16090539
   Policy zone: Normal
   init_memory_mapping: [mem 0x40000000000-0x407ffffffff]
    [mem 0x40000000000-0x407ffffffff] page 1G
    [ffffea1000000000-ffffea10001fffff] PMD -> [ffff8a077d800000-ffff8a077d9fffff] on node 2
    [ffffea1000200000-ffffea10003fffff] PMD -> [ffff8a077de00000-ffff8a077dffffff] on node 2
  ...
    [ffffea101f600000-ffffea101f9fffff] PMD -> [ffff8a074ac00000-ffff8a074affffff] on node 2
    [ffffea101fa00000-ffffea101fdfffff] PMD -> [ffff8a074a800000-ffff8a074abfffff] on node 2
  Initmem setup node 3 [mem 0x0000000000000000-0xffffffffffffffff]
   On node 3 totalpages: 0
   Built 3 zonelists in Node order, mobility grouping on.  Total pages: 16090539
   Policy zone: Normal
   init_memory_mapping: [mem 0x60000000000-0x607ffffffff]
    [mem 0x60000000000-0x607ffffffff] page 1G
    [ffffea101fe00000-ffffea101fffffff] PMD -> [ffff8a074a400000-ffff8a074a5fffff] on node 2 <=== node 2 ???
    [ffffea1800000000-ffffea18001fffff] PMD -> [ffff8a074a600000-ffff8a074a7fffff] on node 3
    [ffffea1800200000-ffffea18005fffff] PMD -> [ffff8a074a000000-ffff8a074a3fffff] on node 3
    [ffffea1800600000-ffffea18009fffff] PMD -> [ffff8a0749c00000-ffff8a0749ffffff] on node 3
  ...

The cause is the last region was missed at the and of hot add memory,
and p_start, p_end, node_start were not reset, so when hot add memory to
a new node, it will consider they are not contiguous blocks and print
the previous one.  So we print the last vmemmap region at the end of hot
add memory to avoid the confusion.

Signed-off-by: Zhu Guihua <[email protected]>
Reviewed-by: Naoya Horiguchi <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/mmap.c: optimization of do_mmap_pgoff function

The simple check for zero length memory mapping may be performed
earlier. So that in case of zero length memory mapping some unnecessary
code is not executed at all. It does not make the code less readable
and saves some CPU cycles.

Signed-off-by: Piotr Kwapulinski <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Acked-by: Rik van Riel <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm: kmemleak: optimise kmemleak_lock acquiring during kmemleak_scan

The kmemleak memory scanning uses finer grained object->lock spinlocks
primarily to avoid races with the memory block freeing.  However, the
pointer lookup in the rb tree requires the kmemleak_lock to be held.
This is currently done in the find_and_get_object() function for each
pointer-like location read during scanning.  While this allows a low
latency on kmemleak_*() callbacks on other CPUs, the memory scanning is
slower.

This patch moves the kmemleak_lock outside the scan_block() loop,
acquiring/releasing it only once per scanned memory block.  The
allow_resched logic is moved outside scan_block() and a new
scan_large_block() function is implemented which splits large blocks in
MAX_SCAN_SIZE chunks with cond_resched() calls in-between.  A redundant
(object->flags & OBJECT_NO_SCAN) check is also removed from
scan_object().

With this patch, the kmemleak scanning performance is significantly
improved: at least 50% with lock debugging disabled and over an order of
magnitude with lock proving enabled (on an arm64 system).

Signed-off-by: Catalin Marinas <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm: kmemleak: avoid deadlock on the kmemleak object insertion error path

While very unlikely (usually kmemleak or sl*b bug), the create_object()
function in mm/kmemleak.c may fail to insert a newly allocated object into
the rb tree.  When this happens, kmemleak disables itself and prints
additional information about the object already found in the rb tree.
Such printing is done with the parent->lock acquired, however the
kmemleak_lock is already held.  This is a potential race with the scanning
thread which acquires object->lock and kmemleak_lock in a

This patch removes the locking around the 'parent' object information
printing.  Such object cannot be freed or removed from object_tree_root
and object_list since kmemleak_lock is already held.  There is a very
small risk that some of the object data is being modified on another CPU
but the only downside is inconsistent information printing.

Signed-off-by: Catalin Marinas <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm: kmemleak: do not acquire scan_mutex in kmemleak_do_cleanup()

The kmemleak_do_cleanup() work thread already waits for the kmemleak_scan
thread to finish via kthread_stop(). Waiting in kthread_stop() while
scan_mutex is held may lead to deadlock if kmemleak_scan_thread() also
waits to acquire for scan_mutex.

Signed-off-by: Catalin Marinas <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm: kmemleak: fix delete_object_*() race when called on the same memory block

Calling delete_object_*() on the same pointer is not a standard use case
(unless there is a bug in the code calling kmemleak_free()). However,
during kmemleak disabling (error or user triggered via /sys), there is a
potential race between kmemleak_free() calls on a CPU and
__kmemleak_do_cleanup() on a different CPU.

The current delete_object_*() implementation first performs a look-up
holding kmemleak_lock, increments the object->use_count and then
re-acquires kmemleak_lock to remove the object from object_tree_root and
object_list.

This patch simplifies the delete_object_*() mechanism to both look up
and remove an object from the object_tree_root and object_list
atomically (guarded by kmemleak_lock). This allows safe concurrent
calls to delete_object_*() on the same pointer without additional
locking for synchronising the kmemleak_free_enabled flag.

A side effect is a slight improvement in the delete_object_*() performance
by avoiding acquiring kmemleak_lock twice and incrementing/decrementing
object->use_count.

Signed-off-by: Catalin Marinas <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm: kmemleak: allow safe memory scanning during kmemleak disabling

The kmemleak scanning thread can run for minutes.  Callbacks like
kmemleak_free() are allowed during this time, the race being taken care
of by the object->lock spinlock.  Such lock also prevents a memory block
from being freed or unmapped while it is being scanned by blocking the
kmemleak_free() -> ...  -> __delete_object() function until the lock is
released in scan_object().

When a kmemleak error occurs (e.g.  it fails to allocate its metadata),
kmemleak_enabled is set and __delete_object() is no longer called on
freed objects.  If kmemleak_scan is running at the same time,
kmemleak_free() no longer waits for the object scanning to complete,
allowing the corresponding memory block to be freed or unmapped (in the
case of vfree()).  This leads to kmemleak_scan potentially triggering a
page fault.

This patch separates the kmemleak_free() enabling/disabling from the
overall kmemleak_enabled nob so that we can defer the disabling of the
object freeing tracking until the scanning thread completed.  The
kmemleak_free_part() is deliberately ignored by this patch since this is
only called during boot before the scanning thread started.

Signed-off-by: Catalin Marinas <[email protected]>
Reported-by: Vignesh Radhakrishnan <[email protected]>
Tested-by: Vignesh Radhakrishnan <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

memcg: convert mem_cgroup->under_oom from atomic_t to int

memcg->under_oom tracks whether the memcg is under OOM conditions and is
an atomic_t counter managed with mem_cgroup_[un]mark_under_oom().  While
atomic_t appears to be simple synchronization-wise, when used as a
synchronization construct like here, it's trickier and more error-prone
due to weak memory ordering rules, especially around atomic_read(), and
false sense of security.

For example, both non-trivial read sites of memcg->under_oom are a bit
problematic although not being actually broken.

* mem_cgroup_oom_register_event()

  It isn't explicit what guarantees the memory ordering between event
  addition and memcg->under_oom check.  This isn't broken only because
  memcg_oom_lock is used for both event list and memcg->oom_lock.

* memcg_oom_recover()

  The lockless test doesn't have any explanation why this would be
  safe.

mem_cgroup_[un]mark_under_oom() are very cold paths and there's no point
in avoiding locking memcg_oom_lock there.  This patch converts
memcg->under_oom from atomic_t to int, puts their modifications under
memcg_oom_lock and documents why the lockless test in
memcg_oom_recover() is safe.

Signed-off-by: Tejun Heo <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: Johannes Weiner <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>