Git Repo - linux.git/log

kthread: kthread worker API cleanup

A good practice is to prefix the names of functions by the name
of the subsystem.

The kthread worker API is a mix of classic kthreads and workqueues.  Each
worker has a dedicated kthread.  It runs a generic function that process
queued works.  It is implemented as part of the kthread subsystem.

This patch renames the existing kthread worker API to use
the corresponding name from the workqueues API prefixed by
kthread_:

__init_kthread_worker() -> __kthread_init_worker()
init_kthread_worker() -> kthread_init_worker()
init_kthread_work() -> kthread_init_work()
insert_kthread_work() -> kthread_insert_work()
queue_kthread_work() -> kthread_queue_work()
flush_kthread_work() -> kthread_flush_work()
flush_kthread_worker() -> kthread_flush_worker()

Note that the names of DEFINE_KTHREAD_WORK*() macros stay
as they are. It is common that the "DEFINE_" prefix has
precedence over the subsystem names.

Note that INIT() macros and init() functions use different
naming scheme. There is no good solution. There are several
reasons for this solution:

  + "init" in the function names stands for the verb "initialize"
    aka "initialize worker". While "INIT" in the macro names
    stands for the noun "INITIALIZER" aka "worker initializer".

  + INIT() macros are used only in DEFINE() macros

  + init() functions are used close to the other kthread()
    functions. It looks much better if all the functions
    use the same scheme.

  + There will be also kthread_destroy_worker() that will
    be used close to kthread_cancel_work(). It is related
    to the init() function. Again it looks better if all
    functions use the same naming scheme.

  + there are several precedents for such init() function
    names, e.g. amd_iommu_init_device(), free_area_init_node(),
    jump_label_init_type(),  regmap_init_mmio_clk(),

  + It is not an argument but it was inconsistent even before.

[[email protected]: fix linux-next merge conflict]
Link: http://lkml.kernel.org/r/[email protected]
Link: http://lkml.kernel.org/r/[email protected]
Suggested-by: Andrew Morton <[email protected]>
Signed-off-by: Petr Mladek <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: "Paul E. McKenney" <[email protected]>
Cc: Josh Triplett <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Jiri Kosina <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Signed-off-by: Arnd Bergmann <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

kthread: rename probe_kthread_data() to kthread_probe_data()

Patch series "kthread: Kthread worker API improvements"

The intention of this patchset is to make it easier to manipulate and
maintain kthreads.  Especially, I want to replace all the custom main
cycles with a generic one.  Also I want to make the kthreads sleep in a
consistent state in a common place when there is no work.

This patch (of 11):

A good practice is to prefix the names of functions by the name of the
subsystem.

This patch fixes the name of probe_kthread_data().  The other wrong
functions names are part of the kthread worker API and will be fixed
separately.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Petr Mladek <[email protected]>
Suggested-by: Andrew Morton <[email protected]>
Acked-by: Tejun Heo <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: "Paul E. McKenney" <[email protected]>
Cc: Josh Triplett <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Jiri Kosina <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

scripts/tags.sh: enable code completion in VIM

Vim, with the omnicppcomplete(#1) plugin, can do code completion using
information build by ctags. Add flags needed by omnicppcomplete(#2) to
have completion on member of structure.

1: https://github.com/vim-scripts/omnicppcomplete
2: https://github.com/vim-scripts/OmniCppComplete/blob/master/doc/omnicppcomplete.txt#L93

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Mathieu Maret <[email protected]>
Cc: Michal Marek <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm: kmemleak: avoid using __va() on addresses that don't have a lowmem mapping

Some of the kmemleak_*() callbacks in memblock, bootmem, CMA convert a
physical address to a virtual one using __va(). However, such physical
addresses may sometimes be located in highmem and using __va() is
incorrect, leading to inconsistent object tracking in kmemleak.

The following functions have been added to the kmemleak API and they take
a physical address as the object pointer. They only perform the
corresponding action if the address has a lowmem mapping:

kmemleak_alloc_phys
kmemleak_free_part_phys
kmemleak_not_leak_phys
kmemleak_ignore_phys

The affected calling places have been updated to use the new kmemleak
API.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Catalin Marinas <[email protected]>
Reported-by: Vignesh R <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

kdump, vmcoreinfo: report memory sections virtual addresses

KASLR memory randomization can randomize the base of the physical memory
mapping (PAGE_OFFSET), vmalloc (VMALLOC_START) and vmemmap
(VMEMMAP_START). Adding these variables on VMCOREINFO so tools can easily
identify the base of each memory section.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Thomas Garnier <[email protected]>
Acked-by: Baoquan He <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: "H . Peter Anvin" <[email protected]>
Cc: Eric Biederman <[email protected]>
Cc: Xunlei Pang <[email protected]>
Cc: HATAYAMA Daisuke <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Eugene Surovegin <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

ipc/sem.c: add cond_resched in exit_sme

In CONFIG_PREEMPT=n kernel a softlockup was observed while the for loop in
exit_sem.  Apparently it's possible for the loop to take quite a long time
and it doesn't have a scheduling point in it.  Since the codes is
executing under an rcu read section this may also cause rcu stalls, which
in turn block synchronize_rcu operations, which more or less de-stabilises
the whole system.

Fix this by introducing a cond_resched() at the beginning of the loop.

So this patch fixes the following:

  NMI watchdog: BUG: soft lockup - CPU#10 stuck for 23s! [httpd:18119]
  CPU: 10 PID: 18119 Comm: httpd Tainted: G           O    4.4.20-clouder2 #6
  Hardware name: Supermicro X10DRi/X10DRi, BIOS 1.1 04/14/2015
  task: ffff88348d695280 ti: ffff881c95550000 task.ti: ffff881c95550000
  RIP: 0010:[<ffffffff81614bc7>]  [<ffffffff81614bc7>] _raw_spin_lock+0x17/0x30
  RSP: 0018:ffff881c95553e40  EFLAGS: 00000246
  RAX: 0000000000000000 RBX: ffff883161b1eea8 RCX: 000000000000000d
  RDX: 0000000000000001 RSI: 000000000000000e RDI: ffff883161b1eea4
  RBP: ffff881c95553ea0 R08: ffff881c95553e68 R09: ffff883fef376f88
  R10: ffff881fffb58c20 R11: ffffea0072556600 R12: ffff883161b1eea0
  R13: ffff88348d695280 R14: ffff883dec427000 R15: ffff8831621672a0
  FS:  0000000000000000(0000) GS:ffff881fffb40000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 00007f3b3723e020 CR3: 0000000001c0a000 CR4: 00000000001406e0
  Call Trace:
    ? exit_sem+0x7c/0x280
    do_exit+0x338/0xb40
    do_group_exit+0x43/0xd0
    SyS_exit_group+0x14/0x20
    entry_SYSCALL_64_fastpath+0x16/0x6e

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Nikolay Borisov <[email protected]>
Cc: Herton R. Krzesinski <[email protected]>
Cc: Fabian Frederick <[email protected]>
Cc: Davidlohr Bueso <[email protected]>
Cc: Manfred Spraul <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

ipc/msg: avoid waking sender upon full queue

Blocked tasks queued in q_senders waiting for their message to fit in the
queue are blindly awoken every time we think there's a remote chance this
might happen.  This could cause numerous (and expensive -- thundering
herd-ish) bogus wakeups if the queue is still really full.  Adding to the
scheduling cost/overhead, there's also the fact that we need to take the
ipc object lock and requeue ourselves in the q_senders list.

By keeping track of the blocked sender's message size, we can know
previously if the wakeup ought to occur or not.  Otherwise, to maintain
the current wakeup order we just move it to the tail.  This is exactly
what occurs right now if the sender needs to go back to sleep.

The case of EIDRM is left completely untouched, as we need to wakeup all
the tasks, and shouldn't be playing games in the first place.

This patch was seen to save on the 'msgctl10' ltp testcase ~15% in context
switches (avg out of ten runs).  Although these tests are really about
functionality (as opposed to performance), is does show the direct
benefits of the optimization.

[[email protected]: coding-style fixes]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Davidlohr Bueso <[email protected]>
Acked-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Manfred Spraul <[email protected]>
Cc: Sebastian Andrzej Siewior <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

ipc/msg: make ss_wakeup() kill arg boolean

... 'tis annoying.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Davidlohr Bueso <[email protected]>
Acked-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Manfred Spraul <[email protected]>
Cc: Sebastian Andrzej Siewior <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

ipc/msg: batch queue sender wakeups

Currently the use of wake_qs in sysv msg queues are only for the receiver
tasks that are blocked on the queue.  But blocked sender tasks (due to
queue size constraints) still are awoken with the ipc object lock held,
which can be a problem particularly for small sized queues and far from
gracious for -rt (just like it was for the receiver side).

The paths that actually wakeup a sender are obviously related to when we
are either getting rid of the queue or after (some) space is freed-up
after a receiver takes the msg (msgrcv).  Furthermore, with the exception
of msgrcv, we can always piggy-back on expunge_all that has its own tasks
lined-up for waking.  Finally, upon unlinking the message, it should be no
problem delaying the wakeups a bit until after we've released the lock.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Davidlohr Bueso <[email protected]>
Acked-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Manfred Spraul <[email protected]>
Cc: Sebastian Andrzej Siewior <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

ipc/msg: implement lockless pipelined wakeups

This patch moves the wakeup_process() invocation so it is not done under
the ipc global lock by making use of a lockless wake_q.  With this change,
the waiter is woken up once the message has been assigned and it does not
need to loop on SMP if the message points to NULL.  In the signal case we
still need to check the pointer under the lock to verify the state.

This change should also avoid the introduction of preempt_disable() in -RT
which avoids a busy-loop which pools for the NULL -> !NULL change if the
waiter has a higher priority compared to the waker.

By making use of wake_qs, the logic of sysv msg queues is greatly
simplified (and very well suited as we can batch lockless wakeups),
particularly around the lockless receive algorithm.

This has been tested with Manred's pmsg-shared tool on a "AMD A10-7800
Radeon R7, 12 Compute Cores 4C+8G":

test             |   before   |   after    | diff
-----------------|------------|------------|----------
pmsg-shared 8 60 | 19,347,422 | 30,442,191 | + ~57.34 %
pmsg-shared 4 60 | 21,367,197 | 35,743,458 | + ~67.28 %
pmsg-shared 2 60 | 22,884,224 | 24,278,200 | +  ~6.09 %

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
Signed-off-by: Davidlohr Bueso <[email protected]>
Acked-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Manfred Spraul <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

ipc/sem.c: fix complex_count vs. simple op race

Commit 6d07b68ce16a ("ipc/sem.c: optimize sem_lock()") introduced a
race:

sem_lock has a fast path that allows parallel simple operations.
There are two reasons why a simple operation cannot run in parallel:
- a non-simple operations is ongoing (sma->sem_perm.lock held)
- a complex operation is sleeping (sma->complex_count != 0)

As both facts are stored independently, a thread can bypass the current
checks by sleeping in the right positions.  See below for more details
(or kernel bugzilla 105651).

The patch fixes that by creating one variable (complex_mode)
that tracks both reasons why parallel operations are not possible.

The patch also updates stale documentation regarding the locking.

With regards to stable kernels:
The patch is required for all kernels that include the
commit 6d07b68ce16a ("ipc/sem.c: optimize sem_lock()") (3.10?)

The alternative is to revert the patch that introduced the race.

The patch is safe for backporting, i.e. it makes no assumptions
about memory barriers in spin_unlock_wait().

Background:
Here is the race of the current implementation:

Thread A: (simple op)
- does the first "sma->complex_count == 0" test

Thread B: (complex op)
- does sem_lock(): This includes an array scan. But the scan can't
  find Thread A, because Thread A does not own sem->lock yet.
- the thread does the operation, increases complex_count,
  drops sem_lock, sleeps

Thread A:
- spin_lock(&sem->lock), spin_is_locked(sma->sem_perm.lock)
- sleeps before the complex_count test

Thread C: (complex op)
- does sem_lock (no array scan, complex_count==1)
- wakes up Thread B.
- decrements complex_count

Thread A:
- does the complex_count test

Bug:
Now both thread A and thread C operate on the same array, without
any synchronization.

Fixes: 6d07b68ce16a ("ipc/sem.c: optimize sem_lock()")
Link: http://lkml.kernel.org/r/[email protected]
Reported-by: <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Davidlohr Bueso <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: <[email protected]>
Cc: <[email protected]> [3.10+]
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

kcov: do not instrument lib/stackdepot.c

There's no point in collecting coverage from lib/stackdepot.c, as it is
not a function of syscall inputs. Disabling kcov instrumentation for that
file will reduce the coverage noise level.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Alexander Potapenko <[email protected]>
Acked-by: Dmitry Vyukov <[email protected]>
Cc: Kostya Serebryany <[email protected]>
Cc: Andrey Konovalov <[email protected]>
Cc: syzkaller <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

config: android: enable CONFIG_SECCOMP

As of Android N, SECCOMP is required. Without it, we will get
mediaextractor error:

E /system/bin/mediaextractor: libminijail: prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER): Invalid argument

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Rob Herring <[email protected]>
Acked-by: John Stultz <[email protected]>
Cc: Amit Pundir <[email protected]>
Cc: Dmitry Shmidt <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

config: android: set SELinux as default security mode

Android won't boot without SELinux enabled, so make it the default.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Rob Herring <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

config: android: move device mapper options to recommended

CONFIG_MD is in recommended, but other dependent options like DM_CRYPT and
DM_VERITY options are in base. The result is the options in base don't
get enabled when applying both base and recommended fragments. Move all
the options to recommended.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Rob Herring <[email protected]>
Acked-by: John Stultz <[email protected]>
Cc: Amit Pundir <[email protected]>
Cc: Dmitry Shmidt <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

config/android: Remove CONFIG_IPV6_PRIVACY

Option is long gone, see commit 5d9efa7ee99e ("ipv6: Remove privacy
config option.")

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Borislav Petkov <[email protected]>
Cc: Rob Herring <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

relay: Use irq_work instead of plain timer for deferred wakeup

Relay avoids calling wake_up_interruptible() for doing the wakeup of
readers/consumers, waiting for the generation of new data, from the
context of a process which produced the data.  This is apparently done to
prevent the possibility of a deadlock in case Scheduler itself is is
generating data for the relay, after acquiring rq->lock.

The following patch used a timer (to be scheduled at next jiffy), for
delegating the wakeup to another context.
commit 7c9cb38302e78d24e37f7d8a2ea7eed4ae5f2fa7
Author: Tom Zanussi <[email protected]>
Date:   Wed May 9 02:34:01 2007 -0700

relay: use plain timer instead of delayed work

relay doesn't need to use schedule_delayed_work() for waking readers
when a simple timer will do.

Scheduling a plain timer, at next jiffies boundary, to do the wakeup
causes a significant wakeup latency for the Userspace client, which makes
relay less suitable for the high-frequency low-payload use cases where the
data gets generated at a very high rate, like multiple sub buffers getting
filled within a milli second.  Moreover the timer is re-scheduled on every
newly produced sub buffer so the timer keeps getting pushed out if sub
buffers are filled in a very quick succession (less than a jiffy gap
between filling of 2 sub buffers).  As a result relay runs out of sub
buffers to store the new data.

By using irq_work it is ensured that wakeup of userspace client, blocked
in the poll call, is done at earliest (through self IPI or next timer
tick) enabling it to always consume the data in time.  Also this makes
relay consistent with printk & ring buffers (trace), as they too use
irq_work for deferred wake up of readers.

[[email protected]: select CONFIG_IRQ_WORK]
Link: http://lkml.kernel.org/r/[email protected]
[[email protected]: coding-style fixes]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Peter Zijlstra <[email protected]>
Signed-off-by: Akash Goel <[email protected]>
Cc: Tom Zanussi <[email protected]>
Cc: Chris Wilson <[email protected]>
Cc: Tvrtko Ursulin <[email protected]>
Signed-off-by: Arnd Bergmann <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

pps: kc: fix non-tickless system config dependency

CONFIG_NO_HZ currently only sets the default value of dynticks config so
if PPS kernel consumer needs periodic timer ticks it should depend on
!CONFIG_NO_HZ_COMMON instead of !CONFIG_NO_HZ.

Otherwise it is possible to enable it even on tickless system which has
CONFIG_NO_HZ not set and CONFIG_NO_HZ_IDLE (or CONFIG_NO_HZ_FULL) set.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Maciej S. Szmigiero <[email protected]>
Acked-by: Rodolfo Giometti <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mips/panic: replace smp_send_stop() with kdump friendly version in panic path

Daniel Walker reported problems which happens when
crash_kexec_post_notifiers kernel option is enabled
(https://lkml.org/lkml/2015/6/24/44).

In that case, smp_send_stop() is called before entering kdump routines
which assume other CPUs are still online.  As the result, kdump
routines fail to save other CPUs' registers.  Additionally for MIPS
OCTEON, it misses to stop the watchdog timer.

To fix this problem, call a new kdump friendly function,
crash_smp_send_stop(), instead of the smp_send_stop() when
crash_kexec_post_notifiers is enabled.  crash_smp_send_stop() is a
weak function, and it just call smp_send_stop().  Architecture
codes should override it so that kdump can work appropriately.
This patch provides MIPS version.

Fixes: f06e5153f4ae (kernel/panic.c: add "crash_kexec_post_notifiers" option)
Link: http://lkml.kernel.org/r/20160810080950.11028.28000.stgit@sysi4-13.yrl.intra.hitachi.co.jp
Signed-off-by: Hidehiro Kawai <[email protected]>
Reported-by: Daniel Walker <[email protected]>
Cc: Dave Young <[email protected]>
Cc: Baoquan He <[email protected]>
Cc: Vivek Goyal <[email protected]>
Cc: Eric Biederman <[email protected]>
Cc: Masami Hiramatsu <[email protected]>
Cc: Daniel Walker <[email protected]>
Cc: Xunlei Pang <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: David Vrabel <[email protected]>
Cc: Toshi Kani <[email protected]>
Cc: Ralf Baechle <[email protected]>
Cc: David Daney <[email protected]>
Cc: Aaro Koskinen <[email protected]>
Cc: "Steven J. Hill" <[email protected]>
Cc: Corey Minyard <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

x86/panic: replace smp_send_stop() with kdump friendly version in panic path

Daniel Walker reported problems which happens when
crash_kexec_post_notifiers kernel option is enabled
(https://lkml.org/lkml/2015/6/24/44).

In that case, smp_send_stop() is called before entering kdump routines
which assume other CPUs are still online.  As the result, for x86, kdump
routines fail to save other CPUs' registers and disable virtualization
extensions.

To fix this problem, call a new kdump friendly function,
crash_smp_send_stop(), instead of the smp_send_stop() when
crash_kexec_post_notifiers is enabled.  crash_smp_send_stop() is a weak
function, and it just call smp_send_stop().  Architecture codes should
override it so that kdump can work appropriately.  This patch only
provides x86-specific version.

For Xen's PV kernel, just keep the current behavior.

NOTES:

- Right solution would be to place crash_smp_send_stop() before
  __crash_kexec() invocation in all cases and remove smp_send_stop(), but
  we can't do that until all architectures implement own
  crash_smp_send_stop()

- crash_smp_send_stop()-like work is still needed by
  machine_crash_shutdown() because crash_kexec() can be called without
  entering panic()

Fixes: f06e5153f4ae (kernel/panic.c: add "crash_kexec_post_notifiers" option)
Link: http://lkml.kernel.org/r/20160810080948.11028.15344.stgit@sysi4-13.yrl.intra.hitachi.co.jp
Signed-off-by: Hidehiro Kawai <[email protected]>
Reported-by: Daniel Walker <[email protected]>
Cc: Dave Young <[email protected]>
Cc: Baoquan He <[email protected]>
Cc: Vivek Goyal <[email protected]>
Cc: Eric Biederman <[email protected]>
Cc: Masami Hiramatsu <[email protected]>
Cc: Daniel Walker <[email protected]>
Cc: Xunlei Pang <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: David Vrabel <[email protected]>
Cc: Toshi Kani <[email protected]>
Cc: Ralf Baechle <[email protected]>
Cc: David Daney <[email protected]>
Cc: Aaro Koskinen <[email protected]>
Cc: "Steven J. Hill" <[email protected]>
Cc: Corey Minyard <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

nvme: use the DMA_ATTR_NO_WARN attribute

Use the DMA_ATTR_NO_WARN attribute for the dma_map_sg() call of the nvme
driver that returns BLK_MQ_RQ_QUEUE_BUSY (not for BLK_MQ_RQ_QUEUE_ERROR).

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Mauricio Faria de Oliveira <[email protected]>
Reviewed-by: Gabriel Krisman Bertazi <[email protected]>
Cc: Keith Busch <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Krzysztof Kozlowski <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

powerpc: implement the DMA_ATTR_NO_WARN attribute

Add support for the DMA_ATTR_NO_WARN attribute on powerpc iommu code.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Mauricio Faria de Oliveira <[email protected]>
Acked-by: Michael Ellerman <[email protected]>
Cc: Keith Busch <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Krzysztof Kozlowski <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

dma-mapping: introduce the DMA_ATTR_NO_WARN attribute

Introduce the DMA_ATTR_NO_WARN attribute, and document it.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Mauricio Faria de Oliveira <[email protected]>
Cc: Keith Busch <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Krzysztof Kozlowski <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

random: remove unused randomize_range()

All call sites for randomize_range have been updated to use the much
simpler and more robust randomize_addr(). Remove the now unnecessary
code.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Jason Cooper <[email protected]>
Acked-by: Kees Cook <[email protected]>
Cc: "Theodore Ts'o" <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

unicore32: use simpler API for random address requests

Currently, all callers to randomize_range() set the length to 0 and
calculate end by adding a constant to the start address. We can simplify
the API to remove a bunch of needless checks and variables.

Use the new randomize_addr(start, range) call to set the requested
address.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Jason Cooper <[email protected]>
Acked-by: Kees Cook <[email protected]>
Cc: "Theodore Ts'o" <[email protected]>
Cc: Guan Xuetao <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

tile: use simpler API for random address requests

Currently, all callers to randomize_range() set the length to 0 and
calculate end by adding a constant to the start address. We can simplify
the API to remove a bunch of needless checks and variables.

Use the new randomize_addr(start, range) call to set the requested
address.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Jason Cooper <[email protected]>
Acked-by: Kees Cook <[email protected]>
Cc: "Theodore Ts'o" <[email protected]>
Cc: Chris Metcalf <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

arm64: use simpler API for random address requests

Currently, all callers to randomize_range() set the length to 0 and
calculate end by adding a constant to the start address. We can simplify
the API to remove a bunch of needless checks and variables.

Use the new randomize_addr(start, range) call to set the requested
address.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Jason Cooper <[email protected]>
Acked-by: Will Deacon <[email protected]>
Acked-by: Kees Cook <[email protected]>
Cc: "Russell King - ARM Linux" <[email protected]>
Cc: Catalin Marinas <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

ARM: use simpler API for random address requests

Currently, all callers to randomize_range() set the length to 0 and
calculate end by adding a constant to the start address. We can simplify
the API to remove a bunch of needless checks and variables.

Use the new randomize_addr(start, range) call to set the requested
address.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Jason Cooper <[email protected]>
Acked-by: Kees Cook <[email protected]>
Cc: "Russell King - ARM Linux" <[email protected]>
Cc: "Theodore Ts'o" <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Will Deacon <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

x86: use simpler API for random address requests

Currently, all callers to randomize_range() set the length to 0 and
calculate end by adding a constant to the start address. We can simplify
the API to remove a bunch of needless checks and variables.

Use the new randomize_addr(start, range) call to set the requested
address.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Jason Cooper <[email protected]>
Acked-by: Kees Cook <[email protected]>
Cc: "Theodore Ts'o" <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: "H . Peter Anvin" <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

random: simplify API for random address requests

To date, all callers of randomize_range() have set the length to 0, and
check for a zero return value.  For the current callers, the only way to
get zero returned is if end <= start.  Since they are all adding a
constant to the start address, this is unnecessary.

We can remove a bunch of needless checks by simplifying the API to do just
what everyone wants, return an address between [start, start + range).

While we're here, s/get_random_int/get_random_long/.  No current call site
is adversely affected by get_random_int(), since all current range
requests are < UINT_MAX.  However, we should match caller expectations to
avoid coming up short (ha!) in the future.

All current callers to randomize_range() chose to use the start address if
randomize_range() failed.  Therefore, we simplify things by just returning
the start address on error.

randomize_range() will be removed once all callers have been converted
over to randomize_addr().

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Jason Cooper <[email protected]>
Acked-by: Kees Cook <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: "Roberts, William C" <[email protected]>
Cc: Yann Droneaud <[email protected]>
Cc: Russell King <[email protected]>
Cc: "Theodore Ts'o" <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Ralf Baechle <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: "David S. Miller" <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: "H . Peter Anvin" <[email protected]>
Cc: Nick Kralevich <[email protected]>
Cc: Jeffrey Vander Stoep <[email protected]>
Cc: Daniel Cashman <[email protected]>
Cc: Chris Metcalf <[email protected]>
Cc: Guan Xuetao <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

rapidio/rio_cm: use memdup_user() instead of duplicating code

Fix coccinelle warning about duplicating existing memdup_user function.

Link: http://lkml.kernel.org/r/[email protected]
Link: https://lkml.org/lkml/2016/8/11/29
Signed-off-by: Alexandre Bounine <[email protected]>
Reported-by: kbuild test robot <[email protected]>
Cc: Matt Porter <[email protected]>
Cc: Andre van Herk <[email protected]>
Cc: Barry Wood <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

ptrace: clear TIF_SYSCALL_TRACE on ptrace detach

On __ptrace_detach(), called from do_exit()->exit_notify()->
forget_original_parent()->exit_ptrace(), the TIF_SYSCALL_TRACE in
thread->flags of the tracee is not cleared up. This results in the
tracehook_report_syscall_* being called (though there's no longer a tracer
listening to that) upon its further syscalls.

Example scenario - attach "strace" to a running process and kill it (the
strace) with SIGKILL. You'll see that the syscall trace hooks are still
being called.

The clearing of this flag should be moved from ptrace_detach() to
__ptrace_detach().

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ales Novak <[email protected]>
Acked-by: Oleg Nesterov <[email protected]>
Cc: Jiri Kosina <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

pipe: cap initial pipe capacity according to pipe-max-size limit

This is a patch that provides behavior that is more consistent, and
probably less surprising to users. I consider the change optional, and
welcome opinions about whether it should be applied.

By default, pipes are created with a capacity of 64 kiB.  However,
/proc/sys/fs/pipe-max-size may be set smaller than this value.  In this
scenario, an unprivileged user could thus create a pipe whose initial
capacity exceeds the limit. Therefore, it seems logical to cap the
initial pipe capacity according to the value of pipe-max-size.

The test program shown earlier in this patch series can be used to
demonstrate the effect of the change brought about with this patch:

    # cat /proc/sys/fs/pipe-max-size
    1048576
    # sudo -u mtk ./test_F_SETPIPE_SZ 1
    Initial pipe capacity: 65536
    # echo 10000 > /proc/sys/fs/pipe-max-size
    # cat /proc/sys/fs/pipe-max-size
    16384
    # sudo -u mtk ./test_F_SETPIPE_SZ 1
    Initial pipe capacity: 16384
    # ./test_F_SETPIPE_SZ 1
    Initial pipe capacity: 65536

The last two executions of 'test_F_SETPIPE_SZ' show that pipe-max-size
caps the initial allocation for a new pipe for unprivileged users, but
not for privileged users.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Michael Kerrisk <[email protected]>
Reviewed-by: Vegard Nossum <[email protected]>
Cc: Willy Tarreau <[email protected]>
Cc: <[email protected]>
Cc: Tetsuo Handa <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Al Viro <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

pipe: make account_pipe_buffers() return a value, and use it

This is an optional patch, to provide a small performance
improvement. Alter account_pipe_buffers() so that it returns the
new value in user->pipe_bufs. This means that we can refactor
too_many_pipe_buffers_soft() and too_many_pipe_buffers_hard() to
avoid the costs of repeated use of atomic_long_read() to get the
value user->pipe_bufs.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Michael Kerrisk <[email protected]>
Reviewed-by: Vegard Nossum <[email protected]>
Cc: Willy Tarreau <[email protected]>
Cc: <[email protected]>
Cc: Tetsuo Handa <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Al Viro <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

pipe: fix limit checking in alloc_pipe_info()

The limit checking in alloc_pipe_info() (used by pipe(2) and when
opening a FIFO) has the following problems:

(1) When checking capacity required for the new pipe, the checks against
    the limit in /proc/sys/fs/pipe-user-pages-{soft,hard} are made
    against existing consumption, and exclude the memory required for
    the new pipe capacity. As a consequence: (1) the memory allocation
    throttling provided by the soft limit does not kick in quite as
    early as it should, and (2) the user can overrun the hard limit.

(2) As currently implemented, accounting and checking against the limits
    is done as follows:

    (a) Test whether the user has exceeded the limit.
    (b) Make new pipe buffer allocation.
    (c) Account new allocation against the limits.

    This is racey. Multiple processes may pass point (a) simultaneously,
    and then allocate pipe buffers that are accounted for only in step
    (c).  The race means that the user's pipe buffer allocation could be
    pushed over the limit (by an arbitrary amount, depending on how
    unlucky we were in the race). [Thanks to Vegard Nossum for spotting
    this point, which I had missed.]

This patch addresses the above problems as follows:

* Alter the checks against limits to include the memory required for the
  new pipe.
* Re-order the accounting step so that it precedes the buffer allocation.
  If the accounting step determines that a limit has been reached, revert
  the accounting and cause the operation to fail.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Michael Kerrisk <[email protected]>
Reviewed-by: Vegard Nossum <[email protected]>
Cc: Willy Tarreau <[email protected]>
Cc: <[email protected]>
Cc: Tetsuo Handa <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Al Viro <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

pipe: simplify logic in alloc_pipe_info()

Replace an 'if' block that covers most of the code in this function
with a 'goto'. This makes the code a little simpler to read, and also
simplifies the next patch (fix limit checking in alloc_pipe_info())

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Michael Kerrisk <[email protected]>
Reviewed-by: Vegard Nossum <[email protected]>
Cc: Willy Tarreau <[email protected]>
Cc: <[email protected]>
Cc: Tetsuo Handa <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Al Viro <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

pipe: fix limit checking in pipe_set_size()

The limit checking in pipe_set_size() (used by fcntl(F_SETPIPE_SZ))
has the following problems:

(1) When increasing the pipe capacity, the checks against the limits in
    /proc/sys/fs/pipe-user-pages-{soft,hard} are made against existing
    consumption, and exclude the memory required for the increased pipe
    capacity. The new increase in pipe capacity can then push the total
    memory used by the user for pipes (possibly far) over a limit. This
    can also trigger the problem described next.

(2) The limit checks are performed even when the new pipe capacity is
    less than the existing pipe capacity. This can lead to problems if a
    user sets a large pipe capacity, and then the limits are lowered,
    with the result that the user will no longer be able to decrease the
    pipe capacity.

(3) As currently implemented, accounting and checking against the
    limits is done as follows:

    (a) Test whether the user has exceeded the limit.
    (b) Make new pipe buffer allocation.
    (c) Account new allocation against the limits.

    This is racey. Multiple processes may pass point (a)
    simultaneously, and then allocate pipe buffers that are accounted
    for only in step (c).  The race means that the user's pipe buffer
    allocation could be pushed over the limit (by an arbitrary amount,
    depending on how unlucky we were in the race). [Thanks to Vegard
    Nossum for spotting this point, which I had missed.]

This patch addresses the above problems as follows:

* Perform checks against the limits only when increasing a pipe's
  capacity; an unprivileged user can always decrease a pipe's capacity.
* Alter the checks against limits to include the memory required for
  the new pipe capacity.
* Re-order the accounting step so that it precedes the buffer
  allocation. If the accounting step determines that a limit has
  been reached, revert the accounting and cause the operation to fail.

The program below can be used to demonstrate problems 1 and 2, and the
effect of the fix. The program takes one or more command-line arguments.
The first argument specifies the number of pipes that the program should
create. The remaining arguments are, alternately, pipe capacities that
should be set using fcntl(F_SETPIPE_SZ), and sleep intervals (in
seconds) between the fcntl() operations. (The sleep intervals allow the
possibility to change the limits between fcntl() operations.)

Problem 1
=========

Using the test program on an unpatched kernel, we first set some
limits:

    # echo 0 > /proc/sys/fs/pipe-user-pages-soft
    # echo 1000000000 > /proc/sys/fs/pipe-max-size
    # echo 10000 > /proc/sys/fs/pipe-user-pages-hard    # 40.96 MB

Then show that we can set a pipe with capacity (100MB) that is
over the hard limit

    # sudo -u mtk ./test_F_SETPIPE_SZ 1 100000000
    Initial pipe capacity: 65536
        Loop 1: set pipe capacity to 100000000 bytes
            F_SETPIPE_SZ returned 134217728

Now set the capacity to 100MB twice. The second call fails (which is
probably surprising to most users, since it seems like a no-op):

    # sudo -u mtk ./test_F_SETPIPE_SZ 1 100000000 0 100000000
    Initial pipe capacity: 65536
        Loop 1: set pipe capacity to 100000000 bytes
            F_SETPIPE_SZ returned 134217728
        Loop 2: set pipe capacity to 100000000 bytes
            Loop 2, pipe 0: F_SETPIPE_SZ failed: fcntl: Operation not permitted

With a patched kernel, setting a capacity over the limit fails at the
first attempt:

    # echo 0 > /proc/sys/fs/pipe-user-pages-soft
    # echo 1000000000 > /proc/sys/fs/pipe-max-size
    # echo 10000 > /proc/sys/fs/pipe-user-pages-hard
    # sudo -u mtk ./test_F_SETPIPE_SZ 1 100000000
    Initial pipe capacity: 65536
        Loop 1: set pipe capacity to 100000000 bytes
            Loop 1, pipe 0: F_SETPIPE_SZ failed: fcntl: Operation not permitted

There is a small chance that the change to fix this problem could
break user-space, since there are cases where fcntl(F_SETPIPE_SZ)
calls that previously succeeded might fail. However, the chances are
small, since (a) the pipe-user-pages-{soft,hard} limits are new (in
4.5), and the default soft/hard limits are high/unlimited.  Therefore,
it seems warranted to make these limits operate more precisely (and
behave more like what users probably expect).

Problem 2
=========

Running the test program on an unpatched kernel, we first set some limits:

    # getconf PAGESIZE
    4096
    # echo 0 > /proc/sys/fs/pipe-user-pages-soft
    # echo 1000000000 > /proc/sys/fs/pipe-max-size
    # echo 10000 > /proc/sys/fs/pipe-user-pages-hard    # 40.96 MB

Now perform two fcntl(F_SETPIPE_SZ) operations on a single pipe,
first setting a pipe capacity (10MB), sleeping for a few seconds,
during which time the hard limit is lowered, and then set pipe
capacity to a smaller amount (5MB):

    # sudo -u mtk ./test_F_SETPIPE_SZ 1 10000000 15 5000000 &
    [1] 748
    # Initial pipe capacity: 65536
        Loop 1: set pipe capacity to 10000000 bytes
            F_SETPIPE_SZ returned 16777216
            Sleeping 15 seconds

    # echo 1000 > /proc/sys/fs/pipe-user-pages-hard      # 4.096 MB
    #     Loop 2: set pipe capacity to 5000000 bytes
            Loop 2, pipe 0: F_SETPIPE_SZ failed: fcntl: Operation not permitted

In this case, the user should be able to lower the limit.

With a kernel that has the patch below, the second fcntl()
succeeds:

    # echo 0 > /proc/sys/fs/pipe-user-pages-soft
    # echo 1000000000 > /proc/sys/fs/pipe-max-size
    # echo 10000 > /proc/sys/fs/pipe-user-pages-hard
    # sudo -u mtk ./test_F_SETPIPE_SZ 1 10000000 15 5000000 &
    [1] 3215
    # Initial pipe capacity: 65536
    #     Loop 1: set pipe capacity to 10000000 bytes
            F_SETPIPE_SZ returned 16777216
            Sleeping 15 seconds

    # echo 1000 > /proc/sys/fs/pipe-user-pages-hard

    #     Loop 2: set pipe capacity to 5000000 bytes
            F_SETPIPE_SZ returned 8388608

8x---8x---8x---8x---8x---8x---8x---8x---8x---8x---8x---8x---8x---8x---

/* test_F_SETPIPE_SZ.c

   (C) 2016, Michael Kerrisk; licensed under GNU GPL version 2 or later

   Test operation of fcntl(F_SETPIPE_SZ) for setting pipe capacity
   and interactions with limits defined by /proc/sys/fs/pipe-* files.
*/

#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <fcntl.h>
#include <unistd.h>

int
main(int argc, char *argv[])
{
    int (*pfd)[2];
    int npipes;
    int pcap, rcap;
    int j, p, s, stime, loop;

    if (argc < 2) {
        fprintf(stderr, "Usage: %s num-pipes "
                "[pipe-capacity sleep-time]...\n", argv[0]);
        exit(EXIT_FAILURE);
    }

    npipes = atoi(argv[1]);

    pfd = calloc(npipes, sizeof (int [2]));
    if (pfd == NULL) {
        perror("calloc");
        exit(EXIT_FAILURE);
    }

    for (j = 0; j < npipes; j++) {
        if (pipe(pfd[j]) == -1) {
            fprintf(stderr, "Loop %d: pipe() failed: ", j);
            perror("pipe");
            exit(EXIT_FAILURE);
        }
    }

    printf("Initial pipe capacity: %d\n", fcntl(pfd[0][0], F_GETPIPE_SZ));

    for (j = 2; j < argc; j += 2 ) {
        loop = j / 2;
        pcap = atoi(argv[j]);
        printf("    Loop %d: set pipe capacity to %d bytes\n", loop, pcap);

        for (p = 0; p < npipes; p++) {
            s = fcntl(pfd[p][0], F_SETPIPE_SZ, pcap);
            if (s == -1) {
                fprintf(stderr, "        Loop %d, pipe %d: F_SETPIPE_SZ "
                        "failed: ", loop, p);
                perror("fcntl");
                exit(EXIT_FAILURE);
            }

            if (p == 0) {
                printf("        F_SETPIPE_SZ returned %d\n", s);
                rcap = s;
            } else {
                if (s != rcap) {
                    fprintf(stderr, "        Loop %d, pipe %d: F_SETPIPE_SZ "
                            "unexpected return: %d\n", loop, p, s);
                    exit(EXIT_FAILURE);
                }
            }

            stime = (j + 1 < argc) ? atoi(argv[j + 1]) : 0;
            if (stime > 0) {
                printf("        Sleeping %d seconds\n", stime);
                sleep(stime);
            }
        }
    }

    exit(EXIT_SUCCESS);
}

8x---8x---8x---8x---8x---8x---8x---8x---8x---8x---8x---8x---8x---8x---

Patch history:

v2
   * Switch order of test in 'if' statement to avoid function call
      (to capability()) in normal path. [This is a fix to a preexisting
      wart in the code. Thanks to Willy Tarreau]
    * Perform (size > pipe_max_size) check before calling
      account_pipe_buffers().  [Thanks to Vegard Nossum]
      Quoting Vegard:

        The potential problem happens if the user passes a very large number
        which will overflow pipe->user->pipe_bufs.

        On 32-bit, sizeof(int) == sizeof(long), so if they pass arg = INT_MAX
        then round_pipe_size() returns INT_MAX. Although it's true that the
        accounting is done in terms of pages and not bytes, so you'd need on
        the order of (1 << 13) = 8192 processes hitting the limit at the same
        time in order to make it overflow, which seems a bit unlikely.

        (See https://lkml.org/lkml/2016/8/12/215 for another discussion on the
        limit checking)

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Michael Kerrisk <[email protected]>
Reviewed-by: Vegard Nossum <[email protected]>
Cc: Willy Tarreau <[email protected]>
Cc: <[email protected]>
Cc: Tetsuo Handa <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Al Viro <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

pipe: refactor argument for account_pipe_buffers()

This is a preparatory patch for following work. account_pipe_buffers()
performs accounting in the 'user_struct'. There is no need to pass a
pointer to a 'pipe_inode_info' struct (which is then dereferenced to
obtain a pointer to the 'user' field). Instead, pass a pointer directly
to the 'user_struct'. This change is needed in preparation for a
subsequent patch that the fixes the limit checking in alloc_pipe_info()
(and the resulting code is a little more logical).

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Michael Kerrisk <[email protected]>
Reviewed-by: Vegard Nossum <[email protected]>
Cc: Willy Tarreau <[email protected]>
Cc: <[email protected]>
Cc: Tetsuo Handa <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Al Viro <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

pipe: move limit checking logic into pipe_set_size()

This is a preparatory patch for following work. Move the F_SETPIPE_SZ
limit-checking logic from pipe_fcntl() into pipe_set_size(). This
simplifies the code a little, and allows for reworking required in
a later patch that fixes the limit checking in pipe_set_size()

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Michael Kerrisk <[email protected]>
Reviewed-by: Vegard Nossum <[email protected]>
Cc: Willy Tarreau <[email protected]>
Cc: <[email protected]>
Cc: Tetsuo Handa <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Al Viro <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

pipe: relocate round_pipe_size() above pipe_set_size()

Patch series "pipe: fix limit handling", v2.

When changing a pipe's capacity with fcntl(F_SETPIPE_SZ), various limits
defined by /proc/sys/fs/pipe-* files are checked to see if unprivileged
users are exceeding limits on memory consumption.

While documenting and testing the operation of these limits I noticed
that, as currently implemented, these checks have a number of problems:

(1) When increasing the pipe capacity, the checks against the limits
    in /proc/sys/fs/pipe-user-pages-{soft,hard} are made against
    existing consumption, and exclude the memory required for the
    increased pipe capacity. The new increase in pipe capacity can then
    push the total memory used by the user for pipes (possibly far) over
    a limit. This can also trigger the problem described next.

(2) The limit checks are performed even when the new pipe capacity
    is less than the existing pipe capacity. This can lead to problems
    if a user sets a large pipe capacity, and then the limits are
    lowered, with the result that the user will no longer be able to
    decrease the pipe capacity.

(3) As currently implemented, accounting and checking against the
    limits is done as follows:

    (a) Test whether the user has exceeded the limit.
    (b) Make new pipe buffer allocation.
    (c) Account new allocation against the limits.

    This is racey. Multiple processes may pass point (a) simultaneously,
    and then allocate pipe buffers that are accounted for only in step
    (c).  The race means that the user's pipe buffer allocation could be
    pushed over the limit (by an arbitrary amount, depending on how
    unlucky we were in the race). [Thanks to Vegard Nossum for spotting
    this point, which I had missed.]

This patch series addresses these three problems.

This patch (of 8):

This is a minor preparatory patch.  After subsequent patches,
round_pipe_size() will be called from pipe_set_size(), so place
round_pipe_size() above pipe_set_size().

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Michael Kerrisk <[email protected]>
Reviewed-by: Vegard Nossum <[email protected]>
Cc: Willy Tarreau <[email protected]>
Cc: <[email protected]>
Cc: Tetsuo Handa <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Al Viro <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

autofs: refactor ioctl fn vector in iookup_dev_ioctl()

cmd part of this struct is the same as an index of itself within
_ioctls[]. In fact this cmd is unused, so we can drop this part.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Tomohiro Kusumi <[email protected]>
Signed-off-by: Ian Kent <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

autofs: remove possibly misleading /* #define DEBUG */

Having this in autofs_i.h gives illusion that uncommenting this enables
pr_debug(), but it doesn't enable all the pr_debug() in autofs because
inclusion order matters.

XFS has the same DEBUG macro in its core header fs/xfs/xfs.h, however XFS
seems to have a rule to include this prior to other XFS headers as well as
kernel headers. This is not the case with autofs, and DEBUG could be
enabled via Makefile, so autofs should just get rid of this comment to
make the code less confusing. It's a comment, so there is literally no
functional difference.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Tomohiro Kusumi <[email protected]>
Signed-off-by: Ian Kent <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

autofs4: move linux/auto_dev-ioctl.h to uapi/linux

Since linux/auto_dev-ioctl.h wasn't included in include/linux/Kbuild
it wasn't moved to uapi/linux as part of the uapi series.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ian Kent <[email protected]>
Cc: Tomohiro Kusumi <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

autofs: move inclusion of linux/limits.h to uapi

linux/limits.h should be included by uapi instead of linux/auto_fs.h
so as not to cause compile error in userspace.

# cat << EOF > ./test1.c
> #include <stdio.h>
> #include <linux/auto_fs.h>
> int main(void) {
>     return 0;
> }
> EOF
# gcc -Wall -g ./test1.c
In file included from ./test1.c:2:0:
/usr/include/linux/auto_fs.h:54:12: error: 'NAME_MAX' undeclared here (not in a function)
   char name[NAME_MAX+1];
             ^

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Tomohiro Kusumi <[email protected]>
Signed-off-by: Ian Kent <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

autofs: fix print format for ioctl warning message

All other warnings use "cmd(0x%08x)" and this is the only one with
"cmd(%d)". (below comes from my userspace debug program, but not
automount daemon)

[ 1139.905676] autofs4:pid:1640:check_dev_ioctl_version: ioctl control interface version mismatch: kernel(1.0), user(0.0), cmd(-1072131215)

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Tomohiro Kusumi <[email protected]>
Signed-off-by: Ian Kent <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

autofs: add autofs_dev_ioctl_version() for AUTOFS_DEV_IOCTL_VERSION_CMD

No functional changes, based on the following justification.

1. Make the code more consistent using the ioctl vector _ioctls[],
   rather than assigning NULL only for this ioctl command.
2. Remove goto done; for better maintainability in the long run.
3. The existing code is based on the fact that validate_dev_ioctl()
   sets ioctl version for any command, but AUTOFS_DEV_IOCTL_VERSION_CMD
   should explicitly set it regardless of the default behavior.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Tomohiro Kusumi <[email protected]>
Signed-off-by: Ian Kent <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

autofs: fix dev ioctl number range check

The count of miscellaneous device ioctls in fs/autofs4/autofs_i.h is wrong.

The number of ioctls is the difference between AUTOFS_DEV_IOCTL_VERSION_CMD
and AUTOFS_DEV_IOCTL_ISMOUNTPOINT_CMD (14) not the difference between
AUTOFS_IOC_COUNT and 11 (21).

[[email protected]: fix typo that made the count macro negative]
Link: http://lkml.kernel.org/r/[email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ian Kent <[email protected]>
Cc: Tomohiro Kusumi <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

autofs: fix pr_debug() message

This isn't a return value, so change the message to indicate the status is
the result of may_umount().

(or locate pr_debug() after put_user() with the same message)

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Tomohiro Kusumi <[email protected]>
Signed-off-by: Ian Kent <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

autofs: update struct autofs_dev_ioctl in Documentation

Sync with changes made by commit 730c9eeca980 ("autofs4: improve
parameter usage") which introduced an union for various ioctl commands
instead of having statically named arg1,2.

This commit simply replaces arg1,2 with the corresponding fields without
changing semantics.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Tomohiro Kusumi <[email protected]>
Signed-off-by: Ian Kent <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

autofs: fix Documentation regarding devid on ioctl

The explanation on how ioctl handles devid seems incorrect. Userspace who
calls this ioctl has no input regarding devid, and ioctl implementation
retrieves devid via superblock.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Tomohiro Kusumi <[email protected]>
Signed-off-by: Ian Kent <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

autofs: remove AUTOFS_DEVID_LEN

This macro was never used by neither kernel nor userspace, and also
doesn't represent "devid length" in bytes. (unless it was added to mean
something else).

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Tomohiro Kusumi <[email protected]>
Signed-off-by: Ian Kent <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

autofs: don't fail to free_dev_ioctl(param)

Returning -ENOTTY here fails to free dynamically allocated param.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Tomohiro Kusumi <[email protected]>
Signed-off-by: Ian Kent <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

autofs: remove obsolete sb fields

These two were left from commit aa55ddf340c9 ("autofs4: remove unused
ioctls") which removed unused ioctls.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Tomohiro Kusumi <[email protected]>
Signed-off-by: Ian Kent <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

autofs: use autofs4_free_ino() to kfree dentry data

kfree dentry data allocated by autofs4_new_ino() with autofs4_free_ino()
instead of raw kfree. (since we have the interface to free autofs_info*)

This patch was modified to remove the need to set the dentry info field to
NULL dew to a change in the previous patch.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Tomohiro Kusumi <[email protected]>
Signed-off-by: Ian Kent <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

autofs: remove ino free in autofs4_dir_symlink()

The inode allocation failure case in autofs4_dir_symlink() frees the
autofs dentry info of the dentry without setting ->d_fsdata to NULL.

That could lead to a double free so just get rid of the free and leave it
to ->d_release().

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ian Kent <[email protected]>
Cc: Tomohiro Kusumi <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

autofs: add WARN_ON(1) for non dir/link inode case

It's invalid if the given mode is neither dir nor link, so warn on else
case.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Tomohiro Kusumi <[email protected]>
Signed-off-by: Ian Kent <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

autofs: fix autofs4_fill_super() error exit handling

Somewhere along the line the error handling gotos have become incorrect.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ian Kent <[email protected]>
Cc: Tomohiro Kusumi <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

autofs: test autofs versions first on sb initialization

This patch does what the below comment says. It could be and it's
considered better to do this first before various functions get called
during initialization.

/* Couldn't this be tested earlier? */

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Tomohiro Kusumi <[email protected]>
Signed-off-by: Ian Kent <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

autofs: drop unnecessary extern in autofs_i.h

autofs4_kill_sb() doesn't need to be declared as extern, and no other
functions in .h are explicitly declared as extern.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Tomohiro Kusumi <[email protected]>
Signed-off-by: Ian Kent <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

autofs: fix typos in Documentation/filesystems/autofs4.txt

plus minor whitespace fixes.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Tomohiro Kusumi <[email protected]>
Signed-off-by: Ian Kent <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

kprobes: include <asm/sections.h> instead of <asm-generic/sections.h>

asm-generic headers are generic implementations for architecture specific
code and should not be included by common code. Thus use the asm/ version
of sections.h to get at the linker sections.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Christoph Hellwig <[email protected]>
Acked-by: Masami Hiramatsu <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

checkpatch: improve the octal permissions tests

The function calls with octal permissions commonly span multiple lines.
The current test is line oriented and fails to find some matches.

Make the test use the $stat variable instead of the $line variable to span
multiple lines.

Also add a few functions to the known functions with permissions list.

Move the SYMBOLIC_PERMS test to a separate section to find all the S_<FOO>
permissions in any form not just those that have specific function names.

This can now find and fix permissions uses like:
.mode = S_<FOO> | S_<BAR>;

Link: http://lkml.kernel.org/r/b51bab60530912aae4ac420119d465c5b206f19f.1475030406.git.joe@perches.com
Signed-off-by: Joe Perches <[email protected]>
Tested-by: Ramiro Oliveira <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

checkpatch: add warning for unnamed function definition arguments

Function definitions without identifiers like
int foo(int)
are not preferred. Emit a warning when they occur.

Link: http://lkml.kernel.org/r/94fe6378504745991b650f48fc92bb4648f25706.1474925354.git.joe@perches.com
Signed-off-by: Joe Perches <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

checkpatch: improve MACRO_ARG_PRECEDENCE test

It is possible for a multiple line macro definition to have a false positive
report when an argument is used on a line after a continuation \.

This line might have a leading '+' as the initial character that could be
confused by checkpatch as an operator.

Avoid the leading character on multiple line macro definitions.

Link: http://lkml.kernel.org/r/60229d13399f9b6509db5a32e30d4c16951a60cd.1473836073.git.joe@perches.com
Signed-off-by: Joe Perches <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

checkpatch: add --strict test for precedence challenged macro arguments

Add a test for macro arguents that have a non-comma leading or trailing
operator where the argument isn't parenthesized to avoid possible precedence
issues.

Link: http://lkml.kernel.org/r/47715508972f8d786f435e583ff881dbeee3a114.1473745855.git.joe@perches.com
Signed-off-by: Joe Perches <[email protected]>
Cc: Andy Whitcroft <[email protected]>
Cc: Julia Lawall <[email protected]>
Cc: Dan Carpenter <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

checkpatch: add --strict test for macro argument reuse

If a macro argument is used multiple times in the macro definition, the
macro argument may have an unexpected side-effect.

Add a test (MACRO_ARG_REUSE) for that condition which is only
emitted with command-line option --strict.

Link: http://lkml.kernel.org/r/b6d67a87cafcafd15499e91780dc63b15dec0aa0.1473744906.git.joe@perches.com
Signed-off-by: Joe Perches <[email protected]>
Cc: Andy Whitcroft <[email protected]>
Cc: Julia Lawall <[email protected]>
Cc: Dan Carpenter <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

checkpatch: improve the block comment * alignment test

An "uninitialized value" is emitted when a block comment starts on
the same line as a statement.

Fix this and make the test use a little fewer cpu cycles too.

Link: http://lkml.kernel.org/r/3c9993320c2182d37f53ac540878cfef59c3f62d.1473365956.git.joe@perches.com
Signed-off-by: Joe Perches <[email protected]>
Reported-by: Charlemagne Lasse <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

checkpatch: speed up checking for filenames in sections marked obsolete

Adding -f to the get_maintainer.pl invocation means git isn't invoked
by get_maintainer.pl for known filenames.

This reduces the overall time to run checkpatch.

Link: http://lkml.kernel.org/r/22991e3a295aeb399b43af0478b6e5809106ccee.1472684066.git.joe@perches.com
Signed-off-by: Joe Perches <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

const_structs.checkpatch: add frequently used from Julia Lawall's list

Using const is generally a good idea.

Julia Lawall has created a list of always const and almost always const
structs in the kernel sources.

Link: https://lkml.org/lkml/2016/8/28/95
Add the most frequently used (> 50 cases) that are almost always or
always const.

Link: http://lkml.kernel.org/r/1e16020f8027654db0095bbfbcc11da51025365c.1472664220.git.joe@perches.com
Signed-off-by: Joe Perches <[email protected]>
Acked-by: Kees Cook <[email protected]>
Cc: Julia Lawall <[email protected]>
Cc: Andy Whitcroft <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

checkpatch: externalize the structs that should be const

Make it easier to add new structs that should be const.

Link: http://lkml.kernel.org/r/e5a8da43e7c11525bafbda1ca69a8323614dd942.1472664220.git.joe@perches.com
Signed-off-by: Joe Perches <[email protected]>
Cc: Julia Lawall <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Andy Whitcroft <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

checkpatch: don't test for prefer ether_addr_<foo>

< sigh > Comment these tests out.

These are just too enticing to people that don't verify that
both source and dest addresses really must be __aligned(2).

It helps make Dan Carpenter happy too.

Link: http://lkml.kernel.org/r/dc32ec66d24647f4cdf824c8dfbbc59aa7ce7b7d.1472665676.git.joe@perches.com
Signed-off-by: Joe Perches <[email protected]>
Cc: Dan Carpenter <[email protected]>
Cc: Greg <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

checkpatch: test multiple line block comment alignment

Warn when block comments are not aligned on the *

/*
* block comment, no warning
*/

/*
* block comment, emit warning
*/

Link: http://lkml.kernel.org/r/edb57bd330adfe024b95ec2a807d4aa7f0c8b112.1472261299.git.joe@perches.com
Signed-off-by: Joe Perches <[email protected]>
Reported-by: Sudip Mukherjee <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

checkpatch: look for symbolic permissions and suggest octal instead

S_<FOO> uses should be avoided where octal is more intelligible.

Linus didst say:

: It's *much* easier to parse and understand the octal numbers, while the
: symbolic macro names are just random line noise and hard as hell to
: understand. You really have to think about it.
:
: So we should rather go the other way: convert existing bad symbolic
: permission bit macro use to just use the octal numbers.
:
: The symbolic names are good for the *other* bits (ie sticky bit, and the
: inode mode _type_ numbers etc), but for the permission bits, the symbolic
: names are just insane crap. Nobody sane should ever use them. Not in the
: kernel, not in user space.
(http://lkml.kernel.org/r/CA+55aFw5v23T-zvDZp-MmD_EYxF8WbafwwB59934FV7g21uMGQ@mail.gmail.com)

Link: http://lkml.kernel.org/r/7232ef011d05a92f4caa86a5e9830d87966a2eaf.1470180926.git.joe@perches.com
Signed-off-by: Joe Perches <[email protected]>
Cc: Linus Torvalds <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

checkpatch: see if modified files are marked obsolete in MAINTAINERS

Use get_maintainer to check the status of individual files. If
"obsolete", suggest leaving the files alone.

Link: http://lkml.kernel.org/r/7ceaa510dc9d2df05ec4b456baed7bb1415550b3.1471889575.git.joe@perches.com
Signed-off-by: Joe Perches <[email protected]>
Cc: SF Markus Elfring <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

lib/bitmap.c: enhance bitmap syntax

Today there are platforms with many CPUs (up to 4K).  Trying to boot only
part of the CPUs may result in too long string.

For example lets take NPS platform that is part of arch/arc.  This
platform have SMP system with 256 cores each with 16 HW threads (SMT
machine) where HW thread appears as CPU to the kernel.  In this example
there is total of 4K CPUs.  When one tries to boot only part of the HW
threads from each core the string representing the map may be long...  For
example if for sake of performance we decided to boot only first half of
HW threads of each core the map will look like:
0-7,16-23,32-39,...,4080-4087

This patch introduce new syntax to accommodate with such use case.  I
added an optional postfix to a range of CPUs which will choose according
to given modulo the desired range of reminders i.e.:

    <cpus range>:sed_size/group_size

For example, above map can be described in new syntax like this:
0-4095:8/16

Note that this patch is backward compatible with current syntax.

[[email protected]: rework documentation]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Noam Camus <[email protected]>
Cc: David Decotigny <[email protected]>
Cc: Ben Hutchings <[email protected]>
Cc: David S. Miller <[email protected]>
Cc: Pan Xinhui <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

lib/kstrtox.c: smaller _parse_integer()

Set "overflow" bit upon encountering it instead of postponing to the end
of the conversion. Somehow gcc unwedges itself and generates better code:

$ ./scripts/bloat-o-meter ../vmlinux-000 ../obj/vmlinux
_parse_integer 177 139 -38

Inspired by patch from Zhaoxiu Zeng.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Alexey Dobriyan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

include/linux/ctype.h: make isdigit() table lookupless

Make isdigit into a simple range checking inline function:

return '0' <= c && c <= '9';

This code is 1 branch, not 2 because any reasonable compiler can
optimize this code into SUB+CMP, so the code

while (isdigit((c = *s++)))
...

remains 1 branch per iteration HOWEVER it suddenly doesn't do table
lookup priming cacheline nobody cares about.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Alexey Dobriyan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

lib: harden strncpy_from_user

The strncpy_from_user() accessor is effectively a copy_from_user()
specialised to copy strings, terminating early at a NUL byte if possible.
In other respects it is identical, and can be used to copy an arbitrarily
large buffer from userspace into the kernel.  Conceptually, it exposes a
similar attack surface.

As with copy_from_user(), we check the destination range when the kernel
is built with KASAN, but unlike copy_from_user() we do not check the
destination buffer when using HARDENED_USERCOPY.  As strncpy_from_user()
calls get_user() in a loop, we must call check_object_size() explicitly.

This patch adds this instrumentation to strncpy_from_user(), per the same
rationale as with the regular copy_from_user().  In the absence of
hardened usercopy this will have no impact as the instrumentation expands
to an empty static inline function.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Mark Rutland <[email protected]>
Cc: Kees Cook <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

radix-tree tests: properly initialize mutex

The pthread_mutex_t in regression1.c wasn't being initialized properly.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ross Zwisler <[email protected]>
Cc: Konstantin Khlebnikov <[email protected]>
Cc: Andrey Ryabinin <[email protected]>
Cc: Dmitry Vyukov <[email protected]>
Cc: Shuah Khan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

radix-tree tests: add iteration test

There are four cases I can see where we could end up with a NULL 'slot' in
radix_tree_next_slot().  This unit test exercises all four of them, making
sure that if in the future we have an unsafe path through
radix_tree_next_slot(), we'll catch it.

Here are details on the four cases:

1) radix_tree_iter_retry() via a non-tagged iteration like
radix_tree_for_each_slot().  In this case we currently aren't seeing a bug
because radix_tree_iter_retry() sets

    iter->next_index = iter->index;

which means that in in the else case in radix_tree_next_slot(), 'count' is
zero, so we skip over the while() loop and effectively just return NULL
without ever dereferencing 'slot'.

2) radix_tree_iter_retry() via tagged iteration like
radix_tree_for_each_tagged().  This case was giving us NULL pointer
dereferences in testing, and was fixed with this commit:

commit 3cb9185c6730 ("radix-tree: fix radix_tree_iter_retry() for tagged
iterators.")

This fix doesn't explicitly check for 'slot' being NULL, though, it works
around the NULL pointer dereference by instead zeroing iter->tags in
radix_tree_iter_retry(), which makes us bail out of the if() case in
radix_tree_next_slot() before we dereference 'slot'.

3) radix_tree_iter_next() via via a non-tagged iteration like
radix_tree_for_each_slot().  This currently happens in shmem_tag_pins()
and shmem_partial_swap_usage().

As with non-tagged iteration, 'count' in the else case of
radix_tree_next_slot() is zero, so we skip over the while() loop and
effectively just return NULL without ever dereferencing 'slot'.

4) radix_tree_iter_next() via tagged iteration like
radix_tree_for_each_tagged().  This happens in shmem_wait_for_pins().

radix_tree_iter_next() zeros out iter->tags, so we end up exiting
radix_tree_next_slot() here:

    if (flags & RADIX_TREE_ITER_TAGGED) {
    void *canon = slot;

    iter->tags >>= 1;
    if (unlikely(!iter->tags))
    return NULL;

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ross Zwisler <[email protected]>
Cc: Konstantin Khlebnikov <[email protected]>
Cc: Andrey Ryabinin <[email protected]>
Cc: Dmitry Vyukov <[email protected]>
Cc: Shuah Khan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

radix-tree: 'slot' can be NULL in radix_tree_next_slot()

There are four cases I can see where we could end up with a NULL 'slot' in
radix_tree_next_slot().  Yet radix_tree_next_slot() never actually checks
whether 'slot' is NULL.  It just happens that for the cases where 'slot'
is NULL, some other combination of factors prevents us from dereferencing
it.

It would be very easy for someone to unwittingly change one of these
factors without realizing that we are implicitly depending on it to save
us from a NULL pointer dereference.

Add a comment documenting the things that allow 'slot' to be safely passed
as NULL to radix_tree_next_slot().

Here are details on the four cases:

1) radix_tree_iter_retry() via a non-tagged iteration like
radix_tree_for_each_slot().  In this case we currently aren't seeing a bug
because radix_tree_iter_retry() sets

iter->next_index = iter->index;

which means that in in the else case in radix_tree_next_slot(), 'count' is
zero, so we skip over the while() loop and effectively just return NULL
without ever dereferencing 'slot'.

2) radix_tree_iter_retry() via tagged iteration like
radix_tree_for_each_tagged().  This case was giving us NULL pointer
dereferences in testing, and was fixed with this commit:

commit 3cb9185c6730 ("radix-tree: fix radix_tree_iter_retry() for tagged
iterators.")

This fix doesn't explicitly check for 'slot' being NULL, though, it works
around the NULL pointer dereference by instead zeroing iter->tags in
radix_tree_iter_retry(), which makes us bail out of the if() case in
radix_tree_next_slot() before we dereference 'slot'.

3) radix_tree_iter_next() via via a non-tagged iteration like
radix_tree_for_each_slot().  This currently happens in shmem_tag_pins()
and shmem_partial_swap_usage().

As with non-tagged iteration, 'count' in the else case of
radix_tree_next_slot() is zero, so we skip over the while() loop and
effectively just return NULL without ever dereferencing 'slot'.

4) radix_tree_iter_next() via tagged iteration like
radix_tree_for_each_tagged().  This happens in shmem_wait_for_pins().

radix_tree_iter_next() zeros out iter->tags, so we end up exiting
radix_tree_next_slot() here:

if (flags & RADIX_TREE_ITER_TAGGED) {
void *canon = slot;

iter->tags >>= 1;
if (unlikely(!iter->tags))
return NULL;

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ross Zwisler <[email protected]>
Cc: Konstantin Khlebnikov <[email protected]>
Cc: Andrey Ryabinin <[email protected]>
Cc: Dmitry Vyukov <[email protected]>
Cc: Shuah Khan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

fs/select: add vmalloc fallback for select(2)

The select(2) syscall performs a kmalloc(size, GFP_KERNEL) where size grows
with the number of fds passed. We had a customer report page allocation
failures of order-4 for this allocation. This is a costly order, so it might
easily fail, as the VM expects such allocation to have a lower-order fallback.

Such trivial fallback is vmalloc(), as the memory doesn't have to be physically
contiguous and the allocation is temporary for the duration of the syscall
only. There were some concerns, whether this would have negative impact on the
system by exposing vmalloc() to userspace. Although an excessive use of vmalloc
can cause some system wide performance issues - TLB flushes etc. - a large
order allocation is not for free either and an excessive reclaim/compaction can
have a similar effect. Also note that the size is effectively limited by
RLIMIT_NOFILE which defaults to 1024 on the systems I checked. That means the
bitmaps will fit well within single page and thus the vmalloc() fallback could
be only excercised for processes where root allows a higher limit.

Note that the poll(2) syscall seems to use a linked list of order-0 pages, so
it doesn't need this kind of fallback.

[[email protected]: fix failure path logic]
[[email protected]: use proper type for size]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Vlastimil Babka <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: Alexander Viro <[email protected]>
Cc: Eric Dumazet <[email protected]>
Cc: David Laight <[email protected]>
Cc: Hillf Danton <[email protected]>
Cc: Nicholas Piggin <[email protected]>
Cc: Jason Baron <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

block: implement (some of) fallocate for block devices

After much discussion, it seems that the fallocate feature flag
FALLOC_FL_ZERO_RANGE maps nicely to SCSI WRITE SAME; and the feature
FALLOC_FL_PUNCH_HOLE maps nicely to the devices that have been whitelisted
for zeroing SCSI UNMAP.  Punch still requires that FALLOC_FL_KEEP_SIZE is
set.  A length that goes past the end of the device will be clamped to the
device size if KEEP_SIZE is set; or will return -EINVAL if not.  Both
start and length must be aligned to the device's logical block size.

Since the semantics of fallocate are fairly well established already, wire
up the two pieces.  The other fallocate variants (collapse range, insert
range, and allocate blocks) are not supported.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Darrick J. Wong <[email protected]>
Reviewed-by: Hannes Reinecke <[email protected]>
Reviewed-by: Bart Van Assche <[email protected]>
Cc: Theodore Ts'o <[email protected]>
Cc: Martin K. Petersen <[email protected]>
Cc: Mike Snitzer <[email protected]> # tweaked header
Cc: Brian Foster <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Hannes Reinecke <[email protected]>
Cc: Jens Axboe <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

block: require write_same and discard requests align to logical block size

Make sure that the offset and length arguments that we're using to
construct WRITE SAME and DISCARD requests are actually aligned to the
logical block size. Failure to do this causes other errors in other parts
of the block layer or the SCSI layer because disks don't support partial
logical block writes.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Darrick J. Wong <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Bart Van Assche <[email protected]>
Reviewed-by: Martin K. Petersen <[email protected]>
Reviewed-by: Hannes Reinecke <[email protected]>
Cc: Theodore Ts'o <[email protected]>
Cc: Mike Snitzer <[email protected]> # tweaked header
Cc: Brian Foster <[email protected]>
Cc: Jens Axboe <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

block: invalidate the page cache when issuing BLKZEROOUT

Patch series "fallocate for block devices", v11.

This is a patchset to fix page cache coherency with BLKZEROOUT and
implement fallocate for block devices.

The first patch is a fix to the existing BLKZEROOUT ioctl to invalidate
the page cache if the zeroing command to the underlying device succeeds.
Without this patch we still have the pagecache coherence bug that's been
in the kernel forever.

The second patch changes the internal block device functions to reject
attempts to discard or zeroout that are not aligned to the logical block
size. Previously, we only checked that the start/len parameters were
512-byte aligned, which caused kernel BUG_ONs for unaligned IOs to 4k-LBA
devices.

The third patch creates an fallocate handler for block devices, wires up
the FALLOC_FL_PUNCH_HOLE flag to zeroing-discard, and connects
FALLOC_FL_ZERO_RANGE to write-same so that we can have a consistent
fallocate interface between files and block devices. It also allows the
combination of PUNCH_HOLE and NO_HIDE_STALE to invoke non-zeroing discard.

Test cases for the new block device fallocate are now in xfstests as
generic/349-351.

This patch (of 3):

Invalidate the page cache (as a regular O_DIRECT write would do) to avoid
returning stale cache contents at a later time.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Darrick J. Wong <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Martin K. Petersen <[email protected]>
Reviewed-by: Bart Van Assche <[email protected]>
Reviewed-by: Hannes Reinecke <[email protected]>
Cc: Theodore Ts'o <[email protected]>
Cc: Mike Snitzer <[email protected]>
Cc: Brian Foster <[email protected]>
Cc: Jens Axboe <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

ocfs2: fix memory leak in dlm_migrate_request_handler()

In the dlm_migrate_request_handler(), when `ret' is -EEXIST, the mle
should be freed, otherwise the memory will be leaked.

Link: http://lkml.kernel.org/r/71604351584F6A4EBAE558C676F37CA4A3D3522A@H3CMLB12-EX.srv.huawei-3com.com
Signed-off-by: Guozhonghua <[email protected]>
Reviewed-by: Mark Fasheh <[email protected]>
Cc: Eric Ren <[email protected]>
Cc: Joel Becker <[email protected]>
Cc: Junxiao Bi <[email protected]>
Cc: Joseph Qi <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

Merge tag 'tegra-for-4.8-i2c' of git://git.kernel.org/pub/scm/linux/kernel/git/tegra/linux into i2c/for-next

[wsa: fell through the cracks, applied to 4.9 now]

Signed-off-by: Wolfram Sang <[email protected]>
i2c: 'i2c-bus' node support for v4.8-rc1

This includes the device tree binding and I2C core changes to support
the i2c-bus subnode that I2C masters can use to describe their slaves
in a separate namespace and therefore avoid clashing with potentially
other subnodes.

powerpc/mm/hash64: Fix might_have_hea() check

In commit 2b4e3ad8f579 ("powerpc/mm/hash64: Don't test for machine type
to detect HEA special case") we changed the logic in might_have_hea()
to check FW_FEATURE_SPLPAR rather than machine_is(pseries).

However the check was incorrectly negated, leading to crashes on
machines with HEA adapters, such as:

  mm: Hashing failure ! EA=0xd000080080004040 access=0x800000000000000e current=NetworkManager
      trap=0x300 vsid=0x13d349c ssize=1 base psize=2 psize 2 pte=0xc0003cc033e701ae
  Unable to handle kernel paging request for data at address 0xd000080080004040
  Call Trace:
    .ehea_create_cq+0x148/0x340 [ehea] (unreliable)
    .ehea_up+0x258/0x1200 [ehea]
    .ehea_open+0x44/0x1a0 [ehea]
    ...

Fix it by removing the negation.

Fixes: 2b4e3ad8f579 ("powerpc/mm/hash64: Don't test for machine type to detect HEA special case")
Cc: [email protected] # v4.8+
Reported-by: Denis Kirjanov <[email protected]>
Reported-by: Jan Stancek <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>

powerpc/64: Fix incorrect return value from __copy_tofrom_user

Debugging a data corruption issue with virtio-net/vhost-net led to
the observation that __copy_tofrom_user was occasionally returning
a value 16 larger than it should.  Since the return value from
__copy_tofrom_user is the number of bytes not copied, this means
that __copy_tofrom_user can occasionally return a value larger
than the number of bytes it was asked to copy.  In turn this can
cause higher-level copy functions such as copy_page_to_iter_iovec
to corrupt memory by copying data into the wrong memory locations.

It turns out that the failing case involves a fault on the store
at label 79, and at that point the first unmodified byte of the
destination is at R3 + 16.  Consequently the exception handler
for that store needs to add 16 to R3 before using it to work out
how many bytes were not copied, but in this one case it was not
adding the offset to R3.  To fix it, this moves the label 179 to
the point where we add 16 to R3.  I have checked manually all the
exception handlers for the loads and stores in this code and the
rest of them are correct (it would be excellent to have an
automated test of all the exception cases).

This bug has been present since this code was initially
committed in May 2002 to Linux version 2.5.20.

Cc: [email protected]
Signed-off-by: Paul Mackerras <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>

gpio: pca953x: add a comment explaining the need for a lockdep subclass

This is a follow-up to commit 559b46990e76 ("gpio: pca953x: fix an
incorrect lockdep warning"). The reason for calling
lockdep_set_subclass() in pca953x_probe() is not explained in
the code.

Add a comment describing the problem, partial solution and required
future extensions.

Signed-off-by: Bartosz Golaszewski <[email protected]>
Acked-by: Linus Walleij <[email protected]>
Signed-off-by: Wolfram Sang <[email protected]>

ACPI / property: Allow holes in reference properties

DT allows holes or empty phandles for references. This is used for example
in SPI subsystem where some chip selects are native and others are regular
GPIOs. In ACPI _DSD we currently do not support this but instead the
preceding reference consumes all following integer arguments.

For example we would like to support something like the below ASL fragment
for SPI:

  Package () {
      "cs-gpios",
      Package () {
          ^GPIO, 19, 0, 0, // GPIO CS0
          0,               // Native CS
          ^GPIO, 20, 0, 0, // GPIO CS1
      }
  }

The zero in the middle means "no entry" or NULL reference. To support this
we change acpi_data_get_property_reference() to take firmware node and
num_args as argument and rename it to __acpi_node_get_property_reference().
The function returns -ENOENT if the given index resolves to "no entry"
reference and -ENODATA when there are no more entries in the property.

We then add static inline wrapper acpi_node_get_property_reference() that
passes MAX_ACPI_REFERENCE_ARGS as num_args to support the existing
behaviour which some drivers have been relying on.

Signed-off-by: Mika Westerberg <[email protected]>
Reviewed-by: Andy Shevchenko <[email protected]>
Signed-off-by: Rafael J. Wysocki <[email protected]>

Merge tag 'media/v4.9-1' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media

Pull media updates from Mauro Carvalho Chehab:

- Documentation improvements: conversion of all non-DocBook documents
   to Sphinx and lots of fixes to the uAPI media book

- New PCI driver for Techwell TW5864 media grabber boards

- New SoC driver for ATMEL Image Sensor Controller

- Removal of some obsolete SoC drivers (s5p-tv driver and soc_camera
   drivers)

- Addition of ST CEC driver

- Lots of drivers fixes, improvements and additions

* tag 'media/v4.9-1' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media: (464 commits)
  [media] ttusb_dec: avoid the risk of go past buffer
  [media] cx23885: Fix some smatch warnings
  [media] si2165: switch to regmap
  [media] si2165: use i2c_client->dev instead of i2c_adapter->dev for logging
  [media] si2165: Remove legacy attach
  [media] cx231xx: attach si2165 driver via i2c_client
  [media] cx231xx: Prepare for attaching new style i2c_client DVB demod drivers
  [media] cx23885: attach si2165 driver via i2c_client
  [media] si2165: support i2c_client attach
  [media] si2165: avoid division by zero
  [media] rcar-vin: add R-Car gen2 fallback compatibility string
  [media] lgdt3306a: remove 20*50 msec unnecessary timeout
  [media] cx25821: Remove deprecated create_singlethread_workqueue
  [media] cx25821: Drop Freeing of Workqueue
  [media] cxd2841er: force 8MHz bandwidth for DVB-C if specified bw not supported
  [media] redrat3: hardware-specific parameters
  [media] redrat3: remove hw_timeout member
  [media] cxd2841er: BER and SNR reading for ISDB-T
  [media] dvb-usb: avoid link error with dib3000m{b,c|
  [media] dvb-usb: split out common parts of dibusb
  ...

Merge tag 'topic/drm-misc-2016-10-11' of git://anongit.freedesktop.org/drm-intel into drm-next

Just flushing out my -misc queue. Slightly important are the prime
refcount/unload fixes from Chris.

There's also the reservation stuff from Chris still pending, and Sumits
hasn't landed that yet. Might get another pull for that, but pls don't
hold up the main pull for it ;-)

* tag 'topic/drm-misc-2016-10-11' of git://anongit.freedesktop.org/drm-intel:
  drm/crtc: constify drm_crtc_index parameter
  drm: use the right function name in documentation
  drm: Release resources with a safer function
  drm: Fix up kerneldoc for new drm_gem_dmabuf_export()
  drm/bridge: Drop drm_connector_unregister and call drm_connector_cleanup directly
  drm/fb-helper: fix sphinx markup for DRM_FB_HELPER_DEFAULT_OPS
  drm/bridge: Add RGB to VGA bridge support
  drm/prime: Take a ref on the drm_dev when exporting a dma_buf
  drm/prime: Pass the right module owner through to dma_buf_export()
  drm/bridge: Call drm_connector_cleanup directly
  drm: simple_kms_helper: Add prepare_fb and cleanup_fb hooks
  drm: Release resources with a safer function

Merge tag 'iommu-updates-v4.9' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu

Pull IOMMU updates from Joerg Roedel:

- support for interrupt virtualization in the AMD IOMMU driver. These
   patches were shared with the KVM tree and are already merged through
   that tree.

- generic DT-binding support for the ARM-SMMU driver. With this the
   driver now makes use of the generic DMA-API code. This also required
   some changes outside of the IOMMU code, but these are acked by the
   respective maintainers.

- more cleanups and fixes all over the place.

* tag 'iommu-updates-v4.9' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu: (40 commits)
  iommu/amd: No need to wait iommu completion if no dte irq entry change
  iommu/amd: Free domain id when free a domain of struct dma_ops_domain
  iommu/amd: Use standard bitmap operation to set bitmap
  iommu/amd: Clean up the cmpxchg64 invocation
  iommu/io-pgtable-arm: Check for v7s-incapable systems
  iommu/dma: Avoid PCI host bridge windows
  iommu/dma: Add support for mapping MSIs
  iommu/arm-smmu: Set domain geometry
  iommu/arm-smmu: Wire up generic configuration support
  Docs: dt: document ARM SMMU generic binding usage
  iommu/arm-smmu: Convert to iommu_fwspec
  iommu/arm-smmu: Intelligent SMR allocation
  iommu/arm-smmu: Add a stream map entry iterator
  iommu/arm-smmu: Streamline SMMU data lookups
  iommu/arm-smmu: Refactor mmu-masters handling
  iommu/arm-smmu: Keep track of S2CR state
  iommu/arm-smmu: Consolidate stream map entry state
  iommu/arm-smmu: Handle stream IDs more dynamically
  iommu/arm-smmu: Set PRIVCFG in stage 1 STEs
  iommu/arm-smmu: Support non-PCI devices with SMMUv3
  ...

Merge tag 'drm-intel-next-fixes-2016-10-11' of git://anongit.freedesktop.org/drm-intel into drm-next

A big bunch of i915 fixes for drm-next / v4.9 merge window, with more
than half of them also cc: stable. We also continue to have more Fixes:
annotations for our fixes, which should help the backporters and
archeologists.

* tag 'drm-intel-next-fixes-2016-10-11' of git://anongit.freedesktop.org/drm-intel: (27 commits)
  drm/i915: Fix conflict resolution from backmerge of v4.8-rc8 to drm-next
  drm/i915/guc: Unwind GuC workqueue reservation if request construction fails
  drm/i915: Reset the breadcrumbs IRQ more carefully
  drm/i915: Force relocations via cpu if we run out of idle aperture
  drm/i915: Distinguish last emitted request from last submitted request
  drm/i915: Allow DP to work w/o EDID
  drm/i915: Move long hpd handling into the hotplug work
  drm/i915/execlists: Reinitialise context image after GPU hang
  drm/i915: Use correct index for backtracking HUNG semaphores
  drm/i915: Unalias obj->phys_handle and obj->userptr
  drm/i915: Just clear the mmiodebug before a register access
  drm/i915/gen9: only add the planes actually affected by ddb changes
  drm/i915: Allow PCH DPLL sharing regardless of DPLL_SDVO_HIGH_SPEED
  drm/i915/bxt: Fix HDMI DPLL configuration
  drm/i915/gen9: fix the watermark res_blocks value
  drm/i915/gen9: fix plane_blocks_per_line on watermarks calculations
  drm/i915/gen9: minimum scanlines for Y tile is not always 4
  drm/i915/gen9: fix the WaWmMemoryReadLatency implementation
  drm/i915/kbl: KBL also needs to run the SAGV code
  drm/i915: introduce intel_has_sagv()
  ...

Merge tag 'libnvdimm-for-4.9' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm

Pull libnvdimm updates from Dan Williams:
"Aside from the recently added pmem sub-division support these have
  been in -next for several releases with no reported issues. The sub-
  division support was included in next-20161010 with no reported
  issues. It passes all unit tests including new tests for all the new
  functionality below.

  Summary:

   - PMEM sub-division support: Allow a single PMEM region to be divided
     into multiple namespaces. Originally, ~2 years ago, it was thought
     that partitions of a /dev/pmemX block device could handle
     sub-allocations of persistent memory for different use cases. With
     the decision to not support DAX mappings of raw block-devices, and
     the genesis of device-dax, the need for having multiple
     pmem-namespace per region has grown.

   - Device-DAX unified inode: In support of dynamic-resizing of a
     device-dax instance the kernel arranges for all mappings of a
     device-dax node to share the same inode. This allows unmap /
     truncate / invalidation events to affect all instances of the
     device similar to the behavior of mmap on block devices.

   - Hardware error scrubbing reworks: The original address-range-scrub
     and badblocks tracking solution allowed clearing entries at the
     individual namespace level, but it failed to clear the internal
     list of media errors maintained at the bus level. The result was
     that the next scrub or namespace disable/re-enable event would
     restore the cleared badblocks, but now that is fixed. The v4.8
     kernel introduced an auto-scrub-on-machine-check behavior to
     repopulate the badblocks list. Now, in v4.9, the auto-scrub
     behavior can be disabled and simply arrange for the error reported
     in the machine-check to be added to the list.

   - DIMM health-event notification support: ACPI 6.1 defines a
     notification event code that can be send to ACPI NVDIMM devices. A
     poll(2) capable file descriptor for these events can be obtained
     from the nmemX/nfit/flags sysfs-attribute of a libnvdimm memory
     device.

   - Miscellaneous fixes: NVDIMM-N probe error, device-dax build error,
     and a change to dedup the flush hint list to not flush the memory
     controller more than necessary"

* tag 'libnvdimm-for-4.9' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (39 commits)
  /dev/dax: fix Kconfig dependency build breakage
  dax: use correct dev_t value
  dax: convert devm_create_dax_dev to PTR_ERR
  libnvdimm, namespace: allow creation of multiple pmem-namespaces per region
  libnvdimm, namespace: lift single pmem limit in scan_labels()
  libnvdimm, namespace: filter out of range labels in scan_labels()
  libnvdimm, namespace: enable allocation of multiple pmem namespaces
  libnvdimm, namespace: update label implementation for multi-pmem
  libnvdimm, namespace: expand pmem device naming scheme for multi-pmem
  libnvdimm, region: update nd_region_available_dpa() for multi-pmem support
  libnvdimm, namespace: sort namespaces by dpa at init
  libnvdimm, namespace: allow multiple pmem-namespaces per region at scan time
  tools/testing/nvdimm: support for sub-dividing a pmem region
  libnvdimm, namespace: unify blk and pmem label scanning
  libnvdimm, namespace: refactor uuid_show() into a namespace_to_uuid() helper
  libnvdimm, label: convert label tracking to a linked list
  libnvdimm, region: move region-mapping input-paramters to nd_mapping_desc
  nvdimm: reduce duplicated wpq flushes
  libnvdimm: clear the internal poison_list when clearing badblocks
  pmem: reduce kmap_atomic sections to the memcpys only
  ...

parisc: Show trap name in kernel crash

Show the real trap name when the kernel crashes.

Signed-off-by: Helge Deller <[email protected]>

parisc: Zero-initialize newly alloced memblock

Commit 4fe9e1d957e4 ("parisc: Drop bootmem and switch to memblock")
switched to the memblock allocator, but missed to zero-initialize the
newly allocated memblocks. This lead to crashes on some machines like
the rp3410.

Fixes: 4fe9e1d957e4 ("parisc: Drop bootmem and switch to memblock")
Signed-off-by: Helge Deller <[email protected]>

Merge branch 'for-linus-4.9' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs

Pull btrfs updates from Chris Mason:
"This is a big variety of fixes and cleanups.

  Liu Bo continues to fixup fuzzer related problems, and some of Josef's
  cleanups are prep for his bigger extent buffer changes (slated for
  v4.10)"

* 'for-linus-4.9' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (39 commits)
  Revert "btrfs: let btrfs_delete_unused_bgs() to clean relocated bgs"
  Btrfs: remove unnecessary btrfs_mark_buffer_dirty in split_leaf
  Btrfs: don't BUG() during drop snapshot
  btrfs: fix btrfs_no_printk stub helper
  Btrfs: memset to avoid stale content in btree leaf
  btrfs: parent_start initialization cleanup
  btrfs: Remove already completed TODO comment
  btrfs: Do not reassign count in btrfs_run_delayed_refs
  btrfs: fix a possible umount deadlock
  Btrfs: fix memory leak in do_walk_down
  btrfs: btrfs_debug should consume fs_info when DEBUG is not defined
  btrfs: convert send's verbose_printk to btrfs_debug
  btrfs: convert pr_* to btrfs_* where possible
  btrfs: convert printk(KERN_* to use pr_* calls
  btrfs: unsplit printed strings
  btrfs: clean the old superblocks before freeing the device
  Btrfs: kill BUG_ON in run_delayed_tree_ref
  Btrfs: don't leak reloc root nodes on error
  btrfs: squash lines for simple wrapper functions
  Btrfs: improve check_node to avoid reading corrupted nodes
  ...

Merge tag 'upstream-4.9-rc1' of git://git.infradead.org/linux-ubifs

Pull UBI/UBIFS updates from Richard Weinberger:
"This pull request contains:

   - Fixes for both UBI and UBIFS
   - overlayfs support (O_TMPFILE, RENAME_WHITEOUT/EXCHANGE)
   - Code refactoring for the upcoming MLC support"

[ Ugh, we just got rid of the "rename2()" naming for the extended rename
  functionality. And this re-introduces it in ubifs with the cross-
  renaming and whiteout support.

  But rather than do any re-organizations in the merge itself, the
  naming can be cleaned up later ]

* tag 'upstream-4.9-rc1' of git://git.infradead.org/linux-ubifs: (27 commits)
  UBIFS: improve function-level documentation
  ubifs: fix host xattr_len when changing xattr
  ubifs: Use move variable in ubifs_rename()
  ubifs: Implement RENAME_EXCHANGE
  ubifs: Implement RENAME_WHITEOUT
  ubifs: Implement O_TMPFILE
  ubi: Fix Fastmap's update_vol()
  ubi: Fix races around ubi_refill_pools()
  ubi: Deal with interrupted erasures in WL
  UBI: introduce the VID buffer concept
  UBI: hide EBA internals
  UBI: provide an helper to query LEB information
  UBI: provide an helper to check whether a LEB is mapped or not
  UBI: add an helper to check lnum validity
  UBI: simplify LEB write and atomic LEB change code
  UBI: simplify recover_peb() code
  UBI: move the global ech and vidh variables into struct ubi_attach_info
  UBI: provide helpers to allocate and free aeb elements
  UBI: fastmap: use ubi_io_{read, write}_data() instead of ubi_io_{read, write}()
  UBI: fastmap: use ubi_rb_for_each_entry() in unmap_peb()
  ...