Git Repo - linux.git/log

coredump: change wait_for_dump_helpers() to use wait_event_interruptible()

wait_for_dump_helpers() calls wake_up/kill_fasync from inside the
wait_event-like loop.  This is not needed and in fact this is not
strictly correct, we can/should do this only once after we change
pipe->writers.  We could even check if it becomes zero.

Change this code to use use wait_event_interruptible(), this can also
help to make this wait freezable.

With this patch we check pipe->readers without pipe_lock(), this is
fine.  Once we see pipe->readers == 1 we know that the handler
decremented the counter, this is all we need.

Signed-off-by: Oleg Nesterov <[email protected]>
Acked-by: Mandeep Singh Baines <[email protected]>
Cc: Neil Horman <[email protected]>
Cc: "Rafael J. Wysocki" <[email protected]>
Cc: Tejun Heo <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

coredump: factor out the setting of PF_DUMPCORE

Cleanup. Every linux_binfmt->core_dump() sets PF_DUMPCORE, move this into
zap_threads() called by do_coredump().

Signed-off-by: Oleg Nesterov <[email protected]>
Acked-by: Mandeep Singh Baines <[email protected]>
Cc: Neil Horman <[email protected]>
Cc: "Rafael J. Wysocki" <[email protected]>
Cc: Tejun Heo <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

coredump: introduce dump_interrupted()

By discussion with Mandeep.

Change dump_write(), dump_seek() and do_coredump() to check
signal_pending() and abort if it is true.  dump_seek() does this only
before f_op->llseek(), otherwise it relies on dump_write().

We need this change to ensure that the coredump won't delay suspend, and
to ensure it reacts to SIGKILL "quickly enough", a core dump can take a
lot of time.  In particular this can help oom-killer.

We add the new trivial helper, dump_interrupted() to add the comments and
to simplify the potential freezer changes.  Perhaps it will have more
callers.

Ideally it should do try_to_freeze() but then we need the unpleasant
changes in dump_write() and wait_for_dump_helpers().  It is not trivial to
change dump_write() to restart if f_op->write() fails because of
freezing().  We need to handle the short writes, we need to clear
TIF_SIGPENDING (and we can't rely on recalc_sigpending() unless we change
it to check PF_DUMPCORE).  And if the buggy f_op->write() sets
TIF_SIGPENDING we can not distinguish this case from the race with
freeze_task() + __thaw_task().

So we simply accept the fact that the freezer can truncate a core-dump but
at least you can reliably suspend.  Hopefully we can tolerate this
unlikely case and the necessary complications doesn't worth a trouble.
But if we decide to make the coredumping freezable later we can do this on
top of this change.

Signed-off-by: Oleg Nesterov <[email protected]>
Acked-by: Mandeep Singh Baines <[email protected]>
Cc: Neil Horman <[email protected]>
Cc: "Rafael J. Wysocki" <[email protected]>
Cc: Tejun Heo <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

coredump: sanitize the setting of signal->group_exit_code

Now that the coredumping process can be SIGKILL'ed, the setting of
->group_exit_code in do_coredump() can race with complete_signal() and
SIGKILL or 0x80 can be "lost", or wait(status) can report status ==
SIGKILL | 0x80.

But the main problem is that it is not clear to me what should we do if
binfmt->core_dump() succeeds but SIGKILL was sent, that is why this patch
comes as a separate change.

This patch adds 0x80 if ->core_dump() succeeds and the process was not
killed. But perhaps we can (should?) re-set ->group_exit_code changed by
SIGKILL back to "siginfo->si_signo |= 0x80" in case when core_dumped == T.

Signed-off-by: Oleg Nesterov <[email protected]>
Tested-by: Mandeep Singh Baines <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Neil Horman <[email protected]>
Cc: "Rafael J. Wysocki" <[email protected]>
Cc: Roland McGrath <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Al Viro <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

coredump: ensure that SIGKILL always kills the dumping thread

prepare_signal() blesses SIGKILL sent to the dumping process but this
signal can be "lost" anyway.  The problems is, complete_signal() sees
SIGNAL_GROUP_EXIT and skips the "kill them all" logic.  And even if the
dumping process is single-threaded (so the target is always "correct"),
the group-wide SIGKILL is not recorded in task->pending and thus
__fatal_signal_pending() won't be true.  A multi-threaded case has even
more problems.

And even ignoring all technical details, SIGNAL_GROUP_EXIT doesn't look
right to me.  This coredumping process is not exiting yet, it can do a lot
of work dumping the core.

With this patch the dumping process doesn't have SIGNAL_GROUP_EXIT, we set
signal->group_exit_task instead.  This makes signal_group_exit() true and
thus this should equally close the races with exit/exec/stop but allows to
kill the dumping thread reliably.

Notes:
- It is not clear what should we do with ->group_exit_code
  if the dumper was killed, see the next change.

- we need more (hopefully straightforward) changes to ensure
  that SIGKILL actually interrupts the coredump. Basically we
  need to check __fatal_signal_pending() in dump_write() and
  dump_seek().

Signed-off-by: Oleg Nesterov <[email protected]>
Tested-by: Mandeep Singh Baines <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Neil Horman <[email protected]>
Cc: "Rafael J. Wysocki" <[email protected]>
Cc: Roland McGrath <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Al Viro <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

coredump: only SIGKILL should interrupt the coredumping task

There are 2 well known and ancient problems with coredump/signals, and a
lot of related bug reports:

- do_coredump() clears TIF_SIGPENDING but of course this can't help
  if, say, SIGCHLD comes after that.

  In this case the coredump can fail unexpectedly. See for example
  wait_for_dump_helper()->signal_pending() check but there are other
  reasons.

- At the same time, dumping a huge core on the slow media can take a
  lot of time/resources and there is no way to kill the coredumping
  task reliably. In particular this is not oom_kill-friendly.

This patch tries to fix the 1st problem, and makes the preparation for the
next changes.

We add the new SIGNAL_GROUP_COREDUMP flag set by zap_threads() to indicate
that this process dumps the core.  prepare_signal() checks this flag and
nacks any signal except SIGKILL.

Note that this check tries to be conservative, in the long term we should
probably treat the SIGNAL_GROUP_EXIT case equally but this needs more
discussion.  See marc.info/?l=linux-kernel&m=120508897917439

Notes:
- recalc_sigpending() doesn't check SIGNAL_GROUP_COREDUMP.
  The patch assumes that dump_write/etc paths should never
  call it, but we can change it as well.

- There is another source of TIF_SIGPENDING, freezer. This
  will be addressed separately.

Signed-off-by: Oleg Nesterov <[email protected]>
Tested-by: Mandeep Singh Baines <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Neil Horman <[email protected]>
Cc: "Rafael J. Wysocki" <[email protected]>
Cc: Roland McGrath <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Al Viro <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

kmod: remove call_usermodehelper_fns()

This function suffers from not being able to determine if the cleanup is
called in case it returns -ENOMEM. Nobody is using it anymore, so let's
remove it.

Signed-off-by: Lucas De Marchi <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: David Howells <[email protected]>
Cc: James Morris <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: "Rafael J. Wysocki" <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

usermodehelper: split remaining calls to call_usermodehelper_fns()

These are the only users of call_usermodehelper_fns(). This function
suffers from not being able to determine if the cleanup is called. Even
if in this places the cleanup pointer is NULL, convert them to use the
separate call_usermodehelper_setup() + call_usermodehelper_exec()
functions so we can remove the _fns variant.

Signed-off-by: Lucas De Marchi <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: David Howells <[email protected]>
Cc: James Morris <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: "Rafael J. Wysocki" <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

coredump: remove trailling whitespace

Signed-off-by: Lucas De Marchi <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: David Howells <[email protected]>
Cc: James Morris <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: "Rafael J. Wysocki" <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

KEYS: split call to call_usermodehelper_fns()

Use call_usermodehelper_setup() + call_usermodehelper_exec() instead of
calling call_usermodehelper_fns(). In case there's an OOM in this last
function the cleanup function may not be called - in this case we would
miss a call to key_put().

Signed-off-by: Lucas De Marchi <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Acked-by: David Howells <[email protected]>
Acked-by: James Morris <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: "Rafael J. Wysocki" <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

kmod: split call to call_usermodehelper_fns()

Use call_usermodehelper_setup() + call_usermodehelper_exec() instead of
calling call_usermodehelper_fns(). In case the latter returns -ENOMEM the
cleanup function may had not been called - in this case we would not free
argv and module_name.

Signed-off-by: Lucas De Marchi <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: David Howells <[email protected]>
Cc: James Morris <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: "Rafael J. Wysocki" <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

usermodehelper: export call_usermodehelper_exec() and call_usermodehelper_setup()

call_usermodehelper_setup() + call_usermodehelper_exec() need to be
called instead of call_usermodehelper_fns() when the cleanup function
needs to be called even when an ENOMEM error occurs. In this case using
call_usermodehelper_fns() the user can't distinguish if the cleanup
function was called or not.

[[email protected]: export call_usermodehelper_setup() to modules]
Signed-off-by: Lucas De Marchi <[email protected]>
Reviewed-by: Oleg Nesterov <[email protected]>
Cc: David Howells <[email protected]>
Cc: James Morris <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: "Rafael J. Wysocki" <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

selftest: add a test case for PTRACE_PEEKSIGINFO

* Dump signals from process-wide and per-thread queues with
different sizes of buffers.
* Check error paths for buffers with restricted permissions. A part of
buffer or a whole buffer is for read-only.
* Try to get nonexistent signal.

Signed-off-by: Andrew Vagin <[email protected]>
Cc: Roland McGrath <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: "Paul E. McKenney" <[email protected]>
Cc: David Howells <[email protected]>
Cc: Dave Jones <[email protected]>
Cc: "Michael Kerrisk (man-pages)" <[email protected]>
Cc: Pavel Emelyanov <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Pedro Alves <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

ptrace: add ability to retrieve signals without removing from a queue (v4)

This patch adds a new ptrace request PTRACE_PEEKSIGINFO.

This request is used to retrieve information about pending signals
starting with the specified sequence number.  Siginfo_t structures are
copied from the child into the buffer starting at "data".

The argument "addr" is a pointer to struct ptrace_peeksiginfo_args.
struct ptrace_peeksiginfo_args {
u64 off; /* from which siginfo to start */
u32 flags;
s32 nr; /* how may siginfos to take */
};

"nr" has type "s32", because ptrace() returns "long", which has 32 bits on
i386 and a negative values is used for errors.

Currently here is only one flag PTRACE_PEEKSIGINFO_SHARED for dumping
signals from process-wide queue.  If this flag is not set, signals are
read from a per-thread queue.

The request PTRACE_PEEKSIGINFO returns a number of dumped signals.  If a
signal with the specified sequence number doesn't exist, ptrace returns
zero.  The request returns an error, if no signal has been dumped.

Errors:
EINVAL - one or more specified flags are not supported or nr is negative
EFAULT - buf or addr is outside your accessible address space.

A result siginfo contains a kernel part of si_code which usually striped,
but it's required for queuing the same siginfo back during restore of
pending signals.

This functionality is required for checkpointing pending signals.  Pedro
Alves suggested using it in "gdb" to peek at pending signals.  gdb already
uses PTRACE_GETSIGINFO to get the siginfo for the signal which was already
dequeued.  This functionality allows gdb to look at the pending signals
which were not reported yet.

The prototype of this code was developed by Oleg Nesterov.

Signed-off-by: Andrew Vagin <[email protected]>
Cc: Roland McGrath <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: "Paul E. McKenney" <[email protected]>
Cc: David Howells <[email protected]>
Cc: Dave Jones <[email protected]>
Cc: "Michael Kerrisk (man-pages)" <[email protected]>
Cc: Pavel Emelyanov <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Pedro Alves <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

hfsplus: remove duplicated message prefix in hfsplus_block_free()

Signed-off-by: Vyacheslav Dubeyko <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Hin-Tak Leung <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

hfsplus: add error propagation to __hfsplus_ext_write_extent()

__hfsplus_ext_write_extent() suppresses errors coming from
hfs_brec_find(). The patch implements error code propagation.

Signed-off-by: Alexey Khoroshilov <[email protected]>
Reviewed-by: Vyacheslav Dubeyko <[email protected]>
Cc: Hin-Tak Leung <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Artem Bityutskiy <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

hfs/hfsplus: convert printks to pr_<level>

Use a more current logging style.

Add #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
hfsplus now uses "hfsplus: " for all messages.
Coalesce formats.
Prefix debugging messages too.

Signed-off-by: Joe Perches <[email protected]>
Cc: Vyacheslav Dubeyko <[email protected]>
Cc: Hin-Tak Leung <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

hfs/hfsplus: convert dprint to hfs_dbg

Use a more current logging style.

Rename macro and uses.
Add do {} while (0) to macro.
Add DBG_ to macro.
Add and use hfs_dbg_cont variant where appropriate.

Signed-off-by: Joe Perches <[email protected]>
Cc: Vyacheslav Dubeyko <[email protected]>
Cc: Hin-Tak Leung <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

hfsplus: fix warnings in fs/hfsplus/bfind.c

fs/hfsplus/bfind.c: In function 'hfs_find_1st_rec_by_cnid':
(1) include/uapi/linux/swab.h:60:2: warning: 'search_cnid' may be used uninitialized in this function [-Wmaybe-uninitialized]
(2) include/uapi/linux/swab.h:60:2: warning: 'cur_cnid' may be used uninitialized in this function [-Wmaybe-uninitialized]

[[email protected]: make the workaround more explicit]
Signed-off-by: Vyacheslav Dubeyko <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

hfs: add error checking for hfs_find_init()

hfs_find_init() may fail with ENOMEM, but there are places, where the
returned value is not checked. The consequences can be very unpleasant,
e.g. kfree uninitialized pointer and inappropriate mutex unlocking.

The patch adds checks for errors in hfs_find_init().

Found by Linux Driver Verification project (linuxtesting.org).

Signed-off-by: Alexey Khoroshilov <[email protected]>
Reviewed-by: Vyacheslav Dubeyko <[email protected]>
Cc: Hin-Tak Leung <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Artem Bityutskiy <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

nilfs2: remove unneeded test in nilfs_writepage()

page->mapping->host cannot be NULL in nilfs_writepage(), so remove the
unneeded test.

The fixes the smatch warning: "fs/nilfs2/inode.c:211 nilfs_writepage()
error: we previously assumed 'inode' could be null (see line 195)".

Reported-by: Dan Carpenter <[email protected]>
Signed-off-by: Vyacheslav Dubeyko <[email protected]>
Cc: Ryusuke Konishi <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

nilfs2: fix using of PageLocked() in nilfs_clear_dirty_page()

Change test_bit(PG_locked, &page->flags) to PageLocked().

Signed-off-by: Vyacheslav Dubeyko <[email protected]>
Cc: Ryusuke Konishi <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

nilfs2: fix issue with flush kernel thread after remount in RO mode because of driver's internal error or metadata corruption

The NILFS2 driver remounts itself in RO mode in the case of discovering
metadata corruption (for example, discovering a broken bmap).  But
usually, this takes place when there have been file system operations
before remounting in RO mode.

Thereby, NILFS2 driver can be in RO mode with presence of dirty pages in
modified inodes' address spaces.  It results in flush kernel thread's
infinite trying to flush dirty pages in RO mode.  As a result, it is
possible to see such side effects as: (1) flush kernel thread occupies
50% - 99% of CPU time; (2) system can't be shutdowned without manual
power switch off.

SYMPTOMS:
(1) System log contains error message: "Remounting filesystem read-only".
(2) The flush kernel thread occupies 50% - 99% of CPU time.
(3) The system can't be shutdowned without manual power switch off.

REPRODUCTION PATH:
(1) Create volume group with name "unencrypted" by means of vgcreate utility.
(2) Run script (prepared by Anthony Doggett <[email protected]>):

  ----------------[BEGIN SCRIPT]--------------------
  #!/bin/bash

  VG=unencrypted
  #apt-get install nilfs-tools darcs
  lvcreate --size 2G --name ntest $VG
  mkfs.nilfs2 -b 1024 -B 8192 /dev/mapper/$VG-ntest
  mkdir /var/tmp/n
  mkdir /var/tmp/n/ntest
  mount /dev/mapper/$VG-ntest /var/tmp/n/ntest
  mkdir /var/tmp/n/ntest/thedir
  cd /var/tmp/n/ntest/thedir
  sleep 2
  date
  darcs init
  sleep 2
  dmesg|tail -n 5
  date
  darcs whatsnew || true
  date
  sleep 2
  dmesg|tail -n 5
  ----------------[END SCRIPT]--------------------

(3) Try to shutdown the system.

REPRODUCIBILITY: 100%

FIX:

This patch implements checking mount state of NILFS2 driver in
nilfs_writepage(), nilfs_writepages() and nilfs_mdt_write_page()
methods.  If it is detected the RO mount state then all dirty pages are
simply discarded with warning messages is written in system log.

[[email protected]: fix printk warning]
Signed-off-by: Vyacheslav Dubeyko <[email protected]>
Acked-by: Ryusuke Konishi <[email protected]>
Cc: Anthony Doggett <[email protected]>
Cc: ARAI Shun-ichi <[email protected]>
Cc: Piotr Szymaniak <[email protected]>
Cc: Zahid Chowdhury <[email protected]>
Cc: Elmer Zhang <[email protected]>
Cc: Wu Fengguang <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

i2o: check copy_from_user() size parameter

Limit the size of the copy so we don't corrupt memory. Hopefully this
can only be called by root, but fixing this makes the static checkers
happier.

Signed-off-by: Dan Carpenter <[email protected]>
Cc: Jiri Kosina <[email protected]>
Cc: Masanari Iida <[email protected]>
Cc: Alan Cox <[email protected]>
Cc: Guenter Roeck <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

dmi_scan: refactor dmi_scan_machine(), {smbios,dmi}_present()

Move the calls to memcpy_fromio() up into the loop in
dmi_scan_machine(), and move the signature checks back down into
dmi_decode(). We need to check at 16-byte intervals but keep a 32-byte
buffer for an SMBIOS entry, so shift the buffer after each iteration.

Merge smbios_present() into dmi_present(), so we look for an SMBIOS
signature at the beginning of the given buffer and then for a DMI
signature at an offset of 16 bytes.

[[email protected]: use proper buf type in dmi_present()]
Signed-off-by: Ben Hutchings <[email protected]>
Reported-by: Tim McGrath <[email protected]>
Tested-by: Tim Mcgrath <[email protected]>
Cc: Zhenzhong Duan <[email protected]>
Signed-off-by: Artem Savkov <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

binfmt_elf: PIE: make PF_RANDOMIZE check comment more accurate

The comment I originally added in commit a3defbe5c337 ("binfmt_elf: fix
PIE execution with randomization disabled") is not really 100% accurate
-- sysctl is not the only way how PF_RANDOMIZE could be forcibly unset
in runtime.

Another option of course is direct modification of personality flags
(i.e. running through setarch wrapper).

Make the comment more explicit and accurate.

Signed-off-by: Jiri Kosina <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

fs: make binfmt support for #! scripts modular and removable

Add a new configuration option CONFIG_BINFMT_SCRIPT to configure support
for interpreted scripts starting with "#!"; allow compiling out that
support, or building it as a module. Embedded systems running exclusively
compiled binaries could leave this support out, and systems that don't
need scripts before mounting the root filesystem can build this as a
module.

Signed-off-by: Josh Triplett <[email protected]>
Cc: Al Viro <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

epoll: cleanup: use RCU_INIT_POINTER when nulling

It is always safe to use RCU_INIT_POINTER to NULL a pointer. This results
in slightly smaller/faster code.

Signed-off-by: Eric Wong <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

epoll: cleanup: hoist out f_op->poll calls

This reduces the amount of code inside the ready list iteration loops for
better readability IMHO.

Signed-off-by: Eric Wong <[email protected]>
Cc: Davide Libenzi <[email protected]>
Cc: Al Viro <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

epoll: lock ep->mtx in ep_free to silence lockdep

Technically we do not need to hold ep->mtx during ep_free since we are
certain there are no other users of ep at that point. However, lockdep
complains with a "suspicious rcu_dereference_check() usage!" message; so
lock the mutex before ep_remove to silence the warning.

Signed-off-by: Eric Wong <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Arve Hjønnevåg <[email protected]>
Cc: Davide Libenzi <[email protected]>
Cc: Eric Dumazet <[email protected]>
Cc: NeilBrown <[email protected]>,
Cc: Rafael J. Wysocki <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

epoll: use RCU to protect wakeup_source in epitem

This prevents wakeup_source destruction when a user hits the item with
EPOLL_CTL_MOD while ep_poll_callback is running.

Tested with CONFIG_SPARSE_RCU_POINTER=y and "make fs/eventpoll.o C=2"

Signed-off-by: Eric Wong <[email protected]>
Cc: Alexander Viro <[email protected]>
Cc: Arve Hjønnevåg <[email protected]>
Cc: Davide Libenzi <[email protected]>
Cc: Eric Dumazet <[email protected]>
Cc: NeilBrown <[email protected]>
Cc: "Rafael J. Wysocki" <[email protected]>
Cc: "Paul E. McKenney" <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

epoll: trim epitem by one cache line

It is common for epoll users to have thousands of epitems, so saving a
cache line on every allocation leads to large memory savings.

Since epitem allocations are cache-aligned, reducing sizeof(struct
epitem) from 136 bytes to 128 bytes will allow it to squeeze under a
cache line boundary on x86_64.

Via /sys/kernel/slab/eventpoll_epi, I see the following changes on my
x86_64 Core2 Duo (which has 64-byte cache alignment):

object_size : 192 => 128
objs_per_slab: 21 => 32

Also, add a BUILD_BUG_ON() to check for future accidental breakage.

[[email protected]: use __packed, for all architectures]
Signed-off-by: Eric Wong <[email protected]>
Cc: Davide Libenzi <[email protected]>
Cc: Al Viro <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

kernel/timer.c: move some non timer related syscalls to kernel/sys.c

Andrew Morton noted:

akpm3:/usr/src/25> grep SYSCALL kernel/timer.c
SYSCALL_DEFINE1(alarm, unsigned int, seconds)
SYSCALL_DEFINE0(getpid)
SYSCALL_DEFINE0(getppid)
SYSCALL_DEFINE0(getuid)
SYSCALL_DEFINE0(geteuid)
SYSCALL_DEFINE0(getgid)
SYSCALL_DEFINE0(getegid)
SYSCALL_DEFINE0(gettid)
SYSCALL_DEFINE1(sysinfo, struct sysinfo __user *, info)
COMPAT_SYSCALL_DEFINE1(sysinfo, struct compat_sysinfo __user *, info)

Only one of those should be in kernel/timer.c. Who wrote this thing?

[[email protected]: coding-style fixes]
Signed-off-by: Stephen Rothwell <[email protected]>
Acked-by: Thomas Gleixner <[email protected]>
Cc: Guenter Roeck <[email protected]>
Cc: Al Viro <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

kernel/timer.c: convert compat_sys_sysinfo to COMPAT_SYSCALL_DEFINE

Signed-off-by: Stephen Rothwell <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Guenter Roeck <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

kernel/compat.c: make do_sysinfo() static

The only use outside of kernel/timer.c was in kernel/compat.c, so move
compat_sys_sysinfo() next to sys_sysinfo() in kernel/timer.c.

Signed-off-by: Stephen Rothwell <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Guenter Roeck <[email protected]>
Cc: Al Viro <[email protected]>
Acked-by: Arnd Bergmann <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

binfmt_misc: reuse string_unescape_inplace()

There is string_unescape_inplace() function which decodes strings in generic
way. Let's use it.

Signed-off-by: Andy Shevchenko <[email protected]>
Cc: Alexander Viro <[email protected]>
Cc: [email protected]
Signed-off-by: Linus Torvalds <[email protected]>

dynamic_debug: reuse generic string_unescape function

There is kernel function to do the job in generic way. Let's use it.

Signed-off-by: Andy Shevchenko <[email protected]>
Cc: Jason Baron <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

staging: speakup: remove custom string_unescape_any_inplace

There is generic implementation of the function to unescape strings.

Signed-off-by: Andy Shevchenko <[email protected]>
Cc: Samuel Thibault <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: William Hubbs <[email protected]>
Cc: Chris Brannon <[email protected]>
Cc: Kirk Reiser <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

lib/string_helpers: introduce generic string_unescape

There are several places in kernel where modules unescapes input to convert
C-Style Escape Sequences into byte codes.

The patch provides generic implementation of such approach. Test cases are
also included into the patch.

[[email protected]: clarify comment]
[[email protected]: export get_random_int() to modules]
Signed-off-by: Andy Shevchenko <[email protected]>
Cc: Samuel Thibault <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Jason Baron <[email protected]>
Cc: Alexander Viro <[email protected]>
Cc: William Hubbs <[email protected]>
Cc: Chris Brannon <[email protected]>
Cc: Kirk Reiser <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

kernel/smp.c: cleanups

We sometimes use "struct call_single_data *data" and sometimes "struct
call_single_data *csd". Use "csd" consistently.

We sometimes use "struct call_function_data *data" and sometimes "struct
call_function_data *cfd". Use "cfd" consistently.

Also, avoid some 80-col layout tricks.

Cc: Ingo Molnar <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Shaohua Li <[email protected]>
Cc: Shaohua Li <[email protected]>
Cc: Steven Rostedt <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

include/linux/fs.h: disable preempt when acquire i_size_seqcount write lock

Two rt tasks bind to one CPU core.

The higher priority rt task A preempts a lower priority rt task B which
has already taken the write seq lock, and then the higher priority rt
task A try to acquire read seq lock, it's doomed to lockup.

rt task A with lower priority: call write
i_size_write                                        rt task B with higher priority: call sync, and preempt task A
  write_seqcount_begin(&inode->i_size_seqcount);    i_size_read
  inode->i_size = i_size;                             read_seqcount_begin <-- lockup here...

So disable preempt when acquiring every i_size_seqcount *write* lock will
cure the problem.

Signed-off-by: Fan Du <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

kernel/smp.c: remove 'priv' of call_single_data

The 'priv' field is redundant; we can pass data via 'info'.

Signed-off-by: liguang <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Ingo Molnar <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

kernel/smp.c: use '|=' for csd_lock

csd_lock() uses assignment to data->flags rather than |=. That is not
buggy at present because only one bit (CSD_FLAG_LOCK) is defined in
call_single_data.flags.

But it will become buggy if we later add another flag, so fix it now.

Signed-off-by: liguang <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Ingo Molnar <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

writeback: set worker desc to identify writeback workers in task dumps

Writeback has been recently converted to use workqueue instead of its
private thread pool implementation.  One negative side effect of this
conversion is that there's no easy to tell which backing device a
writeback work item was working on at the time of task dump, be it
sysrq-t, BUG, WARN or whatever, which, according to our writeback
brethren, is important in tracking down issues with a lot of mounted
file systems on a lot of different devices.

This patch restores that information using the new worker description
facility.  bdi_writeback_workfn() calls set_work_desc() to identify
which bdi it's working on.  The description is printed out together with
the worqueue name and worker function as in the following example dump.

WARNING: at fs/fs-writeback.c:1015 bdi_writeback_workfn+0x2b4/0x3c0()
Modules linked in:
Pid: 28, comm: kworker/u18:0 Not tainted 3.9.0-rc1-work+ #24 empty empty/S3992
Workqueue: writeback bdi_writeback_workfn (flush-8:16)
  ffffffff820a3a98 ffff88015b927cb8 ffffffff81c61855 ffff88015b927cf8
  ffffffff8108f500 0000000000000000 ffff88007a171948 ffff88007a1716b0
  ffff88015b49df00 ffff88015b8d3940 0000000000000000 ffff88015b927d08
Call Trace:
  [<ffffffff81c61855>] dump_stack+0x19/0x1b
  [<ffffffff8108f500>] warn_slowpath_common+0x70/0xa0
  [<ffffffff8108f54a>] warn_slowpath_null+0x1a/0x20
  [<ffffffff81200144>] bdi_writeback_workfn+0x2b4/0x3c0
  [<ffffffff810b4c87>] process_one_work+0x1d7/0x660
  [<ffffffff810b5c72>] worker_thread+0x122/0x380
  [<ffffffff810bdfea>] kthread+0xea/0xf0
  [<ffffffff81c6cedc>] ret_from_fork+0x7c/0xb0

Signed-off-by: Tejun Heo <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

workqueue: include workqueue info when printing debug dump of a worker task

One of the problems that arise when converting dedicated custom
threadpool to workqueue is that the shared worker pool used by workqueue
anonimizes each worker making it more difficult to identify what the
worker was doing on which target from the output of sysrq-t or debug
dump from oops, BUG() and friends.

This patch implements set_worker_desc() which can be called from any
workqueue work function to set its description.  When the worker task is
dumped for whatever reason - sysrq-t, WARN, BUG, oops, lockdep assertion
and so on - the description will be printed out together with the
workqueue name and the worker function pointer.

The printing side is implemented by print_worker_info() which is called
from functions in task dump paths - sched_show_task() and
dump_stack_print_info().  print_worker_info() can be safely called on
any task in any state as long as the task struct itself is accessible.
It uses probe_*() functions to access worker fields.  It may print
garbage if something went very wrong, but it wouldn't cause (another)
oops.

The description is currently limited to 24bytes including the
terminating \0.  worker->desc_valid and workder->desc[] are added and
the 64 bytes marker which was already incorrect before adding the new
fields is moved to the correct position.

Here's an example dump with writeback updated to set the bdi name as
worker desc.

Hardware name: Bochs
Modules linked in:
Pid: 7, comm: kworker/u9:0 Not tainted 3.9.0-rc1-work+ #1
Workqueue: writeback bdi_writeback_workfn (flush-8:0)
  ffffffff820a3ab0 ffff88000f6e9cb8 ffffffff81c61845 ffff88000f6e9cf8
  ffffffff8108f50f 0000000000000000 0000000000000000 ffff88000cde16b0
  ffff88000cde1aa8 ffff88001ee19240 ffff88000f6e9fd8 ffff88000f6e9d08
Call Trace:
  [<ffffffff81c61845>] dump_stack+0x19/0x1b
  [<ffffffff8108f50f>] warn_slowpath_common+0x7f/0xc0
  [<ffffffff8108f56a>] warn_slowpath_null+0x1a/0x20
  [<ffffffff81200150>] bdi_writeback_workfn+0x2a0/0x3b0
...

Signed-off-by: Tejun Heo <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Ingo Molnar <[email protected]>
Acked-by: Jan Kara <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Dave Chinner <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

kthread: implement probe_kthread_data()

One of the problems that arise when converting dedicated custom threadpool
to workqueue is that the shared worker pool used by workqueue anonimizes
each worker making it more difficult to identify what the worker was doing
on which target from the output of sysrq-t or debug dump from oops, BUG()
and friends.

For example, after writeback is converted to use workqueue instead of
priviate thread pool, there's no easy to tell which backing device a
writeback work item was working on at the time of task dump, which,
according to our writeback brethren, is important in tracking down issues
with a lot of mounted file systems on a lot of different devices.

This patchset implements a way for a work function to mark its execution
instance so that task dump of the worker task includes information to
indicate what the work item was doing.

An example WARN dump would look like the following.

WARNING: at fs/fs-writeback.c:1015 bdi_writeback_workfn+0x2b4/0x3c0()
Modules linked in:
CPU: 0 Pid: 28 Comm: kworker/u18:0 Not tainted 3.9.0-rc1-work+ #24
Hardware name: empty empty/S3992, BIOS 080011  10/26/2007
Workqueue: writeback bdi_writeback_workfn (flush-8:16)
  ffffffff820a3a98 ffff88015b927cb8 ffffffff81c61855 ffff88015b927cf8
  ffffffff8108f500 0000000000000000 ffff88007a171948 ffff88007a1716b0
  ffff88015b49df00 ffff88015b8d3940 0000000000000000 ffff88015b927d08
Call Trace:
  [<ffffffff81c61855>] dump_stack+0x19/0x1b
  [<ffffffff8108f500>] warn_slowpath_common+0x70/0xa0
  ...

This patch:

Implement probe_kthread_data() which returns kthread_data if accessible.
The function is equivalent to kthread_data() except that the specified
@task may not be a kthread or its vfork_done is already cleared rendering
struct kthread inaccessible.  In the former case, probe_kthread_data() may
return any value.  In the latter, NULL.

This will be used to safely print debug information without affecting
synchronization in the normal paths.  Workqueue debug info printing on
dump_stack() and friends will make use of it.

Signed-off-by: Tejun Heo <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Acked-by: Jan Kara <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

arc, print-fatal-signals: reduce duplicated information

After the recent generic debug info on dump_stack() and friends, arc
is printing duplicate information on debug dumps.

[ARCLinux]$ ./crash
crash/50: potentially unexpected fatal signal 11. <-- [1]
/sbin/crash, TGID 50 <-- [2]
Pid: 50, comm: crash Not tainted 3.9.0-rc4+ #132 <-- [3]
...

Remove them.

[[email protected]: updated patch desc]
Signed-off-by: Vineet Gupta <[email protected]>
Signed-off-by: Tejun Heo <[email protected]>
Cc: Bjorn Helgaas <[email protected]>
Cc: David S. Miller <[email protected]>
Cc: Fengguang Wu <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Jesper Nilsson <[email protected]>
Cc: Martin Schwidefsky <[email protected]>
Cc: Mike Frysinger <[email protected]>
Cc: Sam Ravnborg <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

dump_stack: unify debug information printed by show_regs()

show_regs() is inherently arch-dependent but it does make sense to print
generic debug information and some archs already do albeit in slightly
different forms.  This patch introduces a generic function to print debug
information from show_regs() so that different archs print out the same
information and it's much easier to modify what's printed.

show_regs_print_info() prints out the same debug info as dump_stack()
does plus task and thread_info pointers.

* Archs which didn't print debug info now do.

  alpha, arc, blackfin, c6x, cris, frv, h8300, hexagon, ia64, m32r,
  metag, microblaze, mn10300, openrisc, parisc, score, sh64, sparc,
  um, xtensa

* Already prints debug info.  Replaced with show_regs_print_info().
  The printed information is superset of what used to be there.

  arm, arm64, avr32, mips, powerpc, sh32, tile, unicore32, x86

* s390 is special in that it used to print arch-specific information
  along with generic debug info.  Heiko and Martin think that the
  arch-specific extra isn't worth keeping s390 specfic implementation.
  Converted to use the generic version.

Note that now all archs print the debug info before actual register
dumps.

An example BUG() dump follows.

kernel BUG at /work/os/work/kernel/workqueue.c:4841!
invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
Modules linked in:
CPU: 0 PID: 1 Comm: swapper/0 Not tainted 3.9.0-rc1-work+ #7
Hardware name: empty empty/S3992, BIOS 080011  10/26/2007
task: ffff88007c85e040 ti: ffff88007c860000 task.ti: ffff88007c860000
RIP: 0010:[<ffffffff8234a07e>]  [<ffffffff8234a07e>] init_workqueues+0x4/0x6
RSP: 0000:ffff88007c861ec8  EFLAGS: 00010246
RAX: ffff88007c861fd8 RBX: ffffffff824466a8 RCX: 0000000000000001
RDX: 0000000000000046 RSI: 0000000000000001 RDI: ffffffff8234a07a
RBP: ffff88007c861ec8 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000001 R11: 0000000000000000 R12: ffffffff8234a07a
R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
FS:  0000000000000000(0000) GS:ffff88007dc00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: ffff88015f7ff000 CR3: 00000000021f1000 CR4: 00000000000007f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Stack:
  ffff88007c861ef8 ffffffff81000312 ffffffff824466a8 ffff88007c85e650
  0000000000000003 0000000000000000 ffff88007c861f38 ffffffff82335e5d
  ffff88007c862080 ffffffff8223d8c0 ffff88007c862080 ffffffff81c47760
Call Trace:
  [<ffffffff81000312>] do_one_initcall+0x122/0x170
  [<ffffffff82335e5d>] kernel_init_freeable+0x9b/0x1c8
  [<ffffffff81c47760>] ? rest_init+0x140/0x140
  [<ffffffff81c4776e>] kernel_init+0xe/0xf0
  [<ffffffff81c6be9c>] ret_from_fork+0x7c/0xb0
  [<ffffffff81c47760>] ? rest_init+0x140/0x140
  ...

v2: Typo fix in x86-32.

v3: CPU number dropped from show_regs_print_info() as
    dump_stack_print_info() has been updated to print it.  s390
    specific implementation dropped as requested by s390 maintainers.

Signed-off-by: Tejun Heo <[email protected]>
Acked-by: David S. Miller <[email protected]>
Acked-by: Jesper Nilsson <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Martin Schwidefsky <[email protected]>
Cc: Bjorn Helgaas <[email protected]>
Cc: Fengguang Wu <[email protected]>
Cc: Mike Frysinger <[email protected]>
Cc: Vineet Gupta <[email protected]>
Cc: Sam Ravnborg <[email protected]>
Acked-by: Chris Metcalf <[email protected]> [tile bits]
Acked-by: Richard Kuo <[email protected]> [hexagon bits]
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

dump_stack: implement arch-specific hardware description in task dumps

x86 and ia64 can acquire extra hardware identification information
from DMI and print it along with task dumps; however, the usage isn't
consistent.

* x86 show_regs() collects vendor, product and board strings and print
  them out with PID, comm and utsname.  Some of the information is
  printed again later in the same dump.

* warn_slowpath_common() explicitly accesses the DMI board and prints
  it out with "Hardware name:" label.  This applies to both x86 and
  ia64 but is irrelevant on all other archs.

* ia64 doesn't show DMI information on other non-WARN dumps.

This patch introduces arch-specific hardware description used by
dump_stack().  It can be set by calling dump_stack_set_arch_desc()
during boot and, if exists, printed out in a separate line with
"Hardware name:" label.

dmi_set_dump_stack_arch_desc() is added which sets arch-specific
description from DMI data.  It uses dmi_ids_string[] which is set from
dmi_present() used for DMI debug message.  It is superset of the
information x86 show_regs() is using.  The function is called from x86
and ia64 boot code right after dmi_scan_machine().

This makes the explicit DMI handling in warn_slowpath_common()
unnecessary.  Removed.

show_regs() isn't yet converted to use generic debug information
printing and this patch doesn't remove the duplicate DMI handling in
x86 show_regs().  The next patch will unify show_regs() handling and
remove the duplication.

An example WARN dump follows.

WARNING: at kernel/workqueue.c:4841 init_workqueues+0x35/0x505()
Modules linked in:
CPU: 0 PID: 1 Comm: swapper/0 Not tainted 3.9.0-rc1-work+ #3
Hardware name: empty empty/S3992, BIOS 080011  10/26/2007
  0000000000000009 ffff88007c861e08 ffffffff81c614dc ffff88007c861e48
  ffffffff8108f500 ffffffff82228240 0000000000000040 ffffffff8234a08e
  0000000000000000 0000000000000000 0000000000000000 ffff88007c861e58
Call Trace:
  [<ffffffff81c614dc>] dump_stack+0x19/0x1b
  [<ffffffff8108f500>] warn_slowpath_common+0x70/0xa0
  [<ffffffff8108f54a>] warn_slowpath_null+0x1a/0x20
  [<ffffffff8234a0c3>] init_workqueues+0x35/0x505
  ...

v2: Use the same string as the debug message from dmi_present() which
    also contains BIOS information.  Move hardware name into its own
    line as warn_slowpath_common() did.  This change was suggested by
    Bjorn Helgaas.

Signed-off-by: Tejun Heo <[email protected]>
Cc: Bjorn Helgaas <[email protected]>
Cc: David S. Miller <[email protected]>
Cc: Fengguang Wu <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Jesper Nilsson <[email protected]>
Cc: Martin Schwidefsky <[email protected]>
Cc: Mike Frysinger <[email protected]>
Cc: Vineet Gupta <[email protected]>
Cc: Sam Ravnborg <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

dmi: morph dmi_dump_ids() into dmi_format_ids() which formats into a buffer

We're goning to use DMI identification for other purposes too. Morph
dmi_dump_ids() which is used to print DMI identification as a debug
message during boot into dmi_format_ids() which formats the same
information sans the leading "DMI:" tag into a string buffer.

dmi_present() is updated to format the information into dmi_ids_string[]
using the new function and print it with "DMI:" prefix.

dmi_ids_string[] will be used for another purpose by a future patch.

Signed-off-by: Tejun Heo <[email protected]>
Cc: Bjorn Helgaas <[email protected]>
Cc: David S. Miller <[email protected]>
Cc: Fengguang Wu <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Jesper Nilsson <[email protected]>
Cc: Martin Schwidefsky <[email protected]>
Cc: Mike Frysinger <[email protected]>
Cc: Vineet Gupta <[email protected]>
Cc: Sam Ravnborg <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

dump_stack: consolidate dump_stack() implementations and unify their behaviors

Both dump_stack() and show_stack() are currently implemented by each
architecture.  show_stack(NULL, NULL) dumps the backtrace for the
current task as does dump_stack().  On some archs, dump_stack() prints
extra information - pid, utsname and so on - in addition to the
backtrace while the two are identical on other archs.

The usages in arch-independent code of the two functions indicate
show_stack(NULL, NULL) should print out bare backtrace while
dump_stack() is used for debugging purposes when something went wrong,
so it does make sense to print additional information on the task which
triggered dump_stack().

There's no reason to require archs to implement two separate but mostly
identical functions.  It leads to unnecessary subtle information.

This patch expands the dummy fallback dump_stack() implementation in
lib/dump_stack.c such that it prints out debug information (taken from
x86) and invokes show_stack(NULL, NULL) and drops arch-specific
dump_stack() implementations in all archs except blackfin.  Blackfin's
dump_stack() does something wonky that I don't understand.

Debug information can be printed separately by calling
dump_stack_print_info() so that arch-specific dump_stack()
implementation can still emit the same debug information.  This is used
in blackfin.

This patch brings the following behavior changes.

* On some archs, an extra level in backtrace for show_stack() could be
  printed.  This is because the top frame was determined in
  dump_stack() on those archs while generic dump_stack() can't do that
  reliably.  It can be compensated by inlining dump_stack() but not
  sure whether that'd be necessary.

* Most archs didn't use to print debug info on dump_stack().  They do
  now.

An example WARN dump follows.

WARNING: at kernel/workqueue.c:4841 init_workqueues+0x35/0x505()
Hardware name: empty
Modules linked in:
CPU: 0 PID: 1 Comm: swapper/0 Not tainted 3.9.0-rc1-work+ #9
  0000000000000009 ffff88007c861e08 ffffffff81c614dc ffff88007c861e48
  ffffffff8108f50f ffffffff82228240 0000000000000040 ffffffff8234a03c
  0000000000000000 0000000000000000 0000000000000000 ffff88007c861e58
Call Trace:
  [<ffffffff81c614dc>] dump_stack+0x19/0x1b
  [<ffffffff8108f50f>] warn_slowpath_common+0x7f/0xc0
  [<ffffffff8108f56a>] warn_slowpath_null+0x1a/0x20
  [<ffffffff8234a071>] init_workqueues+0x35/0x505
  ...

v2: CPU number added to the generic debug info as requested by s390
    folks and dropped the s390 specific dump_stack().  This loses %ksp
    from the debug message which the maintainers think isn't important
    enough to keep the s390-specific dump_stack() implementation.

    dump_stack_print_info() is moved to kernel/printk.c from
    lib/dump_stack.c.  Because linkage is per objecct file,
    dump_stack_print_info() living in the same lib file as generic
    dump_stack() means that archs which implement custom dump_stack()
    - at this point, only blackfin - can't use dump_stack_print_info()
    as that will bring in the generic version of dump_stack() too.  v1
    The v1 patch broke build on blackfin due to this issue.  The build
    breakage was reported by Fengguang Wu.

Signed-off-by: Tejun Heo <[email protected]>
Acked-by: David S. Miller <[email protected]>
Acked-by: Vineet Gupta <[email protected]>
Acked-by: Jesper Nilsson <[email protected]>
Acked-by: Vineet Gupta <[email protected]>
Acked-by: Martin Schwidefsky <[email protected]> [s390 bits]
Cc: Heiko Carstens <[email protected]>
Cc: Mike Frysinger <[email protected]>
Cc: Fengguang Wu <[email protected]>
Cc: Bjorn Helgaas <[email protected]>
Cc: Sam Ravnborg <[email protected]>
Acked-by: Richard Kuo <[email protected]> [hexagon bits]
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

sparc32: make show_stack() acquire %fp if @_ksp is not specified

show_stack(current or NULL, NULL) is used by arch-independent code to dump
backtrace of the current task; however, sparc32 show_stack() doesn't
implement it and wouldn't print any backtrace when NULL @_ksp is specfied.

Make show_stack() acquire and use %fp if @tsk is NULL or current and @_ksp
is NULL. This makes %fp fetching in dump_stack() unnecessary. Make it
use NULL for @_ksp instead.

Only compile tested.

Signed-off-by: Tejun Heo <[email protected]>
Acked-by: David S. Miller <[email protected]>
Cc: Bjorn Helgaas <[email protected]>
Cc: Fengguang Wu <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Jesper Nilsson <[email protected]>
Cc: Martin Schwidefsky <[email protected]>
Cc: Mike Frysinger <[email protected]>
Cc: Vineet Gupta <[email protected]>
Cc: Sam Ravnborg <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

x86: don't show trace beyond show_stack(NULL, NULL)

There are multiple ways a task can be dumped - explicit call to
dump_stack(), triggering WARN() or BUG(), through sysrq-t and so on.
Most of what gets printed is upto each architecture and the current
state is not particularly pretty.  Different pieces of information are
presented differently depending on which path the dump takes and which
architecture it's running on.  This is messy for no good reason and
makes it exceedingly difficult to add or modify debug information to
task dumps.

In all archs except for s390, there's nothing arch-specific about the
printed debug information.  This patchset updates all those archs to use
the same helpers to consistently print out the same debug information.

An example WARN dump after this patchset.

WARNING: at kernel/workqueue.c:4841 init_workqueues+0x35/0x505()
Modules linked in:
CPU: 0 PID: 1 Comm: swapper/0 Not tainted 3.9.0-rc1-work+ #3
Hardware name: empty empty/S3992, BIOS 080011  10/26/2007
  0000000000000009 ffff88007c861e08 ffffffff81c614dc ffff88007c861e48
  ffffffff8108f500 ffffffff82228240 0000000000000040 ffffffff8234a08e
  0000000000000000 0000000000000000 0000000000000000 ffff88007c861e58
Call Trace:
  [<ffffffff81c614dc>] dump_stack+0x19/0x1b
  [<ffffffff8108f500>] warn_slowpath_common+0x70/0xa0
  [<ffffffff8108f54a>] warn_slowpath_null+0x1a/0x20
  [<ffffffff8234a0c3>] init_workqueues+0x35/0x505
  ...

And BUG dump.

kernel BUG at kernel/workqueue.c:4841!
invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
Modules linked in:
CPU: 0 PID: 1 Comm: swapper/0 Not tainted 3.9.0-rc1-work+ #7
Hardware name: empty empty/S3992, BIOS 080011  10/26/2007
task: ffff88007c85e040 ti: ffff88007c860000 task.ti: ffff88007c860000
RIP: 0010:[<ffffffff8234a07e>]  [<ffffffff8234a07e>] init_workqueues+0x4/0x6
RSP: 0000:ffff88007c861ec8  EFLAGS: 00010246
RAX: ffff88007c861fd8 RBX: ffffffff824466a8 RCX: 0000000000000001
RDX: 0000000000000046 RSI: 0000000000000001 RDI: ffffffff8234a07a
RBP: ffff88007c861ec8 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000001 R11: 0000000000000000 R12: ffffffff8234a07a
R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
FS:  0000000000000000(0000) GS:ffff88007dc00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: ffff88015f7ff000 CR3: 00000000021f1000 CR4: 00000000000007f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Stack:
  ffff88007c861ef8 ffffffff81000312 ffffffff824466a8 ffff88007c85e650
  0000000000000003 0000000000000000 ffff88007c861f38 ffffffff82335e5d
  ffff88007c862080 ffffffff8223d8c0 ffff88007c862080 ffffffff81c47760
Call Trace:
  [<ffffffff81000312>] do_one_initcall+0x122/0x170
  [<ffffffff82335e5d>] kernel_init_freeable+0x9b/0x1c8
  [<ffffffff81c47760>] ? rest_init+0x140/0x140
  [<ffffffff81c4776e>] kernel_init+0xe/0xf0
  [<ffffffff81c6be9c>] ret_from_fork+0x7c/0xb0
  [<ffffffff81c47760>] ? rest_init+0x140/0x140
  ...

This patchset contains the following seven patches.

0001-x86-don-t-show-trace-beyond-show_stack-NULL-NULL.patch
0002-sparc32-make-show_stack-acquire-fp-if-_ksp-is-not-sp.patch
0003-dump_stack-consolidate-dump_stack-implementations-an.patch
0004-dmi-morph-dmi_dump_ids-into-dmi_format_ids-which-for.patch
0005-dump_stack-implement-arch-specific-hardware-descript.patch
0006-dump_stack-unify-debug-information-printed-by-show_r.patch
0007-arc-print-fatal-signals-reduce-duplicated-informatio.patch

0001-0002 update stack dumping functions in x86 and sparc32 in
preparation.

0003 makes all arches except blackfin use generic dump_stack().
blackfin still uses the generic helper to print the same info.

0004-0005 properly abstract DMI identifier printing in WARN() and
show_regs() so that all dumps print out the information.  This enables
show_regs() to use the same debug info message.

0006 updates show_regs() of all arches to use a common generic helper
to print debug info.

0007 removes somem duplicate information from arc dumps.

While this patchset changes how debug info is printed on some archs,
the printed information is always superset of what used to be there.

This patchset makes task dump debug messages consistent and enables
adding more information.  Workqueue is scheduled to add worker
information including the workqueue in use and work item specific
description.

While this patch touches a lot of archs, it isn't too likely to cause
non-trivial conflicts with arch-specfic changes and would probably be
best to route together either through -mm.

x86 is tested but other archs are either only compile tested or not
tested at all.  Changes to most archs are generally trivial.

This patch:

show_stack(current or NULL, NULL) is used to print the backtrace of the
current task.  As trace beyond the function itself isn't of much
interest to anyone, don't show it by determining sp and bp in
show_stack()'s frame and passing them to show_stack_log_lvl().

This brings show_stack(NULL, NULL)'s behavior in line with
dump_stack().

Signed-off-by: Tejun Heo <[email protected]>
Cc: Bjorn Helgaas <[email protected]>
Cc: David S. Miller <[email protected]>
Cc: Fengguang Wu <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Jesper Nilsson <[email protected]>
Cc: Martin Schwidefsky <[email protected]>
Cc: Mike Frysinger <[email protected]>
Cc: Vineet Gupta <[email protected]>
Cc: Sam Ravnborg <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

kernel/range.c: subtract_range: fix the broken phrase issued by printk

Also replace deprecated printk(KERN_ERR...) with pr_err() as suggested
by Yinghai, attaching the function name to provide plenty info.

Signed-off-by: Lin Feng <[email protected]>
Cc: Yinghai Lu <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

selftest: add simple test for soft-dirty bit

It creates a mapping of 3 pages and checks that reads, writes and
clear-refs result in present and soft-dirt bits reported from pagemap2
set as expected.

[[email protected]: alphasort the Makefile TARGETS to reduce rejects]
Signed-off-by: Pavel Emelyanov <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

staging: zcache: enable zcache to be built/loaded as a module

Allow zcache to be built/loaded as a module. Note runtime dependency
disallows loading if cleancache/frontswap lazy initialization patches
are not present. Zsmalloc support has not yet been merged into zcache
but, once merged, could now easily be selected via a module_param.

If built-in (not built as a module), the original mechanism of enabling
via a kernel boot parameter is retained, but this should be considered
deprecated.

Note that module unload is explicitly not yet supported.

Signed-off-by: Dan Magenheimer <[email protected]>
[v1: Rebased with different order of patches]
[v2: Removed [CLEANCACHE|FRONTSWAP]_HAS_LAZY_INIT ifdef]
[v3: Rebased on top of ramster->zcache move]
[v4: Redid the Makefile]
[v5: s/ZCACHE2/ZCACHE/]
Signed-off-by: Konrad Rzeszutek Wilk <[email protected]>
Signed-off-by: Bob Liu <[email protected]>
Cc: Wanpeng Li <[email protected]>
Cc: Andor Daam <[email protected]>
Cc: Florian Schmaus <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Stefan Hengelein <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

staging: zcache: enable ramster to be built/loaded as a module

Enable module support for ramster. Note runtime dependency disallows
loading if cleancache/frontswap lazy initialization patches are not
present.

If built-in (not built as a module), the original mechanism of enabling
via a kernel boot parameter is retained, but this should be considered
deprecated.

Note that module unload is explicitly not yet supported.

[v1: Fixed compile issues since ramster_init now has four arguments]
[v2: Fixed rebase on ramster->zcache move]
[[email protected]: use_frontswap_selfshrink cannot be __initdata]
Signed-off-by: Dan Magenheimer <[email protected]>
Signed-off-by: Konrad Rzeszutek Wilk <[email protected]>
Signed-off-by: Bob Liu <[email protected]>
Cc: Wanpeng Li <[email protected]>
Cc: Andor Daam <[email protected]>
Cc: Florian Schmaus <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Stefan Hengelein <[email protected]>
Cc: Wu Fengguang <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

zcache/tmem: Better error checking on frontswap_register_ops return value.

In the past it either used to be NULL or the "older" backend. Now we
also return -Exx error codes.

Signed-off-by: Konrad Rzeszutek Wilk <[email protected]>
Signed-off-by: Bob Liu <[email protected]>
Cc: Wanpeng Li <[email protected]>
Cc: Andor Daam <[email protected]>
Cc: Dan Magenheimer <[email protected]>
Cc: Florian Schmaus <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Stefan Hengelein <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

xen: tmem: enable Xen tmem shim to be built/loaded as a module

Allow Xen tmem shim to be built/loaded as a module. Xen self-ballooning
and frontswap-selfshrinking are now also "lazily" initialized when the
Xen tmem shim is loaded as a module, unless explicitly disabled by
module parameters.

Note runtime dependency disallows loading if cleancache/frontswap lazy
initialization patches are not present.

If built-in (not built as a module), the original mechanism of enabling
via a kernel boot parameter is retained, but this should be considered
deprecated.

Note that module unload is explicitly not yet supported.

[v1: Removed the [CLEANCACHE|FRONTSWAP]_HAS_LAZY_INIT ifdef]
[v2: Squashed the xen/tmem: Remove the subsys call patch in]
[[email protected]: fix build (disable_frontswap_selfshrinking undeclared)]
Signed-off-by: Dan Magenheimer <[email protected]>
Signed-off-by: Konrad Rzeszutek Wilk <[email protected]>
Signed-off-by: Bob Liu <[email protected]>
Cc: Wanpeng Li <[email protected]>
Cc: Andor Daam <[email protected]>
Cc: Florian Schmaus <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Stefan Hengelein <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm: cleancache: clean up cleancache_enabled

cleancache_ops is used to decide whether backend is registered.
So now cleancache_enabled is always true if defined CONFIG_CLEANCACHE.

Signed-off-by: Bob Liu <[email protected]>
Cc: Wanpeng Li <[email protected]>
Cc: Andor Daam <[email protected]>
Cc: Dan Magenheimer <[email protected]>
Cc: Florian Schmaus <[email protected]>
Cc: Konrad Rzeszutek Wilk <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Stefan Hengelein <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

cleancache: Make cleancache_init use a pointer for the ops

Instead of using a backend_registered to determine whether a backend is
enabled. This allows us to remove the backend_register check and just
do 'if (cleancache_ops)'

[v1: Rebase on top of b97c4b430b0a (ramster->zcache move]
Signed-off-by: Konrad Rzeszutek Wilk <[email protected]>
Signed-off-by: Bob Liu <[email protected]>
Cc: Wanpeng Li <[email protected]>
Cc: Andor Daam <[email protected]>
Cc: Dan Magenheimer <[email protected]>
Cc: Florian Schmaus <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Stefan Hengelein <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm: cleancache: lazy initialization to allow tmem backends to build/run as modules

With the goal of allowing tmem backends (zcache, ramster, Xen tmem) to
be built/loaded as modules rather than built-in and enabled by a boot
parameter, this patch provides "lazy initialization", allowing backends
to register to cleancache even after filesystems were mounted. Calls to
init_fs and init_shared_fs are remembered as fake poolids but no real
tmem_pools created. On backend registration the fake poolids are mapped
to real poolids and respective tmem_pools.

Signed-off-by: Stefan Hengelein <[email protected]>
Signed-off-by: Florian Schmaus <[email protected]>
Signed-off-by: Andor Daam <[email protected]>
Signed-off-by: Dan Magenheimer <[email protected]>
[v1: Minor fixes: used #define for some values and bools]
[v2: Removed CLEANCACHE_HAS_LAZY_INIT]
[v3: Added more comments, added a lock for [shared_|]fs_poolid_map]
Signed-off-by: Konrad Rzeszutek Wilk <[email protected]>
Signed-off-by: Bob Liu <[email protected]>
Cc: Wanpeng Li <[email protected]>
Cc: Minchan Kim <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

frontswap: get rid of swap_lock dependency

Frontswap initialization routine depends on swap_lock, which want to be
atomic about frontswap's first appearance. IOW, frontswap is not present
and will fail all calls OR frontswap is fully functional but if new
swap_info_struct isn't registered by enable_swap_info, swap subsystem
doesn't start I/O so there is no race between init procedure and page I/O
working on frontswap.

So let's remove unnecessary swap_lock dependency.

Cc: Dan Magenheimer <[email protected]>
Signed-off-by: Minchan Kim <[email protected]>
[v1: Rebased on my branch, reworked to work with backends loading late]
[v2: Added a check for !map]
[v3: Made the invalidate path follow the init path]
[v4: Address comments by Wanpeng Li <[email protected]>]
Signed-off-by: Konrad Rzeszutek Wilk <[email protected]>
Signed-off-by: Bob Liu <[email protected]>
Cc: Wanpeng Li <[email protected]>
Cc: Andor Daam <[email protected]>
Cc: Florian Schmaus <[email protected]>
Cc: Stefan Hengelein <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm: frontswap: cleanup code

After allowing tmem backends to build/run as modules, frontswap_enabled
always true if defined CONFIG_FRONTSWAP. But frontswap_test() depends on
whether backend is registered, mv it into frontswap.c using fronstswap_ops
to make the decision.

frontswap_set/clear are not used outside frontswap, so don't export them.

Signed-off-by: Bob Liu <[email protected]>
Cc: Wanpeng Li <[email protected]>
Cc: Andor Daam <[email protected]>
Cc: Dan Magenheimer <[email protected]>
Cc: Florian Schmaus <[email protected]>
Cc: Konrad Rzeszutek Wilk <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Stefan Hengelein <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

frontswap: make frontswap_init use a pointer for the ops

This simplifies the code in the frontswap - we can get rid of the
'backend_registered' test and instead check against frontswap_ops.

[v1: Rebase on top of 703ba7fe5e0 (ramster->zcache move]
Signed-off-by: Konrad Rzeszutek Wilk <[email protected]>
Signed-off-by: Bob Liu <[email protected]>
Cc: Wanpeng Li <[email protected]>
Cc: Andor Daam <[email protected]>
Cc: Dan Magenheimer <[email protected]>
Cc: Florian Schmaus <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Stefan Hengelein <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm: frontswap: lazy initialization to allow tmem backends to build/run as modules

With the goal of allowing tmem backends (zcache, ramster, Xen tmem) to
be built/loaded as modules rather than built-in and enabled by a boot
parameter, this patch provides "lazy initialization", allowing backends
to register to frontswap even after swapon was run. Before a backend
registers all calls to init are recorded and the creation of tmem_pools
delayed until a backend registers or until a frontswap store is
attempted.

Signed-off-by: Stefan Hengelein <[email protected]>
Signed-off-by: Florian Schmaus <[email protected]>
Signed-off-by: Andor Daam <[email protected]>
Signed-off-by: Dan Magenheimer <[email protected]>
[v1: Fixes per Seth Jennings suggestions]
[v2: Removed FRONTSWAP_HAS_.. ]
[v3: Fix up per Bob Liu <[email protected]> recommendations]
[v4: Fix up per Andrew's comments]
Signed-off-by: Konrad Rzeszutek Wilk <[email protected]>
Signed-off-by: Bob Liu <[email protected]>
Cc: Wanpeng Li <[email protected]>
Cc: Dan Magenheimer <[email protected]>
Cc: Minchan Kim <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

fs/dcache.c: add cond_resched() to shrink_dcache_parent()

Call cond_resched() in shrink_dcache_parent() to maintain interactivity.

Before this patch:

void shrink_dcache_parent(struct dentry * parent)
{
while ((found = select_parent(parent, &dispose)) != 0)
shrink_dentry_list(&dispose);
}

select_parent() populates the dispose list with dentries which
shrink_dentry_list() then deletes.  select_parent() carefully uses
need_resched() to avoid doing too much work at once.  But neither
shrink_dcache_parent() nor its called functions call cond_resched().  So
once need_resched() is set select_parent() will return single dentry
dispose list which is then deleted by shrink_dentry_list().  This is
inefficient when there are a lot of dentry to process.  This can cause
softlockup and hurts interactivity on non preemptable kernels.

This change adds cond_resched() in shrink_dcache_parent().  The benefit
of this is that need_resched() is quickly cleared so that future calls
to select_parent() are able to efficiently return a big batch of dentry.

These additional cond_resched() do not seem to impact performance, at
least for the workload below.

Here is a program which can cause soft lockup if other system activity
sets need_resched().

int main()
{
        struct rlimit rlim;
        int i;
        int f[100000];
        char buf[20];
        struct timeval t1, t2;
        double diff;

        /* cleanup past run */
        system("rm -rf x");

        /* boost nfile rlimit */
        rlim.rlim_cur = 200000;
        rlim.rlim_max = 200000;
        if (setrlimit(RLIMIT_NOFILE, &rlim))
                err(1, "setrlimit");

        /* make directory for files */
        if (mkdir("x", 0700))
                err(1, "mkdir");

        if (gettimeofday(&t1, NULL))
                err(1, "gettimeofday");

        /* populate directory with open files */
        for (i = 0; i < 100000; i++) {
                snprintf(buf, sizeof(buf), "x/%d", i);
                f[i] = open(buf, O_CREAT);
                if (f[i] == -1)
                        err(1, "open");
        }

        /* close some of the files */
        for (i = 0; i < 85000; i++)
                close(f[i]);

        /* unlink all files, even open ones */
        system("rm -rf x");

        if (gettimeofday(&t2, NULL))
                err(1, "gettimeofday");

        diff = (((double)t2.tv_sec * 1000000 + t2.tv_usec) -
                ((double)t1.tv_sec * 1000000 + t1.tv_usec));

        printf("done: %g elapsed\n", diff/1e6);
        return 0;
}

Signed-off-by: Greg Thelen <[email protected]>
Signed-off-by: Dave Chinner <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

fs/block_dev.c: no need to check inode->i_bdev in bd_forget()

Its only caller evict() has promised a non-NULL inode->i_bdev.

Signed-off-by: Yan Hong <[email protected]>
Cc: Al Viro <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

inotify: invalid mask should return a error number but not set it

When we run the crackerjack testsuite, the inotify_add_watch test is
stalled.

This is caused by the invalid mask 0 - the task is waiting for the event
but it never comes. inotify_add_watch() should return -EINVAL as it did
before commit 676a0675cf92 ("inotify: remove broken mask checks causing
unmount to be EINVAL"). That commit removes the invalid mask check, but
that check is needed.

Check the mask's ALL_INOTIFY_BITS before the inotify_arg_to_mask() call.
If none are set, just return -EINVAL.

Because IN_UNMOUNT is in ALL_INOTIFY_BITS, this change will not trigger
the problem that above commit fixed.

[[email protected]: fix build]
Signed-off-by: Zhao Hongjiang <[email protected]>
Acked-by: Jim Somerville <[email protected]>
Cc: Paul Gortmaker <[email protected]>
Cc: Jerome Marchand <[email protected]>
Cc: Eric Paris <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

memory hotplug: fix warnings

Fix the following compilation warnings:

  mm/slab.c: In function `kmem_cache_init_late':
  mm/slab.c:1778:2: warning: statement with no effect [-Wunused-value]

  mm/page_cgroup.c: In function `page_cgroup_init':
  mm/page_cgroup.c:305:2: warning: statement with no effect [-Wunused-value]

Signed-off-by: Vincent Stehlé <[email protected]>
Cc: David Rientjes <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

drivers/usb/storage/realtek_cr.c: fix build

Remove unused local `us', which broke the build. Also nuke an unneeded
cast.

Repairs commit 191648d03d20 ("usb: storage: Convert US_DEBUGP to
usb_stor_dbg").

Cc: Joe Perches <[email protected]>
Acked-by: David Rientjes <[email protected]>
Cc: Greg KH <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security

Pull security subsystem update from James Morris:
"Just some minor updates across the subsystem"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security:
  ima: eliminate passing d_name.name to process_measurement()
  TPM: Retry SaveState command in suspend path
  tpm/tpm_i2c_infineon: Add small comment about return value of __i2c_transfer
  tpm/tpm_i2c_infineon.c: Add OF attributes type and name to the of_device_id table entries
  tpm_i2c_stm_st33: Remove duplicate inclusion of header files
  tpm: Add support for new Infineon I2C TPM (SLB 9645 TT 1.2 I2C)
  char/tpm: Convert struct i2c_msg initialization to C99 format
  drivers/char/tpm/tpm_ppi: use strlcpy instead of strncpy
  tpm/tpm_i2c_stm_st33: formatting and white space changes
  Smack: include magic.h in smackfs.c
  selinux: make security_sb_clone_mnt_opts return an error on context mismatch
  seccomp: allow BPF_XOR based ALU instructions.
  Fix NULL pointer dereference in smack_inode_unlink() and smack_inode_rmdir()
  Smack: add support for modification of existing rules
  smack: SMACK_MAGIC to include/uapi/linux/magic.h
  Smack: add missing support for transmute bit in smack_str_from_perm()
  Smack: prevent revoke-subject from failing when unseen label is written to it
  tomoyo: use DEFINE_SRCU() to define tomoyo_ss
  tomoyo: use DEFINE_SRCU() to define tomoyo_ss

Merge tag 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jgarzik/libata-dev

Pull libata update from Jeff Garzik:

- More ACPI fixes, cleanups

- Minor cleanups for sata_highbank, pata_at32, pata_octeon_cf,
   sata_rcar

- pata_legacy: small bug found in opti chipset code (untested fix, due
   to ancient h/w)

- sata_fsl: RX water mark config knob, some h/w needs it

- pata_imx: cleanups, DeviceTree support

- SCSI<->ATA translator: properly export translator version, not device
   firmware version

* tag 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jgarzik/libata-dev:
  sata_highbank: Rename proc_name to the module name
  ACPI/libata: Restore libata.noacpi support
  [libata] acpi: make ata_ap_acpi_handle not block
  [libata] SCSI: really use SATL version in VPD
  pata_imx: add devicetree support
  pata_imx: use void __iomem * for regs
  pata_imx: cleanup error path
  pata_imx: Use devm_clk_get
  sata_rcar: Convert to devm_ioremap_resource()
  fsl/sata: create a sysfs entry for rx water mark
  libata-acpi: remove redundent code for power resource handling
  sata_highbank: make ahci_highbank_pm_ops static
  pata_octeon_cf: Use resource_size function
  pata_legacy: bogus clock in opti82c46x_set_piomode()
  pata_at32: use module_platform_driver_probe()

mlx4_en: fix a build error on 32bit arches

commit b6c39bfcf1d7d63 ("net/mlx4_en: Add a service task")
added a build error on 32bit arches.

ERROR: "__udivdi3" [drivers/net/ethernet/mellanox/mlx4/mlx4_en.ko]
undefined!

Fix this problem by using do_div()

Reported-by: Randy Dunlap <[email protected]>
Signed-off-by: Eric Dumazet <[email protected]>
Acked-by: Randy Dunlap <[email protected]>
Cc: Amir Vadai <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Revert "bnx2x: allow nvram test to run when device is down"

This reverts commit d2d2d87dfd1a25ee270994c5b9e3eb4690428d32
("bnx2x: allow nvram test to run when device is down").

Since it makes access to the device in D3 state possible.
More work is required to make sure device is not set to D3
during ifdown. Until this is done the nvram-test should simply
exit if device is down like it did before.

Signed-off-by: Dmitry Kravkov <[email protected]>
Signed-off-by: Eilon Greenstein <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Merge tag 'pm+acpi-3.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull power management and ACPI updates from Rafael J Wysocki:

- ARM big.LITTLE cpufreq driver from Viresh Kumar.

- exynos5440 cpufreq driver from Amit Daniel Kachhap.

- cpufreq core cleanup and code consolidation from Viresh Kumar and
   Stratos Karafotis.

- cpufreq scalability improvement from Nathan Zimmer.

- AMD "frequency sensitivity feedback" powersave bias for the ondemand
   cpufreq governor from Jacob Shin.

- cpuidle code consolidation and cleanups from Daniel Lezcano.

- ARM OMAP cpuidle fixes from Santosh Shilimkar and Daniel Lezcano.

- ACPICA fixes and other improvements from Bob Moore, Jung-uk Kim, Lv
   Zheng, Yinghai Lu, Tang Chen, Colin Ian King, and Linn Crosetto.

- ACPI core updates related to hotplug from Toshi Kani, Paul Bolle,
   Yasuaki Ishimatsu, and Rafael J Wysocki.

- Intel Lynxpoint LPSS (Low-Power Subsystem) support improvements from
   Rafael J Wysocki and Andy Shevchenko.

* tag 'pm+acpi-3.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (192 commits)
  cpufreq: Revert incorrect commit 5800043
  cpufreq: MAINTAINERS: Add co-maintainer
  cpuidle: add maintainer entry
  ACPI / thermal: do not always return THERMAL_TREND_RAISING for active trip points
  ARM: s3c64xx: cpuidle: use init/exit common routine
  cpufreq: pxa2xx: initialize variables
  ACPI: video: correct acpi_video_bus_add error processing
  SH: cpuidle: use init/exit common routine
  ARM: S5pv210: compiling issue, ARM_S5PV210_CPUFREQ needs CONFIG_CPU_FREQ_TABLE=y
  ACPI: Fix wrong parameter passed to memblock_reserve
  cpuidle: fix comment format
  pnp: use %*phC to dump small buffers
  isapnp: remove debug leftovers
  ARM: imx: cpuidle: use init/exit common routine
  ARM: davinci: cpuidle: use init/exit common routine
  ARM: kirkwood: cpuidle: use init/exit common routine
  ARM: calxeda: cpuidle: use init/exit common routine
  ARM: tegra: cpuidle: use init/exit common routine for tegra3
  ARM: tegra: cpuidle: use init/exit common routine for tegra2
  ARM: OMAP4: cpuidle: use init/exit common routine
  ...

Merge tag 'for-v3.10' of git://git.infradead.org/battery-2.6

Pull battery updates from Anton Vorontsov:
"Highlights:

   - OpenFirmware/DeviceTree support for the Power Supply core: the core
     now automatically populates supplied_from hierarchy from the device
     tree.  With these patches chargers and batteries can now lookup
     each other without the board files support shim.  Rhyland Klein at
     NVIDIA did the work

   - New ST-Ericsson ABX500 hwmon driver.  The driver is heavily using
     the AB85xx core and depends on some recent changes to it, so that
     is why the driver comes through the battery tree.  It has an
     appropriate ack from the hwmon maintainer (i.e.  Guenter Roeck).
     Martin Persson at ST-Ericsson and Hongbo Zhang at Linaro authored
     the driver

   - Final bits to sync AB85xx ST-Ericsson changes into mainline.  The
     changes touch mfd parts, but these were acked by the appropriate
     MFD maintainer (ie Samuel Ortiz).  Lee Jones at Linaro did most of
     the work and lead the submission process.

  Minor changes, but still worth mentioning:

   - Battery temperature reporting fix for Nokia N900 phones
   - Versatile Express poweroff driver moved into drivers/power/reset/
   - Tree-wide: use devm_kzalloc() where appropriate
   - Tree-wide: dev_pm_ops cleanups/fixes"

* tag 'for-v3.10' of git://git.infradead.org/battery-2.6: (112 commits)
  pm2301-charger: Fix suspend/resume
  charger-manager: Use kmemdup instead of kzalloc + memcpy
  power_supply: Populate supplied_from hierarchy from the device tree
  power_supply: Add core support for supplied_from
  power_supply: Define Binding for power-supplies
  rx51_battery: Fix reporting temperature
  hwmon: Add ST-Ericsson ABX500 hwmon driver
  ab8500_bmdata: Export abx500_res_to_temp tables for hwmon
  ab8500_{bmdata,fg}: Add const attributes to some data arrays
  ab8500_bmdata: Eliminate CamelCase warning of some variables
  ab8500_btemp: Make ab8500_btemp_get* interfaces public
  goldfish_battery: Use resource_size()
  lp8788-charger: Use PAGE_SIZE for the sysfs read operation
  max8925_power: Use devm_kzalloc()
  da9030_battery: Use devm_kzalloc()
  da9052-battery: Use devm_kzalloc()
  ds2760_battery: Use devm_kzalloc()
  ds2780_battery: Use devm_kzalloc()
  gpio-charger: Use devm_kzalloc()
  isp1704_charger: Use devm_kzalloc()
  ...

sata_highbank: Rename proc_name to the module name

mkinitrd looks at /sys/class/scsi_host/host$hostnum/proc_name to find
the module name of a disk driver. Current name is "highbank-ahci" but
the module is "sata_highbank". Rename it to match the module name.

Cc: Rob Herring <[email protected]>
Cc: Alexander Graf <[email protected]>
Cc: <[email protected]> v3.7..
Signed-off-by: Robert Richter <[email protected]>
Signed-off-by: Jeff Garzik <[email protected]>

ACPI/libata: Restore libata.noacpi support

This patch restores libata.noacpi support to libata-acpi.c.
There are broken optional control methods for ATA controller devices in the
real world.  The libata.noacpi has been used for a long time as a
workaround to deal with issues caused by the broken ASL codes.
1. The "noacpi" option is introduced by the following commit:
   commit 11ef697b37e3c85ce1ac21f7711babf1f5b12784
   Date: Thu, 28 Sep 2006 11:29:01 -0700
   Subject: libata: ACPI and _GTF support
2. The "noacpi" option is renamed to "libata_noacpi" by the following
   commit:
   commit d7d0dad62a641c156386288a747c1a2f6bb2e42d
   Date: Wed, 28 Mar 2007 01:57:37 -0400
   Subject: [libata] Disable ACPI by default; fix namespace problems
3. Some of its logics are changed over time - becomes relying on the
   "acpi_handle" bound to the ATA devices since this commit:
   commit fafbae87db88a73b166d3bc3294d209207f27056
   Date: Tue, 15 May 2007 03:28:16 +0900
   Subject: libata-acpi: implement ata_acpi_associate()
4. The option is deleted by the following commit:
   commit 30dcf76acc695cbd2fa919e294670fe9552e16e7
   Date: Mon, 25 Jun 2012 16:13:04 +0800
   Subject: libata: migrate ACPI code over to new bindings
But the libata.noacpi setup is still left in the kernel without codes to
implement it.  So the deletion introduces a regression to the Linux.
This patch disables ATA_ACPI support at runtime by stopping acpi binding
on the ATA devices to fix this regression.
This patch is tested by booting a SATA x86-64 kernel or a PATA x86 kernel
with or without "libata.noacpi=1" kernel command line argument.

Signed-off-by: Lv Zheng <[email protected]>
Signed-off-by: Aaron Lu <[email protected]>
Signed-off-by: Jeff Garzik <[email protected]>

[libata] acpi: make ata_ap_acpi_handle not block

Since commit 30dcf76acc, ata_ap_acpi_handle will always do a namespace
walk, which requires acquiring an acpi namespace mutex. This made it
impossible to be used when calling path has held a spinlock.

For example, it can occur in the following code path for pata_acpi:
ata_scsi_queuecmd (ap->lock is acquired)
  __ata_scsi_queuecmd
    ata_scsi_translate
      ata_qc_issue
        pacpi_qc_issue
          ata_acpi_stm
            ata_ap_acpi_handle
              acpi_get_child
                acpi_walk_namespace
                  acpi_ut_acquire_mutex (acquire mutex while holding lock)
This caused scheduling while atomic bug, as reported in bug #56781.

Actually, ata_ap_acpi_handle doesn't have to walk the namespace every
time it is called, it can simply return the bound acpi handle on the
corresponding SCSI host. The reason previously it is not done this way
is, ata_ap_acpi_handle is used in the binding function
ata_acpi_bind_host by ata_acpi_gtm when the handle is not bound to the
SCSI host yet. Since we already have the ATA port's handle in its
binding function, we can simply use it instead of calling
ata_ap_acpi_handle there. So introduce a new function __ata_acpi_gtm,
where it will receive an acpi handle param in addition to the ATA port
which is solely used for debug statement. With this change, we can make
ata_ap_acpi_handle simply return the bound handle for SCSI host instead
of walking the acpi namespace now.

Buglink: https://bugzilla.kernel.org/show_bug.cgi?id=56781
Reported-and-tested-by: <[email protected]>
Cc: <[email protected]>
Signed-off-by: Aaron Lu <[email protected]>
Signed-off-by: Jeff Garzik <[email protected]>

Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial

Pull fixup for trivial branch from Jiri Kosina:
"Unfortunately I made a mistake when merging into for-linus branch, and
  omitted one pre-requisity patch for a few other patches (which have
  been Acked by the appropriate maintainers) in the series.  Mea culpa
  maxima, sorry for that."

The trivial branch added %pSR usage before actually teaching vsnprintf()
about the 'R' part of %pSR.  The 'R' causes the symbol translation to do
a "__builtin_extract_return_addr()" before symbol lookup.

That said, on most architectures __builtin_extract_return_addr() isn't
likely to do anything special, so it probably is not normally
noticeable.

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial:
  vsprintf: Add extension %pSR - print_symbol replacement

vsprintf: Add extension %pSR - print_symbol replacement

print_symbol takes a long and converts it to a function
name and offset. %pS does something similar, but doesn't
translate the address via __builtin_extract_return_addr.
%pSR does the translation.

This will enable replacing multiple calls like
printk(...);
printk_symbol(addr);
printk("\n");
with a single non-interleavable in dmesg
printk("... %pSR\n", (void *)addr);

Update documentation too.

Signed-off-by: Joe Perches <[email protected]>
Signed-off-by: Jiri Kosina <[email protected]>

Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi

Pull first round of SCSI updates from James "Jej B" Bottomley:
"The patch set is mostly driver updates (qla4, qla2 [ISF support
  updates], lpfc, aacraid [dual firmware image support]) and a few bug
  fixes"

* tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (47 commits)
  [SCSI] iscsi_tcp: support PF_MEMALLOC/__GFP_MEMALLOC
  [SCSI] libiscsi: avoid unnecessary multiple NULL assignments
  [SCSI] qla4xxx: Update driver version to 5.03.00-k8
  [SCSI] qla4xxx: Added print statements to display AENs
  [SCSI] qla4xxx: Use correct value for max flash node entries
  [SCSI] qla4xxx: Restrict logout from boot target session using session id
  [SCSI] qla4xxx: Use correct flash ddb offset for ISP40XX
  [SCSI] isci: add CONFIG_PM_SLEEP to suspend/resume functions
  [SCSI] scsi_dh_alua: Add module parameter to allow failover to non preferred path without STPG
  [SCSI] qla2xxx: Update the driver version to 8.05.00.03-k.
  [SCSI] qla2xxx: Obtain loopback iteration count from bsg request.
  [SCSI] qla2xxx: Add clarifying printk to thermal access fail cases.
  [SCSI] qla2xxx: Remove duplicated include form qla_isr.c
  [SCSI] qla2xxx: Enhancements to support ISPFx00.
  [SCSI] qla4xxx: Update driver version to 5.03.00-k7
  [SCSI] qla4xxx: Replace dev type macros with generic portal type macros
  [SCSI] scsi_transport_iscsi: Declare portal type string macros for generic use
  [SCSI] qla4xxx: Add flash node mgmt support
  [SCSI] libiscsi: export function iscsi_switch_str_param
  [SCSI] scsi_transport_iscsi: Add flash node mgmt support
  ...

Merge branch 'for-next-merge' of git://git.kernel.org/pub/scm/linux/kernel/git/nab/target-pending

Pull SCSI target update from Nicholas Bellinger:
"The highlights this round include:

   - Add fileio support for WRITE_SAME w/ UNMAP=1 discard (asias)
   - Add fileio support for UNMAP discard (asias)
   - Add tcm_vhost hotplug support to work with upstream QEMU
     vhost-scsi-pci code (asias + mst)
   - Check for aborted sequence in tcm_fc response path (mdr)
   - Add initial iscsit_transport support into iscsi-target code (nab)
   - Refactor iscsi-target RX PDU logic + export request PDU handling
     (nab)
   - Refactor iscsi-target TX queue logic + export response PDU creation
     (nab)
   - Add new iSCSI Extentions for RDMA (ISER) target driver (Or + nab)

  The biggest changes revolve around iscsi-target refactoring in order
  to support the iser-target driver.  This includes the conversion of
  the iscsi-target data-path to use modern se_cmd->cmd_kref counting,
  and allowing transport independent aspects of RX/TX PDU
  request/response handling be shared across existing traditional
  iscsi-target code, and the new iser-target code.

  Thanks to Or Gerlitz + Mellanox for supporting the iser-target
  development effort!"

* 'for-next-merge' of git://git.kernel.org/pub/scm/linux/kernel/git/nab/target-pending: (25 commits)
  iser-target: Add iSCSI Extensions for RDMA (iSER) target driver
  tcm_vhost: Enable VIRTIO_SCSI_F_HOTPLUG
  tcm_vhost: Add ioctl to get and set events missed flag
  tcm_vhost: Add hotplug/hotunplug support
  tcm_vhost: Refactor the lock nesting rule
  tcm_fc: Check for aborted sequence
  iscsi-target: Add iser network portal attribute
  iscsi-target: Refactor TX queue logic + export response PDU creation
  iscsi-target: Refactor RX PDU logic + export request PDU handling
  iscsi-target: Add per transport iscsi_cmd alloc/free
  iscsi-target: Add iser-target parameter keys + setup during login
  iscsi-target: Initial traditional TCP conversion to iscsit_transport
  iscsi-target: Add iscsit_transport API template
  target: Add export of target_get_sess_cmd symbol
  target: Change default sense key of NOT_READY
  target/file: Set is_nonrot attribute
  target: Add sbc_execute_unmap() helper
  target/iblock: Add iblock_do_unmap() helper
  target/file: Add fd_do_unmap() helper
  target/file: Add UNMAP emulation support
  ...

bridge: avoid OOPS if root port not found

Bridge can crash while trying to send topology change packet.
This happens if root port can't be found. This was reported by user
but currently unable to reproduce it easily. The STP conditions that cause
this are not known yet, but the problem doesn't have to be fatal.

Signed-off-by: Stephen Hemminger <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

drivers: net: cpsw: fix kernel warn on cpsw irq enable

With the commit a11fbba (net/cpsw: fix irq_disable() with threaded interrupts)
from Sebastian Siewior, a kernel warning is generated as below. This warning
is generated as the irq_enabled is not initialized for the primary interface
and in probe it is initialized for the second interface. This patch moves
irq_enabled initialization from second interface to primary interface.

[    3.049173] net eth0: phy found : id is : 0x4dd074
[    3.054552] net eth0: phy found : id is : 0x4dd074
[    3.070421] ------------[ cut here ]------------
[    3.075308] WARNING: at kernel/irq/manage.c:437 enable_irq+0x3c/0x74()
[    3.082173] Unbalanced enable for IRQ 56
[    3.086299] Modules linked in:
[    3.089557] [<c001abcc>] (unwind_backtrace+0x0/0xf0) from [<c004294c>] (warn_slowpath_common+0x4c/0x68)
[    3.099450] [<c004294c>] (warn_slowpath_common+0x4c/0x68) from [<c00429fc>] (warn_slowpath_fmt+0x30/0x40)
[    3.109521] [<c00429fc>] (warn_slowpath_fmt+0x30/0x40) from [<c00a29fc>] (enable_irq+0x3c/0x74)
[    3.118681] [<c00a29fc>] (enable_irq+0x3c/0x74) from [<c03a7818>] (cpsw_ndo_open+0x61c/0x684)
[    3.127669] [<c03a7818>] (cpsw_ndo_open+0x61c/0x684) from [<c0445c08>] (__dev_open+0x9c/0xf8)
[    3.136646] [<c0445c08>] (__dev_open+0x9c/0xf8) from [<c0445e34>] (__dev_change_flags+0x78/0x13c)
[    3.145988] [<c0445e34>] (__dev_change_flags+0x78/0x13c) from [<c0445f64>] (dev_change_flags+0x10/0x48)
[    3.155884] [<c0445f64>] (dev_change_flags+0x10/0x48) from [<c0736d88>] (ip_auto_config+0x198/0x111c)
[    3.165592] [<c0736d88>] (ip_auto_config+0x198/0x111c) from [<c00086a4>] (do_one_initcall+0x34/0x180)
[    3.175309] [<c00086a4>] (do_one_initcall+0x34/0x180) from [<c07078f8>] (kernel_init_freeable+0xfc/0x1c8)
[    3.185393] [<c07078f8>] (kernel_init_freeable+0xfc/0x1c8) from [<c04f36ec>] (kernel_init+0x8/0xe4)
[    3.194929] [<c04f36ec>] (kernel_init+0x8/0xe4) from [<c00133d0>] (ret_from_fork+0x14/0x24)
[    3.203712] ---[ end trace d6f979da080bc391 ]---

Cc: Sebastian Siewior <[email protected]>
Signed-off-by: Mugunthan V N <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

sh_eth: use random MAC address if no valid one supplied

On Renesas R-Car based development boards, although a MAC address is printed on
all the Ethernet port labels, U-Boot doesn't write a valid MAC address to the
Ether MAHR/MALR registers (there's no storage provided for the Ether MAC address
either), so we have to resort to using a random MAC address...

Signed-off-by: Sergei Shtylyov <[email protected]>
Acked-by: Laurent Pinchart <[email protected]>
Acked-by: Simon Horman <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

3c509.c: call SET_NETDEV_DEV for all device types (ISA/ISAPnP/EISA)

The venerable 3c509 driver only sets its device parent in one case, the ISAPnP one.
It does this with the SET_NETDEV_DEV function. It should register with the device
hierarchy in two additional cases: standard (non-PnP) ISA and EISA.

- Currently they appear here:
/sys/devices/virtual/net/eth0 (standard ISA)
/sys/devices/virtual/net/eth1 (EISA)

- Rather, they should instead be here:
/sys/devices/isa/3c509.0/net/eth0 (standard ISA)
/sys/devices/pci0000:00/0000:00:07.0/00:04/net/eth1 (EISA)

Tested on ISA and EISA boards.

Signed-off-by: Matthew Whitehead <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

tg3: fix to append hardware time stamping flags

The commit f233a976ad15c3b8c54c0157f3c41d23f7514279 (tg3: shows
HW time stamping support only if ptp_capable is present) didn't
append hardware flags correctly. This patch fixes it.

Signed-off-by: Flavio Leitner <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Merge tag 'dlm-3.10' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm

Pull dlm update from David Teigland:
"This includes a single patch to avoid fully processing a posix unlock
from close when no posix locks exist on the file"

* tag 'dlm-3.10' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm:
dlm: avoid unnecessary posix unlock

Merge tag 'nfs-for-3.10-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs

Pull NFS client bugfixes and cleanups from Trond Myklebust:

- NLM: stable fix for NFSv2/v3 blocking locks

- NFSv4.x: stable fixes for the delegation recall error handling code

- NFSv4.x: Security flavour negotiation fixes and cleanups by Chuck
   Lever

- SUNRPC: A number of RPCSEC_GSS fixes and cleanups also from Chuck

- NFSv4.x assorted state management and reboot recovery bugfixes

- NFSv4.1: In cases where we have already looked up a file, and hold a
   valid filehandle, use the new open-by-filehandle operation instead of
   opening by name.

- Allow the NFSv4.1 callback thread to freeze

- NFSv4.x: ensure that file unlock waits for readahead to complete

- NFSv4.1: ensure that the RPC layer doesn't override the NFS session
   table size negotiation by limiting the number of slots.

- NFSv4.x: Fix SETATTR spec compatibility issues

* tag 'nfs-for-3.10-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (67 commits)
  NFSv4: Warn once about servers that incorrectly apply open mode to setattr
  NFSv4: Servers should only check SETATTR stateid open mode on size change
  NFSv4: Don't recheck permissions on open in case of recovery cached open
  NFSv4.1: Don't do a delegated open for NFS4_OPEN_CLAIM_DELEG_CUR_FH modes
  NFSv4.1: Use the more efficient open_noattr call for open-by-filehandle
  NFS: Retry SETCLIENTID with AUTH_SYS instead of AUTH_NONE
  NFSv4: Ensure that we clear the NFS_OPEN_STATE flag when appropriate
  LOCKD: Ensure that nlmclnt_block resets block->b_status after a server reboot
  NFSv4: Ensure the LOCK call cannot use the delegation stateid
  NFSv4: Use the open stateid if the delegation has the wrong mode
  nfs: Send atime and mtime as a 64bit value
  NFSv4: Record the OPEN create mode used in the nfs4_opendata structure
  NFSv4.1: Set the RPC_CLNT_CREATE_INFINITE_SLOTS flag for NFSv4.1 transports
  SUNRPC: Allow rpc_create() to request that TCP slots be unlimited
  SUNRPC: Fix a livelock problem in the xprt->backlog queue
  NFSv4: Fix handling of revoked delegations by setattr
  NFSv4 release the sequence id in the return on close case
  nfs: remove unnecessary check for NULL inode->i_flock from nfs_delegation_claim_locks
  NFS: Ensure that NFS file unlock waits for readahead to complete
  NFS: Add functionality to allow waiting on all outstanding reads to complete
  ...

Merge git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-3.0-nmw

Pull GFS2 updates from Steven Whitehouse:
"There is not a whole lot of change this time - there are some further
  changes which are in the works, but those will be held over until next
  time.

  Here there are some clean ups to inode creation, the addition of an
  origin (local or remote) indicator to glock demote requests, removal
  of one of the remaining GFP_NOFAIL allocations during log flushes, one
  minor clean up, and a one liner bug fix."

* git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-3.0-nmw:
  GFS2: Flush work queue before clearing glock hash tables
  GFS2: Add origin indicator to glock demote tracing
  GFS2: Add origin indicator to glock callbacks
  GFS2: replace gfs2_ail structure with gfs2_trans
  GFS2: Remove vestigial parameter ip from function rs_deltree
  GFS2: Use gfs2_dinode_out() in the inode create path
  GFS2: Remove gfs2_refresh_inode from inode creation path
  GFS2: Clean up inode creation path

Merge tag 'arm64-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/cmarinas/linux-aarch64

Pull arm64 update from Catalin Marinas:
"Main features:

   - Versatile Express SoC (model) support - DT files and Kconfig
     entries (there are no arch/arm64/mach-* directories).  The bulk of
     the code has already been moved to drivers/ as part of the ARM SoC
     clean-up.

   - Basic multi-cluster support (CPU logical map initialised from the
     DT)

   - Simple earlyprintk support for UART 8250/16550 and FastModel
     console output

   - Optimised kernel library bitops and string functions.

   - Automatic initialisation of the irqchip and clocks via DT"

* tag 'arm64-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/cmarinas/linux-aarch64: (26 commits)
  arm64: Use acquire/release semantics instead of explicit DMB
  arm64: klib: bitops: fix unpredictable stxr usage
  arm64: vexpress: Enable ARMv8 RTSM model (SoC) support
  arm64: vexpress: Add dts files for the ARMv8 RTSM models
  arm64: Survive invalid cpu enable-methods
  arm64: mm: Correct show_pte behaviour
  arm64: Fix compat types affecting struct compat_stat
  arm64: Execute DSB during thread switching for TLB/cache maintenance
  arm64: compiling issue, need add include/asm/vga.h file
  arm64: smp: honour #address-size when parsing CPU reg property
  arm64: Define cmpxchg64 and cmpxchg64_local for outside use
  arm64: Define readq and writeq for driver module using
  arm64: Fix task tracing
  arm64: add explicit symbols to ESR_EL1 decoding
  arm64: Use irqchip_init() for interrupt controller initialisation
  arm64: psci: Use the MPIDR values from cpu_logical_map for cpu ids.
  arm64: klib: Optimised atomic bitops
  arm64: klib: Optimised string functions
  arm64: klib: Optimised memory functions
  arm64: head: match all affinity levels in the pen of the secondaries
  ...

Merge tag 'metag-for-v3.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jhogan/metag

Pull arch/metag update from James Hogan:

- Various fixes for the interrupting perf counter handling in metag's
   perf backend.

- Add OProfile support based on perf.

- Sets up cache partitions for SMP so bootloader doesn't have to.

- Patch from Paul Bolle to remove ARCH_POPULATES_NODE_MAP again
   (touches microblaze too).

- Add TLS pointer regset to metag ptrace api.

- Add exported metag DSP extended context handling header <asm/ech.h>.

- Increase defconfig log buffer size to 128KiB.

- Various fixes, typos, missing exports.

* tag 'metag-for-v3.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jhogan/metag:
  metag: defconfigs: increase log buffer 8KiB => 128KiB
  metag: avoid unnecessary builtin dtb rebuilds
  metag: add exported <asm/ech.h> for extended context handling
  metag: export _metag_da_present and cpu_2_hwthread_id
  metag: ptrace: Implement NT_METAG_TLS
  memblock: Kill ARCH_POPULATES_NODE_MAP once more
  metag: cachepart: fix get_global_dcache_size() typo
  metag: cachepart: take into account small cache bits
  metag: smp: copy cache partition and enable GCOn
  metag: OProfile support
  metag: perf: prepare for use by oprofile
  metag: perf: don't reset TXTACTCYC
  metag: perf: use hard_processor_id() to get thread
  metag: perf: fix frequency sampling (dynamic period)
  metag: perf: add missing prev_count updates
  metag: perf: fixes for interrupting perf counters
  metag: perf: fix wrap handling in delta calculation
  metag: perf: fix core internal / perf channel mux

Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/geert/linux-m68k

Pull m68k update from Geert Uytterhoeven.

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/geert/linux-m68k:
  m68k: Remove inline strlen() implementation
  m68k/atari: USB - add platform devices for EtherNAT/NetUSBee ISP1160 HCD
  m68k: Implement ndelay() based on the existing udelay() logic
  m68k/atari: EtherNAT - add interrupt chip definition for CPLD interrupts
  m68k/atari: EtherNEC - add platform device support
  m68k/atari: EtherNAT - platform device and IRQ support code
  m68k/atari: use dedicated irq_chip for timer D interrupts
  m68k/atari: ROM port ISA adapter support
  m68k: Add missing cmpxchg64() if CONFIG_RMW_INSNS=y

Merge branch 'linux_next' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-edac

Pull edac fixes from Mauro Carvalho Chehab:
"Two edac fixes:

   - i7300_edac currently reports a wrong number of DIMMs when the
     memory controller is in single channel mode

   - on some Sandy Bridge machines, the EDAC driver bails out as one of
     the PCI IDs used by the driver is hidden by BIOS.  As the driver
     uses it only to detect the type of memory, make it optional at the
     driver"

* 'linux_next' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-edac:
  edac: sb_edac.c should not require prescence of IMC_DDRIO device
  i7300_edac: Fix memory detection in single mode

Merge branch 'v4l_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media

Pull media update from Mauro Carvalho Chehab:

- OF documentation and patches at core and drivers, to be used by for
   embedded media systems

- some I2C drivers used on go7007 were rewritten/promoted from staging:
   sony-btf-mpx, tw2804, tw9903, tw9906, wis-ov7640, wis-uda1342

- add fimc-is driver (Exynos)

- add a new radio driver: radio-si476x

- add a two new tuners: r820t and tuner_it913x

- split camera code on em28xx driver and add more models

- the cypress firmware load is used outside dvb usb drivers.  So, move
   it to a common directory to make easier to re-use it

- siano media driver updated to work with sms2270 devices

- several work done in order to promote go7007 and solo6x1x out of
   staging (still, there are some pending issues)

- several API compliance fixes at v4l2 drivers that don't behave as
   expected

- as usual, lots of driver fixes, improvements, cleanups and new device
   addition at the existing drivers.

* 'v4l_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media: (831 commits)
  [media] cx88: make core less verbose
  [media] em28xx: fix oops at em28xx_dvb_bus_ctrl()
  [media] s5c73m3: fix indentation of the help section in Kconfig
  [media] cx25821-alsa: get rid of a __must_check warning
  [media] cx25821-video: declare cx25821_vidioc_s_std as static
  [media] cx25821-video: remove maxw from cx25821_vidioc_try_fmt_vid_cap
  [media] r820t: Remove a warning for an unused value
  [media] dib0090: Fix a warning at dib0090_set_EFUSE
  [media] dib8000: fix a warning
  [media] dib8000: Fix sub-channel range
  [media] dib8000: store dtv_property_cache in a temp var
  [media] dib8000: warning fix: declare internal functions as static
  [media] r820t: quiet gcc warning on n_ring
  [media] r820t: memory leak in release()
  [media] r820t: precendence bug in r820t_xtal_check()
  [media] videodev2.h: Remove the unused old V4L1 buffer types
  [media] anysee: Grammar s/report the/report to/
  [media] anysee: Initialize ret = 0 in anysee_frontend_attach()
  [media] media: videobuf2: fix the length check for mmap
  [media] em28xx: save isoc endpoint number for DVB only if endpoint has alt settings with xMaxPacketSize != 0
  ...

Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/hid

Pull HID updates from Jiri Kosina:

- hid driver transport cleanup, finalizing the long-desired decoupling
   of core from transport layers, by Benjamin Tissoires and Henrik
   Rydberg

- support for hybrid finger/pen multitouch HID devices, by Benjamin
   Tissoires

- fix for long-standing issue in Logitech unifying driver sometimes not
   inializing properly due to device specifics, by Andrew de los Reyes

- Wii remote driver updates to support 2nd generation of devices, by
   David Herrmann

- support for Apple IR remote

- roccat driver now supports new devices (Roccat Kone Pure, IskuFX), by
   Stefan Achatz

- debugfs locking fixes in hid debug interface, by Jiri Kosina

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/hid: (43 commits)
  HID: protect hid_debug_list
  HID: debug: break out hid_dump_report() into hid-debug
  HID: Add PID for Japanese version of NE4K keyboard
  HID: hid-lg4ff add support for new version of DFGT wheel
  HID: icade: u16 which never < 0
  HID: clarify Magic Mouse Kconfig description
  HID: appleir: add support for Apple ir devices
  HID: roccat: added media key support for Kone
  HID: hid-lenovo-tpkbd: remove doubled hid_get_drvdata
  HID: i2c-hid: fix length for set/get report in i2c hid
  HID: wiimote: parse reduced status reports
  HID: wiimote: add 2nd generation Wii Remote IDs
  HID: wiimote: use unique battery names
  HID: hidraw: warn if userspace headers are outdated
  HID: multitouch: force BTN_STYLUS for pen devices
  HID: multitouch: append " Pen" to the name of the stylus input
  HID: multitouch: add handling for pen in dual-sensors device
  HID: multitouch: change touch sensor detection in mt_input_configured()
  HID: multitouch: do not map usage from non used reports
  HID: multitouch: breaks out touch handling in specific functions
  ...

Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial

Pull trivial tree updates from Jiri Kosina:
"Usual stuff, mostly comment fixes, typo fixes, printk fixes and small
  code cleanups"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (45 commits)
  mm: Convert print_symbol to %pSR
  gfs2: Convert print_symbol to %pSR
  m32r: Convert print_symbol to %pSR
  iostats.txt: add easy-to-find description for field 6
  x86 cmpxchg.h: fix wrong comment
  treewide: Fix typo in printk and comments
  doc: devicetree: Fix various typos
  docbook: fix 8250 naming in device-drivers
  pata_pdc2027x: Fix compiler warning
  treewide: Fix typo in printks
  mei: Fix comments in drivers/misc/mei
  treewide: Fix typos in kernel messages
  pm44xx: Fix comment for "CONFIG_CPU_IDLE"
  doc: Fix typo "CONFIG_CGROUP_CGROUP_MEMCG_SWAP"
  mmzone: correct "pags" to "pages" in comment.
  kernel-parameters: remove outdated 'noresidual' parameter
  Remove spurious _H suffixes from ifdef comments
  sound: Remove stray pluses from Kconfig file
  radio-shark: Fix printk "CONFIG_LED_CLASS"
  doc: put proper reference to CONFIG_MODULE_SIG_ENFORCE
  ...

Merge branch 'x86-ras-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 RAS changes from Ingo Molnar:

- Add an Intel CMCI hotplug fix

- Add AMD family 16h EDAC support

- Make the AMD MCE banks code more flexible for virtual environments

* 'x86-ras-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  amd64_edac: Add Family 16h support
  x86/mce: Rework cmci_rediscover() to play well with CPU hotplug
  x86, MCE, AMD: Use MCG_CAP MSR to find out number of banks on AMD
  x86, MCE, AMD: Replace shared_bank array with is_shared_bank() helper