Git Repo - linux.git/log

userns: rename is_owner_or_cap to inode_owner_or_capable

And give it a kernel-doc comment.

[[email protected]: btrfs changed in linux-next]
Signed-off-by: Serge E. Hallyn <[email protected]>
Cc: "Eric W. Biederman" <[email protected]>
Cc: Daniel Lezcano <[email protected]>
Acked-by: David Howells <[email protected]>
Cc: James Morris <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

userns: userns: check user namespace for task->file uid equivalence checks

Cheat for now and say all files belong to init_user_ns. Next step will be
to let superblocks belong to a user_ns, and derive inode_userns(inode)
from inode->i_sb->s_user_ns. Finally we'll introduce more flexible
arrangements.

Changelog:
Feb 15: make is_owner_or_cap take const struct inode
Feb 23: make is_owner_or_cap bool

[[email protected]: coding-style fixes]
Signed-off-by: Serge E. Hallyn <[email protected]>
Acked-by: "Eric W. Biederman" <[email protected]>
Acked-by: Daniel Lezcano <[email protected]>
Acked-by: David Howells <[email protected]>
Cc: James Morris <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

userns: user namespaces: convert several capable() calls

CAP_IPC_OWNER and CAP_IPC_LOCK can be checked against current_user_ns(),
because the resource comes from current's own ipc namespace.

setuid/setgid are to uids in own namespace, so again checks can be against
current_user_ns().

Changelog:
Jan 11: Use task_ns_capable() in place of sched_capable().
Jan 11: Use nsown_capable() as suggested by Bastian Blank.
Jan 11: Clarify (hopefully) some logic in futex and sched.c
Feb 15: use ns_capable for ipc, not nsown_capable
Feb 23: let copy_ipcs handle setting ipc_ns->user_ns
Feb 23: pass ns down rather than taking it from current

[[email protected]: coding-style fixes]
Signed-off-by: Serge E. Hallyn <[email protected]>
Acked-by: "Eric W. Biederman" <[email protected]>
Acked-by: Daniel Lezcano <[email protected]>
Acked-by: David Howells <[email protected]>
Cc: James Morris <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

userns: add a user namespace owner of ipc ns

Changelog:
Feb 15: Don't set new ipc->user_ns if we didn't create a new
ipc_ns.
Feb 23: Move extern declaration to ipc_namespace.h, and group
fwd declarations at top.

Signed-off-by: Serge E. Hallyn <[email protected]>
Acked-by: "Eric W. Biederman" <[email protected]>
Acked-by: Daniel Lezcano <[email protected]>
Acked-by: David Howells <[email protected]>
Cc: James Morris <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

userns: user namespaces: convert all capable checks in kernel/sys.c

This allows setuid/setgid in containers. It also fixes some corner cases
where kernel logic foregoes capability checks when uids are equivalent.
The latter will need to be done throughout the whole kernel.

Changelog:
Jan 11: Use nsown_capable() as suggested by Bastian Blank.
Jan 11: Fix logic errors in uid checks pointed out by Bastian.
Feb 15: allow prlimit to current (was regression in previous version)
Feb 23: remove debugging printks, uninline set_one_prio_perm and
make it bool, and document its return value.

Signed-off-by: Serge E. Hallyn <[email protected]>
Acked-by: "Eric W. Biederman" <[email protected]>
Acked-by: Daniel Lezcano <[email protected]>
Acked-by: David Howells <[email protected]>
Cc: James Morris <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

userns: make has_capability* into real functions

So we can let type safety keep things sane, and as a bonus we can remove
the declaration of init_user_ns in capability.h.

Signed-off-by: Serge E. Hallyn <[email protected]>
Cc: "Eric W. Biederman" <[email protected]>
Cc: Daniel Lezcano <[email protected]>
Cc: David Howells <[email protected]>
Cc: James Morris <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

userns: allow ptrace from non-init user namespaces

ptrace is allowed to tasks in the same user namespace according to the
usual rules (i.e.  the same rules as for two tasks in the init user
namespace).  ptrace is also allowed to a user namespace to which the
current task the has CAP_SYS_PTRACE capability.

Changelog:
Dec 31: Address feedback by Eric:
. Correct ptrace uid check
. Rename may_ptrace_ns to ptrace_capable
. Also fix the cap_ptrace checks.
Jan  1: Use const cred struct
Jan 11: use task_ns_capable() in place of ptrace_capable().
Feb 23: same_or_ancestore_user_ns() was not an appropriate
check to constrain cap_issubset.  Rather, cap_issubset()
only is meaningful when both capsets are in the same
user_ns.

Signed-off-by: Serge E. Hallyn <[email protected]>
Cc: "Eric W. Biederman" <[email protected]>
Acked-by: Daniel Lezcano <[email protected]>
Acked-by: David Howells <[email protected]>
Cc: James Morris <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

userns: allow killing tasks in your own or child userns

Changelog:
Dec  8: Fixed bug in my check_kill_permission pointed out by
        Eric Biederman.
Dec 13: Apply Eric's suggestion to pass target task into kill_ok_by_cred()
        for clarity
Dec 31: address comment by Eric Biederman:
don't need cred/tcred in check_kill_permission.
Jan  1: use const cred struct.
Jan 11: Per Bastian Blank's advice, clean up kill_ok_by_cred().
Feb 16: kill_ok_by_cred: fix bad parentheses
Feb 23: per akpm, let compiler inline kill_ok_by_cred

Signed-off-by: Serge E. Hallyn <[email protected]>
Acked-by: "Eric W. Biederman" <[email protected]>
Acked-by: Daniel Lezcano <[email protected]>
Acked-by: David Howells <[email protected]>
Cc: James Morris <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

userns: allow sethostname in a container

Changelog:
Feb 23: let clone_uts_ns() handle setting uts->user_ns
To do so we need to pass in the task_struct who'll
get the utsname, so we can get its user_ns.
Feb 23: As per Oleg's coment, just pass in tsk, instead of two
of its members.

Signed-off-by: Serge E. Hallyn <[email protected]>
Acked-by: "Eric W. Biederman" <[email protected]>
Acked-by: Daniel Lezcano <[email protected]>
Acked-by: David Howells <[email protected]>
Cc: James Morris <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

userns: security: make capabilities relative to the user namespace

- Introduce ns_capable to test for a capability in a non-default
  user namespace.
- Teach cap_capable to handle capabilities in a non-default
  user namespace.

The motivation is to get to the unprivileged creation of new
namespaces.  It looks like this gets us 90% of the way there, with
only potential uid confusion issues left.

I still need to handle getting all caps after creation but otherwise I
think I have a good starter patch that achieves all of your goals.

Changelog:
11/05/2010: [serge] add apparmor
12/14/2010: [serge] fix capabilities to created user namespaces
Without this, if user serge creates a user_ns, he won't have
capabilities to the user_ns he created.  THis is because we
were first checking whether his effective caps had the caps
he needed and returning -EPERM if not, and THEN checking whether
he was the creator.  Reverse those checks.
12/16/2010: [serge] security_real_capable needs ns argument in !security case
01/11/2011: [serge] add task_ns_capable helper
01/11/2011: [serge] add nsown_capable() helper per Bastian Blank suggestion
02/16/2011: [serge] fix a logic bug: the root user is always creator of
    init_user_ns, but should not always have capabilities to
    it!  Fix the check in cap_capable().
02/21/2011: Add the required user_ns parameter to security_capable,
    fixing a compile failure.
02/23/2011: Convert some macros to functions as per akpm comments.  Some
    couldn't be converted because we can't easily forward-declare
    them (they are inline if !SECURITY, extern if SECURITY).  Add
    a current_user_ns function so we can use it in capability.h
    without #including cred.h.  Move all forward declarations
    together to the top of the #ifdef __KERNEL__ section, and use
    kernel-doc format.
02/23/2011: Per dhowells, clean up comment in cap_capable().
02/23/2011: Per akpm, remove unreachable 'return -EPERM' in cap_capable.

(Original written and signed off by Eric;  latest, modified version
acked by him)

[[email protected]: fix build]
[[email protected]: export current_user_ns() for ecryptfs]
[[email protected]: remove unneeded extra argument in selinux's task_has_capability]
Signed-off-by: Eric W. Biederman <[email protected]>
Signed-off-by: Serge E. Hallyn <[email protected]>
Acked-by: "Eric W. Biederman" <[email protected]>
Acked-by: Daniel Lezcano <[email protected]>
Acked-by: David Howells <[email protected]>
Cc: James Morris <[email protected]>
Signed-off-by: Serge E. Hallyn <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

userns: add a user_namespace as creator/owner of uts_namespace

The expected course of development for user namespaces targeted
capabilities is laid out at https://wiki.ubuntu.com/UserNamespace.

Goals:

- Make it safe for an unprivileged user to unshare namespaces.  They
  will be privileged with respect to the new namespace, but this should
  only include resources which the unprivileged user already owns.

- Provide separate limits and accounting for userids in different
  namespaces.

Status:

  Currently (as of 2.6.38) you can clone with the CLONE_NEWUSER flag to
  get a new user namespace if you have the CAP_SYS_ADMIN, CAP_SETUID, and
  CAP_SETGID capabilities.  What this gets you is a whole new set of
  userids, meaning that user 500 will have a different 'struct user' in
  your namespace than in other namespaces.  So any accounting information
  stored in struct user will be unique to your namespace.

  However, throughout the kernel there are checks which

  - simply check for a capability.  Since root in a child namespace
    has all capabilities, this means that a child namespace is not
    constrained.

  - simply compare uid1 == uid2.  Since these are the integer uids,
    uid 500 in namespace 1 will be said to be equal to uid 500 in
    namespace 2.

  As a result, the lxc implementation at lxc.sf.net does not use user
  namespaces.  This is actually helpful because it leaves us free to
  develop user namespaces in such a way that, for some time, user
  namespaces may be unuseful.

Bugs aside, this patchset is supposed to not at all affect systems which
are not actively using user namespaces, and only restrict what tasks in
child user namespace can do.  They begin to limit privilege to a user
namespace, so that root in a container cannot kill or ptrace tasks in the
parent user namespace, and can only get world access rights to files.
Since all files currently belong to the initila user namespace, that means
that child user namespaces can only get world access rights to *all*
files.  While this temporarily makes user namespaces bad for system
containers, it starts to get useful for some sandboxing.

I've run the 'runltplite.sh' with and without this patchset and found no
difference.

This patch:

copy_process() handles CLONE_NEWUSER before the rest of the namespaces.
So in the case of clone(CLONE_NEWUSER|CLONE_NEWUTS) the new uts namespace
will have the new user namespace as its owner.  That is what we want,
since we want root in that new userns to be able to have privilege over
it.

Changelog:
Feb 15: don't set uts_ns->user_ns if we didn't create
a new uts_ns.
Feb 23: Move extern init_user_ns declaration from
init/version.c to utsname.h.

Signed-off-by: Serge E. Hallyn <[email protected]>
Acked-by: "Eric W. Biederman" <[email protected]>
Acked-by: Daniel Lezcano <[email protected]>
Acked-by: David Howells <[email protected]>
Cc: James Morris <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

procfs: kill the global proc_mnt variable

After the previous cleanup in proc_get_sb() the global proc_mnt has no
reasons to exists, kill it.

Signed-off-by: Oleg Nesterov <[email protected]>
Signed-off-by: Eric W. Biederman <[email protected]>
Signed-off-by: Daniel Lezcano <[email protected]>
Cc: Alexey Dobriyan <[email protected]>
Acked-by: Serge E. Hallyn <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

pidns: call pid_ns_prepare_proc() from create_pid_namespace()

Reorganize proc_get_sb() so it can be called before the struct pid of the
first process is allocated.

Signed-off-by: Eric W. Biederman <[email protected]>
Signed-off-by: Daniel Lezcano <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Alexey Dobriyan <[email protected]>
Acked-by: Serge E. Hallyn <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

pid: remove the child_reaper special case in init/main.c

This patchset is a cleanup and a preparation to unshare the pid namespace.
These prerequisites prepare for Eric's patchset to give a file descriptor
to a namespace and join an existing namespace.

This patch:

It turns out that the existing assignment in copy_process of the
child_reaper can handle the initial assignment of child_reaper we just
need to generalize the test in kernel/fork.c

Signed-off-by: Eric W. Biederman <[email protected]>
Signed-off-by: Daniel Lezcano <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Alexey Dobriyan <[email protected]>
Acked-by: Serge E. Hallyn <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

sysctl: restrict write access to dmesg_restrict

When dmesg_restrict is set to 1 CAP_SYS_ADMIN is needed to read the kernel
ring buffer.  But a root user without CAP_SYS_ADMIN is able to reset
dmesg_restrict to 0.

This is an issue when e.g.  LXC (Linux Containers) are used and complete
user space is running without CAP_SYS_ADMIN.  A unprivileged and jailed
root user can bypass the dmesg_restrict protection.

With this patch writing to dmesg_restrict is only allowed when root has
CAP_SYS_ADMIN.

Signed-off-by: Richard Weinberger <[email protected]>
Acked-by: Dan Rosenberg <[email protected]>
Acked-by: Serge E. Hallyn <[email protected]>
Cc: Eric Paris <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: James Morris <[email protected]>
Cc: Eugene Teo <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

sysctl: add some missing input constraint checks

Add boundaries of allowed input ranges for: dirty_expire_centisecs,
drop_caches, overcommit_memory, page-cluster and panic_on_oom.

Signed-off-by: Petr Holasek <[email protected]>
Acked-by: Dave Young <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Wu Fengguang <[email protected]>
Cc: Alexey Dobriyan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

sysctl_check: drop dead code

Drop dead code.

Signed-off-by: Denis Kirjanov <[email protected]>
Cc: "Eric W. Biederman" <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

sysctl_check: drop table->procname checks

Since the for loop checks for the table->procname drop useless
table->procname checks inside the loop body

Signed-off-by: Denis Kirjanov <[email protected]>
Cc: "Eric W. Biederman" <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

rapidio: fix potential null deref on failure path

If rio is not a switch then "rswitch" is null.

Signed-off-by: Dan Carpenter <[email protected]>
Cc: Matt Porter <[email protected]>
Cc: Kumar Gala <[email protected]>
Signed-off-by: Alexandre Bounine <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

rapidio: remove mport resource reservation from common RIO code

Removes resource reservation from the common sybsystem initialization code
and make it part of mport driver initialization. This resolves conflict
with resource reservation by device specific mport drivers.

Signed-off-by: Alexandre Bounine <[email protected]>
Cc: Kumar Gala <[email protected]>
Cc: Matt Porter <[email protected]>
Cc: Li Yang <[email protected]>
Cc: Thomas Moll <[email protected]>
Cc: Micha Nelissen <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

rapidio: modify mport ID assignment

Changes mport ID and host destination ID assignment to implement unified
method common to all mport drivers. Makes "riohdid=" kernel command line
parameter common for all architectures with support for more that one host
destination ID assignment.

Signed-off-by: Alexandre Bounine <[email protected]>
Cc: Kumar Gala <[email protected]>
Cc: Matt Porter <[email protected]>
Cc: Li Yang <[email protected]>
Cc: Thomas Moll <[email protected]>
Cc: Micha Nelissen <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

rapidio: modify subsystem and driver initialization sequence

Subsystem initialization sequence modified to support presence of multiple
RapidIO controllers in the system. The new sequence is compatible with
initialization of PCI devices.

Signed-off-by: Alexandre Bounine <[email protected]>
Cc: Kumar Gala <[email protected]>
Cc: Matt Porter <[email protected]>
Cc: Li Yang <[email protected]>
Cc: Thomas Moll <[email protected]>
Cc: Micha Nelissen <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

rapidio: modify configuration to support PCI-SRIO controller

1. Add an option to include RapidIO support if the PCI is available.
2. Add FSL_RIO configuration option to enable controller selection.
3. Add RapidIO support option into x86 and MIPS architectures.

Signed-off-by: Alexandre Bounine <[email protected]>
Acked-by: Kumar Gala <[email protected]>
Cc: Matt Porter <[email protected]>
Cc: Li Yang <[email protected]>
Cc: Thomas Moll <[email protected]>
Cc: Micha Nelissen <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

rapidio: add architecture specific callbacks

This set of patches eliminates RapidIO dependency on PowerPC architecture
and makes it available to other architectures (x86 and MIPS). It also
enables support of new platform independent RapidIO controllers such as
PCI-to-SRIO and PCI Express-to-SRIO.

This patch:

Extend number of mport callback functions to eliminate direct linking of
architecture specific mport operations.

Signed-off-by: Alexandre Bounine <[email protected]>
Cc: Kumar Gala <[email protected]>
Cc: Matt Porter <[email protected]>
Cc: Li Yang <[email protected]>
Cc: Thomas Moll <[email protected]>
Cc: Micha Nelissen <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

rapidio: add RapidIO documentation

Add RapidIO documentation files as it was discussed earlier (see thread
http://marc.info/?l=linux-kernel&m=129202338918062&w=2)

Signed-off-by: Alexandre Bounine <[email protected]>
Cc: Kumar Gala <[email protected]>
Cc: Matt Porter <[email protected]>
Cc: Randy Dunlap <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

rapidio: add new sysfs attributes

Add new sysfs attributes.

1. Routing information required to to reach the RIO device:
destid - device destination ID (real for for endpoint, route for switch)
hopcount - hopcount for maintenance requests (switches only)

2. device linking information:
lprev - name of device that precedes the given device in the enumeration
        or discovery order (displayed along with of the port to which it
        is attached).
lnext - names of devices (with corresponding port numbers) that are
        attached to the given device as next in the enumeration or
        discovery order (switches only)

Signed-off-by: Alexandre Bounine <[email protected]>
Cc: Kumar Gala <[email protected]>
Cc: Matt Porter <[email protected]>
Cc: Li Yang <[email protected]>
Cc: Thomas Moll <[email protected]>
Cc: Micha Nelissen <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

drivers/char/mem.c: clean up the code

Reduce the lines of code and simplify the logic.

Signed-off-by: Changli Gao <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

drivers/staging/tty/specialix.c: convert func_enter to func_exit

Convert calls to func_enter on leaving a function to func_exit.

The semantic patch that fixes this problem is as follows:
(http://coccinelle.lip6.fr/)

// <smpl>
@@
@@

- func_enter();
+ func_exit();
return...;
// </smpl>

Signed-off-by: Julia Lawall <[email protected]>
Cc: Roger Wolff <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

drivers/tty/bfin_jtag_comm.c: avoid calling put_tty_driver on NULL

put_tty_driver calls tty_driver_kref_put on its argument, and then
tty_driver_kref_put calls kref_put on the address of a field of this
argument.  kref_put checks for NULL, but in this case the field is likely
to have some offset and so the result of taking its address will not be
NULL.  Labels are added to be able to skip over the call to put_tty_driver
when the argument will be NULL.

The semantic match that finds this problem is as follows:
(http://coccinelle.lip6.fr/)

// <smpl>
@@
expression *x;
@@

*if (x == NULL)
{ ...
* put_tty_driver(x);
  ...
  return ...;
}
// </smpl>

Signed-off-by: Julia Lawall <[email protected]>
Cc: Torben Hohn <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

drivers/char: add MSM smd_pkt driver

Add smd_pkt driver which provides device interface to smd packet ports.

Signed-off-by: Niranjana Vishwanathapura <[email protected]>
Cc: Brian Swetland <[email protected]>
Cc: Greg KH <[email protected]>
Cc: Alan Cox <[email protected]>
Cc: David Brown <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

drivers/char/ipmi/ipmi_si_intf.c: fix cleanup_one_si section mismatch

commit d2478521afc2022 ("char/ipmi: fix OOPS caused by
pnp_unregister_driver on unregistered driver") introduced a section
mismatch by calling __exit cleanup_ipmi_si from __devinit init_ipmi_si.

Remove __exit annotation from cleanup_ipmi_si.

Signed-off-by: Sergey Senozhatsky <[email protected]>
Acked-by: Corey Minyard <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

proc: protect mm start_code/end_code in /proc/pid/stat

While mm->start_stack was protected from cross-uid viewing (commit
f83ce3e6b02d5 ("proc: avoid information leaks to non-privileged
processes")), the start_code and end_code values were not. This would
allow the text location of a PIE binary to leak, defeating ASLR.

Note that the value "1" is used instead of "0" for a protected value since
"ps", "killall", and likely other readers of /proc/pid/stat, take
start_code of "0" to mean a kernel thread and will misbehave. Thanks to
Brad Spengler for pointing this out.

Addresses CVE-2011-0726

Signed-off-by: Kees Cook <[email protected]>
Cc: <[email protected]>
Cc: Alexey Dobriyan <[email protected]>
Cc: David Howells <[email protected]>
Cc: Eugene Teo <[email protected]>
Cc: Martin Schwidefsky <[email protected]>
Cc: Brad Spengler <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

proc: make struct proc_dir_entry::namelen unsigned int

1. namelen is declared "unsigned short" which hints for "maybe space savings".
   Indeed in 2.4 struct proc_dir_entry looked like:

        struct proc_dir_entry {
                unsigned short low_ino;
                unsigned short namelen;

   Now, low_ino is "unsigned int", all savings were gone for a long time.
   "struct proc_dir_entry" is not that countless to worry about it's size,
   anyway.

2. converting from unsigned short to int/unsigned int can only create
   problems, we better play it safe.

Space is not really conserved, because of natural alignment for the next
field.  sizeof(struct proc_dir_entry) remains the same.

Signed-off-by: Alexey Dobriyan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

procfs: fix some wrong error code usage

[root@wei 1]# cat /proc/1/mem
cat: /proc/1/mem: No such process

error code -ESRCH is wrong in this situation. Return -EPERM instead.

Signed-off-by: Jovi Zhang <[email protected]>
Reviewed-by: KOSAKI Motohiro <[email protected]>
Cc: Alexey Dobriyan <[email protected]>
Cc: Al Viro <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

procfs: fix /proc/<pid>/maps heap check

The current code fails to print the "[heap]" marking if the heap is split
into multiple mappings.

Fix the check so that the marking is displayed in all possible cases:
1. vma matches exactly the heap
2. the heap vma is merged e.g. with bss
3. the heap vma is splitted e.g. due to locked pages

Test cases. In all cases, the process should have mapping(s) with
[heap] marking:

(1) vma matches exactly the heap

#include <stdio.h>
#include <unistd.h>
#include <sys/types.h>

int main (void)
{
if (sbrk(4096) != (void *)-1) {
printf("check /proc/%d/maps\n", (int)getpid());
while (1)
sleep(1);
}
return 0;
}

# ./test1
check /proc/553/maps
[1] + Stopped                    ./test1
# cat /proc/553/maps | head -4
00008000-00009000 r-xp 00000000 01:00 3113640    /test1
00010000-00011000 rw-p 00000000 01:00 3113640    /test1
00011000-00012000 rw-p 00000000 00:00 0          [heap]
4006f000-40070000 rw-p 00000000 00:00 0

(2) the heap vma is merged

#include <stdio.h>
#include <unistd.h>
#include <sys/types.h>

char foo[4096] = "foo";
char bar[4096];

int main (void)
{
if (sbrk(4096) != (void *)-1) {
printf("check /proc/%d/maps\n", (int)getpid());
while (1)
sleep(1);
}
return 0;
}

# ./test2
check /proc/556/maps
[2] + Stopped                    ./test2
# cat /proc/556/maps | head -4
00008000-00009000 r-xp 00000000 01:00 3116312    /test2
00010000-00012000 rw-p 00000000 01:00 3116312    /test2
00012000-00014000 rw-p 00000000 00:00 0          [heap]
4004a000-4004b000 rw-p 00000000 00:00 0

(3) the heap vma is splitted (this fails without the patch)

#include <stdio.h>
#include <unistd.h>
#include <sys/mman.h>
#include <sys/types.h>

int main (void)
{
if ((sbrk(4096) != (void *)-1) && !mlockall(MCL_FUTURE) &&
    (sbrk(4096) != (void *)-1)) {
printf("check /proc/%d/maps\n", (int)getpid());
while (1)
sleep(1);
}
return 0;
}

# ./test3
check /proc/559/maps
[1] + Stopped                    ./test3
# cat /proc/559/maps|head -4
00008000-00009000 r-xp 00000000 01:00 3119108    /test3
00010000-00011000 rw-p 00000000 01:00 3119108    /test3
00011000-00012000 rw-p 00000000 00:00 0          [heap]
00012000-00013000 rw-p 00000000 00:00 0          [heap]

It looks like the bug has been there forever, and since it only results in
some information missing from a procfile, it does not fulfil the -stable
"critical issue" criteria.

Signed-off-by: Aaro Koskinen <[email protected]>
Reviewed-by: KOSAKI Motohiro <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

proc: hide kernel addresses via %pK in /proc/<pid>/stack

This file is readable for the task owner. Hide kernel addresses from
unprivileged users, leave them function names and offsets.

Signed-off-by: Konstantin Khlebnikov <[email protected]>
Acked-by: Kees Cook <[email protected]>
Cc: Alexey Dobriyan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

cpuset: hold callback_mutex in cpuset_post_clone()

Chaning cpuset->mems/cpuset->cpus should be protected under
callback_mutex.

cpuset_clone() doesn't follow this rule. It's ok because it's
called when creating and initializing a cgroup, but we'd better
hold the lock to avoid subtil break in the future.

Signed-off-by: Li Zefan <[email protected]>
Acked-by: Paul Menage <[email protected]>
Acked-by: David Rientjes <[email protected]>
Cc: Miao Xie <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

cpuset: fix unchecked calls to NODEMASK_ALLOC()

Those functions that use NODEMASK_ALLOC() can't propagate errno
to users, but will fail silently.

Fix it by using a static nodemask_t variable for each function, and
those variables are protected by cgroup_mutex;

[[email protected]: fix comment spelling, strengthen cgroup_lock comment]
Signed-off-by: Li Zefan <[email protected]>
Cc: Paul Menage <[email protected]>
Acked-by: David Rientjes <[email protected]>
Cc: Miao Xie <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

cpuset: remove unneeded NODEMASK_ALLOC() in cpuset_attach()

oldcs->mems_allowed is not modified during cpuset_attach(), so we don't
have to copy it to a buffer allocated by NODEMASK_ALLOC(). Just pass it
to cpuset_migrate_mm().

Signed-off-by: Li Zefan <[email protected]>
Cc: Paul Menage <[email protected]>
Acked-by: David Rientjes <[email protected]>
Cc: Miao Xie <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

cpuset: remove unneeded NODEMASK_ALLOC() in cpuset_sprintf_memlist()

It's not necessary to copy cpuset->mems_allowed to a buffer allocated by
NODEMASK_ALLOC(). Just pass it to nodelist_scnprintf().

As spotted by Paul, a side effect is we fix a bug that the function can
return -ENOMEM but the caller doesn't expect negative return value.
Therefore change the return value of cpuset_sprintf_cpulist() and
cpuset_sprintf_memlist() from int to size_t.

Signed-off-by: Li Zefan <[email protected]>
Acked-by: Paul Menage <[email protected]>
Acked-by: David Rientjes <[email protected]>
Cc: Miao Xie <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

memcg: give current access to memory reserves if it's trying to die

When a memcg is oom and current has already received a SIGKILL, then give
it access to memory reserves with a higher scheduling priority so that it
may quickly exit and free its memory.

This is identical to the global oom killer and is done even before
checking for panic_on_oom: a pending SIGKILL here while panic_on_oom is
selected is guaranteed to have come from userspace; the thread only needs
access to memory reserves to exit and thus we don't unnecessarily panic
the machine until the kernel has no last resort to free memory.

Signed-off-by: David Rientjes <[email protected]>
Cc: Balbir Singh <[email protected]>
Cc: Daisuke Nishimura <[email protected]>
Acked-by: KAMEZAWA Hiroyuki <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

memcg: fix leak on wrong LRU with FUSE

fs/fuse/dev.c::fuse_try_move_page() does

   (1) remove a page by ->steal()
   (2) re-add the page to page cache
   (3) link the page to LRU if it was not on LRU at (1)

This implies the page is _on_ LRU when it's added to radix-tree.  So, the
page is added to memory cgroup while it's on LRU.  because LRU is lazy and
no one flushs it.

This is the same behavior as SwapCache and needs special care as
- remove page from LRU before overwrite pc->mem_cgroup.
- add page to LRU after overwrite pc->mem_cgroup.

And we need to taking care of pagevec.

If PageLRU(page) is set before we add PCG_USED bit, the page will not be
added to memcg's LRU (in short period).  So, regardlress of PageLRU(page)
value before commit_charge(), we need to check PageLRU(page) after
commit_charge().

Addresses https://bugzilla.kernel.org/show_bug.cgi?id=30432

Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
Reviewed-by: Johannes Weiner <[email protected]>
Acked-by: Daisuke Nishimura <[email protected]>
Cc: Miklos Szeredi <[email protected]>
Cc: Balbir Singh <[email protected]>
Reported-by: Daniel Poelzleithner <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

memcg: page_cgroup array is never stored on reserved pages

KAMEZAWA Hiroyuki noted that free_pages_cgroup doesn't have to check for
PageReserved because we never store the array on reserved pages (neither
alloc_pages_exact nor vmalloc use those pages).

So we can replace the check by a BUG_ON.

Signed-off-by: Michal Hocko <[email protected]>
Acked-by: KAMEZAWA Hiroyuki <[email protected]>
Cc: Balbir Singh <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

page_cgroup: reduce allocation overhead for page_cgroup array for CONFIG_SPARSEMEM

Currently we are allocating a single page_cgroup array per memory section
(stored in mem_section->base) when CONFIG_SPARSEMEM is selected.  This is
correct but memory inefficient solution because the allocated memory
(unless we fall back to vmalloc) is not kmalloc friendly:

        - 32b - 16384 entries (20B per entry) fit into 327680B so the
          524288B slab cache is used
        - 32b with PAE - 131072 entries with 2621440B fit into 4194304B
        - 64b - 32768 entries (40B per entry) fit into 2097152 cache

This is ~37% wasted space per memory section and it sumps up for the whole
memory.  On a x86_64 machine it is something like 6MB per 1GB of RAM.

We can reduce the internal fragmentation by using alloc_pages_exact which
allocates PAGE_SIZE aligned blocks so we will get down to <4kB wasted
memory per section which is much better.

We still need a fallback to vmalloc because we have no guarantees that we
will have a continuous memory of that size (order-10) later on during the
hotplug events.

[[email protected]: do not define unused free_page_cgroup() without memory hotplug]
Signed-off-by: Michal Hocko <[email protected]>
Cc: Dave Hansen <[email protected]>
Acked-by: KAMEZAWA Hiroyuki <[email protected]>
Cc: Balbir Singh <[email protected]>
Signed-off-by: Johannes Weiner <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/memcontrol.c: suppress uninitialized-var warning with older gcc's

mm/memcontrol.c: In function 'mem_cgroup_force_empty':
mm/memcontrol.c:2280: warning: 'flags' may be used uninitialized in this function

It's a false positive.

Cc: Balbir Singh <[email protected]>
Cc: Daisuke Nishimura <[email protected]>
Cc: Greg Thelen <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: KAMEZAWA Hiroyuki <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

memcg: use native word page statistics counters

The statistic counters are in units of pages, there is no reason to make
them 64-bit wide on 32-bit machines.

Make them native words. Since they are signed, this leaves 31 bit on
32-bit machines, which can represent roughly 8TB assuming a page size of
4k.

[[email protected]: coding-style fixes]
Signed-off-by: Johannes Weiner <[email protected]>
Signed-off-by: Greg Thelen <[email protected]>
Acked-by: KAMEZAWA Hiroyuki <[email protected]>
Acked-by: Balbir Singh <[email protected]>
Cc: Daisuke Nishimura <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

memcg: break out event counters from other stats

For increasing and decreasing per-cpu cgroup usage counters it makes sense
to use signed types, as single per-cpu values might go negative during
updates.  But this is not the case for only-ever-increasing event
counters.

All the counters have been signed 64-bit so far, which was enough to count
events even with the sign bit wasted.

This patch:
- divides s64 counters into signed usage counters and unsigned
  monotonically increasing event counters.
- converts unsigned event counters into 'unsigned long' rather than
  'u64'.  This matches the type used by the /proc/vmstat event counters.

The next patch narrows the signed usage counters type (on 32-bit CPUs,
that is).

Signed-off-by: Johannes Weiner <[email protected]>
Signed-off-by: Greg Thelen <[email protected]>
Acked-by: KAMEZAWA Hiroyuki <[email protected]>
Acked-by: Balbir Singh <[email protected]>
Cc: Daisuke Nishimura <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

memcg: unify charge/uncharge quantities to units of pages

There is no clear pattern when we pass a page count and when we pass a
byte count that is a multiple of PAGE_SIZE.

We never charge or uncharge subpage quantities, so convert it all to page
counts.

Signed-off-by: Johannes Weiner <[email protected]>
Acked-by: KAMEZAWA Hiroyuki <[email protected]>
Cc: Daisuke Nishimura <[email protected]>
Cc: Balbir Singh <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

memcg: convert uncharge batching from bytes to page granularity

We never uncharge subpage quantities.

Signed-off-by: Johannes Weiner <[email protected]>
Acked-by: KAMEZAWA Hiroyuki <[email protected]>
Cc: Daisuke Nishimura <[email protected]>
Cc: Balbir Singh <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

memcg: convert per-cpu stock from bytes to page granularity

We never keep subpage quantities in the per-cpu stock.

Signed-off-by: Johannes Weiner <[email protected]>
Acked-by: KAMEZAWA Hiroyuki <[email protected]>
Cc: Daisuke Nishimura <[email protected]>
Cc: Balbir Singh <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

memcg: keep only one charge cancelling function

We have two charge cancelling functions: one takes a page count, the other
a page size. The second one just divides the parameter by PAGE_SIZE and
then calls the first one. This is trivial, no need for an extra function.

Signed-off-by: Johannes Weiner <[email protected]>
Acked-by: KAMEZAWA Hiroyuki <[email protected]>
Cc: Daisuke Nishimura <[email protected]>
Cc: Balbir Singh <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

memcg: remove memcg->reclaim_param_lock

The reclaim_param_lock is only taken around single reads and writes to
integer variables and is thus superfluous. Drop it.

Signed-off-by: Johannes Weiner <[email protected]>
Acked-by: KAMEZAWA Hiroyuki <[email protected]>
Cc: Daisuke Nishimura <[email protected]>
Cc: Balbir Singh <[email protected]>
Reviewed-by: Minchan Kim <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

memcg: charged pages always have valid per-memcg zone info

page_cgroup_zoneinfo() will never return NULL for a charged page, remove
the check for it in mem_cgroup_get_reclaim_stat_from_page().

Signed-off-by: Johannes Weiner <[email protected]>
Acked-by: KAMEZAWA Hiroyuki <[email protected]>
Cc: Daisuke Nishimura <[email protected]>
Cc: Balbir Singh <[email protected]>
Reviewed-by: Minchan Kim <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

memcg: remove direct page_cgroup-to-page pointer

In struct page_cgroup, we have a full word for flags but only a few are
reserved. Use the remaining upper bits to encode, depending on
configuration, the node or the section, to enable page_cgroup-to-page
lookups without a direct pointer.

This saves a full word for every page in a system with memory cgroups
enabled.

Signed-off-by: Johannes Weiner <[email protected]>
Acked-by: KAMEZAWA Hiroyuki <[email protected]>
Cc: Daisuke Nishimura <[email protected]>
Cc: Balbir Singh <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Randy Dunlap <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

memcg: condense page_cgroup-to-page lookup points

The per-cgroup LRU lists string up 'struct page_cgroup's. To get from
those structures to the page they represent, a lookup is required.
Currently, the lookup is done through a direct pointer in struct
page_cgroup, so a lot of functions down the callchain do this lookup by
themselves instead of receiving the page pointer from their callers.

The next patch removes this pointer, however, and the lookup is no longer
that straight-forward. In preparation for that, this patch only leaves
the non-optional lookups when coming directly from the LRU list and passes
the page down the stack.

Signed-off-by: Johannes Weiner <[email protected]>
Acked-by: KAMEZAWA Hiroyuki <[email protected]>
Cc: Daisuke Nishimura <[email protected]>
Cc: Balbir Singh <[email protected]>
Cc: Minchan Kim <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

memcg: fold __mem_cgroup_move_account into caller

It is one logical function, no need to have it split up.

Also, get rid of some checks from the inner function that ensured the
sanity of the outer function.

Signed-off-by: Johannes Weiner <[email protected]>
Acked-by: KAMEZAWA Hiroyuki <[email protected]>
Acked-by: Daisuke Nishimura <[email protected]>
Cc: Balbir Singh <[email protected]>
Cc: Minchan Kim <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

memcg: change page_cgroup_zoneinfo signature

Instead of passing a whole struct page_cgroup to this function, let it
take only what it really needs from it: the struct mem_cgroup and the
page.

This has the advantage that reading pc->mem_cgroup is now done at the same
place where the ordering rules for this pointer are enforced and
explained.

It is also in preparation for removing the pc->page backpointer.

Signed-off-by: Johannes Weiner <[email protected]>
Acked-by: KAMEZAWA Hiroyuki <[email protected]>
Cc: Daisuke Nishimura <[email protected]>
Cc: Balbir Singh <[email protected]>
Cc: Minchan Kim <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

memcg: no uncharged pages reach page_cgroup_zoneinfo

This patch series removes the direct page pointer from struct page_cgroup,
which saves 20% of per-page memcg memory overhead (Fedora and Ubuntu
enable memcg per default, openSUSE apparently too).

The node id or section number is encoded in the remaining free bits of
pc->flags which allows calculating the corresponding page without the
extra pointer.

I ran, what I think is, a worst-case microbenchmark that just cats a large
sparse file to /dev/null, because it means that walking the LRU list on
behalf of per-cgroup reclaim and looking up pages from page_cgroups is
happening constantly and at a high rate. But it made no measurable
difference. A profile reported a 0.11% share of the new
lookup_cgroup_page() function in this benchmark.

This patch:

All callsites check PCG_USED before passing pc->mem_cgroup, so the latter
is never NULL.

Signed-off-by: Johannes Weiner <[email protected]>
Acked-by: KAMEZAWA Hiroyuki <[email protected]>
Cc: Daisuke Nishimura <[email protected]>
Acked-by: Balbir Singh <[email protected]>
Cc: Minchan Kim <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

memcg: add memcg sanity checks at allocating and freeing pages

Add checks at allocating or freeing a page whether the page is used (iow,
charged) from the view point of memcg.

This check may be useful in debugging a problem and we did similar checks
before the commit 52d4b9ac(memcg: allocate all page_cgroup at boot).

This patch adds some overheads at allocating or freeing memory, so it's
enabled only when CONFIG_DEBUG_VM is enabled.

Signed-off-by: Daisuke Nishimura <[email protected]>
Signed-off-by: Johannes Weiner <[email protected]>
Acked-by: KAMEZAWA Hiroyuki <[email protected]>
Cc: Balbir Singh <[email protected]>
Cc: Minchan Kim <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

memcg: remove NULL check from lookup_page_cgroup() result

The page_cgroup array is set up before even fork is initialized. I
seriously doubt that this code executes before the array is alloc'd.

Signed-off-by: Johannes Weiner <[email protected]>
Acked-by: KAMEZAWA Hiroyuki <[email protected]>
Cc: Daisuke Nishimura <[email protected]>
Cc: Balbir Singh <[email protected]>
Cc: Minchan Kim <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

memcg: remove impossible conditional when committing

No callsite ever passes a NULL pointer for a struct mem_cgroup * to the
committing function. There is no need to check for it.

Signed-off-by: Johannes Weiner <[email protected]>
Acked-by: KAMEZAWA Hiroyuki <[email protected]>
Cc: Daisuke Nishimura <[email protected]>
Cc: Balbir Singh <[email protected]>
Cc: Minchan Kim <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

memcg: remove unused page flag bitfield defines

These definitions have been unused since '4b3bde4 memcg: remove the
overhead associated with the root cgroup'.

Signed-off-by: Johannes Weiner <[email protected]>
Acked-by: KAMEZAWA Hiroyuki <[email protected]>
Cc: Daisuke Nishimura <[email protected]>
Cc: Balbir Singh <[email protected]>
Cc: Minchan Kim <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

memcg: simplify the way memory limits are checked

Since transparent huge pages, checking whether memory cgroups are below
their limits is no longer enough, but the actual amount of chargeable
space is important.

To not have more than one limit-checking interface, replace
memory_cgroup_check_under_limit() and memory_cgroup_check_margin() with a
single memory_cgroup_margin() that returns the chargeable space and leaves
the comparison to the callsite.

Soft limits are now checked the other way round, by using the already
existing function that returns the amount by which soft limits are
exceeded: res_counter_soft_limit_excess().

Also remove all the corresponding functions on the res_counter side that
are now no longer used.

Signed-off-by: Johannes Weiner <[email protected]>
Acked-by: KAMEZAWA Hiroyuki <[email protected]>
Cc: Daisuke Nishimura <[email protected]>
Acked-by: Balbir Singh <[email protected]>
Cc: Minchan Kim <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

memcg: soft limit reclaim should end at limit not below

Soft limit reclaim continues until the usage is below the current soft
limit, but the documented semantics are actually that soft limit reclaim
will push usage back until the soft limits are met again.

Signed-off-by: Johannes Weiner <[email protected]>
Acked-by: KAMEZAWA Hiroyuki <[email protected]>
Cc: Daisuke Nishimura <[email protected]>
Acked-by: Balbir Singh <[email protected]>
Cc: Minchan Kim <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

memcg: fix ugly initialization of return value is in caller

Remove initialization of vaiable in caller of memory cgroup function.
Actually, it's return value of memcg function but it's initialized in
caller.

Some memory cgroup uses following style to bring the result of start
function to the end function for avoiding races.

   mem_cgroup_start_A(&(*ptr))
   /* Something very complicated can happen here. */
   mem_cgroup_end_A(*ptr)

In some calls, *ptr should be initialized to NULL be caller.  But it's
ugly.  This patch fixes that *ptr is initialized by _start function.

Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Acked-by: Daisuke Nishimura <[email protected]>
Cc: Balbir Singh <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

memcg: res_counter_read_u64(): fix potential races on 32-bit machines

res_counter_read_u64 reads u64 value without lock. It's dangerous in a
32bit environment. Add locking.

Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
Cc: Greg Thelen <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: KOSAKI Motohiro <[email protected]>
Cc: Minchan Kim <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

bitops: remove minix bitops from asm/bitops.h

minix bit operations are only used by minix filesystem and useless by
other modules. Because byte order of inode and block bitmaps is different
on each architecture like below:

m68k:
big-endian 16bit indexed bitmaps

h8300, microblaze, s390, sparc, m68knommu:
big-endian 32 or 64bit indexed bitmaps

m32r, mips, sh, xtensa:
big-endian 32 or 64bit indexed bitmaps for big-endian mode
little-endian bitmaps for little-endian mode

Others:
little-endian bitmaps

In order to move minix bit operations from asm/bitops.h to architecture
independent code in minix filesystem, this provides two config options.

CONFIG_MINIX_FS_BIG_ENDIAN_16BIT_INDEXED is only selected by m68k.
CONFIG_MINIX_FS_NATIVE_ENDIAN is selected by the architectures which use
native byte order bitmaps (h8300, microblaze, s390, sparc, m68knommu,
m32r, mips, sh, xtensa). The architectures which always use little-endian
bitmaps do not select these options.

Finally, we can remove minix bit operations from asm/bitops.h for all
architectures.

Signed-off-by: Akinobu Mita <[email protected]>
Acked-by: Arnd Bergmann <[email protected]>
Acked-by: Greg Ungerer <[email protected]>
Cc: Geert Uytterhoeven <[email protected]>
Cc: Roman Zippel <[email protected]>
Cc: Andreas Schwab <[email protected]>
Cc: Martin Schwidefsky <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Yoshinori Sato <[email protected]>
Cc: Michal Simek <[email protected]>
Cc: "David S. Miller" <[email protected]>
Cc: Hirokazu Takata <[email protected]>
Acked-by: Ralf Baechle <[email protected]>
Acked-by: Paul Mundt <[email protected]>
Cc: Chris Zankel <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

m68k: remove inline asm from minix_find_first_zero_bit

As a preparation for moving minix bit operations from asm/bitops.h to
architecture independent code in minix filesystem, this removes inline asm
from minix_find_first_zero_bit() for m68k.

Signed-off-by: Akinobu Mita <[email protected]>
Cc: Geert Uytterhoeven <[email protected]>
Cc: Roman Zippel <[email protected]>
Cc: Andreas Schwab <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

bitops: remove ext2 non-atomic bitops from asm/bitops.h

As the result of conversions, there are no users of ext2 non-atomic bit
operations except for ext2 filesystem itself. Now we can put them into
architecture independent code in ext2 filesystem, and remove from
asm/bitops.h for all architectures.

Signed-off-by: Akinobu Mita <[email protected]>
Cc: Jan Kara <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

dm: use little-endian bitops

As a preparation for removing ext2 non-atomic bit operations from
asm/bitops.h. This converts ext2 non-atomic bit operations to
little-endian bit operations.

Signed-off-by: Akinobu Mita <[email protected]>
Cc: Alasdair Kergon <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

md: use little-endian bitops

As a preparation for removing ext2 non-atomic bit operations from
asm/bitops.h. This converts ext2 non-atomic bit operations to
little-endian bit operations.

Signed-off-by: Akinobu Mita <[email protected]>
Acked-by: NeilBrown <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

ufs: use little-endian bitops

As a preparation for removing ext2 non-atomic bit operations from
asm/bitops.h. This converts ext2 non-atomic bit operations to
little-endian bit operations.

Signed-off-by: Akinobu Mita <[email protected]>
Cc: Evgeniy Dushistov <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

udf: use little-endian bitops

As a preparation for removing ext2 non-atomic bit operations from
asm/bitops.h. This converts ext2 non-atomic bit operations to
little-endian bit operations.

Signed-off-by: Akinobu Mita <[email protected]>
Acked-by: Jan Kara <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

reiserfs: use little-endian bitops

As a preparation for removing ext2 non-atomic bit operations from
asm/bitops.h. This converts ext2 non-atomic bit operations to
little-endian bit operations.

Signed-off-by: Akinobu Mita <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

nilfs2: use little-endian bitops

As a preparation for removing ext2 non-atomic bit operations from
asm/bitops.h. This converts ext2 non-atomic bit operations to
little-endian bit operations.

Signed-off-by: Akinobu Mita <[email protected]>
Acked-by: Ryusuke Konishi <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

ocfs2: use little-endian bitops

As a preparation for removing ext2 non-atomic bit operations from
asm/bitops.h. This converts ext2 non-atomic bit operations to
little-endian bit operations.

Signed-off-by: Akinobu Mita <[email protected]>
Acked-by: Joel Becker <[email protected]>
Cc: Mark Fasheh <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

ext4: use little-endian bitops

As a preparation for removing ext2 non-atomic bit operations from
asm/bitops.h. This converts ext2 non-atomic bit operations to
little-endian bit operations.

Signed-off-by: Akinobu Mita <[email protected]>
Acked-by: "Theodore Ts'o" <[email protected]>
Cc: Andreas Dilger <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

ext3: use little-endian bitops

As a preparation for removing ext2 non-atomic bit operations from
asm/bitops.h. This converts ext2 non-atomic bit operations to
little-endian bit operations.

Signed-off-by: Akinobu Mita <[email protected]>
Acked-by: Jan Kara <[email protected]>
Cc: Andreas Dilger <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

rds: use little-endian bitops

As a preparation for removing ext2 non-atomic bit operations from
asm/bitops.h. This converts ext2 non-atomic bit operations to
little-endian bit operations.

Signed-off-by: Akinobu Mita <[email protected]>
Cc: Andy Grover <[email protected]>
Cc: "David S. Miller" <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

kvm: use little-endian bitops

As a preparation for removing ext2 non-atomic bit operations from
asm/bitops.h. This converts ext2 non-atomic bit operations to
little-endian bit operations.

Signed-off-by: Akinobu Mita <[email protected]>
Cc: Avi Kivity <[email protected]>
Cc: Marcelo Tosatti <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

asm-generic: use little-endian bitops

As a preparation for removing ext2 non-atomic bit operations from
asm/bitops.h. This converts ext2 non-atomic bit operations to
little-endian bit operations.

Signed-off-by: Akinobu Mita <[email protected]>
Acked-by: Arnd Bergmann <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

bitops: introduce little-endian bitops for most architectures

Introduce little-endian bit operations to the big-endian architectures
which do not have native little-endian bit operations and the
little-endian architectures. (alpha, avr32, blackfin, cris, frv, h8300,
ia64, m32r, mips, mn10300, parisc, sh, sparc, tile, x86, xtensa)

These architectures can just include generic implementation
(asm-generic/bitops/le.h).

Signed-off-by: Akinobu Mita <[email protected]>
Cc: Richard Henderson <[email protected]>
Cc: Ivan Kokshaysky <[email protected]>
Cc: Mikael Starvik <[email protected]>
Cc: David Howells <[email protected]>
Cc: Yoshinori Sato <[email protected]>
Cc: "Luck, Tony" <[email protected]>
Cc: Ralf Baechle <[email protected]>
Cc: Kyle McMartin <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Grant Grundler <[email protected]>
Cc: Paul Mundt <[email protected]>
Cc: Kazumoto Kojima <[email protected]>
Cc: Hirokazu Takata <[email protected]>
Cc: "David S. Miller" <[email protected]>
Cc: Chris Zankel <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Acked-by: Hans-Christian Egtvedt <[email protected]>
Acked-by: "H. Peter Anvin" <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

m68knommu: introduce little-endian bitops

Introduce little-endian bit operations by renaming native ext2 bit
operations. The ext2 bit operations are kept as wrapper macros using
little-endian bit operations to maintain bisectability until the
conversions are finished.

Signed-off-by: Akinobu Mita <[email protected]>
Acked-by: Greg Ungerer <[email protected]>
Cc: Geert Uytterhoeven <[email protected]>
Cc: Roman Zippel <[email protected]>
Cc: Andreas Schwab <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

bitops: introduce CONFIG_GENERIC_FIND_BIT_LE

This introduces CONFIG_GENERIC_FIND_BIT_LE to tell whether to use generic
implementation of find_*_bit_le() in lib/find_next_bit.c or not.

For now we select CONFIG_GENERIC_FIND_BIT_LE for all architectures which
enable CONFIG_GENERIC_FIND_NEXT_BIT.

But m68knommu wants to define own faster find_next_zero_bit_le() and
continues using generic find_next_{,zero_}bit().
(CONFIG_GENERIC_FIND_NEXT_BIT and !CONFIG_GENERIC_FIND_BIT_LE)

Signed-off-by: Akinobu Mita <[email protected]>
Cc: Greg Ungerer <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

m68k: introduce little-endian bitops

Introduce little-endian bit operations by renaming native ext2 bit
operations and changing find_*_bit_le() to take a "void *". The ext2 bit
operations are kept as wrapper macros using little-endian bit operations
to maintain bisectability until the conversions are finished.

Signed-off-by: Akinobu Mita <[email protected]>
Cc: Geert Uytterhoeven <[email protected]>
Cc: Roman Zippel <[email protected]>
Cc: Andreas Schwab <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

arm: introduce little-endian bitops

Introduce little-endian bit operations by renaming native ext2 bit
operations. The ext2 and minix bit operations are kept as wrapper macros
using little-endian bit operations to maintain bisectability until the
conversions are finished.

Signed-off-by: Akinobu Mita <[email protected]>
Cc: Russell King <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

s390: introduce little-endian bitops

Introduce little-endian bit operations by renaming native ext2 bit
operations. The ext2 bit operations are kept as wrapper macros using
little-endian bit operations to maintain bisectability until the
conversions are finished.

Signed-off-by: Akinobu Mita <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Martin Schwidefsky <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

powerpc: introduce little-endian bitops

Introduce little-endian bit operations by renaming existing powerpc native
little-endian bit operations and changing them to take any pointer types.

Signed-off-by: Akinobu Mita <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

asm-generic: change little-endian bitops to take any pointer types

This makes the little-endian bitops take any pointer types by changing the
prototypes and adding casts in the preprocessor macros.

That would seem to at least make all the filesystem code happier, and they
can continue to do just something like

#define ext2_set_bit __test_and_set_bit_le

(or whatever the exact sequence ends up being).

Signed-off-by: Akinobu Mita <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Richard Henderson <[email protected]>
Cc: Ivan Kokshaysky <[email protected]>
Cc: Mikael Starvik <[email protected]>
Cc: David Howells <[email protected]>
Cc: Yoshinori Sato <[email protected]>
Cc: "Luck, Tony" <[email protected]>
Cc: Ralf Baechle <[email protected]>
Cc: Kyle McMartin <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Grant Grundler <[email protected]>
Cc: Paul Mundt <[email protected]>
Cc: Kazumoto Kojima <[email protected]>
Cc: Hirokazu Takata <[email protected]>
Cc: "David S. Miller" <[email protected]>
Cc: Chris Zankel <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Hans-Christian Egtvedt <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Paul Mackerras <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

asm-generic: rename generic little-endian bitops functions

As a preparation for providing little-endian bitops for all architectures,
This renames generic implementation of little-endian bitops. (remove
"generic_" prefix and postfix "_le")

s/generic_find_next_le_bit/find_next_bit_le/
s/generic_find_next_zero_le_bit/find_next_zero_bit_le/
s/generic_find_first_zero_le_bit/find_first_zero_bit_le/
s/generic___test_and_set_le_bit/__test_and_set_bit_le/
s/generic___test_and_clear_le_bit/__test_and_clear_bit_le/
s/generic_test_le_bit/test_bit_le/
s/generic___set_le_bit/__set_bit_le/
s/generic___clear_le_bit/__clear_bit_le/
s/generic_test_and_set_le_bit/test_and_set_bit_le/
s/generic_test_and_clear_le_bit/test_and_clear_bit_le/

Signed-off-by: Akinobu Mita <[email protected]>
Acked-by: Arnd Bergmann <[email protected]>
Acked-by: Hans-Christian Egtvedt <[email protected]>
Cc: Geert Uytterhoeven <[email protected]>
Cc: Roman Zippel <[email protected]>
Cc: Andreas Schwab <[email protected]>
Cc: Greg Ungerer <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Paul Mackerras <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

bitops: merge little and big endian definisions in asm-generic/bitops/le.h

This patch series introduces little-endian bit operations in asm/bitops.h
for all architectures and converts all ext2 non-atomic and minix bit
operations to use little-endian bit operations.  It enables us to remove
ext2 non-atomic and minix bit operations from asm/bitops.h.  The reason
they should be removed from asm/bitops.h is as follows:

For ext2 non-atomic bit operations, they are used for little-endian byte
order bitmap access by some filesystems and modules.  But using ext2_*()
functions on a module other than ext2 filesystem makes some feel strange.

For minix bit operations, they are only used by minix filesystem and are
useless by other modules.  Because byte order of inode and block bitmap is

This patch:

In order to make the forthcoming changes smaller, this merges macro
definisions in asm-generic/bitops/le.h for big-endian and little-endian as
much as possible.

This also removes unused BITOP_WORD macro.

Signed-off-by: Akinobu Mita <[email protected]>
Acked-by: Arnd Bergmann <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

rds: stop including asm-generic/bitops/le.h directly

asm-generic/bitops/le.h is only intended to be included directly from
asm-generic/bitops/ext2-non-atomic.h or asm-generic/bitops/minix-le.h
which implements generic ext2 or minix bit operations.

This stops including asm-generic/bitops/le.h directly and use ext2
non-atomic bit operations instead.

It seems odd to use ext2_*_bit() on rds, but it will replaced with
__{set,clear,test}_bit_le() after introducing little endian bit operations
for all architectures. This indirect step is necessary to maintain
bisectability for some architectures which have their own little-endian
bit operations.

Signed-off-by: Akinobu Mita <[email protected]>
Cc: Andy Grover <[email protected]>
Cc: "David S. Miller" <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

kvm: stop including asm-generic/bitops/le.h directly

asm-generic/bitops/le.h is only intended to be included directly from
asm-generic/bitops/ext2-non-atomic.h or asm-generic/bitops/minix-le.h
which implements generic ext2 or minix bit operations.

This stops including asm-generic/bitops/le.h directly and use ext2
non-atomic bit operations instead.

It seems odd to use ext2_set_bit() on kvm, but it will replaced with
__set_bit_le() after introducing little endian bit operations for all
architectures. This indirect step is necessary to maintain bisectability
for some architectures which have their own little-endian bit operations.

Signed-off-by: Akinobu Mita <[email protected]>
Cc: Avi Kivity <[email protected]>
Cc: Marcelo Tosatti <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

fs/adfs/adfs.h: fix unsigned comparison

fs/adfs/adfs.h: In function 'append_filetype_suffix':
fs/adfs/adfs.h:115: warning: comparison is always false due to limited range of data type

Reported-by: Geert Uytterhoeven <[email protected]>
Cc: Stuart Swales <[email protected]>
Cc: Russell King <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

ia64: fix build breakage in asm/thread_info.h

In commit 504f52b5439aaf26d3e2c1d45ec10fce38c8dd27
mm: NUMA aware alloc_task_struct_node()

Eric Dumazet forgot a "\". Add it.

Signed-off-by: Tony Luck <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

Revert "drm/i915: Don't save/restore hardware status page address register"

This reverts commit a7a75c8f70d6f6a2f16c9f627f938bbee2d32718.

There are two different variations on how Intel hardware addresses the
"Hardware Status Page". One as a location in physical memory and the
other as an offset into the virtual memory of the GPU, used in more
recent chipsets. (The HWS itself is a cacheable region of memory which
the GPU can write to without requiring CPU synchronisation, used for
updating various details of hardware state, such as the position of
the GPU head in the ringbuffer, the last breadcrumb seqno, etc).

These two types of addresses were updated in different locations of code
- one inline with the ringbuffer initialisation, and the other during
device initialisation. (The HWS page is logically associated with
the rings, and there is one HWS page per ring.) During resume, only the
ringbuffers were being re-initialised along with the virtual HWS page,
leaving the older physical address HWS untouched. This then caused a
hang on the older gen3/4 (915GM, 945GM, 965GM) the first time we tried
to synchronise the GPU as the breadcrumbs were never being updated.

Reported-and-tested-by: Linus Torvalds <[email protected]>
Reported-by: Jan Niehusmann <[email protected]>
Reported-by: Justin P. Mattock <[email protected]>
Reported-and-tested-by: Michael "brot" Groh <[email protected]>
Cc: Zhenyu Wang <[email protected]>
Signed-off-by: Chris Wilson <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound-2.6

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound-2.6:
  ALSA: HDA: Realtek: Avoid unnecessary volume control index on Surround/Side
  ASoC: Support !REGULATOR build for sgtl5000
  ALSA: hda - VIA: Fix VT1708 can't build up Headphone control issue
  ALSA: hda - VIA: Correct stream names for VT1818S
  ALSA: hda - VIA: Fix codec type for VT1708BCE at the right timing
  ALSA: hda - VIA: Fix invalid A-A path volume adjust issue
  ALSA: hda - VIA: Add missing support for VT1718S in A-A path
  ALSA: hda - VIA: Fix independent headphone no sound issue
  ALSA: hda - VIA: Fix stereo mixer recording no sound issue
  ALSA: hda - Set EAPD for Realtek ALC665
  ALSA: usb - Remove trailing spaces from USB card name strings
  sound: read i_size with i_size_read()
  ASoC: Remove bogus check for register validity in debugfs write
  ASoC: mini2440: Fix uda134x codec problem.

sys_swapon: fix inode locking

A conflict between 52c50567d8ab ("mm: swap: unlock swapfile inode mutex
before closing file on bad swapfiles") and 83ef99befc32 ("sys_swapon:
remove did_down variable") caused a double unlock of the inode mutex
(once in bad_swap: before the filp_close, once at the end just before
returning).

The patch which added the extra unlock cleared did_down to avoid
unlocking twice, but the other patch removed the did_down variable.

To fix, set inode to NULL after the first unlock, since it will be used
after that point only for the final unlock.

While checking this patch, I found a path which could unlock without
locking, in case the same inode was added as a swapfile twice. To fix,
move the setting of the inode variable further down, to just before
claim_swapfile, which will lock the inode before doing anything else.

Cc: Mel Gorman <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Eric B Munson <[email protected]>
Cc: KAMEZAWA Hiroyuki <[email protected]>
Cc: Andrew Morton <[email protected]>
Signed-off-by: Cesar Eduardo Barros <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

smp: add missing init.h include

Commit 34db18a054c6 ("smp: move smp setup functions to kernel/smp.c")
causes this build error on s390 because of a missing init.h include:

  CC      arch/s390/kernel/asm-offsets.s
  In file included from /home2/heicarst/linux-2.6/arch/s390/include/asm/spinlock.h:14:0,
  from include/linux/spinlock.h:87,
  from include/linux/seqlock.h:29,
  from include/linux/time.h:8,
  from include/linux/timex.h:56,
  from include/linux/sched.h:57,
  from arch/s390/kernel/asm-offsets.c:10:
  include/linux/smp.h:117:20: error: expected '=', ',', ';', 'asm' or '__attribute__' before 'setup_nr_cpu_ids'
  include/linux/smp.h:118:20: error: expected '=', ',', ';', 'asm' or '__attribute__' before 'smp_init'

Fix it by adding the include statement.

Signed-off-by: Heiko Carstens <[email protected]>
Acked-by: WANG Cong <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

Merge branch 'topic/asoc' into for-linus