Git Repo - linux.git/log

memcg: rename mem_control_xxx to memcg_xxx

Replace memory_cgroup_xxx() with memcg_xxx()

Signed-off-by: Wanpeng Li <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Acked-by: KAMEZAWA Hiroyuki <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

memcg: fix bad behavior in use_hierarchy file

I have an application that does the following:

* copy the state of all controllers attached to a hierarchy
* replicate it as a child of the current level.

I would expect writes to the files to mostly succeed, since they are
inheriting sane values from parents.

But that is not the case for use_hierarchy.  If it is set to 0, we succeed
ok.  If we're set to 1, the value of the file is automatically set to 1 in
the children, but if userspace tries to write the very same 1, it will
fail.  That same situation happens if we set use_hierarchy, create a
child, and then try to write 1 again.

Now, there is no reason whatsoever for failing to write a value that is
already there.  It doesn't even match the comments, that states:

/* If parent's use_hierarchy is set, we can't make any modifications
  * in the child subtrees...

since we are not changing anything.

So test the new value against the one we're storing, and automatically
return 0 if we're not proposing a change.

Signed-off-by: Glauber Costa <[email protected]>
Cc: Dhaval Giani <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: Kamezawa Hiroyuki <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Cc: Ying Han <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm: remove unused LRU_ALL_EVICTABLE

Signed-off-by: Wanpeng Li <[email protected]>
Acked-by: KAMEZAWA Hiroyuki <[email protected]>
Cc: Rik van Riel <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

memcg: rename config variables

Sanity:

CONFIG_CGROUP_MEM_RES_CTLR -> CONFIG_MEMCG
CONFIG_CGROUP_MEM_RES_CTLR_SWAP -> CONFIG_MEMCG_SWAP
CONFIG_CGROUP_MEM_RES_CTLR_SWAP_ENABLED -> CONFIG_MEMCG_SWAP_ENABLED
CONFIG_CGROUP_MEM_RES_CTLR_KMEM -> CONFIG_MEMCG_KMEM

[[email protected]: fix missed bits]
Cc: Glauber Costa <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: KAMEZAWA Hiroyuki <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Aneesh Kumar K.V <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: KOSAKI Motohiro <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm: clean up __count_immobile_pages()

The __count_immobile_pages() naming is rather awkward. Choose a more
clear name and add a comment.

Signed-off-by: Minchan Kim <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Michal Hocko <[email protected]>
Acked-by: KAMEZAWA Hiroyuki <[email protected]>
Cc: Bartlomiej Zolnierkiewicz <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm: do not use page_count() without a page pin

d179e84ba ("mm: vmscan: do not use page_count without a page pin") fixed
this problem in vmscan.c but same problem is in __count_immobile_pages().

I copy and paste d179e84ba's contents for description.

"It is unsafe to run page_count during the physical pfn scan because
compound_head could trip on a dangling pointer when reading
page->first_page if the compound page is being freed by another CPU."

Signed-off-by: Minchan Kim <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Michal Hocko <[email protected]>
Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>
Cc: Wanpeng Li <[email protected]>
Cc: Bartlomiej Zolnierkiewicz <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm, oom: replace some information in tasklist dump

The number of ptes and swap entries are used in the oom killer's badness
heuristic, so they should be shown in the tasklist dump.

This patch adds those fields and replaces cpu and oom_adj values that are
currently emitted. Cpu isn't interesting and oom_adj is deprecated and
will be removed later this year, the same information is already displayed
as oom_score_adj which is used internally.

At the same time, make the documentation a little more clear to state this
information is helpful to determine why the oom killer chose the task it
did to kill.

Signed-off-by: David Rientjes <[email protected]>
Cc: KAMEZAWA Hiroyuki <[email protected]>
Cc: KOSAKI Motohiro <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm, oom: fix potential killing of thread that is disabled from oom killing

/proc/sys/vm/oom_kill_allocating_task will immediately kill current when
the oom killer is called to avoid a potentially expensive tasklist scan
for large systems.

Currently, however, it is not checking current's oom_score_adj value which
may be OOM_SCORE_ADJ_MIN, meaning that it has been disabled from oom
killing.

This patch avoids killing current in such a condition and simply falls
back to the tasklist scan since memory still needs to be freed.

Signed-off-by: David Rientjes <[email protected]>
Cc: KAMEZAWA Hiroyuki <[email protected]>
Acked-by: KOSAKI Motohiro <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm: clear pages_scanned only if draining a pcp adds pages to the buddy allocator again

commit 2ff754fa8f ("mm: clear pages_scanned only if draining a pcp adds
pages to the buddy allocator again") fixed one free_pcppages_bulk()
misuse. But two another miuse still exist.

This patch fixes it.

Signed-off-by: KOSAKI Motohiro <[email protected]>
Acked-by: David Rientjes <[email protected]>
Acked-by: Mel Gorman <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Reviewed-by: Minchan Kim <[email protected]>
Cc: Wu Fengguang <[email protected]>
Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>
Cc: Rik van Riel <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm, fadvise: don't return -EINVAL when filesystem cannot implement fadvise()

Eric Wong reported his test suite failex when /tmp is tmpfs.

https://lkml.org/lkml/2012/2/24/479

Currentlt the input check of POSIX_FADV_WILLNEED has two problems.

- requires a_ops->readpage.  But in fact, force_page_cache_readahead()
  requires that the target filesystem has either ->readpage or ->readpages.

- returns -EINVAL when the filesystem doesn't have ->readpage.  But
  posix says that fadvise is merely a hint.  Thus fadvise() should return
  0 if filesystem has no means of implementing fadvise().  The userland
  application should not know nor care whcih type of filesystem backs the
  TMPDIR directory, as Eric pointed out.  There is nothing which userspace
  can do to solve this error.

So change the return value to 0 when filesytem doesn't support readahead.

[[email protected]: checkpatch fixes]
Signed-off-by: KOSAKI Motohiro <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Hillf Danton <[email protected]>
Signed-off-by: Eric Wong <[email protected]>
Tested-by: Eric Wong <[email protected]>
Reviewed-by: Wanlong Gao <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/compaction: cleanup on compaction_deferred

When CONFIG_COMPACTION is enabled, compaction_deferred() tries to
recalculate the deferred limit again, which isn't necessary.

When CONFIG_COMPACTION is disabled, compaction_deferred() should return
"true" or "false" since it has "bool" for its return value.

Signed-off-by: Gavin Shan <[email protected]>
Acked-by: Minchan Kim <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Acked-by: David Rientjes <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

memcg: make mem_cgroup_force_empty_list() return bool

mem_cgroup_force_empty_list() just returns 0 or -EBUSY and -EBUSY
indicates 'you need to retry'. Make mem_cgroup_force_empty_list() return
a bool to simplify the logic.

[[email protected]: rework mem_cgroup_force_empty_list()'s comment]
Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Johannes Weiner <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

memcg: mem_cgroup_move_parent() doesn't need gfp_mask

Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Johannes Weiner <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

memcg: clean up force_empty_list() return value check

After bf544fdc241da8 "memcg: move charges to root cgroup if
use_hierarchy=0 in mem_cgroup_move_hugetlb_parent()"
mem_cgroup_move_parent() returns only -EBUSY or -EINVAL. So we can remove
the -ENOMEM and -EINTR checks.

Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: Johannes Weiner <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

memcg: remove check for signal_pending() during rmdir()

After bf544fdc241da8 "memcg: move charges to root cgroup if
use_hierarchy=0 in mem_cgroup_move_hugetlb_parent()", no memory reclaim
will occur when removing a memory cgroup. If -EINTR is returned here,
cgroup will show a warning.

We don't need to handle any user interruption signal. Remove this.

Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
Cc: Johannes Weiner <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/memblock.c:memblock_double_array(): cosmetic cleanups

This function is an 80-column eyesore, quite unnecessarily. Clean that
up, and use standard comment layout style.

Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Greg Pearson <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Yinghai Lu <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm, oom: do not schedule if current has been killed

The oom killer currently schedules away from current in an uninterruptible
sleep if it does not have access to memory reserves. It's possible that
current was killed because it shares memory with the oom killed thread or
because it was killed by the user in the interim, however.

This patch only schedules away from current if it does not have a pending
kill, i.e. if it does not share memory with the oom killed thread. It's
possible that it will immediately retry its memory allocation and fail,
but it will immediately be given access to memory reserves if it calls the
oom killer again.

This prevents the delay of memory freeing when threads that share memory
with the oom killed thread get unnecessarily scheduled.

Signed-off-by: David Rientjes <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Acked-by: KOSAKI Motohiro <[email protected]>
Acked-by: KAMEZAWA Hiroyuki <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

hugetlb/cgroup: remove exclude and wakeup rmdir calls from migrate

We already hold the hugetlb_lock. That should prevent a parallel cgroup
rmdir from touching page's hugetlb cgroup. So remove the exclude and
wakeup calls.

Signed-off-by: Aneesh Kumar K.V <[email protected]>
Reviewed-by: Michal Hocko <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

hugetlb/cgroup: assign the page hugetlb cgroup when we move the page to active list.

A page's hugetlb cgroup assignment and movement to the active list should
occur with hugetlb_lock held. Otherwise when we remove the hugetlb cgroup
we will iterate the active list and find pages with NULL hugetlb cgroup
values.

Signed-off-by: Aneesh Kumar K.V <[email protected]>
Reviewed-by: Michal Hocko <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

hugetlb: move all the in use pages to active list

When we fail to allocate pages from the reserve pool, hugetlb tries to
allocate huge pages using alloc_buddy_huge_page. Add these to the active
list. We also need to add the huge page we allocate when we soft offline
the oldpage to active list.

Signed-off-by: Aneesh Kumar K.V <[email protected]>
Reviewed-by: Michal Hocko <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

hugetlb/cgroup: add HugeTLB controller documentation

Signed-off-by: Aneesh Kumar K.V <[email protected]>
Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Hillf Danton <[email protected]>
Reviewed-by: Michal Hocko <[email protected]>
Cc: KOSAKI Motohiro <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

hugetlb/cgroup: migrate hugetlb cgroup info from oldpage to new page during migration

With HugeTLB pages, hugetlb cgroup is uncharged in compound page
destructor. Since we are holding a hugepage reference, we can be sure
that old page won't get uncharged till the last put_page().

Signed-off-by: Aneesh Kumar K.V <[email protected]>
Cc: David Rientjes <[email protected]>
Acked-by: KAMEZAWA Hiroyuki <[email protected]>
Cc: Hillf Danton <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: KOSAKI Motohiro <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

hugetlb/cgroup: add hugetlb cgroup control files

Add the control files for hugetlb controller

[[email protected]: s/CONFIG_CGROUP_HUGETLB_RES_CTLR/CONFIG_MEMCG_HUGETLB/g]
[[email protected]: s/CONFIG_MEMCG_HUGETLB/CONFIG_CGROUP_HUGETLB/]
Signed-off-by: Aneesh Kumar K.V <[email protected]>
Cc: David Rientjes <[email protected]>
Acked-by: KAMEZAWA Hiroyuki <[email protected]>
Cc: Hillf Danton <[email protected]>
Reviewed-by: Michal Hocko <[email protected]>
Cc: KOSAKI Motohiro <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

hugetlb/cgroup: add support for cgroup removal

Add support for cgroup removal. If we don't have parent cgroup, the
charges are moved to root cgroup.

Signed-off-by: Aneesh Kumar K.V <[email protected]>
Cc: David Rientjes <[email protected]>
Acked-by: KAMEZAWA Hiroyuki <[email protected]>
Cc: Hillf Danton <[email protected]>
Reviewed-by: Michal Hocko <[email protected]>
Cc: KOSAKI Motohiro <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

hugetlb/cgroup: add charge/uncharge routines for hugetlb cgroup

Add the charge and uncharge routines for hugetlb cgroup. We do cgroup
charging in page alloc and uncharge in compound page destructor.
Assigning page's hugetlb cgroup is protected by hugetlb_lock.

[[email protected]: add huge_page_order check to avoid incorrect uncharge]
Signed-off-by: Aneesh Kumar K.V <[email protected]>
Cc: David Rientjes <[email protected]>
Acked-by: KAMEZAWA Hiroyuki <[email protected]>
Cc: Hillf Danton <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: KOSAKI Motohiro <[email protected]>
Signed-off-by: Aneesh Kumar K.V <[email protected]>
Signed-off-by: Wanpeng Li <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

hugetlb/cgroup: add the cgroup pointer to page lru

Add the hugetlb cgroup pointer to 3rd page lru.next. This limit the usage
to hugetlb cgroup to only hugepages with 3 or more normal pages. I guess
that is an acceptable limitation.

Signed-off-by: Aneesh Kumar K.V <[email protected]>
Cc: David Rientjes <[email protected]>
Acked-by: KAMEZAWA Hiroyuki <[email protected]>
Cc: Hillf Danton <[email protected]>
Reviewed-by: Michal Hocko <[email protected]>
Cc: KOSAKI Motohiro <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/hugetlb: add new HugeTLB cgroup

Implement a new controller that allows us to control HugeTLB allocations.
The extension allows to limit the HugeTLB usage per control group and
enforces the controller limit during page fault. Since HugeTLB doesn't
support page reclaim, enforcing the limit at page fault time implies that,
the application will get SIGBUS signal if it tries to access HugeTLB pages
beyond its limit. This requires the application to know beforehand how
much HugeTLB pages it would require for its use.

The charge/uncharge calls will be added to HugeTLB code in later patch.
Support for cgroup removal will be added in later patches.

[[email protected]: s/CONFIG_CGROUP_HUGETLB_RES_CTLR/CONFIG_MEMCG_HUGETLB/g]
[[email protected]: s/CONFIG_MEMCG_HUGETLB/CONFIG_CGROUP_HUGETLB/g]
Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>
Signed-off-by: Aneesh Kumar K.V <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Hillf Danton <[email protected]>
Reviewed-by: Michal Hocko <[email protected]>
Cc: KOSAKI Motohiro <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

hugetlb: make some static variables global

We will use them later in hugetlb_cgroup.c

Signed-off-by: Aneesh Kumar K.V <[email protected]>
Cc: David Rientjes <[email protected]>
Acked-by: KAMEZAWA Hiroyuki <[email protected]>
Cc: Hillf Danton <[email protected]>
Reviewed-by: Michal Hocko <[email protected]>
Cc: KOSAKI Motohiro <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

hugetlb: add a list for tracking in-use HugeTLB pages

hugepage_activelist will be used to track currently used HugeTLB pages.
We need to find the in-use HugeTLB pages to support HugeTLB cgroup removal.
On cgroup removal we update the page's HugeTLB cgroup to point to parent
cgroup.

Signed-off-by: Aneesh Kumar K.V <[email protected]>
Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Hillf Danton <[email protected]>
Reviewed-by: Michal Hocko <[email protected]>
Cc: KOSAKI Motohiro <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

hugetlb: simplify migrate_huge_page()

Since we migrate only one hugepage, don't use linked list for passing the
page around. Directly pass the page that need to be migrated as argument.
This also removes the usage of page->lru in the migrate path.

Signed-off-by: Aneesh Kumar K.V <[email protected]>
Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Hillf Danton <[email protected]>
Reviewed-by: Michal Hocko <[email protected]>
Cc: KOSAKI Motohiro <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

hugetlb: use mmu_gather instead of a temporary linked list for accumulating pages

Use a mmu_gather instead of a temporary linked list for accumulating pages
when we unmap a hugepage range

Signed-off-by: Aneesh Kumar K.V <[email protected]>
Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Hillf Danton <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: KOSAKI Motohiro <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

hugetlb: add an inline helper for finding hstate index

Add an inline helper and use it in the code.

Signed-off-by: Aneesh Kumar K.V <[email protected]>
Acked-by: David Rientjes <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>
Cc: Hillf Danton <[email protected]>
Cc: KOSAKI Motohiro <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

hugetlb: don't use ERR_PTR with VM_FAULT* values

The current use of VM_FAULT_* codes with ERR_PTR requires us to ensure
VM_FAULT_* values will not exceed MAX_ERRNO value. Decouple the
VM_FAULT_* values from MAX_ERRNO.

Signed-off-by: Aneesh Kumar K.V <[email protected]>
Acked-by: Hillf Danton <[email protected]>
Acked-by: KOSAKI Motohiro <[email protected]>
Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Michal Hocko <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

hugetlb: rename max_hstate to hugetlb_max_hstate

This patchset implements a cgroup resource controller for HugeTLB pages.
The controller allows to limit the HugeTLB usage per control group and
enforces the controller limit during page fault.  Since HugeTLB doesn't
support page reclaim, enforcing the limit at page fault time implies that,
the application will get SIGBUS signal if it tries to access HugeTLB pages
beyond its limit.  This requires the application to know beforehand how
much HugeTLB pages it would require for its use.

The goal is to control how many HugeTLB pages a group of task can
allocate.  It can be looked at as an extension of the existing quota
interface which limits the number of HugeTLB pages per hugetlbfs
superblock.  HPC job scheduler requires jobs to specify their resource
requirements in the job file.  Once their requirements can be met, job
schedulers like (SLURM) will schedule the job.  We need to make sure that
the jobs won't consume more resources than requested.  If they do we
should either error out or kill the application.

This patch:

Rename max_hstate to hugetlb_max_hstate.  We will be using this from other
subsystems like hugetlb controller in later patches.

Signed-off-by: Aneesh Kumar K.V <[email protected]>
Acked-by: David Rientjes <[email protected]>
Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>
Acked-by: Hillf Danton <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: KOSAKI Motohiro <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm: prepare for removal of obsolete /proc/sys/vm/nr_pdflush_threads

Since per-BDI flusher threads were introduced in 2.6, the pdflush
mechanism is not used any more. But the old interface exported through
/proc/sys/vm/nr_pdflush_threads still exists and is obviously useless.

For back-compatibility, printk warning information and return 2 to notify
the users that the interface is removed.

Signed-off-by: Wanpeng Li <[email protected]>
Cc: Wu Fengguang <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/buddy: cleanup on should_fail_alloc_page

Currently, function should_fail() has "bool" for its return value, so it's
reasonable to change the return value of function should_fail_alloc_page()
into "bool" as well.

The patch does cleanup on function should_fail_alloc_page() to have "bool"
for its return value.

Signed-off-by: Gavin Shan <[email protected]>
Acked-by: David Rientjes <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm: account the total_vm in the vm_stat_account()

vm_stat_account() accounts the shared_vm, stack_vm and reserved_vm now.
But we can also account for total_vm in the vm_stat_account() which makes
the code tidy.

Even for mprotect_fixup(), we can get the right result in the end.

Signed-off-by: Huang Shijie <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

documentation: update how page-cluster affects swap I/O

Fix of the documentation of /proc/sys/vm/page-cluster to match the
behavior of the code and add some comments about what the tunable will
change in that behavior.

Signed-off-by: Christian Ehrhardt <[email protected]>
Acked-by: Jens Axboe <[email protected]>
Reviewed-by: Minchan Kim <[email protected]>
Cc: Hugh Dickins <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

swap: allow swap readahead to be merged

Swap readahead works fine, but the I/O to disk is almost always done in
page size requests, despite the fact that readahead submits
1<<page-cluster pages at a time.

On older kernels the old per device plugging behavior might have captured
this and merged the requests, but currently all comes down to much more
I/Os than required.

On a single device this might not be an issue, but as soon as a server
runs on shared san resources savin I/Os not only improves swapin
throughput but also provides a lower resource utilization.

With a load running KVM in a lot of memory overcommitment (the hot memory
is 1.5 times the host memory) swapping throughput improves significantly
and the lead feels more responsive as well as achieves more throughput.

In a test setup with 16 swap disks running blocktrace on one of those disks
shows the improved merging:
Prior:
Reads Queued:     560,888,    2,243MiB  Writes Queued:     226,242,  904,968KiB
Read Dispatches:  544,701,    2,243MiB  Write Dispatches:  159,318,  904,968KiB
Reads Requeued:         0               Writes Requeued:         0
Reads Completed:  544,716,    2,243MiB  Writes Completed:  159,321,  904,980KiB
Read Merges:       16,187,   64,748KiB  Write Merges:       61,744,  246,976KiB
IO unplugs:       149,614               Timer unplugs:       2,940

With the patch:
Reads Queued:     734,315,    2,937MiB  Writes Queued:     300,188,    1,200MiB
Read Dispatches:  214,972,    2,937MiB  Write Dispatches:  215,176,    1,200MiB
Reads Requeued:         0               Writes Requeued:         0
Reads Completed:  214,971,    2,937MiB  Writes Completed:  215,177,    1,200MiB
Read Merges:      519,343,    2,077MiB  Write Merges:       73,325,  293,300KiB
IO unplugs:       337,130               Timer unplugs:      11,184

I got ~10% to ~40% more throughput in my cases and at the same time much
lower cpu consumption when broken down per transferred kilobyte (the
majority of that due to saved interrupts and better cache handling).  In a
shared SAN others might get an additional benefit as well, because this
now causes less protocol overhead.

Signed-off-by: Christian Ehrhardt <[email protected]>
Acked-by: Rik van Riel <[email protected]>
Acked-by: Jens Axboe <[email protected]>
Reviewed-by: Minchan Kim <[email protected]>
Cc: Hugh Dickins <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

memcg: remove MEM_CGROUP_CHARGE_TYPE_FORCE

There are no users since commit b24028572fb69 ("memcg: remove PCG_CACHE").

Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
Acked-by: Hugh Dickins <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

memcg: rename MEM_CGROUP_CHARGE_TYPE_MAPPED as MEM_CGROUP_CHARGE_TYPE_ANON

Now, in memcg, 2 "MAPPED" enum/macro are found
MEM_CGROUP_CHARGE_TYPE_MAPPED
MEM_CGROUP_STAT_FILE_MAPPED

Thier names looks similar to each other but the former is used for
accounting anonymous memory. rename it as TYPE_ANON.

Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

memcg: rename MEM_CGROUP_STAT_SWAPOUT as MEM_CGROUP_STAT_SWAP

MEM_CGROUP_STAT_SWAPOUT represents the usage of swap rather than
the number of swap-out events. Rename it to be MEM_CGROUP_STAT_SWAP.

Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Acked-by: Hugh Dickins <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm: make vb_alloc() more foolproof

If someone calls vb_alloc() (or vm_map_ram() for that matter) to allocate
0 bytes (0 pages), get_order() returns BITS_PER_LONG - PAGE_CACHE_SHIFT
and interesting stuff happens. So make debugging such problems easier and
warn about 0-size allocation.

[[email protected]: use WARN_ON-return-value feature]
Signed-off-by: Jan Kara <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

vmalloc: walk vmap_areas by sorted list instead of rb_next()

There's a walk by repeating rb_next to find a suitable hole.  Could be
simply replaced by walk on the sorted vmap_area_list.  More simpler and
efficient.

Mutation of the list and tree only happens in pair within
__insert_vmap_area and __free_vmap_area, under protection of
vmap_area_lock.  The patch code is also under vmap_area_lock, so the list
walk is safe, and consistent with the tree walk.

Tested on SMP by repeating batch of vmalloc anf vfree for random sizes and
rounds for hours.

Signed-off-by: Hong Zhiguo <[email protected]>
Cc: Nick Piggin <[email protected]>
Cc: Johannes Weiner <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

drivers/media/video/v4l2-ioctl.c: fix build

Fix zillions of these:

drivers/media/video/v4l2-ioctl.c:1848: error: unknown field 'func' specified in initializer
drivers/media/video/v4l2-ioctl.c:1848: warning: missing braces around initializer
drivers/media/video/v4l2-ioctl.c:1848: warning: (near initialization for 'v4l2_ioctls[0].<anonymous>')
drivers/media/video/v4l2-ioctl.c:1848: warning: initialization makes integer from pointer without a cast
drivers/media/video/v4l2-ioctl.c:1848: error: initializer element is not computable at load time
drivers/media/video/v4l2-ioctl.c:1848: error: (near initialization for 'v4l2_ioctls[0].<anonymous>.offset')

Cc: Mauro Carvalho Chehab <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

xtensa: select generic atomic64_t support

This will fix build errors:

block/blk-cgroup.c:609:2: error: unknown type name 'atomic64_t'
block/blk-cgroup.c:609:2: error: implicit declaration of function 'ATOMIC64_INIT' [-Werror=implicit-function-declaration]

Signed-off-by: Fengguang Wu <[email protected]>
Cc: Chris Zankel <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

fault-injection: fix failcmd.sh warning

"fault-injection: add tool to run command with failslab or
fail_page_alloc" added tools/testing/fault-injection/failcmd.sh to make it
easier to inject slab/page allocation failures by fault injection.

failcmd.sh prints the following warning when running with arguments
for command.

# ./failcmd.sh echo aaa
failcmd.sh: line 209: [: echo: binary operator expected
aaa

This warning is caused by an improper check whether at least one
parameter is left after parsing command options.

Fix it by testing the length of $1 instead of $@

Signed-off-by: Akinobu Mita <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

Merge tag 'for-v3.6' of git://git.infradead.org/battery-2.6

Pull battery updates from Anton Vorontsov:
"The tag contains just a few battery-related changes for v3.6.  It's is
  all pretty straightforward, except one thing.

  One of our patches added thermal support for power supply class, but
  thermal/ subsystem changed under our feet.  We (well, Stephen, that
  is) caught the issue and it was decided[1] that I'd just delay the
  battery pull request, and then will fix it up by merging upstream back
  into battery tree at the specific commit.

  That's not all though: another[2] small fixup for thermal subsystem
  was needed to get rid of a warning in power supply subsystem (the
  warning was not drivers/power's "fault", the thermal registration
  function just needed a proper const annotation, which is also done by
  a small commit on top of the merge.

  So, to sum this up:
   - The 'master' branch of the battery tree was in the -next tree for
     weeks, was never rebased, altered etc.  It should be all OK;
   - Although, for-v3.6 tag contains the 'master' branch + merge + the
     warning fix.

  [1] http://lkml.org/lkml/2012/6/19/23
  [2] http://lkml.org/lkml/2012/6/18/28"

* tag 'for-v3.6' of git://git.infradead.org/battery-2.6: (23 commits)
  thermal: Constify 'type' argument for the registration routine
  olpc-battery: update CHARGE_FULL_DESIGN property for BYD LiFe batteries
  olpc-battery: Add VOLTAGE_MAX_DESIGN property
  charger-manager: Fix build break related to EXTCON
  lp8727_charger: Move header file into platform_data directory
  power_supply: Add min/max alert properties for CAPACITY, TEMP, TEMP_AMBIENT
  bq27x00_battery: Add support for BQ27425 chip
  charger-manager: Set current limit of regulator for over current protection
  charger-manager: Use EXTCON Subsystem to detect charger cables for charging
  test_power: Add VOLTAGE_NOW and BATTERY_TEMP properties
  test_power: Add support for USB AC source
  gpio-charger: Use cansleep version of gpio_set_value
  bq27x00_battery: Add support for power average and health properties
  sbs-battery: Don't trigger false supply_changed event
  twl4030_charger: Allow charger to control the regulator that feeds it
  twl4030_charger: Add backup-battery charging
  twl4030_charger: Fix some typos
  max17042_battery: Support CHARGE_COUNTER power supply attribute
  smb347-charger: Add constant charge and current properties
  power_supply: Add constant charge_current and charge_voltage properties
  ...

Merge branch 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull perf updates from Ingo Molnar:
"The biggest changes are Intel Nehalem-EX PMU uncore support, uprobes
  updates/cleanups/fixes from Oleg and diverse tooling updates (mostly
  fixes) now that Arnaldo is back from vacation."

* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (40 commits)
  uprobes: __replace_page() needs munlock_vma_page()
  uprobes: Rename vma_address() and make it return "unsigned long"
  uprobes: Fix register_for_each_vma()->vma_address() check
  uprobes: Introduce vaddr_to_offset(vma, vaddr)
  uprobes: Teach build_probe_list() to consider the range
  uprobes: Remove insert_vm_struct()->uprobe_mmap()
  uprobes: Remove copy_vma()->uprobe_mmap()
  uprobes: Fix overflow in vma_address()/find_active_uprobe()
  uprobes: Suppress uprobe_munmap() from mmput()
  uprobes: Uprobe_mmap/munmap needs list_for_each_entry_safe()
  uprobes: Clean up and document write_opcode()->lock_page(old_page)
  uprobes: Kill write_opcode()->lock_page(new_page)
  uprobes: __replace_page() should not use page_address_in_vma()
  uprobes: Don't recheck vma/f_mapping in write_opcode()
  perf/x86: Fix missing struct before structure name
  perf/x86: Fix format definition of SNB-EP uncore QPI box
  perf/x86: Make bitfield unsigned
  perf/x86: Fix LLC-* and node-* events on Intel SandyBridge
  perf/x86: Add Intel Nehalem-EX uncore support
  perf/x86: Fix typo in format definition of uncore PCU filter
  ...

Merge branch 'merge' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc

Pull powerpc updates from Benjamin Herrenschmidt:
"Kumar sent me a handful of Freescale related fixes and I added another
  regression fix to the pile.

  PS.  I -will- eventually learn about that signed tag business :-)"

* 'merge' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc:
  powerpc/kvm/book3s_32: Fix MTMSR_EERI macro
  powerpc/85xx: p1022ds: fix DIU/LBC switching with NAND enabled
  powerpc/85xx: p1022ds: disable the NAND flash node if video is enabled
  powerpc/85xx: Fix sram_offset parameter type
  powerpc/85xx: P3041DS - change espi input-clock from 40MHz to 35MHz
  powerpc/85xx: Fix pci base address error for p2020rdb-pc in dts

Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux

Pull s390 updates from Martin Schwidefsky:
"This it the second batch of s390 patches for the 3.6 merge window.
  Included is enablement for two common code changes, killable page
  faults and sorted exception tables.  And the regular set of cleanup
  and bug fix patches."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
  s390: make use of user_mode() macro where possible
  s390/mm: rename user_mode variable to addressing_mode
  s390/mm: fix fault handling for page table walk case
  s390/mm: make page faults killable
  s390: update defconfig
  s390/mm: downgrade page table after fork of a 31 bit process
  s390/ipl: Use diagnose 8 command separation
  s390/linker script: use RO_DATA_SECTION
  s390/exceptions: sort exception table at build time
  s390/debug: remove module_exit function / move EXPORT_SYMBOLs

ipv4: Properly purge netdev references on uncached routes.

When a device is unregistered, we have to purge all of the
references to it that may exist in the entire system.

If a route is uncached, we currently have no way of accomplishing
this.

So create a global list that is scanned when a network device goes
down. This mirrors the logic in net/core/dst.c's dst_ifdown().

Signed-off-by: David S. Miller <[email protected]>

ipv4: Cache routes in nexthop exception entries.

Signed-off-by: David S. Miller <[email protected]>

Merge branch 'nfsd-next' of git://linux-nfs.org/~bfields/linux

Pull nfsd changes from J. Bruce Fields:
"This has been an unusually quiet cycle--mostly bugfixes and cleanup.
  The one large piece is Stanislav's work to containerize the server's
  grace period--but that in itself is just one more step in a
  not-yet-complete project to allow fully containerized nfs service.

  There are a number of outstanding delegation, container, v4 state, and
  gss patches that aren't quite ready yet; 3.7 may be wilder."

* 'nfsd-next' of git://linux-nfs.org/~bfields/linux: (35 commits)
  NFSd: make boot_time variable per network namespace
  NFSd: make grace end flag per network namespace
  Lockd: move grace period management from lockd() to per-net functions
  LockD: pass actual network namespace to grace period management functions
  LockD: manage grace list per network namespace
  SUNRPC: service request network namespace helper introduced
  NFSd: make nfsd4_manager allocated per network namespace context.
  LockD: make lockd manager allocated per network namespace
  LockD: manage grace period per network namespace
  Lockd: add more debug to host shutdown functions
  Lockd: host complaining function introduced
  LockD: manage used host count per networks namespace
  LockD: manage garbage collection timeout per networks namespace
  LockD: make garbage collector network namespace aware.
  LockD: mark host per network namespace on garbage collect
  nfsd4: fix missing fault_inject.h include
  locks: move lease-specific code out of locks_delete_lock
  locks: prevent side-effects of locks_release_private before file_lock is initialized
  NFSd: set nfsd_serv to NULL after service destruction
  NFSd: introduce nfsd_destroy() helper
  ...

ipv4: percpu nh_rth_output cache

Input path is mostly run under RCU and doesnt touch dst refcnt

But output path on forwarding or UDP workloads hits
badly dst refcount, and we have lot of false sharing, for example
in ipv4_mtu() when reading rt->rt_pmtu

Using a percpu cache for nh_rth_output gives a nice performance
increase at a small cost.

24 udpflood test on my 24 cpu machine (dummy0 output device)
(each process sends 1.000.000 udp frames, 24 processes are started)

before : 5.24 s
after : 2.06 s
For reference, time on linux-3.5 : 6.60 s

Signed-off-by: Eric Dumazet <[email protected]>
Tested-by: Alexander Duyck <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

ipv4: Restore old dst_free() behavior.

commit 404e0a8b6a55 (net: ipv4: fix RCU races on dst refcounts) tried
to solve a race but added a problem at device/fib dismantle time :

We really want to call dst_free() as soon as possible, even if sockets
still have dst in their cache.
dst_release() calls in free_fib_info_rcu() are not welcomed.

Root of the problem was that now we also cache output routes (in
nh_rth_output), we must use call_rcu() instead of call_rcu_bh() in
rt_free(), because output route lookups are done in process context.

Based on feedback and initial patch from David Miller (adding another
call_rcu_bh() call in fib, but it appears it was not the right fix)

I left the inet_sk_rx_dst_set() helper and added __rcu attributes
to nh_rth_output and nh_rth_input to better document what is going on in
this code.

Signed-off-by: Eric Dumazet <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client

Pull Ceph changes from Sage Weil:
"Lots of stuff this time around:

   - lots of cleanup and refactoring in the libceph messenger code, and
     many hard to hit races and bugs closed as a result.
   - lots of cleanup and refactoring in the rbd code from Alex Elder,
     mostly in preparation for the layering functionality that will be
     coming in 3.7.
   - some misc rbd cleanups from Josh Durgin that are finally going
     upstream
   - support for CRUSH tunables (used by newer clusters to improve the
     data placement)
   - some cleanup in our use of d_parent that Al brought up a while back
   - a random collection of fixes across the tree

  There is another patch coming that fixes up our ->atomic_open()
  behavior, but I'm going to hammer on it a bit more before sending it."

Fix up conflicts due to commits that were already committed earlier in
drivers/block/rbd.c, net/ceph/{messenger.c, osd_client.c}

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (132 commits)
  rbd: create rbd_refresh_helper()
  rbd: return obj version in __rbd_refresh_header()
  rbd: fixes in rbd_header_from_disk()
  rbd: always pass ops array to rbd_req_sync_op()
  rbd: pass null version pointer in add_snap()
  rbd: make rbd_create_rw_ops() return a pointer
  rbd: have __rbd_add_snap_dev() return a pointer
  libceph: recheck con state after allocating incoming message
  libceph: change ceph_con_in_msg_alloc convention to be less weird
  libceph: avoid dropping con mutex before fault
  libceph: verify state after retaking con lock after dispatch
  libceph: revoke mon_client messages on session restart
  libceph: fix handling of immediate socket connect failure
  ceph: update MAINTAINERS file
  libceph: be less chatty about stray replies
  libceph: clear all flags on con_close
  libceph: clean up con flags
  libceph: replace connection state bits with states
  libceph: drop unnecessary CLOSED check in socket state change callback
  libceph: close socket directly from ceph_con_close()
  ...

[media] radio-tea5777: use library for 64bits div

drivers/built-in.o: In function `radio_tea5777_set_freq':
radio-tea5777.c:(.text+0x4d8704): undefined reference to `__udivdi3'

Reported-by: Randy Dunlap <[email protected]>
Cc: Hans de Goede <[email protected]>
Acked-by: Randy Dunlap <[email protected]>
Signed-off-by: Mauro Carvalho Chehab <[email protected]>

nfs: explicitly reject LOCK_MAND flock() requests

We have no mechanism to emulate LOCK_MAND locks on NFSv4, so explicitly
return -EINVAL if someone requests it.

Signed-off-by: Jeff Layton <[email protected]>
Signed-off-by: Trond Myklebust <[email protected]>

nfs: increase number of permitted callback connections.

By default a sunrpc service is limited to (N+3)*20 connections
where N is the number of threads.  This is 80 when N==1.
If this number is exceeded a warning is printed suggesting that
the number of threads be increased.  However with services which
run a single thread, this is impossible.

For such services there is a ->sv_maxconn setting that can be
used to forcibly increase the limit, and silence the message.
This is used by lockd.

The nfs client uses a sunrpc service to handle callbacks and
it too is single-threaded, so to avoid the useless messages,
and to allow a reasonable number of concurrent connections,
we need to set ->sv_maxconn.  1024 seems like a good number.

Signed-off-by: NeilBrown <[email protected]>
Signed-off-by: Trond Myklebust <[email protected]>

vfio: Add PCI device driver

Add PCI device support for VFIO.  PCI devices expose regions
for accessing config space, I/O port space, and MMIO areas
of the device.  PCI config access is virtualized in the kernel,
allowing us to ensure the integrity of the system, by preventing
various accesses while reducing duplicate support across various
userspace drivers.  I/O port supports read/write access while
MMIO also supports mmap of sufficiently sized regions.  Support
for INTx, MSI, and MSI-X interrupts are provided using eventfds to
userspace.

Signed-off-by: Alex Williamson <[email protected]>

vfio: Type1 IOMMU implementation

This VFIO IOMMU backend is designed primarily for AMD-Vi and Intel
VT-d hardware, but is potentially usable by anything supporting
similar mapping functionality.  We arbitrarily call this a Type1
backend for lack of a better name.  This backend has no IOVA
or host memory mapping restrictions for the user and is optimized
for relatively static mappings.  Mapped areas are pinned into system
memory.

Signed-off-by: Alex Williamson <[email protected]>

vfio: Add documentation

Signed-off-by: Alex Williamson <[email protected]>

vfio: VFIO core

VFIO is a secure user level driver for use with both virtual machines
and user level drivers.  VFIO makes use of IOMMU groups to ensure the
isolation of devices in use, allowing unprivileged user access.  It's
intended that VFIO will replace KVM device assignment and UIO drivers
(in cases where the target platform includes a sufficiently capable
IOMMU).

New in this version of VFIO is support for IOMMU groups managed
through the IOMMU core as well as a rework of the API, removing the
group merge interface.  We now go back to a model more similar to
original VFIO with UIOMMU support where the file descriptor obtained
from /dev/vfio/vfio allows access to the IOMMU, but only after a
group is added, avoiding the previous privilege issues with this type
of model.  IOMMU support is also now fully modular as IOMMUs have
vastly different interface requirements on different platforms.  VFIO
users are able to query and initialize the IOMMU model of their
choice.

Please see the follow-on Documentation commit for further description
and usage example.

Signed-off-by: Alex Williamson <[email protected]>

thermal: Constify 'type' argument for the registration routine

thermal_zone_device_register() does not modify 'type' argument, so it is
safe to declare it as const. Otherwise, if we pass a const string, we are
getting the ugly warning:

CC drivers/power/power_supply_core.o
drivers/power/power_supply_core.c: In function 'psy_register_thermal':
drivers/power/power_supply_core.c:204:6: warning: passing argument 1 of 'thermal_zone_device_register' discards 'const' qualifier from pointer target type [enabled by default]
include/linux/thermal.h:140:29: note: expected 'char *' but argument is of type 'const char *'

Signed-off-by: Anton Vorontsov <[email protected]>
Acked-by: Jean Delvare <[email protected]>

Merge with upstream to accommodate with thermal changes

This merge is performed to take commit c56f5c0342dfee11a1 ("Thermal: Make
Thermal trip points writeable") out of Linus' tree and then fixup power
supply class. This is needed since thermal stuff added a new argument:

CC drivers/power/power_supply_core.o
drivers/power/power_supply_core.c: In function ‘psy_register_thermal’:
drivers/power/power_supply_core.c:204:6: warning: passing argument 3 of ‘thermal_zone_device_register’ makes integer from pointer without a cast [enabled by default]
include/linux/thermal.h:154:29: note: expected ‘int’ but argument is of type ‘struct power_supply *’
drivers/power/power_supply_core.c:204:6: error: too few arguments to function ‘thermal_zone_device_register’
include/linux/thermal.h:154:29: note: declared here
make[1]: *** [drivers/power/power_supply_core.o] Error 1
make: *** [drivers/power/] Error 2

Signed-off-by: Anton Vorontsov <[email protected]>

powerpc/kvm/book3s_32: Fix MTMSR_EERI macro

Commit b38c77d82e4 moved the MTMSR_EERI macro from the KVM code to generic
ppc_asm.h code. However, while adding it in the headers for the ppc32 case,
it missed out to remove the former definition in the KVM code.

This patch fixes compilation on server type PPC32 targets with CONFIG_KVM
enabled.

Signed-off-by: Alexander Graf <[email protected]>
Signed-off-by: Benjamin Herrenschmidt <[email protected]>

Merge remote-tracking branch 'kumar/merge' into merge

Kumar says:

"A few patches that missed the initial 3.6 window. These are bug fixes at
this point."

Merge tag 'writeback-proportions' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux

Pull writeback updates from Wu Fengguang:
"Use time based periods to age the writeback proportions, which can
  adapt equally well to fast/slow devices."

Fix up trivial conflict in comment in fs/sync.c

* tag 'writeback-proportions' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux:
  writeback: Fix some comment errors
  block: Convert BDI proportion calculations to flexible proportions
  lib: Fix possible deadlock in flexible proportion code
  lib: Proportions with flexible period

[media] tlg2300: Declare MODULE_FIRMWARE usage

Cc: Huang Shijie <[email protected]>
Cc: Kang Yong <[email protected]>
Cc: Zhang Xiaobing <[email protected]>
Cc: Mauro Carvalho Chehab <[email protected]>
Cc: [email protected]
Signed-off-by: Tim Gardner <[email protected]>
Acked-by: Huang Shijie <[email protected]>
Signed-off-by: Mauro Carvalho Chehab <[email protected]>

[media] lgs8gxx: Declare MODULE_FIRMWARE usage

Cc: Mauro Carvalho Chehab <[email protected]>
Cc: Hans Verkuil <[email protected]>
Cc: [email protected]
Signed-off-by: Tim Gardner <[email protected]>
Signed-off-by: Mauro Carvalho Chehab <[email protected]>

[media] xc5000: Add MODULE_FIRMWARE statements

This will make modinfo more useful with regard
to discovering necessary firmware files.

Cc: Mauro Carvalho Chehab <[email protected]>
Cc: Michael Krufky <[email protected]>
Cc: Eddi De Pieri <[email protected]>
Cc: [email protected]
Signed-off-by: Tim Gardner <[email protected]>
Signed-off-by: Mauro Carvalho Chehab <[email protected]>

[media] s2255drv: Add MODULE_FIRMWARE statement

Cc: Mauro Carvalho Chehab <[email protected]>
Cc: Dean Anderson <[email protected]>
Cc: Hans Verkuil <[email protected]>
Cc: Dan Carpenter <[email protected]>
Cc: Hans de Goede <[email protected]>
Cc: [email protected]
Signed-off-by: Tim Gardner <[email protected]>
Signed-off-by: Mauro Carvalho Chehab <[email protected]>

Merge tag 'nfs-for-3.6-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs

Pull NFS client updates from Trond Myklebust:
"Features include:
   - More preparatory patches for modularising NFSv2/v3/v4.  Split out
     the various NFSv2/v3/v4-specific code into separate files
   - More preparation for the NFSv4 migration code
   - Ensure that OPEN(O_CREATE) observes the pNFS mds threshold
     parameters
   - pNFS fast failover when the data servers are down
   - Various cleanups and debugging patches"

* tag 'nfs-for-3.6-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (67 commits)
  nfs: fix fl_type tests in NFSv4 code
  NFS: fix pnfs regression with directio writes
  NFS: fix pnfs regression with directio reads
  sunrpc: clnt: Add missing braces
  nfs: fix stub return type warnings
  NFS: exit_nfs_v4() shouldn't be an __exit function
  SUNRPC: Add a missing spin_unlock to gss_mech_list_pseudoflavors
  NFS: Split out NFS v4 client functions
  NFS: Split out the NFS v4 filesystem types
  NFS: Create a single nfs_clone_super() function
  NFS: Split out NFS v4 server creating code
  NFS: Initialize the NFS v4 client from init_nfs_v4()
  NFS: Move the v4 getroot code to nfs4getroot.c
  NFS: Split out NFS v4 file operations
  NFS: Initialize v4 sysctls from nfs_init_v4()
  NFS: Create an init_nfs_v4() function
  NFS: Split out NFS v4 inode operations
  NFS: Split out NFS v3 inode operations
  NFS: Split out NFS v2 inode operations
  NFS: Clean up nfs4_proc_setclientid() and friends
  ...

[media] dib8000: move dereference after check for NULL

My static checker complains that we dereference "state" inside the call
to fft_to_mode() before checking for NULL. The comments say that it is
possible for "state" to be NULL so I have moved the dereference after
the check.

Signed-off-by: Dan Carpenter <[email protected]>
Signed-off-by: Mauro Carvalho Chehab <[email protected]>

[media] Documentation: Update cardlists

Signed-off-by: Mauro Carvalho Chehab <[email protected]>

[media] bttv: add support for Aposonic W-DVR

Forwarded-by: Gerd Hoffmann <[email protected]>
Signed-off-by: Mauro Carvalho Chehab <[email protected]>

Merge tag 'mfd-for-linus-3.6-1' of git://git.kernel.org/pub/scm/linux/kernel/git/sameo/mfd-2.6

Pull MFD fix from Samuel Ortiz:
"This one fixes an s5m8767 regulator build breakage due to a merge
conflict caused by the MFD s5m API changes."

* tag 'mfd-for-linus-3.6-1' of git://git.kernel.org/pub/scm/linux/kernel/git/sameo/mfd-2.6:
regulator: Fix an s5m8767 build failure

[media] cx25821: Remove bad strcpy to read-only char*

The strcpy was being used to set the name of the board.
This was both wrong and redundant,
since the destination char* was read-only and
the name is set statically at compile time.

The type of the name field is changed to const char*
to prevent future errors.

Reported-by: Radek Masin <[email protected]>
Signed-off-by: Ezequiel Garcia <[email protected]>
Signed-off-by: Mauro Carvalho Chehab <[email protected]>

Merge branch 'v4l_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media

Pull media updates from Mauro Carvalho Chehab:
"This is the first part of the media patches for v3.6.

  This patch series contain:
   - new DVB frontend: rtl2832
   - new video drivers: adv7393
   - some unused files got removed
   - a selection API cleanup between V4L2 and V4L2 subdev API's
   - a major redesign at v4l-ioctl2, in order to clean it up
   - several driver fixes and improvements."

* 'v4l_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media: (174 commits)
  v4l: Export v4l2-common.h in include/linux/Kbuild
  media: Revert "[media] Terratec Cinergy S2 USB HD Rev.2"
  [media] media: Use pr_info not homegrown pr_reg macro
  [media] Terratec Cinergy S2 USB HD Rev.2
  [media] v4l: Correct conflicting V4L2 subdev selection API documentation
  [media] Feature removal: V4L2 selections API target and flag definitions
  [media] v4l: Unify selection flags documentation
  [media] v4l: Unify selection flags
  [media] v4l: Common documentation for selection targets
  [media] v4l: Unify selection targets across V4L2 and V4L2 subdev interfaces
  [media] v4l: Remove "_ACTUAL" from subdev selection API target definition names
  [media] V4L: Remove "_ACTIVE" from the selection target name definitions
  [media] media: dvb-usb: print mac address via native %pM
  [media] s5p-tv: Use module_i2c_driver in sii9234_drv.c file
  [media] media: gpio-ir-recv: add allowed_protos for platform data
  [media] s5p-jpeg: Use module_platform_driver in jpeg-core.c file
  [media] saa7134: fix spelling of detach in label
  [media] cx88-blackbird: replace ioctl by unlocked_ioctl
  [media] cx88: don't use current_norm
  [media] cx88: fix a number of v4l2-compliance violations
  ...

[media] pms.c: remove duplicated include

Signed-off-by: Duan Jiong <[email protected]>
Acked-by: Hans Verkuil <[email protected]>
Signed-off-by: Mauro Carvalho Chehab <[email protected]>

[media] smiapp-core.c: remove duplicated include

Signed-off-by: Duan Jiong <[email protected]>
Signed-off-by: Mauro Carvalho Chehab <[email protected]>

rbd: create rbd_refresh_helper()

Create a simple helper that handles the common case of calling
__rbd_refresh_header() while holding the ctl_mutex.

Signed-off-by: Alex Elder <[email protected]>
Reviewed-by: Josh Durgin <[email protected]>

rbd: return obj version in __rbd_refresh_header()

Add a new parameter to __rbd_refresh_header() through which the
version of the header object is passed back to the caller. In most
cases this isn't needed. The main motivation is to normalize
(almost) all calls to __rbd_refresh_header() so they are all
wrapped immediately by mutex_lock()/mutex_unlock().

Signed-off-by: Alex Elder <[email protected]>
Reviewed-by: Josh Durgin <[email protected]>

rbd: fixes in rbd_header_from_disk()

This fixes a few issues in rbd_header_from_disk():
    - There is a check intended to catch overflow, but it's wrong in
      two ways.
- First, the type we don't want to overflow is size_t, not
  unsigned int, and there is now a SIZE_MAX we can use for
  use with that type.
- Second, we're allocating the snapshot ids and snapshot
  image sizes separately (each has type u64; on disk they
          grouped together as a rbd_image_header_ondisk structure).
  So we can use the size of u64 in this overflow check.
    - If there are no snapshots, then there should be no snapshot
      names.  Enforce this, and issue a warning if we encounter a
      header with no snapshots but a non-zero snap_names_len.
    - When saving the snapshot names into the header, be more direct
      in defining the offset in the on-disk structure from which
      they're being copied by using "snap_count" rather than "i"
      in the array index.
    - If an error occurs, the "snapc" and "snap_names" fields are
      freed at the end of the function.  Make those fields be null
      pointers after they're freed, to be explicit that they are
      no longer valid.
    - Finally, move the definition of the local variable "i" to the
      innermost scope in which it's needed.

Signed-off-by: Alex Elder <[email protected]>
Reviewed-by: Josh Durgin <[email protected]>

rbd: always pass ops array to rbd_req_sync_op()

All of the callers of rbd_req_sync_op() except one pass a non-null
"ops" pointer. The only one that does not is rbd_req_sync_read(),
which passes CEPH_OSD_OP_READ as its "opcode" and, CEPH_OSD_FLAG_READ
for "flags".

By allocating the ops array in rbd_req_sync_read() and moving the
special case code for the null ops pointer into it, it becomes
clear that much of that code is not even necessary.

In addition, the "opcode" argument to rbd_req_sync_op() is never
actually used, so get rid of that.

Signed-off-by: Alex Elder <[email protected]>
Reviewed-by: Josh Durgin <[email protected]>

rbd: pass null version pointer in add_snap()

rbd_header_add_snap() passes the address of a version variable to
rbd_req_sync_exec(), but it ignores the result. Just pass a null
pointer instead.

Signed-off-by: Alex Elder <[email protected]>
Reviewed-by: Josh Durgin <[email protected]>

rbd: make rbd_create_rw_ops() return a pointer

Either rbd_create_rw_ops() will succeed, or it will fail because a
memory allocation failed. Have it just return a valid pointer or
null rather than stuffing a pointer into a provided address and
returning an errno.

Signed-off-by: Alex Elder <[email protected]>
Reviewed-by: Josh Durgin <[email protected]>

rbd: have __rbd_add_snap_dev() return a pointer

It's not obvious whether the snapshot pointer whose address is
provided to __rbd_add_snap_dev() will be assigned by that function.
Change it to return the snapshot, or a pointer-coded errno in the
event of a failure.

Signed-off-by: Alex Elder <[email protected]>
Reviewed-by: Josh Durgin <[email protected]>

libceph: recheck con state after allocating incoming message

We drop the lock when calling the ->alloc_msg() con op, which means
we need to (a) not clobber con->in_msg without the mutex held, and (b)
we need to verify that we are still in the OPEN state when we retake
it to avoid causing any mayhem. If the state does change, -EAGAIN
will get us back to con_work() and loop.

Signed-off-by: Sage Weil <[email protected]>
Reviewed-by: Alex Elder <[email protected]>

libceph: change ceph_con_in_msg_alloc convention to be less weird

This function's calling convention is very limiting. In particular,
we can't return any error other than ENOMEM (and only implicitly),
which is a problem (see next patch).

Instead, return an normal 0 or error code, and make the skip a pointer
output parameter. Drop the useless in_hdr argument (we have the con
pointer).

Signed-off-by: Sage Weil <[email protected]>
Reviewed-by: Alex Elder <[email protected]>

libceph: avoid dropping con mutex before fault

The ceph_fault() function takes the con mutex, so we should avoid
dropping it before calling it. This fixes a potential race with
another thread calling ceph_con_close(), or _open(), or similar (we
don't reverify con->state after retaking the lock).

Add annotation so that lockdep realizes we will drop the mutex before
returning.

Signed-off-by: Sage Weil <[email protected]>
Reviewed-by: Alex Elder <[email protected]>

libceph: verify state after retaking con lock after dispatch

We drop the con mutex when delivering a message. When we retake the
lock, we need to verify we are still in the OPEN state before
preparing to read the next tag, or else we risk stepping on a
connection that has been closed.

Signed-off-by: Sage Weil <[email protected]>
Reviewed-by: Alex Elder <[email protected]>

libceph: revoke mon_client messages on session restart

Revoke all mon_client messages when we shut down the old connection.
This is mostly moot since we are re-using the same ceph_connection,
but it is cleaner.

Signed-off-by: Sage Weil <[email protected]>
Reviewed-by: Alex Elder <[email protected]>

libceph: fix handling of immediate socket connect failure

If the connect() call immediately fails such that sock == NULL, we
still need con_close_socket() to reset our socket state to CLOSED.

Signed-off-by: Sage Weil <[email protected]>
Reviewed-by: Alex Elder <[email protected]>

ceph: update MAINTAINERS file

* shiny new inktank.com email addresses
* add include/linux/crush directory (previous oversight)

Signed-off-by: Sage Weil <[email protected]>
Reviewed-by: Alex Elder <[email protected]>

libceph: be less chatty about stray replies

There are many (normal) conditions that can lead to us getting
unexpected replies, include cluster topology changes, osd failures,
and timeouts. There's no need to spam the console about it.

Signed-off-by: Sage Weil <[email protected]>
Reviewed-by: Alex Elder <[email protected]>

libceph: clear all flags on con_close

Signed-off-by: Sage Weil <[email protected]>

libceph: clean up con flags

Rename flags with CON_FLAG prefix, move the definitions into the c file,
and (better) document their meaning.

Signed-off-by: Sage Weil <[email protected]>

libceph: replace connection state bits with states

Use a simple set of 6 enumerated values for the socket states (CON_STATE_*)
and use those instead of the state bits. All of the con->state checks are
now under the protection of the con mutex, so this is safe. It also
simplifies many of the state checks because we can check for anything other
than the expected state instead of various bits for races we can think of.

This appears to hold up well to stress testing both with and without socket
failure injection on the server side.

Signed-off-by: Sage Weil <[email protected]>