Yafang Shao [Tue, 14 May 2019 00:19:14 +0000 (17:19 -0700)]
mm/vmscan: drop may_writepage and classzone_idx from direct reclaim begin template
There are three tracepoints using this template, which are
mm_vmscan_direct_reclaim_begin,
mm_vmscan_memcg_reclaim_begin,
mm_vmscan_memcg_softlimit_reclaim_begin.
Regarding mm_vmscan_direct_reclaim_begin,
sc.may_writepage is !laptop_mode, that's a static setting, and
reclaim_idx is derived from gfp_mask which is already show in this
tracepoint.
Regarding mm_vmscan_memcg_reclaim_begin,
may_writepage is !laptop_mode too, and reclaim_idx is (MAX_NR_ZONES-1),
which are both static value.
mm_vmscan_memcg_softlimit_reclaim_begin is the same with
mm_vmscan_memcg_reclaim_begin.
MADV_DONTNEED is handled with mmap_sem taken in read mode. We call
page_mkclean without holding mmap_sem.
MADV_DONTNEED implies that pages in the region are unmapped and subsequent
access to the pages in that range is handled as a new page fault. This
implies that if we don't have parallel access to the region when
MADV_DONTNEED is run we expect those range to be unallocated.
w.r.t page_mkclean() we need to make sure that we don't break the
MADV_DONTNEED semantics. MADV_DONTNEED check for pmd_none without holding
pmd_lock. This implies we skip the pmd if we temporarily mark pmd none.
Avoid doing that while marking the page clean.
Keep the sequence same for dax too even though we don't support
MADV_DONTNEED for dax mapping
The bug was noticed by code review and I didn't observe any failures w.r.t
test run. This is similar to
As mentioned in patch 0001, the steps are to fix the problem are:
1) Provide put_user_page*() routines, intended to be used
for releasing pages that were pinned via get_user_pages*().
2) Convert all of the call sites for get_user_pages*(), to
invoke put_user_page*(), instead of put_page(). This involves dozens of
call sites, and will take some time.
3) After (2) is complete, use get_user_pages*() and put_user_page*() to
implement tracking of these pages. This tracking will be separate from
the existing struct page refcounting.
4) Use the tracking and identification of these pages, to implement
special handling (especially in writeback paths) when the pages are
backed by a filesystem.
Overview
========
Some kernel components (file systems, device drivers) need to access
memory that is specified via process virtual address. For a long time,
the API to achieve that was get_user_pages ("GUP") and its variations.
However, GUP has critical limitations that have been overlooked; in
particular, GUP does not interact correctly with filesystems in all
situations. That means that file-backed memory + GUP is a recipe for
potential problems, some of which have already occurred in the field.
GUP was first introduced for Direct IO (O_DIRECT), allowing filesystem
code to get the struct page behind a virtual address and to let storage
hardware perform a direct copy to or from that page. This is a
short-lived access pattern, and as such, the window for a concurrent
writeback of GUP'd page was small enough that there were not (we think)
any reported problems. Also, userspace was expected to understand and
accept that Direct IO was not synchronized with memory-mapped access to
that data, nor with any process address space changes such as munmap(),
mremap(), etc.
Over the years, more GUP uses have appeared (virtualization, device
drivers, RDMA) that can keep the pages they get via GUP for a long period
of time (seconds, minutes, hours, days, ...). This long-term pinning
makes an underlying design problem more obvious.
In fact, there are a number of key problems inherent to GUP:
Interactions with file systems
==============================
File systems expect to be able to write back data, both to reclaim pages,
and for data integrity. Allowing other hardware (NICs, GPUs, etc) to gain
write access to the file memory pages means that such hardware can dirty
the pages, without the filesystem being aware. This can, in some cases
(depending on filesystem, filesystem options, block device, block device
options, and other variables), lead to data corruption, and also to kernel
bugs of the form:
"The fundamental issue is that ->page_mkwrite must be called on every
write access to a clean file backed page, not just the first one.
How long the GUP reference lasts is irrelevant, if the page is clean
and you need to dirty it, you must call ->page_mkwrite before it is
marked writeable and dirtied. Every. Time."
This is just one symptom of the larger design problem: real filesystems
that actually write to a backing device, do not actually support
get_user_pages() being called on their pages, and letting hardware write
directly to those pages--even though that pattern has been going on since
about 2005 or so.
Long term GUP
=============
Long term GUP is an issue when FOLL_WRITE is specified to GUP (so, a
writeable mapping is created), and the pages are file-backed. That can
lead to filesystem corruption. What happens is that when a file-backed
page is being written back, it is first mapped read-only in all of the CPU
page tables; the file system then assumes that nobody can write to the
page, and that the page content is therefore stable. Unfortunately, the
GUP callers generally do not monitor changes to the CPU pages tables; they
instead assume that the following pattern is safe (it's not):
get_user_pages()
Hardware can keep a reference to those pages for a very long time,
and write to it at any time. Because "hardware" here means "devices
that are not a CPU", this activity occurs without any interaction with
the kernel's file system code.
for each page
set_page_dirty
put_page()
In fact, the GUP documentation even recommends that pattern.
Anyway, the file system assumes that the page is stable (nothing is
writing to the page), and that is a problem: stable page content is
necessary for many filesystem actions during writeback, such as checksum,
encryption, RAID striping, etc. Furthermore, filesystem features like COW
(copy on write) or snapshot also rely on being able to use a new page for
as memory for that memory range inside the file.
Corruption during write back is clearly possible here. To solve that, one
idea is to identify pages that have active GUP, so that we can use a
bounce page to write stable data to the filesystem. The filesystem would
work on the bounce page, while any of the active GUP might write to the
original page. This would avoid the stable page violation problem, but
note that it is only part of the overall solution, because other problems
remain.
Other filesystem features that need to replace the page with a new one can
be inhibited for pages that are GUP-pinned. This will, however, alter and
limit some of those filesystem features. The only fix for that would be
to require GUP users to monitor and respond to CPU page table updates.
Subsystems such as ODP and HMM do this, for example. This aspect of the
problem is still under discussion.
Direct IO
=========
Direct IO can cause corruption, if userspace does Direct-IO that writes to
a range of virtual addresses that are mmap'd to a file. The pages written
to are file-backed pages that can be under write back, while the Direct IO
is taking place. Here, Direct IO races with a write back: it calls GUP
before page_mkclean() has replaced the CPU pte with a read-only entry.
The race window is pretty small, which is probably why years have gone by
before we noticed this problem: Direct IO is generally very quick, and
tends to finish up before the filesystem gets around to do anything with
the page contents. However, it's still a real problem. The solution is
to never let GUP return pages that are under write back, but instead,
force GUP to take a write fault on those pages. That way, GUP will
properly synchronize with the active write back. This does not change the
required GUP behavior, it just avoids that race.
Details
=======
Introduces put_user_page(), which simply calls put_page(). This provides
a way to update all get_user_pages*() callers, so that they call
put_user_page(), instead of put_page().
Also introduces put_user_pages(), and a few dirty/locked variations, as a
replacement for release_pages(), and also as a replacement for open-coded
loops that release multiple pages. These may be used for subsequent
performance improvements, via batching of pages to be released.
This is the first step of fixing a problem (also described in [1] and [2])
with interactions between get_user_pages ("gup") and filesystems.
Problem description: let's start with a bug report. Below, is what
happens sometimes, under memory pressure, when a driver pins some pages
via gup, and then marks those pages dirty, and releases them. Note that
the gup documentation actually recommends that pattern. The problem is
that the filesystem may do a writeback while the pages were gup-pinned,
and then the filesystem believes that the pages are clean. So, when the
driver later marks the pages as dirty, that conflicts with the
filesystem's page tracking and results in a BUG(), like this one that I
experienced:
"The fundamental issue is that ->page_mkwrite must be called on
every write access to a clean file backed page, not just the first
one. How long the GUP reference lasts is irrelevant, if the page is
clean and you need to dirty it, you must call ->page_mkwrite before it
is marked writeable and dirtied. Every. Time."
This is just one symptom of the larger design problem: real filesystems
that actually write to a backing device, do not actually support
get_user_pages() being called on their pages, and letting hardware write
directly to those pages--even though that pattern has been going on since
about 2005 or so.
The steps are to fix it are:
1) (This patch): provide put_user_page*() routines, intended to be used
for releasing pages that were pinned via get_user_pages*().
2) Convert all of the call sites for get_user_pages*(), to
invoke put_user_page*(), instead of put_page(). This involves dozens of
call sites, and will take some time.
3) After (2) is complete, use get_user_pages*() and put_user_page*() to
implement tracking of these pages. This tracking will be separate from
the existing struct page refcounting.
4) Use the tracking and identification of these pages, to implement
special handling (especially in writeback paths) when the pages are
backed by a filesystem.
[1] https://lwn.net/Articles/774411/ : "DMA and get_user_pages()"
[2] https://lwn.net/Articles/753027/ : "The Trouble with get_user_pages()"
Alexandre Ghiti [Tue, 14 May 2019 00:19:04 +0000 (17:19 -0700)]
hugetlb: allow to free gigantic pages regardless of the configuration
On systems without CONTIG_ALLOC activated but that support gigantic pages,
boottime reserved gigantic pages can not be freed at all. This patch
simply enables the possibility to hand back those pages to memory
allocator.
Alexandre Ghiti [Tue, 14 May 2019 00:18:53 +0000 (17:18 -0700)]
sh: advertise gigantic page support
Patch series "Fix free/allocation of runtime gigantic pages", v8.
This series fixes sh and sparc that did not advertise their gigantic page
support and then were not able to allocate and free those pages at
runtime. It renames MEMORY_ISOLATION && COMPACTION || CMA condition into
the more accurate CONTIG_ALLOC, since it allows the definition of
alloc_contig_range function.
Finally, it then fixes the wrong definition of ARCH_HAS_GIGANTIC_PAGE
config that, without MEMORY_ISOLATION && COMPACTION || CMA defined, did
not allow architectures to free boottime allocated gigantic pages although
unrelated.
This patch (of 4):
sh actually supports gigantic pages and selecting ARCH_HAS_GIGANTIC_PAGE
allows it to allocate and free gigantic pages at runtime.
At least sdk7786_defconfig exposes such a configuration with huge pages of
64MB, pages of 4KB and MAX_ORDER = 11: HPAGE_SHIFT (26) - PAGE_SHIFT (12)
= 14 >= MAX_ORDER (11)
Mike Rapoport [Tue, 14 May 2019 00:18:46 +0000 (17:18 -0700)]
init: free_initmem: poison freed init memory
Various architectures including x86 poison the freed init memory. Do the
same in the generic free_initmem implementation and switch sparc32
architecture that is identical to the generic code over to it now.
Mike Rapoport [Tue, 14 May 2019 00:18:40 +0000 (17:18 -0700)]
init: provide a generic free_initmem implementation
Patch series "provide a generic free_initmem implementation", v2.
Many architectures implement free_initmem() in exactly the same or very
similar way: they wrap the call to free_initmem_default() with sometimes
different 'poison' parameter.
These patches switch those architectures to use a generic implementation
that does free_initmem_default(POISON_FREE_INITMEM).
This was inspired by Christoph's patches for free_initrd_mem [1] and I
shamelessly copied changelog entries from his patches :)
Various architectures including x86 poison the freed initrd memory. Do
the same in the generic free_initrd_mem implementation and switch a few
more architectures that are identical to the generic code over to it now.
The code for kernels that support ramdisks or not is mostly the same.
Unify it by using an IS_ENABLED for the info message, and moving the error
message into a stub for populate_initrd_image.
initramfs: free initrd memory if opening /initrd.image fails
Patch series "initramfs tidyups".
I've spent some time chasing down behavior in initramfs and found
plenty of opportunity to improve the code. A first stab on that is
contained in this series.
This patch (of 7):
We free the initrd memory for all successful or error cases except for the
case where opening /initrd.image fails, which looks like an oversight.
Steven said:
: This also changes the behaviour when CONFIG_INITRAMFS_FORCE is enabled
: - specifically it means that the initrd is freed (previously it was
: ignored and never freed). But that seems like reasonable behaviour and
: the previous behaviour looks like another oversight.
Yue Hu [Tue, 14 May 2019 00:18:14 +0000 (17:18 -0700)]
mm/cma.c: fix crash on CMA allocation if bitmap allocation fails
f022d8cb7ec7 ("mm: cma: Don't crash on allocation if CMA area can't be
activated") fixes the crash issue when activation fails via setting
cma->count as 0, same logic exists if bitmap allocation fails.
Johannes Weiner [Tue, 14 May 2019 00:18:08 +0000 (17:18 -0700)]
mm: memcontrol: push down mem_cgroup_nr_lru_pages()
mem_cgroup_nr_lru_pages() is just a convenience wrapper around
memcg_page_state() that takes bitmasks of lru indexes and aggregates the
counts for those.
Replace callsites where the bitmask is simple enough with direct
memcg_page_state() call(s).
Johannes Weiner [Tue, 14 May 2019 00:18:05 +0000 (17:18 -0700)]
mm: memcontrol: push down mem_cgroup_node_nr_lru_pages()
mem_cgroup_node_nr_lru_pages() is just a convenience wrapper around
lruvec_page_state() that takes bitmasks of lru indexes and aggregates the
counts for those.
Replace callsites where the bitmask is simple enough with direct
lruvec_page_state() calls.
This removes the last extern user of mem_cgroup_node_nr_lru_pages(), so
make that function private again, too.
Johannes Weiner [Tue, 14 May 2019 00:17:57 +0000 (17:17 -0700)]
mm: memcontrol: track LRU counts in the vmstats array
Patch series "mm: memcontrol: clean up the LRU counts tracking".
The memcg LRU stats usage is currently a bit messy. Memcg has private
per-zone counters because reclaim needs zone granularity sometimes, but we
also have plenty of users that need to awkwardly sum them up to node or
memcg granularity. Meanwhile the canonical per-memcg vmstats do not track
the LRU counts (NR_INACTIVE_ANON etc.) as you'd expect.
This series enables LRU count tracking in the per-memcg vmstats array such
that lruvec_page_state() and memcg_page_state() work on the enum
node_stat_item items for the LRU counters. Then it converts all the
callers that don't specifically need per-zone numbers over to that.
This patch (of 6):
The memcg code currently maintains private per-zone breakdowns of the LRU
counters. This is necessary for reclaim decisions which are still
zone-based, but there are a variety of users of these counters that only
want the aggregate per-lruvec or per-memcg LRU counts, and they need to
painfully sum up the zone counters on each request for that.
These would be better served using the memcg vmstats arrays, which track
VM statistics at the desired scope already. They just don't have the LRU
counts right now.
So to kick off the conversion, begin tracking LRU counts in those.
Yafang Shao [Tue, 14 May 2019 00:17:53 +0000 (17:17 -0700)]
mm/vmscan: add tracepoints for node reclaim
The page alloc fast path it may perform node reclaim, which may cause a
latency spike. We should add tracepoint for this event, and also measure
the latency it causes.
So bellow two tracepoints are introduced,
mm_vmscan_node_reclaim_begin
mm_vmscan_node_reclaim_end
mm/page_isolation.c: remove redundant pfn_valid_within() in __first_valid_page()
pfn_valid_within() calls pfn_valid() when CONFIG_HOLES_IN_ZONE making it
redundant for both definitions (w/wo CONFIG_MEMORY_HOTPLUG) of the helper
pfn_to_online_page() which either calls pfn_valid() or pfn_valid_within().
pfn_valid_within() being 1 when !CONFIG_HOLES_IN_ZONE is irrelevant
either way. This does not change functionality.
Yafang Shao [Tue, 14 May 2019 00:17:47 +0000 (17:17 -0700)]
mm, compaction: some tracepoints should be defined only when CONFIG_COMPACTION is set
Only mm_compaction_isolate_{free, migrate}pages may be used when
CONFIG_COMPACTION is not set. All others are used only when
CONFIG_COMPACTION is set.
After this change, if CONFIG_COMPACTION is not set, the tracepoints that
only work when CONFIG_COMPACTION is set will not be exposed to userspace.
Without this change, they will always be exposed in debugfs whether
CONFIG_COMPACTION is set or not. This is an improvement.
Yue Hu [Tue, 14 May 2019 00:17:41 +0000 (17:17 -0700)]
mm/cma.c: fix the bitmap status to show failed allocation reason
Currently one bit in cma bitmap represents number of pages rather than
one page, cma->count means cma size in pages. So to find available pages
via find_next_zero_bit()/find_next_bit() we should use cma size not in
pages but in bits although current free pages number is correct due to
zero value of order_per_bit. Once order_per_bit is changed the bitmap
status will be incorrect.
The size input in cma_debug_show_areas() is not correct. It will
affect the available pages at some position to debug the failure issue.
This is an example with order_per_bit = 1
Before this change:
[ 4.120060] cma: number of available pages: 1@93+4@108+7@121+7@137+7@153+7@169+7@185+7@201+3@213+3@221+3@229+3@237+3@245+3@253+3@261+3@269+3@277+3@285+3@293+3@301+3@309+3@317+3@325+19@333+15@369+512@512=> 638 free of 1024 total pages
After this change:
[ 4.143234] cma: number of available pages: 2@93+8@108+14@121+14@137+14@153+14@169+14@185+14@201+6@213+6@221+6@229+6@237+6@245+6@253+6@261+6@269+6@277+6@285+6@293+6@301+6@309+6@317+6@325+38@333+30@369=> 252 free of 1024 total pages
Qian Cai [Tue, 14 May 2019 00:17:38 +0000 (17:17 -0700)]
mm/compaction.c: fix an undefined behaviour
In a low-memory situation, cc->fast_search_fail can keep increasing as it
is unable to find an available page to isolate in
fast_isolate_freepages(). As the result, it could trigger an error below,
so just compare with the maximum bits can be shifted first.
UBSAN: Undefined behaviour in mm/compaction.c:1160:30
shift exponent 64 is too large for 64-bit type 'unsigned long'
CPU: 131 PID: 1308 Comm: kcompactd1 Kdump: loaded Tainted: G
W L 5.0.0+ #17
Call trace:
dump_backtrace+0x0/0x450
show_stack+0x20/0x2c
dump_stack+0xc8/0x14c
__ubsan_handle_shift_out_of_bounds+0x7e8/0x8c4
compaction_alloc+0x2344/0x2484
unmap_and_move+0xdc/0x1dbc
migrate_pages+0x274/0x1310
compact_zone+0x26ec/0x43bc
kcompactd+0x15b8/0x1a24
kthread+0x374/0x390
ret_from_fork+0x10/0x18
Baoquan He [Tue, 14 May 2019 00:17:35 +0000 (17:17 -0700)]
mm/memory_hotplug.c: fix the wrong usage of N_HIGH_MEMORY
In node_states_check_changes_online(), N_HIGH_MEMORY is used to substitute
ZONE_HIGHMEM directly. This is not right. N_HIGH_MEMORY is to mark the
memory state of node. Here zone index is checked, which should be
compared with 'ZONE_HIGHMEM' accordingly.
Replace it with ZONE_HIGHMEM.
This is a code cleanup - no known runtime effects.
Oscar Salvador [Tue, 14 May 2019 00:17:32 +0000 (17:17 -0700)]
mm,memory_hotplug: drop redundant hugepage_migration_supported check
has_unmovable_pages() already checks whether the hugetlb page supports
migration, so all non-migratable hugetlb pages should have been caught
there. Let us drop the check from scan_movable_pages() as is redundant.
Oscar Salvador [Tue, 14 May 2019 00:17:29 +0000 (17:17 -0700)]
mm,memory_hotplug: unlock 1GB-hugetlb on x86_64
On x86_64, 1GB-hugetlb pages could never be offlined due to the fact
that hugepage_migration_supported() returned false for PUD_SHIFT.
So whenever we wanted to offline a memblock containing a gigantic
hugetlb page, we never got beyond has_unmovable_pages() check.
This changed with [1], where now we also return true for PUD_SHIFT.
After that patch, the check in has_unmovable_pages() and scan_movable_pages()
returned true, but we still had a final barrier in do_migrate_range():
if (compound_order(head) > PFN_SECTION_SHIFT) {
ret = -EBUSY;
break;
}
This is not really nice, and we do not really need it.
It is perfectly possible to migrate a gigantic page as long as another node has
a spare gigantic page for us.
In alloc_huge_page_nodemask(), we calculate the __real__ number of free pages,
and if any, we try to dequeue one from another node.
This all works fine when we do have another node with a spare gigantic page,
but if that is not the case, alloc_huge_page_nodemask() ends up calling
alloc_migrate_huge_page() which bails out if the wanted page is gigantic.
That is mainly because finding a 1GB (or even 16GB on powerpc) contiguous
memory is quite unlikely when the system has been running for a while.
In that situation, we will keep looping forever because scan_movable_pages()
will give us the same page and we will fail again because there is no node
where we can dequeue a gigantic page from.
This is not nice, and it has been raised that we might want to treat -ENOMEM
as a fatal error in do_migrate_range(), but this has to be checked further.
Anyway, I would tend say that this is the administrator's job, to make sure
that the system can keep up with the memory to be offlined, so that would mean
that if we want to use gigantic pages, make sure that the other nodes have at
least enough gigantic pages to keep up in case we need to offline memory.
Just for the sake of completeness, this is one of the tests done:
Ira Weiny [Tue, 14 May 2019 00:17:11 +0000 (17:17 -0700)]
mm/gup: change GUP fast to use flags rather than a write 'bool'
To facilitate additional options to get_user_pages_fast() change the
singular write parameter to be gup_flags.
This patch does not change any functionality. New functionality will
follow in subsequent patches.
Some of the get_user_pages_fast() call sites were unchanged because they
already passed FOLL_WRITE or 0 for the write parameter.
NOTE: It was suggested to change the ordering of the get_user_pages_fast()
arguments to ensure that callers were converted. This breaks the current
GUP call site convention of having the returned pages be the final
parameter. So the suggestion was rejected.
Ira Weiny [Tue, 14 May 2019 00:17:03 +0000 (17:17 -0700)]
mm/gup: replace get_user_pages_longterm() with FOLL_LONGTERM
Pach series "Add FOLL_LONGTERM to GUP fast and use it".
HFI1, qib, and mthca, use get_user_pages_fast() due to its performance
advantages. These pages can be held for a significant time. But
get_user_pages_fast() does not protect against mapping FS DAX pages.
Introduce FOLL_LONGTERM and use this flag in get_user_pages_fast() which
retains the performance while also adding the FS DAX checks. XDP has also
shown interest in using this functionality.[1]
In addition we change get_user_pages() to use the new FOLL_LONGTERM flag
and remove the specialized get_user_pages_longterm call.
[1] https://lkml.org/lkml/2019/3/19/939
"longterm" is a relative thing and at this point is probably a misnomer.
This is really flagging a pin which is going to be given to hardware and
can't move. I've thought of a couple of alternative names but I think we
have to settle on if we are going to use FL_LAYOUT or something else to
solve the "longterm" problem. Then I think we can change the flag to a
better name.
Secondly, it depends on how often you are registering memory. I have
spoken with some RDMA users who consider MR in the performance path...
For the overall application performance. I don't have the numbers as the
tests for HFI1 were done a long time ago. But there was a significant
advantage. Some of which is probably due to the fact that you don't have
to hold mmap_sem.
Finally, architecturally I think it would be good for everyone to use
*_fast. There are patches submitted to the RDMA list which would allow
the use of *_fast (they reworking the use of mmap_sem) and as soon as they
are accepted I'll submit a patch to convert the RDMA core as well. Also
to this point others are looking to use *_fast.
As an aside, Jasons pointed out in my previous submission that *_fast and
*_unlocked look very much the same. I agree and I think further cleanup
will be coming. But I'm focused on getting the final solution for DAX at
the moment.
This patch (of 7):
This patch starts a series which aims to support FOLL_LONGTERM in
get_user_pages_fast(). Some callers who would like to do a longterm (user
controlled pin) of pages with the fast variant of GUP for performance
purposes.
Rather than have a separate get_user_pages_longterm() call, introduce
FOLL_LONGTERM and change the longterm callers to use it.
This patch does not change any functionality. In the short term
"longterm" or user controlled pins are unsafe for Filesystems and FS DAX
in particular has been blocked. However, callers of get_user_pages_fast()
were not "protected".
FOLL_LONGTERM can _only_ be supported with get_user_pages[_fast]() as it
requires vmas to determine if DAX is in use.
NOTE: In merging with the CMA changes we opt to change the
get_user_pages() call in check_and_migrate_cma_pages() to a call of
__get_user_pages_locked() on the newly migrated pages. This makes the
code read better in that we are calling __get_user_pages_locked() on the
pages before and after a potential migration.
As a side affect some of the interfaces are cleaned up but this is not the
primary purpose of the series.
In review[1] it was asked:
<quote>
> This I don't get - if you do lock down long term mappings performance
> of the actual get_user_pages call shouldn't matter to start with.
>
> What do I miss?
A couple of points.
First "longterm" is a relative thing and at this point is probably a
misnomer. This is really flagging a pin which is going to be given to
hardware and can't move. I've thought of a couple of alternative names
but I think we have to settle on if we are going to use FL_LAYOUT or
something else to solve the "longterm" problem. Then I think we can
change the flag to a better name.
Second, It depends on how often you are registering memory. I have spoken
with some RDMA users who consider MR in the performance path... For the
overall application performance. I don't have the numbers as the tests
for HFI1 were done a long time ago. But there was a significant
advantage. Some of which is probably due to the fact that you don't have
to hold mmap_sem.
Finally, architecturally I think it would be good for everyone to use
*_fast. There are patches submitted to the RDMA list which would allow
the use of *_fast (they reworking the use of mmap_sem) and as soon as they
are accepted I'll submit a patch to convert the RDMA core as well. Also
to this point others are looking to use *_fast.
As an asside, Jasons pointed out in my previous submission that *_fast and
*_unlocked look very much the same. I agree and I think further cleanup
will be coming. But I'm focused on getting the final solution for DAX at
the moment.
Kirill Tkhai [Tue, 14 May 2019 00:17:00 +0000 (17:17 -0700)]
mm: generalize putback scan functions
This combines two similar functions move_active_pages_to_lru() and
putback_inactive_pages() into single move_pages_to_lru(). This remove
duplicate code and makes object file size smaller.
Before:
text data bss dec hex filename
57082 4732 128 61942 f1f6 mm/vmscan.o
After:
text data bss dec hex filename
55112 4600 128 59840 e9c0 mm/vmscan.o
Note, that now we are checking for !page_evictable() coming from
shrink_active_list(), which shouldn't change any behavior since that path
works with evictable pages only.
Kirill Tkhai [Tue, 14 May 2019 00:16:51 +0000 (17:16 -0700)]
mm: move recent_rotated pages calculation to shrink_inactive_list()
Patch series "mm: Generalize putback functions"]
putback_inactive_pages() and move_active_pages_to_lru() are almost
similar, so this patchset merges them ina single function.
This patch (of 4):
The patch moves the calculation from putback_inactive_pages() to
shrink_inactive_list(). This makes putback_inactive_pages() looking more
similar to move_active_pages_to_lru().
To do that, we account activated pages in reclaim_stat::nr_activate.
Since a page may change its LRU type from anon to file cache inside
shrink_page_list() (see ClearPageSwapBacked()), we have to account pages
for the both types. So, nr_activate becomes an array.
Previously we used nr_activate to account PGACTIVATE events, but now we
account them into pgactivate variable (since they are about number of
pages in general, not about sum of hpage_nr_pages).
Vlastimil Babka [Tue, 14 May 2019 00:16:47 +0000 (17:16 -0700)]
mm, page_alloc: disallow __GFP_COMP in alloc_pages_exact()
alloc_pages_exact*() allocates a page of sufficient order and then splits
it to return only the number of pages requested. That makes it
incompatible with __GFP_COMP, because compound pages cannot be split.
As shown by [1] things may silently work until the requested size
(possibly depending on user) stops being power of two. Then for
CONFIG_DEBUG_VM, BUG_ON() triggers in split_page(). Without
CONFIG_DEBUG_VM, consequences are unclear.
There are several options here, none of them great:
1) Don't do the splitting when __GFP_COMP is passed, and return the
whole compound page. However if caller then returns it via
free_pages_exact(), that will be unexpected and the freeing actions
there will be wrong.
2) Warn and remove __GFP_COMP from the flags. But the caller may have
really wanted it, so things may break later somewhere.
3) Warn and return NULL. However NULL may be unexpected, especially
for small sizes.
This patch picks option 2, because as Michal Hocko put it: "callers wanted
it" is much less probable than "caller is simply confused and more gfp
flags is surely better than fewer".
Matthew Wilcox [Tue, 14 May 2019 00:16:44 +0000 (17:16 -0700)]
mm: page cache: store only head pages in i_pages
Transparent Huge Pages are currently stored in i_pages as pointers to
consecutive subpages. This patch changes that to storing consecutive
pointers to the head page in preparation for storing huge pages more
efficiently in i_pages.
Userfaultfd can be misued to make it easier to exploit existing
use-after-free (and similar) bugs that might otherwise only make a
short window or race condition available. By using userfaultfd to
stall a kernel thread, a malicious program can keep some state that it
wrote, stable for an extended period, which it can then access using an
existing exploit. While it doesn't cause the exploit itself, and while
it's not the only thing that can stall a kernel thread when accessing a
memory location, it's one of the few that never needs privilege.
We can add a flag, allowing userfaultfd to be restricted, so that in
general it won't be useable by arbitrary user programs, but in
environments that require userfaultfd it can be turned back on.
Add a global sysctl knob "vm.unprivileged_userfaultfd" to control
whether userfaultfd is allowed by unprivileged users. When this is
set to zero, only privileged users (root user, or users with the
CAP_SYS_PTRACE capability) will be able to use the userfaultfd
syscalls.
Andrea said:
: The only difference between the bpf sysctl and the userfaultfd sysctl
: this way is that the bpf sysctl adds the CAP_SYS_ADMIN capability
: requirement, while userfaultfd adds the CAP_SYS_PTRACE requirement,
: because the userfaultfd monitor is more likely to need CAP_SYS_PTRACE
: already if it's doing other kind of tracking on processes runtime, in
: addition of userfaultfd. In other words both syscalls works only for
: root, when the two sysctl are opt-in set to 1.
Yue Hu [Tue, 14 May 2019 00:16:37 +0000 (17:16 -0700)]
mm/cma_debug.c: fix the break condition in cma_maxchunk_get()
If not find zero bit in find_next_zero_bit(), it will return the size
parameter passed in, so the start bit should be compared with bitmap_maxno
rather than cma->count. Although getting maxchunk is working fine due to
zero value of order_per_bit currently, the operation will be stuck if
order_per_bit is set as non-zero.
Yafang Shao [Tue, 14 May 2019 00:16:34 +0000 (17:16 -0700)]
include/trace/events/vmscan.h: drop zone id from kswapd tracepoints
It is not clear how the zone id is useful in kswapd tracepoints and the id
itself is not really easy to process because it depends on the
configuration (available zones). Let's drop the id for now. If somebody
really needs that information then the zone name should be used instead.
Qian Cai [Tue, 14 May 2019 00:16:31 +0000 (17:16 -0700)]
mm/slab.c: fix an infinite loop in leaks_show()
"cat /proc/slab_allocators" could hang forever on SMP machines with
kmemleak or object debugging enabled due to other CPUs running do_drain()
will keep making kmemleak_object or debug_objects_cache dirty and unable
to escape the first loop in leaks_show(),
do {
set_store_user_clean(cachep);
drain_cpu_caches(cachep);
...
One approach is to check cachep->name and skip both kmemleak_object and
debug_objects_cache in leaks_show(). The other is to set store_user_clean
after drain_cpu_caches() which leaves a small window between
drain_cpu_caches() and set_store_user_clean() where per-CPU caches could
be dirty again lead to slightly wrong information has been stored but
could also speed up things significantly which sounds like a good
compromise. For example,
Liu Xiang [Tue, 14 May 2019 00:16:22 +0000 (17:16 -0700)]
slub: remove useless kmem_cache_debug() before remove_full()
When CONFIG_SLUB_DEBUG is not enabled, remove_full() is empty.
While CONFIG_SLUB_DEBUG is enabled, remove_full() can check
s->flags by itself. So kmem_cache_debug() is useless and
can be removed.
Tobin C. Harding [Tue, 14 May 2019 00:16:15 +0000 (17:16 -0700)]
slab: use slab_list instead of lru
Currently we use the page->lru list for maintaining lists of slabs. We
have a list in the page structure (slab_list) that can be used for this
purpose. Doing so makes the code cleaner since we are not overloading the
lru list.
Use the slab_list instead of the lru list for maintaining lists of slabs.
Tobin C. Harding [Tue, 14 May 2019 00:16:12 +0000 (17:16 -0700)]
slub: use slab_list instead of lru
Currently we use the page->lru list for maintaining lists of slabs. We
have a list in the page structure (slab_list) that can be used for this
purpose. Doing so makes the code cleaner since we are not overloading the
lru list.
Use the slab_list instead of the lru list for maintaining lists of slabs.
Tobin C. Harding [Tue, 14 May 2019 00:16:09 +0000 (17:16 -0700)]
slub: add comments to endif pre-processor macros
SLUB allocator makes heavy use of ifdef/endif pre-processor macros. The
pairing of these statements is at times hard to follow e.g. if the pair
are further than a screen apart or if there are nested pairs. We can
reduce cognitive load by adding a comment to the endif statement of form
#ifdef CONFIG_FOO
...
#endif /* CONFIG_FOO */
Add comments to endif pre-processor macros if ifdef/endif pair is not
immediately apparent.
Tobin C. Harding [Tue, 14 May 2019 00:16:06 +0000 (17:16 -0700)]
slob: use slab_list instead of lru
Currently we use the page->lru list for maintaining lists of slabs. We
have a list_head in the page structure (slab_list) that can be used for
this purpose. Doing so makes the code cleaner since we are not
overloading the lru list.
The slab_list is part of a union within the page struct (included here
stripped down):
union {
struct { /* Page cache and anonymous pages */
struct list_head lru;
...
};
struct {
dma_addr_t dma_addr;
};
struct { /* slab, slob and slub */
union {
struct list_head slab_list;
struct { /* Partial pages */
struct page *next;
int pages; /* Nr of pages left */
int pobjects; /* Approximate count */
};
};
...
Here we see that slab_list and lru are the same bits. We can verify that
this change is safe to do by examining the object file produced from
slob.c before and after this patch is applied.
Steps taken to verify:
1. checkout current tip of Linus' tree
commit a667cb7a94d4 ("Merge branch 'akpm' (patches from Andrew)")
Tobin C. Harding [Tue, 14 May 2019 00:16:03 +0000 (17:16 -0700)]
slob: respect list_head abstraction layer
Currently we reach inside the list_head. This is a violation of the layer
of abstraction provided by the list_head. It makes the code fragile.
More importantly it makes the code wicked hard to understand.
The code reaches into the list_head structure to counteract the fact that
the list _may_ have been changed during slob_page_alloc(). Instead of
this we can add a return parameter to slob_page_alloc() to signal that the
list was modified (list_del() called with page->lru to remove page from
the freelist).
This code is concerned with an optimisation that counters the tendency for
first fit allocation algorithm to fragment memory into many small chunks
at the front of the memory pool. Since the page is only removed from the
list when an allocation uses _all_ the remaining memory in the page then
in this special case fragmentation does not occur and we therefore do not
need the optimisation.
Add a return parameter to slob_page_alloc() to signal that the allocation
used up the whole page and that the page was removed from the free list.
After calling slob_page_alloc() check the return value just added and only
attempt optimisation if the page is still on the list.
Use list_head API instead of reaching into the list_head structure to
check if sp is at the front of the list.
Tobin C. Harding [Tue, 14 May 2019 00:15:59 +0000 (17:15 -0700)]
list: add function list_rotate_to_front()
Patch series "mm: Use slab_list list_head instead of lru", v5.
Currently the slab allocators (ab)use the struct page 'lru' list_head. We
have a list head for slab allocators to use, 'slab_list'.
During v2 it was noted by Christoph that the SLOB allocator was reaching
into a list_head, this version adds 2 patches to the front of the set to
fix that.
Clean up all three allocators by using the 'slab_list' list_head instead
of overloading the 'lru' list_head.
This patch (of 7):
Currently if we wish to rotate a list until a specific item is at the
front of the list we can call list_move_tail(head, list). Note that the
arguments are the reverse way to the usual use of list_move_tail(list,
head). This is a hack, it depends on the developer knowing how the
list_head operates internally which violates the layer of abstraction
offered by the list_head. Also, it is not intuitive so the next developer
to come along must study list.h in order to fully understand what is meant
by the call, while this is 'good for' the developer it makes reading the
code harder. We should have an function appropriately named that does
this if there are users for it intree.
By grep'ing the tree for list_move_tail() and list_tail() and attempting
to guess the argument order from the names it seems there is only one
place currently in the tree that does this - the slob allocatator.
Add function list_rotate_to_front() to rotate a list until the specified
item is at the front of the list.
Shuning Zhang [Tue, 14 May 2019 00:15:56 +0000 (17:15 -0700)]
ocfs2: fix ocfs2 read inode data panic in ocfs2_iget
In some cases, ocfs2_iget() reads the data of inode, which has been
deleted for some reason. That will make the system panic. So We should
judge whether this inode has been deleted, and tell the caller that the
inode is a bad inode.
For example, the ocfs2 is used as the backed of nfs, and the client is
nfsv3. This issue can be reproduced by the following steps.
on the nfs server side,
..../patha/pathb
Step 1: The process A was scheduled before calling the function fh_verify.
Step 2: The process B is removing the 'pathb', and just completed the call
to function dput. Then the dentry of 'pathb' has been deleted from the
dcache, and all ancestors have been deleted also. The relationship of
dentry and inode was deleted through the function hlist_del_init. The
following is the call stack.
dentry_iput->hlist_del_init(&dentry->d_u.d_alias)
At this time, the inode is still in the dcache.
Step 3: The process A call the function ocfs2_get_dentry, which get the
inode from dcache. Then the refcount of inode is 1. The following is the
call stack.
nfsd3_proc_getacl->fh_verify->exportfs_decode_fh->fh_to_dentry(ocfs2_get_dentry)
Step 4: Dirty pages are flushed by bdi threads. So the inode of 'patha'
is evicted, and this directory was deleted. But the inode of 'pathb'
can't be evicted, because the refcount of the inode was 1.
Step 5: The process A keep running, and call the function
reconnect_path(in exportfs_decode_fh), which call function
ocfs2_get_parent of ocfs2. Get the block number of parent
directory(patha) by the name of ... Then read the data from disk by the
block number. But this inode has been deleted, so the system panic.
Process A Process B
1. in nfsd3_proc_getacl |
2. | dput
3. fh_to_dentry(ocfs2_get_dentry) |
4. bdi flush dirty cache |
5. ocfs2_iget |
[283465.542049] OCFS2: ERROR (device sdp): ocfs2_validate_inode_block:
Invalid dinode #580640: OCFS2_VALID_FL not set
[283465.545490] Kernel panic - not syncing: OCFS2: (device sdp): panic forced
after error
Phillip Potter [Tue, 14 May 2019 00:15:53 +0000 (17:15 -0700)]
ocfs2: use common file type conversion
Deduplicate the ocfs2 file type conversion implementation and remove
OCFS2_FT_* definitions - file systems that use the same file types as
defined by POSIX do not need to define their own versions and can use the
common helper functions decared in fs_types.h and implemented in
fs_types.c
Common implementation can be found via bbe7449e2599 ("fs: common
implementation of file type").
Joseph Qi [Tue, 14 May 2019 00:15:49 +0000 (17:15 -0700)]
MAINTAINERS: add Joseph as ocfs2 co-maintainer
I have been contributing and reviewing to the ocfs2 filesystem for recent
years and I'm willing to continue doing so. Volunteer as a co-maintainer
for ocfs2 filesystem.
Cyrill Gorcunov [Tue, 14 May 2019 00:15:40 +0000 (17:15 -0700)]
kernel/sys.c: prctl: fix false positive in validate_prctl_map()
While validating new map we require the @start_data to be strictly less
than @end_data, which is fine for regular applications (this is why this
nit didn't trigger for that long). These members are set from executable
loaders such as elf handers, still it is pretty valid to have a loadable
data section with zero size in file, in such case the start_data is equal
to end_data once kernel loader finishes.
As a result when we're trying to restore such programs the procedure fails
and the kernel returns -EINVAL. From the image dump of a program:
The dtor returned by get_compound_page_dtor in __put_compound_page may be
the function of free_huge_page which will lock the hugetlb_lock, so don't
put_page in lock of hugetlb_lock.
Starting with c6f3c5ee40c1 ("mm/huge_memory.c: fix modifying of page
protection by insert_pfn_pmd()") vmf_insert_pfn_pmd() internally calls
pmdp_set_access_flags(). That helper enforces a pmd aligned @address
argument via VM_BUG_ON() assertion.
Update the implementation to take a 'struct vm_fault' argument directly
and apply the address alignment fixup internally to fix crash signatures
like:
Linus Torvalds [Tue, 14 May 2019 16:02:14 +0000 (09:02 -0700)]
Merge tag 'ovl-update-5.2' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs
Pull overlayfs update from Miklos Szeredi:
"Just bug fixes in this small update"
* tag 'ovl-update-5.2' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
ovl: relax WARN_ON() for overlapping layers use case
ovl: check the capability before cred overridden
ovl: do not generate duplicate fsnotify events for "fake" path
ovl: support stacked SEEK_HOLE/SEEK_DATA
ovl: fix missing upper fs freeze protection on copy up for ioctl
Linus Torvalds [Tue, 14 May 2019 15:59:14 +0000 (08:59 -0700)]
Merge tag 'fuse-update-5.2' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse
Pull fuse update from Miklos Szeredi:
"Add more caching controls for userspace filesystems to use, as well as
bug fixes and cleanups"
* tag 'fuse-update-5.2' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
fuse: clean up fuse_alloc_inode
fuse: Add ioctl flag for x32 compat ioctl
fuse: Convert fusectl to use the new mount API
fuse: fix changelog entry for protocol 7.9
fuse: fix changelog entry for protocol 7.12
fuse: document fuse_fsync_in.fsync_flags
fuse: Add FOPEN_STREAM to use stream_open()
fuse: require /dev/fuse reads to have enough buffer capacity
fuse: retrieve: cap requested size to negotiated max_write
fuse: allow filesystems to have precise control over data cache
fuse: convert printk -> pr_*
fuse: honor RLIMIT_FSIZE in fuse_file_fallocate
fuse: fix writepages on 32bit
Linus Torvalds [Tue, 14 May 2019 15:55:43 +0000 (08:55 -0700)]
Merge tag 'f2fs-for-v5.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs
Pull f2fs updates from Jaegeuk Kim:
"Another round of various bug fixes came in. Damien improved SMR drive
support a bit, and Chao replaced BUG_ON() with reporting errors to
user since we've not hit from users but did hit from crafted images.
We've found a disk layout bug in large_nat_bits feature which supports
very large NAT entries enabled at mkfs. If the feature is enabled, it
will give a notice to run fsck to correct the on-disk layout.
Enhancements:
- reduce memory consumption for SMR drive
- better discard handling for multiple partitions
- tracepoints for f2fs_file_write_iter/f2fs_filemap_fault
- allow to change CP_CHKSUM_OFFSET
- detect wrong layout of large_nat_bitmap feature
- enhance checking valid data indices
Bug fixes:
- Multiple partition support for SMR drive
- deadlock problem in f2fs_balance_fs_bg
- add boundary checks to fix abnormal behaviors on fuzzed images
- inline_xattr space calculations
- replace f2fs_bug_on with errors
In addition, this series contains various memory boundary check and
sanity check of on-disk consistency"
* tag 'f2fs-for-v5.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (40 commits)
f2fs: fix to avoid accessing xattr across the boundary
f2fs: fix to avoid potential race on sbi->unusable_block_count access/update
f2fs: add tracepoint for f2fs_filemap_fault()
f2fs: introduce DATA_GENERIC_ENHANCE
f2fs: fix to handle error in f2fs_disable_checkpoint()
f2fs: remove redundant check in f2fs_file_write_iter()
f2fs: fix to be aware of readonly device in write_checkpoint()
f2fs: fix to skip recovery on readonly device
f2fs: fix to consider multiple device for readonly check
f2fs: relocate chksum_offset for large_nat_bitmap feature
f2fs: allow unfixed f2fs_checkpoint.checksum_offset
f2fs: Replace spaces with tab
f2fs: insert space before the open parenthesis '('
f2fs: allow address pointer number of dnode aligning to specified size
f2fs: introduce f2fs_read_single_page() for cleanup
f2fs: mark is_extension_exist() inline
f2fs: fix to set FI_UPDATE_WRITE correctly
f2fs: fix to avoid panic in f2fs_inplace_write_data()
f2fs: fix to do sanity check on valid block count of segment
f2fs: fix to do sanity check on valid node/block count
...
Linus Torvalds [Tue, 14 May 2019 14:57:29 +0000 (07:57 -0700)]
Merge branch 'x86-mds-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 MDS mitigations from Thomas Gleixner:
"Microarchitectural Data Sampling (MDS) is a hardware vulnerability
which allows unprivileged speculative access to data which is
available in various CPU internal buffers. This new set of misfeatures
has the following CVEs assigned:
CVE-2018-12126 MSBDS Microarchitectural Store Buffer Data Sampling
CVE-2018-12130 MFBDS Microarchitectural Fill Buffer Data Sampling
CVE-2018-12127 MLPDS Microarchitectural Load Port Data Sampling
CVE-2019-11091 MDSUM Microarchitectural Data Sampling Uncacheable Memory
MDS attacks target microarchitectural buffers which speculatively
forward data under certain conditions. Disclosure gadgets can expose
this data via cache side channels.
Contrary to other speculation based vulnerabilities the MDS
vulnerability does not allow the attacker to control the memory target
address. As a consequence the attacks are purely sampling based, but
as demonstrated with the TLBleed attack samples can be postprocessed
successfully.
The mitigation is to flush the microarchitectural buffers on return to
user space and before entering a VM. It's bolted on the VERW
instruction and requires a microcode update. As some of the attacks
exploit data structures shared between hyperthreads, full protection
requires to disable hyperthreading. The kernel does not do that by
default to avoid breaking unattended updates.
The mitigation set comes with documentation for administrators and a
deeper technical view"
* 'x86-mds-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (23 commits)
x86/speculation/mds: Fix documentation typo
Documentation: Correct the possible MDS sysfs values
x86/mds: Add MDSUM variant to the MDS documentation
x86/speculation/mds: Add 'mitigations=' support for MDS
x86/speculation/mds: Print SMT vulnerable on MSBDS with mitigations off
x86/speculation/mds: Fix comment
x86/speculation/mds: Add SMT warning message
x86/speculation: Move arch_smt_update() call to after mitigation decisions
x86/speculation/mds: Add mds=full,nosmt cmdline option
Documentation: Add MDS vulnerability documentation
Documentation: Move L1TF to separate directory
x86/speculation/mds: Add mitigation mode VMWERV
x86/speculation/mds: Add sysfs reporting for MDS
x86/speculation/mds: Add mitigation control for MDS
x86/speculation/mds: Conditionally clear CPU buffers on idle entry
x86/kvm/vmx: Add MDS protection when L1D Flush is not active
x86/speculation/mds: Clear CPU buffers on exit to user
x86/speculation/mds: Add mds_clear_cpu_buffers()
x86/kvm: Expose X86_FEATURE_MD_CLEAR to guests
x86/speculation/mds: Add BUG_MSBDS_ONLY
...
Brian Masney [Wed, 24 Apr 2019 09:25:05 +0000 (05:25 -0400)]
backlight: lm3630a: Add firmware node support
Add fwnode support to the lm3630a driver and optionally allow
configuring the label, default brightness level, and maximum brightness
level. The two outputs can be controlled by bank A and B independently
or bank A can control both outputs.
If the platform data was not configured, then the driver defaults to
enabling both banks. This patch changes the default value to disable
both banks before parsing the firmware node so that just a single bank
can be enabled if desired. There are no in-tree users of this driver.
Driver was tested on a LG Nexus 5 (hammerhead) phone.
Brian Masney [Wed, 24 Apr 2019 09:25:03 +0000 (05:25 -0400)]
backlight: lm3630a: Return 0 on success in update_status functions
lm3630a_bank_a_update_status() and lm3630a_bank_b_update_status()
both return the brightness value if the brightness was successfully
updated. Writing to these attributes via sysfs would cause a 'Bad
address' error to be returned. These functions should return 0 on
success, so let's change it to correct that error.
Support Touchpad MCU as a special of CrOS EC devices. The current
Touchpad MCU is used on Eve Chromebook and used the same protocol as
other CrOS EC devices.
When a MCU has touchpad support (aka EC_FEATURE_TOUCHPAD), it is
instantiated as a special CrOS EC device with device name 'cros_tp'. So
regardless of the probing order between the actual cros_ec and cros_tp,
the userspace and other kernel drivers should not confuse them.
Support Fingerprint MCU as a special of CrOS EC devices. The current FP
MCU uses the same EC SPI protocol v3 as other CrOS EC devices on a SPI
bus.
When a MCU has fingerprint support (aka EC_FEATURE_FINGERPRINT), it is
instantiated as a special CrOS EC device with device name 'cros_fp'. So
regardless of the probing order between the actual cros_ec and cros_fp,
the userspace and other kernel drivers should not confuse them.
Update the feature enum for the Chromebook Embedded Controller to the
latest version. Some of these enums are still not used in the kernel but
we might be also interested on have these enums up to date. Userspace
can use them to query the features to the EC via the cros-ec character
device.
While here, also fix a typo in one comment in the enum.
Charles Keepax [Wed, 1 May 2019 10:23:50 +0000 (11:23 +0100)]
mfd: lochnagar: Add links to binding docs for sound and hwmon
Lochnagar is an evaluation and development board for Cirrus
Logic Smart CODEC and Amp devices. It allows the connection of
most Cirrus Logic devices on mini-cards, as well as allowing
connection of various application processor systems to provide a
full evaluation platform. This driver supports the board
controller chip on the Lochnagar board.
Add links to the binding documents for the new sound and hardware
monitor parts of the driver.
Since there are more IOT2040 variants with identical hardware but
different asset tags, the asset tag matching should be adjusted to
support them.
For the board name "SIMATIC IOT2000", currently there are 2 types of
hardware, IOT2020 and IOT2040. Both are identical regarding the
intel_quark_i2c_gpio. In the future there will be no other devices with
the "SIMATIC IOT2000" DMI board name but different hardware. So remove
the asset tag matching from this driver.
Steve Twiss [Fri, 26 Apr 2019 13:33:35 +0000 (14:33 +0100)]
mfd: da9063: Fix OTP control register names to match datasheets for DA9063/63L
Mismatch between what is found in the Datasheets for DA9063 and DA9063L
provided by Dialog Semiconductor, and the register names provided in the
MFD registers file. The changes are for the OTP (one-time-programming)
control registers. The two naming errors are OPT instead of OTP, and
COUNT instead of CONT (i.e. control).
mfd: axp20x: Add USB power supply mfd cell to AXP803
The AXP803 has a VBUS power input. Its functionality is the same as the
one found in the AXP813. Now that the axp20x_usb_power driver supports
this variant, we can add an mfd cell for it to use it.
mfd: sun6i-prcm: Fix build warning for non-OF configurations
When CONFIG_OF is disabled, we get a harmless warning about an
unused variable:
drivers/mfd/sun6i-prcm.c: In function 'sun6i_prcm_probe':
drivers/mfd/sun6i-prcm.c:151:22: error: unused variable 'np' [-Werror=unused-variable]
Remove the variable and open-code the value in the only place
it is used, so it can get left out as well without CONFIG_OF.
Fixes: a05a2e7998ab ("mfd: sun6i-prcm: Allow to compile with COMPILE_TEST") Signed-off-by: Arnd Bergmann <[email protected]> Acked-by: Maxime Ripard <[email protected]> Signed-off-by: Lee Jones <[email protected]>
Evan Green [Wed, 3 Apr 2019 21:34:28 +0000 (14:34 -0700)]
platform/chrome: Add support for v1 of host sleep event
Add support in code for the new forms of the host sleep event.
Detects the presence of this version of the command at runtime,
and use whichever form the EC supports. At this time, always
request the default timeout, and only report the failing response
via a WARN_ONCE(). Future versions could accept the sleep parameter
from outside the driver, and return the response information to
usermode or elsewhere.
Evan Green [Wed, 3 Apr 2019 21:34:27 +0000 (14:34 -0700)]
mfd: cros_ec: Add host_sleep_event_v1 command
Introduce the command and response structures for the second revision
of the host sleep event. These structures are part of a new EC change
that enables detection of failure to enter S0ix. The EC waits a
kernel-specified timeout (or a default amount of time) for the S0_SLP
pin to change, and wakes the system if that change does not occur in
time.
mfd: cros_ec: Instantiate the CrOS USB PD logger driver
Add the cros-usbpd-logger driver for logging event data for the USB PD
charger available in the Embedded Controller on ChromeOS systems. The
logging feature is logically separate functionality from charge manager,
hence is instantiated as a different driver.
Maxime Ripard [Mon, 18 Mar 2019 12:55:48 +0000 (13:55 +0100)]
mfd: axp20x: Allow the AXP223 to be probed by I2C
The AXP223 can be used both using the RSB proprietary bus, or a more
traditional I2C bus. The RSB is a faster bus and provides more features
(like some integrity checks on the messages), so it's usually preferrable
to use it, but since it's proprietary, when we want to use the PMIC in a
multi-master setup, the i2c might make sense as well.