Catalin Marinas [Thu, 1 Feb 2018 00:17:55 +0000 (16:17 -0800)]
arm64: provide pmdp_establish() helper
We need an atomic way to setup pmd page table entry, avoiding races with
CPU setting dirty/accessed bits. This is required to implement
pmdp_invalidate() that doesn't lose these bits.
Patch series "Do not lose dirty bit on THP pages", v4.
Vlastimil noted that pmdp_invalidate() is not atomic and we can lose
dirty and access bits if CPU sets them after pmdp dereference, but
before set_pmd_at().
The bug can lead to data loss, but the race window is tiny and I haven't
seen any reports that suggested that it happens in reality. So I don't
think it worth sending it to stable.
Unfortunately, there's no way to address the issue in a generic way. We
need to fix all architectures that support THP one-by-one.
All architectures that have THP supported have to provide atomic
pmdp_invalidate() that returns previous value.
If generic implementation of pmdp_invalidate() is used, architecture
needs to provide atomic pmdp_estabish().
pmdp_estabish() is not used out-side generic implementation of
pmdp_invalidate() so far, but I think this can change in the future.
This patch (of 12):
This is an implementation of pmdp_establish() that is only suitable for
an architecture that doesn't have hardware dirty/accessed bits. In this
case we can't race with CPU which sets these bits and non-atomic
approach is fine.
Matthew Wilcox [Thu, 1 Feb 2018 00:17:40 +0000 (16:17 -0800)]
mm: get 7% more pages in a pagevec
We don't have to use an entire 'long' for the number of elements in the
pagevec; we know it's a number between 0 and 14 (now 15). So we can
store it in a char, and then the bool packs next to it and we still have
two or six bytes of padding for more elements in the header. That gives
us space to cram in an extra page.
Matthew Wilcox [Thu, 1 Feb 2018 00:17:36 +0000 (16:17 -0800)]
mm: add unmap_mapping_pages()
Several users of unmap_mapping_range() would prefer to express their
range in pages rather than bytes. Unfortuately, on a 32-bit kernel, you
have to remember to cast your page number to a 64-bit type before
shifting it, and four places in the current tree didn't remember to do
that. That's a sign of a bad interface.
Conveniently, unmap_mapping_range() actually converts from bytes into
pages, so hoist the guts of unmap_mapping_range() into a new function
unmap_mapping_pages() and convert the callers which want to use pages.
Huang Ying [Thu, 1 Feb 2018 00:17:32 +0000 (16:17 -0800)]
mm, userfaultfd, THP: avoid waiting when PMD under THP migration
If THP migration is enabled, for a VMA handled by userfaultfd, consider
the following situation,
do_page_fault()
__do_huge_pmd_anonymous_page()
handle_userfault()
userfault_msg()
/* a huge page is allocated and mapped at fault address */
/* the huge page is under migration, leaves migration entry
in page table */
userfaultfd_must_wait()
/* return true because !pmd_present() */
/* may wait in loop until fatal signal */
That is, it may be possible for userfaultfd_must_wait() encounters a PMD
entry which is !pmd_none() && !pmd_present(). In the current
implementation, we will wait for such PMD entries, which may cause
unnecessary waiting, and potential soft lockup.
This is fixed via avoiding to wait when !pmd_none() && !pmd_present(),
only wait when pmd_none().
This may be not a problem in practice, because userfaultfd_must_wait()
is always called with mm->mmap_sem read-locked. mremap() will
write-lock mm->mmap_sem. And UFFDIO_COPY doesn't support to copy THP
mapping. But the change introduced still makes the code more correct,
and makes the PMD and PTE code more consistent.
Oscar Salvador [Thu, 1 Feb 2018 00:17:25 +0000 (16:17 -0800)]
mm: memory_hotplug: remove second __nr_to_section in register_page_bootmem_info_section()
In register_page_bootmem_info_section() we call __nr_to_section() in
order to get the mem_section struct at the beginning of the function.
Since we already got it, there is no need for a second call to
__nr_to_section().
VmExe and VmLib documented as text segment and shared library code size.
Now their sum will be always equal to mm->exec_vm which sums size of
executable and not writable and not stack areas.
I've seen this for huge (>2Gb) statically linked binary which has whole
world inside. For it start_code .. end_code range also covers one of
rodata sections. Probably this is bug in customized linker, elf loader
or both.
Anyway CONFIG_CHECKPOINT_RESTORE allows to change these pointers, thus
we cannot trust them without validation.
Oscar Salvador [Thu, 1 Feb 2018 00:17:14 +0000 (16:17 -0800)]
mm/memory_hotplug.c: remove unnecesary check from register_page_bootmem_info_section()
When we call register_page_bootmem_info_section() having
CONFIG_SPARSEMEM_VMEMMAP enabled, we check if the pfn is valid.
This check is redundant as we already checked this in
register_page_bootmem_info_node() before calling
register_page_bootmem_info_section(), so let's get rid of it.
Michal Hocko [Thu, 1 Feb 2018 00:17:10 +0000 (16:17 -0800)]
mm, hugetlb: remove hugepages_treat_as_movable sysctl
hugepages_treat_as_movable has been introduced by 396faf0303d2 ("Allow
huge page allocations to use GFP_HIGH_MOVABLE") to allow hugetlb
allocations from ZONE_MOVABLE even when hugetlb pages were not
migrateable. The purpose of the movable zone was different at the time.
It aimed at reducing memory fragmentation and hugetlb pages being long
lived and large werre not contributing to the fragmentation so it was
acceptable to use the zone back then.
Things have changed though and the primary purpose of the zone became
migratability guarantee. If we allow non migrateable hugetlb pages to
be in ZONE_MOVABLE memory hotplug might fail to offline the memory.
Remove the knob and only rely on hugepage_migration_supported to allow
movable zones.
Mel said:
: Primarily it was aimed at allowing the hugetlb pool to safely shrink with
: the ability to grow it again. The use case was for batched jobs, some of
: which needed huge pages and others that did not but didn't want the memory
: useless pinned in the huge pages pool.
:
: I suspect that more users rely on THP than hugetlbfs for flexible use of
: huge pages with fallback options so I think that removing the option
: should be ok.
selftests/vm: move 128TB mmap boundary test to generic directory
Architectures like PPC64 support mmap hint address based large address
space selection. This test can be run on those architectures too. Move
the test from the x86 selftests to selftest/vm so that other
architectures can use it too.
We also add a few new test scenarios in this patch. We do test a few
boundary conditions before we do a high address mmap. PPC64 uses the
address limit to validate the address in the fault path. We had bugs in
this area w.r.t SLB fault handling before we updated the addess limit.
We also touch the allocated space to make sure we don't have any bugs in
the fault handling path.
Minchan Kim [Thu, 1 Feb 2018 00:16:55 +0000 (16:16 -0800)]
mm: do not stall register_shrinker()
Shakeel Butt reported he has observed in production systems that the job
loader gets stuck for 10s of seconds while doing a mount operation. It
turns out that it was stuck in register_shrinker() because some
unrelated job was under memory pressure and was spending time in
shrink_slab(). Machines have a lot of shrinkers registered and jobs
under memory pressure have to traverse all of those memcg-aware
shrinkers and affect unrelated jobs which want to register their own
shrinkers.
To solve the issue, this patch simply bails out slab shrinking if it is
found that someone wants to register a shrinker in parallel. A downside
is it could cause unfair shrinking between shrinkers. However, it
should be rare and we can add compilcated logic if we find it's not
enough.
Jiankang Chen [Thu, 1 Feb 2018 00:16:52 +0000 (16:16 -0800)]
mm/page_alloc.c: fix comment in __get_free_pages()
__get_free_pages() will return a virtual address, but it is not just a
32-bit address, for example on a 64-bit system. And this comment really
confuses new readers of mm.
Johannes Weiner [Thu, 1 Feb 2018 00:16:45 +0000 (16:16 -0800)]
mm: memcontrol: fix excessive complexity in memory.stat reporting
We've seen memory.stat reads in top-level cgroups take up to fourteen
seconds during a userspace bug that created tens of thousands of ghost
cgroups pinned by lingering page cache.
Even with a more reasonable number of cgroups, aggregating memory.stat
is unnecessarily heavy. The complexity is this:
nr_cgroups * nr_stat_items * nr_possible_cpus
where the stat items are ~70 at this point. With 128 cgroups and 128
CPUs - decent, not enormous setups - reading the top-level memory.stat
has to aggregate over a million per-cpu counters. This doesn't scale.
Instead of spreading the source of truth across all CPUs, use the
per-cpu counters merely to batch updates to shared atomic counters.
This is the same as the per-cpu stocks we use for charging memory to the
shared atomic page_counters, and also the way the global vmstat counters
are implemented.
Vmstat has elaborate spilling thresholds that depend on the number of
CPUs, amount of memory, and memory pressure - carefully balancing the
cost of counter updates with the amount of per-cpu error. That's
because the vmstat counters are system-wide, but also used for decisions
inside the kernel (e.g. NR_FREE_PAGES in the allocator). Neither is
true for the memory controller.
Use the same static batch size we already use for page_counter updates
during charging. The per-cpu error in the stats will be 128k, which is
an acceptable ratio of cores to memory accounting granularity.
Johannes Weiner [Thu, 1 Feb 2018 00:16:41 +0000 (16:16 -0800)]
mm: memcontrol: implement lruvec stat functions on top of each other
The implementation of the lruvec stat functions and their variants for
accounting through a page, or accounting from a preemptible context, are
mostly identical and needlessly repetitive.
Implement the lruvec_page functions by looking up the page's lruvec and
then using the lruvec function.
Implement the functions for preemptible contexts by disabling preemption
before calling the atomic context functions.
Yang Shi [Thu, 1 Feb 2018 00:16:34 +0000 (16:16 -0800)]
mm/filemap.c: remove include of hardirq.h
in_atomic() has been moved to include/linux/preempt.h, and the filemap.c
doesn't use in_atomic() directly at all, so it sounds unnecessary to
include hardirq.h.
Pavel Tatashin [Thu, 1 Feb 2018 00:16:30 +0000 (16:16 -0800)]
mm: split deferred_init_range into initializing and freeing parts
In deferred_init_range() we initialize struct pages, and also free them
to buddy allocator. We do it in separate loops, because buddy page is
computed ahead, so we do not want to access a struct page that has not
been initialized yet.
There is still, however, a corner case where it is potentially possible
to access uninitialized struct page: this is when buddy page is from the
next memblock range.
This patch fixes this problem by splitting deferred_init_range() into
two functions: one to initialize struct pages, and another to free them.
In addition, this patch brings the following improvements:
- Get rid of __def_free() helper function. And simplifies loop logic by
adding a new pfn validity check function: deferred_pfn_valid().
- Reduces number of variables that we track. So, there is a higher
chance that we will avoid using stack to store/load variables inside
hot loops.
- Enables future multi-threading of these functions: do initialization
in multiple threads, wait for all threads to finish, do freeing part
in multithreading.
Tested on x86 with 1T of memory to make sure no regressions are
introduced.
Josef Bacik [Thu, 1 Feb 2018 00:16:26 +0000 (16:16 -0800)]
mm: use sc->priority for slab shrink targets
Previously we were using the ratio of the number of lru pages scanned to
the number of eligible lru pages to determine the number of slab objects
to scan. The problem with this is that these two things have nothing to
do with each other, so in slab heavy work loads where there is little to
no page cache we can end up with the pages scanned being a very low
number. This means that we reclaim next to no slab pages and waste a
lot of time reclaiming small amounts of space.
Consider the following scenario, where we have the following values and
the rest of the memory usage is in slab
Active: 58840 kB
Inactive: 46860 kB
Every time we do a get_scan_count() we do this
scan = size >> sc->priority
where sc->priority starts at DEF_PRIORITY, which is 12. The first loop
through reclaim would result in a scan target of 2 pages to 11715 total
inactive pages, and 3 pages to 14710 total active pages. This is a
really really small target for a system that is entirely slab pages.
And this is super optimistic, this assumes we even get to scan these
pages. We don't increment sc->nr_scanned unless we 1) isolate the page,
which assumes it's not in use, and 2) can lock the page. Under pressure
these numbers could probably go down, I'm sure there's some random pages
from daemons that aren't actually in use, so the targets get even
smaller.
Instead use sc->priority in the same way we use it to determine scan
amounts for the lru's. This generally equates to pages. Consider the
following
but we don't know the number of slab pages each shrinker controls, only
the objects. However say that theoretically we knew how many pages a
shrinker controlled, we'd still have to convert this to objects, which
would look like the following
We don't need to know exactly how many pages each shrinker represents,
it's objects are all the information we need. Making this change allows
us to place an appropriate amount of pressure on the shrinker pools for
their relative size.
Roman Gushchin [Thu, 1 Feb 2018 00:16:22 +0000 (16:16 -0800)]
mm: show total hugetlb memory consumption in /proc/meminfo
Currently we display some hugepage statistics (total, free, etc) in
/proc/meminfo, but only for default hugepage size (e.g. 2Mb).
If hugepages of different sizes are used (like 2Mb and 1Gb on x86-64),
/proc/meminfo output can be confusing, as non-default sized hugepages
are not reflected at all, and there are no signs that they are existing
and consuming system memory.
To solve this problem, let's display the total amount of memory,
consumed by hugetlb pages of all sized (both free and used). Let's call
it "Hugetlb", and display size in kB to match generic /proc/meminfo
style.
Michal Hocko [Thu, 1 Feb 2018 00:16:19 +0000 (16:16 -0800)]
mm: drop hotplug lock from lru_add_drain_all()
Pulling cpu hotplug locks inside the mm core function like
lru_add_drain_all just asks for problems and the recent lockdep splat
[1] just proves this. While the usage in that particular case might be
wrong we should avoid the locking as lru_add_drain_all() is used in many
places. It seems that this is not all that hard to achieve actually.
We have done the same thing for drain_all_pages which is analogous by
commit a459eeb7b852 ("mm, page_alloc: do not depend on cpu hotplug locks
inside the allocator"). All we have to care about is to handle
- the work item might be executed on a different cpu in worker from
unbound pool so it doesn't run on pinned on the cpu
- we have to make sure that we do not race with page_alloc_cpu_dead
calling lru_add_drain_cpu
the first part is already handled because the worker calls lru_add_drain
which disables preemption when calling lru_add_drain_cpu on the local
cpu it is draining. The later is true because page_alloc_cpu_dead is
called on the controlling CPU after the hotplugged CPU vanished
completely.
Yisheng Xie [Thu, 1 Feb 2018 00:16:15 +0000 (16:16 -0800)]
mm/mempolicy: add nodes_empty check in SYSC_migrate_pages
As in manpage of migrate_pages, the errno should be set to EINVAL when
none of the node IDs specified by new_nodes are on-line and allowed by
the process's current cpuset context, or none of the specified nodes
contain memory. However, when test by following case:
The ret will be 0 and no errno is set. As the new_nodes is empty, we
should expect EINVAL as documented.
To fix the case like above, this patch check whether target nodes AND
current task_nodes is empty, and then check whether AND
node_states[N_MEMORY] is empty.
The new nodes specifies one or more node IDs that are greater than the
maximum supported node ID, however, the errno is not set to EINVAL as
expected.
As man pages of set_mempolicy[1], mbind[2], and migrate_pages[3]
mentioned, when nodemask specifies one or more node IDs that are greater
than the maximum supported node ID, the errno should set to EINVAL.
However, get_nodes only check whether the part of bits
[BITS_PER_LONG*BITS_TO_LONGS(MAX_NUMNODES), maxnode) is zero or not, and
remain [MAX_NUMNODES, BITS_PER_LONG*BITS_TO_LONGS(MAX_NUMNODES)
unchecked.
This patch is to check the bits of [MAX_NUMNODES, maxnode) in get_nodes
to let migrate_pages set the errno to EINVAL when nodemask specifies one
or more node IDs that are greater than the maximum supported node ID,
which follows the manpage's guide.
Pavel Tatashin [Thu, 1 Feb 2018 00:16:02 +0000 (16:16 -0800)]
mm: relax deferred struct page requirements
There is no need to have ARCH_SUPPORTS_DEFERRED_STRUCT_PAGE_INIT, as all
the page initialization code is in common code.
Also, there is no need to depend on MEMORY_HOTPLUG, as initialization
code does not really use hotplug memory functionality. So, we can
remove this requirement as well.
This patch allows to use deferred struct page initialization on all
platforms with memblock allocator.
Tested on x86, arm64, and sparc. Also, verified that code compiles on
PPC with CONFIG_MEMORY_HOTPLUG disabled.
Zswap is a cache which compresses the pages that are being swapped out
and stores them into a dynamically allocated RAM-based memory pool.
Experiments have shown that around 10-20% of pages stored in zswap are
same-filled pages (i.e. contents of the page are all same), but these
pages are handled as normal pages by compressing and allocating memory
in the pool.
This patch adds a check in zswap_frontswap_store() to identify
same-filled page before compression of the page. If the page is a
same-filled page, set zswap_entry.length to zero, save the same-filled
value and skip the compression of the page and alloction of memory in
zpool. In zswap_frontswap_load(), check if value of zswap_entry.length
is zero corresponding to the page to be loaded. If zswap_entry.length
is zero, fill the page with same-filled value. This saves the
decompression time during load.
On a ARM Quad Core 32-bit device with 1.5GB RAM by launching and
relaunching different applications, out of ~64000 pages stored in zswap,
~11000 pages were same-value filled pages (including zero-filled pages)
and ~9000 pages were zero-filled pages.
An average of 17% of pages(including zero-filled pages) in zswap are
same-value filled pages and 14% pages are zero-filled pages. An average
of 3% of pages are same-filled non-zero pages.
The below table shows the execution time profiling with the patch.
Baseline With patch % Improvement
-----------------------------------------------------------------
*Zswap Store Time 26.5ms 18ms 32%
(of same value pages)
*Zswap Load Time
(of same value pages) 25.5ms 13ms 49%
-----------------------------------------------------------------
On Ubuntu PC with 2GB RAM, while executing kernel build and other test
scripts and running multimedia applications, out of 360000 pages stored
in zswap 78000(~22%) of pages were found to be same-value filled pages
(including zero-filled pages) and 64000(~17%) are zero-filled pages. So
an average of %5 of pages are same-filled non-zero pages.
The below table shows the execution time profiling with the patch.
Baseline With patch % Improvement
-----------------------------------------------------------------
*Zswap Store Time 91ms 74ms 19%
(of same value pages)
*Zswap Load Time 50ms 7.5ms 85%
(of same value pages)
-----------------------------------------------------------------
*The execution times may vary with test device used.
Dan said:
: I did test this patch out this week, and I added some instrumentation to
: check the performance impact, and tested with a small program to try to
: check the best and worst cases.
:
: When doing a lot of swap where all (or almost all) pages are same-value, I
: found this patch does save both time and space, significantly. The exact
: improvement in time and space depends on which compressor is being used,
: but roughly agrees with the numbers you listed.
:
: In the worst case situation, where all (or almost all) pages have the
: same-value *except* the final long (meaning, zswap will check each long on
: the entire page but then still have to pass the page to the compressor),
: the same-value check is around 10-15% of the total time spent in
: zswap_frontswap_store(). That's a not-insignificant amount of time, but
: it's not huge. Considering that most systems will probably be swapping
: pages that aren't similar to the worst case (although I don't have any
: data to know that), I'd say the improvement is worth the possible
: worst-case performance impact.
Miles Chen [Thu, 1 Feb 2018 00:15:47 +0000 (16:15 -0800)]
slub: remove obsolete comments of put_cpu_partial()
Commit d6e0b7fa1186 ("slub: make dead caches discard free slabs
immediately") makes put_cpu_partial() run with preemption disabled and
interrupts disabled when calling unfreeze_partials().
The comment: "put_cpu_partial() is done without interrupts disabled and
without preemption disabled" looks obsolete, so remove it.
Oscar Salvador [Thu, 1 Feb 2018 00:15:39 +0000 (16:15 -0800)]
mm/slab.c: remove redundant assignments for slab_state
slab_state is being set to "UP" in create_kmalloc_caches(), and later on
we set it again in kmem_cache_init_late(), but slab_state does not
change in the meantime.
Remove the redundant assignment from kmem_cache_init_late().
And unless I overlooked anything, the same goes for "slab_state = FULL".
slab_state is set to "FULL" in kmem_cache_init_late(), but it is later
being set again in cpucache_init(), which gets called from
do_initcall_level(). So remove the assignment from cpucache_init() as
well.
Also I fixed a style problem reported by checkpatch.
WARNING: Missing a blank line after declarations
#53: FILE: mm/slab_common.c:286:
+ unsigned long ralign = cache_line_size();
+ while (size <= ralign / 2)
piaojun [Thu, 1 Feb 2018 00:15:32 +0000 (16:15 -0800)]
ocfs2: return error when we attempt to access a dirty bh in jbd2
We should not reuse the dirty bh in jbd2 directly due to the following
situation:
1. When removing extent rec, we will dirty the bhs of extent rec and
truncate log at the same time, and hand them over to jbd2.
2. The bhs are submitted to jbd2 area successfully.
3. The write-back thread of device help flush the bhs to disk but
encounter write error due to abnormal storage link.
4. After a while the storage link become normal. Truncate log flush
worker triggered by the next space reclaiming found the dirty bh of
truncate log and clear its 'BH_Write_EIO' and then set it uptodate in
__ocfs2_journal_access():
ocfs2_truncate_log_worker
ocfs2_flush_truncate_log
__ocfs2_flush_truncate_log
ocfs2_replay_truncate_records
ocfs2_journal_access_di
__ocfs2_journal_access // here we clear io_error and set 'tl_bh' uptodata.
5. Then jbd2 will flush the bh of truncate log to disk, but the bh of
extent rec is still in error state, and unfortunately nobody will
take care of it.
6. At last the space of extent rec was not reduced, but truncate log
flush worker have given it back to globalalloc. That will cause
duplicate cluster problem which could be identified by fsck.ocfs2.
Sadly we can hardly revert this but set fs read-only in case of ruining
atomicity and consistency of space reclaim.
Gang He [Thu, 1 Feb 2018 00:15:21 +0000 (16:15 -0800)]
ocfs2: add ocfs2_overwrite_io()
Add ocfs2_overwrite_io function, which is used to judge if overwrite
allocated blocks, otherwise, the write will bring extra block allocation
overhead.
Gang He [Thu, 1 Feb 2018 00:15:17 +0000 (16:15 -0800)]
ocfs2: add ocfs2_try_rw_lock() and ocfs2_try_inode_lock()
Patch series "ocfs2: add nowait aio support", v4.
VFS layer has introduced the non-blocking aio flag IOCB_NOWAIT, which
tells the kernel to bail out if an AIO request will block for reasons
such as file allocations, or writeback triggering, or would block while
allocating requests while performing direct I/O.
Subsequently, pwritev2/preadv2 also can leverage this part of kernel
code. So far, ext4/xfs/btrfs have supported this feature. Add the
related code for the ocfs2 file system.
This patch (of 3):
Add ocfs2_try_rw_lock and ocfs2_try_inode_lock functions, which will be
used in non-blocking IO scenarios.
Gang He [Thu, 1 Feb 2018 00:15:13 +0000 (16:15 -0800)]
ocfs2: add trimfs lock to avoid duplicated trims in cluster
ocfs2 supports trimming the underlying disk via the fstrim command. But
there is a problem, ocfs2 is a shared disk cluster file system, if the
user configures a scheduled fstrim job on each file system node, this
will trigger multiple nodes trimming a shared disk simultaneously, which
is very wasteful for CPU and IO consumption. This also might negatively
affect the lifetime of poor-quality SSD devices.
So we introduce a trimfs dlm lock to communicate with each other in this
case, which will make only one fstrim command to do the trimming on a
shared disk among the cluster. The fstrim commands from the other nodes
should wait for the first fstrim to finish and return success directly,
to avoid running the same trim on the shared disk again.
The BUG code told that extent tree wants to grow but no metadata was
reserved ahead of time. From my investigation into this issue, the root
cause it that although enough metadata is not reserved, there should be
enough for following use. Rightmost extent is merged into its left one
due to a certain times of marking extent written. Because during
marking extent written, we got many physically continuous extents. At
last, an empty extent showed up and the rightmost path is removed from
extent tree.
Add a new mechanism to reuse extent block cached in dealloc which were
just unlinked from extent tree to solve this crash issue.
Criteria is that during marking extents *written*, if extent rotation
and merging results in unlinking extent with growing extent tree later
without any metadata reserved ahead of time, try to reuse those extents
in dealloc in which deleted extents are cached.
Also, this patch addresses the issue John reported that ::dw_zero_count
is not calculated properly.
After applying this patch, the issue John reported was gone. Thanks for
the reproducer provided by John. And this patch has passed
ocfs2-test(29 cases) suite running by New H3C Group.
Changwei Ge [Thu, 1 Feb 2018 00:15:02 +0000 (16:15 -0800)]
ocfs2: make metadata estimation accurate and clear
Current code assume that ::w_unwritten_list always has only one item on.
This is not right and hard to get understood. So improve how to count
unwritten item.
Gang He [Thu, 1 Feb 2018 00:14:48 +0000 (16:14 -0800)]
ocfs2: try a blocking lock before return AOP_TRUNCATED_PAGE
If we can't get inode lock immediately in the function
ocfs2_inode_lock_with_page() when reading a page, we should not return
directly here, since this will lead to a softlockup problem when the
kernel is configured with CONFIG_PREEMPT is not set. The method is to
get a blocking lock and immediately unlock before returning, this can
avoid CPU resource waste due to lots of retries, and benefits fairness
in getting lock among multiple nodes, increase efficiency in case
modifying the same file frequently from multiple nodes.
The softlockup crash (when set /proc/sys/kernel/softlockup_panic to 1)
looks like:
About performance improvement, we can see the testing time is reduced,
and CPU utilization decreases, the detailed data is as follows. I ran
multi_mmap test case in ocfs2-test package in a three nodes cluster.
Before applying this patch:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
2754 ocfs2te+ 20 0 170248 6980 4856 D 80.73 0.341 0:18.71 multi_mmap
1505 root rt 0 222236 123060 97224 S 2.658 6.015 0:01.44 corosync
5 root 20 0 0 0 0 S 1.329 0.000 0:00.19 kworker/u8:0
95 root 20 0 0 0 0 S 1.329 0.000 0:00.25 kworker/u8:1
2728 root 20 0 0 0 0 S 0.997 0.000 0:00.24 jbd2/sda1-33
2721 root 20 0 0 0 0 S 0.664 0.000 0:00.07 ocfs2dc-3C8CFD4
2750 ocfs2te+ 20 0 142976 4652 3532 S 0.664 0.227 0:00.28 mpirun
In this situation we need return -EROFS to 'mount.ocfs2', so that user
can fix it by fsck. And then mount again. In addition, 'mount.ocfs2'
should be updated correspondingly as it only return 1 for all errno.
And I will post a patch for 'mount.ocfs2' too.
Yang Zhang [Thu, 1 Feb 2018 00:14:33 +0000 (16:14 -0800)]
ocfs2/cluster: close a race that fence can't be triggered
When some nodes of cluster face with TCP connection fault, ocfs2 will
pick up a quorum to continue to work and other nodes will be fenced by
resetting host.
In order to decide which node should be fenced, ocfs2 leverages
o2quo_state::qs_holds. If that variable is reduced to zero, then a try
to decide if fence local node is performed. However, under a specific
scenario that local node is not disconnected from others at the same
time, above method has a problem to reduce ::qs_holds to zero.
Because, o2net 90s idle timer corresponding to different nodes is
triggered one after another.
node 2 node 3
90s idle timer elapses
clear ::qs_conn_bm
set hold
40s is passed
90 idle timer elapses
clear ::qs_conn_bm
set hold
still up timer elapses
clear hold (NOT to zero )
90s idle timer elapses AGAIN
still up timer elapses.
clear hold
still up timer elapses
To solve this issue, a node which has already be evicted from
::qs_conn_bm can't set hold again and again invoked from idle timer.
Gang He [Thu, 1 Feb 2018 00:14:29 +0000 (16:14 -0800)]
ocfs2: give an obvious tip for mismatched cluster names
Add an obvious error message, due to mismatched cluster names between
on-disk and in the current cluster. We can meet this case during OCFS2
cluster migration.
If we can give the user an obvious tip for why they can not mount the
file system after migration, they can quickly fix this mismatch problem.
Second, also move printing ocfs2_fill_super() errno to the front of
ocfs2_dismount_volume(), since ocfs2_dismount_volume() will also print
its own message.
I looked through all the code of OCFS2 (include o2cb); there is not any
place which returns this error. In fact, the function calling path
ocfs2_fill_super -> ocfs2_mount_volume -> ocfs2_dlm_init ->
dlm_new_lockspace is a very specific one. We can use this errno to give
the user a more clear tip, since this case is a little common during
cluster migration, but the customer can quickly get the failure cause if
there is a error printed. Also, I think it is not possible to add this
errno in the o2cb path during ocfs2_dlm_init(), since the o2cb code has
been stable for a long time.
We only print this error tip when the user uses pcmk stack, since using
the o2cb stack the user will not meet this error.
Sudip Mukherjee [Thu, 1 Feb 2018 00:14:17 +0000 (16:14 -0800)]
m32r: remove abort()
Commit 7c2c11b208be ("arch: define weak abort()") has introduced a weak
abort() which is common for all arch. And, so we will not need arch
specific abort which has the same code as the weak abort(). Remove the
abort() for m32r.
Andy Shevchenko [Thu, 1 Feb 2018 00:14:10 +0000 (16:14 -0800)]
scripts/decodecode: make it take multiline Code line
In case of running scripts/decodecode without any parameters in order to
give a copy'n'pasted Code line from, for example, email it would parse
only first line of it, while in emails it's split to few.
ie, when you have a file out of oops the Code line looks like
Code: hh hh ... <hh> ... hh\n
When copy'n'paste from, for example, email where sender or some middle
MTA split it, the line looks like:
The Code line followed by another oops line usually contains characters
out of hex digit + space + < + > set.
So add logic to join this split back if and only if the following lines
have hex digits, or spaces, or '<', or '>' characters. It will be quite
unlikely to have a broken input in well formed Oops or dmesg, thus a
simple regex is being used.
Marcin Niestroj [Fri, 26 Jan 2018 19:08:59 +0000 (11:08 -0800)]
Input: goodix - use generic touchscreen_properties
Use touchscreen_properties structure instead of implementing all
properties by our own. It allows us to reuse generic code for parsing
device-tree properties (which was implemented manually in the driver for
now). Additionally, it allows us to report events using generic
touchscreen_report_pos(), which automatically handles inverted and
swapped axes.
This fixes the issue with the custom code incorrectly handling case where
ts->inverted_x and ts->swapped_x_y were true, but ts->inverted_y was
false. Assuming we have 720x1280 touch panel, ts->abs_x_max == 1279 and
ts->abs_y_max == 719 (because we inverted that in goodix_read_config()).
Now let's assume that we received event from (0:0) position (in touch
panel original coordinates). In function goodix_ts_report_touch() we
calculate input_x as 1279, but after swapping input_y takes that value
(which is more that maximum 719 value reported during initialization).
Note that since touchscreen coordinates are 0-indexed, we now report
touchscreen range as (0:size-1).
Developed and tested on custom DT-based device with gt1151 touch
panel.
Signed-off-by: Marcin Niestroj <[email protected]>
[dtor: fix endianness annotation reported by sparse, handle errors when
initializing MT slots] Signed-off-by: Dmitry Torokhov <[email protected]>
19) Avoid locking in act_csum, from Davide Caratti.
20) devinet_ioctl() simplifications from Al viro.
21) More TCP bpf improvements from Lawrence Brakmo.
22) Add support for onlink ipv6 route flag, similar to ipv4, from David
Ahern.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1925 commits)
tls: Add support for encryption using async offload accelerator
ip6mr: fix stale iterator
net/sched: kconfig: Remove blank help texts
openvswitch: meter: Use 64-bit arithmetic instead of 32-bit
tcp_nv: fix potential integer overflow in tcpnv_acked
r8169: fix RTL8168EP take too long to complete driver initialization.
qmi_wwan: Add support for Quectel EP06
rtnetlink: enable IFLA_IF_NETNSID for RTM_NEWLINK
ipmr: Fix ptrdiff_t print formatting
ibmvnic: Wait for device response when changing MAC
qlcnic: fix deadlock bug
tcp: release sk_frag.page in tcp_disconnect
ipv4: Get the address of interface correctly.
net_sched: gen_estimator: fix lockdep splat
net: macb: Handle HRESP error
net/mlx5e: IPoIB, Fix copy-paste bug in flow steering refactoring
ipv6: addrconf: break critical section in addrconf_verify_rtnl()
ipv6: change route cache aging logic
i40e/i40evf: Update DESC_NEEDED value to reflect larger value
bnxt_en: cleanup DIM work on device shutdown
...
Linus Torvalds [Wed, 31 Jan 2018 22:22:45 +0000 (14:22 -0800)]
Merge branch 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6
Pull crypto updates from Herbert Xu:
"API:
- Enforce the setting of keys for keyed aead/hash/skcipher
algorithms.
- Add multibuf speed tests in tcrypt.
Algorithms:
- Improve performance of sha3-generic.
- Add native sha512 support on arm64.
- Add v8.2 Crypto Extentions version of sha3/sm3 on arm64.
- Avoid hmac nesting by requiring underlying algorithm to be unkeyed.
- Add cryptd_max_cpu_qlen module parameter to cryptd.
Drivers:
- Add support for EIP97 engine in inside-secure.
- Add inline IPsec support to chelsio.
- Add RevB core support to crypto4xx.
- Fix AEAD ICV check in crypto4xx.
- Add stm32 crypto driver.
- Add support for BCM63xx platforms in bcm2835 and remove bcm63xx.
- Add Derived Key Protocol (DKP) support in caam.
- Add Samsung Exynos True RNG driver.
- Add support for Exynos5250+ SoCs in exynos PRNG driver"
* 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (166 commits)
crypto: picoxcell - Fix error handling in spacc_probe()
crypto: arm64/sha512 - fix/improve new v8.2 Crypto Extensions code
crypto: arm64/sm3 - new v8.2 Crypto Extensions implementation
crypto: arm64/sha3 - new v8.2 Crypto Extensions implementation
crypto: testmgr - add new testcases for sha3
crypto: sha3-generic - export init/update/final routines
crypto: sha3-generic - simplify code
crypto: sha3-generic - rewrite KECCAK transform to help the compiler optimize
crypto: sha3-generic - fixes for alignment and big endian operation
crypto: aesni - handle zero length dst buffer
crypto: artpec6 - remove select on non-existing CRYPTO_SHA384
hwrng: bcm2835 - Remove redundant dev_err call in bcm2835_rng_probe()
crypto: stm32 - remove redundant dev_err call in stm32_cryp_probe()
crypto: axis - remove unnecessary platform_get_resource() error check
crypto: testmgr - test misuse of result in ahash
crypto: inside-secure - make function safexcel_try_push_requests static
crypto: aes-generic - fix aes-generic regression on powerpc
crypto: chelsio - Fix indentation warning
crypto: arm64/sha1-ce - get rid of literal pool
crypto: arm64/sha2-ce - move the round constant table to .rodata section
...
Linus Torvalds [Wed, 31 Jan 2018 22:16:13 +0000 (14:16 -0800)]
Merge tag 'selinux-pr-20180130' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux
Pull selinux updates from Paul Moore:
"A small pull request this time, just three patches, and one of these
is just a comment update (swap the FSF physical address for a URL).
The other two patches are small bug fixes found by szybot/syzkaller;
they individual patch descriptions should tell you all you ever wanted
to know"
* tag 'selinux-pr-20180130' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux:
selinux: skip bounded transition processing if the policy isn't loaded
selinux: ensure the context is NUL terminated in security_context_to_sid_core()
security: replace FSF address with web source in license notices
Linus Torvalds [Wed, 31 Jan 2018 21:44:45 +0000 (13:44 -0800)]
Merge branch 'next-seccomp' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security
Pull seccomp updates from James Morris:
"Add support for retrieving seccomp metadata"
* 'next-seccomp' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security:
ptrace, seccomp: add support for retrieving seccomp metadata
seccomp: hoist out filter resolving logic
Linus Torvalds [Wed, 31 Jan 2018 21:12:31 +0000 (13:12 -0800)]
Merge branch 'next-tpm' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security
Pull tpm updates from James Morris:
- reduce polling delays in tpm_tis
- support retrieving TPM 2.0 Event Log through EFI before
ExitBootServices
- replace tpm-rng.c with a hwrng device managed by the driver for each
TPM device
- TPM resource manager synthesizes TPM_RC_COMMAND_CODE response instead
of returning -EINVAL for unknown TPM commands. This makes user space
more sound.
- CLKRUN fixes:
* Keep #CLKRUN disable through the entier TPM command/response flow
* Check whether #CLKRUN is enabled before disabling and enabling it
again because enabling it breaks PS/2 devices on a system where it
is disabled
* 'next-tpm' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security:
tpm: remove unused variables
tpm: remove unused data fields from I2C and OF device ID tables
tpm: only attempt to disable the LPC CLKRUN if is already enabled
tpm: follow coding style for variable declaration in tpm_tis_core_init()
tpm: delete the TPM_TIS_CLK_ENABLE flag
tpm: Update MAINTAINERS for Jason Gunthorpe
tpm: Keep CLKRUN enabled throughout the duration of transmit_cmd()
tpm_tis: Move ilb_base_addr to tpm_tis_data
tpm2-cmd: allow more attempts for selftest execution
tpm: return a TPM_RC_COMMAND_CODE response if command is not implemented
tpm: Move Linux RNG connection to hwrng
tpm: use struct tpm_chip for tpm_chip_find_get()
tpm: parse TPM event logs based on EFI table
efi: call get_event_log before ExitBootServices
tpm: add event log format version
tpm: rename event log provider files
tpm: move tpm_eventlog.h outside of drivers folder
tpm: use tpm_msleep() value as max delay
tpm: reduce tpm polling delay in tpm_tis_core
tpm: move wait_for_tpm_stat() to respective driver files
Linus Torvalds [Wed, 31 Jan 2018 21:10:22 +0000 (13:10 -0800)]
Merge branch 'next-smack' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security
Pull smack updates from James Morris:
"Two minor fixes"
* 'next-smack' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security:
Smack: Privilege check on key operations
Smack: fix dereferenced before check
Linus Torvalds [Wed, 31 Jan 2018 21:07:35 +0000 (13:07 -0800)]
Merge branch 'next-integrity' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security
Pull integrity updates from James Morris:
"This contains a mixture of bug fixes, code cleanup, and new
functionality. Of note is the integrity cache locking fix, file change
detection, and support for a new EVM portable and immutable signature
type.
The re-introduction of the integrity cache lock (iint) fixes the
problem of attempting to take the i_rwsem shared a second time, when
it was previously taken exclusively. Defining atomic flags resolves
the original iint/i_rwsem circular locking - accessing the file data
vs. modifying the file metadata. Although it fixes the O_DIRECT
problem as well, a subsequent patch is needed to remove the explicit
O_DIRECT prevention.
For performance reasons, detecting when a file has changed and needs
to be re-measured, re-appraised, and/or re-audited, was limited to
after the last writer has closed, and only if the file data has
changed. Detecting file change is based on i_version. For filesystems
that do not support i_version, remote filesystems, or userspace
filesystems, the file was measured, appraised and/or audited once and
never re-evaluated. Now local filesystems, which do not support
i_version or are not mounted with the i_version option, assume the
file has changed and are required to re-evaluate the file. This change
does not address detecting file change on remote or userspace
filesystems.
Unlike file data signatures, which can be included and distributed in
software packages (eg. rpm, deb), the existing EVM signature, which
protects the file metadata, could not be included in software
packages, as it includes file system specific information (eg. i_ino,
possibly the UUID). This pull request defines a new EVM portable and
immutable file metadata signature format, which can be included in
software packages"
* 'next-integrity' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security:
ima/policy: fix parsing of fsuuid
ima: Use i_version only when filesystem supports it
integrity: remove unneeded initializations in integrity_iint_cache entries
ima: log message to module appraisal error
ima: pass filename to ima_rdwr_violation_check()
ima: Fix line continuation format
ima: support new "hash" and "dont_hash" policy actions
ima: re-introduce own integrity cache lock
EVM: Add support for portable signature format
EVM: Allow userland to permit modification of EVM-protected metadata
ima: relax requiring a file signature for new files with zero length
Linus Torvalds [Wed, 31 Jan 2018 21:02:18 +0000 (13:02 -0800)]
Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/livepatching
Pull livepatching updates from Jiri Kosina:
- handle 'infinitely'-long sleeping tasks, from Miroslav Benes
- remove 'immediate' feature, as it turns out it doesn't provide the
originally expected semantics, and brings more issues than value
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/livepatching:
livepatch: add locking to force and signal functions
livepatch: Remove immediate feature
livepatch: force transition to finish
livepatch: send a fake signal to all blocking tasks
Linus Torvalds [Wed, 31 Jan 2018 21:00:01 +0000 (13:00 -0800)]
Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/hid
Pull HID updates from Jiri Kosina:
- remove hid_have_special_driver[] entry hard requirement for any newly
supported VID/PID by a specific non-core hid driver, and general
related cleanup of HID matching core, from Benjamin Tissoires
- support for new Wacom devices and a few small fixups for already
supported ones in Wacom driver, from Aaron Armstrong Skomra and Jason
Gerecke
- sysfs interface fix for roccat driver from Dan Carpenter
- support for new Asus HW (T100TAF, T100HA, T200TA) from Hans de Goede
- improved support for Jabra devices, from Niels Skou Olsen
- other assorted small fixes and new device IDs
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/hid: (30 commits)
HID: quirks: Fix keyboard + touchpad on Toshiba Click Mini not working
HID: roccat: prevent an out of bounds read in kovaplus_profile_activated()
HID: asus: Fix special function keys on T200TA
HID: asus: Add touchpad max x/y and resolution info for the T200TA
HID: wacom: Add support for One by Wacom (CTL-472 / CTL-672)
HID: wacom: Fix reporting of touch toggle (WACOM_HID_WD_MUTE_DEVICE) events
HID: intel-ish-hid: Enable Cannon Lake and Coffee Lake laptop/desktop
HID: elecom: rewrite report fixup for EX-G and future mice
HID: sony: Report DS4 version info through sysfs
HID: sony: Print reversed MAC address via %pMR
HID: wacom: EKR: ensure devres groups at higher indexes are released
HID: rmi: Support the Fujitsu R726 Pad dock using hid-rmi
HID: add quirk for another PIXART OEM mouse used by HP
HID: quirks: make array hid_quirks static
HID: hid-multitouch: support fine-grain orientation reporting
HID: asus: Add product-id for the T100TAF and T100HA keyboard docks
HID: elo: clear BTN_LEFT mapping
HID: multitouch: Combine all left-button events in a frame
HID: multitouch: Only look at non touch fields in first packet of a frame
HID: multitouch: Properly deal with Win8 PTP reports with 0 touches
...
Linus Torvalds [Wed, 31 Jan 2018 20:55:31 +0000 (12:55 -0800)]
Merge tag 'for-v4.16' of git://git.kernel.org/pub/scm/linux/kernel/git/sre/linux-power-supply
Pull power supply and reset updates from Sebastian Reichel:
- bq27xxx: add bq27521 support
- drop unused imx-snvs-poweroff driver
- improve axp288 driver
- misc fixes
* tag 'for-v4.16' of git://git.kernel.org/pub/scm/linux/kernel/git/sre/linux-power-supply: (32 commits)
power: supply: max17042_battery: Always fall back to default platform-data
power: supply: max17042_battery: Check battery current for status when supplied
MAINTAINERS: Add AXP288 PMIC entry
power: supply: axp288_fuel_gauge: Do not register our psy on (some) HDMI sticks
power: supply: axp288_fuel_gauge: Optimize get_current()
power: supply: axp288_fuel_gauge: Rework get_status()
power: reset: account for const type of of_device_id.data
power: supply: account for const type of of_device_id.data
bq24190: Simplify code in property_is_writeable
power: supply: axp288_fuel_gauge: Get iio-channels once during boot
power: supply: axp288_charger: Properly stop work on probe-error / remove
power: supply: axp288_charger: Simplify extcon cable handling
power: supply: axp288_charger: Use the right property for the input current limit
power: supply: axp288_charger: Pick lower input current limit not higher
power: supply: axp288_charger: Do not cache input current limit value
power: supply: axp288_charger: Remove no longer needed locking
power: supply: axp288_charger: Use regmap_update_bits to set the input limits
power: supply: axp288_charger: Cleanup some double empty lines
power: supply: axp288_charger: Remove charger-enabled state tracking
power: supply: axp288_charger: Add missing newlines to some messages
...
Linus Torvalds [Wed, 31 Jan 2018 20:25:27 +0000 (12:25 -0800)]
Merge tag 'gpio-v4.16-1' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-gpio
Pull GPIO updates from Linus Walleij:
"The is the bulk of GPIO changes for the v4.16 kernel cycle. It is
pretty calm this time around I think. I even got time to get to things
like starting to clean up header includes.
Core changes:
- Disallow open drain and open source flags to be set simultaneously.
This doesn't make electrical sense, and would the hardware actually
respond to this setting, the result would be short circuit.
- ACPI GPIO has a new core infrastructure for handling quirks. The
quirks are there to deal with broken ACPI tables centrally instead
of pushing the work to individual drivers. In the world of BIOS
writers, the ACPI tables are perfect. Until they find a mistake in
it. When such a mistake is found, we can patch it with a quirk. It
should never happen, the problem is that it happens. So we
accomodate for it.
- Several documentation updates.
- Revert the patch setting up initial direction state from reading
the device. This was causing bad things for drivers that can't read
status on all its pins. It is only affecting debugfs information
quality.
- Label descriptors with the device name if no explicit label is
passed in.
- Pave the ground for transitioning SPI and regulators to use GPIO
descriptors by implementing some quirks in the device tree GPIO
parsing code.
New drivers:
- New driver for the Access PCIe IDIO 24 family.
Other:
- Major refactorings and improvements to the GPIO mockup driver used
for test and verification.
- Moved the AXP209 driver over to pin control since it gained a pin
control back-end. These patches will appear (with the same hashes)
in the pin control pull request as well.
- Convert the onewire GPIO driver w1-gpio to use descriptors. This is
merged here since the W1 maintainers send very few pull requests
and he ACKed it.
- Start to clean up driver headers using <linux/gpio.h> to just use
<linux/gpio/driver.h> as appropriate"
* tag 'gpio-v4.16-1' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-gpio: (103 commits)
gpio: Timestamp events in hardirq handler
gpio: Fix kernel stack leak to userspace
gpio: Fix a documentation spelling mistake
gpio: Documentation update
gpiolib: remove redundant initialization of pointer desc
gpio: of: Fix NPE from OF flags
gpio: stmpe: Delete an unnecessary variable initialisation in stmpe_gpio_probe()
gpio: stmpe: Move an assignment in stmpe_gpio_probe()
gpio: stmpe: Improve a size determination in stmpe_gpio_probe()
gpio: stmpe: Use seq_putc() in stmpe_dbg_show()
gpio: No NULL owner
gpio: stmpe: i2c transfer are forbiden in atomic context
gpio: davinci: Include proper header
gpio: da905x: Include proper header
gpio: cs5535: Include proper header
gpio: crystalcove: Include proper header
gpio: bt8xx: Include proper header
gpio: bcm-kona: Include proper header
gpio: arizona: Include proper header
gpio: amd8111: Include proper header
...
Linus Torvalds [Wed, 31 Jan 2018 20:05:10 +0000 (12:05 -0800)]
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma
Pull RDMA subsystem updates from Jason Gunthorpe:
"Overall this cycle did not have any major excitement, and did not
require any shared branch with netdev.
Lots of driver updates, particularly of the scale-up and performance
variety. The largest body of core work was Parav's patches fixing and
restructing some of the core code to make way for future RDMA
containerization.
Summary:
- misc small driver fixups to
bnxt_re/hfi1/qib/hns/ocrdma/rdmavt/vmw_pvrdma/nes
- several major feature adds to bnxt_re driver: SRIOV VF RoCE
support, HugePages support, extended hardware stats support, and
SRQ support
- a notable number of fixes to the i40iw driver from debugging scale
up testing
- more work to enable the new hip08 chip in the hns driver
- misc small ULP fixups to srp/srpt//ipoib
- preparation for srp initiator and target to support the RDMA-CM
protocol for connections
- add RDMA-CM support to srp initiator, srp target is still a WIP
- fixes for a couple of places where ipoib could spam the dmesg log
- fix encode/decode of FDR/EDR data rates in the core
- many patches from Parav with ongoing work to clean up
inconsistencies and bugs in RoCE support around the rdma_cm
- mlx5 driver support for the userspace features 'thread domain',
'wallclock timestamps' and 'DV Direct Connected transport'. Support
for the firmware dual port rocee capability
- core support for more than 32 rdma devices in the char dev
allocation
- kernel doc updates from Randy Dunlap
- new netlink uAPI for inspecting RDMA objects similar in spirit to 'ss'
- one minor change to the kobject code acked by Greg KH"
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (259 commits)
RDMA/nldev: Provide detailed QP information
RDMA/nldev: Provide global resource utilization
RDMA/core: Add resource tracking for create and destroy PDs
RDMA/core: Add resource tracking for create and destroy CQs
RDMA/core: Add resource tracking for create and destroy QPs
RDMA/restrack: Add general infrastructure to track RDMA resources
RDMA/core: Save kernel caller name when creating PD and CQ objects
RDMA/core: Use the MODNAME instead of the function name for pd callers
RDMA: Move enum ib_cq_creation_flags to uapi headers
IB/rxe: Change RDMA_RXE kconfig to use select
IB/qib: remove qib_keys.c
IB/mthca: remove mthca_user.h
RDMA/cm: Fix access to uninitialized variable
RDMA/cma: Use existing netif_is_bond_master function
IB/core: Avoid SGID attributes query while converting GID from OPA to IB
RDMA/mlx5: Avoid memory leak in case of XRCD dealloc failure
IB/umad: Fix use of unprotected device pointer
IB/iser: Combine substrings for three messages
IB/iser: Delete an unnecessary variable initialisation in iser_send_data_out()
IB/iser: Delete an error message for a failed memory allocation in iser_send_data_out()
...
Linus Torvalds [Wed, 31 Jan 2018 19:52:20 +0000 (11:52 -0800)]
Merge tag 'dmaengine-4.16-rc1' of git://git.infradead.org/users/vkoul/slave-dma
Pull dmaengine updates from Vinod Koul:
"This time is smallish update with updates mainly to drivers:
- updates to xilinx and zynqmp dma controllers
- update reside calculation for rcar controller
- more RSTify fixes for documentation
- add support for race free transfer termination and updating for
users for that
- support for new rev of hidma with addition new APIs to get device
match data in ACPI/OF
- random updates to bunch of other drivers"
* tag 'dmaengine-4.16-rc1' of git://git.infradead.org/users/vkoul/slave-dma: (47 commits)
dmaengine: dmatest: fix container_of member in dmatest_callback
dmaengine: stm32-dmamux: Remove unnecessary platform_get_resource() error check
dmaengine: sprd: statify 'sprd_dma_prep_dma_memcpy'
dmaengine: qcom_hidma: simplify DT resource parsing
dmaengine: xilinx_dma: Free BD consistent memory
dmaengine: xilinx_dma: Fix warning variable prev set but not used
dmaengine: xilinx_dma: properly configure the SG mode bit in the driver for cdma
dmaengine: doc: format struct fields using monospace
dmaengine: doc: fix bullet list formatting
dmaengine: ti-dma-crossbar: Fix event mapping for TPCC_EVT_MUX_60_63
dmaengine: cppi41: Fix channel queues array size check
dmaengine: imx-sdma: Add MODULE_FIRMWARE
dmaengine: xilinx_dma: Fix typos
dmaengine: xilinx_dma: Differentiate probe based on the ip type
dmaengine: xilinx_dma: fix style issues from checkpatch
dmaengine: xilinx_dma: Fix kernel doc warnings
dmaengine: xilinx_dma: Fix race condition in the driver for multiple descriptor scenario
dmaeninge: xilinx_dma: Fix bug in multiple frame stores scenario in vdma
dmaengine: xilinx_dma: Check for channel idle state before submitting dma descriptor
dmaengine: zynqmp_dma: Fix race condition in the probe
...
Linus Torvalds [Wed, 31 Jan 2018 19:32:27 +0000 (11:32 -0800)]
Merge tag 'dma-mapping-4.16' of git://git.infradead.org/users/hch/dma-mapping
Pull dma mapping updates from Christoph Hellwig:
"Except for a runtime warning fix from Christian this is all about
consolidation of the generic no-IOMMU code, a well as the glue code
for swiotlb.
All the code is based on the x86 implementation with hooks to allow
all architectures that aren't cache coherent to use it.
The x86 conversion itself has been deferred because the x86
maintainers were a little busy in the last months"
* tag 'dma-mapping-4.16' of git://git.infradead.org/users/hch/dma-mapping: (57 commits)
MAINTAINERS: add the iommu list for swiotlb and xen-swiotlb
arm64: use swiotlb_alloc and swiotlb_free
arm64: replace ZONE_DMA with ZONE_DMA32
mips: use swiotlb_{alloc,free}
mips/netlogic: remove swiotlb support
tile: use generic swiotlb_ops
tile: replace ZONE_DMA with ZONE_DMA32
unicore32: use generic swiotlb_ops
ia64: remove an ifdef around the content of pci-dma.c
ia64: clean up swiotlb support
ia64: use generic swiotlb_ops
ia64: replace ZONE_DMA with ZONE_DMA32
swiotlb: remove various exports
swiotlb: refactor coherent buffer allocation
swiotlb: refactor coherent buffer freeing
swiotlb: wire up ->dma_supported in swiotlb_dma_ops
swiotlb: add common swiotlb_map_ops
swiotlb: rename swiotlb_free to swiotlb_exit
x86: rename swiotlb_dma_ops
powerpc: rename swiotlb_dma_ops
...
Linus Torvalds [Wed, 31 Jan 2018 19:23:28 +0000 (11:23 -0800)]
Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Pull SCSI updates from James Bottomley:
"This is mostly updates of the usual driver suspects: arcmsr,
scsi_debug, mpt3sas, lpfc, cxlflash, qla2xxx, aacraid, megaraid_sas,
hisi_sas.
We also have a rework of the libsas hotplug handling to make it more
robust, a slew of 32 bit time conversions and fixes, and a host of the
usual minor updates and style changes. The biggest potential for
regressions is the libsas hotplug changes, but so far they seem stable
under testing"
* tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (313 commits)
scsi: qla2xxx: Fix logo flag for qlt_free_session_done()
scsi: arcmsr: avoid do_gettimeofday
scsi: core: Add VENDOR_SPECIFIC sense code definitions
scsi: qedi: Drop cqe response during connection recovery
scsi: fas216: fix sense buffer initialization
scsi: ibmvfc: Remove unneeded semicolons
scsi: hisi_sas: fix a bug in hisi_sas_dev_gone()
scsi: hisi_sas: directly attached disk LED feature for v2 hw
scsi: hisi_sas: devicetree: bindings: add LED feature for v2 hw
scsi: megaraid_sas: NVMe passthrough command support
scsi: megaraid: use ktime_get_real for firmware time
scsi: fnic: use 64-bit timestamps
scsi: qedf: Fix error return code in __qedf_probe()
scsi: devinfo: fix format of the device list
scsi: qla2xxx: Update driver version to 10.00.00.05-k
scsi: qla2xxx: Add XCB counters to debugfs
scsi: qla2xxx: Fix queue ID for async abort with Multiqueue
scsi: qla2xxx: Fix warning for code intentation in __qla24xx_handle_gpdb_event()
scsi: qla2xxx: Fix warning during port_name debug print
scsi: qla2xxx: Fix warning in qla2x00_async_iocb_timeout()
...
Linus Torvalds [Wed, 31 Jan 2018 19:05:47 +0000 (11:05 -0800)]
Merge tag 'for-4.16/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm
Pull device mapper updates from Mike Snitzer:
- DM core fixes to ensure that bio submission follows a depth-first
tree walk; this is critical to allow forward progress without the
need to use the bioset's BIOSET_NEED_RESCUER.
- Remove DM core's BIOSET_NEED_RESCUER based dm_offload infrastructure.
- DM core cleanups and improvements to make bio-based DM more efficient
(e.g. reduced memory footprint as well leveraging per-bio-data more).
- Introduce new bio-based mode (DM_TYPE_NVME_BIO_BASED) that leverages
the more direct IO submission path in the block layer; this mode is
used by DM multipath and also optimizes targets like DM thin-pool
that stack directly on NVMe data device.
- DM multipath improvements to factor out legacy SCSI-only (e.g.
scsi_dh) code paths to allow for more optimized support for NVMe
multipath.
- A fix for DM multipath path selectors (service-time and queue-length)
to select paths in a more balanced way; largely academic but doesn't
hurt.
- Numerous DM raid target fixes and improvements.
- Add a new DM "unstriped" target that enables Intel to workaround
firmware limitations in some NVMe drives that are striped internally
(this target also works when stacked above the DM "striped" target).
- Various Documentation fixes and improvements.
- Misc cleanups and fixes across various DM infrastructure and targets
(e.g. bufio, flakey, log-writes, snapshot).
* tag 'for-4.16/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (69 commits)
dm cache: Documentation: update default migration_throttling value
dm mpath selector: more evenly distribute ties
dm unstripe: fix target length versus number of stripes size check
dm thin: fix trailing semicolon in __remap_and_issue_shared_cell
dm table: fix NVMe bio-based dm_table_determine_type() validation
dm: various cleanups to md->queue initialization code
dm mpath: delay the retry of a request if the target responded as busy
dm mpath: return DM_MAPIO_DELAY_REQUEUE if QUEUE_IO or PG_INIT_REQUIRED
dm mpath: return DM_MAPIO_REQUEUE on blk-mq rq allocation failure
dm log writes: fix max length used for kstrndup
dm: backfill missing calls to mutex_destroy()
dm snapshot: use mutex instead of rw_semaphore
dm flakey: check for null arg_name in parse_features()
dm thin: extend thinpool status format string with omitted fields
dm thin: fixes in thin-provisioning.txt
dm thin: document representation of <highest mapped sector> when there is none
dm thin: fix documentation relative to low water mark threshold
dm cache: be consistent in specifying sectors and SI units in cache.txt
dm cache: delete obsoleted paragraph in cache.txt
dm cache: fix grammar in cache-policies.txt
...
- make raid5-PPL support disks with writeback cache enabled"
* 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md:
raid5-ppl: PPL support for disks with write-back cache enabled
md/r5cache: print more info of log recovery
md/raid1,raid10: silence warning about wait-within-wait
md: introduce new personality funciton start()
Linus Torvalds [Wed, 31 Jan 2018 18:18:00 +0000 (10:18 -0800)]
Merge tag 'xfs-4.16-merge-4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull xfs updates from Darrick Wong:
"This merge cycle, we're again some substantive changes to XFS.
Metadata verifiers have been restructured to provide more detail about
which part of a metadata structure failed checks, and we've enhanced
the new online fsck feature to cross-reference extent allocation
information with the other metadata structures. With this pull, the
metadata verification part of online fsck is more or less finished,
though the feature is still experimental and still disabled by
default.
We're also preparing to remove the EXPERIMENTAL tag from a couple of
features this cycle. This week we're committing a bunch of space
accounting fixes for reflink and removing the EXPERIMENTAL tag from
reflink; I anticipate that we'll be ready to do the same for the
reverse mapping feature next week. (I don't have any pending fixes for
rmap; however I wish to remove the tags one at a time.)
This giant pile of patches has been run through a full xfstests run
over the weekend and through a quick xfstests run against this
morning's master, with no major failures reported. Let me know if
there's any merge problems -- git merge reported that one of our
patches touched the same function as the i_version series, but it
resolved things cleanly.
Summary:
- Log faulting code locations when verifiers fail, for improved
diagnosis of corrupt filesystems.
- Implement metadata verifiers for local format inode fork data.
- Online scrub now cross-references metadata records with other
metadata.
- Refactor the fs geometry ioctl generation functions.
- Harden various metadata verifiers.
- Fix various accounting problems.
- Fix uncancelled transactions leaking when xattr functions fail.
- Prevent the copy-on-write speculative preallocation garbage
collector from racing with writeback.
- Emit log reservation type information as trace data so that we can
compare against xfsprogs.
- Fix some erroneous asserts in the online scrub code.
- Clean up the transaction reservation calculations.
- Fix various minor bugs in online scrub.
- Log complaints about mixed dio/buffered writes once per day and
less noisily than before.
- Refactor buffer log item lists to use list_head.
- Break PNFS leases before reflinking blocks.
- Reduce lock contention on reflink source files.
- Fix some quota accounting problems with reflink.
- Fix a serious corruption problem in the direct cow write code where
we fed bad iomaps to the vfs iomap consumers.
- Various other refactorings.
- Remove EXPERIMENTAL tag from reflink!"
* tag 'xfs-4.16-merge-4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (94 commits)
xfs: remove experimental tag for reflinks
xfs: don't screw up direct writes when freesp is fragmented
xfs: check reflink allocation mappings
iomap: warn on zero-length mappings
xfs: treat CoW fork operations as delalloc for quota accounting
xfs: only grab shared inode locks for source file during reflink
xfs: allow xfs_lock_two_inodes to take different EXCL/SHARED modes
xfs: reflink should break pnfs leases before sharing blocks
xfs: don't clobber inobt/finobt cursors when xref with rmap
xfs: skip CoW writes past EOF when writeback races with truncate
xfs: preserve i_rdev when recycling a reclaimable inode
xfs: refactor accounting updates out of xfs_bmap_btalloc
xfs: refactor inode verifier corruption error printing
xfs: make tracepoint inode number format consistent
xfs: always zero di_flags2 when we free the inode
xfs: call xfs_qm_dqattach before performing reflink operations
xfs: bmap code cleanup
Use list_head infra-structure for buffer's log items list
Split buffer's b_fspriv field
Get rid of xfs_buf_log_item_t typedef
...
Linus Torvalds [Wed, 31 Jan 2018 18:01:08 +0000 (10:01 -0800)]
Merge branch 'work.get_user_pages_fast' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull get_user_pages_fast updates from Al Viro:
"A bit more get_user_pages work"
* 'work.get_user_pages_fast' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
kvm: switch get_user_page_nowait() to get_user_pages_unlocked()
__get_user_pages_locked(): get rid of notify_drop argument
get_user_pages_unlocked(): pass true to __get_user_pages_locked() notify_drop
cris: switch to get_user_pages_fast()
fold __get_user_pages_unlocked() into its sole remaining caller
Linus Torvalds [Wed, 31 Jan 2018 17:25:20 +0000 (09:25 -0800)]
Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull misc vfs updates from Al Viro:
"All kinds of misc stuff, without any unifying topic, from various
people.
Neil's d_anon patch, several bugfixes, introduction of kvmalloc
analogue of kmemdup_user(), extending bitfield.h to deal with
fixed-endians, assorted cleanups all over the place..."
* 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (28 commits)
alpha: osf_sys.c: use timespec64 where appropriate
alpha: osf_sys.c: fix put_tv32 regression
jffs2: Fix use-after-free bug in jffs2_iget()'s error handling path
dcache: delete unused d_hash_mask
dcache: subtract d_hash_shift from 32 in advance
fs/buffer.c: fold init_buffer() into init_page_buffers()
fs: fold __inode_permission() into inode_permission()
fs: add RWF_APPEND
sctp: use vmemdup_user() rather than badly open-coding memdup_user()
snd_ctl_elem_init_enum_names(): switch to vmemdup_user()
replace_user_tlv(): switch to vmemdup_user()
new primitive: vmemdup_user()
memdup_user(): switch to GFP_USER
eventfd: fold eventfd_ctx_get() into eventfd_ctx_fileget()
eventfd: fold eventfd_ctx_read() into eventfd_read()
eventfd: convert to use anon_inode_getfd()
nfs4file: get rid of pointless include of btrfs.h
uvc_v4l2: clean copyin/copyout up
vme_user: don't use __copy_..._user()
usx2y: don't bother with memdup_user() for 16-byte structure
...
Linus Torvalds [Wed, 31 Jan 2018 16:55:58 +0000 (08:55 -0800)]
Merge tag 'gfs2-4.16.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2
Pull GFS2 updates from Bob Peterson:
"We've got 30 patches for this merge window. These generally fall into
five categories:
- code cleanups
- patches related to adding PUNCH_HOLE support to GFS2
- support for new fields in resource group headers
- a few bug fixes
- support for new fields in journal log headers. These new fields,
which were previously unused, are designed to make it easier to
track down file system corruption, and allow fsck.gfs2 to make more
intelligent decisions when finding and fixing file system
corruption.
Details:
- Two patches from Abhi Das, to trim the ordered writes list, which
used to grow uncontrollably until unmount.
- Several patches from Andreas Gruenbacher: remove an unused
parameter from function gfs2_write_jdata_pagevec, remove a
pointless BUG_ON, clean up an error patch in trunc_start, remove
some unused parameters from truncate, make gfs2_journaled_truncate
more efficient, clean up the support functions for truncate, fix
metadata read-ahead for truncate to make it faster, fix up the
non-recursive truncate code, rework and rename
gfs2_block_truncate_page, generalize the non-recursive truncate
code so it can take a range of values for punch_hole support,
introduce new PUNCH_HOLE support that take advantage of the
previous patches, add fallocate support with PUNCH_HOLE, fix some
typos in the comments, add the function gfs2_max_stuffed_size to
replace a piece of code that was needlessly repeated throughout
GFS2, a minor cleanup to function gfs2_page_add_databufs, get rid
of function gfs2_log_header_in in preparation for the new log
header fields, and also fix up some missing newlines in kernel
messages.
- Andy Price added a new field to resource groups to indicate where
the next one should be, to allow fsck.gfs2 to make better repairs.
He also added new rindex fields for consistency checking, and added
a crc field to resource group headers for consistency checking.
- I reduced redundancy in functions common to freeing dinodes, and
when writing log headers between the journalling code and journal
recovery code. Also added new fields to journal log headers based
on a prototype from Steve Whitehouse, and log the source of journal
log headers so we can better track down journal corruption. Minor
comment typo fix and a fix for a BUG in an unlink error path.
- Steve Whitehouse contributed a patch to fix an incorrect use of the
gfs2_blk2rgrpd function.
- Tetsuo Handa contributed a patch that fixes incorrect error
handling in function init_gfs2_fs"
* tag 'gfs2-4.16.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2: (30 commits)
gfs2: Add a few missing newlines in messages
gfs2: Remove inode from ordered write list in gfs2_write_inode()
GFS2: Don't try to end a non-existent transaction in unlink
GFS2: Fix minor comment typo
GFS2: Log the reason for log flushes in every log header
GFS2: Introduce new gfs2_log_header_v2
gfs2: Get rid of gfs2_log_header_in
gfs2: Minor gfs2_page_add_databufs cleanup
gfs2: Add gfs2_max_stuffed_size
gfs2: Typo fixes
gfs2: Implement fallocate(FALLOC_FL_PUNCH_HOLE)
gfs2: Turn trunc_dealloc into punch_hole
gfs2: Generalize truncate code
Turn gfs2_block_truncate_page into gfs2_block_zero_range
gfs2: Improve non-recursive delete algorithm
gfs2: Fix metadata read-ahead during truncate
gfs2: Clean up {lookup,fillup}_metapath
gfs2: Remove minor gfs2_journaled_truncate inefficiencies
gfs2: truncate: Remove unnecessary oldsize parameters
gfs2: Clean up trunc_start error path
...
Jeff Layton [Tue, 30 Jan 2018 20:32:21 +0000 (15:32 -0500)]
iversion: make inode_cmp_iversion{+raw} return bool instead of s64
As Linus points out:
The inode_cmp_iversion{+raw}() functions are pure and utter crap.
Why?
You say that they return 0/negative/positive, but they do so in a
completely broken manner. They return that ternary value as the
sequence number difference in a 's64', which means that if you
actually care about that ternary value, and do the *sane* thing that
the kernel-doc of the function implies is the right thing, you would
do
int cmp = inode_cmp_iversion(inode, old);
if (cmp < 0 ...
and as a result you get code that looks sane, but that doesn't
actually *WORK* right.
Since none of the callers actually care about the ternary value here,
convert the inode_cmp_iversion{+raw} functions to just return a boolean
value (false for matching, true for non-matching).
This matches the existing use of these functions just fine, and makes it
simple to convert them to return a ternary value in the future if we
grow callers that need it.
With this change we can also reimplement inode_cmp_iversion in a simpler
way using inode_peek_iversion.
Vakul Garg [Wed, 31 Jan 2018 16:04:37 +0000 (21:34 +0530)]
tls: Add support for encryption using async offload accelerator
Async crypto accelerators (e.g. drivers/crypto/caam) support offloading
GCM operation. If they are enabled, crypto_aead_encrypt() return error
code -EINPROGRESS. In this case tls_do_encryption() needs to wait on a
completion till the time the response for crypto offload request is
received.
When we dump the ip6mr mfc entries via proc, we initialize an iterator
with the table to dump but we don't clear the cache pointer which might
be initialized from a prior read on the same descriptor that ended. This
can result in lock imbalance (an unnecessary unlock) leading to other
crashes and hangs. Clear the cache pointer like ipmr does to fix the issue.
Thanks for the reliable reproducer.
Here's syzbot's trace:
WARNING: bad unlock balance detected!
4.15.0-rc3+ #128 Not tainted
syzkaller971460/3195 is trying to release lock (mrt_lock) at:
[<000000006898068d>] ipmr_mfc_seq_stop+0xe1/0x130 net/ipv6/ip6mr.c:553
but there are no more locks to release!
other info that might help us debug this:
1 lock held by syzkaller971460/3195:
#0: (&p->lock){+.+.}, at: [<00000000744a6565>] seq_read+0xd5/0x13d0
fs/seq_file.c:165
Reported-by: syzbot <bot+eceb3204562c41a438fa1f2335e0fe4f6886d669@syzkaller.appspotmail.com> Signed-off-by: Nikolay Aleksandrov <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Ulf Magnusson [Wed, 31 Jan 2018 09:34:20 +0000 (10:34 +0100)]
net/sched: kconfig: Remove blank help texts
Blank help texts are probably either a typo, a Kconfig misunderstanding,
or some kind of half-committing to adding a help text (in which case a
TODO comment would be clearer, if the help text really can't be added
right away).
openvswitch: meter: Use 64-bit arithmetic instead of 32-bit
Add suffix LL to constant 1000 in order to give the compiler
complete information about the proper arithmetic to use. Notice
that this constant is used in a context that expects an expression
of type long long int (64 bits, signed).
The expression (band->burst_size + band->rate) * 1000 is currently
being evaluated using 32-bit arithmetic.
Addresses-Coverity-ID: 1461563 ("Unintentional integer overflow") Signed-off-by: Gustavo A. R. Silva <[email protected]> Signed-off-by: David S. Miller <[email protected]>
tcp_nv: fix potential integer overflow in tcpnv_acked
Add suffix ULL to constant 80000 in order to avoid a potential integer
overflow and give the compiler complete information about the proper
arithmetic to use. Notice that this constant is used in a context that
expects an expression of type u64.
The current cast to u64 effectively applies to the whole expression
as an argument of type u64 to be passed to div64_u64, but it does
not prevent it from being evaluated using 32-bit arithmetic instead
of 64-bit arithmetic.
Also, once the expression is properly evaluated using 64-bit arithmentic,
there is no need for the parentheses and the external cast to u64.
Addresses-Coverity-ID: 1357588 ("Unintentional integer overflow") Signed-off-by: Gustavo A. R. Silva <[email protected]> Signed-off-by: David S. Miller <[email protected]>
- Backwards Compatibility:
If userspace wants to determine whether RTM_NEWLINK supports the
IFLA_IF_NETNSID property they should first send an RTM_GETLINK request
with IFLA_IF_NETNSID on lo. If either EACCESS is returned or the reply
does not include IFLA_IF_NETNSID userspace should assume that
IFLA_IF_NETNSID is not supported on this kernel.
If the reply does contain an IFLA_IF_NETNSID property userspace
can send an RTM_NEWLINK with a IFLA_IF_NETNSID property. If they receive
EOPNOTSUPP then the kernel does not support the IFLA_IF_NETNSID property
with RTM_NEWLINK. Userpace should then fallback to other means.
- Security:
Callers must have CAP_NET_ADMIN in the owning user namespace of the
target network namespace.