Git Repo - linux.git/log

lib/genalloc.c: make the avail variable an atomic_long_t

If the amount of resources allocated to a gen_pool exceeds 2^32 then the
avail atomic overflows and this causes problems when clients try and
borrow resources from the pool. This is only expected to be an issue on
64 bit systems.

Add the <linux/atomic.h> header to pull in atomic_long* operations. So
that 32 bit systems continue to use atomic32_t but 64 bit systems can
use atomic64_t.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Stephen Bates <[email protected]>
Reviewed-by: Logan Gunthorpe <[email protected]>
Reviewed-by: Mathieu Desnoyers <[email protected]>
Reviewed-by: Daniel Mentz <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Will Deacon <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

lib/int_sqrt: adjust comments

Our current int_sqrt() is not rough nor any approximation; it calculates
the exact value of: floor(sqrt()). Document this.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Acked-by: Linus Torvalds <[email protected]>
Cc: Anshul Garg <[email protected]>
Cc: Davidlohr Bueso <[email protected]>
Cc: David Miller <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Joe Perches <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Michael Davidson <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Will Deacon <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

lib/int_sqrt: optimize initial value compute

The initial value (@m) compute is:

m = 1UL << (BITS_PER_LONG - 2);
while (m > x)
m >>= 2;

Which is a linear search for the highest even bit smaller or equal to @x
We can implement this using a binary search using __fls() (or better when
its hardware implemented).

m = 1UL << (__fls(x) & ~1UL);

Especially for small values of @x; which are the more common arguments
when doing a CDF on idle times; the linear search is near to worst case,
while the binary search of __fls() is a constant 6 (or 5 on 32bit)
branches.

      cycles:                 branches:              branch-misses:

PRE:

hot:   43.633557 +- 0.034373  45.333132 +- 0.002277  0.023529 +- 0.000681
cold: 207.438411 +- 0.125840  45.333132 +- 0.002277  6.976486 +- 0.004219

SOFTWARE FLS:

hot:   29.576176 +- 0.028850  26.666730 +- 0.004511  0.019463 +- 0.000663
cold: 165.947136 +- 0.188406  26.666746 +- 0.004511  6.133897 +- 0.004386

HARDWARE FLS:

hot:   24.720922 +- 0.025161  20.666784 +- 0.004509  0.020836 +- 0.000677
cold: 132.777197 +- 0.127471  20.666776 +- 0.004509  5.080285 +- 0.003874

Averages computed over all values <128k using a LFSR to generate order.
Cold numbers have a LFSR based branch trace buffer 'confuser' ran between
each int_sqrt() invocation.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Suggested-by: Joe Perches <[email protected]>
Acked-by: Will Deacon <[email protected]>
Acked-by: Linus Torvalds <[email protected]>
Cc: Anshul Garg <[email protected]>
Cc: Davidlohr Bueso <[email protected]>
Cc: David Miller <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Michael Davidson <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

lib/int_sqrt: optimize small argument

The current int_sqrt() computation is sub-optimal for the case of small
@x.  Which is the interesting case when we're going to do cumulative
distribution functions on idle times, which we assume to be a random
variable, where the target residency of the deepest idle state gives an
upper bound on the variable (5e6ns on recent Intel chips).

In the case of small @x, the compute loop:

while (m != 0) {
b = y + m;
y >>= 1;

if (x >= b) {
x -= b;
y += m;
}
m >>= 2;
}

can be reduced to:

while (m > x)
m >>= 2;

Because y==0, b==m and until x>=m y will remain 0.

And while this is computationally equivalent, it runs much faster
because there's less code, in particular less branches.

      cycles:                 branches:              branch-misses:

OLD:

hot:   45.109444 +- 0.044117  44.333392 +- 0.002254  0.018723 +- 0.000593
cold: 187.737379 +- 0.156678  44.333407 +- 0.002254  6.272844 +- 0.004305

PRE:

hot:   67.937492 +- 0.064124  66.999535 +- 0.000488  0.066720 +- 0.001113
cold: 232.004379 +- 0.332811  66.999527 +- 0.000488  6.914634 +- 0.006568

POST:

hot:   43.633557 +- 0.034373  45.333132 +- 0.002277  0.023529 +- 0.000681
cold: 207.438411 +- 0.125840  45.333132 +- 0.002277  6.976486 +- 0.004219

Averages computed over all values <128k using a LFSR to generate order.
Cold numbers have a LFSR based branch trace buffer 'confuser' ran between
each int_sqrt() invocation.

Link: http://lkml.kernel.org/r/[email protected]
Fixes: 30493cc9dddb ("lib/int_sqrt.c: optimize square root algorithm")
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Suggested-by: Anshul Garg <[email protected]>
Acked-by: Linus Torvalds <[email protected]>
Cc: Davidlohr Bueso <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Joe Perches <[email protected]>
Cc: David Miller <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Michael Davidson <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

lib/test: delete five error messages for failed memory allocations

Omit extra messages for a memory allocation failure in these functions.

This issue was detected by using the Coccinelle software.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Markus Elfring <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

lib: add module support to string tests

Extract the string test code into its own source file, to allow
compiling it either to a loadable module, or built into the kernel.

Fixes: 03270c13c5ffaa6a ("lib/string.c: add testcases for memset16/32/64")
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Geert Uytterhoeven <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Shuah Khan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

include/linux/radix-tree.h: remove unneeded #include <linux/bug.h>

This include was added by commit 187f1882b5b0 ("BUG: headers with
BUG/BUG_ON etc. need linux/bug.h") because BUG_ON() was used in this
header at that time.

Some time later, commit 6d75f366b924 ("lib: radix-tree: check accounting
of existing slot replacement users") removed the use of BUG_ON() from
this header.

Since then, there is no reason to include <linux/bug.h>.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Masahiro Yamada <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Masahiro Yamada <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Chris Mi <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

include/linux/bitfield.h: include <linux/build_bug.h> instead of <linux/bug.h>

Since commit bc6245e5efd7 ("bug: split BUILD_BUG stuff out into
<linux/build_bug.h>"), #include <linux/build_bug.h> is better to pull
minimal headers needed for BUILG_BUG() family.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Masahiro Yamada <[email protected]>
Acked-by: Jakub Kicinski <[email protected]>
Cc: Dinan Gunawardena <[email protected]>
Cc: Kalle Valo <[email protected]>
Cc: Ian Abbott <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

get_maintainer: add more --self-test options

Add tests for duplicate section headers, missing section content, link and
scm reachability.

Miscellanea:

o Add --self-test=<foo> options
(a comma separated list of any of sections, patterns, links or scm)
where the default without options is all tests
o Rename check_maintainers_patterns to self_test
o Rename self_test_pattern_info to self_test_info

[[email protected]: improvements]
Link: http://lkml.kernel.org/r/13e3986c374902fcf08ae947e36c5c608bbe3b79.1510075301.git.joe@perches.com
Signed-off-by: Joe Perches <[email protected]>
Reviewed-by: Tom Saeger <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

get_maintainer: add --self-test for internal consistency tests

Add "--self-test" option to get_maintainer.pl to show potential
issues in MAINTAINERS file(s) content.

Pattern check warnings are shown for "F" and "X" patterns found in
MAINTAINERS file(s) which do not match any files known by git.

Link: http://lkml.kernel.org/r/64994f911b3510d0f4c8ac2e113501dfcec1f3c9.1509559540.git.tom.saeger@oracle.com
Signed-off-by: Tom Saeger <[email protected]>
Acked-by: Joe Perches <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

dynamic_debug documentation: minor fixes

Fix minor typo.

Fix missing words in explaining parsing of last line number.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Randy Dunlap <[email protected]>
Cc: Jason Baron <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

dynamic-debug-howto: fix optional/omitted ending line number to be LARGE instead of 0

line-range is supposed to treat "1-" as "1-endoffile", so
handle the special case by setting last_lineno to UINT_MAX.

Fixes this error:

dynamic_debug:ddebug_parse_query: last-line:0 < 1st-line:1
dynamic_debug:ddebug_exec_query: query parse failed

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Randy Dunlap <[email protected]>
Acked-by: Jason Baron <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

kernel/umh.c: optimize 'proc_cap_handler()'

If 'write' is 0, we can avoid a call to spin_lock/spin_unlock.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Christophe JAILLET <[email protected]>
Acked-by: Luis R. Rodriguez <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

include/linux/compiler-clang.h: handle randomizable anonymous structs

The GCC randomize layout plugin can randomize the member offsets of
sensitive kernel data structures.  To use this feature, certain
annotations and members are added to the structures which affect the
member offsets even if this plugin is not used.

All of these structures are completely randomized, except for task_struct
which leaves out some of its members.  All the other members are wrapped
within an anonymous struct with the __randomize_layout attribute.  This is
done using the randomized_struct_fields_start and
randomized_struct_fields_end defines.

When the plugin is disabled, the behaviour of this attribute can vary
based on the GCC version.  For GCC 5.1+, this attribute maps to
__designated_init otherwise it is just an empty define but the anonymous
structure is still present.  For other compilers, both
randomized_struct_fields_start and randomized_struct_fields_end default
to empty defines meaning the anonymous structure is not introduced at
all.

So, if a module compiled with Clang, such as a BPF program, needs to
access task_struct fields such as pid and comm, the offsets of these
members as recognized by Clang are different from those recognized by
modules compiled with GCC.  If GCC 4.6+ is used to build the kernel,
this can be solved by introducing appropriate defines for Clang so that
the anonymous structure is seen when determining the offsets for the
members.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Sandipan Das <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Kate Stewart <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Naveen N. Rao <[email protected]>
Cc: Alexei Starovoitov <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

bug: fix "cut here" location for __WARN_TAINT architectures

Prior to v4.11, x86 used warn_slowpath_fmt() for handling WARN()s.
After WARN() was moved to using UD0 on x86, the warning text started
appearing _before_ the "cut here" line.  This appears to have been a
long-standing bug on architectures that used __WARN_TAINT, but it didn't
get fixed.

v4.11 and earlier on x86:

  ------------[ cut here ]------------
  WARNING: CPU: 0 PID: 2956 at drivers/misc/lkdtm_bugs.c:65 lkdtm_WARNING+0x21/0x30
  This is a warning message
  Modules linked in:

v4.12 and later on x86:

  This is a warning message
  ------------[ cut here ]------------
  WARNING: CPU: 1 PID: 2982 at drivers/misc/lkdtm_bugs.c:68 lkdtm_WARNING+0x15/0x20
  Modules linked in:

With this fix:

  ------------[ cut here ]------------
  This is a warning message
  WARNING: CPU: 3 PID: 3009 at drivers/misc/lkdtm_bugs.c:67 lkdtm_WARNING+0x15/0x20

Since the __FILE__ reporting happens as part of the UD0 handler, it
isn't trivial to move the message to after the WARNING line, but at
least we can fix the position of the "cut here" line so all the various
logging tools will start including the actual runtime warning message
again, when they follow the instruction and "cut here".

Link: http://lkml.kernel.org/r/[email protected]
Fixes: 9a93848fe787 ("x86/debug: Implement __WARN() using UD0")
Signed-off-by: Kees Cook <[email protected]>
Cc: Peter Zijlstra (Intel) <[email protected]>
Cc: Josh Poimboeuf <[email protected]>
Cc: Fengguang Wu <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Ingo Molnar <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

bug: define the "cut here" string in a single place

The "cut here" string is used in a few paths. Define it in a single
place.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Kees Cook <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Fengguang Wu <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Josh Poimboeuf <[email protected]>
Cc: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

lkdtm: include WARN format string

In order to test the ordering of WARN format strings, actually include
one in LKDTM.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Kees Cook <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Fengguang Wu <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Josh Poimboeuf <[email protected]>
Cc: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

iopoll: avoid -Wint-in-bool-context warning

When we pass the result of a multiplication as the timeout or the delay,
we can get a warning from gcc-7:

  drivers/mmc/host/bcm2835.c:596:149: error: '*' in boolean context, suggest '&&' instead [-Werror=int-in-bool-context]
  drivers/mfd/arizona-core.c:247:195: error: '*' in boolean context, suggest '&&' instead [-Werror=int-in-bool-context]
  drivers/gpu/drm/sun4i/sun4i_hdmi_i2c.c:49:27: error: '*' in boolean context, suggest '&&' instead [-Werror=int-in-bool-context]

The warning is a bit questionable inside of a macro, but this is
intentional on the side of the gcc developers.  It is also an indication
of another problem: we evaluate the timeout and sleep arguments multiple
times, which can have undesired side-effects when those are complex
expressions.

This changes the two iopoll variants to use local variables for storing
copies of the timeouts.  This adds some more type safety, and avoids
both the double-evaluation and the gcc warning.

Link: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81484
Link: http://lkml.kernel.org/r/[email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Arnd Bergmann <[email protected]>
Reviewed-by: Mark Brown <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

parse-maintainers: add ability to specify filenames

parse-maintainers.pl is convenient, but currently hard-codes the
filenames that are used.

Allow user-specified filenames to simplify the use of the script.

Link: http://lkml.kernel.org/r/48703c068b3235223ffa3b2eb268fa0a125b25e0.1502251549.git.joe@perches.com
Signed-off-by: Joe Perches <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

kernel debug: support resetting WARN_ONCE for all architectures

Some architectures store the WARN_ONCE state in the flags field of the
bug_entry. Clear that one too when resetting once state through
/sys/kernel/debug/clear_warn_once

Pointed out by Michael Ellerman

Improves the earlier patch that add clear_warn_once.

[[email protected]: add a missing ifdef CONFIG_MODULES]
Link: http://lkml.kernel.org/r/[email protected]
[[email protected]: fix unused var warning]
[[email protected]: Use 0200 for clear_warn_once file, per mpe]
[[email protected]: clear BUGFLAG_DONE in clear_once_table(), per mpe]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Andi Kleen <[email protected]>
Tested-by: Michael Ellerman <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

kernel debug: support resetting WARN*_ONCE

I like _ONCE warnings because it's guaranteed that they don't flood the
log.

During testing I find it useful to reset the state of the once warnings,
so that I can rerun tests and see if they trigger again, or can
guarantee that a test run always hits the same warnings.

This patch adds a debugfs interface to reset all the _ONCE warnings so
that they appear again:

echo 1 > /sys/kernel/debug/clear_warn_once

This is implemented by putting all the warning booleans into a special
section, and clearing it.

[[email protected]: coding-style fixes]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Andi Kleen <[email protected]>
Tested-by: Michael Ellerman <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

sh/boot: add static stack-protector to pre-kernel

The sh decompressor code triggers stack-protector code generation when
using CONFIG_CC_STACKPROTECTOR_STRONG.  As done for arm and mips, add a
simple static stack-protector canary.  As this wasn't protected before,
the risk of using a weak canary is minimized.  Once the kernel is
actually up, a better canary is chosen.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Kees Cook <[email protected]>
Cc: Yoshinori Sato <[email protected]>
Cc: Rich Felker <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Laura Abbott <[email protected]>
Cc: Masahiro Yamada <[email protected]>
Cc: Michal Marek <[email protected]>
Cc: Nicholas Piggin <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

spelling.txt: add "unnecessary" typo variants

Add unnecessary typos by copying the necessary typos.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Joe Perches <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

proc: use do-while in name_to_int()

Gcc doesn't know that "len" is guaranteed to be >=1 by dcache and
generates standard while-loop prologue duplicating loop condition.

add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-27 (-27)
function old new delta
name_to_int 104 77 -27

Link: http://lkml.kernel.org/r/20170912195213.GB17730@avx2
Signed-off-by: Alexey Dobriyan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

proc: : uninline name_to_int()

Save ~360 bytes.

add/remove: 1/0 grow/shrink: 0/4 up/down: 104/-463 (-359)
function                                     old     new   delta
name_to_int                                    -     104    +104
proc_pid_lookup                              217     126     -91
proc_lookupfd_common                         212     121     -91
proc_task_lookup                             289     194     -95
__proc_create                                588     402    -186

Link: http://lkml.kernel.org/r/20170912194850.GA17730@avx2
Signed-off-by: Alexey Dobriyan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

proc, coredump: add CoreDumping flag to /proc/pid/status

Right now there is no convenient way to check if a process is being
coredumped at the moment.

It might be necessary to recognize such state to prevent killing the
process and getting a broken coredump.  Writing a large core might take
significant time, and the process is unresponsive during it, so it might
be killed by timeout, if another process is monitoring and
killing/restarting hanging tasks.

We're getting a significant number of corrupted coredump files on
machines in our fleet, just because processes are being killed by
timeout in the middle of the core writing process.

We do have a process health check, and some agent is responsible for
restarting processes which are not responding for health check requests.
Writing a large coredump to the disk can easily exceed the reasonable
timeout (especially on an overloaded machine).

This flag will allow the agent to distinguish processes which are being
coredumped, extend the timeout for them, and let them produce a full
coredump file.

To provide an ability to detect if a process is in the state of being
coredumped, we can expose a boolean CoreDumping flag in
/proc/pid/status.

Example:
$ cat core.sh
  #!/bin/sh

  echo "|/usr/bin/sleep 10" > /proc/sys/kernel/core_pattern
  sleep 1000 &
  PID=$!

  cat /proc/$PID/status | grep CoreDumping
  kill -ABRT $PID
  sleep 1
  cat /proc/$PID/status | grep CoreDumping

$ ./core.sh
  CoreDumping: 0
  CoreDumping: 1

[[email protected]: document CoreDumping flag in /proc/<pid>/status]
Link: http://lkml.kernel.org/r/[email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Roman Gushchin <[email protected]>
Cc: Alexander Viro <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Konstantin Khlebnikov <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm, compaction: remove unneeded pageblock_skip_persistent() checks

Commit f3c931633a59 ("mm, compaction: persistently skip hugetlbfs
pageblocks") has introduced pageblock_skip_persistent() checks into
migration and free scanners, to make sure pageblocks that should be
persistently skipped are marked as such, regardless of the
ignore_skip_hint flag.

Since the previous patch introduced a new no_set_skip_hint flag, the
ignore flag no longer prevents marking pageblocks as skipped.  Therefore
we can remove the special cases.  The relevant pageblocks will be marked
as skipped by the common logic which marks each pageblock where no page
could be isolated.  This makes the code simpler.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Vlastimil Babka <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm, compaction: split off flag for not updating skip hints

Pageblock skip hints were added as a heuristic for compaction, which
shares core code with CMA.  Since CMA reliability would suffer from the
heuristics, compact_control flag ignore_skip_hint was added for the CMA
use case.  Since 6815bf3f233e ("mm/compaction: respect ignore_skip_hint
in update_pageblock_skip") the flag also means that CMA won't *update*
the skip hints in addition to ignoring them.

Today, direct compaction can also ignore the skip hints in the last
resort attempt, but there's no reason not to set them when isolation
fails in such case.  Thus, this patch splits off a new no_set_skip_hint
flag to avoid the updating, which only CMA sets.  This should improve
the heuristics a bit, and allow us to simplify the persistent skip bit
handling as the next step.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Vlastimil Babka <[email protected]>
Acked-by: Mel Gorman <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm, compaction: extend pageblock_skip_persistent() to all compound pages

pageblock_skip_persistent() checks for HugeTLB pages of pageblock order.
When clearing pageblock skip bits for compaction, the bits are not
cleared for such pageblocks, because they cannot contain base pages
suitable for migration, nor free pages to use as migration targets.

This optimization can be simply extended to all compound pages of order
equal or larger than pageblock order, because migrating such pages (if
they support it) cannot help sub-pageblock fragmentation. This includes
THP's and also gigantic HugeTLB pages, which the current implementation
doesn't persistently skip due to a strict pageblock_order equality check
and not recognizing tail pages.

While THP pages are generally less "persistent" than HugeTLB, we can
still expect that if a THP exists at the point of
__reset_isolation_suitable(), it will exist also during the subsequent
compaction run. The time difference here could be actually smaller than
between a compaction run that sets a (non-persistent) skip bit on a THP,
and the next compaction run that observes it.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Vlastimil Babka <[email protected]>
Acked-by: Mel Gorman <[email protected]>
Acked-by: David Rientjes <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm, compaction: persistently skip hugetlbfs pageblocks

It is pointless to migrate hugetlb memory as part of memory compaction
if the hugetlb size is equal to the pageblock order.  No defragmentation
is occurring in this condition.

It is also pointless to for the freeing scanner to scan a pageblock
where a hugetlb page is pinned.  Unconditionally skip these pageblocks,
and do so peristently so that they are not rescanned until it is
observed that these hugepages are no longer pinned.

It would also be possible to do this by involving the hugetlb subsystem
in marking pageblocks to no longer be skipped when they hugetlb pages
are freed.  This is a simple solution that doesn't involve any
additional subsystems in pageblock skip manipulation.

[[email protected]: fix build]
Link: http://lkml.kernel.org/r/[email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: David Rientjes <[email protected]>
Tested-by: Michal Hocko <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Mel Gorman <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm, compaction: kcompactd should not ignore pageblock skip

Kcompactd is needlessly ignoring pageblock skip information. It is
doing MIGRATE_SYNC_LIGHT compaction, which is no more powerful than
MIGRATE_SYNC compaction.

If compaction recently failed to isolate memory from a set of
pageblocks, there is nothing to indicate that kcompactd will be able to
do so, or that it is beneficial from attempting to isolate memory.

Use the pageblock skip hint to avoid rescanning pageblocks needlessly
until that information is reset.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: David Rientjes <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Mel Gorman <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm: shmem: remove unused info variable

Fix the following warning by removing the unused variable:

mm/shmem.c:3205:27: warning: variable 'info' set but not used [-Wunused-but-set-variable]

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Corentin Labbe <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: Hugh Dickins <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

lib/dma-debug.c: fix incorrect pfn calculation

dma-debug reports the following warning:

  WARNING: CPU: 3 PID: 298 at kernel-4.4/lib/dma-debug.c:604
  debug _dma_assert_idle+0x1a8/0x230()
  DMA-API: cpu touching an active dma mapped cacheline [cln=0x00000882300]
  CPU: 3 PID: 298 Comm: vold Tainted: G        W  O    4.4.22+ #1
  Hardware name: MT6739 (DT)
  Call trace:
    debug_dma_assert_idle+0x1a8/0x230
    wp_page_copy.isra.96+0x118/0x520
    do_wp_page+0x4fc/0x534
    handle_mm_fault+0xd4c/0x1310
    do_page_fault+0x1c8/0x394
    do_mem_abort+0x50/0xec

I found that debug_dma_alloc_coherent() and debug_dma_free_coherent()
assume that dma_alloc_coherent() always returns a linear address.

However it's possible that dma_alloc_coherent() returns a non-linear
address.  In this case, page_to_pfn(virt_to_page(virt)) will return an
incorrect pfn.  If the pfn is valid and mapped as a COW page, we will
hit the warning when doing wp_page_copy().

Fix this by calculating pfn for linear and non-linear addresses.

[[email protected]: v4]
Link: http://lkml.kernel.org/r/[email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Miles Chen <[email protected]>
Reviewed-by: Robin Murphy <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Marek Szyprowski <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/z3fold.c: use kref to prevent page free/compact race

There is a race in the current z3fold implementation between
do_compact() called in a work queue context and the page release
procedure when page's kref goes to 0.

do_compact() may be waiting for page lock, which is released by
release_z3fold_page_locked right before putting the page onto the
"stale" list, and then the page may be freed as do_compact() modifies
its contents.

The mechanism currently implemented to handle that (checking the
PAGE_STALE flag) is not reliable enough. Instead, we'll use page's kref
counter to guarantee that the page is not released if its compaction is
scheduled. It then becomes compaction function's responsibility to
decrease the counter and quit immediately if the page was actually
freed.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Vitaly Wool <[email protected]>
Cc: <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm: fix nodemask printing

The cleanup caused build warnings for constant mask pointers:

mm/mempolicy.c: In function `mpol_to_str':
./include/linux/nodemask.h:108:11: warning: the comparison will always evaluate as `true' for the address of `nodes' will never be NULL [-Waddress]

An earlier workaround I suggested was incorporated in the version that
got merged, but that only solved the problem for gcc-7 and higher, while
gcc-4.6 through gcc-6.x still warn.

This changes the printing again to use inline functions that make it
clear to the compiler that the line that does the NULL check has no idea
whether the argument is a constant NULL.

Link: http://lkml.kernel.org/r/[email protected]
Fixes: 0205f75571e3 ("mm: simplify nodemask printing")
Signed-off-by: Arnd Bergmann <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Stephen Rothwell <[email protected]>
Cc: Zhangshaokun <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

Merge tag 'trace-v4.15' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace

Pull tracing updates from

- allow module init functions to be traced

- clean up some unused or not used by config events (saves space)

- clean up of trace histogram code

- add support for preempt and interrupt enabled/disable events

- other various clean ups

* tag 'trace-v4.15' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (30 commits)
  tracing, thermal: Hide cpu cooling trace events when not in use
  tracing, thermal: Hide devfreq trace events when not in use
  ftrace: Kill FTRACE_OPS_FL_PER_CPU
  perf/ftrace: Small cleanup
  perf/ftrace: Fix function trace events
  perf/ftrace: Revert ("perf/ftrace: Fix double traces of perf on ftrace:function")
  tracing, dma-buf: Remove unused trace event dma_fence_annotate_wait_on
  tracing, memcg, vmscan: Hide trace events when not in use
  tracing/xen: Hide events that are not used when X86_PAE is not defined
  tracing: mark trace_test_buffer as __maybe_unused
  printk: Remove superfluous memory barriers from printk_safe
  ftrace: Clear hashes of stale ips of init memory
  tracing: Add support for preempt and irq enable/disable events
  tracing: Prepare to add preempt and irq trace events
  ftrace/kallsyms: Have /proc/kallsyms show saved mod init functions
  ftrace: Add freeing algorithm to free ftrace_mod_maps
  ftrace: Save module init functions kallsyms symbols for tracing
  ftrace: Allow module init functions to be traced
  ftrace: Add a ftrace_free_mem() function for modules to use
  tracing: Reimplement log2
  ...

Merge tag 'linux-kselftest-4.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest

Pull kselftest updates from Shuah Khan:
"This update to Kselftest consists of cleanup patches, fixes, and a new
  test for ion buffer sharing.

  Fixes include changes to skip firmware tests on systems that aren't
  configured to support them, as opposed to failing them"

* tag 'linux-kselftest-4.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest:
  selftests: firmware: skip unsupported custom firmware fallback tests
  selftests: firmware: skip unsupported async loading tests
  selftests: memfd_test.c: fix compilation warning.
  selftests/ftrace: Introduce exit_pass and exit_fail
  selftests: ftrace: add more config fragments
  android/ion: userspace test utility for ion buffer sharing
  selftests: remove obsolete kconfig fragment for cpu-hotplug
  selftests: vdso_test: support ARM64 targets
  selftests/ftrace: Do not use arch dependent do_IRQ as a target function
  selftests: breakpoints: fix compile error on breakpoint_test_arm64
  selftests: add missing test result status in memory-hotplug test
  selftests/exec: include cwd in long path calculation
  selftests: seccomp: update .gitignore with newly added tests
  selftests: vm: Update .gitignore with newly added tests
  selftests: timers: Update .gitignore with newly added tests

Merge tag 'acpi-fix-4.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull ACPI fix from Rafael Wysocki:
"This fixes a possible memory leak in an error code path in one of the
utility routines (Xiongfeng Wang)"

* tag 'acpi-fix-4.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
ACPI / utils: Fix memory leak in acpi_evaluate_reference() error path

Merge tag 'pm-fixes-4.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull two power management fixes from Rafael Wysocki:
"This is the change making /proc/cpuinfo on x86 report current CPU
  frequency in "cpu MHz" again in all cases and an additional one
  dealing with an overzealous check in one of the helper routines in the
  runtime PM framework"

* tag 'pm-fixes-4.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  PM / runtime: Drop children check from __pm_runtime_set_status()
  x86 / CPU: Always show current CPU frequency in /proc/cpuinfo

Merge tag 'drm-for-v4.15-amd-dc' of git://people.freedesktop.org/~airlied/linux

Pull amdgpu DC display code for Vega from Dave Airlie:
"This is the pull request for the AMD DC (display code) layer which is
  a requirement to program the display engines on the new Vega and Raven
  based GPUs. It also contains support for all amdgpu supported GPUs
  (CIK, VI, Polaris), which has to be enabled. It is also a kms atomic
  modesetting compatible driver (unlike the current in-tree display
  code).

  I've kept it separate from drm-next because it may have some things
  that cause you to reject it.

  Background story:

  AMD have an internal team creating a shared OS codebase for display at
  hw bring up time using information from their hardware teams. This
  process doesn't lead to the most Linux friendly/looking code but we
  have worked together on cleaning a lot of it up and dealing with
  sparse/smatch/checkpatch, and having their team internally adhere to
  Linux coding standards.

  This tree is a complete history rebased since they started opening it,
  we decided not to squash it down as the history may have some value.
  Some of the commits therefore might not reach kernel standards, and we
  are steadily training people in AMD to better write commit msgs.

  There is a major bunch of generated bandwidth calculation and
  verification code that comes from their hardware team. On Vega and
  before this is float calculations, on Raven (DCN10) this is double
  based. They do the required things to do FP in the kernel, and I could
  understand this might raise some issues. Rewriting the bandwidth would
  be a major undertaken in reverification, it's non-trivial to work out
  if a display can handle the complete set of mode information thrown at
  it.

  Future story:

  There is a TODO list with this, and it address most of the remaining
  things that would be nice to refine/remove. The DCN10 code is still
  under development internally and they push out a lot of patches quite
  regularly and are supporting this code base with their display team. I
  think we've reached the point where keeping it out of tree is going to
  motivate distributions to start carrying the code, so I'd prefer we
  get it in tree. I think this code is slightly better than STAGING
  quality but not massively so, I'd really like to see that float/double
  magic gone and fixed point used, but AMD don't seem to think the
  accuracy and revalidation of the code is worth the effort"

* tag 'drm-for-v4.15-amd-dc' of git://people.freedesktop.org/~airlied/linux: (1110 commits)
  drm/amd/display: fix MST link training fail division by 0
  drm/amd/display: Fix formatting for null pointer dereference fix
  drm/amd/display: Remove dangling planes on dc commit state
  drm/amd/display: add flip_immediate to commit update for stream
  drm/amd/display: Miss register MST encoder cbs
  drm/amd/display: Fix warnings on S3 resume
  drm/amd/display: use num_timing_generator instead of pipe_count
  drm/amd/display: use configurable FBC option in dm
  drm/amd/display: fix AZ clock not enabled before program AZ endpoint
  amdgpu/dm: Don't use DRM_ERROR in amdgpu_dm_atomic_check
  amd/display: Fix potential null dereference in dce_calcs.c
  amdgpu/dm: Remove unused forward declaration
  drm/amdgpu: Remove unused dc_stream from amdgpu_crtc
  amdgpu/dc: Fix double unlock in amdgpu_dm_commit_planes
  amdgpu/dc: Fix missing null checks in amdgpu_dm.c
  amdgpu/dc: Fix potential null dereferences in amdgpu_dm.c
  amdgpu/dc: fix more indentation warnings
  amdgpu/dc: handle allocation failures in dc_commit_planes_to_stream.
  amdgpu/dc: fix indentation warning from smatch.
  amdgpu/dc: fix non-ansi function decls.
  ...

Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/rzhang/linux

Pull thermal management updates from Zhang Rui:

- introduce brcmstb AVS TMON thermal driver (Brian Norris)

- add Rockchip RV1108 support in rockchip thermal driver (Rocky Hao)

- major rework on HISI driver plus additional support of hisi3660
   (Daniel Lezcano)

- add nvmem-cells binding on imx6sx (Leonard Crestez)

- fix a NULL pointer dereference on ti thermal driver unloading (Tony
   Lindgren)

- improve tmon tool to make it easier to cross-compile tmon (Markus
   Mayer)

- add Coffee Lake and Cannon Lake support for intel processor and pch
   thermal drivers (Srinivas Pandruvada)

- other small fixes and cleanups (Arvind Yadav, Colin Ian King, Allen
   Wild, Nicolin Chen, Baruch SiachNiklas Söderlund, Arnd Bergmann)

* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/rzhang/linux: (44 commits)
  thermal: pch: Add Cannon Lake support
  thermal: int340x: processor_thermal: Add Coffee Lake support
  thermal: int340x: processor_thermal: Add Cannon Lake support
  thermal: bxt: remove redundant variable trip
  thermal: cpu_cooling: pr_err() strings should end with newlines
  thermal: add brcmstb AVS TMON driver
  Documentation: devicetree: add binding for Broadcom STB AVS TMON
  thermal/drivers/hisi: Add support for hi3660 SoC
  thermal/drivers/hisi: Prepare to add support for other hisi platforms
  thermal/drivers/hisi: Add platform prefix to function name
  thermal/drivers/hisi: Put platform code together
  thermal/drivers/qcom-spmi: Use devm_iio_channel_get
  thermal/drivers/generic-iio-adc: Switch tz request to devm version
  thermal/drivers/step_wise: Fix temperature regulation misbehavior
  thermal/drivers/hisi: Use round up step value
  thermal/drivers/hisi: Move the clk setup in the corresponding functions
  thermal/drivers/hisi: Remove mutex_lock in the code
  thermal/drivers/hisi: Remove thermal data back pointer
  thermal/drivers/hisi: Convert long to int
  thermal/drivers/hisi: Rename and remove unused field
  ...

Merge branch 'parisc-4.15-1' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux

Pull parisc updates from Helge Deller:
"Highlights:

   - one important fix from Dave to prevent kernel crash when userspace
     hands over invalid values to our in-kernel CAS implementation.

   - added CPU topology support, including multi-core scheduler support
     on PA8900 CPUs

  Minor changes:

   - minor fixes for sparse (from Luc)

   - drop duplicates for CPU_BIG_ENDIAN from parisc and sparc top
     Kconfig files (from Babu)

   - reorganized parisc PDC (firmware-access) header files for usage
     from userspace. Required for upcoming qemu parisc emulator and
     SeaBIOS fork to support parisc"

* 'parisc-4.15-1' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux:
  arch: Fix duplicates in Kconfig for parisc and sparc
  parisc: Make some PDC structures accessible in uapi headers
  parisc: Pass endianness info to sparse
  parisc: Add CPU topology support
  parisc: Fix validity check of pointer size argument in new CAS implementation

Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux

Pull second round of s390 updates from Martin Schwidefsky:

- rework of the vdso code to avoid the use of the access register mode

- use perf AUX buffers for the transport of diagnostic sample data

- add perf_regs and user stack dump support

- enable perf call graphs for user space programs

- add perf register support for floating-point registers

- all remaining s390 related timer_setup conversions

- bug fixes and cleanups

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (30 commits)
  s390: remove unused parameter from Makefile
  zfcp: purely mechanical update using timer API, plus blank lines
  s390/scsi: Convert timers to use timer_setup()
  s390/cpum_sf: correctly set the PID and TID in perf samples
  s390/cpum_sf: load program parameter at sampler enablement
  s390/perf: add perf register support for floating-point registers
  s390/perf: extend perf_regs support to include floating-point registers
  s390/perf: define common DWARF register string table
  s390/perf: add support for perf_regs and libdw
  s390/perf: add perf_regs support and user stack dump
  s390/cpum_sf: do not register PMU if no sampling mode is authorized
  s390/cpumf: remove raw event support in basic-only sampling mode
  s390/perf: add callback to perf to enable using AUX buffer
  s390/cpumf: enable using AUX buffer
  s390/cpumf: introduce AUX buffer for dump diagnostic sample data
  s390/disassembler: increase show_code buffer size
  s390: Remove CONFIG_HARDENED_USERCOPY
  s390: enable CPU alternatives unconditionally
  s390/nmi: remove unused code
  s390/mm: remove unused code
  ...

Merge tag 'nfs-for-4.15-1' of git://git.linux-nfs.org/projects/anna/linux-nfs

Pull NFS client updates from Anna Schumaker:
"Stable bugfixes:
   - Revalidate "." and ".." correctly on open
   - Avoid RCU usage in tracepoints
   - Fix ugly referral attributes
   - Fix a typo in nomigration mount option
   - Revert "NFS: Move the flock open mode check into nfs_flock()"

  Features:
   - Implement a stronger send queue accounting system for NFS over RDMA
   - Switch some atomics to the new refcount_t type

  Other bugfixes and cleanups:
   - Clean up access mode bits
   - Remove special-case revalidations in nfs_opendir()
   - Improve invalidating NFS over RDMA memory for async operations that
     time out
   - Handle NFS over RDMA replies with a worqueue
   - Handle NFS over RDMA sends with a workqueue
   - Fix up replaying interrupted requests
   - Remove dead NFS over RDMA definitions
   - Update NFS over RDMA copyright information
   - Be more consistent with bool initialization and comparisons
   - Mark expected switch fall throughs
   - Various sunrpc tracepoint cleanups
   - Fix various OPEN races
   - Fix a typo in nfs_rename()
   - Use common error handling code in nfs_lock_and_join_request()
   - Check that some structures are properly cleaned up during
     net_exit()
   - Remove net pointer from dprintk()s"

* tag 'nfs-for-4.15-1' of git://git.linux-nfs.org/projects/anna/linux-nfs: (62 commits)
  NFS: Revert "NFS: Move the flock open mode check into nfs_flock()"
  NFS: Fix typo in nomigration mount option
  nfs: Fix ugly referral attributes
  NFS: super: mark expected switch fall-throughs
  sunrpc: remove net pointer from messages
  nfs: remove net pointer from messages
  sunrpc: exit_net cleanup check added
  nfs client: exit_net cleanup check added
  nfs/write: Use common error handling code in nfs_lock_and_join_requests()
  NFSv4: Replace closed stateids with the "invalid special stateid"
  NFSv4: nfs_set_open_stateid must not trigger state recovery for closed state
  NFSv4: Check the open stateid when searching for expired state
  NFSv4: Clean up nfs4_delegreturn_done
  NFSv4: cleanup nfs4_close_done
  NFSv4: Retry NFS4ERR_OLD_STATEID errors in layoutreturn
  pNFS: Retry NFS4ERR_OLD_STATEID errors in layoutreturn-on-close
  NFSv4: Don't try to CLOSE if the stateid 'other' field has changed
  NFSv4: Retry CLOSE and DELEGRETURN on NFS4ERR_OLD_STATEID.
  NFS: Fix a typo in nfs_rename()
  NFSv4: Fix open create exclusive when the server reboots
  ...

Merge tag 'ecryptfs-4.15-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tyhicks/ecryptfs

Pull eCryptfs updates from Tyler Hicks:

- miscellaneous code cleanups and refactoring

- fix a possible use after free bug when unloading the module

* tag 'ecryptfs-4.15-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tyhicks/ecryptfs:
  eCryptfs: constify attribute_group structures.
  ecryptfs: remove unnecessary i_version bump
  ecryptfs: use ARRAY_SIZE
  ecryptfs: Adjust four checks for null pointers
  ecryptfs: Return an error code only as a constant in ecryptfs_add_global_auth_tok()
  ecryptfs: Delete 21 error messages for a failed memory allocation
  eCryptfs: use after free in ecryptfs_release_messaging()
  ecryptfs: remove private bin2hex implementation
  ecryptfs: add missing \n to end of various error messages

Merge tag 'xfs-4.15-merge-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull xfs fixes from Darrick Wong:
"A couple more patches to fix a locking bug and some inconsistent type
  usage in some of the new code:

   - Fix a forgotten rcu read unlock

   - Fix some inconsistent integer type usage"

* tag 'xfs-4.15-merge-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  xfs: fix type usage
  xfs: fix forgotten rcu read unlock when skipping inode reclaim

NFS: Revert "NFS: Move the flock open mode check into nfs_flock()"

Commit e12937279c8b "NFS: Move the flock open mode check into nfs_flock()"
changed NFSv3 behavior for flock() such that the open mode must match the
lock type, however that requirement shouldn't be enforced for flock().

Signed-off-by: Benjamin Coddington <[email protected]>
Cc: [email protected] # v4.12
Signed-off-by: Anna Schumaker <[email protected]>

NFS: Fix typo in nomigration mount option

The option was incorrectly masking off all other options.

Signed-off-by: Joshua Watt <[email protected]>
Cc: [email protected] #3.7
Signed-off-by: Anna Schumaker <[email protected]>

nfs: Fix ugly referral attributes

Before traversing a referral and performing a mount, the mounted-on
directory looks strange:

dr-xr-xr-x. 2 4294967294 4294967294 0 Dec 31 1969 dir.0

nfs4_get_referral is wiping out any cached attributes with what was
returned via GETATTR(fs_locations), but the bit mask for that
operation does not request any file attributes.

Retrieve owner and timestamp information so that the memcpy in
nfs4_get_referral fills in more attributes.

Changes since v1:
- Don't request attributes that the client unconditionally replaces
- Request only MOUNTED_ON_FILEID or FILEID attribute, not both
- encode_fs_locations() doesn't use the third bitmask word

Fixes: 6b97fd3da1ea ("NFSv4: Follow a referral")
Suggested-by: Pradeep Thomas <[email protected]>
Signed-off-by: Chuck Lever <[email protected]>
Cc: [email protected]
Signed-off-by: Anna Schumaker <[email protected]>

NFS: super: mark expected switch fall-throughs

In preparation to enabling -Wimplicit-fallthrough, mark switch cases
where we are expecting to fall through.

Addresses-Coverity-ID: 703509
Addresses-Coverity-ID: 703510
Addresses-Coverity-ID: 703511
Addresses-Coverity-ID: 703512
Addresses-Coverity-ID: 703513
Signed-off-by: Gustavo A. R. Silva <[email protected]>
Signed-off-by: Anna Schumaker <[email protected]>

sunrpc: remove net pointer from messages

Publishing of net pointer is not safe, use net->ns.inum as net ID
[  171.391947] RPC:       created new rpcb local clients
    (rpcb_local_clnt: ..., rpcb_local_clnt4: ...) for net f00001e7
[  171.767188] NFSD: starting 90-second grace period (net f00001e7)

Signed-off-by: Vasily Averin <[email protected]>
Signed-off-by: Anna Schumaker <[email protected]>

nfs: remove net pointer from messages

Publishing of net pointer is not safe,
use net->ns.inum instead

Signed-off-by: Vasily Averin <[email protected]>
Signed-off-by: Anna Schumaker <[email protected]>

sunrpc: exit_net cleanup check added

Be sure that all_clients list initialized in net_init hook was return
to initial state.

Signed-off-by: Vasily Averin <[email protected]>
Signed-off-by: Anna Schumaker <[email protected]>

nfs client: exit_net cleanup check added

Be sure that nfs_client_list and nfs_volume_list lists initialized
in net_init hook were return to initial state in net_exit hook.

Signed-off-by: Vasily Averin <[email protected]>
Signed-off-by: Anna Schumaker <[email protected]>

nfs/write: Use common error handling code in nfs_lock_and_join_requests()

Add a jump target so that a bit of exception handling can be better reused
at the end of this function.

This issue was detected by using the Coccinelle software.

Signed-off-by: Markus Elfring <[email protected]>
Signed-off-by: Anna Schumaker <[email protected]>

NFSv4: Replace closed stateids with the "invalid special stateid"

When decoding a CLOSE, replace the stateid returned by the server
with the "invalid special stateid" described in RFC5661, Section 8.2.3.

In nfs_set_open_stateid_locked, ignore stateids from closed state.

Signed-off-by: Trond Myklebust <[email protected]>
Signed-off-by: Anna Schumaker <[email protected]>

NFSv4: nfs_set_open_stateid must not trigger state recovery for closed state

In nfs_set_open_stateid_locked, we must ignore stateids from closed state.

Reported-by: Andrew W Elble <[email protected]>
Signed-off-by: Trond Myklebust <[email protected]>
Signed-off-by: Anna Schumaker <[email protected]>

NFSv4: Check the open stateid when searching for expired state

Signed-off-by: Trond Myklebust <[email protected]>
Signed-off-by: Anna Schumaker <[email protected]>

NFSv4: Clean up nfs4_delegreturn_done

Signed-off-by: Trond Myklebust <[email protected]>
Signed-off-by: Anna Schumaker <[email protected]>

NFSv4: cleanup nfs4_close_done

Signed-off-by: Trond Myklebust <[email protected]>
Signed-off-by: Anna Schumaker <[email protected]>

NFSv4: Retry NFS4ERR_OLD_STATEID errors in layoutreturn

If our layoutreturn returns an NFS4ERR_OLD_STATEID, then try to
update the stateid and retry.

Signed-off-by: Trond Myklebust <[email protected]>
Signed-off-by: Anna Schumaker <[email protected]>

pNFS: Retry NFS4ERR_OLD_STATEID errors in layoutreturn-on-close

If our layoutreturn on close operation returns an NFS4ERR_OLD_STATEID,
then try to update the stateid and retry. We know that there should
be no further LAYOUTGET requests being launched.

Signed-off-by: Trond Myklebust <[email protected]>
Signed-off-by: Anna Schumaker <[email protected]>

NFSv4: Don't try to CLOSE if the stateid 'other' field has changed

If the stateid is no longer recognised on the server, either due to a
restart, or due to a competing CLOSE call, then we do not have to
retry. Any open contexts that triggered a reopen of the file, will
also act as triggers for any CLOSE for the updated stateids.

Signed-off-by: Trond Myklebust <[email protected]>
Signed-off-by: Anna Schumaker <[email protected]>

NFSv4: Retry CLOSE and DELEGRETURN on NFS4ERR_OLD_STATEID.

If we're racing with an OPEN, then retry the operation instead of
declaring it a success.

Signed-off-by: Trond Myklebust <[email protected]>
[Andrew W Elble: Fix a typo in nfs4_refresh_open_stateid]
Signed-off-by: Anna Schumaker <[email protected]>

NFS: Fix a typo in nfs_rename()

On successful rename, the "old_dentry" is retained and is attached to
the "new_dir", so we need to call nfs_set_verifier() accordingly.

Signed-off-by: Trond Myklebust <[email protected]>
Signed-off-by: Anna Schumaker <[email protected]>

NFSv4: Fix open create exclusive when the server reboots

If the server that does not implement NFSv4.1 persistent session
semantics reboots while we are performing an exclusive create,
then the return value of NFS4ERR_DELAY when we replay the open
during the grace period causes us to lose the verifier.
When the grace period expires, and we present a new verifier,
the server will then correctly reply NFS4ERR_EXIST.

This commit ensures that we always present the same verifier when
replaying the OPEN.

Reported-by: Tigran Mkrtchyan <[email protected]>
Signed-off-by: Trond Myklebust <[email protected]>
Signed-off-by: Anna Schumaker <[email protected]>

NFSv4: Add a tracepoint to document open stateid updates

Signed-off-by: Trond Myklebust <[email protected]>
Signed-off-by: Anna Schumaker <[email protected]>

NFSv4: Fix OPEN / CLOSE race

Ben Coddington has noted the following race between OPEN and CLOSE
on a single client.

Process 1 Process 2 Server
========= ========= ======

1)  OPEN file
2) OPEN file
3) Process OPEN (1) seqid=1
4) Process OPEN (2) seqid=2
5) Reply OPEN (2)
6) Receive reply (2)
7) new stateid, seqid=2

8) CLOSE file, using
stateid w/ seqid=2
9) Reply OPEN (1)
10( Process CLOSE (8)
11) Reply CLOSE (8)
12) Forget stateid
file closed

13) Receive reply (7)
14) Forget stateid
file closed.

15) Receive reply (1).
16) New stateid seqid=1
    is really the same
    stateid that was
    closed.

IOW: the reply to the first OPEN is delayed. Since "Process 2" does
not wait before closing the file, and it does not cache the closed
stateid, then when the delayed reply is finally received, it is treated
as setting up a new stateid by the client.

The fix is to ensure that the client processes the OPEN and CLOSE calls
in the same order in which the server processed them.

This commit ensures that we examine the seqid of the stateid
returned by OPEN. If it is a new stateid, we assume the seqid
must be equal to the value 1, and that each state transition
increments the seqid value by 1 (See RFC7530, Section 9.1.4.2,
and RFC5661, Section 8.2.2).

If the tracker sees that an OPEN returns with a seqid that is greater
than the cached seqid + 1, then it bumps a flag to ensure that the
caller waits for the RPCs carrying the missing seqids to complete.

Note that there can still be pathologies where the server crashes before
it can even send us the missing seqids. Since the OPEN call is still
holding a slot when it waits here, that could cause the recovery to
stall forever. To avoid that, we time out after a 5 second wait.

Reported-by: Benjamin Coddington <[email protected]>
Signed-off-by: Trond Myklebust <[email protected]>
Signed-off-by: Anna Schumaker <[email protected]>

sunrpc: Add rpc_request static trace point

Display information about the RPC procedure being requested in the
trace log. This sometimes critical information cannot always be
derived from other RPC trace entries.

Signed-off-by: Chuck Lever <[email protected]>
Signed-off-by: Anna Schumaker <[email protected]>

sunrpc: Fix rpc_task_begin trace point

The rpc_task_begin trace point always display a task ID of zero.
Move the trace point call site so that it picks up the new task ID.

Signed-off-by: Chuck Lever <[email protected]>
Signed-off-by: Anna Schumaker <[email protected]>

SUNRPC: Fix parsing failure in trace points with XIDs

mount.nf-11159   8....   905.248380: xprt_transmit:        [FAILED TO PARSE] xid=351291440 status=0 addr=192.168.2.5 port=20049
mount.nf-11159   8....   905.248381: rpc_task_sleep:       task:6210@1 flags=0e80 state=0005 status=0 timeout=60000 queue=xprt_pending
kworker/-1591    1....   905.248419: xprt_lookup_rqst:     [FAILED TO PARSE] xid=351291440 status=0 addr=192.168.2.5 port=20049
kworker/-1591    1....   905.248423: xprt_complete_rqst:   [FAILED TO PARSE] xid=351291440 status=24 addr=192.168.2.5 port=20049

Byte swapping is not available during trace-cmd report.

Signed-off-by: Chuck Lever <[email protected]>
Signed-off-by: Anna Schumaker <[email protected]>

net: sunrpc: mark expected switch fall-throughs

In preparation to enabling -Wimplicit-fallthrough, mark switch cases
where we are expecting to fall through.

Signed-off-by: Gustavo A. R. Silva <[email protected]>
Signed-off-by: Anna Schumaker <[email protected]>

NFS: Fix bool initialization/comparison

Bool initializations should use true and false. Bool tests don't need
comparisons.

Signed-off-by: Thomas Meyer <[email protected]>
Signed-off-by: Anna Schumaker <[email protected]>

NFS: Avoid RCU usage in tracepoints

There isn't an obvious way to acquire and release the RCU lock during a
tracepoint, so we can't use the rpc_peeraddr2str() function here.
Instead, rely on the client's cl_hostname, which should have similar
enough information without needing an rcu_dereference().

Reported-by: Dave Jones <[email protected]>
Cc: [email protected] # v3.12
Signed-off-by: Anna Schumaker <[email protected]>

xprtrdma: Update copyright notices

Credit work contributed by Oracle engineers since 2014.

Signed-off-by: Chuck Lever <[email protected]>
Signed-off-by: Anna Schumaker <[email protected]>

xprtrdma: Remove include for linux/prefetch.h

Clean up. This include should have been removed by
commit 23826c7aeac7 ("xprtrdma: Serialize credit accounting again").

Signed-off-by: Chuck Lever <[email protected]>
Reviewed-by: Devesh Sharma <[email protected]>
Signed-off-by: Anna Schumaker <[email protected]>

rpcrdma: Remove C structure definitions of XDR data items

Clean up: C-structure style XDR encoding and decoding logic has
been replaced over the past several merge windows on both the
client and server. These data structures are no longer used.

Signed-off-by: Chuck Lever <[email protected]>
Reviewed-by: Devesh Sharma <[email protected]>
Signed-off-by: Anna Schumaker <[email protected]>

xprtrdma: Put Send CQ in IB_POLL_WORKQUEUE mode

Lift the Send and LocalInv completion handlers out of soft IRQ mode
to make room for other work. Also, move the Send CQ to a different
CPU than the CPU where the Receive CQ is running, for improved
scalability.

Signed-off-by: Chuck Lever <[email protected]>
Reviewed-by: Devesh Sharma <[email protected]>
Signed-off-by: Anna Schumaker <[email protected]>

Merge branch 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs

Pull overlayfs updates from Miklos Szeredi:

- Report constant st_ino values across copy-up even if underlying
   layers are on different filesystems, but using different st_dev
   values for each layer.

   Ideally we'd report the same st_dev across the overlay, and it's
   possible to do for filesystems that use only 32bits for st_ino by
   unifying the inum space. It would be nice if it wasn't a choice of 32
   or 64, rather filesystems could report their current maximum (that
   could change on resize, so it wouldn't be set in stone).

- miscellaneus fixes and a cleanup of ovl_fill_super(), that was long
   overdue.

- created a path_put_init() helper that clears out the pointers after
   putting the ref.

   I think this could be useful elsewhere, so added it to <linux/path.h>

* 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs: (30 commits)
  ovl: remove unneeded arg from ovl_verify_origin()
  ovl: Put upperdentry if ovl_check_origin() fails
  ovl: rename ufs to ofs
  ovl: clean up getting lower layers
  ovl: clean up workdir creation
  ovl: clean up getting upper layer
  ovl: move ovl_get_workdir() and ovl_get_lower_layers()
  ovl: reduce the number of arguments for ovl_workdir_create()
  ovl: change order of setup in ovl_fill_super()
  ovl: factor out ovl_free_fs() helper
  ovl: grab reference to workbasedir early
  ovl: split out ovl_get_indexdir() from ovl_fill_super()
  ovl: split out ovl_get_lower_layers() from ovl_fill_super()
  ovl: split out ovl_get_workdir() from ovl_fill_super()
  ovl: split out ovl_get_upper() from ovl_fill_super()
  ovl: split out ovl_get_lowerstack() from ovl_fill_super()
  ovl: split out ovl_get_workpath() from ovl_fill_super()
  ovl: split out ovl_get_upperpath() from ovl_fill_super()
  ovl: use path_put_init() in error paths for ovl_fill_super()
  vfs: add path_put_init()
  ...

Merge tag 'locks-v4.15-1' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux

Pull file locking update from Jeff Layton:
"A couple of fixes for a patch that went into v4.14, and the bug report
  just came in a few days ago.. It passes my (minimal) testing, and has
  been in linux-next for a few days now.

  I also would like to get my address changed in MAINTAINERS to clear
  that hurdle"

* tag 'locks-v4.15-1' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux:
  fcntl: don't cap l_start and l_end values for F_GETLK64 in compat syscall
  fcntl: don't leak fd reference when fixup_compat_flock fails
  MAINTAINERS: s/[email protected]/[email protected]/

Merge branch 'work.cramfs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs

Pull cramfs updates from Al Viro:
"Nicolas Pitre's cramfs work"

* 'work.cramfs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  cramfs: rehabilitate it
  cramfs: add mmap support
  cramfs: implement uncompressed and arbitrary data block positioning
  cramfs: direct memory access support

Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs

Pull misc vfs updates from Al Viro:
"Assorted stuff, really no common topic here"

* 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  vfs: grab the lock instead of blocking in __fd_install during resizing
  vfs: stop clearing close on exec when closing a fd
  include/linux/fs.h: fix comment about struct address_space
  fs: make fiemap work from compat_ioctl
  coda: fix 'kernel memory exposure attempt' in fsync
  pstore: remove unneeded unlikely()
  vfs: remove unneeded unlikely()
  stubs for mount_bdev() and kill_block_super() in !CONFIG_BLOCK case
  make vfs_ustat() static
  do_handle_open() should be static
  elf_fdpic: fix unused variable warning
  fold destroy_super() into __put_super()
  new helper: destroy_unused_super()
  fix address space warnings in ipc/
  acct.h: get rid of detritus

Merge branch 'work.get_user_pages_fast' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs

Pull get_user_pages_fast() conversion from Al Viro:
"A bunch of places switched to get_user_pages_fast()"

* 'work.get_user_pages_fast' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  ceph: use get_user_pages_fast()
  pvr2fs: use get_user_pages_fast()
  atomisp: use get_user_pages_fast()
  st: use get_user_pages_fast()
  via_dmablit(): use get_user_pages_fast()
  fsl_hypervisor: switch to get_user_pages_fast()
  rapidio: switch to get_user_pages_fast()
  vchiq_2835_arm: switch to get_user_pages_fast()

Merge branch 'work.iov_iter' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs

Pull iov_iter updates from Al Viro:

- bio_{map,copy}_user_iov() series; those are cleanups - fixes from the
   same pile went into mainline (and stable) in late September.

- fs/iomap.c iov_iter-related fixes

- new primitive - iov_iter_for_each_range(), which applies a function
   to kernel-mapped segments of an iov_iter.

   Usable for kvec and bvec ones, the latter does kmap()/kunmap() around
   the callback. _Not_ usable for iovec- or pipe-backed iov_iter; the
   latter is not hard to fix if the need ever appears, the former is by
   design.

   Another related primitive will have to wait for the next cycle - it
   passes page + offset + size instead of pointer + size, and that one
   will be usable for everything _except_ kvec. Unfortunately, that one
   didn't get exposure in -next yet, so...

- a bit more lustre iov_iter work, including a use case for
   iov_iter_for_each_range() (checksum calculation)

- vhost/scsi leak fix in failure exit

- misc cleanups and detritectomy...

* 'work.iov_iter' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (21 commits)
  iomap_dio_actor(): fix iov_iter bugs
  switch ksocknal_lib_recv_...() to use of iov_iter_for_each_range()
  lustre: switch struct ksock_conn to iov_iter
  vhost/scsi: switch to iov_iter_get_pages()
  fix a page leak in vhost_scsi_iov_to_sgl() error recovery
  new primitive: iov_iter_for_each_range()
  lnet_return_rx_credits_locked: don't abuse list_entry
  xen: don't open-code iov_iter_kvec()
  orangefs: remove detritus from struct orangefs_kiocb_s
  kill iov_shorten()
  bio_alloc_map_data(): do bmd->iter setup right there
  bio_copy_user_iov(): saner bio size calculation
  bio_map_user_iov(): get rid of copying iov_iter
  bio_copy_from_iter(): get rid of copying iov_iter
  move more stuff down into bio_copy_user_iov()
  blk_rq_map_user_iov(): move iov_iter_advance() down
  bio_map_user_iov(): get rid of the iov_for_each()
  bio_map_user_iov(): move alignment check into the main loop
  don't rely upon subsequent bio_add_pc_page() calls failing
  ... and with iov_iter_get_pages_alloc() it becomes even simpler
  ...

Merge branch 'misc.compat' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs

Pull compat and uaccess updates from Al Viro:

- {get,put}_compat_sigset() series

- assorted compat ioctl stuff

- more set_fs() elimination

- a few more timespec64 conversions

- several removals of pointless access_ok() in places where it was
   followed only by non-__ variants of primitives

* 'misc.compat' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (24 commits)
  coredump: call do_unlinkat directly instead of sys_unlink
  fs: expose do_unlinkat for built-in callers
  ext4: take handling of EXT4_IOC_GROUP_ADD into a helper, get rid of set_fs()
  ipmi: get rid of pointless access_ok()
  pi433: sanitize ioctl
  cxlflash: get rid of pointless access_ok()
  mtdchar: get rid of pointless access_ok()
  r128: switch compat ioctls to drm_ioctl_kernel()
  selection: get rid of field-by-field copyin
  VT_RESIZEX: get rid of field-by-field copyin
  i2c compat ioctls: move to ->compat_ioctl()
  sched_rr_get_interval(): move compat to native, get rid of set_fs()
  mips: switch to {get,put}_compat_sigset()
  sparc: switch to {get,put}_compat_sigset()
  s390: switch to {get,put}_compat_sigset()
  ppc: switch to {get,put}_compat_sigset()
  parisc: switch to {get,put}_compat_sigset()
  get_compat_sigset()
  get rid of {get,put}_compat_itimerspec()
  io_getevents: Use timespec64 to represent timeouts
  ...

Merge branch 'for-linus' of git://git.kernel.dk/linux-block

Pull more block layer updates from Jens Axboe:
"A followup pull request, with some parts that either needed a bit more
  testing before going in, merge sync, or just later arriving fixes.
  This contains:

   - Timer related updates from Kees. These were purposefully delayed
     since I didn't want to pull in a later v4.14-rc tag to my block
     tree.

   - ide-cd prep sense buffer fix from Bart. Also delayed, as not to
     clash with the late fix we put into 4.14-rc.

   - Small BFQ updates series from Luca and Paolo.

   - Single nvmet fix from James, fixing a non-functional case there.

   - Bio fast clone fix from Michael, which made bcache return the wrong
     data for some cases.

   - Legacy IO path regression hang fix from Ming"

* 'for-linus' of git://git.kernel.dk/linux-block:
  bio: ensure __bio_clone_fast copies bi_partno
  nvmet_fc: fix better length checking
  block: wake up all tasks blocked in get_request()
  block, bfq: move debug blkio stats behind CONFIG_DEBUG_BLK_CGROUP
  block, bfq: update blkio stats outside the scheduler lock
  block, bfq: add missing invocations of bfqg_stats_update_io_add/remove
  doc, block, bfq: update max IOPS sustainable with BFQ
  ide: Make ide_cdrom_prep_fs() initialize the sense buffer pointer
  md: Convert timers to use timer_setup()
  block: swim3: Convert timers to use timer_setup()
  block/aoe: Convert timers to use timer_setup()
  amifloppy: Convert timers to use timer_setup()
  block/floppy: Convert callback to pass timer_list

fs, nfs: convert nfs_client.cl_count from atomic_t to refcount_t

atomic_t variables are currently used to implement reference
counters with the following properties:
- counter is initialized to 1 using atomic_set()
- a resource is freed upon counter reaching zero
- once counter reaches zero, its further
increments aren't allowed
- counter schema uses basic atomic operations
(set, inc, inc_not_zero, dec_and_test, etc.)

Such atomic variables should be converted to a newly provided
refcount_t type and API that prevents accidental counter overflows
and underflows. This is important since overflows and underflows
can lead to use-after-free situation and be exploitable.

The variable nfs_client.cl_count is used as pure reference counter.
Convert it to refcount_t and fix up the operations.

Suggested-by: Kees Cook <[email protected]>
Reviewed-by: David Windsor <[email protected]>
Reviewed-by: Hans Liljestrand <[email protected]>
Signed-off-by: Elena Reshetova <[email protected]>
Signed-off-by: Anna Schumaker <[email protected]>

fs, nfs: convert nfs_lock_context.count from atomic_t to refcount_t

atomic_t variables are currently used to implement reference
counters with the following properties:
- counter is initialized to 1 using atomic_set()
- a resource is freed upon counter reaching zero
- once counter reaches zero, its further
increments aren't allowed
- counter schema uses basic atomic operations
(set, inc, inc_not_zero, dec_and_test, etc.)

Such atomic variables should be converted to a newly provided
refcount_t type and API that prevents accidental counter overflows
and underflows. This is important since overflows and underflows
can lead to use-after-free situation and be exploitable.

The variable nfs_lock_context.count is used as pure reference counter.
Convert it to refcount_t and fix up the operations.

Suggested-by: Kees Cook <[email protected]>
Reviewed-by: David Windsor <[email protected]>
Reviewed-by: Hans Liljestrand <[email protected]>
Signed-off-by: Elena Reshetova <[email protected]>
Signed-off-by: Anna Schumaker <[email protected]>

fs, nfs: convert nfs4_lock_state.ls_count from atomic_t to refcount_t

atomic_t variables are currently used to implement reference
counters with the following properties:
- counter is initialized to 1 using atomic_set()
- a resource is freed upon counter reaching zero
- once counter reaches zero, its further
   increments aren't allowed
- counter schema uses basic atomic operations
   (set, inc, inc_not_zero, dec_and_test, etc.)

Such atomic variables should be converted to a newly provided
refcount_t type and API that prevents accidental counter overflows
and underflows. This is important since overflows and underflows
can lead to use-after-free situation and be exploitable.

The variable nfs4_lock_state.ls_count  is used as pure reference counter.
Convert it to refcount_t and fix up the operations.

Suggested-by: Kees Cook <[email protected]>
Reviewed-by: David Windsor <[email protected]>
Reviewed-by: Hans Liljestrand <[email protected]>
Signed-off-by: Elena Reshetova <[email protected]>
Signed-off-by: Anna Schumaker <[email protected]>

fs, nfs: convert nfs_cache_defer_req.count from atomic_t to refcount_t

atomic_t variables are currently used to implement reference
counters with the following properties:
- counter is initialized to 1 using atomic_set()
- a resource is freed upon counter reaching zero
- once counter reaches zero, its further
increments aren't allowed
- counter schema uses basic atomic operations
(set, inc, inc_not_zero, dec_and_test, etc.)

Such atomic variables should be converted to a newly provided
refcount_t type and API that prevents accidental counter overflows
and underflows. This is important since overflows and underflows
can lead to use-after-free situation and be exploitable.

The variable nfs_cache_defer_req.count is used as pure reference counter.
Convert it to refcount_t and fix up the operations.

Suggested-by: Kees Cook <[email protected]>
Reviewed-by: David Windsor <[email protected]>
Reviewed-by: Hans Liljestrand <[email protected]>
Signed-off-by: Elena Reshetova <[email protected]>
Signed-off-by: Anna Schumaker <[email protected]>

fs, nfs: convert nfs4_ff_layout_mirror.ref from atomic_t to refcount_t

atomic_t variables are currently used to implement reference
counters with the following properties:
- counter is initialized to 1 using atomic_set()
- a resource is freed upon counter reaching zero
- once counter reaches zero, its further
increments aren't allowed
- counter schema uses basic atomic operations
(set, inc, inc_not_zero, dec_and_test, etc.)

Such atomic variables should be converted to a newly provided
refcount_t type and API that prevents accidental counter overflows
and underflows. This is important since overflows and underflows
can lead to use-after-free situation and be exploitable.

The variable nfs4_ff_layout_mirror.ref is used as pure reference counter.
Convert it to refcount_t and fix up the operations.

Suggested-by: Kees Cook <[email protected]>
Reviewed-by: David Windsor <[email protected]>
Reviewed-by: Hans Liljestrand <[email protected]>
Signed-off-by: Elena Reshetova <[email protected]>
Signed-off-by: Anna Schumaker <[email protected]>

fs, nfs: convert pnfs_layout_hdr.plh_refcount from atomic_t to refcount_t

atomic_t variables are currently used to implement reference
counters with the following properties:
- counter is initialized to 1 using atomic_set()
- a resource is freed upon counter reaching zero
- once counter reaches zero, its further
increments aren't allowed
- counter schema uses basic atomic operations
(set, inc, inc_not_zero, dec_and_test, etc.)

Such atomic variables should be converted to a newly provided
refcount_t type and API that prevents accidental counter overflows
and underflows. This is important since overflows and underflows
can lead to use-after-free situation and be exploitable.

The variable pnfs_layout_hdr.plh_refcount is used as pure reference counter.
Convert it to refcount_t and fix up the operations.

Suggested-by: Kees Cook <[email protected]>
Reviewed-by: David Windsor <[email protected]>
Reviewed-by: Hans Liljestrand <[email protected]>
Signed-off-by: Elena Reshetova <[email protected]>
Signed-off-by: Anna Schumaker <[email protected]>

fs, nfs: convert pnfs_layout_segment.pls_refcount from atomic_t to refcount_t

refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.

Signed-off-by: Elena Reshetova <[email protected]>
Signed-off-by: Hans Liljestrand <[email protected]>
Signed-off-by: Kees Cook <[email protected]>
Signed-off-by: David Windsor <[email protected]>
Signed-off-by: Anna Schumaker <[email protected]>

fs, nfs: convert nfs4_pnfs_ds.ds_count from atomic_t to refcount_t

atomic_t variables are currently used to implement reference
counters with the following properties:
- counter is initialized to 1 using atomic_set()
- a resource is freed upon counter reaching zero
- once counter reaches zero, its further
increments aren't allowed
- counter schema uses basic atomic operations
(set, inc, inc_not_zero, dec_and_test, etc.)

Such atomic variables should be converted to a newly provided
refcount_t type and API that prevents accidental counter overflows
and underflows. This is important since overflows and underflows
can lead to use-after-free situation and be exploitable.

The variable nfs4_pnfs_ds.ds_count is used as pure reference counter.
Convert it to refcount_t and fix up the operations.

Suggested-by: Kees Cook <[email protected]>
Reviewed-by: David Windsor <[email protected]>
Reviewed-by: Hans Liljestrand <[email protected]>
Signed-off-by: Elena Reshetova <[email protected]>
Signed-off-by: Anna Schumaker <[email protected]>

NFSv4.1: Fix up replays of interrupted requests

If the previous request on a slot was interrupted before it was
processed by the server, then our slot sequence number may be out of whack,
and so we try the next operation using the old sequence number.

The problem with this, is that not all servers check to see that the
client is replaying the same operations as previously when they decide
to go to the replay cache, and so instead of the expected error of
NFS4ERR_SEQ_FALSE_RETRY, we get a replay of the old reply, which could
(if the operations match up) be mistaken by the client for a new reply.

To fix this, we attempt to send a COMPOUND containing only the SEQUENCE op
in order to resync our slot sequence number.

Cc: Olga Kornievskaia <[email protected]>
[[email protected]: fix an Oops]
Signed-off-by: Trond Myklebust <[email protected]>
Signed-off-by: Anna Schumaker <[email protected]>

xprtrdma: Remove atomic send completion counting

The sendctx circular queue now guarantees that xprtrdma cannot
overflow the Send Queue, so remove the remaining bits of the
original Send WQE counting mechanism.

Signed-off-by: Chuck Lever <[email protected]>
Signed-off-by: Anna Schumaker <[email protected]>

xprtrdma: RPC completion should wait for Send completion

When an RPC Call includes a file data payload, that payload can come
from pages in the page cache, or a user buffer (for direct I/O).

If the payload can fit inline, xprtrdma includes it in the Send
using a scatter-gather technique. xprtrdma mustn't allow the RPC
consumer to re-use the memory where that payload resides before the
Send completes. Otherwise, the new contents of that memory would be
exposed by an HCA retransmit of the Send operation.

So, block RPC completion on Send completion, but only in the case
where a separate file data payload is part of the Send. This
prevents the reuse of that memory while it is still part of a Send
operation without an undue cost to other cases.

Waiting is avoided in the common case because typically the Send
will have completed long before the RPC Reply arrives.

These days, an RPC timeout will trigger a disconnect, which tears
down the QP. The disconnect flushes all waiting Sends. This bounds
the amount of time the reply handler has to wait for a Send
completion.

Signed-off-by: Chuck Lever <[email protected]>
Signed-off-by: Anna Schumaker <[email protected]>

xprtrdma: Refactor rpcrdma_deferred_completion

Invoke a common routine for releasing hardware resources (for
example, invalidating MRs). This needs to be done whether an
RPC Reply has arrived or the RPC was terminated early.

Signed-off-by: Chuck Lever <[email protected]>
Signed-off-by: Anna Schumaker <[email protected]>

xprtrdma: Add a field of bit flags to struct rpcrdma_req

We have one boolean flag in rpcrdma_req today. I'd like to add more
flags, so convert that boolean to a bit flag.

Signed-off-by: Chuck Lever <[email protected]>
Signed-off-by: Anna Schumaker <[email protected]>

xprtrdma: Add data structure to manage RDMA Send arguments

Problem statement:

Recently Sagi Grimberg <[email protected]> observed that kernel RDMA-
enabled storage initiators don't handle delayed Send completion
correctly. If Send completion is delayed beyond the end of a ULP
transaction, the ULP may release resources that are still being used
by the HCA to complete a long-running Send operation.

This is a common design trait amongst our initiators. Most Send
operations are faster than the ULP transaction they are part of.
Waiting for a completion for these is typically unnecessary.

Infrequently, a network partition or some other problem crops up
where an ordering problem can occur. In NFS parlance, the RPC Reply
arrives and completes the RPC, but the HCA is still retrying the
Send WR that conveyed the RPC Call. In this case, the HCA can try
to use memory that has been invalidated or DMA unmapped, and the
connection is lost. If that memory has been re-used for something
else (possibly not related to NFS), and the Send retransmission
exposes that data on the wire.

Thus we cannot assume that it is safe to release Send-related
resources just because a ULP reply has arrived.

After some analysis, we have determined that the completion
housekeeping will not be difficult for xprtrdma:

- Inline Send buffers are registered via the local DMA key, and
   are already left DMA mapped for the lifetime of a transport
   connection, thus no additional handling is necessary for those
- Gathered Sends involving page cache pages _will_ need to
   DMA unmap those pages after the Send completes. But like
   inline send buffers, they are registered via the local DMA key,
   and thus will not need to be invalidated

In addition, RPC completion will need to wait for Send completion
in the latter case. However, nearly always, the Send that conveys
the RPC Call will have completed long before the RPC Reply
arrives, and thus no additional latency will be accrued.

Design notes:

In this patch, the rpcrdma_sendctx object is introduced, and a
lock-free circular queue is added to manage a set of them per
transport.

The RPC client's send path already prevents sending more than one
RPC Call at the same time. This allows us to treat the consumer
side of the queue (rpcrdma_sendctx_get_locked) as if there is a
single consumer thread.

The producer side of the queue (rpcrdma_sendctx_put_locked) is
invoked only from the Send completion handler, which is a single
thread of execution (soft IRQ).

The only care that needs to be taken is with the tail index, which
is shared between the producer and consumer. Only the producer
updates the tail index. The consumer compares the head with the
tail to ensure that the a sendctx that is in use is never handed
out again (or, expressed more conventionally, the queue is empty).

When the sendctx queue empties completely, there are enough Sends
outstanding that posting more Send operations can result in a Send
Queue overflow. In this case, the ULP is told to wait and try again.
This introduces strong Send Queue accounting to xprtrdma.

As a final touch, Jason Gunthorpe <[email protected]>
suggested a mechanism that does not require signaling every Send.
We signal once every N Sends, and perform SGE unmapping of N Send
operations during that one completion.

Reported-by: Sagi Grimberg <[email protected]>
Suggested-by: Jason Gunthorpe <[email protected]>
Signed-off-by: Chuck Lever <[email protected]>
Signed-off-by: Anna Schumaker <[email protected]>