Git Repo - linux.git/log

gfs2: Rework how rgrp buffer_heads are managed

Before this patch, the rgrp code had a serious problem related to
how it managed buffer_heads for resource groups. The problem caused
file system corruption, especially in cases of journal replay.

When an rgrp glock was demoted to transfer ownership to a
different cluster node, do_xmote() first calls rgrp_go_sync and then
rgrp_go_inval, as expected. When it calls rgrp_go_sync, that called
gfs2_rgrp_brelse() that dropped the buffer_head reference count.
In most cases, the reference count went to zero, which is right.
However, there were other places where the buffers are handled
differently.

After rgrp_go_sync, do_xmote called rgrp_go_inval which called
gfs2_rgrp_brelse a second time, then rgrp_go_inval's call to
truncate_inode_pages_range would get rid of the pages in memory,
but only if the reference count drops to 0.

Unfortunately, gfs2_rgrp_brelse was setting bi->bi_bh = NULL.
So when rgrp_go_sync called gfs2_rgrp_brelse, it lost the pointer
to the buffer_heads in cases where the reference count was still 1.
Therefore, when rgrp_go_inval called gfs2_rgrp_brelse a second time,
it failed the check for "if (bi->bi_bh)" and thus failed to call
brelse a second time. Because of that, the reference count on those
buffers sometimes failed to drop from 1 to 0. And that caused
function truncate_inode_pages_range to keep the pages in page cache
rather than freeing them.

The next time the rgrp glock was acquired, the metadata read of
the rgrp buffers re-used the pages in memory, which were now
wrong because they were likely modified by the other node who
acquired the glock in EX (which is why we demoted the glock).
This re-use of the page cache caused corruption because changes
made by the other nodes were never seen, so the bitmaps were
inaccurate.

For some reason, the problem became most apparent when journal
replay forced the replay of rgrps in memory, which caused newer
rgrp data to be overwritten by the older in-core pages.

A big part of the problem was that the rgrp buffer were released
in multiple places: The go_unlock function would release them when
the glock was released rather than when the glock is demoted,
which is clearly wrong because our intent was to cache them until
the glock is demoted from SH or EX.

This patch attempts to clean up the mess and make one consistent
and centralized mechanism for managing the rgrp buffer_heads by
implementing several changes:

1. It eliminates the call to gfs2_rgrp_brelse() from rgrp_go_sync.
   We don't want to release the buffers or zero the pointers when
   syncing for the reasons stated above. It only makes sense to
   release them when the glock is actually invalidated (go_inval).
   And when we do, then we set the bh pointers to NULL.
2. The go_unlock function (which was only used for rgrps) is
   eliminated, as we've talked about doing many times before.
   The go_unlock function was called too early in the glock dq
   process, and should not happen until the glock is invalidated.
3. It also eliminates the call to rgrp_brelse in gfs2_clear_rgrpd.
   That will now happen automatically when the rgrp glocks are
   demoted, and shouldn't happen any sooner or later than that.
   Instead, function gfs2_clear_rgrpd has been modified to demote
   the rgrp glocks, and therefore, free those pages, before the
   remaining glocks are culled by gfs2_gl_hash_clear. This
   prevents the gl_object from hanging around when the glocks are
   culled.

Signed-off-by: Bob Peterson <[email protected]>
Reviewed-by: Andreas Gruenbacher <[email protected]>

gfs2: clear ail1 list when gfs2 withdraws

This patch fixes a bug in which function gfs2_log_flush can get into
an infinite loop when a gfs2 file system is withdrawn. The problem
is the infinite loop "for (;;)" in gfs2_log_flush which would never
finish because the io error and subsequent withdraw prevented the
items from being taken off the ail list.

This patch tries to clean up the mess by allowing withdraw situations
to move not-in-flight buffer_heads to the ail2 list, where they will
be dealt with later.

Signed-off-by: Bob Peterson <[email protected]>
Reviewed-by: Andreas Gruenbacher <[email protected]>

gfs2: Introduce concept of a pending withdraw

File system withdraws can be delayed when inconsistencies are
discovered when we cannot withdraw immediately, for example, when
critical spin_locks are held. But delaying the withdraw can cause
gfs2 to ignore the error and keep running for a short period of time.
For example, an rgrp glock may be dequeued and demoted while there
are still buffers that haven't been properly revoked, due to io
errors writing to the journal.

This patch introduces a new concept of a pending withdraw, which
means an inconsistency has been discovered and we need to withdraw
at the earliest possible opportunity. In these cases, we aren't
quite withdrawn yet, but we still need to not dequeue glocks and
other critical things. If we dequeue the glocks and the withdraw
results in our journal being replayed, the replay could overwrite
data that's been modified by a different node that acquired the
glock in the meantime.

Signed-off-by: Bob Peterson <[email protected]>
Reviewed-by: Andreas Gruenbacher <[email protected]>

gfs2: Return bool from gfs2_assert functions

The gfs2_assert functions only print messages when the filesystem hasn't been
withdrawn yet, and they indicate whether or not they've printed something in
their return value. However, none of the callers use that information, so
simply return whether or not the assert has failed.

(The gfs2_assert functions are still backwards; they return false when an
assertion is true.)

Signed-off-by: Andreas Gruenbacher <[email protected]>
Signed-off-by: Bob Peterson <[email protected]>

gfs2: Turn gfs2_consist into void functions

Change the various gfs2_consist functions to return void.

Signed-off-by: Andreas Gruenbacher <[email protected]>
Signed-off-by: Bob Peterson <[email protected]>

gfs2: Remove usused cluster_wide arguments of gfs2_consist functions

These arguments are always passed as 0, and they are never evaluated.

Signed-off-by: Andreas Gruenbacher <[email protected]>
Signed-off-by: Bob Peterson <[email protected]>

gfs2: Report errors before withdraw

In gfs2_rgrp_verify and compute_bitstructs, make sure to report errors before
withdrawing the filesystem: otherwise, when we withdraw first and withdraw is
configured to panic, we'll never get to the error reporting.

Signed-off-by: Andreas Gruenbacher <[email protected]>
Signed-off-by: Bob Peterson <[email protected]>

gfs2: Split gfs2_lm_withdraw into two functions

Split gfs2_lm_withdraw into a function that prints an error message and a
function that withdraws the filesystem.

Signed-off-by: Andreas Gruenbacher <[email protected]>
Signed-off-by: Bob Peterson <[email protected]>

gfs2: fix O_SYNC write handling

In gfs2_file_write_iter, for direct writes, the error checking in the buffered
write fallback case is incomplete. This can cause inode write errors to go
undetected. Fix and clean up gfs2_file_write_iter along the way.

Based on a proposed fix by Christoph Hellwig <[email protected]>.

Fixes: 967bcc91b044 ("gfs2: iomap direct I/O support")
Cc: [email protected] # v4.19+
Signed-off-by: Andreas Gruenbacher <[email protected]>

gfs2: move setting current->backing_dev_info

Set current->backing_dev_info just around the buffered write calls to
prepare for the next fix.

Fixes: 967bcc91b044 ("gfs2: iomap direct I/O support")
Cc: [email protected] # v4.19+
Signed-off-by: Christoph Hellwig <[email protected]>
Signed-off-by: Andreas Gruenbacher <[email protected]>

gfs2: fix gfs2_find_jhead that returns uninitialized jhead with seq 0

When the first log header in a journal happens to have a sequence
number of 0, a bug in gfs2_find_jhead() causes it to prematurely exit,
and return an uninitialized jhead with seq 0. This can cause failures
in the caller. For instance, a mount fails in one test case.

The correct behavior is for it to continue searching through the journal
to find the correct journal head with the highest sequence number.

Fixes: f4686c26ecc3 ("gfs2: read journal in large chunks")
Cc: [email protected] # v5.2+
Signed-off-by: Abhi Das <[email protected]>
Signed-off-by: Andreas Gruenbacher <[email protected]>

Merge tag 'gfs2-for-5.6' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2

Pull gfs2 updates from Andreas Gruenbacher:

- Fix some corner cases on filesystems with a block size < page size.

- Fix a corner case that could expose incorrect access times over nfs.

- Revert an otherwise sensible revoke accounting cleanup that causes
   assertion failures. The revoke accounting is whacky and needs to be
   fixed properly before we can add back this cleanup.

- Various other minor cleanups.

In addition, please expect to see another pull request from Bob Peterson
about his gfs2 recovery patch queue shortly.

* tag 'gfs2-for-5.6' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2:
  Revert "gfs2: eliminate tr_num_revoke_rm"
  gfs2: remove unused LBIT macros
  fs/gfs2: remove unused IS_DINODE and IS_LEAF macros
  gfs2: Remove GFS2_MIN_LVB_SIZE define
  gfs2: Fix incorrect variable name
  gfs2: Avoid access time thrashing in gfs2_inode_lookup
  gfs2: minor cleanup: remove unneeded variable ret in gfs2_jdata_writepage
  gfs2: eliminate ssize parameter from gfs2_struct2blk
  gfs2: Another gfs2_find_jhead fix

Merge tag 'iomap-5.6-merge-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull iomap fix from Darrick Wong:
"A single patch fixing an off-by-one error when we're checking to see
how far we're gotten into an EOF page"

* tag 'iomap-5.6-merge-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
fs: Fix page_mkwrite off-by-one errors

Merge branch 'akpm' (patches from Andrew)

Pull updates from Andrew Morton:
"Most of -mm and quite a number of other subsystems: hotfixes, scripts,
  ocfs2, misc, lib, binfmt, init, reiserfs, exec, dma-mapping, kcov.

  MM is fairly quiet this time.  Holidays, I assume"

* emailed patches from Andrew Morton <[email protected]>: (118 commits)
  kcov: ignore fault-inject and stacktrace
  include/linux/io-mapping.h-mapping: use PHYS_PFN() macro in io_mapping_map_atomic_wc()
  execve: warn if process starts with executable stack
  reiserfs: prevent NULL pointer dereference in reiserfs_insert_item()
  init/main.c: fix misleading "This architecture does not have kernel memory protection" message
  init/main.c: fix quoted value handling in unknown_bootoption
  init/main.c: remove unnecessary repair_env_string in do_initcall_level
  init/main.c: log arguments and environment passed to init
  fs/binfmt_elf.c: coredump: allow process with empty address space to coredump
  fs/binfmt_elf.c: coredump: delete duplicated overflow check
  fs/binfmt_elf.c: coredump: allocate core ELF header on stack
  fs/binfmt_elf.c: make BAD_ADDR() unlikely
  fs/binfmt_elf.c: better codegen around current->mm
  fs/binfmt_elf.c: don't copy ELF header around
  fs/binfmt_elf.c: fix ->start_code calculation
  fs/binfmt_elf.c: smaller code generation around auxv vector fill
  lib/find_bit.c: uninline helper _find_next_bit()
  lib/find_bit.c: join _find_next_bit{_le}
  uapi: rename ext2_swab() to swab() and share globally in swab.h
  lib/scatterlist.c: adjust indentation in __sg_alloc_table
  ...

Merge tag 'modules-for-v5.6' of git://git.kernel.org/pub/scm/linux/kernel/git/jeyu/linux

Pull module updates from Jessica Yu:
"Summary of modules changes for the 5.6 merge window:

   - Add "MS" (SHF_MERGE|SHF_STRINGS) section flags to __ksymtab_strings
     to indicate to the linker that it can perform string deduplication
     (i.e., duplicate strings are reduced to a single copy in the string
     table). This means any repeated namespace string would be merged to
     just one entry in __ksymtab_strings.

   - Various code cleanups and small fixes (fix small memleak in error
     path, improve moduleparam docs, silence rcu warnings, improve error
     logging)"

* tag 'modules-for-v5.6' of git://git.kernel.org/pub/scm/linux/kernel/git/jeyu/linux:
  module.h: Annotate mod_kallsyms with __rcu
  module: avoid setting info->name early in case we can fall back to info->mod->name
  modsign: print module name along with error message
  kernel/module: Fix memleak in module_add_modinfo_attrs()
  export.h: reduce __ksymtab_strings string duplication by using "MS" section flags
  moduleparam: fix kerneldoc
  modules: lockdep: Suppress suspicious RCU usage warning

Merge tag 'mips_5.6' of git://git.kernel.org/pub/scm/linux/kernel/git/mips/linux

Pull MIPS changes from Paul Burton:
"Nothing too big or scary in here:

   - Support mremap() for the VDSO, primarily to allow CRIU to restore
     the VDSO to its checkpointed location.

   - Restore the MIPS32 cBPF JIT, after having reverted the enablement
     of the eBPF JIT for MIPS32 systems in the 5.5 cycle.

   - Improve cop0 counter synchronization behaviour whilst onlining CPUs
     by running with interrupts disabled.

   - Better match FPU behaviour when emulating multiply-accumulate
     instructions on pre-r6 systems that implement IEEE754-2008 style
     MACs.

   - Loongson64 kernels now build using the MIPS64r2 ISA, allowing them
     to take advantage of instructions introduced by r2.

   - Support for the Ingenic X1000 SoC & the really nice little CU Neo
     development board that's using it.

   - Support for WMAC on GARDENA Smart Gateway devices.

   - Lots of cleanup & refactoring of SGI IP27 (Origin 2*) support in
     preparation for introducing IP35 (Origin 3*) support.

   - Various Kconfig & Makefile cleanups"

* tag 'mips_5.6' of git://git.kernel.org/pub/scm/linux/kernel/git/mips/linux: (60 commits)
  MIPS: PCI: Add detection of IOC3 on IO7, IO8, IO9 and Fuel
  MIPS: Loongson64: Disable exec hazard
  MIPS: Loongson64: Bump ISA level to MIPSR2
  MIPS: Make DIEI support as a config option
  MIPS: OCTEON: octeon-irq: fix spelling mistake "to" -> "too"
  MIPS: asm: local: add barriers for Loongson
  MIPS: Loongson64: Select mac2008 only feature
  MIPS: Add MAC2008 Support
  Revert "MIPS: Add custom serial.h with BASE_BAUD override for generic kernel"
  MIPS: sort MIPS and MIPS_GENERIC Kconfig selects alphabetically (again)
  MIPS: make CPU_HAS_LOAD_STORE_LR opt-out
  MIPS: generic: don't unconditionally select PINCTRL
  MIPS: don't explicitly select LIBFDT in Kconfig
  MIPS: sync-r4k: do slave counter synchronization with disabled HW interrupts
  MIPS: SGI-IP30: Check for valid pointer before using it
  MIPS: syscalls: fix indentation of the 'SYSNR' message
  MIPS: boot: fix typo in 'vmlinux.lzma.its' target
  MIPS: fix indentation of the 'RELOCS' message
  dt-bindings: Document loongson vendor-prefix
  MIPS: CU1000-Neo: Refresh defconfig to support HWMON and WiFi.
  ...

Merge tag 'arc-5.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/vgupta/arc

Pull ARC updates from Vineet Gupta:

- Wire up clone3 syscall

- ARCv2 FPU state save/restore across context switch

- AXS10x platform and misc fixes

* tag 'arc-5.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/vgupta/arc:
  ARCv2: fpu: preserve userspace fpu state
  ARC: fpu: declutter code, move bits out into fpu.h
  ARC: wireup clone3 syscall
  ARC: [plat-axs10x]: Add missing multicast filter number to GMAC node
  ARC: update feature support for jump-labels

Merge tag 'riscv-for-linus-5.6-mw0' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux

Pull RISC-V updates from Palmer Dabbelt:
"This contains a handful of patches for this merge window:

   - Support for kasan

   - 32-bit physical addresses on rv32i-based systems

   - Support for CONFIG_DEBUG_VIRTUAL

   - DT entry for the FU540 GPIO controller, which has recently had a
     device driver merged

  These boot a buildroot-based system on QEMU's virt board for me"

* tag 'riscv-for-linus-5.6-mw0' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux:
  riscv: dts: Add DT support for SiFive FU540 GPIO driver
  riscv: mm: add support for CONFIG_DEBUG_VIRTUAL
  riscv: keep 32-bit kernel to 32-bit phys_addr_t
  kasan: Add riscv to KASAN documentation.
  riscv: Add KASAN support
  kasan: No KASAN's memmove check if archs don't have it.

Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 fixes from Ingo Molnar:
"Misc fixes:

   - three fixes and a cleanup for the resctrl code

   - a HyperV fix

   - a fix to /proc/kcore contents in live debugging sessions

   - a fix for the x86 decoder opcode map"

* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/decoder: Add TEST opcode to Group3-2
  x86/resctrl: Clean up unused function parameter in mkdir path
  x86/resctrl: Fix a deadlock due to inaccurate reference
  x86/resctrl: Fix use-after-free due to inaccurate refcount of rdtgroup
  x86/resctrl: Fix use-after-free when deleting resource groups
  x86/hyper-v: Add "polling" bit to hv_synic_sint
  x86/crash: Define arch_crash_save_vmcoreinfo() if CONFIG_CRASH_CORE=y

kcov: ignore fault-inject and stacktrace

Don't instrument 3 more files that contain debugging facilities and
produce large amounts of uninteresting coverage for every syscall.

The following snippets are sprinkled all over the place in kcov traces
in a debugging kernel.  We already try to disable instrumentation of
stack unwinding code and of most debug facilities.  I guess we did not
use fault-inject.c at the time, and stacktrace.c was somehow missed (or
something has changed in kernel/configs).  This change both speeds up
kcov (kernel doesn't need to store these PCs, user-space doesn't need to
process them) and frees trace buffer capacity for more useful coverage.

  should_fail
  lib/fault-inject.c:149
  fail_dump
  lib/fault-inject.c:45

  stack_trace_save
  kernel/stacktrace.c:124
  stack_trace_consume_entry
  kernel/stacktrace.c:86
  stack_trace_consume_entry
  kernel/stacktrace.c:89
  ... a hundred frames skipped ...
  stack_trace_consume_entry
  kernel/stacktrace.c:93
  stack_trace_consume_entry
  kernel/stacktrace.c:86

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Dmitry Vyukov <[email protected]>
Reviewed-by: Andrey Konovalov <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

include/linux/io-mapping.h-mapping: use PHYS_PFN() macro in io_mapping_map_atomic_wc()

Use PHYS_PFN() macro in io_mapping_map_atomic_wc() instead of open coded
variant.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Andy Shevchenko <[email protected]>
Reviewed-by: Andrew Morton <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

execve: warn if process starts with executable stack

There were few episodes of silent downgrade to an executable stack over
years:

1) linking innocent looking assembly file will silently add executable
   stack if proper linker options is not given as well:

$ cat f.S
.intel_syntax noprefix
.text
.globl f
f:
        ret

$ cat main.c
void f(void);
int main(void)
{
        f();
        return 0;
}

$ gcc main.c f.S
$ readelf -l ./a.out
  GNU_STACK      0x0000000000000000 0x0000000000000000 0x0000000000000000
                         0x0000000000000000 0x0000000000000000  RWE    0x10
^^^

2) converting C99 nested function into a closure
   https://nullprogram.com/blog/2019/11/15/

void intsort2(int *base, size_t nmemb, _Bool invert)
{
    int cmp(const void *a, const void *b)
    {
        int r = *(int *)a - *(int *)b;
        return invert ? -r : r;
    }
    qsort(base, nmemb, sizeof(*base), cmp);
}

will silently require stack trampolines while non-closure version will
not.

Without doubt this behaviour is documented somewhere, add a warning so
that developers and users can at least notice.  After so many years of
x86_64 having proper executable stack support it should not cause too
many problems.

Link: http://lkml.kernel.org/r/20191208171918.GC19716@avx2
Signed-off-by: Alexey Dobriyan <[email protected]>
Cc: Dan Carpenter <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: "Eric W. Biederman" <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

reiserfs: prevent NULL pointer dereference in reiserfs_insert_item()

The variable inode may be NULL in reiserfs_insert_item(), but there is
no check before accessing the member of inode.

Fix this by adding NULL pointer check before calling reiserfs_debug().

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Yunfeng Ye <[email protected]>
Cc: zhengbin <[email protected]>
Cc: Hu Shiyuan <[email protected]>
Cc: Feilong Lin <[email protected]>
Cc: Jan Kara <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

init/main.c: fix misleading "This architecture does not have kernel memory protection" message

This message leads to thinking that memory protection is not implemented
for the said architecture, whereas absence of CONFIG_STRICT_KERNEL_RWX
only means that memory protection has not been selected at compile time.

Don't print this message when CONFIG_ARCH_HAS_STRICT_KERNEL_RWX is
selected by the architecture. Instead, print "Kernel memory protection
not selected by kernel config."

Link: http://lkml.kernel.org/r/62477e446d9685459d4f27d193af6ff1bd69d55f.1578557581.git.christophe.leroy@c-s.fr
Signed-off-by: Christophe Leroy <[email protected]>
Acked-by: Kees Cook <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Michael Ellerman <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

init/main.c: fix quoted value handling in unknown_bootoption

Patch series "init/main.c: minor cleanup/bugfix of envvar handling", v2.

unknown_bootoption passes unrecognized command line arguments to init as
either environment variables or arguments.  Some of the logic in the
function is broken for quoted command line arguments.

When an argument of the form param="value" is processed by parse_args
and passed to unknown_bootoption, the command line has

  param\0"value\0

with val pointing to the beginning of value.  The helper function
repair_env_string is then used to restore the '=' character that was
removed by parse_args, and strip the quotes off fully.  This results in

  param=value\0\0

and val ends up pointing to the 'a' instead of the 'v' in value.  This
bug was introduced when repair_env_string was refactored into a separate
function, and the decrement of val in repair_env_string became dead
code.

This causes two problems in unknown_bootoption in the two places where
the val pointer is used as a substitute for the length of param:

1. An argument of the form param=".value" is misinterpreted as a
   potential module parameter, with the result that it will not be
   placed in init's environment.

2. An argument of the form param="value" is checked to see if param is
   an existing environment variable that should be overwritten, but the
   comparison is off-by-one and compares 'param=v' instead of 'param='
   against the existing environment. So passing, for example,
   TERM="vt100" on the command line results in init being passed both
   TERM=linux and TERM=vt100 in its environment.

Patch 1 adds logging for the arguments and environment passed to init
and is independent of the rest: it can be dropped if this is
unnecessarily verbose.

Patch 2 removes repair_env_string from initcall parameter parsing in
do_initcall_level, as that uses a separate copy of the command line now
and the repairing is no longer necessary.

Patch 3 fixes the bug in unknown_bootoption by recording the length of
param explicitly instead of implying it from val-param.

This patch (of 3):

Commit a99cd1125189 ("init: fix bug where environment vars can't be
passed via boot args") introduced two minor bugs in unknown_bootoption
by factoring out the quoted value handling into a separate function.

When value is quoted, repair_env_string will move the value up 1 byte to
strip the quotes, so val in unknown_bootoption no longer points to the
actual location of the value.

The result is that an argument of the form param=".value" is mistakenly
treated as a potential module parameter and is not placed in init's
environment, and an argument of the form param="value" can result in a
duplicate environment variable: eg TERM="vt100" on the command line will
result in both TERM=linux and TERM=vt100 being placed into init's
environment.

Fix this by recording the length of the param before calling
repair_env_string instead of relying on val.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Arvind Sankar <[email protected]>
Cc: Chris Metcalf <[email protected]>
Cc: Krzysztof Mazur <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

init/main.c: remove unnecessary repair_env_string in do_initcall_level

Since commit 08746a65c296 ("init: fix in-place parameter modification
regression"), parse_args in do_initcall_level is called on a copy of
saved_command_line. It is unnecessary to call repair_env_string during
this parsing, as this copy is not used for anything later.

Remove the now unnecessary arguments from repair_env_string as well.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Arvind Sankar <[email protected]>
Cc: Krzysztof Mazur <[email protected]>
Cc: Chris Metcalf <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

init/main.c: log arguments and environment passed to init

Extend logging in `run_init_process` to also show the arguments and
environment that we are passing to init.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Arvind Sankar <[email protected]>
Cc: Chris Metcalf <[email protected]>
Cc: Krzysztof Mazur <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

fs/binfmt_elf.c: coredump: allow process with empty address space to coredump

Unmapping whole address space at once with

munmap(0, (1ULL<<47) - 4096)

or equivalent will create empty coredump.

It is silly way to exit, however registers content may still be useful.

The right to coredump is fundamental right of a process!

Link: http://lkml.kernel.org/r/20191222150137.GA1277@avx2
Signed-off-by: Alexey Dobriyan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

fs/binfmt_elf.c: coredump: delete duplicated overflow check

array_size() macro will do overflow check anyway.

Link: http://lkml.kernel.org/r/20191222144009.GB24341@avx2
Signed-off-by: Alexey Dobriyan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

fs/binfmt_elf.c: coredump: allocate core ELF header on stack

Comment says ELF header is "too large to be on stack". 64 bytes on
64-bit is not large by any means.

Link: http://lkml.kernel.org/r/20191222143850.GA24341@avx2
Signed-off-by: Alexey Dobriyan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

fs/binfmt_elf.c: make BAD_ADDR() unlikely

If some mapping goes past TASK_SIZE it will be rejected by kernel which
means no such userspace binaries exist.

Mark every such check as unlikely.

Link: http://lkml.kernel.org/r/20191215124355.GA21124@avx2
Signed-off-by: Alexey Dobriyan <[email protected]>
Reviewed-by: Andrew Morton <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

fs/binfmt_elf.c: better codegen around current->mm

"current->mm" pointer is stable in general except few cases one of which
execve(2).  Compiler can't treat is as stable but it _is_ stable most of
the time.  During ELF loading process ->mm becomes stable right after
flush_old_exec().

Help compiler by caching current->mm, otherwise it continues to refetch
it.

add/remove: 0/0 grow/shrink: 0/2 up/down: 0/-141 (-141)
Function                                     old     new   delta
elf_core_dump                               5062    5039     -23
load_elf_binary                             5426    5308    -118

Note: other cases are left as is because it is either pessimisation or
no change in binary size.

Link: http://lkml.kernel.org/r/20191215124755.GB21124@avx2
Signed-off-by: Alexey Dobriyan <[email protected]>
Reviewed-by: Andrew Morton <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

fs/binfmt_elf.c: don't copy ELF header around

ELF header is read into bprm->buf[] by generic execve code.

Save a memcpy and allocate just one header for the interpreter instead
of two headers (64 bytes instead of 128 on 64-bit).

Link: http://lkml.kernel.org/r/20191208171242.GA19716@avx2
Signed-off-by: Alexey Dobriyan <[email protected]>
Reviewed-by: Andrew Morton <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

fs/binfmt_elf.c: fix ->start_code calculation

Only executable segments should be accounted to ->start_code just like
they do to ->end_code (correctly).

Link: http://lkml.kernel.org/r/20191208171410.GB19716@avx2
Signed-off-by: Alexey Dobriyan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

fs/binfmt_elf.c: smaller code generation around auxv vector fill

Filling auxv vector as array with index (auxv[i++] = ...) generates
terrible code.  "saved_auxv" should be reworked because it is the worst
member of mm_struct by size/usefullness ratio but do it later.

Meanwhile help gcc a little with *auxv++ idiom.

Space savings on x86_64:

add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-127 (-127)
Function                                     old     new   delta
load_elf_binary                             5470    5343    -127

Link: http://lkml.kernel.org/r/20191208172301.GD19716@avx2
Signed-off-by: Alexey Dobriyan <[email protected]>
Reviewed-by: Andrew Morton <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

lib/find_bit.c: uninline helper _find_next_bit()

It saves 25% of .text for arm64, and more for BE architectures.

Before:
  $ size lib/find_bit.o
     text    data     bss     dec     hex filename
     1012      56       0    1068     42c lib/find_bit.o

After:
  $ size lib/find_bit.o
     text    data     bss     dec     hex filename
      776      56       0     832     340 lib/find_bit.o

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Yury Norov <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Allison Randal <[email protected]>
Cc: William Breathitt Gray <[email protected]>
Cc: Joe Perches <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

lib/find_bit.c: join _find_next_bit{_le}

_find_next_bit and _find_next_bit_le are very similar functions. It's
possible to join them by adding 1 parameter and a couple of simple
checks. It's simplify maintenance and make possible to shrink the size
of .text by un-inlining the unified function (in the following patch).

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Yury Norov <[email protected]>
Cc: Allison Randal <[email protected]>
Cc: Joe Perches <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: William Breathitt Gray <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

uapi: rename ext2_swab() to swab() and share globally in swab.h

ext2_swab() is defined locally in lib/find_bit.c However it is not
specific to ext2, neither to bitmaps.

There are many potential users of it, so rename it to just swab() and
move to include/uapi/linux/swab.h

ABI guarantees that size of unsigned long corresponds to BITS_PER_LONG,
therefore drop unneeded cast.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Yury Norov <[email protected]>
Cc: Allison Randal <[email protected]>
Cc: Joe Perches <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: William Breathitt Gray <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

lib/scatterlist.c: adjust indentation in __sg_alloc_table

Clang warns:

  ../lib/scatterlist.c:314:5: warning: misleading indentation; statement
  is not part of the previous 'if' [-Wmisleading-indentation]
                          return -ENOMEM;
                          ^
  ../lib/scatterlist.c:311:4: note: previous statement is here
                          if (prv)
                          ^
  1 warning generated.

This warning occurs because there is a space before the tab on this
line.  Remove it so that the indentation is consistent with the Linux
kernel coding style and clang no longer warns.

Link: http://lkml.kernel.org/r/[email protected]
Link: https://github.com/ClangBuiltLinux/linux/issues/830
Fixes: edce6820a9fd ("scatterlist: prevent invalid free when alloc fails")
Signed-off-by: Nathan Chancellor <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

btrfs: use larger zlib buffer for s390 hardware compression

In order to benefit from s390 zlib hardware compression support,
increase the btrfs zlib workspace buffer size from 1 to 4 pages (if s390
zlib hardware support is enabled on the machine).

This brings up to 60% better performance in hardware on s390 compared to
the PAGE_SIZE buffer and much more compared to the software zlib
processing in btrfs. In case of memory pressure, fall back to a single
page buffer during workspace allocation.

The data compressed with larger input buffers will still conform to zlib
standard and thus can be decompressed also on a systems that uses only
PAGE_SIZE buffer for btrfs zlib.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Mikhail Zaslonko <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Cc: Chris Mason <[email protected]>
Cc: Josef Bacik <[email protected]>
Cc: David Sterba <[email protected]>
Cc: Richard Purdie <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Cc: Christian Borntraeger <[email protected]>
Cc: Eduard Shishkin <[email protected]>
Cc: Ilya Leoshkevich <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

lib/zlib: add zlib_deflate_dfltcc_enabled() function

Add a new function to zlib.h checking if s390 Deflate-Conversion
facility is installed and enabled.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Mikhail Zaslonko <[email protected]>
Cc: Chris Mason <[email protected]>
Cc: Christian Borntraeger <[email protected]>
Cc: David Sterba <[email protected]>
Cc: Eduard Shishkin <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Ilya Leoshkevich <[email protected]>
Cc: Josef Bacik <[email protected]>
Cc: Richard Purdie <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

s390/boot: add dfltcc= kernel command line parameter

Add the new kernel command line parameter 'dfltcc=' to configure s390
zlib hardware support.

Format: { on | off | def_only | inf_only | always }
on:       s390 zlib hardware support for compression on
           level 1 and decompression (default)
off:      No s390 zlib hardware support
def_only: s390 zlib hardware support for deflate
           only (compression on level 1)
inf_only: s390 zlib hardware support for inflate
           only (decompression)
always:   Same as 'on' but ignores the selected compression
           level always using hardware support (used for debugging)

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Mikhail Zaslonko <[email protected]>
Cc: Chris Mason <[email protected]>
Cc: Christian Borntraeger <[email protected]>
Cc: David Sterba <[email protected]>
Cc: Eduard Shishkin <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Ilya Leoshkevich <[email protected]>
Cc: Josef Bacik <[email protected]>
Cc: Richard Purdie <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

lib/zlib: add s390 hardware support for kernel zlib_inflate

Add decompression functions to zlib_dfltcc library. Update zlib_inflate
functions with the hooks for s390 hardware support and adjust workspace
structures with extra parameter lists required for hardware inflate
decompression.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ilya Leoshkevich <[email protected]>
Signed-off-by: Mikhail Zaslonko <[email protected]>
Co-developed-by: Ilya Leoshkevich <[email protected]>
Cc: Chris Mason <[email protected]>
Cc: Christian Borntraeger <[email protected]>
Cc: David Sterba <[email protected]>
Cc: Eduard Shishkin <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Josef Bacik <[email protected]>
Cc: Richard Purdie <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

s390/boot: rename HEAP_SIZE due to name collision

Change the conflicting macro name in preparation for zlib_inflate
hardware support.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Mikhail Zaslonko <[email protected]>
Reviewed-by: Christian Borntraeger <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

lib/zlib: add s390 hardware support for kernel zlib_deflate

Patch series "S390 hardware support for kernel zlib", v3.

With IBM z15 mainframe the new DFLTCC instruction is available.  It
implements deflate algorithm in hardware (Nest Acceleration Unit - NXU)
with estimated compression and decompression performance orders of
magnitude faster than the current zlib.

This patchset adds s390 hardware compression support to kernel zlib.
The code is based on the userspace zlib implementation:

https://github.com/madler/zlib/pull/410

The coding style is also preserved for future maintainability.  There is
only limited set of userspace zlib functions represented in kernel.
Apart from that, all the memory allocation should be performed in
advance.  Thus, the workarea structures are extended with the parameter
lists required for the DEFLATE CONVENTION CALL instruction.

Since kernel zlib itself does not support gzip headers, only Adler-32
checksum is processed (also can be produced by DFLTCC facility).  Like
it was implemented for userspace, kernel zlib will compress in hardware
on level 1, and in software on all other levels.  Decompression will
always happen in hardware (when enabled).

Two DFLTCC compression calls produce the same results only when they
both are made on machines of the same generation, and when the
respective buffers have the same offset relative to the start of the
page.  Therefore care should be taken when using hardware compression
when reproducible results are desired.  However it does always produce
the standard conform output which can be inflated anyway.

The new kernel command line parameter 'dfltcc' is introduced to
configure s390 zlib hardware support:

    Format: { on | off | def_only | inf_only | always }
     on:       s390 zlib hardware support for compression on
               level 1 and decompression (default)
     off:      No s390 zlib hardware support
     def_only: s390 zlib hardware support for deflate
               only (compression on level 1)
     inf_only: s390 zlib hardware support for inflate
               only (decompression)
     always:   Same as 'on' but ignores the selected compression
               level always using hardware support (used for debugging)

The main purpose of the integration of the NXU support into the kernel
zlib is the use of hardware deflate in btrfs filesystem with on-the-fly
compression enabled.  Apart from that, hardware support can also be used
during boot for decompressing the kernel or the ramdisk image

With the patch for btrfs expanding zlib buffer from 1 to 4 pages (patch
6) the following performance results have been achieved using the
ramdisk with btrfs.  These are relative numbers based on throughput rate
and compression ratio for zlib level 1:

  Input data              Deflate rate   Inflate rate   Compression ratio
                          NXU/Software   NXU/Software   NXU/Software
  stream of zeroes        1.46           1.02           1.00
  random ASCII data       10.44          3.00           0.96
  ASCII text (dickens)    6,21           3.33           0.94
  binary data (vmlinux)   8,37           3.90           1.02

This means that s390 hardware deflate can provide up to 10 times faster
compression (on level 1) and up to 4 times faster decompression (refers
to all compression levels) for btrfs zlib.

Disclaimer: Performance results are based on IBM internal tests using DD
command-line utility on btrfs on a Fedora 30 based internal driver in
native LPAR on a z15 system.  Results may vary based on individual
workload, configuration and software levels.

This patch (of 9):

Create zlib_dfltcc library with the s390 DEFLATE CONVERSION CALL
implementation and related compression functions.  Update zlib_deflate
functions with the hooks for s390 hardware support and adjust workspace
structures with extra parameter lists required for hardware deflate.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ilya Leoshkevich <[email protected]>
Signed-off-by: Mikhail Zaslonko <[email protected]>
Co-developed-by: Ilya Leoshkevich <[email protected]>
Cc: Chris Mason <[email protected]>
Cc: Christian Borntraeger <[email protected]>
Cc: David Sterba <[email protected]>
Cc: Eduard Shishkin <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Josef Bacik <[email protected]>
Cc: Richard Purdie <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

iio: adc: qcom-vadc-common: use <linux/units.h> helpers

This switches the qcom-vadc-common to use milli_kelvin_to_millicelsius()
in <linux/units.h>.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Akinobu Mita <[email protected]>
Acked-by: Jonathan Cameron <[email protected]>
Reviewed-by: Andy Shevchenko <[email protected]>
Cc: Hartmut Knaack <[email protected]>
Cc: Lars-Peter Clausen <[email protected]>
Cc: Peter Meerwald-Stadler <[email protected]>
Cc: Amit Kucheria <[email protected]>
Cc: Andy Shevchenko <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Daniel Lezcano <[email protected]>
Cc: Darren Hart <[email protected]>
Cc: Emmanuel Grumbach <[email protected]>
Cc: Guenter Roeck <[email protected]>
Cc: Jean Delvare <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Johannes Berg <[email protected]>
Cc: Jonathan Cameron <[email protected]>
Cc: Kalle Valo <[email protected]>
Cc: Keith Busch <[email protected]>
Cc: Luca Coelho <[email protected]>
Cc: Sagi Grimberg <[email protected]>
Cc: Stanislaw Gruszka <[email protected]>
Cc: Sujith Thomas <[email protected]>
Cc: Zhang Rui <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

thermal: armada: remove unused TO_MCELSIUS macro

This removes unused TO_MCELSIUS() macro.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Akinobu Mita <[email protected]>
Reviewed-by: Andy Shevchenko <[email protected]>
Cc: Zhang Rui <[email protected]>
Cc: Daniel Lezcano <[email protected]>
Cc: Amit Kucheria <[email protected]>
Cc: Andy Shevchenko <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Darren Hart <[email protected]>
Cc: Emmanuel Grumbach <[email protected]>
Cc: Guenter Roeck <[email protected]>
Cc: Hartmut Knaack <[email protected]>
Cc: Jean Delvare <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Johannes Berg <[email protected]>
Cc: Jonathan Cameron <[email protected]>
Cc: Jonathan Cameron <[email protected]>
Cc: Kalle Valo <[email protected]>
Cc: Keith Busch <[email protected]>
Cc: Lars-Peter Clausen <[email protected]>
Cc: Luca Coelho <[email protected]>
Cc: Peter Meerwald-Stadler <[email protected]>
Cc: Sagi Grimberg <[email protected]>
Cc: Stanislaw Gruszka <[email protected]>
Cc: Sujith Thomas <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

iwlwifi: use <linux/units.h> helpers

This switches the iwlwifi driver to use celsius_to_kelvin() and
kelvin_to_celsius() in <linux/units.h>.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Akinobu Mita <[email protected]>
Reviewed-by: Andy Shevchenko <[email protected]>
Acked-by: Luca Coelho <[email protected]>
Cc: Kalle Valo <[email protected]>
Cc: Johannes Berg <[email protected]>
Cc: Emmanuel Grumbach <[email protected]>
Cc: Luca Coelho <[email protected]>
Cc: Amit Kucheria <[email protected]>
Cc: Andy Shevchenko <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Daniel Lezcano <[email protected]>
Cc: Darren Hart <[email protected]>
Cc: Guenter Roeck <[email protected]>
Cc: Hartmut Knaack <[email protected]>
Cc: Jean Delvare <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Jonathan Cameron <[email protected]>
Cc: Jonathan Cameron <[email protected]>
Cc: Keith Busch <[email protected]>
Cc: Lars-Peter Clausen <[email protected]>
Cc: Peter Meerwald-Stadler <[email protected]>
Cc: Sagi Grimberg <[email protected]>
Cc: Stanislaw Gruszka <[email protected]>
Cc: Sujith Thomas <[email protected]>
Cc: Zhang Rui <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

iwlegacy: use <linux/units.h> helpers

This switches the iwlegacy driver to use celsius_to_kelvin() and
kelvin_to_celsius() in <linux/units.h>.

[[email protected]: fix build warnings with format string]
Link: http://lkml.kernel.org/r/[email protected]
Link: https://lore.kernel.org/r/[email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Akinobu Mita <[email protected]>
Reviewed-by: Andy Shevchenko <[email protected]>
Acked-by: Kalle Valo <[email protected]>
Cc: Kalle Valo <[email protected]>
Cc: Stanislaw Gruszka <[email protected]>
Cc: Amit Kucheria <[email protected]>
Cc: Andy Shevchenko <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Daniel Lezcano <[email protected]>
Cc: Darren Hart <[email protected]>
Cc: Emmanuel Grumbach <[email protected]>
Cc: Guenter Roeck <[email protected]>
Cc: Hartmut Knaack <[email protected]>
Cc: Jean Delvare <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Johannes Berg <[email protected]>
Cc: Jonathan Cameron <[email protected]>
Cc: Keith Busch <[email protected]>
Cc: Lars-Peter Clausen <[email protected]>
Cc: Luca Coelho <[email protected]>
Cc: Peter Meerwald-Stadler <[email protected]>
Cc: Sagi Grimberg <[email protected]>
Cc: Sujith Thomas <[email protected]>
Cc: Zhang Rui <[email protected]>
Cc: Stephen Rothwell <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

thermal: remove kelvin to/from Celsius conversion helpers from <linux/thermal.h>

This removes the kelvin to/from Celsius conversion helper macros in
<linux/thermal.h> which were switched to the inline helper functions in
<linux/units.h>.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Akinobu Mita <[email protected]>
Reviewed-by: Andy Shevchenko <[email protected]>
Cc: Sujith Thomas <[email protected]>
Cc: Darren Hart <[email protected]>
Cc: Andy Shevchenko <[email protected]>
Cc: Zhang Rui <[email protected]>
Cc: Daniel Lezcano <[email protected]>
Cc: Amit Kucheria <[email protected]>
Cc: Jean Delvare <[email protected]>
Cc: Guenter Roeck <[email protected]>
Cc: Keith Busch <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Sagi Grimberg <[email protected]>
Cc: Emmanuel Grumbach <[email protected]>
Cc: Hartmut Knaack <[email protected]>
Cc: Johannes Berg <[email protected]>
Cc: Jonathan Cameron <[email protected]>
Cc: Jonathan Cameron <[email protected]>
Cc: Kalle Valo <[email protected]>
Cc: Lars-Peter Clausen <[email protected]>
Cc: Luca Coelho <[email protected]>
Cc: Peter Meerwald-Stadler <[email protected]>
Cc: Stanislaw Gruszka <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

nvme: hwmon: switch to use <linux/units.h> helpers

This switches the nvme driver to use kelvin_to_millicelsius() and
millicelsius_to_kelvin() in <linux/units.h>.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Akinobu Mita <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Guenter Roeck <[email protected]>
Reviewed-by: Keith Busch <[email protected]>
Reviewed-by: Andy Shevchenko <[email protected]>
Cc: Sujith Thomas <[email protected]>
Cc: Darren Hart <[email protected]>
Cc: Andy Shevchenko <[email protected]>
Cc: Zhang Rui <[email protected]>
Cc: Daniel Lezcano <[email protected]>
Cc: Amit Kucheria <[email protected]>
Cc: Jean Delvare <[email protected]>
Cc: Guenter Roeck <[email protected]>
Cc: Keith Busch <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Sagi Grimberg <[email protected]>
Cc: Emmanuel Grumbach <[email protected]>
Cc: Hartmut Knaack <[email protected]>
Cc: Johannes Berg <[email protected]>
Cc: Jonathan Cameron <[email protected]>
Cc: Jonathan Cameron <[email protected]>
Cc: Kalle Valo <[email protected]>
Cc: Lars-Peter Clausen <[email protected]>
Cc: Luca Coelho <[email protected]>
Cc: Peter Meerwald-Stadler <[email protected]>
Cc: Stanislaw Gruszka <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

thermal: intel_pch: switch to use <linux/units.h> helpers

This switches the intel pch thermal driver to use
deci_kelvin_to_millicelsius() in <linux/units.h> instead of helpers in
<linux/thermal.h>.

This is preparation for centralizing the kelvin to/from Celsius
conversion helpers in <linux/units.h>.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Akinobu Mita <[email protected]>
Reviewed-by: Andy Shevchenko <[email protected]>
Cc: Sujith Thomas <[email protected]>
Cc: Darren Hart <[email protected]>
Cc: Andy Shevchenko <[email protected]>
Cc: Zhang Rui <[email protected]>
Cc: Daniel Lezcano <[email protected]>
Cc: Amit Kucheria <[email protected]>
Cc: Jean Delvare <[email protected]>
Cc: Guenter Roeck <[email protected]>
Cc: Keith Busch <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Sagi Grimberg <[email protected]>
Cc: Emmanuel Grumbach <[email protected]>
Cc: Hartmut Knaack <[email protected]>
Cc: Johannes Berg <[email protected]>
Cc: Jonathan Cameron <[email protected]>
Cc: Jonathan Cameron <[email protected]>
Cc: Kalle Valo <[email protected]>
Cc: Lars-Peter Clausen <[email protected]>
Cc: Luca Coelho <[email protected]>
Cc: Peter Meerwald-Stadler <[email protected]>
Cc: Stanislaw Gruszka <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

thermal: int340x: switch to use <linux/units.h> helpers

This switches the int340x thermal zone driver to use
deci_kelvin_to_millicelsius() and millicelsius_to_deci_kelvin() in
<linux/units.h> instead of helpers in <linux/thermal.h>.

This is preparation for centralizing the kelvin to/from Celsius
conversion helpers in <linux/units.h>.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Akinobu Mita <[email protected]>
Reviewed-by: Andy Shevchenko <[email protected]>
Cc: Sujith Thomas <[email protected]>
Cc: Darren Hart <[email protected]>
Cc: Andy Shevchenko <[email protected]>
Cc: Zhang Rui <[email protected]>
Cc: Daniel Lezcano <[email protected]>
Cc: Amit Kucheria <[email protected]>
Cc: Jean Delvare <[email protected]>
Cc: Guenter Roeck <[email protected]>
Cc: Keith Busch <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Sagi Grimberg <[email protected]>
Cc: Emmanuel Grumbach <[email protected]>
Cc: Hartmut Knaack <[email protected]>
Cc: Johannes Berg <[email protected]>
Cc: Jonathan Cameron <[email protected]>
Cc: Jonathan Cameron <[email protected]>
Cc: Kalle Valo <[email protected]>
Cc: Lars-Peter Clausen <[email protected]>
Cc: Luca Coelho <[email protected]>
Cc: Peter Meerwald-Stadler <[email protected]>
Cc: Stanislaw Gruszka <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

platform/x86: intel_menlow: switch to use <linux/units.h> helpers

This switches the intel_menlow driver to use deci_kelvin_to_celsius()
and celsius_to_deci_kelvin() in <linux/units.h> instead of helpers in
<linux/thermal.h>.

This is preparation for centralizing the kelvin to/from Celsius
conversion helpers in <linux/units.h>.

This also removes a trailing space, while we're at it.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Akinobu Mita <[email protected]>
Reviewed-by: Andy Shevchenko <[email protected]>
Cc: Sujith Thomas <[email protected]>
Cc: Darren Hart <[email protected]>
Cc: Andy Shevchenko <[email protected]>
Cc: Zhang Rui <[email protected]>
Cc: Daniel Lezcano <[email protected]>
Cc: Amit Kucheria <[email protected]>
Cc: Jean Delvare <[email protected]>
Cc: Guenter Roeck <[email protected]>
Cc: Keith Busch <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Sagi Grimberg <[email protected]>
Cc: Emmanuel Grumbach <[email protected]>
Cc: Hartmut Knaack <[email protected]>
Cc: Johannes Berg <[email protected]>
Cc: Jonathan Cameron <[email protected]>
Cc: Jonathan Cameron <[email protected]>
Cc: Kalle Valo <[email protected]>
Cc: Lars-Peter Clausen <[email protected]>
Cc: Luca Coelho <[email protected]>
Cc: Peter Meerwald-Stadler <[email protected]>
Cc: Stanislaw Gruszka <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

platform/x86: asus-wmi: switch to use <linux/units.h> helpers

The asus-wmi driver doesn't implement the thermal device functionality
directly, so including <linux/thermal.h> just for
DECI_KELVIN_TO_CELSIUS() is a bit odd.

This switches the asus-wmi driver to use deci_kelvin_to_millicelsius()
in <linux/units.h>.

The format string is changed from %d to %ld due to function returned
type.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Akinobu Mita <[email protected]>
Acked-by: Andy Shevchenko <[email protected]>
Cc: Sujith Thomas <[email protected]>
Cc: Darren Hart <[email protected]>
Cc: Andy Shevchenko <[email protected]>
Cc: Zhang Rui <[email protected]>
Cc: Daniel Lezcano <[email protected]>
Cc: Amit Kucheria <[email protected]>
Cc: Jean Delvare <[email protected]>
Cc: Guenter Roeck <[email protected]>
Cc: Keith Busch <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Sagi Grimberg <[email protected]>
Cc: Emmanuel Grumbach <[email protected]>
Cc: Hartmut Knaack <[email protected]>
Cc: Johannes Berg <[email protected]>
Cc: Jonathan Cameron <[email protected]>
Cc: Jonathan Cameron <[email protected]>
Cc: Kalle Valo <[email protected]>
Cc: Lars-Peter Clausen <[email protected]>
Cc: Luca Coelho <[email protected]>
Cc: Peter Meerwald-Stadler <[email protected]>
Cc: Stanislaw Gruszka <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

ACPI: thermal: switch to use <linux/units.h> helpers

This switches the ACPI thermal zone driver to use
celsius_to_deci_kelvin(), deci_kelvin_to_celsius(), and
deci_kelvin_to_millicelsius_with_offset() in <linux/units.h> instead of
helpers in <linux/thermal.h>.

This is preparation for centralizing the kelvin to/from Celsius
conversion helpers in <linux/units.h>.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Akinobu Mita <[email protected]>
Cc: Sujith Thomas <[email protected]>
Cc: Darren Hart <[email protected]>
Cc: Andy Shevchenko <[email protected]>
Cc: Zhang Rui <[email protected]>
Cc: Daniel Lezcano <[email protected]>
Cc: Amit Kucheria <[email protected]>
Cc: Jean Delvare <[email protected]>
Cc: Guenter Roeck <[email protected]>
Cc: Keith Busch <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Sagi Grimberg <[email protected]>
Cc: Andy Shevchenko <[email protected]>
Cc: Emmanuel Grumbach <[email protected]>
Cc: Hartmut Knaack <[email protected]>
Cc: Johannes Berg <[email protected]>
Cc: Jonathan Cameron <[email protected]>
Cc: Jonathan Cameron <[email protected]>
Cc: Kalle Valo <[email protected]>
Cc: Lars-Peter Clausen <[email protected]>
Cc: Luca Coelho <[email protected]>
Cc: Peter Meerwald-Stadler <[email protected]>
Cc: Stanislaw Gruszka <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

include/linux/units.h: add helpers for kelvin to/from Celsius conversion

Patch series "add header file for kelvin to/from Celsius conversion
helpers", v4.

There are several helper macros to convert kelvin to/from Celsius in
<linux/thermal.h> for thermal drivers.  These are useful for any other
drivers or subsystems, but it's odd to include <linux/thermal.h> just
for the helpers.

This adds a new <linux/units.h> that provides the equivalent inline
functions for any drivers or subsystems, and switches all the users of
conversion helpers in <linux/thermal.h> to use <linux/units.h> helpers.

This patch (of 12):

There are several helper macros to convert kelvin to/from Celsius in
<linux/thermal.h> for thermal drivers.  These are useful for any other
drivers or subsystems, but it's odd to include <linux/thermal.h> just
for the helpers.

This adds a new <linux/units.h> that provides the equivalent inline
functions for any drivers or subsystems.  It is intended to replace the
helpers in <linux/thermal.h>.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Akinobu Mita <[email protected]>
Reviewed-by: Andy Shevchenko <[email protected]>
Cc: Sujith Thomas <[email protected]>
Cc: Darren Hart <[email protected]>
Cc: Zhang Rui <[email protected]>
Cc: Daniel Lezcano <[email protected]>
Cc: Amit Kucheria <[email protected]>
Cc: Jean Delvare <[email protected]>
Cc: Guenter Roeck <[email protected]>
Cc: Keith Busch <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Sagi Grimberg <[email protected]>
Cc: Kalle Valo <[email protected]>
Cc: Stanislaw Gruszka <[email protected]>
Cc: Johannes Berg <[email protected]>
Cc: Emmanuel Grumbach <[email protected]>
Cc: Luca Coelho <[email protected]>
Cc: Jonathan Cameron <[email protected]>
Cc: Hartmut Knaack <[email protected]>
Cc: Lars-Peter Clausen <[email protected]>
Cc: Peter Meerwald-Stadler <[email protected]>
Cc: Andy Shevchenko <[email protected]>
Cc: Jonathan Cameron <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

drivers/block/zram/zram_drv.c: fix error return codes not being returned in writeback_store

Currently when an error code -EIO or -ENOSPC in the for-loop of
writeback_store the error code is being overwritten by a ret = len
assignment at the end of the function and the error codes are being
lost. Fix this by assigning ret = len at the start of the function and
remove the assignment from the end, hence allowing ret to be preserved
when error codes are assigned to it.

Addresses Coverity ("Unused value")

Link: http://lkml.kernel.org/r/[email protected]
Fixes: a939888ec38b ("zram: support idle/huge page writeback")
Signed-off-by: Colin Ian King <[email protected]>
Acked-by: Minchan Kim <[email protected]>
Cc: Sergey Senozhatsky <[email protected]>
Cc: Jens Axboe <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

zram: try to avoid worst-case scenario on same element pages

The worst-case scenario on finding same element pages is that almost all
elements are same at the first glance but only last few elements are
different.

Since the same element tends to be grouped from the beginning of the
pages, if we check the first element with the last element before
looping through all elements, we might have some chances to quickly
detect non-same element pages.

1. Test is done under LG webOS TV (64-bit arch)
2. Dump the swap-out pages (~819200 pages)
3. Analyze the pages with simple test script which counts the iteration
    number and measures the speed at off-line

Under 64-bit arch, the worst iteration count is PAGE_SIZE / 8 bytes =
512.  The speed is based on the time to consume page_same_filled()
function only.  The result, on average, is listed as below:

                                     Num of Iter    Speed(MB/s)
  Looping-Forward (Orig)                 38            99265
  Looping-Backward                       36           102725
  Last-element-check (This Patch)        33           125072

The result shows that the average iteration count decreases by 13% and
the speed increases by 25% with this patch.  This patch does not
increase the overall time complexity, though.

I also ran simpler version which uses backward loop.  Just looping
backward also makes some improvement, but less than this patch.

[[email protected]: fix off-by-one]
Link: http://lkml.kernel.org/r/[email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Taejoon Song <[email protected]>
Acked-by: Minchan Kim <[email protected]>
Cc: Sergey Senozhatsky <[email protected]>
Cc: Jens Axboe <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm: fix comments related to node reclaim

As zone reclaim has been replaced by node reclaim, this patch fixes
related comments.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Hao Lee <[email protected]>
Reviewed-by: Andrew Morton <[email protected]>
Cc: Anshuman Khandual <[email protected]>
Cc: Mel Gorman <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

include/linux/memory.h: drop fields 'hw' and 'phys_callback' from struct memory_block

memory_block structure elements 'hw' and 'phys_callback' are not getting
used. This was originally added with commit 3947be1969a9 ("[PATCH]
memory hotplug: sysfs and add/remove functions") but never seem to have
been used. Just drop them now.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Anshuman Khandual <[email protected]>
Reviewed-by: Dan Williams <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Pavel Tatashin <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

include/linux/mm.h: remove dead code totalram_pages_set()

totalram_pages_set() was introduced in commit ca79b0c211af ("mm: convert
totalram_pages and totalhigh_pages variables to atomic"), but no one
uses it.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Wei Yang <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

include/linux/mm.h: clean up obsolete check on space in page->flags

The check was intended to make sure we don't overrun page flags. But
it's obsolete because it doesn't include LAST_CPUPID_WIDTH nor
KASAN_TAG_WIDTH.

Just remove check since we already have it covered in
linux/page-flags-layout.h (near the end of the file).

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Yu Zhao <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

zswap: potential NULL dereference on error in init_zswap()

The "pool" pointer can be NULL at the end of the init_zswap(). (We
would allocate a new pool later in that situation)

So in the error handling then we need to make sure pool is a valid
pointer before calling "zswap_pool_destroy(pool);" because that function
dereferences the argument.

Link: http://lkml.kernel.org/r/[email protected]
Fixes: 93d4dfa9fbd0 ("mm/zswap.c: add allocation hysteresis if pool limit is hit")
Signed-off-by: Dan Carpenter <[email protected]>
Cc: Vitaly Wool <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/zswap.c: add allocation hysteresis if pool limit is hit

zswap will always try to shrink pool when zswap is full.  If there is a
high pressure on zswap it will result in flipping pages in and out zswap
pool without any real benefit, and the overall system performance will
drop.  The previous discussion on this subject [1] ended up with a
suggestion to implement a sort of hysteresis to refuse taking pages into
zswap pool until it has sufficient space if the limit has been hit.
This is my take on this.

Hysteresis is controlled with a sysfs-configurable parameter (namely,
/sys/kernel/debug/zswap/accept_threhsold_percent).  It specifies the
threshold at which zswap would start accepting pages again after it
became full.  Setting this parameter to 100 disables the hysteresis and
sets the zswap behavior to pre-hysteresis state.

[1] https://lkml.org/lkml/2019/11/8/949

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Vitaly Wool <[email protected]>
Cc: Dan Streetman <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/page_isolation: fix potential warning from user

It makes sense to call the WARN_ON_ONCE(zone_idx(zone) == ZONE_MOVABLE)
from start_isolate_page_range(), but should avoid triggering it from
userspace, i.e, from is_mem_section_removable() because it could crash
the system by a non-root user if warn_on_panic is set.

While at it, simplify the code a bit by removing an unnecessary jump
label.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Qian Cai <[email protected]>
Suggested-by: Michal Hocko <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/hotplug: silence a lockdep splat with printk()

It is not that hard to trigger lockdep splats by calling printk from
under zone->lock.  Most of them are false positives caused by lock
chains introduced early in the boot process and they do not cause any
real problems (although most of the early boot lock dependencies could
happen after boot as well).  There are some console drivers which do
allocate from the printk context as well and those should be fixed.  In
any case, false positives are not that trivial to workaround and it is
far from optimal to lose lockdep functionality for something that is a
non-issue.

So change has_unmovable_pages() so that it no longer calls dump_page()
itself - instead it returns a "struct page *" of the unmovable page back
to the caller so that in the case of a has_unmovable_pages() failure,
the caller can call dump_page() after releasing zone->lock.  Also, make
dump_page() is able to report a CMA page as well, so the reason string
from has_unmovable_pages() can be removed.

Even though has_unmovable_pages doesn't hold any reference to the
returned page this should be reasonably safe for the purpose of
reporting the page (dump_page) because it cannot be hotremoved in the
context of memory unplug.  The state of the page might change but that
is the case even with the existing code as zone->lock only plays role
for free pages.

While at it, remove a similar but unnecessary debug-only printk() as
well.  A sample of one of those lockdep splats is,

  WARNING: possible circular locking dependency detected
  ------------------------------------------------------
  test.sh/8653 is trying to acquire lock:
  ffffffff865a4460 (console_owner){-.-.}, at:
  console_unlock+0x207/0x750

  but task is already holding lock:
  ffff88883fff3c58 (&(&zone->lock)->rlock){-.-.}, at:
  __offline_isolated_pages+0x179/0x3e0

  which lock already depends on the new lock.

  the existing dependency chain (in reverse order) is:

  -> #3 (&(&zone->lock)->rlock){-.-.}:
         __lock_acquire+0x5b3/0xb40
         lock_acquire+0x126/0x280
         _raw_spin_lock+0x2f/0x40
         rmqueue_bulk.constprop.21+0xb6/0x1160
         get_page_from_freelist+0x898/0x22c0
         __alloc_pages_nodemask+0x2f3/0x1cd0
         alloc_pages_current+0x9c/0x110
         allocate_slab+0x4c6/0x19c0
         new_slab+0x46/0x70
         ___slab_alloc+0x58b/0x960
         __slab_alloc+0x43/0x70
         __kmalloc+0x3ad/0x4b0
         __tty_buffer_request_room+0x100/0x250
         tty_insert_flip_string_fixed_flag+0x67/0x110
         pty_write+0xa2/0xf0
         n_tty_write+0x36b/0x7b0
         tty_write+0x284/0x4c0
         __vfs_write+0x50/0xa0
         vfs_write+0x105/0x290
         redirected_tty_write+0x6a/0xc0
         do_iter_write+0x248/0x2a0
         vfs_writev+0x106/0x1e0
         do_writev+0xd4/0x180
         __x64_sys_writev+0x45/0x50
         do_syscall_64+0xcc/0x76c
         entry_SYSCALL_64_after_hwframe+0x49/0xbe

  -> #2 (&(&port->lock)->rlock){-.-.}:
         __lock_acquire+0x5b3/0xb40
         lock_acquire+0x126/0x280
         _raw_spin_lock_irqsave+0x3a/0x50
         tty_port_tty_get+0x20/0x60
         tty_port_default_wakeup+0xf/0x30
         tty_port_tty_wakeup+0x39/0x40
         uart_write_wakeup+0x2a/0x40
         serial8250_tx_chars+0x22e/0x440
         serial8250_handle_irq.part.8+0x14a/0x170
         serial8250_default_handle_irq+0x5c/0x90
         serial8250_interrupt+0xa6/0x130
         __handle_irq_event_percpu+0x78/0x4f0
         handle_irq_event_percpu+0x70/0x100
         handle_irq_event+0x5a/0x8b
         handle_edge_irq+0x117/0x370
         do_IRQ+0x9e/0x1e0
         ret_from_intr+0x0/0x2a
         cpuidle_enter_state+0x156/0x8e0
         cpuidle_enter+0x41/0x70
         call_cpuidle+0x5e/0x90
         do_idle+0x333/0x370
         cpu_startup_entry+0x1d/0x1f
         start_secondary+0x290/0x330
         secondary_startup_64+0xb6/0xc0

  -> #1 (&port_lock_key){-.-.}:
         __lock_acquire+0x5b3/0xb40
         lock_acquire+0x126/0x280
         _raw_spin_lock_irqsave+0x3a/0x50
         serial8250_console_write+0x3e4/0x450
         univ8250_console_write+0x4b/0x60
         console_unlock+0x501/0x750
         vprintk_emit+0x10d/0x340
         vprintk_default+0x1f/0x30
         vprintk_func+0x44/0xd4
         printk+0x9f/0xc5

  -> #0 (console_owner){-.-.}:
         check_prev_add+0x107/0xea0
         validate_chain+0x8fc/0x1200
         __lock_acquire+0x5b3/0xb40
         lock_acquire+0x126/0x280
         console_unlock+0x269/0x750
         vprintk_emit+0x10d/0x340
         vprintk_default+0x1f/0x30
         vprintk_func+0x44/0xd4
         printk+0x9f/0xc5
         __offline_isolated_pages.cold.52+0x2f/0x30a
         offline_isolated_pages_cb+0x17/0x30
         walk_system_ram_range+0xda/0x160
         __offline_pages+0x79c/0xa10
         offline_pages+0x11/0x20
         memory_subsys_offline+0x7e/0xc0
         device_offline+0xd5/0x110
         state_store+0xc6/0xe0
         dev_attr_store+0x3f/0x60
         sysfs_kf_write+0x89/0xb0
         kernfs_fop_write+0x188/0x240
         __vfs_write+0x50/0xa0
         vfs_write+0x105/0x290
         ksys_write+0xc6/0x160
         __x64_sys_write+0x43/0x50
         do_syscall_64+0xcc/0x76c
         entry_SYSCALL_64_after_hwframe+0x49/0xbe

  other info that might help us debug this:

  Chain exists of:
    console_owner --> &(&port->lock)->rlock --> &(&zone->lock)->rlock

   Possible unsafe locking scenario:

         CPU0                    CPU1
         ----                    ----
    lock(&(&zone->lock)->rlock);
                                 lock(&(&port->lock)->rlock);
                                 lock(&(&zone->lock)->rlock);
    lock(console_owner);

   *** DEADLOCK ***

  9 locks held by test.sh/8653:
   #0: ffff88839ba7d408 (sb_writers#4){.+.+}, at:
  vfs_write+0x25f/0x290
   #1: ffff888277618880 (&of->mutex){+.+.}, at:
  kernfs_fop_write+0x128/0x240
   #2: ffff8898131fc218 (kn->count#115){.+.+}, at:
  kernfs_fop_write+0x138/0x240
   #3: ffffffff86962a80 (device_hotplug_lock){+.+.}, at:
  lock_device_hotplug_sysfs+0x16/0x50
   #4: ffff8884374f4990 (&dev->mutex){....}, at:
  device_offline+0x70/0x110
   #5: ffffffff86515250 (cpu_hotplug_lock.rw_sem){++++}, at:
  __offline_pages+0xbf/0xa10
   #6: ffffffff867405f0 (mem_hotplug_lock.rw_sem){++++}, at:
  percpu_down_write+0x87/0x2f0
   #7: ffff88883fff3c58 (&(&zone->lock)->rlock){-.-.}, at:
  __offline_isolated_pages+0x179/0x3e0
   #8: ffffffff865a4920 (console_lock){+.+.}, at:
  vprintk_emit+0x100/0x340

  stack backtrace:
  Hardware name: HPE ProLiant DL560 Gen10/ProLiant DL560 Gen10,
  BIOS U34 05/21/2019
  Call Trace:
   dump_stack+0x86/0xca
   print_circular_bug.cold.31+0x243/0x26e
   check_noncircular+0x29e/0x2e0
   check_prev_add+0x107/0xea0
   validate_chain+0x8fc/0x1200
   __lock_acquire+0x5b3/0xb40
   lock_acquire+0x126/0x280
   console_unlock+0x269/0x750
   vprintk_emit+0x10d/0x340
   vprintk_default+0x1f/0x30
   vprintk_func+0x44/0xd4
   printk+0x9f/0xc5
   __offline_isolated_pages.cold.52+0x2f/0x30a
   offline_isolated_pages_cb+0x17/0x30
   walk_system_ram_range+0xda/0x160
   __offline_pages+0x79c/0xa10
   offline_pages+0x11/0x20
   memory_subsys_offline+0x7e/0xc0
   device_offline+0xd5/0x110
   state_store+0xc6/0xe0
   dev_attr_store+0x3f/0x60
   sysfs_kf_write+0x89/0xb0
   kernfs_fop_write+0x188/0x240
   __vfs_write+0x50/0xa0
   vfs_write+0x105/0x290
   ksys_write+0xc6/0x160
   __x64_sys_write+0x43/0x50
   do_syscall_64+0xcc/0x76c
   entry_SYSCALL_64_after_hwframe+0x49/0xbe

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Qian Cai <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Sergey Senozhatsky <[email protected]>
Cc: Petr Mladek <[email protected]>
Cc: Steven Rostedt (VMware) <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/memory_hotplug: pass in nid to online_pages()

Patch series "mm/memory_hotplug: pass in nid to online_pages()".

Simplify onlining code and get rid of find_memory_block(). Pass in the
nid from the memory block we are trying to online directly, instead of
manually looking it up.

This patch (of 2):

No need to lookup the memory block, we can directly pass in the nid.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: David Hildenbrand <[email protected]>
Reviewed-by: Andrew Morton <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: "Rafael J. Wysocki" <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Anshuman Khandual <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Pavel Tatashin <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/mmap.c: get rid of odd jump labels in find_mergeable_anon_vma()

The jump labels try_prev and none are not really needed in
find_mergeable_anon_vma(), eliminate them to improve readability.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Miaohe Lin <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Reviewed-by: John Hubbard <[email protected]>
Reviewed-by: Wei Yang <[email protected]>
Acked-by: David Rientjes <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm, thp: fix defrag setting if newline is not used

If thp defrag setting "defer" is used and a newline is *not* used when
writing to the sysfs file, this is interpreted as the "defer+madvise"
option.

This is because we do prefix matching and if five characters are written
without a newline, the current code ends up comparing to the first five
bytes of the "defer+madvise" option and using that instead.

Use the more appropriate sysfs_streq() that handles the trailing newline
for us.  Since this doubles as a nice cleanup, do it in enabled_store()
as well.

The current implementation relies on prefix matching: the number of
bytes compared is either the number of bytes written or the length of
the option being compared.  With a newline, "defer\n" does not match
"defer+"madvise"; without a newline, however, "defer" is considered to
match "defer+madvise" (prefix matching is only comparing the first five
bytes).  End result is that writing "defer" is broken unless it has an
additional trailing character.

This means that writing "madv" in the past would match and set
"madvise".  With strict checking, that no longer is the case but it is
unlikely anybody is currently doing this.

Link: http://lkml.kernel.org/r/[email protected]
Fixes: 21440d7eb904 ("mm, thp: add new defer+madvise defrag option")
Signed-off-by: David Rientjes <[email protected]>
Suggested-by: Andrew Morton <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Cc: Mel Gorman <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/migrate: add stable check in migrate_vma_insert_page()

migrate_vma_insert_page() closely follows the code in:
  __handle_mm_fault()
    handle_pte_fault()
      do_anonymous_page()

Add a call to check_stable_address_space() after locking the page table
entry before inserting a ZONE_DEVICE private zero page mapping similar
to page faulting a new anonymous page.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ralph Campbell <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Cc: Jerome Glisse <[email protected]>
Cc: John Hubbard <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Bharata B Rao <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Chris Down <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/migrate: clean up some minor coding style

Fix some comment typos and coding style clean up in preparation for the
next patch. No functional changes.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ralph Campbell <[email protected]>
Acked-by: Chris Down <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Cc: Jerome Glisse <[email protected]>
Cc: John Hubbard <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Bharata B Rao <[email protected]>
Cc: Michal Hocko <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/migrate: remove useless mask of start address

Addresses passed to walk_page_range() callback functions are already
page aligned and don't need to be masked with PAGE_MASK.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ralph Campbell <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Cc: Jerome Glisse <[email protected]>
Cc: John Hubbard <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Bharata B Rao <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Chris Down <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/huge_memory.c: reduce critical section protected by split_queue_lock

split_queue_lock protects data in struct deferred_split. We can release
the lock after delete the page from deferred_split_queue.

This patch moves the THP accounting out of the lock protection, which is
introduced in commit 65c453778aea ("mm, rmap: account shmem thp pages").

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Wei Yang <[email protected]>
Reviewed-by: Andrew Morton <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/huge_memory.c: use head to emphasize the purpose of page

During split huge page, it checks the property of the page.  Currently
we do the check on page and head without emphasizing the check is on the
compound page.  In case the page passed to split_huge_page_to_list is a
tail page, audience would take some time to think about whether the
check is on compound page or tail page itself.

To make it explicit, use head instead of page for those checks.  After
this, audience would be more clear about the checks are on compound page
and the page is used to do the split and dump error message if failed.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Wei Yang <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/huge_memory.c: use head to check huge zero page

The page could be a tail page, if this is the case, this BUG_ON will
never be triggered.

Link: http://lkml.kernel.org/r/[email protected]
Fixes: e9b61f19858a ("thp: reintroduce split_huge_page()")
Signed-off-by: Wei Yang <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm, oom: dump stack of victim when reaping failed

When a process cannot be oom reaped, for whatever reason, currently the
list of locks that are held is currently dumped to the kernel log.

Much more interesting is the stack trace of the victim that cannot be
reaped. If the stack trace is dumped, we have the ability to find
related occurrences in the same kernel code and hopefully solve the
issue that is making it wedged.

Dump the stack trace when a process fails to be oom reaped.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: David Rientjes <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: Tetsuo Handa <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

memblock: Use __func__ in remaining memblock_dbg() call sites

Replace open function name strings with %s (__func__) in all remaining
memblock_dbg() call sites.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Anshuman Khandual <[email protected]>
Reviewed-by: Mike Rapoport <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/memblock: define memblock_physmem_add()

On the s390 platform memblock.physmem array is being built by directly
calling into memblock_add_range() which is a low level function not
intended to be used outside of memblock.  Hence lets conditionally add
helper functions for physmem array when HAVE_MEMBLOCK_PHYS_MAP is
enabled.  Also use MAX_NUMNODES instead of 0 as node ID similar to
memblock_add() and memblock_reserve().  Make memblock_add_range() a
static function as it is no longer getting used outside of memblock.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Anshuman Khandual <[email protected]>
Reviewed-by: Mike Rapoport <[email protected]>
Acked-by: Heiko Carstens <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Cc: Christian Borntraeger <[email protected]>
Cc: Martin Schwidefsky <[email protected]>
Cc: Collin Walling <[email protected]>
Cc: Gerald Schaefer <[email protected]>
Cc: Philipp Rudo <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

tools/vm/slabinfo: fix sanity checks enabling

The sysfs file name for enabling sanity checking is called
'sanity_checks' and not 'sanity'.

The name of the file has never changed since the introduction of the
slub allocator. Obviously, most people turn the checks on via the
command line option and not during runtime using slabinfo.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Daniel Wagner <[email protected]>
Reviewed-by: Andrew Morton <[email protected]>
Cc: "Tobin C. Harding" <[email protected]>
Cc: Christoph Lameter <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/vmscan: remove unused RECLAIM_OFF/RECLAIM_ZONE

Commit 1b2ffb7896ad ("[PATCH] Zone reclaim: Allow modification of zone
reclaim behavior")' defined RECLAIM_OFF/RECLAIM_ZONE, but never use
them, so better to remove them.

[[email protected]: fix sanity checks enabling]
Link: http://lkml.kernel.org/r/[email protected]
[[email protected]: renumber the bits for neatness]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Alex Shi <[email protected]>
Signed-off-by: Daniel Wagner <[email protected]>
Reviewed-by: Andrew Morton <[email protected]>
Cc: "Tobin C. Harding" <[email protected]>
Cc: Christoph Lameter <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/vmscan: remove prefetch_prev_lru_page

This macro was never used in git history. So better to remove.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Alex Shi <[email protected]>
Reviewed-by: Andrew Morton <[email protected]>
Cc: Qian Cai <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/vmscan.c: remove unused return value of shrink_node

The return value of shrink_node is not used, so remove unnecessary
operations.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Liu Song <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm: remove "count" parameter from has_unmovable_pages()

Now that the memory isolate notifier is gone, the parameter is always 0.
Drop it and cleanup has_unmovable_pages().

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: David Hildenbrand <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Anshuman Khandual <[email protected]>
Cc: Qian Cai <[email protected]>
Cc: Pingfan Liu <[email protected]>
Cc: Stephen Rothwell <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Pavel Tatashin <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Mike Rapoport <[email protected]>
Cc: Wei Yang <[email protected]>
Cc: Alexander Duyck <[email protected]>
Cc: Alexander Potapenko <[email protected]>
Cc: Arun KS <[email protected]>
Cc: Michael Ellerman <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm: remove the memory isolate notifier

Luckily, we have no users left, so we can get rid of it. Cleanup
set_migratetype_isolate() a little bit.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: David Hildenbrand <[email protected]>
Reviewed-by: Greg Kroah-Hartman <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: "Rafael J. Wysocki" <[email protected]>
Cc: Pavel Tatashin <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Qian Cai <[email protected]>
Cc: Anshuman Khandual <[email protected]>
Cc: Pingfan Liu <[email protected]>
Cc: Michael Ellerman <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/page_alloc: skip non present sections on zone initialization

memmap_init_zone() can be called on the ranges with holes during the
boot.  It will skip any non-valid PFNs one-by-one.  It works fine as
long as holes are not too big.

But huge holes in the memory map causes a problem.  It takes over 20
seconds to walk 32TiB hole.  x86-64 with 5-level paging allows for much
larger holes in the memory map which would practically hang the system.

Deferred struct page init doesn't help here.  It only works on the
present ranges.

Skipping non-present sections would fix the issue.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Kirill A. Shutemov <[email protected]>
Reviewed-by: Baoquan He <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: "Jin, Zhi" <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Pavel Tatashin <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/early_ioremap.c: use %pa to print resource_size_t variables

%pa takes into consideration the special types such as resource_size_t.
Use this specifier %instead of explicit casting.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Andy Shevchenko <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Reviewed-by: Andrew Morton <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

lib/test_kasan.c: fix memory leak in kmalloc_oob_krealloc_more()

In case memory resources for _ptr2_ were allocated, release them before
return.

Notice that in case _ptr1_ happens to be NULL, krealloc() behaves
exactly like kmalloc().

Addresses-Coverity-ID: 1490594 ("Resource leak")
Link: http://lkml.kernel.org/r/20200123160115.GA4202@embeddedor
Fixes: 3f15801cdc23 ("lib: add kasan test module")
Signed-off-by: Gustavo A. R. Silva <[email protected]>
Reviewed-by: Dmitry Vyukov <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm, tracing: print symbol name for kmem_alloc_node call_site events

Print the call_site ip of kmem_alloc_node using '%pS' to improve the
readability of raw slab trace points.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Junyong Sun <[email protected]>
Acked-by: Steven Rostedt (VMware) <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Joel Fernandes (Google) <[email protected]>
Cc: Changbin Du <[email protected]>
Cc: Tim Murray <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/page_vma_mapped.c: explicitly compare pfn for normal, hugetlbfs and THP page

When check_pte, pfn of normal, hugetlbfs and THP page need be compared.
The current implementation apply comparison as

- normal 4K page: page_pfn <= pfn < page_pfn + 1
- hugetlbfs page:  page_pfn <= pfn < page_pfn + HPAGE_PMD_NR
- THP page: page_pfn <= pfn < page_pfn + HPAGE_PMD_NR

in pfn_in_hpage.  For hugetlbfs page, it should be page_pfn == pfn

Now, change pfn_in_hpage to pfn_is_match to highlight that comparison is
not only for THP and explicitly compare for these cases.

No impact upon current behavior, just make the code clear.  I think it
is important to make the code clear - comparing hugetlbfs page in range
page_pfn <= pfn < page_pfn + HPAGE_PMD_NR is confusing.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Li Xinhai <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
Acked-by: Mike Kravetz <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Michal Hocko <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/memcontrol.c: cleanup some useless code

Compound pages handling in mem_cgroup_migrate is more convoluted than
necessary. The state is duplicated in compound variable and the same
could be achieved by PageTransHuge check which is trivial and
hpage_nr_pages is already PageTransHuge aware.

It is much simpler to just use hpage_nr_pages for nr_pages and replace
the local variable by PageTransHuge check directly

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Kaitao Cheng <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/swapfile.c: swap_next should increase position index

If seq_file .next fuction does not change position index, read after
some lseek can generate unexpected output.

In Aug 2018 NeilBrown noticed commit 1f4aace60b0e ("fs/seq_file.c:
simplify seq_file iteration code and interface") "Some ->next functions
do not increment *pos when they return NULL...  Note that such ->next
functions are buggy and should be fixed.  A simple demonstration is

  dd if=/proc/swaps bs=1000 skip=1

Choose any block size larger than the size of /proc/swaps.  This will
always show the whole last line of /proc/swaps"

Described problem is still actual.  If you make lseek into middle of
last output line following read will output end of last line and whole
last line once again.

  $ dd if=/proc/swaps bs=1  # usual output
  Filename Type Size Used Priority
  /dev/dm-0                               partition 4194812 97536 -2
  104+0 records in
  104+0 records out
  104 bytes copied

  $ dd if=/proc/swaps bs=40 skip=1    # last line was generated twice
  dd: /proc/swaps: cannot skip to specified offset
  v/dm-0                               partition 4194812 97536 -2
  /dev/dm-0                               partition 4194812 97536 -2
  3+1 records in
  3+1 records out
  131 bytes copied

https://bugzilla.kernel.org/show_bug.cgi?id=206283

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Vasily Averin <[email protected]>
Reviewed-by: Andrew Morton <[email protected]>
Cc: Jann Horn <[email protected]>
Cc: Alexander Viro <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Hugh Dickins <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm, tree-wide: rename put_user_page*() to unpin_user_page*()

In order to provide a clearer, more symmetric API for pinning and
unpinning DMA pages. This way, pin_user_pages*() calls match up with
unpin_user_pages*() calls, and the API is a lot closer to being
self-explanatory.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: John Hubbard <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Cc: Alex Williamson <[email protected]>
Cc: Aneesh Kumar K.V <[email protected]>
Cc: Björn Töpel <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Daniel Vetter <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Hans Verkuil <[email protected]>
Cc: Ira Weiny <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Jerome Glisse <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Leon Romanovsky <[email protected]>
Cc: Mauro Carvalho Chehab <[email protected]>
Cc: Mike Rapoport <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/gup_benchmark: use proper FOLL_WRITE flags instead of hard-coding "1"

Fix the gup benchmark flags to use the symbolic FOLL_WRITE, instead of a
hard-coded "1" value.

Also, clean up the filtering of gup flags a little, by just doing it
once before issuing any of the get_user_pages*() calls. This makes it
harder to overlook, instead of having little "gup_flags & 1" phrases in
the function calls.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: John Hubbard <[email protected]>
Reviewed-by: Ira Weiny <[email protected]>
Cc: Alex Williamson <[email protected]>
Cc: Aneesh Kumar K.V <[email protected]>
Cc: Björn Töpel <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Daniel Vetter <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Hans Verkuil <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Jerome Glisse <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Leon Romanovsky <[email protected]>
Cc: Mauro Carvalho Chehab <[email protected]>
Cc: Mike Rapoport <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

powerpc: book3s64: convert to pin_user_pages() and put_user_page()

1. Convert from get_user_pages() to pin_user_pages().

2. As required by pin_user_pages(), release these pages via
put_user_page().

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: John Hubbard <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Cc: Alex Williamson <[email protected]>
Cc: Aneesh Kumar K.V <[email protected]>
Cc: Björn Töpel <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Daniel Vetter <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Hans Verkuil <[email protected]>
Cc: Ira Weiny <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Jerome Glisse <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Leon Romanovsky <[email protected]>
Cc: Mauro Carvalho Chehab <[email protected]>
Cc: Mike Rapoport <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

vfio, mm: pin_user_pages (FOLL_PIN) and put_user_page() conversion

1. Change vfio from get_user_pages_remote(), to
   pin_user_pages_remote().

2. Because all FOLL_PIN-acquired pages must be released via
   put_user_page(), also convert the put_page() call over to
   put_user_pages_dirty_lock().

Note that this effectively changes the code's behavior in
vfio_iommu_type1.c: put_pfn(): it now ultimately calls
set_page_dirty_lock(), instead of set_page_dirty().  This is probably
more accurate.

As Christoph Hellwig put it, "set_page_dirty() is only safe if we are
dealing with a file backed page where we have reference on the inode it
hangs off." [1]

[1] https://lore.kernel.org/r/20190723153640 [email protected]

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: John Hubbard <[email protected]>
Tested-by: Alex Williamson <[email protected]>
Acked-by: Alex Williamson <[email protected]>
Cc: Aneesh Kumar K.V <[email protected]>
Cc: Björn Töpel <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Daniel Vetter <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Hans Verkuil <[email protected]>
Cc: Ira Weiny <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Jerome Glisse <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Leon Romanovsky <[email protected]>
Cc: Mauro Carvalho Chehab <[email protected]>
Cc: Mike Rapoport <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

media/v4l2-core: pin_user_pages (FOLL_PIN) and put_user_page() conversion

1. Change v4l2 from get_user_pages() to pin_user_pages().

2. Because all FOLL_PIN-acquired pages must be released via
put_user_page(), also convert the put_page() call over to
put_user_pages_dirty_lock().

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: John Hubbard <[email protected]>
Acked-by: Hans Verkuil <[email protected]>
Cc: Ira Weiny <[email protected]>
Cc: Alex Williamson <[email protected]>
Cc: Aneesh Kumar K.V <[email protected]>
Cc: Björn Töpel <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Daniel Vetter <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Jerome Glisse <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Leon Romanovsky <[email protected]>
Cc: Mauro Carvalho Chehab <[email protected]>
Cc: Mike Rapoport <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

net/xdp: set FOLL_PIN via pin_user_pages()

Convert net/xdp to use the new pin_longterm_pages() call, which sets
FOLL_PIN. Setting FOLL_PIN is now required for code that requires
tracking of pinned pages.

In partial anticipation of this work, the net/xdp code was already calling
put_user_page() instead of put_page(). Therefore, in order to convert
from the get_user_pages()/put_page() model, to the
pin_user_pages()/put_user_page() model, the only change required here is
to change get_user_pages() to pin_user_pages().

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: John Hubbard <[email protected]>
Acked-by: Björn Töpel <[email protected]>
Cc: Alex Williamson <[email protected]>
Cc: Aneesh Kumar K.V <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Daniel Vetter <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Hans Verkuil <[email protected]>
Cc: Ira Weiny <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Jerome Glisse <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Leon Romanovsky <[email protected]>
Cc: Mauro Carvalho Chehab <[email protected]>
Cc: Mike Rapoport <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

fs/io_uring: set FOLL_PIN via pin_user_pages()

Convert fs/io_uring to use the new pin_user_pages() call, which sets
FOLL_PIN. Setting FOLL_PIN is now required for code that requires
tracking of pinned pages, and therefore for any code that calls
put_user_page().

In partial anticipation of this work, the io_uring code was already
calling put_user_page() instead of put_page(). Therefore, in order to
convert from the get_user_pages()/put_page() model, to the
pin_user_pages()/put_user_page() model, the only change required here is
to change get_user_pages() to pin_user_pages().

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: John Hubbard <[email protected]>
Reviewed-by: Jens Axboe <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Cc: Alex Williamson <[email protected]>
Cc: Aneesh Kumar K.V <[email protected]>
Cc: Björn Töpel <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Daniel Vetter <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Hans Verkuil <[email protected]>
Cc: Ira Weiny <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Jerome Glisse <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Leon Romanovsky <[email protected]>
Cc: Mauro Carvalho Chehab <[email protected]>
Cc: Mike Rapoport <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

drm/via: set FOLL_PIN via pin_user_pages_fast()

Convert drm/via to use the new pin_user_pages_fast() call, which sets
FOLL_PIN. Setting FOLL_PIN is now required for code that requires
tracking of pinned pages, and therefore for any code that calls
put_user_page().

In partial anticipation of this work, the drm/via driver was already
calling put_user_page() instead of put_page(). Therefore, in order to
convert from the get_user_pages()/put_page() model, to the
pin_user_pages()/put_user_page() model, the only change required is to
change get_user_pages() to pin_user_pages().

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: John Hubbard <[email protected]>
Acked-by: Daniel Vetter <[email protected]>
Reviewed-by: Jérôme Glisse <[email protected]>
Reviewed-by: Ira Weiny <[email protected]>
Cc: Alex Williamson <[email protected]>
Cc: Aneesh Kumar K.V <[email protected]>
Cc: Björn Töpel <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Hans Verkuil <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Leon Romanovsky <[email protected]>
Cc: Mauro Carvalho Chehab <[email protected]>
Cc: Mike Rapoport <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>