Git Repo - linux.git/log

checkpatch.pl: warn on invalid commit id

It can happen that a commit message refers to an invalid commit id,
because the referenced hash changed following a rebase, or simply by
mistake.  Add a check in checkpatch.pl which checks that an hash
referenced by a Fixes tag, or just cited in the commit message, is a valid
commit id.

    $ scripts/checkpatch.pl <<'EOF'
    Subject: [PATCH] test commit

    Sample test commit to test checkpatch.pl
    Commit 1da177e4c3f4 ("Linux-2.6.12-rc2") really exists,
    commit 0bba044c4ce7 ("tree") is valid but not a commit,
    while commit b4cc0b1c0cca ("unknown") is invalid.

Fixes: f0cacc14cade ("unknown")
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
    EOF
    WARNING: Unknown commit id '0bba044c4ce7', maybe rebased or not pulled?
    #8:
    commit 0bba044c4ce7 ("tree") is valid but not a commit,

    WARNING: Unknown commit id 'b4cc0b1c0cca', maybe rebased or not pulled?
    #9:
    while commit b4cc0b1c0cca ("unknown") is invalid.

    WARNING: Unknown commit id 'f0cacc14cade', maybe rebased or not pulled?
    #11:
Fixes: f0cacc14cade ("unknown")
    total: 0 errors, 3 warnings, 4 lines checked

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Matteo Croce <[email protected]>
Cc: Joe Perches <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

checkpatch: improve SPDX license checking

Use perl's m@<match>@ match and not /<match>/ comparisons to avoid
an error using c90's // comment style.

Miscellanea:

o Use normal tab indentation and alignment

Link: http://lkml.kernel.org/r/[email protected]
Link: http://lkml.kernel.org/r/f08eb62458407a145cfedf959d1091af151cd665.1563575364.git.joe@perches.com
Signed-off-by: Joe Perches <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

checkpatch: don't interpret stack dumps as commit IDs

Add more types of lines that appear to be stack dumps that also include
hex lines that might otherwise be interpreted as commit IDs.

Link: http://lkml.kernel.org/r/[email protected]
Link: http://lkml.kernel.org/r/f7dc9727795db3802809a24162abe0b67e14123b.1563575364.git.joe@perches.com
Signed-off-by: Joe Perches <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

lib/hexdump: make print_hex_dump_bytes() a nop on !DEBUG builds

I'm seeing a bunch of debug prints from a user of print_hex_dump_bytes()
in my kernel logs, but I don't have CONFIG_DYNAMIC_DEBUG enabled nor do I
have DEBUG defined in my build.  The problem is that
print_hex_dump_bytes() calls a wrapper function in lib/hexdump.c that
calls print_hex_dump() with KERN_DEBUG level.  There are three cases to
consider here

  1. CONFIG_DYNAMIC_DEBUG=y  --> call dynamic_hex_dum()
  2. CONFIG_DYNAMIC_DEBUG=n && DEBUG --> call print_hex_dump()
  3. CONFIG_DYNAMIC_DEBUG=n && !DEBUG --> stub it out

Right now, that last case isn't detected and we still call
print_hex_dump() from the stub wrapper.

Let's make print_hex_dump_bytes() only call print_hex_dump_debug() so that
it works properly in all cases.

Case #1, print_hex_dump_debug() calls dynamic_hex_dump() and we get same
behavior.  Case #2, print_hex_dump_debug() calls print_hex_dump() with
KERN_DEBUG and we get the same behavior.  Case #3, print_hex_dump_debug()
is a nop, changing behavior to what we want, i.e.  print nothing.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Stephen Boyd <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

lib/extable.c: add missing prototypes

When building with W=1, a number of warnings are issued:

  CC      lib/extable.o
lib/extable.c:63:6: warning: no previous prototype for 'sort_extable' [-Wmissing-prototypes]
   63 | void sort_extable(struct exception_table_entry *start,
      |      ^~~~~~~~~~~~
lib/extable.c:75:6: warning: no previous prototype for 'trim_init_extable' [-Wmissing-prototypes]
   75 | void trim_init_extable(struct module *m)
      |      ^~~~~~~~~~~~~~~~~
lib/extable.c:115:1: warning: no previous prototype for 'search_extable' [-Wmissing-prototypes]
  115 | search_extable(const struct exception_table_entry *base,
      | ^~~~~~~~~~~~~~

Add the missing #include for the prototypes.

Link: http://lkml.kernel.org/r/45574.1565235784@turing-police
Signed-off-by: Valdis Kletnieks <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

lib/generic-radix-tree.c: make 2 functions static inline

When building with W=1, we get some warnings:

l  CC      lib/generic-radix-tree.o
lib/generic-radix-tree.c:39:10: warning: no previous prototype for 'genradix_root_to_depth' [-Wmissing-prototypes]
   39 | unsigned genradix_root_to_depth(struct genradix_root *r)
      |          ^~~~~~~~~~~~~~~~~~~~~~
lib/generic-radix-tree.c:44:23: warning: no previous prototype for 'genradix_root_to_node' [-Wmissing-prototypes]
   44 | struct genradix_node *genradix_root_to_node(struct genradix_root *r)
      |                       ^~~~~~~~~~~~~~~~~~~~~

They're not used anywhere else, so make them static inline.

Link: http://lkml.kernel.org/r/46923.1565236485@turing-police
Signed-off-by: Valdis Kletnieks <[email protected]>
Cc: Kent Overstreet <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

strscpy: reject buffer sizes larger than INT_MAX

As already done for snprintf(), add a check in strscpy() for giant (i.e.
likely negative and/or miscalculated) copy sizes, WARN, and error out.

Link: http://lkml.kernel.org/r/201907260928.23DE35406@keescook
Signed-off-by: Kees Cook <[email protected]>
Cc: Joe Perches <[email protected]>
Cc: Rasmus Villemoes <[email protected]>
Cc: Yann Droneaud <[email protected]>
Cc: David Laight <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Stephen Kitt <[email protected]>
Cc: Jann Horn <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

include/trace/events/writeback.h: fix -Wstringop-truncation warnings

There are many of those warnings.

In file included from ./arch/powerpc/include/asm/paca.h:15,
                 from ./arch/powerpc/include/asm/current.h:13,
                 from ./include/linux/thread_info.h:21,
                 from ./include/asm-generic/preempt.h:5,
                 from ./arch/powerpc/include/generated/asm/preempt.h:1,
                 from ./include/linux/preempt.h:78,
                 from ./include/linux/spinlock.h:51,
                 from fs/fs-writeback.c:19:
In function 'strncpy',
    inlined from 'perf_trace_writeback_page_template' at
./include/trace/events/writeback.h:56:1:
./include/linux/string.h:260:9: warning: '__builtin_strncpy' specified
bound 32 equals destination size [-Wstringop-truncation]
  return __builtin_strncpy(p, q, size);
         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Fix it by using the new strscpy_pad() which was introduced in "lib/string:
Add strscpy_pad() function" and will always be NUL-terminated instead of
strncpy().  Also, change strlcpy() to use strscpy_pad() in this file for
consistency.

Link: http://lkml.kernel.org/r/[email protected]
Fixes: 455b2864686d ("writeback: Initial tracing support")
Fixes: 028c2dd184c0 ("writeback: Add tracing to balance_dirty_pages")
Fixes: e84d0a4f8e39 ("writeback: trace event writeback_queue_io")
Fixes: b48c104d2211 ("writeback: trace event bdi_dirty_ratelimit")
Fixes: cc1676d917f3 ("writeback: Move requeueing when I_SYNC set to writeback_sb_inodes()")
Fixes: 9fb0a7da0c52 ("writeback: add more tracepoints")
Signed-off-by: Qian Cai <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Cc: Tobin C. Harding <[email protected]>
Cc: Steven Rostedt (VMware) <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Fengguang Wu <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Joe Perches <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Jann Horn <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Nitin Gote <[email protected]>
Cc: Rasmus Villemoes <[email protected]>
Cc: Stephen Kitt <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

kernel-doc: core-api: include string.h into core-api

core-api should show all the various string functions including the newly
added stracpy and stracpy_pad.

Miscellanea:

o Update the Returns: value for strscpy
o fix a defect with %NUL)

[[email protected]: correct return of -E2BIG descriptions]
Link: http://lkml.kernel.org/r/29f998b4c1a9d69fbeae70500ba0daa4b340c546.1563889130.git.joe@perches.com
Link: http://lkml.kernel.org/r/224a6ebf39955f4107c0c376d66155d970e46733.1563841972.git.joe@perches.com
Signed-off-by: Joe Perches <[email protected]>
Reviewed-by: Kees Cook <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Stephen Kitt <[email protected]>
Cc: Nitin Gote <[email protected]>
Cc: Rasmus Villemoes <[email protected]>
Cc: Jann Horn <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

augmented rbtree: rework the RB_DECLARE_CALLBACKS macro definition

Change the definition of the RBCOMPUTE function.  The propagate callback
repeatedly calls RBCOMPUTE as it moves from leaf to root.  it wants to
stop recomputing once the augmented subtree information doesn't change.
This was previously checked using the == operator, but that only works
when the augmented subtree information is a scalar field.  This commit
modifies the RBCOMPUTE function so that it now sets the augmented subtree
information instead of returning it, and returns a boolean value
indicating if the propagate callback should stop.

The motivation for this change is that I want to introduce augmented
rbtree uses where the augmented data for the subtree is a struct instead
of a scalar.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Michel Lespinasse <[email protected]>
Acked-by: Peter Zijlstra (Intel) <[email protected]>
Cc: David Howells <[email protected]>
Cc: Davidlohr Bueso <[email protected]>
Cc: Uladzislau Rezki <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

augmented rbtree: add new RB_DECLARE_CALLBACKS_MAX macro

Add RB_DECLARE_CALLBACKS_MAX, which generates augmented rbtree callbacks
for the case where the augmented value is a scalar whose definition
follows a max(f(node)) pattern. This actually covers all present uses of
RB_DECLARE_CALLBACKS, and saves some (source) code duplication in the
various RBCOMPUTE function definitions.

[[email protected]: fix mm/vmalloc.c]
Link: http://lkml.kernel.org/r/CANN689FXgK13wDYNh1zKxdipeTuALG4eKvKpsdZqKFJ-rvtGiQ@mail.gmail.com
[[email protected]: re-add check to check_augmented()]
Link: http://lkml.kernel.org/r/[email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Michel Lespinasse <[email protected]>
Acked-by: Peter Zijlstra (Intel) <[email protected]>
Cc: David Howells <[email protected]>
Cc: Davidlohr Bueso <[email protected]>
Cc: Uladzislau Rezki <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

augmented rbtree: add comments for RB_DECLARE_CALLBACKS macro

Patch series "make RB_DECLARE_CALLBACKS more generic", v3.

These changes are intended to make the RB_DECLARE_CALLBACKS macro more
generic (allowing the aubmented subtree information to be a struct instead
of a scalar).

I have verified the compiled lib/interval_tree.o and mm/mmap.o files to
check that they didn't change.  This held as expected for interval_tree.o;
mmap.o did have some changes which could be reverted by marking
__vma_link_rb as noinline.  I did not add such a change to the patchset; I
felt it was reasonable enough to leave the inlining decision up to the
compiler.

This patch (of 3):

Add a short comment summarizing the arguments to RB_DECLARE_CALLBACKS.
The arguments are also now capitalized.  This copies the style of the
INTERVAL_TREE_DEFINE macro.

No functional changes in this commit, only comments and capitalization.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Michel Lespinasse <[email protected]>
Acked-by: Davidlohr Bueso <[email protected]>
Acked-by: Peter Zijlstra (Intel) <[email protected]>
Cc: David Howells <[email protected]>
Cc: Uladzislau Rezki <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

rbtree: avoid generating code twice for the cached versions (tools copy)

As was already noted in rbtree.h, the logic to cache rb_first (or
rb_last) can easily be implemented externally to the core rbtree api.

This commit takes the changes applied to the include/linux/ and lib/
rbtree files in 9f973cb38088 ("lib/rbtree: avoid generating code twice
for the cached versions"), and applies these to the
tools/include/linux/ and tools/lib/ files as well to keep them
synchronized.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Michel Lespinasse <[email protected]>
Cc: David Howells <[email protected]>
Cc: Davidlohr Bueso <[email protected]>
Cc: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

kernel/elfcore.c: include proper prototypes

When building with W=1, gcc properly complains that there's no prototypes:

  CC      kernel/elfcore.o
kernel/elfcore.c:7:17: warning: no previous prototype for 'elf_core_extra_phdrs' [-Wmissing-prototypes]
    7 | Elf_Half __weak elf_core_extra_phdrs(void)
      |                 ^~~~~~~~~~~~~~~~~~~~
kernel/elfcore.c:12:12: warning: no previous prototype for 'elf_core_write_extra_phdrs' [-Wmissing-prototypes]
   12 | int __weak elf_core_write_extra_phdrs(struct coredump_params *cprm, loff_t offset)
      |            ^~~~~~~~~~~~~~~~~~~~~~~~~~
kernel/elfcore.c:17:12: warning: no previous prototype for 'elf_core_write_extra_data' [-Wmissing-prototypes]
   17 | int __weak elf_core_write_extra_data(struct coredump_params *cprm)
      |            ^~~~~~~~~~~~~~~~~~~~~~~~~
kernel/elfcore.c:22:15: warning: no previous prototype for 'elf_core_extra_data_size' [-Wmissing-prototypes]
   22 | size_t __weak elf_core_extra_data_size(void)
      |               ^~~~~~~~~~~~~~~~~~~~~~~~

Provide the include file so gcc is happy, and we don't have potential code drift

Link: http://lkml.kernel.org/r/29875.1565224705@turing-police
Signed-off-by: Valdis Kletnieks <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

linux/coff.h: add include guard

Add a header include guard just in case.

My motivation is to allow Kbuild to detect missing include guard:

https://patchwork.kernel.org/patch/11063011/

Before I enable this checker I want to fix as many headers as possible.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Masahiro Yamada <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

memcg, kmem: do not fail __GFP_NOFAIL charges

Thomas has noticed the following NULL ptr dereference when using cgroup
v1 kmem limit:
BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
PGD 0
P4D 0
Oops: 0000 [#1] PREEMPT SMP PTI
CPU: 3 PID: 16923 Comm: gtk-update-icon Not tainted 4.19.51 #42
Hardware name: Gigabyte Technology Co., Ltd. Z97X-Gaming G1/Z97X-Gaming G1, BIOS F9 07/31/2015
RIP: 0010:create_empty_buffers+0x24/0x100
Code: cd 0f 1f 44 00 00 0f 1f 44 00 00 41 54 49 89 d4 ba 01 00 00 00 55 53 48 89 fb e8 97 fe ff ff 48 89 c5 48 89 c2 eb 03 48 89 ca <48> 8b 4a 08 4c 09 22 48 85 c9 75 f1 48 89 6a 08 48 8b 43 18 48 8d
RSP: 0018:ffff927ac1b37bf8 EFLAGS: 00010286
RAX: 0000000000000000 RBX: fffff2d4429fd740 RCX: 0000000100097149
RDX: 0000000000000000 RSI: 0000000000000082 RDI: ffff9075a99fbe00
RBP: 0000000000000000 R08: fffff2d440949cc8 R09: 00000000000960c0
R10: 0000000000000002 R11: 0000000000000000 R12: 0000000000000000
R13: ffff907601f18360 R14: 0000000000002000 R15: 0000000000001000
FS:  00007fb55b288bc0(0000) GS:ffff90761f8c0000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000008 CR3: 000000007aebc002 CR4: 00000000001606e0
Call Trace:
create_page_buffers+0x4d/0x60
__block_write_begin_int+0x8e/0x5a0
? ext4_inode_attach_jinode.part.82+0xb0/0xb0
? jbd2__journal_start+0xd7/0x1f0
ext4_da_write_begin+0x112/0x3d0
generic_perform_write+0xf1/0x1b0
? file_update_time+0x70/0x140
__generic_file_write_iter+0x141/0x1a0
ext4_file_write_iter+0xef/0x3b0
__vfs_write+0x17e/0x1e0
vfs_write+0xa5/0x1a0
ksys_write+0x57/0xd0
do_syscall_64+0x55/0x160
entry_SYSCALL_64_after_hwframe+0x44/0xa9

Tetsuo then noticed that this is because the __memcg_kmem_charge_memcg
fails __GFP_NOFAIL charge when the kmem limit is reached.  This is a wrong
behavior because nofail allocations are not allowed to fail.  Normal
charge path simply forces the charge even if that means to cross the
limit.  Kmem accounting should be doing the same.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Michal Hocko <[email protected]>
Reported-by: Thomas Lindroth <[email protected]>
Debugged-by: Tetsuo Handa <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Cc: Andrey Ryabinin <[email protected]>
Cc: Thomas Lindroth <[email protected]>
Cc: Shakeel Butt <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

Merge branch 'i2c/for-5.4' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux

Pull i2c updates from Wolfram Sang:

- new driver for ICY, an Amiga Zorro card :)

- axxia driver gained slave mode support, NXP driver gained ACPI

- the slave EEPROM backend gained 16 bit address support

- and lots of regular driver updates and reworks

* 'i2c/for-5.4' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux: (52 commits)
  i2c: tegra: Move suspend handling to NOIRQ phase
  i2c: imx: ACPI support for NXP i2c controller
  i2c: uniphier(-f): remove all dev_dbg()
  i2c: uniphier(-f): use devm_platform_ioremap_resource()
  i2c: slave-eeprom: Add comment about address handling
  i2c: exynos5: Remove IRQF_ONESHOT
  i2c: stm32f7: Make structure stm32f7_i2c_algo constant
  i2c: cht-wc: drop check because i2c_unregister_device() is NULL safe
  i2c-eeprom_slave: Add support for more eeprom models
  i2c: fsi: Add of_put_node() before break
  i2c: synquacer: Make synquacer_i2c_ops constant
  i2c: hix5hd2: Remove IRQF_ONESHOT
  i2c: i801: Use iTCO version 6 in Cannon Lake PCH and beyond
  watchdog: iTCO: Add support for Cannon Lake PCH iTCO
  i2c: iproc: Make bcm_iproc_i2c_quirks constant
  i2c: iproc: Add full name of devicetree node to adapter name
  i2c: piix4: Add ACPI support
  i2c: piix4: Fix probing of reserved ports on AMD Family 16h Model 30h
  i2c: ocores: use request_any_context_irq() to register IRQ handler
  i2c: designware: Fix optional reset error handling
  ...

Merge tag 'sound-fix-5.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound

Pull sound fixes from Takashi Iwai:
"A few small remaining wrap-up for this merge window.

  Most of patches are device-specific (HD-audio and USB-audio quirks,
  FireWire, pcm316a, fsl, rsnd, Atmel, and TI fixes), while there is a
  simple fix (actually two commits) for ASoC core"

* tag 'sound-fix-5.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound:
  ALSA: usb-audio: Add DSD support for EVGA NU Audio
  ALSA: hda - Add laptop imic fixup for ASUS M9V laptop
  ASoC: ti: fix SND_SOC_DM365_VOICE_CODEC dependencies
  ASoC: pcm3168a: The codec does not support S32_LE
  ASoC: core: use list_del_init and move it back to soc_cleanup_component
  ALSA: hda/realtek - PCI quirk for Medion E4254
  ALSA: hda - Apply AMD controller workaround for Raven platform
  ASoC: rsnd: do error check after rsnd_channel_normalization()
  ASoC: atmel_ssc_dai: Remove wrong spinlock usage
  ASoC: core: delete component->card_list in soc_remove_component only
  ASoC: fsl_sai: Fix noise when using EDMA
  ALSA: usb-audio: Add Hiby device family to quirks for native DSD support
  ALSA: hda/realtek - Fix alienware headset mic
  ALSA: dice: fix wrong packet parameter for Alesis iO26

Merge tag 'for-5.4/io_uring-2019-09-24' of git://git.kernel.dk/linux-block

Pull more io_uring updates from Jens Axboe:
"A collection of later fixes and additions, that weren't quite ready
  for pushing out with the initial pull request.

  This contains:

   - Fix potential use-after-free of shadow requests (Jackie)

   - Fix potential OOM crash in request allocation (Jackie)

   - kmalloc+memcpy -> kmemdup cleanup (Jackie)

   - Fix poll crash regression (me)

   - Fix SQ thread not being nice and giving up CPU for !PREEMPT (me)

   - Add support for timeouts, making it easier to do epoll_wait()
     conversions, for instance (me)

   - Ensure io_uring works without f_ops->read_iter() and
     f_ops->write_iter() (me)"

* tag 'for-5.4/io_uring-2019-09-24' of git://git.kernel.dk/linux-block:
  io_uring: correctly handle non ->{read,write}_iter() file_operations
  io_uring: IORING_OP_TIMEOUT support
  io_uring: use cond_resched() in sqthread
  io_uring: fix potential crash issue due to io_get_req failure
  io_uring: ensure poll commands clear ->sqe
  io_uring: fix use-after-free of shadow_req
  io_uring: use kmemdup instead of kmalloc and memcpy

Merge tag 'for-5.4/post-2019-09-24' of git://git.kernel.dk/linux-block

Pull more block updates from Jens Axboe:
"Some later additions that weren't quite done for the first pull
  request, and also a few fixes that have arrived since.

  This contains:

   - Kill silly pktcdvd warning on attempting to register a non-scsi
     passthrough device (me)

   - Use symbolic constants for the block t10 protection types, and
     switch to handling it in core rather than in the drivers (Max)

   - libahci platform missing node put fix (Nishka)

   - Small series of fixes for BFQ (Paolo)

   - Fix possible nbd crash (Xiubo)"

* tag 'for-5.4/post-2019-09-24' of git://git.kernel.dk/linux-block:
  block: drop device references in bsg_queue_rq()
  block: t10-pi: fix -Wswitch warning
  pktcdvd: remove warning on attempting to register non-passthrough dev
  ata: libahci_platform: Add of_node_put() before loop exit
  nbd: fix possible page fault for nbd disk
  nbd: rename the runtime flags as NBD_RT_ prefixed
  block, bfq: push up injection only after setting service time
  block, bfq: increase update frequency of inject limit
  block, bfq: reduce upper bound for inject limit to max_rq_in_driver+1
  block, bfq: update inject limit only after injection occurred
  block: centralize PI remapping logic to the block layer
  block: use symbolic constants for t10_pi type

Merge branch 'akpm' (patches from Andrew)

Merge updates from Andrew Morton:

- a few hot fixes

- ocfs2 updates

- almost all of -mm (slab-generic, slab, slub, kmemleak, kasan,
   cleanups, debug, pagecache, memcg, gup, pagemap, memory-hotplug,
   sparsemem, vmalloc, initialization, z3fold, compaction, mempolicy,
   oom-kill, hugetlb, migration, thp, mmap, madvise, shmem, zswap,
   zsmalloc)

* emailed patches from Andrew Morton <[email protected]>: (132 commits)
  mm/zsmalloc.c: fix a -Wunused-function warning
  zswap: do not map same object twice
  zswap: use movable memory if zpool support allocate movable memory
  zpool: add malloc_support_movable to zpool_driver
  shmem: fix obsolete comment in shmem_getpage_gfp()
  mm/madvise: reduce code duplication in error handling paths
  mm: mmap: increase sockets maximum memory size pgoff for 32bits
  mm/mmap.c: refine find_vma_prev() with rb_last()
  riscv: make mmap allocation top-down by default
  mips: use generic mmap top-down layout and brk randomization
  mips: replace arch specific way to determine 32bit task with generic version
  mips: adjust brk randomization offset to fit generic version
  mips: use STACK_TOP when computing mmap base address
  mips: properly account for stack randomization and stack guard gap
  arm: use generic mmap top-down layout and brk randomization
  arm: use STACK_TOP when computing mmap base address
  arm: properly account for stack randomization and stack guard gap
  arm64, mm: make randomization selected by generic topdown mmap layout
  arm64, mm: move generic mmap layout functions to mm
  arm64: consider stack randomization for mmap base only when necessary
  ...

mm/zsmalloc.c: fix a -Wunused-function warning

set_zspage_inuse() was introduced in the commit 4f42047bbde0 ("zsmalloc:
use accessor") but all the users of it were removed later by the commits,

bdb0af7ca8f0 ("zsmalloc: factor page chain functionality out")
3783689a1aa8 ("zsmalloc: introduce zspage structure")

so the function can be safely removed now.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Qian Cai <[email protected]>
Reviewed-by: Andrew Morton <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Sergey Senozhatsky <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

zswap: do not map same object twice

zswap_writeback_entry() maps a handle to read swpentry first, and
then in the most common case it would map the same handle again.
This is ok when zbud is the backend since its mapping callback is
plain and simple, but it slows things down for z3fold.

Since there's hardly a point in unmapping a handle _that_ fast as
zswap_writeback_entry() does when it reads swpentry, the
suggestion is to keep the handle mapped till the end.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Vitaly Wool <[email protected]>
Reviewed-by: Dan Streetman <[email protected]>
Cc: Shakeel Butt <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Sergey Senozhatsky <[email protected]>
Cc: Seth Jennings <[email protected]>
Cc: Vitaly Wool <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

zswap: use movable memory if zpool support allocate movable memory

This is the third version that was updated according to the comments from
Sergey Senozhatsky https://lkml.org/lkml/2019/5/29/73 and Shakeel Butt
https://lkml.org/lkml/2019/6/4/973

zswap compresses swap pages into a dynamically allocated RAM-based memory
pool.  The memory pool should be zbud, z3fold or zsmalloc.  All of them
will allocate unmovable pages.  It will increase the number of unmovable
page blocks that will bad for anti-fragment.

zsmalloc support page migration if request movable page:
        handle = zs_malloc(zram->mem_pool, comp_len,
                GFP_NOIO | __GFP_HIGHMEM |
                __GFP_MOVABLE);

And commit "zpool: Add malloc_support_movable to zpool_driver" add
zpool_malloc_support_movable check malloc_support_movable to make sure if
a zpool support allocate movable memory.

This commit let zswap allocate block with gfp
__GFP_HIGHMEM | __GFP_MOVABLE if zpool support allocate movable memory.

Following part is test log in a pc that has 8G memory and 2G swap.

Without this commit:
~# echo lz4 > /sys/module/zswap/parameters/compressor
~# echo zsmalloc > /sys/module/zswap/parameters/zpool
~# echo 1 > /sys/module/zswap/parameters/enabled
~# swapon /swapfile
~# cd /home/teawater/kernel/vm-scalability/
/home/teawater/kernel/vm-scalability# export unit_size=$((9 * 1024 * 1024 * 1024))
/home/teawater/kernel/vm-scalability# ./case-anon-w-seq
2717908992 bytes / 4826062 usecs = 549973 KB/s
2717908992 bytes / 4864201 usecs = 545661 KB/s
2717908992 bytes / 4867015 usecs = 545346 KB/s
2717908992 bytes / 4915485 usecs = 539968 KB/s
397853 usecs to free memory
357820 usecs to free memory
421333 usecs to free memory
420454 usecs to free memory
/home/teawater/kernel/vm-scalability# cat /proc/pagetypeinfo
Page block order: 9
Pages per block:  512

Free pages count per migrate type at order       0      1      2      3      4      5      6      7      8      9     10
Node    0, zone      DMA, type    Unmovable      1      1      1      0      2      1      1      0      1      0      0
Node    0, zone      DMA, type      Movable      0      0      0      0      0      0      0      0      0      1      3
Node    0, zone      DMA, type  Reclaimable      0      0      0      0      0      0      0      0      0      0      0
Node    0, zone      DMA, type   HighAtomic      0      0      0      0      0      0      0      0      0      0      0
Node    0, zone      DMA, type          CMA      0      0      0      0      0      0      0      0      0      0      0
Node    0, zone      DMA, type      Isolate      0      0      0      0      0      0      0      0      0      0      0
Node    0, zone    DMA32, type    Unmovable      6      5      8      6      6      5      4      1      1      1      0
Node    0, zone    DMA32, type      Movable     25     20     20     19     22     15     14     11     11      5    767
Node    0, zone    DMA32, type  Reclaimable      0      0      0      0      0      0      0      0      0      0      0
Node    0, zone    DMA32, type   HighAtomic      0      0      0      0      0      0      0      0      0      0      0
Node    0, zone    DMA32, type          CMA      0      0      0      0      0      0      0      0      0      0      0
Node    0, zone    DMA32, type      Isolate      0      0      0      0      0      0      0      0      0      0      0
Node    0, zone   Normal, type    Unmovable   4753   5588   5159   4613   3712   2520   1448    594    188     11      0
Node    0, zone   Normal, type      Movable     16      3    457   2648   2143   1435    860    459    223    224    296
Node    0, zone   Normal, type  Reclaimable      0      0     44     38     11      2      0      0      0      0      0
Node    0, zone   Normal, type   HighAtomic      0      0      0      0      0      0      0      0      0      0      0
Node    0, zone   Normal, type          CMA      0      0      0      0      0      0      0      0      0      0      0
Node    0, zone   Normal, type      Isolate      0      0      0      0      0      0      0      0      0      0      0

Number of blocks type     Unmovable      Movable  Reclaimable   HighAtomic          CMA      Isolate
Node 0, zone      DMA            1            7            0            0            0            0
Node 0, zone    DMA32            4         1652            0            0            0            0
Node 0, zone   Normal          931         1485           15            0            0            0

With this commit:
~# echo lz4 > /sys/module/zswap/parameters/compressor
~# echo zsmalloc > /sys/module/zswap/parameters/zpool
~# echo 1 > /sys/module/zswap/parameters/enabled
~# swapon /swapfile
~# cd /home/teawater/kernel/vm-scalability/
/home/teawater/kernel/vm-scalability# export unit_size=$((9 * 1024 * 1024 * 1024))
/home/teawater/kernel/vm-scalability# ./case-anon-w-seq
2717908992 bytes / 4689240 usecs = 566020 KB/s
2717908992 bytes / 4760605 usecs = 557535 KB/s
2717908992 bytes / 4803621 usecs = 552543 KB/s
2717908992 bytes / 5069828 usecs = 523530 KB/s
431546 usecs to free memory
383397 usecs to free memory
456454 usecs to free memory
224487 usecs to free memory
/home/teawater/kernel/vm-scalability# cat /proc/pagetypeinfo
Page block order: 9
Pages per block:  512

Free pages count per migrate type at order       0      1      2      3      4      5      6      7      8      9     10
Node    0, zone      DMA, type    Unmovable      1      1      1      0      2      1      1      0      1      0      0
Node    0, zone      DMA, type      Movable      0      0      0      0      0      0      0      0      0      1      3
Node    0, zone      DMA, type  Reclaimable      0      0      0      0      0      0      0      0      0      0      0
Node    0, zone      DMA, type   HighAtomic      0      0      0      0      0      0      0      0      0      0      0
Node    0, zone      DMA, type          CMA      0      0      0      0      0      0      0      0      0      0      0
Node    0, zone      DMA, type      Isolate      0      0      0      0      0      0      0      0      0      0      0
Node    0, zone    DMA32, type    Unmovable     10      8     10      9     10      4      3      2      3      0      0
Node    0, zone    DMA32, type      Movable     18     12     14     16     16     11      9      5      5      6    775
Node    0, zone    DMA32, type  Reclaimable      0      0      0      0      0      0      0      0      0      0      1
Node    0, zone    DMA32, type   HighAtomic      0      0      0      0      0      0      0      0      0      0      0
Node    0, zone    DMA32, type          CMA      0      0      0      0      0      0      0      0      0      0      0
Node    0, zone    DMA32, type      Isolate      0      0      0      0      0      0      0      0      0      0      0
Node    0, zone   Normal, type    Unmovable   2669   1236    452    118     37     14      4      1      2      3      0
Node    0, zone   Normal, type      Movable   3850   6086   5274   4327   3510   2494   1520    934    438    220    470
Node    0, zone   Normal, type  Reclaimable     56     93    155    124     47     31     17      7      3      0      0
Node    0, zone   Normal, type   HighAtomic      0      0      0      0      0      0      0      0      0      0      0
Node    0, zone   Normal, type          CMA      0      0      0      0      0      0      0      0      0      0      0
Node    0, zone   Normal, type      Isolate      0      0      0      0      0      0      0      0      0      0      0

Number of blocks type     Unmovable      Movable  Reclaimable   HighAtomic          CMA      Isolate
Node 0, zone      DMA            1            7            0            0            0            0
Node 0, zone    DMA32            4         1650            2            0            0            0
Node 0, zone   Normal           79         2326           26            0            0            0

You can see that the number of unmovable page blocks is decreased
when the kernel has this commit.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Hui Zhu <[email protected]>
Reviewed-by: Shakeel Butt <[email protected]>
Cc: Dan Streetman <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Nitin Gupta <[email protected]>
Cc: Sergey Senozhatsky <[email protected]>
Cc: Seth Jennings <[email protected]>
Cc: Vitaly Wool <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

zpool: add malloc_support_movable to zpool_driver

As a zpool_driver, zsmalloc can allocate movable memory because it support
migate pages.  But zbud and z3fold cannot allocate movable memory.

Add malloc_support_movable to zpool_driver.  If a zpool_driver support
allocate movable memory, set it to true.  And add
zpool_malloc_support_movable check malloc_support_movable to make sure if
a zpool support allocate movable memory.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Hui Zhu <[email protected]>
Reviewed-by: Shakeel Butt <[email protected]>
Cc: Dan Streetman <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Nitin Gupta <[email protected]>
Cc: Sergey Senozhatsky <[email protected]>
Cc: Seth Jennings <[email protected]>
Cc: Vitaly Wool <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

shmem: fix obsolete comment in shmem_getpage_gfp()

Replace "fault_mm" with "vmf" in code comment because commit cfda05267f7b
("userfaultfd: shmem: add userfaultfd hook for shared memory faults") has
changed the prototpye of shmem_getpage_gfp() - pass vmf instead of
fault_mm to the function.

Before:
static int shmem_getpage_gfp(struct inode *inode, pgoff_t index,
struct page **pagep, enum sgp_type sgp,
gfp_t gfp, struct mm_struct *fault_mm, int *fault_type);
After:
static int shmem_getpage_gfp(struct inode *inode, pgoff_t index,
struct page **pagep, enum sgp_type sgp,
gfp_t gfp, struct vm_area_struct *vma,
struct vm_fault *vmf, vm_fault_t *fault_type);

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Miles Chen <[email protected]>
Cc: Hugh Dickins <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/madvise: reduce code duplication in error handling paths

madvise_behavior() converts -ENOMEM to -EAGAIN in several places using
identical code.

Move that code to a common error handling path.

No functional changes.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Mike Rapoport <[email protected]>
Acked-by: Pankaj Gupta <[email protected]>
Reviewed-by: Anshuman Khandual <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm: mmap: increase sockets maximum memory size pgoff for 32bits

The AF_XDP sockets umem mapping interface uses XDP_UMEM_PGOFF_FILL_RING
and XDP_UMEM_PGOFF_COMPLETION_RING offsets. These offsets are
established already and are part of the configuration interface.

But for 32-bit systems, using AF_XDP socket configuration, these values
are too large to pass the maximum allowed file size verification. The
offsets can be tuned off, but instead of changing the existing
interface, let's extend the max allowed file size for sockets.

No one has been using this until this patch with 32 bits as without
this fix af_xdp sockets can't be used at all, so it unblocks af_xdp
socket usage for 32bit systems.

All list of mmap cbs for sockets was verified for side effects and all
of them contain dummy cb - sock_no_mmap() at this moment, except the
following:

xsk_mmap() - it's what this fix is needed for.
tcp_mmap() - doesn't have obvious issues with pgoff - no any references on it.
packet_mmap() - return -EINVAL if it's even set.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ivan Khoronzhuk <[email protected]>
Reviewed-by: Andrew Morton <[email protected]>
Cc: Björn Töpel <[email protected]>
Cc: Alexei Starovoitov <[email protected]>
Cc: Magnus Karlsson <[email protected]>
Cc: Daniel Borkmann <[email protected]>
Cc: David Miller <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/mmap.c: refine find_vma_prev() with rb_last()

When addr is out of range of the whole rb_tree, pprev will point to the
right-most node. rb_tree facility already provides a helper function,
rb_last(), to do this task. We can leverage this instead of
reimplementing it.

This patch refines find_vma_prev() with rb_last() to make it a little
nicer to read.

[[email protected]: little cleanup, per Vlastimil]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Wei Yang <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

riscv: make mmap allocation top-down by default

In order to avoid wasting user address space by using bottom-up mmap
allocation scheme, prefer top-down scheme when possible.

Before:
root@qemuriscv64:~# cat /proc/self/maps
00010000-00016000 r-xp 00000000 fe:00 6389       /bin/cat.coreutils
00016000-00017000 r--p 00005000 fe:00 6389       /bin/cat.coreutils
00017000-00018000 rw-p 00006000 fe:00 6389       /bin/cat.coreutils
00018000-00039000 rw-p 00000000 00:00 0          [heap]
1555556000-155556d000 r-xp 00000000 fe:00 7193   /lib/ld-2.28.so
155556d000-155556e000 r--p 00016000 fe:00 7193   /lib/ld-2.28.so
155556e000-155556f000 rw-p 00017000 fe:00 7193   /lib/ld-2.28.so
155556f000-1555570000 rw-p 00000000 00:00 0
1555570000-1555572000 r-xp 00000000 00:00 0      [vdso]
1555574000-1555576000 rw-p 00000000 00:00 0
1555576000-1555674000 r-xp 00000000 fe:00 7187   /lib/libc-2.28.so
1555674000-1555678000 r--p 000fd000 fe:00 7187   /lib/libc-2.28.so
1555678000-155567a000 rw-p 00101000 fe:00 7187   /lib/libc-2.28.so
155567a000-15556a0000 rw-p 00000000 00:00 0
3fffb90000-3fffbb1000 rw-p 00000000 00:00 0      [stack]

After:
root@qemuriscv64:~# cat /proc/self/maps
00010000-00016000 r-xp 00000000 fe:00 6389       /bin/cat.coreutils
00016000-00017000 r--p 00005000 fe:00 6389       /bin/cat.coreutils
00017000-00018000 rw-p 00006000 fe:00 6389       /bin/cat.coreutils
2de81000-2dea2000 rw-p 00000000 00:00 0          [heap]
3ff7eb6000-3ff7ed8000 rw-p 00000000 00:00 0
3ff7ed8000-3ff7fd6000 r-xp 00000000 fe:00 7187   /lib/libc-2.28.so
3ff7fd6000-3ff7fda000 r--p 000fd000 fe:00 7187   /lib/libc-2.28.so
3ff7fda000-3ff7fdc000 rw-p 00101000 fe:00 7187   /lib/libc-2.28.so
3ff7fdc000-3ff7fe2000 rw-p 00000000 00:00 0
3ff7fe4000-3ff7fe6000 r-xp 00000000 00:00 0      [vdso]
3ff7fe6000-3ff7ffd000 r-xp 00000000 fe:00 7193   /lib/ld-2.28.so
3ff7ffd000-3ff7ffe000 r--p 00016000 fe:00 7193   /lib/ld-2.28.so
3ff7ffe000-3ff7fff000 rw-p 00017000 fe:00 7193   /lib/ld-2.28.so
3ff7fff000-3ff8000000 rw-p 00000000 00:00 0
3fff888000-3fff8a9000 rw-p 00000000 00:00 0      [stack]

[[email protected]: v6]
Link: http://lkml.kernel.org/r/[email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Alexandre Ghiti <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Kees Cook <[email protected]>
Reviewed-by: Luis Chamberlain <[email protected]>
Acked-by: Paul Walmsley <[email protected]> [arch/riscv]
Cc: Albert Ou <[email protected]>
Cc: Alexander Viro <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: James Hogan <[email protected]>
Cc: Palmer Dabbelt <[email protected]>
Cc: Paul Burton <[email protected]>
Cc: Ralf Baechle <[email protected]>
Cc: Russell King <[email protected]>
Cc: Will Deacon <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mips: use generic mmap top-down layout and brk randomization

mips uses a top-down layout by default that exactly fits the generic
functions, so get rid of arch specific code and use the generic version by
selecting ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT.

As ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT selects ARCH_HAS_ELF_RANDOMIZE,
use the generic version of arch_randomize_brk since it also fits. Note
that this commit also removes the possibility for mips to have elf
randomization and no MMU: without MMU, the security added by randomization
is worth nothing.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Alexandre Ghiti <[email protected]>
Acked-by: Paul Burton <[email protected]>
Reviewed-by: Kees Cook <[email protected]>
Reviewed-by: Luis Chamberlain <[email protected]>
Cc: Albert Ou <[email protected]>
Cc: Alexander Viro <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: James Hogan <[email protected]>
Cc: Palmer Dabbelt <[email protected]>
Cc: Ralf Baechle <[email protected]>
Cc: Russell King <[email protected]>
Cc: Will Deacon <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mips: replace arch specific way to determine 32bit task with generic version

Mips uses TASK_IS_32BIT_ADDR to determine if a task is 32bit, but this
define is mips specific and other arches do not have it: instead, use
!IS_ENABLED(CONFIG_64BIT) || is_compat_task() condition.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Alexandre Ghiti <[email protected]>
Acked-by: Paul Burton <[email protected]>
Reviewed-by: Kees Cook <[email protected]>
Reviewed-by: Luis Chamberlain <[email protected]>
Cc: Albert Ou <[email protected]>
Cc: Alexander Viro <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: James Hogan <[email protected]>
Cc: Palmer Dabbelt <[email protected]>
Cc: Ralf Baechle <[email protected]>
Cc: Russell King <[email protected]>
Cc: Will Deacon <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mips: adjust brk randomization offset to fit generic version

This commit simply bumps up to 32MB and 1GB the random offset of brk,
compared to 8MB and 256MB, for 32bit and 64bit respectively.

Link: http://lkml.kernel.org/r/[email protected]
Suggested-by: Kees Cook <[email protected]>
Signed-off-by: Alexandre Ghiti <[email protected]>
Acked-by: Paul Burton <[email protected]>
Reviewed-by: Kees Cook <[email protected]>
Reviewed-by: Luis Chamberlain <[email protected]>
Cc: Albert Ou <[email protected]>
Cc: Alexander Viro <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: James Hogan <[email protected]>
Cc: Palmer Dabbelt <[email protected]>
Cc: Ralf Baechle <[email protected]>
Cc: Russell King <[email protected]>
Cc: Will Deacon <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mips: use STACK_TOP when computing mmap base address

mmap base address must be computed wrt stack top address, using TASK_SIZE
is wrong since STACK_TOP and TASK_SIZE are not equivalent.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Alexandre Ghiti <[email protected]>
Acked-by: Kees Cook <[email protected]>
Acked-by: Paul Burton <[email protected]>
Reviewed-by: Luis Chamberlain <[email protected]>
Cc: Albert Ou <[email protected]>
Cc: Alexander Viro <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: James Hogan <[email protected]>
Cc: Palmer Dabbelt <[email protected]>
Cc: Ralf Baechle <[email protected]>
Cc: Russell King <[email protected]>
Cc: Will Deacon <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mips: properly account for stack randomization and stack guard gap

This commit takes care of stack randomization and stack guard gap when
computing mmap base address and checks if the task asked for
randomization. This fixes the problem uncovered and not fixed for arm
here: https://lkml.kernel.org/r/20170622200033 [email protected]

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Alexandre Ghiti <[email protected]>
Acked-by: Kees Cook <[email protected]>
Acked-by: Paul Burton <[email protected]>
Reviewed-by: Luis Chamberlain <[email protected]>
Cc: Albert Ou <[email protected]>
Cc: Alexander Viro <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: James Hogan <[email protected]>
Cc: Palmer Dabbelt <[email protected]>
Cc: Ralf Baechle <[email protected]>
Cc: Russell King <[email protected]>
Cc: Will Deacon <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

arm: use generic mmap top-down layout and brk randomization

arm uses a top-down mmap layout by default that exactly fits the generic
functions, so get rid of arch specific code and use the generic version by
selecting ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT.

As ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT selects ARCH_HAS_ELF_RANDOMIZE,
use the generic version of arch_randomize_brk since it also fits. Note
that this commit also removes the possibility for arm to have elf
randomization and no MMU: without MMU, the security added by randomization
is worth nothing.

Note that it is safe to remove STACK_RND_MASK since it matches the default
value.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Alexandre Ghiti <[email protected]>
Acked-by: Kees Cook <[email protected]>
Reviewed-by: Luis Chamberlain <[email protected]>
Cc: Albert Ou <[email protected]>
Cc: Alexander Viro <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: James Hogan <[email protected]>
Cc: Palmer Dabbelt <[email protected]>
Cc: Paul Burton <[email protected]>
Cc: Ralf Baechle <[email protected]>
Cc: Russell King <[email protected]>
Cc: Will Deacon <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

arm: use STACK_TOP when computing mmap base address

mmap base address must be computed wrt stack top address, using TASK_SIZE
is wrong since STACK_TOP and TASK_SIZE are not equivalent.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Alexandre Ghiti <[email protected]>
Acked-by: Kees Cook <[email protected]>
Reviewed-by: Luis Chamberlain <[email protected]>
Cc: Albert Ou <[email protected]>
Cc: Alexander Viro <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: James Hogan <[email protected]>
Cc: Palmer Dabbelt <[email protected]>
Cc: Paul Burton <[email protected]>
Cc: Ralf Baechle <[email protected]>
Cc: Russell King <[email protected]>
Cc: Will Deacon <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

arm: properly account for stack randomization and stack guard gap

This commit takes care of stack randomization and stack guard gap when
computing mmap base address and checks if the task asked for
randomization. This fixes the problem uncovered and not fixed for arm
here: https://lkml.kernel.org/r/20170622200033 [email protected]

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Alexandre Ghiti <[email protected]>
Acked-by: Kees Cook <[email protected]>
Reviewed-by: Luis Chamberlain <[email protected]>
Cc: Albert Ou <[email protected]>
Cc: Alexander Viro <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: James Hogan <[email protected]>
Cc: Palmer Dabbelt <[email protected]>
Cc: Paul Burton <[email protected]>
Cc: Ralf Baechle <[email protected]>
Cc: Russell King <[email protected]>
Cc: Will Deacon <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

arm64, mm: make randomization selected by generic topdown mmap layout

This commits selects ARCH_HAS_ELF_RANDOMIZE when an arch uses the generic
topdown mmap layout functions so that this security feature is on by
default.

Note that this commit also removes the possibility for arm64 to have elf
randomization and no MMU: without MMU, the security added by randomization
is worth nothing.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Alexandre Ghiti <[email protected]>
Acked-by: Catalin Marinas <[email protected]>
Acked-by: Kees Cook <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Luis Chamberlain <[email protected]>
Cc: Albert Ou <[email protected]>
Cc: Alexander Viro <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: James Hogan <[email protected]>
Cc: Palmer Dabbelt <[email protected]>
Cc: Paul Burton <[email protected]>
Cc: Ralf Baechle <[email protected]>
Cc: Russell King <[email protected]>
Cc: Will Deacon <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

arm64, mm: move generic mmap layout functions to mm

arm64 handles top-down mmap layout in a way that can be easily reused by
other architectures, so make it available in mm. It then introduces a new
config ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT that can be set by other
architectures to benefit from those functions. Note that this new config
depends on MMU being enabled, if selected without MMU support, a warning
will be thrown.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Alexandre Ghiti <[email protected]>
Suggested-by: Christoph Hellwig <[email protected]>
Acked-by: Catalin Marinas <[email protected]>
Acked-by: Kees Cook <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Luis Chamberlain <[email protected]>
Cc: Albert Ou <[email protected]>
Cc: Alexander Viro <[email protected]>
Cc: James Hogan <[email protected]>
Cc: Palmer Dabbelt <[email protected]>
Cc: Paul Burton <[email protected]>
Cc: Ralf Baechle <[email protected]>
Cc: Russell King <[email protected]>
Cc: Will Deacon <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

arm64: consider stack randomization for mmap base only when necessary

Do not offset mmap base address because of stack randomization if current
task does not want randomization. Note that x86 already implements this
behaviour.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Alexandre Ghiti <[email protected]>
Acked-by: Catalin Marinas <[email protected]>
Acked-by: Kees Cook <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Luis Chamberlain <[email protected]>
Cc: Albert Ou <[email protected]>
Cc: Alexander Viro <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: James Hogan <[email protected]>
Cc: Palmer Dabbelt <[email protected]>
Cc: Paul Burton <[email protected]>
Cc: Ralf Baechle <[email protected]>
Cc: Russell King <[email protected]>
Cc: Will Deacon <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

arm64: make use of is_compat_task instead of hardcoding this test

Each architecture has its own way to determine if a task is a compat task,
by using is_compat_task in arch_mmap_rnd, it allows more genericity and
then it prepares its moving to mm/.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Alexandre Ghiti <[email protected]>
Acked-by: Catalin Marinas <[email protected]>
Acked-by: Kees Cook <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Luis Chamberlain <[email protected]>
Cc: Albert Ou <[email protected]>
Cc: Alexander Viro <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: James Hogan <[email protected]>
Cc: Palmer Dabbelt <[email protected]>
Cc: Paul Burton <[email protected]>
Cc: Ralf Baechle <[email protected]>
Cc: Russell King <[email protected]>
Cc: Will Deacon <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm, fs: move randomize_stack_top from fs to mm

Patch series "Provide generic top-down mmap layout functions", v6.

This series introduces generic functions to make top-down mmap layout
easily accessible to architectures, in particular riscv which was the
initial goal of this series.  The generic implementation was taken from
arm64 and used successively by arm, mips and finally riscv.

Note that in addition the series fixes 2 issues:

- stack randomization was taken into account even if not necessary.

- [1] fixed an issue with mmap base which did not take into account
  randomization but did not report it to arm and mips, so by moving arm64
  into a generic library, this problem is now fixed for both
  architectures.

This work is an effort to factorize architecture functions to avoid code
duplication and oversights as in [1].

[1]: https://www.mail-archive.com/[email protected]/msg1429066.html

This patch (of 14):

This preparatory commit moves this function so that further introduction
of generic topdown mmap layout is contained only in mm/util.c.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Alexandre Ghiti <[email protected]>
Acked-by: Kees Cook <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Luis Chamberlain <[email protected]>
Cc: Russell King <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Ralf Baechle <[email protected]>
Cc: Paul Burton <[email protected]>
Cc: James Hogan <[email protected]>
Cc: Palmer Dabbelt <[email protected]>
Cc: Albert Ou <[email protected]>
Cc: Alexander Viro <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

uprobe: collapse THP pmd after removing all uprobes

After all uprobes are removed from the huge page (with PTE pgtable), it is
possible to collapse the pmd and benefit from THP again. This patch does
the collapse by calling collapse_pte_mapped_thp().

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Song Liu <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
Reported-by: kbuild test robot <[email protected]>
Reviewed-by: Oleg Nesterov <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

khugepaged: enable collapse pmd for pte-mapped THP

khugepaged needs exclusive mmap_sem to access page table.  When it fails
to lock mmap_sem, the page will fault in as pte-mapped THP.  As the page
is already a THP, khugepaged will not handle this pmd again.

This patch enables the khugepaged to retry collapse the page table.

struct mm_slot (in khugepaged.c) is extended with an array, containing
addresses of pte-mapped THPs.  We use array here for simplicity.  We can
easily replace it with more advanced data structures when needed.

In khugepaged_scan_mm_slot(), if the mm contains pte-mapped THP, we try to
collapse the page table.

Since collapse may happen at an later time, some pages may already fault
in.  collapse_pte_mapped_thp() is added to properly handle these pages.
collapse_pte_mapped_thp() also double checks whether all ptes in this pmd
are mapping to the same THP.  This is necessary because some subpage of
the THP may be replaced, for example by uprobe.  In such cases, it is not
possible to collapse the pmd.

[[email protected]: add comments for retract_page_tables()]
Link: http://lkml.kernel.org/r/20190816145443.6ard3iilytc6jlgv@box
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Song Liu <[email protected]>
Signed-off-by: Kirill A. Shutemov <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
Suggested-by: Johannes Weiner <[email protected]>
Reviewed-by: Oleg Nesterov <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

uprobe: use FOLL_SPLIT_PMD instead of FOLL_SPLIT

Use the newly added FOLL_SPLIT_PMD in uprobe. This preserves the huge
page when the uprobe is enabled. When the uprobe is disabled, newer
instances of the same application could still benefit from huge page.

For the next step, we will enable khugepaged to regroup the pmd, so that
existing instances of the application could also benefit from huge page
after the uprobe is disabled.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Song Liu <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
Reviewed-by: Srikar Dronamraju <[email protected]>
Reviewed-by: Oleg Nesterov <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm, thp: introduce FOLL_SPLIT_PMD

Introduce a new foll_flag: FOLL_SPLIT_PMD.  As the name says
FOLL_SPLIT_PMD splits huge pmd for given mm_struct, the underlining huge
page stays as-is.

FOLL_SPLIT_PMD is useful for cases where we need to use regular pages, but
would switch back to huge page and huge pmd on.  One of such example is
uprobe.  The following patches use FOLL_SPLIT_PMD in uprobe.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Song Liu <[email protected]>
Reviewed-by: Oleg Nesterov <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

uprobe: use original page when all uprobes are removed

Currently, uprobe swaps the target page with a anonymous page in both
install_breakpoint() and remove_breakpoint(). When all uprobes on a page
are removed, the given mm is still using an anonymous page (not the
original page).

This patch allows uprobe to use original page when possible (all uprobes
on the page are already removed, and the original page is in page cache
and uptodate).

As suggested by Oleg, we unmap the old_page and let the original page
fault in.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Song Liu <[email protected]>
Suggested-by: Oleg Nesterov <[email protected]>
Reviewed-by: Oleg Nesterov <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm: move memcmp_pages() and pages_identical()

Patch series "THP aware uprobe", v13.

This patchset makes uprobe aware of THPs.

Currently, when uprobe is attached to text on THP, the page is split by
FOLL_SPLIT.  As a result, uprobe eliminates the performance benefit of
THP.

This set makes uprobe THP-aware.  Instead of FOLL_SPLIT, we introduces
FOLL_SPLIT_PMD, which only split PMD for uprobe.

After all uprobes within the THP are removed, the PTE-mapped pages are
regrouped as huge PMD.

This set (plus a few THP patches) is also available at

   https://github.com/liu-song-6/linux/tree/uprobe-thp

This patch (of 6):

Move memcmp_pages() to mm/util.c and pages_identical() to mm.h, so that we
can use them in other files.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Song Liu <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
Reviewed-by: Oleg Nesterov <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: William Kucharski <[email protected]>
Cc: Srikar Dronamraju <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm: thp: make deferred split shrinker memcg aware

Currently THP deferred split shrinker is not memcg aware, this may cause
premature OOM with some configuration.  For example the below test would
run into premature OOM easily:

$ cgcreate -g memory:thp
$ echo 4G > /sys/fs/cgroup/memory/thp/memory/limit_in_bytes
$ cgexec -g memory:thp transhuge-stress 4000

transhuge-stress comes from kernel selftest.

It is easy to hit OOM, but there are still a lot THP on the deferred split
queue, memcg direct reclaim can't touch them since the deferred split
shrinker is not memcg aware.

Convert deferred split shrinker memcg aware by introducing per memcg
deferred split queue.  The THP should be on either per node or per memcg
deferred split queue if it belongs to a memcg.  When the page is
immigrated to the other memcg, it will be immigrated to the target memcg's
deferred split queue too.

Reuse the second tail page's deferred_list for per memcg list since the
same THP can't be on multiple deferred split queues.

[[email protected]: simplify deferred split queue dereference per Kirill Tkhai]
Link: http://lkml.kernel.org/r/[email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Yang Shi <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
Reviewed-by: Kirill Tkhai <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: "Kirill A . Shutemov" <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Shakeel Butt <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Qian Cai <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm: shrinker: make shrinker not depend on memcg kmem

Currently shrinker is just allocated and can work when memcg kmem is
enabled. But, THP deferred split shrinker is not slab shrinker, it
doesn't make too much sense to have such shrinker depend on memcg kmem.
It should be able to reclaim THP even though memcg kmem is disabled.

Introduce a new shrinker flag, SHRINKER_NONSLAB, for non-slab shrinker.
When memcg kmem is disabled, just such shrinkers can be called in
shrinking memcg slab.

[[email protected]: add comment]
Link: http://lkml.kernel.org/r/[email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Yang Shi <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
Reviewed-by: Kirill Tkhai <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: "Kirill A . Shutemov" <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Shakeel Butt <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Qian Cai <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm: move mem_cgroup_uncharge out of __page_cache_release()

A later patch makes THP deferred split shrinker memcg aware, but it needs
page->mem_cgroup information in THP destructor, which is called after
mem_cgroup_uncharge() now.

So move mem_cgroup_uncharge() from __page_cache_release() to compound page
destructor, which is called by both THP and other compound pages except
HugeTLB. And call it in __put_single_page() for single order page.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Yang Shi <[email protected]>
Suggested-by: "Kirill A . Shutemov" <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
Reviewed-by: Kirill Tkhai <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Shakeel Butt <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Qian Cai <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm: thp: extract split_queue_* into a struct

Patch series "Make deferred split shrinker memcg aware", v6.

Currently THP deferred split shrinker is not memcg aware, this may cause
premature OOM with some configuration.  For example the below test would
run into premature OOM easily:

$ cgcreate -g memory:thp
$ echo 4G > /sys/fs/cgroup/memory/thp/memory/limit_in_bytes
$ cgexec -g memory:thp transhuge-stress 4000

transhuge-stress comes from kernel selftest.

It is easy to hit OOM, but there are still a lot THP on the deferred split
queue, memcg direct reclaim can't touch them since the deferred split
shrinker is not memcg aware.

Convert deferred split shrinker memcg aware by introducing per memcg
deferred split queue.  The THP should be on either per node or per memcg
deferred split queue if it belongs to a memcg.  When the page is
immigrated to the other memcg, it will be immigrated to the target memcg's
deferred split queue too.

Reuse the second tail page's deferred_list for per memcg list since the
same THP can't be on multiple deferred split queues.

Make deferred split shrinker not depend on memcg kmem since it is not
slab.  It doesn't make sense to not shrink THP even though memcg kmem is
disabled.

With the above change the test demonstrated above doesn't trigger OOM even
though with cgroup.memory=nokmem.

This patch (of 4):

Put split_queue, split_queue_lock and split_queue_len into a struct in
order to reduce code duplication when we convert deferred_split to memcg
aware in the later patches.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Yang Shi <[email protected]>
Suggested-by: "Kirill A . Shutemov" <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
Reviewed-by: Kirill Tkhai <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Shakeel Butt <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Qian Cai <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm,thp: avoid writes to file with THP in pagecache

In previous patch, an application could put part of its text section in
THP via madvise().  These THPs will be protected from writes when the
application is still running (TXTBSY).  However, after the application
exits, the file is available for writes.

This patch avoids writes to file THP by dropping page cache for the file
when the file is open for write.  A new counter nr_thps is added to struct
address_space.  In do_dentry_open(), if the file is open for write and
nr_thps is non-zero, we drop page cache for the whole file.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Song Liu <[email protected]>
Reported-by: kbuild test robot <[email protected]>
Acked-by: Rik van Riel <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Cc: Hillf Danton <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: William Kucharski <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm,thp: add read-only THP support for (non-shmem) FS

This patch is (hopefully) the first step to enable THP for non-shmem
filesystems.

This patch enables an application to put part of its text sections to THP
via madvise, for example:

    madvise((void *)0x600000, 0x200000, MADV_HUGEPAGE);

We tried to reuse the logic for THP on tmpfs.

Currently, write is not supported for non-shmem THP.  khugepaged will only
process vma with VM_DENYWRITE.  sys_mmap() ignores VM_DENYWRITE requests
(see ksys_mmap_pgoff).  The only way to create vma with VM_DENYWRITE is
execve().  This requirement limits non-shmem THP to text sections.

The next patch will handle writes, which would only happen when the all
the vmas with VM_DENYWRITE are unmapped.

An EXPERIMENTAL config, READ_ONLY_THP_FOR_FS, is added to gate this
feature.

[[email protected]: fix build without CONFIG_SHMEM]
Link: http://lkml.kernel.org/r/[email protected]
[[email protected]: fix double unlock in collapse_file()]
Link: http://lkml.kernel.org/r/[email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Song Liu <[email protected]>
Acked-by: Rik van Riel <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Cc: Stephen Rothwell <[email protected]>
Cc: Dan Carpenter <[email protected]>
Cc: Hillf Danton <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: William Kucharski <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

khugepaged: rename collapse_shmem() and khugepaged_scan_shmem()

Next patch will add khugepaged support of non-shmem files.  This patch
renames these two functions to reflect the new functionality:

    collapse_shmem()        =>  collapse_file()
    khugepaged_scan_shmem() =>  khugepaged_scan_file()

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Song Liu <[email protected]>
Acked-by: Rik van Riel <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Cc: Hillf Danton <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: William Kucharski <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm,thp: stats for file backed THP

In preparation for non-shmem THP, this patch adds a few stats and exposes
them in /proc/meminfo, /sys/bus/node/devices/<node>/meminfo, and
/proc/<pid>/task/<tid>/smaps.

This patch is mostly a rewrite of Kirill A. Shutemov's earlier version:
https://lkml.kernel.org/r/20170126115819 [email protected]/

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Song Liu <[email protected]>
Acked-by: Rik van Riel <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Cc: Hillf Danton <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: William Kucharski <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

filemap: update offset check in filemap_fault()

With THP, current check of offset:

VM_BUG_ON_PAGE(page->index != offset, page);

is no longer accurate. Update it to:

VM_BUG_ON_PAGE(page_to_pgoff(page) != offset, page);

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Song Liu <[email protected]>
Acked-by: Rik van Riel <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Cc: Hillf Danton <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: William Kucharski <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

filemap: check compound_head(page)->mapping in pagecache_get_page()

Similar to previous patch, pagecache_get_page() avoids race condition with
truncate by checking page->mapping == mapping. This does not work for
compound pages. This patch let it check compound_head(page)->mapping
instead.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Song Liu <[email protected]>
Suggested-by: Johannes Weiner <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Cc: Hillf Danton <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: William Kucharski <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

filemap: check compound_head(page)->mapping in filemap_fault()

Patch series "Enable THP for text section of non-shmem files", v10;

This patchset follows up discussion at LSF/MM 2019.  The motivation is to
put text section of an application in THP, and thus reduces iTLB miss rate
and improves performance.  Both Facebook and Oracle showed strong
interests to this feature.

To make reviews easier, this set aims a mininal valid product.  Current
version of the work does not have any changes to file system specific
code.  This comes with some limitations (discussed later).

This set enables an application to "hugify" its text section by simply
running something like:

          madvise(0x600000, 0x80000, MADV_HUGEPAGE);

Before this call, the /proc/<pid>/maps looks like:

    00400000-074d0000 r-xp 00000000 00:27 2006927     app

After this call, part of the text section is split out and mapped to
THP:

    00400000-00425000 r-xp 00000000 00:27 2006927     app
    00600000-00e00000 r-xp 00200000 00:27 2006927     app   <<< on THP
    00e00000-074d0000 r-xp 00a00000 00:27 2006927     app

Limitations:

1. This only works for text section (vma with VM_DENYWRITE).
2. Original limitation #2 is removed in v3.

We gated this feature with an experimental config, READ_ONLY_THP_FOR_FS.
Once we get better support on the write path, we can remove the config and
enable it by default.

Tested cases:
1. Tested with btrfs and ext4.
2. Tested with real work application (memcache like caching service).
3. Tested with "THP aware uprobe":
   https://patchwork.kernel.org/project/linux-mm/list/?series=131339

This patch (of 7):

Currently, filemap_fault() avoids race condition with truncate by checking
page->mapping == mapping.  This does not work for compound pages.  This
patch let it check compound_head(page)->mapping instead.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Song Liu <[email protected]>
Acked-by: Rik van Riel <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Cc: William Kucharski <[email protected]>
Cc: Hillf Danton <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

thp: update split_huge_page_pmd() comment

According to 78ddc5347341 ("thp: rename split_huge_page_pmd() to
split_huge_pmd()"), update related comment.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Kefeng Wang <[email protected]>
Cc: "Kirill A . Shutemov" <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/migrate.c: clean up useless code in migrate_vma_collect_pmd()

Remove unused 'pfn' variable.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Pingfan Liu <[email protected]>
Reviewed-by: Andrew Morton <[email protected]>
Reviewed-by: Ralph Campbell <[email protected]>
Cc: "Jérôme Glisse" <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: "Kirill A. Shutemov" <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

hugetlbfs: don't retry when pool page allocations start to fail

When allocating hugetlbfs pool pages via /proc/sys/vm/nr_hugepages, the
pages will be interleaved between all nodes of the system.  If nodes are
not equal, it is quite possible for one node to fill up before the others.
When this happens, the code still attempts to allocate pages from the
full node.  This results in calls to direct reclaim and compaction which
slow things down considerably.

When allocating pool pages, note the state of the previous allocation for
each node.  If previous allocation failed, do not use the aggressive retry
algorithm on successive attempts.  The allocation will still succeed if
there is memory available, but it will not try as hard to free up memory.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Mike Kravetz <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Cc: Hillf Danton <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Michal Hocko <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm, compaction: raise compaction priority after it withdrawns

Mike Kravetz reports that "hugetlb allocations could stall for minutes or
hours when should_compact_retry() would return true more often then it
should.  Specifically, this was in the case where compact_result was
COMPACT_DEFERRED and COMPACT_PARTIAL_SKIPPED and no progress was being
made."

The problem is that the compaction_withdrawn() test in
should_compact_retry() includes compaction outcomes that are only possible
on low compaction priority, and results in a retry without increasing the
priority.  This may result in furter reclaim, and more incomplete
compaction attempts.

With this patch, compaction priority is raised when possible, or
should_compact_retry() returns false.

The COMPACT_SKIPPED result doesn't really fit together with the other
outcomes in compaction_withdrawn(), as that's a result caused by
insufficient order-0 pages, not due to low compaction priority.  With this
patch, it is moved to a new compaction_needs_reclaim() function, and for
that outcome we keep the current logic of retrying if it looks like
reclaim will be able to help.

Link: http://lkml.kernel.org/r/[email protected]
Reported-by: Mike Kravetz <[email protected]>
Signed-off-by: Vlastimil Babka <[email protected]>
Signed-off-by: Mike Kravetz <[email protected]>
Tested-by: Mike Kravetz <[email protected]>
Cc: Hillf Danton <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Michal Hocko <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm, reclaim: cleanup should_continue_reclaim()

After commit "mm, reclaim: make should_continue_reclaim perform dryrun
detection", closer look at the function shows, that nr_reclaimed == 0
means the function will always return false. And since non-zero
nr_reclaimed implies non_zero nr_scanned, testing nr_scanned serves no
purpose, and so does the testing for __GFP_RETRY_MAYFAIL.

This patch thus cleans up the function to test only !nr_reclaimed upfront,
and remove the __GFP_RETRY_MAYFAIL test and nr_scanned parameter
completely. Comment is also updated, explaining that approximating "full
LRU list has been scanned" with nr_scanned == 0 didn't really work.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Vlastimil Babka <[email protected]>
Signed-off-by: Mike Kravetz <[email protected]>
Acked-by: Mike Kravetz <[email protected]>
Cc: Hillf Danton <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Michal Hocko <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm, reclaim: make should_continue_reclaim perform dryrun detection

Patch series "address hugetlb page allocation stalls", v2.

Allocation of hugetlb pages via sysctl or procfs can stall for minutes or
hours.  A simple example on a two node system with 8GB of memory is as
follows:

echo 4096 > /sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages
echo 4096 > /proc/sys/vm/nr_hugepages

Obviously, both allocation attempts will fall short of their 8GB goal.
However, one or both of these commands may stall and not be interruptible.
The issues were initially discussed in mail thread [1] and RFC code at
[2].

This series addresses the issues causing the stalls.  There are two
distinct fixes, a cleanup, and an optimization.  The reclaim patch by
Hillf and compaction patch by Vlasitmil address corner cases in their
respective areas.  hugetlb page allocation could stall due to either of
these issues.  Vlasitmil added a cleanup patch after Hillf's
modifications.  The hugetlb patch by Mike is an optimization suggested
during the debug and development process.

[1] http://lkml.kernel.org/r/d38a095e-dc39-7e82-bb76-2c9247929f07@oracle.com
[2] http://lkml.kernel.org/r/20190724175014 [email protected]

This patch (of 4):

Address the issue of should_continue_reclaim returning true too often for
__GFP_RETRY_MAYFAIL attempts when !nr_reclaimed and nr_scanned.  This was
observed during hugetlb page allocation causing stalls for minutes or
hours.

We can stop reclaiming pages if compaction reports it can make a progress.
There might be side-effects for other high-order allocations that would
potentially benefit from reclaiming more before compaction so that they
would be faster and less likely to stall.  However, the consequences of
premature/over-reclaim are considered worse.

We can also bail out of reclaiming pages if we know that there are not
enough inactive lru pages left to satisfy the costly allocation.

We can give up reclaiming pages too if we see dryrun occur, with the
certainty of plenty of inactive pages.  IOW with dryrun detected, we are
sure we have reclaimed as many pages as we could.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Hillf Danton <[email protected]>
Signed-off-by: Mike Kravetz <[email protected]>
Tested-by: Mike Kravetz <[email protected]>
Acked-by: Mel Gorman <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Johannes Weiner <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

memcg, kmem: deprecate kmem.limit_in_bytes

Cgroup v1 memcg controller has exposed a dedicated kmem limit to users
which turned out to be really a bad idea because there are paths which
cannot shrink the kernel memory usage enough to get below the limit (e.g.
because the accounted memory is not reclaimable).  There are cases when
the failure is even not allowed (e.g.  __GFP_NOFAIL).  This means that the
kmem limit is in excess to the hard limit without any way to shrink and
thus completely useless.  OOM killer cannot be invoked to handle the
situation because that would lead to a premature oom killing.

As a result many places might see ENOMEM returning from kmalloc and result
in unexpected errors.  E.g.  a global OOM killer when there is a lot of
free memory because ENOMEM is translated into VM_FAULT_OOM in #PF path and
therefore pagefault_out_of_memory would result in OOM killer.

Please note that the kernel memory is still accounted to the overall limit
along with the user memory so removing the kmem specific limit should
still allow to contain kernel memory consumption.  Unlike the kmem one,
though, it invokes memory reclaim and targeted memcg oom killing if
necessary.

Start the deprecation process by crying to the kernel log.  Let's see
whether there are relevant usecases and simply return to EINVAL in the
second stage if nobody complains in few releases.

[[email protected]: tweak documentation text]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Michal Hocko <[email protected]>
Reviewed-by: Shakeel Butt <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Cc: Andrey Ryabinin <[email protected]>
Cc: Thomas Lindroth <[email protected]>
Cc: Tetsuo Handa <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/memcontrol.c: fix a -Wunused-function warning

mem_cgroup_id_get() was introduced in commit 73f576c04b94 ("mm:memcontrol:
fix cgroup creation failure after many small jobs").

Later, it no longer has any user since the commits,

1f47b61fb407 ("mm: memcontrol: fix swap counter leak on swapout from offline cgroup")
58fa2a5512d9 ("mm: memcontrol: add sanity checks for memcg->id.ref on get/put")

so safe to remove it.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Qian Cai <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm, oom: consider present pages for the node size

constrained_alloc() calculates the size of the oom domain by using
node_spanned_pages which is incorrect because this is the full range of
the physical memory range that the numa node occupies rather than the
memory that backs that range which is represented by node_present_pages.

Sparsely populated nodes (e.g. after memory hot remove or simply sparse
due to memory layout) can have really a large difference between the two.
This shouldn't really cause any real user observable problems because the
oom calculates a ratio against totalpages and used memory cannot exceed
present pages but it is confusing and wrong from code point of view.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Michal Hocko <[email protected]>
Reported-by: David Hildenbrand <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Acked-by: David Rientjes <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/oom_kill.c: fix oom_cpuset_eligible() comment

Commit ac311a14c682 ("oom: decouple mems_allowed from
oom_unkillable_task") changed has_intersects_mems_allowed() to
oom_cpuset_eligible(), but didn't change the comment.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Yi Wang <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/oom: add oom_score_adj and pgtables to Killed process message

For an OOM event: print oom_score_adj value for the OOM Killed process to
document what the oom score adjust value was at the time the process was
OOM Killed.  The adjustment value can be set by user code and it affects
the resulting oom_score so it is used to influence kill process selection.

When eligible tasks are not printed (sysctl oom_dump_tasks = 0) printing
this value is the only documentation of the value for the process being
killed.  Having this value on the Killed process message is useful to
document if a miscconfiguration occurred or to confirm that the
oom_score_adj configuration applies as expected.

An example which illustates both misconfiguration and validation that the
oom_score_adj was applied as expected is:

Aug 14 23:00:02 testserver kernel: Out of memory: Killed process 2692
(systemd-udevd) total-vm:1056800kB, anon-rss:1052760kB, file-rss:4kB,
shmem-rss:0kB pgtables:22kB oom_score_adj:1000

The systemd-udevd is a critical system application that should have an
oom_score_adj of -1000.  It was miconfigured to have a adjustment of 1000
making it a highly favored OOM kill target process.  The output documents
both the misconfiguration and the fact that the process was correctly
targeted by OOM due to the miconfiguration.  This can be quite helpful for
triage and problem determination.

The addition of the pgtables_bytes shows page table usage by the process
and is a useful measure of the memory size of the process.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Edward Chron <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Acked-by: David Rientjes <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

memcg, oom: don't require __GFP_FS when invoking memcg OOM killer

Masoud Sharbiani noticed that commit 29ef680ae7c21110 ("memcg, oom: move
out_of_memory back to the charge path") broke memcg OOM called from
__xfs_filemap_fault() path.  It turned out that try_charge() is retrying
forever without making forward progress because mem_cgroup_oom(GFP_NOFS)
cannot invoke the OOM killer due to commit 3da88fb3bacfaa33 ("mm, oom:
move GFP_NOFS check to out_of_memory").

Allowing forced charge due to being unable to invoke memcg OOM killer will
lead to global OOM situation.  Also, just returning -ENOMEM will be risky
because OOM path is lost and some paths (e.g.  get_user_pages()) will leak
-ENOMEM.  Therefore, invoking memcg OOM killer (despite GFP_NOFS) will be
the only choice we can choose for now.

Until 29ef680ae7c21110, we were able to invoke memcg OOM killer when
GFP_KERNEL reclaim failed [1].  But since 29ef680ae7c21110, we need to
invoke memcg OOM killer when GFP_NOFS reclaim failed [2].  Although in the
past we did invoke memcg OOM killer for GFP_NOFS [3], we might get
pre-mature memcg OOM reports due to this patch.

[1]

leaker invoked oom-killer: gfp_mask=0x6200ca(GFP_HIGHUSER_MOVABLE), nodemask=(null), order=0, oom_score_adj=0
CPU: 0 PID: 2746 Comm: leaker Not tainted 4.18.0+ #19
Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/13/2018
Call Trace:
  dump_stack+0x63/0x88
  dump_header+0x67/0x27a
  ? mem_cgroup_scan_tasks+0x91/0xf0
  oom_kill_process+0x210/0x410
  out_of_memory+0x10a/0x2c0
  mem_cgroup_out_of_memory+0x46/0x80
  mem_cgroup_oom_synchronize+0x2e4/0x310
  ? high_work_func+0x20/0x20
  pagefault_out_of_memory+0x31/0x76
  mm_fault_error+0x55/0x115
  ? handle_mm_fault+0xfd/0x220
  __do_page_fault+0x433/0x4e0
  do_page_fault+0x22/0x30
  ? page_fault+0x8/0x30
  page_fault+0x1e/0x30
RIP: 0033:0x4009f0
Code: 03 00 00 00 e8 71 fd ff ff 48 83 f8 ff 49 89 c6 74 74 48 89 c6 bf c0 0c 40 00 31 c0 e8 69 fd ff ff 45 85 ff 7e 21 31 c9 66 90 <41> 0f be 14 0e 01 d3 f7 c1 ff 0f 00 00 75 05 41 c6 04 0e 2a 48 83
RSP: 002b:00007ffe29ae96f0 EFLAGS: 00010206
RAX: 000000000000001b RBX: 0000000000000000 RCX: 0000000001ce1000
RDX: 0000000000000000 RSI: 000000007fffffe5 RDI: 0000000000000000
RBP: 000000000000000c R08: 0000000000000000 R09: 00007f94be09220d
R10: 0000000000000002 R11: 0000000000000246 R12: 00000000000186a0
R13: 0000000000000003 R14: 00007f949d845000 R15: 0000000002800000
Task in /leaker killed as a result of limit of /leaker
memory: usage 524288kB, limit 524288kB, failcnt 158965
memory+swap: usage 0kB, limit 9007199254740988kB, failcnt 0
kmem: usage 2016kB, limit 9007199254740988kB, failcnt 0
Memory cgroup stats for /leaker: cache:844KB rss:521136KB rss_huge:0KB shmem:0KB mapped_file:0KB dirty:132KB writeback:0KB inactive_anon:0KB active_anon:521224KB inactive_file:1012KB active_file:8KB unevictable:0KB
Memory cgroup out of memory: Kill process 2746 (leaker) score 998 or sacrifice child
Killed process 2746 (leaker) total-vm:536704kB, anon-rss:521176kB, file-rss:1208kB, shmem-rss:0kB
oom_reaper: reaped process 2746 (leaker), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

[2]

leaker invoked oom-killer: gfp_mask=0x600040(GFP_NOFS), nodemask=(null), order=0, oom_score_adj=0
CPU: 1 PID: 2746 Comm: leaker Not tainted 4.18.0+ #20
Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/13/2018
Call Trace:
  dump_stack+0x63/0x88
  dump_header+0x67/0x27a
  ? mem_cgroup_scan_tasks+0x91/0xf0
  oom_kill_process+0x210/0x410
  out_of_memory+0x109/0x2d0
  mem_cgroup_out_of_memory+0x46/0x80
  try_charge+0x58d/0x650
  ? __radix_tree_replace+0x81/0x100
  mem_cgroup_try_charge+0x7a/0x100
  __add_to_page_cache_locked+0x92/0x180
  add_to_page_cache_lru+0x4d/0xf0
  iomap_readpages_actor+0xde/0x1b0
  ? iomap_zero_range_actor+0x1d0/0x1d0
  iomap_apply+0xaf/0x130
  iomap_readpages+0x9f/0x150
  ? iomap_zero_range_actor+0x1d0/0x1d0
  xfs_vm_readpages+0x18/0x20 [xfs]
  read_pages+0x60/0x140
  __do_page_cache_readahead+0x193/0x1b0
  ondemand_readahead+0x16d/0x2c0
  page_cache_async_readahead+0x9a/0xd0
  filemap_fault+0x403/0x620
  ? alloc_set_pte+0x12c/0x540
  ? _cond_resched+0x14/0x30
  __xfs_filemap_fault+0x66/0x180 [xfs]
  xfs_filemap_fault+0x27/0x30 [xfs]
  __do_fault+0x19/0x40
  __handle_mm_fault+0x8e8/0xb60
  handle_mm_fault+0xfd/0x220
  __do_page_fault+0x238/0x4e0
  do_page_fault+0x22/0x30
  ? page_fault+0x8/0x30
  page_fault+0x1e/0x30
RIP: 0033:0x4009f0
Code: 03 00 00 00 e8 71 fd ff ff 48 83 f8 ff 49 89 c6 74 74 48 89 c6 bf c0 0c 40 00 31 c0 e8 69 fd ff ff 45 85 ff 7e 21 31 c9 66 90 <41> 0f be 14 0e 01 d3 f7 c1 ff 0f 00 00 75 05 41 c6 04 0e 2a 48 83
RSP: 002b:00007ffda45c9290 EFLAGS: 00010206
RAX: 000000000000001b RBX: 0000000000000000 RCX: 0000000001a1e000
RDX: 0000000000000000 RSI: 000000007fffffe5 RDI: 0000000000000000
RBP: 000000000000000c R08: 0000000000000000 R09: 00007f6d061ff20d
R10: 0000000000000002 R11: 0000000000000246 R12: 00000000000186a0
R13: 0000000000000003 R14: 00007f6ce59b2000 R15: 0000000002800000
Task in /leaker killed as a result of limit of /leaker
memory: usage 524288kB, limit 524288kB, failcnt 7221
memory+swap: usage 0kB, limit 9007199254740988kB, failcnt 0
kmem: usage 1944kB, limit 9007199254740988kB, failcnt 0
Memory cgroup stats for /leaker: cache:3632KB rss:518232KB rss_huge:0KB shmem:0KB mapped_file:0KB dirty:0KB writeback:0KB inactive_anon:0KB active_anon:518408KB inactive_file:3908KB active_file:12KB unevictable:0KB
Memory cgroup out of memory: Kill process 2746 (leaker) score 992 or sacrifice child
Killed process 2746 (leaker) total-vm:536704kB, anon-rss:518264kB, file-rss:1188kB, shmem-rss:0kB
oom_reaper: reaped process 2746 (leaker), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

[3]

leaker invoked oom-killer: gfp_mask=0x50, order=0, oom_score_adj=0
leaker cpuset=/ mems_allowed=0
CPU: 1 PID: 3206 Comm: leaker Not tainted 3.10.0-957.27.2.el7.x86_64 #1
Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/13/2018
Call Trace:
  [<ffffffffaf364147>] dump_stack+0x19/0x1b
  [<ffffffffaf35eb6a>] dump_header+0x90/0x229
  [<ffffffffaedbb456>] ? find_lock_task_mm+0x56/0xc0
  [<ffffffffaee32a38>] ? try_get_mem_cgroup_from_mm+0x28/0x60
  [<ffffffffaedbb904>] oom_kill_process+0x254/0x3d0
  [<ffffffffaee36c36>] mem_cgroup_oom_synchronize+0x546/0x570
  [<ffffffffaee360b0>] ? mem_cgroup_charge_common+0xc0/0xc0
  [<ffffffffaedbc194>] pagefault_out_of_memory+0x14/0x90
  [<ffffffffaf35d072>] mm_fault_error+0x6a/0x157
  [<ffffffffaf3717c8>] __do_page_fault+0x3c8/0x4f0
  [<ffffffffaf371925>] do_page_fault+0x35/0x90
  [<ffffffffaf36d768>] page_fault+0x28/0x30
Task in /leaker killed as a result of limit of /leaker
memory: usage 524288kB, limit 524288kB, failcnt 20628
memory+swap: usage 524288kB, limit 9007199254740988kB, failcnt 0
kmem: usage 0kB, limit 9007199254740988kB, failcnt 0
Memory cgroup stats for /leaker: cache:840KB rss:523448KB rss_huge:0KB mapped_file:0KB swap:0KB inactive_anon:0KB active_anon:523448KB inactive_file:464KB active_file:376KB unevictable:0KB
Memory cgroup out of memory: Kill process 3206 (leaker) score 970 or sacrifice child
Killed process 3206 (leaker) total-vm:536692kB, anon-rss:523304kB, file-rss:412kB, shmem-rss:0kB

Bisected by Masoud Sharbiani.

Link: http://lkml.kernel.org/r/[email protected]
Fixes: 3da88fb3bacfaa33 ("mm, oom: move GFP_NOFS check to out_of_memory") [necessary after 29ef680ae7c21110]
Signed-off-by: Tetsuo Handa <[email protected]>
Reported-by: Masoud Sharbiani <[email protected]>
Tested-by: Masoud Sharbiani <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: <[email protected]> [4.19+]
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/oom_kill.c: add task UID to info message on an oom kill

In the event of an oom kill, useful information about the killed process
is printed to dmesg. Users, especially system administrators, will find
it useful to immediately see the UID of the process.

We already print uid when dumping eligible tasks so it is not overly hard
to find that information in the oom report. However this information is
unavailable when dumping of eligible tasks is disabled.

In the following example, abuse_the_ram is the name of a program that
attempts to iteratively allocate all available memory until it is stopped
by force.

Current message:

Out of memory: Killed process 35389 (abuse_the_ram)
total-vm:133718232kB, anon-rss:129624980kB, file-rss:0kB,
shmem-rss:0kB

Patched message:

Out of memory: Killed process 2739 (abuse_the_ram),
total-vm:133880028kB, anon-rss:129754836kB, file-rss:0kB,
shmem-rss:0kB, UID:0

[[email protected]: s/UID %d/UID:%u/ in printk]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Joel Savitz <[email protected]>
Suggested-by: David Rientjes <[email protected]>
Acked-by: Rafael Aquini <[email protected]>
Cc: Michal Hocko <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/mempolicy.c: remove unnecessary nodemask check in kernel_migrate_pages()

1) task_nodes = cpuset_mems_allowed(current);
   -> cpuset_mems_allowed() guaranteed to return some non-empty
      subset of node_states[N_MEMORY].

2) nodes_and(*new, *new, task_nodes);
   -> after nodes_and(), the 'new' should be empty or appropriate
      nodemask(online node and with memory).

After 1) and 2), we could remove unnecessary check whether the 'new'
AND node_states[N_MEMORY] is empty.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Kefeng Wang <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Oscar Salvador <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/compaction.c: remove unnecessary zone parameter in isolate_migratepages()

Like commit 40cacbcb3240 ("mm, compaction: remove unnecessary zone
parameter in some instances"), remove unnecessary zone parameter.

No functional change.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Pengfei Li <[email protected]>
Reviewed-by: Andrew Morton <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Qian Cai <[email protected]>
Cc: Andrey Ryabinin <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/compaction.c: clear total_{migrate,free}_scanned before scanning a new zone

total_{migrate,free}_scanned will be added to COMPACTMIGRATE_SCANNED and
COMPACTFREE_SCANNED in compact_zone(). We should clear them before
scanning a new zone. In the proc triggered compaction, we forgot clearing
them.

[[email protected]: introduce a helper compact_zone_counters_init()]
Link: http://lkml.kernel.org/r/[email protected]
[[email protected]: expand compact_zone_counters_init() into its single callsite, per mhocko]
[[email protected]: squash compact_zone() list_head init as well]
Link: http://lkml.kernel.org/r/[email protected]
[[email protected]: kcompactd_do_work(): avoid unnecessary initialization of cc.zone]
Link: http://lkml.kernel.org/r/[email protected]
Fixes: 7f354a548d1c ("mm, compaction: add vmstats for kcompactd work")
Signed-off-by: Yafang Shao <[email protected]>
Signed-off-by: Vlastimil Babka <[email protected]>
Reviewed-by: Vlastimil Babka <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Yafang Shao <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

z3fold: fix memory leak in kmem cache

Currently there is a leak in init_z3fold_page() -- it allocates handles
from kmem cache even for headless pages, but then they are never used and
never freed, so eventually kmem cache may get exhausted. This patch
provides a fix for that.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Vitaly Wool <[email protected]>
Reported-by: Markus Linnala <[email protected]>
Tested-by: Markus Linnala <[email protected]>
Cc: Dan Streetman <[email protected]>
Cc: Henry Burns <[email protected]>
Cc: Shakeel Butt <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm: silence -Woverride-init/initializer-overrides

When compiling a kernel with W=1, there are several of those warnings due
to arm64 overriding a field on purpose.  Just disable those warnings for
both GCC and Clang of this file, so it will help dig "gems" hidden in the
W=1 warnings by reducing some noises.

mm/init-mm.c:39:2: warning: initializer overrides prior initialization
of this subobject [-Winitializer-overrides]
        INIT_MM_CONTEXT(init_mm)
        ^~~~~~~~~~~~~~~~~~~~~~~~
./arch/arm64/include/asm/mmu.h:133:9: note: expanded from macro
'INIT_MM_CONTEXT'
        .pgd = init_pg_dir,
               ^~~~~~~~~~~
mm/init-mm.c:30:10: note: previous initialization is here
        .pgd            = swapper_pg_dir,
                          ^~~~~~~~~~~~~~

Note: there is a side project trying to support explicitly allowing
specific initializer overrides in Clang, but there is no guarantee it
will happen or not.

https://github.com/ClangBuiltLinux/linux/issues/639

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Qian Cai <[email protected]>
Cc: Nick Desaulniers <[email protected]>
Cc: Masahiro Yamada <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm: use CPU_BITS_NONE to initialize init_mm.cpu_bitmask

Replace open-coded bitmap array initialization of init_mm.cpu_bitmask with
neat CPU_BITS_NONE macro.

And, since init_mm.cpu_bitmask is statically set to zero, there is no way
to clear it again in start_kernel().

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Mike Rapoport <[email protected]>
Reviewed-by: Andrew Morton <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/vmalloc.c: move 'area->pages' after if statement

If !area->pages statement is true where memory allocation fails, area is
freed.

In this case 'area->pages = pages' should not executed. So move
'area->pages = pages' after if statement.

[[email protected]: give area->pages the same treatment]
Link: http://lkml.kernel.org/r/20190830035716.GA190684@LGEARND20B15
Signed-off-by: Austin Kim <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Reviewed-by: Andrew Morton <[email protected]>
Cc: Uladzislau Rezki (Sony) <[email protected]>
Cc: Roman Gushchin <[email protected]>
Cc: Roman Penyaev <[email protected]>
Cc: Rick Edgecombe <[email protected]>
Cc: Mike Rapoport <[email protected]>
Cc: Andrey Ryabinin <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/vmalloc: modify struct vmap_area to reduce its size

Objective
---------

The current implementation of struct vmap_area wasted space.

After applying this commit, sizeof(struct vmap_area) has been
reduced from 11 words to 8 words.

Description
-----------

1) Pack "subtree_max_size", "vm" and "purge_list". This is no problem
because

A) "subtree_max_size" is only used when vmap_area is in "free" tree

B) "vm" is only used when vmap_area is in "busy" tree

C) "purge_list" is only used when vmap_area is in vmap_purge_list

2) Eliminate "flags".

;Since only one flag VM_VM_AREA is being used, and the same thing can be
done by judging whether "vm" is NULL, then the "flags" can be eliminated.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Pengfei Li <[email protected]>
Suggested-by: Uladzislau Rezki (Sony) <[email protected]>
Reviewed-by: Uladzislau Rezki (Sony) <[email protected]>
Cc: Hillf Danton <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Oleksiy Avramchenko <[email protected]>
Cc: Roman Gushchin <[email protected]>
Cc: Steven Rostedt <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/vmalloc: do not keep unpurged areas in the busy tree

The busy tree can be quite big, even though the area is freed or unmapped
it still stays there until "purge" logic removes it.

1) Optimize and reduce the size of "busy" tree by removing a node from
   it right away as soon as user triggers free paths.  It is possible to
   do so, because the allocation is done using another augmented tree.

The vmalloc test driver shows the difference, for example the
"fix_size_alloc_test" is ~11% better comparing with default configuration:

sudo ./test_vmalloc.sh performance

<default>
Summary: fix_size_alloc_test loops: 1000000 avg: 993985 usec
Summary: full_fit_alloc_test loops: 1000000 avg: 973554 usec
Summary: long_busy_list_alloc_test loops: 1000000 avg: 12617652 usec
<default>

<this patch>
Summary: fix_size_alloc_test loops: 1000000 avg: 882263 usec
Summary: full_fit_alloc_test loops: 1000000 avg: 973407 usec
Summary: long_busy_list_alloc_test loops: 1000000 avg: 12593929 usec
<this patch>

2) Since the busy tree now contains allocated areas only and does not
   interfere with lazily free nodes, introduce the new function
   show_purge_info() that dumps "unpurged" areas that is propagated
   through "/proc/vmallocinfo".

3) Eliminate VM_LAZY_FREE flag.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Uladzislau Rezki (Sony) <[email protected]>
Signed-off-by: Pengfei Li <[email protected]>
Cc: Roman Gushchin <[email protected]>
Cc: Uladzislau Rezki <[email protected]>
Cc: Hillf Danton <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Oleksiy Avramchenko <[email protected]>
Cc: Steven Rostedt <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/sparse.c: remove NULL check in clear_hwpoisoned_pages()

There is no possibility for memmap to be NULL in the current codebase.

This check was added in commit 95a4774d055c ("memory-hotplug: update
mce_bad_pages when removing the memory") where memmap was originally
inited to NULL, and only conditionally given a value.

The code that could have passed a NULL has been removed by commit
ba72b4c8cf60 ("mm/sparsemem: support sub-section hotplug"), so there is no
longer a possibility that memmap can be NULL.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Alastair D'Silva <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Cc: Mike Rapoport <[email protected]>
Cc: Wei Yang <[email protected]>
Cc: Qian Cai <[email protected]>
Cc: Alexander Duyck <[email protected]>
Cc: Logan Gunthorpe <[email protected]>
Cc: Baoquan He <[email protected]>
Cc: Balbir Singh <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/sparse.c: don't manually decrement num_poisoned_pages

Use the function written to do it instead.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Alastair D'Silva <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Acked-by: Mike Rapoport <[email protected]>
Reviewed-by: Wei Yang <[email protected]>
Reviewed-by: Oscar Salvador <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/sparse.c: use __nr_to_section(section_nr) to get mem_section

__pfn_to_section is defined as __nr_to_section(pfn_to_section_nr(pfn)).

Since we already get section_nr, it is not necessary to get mem_section
from start_pfn. By doing so, we reduce one redundant operation.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Wei Yang <[email protected]>
Reviewed-by: Anshuman Khandual <[email protected]>
Tested-by: Anshuman Khandual <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Pavel Tatashin <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: David Hildenbrand <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/sparse.c: fix ALIGN() without power of 2 in sparse_buffer_alloc()

The size argument passed into sparse_buffer_alloc() has already been
aligned with PAGE_SIZE or PMD_SIZE.

If the size after aligned is not power of 2 (e.g. 0x480000), the
PTR_ALIGN() will return wrong value. Use roundup to round sparsemap_buf
up to next multiple of size.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Lecopzer Chen <[email protected]>
Signed-off-by: Mark-PK Tsai <[email protected]>
Cc: YJ Chiang <[email protected]>
Cc: Lecopzer Chen <[email protected]>
Cc: Pavel Tatashin <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Mike Rapoport <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/sparse.c: fix memory leak of sparsemap_buf in aligned memory

sparse_buffer_alloc(xsize) gets the size of memory from sparsemap_buf
after being aligned with the size.  However, the size is at least
PAGE_ALIGN(sizeof(struct page) * PAGES_PER_SECTION) and usually larger
than PAGE_SIZE.

Also, sparse_buffer_fini() only frees memory between sparsemap_buf and
sparsemap_buf_end, since sparsemap_buf may be changed by PTR_ALIGN()
first, the aligned space before sparsemap_buf is wasted and no one will
touch it.

In our ARM32 platform (without SPARSEMEM_VMEMMAP)
  Sparse_buffer_init
    Reserve d359c000 - d3e9c000 (9M)
  Sparse_buffer_alloc
    Alloc   d3a00000 - d3E80000 (4.5M)
  Sparse_buffer_fini
    Free    d3e80000 - d3e9c000 (~=100k)
The reserved memory between d359c000 - d3a00000 (~=4.4M) is unfreed.

In ARM64 platform (with SPARSEMEM_VMEMMAP)

  sparse_buffer_init
    Reserve ffffffc07d623000 - ffffffc07f623000 (32M)
  Sparse_buffer_alloc
    Alloc   ffffffc07d800000 - ffffffc07f600000 (30M)
  Sparse_buffer_fini
    Free    ffffffc07f600000 - ffffffc07f623000 (140K)
The reserved memory between ffffffc07d623000 - ffffffc07d800000
(~=1.9M) is unfreed.

Let's explicit free redundant aligned memory.

[[email protected]: mark sparse_buffer_free as __meminit]
Link: http://lkml.kernel.org/r/[email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Lecopzer Chen <[email protected]>
Signed-off-by: Mark-PK Tsai <[email protected]>
Signed-off-by: Arnd Bergmann <[email protected]>
Cc: YJ Chiang <[email protected]>
Cc: Lecopzer Chen <[email protected]>
Cc: Pavel Tatashin <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Mike Rapoport <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/memory_hotplug.c: s/is/if

Correct typo in comment.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Souptick Joarder <[email protected]>
Reviewed-by: Andrew Morton <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/memory_hotplug: online_pages cannot be 0 in online_pages()

walk_system_ram_range() will fail with -EINVAL in case
online_pages_range() was never called (== no resource applicable in the
range). Otherwise, we will always call online_pages_range() with nr_pages
> 0 and, therefore, have online_pages > 0.

Remove that special handling.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: David Hildenbrand <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Pavel Tatashin <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Arun KS <[email protected]>
Cc: Bjorn Helgaas <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Nadav Amit <[email protected]>
Cc: Wei Yang <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/memory_hotplug: make sure the pfn is aligned to the order when onlining

Commit a9cd410a3d29 ("mm/page_alloc.c: memory hotplug: free pages as
higher order") assumed that any PFN we get via memory resources is aligned
to to MAX_ORDER - 1, I am not convinced that is always true. Let's play
safe, check the alignment and fallback to single pages.

akpm: warn in this situation so we get to find out if and why this ever
occurs.

[[email protected]: add WARN_ON_ONCE()]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: David Hildenbrand <[email protected]>
Cc: Arun KS <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Pavel Tatashin <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Bjorn Helgaas <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Nadav Amit <[email protected]>
Cc: Wei Yang <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/memory_hotplug: simplify online_pages_range()

online_pages always corresponds to nr_pages. Simplify the code, getting
rid of online_pages_blocks(). Add some comments.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: David Hildenbrand <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Pavel Tatashin <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Arun KS <[email protected]>
Cc: Bjorn Helgaas <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Nadav Amit <[email protected]>
Cc: Wei Yang <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/memory_hotplug: drop PageReserved() check in online_pages_range()

move_pfn_range_to_zone() will set all pages to PG_reserved via
memmap_init_zone(). The only way a page could no longer be reserved would
be if a MEM_GOING_ONLINE notifier would clear PG_reserved - which is not
done (the online_page callback is used for that purpose by e.g., Hyper-V
instead). walk_system_ram_range() will never call online_pages_range()
with duplicate PFNs, so drop the PageReserved() check.

This seems to be a leftover from ancient times where the memmap was
initialized when adding memory and we wanted to check for already onlined
memory.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: David Hildenbrand <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Pavel Tatashin <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Arun KS <[email protected]>
Cc: Bjorn Helgaas <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Nadav Amit <[email protected]>
Cc: Wei Yang <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/memory_hotplug.c: use PFN_UP / PFN_DOWN in walk_system_ram_range()

Patch series "mm/memory_hotplug: online_pages() cleanups", v2.

Some cleanups (+ one fix for a special case) in the context of
online_pages().

This patch (of 5):

This makes it clearer that we will never call func() with duplicate PFNs
in case we have multiple sub-page memory resources. All unaligned parts
of PFNs are completely discarded.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: David Hildenbrand <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Reviewed-by: Wei Yang <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Bjorn Helgaas <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Nadav Amit <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Arun KS <[email protected]>
Cc: Pavel Tatashin <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/memory_hotplug.c: prevent memory leak when reusing pgdat

When offlining a node in try_offline_node(), pgdat is not released. So
that pgdat could be reused in hotadd_new_pgdat(). While we reallocate
pgdat->per_cpu_nodestats if this pgdat is reused.

This patch prevents the memory leak by just allocating per_cpu_nodestats
when it is a new pgdat.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Wei Yang <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: Oscar Salvador <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

drivers/base/memory.c: don't store end_section_nr in memory blocks

Each memory block spans the same amount of sections/pages/bytes.  The size
is determined before the first memory block is created.  No need to store
what we can easily calculate - and the calculations even look simpler now.

Michal brought up the idea of variable-sized memory blocks.  However, if
we ever implement something like this, we will need an API compatibility
switch and reworks at various places (most code assumes a fixed memory
block size).  So let's cleanup what we have right now.

While at it, fix the variable naming in register_mem_sect_under_node() -
we no longer talk about a single section.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: David Hildenbrand <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: "Rafael J. Wysocki" <[email protected]>
Cc: Pavel Tatashin <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Oscar Salvador <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

driver/base/memory.c: validate memory block size early

Let's validate the memory block size early, when initializing the memory
device infrastructure. Fail hard in case the value is not suitable.

As nobody checks the return value of memory_dev_init(), turn it into a
void function and fail with a panic in all scenarios instead. Otherwise,
we'll crash later during boot when core/drivers expect that the memory
device infrastructure (including memory_block_size_bytes()) works as
expected.

I think long term, we should move the whole memory block size
configuration (set_memory_block_size_order() and
memory_block_size_bytes()) into drivers/base/memory.c.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: David Hildenbrand <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: "Rafael J. Wysocki" <[email protected]>
Cc: Pavel Tatashin <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Dan Williams <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

drivers/base/memory.c: fixup documentation of removable/phys_index/block_size_bytes

Let's rephrase to memory block terminology and add some further
clarifications.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: David Hildenbrand <[email protected]>
Reviewed-by: Andrew Morton <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: "Rafael J. Wysocki" <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Oscar Salvador <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

drivers/base/node.c: simplify unregister_memory_block_under_nodes()

We don't allow to offline memory block devices that belong to multiple
numa nodes.  Therefore, such devices can never get removed.  It is
sufficient to process a single node when removing the memory block.  No
need to iterate over each and every PFN.

We already have the nid stored for each memory block.  Make sure that the
nid always has a sane value.

Please note that checking for node_online(nid) is not required.  If we
would have a memory block belonging to a node that is no longer offline,
then we would have a BUG in the node offlining code.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: David Hildenbrand <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: "Rafael J. Wysocki" <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Stephen Rothwell <[email protected]>
Cc: Pavel Tatashin <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Oscar Salvador <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/memory_hotplug: remove move_pfn_range()

Let's remove this indirection. We need the zone in the caller either way,
so let's just detect it there. Add some documentation for
move_pfn_range_to_zone() instead.

[[email protected]: restore newline, per David]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: David Hildenbrand <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Reviewed-by: Oscar Salvador <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Pavel Tatashin <[email protected]>
Cc: Dan Williams <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm: do not hash address in print_bad_pte()

Using %px to show the actual address in print_bad_pte()
to help us to debug issue.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Kefeng Wang <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>