Git Repo - linux.git/log

samples/bpf: fix a build issue

With latest net-next:

====
clang  -nostdinc -isystem /usr/lib/gcc/x86_64-redhat-linux/6.3.1/include -I./arch/x86/include -I./arch/x86/include/generated/uapi -I./arch/x86/include/generated  -I./include -I./arch/x86/include/uapi -I./include/uapi -I./include/generated/uapi -include ./include/linux/kconfig.h  -Isamples/bpf \
    -D__KERNEL__ -D__ASM_SYSREG_H -Wno-unused-value -Wno-pointer-sign \
    -Wno-compare-distinct-pointer-types \
    -Wno-gnu-variable-sized-type-not-at-end \
    -Wno-address-of-packed-member -Wno-tautological-compare \
    -Wno-unknown-warning-option \
    -O2 -emit-llvm -c samples/bpf/tcp_synrto_kern.c -o -| llc -march=bpf -filetype=obj -o samples/bpf/tcp_synrto_kern.o
samples/bpf/tcp_synrto_kern.c:20:10: fatal error: 'bpf_endian.h' file not found
          ^~~~~~~~~~~~~~
1 error generated.
====

net has the same issue.

Add support for ntohl and htonl in tools/testing/selftests/bpf/bpf_endian.h.
Also move bpf_helpers.h from samples/bpf to selftests/bpf and change
compiler include logic so that programs in samples/bpf can access the headers
in selftests/bpf, but not the other way around.

Signed-off-by: Yonghong Song <[email protected]>
Acked-by: Daniel Borkmann <[email protected]>
Acked-by: Lawrence Brakmo <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

bridge: mdb: fix leak on complete_info ptr on fail path

We currently get the following kmemleak report:
unreferenced object 0xffff8800039d9820 (size 32):
  comm "softirq", pid 0, jiffies 4295212383 (age 792.416s)
  hex dump (first 32 bytes):
    00 0c e0 03 00 88 ff ff ff 02 00 00 00 00 00 00  ................
    00 00 00 01 ff 11 00 02 86 dd 00 00 ff ff ff ff  ................
  backtrace:
    [<ffffffff8152b4aa>] kmemleak_alloc+0x4a/0xa0
    [<ffffffff811d8ec8>] kmem_cache_alloc_trace+0xb8/0x1c0
    [<ffffffffa0389683>] __br_mdb_notify+0x2a3/0x300 [bridge]
    [<ffffffffa038a0ce>] br_mdb_notify+0x6e/0x70 [bridge]
    [<ffffffffa0386479>] br_multicast_add_group+0x109/0x150 [bridge]
    [<ffffffffa0386518>] br_ip6_multicast_add_group+0x58/0x60 [bridge]
    [<ffffffffa0387fb5>] br_multicast_rcv+0x1d5/0xdb0 [bridge]
    [<ffffffffa037d7cf>] br_handle_frame_finish+0xcf/0x510 [bridge]
    [<ffffffffa03a236b>] br_nf_hook_thresh.part.27+0xb/0x10 [br_netfilter]
    [<ffffffffa03a3738>] br_nf_hook_thresh+0x48/0xb0 [br_netfilter]
    [<ffffffffa03a3fb9>] br_nf_pre_routing_finish_ipv6+0x109/0x1d0 [br_netfilter]
    [<ffffffffa03a4400>] br_nf_pre_routing_ipv6+0xd0/0x14c [br_netfilter]
    [<ffffffffa03a3c27>] br_nf_pre_routing+0x197/0x3d0 [br_netfilter]
    [<ffffffff814a2952>] nf_iterate+0x52/0x60
    [<ffffffff814a29bc>] nf_hook_slow+0x5c/0xb0
    [<ffffffffa037ddf4>] br_handle_frame+0x1a4/0x2c0 [bridge]

This happens when switchdev_port_obj_add() fails. This patch
frees complete_info object in the fail path.

Reviewed-by: Vallish Vaidyeshwara <[email protected]>
Signed-off-by: Eduardo Valentin <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Merge branch 'for-linus' of git://git.kernel.dk/linux-block

Pull more block updates from Jens Axboe:
"This is a followup for block changes, that didn't make the initial
  pull request. It's a bit of a mixed bag, this contains:

   - A followup pull request from Sagi for NVMe. Outside of fixups for
     NVMe, it also includes a series for ensuring that we properly
     quiesce hardware queues when browsing live tags.

   - Set of integrity fixes from Dmitry (mostly), fixing various issues
     for folks using DIF/DIX.

   - Fix for a bug introduced in cciss, with the req init changes. From
     Christoph.

   - Fix for a bug in BFQ, from Paolo.

   - Two followup fixes for lightnvm/pblk from Javier.

   - Depth fix from Ming for blk-mq-sched.

   - Also from Ming, performance fix for mtip32xx that was introduced
     with the dynamic initialization of commands"

* 'for-linus' of git://git.kernel.dk/linux-block: (44 commits)
  block: call bio_uninit in bio_endio
  nvmet: avoid unneeded assignment of submit_bio return value
  nvme-pci: add module parameter for io queue depth
  nvme-pci: compile warnings in nvme_alloc_host_mem()
  nvmet_fc: Accept variable pad lengths on Create Association LS
  nvme_fc/nvmet_fc: revise Create Association descriptor length
  lightnvm: pblk: remove unnecessary checks
  lightnvm: pblk: control I/O flow also on tear down
  cciss: initialize struct scsi_req
  null_blk: fix error flow for shared tags during module_init
  block: Fix __blkdev_issue_zeroout loop
  nvme-rdma: unconditionally recycle the request mr
  nvme: split nvme_uninit_ctrl into stop and uninit
  virtio_blk: quiesce/unquiesce live IO when entering PM states
  mtip32xx: quiesce request queues to make sure no submissions are inflight
  nbd: quiesce request queues to make sure no submissions are inflight
  nvme: kick requeue list when requeueing a request instead of when starting the queues
  nvme-pci: quiesce/unquiesce admin_q instead of start/stop its hw queues
  nvme-loop: quiesce/unquiesce admin_q instead of start/stop its hw queues
  nvme-fc: quiesce/unquiesce admin_q instead of start/stop its hw queues
  ...

Merge tag 'smb3-security-fixes-for-4.13' of git://git.samba.org/sfrench/cifs-2.6

Pull cifs fixes and sane default from Steve French:
"Upgrade default dialect to more secure SMB3 from older cifs dialect"

* tag 'smb3-security-fixes-for-4.13' of git://git.samba.org/sfrench/cifs-2.6:
  cifs: Clean up unused variables in smb2pdu.c
  [SMB3] Improve security, move default dialect to SMB3 from old CIFS
  [SMB3] Remove ifdef since SMB3 (and later) now STRONGLY preferred
  CIFS: Reconnect expired SMB sessions
  CIFS: Display SMB2 error codes in the hex format
  cifs: Use smb 2 - 3 and cifsacl mount options setacl function
  cifs: prototype declaration and definition to set acl for smb 2 - 3 and cifsacl mount options

tap: convert a mutex to a spinlock

We are not allowed to block on the RCU reader side, so can't
just hold the mutex as before. As a quick fix, convert it to
a spinlock.

Fixes: d9f1f61c0801 ("tap: Extending tap device create/destroy APIs")
Reported-by: Christian Borntraeger <[email protected]>
Tested-by: Christian Borntraeger <[email protected]>
Cc: Sainath Grandhi <[email protected]>
Signed-off-by: Cong Wang <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

cxgb4: fix BUG() on interrupt deallocating path of ULD

Since the introduction of ULD (Upper-Layer Drivers), the MSI-X
deallocating path changed in cxgb4: the driver frees the interrupts
of ULD when unregistering it or on shutdown PCI handler.

Problem is that if a MSI-X is not freed before deallocated in the PCI
layer, it will trigger a BUG() due to still "alive" interrupt being
tentatively quiesced.

The below trace was observed when doing a simple unbind of Chelsio's
adapter PCI function, like:
  "echo 001e:80:00.4 > /sys/bus/pci/drivers/cxgb4/unbind"

Trace:

  kernel BUG at drivers/pci/msi.c:352!
  Oops: Exception in kernel mode, sig: 5 [#1]
  ...
  NIP [c0000000005a5e60] free_msi_irqs+0xa0/0x250
  LR [c0000000005a5e50] free_msi_irqs+0x90/0x250
  Call Trace:
  [c0000000005a5e50] free_msi_irqs+0x90/0x250 (unreliable)
  [c0000000005a72c4] pci_disable_msix+0x124/0x180
  [d000000011e06708] disable_msi+0x88/0xb0 [cxgb4]
  [d000000011e06948] free_some_resources+0xa8/0x160 [cxgb4]
  [d000000011e06d60] remove_one+0x170/0x3c0 [cxgb4]
  [c00000000058a910] pci_device_remove+0x70/0x110
  [c00000000064ef04] device_release_driver_internal+0x1f4/0x2c0
  ...

This patch fixes the issue by refactoring the shutdown path of ULD on
cxgb4 driver, by properly freeing and disabling interrupts on PCI
remove handler too.

Fixes: 0fbc81b3ad51 ("Allocate resources dynamically for all cxgb4 ULD's")
Reported-by: Harsha Thyagaraja <[email protected]>
Signed-off-by: Guilherme G. Piccoli <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

qed: Fix printk option passed when printing ipv6 addresses

The option "h" (host order ) exists for ipv4 only.
Remove the h when printing ipv6 addresses.

Lead to the following smatch warning:

drivers/net/ethernet/qlogic/qed/qed_iwarp.c:585 qed_iwarp_print_tcp_ramrod()
warn: '%pI6' can only be followed by c
drivers/net/ethernet/qlogic/qed/qed_iwarp.c:1521 qed_iwarp_print_cm_info()
warn: '%pI6' can only be followed by c

Fixes commit 456a584947d5 ("qed: iWARP CM add passive side connect")

Reported-by: Dan Carpenter <[email protected]>
Signed-off-by: Michal Kalderon <[email protected]>
Signed-off-by: Yuval Mintz <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: Fix minor code bug in timestamping.txt

Passing (void*)val instead of &val would make a pointer out of an integer
and cause sock_setsockopt to -EFAULT.

See tools/testing/selftests/networking/timestamping/timestamping.c
for a working example.

Cc: David S. Miller <[email protected]>
Cc: [email protected]
Signed-off-by: Ahmad Fatoum <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Merge branch 'stmmac-dma-resources-fixes'

Christophe JAILLET says:

====================
net: stmmac: Fixes and cleanups in 'alloc_dma_[rt]x_desc_resources()'

These patchs are all related to 'alloc_dma_[rt]x_desc_resources()' functions.

The 2 first fix an error path where some resources are leaking. I've
separated them into 2 patches because the issues have been introduced by
2 deferent commits.

The 3rd patch is just a clean-up.
====================

Signed-off-by: David S. Miller <[email protected]>

net: stmmac: Make 'alloc_dma_[rt]x_desc_resources()' look even closer

'alloc_dma_[rt]x_desc_resources()' functions look very close.
Remove a useless initialization and use the same label name for error
handling path in order to get them even closer.

Signed-off-by: Christophe JAILLET <[email protected]>
Acked-by: Giuseppe Cavallaro <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: stmmac: Fix error handling path in 'alloc_dma_tx_desc_resources()'

If the first 'kmalloc_array' within the loop fails, we should free what
as already been allocated, as done in all other error handling path.

Fixes: ce736788e8a9 ("net: stmmac: adding multiple buffers for TX")
Signed-off-by: Christophe JAILLET <[email protected]>
Acked-by: Giuseppe Cavallaro <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: stmmac: Fix error handling path in 'alloc_dma_rx_desc_resources()'

If the first 'kmalloc_array' within the loop fails, we should free what
as already been allocated, as done in all other error handling path.

Fixes: 54139cf3bb33 ("net: stmmac: adding multiple buffers for rx")
Signed-off-by: Christophe JAILLET <[email protected]>
Acked-by: Giuseppe Cavallaro <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Merge tag 'ceph-for-4.13-rc1' of git://github.com/ceph/ceph-client

Pull ceph updates from Ilya Dryomov:
"The main item here is support for v12.y.z ("Luminous") clusters:
  RESEND_ON_SPLIT, RADOS_BACKOFF, OSDMAP_PG_UPMAP and CRUSH_CHOOSE_ARGS
  feature bits, and various other changes in the RADOS client protocol.

  On top of that we have a new fsc mount option to allow supplying
  fscache uniquifier (similar to NFS) and the usual pile of filesystem
  fixes from Zheng"

* tag 'ceph-for-4.13-rc1' of git://github.com/ceph/ceph-client: (44 commits)
  libceph: advertise support for NEW_OSDOP_ENCODING and SERVER_LUMINOUS
  libceph: osd_state is 32 bits wide in luminous
  crush: remove an obsolete comment
  crush: crush_init_workspace starts with struct crush_work
  libceph, crush: per-pool crush_choose_arg_map for crush_do_rule()
  crush: implement weight and id overrides for straw2
  libceph: apply_upmap()
  libceph: compute actual pgid in ceph_pg_to_up_acting_osds()
  libceph: pg_upmap[_items] infrastructure
  libceph: ceph_decode_skip_* helpers
  libceph: kill __{insert,lookup,remove}_pg_mapping()
  libceph: introduce and switch to decode_pg_mapping()
  libceph: don't pass pgid by value
  libceph: respect RADOS_BACKOFF backoffs
  libceph: make DEFINE_RB_* helpers more general
  libceph: avoid unnecessary pi lookups in calc_target()
  libceph: use target pi for calc_target() calculations
  libceph: always populate t->target_{oid,oloc} in calc_target()
  libceph: make sure need_resend targets reflect latest map
  libceph: delete from need_resend_linger before check_linger_pool_dne()
  ...

cisco: enic: Fic an error handling path in 'vnic_dev_init_devcmd2()'

if 'ioread32()' returns 0xFFFFFFF, we have to go through the error
handling path as done everywhere else in this function.

Move the 'err_free_wq' label to better match its name and its location
and add a new label 'err_disable_wq'.
Update the code accordingly.

Fixes: 373fb0873d43 ("enic: add devcmd2")
Signed-off-by: Christophe JAILLET <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Merge branch 'bnxt_en-Bug-fixes'

Michael Chan says:

====================
bnxt_en: Bug fixes.

3 bug fixes in this series.  Fix a crash in bnxt_get_stats64() that can
happen if the device is closing and freeing the statistics block at the
same time.  The 2nd one fixes ethtool -L failing when changing from
combined to non-combined mode or vice versa.  The last one fixes SRIOV
failure on big-endian systems because we were setting a bitmap wrong in
a firmware message.
====================

Signed-off-by: David S. Miller <[email protected]>

bnxt_en: Fix SRIOV on big-endian architecture.

The PF driver sets up a list of firmware commands from the VF driver that
needs to be forwarded to the PF for approval.  This list is a 256-bit
bitmap.  The code that sets up the bitmap falls apart on big-endian
architecture.  __set_bit() does not work because it operates on long types
whereas the firmware interface is defined in u32 types, causing bits in
the wrong 32-bit word to be set.

Fix it by setting the proper bits on an array of u32.

Fixes: de68f5de5651 ("bnxt_en: Fix bitmap declaration to work on 32-bit arches.")
Reported-by: Shannon Nelson <[email protected]>
Signed-off-by: Michael Chan <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

bnxt_en: Fix bug in ethtool -L.

When changing channels from combined to rx/tx or vice versa, the code
uses the wrong "sh" parameter to determine if we are reserving rings
for shared or non-shared mode. It should be using the ethtool requested
"sh" parameter instead of the current "sh" parameter.

Fix it by passing the "sh" parameter to bnxt_reserve_rings(). For
ethtool, we will pass in the requested "sh" parameter.

Fixes: 391be5c27364 ("bnxt_en: Implement new scheme to reserve tx rings.")
Signed-off-by: Michael Chan <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

bnxt_en: Fix race conditions in .ndo_get_stats64().

.ndo_get_stats64() may not be protected by RTNL and can race with
.ndo_stop() or other ethtool operations that can free the statistics
memory. Fix it by setting a new flag BNXT_STATE_READ_STATS and then
proceeding to read statistics memory only if the state is OPEN. The
close path that frees the memory clears the OPEN state and then waits
for the BNXT_STATE_READ_STATS to clear before proceeding to free the
statistics memory.

Fixes: c0c050c58d84 ("bnxt_en: New Broadcom ethernet driver.")
Signed-off-by: Michael Chan <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Merge git://www.linux-watchdog.org/linux-watchdog

Pull watchdog updates from Wim Van Sebroeck:

- Add Renesas RZ/A WDT Watchdog driver

- STM32 Independent WatchDoG (IWDG) support

- UniPhier watchdog support

- Add F71868 support

- Add support for NCT6793D and NCT6795D

- dw_wdt: add reset lines support

- core: add option to avoid early handling of watchdog

- core: introduce watchdog_worker_should_ping helper

- Cleanups and improvements for sama5d4, intel-mid_wdt, s3c2410_wdt,
   orion_wdt, gpio_wdt, it87_wdt, meson_wdt, davinci_wdt, bcm47xx_wdt,
   zx2967_wdt, cadence_wdt

* git://www.linux-watchdog.org/linux-watchdog: (32 commits)
  watchdog: introduce watchdog_worker_should_ping helper
  watchdog: uniphier: add UniPhier watchdog driver
  dt-bindings: watchdog: add description for UniPhier WDT controller
  watchdog: cadence_wdt: make of_device_ids const.
  watchdog: zx2967: constify zx2967_wdt_ops.
  watchdog: bcm47xx_wdt: constify bcm47xx_wdt_hard_ops and bcm47xx_wdt_soft_ops
  watchdog: davinci: Add missing clk_disable_unprepare().
  watchdog: davinci: Handle return value of clk_prepare_enable
  watchdog: meson: Handle return value of clk_prepare_enable
  watchdog: it87: Add support for various Super-IO chips
  watchdog: it87: Use infrastructure to stop watchdog on reboot
  watchdog: it87: Drop support for resetting watchdog though CIR and Game port
  watchdog: it87: Convert to use watchdog core infrastructure
  watchdog: it87: Drop FSF mailing address
  watchdog: dw_wdt: get reset lines from dt
  watchdog: bindings: dw_wdt: add reset lines
  watchdog: w83627hf: Add support for NCT6793D and NCT6795D
  watchdog: core: add option to avoid early handling of watchdog
  watchdog: f71808e_wdt: Add F71868 support
  watchdog: Add STM32 IWDG driver
  ...

Merge tag 'chrome-platform-for-linus-4.13' of git://git.kernel.org/pub/scm/linux/kernel/git/bleung/chrome-platform

Pull chrome platform updates from Benson Leung:
"Changes in this pull request are around catching up cros_ec with the
  internal chromeos-kernel versions of cros_ec, cros_ec_lpc, and
  cros_ec_lightbar.

  Also, switching maintainership from olof to bleung"

* tag 'chrome-platform-for-linus-4.13' of git://git.kernel.org/pub/scm/linux/kernel/git/bleung/chrome-platform:
  platform/chrome : Add myself as Maintainer
  platform/chrome: cros_ec_lightbar - hide unused PM functions
  cros_ec: Don't signal wake event for non-wake host events
  cros_ec: Fix deadlock when EC is not responsive at probe
  cros_ec: Don't return error when checking command version
  platform/chrome: cros_ec_lightbar - Avoid I2C xfer to EC during suspend
  platform/chrome: cros_ec_lightbar - Add userspace lightbar control bit to EC
  platform/chrome: cros_ec_lightbar - Control of suspend/resume lightbar sequence
  platform/chrome: cros_ec_lightbar - Add lightbar program feature to sysfs
  platform/chrome: cros_ec_lpc: Add MKBP events support over ACPI
  platform/chrome: cros_ec_lpc: Add power management ops
  platform/chrome: cros_ec_lpc: Add support for GOOG004 ACPI device
  platform/chrome: cros_ec_lpc: Add support for mec1322 EC
  platform/chrome: cros_ec_lpc: Add R/W helpers to LPC protocol variants
  mfd: cros_ec: Add support for dumping panic information
  cros_ec_debugfs: Pass proper struct sizes to cros_ec_cmd_xfer()
  mfd: cros_ec: add debugfs, console log file
  mfd: cros_ec: Add EC console read structures definitions
  mfd: cros_ec: Add helper for event notifier.

Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/gerg/m68knommu

Pull x86nommu update from Greg Ungerer:
"Only a single change, to remove old Kconfig options from defconfigs"

* 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/gerg/m68knommu:
m68k: defconfig: Cleanup from old Kconfig options

Merge branch 'akpm' (patches from Andrew)

Merge more updates from Andrew Morton:

- most of the rest of MM

- KASAN updates

- lib/ updates

- checkpatch updates

- some binfmt_elf changes

- various misc bits

* emailed patches from Andrew Morton <[email protected]>: (115 commits)
  kernel/exit.c: avoid undefined behaviour when calling wait4()
  kernel/signal.c: avoid undefined behaviour in kill_something_info
  binfmt_elf: safely increment argv pointers
  s390: reduce ELF_ET_DYN_BASE
  powerpc: move ELF_ET_DYN_BASE to 4GB / 4MB
  arm64: move ELF_ET_DYN_BASE to 4GB / 4MB
  arm: move ELF_ET_DYN_BASE to 4MB
  binfmt_elf: use ELF_ET_DYN_BASE only for PIE
  fs, epoll: short circuit fetching events if thread has been killed
  checkpatch: improve multi-line alignment test
  checkpatch: improve macro reuse test
  checkpatch: change format of --color argument to --color[=WHEN]
  checkpatch: silence perl 5.26.0 unescaped left brace warnings
  checkpatch: improve tests for multiple line function definitions
  checkpatch: remove false warning for commit reference
  checkpatch: fix stepping through statements with $stat and ctx_statement_block
  checkpatch: [HLP]LIST_HEAD is also declaration
  checkpatch: warn when a MAINTAINERS entry isn't [A-Z]:\t
  checkpatch: improve the unnecessary OOM message test
  lib/bsearch.c: micro-optimize pivot position calculation
  ...

kernel/exit.c: avoid undefined behaviour when calling wait4()

wait4(-2147483648, 0x20, 0, 0xdd0000) triggers:
UBSAN: Undefined behaviour in kernel/exit.c:1651:9

The related calltrace is as follows:

  negation of -2147483648 cannot be represented in type 'int':
  CPU: 9 PID: 16482 Comm: zj Tainted: G    B          ---- -------   3.10.0-327.53.58.71.x86_64+ #66
  Hardware name: Huawei Technologies Co., Ltd. Tecal RH2285          /BC11BTSA              , BIOS CTSAV036 04/27/2011
  Call Trace:
    dump_stack+0x19/0x1b
    ubsan_epilogue+0xd/0x50
    __ubsan_handle_negate_overflow+0x109/0x14e
    SyS_wait4+0x1cb/0x1e0
    system_call_fastpath+0x16/0x1b

Exclude the overflow to avoid the UBSAN warning.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: zhongjiang <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Aneesh Kumar K.V <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Xishi Qiu <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

kernel/signal.c: avoid undefined behaviour in kill_something_info

When running kill(72057458746458112, 0) in userspace I hit the following
issue.

  UBSAN: Undefined behaviour in kernel/signal.c:1462:11
  negation of -2147483648 cannot be represented in type 'int':
  CPU: 226 PID: 9849 Comm: test Tainted: G    B          ---- -------   3.10.0-327.53.58.70.x86_64_ubsan+ #116
  Hardware name: Huawei Technologies Co., Ltd. RH8100 V3/BC61PBIA, BIOS BLHSV028 11/11/2014
  Call Trace:
    dump_stack+0x19/0x1b
    ubsan_epilogue+0xd/0x50
    __ubsan_handle_negate_overflow+0x109/0x14e
    SYSC_kill+0x43e/0x4d0
    SyS_kill+0xe/0x10
    system_call_fastpath+0x16/0x1b

Add code to avoid the UBSAN detection.

[[email protected]: tweak comment]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: zhongjiang <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Xishi Qiu <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

binfmt_elf: safely increment argv pointers

When building the argv/envp pointers, the envp is needlessly
pre-incremented instead of just continuing after the argv pointers are
finished. In some (likely impossible) race where the strings could be
changed from userspace between copy_strings() and here, it might be
possible to confuse the envp position. Instead, just use sp like
everything else.

Link: http://lkml.kernel.org/r/20170622173838.GA43308@beast
Signed-off-by: Kees Cook <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Daniel Micay <[email protected]>
Cc: Qualys Security Advisory <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Alexander Viro <[email protected]>
Cc: Dmitry Safonov <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Grzegorz Andrejczuk <[email protected]>
Cc: Masahiro Yamada <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

s390: reduce ELF_ET_DYN_BASE

Now that explicitly executed loaders are loaded in the mmap region, we
have more freedom to decide where we position PIE binaries in the
address space to avoid possible collisions with mmap or stack regions.

For 64-bit, align to 4GB to allow runtimes to use the entire 32-bit
address space for 32-bit pointers. On 32-bit use 4MB, which is the
traditional x86 minimum load location, likely to avoid historically
requiring a 4MB page table entry when only a portion of the first 4MB
would be used (since the NULL address is avoided). For s390 the
position could be 0x10000, but that is needlessly close to the NULL
address.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Kees Cook <[email protected]>
Cc: Russell King <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Martin Schwidefsky <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: James Hogan <[email protected]>
Cc: Pratyush Anand <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

powerpc: move ELF_ET_DYN_BASE to 4GB / 4MB

Now that explicitly executed loaders are loaded in the mmap region, we
have more freedom to decide where we position PIE binaries in the
address space to avoid possible collisions with mmap or stack regions.

For 64-bit, align to 4GB to allow runtimes to use the entire 32-bit
address space for 32-bit pointers. On 32-bit use 4MB, which is the
traditional x86 minimum load location, likely to avoid historically
requiring a 4MB page table entry when only a portion of the first 4MB
would be used (since the NULL address is avoided).

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Kees Cook <[email protected]>
Tested-by: Michael Ellerman <[email protected]>
Acked-by: Michael Ellerman <[email protected]>
Cc: Russell King <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Martin Schwidefsky <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: James Hogan <[email protected]>
Cc: Pratyush Anand <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

arm64: move ELF_ET_DYN_BASE to 4GB / 4MB

Now that explicitly executed loaders are loaded in the mmap region, we
have more freedom to decide where we position PIE binaries in the
address space to avoid possible collisions with mmap or stack regions.

For 64-bit, align to 4GB to allow runtimes to use the entire 32-bit
address space for 32-bit pointers. On 32-bit use 4MB, to match ARM.
This could be 0x8000, the standard ET_EXEC load address, but that is
needlessly close to the NULL address, and anyone running arm compat PIE
will have an MMU, so the tight mapping is not needed.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Kees Cook <[email protected]>
Cc: Ard Biesheuvel <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

arm: move ELF_ET_DYN_BASE to 4MB

Now that explicitly executed loaders are loaded in the mmap region, we
have more freedom to decide where we position PIE binaries in the
address space to avoid possible collisions with mmap or stack regions.

4MB is chosen here mainly to have parity with x86, where this is the
traditional minimum load location, likely to avoid historically
requiring a 4MB page table entry when only a portion of the first 4MB
would be used (since the NULL address is avoided).

For ARM the position could be 0x8000, the standard ET_EXEC load address,
but that is needlessly close to the NULL address, and anyone running PIE
on 32-bit ARM will have an MMU, so the tight mapping is not needed.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Kees Cook <[email protected]>
Cc: Russell King <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Martin Schwidefsky <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: James Hogan <[email protected]>
Cc: Pratyush Anand <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Alexander Viro <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Daniel Micay <[email protected]>
Cc: Dmitry Safonov <[email protected]>
Cc: Grzegorz Andrejczuk <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Masahiro Yamada <[email protected]>
Cc: Qualys Security Advisory <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

binfmt_elf: use ELF_ET_DYN_BASE only for PIE

The ELF_ET_DYN_BASE position was originally intended to keep loaders
away from ET_EXEC binaries.  (For example, running "/lib/ld-linux.so.2
/bin/cat" might cause the subsequent load of /bin/cat into where the
loader had been loaded.)

With the advent of PIE (ET_DYN binaries with an INTERP Program Header),
ELF_ET_DYN_BASE continued to be used since the kernel was only looking
at ET_DYN.  However, since ELF_ET_DYN_BASE is traditionally set at the
top 1/3rd of the TASK_SIZE, a substantial portion of the address space
is unused.

For 32-bit tasks when RLIMIT_STACK is set to RLIM_INFINITY, programs are
loaded above the mmap region.  This means they can be made to collide
(CVE-2017-1000370) or nearly collide (CVE-2017-1000371) with
pathological stack regions.

Lowering ELF_ET_DYN_BASE solves both by moving programs below the mmap
region in all cases, and will now additionally avoid programs falling
back to the mmap region by enforcing MAP_FIXED for program loads (i.e.
if it would have collided with the stack, now it will fail to load
instead of falling back to the mmap region).

To allow for a lower ELF_ET_DYN_BASE, loaders (ET_DYN without INTERP)
are loaded into the mmap region, leaving space available for either an
ET_EXEC binary with a fixed location or PIE being loaded into mmap by
the loader.  Only PIE programs are loaded offset from ELF_ET_DYN_BASE,
which means architectures can now safely lower their values without risk
of loaders colliding with their subsequently loaded programs.

For 64-bit, ELF_ET_DYN_BASE is best set to 4GB to allow runtimes to use
the entire 32-bit address space for 32-bit pointers.

Thanks to PaX Team, Daniel Micay, and Rik van Riel for inspiration and
suggestions on how to implement this solution.

Fixes: d1fd836dcf00 ("mm: split ET_DYN ASLR from mmap ASLR")
Link: http://lkml.kernel.org/r/20170621173201.GA114489@beast
Signed-off-by: Kees Cook <[email protected]>
Acked-by: Rik van Riel <[email protected]>
Cc: Daniel Micay <[email protected]>
Cc: Qualys Security Advisory <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Alexander Viro <[email protected]>
Cc: Dmitry Safonov <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Grzegorz Andrejczuk <[email protected]>
Cc: Masahiro Yamada <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: James Hogan <[email protected]>
Cc: Martin Schwidefsky <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Pratyush Anand <[email protected]>
Cc: Russell King <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

fs, epoll: short circuit fetching events if thread has been killed

We've encountered zombies that are waiting for a thread to exit that are
looping in ep_poll() almost endlessly although there is a pending
SIGKILL as a result of a group exit.

This happens because we always find ep_events_available() and fetch more
events and never are able to check for signal_pending() that would break
from the loop and return -EINTR.

Special case fatal signals and break immediately to guarantee that we
loop to fetch more events and delay making a timely exit.

It would also be possible to simply move the check for signal_pending()
higher than checking for ep_events_available(), but there have been no
reports of delayed signal handling other than SIGKILL preventing zombies
from exiting that would be fixed by this.

It fixes an issue for us where we have witnessed zombies sticking around
for at least O(minutes), but considering the code has been like this
forever and nobody else has complained that I have found, I would simply
queue it up for 4.12.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: David Rientjes <[email protected]>
Cc: Alexander Viro <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: Davide Libenzi <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

checkpatch: improve multi-line alignment test

The current test fails to warn about improper alignment with code like

foo->bar = func(arg1,
arg2);

because foo->bar is not a single identifier.

Convert the $Ident to $Lval which allows for multiple dereferences.

Link: http://lkml.kernel.org/r/01c35b9b6a12a415e57746d45d589bfaad39952a.1498841563.git.joe@perches.com
Signed-off-by: Joe Perches <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

checkpatch: improve macro reuse test

checkpatch reports a false positive when using token pasting argument
multiple times in a macro.

Fix it.

Miscellanea:

o Make the $tmp variable name used in the macro argument tests
a bit more descriptive

Link: http://lkml.kernel.org/r/cf434ae7602838388c7cb49d42bca93ab88527e7.1498483044.git.joe@perches.com
Signed-off-by: Joe Perches <[email protected]>
Reported-by: Johannes Berg <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

checkpatch: change format of --color argument to --color[=WHEN]

The boolean --color argument did not offer the ability to force
colourized output even if stdout is not a terminal. Change the format
of the argument to the familiar --color[=WHEN] construct as seen in
common Linux utilities such as git, ls and dmesg, which allows the user
to specify whether to colourize output "always", "never", or "auto" when
the output is a terminal. The default is "auto".

The old command-line uses of --color and --no-color are unchanged.

Link: http://lkml.kernel.org/r/efe43bdbad400f39ba691ae663044462493b0773.1496799721.git.joe@perches.com
Signed-off-by: John Brooks <[email protected]>
Signed-off-by: Joe Perches <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

checkpatch: silence perl 5.26.0 unescaped left brace warnings

As of perl 5, version 26, subversion 0 (v5.26.0) some new warnings have
occurred when running checkpatch.

Unescaped left brace in regex is deprecated here (and will be fatal in
Perl 5.30), passed through in regex; marked by <-- HERE in m/^(.\s*){
<-- HERE \s*/ at scripts/checkpatch.pl line 3544.

Unescaped left brace in regex is deprecated here (and will be fatal in
Perl 5.30), passed through in regex; marked by <-- HERE in m/^(.\s*){
<-- HERE \s*/ at scripts/checkpatch.pl line 3885.

Unescaped left brace in regex is deprecated here (and will be fatal in
Perl 5.30), passed through in regex; marked by <-- HERE in
m/^(\+.*(?:do|\))){ <-- HERE / at scripts/checkpatch.pl line 4374.

It seems perfectly reasonable to do as the warning suggests and simply
escape the left brace in these three locations.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Cyril Bur <[email protected]>
Acked-by: Joe Perches <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

checkpatch: improve tests for multiple line function definitions

Add a block that identifies multiple line function definitions.

Save the function name into $context_function to improve the embedded
function name test.

Look for misplaced open brace on the function definition.
Emit an OPEN_BRACE error when the function definition is similar to

void foo(int arg1,
int arg2) {

Miscellanea:

o Remove the $realfile test in function declaration w/o named arguments test
o Comment the function declaration w/o named arguments test

Link: http://lkml.kernel.org/r/de620ed6ebab75fdfa323741ada2134a0f545892.1496835238.git.joe@perches.com
Signed-off-by: Joe Perches <[email protected]>
Tested-by: David Kershner <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

checkpatch: remove false warning for commit reference

Checkpatch warns of an incorrect commit reference style for any
hexadecimal number of 12 digits and more.

Numbers of 12 digits are not necessarily commit ids.

For an example provoking the problem see
https://patchwork.kernel.org/patch/9170897/

Checkpatch should only warn if the number refers to an existing commit.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Heinrich Schuchardt <[email protected]>
Acked-by: Joe Perches <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

checkpatch: fix stepping through statements with $stat and ctx_statement_block

Fix the off-by-one in the suppression of lines in a statement block.

This means that for multiple line statements like

foo(bar,
    baz,
    qux);

$stat has been inspected first correctly for the entire statement,
and subsequently incorrectly just for

    qux);

This fix will help make tracking appropriate indentation a little easier.

Link: http://lkml.kernel.org/r/71b25979c90412133c717084036c9851cd2b7bcb.1496862585.git.joe@perches.com
Signed-off-by: Joe Perches <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

checkpatch: [HLP]LIST_HEAD is also declaration

Fixes the following false warning among others for LLIST_HEAD and
PLIST_HEAD:

    WARNING: Missing a blank line after declarations
    #71: FILE: drivers/s390/scsi/zfcp_fsf.c:422:
    + struct hlist_node *tmp;
    + HLIST_HEAD(remove_queue);

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Steffen Maier <[email protected]>
Acked-by: Joe Perches <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

checkpatch: warn when a MAINTAINERS entry isn't [A-Z]:\t

For consistency, MAINTAINERS entries should be an upper case letter,
then a colon, then a tab, then the value.

Warn when an entry doesn't have this form. --fix it too.

Link: http://lkml.kernel.org/r/9aaaf03ec10adf3888b5e98dd2176b7fe9b5fad8.1496343345.git.joe@perches.com
Signed-off-by: Joe Perches <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

checkpatch: improve the unnecessary OOM message test

Use the context around a patch to avoid missing some candidates.

Link: http://lkml.kernel.org/r/865e874fbae5decc331a849bd8d71c325db6bc80.1496343345.git.joe@perches.com
Signed-off-by: Joe Perches <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

lib/bsearch.c: micro-optimize pivot position calculation

There is a slightly faster way (in terms of the number of instructions
being used) to calculate the position of a middle element, preserving
integer overflow safeness.

./scripts/bloat-o-meter lib/bsearch.o.old lib/bsearch.o.new
add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-24 (-24)
function                                     old     new   delta
bsearch                                      122      98     -24

TEST

INT array of size 100001, elements [0..100000]. gcc 7.1, Os, x86_64.

a) bsearch() of existing key "100001 - 2":

BASE
====

$ perf stat ./a.out

Performance counter stats for './a.out':

        619.445196      task-clock:u (msec)       #    0.999 CPUs utilized
                 0      context-switches:u        #    0.000 K/sec
                 0      cpu-migrations:u          #    0.000 K/sec
               133      page-faults:u             #    0.215 K/sec
     1,949,517,279      cycles:u                  #    3.147 GHz                      (83.06%)
       181,017,938      stalled-cycles-frontend:u #    9.29% frontend cycles idle     (83.05%)
        82,959,265      stalled-cycles-backend:u  #    4.26% backend cycles idle      (67.02%)
     4,355,706,383      instructions:u            #    2.23  insn per cycle
                                                  #    0.04  stalled cycles per insn  (83.54%)
     1,051,539,242      branches:u                # 1697.550 M/sec                    (83.54%)
        15,263,381      branch-misses:u           #    1.45% of all branches          (83.43%)

       0.620082548 seconds time elapsed

PATCHED
=======

$ perf stat ./a.out

Performance counter stats for './a.out':

        475.097316      task-clock:u (msec)       #    0.999 CPUs utilized
                 0      context-switches:u        #    0.000 K/sec
                 0      cpu-migrations:u          #    0.000 K/sec
               135      page-faults:u             #    0.284 K/sec
     1,487,467,717      cycles:u                  #    3.131 GHz                      (82.95%)
       186,537,162      stalled-cycles-frontend:u #   12.54% frontend cycles idle     (82.93%)
        28,797,869      stalled-cycles-backend:u  #    1.94% backend cycles idle      (67.10%)
     3,807,564,203      instructions:u            #    2.56  insn per cycle
                                                  #    0.05  stalled cycles per insn  (83.57%)
     1,049,344,291      branches:u                # 2208.693 M/sec                    (83.60%)
             5,485      branch-misses:u           #    0.00% of all branches          (83.58%)

       0.475760235 seconds time elapsed

b) bsearch() of un-existing key "100001 + 2":

BASE
====

$ perf stat ./a.out

Performance counter stats for './a.out':

        499.244480      task-clock:u (msec)       #    0.999 CPUs utilized
                 0      context-switches:u        #    0.000 K/sec
                 0      cpu-migrations:u          #    0.000 K/sec
               132      page-faults:u             #    0.264 K/sec
     1,571,194,855      cycles:u                  #    3.147 GHz                      (83.18%)
        13,450,980      stalled-cycles-frontend:u #    0.86% frontend cycles idle     (83.18%)
        21,256,072      stalled-cycles-backend:u  #    1.35% backend cycles idle      (66.78%)
     4,171,197,909      instructions:u            #    2.65  insn per cycle
                                                  #    0.01  stalled cycles per insn  (83.68%)
     1,009,175,281      branches:u                # 2021.405 M/sec                    (83.79%)
             3,122      branch-misses:u           #    0.00% of all branches          (83.37%)

       0.499871144 seconds time elapsed

PATCHED
=======

$ perf stat ./a.out

Performance counter stats for './a.out':

        399.023499      task-clock:u (msec)       #    0.998 CPUs utilized
                 0      context-switches:u        #    0.000 K/sec
                 0      cpu-migrations:u          #    0.000 K/sec
               134      page-faults:u             #    0.336 K/sec
     1,245,793,991      cycles:u                  #    3.122 GHz                      (83.39%)
        11,529,273      stalled-cycles-frontend:u #    0.93% frontend cycles idle     (83.46%)
        12,116,311      stalled-cycles-backend:u  #    0.97% backend cycles idle      (66.92%)
     3,679,710,005      instructions:u            #    2.95  insn per cycle
                                                  #    0.00  stalled cycles per insn  (83.47%)
     1,009,792,625      branches:u                # 2530.660 M/sec                    (83.46%)
             2,590      branch-misses:u           #    0.00% of all branches          (83.12%)

       0.399733539 seconds time elapsed

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Sergey Senozhatsky <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

lib/extable.c: use bsearch() library function in search_extable()

[[email protected]: v3: fix arch specific implementations]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Thomas Meyer <[email protected]>
Cc: Rasmus Villemoes <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

lib/rhashtable.c: use kvzalloc() in bucket_table_alloc() when possible

bucket_table_alloc() can be currently called with GFP_KERNEL or
GFP_ATOMIC. For the former we basically have an open coded kvzalloc()
while the later only uses kzalloc(). Let's simplify the code a bit by
the dropping the open coded path and replace it with kvzalloc().

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Michal Hocko <[email protected]>
Cc: Thomas Graf <[email protected]>
Cc: Herbert Xu <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

lib/interval_tree_test.c: allow full tree search

... such that a user can specify visiting all the nodes in the tree
(intersects with the world). This is a nice opposite from the very
basic default query which is a single point.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Davidlohr Bueso <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

lib/interval_tree_test.c: allow users to limit scope of endpoint

Add a 'max_endpoint' parameter such that users may easily limit the size
of the intervals that are randomly generated.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Davidlohr Bueso <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

lib/interval_tree_test.c: make test options module parameters

Allows for more flexible debugging.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Davidlohr Bueso <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

lib/interval_tree_test.c: allow the module to be compiled-in

Patch series "lib/interval_tree_test: some debugging improvements".

Here are some patches that update the interval_tree_test module allowing
users to pass finer grained options to run the actual test.

This patch (of 4):

It is a tristate after all, and also serves well for quick debugging.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Davidlohr Bueso <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

lib/kstrtox.c: use "unsigned int" more

gcc does generates stupid code sign extending data back and forth.  Help
by using "unsigned int".

add/remove: 0/0 grow/shrink: 0/3 up/down: 0/-61 (-61)
function                                     old     new   delta
_parse_integer                               128     123      -5

It _still_ does generate useless MOVSX but I don't know how to delete it:

0000000000000070 <_parse_integer>:
...
  a0:   89 c2                   mov    edx,eax
  a2:   83 e8 30                sub    eax,0x30
  a5:   83 f8 09                cmp    eax,0x9
  a8:   76 11                   jbe    bb <_parse_integer+0x4b>
  aa:   83 ca 20                or     edx,0x20
  ad:   0f be c2      ===>      movsx  eax,dl         <===
useless
  b0:   8d 50 9f                lea    edx,[rax-0x61]
  b3:   83 fa 05                cmp    edx,0x5

Patch also helps on embedded archs which generally only like "int".  On
arm "and 0xff" is generated which is waste because all values used in
comparisons are positive.

Link: http://lkml.kernel.org/r/20170514194720.GB32563@avx2
Signed-off-by: Alexey Dobriyan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

lib/kstrtox.c: delete end-of-string test

Standard "while (*s)" test is unnecessary because NUL won't pass
valid-digit test anyway. Save one branch per parsed character.

Link: http://lkml.kernel.org/r/20170514193756.GA32563@avx2
Signed-off-by: Alexey Dobriyan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

bitmap: use memcmp optimisation in more situations

Commit 7dd968163f7c ("bitmap: bitmap_equal memcmp optimization") was
rather more restrictive than necessary; we can use memcmp() to implement
bitmap_equal() as long as the number of bits can be proved to be a
multiple of 8. And architectures other than s390 may be able to make
good use of this optimisation.

[[email protected]: fix build: add a memcmp() declaration]
Link: http://lkml.kernel.org/r/[email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox <[email protected]>
Signed-off-by: Arnd Bergmann <[email protected]>
Acked-by: Rasmus Villemoes <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

include/linux/bitmap.h: turn bitmap_set and bitmap_clear into memset when possible

Several callers have constant 'start' and an 'nbits' that is a multiple
of 8, so we can turn them into calls to memset. We don't need the
entirety of 'start' and 'nbits' to be constant, we just need to know
whether they're divisible by 8.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox <[email protected]>
Acked-by: Rasmus Villemoes <[email protected]>
Cc: Martin Schwidefsky <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

bitmap: optimise bitmap_set and bitmap_clear of a single bit

We have eight users calling bitmap_clear for a single bit and seventeen
calling bitmap_set for a single bit. Rather than fix all of them to
call __clear_bit or __set_bit, turn bitmap_clear and bitmap_set into
inline functions and make this special case efficient.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox <[email protected]>
Acked-by: Rasmus Villemoes <[email protected]>
Cc: Martin Schwidefsky <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

lib/test_bitmap.c: add optimisation tests

Patch series "Bitmap optimisations", v2.

These three bitmap patches use more efficient specialisations when the
compiler can figure out that it's safe to do so. Thanks to Rasmus's
eagle eyes, a nasty bug in v1 was avoided, and I've added a test case
which would have caught it.

This patch (of 4):

This version of the test is actually a no-op; the next patch will enable
it.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox <[email protected]>
Cc: Rasmus Villemoes <[email protected]>
Cc: Martin Schwidefsky <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

MAINTAINERS: give proc sysctl some maintainer love

We poke at proc sysctl enough that really we should declare it
maintained. We'll just be Cc'd and sending updates / ACK'ing changes
through akpm's tree.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Luis R. Rodriguez <[email protected]>
Acked-by: Kees Cook <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

kernel/kallsyms.c: replace all_var with IS_ENABLED(CONFIG_KALLSYMS_ALL)

'all_var' looks like a variable, but is actually a macro. Use
IS_ENABLED(CONFIG_KALLSYMS_ALL) for clarification.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Masahiro Yamada <[email protected]>
Cc: Alexei Starovoitov <[email protected]>
Cc: Daniel Borkmann <[email protected]>
Cc: "David S. Miller" <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

kernel/groups.c: use sort library function

setgroups is not exactly a hot path, so we might as well use the library
function instead of open-coding the sorting. Saves ~150 bytes.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Rasmus Villemoes <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Alexey Dobriyan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

kernel/ksysfs.c: constify attribute_group structures.

attribute_groups are not supposed to change at runtime.  All functions
working with attribute_groups provided by <linux/sysfs.h> work with
const attribute_group.  So mark the non-const structs as const.

File size before:
   text    data     bss     dec     hex filename
   1120     544      16    1680     690 kernel/ksysfs.o

File size After adding 'const':
   text    data     bss     dec     hex filename
   1160     480      16    1656     678 kernel/ksysfs.o

Link: http://lkml.kernel.org/r/aa224b3cc923fdbb3edd0c41b2c639c85408c9e8.1498737347.git.arvind.yadav.cs@gmail.com
Signed-off-by: Arvind Yadav <[email protected]>
Acked-by: Kees Cook <[email protected]>
Cc: Russell King <[email protected]>
Cc: Dave Young <[email protected]>
Cc: Hari Bathini <[email protected]>
Cc: Petr Tesarik <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

ARM: fix rd_size declaration

The global variable 'rd_size' is declared as 'int' in source file
arch/arm/kernel/atags_parse.c and as 'unsigned long' in
drivers/block/brd.c. Fix this inconsistency.

Additionally, remove the declarations of rd_image_start, rd_prompt and
rd_doload from parse_tag_ramdisk() since these duplicate existing
declarations in <linux/initrd.h>.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Bart Van Assche <[email protected]>
Acked-by: Russell King <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: Jason Yan <[email protected]>
Cc: Zhaohongjiang <[email protected]>
Cc: Miao Xie <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

bug: split BUILD_BUG stuff out into <linux/build_bug.h>

Including <linux/bug.h> pulls in a lot of bloat from <asm/bug.h> and
<asm-generic/bug.h> that is not needed to call the BUILD_BUG() family of
macros. Split them out into their own header, <linux/build_bug.h>.

Also correct some checkpatch.pl errors for the BUILD_BUG_ON_ZERO() and
BUILD_BUG_ON_NULL() macros by adding parentheses around the bitfield
widths that begin with a minus sign.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ian Abbott <[email protected]>
Acked-by: Michal Nazarewicz <[email protected]>
Acked-by: Kees Cook <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Jakub Kicinski <[email protected]>
Cc: Rasmus Villemoes <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

linux/bug.h: correct "space required before that '-'"

Correct these checkpatch.pl errors:

|ERROR: space required before that '-' (ctx:OxO)
|#37: FILE: include/linux/bug.h:37:
|+#define BUILD_BUG_ON_ZERO(e) (sizeof(struct { int:-!!(e); }))

|ERROR: space required before that '-' (ctx:OxO)
|#38: FILE: include/linux/bug.h:38:
|+#define BUILD_BUG_ON_NULL(e) ((void *)sizeof(struct { int:-!!(e); }))

I decided to wrap the bitfield expressions that begin with minus signs
in parentheses rather than insert spaces before the minus signs.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ian Abbott <[email protected]>
Acked-by: Michal Nazarewicz <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Jakub Kicinski <[email protected]>
Cc: Rasmus Villemoes <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

linux/bug.h: correct "(foo*)" should be "(foo *)"

Correct this checkpatch.pl error:

|ERROR: "(foo*)" should be "(foo *)"
|#19: FILE: include/linux/bug.h:19:
|+#define BUILD_BUG_ON_NULL(e) ((void*)0)

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ian Abbott <[email protected]>
Acked-by: Michal Nazarewicz <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Jakub Kicinski <[email protected]>
Cc: Rasmus Villemoes <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

linux/bug.h: correct formatting of block comment

Correct these checkpatch.pl warnings:

|WARNING: Block comments use * on subsequent lines
|#34: FILE: include/linux/bug.h:34:
|+/* Force a compilation error if condition is true, but also produce a
|+ result (of value 0 and type size_t), so the expression can be used

|WARNING: Block comments use a trailing */ on a separate line
|#36: FILE: include/linux/bug.h:36:
|+ aren't permitted). */

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ian Abbott <[email protected]>
Acked-by: Michal Nazarewicz <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Jakub Kicinski <[email protected]>
Cc: Rasmus Villemoes <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

asm-generic/bug.h: declare struct pt_regs; before function prototype

This series of patches splits BUILD_BUG related macros out of
"include/linux/bug.h" into new file "include/linux/build_bug.h" (patch
5), and changes the pointer type checking in the `container_of()` macro
to deal with pointers of array type better (patch 6).  Patches 1 to 4
are prerequisites.

Patches 2, 3, 4, and 5 have been inserted since the previous version of
this patch series.  Patch 6 here corresponds to v3 and v4's patch 2.

Patch 1 was a prerequisite in v3 of this series to avoid a lot of
warnings when <linux/bug.h> was included by <linux/kernel.h>.  That is
no longer relevant for v5 of the series, but I left it in because it was
acked by a Arnd Bergmann and Michal Nazarewicz.

Patches 2, 3, and 4 are some checkpatch clean-ups on
"include/linux/bug.h" before splitting out the BUILD_BUG stuff in patch
5.

Patch 5 splits the BUILD_BUG related macros out of "include/linux/bug.h"
into new file "include/linux/build_bug.h" because including
<linux/bug.h> in "include/linux/kernel.h" would result in build failures
due to circular dependencies.

Patch 6 changes the pointer type checking by `container_of()` to avoid
some incompatible pointer warnings when the dereferenced pointer has
array type.

1) asm-generic/bug.h: declare struct pt_regs; before function prototype
2) linux/bug.h: correct formatting of block comment
3) linux/bug.h: correct "(foo*)" should be "(foo *)"
4) linux/bug.h: correct "space required before that '-'"
5) bug: split BUILD_BUG stuff out into <linux/build_bug.h>
6) kernel.h: handle pointers to arrays better in container_of()

This patch (of 6):

The declaration of `__warn()` has `struct pt_regs *regs` as one of its
parameters.  This can result in compiler warnings if `struct regs` is not
already declared.  Add an empty declaration of `struct pt_regs` to avoid
the warnings.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ian Abbott <[email protected]>
Acked-by: Arnd Bergmann <[email protected]>
Acked-by: Michal Nazarewicz <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Kees Cook <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

fs/proc/generic.c: switch to ida_simple_get/remove

The code can be much simplified by switching to ida_simple_get/remove.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Heiner Kallweit <[email protected]>
Cc: Al Viro <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

frv: cmpxchg: implement cmpxchg64()

FRV supports 64-bit cmpxchg, which is provided by the arch code as
__cmpxchg_64 and subsequently used to implement atomic64_cmpxchg.

This patch hooks up the generic cmpxchg64 API using the same function,
which also provides default definitions of the relaxed, acquire and
release variants. This fixes the build when COMPILE_TEST=y and
IOMMU_IO_PGTABLE_LPAE=y.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Will Deacon <[email protected]>
Reported-by: kbuild test robot <[email protected]>
Cc: Joerg Roedel <[email protected]>
Cc: Robin Murphy <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: David Howells <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

frv: use generic fb.h

The arch uses a verbatim copy of the asm-generic version and does not
add any own implementations to the header, so use asm-generic/fb.h
instead of duplicating code.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Tobias Klauser <[email protected]>
Reviewed-by: David Howells <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

frv: remove wrapper header for asm/device.h

frv's asm/device.h is merely including asm-generic/device.h. Thus, the
arch specific header can be omitted and the generic header can be used
directly.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Tobias Klauser <[email protected]>
Reviewed-by: David Howells <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

kasan: make get_wild_bug_type() static

The helper function get_wild_bug_type() does not need to be in global
scope, so make it static.

Cleans up sparse warning:

"symbol 'get_wild_bug_type' was not declared. Should it be static?"

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Colin Ian King <[email protected]>
Acked-by: Dmitry Vyukov <[email protected]>
Cc: Andrey Ryabinin <[email protected]>
Cc: Alexander Potapenko <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/kasan/kasan.c: rename XXX_is_zero to XXX_is_nonzero

They return positive value, that is, true, if non-zero value is found.
Rename them to reduce confusion.

Link: http://lkml.kernel.org/r/20170516012350.GA16015@js1304-desktop
Signed-off-by: Joonsoo Kim <[email protected]>
Cc: Andrey Ryabinin <[email protected]>
Cc: Alexander Potapenko <[email protected]>
Cc: Dmitry Vyukov <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/kasan: add support for memory hotplug

KASAN doesn't happen work with memory hotplug because hotplugged memory
doesn't have any shadow memory. So any access to hotplugged memory
would cause a crash on shadow check.

Use memory hotplug notifier to allocate and map shadow memory when the
hotplugged memory is going online and free shadow after the memory
offlined.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Andrey Ryabinin <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Alexander Potapenko <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Dmitry Vyukov <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Will Deacon <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

arm64/kasan: don't allocate extra shadow memory

We used to read several bytes of the shadow memory in advance.
Therefore additional shadow memory mapped to prevent crash if
speculative load would happen near the end of the mapped shadow memory.

Now we don't have such speculative loads, so we no longer need to map
additional shadow memory.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Andrey Ryabinin <[email protected]>
Acked-by: Mark Rutland <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Alexander Potapenko <[email protected]>
Cc: Dmitry Vyukov <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

x86/kasan: don't allocate extra shadow memory

We used to read several bytes of the shadow memory in advance.
Therefore additional shadow memory mapped to prevent crash if
speculative load would happen near the end of the mapped shadow memory.

Now we don't have such speculative loads, so we no longer need to map
additional shadow memory.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Andrey Ryabinin <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Alexander Potapenko <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Dmitry Vyukov <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Will Deacon <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/kasan: get rid of speculative shadow checks

For some unaligned memory accesses we have to check additional byte of
the shadow memory.  Currently we load that byte speculatively to have
only single load + branch on the optimistic fast path.

However, this approach has some downsides:

- It's unaligned access, so this prevents porting KASAN on
   architectures which doesn't support unaligned accesses.

- We have to map additional shadow page to prevent crash if speculative
   load happens near the end of the mapped memory. This would
   significantly complicate upcoming memory hotplug support.

I wasn't able to notice any performance degradation with this patch.  So
these speculative loads is just a pain with no gain, let's remove them.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Andrey Ryabinin <[email protected]>
Acked-by: Dmitry Vyukov <[email protected]>
Cc: Alexander Potapenko <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/kasan/kasan_init.c: use kasan_zero_pud for p4d table

There is missing optimization in zero_p4d_populate() that can save some
memory when mapping zero shadow. Implement it like as others.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Joonsoo Kim <[email protected]>
Acked-by: Andrey Ryabinin <[email protected]>
Cc: "Kirill A . Shutemov" <[email protected]>
Cc: Alexander Potapenko <[email protected]>
Cc: Dmitry Vyukov <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/zsmalloc: simplify zs_max_alloc_size handling

Commit 40f9fb8cffc6 ("mm/zsmalloc: support allocating obj with size of
ZS_MAX_ALLOC_SIZE") fixes a size calculation error that prevented
zsmalloc to allocate an object of the maximal size (ZS_MAX_ALLOC_SIZE).
I think however the fix is unneededly complicated.

This patch replaces the dynamic calculation of zs_size_classes at init
time by a compile time calculation that uses the DIV_ROUND_UP() macro
already used in get_size_class_index().

[[email protected]: use min_t]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Jerome Marchand <[email protected]>
Acked-by: Minchan Kim <[email protected]>
Cc: Sergey Senozhatsky <[email protected]>
Cc: Mahendran Ganesh <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

zram: constify attribute_group structures.

attribute_groups are not supposed to change at runtime.  All functions
working with attribute_groups provided by <linux/sysfs.h> work with
const attribute_group.  So mark the non-const structs as const.

File size before:
   text    data     bss     dec     hex filename
   8293     841       4    9138    23b2 drivers/block/zram/zram_drv.o

File size After adding 'const':
   text    data     bss     dec     hex filename
   8357     777       4    9138    23b2 drivers/block/zram/zram_drv.o

Link: http://lkml.kernel.org/r/65680c1c4d85818f7094cbfa31c91bf28185ba1b.1499061182.git.arvind.yadav.cs@gmail.com
Signed-off-by: Arvind Yadav <[email protected]>
Acked-by: Minchan Kim <[email protected]>
Cc: Sergey Senozhatsky <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm: disallow early_pfn_to_nid on configurations which do not implement it

early_pfn_to_nid will return node 0 if both HAVE_ARCH_EARLY_PFN_TO_NID
and HAVE_MEMBLOCK_NODE_MAP are disabled.  It seems we are safe now
because all architectures which support NUMA define one of them (with an
exception of alpha which however has CONFIG_NUMA marked as broken) so
this works as expected.  It can get silently and subtly broken too
easily, though.  Make sure we fail the compilation if NUMA is enabled
and there is no proper implementation for this function.  If that ever
happens we know that either the specific configuration is invalid and
the fix should either disable NUMA or enable one of the above configs.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Michal Hocko <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: Mel Gorman <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/memory-hotplug: switch locking to a percpu rwsem

Andrey reported a potential deadlock with the memory hotplug lock and
the cpu hotplug lock.

The reason is that memory hotplug takes the memory hotplug lock and then
calls stop_machine() which calls get_online_cpus().  That's the reverse
lock order to get_online_cpus(); get_online_mems(); in mm/slub_common.c

The problem has been there forever.  The reason why this was never
reported is that the cpu hotplug locking had this homebrewn recursive
reader writer semaphore construct which due to the recursion evaded the
full lock dep coverage.  The memory hotplug code copied that construct
verbatim and therefor has similar issues.

Three steps to fix this:

1) Convert the memory hotplug locking to a per cpu rwsem so the
   potential issues get reported proper by lockdep.

2) Lock the online cpus in mem_hotplug_begin() before taking the memory
   hotplug rwsem and use stop_machine_cpuslocked() in the page_alloc
   code to avoid recursive locking.

3) The cpu hotpluck locking in #2 causes a recursive locking of the cpu
   hotplug lock via __offline_pages() -> lru_add_drain_all(). Solve this
   by invoking lru_add_drain_all_cpuslocked() instead.

Link: http://lkml.kernel.org/r/[email protected]
Reported-by: Andrey Ryabinin <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Davidlohr Bueso <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm: swap: provide lru_add_drain_all_cpuslocked()

The rework of the cpu hotplug locking unearthed potential deadlocks with
the memory hotplug locking code.

The solution for these is to rework the memory hotplug locking code as
well and take the cpu hotplug lock before the memory hotplug lock in
mem_hotplug_begin(), but this will cause a recursive locking of the cpu
hotplug lock when the memory hotplug code calls lru_add_drain_all().

Split out the inner workings of lru_add_drain_all() into
lru_add_drain_all_cpuslocked() so this function can be invoked from the
memory hotplug code with the cpu hotplug lock held.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Thomas Gleixner <[email protected]>
Reported-by: Andrey Ryabinin <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Davidlohr Bueso <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm: use dedicated helper to access rlimit value

Use rlimit() helper instead of manually writing whole chain from current
task to rlim_cur.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Krzysztof Opasiak <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

fs/dcache.c: fix spin lockup issue on nlru->lock

__list_lru_walk_one() acquires nlru spin lock (nlru->lock) for longer
duration if there are more number of items in the lru list.  As per the
current code, it can hold the spin lock for upto maximum UINT_MAX
entries at a time.  So if there are more number of items in the lru
list, then "BUG: spinlock lockup suspected" is observed in the below
path:

  spin_bug+0x90
  do_raw_spin_lock+0xfc
  _raw_spin_lock+0x28
  list_lru_add+0x28
  dput+0x1c8
  path_put+0x20
  terminate_walk+0x3c
  path_lookupat+0x100
  filename_lookup+0x6c
  user_path_at_empty+0x54
  SyS_faccessat+0xd0
  el0_svc_naked+0x24

This nlru->lock is acquired by another CPU in this path -

  d_lru_shrink_move+0x34
  dentry_lru_isolate_shrink+0x48
  __list_lru_walk_one.isra.10+0x94
  list_lru_walk_node+0x40
  shrink_dcache_sb+0x60
  do_remount_sb+0xbc
  do_emergency_remount+0xb0
  process_one_work+0x228
  worker_thread+0x2e0
  kthread+0xf4
  ret_from_fork+0x10

Fix this lockup by reducing the number of entries to be shrinked from
the lru list to 1024 at once.  Also, add cond_resched() before
processing the lru list again.

Link: http://marc.info/?t=149722864900001&r=1&w=2
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Sahitya Tummala <[email protected]>
Suggested-by: Jan Kara <[email protected]>
Suggested-by: Vladimir Davydov <[email protected]>
Acked-by: Vladimir Davydov <[email protected]>
Cc: Alexander Polakov <[email protected]>
Cc: Al Viro <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/list_lru.c: fix list_lru_count_node() to be race free

list_lru_count_node() iterates over all memcgs to get the total number of
entries on the node but it can race with memcg_drain_all_list_lrus(),
which migrates the entries from a dead cgroup to another. This can return
incorrect number of entries from list_lru_count_node().

Fix this by keeping track of entries per node and simply return it in
list_lru_count_node().

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Sahitya Tummala <[email protected]>
Acked-by: Vladimir Davydov <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: Alexander Polakov <[email protected]>
Cc: Al Viro <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/mmap.c: expand_downwards: don't require the gap if !vm_prev

expand_stack(vma) fails if address < stack_guard_gap even if there is no
vma->vm_prev. I don't think this makes sense, and we didn't do this
before the recent commit 1be7107fbe18 ("mm: larger stack guard gap,
between vmas").

We do not need a gap in this case, any address is fine as long as
security_mmap_addr() doesn't object.

This also simplifies the code, we know that address >= prev->vm_end and
thus underflow is not possible.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Oleg Nesterov <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Larry Woodman <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/mmap.c: do not blow on PROT_NONE MAP_FIXED holes in the stack

Commit 1be7107fbe18 ("mm: larger stack guard gap, between vmas") has
introduced a regression in some rust and Java environments which are
trying to implement their own stack guard page.  They are punching a new
MAP_FIXED mapping inside the existing stack Vma.

This will confuse expand_{downwards,upwards} into thinking that the
stack expansion would in fact get us too close to an existing non-stack
vma which is a correct behavior wrt safety.  It is a real regression on
the other hand.

Let's work around the problem by considering PROT_NONE mapping as a part
of the stack.  This is a gros hack but overflowing to such a mapping
would trap anyway an we only can hope that usespace knows what it is
doing and handle it propely.

Fixes: 1be7107fbe18 ("mm: larger stack guard gap, between vmas")
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Michal Hocko <[email protected]>
Debugged-by: Vlastimil Babka <[email protected]>
Cc: Ben Hutchings <[email protected]>
Cc: Willy Tarreau <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Hugh Dickins <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/balloon_compaction.c: enqueue zero page to balloon device

presently pages in the balloon device have random value, and these pages
will be scanned by ksmd on the host. They usually cannot be merged.
Enqueue zero pages will resolve this problem.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: zhenwei.pi <[email protected]>
Cc: Gioh Kim <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Konstantin Khlebnikov <[email protected]>
Cc: Rafael Aquini <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

cma: fix calculation of aligned offset

The align_offset parameter is used by bitmap_find_next_zero_area_off()
to represent the offset of map's base from the previous alignment
boundary; the function ensures that the returned index, plus the
align_offset, honors the specified align_mask.

The logic introduced by commit b5be83e308f7 ("mm: cma: align to physical
address, not CMA region position") has the cma driver calculate the
offset to the *next* alignment boundary.  In most cases, the base
alignment is greater than that specified when making allocations,
resulting in a zero offset whether we align up or down.  In the example
given with the commit, the base alignment (8MB) was half the requested
alignment (16MB) so the math also happened to work since the offset is
8MB in both directions.  However, when requesting allocations with an
alignment greater than twice that of the base, the returned index would
not be correctly aligned.

Also, the align_order arguments of cma_bitmap_aligned_mask() and
cma_bitmap_aligned_offset() should not be negative so the argument type
was made unsigned.

Fixes: b5be83e308f7 ("mm: cma: align to physical address, not CMA region position")
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Angus Clark <[email protected]>
Signed-off-by: Doug Berger <[email protected]>
Acked-by: Gregory Fong <[email protected]>
Cc: Doug Berger <[email protected]>
Cc: Angus Clark <[email protected]>
Cc: Laura Abbott <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Lucas Stach <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Shiraz Hashim <[email protected]>
Cc: Jaewon Kim <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/memory_hotplug.c: remove unused local zone_type from __remove_zone()

__remove_zone() sets up up zone_type, but never uses it for anything.
This does not cause a warning, due to the (necessary) use of
-Wno-unused-but-set-variable. However, it's noise, so just delete it.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: John Hubbard <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm: document highmem_is_dirtyable sysctl

It seems that there are still people using 32b kernels which a lot of
memory and the IO tend to suck a lot for them by default.  Mostly
because writers are throttled too when the lowmem is used.  We have
highmem_is_dirtyable to work around that issue but it seems we never
bothered to document it.  Let's do it now, finally.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Michal Hocko <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Cc: Alkis Georgopoulos <[email protected]>
Cc: Mel Gorman <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

include/linux/backing-dev.h: simplify wb_stat_sum

wb_stat_sum() disables interrupts and calls __wb_stat_sum() which
eventually calls __percpu_counter_sum(). However, the percpu routine is
already irq-safe. Simplify the code a bit by making wb_stat_sum()
directly call percpu_counter_sum_positive() and not disable interrupts.

Also remove the now-uneeded __wb_stat_sum() which was just a wrapper
over percpu_counter_sum_positive().

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Nikolay Borisov <[email protected]>
Acked-by: Peter Zijlstra <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: Jens Axboe <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

include/linux/mmzone.h: remove ancient/ambiguous comment

Currently pg_data_t is just a struct which describes a NUMA node memory
layout. Let's keep the comment simple and remove ambiguity.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Nikolay Borisov <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/swap_slots.c: don't disable preemption while taking the per-CPU cache

get_cpu_var() disables preemption and returns the per-CPU version of the
variable.  Disabling preemption is useful to ensure atomic access to the
variable within the critical section.

In this case however, after the per-CPU version of the variable is
obtained the ->free_lock is acquired.  For that reason it seems the raw
accessor could be used.  It only seems that ->slots_ret should be
retested (because with disabled preemption this variable can not be set
to NULL otherwise).

This popped up during PREEMPT-RT testing because it tries to take
spinlocks in a preempt disabled section.  In RT, spinlocks can sleep.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: Tim Chen <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ying Huang <[email protected]>
Cc: Hugh Dickins <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/page_alloc.c: eliminate unsigned confusion in __rmqueue_fallback

Since current_order starts as MAX_ORDER-1 and is then only decremented,
the second half of the loop condition seems superfluous. However, if
order is 0, we may decrement current_order past 0, making it UINT_MAX.
This is obviously too subtle ([1], [2]).

Since we need to add some comment anyway, change the two variables to
signed, making the counting-down for loop look more familiar, and
apparently also making gcc generate slightly smaller code.

[1] https://lkml.org/lkml/2016/6/20/493
[2] https://lkml.org/lkml/2017/6/19/345

[[email protected]: fix up reject fixupping]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Rasmus Villemoes <[email protected]>
Reported-by: Hao Lee <[email protected]>
Acked-by: Wei Yang <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

fs/proc/task_mmu.c: remove obsolete comment in show_map_vma()

After commit 1be7107fbe18 ("mm: larger stack guard gap, between vmas")
we do not hide stack guard page in /proc/<pid>/maps

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Vasily Averin <[email protected]>
Cc: Hugh Dickins <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm: drop useless local parameters of __register_one_node()

__register_one_node() initializes local parameters "p_node" & "parent"
for register_node().

But, register_node() does not use them.

Remove the related code of "parent" node, cleanup __register_one_node()
and register_node().

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Dou Liyang <[email protected]>
Acked-by: David Rientjes <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm: avoid taking zone lock in pagetypeinfo_showmixed()

pagetypeinfo_showmixedcount_print is found to take a lot of time to
complete and it does this holding the zone lock and disabling
interrupts. In some cases it is found to take more than a second (On a
2.4GHz,8Gb RAM,arm64 cpu).

Avoid taking the zone lock similar to what is done by read_page_owner,
which means possibility of inaccurate results.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Vinayak Menon <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: zhongjiang <[email protected]>
Cc: Sergey Senozhatsky <[email protected]>
Cc: Sudip Mukherjee <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Sebastian Andrzej Siewior <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Minchan Kim <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm, hugetlb, soft_offline: use new_page_nodemask for soft offline migration

new_page is yet another duplication of the migration callback which has
to handle hugetlb migration specially. We can safely use the generic
new_page_nodemask for the same purpose.

Please note that gigantic hugetlb pages do not need any special handling
because alloc_huge_page_nodemask will make sure to check pages in all
per node pools. The reason this was done previously was that
alloc_huge_page_node treated NO_NUMA_NODE and a specific node
differently and so alloc_huge_page_node(nid) would check on this
specific node.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Michal Hocko <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Reported-by: Vlastimil Babka <[email protected]>
Reviewed-by: Mike Kravetz <[email protected]>
Tested-by: Mike Kravetz <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: Mel Gorman <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

hugetlb: add support for preferred node to alloc_huge_page_nodemask

alloc_huge_page_nodemask tries to allocate from any numa node in the
allowed node mask starting from lower numa nodes.  This might lead to
filling up those low NUMA nodes while others are not used.  We can
reduce this risk by introducing a concept of the preferred node similar
to what we have in the regular page allocator.  We will start allocating
from the preferred nid and then iterate over all allowed nodes in the
zonelist order until we try them all.

This is mimicing the page allocator logic except it operates on per-node
mempools.  dequeue_huge_page_vma already does this so distill the
zonelist logic into a more generic dequeue_huge_page_nodemask and use it
in alloc_huge_page_nodemask.

This will allow us to use proper per numa distance fallback also for
alloc_huge_page_node which can use alloc_huge_page_nodemask now and we
can get rid of alloc_huge_page_node helper which doesn't have any user
anymore.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Michal Hocko <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Reviewed-by: Mike Kravetz <[email protected]>
Tested-by: Mike Kravetz <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: Mel Gorman <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm, hugetlb: unclutter hugetlb allocation layers

Patch series "mm, hugetlb: allow proper node fallback dequeue".

While working on a hugetlb migration issue addressed in a separate
patchset[1] I have noticed that the hugetlb allocations from the
preallocated pool are quite subotimal.

[1] //lkml.kernel.org/r/20170608074553 [email protected]

There is no fallback mechanism implemented and no notion of preferred
node.  I have tried to work around it but Vlastimil was right to push
back for a more robust solution.  It seems that such a solution is to
reuse zonelist approach we use for the page alloctor.

This series has 3 patches.  The first one tries to make hugetlb
allocation layers more clear.  The second one implements the zonelist
hugetlb pool allocation and introduces a preferred node semantic which
is used by the migration callbacks.  The last patch is a clean up.

This patch (of 3):

Hugetlb allocation path for fresh huge pages is unnecessarily complex
and it mixes different interfaces between layers.

__alloc_buddy_huge_page is the central place to perform a new
allocation.  It checks for the hugetlb overcommit and then relies on
__hugetlb_alloc_buddy_huge_page to invoke the page allocator.  This is
all good except that __alloc_buddy_huge_page pushes vma and address down
the callchain and so __hugetlb_alloc_buddy_huge_page has to deal with
two different allocation modes - one for memory policy and other node
specific (or to make it more obscure node non-specific) requests.

This just screams for a reorganization.

This patch pulls out all the vma specific handling up to
__alloc_buddy_huge_page_with_mpol where it belongs.
__alloc_buddy_huge_page will get nodemask argument and
__hugetlb_alloc_buddy_huge_page will become a trivial wrapper over the
page allocator.

In short:
__alloc_buddy_huge_page_with_mpol - memory policy handling
  __alloc_buddy_huge_page - overcommit handling and accounting
    __hugetlb_alloc_buddy_huge_page - page allocator layer

Also note that __hugetlb_alloc_buddy_huge_page and its cpuset retry loop
is not really needed because the page allocator already handles the
cpusets update.

Finally __hugetlb_alloc_buddy_huge_page had a special case for node
specific allocations (when no policy is applied and there is a node
given).  This has relied on __GFP_THISNODE to not fallback to a different
node.  alloc_huge_page_node is the only caller which relies on this
behavior so move the __GFP_THISNODE there.

Not only does this remove quite some code it also should make those
layers easier to follow and clear wrt responsibilities.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Michal Hocko <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Reviewed-by: Mike Kravetz <[email protected]>
Tested-by: Mike Kravetz <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: Mel Gorman <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/oom_kill.c: add tracepoints for oom reaper-related events

During the debugging of the problem described in
https://lkml.org/lkml/2017/5/17/542 and fixed by Tetsuo Handa in
https://lkml.org/lkml/2017/5/19/383 , I've found that the existing debug
output is not really useful to understand issues related to the oom
reaper.

So, I assume, that adding some tracepoints might help with debugging of
similar issues.

Trace the following events:
1) a process is marked as an oom victim,
2) a process is added to the oom reaper list,
3) the oom reaper starts reaping process's mm,
4) the oom reaper finished reaping,
5) the oom reaper skips reaping.

How it works in practice? Below is an example which show how the problem
mentioned above can be found: one process is added twice to the
oom_reaper list:

  $ cd /sys/kernel/debug/tracing
  $ echo "oom:mark_victim" > set_event
  $ echo "oom:wake_reaper" >> set_event
  $ echo "oom:skip_task_reaping" >> set_event
  $ echo "oom:start_task_reaping" >> set_event
  $ echo "oom:finish_task_reaping" >> set_event
  $ cat trace_pipe
          allocate-502   [001] ....    91.836405: mark_victim: pid=502
          allocate-502   [001] .N..    91.837356: wake_reaper: pid=502
          allocate-502   [000] .N..    91.871149: wake_reaper: pid=502
        oom_reaper-23    [000] ....    91.871177: start_task_reaping: pid=502
        oom_reaper-23    [000] .N..    91.879511: finish_task_reaping: pid=502
        oom_reaper-23    [000] ....    91.879580: skip_task_reaping: pid=502

Link: http://lkml.kernel.org/r/20170530185231.GA13412@castle
Signed-off-by: Roman Gushchin <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: Tetsuo Handa <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>