Git Repo - linux.git/log

Merge tag '5.13-rc-smb3-part2' of git://git.samba.org/sfrench/cifs-2.6

Pull cifs updates from Steve French:
"Ten CIFS/SMB3 changes - including two marked for stable - including
  some important multichannel fixes, as well as support for handle
  leases (deferred close) and shutdown support:

   - some important multichannel fixes

   - support for handle leases (deferred close)

   - shutdown support (which is also helpful since it enables multiple
     xfstests)

   - enable negotiating stronger encryption by default (GCM256)

   - improve wireshark debugging by allowing more options for root to
     dump decryption keys

  SambaXP and the SMB3 Plugfest test event are going on now so I am
  expecting more patches over the next few days due to extra testing
  (including more multichannel fixes)"

* tag '5.13-rc-smb3-part2' of git://git.samba.org/sfrench/cifs-2.6:
  fs/cifs: Fix resource leak
  Cifs: Fix kernel oops caused by deferred close for files.
  cifs: fix regression when mounting shares with prefix paths
  cifs: use echo_interval even when connection not ready.
  cifs: detect dead connections only when echoes are enabled.
  smb3.1.1: allow dumping keys for multiuser mounts
  smb3.1.1: allow dumping GCM256 keys to improve debugging of encrypted shares
  cifs: add shutdown support
  cifs: Deferred close for files
  smb3.1.1: enable negotiating stronger encryption by default

Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost

Pull virtio updates from Michael Tsirkin:
"A bunch of new drivers including vdpa support for block and
  virtio-vdpa.

  Beginning of vq kick (aka doorbell) mapping support.

  Misc fixes"

* tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost: (40 commits)
  virtio_pci_modern: correct sparse tags for notify
  virtio_pci_modern: __force cast the notify mapping
  vDPA/ifcvf: get_config_size should return dev specific config size
  vDPA/ifcvf: enable Intel C5000X-PL virtio-block for vDPA
  vDPA/ifcvf: deduce VIRTIO device ID when probe
  vdpa_sim_blk: add support for vdpa management tool
  vdpa_sim_blk: handle VIRTIO_BLK_T_GET_ID
  vdpa_sim_blk: implement ramdisk behaviour
  vdpa: add vdpa simulator for block device
  vhost/vdpa: Remove the restriction that only supports virtio-net devices
  vhost/vdpa: use get_config_size callback in vhost_vdpa_config_validate()
  vdpa: add get_config_size callback in vdpa_config_ops
  vdpa_sim: cleanup kiovs in vdpasim_free()
  vringh: add vringh_kiov_length() helper
  vringh: implement vringh_kiov_advance()
  vringh: explain more about cleaning riov and wiov
  vringh: reset kiov 'consumed' field in __vringh_iov()
  vringh: add 'iotlb_lock' to synchronize iotlb accesses
  vdpa_sim: use iova module to allocate IOVA addresses
  vDPA/ifcvf: deduce VIRTIO device ID from pdev ids
  ...

netfilter: nfnetlink_osf: Fix a missing skb_header_pointer() NULL check

Do not assume that the tcph->doff field is correct when parsing for TCP
options, skb_header_pointer() might fail to fetch these bits.

Fixes: 11eeef41d5f6 ("netfilter: passive OS fingerprint xtables match")
Signed-off-by: Pablo Neira Ayuso <[email protected]>

Merge tag 'pci-v5.13-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci

Pull pci updates from Bjorn Helgaas:
"Enumeration:
   - Release OF node when pci_scan_device() fails (Dmitry Baryshkov)
   - Add pci_disable_parity() (Bjorn Helgaas)
   - Disable Mellanox Tavor parity reporting (Heiner Kallweit)
   - Disable N2100 r8169 parity reporting (Heiner Kallweit)
   - Fix RCiEP device to RCEC association (Qiuxu Zhuo)
   - Convert sysfs "config", "rom", "reset", "label", "index",
     "acpi_index" to static attributes to help fix races in device
     enumeration (Krzysztof Wilczyński)
   - Convert sysfs "vpd" to static attribute (Heiner Kallweit, Krzysztof
     Wilczyński)
   - Use sysfs_emit() in "show" functions (Krzysztof Wilczyński)
   - Remove unused alloc_pci_root_info() return value (Krzysztof
     Wilczyński)

  PCI device hotplug:
   - Fix acpiphp reference count leak (Feilong Lin)

  Power management:
   - Fix acpi_pci_set_power_state() debug message (Rafael J. Wysocki)
   - Fix runtime PM imbalance (Dinghao Liu)

  Virtualization:
   - Increase delay after FLR to work around Intel DC P4510 NVMe erratum
     (Raphael Norwitz)

  MSI:
   - Convert rcar, tegra, xilinx to MSI domains (Marc Zyngier)
   - For rcar, xilinx, use controller address as MSI doorbell (Marc
     Zyngier)
   - Remove unused hv msi_controller struct (Marc Zyngier)
   - Remove unused PCI core msi_controller support (Marc Zyngier)
   - Remove struct msi_controller altogether (Marc Zyngier)
   - Remove unused default_teardown_msi_irqs() (Marc Zyngier)
   - Let host bridges declare their reliance on MSI domains (Marc
     Zyngier)
   - Make pci_host_common_probe() declare its reliance on MSI domains
     (Marc Zyngier)
   - Advertise mediatek lack of built-in MSI handling (Thomas Gleixner)
   - Document ways of ending up with NO_MSI (Marc Zyngier)
   - Refactor HT advertising of NO_MSI flag (Marc Zyngier)

  VPD:
   - Remove obsolete Broadcom NIC VPD length-limiting quirk (Heiner
     Kallweit)
   - Remove sysfs VPD size checking dead code (Heiner Kallweit)
   - Convert VPF sysfs file to static attribute (Heiner Kallweit)
   - Remove unnecessary pci_set_vpd_size() (Heiner Kallweit)
   - Tone down "missing VPD" message (Heiner Kallweit)

  Endpoint framework:
   - Fix NULL pointer dereference when epc_features not implemented
     (Shradha Todi)
   - Add missing destroy_workqueue() in endpoint test (Yang Yingliang)

  Amazon Annapurna Labs PCIe controller driver:
   - Fix compile testing without CONFIG_PCI_ECAM (Arnd Bergmann)
   - Fix "no symbols" warnings when compile testing with
     CONFIG_TRIM_UNUSED_KSYMS (Arnd Bergmann)

  APM X-Gene PCIe controller driver:
   - Fix cfg resource mapping regression (Dejin Zheng)

  Broadcom iProc PCIe controller driver:
   - Return zero for success of iproc_msi_irq_domain_alloc() (Pali
     Rohár)

  Broadcom STB PCIe controller driver:
   - Add reset_control_rearm() stub for !CONFIG_RESET_CONTROLLER (Jim
     Quinlan)
   - Fix use of BCM7216 reset controller (Jim Quinlan)
   - Use reset/rearm for Broadcom STB pulse reset instead of
     deassert/assert (Jim Quinlan)
   - Fix brcm_pcie_probe() error return for unsupported revision (Wei
     Yongjun)

  Cavium ThunderX PCIe controller driver:
   - Fix compile testing (Arnd Bergmann)
   - Fix "no symbols" warnings when compile testing with
     CONFIG_TRIM_UNUSED_KSYMS (Arnd Bergmann)

  Freescale Layerscape PCIe controller driver:
   - Fix ls_pcie_ep_probe() syntax error (comma for semicolon)
     (Krzysztof Wilczyński)
   - Remove layerscape-gen4 dependencies on OF and ARM64, add dependency
     on ARCH_LAYERSCAPE (Geert Uytterhoeven)

  HiSilicon HIP PCIe controller driver:
   - Remove obsolete HiSilicon PCIe DT description (Dongdong Liu)

  Intel Gateway PCIe controller driver:
   - Remove unused pcie_app_rd() (Jiapeng Chong)

  Intel VMD host bridge driver:
   - Program IRTE with Requester ID of VMD endpoint, not child device
     (Jon Derrick)
   - Disable VMD MSI-X remapping when possible so children can use more
     MSI-X vectors (Jon Derrick)

  MediaTek PCIe controller driver:
   - Configure FC and FTS for functions other than 0 (Ryder Lee)
   - Add YAML schema for MediaTek (Jianjun Wang)
   - Export pci_pio_to_address() for module use (Jianjun Wang)
   - Add MediaTek MT8192 PCIe controller driver (Jianjun Wang)
   - Add MediaTek MT8192 INTx support (Jianjun Wang)
   - Add MediaTek MT8192 MSI support (Jianjun Wang)
   - Add MediaTek MT8192 system power management support (Jianjun Wang)
   - Add missing MODULE_DEVICE_TABLE (Qiheng Lin)

  Microchip PolarFlare PCIe controller driver:
   - Make several symbols static (Wei Yongjun)

  NVIDIA Tegra PCIe controller driver:
   - Add MCFG quirks for Tegra194 ECAM errata (Vidya Sagar)
   - Make several symbols const (Rikard Falkeborn)
   - Fix Kconfig host/endpoint typo (Wesley Sheng)

  SiFive FU740 PCIe controller driver:
   - Add pcie_aux clock to prci driver (Greentime Hu)
   - Use reset-simple in prci driver for PCIe (Greentime Hu)
   - Add SiFive FU740 PCIe host controller driver and DT binding (Paul
     Walmsley, Greentime Hu)

  Synopsys DesignWare PCIe controller driver:
   - Move MSI Receiver init to dw_pcie_host_init() so it is
     re-initialized along with the RC in resume (Jisheng Zhang)
   - Move iATU detection earlier to fix regression (Hou Zhiqiang)

  TI J721E PCIe driver:
   - Add DT binding and TI j721e support for refclk to PCIe connector
     (Kishon Vijay Abraham I)
   - Add host mode and endpoint mode DT bindings for TI AM64 SoC (Kishon
     Vijay Abraham I)

  TI Keystone PCIe controller driver:
   - Use generic config accessors for TI AM65x (K3) to fix regression
     (Kishon Vijay Abraham I)

  Xilinx NWL PCIe controller driver:
   - Add support for coherent PCIe DMA traffic using CCI (Bharat Kumar
     Gogada)
   - Add optional "dma-coherent" DT property (Bharat Kumar Gogada)

  Miscellaneous:
   - Fix kernel-doc warnings (Krzysztof Wilczyński)
   - Remove unused MicroGate SyncLink device IDs (Jiri Slaby)
   - Remove redundant dev_err() for devm_ioremap_resource() failure
     (Chen Hui)
   - Remove redundant initialization (Colin Ian King)
   - Drop redundant dev_err() for platform_get_irq() errors (Krzysztof
     Wilczyński)"

* tag 'pci-v5.13-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci: (98 commits)
  riscv: dts: Add PCIe support for the SiFive FU740-C000 SoC
  PCI: fu740: Add SiFive FU740 PCIe host controller driver
  dt-bindings: PCI: Add SiFive FU740 PCIe host controller
  MAINTAINERS: Add maintainers for SiFive FU740 PCIe driver
  clk: sifive: Use reset-simple in prci driver for PCIe driver
  clk: sifive: Add pcie_aux clock in prci driver for PCIe driver
  PCI: brcmstb: Use reset/rearm instead of deassert/assert
  ata: ahci_brcm: Fix use of BCM7216 reset controller
  reset: add missing empty function reset_control_rearm()
  PCI: Allow VPD access for QLogic ISP2722
  PCI/VPD: Add helper pci_get_func0_dev()
  PCI/VPD: Remove pci_vpd_find_tag() SRDT handling
  PCI/VPD: Remove pci_vpd_find_tag() 'offset' argument
  PCI/VPD: Change pci_vpd_init() return type to void
  PCI/VPD: Make missing VPD message less alarming
  PCI/VPD: Remove pci_set_vpd_size()
  x86/PCI: Remove unused alloc_pci_root_info() return value
  MAINTAINERS: Add Jianjun Wang as MediaTek PCI co-maintainer
  PCI: mediatek-gen3: Add system PM support
  PCI: mediatek-gen3: Add MSI support
  ...

Merge tag 'pwm/for-5.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/thierry.reding/linux-pwm

Pull pwm updates from Thierry Reding:
"This adds support for the PWM controller found on Toshiba Visconti
  SoCs and converts a couple of drivers to the atomic API.

  There's also a bunch of cleanups and minor fixes across the board"

* tag 'pwm/for-5.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/thierry.reding/linux-pwm: (35 commits)
  pwm: Reword docs about pwm_apply_state()
  pwm: atmel: Improve duty cycle calculation in .apply()
  pwm: atmel: Fix duty cycle calculation in .get_state()
  pwm: visconti: Add Toshiba Visconti SoC PWM support
  dt-bindings: pwm: Add bindings for Toshiba Visconti PWM Controller
  arm64: dts: rockchip: Remove clock-names from PWM nodes
  ARM: dts: rockchip: Remove clock-names from PWM nodes
  dt-bindings: pwm: rockchip: Add more compatible strings
  dt-bindings: pwm: Convert pwm-rockchip.txt to YAML
  pwm: mediatek: Remove unused function
  pwm: pca9685: Improve runtime PM behavior
  pwm: pca9685: Support hardware readout
  pwm: pca9685: Switch to atomic API
  pwm: lpss: Don't modify HW state in .remove callback
  pwm: sti: Free resources only after pwmchip_remove()
  pwm: sti: Don't modify HW state in .remove callback
  pwm: lpc3200: Don't modify HW state in .remove callback
  pwm: lpc18xx-sct: Free resources only after pwmchip_remove()
  pwm: bcm-kona: Don't modify HW state in .remove callback
  pwm: bcm2835: Free resources only after pwmchip_remove()
  ...

smc: disallow TCP_ULP in smc_setsockopt()

syzbot is able to setup kTLS on an SMC socket which coincidentally
uses sk_user_data too. Later, kTLS treats it as psock so triggers a
refcnt warning. The root cause is that smc_setsockopt() simply calls
TCP setsockopt() which includes TCP_ULP. I do not think it makes
sense to setup kTLS on top of SMC sockets, so we should just disallow
this setup.

It is hard to find a commit to blame, but we can apply this patch
since the beginning of TCP_ULP.

Reported-and-tested-by: [email protected]
Fixes: 734942cc4ea6 ("tcp: ULP infrastructure")
Cc: John Fastabend <[email protected]>
Signed-off-by: Karsten Graul <[email protected]>
Signed-off-by: Cong Wang <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

x86/cpu: Remove write_tsc() and write_rdtscp_aux() wrappers

Drop write_tsc() and write_rdtscp_aux(); the former has no users, and the
latter has only a single user and is slightly misleading since the only
in-kernel consumer of MSR_TSC_AUX is RDPID, not RDTSCP.

No functional change intended.

Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

x86/cpu: Initialize MSR_TSC_AUX if RDTSCP *or* RDPID is supported

Initialize MSR_TSC_AUX with CPU node information if RDTSCP or RDPID is
supported. This fixes a bug where vdso_read_cpunode() will read garbage
via RDPID if RDPID is supported but RDTSCP is not. While no known CPU
supports RDPID but not RDTSCP, both Intel's SDM and AMD's APM allow for
RDPID to exist without RDTSCP, e.g. it's technically a legal CPU model
for a virtual machine.

Note, technically MSR_TSC_AUX could be initialized if and only if RDPID
is supported since RDTSCP is currently not used to retrieve the CPU node.
But, the cost of the superfluous WRMSR is negigible, whereas leaving
MSR_TSC_AUX uninitialized is just asking for future breakage if someone
decides to utilize RDTSCP.

Fixes: a582c540ac1b ("x86/vdso: Use RDPID in preference to LSL when available")
Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Cc: [email protected]
Link: https://lore.kernel.org/r/[email protected]

x86/resctrl: Fix init const confusion

const variable must be initconst, not initdata.

Signed-off-by: Andi Kleen <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

x86: Delete UD0, UD1 traces

Both instructions aren't used by kernel.

Signed-off-by: Alexey Dobriyan <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

x86/smpboot: Remove duplicate includes

Signed-off-by: Wan Jiabing <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

net: fix nla_strcmp to handle more then one trailing null character

Android userspace has been using TCA_KIND with a char[IFNAMESIZ]
many-null-terminated buffer containing the string 'bpf'.

This works on 4.19 and ceases to work on 5.10.

I'm not entirely sure what fixes tag to use, but I think the issue
was likely introduced in the below mentioned 5.4 commit.

Reported-by: Nucca Chen <[email protected]>
Cc: Cong Wang <[email protected]>
Cc: David Ahern <[email protected]>
Cc: David S. Miller <[email protected]>
Cc: Jakub Kicinski <[email protected]>
Cc: Jamal Hadi Salim <[email protected]>
Cc: Jiri Pirko <[email protected]>
Cc: Jiri Pirko <[email protected]>
Fixes: 62794fc4fbf5 ("net_sched: add max len check for TCA_KIND")
Change-Id: I66dc281f165a2858fc29a44869a270a2d698a82b
Signed-off-by: David S. Miller <[email protected]>

Merge tag 'thermal-v5.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/thermal/linux

Pull thermal updates from Daniel Lezcano:

- Remove duplicate error message for the amlogic driver (Tang Bin)

- Fix spellos in comments for the tegra and sun8i (Bhaskar Chowdhury)

- Add the missing fifth node on the rcar_gen3 sensor (Niklas Söderlund)

- Remove duplicate include in ti-bandgap (Zhang Yunkai)

- Assign error code in the error path in the function
   thermal_of_populate_bind_params() (Jia-Ju Bai)

- Fix spelling mistake in a comment 'disabed' -> 'disabled' (Colin Ian
   King)

- Use the device name instead of auto-numbering for a better
   identification of the cooling device (Daniel Lezcano)

- Improve a bit the division accuracy in the power allocator governor
   (Jeson Gao)

- Enable the missing third sensor on msm8976 (Konrad Dybcio)

- Add QCom tsens driver co-maintainer (Thara Gopinath)

- Fix memory leak and use after free errors in the core code (Daniel
   Lezcano)

- Add the MDM9607 compatible bindings (Konrad Dybcio)

- Fix trivial spello in the copyright name for Hisilicon (Hao Fang)

- Fix negative index array access when converting the frequency to
   power in the energy model (Brian-sy Yang)

- Add support for Gen2 new PMIC support for Qcom SPMI (David Collins)

- Update maintainer file for CPU cooling device section (Lukasz Luba)

- Fix missing put_device on error in the Qcom tsens driver (Guangqing
   Zhu)

- Add compatible DT binding for sm8350 (Robert Foss)

- Add support for the MDM9607's tsens driver (Konrad Dybcio)

- Remove duplicate error messages in thermal_mmio and the bcm2835
   driver (Ruiqi Gong)

- Add the Thermal Temperature Cooling driver (Zhang Rui)

- Remove duplicate error messages in the Hisilicon sensor driver (Ye
   Bin)

- Use the devm_platform_ioremap_resource_byname() function instead of a
   couple of corresponding calls (dingsenjie)

- Sort the headers alphabetically in the ti-bandgap driver (Zhen Lei)

- Add missing property in the DT thermal sensor binding (Rafał Miłecki)

- Remove dead code in the ti-bandgap sensor driver (Lin Ruizhe)

- Convert the BRCM DT bindings to the yaml schema (Rafał Miłecki)

- Replace the thermal_notify_framework() call by a call to the
   thermal_zone_device_update() function. Remove the function as well as
   the corresponding documentation (Thara Gopinath)

- Add support for the ipq8064-tsens sensor along with a set of cleanups
   and code preparation (Ansuel Smith)

- Add a lockless __thermal_cdev_update() function to improve the
   locking scheme in the core code and governors (Lukasz Luba)

- Fix multiple cooling device notification changes (Lukasz Luba)

- Remove unneeded variable initialization (Colin Ian King)

* tag 'thermal-v5.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/thermal/linux: (55 commits)
  thermal/drivers/mtk_thermal: Remove redundant initializations of several variables
  thermal/core/power allocator: Use the lockless __thermal_cdev_update() function
  thermal/core/fair share: Use the lockless __thermal_cdev_update() function
  thermal/core/fair share: Lock the thermal zone while looping over instances
  thermal/core/power_allocator: Update once cooling devices when temp is low
  thermal/core/power_allocator: Maintain the device statistics from going stale
  thermal/core: Create a helper __thermal_cdev_update() without a lock
  dt-bindings: thermal: tsens: Document ipq8064 bindings
  thermal/drivers/tsens: Add support for ipq8064-tsens
  thermal/drivers/tsens: Drop unused define for msm8960
  thermal/drivers/tsens: Replace custom 8960 apis with generic apis
  thermal/drivers/tsens: Fix bug in sensor enable for msm8960
  thermal/drivers/tsens: Use init_common for msm8960
  thermal/drivers/tsens: Add VER_0 tsens version
  thermal/drivers/tsens: Convert msm8960 to reg_field
  thermal/drivers/tsens: Don't hardcode sensor slope
  Documentation: driver-api: thermal: Remove thermal_notify_framework from documentation
  thermal/core: Remove thermal_notify_framework
  iwlwifi: mvm: tt: Replace thermal_notify_framework
  dt-bindings: thermal: brcm,ns-thermal: Convert to the json-schema
  ...

net:CXGB4: fix leak if sk_buff is not used

An sk_buff is allocated to send a flow control message, but it's not
sent in all cases: in case the state is not appropiate to send it or if
it can't be enqueued.

In the first of these 2 cases, the sk_buff was discarded but not freed,
producing a memory leak.

Signed-off-by: Íñigo Huguet <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Fix spelling error from "eleminate" to "eliminate"

Spelling error "eleminate" amended to "eliminate".

Signed-off-by: Sean Gloumeau <[email protected]>
Reviewed-by: Kieran Bingham <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

ethtool: fix missing NLM_F_MULTI flag when dumping

When dumping the ethtool information from all the interfaces, the
netlink reply should contain the NLM_F_MULTI flag. This flag allows
userspace tools to identify that multiple messages are expected.

Link: https://bugzilla.redhat.com/1953847
Fixes: 365f9ae4ee36 ("ethtool: fix genlmsg_put() failure handling in ethnl_default_dumpit()")
Signed-off-by: Fernando Fernandez Mancera <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Merge tag 'gpio-updates-for-v5.13-v2' of git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux

Pull gpio updates from Bartosz Golaszewski:

- new driver for the Realtek Otto GPIO controller

- ACPI support for gpio-mpc8xxx

- edge event support for gpio-sch (+ Kconfig fixes)

- Kconfig improvements in gpio-ich

- fixes to older issues in gpio-mockup

- ACPI quirk for ignoring EC wakeups on Dell Venue 10 Pro 5055

- improve the GPIO aggregator code by using more generic interfaces
   instead of reimplementing them in the driver

- convert the DT bindings for gpio-74x164 to yaml

- documentation improvements

- a slew of other minor fixes and improvements to GPIO drivers

* tag 'gpio-updates-for-v5.13-v2' of git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux: (34 commits)
  dt-bindings: gpio: add YAML description for rockchip,gpio-bank
  gpio: mxs: remove useless function
  dt-bindings: gpio: fairchild,74hc595: Convert to json-schema
  gpio: it87: remove unused code
  gpio: 104-dio-48e: Fix coding style issues
  gpio: mpc8xxx: Add ACPI support
  gpio: ich: Switch to be dependent on LPC_ICH
  gpio: sch: Drop MFD_CORE selection
  gpio: sch: depends on LPC_SCH
  gpiolib: acpi: Add quirk to ignore EC wakeups on Dell Venue 10 Pro 5055
  gpio: sch: Hook into ACPI GPE handler to catch GPIO edge events
  gpio: sch: Add edge event support
  gpio: aggregator: Replace custom get_arg() with a generic next_arg()
  lib/cmdline: Export next_arg() for being used in modules
  gpio: omap: Use device_get_match_data() helper
  gpio: Add Realtek Otto GPIO support
  dt-bindings: gpio: Binding for Realtek Otto GPIO
  docs: kernel-parameters: Add gpio_mockup_named_lines
  docs: kernel-parameters: Move gpio-mockup for alphabetic order
  lib: bitmap: provide devm_bitmap_alloc() and devm_bitmap_zalloc()
  ...

Merge tag 'char-misc-5.13-rc1-round2' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc

Pull char/misc driver fixes from Greg KH:
"Here are two char/misc fixes for 5.13-rc1 to resolve reported issues.

  The first is a bugfix for the nitro_enclaves driver that fixed some
  important problems. The second was a dyndbg bugfix that resolved some
  reported problems in dynamic debugging control.

  Both have been in linux-next for a while with no reported issues"

* tag 'char-misc-5.13-rc1-round2' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc:
  dyndbg: fix parsing file query without a line-range suffix
  nitro_enclaves: Fix stale file descriptors on failed usercopy

Merge branch 'turbostat' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux

Pull turbostat updates from Len Brown:
"Bug fixes and a smattering of features"

* 'turbostat' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux: (21 commits)
  tools/power turbostat: version 2021.05.04
  tools/power turbostat: Support "turbostat --hide idle"
  tools/power turbostat: elevate priority of interval mode
  tools/power turbostat: formatting
  tools/power turbostat: rename tcc variables
  tools/power turbostat: add TCC Offset support
  tools/power turbostat: save original CPU model
  tools/power turbostat: Fix Core C6 residency on Atom CPUs
  tools/power turbostat: Print the C-state Pre-wake settings
  tools/power turbostat: Enable tsc_tweak for Elkhart Lake and Jasper Lake
  tools/power turbostat: unmark non-kernel-doc comment
  tools/power/turbostat: Remove Package C6 Retention on Ice Lake Server
  tools/power turbostat: Fix offset overflow issue in index converting
  tools/power/turbostat: Fix turbostat for AMD Zen CPUs
  tools/power turbostat: update version number
  tools/power turbostat: Fix DRAM Energy Unit on SKX
  Revert "tools/power turbostat: adjust for temperature offset"
  tools/power turbostat: Support Ice Lake D
  tools/power turbostat: Support Alder Lake Mobile
  tools/power turbostat: print microcode patch level
  ...

Merge tag 'ktest-v5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-ktest

Pull ktest updates from Steven Rostedt:

- Added a KTEST section in the MAINTAINERS file

- Included John Hawley as a co-maintainer

- Add an example config that would work with VMware workstation guests

- Cleanups to the code

* tag 'ktest-v5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-ktest:
  ktest: Add KTEST section to MAINTAINERS file
  ktest: Re-arrange the code blocks for better discoverability
  ktest: Further consistency cleanups
  ktest: Fixing indentation to match expected pattern
  ktest: Adding editor hints to improve consistency
  ktest: Add example config for using VMware VMs
  ktest: Minor cleanup with uninitialized variable $build_options

Merge tag 'safesetid-5.13' of git://github.com/micah-morton/linux

Pull SafeSetID update from Micah Morton:
"Simple code cleanup

  This just has a single three-line code cleanup to eliminate some
  unnecessary 'break' statements"

* tag 'safesetid-5.13' of git://github.com/micah-morton/linux:
  LSM: SafeSetID: Fix code specification by scripts/checkpatch.pl

kfence: use power-efficient work queue to run delayed work

Use the power-efficient work queue, to avoid the pathological case where
we keep pinning ourselves on the same possibly idle CPU on systems that
want to be power-efficient (https://lwn.net/Articles/731052/).

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Marco Elver <[email protected]>
Cc: Alexander Potapenko <[email protected]>
Cc: Dmitry Vyukov <[email protected]>
Cc: Hillf Danton <[email protected]>
Cc: Jann Horn <[email protected]>
Cc: Mark Rutland <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

kfence: maximize allocation wait timeout duration

The allocation wait timeout was initially added because of warnings due to
CONFIG_DETECT_HUNG_TASK=y [1]. While the 1 sec timeout is sufficient to
resolve the warnings (given the hung task timeout must be 1 sec or larger)
it may cause unnecessary wake-ups if the system is idle:

https://lkml.kernel.org/r/CADYN=9J0DQhizAGB0-jz4HOBBh+05kMBXb4c0cXMS7Qi5NAJiw@mail.gmail.com

Fix it by computing the timeout duration in terms of the current
sysctl_hung_task_timeout_secs value.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Marco Elver <[email protected]>
Cc: Alexander Potapenko <[email protected]>
Cc: Dmitry Vyukov <[email protected]>
Cc: Hillf Danton <[email protected]>
Cc: Jann Horn <[email protected]>
Cc: Mark Rutland <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

kfence: await for allocation using wait_event

Patch series "kfence: optimize timer scheduling", v2.

We have observed that mostly-idle systems with KFENCE enabled wake up
otherwise idle CPUs, preventing such to enter a lower power state.
Debugging revealed that KFENCE spends too much active time in
toggle_allocation_gate().

While the first version of KFENCE was using all the right bits to be
scheduling optimal, and thus power efficient, by simply using wait_event()
+ wake_up(), that code was unfortunately removed.

As KFENCE was exposed to various different configs and tests, the
scheduling optimal code slowly disappeared.  First because of hung task
warnings, and finally because of deadlocks when an allocation is made by
timer code with debug objects enabled.  Clearly, the "fixes" were not too
friendly for devices that want to be power efficient.

Therefore, let's try a little harder to fix the hung task and deadlock
problems that we have with wait_event() + wake_up(), while remaining as
scheduling friendly and power efficient as possible.

Crucially, we need to defer the wake_up() to an irq_work, avoiding any
potential for deadlock.

The result with this series is that on the devices where we observed a
power regression, power usage returns back to baseline levels.

This patch (of 3):

On mostly-idle systems, we have observed that toggle_allocation_gate() is
a cause of frequent wake-ups, preventing an otherwise idle CPU to go into
a lower power state.

A late change in KFENCE's development, due to a potential deadlock [1],
required changing the scheduling-friendly wait_event_timeout() and
wake_up() to an open-coded wait-loop using schedule_timeout().  [1]
https://lkml.kernel.org/r/000000000000c0645805b7f982e4@google.com

To avoid unnecessary wake-ups, switch to using wait_event_timeout().

Unfortunately, we still cannot use a version with direct wake_up() in
__kfence_alloc() due to the same potential for deadlock as in [1].
Instead, add a level of indirection via an irq_work that is scheduled if
we determine that the kfence_timer requires a wake_up().

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Fixes: 0ce20dd84089 ("mm: add Kernel Electric-Fence infrastructure")
Signed-off-by: Marco Elver <[email protected]>
Cc: Alexander Potapenko <[email protected]>
Cc: Dmitry Vyukov <[email protected]>
Cc: Jann Horn <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Hillf Danton <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

kfence: zero guard page after out-of-bounds access

After an out-of-bounds accesses, zero the guard page before re-protecting
in kfence_guarded_free(). On one hand this helps make the failure mode of
subsequent out-of-bounds accesses more deterministic, but could also
prevent certain information leaks.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Marco Elver <[email protected]>
Acked-by: Alexander Potapenko <[email protected]>
Cc: Dmitry Vyukov <[email protected]>
Cc: Andrey Konovalov <[email protected]>
Cc: Jann Horn <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/process_vm_access.c: remove duplicate include

'linux/compat.h' included in 'process_vm_access.c' is duplicated.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Zhang Yunkai <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/mempool: minor coding style tweaks

Various coding style tweaks to various files under mm/

[[email protected]: mm/swapfile: minor coding style tweaks]
Link: https://lkml.kernel.org/r/[email protected]
[[email protected]: mm/sparse: minor coding style tweaks]
Link: https://lkml.kernel.org/r/[email protected]
[[email protected]: mm/vmscan: minor coding style tweaks]
Link: https://lkml.kernel.org/r/[email protected]
[[email protected]: mm/compaction: minor coding style tweaks]
Link: https://lkml.kernel.org/r/[email protected]
[[email protected]: mm/oom_kill: minor coding style tweaks]
Link: https://lkml.kernel.org/r/[email protected]
[[email protected]: mm/shmem: minor coding style tweaks]
Link: https://lkml.kernel.org/r/[email protected]
[[email protected]: mm/page_alloc: minor coding style tweaks]
Link: https://lkml.kernel.org/r/[email protected]
[[email protected]: mm/filemap: minor coding style tweaks]
Link: https://lkml.kernel.org/r/[email protected]
[[email protected]: mm/mlock: minor coding style tweaks]
Link: https://lkml.kernel.org/r/[email protected]
[[email protected]: mm/frontswap: minor coding style tweaks]
Link: https://lkml.kernel.org/r/[email protected]
[[email protected]: mm/vmalloc: minor coding style tweaks]
Link: https://lkml.kernel.org/r/[email protected]
[[email protected]: mm/memory_hotplug: minor coding style tweaks]
Link: https://lkml.kernel.org/r/[email protected]
[[email protected]: mm/mempolicy: minor coding style tweaks]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Zhiyuan Dai <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/highmem.c: fix coding style issue

Delete/add some blank lines and some blank spaces

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: songqiang <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

btrfs: use memzero_page() instead of open coded kmap pattern

There are many places where kmap/memset/kunmap patterns occur.

Use the newly lifted memzero_page() to eliminate direct uses of kmap and
leverage the new core functions use of kmap_local_page().

The development of this patch was aided by the following coccinelle
script:

// <smpl>
// SPDX-License-Identifier: GPL-2.0-only
// Find kmap/memset/kunmap pattern and replace with memset*page calls
//
// NOTE: Offsets and other expressions may be more complex than what the script
// will automatically generate. Therefore a catchall rule is provided to find
// the pattern which then must be evaluated by hand.
//
// Confidence: Low
// Copyright: (C) 2021 Intel Corporation
// URL: http://coccinelle.lip6.fr/
// Comments:
// Options:

//
// Then the memset pattern
//
@ memset_rule1 @
expression page, V, L, Off;
identifier ptr;
type VP;
@@

(
-VP ptr = kmap(page);
|
-ptr = kmap(page);
|
-VP ptr = kmap_atomic(page);
|
-ptr = kmap_atomic(page);
)
<+...
(
-memset(ptr, 0, L);
+memzero_page(page, 0, L);
|
-memset(ptr + Off, 0, L);
+memzero_page(page, Off, L);
|
-memset(ptr, V, L);
+memset_page(page, V, 0, L);
|
-memset(ptr + Off, V, L);
+memset_page(page, V, Off, L);
)
...+>
(
-kunmap(page);
|
-kunmap_atomic(ptr);
)

// Remove any pointers left unused
@
depends on memset_rule1
@
identifier memset_rule1.ptr;
type VP, VP1;
@@

-VP ptr;
... when != ptr;
? VP1 ptr;

//
// Catch all
//
@ memset_rule2 @
expression page;
identifier ptr;
expression GenTo, GenSize, GenValue;
type VP;
@@

(
-VP ptr = kmap(page);
|
-ptr = kmap(page);
|
-VP ptr = kmap_atomic(page);
|
-ptr = kmap_atomic(page);
)
<+...
(
//
// Some call sites have complex expressions within the memset/memcpy
// The follow are catch alls which need to be evaluated by hand.
//
-memset(GenTo, 0, GenSize);
+memzero_pageExtra(page, GenTo, GenSize);
|
-memset(GenTo, GenValue, GenSize);
+memset_pageExtra(page, GenValue, GenTo, GenSize);
)
...+>
(
-kunmap(page);
|
-kunmap_atomic(ptr);
)

// Remove any pointers left unused
@
depends on memset_rule2
@
identifier memset_rule2.ptr;
type VP, VP1;
@@

-VP ptr;
... when != ptr;
? VP1 ptr;

// </smpl>

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Ira Weiny <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Cc: Alexander Viro <[email protected]>
Cc: Chaitanya Kulkarni <[email protected]>
Cc: Chris Mason <[email protected]>
Cc: Josef Bacik <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

iov_iter: lift memzero_page() to highmem.h

Patch series "btrfs: Convert kmap/memset/kunmap to memzero_user()".

Lifting memzero_user(), convert it to kmap_local_page() and then use it
in btrfs.

This patch (of 3):

memzero_page() can replace the kmap/memset/kunmap pattern in other
places in the code. While zero_user() has the same interface it is not
the same call and its use should be limited and some of those calls may
be better converted from zero_user() to memzero_page().[1] But that is
not addressed in this series.

Lift memzero_page() to highmem.

[1] https://lore.kernel.org/lkml/CAHk-=wijdojzo56FzYqE5TOYw2Vws7ik3LEMGj9SPQaJJ+Z73Q@mail.gmail.com/

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Ira Weiny <[email protected]>
Cc: Alexander Viro <[email protected]>
Cc: David Sterba <[email protected]>
Cc: Chris Mason <[email protected]>
Cc: Josef Bacik <[email protected]>
Cc: Chaitanya Kulkarni <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/zsmalloc: use BUG_ON instead of if condition followed by BUG.

It can be optimized at compile time.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: zhouchuangao <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Sergey Senozhatsky <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/zswap.c: switch from strlcpy to strscpy

strlcpy is marked as deprecated in Documentation/process/deprecated.rst,
and there is no functional difference when the caller expects truncation
(when not checking the return value). strscpy is relatively better as
it also avoids scanning the whole source string.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Zhiyuan Dai <[email protected]>
Cc: Seth Jennings <[email protected]>
Cc: Dan Streetman <[email protected]>
Cc: Vitaly Wool <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

arm64/Kconfig: introduce ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE

Enable arm64 platform to use the MHP_MEMMAP_ON_MEMORY feature.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Oscar Salvador <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Cc: Anshuman Khandual <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Pavel Tatashin <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

x86/Kconfig: introduce ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE

Enable x86_64 platform to use the MHP_MEMMAP_ON_MEMORY feature.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Oscar Salvador <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: Anshuman Khandual <[email protected]>
Cc: Pavel Tatashin <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm,memory_hotplug: add kernel boot option to enable memmap_on_memory

Self stored memmap leads to a sparse memory situation which is
unsuitable for workloads that requires large contiguous memory chunks,
so make this an opt-in which needs to be explicitly enabled.

To control this, let memory_hotplug have its own memory space, as
suggested by David, so we can add memory_hotplug.memmap_on_memory
parameter.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Oscar Salvador <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: Anshuman Khandual <[email protected]>
Cc: Pavel Tatashin <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

acpi,memhotplug: enable MHP_MEMMAP_ON_MEMORY when supported

Let the caller check whether it can pass MHP_MEMMAP_ON_MEMORY by
checking mhp_supports_memmap_on_memory(). MHP_MEMMAP_ON_MEMORY can only
be set in case ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE is enabled, the
architecture supports altmap, and the range to be added spans a single
memory block.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Oscar Salvador <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: Anshuman Khandual <[email protected]>
Cc: Pavel Tatashin <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm,memory_hotplug: allocate memmap from the added memory range

Physical memory hotadd has to allocate a memmap (struct page array) for
the newly added memory section.  Currently, alloc_pages_node() is used
for those allocations.

This has some disadvantages:
a) an existing memory is consumed for that purpose
    (eg: ~2MB per 128MB memory section on x86_64)
    This can even lead to extreme cases where system goes OOM because
    the physically hotplugged memory depletes the available memory before
    it is onlined.
b) if the whole node is movable then we have off-node struct pages
    which has performance drawbacks.
c) It might be there are no PMD_ALIGNED chunks so memmap array gets
    populated with base pages.

This can be improved when CONFIG_SPARSEMEM_VMEMMAP is enabled.

Vmemap page tables can map arbitrary memory.  That means that we can
reserve a part of the physically hotadded memory to back vmemmap page
tables.  This implementation uses the beginning of the hotplugged memory
for that purpose.

There are some non-obviously things to consider though.

Vmemmap pages are allocated/freed during the memory hotplug events
(add_memory_resource(), try_remove_memory()) when the memory is
added/removed.  This means that the reserved physical range is not
online although it is used.  The most obvious side effect is that
pfn_to_online_page() returns NULL for those pfns.  The current design
expects that this should be OK as the hotplugged memory is considered a
garbage until it is onlined.  For example hibernation wouldn't save the
content of those vmmemmaps into the image so it wouldn't be restored on
resume but this should be OK as there no real content to recover anyway
while metadata is reachable from other data structures (e.g.  vmemmap
page tables).

The reserved space is therefore (de)initialized during the {on,off}line
events (mhp_{de}init_memmap_on_memory).  That is done by extracting page
allocator independent initialization from the regular onlining path.
The primary reason to handle the reserved space outside of
{on,off}line_pages is to make each initialization specific to the
purpose rather than special case them in a single function.

As per above, the functions that are introduced are:

- mhp_init_memmap_on_memory:
   Initializes vmemmap pages by calling move_pfn_range_to_zone(), calls
   kasan_add_zero_shadow(), and onlines as many sections as vmemmap pages
   fully span.

- mhp_deinit_memmap_on_memory:
   Offlines as many sections as vmemmap pages fully span, removes the
   range from zhe zone by remove_pfn_range_from_zone(), and calls
   kasan_remove_zero_shadow() for the range.

The new function memory_block_online() calls mhp_init_memmap_on_memory()
before doing the actual online_pages().  Should online_pages() fail, we
clean up by calling mhp_deinit_memmap_on_memory().  Adjusting of
present_pages is done at the end once we know that online_pages()
succedeed.

On offline, memory_block_offline() needs to unaccount vmemmap pages from
present_pages() before calling offline_pages().  This is necessary because
offline_pages() tears down some structures based on the fact whether the
node or the zone become empty.  If offline_pages() fails, we account back
vmemmap pages.  If it succeeds, we call mhp_deinit_memmap_on_memory().

Hot-remove:

We need to be careful when removing memory, as adding and
removing memory needs to be done with the same granularity.
To check that this assumption is not violated, we check the
memory range we want to remove and if a) any memory block has
vmemmap pages and b) the range spans more than a single memory
block, we scream out loud and refuse to proceed.

If all is good and the range was using memmap on memory (aka vmemmap pages),
we construct an altmap structure so free_hugepage_table does the right
thing and calls vmem_altmap_free instead of free_pagetable.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Oscar Salvador <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: Anshuman Khandual <[email protected]>
Cc: Pavel Tatashin <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm,memory_hotplug: factor out adjusting present pages into adjust_present_page_count()

Let's have a single place (inspired by adjust_managed_page_count())
where we adjust present pages.

In contrast to adjust_managed_page_count(), only memory onlining or
offlining is allowed to modify the number of present pages.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: David Hildenbrand <[email protected]>
Signed-off-by: Oscar Salvador <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: Anshuman Khandual <[email protected]>
Cc: Pavel Tatashin <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm,memory_hotplug: relax fully spanned sections check

We want {online,offline}_pages to operate on whole memblocks, but
memmap_on_memory will poke pageblock_nr_pages aligned holes in the
beginning, which is a special case we want to allow. Relax the check to
account for that case.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Oscar Salvador <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: Anshuman Khandual <[email protected]>
Cc: Pavel Tatashin <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

drivers/base/memory: introduce memory_block_{online,offline}

Patch series "Allocate memmap from hotadded memory (per device)", v10.

The primary goal of this patchset is to reduce memory overhead of the
hot-added memory (at least for SPARSEMEM_VMEMMAP memory model).  The
current way we use to populate memmap (struct page array) has two main
drawbacks:

a) it consumes an additional memory until the hotadded memory itself is
   onlined and

b) memmap might end up on a different numa node which is especially
   true for movable_node configuration.

c) due to fragmentation we might end up populating memmap with base
   pages

One way to mitigate all these issues is to simply allocate memmap array
(which is the largest memory footprint of the physical memory hotplug)
from the hot-added memory itself.  SPARSEMEM_VMEMMAP memory model allows
us to map any pfn range so the memory doesn't need to be online to be
usable for the array.  See patch 4 for more details.  This feature is
only usable when CONFIG_SPARSEMEM_VMEMMAP is set.

[Overall design]:

Implementation wise we reuse vmem_altmap infrastructure to override the
default allocator used by vmemap_populate.  memory_block structure gains a
new field called nr_vmemmap_pages, which accounts for the number of
vmemmap pages used by that memory_block.  E.g: On x86_64, that is 512
vmemmap pages on small memory bloks and 4096 on large memory blocks (1GB)

We also introduce new two functions: memory_block_{online,offline}.  These
functions take care of initializing/unitializing vmemmap pages prior to
calling {online,offline}_pages, so the latter functions can remain totally
untouched.

More details can be found in the respective changelogs.

This patch (of 8):

This is a preparatory patch that introduces two new functions:
memory_block_online() and memory_block_offline().

For now, these functions will only call online_pages() and offline_pages()
respectively, but they will be later in charge of preparing the vmemmap
pages, carrying out the initialization and proper accounting of such
pages.

Since memory_block struct contains all the information, pass this struct
down the chain till the end functions.

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Oscar Salvador <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: Anshuman Khandual <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Pavel Tatashin <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/memory_hotplug: remove broken locking of zone PCP structures during hot remove

zone_pcp_reset allegedly protects against a race with drain_pages using
local_irq_save but this is bogus.  local_irq_save only operates on the
local CPU.  If memory hotplug is running on CPU A and drain_pages is
running on CPU B, disabling IRQs on CPU A does not affect CPU B and
offers no protection.

This patch deletes IRQ disable/enable on the grounds that IRQs protect
nothing and assumes the existing hotplug paths guarantees the PCP cannot
be used after zone_pcp_enable().  That should be the case already
because all the pages have been freed and there is no page to put on the
PCP lists.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Mel Gorman <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Reviewed-by: Oscar Salvador <[email protected]>
Cc: "Michael S. Tsirkin" <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Alexander Duyck <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: David Hildenbrand <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

selftests/vm: gup_test: test faulting in kernel, and verify pinnable pages

When pages are pinned they can be faulted in userland and migrated, and
they can be faulted right in kernel without migration.

In either case, the pinned pages must end-up being pinnable (not
movable).

Add a new test to gup_test, to help verify that the gup/pup
(get_user_pages() / pin_user_pages()) behavior with respect to pinnable
and movable pages is reasonable and correct.  Specifically, provide a
way to:

1) Verify that only "pinnable" pages are pinned.  This is checked
   automatically for you.

2) Verify that gup/pup performance is reasonable.  This requires
   comparing benchmarks between doing gup/pup on pages that have been
   pre-faulted in from user space, vs.  doing gup/pup on pages that are
   not faulted in until gup/pup time (via FOLL_TOUCH).  This decision is
   controlled with the new -z command line option.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Pavel Tatashin <[email protected]>
Reviewed-by: John Hubbard <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Ira Weiny <[email protected]>
Cc: James Morris <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Sasha Levin <[email protected]>
Cc: Steven Rostedt (VMware) <[email protected]>
Cc: Tyler Hicks <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

selftests/vm: gup_test: fix test flag

In gup_test both gup_flags and test_flags use the same flags field.
This is broken.

Farther, in the actual gup_test.c all the passed gup_flags are erased
and unconditionally replaced with FOLL_WRITE.

Which means that test_flags are ignored, and code like this always
performs pin dump test:

155   if (gup->flags & GUP_TEST_FLAG_DUMP_PAGES_USE_PIN)
156   nr = pin_user_pages(addr, nr, gup->flags,
157       pages + i, NULL);
158   else
159   nr = get_user_pages(addr, nr, gup->flags,
160       pages + i, NULL);
161   break;

Add a new test_flags field, to allow raw gup_flags to work.  Add a new
subcommand for DUMP_USER_PAGES_TEST to specify that pin test should be
performed.

Remove unconditional overwriting of gup_flags via FOLL_WRITE.  But,
preserve the previous behaviour where FOLL_WRITE was the default flag,
and add a new option "-W" to unset FOLL_WRITE.

Rename flags with gup_flags.

With the fix, dump works like this:

  root@virtme:/# gup_test  -c
  ---- page #0, starting from user virt addr: 0x7f8acb9e4000
  page:00000000d3d2ee27 refcount:2 mapcount:1 mapping:0000000000000000
  index:0x0 pfn:0x100bcf
  anon flags: 0x300000000080016(referenced|uptodate|lru|swapbacked)
  raw: 0300000000080016 ffffd0e204021608 ffffd0e208df2e88 ffff8ea04243ec61
  raw: 0000000000000000 0000000000000000 0000000200000000 0000000000000000
  page dumped because: gup_test: dump_pages() test
  DUMP_USER_PAGES_TEST: done

  root@virtme:/# gup_test  -c -p
  ---- page #0, starting from user virt addr: 0x7fd19701b000
  page:00000000baed3c7d refcount:1025 mapcount:1 mapping:0000000000000000
  index:0x0 pfn:0x108008
  anon flags: 0x300000000080014(uptodate|lru|swapbacked)
  raw: 0300000000080014 ffffd0e204200188 ffffd0e205e09088 ffff8ea04243ee71
  raw: 0000000000000000 0000000000000000 0000040100000000 0000000000000000
  page dumped because: gup_test: dump_pages() test
  DUMP_USER_PAGES_TEST: done

Refcount shows the difference between pin vs no-pin case.
Also change type of nr from int to long, as it counts number of pages.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Pavel Tatashin <[email protected]>
Reviewed-by: John Hubbard <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Ira Weiny <[email protected]>
Cc: James Morris <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Sasha Levin <[email protected]>
Cc: Steven Rostedt (VMware) <[email protected]>
Cc: Tyler Hicks <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/gup: longterm pin migration cleanup

When pages are longterm pinned, we must migrated them out of movable zone.
The function that migrates them has a hidden loop with goto. The loop is
to retry on isolation failures, and after successful migration.

Make this code better by moving this loop to the caller.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Pavel Tatashin <[email protected]>
Reviewed-by: Jason Gunthorpe <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Ira Weiny <[email protected]>
Cc: James Morris <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: John Hubbard <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Sasha Levin <[email protected]>
Cc: Steven Rostedt (VMware) <[email protected]>
Cc: Tyler Hicks <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/gup: change index type to long as it counts pages

In __get_user_pages_locked() i counts number of pages which should be
long, as long is used in all other places to contain number of pages, and
32-bit becomes increasingly small for handling page count proportional
values.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Pavel Tatashin <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Ira Weiny <[email protected]>
Cc: James Morris <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: John Hubbard <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Sasha Levin <[email protected]>
Cc: Steven Rostedt (VMware) <[email protected]>
Cc: Tyler Hicks <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

memory-hotplug.rst: add a note about ZONE_MOVABLE and page pinning

Document the special handling of page pinning when ZONE_MOVABLE present.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Pavel Tatashin <[email protected]>
Suggested-by: David Hildenbrand <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Ira Weiny <[email protected]>
Cc: James Morris <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: John Hubbard <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Sasha Levin <[email protected]>
Cc: Steven Rostedt (VMware) <[email protected]>
Cc: Tyler Hicks <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/gup: migrate pinned pages out of movable zone

We should not pin pages in ZONE_MOVABLE.  Currently, we do not pin only
movable CMA pages.  Generalize the function that migrates CMA pages to
migrate all movable pages.  Use is_pinnable_page() to check which pages
need to be migrated

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Pavel Tatashin <[email protected]>
Reviewed-by: John Hubbard <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Ira Weiny <[email protected]>
Cc: James Morris <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Sasha Levin <[email protected]>
Cc: Steven Rostedt (VMware) <[email protected]>
Cc: Tyler Hicks <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/gup: do not migrate zero page

On some platforms ZERO_PAGE(0) might end-up in a movable zone. Do not
migrate zero page in gup during longterm pinning as migration of zero page
is not allowed.

For example, in x86 QEMU with 16G of memory and kernelcore=5G parameter, I
see the following:

Boot#1: zero_pfn 0x48a8d zero_pfn zone: ZONE_DMA32
Boot#2: zero_pfn 0x20168d zero_pfn zone: ZONE_MOVABLE

On x86, empty_zero_page is declared in .bss and depending on the loader
may end up in different physical locations during boots.

Also, move is_zero_pfn() my_zero_pfn() functions under CONFIG_MMU, because
zero_pfn that they are using is declared in memory.c which is compiled
with CONFIG_MMU.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Pavel Tatashin <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Ira Weiny <[email protected]>
Cc: James Morris <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: John Hubbard <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Sasha Levin <[email protected]>
Cc: Steven Rostedt (VMware) <[email protected]>
Cc: Tyler Hicks <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm: honor PF_MEMALLOC_PIN for all movable pages

PF_MEMALLOC_PIN is only honored for CMA pages, extend this flag to work
for any allocations from ZONE_MOVABLE by removing __GFP_MOVABLE from
gfp_mask when this flag is passed in the current context.

Add is_pinnable_page() to return true if page is in a pinnable page. A
pinnable page is not in ZONE_MOVABLE and not of MIGRATE_CMA type.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Pavel Tatashin <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Ira Weiny <[email protected]>
Cc: James Morris <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: John Hubbard <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Sasha Levin <[email protected]>
Cc: Steven Rostedt (VMware) <[email protected]>
Cc: Tyler Hicks <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm: apply per-task gfp constraints in fast path

Function current_gfp_context() is called after fast path.  However, soon
we will add more constraints which will also limit zones based on
context.  Move this call into fast path, and apply the correct
constraints for all allocations.

Also update .reclaim_idx based on value returned by
current_gfp_context() because it soon will modify the allowed zones.

Note:
With this patch we will do one extra current->flags load during fast path,
but we already load current->flags in fast-path:

__alloc_pages()
prepare_alloc_pages()
  current_alloc_flags(gfp_mask, *alloc_flags);

Later, when we add the zone constrain logic to current_gfp_context() we
will be able to remove current->flags load from current_alloc_flags, and
therefore return fast-path to the current performance level.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Pavel Tatashin <[email protected]>
Suggested-by: Michal Hocko <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Ira Weiny <[email protected]>
Cc: James Morris <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: John Hubbard <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Sasha Levin <[email protected]>
Cc: Steven Rostedt (VMware) <[email protected]>
Cc: Tyler Hicks <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm cma: rename PF_MEMALLOC_NOCMA to PF_MEMALLOC_PIN

PF_MEMALLOC_NOCMA is used ot guarantee that the allocator will not
return pages that might belong to CMA region.  This is currently used
for long term gup to make sure that such pins are not going to be done
on any CMA pages.

When PF_MEMALLOC_NOCMA has been introduced we haven't realized that it
is focusing on CMA pages too much and that there is larger class of
pages that need the same treatment.  MOVABLE zone cannot contain any
long term pins as well so it makes sense to reuse and redefine this flag
for that usecase as well.  Rename the flag to PF_MEMALLOC_PIN which
defines an allocation context which can only get pages suitable for
long-term pins.

Also rename: memalloc_nocma_save()/memalloc_nocma_restore to
memalloc_pin_save()/memalloc_pin_restore() and make the new functions
common.

[[email protected]: fix renaming of PF_MEMALLOC_NOCMA to PF_MEMALLOC_PIN]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Pavel Tatashin <[email protected]>
Reviewed-by: John Hubbard <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Signed-off-by: Mike Rapoport <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Ira Weiny <[email protected]>
Cc: James Morris <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Sasha Levin <[email protected]>
Cc: Steven Rostedt (VMware) <[email protected]>
Cc: Tyler Hicks <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/gup: check for isolation errors

It is still possible that we pin movable CMA pages if there are
isolation errors and cma_page_list stays empty when we check again.

Check for isolation errors, and return success only when there are no
isolation errors, and cma_page_list is empty after checking.

Because isolation errors are transient, we retry indefinitely.

Link: https://lkml.kernel.org/r/[email protected]
Fixes: 9a4e9f3b2d73 ("mm: update get_user_pages_longterm to migrate pages allocated from CMA region")
Signed-off-by: Pavel Tatashin <[email protected]>
Reviewed-by: Jason Gunthorpe <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Ira Weiny <[email protected]>
Cc: James Morris <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: John Hubbard <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Sasha Levin <[email protected]>
Cc: Steven Rostedt (VMware) <[email protected]>
Cc: Tyler Hicks <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/gup: return an error on migration failure

When migration failure occurs, we still pin pages, which means that we
may pin CMA movable pages which should never be the case.

Instead return an error without pinning pages when migration failure
happens.

No need to retry migrating, because migrate_pages() already retries 10
times.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Pavel Tatashin <[email protected]>
Reviewed-by: Jason Gunthorpe <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Ira Weiny <[email protected]>
Cc: James Morris <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: John Hubbard <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Sasha Levin <[email protected]>
Cc: Steven Rostedt (VMware) <[email protected]>
Cc: Tyler Hicks <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/gup: check every subpage of a compound page during isolation

When pages are isolated in check_and_migrate_movable_pages() we skip
compound number of pages at a time. However, as Jason noted, it is not
necessary correct that pages[i] corresponds to the pages that we
skipped. This is because it is possible that the addresses in this
range had split_huge_pmd()/split_huge_pud(), and these functions do not
update the compound page metadata.

The problem can be reproduced if something like this occurs:

1. User faulted huge pages.
2. split_huge_pmd() was called for some reason
3. User has unmapped some sub-pages in the range
4. User tries to longterm pin the addresses.

The resulting pages[i] might end-up having pages which are not compound
size page aligned.

Link: https://lkml.kernel.org/r/[email protected]
Fixes: aa712399c1e8 ("mm/gup: speed up check_and_migrate_cma_pages() on huge page")
Signed-off-by: Pavel Tatashin <[email protected]>
Reported-by: Jason Gunthorpe <[email protected]>
Reviewed-by: Jason Gunthorpe <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Ira Weiny <[email protected]>
Cc: James Morris <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: John Hubbard <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Sasha Levin <[email protected]>
Cc: Steven Rostedt (VMware) <[email protected]>
Cc: Tyler Hicks <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/gup: don't pin migrated cma pages in movable zone

Patch series "prohibit pinning pages in ZONE_MOVABLE", v11.

When page is pinned it cannot be moved and its physical address stays
the same until pages is unpinned.

This is useful functionality to allows userland to implementation DMA
access. For example, it is used by vfio in vfio_pin_pages().

However, this functionality breaks memory hotplug/hotremove assumptions
that pages in ZONE_MOVABLE can always be migrated.

This patch series fixes this issue by forcing new allocations during
page pinning to omit ZONE_MOVABLE, and also to migrate any existing
pages from ZONE_MOVABLE during pinning.

It uses the same scheme logic that is currently used by CMA, and extends
the functionality for all allocations.

For more information read the discussion [1] about this problem.
[1] https://lore.kernel.org/lkml/CA+CK2bBffHBxjmb9jmSKacm0fJMinyt3Nhk8Nx6iudcQSj80_w@mail.gmail.com

This patch (of 14):

In order not to fragment CMA the pinned pages are migrated. However, they
are migrated to ZONE_MOVABLE, which also should not have pinned pages.

Remove __GFP_MOVABLE, so pages can be migrated to zones where pinning is
allowed.

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Pavel Tatashin <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Reviewed-by: John Hubbard <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Sasha Levin <[email protected]>
Cc: Tyler Hicks <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: Steven Rostedt (VMware) <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: John Hubbard <[email protected]>
Cc: Ira Weiny <[email protected]>
Cc: James Morris <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Michal Hocko <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/util.c: fix typo

s/condtion/condition/

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Bhaskar Chowdhury <[email protected]>
Acked-by: Randy Dunlap <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/util.c: reduce mem_dump_obj() object size

Simplify the code by using a temporary and reduce the object size by
using a single call to pr_cont().  Reverse a test and unindent a block
too.

$ size mm/util.o* (defconfig x86-64)
   text    data     bss     dec     hex filename
   7419     372      40    7831    1e97 mm/util.o.new
   7477     372      40    7889    1ed1 mm/util.o.old

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Joe Perches <[email protected]>
Reviewed-by: Andrew Morton <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm: drop redundant HAVE_ARCH_TRANSPARENT_HUGEPAGE

HAVE_ARCH_TRANSPARENT_HUGEPAGE has duplicate definitions on platforms
that subscribe it. Drop these reduntant definitions and instead just
select it on applicable platforms.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Anshuman Khandual <[email protected]>
Acked-by: Arnd Bergmann <[email protected]>
Acked-by: Vineet Gupta <[email protected]> [arc]
Cc: Russell King <[email protected]>
Cc: Albert Ou <[email protected]>
Cc: Alexander Viro <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Christian Borntraeger <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Helge Deller <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: "James E.J. Bottomley" <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Palmer Dabbelt <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Paul Walmsley <[email protected]>
Cc: Rich Felker <[email protected]>
Cc: Thomas Bogendoerfer <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Yoshinori Sato <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm: drop redundant ARCH_ENABLE_SPLIT_PMD_PTLOCK

ARCH_ENABLE_SPLIT_PMD_PTLOCKS has duplicate definitions on platforms
that subscribe it. Drop these redundant definitions and instead just
select it on applicable platforms.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Anshuman Khandual <[email protected]>
Acked-by: Catalin Marinas <[email protected]> [arm64]
Acked-by: Heiko Carstens <[email protected]> [s390]
Cc: Will Deacon <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Cc: Christian Borntraeger <[email protected]>
Cc: Yoshinori Sato <[email protected]>
Cc: Rich Felker <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Albert Ou <[email protected]>
Cc: Alexander Viro <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Helge Deller <[email protected]>
Cc: "James E.J. Bottomley" <[email protected]>
Cc: Palmer Dabbelt <[email protected]>
Cc: Paul Walmsley <[email protected]>
Cc: Russell King <[email protected]>
Cc: Thomas Bogendoerfer <[email protected]>
Cc: Vineet Gupta <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm: drop redundant ARCH_ENABLE_[HUGEPAGE|THP]_MIGRATION

ARCH_ENABLE_[HUGEPAGE|THP]_MIGRATION configs have duplicate definitions on
platforms that subscribe them. Drop these reduntant definitions and
instead just select them appropriately.

[[email protected]: s/x86_64/X86_64/, per Oscar]

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Anshuman Khandual <[email protected]>
Acked-by: Catalin Marinas <[email protected]> [arm64]
Cc: Will Deacon <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Albert Ou <[email protected]>
Cc: Alexander Viro <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Christian Borntraeger <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Helge Deller <[email protected]>
Cc: "James E.J. Bottomley" <[email protected]>
Cc: Palmer Dabbelt <[email protected]>
Cc: Paul Walmsley <[email protected]>
Cc: Rich Felker <[email protected]>
Cc: Russell King <[email protected]>
Cc: Thomas Bogendoerfer <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Cc: Vineet Gupta <[email protected]>
Cc: Yoshinori Sato <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm: generalize ARCH_ENABLE_MEMORY_[HOTPLUG|HOTREMOVE]

ARCH_ENABLE_MEMORY_[HOTPLUG|HOTREMOVE] configs have duplicate
definitions on platforms that subscribe them. Instead, just make them
generic options which can be selected on applicable platforms.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Anshuman Khandual <[email protected]>
Acked-by: Catalin Marinas <[email protected]> [arm64]
Acked-by: Heiko Carstens <[email protected]> [s390]
Cc: Will Deacon <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Cc: Christian Borntraeger <[email protected]>
Cc: Yoshinori Sato <[email protected]>
Cc: Rich Felker <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Albert Ou <[email protected]>
Cc: Alexander Viro <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Helge Deller <[email protected]>
Cc: "James E.J. Bottomley" <[email protected]>
Cc: Palmer Dabbelt <[email protected]>
Cc: Paul Walmsley <[email protected]>
Cc: Russell King <[email protected]>
Cc: Thomas Bogendoerfer <[email protected]>
Cc: Vineet Gupta <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm: generalize SYS_SUPPORTS_HUGETLBFS (rename as ARCH_SUPPORTS_HUGETLBFS)

SYS_SUPPORTS_HUGETLBFS config has duplicate definitions on platforms
that subscribe it. Instead, just make it a generic option which can be
selected on applicable platforms.

Also rename it as ARCH_SUPPORTS_HUGETLBFS instead. This reduces code
duplication and makes it cleaner.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Anshuman Khandual <[email protected]>
Acked-by: Catalin Marinas <[email protected]> [arm64]
Acked-by: Palmer Dabbelt <[email protected]> [riscv]
Acked-by: Michael Ellerman <[email protected]> [powerpc]
Cc: Russell King <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Thomas Bogendoerfer <[email protected]>
Cc: "James E.J. Bottomley" <[email protected]>
Cc: Helge Deller <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Paul Walmsley <[email protected]>
Cc: Albert Ou <[email protected]>
Cc: Yoshinori Sato <[email protected]>
Cc: Rich Felker <[email protected]>
Cc: Alexander Viro <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Christian Borntraeger <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Cc: Vineet Gupta <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm: generalize ARCH_HAS_CACHE_LINE_SIZE

Patch series "mm: some config cleanups", v2.

This series contains config cleanup patches which reduces code
duplication across platforms and also improves maintainability.  There
is no functional change intended with this series.

This patch (of 6):

ARCH_HAS_CACHE_LINE_SIZE config has duplicate definitions on platforms
that subscribe it.  Instead, just make it a generic option which can be
selected on applicable platforms.  This change reduces code duplication
and makes it cleaner.

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Anshuman Khandual <[email protected]>
Acked-by: Catalin Marinas <[email protected]> [arm64]
Acked-by: Vineet Gupta <[email protected]> [arc]
Cc: Will Deacon <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Albert Ou <[email protected]>
Cc: Alexander Viro <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Christian Borntraeger <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Helge Deller <[email protected]>
Cc: "James E.J. Bottomley" <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Palmer Dabbelt <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Paul Walmsley <[email protected]>
Cc: Rich Felker <[email protected]>
Cc: Russell King <[email protected]>
Cc: Thomas Bogendoerfer <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Cc: Yoshinori Sato <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/mmap.c: don't unlock VMAs in remap_file_pages()

Since this call uses MAP_FIXED, do_mmap() will munlock the necessary
range. There is also an error in the loop test expression which will
evaluate as false and the loop body has never execute.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Liam R. Howlett <[email protected]>
Acked-by: Hugh Dickins <[email protected]>
Reviewed-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

x86/mm: track linear mapping split events

To help with debugging the sluggishness caused by TLB miss/reload, we
introduce monotonic hugepage [direct mapped] split event counts since
system state: SYSTEM_RUNNING to be displayed as part of /proc/vmstat in
x86 servers

The lifetime split event information will be displayed at the bottom of
/proc/vmstat
  ....
  swap_ra 0
  swap_ra_hit 0
  direct_map_level2_splits 94
  direct_map_level3_splits 4
  nr_unstable 0
  ....

One of the many lasting sources of direct hugepage splits is kernel
tracing (kprobes, tracepoints).

Note that the kernel's code segment [512 MB] points to the same physical
addresses that have been already mapped in the kernel's direct mapping
range.

Source : Documentation/x86/x86_64/mm.rst

When we enable kernel tracing, the kernel has to modify
attributes/permissions of the text segment hugepages that are direct
mapped causing them to split.

Kernel's direct mapped hugepages do not coalesce back after split and
remain in place for the remainder of the lifetime.

An instance of direct page splits when we turn on dynamic kernel tracing
....
cat /proc/vmstat | grep -i direct_map_level
direct_map_level2_splits 784
direct_map_level3_splits 12
bpftrace -e 'tracepoint:raw_syscalls:sys_enter { @ [pid, comm] =
count(); }'
cat /proc/vmstat | grep -i
direct_map_level
direct_map_level2_splits 789
direct_map_level3_splits 12
....

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Saravanan D <[email protected]>
Acked-by: Tejun Heo <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Acked-by: Dave Hansen <[email protected]>
Cc: Ingo Molnar <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm: /proc/sys/vm/stat_refresh stop checking monotonic numa stats

All of the VM NUMA stats are event counts, incremented never
decremented: it is not very useful for vmstat_refresh() to check them
throughout their first aeon, then warn on them throughout their next.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Hugh Dickins <[email protected]>
Acked-by: Roman Gushchin <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm: /proc/sys/vm/stat_refresh skip checking known negative stats

vmstat_refresh() can occasionally catch nr_zone_write_pending and
nr_writeback when they are transiently negative.  The reason is partly
that the interrupt which decrements them in test_clear_page_writeback()
can come in before __test_set_page_writeback() got to increment them;
but transient negatives are still seen even when that is prevented, and
I am not yet certain why (but see Roman's note below).  Those stats are
not buggy, they have never been seen to drift away from 0 permanently:
so just avoid the annoyance of showing a warning on them.

Similarly avoid showing a warning on nr_free_cma: CMA users have seen
that one reported negative from /proc/sys/vm/stat_refresh too, but it
does drift away permanently: I believe that's because its incrementation
and decrementation are decided by page migratetype, but the migratetype
of a pageblock is not guaranteed to be constant.

Roman Gushchin points out:
"For performance reasons, vmstat counters are incremented and
  decremented using per-cpu batches. vmstat_refresh() flushes the
  per-cpu batches on all CPUs, to get values as accurate as possible;
  but this method is not atomic, so the resulting value is not always
  precise.

  As a consequence, for those counters whose actual value is close to 0,
  a small negative value may occasionally be reported. If the value is
  small and the state is transient, it is not an indication of an error"

Link: https://lore.kernel.org/linux-mm/[email protected]/
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Hugh Dickins <[email protected]>
Reported-by: Roman Gushchin <[email protected]>
Acked-by: Roman Gushchin <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm: no more EINVAL from /proc/sys/vm/stat_refresh

EINVAL was good for drawing the refresher's attention to a warning in
dmesg, but became very tiresome when running test suites scripted with
"set -e": an underflow from a bug in one feature would cause unrelated
tests much later to fail, just because their /proc/sys/vm/stat_refresh
touch failed with that error. Stop doing that.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Hugh Dickins <[email protected]>
Acked-by: Roman Gushchin <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm: restore node stat checking in /proc/sys/vm/stat_refresh

In v4.7 commit 52b6f46bc163 ("mm: /proc/sys/vm/stat_refresh to force
vmstat update") introduced vmstat_refresh(), with its vmstat underflow
checking; then in v4.8 commit 75ef71840539 ("mm, vmstat: add
infrastructure for per-node vmstats") split NR_VM_NODE_STAT_ITEMS out of
NR_VM_ZONE_STAT_ITEMS without updating vmstat_refresh(): so it has been
missing out much of the vmstat underflow checking ever since.

Reinstate it.

Thanks to Roman Gushchin <[email protected]> for tangentially pointing this out.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Hugh Dickins <[email protected]>
Cc: Roman Gushchin <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/ksm: remove unused parameter from remove_trailing_rmap_items()

Since commit 6514d511dbe5 ("ksm: singly-linked rmap_list") was merged,
remove_trailing_rmap_items() doesn't use the 'mm_slot' parameter. So
remove it, and update caller accordingly.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Chengyang Fan <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Cc: Hugh Dickins <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

ksm: fix potential missing rmap_item for stable_node

When removing rmap_item from stable tree, STABLE_FLAG of rmap_item is
cleared with head reserved.  So the following scenario might happen: For
ksm page with rmap_item1:

cmp_and_merge_page
  stable_node->head = &migrate_nodes;
  remove_rmap_item_from_tree, but head still equal to stable_node;
  try_to_merge_with_ksm_page failed;
  return;

For the same ksm page with rmap_item2, stable node migration succeed this
time.  The stable_node->head does not equal to migrate_nodes now.  For ksm
page with rmap_item1 again:

cmp_and_merge_page
stable_node->head != &migrate_nodes && rmap_item->head == stable_node
return;

We would miss the rmap_item for stable_node and might result in failed
rmap_walk_ksm().  Fix this by set rmap_item->head to NULL when rmap_item
is removed from stable tree.

Link: https://lkml.kernel.org/r/[email protected]
Fixes: 4146d2d673e8 ("ksm: make !merge_across_nodes migration safe")
Signed-off-by: Miaohe Lin <[email protected]>
Cc: Hugh Dickins <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

ksm: remove dedicated macro KSM_FLAG_MASK

The macro KSM_FLAG_MASK is used in rmap_walk_ksm() only. So we can
replace ~KSM_FLAG_MASK with PAGE_MASK to remove this dedicated macro and
make code more consistent because PAGE_MASK is used elsewhere in this
file.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Miaohe Lin <[email protected]>
Cc: Hugh Dickins <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

ksm: use GET_KSM_PAGE_NOLOCK to get ksm page in remove_rmap_item_from_tree()

It's unnecessary to lock the page when get ksm page if we're going to
remove the rmap item as page migration is irrelevant in this case. Use
GET_KSM_PAGE_NOLOCK instead to save some page lock cycles.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Miaohe Lin <[email protected]>
Cc: Hugh Dickins <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

ksm: remove redundant VM_BUG_ON_PAGE() on stable_tree_search()

Patch series "Cleanup and fixup for ksm".

This series contains cleanups to remove unnecessary VM_BUG_ON_PAGE and
dedicated macro KSM_FLAG_MASK. Also this fixes potential missing
rmap_item for stable_node which would result in failed rmap_walk_ksm().
More details can be found in the respective changelogs.

This patch (of 4):

The same VM_BUG_ON_PAGE() check is already done in the callee. Remove
these extra caller one to simplify code slightly.

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Miaohe Lin <[email protected]>
Cc: Hugh Dickins <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm: use proper type for cma_[alloc|release]

size_t in cma_alloc is confusing since it makes people think it's byte
count, not pages. Change it to unsigned long[1].

The unsigned int in cma_release is also not right so change it. Since we
have unsigned long in cma_release, free_contig_range should also respect
it.

[1] 67a2e213e7e9, mm: cma: fix incorrect type conversion for size during dma allocation

Link: https://lore.kernel.org/linux-mm/[email protected]/
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Minchan Kim <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: David Hildenbrand <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm: cma: add the CMA instance name to cma trace events

There were missing places to add cma instance name. To identify each CMA
instance, let's add the name for every cma trace. This patch also changes
the existing cma_trace_alloc to cma_trace_finish since we have
cma_alloc_start[1].

[1] https://lore.kernel.org/linux-mm/20210324160740 [email protected]

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Minchan Kim <[email protected]>
Cc: Liam Mark <[email protected]>
Cc: Georgi Djakov <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm: cma: support sysfs

Since CMA is getting used more widely, it's more important to keep
monitoring CMA statistics for system health since it's directly related to
user experience.

This patch introduces sysfs statistics for CMA, in order to provide some
basic monitoring of the CMA allocator.

* the number of CMA page successful allocations
* the number of CMA page allocation failures

These two values allow the user to calcuate the allocation
failure rate for each CMA area.

e.g.)
  /sys/kernel/mm/cma/WIFI/alloc_pages_[success|fail]
  /sys/kernel/mm/cma/SENSOR/alloc_pages_[success|fail]
  /sys/kernel/mm/cma/BLUETOOTH/alloc_pages_[success|fail]

The cma_stat was intentionally allocated by dynamic allocation
to harmonize with kobject lifetime management.
https://lore.kernel.org/linux-mm/[email protected]/

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lore.kernel.org/linux-mm/[email protected]/
Signed-off-by: Minchan Kim <[email protected]>
Signed-off-by: Colin Ian King <[email protected]>
Tested-by: Dmitry Osipenko <[email protected]>
Reviewed-by: Dmitry Osipenko <[email protected]>
Reviewed-by: Greg Kroah-Hartman <[email protected]>
Reviewed-by: John Hubbard <[email protected]>
Tested-by: Anders Roxell <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Cc: John Dias <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Colin Ian King <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm: cma: add trace events for CMA alloc perf testing

Add cma and migrate trace events to enable CMA allocation performance to
be measured via ftrace.

[[email protected]: add the CMA instance name to the cma_alloc_start trace event]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Liam Mark <[email protected]>
Signed-off-by: Georgi Djakov <[email protected]>
Acked-by: Minchan Kim <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm: cma: use pr_err_ratelimited for CMA warning

If we did not reserve extra CMA memory, the log buffer can be easily
filled up by CMA failure warning when the devices calling
dmam_alloc_coherent() to alloc DMA memory. Thus we can use
pr_err_ratelimited() instead to reduce the duplicate CMA warning.

Link: https://lkml.kernel.org/r/ce2251ef49e1727a9a40531d1996660b05462bd2.1615279825.git.baolin.wang@linux.alibaba.com
Signed-off-by: Baolin Wang <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Acked-by: Minchan Kim <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm: vmstat: add cma statistics

Since CMA is used more widely, it's worth to have CMA allocation
statistics into vmstat. With it, we could know how agressively system
uses cma allocation and how often it fails.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Minchan Kim <[email protected]>
Reviewed-by: John Hubbard <[email protected]>
Cc: John Dias <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

Revert "mm: migrate: skip shared exec THP for NUMA balancing"

This reverts commit c77c5cbafe549eb330e8909861a3e16cbda2c848.

Since commit c77c5cbafe54 ("mm: migrate: skip shared exec THP for NUMA
balancing"), the NUMA balancing would skip shared exec transhuge page.
But this enhancement is not suitable for transhuge page.  Because it's
required that page_mapcount() must be 1 due to no migration pte dance is
done here.  On the other hand, the shared exec transhuge page will leave
the migrate_misplaced_page() with pte entry untouched and page locked.
Thus pagefault for NUMA will be triggered again and deadlock occurs when
we start waiting for the page lock held by ourselves.

Yang Shi said:

"Thanks for catching this. By relooking the code I think the other
  important reason for removing this is
  migrate_misplaced_transhuge_page() actually can't see shared exec
  file THP at all since page_lock_anon_vma_read() is called before
  and if page is not anonymous page it will just restore the PMD
  without migrating anything.
  The pages for private mapped file vma may be anonymous pages due to
  COW but they can't be THP so it won't trigger THP numa fault at all. I
  think this is why no bug was reported. I overlooked this in the first
  place."

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Miaohe Lin <[email protected]>
Reviewed-by: Yang Shi <[email protected]>
Cc: Alistair Popple <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Jerome Glisse <[email protected]>
Cc: Rafael Aquini <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/migrate.c: use helper migrate_vma_collect_skip() in migrate_vma_collect_hole()

It's more recommended to use helper function migrate_vma_collect_skip() to
skip the unexpected case and it also helps remove some duplicated codes.
Move migrate_vma_collect_skip() above migrate_vma_collect_hole() to avoid
compiler warning.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Miaohe Lin <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Cc: Alistair Popple <[email protected]>
Cc: Jerome Glisse <[email protected]>
Cc: Rafael Aquini <[email protected]>
Cc: Yang Shi <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/migrate.c: fix potential indeterminate pte entry in migrate_vma_insert_page()

If the zone device page does not belong to un-addressable device memory,
the variable entry will be uninitialized and lead to indeterminate pte
entry ultimately. Fix this unexpected case and warn about it.

Link: https://lkml.kernel.org/r/[email protected]
Fixes: df6ad69838fc ("mm/device-public-memory: device memory cache coherent with CPU")
Signed-off-by: Miaohe Lin <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Cc: Alistair Popple <[email protected]>
Cc: Jerome Glisse <[email protected]>
Cc: Rafael Aquini <[email protected]>
Cc: Yang Shi <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/migrate.c: remove unnecessary rc != MIGRATEPAGE_SUCCESS check in 'else' case

It's guaranteed that in the 'else' case of the rc == MIGRATEPAGE_SUCCESS
check, rc does not equal to MIGRATEPAGE_SUCCESS. Remove this unnecessary
check.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Miaohe Lin <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Reviewed-by: Yang Shi <[email protected]>
Cc: Alistair Popple <[email protected]>
Cc: Jerome Glisse <[email protected]>
Cc: Rafael Aquini <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/migrate.c: make putback_movable_page() static

Patch series "Cleanup and fixup for mm/migrate.c", v3.

This series contains cleanups to remove unnecessary VM_BUG_ON_PAGE and rc
!= MIGRATEPAGE_SUCCESS check.  Also use helper function to remove some
duplicated codes.  What's more, this fixes potential deadlock in NUMA
balancing shared exec THP case and so on.  More details can be found in
the respective changelogs.

This patch (of 5):

The putback_movable_page() is just called by putback_movable_pages() and
we know the page is locked and both PageMovable() and PageIsolated() is
checked right before calling putback_movable_page().  So we make it static
and remove all the 3 VM_BUG_ON_PAGE().

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Miaohe Lin <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Reviewed-by: Yang Shi <[email protected]>
Cc: Jerome Glisse <[email protected]>
Cc: Rafael Aquini <[email protected]>
Cc: Alistair Popple <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm: fs: invalidate BH LRU during page migration

Pages containing buffer_heads that are in one of the per-CPU buffer_head
LRU caches will be pinned and thus cannot be migrated. This can prevent
CMA allocations from succeeding, which are often used on platforms with
co-processors (such as a DSP) that can only use physically contiguous
memory. It can also prevent memory hot-unplugging from succeeding,
which involves migrating at least MIN_MEMORY_BLOCK_SIZE bytes of memory,
which ranges from 8 MiB to 1 GiB based on the architecture in use.

Correspondingly, invalidate the BH LRU caches before a migration starts
and stop any buffer_head from being cached in the LRU caches, until
migration has finished.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Minchan Kim <[email protected]>
Reported-by: Chris Goldsworthy <[email protected]>
Reported-by: Laura Abbott <[email protected]>
Tested-by: Oliver Sang <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: John Dias <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm: replace migrate_[prep|finish] with lru_cache_[disable|enable]

Currently, migrate_[prep|finish] is merely a wrapper of
lru_cache_[disable|enable]. There is not much to gain from having
additional abstraction.

Use lru_cache_[disable|enable] instead of migrate_[prep|finish], which
would be more descriptive.

note: migrate_prep_local in compaction.c changed into lru_add_drain to
avoid CPU schedule cost with involving many other CPUs to keep old
behavior.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Minchan Kim <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Cc: Chris Goldsworthy <[email protected]>
Cc: John Dias <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Oliver Sang <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm: disable LRU pagevec during the migration temporarily

LRU pagevec holds refcount of pages until the pagevec are drained.  It
could prevent migration since the refcount of the page is greater than
the expection in migration logic.  To mitigate the issue, callers of
migrate_pages drains LRU pagevec via migrate_prep or lru_add_drain_all
before migrate_pages call.

However, it's not enough because pages coming into pagevec after the
draining call still could stay at the pagevec so it could keep
preventing page migration.  Since some callers of migrate_pages have
retrial logic with LRU draining, the page would migrate at next trail
but it is still fragile in that it doesn't close the fundamental race
between upcoming LRU pages into pagvec and migration so the migration
failure could cause contiguous memory allocation failure in the end.

To close the race, this patch disables lru caches(i.e, pagevec) during
ongoing migration until migrate is done.

Since it's really hard to reproduce, I measured how many times
migrate_pages retried with force mode(it is about a fallback to a sync
migration) with below debug code.

int migrate_pages(struct list_head *from, new_page_t get_new_page,
..
..

  if (rc && reason == MR_CONTIG_RANGE && pass > 2) {
         printk(KERN_ERR, "pfn 0x%lx reason %d", page_to_pfn(page), rc);
         dump_page(page, "fail to migrate");
  }

The test was repeating android apps launching with cma allocation in
background every five seconds.  Total cma allocation count was about 500
during the testing.  With this patch, the dump_page count was reduced
from 400 to 30.

The new interface is also useful for memory hotplug which currently
drains lru pcp caches after each migration failure.  This is rather
suboptimal as it has to disrupt others running during the operation.
With the new interface the operation happens only once.  This is also in
line with pcp allocator cache which are disabled for the offlining as
well.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Minchan Kim <[email protected]>
Reviewed-by: Chris Goldsworthy <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: John Dias <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Oliver Sang <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm: compaction: update the COMPACT[STALL|FAIL] events properly

By definition, COMPACT[STALL|FAIL] events needs to be counted when there
is 'At least in one zone compaction wasn't deferred or skipped from the
direct compaction'. And when compaction is skipped or deferred,
COMPACT_SKIPPED will be returned but it will still go and update these
compaction events which is wrong in the sense that COMPACT[STALL|FAIL]
is counted without even trying the compaction.

Correct this by skipping the counting of these events when
COMPACT_SKIPPED is returned for compaction. This indirectly also avoid
the unnecessary try into the get_page_from_freelist() when compaction is
not even tried.

There is a corner case where compaction is skipped but still count
COMPACTSTALL event, which is that IRQ came and freed the page and the
same is captured in capture_control.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Charan Teja Reddy <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Acked-by: David Rientjes <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/compaction: remove unused variable sysctl_compact_memory

The sysctl_compact_memory is mostly unused in mm/compaction.c It just
acts as a place holder for sysctl to store .data.

But the .data itself is not needed here.

So we can get ride of this variable completely and make .data as NULL.
This will also eliminate the extern declaration from header file. No
functionality is broken or changed this way.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Pintu Kumar <[email protected]>
Signed-off-by: Pintu Agarwal <[email protected]>
Reviewed-by: Vlastimil Babka <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm: vmscan: shrink deferred objects proportional to priority

The number of deferred objects might get windup to an absurd number, and
it results in clamp of slab objects. It is undesirable for sustaining
workingset.

So shrink deferred objects proportional to priority and cap nr_deferred
to twice of cache items.

The idea is borrowed from Dave Chinner's patch:
https://lore.kernel.org/linux-xfs/20191031234618 [email protected]/

Tested with kernel build and vfs metadata heavy workload in our
production environment, no regression is spotted so far.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Yang Shi <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Kirill Tkhai <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Roman Gushchin <[email protected]>
Cc: Shakeel Butt <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm: memcontrol: reparent nr_deferred when memcg offline

Now shrinker's nr_deferred is per memcg for memcg aware shrinkers, add
to parent's corresponding nr_deferred when memcg offline.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Yang Shi <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Acked-by: Kirill Tkhai <[email protected]>
Acked-by: Roman Gushchin <[email protected]>
Reviewed-by: Shakeel Butt <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Michal Hocko <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm: vmscan: don't need allocate shrinker->nr_deferred for memcg aware shrinkers

Now nr_deferred is available on per memcg level for memcg aware
shrinkers, so don't need allocate shrinker->nr_deferred for such
shrinkers anymore.

The prealloc_memcg_shrinker() would return -ENOSYS if !CONFIG_MEMCG or
memcg is disabled by kernel command line, then shrinker's
SHRINKER_MEMCG_AWARE flag would be cleared. This makes the
implementation of this patch simpler.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Yang Shi <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Reviewed-by: Kirill Tkhai <[email protected]>
Acked-by: Roman Gushchin <[email protected]>
Reviewed-by: Shakeel Butt <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Michal Hocko <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm: vmscan: use per memcg nr_deferred of shrinker

Use per memcg's nr_deferred for memcg aware shrinkers.  The shrinker's
nr_deferred will be used in the following cases:

    1. Non memcg aware shrinkers
    2. !CONFIG_MEMCG
    3. memcg is disabled by boot parameter

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Yang Shi <[email protected]>
Acked-by: Roman Gushchin <[email protected]>
Acked-by: Kirill Tkhai <[email protected]>
Reviewed-by: Shakeel Butt <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm: vmscan: add per memcg shrinker nr_deferred

Currently the number of deferred objects are per shrinker, but some
slabs, for example, vfs inode/dentry cache are per memcg, this would
result in poor isolation among memcgs.

The deferred objects typically are generated by __GFP_NOFS allocations,
one memcg with excessive __GFP_NOFS allocations may blow up deferred
objects, then other innocent memcgs may suffer from over shrink,
excessive reclaim latency, etc.

For example, two workloads run in memcgA and memcgB respectively,
workload in B is vfs heavy workload.  Workload in A generates excessive
deferred objects, then B's vfs cache might be hit heavily (drop half of
caches) by B's limit reclaim or global reclaim.

We observed this hit in our production environment which was running vfs
heavy workload shown as the below tracing log:

  <...>-409454 [016] .... 28286961.747146: mm_shrink_slab_start: super_cache_scan+0x0/0x1a0 ffff9a83046f3458:
  nid: 1 objects to shrink 3641681686040 gfp_flags GFP_HIGHUSER_MOVABLE|__GFP_ZERO pgs_scanned 1 lru_pgs 15721
  cache items 246404277 delta 31345 total_scan 123202138
  <...>-409454 [022] .... 28287105.928018: mm_shrink_slab_end: super_cache_scan+0x0/0x1a0 ffff9a83046f3458:
  nid: 1 unused scan count 3641681686040 new scan count 3641798379189 total_scan 602
  last shrinker return val 123186855

The vfs cache and page cache ratio was 10:1 on this machine, and half of
caches were dropped.  This also resulted in significant amount of page
caches were dropped due to inodes eviction.

Make nr_deferred per memcg for memcg aware shrinkers would solve the
unfairness and bring better isolation.

The following patch will add nr_deferred to parent memcg when memcg
offline.  To preserve nr_deferred when reparenting memcgs to root, root
memcg needs shrinker_info allocated too.

When memcg is not enabled (!CONFIG_MEMCG or memcg disabled), the
shrinker's nr_deferred would be used.  And non memcg aware shrinkers use
shrinker's nr_deferred all the time.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Yang Shi <[email protected]>
Acked-by: Roman Gushchin <[email protected]>
Acked-by: Kirill Tkhai <[email protected]>
Reviewed-by: Shakeel Butt <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm: vmscan: use a new flag to indicate shrinker is registered

Currently registered shrinker is indicated by non-NULL
shrinker->nr_deferred. This approach is fine with nr_deferred at the
shrinker level, but the following patches will move MEMCG_AWARE
shrinkers' nr_deferred to memcg level, so their shrinker->nr_deferred
would always be NULL. This would prevent the shrinkers from
unregistering correctly.

Remove SHRINKER_REGISTERING since we could check if shrinker is
registered successfully by the new flag.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Yang Shi <[email protected]>
Acked-by: Kirill Tkhai <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Acked-by: Roman Gushchin <[email protected]>
Reviewed-by: Shakeel Butt <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Michal Hocko <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm: vmscan: add shrinker_info_protected() helper

The shrinker_info is dereferenced in a couple of places via
rcu_dereference_protected with different calling conventions, for
example, using mem_cgroup_nodeinfo helper or dereferencing
memcg->nodeinfo[nid]->shrinker_info. And the later patch will add more
dereference places.

So extract the dereference into a helper to make the code more readable.
No functional change.

[[email protected]: retain rcu_dereference_protected() in free_shrinker_info(), per Hugh]

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Yang Shi <[email protected]>
Acked-by: Roman Gushchin <[email protected]>
Acked-by: Kirill Tkhai <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Reviewed-by: Shakeel Butt <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Michal Hocko <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm: memcontrol: rename shrinker_map to shrinker_info

The following patch is going to add nr_deferred into shrinker_map, the
change will make shrinker_map not only include map anymore, so rename it
to "memcg_shrinker_info". And this should make the patch adding
nr_deferred cleaner and readable and make review easier. Also remove the
"memcg_" prefix.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Yang Shi <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Acked-by: Kirill Tkhai <[email protected]>
Acked-by: Roman Gushchin <[email protected]>
Reviewed-by: Shakeel Butt <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Michal Hocko <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm: vmscan: use kvfree_rcu instead of call_rcu

Using kvfree_rcu() to free the old shrinker_maps instead of call_rcu().
We don't have to define a dedicated callback for call_rcu() anymore.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Yang Shi <[email protected]>
Acked-by: Roman Gushchin <[email protected]>
Acked-by: Kirill Tkhai <[email protected]>
Reviewed-by: Shakeel Butt <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm: vmscan: remove memcg_shrinker_map_size

Both memcg_shrinker_map_size and shrinker_nr_max is maintained, but
actually the map size can be calculated via shrinker_nr_max, so it seems
unnecessary to keep both. Remove memcg_shrinker_map_size since
shrinker_nr_max is also used by iterating the bit map.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Yang Shi <[email protected]>
Acked-by: Kirill Tkhai <[email protected]>
Acked-by: Roman Gushchin <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Reviewed-by: Shakeel Butt <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Michal Hocko <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>