Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull CONFIG_PREEMPT_RT stub config from Thomas Gleixner:
"The real-time preemption patch set exists for almost 15 years now and
while the vast majority of infrastructure and enhancements have found
their way into the mainline kernel, the final integration of RT is
still missing.
Over the course of the last few years, we have worked on reducing the
intrusivenness of the RT patches by refactoring kernel infrastructure
to be more real-time friendly. Almost all of these changes were
benefitial to the mainline kernel on their own, so there was no
objection to integrate them.
Though except for the still ongoing printk refactoring, the remaining
changes which are required to make RT a first class mainline citizen
are not longer arguable as immediately beneficial for the mainline
kernel. Most of them are either reordering code flows or adding RT
specific functionality.
But this now has hit a wall and turned into a classic hen and egg
problem:
Maintainers are rightfully wary vs. these changes as they make only
sense if the final integration of RT into the mainline kernel takes
place.
Adding CONFIG_PREEMPT_RT aims to solve this as a clear sign that RT
will be fully integrated into the mainline kernel. The final
integration of the missing bits and pieces will be of course done with
the same careful approach as we have used in the past.
While I'm aware that you are not entirely enthusiastic about that, I
think that RT should receive the same treatment as any other widely
used out of tree functionality, which we have accepted into mainline
over the years.
RT has become the de-facto standard real-time enhancement and is
shipped by enterprise, embedded and community distros. It's in use
throughout a wide range of industries: telecommunications, industrial
automation, professional audio, medical devices, data acquisition,
automotive - just to name a few major use cases.
RT development is backed by a Linuxfoundation project which is
supported by major stakeholders of this technology. The funding will
continue over the actual inclusion into mainline to make sure that the
functionality is neither introducing regressions, regressing itself,
nor becomes subject to bitrot. There is also a lifely user community
around RT as well, so contrary to the grim situation 5 years ago, it's
a healthy project.
As RT is still a good vehicle to exercise rarely used code paths and
to detect hard to trigger issues, you could at least view it as a QA
tool if nothing else"
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/rt, Kconfig: Introduce CONFIG_PREEMPT_RT
Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Pull SCSI fixes from James Bottomley:
"This is the final round of mostly small fixes in our initial submit.
It's mostly minor fixes and driver updates. The only change of note is
adding a virt_boundary_mask to the SCSI host and host template to
parametrise this for NVMe devices instead of having them do a call in
slave_alloc. It's a fairly straightforward conversion except in the
two NVMe handling drivers that didn't set it who now have a virtual
infinity parameter added"
* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (24 commits)
scsi: megaraid_sas: set an unlimited max_segment_size
scsi: mpt3sas: set an unlimited max_segment_size for SAS 3.0 HBAs
scsi: IB/srp: set virt_boundary_mask in the scsi host
scsi: IB/iser: set virt_boundary_mask in the scsi host
scsi: storvsc: set virt_boundary_mask in the scsi host template
scsi: ufshcd: set max_segment_size in the scsi host template
scsi: core: take the DMA max mapping size into account
scsi: core: add a host / host template field for the virt boundary
scsi: core: Fix race on creating sense cache
scsi: sd_zbc: Fix compilation warning
scsi: libfc: fix null pointer dereference on a null lport
scsi: zfcp: fix GCC compiler warning emitted with -Wmaybe-uninitialized
scsi: zfcp: fix request object use-after-free in send path causing wrong traces
scsi: zfcp: fix request object use-after-free in send path causing seqno errors
scsi: megaraid_sas: Update driver version to 07.710.50.00
scsi: megaraid_sas: Add module parameter for FW Async event logging
scsi: megaraid_sas: Enable msix_load_balance for Invader and later controllers
scsi: megaraid_sas: Fix calculation of target ID
scsi: lpfc: reduce stack size with CONFIG_GCC_PLUGIN_STRUCTLEAK_VERBOSE
scsi: devinfo: BLIST_TRY_VPD_PAGES for SanDisk Cruzer Blade
...
Merge tag 'kbuild-v5.3-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild
Pull more Kbuild updates from Masahiro Yamada:
- match the directory structure of the linux-libc-dev package to that
of Debian-based distributions
- fix incorrect include/config/auto.conf generation when Kconfig
creates it along with the .config file
- remove misleading $(AS) from documents
- clean up precious tag files by distclean instead of mrproper
- add a new coccinelle patch for devm_platform_ioremap_resource
migration
- refactor module-related scripts to read modules.order instead of
$(MODVERDIR)/*.mod files to get the list of created modules
- remove MODVERDIR
- update list of header compile-test
- add -fcf-protection=none flag to avoid conflict with the retpoline
flags when CONFIG_RETPOLINE=y
- misc cleanups
* tag 'kbuild-v5.3-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: (25 commits)
kbuild: add -fcf-protection=none when using retpoline flags
kbuild: update compile-test header list for v5.3-rc1
kbuild: split out *.mod out of {single,multi}-used-m rules
kbuild: remove 'prepare1' target
kbuild: remove the first line of *.mod files
kbuild: create *.mod with full directory path and remove MODVERDIR
kbuild: export_report: read modules.order instead of .tmp_versions/*.mod
kbuild: modpost: read modules.order instead of $(MODVERDIR)/*.mod
kbuild: modsign: read modules.order instead of $(MODVERDIR)/*.mod
kbuild: modinst: read modules.order instead of $(MODVERDIR)/*.mod
scsi: remove pointless $(MODVERDIR)/$(obj)/53c700.ver
kbuild: remove duplication from modules.order in sub-directories
kbuild: get rid of kernel/ prefix from in-tree modules.{order,builtin}
kbuild: do not create empty modules.order in the prepare stage
coccinelle: api: add devm_platform_ioremap_resource script
kbuild: compile-test headers listed in header-test-m as well
kbuild: remove unused hostcc-option
kbuild: remove tag files by distclean instead of mrproper
kbuild: add --hash-style= and --build-id unconditionally
kbuild: get rid of misleading $(AS) from documents
...
Merge branch 'work.dcache2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull dcache and mountpoint updates from Al Viro:
"Saner handling of refcounts to mountpoints.
Transfer the counting reference from struct mount ->mnt_mountpoint
over to struct mountpoint ->m_dentry. That allows us to get rid of the
convoluted games with ordering of mount shutdowns.
The cost is in teaching shrink_dcache_{parent,for_umount} to cope with
mixed-filesystem shrink lists, which we'll also need for the Slab
Movable Objects patchset"
* 'work.dcache2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
switch the remnants of releasing the mountpoint away from fs_pin
get rid of detach_mnt()
make struct mountpoint bear the dentry reference to mountpoint, not struct mount
Teach shrink_dcache_parent() to cope with mixed-filesystem shrink lists
fs/namespace.c: shift put_mountpoint() to callers of unhash_mnt()
__detach_mounts(): lookup_mountpoint() can't return ERR_PTR() anymore
nfs: dget_parent() never returns NULL
ceph: don't open-code the check for dead lockref
Peter Zijlstra [Thu, 18 Jul 2019 09:20:09 +0000 (11:20 +0200)]
smp: Warn on function calls from softirq context
It's clearly documented that smp function calls cannot be invoked from
softirq handling context. Unfortunately nothing enforces that or emits a
warning.
A single function call can be invoked from softirq context only via
smp_call_function_single_async().
The only legit context is task context, so add a warning to that effect.
Paolo Bonzini [Fri, 19 Jul 2019 16:41:10 +0000 (18:41 +0200)]
KVM: nVMX: do not use dangling shadow VMCS after guest reset
If a KVM guest is reset while running a nested guest, free_nested will
disable the shadow VMCS execution control in the vmcs01. However,
on the next KVM_RUN vmx_vcpu_run would nevertheless try to sync
the VMCS12 to the shadow VMCS which has since been freed.
This causes a vmptrld of a NULL pointer on my machime, but Jan reports
the host to hang altogether. Let's see how much this trivial patch fixes.
Like Xu [Thu, 18 Jul 2019 05:35:14 +0000 (13:35 +0800)]
KVM: x86/vPMU: refine kvm_pmu err msg when event creation failed
If a perf_event creation fails due to any reason of the host perf
subsystem, it has no chance to log the corresponding event for guest
which may cause abnormal sampling data in guest result. In debug mode,
this message helps to understand the state of vPMC and we may not
limit the number of occurrences but not in a spamming style.
Wanpeng Li [Thu, 18 Jul 2019 11:39:06 +0000 (19:39 +0800)]
KVM: Boost vCPUs that are delivering interrupts
Inspired by commit 9cac38dd5d (KVM/s390: Set preempted flag during
vcpu wakeup and interrupt delivery), we want to also boost not just
lock holders but also vCPUs that are delivering interrupts. Most
smp_call_function_many calls are synchronous, so the IPI target vCPUs
are also good yield candidates. This patch introduces vcpu->ready to
boost vCPUs during wakeup and interrupt delivery time; unlike s390 we do
not reuse vcpu->preempted so that voluntarily preempted vCPUs are taken
into account by kvm_vcpu_on_spin, but vmx_vcpu_pi_put is not affected
(VT-d PI handles voluntary preemption separately, in pi_pre_block).
Testing on 80 HT 2 socket Xeon Skylake server, with 80 vCPUs VM 80GB RAM:
ebizzy -M
When CPU raise #NPF on guest data access and guest CR4.SMAP=1, it is
possible that CPU microcode implementing DecodeAssist will fail
to read bytes of instruction which caused #NPF. This is AMD errata
1096 and it happens because CPU microcode reading instruction bytes
incorrectly attempts to read code as implicit supervisor-mode data
accesses (that is, just like it would read e.g. a TSS), which are
susceptible to SMAP faults. The microcode reads CS:RIP and if it is
a user-mode address according to the page tables, the processor
gives up and returns no instruction bytes. In this case,
GuestIntrBytes field of the VMCB on a VMEXIT will incorrectly
return 0 instead of the correct guest instruction bytes.
Current KVM code attemps to detect and workaround this errata, but it
has multiple issues:
1) It mistakenly checks if guest CR4.SMAP=0 instead of guest CR4.SMAP=1,
which is required for encountering a SMAP fault.
2) It assumes SMAP faults can only occur when guest CPL==3.
However, in case guest CR4.SMEP=0, the guest can execute an instruction
which reside in a user-accessible page with CPL<3 priviledge. If this
instruction raise a #NPF on it's data access, then CPU DecodeAssist
microcode will still encounter a SMAP violation. Even though no sane
OS will do so (as it's an obvious priviledge escalation vulnerability),
we still need to handle this semanticly correct in KVM side.
Note that (2) *is* a useful optimization, because CR4.SMAP=1 is an easy
triggerable condition and guests usually enable SMAP together with SMEP.
If the vCPU has CR4.SMEP=1, the errata could indeed be encountered onlt
at guest CPL==3; otherwise, the CPU would raise a SMEP fault to guest
instead of #NPF. We keep this condition to avoid false positives in
the detection of the errata.
In addition, to avoid future confusion and improve code readbility,
include details of the errata in code and not just in commit message.
Wanpeng Li [Sat, 6 Jul 2019 01:26:51 +0000 (09:26 +0800)]
KVM: LAPIC: Inject timer interrupt via posted interrupt
Dedicated instances are currently disturbed by unnecessary jitter due
to the emulated lapic timers firing on the same pCPUs where the
vCPUs reside. There is no hardware virtual timer on Intel for guest
like ARM, so both programming timer in guest and the emulated timer fires
incur vmexits. This patch tries to avoid vmexit when the emulated timer
fires, at least in dedicated instance scenario when nohz_full is enabled.
In that case, the emulated timers can be offload to the nearest busy
housekeeping cpus since APICv has been found for several years in server
processors. The guest timer interrupt can then be injected via posted interrupts,
which are delivered by the housekeeping cpu once the emulated timer fires.
The host should tuned so that vCPUs are placed on isolated physical
processors, and with several pCPUs surplus for busy housekeeping.
If disabled mwait/hlt/pause vmexits keep the vCPUs in non-root mode,
~3% redis performance benefit can be observed on Skylake server, and the
number of external interrupt vmexits drops substantially. Without patch
VM-EXIT Samples Samples% Time% Min Time Max Time Avg time
EXTERNAL_INTERRUPT 42916 49.43% 39.30% 0.47us 106.09us 0.71us ( +- 1.09% )
While with patch:
VM-EXIT Samples Samples% Time% Min Time Max Time Avg time
EXTERNAL_INTERRUPT 6871 9.29% 2.96% 0.44us 57.88us 0.72us ( +- 4.02% )
kbuild: add -fcf-protection=none when using retpoline flags
The gcc -fcf-protection=branch option is not compatible with
-mindirect-branch=thunk-extern. The latter is used when
CONFIG_RETPOLINE is selected, and this will fail to build with
a gcc which has -fcf-protection=branch enabled by default. Adding
-fcf-protection=none when building with retpoline enabled
prevents such build failures.
Merge tag 'armsoc-defconfig' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc
Pull ARM SoC defconfig updates from Olof Johansson:
"We keep this in a separate branch to avoid cross-branch conflicts, but
most of the material here is fairly boring -- some new drivers turned
on for hardware since they were merged, and some refreshed files due
to time having moved a lot of entries around"
* tag 'armsoc-defconfig' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc: (47 commits)
ARM: configs: multi_v5: Remove duplicate ASPEED options
arm64: defconfig: Enable CONFIG_KEYBOARD_SNVS_PWRKEY as module
ARM: imx_v6_v7_defconfig: Enable CONFIG_ARM_IMX_CPUFREQ_DT
defconfig: arm64: enable i.MX8 SCU octop driver
arm64: defconfig: Add i.MX SCU SoC info driver
arm64: defconfig: Enable CONFIG_QORIQ_THERMAL
ARM: imx_v6_v7_defconfig: Select CONFIG_NVMEM_SNVS_LPGPR
arm64: defconfig: ARM_IMX_CPUFREQ_DT=m
ARM: imx_v6_v7_defconfig: Add TPM PWM support by default
ARM: imx_v6_v7_defconfig: Enable the OV2680 camera driver
ARM: imx_v6_v7_defconfig: Enable CONFIG_THERMAL_STATISTICS
arm64: defconfig: NVMEM_IMX_OCOTP=y for imx8m
ARM: multi_v7_defconfig: enable STMFX pinctrl support
arm64 defconfig: enable LVM support
ARM: configs: multi_v5: Add more ASPEED devices
arm64: defconfig: Add Tegra194 PCIe driver
ARM: configs: aspeed: Add new drivers
ARM: exynos_defconfig: Enable Panfrost and Lima drivers
ARM: multi_v7_defconfig: Enable Panfrost and Lima drivers
arm64 defconfig: enable Mellanox cards
...
Merge tag 'armsoc-dt' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc
Pull ARM Devicetree updates from Olof Johansson:
"We continue to see a lot of new material. I've highlighted some of it
below, but there's been more beyond that as well.
One of the sweeping changes is that many boards have seen their ARM
Mali GPU devices added to device trees, since the DRM drivers have now
been merged.
So, with the caveat that I have surely missed several great
contributions, here's a collection of the material this time around:
Merge tag 'armsoc-drivers' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc
Pull ARM SoC-related driver updates from Olof Johansson:
"Various driver updates for platforms and a couple of the small driver
subsystems we merge through our tree:
- A driver for SCU (system control) on NXP i.MX8QXP
- Support for a newer version of DRAM PHY driver for Broadcom (DPFE)
- Reset controller support for Bitmain BM1880
- TI SCI (System Control Interface) support for CPU control on AM654
processors
- More TI sysc refactoring and rework"
* tag 'armsoc-drivers' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc: (84 commits)
reset: remove redundant null check on pointer dev
soc: rockchip: work around clang warning
dt-bindings: reset: imx7: Fix the spelling of 'indices'
soc: imx: Add i.MX8MN SoC driver support
soc: aspeed: lpc-ctrl: Fix probe error handling
soc: qcom: geni: Add support for ACPI
firmware: ti_sci: Fix gcc unused-but-set-variable warning
firmware: ti_sci: Use the correct style for SPDX License Identifier
soc: imx8: Use existing of_root directly
soc: imx8: Fix potential kernel dump in error path
firmware/psci: psci_checker: Park kthreads before stopping them
memory: move jedec_ddr.h from include/memory to drivers/memory/
memory: move jedec_ddr_data.c from lib/ to drivers/memory/
MAINTAINERS: Remove myself as qcom maintainer
soc: aspeed: lpc-ctrl: make parameter optional
soc: qcom: apr: Don't use reg for domain id
soc: qcom: fix QCOM_AOSS_QMP dependency and build errors
memory: tegra: Fix -Wunused-const-variable
firmware: tegra: Early resume BPMP
soc/tegra: Select pinctrl for Tegra194
...
Merge tag 'armsoc-soc' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc
Pull ARM SoC platform updates from Olof Johansson:
"SoC platform changes. Main theme this merge window:
- The Netx platform (Netx 100/500) platform is removed by Linus
Walleij-- the SoC doesn't have active maintainers with hardware,
and in discussions with the vendor the agreement was that it's OK
to remove.
- Russell King has a series of patches that cleans up and refactors
SA1101 and RiscPC support"
* tag 'armsoc-soc' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc: (47 commits)
ARM: stm32: use "depends on" instead of "if" after prompt
ARM: sa1100: convert to common clock framework
ARM: exynos: Cleanup cppcheck shifting warning
ARM: pxa/lubbock: remove lubbock_set_misc_wr() from global view
ARM: exynos: Only build MCPM support if used
arm: add missing include platform-data/atmel.h
ARM: davinci: Use GPIO lookup table for DA850 LEDs
ARM: OMAP2: drop explicit assembler architecture
ARM: use arch_extension directive instead of arch argument
ARM: imx: Switch imx7d to imx-cpufreq-dt for speed-grading
ARM: bcm: Enable PINCTRL for ARCH_BRCMSTB
ARM: bcm: Enable ARCH_HAS_RESET_CONTROLLER for ARCH_BRCMSTB
ARM: riscpc: enable chained scatterlist support
ARM: riscpc: reduce IRQ handling code
ARM: riscpc: move RiscPC assembly files from arch/arm/lib to mach-rpc
ARM: riscpc: parse video information from tagged list
ARM: riscpc: add ecard quirk for Atomwide 3port serial card
MAINTAINERS: mvebu: Add git entry
soc: ti: pm33xx: Add a print while entering RTC only mode with DDR in self-refresh
ARM: OMAP2+: Make some variables static
...
Merge tag 'drm-next-2019-07-19' of git://anongit.freedesktop.org/drm/drm
Pull drm fixes from Daniel Vetter:
"Dave is back in shape, but now family got it so I'm doing the pull.
Two things worthy of note:
- nouveau feature pull was way too late, Dave&me decided to not take
that, so Ben spun up a pull with just the fixes.
- after some chatting with the arm display maintainers we decided to
change a bit how that's maintained, for more oversight/review and
cross vendor collab.
amdgpu:
- large pile of fixes for new hw support this release (navi, vega20)
- audio hotplug fix
- bunch of corner cases and small fixes all over for amdgpu/kfd
komeda:
- back out some new properties (from this merge window) that needs
more pondering.
bochs:
- fb pitch setup
core:
- a new panel quirk
- misc fixes"
* tag 'drm-next-2019-07-19' of git://anongit.freedesktop.org/drm/drm: (73 commits)
drm/nouveau/secboot/gp102-: remove WAR for SEC2 RTOS start bug
drm/nouveau/flcn/gp102-: improve implementation of bind_context() on SEC2/GSP
drm/nouveau: fix memory leak in nouveau_conn_reset()
drm/nouveau/dmem: missing mutex_lock in error path
drm/nouveau/hwmon: return EINVAL if the GPU is powered down for sensors reads
drm/nouveau: fix bogus GPL-2 license header
drm/nouveau: fix bogus GPL-2 license header
drm/nouveau/i2c: Enable i2c pads & busses during preinit
drm/nouveau/disp/tu102-: wire up scdc parameter setter
drm/nouveau/core: recognise TU116 chipset
drm/nouveau/kms: disallow dual-link harder if hdmi connection detected
drm/nouveau/disp/nv50-: fix center/aspect-corrected scaling
drm/nouveau/disp/nv50-: force scaler for any non-default LVDS/eDP modes
drm/nouveau/mcp89/mmu: Use mcp77_mmu_new instead of g84_mmu_new on MCP89.
drm/amd/display: init res_pool dccg_ref, dchub_ref with xtalin_freq
drm/amdgpu/pm: remove check for pp funcs in freq sysfs handlers
drm/amd/display: Force uclk to max for every state
drm/amdkfd: Remove GWS from process during uninit
drm/amd/amdgpu: Fix offset for vmid selection in debugfs interface
drm/amd/powerplay: update vega20 driver if to fit latest SMU firmware
...
Merge branch 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6
Pull crypto fixes from Herbert Xu:
- Fix missed wake-up race in padata
- Use crypto_memneq in ccp
- Fix version check in ccp
- Fix fuzz test failure in ccp
- Fix potential double free in crypto4xx
- Fix compile warning in stm32
* 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6:
padata: use smp_mb in padata_reorder to avoid orphaned padata jobs
crypto: ccp - Fix SEV_VERSION_GREATER_OR_EQUAL
crypto: ccp/gcm - use const time tag comparison.
crypto: ccp - memset structure fields to zero before reuse
crypto: crypto4xx - fix a potential double free in ppc4xx_trng_probe
crypto: stm32/hash - Fix incorrect printk modifier for size_t
Merge tag 'trace-v5.3-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fix from Steven Rostedt:
"Eiichi Tsukata found a small bug from the fixup of the stack code
Removing ULONG_MAX as the marker for the user stack trace end, made
the tracing code not know where the end is. The end is now marked with
a zero (NULL) pointer. Eiichi fixed this in the tracing code"
* tag 'trace-v5.3-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing: Fix user stack trace "??" output
Merge tag 'csky-for-linus-5.3-rc1' of git://github.com/c-sky/csky-linux
Pull arch/csky pupdates from Guo Ren:
"This round of csky subsystem gives two features (ASID algorithm
update, Perf pmu record support) and some fixups.
ASID updates:
- Revert mmu ASID mechanism
- Add new asid lib code from arm
- Use generic asid algorithm to implement switch_mm
- Improve tlb operation with help of asid
Perf pmu record support:
- Init pmu as a device
- Add count-width property for csky pmu
- Add pmu interrupt support
- Fix perf record in kernel/user space
- dt-bindings: Add csky PMU bindings
Fixes:
- Fixup no panic in kernel for some traps
- Fixup some error count in 810 & 860.
- Fixup abiv1 memset error"
* tag 'csky-for-linus-5.3-rc1' of git://github.com/c-sky/csky-linux:
csky: Fixup abiv1 memset error
csky: Improve tlb operation with help of asid
csky: Use generic asid algorithm to implement switch_mm
csky: Add new asid lib code from arm
csky: Revert mmu ASID mechanism
dt-bindings: csky: Add csky PMU bindings
dt-bindings: interrupt-controller: Update csky mpintc
csky: Fixup some error count in 810 & 860.
csky: Fix perf record in kernel/user space
csky: Add pmu interrupt support
csky: Add count-width property for csky pmu
csky: Init pmu as a device
csky: Fixup no panic in kernel for some traps
csky: Select intc & timer drivers
Merge tag 'for-linus-5.3a-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip
Pull xen updates from Juergen Gross:
"Fixes and features:
- A series to introduce a common command line parameter for disabling
paravirtual extensions when running as a guest in virtualized
environment
- A fix for int3 handling in Xen pv guests
- Removal of the Xen-specific tmem driver as support of tmem in Xen
has been dropped (and it was experimental only)
- A security fix for running as Xen dom0 (XSA-300)
- A fix for IRQ handling when offlining cpus in Xen guests
- Some small cleanups"
* tag 'for-linus-5.3a-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
xen: let alloc_xenballooned_pages() fail if not enough memory free
xen/pv: Fix a boot up hang revealed by int3 self test
x86/xen: Add "nopv" support for HVM guest
x86/paravirt: Remove const mark from x86_hyper_xen_hvm variable
xen: Map "xen_nopv" parameter to "nopv" and mark it obsolete
x86: Add "nopv" parameter to disable PV extensions
x86/xen: Mark xen_hvm_need_lapic() and xen_x2apic_para_available() as __init
xen: remove tmem driver
Revert "x86/paravirt: Set up the virt_spin_lock_key after static keys get initialized"
xen/events: fix binding user event channels to cpus
Merge tag 'iomap-5.3-merge-4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull iomap split/cleanup from Darrick Wong:
"As promised, here's the second part of the iomap merge for 5.3, in
which we break up iomap.c into smaller files grouped by functional
area so that it'll be easier in the long run to maintain cohesiveness
of code units and to review incoming patches. There are no functional
changes and fs/iomap.c split cleanly.
Summary:
- Regroup the fs/iomap.c code by major functional area so that we can
start development for 5.4 from a more stable base"
* tag 'iomap-5.3-merge-4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
iomap: move internal declarations into fs/iomap/
iomap: move the main iteration code into a separate file
iomap: move the buffered IO code into a separate file
iomap: move the direct IO code into a separate file
iomap: move the SEEK_HOLE code into a separate file
iomap: move the file mapping reporting code into a separate file
iomap: move the swapfile code into a separate file
iomap: start moving code to fs/iomap/
Merge branch 'work.adfs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull adfs updates from Al Viro:
"More ADFS patches from Russell King"
* 'work.adfs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
fs/adfs: add time stamp and file type helpers
fs/adfs: super: limit idlen according to directory type
fs/adfs: super: fix use-after-free bug
fs/adfs: super: safely update options on remount
fs/adfs: super: correct superblock flags
fs/adfs: clean up indirect disc addresses and fragment IDs
fs/adfs: clean up error message printing
fs/adfs: use %pV for error messages
fs/adfs: use format_version from disc_record
fs/adfs: add helper to get filesystem size
fs/adfs: add helper to get discrecord from map
fs/adfs: correct disc record structure
Merge branch 'work.mount0' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs mount updates from Al Viro:
"The first part of mount updates.
Convert filesystems to use the new mount API"
* 'work.mount0' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (63 commits)
mnt_init(): call shmem_init() unconditionally
constify ksys_mount() string arguments
don't bother with registering rootfs
init_rootfs(): don't bother with init_ramfs_fs()
vfs: Convert smackfs to use the new mount API
vfs: Convert selinuxfs to use the new mount API
vfs: Convert securityfs to use the new mount API
vfs: Convert apparmorfs to use the new mount API
vfs: Convert openpromfs to use the new mount API
vfs: Convert xenfs to use the new mount API
vfs: Convert gadgetfs to use the new mount API
vfs: Convert oprofilefs to use the new mount API
vfs: Convert ibmasmfs to use the new mount API
vfs: Convert qib_fs/ipathfs to use the new mount API
vfs: Convert efivarfs to use the new mount API
vfs: Convert configfs to use the new mount API
vfs: Convert binfmt_misc to use the new mount API
convenience helper: get_tree_single()
convenience helper get_tree_nodev()
vfs: Kill sget_userns()
...
2) Fix handling of PHY power-down on RTL8411B, from Heiner Kallweit.
3) Add some new PCI IDs to iwlwifi, from Ihab Zhaika.
4) Fix handling of neigh timers wrt. entries added by userspace, from
Lorenzo Bianconi.
5) Various cases of missing of_node_put(), from Nishka Dasgupta.
6) The new NET_ACT_CT needs to depend upon NF_NAT, from Yue Haibing.
7) Various RDS layer fixes, from Gerd Rausch.
8) Fix some more fallout from TCQ_F_CAN_BYPASS generalization, from
Cong Wang.
9) Fix FIB source validation checks over loopback, also from Cong Wang.
10) Use promisc for unsupported number of filters, from Justin Chen.
11) Missing sibling route unlink on failure in ipv6, from Ido Schimmel.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (90 commits)
tcp: fix tcp_set_congestion_control() use from bpf hook
ag71xx: fix return value check in ag71xx_probe()
ag71xx: fix error return code in ag71xx_probe()
usb: qmi_wwan: add D-Link DWM-222 A2 device ID
bnxt_en: Fix VNIC accounting when enabling aRFS on 57500 chips.
net: dsa: sja1105: Fix missing unlock on error in sk_buff()
gve: replace kfree with kvfree
selftests/bpf: fix test_xdp_noinline on s390
selftests/bpf: fix "valid read map access into a read-only array 1" on s390
net/mlx5: Replace kfree with kvfree
MAINTAINERS: update netsec driver
ipv6: Unlink sibling route in case of failure
liquidio: Replace vmalloc + memset with vzalloc
udp: Fix typo in net/ipv4/udp.c
net: bcmgenet: use promisc for unsupported filters
ipv6: rt6_check should return NULL if 'from' is NULL
tipc: initialize 'validated' field of received packets
selftests: add a test case for rp_filter
fib: relax source validation check for loopback packets
mlxsw: spectrum: Do not process learned records with a dummy FID
...
Merge yet more updates from Andrew Morton:
"The rest of MM and a kernel-wide procfs cleanup.
Summary of the more significant patches:
- Patch series "mm/memory_hotplug: Factor out memory block
devicehandling", v3. David Hildenbrand.
Some spring-cleaning of the memory hotplug code, notably in
drivers/base/memory.c
- "mm: thp: fix false negative of shmem vma's THP eligibility". Yang
Shi.
Fix /proc/pid/smaps output for THP pages used in shmem.
- "resource: fix locking in find_next_iomem_res()" + 1. Nadav Amit.
Bugfix and speedup for kernel/resource.c
- Patch series "mm: Further memory block device cleanups", David
Hildenbrand.
More spring-cleaning of the memory hotplug code.
- Patch series "mm: Sub-section memory hotplug support". Dan
Williams.
Generalise the memory hotplug code so that pmem can use it more
completely. Then remove the hacks from the libnvdimm code which
were there to work around the memory-hotplug code's constraints.
- "proc/sysctl: add shared variables for range check", Matteo Croce.
We have about 250 instances of
int zero;
...
.extra1 = &zero,
in the tree. This is a tree-wide sweep to make all those private
"zero"s and "one"s use global variables.
Alas, it isn't practical to make those two global integers const"
* emailed patches from Andrew Morton <[email protected]>: (38 commits)
proc/sysctl: add shared variables for range check
mm: migrate: remove unused mode argument
mm/sparsemem: cleanup 'section number' data types
libnvdimm/pfn: stop padding pmem namespaces to section alignment
libnvdimm/pfn: fix fsdax-mode namespace info-block zero-fields
mm/devm_memremap_pages: enable sub-section remap
mm: document ZONE_DEVICE memory-model implications
mm/sparsemem: support sub-section hotplug
mm/sparsemem: prepare for sub-section ranges
mm: kill is_dev_zone() helper
mm/hotplug: kill is_dev_zone() usage in __remove_pages()
mm/sparsemem: convert kmalloc_section_memmap() to populate_section_memmap()
mm/hotplug: prepare shrink_{zone, pgdat}_span for sub-section removal
mm/sparsemem: add helpers track active portions of a section at boot
mm/sparsemem: introduce a SECTION_IS_EARLY flag
mm/sparsemem: introduce struct mem_section_usage
drivers/base/memory.c: get rid of find_memory_block_hinted()
mm/memory_hotplug: move and simplify walk_memory_blocks()
mm/memory_hotplug: rename walk_memory_range() and pass start+size instead of pfns
mm: make register_mem_sect_under_node() static
...
Eiichi Tsukata [Sun, 30 Jun 2019 08:54:38 +0000 (17:54 +0900)]
tracing: Fix user stack trace "??" output
Commit c5c27a0a5838 ("x86/stacktrace: Remove the pointless ULONG_MAX
marker") removes ULONG_MAX marker from user stack trace entries but
trace_user_stack_print() still uses the marker and it outputs unnecessary
"??".
The user stack trace code zeroes the storage before saving the stack, so if
the trace is shorter than the maximum number of entries it can terminate
the print loop if a zero entry is detected.
Dave Airlie [Fri, 19 Jul 2019 07:21:48 +0000 (17:21 +1000)]
Merge tag 'drm-next-5.3-2019-07-18' of git://people.freedesktop.org/~agd5f/linux into drm-next
drm-next-5.3-2019-07-18:
amdgpu:
- Navi DC fix for secondary adapters
- Fix Navi flickering with high res panels
- Navi SMU fixes
- Vega20 SMU fixes
- Fixes for audio hotplug on HG systems
- Fix for potential integer overflows on large buffer
migrations
- debugfs fixes for umr
- Various other small fixes
amdkfd:
- Apply noretry setting consistently
- Fix hang in eviction
- Properly clean up GWS on uninit
Yongxin Liu [Mon, 1 Jul 2019 01:46:22 +0000 (09:46 +0800)]
drm/nouveau: fix memory leak in nouveau_conn_reset()
In nouveau_conn_reset(), if connector->state is true,
__drm_atomic_helper_connector_destroy_state() will be called,
but the memory pointed by asyc isn't freed. Memory leak happens
in the following function __drm_atomic_helper_connector_reset(),
where newly allocated asyc->state will be assigned to connector->state.
So using nouveau_conn_atomic_destroy_state() instead of
__drm_atomic_helper_connector_destroy_state to free the "old" asyc.
Ralph Campbell [Fri, 14 Jun 2019 20:20:03 +0000 (13:20 -0700)]
drm/nouveau/dmem: missing mutex_lock in error path
In nouveau_dmem_pages_alloc(), the drm->dmem->mutex is unlocked before
calling nouveau_dmem_chunk_alloc() as shown when CONFIG_PROVE_LOCKING
is enabled:
Ben Skeggs [Fri, 5 Jul 2019 05:11:42 +0000 (15:11 +1000)]
drm/nouveau: fix bogus GPL-2 license header
The bulk SPDX addition made all these files into GPL-2.0 licensed files.
However the remainder of the project is MIT-licensed, these files
were simply missing the boiler plate and got caught up in the global update.
Ilia Mirkin [Thu, 20 Jun 2019 00:13:43 +0000 (20:13 -0400)]
drm/nouveau: fix bogus GPL-2 license header
The bulk SPDX addition made all these files into GPL-2.0 licensed files.
However the remainder of the project is MIT-licensed, these files
(primarily header files) were simply missing the boiler plate and got
caught up in the global update.
Fixes: b24413180f5 (License cleanup: add SPDX GPL-2.0 license identifier to files with no license) Signed-off-by: Ilia Mirkin <[email protected]> Acked-by: Emil Velikov <[email protected]> Acked-by: Karol Herbst <[email protected]> Signed-off-by: Ben Skeggs <[email protected]>
Lyude Paul [Wed, 26 Jun 2019 18:10:27 +0000 (14:10 -0400)]
drm/nouveau/i2c: Enable i2c pads & busses during preinit
It turns out that while disabling i2c bus access from software when the
GPU is suspended was a step in the right direction with:
commit 342406e4fbba ("drm/nouveau/i2c: Disable i2c bus access after
->fini()")
We also ended up accidentally breaking the vbios init scripts on some
older Tesla GPUs, as apparently said scripts can actually use the i2c
bus. Since these scripts are executed before initializing any
subdevices, we end up failing to acquire access to the i2c bus which has
left a number of cards with their fan controllers uninitialized. Luckily
this doesn't break hardware - it just means the fan gets stuck at 100%.
This also means that we've always been using our i2c busses before
initializing them during the init scripts for older GPUs, we just didn't
notice it until we started preventing them from being used until init.
It's pretty impressive this never caused us any issues before!
So, fix this by initializing our i2c pad and busses during subdev
pre-init. We skip initializing aux busses during pre-init, as those are
guaranteed to only ever be used by nouveau for DP aux transactions.
Ben Skeggs [Mon, 17 Jun 2019 02:52:48 +0000 (12:52 +1000)]
drm/nouveau/core: recognise TU116 chipset
Modesetting only, still waiting on ACR/GR firmware from NVIDIA for Turing
graphics/compute bring-up.
Each subsystem was compared with traces, along with various tests to check
that things generally work as they should, and appears compatible enough
with the current TU117 code to enable support.
Ben Skeggs [Tue, 28 May 2019 23:58:18 +0000 (09:58 +1000)]
drm/nouveau/kms: disallow dual-link harder if hdmi connection detected
The fallthrough cases (pre-Fermi) would accidentally allow dual-link pixel
clocks even where they shouldn't be. This leads to a high resolution HDMI
displays, connected via a DVI->HDMI adapter, to fail on the original NV50.
Previously center scaling would get scaling applied to it (when it was
only supposed to center the image), and aspect-corrected scaling did not
always correctly pick whether to reduce width or height for a particular
combination of inputs/outputs.
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=110660 Signed-off-by: Ilia Mirkin <[email protected]> Signed-off-by: Ben Skeggs <[email protected]>
Ilia Mirkin [Sat, 25 May 2019 22:41:48 +0000 (18:41 -0400)]
drm/nouveau/disp/nv50-: force scaler for any non-default LVDS/eDP modes
Higher layers tend to add a lot of modes not actually in the EDID, such
as the standard DMT modes. Changing this would be extremely intrusive to
everyone, so just force the scaler more often. There are no practical
cases we're aware of where a LVDS/eDP panel has multiple resolutions
exposed, and i915 already does it this way.
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=110660 Signed-off-by: Ilia Mirkin <[email protected]> Signed-off-by: Ben Skeggs <[email protected]>
Guo Ren [Fri, 28 Jun 2019 12:39:46 +0000 (20:39 +0800)]
csky: Fixup abiv1 memset error
Current memset implementation in abiv1 is wrong and it'll cause unalign
access. Just remove it and use the generic one. This patch will cause
performance degradation and we will improve it with a new design in next
patchset.
Guo Ren [Tue, 18 Jun 2019 12:34:35 +0000 (20:34 +0800)]
csky: Improve tlb operation with help of asid
There are two generations of tlb operation instruction for C-SKY.
First generation is use mcr register and it need software do more
things, second generation is use specific instructions, eg:
tlbi.va, tlbi.vas, tlbi.alls
We implemented the following functions:
- flush_tlb_range (a range of entries)
- flush_tlb_page (one entry)
Above functions use asid from vma->mm to invalid tlb entries and
we could use tlbi.vas instruction for newest generation csky cpu.
- flush_tlb_kernel_range
- flush_tlb_one
Above functions don't care asid and it invalid the tlb entries only
with vpn and we could use tlbi.vaas instruction for newest generat-
ion csky cpu.
Guo Ren [Tue, 18 Jun 2019 12:33:32 +0000 (20:33 +0800)]
csky: Use generic asid algorithm to implement switch_mm
Use linux generic asid/vmid algorithm to implement csky
switch_mm function. The algorithm is from arm and it could
work with SMP system. It'll help reduce tlb flush for
switch_mm in task/vm switch.
Guo Ren [Tue, 18 Jun 2019 12:06:52 +0000 (20:06 +0800)]
csky: Add new asid lib code from arm
This patch only contains asid help code from arm for next patch to
use.
The asid allocator use five level check to reduce the cost of
switch_mm.
1. Check if the asid version is the same (it's general)
2. Check reserved_asid which is set in rollover flush_context()
and key point is to keep the same bit position with the current
asid version instead of input version.
3. Check if the position of bitmap is free then it could be set &
used directly.
4. find_next_zero_bit() (a little performance cost)
5. flush_context (this is the worst cost with increase current asid
version)
Check is level by level and cost is also higher with the next level.
The reserved_asid and bitmap mechanism prevent unnecessary
find_next_zero_bit().
The atomic 64 bit asid is also suitable for 32-bit system and it
won't cost a lot in 1th 2th 3th level check.
The operation of set/clear mm_cpumask was removed in arm64 compared to
arm32. It seems no side effect on current arm64 system, but from
software meaning it's wrong. Although csky also needn't it, we add it
back for csky.
The asid_per_ctxt is no use for csky and it reserves the lowest bits for
other use, maybe: trust zone ? Ok, just keep it in csky copy.
Seems it also could be used by other archs and it's worth to move asid
code to generic in future.
Guo Ren [Tue, 18 Jun 2019 09:20:10 +0000 (17:20 +0800)]
csky: Revert mmu ASID mechanism
Current C-SKY ASID mechanism is from mips and it doesn't work well
with multi-cores. ASID per core mechanism is not suitable for C-SKY
SMP tlb maintain operations, eg: tlbi.vas need share the same asid
in all processors and it'll invalid the tlb entry in all cores with
the same asid.
Add trigger type setting for csky,mpintc. The driver also could
support #interrupt-cells <1> and it wouldn't invalidate existing
DTs. Here we only show the complete format.
Guo Ren [Tue, 4 Jun 2019 10:54:48 +0000 (18:54 +0800)]
csky: Fixup some error count in 810 & 860.
CK810 pmu only support event with index 0-8 and 0xd; CK860 only
support event 1~4, 0xa~0x1b. So do not register unsupport event
to hardware cache event, which may leader to unknown behavior.
Mao Han [Tue, 4 Jun 2019 10:54:49 +0000 (18:54 +0800)]
csky: Fix perf record in kernel/user space
csky_pmu_event_init is called several times during the perf record
initialzation. After configure the event counter in either kernel
space or user space, csky_pmu_event_init is called twice with no
attr specified. Configuration will be overwritten with sampling in
both kernel space and user space. --all-kernel/--all-user is
useless without this patch applied.
Mao Han [Tue, 4 Jun 2019 10:54:45 +0000 (18:54 +0800)]
csky: Add count-width property for csky pmu
The csky pmu counter may have different io width. When the counter is
smaller then 64 bits and counter value is smaller than the old value, it
will result to a extremely large delta value. So the sampled value should
be extend to 64 bits to avoid this, the extension bits base on the
count-width property from dts.
Mao Han [Tue, 4 Jun 2019 10:54:44 +0000 (18:54 +0800)]
csky: Init pmu as a device
This patch change the csky pmu initialization from arch init to
device init. The pmu can be configued with information from
device tree(pmu device name, irq number and etc.).
Accessing 'current' in bpf context makes no sense, since packets
are processed from softirq context.
As Neal stated : The capability check in tcp_set_congestion_control()
was written assuming a system call context, and then was reused from
a BPF call site.
The fix is to add a new parameter to tcp_set_congestion_control(),
so that the ns_capable() call is only performed under the right
context.
In case of error, the function of_get_mac_address() returns ERR_PTR()
and never returns NULL. The NULL test in the return value check should
be replaced with IS_ERR().
In the sysctl code the proc_dointvec_minmax() function is often used to
validate the user supplied value between an allowed range. This
function uses the extra1 and extra2 members from struct ctl_table as
minimum and maximum allowed value.
On sysctl handler declaration, in every source file there are some
readonly variables containing just an integer which address is assigned
to the extra1 and extra2 members, so the sysctl range is enforced.
The special values 0, 1 and INT_MAX are very often used as range
boundary, leading duplication of variables like zero=0, one=1,
int_max=INT_MAX in different source files:
Add a const int array containing the most commonly used values, some
macros to refer more easily to the correct array member, and use them
instead of creating a local one for every object file.
This is the bloat-o-meter output comparing the old and new binary
compiled with the default Fedora config:
# scripts/bloat-o-meter -d vmlinux.o.old vmlinux.o
add/remove: 2/2 grow/shrink: 0/2 up/down: 24/-188 (-164)
Data old new delta
sysctl_vals - 12 +12
__kstrtab_sysctl_vals - 12 +12
max 14 10 -4
int_max 16 - -16
one 68 - -68
zero 128 28 -100
Total: Before=20583249, After=20583085, chg -0.00%
Dan Williams [Thu, 18 Jul 2019 22:58:43 +0000 (15:58 -0700)]
mm/sparsemem: cleanup 'section number' data types
David points out that there is a mixture of 'int' and 'unsigned long'
usage for section number data types. Update the memory hotplug path to
use 'unsigned long' consistently for section numbers.
Dan Williams [Thu, 18 Jul 2019 22:58:40 +0000 (15:58 -0700)]
libnvdimm/pfn: stop padding pmem namespaces to section alignment
Now that the mm core supports section-unaligned hotplug of ZONE_DEVICE
memory, we no longer need to add padding at pfn/dax device creation
time. The kernel will still honor padding established by older kernels.
At namespace creation time there is the potential for the "expected to
be zero" fields of a 'pfn' info-block to be filled with indeterminate
data. While the kernel buffer is zeroed on allocation it is immediately
overwritten by nd_pfn_validate() filling it with the current contents of
the on-media info-block location. For fields like, 'flags' and the
'padding' it potentially means that future implementations can not rely on
those fields being zero.
In preparation to stop using the 'start_pad' and 'end_trunc' fields for
section alignment, arrange for fields that are not explicitly
initialized to be guaranteed zero. Bump the minor version to indicate
it is safe to assume the 'padding' and 'flags' are zero. Otherwise,
this corruption is expected to benign since all other critical fields
are explicitly initialized.
Note The cc: stable is about spreading this new policy to as many
kernels as possible not fixing an issue in those kernels. It is not
until the change titled "libnvdimm/pfn: Stop padding pmem namespaces to
section alignment" where this improper initialization becomes a problem.
So if someone decides to backport "libnvdimm/pfn: Stop padding pmem
namespaces to section alignment" (which is not tagged for stable), make
sure this pre-requisite is flagged.
Dan Williams [Thu, 18 Jul 2019 22:58:33 +0000 (15:58 -0700)]
mm/devm_memremap_pages: enable sub-section remap
Teach devm_memremap_pages() about the new sub-section capabilities of
arch_{add,remove}_memory(). Effectively, just replace all usage of
align_start, align_end, and align_size with res->start, res->end, and
resource_size(res). The existing sanity check will still make sure that
the two separate remap attempts do not collide within a sub-section (2MB
on x86).
Dan Williams [Thu, 18 Jul 2019 22:58:26 +0000 (15:58 -0700)]
mm/sparsemem: support sub-section hotplug
The libnvdimm sub-system has suffered a series of hacks and broken
workarounds for the memory-hotplug implementation's awkward
section-aligned (128MB) granularity.
For example the following backtrace is emitted when attempting
arch_add_memory() with physical address ranges that intersect 'System
RAM' (RAM) with 'Persistent Memory' (PMEM) within a given section:
It was discovered that the problem goes beyond RAM vs PMEM collisions as
some platform produce PMEM vs PMEM collisions within a given section.
The libnvdimm workaround for that case revealed that the libnvdimm
section-alignment-padding implementation has been broken for a long
while.
A fix for that long-standing breakage introduces as many problems as it
solves as it would require a backward-incompatible change to the
namespace metadata interpretation. Instead of that dubious route [1],
address the root problem in the memory-hotplug implementation.
Note that EEXIST is no longer treated as success as that is how
sparse_add_section() reports subsection collisions, it was also obviated
by recent changes to perform the request_region() for 'System RAM'
before arch_add_memory() in the add_memory() sequence.
Dan Williams [Thu, 18 Jul 2019 22:58:22 +0000 (15:58 -0700)]
mm/sparsemem: prepare for sub-section ranges
Prepare the memory hot-{add,remove} paths for handling sub-section
ranges by plumbing the starting page frame and number of pages being
handled through arch_{add,remove}_memory() to
sparse_{add,remove}_one_section().
This is simply plumbing, small cleanups, and some identifier renames.
No intended functional changes.
Dan Williams [Thu, 18 Jul 2019 22:58:15 +0000 (15:58 -0700)]
mm/hotplug: kill is_dev_zone() usage in __remove_pages()
The zone type check was a leftover from the cleanup that plumbed altmap
through the memory hotplug path, i.e. commit da024512a1fa "mm: pass the
vmem_altmap to arch_remove_memory and __remove_pages".
Dan Williams [Thu, 18 Jul 2019 22:58:11 +0000 (15:58 -0700)]
mm/sparsemem: convert kmalloc_section_memmap() to populate_section_memmap()
Allow sub-section sized ranges to be added to the memmap.
populate_section_memmap() takes an explict pfn range rather than
assuming a full section, and those parameters are plumbed all the way
through to vmmemap_populate(). There should be no sub-section usage in
current deployments. New warnings are added to clarify which memmap
allocation paths are sub-section capable.
Dan Williams [Thu, 18 Jul 2019 22:58:07 +0000 (15:58 -0700)]
mm/hotplug: prepare shrink_{zone, pgdat}_span for sub-section removal
Sub-section hotplug support reduces the unit of operation of hotplug
from section-sized-units (PAGES_PER_SECTION) to sub-section-sized units
(PAGES_PER_SUBSECTION). Teach shrink_{zone,pgdat}_span() to consider
PAGES_PER_SUBSECTION boundaries as the points where pfn_valid(), not
valid_section(), can toggle.
Dan Williams [Thu, 18 Jul 2019 22:58:04 +0000 (15:58 -0700)]
mm/sparsemem: add helpers track active portions of a section at boot
Prepare for hot{plug,remove} of sub-ranges of a section by tracking a
sub-section active bitmask, each bit representing a PMD_SIZE span of the
architecture's memory hotplug section size.
The implications of a partially populated section is that pfn_valid()
needs to go beyond a valid_section() check and either determine that the
section is an "early section", or read the sub-section active ranges
from the bitmask. The expectation is that the bitmask (subsection_map)
fits in the same cacheline as the valid_section() / early_section()
data, so the incremental performance overhead to pfn_valid() should be
negligible.
The rationale for using early_section() to short-ciruit the
subsection_map check is that there are legacy code paths that use
pfn_valid() at section granularity before validating the pfn against
pgdat data. So, the early_section() check allows those traditional
assumptions to persist while also permitting subsection_map to tell the
truth for purposes of populating the unused portions of early sections
with PMEM and other ZONE_DEVICE mappings.
Dan Williams [Thu, 18 Jul 2019 22:58:00 +0000 (15:58 -0700)]
mm/sparsemem: introduce a SECTION_IS_EARLY flag
In preparation for sub-section hotplug, track whether a given section
was created during early memory initialization, or later via memory
hotplug. This distinction is needed to maintain the coarse expectation
that pfn_valid() returns true for any pfn within a given section even if
that section has pages that are reserved from the page allocator.
For example one of the of goals of subsection hotplug is to support
cases where the system physical memory layout collides System RAM and
PMEM within a section. Several pfn_valid() users expect to just check
if a section is valid, but they are not careful to check if the given
pfn is within a "System RAM" boundary and instead expect pgdat
information to further validate the pfn.
Rather than unwind those paths to make their pfn_valid() queries more
precise a follow on patch uses the SECTION_IS_EARLY flag to maintain the
traditional expectation that pfn_valid() returns true for all early
sections.
Dan Williams [Thu, 18 Jul 2019 22:57:57 +0000 (15:57 -0700)]
mm/sparsemem: introduce struct mem_section_usage
Patch series "mm: Sub-section memory hotplug support", v10.
The memory hotplug section is an arbitrary / convenient unit for memory
hotplug. 'Section-size' units have bled into the user interface
('memblock' sysfs) and can not be changed without breaking existing
userspace. The section-size constraint, while mostly benign for typical
memory hotplug, has and continues to wreak havoc with 'device-memory'
use cases, persistent memory (pmem) in particular. Recall that pmem
uses devm_memremap_pages(), and subsequently arch_add_memory(), to
allocate a 'struct page' memmap for pmem. However, it does not use the
'bottom half' of memory hotplug, i.e. never marks pmem pages online and
never exposes the userspace memblock interface for pmem. This leaves an
opening to redress the section-size constraint.
To date, the libnvdimm subsystem has attempted to inject padding to
satisfy the internal constraints of arch_add_memory(). Beyond
complicating the code, leading to bugs [2], wasting memory, and limiting
configuration flexibility, the padding hack is broken when the platform
changes this physical memory alignment of pmem from one boot to the
next. Device failure (intermittent or permanent) and physical
reconfiguration are events that can cause the platform firmware to
change the physical placement of pmem on a subsequent boot, and device
failure is an everyday event in a data-center.
It turns out that sections are only a hard requirement of the
user-facing interface for memory hotplug and with a bit more
infrastructure sub-section arch_add_memory() support can be added for
kernel internal usages like devm_memremap_pages(). Here is an analysis
of the current design assumptions in the current code and how they are
addressed in the new implementation:
Current design assumptions:
- Sections that describe boot memory (early sections) are never
unplugged / removed.
- pfn_valid(), in the CONFIG_SPARSEMEM_VMEMMAP=y, case devolves to a
valid_section() check
- __add_pages() and helper routines assume all operations occur in
PAGES_PER_SECTION units.
- The memblock sysfs interface only comprehends full sections
New design assumptions:
- Sections are instrumented with a sub-section bitmask to track (on
x86) individual 2MB sub-divisions of a 128MB section.
- Partially populated early sections can be extended with additional
sub-sections, and those sub-sections can be removed with
arch_remove_memory(). With this in place we no longer lose usable
memory capacity to padding.
- pfn_valid() is updated to look deeper than valid_section() to also
check the active-sub-section mask. This indication is in the same
cacheline as the valid_section() so the performance impact is
expected to be negligible. So far the lkp robot has not reported any
regressions.
- Outside of the core vmemmap population routines which are replaced,
other helper routines like shrink_{zone,pgdat}_span() are updated to
handle the smaller granularity. Core memory hotplug routines that
deal with online memory are not touched.
- The existing memblock sysfs user api guarantees / assumptions are not
touched since this capability is limited to !online
!memblock-sysfs-accessible sections.
Meanwhile the issue reports continue to roll in from users that do not
understand when and how the 128MB constraint will bite them. The current
implementation relied on being able to support at least one misaligned
namespace, but that immediately falls over on any moderately complex
namespace creation attempt. Beyond the initial problem of 'System RAM'
colliding with pmem, and the unsolvable problem of physical alignment
changes, Linux is now being exposed to platforms that collide pmem ranges
with other pmem ranges by default [3]. In short, devm_memremap_pages()
has pushed the venerable section-size constraint past the breaking point,
and the simplicity of section-aligned arch_add_memory() is no longer
tenable.
These patches are exposed to the kbuild robot on a subsection-v10 branch
[4], and a preview of the unit test for this functionality is available
on the 'subsection-pending' branch of ndctl [5].
Towards enabling memory hotplug to track partial population of a section,
introduce 'struct mem_section_usage'.
A pointer to a 'struct mem_section_usage' instance replaces the existing
pointer to a 'pageblock_flags' bitmap. Effectively it adds one more
'unsigned long' beyond the 'pageblock_flags' (usemap) allocation to house
a new 'subsection_map' bitmap. The new bitmap enables the memory
hot{plug,remove} implementation to act on incremental sub-divisions of a
section.
SUBSECTION_SHIFT is defined as global constant instead of per-architecture
value like SECTION_SIZE_BITS in order to allow cross-arch compatibility of
subsection users. Specifically a common subsection size allows for the
possibility that persistent memory namespace configurations be made
compatible across architectures.
The primary motivation for this functionality is to support platforms that
mix "System RAM" and "Persistent Memory" within a single section, or
multiple PMEM ranges with different mapping lifetimes within a single
section. The section restriction for hotplug has caused an ongoing saga
of hacks and bugs for devm_memremap_pages() users.
Beyond the fixups to teach existing paths how to retrieve the 'usemap'
from a section, and updates to usemap allocation path, there are no
expected behavior changes.
mm/memory_hotplug: rename walk_memory_range() and pass start+size instead of pfns
walk_memory_range() was once used to iterate over sections. Now, it
iterates over memory blocks. Rename the function, fixup the
documentation.
Also, pass start+size instead of PFNs, which is what most callers
already have at hand. (we'll rework link_mem_sections() most probably
soon)
Follow-up patches will rework, simplify, and move walk_memory_blocks()
to drivers/base/memory.c.
Note: walk_memory_blocks() only works correctly right now if the
start_pfn is aligned to a section start. This is the case right now,
but we'll generalize the function in a follow up patch so the semantics
match the documentation.
Patch series "mm: Further memory block device cleanups", v1.
Some further cleanups around memory block devices. Especially, clean up
and simplify walk_memory_range(). Including some other minor cleanups.
This patch (of 6):
We are using a mixture of "int" and "unsigned long". Let's make this
consistent by using "unsigned long" everywhere. We'll do the same with
memory block ids next.
While at it, turn the "unsigned long i" in removable_show() into an int
- sections_per_block is an int.
Nadav Amit [Thu, 18 Jul 2019 22:57:34 +0000 (15:57 -0700)]
resource: avoid unnecessary lookups in find_next_iomem_res()
find_next_iomem_res() shows up to be a source for overhead in dax
benchmarks.
Improve performance by not considering children of the tree if the top
level does not match. Since the range of the parents should include the
range of the children such check is redundant.
Running sysbench on dax (pmem emulation, with write_cache disabled):
sysbench fileio --file-total-size=3G --file-test-mode=rndwr \
--file-io-mode=mmap --threads=4 --file-fsync-mode=fdatasync run
Nadav Amit [Thu, 18 Jul 2019 22:57:31 +0000 (15:57 -0700)]
resource: fix locking in find_next_iomem_res()
Since resources can be removed, locking should ensure that the resource
is not removed while accessing it. However, find_next_iomem_res() does
not hold the lock while copying the data of the resource.
Keep holding the lock while the data is copied. While at it, change the
return value to a more informative value. It is disregarded by the
callers.
Yang Shi [Thu, 18 Jul 2019 22:57:27 +0000 (15:57 -0700)]
mm: thp: fix false negative of shmem vma's THP eligibility
Commit 7635d9cbe832 ("mm, thp, proc: report THP eligibility for each
vma") introduced THPeligible bit for processes' smaps. But, when
checking the eligibility for shmem vma, __transparent_hugepage_enabled()
is called to override the result from shmem_huge_enabled(). It may
result in the anonymous vma's THP flag override shmem's. For example,
running a simple test which create THP for shmem, but with anonymous THP
disabled, when reading the process's smaps, it may show:
And, /proc/meminfo does show THP allocated and PMD mapped too:
ShmemHugePages: 4096 kB
ShmemPmdMapped: 4096 kB
This doesn't make too much sense. The shmem objects should be treated
separately from anonymous THP. Calling shmem_huge_enabled() with
checking MMF_DISABLE_THP sounds good enough. And, we could skip stack
and dax vma check since we already checked if the vma is shmem already.
Also check if vma is suitable for THP by calling
transhuge_vma_suitable().
And minor fix to smaps output format and documentation.
Yang Shi [Thu, 18 Jul 2019 22:57:24 +0000 (15:57 -0700)]
mm: thp: make transhuge_vma_suitable available for anonymous THP
transhuge_vma_suitable() was only available for shmem THP, but anonymous
THP has the same check except pgoff check. And, it will be used for THP
eligible check in the later patch, so make it available for all kind of
THPs. This also helps reduce code duplication slightly.
Since anonymous THP doesn't have to check pgoff, so make pgoff check
shmem vma only.
And regroup some functions in include/linux/mm.h to solve compile issue
since transhuge_vma_suitable() needs call vma_is_anonymous() which was
defined after huge_mm.h is included.
Wei Yang [Thu, 18 Jul 2019 22:57:21 +0000 (15:57 -0700)]
mm/sparse.c: set section nid for hot-add memory
In case of NODE_NOT_IN_PAGE_FLAGS is set, we store section's node id in
section_to_node_table[]. While for hot-add memory, this is missed.
Without this information, page_to_nid() may not give the right node id.
BTW, current online_pages works because it leverages nid in
memory_block. But the granularity of node id should be mem_section
wide.
mm/memory_hotplug: remove "zone" parameter from sparse_remove_one_section
The parameter is unused, so let's drop it. Memory removal paths should
never care about zones. This is the job of memory offlining and will
require more refactorings.
mm/memory_hotplug: make unregister_memory_block_under_nodes() never fail
We really don't want anything during memory hotunplug to fail. We
always pass a valid memory block device, that check can go. Avoid
allocating memory and eventually failing. As we are always called under
lock, we can use a static piece of memory. This avoids having to put
the structure onto the stack, having to guess about the stack size of
callers.
Patch inspired by a patch from Oscar Salvador.
In the future, there might be no need to iterate over nodes at all.
mem->nid should tell us exactly what to remove. Memory block devices
with mixed nodes (added during boot) should properly fenced off and
never removed.
mm/memory_hotplug: remove memory block devices before arch_remove_memory()
Let's factor out removing of memory block devices, which is only
necessary for memory added via add_memory() and friends that created
memory block devices. Remove the devices before calling
arch_remove_memory().
This finishes factoring out memory block device handling from
arch_add_memory() and arch_remove_memory().
mm/memory_hotplug: create memory block devices after arch_add_memory()
Only memory to be added to the buddy and to be onlined/offlined by user
space using /sys/devices/system/memory/... needs (and should have!)
memory block devices.
Factor out creation of memory block devices. Create all devices after
arch_add_memory() succeeded. We can later drop the want_memblock
parameter, because it is now effectively stale.
Only after memory block devices have been added, memory can be onlined
by user space. This implies, that memory is not visible to user space
at all before arch_add_memory() succeeded.
While at it
- use WARN_ON_ONCE instead of BUG_ON in moved unregister_memory()
- introduce find_memory_block_by_id() to search via block id
- Use find_memory_block_by_id() in init_memory_block() to catch
duplicates