Git Repo - linux.git/log

include/linux/syscalls.h: add sys_renameat2() prototype

Signed-off-by: Heiko Carstens <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

ALSA: ice1712: Fix boundary checks in PCM pointer ops

PCM pointer callbacks in ice1712 driver check the buffer size boundary
wrongly between bytes and frames.  This leads to PCM core warnings
like:
   snd_pcm_update_hw_ptr0: 105 callbacks suppressed
   ALSA pcm_lib.c:352 BUG: pcmC3D0c:0, pos = 5461, buffer size = 5461, period size = 2730

This patch fixes these checks to be placed after the proper unit
conversions.

Cc: <[email protected]>
Signed-off-by: Takashi Iwai <[email protected]>

ARM: add missing system_misc.h include to process.c

arm_pm_restart(), arm_pm_idle() and soft_restart() are all declared in
system_misc.h, but this file is not included in process.c. Add this
missing include. Found via sparse:

arch/arm/kernel/process.c:98:6: warning: symbol 'soft_restart' was not declared. Should it be static?
arch/arm/kernel/process.c:127:6: warning: symbol 'arm_pm_restart' was not declared. Should it be static?
arch/arm/kernel/process.c:134:6: warning: symbol 'arm_pm_idle' was not declared. Should it be static?

Signed-off-by: Russell King <[email protected]>

backlight: lm3639: Use devm_backlight_device_register()

Change to use devm_backlight_device_register() for simple cleanup.

Signed-off-by: Daniel Jeong <[email protected]>
Acked-by: Jingoo Han <[email protected]>
Signed-off-by: Lee Jones <[email protected]>

backlight: gpio-backlight: Add DT support

Signed-off-by: Denis Carikli <[email protected]>
Acked-by: Jingoo Han <[email protected]>
Signed-off-by: Lee Jones <[email protected]>

backlight: core: Replace kfree with put_device

As per the comments on device_register, we shouldn't call kfree()
right after a device_register() failure. Instead call put_device(),
which in turn will call bl_device_release resulting in a kfree to the
full structure.

Signed-off-by: Levente Kurusa <[email protected]>
Acked-by: Jingoo Han <[email protected]>
Signed-off-by: Lee Jones <[email protected]>

ASoC: davinci-mcasp: Fix bit clock polarity settings

IB_NF, NB_IF and IB_IF configured the bc polarity incorrectly. The receive
polarity was set to the same edge as the TX in these cases.

Signed-off-by: Peter Ujfalusi <[email protected]>
Signed-off-by: Mark Brown <[email protected]>

ASoC: samsung: Fix build on multiplatform

PCM and S/PDIF drivers referenced mach headers for a trivial
data structure. This caused build errors on multiplatform builds
as machine headers are not accessible from driver files. Move the data
structure definition to the driver header and remove the dependency.
While at it rename the structure to avoid multiple definition errors
as the same structure is also used by the platform code.

Signed-off-by: Sachin Kamat <[email protected]>
Signed-off-by: Mark Brown <[email protected]>

ASoC: fsl_sai: Fix Bit Clock Polarity configurations

The BCP bit in TCR4/RCR4 register rules as followings:
  0 Bit clock is active high with drive outputs on rising edge
    and sample inputs on falling edge.
  1 Bit clock is active low with drive outputs on falling edge
    and sample inputs on rising edge.

For all formats currently supported in the fsl_sai driver, they're exactly
sending data on the falling edge and sampling on the rising edge.

However, the driver clears this BCP bit for all of them which results click
noise when working with SGTL5000 and big noise with WM8962.

Thus this patch corrects the BCP settings for all the formats here to fix
the nosie issue.

Signed-off-by: Nicolin Chen <[email protected]>
Acked-by: Xiubo Li <[email protected]>
Signed-off-by: Mark Brown <[email protected]>

Merge branches 'pm-wakeup' and 'pm-domains'

* pm-wakeup:
PM / wakeup: Correct presence vs. emptiness of wakeup_* attributes

* pm-domains:
PM / domains: Add pd_ignore_unused to keep power domains enabled

Merge branches 'acpi-cleanup', 'acpi-thermal', 'acpi-video' and 'acpi-dock'

* acpi-cleanup:
  ACPI: Clean up memory allocations

* acpi-thermal:
  ACPI / thermal: Fix wrong variable usage in debug statement

* acpi-video:
  ACPI / video: Favor native backlight interface for ThinkPad Helix

* acpi-dock:
  ACPI / dock: Drop dock_device_ids[] table

Merge branch 'pm-cpufreq'

* pm-cpufreq:
  cpufreq: ppc: Remove duplicate inclusion of fsl_soc.h
  cpufreq: create another field .flags in cpufreq_frequency_table
  cpufreq: use kzalloc() to allocate memory for cpufreq_frequency_table
  cpufreq: don't print value of .driver_data from core
  cpufreq: ia64: don't set .driver_data to index
  cpufreq: powernv: Select CPUFreq related Kconfig options for powernv
  cpufreq: powernv: Use cpufreq_frequency_table.driver_data to store pstate ids
  cpufreq: powernv: cpufreq driver for powernv platform
  cpufreq: at32ap: don't declare local variable as static
  cpufreq: loongson2_cpufreq: don't declare local variable as static
  cpufreq: unicore32: fix typo issue for 'clk'
  cpufreq: exynos: Disable on multiplatform build

Merge branch 'pm-cpuidle'

* pm-cpuidle:
  cpuidle: sysfs: Export target residency information
  intel_idle: fine-tune IVT residency targets
  tools/power turbostat: Run on Broadwell
  tools/power turbostat: simplify output, add Avg_MHz
  intel_idle: Add CPU model 54 (Atom N2000 series)
  intel_idle: support Bay Trail
  intel_idle: allow sparse sub-state numbering, for Bay Trail
  ACPI idle: permit sparse C-state sub-state numbers

Merge branch 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux into pm-cpuidle

Pull intel_idle and turbostat material for v3.15-rc1 from Len Brown.

* 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux:
  intel_idle: fine-tune IVT residency targets
  tools/power turbostat: Run on Broadwell
  tools/power turbostat: simplify output, add Avg_MHz
  intel_idle: Add CPU model 54 (Atom N2000 series)
  intel_idle: support Bay Trail
  intel_idle: allow sparse sub-state numbering, for Bay Trail
  ACPI idle: permit sparse C-state sub-state numbers

Merge branch 'cpu-hotplug'

* cpu-hotplug:
arm, kvm: fix double lock on cpu_add_remove_lock

arm, kvm: fix double lock on cpu_add_remove_lock

Commit 8146875de7d4 (arm, kvm: Fix CPU hotplug callback registration)
holds the lock before calling the two functions:

kvm_vgic_hyp_init()
kvm_timer_hyp_init()

and both the two functions are calling register_cpu_notifier()
to register cpu notifier, so cause double lock on cpu_add_remove_lock.

Considered that both two functions are only called inside
kvm_arch_init() with holding cpu_add_remove_lock, so simply use
__register_cpu_notifier() to fix the problem.

Fixes: 8146875de7d4 (arm, kvm: Fix CPU hotplug callback registration)
Signed-off-by: Ming Lei <[email protected]>
Reviewed-by: Srivatsa S. Bhat <[email protected]>
Signed-off-by: Rafael J. Wysocki <[email protected]>

spi: qup: Depend on ARCH_QCOM

Commit 8fc1b0f87d9f ("ARM: qcom: Split Qualcomm support into legacy and
multiplatform") removed Kconfig symbol ARCH_MSM_DT. But that commit
left one (optional) dependency on ARCH_MSM_DT untouched.

Three Kconfig symbols used to depend on ARCH_MSM_DT: ARCH_MSM8X60,
ARCH_MSM8960, and ARCH_MSM8974. These three symbols now depend on
ARCH_QCOM. So it appears this driver needs to depend on ARCH_QCOM too.

Signed-off-by: Paul Bolle <[email protected]>
Reviewed-by: Stephen Boyd <[email protected]>
Signed-off-by: Mark Brown <[email protected]>

arm64: Fix DMA range invalidation for cache line unaligned buffers

If the buffer needing cache invalidation for inbound DMA does start or
end on a cache line aligned address, we need to use the non-destructive
clean&invalidate operation. This issue was introduced by commit
7363590d2c46 (arm64: Implement coherent DMA API based on swiotlb).

Signed-off-by: Catalin Marinas <[email protected]>
Reported-by: Jon Medhurst (Tixy) <[email protected]>

cpuidle: sysfs: Export target residency information

From user space, there is no way to know the target residency for each idle
state. If we want to write tools to measure the accuracy of the idle state
selection from the governor, we need this info.

As the exit latency is exported through sysfs, exporting the target residency
in the same place makes sense.

Signed-off-by: Daniel Lezcano <[email protected]>
Signed-off-by: Rafael J. Wysocki <[email protected]>

cpufreq: ppc: Remove duplicate inclusion of fsl_soc.h

fsl_soc.h was included twice.

Signed-off-by: Sachin Kamat <[email protected]>
Acked-by: Viresh Kumar <[email protected]>
Signed-off-by: Rafael J. Wysocki <[email protected]>

ALSA: hda - Do not assign streams in reverse order

Currently stream numbers are assigned in reverse order.

Unfortunately commit 7546abfb8e1f9933b5 ("ALSA: hda - Increment
default stream numbers for AMD HDMI controllers") assumed this was not
the case (specifically, it had the "old cards had single device only"
=> "extra unused stream numbers do not matter" assumption), causing
non-working audio regressions for AMD Radeon HDMI users.

Change the stream numbers to be assigned in forward order.

The benefit is that regular audio playback will still work even if the
assumed stream count is too high, downside is that a too high stream
count may remain hidden.

Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=77002
Reported-by: Christian Güdel <[email protected]>
Signed-off-by: Anssi Hannula <[email protected]>
Tested-by: Christian Güdel <[email protected]> # 3.14
Cc: Alex Deucher <[email protected]>
Signed-off-by: Takashi Iwai <[email protected]>

ALSA: hda/realtek - Add eapd shutup to ALC283

Add eapd shutup function to alc283_shutup.
It could avoid pop noise from speaker.

Signed-off-by: Kailang Yang <[email protected]>
Signed-off-by: Takashi Iwai <[email protected]>

ALSA: hda/realtek - Change model name alias for ChromeOS

Chrome OS was use model name of alc283-dac-wcaps for loading model as default.
Change the model name to same as model name of Chrome OS for future support.

Signed-off-by: Kailang Yang <[email protected]>
Signed-off-by: Takashi Iwai <[email protected]>

mmc: sdhci-acpi: Intel SDIO has broken card detect

Intel SDIO has broken card detect so add a quirk to reflect that.

Signed-off-by: Adrian Hunter <[email protected]>
Acked-by: Ulf Hansson <[email protected]>
Signed-off-by: Chris Ball <[email protected]>

thermal: rcar-thermal: update thermal zone only when temperature changes

Avoid updating the thermal zone in case an IRQ was triggered but the
temperature didn't effectively change.
Note this is not a driver issue.
Below is a captured debug trace illustrating the purpose of this patch:
out of 8 thermal zone updates, only 2 are actually necessary.

[   41.120000] rcar_thermal_work(): cctemp=25000
[   41.120000] rcar_thermal_work(): nctemp=30000
[   41.120000] rcar_thermal_work(): temp is now 30000C, update thermal zone
[   58.990000] rcar_thermal_work(): cctemp=30000
[   58.990000] rcar_thermal_work(): nctemp=30000
[   58.990000] rcar_thermal_work(): same temp, do not update thermal zone
[   59.290000] rcar_thermal_work(): cctemp=30000
[   59.290000] rcar_thermal_work(): nctemp=30000
[   59.290000] rcar_thermal_work(): same temp, do not update thermal zone
[   59.590000] rcar_thermal_work(): cctemp=30000
[   59.590000] rcar_thermal_work(): nctemp=30000
[   59.590000] rcar_thermal_work(): same temp, do not update thermal zone
[   59.890000] rcar_thermal_work(): cctemp=30000
[   59.890000] rcar_thermal_work(): nctemp=30000
[   59.890000] rcar_thermal_work(): same temp, do not update thermal zone
[   60.190000] rcar_thermal_work(): cctemp=30000
[   60.190000] rcar_thermal_work(): nctemp=30000
[   60.190000] rcar_thermal_work(): same temp, do not update thermal zone
[   60.490000] rcar_thermal_work(): cctemp=30000
[   60.490000] rcar_thermal_work(): nctemp=30000
[   60.490000] rcar_thermal_work(): same temp, do not update thermal zone
[   60.790000] rcar_thermal_work(): cctemp=30000
[   60.790000] rcar_thermal_work(): nctemp=35000
[   60.790000] rcar_thermal_work(): temp is now 35000C, update thermal zone

I suspect this may be due to sensor sampling accuracy / fluctuation,
but no formal proof.

Signed-off-by: Patrick Titiano <[email protected]>
Acked-by: Kuninori Morimoto <[email protected]>
Signed-off-by: Zhang Rui <[email protected]>

thermal: rcar-thermal: fix same mask applied twice

Mask is already applied preceding the if statement.
Remove the second mask.

Signed-off-by: Patrick Titiano <[email protected]>
Acked-by: Kuninori Morimoto <[email protected]>
Signed-off-by: Zhang Rui <[email protected]>

thermal: ti-soc-thermal: Use SIMPLE_DEV_PM_OPS macro

Use SIMPLE_DEV_PM_OPS macro in order to make the code simpler.

Signed-off-by: Jingoo Han <[email protected]>
Signed-off-by: Zhang Rui <[email protected]>

thermal: imx: update formula for thermal sensor

Thermal sensor used to need two calibration points which are
in fuse map to get a slope for converting thermal sensor's raw
data to real temperature in degree C. Due to the chip calibration
limitation, hardware team provides an universal formula to get
real temperature from internal thermal sensor raw data:

Slope = 0.4297157 - (0.0015976 * 25C fuse);

Update the formula, as there will be no hot point calibration
data in fuse map from now on.

Signed-off-by: Anson Huang <[email protected]>
Acked-by: Shawn Guo <[email protected]>
Signed-off-by: Zhang Rui <[email protected]>

Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs

Pull ext3 improvements, cleanups, reiserfs fix from Jan Kara:
"various cleanups for ext2, ext3, udf, isofs, a documentation update
  for quota, and a fix of a race in reiserfs readdir implementation"

* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
  reiserfs: fix race in readdir
  ext2: acl: remove unneeded include of linux/capability.h
  ext3: explicitly remove inode from orphan list after failed direct io
  fs/isofs/inode.c add __init to init_inodecache()
  ext3: Speedup WB_SYNC_ALL pass
  fs/quota/Kconfig: Update filesystems
  ext3: Update outdated comment before ext3_ordered_writepage()
  ext3: Update PF_MEMALLOC handling in ext3_write_inode()
  ext2/3: use prandom_u32() instead of get_random_bytes()
  ext3: remove an unneeded check in ext3_new_blocks()
  ext3: remove unneeded check in ext3_ordered_writepage()
  fs: Mark function as static in ext3/xattr_security.c
  fs: Mark function as static in ext3/dir.c
  fs: Mark function as static in ext2/xattr_security.c
  ext3: Add __init macro to init_inodecache
  ext2: Add __init macro to init_inodecache
  udf: Add __init macro to init_inodecache
  fs: udf: parse_options: blocksize check

Merge branch 'kbuild' of git://git.kernel.org/pub/scm/linux/kernel/git/mmarek/kbuild

Pull kbuild changes from Michal Marek:
- cleanups in the main Makefiles and Documentation/DocBook/Makefile
- make O=...  directory is automatically created if needed
- mrproper/distclean removes the old include/linux/version.h to make
   life easier when bisecting across the commit that moved the version.h
   file

* 'kbuild' of git://git.kernel.org/pub/scm/linux/kernel/git/mmarek/kbuild:
  kbuild: docbook: fix the include error when executing "make help"
  kbuild: create a build directory automatically for out-of-tree build
  kbuild: remove redundant '.*.cmd' pattern from make distclean
  kbuild: move "quote" to Kbuild.include to be consistent
  kbuild: docbook: use $(obj) and $(src) rather than specific path
  kbuild: unconditionally clobber include/linux/version.h on distclean
  kbuild: docbook: specify KERNELDOC dependency correctly
  kbuild: docbook: include cmd files more simply
  kbuild: specify build_docproc as a phony target

Merge tag 'arc-v3.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/vgupta/arc

Pull ARC changes from Vineet Gupta:
- Support for external initrd from Noam
- Fix broken serial console in nsimosci Virtual Platform
- Reuse of ENTRY/END assembler macros across hand asm code
- Other minor fixes here and there

* tag 'arc-v3.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/vgupta/arc:
  ARC: [nsimosci] Unbork console
  ARC: [nsimosci] Change .dts to use generic 8250 UART
  ARC: [SMP] General Fixes
  ARC: Remove unused DT template file
  ARC: [clockevent] simplify timer ISR
  ARC: [clockevent] can't be SoC specific
  ARC: Remove ARC_HAS_COH_RTSC
  ARC: switch to generic ENTRY/END assembler annotations
  ARC: support external initrd
  ARC: add uImage to .gitignore
  ARC: [arcfpga] Fix __initconst data const-correctness

DRM: armada: fix corruption while loading cursors

Loading cursors to the LCD controller's SRAM can be corrupted when the
configured pixel clock is relatively slow. This seems to be caused
when we write back-to-back to the SRAM registers.

There doesn't appear to be any status register we can read to check
when an access has completed.

Inserting a dummy read between the writes appears to fix the problem.

Cc: <[email protected]> # 3.13
Signed-off-by: Russell King <[email protected]>
Signed-off-by: Dave Airlie <[email protected]>

Merge tag 'stable/for-linus-3.15-tag2' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip

Pull Xen build fix from David Vrabel:
"Fix arm build of drivers/xen/events/

The merge of irq-core-for-linus branch broke it"

* tag 'stable/for-linus-3.15-tag2' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
Xen: do hv callback accounting only on x86

Merge branch 'akpm' (incoming from Andrew)

Merge second patch-bomb from Andrew Morton:
- the rest of MM
- zram updates
- zswap updates
- exit
- procfs
- exec
- wait
- crash dump
- lib/idr
- rapidio
- adfs, affs, bfs, ufs
- cris
- Kconfig things
- initramfs
- small amount of IPC material
- percpu enhancements
- early ioremap support
- various other misc things

* emailed patches from Andrew Morton <[email protected]>: (156 commits)
  MAINTAINERS: update Intel C600 SAS driver maintainers
  fs/ufs: remove unused ufs_super_block_third pointer
  fs/ufs: remove unused ufs_super_block_second pointer
  fs/ufs: remove unused ufs_super_block_first pointer
  fs/ufs/super.c: add __init to init_inodecache()
  doc/kernel-parameters.txt: add early_ioremap_debug
  arm64: add early_ioremap support
  arm64: initialize pgprot info earlier in boot
  x86: use generic early_ioremap
  mm: create generic early_ioremap() support
  x86/mm: sparse warning fix for early_memremap
  lglock: map to spinlock when !CONFIG_SMP
  percpu: add preemption checks to __this_cpu ops
  vmstat: use raw_cpu_ops to avoid false positives on preemption checks
  slub: use raw_cpu_inc for incrementing statistics
  net: replace __this_cpu_inc in route.c with raw_cpu_inc
  modules: use raw_cpu_write for initialization of per cpu refcount.
  mm: use raw_cpu ops for determining current NUMA node
  percpu: add raw_cpu_ops
  slub: fix leak of 'name' in sysfs_slab_add
  ...

MAINTAINERS: update Intel C600 SAS driver maintainers

Signed-off-by: Lukasz Dorau <[email protected]>
Signed-off-by: Dave Jiang <[email protected]>
Signed-off-by: Maciej Patelczyk <[email protected]>
Cc: Artur Paszkiewicz <[email protected]>
Cc: James Bottomley <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

fs/ufs: remove unused ufs_super_block_third pointer

Pointer 'usb3' to struct ufs_super_block_third acquired via
ubh_get_usb_third() is never used in function
ufs_read_cylinder_structures(). Thus remove it.

Detected by Coverity: CID 139939.

Signed-off-by: Christian Engelmayer <[email protected]>
Cc: Evgeniy Dushistov <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

fs/ufs: remove unused ufs_super_block_second pointer

Pointer 'usb2' to struct ufs_super_block_second acquired via
ubh_get_usb_second() is never used in function ufs_statfs(). Thus
remove it.

Detected by Coverity: CID 139940.

Signed-off-by: Christian Engelmayer <[email protected]>
Cc: Evgeniy Dushistov <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

fs/ufs: remove unused ufs_super_block_first pointer

Remove occurences of unused pointers to struct ufs_super_block_first
that were acquired via ubh_get_usb_first().

Detected by Coverity: CID 139929 - CID 139936, CID 139940.

Signed-off-by: Christian Engelmayer <[email protected]>
Cc: Evgeniy Dushistov <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

fs/ufs/super.c: add __init to init_inodecache()

init_inodecache is only called by __init init_ufs_fs.

Signed-off-by: Fabian Frederick <[email protected]>
Cc: Evgeniy Dushistov <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

doc/kernel-parameters.txt: add early_ioremap_debug

Add description of early_ioremap_debug kernel parameter.

Signed-off-by: Mark Salter <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Dave Young <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

arm64: add early_ioremap support

Add support for early IO or memory mappings which are needed before the
normal ioremap() is usable. This also adds fixmap support for permanent
fixed mappings such as that used by the earlyprintk device register
region.

Signed-off-by: Mark Salter <[email protected]>
Acked-by: Catalin Marinas <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Dave Young <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

arm64: initialize pgprot info earlier in boot

Presently, paging_init() calls init_mem_pgprot() to initialize pgprot
values used by macros such as PAGE_KERNEL, PAGE_KERNEL_EXEC, etc.

The new fixmap and early_ioremap support also needs to use these macros
before paging_init() is called. This patch moves the init_mem_pgprot()
call out of paging_init() and into setup_arch() so that pgprot_default
gets initialized in time for fixmap and early_ioremap.

Signed-off-by: Mark Salter <[email protected]>
Acked-by: Catalin Marinas <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Dave Young <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

x86: use generic early_ioremap

Move x86 over to the generic early ioremap implementation.

Signed-off-by: Mark Salter <[email protected]>
Acked-by: H. Peter Anvin <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Dave Young <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm: create generic early_ioremap() support

This patch creates a generic implementation of early_ioremap() support
based on the existing x86 implementation. early_ioremp() is useful for
early boot code which needs to temporarily map I/O or memory regions
before normal mapping functions such as ioremap() are available.

Some architectures have optional MMU. In the no-MMU case, the remap
functions simply return the passed in physical address and the unmap
functions do nothing.

Signed-off-by: Mark Salter <[email protected]>
Acked-by: Catalin Marinas <[email protected]>
Acked-by: H. Peter Anvin <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Dave Young <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

x86/mm: sparse warning fix for early_memremap

This patch series takes the common bits from the x86 early ioremap
implementation and creates a generic implementation which may be used by
other architectures.  The early ioremap interfaces are intended for
situations where boot code needs to make temporary virtual mappings
before the normal ioremap interfaces are available.  Typically, this
means before paging_init() has run.

This patch (of 6):

There's a lot of sparse warnings for code like below: void *a =
early_memremap(phys_addr, size);

early_memremap intend to map kernel memory with ioremap facility, the
return pointer should be a kernel ram pointer instead of iomem one.

For making the function clearer and supressing sparse warnings this patch
do below two things:
1. cast to (__force void *) for the return value of early_memremap
2. add early_memunmap function and pass (__force void __iomem *) to iounmap

From Boris:
  "Ingo told me yesterday, it makes sense too.  I'd guess we can try it.
   FWIW, all callers of early_memremap use the memory they get remapped
   as normal memory so we should be safe"

Signed-off-by: Dave Young <[email protected]>
Signed-off-by: Mark Salter <[email protected]>
Acked-by: H. Peter Anvin <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

lglock: map to spinlock when !CONFIG_SMP

When the system has only one CPU, lglock is effectively a spinlock; map
it directly to spinlock to eliminate the indirection and duplicate code.

In addition to removing overhead, this drops 1.6k of code with a
defconfig modified to have !CONFIG_SMP, and 1.1k with a minimal config.

Signed-off-by: Josh Triplett <[email protected]>
Cc: Rusty Russell <[email protected]>
Cc: Michal Marek <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: David Howells <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Nick Piggin <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

percpu: add preemption checks to __this_cpu ops

We define a check function in order to avoid trouble with the include
files. Then the higher level __this_cpu macros are modified to invoke
the preemption check.

[[email protected]: coding-style fixes]
Signed-off-by: Christoph Lameter <[email protected]>
Acked-by: Ingo Molnar <[email protected]>
Cc: Tejun Heo <[email protected]>
Tested-by: Grygorii Strashko <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

vmstat: use raw_cpu_ops to avoid false positives on preemption checks

vm counters are allowed to be racy. Use raw_cpu_ops to avoid the
local_irq_disable overhead and to avoid preemption checks which will be
added to the __this_cpu operations.

[[email protected]: Add comment. Again.]
Signed-off-by: Christoph Lameter <[email protected]>
Reported-by: Sergey Senozhatsky <[email protected]>
Cc: Dave Chinner <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

slub: use raw_cpu_inc for incrementing statistics

Statistics are not critical to the operation of the allocation but
should also not cause too much overhead.

When __this_cpu_inc is altered to check if preemption is disabled this
triggers. Use raw_cpu_inc to avoid the checks. Using this_cpu_ops may
cause interrupt disable/enable sequences on various arches which may
significantly impact allocator performance.

[[email protected]: add comment]
Signed-off-by: Christoph Lameter <[email protected]>
Cc: Fengguang Wu <[email protected]>
Cc: Pekka Enberg <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

net: replace __this_cpu_inc in route.c with raw_cpu_inc

The RT_CACHE_STAT_INC macro triggers the new preemption checks
for __this_cpu ops.

I do not see any other synchronization that would allow the use of a
__this_cpu operation here however in commit dbd2915ce87e ("[IPV4]:
RT_CACHE_STAT_INC() warning fix") Andrew justifies the use of
raw_smp_processor_id() here because "we do not care" about races.  In
the past we agreed that the price of disabling interrupts here to get
consistent counters would be too high.  These counters may be inaccurate
due to race conditions.

The use of __this_cpu op improves the situation already from what commit
dbd2915ce87e did since the single instruction emitted on x86 does not
allow the race to occur anymore.  However, non x86 platforms could still
experience a race here.

Trace:

  __this_cpu_add operation in preemptible [00000000] code: avahi-daemon/1193
  caller is __this_cpu_preempt_check+0x38/0x60
  CPU: 1 PID: 1193 Comm: avahi-daemon Tainted: GF            3.12.0-rc4+ #187
  Call Trace:
    check_preemption_disabled+0xec/0x110
    __this_cpu_preempt_check+0x38/0x60
    __ip_route_output_key+0x575/0x8c0
    ip_route_output_flow+0x27/0x70
    udp_sendmsg+0x825/0xa20
    inet_sendmsg+0x85/0xc0
    sock_sendmsg+0x9c/0xd0
    ___sys_sendmsg+0x37c/0x390
    __sys_sendmsg+0x49/0x90
    SyS_sendmsg+0x12/0x20
    tracesys+0xe1/0xe6

Signed-off-by: Christoph Lameter <[email protected]>
Acked-by: David S. Miller <[email protected]>
Acked-by: Ingo Molnar <[email protected]>
Cc: Eric Dumazet <[email protected]>
Cc: Tejun Heo <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

modules: use raw_cpu_write for initialization of per cpu refcount.

The initialization of a structure is not subject to synchronization.
The use of __this_cpu would trigger a false positive with the additional
preemption checks for __this_cpu ops.

So simply disable the check through the use of raw_cpu ops.

Trace:

  __this_cpu_write operation in preemptible [00000000] code: modprobe/286
  caller is __this_cpu_preempt_check+0x38/0x60
  CPU: 3 PID: 286 Comm: modprobe Tainted: GF            3.12.0-rc4+ #187
  Call Trace:
    dump_stack+0x4e/0x82
    check_preemption_disabled+0xec/0x110
    __this_cpu_preempt_check+0x38/0x60
    load_module+0xcfd/0x2650
    SyS_init_module+0xa6/0xd0
    tracesys+0xe1/0xe6

Signed-off-by: Christoph Lameter <[email protected]>
Acked-by: Ingo Molnar <[email protected]>
Acked-by: Rusty Russell <[email protected]>
Cc: Tejun Heo <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm: use raw_cpu ops for determining current NUMA node

With the preempt checking logic for __this_cpu_ops we will get false
positives from locations in the code that use numa_node_id.

Before the __this_cpu ops where introduced there were no checks for
preemption present either.  smp_raw_processor_id() was used.  See

  http://www.spinics.net/lists/linux-numa/msg00641.html

Therefore we need to use raw_cpu_read here to avoid false postives.

Note that this issue has been discussed in prior years.  If the process
changes nodes after retrieving the current numa node then that is
acceptable since most uses of numa_node etc are for optimization and not
for correctness.

There were suggestions to implement a raw_numa_node_id in order to do
preempt checks for numa_node_id as well.  But I think we better defer
that to another patch since that would mean investigating how
numa_node_id() is used throughout the kernel which would increase the
scope of this patchset significantly.  After all preemption was never
checked before when numa_node_id() was used.

Some sample traces:

__this_cpu_read operation in preemptible [00000000] code: login/1456
caller is __this_cpu_preempt_check+0x2b/0x2d
CPU: 0 PID: 1456 Comm: login Not tainted 3.12.0-rc4-cl-00062-g2fe80d3-dirty #185
Call Trace:
  dump_stack+0x4e/0x82
  check_preemption_disabled+0xc5/0xe0
  __this_cpu_preempt_check+0x2b/0x2d
  get_task_policy+0x1d/0x49
  get_vma_policy+0x14/0x76
  alloc_pages_vma+0x35/0xff
  handle_mm_fault+0x290/0x73b
  __do_page_fault+0x3fe/0x44d
  do_page_fault+0x9/0xc
  page_fault+0x22/0x30
  generic_file_aio_read+0x38e/0x624
  do_sync_read+0x54/0x73
  vfs_read+0x9d/0x12a
  SyS_read+0x47/0x7e
  cstar_dispatch+0x7/0x23

caller is __this_cpu_preempt_check+0x2b/0x2d
CPU: 0 PID: 1456 Comm: login Not tainted 3.12.0-rc4-cl-00062-g2fe80d3-dirty #185
Call Trace:
  dump_stack+0x4e/0x82
  check_preemption_disabled+0xc5/0xe0
  __this_cpu_preempt_check+0x2b/0x2d
  alloc_pages_current+0x8f/0xbc
  __page_cache_alloc+0xb/0xd
  __do_page_cache_readahead+0xf4/0x219
  ra_submit+0x1c/0x20
  ondemand_readahead+0x28c/0x2b4
  page_cache_sync_readahead+0x38/0x3a
  generic_file_aio_read+0x261/0x624
  do_sync_read+0x54/0x73
  vfs_read+0x9d/0x12a
  SyS_read+0x47/0x7e
  cstar_dispatch+0x7/0x23

Signed-off-by: Christoph Lameter <[email protected]>
Acked-by: Ingo Molnar <[email protected]>
Cc: Alex Shi <[email protected]>
Cc: Tejun Heo <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

percpu: add raw_cpu_ops

The kernel has never been audited to ensure that this_cpu operations are
consistently used throughout the kernel.  The code generated in many
places can be improved through the use of this_cpu operations (which
uses a segment register for relocation of per cpu offsets instead of
performing address calculations).

The patch set also addresses various consistency issues in general with
the per cpu macros.

A. The semantics of __this_cpu_ptr() differs from this_cpu_ptr only
   because checks are skipped. This is typically shown through a raw_
   prefix. So this patch set changes the places where __this_cpu_ptr()
   is used to raw_cpu_ptr().

B. There has been the long term wish by some that __this_cpu operations
   would check for preemption. However, there are cases where preemption
   checks need to be skipped. This patch set adds raw_cpu operations that
   do not check for preemption and then adds preemption checks to the
   __this_cpu operations.

C. The use of __get_cpu_var is always a reference to a percpu variable
   that can also be handled via a this_cpu operation. This patch set
   replaces all uses of __get_cpu_var with this_cpu operations.

D. We can then use this_cpu RMW operations in various places replacing
   sequences of instructions by a single one.

E. The use of this_cpu operations throughout will allow other arches than
   x86 to implement optimized references and RMV operations to work with
   per cpu local data.

F. The use of this_cpu operations opens up the possibility to
   further optimize code that relies on synchronization through
   per cpu data.

The patch set works in a couple of stages:

I. Patch 1 adds the additional raw_cpu operations and raw_cpu_ptr().
    Also converts the existing __this_cpu_xx_# primitive in the x86
    code to raw_cpu_xx_#.

II. Patch 2-4 use the raw_cpu operations in places that would give
     us false positives once they are enabled.

III. Patch 5 adds preemption checks to __this_cpu operations to allow
    checking if preemption is properly disabled when these functions
    are used.

IV. Patches 6-20 are patches that simply replace uses of __get_cpu_var
   with this_cpu_ptr. They do not depend on any changes to the percpu
   code. No preemption tests are skipped if they are applied.

V. Patches 21-46 are conversion patches that use this_cpu operations
   in various kernel subsystems/drivers or arch code.

VI.  Patches 47/48 (not included in this series) remove no longer used
    functions (__this_cpu_ptr and __get_cpu_var).  These should only be
    applied after all the conversion patches have made it and after we
    have done additional passes through the kernel to ensure that none of
    the uses of these functions remain.

This patch (of 46):

The patches following this one will add preemption checks to __this_cpu
ops so we need to have an alternative way to use this_cpu operations
without preemption checks.

raw_cpu_ops will be the basis for all other ops since these will be the
operations that do not implement any checks.

Primitive operations are renamed by this patch from __this_cpu_xxx to
raw_cpu_xxxx.

Also change the uses of the x86 percpu primitives in preempt.h.
These depend directly on asm/percpu.h (header #include nesting issue).

Signed-off-by: Peter Zijlstra <[email protected]>
Signed-off-by: Christoph Lameter <[email protected]>
Acked-by: Ingo Molnar <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: "James E.J. Bottomley" <[email protected]>
Cc: "Paul E. McKenney" <[email protected]>
Cc: Alex Shi <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Bryan Wu <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Chris Metcalf <[email protected]>
Cc: Daniel Lezcano <[email protected]>
Cc: David Daney <[email protected]>
Cc: David Miller <[email protected]>
Cc: David S. Miller <[email protected]>
Cc: Dimitri Sivanich <[email protected]>
Cc: Dipankar Sarma <[email protected]>
Cc: Eric Dumazet <[email protected]>
Cc: Fenghua Yu <[email protected]>
Cc: Frederic Weisbecker <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Haavard Skinnemoen <[email protected]>
Cc: Hans-Christian Egtvedt <[email protected]>
Cc: Hedi Berriche <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Helge Deller <[email protected]>
Cc: Ivan Kokshaysky <[email protected]>
Cc: James Hogan <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: John Stultz <[email protected]>
Cc: Martin Schwidefsky <[email protected]>
Cc: Masami Hiramatsu <[email protected]>
Cc: Matt Turner <[email protected]>
Cc: Mike Frysinger <[email protected]>
Cc: Mike Travis <[email protected]>
Cc: Neil Brown <[email protected]>
Cc: Nicolas Pitre <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Paul Mundt <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Ralf Baechle <[email protected]>
Cc: Richard Henderson <[email protected]>
Cc: Robert Richter <[email protected]>
Cc: Russell King <[email protected]>
Cc: Russell King <[email protected]>
Cc: Rusty Russell <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Tony Luck <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Wim Van Sebroeck <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

slub: fix leak of 'name' in sysfs_slab_add

The failure paths of sysfs_slab_add don't release the allocation of
'name' made by create_unique_id() a few lines above the context of the
diff below. Create a common exit path to make it more obvious what
needs freeing.

[[email protected]: free the name only if !unmergeable]
Signed-off-by: Dave Jones <[email protected]>
Signed-off-by: Vladimir Davydov <[email protected]>
Cc: Pekka Enberg <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

slub: rework sysfs layout for memcg caches

Currently, we try to arrange sysfs entries for memcg caches in the same
manner as for global caches.  Apart from turning /sys/kernel/slab into a
mess when there are a lot of kmem-active memcgs created, it actually
does not work properly - we won't create more than one link to a memcg
cache in case its parent is merged with another cache.  For instance, if
A is a root cache merged with another root cache B, we will have the
following sysfs setup:

  X
  A -> X
  B -> X

where X is some unique id (see create_unique_id()).  Now if memcgs M and
N start to allocate from cache A (or B, which is the same), we will get:

  X
  X:M
  X:N
  A -> X
  B -> X
  A:M -> X:M
  A:N -> X:N

Since B is an alias for A, we won't get entries B:M and B:N, which is
confusing.

It is more logical to have entries for memcg caches under the
corresponding root cache's sysfs directory.  This would allow us to keep
sysfs layout clean, and avoid such inconsistencies like one described
above.

This patch does the trick.  It creates a "cgroup" kset in each root
cache kobject to keep its children caches there.

Signed-off-by: Vladimir Davydov <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: Glauber Costa <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

slub: adjust memcg caches when creating cache alias

Otherwise, kzalloc() called from a memcg won't clear the whole object.

Signed-off-by: Vladimir Davydov <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: Glauber Costa <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

memcg, slab: do not destroy children caches if parent has aliases

Currently we destroy children caches at the very beginning of
kmem_cache_destroy().  This is wrong, because the root cache will not
necessarily be destroyed in the end - if it has aliases (refcount > 0),
kmem_cache_destroy() will simply decrement its refcount and return.  In
this case, at best we will get a bunch of warnings in dmesg, like this
one:

  kmem_cache_destroy kmalloc-32:0: Slab cache still has objects
  CPU: 1 PID: 7139 Comm: modprobe Tainted: G    B   W    3.13.0+ #117
  Call Trace:
    dump_stack+0x49/0x5b
    kmem_cache_destroy+0xdf/0xf0
    kmem_cache_destroy_memcg_children+0x97/0xc0
    kmem_cache_destroy+0xf/0xf0
    xfs_mru_cache_uninit+0x21/0x30 [xfs]
    exit_xfs_fs+0x2e/0xc44 [xfs]
    SyS_delete_module+0x198/0x1f0
    system_call_fastpath+0x16/0x1b

At worst - if kmem_cache_destroy() will race with an allocation from a
memcg cache - the kernel will panic.

This patch fixes this by moving children caches destruction after the
check if the cache has aliases.  Plus, it forbids destroying a root
cache if it still has children caches, because each children cache keeps
a reference to its parent.

Signed-off-by: Vladimir Davydov <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: Glauber Costa <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

memcg, slab: unregister cache from memcg before starting to destroy it

Currently, memcg_unregister_cache(), which deletes the cache being
destroyed from the memcg_slab_caches list, is called after
__kmem_cache_shutdown() (see kmem_cache_destroy()), which starts to
destroy the cache.

As a result, one can access a partially destroyed cache while traversing
a memcg_slab_caches list, which can have deadly consequences (for
instance, cache_show() called for each cache on a memcg_slab_caches list
from mem_cgroup_slabinfo_read() will dereference pointers to already
freed data).

To fix this, let's move memcg_unregister_cache() before the cache
destruction process beginning, issuing memcg_register_cache() on failure.

Signed-off-by: Vladimir Davydov <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: Glauber Costa <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

memcg, slab: separate memcg vs root cache creation paths

Memcg-awareness turned kmem_cache_create() into a dirty interweaving of
memcg-only and except-for-memcg calls. To clean this up, let's move the
code responsible for memcg cache creation to a separate function.

Signed-off-by: Vladimir Davydov <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: Glauber Costa <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

memcg, slab: cleanup memcg cache creation

This patch cleans up the memcg cache creation path as follows:

- Move memcg cache name creation to a separate function to be called
  from kmem_cache_create_memcg().  This allows us to get rid of the mutex
  protecting the temporary buffer used for the name formatting, because
  the whole cache creation path is protected by the slab_mutex.

- Get rid of memcg_create_kmem_cache().  This function serves as a proxy
  to kmem_cache_create_memcg().  After separating the cache name creation
  path, it would be reduced to a function call, so let's inline it.

Signed-off-by: Vladimir Davydov <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: Glauber Costa <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

memcg, slab: never try to merge memcg caches

When a kmem cache is created (kmem_cache_create_memcg()), we first try to
find a compatible cache that already exists and can handle requests from
the new cache, i.e.  has the same object size, alignment, ctor, etc.  If
there is such a cache, we do not create any new caches, instead we simply
increment the refcount of the cache found and return it.

Currently we do this procedure not only when creating root caches, but
also for memcg caches.  However, there is no point in that, because, as
every memcg cache has exactly the same parameters as its parent and cache
merging cannot be turned off in runtime (only on boot by passing
"slub_nomerge"), the root caches of any two potentially mergeable memcg
caches should be merged already, i.e.  it must be the same root cache, and
therefore we couldn't even get to the memcg cache creation, because it
already exists.

The only exception is boot caches - they are explicitly forbidden to be
merged by setting their refcount to -1.  There are currently only two of
them - kmem_cache and kmem_cache_node, which are used in slab internals (I
do not count kmalloc caches as their refcount is set to 1 immediately
after creation).  Since they are prevented from merging preliminary I
guess we should avoid to merge their children too.

So let's remove the useless code responsible for merging memcg caches.

Signed-off-by: Vladimir Davydov <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: Glauber Costa <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

asm/system.h: um: arch_align_stack() moved to asm/exec.h

arch_align_stack() moved to asm/exec.h, so change the comment referring to
asm/system.h which no longer exists.

Signed-off-by: David Howells <[email protected]>
Cc: Jeff Dike <[email protected]>
Cc: Richard Weinberger <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

asm/system.h: clean asm/system.h from docs

Clean asm/system.h from docs as nothing should refer to that header anymore.

Signed-off-by: David Howells <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

kernel: use macros from compiler.h instead of __attribute__((...))

To increase compiler portability there is <linux/compiler.h> which
provides convenience macros for various gcc constructs. Eg: __weak for
__attribute__((weak)). I've replaced all instances of gcc attributes
with the right macro in the kernel subsystem.

Signed-off-by: Gideon Israel Dsouza <[email protected]>
Cc: "Rafael J. Wysocki" <[email protected]>
Cc: Ingo Molnar <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

Kconfig: rename HAS_IOPORT to HAS_IOPORT_MAP

If the renamed symbol is defined lib/iomap.c implements ioport_map and
ioport_unmap and currently (nearly) all platforms define the port
accessor functions outb/inb and friend unconditionally. So
HAS_IOPORT_MAP is the better name for this.

Consequently NO_IOPORT is renamed to NO_IOPORT_MAP.

The motivation for this change is to reintroduce a symbol HAS_IOPORT
that signals if outb/int et al are available. I will address that at
least one merge window later though to keep surprises to a minimum and
catch new introductions of (HAS|NO)_IOPORT.

The changes in this commit were done using:

$ git grep -l -E '(NO|HAS)_IOPORT' | xargs perl -p -i -e 's/\b((?:CONFIG_)?(?:NO|HAS)_IOPORT)\b/$1_MAP/'

Signed-off-by: Uwe Kleine-König <[email protected]>
Acked-by: Arnd Bergmann <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

ipc: use device_initcall

... since __initcall is now deprecated.

Signed-off-by: Davidlohr Bueso <[email protected]>
Cc: Manfred Spraul <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

ipc/compat.c: remove sc_semopm macro

This macro appears to have been introduced back in the 2.5 era for
semtimedop32 backward compatibility on ia32:

https://lkml.org/lkml/2003/4/28/78

Nowadays, this syscall in compat just defaults back to the code found in
sem.c, so it is no longer used and can thus be removed:

long compat_sys_semtimedop(int semid, struct sembuf __user *tsems,
unsigned nsops, const struct compat_timespec __user *timeout)
{
struct timespec __user *ts64;
if (compat_convert_timespec(&ts64, timeout))
return -EFAULT;
return sys_semtimedop(semid, tsems, nsops, ts64);
}

Furthermore, there are no users in compat.c. After this change, kernel
builds just fine with both CONFIG_SYSVIPC_COMPAT and CONFIG_SYSVIPC.

Signed-off-by: Davidlohr Bueso <[email protected]>
Cc: Manfred Spraul <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

initramfs: debug detected compression method

This can greatly aid in narrowing down the real source of initramfs
problems such as failures related to the compression of the in-kernel
initramfs when an external initramfs is in use as well. Existing errors
are ambiguous as to which initramfs is a problem and why.

[[email protected]: use pr_debug()]
Signed-off-by: Daniel M. Weeks <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

fault-injection: set bounds on what /proc/self/make-it-fail accepts.

/proc/self/make-it-fail is a boolean, but accepts any number, including
negative ones. Change variable to unsigned, and cap upper bound at 1.

[[email protected]: don't make make_it_fail unsigned]
Signed-off-by: Dave Jones <[email protected]>
Reviewed-by: Akinobu Mita <[email protected]>
Cc: David Rientjes <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

x86: always define BUG() and HAVE_ARCH_BUG, even with !CONFIG_BUG

This ensures that BUG() always has a definition that causes a trap (via
an undefined instruction), and that the compiler still recognizes the
code following BUG() as unreachable, avoiding warnings that would
otherwise appear (such as on non-void functions that don't return a
value after BUG()).

In addition to saving a few bytes over the generic infinite-loop
implementation, this implementation traps rather than looping, which
potentially allows for better error-recovery behavior (such as by
rebooting).

Signed-off-by: Josh Triplett <[email protected]>
Reported-by: Arnd Bergmann <[email protected]>
Acked-by: Arnd Bergmann <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

bug: Make BUG() always stop the machine

When !CONFIG_BUG and !HAVE_ARCH_BUG, define the generic BUG() as an
infinite loop rather than a no-op.  This avoids undefined behavior if
execution ever actually reaches BUG(), and avoids warnings about code
after BUG() (such as on non-void functions calling BUG() and then not
returning).

bloat-o-meter results:

  add/remove: 0/0 grow/shrink: 43/10 up/down: 235/-98 (137)
  function                             old     new   delta
  umount_collect                       119     138     +19
  notify_change                        306     324     +18
  xstate_enable_boot_cpu               252     269     +17
  kunmap                                54      70     +16
  balloon_page_dequeue                 112     126     +14
  mm_take_all_locks                    223     233     +10
  list_lru_walk_node                   143     152      +9
  vma_adjust                          1059    1067      +8
  pcpu_setup_first_chunk              1130    1138      +8
  mm_drop_all_locks                    143     151      +8
  ns_capable                            55      62      +7
  anon_transport_class_unregister        8      15      +7
  srcu_init_notifier_head               35      41      +6
  shrink_dcache_for_umount             174     180      +6
  kunmap_high                           99     105      +6
  end_page_writeback                    43      49      +6
  do_exit                             1339    1345      +6
  __kfifo_dma_out_prepare_r             86      92      +6
  __kfifo_dma_in_prepare_r              90      96      +6
  fixup_user_fault                     120     125      +5
  repair_env_string                     73      77      +4
  read_cache_pages_invalidate_page      56      60      +4
  isolate_lru_pages.isra               142     146      +4
  do_notify_parent_cldstop             255     259      +4
  cpu_init                             370     374      +4
  utimes_common                        270     272      +2
  tasklet_hi_action                     91      93      +2
  tasklet_action                        91      93      +2
  set_pte_vaddr                         46      48      +2
  find_get_pages_tag                   202     204      +2
  early_iounmap                        185     187      +2
  __native_set_fixmap                   36      38      +2
  __get_user_pages                     822     824      +2
  __early_ioremap                      299     301      +2
  yield_task_stop                        1       2      +1
  tick_resume                           37      38      +1
  switched_to_stop                       1       2      +1
  switched_to_idle                       1       2      +1
  prio_changed_stop                      1       2      +1
  prio_changed_idle                      1       2      +1
  pm_qos_power_read                    111     112      +1
  arch_cpu_idle_dead                     1       2      +1
  __insert_vmap_area                   140     141      +1
  sys_renameat                         614     612      -2
  mm_fault_error                       297     295      -2
  SyS_renameat                         614     612      -2
  sys_linkat                           416     413      -3
  SyS_linkat                           416     413      -3
  chmod_common                         129     122      -7
  proc_cap_handler                     240     225     -15
  __schedule                           849     831     -18
  sys_madvise                         1077    1054     -23
  SyS_madvise                         1077    1054     -23

Signed-off-by: Josh Triplett <[email protected]>
Reported-by: Arnd Bergmann <[email protected]>
Acked-by: Arnd Bergmann <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

bug: when !CONFIG_BUG, make WARN call no_printk to check format and args

The stub version of WARN for !CONFIG_BUG completely ignored its format
string and subsequent arguments; make it check them instead, using
no_printk.

Signed-off-by: Josh Triplett <[email protected]>
Reported-by: Arnd Bergmann <[email protected]>
Acked-by: Arnd Bergmann <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

include/asm-generic/bug.h: style fix: s/while(0)/while (0)/

Signed-off-by: Josh Triplett <[email protected]>
Reported-by: Randy Dunlap <[email protected]>
Acked-by: Arnd Bergmann <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

bug: when !CONFIG_BUG, simplify WARN_ON_ONCE and family

When !CONFIG_BUG, WARN_ON and family become simple passthroughs of their
condition argument; however, WARN_ON_ONCE and family still have conditions
and a boolean to detect one-time invocation, even though the warning
they'd emit doesn't exist. Make the existing definitions conditional on
CONFIG_BUG, and add definitions for !CONFIG_BUG that map to the
passthrough versions of WARN and WARN_ON.

This saves 4.4k on a minimized configuration (smaller than allnoconfig),
and 20.6k with defconfig plus CONFIG_BUG=n.

Signed-off-by: Josh Triplett <[email protected]>
Acked-by: Arnd Bergmann <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

kconfig: make allnoconfig disable options behind EMBEDDED and EXPERT

"make allnoconfig" exists to ease testing of minimal configurations.
Documentation/SubmitChecklist includes a note to test with allnoconfig.
This helps catch missing dependencies on common-but-not-required
functionality, which might otherwise go unnoticed.

However, allnoconfig still leaves many symbols enabled, because they're
hidden behind CONFIG_EMBEDDED or CONFIG_EXPERT.  For instance, allnoconfig
still has CONFIG_PRINTK and CONFIG_BLOCK enabled, so drivers don't
typically get build-tested with those disabled.

To address this, introduce a new Kconfig option "allnoconfig_y", used on
symbols which only exist to hide other symbols.  Set it on CONFIG_EMBEDDED
(which then selects CONFIG_EXPERT).  allnoconfig will then disable all the
symbols hidden behind those.

Signed-off-by: Josh Triplett <[email protected]>
Tested-by: Paul E. McKenney <[email protected]>
Cc: Michal Marek <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

ppc: make PPC_BOOK3S_64 select IRQ_WORK

Fix breakage which will be exposed by the patch "kconfig: make allnoconfig
disable options behind EMBEDDED and EXPERT".

arch/powerpc/kernel/mce.c, compiled in for PPC_BOOK3S_64, calls
functions only built when IRQ_WORK, so select it.  Fixes the following
build error:

  arch/powerpc/kernel/built-in.o: In function `.machine_check_queue_event':
  (.text+0x11260): undefined reference to `.irq_work_queue'

Signed-off-by: Josh Triplett <[email protected]>
Reported-by: Stephen Rothwell <[email protected]>
Acked-by: Benjamin Herrenschmidt <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

ia64: select CONFIG_TTY for use of tty_write_message in unaligned

Fix breakage which will be exposed by the patch "kconfig: make allnoconfig
disable options behind EMBEDDED and EXPERT".

arch/ia64/kernel/unaligned.c uses tty_write_message to print an
unaligned access exception to the TTY of the current user process.
Enable TTY to prevent a build error.

Minimal fix, on the basis that few people on ia64 will care deeply about
kernel size enough to turn off TTY. Ideally, I'd instead suggest
dropping the tty_write_message entirely, and just leaving the printk.
Bonus: no need to sprintf first.

Signed-off-by: Josh Triplett <[email protected]>
Cc: Stephen Rothwell <[email protected]>
Cc: "Luck, Tony" <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

cris: cpuinfo_op should depend on CONFIG_PROC_FS

Fix breakage which will be exposed by the patch "kconfig: make allnoconfig
disable options behind EMBEDDED and EXPERT".

Now allnoconfig started disabling CONFIG_PROC_FS:

arch/cris/kernel/built-in.o:(.rodata+0xc): undefined reference to `show_cpuinfo'
make: *** [vmlinux] Error 1

Signed-off-by: Geert Uytterhoeven <[email protected]>
Cc: Stephen Rothwell <[email protected]>
Cc: Mikael Starvik <[email protected]>
Cc: Jesper Nilsson <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

cris: make ETRAX_ARCH_V10 select TTY for use in debugport

Fix breakage which will be exposed by the patch "kconfig: make allnoconfig
disable options behind EMBEDDED and EXPERT".

arch/cris/arch-v10/kernel/debugport.c, compiled in unconditionally with
ETRAX_ARCH_V10, requires TTY, so select TTY to avoid a build failure.

Signed-off-by: Josh Triplett <[email protected]>
Cc: Stephen Rothwell <[email protected]>
Cc: Mikael Starvik <[email protected]>
Cc: Jesper Nilsson <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

drivers/misc/sgi-gru/grukdump.c: cleanup gru_dump_context() a little

"ret" is zero here so we can remove the "!ret" part of the condition.
"uhdr" is alread a __user pointer so we can remove the cast.

Signed-off-by: Dan Carpenter <[email protected]>
Acked-by: Dimitri Sivanich <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

kernel/panic.c: display reason at end + pr_emerg

Currently, booting without initrd specified on 80x25 screen gives a call
trace followed by atkbd : Spurious ACK. Original message ("VFS: Unable
to mount root fs") is not available. Of course this could happen in
other situations...

This patch displays panic reason after call trace which could help lot
of people even if it's not the very last line on screen.

Also, convert all panic.c printk(KERN_EMERG to pr_emerg(

[[email protected]: missed a couple of pr_ conversions]
Signed-off-by: Fabian Frederick <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

fs/bfs/inode.c: add __init to init_inodecache()

init_inodecache is only called by __init init_bfs_fs

Signed-off-by: Fabian Frederick <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

affs: add mount option to avoid filename truncates

Normal behavior for filenames exceeding specific filesystem limits is to
refuse operation.

AFFS standard name length being only 30 characters against 255 for usual
Linux filesystems, original implementation does filename truncate by
default with a define value AFFS_NO_TRUNCATE which can be enabled but
needs module compilation.

This patch adds 'nofilenametruncate' mount option so that user can
easily activate that feature and avoid a lot of problems (eg overwrite
files ...)

Signed-off-by: Fabian Frederick <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

fs/affs/dir.c: unlock/brelse dir on failure + code clean-up

Commit 0edf977d2ae3 ("[readdir] convert affs") returns directly -EIO
without unlocking dir inode and releasing dir bh when second affs_bread
sequence fails. This patch restores initial behaviour. It also fixes
pr_debug and affs_error to fit in 80 columns + removes reference to
filldir (replaced by dir_emit in the commit above).

Signed-off-by: Fabian Frederick <[email protected]>
Cc: Al Viro <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

affs: add __init to init_inodecache ()

init_inodecache is only called by __init init_affs_fs

Signed-off-by: Fabian Frederick <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

fs/adfs/super.c: add __init to init_inodecache()

init_inodecache is only called by __init init_adfs_fs.

Signed-off-by: Fabian Frederick <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

hung_task: check the value of "sysctl_hung_task_timeout_sec"

As sysctl_hung_task_timeout_sec is unsigned long, when this value is
larger then LONG_MAX/HZ, the function schedule_timeout_interruptible in
watchdog will return immediately without sleep and with print :

schedule_timeout: wrong timeout value ffffffffffffff83

and then the funtion watchdog will call schedule_timeout_interruptible
again and again. The screen will be filled with

"schedule_timeout: wrong timeout value ffffffffffffff83"

This patch does some check and correction in sysctl, to let the function
schedule_timeout_interruptible allways get the valid parameter.

Signed-off-by: Liu Hua <[email protected]>
Tested-by: Satoru Takeuchi <[email protected]>
Cc: <[email protected]> [3.4+]
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

rapidio: rework device hierarchy and introduce mport class of devices

This patch removes an artificial RapidIO bus root device and establishes
actual device hierarchy by providing reference to real parent devices.
It also introduces device class for RapidIO controller devices (on-chip
or an eternal bridge, known as "mport").

Existing implementation was sufficient for SoC-based platforms that have
a single RapidIO controller. With introduction of devices using
multiple RapidIO controllers and PCIe-to-RapidIO bridges the old scheme
is very limiting or does not work at all. The implemented changes allow
to properly reference platform's local RapidIO mport devices and provide
device details needed for upper layers.

This change to RapidIO device hierarchy does not break any known
existing kernel or user space interfaces.

Signed-off-by: Alexandre Bounine <[email protected]>
Cc: Matt Porter <[email protected]>
Cc: Li Yang <[email protected]>
Cc: Kumar Gala <[email protected]>
Cc: Andre van Herk <[email protected]>
Cc: Stef van Os <[email protected]>
Cc: Jerry Jacobs <[email protected]>
Cc: Arno Tiemersma <[email protected]>
Cc: Rob Landley <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

drivers/rapidio/devices/tsi721_dma.c: optimize use of BDMA descriptors

Combine SG entries describing single contiguous memory block into one
Tsi721 BDMA descriptor. This reduces number of hardware descriptors
required for large data transfers and improves performance on the PCIe
side by reducing number of descriptor fetch requests.

Signed-off-by: Alexandre Bounine <[email protected]>
Cc: Matt Porter <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

lib/idr.c: use RCU_INIT_POINTER(x, NULL)

Replace rcu_assign_pointer(x, NULL) with RCU_INIT_POINTER(x, NULL)

The rcu_assign_pointer() ensures that the initialization of a structure
is carried out before storing a pointer to that structure. And in the
case of the NULL pointer, there is no structure to initialize.

So, rcu_assign_pointer(p, NULL) can be safely converted to
RCU_INIT_POINTER(p, NULL)

Signed-off-by: Monam Agarwal <[email protected]>
Acked-by: Tejun Heo <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

idr: remove dead code

Remove no longer used deprecated code, and make local functions
static.

Signed-off-by: Stephen Hemminger <[email protected]>
Acked-by: Jean Delvare <[email protected]>
Acked-by: Tejun Heo <[email protected]>
Cc: Jeff Layton <[email protected]>
Cc: Philipp Reisner <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: George Spelvin <[email protected]>
Cc: Randy Dunlap <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

vmcore: continue vmcore initialization if PT_NOTE is found empty

Currently when an empty PT_NOTE is detected, vmcore initialization
fails.  It sounds too harsh.  Because PT_NOTE could be empty, for
example, one offlined a cpu but never restarted kdump service, and after
crash, PT_NOTE program header is there but no data contains.  It's
better to warn about the empty PT_NOTE and continue to initialise
vmcore.

And ultimately the multiple PT_NOTE are merged into a single one, all
empty PT_NOTE are discarded naturally during the merge.  So empty
PT_NOTE is not visible to user space and vmcore is as good as expected.

Signed-off-by: WANG Chao <[email protected]>
Cc: Vivek Goyal <[email protected]>
Cc: HATAYAMA Daisuke <[email protected]>
Cc: Greg Pearson <[email protected]>
Cc: Baoquan He <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

include/linux/crash_dump.h: add vmcore_cleanup() prototype

Eliminate the following warning in proc/vmcore.c:

fs/proc/vmcore.c:1088:6: warning: no previous prototype for `vmcore_cleanup' [-Wmissing-prototypes]

[[email protected]: clean up powerpc, remove unneeded EXPORT_SYMBOL]
Signed-off-by: Rashika Kheria <[email protected]>
Reviewed-by: Josh Triplett <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

wait: WSTOPPED|WCONTINUED doesn't work if a zombie leader is traced by another process

Even if the main thread is dead the process still can stop/continue.
However, if the leader is ptraced wait_consider_task(ptrace => false)
always skips wait_task_stopped/wait_task_continued, so WSTOPPED or
WCONTINUED can never work for the natural parent in this case.

Move the "A zombie ptracee is only visible to its ptracer" check into the
"if (!delay_group_leader(p))" block.  ->notask_error is cleared by the
"fall through" code below.

This depends on the previous change, wait_task_stopped/continued must be
avoided if !delay_group_leader() and the tracer is ->real_parent.
Otherwise WSTOPPED|WEXITED could wrongly report "stopped" when the child
is already dead (single-threaded or not).  If it is traced by another task
then the "stopped" state is fine until the debugger detaches and reveals a
zombie state.

Stupid test-case:

void *tfunc(void *arg)
{
sleep(1); // wait for zombie leader
raise(SIGSTOP);
exit(0x13);
return NULL;
}

int run_child(void)
{
pthread_t thread;

if (!fork()) {
int tracee = getppid();

assert(ptrace(PTRACE_ATTACH, tracee, 0,0) == 0);
do
ptrace(PTRACE_CONT, tracee, 0,0);
while (wait(NULL) > 0);

return 0;
}

sleep(1); // wait for PTRACE_ATTACH
assert(pthread_create(&thread, NULL, tfunc, NULL) == 0);
pthread_exit(NULL);
}

int main(void)
{
int child, stat;

child = fork();
if (!child)
return run_child();

assert(child == waitpid(-1, &stat, WSTOPPED));
assert(stat == 0x137f);

kill(child, SIGCONT);

assert(child == waitpid(-1, &stat, WCONTINUED));
assert(stat == 0xffff);

assert(child == waitpid(-1, &stat, 0));
assert(stat == 0x1300);

return 0;
}

Without this patch it hangs in waitpid(WSTOPPED), wait_task_stopped() is
never called.

Note: this doesn't fix all problems with a zombie delay_group_leader(),
WCONTINUED | WEXITED check is not exactly right.  debugger can't assume it
will be notified if another thread reaps the whole thread group.

Signed-off-by: Oleg Nesterov <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Jan Kratochvil <[email protected]>
Cc: Lennart Poettering <[email protected]>
Cc: Michal Schmidt <[email protected]>
Cc: Roland McGrath <[email protected]>
Cc: Tejun Heo <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

wait: WSTOPPED|WCONTINUED hangs if a zombie child is traced by real_parent

"A zombie is only visible to its ptracer" logic in wait_consider_task()
is very wrong. Trivial test-case:

#include <unistd.h>
#include <sys/ptrace.h>
#include <sys/wait.h>
#include <assert.h>

int main(void)
{
int child = fork();

if (!child) {
assert(ptrace(PTRACE_TRACEME, 0,0,0) == 0);
return 0x23;
}

assert(waitid(P_ALL, child, NULL, WEXITED | WNOWAIT) == 0);
assert(waitid(P_ALL, 0, NULL, WSTOPPED) == -1);
return 0;
}

it hangs in waitpid(WSTOPPED) despite the fact it has a single zombie
child.  This is because wait_consider_task(ptrace => 0) sees p->ptrace and
cleares ->notask_error assuming that the debugger should detach and notify
us.

Change wait_consider_task(ptrace => 0) to pretend that ptrace == T if the
child is traced by us.  This really simplifies the logic and allows us to
do more fixes, see the next changes.  This also hides the unwanted group
stop state automatically, we can remove another ptrace_reparented() check.

Unfortunately, this adds the following behavioural changes:

1. Before this patch wait(WEXITED | __WNOTHREAD) does not reap
   a natural child if it is traced by the caller's sub-thread.

   Hopefully nobody will ever notice this change, and I think
   that nobody should rely on this behaviour anyway.

2. SIGNAL_STOP_CONTINUED is no longer hidden from debugger if
   it is real parent.

   While this change comes as a side effect, I think it is good
   by itself. The group continued state can not be consumed by
   another process in this case, it doesn't depend on ptrace,
   it doesn't make sense to hide it from real parent.

   Perhaps we should add the thread_group_leader() check before
   wait_task_continued()? May be, but this shouldn't depend on
   ptrace_reparented().

Signed-off-by: Oleg Nesterov <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Jan Kratochvil <[email protected]>
Cc: Lennart Poettering <[email protected]>
Cc: Michal Schmidt <[email protected]>
Cc: Roland McGrath <[email protected]>
Cc: Tejun Heo <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

wait: swap EXIT_ZOMBIE and EXIT_DEAD to hide EXIT_TRACE from user-space

get_task_state() uses the most significant bit to report the state to
user-space, this means that EXIT_ZOMBIE->EXIT_TRACE->EXIT_DEAD transition
can be noticed via /proc as Z -> X -> Z change. Note that this was
possible even before EXIT_TRACE was introduced.

This is not really bad but imho it make sense to hide EXIT_TRACE from
user-space completely. So the patch simply swaps EXIT_ZOMBIE and
EXIT_DEAD, this way EXIT_TRACE will be seen as EXIT_ZOMBIE by user-space.

Signed-off-by: Oleg Nesterov <[email protected]>
Cc: Jan Kratochvil <[email protected]>
Cc: Michal Schmidt <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Lennart Poettering <[email protected]>
Cc: Roland McGrath <[email protected]>
Cc: Tejun Heo <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

wait: completely ignore the EXIT_DEAD tasks

Now that EXIT_DEAD is the terminal state it doesn't make sense to call
eligible_child() or security_task_wait() if the task is really dead.

Signed-off-by: Oleg Nesterov <[email protected]>
Tested-by: Michal Schmidt <[email protected]>
Cc: Jan Kratochvil <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Lennart Poettering <[email protected]>
Cc: Roland McGrath <[email protected]>
Cc: Tejun Heo <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

wait: use EXIT_TRACE only if thread_group_leader(zombie)

wait_task_zombie() always uses EXIT_TRACE/ptrace_unlink() if
ptrace_reparented(). This is suboptimal and a bit confusing: we do not
need do_notify_parent(p) if !thread_group_leader(p) and in this case we
also do not need ptrace_unlink(), we can rely on ptrace_release_task().

Change wait_task_zombie() to check thread_group_leader() along with
ptrace_reparented() and simplify the final p->exit_state transition.

Signed-off-by: Oleg Nesterov <[email protected]>
Tested-by: Michal Schmidt <[email protected]>
Cc: Jan Kratochvil <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Lennart Poettering <[email protected]>
Cc: Roland McGrath <[email protected]>
Cc: Tejun Heo <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

wait: introduce EXIT_TRACE to avoid the racy EXIT_DEAD->EXIT_ZOMBIE transition

wait_task_zombie() first does EXIT_ZOMBIE->EXIT_DEAD transition and
drops tasklist_lock.  If this task is not the natural child and it is
traced, we change its state back to EXIT_ZOMBIE for ->real_parent.

The last transition is racy, this is even documented in 50b8d257486a
"ptrace: partially fix the do_wait(WEXITED) vs EXIT_DEAD->EXIT_ZOMBIE
race".  wait_consider_task() tries to detect this transition and clear
->notask_error but we can't rely on ptrace_reparented(), debugger can
exit and do ptrace_unlink() before its sub-thread sets EXIT_ZOMBIE.

And there is another problem which were missed before: this transition
can also race with reparent_leader() which doesn't reset >exit_signal if
EXIT_DEAD, assuming that this task must be reaped by someone else.  So
the tracee can be re-parented with ->exit_signal != SIGCHLD, and if
/sbin/init doesn't use __WALL it becomes unreapable.  This was fixed by
the previous commit, but it was the temporary hack.

1. Add the new exit_state, EXIT_TRACE. It means that the task is the
   traced zombie, debugger is going to detach and notify its natural
   parent.

   This new state is actually EXIT_ZOMBIE | EXIT_DEAD. This way we
   can avoid the changes in proc/kgdb code, get_task_state() still
   reports "X (dead)" in this case.

   Note: with or without this change userspace can see Z -> X -> Z
   transition. Not really bad, but probably makes sense to fix.

2. Change wait_task_zombie() to use EXIT_TRACE instead of EXIT_DEAD
   if we need to notify the ->real_parent.

3. Revert the previous hack in reparent_leader(), now that EXIT_DEAD
   is always the final state we can safely ignore such a task.

4. Change wait_consider_task() to check EXIT_TRACE separately and kill
   the racy and no longer needed ptrace_reparented() case.

   If ptrace == T an EXIT_TRACE thread should be simply ignored, the
   owner of this state is going to ptrace_unlink() this task. We can
   pretend that it was already removed from ->ptraced list.

   Otherwise we should skip this thread too but clear ->notask_error,
   we must be the natural parent and debugger is going to untrace and
   notify us. IOW, this doesn't differ from "EXIT_ZOMBIE && p->ptrace"
   even if the task was already untraced.

Signed-off-by: Oleg Nesterov <[email protected]>
Reported-by: Jan Kratochvil <[email protected]>
Reported-by: Michal Schmidt <[email protected]>
Tested-by: Michal Schmidt <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Lennart Poettering <[email protected]>
Cc: Roland McGrath <[email protected]>
Cc: Tejun Heo <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

wait: fix reparent_leader() vs EXIT_DEAD->EXIT_ZOMBIE race

wait_task_zombie() first does EXIT_ZOMBIE->EXIT_DEAD transition and
drops tasklist_lock.  If this task is not the natural child and it is
traced, we change its state back to EXIT_ZOMBIE for ->real_parent.

The last transition is racy, this is even documented in 50b8d257486a
"ptrace: partially fix the do_wait(WEXITED) vs EXIT_DEAD->EXIT_ZOMBIE
race".  wait_consider_task() tries to detect this transition and clear
->notask_error but we can't rely on ptrace_reparented(), debugger can
exit and do ptrace_unlink() before its sub-thread sets EXIT_ZOMBIE.

And there is another problem which were missed before: this transition
can also race with reparent_leader() which doesn't reset >exit_signal if
EXIT_DEAD, assuming that this task must be reaped by someone else.  So
the tracee can be re-parented with ->exit_signal != SIGCHLD, and if
/sbin/init doesn't use __WALL it becomes unreapable.

Change reparent_leader() to update ->exit_signal even if EXIT_DEAD.
Note: this is the simple temporary hack for -stable, it doesn't try to
solve all problems, it will be reverted by the next changes.

Signed-off-by: Oleg Nesterov <[email protected]>
Reported-by: Jan Kratochvil <[email protected]>
Reported-by: Michal Schmidt <[email protected]>
Tested-by: Michal Schmidt <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Lennart Poettering <[email protected]>
Cc: Roland McGrath <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>