Git Repo - linux.git/log

Merge branch 'devel-stable' into devel

Merge branch 'for-rmk' of git://git.pengutronix.de/git/imx/linux-2.6 into devel-stable

x86, mm: Hold mm->page_table_lock while doing vmalloc_sync

Take mm->page_table_lock while syncing the vmalloc region. This prevents
a race with the Xen pagetable pin/unpin code, which expects that the
page_table_lock is already held. If this race occurs, then Xen can see
an inconsistent page type (a page can either be read/write or a pagetable
page, and pin/unpin converts it between them), which will cause either
the pin or the set_p[gm]d to fail; either will crash the kernel.

vmalloc_sync_all() should be called rarely, so this extra use of
page_table_lock should not interfere with its normal users.

The mm pointer is stashed in the pgd page's index field, as that won't
be otherwise used for pgds.

Reported-by: Ian Campbell <[email protected]>
Originally-by: Jan Beulich <[email protected]>
LKML-Reference: <4CB88A4C.1080305@goop.org>
Signed-off-by: Jeremy Fitzhardinge <[email protected]>
Signed-off-by: H. Peter Anvin <[email protected]>

x86, mm: Fix bogus whitespace in sync_global_pgds()

Whitespace cleanup only.

Signed-off-by: Jeremy Fitzhardinge <[email protected]>
Signed-off-by: H. Peter Anvin <[email protected]>

Merge branch 'for-rmk' of git://git.kernel.org/pub/scm/linux/kernel/git/kgene/linux-samsung into devel-stable

Conflicts:
arch/arm/mach-at91/include/mach/system.h
arch/arm/mach-imx/mach-cpuimx27.c

AT91 conflict resolution:
Acked-by: Anders Larsen <[email protected]>
IMX conflict resolution confirmed by Uwe Kleine-König.

Merge branch 'msm-core' of git://codeaurora.org/quic/kernel/dwalker/linux-msm into devel-stable

Merge branch 'tip/perf/core' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-2.6-trace into perf/core

MIPS: O32 compat/N32: Fix to use compat syscall wrappers for AIO syscalls.

[Ralf: Michel's original patch only fixed N32; I replicated the same fix
for O32.]

Signed-off-by: Michel Thebeau <[email protected]>
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Ralf Baechle <[email protected]>

MAINTAINERS: Change list for ioc_serial to linux-serial.

IOC3 is also being used on SGI MIPS systems but this particular driver is
only being used on IA64 systems so linux-mips made no sense as a list. Pat
also thinks [email protected] is the better list.

Signed-off-by: Ralf Baechle <[email protected]>

SERIAL: ioc3_serial: Return -ENOMEM on memory allocation failure

In this code, 0 is returned on memory allocation failure, even though other
failures return -ENOMEM or other similar values.

A simplified version of the semantic match that finds this problem is as
follows: (http://coccinelle.lip6.fr/)

// <smpl>
@@
expression ret;
expression x,e1,e2,e3;
@@

ret = 0
... when != ret = e1
*x = $kmalloc\|kcalloc\|kzalloc$(...)
... when != ret = e2
if (x == NULL) { ... when != ret = e3
return ret;
}
// </smpl>

Signed-off-by: Julia Lawall <[email protected]>
To: Pat Gefre <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Patchwork: https://patchwork.linux-mips.org/patch/1704/
Signed-off-by: Ralf Baechle <[email protected]>

MIPS: jz4740: Fix Kbuild Platform file.

The platform specific files should be included via the platform-y
variable.

Signed-off-by: David Daney <[email protected]>
Cc: Lars-Peter Clausen <[email protected]>
Patchwork: https://patchwork.linux-mips.org/patch/1719/
Signed-off-by: Ralf Baechle <[email protected]>

MIPS: Repair Kbuild make clean breakage.

When running make clean, Kbuild doesn't process the .config file, so nothing
generates a platform-y variable. We can get it to descend into the platform
directories by setting $(obj-).

The dec Platform file was unconditionally setting platform-, obliterating
its previous contents and preventing some directories from being cleaned.
This is change to an append operation '+=' to allow cavium-octeon to be
cleaned.

Signed-off-by: David Daney <[email protected]>
Cc: Sam Ravnborg <[email protected]>
Patchwork: https://patchwork.linux-mips.org/patch/1718/
Signed-off-by: Ralf Baechle <[email protected]>

Merge branch 'drm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/airlied/drm-2.6

* 'drm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/airlied/drm-2.6:
drm/radeon/kms: avivo cursor workaround applies to evergreen as well

eukrea_mbimxsd-baseboard: Pass the correct GPIO to gpio_free

Pass the correct GPIO to gpio_free

Signed-off-by: Fabio Estevam <[email protected]>
Acked-by: Eric Bénard <[email protected]>
Signed-off-by: Sascha Hauer <[email protected]>

cpuimx27: fix compile when ULPI is selected

without this patch we get :
arch/arm/mach-imx/built-in.o: In function `eukrea_cpuimx27_init':
eukrea_mbimx27-baseboard.c:(.init.text+0x44c): undefined reference to `mxc_ulpi_access_ops'

Signed-off-by: Eric Bénard <[email protected]>
Signed-off-by: Uwe Kleine-König <[email protected]>

mach-pcm037_eet: fix compile errors

this patch fix the following errors :
arch/arm/mach-mx3/mach-pcm037_eet.c:62: error: implicit declaration of function 'MXC_SPI_CS'
arch/arm/mach-mx3/mach-pcm037_eet.c:185: error: implicit declaration of function 'imx35_add_spi_imx0'

from the Kconfig pcm037 is i.MX31 based and not i.MX35 so replace
imx35_add_spi_imx0 by imx31_add_spi_imx0

Signed-off-by: Eric Bénard <[email protected]>
[ukl: remove unneeded #include <mach/spi.h>]
Signed-off-by: Uwe Kleine-König <[email protected]>

Fixing ethernet driver compilation error for i.MX31 ADS board

This is only a partial revert of "ARM: mx3/mx31ads: fold board
header in its only user"
[commit ccfa7c269843001077df02d98918c6c9bde91395)]

As some of the the board defines are also used in the cs89x0
ethernet driver by the i.MX31 ADS.

Signed-off-by: Ian Lartey <[email protected]>
Signed-off-by: Sascha Hauer <[email protected]>

cpuimx51: update board support

add NAND, SDHC

Signed-off-by: Eric Bénard <[email protected]>

mx5: add cpuimx51sd module and its baseboard

Signed-off-by: Eric Bénard <[email protected]>

iomux-mx51: fix GPIO_1_xx 's IOMUX configuration

this patch really configure the GPIO in GPIO mode.

Signed-off-by: Eric Bénard <[email protected]>

imx-esdhc: update devices registration

Tested on i.MX25 and i.MX35 and i.MX51

Signed-off-by: Eric Bénard <[email protected]>

mx51: add resources for SD/MMC on i.MX51

the attached patch allows SD to work on i.MX51 with Wolfram's drivers
Tested on i.MX51.

Based on original patch from: Richard Zhu <[email protected]>
Signed-off-by: Eric Bénard <[email protected]>

iomux-mx51: fix SD1 and SD2's iomux configuration

Based on original patch from: Richard Zhu <[email protected]>
Signed-off-by: Eric Bénard <[email protected]>

clock-mx51: rename CLOCK1 to CLOCK_CCGR for better readability

Signed-off-by: Eric Bénard <[email protected]>

clock-mx51: factorize clk_set_parent and clk_get_rate

Signed-off-by: Eric Bénard <[email protected]>

eukrea_mbimxsd: add support for DVI displays

Signed-off-by: Eric Bénard <[email protected]>

cpuimx25 & cpuimx35: fix OTG port registration in host mode

the PHY is UTMI so don't create an ULPI viewpoint.

Signed-off-by: Eric Bénard <[email protected]>

i.MX31 and i.MX35 : fix errate TLSbo65953 and ENGcm09472

Without this exiting WFI can result in cache corruption.
Code taken from Freescale's 2.6.27 BSP and tested on i.MX35

Signed-off-by: Eric Bénard <[email protected]>

mx25: fix compile error in platform-imx-dma.c

this patch fix the following errors :
arch/arm/plat-mxc/devices/platform-imx-dma.c:44:
error: ‘MX25_SDMA_BASE_ADDR’ undeclared here (not in a function)
arch/arm/plat-mxc/devices/platform-imx-dma.c:44:
error: ‘MX25_INT_SDMA’ undeclared here (not in a function)

Signed-off-by: Eric Bénard <[email protected]>
Acked-by: Uwe Kleine-König <[email protected]>

mx25: fix clock's calculation

* get_rate_arm : when 400MHz clock is selected (cctl & 1<<14),
ARM clock is 400MHz (MPLL * 3 / 4) and not 800MHz
* get_rate_per : peripherals's clock is derived from AHB and not
from IPG (ref manual : figure 5-1)
* can2_clk : use the correct ID

* without this patch, peripherals getting their clock from PER
clocks work fine because of the 2 errors which fix themselves
(ARM clock x 2 and per clock actually based on IPG which is AHB/2)
but flexcan can't work as it gets its clock from IPG and thus
calculates its bitrate using a reference value which is twice
what it really is.

Signed-off-by: Eric Bénard <[email protected]>

ARM: imx: add lost 3rd imx-i2c device for mx35

During the reorganisation of the imx-i2c devices
(in 64de5ec168d9743903e6ec482c3e9f37af49f9c1) the 3rd imx-i2c device
for the mx35 got lost. This patch adds the missing device.

Signed-off-by: Marc Kleine-Budde <[email protected]>
Acked-by: Uwe Kleine-König <[email protected]>
Signed-off-by: Sascha Hauer <[email protected]>

ARM: imx: Add iram allocator functions

Add IRAM(Internal RAM) allocation functions using GENERIC_ALLOCATOR.
The allocation size is 4KB multiples to guarantee alignment. The
idea for these functions is for i.MX platforms to use them
to dynamically allocate IRAM usage.

Applies on 2.6.36-rc7

Signed-off-by: Dinh Nguyen <[email protected]>
Reviewed-by: Amit Kucheria <[email protected]>
Acked-by: Uwe Kleine-König <[email protected]>
Signed-off-by: Sascha Hauer <[email protected]>

KVM: Fix fs/gs reload oops with invalid ldt

kvm reloads the host's fs and gs blindly, however the underlying segment
descriptors may be invalid due to the user modifying the ldt after loading
them.

Fix by using the safe accessors (loadsegment() and load_gs_index()) instead
of home grown unsafe versions.

This is CVE-2010-3698.

KVM-Stable-Tag.
Signed-off-by: Avi Kivity <[email protected]>
Signed-off-by: Marcelo Tosatti <[email protected]>

tracing: Fix compile issue for trace_sched_wakeup.c

The function start_func_tracer() was incorrectly added in the
#ifdef CONFIG_FUNCTION_TRACER condition, but is still used even
when function tracing is not enabled.

The calls to register_ftrace_function() and register_ftrace_graph()
become nops (and their arguments are even ignored), thus there is
no reason to hide start_func_tracer() when function tracing is
not enabled.

Reported-by: Ingo Molnar <[email protected]>
Signed-off-by: Steven Rostedt <[email protected]>

[S390] hardirq: remove pointless header file includes

Remove a couple of pointless header file includes.
Fixes a compile bug caused by header file include dependencies with
"irq: Add tracepoint to softirq_raise" within linux-next.

Reported-by: Sachin Sant <[email protected]>
Signed-off-by: Heiko Carstens <[email protected]>
Signed-off-by: Martin Schwidefsky <[email protected]>
[ cherry-picked from the s390 tree to fix "2bf2160: irq: Add tracepoint to softirq_raise" ]
Signed-off-by: Ingo Molnar <[email protected]>

[IA64] Move local_softirq_pending() definition

Ugly #include dependencies. We need to have local_softirq_pending()
defined before it gets used in <linux/interrupt.h>. But <asm/hardirq.h>
provides the definition *after* this #include chain:
  <linux/irq.h>
    <asm/irq.h>
      <asm/hw_irq.h>
        <linux/interrupt.h>

Signed-off-by: Tony Luck <[email protected]>
[ cherry-picked from the ia64 tree to fix "2bf2160: irq: Add tracepoint to softirq_raise" ]
Signed-off-by: Ingo Molnar <[email protected]>

futex: Fix errors in nested key ref-counting

futex_wait() is leaking key references due to futex_wait_setup()
acquiring an additional reference via the queue_lock() routine. The
nested key ref-counting has been masking bugs and complicating code
analysis. queue_lock() is only called with a previously ref-counted
key, so remove the additional ref-counting from the queue_(un)lock()
functions.

Also futex_wait_requeue_pi() drops one key reference too many in
unqueue_me_pi(). Remove the key reference handling from
unqueue_me_pi(). This was paired with a queue_lock() in
futex_lock_pi(), so the count remains unchanged.

Document remaining nested key ref-counting sites.

Signed-off-by: Darren Hart <[email protected]>
Reported-and-tested-by: Matthieu Fertré<[email protected]>
Reported-by: Louis Rilling<[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Eric Dumazet <[email protected]>
Cc: John Kacur <[email protected]>
Cc: Rusty Russell <[email protected]>
LKML-Reference: <4CBB17A8 [email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Cc: [email protected]

dabusb: remove the BKL

The dabusb device driver is sufficiently serialized using
its own mutex, no need for the big kernel lock here
in addition.

Signed-off-by: Arnd Bergmann <[email protected]>
Cc: Mauro Carvalho Chehab <[email protected]>

sunrpc: remove the big kernel lock

The sunrpc cache_ioctl function does not need the big kernel lock
because it uses its own queue_lock already.

rpc_pipe_ioctl apparently should be using i_lock like the other
operations on the pipe file descriptor do.

Signed-off-by: Arnd Bergmann <[email protected]>

init/main.c: remove BKL notations

According to commit 5e3d20a68f63fc5a310687d81956c3b96e488b84
(init: Remove the BKL from startup code) these sparse notations
should be removed also.

Signed-off-by: Namhyung Kim <[email protected]>
Signed-off-by: Arnd Bergmann <[email protected]>

blktrace: remove the big kernel lock

According to Jens, this code does not need the BKL at all,
it is sufficiently serialized by bd_mutex.

Signed-off-by: Arnd Bergmann <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Steven Rostedt <[email protected]>

rtmutex-tester: make it build without BKL

The big kernel lock is going away, so make sure
that if it is disabled by Kconfig, we do not
try to validate it, which would result in
compile errors.

Signed-off-by: Arnd Bergmann <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Arjan van de Ven <[email protected]>
Cc: Andrew Morton <[email protected]>

dvb-core: kill the big kernel lock

The dvb core only uses the big kernel lock in the open
and ioctl functions, which means it can be replaced with
a dvb specific mutex. Fortunately, all the ioctl functions
go through dvb_usercopy, so we can move the serialization
in there.

Signed-off-by: Arnd Bergmann <[email protected]>
Cc: Mauro Carvalho Chehab <[email protected]>
Cc: [email protected]

dvb/bt8xx: kill the big kernel lock

The bt8xx driver only uses the big kernel lock in its dst_ca_ioctl
function and never to serialize against other code, so we can
trivially replace it with a private mutex.

Signed-off-by: Arnd Bergmann <[email protected]>
Cc: [email protected]
Cc: Mauro Carvalho Chehab <[email protected]>

tlclk: remove big kernel lock

This driver already has a global mutex, so let's just
use that in the open function instead of the BKL.
It may not even be needed there, but this patch should
have the smallest impact.

Signed-off-by: Arnd Bergmann <[email protected]>
Cc: Mark Gross <[email protected]>

fix rawctl compat ioctls breakage on amd64 and itanic

RAW_SETBIND and RAW_GETBIND 32bit versions are fscked in interesting ways.

1) fs/compat_ioctl.c has COMPATIBLE_IOCTL(RAW_SETBIND) followed by
HANDLE_IOCTL(RAW_SETBIND, raw_ioctl).  The latter is ignored.

2) on amd64 (and itanic) the damn thing is broken - we have int + u64 + u64
and layouts on i386 and amd64 are _not_ the same.  raw_ioctl() would
work there, but it's never called due to (1).  As it is, i386 /sbin/raw
definitely doesn't work on amd64 boxen.

3) switching to raw_ioctl() as is would *not* work on e.g. sparc64 and ppc64,
which would be rather sad, seeing that normal userland there is 32bit.
The thing is, slapping __packed on the struct in question does not DTRT -
it eliminates *all* padding.  The real solution is to use compat_u64.

4) of course, all that stuff has no business being outside of raw.c in the
first place - there should be ->compat_ioctl() for /dev/rawctl instead of
messing with compat_ioctl.c.

[[email protected]: coding-style fixes]
[[email protected]: port to 2.6.36]
Signed-off-by: Al Viro <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Arnd Bergmann <[email protected]>

uml: kill big kernel lock

Three uml device drivers still use the big kernel lock,
but all of them can be safely converted to using
a per-driver mutex instead. Most likely this is not
even necessary, so after further review these can
and should be removed as well.

The exec system call no longer requires the BKL either,
so remove it from there, too.

Signed-off-by: Arnd Bergmann <[email protected]>
Cc: Jeff Dike <[email protected]>
Cc: [email protected]

x86: ioapic: Call free_irte only if interrupt remapping enabled

On a system that support intr-rempping when booting with "intremap=off"

[  177.895501] BUG: unable to handle kernel NULL pointer dereference at 00000000000000f8
[  177.913316] IP: [<ffffffff8145fc18>] free_irte+0x47/0xc0
...
[  178.173326] Call Trace:
[  178.173574]  [<ffffffff810515b4>] destroy_irq+0x3a/0x75
[  178.192934]  [<ffffffff81051834>] arch_teardown_msi_irq+0xe/0x10
[  178.193418]  [<ffffffff81458dc3>] arch_teardown_msi_irqs+0x56/0x7f
[  178.213021]  [<ffffffff81458e79>] free_msi_irqs+0x8d/0xeb

Call free_irte only when interrupt remapping is enabled.

Signed-off-by: Yinghai Lu <[email protected]>
LKML-Reference: <4CBCB274.7010108@kernel.org>
Signed-off-by: Thomas Gleixner <[email protected]>

perf, powerpc: Fix power_pmu_event_init to not use event->ctx

Commit c3f00c70 ("perf: Separate find_get_context() from event
initialization") changed the generic perf_event code to call
perf_event_alloc, which calls the arch-specific event_init code,
before looking up the context for the new event.  Unfortunately,
power_pmu_event_init uses event->ctx->task to see whether the
new event is a per-task event or a system-wide event, and thus
crashes since event->ctx is NULL at the point where
power_pmu_event_init gets called.

(The reason it needs to know whether it is a per-task event is
because there are some hardware events on Power systems which
only count when the processor is not idle, and there are some
fixed-function counters which count such events.  For example,
the "run cycles" event counts cycles when the processor is not
idle.  If the user asks to count cycles, we can use "run cycles"
if this is a per-task event, since the processor is running when
the task is running, by definition.  We can't use "run cycles"
if the user asks for "cycles" on a system-wide counter.)

Fortunately the information we need is in the
event->attach_state field, so we just use that instead.

Signed-off-by: Paul Mackerras <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Frederic Weisbecker <[email protected]>
LKML-Reference: <20101019055535.GA10398@drongo>
Signed-off-by: Ingo Molnar <[email protected]>
Reported-by: Alexey Kardashevskiy <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>

Merge branch 'tip/perf/recordmcount-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-2.6-trace into perf/core

ARM: S5PV310: Fix build error on GPIO map

This patch fixes build error about GPIO address due to
conflict of commit 4d914705 and 19a2c065.

- commit 4d914705: Fix on GPIO base addresses
- commit 19a2c065: Moves initial map for merging S5P64X0

Signed-off-by: Kukjin Kim <[email protected]>

Merge branch 'hotplug' into devel

Conflicts:
arch/arm/kernel/head-common.S

Merge branches 'at91', 'dcache', 'ftrace', 'hwbpt', 'misc', 'mmci', 's3c', 'st-ux' and 'unwind' into devel

ftrace: Remove recursion between recordmcount and scripts/mod/empty

When DYNAMIC_FTRACE is enabled and we use the C version of recordmcount,
all objects are run through the recordmcount program to create a
separate section that stores all the callers of mcount.

The build process has a special file: scripts/mod/empty.o. This is
built from empty.c which is literally an empty file (except for a
single comment). This file is used to find information about the target
elf format, like endianness and word size.

The problem comes up when we need to build recordmcount. The
build process requires that empty.o is built first. The build rules
for empty.o will try to execute recordmcount on the empty.o file.
We get an error that recordmcount does not exist.

To avoid this recursion, the build file will skip running recordmcount
if the file that it is building is script/mod/empty.o.

[ extra comment Suggested-by: Sam Ravnborg <[email protected]> ]

Reported-by: Ingo Molnar <[email protected]>
Tested-by: Ingo Molnar <[email protected]>
Cc: Michal Marek <[email protected]>
Cc: [email protected]
Signed-off-by: Steven Rostedt <[email protected]>

ARM: 6441/1: ux500: The platform is not just based on early drop silicon version.

Update Kconfig text accordingly.

Signed-off-by: srinidhi kasagar <[email protected]>
Acked-by: Linus Walleij <[email protected]>
Signed-off-by: Russell King <[email protected]>

Merge branch 'upstream' of git://git.linux-mips.org/pub/scm/upstream-linus

* 'upstream' of git://git.linux-mips.org/pub/scm/upstream-linus:
  MIPS: Enable ISA_DMA_API config to fix build failure
  MIPS: 32-bit: Fix build failure in asm/fcntl.h
  MIPS: Remove all generated vmlinuz* files on "make clean"
  MIPS: do_sigaltstack() expects userland pointers
  MIPS: Fix error values in case of bad_stack
  MIPS: Sanitize restart logics
  MIPS: secure_computing, syscall audit: syscall number should in r2, not r0.
  MIPS: Don't block signals if we'd failed to setup a sigframe

Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input:
Input: evdev - fix EVIOCSABS regression
Input: evdev - fix Ooops in EVIOCGABS/EVIOCSABS

Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ieee1394/linux1394-2.6

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ieee1394/linux1394-2.6:
firewire: ohci: fix TI TSB82AA2 regression since 2.6.35

mxc_nand: do not depend on disabling the irq in the interrupt handler

This patch reverts the driver to enabling/disabling the NFC interrupt
mask rather than enabling/disabling the system interrupt. This cleans
up the driver so that it doesn't rely on interrupts being disabled
within the interrupt handler.

For i.MX21 we keep the current behaviour, that is calling
enable_irq/disable_irq_nosync to enable/disable interrupts. This patch
is based on earlier work by John Ogness.

Signed-off-by: Sascha Hauer <[email protected]>
Acked-by: John Ogness <[email protected]>
Tested-by: John Ogness <[email protected]>
Signed-off-by: David Woodhouse <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

Merge branch 'for-linus/i2c/2636-rc8' of git://git.fluff.org/bjdooks/linux

* 'for-linus/i2c/2636-rc8' of git://git.fluff.org/bjdooks/linux:
i2c-imx: do not allow interruptions when waiting for I2C to complete
i2c-davinci: Fix TX setup for more SoCs

Merge branch 'fixes'

* fixes:
v4l1: fix 32-bit compat microcode loading translation
De-pessimize rds_page_copy_user

sched: Export account_system_vtime()

KVM uses it for example:

ERROR: "account_system_vtime" [arch/x86/kvm/kvm.ko] undefined!

Cc: Venkatesh Pallipadi <[email protected]>
Cc: Peter Zijlstra <[email protected]>
LKML-Reference: <1286237003 [email protected]>
Signed-off-by: Ingo Molnar <[email protected]>

sched: Call tick_check_idle before __irq_enter

When CPU is idle and on first interrupt, irq_enter calls tick_check_idle()
to notify interruption from idle. But, there is a problem if this call
is done after __irq_enter, as all routines in __irq_enter may find
stale time due to yet to be done tick_check_idle.

Specifically, trace calls in __irq_enter when they use global clock and also
account_system_vtime change in this patch as it wants to use sched_clock_cpu()
to do proper irq timing.

But, tick_check_idle was moved after __irq_enter intentionally to
prevent problem of unneeded ksoftirqd wakeups by the commit ee5f80a:

irq: call __irq_enter() before calling the tick_idle_check
Impact: avoid spurious ksoftirqd wakeups

Moving tick_check_idle() before __irq_enter and wrapping it with
local_bh_enable/disable would solve both the problems.

Fixed-by: Yong Zhang <[email protected]>
Signed-off-by: Venkatesh Pallipadi <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
LKML-Reference: <1286237003 [email protected]>
Signed-off-by: Ingo Molnar <[email protected]>

sched: Remove irq time from available CPU power

The idea was suggested by Peter Zijlstra here:

http://marc.info/?l=linux-kernel&m=127476934517534&w=2

irq time is technically not available to the tasks running on the CPU.
This patch removes irq time from CPU power piggybacking on
sched_rt_avg_update().

Tested this by keeping CPU X busy with a network intensive task having 75%
oa a single CPU irq processing (hard+soft) on a 4-way system. And start seven
cycle soakers on the system. Without this change, there will be two tasks on
each CPU. With this change, there is a single task on irq busy CPU X and
remaining 7 tasks are spread around among other 3 CPUs.

Signed-off-by: Venkatesh Pallipadi <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
LKML-Reference: <1286237003 [email protected]>
Signed-off-by: Ingo Molnar <[email protected]>

sched: Do not account irq time to current task

Scheduler accounts both softirq and interrupt processing times to the
currently running task. This means, if the interrupt processing was
for some other task in the system, then the current task ends up being
penalized as it gets shorter runtime than otherwise.

Change sched task accounting to acoount only actual task time from
currently running task. Now update_curr(), modifies the delta_exec to
depend on rq->clock_task.

Note that this change only handles CONFIG_IRQ_TIME_ACCOUNTING case. We can
extend this to CONFIG_VIRT_CPU_ACCOUNTING with minimal effort. But, thats
for later.

This change will impact scheduling behavior in interrupt heavy conditions.

Tested on a 4-way system with eth0 handled by CPU 2 and a network heavy
task (nc) running on CPU 3 (and no RSS/RFS). With that I have CPU 2
spending 75%+ of its time in irq processing. CPU 3 spending around 35%
time running nc task.

Now, if I run another CPU intensive task on CPU 2, without this change
/proc/<pid>/schedstat shows 100% of time accounted to this task. With this
change, it rightly shows less than 25% accounted to this task as remaining
time is actually spent on irq processing.

Signed-off-by: Venkatesh Pallipadi <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
LKML-Reference: <1286237003 [email protected]>
Signed-off-by: Ingo Molnar <[email protected]>

x86: Add IRQ_TIME_ACCOUNTING

This patch adds IRQ_TIME_ACCOUNTING option on x86 and runtime enables it
when TSC is enabled.

This change just enables fine grained irq time accounting, isn't used yet.
Following patches use it for different purposes.

Signed-off-by: Venkatesh Pallipadi <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
LKML-Reference: <1286237003 [email protected]>
Signed-off-by: Ingo Molnar <[email protected]>

sched: Add IRQ_TIME_ACCOUNTING, finer accounting of irq time

s390/powerpc/ia64 have support for CONFIG_VIRT_CPU_ACCOUNTING which does
the fine granularity accounting of user, system, hardirq, softirq times.
Adding that option on archs like x86 will be challenging however, given the
state of TSC reliability on various platforms and also the overhead it will
add in syscall entry exit.

Instead, add a lighter variant that only does finer accounting of
hardirq and softirq times, providing precise irq times (instead of timer tick
based samples). This accounting is added with a new config option
CONFIG_IRQ_TIME_ACCOUNTING so that there won't be any overhead for users not
interested in paying the perf penalty.

This accounting is based on sched_clock, with the code being generic.
So, other archs may find it useful as well.

This patch just adds the core logic and does not enable this logic yet.

Signed-off-by: Venkatesh Pallipadi <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
LKML-Reference: <1286237003 [email protected]>
Signed-off-by: Ingo Molnar <[email protected]>

sched: Add a PF flag for ksoftirqd identification

To account softirq time cleanly in scheduler, we need to identify whether
softirq is invoked in ksoftirqd context or softirq at hardirq tail context.
Add PF_KSOFTIRQD for that purpose.

As all PF flag bits are currently taken, create space by moving one of the
infrequently used bits (PF_THREAD_BOUND) down in task_struct to be along
with some other state fields.

Signed-off-by: Venkatesh Pallipadi <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
LKML-Reference: <1286237003 [email protected]>
Signed-off-by: Ingo Molnar <[email protected]>

sched: Consolidate account_system_vtime extern declaration

Just a minor cleanup patch that makes things easier to the following patches.
No functionality change in this patch.

Signed-off-by: Venkatesh Pallipadi <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
LKML-Reference: <1286237003 [email protected]>
Signed-off-by: Ingo Molnar <[email protected]>

sched: Fix softirq time accounting

Peter Zijlstra found a bug in the way softirq time is accounted in
VIRT_CPU_ACCOUNTING on this thread:

http://lkml.indiana.edu/hypermail//linux/kernel/1009.2/01366.html

The problem is, softirq processing uses local_bh_disable internally. There
is no way, later in the flow, to differentiate between whether softirq is
being processed or is it just that bh has been disabled. So, a hardirq when bh
is disabled results in time being wrongly accounted as softirq.

Looking at the code a bit more, the problem exists in !VIRT_CPU_ACCOUNTING
as well. As account_system_time() in normal tick based accouting also uses
softirq_count, which will be set even when not in softirq with bh disabled.

Peter also suggested solution of using 2*SOFTIRQ_OFFSET as irq count
for local_bh_{disable,enable} and using just SOFTIRQ_OFFSET while softirq
processing. The patch below does that and adds API in_serving_softirq() which
returns whether we are currently processing softirq or not.

Also changes one of the usages of softirq_count in net/sched/cls_cgroup.c
to in_serving_softirq.

Looks like many usages of in_softirq really want in_serving_softirq. Those
changes can be made individually on a case by case basis.

Signed-off-by: Venkatesh Pallipadi <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
LKML-Reference: <1286237003 [email protected]>
Signed-off-by: Ingo Molnar <[email protected]>

sched: Drop group_capacity to 1 only if local group has extra capacity

When SD_PREFER_SIBLING is set on a sched domain, drop group_capacity to 1
only if the local group has extra capacity. The extra check prevents the case
where you always pull from the heaviest group when it is already under-utilized
(possible with a large weight task outweighs the tasks on the system).

For example, consider a 16-cpu quad-core quad-socket machine with MC and NUMA
scheduling domains. Let's say we spawn 15 nice0 tasks and one nice-15 task,
and each task is running on one core. In this case, we observe the following
events when balancing at the NUMA domain:

- find_busiest_group() will always pick the sched group containing the niced
  task to be the busiest group.
- find_busiest_queue() will then always pick one of the cpus running the
  nice0 task (never picks the cpu with the nice -15 task since
  weighted_cpuload > imbalance).
- The load balancer fails to migrate the task since it is the running task
  and increments sd->nr_balance_failed.
- It repeats the above steps a few more times until sd->nr_balance_failed > 5,
  at which point it kicks off the active load balancer, wakes up the migration
  thread and kicks the nice 0 task off the cpu.

The load balancer doesn't stop until we kick out all nice 0 tasks from
the sched group, leaving you with 3 idle cpus and one cpu running the
nice -15 task.

When balancing at the NUMA domain, we drop sgs.group_capacity to 1 if the child
domain (in this case MC) has SD_PREFER_SIBLING set.  Subsequent load checks are
not relevant because the niced task has a very large weight.

In this patch, we add an extra condition to the "if(prefer_sibling)" check in
update_sd_lb_stats(). We drop the capacity of a group only if the local group
has extra capacity, ie. nr_running < group_capacity. This patch preserves the
original intent of the prefer_siblings check (to spread tasks across the system
in low utilization scenarios) and fixes the case above.

It helps in the following ways:
- In low utilization cases (where nr_tasks << nr_cpus), we still drop
  group_capacity down to 1 if we prefer siblings.
- On very busy systems (where nr_tasks >> nr_cpus), sgs.nr_running will most
  likely be > sgs.group_capacity.
- When balancing large weight tasks, if the local group does not have extra
  capacity, we do not pick the group with the niced task as the busiest group.
  This prevents failed balances, active migration and the under-utilization
  described above.

Signed-off-by: Nikhil Rao <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
LKML-Reference: <1287173550 [email protected]>
Signed-off-by: Ingo Molnar <[email protected]>

sched: Force balancing on newidle balance if local group has capacity

This patch forces a load balance on a newly idle cpu when the local group has
extra capacity and the busiest group does not have any. It improves system
utilization when balancing tasks with a large weight differential.

Under certain situations, such as a niced down task (i.e. nice = -15) in the
presence of nr_cpus NICE0 tasks, the niced task lands on a sched group and
kicks away other tasks because of its large weight. This leads to sub-optimal
utilization of the machine. Even though the sched group has capacity, it does
not pull tasks because sds.this_load >> sds.max_load, and f_b_g() returns NULL.

With this patch, if the local group has extra capacity, we shortcut the checks
in f_b_g() and try to pull a task over. A sched group has extra capacity if the
group capacity is greater than the number of running tasks in that group.

Thanks to Mike Galbraith for discussions leading to this patch and for the
insight to reuse SD_NEWIDLE_BALANCE.

Signed-off-by: Nikhil Rao <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
LKML-Reference: <1287173550 [email protected]>
Signed-off-by: Ingo Molnar <[email protected]>

sched: Set group_imb only a task can be pulled from the busiest cpu

When cycling through sched groups to determine the busiest group, set
group_imb only if the busiest cpu has more than 1 runnable task. This patch
fixes the case where two cpus in a group have one runnable task each, but there
is a large weight differential between these two tasks. The load balancer is
unable to migrate any task from this group, and hence do not consider this
group to be imbalanced.

Signed-off-by: Nikhil Rao <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
LKML-Reference: <1286996978 [email protected]>
[ small code readability edits ]
Signed-off-by: Ingo Molnar <[email protected]>

sched: Do not consider SCHED_IDLE tasks to be cache hot

This patch adds a check in task_hot to return if the task has SCHED_IDLE
policy. SCHED_IDLE tasks have very low weight, and when run with regular
workloads, are typically scheduled many milliseconds apart. There is no
need to consider these tasks hot for load balancing.

Signed-off-by: Nikhil Rao <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
LKML-Reference: <1287173550 [email protected]>
Signed-off-by: Ingo Molnar <[email protected]>

jump_label: Add COND_STMT(), reducer wrappery

The use of the JUMP_LABEL() construct ends up creating endless silly
wrappers, create a higher level construct to reduce this clutter.

Signed-off-by: Peter Zijlstra <[email protected]>
Cc: Jason Baron <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Frederic Weisbecker <[email protected]>
Cc: Paul Mackerras <[email protected]>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <[email protected]>

perf: Optimize sw events

Acked-by: Frederic Weisbecker <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <[email protected]>

perf: Use jump_labels to optimize the scheduler hooks

Trades a call + conditional + ret for an unconditional jmp.

Acked-by: Frederic Weisbecker <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
LKML-Reference: <20101014203625.501657727@chello.nl>
Signed-off-by: Ingo Molnar <[email protected]>

jump_label: Add atomic_t interface

Add an interface to allow usage of jump_labels with atomic counters.

Signed-off-by: Peter Zijlstra <[email protected]>
Acked-by: Frederic Weisbecker <[email protected]>
LKML-Reference: <20101014203625.501657727@chello.nl>
Signed-off-by: Ingo Molnar <[email protected]>

jump_label: Use more consistent naming

Now that there's still only a few users around, rename things to make
them more consistent.

Signed-off-by: Peter Zijlstra <[email protected]>
LKML-Reference: <20101014203625.448565169@chello.nl>
Signed-off-by: Ingo Molnar <[email protected]>

perf, hw_breakpoint: Fix crash in hw_breakpoint creation

hw_breakpoint creation needs to account stuff per-task to ensure there
is always sufficient hardware resources to back these things due to
ptrace.

With the perf per pmu context changes the event initialization no
longer has access to the event context, for the simple reason that we
need to first find the pmu (result of initialization) before we can
find the context.

This makes hw_breakpoints unhappy, because it can no longer do per
task accounting, cure this by frobbing a task pointer in the event::hw
bits for now...

Signed-off-by: Peter Zijlstra <[email protected]>
Cc: Frederic Weisbecker <[email protected]>
LKML-Reference: <20101014203625.391543667@chello.nl>
Signed-off-by: Ingo Molnar <[email protected]>

perf: Find task before event alloc

So that we can pass the task pointer to the event allocation, so that
we can use task associated data during event initialization.

Signed-off-by: Peter Zijlstra <[email protected]>
LKML-Reference: <20101014203625.340789919@chello.nl>
Signed-off-by: Ingo Molnar <[email protected]>

perf: Fix task refcount bugs

Currently it looks like find_lively_task_by_vpid() takes a task ref
and relies on find_get_context() to drop it.

The problem is that perf_event_create_kernel_counter() shouldn't be
dropping task refs.

Signed-off-by: Peter Zijlstra <[email protected]>
Acked-by: Frederic Weisbecker <[email protected]>
Acked-by: Matt Helsley <[email protected]>
LKML-Reference: <20101014203625.278436085@chello.nl>
Signed-off-by: Ingo Molnar <[email protected]>

perf: Fix group moving

Matt found we trigger the WARN_ON_ONCE() in perf_group_attach() when we take
the move_group path in perf_event_open().

Since we cannot de-construct the group (we rely on it to move the events), we
have to simply ignore the double attach. The group state is context invariant
and doesn't need changing.

Reported-by: Matt Fleming <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
LKML-Reference: <1287135757.29097.1368.camel@twins>
Signed-off-by: Ingo Molnar <[email protected]>

irq_work: Add generic hardirq context callbacks

Provide a mechanism that allows running code in IRQ context. It is
most useful for NMI code that needs to interact with the rest of the
system -- like wakeup a task to drain buffers.

Perf currently has such a mechanism, so extract that and provide it as
a generic feature, independent of perf so that others may also
benefit.

The IRQ context callback is generated through self-IPIs where
possible, or on architectures like powerpc the decrementer (the
built-in timer facility) is set to generate an interrupt immediately.

Architectures that don't have anything like this get to do with a
callback from the timer tick. These architectures can call
irq_work_run() at the tail of any IRQ handlers that might enqueue such
work (like the perf IRQ handler) to avoid undue latencies in
processing the work.

Signed-off-by: Peter Zijlstra <[email protected]>
Acked-by: Kyle McMartin <[email protected]>
Acked-by: Martin Schwidefsky <[email protected]>
[ various fixes ]
Signed-off-by: Huang Ying <[email protected]>
LKML-Reference: <1287036094.7768.291.camel@yhuang-dev>
Signed-off-by: Ingo Molnar <[email protected]>

perf_events: Fix transaction recovery in group_sched_in()

The group_sched_in() function uses a transactional approach to schedule
a group of events. In a group, either all events can be scheduled or
none are. To schedule each event in, the function calls event_sched_in().
In case of error, event_sched_out() is called on each event in the group.

The problem is that event_sched_out() does not completely cancel the
effects of event_sched_in(). Furthermore event_sched_out() changes the
state of the event as if it had run which is not true is this particular
case.

Those inconsistencies impact time tracking fields and may lead to events
in a group not all reporting the same time_enabled and time_running values.
This is demonstrated with the example below:

$ task -eunhalted_core_cycles,baclears,baclears -e unhalted_core_cycles,baclears,baclears sleep 5
1946101 unhalted_core_cycles (32.85% scaling, ena=829181, run=556827)
  11423 baclears (32.85% scaling, ena=829181, run=556827)
   7671 baclears (0.00% scaling, ena=556827, run=556827)

2250443 unhalted_core_cycles (57.83% scaling, ena=962822, run=405995)
  11705 baclears (57.83% scaling, ena=962822, run=405995)
  11705 baclears (57.83% scaling, ena=962822, run=405995)

Notice that in the first group, the last baclears event does not
report the same timings as its siblings.

This issue comes from the fact that tstamp_stopped is updated
by event_sched_out() as if the event had actually run.

To solve the issue, we must ensure that, in case of error, there is
no change in the event state whatsoever. That means timings must
remain as they were when entering group_sched_in().

To do this we defer updating tstamp_running until we know the
transaction succeeded. Therefore, we have split event_sched_in()
in two parts separating the update to tstamp_running.

Similarly, in case of error, we do not want to update tstamp_stopped.
Therefore, we have split event_sched_out() in two parts separating
the update to tstamp_stopped.

With this patch, we now get the following output:

$ task -eunhalted_core_cycles,baclears,baclears -e unhalted_core_cycles,baclears,baclears sleep 5
2492050 unhalted_core_cycles (71.75% scaling, ena=1093330, run=308841)
  11243 baclears (71.75% scaling, ena=1093330, run=308841)
  11243 baclears (71.75% scaling, ena=1093330, run=308841)

1852746 unhalted_core_cycles (0.00% scaling, ena=784489, run=784489)
   9253 baclears (0.00% scaling, ena=784489, run=784489)
   9253 baclears (0.00% scaling, ena=784489, run=784489)

Note that the uneven timing between groups is a side effect of
the process spending most of its time sleeping, i.e., not enough
event rotations (but that's a separate issue).

Signed-off-by: Stephane Eranian <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
LKML-Reference: <4cb86b4c.41e9d80a [email protected]>
Signed-off-by: Ingo Molnar <[email protected]>

perf_events: Fix bogus AMD64 generic TLB events

PERF_COUNT_HW_CACHE_DTLB:READ:MISS had a bogus umask value of 0 which
counts nothing. Needed to be 0x7 (to count all possibilities).

PERF_COUNT_HW_CACHE_ITLB:READ:MISS had a bogus umask value of 0 which
counts nothing. Needed to be 0x3 (to count all possibilities).

Signed-off-by: Stephane Eranian <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Cc: Robert Richter <[email protected]>
Cc: <[email protected]> # as far back as it applies
LKML-Reference: <4cb85478.41e9d80a [email protected]>
Signed-off-by: Ingo Molnar <[email protected]>

perf_events: Fix bogus context time tracking

You can only call update_context_time() when the context
is active, i.e., the thread it is attached to is still running.

However, perf_event_read() can be called even when the context
is inactive, e.g., user read() the counters. The call to
update_context_time() must be conditioned on the status of
the context, otherwise, bogus time_enabled, time_running may
be returned. Here is an example on AMD64. The task program
is an example from libpfm4. The -p prints deltas every 1s.

$ task -p -e cpu_clk_unhalted sleep 5
    2,266,610 cpu_clk_unhalted (0.00% scaling, ena=2,158,982, run=2,158,982)
    0 cpu_clk_unhalted (0.00% scaling, ena=2,158,982, run=2,158,982)
    0 cpu_clk_unhalted (0.00% scaling, ena=2,158,982, run=2,158,982)
    0 cpu_clk_unhalted (0.00% scaling, ena=2,158,982, run=2,158,982)
    0 cpu_clk_unhalted (0.00% scaling, ena=2,158,982, run=2,158,982)
5,242,358,071 cpu_clk_unhalted (99.95% scaling, ena=5,000,359,984, run=2,319,270)

Whereas if you don't read deltas, e.g., no call to perf_event_read() until
the process terminates:

$ task -e cpu_clk_unhalted sleep 5
    2,497,783 cpu_clk_unhalted (0.00% scaling, ena=2,376,899, run=2,376,899)

Notice that time_enable, time_running are bogus in the first example
causing bogus scaling.

This patch fixes the problem, by conditionally calling update_context_time()
in perf_event_read().

Signed-off-by: Stephane Eranian <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Cc: [email protected]
LKML-Reference: <4cb856dc.51edd80a [email protected]>
Signed-off-by: Ingo Molnar <[email protected]>

lockdep: Check the depth of subclass

Current look_up_lock_class() doesn't check the parameter "subclass".
This rarely rises problems because the main caller of this function,
register_lock_class(), checks it.

But register_lock_class() is not the only function which calls
look_up_lock_class(). lock_set_class() and its callees also call it.
And lock_set_class() doesn't check this parameter.

This will rise problems when the the value of subclass is larger than
MAX_LOCKDEP_SUBCLASSES. Because the address (used as the key of class)
caliculated with too large subclass has a probability to point
another key in different lock_class_key.

Of course this problem depends on the memory layout and
occurs with really low probability.

Signed-off-by: Hitoshi Mitake <[email protected]>
Cc: Dmitry Torokhov <[email protected]>
Cc: Vojtech Pavlik <[email protected]>
Cc: Frederic Weisbecker <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
LKML-Reference: <1286958626 [email protected]>
Signed-off-by: Ingo Molnar <[email protected]>

lockdep: Add improved subclass caching

Current lockdep_map only caches one class with subclass == 0,
and looks up hash table of classes when subclass != 0.

It seems that this has no problem because the case of
subclass != 0 is rare. But locks of struct rq are
acquired with subclass == 1 when task migration is executed.
Task migration is high frequent event, so I modified lockdep
to cache subclasses.

I measured the score of perf bench sched messaging.
This patch has slightly but certain (order of milli seconds
or 10 milli seconds) effect when lots of tasks are running.
I'll show the result in the tail of this description.

NR_LOCKDEP_CACHING_CLASSES specifies how many classes can be
cached in the instances of lockdep_map.
I discussed with Peter Zijlstra in LinuxCon Japan about
this approach and he taught me that caching every subclasses(8)
is cleary waste of memory. So number of cached classes
should be configurable.

=== Score comparison of benchmarks ===
# "min" means best score, and "max" means worst score

for i in `seq 1 10`; do ./perf bench -f simple sched messaging; done

before: min: 0.565000, max: 0.583000, avg: 0.572500
after: min: 0.559000, max: 0.568000, avg: 0.563300

# with more processes
for i in `seq 1 10`; do ./perf bench -f simple sched messaging -g 40; done

before: min: 2.274000, max: 2.298000, avg: 2.286300
after: min: 2.242000, max: 2.270000, avg: 2.259700

Signed-off-by: Hitoshi Mitake <[email protected]>
Cc: Frederic Weisbecker <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
LKML-Reference: <1286269311 [email protected]>
Signed-off-by: Ingo Molnar <[email protected]>

Merge branch 'linus' into core/locking

Merge reason: Update to almost-final-.36

Signed-off-by: Ingo Molnar <[email protected]>

sched: Drop all load weight manipulation for RT tasks

Load weights are for the CFS, they do not belong in the RT task. This makes all
RT scheduling classes leave the CFS weights alone.

This fixes a real bug as well: I noticed the following phonomena: a process
elevated to SCHED_RR forks with SCHED_RESET_ON_FORK set, and the child is
indeed SCHED_OTHER, and the niceval is indeed reset to 0. However the weight
inserted by set_load_weight() remains at 0, giving the task insignificat
priority.

With this fix, the weight is reset to what the task had before being elevated
to SCHED_RR/SCHED_FIFO.

Cc: Lennart Poettering <[email protected]>
Cc: [email protected]
Signed-off-by: Linus Walleij <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
LKML-Reference: <1286807811 [email protected]>
Signed-off-by: Ingo Molnar <[email protected]>

sched: Create special class for stop/migrate work

In order to separate the stop/migrate work thread from the SCHED_FIFO
implementation, create a special class for it that is of higher priority than
SCHED_FIFO itself.

This currently solves a problem where cpu-hotplug consumes so much cpu-time
that the SCHED_FIFO class gets throttled, but has the bandwidth replenishment
timer pending on the now dead cpu.

It is also required for when we add the planned deadline scheduling class above
SCHED_FIFO, as the stop/migrate thread still needs to transcent those tasks.

Tested-by: Heiko Carstens <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
LKML-Reference: <1285165776.2275.1022.camel@laptop>
Signed-off-by: Ingo Molnar <[email protected]>

sched: Unindent labels

Labels should be on column 0.

Signed-off-by: Peter Zijlstra <[email protected]>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <[email protected]>

MIPS: Enable ISA_DMA_API config to fix build failure

Add ISA_DMA_API config item and select it when GENERIC_ISA_DMA enabled.
This fixes build failure on allmodconfig like following:

CC sound/isa/es18xx.o
sound/isa/es18xx.c: In function 'snd_es18xx_playback1_prepare':
sound/isa/es18xx.c:501:9: error: implicit declaration of function 'snd_dma_program'
sound/isa/es18xx.c: In function 'snd_es18xx_playback_pointer':
sound/isa/es18xx.c:818:3: error: implicit declaration of function 'snd_dma_pointer'
make[3]: *** [sound/isa/es18xx.o] Error 1
make[2]: *** [sound/isa/es18xx.o] Error 2
make[1]: *** [sub-make] Error 2
make: *** [all] Error 2

Signed-off-by: Namhyung Kim <[email protected]>
Cc: [email protected]
Cc: [email protected]
Patchwork: https://patchwork.linux-mips.org/patch/1717/
Signed-off-by: Ralf Baechle <[email protected]>

MIPS: 32-bit: Fix build failure in asm/fcntl.h

CC security/integrity/ima/ima_fs.o
In file included from linux/include/linux/fcntl.h:4:0,
from linux/security/integrity/ima/ima_fs.c:18:
linux/arch/mips/include/asm/fcntl.h:63:2: error: expected specifier-qualifier-list before 'off_t'
make[3]: *** [security/integrity/ima/ima_fs.o] Error 1
make[2]: *** [security/integrity/ima/ima_fs.o] Error 2
make[1]: *** [sub-make] Error 2
make: *** [all] Error 2

Signed-off-by: Namhyung Kim <[email protected]>
Cc: [email protected]
Patchwork: https://patchwork.linux-mips.org/patch/1715/
Signed-off-by: Ralf Baechle <[email protected]>

MIPS: Remove all generated vmlinuz* files on "make clean"

[Ralf: I changed the patch to explicitly list all files to be deleted out
of paranoia.]

Signed-off-by: Wu Zhangjin <[email protected]>
Patchwork: http://patchwork.linux-mips.org/patch/1590/
Signed-off-by: Ralf Baechle <[email protected]>

MIPS: do_sigaltstack() expects userland pointers

o32 compat does the right thing, native and n32 compat do not...

Signed-off-by: Al Viro <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Patchwork: http://patchwork.linux-mips.org/patch/1700/
Signed-off-by: Ralf Baechle <[email protected]>

MIPS: Fix error values in case of bad_stack

We want EFAULT, not -<syscall number>

Signed-off-by: Al Viro <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Patchwork: https://patchwork.linux-mips.org/patch/1699/
Signed-off-by: Ralf Baechle <[email protected]>

MIPS: Sanitize restart logics

Put the original syscall number into ->regs[0] when we leave syscall
with error. Use it in restart logics. Everything else will have
it 0 since we pass through SAVE_SOME on all the ways in. Note that
in places like bad_stack and inllegal_syscall we leave it 0 - it's not
restartable.

Signed-off-by: Al Viro <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Patchwork: https://patchwork.linux-mips.org/patch/1698/
Signed-off-by: Ralf Baechle <[email protected]>

MIPS: secure_computing, syscall audit: syscall number should in r2, not r0.

As it is, audit_syscall_entry() and secure_computing() get the
bogus value (0, in fact)

Signed-off-by: Al Viro <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Patchwork: https://patchwork.linux-mips.org/patch/1697/
Signed-off-by: Ralf Baechle <[email protected]>