Git Repo - linux.git/log

KVM: x86/vPMU: use the new macros to go between PMC, PMU and VCPU

Signed-off-by: Wei Huang <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: x86/vPMU: introduce pmu.h header

This will be used for private function used by AMD- and Intel-specific
PMU implementations.

Signed-off-by: Wei Huang <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: x86/vPMU: rename a few PMU functions

Before introducing a pmu.h header for them, make the naming more
consistent.

Signed-off-by: Paolo Bonzini <[email protected]>

KVM: MTRR: do not map huge page for non-consistent range

Based on Intel's SDM, mapping huge page which do not have consistent
memory cache for each 4k page will cause undefined behavior

In order to avoiding this kind of undefined behavior, we force to use
4k pages under this case

Signed-off-by: Xiao Guangrong <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: MTRR: simplify kvm_mtrr_get_guest_memory_type

mtrr_for_each_mem_type() is ready now, use it to simplify
kvm_mtrr_get_guest_memory_type()

Signed-off-by: Xiao Guangrong <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: MTRR: introduce mtrr_for_each_mem_type

It walks all MTRRs and gets all the memory cache type setting for the
specified range also it checks if the range is fully covered by MTRRs

Signed-off-by: Xiao Guangrong <[email protected]>
[Adjust for range_size->range_shift change. - Paolo]
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: MTRR: introduce fixed_mtrr_addr_* functions

Two functions are introduced:
- fixed_mtrr_addr_to_seg() translates the address to the fixed
MTRR segment

- fixed_mtrr_addr_seg_to_range_index() translates the address to
the index of kvm_mtrr.fixed_ranges[]

They will be used in the later patch

Signed-off-by: Xiao Guangrong <[email protected]>
[Adjust for range_size->range_shift change. - Paolo]
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: MTRR: sort variable MTRRs

Sort all valid variable MTRRs based on its base address, it will help us to
check a range to see if it's fully contained in variable MTRRs

Signed-off-by: Xiao Guangrong <[email protected]>
[Fix list insertion sort, simplify var_mtrr_range_is_valid to just
test the V bit. - Paolo]
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: MTRR: introduce var_mtrr_range

It gets the range for the specified variable MTRR

Signed-off-by: Xiao Guangrong <[email protected]>
[Simplify boolean operations. - Paolo]
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: MTRR: introduce fixed_mtrr_segment table

This table summarizes the information of fixed MTRRs and introduce some APIs
to abstract its operation which helps us to clean up the code and will be
used in later patches

Signed-off-by: Xiao Guangrong <[email protected]>
[Change range_size to range_shift, in order to avoid udivdi3 errors.
- Paolo]
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: MTRR: improve kvm_mtrr_get_guest_memory_type

- kvm_mtrr_get_guest_memory_type() only checks one page in MTRRs so
   that it's unnecessary to check to see if the range is partially
   covered in MTRR

- optimize the check of overlap memory type and add some comments
   to explain the precedence

Signed-off-by: Xiao Guangrong <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: MTRR: do not split 64 bits MSR content

Variable MTRR MSRs are 64 bits which are directly accessed with full length,
no reason to split them to two 32 bits

Signed-off-by: Xiao Guangrong <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: MTRR: clean up mtrr default type

Drop kvm_mtrr->enable, omit the decode/code workload and get rid of
all the hard code

Signed-off-by: Xiao Guangrong <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: MTRR: exactly define the size of variable MTRRs

Only KVM_NR_VAR_MTRR variable MTRRs are available in KVM guest

Signed-off-by: Xiao Guangrong <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: MTRR: remove mtrr_state.have_fixed

vMTRR does not depend on any host MTRR feature and fixed MTRRs have always
been implemented, so drop this field

Signed-off-by: Xiao Guangrong <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: MTRR: handle MSR_MTRRcap in kvm_mtrr_get_msr

MSR_MTRRcap is a MTRR msr so move the handler to the common place, also
add some comments to make the hard code more readable

Signed-off-by: Xiao Guangrong <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: x86: move MTRR related code to a separate file

MTRR code locates in x86.c and mmu.c so that move them to a separate file to
make the organization more clearer and it will be the place where we fully
implement vMTRR

Signed-off-by: Xiao Guangrong <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: x86: fix CR0.CD virtualization

Currently, CR0.CD is not checked when we virtualize memory cache type for
noncoherent_dma guests, this patch fixes it by :

- setting UC for all memory if CR0.CD = 1
- zapping all the last sptes in MMU if CR0.CD is changed

Signed-off-by: Xiao Guangrong <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: nSVM: Check for NRIPS support before updating control field

If hardware doesn't support DecodeAssist - a feature that provides
more information about the intercept in the VMCB, KVM decodes the
instruction and then updates the next_rip vmcb control field.
However, NRIP support itself depends on cpuid Fn8000_000A_EDX[NRIPS].
Since skip_emulated_instruction() doesn't verify nrip support
before accepting control.next_rip as valid, avoid writing this
field if support isn't present.

Signed-off-by: Bandan Das <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: fix checkpatch.pl errors in kvm/coalesced_mmio.h

Tabs rather than spaces

Signed-off-by: Kevin Mulvey <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: fix checkpatch.pl errors in kvm/async_pf.h

fix brace spacing

Signed-off-by: Kevin Mulvey <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

kvm: irqchip: Break up high order allocations of kvm_irq_routing_table

The allocation size of the kvm_irq_routing_table depends on
the number of irq routing entries because they are all
allocated with one kzalloc call.

When the irq routing table gets bigger this requires high
order allocations which fail from time to time:

qemu-kvm: page allocation failure: order:4, mode:0xd0

This patch fixes this issue by breaking up the allocation of
the table and its entries into individual kzalloc calls.
These could all be satisfied with order-0 allocations, which
are less likely to fail.

The downside of this change is the lower performance, because
of more calls to kzalloc. But given how often kvm_set_irq_routing
is called in the lifetime of a guest, it doesn't really
matter much.

Signed-off-by: Joerg Roedel <[email protected]>
[Avoid sparse warning through rcu_access_pointer. - Paolo]
Signed-off-by: Paolo Bonzini <[email protected]>

Merge tag 'kvm-arm-for-4.2' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD

KVM/ARM changes for v4.2:

- Proper guest time accounting
- FP access fix for 32bit
- The usual pile of GIC fixes
- PSCI fixes
- Random cleanups

Merge branch 'ccf/atmel-fixes-for-4.1' of https://github.com/bbrezillon/linux-at91 into clk-fixes

crypto: marvell/cesa - add DT bindings documentation

Add DT bindings documentation for the new marvell-cesa driver.

Signed-off-by: Boris Brezillon <[email protected]>
Signed-off-by: Herbert Xu <[email protected]>

crypto: marvell/cesa - add support for Kirkwood and Dove SoCs

Add the Kirkwood and Dove SoC descriptions, and control the allhwsupport
module parameter to avoid probing the CESA IP when the old CESA driver is
enabled (unless it is explicitly requested to do so).

Signed-off-by: Arnaud Ebalard <[email protected]>
Signed-off-by: Boris Brezillon <[email protected]>
Signed-off-by: Herbert Xu <[email protected]>

crypto: marvell/cesa - add support for Orion SoCs

Add the Orion SoC description, and select this implementation by default
to support non-DT probing: Orion is the only platform where non-DT boards
are declaring the CESA block.

Control the allhwsupport module parameter to avoid probing the CESA IP when
the old CESA driver is enabled (unless it is explicitly requested to do
so).

Signed-off-by: Boris Brezillon <[email protected]>
Signed-off-by: Herbert Xu <[email protected]>

crypto: marvell/cesa - add allhwsupport module parameter

The old and new marvell CESA drivers both support Orion and Kirkwood SoCs.
Add a module parameter to choose whether these SoCs should be attached to
the new or the old driver.

The default policy is to keep attaching those IPs to the old driver if it
is enabled, until we decide the new CESA driver is stable/secure enough.

Signed-off-by: Boris Brezillon <[email protected]>
Signed-off-by: Herbert Xu <[email protected]>

crypto: marvell/cesa - add support for all armada SoCs

Add CESA IP description for all the missing armada SoCs (XP, 375 and 38x).

Signed-off-by: Boris Brezillon <[email protected]>
Signed-off-by: Herbert Xu <[email protected]>

crypto: marvell/cesa - add SHA256 support

Add support for SHA256 operations.

Signed-off-by: Arnaud Ebalard <[email protected]>
Signed-off-by: Boris Brezillon <[email protected]>
Signed-off-by: Herbert Xu <[email protected]>

crypto: marvell/cesa - add MD5 support

Add support for MD5 operations.

Signed-off-by: Arnaud Ebalard <[email protected]>
Signed-off-by: Boris Brezillon <[email protected]>
Signed-off-by: Herbert Xu <[email protected]>

crypto: marvell/cesa - add Triple-DES support

Add support for Triple-DES operations.

Signed-off-by: Arnaud Ebalard <[email protected]>
Signed-off-by: Boris Brezillon <[email protected]>
Signed-off-by: Herbert Xu <[email protected]>

crypto: marvell/cesa - add DES support

Add support for DES operations.

Signed-off-by: Boris Brezillon <[email protected]>
Signed-off-by: Arnaud Ebalard <[email protected]>
Signed-off-by: Herbert Xu <[email protected]>

crypto: marvell/cesa - add TDMA support

The CESA IP supports CPU offload through a dedicated DMA engine (TDMA)
which can control the crypto block.
When you use this mode, all the required data (operation metadata and
payload data) are transferred using DMA, and the results are retrieved
through DMA when possible (hash results are not retrieved through DMA yet),
thus reducing the involvement of the CPU and providing better performances
in most cases (for small requests, the cost of DMA preparation might
exceed the performance gain).

Note that some CESA IPs do not embed this dedicated DMA, hence the
activation of this feature on a per platform basis.

Signed-off-by: Boris Brezillon <[email protected]>
Signed-off-by: Arnaud Ebalard <[email protected]>
Signed-off-by: Herbert Xu <[email protected]>

crypto: marvell/cesa - add a new driver for Marvell's CESA

The existing mv_cesa driver supports some features of the CESA IP but is
quite limited, and reworking it to support new features (like involving the
TDMA engine to offload the CPU) is almost impossible.
This driver has been rewritten from scratch to take those new features into
account.

This commit introduce the base infrastructure allowing us to add support
for DMA optimization.
It also includes support for one hash (SHA1) and one cipher (AES)
algorithm, and enable those features on the Armada 370 SoC.

Other algorithms and platforms will be added later on.

Signed-off-by: Boris Brezillon <[email protected]>
Signed-off-by: Arnaud Ebalard <[email protected]>
Signed-off-by: Herbert Xu <[email protected]>

crypto: mv_cesa - explicitly define kirkwood and dove compatible strings

We are about to add a new driver to support new features like using the
TDMA engine to offload the CPU.
Orion, Dove and Kirkwood platforms are already using the mv_cesa driver,
but Orion SoCs do not embed the TDMA engine, which means we will have to
differentiate them if we want to get TDMA support on Dove and Kirkwood.
In the other hand, the migration from the old driver to the new one is not
something all people are willing to do without first auditing the new
driver.
Hence we have to support the new compatible in the mv_cesa driver so that
new platforms with updated DTs can still attach their crypto engine device
to this driver.

Signed-off-by: Boris Brezillon <[email protected]>
Signed-off-by: Herbert Xu <[email protected]>

crypto: mv_cesa - use gen_pool to reserve the SRAM memory region

The mv_cesa driver currently expects the SRAM memory region to be passed
as a platform device resource.

This approach implies two drawbacks:
- the DT representation is wrong
- the only one that can access the SRAM is the crypto engine

The last point is particularly annoying in some cases: for example on
armada 370, a small region of the crypto SRAM is used to implement the
cpuidle, which means you would not be able to enable both cpuidle and the
CESA driver.

To address that problem, we explicitly define the SRAM device in the DT
and then reference the sram node from the crypto engine node.

Also note that the old way of retrieving the SRAM memory region is still
supported, or in other words, backward compatibility is preserved.

Signed-off-by: Boris Brezillon <[email protected]>
Signed-off-by: Herbert Xu <[email protected]>

crypto: mv_cesa - document the clocks property

On Dove platforms, the crypto engine requires a clock. Document this
clocks property in the mv_cesa bindings doc.

Signed-off-by: Boris Brezillon <[email protected]>
Signed-off-by: Herbert Xu <[email protected]>

Merge branch 'mvebu/drivers' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc

Merge the mvebu/drivers branch of the arm-soc tree which contains
just a single patch bfa1ce5f38938cc9e6c7f2d1011f88eba2b9e2b2 ("bus:
mvebu-mbus: add mv_mbus_dram_info_nooverlap()") that happens to be
a prerequisite of the new marvell/cesa crypto driver.

x86/boot: Fix overflow warning with 32-bit binutils

When building the kernel with 32-bit binutils built with support
only for the i386 target, we get the following warning:

arch/x86/kernel/head_32.S:66: Warning: shift count out of range (32 is not between 0 and 31)

The problem is that in that case, binutils' internal type
representation is 32-bit wide and the shift range overflows.

In order to fix this, manipulate the shift expression which
creates the 4GiB constant to not overflow the shift count.

Suggested-by: Michael Matz <[email protected]>
Reported-and-tested-by: Enrico Mioso <[email protected]>
Signed-off-by: Borislav Petkov <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>

arm64: vdso: work-around broken ELF toolchains in Makefile

When building the kernel with a bare-metal (ELF) toolchain, the -shared
option may not be passed down to collect2, resulting in silent corruption
of the vDSO image (in particular, the DYNAMIC section is omitted).

The effect of this corruption is that the dynamic linker fails to find
the vDSO symbols and libc is instead used for the syscalls that we
intended to optimise (e.g. gettimeofday). Functionally, there is no
issue as the sigreturn trampoline is still intact and located by the
kernel.

This patch fixes the problem by explicitly passing -shared to the linker
when building the vDSO.

Cc: <[email protected]>
Reported-by: Szabolcs Nagy <[email protected]>
Reported-by: James Greenlaigh <[email protected]>
Signed-off-by: Will Deacon <[email protected]>
Signed-off-by: Catalin Marinas <[email protected]>

clk: at91: fix h32mx prototype inclusion in pmc header

Trivial fix that prevents to compile this pmc clock driver if h32mx clock is
present but smd clock isn't.

Signed-off-by: Nicolas Ferre <[email protected]>
Signed-off-by: Boris Brezillon <[email protected]>
Acked-by: Alexandre Belloni <[email protected]>
Fixes: bcc5fd49a0fd ("clk: at91: add a driver for the h32mx clock")
Cc: <[email protected]> # 3.18+

clk: at91: trivial: typo in peripheral clock description

Signed-off-by: Nicolas Ferre <[email protected]>
Signed-off-by: Boris Brezillon <[email protected]>

arm64: kernel: rename __cpu_suspend to keep it aligned with arm

This patch renames __cpu_suspend to cpu_suspend so that it's aligned
with ARM32. It also removes the redundant wrapper created.

This is in preparation to implement generic PSCI system suspend using
the cpu_{suspend,resume} which now has the same interface on both ARM
and ARM64.

Cc: Mark Rutland <[email protected]>
Reviewed-by: Lorenzo Pieralisi <[email protected]>
Reviewed-by: Ashwin Chaugule <[email protected]>
Signed-off-by: Sudeep Holla <[email protected]>
Signed-off-by: Catalin Marinas <[email protected]>

timer: Minimize nohz off overhead

If nohz is disabled on the kernel command line the [hr]timer code
still calls wake_up_nohz_cpu() and tick_nohz_full_cpu(), a pretty
pointless exercise. Cache nohz_active in [hr]timer per cpu bases and
avoid the overhead.

Before:
  48.10%  hog       [.] main
  15.25%  [kernel]  [k] _raw_spin_lock_irqsave
   9.76%  [kernel]  [k] _raw_spin_unlock_irqrestore
   6.50%  [kernel]  [k] mod_timer
   6.44%  [kernel]  [k] lock_timer_base.isra.38
   3.87%  [kernel]  [k] detach_if_pending
   3.80%  [kernel]  [k] del_timer
   2.67%  [kernel]  [k] internal_add_timer
   1.33%  [kernel]  [k] __internal_add_timer
   0.73%  [kernel]  [k] timerfn
   0.54%  [kernel]  [k] wake_up_nohz_cpu

After:
  48.73%  hog       [.] main
  15.36%  [kernel]  [k] _raw_spin_lock_irqsave
   9.77%  [kernel]  [k] _raw_spin_unlock_irqrestore
   6.61%  [kernel]  [k] lock_timer_base.isra.38
   6.42%  [kernel]  [k] mod_timer
   3.90%  [kernel]  [k] detach_if_pending
   3.76%  [kernel]  [k] del_timer
   2.41%  [kernel]  [k] internal_add_timer
   1.39%  [kernel]  [k] __internal_add_timer
   0.76%  [kernel]  [k] timerfn

We probably should have a cached value for nohz full in the per cpu
bases as well to avoid the cpumask check. The base cache line is hot
already, the cpumask not necessarily.

Signed-off-by: Thomas Gleixner <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Paul McKenney <[email protected]>
Cc: Frederic Weisbecker <[email protected]>
Cc: Eric Dumazet <[email protected]>
Cc: Viresh Kumar <[email protected]>
Cc: John Stultz <[email protected]>
Cc: Joonwoo Park <[email protected]>
Cc: Wenbo Wang <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Thomas Gleixner <[email protected]>

timer: Reduce timer migration overhead if disabled

Eric reported that the timer_migration sysctl is not really nice
performance wise as it needs to check at every timer insertion whether
the feature is enabled or not. Further the check does not live in the
timer code, so we have an extra function call which checks an extra
cache line to figure out that it is disabled.

We can do better and store that information in the per cpu (hr)timer
bases. I pondered to use a static key, but that's a nightmare to
update from the nohz code and the timer base cache line is hot anyway
when we select a timer base.

The old logic enabled the timer migration unconditionally if
CONFIG_NO_HZ was set even if nohz was disabled on the kernel command
line.

With this modification, we start off with migration disabled. The user
visible sysctl is still set to enabled. If the kernel switches to NOHZ
migration is enabled, if the user did not disable it via the sysctl
prior to the switch. If nohz=off is on the kernel command line,
migration stays disabled no matter what.

Before:
  47.76%  hog       [.] main
  14.84%  [kernel]  [k] _raw_spin_lock_irqsave
   9.55%  [kernel]  [k] _raw_spin_unlock_irqrestore
   6.71%  [kernel]  [k] mod_timer
   6.24%  [kernel]  [k] lock_timer_base.isra.38
   3.76%  [kernel]  [k] detach_if_pending
   3.71%  [kernel]  [k] del_timer
   2.50%  [kernel]  [k] internal_add_timer
   1.51%  [kernel]  [k] get_nohz_timer_target
   1.28%  [kernel]  [k] __internal_add_timer
   0.78%  [kernel]  [k] timerfn
   0.48%  [kernel]  [k] wake_up_nohz_cpu

After:
  48.10%  hog       [.] main
  15.25%  [kernel]  [k] _raw_spin_lock_irqsave
   9.76%  [kernel]  [k] _raw_spin_unlock_irqrestore
   6.50%  [kernel]  [k] mod_timer
   6.44%  [kernel]  [k] lock_timer_base.isra.38
   3.87%  [kernel]  [k] detach_if_pending
   3.80%  [kernel]  [k] del_timer
   2.67%  [kernel]  [k] internal_add_timer
   1.33%  [kernel]  [k] __internal_add_timer
   0.73%  [kernel]  [k] timerfn
   0.54%  [kernel]  [k] wake_up_nohz_cpu

Reported-by: Eric Dumazet <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Paul McKenney <[email protected]>
Cc: Frederic Weisbecker <[email protected]>
Cc: Viresh Kumar <[email protected]>
Cc: John Stultz <[email protected]>
Cc: Joonwoo Park <[email protected]>
Cc: Wenbo Wang <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Thomas Gleixner <[email protected]>

timer: Stats: Simplify the flags handling

Simplify the handling of the flag storage for the timer statistics. No
intermediate storage anymore. Just hand over the flags field.

I left the printout of 'deferrable' for now because changing this
would be an ABI update and I have no idea how strong people feel about
that. OTOH, I wonder whether we should kill the whole timer stats
stuff because all of that information can be retrieved via ftrace/perf
as well.

Signed-off-by: Thomas Gleixner <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Paul McKenney <[email protected]>
Cc: Frederic Weisbecker <[email protected]>
Cc: Eric Dumazet <[email protected]>
Cc: Viresh Kumar <[email protected]>
Cc: John Stultz <[email protected]>
Cc: Joonwoo Park <[email protected]>
Cc: Wenbo Wang <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Thomas Gleixner <[email protected]>

timer: Replace timer base by a cpu index

Instead of storing a pointer to the per cpu tvec_base we can simply
cache a CPU index in the timer_list and use that to get hold of the
correct per cpu tvec_base. This is only used in lock_timer_base() and
the slightly larger code is peanuts versus the spinlock operation and
the d-cache foot print of the timer wheel.

Aside of that this allows to get rid of following nuisances:

- boot_tvec_base

   That statically allocated 4k bss data is just kept around so the
   timer has a home when it gets statically initialized. It serves no
   other purpose.

   With the CPU index we assign the timer to CPU0 at static
   initialization time and therefor can avoid the whole boot_tvec_base
   dance.  That also simplifies the init code, which just can use the
   per cpu base.

   Before:
     text    data     bss     dec     hex filename
    17491    9201    4160   30852    7884 ../build/kernel/time/timer.o
   After:
     text    data     bss     dec     hex filename
    17440    9193       0   26633    6809 ../build/kernel/time/timer.o

- Overloading the base pointer with various flags

   The CPU index has enough space to hold the flags (deferrable,
   irqsafe) so we can get rid of the extra masking and bit fiddling
   with the base pointer.

As a benefit we reduce the size of struct timer_list on 64 bit
machines. 4 - 8 bytes, a size reduction up to 15% per struct timer_list,
which is a real win as we have tons of them embedded in other structs.

This changes also the newly added deferrable printout of the timer
start trace point to capture and print all timer->flags, which allows
us to decode the target cpu of the timer as well.

We might have used bitfields for this, but that would change the
static initializers and the init function for no value to accomodate
big endian bitfields.

Signed-off-by: Thomas Gleixner <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Paul McKenney <[email protected]>
Cc: Frederic Weisbecker <[email protected]>
Cc: Eric Dumazet <[email protected]>
Cc: Viresh Kumar <[email protected]>
Cc: John Stultz <[email protected]>
Cc: Joonwoo Park <[email protected]>
Cc: Wenbo Wang <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Badhri Jagan Sridharan <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Thomas Gleixner <[email protected]>

timer: Use hlist for the timer wheel hash buckets

This reduces the size of struct tvec_base by 50% and results in
slightly smaller code as well.

Before:
   struct tvec_base: size: 8256, cachelines: 129

   text    data     bss     dec     hex filename
  17698   13297    8256   39251    9953 ../build/kernel/time/timer.o

After:
  struct tvec_base: 4160, cachelines: 65

   text    data     bss     dec     hex filename
  17491    9201    4160   30852    7884 ../build/kernel/time/timer.o

Signed-off-by: Thomas Gleixner <[email protected]>
Reviewed-by: Viresh Kumar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Paul McKenney <[email protected]>
Cc: Frederic Weisbecker <[email protected]>
Cc: Eric Dumazet <[email protected]>
Cc: John Stultz <[email protected]>
Cc: Joonwoo Park <[email protected]>
Cc: Wenbo Wang <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Thomas Gleixner <[email protected]>

timer: Remove FIFO "guarantee"

The FIFO guarantee is only there if two timers are queued into the
same bucket at the same jiffie on the same cpu:

- The slack value depends on the delta between expiry and enqueue
   time, so the resulting expiry time can be different for timers
   which are queued in different jiffies.

- Timers which are queued into the secondary array end up after a
   later queued timer which was queued into the primary array due to
   cascading.

- Timers can end up on different cpus due to the NOHZ target moving
   around. Obviously there is no guarantee of expiry ordering between
   cpus.

So anything which relies on FIFO behaviour of the timer wheel is
broken already.

This is a preparatory patch for converting the timer wheel to hlist
which reduces the memory foot print of the wheel by 50%.

It's a seperate patch so any (unlikely to happen) regression caused by
this can be identified clearly.

Signed-off-by: Thomas Gleixner <[email protected]>
Reviewed-by: Viresh Kumar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Paul McKenney <[email protected]>
Cc: Frederic Weisbecker <[email protected]>
Cc: Eric Dumazet <[email protected]>
Cc: John Stultz <[email protected]>
Cc: Joonwoo Park <[email protected]>
Cc: Wenbo Wang <[email protected]>
Cc: George Spelvin <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Thomas Gleixner <[email protected]>

timers: Sanitize catchup_timer_jiffies() usage

catchup_timer_jiffies() has been applied blindly to several functions
without looking for possible better ways to do it.

1) internal_add_timer()

   Move the update to base->all_timers before we actually insert the
   timer into the wheel.

2) detach_if_pending()

   Again the update to base->all_timers allows us to explicitely do
   the timer_jiffies update in place, if this was the last timer which
   got removed.

3) __run_timers()

   We only check on entry, which is silly, because base->timer_jiffies
   can be behind - especially on NOHZ kernels - and if there is a
   single deferrable timer somewhere between base->timer_jiffies and
   jiffies we expire it and then loop until base->timer_jiffies ==
   jiffies.

   Move it into the loop.

Signed-off-by: Thomas Gleixner <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Paul McKenney <[email protected]>
Cc: Frederic Weisbecker <[email protected]>
Cc: Eric Dumazet <[email protected]>
Cc: Viresh Kumar <[email protected]>
Cc: John Stultz <[email protected]>
Cc: Joonwoo Park <[email protected]>
Cc: Wenbo Wang <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Thomas Gleixner <[email protected]>

clk: at91: fix PERIPHERAL_MAX_SHIFT definition

Fix the PERIPHERAL_MAX_SHIFT definition (3 instead of 4) and adapt the
round_rate and set_rate logic accordingly.

Signed-off-by: Boris Brezillon <[email protected]>
Reported-by: "Wu, Songjun" <[email protected]>

clk: at91: pll: fix input range validity check

The PLL impose a certain input range to work correctly, but it appears that
this input range does not apply on the input clock (or parent clock) but
on the input clock after it has passed the PLL divisor.
Fix the implementation accordingly.

Cc: <[email protected]> # v3.14+
Signed-off-by: Boris Brezillon <[email protected]>
Reported-by: Jonas Andersson <[email protected]>

power_supply: Correct kerneldoc copy paste errors

Signed-off-by: Bjorn Andersson <[email protected]>
Reviewed-by: Krzysztof Kozlowski <[email protected]>
Signed-off-by: Sebastian Reichel <[email protected]>

sched/deadline: Remove needless parameter in dl_runtime_exceeded()

Sine commit 269ad8015a6b ("sched/deadline: Avoid double-accounting in
case of missed deadlines), parameter 'rq' is no longer used, so
remove it.

Signed-off-by: Zhiqiang Zhang <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: <[email protected]>
Cc: <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>

sched: Remove superfluous resetting of the p->dl_throttled flag

Resetting the p->dl_throttled flag in rt_mutex_setprio() (for a task that is going
to be boosted) is superfluous, as the natural place to do so is in
replenish_dl_entity().

If the task was on the runqueue and it is boosted by a DL task, it will be enqueued
back with ENQUEUE_REPLENISH flag set, which can guarantee that dl_throttled is
reset in replenish_dl_entity().

This patch drops the resetting of throttled status in function rt_mutex_setprio().

Signed-off-by: Wanpeng Li <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>

sched/deadline: Drop duplicate init_sched_dl_class() declaration

There are two init_sched_dl_class() declarations, this patch drops
the duplicate.

Signed-off-by: Wanpeng Li <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>

sched/deadline: Reduce rq lock contention by eliminating locking of non-feasible target

This patch adds a check that prevents futile attempts to move DL tasks
to a CPU with active tasks of equal or earlier deadline. The same
behavior as commit 80e3d87b2c55 ("sched/rt: Reduce rq lock contention
by eliminating locking of non-feasible target") for rt class.

Signed-off-by: Wanpeng Li <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>

sched/deadline: Make init_sched_dl_class() __init

It's a bootstrap function, make init_sched_dl_class() __init.

Signed-off-by: Wanpeng Li <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>

sched/deadline: Optimize pull_dl_task()

pull_dl_task() uses pick_next_earliest_dl_task() to select a migration
candidate; this is sub-optimal since the next earliest task -- as per
the regular runqueue -- might not be migratable at all. This could
result in iterating the entire runqueue looking for a task.

Instead iterate the pushable queue -- this queue only contains tasks
that have at least 2 cpus set in their cpus_allowed mask.

Signed-off-by: Wanpeng Li <[email protected]>
[ Improved the changelog. ]
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>

sched/preempt: Add static_key() to preempt_notifiers

Avoid touching the curr->preempt_notifier cacheline when not needed.

Provides a small improvement on pipe-bench:

  taskset 01 perf stat --repeat 10 -- perf bench sched pipe

before:

Performance counter stats for 'perf bench sched pipe' (10 runs):

      12385.016204      task-clock (msec)         #    1.001 CPUs utilized            ( +-  0.34% )
         2,000,023      context-switches          #    0.161 M/sec                    ( +-  0.00% )
                 0      cpu-migrations            #    0.000 K/sec
               175      page-faults               #    0.014 K/sec                    ( +-  0.26% )
    41,376,162,250      cycles                    #    3.341 GHz                      ( +-  0.11% )
    17,389,139,321      stalled-cycles-frontend   #   42.03% frontend cycles idle     ( +-  0.25% )
   <not supported>      stalled-cycles-backend
    68,788,588,003      instructions              #    1.66  insns per cycle
                                                  #    0.25  stalled cycles per insn  ( +-  0.02% )
    13,449,387,620      branches                  # 1085.940 M/sec                    ( +-  0.02% )
        20,880,690      branch-misses             #    0.16% of all branches          ( +-  0.98% )

      12.372646094 seconds time elapsed                                          ( +-  0.34% )

after:

Performance counter stats for 'perf bench sched pipe' (10 runs):

      12180.936528      task-clock (msec)         #    1.001 CPUs utilized            ( +-  0.33% )
         2,000,077      context-switches          #    0.164 M/sec                    ( +-  0.00% )
                 0      cpu-migrations            #    0.000 K/sec
               174      page-faults               #    0.014 K/sec                    ( +-  0.27% )
    40,691,545,577      cycles                    #    3.341 GHz                      ( +-  0.06% )
    16,446,333,371      stalled-cycles-frontend   #   40.42% frontend cycles idle     ( +-  0.18% )
   <not supported>      stalled-cycles-backend
    68,570,100,387      instructions              #    1.69  insns per cycle
                                                  #    0.24  stalled cycles per insn  ( +-  0.01% )
    13,389,740,014      branches                  # 1099.237 M/sec                    ( +-  0.01% )
        20,175,440      branch-misses             #    0.15% of all branches          ( +-  0.52% )

      12.169253010 seconds time elapsed                                          ( +-  0.33% )

Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Signed-off-by: Ingo Molnar <[email protected]>

sched/preempt: Fix preempt notifiers documentation about hlist_del() within unsafe iteration

preempt_notifier_unregister() documents:

"This is safe to call from within a preemption notifier."

However, both fire_sched_in_preempt_notifiers() and
fire_sched_out_preempt_notifiers() are using hlist_for_each_entry(),
which is not safe against entry removal during iteration.

Inspection of the KVM code does not reveal any use of
preempt_notifier_unregister() within the preempt notifiers.

Therefore, fix the comment.

Signed-off-by: Mathieu Desnoyers <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>

sched/stop_machine: Fix deadlock between multiple stop_two_cpus()

Jiri reported a machine stuck in multi_cpu_stop() with
migrate_swap_stop() as function and with the following src,dst cpu
pairs: {11,  4} {13, 11} { 4, 13}

                        4       11      13

cpuM: queue(4 ,13)
                        *Ma
cpuN: queue(13,11)
                                *N      Na
                        *M              Mb
cpuO: queue(11, 4)
                        *O      Oa
                                *Nb
                        *Ob

Where *X denotes the cpu running the queueing of cpu-X and X[ab] denotes
the first/second queued work.

You'll observe the top of the workqueue for each cpu: 4,11,13 to be work
from cpus: M, O, N resp. IOW. deadlock.

Do away with the queueing trickery and introduce lg_double_lock() to
lock both CPUs and fully serialize the stop_two_cpus() callers instead
of the partial (and buggy) serialization we have now.

Reported-by: Jiri Olsa <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>

sched/debug: Add sum_sleep_runtime to /proc/<pid>/sched

When CONFIG_SCHEDSTATS is enabled, /proc/<pid>/sched prints almost all
sched statistics except sum_sleep_runtime. Since sum_sleep_runtime is
a good info to collect, add this it to /proc/<pid>/sched.

Signed-off-by: Srikar Dronamraju <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>

sched/debug: Replace vruntime with wait_sum in /proc/sched_debug

Within runnable tasks in /proc/sched_debug, vruntime is printed twice,
once as tree-key and again as exec-runtime.

Since exec-runtime isnt populated in !CONFIG_SCHEDSTATS, use this field
to print wait_sum.

Signed-off-by: Srikar Dronamraju <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>

sched/debug: Properly format runnable tasks in /proc/sched_debug

With !CONFIG_SCHEDSTATS, runnable tasks in /proc/sched_debug has too
many columns than required. Fix this by printing appropriate columns.

While at this, print sum_exec_runtime, since this information is
available even in !CONFIG_SCHEDSTATS case.

Signed-off-by: Srikar Dronamraju <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>

locking/lockdep: Remove hard coded array size dependency

An apparent oversight left a hardcoded '4' in place when
LOCKSTAT_POINTS was introduced.

The contention_point[] and contending_point[] arrays in the
structs lock_class and lock_class_stats need to be the same
size for the loops in lock_stats() to be correct.

This patch allows LOCKSTAT_POINTS to be changed without
affecting the correctness of the code.

Signed-off-by: George Beshers <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Signed-off-by: Ingo Molnar <[email protected]>

locking/qrwlock: Don't contend with readers when setting _QW_WAITING

The current cmpxchg() loop in setting the _QW_WAITING flag for writers
in queue_write_lock_slowpath() will contend with incoming readers
causing possibly extra cmpxchg() operations that are wasteful. This
patch changes the code to do a byte cmpxchg() to eliminate contention
with new readers.

A multithreaded microbenchmark running 5M read_lock/write_lock loop
on a 8-socket 80-core Westmere-EX machine running 4.0 based kernel
with the qspinlock patch have the following execution times (in ms)
with and without the patch:

With R:W ratio = 5:1

Threads    w/o patch with patch % change
-------    --------- ---------- --------
   2      990     895   -9.6%
   3     2136    1912 -10.5%
   4     3166    2830 -10.6%
   5     3953    3629   -8.2%
   6     4628    4405   -4.8%
   7     5344    5197   -2.8%
   8     6065    6004   -1.0%
   9     6826    6811   -0.2%
  10     7599    7599    0.0%
  15     9757    9766   +0.1%
  20    13767   13817   +0.4%

With small number of contending threads, this patch can improve
locking performance by up to 10%. With more contending threads,
however, the gain diminishes.

Signed-off-by: Waiman Long <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Douglas Hatch <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Scott J Norton <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>

perf/x86: Honor the architectural performance monitoring version

Architectural performance monitoring, version 1, doesn't support fixed counters.

Currently, even if a hypervisor advertises support for architectural
performance monitoring version 1, perf may still try to use the fixed
counters, as the constraints are set up based on the CPU model.

This patch ensures that perf honors the architectural performance monitoring
version returned by CPUID, and it only uses the fixed counters for version 2
and above.

(Some of the ideas in this patch came from Peter Zijlstra.)

Signed-off-by: Imre Palik <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Anthony Liguori <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Brian Gerst <[email protected]>
Cc: Denys Vlasenko <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>

perf/x86/intel: Fix PMI handling for Intel PT

Intel PT is a separate PMU and it is not using any of the x86_pmu
code paths, which means in particular that the active_events counter
remains intact when new PT events are created.

However, PT uses the generic x86_pmu PMI handler for its PMI handling needs.

The problem here is that the latter checks active_events and in case of it
being zero, exits without calling the actual x86_pmu.handle_nmi(), which
results in unknown NMI errors and massive data loss for PT.

The effect is not visible if there are other perf events in the system
at the same time that keep active_events counter non-zero, for instance
if the NMI watchdog is running, so one needs to disable it to reproduce
the problem.

At the same time, the active_events counter besides doing what the name
suggests also implicitly serves as a PMC hardware and DS area reference
counter.

This patch adds a separate reference counter for the PMC hardware, leaving
active_events for actually counting the events and makes sure it also
counts PT and BTS events.

Signed-off-by: Alexander Shishkin <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Brian Gerst <[email protected]>
Cc: Denys Vlasenko <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>

perf/x86/intel/bts: Fix DS area sharing with x86_pmu events

Currently, the intel_bts driver relies on the DS area allocated by the x86_pmu
code in its event_init() path, which is a bug: creating a BTS event while
no x86_pmu events are present results in a NULL pointer dereference.

The same DS area is also used by PEBS sampling, which makes it quite a bit
trickier to have a separate one for intel_bts' purposes.

This patch makes intel_bts driver use the same DS allocation and reference
counting code as x86_pmu to make sure it is always present when either
intel_bts or x86_pmu need it.

Signed-off-by: Alexander Shishkin <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Brian Gerst <[email protected]>
Cc: Denys Vlasenko <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/1434024837-9916-2-git-send-email-alexander.shishkin@linux.intel.com
Signed-off-by: Ingo Molnar <[email protected]>

perf/x86: Add more Broadwell model numbers

This patch adds additional model numbers for Broadwell to perf.
Support for Broadwell with Iris Pro (Intel Core i7-57xxC)
and support for Broadwell Server Xeon.

Signed-off-by: Andi Kleen <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Brian Gerst <[email protected]>
Cc: Denys Vlasenko <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>

perf: Fix ring_buffer_attach() RCU sync, again

While looking for other users of get_state/cond_sync. I Found
ring_buffer_attach() and it looks obviously buggy?

Don't we need to ensure that we have "synchronize" _between_
list_del() and list_add() ?

IOW. Suppose that ring_buffer_attach() preempts right_after
get_state_synchronize_rcu() and gp completes before spin_lock().

In this case cond_synchronize_rcu() does nothing and we reuse
->rb_entry without waiting for gp in between?

It also moves the ->rcu_pending check under "if (rb)", to make it
more readable imo.

Signed-off-by: Oleg Nesterov <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Alexander Shishkin <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Brian Gerst <[email protected]>
Cc: Denys Vlasenko <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Fixes: b69cf53640da ("perf: Fix a race between ring_buffer_detach() and ring_buffer_attach()")
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>

Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/scottwood/linux into next

Freescale updates from Scott:

"Highlights include more 8xx optimizations, an e6500 hugetlb optimization,
QMan device tree nodes, t1024/t1023 support, and various fixes and
cleanup."

cxl: Fix typo in debug print

Fix typo in debug print. p1_base() should be p2_base(). No change other
than to the debug output.

Signed-off-by: Michael Neuling <[email protected]>
Acked-by: Ian Munsie <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>

cxl: Add CXL_KERNEL_API config option

Add CXL_KERNEL_API config option so drivers which depend on this new
functionality won't be enabled until this is visible.

This is useful for merging the cxlflash driver which comes in via the SCSI
tree. The cxlflash driver can depend on CXL_KERNEL_API, hence it won't be
enabled in the SCSI tree until this new config option is merged via the powerpc
tree. Hence all trees will be bisectable at all times.

Signed-off-by: Michael Neuling <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>

powerpc/powernv: Fix wrong IOMMU table in pnv_ioda_setup_bus_dma()

When pnv_pci_ioda_fixup() is called during PHB fixup time, each PE in
the sorted list of PEs (phb::pe_dma_list) is iterated to setup the PE's
DMA32 space by pnv_ioda_setup_bus_dma() if the PE's DMA32 weight is bigger
than zero. The function also assigns all the subordinate PCI devices of
the PE's primary bus with the PE's DMA32 IOMMU table. It causes the PCI
devicess in the child PEs, which don't have DMA weight, receives wrong
IOMMU table and then IOMMU group.

The patch fixes above issue by more check on the PE's coverage and don't
assign IOMMU table to those PCI devices, which belong to the child PEs.
The problem was found on Firestone platform initially.

Suggested-by: Gavin Shan <[email protected]>
Signed-off-by: Alexey Kardashevskiy <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>

powerpc/mm: Change the swap encoding in pte.

Current swap encoding in pte can't support large pfns
above 4TB. Change the swap encoding such that we put
the swap type in the PTE bits. Also add build checks
to make sure we don't overlap with HPTEFLAGS.

Signed-off-by: Aneesh Kumar K.V <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>

powerpc/mm: PTE_RPN_MAX is not used, remove the same

Remove the unused #define

Signed-off-by: Aneesh Kumar K.V <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>

powerpc/tm: Abort syscalls in active transactions

This patch changes the syscall handler to doom (tabort) active
transactions when a syscall is made and return very early without
performing the syscall and keeping side effects to a minimum (no CPU
accounting or system call tracing is performed). Also included is a
new HWCAP2 bit, PPC_FEATURE2_HTM_NOSC, to indicate this
behaviour to userspace.

Currently, the system call instruction automatically suspends an
active transaction which causes side effects to persist when an active
transaction fails.

This does change the kernel's behaviour, but in a way that was
documented as unsupported. It doesn't reduce functionality as
syscalls will still be performed after tsuspend; it just requires that
the transaction be explicitly suspended. It also provides a
consistent interface and makes the behaviour of user code
substantially the same across powerpc and platforms that do not
support suspended transactions (e.g. x86 and s390).

Performance measurements using
http://ozlabs.org/~anton/junkcode/null_syscall.c indicate the cost of
a normal (non-aborted) system call increases by about 0.25%.

Signed-off-by: Sam Bobroff <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>

MAINTAINERS: clarify drivers/crypto/nx/ file ownership

Update the "IBM Power in-Nest Crypto Acceleration" and
"IBM Power 842 compression accelerator" sections to specify the correct
files.

The "IBM Power in-Nest Crypto Acceleration" was originally the only
NX driver, and so its section listed all drivers/crypto/nx/ files,
but now there is also the 842 driver which has its own section. This
lists explicitly what files are owned by the Crypto driver and which
files are owned by the 842 compression driver.

Signed-off-by: Dan Streetman <[email protected]>
Signed-off-by: Herbert Xu <[email protected]>

crypto: nx - add LE support to pSeries platform driver

Add support to the nx-842-pseries.c driver for running in little endian
mode.

The pSeries platform NX 842 driver currently only works as big endian.
This adds cpu_to_be*() and be*_to_cpu() in the appropriate places to
work in LE mode also.

Signed-off-by: Dan Streetman <[email protected]>
Signed-off-by: Herbert Xu <[email protected]>

crypto: caam - Set last bit on src SG list

The new aead_edesc_alloc left out the bit indicating the last
entry on the source SG list. This patch fixes it.

Signed-off-by: Herbert Xu <[email protected]>

crypto: caam - Reintroduce DESC_MAX_USED_BYTES

I incorrectly removed DESC_MAX_USED_BYTES when enlarging the size
of the shared descriptor buffers, thus making it four times larger
than what is necessary. This patch restores the division by four
calculation.

Signed-off-by: Herbert Xu <[email protected]>

crypto: aead - Fix aead_instance struct size

The struct aead_instance is meant to extend struct crypto_instance
by incorporating the extra members of struct aead_alg. However,
the current layout which is copied from shash/ahash does not specify
the struct fully. In particular only aead_alg is present.

For shash/ahash this works because users there add extra headroom
to sizeof(struct crypto_instance) when allocating the instance.
Unfortunately for aead, this bit was lost when the new aead_instance
was added.

Rather than fixing it like shash/ahash, this patch simply expands
struct aead_instance to contain what is supposed to be there, i.e.,
adding struct crypto_instance.

In order to not break existing AEAD users, this is done through an
anonymous union.

Signed-off-by: Herbert Xu <[email protected]>

crypto: api - Add CRYPTO_MINALIGN_ATTR to struct crypto_alg

The struct crypto_alg is embedded into various type-specific structs
such as aead_alg.  This is then used as part of instances such as
struct aead_instance.  It is also embedded into the generic struct
crypto_instance.  In order to ensure that struct aead_instance can
be converted to struct crypto_instance when necessary, we need to
ensure that crypto_alg is aligned properly.

This patch adds an alignment attribute to struct crypto_alg to
ensure this.

Signed-off-by: Herbert Xu <[email protected]>

Merge branch 'i2c/for-current' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux

Pull i2c documentation fix from Wolfram Sang:
"Here is a small documentation fix for I2C.

  We already had a user who unsuccessfully tried to get the new slave
  framework running with the currently broken example.  So, before this
  happens again, I'd like to have this how-to-use section fixed for 4.1
  already.  So that no more hacking time is wasted"

* 'i2c/for-current' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux:
  i2c: slave: fix the example how to instantiate from userspace

revert "cpumask: don't perform while loop in cpumask_next_and()"

Revert commit 534b483a86e6 ("cpumask: don't perform while loop in
cpumask_next_and()").

This was a minor optimization, but it puts a `struct cpumask' on the
stack, which consumes too much stack space.

Sergey Senozhatsky <[email protected]>
Reported-by: Peter Zijlstra <[email protected]>
Cc: Sergey Senozhatsky <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: "David S. Miller" <[email protected]>
Cc: Amir Vadai <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

Merge tag 'drm-intel-fixes-2015-06-18' of git://anongit.freedesktop.org/drm-intel into drm-fixes

one fix, one revert
* tag 'drm-intel-fixes-2015-06-18' of git://anongit.freedesktop.org/drm-intel:
Revert "drm/i915: Don't skip request retirement if the active list is empty"
drm/i915: Always reset vma->ggtt_view.pages cache on unbinding

Merge branch 'drm-fixes-4.1' of git://people.freedesktop.org/~deathsimple/linux into drm-fixes

two radeon fixes
one MST fix,
one query addition, destined for stable, and to fix a regression
* 'drm-fixes-4.1' of git://people.freedesktop.org/~deathsimple/linux:
drm/radeon: don't probe MST on hw we don't support it on
drm/radeon: Add RADEON_INFO_VA_UNMAP_WORKING query

Merge branches 'pnp' and 'pm-tools'

* pnp:
  PNP / ACPI: use unsigned int in pnpacpi_encode_resources()
  PNP / ACPI: use u8 instead of int in acpi_resource_extended_irq context

* pm-tools:
  cpupower: mperf monitor: fix output in MAX_FREQ_SYSFS mode

Merge branches 'pm-clk', 'pm-domains' and 'powercap'

* pm-clk:
  PM / clk: Print acquired clock name in addition to con_id
  PM / clk: Fix clock error check in __pm_clk_add()
  drivers: sh: remove boilerplate code and use USE_PM_CLK_RUNTIME_OPS
  arm: davinci: remove boilerplate code and use USE_PM_CLK_RUNTIME_OPS
  arm: omap1: remove boilerplate code and use USE_PM_CLK_RUNTIME_OPS
  arm: keystone: remove boilerplate code and use USE_PM_CLK_RUNTIME_OPS
  PM / clock_ops: Provide default runtime ops to users

* pm-domains:
  PM / Domains: Skip timings during syscore suspend/resume

* powercap:
  powercap / RAPL: Support Knights Landing
  powercap / RAPL: Floor frequency setting in Atom SoC

Merge branch 'pm-wakeirq'

* pm-wakeirq:
PM / wakeirq: Fix typo in prototype for dev_pm_set_dedicated_wake_irq
PM / Wakeirq: Add automated device wake IRQ handling

Merge branches 'pm-sleep' and 'pm-runtime'

* pm-sleep:
  PM / sleep: trace_device_pm_callback coverage in dpm_prepare/complete
  PM / wakeup: add a dummy wakeup_source to record statistics
  PM / sleep: Make suspend-to-idle-specific code depend on CONFIG_SUSPEND
  PM / sleep: Return -EBUSY from suspend_enter() on wakeup detection
  PM / tick: Add tracepoints for suspend-to-idle diagnostics
  PM / sleep: Fix symbol name in a comment in kernel/power/main.c
  leds / PM: fix hibernation on arm when gpio-led used with CPU led trigger
  ARM: omap-device: use SET_NOIRQ_SYSTEM_SLEEP_PM_OPS
  bus: omap_l3_noc: add missed callbacks for suspend-to-disk
  PM / sleep: Add macro to define common noirq system PM callbacks
  PM / sleep: Refine diagnostic messages in enter_state()
  PM / wakeup: validate wakeup source before activating it.

* pm-runtime:
  PM / Runtime: Update last_busy in rpm_resume
  PM / runtime: add note about re-calling in during device probe()

Merge branch 'pm-cpufreq'

* pm-cpufreq: (37 commits)
  cpufreq: dt: allow driver to boot automatically
  intel_pstate: Fix overflow in busy_scaled due to long delay
  cpufreq: qoriq: optimize the CPU frequency switching time
  cpufreq: gx-suspmod: Fix two typos in two comments
  cpufreq: nforce2: Fix typo in comment to function nforce2_init()
  cpufreq: governor: Serialize governor callbacks
  cpufreq: governor: split cpufreq_governor_dbs()
  cpufreq: governor: register notifier from cs_init()
  cpufreq: Remove cpufreq_update_policy()
  cpufreq: Restart governor as soon as possible
  cpufreq: Call cpufreq_policy_put_kobj() from cpufreq_policy_free()
  cpufreq: Initialize policy->kobj while allocating policy
  cpufreq: Stop migrating sysfs files on hotplug
  cpufreq: Don't allow updating inactive policies from sysfs
  intel_pstate: Force setting target pstate when required
  intel_pstate: change some inconsistent debug information
  cpufreq: Track cpu managing sysfs kobjects separately
  cpufreq: Fix for typos in two comments
  cpufreq: Mark policy->governor = NULL for inactive policies
  cpufreq: Manage governor usage history with 'policy->last_governor'
  ...

Merge branch 'pm-cpuidle'

* pm-cpuidle:
  cpuidle: Do not use CPUIDLE_DRIVER_STATE_START in cpuidle.c
  cpuidle: Select a different state on tick_broadcast_enter() failures
  sched / idle: Call default_idle_call() from cpuidle_enter_state()
  sched / idle: Call idle_set_state() from cpuidle_enter_state()
  cpuidle: Fix the kerneldoc comment for cpuidle_enter_state()
  sched / idle: Eliminate the "reflect" check from cpuidle_idle_call()
  cpuidle: Check the sign of index in cpuidle_reflect()
  sched / idle: Move the default idle call code to a separate function

Merge branch 'acpi-cca'

* acpi-cca:
  ufs: fix TRUE and FALSE re-define build error
  megaraid_sas: fix TRUE and FALSE re-define build error
  amd-xgbe: Unify coherency checking logic with device_dma_is_coherent()
  crypto: ccp - Unify coherency checking logic with device_dma_is_coherent()
  device property: Introduces device_dma_is_coherent()
  arm64 : Introduce support for ACPI _CCA object
  ACPI / scan: Parse _CCA and setup device coherency

Merge branch 'acpi-video'

* acpi-video: (38 commits)
  ACPI / video: Make acpi_video_unregister_backlight() private
  acpi-video-detect: Remove old API
  toshiba-acpi: Port to new backlight interface selection API
  thinkpad-acpi: Port to new backlight interface selection API
  sony-laptop: Port to new backlight interface selection API
  samsung-laptop: Port to new backlight interface selection API
  msi-wmi: Port to new backlight interface selection API
  msi-laptop: Port to new backlight interface selection API
  intel-oaktrail: Port to new backlight interface selection API
  ideapad-laptop: Port to new backlight interface selection API
  fujitsu-laptop: Port to new backlight interface selection API
  eeepc-laptop: Port to new backlight interface selection API
  dell-wmi: Port to new backlight interface selection API
  dell-laptop: Port to new backlight interface selection API
  compal-laptop: Port to new backlight interface selection API
  asus-wmi: Port to new backlight interface selection API
  asus-laptop: Port to new backlight interface selection API
  apple-gmux: Port to new backlight interface selection API
  acer-wmi: Port to new backlight interface selection API
  ACPI / video: Fix acpi_video _register vs _unregister_backlight race
  ...

Merge branches 'acpi-battery' and 'acpi-processor'

* acpi-battery:
  ACPI / battery: mark DMI table as __initconst
  ACPI / battery: minor tweaks to acpi_battery_units()
  ACPI / battery: constify the offset tables
  ACPI / battery: ensure acpi_battery_init() has finish
  ACPI / battery: drop useless return statements
  ACPI / battery: abort initialization earlier if acpi_disabled

* acpi-processor:
  ACPI / processor: constify DMI system id table
  ACPI / processor: Introduce invalid_phys_cpuid()
  ACPI / processor: return specific error instead of -1
  ACPI / processor: remove phys_id in acpi_processor_get_info()
  ACPI / processor: remove cpu_index in acpi_processor_get_info()
  Xen / ACPI / processor: Remove unneeded NULL check
  Xen / ACPI / processor: use invalid_logical_cpuid()
  ACPI / processor: Introduce invalid_logical_cpuid()

Merge branches 'acpi-ac', 'acpi-soc' and 'acpi-assorted'

* acpi-ac:
  ACPI / AC: constify DMI system id table

* acpi-soc:
  ACPI / LPSS: constify device descriptors

* acpi-assorted:
  ACPI / HED: constify ACPI device ids