Git Repo - linux.git/log

printk: make reading the kernel log flush pending lines

That will mean that any possible subsequent continuation will now be
broken up onto a line of its own (since reading the log has finalized
the beginning og the line), but if user space has activated system
logging (or if there's a kernel message dump going on) that is the right
thing to do.

And now that we actually get the continuation flags _right_ for this
all, the user space logger that is reading the kernel messages can
actually see the continuation marker. Not that anybody seems to really
bother with it (or care), but in theory user space can do its own
message stitching.

Signed-off-by: Linus Torvalds <[email protected]>

printk: re-organize log_output() to be more legible

Avoid some duplicate logic now that we can return early, and update the
comments for the new LOG_CONT world order.

This also stops the continuation flushing from just using random record
flags for the flushing action, instead taking the flags from the proper
original line and updating them as we add continuations to it.

Signed-off-by: Linus Torvalds <[email protected]>

printk: split out core logging code into helper function

The code that actually decides how to log the message (whether to put it
directly into the record log, whether to append it to an existing
buffered log, or whether to start a new buffered log) is fairly
non-obvious code in the middle of the vprintk_emit() function.

Splitting that code up into a helper function makes it easier to
understand, but perhaps more importantly also allows for the code to
just return early out of the helper function once it has made the
decision about where the new log content goes.

Signed-off-by: Linus Torvalds <[email protected]>

printk: reinstate KERN_CONT for printing continuation lines

Long long ago the kernel log buffer was a buffered stream of bytes, very
much like stdio in user space.  It supported log levels by scanning the
stream and noticing the log level markers at the beginning of each line,
but if you wanted to print a partial line in multiple chunks, you just
did multiple printk() calls, and it just automatically worked.

Except when it didn't, and you had very confusing output when different
lines got all mixed up with each other.  Then you got fragment lines
mixing with each other, or with non-fragment lines, because it was
traditionally impossible to tell whether a printk() call was a
continuation or not.

To at least help clarify the issue of continuation lines, we added a
KERN_CONT marker back in 2007 to mark continuation lines:

  474925277671 ("printk: add KERN_CONT annotation").

That continuation marker was initially an empty string, and didn't
actuall make any semantic difference.  But it at least made it possible
to annotate the source code, and have check-patch notice that a printk()
didn't need or want a log level marker, because it was a continuation of
a previous line.

To avoid the ambiguity between a continuation line that had that
KERN_CONT marker, and a printk with no level information at all, we then
in 2009 made KERN_CONT be a real log level marker which meant that we
could now reliably tell the difference between the two cases.

  5fd29d6ccbc9 ("printk: clean up handling of log-levels and newlines")

and we could take advantage of that to make sure we didn't mix up
continuation lines with lines that just didn't have any loglevel at all.

Then, in 2012, the kernel log buffer was changed to be a "record" based
log, where each line was a record that has a loglevel and a timestamp.

You can see the beginning of that conversion in commits

  e11fea92e13f ("kmsg: export printk records to the /dev/kmsg interface")
  7ff9554bb578 ("printk: convert byte-buffer to variable-length record buffer")

with a number of follow-up commits to fix some painful fallout from that
conversion.  Over all, it took a couple of months to sort out most of
it.  But the upside was that you could have concurrent readers (and
writers) of the kernel log and not have lines with mixed output in them.

And one particular pain-point for the record-based kernel logging was
exactly the fragmentary lines that are generated in smaller chunks.  In
order to still log them as one recrod, the continuation lines need to be
attached to the previous record properly.

However the explicit continuation record marker that is actually useful
for this exact case was actually removed in aroundm the same time by commit

  61e99ab8e35a ("printk: remove the now unnecessary "C" annotation for KERN_CONT")

due to the incorrect belief that KERN_CONT wasn't meaningful.  The
ambiguity between "is this a continuation line" or "is this a plain
printk with no log level information" was reintroduced, and in fact
became an even bigger pain point because there was now the whole
record-level merging of kernel messages going on.

This patch reinstates the KERN_CONT as a real non-empty string marker,
so that the ambiguity is fixed once again.

But it's not a plain revert of that original removal: in the four years
since we made KERN_CONT an empty string again, not only has the format
of the log level markers changed, we've also had some usage changes in
this area.

For example, some ACPI code seems to use KERN_CONT _together_ with a log
level, and now uses both the KERN_CONT marker and (for example) a
KERN_INFO marker to show that it's an informational continuation of a
line.

Which is actually not a bad idea - if the continuation line cannot be
attached to its predecessor, without the log level information we don't
know what log level to assign to it (and we traditionally just assigned
it the default loglevel).  So having both a log level and the KERN_CONT
marker is not necessarily a bad idea, but it does mean that we need to
actually iterate over potentially multiple markers, rather than just a
single one.

Also, since KERN_CONT was still conceptually needed, and encouraged, but
didn't actually _do_ anything, we've also had the reverse problem:
rather than having too many annotations it has too few, and there is bit
rot with code that no longer marks the continuation lines with the
KERN_CONT marker.

So this patch not only re-instates the non-empty KERN_CONT marker, it
also fixes up the cases of bit-rot I noticed in my own logs.

There are probably other cases where KERN_CONT will be needed to be
added, either because it is new code that never dealt with the need for
KERN_CONT, or old code that has bitrotted without anybody noticing.

That said, we should strive to avoid the need for KERN_CONT.  It does
result in real problems for logging, and should generally not be seen as
a good feature.  If we some day can get rid of the feature entirely,
because nobody does any fragmented printk calls, that would be lovely.

But until that point, let's at mark the code that relies on the hacky
multi-fragment kernel printk's.  Not only does it avoid the ambiguity,
it also annotates code as "maybe this would be good to fix some day".

(That said, particularly during single-threaded bootup, the downsides of
KERN_CONT are very limited.  Things get much hairier when you have
multiple threads going on and user level reading and writing logs too).

Signed-off-by: Linus Torvalds <[email protected]>

Merge branch 'be2net-fixes'

Sriharsha Basavapatna says:

====================
be2net: patch-set

The following patch set contains a few bug fixes.
Please consider applying this to the net-next tree.
Thanks.

Patch-1 Obtains proper PF number for BEx chips
Patch-2 Fixes a FW update issue seen with BEx chips
Patch-3 Updates copyright string
Patch-4 Fixes TX stats for TSO packets
Patch-5 Enables VF link state setting for BE3
====================

Signed-off-by: David S. Miller <[email protected]>

be2net: Enable VF link state setting for BE3

The VF link state setting feature now works on BE3 chips too from
FW ver 11.1.192.0 onwards.

Signed-off-by: Suresh Reddy <[email protected]>
Signed-off-by: Sriharsha Basavapatna <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

be2net: Fix TX stats for TSO packets

TX stats update does not take into account headers which get duplicated
when the TSO packet is split into segments by HW. Fix this for both
tunneled (vxlan) and non-tunneled TSO packets.

Signed-off-by: Sriharsha Basavapatna <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

be2net: Update Copyright string in be_hw.h

This patch updates the year and company name in the copyright string
in be_hw.h.

Signed-off-by: Sriharsha Basavapatna <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

be2net: NCSI FW section should be properly updated with ethtool for BE3

The driver has a check to ensure that NCSI FW section is updated only
if the current FW version in the card supports it. This FW version check
is done using memcmp() which obviously fails in some cases. Fix this by
breaking up the version string into integer version components and
comparing them.

Signed-off-by: Sriharsha Basavapatna <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

be2net: Provide an alternate way to read pf_num for BEx chips

The driver gets the pf_num for Skyhawk and Lancer using
GET_FUNC_CONFIG FW command. But since that command is not
supported in BEx, we need to get it from some other command.
Otherwise TPE recovery would fail since all NIC PFs would
end up with a func num of 0. There's a pci function number
field in the response of GET_CNTL_ATTRIBUTES command that
can be read to get the same info for BEx adapters.

Signed-off-by: Sriharsha Basavapatna <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

parisc: Move exception table into read-only section

Since BUILDTIME_EXTABLE_SORT is enabled, the exception table can move
into the read-only section.

Signed-off-by: Helge Deller <[email protected]>

parisc: Fix kernel memory layout regarding position of __gp

Architecturally we need to keep __gp below 0x1000000.

But because of ftrace and tracepoint support, the RO_DATA_SECTION now gets much
bigger than it was before. By moving the linkage tables before RO_DATA_SECTION
we can avoid that __gp gets positioned at a too high address.

Cc: [email protected] # 4.4+
Signed-off-by: Helge Deller <[email protected]>

parisc: Increase initial kernel mapping size

Increase the initial kernel default page mapping size for 64-bit kernels to
64 MB and for 32-bit kernels to 32 MB.

Due to the additional support of ftrace, tracepoint and huge pages the kernel
size can exceed the sizes we used up to now.

Cc: [email protected] # 4.4+
Signed-off-by: Helge Deller <[email protected]>

Merge tag '4.9/mtd-pairing-scheme' of github.com:linux-nand/linux

Introduction of the MTD pairing scheme concept.

Merge remote-tracking branch 'jk/vfs' into work.misc

Merge remote-tracking branch 'ovl/misc' into work.misc

Merge branch 'work.const-qstr' into work.misc

Merge branch 'work.iget' into work.misc

parisc: Migrate exception table users off module.h and onto extable.h

This file was only including module.h for exception table related
functions. We've now separated that content out into its own file
"extable.h" so now move over to that and avoid all the extra header
content in module.h that we don't really need to compile this file.

Cc: "James E.J. Bottomley" <[email protected]>
Cc: Helge Deller <[email protected]>
Cc: [email protected]
Signed-off-by: Paul Gortmaker <[email protected]>
Signed-off-by: Helge Deller <[email protected]>

Merge branch 'for-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth

Johan Hedberg says:

====================
pull request: bluetooth 2016-10-08

Here are a couple of Bluetooth fixes for the 4.9 kernel:

- Firmware download fix for Atheros controllers
- Fixes to the content of LE scan response
- New USB ID for a Marvell chipset

Please let me know if there are any issues pulling. Thanks.
====================

Signed-off-by: David S. Miller <[email protected]>

x86/pkeys: Make protection keys an "eager" feature

Our XSAVE features are divided into two categories: those that
generate FPU exceptions, and those that do not.  MPX and pkeys do
not generate FPU exceptions and thus can not be used lazily.  We
disable them when lazy mode is forced on.

We have a pair of masks to collect these two sets of features, but
XFEATURE_MASK_PKRU was added to the wrong mask: XFEATURE_MASK_LAZY.
Fix it by moving the feature to XFEATURE_MASK_EAGER.

Note: this only causes problem if you boot with lazy FPU mode
(eagerfpu=off) which is *not* the default.  It also only affects
hardware which is not currently publicly available.  It looks like
eager mode is going away, but we still need this patch applied
to any kernel that has protection keys and lazy mode, which is 4.6
through 4.8 at this point, and 4.9 if the lazy removal isn't sent
to Linus for 4.9.

Fixes: c8df40098451 ("x86/fpu, x86/mm/pkeys: Add PKRU xsave fields and data structures")
Signed-off-by: Dave Hansen <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Thomas Gleixner <[email protected]>

x86/apic: Prevent pointless warning messages

Markus reported that he sees new warnings:

APIC: NR_CPUS/possible_cpus limit of 4 reached. Processor 4/0x84 ignored.
APIC: NR_CPUS/possible_cpus limit of 4 reached. Processor 5/0x85 ignored.

This comes from the recent persistant cpuid - nodeid changes. The code
which emits the warning has been called prior to these changes only for
enabled processors. Now it's called for disabled processors as well to get
the possible cpu accounting correct. So if the kernel is compiled for the
number of actual available/enabled CPUs and the BIOS reports disabled CPUs
as well then the above warnings are printed.

That's a pointless exercise as it only makes sense if there are more CPUs
enabled than the kernel supports.

Nake the warning conditional on enabled processors so we are back to the
state before these changes.

Fixes: 8f54969dc8d6 ("x86/acpi: Introduce persistent storage for cpuid <-> apicid mapping")
Reported-and-tested-by: Markus Trippelsdorf <[email protected]>
Cc: One Thousand Gnomes <[email protected]>
Cc: Dou Liyang <[email protected]>
Cc: [email protected]
Cc: Gu Zheng <[email protected]>
Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1610071549330.19804@nanos
Signed-off-by: Thomas Gleixner <[email protected]>

x86/acpi: Prevent LAPIC id 0xff from being accounted

Yinghai reported that the recent changes to make the cpuid - nodeid
relationship permanent causes a cpuid ordering regression on a system which
has 2apic enabled..

The reason is that the ACPI local APIC parser has no sanity check for
apicid 0xff, which is an invalid id. So a CPU id for this invalid local
APIC id is allocated and therefor breaks the cpuid ordering.

Add a sanity check to acpi_parse_lapic() which ignores the invalid id.

Fixes: 8f54969dc8d6 ("x86/acpi: Introduce persistent storage for cpuid <-> apicid mapping")
Reported-by: Yinghai Lu <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Cc: Gu Zheng <[email protected]>,
Cc: Tang Chen <[email protected]>
Cc: [email protected],
Cc: [email protected]
Cc: Tony Luck <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Len Brown <[email protected]>
Cc: Lv Zheng <[email protected]>,
Cc: [email protected]
Cc: [email protected]
Link: https://lkml.kernel.org/r/CAE9FiQVQx6FRXT-RdR7Crz4dg5LeUWHcUSy1KacjR+JgU_vGJg@mail.gmail.com

watchdog: imx2_wdt: add pretimeout function support

The change adds watchdog pretimeout notification handling to imx2_wdt
driver, if device data contains information about a valid interrupt.

It is unlikely but still possible (e.g. through a software limitation)
that only a subset of watchdogs on SoC has interrupt lines, hence
functionally the devices from these two groups have different
capabilities, and this is reflected in different watchdog_info
structs assigned to the devices.

Signed-off-by: Vladimir Zapolskiy <[email protected]>
Reviewed-by: Guenter Roeck <[email protected]>
Reviewed-by: Wolfram Sang <[email protected]>
Signed-off-by: Guenter Roeck <[email protected]>
Signed-off-by: Wim Van Sebroeck <[email protected]>

watchdog: softdog: implement pretimeout support

Give devices which do not have hardware support for pretimeout at least a
software version of it.

Signed-off-by: Wolfram Sang <[email protected]>
Signed-off-by: Vladimir Zapolskiy <[email protected]>
Reviewed-by: Guenter Roeck <[email protected]>
Signed-off-by: Guenter Roeck <[email protected]>
Signed-off-by: Wim Van Sebroeck <[email protected]>

watchdog: pretimeout: add pretimeout_available_governors attribute

The change adds an option to a user with CONFIG_WATCHDOG_SYSFS and
CONFIG_WATCHDOG_PRETIMEOUT_GOV enabled to get information about all
registered watchdog pretimeout governors by reading watchdog device
attribute named "pretimeout_available_governors".

Signed-off-by: Vladimir Zapolskiy <[email protected]>
Reviewed-by: Guenter Roeck <[email protected]>
Reviewed-by: Wolfram Sang <[email protected]>
Tested-by: Wolfram Sang <[email protected]>
Signed-off-by: Guenter Roeck <[email protected]>
Signed-off-by: Wim Van Sebroeck <[email protected]>

watchdog: pretimeout: add option to select a pretimeout governor in runtime

The change converts watchdog device attribute "pretimeout_governor" from
read-only to read-write type to allow users to select a desirable
watchdog pretimeout governor in runtime, e.g.

% echo -n panic > /sys/..../watchdog/watchdog0/pretimeout

To get this working a list of registered pretimeout governors is created
and a new helper function watchdog_pretimeout_governor_set() is exported
to watchdog_dev.c.

If a selected governor is gone, a watchdog device pretimeout notification
is delegated to a default built-in pretimeout governor.

Signed-off-by: Vladimir Zapolskiy <[email protected]>
Reviewed-by: Guenter Roeck <[email protected]>
Reviewed-by: Wolfram Sang <[email protected]>
Tested-by: Wolfram Sang <[email protected]>
Signed-off-by: Guenter Roeck <[email protected]>
Signed-off-by: Wim Van Sebroeck <[email protected]>

watchdog: pretimeout: add panic pretimeout governor

The change adds panic watchdog pretimeout governor, on watchdog
pretimeout event the kernel shall panic. In general watchdog
pretimeout event means that something essentially bad is going on the
system, for example a process scheduler stalls or watchdog feeder is
killed due to OOM, so printing out information attendant to panic and
before likely unavoidable reboot caused by a watchdog may help to
determine a root cause of the issue.

Signed-off-by: Vladimir Zapolskiy <[email protected]>
Reviewed-by: Guenter Roeck <[email protected]>
Reviewed-by: Wolfram Sang <[email protected]>
Tested-by: Wolfram Sang <[email protected]>
Signed-off-by: Guenter Roeck <[email protected]>
Signed-off-by: Wim Van Sebroeck <[email protected]>

watchdog: pretimeout: add noop pretimeout governor

The change adds noop watchdog pretimeout governor, only an
informational message is printed to the kernel log buffer when a
watchdog triggers a pretimeout event.

While introducing the first pretimeout governor the selected design
assumes that the default pretimeout governor is selected by its name
and it is always built-in, thus the default pretimeout governor can
not be unregistered and the correspondent check can be removed from
the watchdog_unregister_governor() function.

Signed-off-by: Vladimir Zapolskiy <[email protected]>
Reviewed-by: Guenter Roeck <[email protected]>
Reviewed-by: Wolfram Sang <[email protected]>
Tested-by: Wolfram Sang <[email protected]>
Signed-off-by: Guenter Roeck <[email protected]>
Signed-off-by: Wim Van Sebroeck <[email protected]>

watchdog: add watchdog pretimeout governor framework

The change adds a simple watchdog pretimeout framework infrastructure,
its purpose is to allow users to select a desired handling of watchdog
pretimeout events, which may be generated by some watchdog devices.

A user selects a default watchdog pretimeout governor during
compilation stage.

Watchdogs with WDIOF_PRETIMEOUT capability now have one more device
attribute in sysfs, pretimeout_governor attribute is intended to display
the selected watchdog pretimeout governor.

The framework has no impact at runtime on watchdog devices with no
WDIOF_PRETIMEOUT capability set.

Signed-off-by: Vladimir Zapolskiy <[email protected]>
Reviewed-by: Guenter Roeck <[email protected]>
Reviewed-by: Wolfram Sang <[email protected]>
Tested-by: Wolfram Sang <[email protected]>
Signed-off-by: Guenter Roeck <[email protected]>
Signed-off-by: Wim Van Sebroeck <[email protected]>

Merge branch 'akpm' (patches from Andrew)

Merge updates from Andrew Morton:

- fsnotify updates

- ocfs2 updates

- all of MM

* emailed patches from Andrew Morton <[email protected]>: (127 commits)
  console: don't prefer first registered if DT specifies stdout-path
  cred: simpler, 1D supplementary groups
  CREDITS: update Pavel's information, add GPG key, remove snail mail address
  mailmap: add Johan Hovold
  .gitattributes: set git diff driver for C source code files
  uprobes: remove function declarations from arch/{mips,s390}
  spelling.txt: "modeled" is spelt correctly
  nmi_backtrace: generate one-line reports for idle cpus
  arch/tile: adopt the new nmi_backtrace framework
  nmi_backtrace: do a local dump_stack() instead of a self-NMI
  nmi_backtrace: add more trigger_*_cpu_backtrace() methods
  min/max: remove sparse warnings when they're nested
  Documentation/filesystems/proc.txt: add more description for maps/smaps
  mm, proc: fix region lost in /proc/self/smaps
  proc: fix timerslack_ns CAP_SYS_NICE check when adjusting self
  proc: add LSM hook checks to /proc/<tid>/timerslack_ns
  proc: relax /proc/<tid>/timerslack_ns capability requirements
  meminfo: break apart a very long seq_printf with #ifdefs
  seq/proc: modify seq_put_decimal_[u]ll to take a const char *, not char
  proc: faster /proc/*/status
  ...

Merge tag 'armsoc-late' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc

Pull ARM SoC late DT updates from Arnd Bergmann:
"These updates have been kept in a separate branch mostly because they
  rely on updates to the respective clk drivers to keep the shared
  header files in sync.

   - The Renesas r8a7796 (R-Car M3-W) platform gets added, this is an
     automotive SoC similar to the ⅹ8a7795 chip we already support, but
     the dts changes rely on a clock driver change that has been merged
     for v4.9 through the clk tree.

   - The Amlogic meson-gxbb (S905) platform gains support for a few
     drivers merged through our tree, in particular the network and usb
     driver changes are required and included here, and also the clk
     tree changes.

   - The Allwinner platforms have seen a large-scale change to their clk
     drivers and the dts file updates must come after that. This
     includes the newly added Nextthing GR8 platform, which is derived
     from sun5i/A13.

   - Some integrator (arm32) changes rely on clk driver changes.

   - A single patch for lpc32xx has no such dependency but wasn't added
     until just before the merge window"

* tag 'armsoc-late' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc: (99 commits)
  ARM: dts: lpc32xx: add device node for IRAM on-chip memory
  ARM: dts: sun8i: Add accelerometer to polaroid-mid2407pxe03
  ARM: dts: sun8i: enable UART1 for iNet D978 Rev2 board
  ARM: dts: sun8i: add pinmux for UART1 at PG
  dts: sun8i-h3: add I2C0-2 peripherals to H3 SOC
  dts: sun8i-h3: add pinmux definitions for I2C0-2
  dts: sun8i-h3: associate exposed UARTs on Orange Pi Boards
  dts: sun8i-h3: split off RTS/CTS for UART1 in seperate pinmux
  dts: sun8i-h3: add pinmux definitions for UART2-3
  ARM: dts: sun9i: a80-optimus: Disable EHCI1
  ARM: dts: sun9i: cubieboard4: Add AXP806 PMIC device node and regulators
  ARM: dts: sun9i: a80-optimus: Add AXP806 PMIC device node and regulators
  ARM: dts: sun9i: cubieboard4: Declare AXP809 SW regulator as unused
  ARM: dts: sun9i: a80-optimus: Declare AXP809 SW regulator as unused
  ARM: dts: sun8i: Add touchscreen node for sun8i-a33-ga10h
  ARM: dts: sun8i: Add touchscreen node for sun8i-a23-polaroid-mid2809pxe04
  ARM: dts: sun8i: Add touchscreen node for sun8i-a23-polaroid-mid2407pxe03
  ARM: dts: sun8i: Add touchscreen node for sun8i-a23-inet86dz
  ARM: dts: sun8i: Add touchscreen node for sun8i-a23-gt90h
  ARM64: dts: meson-gxbb-vega-s95: Enable USB Nodes
  ...

Merge tag 'armsoc-dt64' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc

Pull ARM 64-bit DT updates from Arnd Bergmann:
"The 64-bit DT changes are surprisingly small this time, we only add
  two SoC platforms: the ZTE ZX296718 Set-top-box SoC and the SocioNext
  UniPhier LD11 TV SoC, each with their reference boards.

  There are three new machines added for existing SoC platforms:

   - The Marvell Armada 8040 development board is an impressive
     quad-core Cortex-A72 machine with three 10gbit ethernet interfaces

   - Qualcomms DragonBoard 820c single-board computer is their current
     high-end phone platform in the 96boards form factor

   - Rockchip: Tronsmart Orion r86 set-top-box is a popular mid-range
     Android box based on the 8-core rk3368 SoC"

* tag 'armsoc-dt64' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc: (91 commits)
  arm64: dts: berlin4ct: Add L2 cache topology
  arm64: dts: berlin4ct: enable all wdt nodes unconditionally
  arm64: dts: berlin4ct: switch to Cortex-A53 specific pmu nodes
  arm64: dts: Add ZTE ZX296718 SoC dts and Makefile
  arm64: dts: apm: Add DT node for APM X-Gene 2 CPU clocks
  arm64: dts: apm: Add X-Gene SoC hwmon to device tree
  arm64: dts: apm: Fix interrupt polarity for X-Gene PCIe legacy interrupts
  arm64: dts: apm: Add APM X-Gene v2 SoC PMU DTS entries
  arm64: dts: apm: Add APM X-Gene SoC PMU DTS entries
  arm64: dts: marvell: enable MSI for PCIe on Armada 7K/8K
  arm64: dts: ls2080a: Add 'dma-coherent' for ls2080a PCI nodes
  arm64: dts: rockchip: add Type-C phy for RK3399
  arm64: dts: rockchip: enable the gmac for rk3399 evb board
  arm64: dts: rockchip: add the gmac needed node for rk3399
  arm64: dts: rockchip: support the pmu node for rk3399
  arm64: dts: rockchip: change all interrupts cells to 4 on rk3399 SoCs
  arm64: dts: rockchip: add the tcpc for rk3399 power domain
  arm64: dts: rockchip: add efuse0 device node for rk3399
  arm64: dts: rockchip: configure PCIe support for rk3399-evb
  arm64: dts: rockchip: add the PCIe controller support for RK3399
  ...

Merge tag 'armsoc-dt' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc

Pull ARM DT updates from Arnd Bergmann:
"These are as usual a very large number of mostly boring updates to
  enable devices in existing machines, or to fix minor bugs. Notably, an
  ongoing treewide effort to fix warnings caused by an update to the
  device tree compiler. These are enabled with "make W=1" at the moment
  but can hopefully become the default once all issues have been
  addressed.

  No new SoC platform is added this time around (Armada 395 and Orion
  mv88f5181 are slight variations of existing ones), but a significant
  number of new dts files are added, which I list by platform:

   - Allwinner: Empire Electronix M712 and iNet d978 Rev2 tablets,
     Orange Pi PC Plus, Orange Pi 2, Orange Pi Plus 2E, Orange Pi Lite,
     Olimex A33-Olinuxino, and Nano Pi Neo single-board computers

   - ARM Realview: all supported machines (ported from board files)

   - Broadcom: BCM958525er, BCM958522er, BCM988312hr, BCM958623hr and
     BCM958622hr reference boards for Northstar platform, Raspberry Pi
     Zero single-board computer

   - Marvell EBU: Netgear WNR854T router (ported from board file),
     Armada 395 SoC platform and GP board Armada 390 DB development
     board

   - NXP i.MX: imx7s Warp7 reference board, Gateworks Ventana GW553x
     single-board computer, Technologic Systems TS-4900 and Engicam
     IMX6UL GEA M6UL computer-on-module, Inverse Path USB armory board

   - Qualcomm: LG Nexus 5 Phone

   - Renesas: r8a7792/wheat and r7s72100/rskrza1 development boards

   - Rockchip: Rockchip RK3288 Fennec reference board, Firefly RK3288
     Reload platform

   - ST Microelectronics STi: B2260 (96boards) single-board computer

   - TI Davinci: OMAP-L138 LCDK Development kit

   - TI OMAP: beagleboard-x15 rev B1 single-board computer"

* tag 'armsoc-dt' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc: (390 commits)
  ARM: dts: sony-nsz-gs7: add missing unit name to /memory node
  ARM: dts: chromecast: add missing unit name to /memory node
  ARM: dts: berlin2q-marvell-dmp: add missing unit name to /memory node
  ARM: dts: berlin2: Add missing unit name to /soc node
  ARM: dts: berlin2cd: Add missing unit name to /soc node
  ARM: dts: berlin2q: Add missing unit name to /soc node
  ARM: dts: berlin2: Remove skeleton.dtsi inclusion
  ARM: dts: berlin2cd: Remove skeleton.dtsi inclusion
  ARM: dts: berlin2q: Remove skeleton.dtsi inclusion
  arm: dts: berlin2q: enable all wdt nodes unconditionally
  arm: dts: berlin2: enable all wdt nodes unconditionally
  ARM: dts: omap5-igep0050.dts: Use tabs for indentation
  ARM: dts: Fix igepv5 power button GPIO direction
  ARM: dts: am335x-evmsk: Add blue-and-red-wiring -property to lcdc node
  ARM: dts: am335x-evmsk: Whitespace cleanup of lcdc related nodes
  ARM: dts: am335x-evm: Add blue-and-red-wiring -property to lcdc node
  ARM: dts: s3c64xx: Use macros for pinctrl configuration
  ARM: dts: s3c2416: Use macros for pinctrl configuration
  ARM: dts: s5pv210: Use macros for pinctrl configuration
  ARM: dts: s3c64xx: Use common macros for pinctrl configuration
  ...

Merge tag 'armsoc-drivers' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc

Pull ARM SoC driver updates from Arnd Bergmann:
"Driver updates for ARM SoCs, including a couple of newly added
  drivers:

   - The Qualcomm external bus interface 2 (EBI2), used in some of their
     mobile phone chips for connecting flash memory, LCD displays or
     other peripherals

   - Secure monitor firmware for Amlogic SoCs, and an NVMEM driver for
     the EFUSE based on that firmware interface.

   - Perf support for the AppliedMicro X-Gene performance monitor unit

   - Reset driver for STMicroelectronics STM32

   - Reset driver for SocioNext UniPhier SoCs

  Aside from these, there are minor updates to SoC-specific bus,
  clocksource, firmware, pinctrl, reset, rtc and pmic drivers"

* tag 'armsoc-drivers' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc: (50 commits)
  bus: qcom-ebi2: depend on HAS_IOMEM
  pinctrl: mvebu: orion5x: Generalise mv88f5181l support for 88f5181
  clk: mvebu: Add clk support for the orion5x SoC mv88f5181
  dt-bindings: EXYNOS: Add Exynos5433 PMU compatible
  clocksource: exynos_mct: Add the support for ARM64
  perf: xgene: Add APM X-Gene SoC Performance Monitoring Unit driver
  Documentation: Add documentation for APM X-Gene SoC PMU DTS binding
  MAINTAINERS: Add entry for APM X-Gene SoC PMU driver
  bus: qcom: add EBI2 driver
  bus: qcom: add EBI2 device tree bindings
  rtc: rtc-pm8xxx: Add support for pm8018 rtc
  nvmem: amlogic: Add Amlogic Meson EFUSE driver
  firmware: Amlogic: Add secure monitor driver
  soc: qcom: smd: Reset rx tail rather than tx
  memory: atmel-sdramc: fix a possible NULL dereference
  reset: hi6220: allow to compile test driver on other architectures
  reset: zynq: add driver Kconfig option
  reset: sunxi: add driver Kconfig option
  reset: stm32: add driver Kconfig option
  reset: socfpga: add driver Kconfig option
  ...

Merge tag 'armsoc-arm64' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc

Pull ARM SoC 64-bit updates from Arnd Bergmann:
"Changes to platform code for 64-bit ARM platforms.

  Nearly all of these are defconfig updates to enable new drivers or old
  drivers still used on these 64-bit platforms.

  Aside from that, we gain initial support for two set-top-box
  platforms, both of which already have 32-bit support in arch/arm:

   - Broadcom adds abstract support for the bcm7xxx/brcmstb platform,
     presumably the respective dts files and more information will
     follow at a later point.

   - The ZTE ZX296718 SoC for set-top-boxes, a relative of the 32-bit
     ZX296702 SoC that we already support"

* tag 'armsoc-arm64' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc:
  arm64: add ZTE ZX SoC family
  arm64: defconfig: enable ZTE ZX related config
  arm64: defconfig: enable common modules for power management
  arm64: defconfig: enable meson I2C
  arm64: defconfig: enable meson SPI as module
  arm64: defconfig: enable meson WDT as modules
  arm64: defconfig: enable HW random as module
  arm64: defconfig: Enable SDHI and GPIO_REGULATOR
  arm64: configs: enable PCIe driver for Aardvark
  Kconfig: ARCH_HISI: Add PINCTRL to HISI platform
  arm64: defconfig: enable bluetooth supports as modules
  arm64: defconfig: enable CONFIG_INPUT_HISI_POWERKEY for HiKey
  arm64: defconfig: Enable HiSilicon kirin drm, adv7533 for HiKey
  arm64: defconfig: Enable Hisi SAS and HNS
  arm64: defconfig: Enable QDF2432 config options
  arm64: sunxi: Kconfig: add essential pinctrl driver
  arm64: defconfig: Add Renesas R-Car HSUSB driver support as module
  arm64: Add Broadcom Set Top Box Kconfig entry point
  arm64: defconfig: enable xhci-platform

Merge tag 'armsoc-defconfig' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc

Pull ARM SoC defconfig updates from Arnd Bergmann:
"Defconfig additions, removals, etc. Most of these are small changes
  adding the options for newly upstreamed drivers, or drivers needed for
  new board support. Nothing specifically sticks out this time"

* tag 'armsoc-defconfig' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc: (25 commits)
  ARM: multi_v7_defconfig: enable CONFIG_EFI
  ARM: multi_v7_defconfig: Build Atmel maXTouch driver as a module
  ARM: defconfig: update the Integrator defconfig
  ARM: keystone: defconfig: Fix USB configuration
  ARM: imx_v6_v7_defconfig: Select the wm8960 codec driver
  ARM: omap2plus_defconfig: switch to the IIO BMP085 driver
  ARM: mvebu_v5_defconfig: use MV88E6XXX
  ARM: davinci_all_defconfig: Enable some UBI modules
  ARM: davinci_all_defconfig: Enable AEMIF as a module
  ARM: multi_v7_defconfig: Enable SECCOMP
  ARM: exynos_defconfig: Enable SECCOMP
  ARM: imx_v6_v7_defconfig: Add CONFIG_MPL3115
  ARM: imx_v6_v7_defconfig: Enable GPU support
  ARM: s3c2410_defconfig: Remove CONFIG_IPV6_PRIVACY
  ARM: exynos_defconfig: Enable PM_DEBUG
  ARM: exynos_defconfig: Enable bus frequency scaling with devfreq
  ARM: imx_v6_v7_defconfig: enable more USB configurations
  ARM: davinci_all_defconfig: enable SMSC ethernet PHY
  ARM: davinci_all_defconfig: enable RTC driver as module
  ARM: multi_v7_defconfig: Enable ARM_IMX6Q_CPUFREQ
  ...

Merge tag 'armsoc-soc' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc

Pull ARM SoC platform updates from Arnd Bergmann:
"These are updates for platform specific code on 32-bit ARM machines,
  essentially anything that can not (yet) be expressed using DT files.

  Noteworthy changes include:

   - We get support for running in big-endian mode on two platforms:
     sunxi (Allwinner) and s3c24xx (old Samsung).

   - The recently added Uniphier platform now uses standard PSCI methods
     for SMP booting and we remove support for old bootloader versions
     that did not support it yet.

   - In sunxi, we gain support for the "Nextthing GR8" SoC, which is a
     close relative of the Allwinner A13 and R8 chips.

   - PXA completes its move over to the generic dmaengine framework and
     removes its old private API

   - mach-bcm gains support for BCM47189/BCM53573, their first ARM SoC
     with integrated 802.11ac wireless networking"

* tag 'armsoc-soc' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc: (54 commits)
  ARM: imx legacy: pca100: move peripheral initialization to .init_late
  ARM: imx legacy: mx27ads: move peripheral initialization to .init_late
  ARM: imx legacy: mx21ads: move peripheral initialization to .init_late
  ARM: imx legacy: pcm043: move peripheral initialization to .init_late
  ARM: imx legacy: mx35-3ds: move peripheral initialization to .init_late
  ARM: imx legacy: mx27-3ds: move peripheral initialization to .init_late
  ARM: imx legacy: imx27-visstrim-m10: move peripheral initialization to .init_late
  ARM: imx legacy: vpr200: move peripheral initialization to .init_late
  ARM: imx legacy: mx31moboard: move peripheral initialization to .init_late
  ARM: imx legacy: armadillo5x0: move peripheral initialization to .init_late
  ARM: imx legacy: qong: move peripheral initialization to .init_late
  ARM: imx legacy: mx31-3ds: move peripheral initialization to .init_late
  ARM: imx legacy: pcm037: move peripheral initialization to .init_late
  ARM: imx legacy: mx31lilly: move peripheral initialization to .init_late
  ARM: imx legacy: mx31ads: move peripheral initialization to .init_late
  ARM: imx legacy: mx31lite: move peripheral initialization to .init_late
  ARM: imx legacy: kzm: move peripheral initialization to .init_late
  MAINTAINERS: update list of Oxnas maintainers
  ARM: orion5x: remove extraneous NO_IRQ
  ARM: orion: simplify orion_ge00_switch_init
  ...

Merge tag 'armsoc-cleanup' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc

Pull ARM SoC cleanups from Arnd Bergmann:
"The cleanups for v4.9 are a little larger that usual, but thankfully
  that is almost exclusively due to removing a significant number of
  files that have become obsolete after the still ongoing conversion of
  old board files to devicetree.

   - for mach-omap2, which is still the largest platform in arch/arm/,
     the conversion to DT is finally complete after the Nokia N900 is
     now fully supported there, along with the omap3 LDP, and we can
     remove those two board files. If no regressions are found, another
     large cleanup for the platform will happen as a follow-up, removing
     dead code and restructuring the platform based on being DT-only.

   - In mach-imx, similar work is ongoing, but has not come that far.
     This time, we remove the obsolete board file for the i.MX1
     generation, which like i.MX25, i.MX5, i.MX6, and i.MX7 is now
     DT-only. The remaining board files are for i.MX2 and i.MX3 machines
     based on old ARM926 or ARM1136 cores that should work with DT in
     principle.

   - realview has just been converted from board files to DT, and a lot
     of code gets removed in the process. This is the last
     ARM/Keil/Versatile derived platform that was still using board
     files, the other ones being integrator, versatile and vexpress. We
     can probably merge the remaining code into a single directory in
     the near future.

   - clps711x had completed the conversion in v4.8, but we accidentally
     left the files in place that should have been deleted then"

* tag 'armsoc-cleanup' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc: (21 commits)
  ARM: select PCI_DOMAINS config from ARCH_MULTIPLATFORM
  ARM: stop *MIGHT_HAVE_PCI* config from being selected redundantly
  ARM: imx: (trivial) fix typo and grammar
  ARM: clps711x: remove extraneous files
  ARM: imx: use IS_ENABLED() instead of checking for built-in or module
  ARM: OMAP2+: use IS_ENABLED() instead of checking for built-in or module
  ARM: OMAP1: use IS_ENABLED() instead of checking for built-in or module
  ARM: imx: remove platform-mxc_rnga
  ARM: realview: imply device tree boot
  ARM: realview: no need to select SMP_ON_UP explicitly
  ARM: realview: delete the RealView board files
  ARM: imx: no need to select SMP_ON_UP explicitly
  ARM: i.MX: Move SOC_IMX1 into 'Device tree only'
  ARM: i.MX: Remove i.MX1 non-DT support
  ARM: i.MX: Remove i.MX1 Synertronixx SCB9328 board support
  ARM: i.MX: Remove i.MX1 Armadeus APF9328 board support
  ARM: mxs: remove obsolete startup code for TX28
  ARM: i.MX31 iomux: remove duplicates with alternate name
  ARM: i.MX31 iomux: remove plain duplicates
  ARM: OMAP2+: Drop legacy board file for LDP
  ...

wan/fsl_ucc_hdlc: Fix size used in dma_free_coherent()

Size used with 'dma_alloc_coherent()' and 'dma_free_coherent()' should be
consistent.
Here, the size of a pointer is used in dma_alloc... and the size of the
pointed structure is used in dma_free...

This has been spotted with coccinelle, using the following script:
////////////////////
@r@
expression x0, x1, y0, y1, z0, z1, t0, t1, ret;
@@

*   ret = dma_alloc_coherent(x0, y0, z0, t0);
    ...
*   dma_free_coherent(x1, y1, ret, t1);

@script:python@
y0 << r.y0;
y1 << r.y1;

@@
if y1.find(y0) == -1:
print "WARNING: sizes look different:  '%s'   vs   '%s'" % (y0, y1)
////////////////////

Signed-off-by: Christophe JAILLET <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: macb: NULL out phydev after removing mdio bus

To ensure the dev->phydev pointer is not used after becoming invalid in
mdiobus_unregister, set it to NULL. This happens when removing the macb
driver without first taking its interface down, since unregister_netdev
will end up calling macb_close.

Signed-off-by: Xander Huff <[email protected]>
Signed-off-by: Nathan Sullivan <[email protected]>
Signed-off-by: Brad Mouring <[email protected]>
Reviewed-by: Moritz Fischer <[email protected]>
Acked-by: Nicolas Ferre <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

xen-netback: make sure that hashes are not send to unaware frontends

In the case when a frontend only negotiates a single queue with xen-
netback it is possible for a skbuff with a s/w hash to result in a
hash extra_info segment being sent to the frontend even when no hash
algorithm has been configured. (The ndo_select_queue() entry point makes
sure the hash is not set if no algorithm is configured, but this entry
point is not called when there is only a single queue). This can result
in a frontend that is unable to handle extra_info segments being given
such a segment, causing it to crash.

This patch fixes the problem by clearing the hash in ndo_start_xmit()
instead, which is clearly guaranteed to be called irrespective of the
number of queues.

Signed-off-by: Paul Durrant <[email protected]>
Cc: Wei Liu <[email protected]>
Acked-by: Wei Liu <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Fixing a bug in team driver due to incorrect 'unsigned int' to 'int' conversion

Roundrobin runner of team driver uses 'unsigned int' variable to count
the number of sent_packets. Later it is passed to a subroutine
team_num_to_port_index(struct team *team, int num) as 'num' and when
we reach MAXINT (2**31-1), 'num' becomes negative.

This leads to using incorrect hash-bucket for port lookup
and as a result, packets are dropped. The fix consists of changing
'int num' to 'unsigned int num'. Testing of a fixed kernel shows that
there is no packet drop anymore.

Signed-off-by: Alex Sidorenko <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Merge branch 'parisc-4.9-1' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux

Pull parisc updates from Helge Deller:
"Changes include:

   - Fix boot of 32bit SMP kernel (initial kernel mapping was too small)

   - Added hardened usercopy checks

   - Drop bootmem and switch to memblock and NO_BOOTMEM implementation

   - Drop the BROKEN_RODATA config option (and thus remove the relevant
     code from the generic headers and files because parisc was the last
     architecture which used this config option)

   - Improve segfault reporting by printing human readable error strings

   - Various smaller changes, e.g. dwarf debug support for assembly
     code, update comments regarding copy_user_page_asm, switch to
     kmalloc_array()"

* 'parisc-4.9-1' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux:
  parisc: Increase KERNEL_INITIAL_SIZE for 32-bit SMP kernels
  parisc: Drop bootmem and switch to memblock
  parisc: Add hardened usercopy feature
  parisc: Add cfi_startproc and cfi_endproc to assembly code
  parisc: Move hpmc stack into page aligned bss section
  parisc: Fix self-detected CPU stall warnings on Mako machines
  parisc: Report trap type as human readable string
  parisc: Update comment regarding implementation of copy_user_page_asm
  parisc: Use kmalloc_array() in add_system_map_addresses()
  parisc: Check return value of smp_boot_one_cpu()
  parisc: Drop BROKEN_RODATA config option

MAINTAINERS: add myself as a maintainer of xen-netback

Signed-off-by: Paul Durrant <[email protected]>
Cc: Wei Liu <[email protected]>
Acked-by: Wei Liu <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/egtvedt/linux-avr32

Pull avr32 update from Hans-Christian Noren Egtvedt.

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/egtvedt/linux-avr32:
avr32: migrate exception table users off module.h and onto extable.h

ipv6 addrconf: disallow rtr_solicits < -1

This disallows setting /proc/sys/net/ipv6/conf/*/router_solicitations
to values below -1.

-1 continues to mean an unlimited number of retransmits.

Note: this depends on 'ipv6 addrconf: remove addrconf_sysctl_hop_limit()'

Signed-off-by: Maciej Żenczykowski <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Merge tag 'powerpc-4.9-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux

Pull powerpc updates from Michael Ellerman:
"Highlights:
   - Major rework of Book3S 64-bit exception vectors (Nicholas Piggin)
   - Use gas sections for arranging exception vectors et. al.
   - Large set of TM cleanups and selftests (Cyril Bur)
   - Enable transactional memory (TM) lazily for userspace (Cyril Bur)
   - Support for XZ compression in the zImage wrapper (Oliver
     O'Halloran)
   - Add support for bpf constant blinding (Naveen N. Rao)
   - Beginnings of upstream support for PA Semi Nemo motherboards
     (Darren Stevens)

  Fixes:
   - Ensure .mem(init|exit).text are within _stext/_etext (Michael
     Ellerman)
   - xmon: Don't use ld on 32-bit (Michael Ellerman)
   - vdso64: Use double word compare on pointers (Anton Blanchard)
   - powerpc/nvram: Fix an incorrect partition merge (Pan Xinhui)
   - powerpc: Fix usage of _PAGE_RO in hugepage (Christophe Leroy)
   - powerpc/mm: Update FORCE_MAX_ZONEORDER range to allow hugetlb w/4K
     (Aneesh Kumar K.V)
   - Fix memory leak in queue_hotplug_event() error path (Andrew
     Donnellan)
   - Replay hypervisor maintenance interrupt first (Nicholas Piggin)

  Various performance optimisations (Anton Blanchard):
   - Align hot loops of memset() and backwards_memcpy()
   - During context switch, check before setting mm_cpumask
   - Remove static branch prediction in atomic{, 64}_add_unless
   - Only disable HAVE_EFFICIENT_UNALIGNED_ACCESS on POWER7 little
     endian
   - Set default CPU type to POWER8 for little endian builds

  Cleanups & features:
   - Sparse fixes/cleanups (Daniel Axtens)
   - Preserve CFAR value on SLB miss caused by access to bogus address
     (Paul Mackerras)
   - Radix MMU fixups for POWER9 (Aneesh Kumar K.V)
   - Support for setting used_(vsr|vr|spe) in sigreturn path (for CRIU)
     (Simon Guo)
   - Optimise syscall entry for virtual, relocatable case (Nicholas
     Piggin)
   - Optimise MSR handling in exception handling (Nicholas Piggin)
   - Support for kexec with Radix MMU (Benjamin Herrenschmidt)
   - powernv EEH fixes (Russell Currey)
   - Suprise PCI hotplug support for powernv (Gavin Shan)
   - Endian/sparse fixes for powernv PCI (Gavin Shan)
   - Defconfig updates (Anton Blanchard)
   - KVM: PPC: Book3S HV: Migrate pinned pages out of CMA (Balbir Singh)
   - cxl: Flush PSL cache before resetting the adapter (Frederic Barrat)
   - cxl: replace loop with for_each_child_of_node(), remove unneeded
     of_node_put() (Andrew Donnellan)
   - Fix HV facility unavailable to use correct handler (Nicholas
     Piggin)
   - Remove unnecessary syscall trampoline (Nicholas Piggin)
   - fadump: Fix build break when CONFIG_PROC_VMCORE=n (Michael
     Ellerman)
   - Quieten EEH message when no adapters are found (Anton Blanchard)
   - powernv: Add PHB register dump debugfs handle (Russell Currey)
   - Use kprobe blacklist for exception handlers & asm functions
     (Nicholas Piggin)
   - Document the syscall ABI (Nicholas Piggin)
   - MAINTAINERS: Update cxl maintainers (Michael Neuling)
   - powerpc: Remove all usages of NO_IRQ (Michael Ellerman)

  Minor cleanups:
   - Andrew Donnellan, Christophe Leroy, Colin Ian King, Cyril Bur,
     Frederic Barrat, Pan Xinhui, PrasannaKumar Muralidharan, Rui Teng,
     Simon Guo"

* tag 'powerpc-4.9-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (156 commits)
  powerpc/bpf: Add support for bpf constant blinding
  powerpc/bpf: Implement support for tail calls
  powerpc/bpf: Introduce accessors for using the tmp local stack space
  powerpc/fadump: Fix build break when CONFIG_PROC_VMCORE=n
  powerpc: tm: Enable transactional memory (TM) lazily for userspace
  powerpc/tm: Add TM Unavailable Exception
  powerpc: Remove do_load_up_transact_{fpu,altivec}
  powerpc: tm: Rename transct_(*) to ck(\1)_state
  powerpc: tm: Always use fp_state and vr_state to store live registers
  selftests/powerpc: Add checks for transactional VSXs in signal contexts
  selftests/powerpc: Add checks for transactional VMXs in signal contexts
  selftests/powerpc: Add checks for transactional FPUs in signal contexts
  selftests/powerpc: Add checks for transactional GPRs in signal contexts
  selftests/powerpc: Check that signals always get delivered
  selftests/powerpc: Add TM tcheck helpers in C
  selftests/powerpc: Allow tests to extend their kill timeout
  selftests/powerpc: Introduce GPR asm helper header file
  selftests/powerpc: Move VMX stack frame macros to header file
  selftests/powerpc: Rework FPU stack placement macros and move to header file
  selftests/powerpc: Check for VSX preservation across userspace preemption
  ...

vfs: Remove {get,set,remove}xattr inode operations

These inode operations are no longer used; remove them.

Signed-off-by: Andreas Gruenbacher <[email protected]>
Signed-off-by: Al Viro <[email protected]>

console: don't prefer first registered if DT specifies stdout-path

If a device tree specifies a preferred device for kernel console output
via the stdout-path or linux,stdout-path chosen node properties or the
stdout alias then the kernel ought to honor it & output the kernel
console to that device.  As it stands, this isn't the case.  Whilst we
parse the stdout-path properties & set an of_stdout variable from
of_alias_scan(), and use that from of_console_check() to determine
whether to add a console device as a preferred console whilst
registering it, we also prefer the first registered console if no other
has been selected at the time of its registration.

This means that if a console other than the one the device tree selects
via stdout-path is registered first, we will switch to using it & when
the stdout-path console is later registered the call to
add_preferred_console() via of_console_check() is too late to do
anything useful.  In practice this seems to mean that we switch to the
dummy console device fairly early & see no further console output:

    Console: colour dummy device 80x25
    console [tty0] enabled
    bootconsole [ns16550a0] disabled

Fix this by not automatically preferring the first registered console if
one is specified by the device tree.  This allows consoles to be
registered but not enabled, and once the driver for the console selected
by stdout-path calls of_console_check() the driver will be added to the
list of preferred consoles before any other console has been enabled.
When that console is then registered via register_console() it will be
enabled as expected.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Paul Burton <[email protected]>
Cc: Ralf Baechle <[email protected]>
Cc: Paul Burton <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Sergey Senozhatsky <[email protected]>
Cc: Jiri Slaby <[email protected]>
Cc: Daniel Vetter <[email protected]>
Cc: Ivan Delalande <[email protected]>
Cc: Thierry Reding <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: Petr Mladek <[email protected]>
Cc: Joe Perches <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Rob Herring <[email protected]>
Cc: Frank Rowand <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

cred: simpler, 1D supplementary groups

Current supplementary groups code can massively overallocate memory and
is implemented in a way so that access to individual gid is done via 2D
array.

If number of gids is <= 32, memory allocation is more or less tolerable
(140/148 bytes).  But if it is not, code allocates full page (!)
regardless and, what's even more fun, doesn't reuse small 32-entry
array.

2D array means dependent shifts, loads and LEAs without possibility to
optimize them (gid is never known at compile time).

All of the above is unnecessary.  Switch to the usual
trailing-zero-len-array scheme.  Memory is allocated with
kmalloc/vmalloc() and only as much as needed.  Accesses become simpler
(LEA 8(gi,idx,4) or even without displacement).

Maximum number of gids is 65536 which translates to 256KB+8 bytes.  I
think kernel can handle such allocation.

On my usual desktop system with whole 9 (nine) aux groups, struct
group_info shrinks from 148 bytes to 44 bytes, yay!

Nice side effects:

- "gi->gid[i]" is shorter than "GROUP_AT(gi, i)", less typing,

- fix little mess in net/ipv4/ping.c
   should have been using GROUP_AT macro but this point becomes moot,

- aux group allocation is persistent and should be accounted as such.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Alexey Dobriyan <[email protected]>
Cc: Vasily Kulikov <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

CREDITS: update Pavel's information, add GPG key, remove snail mail address

Link: http://lkml.kernel.org/r/20161003082312.GA20634@amd
Signed-off-by: Pavel Machek <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mailmap: add Johan Hovold

Add two entries to map to my primary address.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Johan Hovold <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

.gitattributes: set git diff driver for C source code files

Git can be told to apply language-specific rules when generating diffs.
Enable this for C source code files (*.c and *.h) so that function names
are printed right. Specifically, doing so prevents "git diff" from
mistakenly considering unindented goto labels as function names.

Link: http://lkml.kernel.org/r/20160907143403.1449324f@endymion
Signed-off-by: Jean Delvare <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Joe Perches <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

uprobes: remove function declarations from arch/{mips,s390}

The declarations of arch-specific functions have been moved to a common
header in commit 3820b4d2789f ('uprobes: Move function declarations out
of arch'), but MIPS and S390 has added them to their own trees later.
Remove the unnecessary duplicates.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Marcin Nowakowski <[email protected]>
Acked-by: Heiko Carstens <[email protected]>
Cc: Martin Schwidefsky <[email protected]>
Cc: Ralf Baechle <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

spelling.txt: "modeled" is spelt correctly

No need to correct the correct.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Joe Perches <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

nmi_backtrace: generate one-line reports for idle cpus

When doing an nmi backtrace of many cores, most of which are idle, the
output is a little overwhelming and very uninformative. Suppress
messages for cpus that are idling when they are interrupted and just
emit one line, "NMI backtrace for N skipped: idling at pc 0xNNN".

We do this by grouping all the cpuidle code together into a new
.cpuidle.text section, and then checking the address of the interrupted
PC to see if it lies within that section.

This commit suitably tags x86 and tile idle routines, and only adds in
the minimal framework for other architectures.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Chris Metcalf <[email protected]>
Acked-by: Peter Zijlstra (Intel) <[email protected]>
Tested-by: Peter Zijlstra (Intel) <[email protected]>
Tested-by: Daniel Thompson <[email protected]> [arm]
Tested-by: Petr Mladek <[email protected]>
Cc: Aaron Tomlin <[email protected]>
Cc: Peter Zijlstra (Intel) <[email protected]>
Cc: "Rafael J. Wysocki" <[email protected]>
Cc: Russell King <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

arch/tile: adopt the new nmi_backtrace framework

Previously tile was rolling its own method of capturing backtrace data
in the NMI handlers, but it was relying on running printk() from the NMI
handler, which is not always safe. So adopt the nmi_backtrace model
(with the new cpumask extension) instead.

So we can call the nmi_backtrace code directly from the nmi handler,
move the nmi_enter()/exit() into the top-level tile NMI handler.

The semantics of the routine change slightly since it is now synchronous
with the remote cores completing the backtraces. Previously it was
asynchronous, but with protection to avoid starting a new remote
backtrace if the old one was still in progress.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Chris Metcalf <[email protected]>
Cc: Daniel Thompson <[email protected]> [arm]
Cc: Petr Mladek <[email protected]>
Cc: Aaron Tomlin <[email protected]>
Cc: Peter Zijlstra (Intel) <[email protected]>
Cc: "Rafael J. Wysocki" <[email protected]>
Cc: Russell King <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

nmi_backtrace: do a local dump_stack() instead of a self-NMI

Currently on arm there is code that checks whether it should call
dump_stack() explicitly, to avoid trying to raise an NMI when the
current context is not preemptible by the backtrace IPI.  Similarly, the
forthcoming arch/tile support uses an IPI mechanism that does not
support generating an NMI to self.

Accordingly, move the code that guards this case into the generic
mechanism, and invoke it unconditionally whenever we want a backtrace of
the current cpu.  It seems plausible that in all cases, dump_stack()
will generate better information than generating a stack from the NMI
handler.  The register state will be missing, but that state is likely
not particularly helpful in any case.

Or, if we think it is helpful, we should be capturing and emitting the
current register state in all cases when regs == NULL is passed to
nmi_cpu_backtrace().

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Chris Metcalf <[email protected]>
Tested-by: Daniel Thompson <[email protected]> [arm]
Reviewed-by: Petr Mladek <[email protected]>
Acked-by: Aaron Tomlin <[email protected]>
Cc: "Rafael J. Wysocki" <[email protected]>
Cc: Russell King <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

nmi_backtrace: add more trigger_*_cpu_backtrace() methods

Patch series "improvements to the nmi_backtrace code" v9.

This patch series modifies the trigger_xxx_backtrace() NMI-based remote
backtracing code to make it more flexible, and makes a few small
improvements along the way.

The motivation comes from the task isolation code, where there are
scenarios where we want to be able to diagnose a case where some cpu is
about to interrupt a task-isolated cpu.  It can be helpful to see both
where the interrupting cpu is, and also an approximation of where the
cpu that is being interrupted is.  The nmi_backtrace framework allows us
to discover the stack of the interrupted cpu.

I've tested that the change works as desired on tile, and build-tested
x86, arm, mips, and sparc64.  For x86 I confirmed that the generic
cpuidle stuff as well as the architecture-specific routines are in the
new cpuidle section.  For arm, mips, and sparc I just build-tested it
and made sure the generic cpuidle routines were in the new cpuidle
section, but I didn't attempt to figure out which the platform-specific
idle routines might be.  That might be more usefully done by someone
with platform experience in follow-up patches.

This patch (of 4):

Currently you can only request a backtrace of either all cpus, or all
cpus but yourself.  It can also be helpful to request a remote backtrace
of a single cpu, and since we want that, the logical extension is to
support a cpumask as the underlying primitive.

This change modifies the existing lib/nmi_backtrace.c code to take a
cpumask as its basic primitive, and modifies the linux/nmi.h code to use
the new "cpumask" method instead.

The existing clients of nmi_backtrace (arm and x86) are converted to
using the new cpumask approach in this change.

The other users of the backtracing API (sparc64 and mips) are converted
to use the cpumask approach rather than the all/allbutself approach.
The mips code ignored the "include_self" boolean but with this change it
will now also dump a local backtrace if requested.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Chris Metcalf <[email protected]>
Tested-by: Daniel Thompson <[email protected]> [arm]
Reviewed-by: Aaron Tomlin <[email protected]>
Reviewed-by: Petr Mladek <[email protected]>
Cc: "Rafael J. Wysocki" <[email protected]>
Cc: Russell King <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Ralf Baechle <[email protected]>
Cc: David Miller <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

min/max: remove sparse warnings when they're nested

Currently, when min/max are nested within themselves, sparse will warn:

    warning: symbol '_min1' shadows an earlier one
    originally declared here
    warning: symbol '_min1' shadows an earlier one
    originally declared here
    warning: symbol '_min2' shadows an earlier one
    originally declared here

This also immediately happens when min3() or max3() are used.

Since sparse implements __COUNTER__, we can use __UNIQUE_ID() to
generate unique variable names, avoiding this.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Johannes Berg <[email protected]>
Cc: Jens Axboe <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

Documentation/filesystems/proc.txt: add more description for maps/smaps

Add some more description on the limitations for smaps/maps readings, as
well as some guaruntees we can make.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Robert Ho <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Xiao Guangrong <[email protected]>
Cc: Robert Hu <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Paolo Bonzini <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Gleb Natapov <[email protected]>
Cc: Marcelo Tosatti <[email protected]>
Cc: Stefan Hajnoczi <[email protected]>
Cc: Ross Zwisler <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm, proc: fix region lost in /proc/self/smaps

Recently, Redhat reported that nvml test suite failed on QEMU/KVM,
more detailed info please refer to:

   https://bugzilla.redhat.com/show_bug.cgi?id=1365721

Actually, this bug is not only for NVDIMM/DAX but also for any other
file systems.  This simple test case abstracted from nvml can easily
reproduce this bug in common environment:

-------------------------- testcase.c -----------------------------

int
is_pmem_proc(const void *addr, size_t len)
{
        const char *caddr = addr;

        FILE *fp;
        if ((fp = fopen("/proc/self/smaps", "r")) == NULL) {
                printf("!/proc/self/smaps");
                return 0;
        }

        int retval = 0;         /* assume false until proven otherwise */
        char line[PROCMAXLEN];  /* for fgets() */
        char *lo = NULL;        /* beginning of current range in smaps file */
        char *hi = NULL;        /* end of current range in smaps file */
        int needmm = 0;         /* looking for mm flag for current range */
        while (fgets(line, PROCMAXLEN, fp) != NULL) {
                static const char vmflags[] = "VmFlags:";
                static const char mm[] = " wr";

                /* check for range line */
                if (sscanf(line, "%p-%p", &lo, &hi) == 2) {
                        if (needmm) {
                                /* last range matched, but no mm flag found */
                                printf("never found mm flag.\n");
                                break;
                        } else if (caddr < lo) {
                                /* never found the range for caddr */
                                printf("#######no match for addr %p.\n", caddr);
                                break;
                        } else if (caddr < hi) {
                                /* start address is in this range */
                                size_t rangelen = (size_t)(hi - caddr);

                                /* remember that matching has started */
                                needmm = 1;

                                /* calculate remaining range to search for */
                                if (len > rangelen) {
                                        len -= rangelen;
                                        caddr += rangelen;
                                        printf("matched %zu bytes in range "
                                                "%p-%p, %zu left over.\n",
                                                        rangelen, lo, hi, len);
                                } else {
                                        len = 0;
                                        printf("matched all bytes in range "
                                                        "%p-%p.\n", lo, hi);
                                }
                        }
                } else if (needmm && strncmp(line, vmflags,
                                        sizeof(vmflags) - 1) == 0) {
                        if (strstr(&line[sizeof(vmflags) - 1], mm) != NULL) {
                                printf("mm flag found.\n");
                                if (len == 0) {
                                        /* entire range matched */
                                        retval = 1;
                                        break;
                                }
                                needmm = 0;     /* saw what was needed */
                        } else {
                                /* mm flag not set for some or all of range */
                                printf("range has no mm flag.\n");
                                break;
                        }
                }
        }

        fclose(fp);

        printf("returning %d.\n", retval);
        return retval;
}

void *Addr;
size_t Size;

/*
* worker -- the work each thread performs
*/
static void *
worker(void *arg)
{
        int *ret = (int *)arg;
        *ret =  is_pmem_proc(Addr, Size);
        return NULL;
}

int main(int argc, char *argv[])
{
        if (argc <  2 || argc > 3) {
                printf("usage: %s file [env].\n", argv[0]);
                return -1;
        }

        int fd = open(argv[1], O_RDWR);

        struct stat stbuf;
        fstat(fd, &stbuf);

        Size = stbuf.st_size;
        Addr = mmap(0, stbuf.st_size, PROT_READ|PROT_WRITE, MAP_PRIVATE, fd, 0);

        close(fd);

        pthread_t threads[NTHREAD];
        int ret[NTHREAD];

        /* kick off NTHREAD threads */
        for (int i = 0; i < NTHREAD; i++)
                pthread_create(&threads[i], NULL, worker, &ret[i]);

        /* wait for all the threads to complete */
        for (int i = 0; i < NTHREAD; i++)
                pthread_join(threads[i], NULL);

        /* verify that all the threads return the same value */
        for (int i = 1; i < NTHREAD; i++) {
                if (ret[0] != ret[i]) {
                        printf("Error i %d ret[0] = %d ret[i] = %d.\n", i,
                                ret[0], ret[i]);
                }
        }

        printf("%d", ret[0]);
        return 0;
}

It failed as some threads can not find the memory region in
"/proc/self/smaps" which is allocated in the main process

It is caused by proc fs which uses 'file->version' to indicate the VMA that
is the last one has already been handled by read() system call. When the
next read() issues, it uses the 'version' to find the VMA, then the next
VMA is what we want to handle, the related code is as follows:

        if (last_addr) {
                vma = find_vma(mm, last_addr);
                if (vma && (vma = m_next_vma(priv, vma)))
                        return vma;
        }

However, VMA will be lost if the last VMA is gone, e.g:

The process VMA list is A->B->C->D

CPU 0                                  CPU 1
read() system call
   handle VMA B
   version = B
return to userspace

                                   unmap VMA B

issue read() again to continue to get
the region info
   find_vma(version) will get VMA C
   m_next_vma(C) will get VMA D
   handle D
   !!! VMA C is lost !!!

In order to fix this bug, we make 'file->version' indicate the end address
of the current VMA.  m_start will then look up a vma which with vma_start
< last_vm_end and moves on to the next vma if we found the same or an
overlapping vma.  This will guarantee that we will not miss an exclusive
vma but we can still miss one if the previous vma was shrunk.  This is
acceptable because guaranteeing "never miss a vma" is simply not feasible.
User has to cope with some inconsistencies if the file is not read in one
go.

[[email protected]: changelog fixes]
Link: http://lkml.kernel.org/r/[email protected]
Acked-by: Dave Hansen <[email protected]>
Signed-off-by: Xiao Guangrong <[email protected]>
Signed-off-by: Robert Hu <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Acked-by: Oleg Nesterov <[email protected]>
Cc: Paolo Bonzini <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Gleb Natapov <[email protected]>
Cc: Marcelo Tosatti <[email protected]>
Cc: Stefan Hajnoczi <[email protected]>
Cc: Ross Zwisler <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

proc: fix timerslack_ns CAP_SYS_NICE check when adjusting self

In changing from checking ptrace_may_access(p, PTRACE_MODE_ATTACH_FSCREDS)
to capable(CAP_SYS_NICE), I missed that ptrace_my_access succeeds when p
== current, but the CAP_SYS_NICE doesn't.

Thus while the previous commit was intended to loosen the needed
privileges to modify a processes timerslack, it needlessly restricted a
task modifying its own timerslack via the proc/<tid>/timerslack_ns
(which is permitted also via the PR_SET_TIMERSLACK method).

This patch corrects this by checking if p == current before checking the
CAP_SYS_NICE value.

This patch applies on top of my two previous patches currently in -mm

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: John Stultz <[email protected]>
Acked-by: Kees Cook <[email protected]>
Cc: "Serge E. Hallyn" <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Arjan van de Ven <[email protected]>
Cc: Oren Laadan <[email protected]>
Cc: Ruchi Kandoi <[email protected]>
Cc: Rom Lemarchand <[email protected]>
Cc: Todd Kjos <[email protected]>
Cc: Colin Cross <[email protected]>
Cc: Nick Kralevich <[email protected]>
Cc: Dmitry Shmidt <[email protected]>
Cc: Elliott Hughes <[email protected]>
Cc: Android Kernel Team <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

proc: add LSM hook checks to /proc/<tid>/timerslack_ns

As requested, this patch checks the existing LSM hooks
task_getscheduler/task_setscheduler when reading or modifying the task's
timerslack value.

Previous versions added new get/settimerslack LSM hooks, but since they
checked the same PROCESS__SET/GETSCHED values as existing hooks, it was
suggested we just use the existing ones.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: John Stultz <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: "Serge E. Hallyn" <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Arjan van de Ven <[email protected]>
Cc: Oren Laadan <[email protected]>
Cc: Ruchi Kandoi <[email protected]>
Cc: Rom Lemarchand <[email protected]>
Cc: Todd Kjos <[email protected]>
Cc: Colin Cross <[email protected]>
Cc: Nick Kralevich <[email protected]>
Cc: Dmitry Shmidt <[email protected]>
Cc: Elliott Hughes <[email protected]>
Cc: James Morris <[email protected]>
Cc: Android Kernel Team <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

proc: relax /proc/<tid>/timerslack_ns capability requirements

When an interface to allow a task to change another tasks timerslack was
first proposed, it was suggested that something greater then
CAP_SYS_NICE would be needed, as a task could be delayed further then
what normally could be done with nice adjustments.

So CAP_SYS_PTRACE was adopted instead for what became the
/proc/<tid>/timerslack_ns interface. However, for Android (where this
feature originates), giving the system_server CAP_SYS_PTRACE would allow
it to observe and modify all tasks memory. This is considered too high
a privilege level for only needing to change the timerslack.

After some discussion, it was realized that a CAP_SYS_NICE process can
set a task as SCHED_FIFO, so they could fork some spinning processes and
set them all SCHED_FIFO 99, in effect delaying all other tasks for an
infinite amount of time.

So as a CAP_SYS_NICE task can already cause trouble for other tasks,
using it as a required capability for accessing and modifying
/proc/<tid>/timerslack_ns seems sufficient.

Thus, this patch loosens the capability requirements to CAP_SYS_NICE and
removes CAP_SYS_PTRACE, simplifying some of the code flow as well.

This is technically an ABI change, but as the feature just landed in
4.6, I suspect no one is yet using it.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: John Stultz <[email protected]>
Reviewed-by: Nick Kralevich <[email protected]>
Acked-by: Serge Hallyn <[email protected]>
Acked-by: Kees Cook <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: "Serge E. Hallyn" <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Arjan van de Ven <[email protected]>
Cc: Oren Laadan <[email protected]>
Cc: Ruchi Kandoi <[email protected]>
Cc: Rom Lemarchand <[email protected]>
Cc: Todd Kjos <[email protected]>
Cc: Colin Cross <[email protected]>
Cc: Nick Kralevich <[email protected]>
Cc: Dmitry Shmidt <[email protected]>
Cc: Elliott Hughes <[email protected]>
Cc: Android Kernel Team <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

meminfo: break apart a very long seq_printf with #ifdefs

Use a specific routine to emit most lines so that the code is easier to
read and maintain.

akpm:
   text    data     bss     dec     hex filename
   2976       8       0    2984     ba8 fs/proc/meminfo.o before
   2669       8       0    2677     a75 fs/proc/meminfo.o after

Link: http://lkml.kernel.org/r/8fce7fdef2ba081a4ef531594e97da8a9feebb58.1470810406.git.joe@perches.com
Signed-off-by: Joe Perches <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Alexey Dobriyan <[email protected]>
Cc: Al Viro <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

seq/proc: modify seq_put_decimal_[u]ll to take a const char *, not char

Allow some seq_puts removals by taking a string instead of a single
char.

[[email protected]: update vmstat_show(), per Joe]
Link: http://lkml.kernel.org/r/667e1cf3d436de91a5698170a1e98d882905e956.1470704995.git.joe@perches.com
Signed-off-by: Joe Perches <[email protected]>
Cc: Joe Perches <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Al Viro <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

proc: faster /proc/*/status

top(1) opens the following files for every PID:

/proc/*/stat
/proc/*/statm
/proc/*/status

This patch switches /proc/*/status away from seq_printf().
The result is 13.5% speedup.

Benchmark is open("/proc/self/status")+read+close 1.000.000 million times.

BEFORE
$ perf stat -r 10 taskset -c 3 ./proc-self-status

Performance counter stats for 'taskset -c 3 ./proc-self-status' (10 runs):

      10748.474301      task-clock (msec)         #    0.954 CPUs utilized            ( +-  0.91% )
                12      context-switches          #    0.001 K/sec                    ( +-  1.09% )
                 1      cpu-migrations            #    0.000 K/sec
               104      page-faults               #    0.010 K/sec                    ( +-  0.45% )
    37,424,127,876      cycles                    #    3.482 GHz                      ( +-  0.04% )
     8,453,010,029      stalled-cycles-frontend   #   22.59% frontend cycles idle     ( +-  0.12% )
     3,747,609,427      stalled-cycles-backend    #  10.01% backend cycles idle       ( +-  0.68% )
    65,632,764,147      instructions              #    1.75  insn per cycle
                                                  #    0.13  stalled cycles per insn  ( +-  0.00% )
    13,981,324,775      branches                  # 1300.773 M/sec                    ( +-  0.00% )
       138,967,110      branch-misses             #    0.99% of all branches          ( +-  0.18% )

      11.263885428 seconds time elapsed                                          ( +-  0.04% )
      ^^^^^^^^^^^^

AFTER
$ perf stat -r 10 taskset -c 3 ./proc-self-status

Performance counter stats for 'taskset -c 3 ./proc-self-status' (10 runs):

       9010.521776      task-clock (msec)         #    0.925 CPUs utilized            ( +-  1.54% )
                11      context-switches          #    0.001 K/sec                    ( +-  1.54% )
                 1      cpu-migrations            #    0.000 K/sec                    ( +- 11.11% )
               103      page-faults               #    0.011 K/sec                    ( +-  0.60% )
    32,352,310,603      cycles                    #    3.591 GHz                      ( +-  0.07% )
     7,849,199,578      stalled-cycles-frontend   #   24.26% frontend cycles idle     ( +-  0.27% )
     3,269,738,842      stalled-cycles-backend    #  10.11% backend cycles idle       ( +-  0.73% )
    56,012,163,567      instructions              #    1.73  insn per cycle
                                                  #    0.14  stalled cycles per insn  ( +-  0.00% )
    11,735,778,795      branches                  # 1302.453 M/sec                    ( +-  0.00% )
        98,084,459      branch-misses             #    0.84% of all branches          ( +-  0.28% )

       9.741247736 seconds time elapsed                                          ( +-  0.07% )
       ^^^^^^^^^^^

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Alexey Dobriyan <[email protected]>
Cc: Joe Perches <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Al Viro <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

proc: much faster /proc/vmstat

Every current KDE system has process named ksysguardd polling files
below once in several seconds:

$ strace -e trace=open -p $(pidof ksysguardd)
Process 1812 attached
open("/etc/mtab", O_RDONLY|O_CLOEXEC)   = 8
open("/etc/mtab", O_RDONLY|O_CLOEXEC)   = 8
open("/proc/net/dev", O_RDONLY)         = 8
open("/proc/net/wireless", O_RDONLY)    = -1 ENOENT (No such file or directory)
open("/proc/stat", O_RDONLY)            = 8
open("/proc/vmstat", O_RDONLY)          = 8

Hell knows what it is doing but speed up reading /proc/vmstat by 33%!

Benchmark is open+read+close 1.000.000 times.

BEFORE
$ perf stat -r 10 taskset -c 3 ./proc-vmstat

Performance counter stats for 'taskset -c 3 ./proc-vmstat' (10 runs):

      13146.768464      task-clock (msec)         #    0.960 CPUs utilized            ( +-  0.60% )
                15      context-switches          #    0.001 K/sec                    ( +-  1.41% )
                 1      cpu-migrations            #    0.000 K/sec                    ( +- 11.11% )
               104      page-faults               #    0.008 K/sec                    ( +-  0.57% )
    45,489,799,349      cycles                    #    3.460 GHz                      ( +-  0.03% )
     9,970,175,743      stalled-cycles-frontend   #   21.92% frontend cycles idle     ( +-  0.10% )
     2,800,298,015      stalled-cycles-backend    #   6.16% backend cycles idle       ( +-  0.32% )
    79,241,190,850      instructions              #    1.74  insn per cycle
                                                  #    0.13  stalled cycles per insn  ( +-  0.00% )
    17,616,096,146      branches                  # 1339.956 M/sec                    ( +-  0.00% )
       176,106,232      branch-misses             #    1.00% of all branches          ( +-  0.18% )

      13.691078109 seconds time elapsed                                          ( +-  0.03% )
      ^^^^^^^^^^^^

AFTER
$ perf stat -r 10 taskset -c 3 ./proc-vmstat

Performance counter stats for 'taskset -c 3 ./proc-vmstat' (10 runs):

       8688.353749      task-clock (msec)         #    0.950 CPUs utilized            ( +-  1.25% )
                10      context-switches          #    0.001 K/sec                    ( +-  2.13% )
                 1      cpu-migrations            #    0.000 K/sec
               104      page-faults               #    0.012 K/sec                    ( +-  0.56% )
    30,384,010,730      cycles                    #    3.497 GHz                      ( +-  0.07% )
    12,296,259,407      stalled-cycles-frontend   #   40.47% frontend cycles idle     ( +-  0.13% )
     3,370,668,651      stalled-cycles-backend    #  11.09% backend cycles idle       ( +-  0.69% )
    28,969,052,879      instructions              #    0.95  insn per cycle
                                                  #    0.42  stalled cycles per insn  ( +-  0.01% )
     6,308,245,891      branches                  #  726.058 M/sec                    ( +-  0.00% )
       214,685,502      branch-misses             #    3.40% of all branches          ( +-  0.26% )

       9.146081052 seconds time elapsed                                          ( +-  0.07% )
       ^^^^^^^^^^^

vsnprintf() is slow because:

1. format_decode() is busy looking for format specifier: 2 branches
   per character (not in this case, but in others)

2. approximately million branches while parsing format mini language
   and everywhere

3.  just look at what string() does /proc/vmstat is good case because
   most of its content are strings

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Alexey Dobriyan <[email protected]>
Cc: Joe Perches <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Al Viro <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

atomic64: no need for CONFIG_ARCH_HAS_ATOMIC64_DEC_IF_POSITIVE

This came to light when implementing native 64-bit atomics for ARCv2.

The atomic64 self-test code uses CONFIG_ARCH_HAS_ATOMIC64_DEC_IF_POSITIVE
to check whether atomic64_dec_if_positive() is available.  It seems it
was needed when not every arch defined it.  However as of current code
the Kconfig option seems needless

- for CONFIG_GENERIC_ATOMIC64 it is auto-enabled in lib/Kconfig and a
   generic definition of API is present lib/atomic64.c
- arches with native 64-bit atomics select it in arch/*/Kconfig and
   define the API in their headers

So I see no point in keeping the Kconfig option

Compile tested for:
- blackfin (CONFIG_GENERIC_ATOMIC64)
- x86 (!CONFIG_GENERIC_ATOMIC64)
- ia64

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Vineet Gupta <[email protected]>
Cc: Richard Henderson <[email protected]>
Cc: Ivan Kokshaysky <[email protected]>
Cc: Matt Turner <[email protected]>
Cc: Russell King <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Ralf Baechle <[email protected]>
Cc: "James E.J. Bottomley" <[email protected]>
Cc: Helge Deller <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Martin Schwidefsky <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: "David S. Miller" <[email protected]>
Cc: Chris Metcalf <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Vineet Gupta <[email protected]>
Cc: Zhaoxiu Zeng <[email protected]>
Cc: Linus Walleij <[email protected]>
Cc: Alexander Potapenko <[email protected]>
Cc: Andrey Ryabinin <[email protected]>
Cc: Herbert Xu <[email protected]>
Cc: Ming Lin <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Geert Uytterhoeven <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Boqun Feng <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

ia64: implement atomic64_dec_if_positive

This is based on s390 version and needed to get rid of
CONFIG_ARCH_HAS_ATOMIC64_DEC_IF_POSITIVE

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Vineet Gupta <[email protected]>
Reported-by: kbuild test robot <[email protected]>
Cc: Tony Luck <[email protected]>
Cc: Fenghua Yu <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

linux/mm.h: canonicalize macro PAGE_ALIGNED() definition

The macro PAGE_ALIGNED() is prone to cause error because it doesn't
follow convention to parenthesize parameter @addr within macro body, for
example unsigned long *ptr = kmalloc(...); PAGE_ALIGNED(ptr + 16); for
the left parameter of macro IS_ALIGNED(), (unsigned long)(ptr + 16) is
desired but the actual one is (unsigned long)ptr + 16.

It is fixed by simply canonicalizing macro PAGE_ALIGNED() definition.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: zijun_hu <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm: remove unnecessary condition in remove_inode_hugepages

When the huge page is added to the page cahce (huge_add_to_page_cache),
the page private flag will be cleared. since this code
(remove_inode_hugepages) will only be called for pages in the page
cahce, PagePrivate(page) will always be false.

The patch remove the code without any functional change.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: zhong jiang <[email protected]>
Reviewed-by: Naoya Horiguchi <[email protected]>
Reviewed-by: Mike Kravetz <[email protected]>
Tested-by: Mike Kravetz <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm: warn about allocations which stall for too long

Currently we do warn only about allocation failures but small
allocations are basically nofail and they might loop in the page
allocator for a long time.  Especially when the reclaim cannot make any
progress - e.g.  GFP_NOFS cannot invoke the oom killer and rely on a
different context to make a forward progress in case there is a lot
memory used by filesystems.

Give us at least a clue when something like this happens and warn about
allocations which take more than 10s.  Print the basic allocation
context information along with the cumulative time spent in the
allocation as well as the allocation stack.  Repeat the warning after
every 10 seconds so that we know that the problem is permanent rather
than ephemeral.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Michal Hocko <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Tetsuo Handa <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Dave Hansen <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm: consolidate warn_alloc_failed users

warn_alloc_failed is currently used from the page and vmalloc
allocators.  This is a good reuse of the code except that vmalloc would
appreciate a slightly different warning message.  This is already
handled by the fmt parameter except that

  "%s: page allocation failure: order:%u, mode:%#x(%pGg)"

is printed anyway.  This might be quite misleading because it might be a
vmalloc failure which leads to the warning while the page allocator is
not the culprit here.  Fix this by always using the fmt string and only
print the context that makes sense for the particular context (e.g.
order makes only very little sense for the vmalloc context).

Rename the function to not miss any user and also because a later patch
will reuse it also for !failure cases.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Michal Hocko <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Cc: Tetsuo Handa <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Dave Hansen <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

vfs,mm: fix a dead loop in truncate_inode_pages_range()

We triggered a deadloop in truncate_inode_pages_range() on 32 bits
architecture with the test case bellow:

...
fd = open();
write(fd, buf, 4096);
preadv64(fd, &iovec, 1, 0xffffffff000);
ftruncate(fd, 0);
...

Then ftruncate() will not return forever.

The filesystem used in this case is ubifs, but it can be triggered on
many other filesystems.

When preadv64() is called with offset=0xffffffff000, a page with
index=0xffffffff will be added to the radix tree of ->mapping.  Then
this page can be found in ->mapping with pagevec_lookup().  After that,
truncate_inode_pages_range(), which is called in ftruncate(), will fall
into an infinite loop:

- find a page with index=0xffffffff, since index>=end, this page won't
   be truncated

- index++, and index become 0

- the page with index=0xffffffff will be found again

The data type of index is unsigned long, so index won't overflow to 0 on
64 bits architecture in this case, and the dead loop won't happen.

Since truncate_inode_pages_range() is executed with holding lock of
inode->i_rwsem, any operation related with this lock will be blocked,
and a hung task will happen, e.g.:

  INFO: task truncate_test:3364 blocked for more than 120 seconds.
  ...
     call_rwsem_down_write_failed+0x17/0x30
     generic_file_write_iter+0x32/0x1c0
     ubifs_write_iter+0xcc/0x170
     __vfs_write+0xc4/0x120
     vfs_write+0xb2/0x1b0
     SyS_write+0x46/0xa0

The page with index=0xffffffff added to ->mapping is useless.  Fix this
by checking the read position before allocating pages.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Wei Fang <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Al Viro <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

arm64 Kconfig: select gigantic page

Arm64 supports gigantic pages after commit 084bd29810a5 ("ARM64: mm:
HugeTLB support.") however, it can only be allocated at boottime and
can't be freed.

This patch selects ARCH_HAS_GIGANTIC_PAGE to make gigantic pages can be
allocated and freed at runtime for arch arm64.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Yisheng Xie <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Acked-by: Catalin Marinas <[email protected]>
Acked-by: Hillf Danton <[email protected]>
Cc: Hanjun Guo <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Sudeep Holla <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Rob Herring <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/hugetlb: introduce ARCH_HAS_GIGANTIC_PAGE

Avoid making ifdef get pretty unwieldy if many ARCHs support gigantic
page. No functional change with this patch.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Yisheng Xie <[email protected]>
Suggested-by: Michal Hocko <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Acked-by: Naoya Horiguchi <[email protected]>
Acked-by: Hillf Danton <[email protected]>
Cc: Hanjun Guo <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Sudeep Holla <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Rob Herring <[email protected]>
Cc: Mike Kravetz <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

oom: print nodemask in the oom report

We have received a hard to explain oom report from a customer.  The oom
triggered regardless there is a lot of free memory:

  PoolThread invoked oom-killer: gfp_mask=0x280da, order=0, oom_adj=0, oom_score_adj=0
  PoolThread cpuset=/ mems_allowed=0-7
  Pid: 30055, comm: PoolThread Tainted: G           E X 3.0.101-80-default #1
  Call Trace:
    dump_trace+0x75/0x300
    dump_stack+0x69/0x6f
    dump_header+0x8e/0x110
    oom_kill_process+0xa6/0x350
    out_of_memory+0x2b7/0x310
    __alloc_pages_slowpath+0x7dd/0x820
    __alloc_pages_nodemask+0x1e9/0x200
    alloc_pages_vma+0xe1/0x290
    do_anonymous_page+0x13e/0x300
    do_page_fault+0x1fd/0x4c0
    page_fault+0x25/0x30
  [...]
  active_anon:1135959151 inactive_anon:1051962 isolated_anon:0
   active_file:13093 inactive_file:222506 isolated_file:0
   unevictable:262144 dirty:2 writeback:0 unstable:0
   free:432672819 slab_reclaimable:7917 slab_unreclaimable:95308
   mapped:261139 shmem:166297 pagetables:2228282 bounce:0
  [...]
  Node 0 DMA free:15896kB min:0kB low:0kB high:0kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15672kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
  lowmem_reserve[]: 0 2892 775542 775542
  Node 0 DMA32 free:2783784kB min:28kB low:32kB high:40kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2961572kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
  lowmem_reserve[]: 0 0 772650 772650
  Node 0 Normal free:8120kB min:8160kB low:10200kB high:12240kB active_anon:779334960kB inactive_anon:2198744kB active_file:0kB inactive_file:180kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:791193600kB mlocked:131072kB dirty:0kB writeback:0kB mapped:372940kB shmem:361480kB slab_reclaimable:4536kB slab_unreclaimable:68472kB kernel_stack:10104kB pagetables:1414820kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:2280 all_unreclaimable? yes
  lowmem_reserve[]: 0 0 0 0
  Node 1 Normal free:476718144kB min:8192kB low:10240kB high:12288kB active_anon:307623696kB inactive_anon:283620kB active_file:10392kB inactive_file:69908kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:794296320kB mlocked:131072kB dirty:4kB writeback:0kB mapped:257208kB shmem:189896kB slab_reclaimable:3868kB slab_unreclaimable:44756kB kernel_stack:1848kB pagetables:1369432kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
  lowmem_reserve[]: 0 0 0 0
  Node 2 Normal free:386002452kB min:8192kB low:10240kB high:12288kB active_anon:398563752kB inactive_anon:68184kB active_file:10292kB inactive_file:29936kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:794296320kB mlocked:131072kB dirty:0kB writeback:0kB mapped:32084kB shmem:776kB slab_reclaimable:6888kB slab_unreclaimable:60056kB kernel_stack:8208kB pagetables:1282880kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
  lowmem_reserve[]: 0 0 0 0
  Node 3 Normal free:196406760kB min:8192kB low:10240kB high:12288kB active_anon:587445640kB inactive_anon:164396kB active_file:5716kB inactive_file:709844kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:794296320kB mlocked:131072kB dirty:0kB writeback:0kB mapped:291776kB shmem:111416kB slab_reclaimable:5152kB slab_unreclaimable:44516kB kernel_stack:2168kB pagetables:1455956kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
  lowmem_reserve[]: 0 0 0 0
  Node 4 Normal free:425338880kB min:8192kB low:10240kB high:12288kB active_anon:359695204kB inactive_anon:43216kB active_file:5748kB inactive_file:14772kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:794296320kB mlocked:131072kB dirty:0kB writeback:0kB mapped:24708kB shmem:1120kB slab_reclaimable:1884kB slab_unreclaimable:41060kB kernel_stack:1856kB pagetables:1100208kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
  lowmem_reserve[]: 0 0 0 0
  Node 5 Normal free:11140kB min:8192kB low:10240kB high:12288kB active_anon:784240872kB inactive_anon:1217164kB active_file:28kB inactive_file:48kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:794296320kB mlocked:131072kB dirty:0kB writeback:0kB mapped:11408kB shmem:0kB slab_reclaimable:2008kB slab_unreclaimable:49220kB kernel_stack:1360kB pagetables:531600kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:1202 all_unreclaimable? yes
  lowmem_reserve[]: 0 0 0 0
  Node 6 Normal free:243395332kB min:8192kB low:10240kB high:12288kB active_anon:542015544kB inactive_anon:40208kB active_file:968kB inactive_file:8484kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:794296320kB mlocked:131072kB dirty:0kB writeback:0kB mapped:19992kB shmem:496kB slab_reclaimable:1672kB slab_unreclaimable:37052kB kernel_stack:2088kB pagetables:750264kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
  lowmem_reserve[]: 0 0 0 0
  Node 7 Normal free:10768kB min:8192kB low:10240kB high:12288kB active_anon:784916936kB inactive_anon:192316kB active_file:19228kB inactive_file:56852kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:794296320kB mlocked:131072kB dirty:4kB writeback:0kB mapped:34440kB shmem:4kB slab_reclaimable:5660kB slab_unreclaimable:36100kB kernel_stack:1328kB pagetables:1007968kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
  lowmem_reserve[]: 0 0 0 0

So all nodes but Node 0 have a lot of free memory which should suggest
that there is an available memory especially when mems_allowed=0-7.  One
could speculate that a massive process has managed to terminate and free
up a lot of memory while racing with the above allocation request.
Although this is highly unlikely it cannot be ruled out.

A further debugging, however shown that the faulting process had
mempolicy (not cpuset) to bind to Node 0.  We cannot see that
information from the report though.  mems_allowed turned out to be more
confusing than really helpful.

Fix this by always priting the nodemask.  It is either mempolicy mask
(and non-null) or the one defined by the cpusets.  The new output for
the above oom report would be

  PoolThread invoked oom-killer: gfp_mask=0x280da(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), nodemask=0, order=0, oom_adj=0, oom_score_adj=0

This patch doesn't touch show_mem and the node filtering based on the
cpuset node mask because mempolicy is always a subset of cpusets and
seeing the full cpuset oom context might be helpful for tunning more
specific mempolicies inside cpusets (e.g.  when they turn out to be too
restrictive).  To prevent from ugly ifdefs the mask is printed even for
!NUMA configurations but this should be OK (a single node will be
printed).

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Michal Hocko <[email protected]>
Reported-by: Sellami Abdelkader <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Sellami Abdelkader <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm: clarify why we avoid page_mapcount() for slab pages in dump_page()

Let's add comment on why we skip page_mapcount() for sl[aou]b pages.

Link: http://lkml.kernel.org/r/20160922105532.GB24593@node
Signed-off-by: Kirill A. Shutemov <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm: vma_merge: correct false positive from __vma_unlink->validate_mm_rb

The old code was always doing:

   vma->vm_end = next->vm_end
   vma_rb_erase(next) // in __vma_unlink
   vma->vm_next = next->vm_next // in __vma_unlink
   next = vma->vm_next
   vma_gap_update(next)

The new code still does the above for remove_next == 1 and 2, but for
remove_next == 3 it has been changed and it does:

   next->vm_start = vma->vm_start
   vma_rb_erase(vma) // in __vma_unlink
   vma_gap_update(next)

In the latter case, while unlinking "vma", validate_mm_rb() is told to
ignore "vma" that is being removed, but next->vm_start was reduced
instead. So for the new case, to avoid the false positive from
validate_mm_rb, it should be "next" that is ignored when "vma" is
being unlinked.

"vma" and "next" in the above comment, considered pre-swap().

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Andrea Arcangeli <[email protected]>
Tested-by: Shaun Tancheff <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Jan Vorlicek <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm: vma_adjust: minor comment correction

The cases are three not two.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Andrea Arcangeli <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Jan Vorlicek <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm: vma_adjust: remove superfluous check for next not NULL

If next would be NULL we couldn't reach such code path.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Andrea Arcangeli <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Jan Vorlicek <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm: vma_merge: fix vm_page_prot SMP race condition against rmap_walk

The rmap_walk can access vm_page_prot (and potentially vm_flags in the
pte/pmd manipulations).  So it's not safe to wait the caller to update
the vm_page_prot/vm_flags after vma_merge returned potentially removing
the "next" vma and extending the "current" vma over the
next->vm_start,vm_end range, but still with the "current" vma
vm_page_prot, after releasing the rmap locks.

The vm_page_prot/vm_flags must be transferred from the "next" vma to the
current vma while vma_merge still holds the rmap locks.

The side effect of this race condition is pte corruption during migrate
as remove_migration_ptes when run on a address of the "next" vma that
got removed, used the vm_page_prot of the current vma.

  migrate                   mprotect
  ------------ -------------
  migrating in "next" vma
vma_merge() # removes "next" vma and
             # extends "current" vma
    # current vma is not with
    # vm_page_prot updated
  remove_migration_ptes
  read vm_page_prot of current "vma"
  establish pte with wrong permissions
vm_set_page_prot(vma) # too late!
change_protection in the old vma range
only, next range is not updated

This caused segmentation faults and potentially memory corruption in
heavy mprotect loads with some light page migration caused by compaction
in the background.

Hugh Dickins pointed out the comment about the Odd case 8 in vma_merge
which confirms the case 8 is only buggy one where the race can trigger,
in all other vma_merge cases the above cannot happen.

This fix removes the oddness factor from case 8 and it converts it from:

      AAAA
  PPPPNNNNXXXX -> PPPPNNNNNNNN

to:

      AAAA
  PPPPNNNNXXXX -> PPPPXXXXXXXX

XXXX has the right vma properties for the whole merged vma returned by
vma_adjust, so it solves the problem fully.  It has the added benefits
that the callers could stop updating vma properties when vma_merge
succeeds however the callers are not updated by this patch (there are
bits like VM_SOFTDIRTY that still need special care for the whole range,
as the vma merging ignores them, but as long as they're not processed by
rmap walks and instead they're accessed with the mmap_sem at least for
reading, they are fine not to be updated within vma_adjust before
releasing the rmap_locks).

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Andrea Arcangeli <[email protected]>
Reported-by: Aditya Mandaleeka <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Jan Vorlicek <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm: vma_adjust: remove superfluous confusing update in remove_next == 1 case

mm->highest_vm_end doesn't need any update.

After finally removing the oddness from vma_merge case 8 that was
causing:

1) constant risk of trouble whenever anybody would check vma fields
   from rmap_walks, like it happened when page migration was
   introduced and it read the vma->vm_page_prot from a rmap_walk

2) the callers of vma_merge to re-initialize any value different from
   the current vma, instead of vma_merge() more reliably returning a
   vma that already matches all fields passed as parameter

.. it is also worth to take the opportunity of cleaning up superfluous
code in vma_adjust(), that if not removed adds up to the hard
readability of the function.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Andrea Arcangeli <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Jan Vorlicek <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm: vm_page_prot: update with WRITE_ONCE/READ_ONCE

vma->vm_page_prot is read lockless from the rmap_walk, it may be updated
concurrently and this prevents the risk of reading intermediate values.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Andrea Arcangeli <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Jan Vorlicek <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm,ksm: add __GFP_HIGH to the allocation in alloc_stable_node()

According to Hugh's suggestion, alloc_stable_node() with GFP_KERNEL can
in rare cases cause a hung task warning.

At present, if alloc_stable_node() allocation fails, two break_cows may
want to allocate a couple of pages, and the issue will come up when free
memory is under pressure.

We fix it by adding __GFP_HIGH to GFP, to grant access to memory
reserves, increasing the likelihood of allocation success.

[[email protected]: tweak comment]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: zhong jiang <[email protected]>
Suggested-by: Hugh Dickins <[email protected]>
Acked-by: Hugh Dickins <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/page_isolation: fix typo: "paes" -> "pages"

Fix typo in comment.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Yisheng Xie <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/hugetlb: improve locking in dissolve_free_huge_pages()

For every pfn aligned to minimum_order, dissolve_free_huge_pages() will
call dissolve_free_huge_page() which takes the hugetlb spinlock, even if
the page is not huge at all or a hugepage that is in-use.

Improve this by doing the PageHuge() and page_count() checks already in
dissolve_free_huge_pages() before calling dissolve_free_huge_page(). In
dissolve_free_huge_page(), when holding the spinlock, those checks need
to be revalidated.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Gerald Schaefer <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Acked-by: Naoya Horiguchi <[email protected]>
Cc: "Kirill A . Shutemov" <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: "Aneesh Kumar K . V" <[email protected]>
Cc: Martin Schwidefsky <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Rui Teng <[email protected]>
Cc: Dave Hansen <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/hugetlb: check for reserved hugepages during memory offline

In dissolve_free_huge_pages(), free hugepages will be dissolved without
making sure that there are enough of them left to satisfy hugepage
reservations.

Fix this by adding a return value to dissolve_free_huge_pages() and
checking h->free_huge_pages vs. h->resv_huge_pages. Note that this may
lead to the situation where dissolve_free_huge_page() returns an error
and all free hugepages that were dissolved before that error are lost,
while the memory block still cannot be set offline.

Fixes: c8721bbb ("mm: memory-hotplug: enable memory hotplug to handle hugepage")
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Gerald Schaefer <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Acked-by: Naoya Horiguchi <[email protected]>
Cc: "Kirill A . Shutemov" <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: "Aneesh Kumar K . V" <[email protected]>
Cc: Martin Schwidefsky <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Rui Teng <[email protected]>
Cc: Dave Hansen <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/hugetlb: fix memory offline with hugepage size > memory block size

Patch series "mm/hugetlb: memory offline issues with hugepages", v4.

This addresses several issues with hugepages and memory offline.  While
the first patch fixes a panic, and is therefore rather important, the
last patch is just a performance optimization.

The second patch fixes a theoretical issue with reserved hugepages,
while still leaving some ugly usability issue, see description.

This patch (of 3):

dissolve_free_huge_pages() will either run into the VM_BUG_ON() or a
list corruption and addressing exception when trying to set a memory
block offline that is part (but not the first part) of a "gigantic"
hugetlb page with a size > memory block size.

When no other smaller hugetlb page sizes are present, the VM_BUG_ON()
will trigger directly.  In the other case we will run into an addressing
exception later, because dissolve_free_huge_page() will not work on the
head page of the compound hugetlb page which will result in a NULL
hstate from page_hstate().

To fix this, first remove the VM_BUG_ON() because it is wrong, and then
use the compound head page in dissolve_free_huge_page().  This means
that an unused pre-allocated gigantic page that has any part of itself
inside the memory block that is going offline will be dissolved
completely.  Losing an unused gigantic hugepage is preferable to failing
the memory offline, for example in the situation where a (possibly
faulty) memory DIMM needs to go offline.

Fixes: c8721bbb ("mm: memory-hotplug: enable memory hotplug to handle hugepage")
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Gerald Schaefer <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Acked-by: Naoya Horiguchi <[email protected]>
Cc: "Kirill A . Shutemov" <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: "Aneesh Kumar K . V" <[email protected]>
Cc: Martin Schwidefsky <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Rui Teng <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm: nobootmem: move the comment of free_all_bootmem

Commit b4def3509d18 ("mm, nobootmem: clean-up of free_low_memory_core_early()")
removed the unnecessary nodeid argument, after that, this comment
becomes more confused. We should move it to the right place.

Fixes: b4def3509d18c1db9 ("mm, nobootmem: clean-up of free_low_memory_core_early()")
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Wanlong Gao <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/shmem.c: constify anon_ops

Every other dentry_operations instance is const, and this one might as
well be.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Rasmus Villemoes <[email protected]>
Acked-by: Hugh Dickins <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm: memcontrol: consolidate cgroup socket tracking

The cgroup core and the memory controller need to track socket ownership
for different purposes, but the tracking sites being entirely different
is kind of ugly.

Be a better citizen and rename the memory controller callbacks to match
the cgroup core callbacks, then move them to the same place.

[[email protected]: coding-style fixes]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Johannes Weiner <[email protected]>
Acked-by: Tejun Heo <[email protected]>
Cc: "David S. Miller" <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm: move phys_mem_access_prot_allowed() declaration to pgtable.h

We get 1 warning when building kernel with W=1:

drivers/char/mem.c:220:12: warning: no previous prototype for 'phys_mem_access_prot_allowed' [-Wmissing-prototypes]
int __weak phys_mem_access_prot_allowed(struct file *file,

In fact, its declaration is spreading to several header files in
different architecture, but need to be declare in common header file.

So this patch moves phys_mem_access_prot_allowed() to pgtable.h.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Baoyou Xie <[email protected]>
Acked-by: Thomas Gleixner <[email protected]>
Acked-by: Ralf Baechle <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/page_io.c: replace some BUG_ON()s with VM_BUG_ON_PAGE()

So they are CONFIG_DEBUG_VM-only and more informative.

Cc: Al Viro <[email protected]>
Cc: David S. Miller <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Joe Perches <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Santosh Shilimkar <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm: don't emit warning from pagefault_out_of_memory()

Commit c32b3cbe0d06 ("oom, PM: make OOM detection in the freezer path
raceless") inserted a WARN_ON() into pagefault_out_of_memory() in order
to warn when we raced with disabling the OOM killer.

Now, patch "oom, suspend: fix oom_killer_disable vs.  pm suspend
properly" introduced a timeout for oom_killer_disable().  Even if we
raced with disabling the OOM killer and the system is OOM livelocked,
the OOM killer will be enabled eventually (in 20 seconds by default) and
the OOM livelock will be solved.  Therefore, we no longer need to warn
when we raced with disabling the OOM killer.

Link: http://lkml.kernel.org/r/1473442120-7246-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp
Signed-off-by: Tetsuo Handa <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Johannes Weiner <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm, compaction: restrict fragindex to costly orders

Fragmentation index and the vm.extfrag_threshold sysctl is meant as a
heuristic to prevent excessive compaction for costly orders (i.e.  THP).
It's unlikely to make any difference for non-costly orders, especially
with the default threshold.  But we cannot afford any uncertainty for
the non-costly orders where the only alternative to successful
reclaim/compaction is OOM.  After the recent patches we are guaranteed
maximum effort without heuristics from compaction before deciding OOM,
and fragindex is the last remaining heuristic.  Therefore skip fragindex
altogether for non-costly orders.

Suggested-by: Michal Hocko <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Vlastimil Babka <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Rik van Riel <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm, compaction: ignore fragindex from compaction_zonelist_suitable()

The compaction_zonelist_suitable() function tries to determine if
compaction will be able to proceed after sufficient reclaim, i.e.
whether there are enough reclaimable pages to provide enough order-0
freepages for compaction.

This addition of reclaimable pages to the free pages works well for the
order-0 watermark check, but in the fragmentation index check we only
consider truly free pages. Thus we can get fragindex value close to 0
which indicates failure do to lack of memory, and wrongly decide that
compaction won't be suitable even after reclaim.

Instead of trying to somehow adjust fragindex for reclaimable pages,
let's just skip it from compaction_zonelist_suitable().

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Vlastimil Babka <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Rik van Riel <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>