Willem de Bruijn [Wed, 26 Oct 2016 15:23:07 +0000 (11:23 -0400)]
packet: on direct_xmit, limit tso and csum to supported devices
When transmitting on a packet socket with PACKET_VNET_HDR and
PACKET_QDISC_BYPASS, validate device support for features requested
in vnet_hdr.
Drop TSO packets sent to devices that do not support TSO or have the
feature disabled. Note that the latter currently do process those
packets correctly, regardless of not advertising the feature.
Because of SKB_GSO_DODGY, it is not sufficient to test device features
with netif_needs_gso. Full validate_xmit_skb is needed.
Switch to software checksum for non-TSO packets that request checksum
offload if that device feature is unsupported or disabled. Note that
similar to the TSO case, device drivers may perform checksum offload
correctly even when not advertising it.
When switching to software checksum, packets hit skb_checksum_help,
which has two BUG_ON checksum not in linear segment. Packet sockets
always allocate at least up to csum_start + csum_off + 2 as linear.
Tested by running github.com/wdebruij/kerneltools/psock_txring_vnet.c
Ganesh Goudar [Wed, 26 Oct 2016 07:56:38 +0000 (13:26 +0530)]
cxgb4: Fix error handling in alloc_uld_rxqs().
Fix to release resources properly in error handling path of
alloc_uld_rxqs(), This patch also removes unwanted arguments
and avoids calling the same function twice.
Fixes: 94cdb8bb993a (cxgb4: Add support for dynamic allocation
of resources for ULD Signed-off-by: Ganesh Goudar <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Arnd Bergmann [Tue, 25 Oct 2016 16:16:20 +0000 (18:16 +0200)]
IB/mlx4: avoid a -Wmaybe-uninitialize warning
There is an old warning about mlx4_SW2HW_EQ_wrapper on x86:
ethernet/mellanox/mlx4/resource_tracker.c: In function ‘mlx4_SW2HW_EQ_wrapper’:
ethernet/mellanox/mlx4/resource_tracker.c:3071:10: error: ‘eq’ may be used uninitialized in this function [-Werror=maybe-uninitialized]
The problem here is that gcc won't track the state of the variable
across a spin_unlock. Moving the assignment out of the lock is
safe here and avoids the warning.
Eli Cooper [Wed, 26 Oct 2016 02:11:09 +0000 (10:11 +0800)]
ip6_tunnel: Update skb->protocol to ETH_P_IPV6 in ip6_tnl_xmit()
This patch updates skb->protocol to ETH_P_IPV6 in ip6_tnl_xmit() when an
IPv6 header is installed to a socket buffer.
This is not a cosmetic change. Without updating this value, GSO packets
transmitted through an ipip6 tunnel have the protocol of ETH_P_IP and
skb_mac_gso_segment() will attempt to call gso_segment() for IPv4,
which results in the packets being dropped.
Fixes: b8921ca83eed ("ip4ip6: Support for GSO/GRO") Signed-off-by: Eli Cooper <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Daniel Borkmann [Tue, 25 Oct 2016 22:37:53 +0000 (00:37 +0200)]
bpf: fix samples to add fake KBUILD_MODNAME
Some of the sample files are causing issues when they are loaded with tc
and cls_bpf, meaning tc bails out while trying to parse the resulting ELF
file as program/map/etc sections are not present, which can be easily
spotted with readelf(1).
Currently, BPF samples are including some of the kernel headers and mid
term we should change them to refrain from this, really. When dynamic
debugging is enabled, we bail out due to undeclared KBUILD_MODNAME, which
is easily overlooked in the build as clang spills this along with other
noisy warnings from various header includes, and llc still generates an
ELF file with mentioned characteristics. For just playing around with BPF
examples, this can be a bit of a hurdle to take.
Just add a fake KBUILD_MODNAME as a band-aid to fix the issue, same is
done in xdp*_kern samples already.
Fixes: 65d472fb007d ("samples/bpf: add 'pointer to packet' tests") Fixes: 6afb1e28b859 ("samples/bpf: Add tunnel set/get tests.") Fixes: a3f74617340b ("cgroup: bpf: Add an example to do cgroup checking in BPF") Reported-by: Chandrasekar Kannan <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Acked-by: Alexei Starovoitov <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Linus Torvalds [Sat, 29 Oct 2016 18:19:02 +0000 (11:19 -0700)]
Merge tag 'char-misc-4.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc
Pull char/misc driver fixes from Greg KH:
"Here are a few small char/misc driver fixes for reported issues.
The "biggest" are two binder fixes for reported issues that have been
shipping in Android phones for a while now, the others are various
fixes for reported problems.
And there's a MAINTAINERS update for good measure.
All have been in linux-next with no reported issues"
* tag 'char-misc-4.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc:
MAINTAINERS: Add entry for genwqe driver
VMCI: Doorbell create and destroy fixes
GenWQE: Fix bad page access during abort of resource allocation
vme: vme_get_size potentially returning incorrect value on failure
extcon: qcom-spmi-misc: Sync the extcon state on interrupt
hv: do not lose pending heartbeat vmbus packets
mei: txe: don't clean an unprocessed interrupt cause.
ANDROID: binder: Clear binder and cookie when setting handle in flat binder struct
ANDROID: binder: Add strong ref checks
Olof Johansson [Sat, 29 Oct 2016 18:09:37 +0000 (11:09 -0700)]
Merge tag 'v4.9-rockchip-dts64-fixes1' of git://git.kernel.org/pub/scm/linux/kernel/git/mmind/linux-rockchip into fixes
Correct regulator handling on Rockchip arm64 boards to make
bind/unbind calls work correctly and remove a sdio-only
property from non-sdio mmc hosts, that accidentially was
added there.
* tag 'v4.9-rockchip-dts64-fixes1' of git://git.kernel.org/pub/scm/linux/kernel/git/mmind/linux-rockchip:
arm64: dts: rockchip: remove the abuse of keep-power-in-suspend
arm64: dts: rockchip: remove always-on and boot-on from vcc_sd
Olof Johansson [Sat, 29 Oct 2016 18:09:11 +0000 (11:09 -0700)]
Merge tag 'arm-soc/for-4.9/devicetree-arm64-fixes' of http://github.com/Broadcom/stblinux into fixes
This pull request contains a single fix for Broadcom ARM64-based SoCs:
- Ray adds the required bus width and OOB sector size properties to the
Northstar 2 SVK reference board in order for the NAND controller to work
properly
* tag 'arm-soc/for-4.9/devicetree-arm64-fixes' of http://github.com/Broadcom/stblinux:
arm64: dts: Updated NAND DT properties for NS2 SVK
Olof Johansson [Sat, 29 Oct 2016 18:08:50 +0000 (11:08 -0700)]
Merge tag 'imx-fixes-4.9' of git://git.kernel.org/pub/scm/linux/kernel/git/shawnguo/linux into fixes
The i.MX fixes for 4.9:
- A couple of patches from Fabio to fix the GPC power domain regression
which is caused by PM Domain core change 0159ec670763dd
("PM / Domains: Verify the PM domain is present when adding a
provider"), and a related kernel crash seen with multi_v7_defconfig
build.
- Correct the PHY ID mask for AR8031 to match phy driver code.
- Apply new added timer erratum A008585 for LS1043A and LS2080A SoC.
- Correct vf610 global timer IRQ flag to avoid warning from gic driver
after commit 992345a58e0c ("irqchip/gic: WARN if setting the
interrupt type for a PPI fails").
* tag 'imx-fixes-4.9' of git://git.kernel.org/pub/scm/linux/kernel/git/shawnguo/linux:
ARM: imx: mach-imx6q: Fix the PHY ID mask for AR8031
ARM: dts: vf610: fix IRQ flag of global timer
ARM: imx: gpc: Fix the imx_gpc_genpd_init() error path
ARM: imx: gpc: Initialize all power domains
arm64: dts: Add timer erratum property for LS2080A and LS1043A
Linus Torvalds [Sat, 29 Oct 2016 17:57:40 +0000 (10:57 -0700)]
Merge tag 'driver-core-4.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
Pull driver core fixes from Greg KH:
"Here are two small driver core / kernfs fixes for 4.9-rc3.
One makes the Kconfig entry for DEBUG_TEST_DRIVER_REMOVE a bit more
explicit that this is a crazy thing to enable for a distro kernel
(thanks for trying Fedora!), the other resolves an issue with vim
opening kernfs files (sysfs, configfs, etc.)
Both have been in linux-next with no reported issues"
* tag 'driver-core-4.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core:
driver core: Make Kconfig text for DEBUG_TEST_DRIVER_REMOVE stronger
kernfs: Add noop_fsync to supported kernfs_file_fops
Linus Torvalds [Sat, 29 Oct 2016 17:20:59 +0000 (10:20 -0700)]
Merge tag 'staging-4.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging
Pull staging and IIO driver fixes from Greg KH:
"Here are some small staging and iio driver fixes for reported issues
for 4.9-rc3. Nothing major, the "largest" being a lustre fix for a
sysfs file that was obviously wrong, and had never been tested, so it
was moved to debugfs as that is where it belongs. The others are small
bug fixes for reported issues with various staging or iio drivers.
All have been in linux-next for a while with no reported issues"
* tag 'staging-4.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging:
greybus: fix a leak on error in gb_module_create()
greybus: es2: fix error return code in ap_probe()
greybus: arche-platform: Add missing of_node_put() in arche_platform_change_state()
staging: android: ion: Fix error handling in ion_query_heaps()
iio: accel: sca3000_core: avoid potentially uninitialized variable
iio:chemical:atlas-ph-sensor: Fix use of 32 bit int to hold 16 bit big endian value
staging/lustre/llite: Move unstable_stats from sysfs to debugfs
Staging: wilc1000: Fix kernel Oops on opening the device
staging: android/ion: testing the wrong variable
Staging: greybus: uart: Use gbphy_dev->dev instead of bundle->dev
Staging: greybus: gpio: Use gbphy_dev->dev instead of bundle->dev
iio: adc: ti-adc081c: Select IIO_TRIGGERED_BUFFER to prevent build errors
iio: maxim_thermocouple: Align 16 bit big endian value of raw reads
Linus Torvalds [Sat, 29 Oct 2016 17:17:52 +0000 (10:17 -0700)]
Merge tag 'tty-4.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty
Pull tty/serial driver fixes from Greg KH:
"Here are a number of small tty and serial driver fixes for reported
issues for 4.9-rc3. Nothing major, but they do resolve a bunch of
problems with the tty core changes that are in 4.9-rc1, and finally
the atmel serial driver is back working properly.
All have been in linux-next with no reported issues"
* tag 'tty-4.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty:
tty: serial_core: fix NULL struct tty pointer access in uart_write_wakeup
tty: serial_core: Fix serial console crash on port shutdown
tty/serial: at91: fix hardware handshake on Atmel platforms
vt: clear selection before resizing
sc16is7xx: always write state when configuring GPIO as an output
sh-sci: document R8A7743/5 support
tty: serial: 8250: 8250_core: NXP SC16C2552 workaround
tty: limit terminal size to 4M chars
tty: serial: fsl_lpuart: Fix Tx DMA edge case
serial: 8250_lpss: enable MSI for sure
serial: core: fix console problems on uart_close
serial: 8250_uniphier: fix clearing divisor latch access bit
serial: 8250_uniphier: fix more unterminated string
serial: pch_uart: add terminate entry for dmi_system_id tables
devicetree: bindings: uart: Add new compatible string for ZynqMP
serial: xuartps: Add new compatible string for ZynqMP
serial: SERIAL_STM32 should depend on HAS_DMA
serial: stm32: Fix comparisons with undefined register
tty: vt, fix bogus division in csi_J
Craig Gallek [Tue, 25 Oct 2016 22:08:49 +0000 (18:08 -0400)]
inet: Fix missing return value in inet6_hash
As part of a series to implement faster SO_REUSEPORT lookups,
commit 086c653f5862 ("sock: struct proto hash function may error")
added return values to protocol hash functions and
commit 496611d7b5ea ("inet: create IPv6-equivalent inet_hash function")
implemented a new hash function for IPv6. However, the latter does
not respect the former's convention.
This properly propagates the hash errors in the IPv6 case.
Fixes: 496611d7b5ea ("inet: create IPv6-equivalent inet_hash function") Reported-by: Soheil Hassas Yeganeh <[email protected]> Signed-off-by: Craig Gallek <[email protected]> Acked-by: Soheil Hassas Yeganeh <[email protected]> Signed-off-by: David S. Miller <[email protected]>
This series contains some bug fixes for the mlx5 core and mlx5e driver.
From Daniel:
- Cache line size determination at runtime, instead of using
L1_CACHE_BYTES hard coded value, use cache_line_size()
- Always Query HCA caps after setting them even on reset flow
From Mohamad:
- Reorder netdev cleanup to uregister netdev before detaching it
for the kernel to not complain about open resources such as vlans
- Change the acl enable prototype to return status, for better error
resiliency
- Clear health sick bit when starting health poll after reset flow
- Fix race between PCI error handlers and health work
- PCI error recovery health care simulation, in case when the kernel
PCI error handlers are not triggered for some internal firmware errors
From Noa:
- Avoid passing dma address 0 to firmware when mapping system pages
to the firmware
From Paul: Some straight forward flow steering fixes
- Keep autogroups list ordered
- Fix autogroups groups num not decreasing
- Correctly initialize last use of flow counters
From Saeed:
- Choose the nearest LRO timeout to the wanted one
instead of blindly choosing "dev_cap.lro_timeout[2]"
This series has no conflict with the for-next pull request posted
earlier today ("Mellanox mlx5 core driver updates 2016-10-25").
====================
net/mlx5: PCI error recovery health care simulation
In case that the kernel PCI error handlers are not called, we will
trigger our own recovery flow.
The health work will give priority to the kernel pci error handlers to
recover the PCI by waiting for a small period, if the pci error handlers
are not triggered the manual recovery flow will be executed.
We don't save pci state in case of manual recovery because it will ruin the
pci configuration space and we will lose dma sync.
Fixes: 89d44f0a6c73 ('net/mlx5_core: Add pci error handlers to mlx5_core driver') Signed-off-by: Mohamad Haj Yahia <[email protected]> Signed-off-by: Saeed Mahameed <[email protected]> Signed-off-by: David S. Miller <[email protected]>
net/mlx5: Fix race between PCI error handlers and health work
Currently there is a race between the health care work and the kernel
pci error handlers because both of them detect the error, the first one
to be called will do the error handling.
There is a chance that health care will disable the pci after resuming
pci slot.
Also create a separate WQ because now we will have two types of health
works, one for the error detection and one for the recovery.
Fixes: 89d44f0a6c73 ('net/mlx5_core: Add pci error handlers to mlx5_core driver') Signed-off-by: Mohamad Haj Yahia <[email protected]> Signed-off-by: Saeed Mahameed <[email protected]> Signed-off-by: David S. Miller <[email protected]>
net/mlx5: Clear health sick bit when starting health poll
The health sick status should be cleared when we start the health poll.
This is crucial for driver reload (unload + load) in order to behave
right in case of health issue.
Fixes: fd76ee4da55a ('net/mlx5_core: Fix internal error detection conditions') Signed-off-by: Mohamad Haj Yahia <[email protected]> Signed-off-by: Saeed Mahameed <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Detaching the netdev before unregistering it cause some netdev cleanup
ndos to fail because they check presence of the netdev, so we need to
unregister the netdev first.
Fixes: 26e59d8077a3 ('net/mlx5e: Implement mlx5e interface attach/detach callbacks') Signed-off-by: Mohamad Haj Yahia <[email protected]> Signed-off-by: Saeed Mahameed <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Paul Blakey [Tue, 25 Oct 2016 15:36:28 +0000 (18:36 +0300)]
net/mlx5: Correctly initialize last use of flow counters
Currently, last use timestamp is initialized to zero.
This is not the expected value by higher layers such as
when we do TC action offloading. To fix that, set it to
the current time, e.g when the counter/rule is offloaded.
This is the same behaviour of non-offloaded TC actions.
Fixes: 43a335e055bb ('mlx5_core: Flow counters infrastructure') Signed-off-by: Paul Blakey <[email protected]> Signed-off-by: Saeed Mahameed <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Paul Blakey [Tue, 25 Oct 2016 15:36:26 +0000 (18:36 +0300)]
net/mlx5: Keep autogroups list ordered
Finding a new autogroup range is done by going over a group list
sorted by each group start index. The search is stopped after finding
the first free range. Adding the newly created group to the list is
wrongly added to the end of the list regardless of its start index as
the parameter of where to insert it is ignored.
This commit makes sure to use that unused parameter to insert
it where requested.
Fixes: f0d22d187473 ('net/mlx5_core: Introduce flow steering autogrouped flow table') Signed-off-by: Paul Blakey <[email protected]> Signed-off-by: Saeed Mahameed <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Daniel Jurgens [Tue, 25 Oct 2016 15:36:25 +0000 (18:36 +0300)]
net/mlx5: Always Query HCA caps after setting them
Always query the HCA caps after setting them to update the capablities
data structures. Not doing so results in incorrect capabilities being
reported including max_dc, max_qp and several others.
Fixes: 59211bd3b632 ("net/mlx5: Split the load/unload flow into hardware
and software flows") Signed-off-by: Daniel Jurgens <[email protected]> Signed-off-by: Saeed Mahameed <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Daniel Jurgens [Tue, 25 Oct 2016 15:36:24 +0000 (18:36 +0300)]
{net, ib}/mlx5: Make cache line size determination at runtime.
ARM 64B cache line systems have L1_CACHE_BYTES set to 128.
cache_line_size() will return the correct size.
Fixes: cf50b5efa2fe('net/mlx5_core/ib: New device capabilities
handling.') Signed-off-by: Daniel Jurgens <[email protected]> Signed-off-by: Saeed Mahameed <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Andrey Konovalov reported that KASAN detected that SCTP was using a slab
beyond the boundaries. It was caused because when handling out of the
blue packets in function sctp_sf_ootb() it was checking the chunk len
only after already processing the first chunk, validating only for the
2nd and subsequent ones.
The fix is to just move the check upwards so it's also validated for the
1st chunk.
Thomas Gleixner [Sat, 29 Oct 2016 11:42:42 +0000 (13:42 +0200)]
x86/smpboot: Init apic mapping before usage
The recent changes, which forced the registration of the boot cpu on UP
systems, which do not have ACPI tables, have been fixed for systems w/o
local APIC, but left a wreckage for systems which have neither ACPI nor
mptables, but the CPU has an APIC, e.g. virtualbox.
The boot process crashes in prefill_possible_map() as it wants to register
the boot cpu, which needs to access the local apic, but the local APIC is
not yet mapped.
There is no reason why init_apic_mapping() can't be invoked before
prefill_possible_map(). So instead of playing another silly early mapping
game, as the ACPI/mptables code does, we just move init_apic_mapping()
before the call to prefill_possible_map().
In hindsight, I should have noticed that combination earlier.
Linus Torvalds [Sat, 29 Oct 2016 01:34:19 +0000 (18:34 -0700)]
Merge tag 'acpi-4.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull ACPI fixes from Rafael Wysocki:
"These fix recent ACPICA regressions, an older PCI IRQ management
regression, and an incorrect return value of a function in the APEI
code.
Specifics:
- Fix three ACPICA issues related to the interpreter locking and
introduced by recent changes in that area (Lv Zheng).
- Fix a PCI IRQ management regression introduced during the 4.7 cycle
and related to the configuration of shared IRQs on systems with an
ISA bus (Sinan Kaya).
- Fix up a return value of one function in the APEI code (Punit
Agrawal)"
* tag 'acpi-4.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
ACPICA: Dispatcher: Fix interpreter locking around acpi_ev_initialize_region()
ACPICA: Dispatcher: Fix an unbalanced lock exit path in acpi_ds_auto_serialize_method()
ACPICA: Dispatcher: Fix order issue of method termination
ACPI / APEI: Fix incorrect return value of ghes_proc()
ACPI/PCI: pci_link: Include PIRQ_PENALTY_PCI_USING for ISA IRQs
ACPI/PCI: pci_link: penalize SCI correctly
ACPI/PCI/IRQ: assign ISA IRQ directly during early boot stages
Linus Torvalds [Sat, 29 Oct 2016 01:29:13 +0000 (18:29 -0700)]
Merge tag 'pm-4.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management fixes from Rafael Wysocki:
"These fix two intel_pstate issues related to the way it works when the
scaling_governor sysfs attribute is set to "performance" and fix up
messages in the system suspend core code.
Specifics:
- Fix a missing KERN_CONT in a system suspend message by converting
the affected code to using pr_info() and pr_cont() instead of the
"raw" printk() (Jon Hunter).
- Make intel_pstate set the CPU P-state from its .set_policy()
callback when the scaling_governor sysfs attribute is set to
"performance" so that it interacts with NOHZ_FULL more predictably
which was the case before 4.7 (Rafael Wysocki).
- Make intel_pstate always request the maximum allowed P-state when
the scaling_governor sysfs attribute is set to "performance" to
prevent it from effectively ingoring that setting is some
situations (Rafael Wysocki)"
* tag 'pm-4.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
cpufreq: intel_pstate: Always set max P-state in performance mode
PM / suspend: Fix missing KERN_CONT for suspend message
cpufreq: intel_pstate: Set P-state upfront in performance mode
Linus Torvalds [Sat, 29 Oct 2016 00:02:58 +0000 (17:02 -0700)]
Merge tag 'arc-4.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/vgupta/arc
Pull ARC updates from Vineet Gupta:
- support IDU intc for UP builds
- support gz, lzma compressed uImage [Daniel Mentz]
- adjust /proc/cpuinfo for non-continuous cpu ids [Noam Camus]
- syscall for userspace cmpxchg assist for configs lacking hardware atomics
- rework of boot log printing mainly for identifying older arc700 cores
- retiring some old code, build toggles
* tag 'arc-4.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/vgupta/arc:
ARC: module: print pretty section names
ARC: module: elide loop to save reference to .eh_frame
ARC: mm: retire ARC_DBG_TLB_MISS_COUNT...
ARC: build: retire old toggles
ARC: boot log: refactor cpu name/release printing
ARC: boot log: remove awkward space comma from MMU line
ARC: boot log: don't assume SWAPE instruction support
ARC: boot log: refactor printing abt features not captured in BCRs
ARCv2: boot log: print IOC exists as well as enabled status
ARCv2: IOC: use @ioc_enable not @ioc_exist where intended
ARC: syscall for userspace cmpxchg assist
ARC: fix build warning in elf.h
ARC: Adjust cpuinfo for non-continuous cpu ids
ARC: [build] Support gz, lzma compressed uImage
ARCv2: intc: untangle SMP, MCIP and IDU
Merge branches 'acpica-fixes', 'acpi-pci-fixes' and 'acpi-apei-fixes'
* acpica-fixes:
ACPICA: Dispatcher: Fix interpreter locking around acpi_ev_initialize_region()
ACPICA: Dispatcher: Fix an unbalanced lock exit path in acpi_ds_auto_serialize_method()
ACPICA: Dispatcher: Fix order issue of method termination
* acpi-pci-fixes:
ACPI/PCI: pci_link: Include PIRQ_PENALTY_PCI_USING for ISA IRQs
ACPI/PCI: pci_link: penalize SCI correctly
ACPI/PCI/IRQ: assign ISA IRQ directly during early boot stages
* acpi-apei-fixes:
ACPI / APEI: Fix incorrect return value of ghes_proc()
Lv Zheng [Wed, 26 Oct 2016 07:42:01 +0000 (15:42 +0800)]
ACPICA: Dispatcher: Fix interpreter locking around acpi_ev_initialize_region()
In the code path of acpi_ev_initialize_region(), there is namespace
modification code unlocked. This patch tunes the code to make sure
such modification are always locked.
Fixes: 74f51b80a0c4 (ACPICA: Namespace: Fix dynamic table loading issues) Tested-by: Imre Deak <[email protected]> Signed-off-by: Lv Zheng <[email protected]> Signed-off-by: Rafael J. Wysocki <[email protected]>
Lv Zheng [Wed, 26 Oct 2016 07:40:20 +0000 (15:40 +0800)]
ACPICA: Dispatcher: Fix an unbalanced lock exit path in acpi_ds_auto_serialize_method()
There is a lock unbalanced exit path in acpi_ds_initialize_method(),
this patch corrects it.
Fixes: 441ad11d078f (ACPICA: Dispatcher: Fix a mutex issue for method auto serialization) Tested-by: Imre Deak <[email protected]> Signed-off-by: Lv Zheng <[email protected]> Signed-off-by: Rafael J. Wysocki <[email protected]>
Lv Zheng [Wed, 26 Oct 2016 07:40:12 +0000 (15:40 +0800)]
ACPICA: Dispatcher: Fix order issue of method termination
The last step of the method termination should be the end of the method
serialization. Otherwise, the steps happening after it will face the race
issues that cannot be protected by the method serialization mechanism.
This patch fixes this issue by moving the per-method-object deletion code
prior than the end of the method serialization. Otherwise, the possible
race issues may result in AE_ALREADY_EXISTS error in a parallel
environment.
Fixes: 74f51b80a0c4 (ACPICA: Namespace: Fix dynamic table loading issues) Reported-and-tested-by: Imre Deak <[email protected]> Signed-off-by: Lv Zheng <[email protected]> Signed-off-by: Rafael J. Wysocki <[email protected]>
Linus Torvalds [Fri, 28 Oct 2016 23:52:28 +0000 (16:52 -0700)]
Merge tag 'powerpc-4.9-4' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc fixes from Michael Ellerman:
"Fixes marked for stable:
- Convert cmp to cmpd in idle enter sequence (Segher Boessenkool)
- cxl: Fix leaking pid refs in some error paths (Vaibhav Jain)
- Re-fix race condition between going idle and entering guest (Paul Mackerras)
- Fix race condition in setting lock bit in idle/wakeup code (Paul Mackerras)
- radix: Use tlbiel only if we ever ran on the current cpu (Aneesh Kumar K.V)
- relocation, register save fixes for system reset interrupt (Nicholas Piggin)
Fixes for code merged this cycle:
- Fix CONFIG_ALIVEC typo in restore_tm_state() (Valentin Rothberg)
- KVM: PPC: Book3S HV: Fix build error when SMP=n (Michael Ellerman)"
* tag 'powerpc-4.9-4' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
powerpc/64s: relocation, register save fixes for system reset interrupt
powerpc/mm/radix: Use tlbiel only if we ever ran on the current cpu
powerpc/process: Fix CONFIG_ALIVEC typo in restore_tm_state()
powerpc/64: Fix race condition in setting lock bit in idle/wakeup code
powerpc/64: Re-fix race condition between going idle and entering guest
cxl: Fix leaking pid refs in some error paths
powerpc: Convert cmp to cmpd in idle enter sequence
KVM: PPC: Book3S HV: Fix build error when SMP=n
Linus Torvalds [Fri, 28 Oct 2016 23:27:16 +0000 (16:27 -0700)]
Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Ingo Molnar:
"Misc kernel fixes: a virtualization environment related fix, an uncore
PMU driver removal handling fix, a PowerPC fix and new events for
Knights Landing"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/x86/intel: Honour the CPUID for number of fixed counters in hypervisors
perf/powerpc: Don't call perf_event_disable() from atomic context
perf/core: Protect PMU device removal with a 'pmu_bus_running' check, to fix CONFIG_DEBUG_TEST_DRIVER_REMOVE=y kernel panic
perf/x86/intel/cstate: Add C-state residency events for Knights Landing
Linus Torvalds [Fri, 28 Oct 2016 18:47:45 +0000 (11:47 -0700)]
Merge branch 'libnvdimm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm
Pull libnvdimm fixes from Dan Williams:
"A build fix, a NULL de-reference found by static analysis, a misuse of
the percpu_ref_exit() (tagged for -stable), and notification of failed
attempts to clear media errors.
These patches have received a build success notification from the
0day- kbuild-robot and appeared in next-20161028"
* 'libnvdimm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
device-dax: fix percpu_ref_exit ordering
nvdimm: make CONFIG_NVDIMM_DAX 'bool'
pmem: report error on clear poison failure
libnvdimm, namespace: potential NULL deref on allocation error
Linus Torvalds [Fri, 28 Oct 2016 18:31:06 +0000 (11:31 -0700)]
Merge tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull arm64 fixes from Will Deacon:
"Three arm64 fixes for -rc3. They're all pretty straightforward: a
couple of NUMA issues from the Huawei folks and a thinko in
__page_to_voff that seems to be benign, but is certainly better off
fixed.
Summary:
- couple of NUMA fixes
- thinko in __page_to_voff"
* tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
arm64: mm: fix __page_to_voff definition
arm64/numa: fix incorrect log for memory-less node
arm64/numa: fix pcpu_cpu_distance() to get correct CPU proximity
Linus Torvalds [Fri, 28 Oct 2016 18:26:01 +0000 (11:26 -0700)]
Merge branch 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer fixes from Ingo Molnar:
"Fix four timer locking races: two were noticed by Linus while
reviewing the code while chasing for a corruption bug, and two
from fixing spurious USB timeouts"
* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
timers: Prevent base clock corruption when forwarding
timers: Prevent base clock rewind when forwarding clock
timers: Lock base for same bucket optimization
timers: Plug locking race vs. timer migration
Jiri Pirko [Tue, 25 Oct 2016 09:25:56 +0000 (11:25 +0200)]
mlxsw: spectrum_router: Save requested prefix bitlist when creating tree
Currently, the prefix bitlist is not saved for LPM trees, causing the
compare to always fail which causes the tree to be destroyed and created
for every inserted and removed FIB entry. So fix this by saving
the bitlist as it should have been done from the very beginning.
Fixes: 53342023eed9 ("mlxsw: spectrum_router: Implement LPM trees management") Signed-off-by: Jiri Pirko <[email protected]> Reviewed-by: Ido Schimmel <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Linus Torvalds [Fri, 28 Oct 2016 17:12:27 +0000 (10:12 -0700)]
Merge branches 'core-urgent-for-linus', 'irq-urgent-for-linus' and 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull objtool, irq and scheduler fixes from Ingo Molnar:
"One more objtool fixlet for GCC6 code generation patterns, an irq
DocBook fix and an unused variable warning fix in the scheduler"
Vineet Gupta [Tue, 25 Oct 2016 17:43:20 +0000 (10:43 -0700)]
ARC: module: elide loop to save reference to .eh_frame
The loop was really needed in .debug_frame regime where wanted make it
as SH_ALLOC so that apply_relocate_add() would process it. That's not
needed for .eh_frame, so we check this in apply_relocate_add() which
gets called for each section.
Note that we need to save reference to "section name strings" section in
module_frob_arch_sections() since apply_relocate_add() doesn't get that
Vineet Gupta [Thu, 27 Oct 2016 21:33:19 +0000 (14:33 -0700)]
ARC: boot log: refactor cpu name/release printing
The motivation is to identify ARC750 vs. ARC770 (we currently print
generic "ARC700").
A given ARC700 release could be 750 or 770, with same ARCNUM (or family
identifier which is unfortunate). The existing arc_cpu_tbl[] kept a single
concatenated string for core name and release which thus doesn't work
for 750 vs. 770 identification.
So split this into 2 tables, one with core names and other with release.
And while we are at it, get rid of the range checking for family numbers.
We just document the known to exist cores running Linux and ditch
others.
With this in place, we add detection of ARC750 which is
- cores 0x33 and before
- cores 0x34 and later with MMUv2
Vineet Gupta [Fri, 21 Oct 2016 01:08:10 +0000 (18:08 -0700)]
ARC: boot log: don't assume SWAPE instruction support
This came to light when helping a customer with oldish ARC750 core who
were getting instruction errors because of lack of SWAPE but boot log
was incorrectly printing it as being present
Vineet Gupta [Fri, 21 Oct 2016 00:49:15 +0000 (17:49 -0700)]
ARC: boot log: refactor printing abt features not captured in BCRs
On older arc700 cores, some of the features configured were not present
in Build config registers. To print about them at boot, we just use the
Kconfig option i.e. whether linux is built to use them or not.
So yes this seems bogus, but what else can be done. Moreover if linux is
booting with these enabled, then the Kconfig info is a good indicator
anyways.
Over time these "hacks" accumulated in read_arc_build_cfg_regs() as well
as arc_cpu_mumbojumbo(). so refactor and move all of those in a single
place: read_arc_build_cfg_regs(). This causes some code redcution too:
Linus Torvalds [Fri, 28 Oct 2016 17:07:35 +0000 (10:07 -0700)]
Merge branch 'for-linus-4.9' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
"My patch fixes the btrfs list_head abuse that we tracked down during
Dave Jones' memory corruption investigation. With both Jens and my
patches in place, I'm no longer able to trigger problems.
Filipe is fixing a difficult old bug between snapshots, balance and
send. Dave is cooking a few more for the next rc, but these are tested
and ready"
* 'for-linus-4.9' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
btrfs: fix races on root_log_ctx lists
btrfs: fix incremental send failure caused by balance
Linus Torvalds [Fri, 28 Oct 2016 17:00:44 +0000 (10:00 -0700)]
Merge tag 'sound-4.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound
Pull sound fixes from Takashi Iwai:
"This contains the usual stuff -- the fixups and quirks for HD-audio
and USB-audio, in addition to a bad regression fix in ALSA sequencer
timer since 4.8, and a trivial fix for asihpi PCI driver"
* tag 'sound-4.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound:
ALSA: usb-audio: Add quirk for Syntek STK1160
ALSA: seq: Fix time account regression
ALSA: hda - Fix surround output pins for ASRock B150M mobo
ALSA: hda - Fix headset mic detection problem for two Dell laptops
ALSA: asihpi: fix kernel memory disclosure
ALSA: hda - Adding a new group of pin cfg into ALC295 pin quirk table
ALSA: hda - allow 40 bit DMA mask for NVidia devices
Linus Torvalds [Fri, 28 Oct 2016 16:36:07 +0000 (09:36 -0700)]
Merge tag 'drm-x86-pat-regression-fix' of git://people.freedesktop.org/~airlied/linux
Pull drm x86/pat regression fixes from Dave Airlie:
"This is a standalone pull request for the fix for a regression
introduced in -rc1 by a change to vm_insert_mixed to start using the
PAT range tracking to validate page protections. With this fix in
place, all the VRAM mappings for GPU drivers ended up at UC instead of
WC.
There are probably better ways to fix this long term, but nothing I'd
considered for -fixes that wouldn't need more settling in time. So
I've just created a new arch API that the drivers can reserve all
their VRAM aperture ranges as WC"
* tag 'drm-x86-pat-regression-fix' of git://people.freedesktop.org/~airlied/linux:
drm/drivers: add support for using the arch wc mapping API.
x86/io: add interface to reserve io memtype for a resource range. (v1.1)
Linus Torvalds [Fri, 28 Oct 2016 16:27:58 +0000 (09:27 -0700)]
Merge tag 'dm-4.9-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm
Pull device mapper fixes from Mike Snitzer:
- a couple DM raid and DM mirror fixes
- a couple .request_fn request-based DM NULL pointer fixes
- a fix for a DM target reference count leak, on target load error,
that prevented associated DM target kernel module(s) from being
removed
* tag 'dm-4.9-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
dm table: fix missing dm_put_target_type() in dm_table_add_target()
dm rq: clear kworker_task if kthread_run() returned an error
dm: free io_barrier after blk_cleanup_queue call
dm raid: fix activation of existing raid4/10 devices
dm mirror: use all available legs on multiple failures
dm mirror: fix read error on recovery after default leg failure
dm raid: fix compat_features validation
Linus Torvalds [Fri, 28 Oct 2016 16:23:59 +0000 (09:23 -0700)]
Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security
Pull key fixes from James Morris:
- fix a buffer overflow when displaying /proc/keys [CVE-2016-7042].
- fix broken initialisation in the big_key implementation that can
result in an oops.
- make big_key depend on having a random number generator available in
Kconfig.
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security:
security/keys: make BIG_KEYS dependent on stdrng.
KEYS: Sort out big_key initialisation
KEYS: Fix short sprintf buffer in /proc/keys show function
Boris Brezillon [Fri, 28 Oct 2016 09:08:44 +0000 (11:08 +0200)]
ubi: fastmap: Fix add_vol() return value test in ubi_attach_fastmap()
Commit e96a8a3bb671 ("UBI: Fastmap: Do not add vol if it already
exists") introduced a bug by changing the possible error codes returned
by add_vol():
- this function no longer returns NULL in case of allocation failure
but return ERR_PTR(-ENOMEM)
- when a duplicate entry in the volume RB tree is found it returns
ERR_PTR(-EEXIST) instead of ERR_PTR(-EINVAL)
Fix the tests done on add_vol() return val to match this new behavior.
Jorgen Hansen [Thu, 6 Oct 2016 11:43:08 +0000 (04:43 -0700)]
VMCI: Doorbell create and destroy fixes
This change consists of two changes:
1) If vmci_doorbell_create is called when neither guest nor
host personality as been initialized, vmci_get_context_id
will return VMCI_INVALID_ID. In that case, we should fail
the create call.
2) In doorbell destroy, we assume that vmci_guest_code_active()
has the same return value on create and destroy. That may not
be the case, so we may end up with the wrong refcount.
Instead, destroy should check explicitly whether the doorbell
is in the index table as an indicator of whether the guest
code was active at create time.
Gerald Schaefer [Wed, 19 Oct 2016 10:29:41 +0000 (12:29 +0200)]
GenWQE: Fix bad page access during abort of resource allocation
When interrupting an application which was allocating DMAable
memory, it was possible, that the DMA memory was deallocated
twice, leading to the error symptoms below.
Thanks to Gerald, who analyzed the problem and provided this
patch.
I agree with his analysis of the problem: ddcb_cmd_fixups() ->
genwqe_alloc_sync_sgl() (fails in f/lpage, but sgl->sgl != NULL
and f/lpage maybe also != NULL) -> ddcb_cmd_cleanup() ->
genwqe_free_sync_sgl() (double free, because sgl->sgl != NULL and
f/lpage maybe also != NULL)
In this scenario we would have exactly the kind of double free that
would explain the WARNING / Bad page state, and as expected it is
caused by broken error handling (cleanup).
Using the Ubuntu git source, tag Ubuntu-4.4.0-33.52, he was able to reproduce
the "Bad page state" issue, and with the patch on top he could not reproduce
it any more.
Rob Herring [Fri, 28 Oct 2016 12:07:48 +0000 (07:07 -0500)]
tty: serial_core: fix NULL struct tty pointer access in uart_write_wakeup
Since commit 761ed4a94582ab29 ("tty: serial_core: convert uart_close to
use tty_port_close"), the serial console is broken on various systems
and typing "reboot" splats the following on the serial console:
The problem is for console ports, the serial port is not shutdown and
interrupts may fire after the struct tty is gone. Simply calling the
tty_port helper tty_port_tty_wakeup instead of tty_wakeup directly will
ensure there is a valid struct tty.
tty: serial_core: Fix serial console crash on port shutdown
The port->console flag is always false, as uart_console() is called
before the serial console has been registered.
Hence for a serial port used as the console, uart_tty_port_shutdown()
will still be called when userspace closes the port, powering it down.
This may lead to a system lock up when the serial console driver writes
to the serial port's registers.
To fix this, move the setting of port->console after the call to
uart_configure_port(), which registers the serial console.
Richard Genoud [Thu, 27 Oct 2016 16:04:06 +0000 (18:04 +0200)]
tty/serial: at91: fix hardware handshake on Atmel platforms
After commit 1cf6e8fc8341 ("tty/serial: at91: fix RTS line management
when hardware handshake is enabled"), the hardware handshake wasn't
functional anymore on Atmel platforms (beside SAMA5D2).
To understand why, one has to understand the flag ATMEL_US_USMODE_HWHS
first:
Before commit 1cf6e8fc8341 ("tty/serial: at91: fix RTS line management
when hardware handshake is enabled"), this flag was never set.
Thus, the CTS/RTS where only handled by serial_core (and everything
worked just fine).
This commit introduced the use of the ATMEL_US_USMODE_HWHS flag,
enabling it for all boards when the user space enables flow control.
When the ATMEL_US_USMODE_HWHS is set, the Atmel USART controller
handles a part of the flow control job:
- disable the transmitter when the CTS pin gets high.
- drive the RTS pin high when the DMA buffer transfer is completed or
PDC RX buffer full or RX FIFO is beyond threshold. (depending on the
controller version).
NB: This feature is *not* mandatory for the flow control to work.
(Nevertheless, it's very useful if low latencies are needed.)
Now, the specifics of the ATMEL_US_USMODE_HWHS flag:
- For platforms with DMAC and no FIFOs (sam9x25, sam9x35, sama5D3,
sama5D4, sam9g15, sam9g25, sam9g35)* this feature simply doesn't work.
( source: https://lkml.org/lkml/2016/9/7/598 )
Tested it on sam9g35, the RTS pins always stays up, even when RXEN=1
or a new DMA transfer descriptor is set.
=> ATMEL_US_USMODE_HWHS must not be used for those platforms
- For platforms with a PDC (sam926{0,1,3}, sam9g10, sam9g20, sam9g45,
sam9g46)*, there's another kind of problem. Once the flag
ATMEL_US_USMODE_HWHS is set, the RTS pin can't be driven anymore via
RTSEN/RTSDIS in USART Control Register. The RTS pin can only be driven
by enabling/disabling the receiver or setting RCR=RNCR=0 in the PDC
(Receive (Next) Counter Register).
=> Doing this is beyond the scope of this patch and could add other
bugs, so the original (and working) behaviour should be set for those
platforms (meaning ATMEL_US_USMODE_HWHS flag should be unset).
- For platforms with a FIFO (sama5d2)*, the RTS pin is driven according
to the RX FIFO thresholds, and can be also driven by RTSEN/RTSDIS in
USART Control Register. No problem here.
(This was the use case of commit 1cf6e8fc8341 ("tty/serial: at91: fix
RTS line management when hardware handshake is enabled"))
NB: If the CTS pin declared as a GPIO in the DTS, (for instance
cts-gpios = <&pioA PIN_PB31 GPIO_ACTIVE_LOW>), the transmitter will be
disabled.
=> ATMEL_US_USMODE_HWHS flag can be set for this platform ONLY IF the
CTS pin is not a GPIO.
So, the only case when ATMEL_US_USMODE_HWHS can be enabled is when
(atmel_use_fifo(port) &&
!mctrl_gpio_to_gpiod(atmel_port->gpios, UART_GPIO_CTS))
Tested on all Atmel USART controller flavours:
AT91SAM9G35-CM (DMAC flavour), AT91SAM9G20-EK (PDC flavour),
SAMA5D2xplained (FIFO flavour).
Imre Palik [Fri, 21 Oct 2016 08:18:59 +0000 (01:18 -0700)]
perf/x86/intel: Honour the CPUID for number of fixed counters in hypervisors
perf doesn't seem to honour the number of fixed counters specified by CPUID
leaf 0xa. It always assumes that Intel CPUs have at least 3 fixed counters.
So if some of the fixed counters are masked out by the hypervisor, it still
tries to check/set them.
This patch makes perf behave nicer when the kernel is running under a
hypervisor that doesn't expose all the counters.
While it looks like the first WARN() is probably valid, the other one is
triggered by disabling event via perf_event_disable() from atomic context.
The event is disabled here in case we were not able to emulate
the instruction that hit the breakpoint. By disabling the event
we unschedule the event and make sure it's not scheduled back.
But we can't call perf_event_disable() from atomic context, instead
we need to use the event's pending_disable irq_work method to disable it.
The reason for the crash is that perf_pmu_unregister() tries to remove
a PMU device which is not added at this point. We add PMU devices
only after pmu_bus is registered, which happens in the
perf_event_sysfs_init() call and sets the 'pmu_bus_running' flag.
The fix is to get the 'pmu_bus_running' flag state at the point
the PMU is taken out of the PMU list and remove the device
later only if it's set.
Borislav Petkov [Thu, 27 Oct 2016 12:36:23 +0000 (14:36 +0200)]
x86/microcode/AMD: Fix more fallout from CONFIG_RANDOMIZE_MEMORY=y
We needed the physical address of the container in order to compute the
offset within the relocated ramdisk. And we did this by doing __pa() on
the virtual address.
However, __pa() does checks whether the physical address is within
PAGE_OFFSET and __START_KERNEL_map - see __phys_addr() - which fail
if we have CONFIG_RANDOMIZE_MEMORY enabled: we feed a virtual address
which *doesn't* have the randomization offset into a function which uses
PAGE_OFFSET which *does* have that offset.
Linus Torvalds [Fri, 28 Oct 2016 02:58:39 +0000 (19:58 -0700)]
Merge branch 'akpm' (patches from Andrew)
Merge misc fixes from Andrew Morton:
"20 fixes"
* emailed patches from Andrew Morton <[email protected]>:
drivers/misc/sgi-gru/grumain.c: remove bogus 0x prefix from printk
cris/arch-v32: cryptocop: print a hex number after a 0x prefix
ipack: print a hex number after a 0x prefix
block: DAC960: print a hex number after a 0x prefix
fs: exofs: print a hex number after a 0x prefix
lib/genalloc.c: start search from start of chunk
mm: memcontrol: do not recurse in direct reclaim
CREDITS: update credit information for Martin Kepplinger
proc: fix NULL dereference when reading /proc/<pid>/auxv
mm: kmemleak: ensure that the task stack is not freed during scanning
lib/stackdepot.c: bump stackdepot capacity from 16MB to 128MB
latent_entropy: raise CONFIG_FRAME_WARN by default
kconfig.h: remove config_enabled() macro
ipc: account for kmem usage on mqueue and msg
mm/slab: improve performance of gathering slabinfo stats
mm: page_alloc: use KERN_CONT where appropriate
mm/list_lru.c: avoid error-path NULL pointer deref
h8300: fix syscall restarting
kcov: properly check if we are in an interrupt
mm/slab: fix kmemcg cache creation delayed issue
Daniel Mentz [Fri, 28 Oct 2016 00:46:59 +0000 (17:46 -0700)]
lib/genalloc.c: start search from start of chunk
gen_pool_alloc_algo() iterates over the chunks of a pool trying to find
a contiguous block of memory that satisfies the allocation request.
The shortcut
if (size > atomic_read(&chunk->avail))
continue;
makes the loop skip over chunks that do not have enough bytes left to
fulfill the request. There are two situations, though, where an
allocation might still fail:
(1) The available memory is not contiguous, i.e. the request cannot
be fulfilled due to external fragmentation.
(2) A race condition. Another thread runs the same code concurrently
and is quicker to grab the available memory.
In those situations, the loop calls pool->algo() to search the entire
chunk, and pool->algo() returns some value that is >= end_bit to
indicate that the search failed. This return value is then assigned to
start_bit. The variables start_bit and end_bit describe the range that
should be searched, and this range should be reset for every chunk that
is searched. Today, the code fails to reset start_bit to 0. As a
result, prefixes of subsequent chunks are ignored. Memory allocations
might fail even though there is plenty of room left in these prefixes of
those other chunks.
Johannes Weiner [Fri, 28 Oct 2016 00:46:56 +0000 (17:46 -0700)]
mm: memcontrol: do not recurse in direct reclaim
On 4.0, we saw a stack corruption from a page fault entering direct
memory cgroup reclaim, calling into btrfs_releasepage(), which then
tried to allocate an extent and recursed back into a kmem charge ad
nauseam:
On later kernels, kmem charging is opt-in rather than opt-out, and that
particular kmem allocation in btrfs_releasepage() is no longer being
charged and won't recurse and overrun the stack anymore.
But it's not impossible for an accounted allocation to happen from the
memcg direct reclaim context, and we needed to reproduce this crash many
times before we even got a useful stack trace out of it.
Like other direct reclaimers, mark tasks in memcg reclaim PF_MEMALLOC to
avoid recursing into any other form of direct reclaim. Then let
recursive charges from PF_MEMALLOC contexts bypass the cgroup limit.
Leon Yu [Fri, 28 Oct 2016 00:46:50 +0000 (17:46 -0700)]
proc: fix NULL dereference when reading /proc/<pid>/auxv
Reading auxv of any kernel thread results in NULL pointer dereferencing
in auxv_read() where mm can be NULL. Fix that by checking for NULL mm
and bailing out early. This is also the original behavior changed by
recent commit c5317167854e ("proc: switch auxv to use of __mem_open()").
# cat /proc/2/auxv
Unable to handle kernel NULL pointer dereference at virtual address 000000a8
Internal error: Oops: 17 [#1] PREEMPT SMP ARM
CPU: 3 PID: 113 Comm: cat Not tainted 4.9.0-rc1-ARCH+ #1
Hardware name: BCM2709
task: ea3b0b00 task.stack: e99b2000
PC is at auxv_read+0x24/0x4c
LR is at do_readv_writev+0x2fc/0x37c
Process cat (pid: 113, stack limit = 0xe99b2210)
Call chain:
auxv_read
do_readv_writev
vfs_readv
default_file_splice_read
splice_direct_to_actor
do_splice_direct
do_sendfile
SyS_sendfile64
ret_fast_syscall
Catalin Marinas [Fri, 28 Oct 2016 00:46:47 +0000 (17:46 -0700)]
mm: kmemleak: ensure that the task stack is not freed during scanning
Commit 68f24b08ee89 ("sched/core: Free the stack early if
CONFIG_THREAD_INFO_IN_TASK") may cause the task->stack to be freed
during kmemleak_scan() execution, leading to either a NULL pointer fault
(if task->stack is NULL) or kmemleak accessing already freed memory.
This patch uses the new try_get_task_stack() API to ensure that the task
stack is not freed during kmemleak stack scanning.
Dmitry Vyukov [Fri, 28 Oct 2016 00:46:44 +0000 (17:46 -0700)]
lib/stackdepot.c: bump stackdepot capacity from 16MB to 128MB
KASAN uses stackdepot to memorize stacks for all kmalloc/kfree calls.
Current stackdepot capacity is 16MB (1024 top level entries x 4 pages on
second level). Size of each stack is (num_frames + 3) * sizeof(long).
Which gives us ~84K stacks. This capacity was chosen empirically and it
is enough to run kernel normally.
However, when lots of configs are enabled and a fuzzer tries to maximize
code coverage, it easily hits the limit within tens of minutes. I've
tested for long a time with number of top level entries bumped 4x
(4096). And I think I've seen overflow only once. But I don't have all
configs enabled and code coverage has not reached maximum yet. So bump
it 8x to 8192.
Since we have two-level table, memory cost of this is very moderate --
currently the top-level table is 8KB, with this patch it is 64KB, which
is negligible under KASAN.
Here is some approx math.
128MB allows us to memorize ~670K stacks (assuming stack is ~200b).
I've grepped kernel for kmalloc|kfree|kmem_cache_alloc|kmem_cache_free|
kzalloc|kstrdup|kstrndup|kmemdup and it gives ~60K matches. Most of
alloc/free call sites are reachable with only one stack. But some
utility functions can have large fanout. Assuming average fanout is 5x,
total number of alloc/free stacks is ~300K.
Kees Cook [Fri, 28 Oct 2016 00:46:41 +0000 (17:46 -0700)]
latent_entropy: raise CONFIG_FRAME_WARN by default
When building with the latent_entropy plugin, set the default
CONFIG_FRAME_WARN to 2048, since some __init functions have many basic
blocks that, when instrumented by the latent_entropy plugin, grow beyond
1024 byte stack size on 32-bit builds.
Masahiro Yamada [Fri, 28 Oct 2016 00:46:38 +0000 (17:46 -0700)]
kconfig.h: remove config_enabled() macro
The use of config_enabled() is ambiguous. For config options,
IS_ENABLED(), IS_REACHABLE(), etc. will make intention clearer.
Sometimes config_enabled() has been used for non-config options because
it is useful to check whether the given symbol is defined or not.
I have been tackling on deprecating config_enabled(), and now is the
time to finish this work.
Some new users have appeared for v4.9-rc1, but it is trivial to replace
them:
- arch/x86/mm/kaslr.c
replace config_enabled() with IS_ENABLED() because
CONFIG_X86_ESPFIX64 and CONFIG_EFI are boolean.
- include/asm-generic/export.h
replace config_enabled() with __is_defined().
Then, config_enabled() can be removed now.
Going forward, please use IS_ENABLED(), IS_REACHABLE(), etc. for config
options, and __is_defined() for non-config symbols.
Aristeu Rozanski [Fri, 28 Oct 2016 00:46:35 +0000 (17:46 -0700)]
ipc: account for kmem usage on mqueue and msg
When kmem accounting switched from account by default to only account if
flagged by __GFP_ACCOUNT, IPC mqueue and messages was left out.
The production use case at hand is that mqueues should be customizable
via sysctls in Docker containers in a Kubernetes cluster. This can only
be safely allowed to the users of the cluster (without the risk that
they can cause resource shortage on a node, influencing other users'
containers) if all resources they control are bounded, i.e. accounted
for.
mm/slab: improve performance of gathering slabinfo stats
On large systems, when some slab caches grow to millions of objects (and
many gigabytes), running 'cat /proc/slabinfo' can take up to 1-2
seconds. During this time, interrupts are disabled while walking the
slab lists (slabs_full, slabs_partial, and slabs_free) for each node,
and this sometimes causes timeouts in other drivers (for instance,
Infiniband).
This patch optimizes 'cat /proc/slabinfo' by maintaining a counter for
total number of allocated slabs per node, per cache. This counter is
updated when a slab is created or destroyed. This enables us to skip
traversing the slabs_full list while gathering slabinfo statistics, and
since slabs_full tends to be the biggest list when the cache is large,
it results in a dramatic performance improvement. Getting slabinfo
statistics now only requires walking the slabs_free and slabs_partial
lists, and those lists are usually much smaller than slabs_full.
We tested this after growing the dentry cache to 70GB, and the
performance improved from 2s to 5ms.
As described in https://bugzilla.kernel.org/show_bug.cgi?id=177821:
After some analysis it seems to be that the problem is in alloc_super().
In case list_lru_init_memcg() fails it goes into destroy_super(), which
calls list_lru_destroy().
And in list_lru_init() we see that in case memcg_init_list_lru() fails,
lru->node is freed, but not set NULL, which then leads list_lru_destroy()
to believe it is initialized and call memcg_destroy_list_lru().
memcg_destroy_list_lru() in turn can access lru->node[i].memcg_lrus,
which is NULL.
Mark Rutland [Fri, 28 Oct 2016 00:46:24 +0000 (17:46 -0700)]
h8300: fix syscall restarting
Back in commit f56141e3e2d9 ("all arches, signal: move restart_block to
struct task_struct"), all architectures and core code were changed to
use task_struct::restart_block. However, when h8300 support was
subsequently restored in v4.2, it was not updated to account for this,
and maintains thread_info::restart_block, which is not kept in sync.
This patch drops the redundant restart_block from thread_info, and moves
h8300 to the common one in task_struct, ensuring that syscall restarting
always works as expected.
Andrey Konovalov [Fri, 28 Oct 2016 00:46:21 +0000 (17:46 -0700)]
kcov: properly check if we are in an interrupt
in_interrupt() returns a nonzero value when we are either in an
interrupt or have bh disabled via local_bh_disable(). Since we are
interested in only ignoring coverage from actual interrupts, do a proper
check instead of just calling in_interrupt().
As a result of this change, kcov will start to collect coverage from
within local_bh_disable()/local_bh_enable() sections.
This issue is caused by kmemcg feature that try to create new set of
kmem_caches for each memcg. Recently, kmem_cache creation is slowed by
synchronize_sched() and futher kmem_cache creation is also delayed since
kmem_cache creation is synchronized by a global slab_mutex lock. So,
the number of kworker that try to create kmem_cache increases quietly.
synchronize_sched() is for lockless access to node's shared array but
it's not needed when a new kmem_cache is created. So, this patch rules
out that case.
Dan Williams [Fri, 28 Oct 2016 00:04:05 +0000 (17:04 -0700)]
device-dax: fix percpu_ref_exit ordering
We need to wait until the percpu_ref is released before exit. Otherwise,
we sometimes lose the race and trigger this new warning that was added
in v4.9 (commit a67823c1ed10 "percpu-refcount: init ->confirm_switch
member properly"):
Linus Torvalds [Thu, 27 Oct 2016 23:23:01 +0000 (16:23 -0700)]
Allow KASAN and HOTPLUG_MEMORY to co-exist when doing build testing
No, KASAN may not be able to co-exist with HOTPLUG_MEMORY at runtime,
but for build testing there is no reason not to allow them together.
This hopefully means better build coverage and fewer embarrasing silly
problems like the one fixed by commit 9db4f36e82c2 ("mm: remove unused
variable in memory hotplug") in the future.
Arnd Bergmann [Tue, 25 Oct 2016 15:52:04 +0000 (17:52 +0200)]
nvdimm: make CONFIG_NVDIMM_DAX 'bool'
A bugfix just tried to address a randconfig build problem and introduced
a variant of the same problem: with CONFIG_LIBNVDIMM=y and
CONFIG_NVDIMM_DAX=m, the nvdimm module now fails to link:
drivers/nvdimm/built-in.o: In function `to_nd_device_type':
bus.c:(.text+0x1b5d): undefined reference to `is_nd_dax'
drivers/nvdimm/built-in.o: In function `nd_region_notify_driver_action.constprop.2':
region_devs.c:(.text+0x6b6c): undefined reference to `is_nd_dax'
region_devs.c:(.text+0x6b8c): undefined reference to `to_nd_dax'
drivers/nvdimm/built-in.o: In function `nd_region_probe':
region.c:(.text+0x70f3): undefined reference to `nd_dax_create'
drivers/nvdimm/built-in.o: In function `mode_show':
namespace_devs.c:(.text+0xa196): undefined reference to `is_nd_dax'
drivers/nvdimm/built-in.o: In function `nvdimm_namespace_common_probe':
(.text+0xa55f): undefined reference to `is_nd_dax'
drivers/nvdimm/built-in.o: In function `nvdimm_namespace_common_probe':
(.text+0xa56e): undefined reference to `to_nd_dax'
This reverts the earlier fix, making NVDIMM_DAX a 'bool' option again
as it should be (it gets linked into the libnvdimm module). To fix
the original problem, I'm adding a dependency on LIBNVDIMM to
DEV_DAX_PMEM, which ensures we can't have that one built-in if the
rest is a module.
Fixes: 4e65e9381c7a ("/dev/dax: fix Kconfig dependency build breakage") Signed-off-by: Arnd Bergmann <[email protected]> Reviewed-by: Ross Zwisler <[email protected]> Signed-off-by: Dan Williams <[email protected]>