Erez Shitrit [Sun, 28 Aug 2016 07:58:31 +0000 (10:58 +0300)]
IB/ipoib: Fix memory corruption in ipoib cm mode connect flow
When a new CM connection is being requested, ipoib driver copies data
from the path pointer in the CM/tx object, the path object might be
invalid at the point and memory corruption will happened later when now
the CM driver will try using that data.
The next scenario demonstrates it:
neigh_add_path --> ipoib_cm_create_tx -->
queue_work (pointer to path is in the cm/tx struct)
#while the work is still in the queue,
#the port goes down and causes the ipoib_flush_paths:
ipoib_flush_paths --> path_free --> kfree(path)
#at this point the work scheduled starts.
ipoib_cm_tx_start --> copy from the (invalid)path pointer:
(memcpy(&pathrec, &p->path->pathrec, sizeof pathrec);)
-> memory corruption.
To fix that the driver now starts the CM/tx connection only if that
specific path exists in the general paths database.
This check is protected with the relevant locks, and uses the gid from
the neigh member in the CM/tx object which is valid according to the ref
count that was taken by the CM/tx.
Erez Shitrit [Sun, 28 Aug 2016 07:58:30 +0000 (10:58 +0300)]
IB/core: Fix use after free in send_leave function
The function send_leave sets the member: group->query_id
(group->query_id = ret) after calling the sa_query, but leave_handler
can be executed before the setting and it might delete the group object,
and will get a memory corruption.
Additionally, this patch gets rid of group->query_id variable which is
not used.
Baoyou Xie [Sun, 28 Aug 2016 14:57:11 +0000 (22:57 +0800)]
IB/cxgb4: Make _free_qp static to silence build warning
We get 1 warning when build kernel with W=1:
drivers/infiniband/hw/cxgb4/qp.c:686:6: warning: no previous prototype for '_free_qp' [-Wmissing-prototypes]
In fact, this function is only used in the file in which it is declared
and don't need a declaration, but can be made static.
so this patch marks it 'static'.
Raju Rangoju [Mon, 29 Aug 2016 11:45:49 +0000 (17:15 +0530)]
IB/isert: Properly release resources on DEVICE_REMOVAL
When the low level driver exercises the hot unplug they would call
rdma_cm cma_remove_one which would fire DEVICE_REMOVAL event to all cma
consumers. Now, if consumer doesn't make sure they destroy all IB
objects created on that IB device instance prior to finalizing all
processing of DEVICE_REMOVAL callback, rdma_cm will let the lld to
de-register with IB core and destroy the IB device instance. And if the
consumer calls (say) ib_dereg_mr(), it will crash since that dev object
is NULL.
In the current implementation, iser-target just initiates the cleanup
and returns from DEVICE_REMOVAL callback. This deferred work creates a
race between iser-target cleaning IB objects(say MR) and lld destroying
IB device instance.
This patch includes the following fixes
-> make sure that consumer frees all IB objects associated with device
instance
-> return non-zero from the callback to destroy the rdma_cm id
The 2nd parameter of 'find_first_bit' is the number of bits to search.
In this case, we are passing 'sizeof(tmp)' which is likely to be 4 or 8
because 'tmp' is an 'unsigned long'.
It is likely that the number of bits of 'tmp' was expected here. So use
BITS_PER_LONG instead.
It has been spotted by the following coccinelle script:
@@
expression ret, x;
@@
* ret = \(find_first_bit \| find_first_zero_bit\) (x, sizeof(...));
Steven Rostedt [Wed, 25 May 2016 17:47:26 +0000 (13:47 -0400)]
x86/paravirt: Do not trace _paravirt_ident_*() functions
Łukasz Daniluk reported that on a RHEL kernel that his machine would lock up
after enabling function tracer. I asked him to bisect the functions within
available_filter_functions, which he did and it came down to three:
_paravirt_nop(), _paravirt_ident_32() and _paravirt_ident_64()
It was found that this is only an issue when noreplace-paravirt is added
to the kernel command line.
This means that those functions are most likely called within critical
sections of the funtion tracer, and must not be traced.
In newer kenels _paravirt_nop() is defined within gcc asm(), and is no
longer an issue. But both _paravirt_ident_{32,64}() causes the
following splat when they are traced:
Merge branch 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs
Pull overlayfs fixes from Miklos Szeredi:
"Most of this is regression fixes for posix acl behavior introduced in
4.8-rc1 (these were caught by the pjd-fstest suite). The are also
miscellaneous fixes marked as stable material and cleanups.
Other than overlayfs code, it touches <linux/fs.h> to add a constant
with which to disable posix acl caching. No changes needed to the
actual caching code, it automatically does the right thing, although
later we may want to optimize this case.
I'm now testing overlayfs with the following test suites to catch
regressions:
- unionmount-testsuite
- xfstests
- pjd-fstest"
* 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
ovl: update doc
ovl: listxattr: use strnlen()
ovl: Switch to generic_getxattr
ovl: copyattr after setting POSIX ACL
ovl: Switch to generic_removexattr
ovl: Get rid of ovl_xattr_noacl_handlers array
ovl: Fix OVL_XATTR_PREFIX
ovl: fix spelling mistake: "directries" -> "directories"
ovl: don't cache acl on overlay layer
ovl: use cached acl on underlying layer
ovl: proper cleanup of workdir
ovl: remove posix_acl_default from workdir
ovl: handle umask and posix_acl_default correctly on creation
ovl: don't copy up opaqueness
James Morse [Fri, 26 Aug 2016 15:03:42 +0000 (16:03 +0100)]
arm64: kernel: Fix unmasked debug exceptions when restoring mdscr_el1
Changes to make the resume from cpu_suspend() code behave more like
secondary boot caused debug exceptions to be unmasked early by
__cpu_setup(). We then go on to restore mdscr_el1 in cpu_do_resume(),
potentially taking break or watch points based on uninitialised registers.
Mask debug exceptions in cpu_do_resume(), which is specific to resume
from cpu_suspend(). Debug exceptions will be restored to their original
state by local_dbg_restore() in cpu_suspend(), which runs after
hw_breakpoint_restore() has re-initialised the other registers.
Stefan Wahren [Sat, 27 Aug 2016 16:19:50 +0000 (16:19 +0000)]
drivers/perf: arm_pmu: Fix NULL pointer dereference during probe
Patch 7f1d642fbb5c ("drivers/perf: arm-pmu: Fix handling of SPI lacking
interrupt-affinity property") unintended also fixes perf_event support
for bcm2835 which doesn't have PMU interrupts. Unfortunately this change
introduce a NULL pointer dereference on bcm2835, because irq_is_percpu
always expected to be called with a valid IRQ. So fix this regression
by validating the IRQ before.
Tested-by: Kevin Hilman <[email protected]> Signed-off-by: Stefan Wahren <[email protected]> Fixes: 7f1d642fbb5c ("drivers/perf: arm-pmu: Fix handling of SPI lacking "interrupt-affinity" property") Signed-off-by: Will Deacon <[email protected]> Signed-off-by: Catalin Marinas <[email protected]>
Merge tag 'dmaengine-fix-4.8-rc5' of git://git.infradead.org/users/vkoul/slave-dma
Pull dmaengine fixes from Vinod Koul:
"The fixes this time are all in drivers:
- possible NULL dereference in img-mdc
- correct device identity for free_irq in at_xdmac
- missing of_node_put() in fsl probe
- fix debug log and hotchain corner case for pxa-dma
- fix checking hardware bits in isr in usb dmac"
* tag 'dmaengine-fix-4.8-rc5' of git://git.infradead.org/users/vkoul/slave-dma:
dmaengine: img-mdc: fix a possible NULL dereference
dmaengine: at_xdmac: fix to pass correct device identity to free_irq()
dmaengine: fsl_raid: add missing of_node_put() in fsl_re_probe()
dmaengine: pxa_dma: fix debug message
dmaengine: pxa_dma: fix hotchain corner case
dmaengine: usb-dmac: check CHCR.DE bit in usb_dmac_isr_channel()
Merge tag 'drm-fixes-for-4.8-rc5' of git://people.freedesktop.org/~airlied/linux
Pull drm fixes from Dave Airlie:
"Contains fixes for imx, amdgpu, vc4, msm and one nouveau ACPI fix"
* tag 'drm-fixes-for-4.8-rc5' of git://people.freedesktop.org/~airlied/linux:
drm/amdgpu: record error code when ring test failed
drm/amd/amdgpu: compute ring test fail during S4 on CI
drm/amd/amdgpu: sdma resume fail during S4 on CI
drm/nouveau/acpi: use DSM if bridge does not support D3cold
drm/imx: fix crtc vblank state regression
drm/imx: Add active plane reconfiguration support
drm/msm: protect against faults from copy_from_user() in submit ioctl
drm/msm: fix use of copy_from_user() while holding spinlock
drm/vc4: Fix oops when userspace hands in a bad BO.
drm/vc4: Fix overflow mem unreferencing when the binner runs dry.
drm/vc4: Free hang state before destroying BO cache.
drm/vc4: Fix handling of a pm_runtime_get_sync() success case.
drm/vc4: Use drm_malloc_ab to fix large rendering jobs.
drm/vc4: Use drm_free_large() on handles to match its allocation.
Merge tag 'ccn/fixes-for-4.8-v2' of git://git.linaro.org/people/pawel.moll/linux into fixes
Merge "bus: ARM CCN PMU driver updates" from Paweł Moll:
- Fixes and improvements for XP watchpoint and events handling
- Added missing condition checks for KVM-related exclusions
- Improved interrupt affinity handling
- Fix for hrtimer use in polling mode
- Event grouping implementation improvement
* tag 'ccn/fixes-for-4.8-v2' of git://git.linaro.org/people/pawel.moll/linux:
bus: arm-ccn: make event groups reliable
bus: arm-ccn: fix hrtimer registration
bus: arm-ccn: fix PMU interrupt flags
bus: arm-ccn: Add missing event attribute exclusions for host/guest
bus: arm-ccn: Correct required arguments for XP PMU events
bus: arm-ccn: Fix XP watchpoint settings bitmask
bus: arm-ccn: Do not attempt to configure XPs for cycle counter
bus: arm-ccn: Fix PMU handling of MN
- ioctl(SNDRV_TIMER_IOCTL_SELECT), which potentially sets
tu->queue/tu->tqueue to NULL on memory allocation failure, so read()
would get a NULL pointer dereference like the above splat
- the same ioctl() can free tu->queue/to->tqueue which means read()
could potentially see (and dereference) the freed pointer
We can fix both by taking the ioctl_lock mutex when dereferencing
->queue/->tqueue, since that's always held over all the ioctl() code.
Just looking at the code I find it likely that there are more problems
here such as tu->qhead pointing outside the buffer if the size is
changed concurrently using SNDRV_TIMER_IOCTL_PARAMS.
Wanpeng Li [Fri, 2 Sep 2016 06:38:23 +0000 (14:38 +0800)]
tick/nohz: Fix softlockup on scheduler stalls in kvm guest
tick_nohz_start_idle() is prevented to be called if the idle tick can't
be stopped since commit 1f3b0f8243cb934 ("tick/nohz: Optimize nohz idle
enter"). As a result, after suspend/resume the host machine, full dynticks
kvm guest will softlockup:
The idle and iowait states are weird 0 for cpu0(housekeeping).
The bug is present in both guest and host kernels, and they both have
cpu0's idle and iowait states issue, however, host kernel's suspend/resume
path etc will touch watchdog to avoid the softlockup.
- The watchdog will not be touched in tick_nohz_stop_idle path (need be
touched since the scheduler stall is expected) if idle_active flags are
not detected.
- The idle and iowait states will not be accounted when exit idle loop
(resched or interrupt) if idle start time and idle_active flags are
not set.
This patch fixes it by reverting commit 1f3b0f8243cb934 since can't stop
idle tick doesn't mean can't be idle.
Wolfram Sang [Tue, 30 Aug 2016 19:50:22 +0000 (21:50 +0200)]
ARM: shmobile: fix regulator quirk for Gen2
The current implementation only works if the da9xxx devices are added
before their drivers are registered. Only then it can apply the fixes to
both devices. Otherwise, the driver for the first device gets probed
before the fix for the second device can be applied. This is what
fails when using the IP core switcher or when having the i2c master
driver as a module.
So, we need to disable both da9xxx once we detected one of them. We now
use i2c_transfer with hardcoded i2c_messages and device addresses, so we
don't need the da9xxx client devices to be instantiated. Because the
fixup is used on specific boards only, the addresses are not going to
change.
Eli Cooper [Fri, 26 Aug 2016 15:52:29 +0000 (23:52 +0800)]
ipv6: Don't unset flowi6_proto in ipxip6_tnl_xmit()
Commit 8eb30be0352d0916 ("ipv6: Create ip6_tnl_xmit") unsets
flowi6_proto in ip4ip6_tnl_xmit() and ip6ip6_tnl_xmit().
Since xfrm_selector_match() relies on this info, IPv6 packets
sent by an ip6tunnel cannot be properly selected by their
protocols after removing it. This patch puts flowi6_proto back.
Dave Airlie [Fri, 2 Sep 2016 05:55:15 +0000 (15:55 +1000)]
Merge tag 'drm-vc4-fixes-2016-08-29' of https://github.com/anholt/linux into drm-fixes
This pull request brings in fixes for VC4 3D in 4.8, most of which are
covered by testcases.
* tag 'drm-vc4-fixes-2016-08-29' of https://github.com/anholt/linux:
drm/vc4: Fix oops when userspace hands in a bad BO.
drm/vc4: Fix overflow mem unreferencing when the binner runs dry.
drm/vc4: Free hang state before destroying BO cache.
drm/vc4: Fix handling of a pm_runtime_get_sync() success case.
drm/vc4: Use drm_malloc_ab to fix large rendering jobs.
drm/vc4: Use drm_free_large() on handles to match its allocation.
bnx2x: don't reset chip on cleanup if PCI function is offline
When PCI error is detected, in some architectures (like PowerPC) a slot
reset is performed - the driver's error handlers are in charge of "disable"
device before the reset, and re-enable it after a successful slot reset.
There are two cases though that another path is taken on the code: if the
slot reset is not successful or if too many errors already happened in the
specific adapter (meaning that possibly the device is experiencing a HW
failure that slot reset is not able to solve), the core PCI error mechanism
(called EEH in PowerPC) will remove the adapter from the system, since it
will consider this as a permanent failure on device. In this case, a path
is taken that leads to bnx2x_chip_cleanup() calling bnx2x_reset_hw(), which
then tries to perform a HW reset on chip. This reset won't succeed since
the HW is in a fault state, which can be seen by multiple messages on
kernel log like below:
bnx2x: [bnx2x_issue_dmae_with_comp:552(eth1)]DMAE timeout!
bnx2x: [bnx2x_write_dmae:600(eth1)]DMAE returned failure -1
After some time, the PCI error mechanism gives up on waiting the driver's
correct removal procedure and forcibly remove the adapter from the system.
We can see soft lockup while core PCI error mechanism is waiting for driver
to accomplish the right removal process.
This patch adds a verification to avoid a chip reset whenever the function
is in PCI error state - since this case is only reached when we have a
device being removed because of a permanent failure, the HW chip reset is
not expected to work fine neither is necessary.
Also, as a minor improvement in error path, we avoid the MCP information dump
in case of non-recoverable PCI error (when adapter is about to be removed),
since it will certainly fail.
Dave Airlie [Fri, 2 Sep 2016 05:48:38 +0000 (15:48 +1000)]
Merge tag 'imx-drm-fixes-2016-08-30' of git://git.pengutronix.de/git/pza/linux into drm-fixes
imx-drm atomic modeset regression fixes
- add active plane reconfiguration support
- add back crtc vblank state reporting
* tag 'imx-drm-fixes-2016-08-30' of git://git.pengutronix.de/git/pza/linux:
drm/imx: fix crtc vblank state regression
drm/imx: Add active plane reconfiguration support
Gao Feng [Wed, 31 Aug 2016 06:15:05 +0000 (14:15 +0800)]
rps: flow_dissector: Fix uninitialized flow_keys used in __skb_get_hash possibly
The original codes depend on that the function parameters are evaluated from
left to right. But the parameter's evaluation order is not defined in C
standard actually.
When flow_keys_have_l4(&keys) is invoked before ___skb_get_hash(skb, &keys,
hashrnd) with some compilers or environment, the keys passed to
flow_keys_have_l4 is not initialized.
Fixes: 6db61d79c1e1 ("flow_dissector: Ignore flow dissector return value from ___skb_get_hash") Acked-by: Eric Dumazet <[email protected]> Signed-off-by: Gao Feng <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Merge tag 'clk-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux
Pull clk fixes from Stephen Boyd:
"A collection of small fixes for various SoC vendor clk drivers"
* tag 'clk-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux:
clk: rockchip: mark aclk_emmc_noc as a critical clock on rk3399
clk: tegra: remove TEGRA_PLL_USE_LOCK for PLLD/PLLD2
clk: rockchip: fix incorrect GATE bits for {c, g}pll_aclk_perihp_src on rk3399
clk: rockchip: fix incorrect aclk_emmc source gate bits on rk3399
clk: renesas: r8a7795: Fix SD clocks
clk: rockchip: fix rk3399 aclk_vio gate bit
clk: sunxi-ng: Fix inverted test condition in ccu_helper_wait_for_lock
* emailed patches from Andrew Morton <[email protected]>:
rapidio/tsi721: fix incorrect detection of address translation condition
rapidio/documentation/mport_cdev: add missing parameter description
kernel/fork: fix CLONE_CHILD_CLEARTID regression in nscd
MAINTAINERS: Vladimir has moved
mm, mempolicy: task->mempolicy must be NULL before dropping final reference
printk/nmi: avoid direct printk()-s from __printk_nmi_flush()
treewide: remove references to the now unnecessary DEFINE_PCI_DEVICE_TABLE
drivers/scsi/wd719x.c: remove last declaration using DEFINE_PCI_DEVICE_TABLE
mm, vmscan: only allocate and reclaim from zones with pages managed by the buddy allocator
lib/test_hash.c: fix warning in preprocessor symbol evaluation
lib/test_hash.c: fix warning in two-dimensional array init
kconfig: tinyconfig: provide whole choice blocks to avoid warnings
kexec: fix double-free when failing to relocate the purgatory
mm, oom: prevent premature OOM killer invocation for high order request
Michal Hocko [Thu, 1 Sep 2016 23:15:13 +0000 (16:15 -0700)]
kernel/fork: fix CLONE_CHILD_CLEARTID regression in nscd
Commit fec1d0115240 ("[PATCH] Disable CLONE_CHILD_CLEARTID for abnormal
exit") has caused a subtle regression in nscd which uses
CLONE_CHILD_CLEARTID to clear the nscd_certainly_running flag in the
shared databases, so that the clients are notified when nscd is
restarted. Now, when nscd uses a non-persistent database, clients that
have it mapped keep thinking the database is being updated by nscd, when
in fact nscd has created a new (anonymous) one (for non-persistent
databases it uses an unlinked file as backend).
The original proposal for the CLONE_CHILD_CLEARTID change claimed
(https://lkml.org/lkml/2006/10/25/233):
: The NPTL library uses the CLONE_CHILD_CLEARTID flag on clone() syscalls
: on behalf of pthread_create() library calls. This feature is used to
: request that the kernel clear the thread-id in user space (at an address
: provided in the syscall) when the thread disassociates itself from the
: address space, which is done in mm_release().
:
: Unfortunately, when a multi-threaded process incurs a core dump (such as
: from a SIGSEGV), the core-dumping thread sends SIGKILL signals to all of
: the other threads, which then proceed to clear their user-space tids
: before synchronizing in exit_mm() with the start of core dumping. This
: misrepresents the state of process's address space at the time of the
: SIGSEGV and makes it more difficult for someone to debug NPTL and glibc
: problems (misleading him/her to conclude that the threads had gone away
: before the fault).
:
: The fix below is to simply avoid the CLONE_CHILD_CLEARTID action if a
: core dump has been initiated.
The resulting patch from Roland (https://lkml.org/lkml/2006/10/26/269)
seems to have a larger scope than the original patch asked for. It
seems that limitting the scope of the check to core dumping should work
for SIGSEGV issue describe above.
David Rientjes [Thu, 1 Sep 2016 23:15:07 +0000 (16:15 -0700)]
mm, mempolicy: task->mempolicy must be NULL before dropping final reference
KASAN allocates memory from the page allocator as part of
kmem_cache_free(), and that can reference current->mempolicy through any
number of allocation functions. It needs to be NULL'd out before the
final reference is dropped to prevent a use-after-free bug:
BUG: KASAN: use-after-free in alloc_pages_current+0x363/0x370 at addr ffff88010b48102c
CPU: 0 PID: 15425 Comm: trinity-c2 Not tainted 4.8.0-rc2+ #140
...
Call Trace:
dump_stack
kasan_object_err
kasan_report_error
__asan_report_load2_noabort
alloc_pages_current <-- use after free
depot_save_stack
save_stack
kasan_slab_free
kmem_cache_free
__mpol_put <-- free
do_exit
This patch sets current->mempolicy to NULL before dropping the final
reference.
printk/nmi: avoid direct printk()-s from __printk_nmi_flush()
__printk_nmi_flush() can be called from nmi_panic(), therefore it has to
test whether it's executed in NMI context and thus must route the
messages through deferred printk() or via direct printk().
This is to avoid potential deadlocks, as described in commit cf9b1106c81c ("printk/nmi: flush NMI messages on the system panic").
However there remain two places where __printk_nmi_flush() does
unconditional direct printk() calls:
Factor out print_nmi_seq_line() parts into a new printk_nmi_flush_line()
function, which takes care of in_nmi(), and use it in
__printk_nmi_flush() for printing and error-reporting.
mm, vmscan: only allocate and reclaim from zones with pages managed by the buddy allocator
Firmware Assisted Dump (FA_DUMP) on ppc64 reserves substantial amounts
of memory when booting a secondary kernel. Srikar Dronamraju reported
that multiple nodes may have no memory managed by the buddy allocator
but still return true for populated_zone().
Commit 1d82de618ddd ("mm, vmscan: make kswapd reclaim in terms of
nodes") was reported to cause kswapd to spin at 100% CPU usage when
fadump was enabled. The old code happened to deal with the situation of
a populated node with zero free pages by co-incidence but the current
code tries to reclaim populated zones without realising that is
impossible.
We cannot just convert populated_zone() as many existing users really
need to check for present_pages. This patch introduces a managed_zone()
helper and uses it in the few cases where it is critical that the check
is made for managed pages -- zonelist construction and page reclaim.
lib/test_hash.c: fix warning in preprocessor symbol evaluation
Some versions of gcc don't like tests for the value of an undefined
preprocessor symbol, even in the #else branch of an #ifndef:
lib/test_hash.c:224:7: warning: "HAVE_ARCH__HASH_32" is not defined [-Wundef]
#elif HAVE_ARCH__HASH_32 != 1
^
lib/test_hash.c:229:7: warning: "HAVE_ARCH_HASH_32" is not defined [-Wundef]
#elif HAVE_ARCH_HASH_32 != 1
^
lib/test_hash.c:234:7: warning: "HAVE_ARCH_HASH_64" is not defined [-Wundef]
#elif HAVE_ARCH_HASH_64 != 1
^
Seen with gcc 4.9, not seen with 4.1.2.
Change the logic to only check the value inside an #ifdef to fix this.
kconfig: tinyconfig: provide whole choice blocks to avoid warnings
Using "make tinyconfig" produces a couple of annoying warnings that show
up for build test machines all the time:
.config:966:warning: override: NOHIGHMEM changes choice state
.config:965:warning: override: SLOB changes choice state
.config:963:warning: override: KERNEL_XZ changes choice state
.config:962:warning: override: CC_OPTIMIZE_FOR_SIZE changes choice state
.config:933:warning: override: SLOB changes choice state
.config:930:warning: override: CC_OPTIMIZE_FOR_SIZE changes choice state
.config:870:warning: override: SLOB changes choice state
.config:868:warning: override: KERNEL_XZ changes choice state
.config:867:warning: override: CC_OPTIMIZE_FOR_SIZE changes choice state
I've made a previous attempt at fixing them and we discussed a number of
alternatives.
I tried changing the Makefile to use "merge_config.sh -n
$(fragment-list)" but couldn't get that to work properly.
This is yet another approach, based on the observation that we do want
to see a warning for conflicting 'choice' options, and that we can
simply make them non-conflicting by listing all other options as
disabled. This is a trivial patch that we can apply independent of
plans for other changes.
kexec: fix double-free when failing to relocate the purgatory
If kexec_apply_relocations fails, kexec_load_purgatory frees pi->sechdrs
and pi->purgatory_buf. This is redundant, because in case of error
kimage_file_prepare_segments calls kimage_file_post_load_cleanup, which
will also free those buffers.
This causes two warnings like the following, one for pi->sechdrs and the
other for pi->purgatory_buf:
kexec-bzImage64: Loading purgatory failed
------------[ cut here ]------------
WARNING: CPU: 1 PID: 2119 at mm/vmalloc.c:1490 __vunmap+0xc1/0xd0
Trying to vfree() nonexistent vm area (ffffc90000e91000)
Modules linked in:
CPU: 1 PID: 2119 Comm: kexec Not tainted 4.8.0-rc3+ #5
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
Call Trace:
dump_stack+0x4d/0x65
__warn+0xcb/0xf0
warn_slowpath_fmt+0x4f/0x60
? find_vmap_area+0x19/0x70
? kimage_file_post_load_cleanup+0x47/0xb0
__vunmap+0xc1/0xd0
vfree+0x2e/0x70
kimage_file_post_load_cleanup+0x5e/0xb0
SyS_kexec_file_load+0x448/0x680
? putname+0x54/0x60
? do_sys_open+0x190/0x1f0
entry_SYSCALL_64_fastpath+0x13/0x8f
---[ end trace 158bb74f5950ca2b ]---
Fix by setting pi->sechdrs an pi->purgatory_buf to NULL, since vfree
won't try to free a NULL pointer.
Michal Hocko [Thu, 1 Sep 2016 23:14:41 +0000 (16:14 -0700)]
mm, oom: prevent premature OOM killer invocation for high order request
There have been several reports about pre-mature OOM killer invocation
in 4.7 kernel when order-2 allocation request (for the kernel stack)
invoked OOM killer even during basic workloads (light IO or even kernel
compile on some filesystems). In all reported cases the memory is
fragmented and there are no order-2+ pages available. There is usually
a large amount of slab memory (usually dentries/inodes) and further
debugging has shown that there are way too many unmovable blocks which
are skipped during the compaction. Multiple reporters have confirmed
that the current linux-next which includes [1] and [2] helped and OOMs
are not reproducible anymore.
A simpler fix for the late rc and stable is to simply ignore the
compaction feedback and retry as long as there is a reclaim progress and
we are not getting OOM for order-0 pages. We already do that for
CONFING_COMPACTION=n so let's reuse the same code when compaction is
enabled as well.
Neal Cardwell [Tue, 30 Aug 2016 15:55:23 +0000 (11:55 -0400)]
tcp: fastopen: fix rcv_wup initialization for TFO server on SYN/data
Yuchung noticed that on the first TFO server data packet sent after
the (TFO) handshake, the server echoed the TCP timestamp value in the
SYN/data instead of the timestamp value in the final ACK of the
handshake. This problem did not happen on regular opens.
The tcp_replace_ts_recent() logic that decides whether to remember an
incoming TS value needs tp->rcv_wup to hold the latest receive
sequence number that we have ACKed (latest tp->rcv_nxt we have
ACKed). This commit fixes this issue by ensuring that a TFO server
properly updates tp->rcv_wup to match tp->rcv_nxt at the time it sends
a SYN/ACK for the SYN/data.
net: bridge: don't increment tx_dropped in br_do_proxy_arp
pskb_may_pull may fail due to various reasons (e.g. alloc failure), but the
skb isn't changed/dropped and processing continues so we shouldn't
increment tx_dropped.
Merge branch 'stable-4.8' of git://git.infradead.org/users/pcmoore/audit
Pull audit fixes from Paul Moore:
"Two small patches to fix some bugs with the audit-by-executable
functionality we introduced back in v4.3 (both patches are marked
for the stable folks)"
* 'stable-4.8' of git://git.infradead.org/users/pcmoore/audit:
audit: fix exe_file access in audit_exe_compare
mm: introduce get_task_exe_file
Merge tag 'xfs-iomap-for-linus-4.8-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs
Pull xfs and iomap fixes from Dave Chinner:
"Most of these changes are small regression fixes that address problems
introduced in the 4.8-rc1 window. The two fixes that aren't (IO
completion fix and superblock inprogress check) are fixes for problems
introduced some time ago and need to be pushed back to stable kernels.
* tag 'xfs-iomap-for-linus-4.8-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs:
xfs: track log done items directly in the deferred pending work item
iomap: don't set FIEMAP_EXTENT_MERGED for extent based filesystems
xfs: prevent dropping ioend completions during buftarg wait
xfs: fix superblock inprogress check
xfs: simple btree query range should look right if LE lookup fails
xfs: fix some key handling problems in _btree_simple_query_range
xfs: don't log the entire end of the AGF
xfs: disallow mounting of realtime + rmap filesystems
xfs: don't perform lookups on zero-height btrees
Nicolas Dichtel [Tue, 30 Aug 2016 08:09:21 +0000 (10:09 +0200)]
ipv6: add missing netconf notif when 'all' is updated
The 'default' value was not advertised.
Fixes: f3a1bfb11ccb ("rtnl/ipv6: use netconf msg to advertise forwarding status") Signed-off-by: Nicolas Dichtel <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Sunil Goutham [Tue, 30 Aug 2016 06:06:27 +0000 (11:36 +0530)]
net: thunderx: Fix for issues with multiple CQEs posted for a TSO packet
On ThunderX 88xx pass 2.x chips when TSO is offloaded to HW,
HW posts a CQE for every TSO segment transmitted. Current code
does handles this, but is prone to issues when segment sizes are
small resulting in SW processing too many CQEs and also at times
frees a SKB which is not yet transmitted.
This patch handles the errata in a different way and eliminates issues
with earlier approach, TSO packet is submitted to HW with post_cqe=0,
so that no CQE is posted upon completion of transmission of TSO packet
but a additional HDR + IMMEDIATE descriptors are added to SQ due to
which a CQE is posted and will have required info to be used while
cleanup in napi. This way only one CQE is posted for a TSO packet.
Sunil Goutham [Tue, 30 Aug 2016 06:06:26 +0000 (11:36 +0530)]
net: thunderx: Fix for HW issue while padding TSO packet
There is a issue in HW where-in while sending GSO sized pkts
as part of TSO, if pkt len falls below configured min packet
size i.e 60, NIC will zero PAD packet and also updates IP total length.
Hence set this value to lessthan min pkt size of MAC + IP + TCP
headers, BGX will anyway do the padding to transmit 64 byte pkt
including FCS.
Dave Ertman [Tue, 30 Aug 2016 00:38:26 +0000 (17:38 -0700)]
i40e: Fix kernel panic on enable/disable LLDP
If DCB is configured on the link partner switch with an
unsupported traffic class configuration (e.g. non-contiguous TCs),
the driver is flagging DCB as disabled. But, for future DCB
LLDPDUs, the driver was checking if the interface was DCB capable
instead of enabled. This was causing a kernel panic when LLDP
was enabled/disabled on the link partner switch.
This patch corrects the situation by having the LLDP event handler
check the correct flag in the pf structure. It also cleans up the
setting and clearing of the enabled flag for other checks.
tipc: fix random link resets while adding a second bearer
In a dual bearer configuration, if the second tipc link becomes
active while the first link still has pending nametable "bulk"
updates, it randomly leads to reset of the second link.
When a link is established, the function named_distribute(),
fills the skb based on node mtu (allows room for TUNNEL_PROTOCOL)
with NAME_DISTRIBUTOR message for each PUBLICATION.
However, the function named_distribute() allocates the buffer by
increasing the node mtu by INT_H_SIZE (to insert NAME_DISTRIBUTOR).
This consumes the space allocated for TUNNEL_PROTOCOL.
When establishing the second link, the link shall tunnel all the
messages in the first link queue including the "bulk" update.
As size of the NAME_DISTRIBUTOR messages while tunnelling, exceeds
the link mtu the transmission fails (-EMSGSIZE).
Thus, the synch point based on the message count of the tunnel
packets is never reached leading to link timeout.
In this commit, we adjust the size of name distributor message so that
they can be tunnelled.
Ivan Vecera [Thu, 1 Sep 2016 09:28:59 +0000 (11:28 +0200)]
tg3: Fix for disallow tx coalescing time to be 0
The recent commit 087d7a8c9174 "tg3: Fix for diasllow rx coalescing
time to be 0" disallow to set Rx coalescing time to be 0 as this stops
generating interrupts for the incoming packets. I found the zero
Tx coalescing time stops generating interrupts for outgoing packets
as well and fires Tx watchdog later. To avoid this, don't allow to set
Tx coalescing time to 0 and also remove subsequent checks that become
senseless.
drivers/net/ethernet/qlogic/qed/qed_dcbx.c:1230:13-20: WARNING: kzalloc should be used for dcbx_info, instead of kmalloc/memset
drivers/net/ethernet/qlogic/qed/qed_dcbx.c:1192:13-20: WARNING: kzalloc should be used for dcbx_info, instead of kmalloc/memset
Use kzalloc rather than kmalloc followed by memset with 0
This considers some simple cases that are common and easy to validate
Note in particular that there are no ...s in the rule, so all of the
matched code has to be contiguous
mlxsw: spectrum: Use existing flood setup when adding VLANs
When a VLAN is added on a bridge port we should use the existing unicast
flood configuration of the port instead of assuming it's enabled.
Fixes: 0293038e0c36 ("mlxsw: spectrum: Add support for flood control") Signed-off-by: Ido Schimmel <[email protected]> Signed-off-by: Jiri Pirko <[email protected]> Signed-off-by: David S. Miller <[email protected]>
mlxsw: spectrum: Don't take multiple references on a FID
In commit 14d39461b3f4 ("mlxsw: spectrum: Use per-FID struct for the
VLAN-aware bridge") I added a per-FID struct, which member ports can
take a reference on upon VLAN membership configuration.
However, sometimes only the VLAN flags (e.g. egress untagged) are
toggled without changing the VLAN membership. In these cases we
shouldn't take another reference on the FID.
Fixes: 14d39461b3f4 ("mlxsw: spectrum: Use per-FID struct for the VLAN-aware bridge") Signed-off-by: Ido Schimmel <[email protected]> Signed-off-by: Jiri Pirko <[email protected]> Signed-off-by: David S. Miller <[email protected]>
mlxsw: spectrum: Fix error path in mlxsw_sp_module_init
Add forgotten notifier unregister.
Fixes: 99724c18fc66 ("mlxsw: spectrum: Introduce support for router interfaces") Signed-off-by: Jiri Pirko <[email protected]> Reviewed-by: Ido Schimmel <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Originally, I expected that there would be needed to call update
operation in case RALUE record action is changed. However, that is not
needed since write operation takes care of that nicely. Remove prepared
construct and always call the write operation.
Fixes: 61c503f976b5 ("mlxsw: spectrum_router: Implement fib4 add/del switchdev obj ops") Signed-off-by: Jiri Pirko <[email protected]> Reviewed-by: Ido Schimmel <[email protected]> Signed-off-by: David S. Miller <[email protected]>
mlxsw: spectrum_router: Fix failure caused by double fib removal from HW
In mlxsw we squash tables 254 and 255 together into HW. Kernel adds/dels
/32 ip to/from both 254 and 255. On del path, that causes the same
prefix being removed twice. Fix this by introducing reference counting
for private mlxsw fib entries. That required a bit of code reshuffle.
Also put dev into fib entry key so the same prefix could be represented
once per every router interface.
Fixes: 61c503f976b5 ("mlxsw: spectrum_router: Implement fib4 add/del switchdev obj ops") Signed-off-by: Jiri Pirko <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Wang Xiaoguang [Wed, 31 Aug 2016 11:46:16 +0000 (19:46 +0800)]
btrfs: fix one bug that process may endlessly wait for ticket in wait_reserve_ticket()
If can_overcommit() in btrfs_calc_reclaim_metadata_size() returns true,
btrfs_async_reclaim_metadata_space() will not reclaim metadata space, just
return directly and also forget to wake up process which are waiting for
their tickets, so these processes will wait endlessly.
Fstests case generic/172 with mount option "-o compress=lzo" have revealed
this bug in my test machine. Here if we have tickets to handle, we must
handle them first.
Liu Bo [Wed, 31 Aug 2016 23:43:33 +0000 (16:43 -0700)]
Btrfs: fix endless loop in balancing block groups
Qgroup function may overwrite the saved error 'err' with 0
in case quota is not enabled, and this ends up with a
endless loop in balance because we keep going back to balance
the same block group.
Josef Bacik [Wed, 24 Aug 2016 15:57:52 +0000 (11:57 -0400)]
Btrfs: kill invalid ASSERT() in process_all_refs()
Suppose you have the following tree in snap1 on a file system mounted with -o
inode_cache so that inode numbers are recycled
└── [ 258] a
└── [ 257] b
and then you remove b, rename a to c, and then re-create b in c so you have the
following tree
└── [ 258] c
└── [ 257] b
and then you try to do an incremental send you will hit
ASSERT(pending_move == 0);
in process_all_refs(). This is because we assume that any recycling of inodes
will not have a pending change in our path, which isn't the case. This is the
case for the DELETE side, since we want to remove the old file using the old
path, but on the create side we could have a pending move and need to do the
normal pending rename dance. So remove this ASSERT() and put a comment about
why we ignore pending_move. Thanks,
Kalle Valo [Thu, 1 Sep 2016 14:11:42 +0000 (17:11 +0300)]
Merge tag 'iwlwifi-for-kalle-2016-08-29' of git://git.kernel.org/pub/scm/linux/kernel/git/iwlwifi/iwlwifi-fixes
* Fix P2P dump trigger
* Prevent a potential null dereference in iwlmvm
* Prevent an uninitialized value from being returned in iwlmvm
* Advertise support for channel width change in AP mode
PCI: Mark Haswell Power Control Unit as having non-compliant BARs
The Haswell Power Control Unit has a non-PCI register (CONFIG_TDP_NOMINAL)
where BAR 0 is supposed to be. This is erratum HSE43 in the spec update
referenced below:
The PCIe* Base Specification indicates that Configuration Space Headers
have a base address register at offset 0x10. Due to this erratum, the
Power Control Unit's CONFIG_TDP_NOMINAL CSR (Bus 1; Device 30; Function
3; Offset 0x10) is located where a base register is expected.
Mark the PCU as having non-compliant BARs so we don't try to probe any of
them. There are no other BARs on this device.
Be defensive about what underlying fs provides us in the returned xattr
list buffer. If it's not properly null terminated, bail out with a warning
insead of BUG.
Commit d837a49bd57f ("ovl: fix POSIX ACL setting") switches from
iop->setxattr from ovl_setxattr to generic_setxattr, so switch from
ovl_removexattr to generic_removexattr as well. As far as permission
checking goes, the same rules should apply in either case.
While doing that, rename ovl_setxattr to ovl_xattr_set to indicate that
this is not an iop->setxattr implementation and remove the unused inode
argument.
Move ovl_other_xattr_set above ovl_own_xattr_set so that they match the
order of handlers in ovl_xattr_handlers.
Use an ordinary #ifdef to conditionally include the POSIX ACL handlers
in ovl_xattr_handlers, like the other filesystems do. Flag the code
that is now only used conditionally with __maybe_unused.
Some operations (setxattr/chmod) can make the cached acl stale. We either
need to clear overlay's acl cache for the affected inode or prevent acl
caching on the overlay altogether. Preventing caching has the following
advantages:
- no double caching, less memory used
- overlay cache doesn't go stale when fs clears it's own cache
Possible disadvantage is performance loss. If that becomes a problem
get_acl() can be optimized for overlayfs.
This patch disables caching by pre setting i_*acl to a value that
- has bit 0 set, so is_uncached_acl() will return true
- is not equal to ACL_NOT_CACHED, so get_acl() will not overwrite it
When mounting overlayfs it needs a clean "work" directory under the
supplied workdir.
Previously the mount code removed this directory if it already existed and
created a new one. If the removal failed (e.g. directory was not empty)
then it fell back to a read-only mount not using the workdir.
While this has never been reported, it is possible to get a non-empty
"work" dir from a previous mount of overlayfs in case of crash in the
middle of an operation using the work directory.
In this case the left over state should be discarded and the overlay
filesystem will be consistent, guaranteed by the atomicity of operations on
moving to/from the workdir to the upper layer.
This patch implements cleaning out any files left in workdir. It is
implemented using real recursion for simplicity, but the depth is limited
to 2, because the worst case is that of a directory containing whiteouts
under "work".
ovl: handle umask and posix_acl_default correctly on creation
Setting MS_POSIXACL in sb->s_flags has the side effect of passing mode to
create functions without masking against umask.
Another problem when creating over a whiteout is that the default posix acl
is not inherited from the parent dir (because the real parent dir at the
time of creation is the work directory).
Fix these problems by:
a) If upper fs does not have MS_POSIXACL, then mask mode with umask.
b) If creating over a whiteout, call posix_acl_create() to get the
inherited acls. After creation (but before moving to the final
destination) set these acls on the created file. posix_acl_create() also
updates the file creation mode as appropriate.
This patch takes care of clearing the uninitialized buffer before using it.
1. pfc pri-enable bitmap need to be cleared before setting the requested
enable bits. Without this, the un-touched values will be merged with
requested values and sent to MFW.
2. The data in app-entry field need to be cleared before using it.
3. Clear the output data buffer used in qed_dcbx_query_params().
qed: Set selection-field while configuring the app entry in ieee mode.
Management firmware requires the selection-field (SF) to be set for
configuring the application/protocol entry in IEEE mode. Without this
setting, the app entry will be configured incorrectly in MFW.
This is because on the error path, after we install
the new socket file, we call sock_release() to clean
up the socket, which leaves the fd pointing to a freed
socket. Fix this by calling sys_close() on that fd
directly.
David S. Miller [Thu, 1 Sep 2016 03:53:49 +0000 (20:53 -0700)]
Merge branch 'mediatek-fixes'
Sean Wang says:
====================
net: ethernet: mediatek: a couple of fixes
a couple of fixes come out from integrating with linux-4.8 rc1
they all are verified and workable on linux-4.8 rc1
Changes since v1:
- usage of loops to work out if all required clock are ready instead
of tedious coding
- remove redundant pinctrl setup that is already done by core driver
thanks for careful and patient reviewing by Andrew Lunn
- splitting distinct changes into the separate patches
- change variable naming from err to ret for readable coding
Changes since v2:
- restore to original clock disabling sequence that is changed
accidentally in the last version
- refine the commit log that would cause misunderstanding what has
been done in the changes
- refine the commit log that would cause footnote losing due to
improper delimiter use
Changes since v3:
- fix git rejects caused by mixing a change from net-next, so
remake the patch set based on the current net branch again.
====================
Sean Wang [Thu, 1 Sep 2016 02:47:34 +0000 (10:47 +0800)]
net: ethernet: mediatek: use devm_mdiobus_alloc instead of mdiobus_alloc inside mtk_mdio_init
a lot of parts in the driver uses devm_* APIs to gain benefits from the
device resource management, so devm_mdiobus_alloc is also used instead
of mdiobus_alloc to have more elegant code flow.
Using common code provided by the devm_* helps to
1) have simplified the code flow as [1] says
2) decrease the risk of incorrect error handling by human
3) only a few drivers used it since it was proposed on linux 3.16,
so just hope to promote for this.
Sean Wang [Thu, 1 Sep 2016 02:47:30 +0000 (10:47 +0800)]
net: ethernet: mediatek: remove redundant free_irq for devm_request_irq allocated irq
these irqs are not used for shared irq and disabled during ethernet stops.
irq requested by devm_request_irq is safe to be freed automatically on
driver detach.
Sean Wang [Thu, 1 Sep 2016 02:47:28 +0000 (10:47 +0800)]
net: ethernet: mediatek: fix incorrect return value of devm_clk_get with EPROBE_DEFER
1) If the return value of devm_clk_get is EPROBE_DEFER, we should
defer probing the driver. The change is verified and works based
on 4.8-rc1 staying with the latest clk-next code for MT7623.
2) Changing with the usage of loops to work out if all clocks
required are fine
Sean Wang [Thu, 1 Sep 2016 02:47:27 +0000 (10:47 +0800)]
net: ethernet: mediatek: fix fails from TX housekeeping due to incorrect port setup
which net device the SKB is complete for depends on the forward port
on txd4 on the corresponding TX descriptor, but the information isn't
set up well in case of SKB fragments that would lead to watchdog timeout
from the upper layer, so fix it up.
Dave Airlie [Wed, 31 Aug 2016 20:34:09 +0000 (06:34 +1000)]
Merge branch 'msm-fixes-4.8' of git://people.freedesktop.org/~robclark/linux into drm-fixes
copy from user fixes.
* 'msm-fixes-4.8' of git://people.freedesktop.org/~robclark/linux:
drm/msm: protect against faults from copy_from_user() in submit ioctl
drm/msm: fix use of copy_from_user() while holding spinlock