IB/mlx4: Do not attemp to report HCA clock offset on VFs
mlx4 VFs can provide CQE raw time-stamping services, but they
don't have the hca core clock mapped to their PCI bars.
As such, we should not attempt to query and report the clock offset
to user space for VFs. Doing so causes query_device over VFs to fail
with -ENOSUPP.
Fixes: 4b664c4355b2 ('IB/mlx4: Add support for CQ time-stamping') Signed-off-by: Matan Barak <[email protected]> Signed-off-by: Or Gerlitz <[email protected]> Signed-off-by: Doug Ledford <[email protected]>
Erez Shitrit [Thu, 25 Jun 2015 14:13:22 +0000 (17:13 +0300)]
IB/cm: Do not queue work to a device that's going away
Whenever ib_cm gets remove_one call, like when there is a hot-unplug
event, the driver should mark itself as going_down and confirm that no
new works are going to be queued for that device.
so, the order of the actions are:
1. mark the going_down bit.
2. flush the wq.
3. [make sure no new works for that device.]
4. unregister mad agent.
otherwise, works that are already queued can be scheduled after the mad
agent was freed.
Vaishali Thakkar [Wed, 24 Jun 2015 04:42:13 +0000 (10:12 +0530)]
IB/srpt: Convert use of __constant_cpu_to_beXX to cpu_to_beXX
In little endian cases, the macro cpu_to_be{16,32,64} unfolds to
__swab{16,32,64} which provides special case for constants. In
big endian cases, __constant_cpu_to_be{16,32,64} and
cpu_to_be{16,32,64} expand directly to the same expression. So,
replace __constant_cpu_to_be{16,32,64} with cpu_to_be{16,32,64}
with the goal of getting rid of the definitions of
__constant_cpu_to_be{16,32,64} completely.
The Coccinelle semantic patch that performs this transformation
is as follows:
Hal Rosenstock [Mon, 29 Jun 2015 13:57:00 +0000 (09:57 -0400)]
IB: Add rdma_cap_ib_switch helper and use where appropriate
Persuant to Liran's comments on node_type on linux-rdma
mailing list:
In an effort to reform the RDMA core and ULPs to minimize use of
node_type in struct ib_device, an additional bit is added to
struct ib_device for is_switch (IB switch). This is needed
to be initialized by any IB switch device driver. This is a
NEW requirement on such device drivers which are all
"out of tree".
In addition, an ib_switch helper was added to ib_verbs.h
based on the is_switch device bit rather than node_type
(although those should be consistent).
The RDMA core (MAD, SMI, agent, sa_query, multicast, sysfs)
as well as (IPoIB and SRP) ULPs are updated where
appropriate to use this new helper. In some cases,
the helper is now used under the covers of using
rdma_[start end]_port rather than the open coding
previously used.
Filipe Manana [Tue, 14 Jul 2015 15:09:39 +0000 (16:09 +0100)]
Btrfs: fix file corruption after cloning inline extents
Using the clone ioctl (or extent_same ioctl, which calls the same extent
cloning function as well) we end up allowing copy an inline extent from
the source file into a non-zero offset of the destination file. This is
something not expected and that the btrfs code is not prepared to deal
with - all inline extents must be at a file offset equals to 0.
For example, the following excerpt of a test case for fstests triggers
a crash/BUG_ON() on a write operation after an inline extent is cloned
into a non-zero offset:
_scratch_mkfs >>$seqres.full 2>&1
_scratch_mount
# Create our test files. File foo has the same 2K of data at offset 4K
# as file bar has at its offset 0.
$XFS_IO_PROG -f -s -c "pwrite -S 0xaa 0 4K" \
-c "pwrite -S 0xbb 4k 2K" \
-c "pwrite -S 0xcc 8K 4K" \
$SCRATCH_MNT/foo | _filter_xfs_io
# File bar consists of a single inline extent (2K size).
$XFS_IO_PROG -f -s -c "pwrite -S 0xbb 0 2K" \
$SCRATCH_MNT/bar | _filter_xfs_io
# Now call the clone ioctl to clone the extent of file bar into file
# foo at its offset 4K. This made file foo have an inline extent at
# offset 4K, something which the btrfs code can not deal with in future
# IO operations because all inline extents are supposed to start at an
# offset of 0, resulting in all sorts of chaos.
# So here we validate that clone ioctl returns an EOPNOTSUPP, which is
# what it returns for other cases dealing with inlined extents.
$CLONER_PROG -s 0 -d $((4 * 1024)) -l $((2 * 1024)) \
$SCRATCH_MNT/bar $SCRATCH_MNT/foo
# Because of the inline extent at offset 4K, the following write made
# the kernel crash with a BUG_ON().
$XFS_IO_PROG -c "pwrite -S 0xdd 6K 2K" $SCRATCH_MNT/foo | _filter_xfs_io
status=0
exit
The stack trace of the BUG_ON() triggered by the last write is:
Fix this by returning the error EOPNOTSUPP if an attempt to copy an
inline extent into a non-zero offset happens, just like what is done for
other scenarios that would require copying/splitting inline extents,
which were introduced by the following commits:
00fdf13a2e9f ("Btrfs: fix a crash of clone with inline extents's split") 3f9e3df8da3c ("btrfs: replace error code from btrfs_drop_extents")
Paul Burton [Fri, 10 Jul 2015 15:00:24 +0000 (16:00 +0100)]
MIPS: Require O32 FP64 support for MIPS64 with O32 compat
MIPS32r6 code requires FP64 (ie. FR=1) support. Building a kernel with
support for MIPS32r6 binaries but without support for O32 with FP64 is
therefore a problem which can lead to incorrectly executed userland.
CONFIG_MIPS_O32_FP64_SUPPORT is already selected when the kernel is
configured for MIPS32r6, but not when the kernel is configured for
MIPS64r6 with O32 compat support. Select CONFIG_MIPS_O32_FP64_SUPPORT in
such configurations to prevent building kernels which execute MIPS32r6
userland incorrectly.
ALSA: line6: Fix -EBUSY error during active monitoring
When a monitor stream is active, the next PCM stream access results in
EBUSY error because of the check in line6_stream_start(). Fix this by
just skipping the submission of pending URBs when the stream is
already running instead.
It breaks existing old userspace which doesn't handle UNKNOWN
swizzling correct. Yes UNKNOWN was a thing back in 2009 and probably
still is on some other platforms, but it still pretty clearly broke
the testers machine. If we want this we need to extend the ioctl with
new paramters that only new userspace looks at.
Olof Johansson [Tue, 14 Jul 2015 09:16:55 +0000 (11:16 +0200)]
Merge tag 'socfpga_fixes_for_v4.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dinguyen/linux into fixes
Merge "SoCFPGA fixes for v4.2-rc1" from Dinh Nguyen:
SoCFPGA fixes against v4.2-rc1
- Update compatible "adxl345x" compatible string
- Alphabetize the DTS nodes for the C5 sockit board file
* tag 'socfpga_fixes_for_v4.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dinguyen/linux:
ARM: socfpga: dts: Fix entries order
ARM: socfpga: dts: Fix adxl34x formating and compatible string
Ux500 is regressing due to commit a21763a0b1e5a5ab8310f581886d04beadc16616
"pinctrl: nomadik: activate strict mux mode" which disallows
Nomadik GPIO 5 to be muxed in as a level shifter voltage select
pin, as it is currently described as being used for RX on UART1.
The behaviour is correct, instead the hardware config has been
incorrecly specified: UART1 is indeed unused on HREFv60plus and
Snowball and that is why HREFv60plus can use the pins it would
normally occupy as the voltage select line for the MMC/SD
levelshifter (Snowball uses it for I2C4).
The reason UART1 was anyway enabled on these platforms was
probably to secure the port enumeration to userspace. This
can be solved by using aliases (done in a separate patch) so
we can now deactivate UART1 and let MMC/SD use it properly
on HREFv60plus. We explicitly activate it only for the
older HREFprev60 board.
To complete, we set up the pin configuration for these pins
properly in the sdi0 node.
This enumerates the PL011 serial ports on the Ux500. This is
necessary to do if we want to remove one of the serial ports,
since userspace depends on console to be present on ttyAMA2
and we must not break userspace.
drm/i915: Forward all core DRM ioctls to core compat handling
Previously only core DRM ioctls under the DRM_COMMAND_BASE were being
forwarded, but the drm.h header suggests (and reality confirms) ones
after (and including) DRM_COMMAND_END should be forwarded as well.
We need this to correctly forward the compat ioctl for the botched-up
addfb2.1 extension.
Arun Bharadwaj [Thu, 4 Jun 2015 18:08:19 +0000 (11:08 -0700)]
ARM: dts: Fix frequency scaling on Gumstix Pepper
The device tree for Gumstix Pepper has DCDC2 and
DCDC3 correctly labelled but the upper limit values
are wrong. The confusion is due to the hardware
quirk where the DCDC2 and DCDC3 wires are flipped
in Pepper.
Adam YH Lee [Thu, 9 Jul 2015 22:37:57 +0000 (15:37 -0700)]
ARM: dts: configure regulators for Gumstix Pepper
Boot process is halting in midway because some of the necessary voltage
regulators are deemed unused and subsequently powered off, leading to
a completely unresponsive system.
Most of the device nodes had correct voltage regulator attachments.
Yet these nodes had to set stricter enforcement on them through
'regulator-boot-on' and 'regulator-always-on' to function correctly.
The consumers of the regulators this commit affect are the followings:
DCDC1: vdd_1v8 system supply, USB-PHY, and ADC
DCDC2: Core domain
DCDC3: MPU core domain
LDO1: RTC
LDO2: 3v3 IO domain
LDO3: USB-PHY; not a boot-time requirement
LDO4: LCD [16:23]
All but LDO3 need to be always-on for the system to be functional.
Additionally regulator-name properties have been added for the kernel to
display the name from the schematic. This will improve diagnostics.
Adam YH Lee [Fri, 12 Jun 2015 20:37:22 +0000 (13:37 -0700)]
ARM: dts: omap3: overo: Update LCD panel names
For Gumstix Overo COMs, the u-boot bootloader typically passes an
argument specifying the default display via the omapdss.def_disp
parameter. When a default display is specified, DSS2 tries to match
this name with either the device tree label (e.g. label=dvi) or,
failing this, the device tree alias (e.g. label=display0). Update the
panel names for the 'lcd43' and 'lcd35' displays in the device tree
such that they match the names passed by u-boot.
Jason Gunthorpe [Tue, 30 Jun 2015 19:15:31 +0000 (13:15 -0600)]
tpm: Fix initialization of the cdev
When a cdev is contained in a dynamic structure the cdev parent kobj
should be set to the kobj that controls the lifetime of the enclosing
structure. In TPM's case this is the embedded struct device.
Also, cdev_init 0's the whole structure, so all sets must be after,
not before. This fixes module ref counting and cdev.
Thomas Gleixner [Mon, 13 Jul 2015 21:22:44 +0000 (23:22 +0200)]
gpio/davinci: Fix race in installing chained irq handler
Fix a race where a pending interrupt could be received and the handler
called before the handler's data has been setup, by converting to
irq_set_chained_handler_and_data().
Merge tag 'iio-fixes-for-4.2b' of git://git.kernel.org/pub/scm/linux/kernel/git/jic23/iio into staging-linus
Jonathan writes:
Second set of IIO fixes for the 4.2 cycle. Note these depend (mostly) on
material in the recent merge window, hence their separation from set (a)
as the fixes-togreg branch predated the merge window. I am running rather
later with these than I would have liked hence the large set.
* stk3310 fixes from Hartmut's review that came in post merge
- fix direction of proximity inline with recent documentation
clarification.
- fix missing REGMAP_I2C dependency
- rework the error handling for raw readings to fix an failure to power
down in the event of a raw reading failing.
- fix a bug in the compensation code which was toggling an extra bit in the
register.
* mmc35240 - reported samplign frequencies were wrong.
* ltr501 fixes
- fix a case of returning the return value of a regmap_read instead of
the value read.
- fix missing regmap dependency
* sx9500 - fix missing default values for ret in a couple of places to handle
the case of no enabled channels.
* tmp006 - check that writes to info_mask elements are actually to writable
ones. Otherwise, writing to any of them will change the sampling frequency.
Merge tag 'iio-fixes-for-4.2a' of git://git.kernel.org/pub/scm/linux/kernel/git/jic23/iio into staging-linus
Jonathan writes:
First set of IIO fixes for the 4.2 cycle.
* Fix a regression in hid sensors suspend time as a result of adding runtime
pm. The normal flow of waking up devices in order to go into suspend
(given the devices are normally suspended when not reading) to a regression
in suspend time on some laptops (reports of an additional 8 seconds).
Fix this by checking to see if a user action resulting in the wake up, and
make it a null operation if it didn't. Note that for hid sensors, there is
nothing useful to be done when moving into a full suspend from a runtime
suspend so they might as well be left alone.
* rochip_saradc: fix some missing MODULE_* data including the licence so that
the driver does not taint the kernel incorrectly and can build as a module.
* twl4030 - mark irq as oneshot as it always should have been.
* inv-mpu - write formats for attributes not specified, leading to miss
interpretation of the gyro scale channel when written.
* Proximity ABI clarification. This had snuck through as a mess. Some
drivers thought proximity went in one direction, some the other. We went
with the most common option, documented it and fixed up the drivers going
the other way. Fix for sx9500 included in this set.
* ad624r - fix a wrong shift in the output data.
* at91_adc - remove a false limit on the value of the STARTUP register
applied by too small a type for the device tree parameter.
* cm3323 - clear the bits when setting the integration time (otherwise
we can only ever set more bits in the relevant field).
* bmc150-accel - multiple triggers are registered, but on error were not being
unwound in the opposite order leading to removal of triggers that had not
yet successfully been registered (count down instead of up when unwinding).
* tcs3414 - ensure right part of val / val2 pair read so that the integration
time is not always 0.
* cc10001_adc - bug in kconfig dependency. Use of OR when AND was intended.
Daniel Vetter [Mon, 13 Jul 2015 06:22:22 +0000 (08:22 +0200)]
drm/i915: fix oops in primary_check_plane
On Sun, Jul 12, 2015 at 09:52:51AM -0700, Linus Torvalds wrote:
> On Sun, Jul 12, 2015 at 1:03 AM, Jörg Otte <[email protected]> wrote:
> > BUG: unable to handle kernel NULL pointer dereference at 0000000000000009
> > IP: [<ffffffffbd3447bb>] 0xffffffffbd3447bb
>
> Ugh. Please enable KALLSYMS to get sane symbols.
>
> But yes, "crtc_state->base.active" is at offset 9 from "crtc_state",
> so it's pretty clearly just that change frm
>
> - if (intel_crtc->active) {
> + if (crtc_state->base.active) {
>
> and "crtc_state" is NULL.
>
> And the code very much knows that crtc_state can be NULL, since it's
> initialized with
>
> crtc_state = state->base.state ?
> intel_atomic_get_crtc_state(state->base.state,
> intel_crtc) : NULL;
>
> Tssk. Daniel? Should I just revert that commit dec4f799d0a4
> ("drm/i915: Use crtc_state->active in primary check_plane func") for
> now, or is there a better fix? Like just checking crtc_state for NULL?
Indeed embarrassing. I've missed that we still have 1 caller left that's
using the transitional helpers, and those don't fill out
plane_state->state backpointers to the global atomic update since there is
no global atomic update for transitional helpers. Below diff should fix
this - we need to preferentially check crts_state->active and if that's
not set intel_crtc->active should yield the right result for the one
remaining caller (it's in the crtc_disable paths).
Imre Deak [Wed, 8 Jul 2015 16:18:59 +0000 (19:18 +0300)]
drm/i915: remove unused has_dma_mapping flag
After the previous patch this flag will check always clear, as it's
never set for shmem backed and userptr objects, so we can remove it.
Signed-off-by: Imre Deak <[email protected]> Reviewed-by: Chris Wilson <[email protected]>
[danvet: Yeah this isn't really fixes but it's a nice cleanup to
clarify the code but not really worth the hassle of backmerging. So
just add to -fixes, we're still early in -rc.] Signed-off-by: Daniel Vetter <[email protected]>
Imre Deak [Thu, 9 Jul 2015 09:59:05 +0000 (12:59 +0300)]
drm/i915: avoid leaking DMA mappings
We have 3 types of DMA mappings for GEM objects:
1. physically contiguous for stolen and for objects needing contiguous
memory
2. DMA-buf mappings imported via a DMA-buf attach operation
3. SG DMA mappings for shmem backed and userptr objects
For 1. and 2. the lifetime of the DMA mapping matches the lifetime of the
corresponding backing pages and so in practice we create/release the
mapping in the object's get_pages/put_pages callback.
For 3. the lifetime of the mapping matches that of any existing GPU binding
of the object, so we'll create the mapping when the object is bound to
the first vma and release the mapping when the object is unbound from its
last vma.
Since the object can be bound to multiple vmas, we can end up creating a
new DMA mapping in the 3. case even if the object already had one. This
is not allowed by the DMA API and can lead to leaked mapping data and
IOMMU memory space starvation in certain cases. For example HW IOMMU
drivers (intel_iommu) allocate a new range from their memory space
whenever a mapping is created, silently overriding a pre-existing
mapping.
Fix this by moving the creation/removal of DMA mappings to the object's
get_pages/put_pages callbacks. These callbacks already check for and do
an early return in case of any nested calls. This way objects of the 3.
case also become more like the other object types.
I noticed this issue by enabling DMA debugging, which got disabled after
a while due to its internal mapping tables getting full. It also reported
errors in connection to random other drivers that did a DMA mapping for
an address that was previously mapped by i915 but was never released.
Besides these diagnostic messages and the memory space starvation
problem for IOMMUs, I'm not aware of this causing a real issue.
The fix is based on a patch from Chris.
v2:
- move the DMA mapping create/remove calls to the get_pages/put_pages
callbacks instead of adding new callbacks for these (Chris)
v3:
- also fix the get_page cache logic on the userptr async path (Chris)
Tomas Elf [Thu, 9 Jul 2015 14:30:57 +0000 (15:30 +0100)]
drm/i915: Snapshot seqno of most recently submitted request.
The hang checker needs to inspect whether or not the ring request list is empty
as well as if the given engine has reached or passed the most recently
submitted request. The problem with this is that the hang checker cannot grab
the struct_mutex, which is required in order to safely inspect requests since
requests might be deallocated during inspection. In the past we've had kernel
panics due to this very unsynchronized access in the hang checker.
One solution to this problem is to not inspect the requests directly since
we're only interested in the seqno of the most recently submitted request - not
the request itself. Instead the seqno of the most recently submitted request is
stored separately, which the hang checker then inspects, circumventing the
issue of synchronization from the hang checker entirely.
drm/i915: Convert 'ring_idle()' to use requests not seqnos
v2 (Chris Wilson):
- Pass current engine seqno to ring_idle() from i915_hangcheck_elapsed() rather
than compute it over again.
- Remove extra whitespace.
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=90112#c23 Signed-off-by: Chris Wilson <[email protected]> Signed-off-by: Daniel Vetter <[email protected]>
perf report: Merge al->filtered with hist_entry->filtered
We stopped dropping samples for things filtered via the --comms, --dsos,
--symbols, etc, i.e. things marked as filtered in the symbol resolution
routines (thread__find_addr_map(), perf_event__preprocess_sample(),
etc).
perf top/tui: Update nr_entries properly after a filter is applied
We don't take into account entries that were filtered in
perf_event__preprocess_sample() and friends, which leads to
inconsistency in the browser seek routines, that expects the number of
hist_entry->filtered entries to match what it thinks is the number of
unfiltered, browsable entries.
So, for instance, when we do:
perf top --symbols ___non_existent_symbol___
the hist_browser__nr_entries() routine thinks there are no filters in
place, uses the hists->nr_entries but all entries are filtered, leading
to a segfault.
Tested with:
perf top --symbols malloc,free --percentage=relative
Freezing, by pressing 'f', at any time and doing the math on the
percentages ends up with 100%, ditto for:
perf top --dsos libpthread-2.20.so,libxul.so --percentage=relative
Both were segfaulting, all fixed now.
More work needed to do away with checking if filters are in place, we
should just use the nr_non_filtered_samples counter, no need to
conditionally use it or hists.nr_filter, as what the browser does is
just show unfiltered stuff. An audit of how it is being accounted is
needed, this is the minimal fix.
1) Missing list head init in bluetooth hidp session creation, from Tedd
Ho-Jeong An.
2) Don't leak SKB in bridge netfilter error paths, from Florian
Westphal.
3) ipv6 netdevice private leak in netfilter bridging, fixed by Julien
Grall.
4) Fix regression in IP over hamradio bpq encapsulation, from Ralf
Baechle.
5) Fix race between rhashtable resize events and table walks, from Phil
Sutter.
6) Missing validation of IFLA_VF_INFO netlink attributes, fix from
Daniel Borkmann.
7) Missing security layer socket state initialization in tipc code,
from Stephen Smalley.
8) Fix shared IRQ handling in boomerang 3c59x interrupt handler, from
Denys Vlasenko.
9) Missing minor_idr destroy on module unload on macvtap driver, from
Johannes Thumshirn.
10) Various pktgen kernel thread races, from Oleg Nesterov.
11) Fix races that can cause packets to be processed in the backlog even
after a device attached to that SKB has been fully unregistered.
From Julian Anastasov.
12) bcmgenet driver doesn't account packet drops vs. errors properly,
fix from Petri Gynther.
13) Array index validation and off by one fix in DSA layer from Florian
Fainelli
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (66 commits)
can: replace timestamp as unique skb attribute
ARM: dts: dra7x-evm: Prevent glitch on DCAN1 pinmux
can: c_can: Fix default pinmux glitch at init
can: rcar_can: unify error messages
can: rcar_can: print request_irq() error code
can: rcar_can: fix typo in error message
can: rcar_can: print signed IRQ #
can: rcar_can: fix IRQ check
net: dsa: Fix off-by-one in switch address parsing
net: dsa: Test array index before use
net: switchdev: don't abort unsupported operations
net: bcmgenet: fix accounting of packet drops vs errors
cdc_ncm: update specs URL
Doc: z8530book: Fix typo in API-z8530-sync-txdma-open.html
net: inet_diag: always export IPV6_V6ONLY sockopt for listening sockets
bridge: mdb: allow the user to delete mdb entry if there's a querier
net: call rcu_read_lock early in process_backlog
net: do not process device backlog during unregistration
bridge: fix potential crash in __netdev_pick_tx()
net: axienet: Fix devm_ioremap_resource return value check
...
Pull crypto fixes from Herbert Xu:
"This fixes a duplicate dma_unmap_sg call in omap-des and reentrancy
bugs in the powerpc nx driver which may cause bogus output or worse
memory corruption"
dm: fix use after free crash due to incorrect cleanup sequence
Linux 4.2-rc1 Commit 0f20972f7bf6 ("dm: factor out a common
cleanup_mapped_device()") moved a common cleanup code to a separate
function. Unfortunately, that commit incorrectly changed the order of
cleanup, so that it destroys the mapped_device's srcu structure
'io_barrier' before destroying its workqueue.
The function that is executed on the workqueue (dm_wq_work) uses the srcu
structure, thus it may use it after being freed. That results in a
crash in the LVM test suite's mirror-vgreduce-removemissing.sh test.
Signed-off-by: Mikulas Patocka <[email protected]> Fixes: 0f20972f7bf6 ("dm: factor out a common cleanup_mapped_device()") Signed-off-by: Mike Snitzer <[email protected]>
When setting yup the symbols library we setup several filter lists,
for dsos, comms, symbols, etc, and there is code that, if there are
filters, do certain operations, like recalculate the number of non
filtered histogram entries in the top/report TUI.
But they were considering just the "Zoom" filters, when they need to
take into account as well the above mentioned filters (perf top --comms,
--dsos, etc).
So store in symbol_conf.has_filter true if any of those filters is in
place.
ASoC: zx: spdif: Fix devm_ioremap_resource return value check
Value returned by devm_ioremap_resource() was checked for non-NULL but
devm_ioremap_resource() returns IOMEM_ERR_PTR, not NULL. In case of
error this could lead to dereference of ERR_PTR.
ASoC: zx: i2s: Fix devm_ioremap_resource return value check
Value returned by devm_ioremap_resource() was checked for non-NULL but
devm_ioremap_resource() returns IOMEM_ERR_PTR, not NULL. In case of
error this could lead to dereference of ERR_PTR.
Jeff Layton [Sat, 11 Jul 2015 10:43:03 +0000 (06:43 -0400)]
nfs4: have do_vfs_lock take an inode pointer
Now that we have file locking helpers that can deal with an inode
instead of a filp, we can change the NFSv4 locking code to use that
instead.
This should fix the case where we have a filp that is closed while flock
or OFD locks are set on it, and the task is signaled so that it doesn't
wait for the LOCKU reply to come in before the filp is freed. At that
point we can end up with a use-after-free with the current code, which
relies on dereferencing the fl_file in the lock request.
Jeff Layton [Sat, 11 Jul 2015 10:43:02 +0000 (06:43 -0400)]
locks: have flock_lock_file take an inode pointer instead of a filp
...and rename it to better describe how it works.
In order to fix a use-after-free in NFS, we need to be able to remove
locks from an inode after the filp associated with them may have already
been freed. flock_lock_file already only dereferences the filp to get to
the inode, so just change it so the callers do that.
All of the callers already pass in a lock request that has the fl_file
set properly, so we don't need to pass it in individually. With that
change it now only dereferences the filp to get to the inode, so just
push that out to the callers.
William reported that he was seeing instability with this patch, which
is likely due to the fact that it can cause the kernel to take a new
reference to a filp after the last reference has already been put.
Revert this patch for now, as we'll need to fix this in another way.
If a machine check happens, the machine has the vector facility installed
and the extended save area exists, the cpu will save vector register
contents into the extended save area. This is regardless of control
register 0 contents, which enables and disables the vector facility during
runtime.
On each machine check we should validate the vector registers. The current
code however tries to validate the registers only if the running task is
using vector registers in user space.
However even the current code is broken and causes vector register
corruption on machine checks, if user space uses them:
the prefix area contains a pointer (absolute address) to the machine check
extended save area. In order to save some space the save area was put into
an unused area of the second prefix page.
When validating vector register contents the code uses the absolute address
of the extended save area, which is wrong. Due to prefixing the vector
instructions will then access contents using absolute addresses instead
of real addresses, where the machine stored the contents.
If the above would work there is still the problem that register validition
would only happen if user space uses vector registers. If kernel space uses
them also, this may also lead to vector register content corruption:
if the kernel makes use of vector instructions, but the current running
user space context does not, the machine check handler will validate
floating point registers instead of vector registers.
Given the fact that writing to a floating point register may change the
upper halve of the corresponding vector register, we also experience vector
register corruption in this case.
Fix all of these issues, and always validate vector registers on each
machine check, if the machine has the vector facility installed and the
extended save area is defined.
The sfpc inline assembly within execve_tail() may incorrectly set bits
28-31 of the sfpc instruction to a value which is not zero.
These bits however are currently unused and therefore should be zero
so we won't get surprised if these bits will be used in the future.
Therefore remove the second operand from the inline assembly.
Stefan Haberland [Fri, 10 Jul 2015 08:47:09 +0000 (10:47 +0200)]
s390/dasd: fix kernel panic when alias is set offline
The dasd device driver selects which (alias or base) device is used
for a given requests when the request is build. If the chosen alias
device is set offline before the request gets queued to the device
queue the starting function may use device structures that are
already freed. This might lead to a hanging offline process or a
kernel panic.
Add a check to the starting function that returns the request to the
upper layer if the device is already in offline processing.
In addition to that prevent that an alias device that's already in
offline processing gets chosen as start device.
ARC: make sure instruction_pointer() returns unsigned value
Currently instruction_pointer() returns pt_regs->ret and so return value
is of type "long", which implicitly stands for "signed long".
While that's perfectly fine when dealing with 32-bit values if return
value of instruction_pointer() gets assigned to 64-bit variable sign
extension may happen.
And at least in one real use-case it happens already.
In perf_prepare_sample() return value of perf_instruction_pointer()
(which is an alias to instruction_pointer() in case of ARC) is assigned
to (struct perf_sample_data)->ip (which type is "u64").
And what we see if instuction pointer points to user-space application
that in case of ARC lays below 0x8000_0000 "ip" gets set properly with
leading 32 zeros. But if instruction pointer points to kernel address
space that starts from 0x8000_0000 then "ip" is set with 32 leadig
"f"-s. I.e. id instruction_pointer() returns 0x8100_0000, "ip" will be
assigned with 0xffff_ffff__8100_0000. Which is obviously wrong.
In particular that issuse broke output of perf, because perf was unable
to associate addresses like 0xffff_ffff__8100_0000 with anything from
/proc/kallsyms.
That's what we used to see:
----------->8----------
6.27% ls [unknown] [k] 0xffffffff8046c5cc
2.96% ls libuClibc-0.9.34-git.so [.] memcpy
2.25% ls libuClibc-0.9.34-git.so [.] memset
1.66% ls [unknown] [k] 0xffffffff80666536
1.54% ls libuClibc-0.9.34-git.so [.] 0x000224d6
1.18% ls libuClibc-0.9.34-git.so [.] 0x00022472
----------->8----------
With that change perf output looks much better now:
----------->8----------
8.21% ls [kernel.kallsyms] [k] memset
3.52% ls libuClibc-0.9.34-git.so [.] memcpy
2.11% ls libuClibc-0.9.34-git.so [.] malloc
1.88% ls libuClibc-0.9.34-git.so [.] memset
1.64% ls [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore
1.41% ls [kernel.kallsyms] [k] __d_lookup_rcu
----------->8----------
yao mark [Thu, 2 Jul 2015 07:07:31 +0000 (15:07 +0800)]
drm/rockchip: vop: remove hardware cursor window
hardware cursor windows only have some fixed size, and not support
width virtual, when move hardware cursor windows outside of left,
the display would be wrong, so this window can't for cursor now.
And Tag hardware cursor window as a overlay is wrong, will make
userspace wrong behaviour.
Daniel Kurtz [Tue, 7 Jul 2015 09:03:36 +0000 (17:03 +0800)]
drm/rockchip: use drm_gem_mmap helpers
Rather than (incompletely [0]) re-implementing drm_gem_mmap() and
drm_gem_mmap_obj() helpers, call them directly from the rockchip mmap
routines.
Once the core functions return successfully, the rockchip mmap routines
can still use dma_mmap_attrs() to simply mmap the entire buffer.
[0] Previously, we were performing the mmap() without first taking a
reference on the underlying gem buffer. This could leak ptes if the gem
object is destroyed while userspace is still holding the mapping.
Heiko Stübner [Tue, 2 Jun 2015 14:41:45 +0000 (16:41 +0200)]
drm/rockchip: only call drm_fb_helper_hotplug_event if fb_helper present
Add a check for the presence of fb_helper to rockchip_drm_output_poll_changed()
to only call drm_fb_helper_hotplug_event if there is actually a fb_helper
available. Without this check I see NULL pointer dereferences when the
hdmi hotplug irq fires before the fb_helper got initialized.
Tomasz Figa [Mon, 11 May 2015 10:55:39 +0000 (19:55 +0900)]
drm/rockchip: Add BGR formats to VOP
VOP can support BGR formats in all windows thanks to red/blue swap option
provided in WINx_CTRL0 registers. This patch enables support for
ABGR8888, XBGR8888, BGR888 and BGR565 formats by using this feature.
David S. Miller [Mon, 13 Jul 2015 05:24:01 +0000 (22:24 -0700)]
Merge tag 'linux-can-fixes-for-4.2-20150712' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can
Marc Kleine-Budde says:
====================
pull-request: can 2015-07-12
this is a pull request of 8 patchs for net/master.
Sergei Shtylyov contributes 5 patches for the rcar_can driver, fixing the IRQ
check and several info and error messages. There are two patches by J.D.
Schroeder and Roger Quadros for the c_can driver and dra7x-evm device tree,
which precent a glitch in the DCAN1 pinmux. Oliver Hartkopp provides a better
approach to make the CAN skbs unique, the timestamp is replaced by a counter.
====================
The inb/outb/... family of IO methods end up being multiply defined when
building PCI support for the ColdFire. Compiling gives this:
CC init/main.o
In file included from ./arch/m68k/include/asm/io.h:4:0,
from include/linux/bio.h:30,
from include/linux/blkdev.h:18,
from init/main.c:75:
./arch/m68k/include/asm/io_mm.h:420:0: warning: "inb" redefined
./arch/m68k/include/asm/io_mm.h:108:0: note: this is the location of the previous definition
...
The ColdFire/PCI case defines its own IO access methods, so no others
should be defined or used in this case. Conditionally disable other
definitions that clash with it.
It would be nice if we could support multiple ColdFire SoC types in a
single binary - but currently the code simply does not support it.
Change the SoC selection config options to be a choice instead of
individual selectable entries.
This fixes problems with building allnoconfig, and means that a sane
linux kernel is generated for a single ColdFire SoC type.
m68knommu: improve the clock configuration defaults
Create some intelligent default settings for each ColdFire SoC type
in the configuration entry for CONFIG_CLOCK_FREQ.
The ColdFire clock frequency is configurable at build time. There is a
lot of variation in the frequency of operation on specific ColdFire based
boards. But we can choose a default that matches the maximum frequency
of clock operation for a particular ColdFire part. That is typically
the most common clock setting.
m68knommu: force setting of CONFIG_CLOCK_FREQ for ColdFire
It is possible to disable the clock selection at configuration time,
but for ColdFire targets we always expect a clock frequency to be
selected. This results in the following compile time error:
CC arch/m68k/kernel/asm-offsets.s
In file included from ./arch/m68k/include/asm/timex.h:14:0,
from include/linux/timex.h:65,
from include/linux/sched.h:19,
from arch/m68k/kernel/asm-offsets.c:14:
./arch/m68k/include/asm/coldfire.h:25:2: error: #error "Don't know what your ColdFire CPU clock frequency is??"
Remove CONFIG_CLOCK_SELECT completely and always enable CONFIG_CLOCK_FREQ
for ColdFire.
So the change to test 'crtc_state->base.active' cannot possibly be
correct as-is.
There may be some other minimal fix (like just checking crtc_state for
NULL), but I'm just reverting it now for the rc2 release, and people
like Daniel Vetter who actually know this code will figure out what the
right solution is in the longer term.
Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull VFS fixes from Al Viro:
"Fixes for this cycle regression in overlayfs and a couple of
long-standing (== all the way back to 2.6.12, at least) bugs"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
freeing unlinked file indefinitely delayed
fix a braino in ovl_d_select_inode()
9p: don't leave a half-initialized inode sitting around
Merge branch 'upstream' of git://git.linux-mips.org/pub/scm/ralf/upstream-linus
Pull MIPS fixes from Ralf Baechle:
"A fair number of 4.2 fixes also because Markos opened the flood gates.
- Patch up the math used calculate the location for the page bitmap.
- The FDC (Not what you think, FDC stands for Fast Debug Channel) IRQ
around was causing issues on non-Malta platforms, so move the code
to a Malta specific location.
- A spelling fix replicated through several files.
- Fix to the emulation of an R2 instruction for R6 cores.
- Fix the JR emulation for R6.
- Further patching of mindless 64 bit issues.
- Ensure the kernel won't crash on CPUs with L2 caches with >= 8
ways.
- Use compat_sys_getsockopt for O32 ABI on 64 bit kernels.
- Fix cache flushing for multithreaded cores.
- A build fix"
* 'upstream' of git://git.linux-mips.org/pub/scm/ralf/upstream-linus:
MIPS: O32: Use compat_sys_getsockopt.
MIPS: c-r4k: Extend way_string array
MIPS: Pistachio: Support CDMM & Fast Debug Channel
MIPS: Malta: Make GIC FDC IRQ workaround Malta specific
MIPS: c-r4k: Fix cache flushing for MT cores
Revert "MIPS: Kconfig: Disable SMP/CPS for 64-bit"
MIPS: cps-vec: Use macros for various arithmetics and memory operations
MIPS: kernel: cps-vec: Replace KSEG0 with CKSEG0
MIPS: kernel: cps-vec: Use ta0-ta3 pseudo-registers for 64-bit
MIPS: kernel: cps-vec: Replace mips32r2 ISA level with mips64r2
MIPS: kernel: cps-vec: Replace 'la' macro with PTR_LA
MIPS: kernel: smp-cps: Fix 64-bit compatibility errors due to pointer casting
MIPS: Fix erroneous JR emulation for MIPS R6
MIPS: Fix branch emulation for BLTC and BGEC instructions
MIPS: kernel: traps: Fix broken indentation
MIPS: bootmem: Don't use memory holes for page bitmap
MIPS: O32: Do not handle require 32 bytes from the stack to be readable.
MIPS, CPUFREQ: Fix spelling of Institute.
MIPS: Lemote 2F: Fix build caused by recent mass rename.
Oliver Hartkopp [Fri, 26 Jun 2015 09:58:19 +0000 (11:58 +0200)]
can: replace timestamp as unique skb attribute
Commit 514ac99c64b "can: fix multiple delivery of a single CAN frame for
overlapping CAN filters" requires the skb->tstamp to be set to check for
identical CAN skbs.
Without timestamping to be required by user space applications this timestamp
was not generated which lead to commit 36c01245eb8 "can: fix loss of CAN frames
in raw_rcv" - which forces the timestamp to be set in all CAN related skbuffs
by introducing several __net_timestamp() calls.
This forces e.g. out of tree drivers which are not using alloc_can{,fd}_skb()
to add __net_timestamp() after skbuff creation to prevent the frame loss fixed
in mainline Linux.
This patch removes the timestamp dependency and uses an atomic counter to
create an unique identifier together with the skbuff pointer.
Btw: the new skbcnt element introduced in struct can_skb_priv has to be
initialized with zero in out-of-tree drivers which are not using
alloc_can{,fd}_skb() too.
Roger Quadros [Tue, 7 Jul 2015 14:27:57 +0000 (17:27 +0300)]
ARM: dts: dra7x-evm: Prevent glitch on DCAN1 pinmux
Driver core sets "default" pinmux on on probe and CAN driver
sets "sleep" pinmux during register. This causes a small window
where the CAN pins are in "default" state with the DCAN module
being disabled.
Change the "default" state to be like sleep so this glitch is
avoided. Add a new "active" state that is used by the driver
when CAN is actually active.
The previous change 3973c526ae9c (net: can: c_can: Disable pins when CAN
interface is down) causes a slight glitch on the pinctrl settings when used.
Since commit ab78029 (drivers/pinctrl: grab default handles from device core),
the device core will automatically set the default pins. This causes the pins
to be momentarily set to the default and then to the sleep state in
register_c_can_dev(). By adding an optional "enable" state, boards can set the
default pin state to be disabled and avoid the glitch when the switch from
default to sleep first occurs. If the "enable" state is not available
c_can_pinctrl_select_state() falls back to using the "default" pinctrl state.
[Roger Q] - Forward port to v4.2 and use pinctrl_get_select().
Sergei Shtylyov [Sat, 20 Jun 2015 00:34:55 +0000 (03:34 +0300)]
can: rcar_can: fix typo in error message
Fix typo in the first error message printed by rcar_can_open().
Based on the original patch by Vladimir Barinov.
Fixes: 862e2b6af941 ("can: rcar_can: support all input clocks") Reported-by: Vladimir Barinov <[email protected]> Signed-off-by: Sergei Shtylyov <[email protected]> Signed-off-by: Marc Kleine-Budde <[email protected]>
Sergei Shtylyov [Sat, 20 Jun 2015 00:32:46 +0000 (03:32 +0300)]
can: rcar_can: fix IRQ check
rcar_can_probe() regards 0 as a wrong IRQ #, despite platform_get_irq() that it
calls returns negative error code in that case. This leads to the following
being printed to the console when attempting to open the device:
Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Thomas Gleixner:
- the high latency PIT detection fix, which slipped through the cracks
for rc1
- a regression fix for the early printk mechanism
- the x86 part to plug irq/vector related hotplug races
- move the allocation of the espfix pages on cpu hotplug to non atomic
context. The current code triggers a might_sleep() warning.
- a series of KASAN fixes addressing boot crashes and usability
- a trivial typo fix for Kconfig help text
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/kconfig: Fix typo in the CONFIG_CMDLINE_BOOL help text
x86/irq: Retrieve irq data after locking irq_desc
x86/irq: Use proper locking in check_irq_vectors_for_cpu_disable()
x86/irq: Plug irq vector hotplug race
x86/earlyprintk: Allow early_printk() to use console style parameters like '115200n8'
x86/espfix: Init espfix on the boot CPU side
x86/espfix: Add 'cpu' parameter to init_espfix_ap()
x86/kasan: Move KASAN_SHADOW_OFFSET to the arch Kconfig
x86/kasan: Add message about KASAN being initialized
x86/kasan: Fix boot crash on AMD processors
x86/kasan: Flush TLBs after switching CR3
x86/kasan: Fix KASAN shadow region page tables
x86/init: Clear 'init_level4_pgt' earlier
x86/tsc: Let high latency PIT fail fast in quick_pit_calibrate()
Merge branch 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer fixes from Thomas Gleixner:
"This update from the timer departement contains:
- A series of patches which address a shortcoming in the tick
broadcast code.
If the broadcast device is not available or an hrtimer emulated
broadcast device, some of the original assumptions lead to boot
failures. I rather plugged all of the corner cases instead of only
addressing the issue reported, so the change got a little larger.
Has been extensivly tested on x86 and arm.
- Get rid of the last holdouts using do_posix_clock_monotonic_gettime()
- A regression fix for the imx clocksource driver
- An update to the new state callbacks mechanism for clockevents.
This is required to simplify the conversion, which will take place
in 4.3"
* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
tick/broadcast: Prevent NULL pointer dereference
time: Get rid of do_posix_clock_monotonic_gettime
cris: Replace do_posix_clock_monotonic_gettime()
tick/broadcast: Unbreak CONFIG_GENERIC_CLOCKEVENTS=n build
tick/broadcast: Handle spurious interrupts gracefully
tick/broadcast: Check for hrtimer broadcast active early
tick/broadcast: Return busy when IPI is pending
tick/broadcast: Return busy if periodic mode and hrtimer broadcast
tick/broadcast: Move the check for periodic mode inside state handling
tick/broadcast: Prevent deep idle if no broadcast device available
tick/broadcast: Make idle check independent from mode and config
tick/broadcast: Sanity check the shutdown of the local clock_event
tick/broadcast: Prevent hrtimer recursion
clockevents: Allow set-state callbacks to be optional
clocksource/imx: Define clocksource for mx27
Merge branch 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq fix from Thomas Gleixner:
"A single fix for a cpu hotplug race vs. interrupt descriptors:
Prevent irq setup/teardown across the cpu starting/dying parts of cpu
hotplug so that the starting/dying cpu has a stable view of the
descriptor space. This has been an issue for all architectures in the
cpu dying phase, where interrupts are migrated away from the dying
cpu. In the starting phase its mostly a x86 issue vs the vector space
update"
* 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
hotplug: Prevent alloc/free of irq descriptors during cpu up/down
Al Viro [Wed, 8 Jul 2015 01:42:38 +0000 (02:42 +0100)]
freeing unlinked file indefinitely delayed
Normally opening a file, unlinking it and then closing will have
the inode freed upon close() (provided that it's not otherwise busy and
has no remaining links, of course). However, there's one case where that
does *not* happen. Namely, if you open it by fhandle with cold dcache,
then unlink() and close().
In normal case you get d_delete() in unlink(2) notice that dentry
is busy and unhash it; on the final dput() it will be forcibly evicted from
dcache, triggering iput() and inode removal. In this case, though, we end
up with *two* dentries - disconnected (created by open-by-fhandle) and
regular one (used by unlink()). The latter will have its reference to inode
dropped just fine, but the former will not - it's considered hashed (it
is on the ->s_anon list), so it will stay around until the memory pressure
will finally do it in. As the result, we have the final iput() delayed
indefinitely. It's trivial to reproduce -
main()
{
int fd;
union {
struct file_handle f;
char buf[MAX_HANDLE_SZ];
} x;
int m;
x.f.handle_bytes = sizeof(x);
chdir("/root");
mkdir("foo", 0700);
fd = open("foo/bar", O_CREAT | O_RDWR, 0600);
close(fd);
name_to_handle_at(AT_FDCWD, "foo/bar", &x.f, &m, 0);
flush_dcache();
fd = open_by_handle_at(AT_FDCWD, &x.f, O_RDWR);
unlink("foo/bar");
write(fd, buf, sizeof(buf));
system("df ."); /* 20Mb eaten */
close(fd);
system("df ."); /* should've freed those 20Mb */
flush_dcache();
system("df ."); /* should be the same as #2 */
}
will spit out something like
Filesystem 1K-blocks Used Available Use% Mounted on
/dev/root 322023 303843 1131 100% /
Filesystem 1K-blocks Used Available Use% Mounted on
/dev/root 322023 303843 1131 100% /
Filesystem 1K-blocks Used Available Use% Mounted on
/dev/root 322023 283282 21692 93% /
- inode gets freed only when dentry is finally evicted (here we trigger
than by remount; normally it would've happened in response to memory
pressure hell knows when).
David S. Miller [Sun, 12 Jul 2015 06:25:16 +0000 (23:25 -0700)]
Merge branch 'dsa-of-parsing-fixes'
Florian Fainelli says:
====================
net: dsa: OF parsing fixes
This patch series fixes two small parsing issues, the first one was
reported by Dan, the second came after looking more closely at the
code.
====================
net: dsa: Fix off-by-one in switch address parsing
cd->sw_addr is used as a MDIO bus address, which cannot exceed
PHY_MAX_ADDR (32), our check was off-by-one.
Fixes: 5e95329b701c ("dsa: add device tree bindings to register DSA switches") Signed-off-by: Florian Fainelli <[email protected]> Signed-off-by: David S. Miller <[email protected]>
port_index is used an index into an array, and this information comes
from Device Tree, make sure that port_index is not equal to the array
size before using it. Move the check against port_index earlier in the
loop.
Fixes: 5e95329b701c: ("dsa: add device tree bindings to register DSA switches") Reported-by: Dan Carpenter <[email protected]> Signed-off-by: Florian Fainelli <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Merge branch 'libnvdimm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/djbw/nvdimm
Pull libnvdimm fixes from Dan Williams:
"1) Fixes for a handful of smatch reports (Thanks Dan C.!) and minor
bug fixes (patches 1-6)
2) Correctness fixes to the BLK-mode nvdimm driver (patches 7-10).
Granted these are slightly large for a -rc update. They have been
out for review in one form or another since the end of May and were
deferred from the merge window while we settled on the "PMEM API"
for the PMEM-mode nvdimm driver (ie memremap_pmem, memcpy_to_pmem,
and wmb_pmem).
Now that those apis are merged we implement them in the BLK driver
to guarantee that mmio aperture moves stay ordered with respect to
incoming read/write requests, and that writes are flushed through
those mmio-windows and platform-buffers to be persistent on media.
These pass the sub-system unit tests with the updates to
tools/testing/nvdimm, and have received a successful build-report from
the kbuild robot (468 configs).
With acks from Rafael for the touches to drivers/acpi/"
* 'libnvdimm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/djbw/nvdimm:
nfit: add support for NVDIMM "latch" flag
nfit: update block I/O path to use PMEM API
tools/testing/nvdimm: add mock acpi_nfit_flush_address entries to nfit_test
tools/testing/nvdimm: fix return code for unimplemented commands
tools/testing/nvdimm: mock ioremap_wt
pmem: add maintainer for include/linux/pmem.h
nfit: fix smatch "use after null check" report
nvdimm: Fix return value of nvdimm_bus_init() if class_create() fails
libnvdimm: smatch cleanups in __nd_ioctl
sparse: fix misplaced __pmem definition
Filipe Manana [Thu, 9 Jul 2015 12:13:44 +0000 (13:13 +0100)]
Btrfs: fix order by which delayed references are run
When we have an extent that got N references removed and N new references
added in the same transaction, we must run the insertion of the references
first because otherwise the last removed reference will remove the extent
item from the extent tree, resulting in a failure for the insertions.
This is a regression introduced in the 4.2-rc1 release and this fix just
brings back the behaviour of selecting reference additions before any
reference removals.
The following test case for fstests reproduces the issue:
seq=`basename $0`
seqres=$RESULT_DIR/$seq
echo "QA output created by $seq"
tmp=/tmp/$$
status=1 # failure is the default!
trap "_cleanup; exit \$status" 0 1 2 3 15
_cleanup()
{
_cleanup_flakey
rm -f $tmp.*
}
# get standard environment, filters and checks
. ./common/rc
. ./common/filter
. ./common/dmflakey
# real QA test starts here
_need_to_be_root
_supported_fs btrfs
_supported_os Linux
_require_scratch
_require_dm_flakey
_require_cloner
_require_metadata_journaling $SCRATCH_DEV
# Now write to the last 80K of the prealloc extent plus 40K to the unallocated
# space that immediately follows it. This creates a new extent of 40K that spans
# the range [620K, 660K[.
$XFS_IO_PROG -c "pwrite -S 0xaa 540K 120K" $SCRATCH_MNT/foo | _filter_xfs_io
# At this point, there are now 2 back references to the prealloc extent in our
# extent tree. Both are for our file offset 160K and one relates to a file
# extent item with a data offset of 0 and a length of 380K, while the other
# relates to a file extent item with a data offset of 380K and a length of 80K.
# Make sure everything done so far is durably persisted (all back references are
# in the extent tree, etc).
sync
# Now clone all extents of our file that cover the offset 160K up to its eof
# (660K at this point) into itself at offset 2M. This leaves a hole in the file
# covering the range [660K, 2M[. The prealloc extent will now be referenced by
# the file twice, once for offset 160K and once for offset 2M. The 40K extent
# that follows the prealloc extent will also be referenced twice by our file,
# once for offset 620K and once for offset 2M + 460K.
$CLONER_PROG -s $((160 * 1024)) -d $((2 * 1024 * 1024)) -l 0 $SCRATCH_MNT/foo \
$SCRATCH_MNT/foo
# Now create one new extent in our file with a size of 100Kb. It will span the
# range [3M, 3M + 100K[. It also will cause creation of a hole spanning the
# range [2M + 460K, 3M[. Our new file size is 3M + 100K.
$XFS_IO_PROG -c "pwrite -S 0xbb 3M 100K" $SCRATCH_MNT/foo | _filter_xfs_io
# At this point, there are now (in memory) 4 back references to the prealloc
# extent.
#
# Two of them are for file offset 160K, related to file extent items
# matching the file offsets 160K and 540K respectively, with data offsets of
# 0 and 380K respectively, and with lengths of 380K and 80K respectively.
#
# The other two references are for file offset 2M, related to file extent items
# matching the file offsets 2M and 2M + 380K respectively, with data offsets of
# 0 and 380K respectively, and with lengths of 389K and 80K respectively.
#
# The 40K extent has 2 back references, one for file offset 620K and the other
# for file offset 2M + 460K.
#
# The 100K extent has a single back reference and it relates to file offset 3M.
# Now clone our 100K extent into offset 600K. That offset covers the last 20K
# of the prealloc extent, the whole 40K extent and 40K of the hole starting at
# offset 660K.
$CLONER_PROG -s $((3 * 1024 * 1024)) -d $((600 * 1024)) -l $((100 * 1024)) \
$SCRATCH_MNT/foo $SCRATCH_MNT/foo
# At this point there's only one reference to the 40K extent, at file offset
# 2M + 460K, we have 4 references for the prealloc extent (2 for file offset
# 160K and 2 for file offset 2M) and 2 references for the 100K extent (1 for
# file offset 3M and a new one for file offset 600K).
# Now fsync our file to make all its new data and metadata updates are durably
# persisted and present if a power failure/crash happens after a successful
# fsync and before the next transaction commit.
$XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foo
echo "File digest before power failure:"
md5sum $SCRATCH_MNT/foo | _filter_scratch
# Silently drop all writes and ummount to simulate a crash/power failure.
_load_flakey_table $FLAKEY_DROP_WRITES
_unmount_flakey
# Allow writes again, mount to trigger log replay and validate file contents.
# During log replay, the btrfs delayed references implementation used to run the
# deletion of back references before the addition of new back references, which
# made the addition fail as it didn't find the key in the extent tree that it
# was looking for. The failure triggered by this test was related to the 40K
# extent, which got 1 reference dropped and 1 reference added during the fsync
# log replay - when running the delayed references at transaction commit time,
# btrfs was applying the deletion before the insertion, resulting in a failure
# of the insertion that ended up turning the fs into read-only mode.
_load_flakey_table $FLAKEY_ALLOW_WRITES
_mount_flakey
echo "File digest after log replay:"
md5sum $SCRATCH_MNT/foo | _filter_scratch
_unmount_flakey
status=0
exit
This issue turned the filesystem into read-only mode (current transaction
aborted) and produced the following traces:
Fixes: c6fc24549960 ("btrfs: delayed-ref: Use list to replace the ref_root in ref_head.") Signed-off-by: Filipe Manana <[email protected]> Acked-by: Qu Wenruo <[email protected]>
Filipe Manana [Fri, 3 Jul 2015 19:30:34 +0000 (20:30 +0100)]
Btrfs: fix list transaction->pending_ordered corruption
When we call btrfs_commit_transaction(), we splice the list "ordered"
of our transaction handle into the transaction's "pending_ordered"
list, but we don't re-initialize the "ordered" list of our transaction
handle, this means it still points to the same elements it used to
before the splice. Then we check if the current transaction's state is
>= TRANS_STATE_COMMIT_START and if it is we end up calling
btrfs_end_transaction() which simply splices again the "ordered" list
of our handle into the transaction's "pending_ordered" list, leaving
multiple pointers to the same ordered extents which results in list
corruption when we are iterating, removing and freeing ordered extents
at btrfs_wait_pending_ordered(), resulting in access to dangling
pointers / use-after-free issues.
Similarly, btrfs_end_transaction() can end up in some cases calling
btrfs_commit_transaction(), and both did a list splice of the transaction
handle's "ordered" list into the transaction's "pending_ordered" without
re-initializing the handle's "ordered" list, resulting in exactly the
same problem.
This produces the following warning on a kernel with linked list
debugging enabled:
On a non-debug kernel this leads to invalid memory accesses, causing a
crash. Fix this by using list_splice_init() instead of list_splice() in
btrfs_commit_transaction() and btrfs_end_transaction().
Cc: [email protected] Fixes: 50d9aa99bd35 ("Btrfs: make sure logged extents complete in the current transaction V3" Signed-off-by: Filipe Manana <[email protected]> Reviewed-by: David Sterba <[email protected]>