Dasgupta, Romit [Thu, 6 Nov 2008 10:16:18 +0000 (15:46 +0530)]
[netdrvr] smc911x: fix for driver resume (and compilation warning)
I am trying out suspend, resume on an OMAP3 based board. What I see
during resume is that the SMC911x driver resume routing gets stuck
after trying to transmit the packet out of the controller. Some debug
messages below:
--> smc911x_drv_resume
eth0: --> smc911x_reset
eth0: smc911x_reset timeout waiting for PM restore
eth0: --> smc911x_enable
eth0: --> smc911x_phy_configure()
eth0: --> smc911x_phy_reset()
eth0: phy caps=0x782d
eth0: phy advertised caps=0x0de1
eth0: --> smc911x_phy_check_media
smc911x_phy_read: phyaddr=0x1, phyreg=0x01, phydata=0x7809
smc911x_phy_read: phyaddr=0x1, phyreg=0x01, phydata=0x7809
eth0: link down
Restarting tasks ... eth0: --> smc911x_hard_start_xmit
eth0: --> smc911x_hardware_send_pkt
eth0: --> smc911x_hard_start_xmit
eth0: --> smc911x_hardware_send_pkt
eth0: --> smc911x_hard_start_xmit
eth0: --> smc911x_hardware_send_pkt
nfs: server 172.24.190.217 not responding, still trying
nfs: server 172.24.190.217 not responding, still trying
The following change makes it work fine: (The change within
smc911x_drv_probe function was to get rid of a compilation warning).
Steve Wise [Thu, 6 Nov 2008 23:06:42 +0000 (17:06 -0600)]
RDMA/cxgb3: deadlock in iw_cxgb3 can cause hang when configuring interface.
When the iw_cxgb3 module's cxgb3_client "add" func gets called by the
cxgb3 module, the iwarp driver ends up calling the ethtool ops get_drvinfo
function in cxgb3 to get the fw version and other info. Currently the
iwarp driver grabs the rtnl lock around this down call to serialize.
As of 2.6.27 or so, things changed such that the rtnl lock is held around
the call to the netdev driver open function. Also the cxgb3_client "add"
function doesn't get called if the device is down.
So, if you load cxgb3, then load iw_cxgb3, then ifconfig up the device,
the iw_cxgb3 add func gets called with the rtnl_lock held. If you
load cxgb3, ifconfig up the device, then load iw_cxgb3, the add func
gets called without the rtnl_lock held. The former causes the deadlock,
the latter does not.
In addition, there are iw_cxgb3 sysfs handlers that also can call
down into cxgb3 to gather the fw and hw versions. These can be called
concurrently on different processors and at any time. Thus we need to
push this serialization down in the cxgb3 driver get_drvinfo func.
The fix is to remove rtnl lock usage, and use a per-device lock in cxgb3.
Brice Goglin [Mon, 10 Nov 2008 12:58:41 +0000 (13:58 +0100)]
myri10ge: fix stop/go ordering even more
The doorbell writes may be seen out of order by the firmware if they
are in WC memory since the tx spin(un)lock does not flush WC writes.
Hence if the "stop" is written on a different CPU than the "go", it
is possible that the stop will arrive after the go unless we add an
explicit memory barrier (and mmiowb() is not enough).
Paul Mackerras [Tue, 11 Nov 2008 08:39:26 +0000 (19:39 +1100)]
powerpc: Update desktop/server defconfigs
Turned off CONFIG_PCI_LEGACY and turned on EXT4, and otherwise mostly
took the defaults. This also updates ppc6xx_defconfig, which covers
the 6xx/7xx/7xxx-based embedded boards.
Andreas Schwab [Thu, 6 Nov 2008 00:49:00 +0000 (00:49 +0000)]
powerpc: Fix msr check in compat_sys_swapcontext
The new context may not be 16-byte aligned, so the real address of the
mcontext structure should be read from the uc_regs pointer instead of
directly using the (unaligned) uc_mcontext field.
Keith Packard [Sat, 8 Nov 2008 01:44:14 +0000 (11:44 +1000)]
drm/i915: Move legacy breadcrumb out of the reserved status page area
Addresses in the hardware status page below index 0x20 are reserved for use
by the hardware. The legacy breadcrumb was sitting at index 5. Move it to
index 0x21, and make sure everyone uses the defined value instead of
hard-coded constants.
Dave Airlie [Tue, 11 Nov 2008 08:02:12 +0000 (18:02 +1000)]
drm/i915: Filter pci devices based on PCI_CLASS_DISPLAY_VGA
This fixes hangs on 855-class hardware by avoiding double attachment of the
driver due to the stub second head device having the same pci id as the real
device.
Other DRM drivers probably want this treatment as well, but I'm applying it
just to this one for safety. But we should clean up the drm_pciids.h mess
now so that each driver has its own pci id list header in its own directory.
Lets do that in the next release.
Tejun Heo [Tue, 4 Nov 2008 08:08:40 +0000 (17:08 +0900)]
libata: fix last_reset timestamp handling
ehc->last_reset is used to ensure that resets are not issued too
close to each other. It's initialized to jiffies minus one minute
on EH entry. However, when new links are initialized after PMP is
probed, new links have zero for this timestamp resulting in long wait
depending on the current jiffies.
This patch makes last_set considered iff ATA_EHI_DID_RESET is set, in
which case last_reset is always initialized. As an added precaution,
WARN_ON() is added so that warning is printed if last_reset is
in future.
This problem is spotted and debugged by Shane Huang.
Roland Dreier [Tue, 4 Nov 2008 18:34:48 +0000 (10:34 -0800)]
libata: Avoid overflow in ata_tf_read_block() when tf->hba_lbal > 127
Phillip O'Donnell <[email protected]> pointed out that the same
sign extension bug that was fixed in commit ba14a9c2 ("libata: Avoid
overflow in ata_tf_to_lba48() when tf->hba_lbal > 127") also appears to
exist in ata_tf_read_block(). Fix this by adding a cast to u64.
Dave Airlie [Tue, 11 Nov 2008 07:56:16 +0000 (17:56 +1000)]
drm/radeon: map registers at load time
Now that the radeon driver has suspend/resume functions, it needs to map its
registers at load time or it will likely crash if a suspend operation occurs
before the driver has been initialized.
This patch moves the register mapping code from firstopen to load and makes
the mapping into a _DRM_DRIVER one so that the core won't remove it at
lastclose time.
Eric Anholt [Tue, 4 Nov 2008 20:01:24 +0000 (12:01 -0800)]
i915: Remove racy delayed vblank swap ioctl.
When userland detected that this ioctl was supported (by version number check),
it used it in a racy way -- dispatch delayed swap, wait for vblank, continue
rendering. As there was no mechanism for it to wait for the swap to finish,
sometimes it would render before the swap and garbage would be displayed on
the screen.
By removing the ioctl and returning -EINVAL, userland returns to its previous,
correct rendering path of waiting for a vblank then dispatching a swap. The
only path that could have used this ioctl correctly was page flipping, which
relied on only one client running and emitting wait-for-vblank-before-rendering
in the command stream. That path also falls back correctly, at the performance
cost of not being able to queue up rendering before the flip occurs.
Owen Taylor [Mon, 3 Nov 2008 22:38:17 +0000 (14:38 -0800)]
i915: Don't attempt to short-circuit object_wait_rendering by checking domains.
This could return early when reading after writing a buffer, if somebody
had already put it on the flushing list (write domains are 0, but still
active), leading to glReadPixels failure.
Keith Packard [Mon, 3 Nov 2008 07:38:20 +0000 (23:38 -0800)]
i915: Clean up sarea pointers on leavevt
This corresponds to the setup of the sarea pointers in DMA initialization,
though neither is exactly the point at which the sarea is set up or torn down.
Oleg Nesterov [Mon, 10 Nov 2008 14:39:30 +0000 (15:39 +0100)]
fix for account_group_exec_runtime(), make sure ->signal can't be freed under rq->lock
Impact: fix hang/crash on ia64 under high load
This is ugly, but the simplest patch by far.
Unlike other similar routines, account_group_exec_runtime() could be
called "implicitly" from within scheduler after exit_notify(). This
means we can race with the parent doing release_task(), we can't just
check ->signal != NULL.
Change __exit_signal() to do spin_unlock_wait(&task_rq(tsk)->lock)
before __cleanup_signal() to make sure ->signal can't be freed under
task_rq(tsk)->lock. Note that task_rq_unlock_wait() doesn't care
about the case when tsk changes cpu/rq under us, this should be OK.
Thanks to Ingo who nacked my previous buggy patch.
Before commit b6c40d68ff6498b7f63ddf97cf0aa818d748dee7 ("net: only
invoke dev->change_rx_flags when device is UP"), the dsa driver could
sort-of get away with only fiddling with the master interface's
allmulti/promisc counts in ->change_rx_flags() and not touching them
in ->open() or ->stop(). After this commit (note that it was merged
almost simultaneously with the dsa patches, which is why this wasn't
caught initially), the breakage that was already there became more
apparent.
Since it makes no sense to keep the master interface's allmulti or
promisc count pinned for a slave interface that is down, copy the vlan
driver's sync logic (which does exactly what we want) over to dsa to
fix this.
dsa: fix skb->pkt_type when mac address of slave interface differs
When a dsa slave interface has a mac address that differs from that
of the master interface, eth_type_trans() won't explicitly set
skb->pkt_type back to PACKET_HOST -- we need to do this ourselves
before calling eth_type_trans().
net: fix setting of skb->tail in skb_recycle_check()
Since skb_reset_tail_pointer() reads skb->data, we need to set
skb->data before calling skb_reset_tail_pointer(). This was causing
spurious skb_over_panic()s from skb_put() being called on a recycled
skb that had its skb->tail set to beyond where it should have been.
Steven Rostedt [Tue, 11 Nov 2008 02:46:01 +0000 (21:46 -0500)]
ring-buffer: prevent infinite looping on time stamping
Impact: removal of unnecessary looping
The lockless part of the ring buffer allows for reentry into the code
from interrupts. A timestamp is taken, a test is preformed and if it
detects that an interrupt occurred that did tracing, it tries again.
The problem arises if the timestamp code itself causes a trace.
The detection will detect this and loop again. The difference between
this and an interrupt doing tracing, is that this will fail every time,
and cause an infinite loop.
Currently, we test if the loop happens 1000 times, and if so, it will
produce a warning and disable the ring buffer.
The problem with this approach is that it makes it difficult to perform
some types of tracing (tracing the timestamp code itself).
Each trace entry has a delta timestamp from the previous entry.
If a trace entry is reserved but and interrupt occurs and traces before
the previous entry is commited, the delta timestamp for that entry will
be zero. This actually makes sense in terms of tracing, because the
interrupt entry happened before the preempted entry was commited, so
one may consider the two happening at the same time. The order is
still preserved in the buffer.
With this idea, instead of trying to get a new timestamp if an interrupt
made it in between the timestamp and the test, the entry could simply
make the delta zero and continue. This will prevent interrupts or
tracers in the timer code from causing the above loop.
Steven Rostedt [Tue, 11 Nov 2008 02:46:00 +0000 (21:46 -0500)]
ftrace: disable tracing on resize
Impact: fix for bug on resize
This patch addresses the bug found here:
http://bugzilla.kernel.org/show_bug.cgi?id=11996
When ftrace converted to the new unified trace buffer, the resizing of
the buffer was not protected as much as it was originally. If tracing
is performed while the resize occurs, then the buffer can be corrupted.
This patch disables all ftrace buffer modifications before a resize
takes place.
Benjamin Thery [Tue, 11 Nov 2008 00:34:11 +0000 (16:34 -0800)]
ipv6: fix ip6_mr_init error path
The order of cleanup operations in the error/exit section of ip6_mr_init()
is completely inversed. It should be the other way around.
Also a del_timer() is missing in the error path.
Error handling needs to be modified in dma_pin_iovec_pages().
It should return NULL instead of ERR_PTR
(pinned_list is checked for NULL in tcp_recvmsg() to determine
if iovec pages have been successfully pinned down).
In case of error for the first iovec,
local_list->nr_iovecs needs to be initialized.
[1/4] I/OAT: fix channel resources free for not allocated channels
If the ioatdma driver is loaded but not used it does not allocate descriptors.
Before it frees channel resources it should first be sure
that they have been previously allocated.
While the patch is correct it does not take into account that spurious
wakeups happen on x86. A fix for this issue is available, but we just
revert to the .27 behaviour and let long running softirqs screw
themself.
Thomas Gleixner [Fri, 7 Nov 2008 10:06:00 +0000 (11:06 +0100)]
irq: call __irq_enter() before calling the tick_idle_check
Impact: avoid spurious ksoftirqd wakeups
The tick idle check which is called from irq_enter() was run before
the call to __irq_enter() which did not set the in_interrupt() bits in
preempt_count. That way the raise of a softirq woke up softirqd for
nothing as the softirq was handled on return from interrupt.
Call __irq_enter() before calling into the tick idle check code.
While investigating the failure of hibernation on 32-bit x86 with
CONFIG_NUMA set, as described in this message
http://marc.info/?l=linux-kernel&m=122634118116226&w=4
I asked some people for help and I was told that it wasn't really
worth the effort, because CONFIG_NUMA was generally broken on 32-bit
x86 systems and it shouldn't be used in such configs. For this
reason, make CONFIG_NUMA depend on BROKEN instead of EXPERIMENTAL on
x86-32.
David Howells [Mon, 10 Nov 2008 19:00:05 +0000 (19:00 +0000)]
KEYS: Make request key instantiate the per-user keyrings
Make request_key() instantiate the per-user keyrings so that it doesn't oops
if it needs to get hold of the user session keyring because there isn't a
session keyring in place.
Trent Piepho [Mon, 10 Nov 2008 21:09:21 +0000 (13:09 -0800)]
powerpc: Repair device bindings documentation
Commit d0fc2eaaf4c56a95f5ed29b6bfb609e19714fc16 "powerpc/fsl: Refactor
device bindings" split out a number of device bindings from
booting-without-of.txt into separate files. Having them all in one file
was a frequent source of merge conflicts.
However, in the next merge, 49997d75152b3d23c53b0fa730599f2f74c92c65, there
was another conflict. Some of the bindings removed from
booting-without-of.txt were mistakenly added back in and the copies in
dts-bindings were kept as well.
This patch re-removes "Freescale Display Interface" and "Freescale on board
FPGA" and fixes the table of contents.
Chris Mason [Mon, 10 Nov 2008 21:13:54 +0000 (16:13 -0500)]
Btrfs: empty_size allocation fixes again
The allocator wasn't catching all of the cases where it needed to do
extra loops because the check to enforce them wasn't happening early
enough.
When the allocator decided to increase the size of the allocation
for metadata clustering, it wasn't always setting the empty_size to
include the extra (optional) bytes. This also fixes the empty_size field
to be correct.
Chris Mason [Mon, 10 Nov 2008 18:08:31 +0000 (13:08 -0500)]
Btrfs: tune btrfs unplug functions for a small number of devices
When btrfs unplugs, it tries to find the correct device to unplug
via search through the extent_map tree. This avoids unplugging
a device that doesn't need it, but is a waste of time for filesystems
with a small number of devices.
This patch checks the total number of devices before doing the
search.
Tiger Yang [Sun, 2 Nov 2008 11:04:21 +0000 (19:04 +0800)]
ocfs2: Check search result in ocfs2_xattr_block_get()
ocfs2_xattr_block_get() calls ocfs2_xattr_search() to find an external
xattr, but doesn't check the search result that is passed back via struct
ocfs2_xattr_search. Add a check for search result, and pass back -ENODATA if
the xattr search failed. This avoids a later NULL pointer error.
Tao Ma [Sun, 26 Oct 2008 22:06:24 +0000 (06:06 +0800)]
ocfs2/xattr: Proper hash collision handle in bucket division
In ocfs2/xattr, we must make sure the xattrs which have the same hash value
exist in the same bucket so that the search schema can work. But in the old
implementation, when we want to extend a bucket, we just move half number of
xattrs to the new bucket. This works in most cases, but if we are lucky
enough we will move 2 xattrs into 2 different buckets. This means that an
xattr from the previous bucket cannot be found anymore. This patch fix this
problem by finding the right position during extending the bucket and extend
an empty bucket if needed.
Tao Ma [Mon, 6 Oct 2008 08:59:55 +0000 (16:59 +0800)]
ocfs2: return 0 in page_mkwrite to let VFS retry.
In ocfs2_page_mkwrite, we return -EINVAL when we found the page mapping
isn't updated, and it will cause the user space program get SIGBUS and
exit. The reason is that during race writeable mmap, we will do
unmap_mapping_range in ocfs2_data_downconvert_worker. The good thing is
that if we reuturn 0 in page_mkwrite, VFS will retry fault and then
call page_mkwrite again, so it is safe to return 0 here.
Sunil Mushran [Wed, 22 Oct 2008 20:24:29 +0000 (13:24 -0700)]
ocfs2: Set journal descriptor to NULL after journal shutdown
Patch sets journal descriptor to NULL after the journal is shutdown.
This ensures that jbd2_journal_release_jbd_inode(), which removes the
jbd2 inode from txn lists, can be called safely from ocfs2_clear_inode()
even after the journal has been shutdown.
Tao Ma [Thu, 23 Oct 2008 23:57:28 +0000 (07:57 +0800)]
ocfs2: Fix check of return value of ocfs2_start_trans() in xattr.c.
On failure, ocfs2_start_trans() returns values like ERR_PTR(-ENOMEM),
so we should check whether handle is NULL. Fix them to use IS_ERR().
Jan has made the patch for other part in ocfs2(thank Jan for it), so
this is just the fix for fs/ocfs2/xattr.c.
Jan Kara [Mon, 20 Oct 2008 17:23:54 +0000 (19:23 +0200)]
ocfs2: Let inode be really deleted when ocfs2_mknod_locked() fails
We forgot to set i_nlink to 0 when returning due to error from ocfs2_mknod_locked()
and thus inode was not properly released via ocfs2_delete_inode() (e.g. claimed
space was not released). Fix it.
Tao Ma [Fri, 17 Oct 2008 04:44:36 +0000 (12:44 +0800)]
ocfs2: Remove unused ocfs2_restore_xattr_block().
Since now ocfs2 supports empty xattr buckets, we will never remove
the xattr index tree even if all the xattrs are removed, so this
function will never be called. So remove it.
Joel Becker [Tue, 21 Oct 2008 01:43:07 +0000 (18:43 -0700)]
ocfs2: Don't repeat ocfs2_xattr_block_find()
ocfs2_xattr_block_get() looks up the xattr in a startlingly familiar
way; it's identical to the function ocfs2_xattr_block_find(). Let's just
use the later in the former.
Joel Becker [Tue, 21 Oct 2008 01:32:48 +0000 (18:32 -0700)]
ocfs2: Specify appropriate journal access for new xattr buckets.
There are a couple places that get an xattr bucket that may be reading
an existing one or may be allocating a new one. They should specify the
correct journal access mode depending.
Joel Becker [Tue, 21 Oct 2008 01:25:56 +0000 (18:25 -0700)]
ocfs2: Check errors from ocfs2_xattr_update_xattr_search()
The ocfs2_xattr_update_xattr_search() function can return an error when
trying to read blocks off of disk. The caller needs to check this error
before using those (possibly invalid) blocks.
Chris Mason [Mon, 10 Nov 2008 17:34:40 +0000 (12:34 -0500)]
Btrfs: Turn off extent state leak debugging
The extent_io.c code has a #define to find and cleanup extent state leaks
on module unmount. This adds a very highly contended spinlock to a
hot path for most FS operations.
Turn it off by default. A later changeset will add a .config option
for it.
Linus Torvalds [Mon, 10 Nov 2008 17:13:37 +0000 (09:13 -0800)]
Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound-2.6:
ALSA: hda - Make the HP EliteBook 8530p use AD1884A model laptop
ALSA: gusextreme: Fix build errors
ALSA: hdsp: check for iobox and upload firmware during ioctl
ALSA: HDSP: check for io box before uploading firmware
ALSA: hda - Add another HP model (6730s) for AD1884A
alsa: fix snd_BUG_on() and friends
ALSA: hda - Add a quirk for MEDION MD96630
ALSA: hda - Limit the number of GPIOs show in proc
Travis Place [Mon, 10 Nov 2008 16:56:23 +0000 (17:56 +0100)]
ALSA: hda - Make the HP EliteBook 8530p use AD1884A model laptop
Added a QUIRK to patch_analog.c for the HP Elitebook 8530p
(IDs 0x103c:0x30e7) to use AD1884A model 'laptop' by default.
Playback and Capture confirmed working.
Chris Mason [Mon, 10 Nov 2008 16:50:50 +0000 (11:50 -0500)]
Btrfs: Use invalidatepage when writepage finds a page outside of i_size
With all the recent fixes to the delalloc locking, it is now safe
again to use invalidatepage inside the writepage code for
pages outside of i_size. This used to deadlock against some of the
code to write locked ranges of pages, but all of that has been fixed.
Chris Mason [Mon, 10 Nov 2008 16:47:09 +0000 (11:47 -0500)]
Btrfs: Try harder while searching for free space
The loop searching for free space would exit out too soon when
metadata clustering was trying to allocate a large extent. This makes
sure a full scan of the free space is done searching for only the
minimum extent size requested by the higher layers.
Chris Mason [Mon, 10 Nov 2008 16:44:58 +0000 (11:44 -0500)]
Btrfs: Fix use after free during compressed reads
Yan's fix to use the correct file offset during compressed reads used the
extent_map struct pointer after it had been freed. This saves the
fields we want for later use instead.
Matt Fleming [Sun, 2 Nov 2008 22:23:13 +0000 (22:23 +0000)]
x86: HPET: enter hpet_interrupt_handler with interrupts disabled
Some functions that may be called from this handler require that
interrupts are disabled. Also, combining IRQF_DISABLED and
IRQF_SHARED does not reliably disable interrupts in a handler, so
remove IRQF_SHARED from the irq flags (this irq is not shared anyway).
Matt Fleming [Sun, 2 Nov 2008 16:04:20 +0000 (16:04 +0000)]
x86: HPET: read from HPET_Tn_CMP() not HPET_T0_CMP
In hpet_next_event() we check that the value we just wrote to
HPET_Tn_CMP(timer) has reached the chip. Currently, we're checking that
the value we wrote to HPET_Tn_CMP(timer) is in HPET_T0_CMP, which, if
timer is anything other than timer 0, is likely to fail.
Although using block layer tagging is the right direction, due to the
tight coupling among tag number, data structure allocation and
hardware command slot allocation, libata doesn't work correctly with
the current conversion.
The biggest problem is guaranteeing that tag 0 is always used for
non-NCQ commands. Due to the way blk-tag is implemented and how SCSI
starts and finishes requests, such guarantee can't be made. I'm not
sure whether this would actually break any low level driver but it
doesn't look like a good idea to break such assumption given the
frailty of ATA controllers.
So, for the time being, keep using the old dumb in-libata qc
allocation.
Yan Zheng [Mon, 10 Nov 2008 12:34:43 +0000 (07:34 -0500)]
Btrfs: Fix csum error for compressed data
The decompress code doesn't take the logical offset in extent
pointer into account. If the logical offset isn't zero, data
will be decompressed into wrong pages.
The solution used here is to record the starting offset of the extent
in the file separately from the logical start of the extent_map struct.
This allows us to avoid problems inserting overlapping extents.
Chris Mason [Mon, 10 Nov 2008 12:31:30 +0000 (07:31 -0500)]
Btrfs: Make sure pages are dirty before doing delalloc for them
This adds a PageDirty check to the writeback path that locks pages
for delalloc. If a page wasn't dirty at this point, it is in the
process of being truncated away.
Chris Mason [Mon, 10 Nov 2008 12:26:33 +0000 (07:26 -0500)]
Btrfs: Don't substract too much from the allocation target (avoid wrapping)
When metadata allocation clustering has to fall back to unclustered
allocs because large free areas could not be found, it was sometimes
substracting too much from the total bytes to allocate. This would
make it wrap below zero.
Paul Mundt [Mon, 10 Nov 2008 11:00:45 +0000 (20:00 +0900)]
sh: Handle fixmap TLB eviction more coherently.
There was a race in the kmap_coherent() implementation. While we
guarded against preemption, there was nothing preventing eviction of
the pre-faulted fixmap entry from the UTLB. Under certain workloads
this would result in the fixmap entries used for cache colouring being
evicted from the UTLB in the midst of a copy_page().
In addition to pre-faulting, we also make sure to preserve the PTEs
in the kernel page table and introduce a cached PTE for kmap_coherent()
usage. This follows a similar change on MIPS ("[MIPS] Fix aliasing bug
in copy_to_user_page / copy_from_user_page").
David Chinner [Thu, 30 Oct 2008 06:40:09 +0000 (17:40 +1100)]
[XFS] XFS: Check for valid transaction headers in recovery
When we are about to add a new item to a transaction in recovery, we need
to check that it is valid first. Currently we just assert that header
magic number matches, but in production systems that is not present and we
add a corrupted transaction to the list to be processed. This results in a
kernel oops later when processing the corrupted transaction.
Instead, if we detect a corrupted transaction, abort recovery and leave
the user to clean up the mess that has occurred.
Dave Chinner [Mon, 10 Nov 2008 05:50:24 +0000 (16:50 +1100)]
[XFS] handle memory allocation failures during log initialisation
When there is no memory left in the system, xfs_buf_get_noaddr()
can fail. If this happens at mount time during xlog_alloc_log()
we fail to catch the error and oops.
Catch the error from xfs_buf_get_noaddr(), and allow other memory
allocations to fail and catch those errors too. Report the error
to the console and fail the mount with ENOMEM.
Tested by manually injecting errors into xfs_buf_get_noaddr() and
xlog_alloc_log().
Version 2:
o remove unnecessary casts of the returned pointer from kmem_zalloc()
David Chinner [Thu, 30 Oct 2008 06:38:12 +0000 (17:38 +1100)]
[XFS] Account for allocated blocks when expanding directories
When we create a directory, we reserve a number of blocks for the maximum
possible expansion of of the directory due to various btree splits,
freespace allocation, etc. Unfortunately, each allocation is not reflected
in the total number of blocks still available to the transaction, so the
maximal reservation is used over and over again.
This leads to problems where an allocation group has only enough blocks
for *some* of the allocations required for the directory modification.
After the first N allocations, the remaining blocks in the allocation
group drops below the total reservation, and subsequent allocations fail
because the allocator will not allow the allocation to proceed if the AG
does not have the enough blocks available for the entire allocation total.
This results in an ENOSPC occurring after an allocation has already
occurred. This results in aborting the directory operation (leaving the
directory in an inconsistent state) and cancelling a dirty transaction,
which results in a filesystem shutdown.
Avoid the problem by reflecting the number of blocks allocated in any
directory expansion in the total number of blocks available to the
modification in progress. This prevents a directory modification from
being aborted part way through with an ENOSPC.
Lachlan McIlroy [Thu, 30 Oct 2008 05:59:06 +0000 (16:59 +1100)]
[XFS] Wait for all I/O on truncate to zero file size
It's possible to have outstanding xfs_ioend_t's queued when the file size
is zero. This can happen in the direct I/O path when a direct I/O write
fails due to ENOSPC. In this case the xfs_ioend_t will still be queued (ie
xfs_end_io_direct() does not know that the I/O failed so can't force the
xfs_ioend_t to be flushed synchronously).
When we truncate a file on unlink we don't know to wait for these
xfs_ioend_ts and we can have a use-after-free situation if the inode is
reclaimed before the xfs_ioend_t is finally processed.
As was suggested by Dave Chinner lets wait for all I/Os to complete when
truncating the file size to zero.
Lachlan McIlroy [Thu, 30 Oct 2008 05:53:25 +0000 (16:53 +1100)]
[XFS] Fix use-after-free with log and quotas
Destroying the quota stuff on unmount can access the log - ie
XFS_QM_DONE() ends up in xfs_dqunlock() which calls
xfs_trans_unlocked_item() and then xfs_log_move_tail(). By this time the
log has already been destroyed. Just move the cleanup of the quota code
earlier in xfs_unmountfs() before the call to xfs_log_unmount(). Moving
XFS_QM_DONE() up near XFS_QM_DQPURGEALL() seems like a good spot.
It's showing up as regressions; disabling it very likely just papers
over an underlying issue, but time is running out for 2.6.28, lets get
back to this for 2.6.29
kbuild: Fixup deb-pkg target to generate separate firmware deb
The below is a simplistic fix for "make deb-pkg"; it splits the
firmware out to a linux-firmware-image package and adds an
(unversioned) Suggests to the linux package for this firmware.
Linus Torvalds [Sun, 9 Nov 2008 20:47:04 +0000 (12:47 -0800)]
Don't ask twice about not including staging drivers
The "Exclude staging drivers" question is there so that we don't build
staging drivers for allyesconfig or allnoconfig settings, but it's very
irritating when you've already said "no" to staging drivers earlier.
There is absolutely no point in declining twice - once you've declined
the staging drivers, you're done.
So make the second question depend on the first question having been
answered in the affirmative.
Linus Torvalds [Sun, 9 Nov 2008 20:20:56 +0000 (12:20 -0800)]
Merge branch 'cpus4096' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'cpus4096' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
cpumask: introduce new API, without changing anything, v3
cpumask: new API, v2
cpumask: introduce new API, without changing anything
Doug Nazar [Wed, 5 Nov 2008 11:16:28 +0000 (06:16 -0500)]
Fix nfsd truncation of readdir results
Commit 8d7c4203 "nfsd: fix failure to set eof in readdir in some
situations" introduced a bug: on a directory in an exported ext3
filesystem with dir_index unset, a READDIR will only return about 250
entries, even if the directory was larger.
Bisected it back to this commit; reverting it fixes the problem.
It turns out that in this case ext3 reads a block at a time, then
returns from readdir, which means we can end up with buf.full==0 but
with more entries in the directory still to be read. Before 8d7c4203
(but after c002a6c797 "Optimise NFS readdir hack slightly"), this would
cause us to return the READDIR result immediately, but with the eof bit
unset. That could cause a performance regression (because the client
would need more roundtrips to the server to read the whole directory),
but no loss in correctness, since the cleared eof bit caused the client
to send another readdir. After 8d7c4203, the setting of the eof bit
made this a correctness problem.
So, move nfserr_eof into the loop and remove the buf.full check so that
we loop until buf.used==0. The following seems to do the right thing
and reduces the network traffic since we don't return a READDIR result
until the buffer is full.
Tested on an empty directory & large directory; eof is properly sent and
there are no more short buffers.