Xiaotian Feng [Thu, 20 Dec 2012 23:05:44 +0000 (15:05 -0800)]
proc: fix inconsistent lock state
Lockdep found an inconsistent lock state when rcu is processing delayed
work in softirq. Currently, kernel is using spin_lock/spin_unlock to
protect proc_inum_ida, but proc_free_inum is called by rcu in softirq
context.
Use spin_lock_bh/spin_unlock_bh fix following lockdep warning.
=================================
[ INFO: inconsistent lock state ]
3.7.0 #36 Not tainted
---------------------------------
inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
swapper/1/0 [HC0[0]:SC1[1]:HE1:SE0] takes:
(proc_inum_lock){+.?...}, at: proc_free_inum+0x1c/0x50
{SOFTIRQ-ON-W} state was registered at:
__lock_acquire+0x8ae/0xca0
lock_acquire+0x199/0x200
_raw_spin_lock+0x41/0x50
proc_alloc_inum+0x4c/0xd0
alloc_mnt_ns+0x49/0xc0
create_mnt_ns+0x25/0x70
mnt_init+0x161/0x1c7
vfs_caches_init+0x107/0x11a
start_kernel+0x348/0x38c
x86_64_start_reservations+0x131/0x136
x86_64_start_kernel+0x103/0x112
irq event stamp: 2993422
hardirqs last enabled at (2993422): _raw_spin_unlock_irqrestore+0x55/0x80
hardirqs last disabled at (2993421): _raw_spin_lock_irqsave+0x29/0x70
softirqs last enabled at (2993394): _local_bh_enable+0x13/0x20
softirqs last disabled at (2993395): call_softirq+0x1c/0x30
other info that might help us debug this:
Possible unsafe locking scenario:
Guenter Roeck [Thu, 20 Dec 2012 23:05:42 +0000 (15:05 -0800)]
linux/kernel.h: fix DIV_ROUND_CLOSEST with unsigned divisors
Commit 263a523d18bc ("linux/kernel.h: Fix warning seen with W=1 due to
change in DIV_ROUND_CLOSEST") fixes a warning seen with W=1 due to
change in DIV_ROUND_CLOSEST.
Unfortunately, the C compiler converts divide operations with unsigned
divisors to unsigned, even if the dividend is signed and negative (for
example, -10 / 5U = 858993457). The C standard says "If one operand has
unsigned int type, the other operand is converted to unsigned int", so
the compiler is not to blame. As a result, DIV_ROUND_CLOSEST(0, 2U) and
similar operations now return bad values, since the automatic conversion
of expressions such as "0 - 2U/2" to unsigned was not taken into
account.
Fix by checking for the divisor variable type when deciding which
operation to perform. This fixes DIV_ROUND_CLOSEST(0, 2U), but still
returns bad values for negative dividends divided by unsigned divisors.
Mark the latter case as unsupported.
One observed effect of this problem is that the s2c_hwmon driver reports
a value of 4198403 instead of 0 if the ADC reads 0.
Other impact is unpredictable. Problem is seen if the divisor is an
unsigned variable or constant and the dividend is less than (divisor/2).
Tejun Heo [Thu, 20 Dec 2012 23:05:40 +0000 (15:05 -0800)]
memcg: don't register hotcpu notifier from ->css_alloc()
Commit 648bb56d076b ("cgroup: lock cgroup_mutex in cgroup_init_subsys()")
made cgroup_init_subsys() grab cgroup_mutex before invoking
->css_alloc() for the root css. Because memcg registers hotcpu notifier
from ->css_alloc() for the root css, this introduced circular locking
dependency between cgroup_mutex and cpu hotplug.
Fix it by moving hotcpu notifier registration to a subsys initcall.
======================================================
[ INFO: possible circular locking dependency detected ]
3.7.0-rc4-work+ #42 Not tainted
-------------------------------------------------------
bash/645 is trying to acquire lock:
(cgroup_mutex){+.+.+.}, at: [<ffffffff8110c5b7>] cgroup_lock+0x17/0x20
but task is already holding lock:
(cpu_hotplug.lock){+.+.+.}, at: [<ffffffff8109300f>] cpu_hotplug_begin+0x2f/0x60
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
We already perform the ida_simple_remove() in rtc_device_release(),
which is an appropriate place. Commit 2830a6d20 ("rtc: recycle id when
unloading a rtc driver") caused the kernel to emit
ida_remove called for id=0 which is not allocated.
warnings when rtc_device_release() tries to release an alread-released
ID.
Let's restore things to their previous state and then work out why
Vincent's kernel wasn't calling rtc_device_release() - presumably a bug
in a specific sub-driver.
hfsplus: rework processing of hfs_btree_write() returned error
Add to hfs_btree_write() a return of -EIO on failure of b-tree node
searching. Also add logic ofor processing errors from hfs_btree_write()
in hfsplus_system_write_inode() with a message about b-tree writing
failure.
Alan Cox [Thu, 20 Dec 2012 23:05:24 +0000 (15:05 -0800)]
hfsplus: avoid crash on failed block map free
If the read fails we kmap an error code. This doesn't end well. Instead
print a critical error and pray. This mirrors the rest of the fs
behaviour with critical error cases.
Kees Cook [Thu, 20 Dec 2012 23:05:16 +0000 (15:05 -0800)]
exec: do not leave bprm->interp on stack
If a series of scripts are executed, each triggering module loading via
unprintable bytes in the script header, kernel stack contents can leak
into the command line.
Normally execution of binfmt_script and binfmt_misc happens recursively.
However, when modules are enabled, and unprintable bytes exist in the
bprm->buf, execution will restart after attempting to load matching
binfmt modules. Unfortunately, the logic in binfmt_script and
binfmt_misc does not expect to get restarted. They leave bprm->interp
pointing to their local stack. This means on restart bprm->interp is
left pointing into unused stack memory which can then be copied into the
userspace argv areas.
After additional study, it seems that both recursion and restart remains
the desirable way to handle exec with scripts, misc, and modules. As
such, we need to protect the changes to interp.
This changes the logic to require allocation for any changes to the
bprm->interp. To avoid adding a new kmalloc to every exec, the default
value is left as-is. Only when passing through binfmt_script or
binfmt_misc does an allocation take place.
From SMBIOS 2.6 on, spec use little-endian encoding for UUID other than
network byte order.
So we need to get dmi version to distinguish. If version is 0.0, the
real version is taken from the SMBIOS version. This is part of original
kernel comment in code.
Remove the documentation for capability.disable. The code supporting
this parameter was removed with commit 5915eb53861c ("security: remove
dummy module")
Sonny Rao [Thu, 20 Dec 2012 23:05:07 +0000 (15:05 -0800)]
mm: fix calculation of dirtyable memory
The system uses global_dirtyable_memory() to calculate number of
dirtyable pages/pages that can be allocated to the page cache. A bug
causes an underflow thus making the page count look like a big unsigned
number. This in turn confuses the dirty writeback throttling to
aggressively write back pages as they become dirty (usually 1 page at a
time). This generally only affects systems with highmem because the
underflowed count gets subtracted from the global count of dirtyable
memory.
Minchan Kim [Thu, 20 Dec 2012 23:05:06 +0000 (15:05 -0800)]
compaction: fix build error in CMA && !COMPACTION
isolate_freepages_block() and isolate_migratepages_range() are used for
CMA as well as compaction so it breaks build for CONFIG_CMA &&
!CONFIG_COMPACTION.
Jeff Layton [Tue, 11 Dec 2012 17:10:10 +0000 (12:10 -0500)]
vfs: fix renameat to retry on ESTALE errors
...as always, rename is the messiest of the bunch. We have to track
whether to retry or not via a separate flag since the error handling
is already quite complex.
Jeff Layton [Thu, 20 Dec 2012 19:59:40 +0000 (14:59 -0500)]
vfs: add a retry_estale helper function to handle retries on ESTALE
This function is expected to be called from path-based syscalls to help
them decide whether to try the lookup and call again in the event that
they got an -ESTALE return back on an earier try.
Currently, we only retry the call once on an ESTALE error, but in the
event that we decide that that's not enough in the future, we should be
able to change the logic in this helper without too much effort.
NeilBrown [Fri, 9 Nov 2012 00:09:37 +0000 (16:09 -0800)]
vfs: d_obtain_alias() needs to use "/" as default name.
NFS appears to use d_obtain_alias() to create the root dentry rather than
d_make_root. This can cause 'prepend_path()' to complain that the root
has a weird name if an NFS filesystem is lazily unmounted. e.g. if
"/mnt" is an NFS mount then
{ cd /mnt; umount -l /mnt ; ls -l /proc/self/cwd; }
will cause a WARN message like
WARNING: at /home/git/linux/fs/dcache.c:2624 prepend_path+0x1d7/0x1e0()
...
Root dentry has weird name <>
to appear in kernel logs.
So change d_obtain_alias() to use "/" rather than "" as the anonymous
name.
David Howells [Fri, 14 Dec 2012 11:02:22 +0000 (11:02 +0000)]
FS-Cache: Clear remaining page count on retrieval cancellation
Provide fscache_cancel_op() with a pointer to a function it should invoke under
lock if it cancels an operation.
Use this to clear the remaining page count upon cancellation of a pending
retrieval operation so that fscache_release_retrieval_op() doesn't get an
assertion failure (see below). This can happen when a signal occurs, say from
CTRL-C being pressed during data retrieval.
David Howells [Thu, 13 Dec 2012 20:03:13 +0000 (20:03 +0000)]
FS-Cache: Mark cancellation of in-progress operation
Mark as cancelled an operation that is in progress rather than pending at the
time it is cancelled, and call fscache_complete_op() to cancel an operation so
that blocked ops can be started.
David Howells [Fri, 7 Dec 2012 10:41:26 +0000 (10:41 +0000)]
FS-Cache: One of the write operation paths doesn't set the object state
In fscache_write_op(), if the object is determined to have become inactive or
to have lost its cookie, we don't move the operation state from in-progress,
and so an assertion in fscache_put_operation() fails with an assertion (see
below).
Instrumenting fscache_op_work_func() indicates that it called
fscache_write_op() before calling fscache_put_operation() - where the assertion
failed. The assertion at line 433 indicates that the operation state is
IN_PROGRESS rather than being COMPLETE or CANCELLED.
Instrumenting fscache_write_op() showed that it was being called on an object
that had had its cookie removed and that this was due to relinquishment of the
cookie by the netfs. At this point fscache no longer has access to the pages
of netfs data that were requested to be written, and so simply cancelling the
operation is the thing to do.
David Howells [Fri, 7 Dec 2012 18:08:02 +0000 (18:08 +0000)]
FS-Cache: Fix signal handling during waits
wait_on_bit() with TASK_INTERRUPTIBLE returns 1 rather than a negative error
code, so change what we check for. This means that the signal handling in
fscache_wait_for_retrieval_activation() should now work properly.
Without this, the following bug can be seen if CTRL-C is pressed during
fscache read operation:
David Howells [Wed, 5 Dec 2012 13:34:49 +0000 (13:34 +0000)]
FS-Cache: Add transition to handle invalidate immediately after lookup
Add a missing transition to the FS-Cache object state machine to handle an
invalidation event occuring between the back end completing the object lookup
by calling fscache_obtained_object() (which moves to state OBJECT_AVAILABLE)
and the backend returning to fscache_lookup_object() and thence to
fscache_object_state_machine() which then does a goto lookup_transit to handle
the transition - but lookup_transit doesn't handle EV_INVALIDATE.
Without this, the following BUG can be logged:
FS-Cache: Unsupported event 2 [5/f7] in state OBJECT_AVAILABLE
------------[ cut here ]------------
kernel BUG at fs/fscache/object.c:357!
Linus Torvalds [Thu, 20 Dec 2012 22:15:53 +0000 (14:15 -0800)]
Merge branch 'kbuild' of git://git.kernel.org/pub/scm/linux/kernel/git/mmarek/kbuild
Pull kbuild changes from Michal Marek:
"The kbuild changes are minimal this time:
- scripts/pnmlogo fix for some newer format
- minor top-level Makefile cleanup
- fix for a v3.5 regression with make clean M=<directory>"
* 'kbuild' of git://git.kernel.org/pub/scm/linux/kernel/git/mmarek/kbuild:
kbuild: Do not remove vmlinux when cleaning external module
scripts/pnmtologo: fix for plain PBM
kbuild: Remove reference to uninitialised variable
Tony Lindgren [Thu, 20 Dec 2012 19:50:34 +0000 (11:50 -0800)]
ARM: OMAP2+: Trivial fix for IOMMU merge issue
Commit 787314c35fbb ("Merge tag 'iommu-updates-v3.8' of
git://git./linux/kernel/git/joro/iommu") did not account for the changed
header location.
The headers were made local to mach-omap2 as they are specific to omap2+
only, and we wanted to get most of the #include <plat/*.h> headers fixed
up anyways for the ARM multiplatform support.
We attempted to avoid this kind of merge conflict early on by setting up
a minimal git branch shared by the arm-soc tree and the iommu tree, but
looks like we still hit a merge issue there as the branches got merged
as various topic branches.
nfs_migrate_page() calls nfs_fscache_release_page() which doesn't actually wait
- even if __GFP_WAIT is set. The reason that doesn't wait is that
fscache_maybe_release_page() might deadlock the allocator as the work threads
writing to the cache may all end up sleeping on memory allocation.
However, I wonder if that is actually a problem. There are a number of things
I can do to deal with this:
(1) Make nfs_migrate_page() wait.
(2) Make fscache_maybe_release_page() honour the __GFP_WAIT flag.
(3) Set a timeout around the wait.
(4) Make nfs_migrate_page() return an error if the page is still busy.
David Howells [Wed, 5 Dec 2012 13:34:48 +0000 (13:34 +0000)]
FS-Cache: Exclusive op submission can BUG if there's been an I/O error
The function to submit an exclusive op (fscache_submit_exclusive_op()) can BUG
if there's been an I/O error because it may see the parent cache object in an
unexpected state. It should only BUG if there hasn't been an I/O error.
In this case the problem was produced by remounting the cache partition to be
R/O. The EROFS state was detected and the cache was aborted, but not
everything handled the aborting correctly.
SysRq : Emergency Remount R/O
EXT4-fs (sda6): re-mounted. Opts: (null)
Emergency Remount complete
CacheFiles: I/O Error: Failed to update xattr with error -30
FS-Cache: Cache cachefiles stopped due to I/O error
------------[ cut here ]------------
kernel BUG at fs/fscache/operation.c:128!
invalid opcode: 0000 [#1] SMP
CPU 0
Modules linked in: cachefiles nfs fscache auth_rpcgss nfs_acl lockd sunrpc
David Howells [Wed, 5 Dec 2012 13:34:48 +0000 (13:34 +0000)]
FS-Cache: Limit the number of I/O error reports for a cache
Limit the number of I/O error reports for a cache to 1 to prevent massive
amounts of noise. After the first I/O error the cache is taken off line
automatically, so must be restarted to resume caching.
David Howells [Wed, 5 Dec 2012 13:34:46 +0000 (13:34 +0000)]
FS-Cache: Convert the object event ID #defines into an enum
Convert the fscache_object event IDs from #defines into an enum. Also add an
extra label to the enum to carry the event count and redefine the event mask
in terms of that.
David Howells [Wed, 5 Dec 2012 13:34:45 +0000 (13:34 +0000)]
CacheFiles: Add missing retrieval completions
CacheFiles is missing some calls to fscache_retrieval_complete() in the error
handling/collision paths of its reader functions.
This can be seen by the following assertion tripping in fscache_put_operation()
whereby the operation being destroyed is still in the in-progress state and has
not been cancelled or completed:
David Howells [Thu, 20 Dec 2012 21:52:38 +0000 (21:52 +0000)]
NFS: Use FS-Cache invalidation
Use the new FS-Cache invalidation facility from NFS to deal with foreign
changes being detected on the server rather than attempting to retire the old
cookie and get a new one.
The problem with the old method was that NFS did not wait for all outstanding
storage and retrieval ops on the cache to complete. There was no automatic
wait between the calls to ->readpages() and calls to invalidate_inode_pages2()
as the latter can only wait on locked pages that have been added to the
pagecache (which they haven't yet on entry to ->readpages()).
This was leading to oopses like the one below when an outstanding read got cut
off from its cookie by a premature release.
David Howells [Thu, 20 Dec 2012 21:52:36 +0000 (21:52 +0000)]
CacheFiles: Implement invalidation
Implement invalidation for CacheFiles. This is in two parts:
(1) Provide an invalidation method (which just truncates the backing file).
(2) Abort attempts to copy anything read from the backing file whilst
invalidation is in progress.
Question: CacheFiles uses truncation in a couple of places. It has been using
notify_change() rather than sys_truncate() or something similar. This means
it bypasses a bunch of checks and suchlike that it possibly should be making
(security, file locking, lease breaking, vfsmount write). Should it be using
vfs_truncate() as added by a preceding patch or should it use notify_write()
and assume that anyone poking around in the cache files on disk gets
everything they deserve?
David Howells [Thu, 20 Dec 2012 21:52:36 +0000 (21:52 +0000)]
VFS: Make more complete truncate operation available to CacheFiles
Make a more complete truncate operation available to CacheFiles (including
security checks and suchlike) so that it can use this to clear invalidated
cache files.
- Fixes for some bugs that could be triggered by unusual compounds.
Our xdr code wasn't designed with v4 compounds in mind, and it
shows. A more thorough rewrite is still a todo.
- If you've ever seen "RPC: multiple fragments per record not
supported" logged while using some sort of odd userland NFS client,
that should now be fixed.
- Further work from Jeff Layton on our mechanism for storing
information about NFSv4 clients across reboots.
- Further work from Bryan Schumaker on his fault-injection mechanism
(which allows us to discard selective NFSv4 state, to excercise
rarely-taken recovery code paths in the client.)
- The usual mix of miscellaneous bugs and cleanup.
Thanks to everyone who tested or contributed this cycle."
* 'for-3.8' of git://linux-nfs.org/~bfields/linux: (111 commits)
nfsd4: don't leave freed stateid hashed
nfsd4: free_stateid can use the current stateid
nfsd4: cleanup: replace rq_resused count by rq_next_page pointer
nfsd: warn on odd reply state in nfsd_vfs_read
nfsd4: fix oops on unusual readlike compound
nfsd4: disable zero-copy on non-final read ops
svcrpc: fix some printks
NFSD: Correct the size calculation in fault_inject_write
NFSD: Pass correct buffer size to rpc_ntop
nfsd: pass proper net to nfsd_destroy() from NFSd kthreads
nfsd: simplify service shutdown
nfsd: replace boolean nfsd_up flag by users counter
nfsd: simplify NFSv4 state init and shutdown
nfsd: introduce helpers for generic resources init and shutdown
nfsd: make NFSd service structure allocated per net
nfsd: make NFSd service boot time per-net
nfsd: per-net NFSd up flag introduced
nfsd: move per-net startup code to separated function
nfsd: pass net to __write_ports() and down
nfsd: pass net to nfsd_set_nrthreads()
...
David Howells [Thu, 20 Dec 2012 21:52:36 +0000 (21:52 +0000)]
FS-Cache: Provide proper invalidation
Provide a proper invalidation method rather than relying on the netfs retiring
the cookie it has and getting a new one. The problem with this is that isn't
easy for the netfs to make sure that it has completed/cancelled all its
outstanding storage and retrieval operations on the cookie it is retiring.
Instead, have the cache provide an invalidation method that will cancel or wait
for all currently outstanding operations before invalidating the cache, and
will cause new operations to queue up behind that. Whilst invalidation is in
progress, some requests will be rejected until the cache can stack a barrier on
the operation queue to cause new operations to be deferred behind it.
Linus Torvalds [Thu, 20 Dec 2012 22:00:13 +0000 (14:00 -0800)]
Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
Pull Ceph update from Sage Weil:
"There are a few different groups of commits here. The largest is
Alex's ongoing work to enable the coming RBD features (cloning,
striping). There is some cleanup in libceph that goes along with it.
Cyril and David have fixed some problems with NFS reexport (leaking
dentries and page locks), and there is a batch of patches from Yan
fixing problems with the fs client when running against a clustered
MDS. There are a few bug fixes mixed in for good measure, many of
which will be going to the stable trees once they're upstream.
My apologies for the late pull. There is still a gremlin in the rbd
map/unmap code and I was hoping to include the fix for that as well,
but we haven't been able to confirm the fix is correct yet; I'll send
that in a separate pull once it's nailed down."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (68 commits)
rbd: get rid of rbd_{get,put}_dev()
libceph: register request before unregister linger
libceph: don't use rb_init_node() in ceph_osdc_alloc_request()
libceph: init event->node in ceph_osdc_create_event()
libceph: init osd->o_node in create_osd()
libceph: report connection fault with warning
libceph: socket can close in any connection state
rbd: don't use ENOTSUPP
rbd: remove linger unconditionally
rbd: get rid of RBD_MAX_SEG_NAME_LEN
libceph: avoid using freed osd in __kick_osd_requests()
ceph: don't reference req after put
rbd: do not allow remove of mounted-on image
libceph: Unlock unprocessed pages in start_read() error path
ceph: call handle_cap_grant() for cap import message
ceph: Fix __ceph_do_pending_vmtruncate
ceph: Don't add dirty inode to dirty list if caps is in migration
ceph: Fix infinite loop in __wake_requests
ceph: Don't update i_max_size when handling non-auth cap
bdi_register: add __printf verification, fix arg mismatch
...
David Howells [Thu, 20 Dec 2012 21:52:35 +0000 (21:52 +0000)]
FS-Cache: Fix operation state management and accounting
Fix the state management of internal fscache operations and the accounting of
what operations are in what states.
This is done by:
(1) Give struct fscache_operation a enum variable that directly represents the
state it's currently in, rather than spreading this knowledge over a bunch
of flags, who's processing the operation at the moment and whether it is
queued or not.
This makes it easier to write assertions to check the state at various
points and to prevent invalid state transitions.
(2) Add an 'operation complete' state and supply a function to indicate the
completion of an operation (fscache_op_complete()) and make things call
it. The final call to fscache_put_operation() can then check that an op
in the appropriate state (complete or cancelled).
(3) Adjust the use of object->n_ops, ->n_in_progress, ->n_exclusive to better
govern the state of an object:
(a) The ->n_ops is now the number of extant operations on the object
and is now decremented by fscache_put_operation() only.
(b) The ->n_in_progress is simply the number of objects that have been
taken off of the object's pending queue for the purposes of being
run. This is decremented by fscache_op_complete() only.
(c) The ->n_exclusive is the number of exclusive ops that have been
submitted and queued or are in progress. It is decremented by
fscache_op_complete() and by fscache_cancel_op().
fscache_put_operation() and fscache_operation_gc() now no longer try to
clean up ->n_exclusive and ->n_in_progress. That was leading to double
decrements against fscache_cancel_op().
fscache_cancel_op() now no longer decrements ->n_ops. That was leading to
double decrements against fscache_put_operation().
fscache_submit_exclusive_op() now decides whether it has to queue an op
based on ->n_in_progress being > 0 rather than ->n_ops > 0 as the latter
will persist in being true even after all preceding operations have been
cancelled or completed. Furthermore, if an object is active and there are
runnable ops against it, there must be at least one op running.
(4) Add a remaining-pages counter (n_pages) to struct fscache_retrieval and
provide a function to record completion of the pages as they complete.
When n_pages reaches 0, the operation is deemed to be complete and
fscache_op_complete() is called.
Add calls to fscache_retrieval_complete() anywhere we've finished with a
page we've been given to read or allocate for. This includes places where
we just return pages to the netfs for reading from the server and where
accessing the cache fails and we discard the proposed netfs page.
The bugs in the unfixed state management manifest themselves as oopses like the
following where the operation completion gets out of sync with return of the
cookie by the netfs. This is possible because the cache unlocks and returns
all the netfs pages before recording its completion - which means that there's
nothing to stop the netfs discarding them and returning the cookie.
FS-Cache: Cookie 'NFS.fh' still has outstanding reads
------------[ cut here ]------------
kernel BUG at fs/fscache/cookie.c:519!
invalid opcode: 0000 [#1] SMP
CPU 1
Modules linked in: cachefiles nfs fscache auth_rpcgss nfs_acl lockd sunrpc
David Howells [Thu, 20 Dec 2012 21:52:33 +0000 (21:52 +0000)]
FS-Cache: Check that there are no read ops when cookie relinquished
Check that the netfs isn't trying to relinquish a cookie that still has read
operations in progress upon it. If there are, then give log a warning and BUG.
David Howells [Thu, 20 Dec 2012 21:52:33 +0000 (21:52 +0000)]
CacheFiles: Downgrade the requirements passed to the allocator
Downgrade the requirements passed to the allocator in the gfp flags parameter.
FS-Cache/CacheFiles can handle OOM conditions simply by aborting the attempt to
store an object or a page in the cache.
Linus Torvalds [Thu, 20 Dec 2012 21:57:09 +0000 (13:57 -0800)]
Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull two btrfs reverts from Chris Mason:
"I had missed that for two of the patches in my last pull, we had
included different fixes during 3.7."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
Revert "Btrfs: reorder tree mod log operations in deleting a pointer"
Revert "Btrfs: MOD_LOG_KEY_REMOVE_WHILE_MOVING never change node's nritems"
Linus Torvalds [Thu, 20 Dec 2012 21:54:51 +0000 (13:54 -0800)]
Merge tag 'for-3.8-merge' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs
Pull new F2FS filesystem from Jaegeuk Kim:
"Introduce a new file system, Flash-Friendly File System (F2FS), to
Linux 3.8.
Highlights:
- Add initial f2fs source codes
- Fix an endian conversion bug
- Fix build failures on random configs
- Fix the power-off-recovery routine
- Minor cleanup, coding style, and typos patches"
From the Kconfig help text:
F2FS is based on Log-structured File System (LFS), which supports
versatile "flash-friendly" features. The design has been focused on
addressing the fundamental issues in LFS, which are snowball effect
of wandering tree and high cleaning overhead.
Since flash-based storages show different characteristics according to
the internal geometry or flash memory management schemes aka FTL, F2FS
and tools support various parameters not only for configuring on-disk
layout, but also for selecting allocation and cleaning algorithms.
and there's an article by Neil Brown about it on lwn.net:
http://lwn.net/Articles/518988/
* tag 'for-3.8-merge' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (36 commits)
f2fs: fix tracking parent inode number
f2fs: cleanup the f2fs_bio_alloc routine
f2fs: introduce accessor to retrieve number of dentry slots
f2fs: remove redundant call to f2fs_put_page in delete entry
f2fs: make use of GFP_F2FS_ZERO for setting gfp_mask
f2fs: rewrite f2fs_bio_alloc to make it simpler
f2fs: fix a typo in f2fs documentation
f2fs: remove unused variable
f2fs: move error condition for mkdir at proper place
f2fs: remove unneeded initialization
f2fs: check read only condition before beginning write out
f2fs: remove unneeded memset from init_once
f2fs: show error in case of invalid mount arguments
f2fs: fix the compiler warning for uninitialized use of variable
f2fs: resolve build failures
f2fs: adjust kernel coding style
f2fs: fix endian conversion bugs reported by sparse
f2fs: remove unneeded version.h header file from f2fs.h
f2fs: update the f2fs document
f2fs: update Kconfig and Makefile
...
David Howells [Thu, 20 Dec 2012 21:52:32 +0000 (21:52 +0000)]
CacheFiles: Fix the marking of cached pages
Under some circumstances CacheFiles defers the marking of pages with PG_fscache
so that it can take advantage of pagevecs to reduce the number of calls to
fscache_mark_pages_cached() and the netfs's hook to keep track of this.
There are, however, two problems with this:
(1) It can lead to the PG_fscache mark being applied _after_ the page is set
PG_uptodate and unlocked (by the call to fscache_end_io()).
(2) CacheFiles's ref on the page is dropped immediately following
fscache_end_io() - and so may not still be held when the mark is applied.
This can lead to the page being passed back to the allocator before the
mark is applied.
Fix this by, where appropriate, marking the page before calling
fscache_end_io() and releasing the page. This means that we can't take
advantage of pagevecs and have to make a separate call for each page to the
marking routines.
The symptoms of this are Bad Page state errors cropping up under memory
pressure, for example:
As can be seen, PG_private_2 (== PG_fscache) is set in the page flags.
Instrumenting fscache_mark_pages_cached() to verify whether page->mapping was
set appropriately showed that sometimes it wasn't. This led to the discovery
that sometimes the page has apparently been reclaimed by the time the marker
got to see it.
Stephen Boyd [Thu, 20 Dec 2012 07:39:48 +0000 (23:39 -0800)]
lib: atomic64: Initialize locks statically to fix early users
The atomic64 library uses a handful of static spin locks to implement
atomic 64-bit operations on architectures without support for atomic
64-bit instructions.
Unfortunately, the spinlocks are initialized in a pure initcall and that
is too late for the vfs namespace code which wants to use atomic64
operations before the initcall is run.
This became a problem as of commit 8823c079ba71: "vfs: Add setns support
for the mount namespace".