Git Repo - linux.git/log

[S390] boot from NSS support

Add support to boot from a named saved segment (NSS).

Signed-off-by: Hongjie Yang <[email protected]>
Signed-off-by: Martin Schwidefsky <[email protected]>

[S390] Support for s390 Pseudo Random Number Generator

Starting with the z9 the CPU Cryptographic Assist Facility comes with
an integrated Pseudo Random Number Generator. The generator creates
random numbers by an algorithm similar to the ANSI X9.17 standard.
The pseudo-random numbers can be accessed via a character device driver
node called /dev/prandom. Similar to /dev/urandom any amount of bytes
can be read from the device without blocking.

Signed-off-by: Jan Glauber <[email protected]>
Signed-off-by: Martin Schwidefsky <[email protected]>

[S390] ETR support.

This patch adds support for clock synchronization to an external time
reference (ETR). The external time reference sends an oscillator
signal and a synchronization signal every 2^20 microseconds to keep
the TOD clocks of all connected servers in sync. For availability
two ETR units can be connected to a machine. If the clock deviates
for more than the sync-check tolerance all cpus get a machine check
that indicates that the clock is out of sync. For the lovely details
how to get the clock back in sync see the code below.

Signed-off-by: Martin Schwidefsky <[email protected]>

[S390] noexec protection

This provides a noexec protection on s390 hardware. Our hardware does
not have any bits left in the pte for a hw noexec bit, so this is a
different approach using shadow page tables and a special addressing
mode that allows separate address spaces for code and data.

As a special feature of our "secondary-space" addressing mode, separate
page tables can be specified for the translation of data addresses
(storage operands) and instruction addresses. The shadow page table is
used for the instruction addresses and the standard page table for the
data addresses.
The shadow page table is linked to the standard page table by a pointer
in page->lru.next of the struct page corresponding to the page that
contains the standard page table (since page->private is not really
private with the pte_lock and the page table pages are not in the LRU
list).
Depending on the software bits of a pte, it is either inserted into
both page tables or just into the standard (data) page table. Pages of
a vma that does not have the VM_EXEC bit set get mapped only in the
data address space. Any try to execute code on such a page will cause a
page translation exception. The standard reaction to this is a SIGSEGV
with two exceptions: the two system call opcodes 0x0a77 (sys_sigreturn)
and 0x0aad (sys_rt_sigreturn) are allowed. They are stored by the
kernel to the signal stack frame. Unfortunately, the signal return
mechanism cannot be modified to use an SA_RESTORER because the
exception unwinding code depends on the system call opcode stored
behind the signal stack frame.

This feature requires that user space is executed in secondary-space
mode and the kernel in home-space mode, which means that the addressing
modes need to be switched and that the noexec protection only works
for user space.
After switching the addressing modes, we cannot use the mvcp/mvcs
instructions anymore to copy between kernel and user space. A new
mvcos instruction has been added to the z9 EC/BC hardware which allows
to copy between arbitrary address spaces, but on older hardware the
page tables need to be walked manually.

Signed-off-by: Gerald Schaefer <[email protected]>
Signed-off-by: Martin Schwidefsky <[email protected]>

[S390] move crypto options and some cleanup.

This patch moves the config options for the s390 crypto instructions
to the standard "Hardware crypto devices" menu. In addition some
cleanup has been done: use a flag for supported keylengths, add a
warning about machien limitation, return ENOTSUPP in case the
hardware has no support, remove superfluous printks and update
email addresses.

Signed-off-by: Jan Glauber <[email protected]>
Signed-off-by: Martin Schwidefsky <[email protected]>

[S390] cio: Don't spam debug feature.

Lower priority of "Blacklisted device detected" messages so we don't
overwrite more useful messages.

Signed-off-by: Cornelia Huck <[email protected]>
Signed-off-by: Martin Schwidefsky <[email protected]>

[S390] Cleanup of CHSC event handling.

Change CHSC event handling to be more easily extensible.

Signed-off-by: Peter Oberparleiter <[email protected]>
Signed-off-by: Martin Schwidefsky <[email protected]>

[S390] cio: declare hardware structures packed.

Signed-off-by: Peter Oberparleiter <[email protected]>
Signed-off-by: Martin Schwidefsky <[email protected]>

[S390] Add set_fs(USER_DS) to start_thread().

Currently works anyway since search_binary_handler has a
set_fs(USER_DS). But start_thread() is the place where this should be
done. Following all other architectures...

Signed-off-by: Heiko Carstens <[email protected]>
Signed-off-by: Martin Schwidefsky <[email protected]>

[S390] cio: Catch operand exceptions on stsch.

If we have a subchannel id which has been generated via
for_each_subchannel(), it might contain an invalid subchannel set id.
We need to catch the ensuing operand exception by using stsch_err()
instead of stsch() in all possible cases.

Signed-off-by: Cornelia Huck <[email protected]>
Signed-off-by: Martin Schwidefsky <[email protected]>

[S390] Fix register usage description.

Fix description of register usage as pointed out by Andreas Krebbel.
Since this document is completely outdated and would need a lot of
fixing, it might be worth considering to get rid of it...

Signed-off-by: Heiko Carstens <[email protected]>
Signed-off-by: Martin Schwidefsky <[email protected]>

[S390] kretprobe_trampoline_holder() in wrong section.

kretprobe_trampoline_holder() is in kprobes section but used to
register a kprobe in arch_init_kprobes(). Hence register_kprobe()
and therefore arch_init_kprobes() will fail.

Signed-off-by: Heiko Carstens <[email protected]>
Signed-off-by: Martin Schwidefsky <[email protected]>

[S390] Fix kprobes breakpoint handling.

In case of an illegal op the die notifier gets called with DIE_TRAP
instead of DIE_BPT first.

Signed-off-by: Heiko Carstens <[email protected]>
Signed-off-by: Martin Schwidefsky <[email protected]>

[S390] Update maintainers file.

Use the new [email protected] mailing list instead of
[email protected].

Signed-off-by: Martin Schwidefsky <[email protected]>

[S390] dasd: fix unconditional reserve handling.

The reserve/release IOCTLs sometimes do not work. If second system
does a 'steal lock' the pending unit check (Format 3 Msg F) is
delivered. Since ERP is disabled for reserve/release, the IOCTL call
fails. We have to allow basic ERP (retries) for reserve/release IOCTLs.

Signed-off-by: Horst Hummel <[email protected]>
Signed-off-by: Martin Schwidefsky <[email protected]>

[S390] Remove dasd_ccw_log function.

Logging of relevant information is already done by disciplines
dump_sense function.

Signed-off-by: Horst Hummel <[email protected]>
Signed-off-by: Martin Schwidefsky <[email protected]>

[S390] Small barrier() and cpu_relax() cleanup.

cpu_relax() has barrier() semantics hence there is no need to use both
of them in conjunction in sclp_sync_wait(). Also change cpu_relax()
so it's more obvious that it has barrier semantics.

Signed-off-by: Heiko Carstens <[email protected]>
Signed-off-by: Martin Schwidefsky <[email protected]>

[S390] cio: Use device_{create,remove}_bin_file.

Create/remove the channel measurement binary files with
device_{create,remove}_bin_file instead of sysfs_{create,remove}_bin_file.

Signed-off-by: Cornelia Huck <[email protected]>
Signed-off-by: Martin Schwidefsky <[email protected]>

[S390] sclp: don't call local_bh_disable/_local_bh_enable if in_interrupt()

local_bh_disable/_local_bh_enable must not be called if in_irq() is
true. Besides that if in_interrupt() is true bottom halves are
disabled anyway.

Signed-off-by: Heiko Carstens <[email protected]>
Signed-off-by: Martin Schwidefsky <[email protected]>

[S390] Show loaded DCSS segments under /proc/iomem.

Currently loaded DCSS segments are now listed in /proc/iomem with
their name followed by a trailing "(DCSS)".

Signed-off-by: Gerald Schaefer <[email protected]>
Signed-off-by: Martin Schwidefsky <[email protected]>

[S390] cio: Restart path verification after unsolicited interrupt.

If we try to start path verification when an unsolicited interrupt
is already pending, stctl shows status pending and we delay path
verification again. We need to check for the doverify bit when the
unsolicited interrupt comes in and then do path verification.

Signed-off-by: Cornelia Huck <[email protected]>
Signed-off-by: Martin Schwidefsky <[email protected]>

[S390] Fix FCP dump feature detection.

FCP dump feature detection works only if the sclp command in head.S
was succesful. Since the sclp command is skipped if diag260 works,
we don't have any dump feature detection anymore.
Bug was introduced with d57de5a36791cb1b7285649c62f183b0d3505f7d.

Signed-off-by: Heiko Carstens <[email protected]>
Signed-off-by: Martin Schwidefsky <[email protected]>

[S390] dasd: fix bug in dasd initialization cleanup

The initialization of the dasd_eer code is one of the last steps of the
dasd driver initialization. When initialization fails in one of the
earlier steps, the dasd_exit function is called to clean up what has been
done so far. So the dasd_eer_exit function may be called, although the
dasd_eer_init function wasn't called before and dasd_eer_exit tries to
unregister a misc device that wasn't registered, which results in a BUG.

Make sure that dasd_eer_exit can be called without initialization. Use a
dynamically allocated struct miscdevice instead of a static one, so we
only try to unregister the device if it exists and was actually registered.

Signed-off-by: Stefan Weinhuber <[email protected]>
Signed-off-by: Martin Schwidefsky <[email protected]>

[S390] sclp: invalid handling of temporary 'not operational' status

Requests are aborted when the sclp interface reports 'not operational'
even though they may still be active at the sclp, leading to concurrent
writes to request memory by both the kernel and the sclp interface.
Do not abort requests for which the sclp interface reports not
operational status during request retry.

Signed-off-by: Peter Oberparleiter <[email protected]>5A
Signed-off-by: Martin Schwidefsky <[email protected]>

[S390] Simplify virt_to_phys.

No need to use lrag in 64 bit addressing mode since lra will do the
same.

Signed-off-by: Heiko Carstens <[email protected]>
Signed-off-by: Martin Schwidefsky <[email protected]>

[S390] cio: Remove check for ssd in chpids_show().

Since ssd_info is now available before the subchannel is registered,
we don't need to check whether it is available.

Signed-off-by: Cornelia Huck <[email protected]>
Signed-off-by: Martin Schwidefsky <[email protected]>

[S390] cpcmd with vmalloc addresses.

Change the bounce buffer logic of cpcmd. diag8 needs _real_ memory below
2GB. Therefore vmalloced data does not work. As the data might cross a
page boundary, we cannot use virt_to_page either. The solution is to use
virt_to_page only in the check for a bounce buffer.

There was a redundant check for response==NULL. response < 2GB contains
this check as well.

I also removed the rlen==0 check, since rlen=0 and response!=NULL would
be a caller bug and response==NULL is already checked.

Signed-off-by: Christian Borntraeger <[email protected]>
Signed-off-by: Martin Schwidefsky <[email protected]>

[S390] Remove pointless/unreliable kernel messages.

Signed-off-by: Heiko Carstens <[email protected]>
Signed-off-by: Martin Schwidefsky <[email protected]>

[S390] Check the return value of kthread_run().

Signed-off-by: Akinobu Mita <[email protected]>
Signed-off-by: Martin Schwidefsky <[email protected]>

[S390] Get rid of a lot of sparse warnings.

Signed-off-by: Heiko Carstens <[email protected]>
Signed-off-by: Martin Schwidefsky <[email protected]>

[S390] Move init_irq_proc to the other irq related functions.

Signed-off-by: Heiko Carstens <[email protected]>
Signed-off-by: Martin Schwidefsky <[email protected]>

[IA64] add newline to PAL-code warning message

Signed-off-by: Simon Horman <[email protected]>
Signed-off-by: Tony Luck <[email protected]>

[IA64] kexec: Remove inline declaration of efi_get_pal_addr()

Remove the Remove inline declaration of efi_get_pal_addr() as it is
declared in linux/efi.h.

Signed-Off-By: Simon Horman <[email protected]>
Signed-off-by: Tony Luck <[email protected]>

[IA64] kexec: Minor enhancement to includes in crash.c

linux/uaccess.h was being included, but it seems that
really the following includes are needed.

asm/page.h: for __va() and PAGE_SHIFT
asm/uaccess.h: for copy_to_user()

I guess that linux/uaccess.h pulls in both asm/page.h and asm/uaccess.h.
I notices this while backporting the code to xen's linux-2.6.16.33,
which does not have linux/uaccess.h. I'm posting it as I think it is a
correct, though somewhat cosmetic fix.

Signed-off-by: Simon Horman <[email protected]>
Signed-off-by: Tony Luck <[email protected]>

[IA64] kexec: typo in the saved_max_pfn description in contig.c

Fix a typo in the saved_max_pfn description in contig.c

Signed-off-by: Simon Horman <[email protected]>
Signed-off-by: Tony Luck <[email protected]>

[IA64] Zero size /proc/vmcore on ia64

Set saved_max_pfn when discontig memory is in use.

This sets up saved_max_pfn when disctontig memory is in use.
This mirrors the code for contig memory.

This patch does not entirely solve the problem of making vmcore work,
however it does appear to be neccessary. Please consider applying.

Signed-off-by: Simon Horman <[email protected]>
Signed-off-by: Tony Luck <[email protected]>

[IA64] kexec: Fix CONFIG_SMP=n compilation

Kexec support for 2.6.20 on ia64 does not build properly using a config
made up by CONFIG_SMP=n and CONFIG_HOTPLUG_CPU=n:

Signed-off-by: Magnus Damm <[email protected]>
Acked-by: Simon Horman <[email protected]>
Acked-by: Jay Lan <[email protected]>
Signed-off-by: Tony Luck <[email protected]>

[DLM] fix softlockup in dlm_recv

This patch stops the dlm_recv workqueue from busy-waiting when a node
disconnects. This can cause soft lockup errors on debug systems and bad
performance generally.

Signed-Off-By: Patrick Caulfield <[email protected]>
Signed-off-by: Steven Whitehouse <[email protected]>

[DLM] zero new user lvbs

A new lvb for a userland lock wasn't being initialized to zero.

Signed-off-by: David Teigland <[email protected]>
Signed-off-by: Steven Whitehouse <[email protected]>

[DLM/GFS2] indent help text

Indent help text as expected.

Signed-off-by: Randy Dunlap <[email protected]>
Signed-off-by: Steven Whitehouse <[email protected]>

[GFS2] Fix unlink deadlocks

Move the glock acquisition to outside of the transactions.

Lock odering must be preserved in order to prevent ABBA
deadlocks. The current gfs2_change_nlink code would tries
to grab the glock after having started a transaction and thus is holding
the log lock. This is inconsistent with other code paths in
gfs that grab the resource group glock prior to staring
a tranactions.

One problem with this fix is that the resource group
lock is always grabbed now even if the inode still has
ref count and can not be marked for unlink.

Signed-off-by: Russell Cattelan <[email protected]>
Signed-off-by: Steven Whitehouse <[email protected]>

[GFS2] Put back semaphore to avoid umount problem

Dave Teigland fixed this bug a while back, but I managed to mistakenly
remove the semaphore during later development. It is required to avoid
the list of inodes changing during an invalidate_inodes call. I have
made it an rwsem since the read side will be taken frequently during
normal filesystem operation. The write site will only happen during
umount of the file system.

Also the bug only triggers when using the DLM lock manager and only then
under certain conditions as its timing related.

Signed-off-by: Steven Whitehouse <[email protected]>
Cc: David Teigland <[email protected]>

[GFS2] more CURRENT_TIME_SEC

Whoops, quilt user error, missed this one in the previous patch.

Signed-off-by: Eric Sandeen <[email protected]>
Signed-off-by: Steven Whitehouse <[email protected]>

[GFS2/DLM] fix GFS2 circular dependency

On Sun, Jan 28, 2007 at 11:08:18AM +0100, Jiri Slaby wrote:
> Andrew Morton napsal(a):
> >Temporarily at
> >
> > http://userweb.kernel.org/~akpm/2.6.20-rc6-mm1/
>
> Unable to select IPV6. Menuconfig doesn't offer it when INET is selected.
> When it's not it appears in the menu, but after state change it gets away.
> The same behaviour in xconfig, gconfig.
>
> $ mkdir ../a/tst
> $ make O=../a/tst menuconfig
>   HOSTCC  scripts/basic/fixdep
> [...]
>   HOSTLD  scripts/kconfig/mconf
> scripts/kconfig/mconf arch/i386/Kconfig
> Warning! Found recursive dependency: INET GFS2_FS_LOCKING_DLM SYSFS
> OCFS2_FS INET
>
> Maybe this is the problem?

Yes, patch below.

> regards,

cu
Adrian

<--  snip  -->

This patch fixes a circular dependency by letting GFS2_FS_LOCKING_DLM
and DLM depend on instead of select SYSFS.

Since SYSFS depends on EMBEDDED this change shouldn't cause any problems
for users.

Signed-off-by: Adrian Bunk <[email protected]>
Acked-by: Randy Dunlap <[email protected]>
Signed-off-by: Steven Whitehouse <[email protected]>

[GFS2/DLM] use sysfs

With CONFIG_DLM=m, CONFIG_PROC_FS=n, and CONFIG_SYSFS=n, kernel build
fails with:

WARNING: "kernel_subsys" [fs/gfs2/locking/dlm/lock_dlm.ko] undefined!
WARNING: "kernel_subsys" [fs/dlm/dlm.ko] undefined!
WARNING: "kernel_subsys" [fs/configfs/configfs.ko] undefined!
make[1]: *** [__modpost] Error 1
make: *** [modules] Error 2

Since fs/dlm/lockspace.c and fs/gfs2/locking/dlm/sysfs.c use
kernel_subsys, they should either DEPEND on it or SELECT it.

Signed-off-by: Randy Dunlap <[email protected]>
Signed-off-by: Steven Whitehouse <[email protected]>

[GFS2] make lock_dlm drop_count tunable in sysfs

We want to be able to change or disable the default drop_count (number at
which the dlm asks gfs to limit the the number of locks it's holding).
Add it to the collection of sysfs tunables for an fs.

Signed-off-by: David Teigland <[email protected]>
Signed-off-by: Steven Whitehouse <[email protected]>

[GFS2] increase default lock limit

Increase the number of locks at which point the dlm begins asking gfs to
reduce its lock usage. The default value is largely arbitrary, but the
current value of 50,000 ends up limiting performance unnecessarily for too
many users.

Signed-off-by: David Teigland <[email protected]>
Signed-off-by: Steven Whitehouse <[email protected]>

[GFS2] Fix list corruption in lops.c

The patch below appears to fix the list corruption that we are seeing on
occasion. Although the transaction structure is private to a single
thread, when the queued structures are dismantled during an in-core
commit, its possible for a different thread to be trying to add the same
structure to another, new, transaction at the same time.

To avoid this, this patch takes the log spinlock during this operation.

Signed-off-by: Steven Whitehouse <[email protected]>

[GFS2] Fix recursive locking attempt with NFS

In certain cases, its possible for NFS to call the lookup code while
holding the glock (when doing a readdirplus operation) so we need to
check for that and not try and lock the glock twice. This also fixes a
typo in a previous NFS related GFS2 patch.

Signed-off-by: Steven Whitehouse <[email protected]>

[DLM] can miss clearing resend flag

A long, complicated sequence of events, beginning with the RESEND flag not
being cleared on an lkb, can result in an unlock never completing.

- lkb on waiters list for remote lookup
- the remote node is both the dir node and the master node, so
  it optimizes the lookup into a request and sends a request
  reply back
- the request reply is saved on the requestqueue to be processed
  after recovery
- recovery runs dlm_recover_waiters_pre() which sets RESEND flag
  so the lookup will be resent after recovery
- end of recovery: process_requestqueue takes saved request reply
  which removes the lkb off the waitesr list, _without_ clearing
  the RESEND flag
- end of recovery: dlm_recover_waiters_post() doesn't do anything
  with the now completed lookup lkb (would usually clear RESEND)
- later, the node unmounts, unlocks this lkb that still has RESEND
  flag set
- the lkb is on the waiters list again, now for unlock, when recovery
  occurs, dlm_recover_waiters_pre() shows the lkb for unlock with RESEND
  set, doesn't do anything since the master still exists
- end of recovery: dlm_recover_waiters_post() takes this lkb off
  the waiters list because it has the RESEND flag set, then reports
  an error because unlocks are never supposed to be handled in
  recover_waiters_post().
- later, the unlock reply is received, doesn't find the lkb on
  the waiters list because recover_waiters_post() has wrongly
  removed it.
- the unlock operation has been lost, and we're left with a
  stray granted lock
- unmount spins waiting for the unlock to complete

The visible evidence of this problem will be a node where gfs umount is
spinning, the dlm waiters list will be empty, and the dlm locks list will
show a granted lock.

The fix is simply to clear the RESEND flag when taking an lkb off the
waiters list.

Signed-off-by: David Teigland <[email protected]>
Signed-off-by: Steven Whitehouse <[email protected]>

[DLM] saved dlm message can be dropped

dlm_receive_message() returns 0 instead of returning 'error'.  What would
happen is that process_requestqueue would take a saved message off the
requestqueue and call receive_message on it.  receive_message would then
see that recovery had been aborted, set error to EINTR, and 'goto out',
expecting that the error would be returned.  Instead, 0 was always
returned, so process_requestqueue would think that the message had been
processed and delete it instead of saving it to process next time.  This
means the message (usually an unlock in my tests) would be lost.

Signed-off-by: David Teigland <[email protected]>
Signed-off-by: Steven Whitehouse <[email protected]>

[DLM] Make sock_sem into a mutex

Now that there can be multiple dlm_recv threads running we need to prevent two
recvs running for the same connection - it's unlikely but it can happen and it
causes message corruption.

Signed-Off-By: Patrick Caulfield <[email protected]>
Signed-off-by: Steven Whitehouse <[email protected]>

[GFS2] Fix typo in glock.c

This is a one letter typo fix in glock.c, spotted by Rob Kenna.

Signed-off-by: Steven Whitehouse <[email protected]>

[GFS2] use CURRENT_TIME_SEC instead of get_seconds in gfs2

I was looking something else up and came across this...

I don't honestly have a good reason to change it other than to make it
like every other Linux filesystem in this regard. ;-) It doesn't
functionally change anything, but makes some lines shorter. :)

I'm also curious; why does gfs2 have 64-bits of on-disk timestamps, but
not in timespec_t format, and only stores second resolutions? Seems like
you're halfway to sub-second resolutions already.

I suppose if that gets implemented then all of the below should
instead be CURRENT_TIME not CURRENT_TIME_SEC.

Signed-off-by: Eric Sandeen <[email protected]>
Signed-off-by: Steven Whitehouse <[email protected]>

[GFS2] Compile fix for glock.c

This one liner got missed from the previous patch.

Signed-off-by: Steven Whitehouse <[email protected]>

[GFS2] Remove queue_empty() function

This function is not longer required since we do not do recursive
locking in the glock layer. As a result all its callers can be
replaceed with list_empty() calls.

Signed-off-by: Steven Whitehouse <[email protected]>

[DLM] fix lowcomms receiving

This patch fixes a bug whereby data on a newly accepted connection would be
ignored if it arrived soon after the accept.

Signed-Off-By: Patrick Caulfield <[email protected]>
Signed-off-by: Steven Whitehouse <[email protected]>

[GFS2] Tidy up glops calls

This patch doesn't make any changes to the ordering of the various
operations related to glocking, but it does tidy up the calls to the
glops.c functions to make the structure more obvious.

The two functions: gfs2_glock_xmote_th() and gfs2_glock_drop_th() can be
made static within glock.c since they are called by every set of glock
operations. The xmote_th and drop_th glock operations are then made
conditional upon those two routines existing and called from the
previously mentioned functions in glock.c respectively.

Also it can be seen that the go_sync operation isn't needed since it can
easily be replaced by calls to xmote_bh and drop_bh respectively. This
results in no longer (confusingly) calling back into routines in glock.c
from glops.c and also reducing the glock operations by one member.

Signed-off-by: Steven Whitehouse <[email protected]>

[DLM] lowcomms tidy

This patch removes some redundant fields from the connection structure and adds
some lockdep annotation to remove spurious warnings.

Signed-Off-By: Patrick Caulfield <[email protected]>
Signed-off-by: Steven Whitehouse <[email protected]>

[GFS2] Remove local exclusive glock mode

Here is a patch for GFS2 to remove the local exclusive flag. In
the places it was used, mutex's are always held earlier in the
call path, so it appears redundant in the LM_ST_SHARED case.

Also, the GFS2 holders were setting local exclusive in any case where
the requested lock was LM_ST_EXCLUSIVE. So the other places in the glock
code where the flag was tested have been replaced with tests for the
lock state being LM_ST_EXCLUSIVE in order to ensure the logic is the
same as before (i.e. LM_ST_EXCLUSIVE is always locally exclusive as well
as globally exclusive).

Signed-off-by: Steven Whitehouse <[email protected]>

[GFS2] Remove unused go_callback operation

This is never used, so we might as well remove it.

Signed-off-by: Steven Whitehouse <[email protected]>

[GFS2] Remove the "greedy" function from glock.[ch]

The "greedy" code was an attempt to retain glocks for a minimum length
of time when they relate to mmap()ed files. The current implementation
of this feature is not, however, ideal in that it required allocating
memory in order to do this and its overly complicated.

It also misses the mark by ignoring the other I/O operations which are
just as likely to suffer from the same problem. So the plan is to remove
this now and then add the functionality back as part of the glock state
machine at a later date (and thus take into account all the possible
users of this feature)

Signed-off-by: Steven Whitehouse <[email protected]>

[GFS2] Shrink gfs2_inode memory by half

Here is something I spotted (while looking for something entirely
different) the other day.

Rather than using a completion in each and every struct gfs2_holder,
this removes it in favour of hashed wait queues, thus saving a
considerable amount of memory both on the stack (where a number of
gfs2_holder structures are allocated) and in particular in the
gfs2_inode which has 8 gfs2_holder structures embedded within it.

As a result on x86_64 the gfs2_inode shrinks from 2488 bytes to
1912 bytes, a saving of 576 bytes per inode (no thats not a typo!).
In actual practice we get a much better result than that since
now that a gfs2_inode is under the 2048 byte barrier, we get two
per 4k slab page effectively halving the amount of memory required
to store gfs2_inodes.

Signed-off-by: Steven Whitehouse <[email protected]>

[GFS2] Remove max_atomic_write tunable

This removes an unused sysfs tunable parameter.

Signed-off-by: Steven Whitehouse <[email protected]>

[GFS2] Clean up/speed up readdir

This removes the extra filldir callback which gfs2 was using to
enclose an attempt at readahead for inodes during readdir. The
code was too complicated and also hurts performance badly in the
case that the getdents64/readdir call isn't being followed by
stat() and it wasn't even getting it right all the time when it
was.

As a result, on my test box an "ls" of a directory containing 250000
files fell from about 7mins (freshly mounted, so nothing cached) to
between about 15 to 25 seconds. When the directory content was cached,
the time taken fell from about 3mins to about 4 or 5 seconds.

Interestingly in the cached case, running "ls -l" once reduced the time
taken for subsequent runs of "ls" to about 6 secs even without this
patch. Now it turns out that there was a special case of glocks being
used for prefetching the metadata, but because of the timeouts for these
locks (set to 10 secs) the metadata was being timed out before it was
being used and this the prefetch code was constantly trying to prefetch
the same data over and over.

Calling "ls -l" meant that the inodes were brought into memory and once
the inodes are cached, the glocks are not disposed of until the inodes
are pushed out of the cache, thus extending the lifetime of the glocks,
and thus bringing down the time for subsequent runs of "ls"
considerably.

Signed-off-by: Steven Whitehouse <[email protected]>

[GFS2] Add writepages for "data=writeback" mounts

It occurred to me that although a gfs2 specific writepages for ordered
writes and journaled data would be tricky, by hooking writepages only
for "data=writeback" mounts we could take advantage of not needing
buffer heads (we don't use them on the read side, nor have we for some
time) and create much larger I/Os for the block layer.

Using blktrace both before and after, its possible to see that for large
I/Os, most of the requests generated through writepages are now 1024
sectors after this patch is applied as opposed to 8 sectors before.

Signed-off-by: Steven Whitehouse <[email protected]>

[DLM] fix master recovery

If master recovery happens on an rsb in one recovery sequence, then that
sequence is aborted before lock recovery happens, then in the next
sequence, we rely on the previous master recovery (which may now be
invalid due to another node ignoring a lookup result) and go on do to the
lock recovery where we get stuck due to an invalid master value.

recovery cycle begins: master of rsb X has left
nodes A and B send node C an rcom lookup for X to find the new master
C gets lookup from B first, sets B as new master, and sends reply back to B
C gets lookup from A next, and sends reply back to A saying B is master
A gets lookup reply from C and sets B as the new master in the rsb
recovery cycle on A, B and C is aborted to start a new recovery
B gets lookup reply from C and ignores it since there's a new recovery
recovery cycle begins: some other node has joined
B doesn't think it's the master of X so it doesn't rebuild it in the directory
C looks up the master of X, no one is master, so it becomes new master
B looks up the master of X, finds it's C
A believes that B is the master of X, so it sends its lock to B
B sends an error back to A
A resends
this repeats forever, the incorrect master value on A is never corrected

The fix is to do master recovery on an rsb that still has the NEW_MASTER
flag set from an earlier recovery sequence, and therefore didn't complete
lock recovery.

Signed-off-by: David Teigland <[email protected]>
Signed-off-by: Steven Whitehouse <[email protected]>

[DLM] fix user unlocking

When a user process exits, we clear all the locks it holds.  There is a
problem, though, with locks that the process had begun unlocking before it
exited.  We couldn't find the lkb's that were in the process of being
unlocked remotely, to flag that they are DEAD.  To solve this, we move
lkb's being unlocked onto a new list in the per-process structure that
tracks what locks the process is holding.  We can then go through this
list to flag the necessary lkb's when clearing locks for a process when it
exits.

Signed-off-by: David Teigland <[email protected]>
Signed-off-by: Steven Whitehouse <[email protected]>

[DLM] Use workqueues for dlm lowcomms

This patch converts the DLM TCP lowcomms to use workqueues rather than using its
own daemon functions. Simultaneously removing a lot of code and making it more
scalable on multi-processor machines.

Signed-Off-By: Patrick Caulfield <[email protected]>
Signed-off-by: Steven Whitehouse <[email protected]>

[GFS2] make gfs2_change_nlink_i() static

On Thu, Jan 11, 2007 at 10:26:27PM -0800, Andrew Morton wrote:
>...
> Changes since 2.6.20-rc3-mm1:
>...
> git-gfs2-nmw.patch
>...
> git trees
>...

This patch makes the needlessly globlal gfs2_change_nlink_i() static.

Signed-off-by: Adrian Bunk <[email protected]>
Signed-off-by: Steven Whitehouse <[email protected]>

[GFS2] gfs2 knows of directories which it chooses not to display

This is for Red Hat bugzilla bug bz #222302:

Moving a virtual IP from node to node between two NFS-over-GFS2
servers was causing one of the GFS2 servers to become confused and
reference a deleted inode. The problem was due to vfs dentries that did
not reference the gfs2_dops and therefore didn't call the gfs2 revalidate
code to revalidate a dentry after a directory had been deleted & recreated.
This patch is a crosswrite from a RHEL4 bug found in GFS1 as
bz #190756 and it is against the latest -nmw git tree.

Signed-off-by: Robert Peterson <[email protected]>
Signed-off-by: Steven Whitehouse <[email protected]>

[DLM] expose dlm_config_info fields in configfs

Make the dlm_config_info values readable and writeable via configfs
entries.

Signed-off-by: David Teigland <[email protected]>
Signed-off-by: Steven Whitehouse <[email protected]>

[DLM] add config entry to enable log_debug

Add a new dlm_config_info field to enable log_debug output and change
log_debug() to use it.

Signed-off-by: David Teigland <[email protected]>
Signed-off-by: Steven Whitehouse <[email protected]>

[DLM] rename dlm_config_info fields

Add a "ci_" prefix to the fields in the dlm_config_info struct so that we
can use macros to add configfs functions to access them (in a later
patch). No functional changes in this patch, just naming changes.

Signed-off-by: David Teigland <[email protected]>
Signed-off-by: Steven Whitehouse <[email protected]>

[DLM] change some log_error to log_debug

Some common, non-error messages should use log_debug instead of log_error
so they can be turned off.

Signed-off-by: David Teigland <[email protected]>
Signed-off-by: Steven Whitehouse <[email protected]>

[GFS2] Fix gfs2_rename deadlock

Second round of gfs2_rename lock re-ordering to allow Anaconda adding
root partition on top of gfs2. Previous to this patch the recursive
lock detector in glock.c can be triggered due to attempting to lock
the rgrp twice. This fixes it by checking to see whether the rgrp
is already locked.

This fixes Red Hat bugzilla #221237

Signed-off-by: S. Wendy Cheng <[email protected]>
Signed-off-by: Steven Whitehouse <[email protected]>

[GFS2] BZ 217008 fsfuzzer fix.

Update the quilt header comments to match the
code changes.

Change gfs2_lookup_simple to return an error in the case
of a NULL inode.
The callers of gfs2_lookup_simple do not check for NULL
in the no entry case and such would end up dereferencing a NULL ptr.

This fixes:
http://projects.info-pull.com/mokb/MOKB-15-11-2006.html

Signed-off-by: Russell Cattelan <[email protected]>
Signed-off-by: Steven Whitehouse <[email protected]>

[GFS2] Fix ordering of page disposal vs. glock_dq

In case of unlinked files with dirty pages GFS2 wasn't clearing
the pages in quite the right order. This patch clears the pages
earlier (before the qlock_dq) to avoid the situation that the
release of the glock results in attempting to write back data that
has already been deallocated.

This fixes Red Hat bugzilla: #220117

Signed-off-by: Steven Whitehouse <[email protected]>

[DLM] Fix spin lock already unlocked bug

I just noticed this message when testing some other changes I'd made to
lowcomms (to use workqueues) but the problem seems to be in the current
git trees too. I'm amazed no-one has seen it.

BUG: spinlock already unlocked on CPU#1, dlm_recoverd/16868

Signed-Off-By: Patrick Caulfield <[email protected]>
Signed-off-by: Steven Whitehouse <[email protected]>

[DLM] Fix schedule() calls

I was a little over-enthusiastic turning schedule() calls int cond_sched() when fixing the DLM for Andrew Morton.

These four should really be calls to schedule() or the dlm can busy-wait.

Signed-Off-By: Patrick Caulfield <[email protected]>
Signed-off-by: Steven Whitehouse <[email protected]>

[GFS2] Fix change nlink deadlock

Bugzilla 215088

Fix deadlock in gfs2_change_nlink() while installing RHEL5 into GFS2
partition. The gfs2_rename() apparently needs block allocation for the
new name (into the directory) where it requires rg locks. At the same
time, while updating the nlink count for the replaced file,
gfs2_change_nlink() tries to return the inode meta-data back to resource
group where it needs rg locks too. Our logic doesn't allow process to
acquire these locks recursively by the same process (RHEL installer)
that results a BUG call. This only happens within rename code path and
only if the destination file exists before the rename operation.

Signed-off-by: S. Wendy Cheng <[email protected]>
Signed-off-by: Steven Whitehouse <[email protected]>

[GFS2] Fail over to readpage for stuffed files

This is partially derrived from a patch written by Russell Cattelan.
It fixes a bug where there is a race between readpages and truncate
by ignoring readpages for stuffed files. This is ok because a stuffed
file will never be more than one block (minus sizeof(struct gfs2_dinode))
in size and block size is always less than page size, so we do not lose
anything efficiency-wise by not doing readahead for stuffed files. They
will have already been "read ahead" by the action of reading the inode
in, in the first place.

This is the remaining part of the fix for Red Hat bugzilla #218966
which had not yet made it upstream.

Signed-off-by: Steven Whitehouse <[email protected]>
Cc: Russell Cattelan <[email protected]>

[GFS2] Fix DIO deadlock

This patch fixes Red Hat bugzilla #212627 in which a deadlock occurs
due to trying to take the i_mutex while holding a glock. The correct
locking order is defined as i_mutex -> glock in all cases.

I've left dealing with allocating writes. I know that we need to do
that, but for now this should do the trick. We don't need to take the
i_mutex on write, because the VFS has already taken it for us. On read
we don't need it since the glock is enough protection. The reason that
I've made some of the checks into a separate function is that we'll need
to do the checks again in the allocating write case eventually, so this
is partly in preparation for this. Likewise the return value test of !=
1 might look a bit odd and thats because we'll need a third return value
in case of requiring an allocation.

I've made the change to deferred mode on the glock to ensure flushing
read caches on other nodes. I notice that (using blktrace to look at
whats going on) we appear to do a better job of large I/Os than ext3
after this patch (in terms of not splitting up the I/Os).

Signed-off-by: Steven Whitehouse <[email protected]>
Cc: Wendy Cheng <[email protected]>

[DLM] fs/dlm/lowcomms-tcp.c: remove 2 functions

Remove the following unused functions:

- lowcomms_send_message()
- lowcomms_max_buffer_size()

Signed-off-by: Adrian Bunk <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Patrick Caulfield <[email protected]>
Signed-off-by: Steven Whitehouse <[email protected]>

[DLM] fix lost flags in stub replies

When the dlm fakes an unlock/cancel reply from a failed node using a stub
message struct, it wasn't setting the flags in the stub message. So, in
the process of receiving the fake message the lkb flags would be updated
and cleared from the zero flags in the message. The problem observed in
tests was the loss of the USER flag which caused the dlm to think a user
lock was a kernel lock and subsequently fail an assertion checking the
validity of the ast/callback field.

Signed-off-by: David Teigland <[email protected]>
Signed-off-by: Steven Whitehouse <[email protected]>

[DLM] fix receive_request() lvb copying

LVB's are not sent as part of new requests, but the code receiving the
request was copying data into the lvb anyway. The space in the message
where it mistakenly thought the lvb lived actually contained the resource
name, so it wound up incorrectly copying this name data into the lvb. Fix
is to just create the lvb, not copy junk into it.

Signed-off-by: David Teigland <[email protected]>
Signed-off-by: Steven Whitehouse <[email protected]>

[DLM] fix send_args() lvb copying

The send_args() function is used to copy parameters into a message for a
number different message types. Only some of those types are set up
beforehand (in create_message) to include space for sending lvb data.
send_args was wrongly copying the lvb for all message types as long as the
lock had an lvb. This means that the lvb data was being written past the
end of the message into unknown space.

Signed-off-by: David Teigland <[email protected]>
Signed-off-by: Steven Whitehouse <[email protected]>

[DLM] add version check

Check if we receive a message from another lockspace member running a
version of the dlm with an incompatible inter-node message protocol.

Signed-off-by: David Teigland <[email protected]>
Signed-off-by: Steven Whitehouse <[email protected]>

[DLM] fix old rcom messages

A reply to a recovery message will often be received after the relevant
recovery sequence has aborted and the next recovery sequence has begun.
We need to ignore replies to these old messages from the previous
recovery.  There's already a way to do this for synchronous recovery
requests using the rc_id number, but not for async.

Each recovery sequence already has a locally unique sequence number
associated with it.  This patch adds a field to the rcom (recovery
message) structure where this recovery sequence number can be placed,
rc_seq.  When a node sends a reply to a recovery request, it copies the
rc_seq number it received into rc_seq_reply.  When the first node receives
the reply to its recovery message, it will check whether rc_seq_reply
matches the current recovery sequence number, ls_recover_seq, and if not
then it ignores the old reply.

An old, inadequate approach to filtering out old replies (checking if the
current stage of recovery has moved back to the start) has been removed
from two spots.

The protocol version number is changed to reflect the different rcom
structures.

Signed-off-by: David Teigland <[email protected]>
Signed-off-by: Steven Whitehouse <[email protected]>

[DLM] fix resend rcom lock

There's a chance the new master of resource hasn't learned it's the new
master before another node sends it a lock during recovery.  The node
sending the lock needs to resend if this happens.

- A sends a master lookup for resource R to C
- B sends a master lookup for resource R to C
- C receives A's lookup, assigns A to be master of R and
  sends a reply back to A
- C receives B's lookup and sends a reply back to B saying
  that A is the master
- B receives lookup reply from C and sends its lock for R to A
- A receives lock from B, doesn't think it's the master of R
  and sends an error back to B
- A receives lookup reply from C and becomes master of R
- B gets error back from A and resends its lock back to A
  (this resending is what this patch does)
- A receives lock from B, it now sees it's the master of R
  and takes the lock

Signed-off-by: David Teigland <[email protected]>
Signed-off-by: Steven Whitehouse <[email protected]>

[GFS2] don't try to lockfs after shutdown

If an fs has already been shut down, a lockfs callback should do nothing.
An fs that's been shut down can't acquire locks or do anything with
respect to the cluster.

Also, remove FIXME comment in withdraw function. The missing bits of the
withdraw procedure are now all done by user space.

Signed-off-by: David Teigland <[email protected]>
Signed-off-by: Steven Whitehouse <[email protected]>

USB HID: handle multi-interface devices for Apple macbook pro properly

Some HID devices by Apple have both keyboard and mouse interfaces; the
keyboard interface is handled by usbhid, but the mouse (really
touchpad) interface must be handled by the separate 'appletouch'
driver. Using HID_QUIRK_IGNORE will make hiddev ignore both
interfaces, therefore a new quirk flag to ignore only the mouse
interface is required.

Signed-off-by: Soeren Sonnenburg <[email protected]>
Signed-off-by: Sergey Vlasov <[email protected]>
Signed-off-by: Jiri Kosina <[email protected]>

HID: move away from DEBUG defines in favor of CONFIG_HID_DEBUG

CONFIG_INPUT_DEBUG is non-existent option, so remove anything depending
on it.

Also, as we have new CONFIG_HID_DEBUG, this should be used on places
where ifdef DEBUG was used before.

Suggested by Adrian Bunk.

Signed-off-by: Jiri Kosina <[email protected]>

USB HID: fix bogus comment in hid_get_class_descriptor()

The comment in hid_get_class_descriptor() says a very obvious thing
and is also violating codingstyle. Just remove it.

Signed-off-by: Jiri Kosina <[email protected]>

USB HID: remove hid_find_field_by_usage()

The unused hid_find_field_by_usage() function has been commented out for
a pretty long time. Remove it completely.

Signed-off-by: Jiri Kosina <[email protected]>

HID: API - fix leftovers of hidinput API in USB HID

hidinput_{open,close}() functions do not belong to usbhid, but
to the generic HID layer. Move them, and fix hooks in struct
hid_device, so that now the callbacks are done to transport-specific
_open() functions, but not input_open() functions.

Signed-off-by: Jiri Kosina <[email protected]>

HID: hid debug from hid-debug.h to hid layer

hid-debug.h contains a lot of code, and should not therefore
be a header.

This patch moves the code to generic hid layer as .c source, and
introduces CONFIG_HID_DEBUG to conditionally compile it, instead
of playing with #define DEBUG and including hid-debug.h.

Signed-off-by: Jiri Kosina <[email protected]>

hid: force feedback driver for PantherLord USB/PS2 2in1 Adapter

Add a force feedback driver for PantherLord USB/PS2 2in1 Adapter,
0810:0001. The device identifies itself as "Twin USB Joystick".

Signed-off-by: Anssi Hannula <[email protected]>
Signed-off-by: Jiri Kosina <[email protected]>

hid: quirk for multi-input devices with unneeded output reports

Add new quirk HID_QUIRK_SKIP_OUTPUT_REPORTS to skip output reports
when enumerating reports on a hid-input device. Add this quirk and
HID_QUIRK_MULTI_INPUT to 0810:0001.

PantherLord Twin USB Joystick, 0810:0001 has separate input reports
for 2 distinct game controllers in the same interface, so it needs
HID_QUIRK_MULTI_INPUT. However, the device also contains one output
report per controller which is used to control the force feedback
function, and we do not want those to appear as separate input
devices as well. The simplest approach seems to be to add a quirk to
skip output reports on 0810:0001, and allow the force feedback
driver to handle those.

Signed-off-by: Anssi Hannula <[email protected]>
Signed-off-by: Jiri Kosina <[email protected]>

hid: allow force feedback for multi-input devices

Allow hid devices with HID_QUIRK_MULTI_INPUT to have force feedback.
This was previously disabled because there were not any force
feedback drivers for such devices. This will change with my upcoming
patch.

Signed-off-by: Anssi Hannula <[email protected]>
Signed-off-by: Jiri Kosina <[email protected]>