Borislav Petkov [Wed, 18 Aug 2010 13:11:35 +0000 (15:11 +0200)]
EDAC, MCE: Adjust DC decoders to F14h
Add a per-family data cache decoders. Since there is a certain overlap
between the different DC MCE signatures, reuse functionality between the
families as far as possible.
Add sysfs injection facilities for testing of the MCE decoding code.
Remove large parts of amd64_edac_dbg.c, as a result, which did only
NB MCE injection anyway and the new injection code supports that
functionality already.
Add an injection module so that MCE decoding code in production kernels
like those in RHEL and SLES can be tested.
Move toplevel sysfs class to the stub and make it available to
non-modularized code too. Add proper refcounting of its users and move
the registration functionality into the reference counting routines.
H. Peter Anvin [Thu, 21 Oct 2010 07:15:00 +0000 (00:15 -0700)]
x86-32, percpu: Correct the ordering of the percpu readmostly section
Checkin c957ef2c59e952803766ddc22e89981ab534606f had inconsistent
ordering of .data..percpu..page_aligned and .data..percpu..readmostly;
the still-broken version affected x86-32 at least.
The page aligned version really must be page aligned...
Remove the BKL usage added in "block: push down BKL into .locked_ioctl".
Virtio-blk doesn't use the BKL for anything, and doesn't implement any
ioctl command by itself, but only uses the generic scsi_cmd_ioctl
which is fine without the BKL.
Amit Shah [Thu, 16 Sep 2010 09:13:09 +0000 (14:43 +0530)]
virtio: console: Disable lseek(2) for port file operations
The ports are char devices; do not have seeking capabilities. Calling
nonseekable_open() from the fops_open() call and setting the llseek fops
pointer to no_llseek ensures an lseek() call from userspace returns
-ESPIPE.
Amit Shah [Thu, 2 Sep 2010 13:17:52 +0000 (18:47 +0530)]
virtio: console: Send SIGIO to processes that request it for host events
A process can request for SIGIO on host connect / disconnect events
using the O_ASYNC file flag using fcntl().
If that's requested, and if the guest-side connection for the port is
open, any host-side open/close events for that port will raise a SIGIO.
The process can then use poll() within the signal handler to find out
which port triggered the signal.
Amit Shah [Thu, 2 Sep 2010 13:08:30 +0000 (18:38 +0530)]
virtio: console: Reference counting portdev structs is not needed
Explain in a comment why there's no need to reference-count the portdev
struct: when a device is yanked out, we can't do anything more with it
anyway so just give up doing anything more with the data or the vqs and
exit cleanly.
Amit Shah [Thu, 2 Sep 2010 13:08:29 +0000 (18:38 +0530)]
virtio: console: Add reference counting for port struct
When a port got hot-unplugged, when a port was open, any file operation
after the unplugging resulted in a crash. This is fixed by ref-counting
the port structure, and releasing it only when the file is closed.
This splits the unplug operation in two parts: first marks the port
as unavailable, removes all the buffers in the vqs and removes the port
from the per-device list of ports. The second stage, invoked when all
references drop to zero, releases the chardev and frees all other memory.
Amit Shah [Thu, 2 Sep 2010 12:50:59 +0000 (18:20 +0530)]
virtio: console: Use cdev_alloc() instead of cdev_init()
This moves to using cdev on the heap instead of it being embedded in the
ports struct. This helps individual refcounting and will allow us to
properly remove cdev structs after hot-unplugs and close operations.
Amit Shah [Thu, 2 Sep 2010 12:50:58 +0000 (18:20 +0530)]
virtio: console: Add a find_port_by_devt() function
To convert to using cdev as a pointer to avoid kref troubles, we have to
use a different method to get to a port from an inode than the current
container_of method.
Add find_port_by_devt() that looks up all portdevs and ports with those
portdevs to find the right port.
Eric Paris [Tue, 19 Oct 2010 22:17:32 +0000 (18:17 -0400)]
secmark: fix config problem when CONFIG_NF_CONNTRACK_SECMARK is not set
When CONFIG_NF_CONNTRACK_SECMARK is not set we accidentally attempt to use
the secmark fielf of struct nf_conn. Problem is when that config isn't set
the field doesn't exist. whoops. Wrap the incorrect usage in the config.
Eric Paris [Wed, 13 Oct 2010 21:50:25 +0000 (17:50 -0400)]
SELinux: allow userspace to read policy back out of the kernel
There is interest in being able to see what the actual policy is that was
loaded into the kernel. The patch creates a new selinuxfs file
/selinux/policy which can be read by userspace. The actual policy that is
loaded into the kernel will be written back out to userspace.
Eric Paris [Wed, 13 Oct 2010 21:50:19 +0000 (17:50 -0400)]
SELinux: drop useless (and incorrect) AVTAB_MAX_SIZE
AVTAB_MAX_SIZE was a define which was supposed to be used in userspace to
define a maximally sized avtab when userspace wasn't sure how big of a table
it needed. It doesn't make sense in the kernel since we always know our table
sizes. The only place it is used we have a more appropiately named define
called AVTAB_MAX_HASH_BUCKETS, use that instead.
Eric Paris [Wed, 13 Oct 2010 21:50:14 +0000 (17:50 -0400)]
SELinux: deterministic ordering of range transition rules
Range transition rules are placed in the hash table in an (almost)
arbitrary order. This patch inserts them in a fixed order to make policy
retrival more predictable.
Eric Paris [Wed, 13 Oct 2010 21:50:02 +0000 (17:50 -0400)]
kernel: rounddown helper function
The roundup() helper function will round a given value up to a multiple of
another given value. aka roundup(11, 7) would give 14 = 7 * 2. This new
function does the opposite. It will round a given number down to the
nearest multiple of the second number: rounddown(11, 7) would give 7.
I need this in some future SELinux code and can carry the macro myself, but
figured I would put it in the core kernel so others might find and use it
if need be.
Eric Paris [Wed, 13 Oct 2010 20:25:00 +0000 (16:25 -0400)]
secmark: export secctx, drop secmark in procfs
The current secmark code exports a secmark= field which just indicates if
there is special labeling on a packet or not. We drop this field as it
isn't particularly useful and instead export a new field secctx= which is
the actual human readable text label.
Eric Paris [Wed, 13 Oct 2010 20:24:54 +0000 (16:24 -0400)]
conntrack: export lsm context rather than internal secid via netlink
The conntrack code can export the internal secid to userspace. These are
dynamic, can change on lsm changes, and have no meaning in userspace. We
should instead be sending lsm contexts to userspace instead. This patch sends
the secctx (rather than secid) to userspace over the netlink socket. We use a
new field CTA_SECCTX and stop using the the old CTA_SECMARK field since it did
not send particularly useful information.
Eric Paris [Wed, 13 Oct 2010 20:24:48 +0000 (16:24 -0400)]
security: secid_to_secctx returns len when data is NULL
With the (long ago) interface change to have the secid_to_secctx functions
do the string allocation instead of having the caller do the allocation we
lost the ability to query the security server for the length of the
upcoming string. The SECMARK code would like to allocate a netlink skb
with enough length to hold the string but it is just too unclean to do the
string allocation twice or to do the allocation the first time and hold
onto the string and slen. This patch adds the ability to call
security_secid_to_secctx() with a NULL data pointer and it will just set
the slen pointer.
Eric Paris [Wed, 13 Oct 2010 20:24:41 +0000 (16:24 -0400)]
secmark: make secmark object handling generic
Right now secmark has lots of direct selinux calls. Use all LSM calls and
remove all SELinux specific knowledge. The only SELinux specific knowledge
we leave is the mode. The only point is to make sure that other LSMs at
least test this generic code before they assume it works. (They may also
have to make changes if they do not represent labels as strings)
Eric Paris [Tue, 12 Oct 2010 15:40:08 +0000 (11:40 -0400)]
secmark: do not return early if there was no error
Commit 4a5a5c73 attempted to pass decent error messages back to userspace for
netfilter errors. In xt_SECMARK.c however the patch screwed up and returned
on 0 (aka no error) early and didn't finish setting up secmark. This results
in a kernel BUG if you use SECMARK.
John Johansen [Sat, 9 Oct 2010 07:47:53 +0000 (00:47 -0700)]
AppArmor: Ensure the size of the copy is < the buffer allocated to hold it
Actually I think in this case the appropriate thing to do is to BUG as there
is currently a case (remove) where the alloc_size needs to be larger than
the copy_size, and if copy_size is ever greater than alloc_size there is
a mistake in the caller code.
KOSAKI Motohiro [Thu, 14 Oct 2010 19:21:18 +0000 (04:21 +0900)]
security: remove unused parameter from security_task_setscheduler()
All security modules shouldn't change sched_param parameter of
security_task_setscheduler(). This is not only meaningless, but also
make a harmful result if caller pass a static variable.
This patch remove policy and sched_param parameter from
security_task_setscheduler() becuase none of security module is
using it.
While the previous change to the selinux Makefile reduced the window
significantly for this failure, it is still possible to see a compile
failure where cpp starts processing selinux files before the auto
generated flask.h file is completed. This is easily reproduced by
adding the following temporary change to expose the issue everytime:
This failure happens because the creation of the object files in the ss
subdir also depends on flask.h. So simply incorporate them into the
parent Makefile, as the ss/Makefile really doesn't do anything unique.
With this change, compiling of all selinux files is dependent on
completion of the header file generation, and this test case with
the "sleep 30" now confirms it is functioning as expected.
Paul Gortmaker [Mon, 9 Aug 2010 21:34:25 +0000 (17:34 -0400)]
selinux: fix parallel compile error
Selinux has an autogenerated file, "flask.h" which is included by
two other selinux files. The current makefile has a single dependency
on the first object file in the selinux-y list, assuming that will get
flask.h generated before anyone looks for it, but that assumption breaks
down in a "make -jN" situation and you get:
selinux/selinuxfs.c:35: fatal error: flask.h: No such file or directory
compilation terminated.
remake[9]: *** [security/selinux/selinuxfs.o] Error 1
Since flask.h is included by security.h which in turn is included
nearly everywhere, make the dependency apply to all of the selinux-y
list of objs.
selinux: fast status update interface (/selinux/status)
This patch provides a new /selinux/status entry which allows applications
read-only mmap(2).
This region reflects selinux_kernel_status structure in kernel space.
struct selinux_kernel_status
{
u32 length; /* length of this structure */
u32 sequence; /* sequence number of seqlock logic */
u32 enforcing; /* current setting of enforcing mode */
u32 policyload; /* times of policy reloaded */
u32 deny_unknown; /* current setting of deny_unknown */
};
When userspace object manager caches access control decisions provided
by SELinux, it needs to invalidate the cache on policy reload and setenforce
to keep consistency.
However, the applications need to check the kernel state for each accesses
on userspace avc, or launch a background worker process.
In heuristic, frequency of invalidation is much less than frequency of
making access control decision, so it is annoying to invoke a system call
to check we don't need to invalidate the userspace cache.
If we can use a background worker thread, it allows to receive invalidation
messages from the kernel. But it requires us an invasive coding toward the
base application in some cases; E.g, when we provide a feature performing
with SELinux as a plugin module, it is unwelcome manner to launch its own
worker thread from the module.
If we could map /selinux/status to process memory space, application can
know updates of selinux status; policy reload or setenforce.
A typical application checks selinux_kernel_status::sequence when it tries
to reference userspace avc. If it was changed from the last time when it
checked userspace avc, it means something was updated in the kernel space.
Then, the application can reset userspace avc or update current enforcing
mode, without any system call invocations.
This sequence number is updated according to the seqlock logic, so we need
to wait for a while if it is odd number.
Tetsuo Handa [Sat, 28 Aug 2010 05:58:44 +0000 (14:58 +0900)]
LSM: Fix security_module_enable() error.
We can set default LSM module to DAC (which means "enable no LSM module").
If default LSM module was set to DAC, security_module_enable() must return 0
unless overridden via boot time parameter.
Sage Weil [Mon, 18 Oct 2010 21:04:31 +0000 (14:04 -0700)]
ceph: do not carry i_lock for readdir from dcache
We were taking dcache_lock inside of i_lock, which introduces a dependency
not found elsewhere in the kernel, complicationg the vfs locking
scalability work. Since we don't actually need it here anyway, remove
it.
We only need i_lock to test for the I_COMPLETE flag, so be careful to do
so without dcache_lock held.
Dan Carpenter [Mon, 11 Oct 2010 19:15:11 +0000 (21:15 +0200)]
rbd: passing wrong variable to bvec_kunmap_irq()
We should be passing "buf" here insead of "bv". This is tricky because
it's not the same as kmap() and kunmap(). GCC does warn about it if you
compile on i386 with CONFIG_HIGHMEM.
Randy Dunlap [Tue, 28 Sep 2010 16:53:10 +0000 (09:53 -0700)]
ceph: fix debugfs warnings
Include "super.h" outside of CONFIG_DEBUG_FS to eliminate a compiler warning:
fs/ceph/debugfs.c:266: warning: 'struct ceph_fs_client' declared inside parameter list
fs/ceph/debugfs.c:266: warning: its scope is only this definition or declaration, which is probably not what you want
fs/ceph/debugfs.c:271: warning: 'struct ceph_fs_client' declared inside parameter list
When the lock_kernel() turns into lock_flocks() and a spinlock, we won't
be able to do allocations with the lock held. Preallocate space without
the lock, and retry if the lock state changes out from underneath us.
Yehuda Sadeh [Thu, 12 Aug 2010 23:11:25 +0000 (16:11 -0700)]
rbd: introduce rados block device (rbd), based on libceph
The rados block device (rbd), based on osdblk, creates a block device
that is backed by objects stored in the Ceph distributed object storage
cluster. Each device consists of a single metadata object and data
striped over many data objects.
This factors out protocol and low-level storage parts of ceph into a
separate libceph module living in net/ceph and include/linux/ceph. This
is mostly a matter of moving files around. However, a few key pieces
of the interface change as well:
- ceph_client becomes ceph_fs_client and ceph_client, where the latter
captures the mon and osd clients, and the fs_client gets the mds client
and file system specific pieces.
- Mount option parsing and debugfs setup is correspondingly broken into
two pieces.
- The mon client gets a generic handler callback for otherwise unknown
messages (mds map, in this case).
- The basic supported/required feature bits can be expanded (and are by
ceph_fs_client).
No functional change, aside from some subtle error handling cases that got
cleaned up in the refactoring process.
Allow the messenger to send/receive data in a bio. This is added
so that we wouldn't need to copy the data into pages or some other buffer
when doing IO for an rbd block device.
We can now have trailing variable sized data for osd
ops. Also osd ops encoding is more modular.
The osd requests creation are being decoupled from the
vino parameter, allowing clients using the osd to use
other arbitrary object names that are not necessarily
vino based. Also, calc_raw_layout now takes a snap id.
Shaohua Li [Wed, 20 Oct 2010 03:07:03 +0000 (11:07 +0800)]
x86: Spread tlb flush vector between nodes
Currently flush tlb vector allocation is based on below equation:
sender = smp_processor_id() % 8
This isn't optimal, CPUs from different node can have the same vector, this
causes a lot of lock contention. Instead, we can assign the same vectors to
CPUs from the same node, while different node has different vectors. This has
below advantages:
a. if there is lock contention, the lock contention is between CPUs from one
node. This should be much cheaper than the contention between nodes.
b. completely avoid lock contention between nodes. This especially benefits
kswapd, which is the biggest user of tlb flush, since kswapd sets its affinity
to specific node.
In my test, this could reduce > 20% CPU overhead in extreme case.The test
machine has 4 nodes and each node has 16 CPUs. I then bind each node's kswapd
to the first CPU of the node. I run a workload with 4 sequential mmap file
read thread. The files are empty sparse file. This workload will trigger a
lot of page reclaim and tlbflush. The kswapd bind is to easy trigger the
extreme tlb flush lock contention because otherwise kswapd keeps migrating
between CPUs of a node and I can't get stable result. Sure in real workload,
we can't always see so big tlb flush lock contention, but it's possible.
[ hpa: folded in fix from Eric Dumazet to use this_cpu_read() ]
Shaohua Li [Wed, 20 Oct 2010 03:07:02 +0000 (11:07 +0800)]
percpu: Introduce a read-mostly percpu API
Add a new readmostly percpu section and API. This can be used to
avoid dirtying data lines which are generally not written to, which is
especially important for data which may be accessed by processors
other than the one for which the percpu area belongs to.
[ hpa: moved it *after* the page-aligned section, for obvious
reasons. ]
Linus Torvalds [Wed, 20 Oct 2010 20:18:21 +0000 (13:18 -0700)]
Merge branch 'upstream' of git://git.linux-mips.org/pub/scm/upstream-linus
* 'upstream' of git://git.linux-mips.org/pub/scm/upstream-linus:
MIPS: O32 compat/N32: Fix to use compat syscall wrappers for AIO syscalls.
MAINTAINERS: Change list for ioc_serial to linux-serial.
SERIAL: ioc3_serial: Return -ENOMEM on memory allocation failure
MIPS: jz4740: Fix Kbuild Platform file.
MIPS: Repair Kbuild make clean breakage.
Amit Shah [Wed, 20 Oct 2010 03:15:43 +0000 (13:45 +1030)]
virtio: console: Don't block entire guest if host doesn't read data
If the host is slow in reading data or doesn't read data at all,
blocking write calls not only blocked the program that called write()
but the entire guest itself.
To overcome this, let's not block till the host signals it has given
back the virtio ring element we passed it. Instead, send the buffer to
the host and return to userspace. This operation then becomes similar
to how non-blocking writes work, so let's use the existing code for this
path as well.
This code change also ensures blocking write calls do get blocked if
there's not enough room in the virtio ring as well as they don't return
-EAGAIN to userspace.
Ilkka Koskinen [Tue, 19 Oct 2010 14:07:31 +0000 (17:07 +0300)]
spi/omap2_mcspi: Verify TX reg is empty after TX only xfer with DMA
In case of TX only with DMA, the driver assumes that the data
has been transferred once DMA callback in invoked. However,
SPI's shift register may still contain data. Thus, the driver
is supposed to verify that the register is empty and the end of
the SPI transfer has been reached.
Jason Wang [Tue, 19 Oct 2010 10:03:27 +0000 (18:03 +0800)]
spi/omap2_mcspi: disable channel after TX_ONLY transfer in PIO mode
In the TX_ONLY transfer, the SPI controller also receives data
simultaneously and saves them in the rx register. After the TX_ONLY
transfer, the rx register will hold the random data received during
the last tx transaction.
If the direct following transfer is RX_ONLY, this random data has the
possibility to affect this transfer like this:
When the SPI controller is changed from TX_ONLY to RX_ONLY,
the random data makes the rx register full immediately and
triggers a dummy write automatically(in SPI RX_ONLY transfers,
we need a dummy write to trigger the first transaction).
So the first data received in the RX_ONLY transfer will be that
random data instead of something meaningful.
We can avoid this by inserting a Disable/Re-enable toggle of the
channel after the TX_ONLY transfer, since it purges the rx register.
Jeremy Kerr [Tue, 6 Jul 2010 10:30:06 +0000 (18:30 +0800)]
arm: return both physical and virtual addresses from addruart
Rather than checking the MMU status in every instance of addruart, do it
once in kernel/debug.S, and change the existing addruart macros to
return both physical and virtual addresses. The main debug code can then
select the appropriate address to use.
This will also allow us to retreive the address of a uart for the MMU
state that we're not current in.
Nicolas Pitre [Fri, 15 Oct 2010 02:37:52 +0000 (22:37 -0400)]
ARM: make struct machine_desc definition coherent with its comment
As mentioned in the comment right at the top, the first four fields
are directly accessed by assembly code in head.S. Move nr_irqs so the
comment is true again.
Robert Richter [Wed, 6 Oct 2010 10:27:54 +0000 (12:27 +0200)]
apic, x86: Use BIOS settings for IBS and MCE threshold interrupt LVT offsets
We want the BIOS to setup the EILVT APIC registers. The offsets
were hardcoded and BIOS settings were overwritten by the OS.
Now, the subsystems for MCE threshold and IBS determine the LVT
offset from the registers the BIOS has setup. If the BIOS setup
is buggy on a family 10h system, a workaround enables IBS. If
the OS determines an invalid register setup, a "[Firmware Bug]:
" error message is reported.
We need this change also for upcomming cpu families.
Robert Richter [Wed, 6 Oct 2010 10:27:53 +0000 (12:27 +0200)]
apic, x86: Check if EILVT APIC registers are available (AMD only)
This patch implements checks for the availability of LVT entries
(APIC500-530) and reserves it if used. The check becomes
necessary since we want to let the BIOS provide the LVT offsets.
The offsets should be determined by the subsystems using it
like those for MCE threshold or IBS. On K8 only offset 0
(APIC500) and MCE interrupts are supported. Beginning with
family 10h at least 4 offsets are available.
Since offsets must be consistent for all cores, we keep track of
the LVT offsets in software and reserve the offset for the same
vector also to be used on other cores. An offset is freed by
setting the entry to APIC_EILVT_MASKED.
If the BIOS is right, there should be no conflicts. Otherwise a
"[Firmware Bug]: ..." error message is generated. However, if
software does not properly determines the offsets, it is not
necessarily a BIOS bug.
PM / Wakeup: Show wakeup sources statistics in debugfs
There may be wakeup sources that aren't associated with any devices
and their statistics information won't be available from sysfs. Also,
for debugging purposes it is convenient to have all of the wakeup
sources statistics available from one place. For these reasons,
introduce new file "wakeup_sources" in debugfs containing those
statistics.