Joe Perches [Thu, 25 Jun 2015 22:03:00 +0000 (15:03 -0700)]
checkpatch: improve output with multiple command-line files
If there are multiple patches/files on the command line,
use a prefix before the patch/file message output like:
--------------
patch/filename
--------------
to make the identifying which messages go with which
file/patch a bit easier to parse.
Move the perl version and false positive messages after
all the files have been scanned so that they are emitted
only once.
Standardize the NOTE: <...> form to always emit a blank
line before the NOTE and always use print << "EOM" style.
Joe Perches [Thu, 25 Jun 2015 22:02:57 +0000 (15:02 -0700)]
checkpatch: categorize some long line length checks
Many lines of code extend beyond the maximum line length. Some of these
are possibly justified by use type.
For instance:
structure definitions where comments are added per member like:
struct foo {
type member; /* some long description */
And lines that don't fit the typical logging message style
where a string constant is used like:
SOME_MACRO(args, "Some long string");
Categorize these long line types so that checkpatch can use a command-line
--ignore=<type> option to avoid emitting some long line warnings.
One of the existing checkpatch exclusions allowed kernel-doc argument
documentation to exceed 80 columns because old versions of kernel-doc
required single line documentation. The requirement was removed in 2009
so remove that exclusion.
Alex Dowad [Thu, 25 Jun 2015 22:02:52 +0000 (15:02 -0700)]
checkpatch: make types found in a source file/patch local
checkpatch uses various cues in its input patches and files to discover
the names of user-defined types and modifiers. It then uses that
information when processing expressions to discover potential style
issues.
Unfortunately, in rare cases, this means that checkpatch may give
different results if you run it on several input files in one execution,
or one by one!
The reason is that it may identify a type (or something that looks like a
type) in one file, and then carry this information over when processing a
different file.
For example, drivers/staging/media/bcm2048/radio-bcm2048.c contains this
line (in a macro):
size value;
and drivers/staging/media/davinci_vpfe/vpfe_video.c has this line:
while (size * *nbuffers > vpfe_dev->video_limit)
If checkpatch processes these 2 files in a single command like:
./scripts/checkpatch.pl -f $file1 $file2
the (spurious) "size" type detected in the first file will cause it to flag
the second file for improper use of the pointer dereference operator.
To fix this, store types and modifiers found in a file in separate arrays
from built-in ones, and reset the arrays of types and modifiers found in
files at the beginning of each new source file.
Joe Perches [Thu, 25 Jun 2015 22:02:46 +0000 (15:02 -0700)]
checkpatch: check for uncommented waitqueue_active()
Linus sayeth:
: Pretty much every single time people use this "if
: (waitqueue_active())" model, it tends to be a bug, because it means
: that there is zero serialization with people who are just about to go
: to sleep. It's fundamentally racy against all the "wait_event()" loops
: that carefully do memory barriers between testing conditions and going
: to sleep, because the memory barriers now don't exist on the waking
: side.
:
: So I'm making a new rule: if you use waitqueue_active(), I want an
: explanation for why it's not racy with the waiter. A big comment about
: the memory ordering, or about higher-level locks that are held by the
: caller, or something.
Yep, that's right: gcc isn't smart enough to realize that replacing '/' by
'_' cannot change the strlen(), so we call it again and again (at least
when a '/' is found). Even if gcc were that smart, this construction
would still loop over the string twice, once for the initial strlen() call
and then the open-coded loop.
Rasmus Villemoes [Thu, 25 Jun 2015 22:02:22 +0000 (15:02 -0700)]
lib/string.c: introduce strreplace()
Strings are sometimes sanitized by replacing a certain character (often
'/') by another (often '!'). In a few places, this is done the same way
Schlemiel the Painter would do it. Others are slightly smarter but still
do multiple strchr() calls. Introduce strreplace() to do this using a
single function call and a single pass over the string.
One would expect the return value to be one of three things: void, s, or
the number of replacements made. I chose the fourth, returning a pointer
to the end of the string. This is more likely to be useful (for example
allowing the caller to avoid a strlen call).
radix-tree: replace preallocated node array with linked list
Currently we use per-cpu array to hold pointers to preallocated nodes.
Let's replace it with linked list. On x86_64 it saves 256 bytes in
per-cpu ELF section which may translate into freeing up 2MB of memory for
NR_CPUS==8192.
Sudeep Holla [Thu, 25 Jun 2015 22:02:17 +0000 (15:02 -0700)]
bitmap: remove explicit newline handling using scnprintf format string
bitmap_print_to_pagebuf uses scnprintf to copy the cpumask/list to page
buffer. It handles the newline and trailing null character explicitly.
It's unnecessary and also partially duplicated as scnprintf already adds
trailing null character. The newline can be passed through format
string to scnprintf. This patch does that simplification.
However theoretically there's one behavior difference: when the buffer
is too small, the original code would still output '\n' at the end while
the new code(with this patch) would just continue to print the formatted
string. Since this function is dealing with only page buffers, it's
highly unlikely to hit that corner case.
This patch will help in auditing the users of bitmap_print_to_pagebuf to
verify that the buffer passed is large enough and get rid of it
completely by replacing them with direct scnprintf()
Daniel Wagner [Thu, 25 Jun 2015 22:02:14 +0000 (15:02 -0700)]
lib/sort: Add 64 bit swap function
In case the call side is not providing a swap function, we either use a
32 bit or a generic swap function. When swapping around pointers on 64
bit architectures falling back to use the generic swap function seems
like an unnecessary waste.
There at least 9 users ('sort' is of difficult to grep for) of sort()
and all of them use the sort function without a customized swap
function. Furthermore, they are all using pointers to swap around:
Obviously, we could improve the swap for other sizes as well
but this is overkill at this point.
A simple test shows sorting a 400 element array (try to stay in one
page) with either with u32_swap() or u64_swap() show that the theory
actually works. This test was done on a x86_64 (Intel Xeon E5-4610)
machine.
Chris Metcalf [Thu, 25 Jun 2015 22:02:08 +0000 (15:02 -0700)]
__bitmap_parselist: fix bug in empty string handling
bitmap_parselist("", &mask, nmaskbits) will erroneously set bit zero in
the mask. The same bug is visible in cpumask_parselist() since it is
layered on top of the bitmask code, e.g. if you boot with "isolcpus=",
you will actually end up with cpu zero isolated.
The bug was introduced in commit 4b060420a596 ("bitmap, irq: add
smp_affinity_list interface to /proc/irq") when bitmap_parselist() was
generalized to support userspace as well as kernelspace.
Some people prefer not to be cc'd on patches. Add an ability to have a
file (.get_maintainer.ignore) with names and email addresses that are
excluded from being listed except when specifically listed as a maintainer
in a section.
Vasily Averin [Thu, 25 Jun 2015 22:01:44 +0000 (15:01 -0700)]
security_syslog() should be called once only
The final version of commit 637241a900cb ("kmsg: honor dmesg_restrict
sysctl on /dev/kmsg") lost few hooks, as result security_syslog() are
processed incorrectly:
- open of /dev/kmsg checks syslog access permissions by using
check_syslog_permissions() where security_syslog() is not called if
dmesg_restrict is set.
- syslog syscall and /proc/kmsg calls do_syslog() where security_syslog
can be executed twice (inside check_syslog_permissions() and then
directly in do_syslog())
With this patch security_syslog() is called once only in all
syslog-related operations regardless of dmesg_restrict value.
Tejun Heo [Thu, 25 Jun 2015 22:01:41 +0000 (15:01 -0700)]
netconsole: implement extended console support
printk logbuf keeps various metadata and optional key=value dictionary for
structured messages, both of which are stripped when messages are handed
to regular console drivers.
It can be useful to have this metadata and dictionary available to
netconsole consumers. This obviously makes logging via netconsole more
complete and the sequence number in particular is useful in environments
where messages may be lost or reordered in transit - e.g. when netconsole
is used to collect messages in a large cluster where packets may have to
travel congested hops to reach the aggregator. The lost and reordered
messages can easily be identified and handled accordingly using the
sequence numbers.
printk recently added extended console support which can be selected by
setting CON_EXTENDED flag. From console driver side, not much changes.
The only difference is that the text passed to the write callback is
formatted the same way as /dev/kmsg.
This patch implements extended console support for netconsole which can be
enabled by either prepending "+" to a netconsole boot param entry or
echoing 1 to "extended" file in configfs. When enabled, netconsole
transmits extended log messages with headers identical to /dev/kmsg
output.
There's one complication due to message fragments. netconsole limits the
maximum message size to 1k and messages longer than that are split into
multiple fragments. As all extended console messages should carry
matching headers and be uniquely identifiable, each extended message
fragment carries full copy of the metadata and an extra header field to
identify the specific fragment. The optional header is of the form
"ncfrag=OFF/LEN" where OFF is the byte offset into the message body and
LEN is the total length.
To avoid unnecessarily making printk format extended messages, Extended
netconsole is registered with printk when the first extended netconsole is
configured.
Tejun Heo [Thu, 25 Jun 2015 22:01:38 +0000 (15:01 -0700)]
netconsole: make all dynamic netconsoles share a mutex
Currently, each dynamic netconsole_target uses its own separate mutex to
synchronize the configuration operations.
This patch replaces the per-netconsole_target mutexes with a single
mutex - dynamic_netconsole_mutex. The reduced granularity doesn't hurt
anything, the code is minutely simpler and this'd allow adding
operations which should be synchronized across all dynamic netconsoles.
Tejun Heo [Thu, 25 Jun 2015 22:01:33 +0000 (15:01 -0700)]
netconsole: remove unnecessary netconsole_target_get/out() from write_msg()
write_msg() grabs target_list_lock and walks target_list invoking
netpool_send_udp() on each target. Curiously, it protects each iteration
with netconsole_target_get/put() even though it never releases
target_list_lock which protects all the members.
While this doesn't harm anything, it doesn't serve any purpose either.
The items on the list can't go away while target_list_lock is held.
Remove the unnecessary get/put pair.
Tejun Heo [Thu, 25 Jun 2015 22:01:30 +0000 (15:01 -0700)]
printk: implement support for extended console drivers
printk log_buf keeps various metadata for each message including its
sequence number and timestamp. The metadata is currently available only
through /dev/kmsg and stripped out before passed onto console drivers. We
want this metadata to be available to console drivers too so that console
consumers can get full information including the metadata and dictionary,
which among other things can be used to detect whether messages got lost
in transit.
This patch implements support for extended console drivers. Consoles can
indicate that they want extended messages by setting the new CON_EXTENDED
flag and they'll be fed messages formatted the same way as /dev/kmsg.
If extended consoles exist, in-kernel fragment assembly is disabled. This
ensures that all messages emitted to consoles have full metadata including
sequence number. The contflag carries enough information to reassemble
the fragments from the reader side trivially. Note that this only affects
/dev/kmsg. Regular console and /proc/kmsg outputs are not affected by
this change.
* Extended message formatting for console drivers is enabled iff there
are registered extended consoles.
* Comment describing /dev/kmsg message format updated to add missing
contflag field and help distinguishing variable from verbatim terms.
Tejun Heo [Thu, 25 Jun 2015 22:01:27 +0000 (15:01 -0700)]
printk: factor out message formatting from devkmsg_read()
The extended message formatting used for /dev/kmsg will be used implement
extended consoles. Factor out msg_print_ext_header() and
msg_print_ext_body() from devkmsg_read().
Tejun Heo [Thu, 25 Jun 2015 22:01:24 +0000 (15:01 -0700)]
printk: guard the amount written per line by devkmsg_read()
This patchset updates netconsole so that it can emit messages with the
same header as used in /dev/kmsg which gives neconsole receiver full log
information which enables things like structured logging and detection
of lost messages.
This patch (of 7):
devkmsg_read() uses 8k buffer and assumes that the formatted output
message won't overrun which seems safe given LOG_LINE_MAX, the current use
of dict and the escaping method being used; however, we're planning to use
devkmsg formatting wider and accounting for the buffer size properly isn't
that complicated.
This patch defines CONSOLE_EXT_LOG_MAX as 8192 and updates devkmsg_read()
so that it limits output accordingly.
The KERN_INFO prefix is being prepended to KERN_DEBUG when using the
dprink macro, Remove it as it is extraneous since we are printing the
message out as debug via dprintk().
Fixes smatch warning:
drivers/misc/altera-stapl/altera.c:2454 altera_init()
warn: KERN_* level not at start of string
Josh Triplett [Thu, 25 Jun 2015 22:01:19 +0000 (15:01 -0700)]
clone: support passing tls argument via C rather than pt_regs magic
clone has some of the quirkiest syscall handling in the kernel, with a
pile of special cases, historical curiosities, and architecture-specific
calling conventions. In particular, clone with CLONE_SETTLS accepts a
parameter "tls" that the C entry point completely ignores and some
assembly entry points overwrite; instead, the low-level arch-specific
code pulls the tls parameter out of the arch-specific register captured
as part of pt_regs on entry to the kernel. That's a massive hack, and
it makes the arch-specific code only work when called via the specific
existing syscall entry points; because of this hack, any new clone-like
system call would have to accept an identical tls argument in exactly
the same arch-specific position, rather than providing a unified system
call entry point across architectures.
The first patch allows architectures to handle the tls argument via
normal C parameter passing, if they opt in by selecting
HAVE_COPY_THREAD_TLS. The second patch makes 32-bit and 64-bit x86 opt
into this.
These two patches came out of the clone4 series, which isn't ready for
this merge window, but these first two cleanup patches were entirely
uncontroversial and have acks. I'd like to go ahead and submit these
two so that other architectures can begin building on top of this and
opting into HAVE_COPY_THREAD_TLS. However, I'm also happy to wait and
send these through the next merge window (along with v3 of clone4) if
anyone would prefer that.
This patch (of 2):
clone with CLONE_SETTLS accepts an argument to set the thread-local
storage area for the new thread. sys_clone declares an int argument
tls_val in the appropriate point in the argument list (based on the
various CLONE_BACKWARDS variants), but doesn't actually use or pass along
that argument. Instead, sys_clone calls do_fork, which calls
copy_process, which calls the arch-specific copy_thread, and copy_thread
pulls the corresponding syscall argument out of the pt_regs captured at
kernel entry (knowing what argument of clone that architecture passes tls
in).
Apart from being awful and inscrutable, that also only works because only
one code path into copy_thread can pass the CLONE_SETTLS flag, and that
code path comes from sys_clone with its architecture-specific
argument-passing order. This prevents introducing a new version of the
clone system call without propagating the same architecture-specific
position of the tls argument.
However, there's no reason to pull the argument out of pt_regs when
sys_clone could just pass it down via C function call arguments.
Introduce a new CONFIG_HAVE_COPY_THREAD_TLS for architectures to opt into,
and a new copy_thread_tls that accepts the tls parameter as an additional
unsigned long (syscall-argument-sized) argument. Change sys_clone's tls
argument to an unsigned long (which does not change the ABI), and pass
that down to copy_thread_tls.
Architectures that don't opt into copy_thread_tls will continue to ignore
the C argument to sys_clone in favor of the pt_regs captured at kernel
entry, and thus will be unable to introduce new versions of the clone
syscall.
Patch co-authored by Josh Triplett and Thiago Macieira.
Commit 3876488444e7 ("include/stddef.h: Move offsetofend() from vfio.h
to a generic kernel header") added offsetofend outside the normal
include #ifndef/#endif guard. Move it inside.
Miscellanea:
o remove unnecessary blank line
o standardize offsetof macros whitespace style
Cleanup commit 73679e508201 ("compiler-intel.h: Remove duplicate
definition") removed the double definition of __memory_barrier()
intrinsics.
However, in doing so, it also removed the preceding #undef barrier by
accident, meaning, the actual barrier() macro from compiler-gcc.h with
inline asm is still in place as __GNUC__ is provided.
Subsequently, barrier() can never be defined as __memory_barrier() from
compiler.h since it already has a definition in place and if we trust
the comment in compiler-intel.h, ecc doesn't support gcc specific asm
statements.
I don't have an ecc at hand (unsure if that's still used in the field?)
and only found this by accident during code review, a revert of that
cleanup would be simplest option.
Commit 818411616baf ("fs, proc: introduce /proc/<pid>/task/<tid>/children
entry") introduced the children entry for checkpoint restore and the
file is only available on kernels configured with CONFIG_EXPERT and
CONFIG_CHECKPOINT_RESTORE.
This is available in most distributions (Fedora, Debian, Ubuntu, CoreOS)
because they usually enable CONFIG_EXPERT and CONFIG_CHECKPOINT_RESTORE.
But Arch does not enable CONFIG_EXPERT or CONFIG_CHECKPOINT_RESTORE.
However, the children proc file is useful outside of checkpoint restore.
I would like to use it in rkt. The rkt process exec() another program
it does not control, and that other program will fork()+exec() a child
process. I would like to find the pid of the child process from an
external tool without iterating in /proc over all processes to find
which one has a parent pid equal to rkt.
This commit introduces CONFIG_PROC_CHILDREN and makes
CONFIG_CHECKPOINT_RESTORE select it. This allows enabling
/proc/<pid>/task/<tid>/children without needing to enable
CONFIG_CHECKPOINT_RESTORE and CONFIG_EXPERT.
Alban tested that /proc/<pid>/task/<tid>/children is present when the
kernel is configured with CONFIG_PROC_CHILDREN=y but without
CONFIG_CHECKPOINT_RESTORE
Alexey Dobriyan [Thu, 25 Jun 2015 22:00:54 +0000 (15:00 -0700)]
proc: fix PAGE_SIZE limit of /proc/$PID/cmdline
/proc/$PID/cmdline truncates output at PAGE_SIZE. It is easy to see with
$ cat /proc/self/cmdline $(seq 1037) 2>/dev/null
However, command line size was never limited to PAGE_SIZE but to 128 KB
and relatively recently limitation was removed altogether.
People noticed and ask questions:
http://stackoverflow.com/questions/199130/how-do-i-increase-the-proc-pid-cmdline-4096-byte-limit
seq file interface is not OK, because it kmalloc's for whole output and
open + read(, 1) + sleep will pin arbitrary amounts of kernel memory. To
not do that, limit must be imposed which is incompatible with arbitrary
sized command lines.
I apologize for hairy code, but this it direct consequence of command line
layout in memory and hacks to support things like "init [3]".
The loops are "unrolled" otherwise it is either macros which hide control
flow or functions with 7-8 arguments with equal line count.
There should be real setproctitle(2) or something.
Alexey Dobriyan [Thu, 25 Jun 2015 22:00:51 +0000 (15:00 -0700)]
prctl: more prctl(PR_SET_MM_*) checks
Individual prctl(PR_SET_MM_*) calls do some checking to maintain a
consistent view of mm->arg_start et al fields, but not enough. In
particular PR_SET_MM_ARG_START/PR_SET_MM_ARG_END/ R_SET_MM_ENV_START/
PR_SET_MM_ENV_END only check that the address lies in an existing VMA,
but don't check that the start address is lower than the end address _at
all_.
Consolidate all consistency checks, so there will be no difference in
the future between PR_SET_MM_MAP and individual PR_SET_MM_* calls.
The program below makes both ARGV and ENVP areas be reversed. It makes
/proc/$PID/cmdline show garbage (it doesn't oops by luck).
Akinobu Mita [Thu, 25 Jun 2015 22:00:48 +0000 (15:00 -0700)]
avr32: use for_each_sg()
This replaces the plain loop over the sglist array with for_each_sg()
macro which consists of sg_next() function calls. Since avr32 doesn't
select ARCH_HAS_SG_CHAIN, it is not necessary to use for_each_sg() in
order to loop over each sg element. But this can help find problems
with drivers that do not properly initialize their sg tables when
CONFIG_DEBUG_SG is enabled.
Akinobu Mita [Thu, 25 Jun 2015 22:00:46 +0000 (15:00 -0700)]
frv: use for_each_sg()
This replaces the plain loop over the sglist array with for_each_sg()
macro which consists of sg_next() function calls. Since frv doesn't
select ARCH_HAS_SG_CHAIN, it is not necessary to use for_each_sg() in
order to loop over each sg element. But this can help find problems
with drivers that do not properly initialize their sg tables when
CONFIG_DEBUG_SG is enabled.
Dan Streetman [Thu, 25 Jun 2015 22:00:40 +0000 (15:00 -0700)]
zpool: remove zpool_evict()
Remove zpool_evict() helper function. As zbud is currently the only
zpool implementation that supports eviction, add zpool and zpool_ops
references to struct zbud_pool and directly call zpool_ops->evict(zpool,
handle) on eviction.
Currently zpool provides the zpool_evict helper which locks the zpool
list lock and searches through all pools to find the specific one
matching the caller, and call the corresponding zpool_ops->evict
function. However, this is unnecessary, as the zbud pool can simply
keep a reference to the zpool that created it, as well as the zpool_ops,
and directly call the zpool_ops->evict function, when it needs to evict
a page. This avoids a spinlock and list search in zpool for each
eviction.
Dan Streetman [Thu, 25 Jun 2015 22:00:35 +0000 (15:00 -0700)]
zswap: runtime enable/disable
Change the "enabled" parameter to be configurable at runtime. Remove the
enabled check from init(), and move it to the frontswap store() function;
when enabled, pages will be stored, and when disabled, pages won't be
stored.
This is almost identical to Seth's patch from 2 years ago:
http://lkml.iu.edu/hypermail/linux/kernel/1307.2/04289.html
comp_algorithm_store() silently accepts any supplied algorithm name,
because zram performs algorithm availability check later, during the
device configuration phase in disksize_store() and emits the following
error:
"zram: Cannot initialise %s compressing backend"
this error line is somewhat generic and, besides, can indicate a failed
attempt to allocate compression backend's working buffers.
add algorithm availability check to comp_algorithm_store():
Supplied sysfs values sometimes contain new-line symbols (echo vs. echo
-n), which we also copy as a compression algorithm name. it works fine
when we lookup for compression algorithm, because we use sysfs_streq()
which takes care of new line symbols. however, it doesn't look nice when
we print compression algorithm name if zcomp_create() failed:
zram: Cannot initialise LXZ
compressing backend
cut trailing new-line, so the error string will look like
zram: Cannot initialise LXZ compressing backend
we also now can replace sysfs_streq() in zcomp_available_show() with
strcmp().
`bool locked' local variable tells us if we should perform
zcomp_strm_release() or not (jumped to `out' label before
zcomp_strm_find() occurred), which is equivalent to `zstrm' being or not
being NULL. remove `locked' and check `zstrm' instead.
We currently don't support on-demand device creation. The one and only
way to have N zram devices is to specify num_devices module parameter
(default value: 1). IOW if, for some reason, at some point, user wants
to have N + 1 devies he/she must umount all the existing devices, unload
the module, load the module passing num_devices equals to N + 1. And do
this again, if needed.
This patch introduces zram control sysfs class, which has two sysfs
attrs:
- hot_add -- add a new zram device
- hot_remove -- remove a specific (device_id) zram device
hot_add sysfs attr is read-only and has only automatic device id
assignment mode (as requested by Minchan Kim). read operation performed
on this attr creates a new zram device and returns back its device_id or
error status.
Usage example:
# add a new specific zram device
cat /sys/class/zram-control/hot_add
2
# remove a specific zram device
echo 4 > /sys/class/zram-control/hot_remove
Returning zram_add() error code back to user (-ENOMEM in this case)
Commit ba6b17d68c8e ("zram: fix umount-reset_store-mount race
condition") introduced bdev->bd_mutex to protect a race between mount
and reset. At that time, we don't have dynamic zram-add/remove feature
so it was okay.
However, as we introduce dynamic device feature, bd_mutex became
trouble.
The reason we are holding bd_mutex for zram_remove is to prevent
any incoming open /dev/zram[0-9]. Otherwise, we could remove zram
others already have opened. But it causes above deadlock problem.
To fix the problem, this patch overrides block_device.open and
it returns -EBUSY if zram asserts he claims zram to reset so any
incoming open will be failed so we don't need to hold bd_mutex
for zram_remove ayn more.
This patch is to prepare for zram-add/remove feature.
This patch prepares zram to enable on-demand device creation.
zram_add() performs automatic device_id assignment and returns
new device id (>= 0) or error code (< 0).
With dynamic device creation/removal (which will be introduced later in
the series) printing num_devices in zram_init() will not make a lot of
sense, as well as printing the number of destroyed devices in
destroy_devices(). Print per-device action (added/removed) in zram_add()
and zram_remove() instead.
Limiting the number of zram devices to 32 (default max_num_devices value)
is confusing, let's drop it. A user with 2TB or 4TB of RAM, for example,
can request as many devices as he can handle.
This patch looks big, but basically it just moves code blocks.
No functional changes.
Our current code layout looks like a sandwitch.
For example,
a) between read/write handlers, we have update_used_max() helper function:
static int zram_decompress_page
static int zram_bvec_read
static inline void update_used_max
static int zram_bvec_write
static int zram_bvec_rw
b) RW request handlers __zram_make_request/zram_bio_discard are divided by
sysfs attr reset_store() function and corresponding zram_reset_device()
handler:
c) we first a bunch of sysfs read/store functions. then a number of
one-liners, then helper functions, RW functions, sysfs functions, helper
functions again, and so on.
Reorganize layout to be more logically grouped (a brief description,
`cat zram_drv.c | grep static` gives a bigger picture):
-- one-liners: zram_test_flag/etc.
-- helpers: is_partial_io/update_position/etc
-- sysfs attr show/store functions + ZRAM_ATTR_RO() generated stats
show() functions
exception: reset and disksize store functions are required to be after
meta() functions. because we do device create/destroy actions in these
sysfs handlers.
-- "mm" functions: meta get/put, meta alloc/free, page free
static inline bool zram_meta_get
static inline void zram_meta_put
static void zram_meta_free
static struct zram_meta *zram_meta_alloc
static void zram_free_page
-- a block of I/O functions
static int zram_decompress_page
static int zram_bvec_read
static int zram_bvec_write
static void zram_bio_discard
static int zram_bvec_rw
static void __zram_make_request
static void zram_make_request
static void zram_slot_free_notify
static int zram_rw_page
-- device contol: add/remove/init/reset functions (+zram-control class
will sit here)
static int zram_reset_device
static ssize_t reset_store
static ssize_t disksize_store
static int zram_add
static void zram_remove
static int __init zram_init
static void __exit zram_exit
This patch makes some preparations for on-demand device add/remove
functionality.
Remove `zram_devices' array and switch to id-to-pointer translation (idr).
idr doesn't bloat zram struct with additional members, f.e. list_head,
yet still provides ability to match the device_id with the device pointer.
We currently don't support zram on-demand device creation. The only way
to have N zram devices is to specify num_devices module parameter (default
value 1). That means that if, for some reason, at some point, user wants
to have N + 1 devies he/she must umount all the existing devices, unload
the module, load the module passing num_devices equals to N + 1.
This patchset introduces zram-control sysfs class, which has two sysfs
attrs:
- hot_add -- add a new zram device
- hot_remove -- remove a specific (device_id) zram device
Usage example:
# add a new specific zram device
cat /sys/class/zram-control/hot_add
1
# remove a specific zram device
echo 4 > /sys/class/zram-control/hot_remove
Dominik Dingel [Thu, 25 Jun 2015 21:59:52 +0000 (14:59 -0700)]
s390/mm: change HPAGE_SHIFT type to int
With making HPAGE_SHIFT an unsigned integer we also accidentally changed
pageblock_order. In order to avoid compiler warnings we make
HPAGE_SHFIT an int again.
Dominik Dingel [Thu, 25 Jun 2015 21:59:47 +0000 (14:59 -0700)]
s390/hugetlb: remove dead code for sw emulated huge pages
We now support only hugepages on hardware with EDAT1 support. So we
remove the prepare/release_hugepage hooks and simplify set_huge_pte_at
and huge_ptep_get.
Dominik Dingel [Thu, 25 Jun 2015 21:59:39 +0000 (14:59 -0700)]
s390/mm: make hugepages_supported a boot time decision
There is a potential bug with KVM and hugetlbfs if the hardware does not
support hugepages (EDAT1). We fix this by making EDAT1 a hard requirement
for hugepages and therefore removing and simplifying code.
As s390, with the sw-emulated hugepages, was the only user of
arch_prepare/release_hugepage I also removed theses calls from common and
other architecture code.
This patch (of 5):
By dropping support for hugepages on machines which do not have the
hardware feature EDAT1, we fix a potential s390 KVM bug.
The bug would happen if a guest is backed by hugetlbfs (not supported
currently), but does not get pagetables with PGSTE. This would lead to
random memory overwrites.
Linus Torvalds [Thu, 25 Jun 2015 23:49:21 +0000 (16:49 -0700)]
Merge branch 'for-4.2' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/libata
Pull libata updates from Tejun Heo:
- a number of libata core changes to better support NCQ TRIM.
- ahci now supports MSI-X in single IRQ mode to support a new
controller which doesn't implement MSI or INTX.
- ahci now supports edge-triggered IRQ mode to support a new controller
which for some odd reason did edge-triggered IRQ.
- the usual controller support additions and changes.
* 'for-4.2' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/libata: (27 commits)
libata: Do not blacklist Micron M500DC
ata: ahci_mvebu: add suspend/resume support
ahci, msix: Fix build error for !PCI_MSI
ahci: Add support for Cavium's ThunderX host controller
ahci: Add generic MSI-X support for single interrupts to SATA PCI driver
libata: finally use __initconst in ata_parse_force_one()
drivers: ata: add support for Ceva sata host controller
devicetree:bindings: add devicetree bindings for ceva ahci
ahci: added support for Freescale AHCI sata
ahci: Store irq number in struct ahci_host_priv
ahci: Move interrupt enablement code to a separate function
Doc: libata: Fix spelling typo found in libata.xml
ata:sata_nv - Change 1 to true for bool type variable.
ata: add Broadcom AHCI SATA3 driver for STB chips
Documentation: devicetree: add Broadcom SATA binding
libata: Fix regression when the NCQ Send and Receive log page is absent
ata: hpt366: fix constant cast warning
ata: ahci_xgene: potential NULL dereference in probe
ata: ahci_xgene: Add AHCI Support for 2nd HW version of APM X-Gene SoC AHCI SATA Host controller.
libahci: Add support to handle HOST_IRQ_STAT as edge trigger latch.
...
Linus Torvalds [Thu, 25 Jun 2015 23:34:39 +0000 (16:34 -0700)]
Merge tag 'dm-4.2-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm
Pull device mapper updates from Mike Snitzer:
- DM core cleanups:
* blk-mq request-based DM no longer uses any mempools now that
partial completions are no longer handled as part of cloned
requests
- DM raid cleanups and support for MD raid0
- DM cache core advances and a new stochastic-multi-queue (smq) cache
replacement policy
* smq is the new default dm-cache policy
- DM thinp cleanups and much more efficient large discard support
- DM statistics support for request-based DM and nanosecond resolution
timestamps
- Fixes to DM stripe, DM log-writes, DM raid1 and DM crypt
* tag 'dm-4.2-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (39 commits)
dm stats: add support for request-based DM devices
dm stats: collect and report histogram of IO latencies
dm stats: support precise timestamps
dm stats: fix divide by zero if 'number_of_areas' arg is zero
dm cache: switch the "default" cache replacement policy from mq to smq
dm space map metadata: fix occasional leak of a metadata block on resize
dm thin metadata: fix a race when entering fail mode
dm thin: fail messages with EOPNOTSUPP when pool cannot handle messages
dm thin: range discard support
dm thin metadata: add dm_thin_remove_range()
dm thin metadata: add dm_thin_find_mapped_range()
dm btree: add dm_btree_remove_leaves()
dm stats: Use kvfree() in dm_kvfree()
dm cache: age and write back cache entries even without active IO
dm cache: prefix all DMERR and DMINFO messages with cache device name
dm cache: add fail io mode and needs_check flag
dm cache: wake the worker thread every time we free a migration object
dm cache: add stochastic-multi-queue (smq) policy
dm cache: boost promotion of blocks that will be overwritten
dm cache: defer whole cells
...
Linus Torvalds [Thu, 25 Jun 2015 23:00:17 +0000 (16:00 -0700)]
Merge branch 'for-4.2/writeback' of git://git.kernel.dk/linux-block
Pull cgroup writeback support from Jens Axboe:
"This is the big pull request for adding cgroup writeback support.
This code has been in development for a long time, and it has been
simmering in for-next for a good chunk of this cycle too. This is one
of those problems that has been talked about for at least half a
decade, finally there's a solution and code to go with it.
Also see last weeks writeup on LWN:
http://lwn.net/Articles/648292/"
* 'for-4.2/writeback' of git://git.kernel.dk/linux-block: (85 commits)
writeback, blkio: add documentation for cgroup writeback support
vfs, writeback: replace FS_CGROUP_WRITEBACK with SB_I_CGROUPWB
writeback: do foreign inode detection iff cgroup writeback is enabled
v9fs: fix error handling in v9fs_session_init()
bdi: fix wrong error return value in cgwb_create()
buffer: remove unusued 'ret' variable
writeback: disassociate inodes from dying bdi_writebacks
writeback: implement foreign cgroup inode bdi_writeback switching
writeback: add lockdep annotation to inode_to_wb()
writeback: use unlocked_inode_to_wb transaction in inode_congested()
writeback: implement unlocked_inode_to_wb transaction and use it for stat updates
writeback: implement [locked_]inode_to_wb_and_lock_list()
writeback: implement foreign cgroup inode detection
writeback: make writeback_control track the inode being written back
writeback: relocate wb[_try]_get(), wb_put(), inode_{attach|detach}_wb()
mm: vmscan: disable memcg direct reclaim stalling if cgroup writeback support is in use
writeback: implement memcg writeback domain based throttling
writeback: reset wb_domain->dirty_limit[_tstmp] when memcg domain size changes
writeback: implement memcg wb_domain
writeback: update wb_over_bg_thresh() to use wb_domain aware operations
...
Linus Torvalds [Thu, 25 Jun 2015 22:22:36 +0000 (15:22 -0700)]
Merge branch 'for-4.2/sg' of git://git.kernel.dk/linux-block
Pull asm/scatterlist.h removal from Jens Axboe:
"We don't have any specific arch scatterlist anymore, since parisc
finally switched over. Kill the include"
* 'for-4.2/sg' of git://git.kernel.dk/linux-block:
remove scatterlist.h generation from arch Kbuild files
remove <asm/scatterlist.h>
tracing: Fix typo from "static inlin" to "static inline"
The trace.h header when called without CONFIG_EVENT_TRACING enabled
(seldom done), will not compile because of a typo in the protocol
of trace_event_enum_update().
The filter_parse() function can call infix_get_op() which calls
infix_advance() that updates the infix filter pointers for the cnt
and tail without checking if the filter is already at the end, which
will put the cnt to zero and the tail beyond the end. The loop then calls
infix_next() that has
The cnt will now be below zero, and the tail that is returned is
already passed the end of the filter string. So far the allocation
of the filter string usually has some buffer that is zeroed out, but
if the filter string is of the exact size of the allocated buffer
there's no guarantee that the charater after the nul terminating
character will be zero.
tracing/filter: Do not WARN on operand count going below zero
When testing the fix for the trace filter, I could not come up with
a scenario where the operand count goes below zero, so I added a
WARN_ON_ONCE(cnt < 0) to the logic. But there is legitimate case
that it can happen (although the filter would be wrong).
That is, a single operation without any operands will hit the path
where the WARN_ON_ONCE() can trigger. Although this is harmless,
and the filter is reported as a error. But instead of spitting out
a warning to the kernel dmesg, just fail nicely and report it via
the proper channels.
Linus Torvalds [Thu, 25 Jun 2015 21:29:53 +0000 (14:29 -0700)]
Merge branch 'for-4.2/core' of git://git.kernel.dk/linux-block
Pull core block IO update from Jens Axboe:
"Nothing really major in here, mostly a collection of smaller
optimizations and cleanups, mixed with various fixes. In more detail,
this contains:
- Addition of policy specific data to blkcg for block cgroups. From
Arianna Avanzini.
- Various cleanups around command types from Christoph.
- Cleanup of the suspend block I/O path from Christoph.
- Plugging updates from Shaohua and Jeff Moyer, for blk-mq.
- Eliminating atomic inc/dec of both remaining IO count and reference
count in a bio. From me.
- Fixes for SG gap and chunk size support for data-less (discards)
IO, so we can merge these better. From me.
- Small restructuring of blk-mq shared tag support, freeing drivers
from iterating hardware queues. From Keith Busch.
- A few cfq-iosched tweaks, from Tahsin Erdogan and me. Makes the
IOPS mode the default for non-rotational storage"
* 'for-4.2/core' of git://git.kernel.dk/linux-block: (35 commits)
cfq-iosched: fix other locations where blkcg_to_cfqgd() can return NULL
cfq-iosched: fix sysfs oops when attempting to read unconfigured weights
cfq-iosched: move group scheduling functions under ifdef
cfq-iosched: fix the setting of IOPS mode on SSDs
blktrace: Add blktrace.c to BLOCK LAYER in MAINTAINERS file
block, cgroup: implement policy-specific per-blkcg data
block: Make CFQ default to IOPS mode on SSDs
block: add blk_set_queue_dying() to blkdev.h
blk-mq: Shared tag enhancements
block: don't honor chunk sizes for data-less IO
block: only honor SG gap prevention for merges that contain data
block: fix returnvar.cocci warnings
block, dm: don't copy bios for request clones
block: remove management of bi_remaining when restoring original bi_end_io
block: replace trylock with mutex_lock in blkdev_reread_part()
block: export blkdev_reread_part() and __blkdev_reread_part()
suspend: simplify block I/O handling
block: collapse bio bit space
block: remove unused BIO_RW_BLOCK and BIO_EOF flags
block: remove BIO_EOPNOTSUPP
...
Linus Torvalds [Thu, 25 Jun 2015 21:11:34 +0000 (14:11 -0700)]
Merge tag 'upstream-4.2-rc1' of git://git.infradead.org/linux-ubifs
Pull UBI/UBIFS updates from Richard Weinberger:
"Minor fixes for UBI and UBIFS"
* tag 'upstream-4.2-rc1' of git://git.infradead.org/linux-ubifs:
UBI: Remove unnecessary `\'
UBI: Use static class and attribute groups
UBI: add a helper function for updatting on-flash layout volumes
UBI: Fastmap: Do not add vol if it already exists
UBI: Init vol->reserved_pebs by assignment
UBI: Fastmap: Rename variables to make them meaningful
UBI: Fastmap: Remove unnecessary `\'
UBI: Fastmap: Use max() to get the larger value
ubifs: fix to check error code of register_shrinker
UBI: block: Dynamically allocate minor numbers
Linus Torvalds [Thu, 25 Jun 2015 21:06:55 +0000 (14:06 -0700)]
Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 updates from Ted Ts'o:
"A very large number of cleanups and bug fixes --- in particular for
the ext4 encryption patches, which is a new feature added in the last
merge window. Also fix a number of long-standing xfstest failures.
(Quota writes failing due to ENOSPC, a race between truncate and
writepage in data=journalled mode that was causing generic/068 to
fail, and other corner cases.)
Also add support for FALLOC_FL_INSERT_RANGE, and improve jbd2
performance eliminating locking when a buffer is modified more than
once during a transaction (which is very common for allocation
bitmaps, for example), in which case the state of the journalled
buffer head doesn't need to change"
[ I renamed "ext4_follow_link()" to "ext4_encrypted_follow_link()" in
the merge resolution, to make it clear that that function is _only_
used for encrypted symlinks. The function doesn't actually work for
non-encrypted symlinks at all, and they use the generic helpers
- Linus ]
* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (52 commits)
ext4: set lazytime on remount if MS_LAZYTIME is set by mount
ext4: only call ext4_truncate when size <= isize
ext4: make online defrag error reporting consistent
ext4: minor cleanup of ext4_da_reserve_space()
ext4: don't retry file block mapping on bigalloc fs with non-extent file
ext4: prevent ext4_quota_write() from failing due to ENOSPC
ext4: call sync_blockdev() before invalidate_bdev() in put_super()
jbd2: speedup jbd2_journal_dirty_metadata()
jbd2: get rid of open coded allocation retry loop
ext4: improve warning directory handling messages
jbd2: fix ocfs2 corrupt when updating journal superblock fails
ext4: mballoc: avoid 20-argument function call
ext4: wait for existing dio workers in ext4_alloc_file_blocks()
ext4: recalculate journal credits as inode depth changes
jbd2: use GFP_NOFS in jbd2_cleanup_journal_tail()
ext4: use swap() in mext_page_double_lock()
ext4: use swap() in memswap()
ext4: fix race between truncate and __ext4_journalled_writepage()
ext4 crypto: fail the mount if blocksize != pagesize
ext4: Add support FALLOC_FL_INSERT_RANGE for fallocate
...
Pull sparc fixes from David Miller:
"Sparc perf stack traversal fixes from David Ahern"
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc:
sparc64: perf: Use UREG_FP rather than UREG_I6
sparc64: perf: Add sanity checking on addresses in user stack
sparc64: Convert BUG_ON to warning
sparc: perf: Disable pagefaults while walking userspace stacks
Don't include ptrace uapi stuff in arch headers, it will
pollute the kernel namespace and conflict with existing
stuff.
In this case it fixes clashes with common names like R8.
um: Include sys/types.h for makedev(), major(), minor()
The functions in question are not part of the POSIX standard,
documentation however hints that the corresponding header shall
be sys/types.h. C libraries other than glibc, namely musl, did
not include that header via other ways and complained.
crypto: rsa - add .gitignore for crypto/*.-asn1.[ch] files
There are two generated files: crypto/rsakey-asn1.c and crypto/raskey-asn1.h,
after the cfc2bb32b31371d6bffc6bf2da3548f20ad48c83 commit. Let's add
.gitignore to ignore *-asn1.[ch] files.
Guenter Roeck [Wed, 24 Jun 2015 22:27:01 +0000 (15:27 -0700)]
crypto: asymmetric_keys/rsa - Use non-conflicting variable name
arm64:allmodconfig fails to build as follows.
In file included from include/acpi/platform/aclinux.h:74:0,
from include/acpi/platform/acenv.h:173,
from include/acpi/acpi.h:56,
from include/linux/acpi.h:37,
from ./arch/arm64/include/asm/dma-mapping.h:21,
from include/linux/dma-mapping.h:86,
from include/linux/skbuff.h:34,
from include/crypto/algapi.h:18,
from crypto/asymmetric_keys/rsa.c:16:
include/linux/ctype.h:15:12: error: expected ‘;’, ‘,’ or ‘)’
before numeric constant
#define _X 0x40 /* hex digit */
^
crypto/asymmetric_keys/rsa.c:123:47: note: in expansion of macro ‘_X’
static int RSA_I2OSP(MPI x, size_t xLen, u8 **_X)
^
crypto/asymmetric_keys/rsa.c: In function ‘RSA_verify_signature’:
crypto/asymmetric_keys/rsa.c:256:2: error:
implicit declaration of function ‘RSA_I2OSP’
The problem is caused by an unrelated include file change, resulting in
the inclusion of ctype.h on arm64. This in turn causes the local variable
_X to conflict with macro _X used in ctype.h.
Fixes: b6197b93fa4b ("arm64 : Introduce support for ACPI _CCA object") Cc: Suthikulpanit, Suravee <[email protected]> Signed-off-by: Guenter Roeck <[email protected]> Signed-off-by: Herbert Xu <[email protected]>
Stephan Mueller [Tue, 23 Jun 2015 14:18:54 +0000 (16:18 +0200)]
crypto: jitterentropy - avoid compiler warnings
The core of the Jitter RNG is intended to be compiled with -O0. To
ensure that the Jitter RNG can be compiled on all architectures,
separate out the RNG core into a stand-alone C file that can be compiled
with -O0 which does not depend on any kernel include file.
As no kernel includes can be used in the C file implementing the core
RNG, any dependencies on kernel code must be extracted.
A second file provides the link to the kernel and the kernel crypto API
that can be compiled with the regular compile options of the kernel.
David S. Miller [Thu, 25 Jun 2015 13:01:10 +0000 (06:01 -0700)]
Merge branch 'sparc-perf-stack'
David Ahern says:
====================
sparc64: perf fixes for userspace stacks
Coming back to the perf userspace callchain problem. As a reminder there are
a series of problems trying to use perf to collect callchains with scheduling
tracepoints, e.g., perf sched record -g -- <cmd>.
The first patch disables pagefaults while walking the user stack. As discussed
a couple of months ago this is the right fix, but I was puzzled as to why
processes were terminating with sigbus (and sometimes sigsegv). I believe the
root of this problem is bad addresses trying to walk the frames using frame
pointers. The bad addresses lead to faults that get handled by do_sparc64_fault
and it aborts the task though I am still puzzled as to why it gets past this
check in do_sparc64_fault:
if (in_atomic() || !mm)
goto intr_or_no_mm;
pagefault_disable bumps the preempt_count which should make in_atomic return != 0
(building kernels with preemption set to voluntar, CONFIG_PREEMPT_VOLUNTARY=y).
While this set does not fully solve the problem it does prevent a number of
pain points with the current code, most notably able to lock up the system.
====================