Davidlohr Bueso [Fri, 7 May 2021 01:04:07 +0000 (18:04 -0700)]
fs/epoll: restore waking from ep_done_scan()
Commit 339ddb53d373 ("fs/epoll: remove unnecessary wakeups of nested
epoll") changed the userspace visible behavior of exclusive waiters
blocked on a common epoll descriptor upon a single event becoming ready.
Previously, all tasks doing epoll_wait would awake, and now only one is
awoken, potentially causing missed wakeups on applications that rely on
this behavior, such as Apache Qpid.
While the aforementioned commit aims at having only a wakeup single path
in ep_poll_callback (with the exceptions of epoll_ctl cases), we need to
restore the wakeup in what was the old ep_scan_ready_list() such that
the next thread can be awoken, in a cascading style, after the waker's
corresponding ep_send_events().
Davidlohr Bueso [Fri, 7 May 2021 01:04:04 +0000 (18:04 -0700)]
kselftest: introduce new epoll test case
Patch series "fs/epoll: restore user-visible behavior upon event ready".
This series tries to address a change in user visible behavior, reported
in https://bugzilla.kernel.org/show_bug.cgi?id=208943.
Epoll does not report an event to all the threads running epoll_wait()
on the same epoll descriptor. Unsurprisingly, this was bisected back to 339ddb53d373 (fs/epoll: remove unnecessary wakeups of nested epoll), which
has had various problems in the past, beyond only nested epoll usage.
This patch (of 2):
This incorporates the testcase originally reported in:
Which ensures an event is reported to all threads blocked on the same
epoll descriptor, otherwise only a single thread will receive the wakeup
once the event become ready.
Vincent Mailhol [Fri, 7 May 2021 01:03:58 +0000 (18:03 -0700)]
checkpatch: exclude four preprocessor sub-expressions from MACRO_ARG_REUSE
__must_be_array, offsetof, sizeof_field and __stringify are all
preprocessor macros and do not evaluate their arguments. As such, it is
safe not to warn when arguments are being reused in those four
sub-expressions.
Randy Dunlap [Fri, 7 May 2021 01:03:49 +0000 (18:03 -0700)]
lib: parser: clean up kernel-doc
Mark match_uint() as kernel-doc notation since it is already fully
annotated as such. Use % prefix on constants in kernel-doc comments.
Convert function return descriptions to use the "Return:" kernel-doc
notation.
Alex Shi [Fri, 7 May 2021 01:03:46 +0000 (18:03 -0700)]
lib/genalloc: add parameter description to fix doc compile warning
Commit 52fbf1134d47 ("lib/genalloc.c: fix allocation of aligned buffer
from non-aligned chunk") added a new parameter 'start_addr' w/o
description for it. That causes some doc compile warning:
lib/genalloc.c:649: warning: Function parameter or member 'start_addr' not described in 'gen_pool_first_fit'
lib/genalloc.c:667: warning: Function parameter or member 'start_addr' not described in 'gen_pool_first_fit_align'
lib/genalloc.c:694: warning: Function parameter or member 'start_addr' not described in 'gen_pool_fixed_alloc'
lib/genalloc.c:729: warning: Function parameter or member 'start_addr' not described in 'gen_pool_first_fit_order_align'
lib/genalloc.c:752: warning: Function parameter or member 'start_addr' not described in 'gen_pool_best_fit'
Alex Shi [Fri, 7 May 2021 01:03:43 +0000 (18:03 -0700)]
lib/percpu_counter: tame kernel-doc compile warning
commit 3e8f399da490 ("writeback: rework wb_[dec|inc]_stat family of
functions") add some function description of percpu_counter_add_batch.
but the double '*' in comments means a kernel-doc format comment which
isn't right.
Since the whole file of lib/percpu_counter.c has no any other kernel-doc
format comments, we'd better to remove this incomplete one to tame the
kernel-doc warning:
lib/percpu_counter.c:83: warning: Function parameter or member 'fbc' not described in 'percpu_counter_add_batch'
lib/percpu_counter.c:83: warning: Function parameter or member 'amount' not described in 'percpu_counter_add_batch'
lib/percpu_counter.c:83: warning: Function parameter or member 'batch' not described in 'percpu_counter_add_batch'
Zqiang [Fri, 7 May 2021 01:03:40 +0000 (18:03 -0700)]
lib: stackdepot: turn depot_lock spinlock to raw_spinlock
In RT system, the spin_lock will be replaced by sleepable rt_mutex lock,
in __call_rcu(), disable interrupts before calling
kasan_record_aux_stack(), will trigger this calltrace:
BUG: sleeping function called from invalid context at kernel/locking/rtmutex.c:951
in_atomic(): 0, irqs_disabled(): 1, non_block: 0, pid: 19, name: pgdatinit0
Call Trace:
___might_sleep.cold+0x1b2/0x1f1
rt_spin_lock+0x3b/0xb0
stack_depot_save+0x1b9/0x440
kasan_save_stack+0x32/0x40
kasan_record_aux_stack+0xa5/0xb0
__call_rcu+0x117/0x880
__exit_signal+0xafb/0x1180
release_task+0x1d6/0x480
exit_notify+0x303/0x750
do_exit+0x678/0xcf0
kthread+0x364/0x4f0
ret_from_fork+0x22/0x30
crc8() does not change the data passed to it, so the pointer argument
should be declared const. This avoids callers that receive const data
having to cast it to a non-const pointer to call crc8().
Yury Norov [Fri, 7 May 2021 01:03:11 +0000 (18:03 -0700)]
lib: add fast path for find_next_*_bit()
Similarly to bitmap functions, find_next_*_bit() users will benefit if
we'll handle a case of bitmaps that fit into a single word inline. In the
very best case, the compiler may replace a function call with a few
instructions.
This is the quite typical find_next_bit() user:
unsigned int cpumask_next(int n, const struct cpumask *srcp)
{
/* -1 is a legal arg here. */
if (n != -1)
cpumask_check(n);
return find_next_bit(cpumask_bits(srcp), nr_cpumask_bits, n + 1);
}
EXPORT_SYMBOL(cpumask_next);
Yury Norov [Fri, 7 May 2021 01:03:03 +0000 (18:03 -0700)]
lib: inline _find_next_bit() wrappers
lib/find_bit.c declares five single-line wrappers for _find_next_bit().
We may turn those wrappers to inline functions. It eliminates unneeded
function calls and opens room for compile-time optimizations.
Yury Norov [Fri, 7 May 2021 01:02:53 +0000 (18:02 -0700)]
arch: rearrange headers inclusion order in asm/bitops for m68k, sh and h8300
m68k and sh include bitmap/{find,le}.h prior to ffs/fls headers. New
fast-path implementation in find.h requires ffs/fls. Reordering the
headers inclusion sequence helps to prevent compile-time implicit function
declaration error.
Yury Norov [Fri, 7 May 2021 01:02:42 +0000 (18:02 -0700)]
tools: disable -Wno-type-limits
Patch series "lib/find_bit: fast path for small bitmaps", v6.
Bitmap operations are much simpler and faster in case of small bitmaps
which fit into a single word. In linux/bitmap.c we have a machinery that
allows compiler to replace actual function call with a few instructions if
bitmaps passed into the function are small and their size is known at
compile time.
find_*_bit() API lacks this functionality; but users will benefit from it
a lot. One important example is cpumask subsystem when NR_CPUS <=
BITS_PER_LONG.
This patch (of 12):
GENMASK(h, l) may be passed with unsigned types. In such case,
type-limits warning is generated for example in case of GENMASK(h, 0).
init_groups is declared in both cred.h and init_task.h, but it is not
actually referenced anywhere outside of cred.c where it is defined. So
make it static and remove the declarations.
Wan Jiabing [Fri, 7 May 2021 01:02:33 +0000 (18:02 -0700)]
linux/profile.h: remove unnecessary declaration
Declaring struct pt_regs is unnecessary. On the one hand, there is no
function using it; on the other hand, struct pt_regs has been declared in
linux/kernel.h. Remove them.
Andy Shevchenko [Fri, 7 May 2021 01:02:30 +0000 (18:02 -0700)]
kernel.h: drop inclusion in bitmap.h
The bitmap.h header is used in a lot of code around the kernel. Besides
that it includes kernel.h which sometimes makes a loop.
The problem here is many unneeded loops that make header hell
dependencies. For example, how may you move bitmap_zalloc() from C-file
to the header? Currently it's impossible. And bitmap.h here is only the
tip of an iceberg.
kerne.h is a dump of everything that even has nothing in common at all.
We may still have it, but in my new code I prefer to include only the
headers that I want to use, without the bulk of unneeded kernel code.
Break the loop by introducing align.h, including it in kernel.h and
bitmap.h followed by replacing kernel.h with limits.h.
My UEK-derived config has 1030 files depending on pagemap.h before this
change. Afterwards, just 326 files need to be rebuilt when I touch
pagemap.h. I think blkdev.h is probably included too widely, but
untangling that dependency is harder and this solves my problem. x86
allmodconfig builds, but there may be implicit include problems on other
architectures.
Alexey Dobriyan [Fri, 7 May 2021 01:02:16 +0000 (18:02 -0700)]
proc: mandate ->proc_lseek in "struct proc_ops"
Now that proc_ops are separate from file_operations and other operations
it easy to check all instances to have ->proc_lseek hook and remove check
in main code.
Currently the pde_is_permanent() check is being run on root multiple times
rather than on the next proc directory entry. This looks like a
copy-paste error. Fix this by replacing root with next.
Randy Dunlap [Fri, 7 May 2021 01:02:04 +0000 (18:02 -0700)]
alpha: eliminate old-style function definitions
'make ARCH=alpha W=1' reports a couple of old-style function
definitions with missing parameter list, so fix those.
arch/alpha/kernel/pc873xx.c: In function 'pc873xx_get_base':
arch/alpha/kernel/pc873xx.c:16:21: warning: old-style function definition [-Wold-style-definition]
16 | unsigned int __init pc873xx_get_base()
arch/alpha/kernel/pc873xx.c: In function 'pc873xx_get_model':
arch/alpha/kernel/pc873xx.c:21:14: warning: old-style function definition [-Wold-style-definition]
21 | char *__init pc873xx_get_model()
Arjun Roy [Thu, 6 May 2021 22:35:30 +0000 (15:35 -0700)]
tcp: Specify cmsgbuf is user pointer for receive zerocopy.
A prior change (1f466e1f15cf) introduces separate handling for
->msg_control depending on whether the pointer is a kernel or user
pointer. However, while tcp receive zerocopy is using this field, it
is not properly annotating that the buffer in this case is a user
pointer. This can cause faults when the improper mechanism is used
within put_cmsg().
This patch simply annotates tcp receive zerocopy's use as explicitly
being a user pointer.
Ido Schimmel [Thu, 6 May 2021 07:23:08 +0000 (10:23 +0300)]
mlxsw: spectrum_mr: Update egress RIF list before route's action
Each multicast route that is forwarding packets (as opposed to trapping
them) points to a list of egress router interfaces (RIFs) through which
packets are replicated.
A route's action can transition from trap to forward when a RIF is
created for one of the route's egress virtual interfaces (eVIF). When
this happens, the route's action is first updated and only later the
list of egress RIFs is committed to the device.
This results in the route pointing to an invalid list. In case the list
pointer is out of range (due to uninitialized memory), the device will
complain:
Fix this by first committing the list of egress RIFs to the device and
only later update the route's action.
Note that a fix is not needed in the reverse function (i.e.,
mlxsw_sp_mr_route_evif_unresolve()), as there the route's action is
first updated and only later the RIF is removed from the list.
Jakub Kicinski [Thu, 6 May 2021 23:24:31 +0000 (16:24 -0700)]
Merge tag 'linux-can-fixes-for-5.13-20210506' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can
Marc Kleine-Budde says:
====================
pull-request: can 2021-05-06
The first two patches target the mcp251xfd driver. Dan Carpenter's
patch fixes a NULL pointer dereference in the probe function's error
path. A patch by me adds the missing can_rx_offload_del() in error
path of the probe function.
Frieder Schrempf contributes a patch for the mcp251x driver, the patch
fixes the resume from sleep before interface was brought up.
The last patch is by me and fixes a race condition in the TX path of
the m_can driver for peripheral (SPI) based m_can cores.
* tag 'linux-can-fixes-for-5.13-20210506' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can:
can: m_can: m_can_tx_work_queue(): fix tx_skb race condition
can: mcp251x: fix resume from sleep before interface was brought up
can: mcp251xfd: mcp251xfd_probe(): add missing can_rx_offload_del() in error path
can: mcp251xfd: mcp251xfd_probe(): fix an error pointer dereference in probe
====================
Hansem Ro [Thu, 6 May 2021 20:27:10 +0000 (13:27 -0700)]
Input: ili210x - add missing negation for touch indication on ili210x
This adds the negation needed for proper finger detection on Ilitek
ili2107/ili210x. This fixes polling issues (on Amazon Kindle Fire)
caused by returning false for the cooresponding finger on the touchscreen.
Linus Torvalds [Thu, 6 May 2021 21:39:50 +0000 (14:39 -0700)]
Merge tag 's390-5.13-2' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
Pull more s390 updates from Heiko Carstens:
- add support for system call stack randomization
- handle stale PCI deconfiguration events
- couple of defconfig updates
- some fixes and cleanups
* tag 's390-5.13-2' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
s390: fix detection of vector enhancements facility 1 vs. vector packed decimal facility
s390/entry: add support for syscall stack randomization
s390/configs: change CONFIG_VIRTIO_CONSOLE to "m"
s390/cio: remove invalid condition on IO_SCH_UNREG
s390/cpumf: remove call to perf_event_update_userpage
s390/cpumf: move counter set size calculation to common place
s390/cpumf: beautify if-then-else indentation
s390/configs: enable CONFIG_PCI_IOV
s390/pci: handle stale deconfiguration events
s390/pci: rename zpci_configure_device()
Linus Torvalds [Thu, 6 May 2021 21:22:58 +0000 (14:22 -0700)]
Merge tag 'vfio-v5.13-rc1pt2' of git://github.com/awilliam/linux-vfio
Pull more VFIO updates from Alex Williamson:
"A second small set of commits for this merge window, primarily to
unbreak some deletions from our uAPI header.
Thomas Gleixner [Thu, 22 Apr 2021 19:44:21 +0000 (21:44 +0200)]
futex: Make syscall entry points less convoluted
The futex and the compat syscall entry points do pretty much the same
except for the timespec data types and the corresponding copy from
user function.
Split out the rest into inline functions and share the functionality.
Thomas Gleixner [Thu, 22 Apr 2021 19:44:20 +0000 (21:44 +0200)]
futex: Get rid of the val2 conditional dance
There is no point in checking which FUTEX operand treats the utime pointer
as 'val2' argument because that argument to do_futex() is only used by
exactly these operands.
So just handing it in unconditionally is not making any difference, but
removes a lot of pointless gunk.
Thomas Gleixner [Thu, 22 Apr 2021 19:44:19 +0000 (21:44 +0200)]
futex: Do not apply time namespace adjustment on FUTEX_LOCK_PI
FUTEX_LOCK_PI does not require to have the FUTEX_CLOCK_REALTIME bit set
because it has been using CLOCK_REALTIME based absolute timeouts
forever. Due to that, the time namespace adjustment which is applied when
FUTEX_CLOCK_REALTIME is not set, will wrongly take place for FUTEX_LOCK_PI
and wreckage the timeout.
Thomas Gleixner [Thu, 22 Apr 2021 19:44:18 +0000 (21:44 +0200)]
Revert 337f13046ff0 ("futex: Allow FUTEX_CLOCK_REALTIME with FUTEX_WAIT op")
The FUTEX_WAIT operand has historically a relative timeout which means that
the clock id is irrelevant as relative timeouts on CLOCK_REALTIME are not
subject to wall clock changes and therefore are mapped by the kernel to
CLOCK_MONOTONIC for simplicity.
If a caller would set FUTEX_CLOCK_REALTIME for FUTEX_WAIT the timeout is
still treated relative vs. CLOCK_MONOTONIC and then the wait arms that
timeout based on CLOCK_REALTIME which is broken and obviously has never
been used or even tested.
Reject any attempt to use FUTEX_CLOCK_REALTIME with FUTEX_WAIT again.
The desired functionality can be achieved with FUTEX_WAIT_BITSET and a
FUTEX_BITSET_MATCH_ANY argument.
Linus Torvalds [Thu, 6 May 2021 18:08:50 +0000 (11:08 -0700)]
Merge branch 'pcmcia-next' of git://git.kernel.org/pub/scm/linux/kernel/git/brodo/linux
Pull pcmcia updates from Dominik Brodowski:
"A number of patches fixing W=1 kernel build warnings, and one patch
removing an always-false NULL check"
* 'pcmcia-next' of git://git.kernel.org/pub/scm/linux/kernel/git/brodo/linux:
pcmcia: rsrc_nonstatic: Fix call-back function as reference formatting
pcmcia: pcmcia_resource: Fix some kernel-doc formatting/disparities and demote others
pcmcia: ds: Fix function name disparity in header
pcmcia: pcmcia_cis: Demote non-conforming kernel-doc headers to standard kernel-doc
pcmcia: cistpl: Demote non-conformant kernel-doc headers to standard comments
pcmcia: rsrc_nonstatic: Demote kernel-doc abuses
pcmcia: ds: Remove if with always false condition
Linus Torvalds [Thu, 6 May 2021 17:27:02 +0000 (10:27 -0700)]
Merge tag 'ceph-for-5.13-rc1' of git://github.com/ceph/ceph-client
Pull ceph updates from Ilya Dryomov:
"Notable items here are
- a series to take advantage of David Howells' netfs helper library
from Jeff
- three new filesystem client metrics from Xiubo
- ceph.dir.rsnaps vxattr from Yanhu
- two auth-related fixes from myself, marked for stable.
Interspersed is a smattering of assorted fixes and cleanups across the
filesystem"
* tag 'ceph-for-5.13-rc1' of git://github.com/ceph/ceph-client: (24 commits)
libceph: allow addrvecs with a single NONE/blank address
libceph: don't set global_id until we get an auth ticket
libceph: bump CephXAuthenticate encoding version
ceph: don't allow access to MDS-private inodes
ceph: fix up some bare fetches of i_size
ceph: convert some PAGE_SIZE invocations to thp_size()
ceph: support getting ceph.dir.rsnaps vxattr
ceph: drop pinned_page parameter from ceph_get_caps
ceph: fix inode leak on getattr error in __fh_to_dentry
ceph: only check pool permissions for regular files
ceph: send opened files/pinned caps/opened inodes metrics to MDS daemon
ceph: avoid counting the same request twice or more
ceph: rename the metric helpers
ceph: fix kerneldoc copypasta over ceph_start_io_direct
ceph: use attach/detach_page_private for tracking snap context
ceph: don't use d_add in ceph_handle_snapdir
ceph: don't clobber i_snap_caps on non-I_NEW inode
ceph: fix fall-through warnings for Clang
ceph: convert ceph_readpages to ceph_readahead
ceph: convert ceph_write_begin to netfs_write_begin
...
Linus Torvalds [Thu, 6 May 2021 17:06:39 +0000 (10:06 -0700)]
Merge tag 'ecryptfs-5.13-rc1-updates' of git://git.kernel.org/pub/scm/linux/kernel/git/tyhicks/ecryptfs
Pull ecryptfs updates from Tyler Hicks:
"Code cleanups and a bug fix
- W=1 compiler warning cleanups
- Mutex initialization simplification
- Protect against NULL pointer exception during mount"
* tag 'ecryptfs-5.13-rc1-updates' of git://git.kernel.org/pub/scm/linux/kernel/git/tyhicks/ecryptfs:
ecryptfs: fix kernel panic with null dev_name
ecryptfs: remove unused helpers
ecryptfs: Fix typo in message
eCryptfs: Use DEFINE_MUTEX() for mutex lock
ecryptfs: keystore: Fix some kernel-doc issues and demote non-conformant headers
ecryptfs: inode: Help out nearly-there header and demote non-conformant ones
ecryptfs: mmap: Help out one function header and demote other abuses
ecryptfs: crypto: Supply some missing param descriptions and demote abuses
ecryptfs: miscdev: File headers are not good kernel-doc candidates
ecryptfs: main: Demote a bunch of non-conformant kernel-doc headers
ecryptfs: messaging: Add missing param descriptions and demote abuses
ecryptfs: super: Fix formatting, naming and kernel-doc abuses
ecryptfs: file: Demote kernel-doc abuses
ecryptfs: kthread: Demote file header and provide description for 'cred'
ecryptfs: dentry: File headers are not good candidates for kernel-doc
ecryptfs: debug: Demote a couple of kernel-doc abuses
ecryptfs: read_write: File headers do not make good candidates for kernel-doc
ecryptfs: use DEFINE_MUTEX() for mutex lock
eCryptfs: add a semicolon
Linus Torvalds [Thu, 6 May 2021 17:03:38 +0000 (10:03 -0700)]
Merge tag 'trace-v5.13-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fix from Steven Rostedt:
"Fix probes written to the set_ftrace_filter file
Now that there's a library that accesses the tracefs file system
(libtracefs), the way the files are interacted with is slightly
different than the command line. For instance, the write() system call
is used directly instead of an echo. This exposes some old bugs.
If a probe is written to "set_ftrace_filter" without any white space
after it, it will be ignored. This is because the write expects that a
string written to it that does not end with white spaces thinks there
is more to come. But if the file is closed, the release function needs
to finish it. The "set_ftrace_filter" release function handles the
filter part of the "set_ftrace_filter" file, but did not handle the
probe part"
* tag 'trace-v5.13-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
ftrace: Handle commands when closing set_ftrace_filter file
Linus Torvalds [Thu, 6 May 2021 16:56:26 +0000 (09:56 -0700)]
Merge tag 'acpi-5.13-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull ACPI fixes from Rafael Wysocki:
"These revert one recent commit that turned out to be problematic,
address two issues in the ACPI "custom method" interface and update
GPIO properties documentation.
Specifics:
- Revent recent commit related to the handling of ACPI power
resources during initialization, because it turned out to cause
problems to occur on some systems (Rafael Wysocki).
- Fix potential use-after-free and potential memory leak in the ACPI
"custom method" debugfs interface (Mark Langsdorf).
* tag 'acpi-5.13-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
Revert "ACPI: scan: Turn off unused power resources during initialization"
ACPI: custom_method: fix a possible memory leak
ACPI: custom_method: fix potential use-after-free issue
Documentation: firmware-guide: gpio-properties: Add note to SPI CS case
Since commit 79b1feba5455 ("RISC-V: Setup exception vector early")
exception vectors are setup early and the handle_exception symbol from
the asm files is no longer referenced in traps.c. Remove it.
riscv: Consistify protect_kernel_linear_mapping_text_rodata() use
The various uses of protect_kernel_linear_mapping_text_rodata() are
not consistent:
- Its definition depends on "64BIT && !XIP_KERNEL",
- Its forward declaration depends on MMU,
- Its single caller depends on "STRICT_KERNEL_RWX && 64BIT && MMU &&
!XIP_KERNEL".
Fix this by settling on the dependencies of the caller, which can be
simplified as STRICT_KERNEL_RWX depends on "MMU && !XIP_KERNEL".
Provide a dummy definition, as the caller is protected by
"IS_ENABLED(CONFIG_STRICT_KERNEL_RWX)" instead of "#ifdef
CONFIG_STRICT_KERNEL_RWX".
Vincent Chen [Thu, 29 Apr 2021 07:58:36 +0000 (00:58 -0700)]
riscv: enable SiFive errata CIP-453 and CIP-1200 Kconfig only if CONFIG_64BIT=y
The corresponding hardware issues of CONFIG_ERRATA_SIFIVE_CIP_453 and
CONFIG_ERRATA_SIFIVE_CIP_1200 only exist in the SiFive 64bit CPU cores.
Therefore, these two errata are required only if CONFIG_64BIT=y
riscv: Only extend kernel reservation if mapped read-only
When the kernel mapping was moved outside of the linear mapping, the
kernel memory reservation was increased, to take into account mapping
granularity. However, this is done unconditionally, regardless of
whether the kernel memory is mapped read-only or not.
If this extension is not needed, up to 2 MiB may be lost, which has a
big impact on e.g. Canaan K210 (64-bit nommu) platforms with only 8 MiB
of RAM.
Reclaim the lost memory by only extending the reserved region when
needed, i.e. depending on a simplified version of the conditional logic
around the call to protect_kernel_linear_mapping_text_rodata().
Linus Torvalds [Thu, 6 May 2021 16:24:18 +0000 (09:24 -0700)]
Merge tag 'riscv-for-linus-5.13-mw0' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux
Pull RISC-V updates from Palmer Dabbelt:
- Support for the memtest= kernel command-line argument.
- Support for building the kernel with FORTIFY_SOURCE.
- Support for generic clockevent broadcasts.
- Support for the buildtar build target.
- Some build system cleanups to pass more LLVM-friendly arguments.
- Support for kprobes.
- A rearranged kernel memory map, the first part of supporting sv48
systems.
- Improvements to kexec, along with support for kdump and crash
kernels.
- An alternatives-based errata framework, along with support for
handling a pair of errata that manifest on some SiFive designs
(including the HiFive Unmatched).
- Support for XIP.
- A device tree for the Microchip PolarFire ICICLE SoC and associated
dev board.
... along with a bunch of cleanups. There are already a handful of fixes
on the list so there will likely be a part 2.
* tag 'riscv-for-linus-5.13-mw0' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux: (45 commits)
RISC-V: Always define XIP_FIXUP
riscv: Remove 32b kernel mapping from page table dump
riscv: Fix 32b kernel build with CONFIG_DEBUG_VIRTUAL=y
RISC-V: Fix error code returned by riscv_hartid_to_cpuid()
RISC-V: Enable Microchip PolarFire ICICLE SoC
RISC-V: Initial DTS for Microchip ICICLE board
dt-bindings: riscv: microchip: Add YAML documentation for the PolarFire SoC
RISC-V: Add Microchip PolarFire SoC kconfig option
RISC-V: enable XIP
RISC-V: Add crash kernel support
RISC-V: Add kdump support
RISC-V: Improve init_resources()
RISC-V: Add kexec support
RISC-V: Add EM_RISCV to kexec UAPI header
riscv: vdso: fix and clean-up Makefile
riscv/mm: Use BUG_ON instead of if condition followed by BUG.
riscv/kprobe: fix kernel panic when invoking sys_read traced by kprobe
riscv: Set ARCH_HAS_STRICT_MODULE_RWX if MMU
riscv: module: Create module allocations without exec permissions
riscv: bpf: Avoid breaking W^X
...
Linus Torvalds [Thu, 6 May 2021 15:40:38 +0000 (08:40 -0700)]
Merge tag 'hexagon-5.13-0' of git://git.kernel.org/pub/scm/linux/kernel/git/bcain/linux
Pull Hexagon updates from Brian Cain:
"Hexagon architecture build fixes + builtins
Small build fixes applied:
- use -mlong-calls to build
- extend jumps in futex_atomic_*
- etc
Also, for convenience and portability, the hexagon compiler builtin
functions like memcpy etc have been added to the kernel -- following
the idiom used by other architectures"
* tag 'hexagon-5.13-0' of git://git.kernel.org/pub/scm/linux/kernel/git/bcain/linux:
Hexagon: add target builtins to kernel
Hexagon: remove DEBUG from comet config
Hexagon: change jumps to must-extend in futex_atomic_*
Hexagon: fix build errors
Linus Torvalds [Thu, 6 May 2021 15:33:54 +0000 (08:33 -0700)]
Merge tag 'docs-5.13-2' of git://git.lwn.net/linux
Pull documentation fixes from Jonathan Corbet:
"A few late-arriving documentation fixes, including some oprofile
cleanup, a kernel-doc fix, some regression-reporting updates, and the
usual minor fixes"
* tag 'docs-5.13-2' of git://git.lwn.net/linux:
Enlisted oprofile version line removed
oprofiled version output line removed from the list
Removed the oprofiled version option
docs: reporting-issues.rst: CC subsystem and maintainers on regressions
docs: correct URL to bios and kernel developer's guide
docs/core-api: Consistent code style
docs/zh_CN: Adjust order and content of zh_CN/index.rst
Documentation: input: joydev file corrections
docs: Fix typo in Documentation/x86/x86_64/5level-paging.rst
kernel-doc: Add support for __deprecated
blkdev_read_iter can truncate iov_iter's count since the count + pos may
exceed the size of the blkdev. This will confuse io_read that we have
consume the iovec. And once we do the iov_iter_revert in io_read, we
will trigger the slab-out-of-bounds. Fix it by reexpand the count with
size has been truncated.
Make the code more readable by replacing the atomic_cmpxchg_acquire()
by an equivalent atomic_try_cmpxchg_acquire() and change atomic_add()
to atomic_or().
For architectures that use qrwlock, I do not find one that has an
atomic_add() defined but not an atomic_or(). I guess it should be fine
by changing atomic_add() to atomic_or().
Note that the previous use of atomic_add() isn't wrong as only one
writer that is the wait_lock owner can set the waiting flag and the
flag will be cleared later on when acquiring the write lock.
Arnd Bergmann [Wed, 5 May 2021 21:12:42 +0000 (23:12 +0200)]
smp: Fix smp_call_function_single_async prototype
As of commit 966a967116e6 ("smp: Avoid using two cache lines for struct
call_single_data"), the smp code prefers 32-byte aligned call_single_data
objects for performance reasons, but the block layer includes an instance
of this structure in the main 'struct request' that is more senstive
to size than to performance here, see 4ccafe032005 ("block: unalign
call_single_data in struct request").
The result is a violation of the calling conventions that clang correctly
points out:
block/blk-mq.c:630:39: warning: passing 8-byte aligned argument to 32-byte aligned parameter 2 of 'smp_call_function_single_async' may result in an unaligned pointer access [-Walign-mismatch]
smp_call_function_single_async(cpu, &rq->csd);
It does seem that the usage of the call_single_data without cache line
alignment should still be allowed by the smp code, so just change the
function prototype so it accepts both, but leave the default alignment
unchanged for the other users. This seems better to me than adding
a local hack to shut up an otherwise correct warning in the caller.
x86/events/amd/iommu: Fix invalid Perf result due to IOMMU PMC power-gating
On certain AMD platforms, when the IOMMU performance counter source
(csource) field is zero, power-gating for the counter is enabled, which
prevents write access and returns zero for read access.
This can cause invalid perf result especially when event multiplexing
is needed (i.e. more number of events than available counters) since
the current logic keeps track of the previously read counter value,
and subsequently re-program the counter to continue counting the event.
With power-gating enabled, we cannot gurantee successful re-programming
of the counter.
Workaround this issue by :
1. Modifying the ordering of setting/reading counters and enabing/
disabling csources to only access the counter when the csource
is set to non-zero.
2. Since AMD IOMMU PMU does not support interrupt mode, the logic
can be simplified to always start counting with value zero,
and accumulate the counter value when stopping without the need
to keep track and reprogram the counter with the previously read
counter value.
This has been tested on systems with and without power-gating.
Odin Ugedal [Sat, 1 May 2021 14:19:50 +0000 (16:19 +0200)]
sched/fair: Fix unfairness caused by missing load decay
This fixes an issue where old load on a cfs_rq is not properly decayed,
resulting in strange behavior where fairness can decrease drastically.
Real workloads with equally weighted control groups have ended up
getting a respective 99% and 1%(!!) of cpu time.
When an idle task is attached to a cfs_rq by attaching a pid to a cgroup,
the old load of the task is attached to the new cfs_rq and sched_entity by
attach_entity_cfs_rq. If the task is then moved to another cpu (and
therefore cfs_rq) before being enqueued/woken up, the load will be moved
to cfs_rq->removed from the sched_entity. Such a move will happen when
enforcing a cpuset on the task (eg. via a cgroup) that force it to move.
The load will however not be removed from the task_group itself, making
it look like there is a constant load on that cfs_rq. This causes the
vruntime of tasks on other sibling cfs_rq's to increase faster than they
are supposed to; causing severe fairness issues. If no other task is
started on the given cfs_rq, and due to the cpuset it would not happen,
this load would never be properly unloaded. With this patch the load
will be properly removed inside update_blocked_averages. This also
applies to tasks moved to the fair scheduling class and moved to another
cpu, and this path will also fix that. For fork, the entity is queued
right away, so this problem does not affect that.
This applies to cases where the new process is the first in the cfs_rq,
issue introduced 3d30544f0212 ("sched/fair: Apply more PELT fixes"), and
when there has previously been load on the cgroup but the cgroup was
removed from the leaflist due to having null PELT load, indroduced
in 039ae8bcf7a5 ("sched/fair: Fix O(nr_cgroups) in the load balancing
path").
For a simple cgroup hierarchy (as seen below) with two equally weighted
groups, that in theory should get 50/50 of cpu time each, it often leads
to a load of 60/40 or 70/30.
If the hierarchy is deeper (as seen below), while keeping cg-1 and cg-2
equally weighted, they should still get a 50/50 balance of cpu time.
This however sometimes results in a balance of 10/90 or 1/99(!!) between
the task groups.
This can be reproduced by attaching an idle process to a cgroup and
moving it to a given cpuset before it wakes up. The issue is evident in
many (if not most) container runtimes, and has been reproduced
with both crun and runc (and therefore docker and all its "derivatives"),
and with both cgroup v1 and v2.
Util-clamp places tasks in different buckets based on their clamp values
for performance reasons. However, the size of buckets is currently
computed using a rounding division, which can lead to an off-by-one
error in some configurations.
For instance, with 20 buckets, the bucket size will be 1024/20=51. A
task with a clamp of 1024 will be mapped to bucket id 1024/51=20. Sadly,
correct indexes are in range [0,19], hence leading to an out of bound
memory access.
Johannes Weiner [Mon, 3 May 2021 17:49:17 +0000 (13:49 -0400)]
psi: Fix psi state corruption when schedule() races with cgroup move
4117cebf1a9f ("psi: Optimize task switch inside shared cgroups")
introduced a race condition that corrupts internal psi state. This
manifests as kernel warnings, sometimes followed by bogusly high IO
pressure:
psi: task underflow! cpu=1 t=2 tasks=[0 0 0 0] clear=c set=0
(schedule() decreasing RUNNING and ONCPU, both of which are 0)
psi: incosistent task state! task=2412744:systemd cpu=17 psi_flags=e clear=3 set=0
(cgroup_move_task() clearing MEMSTALL and IOWAIT, but task is MEMSTALL | RUNNING | ONCPU)
What the offending commit does is batch the two psi callbacks in
schedule() to reduce the number of cgroup tree updates. When prev is
deactivated and removed from the runqueue, nothing is done in psi at
first; when the task switch completes, TSK_RUNNING and TSK_IOWAIT are
updated along with TSK_ONCPU.
However, the deactivation and the task switch inside schedule() aren't
atomic: pick_next_task() may drop the rq lock for load balancing. When
this happens, cgroup_move_task() can run after the task has been
physically dequeued, but the psi updates are still pending. Since it
looks at the task's scheduler state, it doesn't move everything to the
new cgroup that the task switch that follows is about to clear from
it. cgroup_move_task() will leak the TSK_RUNNING count in the old
cgroup, and psi_sched_switch() will underflow it in the new cgroup.
A similar thing can happen for iowait. TSK_IOWAIT is usually set when
a p->in_iowait task is dequeued, but again this update is deferred to
the switch. cgroup_move_task() can see an unqueued p->in_iowait task
and move a non-existent TSK_IOWAIT. This results in the inconsistent
task state warning, as well as a counter underflow that will result in
permanent IO ghost pressure being reported.
Fix this bug by making cgroup_move_task() use task->psi_flags instead
of looking at the potentially mismatching scheduler state.
[ We used the scheduler state historically in order to not rely on
task->psi_flags for anything but debugging. But that ship has sailed
anyway, and this is simpler and more robust.
We previously already batched TSK_ONCPU clearing with the
TSK_RUNNING update inside the deactivation call from schedule(). But
that ordering was safe and didn't result in TSK_ONCPU corruption:
unlike most places in the scheduler, cgroup_move_task() only checked
task_current() and handled TSK_ONCPU if the task was still queued. ]
Shaokun Zhang [Thu, 6 May 2021 05:54:22 +0000 (13:54 +0800)]
arm64: kernel: Update the stale comment
Commit af391b15f7b5 ("arm64: kernel: rename __cpu_suspend to keep it aligned with arm")
has used @index instead of @arg, but the comment is stale, update it.
Hui Wang [Tue, 4 May 2021 07:39:17 +0000 (15:39 +0800)]
ALSA: hda: generic: change the DAC ctl name for LO+SPK or LO+HP
Without this change, the DAC ctl's name could be changed only when
the machine has both Speaker and Headphone, but we met some machines
which only has Lineout and Headhpone, and the Lineout and Headphone
share the Audio Mixer0 and DAC0, the ctl's name is set to "Front".
On most of machines, the "Front" is used for Speaker only or Lineout
only, but on this machine it is shared by Lineout and Headphone,
This introduces an issue in the pipewire and pulseaudio, suppose users
want the Headphone to be on and the Speaker/Lineout to be off, they
could turn off the "Front", this works on most of the machines, but on
this machine, the "Front" couldn't be turned off otherwise the
headphone will be off too. Here we do some change to let the ctl's
name change to "Headphone+LO" on this machine, and pipewire and
pulseaudio already could handle "Headphone+LO" and "Speaker+LO".
(https://gitlab.freedesktop.org/pipewire/pipewire/-/issues/747)
The m_can_start_xmit() function checks if the cdev->tx_skb is NULL and
returns with NETDEV_TX_BUSY in case tx_sbk is not NULL.
There is a race condition in the m_can_tx_work_queue(), where first
the skb is send to the driver and then the case tx_sbk is set to NULL.
A TX complete IRQ might come in between and wake the queue, which
results in tx_skb not being cleared yet.
Fixes: f524f829b75a ("can: m_can: Create a m_can platform framework") Tested-by: Torin Cooper-Bennun <[email protected]> Signed-off-by: Marc Kleine-Budde <[email protected]>
can: mcp251x: fix resume from sleep before interface was brought up
Since 8ce8c0abcba3 the driver queues work via priv->restart_work when
resuming after suspend, even when the interface was not previously
enabled. This causes a null dereference error as the workqueue is only
allocated and initialized in mcp251x_open().
To fix this we move the workqueue init to mcp251x_can_probe() as there
is no reason to do it later and repeat it whenever mcp251x_open() is
called.
Dan Carpenter [Mon, 3 May 2021 14:49:09 +0000 (17:49 +0300)]
can: mcp251xfd: mcp251xfd_probe(): fix an error pointer dereference in probe
When we converted this code to use dev_err_probe() we accidentally
removed a return. It means that if devm_clk_get() it will lead to an
Oops when we call clk_get_rate() on the next line.
drm/amdgpu: Use device specific BO size & stride check.
The builtin size check isn't really the right thing for AMD
modifiers due to a couple of reasons:
1) In the format structs we don't do set any of the tilesize / blocks
etc. to avoid having format arrays per modifier/GPU
2) The pitch on the main plane is pixel_pitch * bytes_per_pixel even
for tiled ...
3) The pitch for the DCC planes is really the pixel pitch of the main
surface that would be covered by it ...
Note that we only handle GFX9+ case but we do this after converting
the implicit modifier to an explicit modifier, so on GFX9+ all
framebuffers should be checked here.
There is a TODO about DCC alignment, but it isn't worse than before
and I'd need to dig a bunch into the specifics. Getting this out in
a reasonable timeframe to make sure it gets the appropriate testing
seemed more important.
Finally as I've found that debugging addfb2 failures is a pita I was
generous adding explicit error messages to every failure case.
Fixes: f258907fdd83 ("drm/amdgpu: Verify bo size can fit framebuffer size on init.") Tested-by: Simon Ser <[email protected]> Signed-off-by: Bas Nieuwenhuizen <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
x86/process: setup io_threads more like normal user space threads
As io_threads are fully set up USER threads it's clearer to separate the
code path from the KTHREAD logic.
The only remaining difference to user space threads is that io_threads
never return to user space again. Instead they loop within the given
worker function.
The fact that they never return to user space means they don't have an
user space thread stack. In order to indicate that to tools like gdb we
reset the stack and instruction pointers to 0.
This allows gdb attach to user space processes using io-uring, which like
means that they have io_threads, without printing worrying message like
this:
warning: Selected architecture i386:x86-64 is not compatible with reported target architecture i386
netfilter: remove BUG_ON() after skb_header_pointer()
Several conntrack helpers and the TCP tracker assume that
skb_header_pointer() never fails based on upfront header validation.
Even if this should not ever happen, BUG_ON() is a too drastic measure,
remove them.
KVM: x86: Consolidate guest enter/exit logic to common helpers
Move the enter/exit logic in {svm,vmx}_vcpu_enter_exit() to common
helpers. Opportunistically update the somewhat stale comment about the
updates needing to occur immediately after VM-Exit.
context_tracking: KVM: Move guest enter/exit wrappers to KVM's domain
Move the guest enter/exit wrappers to kvm_host.h so that KVM can manage
its context tracking vs. vtime accounting without bleeding too many KVM
details into the context tracking code.
Consolidate the guest enter/exit wrappers, providing and tweaking stubs
as needed. This will allow moving the wrappers under KVM without having
to bleed #ifdefs into the soon-to-be KVM code.
sched/vtime: Move guest enter/exit vtime accounting to vtime.h
Provide separate helpers for guest enter vtime accounting (in addition to
the existing guest exit helpers), and move all vtime accounting helpers
to vtime.h where the existing #ifdef infrastructure can be leveraged to
better delineate the different types of accounting. This will also allow
future cleanups via deduplication of context tracking code.
Opportunstically delete the vtime_account_kernel() stub now that all
callers are wrapped with CONFIG_VIRT_CPU_ACCOUNTING_NATIVE=y.
Move the blob of external declarations (and their stubs) above the set of
inline definitions (and their stubs) for vtime accounting. This will
allow a future patch to bring in more inline definitions without also
having to shuffle large chunks of code.
Wanpeng Li [Wed, 5 May 2021 00:27:30 +0000 (17:27 -0700)]
KVM: x86: Defer vtime accounting 'til after IRQ handling
Defer the call to account guest time until after servicing any IRQ(s)
that happened in the guest or immediately after VM-Exit. Tick-based
accounting of vCPU time relies on PF_VCPU being set when the tick IRQ
handler runs, and IRQs are blocked throughout the main sequence of
vcpu_enter_guest(), including the call into vendor code to actually
enter and exit the guest.
This fixes a bug where reported guest time remains '0', even when
running an infinite loop in the guest:
Wanpeng Li [Wed, 5 May 2021 00:27:29 +0000 (17:27 -0700)]
context_tracking: Move guest exit vtime accounting to separate helpers
Provide separate vtime accounting functions for guest exit instead of
open coding the logic within the context tracking code. This will allow
KVM x86 to handle vtime accounting slightly differently when using
tick-based accounting.
Wanpeng Li [Wed, 5 May 2021 00:27:28 +0000 (17:27 -0700)]
context_tracking: Move guest exit context tracking to separate helpers
Provide separate context tracking helpers for guest exit, the standalone
helpers will be called separately by KVM x86 in later patches to fix
tick-based accounting.
Lai Jiangshan [Tue, 4 May 2021 19:50:14 +0000 (21:50 +0200)]
KVM/VMX: Invoke NMI non-IST entry instead of IST entry
In VMX, the host NMI handler needs to be invoked after NMI VM-Exit.
Before commit 1a5488ef0dcf6 ("KVM: VMX: Invoke NMI handler via indirect
call instead of INTn"), this was done by INTn ("int $2"). But INTn
microcode is relatively expensive, so the commit reworked NMI VM-Exit
handling to invoke the kernel handler by function call.
But this missed a detail. The NMI entry point for direct invocation is
fetched from the IDT table and called on the kernel stack. But on 64-bit
the NMI entry installed in the IDT expects to be invoked on the IST stack.
It relies on the "NMI executing" variable on the IST stack to work
correctly, which is at a fixed position in the IST stack. When the entry
point is unexpectedly called on the kernel stack, the RSP-addressed "NMI
executing" variable is obviously also on the kernel stack and is
"uninitialized" and can cause the NMI entry code to run in the wrong way.
Provide a non-ist entry point for VMX which shares the C-function with
the regular NMI entry and invoke the new asm entry point instead.
On 32-bit this just maps to the regular NMI entry point as 32-bit has no
ISTs and is not affected.
[ tglx: Made it independent for backporting, massaged changelog ]
Linus Torvalds [Wed, 5 May 2021 20:50:15 +0000 (13:50 -0700)]
Merge branch 'akpm' (patches from Andrew)
Merge more updates from Andrew Morton:
"The remainder of the main mm/ queue.
143 patches.
Subsystems affected by this patch series (all mm): pagecache, hugetlb,
userfaultfd, vmscan, compaction, migration, cma, ksm, vmstat, mmap,
kconfig, util, memory-hotplug, zswap, zsmalloc, highmem, cleanups, and
kfence"
* emailed patches from Andrew Morton <[email protected]>: (143 commits)
kfence: use power-efficient work queue to run delayed work
kfence: maximize allocation wait timeout duration
kfence: await for allocation using wait_event
kfence: zero guard page after out-of-bounds access
mm/process_vm_access.c: remove duplicate include
mm/mempool: minor coding style tweaks
mm/highmem.c: fix coding style issue
btrfs: use memzero_page() instead of open coded kmap pattern
iov_iter: lift memzero_page() to highmem.h
mm/zsmalloc: use BUG_ON instead of if condition followed by BUG.
mm/zswap.c: switch from strlcpy to strscpy
arm64/Kconfig: introduce ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE
x86/Kconfig: introduce ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE
mm,memory_hotplug: add kernel boot option to enable memmap_on_memory
acpi,memhotplug: enable MHP_MEMMAP_ON_MEMORY when supported
mm,memory_hotplug: allocate memmap from the added memory range
mm,memory_hotplug: factor out adjusting present pages into adjust_present_page_count()
mm,memory_hotplug: relax fully spanned sections check
drivers/base/memory: introduce memory_block_{online,offline}
mm/memory_hotplug: remove broken locking of zone PCP structures during hot remove
...
Linus Torvalds [Wed, 5 May 2021 20:44:19 +0000 (13:44 -0700)]
Merge tag 'nfsd-5.13-1' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux
Pull more nfsd updates from Chuck Lever:
"Additional fixes and clean-ups for NFSD since tags/nfsd-5.13,
including a fix to grant read delegations for files open for writing"
* tag 'nfsd-5.13-1' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux:
SUNRPC: Fix null pointer dereference in svc_rqst_free()
SUNRPC: fix ternary sign expansion bug in tracing
nfsd: Fix fall-through warnings for Clang
nfsd: grant read delegations to clients holding writes
nfsd: reshuffle some code
nfsd: track filehandle aliasing in nfs4_files
nfsd: hash nfs4_files by inode number
nfsd: ensure new clients break delegations
nfsd: removed unused argument in nfsd_startup_generic()
nfsd: remove unused function
svcrdma: Pass a useful error code to the send_err tracepoint
svcrdma: Rename goto labels in svc_rdma_sendto()
svcrdma: Don't leak send_ctxt on Send errors