Liming Sun [Sat, 8 May 2021 00:30:12 +0000 (20:30 -0400)]
platform/mellanox: mlxbf-tmfifo: Fix a memory barrier issue
The virtio framework uses wmb() when updating avail->idx. It
guarantees the write order, but not necessarily loading order
for the code accessing the memory. This commit adds a load barrier
after reading the avail->idx to make sure all the data in the
descriptor is visible. It also adds a barrier when returning the
packet to virtio framework to make sure read/writes are visible to
the virtio code.
Maximilian Luz [Thu, 13 May 2021 13:44:37 +0000 (15:44 +0200)]
platform/surface: dtx: Fix poll function
The poll function should not return -ERESTARTSYS.
Furthermore, locking in this function is completely unnecessary. The
ddev->lock protects access to the main device and controller (ddev->dev
and ddev->ctrl), ensuring that both are and remain valid while being
accessed by clients. Both are, however, never accessed in the poll
function. The shutdown test (via atomic bit flags) be safely done
without locking, so drop locking here entirely.
Maximilian Luz [Fri, 14 May 2021 22:19:54 +0000 (00:19 +0200)]
platform/surface: aggregator: Add platform-drivers-x86 list to MAINTAINERS entry
The Surface System Aggregator Module driver entry is currently missing a
mailing list. Surface platform drivers are discussed on the
platform-driver-x86 list and all other Surface platform drivers have a
reference to that list in their entries. So let's add one here as well.
Clang complains about the assignment of SSAM_ANY_IID to
ssam_device_uid->instance:
drivers/platform/surface/surface_aggregator_registry.c:478:25: error: implicit conversion from 'int' to '__u8' (aka 'unsigned char') changes value from 65535 to 255 [-Werror,-Wconstant-conversion]
{ SSAM_VDEV(HUB, 0x02, SSAM_ANY_IID, 0x00) },
~ ^~~~~~~~~~~~
include/linux/surface_aggregator/device.h:71:23: note: expanded from macro 'SSAM_ANY_IID'
#define SSAM_ANY_IID 0xffff
^~~~~~
include/linux/surface_aggregator/device.h:126:63: note: expanded from macro 'SSAM_VDEV'
SSAM_DEVICE(SSAM_DOMAIN_VIRTUAL, SSAM_VIRTUAL_TC_##cat, tid, iid, fun)
^~~
include/linux/surface_aggregator/device.h:102:41: note: expanded from macro 'SSAM_DEVICE'
.instance = ((iid) != SSAM_ANY_IID) ? (iid) : 0, \
^~~
The assignment doesn't actually happen, but clang checks the type limits
before checking whether this assignment is reached. Replace the ?:
operator with a __builtin_choose_expr() invocation that avoids the
warning for the untaken part.
Maximilian Luz [Wed, 5 May 2021 13:36:35 +0000 (15:36 +0200)]
platform/surface: aggregator: Do not mark interrupt as shared
Having both IRQF_NO_AUTOEN and IRQF_SHARED set causes
request_threaded_irq() to return with -EINVAL (see comment in flag
validation in that function). As the interrupt is currently not shared
between multiple devices, drop the IRQF_SHARED flag.
Commit b33fff07e3e3 ("x86, build: allow LTO to be selected") added a
couple of '-plugin-opt=' flags to KBUILD_LDFLAGS because the code model
and stack alignment are not stored in LLVM bitcode.
However, these flags were added to KBUILD_LDFLAGS prior to the
emulation flag assignment, which uses ':=', so they were overwritten
and never added to $(LD) invocations.
The absence of these flags caused misalignment issues in the
AMDGPU driver when compiling with CONFIG_LTO_CLANG, resulting in
general protection faults.
Shuffle the assignment below the initial one so that the flags are
properly passed along and all of the linker flags stay together.
At the same time, avoid any future issues with clobbering flags by
changing the emulation flag assignment to '+=' since KBUILD_LDFLAGS is
already defined with ':=' in the main Makefile before being exported for
modification here as a result of commit:
ce99d0bf312d ("kbuild: clear LDFLAGS in the top Makefile")
Simon Rettberg [Mon, 26 Apr 2021 14:11:24 +0000 (16:11 +0200)]
drm/i915/gt: Disable HiZ Raw Stall Optimization on broken gen7
When resetting CACHE_MODE registers, don't enable HiZ Raw Stall
Optimization on Ivybridge GT1 and Baytrail, as it causes severe glitches
when rendering any kind of 3D accelerated content.
This optimization is disabled on these platforms by default according to
official documentation from 01.org.
Chris Wilson [Mon, 17 May 2021 08:46:40 +0000 (09:46 +0100)]
drm/i915/gem: Pin the L-shape quirked object as unshrinkable
When instantiating a tiled object on an L-shaped memory machine, we mark
the object as unshrinkable to prevent the shrinker from trying to swap
out the pages. We have to do this as we do not know the swizzling on the
individual pages, and so the data will be scrambled across swap out/in.
Not only do we need to move the object off the shrinker list, we need to
mark the object with shrink_pin so that the counter is consistent across
calls to madvise.
v2: in the madvise ioctl we need to check if the object is currently
shrinkable/purgeable, not if the object type supports shrinking
James Smart [Tue, 11 May 2021 04:56:35 +0000 (21:56 -0700)]
nvme-fc: clear q_live at beginning of association teardown
The __nvmf_check_ready() routine used to bounce all filesystem io if the
controller state isn't LIVE. However, a later patch changed the logic so
that it rejection ends up being based on the Q live check. The FC
transport has a slightly different sequence from rdma and tcp for
shutting down queues/marking them non-live. FC marks its queue non-live
after aborting all ios and waiting for their termination, leaving a
rather large window for filesystem io to continue to hit the transport.
Unfortunately this resulted in filesystem I/O or applications seeing I/O
errors.
Change the FC transport to mark the queues non-live at the first sign of
teardown for the association (when I/O is initially terminated).
Keith Busch [Mon, 17 May 2021 22:36:43 +0000 (15:36 -0700)]
nvme-tcp: rerun io_work if req_list is not empty
A possible race condition exists where the request to send data is
enqueued from nvme_tcp_handle_r2t()'s will not be observed by
nvme_tcp_send_all() if it happens to be running. The driver relies on
io_work to send the enqueued request when it is runs again, but the
concurrently running nvme_tcp_send_all() may not have released the
send_mutex at that time. If no future commands are enqueued to re-kick
the io_work, the request will timeout in the SEND_H2C state, resulting
in a timeout error like:
nvme nvme0: queue 1: timeout request 0x3 type 6
Ensure the io_work continues to run as long as the req_list is not empty.
Fixes: db5ad6b7f8cdd ("nvme-tcp: try to send request in queue_rq context") Signed-off-by: Keith Busch <[email protected]> Reviewed-by: Sagi Grimberg <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]>
Sagi Grimberg [Mon, 17 May 2021 21:07:45 +0000 (14:07 -0700)]
nvme-tcp: fix possible use-after-completion
Commit db5ad6b7f8cd ("nvme-tcp: try to send request in queue_rq
context") added a second context that may perform a network send.
This means that now RX and TX are not serialized in nvme_tcp_io_work
and can run concurrently.
While there is correct mutual exclusion in the TX path (where
the send_mutex protect the queue socket send activity) RX activity,
and more specifically request completion may run concurrently.
This means we must guarantee that any mutation of the request state
related to its lifetime, bytes sent must not be accessed when a completion
may have possibly arrived back (and processed).
The race may trigger when a request completion arrives, processed
_and_ reused as a fresh new request, exactly in the (relatively short)
window between the last data payload sent and before the request iov_iter
is advanced.
Consider the following race:
1. 16K write request is queued
2. The nvme command and the data is sent to the controller (in-capsule
or solicited by r2t)
3. After the last payload is sent but before the req.iter is advanced,
the controller sends back a completion.
4. The completion is processed, the request is completed, and reused
to transfer a new request (write or read)
5. The new request is queued, and the driver reset the request parameters
(nvme_tcp_setup_cmd_pdu).
6. Now context in (2) resumes execution and advances the req.iter
==> use-after-completion as this is already a new request.
Fix this by making sure the request is not advanced after the last
data payload send, knowing that a completion may have arrived already.
An alternative solution would have been to delay the request completion
or state change waiting for reference counting on the TX path, but besides
adding atomic operations to the hot-path, it may present challenges in
multi-stage R2T scenarios where a r2t handler needs to be deferred to
an async execution.
Wu Bo [Wed, 19 May 2021 05:01:09 +0000 (13:01 +0800)]
nvmet: fix memory leak in nvmet_alloc_ctrl()
When creating ctrl in nvmet_alloc_ctrl(), if the cntlid_min is larger
than cntlid_max of the subsystem, and jumps to the
"out_free_changed_ns_list" label, but the ctrl->sqs lack of be freed.
Fix this by jumping to the "out_free_sqs" label.
signalfd: Remove SIL_PERF_EVENT fields from signalfd_siginfo
With the addition of ssi_perf_data and ssi_perf_type struct signalfd_siginfo
is dangerously close to running out of space. All that remains is just
enough space for two additional 64bit fields. A practice of adding all
possible siginfo_t fields into struct singalfd_siginfo can not be supported
as adding the missing fields ssi_lower, ssi_upper, and ssi_pkey would
require two 64bit fields and one 32bit fields. In practice the fields
ssi_perf_data and ssi_perf_type can never be used by signalfd as the signal
that generates them always delivers them synchronously to the thread that
triggers them.
Therefore until someone actually needs the fields ssi_perf_data and
ssi_perf_type in signalfd_siginfo remove them. This leaves a bit more room
for future expansion.
signal: Deliver all of the siginfo perf data in _perf
Don't abuse si_errno and deliver all of the perf data in _perf member
of siginfo_t.
Note: The data field in the perf data structures in a u64 to allow a
pointer to be encoded without needed to implement a 32bit and 64bit
version of the same structure. There already exists a 32bit and 64bit
versions siginfo_t, and the 32bit version can not include a 64bit
member as it only has 32bit alignment. So unsigned long is used in
siginfo_t instead of a u64 as unsigned long can encode a pointer on
all architectures linux supports.
Separate filling in siginfo for TRAP_PERF from deciding that
siginal needs to be sent.
There are enough little details that need to be correct when
properly filling in siginfo_t that it is easy to make mistakes
if filling in the siginfo_t is in the same function with other
logic. So factor out force_sig_perf to reduce the cognative
load of on reviewers, maintainers and implementors.
Now that si_trapno is part of the union in _si_fault and available on
all architectures, add SIL_FAULT_TRAPNO and update siginfo_layout to
return SIL_FAULT_TRAPNO when the code assumes si_trapno is valid.
There is room for future changes to reduce when si_trapno is valid but
this is all that is needed to make si_trapno and the other members of
the the union in _sigfault mutually exclusive.
Update the code that uses siginfo_layout to deal with SIL_FAULT_TRAPNO
and have the same code ignore si_trapno in in all other cases.
siginfo: Move si_trapno inside the union inside _si_fault
It turns out that linux uses si_trapno very sparingly, and as such it
can be considered extra information for a very narrow selection of
signals, rather than information that is present with every fault
reported in siginfo.
As such move si_trapno inside the union inside of _si_fault. This
results in no change in placement, and makes it eaiser
to extend _si_fault in the future as this reduces the number of
special cases. In particular with si_trapno included in the union it
is no longer a concern that the union must be pointer aligned on most
architectures because the union follows immediately after si_addr
which is a pointer.
This change results in a difference in siginfo field placement on
sparc and alpha for the fields si_addr_lsb, si_lower, si_upper,
si_pkey, and si_perf. These architectures do not implement the
signals that would use si_addr_lsb, si_lower, si_upper, si_pkey, and
si_perf. Further these architecture have not yet implemented the
userspace that would use si_perf.
The point of this change is in fact to correct these placement issues
before sparc or alpha grow userspace that cares. This change was
discussed[1] and the agreement is that this change is currently safe.
Arnd Bergmann [Fri, 14 May 2021 14:00:08 +0000 (16:00 +0200)]
kcsan: Fix debugfs initcall return type
clang with CONFIG_LTO_CLANG points out that an initcall function should
return an 'int' due to the changes made to the initcall macros in commit 3578ad11f3fb ("init: lto: fix PREL32 relocations"):
kernel/kcsan/debugfs.c:274:15: error: returning 'void' from a function with incompatible result type 'int'
late_initcall(kcsan_debugfs_init);
~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~
include/linux/init.h:292:46: note: expanded from macro 'late_initcall'
#define late_initcall(fn) __define_initcall(fn, 7)
Shay Drory [Tue, 11 May 2021 05:48:28 +0000 (08:48 +0300)]
RDMA/core: Don't access cm_id after its destruction
restrack should only be attached to a cm_id while the ID has a valid
device pointer. It is set up when the device is first loaded, but not
cleared when the device is removed. There is also two copies of the device
pointer, one private and one in the public API, and these were left out of
sync.
Make everything go to NULL together and manipulate restrack right around
the device assignments.
Found by syzcaller:
BUG: KASAN: wild-memory-access in __list_del include/linux/list.h:112 [inline]
BUG: KASAN: wild-memory-access in __list_del_entry include/linux/list.h:135 [inline]
BUG: KASAN: wild-memory-access in list_del include/linux/list.h:146 [inline]
BUG: KASAN: wild-memory-access in cma_cancel_listens drivers/infiniband/core/cma.c:1767 [inline]
BUG: KASAN: wild-memory-access in cma_cancel_operation drivers/infiniband/core/cma.c:1795 [inline]
BUG: KASAN: wild-memory-access in cma_cancel_operation+0x1f4/0x4b0 drivers/infiniband/core/cma.c:1783
Write of size 8 at addr dead000000000108 by task syz-executor716/334
Zqiang [Mon, 17 May 2021 03:40:05 +0000 (11:40 +0800)]
locking/mutex: clear MUTEX_FLAGS if wait_list is empty due to signal
When a interruptible mutex locker is interrupted by a signal
without acquiring this lock and removed from the wait queue.
if the mutex isn't contended enough to have a waiter
put into the wait queue again, the setting of the WAITER
bit will force mutex locker to go into the slowpath to
acquire the lock every time, so if the wait queue is empty,
the WAITER bit need to be clear.
Leo Yan [Wed, 12 May 2021 12:09:37 +0000 (20:09 +0800)]
locking/lockdep: Correct calling tracepoints
The commit eb1f00237aca ("lockdep,trace: Expose tracepoints") reverses
tracepoints for lock_contended() and lock_acquired(), thus the ftrace
log shows the wrong locking sequence that "acquired" event is prior to
"contended" event:
Like Xu [Fri, 30 Apr 2021 05:22:47 +0000 (13:22 +0800)]
perf/x86/lbr: Remove cpuc->lbr_xsave allocation from atomic context
If the kernel is compiled with the CONFIG_LOCKDEP option, the conditional
might_sleep_if() deep in kmem_cache_alloc() will generate the following
trace, and potentially cause a deadlock when another LBR event is added:
[] BUG: sleeping function called from invalid context at include/linux/sched/mm.h:196
[] Call Trace:
[] kmem_cache_alloc+0x36/0x250
[] intel_pmu_lbr_add+0x152/0x170
[] x86_pmu_add+0x83/0xd0
Make it symmetric with the release_lbr_buffers() call and mirror the
existing DS buffers.
Like Xu [Fri, 30 Apr 2021 05:22:46 +0000 (13:22 +0800)]
perf/x86: Avoid touching LBR_TOS MSR for Arch LBR
The Architecture LBR does not have MSR_LBR_TOS (0x000001c9).
In a guest that should support Architecture LBR, check_msr()
will be a non-related check for the architecture MSR 0x0
(IA32_P5_MC_ADDR) that is also not supported by KVM.
The failure will cause x86_pmu.lbr_nr = 0, thereby preventing
the initialization of the guest Arch LBR. Fix it by avoiding
this extraneous check in intel_pmu_init() for Arch LBR.
Takashi Sakamoto [Tue, 18 May 2021 01:26:12 +0000 (10:26 +0900)]
ALSA: dice: fix stream format for TC Electronic Konnekt Live at high sampling transfer frequency
At high sampling transfer frequency, TC Electronic Konnekt Live
transfers/receives 6 audio data frames in multi bit linear audio data
channel of data block in CIP payload. Current hard-coded stream format
is wrong.
Although many DICE-based devices have a quirk at high sampling transfer
frequency to multiplex double number of PCM frames into data block than
the number in IEC 61883-1/6, the above devices are just compliant to
IEC 61883-1/6.
This commit disables the mode of double_pcm_frames for the models.
Tom Lendacky [Mon, 17 May 2021 17:42:33 +0000 (12:42 -0500)]
x86/sev-es: Invalidate the GHCB after completing VMGEXIT
Since the VMGEXIT instruction can be issued from userspace, invalidate
the GHCB after performing VMGEXIT processing in the kernel.
Invalidation is only required after userspace is available, so call
vc_ghcb_invalidate() from sev_es_put_ghcb(). Update vc_ghcb_invalidate()
to additionally clear the GHCB exit code so that it is always presented
as 0 when VMGEXIT has been issued by anything else besides the kernel.
Tom Lendacky [Mon, 17 May 2021 17:42:32 +0000 (12:42 -0500)]
x86/sev-es: Move sev_es_put_ghcb() in prep for follow on patch
Move the location of sev_es_put_ghcb() in preparation for an update to it
in a follow-on patch. This will better highlight the changes being made
to the function.
Rob Herring [Mon, 10 May 2021 20:45:24 +0000 (15:45 -0500)]
dt-bindings: More removals of type references on common properties
Users of common properties shouldn't have a type definition as the
common schemas already have one. A few new ones slipped in and
*-names was missed in the last clean-up pass. Drop all the unnecessary
type references in the tree.
Rob Herring [Mon, 10 May 2021 20:35:14 +0000 (15:35 -0500)]
dt-bindings: media: renesas,drif: Use graph schema
Convert the renesas,drif binding schema to use the graph schema. The
binding referred to video-interfaces.txt, but it doesn't actually use any
properties from it as 'sync-active' is a custom property. As 'sync-active'
is custom, it needs a type definition.
Thomas Gleixner [Mon, 17 May 2021 17:57:47 +0000 (19:57 +0200)]
Merge tag 'irqchip-fixes-5.13-1' of git://git.kernel.org/pub/scm/linux/kernel/git/maz/arm-platforms into irq/urgent
Pull irqchip fixes from Marc Zyngier:
- Fix PXA Mainstone CPLD irq allocation in legacy mode
- Restrict the Apple AIC controller to the Apple platform
- Remove a few supperfluous messages on devm_ioremap_resource() failure
Linus Torvalds [Mon, 17 May 2021 16:55:10 +0000 (09:55 -0700)]
Merge tag 'for-5.13-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"A few more fixes:
- fix fiemap to print extents that could get misreported due to
internal extent splitting and logical merging for fiemap output
- fix RCU stalls during delayed iputs
- fix removed dentries still existing after log is synced"
* tag 'for-5.13-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: fix removed dentries still existing after log is synced
btrfs: return whole extents in fiemap
btrfs: avoid RCU stalls while running delayed iputs
btrfs: return 0 for dev_extent_hole_check_zoned hole_start in case of error
Leon Romanovsky [Tue, 11 May 2021 05:48:31 +0000 (08:48 +0300)]
RDMA/rxe: Return CQE error if invalid lkey was supplied
RXE is missing update of WQE status in LOCAL_WRITE failures. This caused
the following kernel panic if someone sent an atomic operation with an
explicitly wrong lkey.
Maor Gottlieb [Tue, 11 May 2021 05:48:29 +0000 (08:48 +0300)]
RDMA/mlx5: Recover from fatal event in dual port mode
When there is fatal event on the slave port, the device is marked as not
active. We need to mark it as active again when the slave is recovered to
regain full functionality.
Bug: https://gitlab.freedesktop.org/drm/amd/-/issues/1230 Reviewed-by: Hans de Goede <[email protected]> Signed-off-by: Alex Deucher <[email protected]> Signed-off-by: Rafael J. Wysocki <[email protected]>
Kevin Hilman [Tue, 11 May 2021 19:00:54 +0000 (12:00 -0700)]
MAINTAINERS: ARM/Amlogic SoCs: add Neil as primary maintainer
Add Neil as primary maintainer for the Amlogic family of Arm SoCs. I
will now act as co-maintainer.
Neil is already doing lots of the reviewing, testing and behind the
scenes support for users of the upstream kernel on these SoCs, so this
is just to formalize the current state of affairs.
Thanks Neil for all of your efforts, and keep up the great work!
We can get -EIO or any number of legitimate errors from
btrfs_search_slot(), panicing here is not the appropriate response. The
error path for this code handles errors properly, simply return the
error.
Filipe Manana [Fri, 14 May 2021 09:03:40 +0000 (10:03 +0100)]
btrfs: release path before starting transaction when cloning inline extent
When cloning an inline extent there are a few cases, such as when we have
an implicit hole at file offset 0, where we start a transaction while
holding a read lock on a leaf. Starting the transaction results in a call
to sb_start_intwrite(), which results in doing a read lock on a percpu
semaphore. Lockdep doesn't like this and complains about it:
This should be a false positive, as both locks are acquired in read mode.
Nevertheless, we don't need to hold a leaf locked when we start the
transaction, so just release the leaf (path) before starting it.
Pavel Begunkov [Mon, 17 May 2021 11:43:34 +0000 (12:43 +0100)]
io_uring: don't modify req->poll for rw
__io_queue_proc() is used by both poll and apoll, so we should not
access req->poll directly but selecting right struct io_poll_iocb
depending on use case.
Jan Kara [Mon, 17 May 2021 12:39:56 +0000 (14:39 +0200)]
quota: Disable quotactl_path syscall
In commit fa8b90070a80 ("quota: wire up quotactl_path") we have wired up
new quotactl_path syscall. However some people in LWN discussion have
objected that the path based syscall is missing dirfd and flags argument
which is mostly standard for contemporary path based syscalls. Indeed
they have a point and after a discussion with Christian Brauner and
Sascha Hauer I've decided to disable the syscall for now and update its
API. Since there is no userspace currently using that syscall and it
hasn't been released in any major release, we should be fine.
Zhen Lei [Tue, 11 May 2021 11:27:33 +0000 (19:27 +0800)]
drm/exynos/decon5433: Remove redundant error printing in exynos5433_decon_probe()
When devm_ioremap_resource() fails, a clear enough error message will be
printed by its subfunction __devm_ioremap_resource(). The error
information contains the device name, failure cause, and possibly resource
information.
Therefore, remove the error printing here to simplify code and reduce the
binary size.
Zhen Lei [Tue, 11 May 2021 09:40:04 +0000 (17:40 +0800)]
drm/exynos: Remove redundant error printing in exynos_dsi_probe()
When devm_ioremap_resource() fails, a clear enough error message will be
printed by its subfunction __devm_ioremap_resource(). The error
information contains the device name, failure cause, and possibly resource
information.
Therefore, remove the error printing here to simplify code and reduce the
binary size.
Correct the kerneldoc of fimd_shadow_protect_win() to fix W=1 warnings:
drivers/gpu/drm/exynos/exynos_drm_fimd.c:734: warning:
expecting prototype for shadow_protect_win(). Prototype was for fimd_shadow_protect_win() instead
Zhenyu Wang [Thu, 13 May 2021 08:39:02 +0000 (16:39 +0800)]
drm/i915/gvt: Move mdev attribute groups into kvmgt module
As kvmgt module contains all handling for VFIO/mdev, leaving mdev attribute
groups in gvt module caused dependency issue. Although it was there for possible
other hypervisor usage, that turns out never to be true. So this moves all mdev
handling into kvmgt module completely to resolve dependency issue.
With this fix, no config workaround is required. So revert previous workaround
commits: adaeb718d46f ("vfio/gvt: fix DRM_I915_GVT dependency on VFIO_MDEV")
and 07e543f4f9d1 ("vfio/gvt: Make DRM_I915_GVT depend on VFIO_MDEV").
Jessica Yu [Wed, 12 May 2021 13:45:46 +0000 (15:45 +0200)]
module: check for exit sections in layout_sections() instead of module_init_section()
Previously, when CONFIG_MODULE_UNLOAD=n, the module loader just does not
attempt to load exit sections since it never expects that any code in those
sections will ever execute. However, dynamic code patching (alternatives,
jump_label and static_call) can have sites in __exit code, even if __exit is
never executed. Therefore __exit must be present at runtime, at least for as
long as __init code is.
Commit 33121347fb1c ("module: treat exit sections the same as init
sections when !CONFIG_MODULE_UNLOAD") solves the requirements of
jump_labels and static_calls by putting the exit sections in the init
region of the module so that they are at least present at init, and
discarded afterwards. It does this by including a check for exit
sections in module_init_section(), so that it also returns true for exit
sections, and the module loader will automatically sort them in the init
region of the module.
However, the solution there was not completely arch-independent. ARM is
a special case where it supplies its own module_{init, exit}_section()
functions. Instead of pushing the exit section checks into
module_init_section(), just implement the exit section check in
layout_sections(), so that we don't have to touch arch-dependent code.
Fixes: 33121347fb1c ("module: treat exit sections the same as init sections when !CONFIG_MODULE_UNLOAD") Reviewed-by: Russell King (Oracle) <[email protected]> Signed-off-by: Jessica Yu <[email protected]>
wenhuizhang [Thu, 13 May 2021 16:55:16 +0000 (12:55 -0400)]
cifs: remove deadstore in cifs_close_all_deferred_files()
Deadstore detected by Lukas Bulwahn's CodeChecker Tool (ELISA group).
line 741 struct cifsInodeInfo *cinode;
line 747 cinode = CIFS_I(d_inode(cfile->dentry));
could be deleted.
cinode on filesystem should not be deleted when files are closed,
they are representations of some data fields on a physical disk,
thus no further action is required.
The virtual inode on vfs will be handled by vfs automatically,
and the denotation is inode, which is different from the cinode.
Michal Kubecek [Sat, 15 May 2021 10:11:13 +0000 (12:11 +0200)]
kbuild: dummy-tools: adjust to stricter stackprotector check
Commit 3fb0fdb3bbe7 ("x86/stackprotector/32: Make the canary into a regular
percpu variable") modified the stackprotector check on 32-bit x86 to check
if gcc supports using %fs as canary. Adjust dummy-tools gcc script to pass
this new test by returning "%fs" rather than "%gs" if it detects
-mstack-protector-guard-reg=fs on command line.
Fixes: 3fb0fdb3bbe7 ("x86/stackprotector/32: Make the canary into a regular percpu variable") Signed-off-by: Michal Kubecek <[email protected]> Signed-off-by: Masahiro Yamada <[email protected]>
Darrick J. Wong [Sun, 9 May 2021 23:22:54 +0000 (16:22 -0700)]
xfs: adjust rt allocation minlen when extszhint > rtextsize
xfs_bmap_rtalloc doesn't handle realtime extent files with extent size
hints larger than the rt volume's extent size properly, because
xfs_bmap_extsize_align can adjust the offset/length parameters to try to
fit the extent size hint.
Under these conditions, minlen has to be large enough so that any
allocation returned by xfs_rtallocate_extent will be large enough to
cover at least one of the blocks that the caller asked for. If the
allocation is too short, bmapi_write will return no mapping for the
requested range, which causes ENOSPC errors in other parts of the
filesystem.
Therefore, adjust minlen upwards to fix this. This can be found by
running generic/263 (g/127 or g/522) with a realtime extent size hint
that's larger than the rt volume extent size.
Linus Torvalds [Sun, 16 May 2021 17:13:14 +0000 (10:13 -0700)]
Merge tag 'driver-core-5.13-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
Pull driver core fixes from Greg KH:
"Here are two driver fixes for driver core changes that happened in
5.13-rc1.
The clk driver fix resolves a many-reported issue with booting some
devices, and the USB typec fix resolves the reported problem of USB
systems on some embedded boards.
Both of these have been in linux-next this week with no reported
issues"
* tag 'driver-core-5.13-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core:
clk: Skip clk provider registration when np is NULL
usb: typec: tcpm: Don't block probing of consumers of "connector" nodes
Linus Torvalds [Sun, 16 May 2021 16:55:05 +0000 (09:55 -0700)]
Merge tag 'usb-5.13-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb
Pull USB fixes from Greg KH:
"Here are some small USB fixes for 5.13-rc2. They consist of a number
of resolutions for reported issues:
- typec fixes for found problems
- xhci fixes and quirk additions
- dwc3 driver fixes
- minor fixes found by Coverity
- cdc-wdm fixes for reported problems
All of these have been in linux-next for a few days with no reported
issues"
* tag 'usb-5.13-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb: (28 commits)
usb: core: hub: fix race condition about TRSMRCY of resume
usb: typec: tcpm: Fix SINK_DISCOVERY current limit for Rp-default
xhci: Add reset resume quirk for AMD xhci controller.
usb: xhci: Increase timeout for HC halt
xhci: Do not use GFP_KERNEL in (potentially) atomic context
xhci: Fix giving back cancelled URBs even if halted endpoint can't reset
xhci-pci: Allow host runtime PM as default for Intel Alder Lake xHCI
usb: musb: Fix an error message
usb: typec: tcpm: Fix wrong handling for Not_Supported in VDM AMS
usb: typec: tcpm: Send DISCOVER_IDENTITY from dedicated work
usb: typec: ucsi: Retrieve all the PDOs instead of just the first 4
usb: fotg210-hcd: Fix an error message
docs: usb: function: Modify path name
usb: dwc3: omap: improve extcon initialization
usb: typec: ucsi: Put fwnode in any case during ->probe()
usb: typec: tcpm: Fix wrong handling in GET_SINK_CAP
usb: dwc2: Remove obsolete MODULE_ constants from platform.c
usb: dwc3: imx8mp: fix error return code in dwc3_imx8mp_probe()
usb: dwc3: imx8mp: detect dwc3 core node via compatible string
usb: dwc3: gadget: Return success always for kick transfer in ep queue
...
Linus Torvalds [Sun, 16 May 2021 16:42:13 +0000 (09:42 -0700)]
Merge tag 'timers-urgent-2021-05-16' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer fixes from Thomas Gleixner:
"Two fixes for timers:
- Use the ALARM feature check in the alarmtimer core code insted of
the old method of checking for the set_alarm() callback.
Drivers can have that callback set but the feature bit cleared. If
such a RTC device is selected then alarms wont work.
- Use a proper define to let the preprocessor check whether Hyper-V
VDSO clocksource should be active.
The code used a constant in an enum with #ifdef, which evaluates to
always false and disabled the clocksource for VDSO"
* tag 'timers-urgent-2021-05-16' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
clocksource/drivers/hyper-v: Re-enable VDSO_CLOCKMODE_HVCLOCK on X86
alarmtimer: Check RTC features instead of ops
Linus Torvalds [Sun, 16 May 2021 16:39:04 +0000 (09:39 -0700)]
Merge tag 'for-linus-5.13b-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip
Pull xen fixes from Juergen Gross:
- two patches for error path fixes
- a small series for fixing a regression with swiotlb with Xen on Arm
* tag 'for-linus-5.13b-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
xen/swiotlb: check if the swiotlb has already been initialized
arm64: do not set SWIOTLB_NO_FORCE when swiotlb is required
xen/arm: move xen_swiotlb_detect to arm/swiotlb-xen.h
xen/unpopulated-alloc: fix error return code in fill_list()
xen/gntdev: fix gntdev_mmap() error exit path
Linus Torvalds [Sun, 16 May 2021 16:31:06 +0000 (09:31 -0700)]
Merge tag 'x86_urgent_for_v5.13_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Borislav Petkov:
"The three SEV commits are not really urgent material. But we figured
since getting them in now will avoid a huge amount of conflicts
between future SEV changes touching tip, the kvm and probably other
trees, sending them to you now would be best.
The idea is that the tip, kvm etc branches for 5.14 will all base
ontop of -rc2 and thus everything will be peachy. What is more, those
changes are purely mechanical and defines movement so they should be
fine to go now (famous last words).
Summary:
- Enable -Wundef for the compressed kernel build stage
- Reorganize SEV code to streamline and simplify future development"
* tag 'x86_urgent_for_v5.13_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/boot/compressed: Enable -Wundef
x86/msr: Rename MSR_K8_SYSCFG to MSR_AMD64_SYSCFG
x86/sev: Move GHCB MSR protocol and NAE definitions in a common header
x86/sev-es: Rename sev-es.{ch} to sev.{ch}
Takashi Iwai [Sun, 16 May 2021 16:17:55 +0000 (18:17 +0200)]
ALSA: intel8x0: Don't update period unless prepared
The interrupt handler of intel8x0 calls snd_intel8x0_update() whenever
the hardware sets the corresponding status bit for each stream. This
works fine for most cases as long as the hardware behaves properly.
But when the hardware gives a wrong bit set, this leads to a zero-
division Oops, and reportedly, this seems what happened on a VM.
For fixing the crash, this patch adds a internal flag indicating that
the stream is ready to be updated, and check it (as well as the flag
being in suspended) to ignore such spurious update.
Zhen Lei [Tue, 11 May 2021 12:54:28 +0000 (20:54 +0800)]
irqchip: Remove redundant error printing
When devm_ioremap_resource() fails, a clear enough error message will be
printed by its subfunction __devm_ioremap_resource(). The error
information contains the device name, failure cause, and possibly resource
information.
Therefore, remove the error printing here to simplify code and reduce the
binary size.
Ajish Koshy [Wed, 5 May 2021 12:01:03 +0000 (17:31 +0530)]
scsi: pm80xx: Fix drives missing during rmmod/insmod loop
When driver is loaded after rmmod some drives are not showing up during
discovery.
SATA drives are directly attached to the controller connected phys. During
device discovery, the IDENTIFY command (qc timeout (cmd 0xec)) is timing out
during revalidation. This will trigger abort from host side and controller
successfully aborts the command and returns success. Post this successful
abort response ATA library decides to mark the disk as NODEV.
To overcome this, inside pm8001_scan_start() after phy_start() call, add get
start response and wait for few milliseconds to trigger next phy start.
This millisecond delay will give sufficient time for the controller state
machine to accept next phy start.
Linus Torvalds [Sat, 15 May 2021 17:24:48 +0000 (10:24 -0700)]
Merge tag 'sched-urgent-2021-05-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Ingo Molnar:
"Fix an idle CPU selection bug, and an AMD Ryzen maximum frequency
enumeration bug"
* tag 'sched-urgent-2021-05-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86, sched: Fix the AMD CPPC maximum performance value on certain AMD Ryzen generations
sched/fair: Fix clearing of has_idle_cores flag in select_idle_cpu()
Linus Torvalds [Sat, 15 May 2021 16:42:27 +0000 (09:42 -0700)]
Merge branch 'akpm' (patches from Andrew)
Merge misc fixes from Andrew Morton:
"13 patches.
Subsystems affected by this patch series: resource, squashfs, hfsplus,
modprobe, and mm (hugetlb, slub, userfaultfd, ksm, pagealloc, kasan,
pagemap, and ioremap)"
* emailed patches from Andrew Morton <[email protected]>:
mm/ioremap: fix iomap_max_page_shift
docs: admin-guide: update description for kernel.modprobe sysctl
hfsplus: prevent corruption in shrinking truncate
mm/filemap: fix readahead return types
kasan: fix unit tests with CONFIG_UBSAN_LOCAL_BOUNDS enabled
mm: fix struct page layout on 32-bit systems
ksm: revert "use GET_KSM_PAGE_NOLOCK to get ksm page in remove_rmap_item_from_tree()"
userfaultfd: release page in error path to avoid BUG_ON
squashfs: fix divide error in calculate_skip()
kernel/resource: fix return code check in __request_free_mem_region
mm, slub: move slub_debug static key enabling outside slab_mutex
mm/hugetlb: fix cow where page writtable in child
mm/hugetlb: fix F_SEAL_FUTURE_WRITE
Linus Torvalds [Sat, 15 May 2021 16:01:45 +0000 (09:01 -0700)]
Merge tag 'arc-5.13-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/vgupta/arc
Pull ARC fixes from Vineet Gupta:
- PAE fixes
- syscall num check off-by-one bug
- misc fixes
* tag 'arc-5.13-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/vgupta/arc:
ARC: mm: Use max_high_pfn as a HIGHMEM zone border
ARC: mm: PAE: use 40-bit physical page mask
ARC: entry: fix off-by-one error in syscall number validation
ARC: kgdb: add 'fallthrough' to prevent a warning
arc: Fix typos/spellos
Linus Torvalds [Sat, 15 May 2021 15:52:30 +0000 (08:52 -0700)]
Merge tag 'block-5.13-2021-05-14' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
- Fix for shared tag set exit (Bart)
- Correct ioctl range for zoned ioctls (Damien)
- Removed dead/unused function (Lin)
- Fix perf regression for shared tags (Ming)
- Fix out-of-bounds issue with kyber and preemption (Omar)
- BFQ merge fix (Paolo)
- Two error handling fixes for nbd (Sun)
- Fix weight update in blk-iocost (Tejun)
- NVMe pull request (Christoph):
- correct the check for using the inline bio in nvmet (Chaitanya
Kulkarni)
- demote unsupported command warnings (Chaitanya Kulkarni)
- fix corruption due to double initializing ANA state (me, Hou Pu)
- reset ns->file when open fails (Daniel Wagner)
- fix a NULL deref when SEND is completed with error in nvmet-rdma
(Michal Kalderon)
- Fix kernel-doc warning (Bart)
* tag 'block-5.13-2021-05-14' of git://git.kernel.dk/linux-block:
block/partitions/efi.c: Fix the efi_partition() kernel-doc header
blk-mq: Swap two calls in blk_mq_exit_queue()
blk-mq: plug request for shared sbitmap
nvmet: use new ana_log_size instead the old one
nvmet: seset ns->file when open fails
nbd: share nbd_put and return by goto put_nbd
nbd: Fix NULL pointer in flush_workqueue
blkdev.h: remove unused codes blk_account_rq
block, bfq: avoid circular stable merges
blk-iocost: fix weight updates of inner active iocgs
nvmet: demote fabrics cmd parse err msg to debug
nvmet: use helper to remove the duplicate code
nvmet: demote discovery cmd parse err msg to debug
nvmet-rdma: Fix NULL deref when SEND is completed with error
nvmet: fix inline bio check for passthru
nvmet: fix inline bio check for bdev-ns
nvme-multipath: fix double initialization of ANA state
kyber: fix out of bounds access when preempted
block: uapi: fix comment about block device ioctl
Linus Torvalds [Sat, 15 May 2021 15:43:44 +0000 (08:43 -0700)]
Merge tag 'io_uring-5.13-2021-05-14' of git://git.kernel.dk/linux-block
Pull io_uring fixes from Jens Axboe:
"Just a few minor fixes/changes:
- Fix issue with double free race for linked timeout completions
- Fix reference issue with timeouts
- Remove last few places that make SQPOLL special, since it's just an
io thread now.
- Bump maximum allowed registered buffers, as we don't allocate as
much anymore"
* tag 'io_uring-5.13-2021-05-14' of git://git.kernel.dk/linux-block:
io_uring: increase max number of reg buffers
io_uring: further remove sqpoll limits on opcodes
io_uring: fix ltout double free on completion race
io_uring: fix link timeout refs
Linus Torvalds [Sat, 15 May 2021 15:37:21 +0000 (08:37 -0700)]
Merge tag 'erofs-for-5.13-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs
Pull erofs fixes from Gao Xiang:
"This mainly fixes 1 lcluster-sized pclusters for the big pcluster
feature, which can be forcely generated by mkfs as a specific on-disk
case for per-(sub)file compression strategies but missed to handle in
runtime properly.
Also, documentation updates are included to fix the broken
illustration due to the ReST conversion by accident and complete the
big pcluster introduction.
Summary:
- update documentation to fix the broken illustration due to ReST
conversion by accident at that time and complete the big pcluster
introduction
- fix 1 lcluster-sized pclusters for the big pcluster feature"
* tag 'erofs-for-5.13-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs:
erofs: fix 1 lcluster-sized pcluster for big pcluster
erofs: update documentation about data compression
erofs: fix broken illustration in documentation
Linus Torvalds [Sat, 15 May 2021 15:32:51 +0000 (08:32 -0700)]
Merge tag 'libnvdimm-fixes-5.13-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm
Pull libnvdimm fixes from Dan Williams:
"A regression fix for a bootup crash condition introduced in this merge
window and some other minor fixups:
- Fix regression in ACPI NFIT table handling leading to crashes and
driver load failures.
- Move the nvdimm mailing list
- Miscellaneous minor fixups"
* tag 'libnvdimm-fixes-5.13-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
ACPI: NFIT: Fix support for variable 'SPA' structure size
MAINTAINERS: Move nvdimm mailing list
tools/testing/nvdimm: Make symbol '__nfit_test_ioremap' static
libnvdimm: Remove duplicate struct declaration
Linus Torvalds [Sat, 15 May 2021 15:28:08 +0000 (08:28 -0700)]
Merge tag 'dax-fixes-5.13-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm
Pull dax fixes from Dan Williams:
"A fix for a hang condition due to missed wakeups in the filesystem-dax
core when exercised by virtiofs.
This bug has been there from the beginning, but the condition has
not triggered on other filesystems since they hold a lock over
invalidation events"
* tag 'dax-fixes-5.13-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
dax: Wake up all waiters after invalidating dax entry
dax: Add a wakeup mode parameter to put_unlocked_entry()
dax: Add an enum for specifying dax wakup mode
Linus Torvalds [Sat, 15 May 2021 15:18:29 +0000 (08:18 -0700)]
Merge tag 'drm-fixes-2021-05-15' of git://anongit.freedesktop.org/drm/drm
Pull more drm fixes from Dave Airlie:
"Looks like I wasn't the only one not fully switched on this week. The
msm pull has a missing tag so I missed it, and i915 team were a bit
late. In my defence I did have a day with the roof of my home office
removed, so was sitting at my kids desk.
i915:
- Fix active callback alignment annotations and subsequent crashes
- Retract link training strategy to slow and wide, again
- Avoid division by zero on gen2
- Use correct width reads for C0DRB3/C1DRB3 registers
- Fix double free in pdp allocation failure path
- Fix HDMI 2.1 PCON downstream caps check"
* tag 'drm-fixes-2021-05-15' of git://anongit.freedesktop.org/drm/drm:
drm/i915: Use correct downstream caps for check Src-Ctl mode for PCON
drm/i915/overlay: Fix active retire callback alignment
drm/i915: Fix crash in auto_retire
drm/i915/gt: Fix a double free in gen8_preallocate_top_level_pdp
drm/i915: Read C0DRB3/C1DRB3 as 16 bits again
drm/i915: Avoid div-by-zero on gen2
drm/i915/dp: Use slow and wide link training for everything
drm/msm/dp: initialize audio_comp when audio starts
drm/msm/dp: check sink_count before update is_connected status
drm/msm: fix minor version to indicate MSM_PARAM_SUSPENDS support
drm/msm/dsi: fix msm_dsi_phy_get_clk_provider return code
drm/msm/dsi: dsi_phy_28nm_8960: fix uninitialized variable access
drm/msm: fix LLC not being enabled for mmu500 targets
drm/msm: Do not unpin/evict exported dma-buf's
syzbot is reporting OOB write at vga16fb_imageblit() [1], for
resize_screen() from ioctl(VT_RESIZE) returns 0 without checking whether
requested rows/columns fit the amount of memory reserved for the graphical
screen if current mode is KD_GRAPHICS.
Peter Zijlstra [Wed, 14 Apr 2021 12:45:43 +0000 (14:45 +0200)]
openrisc: Define memory barrier mb
This came up in the discussion of the requirements of qspinlock on an
architecture. OpenRISC uses qspinlock, but it was noticed that the
memmory barrier was not defined.
Peter defined it in the mail thread writing:
As near as I can tell this should do. The arch spec only lists
this one instruction and the text makes it sound like a completion
barrier.
API qedf_link_update() is getting called from QED but by that time
shost_data is not initialised. This results in a NULL pointer dereference
when we try to dereference shost_data while updating supported_speeds.
Add a NULL pointer check before dereferencing shost_data.
Christophe Leroy [Sat, 15 May 2021 00:27:39 +0000 (17:27 -0700)]
mm/ioremap: fix iomap_max_page_shift
iomap_max_page_shift is expected to contain a page shift, so it can't be a
'bool', has to be an 'unsigned int'
And fix the default values: P4D_SHIFT is when huge iomap is allowed.
However, on some architectures (eg: powerpc book3s/64), P4D_SHIFT is not a
constant so it can't be used to initialise a static variable. So,
initialise iomap_max_page_shift with a maximum shift supported by the
architecture, it is gated by P4D_SHIFT in vmap_try_huge_p4d() anyway.
Rasmus Villemoes [Sat, 15 May 2021 00:27:36 +0000 (17:27 -0700)]
docs: admin-guide: update description for kernel.modprobe sysctl
When I added CONFIG_MODPROBE_PATH, I neglected to update Documentation/.
It's still true that this defaults to /sbin/modprobe, but now via a level
of indirection. So document that the kernel might have been built with
something other than /sbin/modprobe as the initial value.
Jouni Roivas [Sat, 15 May 2021 00:27:33 +0000 (17:27 -0700)]
hfsplus: prevent corruption in shrinking truncate
I believe there are some issues introduced by commit 31651c607151
("hfsplus: avoid deadlock on file truncation")
HFS+ has extent records which always contains 8 extents. In case the
first extent record in catalog file gets full, new ones are allocated from
extents overflow file.
In case shrinking truncate happens to middle of an extent record which
locates in extents overflow file, the logic in hfsplus_file_truncate() was
changed so that call to hfs_brec_remove() is not guarded any more.
Right action would be just freeing the extents that exceed the new size
inside extent record by calling hfsplus_free_extents(), and then check if
the whole extent record should be removed. However since the guard
(blk_cnt > start) is now after the call to hfs_brec_remove(), this has
unfortunate effect that the last matching extent record is removed
unconditionally.
To reproduce this issue, create a file which has at least 10 extents, and
then perform shrinking truncate into middle of the last extent record, so
that the number of remaining extents is not under or divisible by 8. This
causes the last extent record (8 extents) to be removed totally instead of
truncating into middle of it. Thus this causes corruption, and lost data.
Fix for this is simply checking if the new truncated end is below the
start of this extent record, making it safe to remove the full extent
record. However call to hfs_brec_remove() can't be moved to it's previous
place since we're dropping ->tree_lock and it can cause a race condition
and the cached info being invalidated possibly corrupting the node data.
Another issue is related to this one. When entering into the block
(blk_cnt > start) we are not holding the ->tree_lock. We break out from
the loop not holding the lock, but hfs_find_exit() does unlock it. Not
sure if it's possible for someone else to take the lock under our feet,
but it can cause hard to debug errors and premature unlocking. Even if
there's no real risk of it, the locking should still always be kept in
balance. Thus taking the lock now just before the check.
A readahead request will not allocate more memory than can be represented
by a size_t, even on systems that have HIGHMEM available. Change the
length functions from returning an loff_t to a size_t.
kasan: fix unit tests with CONFIG_UBSAN_LOCAL_BOUNDS enabled
These tests deliberately access these arrays out of bounds, which will
cause the dynamic local bounds checks inserted by
CONFIG_UBSAN_LOCAL_BOUNDS to fail and panic the kernel. To avoid this
problem, access the arrays via volatile pointers, which will prevent the
compiler from being able to determine the array bounds.
These accesses use volatile pointers to char (char *volatile) rather than
the more conventional pointers to volatile char (volatile char *) because
we want to prevent the compiler from making inferences about the pointer
itself (i.e. its array bounds), not the data that it refers to.
32-bit architectures which expect 8-byte alignment for 8-byte integers and
need 64-bit DMA addresses (arm, mips, ppc) had their struct page
inadvertently expanded in 2019. When the dma_addr_t was added, it forced
the alignment of the union to 8 bytes, which inserted a 4 byte gap between
'flags' and the union.
Fix this by storing the dma_addr_t in one or two adjacent unsigned longs.
This restores the alignment to that of an unsigned long. We always
store the low bits in the first word to prevent the PageTail bit from
being inadvertently set on a big endian platform. If that happened,
get_user_pages_fast() racing against a page which was freed and
reallocated to the page_pool could dereference a bogus compound_head(),
which would be hard to trace back to this cause.
Hugh Dickins [Sat, 15 May 2021 00:27:22 +0000 (17:27 -0700)]
ksm: revert "use GET_KSM_PAGE_NOLOCK to get ksm page in remove_rmap_item_from_tree()"
This reverts commit 3e96b6a2e9ad929a3230a22f4d64a74671a0720b. General
Protection Fault in rmap_walk_ksm() under memory pressure:
remove_rmap_item_from_tree() needs to take page lock, of course.
Axel Rasmussen [Sat, 15 May 2021 00:27:19 +0000 (17:27 -0700)]
userfaultfd: release page in error path to avoid BUG_ON
Consider the following sequence of events:
1. Userspace issues a UFFD ioctl, which ends up calling into
shmem_mfill_atomic_pte(). We successfully account the blocks, we
shmem_alloc_page(), but then the copy_from_user() fails. We return
-ENOENT. We don't release the page we allocated.
2. Our caller detects this error code, tries the copy_from_user() after
dropping the mmap_lock, and retries, calling back into
shmem_mfill_atomic_pte().
3. Meanwhile, let's say another process filled up the tmpfs being used.
4. So shmem_mfill_atomic_pte() fails to account blocks this time, and
immediately returns - without releasing the page.
This triggers a BUG_ON in our caller, which asserts that the page
should always be consumed, unless -ENOENT is returned.
To fix this, detect if we have such a "dangling" page when accounting
fails, and if so, release it before returning.
Phillip Lougher [Sat, 15 May 2021 00:27:16 +0000 (17:27 -0700)]
squashfs: fix divide error in calculate_skip()
Sysbot has reported a "divide error" which has been identified as being
caused by a corrupted file_size value within the file inode. This value
has been corrupted to a much larger value than expected.
Calculate_skip() is passed i_size_read(inode) >> msblk->block_log. Due to
the file_size value corruption this overflows the int argument/variable in
that function, leading to the divide error.
This patch changes the function to use u64. This will accommodate any
unexpectedly large values due to corruption.
The value returned from calculate_skip() is clamped to be never more than
SQUASHFS_CACHED_BLKS - 1, or 7. So file_size corruption does not lead to
an unexpectedly large return result here.
Alistair Popple [Sat, 15 May 2021 00:27:13 +0000 (17:27 -0700)]
kernel/resource: fix return code check in __request_free_mem_region
Splitting an earlier version of a patch that allowed calling
__request_region() while holding the resource lock into a series of
patches required changing the return code for the newly introduced
__request_region_locked().
Unfortunately this change was not carried through to a subsequent commit 56fd94919b8b ("kernel/resource: fix locking in request_free_mem_region")
in the series. This resulted in a use-after-free due to freeing the
struct resource without properly releasing it. Fix this by correcting the
return code check so that the struct is not freed if the request to add it
was successful.
Vlastimil Babka [Sat, 15 May 2021 00:27:10 +0000 (17:27 -0700)]
mm, slub: move slub_debug static key enabling outside slab_mutex
Paul E. McKenney reported [1] that commit 1f0723a4c0df ("mm, slub: enable
slub_debug static key when creating cache with explicit debug flags")
results in the lockdep complaint:
======================================================
WARNING: possible circular locking dependency detected
5.12.0+ #15 Not tainted
------------------------------------------------------
rcu_torture_sta/109 is trying to acquire lock: ffffffff96063cd0 (cpu_hotplug_lock){++++}-{0:0}, at: static_key_enable+0x9/0x20
but task is already holding lock: ffffffff96173c28 (slab_mutex){+.+.}-{3:3}, at: kmem_cache_create_usercopy+0x2d/0x250
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
This is because there's one order of locking from the hotplug callbacks:
lock(cpu_hotplug_lock); // from hotplug machinery itself
lock(slab_mutex); // in e.g. slab_mem_going_offline_callback()
And commit 1f0723a4c0df made the reverse sequence possible:
lock(slab_mutex); // in kmem_cache_create_usercopy()
lock(cpu_hotplug_lock); // kmem_cache_open() -> static_key_enable()
The simplest fix is to move static_key_enable() to a place before slab_mutex is
taken. That means kmem_cache_create_usercopy() in mm/slab_common.c which is not
ideal for SLUB-specific code, but the #ifdef CONFIG_SLUB_DEBUG makes it
at least self-contained and obvious.
Peter Xu [Sat, 15 May 2021 00:27:07 +0000 (17:27 -0700)]
mm/hugetlb: fix cow where page writtable in child
When rework early cow of pinned hugetlb pages, we moved huge_ptep_get()
upper but overlooked a side effect that the huge_ptep_get() will fetch the
pte after wr-protection. After moving it upwards, we need explicit
wr-protect of child pte or we will keep the write bit set in the child
process, which could cause data corrution where the child can write to the
original page directly.
This issue can also be exposed by "memfd_test hugetlbfs" kselftest.
Peter Xu [Sat, 15 May 2021 00:27:04 +0000 (17:27 -0700)]
mm/hugetlb: fix F_SEAL_FUTURE_WRITE
Patch series "mm/hugetlb: Fix issues on file sealing and fork", v2.
Hugh reported issue with F_SEAL_FUTURE_WRITE not applied correctly to
hugetlbfs, which I can easily verify using the memfd_test program, which
seems that the program is hardly run with hugetlbfs pages (as by default
shmem).
Meanwhile I found another probably even more severe issue on that hugetlb
fork won't wr-protect child cow pages, so child can potentially write to
parent private pages. Patch 2 addresses that.
After this series applied, "memfd_test hugetlbfs" should start to pass.
This patch (of 2):
F_SEAL_FUTURE_WRITE is missing for hugetlb starting from the first day.
There is a test program for that and it fails constantly.
Bart Van Assche [Thu, 13 May 2021 16:49:12 +0000 (09:49 -0700)]
scsi: ufs: core: Increase the usable queue depth
With the current implementation of the UFS driver active_queues is 1
instead of 0 if all UFS request queues are idle. That causes
hctx_may_queue() to divide the queue depth by 2 when queueing a request and
hence reduces the usable queue depth.
The shared tag set code in the block layer keeps track of the number of
active request queues. blk_mq_tag_busy() is called before a request is
queued onto a hwq and blk_mq_tag_idle() is called some time after the hwq
became idle. blk_mq_tag_idle() is called from inside blk_mq_timeout_work().
Hence, blk_mq_tag_idle() is only called if a timer is associated with each
request that is submitted to a request queue that shares a tag set with
another request queue.
Adds a blk_mq_start_request() call in ufshcd_exec_dev_cmd(). This doubles
the queue depth on my test setup from 16 to 32.
In addition to increasing the usable queue depth, also fix the
documentation of the 'timeout' parameter in the header above
ufshcd_exec_dev_cmd().
Matt Wang [Tue, 11 May 2021 03:04:37 +0000 (03:04 +0000)]
scsi: BusLogic: Fix 64-bit system enumeration error for Buslogic
Commit 391e2f25601e ("[SCSI] BusLogic: Port driver to 64-bit")
introduced a serious issue for 64-bit systems. With this commit,
64-bit kernel will enumerate 8*15 non-existing disks. This is caused
by the broken CCB structure. The change from u32 data to void *data
increased CCB length on 64-bit system, which introduced an extra 4
byte offset of the CDB. This leads to incorrect response to INQUIRY
commands during enumeration.
Fix disk enumeration failure by reverting the portion of the commit
above which switched the data pointer from u32 to void.
Peter Wang [Wed, 12 May 2021 10:01:45 +0000 (18:01 +0800)]
scsi: ufs: ufs-mediatek: Fix power down spec violation
As per spec, e.g. JESD220E chapter 7.2, while powering off the UFS device,
RST_N signal should be between VSS(Ground) and VCCQ/VCCQ2. The power down
sequence after fixing:
Linus Torvalds [Fri, 14 May 2021 20:44:51 +0000 (13:44 -0700)]
Merge tag 'trace-v5.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fix from Steven Rostedt:
"Fix trace_check_vprintf() for %.*s
The sanity check of all strings being read from the ring buffer to
make sure they are in safe memory space did not account for the %.*s
notation having another parameter to process (the length).
Add that to the check"
* tag 'trace-v5.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing: Handle %.*s in trace_check_vprintf()