events: Reuse value read using READ_ONCE instead of re-reading it
In perf_event_addr_filters_apply, the task associated with
the event (event->ctx->task) is read using READ_ONCE at the beginning
of the function, checked, and then re-read from event->ctx->task,
voiding all guarantees of the checks. Reuse the value that was read by
READ_ONCE to ensure the consistency of the task struct throughout the
function.
Pavel Begunkov [Wed, 15 Sep 2021 10:00:05 +0000 (11:00 +0100)]
io_uring: move iopoll reissue into regular IO path
230d50d448acb ("io_uring: move reissue into regular IO path")
made non-IOPOLL I/O to not retry from ki_complete handler. Follow it
steps and do the same for IOPOLL. Same problems, same implementation,
same -EAGAIN assumptions.
Get rid of the need to do re-expand and revert on an iterator when we
encounter a short IO, or failure that warrants a retry. Use the new
state save/restore helpers instead.
We keep the iov_iter_state persistent across retries, if we need to
restart the read or write operation. If there's a pending retry, the
operation will always exit with the state correctly saved.
Merge tag 'nvme-5.15-2021-09-15' of git://git.infradead.org/nvme into block-5.15
Pull NVMe fixes from Christoph:
"nvme fixes for Linux 5.15
- fix ANA state updates when a namespace is not present (Anton Eidelman)
- nvmet: fix a width vs precision bug in nvmet_subsys_attr_serial_show
(Dan Carpenter)
- avoid race in shutdown namespace removal (Daniel Wagner)
- fix io_work priority inversion in nvme-tcp (Keith Busch)
- destroy cm id before destroy qp to avoid use after free (Ruozhu Li)"
* tag 'nvme-5.15-2021-09-15' of git://git.infradead.org/nvme:
nvme-tcp: fix io_work priority inversion
nvme-rdma: destroy cm id before destroy qp to avoid use after free
nvme-multipath: fix ANA state updates when a namespace is not present
nvme: avoid race in shutdown namespace removal
nvmet: fix a width vs precision bug in nvmet_subsys_attr_serial_show()
Some PHYs pointed to by "phy-handle" will never bind to a driver until a
consumer attaches to it. And when the consumer attaches to it, they get
forcefully bound to a generic PHY driver. In such cases, parsing the
phy-handle property and creating a device link will prevent the consumer
from ever probing. We don't want that. So revert support for
"phy-handle" property until we come up with a better mechanism for
binding PHYs to generic drivers before a consumer tries to attach to it.
s390 is the only architecture which allows to set the
-mwarn-dynamicstack compile option. This however will also always
generate a warning with system call stack randomization, which uses
alloca to generate some random sized stack frame.
On the other hand Linus just enabled "-Werror" by default with commit 3fe617ccafd6 ("Enable '-Werror' by default for all kernel builds"),
which means compiles will always fail by default.
So instead of playing once again whack-a-mole for something which is
s390 specific, simply remove this option.
drivers/s390/crypto/ap_bus.c:216: warning:
bad line:
drivers/s390/crypto/ap_bus.c:444:
warning: Function parameter or member 'floating' not described in 'ap_interrupt_handler'
s390/pci_mmio: fully validate the VMA before calling follow_pte()
We should not walk/touch page tables outside of VMA boundaries when
holding only the mmap sem in read mode. Evil user space can modify the
VMA layout just before this function runs and e.g., trigger races with
page table removal code since commit dd2283f2605e ("mm: mmap: zap pages
with read mmap_sem in munmap").
find_vma() does not check if the address is >= the VMA start address;
use vma_lookup() instead.
Reviewed-by: Niklas Schnelle <[email protected]> Reviewed-by: Liam R. Howlett <[email protected]> Fixes: dd2283f2605e ("mm: mmap: zap pages with read mmap_sem in munmap") Signed-off-by: David Hildenbrand <[email protected]> Signed-off-by: Vasily Gorbik <[email protected]>
powerpc/xics: Set the IRQ chip data for the ICS native backend
The ICS native driver relies on the IRQ chip data to find the struct
'ics_native' describing the ICS controller but it was removed by commit 248af248a8f4 ("powerpc/xics: Rename the map handler in a check handler").
Revert this change to fix the Microwatt SoC platform.
Jan Beulich [Tue, 7 Sep 2021 12:07:47 +0000 (14:07 +0200)]
swiotlb-xen: drop DEFAULT_NSLABS
It was introduced by 4035b43da6da ("xen-swiotlb: remove xen_set_nslabs")
and then not removed by 2d29960af0be ("swiotlb: dynamically allocate
io_tlb_default_mem").
Jan Beulich [Tue, 7 Sep 2021 12:07:21 +0000 (14:07 +0200)]
swiotlb-xen: arrange to have buffer info logged
I consider it unhelpful that address and size of the buffer aren't put
in the log file; it makes diagnosing issues needlessly harder. The
majority of callers of swiotlb_init() also passes 1 for the "verbose"
parameter.
Jan Beulich [Tue, 7 Sep 2021 12:06:55 +0000 (14:06 +0200)]
swiotlb-xen: drop leftover __ref
Commit a98f565462f0 ("xen-swiotlb: split xen_swiotlb_init") should not
only have added __init to the split off function, but also should have
dropped __ref from the one left.
Jan Beulich [Tue, 7 Sep 2021 12:06:37 +0000 (14:06 +0200)]
swiotlb-xen: limit init retries
Due to the use of max(1024, ...) there's no point retrying (and issuing
bogus log messages) when the number of slabs is already no larger than
this minimum value.
Jan Beulich [Tue, 7 Sep 2021 12:05:54 +0000 (14:05 +0200)]
swiotlb-xen: suppress certain init retries
Only on the 2nd of the paths leading to xen_swiotlb_init()'s "error"
label it is useful to retry the allocation; the first one did already
iterate through all possible order values.
Jan Beulich [Tue, 7 Sep 2021 12:05:12 +0000 (14:05 +0200)]
swiotlb-xen: maintain slab count properly
Generic swiotlb code makes sure to keep the slab count a multiple of the
number of slabs per segment. Yet even without checking whether any such
assumption is made elsewhere, it is easy to see that xen_swiotlb_fixup()
might alter unrelated memory when calling xen_create_contiguous_region()
for the last segment, when that's not a full one - the function acts on
full order-N regions, not individual pages.
Align the slab count suitably when halving it for a retry. Add a build
time check and a runtime one. Replace the no longer useful local
variable "slabs" by an "order" one calculated just once, outside of the
loop. Re-use "order" for calculating "dma_bits", and change the type of
the latter as well as the one of "i" while touching this anyway.
Jan Beulich [Tue, 7 Sep 2021 12:04:47 +0000 (14:04 +0200)]
swiotlb-xen: fix late init retry
The commit referenced below removed the assignment of "bytes" from
xen_swiotlb_init() without - like done for xen_swiotlb_init_early() -
adding an assignment on the retry path, thus leading to excessively
sized allocations upon retries.
xen: fix usage of pmd_populate in mremap for pv guests
Commit 0881ace292b662 ("mm/mremap: use pmd/pud_poplulate to update page
table entries") introduced a regression when running as Xen PV guest.
Today pmd_populate() for Xen PV assumes that the PFN inserted is
referencing a not yet used page table. In case of move_normal_pmd()
this is not true, resulting in WARN splats like:
Randy Dunlap [Mon, 13 Sep 2021 22:06:05 +0000 (15:06 -0700)]
ptp: dp83640: don't define PAGE0
Building dp83640.c on arch/parisc/ produces a build warning for
PAGE0 being redefined. Since the macro is not used in the dp83640
driver, just make it a comment for documentation purposes.
In file included from ../drivers/net/phy/dp83640.c:23:
../drivers/net/phy/dp83640_reg.h:8: warning: "PAGE0" redefined
8 | #define PAGE0 0x0000
from ../drivers/net/phy/dp83640.c:11:
../arch/parisc/include/asm/page.h:187: note: this is the location of the previous definition
187 | #define PAGE0 ((struct zeropage *)__PAGE_OFFSET)
block: flush the integrity workqueue in blk_integrity_unregister
When the integrity profile is unregistered there can still be integrity
reads queued up which could see a NULL verify_fn as shown by the race
window below:
CPU0 CPU1
process_one_work nvme_validate_ns
bio_integrity_verify_fn nvme_update_ns_info
nvme_update_disk_info
blk_integrity_unregister
---set queue->integrity as 0
bio_integrity_process
--access bi->profile->verify_fn(bi is a pointer of queue->integity)
Before calling blk_integrity_unregister in nvme_update_disk_info, we must
make sure that there is no work item in the kintegrityd_wq. Just call
blk_flush_integrity to flush the work queue so the bug can be resolved.
block: check if a profile is actually registered in blk_integrity_unregister
While clearing the profile itself is harmless, we really should not clear
the stable writes flag if it wasn't set due to a registered integrity
profile.
On sparc64, __fls() returns an "int", but the drm TTM code expected it
to be "unsigned long" as on x86. As a result, on sparc (and arc, and
m68k) you get build errors because 'min()' checks that the types match.
As suggested by Linus, it can use min_t instead of min to force the type
to be "unsigned int".
The boot-time allocation interface for memblock is a mess, with
'memblock_alloc()' returning a virtual pointer, but then you are
supposed to free it with 'memblock_free()' that takes a _physical_
address.
Not only is that all kinds of strange and illogical, but it actually
causes bugs, when people then use it like a normal allocation function,
and it fails spectacularly on a NULL pointer:
I really don't want to apply patches that treat the symptoms, when the
fundamental cause is this horribly confusing interface.
I started out looking at just automating a sane replacement sequence,
but because of this mix or virtual and physical addresses, and because
people have used the "__pa()" macro that can take either a regular
kernel pointer, or just the raw "unsigned long" address, it's all quite
messy.
So this just introduces a new saner interface for freeing a virtual
address that was allocated using 'memblock_alloc()', and that was kept
as a regular kernel pointer. And then it converts a couple of users
that are obvious and easy to test, including the 'xbc_nodes' case in
lib/bootconfig.c that caused problems.
In amdgpu_dm_atomic_check, dc_validate_global_state is called. On
failure this logs a warning to the kernel journal. However warnings
shouldn't be used for atomic test-only commit failures: user-space
might be perfoming a lot of atomic test-only commits to find the
best hardware configuration.
Downgrade the log to a regular DRM atomic message. While at it, use
the new device-aware logging infrastructure.
This fixes error messages in the kernel when running gamescope [1].
Ernst Sjöstrand [Thu, 2 Sep 2021 07:50:27 +0000 (09:50 +0200)]
drm/amd/amdgpu: Increase HWIP_MAX_INSTANCE to 10
Seems like newer cards can have even more instances now.
Found by UBSAN: array-index-out-of-bounds in
drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c:318:29
index 8 is out of range for type 'uint32_t *[8]'
ipc: remove memcg accounting for sops objects in do_semtimedop()
Linus proposes to revert an accounting for sops objects in
do_semtimedop() because it's really just a temporary buffer
for a single semtimedop() system call.
This object can consume up to 2 pages, syscall is sleeping
one, size and duration can be controlled by user, and this
allocation can be repeated by many thread at the same time.
However Shakeel Butt pointed that there are much more popular
objects with the same life time and similar memory
consumption, the accounting of which was decided to be
rejected for performance reasons.
Considering at least 2 pages for task_struct and 2 pages for
the kernel stack, a back of the envelope calculation gives a
footprint amplification of <1.5 so this temporal buffer can be
safely ignored.
The factor would IMO be interesting if it was >> 2 (from the
PoV of excessive (ab)use, fine-grained accounting seems to be
currently unfeasible due to performance impact).
io_uring: allow retry for O_NONBLOCK if async is supported
A common complaint is that using O_NONBLOCK files with io_uring can be a
bit of a pain. Be a bit nicer and allow normal retry IFF the file does
support async behavior. This makes it possible to use io_uring more
reliably with O_NONBLOCK files, for use cases where it either isn't
possible or feasible to modify the file flags.
James Morse [Tue, 14 Sep 2021 16:56:23 +0000 (16:56 +0000)]
cpufreq: schedutil: Destroy mutex before kobject_put() frees the memory
Since commit e5c6b312ce3c ("cpufreq: schedutil: Use kobject release()
method to free sugov_tunables") kobject_put() has kfree()d the
attr_set before gov_attr_set_put() returns.
kobject_put() isn't the last user of attr_set in gov_attr_set_put(),
the subsequent mutex_destroy() triggers a use-after-free:
| BUG: KASAN: use-after-free in mutex_is_locked+0x20/0x60
| Read of size 8 at addr ffff000800ca4250 by task cpuhp/2/20
|
| CPU: 2 PID: 20 Comm: cpuhp/2 Not tainted 5.15.0-rc1 #12369
| Hardware name: ARM LTD ARM Juno Development Platform/ARM Juno Development
| Platform, BIOS EDK II Jul 30 2018
| Call trace:
| dump_backtrace+0x0/0x380
| show_stack+0x1c/0x30
| dump_stack_lvl+0x8c/0xb8
| print_address_description.constprop.0+0x74/0x2b8
| kasan_report+0x1f4/0x210
| kasan_check_range+0xfc/0x1a4
| __kasan_check_read+0x38/0x60
| mutex_is_locked+0x20/0x60
| mutex_destroy+0x80/0x100
| gov_attr_set_put+0xfc/0x150
| sugov_exit+0x78/0x190
| cpufreq_offline.isra.0+0x2c0/0x660
| cpuhp_cpufreq_offline+0x14/0x24
| cpuhp_invoke_callback+0x430/0x6d0
| cpuhp_thread_fun+0x1b0/0x624
| smpboot_thread_fn+0x5e0/0xa6c
| kthread+0x3a0/0x450
| ret_from_fork+0x10/0x20
Swap the order of the calls.
Fixes: e5c6b312ce3c ("cpufreq: schedutil: Use kobject release() method to free sugov_tunables") Cc: 4.7+ <[email protected]> # 4.7+ Signed-off-by: James Morse <[email protected]> Signed-off-by: Rafael J. Wysocki <[email protected]>
This warning helps catch uninitialized variables. It should have been
enabled at the same time as commit b2423184ac33 ("drm/i915: Enable
-Wuninitialized") but I did not realize they were disabled separately.
Enable it now that i915 is clean so that it stays that way.
drm/i915/selftests: Always initialize err in igt_dmabuf_import_same_driver_lmem()
Clang warns:
drivers/gpu/drm/i915/gem/selftests/i915_gem_dmabuf.c:127:13: warning:
variable 'err' is used uninitialized whenever 'if' condition is false
[-Wsometimes-uninitialized]
} else if (PTR_ERR(import) != -EOPNOTSUPP) {
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
drivers/gpu/drm/i915/gem/selftests/i915_gem_dmabuf.c:138:9: note:
uninitialized use occurs here
return err;
^~~
drivers/gpu/drm/i915/gem/selftests/i915_gem_dmabuf.c:127:9: note: remove
the 'if' if its condition is always true
} else if (PTR_ERR(import) != -EOPNOTSUPP) {
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
drivers/gpu/drm/i915/gem/selftests/i915_gem_dmabuf.c:95:9: note:
initialize the variable 'err' to silence this warning
int err;
^
= 0
The test is expected to pass if i915_gem_prime_import() returns
-EOPNOTSUPP so initialize err to zero in this case.
drm/i915/selftests: Do not use import_obj uninitialized
Clang warns a couple of times:
drivers/gpu/drm/i915/gem/selftests/i915_gem_dmabuf.c:63:6: warning:
variable 'import_obj' is used uninitialized whenever 'if' condition is
true [-Wsometimes-uninitialized]
if (import != &obj->base) {
^~~~~~~~~~~~~~~~~~~~
drivers/gpu/drm/i915/gem/selftests/i915_gem_dmabuf.c:80:22: note:
uninitialized use occurs here
i915_gem_object_put(import_obj);
^~~~~~~~~~
drivers/gpu/drm/i915/gem/selftests/i915_gem_dmabuf.c:63:2: note: remove
the 'if' if its condition is always false
if (import != &obj->base) {
^~~~~~~~~~~~~~~~~~~~~~~~~~~
drivers/gpu/drm/i915/gem/selftests/i915_gem_dmabuf.c:38:46: note:
initialize the variable 'import_obj' to silence this warning
struct drm_i915_gem_object *obj, *import_obj;
^
= NULL
Shuffle the import_obj initialization above these if statements so that
it is not used uninitialized.
Pavel Begunkov [Tue, 14 Sep 2021 15:12:52 +0000 (16:12 +0100)]
io_uring: auto-removal for direct open/accept
It might be inconvenient that direct open/accept deviates from the
update semantics and fails if the slot is taken instead of removing a
file sitting there. Implement this auto-removal.
Note that removal might need to allocate and so may fail. However, if an
empty slot is specified, it's guaraneed to not fail on the fd
installation side for valid userspace programs. It's needed for users
who can't tolerate such failures, e.g. accept where the other end
never retries.
Michael Ellerman [Tue, 14 Sep 2021 12:17:23 +0000 (22:17 +1000)]
powerpc/boot: Fix build failure since GCC 4.9 removal
Stephen reported that the build was broken since commit 6d2ef226f2f1 ("compiler_attributes.h: drop __has_attribute() support for
gcc4"), with errors such as:
include/linux/compiler_attributes.h:296:5: warning: "__has_attribute" is not defined, evaluates to 0 [-Wundef]
296 | #if __has_attribute(__warning__)
| ^~~~~~~~~~~~~~~
make[2]: *** [arch/powerpc/boot/Makefile:225: arch/powerpc/boot/crt0.o] Error 1
But we expect __has_attribute() to always be defined now that we've
stopped using GCC 4.
Linus debugged it to the point of reading the GCC sources, and noticing
that the problem is that __has_attribute() is not defined when
preprocessing assembly files, which is what we're doing here.
Our assembly files don't include, or need, compiler_attributes.h, but
they are getting it unconditionally from the -include in BOOT_CFLAGS,
which is then added in its entirety to BOOT_AFLAGS.
That -include was added in commit 77433830ed16 ("powerpc: boot: include
compiler_attributes.h") so that we'd have "fallthrough" and other
attributes defined for the C files in arch/powerpc/boot. But it's not
needed for assembly files.
The minimal fix is to move the addition to BOOT_CFLAGS of -include
compiler_attributes.h until after we've copied BOOT_CFLAGS into
BOOT_AFLAGS. That avoids including compiler_attributes.h for asm files,
but makes no other change to BOOT_CFLAGS or BOOT_AFLAGS.
In an ideal world, when someone is passed an iov_iter and returns X bytes,
then X bytes would have been consumed/advanced from the iov_iter. But we
have use cases that always consume the entire iterator, a few examples
of that are iomap and bdev O_DIRECT. This means we cannot rely on the
state of the iov_iter once we've called ->read_iter() or ->write_iter().
This would be easier if we didn't always have to deal with truncate of
the iov_iter, as rewinding would be trivial without that. We recently
added a commit to track the truncate state, but that grew the iov_iter
by 8 bytes and wasn't the best solution.
Implement a helper to save enough of the iov_iter state to sanely restore
it after we've called the read/write iterator helpers. This currently
only works for IOVEC/BVEC/KVEC as that's all we need, support for other
iterator types are left as an exercise for the reader.
As mentioned in https://lkml.org/lkml/2021/9/13/1819
5 years old commit 919483096bfe ("ipv4: fix memory leaks in ip_cmsg_send() callers")
was a correct fix.
ip_cmsg_send() can loop over multiple cmsghdr()
If IP_RETOPTS has been successful, but following cmsghdr generates an error,
we do not free ipc.ok
If IP_RETOPTS is not successful, we have freed the allocated temporary space,
not the one currently in ipc.opt.
Sure, code could be refactored, but let's not bring back old bugs.
Fixes: d7807a9adf48 ("Revert "ipv4: fix memory leaks in ip_cmsg_send() callers"") Signed-off-by: Eric Dumazet <[email protected]> Cc: Yajun Deng <[email protected]> Signed-off-by: David S. Miller <[email protected]>
tcp: fix tp->undo_retrans accounting in tcp_sacktag_one()
Commit 10d3be569243 ("tcp-tso: do not split TSO packets at retransmit
time") may directly retrans a multiple segments TSO/GSO packet without
split, Since this commit, we can no longer assume that a retransmitted
packet is a single segment.
This patch fixes the tp->undo_retrans accounting in tcp_sacktag_one()
that use the actual segments(pcount) of the retransmitted packet.
Before that commit (10d3be569243), the assumption underlying the
tp->undo_retrans-- seems correct.
Andreas Larsson [Wed, 8 Sep 2021 07:48:22 +0000 (09:48 +0200)]
sparc32: page align size in arch_dma_alloc
Commit 53b7670e5735 ("sparc: factor the dma coherent mapping into
helper") lost the page align for the calls to dma_make_coherent and
srmmu_unmapiorange. The latter cannot handle a non page aligned len
argument.
The following pull-request contains BPF updates for your *net* tree.
We've added 7 non-merge commits during the last 13 day(s) which contain
a total of 18 files changed, 334 insertions(+), 193 deletions(-).
The main changes are:
1) Fix mmap_lock lockdep splat in BPF stack map's build_id lookup, from Yonghong Song.
2) Fix BPF cgroup v2 program bypass upon net_cls/prio activation, from Daniel Borkmann.
3) Fix kvcalloc() BTF line info splat on oversized allocation attempts, from Bixuan Cui.
4) Fix BPF selftest build of task_pt_regs test for arm64/s390, from Jean-Philippe Brucker.
5) Fix BPF's disasm.{c,h} to dual-license so that it is aligned with bpftool given the former
is a build dependency for the latter, from Daniel Borkmann with ACKs from contributors.
====================
Fixes: cc36a070b590 ("net-caif: add CAIF netdevice") Signed-off-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Keith Busch [Thu, 9 Sep 2021 15:54:52 +0000 (08:54 -0700)]
nvme-tcp: fix io_work priority inversion
Dispatching requests inline with the .queue_rq() call may block while
holding the send_mutex. If the tcp io_work also happens to schedule, it
may see the req_list is non-empty, leaving "pending" true and remaining
in TASK_RUNNING. Since io_work is of higher scheduling priority, the
.queue_rq task may not get a chance to run, blocking forward progress
and leading to io timeouts.
Instead of checking for pending requests within io_work, let the queueing
restart io_work outside the send_mutex lock if there is more work to be
done.
Ruozhu Li [Mon, 6 Sep 2021 03:51:34 +0000 (11:51 +0800)]
nvme-rdma: destroy cm id before destroy qp to avoid use after free
We should always destroy cm_id before destroy qp to avoid to get cma
event after qp was destroyed, which may lead to use after free.
In RDMA connection establishment error flow, don't destroy qp in cm
event handler.Just report cm_error to upper level, qp will be destroy
in nvme_rdma_alloc_queue() after destroy cm id.
Anton Eidelman [Sun, 12 Sep 2021 18:54:57 +0000 (12:54 -0600)]
nvme-multipath: fix ANA state updates when a namespace is not present
nvme_update_ana_state() has a deficiency that results in a failure to
properly update the ana state for a namespace in the following case:
NSIDs in ctrl->namespaces: 1, 3, 4
NSIDs in desc->nsids: 1, 2, 3, 4
Loop iteration 0:
ns index = 0, n = 0, ns->head->ns_id = 1, nsid = 1, MATCH.
Loop iteration 1:
ns index = 1, n = 1, ns->head->ns_id = 3, nsid = 2, NO MATCH.
Loop iteration 2:
ns index = 2, n = 2, ns->head->ns_id = 4, nsid = 4, MATCH.
Where the update to the ANA state of NSID 3 is missed. To fix this
increment n and retry the update with the same ns when ns->head->ns_id is
higher than nsid,
Tony Luck [Mon, 13 Sep 2021 21:52:39 +0000 (14:52 -0700)]
x86/mce: Avoid infinite loop for copy from user recovery
There are two cases for machine check recovery:
1) The machine check was triggered by ring3 (application) code.
This is the simpler case. The machine check handler simply queues
work to be executed on return to user. That code unmaps the page
from all users and arranges to send a SIGBUS to the task that
triggered the poison.
2) The machine check was triggered in kernel code that is covered by
an exception table entry. In this case the machine check handler
still queues a work entry to unmap the page, etc. but this will
not be called right away because the #MC handler returns to the
fix up code address in the exception table entry.
Problems occur if the kernel triggers another machine check before the
return to user processes the first queued work item.
Specifically, the work is queued using the ->mce_kill_me callback
structure in the task struct for the current thread. Attempting to queue
a second work item using this same callback results in a loop in the
linked list of work functions to call. So when the kernel does return to
user, it enters an infinite loop processing the same entry for ever.
There are some legitimate scenarios where the kernel may take a second
machine check before returning to the user.
1) Some code (e.g. futex) first tries a get_user() with page faults
disabled. If this fails, the code retries with page faults enabled
expecting that this will resolve the page fault.
2) Copy from user code retries a copy in byte-at-time mode to check
whether any additional bytes can be copied.
On the other side of the fence are some bad drivers that do not check
the return value from individual get_user() calls and may access
multiple user addresses without noticing that some/all calls have
failed.
Fix by adding a counter (current->mce_count) to keep track of repeated
machine checks before task_work() is called. First machine check saves
the address information and calls task_work_add(). Subsequent machine
checks before that task_work call back is executed check that the address
is in the same page as the first machine check (since the callback will
offline exactly one page).
Expected worst case is four machine checks before moving on (e.g. one
user access with page faults disabled, then a repeat to the same address
with page faults enabled ... repeat in copy tail bytes). Just in case
there is some code that loops forever enforce a limit of 10.
Daniel Vetter [Thu, 2 Sep 2021 14:20:48 +0000 (16:20 +0200)]
drm/i915: Release ctx->syncobj on final put, not on ctx close
gem context refcounting is another exercise in least locking design it
seems, where most things get destroyed upon context closure (which can
race with anything really). Only the actual memory allocation and the
locks survive while holding a reference.
This tripped up Jason when reimplementing the single timeline feature
in
drm/i915: Implement SINGLE_TIMELINE with a syncobj (v4)
We could fix the bug by holding ctx->mutex in execbuf and clear the
pointer (again while holding the mutex) context_close, but it's
cleaner to just make the context object actually invariant over its
_entire_ lifetime. This way any other ioctl that's potentially racing,
but holding a full reference, can still rely on ctx->syncobj being
an immutable pointer. Which without this change, is not the case.
Thomas Hellström [Tue, 31 Aug 2021 12:29:31 +0000 (14:29 +0200)]
drm/i915/gem: Fix the mman selftest
Using the I915_MMAP_TYPE_FIXED mmap type requires the TTM backend, so
for that mmap type, use __i915_gem_object_create_user() instead of
i915_gem_object_create_internal(), as we really want to tests objects
mmap-able by user-space.
This also means that the out-of-space error happens at object creation
and returns -ENXIO rather than -ENOSPC, so fix the code up to expect
that on out-of-offset-space errors.
Finally only use I915_MMAP_TYPE_FIXED for LMEM and SMEM for now if
testing on LMEM-capable devices. For stolen LMEM, we still take the
same path as for integrated, as that haven't been moved over to TTM yet,
and user-space should not be able to create out of stolen LMEM anyway.
v2:
- Check the presence of the obj->ops->mmap_offset callback rather than
hardcoding the supported mmap regions in can_mmap() (Maarten Lankhorst)
The function is only used from within GEM_BUG_ON(), which is causing
warnings with Wunneeded-internal-declaration in some builds. Since the
function is a simple wrapper around a CT function, we can just call the
CT function directly instead.
Kai-Heng Feng [Fri, 20 Aug 2021 07:52:59 +0000 (15:52 +0800)]
drm/i915/dp: Use max params for panels < eDP 1.4
Users reported that after commit 2bbd6dba84d4 ("drm/i915: Try to use
fast+narrow link on eDP again and fall back to the old max strategy on
failure"), the screen starts to have wobbly effect.
Commit a5c936add6a2 ("drm/i915/dp: Use slow and wide link training for
everything") doesn't help either, that means the affected eDP 1.2 panels
only work with max params.
So use max params for panels < eDP 1.4 as Windows does to solve the
issue.
v3:
- Do the eDP rev check in intel_edp_init_dpcd()
v2:
- Check eDP 1.4 instead of DPCD 1.1 to apply max params
Lee Shawn C [Tue, 6 Jul 2021 15:25:41 +0000 (23:25 +0800)]
drm/i915/dp: return proper DPRX link training result
After DPRX link training, intel_dp_link_train_phy() did not
return the training result properly. If link training failed,
i915 driver would not run into link train fallback function.
And no hotplug uevent would be received by user space application.
Juergen Gross [Fri, 27 Aug 2021 12:32:06 +0000 (14:32 +0200)]
xen/balloon: use a kernel thread instead a workqueue
Today the Xen ballooning is done via delayed work in a workqueue. This
might result in workqueue hangups being reported in case of large
amounts of memory are being ballooned in one go (here 16GB):
io_uring: pin SQPOLL data before unlocking ring lock
We need to re-check sqd->thread after we've dropped the lock. Pin
the sqd before doing the lockdep lock dance, and check if the thread
is alive after that. It's either NULL or alive, as the SQPOLL thread
cannot exit without holding the same sqd->lock.
Reported-and-tested-by: [email protected] Fixes: fa84693b3c89 ("io_uring: ensure IORING_REGISTER_IOWQ_MAX_WORKERS works with SQPOLL") Signed-off-by: Jens Axboe <[email protected]>
Daniel Borkmann [Mon, 13 Sep 2021 23:07:59 +0000 (01:07 +0200)]
bpf, selftests: Add test case for mixed cgroup v1/v2
Minimal selftest which implements a small BPF policy program to the
connect(2) hook which rejects TCP connection requests to port 60123
with EPERM. This is being attached to a non-root cgroup v2 path. The
test asserts that this works under cgroup v2-only and under a mixed
cgroup v1/v2 environment where net_classid is set in the former case.
Minimal set of helpers for net_cls classid cgroupv1 management in order
to set an id, join from a process, initiate setup and teardown. cgroupv2
helpers are left as-is, but reused where possible.
Daniel Borkmann [Mon, 13 Sep 2021 23:07:57 +0000 (01:07 +0200)]
bpf, cgroups: Fix cgroup v2 fallback on v1/v2 mixed mode
Fix cgroup v1 interference when non-root cgroup v2 BPF programs are used.
Back in the days, commit bd1060a1d671 ("sock, cgroup: add sock->sk_cgroup")
embedded per-socket cgroup information into sock->sk_cgrp_data and in order
to save 8 bytes in struct sock made both mutually exclusive, that is, when
cgroup v1 socket tagging (e.g. net_cls/net_prio) is used, then cgroup v2
falls back to the root cgroup in sock_cgroup_ptr() (&cgrp_dfl_root.cgrp).
The assumption made was "there is no reason to mix the two and this is in line
with how legacy and v2 compatibility is handled" as stated in bd1060a1d671.
However, with Kubernetes more widely supporting cgroups v2 as well nowadays,
this assumption no longer holds, and the possibility of the v1/v2 mixed mode
with the v2 root fallback being hit becomes a real security issue.
Many of the cgroup v2 BPF programs are also used for policy enforcement, just
to pick _one_ example, that is, to programmatically deny socket related system
calls like connect(2) or bind(2). A v2 root fallback would implicitly cause
a policy bypass for the affected Pods.
In production environments, we have recently seen this case due to various
circumstances: i) a different 3rd party agent and/or ii) a container runtime
such as [0] in the user's environment configuring legacy cgroup v1 net_cls
tags, which triggered implicitly mentioned root fallback. Another case is
Kubernetes projects like kind [1] which create Kubernetes nodes in a container
and also add cgroup namespaces to the mix, meaning programs which are attached
to the cgroup v2 root of the cgroup namespace get attached to a non-root
cgroup v2 path from init namespace point of view. And the latter's root is
out of reach for agents on a kind Kubernetes node to configure. Meaning, any
entity on the node setting cgroup v1 net_cls tag will trigger the bypass
despite cgroup v2 BPF programs attached to the namespace root.
Generally, this mutual exclusiveness does not hold anymore in today's user
environments and makes cgroup v2 usage from BPF side fragile and unreliable.
This fix adds proper struct cgroup pointer for the cgroup v2 case to struct
sock_cgroup_data in order to address these issues; this implicitly also fixes
the tradeoffs being made back then with regards to races and refcount leaks
as stated in bd1060a1d671, and removes the fallback, so that cgroup v2 BPF
programs always operate as expected.
Bixuan Cui [Sat, 11 Sep 2021 00:55:57 +0000 (08:55 +0800)]
bpf: Add oversize check before call kvcalloc()
Commit 7661809d493b ("mm: don't allow oversized kvmalloc() calls") add the
oversize check. When the allocation is larger than what kmalloc() supports,
the following warning triggered:
This occurs after commit 4e59869aa655 ("compiler-gcc.h: drop checks for
older GCC versions"), which removed the GCC_VERSION gating. The removed
version check just so happened to prevent __compiletime_error() from
being defined with clang because it pretends to be GCC 4.2.1 for
compatibility but the error attribute was not added to clang until
14.0.0.
Commit 815f0ddb346c ("include/linux/compiler*.h: make compiler-*.h
mutually exclusive") and commit a3f8a30f3f00 ("Compiler Attributes: use
feature checks instead of version checks") refactored the handling of
attributes in the main kernel to avoid situations like this but that
refactoring has never been done for the tools directory.
Refactoring is a rather large undertaking and this has never been an
issue before so instead, just guard the definition of
__compiletime_error() with __has_attribute() so that there are no more
errors.
Fixes: 4e59869aa655 ("compiler-gcc.h: drop checks for older GCC versions") Signed-off-by: Nathan Chancellor <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
Merge branch 'gcc-min-version-5.1' (make gcc-5.1 the minimum version)
Merge patch series from Nick Desaulniers to update the minimum gcc
version to 5.1.
This is some of the left-overs from the merge window that I didn't want
to deal with yesterday, so it comes in after -rc1 but was sent before.
Gcc-4.9 support has been an annoyance for some time, and with -Werror I
had the choice of applying a fairly big patch from Kees Cook to remove a
fair number of initializer warnings (still leaving some), or this patch
series from Nick that just removes the source of the problem.
The initializer cleanups might still be worth it regardless, but
honestly, I preferred just tackling the problem with gcc-4.9 head-on.
We've been more aggressiuve about no longer having to care about
compilers that were released a long time ago, and I think it's been a
good thing.
I added a couple of patches on top to sort out a few left-overs now that
we no longer support gcc-4.x.
As noted by Arnd, as a result of this minimum compiler version upgrade
we can probably change our use of '--std=gnu89' to '--std=gnu11', and
finally start using local loop declarations etc. But this series does
_not_ yet do that.
Link: https://lore.kernel.org/all/[email protected]/ Link: https://lore.kernel.org/lkml/CAK7LNASs6dvU6D3jL2GG3jW58fXfaj6VNOe55NJnTB8UPuk2pA@mail.gmail.com/ Link: https://github.com/ClangBuiltLinux/linux/issues/1438
* emailed patches from Nick Desaulniers <[email protected]>:
Drop some straggling mentions of gcc-4.9 as being stale
compiler_attributes.h: drop __has_attribute() support for gcc4
vmlinux.lds.h: remove old check for GCC 4.9
compiler-gcc.h: drop checks for older GCC versions
Makefile: drop GCC < 5 -fno-var-tracking-assignments workaround
arm64: remove GCC version check for ARCH_SUPPORTS_INT128
powerpc: remove GCC version check for UPD_CONSTR
riscv: remove Kconfig check for GCC version for ARCH_RV64I
Kconfig.debug: drop GCC 5+ version check for DWARF5
mm/ksm: remove old GCC 4.9+ check
compiler.h: drop fallback overflow checkers
Documentation: raise minimum supported version of GCC to 5.1
cpufreq: intel_pstate: Override parameters if HWP forced by BIOS
If HWP has been already been enabled by BIOS, it may be
necessary to override some kernel command line parameters.
Once it has been enabled it requires a reset to be disabled.
compiler_attributes.h: drop __has_attribute() support for gcc4
Now that GCC 5.1 is the minimally supported default, the manual
workaround for older gcc versions not having __has_attribute() are no
longer relevant and can be removed.
Nick Desaulniers [Fri, 10 Sep 2021 23:40:47 +0000 (16:40 -0700)]
vmlinux.lds.h: remove old check for GCC 4.9
Now that GCC 5.1 is the minimally supported version of GCC, we can
effectively revert commit 85c2ce9104eb ("sched, vmlinux.lds: Increase
STRUCT_ALIGNMENT to 64 bytes for GCC-4.9")
Nick Desaulniers [Fri, 10 Sep 2021 23:40:38 +0000 (16:40 -0700)]
Documentation: raise minimum supported version of GCC to 5.1
commit fad7cd3310db ("nbd: add the check to prevent overflow in
__nbd_ioctl()") raised an issue from the fallback helpers added in
commit f0907827a8a9 ("compiler.h: enable builtin overflow checkers and
add fallback code")
Specifically, the helpers for checking whether the results of a
multiplication overflowed (__unsigned_mul_overflow,
__signed_add_overflow) use the division operator when
!COMPILER_HAS_GENERIC_BUILTIN_OVERFLOW. This is problematic for 64b
operands on 32b hosts.
Also, because the macro is type agnostic, it is very difficult to write
a similarly type generic macro that dispatches to one of:
* div64_s64
* div64_u64
* div_s64
* div_u64
Raising the minimum supported versions allows us to remove all of the
fallback helpers for !COMPILER_HAS_GENERIC_BUILTIN_OVERFLOW, instead
dispatching the compiler builtins.
arm64 has already raised the minimum supported GCC version to 5.1, do
this for all targets now. See the link below for the previous
discussion.
Will Deacon [Mon, 13 Sep 2021 16:35:47 +0000 (17:35 +0100)]
x86/uaccess: Fix 32-bit __get_user_asm_u64() when CC_HAS_ASM_GOTO_OUTPUT=y
Commit 865c50e1d279 ("x86/uaccess: utilize CONFIG_CC_HAS_ASM_GOTO_OUTPUT")
added an optimised version of __get_user_asm() for x86 using 'asm goto'.
Like the non-optimised code, the 32-bit implementation of 64-bit
get_user() expands to a pair of 32-bit accesses. Unlike the
non-optimised code, the _original_ pointer is incremented to copy the
high word instead of loading through a new pointer explicitly
constructed to point at a 32-bit type. Consequently, if the pointer
points at a 64-bit type then we end up loading the wrong data for the
upper 32-bits.
This was observed as a mount() failure in Android targeting i686 after b0cfcdd9b967 ("d_path: make 'prepend()' fill up the buffer exactly on
overflow") because the call to copy_from_kernel_nofault() from
prepend_copy() ends up in __get_kernel_nofault() and casts the source
pointer to a 'u64 __user *'. An attempt to mount at "/debug_ramdisk"
therefore ends up failing trying to mount "/debumdismdisk".
Use the existing '__gu_ptr' source pointer to unsigned int for 32-bit
__get_user_asm_u64() instead of the original pointer.
io-wq: provide IO_WQ_* constants for IORING_REGISTER_IOWQ_MAX_WORKERS arg items
The items passed in the array pointed by the arg parameter
of IORING_REGISTER_IOWQ_MAX_WORKERS io_uring_register operation
carry certain semantics: they refer to different io-wq worker categories;
provide IO_WQ_* constants in the UAPI, so these categories can be referenced
in the user space code.
Hamza Mahfooz [Fri, 10 Sep 2021 23:53:37 +0000 (19:53 -0400)]
dma-debug: prevent an error message from causing runtime problems
For some drivers, that use the DMA API. This error message can be reached
several millions of times per second, causing spam to the kernel's printk
buffer and bringing the CPU usage up to 100% (so, it should be rate
limited). However, since there is at least one driver that is in the
mainline and suffers from the error condition, it is more useful to
err_printk() here instead of just rate limiting the error message (in hopes
that it will make it easier for other drivers that suffer from this issue
to be spotted).
Daniel Wagner [Thu, 2 Sep 2021 09:20:02 +0000 (11:20 +0200)]
nvme: avoid race in shutdown namespace removal
When we remove the siblings entry, we update ns->head->list, hence we
can't separate the removal and test for being empty. They have to be
in the same critical section to avoid a race.
To avoid breaking the refcounting imbalance again, add a list empty
check to nvme_find_ns_head.
Fixes: 5396fdac56d8 ("nvme: fix refcounting imbalance when all paths are down") Signed-off-by: Daniel Wagner <[email protected]> Reviewed-by: Hannes Reinecke <[email protected]> Tested-by: Hannes Reinecke <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]>
Dan Carpenter [Thu, 9 Sep 2021 09:14:40 +0000 (12:14 +0300)]
nvmet: fix a width vs precision bug in nvmet_subsys_attr_serial_show()
This was intended to limit the number of characters printed from
"subsys->serial" to NVMET_SN_MAX_SIZE. But accidentally the width
specifier was used instead of the precision specifier so it only
affects the alignment and not the number of characters printed.
Fixes: f04064814c2a ("nvmet: fixup buffer overrun in nvmet_subsys_attr_serial()") Signed-off-by: Dan Carpenter <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]>
net: hns3: fix the timing issue of VF clearing interrupt sources
Currently, the VF does not clear the interrupt source immediately after
receiving the interrupt. As a result, if the second interrupt task is
triggered when processing the first interrupt task, clearing the
interrupt source before exiting will clear the interrupt sources of the
two tasks at the same time. As a result, no interrupt is triggered for
the second task. The VF detects the missed message only when the next
interrupt is generated.
Clearing it immediately after executing check_evt_cause ensures that:
1. Even if two interrupt tasks are triggered at the same time, they can
be processed.
2. If the second task is triggered during the processing of the first
task and the interrupt source is not cleared, the interrupt is reported
after vector0 is enabled.
Fixes: b90fcc5bd904 ("net: hns3: add reset handling for VF when doing Core/Global/IMP reset") Signed-off-by: Jiaran Zhang <[email protected]> Signed-off-by: Guangbin Huang <[email protected]> Signed-off-by: David S. Miller <[email protected]>
When the command for querying imp info is issued to the firmware,
if the firmware does not support the command, the returned value
of bd num is 0.
Add protection mechanism before alloc memory to prevent apply for
0-length memory.
Fixes: 0b198b0d80ea ("net: hns3: refactor dump m7 info of debugfs") Signed-off-by: Jiaran Zhang <[email protected]> Signed-off-by: Guangbin Huang <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Yufeng Mo [Mon, 13 Sep 2021 13:08:23 +0000 (21:08 +0800)]
net: hns3: disable mac in flr process
The firmware will not disable mac in flr process. Therefore, the driver
needs to proactively disable mac during flr, which is the same as the
function reset.
Fixes: 35d93a30040c ("net: hns3: adjust the process of PF reset") Signed-off-by: Yufeng Mo <[email protected]> Signed-off-by: Guangbin Huang <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Yufeng Mo [Mon, 13 Sep 2021 13:08:22 +0000 (21:08 +0800)]
net: hns3: change affinity_mask to numa node range
Currently, affinity_mask is set to a single cpu. As a result,
irqbalance becomes invalid in SUBSET or EXACT mode. To solve
this problem, change affinity_mask to numa node range. In this
way, irqbalance can be performed on the cpu of the numa node.
Fixes: 0812545487ec ("net: hns3: add interrupt affinity support for misc interrupt") Signed-off-by: Yufeng Mo <[email protected]> Signed-off-by: Guangbin Huang <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Yufeng Mo [Mon, 13 Sep 2021 13:08:21 +0000 (21:08 +0800)]
net: hns3: pad the short tunnel frame before sending to hardware
The hardware cannot handle short tunnel frames below 65 bytes,
and will cause vlan tag missing problem. So pads packet size to
65 bytes for tunnel frames to fix this bug.
Fixes: 3db084d28dc0("net: hns3: Fix for vxlan tx checksum bug") Signed-off-by: Yufeng Mo <[email protected]> Signed-off-by: Guangbin Huang <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Yunsheng Lin [Mon, 13 Sep 2021 13:08:20 +0000 (21:08 +0800)]
net: hns3: add option to turn off page pool feature
When page pool is added to the hns3 driver, it is always
enabled unconditionally, which means spilt page handling
in the hns3 driver is dead code.
As there is a requirement to test the performance between
spilt page handling in driver and page pool, so add a module
param to support disabling the page pool.
When the page pool is proved to perform better in most case,
the spilt page handling in driver can be removed.
Fixes: 93188e9642c3 ("net: hns3: support skb's frag page recycling based on page pool") Signed-off-by: Yunsheng Lin <[email protected]> Signed-off-by: Guangbin Huang <[email protected]> Signed-off-by: David S. Miller <[email protected]>
We queue an irq work for deferred processing of mce event in realmode
mce handler, where translation is disabled. Queuing of the work may
result in accessing memory outside RMO region, such access needs the
translation to be enabled for an LPAR running with hash mmu else the
kernel crashes.
After enabling translation in mce_handle_error() we used to leave it
enabled to avoid crashing here, but now with the commit 74c3354bc1d89 ("powerpc/pseries/mce: restore msr before returning from
handler") we are restoring the MSR to disable translation.
Hence to fix this enable the translation before queuing the work.
Without this change following trace is seen on injecting SLB multihit in
an LPAR running with hash mmu.
Oops: Kernel access of bad area, sig: 11 [#1]
LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
CPU: 5 PID: 1883 Comm: insmod Tainted: G OE 5.14.0-mce+ #137
NIP: c000000000735d60 LR: c000000000318640 CTR: 0000000000000000
REGS: c00000001ebff9a0 TRAP: 0300 Tainted: G OE (5.14.0-mce+)
MSR: 8000000000001003 <SF,ME,RI,LE> CR: 28008228 XER: 00000001
CFAR: c00000000031863c DAR: c00000027fa8fe08 DSISR: 40000000 IRQMASK: 0
...
NIP llist_add_batch+0x0/0x40
LR __irq_work_queue_local+0x70/0xc0
Call Trace:
0xc00000001ebffc0c (unreliable)
irq_work_queue+0x40/0x70
machine_check_queue_event+0xbc/0xd0
machine_check_early_common+0x16c/0x1f4
Fixes: 74c3354bc1d89 ("powerpc/pseries/mce: restore msr before returning from handler") Signed-off-by: Ganesh Goudar <[email protected]>
[mpe: Fix comment formatting, trim oops in change log for readability] Signed-off-by: Michael Ellerman <[email protected]> Link: https://lore.kernel.org/r/[email protected]
POWER9 DD2.2 and 2.3 hardware implements a "fake-suspend" mode where
certain TM instructions executed in HV=0 mode cause softpatch interrupts
so the hypervisor can emulate them and prevent problematic processor
conditions. In this fake-suspend mode, the treclaim. instruction does
not modify registers.
Unfortunately the rfscv instruction executed by the guest do not
generate softpatch interrupts, which can cause the hypervisor to lose
track of the fake-suspend mode, and it can execute this treclaim. while
not in fake-suspend mode. This modifies GPRs and crashes the hypervisor.
It's not trivial to disable scv in the guest with HFSCR now, because
they assume a POWER9 has scv available. So this fix saves and restores
checkpointed registers across the treclaim.
Nicholas Piggin [Wed, 8 Sep 2021 10:17:17 +0000 (20:17 +1000)]
powerpc/64s: system call rfscv workaround for TM bugs
The rfscv instruction does not work correctly with the fake-suspend mode
in POWER9, which can end up with the hypervisor restoring an incorrect
checkpoint.
Work around this by setting the _TIF_RESTOREALL flag if a system call
returns to a transaction active state, causing rfid to be used instead
of rfscv to return, which will do the right thing. The contents of the
registers are irrelevant because they will be overwritten in this case
anyway.
Nicholas Piggin [Fri, 3 Sep 2021 12:57:07 +0000 (22:57 +1000)]
selftests/powerpc: Add scv versions of the basic TM syscall tests
The basic TM vs syscall test code hard codes an sc instruction for the
system call, which fails to cover scv even when the userspace libc has
support for it.
Duplicate the tests with hard coded scv variants so both are tested
when possible.
Nicholas Piggin [Fri, 3 Sep 2021 12:57:06 +0000 (22:57 +1000)]
powerpc/64s: system call scv tabort fix for corrupt irq soft-mask state
If a system call is made with a transaction active, the kernel
immediately aborts it and returns. scv system calls disable irqs even
earlier in their interrupt handler, and tabort_syscall does not fix this
up.
This can result in irq soft-mask state being messed up on the next
kernel entry, and crashing at BUG_ON(arch_irq_disabled_regs(regs)) in
the kernel exit handlers, or possibly worse.
This can't easily be fixed in asm because at this point an async irq may
have hit, which is soft-masked and marked pending. The pending interrupt
has to be replayed before returning to userspace. The fix is to move the
tabort_syscall code to C in the main syscall handler, and just skip the
system call but otherwise return as usual, which will take care of the
pending irqs. This also does a bunch of other things including possible
signal delivery to the process, but the doomed transaction should still
be aborted when it is eventually returned to.
The sc system call path is changed to use the new C function as well to
reduce code and path differences. This slows down how quickly system
calls are aborted when called while a transaction is active, which could
potentially impact TM performance. But making any system call is already
bad for performance, and TM is on the way out, so go with simpler over
faster.