Git Repo - linux.git/log

memblock: introduce saner 'memblock_free_ptr()' interface

The boot-time allocation interface for memblock is a mess, with
'memblock_alloc()' returning a virtual pointer, but then you are
supposed to free it with 'memblock_free()' that takes a _physical_
address.

Not only is that all kinds of strange and illogical, but it actually
causes bugs, when people then use it like a normal allocation function,
and it fails spectacularly on a NULL pointer:

   https://lore.kernel.org/all/20210912140820.GD25450@xsang-OptiPlex-9020/

or just random memory corruption if the debug checks don't catch it:

   https://lore.kernel.org/all/61ab2d0c-3313-aaab-514c-e15b7aa054a0@suse.cz/

I really don't want to apply patches that treat the symptoms, when the
fundamental cause is this horribly confusing interface.

I started out looking at just automating a sane replacement sequence,
but because of this mix or virtual and physical addresses, and because
people have used the "__pa()" macro that can take either a regular
kernel pointer, or just the raw "unsigned long" address, it's all quite
messy.

So this just introduces a new saner interface for freeing a virtual
address that was allocated using 'memblock_alloc()', and that was kept
as a regular kernel pointer.  And then it converts a couple of users
that are obvious and easy to test, including the 'xbc_nodes' case in
lib/bootconfig.c that caused problems.

Reported-by: kernel test robot <[email protected]>
Fixes: 40caa127f3c7 ("init: bootconfig: Remove all bootconfig data when the init memory is removed")
Cc: Steven Rostedt <[email protected]>
Cc: Mike Rapoport <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Masami Hiramatsu <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

drm/amdgpu: use IS_ERR for debugfs APIs

debugfs APIs returns encoded error so use
IS_ERR for checking return value.

v2: return PTR_ERR(ent)

References: https://gitlab.freedesktop.org/drm/amd/-/issues/1686
Signed-off-by: Nirmoy Das <[email protected]>
Reviewed-by: Christian König <[email protected]>
Reviewed-By: Shashank Sharma <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
Cc: [email protected]

amd/display: downgrade validation failure log level

In amdgpu_dm_atomic_check, dc_validate_global_state is called. On
failure this logs a warning to the kernel journal. However warnings
shouldn't be used for atomic test-only commit failures: user-space
might be perfoming a lot of atomic test-only commits to find the
best hardware configuration.

Downgrade the log to a regular DRM atomic message. While at it, use
the new device-aware logging infrastructure.

This fixes error messages in the kernel when running gamescope [1].

[1]: https://github.com/Plagman/gamescope/issues/245

Reviewed-by: Nicholas Kazlauskas <[email protected]>
Signed-off-by: Simon Ser <[email protected]>
Cc: Alex Deucher <[email protected]>
Cc: Harry Wentland <[email protected]>
Cc: Nicholas Kazlauskas <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

drm/amdgpu: fix use after free during BO move

The memory backing old_mem is already freed at that point, move the
check a bit more up.

Signed-off-by: Christian König <[email protected]>
Fixes: bfa3357ef9ab ("drm/ttm: allocate resource object instead of embedding it v2")
Bug: https://gitlab.freedesktop.org/drm/amd/-/issues/1699
Acked-by: Nirmoy Das <[email protected]>
Reviewed-by: Michel Dänzer <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
Cc: [email protected]

drm/amd/pm: fix the issue of uploading powerplay table

fix the issue of uploading powerplay table due to the dependancy of rlc.

Signed-off-by: Kenneth Feng <[email protected]>
Reviewed-by: Jack Gui <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
Cc: [email protected]

drm/amd/amdgpu: Increase HWIP_MAX_INSTANCE to 10

Seems like newer cards can have even more instances now.
Found by UBSAN: array-index-out-of-bounds in
drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c:318:29
index 8 is out of range for type 'uint32_t *[8]'

Bug: https://gitlab.freedesktop.org/drm/amd/-/issues/1697
Cc: [email protected]
Signed-off-by: Ernst Sjöstrand <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

ipc: remove memcg accounting for sops objects in do_semtimedop()

Linus proposes to revert an accounting for sops objects in
do_semtimedop() because it's really just a temporary buffer
for a single semtimedop() system call.

This object can consume up to 2 pages, syscall is sleeping
one, size and duration can be controlled by user, and this
allocation can be repeated by many thread at the same time.

However Shakeel Butt pointed that there are much more popular
objects with the same life time and similar memory
consumption, the accounting of which was decided to be
rejected for performance reasons.

Considering at least 2 pages for task_struct and 2 pages for
the kernel stack, a back of the envelope calculation gives a
footprint amplification of <1.5 so this temporal buffer can be
safely ignored.

The factor would IMO be interesting if it was >> 2 (from the
PoV of excessive (ab)use, fine-grained accounting seems to be
currently unfeasible due to performance impact).

Link: https://lore.kernel.org/lkml/[email protected]/
Fixes: 18319498fdd4 ("memcg: enable accounting of ipc resources")
Signed-off-by: Vasily Averin <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Reviewed-by: Michal Koutný <[email protected]>
Acked-by: Shakeel Butt <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

io_uring: allow retry for O_NONBLOCK if async is supported

A common complaint is that using O_NONBLOCK files with io_uring can be a
bit of a pain. Be a bit nicer and allow normal retry IFF the file does
support async behavior. This makes it possible to use io_uring more
reliably with O_NONBLOCK files, for use cases where it either isn't
possible or feasible to modify the file flags.

Cc: [email protected]
Reported-and-tested-by: Dan Melnic <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>

cpufreq: schedutil: Destroy mutex before kobject_put() frees the memory

Since commit e5c6b312ce3c ("cpufreq: schedutil: Use kobject release()
method to free sugov_tunables") kobject_put() has kfree()d the
attr_set before gov_attr_set_put() returns.

kobject_put() isn't the last user of attr_set in gov_attr_set_put(),
the subsequent mutex_destroy() triggers a use-after-free:
| BUG: KASAN: use-after-free in mutex_is_locked+0x20/0x60
| Read of size 8 at addr ffff000800ca4250 by task cpuhp/2/20
|
| CPU: 2 PID: 20 Comm: cpuhp/2 Not tainted 5.15.0-rc1 #12369
| Hardware name: ARM LTD ARM Juno Development Platform/ARM Juno Development
| Platform, BIOS EDK II Jul 30 2018
| Call trace:
|  dump_backtrace+0x0/0x380
|  show_stack+0x1c/0x30
|  dump_stack_lvl+0x8c/0xb8
|  print_address_description.constprop.0+0x74/0x2b8
|  kasan_report+0x1f4/0x210
|  kasan_check_range+0xfc/0x1a4
|  __kasan_check_read+0x38/0x60
|  mutex_is_locked+0x20/0x60
|  mutex_destroy+0x80/0x100
|  gov_attr_set_put+0xfc/0x150
|  sugov_exit+0x78/0x190
|  cpufreq_offline.isra.0+0x2c0/0x660
|  cpuhp_cpufreq_offline+0x14/0x24
|  cpuhp_invoke_callback+0x430/0x6d0
|  cpuhp_thread_fun+0x1b0/0x624
|  smpboot_thread_fn+0x5e0/0xa6c
|  kthread+0x3a0/0x450
|  ret_from_fork+0x10/0x20

Swap the order of the calls.

Fixes: e5c6b312ce3c ("cpufreq: schedutil: Use kobject release() method to free sugov_tunables")
Cc: 4.7+ <[email protected]> # 4.7+
Signed-off-by: James Morse <[email protected]>
Signed-off-by: Rafael J. Wysocki <[email protected]>

drm/i915: Enable -Wsometimes-uninitialized

This warning helps catch uninitialized variables. It should have been
enabled at the same time as commit b2423184ac33 ("drm/i915: Enable
-Wuninitialized") but I did not realize they were disabled separately.
Enable it now that i915 is clean so that it stays that way.

Reviewed-by: Nick Desaulniers <[email protected]>
Signed-off-by: Nathan Chancellor <[email protected]>
Signed-off-by: Jani Nikula <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
(cherry picked from commit 43192617f7816bb74584c1df06f57363afd15337)
Signed-off-by: Jani Nikula <[email protected]>

drm/i915/selftests: Always initialize err in igt_dmabuf_import_same_driver_lmem()

Clang warns:

drivers/gpu/drm/i915/gem/selftests/i915_gem_dmabuf.c:127:13: warning:
variable 'err' is used uninitialized whenever 'if' condition is false
[-Wsometimes-uninitialized]
        } else if (PTR_ERR(import) != -EOPNOTSUPP) {
                   ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
drivers/gpu/drm/i915/gem/selftests/i915_gem_dmabuf.c:138:9: note:
uninitialized use occurs here
        return err;
               ^~~
drivers/gpu/drm/i915/gem/selftests/i915_gem_dmabuf.c:127:9: note: remove
the 'if' if its condition is always true
        } else if (PTR_ERR(import) != -EOPNOTSUPP) {
               ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
drivers/gpu/drm/i915/gem/selftests/i915_gem_dmabuf.c:95:9: note:
initialize the variable 'err' to silence this warning
        int err;
               ^
                = 0

The test is expected to pass if i915_gem_prime_import() returns
-EOPNOTSUPP so initialize err to zero in this case.

Fixes: cdb35d1ed6d2 ("drm/i915/gem: Migrate to system at dma-buf attach time (v7)")
Reported-by: Dan Carpenter <[email protected]>
Reviewed-by: Thomas Hellström <[email protected]>
Signed-off-by: Nathan Chancellor <[email protected]>
Signed-off-by: Jani Nikula <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
(cherry picked from commit 46f20a353b80d02492655d99714f0566018a17e8)
Signed-off-by: Jani Nikula <[email protected]>

drm/i915/selftests: Do not use import_obj uninitialized

Clang warns a couple of times:

drivers/gpu/drm/i915/gem/selftests/i915_gem_dmabuf.c:63:6: warning:
variable 'import_obj' is used uninitialized whenever 'if' condition is
true [-Wsometimes-uninitialized]
        if (import != &obj->base) {
            ^~~~~~~~~~~~~~~~~~~~
drivers/gpu/drm/i915/gem/selftests/i915_gem_dmabuf.c:80:22: note:
uninitialized use occurs here
        i915_gem_object_put(import_obj);
                            ^~~~~~~~~~
drivers/gpu/drm/i915/gem/selftests/i915_gem_dmabuf.c:63:2: note: remove
the 'if' if its condition is always false
        if (import != &obj->base) {
        ^~~~~~~~~~~~~~~~~~~~~~~~~~~
drivers/gpu/drm/i915/gem/selftests/i915_gem_dmabuf.c:38:46: note:
initialize the variable 'import_obj' to silence this warning
        struct drm_i915_gem_object *obj, *import_obj;
                                                    ^
                                                     = NULL

Shuffle the import_obj initialization above these if statements so that
it is not used uninitialized.

Fixes: d7b2cb380b3a ("drm/i915/gem: Correct the locking and pin pattern for dma-buf (v8)")
Reported-by: Dan Carpenter <[email protected]>
Reviewed-by: Thomas Hellström <[email protected]>
Signed-off-by: Nathan Chancellor <[email protected]>
Signed-off-by: Jani Nikula <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
(cherry picked from commit 4796054b381a586f4177a24e3d8b5a6a0a32ce62)
Signed-off-by: Jani Nikula <[email protected]>

io_uring: auto-removal for direct open/accept

It might be inconvenient that direct open/accept deviates from the
update semantics and fails if the slot is taken instead of removing a
file sitting there. Implement this auto-removal.

Note that removal might need to allocate and so may fail. However, if an
empty slot is specified, it's guaraneed to not fail on the fd
installation side for valid userspace programs. It's needed for users
who can't tolerate such failures, e.g. accept where the other end
never retries.

Suggested-by: Franz-B. Tuneke <[email protected]>
Signed-off-by: Pavel Begunkov <[email protected]>
Link: https://lore.kernel.org/r/c896f14ea46b0eaa6c09d93149e665c2c37979b4.1631632300.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <[email protected]>

powerpc/boot: Fix build failure since GCC 4.9 removal

Stephen reported that the build was broken since commit
6d2ef226f2f1 ("compiler_attributes.h: drop __has_attribute() support for
gcc4"), with errors such as:

  include/linux/compiler_attributes.h:296:5: warning: "__has_attribute" is not defined, evaluates to 0 [-Wundef]
    296 | #if __has_attribute(__warning__)
        |     ^~~~~~~~~~~~~~~
  make[2]: *** [arch/powerpc/boot/Makefile:225: arch/powerpc/boot/crt0.o] Error 1

But we expect __has_attribute() to always be defined now that we've
stopped using GCC 4.

Linus debugged it to the point of reading the GCC sources, and noticing
that the problem is that __has_attribute() is not defined when
preprocessing assembly files, which is what we're doing here.

Our assembly files don't include, or need, compiler_attributes.h, but
they are getting it unconditionally from the -include in BOOT_CFLAGS,
which is then added in its entirety to BOOT_AFLAGS.

That -include was added in commit 77433830ed16 ("powerpc: boot: include
compiler_attributes.h") so that we'd have "fallthrough" and other
attributes defined for the C files in arch/powerpc/boot. But it's not
needed for assembly files.

The minimal fix is to move the addition to BOOT_CFLAGS of -include
compiler_attributes.h until after we've copied BOOT_CFLAGS into
BOOT_AFLAGS. That avoids including compiler_attributes.h for asm files,
but makes no other change to BOOT_CFLAGS or BOOT_AFLAGS.

Reported-by: Stephen Rothwell <[email protected]>
Debugged-by: Linus Torvalds <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Tested-by: Guenter Roeck <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

io_uring: fix missing sigmask restore in io_cqring_wait()

Move get_timespec() section in io_cqring_wait() before the sigmask
saving, otherwise we'll fail to restore sigmask once get_timespec()
returns error.

Fixes: c73ebb685fb6 ("io_uring: add timeout support for io_uring_enter()")
Signed-off-by: Xiaoguang Wang <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jens Axboe <[email protected]>

iov_iter: add helper to save iov_iter state

In an ideal world, when someone is passed an iov_iter and returns X bytes,
then X bytes would have been consumed/advanced from the iov_iter. But we
have use cases that always consume the entire iterator, a few examples
of that are iomap and bdev O_DIRECT. This means we cannot rely on the
state of the iov_iter once we've called ->read_iter() or ->write_iter().

This would be easier if we didn't always have to deal with truncate of
the iov_iter, as rewinding would be trivial without that. We recently
added a commit to track the truncate state, but that grew the iov_iter
by 8 bytes and wasn't the best solution.

Implement a helper to save enough of the iov_iter state to sanely restore
it after we've called the read/write iterator helpers. This currently
only works for IOVEC/BVEC/KVEC as that's all we need, support for other
iterator types are left as an exercise for the reader.

Link: https://lore.kernel.org/linux-fsdevel/CAHk-=wiacKV4Gh-MYjteU0LwNBSGpWrK-Ov25HdqB1ewinrFPg@mail.gmail.com/
Signed-off-by: Jens Axboe <[email protected]>

Revert "Revert "ipv4: fix memory leaks in ip_cmsg_send() callers""

This reverts commit d7807a9adf4856171f8441f13078c33941df48ab.

As mentioned in https://lkml.org/lkml/2021/9/13/1819
5 years old commit 919483096bfe ("ipv4: fix memory leaks in ip_cmsg_send() callers")
was a correct fix.

  ip_cmsg_send() can loop over multiple cmsghdr()

  If IP_RETOPTS has been successful, but following cmsghdr generates an error,
  we do not free ipc.ok

  If IP_RETOPTS is not successful, we have freed the allocated temporary space,
  not the one currently in ipc.opt.

Sure, code could be refactored, but let's not bring back old bugs.

Fixes: d7807a9adf48 ("Revert "ipv4: fix memory leaks in ip_cmsg_send() callers"")
Signed-off-by: Eric Dumazet <[email protected]>
Cc: Yajun Deng <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

tcp: fix tp->undo_retrans accounting in tcp_sacktag_one()

Commit 10d3be569243 ("tcp-tso: do not split TSO packets at retransmit
time") may directly retrans a multiple segments TSO/GSO packet without
split, Since this commit, we can no longer assume that a retransmitted
packet is a single segment.

This patch fixes the tp->undo_retrans accounting in tcp_sacktag_one()
that use the actual segments(pcount) of the retransmitted packet.

Before that commit (10d3be569243), the assumption underlying the
tp->undo_retrans-- seems correct.

Fixes: 10d3be569243 ("tcp-tso: do not split TSO packets at retransmit time")
Signed-off-by: zhenggy <[email protected]>
Reviewed-by: Eric Dumazet <[email protected]>
Acked-by: Yuchung Cheng <[email protected]>
Acked-by: Neal Cardwell <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

sparc32: page align size in arch_dma_alloc

Commit 53b7670e5735 ("sparc: factor the dma coherent mapping into
helper") lost the page align for the calls to dma_make_coherent and
srmmu_unmapiorange. The latter cannot handle a non page aligned len
argument.

Signed-off-by: Andreas Larsson <[email protected]>
Reviewed-by: Sam Ravnborg <[email protected]>
Signed-off-by: Christoph Hellwig <[email protected]>

Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf

Daniel Borkmann says:

====================
pull-request: bpf 2021-09-14

The following pull-request contains BPF updates for your *net* tree.

We've added 7 non-merge commits during the last 13 day(s) which contain
a total of 18 files changed, 334 insertions(+), 193 deletions(-).

The main changes are:

1) Fix mmap_lock lockdep splat in BPF stack map's build_id lookup, from Yonghong Song.

2) Fix BPF cgroup v2 program bypass upon net_cls/prio activation, from Daniel Borkmann.

3) Fix kvcalloc() BTF line info splat on oversized allocation attempts, from Bixuan Cui.

4) Fix BPF selftest build of task_pt_regs test for arm64/s390, from Jean-Philippe Brucker.

5) Fix BPF's disasm.{c,h} to dual-license so that it is aligned with bpftool given the former
is a build dependency for the latter, from Daniel Borkmann with ACKs from contributors.
====================

Signed-off-by: David S. Miller <[email protected]>

net-caif: avoid user-triggerable WARN_ON(1)

syszbot triggers this warning, which looks something
we can easily prevent.

If we initialize priv->list_field in chnl_net_init(),
then always use list_del_init(), we can remove robust_list_del()
completely.

WARNING: CPU: 0 PID: 3233 at net/caif/chnl_net.c:67 robust_list_del net/caif/chnl_net.c:67 [inline]
WARNING: CPU: 0 PID: 3233 at net/caif/chnl_net.c:67 chnl_net_uninit+0xc9/0x2e0 net/caif/chnl_net.c:375
Modules linked in:
CPU: 0 PID: 3233 Comm: syz-executor.3 Not tainted 5.14.0-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
RIP: 0010:robust_list_del net/caif/chnl_net.c:67 [inline]
RIP: 0010:chnl_net_uninit+0xc9/0x2e0 net/caif/chnl_net.c:375
Code: 89 eb e8 3a a3 ba f8 48 89 d8 48 c1 e8 03 42 80 3c 28 00 0f 85 bf 01 00 00 48 81 fb 00 14 4e 8d 48 8b 2b 75 d0 e8 17 a3 ba f8 <0f> 0b 5b 5d 41 5c 41 5d e9 0a a3 ba f8 4c 89 e3 e8 02 a3 ba f8 4c
RSP: 0018:ffffc90009067248 EFLAGS: 00010202
RAX: 0000000000008780 RBX: ffffffff8d4e1400 RCX: ffffc9000fd34000
RDX: 0000000000040000 RSI: ffffffff88bb6e49 RDI: 0000000000000003
RBP: ffff88802cd9ee08 R08: 0000000000000000 R09: ffffffff8d0e6647
R10: ffffffff88bb6dc2 R11: 0000000000000000 R12: ffff88803791ae08
R13: dffffc0000000000 R14: 00000000e600ffce R15: ffff888073ed3480
FS: 00007fed10fa0700(0000) GS:ffff8880b9d00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000001b2c322000 CR3: 00000000164a6000 CR4: 00000000001506e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
register_netdevice+0xadf/0x1500 net/core/dev.c:10347
ipcaif_newlink+0x4c/0x260 net/caif/chnl_net.c:468
__rtnl_newlink+0x106d/0x1750 net/core/rtnetlink.c:3458
rtnl_newlink+0x64/0xa0 net/core/rtnetlink.c:3506
rtnetlink_rcv_msg+0x413/0xb80 net/core/rtnetlink.c:5572
netlink_rcv_skb+0x153/0x420 net/netlink/af_netlink.c:2504
netlink_unicast_kernel net/netlink/af_netlink.c:1314 [inline]
netlink_unicast+0x533/0x7d0 net/netlink/af_netlink.c:1340
netlink_sendmsg+0x86d/0xdb0 net/netlink/af_netlink.c:1929
sock_sendmsg_nosec net/socket.c:704 [inline]
sock_sendmsg+0xcf/0x120 net/socket.c:724
__sys_sendto+0x21c/0x320 net/socket.c:2036
__do_sys_sendto net/socket.c:2048 [inline]
__se_sys_sendto net/socket.c:2044 [inline]
__x64_sys_sendto+0xdd/0x1b0 net/socket.c:2044
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x44/0xae

Fixes: cc36a070b590 ("net-caif: add CAIF netdevice")
Signed-off-by: Eric Dumazet <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

nvme-tcp: fix io_work priority inversion

Dispatching requests inline with the .queue_rq() call may block while
holding the send_mutex. If the tcp io_work also happens to schedule, it
may see the req_list is non-empty, leaving "pending" true and remaining
in TASK_RUNNING. Since io_work is of higher scheduling priority, the
.queue_rq task may not get a chance to run, blocking forward progress
and leading to io timeouts.

Instead of checking for pending requests within io_work, let the queueing
restart io_work outside the send_mutex lock if there is more work to be
done.

Fixes: a0fdd1418007f ("nvme-tcp: rerun io_work if req_list is not empty")
Reported-by: Samuel Jones <[email protected]>
Signed-off-by: Keith Busch <[email protected]>
Reviewed-by: Sagi Grimberg <[email protected]>
Signed-off-by: Christoph Hellwig <[email protected]>

nvme-rdma: destroy cm id before destroy qp to avoid use after free

We should always destroy cm_id before destroy qp to avoid to get cma
event after qp was destroyed, which may lead to use after free.
In RDMA connection establishment error flow, don't destroy qp in cm
event handler.Just report cm_error to upper level, qp will be destroy
in nvme_rdma_alloc_queue() after destroy cm id.

Signed-off-by: Ruozhu Li <[email protected]>
Reviewed-by: Max Gurtovoy <[email protected]>
Signed-off-by: Christoph Hellwig <[email protected]>

nvme-multipath: fix ANA state updates when a namespace is not present

nvme_update_ana_state() has a deficiency that results in a failure to
properly update the ana state for a namespace in the following case:

  NSIDs in ctrl->namespaces: 1, 3,    4
  NSIDs in desc->nsids: 1, 2, 3, 4

Loop iteration 0:
    ns index = 0, n = 0, ns->head->ns_id = 1, nsid = 1, MATCH.
Loop iteration 1:
    ns index = 1, n = 1, ns->head->ns_id = 3, nsid = 2, NO MATCH.
Loop iteration 2:
    ns index = 2, n = 2, ns->head->ns_id = 4, nsid = 4, MATCH.

Where the update to the ANA state of NSID 3 is missed.  To fix this
increment n and retry the update with the same ns when ns->head->ns_id is
higher than nsid,

Signed-off-by: Anton Eidelman <[email protected]>
Signed-off-by: Christoph Hellwig <[email protected]>
Reviewed-by: Sagi Grimberg <[email protected]>

rtc: cmos: Disable irq around direct invocation of cmos_interrupt()

As previously noted in commit 66e4f4a9cc38 ("rtc: cmos: Use
spin_lock_irqsave() in cmos_interrupt()"):

<4>[  254.192378] WARNING: inconsistent lock state
<4>[  254.192384] 5.12.0-rc1-CI-CI_DRM_9834+ #1 Not tainted
<4>[  254.192396] --------------------------------
<4>[  254.192400] inconsistent {IN-HARDIRQ-W} -> {HARDIRQ-ON-W} usage.
<4>[  254.192409] rtcwake/5309 [HC0[0]:SC0[0]:HE1:SE1] takes:
<4>[  254.192429] ffffffff8263c5f8 (rtc_lock){?...}-{2:2}, at: cmos_interrupt+0x18/0x100
<4>[  254.192481] {IN-HARDIRQ-W} state was registered at:
<4>[  254.192488]   lock_acquire+0xd1/0x3d0
<4>[  254.192504]   _raw_spin_lock+0x2a/0x40
<4>[  254.192519]   cmos_interrupt+0x18/0x100
<4>[  254.192536]   rtc_handler+0x1f/0xc0
<4>[  254.192553]   acpi_ev_fixed_event_detect+0x109/0x13c
<4>[  254.192574]   acpi_ev_sci_xrupt_handler+0xb/0x28
<4>[  254.192596]   acpi_irq+0x13/0x30
<4>[  254.192620]   __handle_irq_event_percpu+0x43/0x2c0
<4>[  254.192641]   handle_irq_event_percpu+0x2b/0x70
<4>[  254.192661]   handle_irq_event+0x2f/0x50
<4>[  254.192680]   handle_fasteoi_irq+0x9e/0x150
<4>[  254.192693]   __common_interrupt+0x76/0x140
<4>[  254.192715]   common_interrupt+0x96/0xc0
<4>[  254.192732]   asm_common_interrupt+0x1e/0x40
<4>[  254.192750]   _raw_spin_unlock_irqrestore+0x38/0x60
<4>[  254.192767]   resume_irqs+0xba/0xf0
<4>[  254.192786]   dpm_resume_noirq+0x245/0x3d0
<4>[  254.192811]   suspend_devices_and_enter+0x230/0xaa0
<4>[  254.192835]   pm_suspend.cold.8+0x301/0x34a
<4>[  254.192859]   state_store+0x7b/0xe0
<4>[  254.192879]   kernfs_fop_write_iter+0x11d/0x1c0
<4>[  254.192899]   new_sync_write+0x11d/0x1b0
<4>[  254.192916]   vfs_write+0x265/0x390
<4>[  254.192933]   ksys_write+0x5a/0xd0
<4>[  254.192949]   do_syscall_64+0x33/0x80
<4>[  254.192965]   entry_SYSCALL_64_after_hwframe+0x44/0xae
<4>[  254.192986] irq event stamp: 43775
<4>[  254.192994] hardirqs last  enabled at (43775): [<ffffffff81c00c42>] asm_sysvec_apic_timer_interrupt+0x12/0x20
<4>[  254.193023] hardirqs last disabled at (43774): [<ffffffff81aa691a>] sysvec_apic_timer_interrupt+0xa/0xb0
<4>[  254.193049] softirqs last  enabled at (42548): [<ffffffff81e00342>] __do_softirq+0x342/0x48e
<4>[  254.193074] softirqs last disabled at (42543): [<ffffffff810b45fd>] irq_exit_rcu+0xad/0xd0
<4>[  254.193101]
                  other info that might help us debug this:
<4>[  254.193107]  Possible unsafe locking scenario:

<4>[  254.193112]        CPU0
<4>[  254.193117]        ----
<4>[  254.193121]   lock(rtc_lock);
<4>[  254.193137]   <Interrupt>
<4>[  254.193142]     lock(rtc_lock);
<4>[  254.193156]
                   *** DEADLOCK ***

<4>[  254.193161] 6 locks held by rtcwake/5309:
<4>[  254.193174]  #0: ffff888104861430 (sb_writers#5){.+.+}-{0:0}, at: ksys_write+0x5a/0xd0
<4>[  254.193232]  #1: ffff88810f823288 (&of->mutex){+.+.}-{3:3}, at: kernfs_fop_write_iter+0xe7/0x1c0
<4>[  254.193282]  #2: ffff888100cef3c0 (kn->active#285
<7>[  254.192706] i915 0000:00:02.0: [drm:intel_modeset_setup_hw_state [i915]] [CRTC:51:pipe A] hw state readout: disabled
<4>[  254.193307] ){.+.+}-{0:0}, at: kernfs_fop_write_iter+0xf0/0x1c0
<4>[  254.193333]  #3: ffffffff82649fa8 (system_transition_mutex){+.+.}-{3:3}, at: pm_suspend.cold.8+0xce/0x34a
<4>[  254.193387]  #4: ffffffff827a2108 (acpi_scan_lock){+.+.}-{3:3}, at: acpi_suspend_begin+0x47/0x70
<4>[  254.193433]  #5: ffff8881019ea178 (&dev->mutex){....}-{3:3}, at: device_resume+0x68/0x1e0
<4>[  254.193485]
                  stack backtrace:
<4>[  254.193492] CPU: 1 PID: 5309 Comm: rtcwake Not tainted 5.12.0-rc1-CI-CI_DRM_9834+ #1
<4>[  254.193514] Hardware name: Google Soraka/Soraka, BIOS MrChromebox-4.10 08/25/2019
<4>[  254.193524] Call Trace:
<4>[  254.193536]  dump_stack+0x7f/0xad
<4>[  254.193567]  mark_lock.part.47+0x8ca/0xce0
<4>[  254.193604]  __lock_acquire+0x39b/0x2590
<4>[  254.193626]  ? asm_sysvec_apic_timer_interrupt+0x12/0x20
<4>[  254.193660]  lock_acquire+0xd1/0x3d0
<4>[  254.193677]  ? cmos_interrupt+0x18/0x100
<4>[  254.193716]  _raw_spin_lock+0x2a/0x40
<4>[  254.193735]  ? cmos_interrupt+0x18/0x100
<4>[  254.193758]  cmos_interrupt+0x18/0x100
<4>[  254.193785]  cmos_resume+0x2ac/0x2d0
<4>[  254.193813]  ? acpi_pm_set_device_wakeup+0x1f/0x110
<4>[  254.193842]  ? pnp_bus_suspend+0x10/0x10
<4>[  254.193864]  pnp_bus_resume+0x5e/0x90
<4>[  254.193885]  dpm_run_callback+0x5f/0x240
<4>[  254.193914]  device_resume+0xb2/0x1e0
<4>[  254.193942]  ? pm_dev_err+0x25/0x25
<4>[  254.193974]  dpm_resume+0xea/0x3f0
<4>[  254.194005]  dpm_resume_end+0x8/0x10
<4>[  254.194030]  suspend_devices_and_enter+0x29b/0xaa0
<4>[  254.194066]  pm_suspend.cold.8+0x301/0x34a
<4>[  254.194094]  state_store+0x7b/0xe0
<4>[  254.194124]  kernfs_fop_write_iter+0x11d/0x1c0
<4>[  254.194151]  new_sync_write+0x11d/0x1b0
<4>[  254.194183]  vfs_write+0x265/0x390
<4>[  254.194207]  ksys_write+0x5a/0xd0
<4>[  254.194232]  do_syscall_64+0x33/0x80
<4>[  254.194251]  entry_SYSCALL_64_after_hwframe+0x44/0xae
<4>[  254.194274] RIP: 0033:0x7f07d79691e7
<4>[  254.194293] Code: 64 89 02 48 c7 c0 ff ff ff ff eb bb 0f 1f 80 00 00 00 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24
<4>[  254.194312] RSP: 002b:00007ffd9cc2c768 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
<4>[  254.194337] RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 00007f07d79691e7
<4>[  254.194352] RDX: 0000000000000004 RSI: 0000556ebfc63590 RDI: 000000000000000b
<4>[  254.194366] RBP: 0000556ebfc63590 R08: 0000000000000000 R09: 0000000000000004
<4>[  254.194379] R10: 0000556ebf0ec2a6 R11: 0000000000000246 R12: 0000000000000004

which breaks S3-resume on fi-kbl-soraka presumably as that's slow enough
to trigger the alarm during the suspend.

Fixes: 6950d046eb6e ("rtc: cmos: Replace spin_lock_irqsave with spin_lock in hard IRQ")
References: 66e4f4a9cc38 ("rtc: cmos: Use spin_lock_irqsave() in cmos_interrupt()"):
Signed-off-by: Chris Wilson <[email protected]>
Cc: Xiaofei Tan <[email protected]>
Cc: Alexandre Belloni <[email protected]>
Cc: Alessandro Zummo <[email protected]>
Cc: Ville Syrjälä <[email protected]>
Reviewed-by: Ville Syrjälä <[email protected]>
Signed-off-by: Alexandre Belloni <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

drm/i915: Get PM ref before accessing HW register

Seeing these errors when GT is likely in suspend state-
"RPM wakelock ref not held during HW access"

Ensure GT is awake before trying to access HW registers. Avoid
reading the register if that is not the case.

Signed-off-by: Vinay Belgaumkar <[email protected]>
Fixes: 41e5c17ebfc2 ("drm/i915/guc/slpc: Sysfs hooks for SLPC")
Reviewed-by: Tvrtko Ursulin <[email protected]>
Signed-off-by: John Harrison <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
(cherry picked from commit f25e3908b9cd4a3fe819e9bdcdde58f20bacb34c)
Signed-off-by: Jani Nikula <[email protected]>

drm/i915: Release ctx->syncobj on final put, not on ctx close

gem context refcounting is another exercise in least locking design it
seems, where most things get destroyed upon context closure (which can
race with anything really). Only the actual memory allocation and the
locks survive while holding a reference.

This tripped up Jason when reimplementing the single timeline feature
in

commit 00dae4d3d35d4f526929633b76e00b0ab4d3970d
Author: Jason Ekstrand <[email protected]>
Date: Thu Jul 8 10:48:12 2021 -0500

drm/i915: Implement SINGLE_TIMELINE with a syncobj (v4)

We could fix the bug by holding ctx->mutex in execbuf and clear the
pointer (again while holding the mutex) context_close, but it's
cleaner to just make the context object actually invariant over its
_entire_ lifetime. This way any other ioctl that's potentially racing,
but holding a full reference, can still rely on ctx->syncobj being
an immutable pointer. Which without this change, is not the case.

Reviewed-by: Maarten Lankhorst <[email protected]>
Signed-off-by: Daniel Vetter <[email protected]>
Fixes: 00dae4d3d35d ("drm/i915: Implement SINGLE_TIMELINE with a syncobj (v4)")
Cc: Jason Ekstrand <[email protected]>
Cc: Chris Wilson <[email protected]>
Cc: Tvrtko Ursulin <[email protected]>
Cc: Joonas Lahtinen <[email protected]>
Cc: Matthew Brost <[email protected]>
Cc: Matthew Auld <[email protected]>
Cc: Maarten Lankhorst <[email protected]>
Cc: "Thomas Hellström" <[email protected]>
Cc: Lionel Landwerlin <[email protected]>
Cc: Dave Airlie <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
(cherry picked from commit c238980efd3b35af70fc926066cf7440f50a97a9)
Signed-off-by: Jani Nikula <[email protected]>

drm/i915/gem: Fix the mman selftest

Using the I915_MMAP_TYPE_FIXED mmap type requires the TTM backend, so
for that mmap type, use __i915_gem_object_create_user() instead of
i915_gem_object_create_internal(), as we really want to tests objects
mmap-able by user-space.

This also means that the out-of-space error happens at object creation
and returns -ENXIO rather than -ENOSPC, so fix the code up to expect
that on out-of-offset-space errors.

Finally only use I915_MMAP_TYPE_FIXED for LMEM and SMEM for now if
testing on LMEM-capable devices. For stolen LMEM, we still take the
same path as for integrated, as that haven't been moved over to TTM yet,
and user-space should not be able to create out of stolen LMEM anyway.

v2:
- Check the presence of the obj->ops->mmap_offset callback rather than
hardcoding the supported mmap regions in can_mmap() (Maarten Lankhorst)

Fixes: 7961c5b60f23 ("drm/i915: Add TTM offset argument to mmap.")
Cc: Maarten Lankhorst <[email protected]>
Signed-off-by: Thomas Hellström <[email protected]>
Reviewed-by: Maarten Lankhorst <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
(cherry picked from commit 450cede7f3804ca7f8b3da210ebefa61c0958f22)
Signed-off-by: Jani Nikula <[email protected]>

drm/i915/guc: drop guc_communication_enabled

The function is only used from within GEM_BUG_ON(), which is causing
warnings with Wunneeded-internal-declaration in some builds. Since the
function is a simple wrapper around a CT function, we can just call the
CT function directly instead.

Fixes: 1fb12c587152 ("drm/i915/guc: skip disabling CTBs before sanitizing the GuC")
Reported-by: kernel test robot <[email protected]>
Signed-off-by: Daniele Ceraolo Spurio <[email protected]>
Cc: Matthew Brost <[email protected]>
Cc: John Harrison <[email protected]>
Reviewed-by: Matthew Brost <[email protected]>
Signed-off-by: John Harrison <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
(cherry picked from commit 5db1856781e45c9610f7652a19cc656b984235e7)
Signed-off-by: Jani Nikula <[email protected]>

drm/i915/dp: Use max params for panels < eDP 1.4

Users reported that after commit 2bbd6dba84d4 ("drm/i915: Try to use
fast+narrow link on eDP again and fall back to the old max strategy on
failure"), the screen starts to have wobbly effect.

Commit a5c936add6a2 ("drm/i915/dp: Use slow and wide link training for
everything") doesn't help either, that means the affected eDP 1.2 panels
only work with max params.

So use max params for panels < eDP 1.4 as Windows does to solve the
issue.

v3:
- Do the eDP rev check in intel_edp_init_dpcd()

v2:
- Check eDP 1.4 instead of DPCD 1.1 to apply max params

Cc: [email protected]
Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/3714
Fixes: 2bbd6dba84d4 ("drm/i915: Try to use fast+narrow link on eDP again and fall back to the old max strategy on failure")
Fixes: a5c936add6a2 ("drm/i915/dp: Use slow and wide link training for everything")
Suggested-by: Ville Syrjälä <[email protected]>
Signed-off-by: Kai-Heng Feng <[email protected]>
Signed-off-by: Ville Syrjälä <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
(cherry picked from commit d7f213c131adf0bec8b731553eb82990cdac265d)
Signed-off-by: Jani Nikula <[email protected]>

drm/i915/dp: return proper DPRX link training result

After DPRX link training, intel_dp_link_train_phy() did not
return the training result properly. If link training failed,
i915 driver would not run into link train fallback function.
And no hotplug uevent would be received by user space application.

Fixes: b30edfd8d0b4 ("drm/i915: Switch to LTTPR non-transparent mode link training")
Cc: Ville Syrjala <[email protected]>
Cc: Imre Deak <[email protected]>
Cc: Jani Nikula <[email protected]>
Cc: Cooper Chiou <[email protected]>
Cc: William Tseng <[email protected]>
Signed-off-by: Lee Shawn C <[email protected]>
Reviewed-by: Imre Deak <[email protected]>
Signed-off-by: Imre Deak <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
(cherry picked from commit dab1b47e57e053b2a02c22ead8e7449f79961335)
Signed-off-by: Jani Nikula <[email protected]>

PM: base: power: don't try to use non-existing RTC for storing data

If there is no legacy RTC device, don't try to use it for storing trace
data across suspend/resume.

Cc: <[email protected]>
Signed-off-by: Juergen Gross <[email protected]>
Reviewed-by: Rafael J. Wysocki <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Juergen Gross <[email protected]>

xen/balloon: use a kernel thread instead a workqueue

Today the Xen ballooning is done via delayed work in a workqueue. This
might result in workqueue hangups being reported in case of large
amounts of memory are being ballooned in one go (here 16GB):

BUG: workqueue lockup - pool cpus=6 node=0 flags=0x0 nice=0 stuck for 64s!
Showing busy workqueues and worker pools:
workqueue events: flags=0x0
  pwq 12: cpus=6 node=0 flags=0x0 nice=0 active=2/256 refcnt=3
    in-flight: 229:balloon_process
    pending: cache_reap
workqueue events_freezable_power_: flags=0x84
  pwq 12: cpus=6 node=0 flags=0x0 nice=0 active=1/256 refcnt=2
    pending: disk_events_workfn
workqueue mm_percpu_wq: flags=0x8
  pwq 12: cpus=6 node=0 flags=0x0 nice=0 active=1/256 refcnt=2
    pending: vmstat_update
pool 12: cpus=6 node=0 flags=0x0 nice=0 hung=64s workers=3 idle: 2222 43

This can easily be avoided by using a dedicated kernel thread for doing
the ballooning work.

Reported-by: Jan Beulich <[email protected]>
Signed-off-by: Juergen Gross <[email protected]>
Reviewed-by: Boris Ostrovsky <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Juergen Gross <[email protected]>

io_uring: pin SQPOLL data before unlocking ring lock

We need to re-check sqd->thread after we've dropped the lock. Pin
the sqd before doing the lockdep lock dance, and check if the thread
is alive after that. It's either NULL or alive, as the SQPOLL thread
cannot exit without holding the same sqd->lock.

Reported-and-tested-by: [email protected]
Fixes: fa84693b3c89 ("io_uring: ensure IORING_REGISTER_IOWQ_MAX_WORKERS works with SQPOLL")
Signed-off-by: Jens Axboe <[email protected]>

bpf, selftests: Add test case for mixed cgroup v1/v2

Minimal selftest which implements a small BPF policy program to the
connect(2) hook which rejects TCP connection requests to port 60123
with EPERM. This is being attached to a non-root cgroup v2 path. The
test asserts that this works under cgroup v2-only and under a mixed
cgroup v1/v2 environment where net_classid is set in the former case.

Before fix:

  # ./test_progs -t cgroup_v1v2
  test_cgroup_v1v2:PASS:server_fd 0 nsec
  test_cgroup_v1v2:PASS:client_fd 0 nsec
  test_cgroup_v1v2:PASS:cgroup_fd 0 nsec
  test_cgroup_v1v2:PASS:server_fd 0 nsec
  run_test:PASS:skel_open 0 nsec
  run_test:PASS:prog_attach 0 nsec
  test_cgroup_v1v2:PASS:cgroup-v2-only 0 nsec
  run_test:PASS:skel_open 0 nsec
  run_test:PASS:prog_attach 0 nsec
  run_test:PASS:join_classid 0 nsec
  (network_helpers.c:219: errno: None) Unexpected success to connect to server
  test_cgroup_v1v2:FAIL:cgroup-v1v2 unexpected error: -1 (errno 0)
  #27 cgroup_v1v2:FAIL
  Summary: 0/0 PASSED, 0 SKIPPED, 1 FAILED

After fix:

  # ./test_progs -t cgroup_v1v2
  #27 cgroup_v1v2:OK
  Summary: 1/0 PASSED, 0 SKIPPED, 0 FAILED

Signed-off-by: Daniel Borkmann <[email protected]>
Signed-off-by: Alexei Starovoitov <[email protected]>
Acked-by: Andrii Nakryiko <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]

bpf, selftests: Add cgroup v1 net_cls classid helpers

Minimal set of helpers for net_cls classid cgroupv1 management in order
to set an id, join from a process, initiate setup and teardown. cgroupv2
helpers are left as-is, but reused where possible.

Signed-off-by: Daniel Borkmann <[email protected]>
Signed-off-by: Alexei Starovoitov <[email protected]>
Acked-by: Andrii Nakryiko <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]

bpf, cgroups: Fix cgroup v2 fallback on v1/v2 mixed mode

Fix cgroup v1 interference when non-root cgroup v2 BPF programs are used.
Back in the days, commit bd1060a1d671 ("sock, cgroup: add sock->sk_cgroup")
embedded per-socket cgroup information into sock->sk_cgrp_data and in order
to save 8 bytes in struct sock made both mutually exclusive, that is, when
cgroup v1 socket tagging (e.g. net_cls/net_prio) is used, then cgroup v2
falls back to the root cgroup in sock_cgroup_ptr() (&cgrp_dfl_root.cgrp).

The assumption made was "there is no reason to mix the two and this is in line
with how legacy and v2 compatibility is handled" as stated in bd1060a1d671.
However, with Kubernetes more widely supporting cgroups v2 as well nowadays,
this assumption no longer holds, and the possibility of the v1/v2 mixed mode
with the v2 root fallback being hit becomes a real security issue.

Many of the cgroup v2 BPF programs are also used for policy enforcement, just
to pick _one_ example, that is, to programmatically deny socket related system
calls like connect(2) or bind(2). A v2 root fallback would implicitly cause
a policy bypass for the affected Pods.

In production environments, we have recently seen this case due to various
circumstances: i) a different 3rd party agent and/or ii) a container runtime
such as [0] in the user's environment configuring legacy cgroup v1 net_cls
tags, which triggered implicitly mentioned root fallback. Another case is
Kubernetes projects like kind [1] which create Kubernetes nodes in a container
and also add cgroup namespaces to the mix, meaning programs which are attached
to the cgroup v2 root of the cgroup namespace get attached to a non-root
cgroup v2 path from init namespace point of view. And the latter's root is
out of reach for agents on a kind Kubernetes node to configure. Meaning, any
entity on the node setting cgroup v1 net_cls tag will trigger the bypass
despite cgroup v2 BPF programs attached to the namespace root.

Generally, this mutual exclusiveness does not hold anymore in today's user
environments and makes cgroup v2 usage from BPF side fragile and unreliable.
This fix adds proper struct cgroup pointer for the cgroup v2 case to struct
sock_cgroup_data in order to address these issues; this implicitly also fixes
the tradeoffs being made back then with regards to races and refcount leaks
as stated in bd1060a1d671, and removes the fallback, so that cgroup v2 BPF
programs always operate as expected.

[0] https://github.com/nestybox/sysbox/
[1] https://kind.sigs.k8s.io/

Fixes: bd1060a1d671 ("sock, cgroup: add sock->sk_cgroup")
Signed-off-by: Daniel Borkmann <[email protected]>
Signed-off-by: Alexei Starovoitov <[email protected]>
Acked-by: Stanislav Fomichev <[email protected]>
Acked-by: Tejun Heo <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]

bpf: Add oversize check before call kvcalloc()

Commit 7661809d493b ("mm: don't allow oversized kvmalloc() calls") add the
oversize check. When the allocation is larger than what kmalloc() supports,
the following warning triggered:

WARNING: CPU: 0 PID: 8408 at mm/util.c:597 kvmalloc_node+0x108/0x110 mm/util.c:597
Modules linked in:
CPU: 0 PID: 8408 Comm: syz-executor221 Not tainted 5.14.0-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
RIP: 0010:kvmalloc_node+0x108/0x110 mm/util.c:597
Call Trace:
kvmalloc include/linux/mm.h:806 [inline]
kvmalloc_array include/linux/mm.h:824 [inline]
kvcalloc include/linux/mm.h:829 [inline]
check_btf_line kernel/bpf/verifier.c:9925 [inline]
check_btf_info kernel/bpf/verifier.c:10049 [inline]
bpf_check+0xd634/0x150d0 kernel/bpf/verifier.c:13759
bpf_prog_load kernel/bpf/syscall.c:2301 [inline]
__sys_bpf+0x11181/0x126e0 kernel/bpf/syscall.c:4587
__do_sys_bpf kernel/bpf/syscall.c:4691 [inline]
__se_sys_bpf kernel/bpf/syscall.c:4689 [inline]
__x64_sys_bpf+0x78/0x90 kernel/bpf/syscall.c:4689
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x3d/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x44/0xae

Reported-by: [email protected]
Signed-off-by: Bixuan Cui <[email protected]>
Signed-off-by: Alexei Starovoitov <[email protected]>
Acked-by: Yonghong Song <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]

tools: compiler-gcc.h: Guard error attribute use with __has_attribute

When building objtool with HOSTCC=clang, there are several errors along
the lines of

orc_dump.c:201:28: error: unknown attribute 'error' ignored [-Werror,-Wunknown-attributes]

This occurs after commit 4e59869aa655 ("compiler-gcc.h: drop checks for
older GCC versions"), which removed the GCC_VERSION gating. The removed
version check just so happened to prevent __compiletime_error() from
being defined with clang because it pretends to be GCC 4.2.1 for
compatibility but the error attribute was not added to clang until
14.0.0.

Commit 815f0ddb346c ("include/linux/compiler*.h: make compiler-*.h
mutually exclusive") and commit a3f8a30f3f00 ("Compiler Attributes: use
feature checks instead of version checks") refactored the handling of
attributes in the main kernel to avoid situations like this but that
refactoring has never been done for the tools directory.

Refactoring is a rather large undertaking and this has never been an
issue before so instead, just guard the definition of
__compiletime_error() with __has_attribute() so that there are no more
errors.

Fixes: 4e59869aa655 ("compiler-gcc.h: drop checks for older GCC versions")
Signed-off-by: Nathan Chancellor <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

Merge branch 'gcc-min-version-5.1' (make gcc-5.1 the minimum version)

Merge patch series from Nick Desaulniers to update the minimum gcc
version to 5.1.

This is some of the left-overs from the merge window that I didn't want
to deal with yesterday, so it comes in after -rc1 but was sent before.

Gcc-4.9 support has been an annoyance for some time, and with -Werror I
had the choice of applying a fairly big patch from Kees Cook to remove a
fair number of initializer warnings (still leaving some), or this patch
series from Nick that just removes the source of the problem.

The initializer cleanups might still be worth it regardless, but
honestly, I preferred just tackling the problem with gcc-4.9 head-on.
We've been more aggressiuve about no longer having to care about
compilers that were released a long time ago, and I think it's been a
good thing.

I added a couple of patches on top to sort out a few left-overs now that
we no longer support gcc-4.x.

As noted by Arnd, as a result of this minimum compiler version upgrade
we can probably change our use of '--std=gnu89' to '--std=gnu11', and
finally start using local loop declarations etc.  But this series does
_not_ yet do that.

Link: https://lore.kernel.org/all/[email protected]/
Link: https://lore.kernel.org/lkml/CAK7LNASs6dvU6D3jL2GG3jW58fXfaj6VNOe55NJnTB8UPuk2pA@mail.gmail.com/
Link: https://github.com/ClangBuiltLinux/linux/issues/1438
* emailed patches from Nick Desaulniers <[email protected]>:
  Drop some straggling mentions of gcc-4.9 as being stale
  compiler_attributes.h: drop __has_attribute() support for gcc4
  vmlinux.lds.h: remove old check for GCC 4.9
  compiler-gcc.h: drop checks for older GCC versions
  Makefile: drop GCC < 5 -fno-var-tracking-assignments workaround
  arm64: remove GCC version check for ARCH_SUPPORTS_INT128
  powerpc: remove GCC version check for UPD_CONSTR
  riscv: remove Kconfig check for GCC version for ARCH_RV64I
  Kconfig.debug: drop GCC 5+ version check for DWARF5
  mm/ksm: remove old GCC 4.9+ check
  compiler.h: drop fallback overflow checkers
  Documentation: raise minimum supported version of GCC to 5.1

Drop some straggling mentions of gcc-4.9 as being stale

Fix up the admin-guide README file to the new gcc-5.1 requirement, and
remove a stale comment about gcc support for the __assume_aligned__
attribute.

Signed-off-by: Linus Torvalds <[email protected]>

cpufreq: intel_pstate: Override parameters if HWP forced by BIOS

If HWP has been already been enabled by BIOS, it may be
necessary to override some kernel command line parameters.
Once it has been enabled it requires a reset to be disabled.

Suggested-by: Rafael J. Wysocki <[email protected]>
Signed-off-by: Doug Smythies <[email protected]>
Signed-off-by: Rafael J. Wysocki <[email protected]>

compiler_attributes.h: drop __has_attribute() support for gcc4

Now that GCC 5.1 is the minimally supported default, the manual
workaround for older gcc versions not having __has_attribute() are no
longer relevant and can be removed.

Signed-off-by: Linus Torvalds <[email protected]>

vmlinux.lds.h: remove old check for GCC 4.9

Now that GCC 5.1 is the minimally supported version of GCC, we can
effectively revert commit 85c2ce9104eb ("sched, vmlinux.lds: Increase
STRUCT_ALIGNMENT to 64 bytes for GCC-4.9")

Cc: Peter Zijlstra <[email protected]>
Signed-off-by: Nick Desaulniers <[email protected]>
Acked-by: Kees Cook <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

compiler-gcc.h: drop checks for older GCC versions

Now that GCC 5.1 is the minimally supported default, drop the values we
don't use.

Signed-off-by: Nick Desaulniers <[email protected]>
Reviewed-by: Kees Cook <[email protected]>
Reviewed-by: Nathan Chancellor <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

Makefile: drop GCC < 5 -fno-var-tracking-assignments workaround

Now that GCC 5.1 is the minimally supported version, we can drop this
workaround for older versions of GCC.

Signed-off-by: Nick Desaulniers <[email protected]>
Reviewed-by: Kees Cook <[email protected]>
Reviewed-by: Nathan Chancellor <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

arm64: remove GCC version check for ARCH_SUPPORTS_INT128

Now that GCC 5.1 is the minimally supported compiler version, this
Kconfig check is no longer necessary.

Cc: Catalin Marinas <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: [email protected]
Signed-off-by: Nick Desaulniers <[email protected]>
Reviewed-by: Kees Cook <[email protected]>
Reviewed-by: Nathan Chancellor <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

powerpc: remove GCC version check for UPD_CONSTR

Now that GCC 5.1 is the minimum supported version, we can drop this
workaround for older versions of GCC. This adversely affected clang,
too.

Cc: Michael Ellerman <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Segher Boessenkool <[email protected]>
Cc: Christophe Leroy <[email protected]>
Cc: [email protected]
Signed-off-by: Nick Desaulniers <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

riscv: remove Kconfig check for GCC version for ARCH_RV64I

The minimum supported version of GCC is now 5.1. The check wasn't
correct as written anyways since GCC_VERSION is 0 when CC=clang.

Cc: Paul Walmsley <[email protected]>
Cc: Palmer Dabbelt <[email protected]>
Cc: Albert Ou <[email protected]>
Cc: [email protected]
Signed-off-by: Nick Desaulniers <[email protected]>
Reviewed-by: Kees Cook <[email protected]>
Reviewed-by: Nathan Chancellor <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

Kconfig.debug: drop GCC 5+ version check for DWARF5

Now that the minimum supported version of GCC is 5.1, we no longer need
this Kconfig version check for CONFIG_DEBUG_INFO_DWARF5.

Signed-off-by: Nick Desaulniers <[email protected]>
Reviewed-by: Kees Cook <[email protected]>
Reviewed-by: Nathan Chancellor <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/ksm: remove old GCC 4.9+ check

The minimum supported version of GCC has been raised to GCC 5.1.

Signed-off-by: Nick Desaulniers <[email protected]>
Reviewed-by: Kees Cook <[email protected]>
Reviewed-by: Nathan Chancellor <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

compiler.h: drop fallback overflow checkers

Once upgrading the minimum supported version of GCC to 5.1, we can drop
the fallback code for !COMPILER_HAS_GENERIC_BUILTIN_OVERFLOW.

This is effectively a revert of commit f0907827a8a9 ("compiler.h: enable
builtin overflow checkers and add fallback code")

Link: https://github.com/ClangBuiltLinux/linux/issues/1438#issuecomment-916745801
Suggested-by: Rasmus Villemoes <[email protected]>
Signed-off-by: Nick Desaulniers <[email protected]>
Acked-by: Kees Cook <[email protected]>
Reviewed-by: Nathan Chancellor <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

Documentation: raise minimum supported version of GCC to 5.1

commit fad7cd3310db ("nbd: add the check to prevent overflow in
__nbd_ioctl()") raised an issue from the fallback helpers added in
commit f0907827a8a9 ("compiler.h: enable builtin overflow checkers and
add fallback code")

Specifically, the helpers for checking whether the results of a
multiplication overflowed (__unsigned_mul_overflow,
__signed_add_overflow) use the division operator when
!COMPILER_HAS_GENERIC_BUILTIN_OVERFLOW. This is problematic for 64b
operands on 32b hosts.

Also, because the macro is type agnostic, it is very difficult to write
a similarly type generic macro that dispatches to one of:
* div64_s64
* div64_u64
* div_s64
* div_u64

Raising the minimum supported versions allows us to remove all of the
fallback helpers for !COMPILER_HAS_GENERIC_BUILTIN_OVERFLOW, instead
dispatching the compiler builtins.

arm64 has already raised the minimum supported GCC version to 5.1, do
this for all targets now. See the link below for the previous
discussion.

Link: https://lore.kernel.org/all/[email protected]/
Link: https://lore.kernel.org/lkml/CAK7LNASs6dvU6D3jL2GG3jW58fXfaj6VNOe55NJnTB8UPuk2pA@mail.gmail.com/
Link: https://github.com/ClangBuiltLinux/linux/issues/1438
Reported-by: Stephen Rothwell <[email protected]>
Reported-by: Nathan Chancellor <[email protected]>
Suggested-by: Rasmus Villemoes <[email protected]>
Signed-off-by: Nick Desaulniers <[email protected]>
Reviewed-by: Kees Cook <[email protected]>
Reviewed-by: Nathan Chancellor <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

x86/uaccess: Fix 32-bit __get_user_asm_u64() when CC_HAS_ASM_GOTO_OUTPUT=y

Commit 865c50e1d279 ("x86/uaccess: utilize CONFIG_CC_HAS_ASM_GOTO_OUTPUT")
added an optimised version of __get_user_asm() for x86 using 'asm goto'.

Like the non-optimised code, the 32-bit implementation of 64-bit
get_user() expands to a pair of 32-bit accesses.  Unlike the
non-optimised code, the _original_ pointer is incremented to copy the
high word instead of loading through a new pointer explicitly
constructed to point at a 32-bit type.  Consequently, if the pointer
points at a 64-bit type then we end up loading the wrong data for the
upper 32-bits.

This was observed as a mount() failure in Android targeting i686 after
b0cfcdd9b967 ("d_path: make 'prepend()' fill up the buffer exactly on
overflow") because the call to copy_from_kernel_nofault() from
prepend_copy() ends up in __get_kernel_nofault() and casts the source
pointer to a 'u64 __user *'.  An attempt to mount at "/debug_ramdisk"
therefore ends up failing trying to mount "/debumdismdisk".

Use the existing '__gu_ptr' source pointer to unsigned int for 32-bit
__get_user_asm_u64() instead of the original pointer.

Cc: Bill Wendling <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Reported-by: Greg Kroah-Hartman <[email protected]>
Fixes: 865c50e1d279 ("x86/uaccess: utilize CONFIG_CC_HAS_ASM_GOTO_OUTPUT")
Signed-off-by: Will Deacon <[email protected]>
Reviewed-by: Nick Desaulniers <[email protected]>
Tested-by: Greg Kroah-Hartman <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

io-wq: provide IO_WQ_* constants for IORING_REGISTER_IOWQ_MAX_WORKERS arg items

The items passed in the array pointed by the arg parameter
of IORING_REGISTER_IOWQ_MAX_WORKERS io_uring_register operation
carry certain semantics: they refer to different io-wq worker categories;
provide IO_WQ_* constants in the UAPI, so these categories can be referenced
in the user space code.

Suggested-by: Jens Axboe <[email protected]>
Complements: 2e480058ddc21ec5 ("io-wq: provide a way to limit max number of workers")
Signed-off-by: Eugene Syromiatnikov <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jens Axboe <[email protected]>

dma-debug: prevent an error message from causing runtime problems

For some drivers, that use the DMA API. This error message can be reached
several millions of times per second, causing spam to the kernel's printk
buffer and bringing the CPU usage up to 100% (so, it should be rate
limited). However, since there is at least one driver that is in the
mainline and suffers from the error condition, it is more useful to
err_printk() here instead of just rate limiting the error message (in hopes
that it will make it easier for other drivers that suffer from this issue
to be spotted).

Link: https://lkml.kernel.org/r/[email protected]
Reported-by: Jeremy Linton <[email protected]>
Signed-off-by: Hamza Mahfooz <[email protected]>
Signed-off-by: Christoph Hellwig <[email protected]>

nvme: avoid race in shutdown namespace removal

When we remove the siblings entry, we update ns->head->list, hence we
can't separate the removal and test for being empty. They have to be
in the same critical section to avoid a race.

To avoid breaking the refcounting imbalance again, add a list empty
check to nvme_find_ns_head.

Fixes: 5396fdac56d8 ("nvme: fix refcounting imbalance when all paths are down")
Signed-off-by: Daniel Wagner <[email protected]>
Reviewed-by: Hannes Reinecke <[email protected]>
Tested-by: Hannes Reinecke <[email protected]>
Signed-off-by: Christoph Hellwig <[email protected]>

nvmet: fix a width vs precision bug in nvmet_subsys_attr_serial_show()

This was intended to limit the number of characters printed from
"subsys->serial" to NVMET_SN_MAX_SIZE. But accidentally the width
specifier was used instead of the precision specifier so it only
affects the alignment and not the number of characters printed.

Fixes: f04064814c2a ("nvmet: fixup buffer overrun in nvmet_subsys_attr_serial()")
Signed-off-by: Dan Carpenter <[email protected]>
Signed-off-by: Christoph Hellwig <[email protected]>

Merge branch 'hns3-fixes'

Guangbin Huang says:

====================
net: hns3: add some fixes for -net

This series adds some fixes for the HNS3 ethernet driver.
====================

Signed-off-by: David S. Miller <[email protected]>

net: hns3: fix the timing issue of VF clearing interrupt sources

Currently, the VF does not clear the interrupt source immediately after
receiving the interrupt. As a result, if the second interrupt task is
triggered when processing the first interrupt task, clearing the
interrupt source before exiting will clear the interrupt sources of the
two tasks at the same time. As a result, no interrupt is triggered for
the second task. The VF detects the missed message only when the next
interrupt is generated.

Clearing it immediately after executing check_evt_cause ensures that:
1. Even if two interrupt tasks are triggered at the same time, they can
be processed.
2. If the second task is triggered during the processing of the first
task and the interrupt source is not cleared, the interrupt is reported
after vector0 is enabled.

Fixes: b90fcc5bd904 ("net: hns3: add reset handling for VF when doing Core/Global/IMP reset")
Signed-off-by: Jiaran Zhang <[email protected]>
Signed-off-by: Guangbin Huang <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: hns3: fix the exception when query imp info

When the command for querying imp info is issued to the firmware,
if the firmware does not support the command, the returned value
of bd num is 0.
Add protection mechanism before alloc memory to prevent apply for
0-length memory.

Fixes: 0b198b0d80ea ("net: hns3: refactor dump m7 info of debugfs")
Signed-off-by: Jiaran Zhang <[email protected]>
Signed-off-by: Guangbin Huang <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: hns3: disable mac in flr process

The firmware will not disable mac in flr process. Therefore, the driver
needs to proactively disable mac during flr, which is the same as the
function reset.

Fixes: 35d93a30040c ("net: hns3: adjust the process of PF reset")
Signed-off-by: Yufeng Mo <[email protected]>
Signed-off-by: Guangbin Huang <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: hns3: change affinity_mask to numa node range

Currently, affinity_mask is set to a single cpu. As a result,
irqbalance becomes invalid in SUBSET or EXACT mode. To solve
this problem, change affinity_mask to numa node range. In this
way, irqbalance can be performed on the cpu of the numa node.

Fixes: 0812545487ec ("net: hns3: add interrupt affinity support for misc interrupt")
Signed-off-by: Yufeng Mo <[email protected]>
Signed-off-by: Guangbin Huang <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: hns3: pad the short tunnel frame before sending to hardware

The hardware cannot handle short tunnel frames below 65 bytes,
and will cause vlan tag missing problem. So pads packet size to
65 bytes for tunnel frames to fix this bug.

Fixes: 3db084d28dc0("net: hns3: Fix for vxlan tx checksum bug")
Signed-off-by: Yufeng Mo <[email protected]>
Signed-off-by: Guangbin Huang <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: hns3: add option to turn off page pool feature

When page pool is added to the hns3 driver, it is always
enabled unconditionally, which means spilt page handling
in the hns3 driver is dead code.

As there is a requirement to test the performance between
spilt page handling in driver and page pool, so add a module
param to support disabling the page pool.

When the page pool is proved to perform better in most case,
the spilt page handling in driver can be removed.

Fixes: 93188e9642c3 ("net: hns3: support skb's frag page recycling based on page pool")
Signed-off-by: Yunsheng Lin <[email protected]>
Signed-off-by: Guangbin Huang <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: dsa: lantiq_gswip: Add 200ms assert delay

The delay is especially needed by the xRX300 and xRX330 SoCs. Without
this patch, some phys are sometimes not properly detected.

The patch was tested on BT Home Hub 5A and D-Link DWR-966.

Fixes: a09d042b0862 ("net: dsa: lantiq: allow to use all GPHYs on xRX300 and xRX330")
Signed-off-by: Aleksander Jan Bajkowski <[email protected]>
Signed-off-by: Martin Blumenstingl <[email protected]>
Acked-by: Hauke Mehrtens <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

ipv6: delay fib6_sernum increase in fib6_add

only increase fib6_sernum in net namespace after add fib6_info
successfully.

Signed-off-by: zhang kai <[email protected]>
Reviewed-by: David Ahern <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

tipc: increase timeout in tipc_sk_enqueue()

In tipc_sk_enqueue() we use hardcoded 2 jiffies to extract
socket buffer from generic queue to particular socket.
The 2 jiffies is too short in case there are other high priority
tasks get CPU cycles for multiple jiffies update. As result, no
buffer could be enqueued to particular socket.

To solve this, we switch to use constant timeout 20msecs.
Then, the function will be expired between 2 jiffies (CONFIG_100HZ)
and 20 jiffies (CONFIG_1000HZ).

Fixes: c637c1035534 ("tipc: resolve race problem at unicast message reception")
Acked-by: Jon Maloy <[email protected]>
Signed-off-by: Hoang Le <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

udp_tunnel: Fix udp_tunnel_nic work-queue type

Turn udp_tunnel_nic work-queue to an ordered work-queue. This queue
holds the UDP-tunnel configuration commands of the different netdevs.
When the netdevs are functions of the same NIC the order of
execution may be crucial.

Problem example:
NIC with 2 PFs, both PFs declare offload quota of up to 3 UDP-ports.
$ifconfig eth2 1.1.1.1/16 up

$ip link add eth2_19503 type vxlan id 5049 remote 1.1.1.2 dev eth2 dstport 19053
$ip link set dev eth2_19503 up

$ip link add eth2_19504 type vxlan id 5049 remote 1.1.1.3 dev eth2 dstport 19054
$ip link set dev eth2_19504 up

$ip link add eth2_19505 type vxlan id 5049 remote 1.1.1.4 dev eth2 dstport 19055
$ip link set dev eth2_19505 up

$ip link add eth2_19506 type vxlan id 5049 remote 1.1.1.5 dev eth2 dstport 19056
$ip link set dev eth2_19506 up

NIC RX port offload infrastructure offloads the first 3 UDP-ports (on
all devices which sets NETIF_F_RX_UDP_TUNNEL_PORT feature) and not
UDP-port 19056. So both PFs gets this offload configuration.

$ip link set dev eth2_19504 down

This triggers udp-tunnel-core to remove the UDP-port 19504 from
offload-ports-list and offload UDP-port 19056 instead.

In this scenario it is important that the UDP-port of 19504 will be
removed from both PFs before trying to add UDP-port 19056. The NIC can
stop offloading a UDP-port only when all references are removed.
Otherwise the NIC may report exceeding of the offload quota.

Fixes: cc4e3835eff4 ("udp_tunnel: add central NIC RX port offload infrastructure")
Signed-off-by: Aya Levin <[email protected]>
Reviewed-by: Tariq Toukan <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Revert "ipv4: fix memory leaks in ip_cmsg_send() callers"

This reverts commit 919483096bfe75dda338e98d56da91a263746a0a.

There is only when ip_options_get() return zero need to free.
It already called kfree() when return error.

Fixes: 919483096bfe ("ipv4: fix memory leaks in ip_cmsg_send() callers")
Signed-off-by: Yajun Deng <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Merge branch 'bnxt_en-fixes'

Michael Chan says:

====================
bnxt_en: Bug fixes.

The first patch fixes an error recovery regression just introduced
about a week ago. The other two patches fix issues related to
freeing rings in the bnxt_close() path under error conditions.
====================

Signed-off-by: David S. Miller <[email protected]>

bnxt_en: Clean up completion ring page arrays completely

We recently changed the completion ring page arrays to be dynamically
allocated to better support the expanded range of ring depths.  The
cleanup path for this was not quite complete.  It might cause the
shutdown path to crash if we need to abort before the completion ring
arrays have been allocated and initialized.

Fix it by initializing the ring_mem->pg_arr to NULL after freeing the
completion ring page array.  Add a check in bnxt_free_ring() to skip
referencing the rmem->pg_arr if it is NULL.

Fixes: 03c7448790b8 ("bnxt_en: Don't use static arrays for completion ring pages")
Reviewed-by: Andy Gospodarek <[email protected]>
Reviewed-by: Edwin Peer <[email protected]>
Signed-off-by: Michael Chan <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

bnxt_en: make bnxt_free_skbs() safe to call after bnxt_free_mem()

The call to bnxt_free_mem(..., false) in the bnxt_half_open_nic() error
path will deallocate ring descriptor memory via bnxt_free_?x_rings(),
but because irq_re_init is false, the ring info itself is not freed.

To simplify error paths, deallocation functions have generally been
written to be safe when called on unallocated memory. It should always
be safe to call dev_close(), which calls bnxt_free_skbs() a second time,
even in this semi- allocated ring state.

Calling bnxt_free_skbs() a second time with the rings already freed will
cause NULL pointer dereference. Fix it by checking the rings are valid
before proceeding in bnxt_free_tx_skbs() and
bnxt_free_one_rx_ring_skbs().

Fixes: 975bc99a4a39 ("bnxt_en: Refactor bnxt_free_rx_skbs().")
Signed-off-by: Edwin Peer <[email protected]>
Signed-off-by: Michael Chan <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

bnxt_en: Fix error recovery regression

The recent patch has introduced a regression by not reading the reset
count in the ERROR_RECOVERY async event handler. We may have just
gone through a reset and the reset count has just incremented. If
we don't update the reset count in the ERROR_RECOVERY event handler,
the health check timer will see that the reset count has changed and
will initiate an unintended reset.

Restore the unconditional update of the reset count in
bnxt_async_event_process() if error recovery watchdog is enabled.
Also, update the reset count at the end of the reset sequence to
make it even more robust.

Fixes: 1b2b91831983 ("bnxt_en: Fix possible unintended driver initiated error recovery")
Reviewed-by: Edwin Peer <[email protected]>
Signed-off-by: Michael Chan <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

m68k: mvme: Remove overdue #warnings in RTC handling

The warnings were introduced when converting the MVME147 and MVME16x
RTC handling from gettod to hwclk. Replace the #warning by a comment,
and return an error to inform the upper layer that writing to the RTC is
not yet supported.

Signed-off-by: Geert Uytterhoeven <[email protected]>
Reviewed-by: Guenter Roeck <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

m68k: Double cast io functions to unsigned long

m68k builds fail widely with errors such as

arch/m68k/include/asm/raw_io.h:20:19: error:
cast to pointer from integer of different size
arch/m68k/include/asm/raw_io.h:30:32: error:
cast to pointer from integer of different size [-Werror=int-to-p

On m68k, io functions are defined as macros. The problem is seen if the
macro parameter variable size differs from the size of a pointer. Cast
the parameter of all io macros to unsigned long before casting it to
a pointer to fix the problem.

Signed-off-by: Guenter Roeck <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Geert Uytterhoeven <[email protected]>

blk-mq: avoid to iterate over stale request

blk-mq can't run allocating driver tag and updating ->rqs[tag]
atomically, meantime blk-mq doesn't clear ->rqs[tag] after the driver
tag is released.

So there is chance to iterating over one stale request just after the
tag is allocated and before updating ->rqs[tag].

scsi_host_busy_iter() calls scsi_host_check_in_flight() to count scsi
in-flight requests after scsi host is blocked, so no new scsi command can
be marked as SCMD_STATE_INFLIGHT. However, driver tag allocation still can
be run by blk-mq core. One request is marked as SCMD_STATE_INFLIGHT,
but this request may have been kept in another slot of ->rqs[], meantime
the slot can be allocated out but ->rqs[] isn't updated yet. Then this
in-flight request is counted twice as SCMD_STATE_INFLIGHT. This way causes
trouble in handling scsi error.

Fixes the issue by not iterating over stale request.

Cc: [email protected]
Cc: "Martin K. Petersen" <[email protected]>
Reported-by: luojiaxing <[email protected]>
Signed-off-by: Ming Lei <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jens Axboe <[email protected]>

io-wq: fix potential race of acct->nr_workers

Given max_worker is 1, and we currently have 1 running and it is
exiting. There may be race like:
io_wqe_enqueue                   worker1
                               no work there and timeout
                               unlock(wqe->lock)
->insert work
                               -->io_worker_exit
lock(wqe->lock)
->if(!nr_workers) //it's still 1
unlock(wqe->lock)
    goto run_cancel
                                  lock(wqe->lock)
                                  nr_workers--
                                  ->dec_running
                                    ->worker creation fails
                                  unlock(wqe->lock)

We enqueued one work but there is no workers, causes hung.

Signed-off-by: Hao Xu <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>

io-wq: code clean of io_wqe_create_worker()

Remove do_create to save a local variable.

Signed-off-by: Hao Xu <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>

io_uring: ensure symmetry in handling iter types in loop_rw_iter()

When setting up the next segment, we check what type the iter is and
handle it accordingly. However, when incrementing and processed amount
we do not, and both iter advance and addr/len are adjusted, regardless
of type. Split the increment side just like we do on the setup side.

Fixes: 4017eb91a9e7 ("io_uring: make loop_rw_iter() use original user supplied pointers")
Cc: [email protected]
Reported-by: Valentina Palmiotti <[email protected]>
Reviewed-by: Pavel Begunkov <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>

Linux 5.15-rc1

Merge tag 'perf-tools-for-v5.15-2021-09-11' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux

Pull more perf tools updates from Arnaldo Carvalho de Melo:

- Add missing fields and remove some duplicate fields when printing a
   perf_event_attr.

- Fix hybrid config terms list corruption.

- Update kernel header copies, some resulted in new kernel features
   being automagically added to 'perf trace' syscall/tracepoint argument
   id->string translators.

- Add a file generated during the documentation build to .gitignore.

- Add an option to build without libbfd, as some distros, like Debian
   consider its ABI unstable.

- Add support to print a textual representation of IBS raw sample data
   in 'perf report'.

- Fix bpf 'perf test' sample mismatch reporting

- Fix passing arguments to stackcollapse report in a 'perf script'
   python script.

- Allow build-id with trailing zeros.

- Look for ImageBase in PE file to compute .text offset.

* tag 'perf-tools-for-v5.15-2021-09-11' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux: (25 commits)
  tools headers UAPI: Update tools's copy of drm.h headers
  tools headers UAPI: Sync drm/i915_drm.h with the kernel sources
  tools headers UAPI: Sync linux/fs.h with the kernel sources
  tools headers UAPI: Sync linux/in.h copy with the kernel sources
  perf tools: Add an option to build without libbfd
  perf tools: Allow build-id with trailing zeros
  perf tools: Fix hybrid config terms list corruption
  perf tools: Factor out copy_config_terms() and free_config_terms()
  perf tools: Fix perf_event_attr__fprintf() missing/dupl. fields
  perf tools: Ignore Documentation dependency file
  perf bpf: Provide a weak btf__load_from_kernel_by_id() for older libbpf versions
  tools include UAPI: Update linux/mount.h copy
  perf beauty: Cover more flags in the  move_mount syscall argument beautifier
  tools headers UAPI: Sync linux/prctl.h with the kernel sources
  tools include UAPI: Sync sound/asound.h copy with the kernel sources
  tools headers UAPI: Sync linux/kvm.h with the kernel sources
  tools headers UAPI: Sync x86's asm/kvm.h with the kernel sources
  perf report: Add support to print a textual representation of IBS raw sample data
  perf report: Add tools/arch/x86/include/asm/amd-ibs.h
  perf env: Add perf_env__cpuid, perf_env__{nr_}pmu_mappings
  ...

Merge tag 'compiler-attributes-for-linus-v5.15-rc1-v2' of git://github.com/ojeda/linux

Pull compiler attributes updates from Miguel Ojeda:

- Fix __has_attribute(__no_sanitize_coverage__) for GCC 4 (Marco Elver)

- Add Nick as Reviewer for compiler_attributes.h (Nick Desaulniers)

- Move __compiletime_{error|warning} (Nick Desaulniers)

* tag 'compiler-attributes-for-linus-v5.15-rc1-v2' of git://github.com/ojeda/linux:
  compiler_attributes.h: move __compiletime_{error|warning}
  MAINTAINERS: add Nick as Reviewer for compiler_attributes.h
  Compiler Attributes: fix __has_attribute(__no_sanitize_coverage__) for GCC 4

Merge tag 'auxdisplay-for-linus-v5.15-rc1' of git://github.com/ojeda/linux

Pull auxdisplay updates from Miguel Ojeda:
"An assortment of improvements for auxdisplay:

   - Replace symbolic permissions with octal permissions (Jinchao Wang)

   - ks0108: Switch to use module_parport_driver() (Andy Shevchenko)

   - charlcd: Drop unneeded initializers and switch to C99 style (Andy
     Shevchenko)

   - hd44780: Fix oops on module unloading (Lars Poeschel)

   - Add I2C gpio expander example (Ralf Schlatterbeck)"

* tag 'auxdisplay-for-linus-v5.15-rc1' of git://github.com/ojeda/linux:
  auxdisplay: Replace symbolic permissions with octal permissions
  auxdisplay: ks0108: Switch to use module_parport_driver()
  auxdisplay: charlcd: Drop unneeded initializers and switch to C99 style
  auxdisplay: hd44780: Fix oops on module unloading
  auxdisplay: Add I2C gpio expander example

Merge tag 'smp-urgent-2021-09-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull CPU hotplug updates from Thomas Gleixner:
"Updates for the SMP and CPU hotplug:

   - Remove DEFINE_SMP_CALL_CACHE_FUNCTION() which is a left over of the
     original hotplug code and now causing trouble with the ARM64 cache
     topology setup due to the pointless SMP function call.

     It's not longer required as the hotplug callbacks are guaranteed to
     be invoked on the upcoming CPU.

   - Remove the deprecated and now unused CPU hotplug functions

   - Rewrite the CPU hotplug API documentation"

* tag 'smp-urgent-2021-09-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  Documentation: core-api/cpuhotplug: Rewrite the API section
  cpu/hotplug: Remove deprecated CPU-hotplug functions.
  thermal: Replace deprecated CPU-hotplug functions.
  drivers: base: cacheinfo: Get rid of DEFINE_SMP_CALL_CACHE_FUNCTION()

Merge tag 'char-misc-5.15-rc1-lkdtm' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc

Pull misc driver fix from Greg KH:
"Here is a single patch for 5.15-rc1, for the lkdtm misc driver.

  It resolves a build issue that many people were hitting with your
  current tree, and Kees and others felt would be good to get merged
  before -rc1 comes out, to prevent them from having to constantly hit
  it as many development trees restart on -rc1, not older -rc releases.

  It has NOT been in linux-next, but has passed 0-day testing and looks
  'obviously correct' when reviewing it locally :)"

* tag 'char-misc-5.15-rc1-lkdtm' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc:
  lkdtm: Use init_uts_ns.name instead of macros

Merge tag 'for-linus-5.15-1' of git://github.com/cminyard/linux-ipmi

Pull IPMI updates from Corey Minyard:
"A couple of very minor fixes for style and rate limiting.

  Nothing big, but probably needs to go in"

* tag 'for-linus-5.15-1' of git://github.com/cminyard/linux-ipmi:
  char: ipmi: use DEVICE_ATTR helper macro
  ipmi: rate limit ipmi smi_event failure message

Merge tag 'sched_urgent_for_v5.15_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull scheduler fixes from Borislav Petkov:

- Make sure the idle timer expires in hardirq context, on PREEMPT_RT

- Make sure the run-queue balance callback is invoked only on the
   outgoing CPU

* tag 'sched_urgent_for_v5.15_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched: Prevent balance_push() on remote runqueues
  sched/idle: Make the idle timer expire in hard interrupt context

Merge tag 'locking_urgent_for_v5.15_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull locking fixes from Borislav Petkov:

- Fix the futex PI requeue machinery to not return to userspace in
   inconsistent state

- Avoid a potential null pointer dereference in the ww_mutex deadlock
   check

- Other smaller cleanups and optimizations

* tag 'locking_urgent_for_v5.15_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  locking/rtmutex: Fix ww_mutex deadlock check
  futex: Remove unused variable 'vpid' in futex_proxy_trylock_atomic()
  futex: Avoid redundant task lookup
  futex: Clarify comment for requeue_pi_wake_futex()
  futex: Prevent inconsistent state and exit race
  futex: Return error code instead of assigning it without effect
  locking/rwsem: Add missing __init_rwsem() for PREEMPT_RT

Merge tag 'timers_urgent_for_v5.15_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull timer fix from Borislav Petkov:

- Handle negative second values properly when converting a timespec64
to nanoseconds.

* tag 'timers_urgent_for_v5.15_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
time: Handle negative seconds correctly in timespec64_to_ns()

Merge branch 'misc.namei' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs

Pull namei updates from Al Viro:
"Clearing fallout from mkdirat in io_uring series. The fix in the
  kern_path_locked() patch plus associated cleanups"

* 'misc.namei' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  putname(): IS_ERR_OR_NULL() is wrong here
  namei: Standardize callers of filename_create()
  namei: Standardize callers of filename_lookup()
  rename __filename_parentat() to filename_parentat()
  namei: Fix use after free in kern_path_locked

Merge tag '5.15-rc-cifs-part2' of git://git.samba.org/sfrench/cifs-2.6

Pull smbfs updates from Steve French:
"cifs/smb3 updates:

   - DFS reconnect fix

   - begin creating common headers for server and client

   - rename the cifs_common directory to smbfs_common to be more
     consistent ie change use of the name cifs to smb (smb3 or smbfs is
     more accurate, as the very old cifs dialect has long been
     superseded by smb3 dialects).

  In the future we can rename the fs/cifs directory to fs/smbfs.

  This does not include the set of multichannel fixes nor the two
  deferred close fixes (they are still being reviewed and tested)"

* tag '5.15-rc-cifs-part2' of git://git.samba.org/sfrench/cifs-2.6:
  cifs: properly invalidate cached root handle when closing it
  cifs: move SMB FSCTL definitions to common code
  cifs: rename cifs_common to smbfs_common
  cifs: update FSCTL definitions

net: mana: Prefer struct_size over open coded arithmetic

As noted in the "Deprecated Interfaces, Language Features, Attributes,
and Conventions" documentation [1], size calculations (especially
multiplication) should not be performed in memory allocator (or similar)
function arguments due to the risk of them overflowing. This could lead
to values wrapping around and a smaller allocation being made than the
caller was expecting. Using those allocations could lead to linear
overflows of heap memory and other misbehaviors.

So, use the struct_size() helper to do the arithmetic instead of the
argument "size + count * size" in the kzalloc() function.

[1] https://www.kernel.org/doc/html/v5.14/process/deprecated.html#open-coded-arithmetic-in-allocator-arguments

Signed-off-by: Len Baker <[email protected]>
Reviewed-by: Haiyang Zhang <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

selftest: net: fix typo in altname test

If altname deletion of the short alternative name fails, the error
message printed is: "Failed to add short alternative name".
This is obviously a typo, as we are testing altname deletion.

Fix this using a proper error message.

Fixes: f95e6c9c4617 ("selftest: net: add alternative names test")
Signed-off-by: Andrea Claudi <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: dsa: qca8k: fix kernel panic with legacy mdio mapping

When the mdio legacy mapping is used the mii_bus priv registered by DSA
refer to the dsa switch struct instead of the qca8k_priv struct and
causes a kernel panic. Create dedicated function when the internal
dedicated mdio driver is used to properly handle the 2 different
implementation.

Fixes: 759bafb8a322 ("net: dsa: qca8k: add support for internal phy and internal mdio")
Signed-off-by: Ansuel Smith <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost

Pull virtio updates from Michael Tsirkin:

- vduse driver ("vDPA Device in Userspace") supporting emulated virtio
   block devices

- virtio-vsock support for end of record with SEQPACKET

- vdpa: mac and mq support for ifcvf and mlx5

- vdpa: management netlink for ifcvf

- virtio-i2c, gpio dt bindings

- misc fixes and cleanups

* tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost: (39 commits)
  Documentation: Add documentation for VDUSE
  vduse: Introduce VDUSE - vDPA Device in Userspace
  vduse: Implement an MMU-based software IOTLB
  vdpa: Support transferring virtual addressing during DMA mapping
  vdpa: factor out vhost_vdpa_pa_map() and vhost_vdpa_pa_unmap()
  vdpa: Add an opaque pointer for vdpa_config_ops.dma_map()
  vhost-iotlb: Add an opaque pointer for vhost IOTLB
  vhost-vdpa: Handle the failure of vdpa_reset()
  vdpa: Add reset callback in vdpa_config_ops
  vdpa: Fix some coding style issues
  file: Export receive_fd() to modules
  eventfd: Export eventfd_wake_count to modules
  iova: Export alloc_iova_fast() and free_iova_fast()
  virtio-blk: remove unneeded "likely" statements
  virtio-balloon: Use virtio_find_vqs() helper
  vdpa: Make use of PFN_PHYS/PFN_UP/PFN_DOWN helper macro
  vsock_test: update message bounds test for MSG_EOR
  af_vsock: rename variables in receive loop
  virtio/vsock: support MSG_EOR bit processing
  vhost/vsock: support MSG_EOR bit processing
  ...

Merge tag 'riscv-for-linus-5.15-mw1' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux

Pull more RISC-V updates from Palmer Dabbelt:

- A pair of defconfig additions, for NVMe and the EFI filesystem
   localization options.

- A larger address space for stack randomization.

- A cleanup to our install rules.

- A DTS update for the Microchip Icicle board, to fix the serial
   console.

- Support for build-time table sorting, which allows us to have
   __ex_table read-only.

* tag 'riscv-for-linus-5.15-mw1' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux:
  riscv: Move EXCEPTION_TABLE to RO_DATA segment
  riscv: Enable BUILDTIME_TABLE_SORT
  riscv: dts: microchip: mpfs-icicle: Fix serial console
  riscv: move the (z)install rules to arch/riscv/Makefile
  riscv: Improve stack randomisation on RV64
  riscv: defconfig: enable NLS_CODEPAGE_437, NLS_ISO8859_1
  riscv: defconfig: enable BLK_DEV_NVME

Merge branch 'for-5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/jlawall/linux

Pull coccinelle updates from Julia Lawall:
"These changes update some existing semantic patches with
  respect to some recent changes in the kernel.

  Specifically, the change to kvmalloc.cocci searches for
  kfree_sensitive rather than kzfree, and the change to
  use_after_iter.cocci adds list_entry_is_head as a valid
  use of a list iterator index variable after the end of
  the loop"

* 'for-5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/jlawall/linux:
  scripts: coccinelle: allow list_entry_is_head() to use pos
  coccinelle: api: rename kzfree to kfree_sensitive

tools headers UAPI: Update tools's copy of drm.h headers

Picking the changes from:

  17ce9c61c71cbc0d ("drm: document DRM_IOCTL_MODE_RMFB")

Doesn't result in any tooling changes:

  $ tools/perf/trace/beauty/drm_ioctl.sh  > before
  $ cp include/uapi/drm/drm.h tools/include/uapi/drm/drm.h
  $ tools/perf/trace/beauty/drm_ioctl.sh  > after
  $ diff -u before after

Silencing these perf build warnings:

  Warning: Kernel ABI header at 'tools/include/uapi/drm/drm.h' differs from latest version at 'include/uapi/drm/drm.h'
  diff -u tools/include/uapi/drm/drm.h include/uapi/drm/drm.h

Cc: Simon Ser <[email protected]>
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>

tools headers UAPI: Sync drm/i915_drm.h with the kernel sources

To pick the changes in:

  b65a9489730a2494 ("drm/i915/userptr: Probe existence of backing struct pages upon creation")
  ee242ca704d38699 ("drm/i915/guc: Implement GuC priority management")
  81340cf3bddded4f ("drm/i915/uapi: reject set_domain for discrete")
  7961c5b60f23dff5 ("drm/i915: Add TTM offset argument to mmap.")
  aef7b67a79564f6c ("drm/i915/uapi: convert drm_i915_gem_userptr to kernel doc")
  e7737b67ab46ee0e ("drm/i915/uapi: reject caching ioctls for discrete")
  3aa8c57fe25a9247 ("drm/i915/uapi: convert drm_i915_gem_set_domain to kernel doc")
  289f5a72009b8f67 ("drm/i915/uapi: convert drm_i915_gem_caching to kernel doc")
  4a766ae40ec83301 ("drm/i915: Drop the CONTEXT_CLONE API (v2)")
  6ff6d61dd2a943bd ("drm/i915: Drop I915_CONTEXT_PARAM_NO_ZEROMAP")
  fe4751c3d513ff4f ("drm/i915: Drop I915_CONTEXT_PARAM_RINGSIZE")
  577729533cdc4e37 ("drm/i915: Document the Virtual Engine uAPI")
  c649432e86ca677d ("drm/i915: Fix busy ioctl commentary")

That doesn't result in any changes to tooling as no new ioctl were
added (at least not perceived by tools/perf/trace/beauty/drm_ioctl.sh).

Addressing this perf build warning:

  Warning: Kernel ABI header at 'tools/include/uapi/drm/i915_drm.h' differs from latest version at 'include/uapi/drm/i915_drm.h'
  diff -u tools/include/uapi/drm/i915_drm.h include/uapi/drm/i915_drm.h

Cc: Chris Wilson <[email protected]>
Cc: Daniel Vetter <[email protected]>
Cc: Jason Ekstrand <[email protected]>
Cc: John Harrison <[email protected]>
Cc: Maarten Lankhorst <[email protected]>
Cc: Matthew Auld <[email protected]>
Cc: Matthew Brost <[email protected]>
Cc: Tvrtko Ursulin <[email protected]>
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>