Git Repo - linux.git/log

netfilter: nf_tables: reject table flag and netdev basechain updates

netdev basechain updates are stored in the transaction object hook list.
When setting on the table dormant flag, it iterates over the existing
hooks in the basechain. Thus, skipping the hooks that are being
added/deleted in this transaction, which leaves hook registration in
inconsistent state.

Reject table flag updates in combination with netdev basechain updates
in the same batch:

- Update table flags and add/delete basechain: Check from basechain update
path if there are pending flag updates for this table.
- add/delete basechain and update table flags: Iterate over the transaction
list to search for basechain updates from the table update path.

In both cases, the batch is rejected. Based on suggestion from Florian Westphal.

Fixes: b9703ed44ffb ("netfilter: nf_tables: support for adding new devices to an existing netdev chain")
Fixes: 7d937b107108f ("netfilter: nf_tables: support for deleting devices in an existing netdev chain")
Signed-off-by: Pablo Neira Ayuso <[email protected]>

netfilter: nf_tables: reject destroy command to remove basechain hooks

Report EOPNOTSUPP if NFT_MSG_DESTROYCHAIN is used to delete hooks in an
existing netdev basechain, thus, only NFT_MSG_DELCHAIN is allowed.

Fixes: 7d937b107108f ("netfilter: nf_tables: support for deleting devices in an existing netdev chain")
Signed-off-by: Pablo Neira Ayuso <[email protected]>

modpost: do not make find_tosym() return NULL

As mentioned in commit 397586506c3d ("modpost: Add '.ltext' and
'.ltext.*' to TEXT_SECTIONS"), modpost can result in a segmentation
fault due to a NULL pointer dereference in default_mismatch_handler().

find_tosym() can return the original symbol pointer instead of NULL
if a better one is not found.

This fixes the reported segmentation fault.

Fixes: a23e7584ecf3 ("modpost: unify 'sym' and 'to' in default_mismatch_handler()")
Reported-by: Nathan Chancellor <[email protected]>
Signed-off-by: Masahiro Yamada <[email protected]>

export.h: remove include/asm-generic/export.h

Commit 3a6dd5f614a1 ("riscv: remove unneeded #include
<asm-generic/export.h>") removed the last use of
include/asm-generic/export.h.

This deprecated header can go away.

Signed-off-by: Masahiro Yamada <[email protected]>

kconfig: do not reparent the menu inside a choice block

The boolean 'choice' is used to list exclusively selected config
options.

You must not add a dependency between choice members, because such a
dependency would create an invisible entry.

In the following test case, it is impossible to choose 'C'.

[Test Case 1]

  choice
          prompt "Choose one, but how to choose C?"

  config A
          bool "A"

  config B
          bool "B"

  config C
          bool "C"
          depends on A

  endchoice

Hence, Kconfig shows the following error message:

  Kconfig:1:error: recursive dependency detected!
  Kconfig:1:      choice <choice> contains symbol C
  Kconfig:10:     symbol C is part of choice A
  Kconfig:4:      symbol A is part of choice <choice>
  For a resolution refer to Documentation/kbuild/kconfig-language.rst
  subsection "Kconfig recursive dependency limitations"

However, Kconfig does not report anything for the following similar code:

[Test Case 2]

  choice
         prompt "Choose one, but how to choose B?"

  config A
          bool "A"

  config B
          bool "B"
          depends on A

  config C
          bool "C"

  endchoice

This is because menu_finalize() reparents the menu tree when an entry
depends on the preceding one.

With reparenting, the menu tree:

  choice
   |- A
   |- B
   \- C

... will be transformed into the following structure:

  choice
   |- A
   |  \- B
   \- C

Consequently, Kconfig considers only 'A' and 'C' as choice members.
This behavior is awkward. The second test case should be an error too.

This commit stops reparenting inside a choice.

Signed-off-by: Masahiro Yamada <[email protected]>

Merge tag 'wireless-2024-03-27' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless

Kalle Valo says:

====================
wireless fixes for v6.9-rc2

The first fixes for v6.9. Ping-Ke Shih now maintains a separate tree
for Realtek drivers, document that in the MAINTAINERS. Plenty of fixes
for both to stack and iwlwifi. Our kunit tests were working only on um
architecture but that's fixed now.

* tag 'wireless-2024-03-27' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless: (21 commits)
  MAINTAINERS: wifi: mwifiex: add Francesco as reviewer
  kunit: fix wireless test dependencies
  wifi: iwlwifi: mvm: include link ID when releasing frames
  wifi: iwlwifi: mvm: handle debugfs names more carefully
  wifi: iwlwifi: mvm: guard against invalid STA ID on removal
  wifi: iwlwifi: read txq->read_ptr under lock
  wifi: iwlwifi: fw: don't always use FW dump trig
  wifi: iwlwifi: mvm: rfi: fix potential response leaks
  wifi: mac80211: correctly set active links upon TTLM
  wifi: iwlwifi: mvm: Configure the link mapping for non-MLD FW
  wifi: iwlwifi: mvm: consider having one active link
  wifi: iwlwifi: mvm: pick the version of SESSION_PROTECTION_NOTIF
  wifi: mac80211: fix prep_connection error path
  wifi: cfg80211: fix rdev_dump_mpp() arguments order
  wifi: iwlwifi: mvm: disable MLO for the time being
  wifi: cfg80211: add a flag to disable wireless extensions
  wifi: mac80211: fix ieee80211_bss_*_flags kernel-doc
  wifi: mac80211: check/clear fast rx for non-4addr sta VLAN changes
  wifi: mac80211: fix mlme_link_id_dbg()
  MAINTAINERS: wifi: add git tree for Realtek WiFi drivers
  ...
====================

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

Merge tag '9p-fixes-for-6.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/ericvh/v9fs

Pull 9p fixes from Eric Van Hensbergen:
"Two of these fix syzbot reported issues, and the other fixes a unused
  variable in some configurations"

* tag '9p-fixes-for-6.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/ericvh/v9fs:
  fs/9p: fix uninitialized values during inode evict
  fs/9p: remove redundant pointer v9ses
  fs/9p: fix uaf in in v9fs_stat2inode_dotl

Merge tag 'for-6.9-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:

- fix race when reading extent buffer and 'uptodate' status is missed
   by one thread (introduced in 6.5)

- do additional validation of devices using major:minor numbers

- zoned mode fixes:
     - use zone-aware super block access during scrub
     - fix use-after-free during device replace (found by KASAN)
     - also delete zones that are 100% unusable to reclaim space

- extent unpinning fixes:
     - fix extent map leak after error handling
     - print correct range in error message

- error code and message updates

* tag 'for-6.9-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: fix race in read_extent_buffer_pages()
  btrfs: return accurate error code on open failure in open_fs_devices()
  btrfs: zoned: don't skip block groups with 100% zone unusable
  btrfs: use btrfs_warn() to log message at btrfs_add_extent_mapping()
  btrfs: fix message not properly printing interval when adding extent map
  btrfs: fix warning messages not printing interval at unpin_extent_range()
  btrfs: fix extent map leak in unexpected scenario at unpin_extent_cache()
  btrfs: validate device maj:min during open
  btrfs: zoned: fix use-after-free in do_zone_finish()
  btrfs: zoned: use zone aware sb location for scrub

Merge tag 'mm-hotfixes-stable-2024-03-27-11-25' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull misc fixes from Andrew Morton:
"Various hotfixes. About half are cc:stable and the remainder address
  post-6.8 issues or aren't considered suitable for backporting.

  zswap figures prominently in the post-6.8 issues - folloup against the
  large amount of changes we have just made to that code.

  Apart from that, all over the map"

* tag 'mm-hotfixes-stable-2024-03-27-11-25' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (21 commits)
  crash: use macro to add crashk_res into iomem early for specific arch
  mm: zswap: fix data loss on SWP_SYNCHRONOUS_IO devices
  selftests/mm: fix ARM related issue with fork after pthread_create
  hexagon: vmlinux.lds.S: handle attributes section
  userfaultfd: fix deadlock warning when locking src and dst VMAs
  tmpfs: fix race on handling dquot rbtree
  selftests/mm: sigbus-wp test requires UFFD_FEATURE_WP_HUGETLBFS_SHMEM
  mm: zswap: fix writeback shinker GFP_NOIO/GFP_NOFS recursion
  ARM: prctl: reject PR_SET_MDWE on pre-ARMv6
  prctl: generalize PR_SET_MDWE support check to be per-arch
  MAINTAINERS: remove incorrect M: tag for [email protected]
  mm: zswap: fix kernel BUG in sg_init_one
  selftests: mm: restore settings from only parent process
  tools/Makefile: remove cgroup target
  mm: cachestat: fix two shmem bugs
  mm: increase folio batch size
  mm,page_owner: fix recursion
  mailmap: update entry for Leonard Crestez
  init: open /initrd.image with O_LARGEFILE
  selftests/mm: Fix build with _FORTIFY_SOURCE
  ...

drm/vmwgfx: Create debugfs ttm_resource_manager entry only if needed

The driver creates /sys/kernel/debug/dri/0/mob_ttm even when the
corresponding ttm_resource_manager is not allocated.
This leads to a crash when trying to read from this file.

Add a check to create mob_ttm, system_mob_ttm, and gmr_ttm debug file
only when the corresponding ttm_resource_manager is allocated.

crash> bt
PID: 3133409  TASK: ffff8fe4834a5000  CPU: 3    COMMAND: "grep"
#0 [ffffb954506b3b20] machine_kexec at ffffffffb2a6bec3
#1 [ffffb954506b3b78] __crash_kexec at ffffffffb2bb598a
#2 [ffffb954506b3c38] crash_kexec at ffffffffb2bb68c1
#3 [ffffb954506b3c50] oops_end at ffffffffb2a2a9b1
#4 [ffffb954506b3c70] no_context at ffffffffb2a7e913
#5 [ffffb954506b3cc8] __bad_area_nosemaphore at ffffffffb2a7ec8c
#6 [ffffb954506b3d10] do_page_fault at ffffffffb2a7f887
#7 [ffffb954506b3d40] page_fault at ffffffffb360116e
    [exception RIP: ttm_resource_manager_debug+0x11]
    RIP: ffffffffc04afd11  RSP: ffffb954506b3df0  RFLAGS: 00010246
    RAX: ffff8fe41a6d1200  RBX: 0000000000000000  RCX: 0000000000000940
    RDX: 0000000000000000  RSI: ffffffffc04b4338  RDI: 0000000000000000
    RBP: ffffb954506b3e08   R8: ffff8fee3ffad000   R9: 0000000000000000
    R10: ffff8fe41a76a000  R11: 0000000000000001  R12: 00000000ffffffff
    R13: 0000000000000001  R14: ffff8fe5bb6f3900  R15: ffff8fe41a6d1200
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
#8 [ffffb954506b3e00] ttm_resource_manager_show at ffffffffc04afde7 [ttm]
#9 [ffffb954506b3e30] seq_read at ffffffffb2d8f9f3
    RIP: 00007f4c4eda8985  RSP: 00007ffdbba9e9f8  RFLAGS: 00000246
    RAX: ffffffffffffffda  RBX: 000000000037e000  RCX: 00007f4c4eda8985
    RDX: 000000000037e000  RSI: 00007f4c41573000  RDI: 0000000000000003
    RBP: 000000000037e000   R8: 0000000000000000   R9: 000000000037fe30
    R10: 0000000000000000  R11: 0000000000000246  R12: 00007f4c41573000
    R13: 0000000000000003  R14: 00007f4c41572010  R15: 0000000000000003
    ORIG_RAX: 0000000000000000  CS: 0033  SS: 002b

Signed-off-by: Jocelyn Falempe <[email protected]>
Fixes: af4a25bbe5e7 ("drm/vmwgfx: Add debugfs entries for various ttm resource managers")
Cc: <[email protected]>
Reviewed-by: Zack Rusin <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]

bpf: update BPF LSM designated reviewer list

Adding myself in place of both Brendan and Florent as both have since
moved on from working on the BPF LSM and will no longer be devoting
their time to maintaining the BPF LSM.

Signed-off-by: Matt Bobrowski <[email protected]>
Acked-by: KP Singh <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>

NFSD: CREATE_SESSION must never cache NFS4ERR_DELAY replies

There are one or two cases where CREATE_SESSION returns
NFS4ERR_DELAY in order to force the client to wait a bit and try
CREATE_SESSION again. However, after commit e4469c6cc69b ("NFSD: Fix
the NFSv4.1 CREATE_SESSION operation"), NFSD caches that response in
the CREATE_SESSION slot. Thus, when the client resends the
CREATE_SESSION, the server always returns the cached NFS4ERR_DELAY
response rather than actually executing the request and properly
recording its outcome. This blocks the client from making further
progress.

RFC 8881 Section 15.1.1.3 says:
> If NFS4ERR_DELAY is returned on an operation other than SEQUENCE
> that validly appears as the first operation of a request ... [t]he
> request can be retried in full without modification. In this case
> as well, the replier MUST avoid returning a response containing
> NFS4ERR_DELAY as the response to an initial operation of a request
> solely on the basis of its presence in the reply cache.

Neither the original NFSD code nor the discussion in section 18.36.4
refer explicitly to this important requirement, so I missed it.

Note also that not only must the server not cache NFS4ERR_DELAY, but
it has to not advance the CREATE_SESSION slot sequence number so
that it can properly recognize and accept the client's retry.

Reported-by: Dai Ngo <[email protected]>
Fixes: e4469c6cc69b ("NFSD: Fix the NFSv4.1 CREATE_SESSION operation")
Tested-by: Dai Ngo <[email protected]>
Signed-off-by: Chuck Lever <[email protected]>

cifs: Fix duplicate fscache cookie warnings

fscache emits a lot of duplicate cookie warnings with cifs because the
index key for the fscache cookies does not include everything that the
cifs_find_inode() function does. The latter is used with iget5_locked() to
distinguish between inodes in the local inode cache.

Fix this by adding the creation time and file type to the fscache cookie
key.

Additionally, add a couple of comments to note that if one is changed the
other must be also.

Signed-off-by: David Howells <[email protected]>
Fixes: 70431bfd825d ("cifs: Support fscache indexing rewrite")
cc: Shyam Prasad N <[email protected]>
cc: Rohith Surabattula <[email protected]>
cc: Jeff Layton <[email protected]>
cc: [email protected]
cc: [email protected]
cc: [email protected]
Signed-off-by: Steve French <[email protected]>

Merge tag 'probes-fixes-v6.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull probes fixlet from Masami Hiramatsu:

- tracing/probes: initialize a 'val' local variable with zero.

   This variable is read by FETCH_OP_ST_EDATA in a loop, and is
   initialized by FETCH_OP_ARG in the same loop. Since this
   initialization is not obvious, smatch warns about it.

   Explicitly initializing 'val' with zero fixes this warning.

* tag 'probes-fixes-v6.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  tracing: probes: Fix to zero initialize a local variable

Merge tag 'execve-v6.9-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux

Pull execve fixes from Kees Cook:

- Fix selftests to conform to the TAP output format (Muhammad Usama
   Anjum)

- Fix NOMMU linux_binprm::exec pointer in auxv (Max Filippov)

- Replace deprecated strncpy usage (Justin Stitt)

- Replace another /bin/sh instance in selftests

* tag 'execve-v6.9-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
  binfmt: replace deprecated strncpy
  exec: Fix NOMMU linux_binprm::exec in transfer_args_to_stack()
  selftests/exec: Convert remaining /bin/sh to /bin/bash
  selftests/exec: execveat: Improve debug reporting
  selftests/exec: recursion-depth: conform test to TAP format output
  selftests/exec: load_address: conform test to TAP format output
  selftests/exec: binfmt_script: Add the overall result line according to TAP

Merge branch 'check-bloom-filter-map-value-size'

Andrei Matei says:

====================
Check bloom filter map value size

v1->v2:
- prepend a patch addressing the bloom map specifically
- change low-level rejection error to EFAULT, to indicate a bug
====================

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>

bpf: Protect against int overflow for stack access size

This patch re-introduces protection against the size of access to stack
memory being negative; the access size can appear negative as a result
of overflowing its signed int representation. This should not actually
happen, as there are other protections along the way, but we should
protect against it anyway. One code path was missing such protections
(fixed in the previous patch in the series), causing out-of-bounds array
accesses in check_stack_range_initialized(). This patch causes the
verification of a program with such a non-sensical access size to fail.

This check used to exist in a more indirect way, but was inadvertendly
removed in a833a17aeac7.

Fixes: a833a17aeac7 ("bpf: Fix verification of indirect var-off stack access")
Reported-by: [email protected]
Reported-by: [email protected]
Closes: https://lore.kernel.org/bpf/CAADnVQLORV5PT0iTAhRER+iLBTkByCYNBYyvBSgjN1T31K+gOw@mail.gmail.com/
Acked-by: Andrii Nakryiko <[email protected]>
Signed-off-by: Andrei Matei <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>

bpf: Check bloom filter map value size

This patch adds a missing check to bloom filter creating, rejecting
values above KMALLOC_MAX_SIZE. This brings the bloom map in line with
many other map types.

The lack of this protection can cause kernel crashes for value sizes
that overflow int's. Such a crash was caught by syzkaller. The next
patch adds more guard-rails at a lower level.

Signed-off-by: Andrei Matei <[email protected]>
Acked-by: Andrii Nakryiko <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>

Fix build errors due to new UIO_MEM_DMA_COHERENT mess

Commit 576882ef5e7f ("uio: introduce UIO_MEM_DMA_COHERENT type")
introduced a new use-case for 'struct uio_mem' where the 'mem' field now
contains a kernel virtual address when 'memtype' is set to
UIO_MEM_DMA_COHERENT.

That in turn causes build errors, because 'mem' is of type
'phys_addr_t', and a virtual address is a pointer type. When the code
just blindly uses cast to mix the two, it caused problems when
phys_addr_t isn't the same size as a pointer - notably on 32-bit
architectures with PHYS_ADDR_T_64BIT.

The proper thing to do would probably be to use a union member, and not
have any casts, and make the 'mem' member be a union of 'mem.physaddr'
and 'mem.vaddr', based on 'memtype'.

This is not that proper thing. This is just fixing the ugly casts to be
even uglier, but at least not cause build errors on 32-bit platforms
with 64-bit physical addresses.

Reported-by: Guenter Roeck <[email protected]>
Fixes: 576882ef5e7f ("uio: introduce UIO_MEM_DMA_COHERENT type")
Fixes: 7722151e4651 ("uio_pruss: UIO_MEM_DMA_COHERENT conversion")
Fixes: 019947805a8d ("uio_dmem_genirq: UIO_MEM_DMA_COHERENT conversion")
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Chris Leech <[email protected]>
Cc: Nilesh Javali <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

Fix memory leak in posix_clock_open()

If the clk ops.open() function returns an error, we don't release the
pccontext we allocated for this clock.

Re-organize the code slightly to make it all more obvious.

Reported-by: Rohit Keshri <[email protected]>
Acked-by: Oleg Nesterov <[email protected]>
Fixes: 60c6946675fc ("posix-clock: introduce posix_clock_context concept")
Cc: Jakub Kicinski <[email protected]>
Cc: David S. Miller <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

bpf: fix warning for crash_kexec

With [1], crash dump specific code is moved out of CONFIG_KEXEC_CORE
and placed under CONFIG_CRASH_DUMP, where it is more appropriate.
And since CONFIG_KEXEC & !CONFIG_CRASH_DUMP build option is supported
with that, it led to the below warning:

"WARN: resolve_btfids: unresolved symbol crash_kexec"

Fix it by using the appropriate #ifdef.

[1] https://lore.kernel.org/all/20240124051254 [email protected]/

Acked-by: Baoquan He <[email protected]>
Fixes: 02aff8480533 ("crash: split crash dumping code out from kexec_core.c")
Acked-by: Jiri Olsa <[email protected]>
Acked-by: Stanislav Fomichev <[email protected]>
Signed-off-by: Hari Bathini <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>

thermal: devfreq_cooling: Fix perf state when calculate dfc res_util

The issue occurs when the devfreq cooling device uses the EM power model
and the get_real_power() callback is provided by the driver.

The EM power table is sorted ascending，can't index the table by cooling
device state，so convert cooling state to performance state by
dfc->max_state - dfc->capped_state.

Fixes: 615510fe13bd ("thermal: devfreq_cooling: remove old power model and use EM")
Cc: 5.11+ <[email protected]> # 5.11+
Signed-off-by: Ye Zhang <[email protected]>
Reviewed-by: Dhruva Gole <[email protected]>
Reviewed-by: Lukasz Luba <[email protected]>
Signed-off-by: Rafael J. Wysocki <[email protected]>

MAINTAINERS: Add co-maintainers for time[rs]

Anna-Maria and Frederic are working in this area for years. Volunteer them
into co-maintainer roles.

While at it bring the file lists up to date.

Signed-off-by: Thomas Gleixner <[email protected]>
Acked-by: Ingo Molnar <[email protected]>
Acked-by: Anna-Maria Behnsen <[email protected]>
Acked-by: Frederic Weisbecker <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

drm/amdgpu: fix deadlock while reading mqd from debugfs

An errant disk backup on my desktop got into debugfs and triggered the
following deadlock scenario in the amdgpu debugfs files. The machine
also hard-resets immediately after those lines are printed (although I
wasn't able to reproduce that part when reading by hand):

[ 1318.016074][ T1082] ======================================================
[ 1318.016607][ T1082] WARNING: possible circular locking dependency detected
[ 1318.017107][ T1082] 6.8.0-rc7-00015-ge0c8221b72c0 #17 Not tainted
[ 1318.017598][ T1082] ------------------------------------------------------
[ 1318.018096][ T1082] tar/1082 is trying to acquire lock:
[ 1318.018585][ T1082] ffff98c44175d6a0 (&mm->mmap_lock){++++}-{3:3}, at: __might_fault+0x40/0x80
[ 1318.019084][ T1082]
[ 1318.019084][ T1082] but task is already holding lock:
[ 1318.020052][ T1082] ffff98c4c13f55f8 (reservation_ww_class_mutex){+.+.}-{3:3}, at: amdgpu_debugfs_mqd_read+0x6a/0x250 [amdgpu]
[ 1318.020607][ T1082]
[ 1318.020607][ T1082] which lock already depends on the new lock.
[ 1318.020607][ T1082]
[ 1318.022081][ T1082]
[ 1318.022081][ T1082] the existing dependency chain (in reverse order) is:
[ 1318.023083][ T1082]
[ 1318.023083][ T1082] -> #2 (reservation_ww_class_mutex){+.+.}-{3:3}:
[ 1318.024114][ T1082]        __ww_mutex_lock.constprop.0+0xe0/0x12f0
[ 1318.024639][ T1082]        ww_mutex_lock+0x32/0x90
[ 1318.025161][ T1082]        dma_resv_lockdep+0x18a/0x330
[ 1318.025683][ T1082]        do_one_initcall+0x6a/0x350
[ 1318.026210][ T1082]        kernel_init_freeable+0x1a3/0x310
[ 1318.026728][ T1082]        kernel_init+0x15/0x1a0
[ 1318.027242][ T1082]        ret_from_fork+0x2c/0x40
[ 1318.027759][ T1082]        ret_from_fork_asm+0x11/0x20
[ 1318.028281][ T1082]
[ 1318.028281][ T1082] -> #1 (reservation_ww_class_acquire){+.+.}-{0:0}:
[ 1318.029297][ T1082]        dma_resv_lockdep+0x16c/0x330
[ 1318.029790][ T1082]        do_one_initcall+0x6a/0x350
[ 1318.030263][ T1082]        kernel_init_freeable+0x1a3/0x310
[ 1318.030722][ T1082]        kernel_init+0x15/0x1a0
[ 1318.031168][ T1082]        ret_from_fork+0x2c/0x40
[ 1318.031598][ T1082]        ret_from_fork_asm+0x11/0x20
[ 1318.032011][ T1082]
[ 1318.032011][ T1082] -> #0 (&mm->mmap_lock){++++}-{3:3}:
[ 1318.032778][ T1082]        __lock_acquire+0x14bf/0x2680
[ 1318.033141][ T1082]        lock_acquire+0xcd/0x2c0
[ 1318.033487][ T1082]        __might_fault+0x58/0x80
[ 1318.033814][ T1082]        amdgpu_debugfs_mqd_read+0x103/0x250 [amdgpu]
[ 1318.034181][ T1082]        full_proxy_read+0x55/0x80
[ 1318.034487][ T1082]        vfs_read+0xa7/0x360
[ 1318.034788][ T1082]        ksys_read+0x70/0xf0
[ 1318.035085][ T1082]        do_syscall_64+0x94/0x180
[ 1318.035375][ T1082]        entry_SYSCALL_64_after_hwframe+0x46/0x4e
[ 1318.035664][ T1082]
[ 1318.035664][ T1082] other info that might help us debug this:
[ 1318.035664][ T1082]
[ 1318.036487][ T1082] Chain exists of:
[ 1318.036487][ T1082]   &mm->mmap_lock --> reservation_ww_class_acquire --> reservation_ww_class_mutex
[ 1318.036487][ T1082]
[ 1318.037310][ T1082]  Possible unsafe locking scenario:
[ 1318.037310][ T1082]
[ 1318.037838][ T1082]        CPU0                    CPU1
[ 1318.038101][ T1082]        ----                    ----
[ 1318.038350][ T1082]   lock(reservation_ww_class_mutex);
[ 1318.038590][ T1082]                                lock(reservation_ww_class_acquire);
[ 1318.038839][ T1082]                                lock(reservation_ww_class_mutex);
[ 1318.039083][ T1082]   rlock(&mm->mmap_lock);
[ 1318.039328][ T1082]
[ 1318.039328][ T1082]  *** DEADLOCK ***
[ 1318.039328][ T1082]
[ 1318.040029][ T1082] 1 lock held by tar/1082:
[ 1318.040259][ T1082]  #0: ffff98c4c13f55f8 (reservation_ww_class_mutex){+.+.}-{3:3}, at: amdgpu_debugfs_mqd_read+0x6a/0x250 [amdgpu]
[ 1318.040560][ T1082]
[ 1318.040560][ T1082] stack backtrace:
[ 1318.041053][ T1082] CPU: 22 PID: 1082 Comm: tar Not tainted 6.8.0-rc7-00015-ge0c8221b72c0 #17 3316c85d50e282c5643b075d1f01a4f6365e39c2
[ 1318.041329][ T1082] Hardware name: Gigabyte Technology Co., Ltd. B650 AORUS PRO AX/B650 AORUS PRO AX, BIOS F20 12/14/2023
[ 1318.041614][ T1082] Call Trace:
[ 1318.041895][ T1082]  <TASK>
[ 1318.042175][ T1082]  dump_stack_lvl+0x4a/0x80
[ 1318.042460][ T1082]  check_noncircular+0x145/0x160
[ 1318.042743][ T1082]  __lock_acquire+0x14bf/0x2680
[ 1318.043022][ T1082]  lock_acquire+0xcd/0x2c0
[ 1318.043301][ T1082]  ? __might_fault+0x40/0x80
[ 1318.043580][ T1082]  ? __might_fault+0x40/0x80
[ 1318.043856][ T1082]  __might_fault+0x58/0x80
[ 1318.044131][ T1082]  ? __might_fault+0x40/0x80
[ 1318.044408][ T1082]  amdgpu_debugfs_mqd_read+0x103/0x250 [amdgpu 8fe2afaa910cbd7654c8cab23563a94d6caebaab]
[ 1318.044749][ T1082]  full_proxy_read+0x55/0x80
[ 1318.045042][ T1082]  vfs_read+0xa7/0x360
[ 1318.045333][ T1082]  ksys_read+0x70/0xf0
[ 1318.045623][ T1082]  do_syscall_64+0x94/0x180
[ 1318.045913][ T1082]  ? do_syscall_64+0xa0/0x180
[ 1318.046201][ T1082]  ? lockdep_hardirqs_on+0x7d/0x100
[ 1318.046487][ T1082]  ? do_syscall_64+0xa0/0x180
[ 1318.046773][ T1082]  ? do_syscall_64+0xa0/0x180
[ 1318.047057][ T1082]  ? do_syscall_64+0xa0/0x180
[ 1318.047337][ T1082]  ? do_syscall_64+0xa0/0x180
[ 1318.047611][ T1082]  entry_SYSCALL_64_after_hwframe+0x46/0x4e
[ 1318.047887][ T1082] RIP: 0033:0x7f480b70a39d
[ 1318.048162][ T1082] Code: 91 ba 0d 00 f7 d8 64 89 02 b8 ff ff ff ff eb b2 e8 18 a3 01 00 0f 1f 84 00 00 00 00 00 80 3d a9 3c 0e 00 00 74 17 31 c0 0f 05 <48> 3d 00 f0 ff ff 77 5b c3 66 2e 0f 1f 84 00 00 00 00 00 53 48 83
[ 1318.048769][ T1082] RSP: 002b:00007ffde77f5c68 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
[ 1318.049083][ T1082] RAX: ffffffffffffffda RBX: 0000000000000800 RCX: 00007f480b70a39d
[ 1318.049392][ T1082] RDX: 0000000000000800 RSI: 000055c9f2120c00 RDI: 0000000000000008
[ 1318.049703][ T1082] RBP: 0000000000000800 R08: 000055c9f2120a94 R09: 0000000000000007
[ 1318.050011][ T1082] R10: 0000000000000000 R11: 0000000000000246 R12: 000055c9f2120c00
[ 1318.050324][ T1082] R13: 0000000000000008 R14: 0000000000000008 R15: 0000000000000800
[ 1318.050638][ T1082]  </TASK>

amdgpu_debugfs_mqd_read() holds a reservation when it calls
put_user(), which may fault and acquire the mmap_sem. This violates
the established locking order.

Bounce the mqd data through a kernel buffer to get put_user() out of
the illegal section.

Fixes: 445d85e3c1df ("drm/amdgpu: add debugfs interface for reading MQDs")
Cc: [email protected] # v6.5+
Reviewed-by: Shashank Sharma <[email protected]>
Signed-off-by: Johannes Weiner <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

drm/amdgpu: enable UMSCH 4.0.6

Share same codes with 4.0.5 and enable collaborate mode for VPE.

Signed-off-by: Lang Yu <[email protected]>
Reviewed-by: Veerabadhran Gopalakrishnan <[email protected]>
Acked-by: Alex Deucher <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

drm/amdgpu/umsch: update UMSCH 4.0 FW interface

Align with FW changes.

Signed-off-by: Lang Yu <[email protected]>
Reviewed-by: Veerabadhran Gopalakrishnan <[email protected]>
Acked-by: Alex Deucher <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

drm/amd/display: Set DCN351 BB and IP the same as DCN35

[WHY & HOW]
DCN351 and DCN35 should use the same bounding box and IP settings.

Cc: Mario Limonciello <[email protected]>
Cc: Alex Deucher <[email protected]>
Cc: [email protected]
Reviewed-by: Jun Lei <[email protected]>
Acked-by: Alex Hung <[email protected]>
Signed-off-by: Xi Liu <[email protected]>
Tested-by: Daniel Wheeler <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

drm/amd/display: Fix bounds check for dcn35 DcfClocks

[Why]
NumFclkLevelsEnabled is used for DcfClocks bounds check
instead of designated NumDcfClkLevelsEnabled.
That can cause array index out-of-bounds access.

[How]
Use designated variable for dcn35 DcfClocks bounds check.

Fixes: a8edc9cc0b14 ("drm/amd/display: Fix array-index-out-of-bounds in dcn35_clkmgr")
Cc: Mario Limonciello <[email protected]>
Cc: Alex Deucher <[email protected]>
Cc: [email protected]
Reviewed-by: Sun peng Li <[email protected]>
Acked-by: Tom Chung <[email protected]>
Signed-off-by: Roman Li <[email protected]>
Tested-by: Daniel Wheeler <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

drm/amd/display: Remove MPC rate control logic from DCN30 and above

[Why]
MPC flow rate control is not needed for DCN30 and above. Current logic
that uses it can result in underflow for certain edge cases (such as
DSC N422 + ODM combine + 422 left edge pixel).

[How]
Remove MPC flow rate control logic and programming for DCN30 and above.

Cc: Mario Limonciello <[email protected]>
Cc: Alex Deucher <[email protected]>
Cc: [email protected]
Reviewed-by: Wenjing Liu <[email protected]>
Acked-by: Tom Chung <[email protected]>
Signed-off-by: George Shen <[email protected]>
Tested-by: Daniel Wheeler <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

drm/amd/display: fix a dereference of a NULL pointer

[why&how]
In some platform out_transfer_func may not be popualted. We need to check
for null before dereferencing it.

Fixes: d2dea1f14038 ("drm/amd/display: Generalize new minimal transition path")
Reviewed-by: Alvin Lee <[email protected]>
Acked-by: Tom Chung <[email protected]>
Signed-off-by: Wenjing Liu <[email protected]>
Tested-by: Daniel Wheeler <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

drm/amd/display: Send DTBCLK disable message on first commit

[Why]
Previous patch to allow DTBCLK disable didn't address boot case. Driver
thinks DTBCLK is disabled by default, so we don't send disable message to
PMFW. DTBCLK is then enabled at idle desktop on boot, burning power.

[How]
Set dtbclk_en to true on boot so that disable message is sent during first
commit.

Fixes: 27750e176a4f ("drm/amd/display: Allow DTBCLK disable for DCN35")
Reviewed-by: Charlene Liu <[email protected]>
Acked-by: Tom Chung <[email protected]>
Signed-off-by: Taimur Hassan <[email protected]>
Tested-by: Daniel Wheeler <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

drm/amd/display: Update dcn351 to latest dcn35 config

[why & how]
There were some fixes in dcn35 that need
to be ported over to dcn351 to prevent any
regression.

Signed-off-by: Sung Joon Kim <[email protected]>
Reviewed-by: Liu, Xi (Alex) <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

drm/amd/display: fix IPX enablement

We need to re-enable idle power optimizations after entering PSR. Since,
we get kicked out of idle power optimizations before entering PSR
(entering PSR requires us to write to DCN registers, which isn't allowed
while we are in IPS).

Fixes: a9b1a4f684b3 ("drm/amd/display: Add more checks for exiting idle in DC")
Tested-by: Mark Broadworth <[email protected]>
Reviewed-by: Nicholas Kazlauskas <[email protected]>
Signed-off-by: Hamza Mahfooz <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

drm/amd: Flush GFXOFF requests in prepare stage

If the system hasn't entered GFXOFF when suspend starts it can cause
hangs accessing GC and RLC during the suspend stage.

Cc: <[email protected]> # 6.1.y: 5095d5418193 ("drm/amd: Evict resources during PM ops prepare() callback")
Cc: <[email protected]> # 6.1.y: cb11ca3233aa ("drm/amd: Add concept of running prepare_suspend() sequence for IP blocks")
Cc: <[email protected]> # 6.1.y: 2ceec37b0e3d ("drm/amd: Add missing kernel doc for prepare_suspend()")
Cc: <[email protected]> # 6.1.y: 3a9626c816db ("drm/amd: Stop evicting resources on APUs in suspend")
Cc: <[email protected]> # 6.6.y: 5095d5418193 ("drm/amd: Evict resources during PM ops prepare() callback")
Cc: <[email protected]> # 6.6.y: cb11ca3233aa ("drm/amd: Add concept of running prepare_suspend() sequence for IP blocks")
Cc: <[email protected]> # 6.6.y: 2ceec37b0e3d ("drm/amd: Add missing kernel doc for prepare_suspend()")
Cc: <[email protected]> # 6.6.y: 3a9626c816db ("drm/amd: Stop evicting resources on APUs in suspend")
Cc: <[email protected]> # 6.1+
Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3132
Fixes: ab4750332dbe ("drm/amdgpu/sdma5.2: add begin/end_use ring callbacks")
Reviewed-by: Alex Deucher <[email protected]>
Signed-off-by: Mario Limonciello <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

drm/amdkfd: range check cp bad op exception interrupts

Due to a CP interrupt bug, bad packet garbage exception codes are raised.
Do a range check so that the debugger and runtime do not receive garbage
codes.
Update the user api to guard exception code type checking as well.

Signed-off-by: Jonathan Kim <[email protected]>
Tested-by: Jesse Zhang <[email protected]>
Reviewed-by: Felix Kuehling <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

Revert "drm/amd/display: Fix sending VSC (+ colorimetry) packets for DP/eDP displays without PSR"

This causes flicker on a bunch of eDP panels. The info_packet code
also caused regressions on other OSes that we haven't' seen on Linux
yet, but that is likely due to the fact that we haven't had a chance
to test those environments on Linux.

We'll need to revisit this.

This reverts commit 202260f64519e591b5cd99626e441b6559f571a3.

Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3207
Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3151
Signed-off-by: Harry Wentland <[email protected]>
Reviewed-by: Rodrigo Siqueira <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
Cc: [email protected]

drm/amdkfd: fix TLB flush after unmap for GFX9.4.2

TLB flush after unmap accidentially was removed on
gfx9.4.2. It is to add it back.

Signed-off-by: Eric Huang <[email protected]>
Reviewed-by: Harish Kasiviswanathan <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
Cc: [email protected]

drm/amdgpu/vpe: power on vpe when hw_init

To fix mode2 reset failure.
Should power on VPE when hw_init.

Signed-off-by: Peyton Lee <[email protected]>
Reviewed-by: Lang Yu <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

drm/amd/display: increase bb clock for DCN351

[Why and how]

Bounding box clocks for DCN351 should be increased as per request

Reviewed-by: Swapnil Patel <[email protected]>
Acked-by: Wayne Lin <[email protected]>
Signed-off-by: Xi Liu <[email protected]>
Tested-by: Daniel Wheeler <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

drm/amd/display: Prevent crash when disable stream

[Why]
Disabling stream encoder invokes a function that no longer exists.

[How]
Check if the function declaration is NULL in disable stream encoder.

Cc: Mario Limonciello <[email protected]>
Cc: Alex Deucher <[email protected]>
Cc: [email protected]
Reviewed-by: Charlene Liu <[email protected]>
Acked-by: Wayne Lin <[email protected]>
Signed-off-by: Chris Park <[email protected]>
Tested-by: Daniel Wheeler <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

drm/amd/display: Increase Z8 watermark times.

Increase Z8 watermark times from 210->250us and 320->350us.

Reviewed-by: Nicholas Kazlauskas <[email protected]>
Acked-by: Wayne Lin <[email protected]>
Signed-off-by: Natanel Roizenman <[email protected]>
Tested-by: Daniel Wheeler <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

drm/amdkfd: Check cgroup when returning DMABuf info

Check cgroup permissions when returning DMA-buf info and
based on cgroup info return the GPU id of the GPU that have
access to the BO.

Signed-off-by: Mukul Joshi <[email protected]>
Reviewed-by: Felix Kuehling <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

drm/amd/swsmu: add smu 14.0.1 vcn and jpeg msg

add new vcn and jpeg msg

v2: squash in updates (Alex)
v3: rework code for better compat with other smu14.x variants (Alex)

Reviewed-by: Alex Deucher <[email protected]>
Signed-off-by: lima1002 <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

selftests: netdevsim: set test timeout to 10 minutes

The longest running netdevsim test, nexthop.sh, currently takes
5 min to finish. Around 260s to be exact, and 310s on a debug kernel.
The default timeout in selftest is 45sec, so we need an explicit
config. Give ourselves some headroom and use 10min.

Commit under Fixes isn't really to "blame" but prior to that
netdevsim tests weren't integrated with kselftest infra
so blaming the tests themselves doesn't seem right, either.

Fixes: 8ff25dac88f6 ("netdevsim: add Makefile for selftests")
Signed-off-by: Jakub Kicinski <[email protected]>
Reviewed-by: Simon Horman <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: wan: framer: Add missing static inline qualifiers

Compilation with CONFIG_GENERIC_FRAMER disabled lead to the following
warnings:
  framer.h:184:16: warning: no previous prototype for function 'framer_get' [-Wmissing-prototypes]
  184 | struct framer *framer_get(struct device *dev, const char *con_id)
  framer.h:184:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
  184 | struct framer *framer_get(struct device *dev, const char *con_id)
  framer.h:189:6: warning: no previous prototype for function 'framer_put' [-Wmissing-prototypes]
  189 | void framer_put(struct device *dev, struct framer *framer)
  framer.h:189:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
  189 | void framer_put(struct device *dev, struct framer *framer)

Add missing 'static inline' qualifiers for these functions.

Reported-by: kernel test robot <[email protected]>
Closes: https://lore.kernel.org/oe-kbuild-all/[email protected]/
Fixes: 82c944d05b1a ("net: wan: Add framer framework support")
Cc: [email protected]
Signed-off-by: Herve Codina <[email protected]>
Reviewed-by: Andy Shevchenko <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

ALSA: hda/tas2781: remove useless dev_dbg from playback_hook

The debug message "Playback action not supported: action" is not useful,
because the action was previously printed, and the list of supported
actions are intentional.

Remove the debug statement from the default switch case.

Signed-off-by: Gergo Koteles <[email protected]>
Message-ID: <8b9546db6c92dea4476a7247a88d56248c2ba8c2.1711469583 [email protected]>
Signed-off-by: Takashi Iwai <[email protected]>

ALSA: hda/tas2781: add debug statements to kcontrols

Sometimes it is useful to examine the timing of kcontrol events.

Add debug statements to each kcontrol.

Signed-off-by: Gergo Koteles <[email protected]>
Message-ID: <18ff4b0caab90a2dacf907e62346fd5079a9eb1a.1711469583 [email protected]>
Signed-off-by: Takashi Iwai <[email protected]>

ALSA: hda/tas2781: add locks to kcontrols

The rcabin.profile_cfg_id, cur_prog, cur_conf, force_fwload_status
variables are acccessible from multiple threads and therefore require
locking.

Fixes: 5be27f1e3ec9 ("ALSA: hda/tas2781: Add tas2781 HDA driver")
CC: [email protected]
Signed-off-by: Gergo Koteles <[email protected]>
Message-ID: <e35b867f6fe5fa1f869dd658a0a1f2118b737f57.1711469583 [email protected]>
Signed-off-by: Takashi Iwai <[email protected]>

ALSA: hda/tas2781: remove digital gain kcontrol

The "Speaker Digital Gain" kcontrol controls the TAS2781_DVC_LVL (0x1A)
register. Unfortunately the tas2563 does not have DVC_LVL, but has
INT_MASK0 in 0x1A, which has been misused so far.

Since commit c1947ce61ff4 ("ALSA: hda/realtek: tas2781: enable subwoofer
volume control") the volume of the tas2781 amplifiers can be controlled
by the master volume, so this digital gain kcontrol is not needed.

Remove it.

Fixes: 5be27f1e3ec9 ("ALSA: hda/tas2781: Add tas2781 HDA driver")
CC: [email protected]
Signed-off-by: Gergo Koteles <[email protected]>
Message-ID: <741fc21db994efd58f83e7aef38931204961e5b2.1711469583 [email protected]>
Signed-off-by: Takashi Iwai <[email protected]>

ALSA: aoa: avoid false-positive format truncation warning

clang warns about what it interprets as a truncated snprintf:

sound/aoa/soundbus/i2sbus/core.c:171:6: error: 'snprintf' will always be truncated; specified size is 6, but format string expands to at least 7 [-Werror,-Wformat-truncation-non-kprintf]

The actual problem here is that it does not understand the special
%pOFn format string and assumes that it is a pointer followed by
the string "OFn", which would indeed not fit.

Slightly increasing the size of the buffer to its natural alignment
avoids the warning, as it is now long enough for the correct and
the incorrect interprations.

Fixes: b917d58dcfaa ("ALSA: aoa: Convert to using %pOFn instead of device_node.name")
Signed-off-by: Arnd Bergmann <[email protected]>
Message-ID: <20240326223825.4084412 [email protected]>
Signed-off-by: Takashi Iwai <[email protected]>

Merge branch '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue

Tony Nguyen says:

====================
Intel Wired LAN Driver Updates 2024-03-25 (ice, ixgbe, igc)

This series contains updates to ice, ixgbe, and igc drivers.

Steven fixes incorrect casting of bitmap type for ice driver.

Jesse fixes memory corruption issue with suspend flow on ice.

Przemek adds GFP_ATOMIC flag to avoid sleeping in IRQ context for ixgbe.

Kurt Kanzenbach removes no longer valid comment on igc.

* '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue:
  igc: Remove stale comment about Tx timestamping
  ixgbe: avoid sleeping allocation in ixgbe_ipsec_vf_add_sa()
  ice: fix memory corruption bug with suspend and rebuild
  ice: Refactor FW data type and fix bitmap casting issue
====================

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

mlxbf_gige: call request_irq() after NAPI initialized

The mlxbf_gige driver encounters a NULL pointer exception in
mlxbf_gige_open() when kdump is enabled.  The sequence to reproduce
the exception is as follows:
a) enable kdump
b) trigger kdump via "echo c > /proc/sysrq-trigger"
c) kdump kernel executes
d) kdump kernel loads mlxbf_gige module
e) the mlxbf_gige module runs its open() as the
   the "oob_net0" interface is brought up
f) mlxbf_gige module will experience an exception
   during its open(), something like:

     Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000
     Mem abort info:
       ESR = 0x0000000086000004
       EC = 0x21: IABT (current EL), IL = 32 bits
       SET = 0, FnV = 0
       EA = 0, S1PTW = 0
       FSC = 0x04: level 0 translation fault
     user pgtable: 4k pages, 48-bit VAs, pgdp=00000000e29a4000
     [0000000000000000] pgd=0000000000000000, p4d=0000000000000000
     Internal error: Oops: 0000000086000004 [#1] SMP
     CPU: 0 PID: 812 Comm: NetworkManager Tainted: G           OE     5.15.0-1035-bluefield #37-Ubuntu
     Hardware name: https://www.mellanox.com BlueField-3 SmartNIC Main Card/BlueField-3 SmartNIC Main Card, BIOS 4.6.0.13024 Jan 19 2024
     pstate: 80400009 (Nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
     pc : 0x0
     lr : __napi_poll+0x40/0x230
     sp : ffff800008003e00
     x29: ffff800008003e00 x28: 0000000000000000 x27: 00000000ffffffff
     x26: ffff000066027238 x25: ffff00007cedec00 x24: ffff800008003ec8
     x23: 000000000000012c x22: ffff800008003eb7 x21: 0000000000000000
     x20: 0000000000000001 x19: ffff000066027238 x18: 0000000000000000
     x17: ffff578fcb450000 x16: ffffa870b083c7c0 x15: 0000aaab010441d0
     x14: 0000000000000001 x13: 00726f7272655f65 x12: 6769675f6662786c
     x11: 0000000000000000 x10: 0000000000000000 x9 : ffffa870b0842398
     x8 : 0000000000000004 x7 : fe5a48b9069706ea x6 : 17fdb11fc84ae0d2
     x5 : d94a82549d594f35 x4 : 0000000000000000 x3 : 0000000000400100
     x2 : 0000000000000000 x1 : 0000000000000000 x0 : ffff000066027238
     Call trace:
      0x0
      net_rx_action+0x178/0x360
      __do_softirq+0x15c/0x428
      __irq_exit_rcu+0xac/0xec
      irq_exit+0x18/0x2c
      handle_domain_irq+0x6c/0xa0
      gic_handle_irq+0xec/0x1b0
      call_on_irq_stack+0x20/0x2c
      do_interrupt_handler+0x5c/0x70
      el1_interrupt+0x30/0x50
      el1h_64_irq_handler+0x18/0x2c
      el1h_64_irq+0x7c/0x80
      __setup_irq+0x4c0/0x950
      request_threaded_irq+0xf4/0x1bc
      mlxbf_gige_request_irqs+0x68/0x110 [mlxbf_gige]
      mlxbf_gige_open+0x5c/0x170 [mlxbf_gige]
      __dev_open+0x100/0x220
      __dev_change_flags+0x16c/0x1f0
      dev_change_flags+0x2c/0x70
      do_setlink+0x220/0xa40
      __rtnl_newlink+0x56c/0x8a0
      rtnl_newlink+0x58/0x84
      rtnetlink_rcv_msg+0x138/0x3c4
      netlink_rcv_skb+0x64/0x130
      rtnetlink_rcv+0x20/0x30
      netlink_unicast+0x2ec/0x360
      netlink_sendmsg+0x278/0x490
      __sock_sendmsg+0x5c/0x6c
      ____sys_sendmsg+0x290/0x2d4
      ___sys_sendmsg+0x84/0xd0
      __sys_sendmsg+0x70/0xd0
      __arm64_sys_sendmsg+0x2c/0x40
      invoke_syscall+0x78/0x100
      el0_svc_common.constprop.0+0x54/0x184
      do_el0_svc+0x30/0xac
      el0_svc+0x48/0x160
      el0t_64_sync_handler+0xa4/0x12c
      el0t_64_sync+0x1a4/0x1a8
     Code: bad PC value
     ---[ end trace 7d1c3f3bf9d81885 ]---
     Kernel panic - not syncing: Oops: Fatal exception in interrupt
     Kernel Offset: 0x2870a7a00000 from 0xffff800008000000
     PHYS_OFFSET: 0x80000000
     CPU features: 0x0,000005c1,a3332a5a
     Memory Limit: none
     ---[ end Kernel panic - not syncing: Oops: Fatal exception in interrupt ]---

The exception happens because there is a pending RX interrupt before the
call to request_irq(RX IRQ) executes.  Then, the RX IRQ handler fires
immediately after this request_irq() completes. The RX IRQ handler runs
"napi_schedule()" before NAPI is fully initialized via "netif_napi_add()"
and "napi_enable()", both which happen later in the open() logic.

The logic in mlxbf_gige_open() must fully initialize NAPI before any calls
to request_irq() execute.

Fixes: f92e1869d74e ("Add Mellanox BlueField Gigabit Ethernet driver")
Signed-off-by: David Thompson <[email protected]>
Reviewed-by: Asmaa Mnebhi <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

Merge branch 'tls-recvmsg-fixes'

Sabrina Dubroca says:

====================
tls: recvmsg fixes

The first two fixes are again related to async decrypt. The last one
is unrelated but I stumbled upon it while reading the code.
====================

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

tls: get psock ref after taking rxlock to avoid leak

At the start of tls_sw_recvmsg, we take a reference on the psock, and
then call tls_rx_reader_lock. If that fails, we return directly
without releasing the reference.

Instead of adding a new label, just take the reference after locking
has succeeded, since we don't need it before.

Fixes: 4cbc325ed6b4 ("tls: rx: allow only one reader at a time")
Signed-off-by: Sabrina Dubroca <[email protected]>
Reviewed-by: Simon Horman <[email protected]>
Link: https://lore.kernel.org/r/fe2ade22d030051ce4c3638704ed58b67d0df643.1711120964.git.sd@queasysnail.net
Signed-off-by: Jakub Kicinski <[email protected]>

selftests: tls: add test with a partially invalid iov

Make sure that we don't return more bytes than we actually received if
the userspace buffer was bogus. We expect to receive at least the rest
of rec1, and possibly some of rec2 (currently, we don't, but that
would be ok).

Signed-off-by: Sabrina Dubroca <[email protected]>
Reviewed-by: Simon Horman <[email protected]>
Link: https://lore.kernel.org/r/720e61b3d3eab40af198a58ce2cd1ee019f0ceb1.1711120964.git.sd@queasysnail.net
Signed-off-by: Jakub Kicinski <[email protected]>

tls: adjust recv return with async crypto and failed copy to userspace

process_rx_list may not copy as many bytes as we want to the userspace
buffer, for example in case we hit an EFAULT during the copy. If this
happens, we should only count the bytes that were actually copied,
which may be 0.

Subtracting async_copy_bytes is correct in both peek and !peek cases,
because decrypted == async_copy_bytes + peeked for the peek case: peek
is always !ZC, and we can go through either the sync or async path. In
the async case, we add chunk to both decrypted and
async_copy_bytes. In the sync case, we add chunk to both decrypted and
peeked. I missed that in commit 6caaf104423d ("tls: fix peeking with
sync+async decryption").

Fixes: 4d42cd6bc2ac ("tls: rx: fix return value for async crypto")
Signed-off-by: Sabrina Dubroca <[email protected]>
Reviewed-by: Simon Horman <[email protected]>
Link: https://lore.kernel.org/r/1b5a1eaab3c088a9dd5d9f1059ceecd7afe888d1.1711120964.git.sd@queasysnail.net
Signed-off-by: Jakub Kicinski <[email protected]>

tls: recv: process_rx_list shouldn't use an offset with kvec

Only MSG_PEEK needs to copy from an offset during the final
process_rx_list call, because the bytes we copied at the beginning of
tls_sw_recvmsg were left on the rx_list. In the KVEC case, we removed
data from the rx_list as we were copying it, so there's no need to use
an offset, just like in the normal case.

Fixes: 692d7b5d1f91 ("tls: Fix recvmsg() to be able to peek across multiple records")
Signed-off-by: Sabrina Dubroca <[email protected]>
Reviewed-by: Simon Horman <[email protected]>
Link: https://lore.kernel.org/r/e5487514f828e0347d2b92ca40002c62b58af73d.1711120964.git.sd@queasysnail.net
Signed-off-by: Jakub Kicinski <[email protected]>

RAS: Avoid build errors when CONFIG_DEBUG_FS=n

A new helper was introduced for RAS modules to be able to get the RAS
subsystem debugfs root directory. The helper is defined in debugfs.c
which is only built when CONFIG_DEBUG_FS=y.

However, it's possible that the modules would include debugfs support
for optional functionality. One current example is the fmpm module. In
this case, a build error will occur when CONFIG_RAS_FMPM is selected and
CONFIG_DEBUG_FS=n.

Add an inline helper function stub for the CONFIG_DEBUG_FS=n case as the
fmpm module can function without the debugfs functionality too.

Fixes: 9d2b6fa09d15 ("RAS: Export helper to get ras_debugfs_dir")
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=218640
Reported-by: anthony s. knowles <[email protected]>
Signed-off-by: Yazen Ghannam <[email protected]>
Signed-off-by: Borislav Petkov (AMD) <[email protected]>
Tested-by: anthony s. knowles <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

smb3: add trace event for mknod

Add trace points to help debug mknod and mkfifo:

   smb3_mknod_done
   smb3_mknod_enter
   smb3_mknod_err

Example output:

      TASK-PID     CPU#  |||||  TIMESTAMP  FUNCTION
         | |         |   |||||     |         |
    mkfifo-6163    [003] .....   960.425558: smb3_mknod_enter: xid=12 sid=0xb55130f6 tid=0x46e6241c path=\fifo1
    mkfifo-6163    [003] .....   960.432719: smb3_mknod_done: xid=12 sid=0xb55130f6 tid=0x46e6241c

Reviewed-by: Bharath SM <[email protected]>
Reviewed-by: Meetakshi Setiya <[email protected]>
Signed-off-by: Steve French <[email protected]>

crash: use macro to add crashk_res into iomem early for specific arch

There are regression reports[1][2] that crashkernel region on x86_64 can't
be added into iomem tree sometime. This causes the later failure of kdump
loading.

This happened after commit 4a693ce65b18 ("kdump: defer the insertion of
crashkernel resources") was merged.

Even though, these reported issues are proved to be related to other
component, they are just exposed after above commmit applied, I still
would like to keep crashk_res and crashk_low_res being added into iomem
early as before because the early adding has been always there on x86_64
and working very well. For safety of kdump, Let's change it back.

Here, add a macro HAVE_ARCH_ADD_CRASH_RES_TO_IOMEM_EARLY to limit that
only ARCH defining the macro can have the early adding
crashk_res/_low_res into iomem. Then define
HAVE_ARCH_ADD_CRASH_RES_TO_IOMEM_EARLY on x86 to enable it.

Note: In reserve_crashkernel_low(), there's a remnant of crashk_low_res
handling which was mistakenly added back in commit 85fcde402db1 ("kexec:
split crashkernel reservation code out from crash_core.c").

[1]
[PATCH V2] x86/kexec: do not update E820 kexec table for setup_data
https://lore.kernel.org/all/[email protected]/T/#u

[2]
Question about Address Range Validation in Crash Kernel Allocation
https://lore.kernel.org/all/4eeac1f733584855965a2ea62fa4da58@huawei.com/T/#u

Link: https://lkml.kernel.org/r/ZgDYemRQ2jxjLkq+@MiWiFi-R3L-srv
Fixes: 4a693ce65b18 ("kdump: defer the insertion of crashkernel resources")
Signed-off-by: Baoquan He <[email protected]>
Cc: Dave Young <[email protected]>
Cc: Huacai Chen <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jiri Bohac <[email protected]>
Cc: Li Huafei <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: zswap: fix data loss on SWP_SYNCHRONOUS_IO devices

Zhongkun He reports data corruption when combining zswap with zram.

The issue is the exclusive loads we're doing in zswap. They assume
that all reads are going into the swapcache, which can assume
authoritative ownership of the data and so the zswap copy can go.

However, zram files are marked SWP_SYNCHRONOUS_IO, and faults will try to
bypass the swapcache. This results in an optimistic read of the swap data
into a page that will be dismissed if the fault fails due to races. In
this case, zswap mustn't drop its authoritative copy.

Link: https://lore.kernel.org/all/CACSyD1N+dUvsu8=zV9P691B9bVq33erwOXNTmEaUbi9DrDeJzw@mail.gmail.com/
Fixes: b9c91c43412f ("mm: zswap: support exclusive loads")
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Johannes Weiner <[email protected]>
Reported-by: Zhongkun He <[email protected]>
Tested-by: Zhongkun He <[email protected]>
Acked-by: Yosry Ahmed <[email protected]>
Acked-by: Barry Song <[email protected]>
Reviewed-by: Chengming Zhou <[email protected]>
Reviewed-by: Nhat Pham <[email protected]>
Acked-by: Chris Li <[email protected]>
Cc: <[email protected]> [6.5+]
Signed-off-by: Andrew Morton <[email protected]>

selftests/mm: fix ARM related issue with fork after pthread_create

Following issue was observed while running the uffd-unit-tests selftest
on ARM devices. On x86_64 no issues were detected:

pthread_create followed by fork caused deadlock in certain cases wherein
fork required some work to be completed by the created thread. Used
synchronization to ensure that created thread's start function has started
before invoking fork.

[[email protected]: refactored to use atomic_bool]
Link: https://lkml.kernel.org/r/[email protected]
Fixes: 760aee0b71e3 ("selftests/mm: add tests for RO pinning vs fork()")
Signed-off-by: Lokesh Gidra <[email protected]>
Signed-off-by: Edward Liaw <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

hexagon: vmlinux.lds.S: handle attributes section

After the linked LLVM change, the build fails with
CONFIG_LD_ORPHAN_WARN_LEVEL="error", which happens with allmodconfig:

ld.lld: error: vmlinux.a(init/main.o):(.hexagon.attributes) is being placed in '.hexagon.attributes'

Handle the attributes section in a similar manner as arm and riscv by
adding it after the primary ELF_DETAILS grouping in vmlinux.lds.S, which
fixes the error.

Link: https://lkml.kernel.org/r/20240319-hexagon-handle-attributes-section-vmlinux-lds-s-v1-1-59855dab8872@kernel.org
Fixes: 113616ec5b64 ("hexagon: select ARCH_WANT_LD_ORPHAN_WARN")
Link: https://github.com/llvm/llvm-project/commit/31f4b329c8234fab9afa59494d7f8bdaeaefeaad
Signed-off-by: Nathan Chancellor <[email protected]>
Reviewed-by: Brian Cain <[email protected]>
Cc: Bill Wendling <[email protected]>
Cc: Justin Stitt <[email protected]>
Cc: Nick Desaulniers <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

userfaultfd: fix deadlock warning when locking src and dst VMAs

Use down_read_nested() to avoid the warning.

Link: https://lkml.kernel.org/r/[email protected]
Fixes: 867a43a34ff8 ("userfaultfd: use per-vma locks in userfaultfd operations")
Reported-by: [email protected]
Signed-off-by: Lokesh Gidra <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Axel Rasmussen <[email protected]>
Cc: Brian Geffon <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Hillf Danton <[email protected]>
Cc: Jann Horn <[email protected]> [Bug #2]
Cc: Kalesh Singh <[email protected]>
Cc: Lokesh Gidra <[email protected]>
Cc: Mike Rapoport (IBM) <[email protected]>
Cc: Nicolas Geoffray <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

tmpfs: fix race on handling dquot rbtree

A syzkaller reproducer found a race while attempting to remove dquot
information from the rb tree.

Fetching the rb_tree root node must also be protected by the
dqopt->dqio_sem, otherwise, giving the right timing, shmem_release_dquot()
will trigger a warning because it couldn't find a node in the tree, when
the real reason was the root node changing before the search starts:

Thread 1 Thread 2
- shmem_release_dquot() - shmem_{acquire,release}_dquot()

- fetch ROOT - Fetch ROOT

- acquire dqio_sem
- wait dqio_sem

- do something, triger a tree rebalance
- release dqio_sem

- acquire dqio_sem
- start searching for the node, but
from the wrong location, missing
the node, and triggering a warning.

Link: https://lkml.kernel.org/r/[email protected]
Fixes: eafc474e2029 ("shmem: prepare shmem quota infrastructure")
Signed-off-by: Carlos Maiolino <[email protected]>
Reported-by: Ubisectech Sirius <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

selftests/mm: sigbus-wp test requires UFFD_FEATURE_WP_HUGETLBFS_SHMEM

The sigbus-wp test requires the UFFD_FEATURE_WP_HUGETLBFS_SHMEM flag for
shmem and hugetlb targets. Otherwise it is not backwards compatible with
kernels <5.19 and fails with EINVAL.

Link: https://lkml.kernel.org/r/[email protected]
Fixes: 73c1ea939b65 ("selftests/mm: move uffd sig/events tests into uffd unit tests")
Signed-off-by: Edward Liaw <[email protected]>
Cc: Shuah Khan <[email protected]>
Cc: Peter Xu <[email protected]
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: zswap: fix writeback shinker GFP_NOIO/GFP_NOFS recursion

Kent forwards this bug report of zswap re-entering the block layer
from an IO request allocation and locking up:

[10264.128242] sysrq: Show Blocked State
[10264.128268] task:kworker/20:0H   state:D stack:0     pid:143   tgid:143   ppid:2      flags:0x00004000
[10264.128271] Workqueue: bcachefs_io btree_write_submit [bcachefs]
[10264.128295] Call Trace:
[10264.128295]  <TASK>
[10264.128297]  __schedule+0x3e6/0x1520
[10264.128303]  schedule+0x32/0xd0
[10264.128304]  schedule_timeout+0x98/0x160
[10264.128308]  io_schedule_timeout+0x50/0x80
[10264.128309]  wait_for_completion_io_timeout+0x7f/0x180
[10264.128310]  submit_bio_wait+0x78/0xb0
[10264.128313]  swap_writepage_bdev_sync+0xf6/0x150
[10264.128317]  zswap_writeback_entry+0xf2/0x180
[10264.128319]  shrink_memcg_cb+0xe7/0x2f0
[10264.128322]  __list_lru_walk_one+0xb9/0x1d0
[10264.128325]  list_lru_walk_one+0x5d/0x90
[10264.128326]  zswap_shrinker_scan+0xc4/0x130
[10264.128327]  do_shrink_slab+0x13f/0x360
[10264.128328]  shrink_slab+0x28e/0x3c0
[10264.128329]  shrink_one+0x123/0x1b0
[10264.128331]  shrink_node+0x97e/0xbc0
[10264.128332]  do_try_to_free_pages+0xe7/0x5b0
[10264.128333]  try_to_free_pages+0xe1/0x200
[10264.128334]  __alloc_pages_slowpath.constprop.0+0x343/0xde0
[10264.128337]  __alloc_pages+0x32d/0x350
[10264.128338]  allocate_slab+0x400/0x460
[10264.128339]  ___slab_alloc+0x40d/0xa40
[10264.128345]  kmem_cache_alloc+0x2e7/0x330
[10264.128348]  mempool_alloc+0x86/0x1b0
[10264.128349]  bio_alloc_bioset+0x200/0x4f0
[10264.128352]  bio_alloc_clone+0x23/0x60
[10264.128354]  alloc_io+0x26/0xf0 [dm_mod 7e9e6b44df4927f93fb3e4b5c782767396f58382]
[10264.128361]  dm_submit_bio+0xb8/0x580 [dm_mod 7e9e6b44df4927f93fb3e4b5c782767396f58382]
[10264.128366]  __submit_bio+0xb0/0x170
[10264.128367]  submit_bio_noacct_nocheck+0x159/0x370
[10264.128368]  bch2_submit_wbio_replicas+0x21c/0x3a0 [bcachefs 85f1b9a7a824f272eff794653a06dde1a94439f2]
[10264.128391]  btree_write_submit+0x1cf/0x220 [bcachefs 85f1b9a7a824f272eff794653a06dde1a94439f2]
[10264.128406]  process_one_work+0x178/0x350
[10264.128408]  worker_thread+0x30f/0x450
[10264.128409]  kthread+0xe5/0x120

The zswap shrinker resumes the swap_writepage()s that were intercepted
by the zswap store. This will enter the block layer, and may even
enter the filesystem depending on the swap backing file.

Make it respect GFP_NOIO and GFP_NOFS.

Link: https://lore.kernel.org/linux-mm/rc4pk2r42oyvjo4dc62z6sovquyllq56i5cdgcaqbd7wy3hfzr@n4nbxido3fme/
Link: https://lkml.kernel.org/r/[email protected]
Fixes: b5ba474f3f51 ("zswap: shrink zswap pool based on memory pressure")
Signed-off-by: Johannes Weiner <[email protected]>
Reported-by: Kent Overstreet <[email protected]>
Acked-by: Yosry Ahmed <[email protected]>
Reported-by: Jérôme Poulin <[email protected]>
Reviewed-by: Nhat Pham <[email protected]>
Reviewed-by: Chengming Zhou <[email protected]>
Cc: [email protected] [v6.8]
Signed-off-by: Andrew Morton <[email protected]>

ARM: prctl: reject PR_SET_MDWE on pre-ARMv6

On v5 and lower CPUs we can't provide MDWE protection, so ensure we fail
any attempt to enable it via prctl(PR_SET_MDWE).

Previously such an attempt would misleadingly succeed, leading to any
subsequent mmap(PROT_READ|PROT_WRITE) or execve() failing unconditionally
(the latter somewhat violently via force_fatal_sig(SIGSEGV) due to
READ_IMPLIES_EXEC).

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Zev Weiss <[email protected]>
Cc: <[email protected]> [6.3+]
Cc: Borislav Petkov <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Florent Revest <[email protected]>
Cc: Helge Deller <[email protected]>
Cc: "James E.J. Bottomley" <[email protected]>
Cc: Josh Triplett <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Miguel Ojeda <[email protected]>
Cc: Mike Rapoport (IBM) <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Ondrej Mosnacek <[email protected]>
Cc: Rick Edgecombe <[email protected]>
Cc: Russell King (Oracle) <[email protected]>
Cc: Sam James <[email protected]>
Cc: Stefan Roesch <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: Yin Fengwei <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

prctl: generalize PR_SET_MDWE support check to be per-arch

Patch series "ARM: prctl: Reject PR_SET_MDWE where not supported".

I noticed after a recent kernel update that my ARM926 system started
segfaulting on any execve() after calling prctl(PR_SET_MDWE). After some
investigation it appears that ARMv5 is incapable of providing the
appropriate protections for MDWE, since any readable memory is also
implicitly executable.

The prctl_set_mdwe() function already had some special-case logic added
disabling it on PARISC (commit 793838138c15, "prctl: Disable
prctl(PR_SET_MDWE) on parisc"); this patch series (1) generalizes that
check to use an arch_*() function, and (2) adds a corresponding override
for ARM to disable MDWE on pre-ARMv6 CPUs.

With the series applied, prctl(PR_SET_MDWE) is rejected on ARMv5 and
subsequent execve() calls (as well as mmap(PROT_READ|PROT_WRITE)) can
succeed instead of unconditionally failing; on ARMv6 the prctl works as it
did previously.

[0] https://lore.kernel.org/all/2023112456-linked-nape-bf19@gregkh/

This patch (of 2):

There exist systems other than PARISC where MDWE may not be feasible to
support; rather than cluttering up the generic code with additional
arch-specific logic let's add a generic function for checking MDWE support
and allow each arch to override it as needed.

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Zev Weiss <[email protected]>
Acked-by: Helge Deller <[email protected]> [parisc]
Cc: Borislav Petkov <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Florent Revest <[email protected]>
Cc: "James E.J. Bottomley" <[email protected]>
Cc: Josh Triplett <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Miguel Ojeda <[email protected]>
Cc: Mike Rapoport (IBM) <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Ondrej Mosnacek <[email protected]>
Cc: Rick Edgecombe <[email protected]>
Cc: Russell King (Oracle) <[email protected]>
Cc: Sam James <[email protected]>
Cc: Stefan Roesch <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: Yin Fengwei <[email protected]>
Cc: <[email protected]> [6.3+]
Signed-off-by: Andrew Morton <[email protected]>

MAINTAINERS: remove incorrect M: tag for [email protected]

The [email protected] mailing list should only be listed under the
L: (List) tag in the MAINTAINERS file. However, it was incorrectly listed
under both L: and M: (Maintainers) tags, which is not accurate. Remove
the M: tag for [email protected] in the MAINTAINERS file to reflect
the correct categorization.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Kuan-Wei Chiu <[email protected]>
Cc: Ching-Chun (Jim) Huang <[email protected]>
Cc: Matthew Sakai <[email protected]>
Cc: Michael Sclafani <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: zswap: fix kernel BUG in sg_init_one

sg_init_one() relies on linearly mapped low memory for the safe
utilization of virt_to_page().  Otherwise, we trigger a kernel BUG,

kernel BUG at include/linux/scatterlist.h:187!
Internal error: Oops - BUG: 0 [#1] PREEMPT SMP ARM
Modules linked in:
CPU: 0 PID: 2997 Comm: syz-executor198 Not tainted 6.8.0-syzkaller #0
Hardware name: ARM-Versatile Express
PC is at sg_set_buf include/linux/scatterlist.h:187 [inline]
PC is at sg_init_one+0x9c/0xa8 lib/scatterlist.c:143
LR is at sg_init_table+0x2c/0x40 lib/scatterlist.c:128
Backtrace:
[<807e16ac>] (sg_init_one) from [<804c1824>] (zswap_decompress+0xbc/0x208 mm/zswap.c:1089)
r7:83471c80 r6:def6d08c r5:844847d0 r4:ff7e7ef4
[<804c1768>] (zswap_decompress) from [<804c4468>] (zswap_load+0x15c/0x198 mm/zswap.c:1637)
r9:8446eb80 r8:8446eb80 r7:8446eb84 r6:def6d08c r5:00000001 r4:844847d0
[<804c430c>] (zswap_load) from [<804b9644>] (swap_read_folio+0xa8/0x498 mm/page_io.c:518)
r9:844ac800 r8:835e6c00 r7:00000000 r6:df955d4c r5:00000001 r4:def6d08c
[<804b959c>] (swap_read_folio) from [<804bb064>] (swap_cluster_readahead+0x1c4/0x34c mm/swap_state.c:684)
r10:00000000 r9:00000007 r8:df955d4b r7:00000000 r6:00000000 r5:00100cca
r4:00000001
[<804baea0>] (swap_cluster_readahead) from [<804bb3b8>] (swapin_readahead+0x68/0x4a8 mm/swap_state.c:904)
r10:df955eb8 r9:00000000 r8:00100cca r7:84476480 r6:00000001 r5:00000000
r4:00000001
[<804bb350>] (swapin_readahead) from [<8047cde0>] (do_swap_page+0x200/0xcc4 mm/memory.c:4046)
r10:00000040 r9:00000000 r8:844ac800 r7:84476480 r6:00000001 r5:00000000
r4:df955eb8
[<8047cbe0>] (do_swap_page) from [<8047e6c4>] (handle_pte_fault mm/memory.c:5301 [inline])
[<8047cbe0>] (do_swap_page) from [<8047e6c4>] (__handle_mm_fault mm/memory.c:5439 [inline])
[<8047cbe0>] (do_swap_page) from [<8047e6c4>] (handle_mm_fault+0x3d8/0x12b8 mm/memory.c:5604)
r10:00000040 r9:842b3900 r8:7eb0d000 r7:84476480 r6:7eb0d000 r5:835e6c00
r4:00000254
[<8047e2ec>] (handle_mm_fault) from [<80215d28>] (do_page_fault+0x148/0x3a8 arch/arm/mm/fault.c:326)
r10:00000007 r9:842b3900 r8:7eb0d000 r7:00000207 r6:00000254 r5:7eb0d9b4
r4:df955fb0
[<80215be0>] (do_page_fault) from [<80216170>] (do_DataAbort+0x38/0xa8 arch/arm/mm/fault.c:558)
r10:7eb0da7c r9:00000000 r8:80215be0 r7:df955fb0 r6:7eb0d9b4 r5:00000207
r4:8261d0e0
[<80216138>] (do_DataAbort) from [<80200e3c>] (__dabt_usr+0x5c/0x60 arch/arm/kernel/entry-armv.S:427)
Exception stack(0xdf955fb0 to 0xdf955ff8)
5fa0:                                     00000000 00000000 22d5f800 0008d158
5fc0: 00000000 7eb0d9a4 00000000 00000109 00000000 00000000 7eb0da7c 7eb0da3c
5fe0: 00000000 7eb0d9a0 00000001 00066bd4 00000010 ffffffff
r8:824a9044 r7:835e6c00 r6:ffffffff r5:00000010 r4:00066bd4
Code: 1a000004 e1822003 e8860094 e89da8f0 (e7f001f2)
---[ end trace 0000000000000000 ]---
----------------
Code disassembly (best guess):
   0: 1a000004 bne 0x18
   4: e1822003 orr r2, r2, r3
   8: e8860094 stm r6, {r2, r4, r7}
   c: e89da8f0 ldm sp, {r4, r5, r6, r7, fp, sp, pc}
* 10: e7f001f2 udf #18 <-- trapping instruction

Consequently, we have two choices: either employ kmap_to_page() alongside
sg_set_page(), or resort to copying high memory contents to a temporary
buffer residing in low memory.  However, considering the introduction of
the WARN_ON_ONCE in commit ef6e06b2ef870 ("highmem: fix kmap_to_page() for
kmap_local_page() addresses"), which specifically addresses high memory
concerns, it appears that memcpy remains the sole viable option.

Link: https://lkml.kernel.org/r/[email protected]
Fixes: 270700dd06ca ("mm/zswap: remove the memcpy if acomp is not sleepable")
Signed-off-by: Barry Song <[email protected]>
Reported-by: [email protected]
Closes: https://lore.kernel.org/all/[email protected]/
Tested-by: [email protected]
Acked-by: Yosry Ahmed <[email protected]>
Reviewed-by: Nhat Pham <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Cc: Chris Li <[email protected]>
Cc: Ira Weiny <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

selftests: mm: restore settings from only parent process

The atexit() is called from parent process as well as forked processes.
Hence the child restores the settings at exit while the parent is still
executing. Fix this by checking pid of atexit() calling process and only
restore THP number from parent process.

Link: https://lkml.kernel.org/r/[email protected]
Fixes: c23ea61726d5 ("selftests/mm: protection_keys: save/restore nr_hugepages settings")
Signed-off-by: Muhammad Usama Anjum <[email protected]>
Tested-by: Joey Gouly <[email protected]>
Cc: Shuah Khan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

tools/Makefile: remove cgroup target

The tools/cgroup directory no longer contains a Makefile.  This patch
updates the top-level tools/Makefile to remove references to building and
installing cgroup components.  This change reflects the current structure
of the tools directory and fixes the build failure when building tools in
the top-level directory.

linux/tools$ make cgroup
  DESCEND cgroup
make[1]: *** No targets specified and no makefile found.  Stop.
make: *** [Makefile:73: cgroup] Error 2

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Cong Liu <[email protected]>
Acked-by: Stanislav Fomichev <[email protected]>
Reviewed-by: Dmitry Rokosov <[email protected]>
Cc: Cong Liu <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: cachestat: fix two shmem bugs

When cachestat on shmem races with swapping and invalidation, there
are two possible bugs:

1) A swapin error can have resulted in a poisoned swap entry in the
   shmem inode's xarray. Calling get_shadow_from_swap_cache() on it
   will result in an out-of-bounds access to swapper_spaces[].

   Validate the entry with non_swap_entry() before going further.

2) When we find a valid swap entry in the shmem's inode, the shadow
   entry in the swapcache might not exist yet: swap IO is still in
   progress and we're before __remove_mapping; swapin, invalidation,
   or swapoff have removed the shadow from swapcache after we saw the
   shmem swap entry.

   This will send a NULL to workingset_test_recent(). The latter
   purely operates on pointer bits, so it won't crash - node 0, memcg
   ID 0, eviction timestamp 0, etc. are all valid inputs - but it's a
   bogus test. In theory that could result in a false "recently
   evicted" count.

   Such a false positive wouldn't be the end of the world. But for
   code clarity and (future) robustness, be explicit about this case.

   Bail on get_shadow_from_swap_cache() returning NULL.

Link: https://lkml.kernel.org/r/[email protected]
Fixes: cf264e1329fb ("cachestat: implement cachestat syscall")
Signed-off-by: Johannes Weiner <[email protected]>
Reported-by: Chengming Zhou <[email protected]> [Bug #1]
Reported-by: Jann Horn <[email protected]> [Bug #2]
Reviewed-by: Chengming Zhou <[email protected]>
Reviewed-by: Nhat Pham <[email protected]>
Cc: <[email protected]> [v6.5+]
Signed-off-by: Andrew Morton <[email protected]>

mm: increase folio batch size

On a 104 thread, 2 socket Skylake system, Intel report a 4.7% performance
reduction with will-it-scale page_fault2.  This was due to reducing the
size of the batch from 32 to 15.  Increasing the folio batch size from 15
to 31 gives a performance increase of 12.5% relative to the original, or
17.2% relative to the reduced performance commit.

The penalty of this commit is an additional 128 bytes of stack usage.  Six
folio_batches are also allocated from percpu memory in cpu_fbatches so
that will be an additional 768 bytes of percpu memory (per CPU).  Tim Chen
originally submitted a patch like this in 2020:
https://lore.kernel.org/linux-mm/d1cc9f12a8ad6c2a52cb600d93b06b064f2bbc57.1593205965 [email protected]/

Link: https://lkml.kernel.org/r/[email protected]
Fixes: 99fbb6bfc16f ("mm: make folios_put() the basis of release_pages()")
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Tested-by: Yujie Liu <[email protected]>
Reported-by: kernel test robot <[email protected]>
Closes: https://lore.kernel.org/oe-lkp/[email protected]
Signed-off-by: Andrew Morton <[email protected]>

mm,page_owner: fix recursion

Prior to 217b2119b9e2 ("mm,page_owner: implement the tracking of the
stacks count") the only place where page_owner could potentially go into
recursion due to its need of allocating more memory was in save_stack(),
which ends up calling into stackdepot code with the possibility of
allocating memory.

We made sure to guard against that by signaling that the current task was
already in page_owner code, so in case a recursion attempt was made, we
could catch that and return dummy_handle.

After above commit, a new place in page_owner code was introduced where we
could allocate memory, meaning we could go into recursion would we take
that path.

Make sure to signal that we are in page_owner in that codepath as well.
Move the guard code into two helpers {un}set_current_in_page_owner() and
use them prior to calling in the two functions that might allocate memory.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Oscar Salvador <[email protected]>
Fixes: 217b2119b9e2 ("mm,page_owner: implement the tracking of the stacks count")
Reviewed-by: Vlastimil Babka <[email protected]>
Cc: Alexander Potapenko <[email protected]>
Cc: Andrey Konovalov <[email protected]>
Cc: Marco Elver <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Oscar Salvador <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mailmap: update entry for Leonard Crestez

Put my personal email first because NXP employment ended some time ago.
Also add my old intel email address.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Leonard Crestez <[email protected]>
Cc: Florian Fainelli <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

init: open /initrd.image with O_LARGEFILE

If initrd data is larger than 2Gb, we'll eventually fail to write to the
/initrd.image file when we hit that limit, unless O_LARGEFILE is set.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: John Sperbeck <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Nick Desaulniers <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

selftests/mm: Fix build with _FORTIFY_SOURCE

Add missing flags argument to open(2) call with O_CREAT.

Some tests fail to compile if _FORTIFY_SOURCE is defined (to any valid
value) (together with -O), resulting in similar error messages such as:

  In file included from /usr/include/fcntl.h:342,
                   from gup_test.c:1:
  In function 'open',
      inlined from 'main' at gup_test.c:206:10:
  /usr/include/bits/fcntl2.h:50:11: error: call to '__open_missing_mode' declared with attribute error: open with O_CREAT or O_TMPFILE in second argument needs 3 arguments
     50 |           __open_missing_mode ();
        |           ^~~~~~~~~~~~~~~~~~~~~~

_FORTIFY_SOURCE is enabled by default in some distributions, so the
tests are not built by default and are skipped.

open(2) man-page warns about missing flags argument: "if it is not
supplied, some arbitrary bytes from the stack will be applied as the
file mode."

Link: https://lkml.kernel.org/r/[email protected]
Fixes: aeb85ed4f41a ("tools/testing/selftests/vm/gup_benchmark.c: allow user specified file")
Fixes: fbe37501b252 ("mm: huge_memory: debugfs for file-backed THP split")
Fixes: c942f5bd17b3 ("selftests: soft-dirty: add test for mprotect")
Signed-off-by: Vitaly Chikunov <[email protected]>
Reviewed-by: Zi Yan <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Cc: Keith Busch <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Nadav Amit <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm/memory: fix missing pte marker for !page on pte zaps

Commit 0cf18e839f64 of large folio zap work broke uffd-wp.  Now mm's uffd
unit test "wp-unpopulated" will trigger this WARN_ON_ONCE().

The WARN_ON_ONCE() asserts that an VMA cannot be registered with
userfaultfd-wp if it contains a !normal page, but it's actually possible.
One example is an anonymous vma, register with uffd-wp, read anything will
install a zero page.  Then when zap on it, this should trigger.

What's more, removing that WARN_ON_ONCE may not be enough either, because
we should also not rely on "whether it's a normal page" to decide whether
pte marker is needed.  For example, one can register wr-protect over some
DAX regions to track writes when UFFD_FEATURE_WP_ASYNC enabled, in which
case it can have page==NULL for a devmap but we may want to keep the
marker around.

Link: https://lkml.kernel.org/r/[email protected]
Fixes: 0cf18e839f64 ("mm/memory: handle !page case in zap_present_pte() separately")
Signed-off-by: Peter Xu <[email protected]>
Acked-by: David Hildenbrand <[email protected]>
Cc: Muhammad Usama Anjum <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

block: don't reject too large max_user_sectors in blk_validate_limits

We already cap down the actual max_sectors to the max of the hardware
and user limit, so don't reject the configuration.

Signed-off-by: Christoph Hellwig <[email protected]>
Reviewed-by: John Garry <[email protected]>
Reviewed-by: Damien Le Moal <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jens Axboe <[email protected]>

block: Make blk_rq_set_mixed_merge() static

Since commit 8e756373d7c8 ("block: Move bio merge related functions into
blk-merge.c"), blk_rq_set_mixed_merge() has only been referenced in
blk-merge.c, so make it static.

Signed-off-by: John Garry <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jens Axboe <[email protected]>

MIPS: move unselectable FIT_IMAGE_FDT_EPM5 out of the "System type" choice

The reason is described in 5033ad566016 ("MIPS: move unselectable
entries out of the "CPU type" choice").

At the same time, commit 101bd58fde10 ("MIPS: Add support for Mobileye
EyeQ5") introduced another unselectable choice member.

(In fact, 5033ad566016 and 101bd58fde10 have the same commit time.)

Signed-off-by: Masahiro Yamada <[email protected]>

cxl: remove CONFIG_CXL_PMU entry in drivers/cxl/Kconfig

Commit 5d7107c72796 ("perf: CXL Performance Monitoring Unit driver")
added the config entries for CXL_PMU in drivers/cxl/Kconfig and
drivers/perf/Kconfig, so it can be toggled from multiple locations:

[1] Device Drivers
     -> PCI support
       -> CXL (Compute Expres Link) Devices
         -> CXL Performance Monitoring Unit

[2] Device Drivers
     -> Performance monitor support
       -> CXL Performance Monitoring Unit

This complicates things, and nobody else does this.

I kept the one in drivers/perf/Kconfig because CONFIG_CXL_PMU controls
the compilation of drivers/perf/cxl_pmu.c.

Acked-by: Davidlohr Bueso <[email protected]>
Reviewed-by: Dave Jiang <[email protected]>
Reviewed-by: Jonathan Cameron <[email protected]>
Signed-off-by: Masahiro Yamada <[email protected]>

Merge tag 'printk-for-6.9-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux

Pull printk fix from Petr Mladek:

- Prevent scheduling in an atomic context when printk() takes over the
console flushing duty

* tag 'printk-for-6.9-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux:
printk: Update @console_may_schedule in console_trylock_spinning()

Merge tag 'pwm/for-6.9-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/ukleinek/linux

Pull pwm fix from Uwe Kleine-König:
"This contains a single fix for a regression introduced in v5.18-rc1
which made the img pwm driver fail to bind"

* tag 'pwm/for-6.9-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/ukleinek/linux:
pwm: img: fix pwm clock lookup

btrfs: fix race in read_extent_buffer_pages()

There are reports from tree-checker that detects corrupted nodes,
without any obvious pattern so possibly an overwrite in memory.
After some debugging it turns out there's a race when reading an extent
buffer the uptodate status can be missed.

To prevent concurrent reads for the same extent buffer,
read_extent_buffer_pages() performs these checks:

    /* (1) */
    if (test_bit(EXTENT_BUFFER_UPTODATE, &eb->bflags))
        return 0;

    /* (2) */
    if (test_and_set_bit(EXTENT_BUFFER_READING, &eb->bflags))
        goto done;

At this point, it seems safe to start the actual read operation. Once
that completes, end_bbio_meta_read() does

    /* (3) */
    set_extent_buffer_uptodate(eb);

    /* (4) */
    clear_bit(EXTENT_BUFFER_READING, &eb->bflags);

Normally, this is enough to ensure only one read happens, and all other
callers wait for it to finish before returning.  Unfortunately, there is
a racey interleaving:

    Thread A | Thread B | Thread C
    ---------+----------+---------
       (1)   |          |
             |    (1)   |
       (2)   |          |
       (3)   |          |
       (4)   |          |
             |    (2)   |
             |          |    (1)

When this happens, thread B kicks of an unnecessary read. Worse, thread
C will see UPTODATE set and return immediately, while the read from
thread B is still in progress.  This race could result in tree-checker
errors like this as the extent buffer is concurrently modified:

    BTRFS critical (device dm-0): corrupted node, root=256
    block=8550954455682405139 owner mismatch, have 11858205567642294356
    expect [256, 18446744073709551360]

Fix it by testing UPTODATE again after setting the READING bit, and if
it's been set, skip the unnecessary read.

Fixes: d7172f52e993 ("btrfs: use per-buffer locking for extent_buffer reading")
Link: https://lore.kernel.org/linux-btrfs/CAHk-=whNdMaN9ntZ47XRKP6DBes2E5w7fi-0U3H2+PS18p+Pzw@mail.gmail.com/
Link: https://lore.kernel.org/linux-btrfs/f51a6d5d7432455a6a858d51b49ecac183e0bbc9.1706312914.git.wqu@suse.com/
Link: https://lore.kernel.org/linux-btrfs/[email protected]/
CC: [email protected] # 6.5+
Reviewed-by: Qu Wenruo <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Signed-off-by: Tavian Barnes <[email protected]>
Reviewed-by: David Sterba <[email protected]>
[ minor update of changelog ]
Signed-off-by: David Sterba <[email protected]>

btrfs: return accurate error code on open failure in open_fs_devices()

When attempting to exclusive open a device which has no exclusive open
permission, such as a physical device associated with the flakey dm
device, the open operation will fail, resulting in a mount failure.

In this particular scenario, we erroneously return -EINVAL instead of the
correct error code provided by the bdev_open_by_path() function, which is
-EBUSY.

Fix this, by returning error code from the bdev_open_by_path() function.
With this correction, the mount error message will align with that of
ext4 and xfs.

Reviewed-by: Boris Burkov <[email protected]>
Signed-off-by: Anand Jain <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>

btrfs: zoned: don't skip block groups with 100% zone unusable

Commit f4a9f219411f ("btrfs: do not delete unused block group if it may be
used soon") changed the behaviour of deleting unused block-groups on zoned
filesystems. Starting with this commit, we're using
btrfs_space_info_used() to calculate the number of used bytes in a
space_info. But btrfs_space_info_used() also accounts
btrfs_space_info::bytes_zone_unusable as used bytes.

So if a block group is 100% zone_unusable it is skipped from the deletion
step.

In order not to skip fully zone_unusable block-groups, also check if the
block-group has bytes left that can be used on a zoned filesystem.

Fixes: f4a9f219411f ("btrfs: do not delete unused block group if it may be used soon")
CC: [email protected] # 6.1+
Reviewed-by: Filipe Manana <[email protected]>
Signed-off-by: Johannes Thumshirn <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>

btrfs: use btrfs_warn() to log message at btrfs_add_extent_mapping()

At btrfs_add_extent_mapping(), if we failed to merge the extent map, which
is unexpected and theoretically should never happen, we use WARN_ONCE() to
log a message which is not great because we don't get information about
which filesystem it relates to in case we have multiple btrfs filesystems
mounted. So change this to use btrfs_warn() and surround the error check
with WARN_ON() so we always get a useful stack trace and the condition is
flagged as "unlikely" since it's not expected to ever happen.

Reviewed-by: Qu Wenruo <[email protected]>
Reviewed-by: Anand Jain <[email protected]>
Signed-off-by: Filipe Manana <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>

btrfs: fix message not properly printing interval when adding extent map

At btrfs_add_extent_mapping(), if we are unable to merge the existing
extent map, we print a warning message that suggests interval ranges in
the form "[X, Y)", where the first element is the inclusive start offset
of a range and the second element is the exclusive end offset. However
we end up printing the length of the ranges instead of the exclusive end
offsets. So fix this by printing the range end offsets.

Reviewed-by: Qu Wenruo <[email protected]>
Reviewed-by: Anand Jain <[email protected]>
Signed-off-by: Filipe Manana <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>

btrfs: fix warning messages not printing interval at unpin_extent_range()

At unpin_extent_range() we print warning messages that are supposed to
print an interval in the form "[X, Y)", with the first element being an
inclusive start offset and the second element being the exclusive end
offset of a range. However we end up printing the range's length instead
of the range's exclusive end offset, so fix that to avoid having confusing
and non-sense messages in case we hit one of these unexpected scenarios.

Fixes: 00deaf04df35 ("btrfs: log messages at unpin_extent_range() during unexpected cases")
Reviewed-by: Qu Wenruo <[email protected]>
Reviewed-by: Anand Jain <[email protected]>
Signed-off-by: Filipe Manana <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>

btrfs: fix extent map leak in unexpected scenario at unpin_extent_cache()

At unpin_extent_cache() if we happen to find an extent map with an
unexpected start offset, we jump to the 'out' label and never release the
reference we added to the extent map through the call to
lookup_extent_mapping(), therefore resulting in a leak. So fix this by
moving the free_extent_map() under the 'out' label.

Fixes: c03c89f821e5 ("btrfs: handle errors returned from unpin_extent_cache()")
Reviewed-by: Qu Wenruo <[email protected]>
Reviewed-by: Anand Jain <[email protected]>
Signed-off-by: Filipe Manana <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>

btrfs: validate device maj:min during open

Boris managed to create a device capable of changing its maj:min without
altering its device path.

Only multi-devices can be scanned. A device that gets scanned and remains
in the btrfs kernel cache might end up with an incorrect maj:min.

Despite the temp-fsid feature patch did not introduce this bug, it could
lead to issues if the above multi-device is converted to a single device
with a stale maj:min. Subsequently, attempting to mount the same device
with the correct maj:min might mistake it for another device with the same
fsid, potentially resulting in wrongly auto-enabling the temp-fsid feature.

To address this, this patch validates the device's maj:min at the time of
device open and updates it if it has changed since the last scan.

CC: [email protected] # 6.7+
Fixes: a5b8a5f9f835 ("btrfs: support cloned-device mount capability")
Reported-by: Boris Burkov <[email protected]>
Co-developed-by: Boris Burkov <[email protected]>
Reviewed-by: Boris Burkov <[email protected]>#
Signed-off-by: Anand Jain <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>

btrfs: zoned: fix use-after-free in do_zone_finish()

Shinichiro reported the following use-after-free triggered by the device
replace operation in fstests btrfs/070.

BTRFS info (device nullb1): scrub: finished on devid 1 with status: 0
==================================================================
BUG: KASAN: slab-use-after-free in do_zone_finish+0x91a/0xb90 [btrfs]
Read of size 8 at addr ffff8881543c8060 by task btrfs-cleaner/3494007

CPU: 0 PID: 3494007 Comm: btrfs-cleaner Tainted: G        W          6.8.0-rc5-kts #1
Hardware name: Supermicro Super Server/X11SPi-TF, BIOS 3.3 02/21/2020
Call Trace:
  <TASK>
  dump_stack_lvl+0x5b/0x90
  print_report+0xcf/0x670
  ? __virt_addr_valid+0x200/0x3e0
  kasan_report+0xd8/0x110
  ? do_zone_finish+0x91a/0xb90 [btrfs]
  ? do_zone_finish+0x91a/0xb90 [btrfs]
  do_zone_finish+0x91a/0xb90 [btrfs]
  btrfs_delete_unused_bgs+0x5e1/0x1750 [btrfs]
  ? __pfx_btrfs_delete_unused_bgs+0x10/0x10 [btrfs]
  ? btrfs_put_root+0x2d/0x220 [btrfs]
  ? btrfs_clean_one_deleted_snapshot+0x299/0x430 [btrfs]
  cleaner_kthread+0x21e/0x380 [btrfs]
  ? __pfx_cleaner_kthread+0x10/0x10 [btrfs]
  kthread+0x2e3/0x3c0
  ? __pfx_kthread+0x10/0x10
  ret_from_fork+0x31/0x70
  ? __pfx_kthread+0x10/0x10
  ret_from_fork_asm+0x1b/0x30
  </TASK>

Allocated by task 3493983:
  kasan_save_stack+0x33/0x60
  kasan_save_track+0x14/0x30
  __kasan_kmalloc+0xaa/0xb0
  btrfs_alloc_device+0xb3/0x4e0 [btrfs]
  device_list_add.constprop.0+0x993/0x1630 [btrfs]
  btrfs_scan_one_device+0x219/0x3d0 [btrfs]
  btrfs_control_ioctl+0x26e/0x310 [btrfs]
  __x64_sys_ioctl+0x134/0x1b0
  do_syscall_64+0x99/0x190
  entry_SYSCALL_64_after_hwframe+0x6e/0x76

Freed by task 3494056:
  kasan_save_stack+0x33/0x60
  kasan_save_track+0x14/0x30
  kasan_save_free_info+0x3f/0x60
  poison_slab_object+0x102/0x170
  __kasan_slab_free+0x32/0x70
  kfree+0x11b/0x320
  btrfs_rm_dev_replace_free_srcdev+0xca/0x280 [btrfs]
  btrfs_dev_replace_finishing+0xd7e/0x14f0 [btrfs]
  btrfs_dev_replace_by_ioctl+0x1286/0x25a0 [btrfs]
  btrfs_ioctl+0xb27/0x57d0 [btrfs]
  __x64_sys_ioctl+0x134/0x1b0
  do_syscall_64+0x99/0x190
  entry_SYSCALL_64_after_hwframe+0x6e/0x76

The buggy address belongs to the object at ffff8881543c8000
  which belongs to the cache kmalloc-1k of size 1024
The buggy address is located 96 bytes inside of
  freed 1024-byte region [ffff8881543c8000, ffff8881543c8400)

The buggy address belongs to the physical page:
page:00000000fe2c1285 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x1543c8
head:00000000fe2c1285 order:3 entire_mapcount:0 nr_pages_mapped:0 pincount:0
flags: 0x17ffffc0000840(slab|head|node=0|zone=2|lastcpupid=0x1fffff)
page_type: 0xffffffff()
raw: 0017ffffc0000840 ffff888100042dc0 ffffea0019e8f200 dead000000000002
raw: 0000000000000000 0000000000100010 00000001ffffffff 0000000000000000
page dumped because: kasan: bad access detected

Memory state around the buggy address:
  ffff8881543c7f00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  ffff8881543c7f80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>ffff8881543c8000: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                                                        ^
  ffff8881543c8080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
  ffff8881543c8100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb

This UAF happens because we're accessing stale zone information of a
already removed btrfs_device in do_zone_finish().

The sequence of events is as follows:

btrfs_dev_replace_start
  btrfs_scrub_dev
   btrfs_dev_replace_finishing
    btrfs_dev_replace_update_device_in_mapping_tree <-- devices replaced
    btrfs_rm_dev_replace_free_srcdev
     btrfs_free_device                              <-- device freed

cleaner_kthread
btrfs_delete_unused_bgs
  btrfs_zone_finish
   do_zone_finish              <-- refers the freed device

The reason for this is that we're using a cached pointer to the chunk_map
from the block group, but on device replace this cached pointer can
contain stale device entries.

The staleness comes from the fact, that btrfs_block_group::physical_map is
not a pointer to a btrfs_chunk_map but a memory copy of it.

Also take the fs_info::dev_replace::rwsem to prevent
btrfs_dev_replace_update_device_in_mapping_tree() from changing the device
underneath us again.

Note: btrfs_dev_replace_update_device_in_mapping_tree() is holding
fs_info::mapping_tree_lock, but as this is a spinning read/write lock we
cannot take it as the call to blkdev_zone_mgmt() requires a memory
allocation which may not sleep.
But btrfs_dev_replace_update_device_in_mapping_tree() is always called with
the fs_info::dev_replace::rwsem held in write mode.

Many thanks to Shinichiro for analyzing the bug.

Reported-by: Shinichiro Kawasaki <[email protected]>
CC: [email protected] # 6.8
Reviewed-by: Filipe Manana <[email protected]>
Signed-off-by: Johannes Thumshirn <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>

Merge branch 'there-are-some-bugfix-for-the-hns3-ethernet-driver'

Jijie Shao says:

====================
There are some bugfix for the HNS3 ethernet driver
====================

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Paolo Abeni <[email protected]>

net: hns3: mark unexcuted loopback test result as UNEXECUTED

Currently, loopback test may be skipped when resetting, but the test
result will still show as 'PASS', because the driver doesn't set
ETH_TEST_FL_FAILED flag. Fix it by setting the flag and
initializating the value to UNEXECUTED.

Fixes: 4c8dab1c709c ("net: hns3: reconstruct function hns3_self_test")
Signed-off-by: Jian Shen <[email protected]>
Signed-off-by: Jijie Shao <[email protected]>
Reviewed-by: Michal Kubiak <[email protected]>
Reviewed-by: Simon Horman <[email protected]>
Signed-off-by: Paolo Abeni <[email protected]>

net: hns3: fix kernel crash when devlink reload during pf initialization

The devlink reload process will access the hardware resources,
but the register operation is done before the hardware is initialized.
So, processing the devlink reload during initialization may lead to kernel
crash. This patch fixes this by taking devl_lock during initialization.

Fixes: b741269b2759 ("net: hns3: add support for registering devlink for PF")
Signed-off-by: Yonglong Liu <[email protected]>
Signed-off-by: Jijie Shao <[email protected]>
Reviewed-by: Simon Horman <[email protected]>
Signed-off-by: Paolo Abeni <[email protected]>

net: hns3: fix index limit to support all queue stats

Currently, hns hardware supports more than 512 queues and the index limit
in hclge_comm_tqps_update_stats is wrong. So this patch removes it.

Fixes: 287db5c40d15 ("net: hns3: create new set of common tqp stats APIs for PF and VF reuse")
Signed-off-by: Jie Wang <[email protected]>
Signed-off-by: Jijie Shao <[email protected]>
Reviewed-by: Michal Kubiak <[email protected]>
Reviewed-by: Kalesh AP <[email protected]>
Reviewed-by: Simon Horman <[email protected]>
Signed-off-by: Paolo Abeni <[email protected]>

x86/sev: Skip ROM range scans and validation for SEV-SNP guests

SEV-SNP requires encrypted memory to be validated before access.
Because the ROM memory range is not part of the e820 table, it is not
pre-validated by the BIOS. Therefore, if a SEV-SNP guest kernel wishes
to access this range, the guest must first validate the range.

The current SEV-SNP code does indeed scan the ROM range during early
boot and thus attempts to validate the ROM range in probe_roms().
However, this behavior is neither sufficient nor necessary for the
following reasons:

* With regards to sufficiency, if EFI_CONFIG_TABLES are not enabled and
  CONFIG_DMI_SCAN_MACHINE_NON_EFI_FALLBACK is set, the kernel will
  attempt to access the memory at SMBIOS_ENTRY_POINT_SCAN_START (which
  falls in the ROM range) prior to validation.

  For example, Project Oak Stage 0 provides a minimal guest firmware
  that currently meets these configuration conditions, meaning guests
  booting atop Oak Stage 0 firmware encounter a problematic call chain
  during dmi_setup() -> dmi_scan_machine() that results in a crash
  during boot if SEV-SNP is enabled.

* With regards to necessity, SEV-SNP guests generally read garbage
  (which changes across boots) from the ROM range, meaning these scans
  are unnecessary. The guest reads garbage because the legacy ROM range
  is unencrypted data but is accessed via an encrypted PMD during early
  boot (where the PMD is marked as encrypted due to potentially mapping
  actually-encrypted data in other PMD-contained ranges).

In one exceptional case, EISA probing treats the ROM range as
unencrypted data, which is inconsistent with other probing.

Continuing to allow SEV-SNP guests to use garbage and to inconsistently
classify ROM range encryption status can trigger undesirable behavior.
For instance, if garbage bytes appear to be a valid signature, memory
may be unnecessarily reserved for the ROM range. Future code or other
use cases may result in more problematic (arbitrary) behavior that
should be avoided.

While one solution would be to overhaul the early PMD mapping to always
treat the ROM region of the PMD as unencrypted, SEV-SNP guests do not
currently rely on data from the ROM region during early boot (and even
if they did, they would be mostly relying on garbage data anyways).

As a simpler solution, skip the ROM range scans (and the otherwise-
necessary range validation) during SEV-SNP guest early boot. The
potential SEV-SNP guest crash due to lack of ROM range validation is
thus avoided by simply not accessing the ROM range.

In most cases, skip the scans by overriding problematic x86_init
functions during sme_early_init() to SNP-safe variants, which can be
likened to x86_init overrides done for other platforms (ex: Xen); such
overrides also avoid the spread of cc_platform_has() checks throughout
the tree.

In the exceptional EISA case, still use cc_platform_has() for the
simplest change, given (1) checks for guest type (ex: Xen domain status)
are already performed here, and (2) these checks occur in a subsys
initcall instead of an x86_init function.

  [ bp: Massage commit message, remove "we"s. ]

Fixes: 9704c07bf9f7 ("x86/kernel: Validate ROM memory before accessing when SEV-SNP is active")
Signed-off-by: Kevin Loughlin <[email protected]>
Signed-off-by: Borislav Petkov (AMD) <[email protected]>
Cc: <[email protected]>
Link: https://lore.kernel.org/r/[email protected]