Git Repo - linux.git/log

Merge tag 'loongarch-fixes-6.12-2' of git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson

Pull LoongArch fixes from Huacai Chen:

- fix possible CPUs setup logical-physical CPU mapping, in order to
   avoid CPU hotplug issue

- fix some KASAN bugs

- fix AP booting issue in VM mode

- some trivial cleanups

* tag 'loongarch-fixes-6.12-2' of git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson:
  LoongArch: Fix AP booting issue in VM mode
  LoongArch: Add WriteCombine shadow mapping in KASAN
  LoongArch: Disable KASAN if PGDIR_SIZE is too large for cpu_vabits
  LoongArch: Make KASAN work with 5-level page-tables
  LoongArch: Define a default value for VM_DATA_DEFAULT_FLAGS
  LoongArch: Fix early_numa_add_cpu() usage for FDT systems
  LoongArch: For all possible CPUs setup logical-physical CPU mapping

Merge tag 'mm-hotfixes-stable-2024-11-12-16-39' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull misc fixes from Andrew Morton:
"10 hotfixes, 7 of which are cc:stable. 7 are MM, 3 are not. All
  singletons"

* tag 'mm-hotfixes-stable-2024-11-12-16-39' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
  mm: swapfile: fix cluster reclaim work crash on rotational devices
  selftests: hugetlb_dio: fixup check for initial conditions to skip in the start
  mm/thp: fix deferred split queue not partially_mapped: fix
  mm/gup: avoid an unnecessary allocation call for FOLL_LONGTERM cases
  nommu: pass NULL argument to vma_iter_prealloc()
  ocfs2: fix UBSAN warning in ocfs2_verify_volume()
  nilfs2: fix null-ptr-deref in block_dirty_buffer tracepoint
  nilfs2: fix null-ptr-deref in block_touch_buffer tracepoint
  mm: page_alloc: move mlocked flag clearance into free_pages_prepare()
  mm: count zeromap read and set for swapout and swapin

Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost

Pull virtio fix from Michael Tsirkin:
"A last minute mlx5 bugfix"

* tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost:
vdpa/mlx5: Fix PA offset with unaligned starting iotlb map

mm: swapfile: fix cluster reclaim work crash on rotational devices

syzbot and Daan report a NULL pointer crash in the new full swap cluster
reclaim work:

> Oops: general protection fault, probably for non-canonical address 0xdffffc0000000001: 0000 [#1] PREEMPT SMP KASAN PTI
> KASAN: null-ptr-deref in range [0x0000000000000008-0x000000000000000f]
> CPU: 1 UID: 0 PID: 51 Comm: kworker/1:1 Not tainted 6.12.0-rc6-syzkaller #0
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 09/13/2024
> Workqueue: events swap_reclaim_work
> RIP: 0010:__list_del_entry_valid_or_report+0x20/0x1c0 lib/list_debug.c:49
> Code: 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 48 89 fe 48 83 c7 08 48 83 ec 18 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 <80> 3c 02 00 0f 85 19 01 00 00 48 89 f2 48 8b 4e 08 48 b8 00 00 00
> RSP: 0018:ffffc90000bb7c30 EFLAGS: 00010202
> RAX: dffffc0000000000 RBX: 0000000000000000 RCX: ffff88807b9ae078
> RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000000000000008
> RBP: 0000000000000001 R08: 0000000000000001 R09: 0000000000000000
> R10: 0000000000000001 R11: 000000000000004f R12: dffffc0000000000
> R13: ffffffffffffffb8 R14: ffff88807b9ae000 R15: ffffc90003af1000
> FS:  0000000000000000(0000) GS:ffff8880b8700000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00007fffaca68fb8 CR3: 00000000791c8000 CR4: 00000000003526f0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> Call Trace:
>  <TASK>
>  __list_del_entry_valid include/linux/list.h:124 [inline]
>  __list_del_entry include/linux/list.h:215 [inline]
>  list_move_tail include/linux/list.h:310 [inline]
>  swap_reclaim_full_clusters+0x109/0x460 mm/swapfile.c:748
>  swap_reclaim_work+0x2e/0x40 mm/swapfile.c:779

The syzbot console output indicates a virtual environment where swapfile
is on a rotational device.  In this case, clusters aren't actually used,
and si->full_clusters is not initialized.  Daan's report is from qemu, so
likely rotational too.

Make sure to only schedule the cluster reclaim work when clusters are
actually in use.

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lore.kernel.org/lkml/[email protected]/
Link: https://github.com/systemd/systemd/issues/35044
Fixes: 5168a68eb78f ("mm, swap: avoid over reclaim of full clusters")
Reported-by: [email protected]
Signed-off-by: Johannes Weiner <[email protected]>
Reported-by: Daan De Meyer <[email protected]>
Cc: Kairui Song <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

vdpa/mlx5: Fix PA offset with unaligned starting iotlb map

When calculating the physical address range based on the iotlb and mr
[start,end) ranges, the offset of mr->start relative to map->start
is not taken into account. This leads to some incorrect and duplicate
mappings.

For the case when mr->start < map->start the code is already correct:
the range in [mr->start, map->start) was handled by a different
iteration.

Fixes: 94abbccdf291 ("vdpa/mlx5: Add shared memory registration code")
Cc: [email protected]
Signed-off-by: Si-Wei Liu <[email protected]>
Signed-off-by: Dragos Tatulea <[email protected]>
Message-Id: <20241021134040 [email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>
Acked-by: Jason Wang <[email protected]>

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull kvm fixes from Paolo Bonzini:
"x86 and selftests fixes.

  x86:

   - When emulating a guest TLB flush for a nested guest, flush vpid01,
     not vpid02, if L2 is active but VPID is disabled in vmcs12, i.e. if
     L2 and L1 are sharing VPID '0' (from L1's perspective).

   - Fix a bug in the SNP initialization flow where KVM would return '0'
     to userspace instead of -errno on failure.

   - Move the Intel PT virtualization (i.e. outputting host trace to
     host buffer and guest trace to guest buffer) behind CONFIG_BROKEN.

   - Fix memory leak on failure of KVM_SEV_SNP_LAUNCH_START

   - Fix a bug where KVM fails to inject an interrupt from the IRR after
     KVM_SET_LAPIC.

  Selftests:

   - Increase the timeout for the memslot performance selftest to avoid
     false failures on arm64 and nested x86 platforms.

   - Fix a goof in the guest_memfd selftest where a for-loop initialized
     a bit mask to zero instead of BIT(0).

   - Disable strict aliasing when building KVM selftests to prevent the
     compiler from treating things like "u64 *" to "uint64_t *" cases as
     undefined behavior, which can lead to nasty, hard to debug
     failures.

   - Force -march=x86-64-v2 for KVM x86 selftests if and only if the
     uarch is supported by the compiler.

   - Fix broken compilation of kvm selftests after a header sync in
     tools/"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
  KVM: VMX: Bury Intel PT virtualization (guest/host mode) behind CONFIG_BROKEN
  KVM: x86: Unconditionally set irr_pending when updating APICv state
  kvm: svm: Fix gctx page leak on invalid inputs
  KVM: selftests: use X86_MEMTYPE_WB instead of VMX_BASIC_MEM_TYPE_WB
  KVM: SVM: Propagate error from snp_guest_req_init() to userspace
  KVM: nVMX: Treat vpid01 as current if L2 is active, but with VPID disabled
  KVM: selftests: Don't force -march=x86-64-v2 if it's unsupported
  KVM: selftests: Disable strict aliasing
  KVM: selftests: fix unintentional noop test in guest_memfd_test.c
  KVM: selftests: memslot_perf_test: increase guest sync timeout

Merge tag 'for-6.12/dm-fixes-3' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm

Pull device mapper fixes from Mikulas Patocka:

- fix warnings about duplicate slab cache names

* tag 'for-6.12/dm-fixes-3' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
dm-cache: fix warnings about duplicate slab caches
dm-bufio: fix warnings about duplicate slab caches

Merge tag 'integrity-v6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/zohar/linux-integrity

Pull integrity fixes from Mimi Zohar:
"One bug fix, one performance improvement, and the use of
  static_assert:

   - The bug fix addresses "only a cosmetic change" commit, which didn't
     take into account the original 'ima' template definition.

  - The performance improvement limits the atomic_read()"

* tag 'integrity-v6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/zohar/linux-integrity:
  integrity: Use static_assert() to check struct sizes
  evm: stop avoidably reading i_writecount in evm_file_release
  ima: fix buffer overrun in ima_eventdigest_init_common

Merge tag 'landlock-6.12-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/mic/linux

Pull landlock fixes from Mickaël Salaün:
"This fixes issues in the Landlock's sandboxer sample and
  documentation, slightly refactors helpers (required for ongoing patch
  series), and improve/fix a feature merged in v6.12 (signal and
  abstract UNIX socket scoping)"

* tag 'landlock-6.12-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/mic/linux:
  landlock: Optimize scope enforcement
  landlock: Refactor network access mask management
  landlock: Refactor filesystem access mask management
  samples/landlock: Clarify option parsing behaviour
  samples/landlock: Refactor help message
  samples/landlock: Fix port parsing in sandboxer
  landlock: Fix grammar issues in documentation
  landlock: Improve documentation of previous limitations

selftests: hugetlb_dio: fixup check for initial conditions to skip in the start

This test verifies that a hugepage, used as a user buffer for DIO
operations, is correctly freed upon unmapping. To test this, we read the
count of free hugepages before and after the mmap, DIO, and munmap
operations, then check if the free hugepage count is the same.

Reading free hugepages before the test was removed by commit 0268d4579901
('selftests: hugetlb_dio: check for initial conditions to skip at the
start'), causing the test to always fail.

This patch adds back reading the free hugepages before starting the test.
With this patch, the tests are now passing.

Test results without this patch:

./tools/testing/selftests/mm/hugetlb_dio
TAP version 13
1..4
# No. Free pages before allocation : 0
# No. Free pages after munmap : 100
not ok 1 : Huge pages not freed!
# No. Free pages before allocation : 0
# No. Free pages after munmap : 100
not ok 2 : Huge pages not freed!
# No. Free pages before allocation : 0
# No. Free pages after munmap : 100
not ok 3 : Huge pages not freed!
# No. Free pages before allocation : 0
# No. Free pages after munmap : 100
not ok 4 : Huge pages not freed!
# Totals: pass:0 fail:4 xfail:0 xpass:0 skip:0 error:0

Test results with this patch:

/tools/testing/selftests/mm/hugetlb_dio
TAP version 13
1..4
# No. Free pages before allocation : 100
# No. Free pages after munmap : 100
ok 1 : Huge pages freed successfully !
# No. Free pages before allocation : 100
# No. Free pages after munmap : 100
ok 2 : Huge pages freed successfully !
# No. Free pages before allocation : 100
# No. Free pages after munmap : 100
ok 3 : Huge pages freed successfully !
# No. Free pages before allocation : 100
# No. Free pages after munmap : 100
ok 4 : Huge pages freed successfully !

# Totals: pass:4 fail:0 xfail:0 xpass:0 skip:0 error:0

Link: https://lkml.kernel.org/r/[email protected]
Fixes: 0268d4579901 ("selftests: hugetlb_dio: check for initial conditions to skip in the start")
Signed-off-by: Donet Tom <[email protected]>
Cc: Muhammad Usama Anjum <[email protected]>
Cc: Shuah Khan <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm/thp: fix deferred split queue not partially_mapped: fix

Though even more elusive than before, list_del corruption has still been
seen on THP's deferred split queue.

The idea in commit e66f3185fa04 was right, but its implementation wrong.
The context omitted an important comment just before the critical test:
"split_folio() removes folio from list on success." In ignoring that
comment, when a THP split succeeded, the code went on to release the
preceding safe folio, preserving instead an irrelevant (formerly head)
folio: which gives no safety because it's not on the list. Fix the logic.

Link: https://lkml.kernel.org/r/[email protected]
Fixes: e66f3185fa04 ("mm/thp: fix deferred split queue not partially_mapped")
Signed-off-by: Hugh Dickins <[email protected]>
Acked-by: Usama Arif <[email protected]>
Reviewed-by: Zi Yan <[email protected]>
Cc: Baolin Wang <[email protected]>
Cc: Barry Song <[email protected]>
Cc: Chris Li <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Kefeng Wang <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Nhat Pham <[email protected]>
Cc: Ryan Roberts <[email protected]>
Cc: Shakeel Butt <[email protected]>
Cc: Wei Yang <[email protected]>
Cc: Yang Shi <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm/gup: avoid an unnecessary allocation call for FOLL_LONGTERM cases

commit 53ba78de064b ("mm/gup: introduce
check_and_migrate_movable_folios()") created a new constraint on the
pin_user_pages*() API family: a potentially large internal allocation must
now occur, for FOLL_LONGTERM cases.

A user-visible consequence has now appeared: user space can no longer pin
more than 2GB of memory anymore on x86_64.  That's because, on a 4KB
PAGE_SIZE system, when user space tries to (indirectly, via a device
driver that calls pin_user_pages()) pin 2GB, this requires an allocation
of a folio pointers array of MAX_PAGE_ORDER size, which is the limit for
kmalloc().

In addition to the directly visible effect described above, there is also
the problem of adding an unnecessary allocation.  The **pages array
argument has already been allocated, and there is no need for a redundant
**folios array allocation in this case.

Fix this by avoiding the new allocation entirely.  This is done by
referring to either the original page[i] within **pages, or to the
associated folio.  Thanks to David Hildenbrand for suggesting this
approach and for providing the initial implementation (which I've tested
and adjusted slightly) as well.

[[email protected]: whitespace tweak, per David]
Link: https://lkml.kernel.org/r/[email protected]
[[email protected]: bypass pofs_get_folio(), per Oscar]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Fixes: 53ba78de064b ("mm/gup: introduce check_and_migrate_movable_folios()")
Signed-off-by: John Hubbard <[email protected]>
Suggested-by: David Hildenbrand <[email protected]>
Acked-by: David Hildenbrand <[email protected]>
Reviewed-by: Oscar Salvador <[email protected]>
Cc: Vivek Kasireddy <[email protected]>
Cc: Dave Airlie <[email protected]>
Cc: Gerd Hoffmann <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Daniel Vetter <[email protected]>
Cc: Dongwon Kim <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Junxiao Chang <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

LoongArch: Fix AP booting issue in VM mode

Native IPI is used for AP booting, because it is the booting interface
between OS and BIOS firmware. The paravirt IPI is only used inside OS,
and native IPI is necessary to boot AP.

When booting AP, we write the kernel entry address in the HW mailbox of
AP and send IPI interrupt to it. AP executes idle instruction and waits
for interrupts or SW events, then clears IPI interrupt and jumps to the
kernel entry from HW mailbox.

Between writing HW mailbox and sending IPI, AP can be woken up by SW
events and jumps to the kernel entry, so ACTION_BOOT_CPU IPI interrupt
will keep pending during AP booting. And native IPI interrupt handler
needs be registered so that it can clear pending native IPI, else there
will be endless interrupts during AP booting stage.

Here native IPI interrupt is initialized even if paravirt IPI is used.

Cc: [email protected]
Fixes: 74c16b2e2b0c ("LoongArch: KVM: Add PV IPI support on guest side")
Signed-off-by: Bibo Mao <[email protected]>
Signed-off-by: Huacai Chen <[email protected]>

LoongArch: Add WriteCombine shadow mapping in KASAN

Currently, the kernel couldn't boot when ARCH_IOREMAP, ARCH_WRITECOMBINE
and KASAN are enabled together. Because DMW2 is used by kernel now which
is configured as 0xa000000000000000 for WriteCombine, but KASAN has no
segment mapping for it. This patch fix this issue.

Solution: Add the relevant definitions for WriteCombine (DMW2) in KASAN.

Cc: [email protected]
Fixes: 8e02c3b782ec ("LoongArch: Add writecombine support for DMW-based ioremap()")
Signed-off-by: Kanglong Wang <[email protected]>
Signed-off-by: Huacai Chen <[email protected]>

LoongArch: Disable KASAN if PGDIR_SIZE is too large for cpu_vabits

If PGDIR_SIZE is too large for cpu_vabits, KASAN_SHADOW_END will
overflow UINTPTR_MAX because KASAN_SHADOW_START/KASAN_SHADOW_END are
aligned up by PGDIR_SIZE. And then the overflowed KASAN_SHADOW_END looks
like a user space address.

For example, PGDIR_SIZE of CONFIG_4KB_4LEVEL is 2^39, which is too large
for Loongson-2K series whose cpu_vabits = 39.

Since CONFIG_4KB_4LEVEL is completely legal for CPUs with cpu_vabits <=
39, we just disable KASAN via early return in kasan_init(). Otherwise we
get a boot failure.

Moreover, we change KASAN_SHADOW_END from the first address after KASAN
shadow area to the last address in KASAN shadow area, in order to avoid
the end address exactly overflow to 0 (which is a legal case). We don't
need to worry about alignment because pgd_addr_end() can handle it.

Cc: [email protected]
Reviewed-by: Jiaxun Yang <[email protected]>
Signed-off-by: Huacai Chen <[email protected]>

LoongArch: Make KASAN work with 5-level page-tables

Make KASAN work with 5-level page-tables, including:
1. Implement and use __pgd_none() and kasan_p4d_offset().
2. As done in kasan_pmd_populate() and kasan_pte_populate(), restrict
the loop conditions of kasan_p4d_populate() and kasan_pud_populate()
to avoid unnecessary population.

Cc: [email protected]
Signed-off-by: Huacai Chen <[email protected]>

LoongArch: Define a default value for VM_DATA_DEFAULT_FLAGS

This is a trivial cleanup, commit c62da0c35d58518d ("mm/vma: define a
default value for VM_DATA_DEFAULT_FLAGS") has unified default values of
VM_DATA_DEFAULT_FLAGS across different platforms.

Apply the same consistency to LoongArch.

Suggested-by: Wentao Guan <[email protected]>
Signed-off-by: Yuli Wang <[email protected]>
Signed-off-by: Huacai Chen <[email protected]>

LoongArch: Fix early_numa_add_cpu() usage for FDT systems

early_numa_add_cpu() applies on physical CPU id rather than logical CPU
id, so use cpuid instead of cpu.

Cc: [email protected]
Fixes: 3de9c42d02a79a5 ("LoongArch: Add all CPUs enabled by fdt to NUMA node 0")
Reported-by: Bibo Mao <[email protected]>
Signed-off-by: Huacai Chen <[email protected]>

LoongArch: For all possible CPUs setup logical-physical CPU mapping

In order to support ACPI-based physical CPU hotplug, we suppose for all
"possible" CPUs cpu_logical_map() can work. Because some drivers want to
use cpu_logical_map() for all "possible" CPUs, while currently we only
setup logical-physical mapping for "present" CPUs. This lack of mapping
also causes cpu_to_node() cannot work for hot-added CPUs.

All "possible" CPUs are listed in MADT, and the "present" subset is
marked as ACPI_MADT_ENABLED. To setup logical-physical CPU mapping for
all possible CPUs and keep present CPUs continuous in cpu_present_mask,
we parse MADT twice. The first pass handles CPUs with ACPI_MADT_ENABLED
and the second pass handles CPUs without ACPI_MADT_ENABLED.

The global flag (cpu_enumerated) is removed because acpi_map_cpu() calls
cpu_number_map() rather than set_processor_mask() now.

Reported-by: Bibo Mao <[email protected]>
Signed-off-by: Huacai Chen <[email protected]>

nommu: pass NULL argument to vma_iter_prealloc()

When deleting a vma entry from a maple tree, it has to pass NULL to
vma_iter_prealloc() in order to calculate internal state of the tree, but
it passed a wrong argument. As a result, nommu kernels crashed upon
accessing a vma iterator, such as acct_collect() reading the size of vma
entries after do_munmap().

This commit fixes this issue by passing a right argument to the
preallocation call.

Link: https://lkml.kernel.org/r/[email protected]
Fixes: b5df09226450 ("mm: set up vma iterator for vma_iter_prealloc() calls")
Signed-off-by: Hajime Tazaki <[email protected]>
Reviewed-by: Liam R. Howlett <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

ocfs2: fix UBSAN warning in ocfs2_verify_volume()

Syzbot has reported the following splat triggered by UBSAN:

UBSAN: shift-out-of-bounds in fs/ocfs2/super.c:2336:10
shift exponent 32768 is too large for 32-bit type 'int'
CPU: 2 UID: 0 PID: 5255 Comm: repro Not tainted 6.12.0-rc4-syzkaller-00047-gc2ee9f594da8 #0
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-3.fc41 04/01/2014
Call Trace:
<TASK>
dump_stack_lvl+0x241/0x360
? __pfx_dump_stack_lvl+0x10/0x10
? __pfx__printk+0x10/0x10
? __asan_memset+0x23/0x50
? lockdep_init_map_type+0xa1/0x910
__ubsan_handle_shift_out_of_bounds+0x3c8/0x420
ocfs2_fill_super+0xf9c/0x5750
? __pfx_ocfs2_fill_super+0x10/0x10
? __pfx_validate_chain+0x10/0x10
? __pfx_validate_chain+0x10/0x10
? validate_chain+0x11e/0x5920
? __lock_acquire+0x1384/0x2050
? __pfx_validate_chain+0x10/0x10
? string+0x26a/0x2b0
? widen_string+0x3a/0x310
? string+0x26a/0x2b0
? bdev_name+0x2b1/0x3c0
? pointer+0x703/0x1210
? __pfx_pointer+0x10/0x10
? __pfx_format_decode+0x10/0x10
? __lock_acquire+0x1384/0x2050
? vsnprintf+0x1ccd/0x1da0
? snprintf+0xda/0x120
? __pfx_lock_release+0x10/0x10
? do_raw_spin_lock+0x14f/0x370
? __pfx_snprintf+0x10/0x10
? set_blocksize+0x1f9/0x360
? sb_set_blocksize+0x98/0xf0
? setup_bdev_super+0x4e6/0x5d0
mount_bdev+0x20c/0x2d0
? __pfx_ocfs2_fill_super+0x10/0x10
? __pfx_mount_bdev+0x10/0x10
? vfs_parse_fs_string+0x190/0x230
? __pfx_vfs_parse_fs_string+0x10/0x10
legacy_get_tree+0xf0/0x190
? __pfx_ocfs2_mount+0x10/0x10
vfs_get_tree+0x92/0x2b0
do_new_mount+0x2be/0xb40
? __pfx_do_new_mount+0x10/0x10
__se_sys_mount+0x2d6/0x3c0
? __pfx___se_sys_mount+0x10/0x10
? do_syscall_64+0x100/0x230
? __x64_sys_mount+0x20/0xc0
do_syscall_64+0xf3/0x230
entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7f37cae96fda
Code: 48 8b 0d 51 ce 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 49 89 ca b8 a5 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 1e ce 0c 00 f7 d8 64 89 01 48
RSP: 002b:00007fff6c1aa228 EFLAGS: 00000206 ORIG_RAX: 00000000000000a5
RAX: ffffffffffffffda RBX: 00007fff6c1aa240 RCX: 00007f37cae96fda
RDX: 00000000200002c0 RSI: 0000000020000040 RDI: 00007fff6c1aa240
RBP: 0000000000000004 R08: 00007fff6c1aa280 R09: 0000000000000000
R10: 00000000000008c0 R11: 0000000000000206 R12: 00000000000008c0
R13: 00007fff6c1aa280 R14: 0000000000000003 R15: 0000000001000000
</TASK>

For a really damaged superblock, the value of 'i_super.s_blocksize_bits'
may exceed the maximum possible shift for an underlying 'int'. So add an
extra check whether the aforementioned field represents the valid block
size, which is 512 bytes, 1K, 2K, or 4K.

Link: https://lkml.kernel.org/r/[email protected]
Fixes: ccd979bdbce9 ("[PATCH] OCFS2: The Second Oracle Cluster Filesystem")
Signed-off-by: Dmitry Antipov <[email protected]>
Reported-by: [email protected]
Closes: https://syzkaller.appspot.com/bug?extid=56f7cd1abe4b8e475180
Reviewed-by: Joseph Qi <[email protected]>
Cc: Mark Fasheh <[email protected]>
Cc: Joel Becker <[email protected]>
Cc: Junxiao Bi <[email protected]>
Cc: Changwei Ge <[email protected]>
Cc: Jun Piao <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

nilfs2: fix null-ptr-deref in block_dirty_buffer tracepoint

When using the "block:block_dirty_buffer" tracepoint, mark_buffer_dirty()
may cause a NULL pointer dereference, or a general protection fault when
KASAN is enabled.

This happens because, since the tracepoint was added in
mark_buffer_dirty(), it references the dev_t member bh->b_bdev->bd_dev
regardless of whether the buffer head has a pointer to a block_device
structure.

In the current implementation, nilfs_grab_buffer(), which grabs a buffer
to read (or create) a block of metadata, including b-tree node blocks,
does not set the block device, but instead does so only if the buffer is
not in the "uptodate" state for each of its caller block reading
functions. However, if the uptodate flag is set on a folio/page, and the
buffer heads are detached from it by try_to_free_buffers(), and new buffer
heads are then attached by create_empty_buffers(), the uptodate flag may
be restored to each buffer without the block device being set to
bh->b_bdev, and mark_buffer_dirty() may be called later in that state,
resulting in the bug mentioned above.

Fix this issue by making nilfs_grab_buffer() always set the block device
of the super block structure to the buffer head, regardless of the state
of the buffer's uptodate flag.

Link: https://lkml.kernel.org/r/[email protected]
Fixes: 5305cb830834 ("block: add block_{touch|dirty}_buffer tracepoint")
Signed-off-by: Ryusuke Konishi <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Ubisectech Sirius <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

nilfs2: fix null-ptr-deref in block_touch_buffer tracepoint

Patch series "nilfs2: fix null-ptr-deref bugs on block tracepoints".

This series fixes null pointer dereference bugs that occur when using
nilfs2 and two block-related tracepoints.

This patch (of 2):

It has been reported that when using "block:block_touch_buffer"
tracepoint, touch_buffer() called from __nilfs_get_folio_block() causes a
NULL pointer dereference, or a general protection fault when KASAN is
enabled.

This happens because since the tracepoint was added in touch_buffer(), it
references the dev_t member bh->b_bdev->bd_dev regardless of whether the
buffer head has a pointer to a block_device structure. In the current
implementation, the block_device structure is set after the function
returns to the caller.

Here, touch_buffer() is used to mark the folio/page that owns the buffer
head as accessed, but the common search helper for folio/page used by the
caller function was optimized to mark the folio/page as accessed when it
was reimplemented a long time ago, eliminating the need to call
touch_buffer() here in the first place.

So this solves the issue by eliminating the touch_buffer() call itself.

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Fixes: 5305cb830834 ("block: add block_{touch|dirty}_buffer tracepoint")
Signed-off-by: Ryusuke Konishi <[email protected]>
Reported-by: Ubisectech Sirius <[email protected]>
Closes: https://lkml.kernel.org/r/[email protected]
Reported-by: [email protected]
Closes: https://syzkaller.appspot.com/bug?extid=9982fb8d18eba905abe2
Tested-by: [email protected]
Cc: Tejun Heo <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: page_alloc: move mlocked flag clearance into free_pages_prepare()

Syzbot reported a bad page state problem caused by a page being freed
using free_page() still having a mlocked flag at free_pages_prepare()
stage:

  BUG: Bad page state in process syz.5.504  pfn:61f45
  page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x61f45
  flags: 0xfff00000080204(referenced|workingset|mlocked|node=0|zone=1|lastcpupid=0x7ff)
  raw: 00fff00000080204 0000000000000000 dead000000000122 0000000000000000
  raw: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000
  page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set
  page_owner tracks the page as allocated
  page last allocated via order 0, migratetype Unmovable, gfp_mask 0x400dc0(GFP_KERNEL_ACCOUNT|__GFP_ZERO), pid 8443, tgid 8442 (syz.5.504), ts 201884660643, free_ts 201499827394
   set_page_owner include/linux/page_owner.h:32 [inline]
   post_alloc_hook+0x1f3/0x230 mm/page_alloc.c:1537
   prep_new_page mm/page_alloc.c:1545 [inline]
   get_page_from_freelist+0x303f/0x3190 mm/page_alloc.c:3457
   __alloc_pages_noprof+0x292/0x710 mm/page_alloc.c:4733
   alloc_pages_mpol_noprof+0x3e8/0x680 mm/mempolicy.c:2265
   kvm_coalesced_mmio_init+0x1f/0xf0 virt/kvm/coalesced_mmio.c:99
   kvm_create_vm virt/kvm/kvm_main.c:1235 [inline]
   kvm_dev_ioctl_create_vm virt/kvm/kvm_main.c:5488 [inline]
   kvm_dev_ioctl+0x12dc/0x2240 virt/kvm/kvm_main.c:5530
   __do_compat_sys_ioctl fs/ioctl.c:1007 [inline]
   __se_compat_sys_ioctl+0x510/0xc90 fs/ioctl.c:950
   do_syscall_32_irqs_on arch/x86/entry/common.c:165 [inline]
   __do_fast_syscall_32+0xb4/0x110 arch/x86/entry/common.c:386
   do_fast_syscall_32+0x34/0x80 arch/x86/entry/common.c:411
   entry_SYSENTER_compat_after_hwframe+0x84/0x8e
  page last free pid 8399 tgid 8399 stack trace:
   reset_page_owner include/linux/page_owner.h:25 [inline]
   free_pages_prepare mm/page_alloc.c:1108 [inline]
   free_unref_folios+0xf12/0x18d0 mm/page_alloc.c:2686
   folios_put_refs+0x76c/0x860 mm/swap.c:1007
   free_pages_and_swap_cache+0x5c8/0x690 mm/swap_state.c:335
   __tlb_batch_free_encoded_pages mm/mmu_gather.c:136 [inline]
   tlb_batch_pages_flush mm/mmu_gather.c:149 [inline]
   tlb_flush_mmu_free mm/mmu_gather.c:366 [inline]
   tlb_flush_mmu+0x3a3/0x680 mm/mmu_gather.c:373
   tlb_finish_mmu+0xd4/0x200 mm/mmu_gather.c:465
   exit_mmap+0x496/0xc40 mm/mmap.c:1926
   __mmput+0x115/0x390 kernel/fork.c:1348
   exit_mm+0x220/0x310 kernel/exit.c:571
   do_exit+0x9b2/0x28e0 kernel/exit.c:926
   do_group_exit+0x207/0x2c0 kernel/exit.c:1088
   __do_sys_exit_group kernel/exit.c:1099 [inline]
   __se_sys_exit_group kernel/exit.c:1097 [inline]
   __x64_sys_exit_group+0x3f/0x40 kernel/exit.c:1097
   x64_sys_call+0x2634/0x2640 arch/x86/include/generated/asm/syscalls_64.h:232
   do_syscall_x64 arch/x86/entry/common.c:52 [inline]
   do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
   entry_SYSCALL_64_after_hwframe+0x77/0x7f
  Modules linked in:
  CPU: 0 UID: 0 PID: 8442 Comm: syz.5.504 Not tainted 6.12.0-rc6-syzkaller #0
  Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 09/13/2024
  Call Trace:
   <TASK>
   __dump_stack lib/dump_stack.c:94 [inline]
   dump_stack_lvl+0x241/0x360 lib/dump_stack.c:120
   bad_page+0x176/0x1d0 mm/page_alloc.c:501
   free_page_is_bad mm/page_alloc.c:918 [inline]
   free_pages_prepare mm/page_alloc.c:1100 [inline]
   free_unref_page+0xed0/0xf20 mm/page_alloc.c:2638
   kvm_destroy_vm virt/kvm/kvm_main.c:1327 [inline]
   kvm_put_kvm+0xc75/0x1350 virt/kvm/kvm_main.c:1386
   kvm_vcpu_release+0x54/0x60 virt/kvm/kvm_main.c:4143
   __fput+0x23f/0x880 fs/file_table.c:431
   task_work_run+0x24f/0x310 kernel/task_work.c:239
   exit_task_work include/linux/task_work.h:43 [inline]
   do_exit+0xa2f/0x28e0 kernel/exit.c:939
   do_group_exit+0x207/0x2c0 kernel/exit.c:1088
   __do_sys_exit_group kernel/exit.c:1099 [inline]
   __se_sys_exit_group kernel/exit.c:1097 [inline]
   __ia32_sys_exit_group+0x3f/0x40 kernel/exit.c:1097
   ia32_sys_call+0x2624/0x2630 arch/x86/include/generated/asm/syscalls_32.h:253
   do_syscall_32_irqs_on arch/x86/entry/common.c:165 [inline]
   __do_fast_syscall_32+0xb4/0x110 arch/x86/entry/common.c:386
   do_fast_syscall_32+0x34/0x80 arch/x86/entry/common.c:411
   entry_SYSENTER_compat_after_hwframe+0x84/0x8e
  RIP: 0023:0xf745d579
  Code: Unable to access opcode bytes at 0xf745d54f.
  RSP: 002b:00000000f75afd6c EFLAGS: 00000206 ORIG_RAX: 00000000000000fc
  RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 0000000000000000
  RDX: 0000000000000000 RSI: 00000000ffffff9c RDI: 00000000f744cff4
  RBP: 00000000f717ae61 R08: 0000000000000000 R09: 0000000000000000
  R10: 0000000000000000 R11: 0000000000000206 R12: 0000000000000000
  R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
   </TASK>

The problem was originally introduced by commit b109b87050df ("mm/munlock:
replace clear_page_mlock() by final clearance"): it was focused on
handling pagecache and anonymous memory and wasn't suitable for lower
level get_page()/free_page() API's used for example by KVM, as with this
reproducer.

Fix it by moving the mlocked flag clearance down to free_page_prepare().

The bug itself if fairly old and harmless (aside from generating these
warnings), aside from a small memory leak - "bad" pages are stopped from
being allocated again.

Link: https://lkml.kernel.org/r/[email protected]
Fixes: b109b87050df ("mm/munlock: replace clear_page_mlock() by final clearance")
Signed-off-by: Roman Gushchin <[email protected]>
Reported-by: [email protected]
Closes: https://lore.kernel.org/all/[email protected]
Acked-by: Hugh Dickins <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Sean Christopherson <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

Merge tag 'sched_ext-for-6.12-rc7-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext

Pull sched_ext fixes from Tejun Heo:

- The fair sched class currently has a bug where its balance() returns
   true telling the sched core that it has tasks to run but then NULL
   from pick_task(). This makes sched core call sched_ext's pick_task()
   without preceding balance() which can lead to stalls in partial mode.

   For now, work around by detecting the condition and forcing the CPU
   to go through another scheduling cycle.

- Add a missing newline to an error message and fix drgn introspection
   tool which went out of sync.

* tag 'sched_ext-for-6.12-rc7-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext:
  sched_ext: Handle cases where pick_task_scx() is called without preceding balance_scx()
  sched_ext: Update scx_show_state.py to match scx_ops_bypass_depth's new type
  sched_ext: Add a missing newline at the end of an error message

Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost

Pull virtio fixes from Michael Tsirkin:
"Several small bugfixes all over the place"

* tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost:
  vdpa/mlx5: Fix error path during device add
  vp_vdpa: fix id_table array not null terminated error
  virtio_pci: Fix admin vq cleanup by using correct info pointer
  vDPA/ifcvf: Fix pci_read_config_byte() return code handling
  Fix typo in vringh_test.c
  vdpa: solidrun: Fix UB bug with devres
  vsock/virtio: Initialization of the dangling pointer occurring in vsk->trans

dm-cache: fix warnings about duplicate slab caches

The commit 4c39529663b9 adds a warning about duplicate cache names if
CONFIG_DEBUG_VM is selected. These warnings are triggered by the dm-cache
code.

The dm-cache code allocates a slab cache for each device. This commit
changes it to allocate just one slab cache in the module init function.

Signed-off-by: Mikulas Patocka <[email protected]>
Fixes: 4c39529663b9 ("slab: Warn on duplicate cache names when DEBUG_VM=y")

dm-bufio: fix warnings about duplicate slab caches

The commit 4c39529663b9 adds a warning about duplicate cache names if
CONFIG_DEBUG_VM is selected. These warnings are triggered by the dm-bufio
code. The dm-bufio code allocates a slab cache with each client. It is
not possible to preallocate the caches in the module init function
because the size of auxiliary per-buffer data is not known at this point.

So, this commit changes dm-bufio so that it appends a unique atomic value
to the cache name, to avoid the warnings.

Signed-off-by: Mikulas Patocka <[email protected]>
Fixes: 4c39529663b9 ("slab: Warn on duplicate cache names when DEBUG_VM=y")

mm: count zeromap read and set for swapout and swapin

When the proportion of folios from the zeromap is small, missing their
accounting may not significantly impact profiling.  However, it's easy to
construct a scenario where this becomes an issue—for example, allocating
1 GB of memory, writing zeros from userspace, followed by MADV_PAGEOUT,
and then swapping it back in.  In this case, the swap-out and swap-in
counts seem to vanish into a black hole, potentially causing semantic
ambiguity.

On the other hand, Usama reported that zero-filled pages can exceed 10% in
workloads utilizing zswap, while Hailong noted that some app in Android
have more than 6% zero-filled pages.  Before commit 0ca0c24e3211 ("mm:
store zero pages to be swapped out in a bitmap"), both zswap and zRAM
implemented similar optimizations, leading to these optimized-out pages
being counted in either zswap or zRAM counters (with pswpin/pswpout also
increasing for zRAM).  With zeromap functioning prior to both zswap and
zRAM, userspace will no longer detect these swap-out and swap-in actions.

We have three ways to address this:

1. Introduce a dedicated counter specifically for the zeromap.

2. Use pswpin/pswpout accounting, treating the zero map as a standard
   backend.  This approach aligns with zRAM's current handling of
   same-page fills at the device level.  However, it would mean losing the
   optimized-out page counters previously available in zRAM and would not
   align with systems using zswap.  Additionally, as noted by Nhat Pham,
   pswpin/pswpout counters apply only to I/O done directly to the backend
   device.

3. Count zeromap pages under zswap, aligning with system behavior when
   zswap is enabled.  However, this would not be consistent with zRAM, nor
   would it align with systems lacking both zswap and zRAM.

Given the complications with options 2 and 3, this patch selects
option 1.

We can find these counters from /proc/vmstat (counters for the whole
system) and memcg's memory.stat (counters for the interested memcg).

For example:

$ grep -E 'swpin_zero|swpout_zero' /proc/vmstat
swpin_zero 1648
swpout_zero 33536

$ grep -E 'swpin_zero|swpout_zero' /sys/fs/cgroup/system.slice/memory.stat
swpin_zero 3905
swpout_zero 3985

This patch does not address any specific zeromap bug, but the missing
swpout and swpin counts for zero-filled pages can be highly confusing and
may mislead user-space agents that rely on changes in these counters as
indicators.  Therefore, we add a Fixes tag to encourage the inclusion of
this counter in any kernel versions with zeromap.

Many thanks to Kanchana for the contribution of changing
count_objcg_event() to count_objcg_events() to support large folios[1],
which has now been incorporated into this patch.

[1] https://lkml.kernel.org/r/20241001053222 [email protected]

Link: https://lkml.kernel.org/r/[email protected]
Fixes: 0ca0c24e3211 ("mm: store zero pages to be swapped out in a bitmap")
Co-developed-by: Kanchana P Sridhar <[email protected]>
Signed-off-by: Barry Song <[email protected]>
Reviewed-by: Nhat Pham <[email protected]>
Reviewed-by: Chengming Zhou <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Cc: Usama Arif <[email protected]>
Cc: Yosry Ahmed <[email protected]>
Cc: Hailong Liu <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Shakeel Butt <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Baolin Wang <[email protected]>
Cc: Chris Li <[email protected]>
Cc: "Huang, Ying" <[email protected]>
Cc: Kairui Song <[email protected]>
Cc: Ryan Roberts <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

Linux 6.12-rc7

Merge tag 'clk-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux

Pull clk fixes from Stephen Boyd:
"A handful of Qualcomm clk driver fixes:

   - Correct flags for X Elite USB MP GDSC and pcie pipediv2 clocks

   - Fix alpha PLL post_div mask for the cases where width is not
     specified

   - Avoid hangs in the SM8350 video driver (venus) by setting HW_CTRL
     trigger feature on the video clocks"

* tag 'clk-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux:
  clk: qcom: gcc-x1e80100: Fix USB MP SS1 PHY GDSC pwrsts flags
  clk: qcom: gcc-x1e80100: Fix halt_check for pipediv2 clocks
  clk: qcom: clk-alpha-pll: Fix pll post div mask when width is not set
  clk: qcom: videocc-sm8350: use HW_CTRL_TRIGGER for vcodec GDSCs

Merge tag 'i2c-for-6.12-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux

Pull i2c fixes from Wolfram Sang:
"i2c-host fixes for v6.12-rc7 (from Andi):

   - Fix designware incorrect behavior when concluding a transmission

   - Fix Mule multiplexer error value evaluation"

* tag 'i2c-for-6.12-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux:
  i2c: designware: do not hold SCL low when I2C_DYNAMIC_TAR_UPDATE is not set
  i2c: muxes: Fix return value check in mule_i2c_mux_probe()

filemap: Fix bounds checking in filemap_read()

If the caller supplies an iocb->ki_pos value that is close to the
filesystem upper limit, and an iterator with a count that causes us to
overflow that limit, then filemap_read() enters an infinite loop.

This behaviour was discovered when testing xfstests generic/525 with the
"localio" optimisation for loopback NFS mounts.

Reported-by: Mike Snitzer <[email protected]>
Fixes: c2a9737f45e2 ("vfs,mm: fix a dead loop in truncate_inode_pages_range()")
Tested-by: Mike Snitzer <[email protected]>
Signed-off-by: Trond Myklebust <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

Merge tag 'irq_urgent_for_v6.12_rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull irq fix from Borislav Petkov:

- Make sure GICv3 controller interrupt activation doesn't race with a
   concurrent deactivation due to propagation delays of the register
   write

* tag 'irq_urgent_for_v6.12_rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  irqchip/gic-v3: Force propagation of the active state with a read-back

Merge tag 'mm-hotfixes-stable-2024-11-09-22-40' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull misc fixes from Andrew Morton:
"20 hotfixes, 14 of which are cc:stable.

  Three affect DAMON. Lorenzo's five-patch series to address the
  mmap_region error handling is here also.

  Apart from that, various singletons"

* tag 'mm-hotfixes-stable-2024-11-09-22-40' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
  mailmap: add entry for Thorsten Blum
  ocfs2: remove entry once instead of null-ptr-dereference in ocfs2_xa_remove()
  signal: restore the override_rlimit logic
  fs/proc: fix compile warning about variable 'vmcore_mmap_ops'
  ucounts: fix counter leak in inc_rlimit_get_ucounts()
  selftests: hugetlb_dio: check for initial conditions to skip in the start
  mm: fix docs for the kernel parameter ``thp_anon=``
  mm/damon/core: avoid overflow in damon_feed_loop_next_input()
  mm/damon/core: handle zero schemes apply interval
  mm/damon/core: handle zero {aggregation,ops_update} intervals
  mm/mlock: set the correct prev on failure
  objpool: fix to make percpu slot allocation more robust
  mm/page_alloc: keep track of free highatomic
  mm: resolve faulty mmap_region() error path behaviour
  mm: refactor arch_calc_vm_flag_bits() and arm64 MTE handling
  mm: refactor map_deny_write_exec()
  mm: unconditionally close VMAs on error
  mm: avoid unsafe VMA hook invocation when error arises on mmap hook
  mm/thp: fix deferred split unqueue naming and locking
  mm/thp: fix deferred split queue not partially_mapped

Merge tag 'usb-6.12-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb

Pull USB/Thunderbolt fixes from Greg KH:
"Here are some small remaining USB and Thunderbolt fixes and device ids
  for 6.12-rc7. Included in here are:

   - new USB serial driver device ids

   - thunderbolt driver fixes for reported problems

   - typec bugfixes

   - dwc3 driver fix

   - musb driver fix

  All of these have been in linux-next this past week with no reported
  issues"

* tag 'usb-6.12-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb:
  USB: serial: qcserial: add support for Sierra Wireless EM86xx
  thunderbolt: Fix connection issue with Pluggable UD-4VPD dock
  usb: typec: fix potential out of bounds in ucsi_ccg_update_set_new_cam_cmd()
  usb: dwc3: fix fault at system suspend if device was already runtime suspended
  usb: typec: qcom-pmic: init value of hdr_len/txbuf_len earlier
  usb: musb: sunxi: Fix accessing an released usb phy
  USB: serial: io_edgeport: fix use after free in debug printk
  USB: serial: option: add Quectel RG650V
  USB: serial: option: add Fibocom FG132 0x0112 composition
  thunderbolt: Add only on-board retimers when !CONFIG_USB4_DEBUGFS_MARGINING

Merge tag 'staging-6.12-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging

Pull staging driver fixes from Greg KH:
"Here are two small memory leak fixes for the vchiq_arm staging driver
  that have been sitting in my tree for weeks and should get merged for
  6.12-rc7 so that people don't keep tripping over them.

  They both have been in linux-next for a while with no reported
  problems"

* tag 'staging-6.12-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging:
  staging: vchiq_arm: Use devm_kzalloc() for drv_mgmt allocation
  staging: vchiq_arm: Use devm_kzalloc() for vchiq_arm_state allocation

Merge tag 'i2c-host-fixes-6.12-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/andi.shyti/linux into i2c/for-current

i2c-host fixes for v6.12-rc7

In designware an incorrect behavior has been fixes when
concluding a transmission.

Fixed return error value evaluation in the Mule multiplexer.

Merge tag 'nfsd-6.12-4' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux

Pull nfsd fix from Chuck Lever:

- Fix a v6.12-rc regression when exporting ext4 filesystems with NFSD

* tag 'nfsd-6.12-4' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux:
NFSD: Fix READDIR on NFSv3 mounts of ext4 exports

Merge tag 'v6.12-rc6-smb3-client-fix' of git://git.samba.org/sfrench/cifs-2.6

Pull smb client fix from Steve French:
"Fix net namespace refcount use after free issue"

* tag 'v6.12-rc6-smb3-client-fix' of git://git.samba.org/sfrench/cifs-2.6:
smb: client: Fix use-after-free of network namespace.

Merge tag 'block-6.12-20241108' of git://git.kernel.dk/linux

Pull block fix from Jens Axboe:
"Single fix for an issue triggered with PROVE_RCU=y, with nvme using
the wrong iterators for an SRCU protected list"

* tag 'block-6.12-20241108' of git://git.kernel.dk/linux:
nvme/host: Fix RCU list traversal to use SRCU primitive

sched_ext: Handle cases where pick_task_scx() is called without preceding balance_scx()

sched_ext dispatches tasks from the BPF scheduler from balance_scx() and
thus every pick_task_scx() call must be preceded by balance_scx(). While
this usually holds, due to a bug, there are cases where the fair class's
balance() returns true indicating that it has tasks to run on the CPU and
thus terminating balance() calls but fails to actually find the next task to
run when pick_task() is called. In such cases, pick_task_scx() can be called
without preceding balance_scx().

Detect this condition using SCX_RQ_BAL_PENDING flags. If detected, keep
running the previous task if possible and avoid stalling from entering idle
without balancing.

Signed-off-by: Tejun Heo <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]

landlock: Optimize scope enforcement

Do not walk through the domain hierarchy when the required scope is not
supported by this domain. This is the same approach as for filesystem
and network restrictions.

Cc: Mikhail Ivanov <[email protected]>
Cc: Tahera Fahimi <[email protected]>
Reviewed-by: Günther Noack <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Mickaël Salaün <[email protected]>

landlock: Refactor network access mask management

Replace get_raw_handled_net_accesses() and get_current_net_domain() with
a call to landlock_get_applicable_domain().

Cc: Konstantin Meskhidze <[email protected]>
Cc: Mikhail Ivanov <[email protected]>
Reviewed-by: Günther Noack <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Mickaël Salaün <[email protected]>

landlock: Refactor filesystem access mask management

Replace get_raw_handled_fs_accesses() with a generic
landlock_union_access_masks(), and replace get_fs_domain() with a
generic landlock_get_applicable_domain(). These helpers will also be
useful for other types of access.

Cc: Mikhail Ivanov <[email protected]>
Reviewed-by: Günther Noack <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
[mic: Slightly improve doc as suggested by Günther]
Signed-off-by: Mickaël Salaün <[email protected]>

Merge tag 'thermal-6.12-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull thermal control fixes from Rafael Wysocki:
"These fix one issue in the qcom lmh thermal driver, a DT handling
  issue in the thermal core and two issues in the userspace thermal
  library:

   - Allow tripless thermal zones defined in a DT to be registered in
     accordance with the thermal DT bindings (Icenowy Zheng)

   - Annotate LMH IRQs with lockdep classes to prevent lockdep from
     reporting a possible recursive locking issue that cannot really
     occur (Dmitry Baryshkov)

   - Improve the thermal library "make clean" to remove a leftover
     symbolic link created during compilation and fix the sampling
     handler invocation in that library to pass the correct pointer to
     it (Emil Dahl Juhl, zhang jiao)"

* tag 'thermal-6.12-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  thermal/of: support thermal zones w/o trips subnode
  tools/lib/thermal: Remove the thermal.h soft link when doing make clean
  tools/lib/thermal: Fix sampling handler context ptr
  thermal/drivers/qcom/lmh: Remove false lockdep backtrace

Merge tag 'pm-6.12-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull power management fix from Rafael Wysocki:
"Fix the asymmetric CPU capacity support code in the intel_pstate
  driver, added during this develompent cycle, to address a corner case
  in which the capacity of a CPU going online is not updated (Rafael
  Wysocki)"

* tag 'pm-6.12-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  cpufreq: intel_pstate: Update asym capacity for CPUs that were offline initially
  cpufreq: intel_pstate: Clear hybrid_max_perf_cpu before driver registration

Merge tag 'acpi-6.12-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull ACPI fix from Rafael Wysocki:
"Fix the ACPI processor driver initialization ordering after recent
  changes to avoid calling init_freq_invariance_cppc() too early on AMD
  platforms (Mario Limonciello)"

* tag 'acpi-6.12-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  ACPI: processor: Move arch_init_invariance_cppc() call later

Merge tag 'v6.12-rc6-ksmbd-fixes' of git://git.samba.org/ksmbd

Pull smb server fixes from Steve French:
"Four fixes, all also marked for stable:

   - fix two potential use after free issues

   - fix OOM issue with many simultaneous requests

   - fix missing error check in RPC pipe handling"

* tag 'v6.12-rc6-ksmbd-fixes' of git://git.samba.org/ksmbd:
  ksmbd: check outstanding simultaneous SMB operations
  ksmbd: fix slab-use-after-free in smb3_preauth_hash_rsp
  ksmbd: fix slab-use-after-free in ksmbd_smb2_session_create
  ksmbd: Fix the missing xa_store error check

Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi

Pull SCSI fixes from James Bottomley:
"Two small fixes, the drivers one in ufs simply delays running a work
  queue and the generic one in zoned storage switches to a more correct
  API that tries the standard buddy allocator first (for small
  allocations); this fixes an allocation problem with small allocations
  seen under memory pressure"

* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
  scsi: ufs: core: Start the RTC update work later
  scsi: sd_zbc: Use kvzalloc() to allocate REPORT ZONES buffer

Merge tag 'drm-fixes-2024-11-09' of https://gitlab.freedesktop.org/drm/kernel

Pull drm fixes from Dave Airlie:
"Weekly fixes, usual leaders in amdgpu and xe, then a panel quirk, and
  some fixes to imagination and panthor drivers. Seems around the usual
  level for this time and don't know of any big problems.

  amdgpu:
   - Brightness fix
   - DC vbios parsing fix
   - ACPI fix
   - SMU 14.x fix
   - Power workload profile fix
   - GC partitioning fix
   - Debugfs fixes

  imagination:
   - Track PVR context per file
   - Break ref-counting cycle

  panel-orientation-quirks:
   - Fix matching Lenovo Yoga Tab 3 X90F

  panthor:
   - Lock VM array
   - Be strict about I/O mapping flags

  xe:
   - Fix ccs_mode setting for Xe2 and later
   - Synchronize ccs_mode setting with client creation
   - Apply scheduling WA for LNL in additional places as needed
   - Fix leak and lock handling in error paths of xe_exec ioctl
   - Fix GGTT allocation leak leading to eventual crash in SR-IOV
   - Move run_ticks update out of job handling to avoid synchronization
     with reader"

* tag 'drm-fixes-2024-11-09' of https://gitlab.freedesktop.org/drm/kernel: (23 commits)
  drm/panthor: Be stricter about IO mapping flags
  drm/panthor: Lock XArray when getting entries for the VM
  drm: panel-orientation-quirks: Make Lenovo Yoga Tab 3 X90F DMI match less strict
  drm/xe: Stop accumulating LRC timestamp on job_free
  drm/xe/pf: Fix potential GGTT allocation leak
  drm/xe: Drop VM dma-resv lock on xe_sync_in_fence_get failure in exec IOCTL
  drm/xe: Fix possible exec queue leak in exec IOCTL
  drm/amdgpu: add missing size check in amdgpu_debugfs_gprwave_read()
  drm/amdgpu: Adjust debugfs eviction and IB access permissions
  drm/amdgpu: Adjust debugfs register access permissions
  drm/amdgpu: Fix DPX valid mode check on GC 9.4.3
  drm/amd/pm: correct the workload setting
  drm/amd/pm: always pick the pptable from IFWI
  drm/amdgpu: prevent NULL pointer dereference if ATIF is not supported
  drm/amd/display: parse umc_info or vram_info based on ASIC
  drm/amd/display: Fix brightness level not retained over reboot
  drm/xe/guc/tlb: Flush g2h worker in case of tlb timeout
  drm/xe/ufence: Flush xe ordered_wq in case of ufence timeout
  drm/xe: Move LNL scheduling WA to xe_device.h
  drm/xe: Use the filelist from drm for ccs_mode change
  ...

Merge tag 'drm-xe-fixes-2024-11-08' of https://gitlab.freedesktop.org/drm/xe/kernel into drm-fixes

Driver Changes:
- Fix ccs_mode setting for Xe2 and later (Balasubramani)
- Synchronize ccs_mode setting with client creation (Balasubramani)
- Apply scheduling WA for LNL in additional places as needed
  (Nirmoy)
- Fix leak and lock handling in error paths of xe_exec ioctl
  (Matthew Brost)
- Fix GGTT allocation leak leading to eventual crash in SR-IOV
  (Michal Wajdeczko)
- Move run_ticks update out of job handling to avoid synchronization
  with reader (Lucas)

Signed-off-by: Dave Airlie <[email protected]>
From: Lucas De Marchi <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/4ffcebtluaaaohquxfyf5babpihmtscxwad3jjmt5nggwh2xpm@ztw67ucywttg

Merge tag 'drm-misc-fixes-2024-11-08' of https://gitlab.freedesktop.org/drm/misc/kernel into drm-fixes

Short summary of fixes pull:

imagination:
- Track PVR context per file
- Break ref-counting cycle

panel-orientation-quirks:
- Fix matching Lenovo Yoga Tab 3 X90F

panthor:
- Lock VM array
- Be strict about I/O mapping flags

Signed-off-by: Dave Airlie <[email protected]>
From: Thomas Zimmermann <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]

i2c: designware: do not hold SCL low when I2C_DYNAMIC_TAR_UPDATE is not set

When the Tx FIFO is empty and the last command has no STOP bit
set, the master holds SCL low. If I2C_DYNAMIC_TAR_UPDATE is not
set, BIT(13) MST_ON_HOLD of IC_RAW_INTR_STAT is not enabled,
causing the __i2c_dw_disable() timeout. This is quite similar to
commit 2409205acd3c ("i2c: designware: fix __i2c_dw_disable() in
case master is holding SCL low"). Also check BIT(7)
MST_HOLD_TX_FIFO_EMPTY in IC_STATUS, which is available when
IC_STAT_FOR_CLK_STRETCH is set.

Fixes: 2409205acd3c ("i2c: designware: fix __i2c_dw_disable() in case master is holding SCL low")
Co-developed-by: Xiaowu Ding <[email protected]>
Signed-off-by: Xiaowu Ding <[email protected]>
Co-developed-by: Angus Chen <[email protected]>
Signed-off-by: Angus Chen <[email protected]>
Signed-off-by: Liu Peibao <[email protected]>
Acked-by: Jarkko Nikula <[email protected]>
Signed-off-by: Andi Shyti <[email protected]>

Merge tag 'sound-6.12-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound

Pull sound fixes from Takashi Iwai:
"Still more changes floating than wished at this late stage, but all
  are small device-specific fixes, and look less troublesome.

  Including a few ASoC quirk / ID additoins, a series of ASoC STM fixes,
  HD-audio conexant codec regression fix, and other various quirks and
  device-specific fixes"

* tag 'sound-6.12-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound:
  ASoC: SOF: sof-client-probes-ipc4: Set param_size extension bits
  ASoC: stm: Prevent potential division by zero in stm32_sai_get_clk_div()
  ASoC: stm: Prevent potential division by zero in stm32_sai_mclk_round_rate()
  ASoC: amd: yc: Support dmic on another model of Lenovo Thinkpad E14 Gen 6
  ASoC: SOF: amd: Fix for incorrect DMA ch status register offset
  ASoC: amd: yc: fix internal mic on Xiaomi Book Pro 14 2022
  ASoC: stm32: spdifrx: fix dma channel release in stm32_spdifrx_remove
  MAINTAINERS: Generic Sound Card section
  ALSA: usb-audio: Add quirk for HP 320 FHD Webcam
  ASoC: tas2781: Add new driver version for tas2563 & tas2781 qfn chip
  ALSA: firewire-lib: fix return value on fail in amdtp_tscm_init()
  ALSA: ump: Don't enumeration invalid groups for legacy rawmidi
  Revert "ALSA: hda/conexant: Mute speakers at suspend / shutdown"

Merge tag 'media/v6.12-2' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media

Pull media fixes from Mauro Carvalho Chehab:

- dvb-core fixes for vb2 check and device registration

- v4l2-core: fix an issue with error handling for VIDIOC_G_CTRL

- vb2 core: fix an issue with vb plane copy logic

- videobuf2-core: copy vb planes unconditionally

- vivid: fix buffer overwrite when using > 32 buffers

- vivid: fix a potential division by zero due to an issue at v4l2-tpg

- some spectre vulnerability fixes

- several OOM access fixes

- some buffer overflow fixes

* tag 'media/v6.12-2' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media:
  media: videobuf2-core: copy vb planes unconditionally
  media: dvbdev: fix the logic when DVB_DYNAMIC_MINORS is not set
  media: vivid: fix buffer overwrite when using > 32 buffers
  media: pulse8-cec: fix data timestamp at pulse8_setup()
  media: cec: extron-da-hd-4k-plus: don't use -1 as an error code
  media: stb0899_algo: initialize cfr before using it
  media: adv7604: prevent underflow condition when reporting colorspace
  media: cx24116: prevent overflows on SNR calculus
  media: ar0521: don't overflow when checking PLL values
  media: s5p-jpeg: prevent buffer overflows
  media: av7110: fix a spectre vulnerability
  media: mgb4: protect driver against spectre
  media: dvb_frontend: don't play tricks with underflow values
  media: dvbdev: prevent the risk of out of memory access
  media: v4l2-tpg: prevent the risk of a division by zero
  media: v4l2-ctrls-api: fix error handling for v4l2_g_ctrl()
  media: dvb-core: add missing buffer index check

Merge tag 'slab-for-6.12-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab

Pull slab fix from Vlastimil Babka:

- Fix for duplicate caches in some arm64 configurations with
CONFIG_SLAB_BUCKETS (Koichiro Den)

* tag 'slab-for-6.12-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab:
mm/slab: fix warning caused by duplicate kmem_cache creation in kmem_buckets_create

Merge tag 'for-6.12-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:
"A few more one-liners that fix some user visible problems:

   - use correct range when clearing qgroup reservations after COW

   - properly reset freed delayed ref list head

   - fix ro/rw subvolume mounts to be backward compatible with old and
     new mount API"

* tag 'for-6.12-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: fix the length of reserved qgroup to free
  btrfs: reinitialize delayed ref list after deleting it from the list
  btrfs: fix per-subvolume RO/RW flags with new mount API

Merge tag 'bcachefs-2024-11-07' of git://evilpiepirate.org/bcachefs

Pull bcachefs fixes from Kent Overstreet:
"Some trivial syzbot fixes, two more serious btree fixes found by
  looping single_devices.ktest small_nodes:

   - Topology error on split after merge, where we accidentaly picked
     the node being deleted for the pivot, resulting in an assertion pop

   - New nodes being preallocated were left on the freedlist, unlocked,
     resulting in them sometimes being accidentally freed: this dated
     from pre-cycle detector, when we could leave them locked. This
     should have resulted in more explosions and fireworks, but turned
     out to be surprisingly hard to hit because the preallocated nodes
     were being used right away.

     The fix for this is bigger than we'd like - reworking btree list
     handling was a bit invasive - but we've now got more assertions and
     it's well tested.

   - Also another mishandled transaction restart fix (in
     btree_node_prefetch) - we're almost done with those"

* tag 'bcachefs-2024-11-07' of git://evilpiepirate.org/bcachefs:
  bcachefs: Fix UAF in __promote_alloc() error path
  bcachefs: Change OPT_STR max to be 1 less than the size of choices array
  bcachefs: btree_cache.freeable list fixes
  bcachefs: check the invalid parameter for perf test
  bcachefs: add check NULL return of bio_kmalloc in journal_read_bucket
  bcachefs: Ensure BCH_FS_may_go_rw is set before exiting recovery
  bcachefs: Fix topology errors on split after merge
  bcachefs: Ancient versions with bad bkey_formats are no longer supported
  bcachefs: Fix error handling in bch2_btree_node_prefetch()
  bcachefs: Fix null ptr deref in bucket_gen_get()

Merge tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux

Pull arm64 fixes from Will Deacon:
"Here is a (hopefully) final round of arm64 fixes for 6.12 that address
  some user-visible floating point register corruption. Both of the
  Marks have been working on this for a couple of weeks and we've ended
  up in a position where SVE is solid but SME still has enough pending
  issues that the most pragmatic solution for the release and stable
  backports is to disable the feature. Yes, it's a shame, but the
  hardware is rare as hen's teeth at the moment and we're better off
  getting back to a known good state before fixing it all properly.
  We're also improving the selftests for 6.13 to help avoid merging
  broken code in the future.

  Anyway, the good news is that we're removing a lot more code than
  we're adding.

  Summary:

   - Fix handling of SVE traps from userspace on preemptible kernels
     when converting the saved floating point state into SVE state.

   - Remove broken support for the SMCCCv1.3 "SVE discard hint"
     optimisation.

   - Disable SME support, as the current support code suffers from
     numerous issues around signal delivery, ptrace access and
     context-switch which can lead to user-visible corruption of the
     register state"

* tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
  arm64: Kconfig: Make SME depend on BROKEN for now
  arm64: smccc: Remove broken support for SMCCCv1.3 SVE discard hint
  arm64/sve: Discard stale CPU state when handling SVE traps

Merge tag 'powerpc-6.12-6' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux

Pull powerpc fix from Madhavan Srinivasan:

- Fix spurious interrupts in Book3S HV Nested KVM

Thanks to Gautam Menghani.

* tag 'powerpc-6.12-6' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
KVM: PPC: Book3S HV: Mask off LPCR_MER for a vCPU before running it to avoid spurious interrupts

KVM: VMX: Bury Intel PT virtualization (guest/host mode) behind CONFIG_BROKEN

Hide KVM's pt_mode module param behind CONFIG_BROKEN, i.e. disable support
for virtualizing Intel PT via guest/host mode unless BROKEN=y.  There are
myriad bugs in the implementation, some of which are fatal to the guest,
and others which put the stability and health of the host at risk.

For guest fatalities, the most glaring issue is that KVM fails to ensure
tracing is disabled, and *stays* disabled prior to VM-Enter, which is
necessary as hardware disallows loading (the guest's) RTIT_CTL if tracing
is enabled (enforced via a VMX consistency check).  Per the SDM:

  If the logical processor is operating with Intel PT enabled (if
  IA32_RTIT_CTL.TraceEn = 1) at the time of VM entry, the "load
  IA32_RTIT_CTL" VM-entry control must be 0.

On the host side, KVM doesn't validate the guest CPUID configuration
provided by userspace, and even worse, uses the guest configuration to
decide what MSRs to save/load at VM-Enter and VM-Exit.  E.g. configuring
guest CPUID to enumerate more address ranges than are supported in hardware
will result in KVM trying to passthrough, save, and load non-existent MSRs,
which generates a variety of WARNs, ToPA ERRORs in the host, a potential
deadlock, etc.

Fixes: f99e3daf94ff ("KVM: x86: Add Intel PT virtualization work mode")
Cc: [email protected]
Cc: Adrian Hunter <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Reviewed-by: Xiaoyao Li <[email protected]>
Tested-by: Adrian Hunter <[email protected]>
Message-ID: <20241101185031.1799556 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: x86: Unconditionally set irr_pending when updating APICv state

Always set irr_pending (to true) when updating APICv status to fix a bug
where KVM fails to set irr_pending when userspace sets APIC state and
APICv is disabled, which ultimate results in KVM failing to inject the
pending interrupt(s) that userspace stuffed into the vIRR, until another
interrupt happens to be emulated by KVM.

Only the APICv-disabled case is flawed, as KVM forces apic->irr_pending to
be true if APICv is enabled, because not all vIRR updates will be visible
to KVM.

Hit the bug with a big hammer, even though strictly speaking KVM can scan
the vIRR and set/clear irr_pending as appropriate for this specific case.
The bug was introduced by commit 755c2bf87860 ("KVM: x86: lapic: don't
touch irr_pending in kvm_apic_update_apicv when inhibiting it"), which as
the shortlog suggests, deleted code that updated irr_pending.

Before that commit, kvm_apic_update_apicv() did indeed scan the vIRR, with
with the crucial difference that kvm_apic_update_apicv() did the scan even
when APICv was being *disabled*, e.g. due to an AVIC inhibition.

        struct kvm_lapic *apic = vcpu->arch.apic;

        if (vcpu->arch.apicv_active) {
                /* irr_pending is always true when apicv is activated. */
                apic->irr_pending = true;
                apic->isr_count = 1;
        } else {
                apic->irr_pending = (apic_search_irr(apic) != -1);
                apic->isr_count = count_vectors(apic->regs + APIC_ISR);
        }

And _that_ bug (clearing irr_pending) was introduced by commit b26a695a1d78
("kvm: lapic: Introduce APICv update helper function"), prior to which KVM
unconditionally set irr_pending to true in kvm_apic_set_state(), i.e.
assumed that the new virtual APIC state could have a pending IRQ.

Furthermore, in addition to introducing this issue, commit 755c2bf87860
also papered over the underlying bug: KVM doesn't ensure CPUs and devices
see APICv as disabled prior to searching the IRR.  Waiting until KVM
emulates an EOI to update irr_pending "works", but only because KVM won't
emulate EOI until after refresh_apicv_exec_ctrl(), and there are plenty of
memory barriers in between.  I.e. leaving irr_pending set is basically
hacking around bad ordering.

So, effectively revert to the pre-b26a695a1d78 behavior for state restore,
even though it's sub-optimal if no IRQs are pending, in order to provide a
minimal fix, but leave behind a FIXME to document the ugliness.  With luck,
the ordering issue will be fixed and the mess will be cleaned up in the
not-too-distant future.

Fixes: 755c2bf87860 ("KVM: x86: lapic: don't touch irr_pending in kvm_apic_update_apicv when inhibiting it")
Cc: [email protected]
Cc: Maxim Levitsky <[email protected]>
Reported-by: Yong He <[email protected]>
Closes: https://lkml.kernel.org/r/20241023124527.1092810-1-alexyonghe%40tencent.com
Signed-off-by: Sean Christopherson <[email protected]>
Message-ID: <20241106015135.2462147 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

kvm: svm: Fix gctx page leak on invalid inputs

Ensure that snp gctx page allocation is adequately deallocated on
failure during snp_launch_start.

Fixes: 136d8bc931c8 ("KVM: SEV: Add KVM_SEV_SNP_LAUNCH_START command")
CC: Sean Christopherson <[email protected]>
CC: Paolo Bonzini <[email protected]>
CC: Thomas Gleixner <[email protected]>
CC: Ingo Molnar <[email protected]>
CC: Borislav Petkov <[email protected]>
CC: Dave Hansen <[email protected]>
CC: Ashish Kalra <[email protected]>
CC: Tom Lendacky <[email protected]>
CC: John Allen <[email protected]>
CC: Herbert Xu <[email protected]>
CC: "David S. Miller" <[email protected]>
CC: Michael Roth <[email protected]>
CC: Luis Chamberlain <[email protected]>
CC: Russ Weight <[email protected]>
CC: Danilo Krummrich <[email protected]>
CC: Greg Kroah-Hartman <[email protected]>
CC: "Rafael J. Wysocki" <[email protected]>
CC: Tianfei zhang <[email protected]>
CC: Alexey Kardashevskiy <[email protected]>
Signed-off-by: Dionna Glaze <[email protected]>
Message-ID: <20241105010558.1266699 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: selftests: use X86_MEMTYPE_WB instead of VMX_BASIC_MEM_TYPE_WB

In 08a7d2525511 ("tools arch x86: Sync the msr-index.h copy with the
kernel sources"), VMX_BASIC_MEM_TYPE_WB was removed. Use X86_MEMTYPE_WB
instead.

Fixes: 08a7d2525511 ("tools arch x86: Sync the msr-index.h copy with the
kernel sources")
Signed-off-by: John Sperbeck <[email protected]>
Message-ID: <20241106034031 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

Merge tag 'kvm-x86-fixes-6.12-rcN' of https://github.com/kvm-x86/linux into HEAD

KVM x86 and selftests fixes for 6.12:

- Increase the timeout for the memslot performance selftest to avoid false
   failures on arm64 and nested x86 platforms.

- Fix a goof in the guest_memfd selftest where a for-loop initialized a
   bit mask to zero instead of BIT(0).

- Disable strict aliasing when building KVM selftests to prevent the
   compiler from treating things like "u64 *" to "uint64_t *" cases as
   undefined behavior, which can lead to nasty, hard to debug failures.

- Force -march=x86-64-v2 for KVM x86 selftests if and only if the uarch
   is supported by the compiler.

- When emulating a guest TLB flush for a nested guest, flush vpid01, not
   vpid02, if L2 is active but VPID is disabled in vmcs12, i.e. if L2 and
   L1 are sharing VPID '0' (from L1's perspective).

- Fix a bug in the SNP initialization flow where KVM would return '0' to
   userspace instead of -errno on failure.

Merge tag 'asoc-fix-v6.12-rc6' of https://git.kernel.org/pub/scm/linux/kernel/git/broonie/sound into for-linus

ASoC: Fixes for v6.12

A moderately large pile of small changes here, split fairly evenly
between fixes and ID additions/quirks and all of it driver specific.

Merge tag 'usb-serial-6.12-rc7' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/johan/usb-serial into usb-linus

Johan writes:

USB-serial fixes for 6.12-rc7

Here's a fix for a long-standing use-after-free in an io_edgeport debug
printk and some new modem device ids.

All have been in linux-next with no reported issues.

* tag 'usb-serial-6.12-rc7' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/johan/usb-serial:
  USB: serial: qcserial: add support for Sierra Wireless EM86xx
  USB: serial: io_edgeport: fix use after free in debug printk
  USB: serial: option: add Quectel RG650V
  USB: serial: option: add Fibocom FG132 0x0112 composition

Merge tag 'amd-drm-fixes-6.12-2024-11-07' of https://gitlab.freedesktop.org/agd5f/linux into drm-fixes

amd-drm-fixes-6.12-2024-11-07:

amdgpu:
- Brightness fix
- DC vbios parsing fix
- ACPI fix
- SMU 14.x fix
- Power workload profile fix
- GC partitioning fix
- Debugfs fixes

Signed-off-by: Dave Airlie <[email protected]>
From: Alex Deucher <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]

Merge tag 'spi-fix-v6.12-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi

Pull spi fix from Mark Brown:
"An update for the maintainers of the AMD driver following some job
changes there"

* tag 'spi-fix-v6.12-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi:
MAINTAINERS: update AMD SPI maintainer

Merge tag 'regulator-fix-v6.12-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator

Pull regulator fixes from Mark Brown:
"A couple of small fixes for drivers, nothing particularly remarkable"

* tag 'regulator-fix-v6.12-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator:
regulator: rk808: Add apply_bit for BUCK3 on RK809
regulator: rtq2208: Fix uninitialized use of regulator_config

mailmap: add entry for Thorsten Blum

Map my previously used email address to my @linux.dev address.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Thorsten Blum <[email protected]>
Cc: Alex Elder <[email protected]>
Cc: David S. Miller <[email protected]>
Cc: Geliang Tang <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Mathieu Othacehe <[email protected]>
Cc: Matthieu Baerts (NGI0) <[email protected]>
Cc: Matt Ranostay <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: Neeraj Upadhyay <[email protected]>
Cc: Quentin Monnet <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

ocfs2: remove entry once instead of null-ptr-dereference in ocfs2_xa_remove()

Syzkaller is able to provoke null-ptr-dereference in ocfs2_xa_remove():

[   57.319872] (a.out,1161,7):ocfs2_xa_remove:2028 ERROR: status = -12
[   57.320420] (a.out,1161,7):ocfs2_xa_cleanup_value_truncate:1999 ERROR: Partial truncate while removing xattr overlay.upper.  Leaking 1 clusters and removing the entry
[   57.321727] BUG: kernel NULL pointer dereference, address: 0000000000000004
[...]
[   57.325727] RIP: 0010:ocfs2_xa_block_wipe_namevalue+0x2a/0xc0
[...]
[   57.331328] Call Trace:
[   57.331477]  <TASK>
[...]
[   57.333511]  ? do_user_addr_fault+0x3e5/0x740
[   57.333778]  ? exc_page_fault+0x70/0x170
[   57.334016]  ? asm_exc_page_fault+0x2b/0x30
[   57.334263]  ? __pfx_ocfs2_xa_block_wipe_namevalue+0x10/0x10
[   57.334596]  ? ocfs2_xa_block_wipe_namevalue+0x2a/0xc0
[   57.334913]  ocfs2_xa_remove_entry+0x23/0xc0
[   57.335164]  ocfs2_xa_set+0x704/0xcf0
[   57.335381]  ? _raw_spin_unlock+0x1a/0x40
[   57.335620]  ? ocfs2_inode_cache_unlock+0x16/0x20
[   57.335915]  ? trace_preempt_on+0x1e/0x70
[   57.336153]  ? start_this_handle+0x16c/0x500
[   57.336410]  ? preempt_count_sub+0x50/0x80
[   57.336656]  ? _raw_read_unlock+0x20/0x40
[   57.336906]  ? start_this_handle+0x16c/0x500
[   57.337162]  ocfs2_xattr_block_set+0xa6/0x1e0
[   57.337424]  __ocfs2_xattr_set_handle+0x1fd/0x5d0
[   57.337706]  ? ocfs2_start_trans+0x13d/0x290
[   57.337971]  ocfs2_xattr_set+0xb13/0xfb0
[   57.338207]  ? dput+0x46/0x1c0
[   57.338393]  ocfs2_xattr_trusted_set+0x28/0x30
[   57.338665]  ? ocfs2_xattr_trusted_set+0x28/0x30
[   57.338948]  __vfs_removexattr+0x92/0xc0
[   57.339182]  __vfs_removexattr_locked+0xd5/0x190
[   57.339456]  ? preempt_count_sub+0x50/0x80
[   57.339705]  vfs_removexattr+0x5f/0x100
[...]

Reproducer uses faultinject facility to fail ocfs2_xa_remove() ->
ocfs2_xa_value_truncate() with -ENOMEM.

In this case the comment mentions that we can return 0 if
ocfs2_xa_cleanup_value_truncate() is going to wipe the entry
anyway. But the following 'rc' check is wrong and execution flow do
'ocfs2_xa_remove_entry(loc);' twice:
* 1st: in ocfs2_xa_cleanup_value_truncate();
* 2nd: returning back to ocfs2_xa_remove() instead of going to 'out'.

Fix this by skipping the 2nd removal of the same entry and making
syzkaller repro happy.

Link: https://lkml.kernel.org/r/[email protected]
Fixes: 399ff3a748cf ("ocfs2: Handle errors while setting external xattr values.")
Signed-off-by: Andrew Kanner <[email protected]>
Reported-by: [email protected]
Closes: https://lore.kernel.org/all/[email protected]/T/
Tested-by: [email protected]
Reviewed-by: Joseph Qi <[email protected]>
Cc: Mark Fasheh <[email protected]>
Cc: Joel Becker <[email protected]>
Cc: Junxiao Bi <[email protected]>
Cc: Changwei Ge <[email protected]>
Cc: Jun Piao <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

signal: restore the override_rlimit logic

Prior to commit d64696905554 ("Reimplement RLIMIT_SIGPENDING on top of
ucounts") UCOUNT_RLIMIT_SIGPENDING rlimit was not enforced for a class of
signals.  However now it's enforced unconditionally, even if
override_rlimit is set.  This behavior change caused production issues.

For example, if the limit is reached and a process receives a SIGSEGV
signal, sigqueue_alloc fails to allocate the necessary resources for the
signal delivery, preventing the signal from being delivered with siginfo.
This prevents the process from correctly identifying the fault address and
handling the error.  From the user-space perspective, applications are
unaware that the limit has been reached and that the siginfo is
effectively 'corrupted'.  This can lead to unpredictable behavior and
crashes, as we observed with java applications.

Fix this by passing override_rlimit into inc_rlimit_get_ucounts() and skip
the comparison to max there if override_rlimit is set.  This effectively
restores the old behavior.

Link: https://lkml.kernel.org/r/[email protected]
Fixes: d64696905554 ("Reimplement RLIMIT_SIGPENDING on top of ucounts")
Signed-off-by: Roman Gushchin <[email protected]>
Co-developed-by: Andrei Vagin <[email protected]>
Signed-off-by: Andrei Vagin <[email protected]>
Acked-by: Oleg Nesterov <[email protected]>
Acked-by: Alexey Gladkov <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: "Eric W. Biederman" <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

fs/proc: fix compile warning about variable 'vmcore_mmap_ops'

When build with !CONFIG_MMU, the variable 'vmcore_mmap_ops'
is defined but not used:

>> fs/proc/vmcore.c:458:42: warning: unused variable 'vmcore_mmap_ops'
458 | static const struct vm_operations_struct vmcore_mmap_ops = {

Fix this by only defining it when CONFIG_MMU is enabled.

Link: https://lkml.kernel.org/r/[email protected]
Fixes: 9cb218131de1 ("vmcore: introduce remap_oldmem_pfn_range()")
Signed-off-by: Qi Xi <[email protected]>
Reported-by: kernel test robot <[email protected]>
Closes: https://lore.kernel.org/lkml/[email protected]/
Cc: Baoquan He <[email protected]>
Cc: Dave Young <[email protected]>
Cc: Michael Holzheu <[email protected]>
Cc: Vivek Goyal <[email protected]>
Cc: Wang ShaoBo <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

ucounts: fix counter leak in inc_rlimit_get_ucounts()

The inc_rlimit_get_ucounts() increments the specified rlimit counter and
then checks its limit. If the value exceeds the limit, the function
returns an error without decrementing the counter.

Link: https://lkml.kernel.org/r/[email protected]
Fixes: 15bc01effefe ("ucounts: Fix signal ucount refcounting")
Signed-off-by: Andrei Vagin <[email protected]>
Co-developed-by: Roman Gushchin <[email protected]>
Signed-off-by: Roman Gushchin <[email protected]>
Tested-by: Roman Gushchin <[email protected]>
Acked-by: Alexey Gladkov <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Andrei Vagin <[email protected]>
Cc: "Eric W. Biederman" <[email protected]>
Cc: Alexey Gladkov <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

selftests: hugetlb_dio: check for initial conditions to skip in the start

The test should be skipped if initial conditions aren't fulfilled in the
start instead of failing and outputting non-compliant TAP logs. This kind
of failure pollutes the results. The initial conditions are:

- The test should only execute if /tmp file can be allocated.
- The test should only execute if huge pages are free.

Before:
TAP version 13
1..4
Bail out! Error opening file
: Read-only file system (30)
# Planned tests != run tests (4 != 0)
# Totals: pass:0 fail:0 xfail:0 xpass:0 skip:0 error:0

After:
TAP version 13
1..0 # SKIP Unable to allocate file: Read-only file system

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Muhammad Usama Anjum <[email protected]>
Fixes: 3a103b5315b7 ("selftest: mm: Test if hugepage does not get leaked during __bio_release_pages()")
Cc: Muhammad Usama Anjum <[email protected]>
Cc: Shuah Khan <[email protected]>
Cc: Donet Tom <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: fix docs for the kernel parameter ``thp_anon=``

If we add ``thp_anon=32,64K:always`` to the kernel command line, we
will see the following error:

[ 0.000000] huge_memory: thp_anon=32,64K:always: error parsing string, ignoring setting

This happens because the correct format isn't ``thp_anon=<size>,<size>[KMG]:<state>```,
as [KMG] must follow each number to especify its unit. So, the correct
format is ``thp_anon=<size>[KMG],<size>[KMG]:<state>```.

Therefore, adjust the documentation to reflect the correct format of the
parameter ``thp_anon=``.

Link: https://lkml.kernel.org/r/[email protected]
Fixes: dd4d30d1cdbe ("mm: override mTHP "enabled" defaults at kernel cmdline")
Signed-off-by: Maíra Canal <[email protected]>
Acked-by: Barry Song <[email protected]>
Acked-by: David Hildenbrand <[email protected]>
Cc: Baolin Wang <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Lance Yang <[email protected]>
Cc: Ryan Roberts <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm/damon/core: avoid overflow in damon_feed_loop_next_input()

damon_feed_loop_next_input() is inefficient and fragile to overflows.
Specifically, 'score_goal_diff_bp' calculation can overflow when 'score'
is high.  The calculation is actually unnecessary at all because 'goal' is
a constant of value 10,000.  Calculation of 'compensation' is again
fragile to overflow.  Final calculation of return value for under-achiving
case is again fragile to overflow when the current score is
under-achieving the target.

Add two corner cases handling at the beginning of the function to make the
body easier to read, and rewrite the body of the function to avoid
overflows and the unnecessary bp value calcuation.

Link: https://lkml.kernel.org/r/[email protected]
Fixes: 9294a037c015 ("mm/damon/core: implement goal-oriented feedback-driven quota auto-tuning")
Signed-off-by: SeongJae Park <[email protected]>
Reported-by: Guenter Roeck <[email protected]>
Closes: https://lore.kernel.org/[email protected]
Tested-by: Guenter Roeck <[email protected]>
Cc: <[email protected]> [6.8.x]
Signed-off-by: Andrew Morton <[email protected]>

mm/damon/core: handle zero schemes apply interval

DAMON's logics to determine if this is the time to apply damos schemes
assumes next_apply_sis is always set larger than current
passed_sample_intervals.  And therefore assume continuously incrementing
passed_sample_intervals will make it reaches to the next_apply_sis in
future.  The logic hence does apply the scheme and update next_apply_sis
only if passed_sample_intervals is same to next_apply_sis.

If Schemes apply interval is set as zero, however, next_apply_sis is set
same to current passed_sample_intervals, respectively.  And
passed_sample_intervals is incremented before doing the next_apply_sis
check.  Hence, next_apply_sis becomes larger than next_apply_sis, and the
logic says it is not the time to apply schemes and update next_apply_sis.
In other words, DAMON stops applying schemes until passed_sample_intervals
overflows.

Based on the documents and the common sense, a reasonable behavior for
such inputs would be applying the schemes for every sampling interval.
Handle the case by removing the assumption.

Link: https://lkml.kernel.org/r/[email protected]
Fixes: 42f994b71404 ("mm/damon/core: implement scheme-specific apply interval")
Signed-off-by: SeongJae Park <[email protected]>
Cc: <[email protected]> [6.7.x]
Signed-off-by: Andrew Morton <[email protected]>

mm/damon/core: handle zero {aggregation,ops_update} intervals

Patch series "mm/damon/core: fix handling of zero non-sampling intervals".

DAMON's internal intervals accounting logic is not correctly handling
non-sampling intervals of zero values for a wrong assumption.  This could
cause unexpected monitoring behavior, and even result in infinite hang of
DAMON sysfs interface user threads in case of zero aggregation interval.
Fix those by updating the intervals accounting logic.  For details of the
root case and solutions, please refer to commit messages of fixes.

This patch (of 2):

DAMON's logics to determine if this is the time to do aggregation and ops
update assumes next_{aggregation,ops_update}_sis are always set larger
than current passed_sample_intervals.  And therefore it further assumes
continuously incrementing passed_sample_intervals every sampling interval
will make it reaches to the next_{aggregation,ops_update}_sis in future.
The logic therefore make the action and update
next_{aggregation,ops_updaste}_sis only if passed_sample_intervals is same
to the counts, respectively.

If Aggregation interval or Ops update interval are zero, however,
next_aggregation_sis or next_ops_update_sis are set same to current
passed_sample_intervals, respectively.  And passed_sample_intervals is
incremented before doing the next_{aggregation,ops_update}_sis check.
Hence, passed_sample_intervals becomes larger than
next_{aggregation,ops_update}_sis, and the logic says it is not the time
to do the action and update next_{aggregation,ops_update}_sis forever,
until an overflow happens.  In other words, DAMON stops doing aggregations
or ops updates effectively forever, and users cannot get monitoring
results.

Based on the documents and the common sense, a reasonable behavior for
such inputs is doing an aggregation and an ops update for every sampling
interval.  Handle the case by removing the assumption.

Note that this could incur particular real issue for DAMON sysfs interface
users, in case of zero Aggregation interval.  When user starts DAMON with
zero Aggregation interval and asks online DAMON parameter tuning via DAMON
sysfs interface, the request is handled by the aggregation callback.
Until the callback finishes the work, the user who requested the online
tuning just waits.  Hence, the user will be stuck until the
passed_sample_intervals overflows.

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Fixes: 4472edf63d66 ("mm/damon/core: use number of passed access sampling as a timer")
Signed-off-by: SeongJae Park <[email protected]>
Cc: <[email protected]> [6.7.x]
Signed-off-by: Andrew Morton <[email protected]>

mm/mlock: set the correct prev on failure

After commit 94d7d9233951 ("mm: abstract the vma_merge()/split_vma()
pattern for mprotect() et al."), if vma_modify_flags() return error, the
vma is set to an error code. This will lead to an invalid prev be
returned.

Generally this shouldn't matter as the caller should treat an error as
indicating state is now invalidated, however unfortunately
apply_mlockall_flags() does not check for errors and assumes that
mlock_fixup() correctly maintains prev even if an error were to occur.

This patch fixes that assumption.

[[email protected]: provide a better fix and rephrase the log]
Link: https://lkml.kernel.org/r/[email protected]
Fixes: 94d7d9233951 ("mm: abstract the vma_merge()/split_vma() pattern for mprotect() et al.")
Signed-off-by: Wei Yang <[email protected]>
Reviewed-by: Lorenzo Stoakes <[email protected]>
Reviewed-by: Liam R. Howlett <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Jann Horn <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

objpool: fix to make percpu slot allocation more robust

Since gfp & GFP_ATOMIC == GFP_ATOMIC is true for GFP_KERNEL | GFP_HIGH, it
will use kmalloc if user specifies that combination.  Here the reason why
combining the __vmalloc_node() and kmalloc_node() is that the vmalloc does
not support all GFP flag, especially GFP_ATOMIC.  So we should check if
gfp & (GFP_ATOMIC | GFP_KERNEL) != GFP_ATOMIC for vmalloc first.  This
ensures caller can sleep.  And for the robustness, even if vmalloc fails,
it should retry with kmalloc to allocate it.

Link: https://lkml.kernel.org/r/173008598713.1262174.2959179484209897252.stgit@mhiramat.roam.corp.google.com
Fixes: aff1871bfc81 ("objpool: fix choosing allocation for percpu slots")
Signed-off-by: Masami Hiramatsu (Google) <[email protected]>
Reported-by: Linus Torvalds <[email protected]>
Closes: https://lore.kernel.org/all/CAHk-=whO+vSH+XVRio8byJU8idAWES0SPGVZ7KAVdc4qrV0VUA@mail.gmail.com/
Cc: Leo Yan <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Matt Wu <[email protected]>
Cc: Mikel Rychliski <[email protected]>
Cc: Steven Rostedt (Google) <[email protected]>
Cc: Viktor Malik <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm/page_alloc: keep track of free highatomic

OOM kills due to vastly overestimated free highatomic reserves were
observed:

  ... invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0 ...
  Node 0 Normal free:1482936kB boost:0kB min:410416kB low:739404kB high:1068392kB reserved_highatomic:1073152KB ...
  Node 0 Normal: 1292*4kB (ME) 1920*8kB (E) 383*16kB (UE) 220*32kB (ME) 340*64kB (E) 2155*128kB (UE) 3243*256kB (UE) 615*512kB (U) 1*1024kB (M) 0*2048kB 0*4096kB = 1477408kB

The second line above shows that the OOM kill was due to the following
condition:

  free (1482936kB) - reserved_highatomic (1073152kB) = 409784KB < min (410416kB)

And the third line shows there were no free pages in any
MIGRATE_HIGHATOMIC pageblocks, which otherwise would show up as type 'H'.
Therefore __zone_watermark_unusable_free() underestimated the usable free
memory by over 1GB, which resulted in the unnecessary OOM kill above.

The comments in __zone_watermark_unusable_free() warns about the potential
risk, i.e.,

  If the caller does not have rights to reserves below the min
  watermark then subtract the high-atomic reserves. This will
  over-estimate the size of the atomic reserve but it avoids a search.

However, it is possible to keep track of free pages in reserved highatomic
pageblocks with a new per-zone counter nr_free_highatomic protected by the
zone lock, to avoid a search when calculating the usable free memory.  And
the cost would be minimal, i.e., simple arithmetics in the highatomic
alloc/free/move paths.

Note that since nr_free_highatomic can be relatively small, using a
per-cpu counter might cause too much drift and defeat its purpose, in
addition to the extra memory overhead.

Dependson e0932b6c1f94 ("mm: page_alloc: consolidate free page accounting") - see [1]

[[email protected]: s/if/else if/, per Johannes, stealth whitespace tweak]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Fixes: 0aaa29a56e4f ("mm, page_alloc: reserve pageblocks for high-order atomic allocations on demand")
Signed-off-by: Yu Zhao <[email protected]>
Reported-by: Link Lin <[email protected]>
Acked-by: David Rientjes <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

vdpa/mlx5: Fix error path during device add

In the error recovery path of mlx5_vdpa_dev_add(), the cleanup is
executed and at the end put_device() is called which ends up calling
mlx5_vdpa_free(). This function will execute the same cleanup all over
again. Most resources support being cleaned up twice, but the recent
mlx5_vdpa_destroy_mr_resources() doesn't.

This change drops the explicit cleanup from within the
mlx5_vdpa_dev_add() and lets mlx5_vdpa_free() do its work.

This issue was discovered while trying to add 2 vdpa devices with the
same name:
$> vdpa dev add name vdpa-0 mgmtdev auxiliary/mlx5_core.sf.2
$> vdpa dev add name vdpa-0 mgmtdev auxiliary/mlx5_core.sf.3

... yields the following dump:

  BUG: kernel NULL pointer dereference, address: 00000000000000b8
  #PF: supervisor read access in kernel mode
  #PF: error_code(0x0000) - not-present page
  PGD 0 P4D 0
  Oops: Oops: 0000 [#1] SMP
  CPU: 4 UID: 0 PID: 2811 Comm: vdpa Not tainted 6.12.0-rc6 #1
  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
  RIP: 0010:destroy_workqueue+0xe/0x2a0
  Code: ...
  RSP: 0018:ffff88814920b9a8 EFLAGS: 00010282
  RAX: 0000000000000000 RBX: ffff888105c10000 RCX: 0000000000000000
  RDX: 0000000000000001 RSI: ffff888100400168 RDI: 0000000000000000
  RBP: 0000000000000000 R08: ffff888100120c00 R09: ffffffff828578c0
  R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
  R13: ffff888131fd99a0 R14: 0000000000000000 R15: ffff888105c10580
  FS:  00007fdfa6b4f740(0000) GS:ffff88852ca00000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 00000000000000b8 CR3: 000000018db09006 CR4: 0000000000372eb0
  DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
  Call Trace:
   <TASK>
   ? __die+0x20/0x60
   ? page_fault_oops+0x150/0x3e0
   ? exc_page_fault+0x74/0x130
   ? asm_exc_page_fault+0x22/0x30
   ? destroy_workqueue+0xe/0x2a0
   mlx5_vdpa_destroy_mr_resources+0x2b/0x40 [mlx5_vdpa]
   mlx5_vdpa_free+0x45/0x150 [mlx5_vdpa]
   vdpa_release_dev+0x1e/0x50 [vdpa]
   device_release+0x31/0x90
   kobject_put+0x8d/0x230
   mlx5_vdpa_dev_add+0x328/0x8b0 [mlx5_vdpa]
   vdpa_nl_cmd_dev_add_set_doit+0x2b8/0x4c0 [vdpa]
   genl_family_rcv_msg_doit+0xd0/0x120
   genl_rcv_msg+0x180/0x2b0
   ? __vdpa_alloc_device+0x1b0/0x1b0 [vdpa]
   ? genl_family_rcv_msg_dumpit+0xf0/0xf0
   netlink_rcv_skb+0x54/0x100
   genl_rcv+0x24/0x40
   netlink_unicast+0x1fc/0x2d0
   netlink_sendmsg+0x1e4/0x410
   __sock_sendmsg+0x38/0x60
   ? sockfd_lookup_light+0x12/0x60
   __sys_sendto+0x105/0x160
   ? __count_memcg_events+0x53/0xe0
   ? handle_mm_fault+0x100/0x220
   ? do_user_addr_fault+0x40d/0x620
   __x64_sys_sendto+0x20/0x30
   do_syscall_64+0x4c/0x100
   entry_SYSCALL_64_after_hwframe+0x4b/0x53
  RIP: 0033:0x7fdfa6c66b57
  Code: ...
  RSP: 002b:00007ffeace22998 EFLAGS: 00000202 ORIG_RAX: 000000000000002c
  RAX: ffffffffffffffda RBX: 000055a498608350 RCX: 00007fdfa6c66b57
  RDX: 000000000000006c RSI: 000055a498608350 RDI: 0000000000000003
  RBP: 00007ffeace229c0 R08: 00007fdfa6d35200 R09: 000000000000000c
  R10: 0000000000000000 R11: 0000000000000202 R12: 000055a4986082a0
  R13: 0000000000000000 R14: 0000000000000000 R15: 00007ffeace233f3
   </TASK>
  Modules linked in: ...
  CR2: 00000000000000b8

Fixes: 62111654481d ("vdpa/mlx5: Postpone MR deletion")
Signed-off-by: Dragos Tatulea <[email protected]>
Message-Id: <20241105185101.1323272 [email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>
Acked-by: Jason Wang <[email protected]>
Acked-by: Eugenio Pérez <[email protected]>

bcachefs: Fix UAF in __promote_alloc() error path

If we error in data_update_init() after adding to the rhashtable of
outstanding promotes, kfree_rcu() is required.

Reported-by: Reed Riley <[email protected]>
Signed-off-by: Kent Overstreet <[email protected]>

bcachefs: Change OPT_STR max to be 1 less than the size of choices array

Change OPT_STR max value to be 1 less than the "ARRAY_SIZE" of "_choices"
array. As a result, remove -1 from (opt->max-1) in bch2_opt_to_text.

The "_choices" array is a null-terminated array, so computing the maximum
using "ARRAY_SIZE" without subtracting 1 yields an incorrect result. Since
bch2_opt_validate don't subtract 1, as bch2_opt_to_text does, values
bigger than the actual maximum would pass through option validation.

Reported-by: [email protected]
Closes: https://syzkaller.appspot.com/bug?extid=bee87a0c3291c06aa8c6
Fixes: 63c4b2545382 ("bcachefs: Better superblock opt validation")
Suggested-by: Kent Overstreet <[email protected]>
Signed-off-by: Piotr Zalewski <[email protected]>
Signed-off-by: Kent Overstreet <[email protected]>

bcachefs: btree_cache.freeable list fixes

When allocating new btree nodes, we were leaving them on the freeable
list - unlocked - allowing them to be reclaimed: ouch.

Additionally, bch2_btree_node_free_never_used() ->
bch2_btree_node_hash_remove was putting it on the freelist, while
bch2_btree_node_free_never_used() was putting it back on the btree
update reserve list - ouch.

Originally, the code was written to always keep btree nodes on a list -
live or freeable - and this worked when new nodes were kept locked.

But now with the cycle detector, we can't keep nodes locked that aren't
tracked by the cycle detector; and this is fine as long as they're not
reachable.

We also have better and more robust leak detection now, with memory
allocation profiling, so the original justification no longer applies.

Signed-off-by: Kent Overstreet <[email protected]>

bcachefs: check the invalid parameter for perf test

The perf_test does not check the number of iterations and threads
when it is zero. If nr_thread is 0, the perf test will keep
waiting for wakekup. If iteration is 0, it will cause exception
of division by zero. This can be reproduced by:
echo "rand_insert 0 1" > /sys/fs/bcachefs/${uuid}/perf_test
or
echo "rand_insert 1 0" > /sys/fs/bcachefs/${uuid}/perf_test

Fixes: 1c6fdbd8f246 ("bcachefs: Initial commit")
Signed-off-by: Hongbo Li <[email protected]>
Signed-off-by: Kent Overstreet <[email protected]>

bcachefs: add check NULL return of bio_kmalloc in journal_read_bucket

bio_kmalloc may return NULL, will cause NULL pointer dereference.
Add check NULL return for bio_kmalloc in journal_read_bucket.

Signed-off-by: Pei Xiao <[email protected]>
Fixes: ac10a9611d87 ("bcachefs: Some fixes for building in userspace")
Signed-off-by: Kent Overstreet <[email protected]>

bcachefs: Ensure BCH_FS_may_go_rw is set before exiting recovery

If BCH_FS_may_go_rw is not yet set, it indicates to the transaction
commit path that updates should be done via the list of journal replay
keys.

This must be set before multithreaded use commences.

Signed-off-by: Kent Overstreet <[email protected]>

bcachefs: Fix topology errors on split after merge

If a btree split picks a pivot that's being deleted by a btree node
merge, we're going to have problems.

Fix this by checking if the pivot is being deleted, the same as we check
for deletions in journal replay keys.

Found by single_devic.ktest small_nodes.

Signed-off-by: Kent Overstreet <[email protected]>

bcachefs: Ancient versions with bad bkey_formats are no longer supported

Syzbot found an assertion pop, by generating an ancient filesystem
version with an invalid bkey_format (with fields that can overflow) as
well as packed keys that aren't representable unpacked.

This breaks key comparisons in all sorts of painful ways.

Filesystems have been automatically rewriting nodes with such invalid
formats for years; we can safely drop support for them.

Reported-by: [email protected]
Signed-off-by: Kent Overstreet <[email protected]>

bcachefs: Fix error handling in bch2_btree_node_prefetch()

Signed-off-by: Kent Overstreet <[email protected]>

bcachefs: Fix null ptr deref in bucket_gen_get()

bucket_gen() checks if we're lookup up a valid bucket and returns NULL
otherwise, but bucket_gen_get() was failing to check; other callers were
correct.

Also do a bit of cleanup on callers.

Signed-off-by: Kent Overstreet <[email protected]>

Merge tag 'net-6.12-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Pull networking fixes from Jakub Kicinski:
"Including fixes from can and netfilter.

  Things are slowing down quite a bit, mostly driver fixes here. No
  known ongoing investigations.

  Current release - new code bugs:

   - eth: ti: am65-cpsw:
      - fix multi queue Rx on J7
      - fix warning in am65_cpsw_nuss_remove_rx_chns()

  Previous releases - regressions:

   - mptcp: do not require admin perm to list endpoints, got missed in a
     refactoring

   - mptcp: use sock_kfree_s instead of kfree

  Previous releases - always broken:

   - sctp: properly validate chunk size in sctp_sf_ootb() fix OOB access

   - virtio_net: make RSS interact properly with queue number

   - can: mcp251xfd: mcp251xfd_get_tef_len(): fix length calculation

   - can: mcp251xfd: mcp251xfd_ring_alloc(): fix coalescing
     configuration when switching CAN modes

  Misc:

   - revert earlier hns3 fixes, they were ignoring IOMMU abstractions
     and need to be reworked

   - can: {cc770,sja1000}_isa: allow building on x86_64"

* tag 'net-6.12-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (42 commits)
  drivers: net: ionic: add missed debugfs cleanup to ionic_probe() error path
  net/smc: do not leave a dangling sk pointer in __smc_create()
  rxrpc: Fix missing locking causing hanging calls
  net/smc: Fix lookup of netdev by using ib_device_get_netdev()
  net: arc: rockchip: fix emac mdio node support
  net: arc: fix the device for dma_map_single/dma_unmap_single
  virtio_net: Update rss when set queue
  virtio_net: Sync rss config to device when virtnet_probe
  virtio_net: Add hash_key_length check
  virtio_net: Support dynamic rss indirection table size
  netfilter: nf_tables: wait for rcu grace period on net_device removal
  net: stmmac: Fix unbalanced IRQ wake disable warning on single irq case
  net: vertexcom: mse102x: Fix possible double free of TX skb
  mptcp: use sock_kfree_s instead of kfree
  mptcp: no admin perm to list endpoints
  net: phy: ti: add PHY_RST_AFTER_CLK_EN flag
  net: ethernet: ti: am65-cpsw: fix warning in am65_cpsw_nuss_remove_rx_chns()
  net: ethernet: ti: am65-cpsw: Fix multi queue Rx on J7
  net: hns3: fix kernel crash when uninstalling driver
  Revert "Merge branch 'there-are-some-bugfix-for-the-hns3-ethernet-driver'"
  ...

Merge tag 'nvme-6.12-2024-11-07' of git://git.infradead.org/nvme into block-6.12

Pull NVMe fix from Keith:

"nvme fix for Linux 6.13

- Use correct list traversal for srcu lists (Breno)"

* tag 'nvme-6.12-2024-11-07' of git://git.infradead.org/nvme:
nvme/host: Fix RCU list traversal to use SRCU primitive

drivers: net: ionic: add missed debugfs cleanup to ionic_probe() error path

The ionic_setup_one() creates a debugfs entry for ionic upon
successful execution. However, the ionic_probe() does not
release the dentry before returning, resulting in a memory
leak.

To fix this bug, we add the ionic_debugfs_del_dev() to release
the resources in a timely manner before returning.

Fixes: 0de38d9f1dba ("ionic: extract common bits from ionic_probe")
Signed-off-by: Wentao Liang <[email protected]>
Acked-by: Shannon Nelson <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

net/smc: do not leave a dangling sk pointer in __smc_create()

Thanks to commit 4bbd360a5084 ("socket: Print pf->create() when
it does not clear sock->sk on failure."), syzbot found an issue with AF_SMC:

smc_create must clear sock->sk on failure, family: 43, type: 1, protocol: 0
WARNING: CPU: 0 PID: 5827 at net/socket.c:1565 __sock_create+0x96f/0xa30 net/socket.c:1563
Modules linked in:
CPU: 0 UID: 0 PID: 5827 Comm: syz-executor259 Not tainted 6.12.0-rc6-next-20241106-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 09/13/2024
RIP: 0010:__sock_create+0x96f/0xa30 net/socket.c:1563
Code: 03 00 74 08 4c 89 e7 e8 4f 3b 85 f8 49 8b 34 24 48 c7 c7 40 89 0c 8d 8b 54 24 04 8b 4c 24 0c 44 8b 44 24 08 e8 32 78 db f7 90 <0f> 0b 90 90 e9 d3 fd ff ff 89 e9 80 e1 07 fe c1 38 c1 0f 8c ee f7
RSP: 0018:ffffc90003e4fda0 EFLAGS: 00010246
RAX: 099c6f938c7f4700 RBX: 1ffffffff1a595fd RCX: ffff888034823c00
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
RBP: 00000000ffffffe9 R08: ffffffff81567052 R09: 1ffff920007c9f50
R10: dffffc0000000000 R11: fffff520007c9f51 R12: ffffffff8d2cafe8
R13: 1ffffffff1a595fe R14: ffffffff9a789c40 R15: ffff8880764298c0
FS:  000055557b518380(0000) GS:ffff8880b8600000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fa62ff43225 CR3: 0000000031628000 CR4: 00000000003526f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
<TASK>
  sock_create net/socket.c:1616 [inline]
  __sys_socket_create net/socket.c:1653 [inline]
  __sys_socket+0x150/0x3c0 net/socket.c:1700
  __do_sys_socket net/socket.c:1714 [inline]
  __se_sys_socket net/socket.c:1712 [inline]

For reference, see commit 2d859aff775d ("Merge branch
'do-not-leave-dangling-sk-pointers-in-pf-create-functions'")

Fixes: d25a92ccae6b ("net/smc: Introduce IPPROTO_SMC")
Signed-off-by: Eric Dumazet <[email protected]>
Cc: Ignat Korchagin <[email protected]>
Cc: D. Wythe <[email protected]>
Cc: Dust Li <[email protected]>
Reviewed-by: Kuniyuki Iwashima <[email protected]>
Reviewed-by: Wenjia Zhang <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

rxrpc: Fix missing locking causing hanging calls

If a call gets aborted (e.g. because kafs saw a signal) between it being
queued for connection and the I/O thread picking up the call, the abort
will be prioritised over the connection and it will be removed from
local->new_client_calls by rxrpc_disconnect_client_call() without a lock
being held. This may cause other calls on the list to disappear if a race
occurs.

Fix this by taking the client_call_lock when removing a call from whatever
list its ->wait_link happens to be on.

Signed-off-by: David Howells <[email protected]>
cc: [email protected]
Reported-by: Marc Dionne <[email protected]>
Fixes: 9d35d880e0e4 ("rxrpc: Move client call connection to the I/O thread")
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>