Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm fixes from Paolo Bonzini:
"The main and larger change here is a workaround for AMD's lack of
cache coherency for encrypted-memory guests.
I have another patch pending, but it's waiting for review from the
architecture maintainers.
RISC-V:
- Remove 's' & 'u' as valid ISA extension
- Do not allow disabling the base extensions 'i'/'m'/'a'/'c'
x86:
- Fix NMI watchdog in guests on AMD
- Fix for SEV cache incoherency issues
- Don't re-acquire SRCU lock in complete_emulated_io()
- Avoid NULL pointer deref if VM creation fails
- Fix race conditions between APICv disabling and vCPU creation
- Bugfixes for disabling of APICv
- Preserve BSP MSR_KVM_POLL_CONTROL across suspend/resume
selftests:
- Do not use bitfields larger than 32-bits, they differ between GCC
and clang"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
kvm: selftests: introduce and use more page size-related constants
kvm: selftests: do not use bitfields larger than 32-bits for PTEs
KVM: SEV: add cache flush to solve SEV cache incoherency issues
KVM: SVM: Flush when freeing encrypted pages even on SME_COHERENT CPUs
KVM: SVM: Simplify and harden helper to flush SEV guest page(s)
KVM: selftests: Silence compiler warning in the kvm_page_table_test
KVM: x86/pmu: Update AMD PMC sample period to fix guest NMI-watchdog
x86/kvm: Preserve BSP MSR_KVM_POLL_CONTROL across suspend/resume
KVM: SPDX style and spelling fixes
KVM: x86: Skip KVM_GUESTDBG_BLOCKIRQ APICv update if APICv is disabled
KVM: x86: Pend KVM_REQ_APICV_UPDATE during vCPU creation to fix a race
KVM: nVMX: Defer APICv updates while L2 is active until L1 is active
KVM: x86: Tag APICv DISABLE inhibit, not ABSENT, if APICv is disabled
KVM: Initialize debugfs_dentry when a VM is created to avoid NULL deref
KVM: Add helpers to wrap vcpu->srcu_idx and yell if it's abused
KVM: RISC-V: Use kvm_vcpu.srcu_idx, drop RISC-V's unnecessary copy
KVM: x86: Don't re-acquire SRCU lock in complete_emulated_io()
RISC-V: KVM: Restrict the extensions that can be disabled
RISC-V: KVM: Remove 's' & 'u' as valid ISA extension
Commit 5467801f1fcb ("gpio: Restrict usage of GPIO chip irq members
before initialization") attempted to fix a race condition that lead to a
NULL pointer, but in the process caused a regression for _AEI/_EVT
declared GPIOs.
This manifests in messages showing deferred probing while trying to
allocate IRQs like so:
amd_gpio AMDI0030:00: Failed to translate GPIO pin 0x0000 to IRQ, err -517
amd_gpio AMDI0030:00: Failed to translate GPIO pin 0x002C to IRQ, err -517
amd_gpio AMDI0030:00: Failed to translate GPIO pin 0x003D to IRQ, err -517
[ .. more of the same .. ]
The code for walking _AEI doesn't handle deferred probing and so this
leads to non-functional GPIO interrupts.
Fix this issue by moving the call to `acpi_gpiochip_request_interrupts`
to occur after gc->irc.initialized is set.
Merge tag 'riscv-for-linus-5.18-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux
Pull RISC-V fixes Palmer Dabbelt:
- A pair of build fixes for the recent cpuidle driver
- A fix for systems without sv57 that manifests as a crash
early in boot
* tag 'riscv-for-linus-5.18-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux:
RISC-V: cpuidle: fix Kconfig select for RISCV_SBI_CPUIDLE
RISC-V: mm: Fix set_satp_mode() for platform not having Sv57
cpuidle: riscv: support non-SMP config
Merge tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull arm64 fixes from Will Deacon:
"There's no real pattern to the fixes, but the main one fixes our
pmd_leaf() definition to resolve a NULL dereference on the migration
path.
- Fix PMU event validation in the absence of any event counters
- Fix allmodconfig build using clang in conjunction with binutils
- Fix definitions of pXd_leaf() to handle PROT_NONE entries
- More typo fixes"
* tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
arm64: mm: fix p?d_leaf()
arm64: fix typos in comments
arm64: Improve HAVE_DYNAMIC_FTRACE_WITH_REGS selection for clang
arm_pmu: Validate single/group leader events
Merge tag 'xarray-5.18a' of git://git.infradead.org/users/willy/xarray
Pull xarray fixes from Matthew Wilcox:
"Syzbot found a nasty race between large page splitting and page
lookup. Details in the commit log, but fortunately it has a reliable
reproducer. I thought it better to send this one to you straight away.
Also fix the test suite build for kmem_cache_alloc_lru()"
* tag 'xarray-5.18a' of git://git.infradead.org/users/willy/xarray:
XArray: Disallow sibling entries of nodes
tools: Add kmem_cache_alloc_lru()
Merge tag '5.18-rc3-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6
Pull cifs fixes from Steve French:
"Four fixes, two of them for stable:
- fcollapse fix
- reconnect lock fix
- DFS oops fix
- minor cleanup patch"
* tag '5.18-rc3-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6:
cifs: destage any unwritten data to the server before calling copychunk_write
cifs: use correct lock type in cifs_reconnect()
cifs: fix NULL ptr dereference in refresh_mounts()
cifs: Use kzalloc instead of kmalloc/memset
Merge tag 'fs.fixes.v5.18-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux
Pull mount_setattr fix from Christian Brauner:
"The recent cleanup in e257039f0fc7 ("mount_setattr(): clean the
control flow and calling conventions") switched the mount attribute
codepaths from do-while to for loops as they are more idiomatic when
walking mounts.
However, we did originally choose do-while constructs because if we
request a mount or mount tree to be made read-only we need to hold
writers in the following way: The mount attribute code will grab
lock_mount_hash() and then call mnt_hold_writers() which will
_unconditionally_ set MNT_WRITE_HOLD on the mount.
Any callers that need write access have to call mnt_want_write(). They
will immediately see that MNT_WRITE_HOLD is set on the mount and the
caller will then either spin (on non-preempt-rt) or wait on
lock_mount_hash() (on preempt-rt).
The fact that MNT_WRITE_HOLD is set unconditionally means that once
mnt_hold_writers() returns we need to _always_ pair it with
mnt_unhold_writers() in both the failure and success paths.
The do-while constructs did take care of this. But Al's change to a
for loop in the failure path stops on the first mount we failed to
change mount attributes _without_ going into the loop to call
mnt_unhold_writers().
This in turn means that once we failed to make a mount read-only via
mount_setattr() - i.e. there are already writers on that mount - we
will block any writers indefinitely. Fix this by ensuring that the for
loop always unsets MNT_WRITE_HOLD including the first mount we failed
to change to read-only. Also sprinkle a few comments into the cleanup
code to remind people about what is happening including myself. After
all, I didn't catch it during review.
This is only relevant on mainline and was reported by syzbot. Details
about the syzbot reports are all in the commit message"
* tag 'fs.fixes.v5.18-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
fs: unset MNT_WRITE_HOLD on failure
Merge tag 'sound-5.18-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound
Pull sound fixes from Takashi Iwai:
"At this time, the majority of changes are for pending ASoC fixes while
a few usual HD-audio and USB-audio quirks are found.
Almost all patches are small device-specific fixes, and nothing
worrisome stands out, so far"
* tag 'sound-5.18-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound: (37 commits)
ALSA: hda/realtek: Add quirk for Clevo NP70PNP
ALSA: hda: intel-dsp-config: Add RaptorLake PCI IDs
ALSA: hda/realtek: Enable mute/micmute LEDs and limit mic boost on EliteBook 845/865 G9
ALSA: usb-audio: Clear MIDI port active flag after draining
ALSA: usb-audio: add mapping for MSI MAG X570S Torpedo MAX.
ALSA: hda/i915: Fix one too many pci_dev_put()
ALSA: hda/hdmi: add HDMI codec VID for Raptorlake-P
ALSA: hda/hdmi: fix warning about PCM count when used with SOF
sound/oss/dmasound: fix 'dmasound_setup' defined but not used
firmware: cs_dsp: Fix overrun of unterminated control name string
ASoC: codecs: Fix an error handling path in (rx|tx|va)_macro_probe()
ASoC: Intel: sof_es8336: Add a quirk for Huawei Matebook D15
ASoC: Intel: sof_es8336: add a quirk for headset at mic1 port
ASoC: Intel: sof_es8336: support a separate gpio to control headphone
ASoC: Intel: sof_es8336: simplify speaker gpio naming
ASoC: wm8731: Disable the regulator when probing fails
ASoC: Intel: soc-acpi: correct device endpoints for max98373
ASoC: codecs: wcd934x: do not switch off SIDO Buck when codec is in use
ASoC: SOF: topology: Fix memory leak in sof_control_load()
ASoC: SOF: topology: cleanup dailinks on widget unload
...
There is a race between xas_split() and xas_load() which can result in
the wrong page being returned, and thus data corruption. Fortunately,
it's hard to hit (syzbot took three months to find it) and often guarded
with VM_BUG_ON().
The anatomy of this race is:
thread A thread B
order-9 page is stored at index 0x200
lookup of page at index 0x274
page split starts
load of sibling entry at offset 9
stores nodes at offsets 8-15
load of entry at offset 8
The entry at offset 8 turns out to be a node, and so we descend into it,
and load the page at index 0x234 instead of 0x274. This is hard to fix
on the split side; we could replace the entire node that contains the
order-9 page instead of replacing the eight entries. Fixing it on
the lookup side is easier; just disallow sibling entries that point
to nodes. This cannot ever be a useful thing as the descent would not
know the correct offset to use within the new node.
The test suite continues to pass, but I have not added a new test for
this bug.
Turn kmem_cache_alloc() into a wrapper around kmem_cache_alloc_lru().
Fixes: 9bbdc0f32409 ("xarray: use kmem_cache_alloc_lru to allocate xa_node") Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Reported-by: Liam R. Howlett <[email protected]> Reported-by: Li Wang <[email protected]>
Subsystems affected by this patch series: mm (memory-failure, memcg,
userfaultfd, hugetlbfs, mremap, oom-kill, kasan, hmm), and kcov"
* emailed patches from Andrew Morton <[email protected]>:
mm/mmu_notifier.c: fix race in mmu_interval_notifier_remove()
kcov: don't generate a warning on vm_insert_page()'s failure
MAINTAINERS: add Vincenzo Frascino to KASAN reviewers
oom_kill.c: futex: delay the OOM reaper to allow time for proper futex cleanup
selftest/vm: add skip support to mremap_test
selftest/vm: support xfail in mremap_test
selftest/vm: verify remap destination address in mremap_test
selftest/vm: verify mmap addr in mremap_test
mm, hugetlb: allow for "high" userspace addresses
userfaultfd: mark uffd_wp regardless of VM_WRITE flag
memcg: sync flush only if periodic flush is delayed
mm/memory-failure.c: skip huge_zero_page in memory_failure()
mm/hwpoison: fix race between hugetlb free/demotion and memory_failure_hugetlb()
Nicholas Piggin [Fri, 22 Apr 2022 06:01:05 +0000 (16:01 +1000)]
mm/vmalloc: huge vmalloc backing pages should be split rather than compound
Huge vmalloc higher-order backing pages were allocated with __GFP_COMP
in order to allow the sub-pages to be refcounted by callers such as
"remap_vmalloc_page [sic]" (remap_vmalloc_range).
However a similar problem exists for other struct page fields callers
use, for example fb_deferred_io_fault() takes a vmalloc'ed page and
not only refcounts it but uses ->lru, ->mapping, ->index.
This is not compatible with compound sub-pages, and can cause bad page
state issues like
The correct approach is to use split high-order pages for the huge
vmalloc backing. These allow callers to treat them in exactly the same
way as individually-allocated order-0 pages.
Muchun Song [Fri, 22 Apr 2022 06:00:33 +0000 (14:00 +0800)]
arm64: mm: fix p?d_leaf()
The pmd_leaf() is used to test a leaf mapped PMD, however, it misses
the PROT_NONE mapped PMD on arm64. Fix it. A real world issue [1]
caused by this was reported by Qian Cai. Also fix pud_leaf().
Merge tag 'drm-fixes-2022-04-22' of git://anongit.freedesktop.org/drm/drm
Pull drm fixes from Dave Airlie:
"Extra quiet after Easter, only have minor i915 and msm pulls. However
I haven't seen a PR from our misc tree in a little while, I've cc'ed
all the suspects. Once that unblocks I expect a bit larger bunch of
patches to arrive.
Otherwise as I said, one msm revert and two i915 fixes.
msm:
- revert iommu change that broke some platforms.
i915:
- Unset enable_psr2_sel_fetch if PSR2 detection fails
- Fix to detect when VRR is turned off from panel settings"
* tag 'drm-fixes-2022-04-22' of git://anongit.freedesktop.org/drm/drm:
drm/i915/display/psr: Unset enable_psr2_sel_fetch if other checks in intel_psr2_config_valid() fails
drm/msm: Revert "drm/msm: Stop using iommu_present()"
drm/i915/display/vrr: Reset VRR capable property on a long hpd
mm/mmu_notifier.c: fix race in mmu_interval_notifier_remove()
In some cases it is possible for mmu_interval_notifier_remove() to race
with mn_tree_inv_end() allowing it to return while the notifier data
structure is still in use. Consider the following sequence:
As the wait_event() condition is true it will return immediately. This
can lead to use-after-free type errors if the caller frees the data
structure containing the interval notifier subscription while it is
still on a deferred list. Fix this by taking the appropriate lock when
reading invalidate_seq to ensure proper synchronisation.
I observed this whilst running stress testing during some development.
You do have to be pretty unlucky, but it leads to the usual problems of
use-after-free (memory corruption, kernel crash, difficult to diagnose
WARN_ON, etc).
Aleksandr Nogikh [Thu, 21 Apr 2022 23:36:07 +0000 (16:36 -0700)]
kcov: don't generate a warning on vm_insert_page()'s failure
vm_insert_page()'s failure is not an unexpected condition, so don't do
WARN_ONCE() in such a case.
Instead, print a kernel message and just return an error code.
This flaw has been reported under an OOM condition by sysbot [1].
The message is mainly for the benefit of the test log, in this case the
fuzzer's log so that humans inspecting the log can figure out what was
going on. KCOV is a testing tool, so I think being a little more chatty
when KCOV unexpectedly is about to fail will save someone debugging
time.
We don't want the WARN, because it's not a kernel bug that syzbot should
report, and failure can happen if the fuzzer tries hard enough (as
above).
oom_kill.c: futex: delay the OOM reaper to allow time for proper futex cleanup
The pthread struct is allocated on PRIVATE|ANONYMOUS memory [1] which
can be targeted by the oom reaper. This mapping is used to store the
futex robust list head; the kernel does not keep a copy of the robust
list and instead references a userspace address to maintain the
robustness during a process death.
A race can occur between exit_mm and the oom reaper that allows the oom
reaper to free the memory of the futex robust list before the exit path
has handled the futex death:
selftest/vm: verify remap destination address in mremap_test
Because mremap does not have a MAP_FIXED_NOREPLACE flag, it can destroy
existing mappings. This causes a segfault when regions such as text are
remapped and the permissions are changed.
Verify the requested mremap destination address does not overlap any
existing mappings by using mmap's MAP_FIXED_NOREPLACE flag. Keep
incrementing the destination address until a valid mapping is found or
fail the current test once the max address is reached.
Avoid calling mmap with requested addresses that are less than the
system's mmap_min_addr. When run as root, mmap returns EACCES when
trying to map addresses < mmap_min_addr. This is not one of the error
codes for the condition to retry the mmap in the test.
Rather than arbitrarily retrying on EACCES, don't attempt an mmap until
addr > vm.mmap_min_addr.
Add a munmap call after an alignment check as the mappings are retained
after the retry and can reach the vm.max_map_count sysctl.
This is a fix for commit f6795053dac8 ("mm: mmap: Allow for "high"
userspace addresses") for hugetlb.
This patch adds support for "high" userspace addresses that are
optionally supported on the system and have to be requested via a hint
mechanism ("high" addr parameter to mmap).
Architectures such as powerpc and x86 achieve this by making changes to
their architectural versions of hugetlb_get_unmapped_area() function.
However, arm64 uses the generic version of that function.
So take into account arch_get_mmap_base() and arch_get_mmap_end() in
hugetlb_get_unmapped_area(). To allow that, move those two macros out
of mm/mmap.c into include/linux/sched/mm.h
If these macros are not defined in architectural code then they default
to (TASK_SIZE) and (base) so should not introduce any behavioural
changes to architectures that do not define them.
For the time being, only ARM64 is affected by this change.
Catalin (ARM64) said
"We should have fixed hugetlb_get_unmapped_area() as well when we added
support for 52-bit VA. The reason for commit f6795053dac8 was to
prevent normal mmap() from returning addresses above 48-bit by default
as some user-space had hard assumptions about this.
It's a slight ABI change if you do this for hugetlb_get_unmapped_area()
but I doubt anyone would notice. It's more likely that the current
behaviour would cause issues, so I'd rather have them consistent.
Basically when arm64 gained support for 52-bit addresses we did not
want user-space calling mmap() to suddenly get such high addresses,
otherwise we could have inadvertently broken some programs (similar
behaviour to x86 here). Hence we added commit f6795053dac8. But we
missed hugetlbfs which could still get such high mmap() addresses. So
in theory that's a potential regression that should have bee addressed
at the same time as commit f6795053dac8 (and before arm64 enabled
52-bit addresses)"
Nadav Amit [Thu, 21 Apr 2022 23:35:43 +0000 (16:35 -0700)]
userfaultfd: mark uffd_wp regardless of VM_WRITE flag
When a PTE is set by UFFD operations such as UFFDIO_COPY, the PTE is
currently only marked as write-protected if the VMA has VM_WRITE flag
set. This seems incorrect or at least would be unexpected by the users.
Consider the following sequence of operations that are being performed
on a certain page:
At this point the user would expect to still get UFFD notification when
the page is accessed for write, but the user would not get one, since
the PTE was not marked as UFFD_WP during UFFDIO_COPY.
Fix it by always marking PTEs as UFFD_WP regardless on the
write-permission in the VMA flags.
memcg: sync flush only if periodic flush is delayed
Daniel Dao has reported [1] a regression on workloads that may trigger a
lot of refaults (anon and file). The underlying issue is that flushing
rstat is expensive. Although rstat flush are batched with (nr_cpus *
MEMCG_BATCH) stat updates, it seems like there are workloads which
genuinely do stat updates larger than batch value within short amount of
time. Since the rstat flush can happen in the performance critical
codepaths like page faults, such workload can suffer greatly.
This patch fixes this regression by making the rstat flushing
conditional in the performance critical codepaths. More specifically,
the kernel relies on the async periodic rstat flusher to flush the stats
and only if the periodic flusher is delayed by more than twice the
amount of its normal time window then the kernel allows rstat flushing
from the performance critical codepaths.
Now the question: what are the side-effects of this change? The worst
that can happen is the refault codepath will see 4sec old lruvec stats
and may cause false (or missed) activations of the refaulted page which
may under-or-overestimate the workingset size. Though that is not very
concerning as the kernel can already miss or do false activations.
There are two more codepaths whose flushing behavior is not changed by
this patch and we may need to come to them in future. One is the
writeback stats used by dirty throttling and second is the deactivation
heuristic in the reclaim. For now keeping an eye on them and if there
is report of regression due to these codepaths, we will reevaluate then.
mm/hwpoison: fix race between hugetlb free/demotion and memory_failure_hugetlb()
There is a race condition between memory_failure_hugetlb() and hugetlb
free/demotion, which causes setting PageHWPoison flag on the wrong page.
The one simple result is that wrong processes can be killed, but another
(more serious) one is that the actual error is left unhandled, so no one
prevents later access to it, and that might lead to more serious results
like consuming corrupted data.
Think about the below race window:
CPU 1 CPU 2
memory_failure_hugetlb
struct page *head = compound_head(p);
hugetlb page might be freed to
buddy, or even changed to another
compound page.
get_hwpoison_page -- page is not what we want now...
The current code first does prechecks roughly and then reconfirms after
taking refcount, but it's found that it makes code overly complicated,
so move the prechecks in a single hugetlb_lock range.
A newly introduced function, try_memory_failure_hugetlb(), always takes
hugetlb_lock (even for non-hugetlb pages). That can be improved, but
memory_failure() is rare in principle, so should not be a big problem.
Merge tag 'dmaengine-fix-5.18' of git://git.kernel.org/pub/scm/linux/kernel/git/vkoul/dmaengine
Pull dmaengine fixes from Vinod Koul:
"A bunch of driver fixes:
- idxd device RO checks and device cleanup
- dw-edma unaligned access and alignment
- qcom: missing minItems in binding
- mediatek pm usage fix
- imx init script"
* tag 'dmaengine-fix-5.18' of git://git.kernel.org/pub/scm/linux/kernel/git/vkoul/dmaengine:
dt-bindings: dmaengine: qcom: gpi: Add minItems for interrupts
dmaengine: idxd: skip clearing device context when device is read-only
dmaengine: idxd: add RO check for wq max_transfer_size write
dmaengine: idxd: add RO check for wq max_batch_size write
dmaengine: idxd: fix retry value to be constant for duration of function call
dmaengine: idxd: match type for retries var in idxd_enqcmds()
dmaengine: dw-edma: Fix inconsistent indenting
dmaengine: dw-edma: Fix unaligned 64bit access
dmaengine: mediatek:Fix PM usage reference leak of mtk_uart_apdma_alloc_chan_resources
dmaengine: imx-sdma: Fix error checking in sdma_event_remap
dma: at_xdmac: fix a missing check on list iterator
dmaengine: imx-sdma: fix init of uart scripts
dmaengine: idxd: fix device cleanup on disable
RISC-V: mm: Fix set_satp_mode() for platform not having Sv57
When Sv57 is not available the satp.MODE test in set_satp_mode() will
fail and lead to pgdir re-programming for Sv48. The pgdir re-programming
will fail as well due to pre-existing pgdir entry used for Sv57 and as
a result kernel fails to boot on RISC-V platform not having Sv57.
To fix above issue, we should clear the pgdir memory in set_satp_mode()
before re-programming.
Populating memory : 0.684014577s
Writing to populated memory : 0.006230175s
Reading from populated memory : 0.004557805s
==== Test Assertion Failure ====
lib/kvm_util.c:1411: false
pid=125806 tid=125809 errno=4 - Interrupted system call
1 0x0000000000402f7c: addr_gpa2hva at kvm_util.c:1411
2 (inlined by) addr_gpa2hva at kvm_util.c:1405
3 0x0000000000401f52: lookup_pfn at access_tracking_perf_test.c:98
4 (inlined by) mark_vcpu_memory_idle at access_tracking_perf_test.c:152
5 (inlined by) vcpu_thread_main at access_tracking_perf_test.c:232
6 0x00007fefe9ff81ce: ?? ??:0
7 0x00007fefe9c64d82: ?? ??:0
No vm physical memory at 0xffbffff000
I can easily reproduce it with a Intel(R) Xeon(R) CPU E5-2630 with 46 bits
PA.
It turns out that the address translation for clearing idle page tracking
returned a wrong result; addr_gva2gpa()'s last step, which is based on
"pte[index[0]].pfn", did the calculation with 40 bits length and the
high 12 bits got truncated. In above case the GPA address to be returned
should be 0x3fffbffff000 for GVA 0xc0000000, but it got truncated into
0xffbffff000 and the subsequent gpa2hva lookup failed.
The width of operations on bit fields greater than 32-bit is
implementation defined, and differs between GCC (which uses the bitfield
precision) and clang (which uses 64-bit arithmetic), so this is a
potential minefield. Remove the bit fields and using manual masking
instead.
KVM: SEV: add cache flush to solve SEV cache incoherency issues
Flush the CPU caches when memory is reclaimed from an SEV guest (where
reclaim also includes it being unmapped from KVM's memslots). Due to lack
of coherency for SEV encrypted memory, failure to flush results in silent
data corruption if userspace is malicious/broken and doesn't ensure SEV
guest memory is properly pinned and unpinned.
Cache coherency is not enforced across the VM boundary in SEV (AMD APM
vol.2 Section 15.34.7). Confidential cachelines, generated by confidential
VM guests have to be explicitly flushed on the host side. If a memory page
containing dirty confidential cachelines was released by VM and reallocated
to another user, the cachelines may corrupt the new user at a later time.
KVM takes a shortcut by assuming all confidential memory remain pinned
until the end of VM lifetime. Therefore, KVM does not flush cache at
mmu_notifier invalidation events. Because of this incorrect assumption and
the lack of cache flushing, malicous userspace can crash the host kernel:
creating a malicious VM and continuously allocates/releases unpinned
confidential memory pages when the VM is running.
Add cache flush operations to mmu_notifier operations to ensure that any
physical memory leaving the guest VM get flushed. In particular, hook
mmu_notifier_invalidate_range_start and mmu_notifier_release events and
flush cache accordingly. The hook after releasing the mmu lock to avoid
contention with other vCPUs.
Merge tag 'net-5.18-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Paolo Abeni:
"Including fixes from xfrm and can.
Current release - regressions:
- rxrpc: restore removed timer deletion
Current release - new code bugs:
- gre: fix device lookup for l3mdev use-case
- xfrm: fix egress device lookup for l3mdev use-case
Previous releases - regressions:
- sched: cls_u32: fix netns refcount changes in u32_change()
- smc: fix sock leak when release after smc_shutdown()
- xfrm: limit skb_page_frag_refill use to a single page
- eth: atlantic: invert deep par in pm functions, preventing null
derefs
- eth: stmmac: use readl_poll_timeout_atomic() in atomic state
Previous releases - always broken:
- gre: fix skb_under_panic on xmit
- openvswitch: fix OOB access in reserve_sfa_size()
- dsa: hellcreek: calculate checksums in tagger
- eth: ice: fix crash in switchdev mode
- eth: igc:
- fix infinite loop in release_swfw_sync
- fix scheduling while atomic"
* tag 'net-5.18-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (37 commits)
drivers: net: hippi: Fix deadlock in rr_close()
selftests: mlxsw: vxlan_flooding_ipv6: Prevent flooding of unwanted packets
selftests: mlxsw: vxlan_flooding: Prevent flooding of unwanted packets
nfc: MAINTAINERS: add Bug entry
net: stmmac: Use readl_poll_timeout_atomic() in atomic state
doc/ip-sysctl: add bc_forwarding
netlink: reset network and mac headers in netlink_dump()
net: mscc: ocelot: fix broken IP multicast flooding
net: dsa: hellcreek: Calculate checksums in tagger
net: atlantic: invert deep par in pm functions, preventing null derefs
can: isotp: stop timeout monitoring when no first frame was sent
bonding: do not discard lowest hash bit for non layer3+4 hashing
net: lan966x: Make sure to release ptp interrupt
ipv6: make ip6_rt_gc_expire an atomic_t
net: Handle l3mdev in ip_tunnel_init_flow
l3mdev: l3mdev_master_upper_ifindex_by_index_rcu should be using netdev_master_upper_dev_get_rcu
net/sched: cls_u32: fix possible leak in u32_init_knode()
net/sched: cls_u32: fix netns refcount changes in u32_change()
powerpc: Update MAINTAINERS for ibmvnic and VAS
net: restore alpha order to Ethernet devices in config
...
KVM: SVM: Flush when freeing encrypted pages even on SME_COHERENT CPUs
Use clflush_cache_range() to flush the confidential memory when
SME_COHERENT is supported in AMD CPU. Cache flush is still needed since
SME_COHERENT only support cache invalidation at CPU side. All confidential
cache lines are still incoherent with DMA devices.
KVM: SVM: Simplify and harden helper to flush SEV guest page(s)
Rework sev_flush_guest_memory() to explicitly handle only a single page,
and harden it to fall back to WBINVD if VM_PAGE_FLUSH fails. Per-page
flushing is currently used only to flush the VMSA, and in its current
form, the helper is completely broken with respect to flushing actual
guest memory, i.e. won't work correctly for an arbitrary memory range.
VM_PAGE_FLUSH takes a host virtual address, and is subject to normal page
walks, i.e. will fault if the address is not present in the host page
tables or does not have the correct permissions. Current AMD CPUs also
do not honor SMAP overrides (undocumented in kernel versions of the APM),
so passing in a userspace address is completely out of the question. In
other words, KVM would need to manually walk the host page tables to get
the pfn, ensure the pfn is stable, and then use the direct map to invoke
VM_PAGE_FLUSH. And the latter might not even work, e.g. if userspace is
particularly evil/clever and backs the guest with Secret Memory (which
unmaps memory from the direct map).
Like Xu [Sat, 9 Apr 2022 01:52:26 +0000 (09:52 +0800)]
KVM: x86/pmu: Update AMD PMC sample period to fix guest NMI-watchdog
NMI-watchdog is one of the favorite features of kernel developers,
but it does not work in AMD guest even with vPMU enabled and worse,
the system misrepresents this capability via /proc.
This is a PMC emulation error. KVM does not pass the latest valid
value to perf_event in time when guest NMI-watchdog is running, thus
the perf_event corresponding to the watchdog counter will enter the
old state at some point after the first guest NMI injection, forcing
the hardware register PMC0 to be constantly written to 0x800000000001.
Meanwhile, the running counter should accurately reflect its new value
based on the latest coordinated pmc->counter (from vPMC's point of view)
rather than the value written directly by the guest.
Wanpeng Li [Mon, 18 Apr 2022 07:42:32 +0000 (00:42 -0700)]
x86/kvm: Preserve BSP MSR_KVM_POLL_CONTROL across suspend/resume
MSR_KVM_POLL_CONTROL is cleared on reset, thus reverting guests to
host-side polling after suspend/resume. Non-bootstrap CPUs are
restored correctly by the haltpoll driver because they are hot-unplugged
during suspend and hot-plugged during resume; however, the BSP
is not hotpluggable and remains in host-sde polling mode after
the guest resume. The makes the guest pay for the cost of vmexits
every time the guest enters idle.
Fix it by recording BSP's haltpoll state and resuming it during guest
resume.
KVM: x86: Skip KVM_GUESTDBG_BLOCKIRQ APICv update if APICv is disabled
Skip the APICv inhibit update for KVM_GUESTDBG_BLOCKIRQ if APICv is
disabled at the module level to avoid having to acquire the mutex and
potentially process all vCPUs. The DISABLE inhibit will (barring bugs)
never be lifted, so piling on more inhibits is unnecessary.
KVM: x86: Pend KVM_REQ_APICV_UPDATE during vCPU creation to fix a race
Make a KVM_REQ_APICV_UPDATE request when creating a vCPU with an
in-kernel local APIC and APICv enabled at the module level. Consuming
kvm_apicv_activated() and stuffing vcpu->arch.apicv_active directly can
race with __kvm_set_or_clear_apicv_inhibit(), as vCPU creation happens
before the vCPU is fully onlined, i.e. it won't get the request made to
"all" vCPUs. If APICv is globally inhibited between setting apicv_active
and onlining the vCPU, the vCPU will end up running with APICv enabled
and trigger KVM's sanity check.
Mark APICv as active during vCPU creation if APICv is enabled at the
module level, both to be optimistic about it's final state, e.g. to avoid
additional VMWRITEs on VMX, and because there are likely bugs lurking
since KVM checks apicv_active in multiple vCPU creation paths. While
keeping the current behavior of consuming kvm_apicv_activated() is
arguably safer from a regression perspective, force apicv_active so that
vCPU creation runs with deterministic state and so that if there are bugs,
they are found sooner than later, i.e. not when some crazy race condition
is hit.
KVM: nVMX: Defer APICv updates while L2 is active until L1 is active
Defer APICv updates that occur while L2 is active until nested VM-Exit,
i.e. until L1 regains control. vmx_refresh_apicv_exec_ctrl() assumes L1
is active and (a) stomps all over vmcs02 and (b) neglects to ever updated
vmcs01. E.g. if vmcs12 doesn't enable the TPR shadow for L2 (and thus no
APICv controls), L1 performs nested VM-Enter APICv inhibited, and APICv
becomes unhibited while L2 is active, KVM will set various APICv controls
in vmcs02 and trigger a failed VM-Entry. The kicker is that, unless
running with nested_early_check=1, KVM blames L1 and chaos ensues.
In all cases, ignoring vmcs02 and always deferring the inhibition change
to vmcs01 is correct (or at least acceptable). The ABSENT and DISABLE
inhibitions cannot truly change while L2 is active (see below).
IRQ_BLOCKING can change, but it is firmly a best effort debug feature.
Furthermore, only L2's APIC is accelerated/virtualized to the full extent
possible, e.g. even if L1 passes through its APIC to L2, normal MMIO/MSR
interception will apply to the virtual APIC managed by KVM.
The exception is the SELF_IPI register when x2APIC is enabled, but that's
an acceptable hole.
Lastly, Hyper-V's Auto EOI can technically be toggled if L1 exposes the
MSRs to L2, but for that to work in any sane capacity, L1 would need to
pass through IRQs to L2 as well, and IRQs must be intercepted to enable
virtual interrupt delivery. I.e. exposing Auto EOI to L2 and enabling
VID for L2 are, for all intents and purposes, mutually exclusive.
Lack of dynamic toggling is also why this scenario is all but impossible
to encounter in KVM's current form. But a future patch will pend an
APICv update request _during_ vCPU creation to plug a race where a vCPU
that's being created doesn't get included in the "all vCPUs request"
because it's not yet visible to other vCPUs. If userspaces restores L2
after VM creation (hello, KVM selftests), the first KVM_RUN will occur
while L2 is active and thus service the APICv update request made during
VM creation.
KVM: x86: Tag APICv DISABLE inhibit, not ABSENT, if APICv is disabled
Set the DISABLE inhibit, not the ABSENT inhibit, if APICv is disabled via
module param. A recent refactoring to add a wrapper for setting/clearing
inhibits unintentionally changed the flag, probably due to a copy+paste
goof.
KVM: Initialize debugfs_dentry when a VM is created to avoid NULL deref
Initialize debugfs_entry to its semi-magical -ENOENT value when the VM
is created. KVM's teardown when VM creation fails is kludgy and calls
kvm_uevent_notify_change() and kvm_destroy_vm_debugfs() even if KVM never
attempted kvm_create_vm_debugfs(). Because debugfs_entry is zero
initialized, the IS_ERR() checks pass and KVM derefs a NULL pointer.
KVM: Add helpers to wrap vcpu->srcu_idx and yell if it's abused
Add wrappers to acquire/release KVM's SRCU lock when stashing the index
in vcpu->src_idx, along with rudimentary detection of illegal usage,
e.g. re-acquiring SRCU and thus overwriting vcpu->src_idx. Because the
SRCU index is (currently) either 0 or 1, illegal nesting bugs can go
unnoticed for quite some time and only cause problems when the nested
lock happens to get a different index.
Wrap the WARNs in PROVE_RCU=y, and make them ONCE, otherwise KVM will
likely yell so loudly that it will bring the kernel to its knees.
KVM: RISC-V: Use kvm_vcpu.srcu_idx, drop RISC-V's unnecessary copy
Use the generic kvm_vcpu's srcu_idx instead of using an indentical field
in RISC-V's version of kvm_vcpu_arch. Generic KVM very intentionally
does not touch vcpu->srcu_idx, i.e. there's zero chance of running afoul
of common code.
KVM: x86: Don't re-acquire SRCU lock in complete_emulated_io()
Don't re-acquire SRCU in complete_emulated_io() now that KVM acquires the
lock in kvm_arch_vcpu_ioctl_run(). More importantly, don't overwrite
vcpu->srcu_idx. If the index acquired by complete_emulated_io() differs
from the one acquired by kvm_arch_vcpu_ioctl_run(), KVM will effectively
leak a lock and hang if/when synchronize_srcu() is invoked for the
relevant grace period.
After mnt_hold_writers() has been called we will always have set MNT_WRITE_HOLD
and consequently we always need to pair mnt_hold_writers() with
mnt_unhold_writers(). After the recent cleanup in [1] where Al switched from a
do-while to a for loop the cleanup currently fails to unset MNT_WRITE_HOLD for
the first mount that was changed. Fix this and make sure that the first mount
will be cleaned up and add some comments to make it more obvious.
We hold rrpriv->lock in position (1) of thread 1 and
use del_timer_sync() to wait timer to stop, but timer handler
also need rrpriv->lock in position (2) of thread 2.
As a result, rr_close() will block forever.
This patch extracts del_timer_sync() from the protection of
spin_lock_irqsave(), which could let timer handler to obtain
the needed lock.
Ronnie Sahlberg [Thu, 21 Apr 2022 01:15:36 +0000 (11:15 +1000)]
cifs: destage any unwritten data to the server before calling copychunk_write
because the copychunk_write might cover a region of the file that has not yet
been sent to the server and thus fail.
A simple way to reproduce this is:
truncate -s 0 /mnt/testfile; strace -f -o x -ttT xfs_io -i -f -c 'pwrite 0k 128k' -c 'fcollapse 16k 24k' /mnt/testfile
the issue is that the 'pwrite 0k 128k' becomes rearranged on the wire with
the 'fcollapse 16k 24k' due to write-back caching.
fcollapse is implemented in cifs.ko as a SMB2 IOCTL(COPYCHUNK_WRITE) call
and it will fail serverside since the file is still 0b in size serverside
until the writes have been destaged.
To avoid this we must ensure that we destage any unwritten data to the
server before calling COPYCHUNK_WRITE.
Paulo Alcantara [Thu, 21 Apr 2022 00:05:45 +0000 (21:05 -0300)]
cifs: fix NULL ptr dereference in refresh_mounts()
Either mount(2) or automount might not have server->origin_fullpath
set yet while refresh_cache_worker() is attempting to refresh DFS
referrals. Add missing NULL check and locking around it.
Merge tag 'xtensa-20220416' of https://github.com/jcmvbkbc/linux-xtensa
Pull xtensa fixes from Max Filippov:
- fix patching CPU selection in patch_text
- fix potential deadlock in ISS platform serial driver
- fix potential register clobbering in coprocessor exception handler
* tag 'xtensa-20220416' of https://github.com/jcmvbkbc/linux-xtensa:
xtensa: fix a7 clobbering in coprocessor context load/store
arch: xtensa: platforms: Fix deadlock in rs_close()
xtensa: patch_text: Fixup last cpu should be master
Merge tag 'erofs-for-5.18-rc4-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs
Pull erofs fixes from Gao Xiang:
"One patch to fix a use-after-free race related to the on-stack
z_erofs_decompressqueue, which happens very rarely but needs to be
fixed properly soon.
The other patch fixes some sysfs Sphinx warnings"
* tag 'erofs-for-5.18-rc4-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs:
Documentation/ABI: sysfs-fs-erofs: Fix Sphinx errors
erofs: fix use-after-free of on-stack io[]
It turns out that making the pipe almost arbitrarily large has some
rather unexpected downsides. The kernel test robot reports a kernel
warning that is due to pipe->max_usage now growing to the point where
the iter_file_splice_write() buffer allocation can no longer be
satisfied as a slab allocation, and the
code sequence there will now always fail as a result.
That code could be modified to use kvcalloc() too, but I feel very
uncomfortable making those kinds of changes for a very niche use case
that really should have other options than make these kinds of
fundamental changes to pipe behavior.
Maybe the CRIU process dumping should be multi-threaded, and use
multiple pipes and multiple cores, rather than try to use one larger
pipe to minimize splice() calls.
x86: __memcpy_flushcache: fix wrong alignment if size > 2^32
The first "if" condition in __memcpy_flushcache is supposed to align the
"dest" variable to 8 bytes and copy data up to this alignment. However,
this condition may misbehave if "size" is greater than 4GiB.
The statement min_t(unsigned, size, ALIGN(dest, 8) - dest); casts both
arguments to unsigned int and selects the smaller one. However, the
cast truncates high bits in "size" and it results in misbehavior.
For example:
suppose that size == 0x100000001, dest == 0x200000002
min_t(unsigned, size, ALIGN(dest, 8) - dest) == min_t(0x1, 0xe) == 0x1;
...
dest += 0x1;
so we copy just one byte "and" dest remains unaligned.
This patch fixes the bug by replacing unsigned with size_t.
selftests: mlxsw: vxlan_flooding_ipv6: Prevent flooding of unwanted packets
The test verifies that packets are correctly flooded by the bridge and
the VXLAN device by matching on the encapsulated packets at the other
end. However, if packets other than those generated by the test also
ingress the bridge (e.g., MLD packets), they will be flooded as well and
interfere with the expected count.
Make the test more robust by making sure that only the packets generated
by the test can ingress the bridge. Drop all the rest using tc filters
on the egress of 'br0' and 'h1'.
In the software data path, the problem can be solved by matching on the
inner destination MAC or dropping unwanted packets at the egress of the
VXLAN device, but this is not currently supported by mlxsw.
Fixes: d01724dd2a66 ("selftests: mlxsw: spectrum-2: Add a test for VxLAN flooding with IPv6") Signed-off-by: Ido Schimmel <[email protected]> Reviewed-by: Amit Cohen <[email protected]> Signed-off-by: David S. Miller <[email protected]>
selftests: mlxsw: vxlan_flooding: Prevent flooding of unwanted packets
The test verifies that packets are correctly flooded by the bridge and
the VXLAN device by matching on the encapsulated packets at the other
end. However, if packets other than those generated by the test also
ingress the bridge (e.g., MLD packets), they will be flooded as well and
interfere with the expected count.
Make the test more robust by making sure that only the packets generated
by the test can ingress the bridge. Drop all the rest using tc filters
on the egress of 'br0' and 'h1'.
In the software data path, the problem can be solved by matching on the
inner destination MAC or dropping unwanted packets at the egress of the
VXLAN device, but this is not currently supported by mlxsw.
Fixes: 94d302deae25 ("selftests: mlxsw: Add a test for VxLAN flooding") Signed-off-by: Ido Schimmel <[email protected]> Reviewed-by: Amit Cohen <[email protected]> Signed-off-by: David S. Miller <[email protected]>
ALSA: usb-audio: Clear MIDI port active flag after draining
When a rawmidi output stream is closed, it calls the drain at first,
then does trigger-off only when the drain returns -ERESTARTSYS as a
fallback. It implies that each driver should turn off the stream
properly after the drain. Meanwhile, USB-audio MIDI interface didn't
change the port->active flag after the drain. This may leave the
output work picking up the port that is closed right now, which
eventually leads to a use-after-free for the already released rawmidi
object.
This patch fixes the bug by properly clearing the port->active flag
after the output drain.
Dave Jiang [Mon, 11 Apr 2022 22:06:34 +0000 (15:06 -0700)]
dmaengine: idxd: skip clearing device context when device is read-only
If the device shows up as read-only configuration, skip the clearing of the
state as the context must be preserved for device re-enable after being
disabled.
Dave Jiang [Mon, 18 Apr 2022 21:33:21 +0000 (14:33 -0700)]
dmaengine: idxd: fix retry value to be constant for duration of function call
When retries is compared to wq->enqcmds_retries each loop of idxd_enqcmds(),
wq->enqcmds_retries can potentially changed by user. Assign the value
of retries to wq->enqcmds_retries during initialization so it is the
original value set when entering the function.
Dave Jiang [Mon, 18 Apr 2022 21:31:10 +0000 (14:31 -0700)]
dmaengine: idxd: match type for retries var in idxd_enqcmds()
wq->enqcmds_retries is defined as unsigned int. However, retries on the
stack is defined as int. Change retries to unsigned int to compare the same
type.
Kevin Hao [Tue, 19 Apr 2022 08:42:26 +0000 (16:42 +0800)]
net: stmmac: Use readl_poll_timeout_atomic() in atomic state
The init_systime() may be invoked in atomic state. We have observed the
following call trace when running "phc_ctl /dev/ptp0 set" on a Intel
Agilex board.
BUG: sleeping function called from invalid context at drivers/net/ethernet/stmicro/stmmac/stmmac_hwtstamp.c:74
in_atomic(): 1, irqs_disabled(): 128, non_block: 0, pid: 381, name: phc_ctl
preempt_count: 1, expected: 0
RCU nest depth: 0, expected: 0
Preemption disabled at:
[<ffff80000892ef78>] stmmac_set_time+0x34/0x8c
CPU: 2 PID: 381 Comm: phc_ctl Not tainted 5.18.0-rc2-next-20220414-yocto-standard+ #567
Hardware name: SoCFPGA Agilex SoCDK (DT)
Call trace:
dump_backtrace.part.0+0xc4/0xd0
show_stack+0x24/0x40
dump_stack_lvl+0x7c/0xa0
dump_stack+0x18/0x34
__might_resched+0x154/0x1c0
__might_sleep+0x58/0x90
init_systime+0x78/0x120
stmmac_set_time+0x64/0x8c
ptp_clock_settime+0x60/0x9c
pc_clock_settime+0x6c/0xc0
__arm64_sys_clock_settime+0x88/0xf0
invoke_syscall+0x5c/0x130
el0_svc_common.constprop.0+0x4c/0x100
do_el0_svc+0x7c/0xa0
el0_svc+0x58/0xcc
el0t_64_sync_handler+0xa4/0x130
el0t_64_sync+0x18c/0x190
So we should use readl_poll_timeout_atomic() here instead of
readl_poll_timeout().
Also adjust the delay time to 10us to fix a "__bad_udelay" build error
reported by "kernel test robot <[email protected]>". I have tested this on
Intel Agilex and NXP S32G boards, there is no delay needed at all.
So the 10us delay should be long enough for most cases.
Fixes: ff8ed737860e ("net: stmmac: use readl_poll_timeout() function in init_systime()") Signed-off-by: Kevin Hao <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Nicolas Dichtel [Wed, 13 Apr 2022 14:00:00 +0000 (16:00 +0200)]
doc/ip-sysctl: add bc_forwarding
Let's describe this sysctl.
Fixes: 5cbf777cfdf6 ("route: add support for directed broadcast forwarding") Signed-off-by: Nicolas Dichtel <[email protected]> Signed-off-by: David S. Miller <[email protected]>
RISC-V: KVM: Restrict the extensions that can be disabled
Currently, the config isa register allows us to disable all allowed
single letter ISA extensions. It shouldn't be the case as vmm shouldn't
be able to disable base extensions (imac).
These extensions should always be enabled as long as they are enabled
in the host ISA.
RISC-V: KVM: Remove 's' & 'u' as valid ISA extension
There are no ISA extension defined as 's' & 'u' in RISC-V specifications.
The misa register defines 's' & 'u' bit as Supervisor/User privilege mode
enabled. But it should not appear in the ISA extension in the device tree.
Remove those from the allowed ISA extension for kvm.
drm/i915/display/psr: Unset enable_psr2_sel_fetch if other checks in intel_psr2_config_valid() fails
If any of the PSR2 checks after intel_psr2_sel_fetch_config_valid()
fails, enable_psr2_sel_fetch will be kept enabled causing problems
in the functions that only checks for it and not for has_psr2.
So here moving the check that do not depend on enable_psr2_sel_fetch
and for the remaning ones jumping to a section that unset
enable_psr2_sel_fetch in case of failure to support PSR2.
Randy Dunlap [Sun, 3 Apr 2022 03:13:55 +0000 (20:13 -0700)]
cpuidle: riscv: support non-SMP config
Add <asm/smp.h> for cpuid_to_hartid_map etc.
This is needed for both SMP and non-SMP builds, but not having it
causes a build error for non-SMP:
drivers/cpuidle/cpuidle-riscv-sbi.c: In function 'sbi_cpuidle_init_cpu':
drivers/cpuidle/cpuidle-riscv-sbi.c:350:26: error: implicit declaration of function 'cpuid_to_hartid_map' [-Werror=implicit-function-declaration]
Song Liu [Fri, 15 Apr 2022 16:44:10 +0000 (09:44 -0700)]
vmalloc: replace VM_NO_HUGE_VMAP with VM_ALLOW_HUGE_VMAP
Huge page backed vmalloc memory could benefit performance in many cases.
However, some users of vmalloc may not be ready to handle huge pages for
various reasons: hardware constraints, potential pages split, etc.
VM_NO_HUGE_VMAP was introduced to allow vmalloc users to opt-out huge
pages. However, it is not easy to track down all the users that require
the opt-out, as the allocation are passed different stacks and may cause
issues in different layers.
To address this issue, replace VM_NO_HUGE_VMAP with an opt-in flag,
VM_ALLOW_HUGE_VMAP, so that users that benefit from huge pages could ask
specificially.
Also, remove vmalloc_no_huge() and add opt-in helper vmalloc_huge().
drm/msm: Revert "drm/msm: Stop using iommu_present()"
This reverts commit e2a88eabb02410267519b838fb9b79f5206769be. The commit
in question makes msm_use_mmu() check whether the DRM 'component master'
device is translated by the IOMMU. At this moment it is the 'mdss'
device.
However on platforms using the MDP5 driver (e.g. MSM8916/APQ8016,
MSM8996/APQ8096) it's the mdp5 device, which has the iommus property
(and thus is "translated by the IOMMU"). This results in these devices
being broken with the following lines in the dmesg.
[drm] Initialized msm 1.9.0 20130625 for 1a00000.mdss on minor 0
msm 1a00000.mdss: [drm:adreno_request_fw] loaded qcom/a300_pm4.fw from new location
msm 1a00000.mdss: [drm:adreno_request_fw] loaded qcom/a300_pfp.fw from new location
msm 1a00000.mdss: [drm:get_pages] *ERROR* could not get pages: -28
msm 1a00000.mdss: could not allocate stolen bo
msm 1a00000.mdss: [drm:get_pages] *ERROR* could not get pages: -28
msm 1a00000.mdss: [drm:msm_alloc_stolen_fb] *ERROR* failed to allocate buffer object
msm 1a00000.mdss: [drm:msm_fbdev_create] *ERROR* failed to allocate fb
Getting the mdp5 device pointer from this function is not that easy at
this moment. Thus this patch is reverted till the MDSS rework [1] lands.
It will make the mdp5/dpu1 device component master and the check will be
legit.
Merge tag 'spi-fix-v5.18-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi
Pull spi fixes from Mark Brown:
"A few more fixes for SPI, plus one new PCI ID for another Intel
chipset.
All device specific stuff"
* tag 'spi-fix-v5.18-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi:
spi: atmel-quadspi: Fix the buswidth adjustment between spi-mem and controller
spi: cadence-quadspi: fix incorrect supports_op() return value
spi: intel: Add support for Raptor Lake-S SPI serial flash
spi: spi-mtk-nor: initialize spi controller after resume
Last cycle we extended the idmapped mounts infrastructure to support
idmapped mounts of idmapped filesystems (No such filesystem yet exist.).
Since then, the meaning of an idmapped mount is a mount whose idmapping
is different from the filesystems idmapping.
While doing that work we missed to adapt the acl translation helpers.
They still assume that checking for the identity mapping is enough. But
they need to use the no_idmapping() helper instead.
Note, POSIX ACLs are always translated right at the userspace-kernel
boundary using the caller's current idmapping and the initial idmapping.
The order depends on whether we're coming from or going to userspace.
The filesystem's idmapping doesn't matter at the border.
Consequently, if a non-idmapped mount is passed we need to make sure to
always pass the initial idmapping as the mount's idmapping and not the
filesystem idmapping. Since it's irrelevant here it would yield invalid
ids and prevent setting acls for filesystems that are mountable in a
userns and support posix acls (tmpfs and fuse).
I verified the regression reported in [1] and verified that this patch
fixes it. A regression test will be added to xfstests in parallel.
Merge tag 'asoc-fix-v5.18-rc3' of https://git.kernel.org/pub/scm/linux/kernel/git/broonie/sound into for-linus
ASoC: Fixes for v5.18
A collection of fixes that came in since the merge window, plus one new
device ID for an x86 laptop. Nothing that really stands out with
particularly big impact outside of the affected device.
Eric Dumazet [Fri, 15 Apr 2022 18:14:42 +0000 (11:14 -0700)]
netlink: reset network and mac headers in netlink_dump()
netlink_dump() is allocating an skb, reserves space in it
but forgets to reset network header.
This allows a BPF program, invoked later from sk_filter()
to access uninitialized kernel memory from the reserved
space.
Theorically mac header reset could be omitted, because
it is set to a special initial value.
bpf_internal_load_pointer_neg_helper calls skb_mac_header()
without checking skb_mac_header_was_set().
Relying on skb->len not being too big seems fragile.
We also could add a sanity check in bpf_internal_load_pointer_neg_helper()
to avoid surprises in the future.
Vladimir Oltean [Fri, 15 Apr 2022 15:19:50 +0000 (18:19 +0300)]
net: mscc: ocelot: fix broken IP multicast flooding
When the user runs:
bridge link set dev $br_port mcast_flood on
this command should affect not only L2 multicast, but also IPv4 and IPv6
multicast.
In the Ocelot switch, unknown multicast gets flooded according to
different PGIDs according to its type, and PGID_MC only handles L2
multicast. Therefore, by leaving PGID_MCIPV4 and PGID_MCIPV6 at their
default value of 0, unknown IP multicast traffic is never flooded.
Kurt Kanzenbach [Fri, 15 Apr 2022 10:33:20 +0000 (12:33 +0200)]
net: dsa: hellcreek: Calculate checksums in tagger
In case the checksum calculation is offloaded to the DSA master network
interface, it will include the switch trailing tag. As soon as the switch strips
that tag on egress, the calculated checksum is wrong.
Therefore, add the checksum calculation to the tagger (if required) before
adding the switch tag. This way, the hellcreek code works with all DSA master
interfaces regardless of their declared feature set.
Manasi Navare [Thu, 3 Mar 2022 23:32:22 +0000 (15:32 -0800)]
drm/i915/display/vrr: Reset VRR capable property on a long hpd
With some VRR panels, user can turn VRR ON/OFF on the fly from the panel settings.
When VRR is turned OFF ,sends a long HPD to the driver clearing the Ignore MSA bit
in the DPCD. Currently the driver parses that onevery HPD but fails to reset
the corresponding VRR Capable Connector property.
Hence the userspace still sees this as VRR Capable panel which is incorrect.
Fix this by explicitly resetting the connector property.
v2: Reset vrr capable if status == connector_disconnected
v3: Use i915 and use bool vrr_capable (Jani Nikula)
v4: Move vrr_capable to after update modes call (Jani N)
Remove the redundant comment (Jan N)
v5: Fixes the regression on older platforms by resetting the VRR
only if HAS_VRR
v6: Remove the checks from driver, add in drm core before
setting VRR prop (Ville)
v7: Move VRR set/reset to set/unset_edid (Ville)
ALSA: usb-audio: add mapping for MSI MAG X570S Torpedo MAX.
The USB audio device 0db0:a073 based on the Realtek ALC4080 chipset
exposes all playback volume controls as "PCM". This makes
distinguishing the individual functions hard.
The mapping already adopted for device 0db0:419c based on the same
chipset fixes the issue, apply it for this device too.
Lucas De Marchi [Sat, 16 Apr 2022 06:44:18 +0000 (23:44 -0700)]
ALSA: hda/i915: Fix one too many pci_dev_put()
pci_get_class() will already unref the pci device passed as argument.
So if it's unconditionally unref'ed, even if the loop is not stopped,
there will be one too many unref for each device not matched.
Manuel Ullmann [Sun, 17 Apr 2022 22:20:01 +0000 (00:20 +0200)]
net: atlantic: invert deep par in pm functions, preventing null derefs
This will reset deeply on freeze and thaw instead of suspend and
resume and prevent null pointer dereferences of the uninitialized ring
0 buffer while thawing.
The impact is an indefinitely hanging kernel. You can't switch
consoles after this and the only possible user interaction is SysRq.
The bug has been present since the introduction of the new pm code in 8aaa112a57c1 ("net: atlantic: refactoring pm logic") and was hidden
until 8ce84271697a ("net: atlantic: changes for multi-TC support"),
which refactored the aq_vec_{free,alloc} functions into
aq_vec_{,ring}_{free,alloc}, but is technically not wrong. The
original functions just always reinitialized the buffers on S3/S4. If
the interface is down before freezing, the bug does not occur. It does
not matter, whether the initrd contains and loads the module before
thawing.
So the fix is to invert the boolean parameter deep in all pm function
calls, which was clearly intended to be set like that.
First report was on Github [1], which you have to guess from the
resume logs in the posted dmesg snippet. Recently I posted one on
Bugzilla [2], since I did not have an AQC device so far.
Merge tag 'for-linus-5.18-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip
Pull xen fixlet from Juergen Gross:
"A single cleanup patch for the Xen balloon driver"
* tag 'for-linus-5.18-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
xen/balloon: don't use PV mode extra memory for zone device allocations
Merge tag 'x86-urgent-2022-04-17' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Thomas Gleixner:
"Two x86 fixes related to TSX:
- Use either MSR_TSX_FORCE_ABORT or MSR_IA32_TSX_CTRL to disable TSX
to cover all CPUs which allow to disable it.
- Disable TSX development mode at boot so that a microcode update
which provides TSX development mode does not suddenly make the
system vulnerable to TSX Asynchronous Abort"
* tag 'x86-urgent-2022-04-17' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/tsx: Disable TSX development mode at boot
x86/tsx: Use MSR_TSX_CTRL to clear CPUID bits
Merge tag 'timers-urgent-2022-04-17' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer fixes from Thomas Gleixner:
"A small set of fixes for the timers core:
- Fix the warning condition in __run_timers() which does not take
into account that a CPU base (especially the deferrable base) never
has a timer armed on it and therefore the next_expiry value can
become stale.
- Replace a WARN_ON() in the NOHZ code with a WARN_ON_ONCE() to
prevent endless spam in dmesg.
- Remove the double star from a comment which is not meant to be in
kernel-doc format"
* tag 'timers-urgent-2022-04-17' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
tick/sched: Fix non-kernel-doc comment
tick/nohz: Use WARN_ON_ONCE() to prevent console saturation
timers: Fix warning condition in __run_timers()
Merge tag 'smp-urgent-2022-04-17' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull SMP fixes from Thomas Gleixner:
"Two fixes for the SMP core:
- Make the warning condition in flush_smp_call_function_queue()
correct, which checked a just emptied list head for being empty
instead of validating that there was no pending entry on the
offlined CPU at all.
- The @cpu member of struct cpuhp_cpu_state is initialized when the
CPU hotplug thread for the upcoming CPU is created. That's too late
because the creation of the thread can fail and then the following
rollback operates on CPU0. Get rid of the CPU member and hand the
CPU number to the involved functions directly"
* tag 'smp-urgent-2022-04-17' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
cpu/hotplug: Remove the 'cpu' member of cpuhp_cpu_state
smp: Fix offline cpu check in flush_smp_call_function_queue()
Merge tag 'irq-urgent-2022-04-17' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq fix from Thomas Gleixner:
"A single fix for the interrupt affinity spreading logic to take into
account that there can be an imbalance between present and possible
CPUs, which causes already assigned bits to be overwritten"
* tag 'irq-urgent-2022-04-17' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
genirq/affinity: Consider that CPUs on nodes can be unbalanced