Anand Jain [Tue, 13 Feb 2024 01:13:56 +0000 (09:13 +0800)]
btrfs: do not skip re-registration for the mounted device
There are reports that since version 6.7 update-grub fails to find the
device of the root on systems without initrd and on a single device.
This looks like the device name changed in the output of
/proc/self/mountinfo:
6.5-rc5 working
18 1 0:16 / / rw,noatime - btrfs /dev/sda8 ...
6.7 not working:
17 1 0:15 / / rw,noatime - btrfs /dev/root ...
and "update-grub" shows this error:
/usr/sbin/grub-probe: error: cannot find a device for / (is /dev mounted?)
This looks like it's related to the device name, but grub-probe
recognizes the "/dev/root" path and tries to find the underlying device.
However there's a special case for some filesystems, for btrfs in
particular.
The generic root device detection heuristic is not done and it all
relies on reading the device infos by a btrfs specific ioctl. This ioctl
returns the device name as it was saved at the time of device scan (in
this case it's /dev/root).
The change in 6.7 for temp_fsid to allow several single device
filesystem to exist with the same fsid (and transparently generate a new
UUID at mount time) was to skip caching/registering such devices.
This also skipped mounted device. One step of scanning is to check if
the device name hasn't changed, and if yes then update the cached value.
This broke the grub-probe as it always read the device /dev/root and
couldn't find it in the system. A temporary workaround is to create a
symlink but this does not survive reboot.
The right fix is to allow updating the device path of a mounted
filesystem even if this is a single device one.
In the fix, check if the device's major:minor number matches with the
cached device. If they do, then we can allow the scan to happen so that
device_list_add() can take care of updating the device path. The file
descriptor remains unchanged.
This does not affect the temp_fsid feature, the UUID of the mounted
filesystem remains the same and the matching is based on device major:minor
which is unique per mounted filesystem.
This covers the path when the device (that exists for all mounted
devices) name changes, updating /dev/root to /dev/sdx. Any other single
device with filesystem and is not mounted is still skipped.
Note that if a system is booted and initial mount is done on the
/dev/root device, this will be the cached name of the device. Only after
the command "btrfs device scan" it will change as it triggers the
rename.
The fix was verified by users whose systems were affected.
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=218353
Link: https://lore.kernel.org/lkml/CAKLYgeJ1tUuqLcsquwuFqjDXPSJpEiokrWK2gisPKDZLs8Y2TQ@mail.gmail.com/
Fixes: bc27d6f0aa0e ("btrfs: scan but don't register device on single device filesystem")
CC: stable@vger.kernel.org # 6.7+
Tested-by: Alex Romosan <aromosan@gmail.com>
Tested-by: CHECK_1234543212345@protonmail.com
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Linus Torvalds [Tue, 12 Mar 2024 19:28:34 +0000 (12:28 -0700)]
Merge tag 'for-6.9-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs updates from David Sterba:
"Mostly stabilization, refactoring and cleanup changes. There rest are
minor performance optimizations due to caching or lock contention
reduction and a few notable fixes.
Performance improvements:
- minor speedup in logging when repeatedly allocated structure is
preallocated only once, improves latency and decreases lock
contention
- minor throughput increase (+6%), reduced lock contention after
clearing delayed allocation bits, applies to several common
workload types
- skip full quota rescan if a new relation is added in the same
transaction
Fixes:
- zstd fix for inline compressed file in subpage mode, updated
version from the 6.8 time
- proper qgroup inheritance ioctl parameter validation
- more fiemap followup fixes after reduced locking done in 6.8:
- fix race when detecting delalloc ranges
Core changes:
- more debugging code:
- added assertions for a very rare crash in raid56 calculation
- tree-checker dumps page state to give more insights into
possible reference counting issues
- add checksum calculation offloading sysfs knob, for now enabled
under DEBUG only to determine a good heuristic for deciding the
offload or synchronous, depends on various factors (block group
profile, device speed) and is not as clear as initially thought
(checksum type)
- error handling improvements, added assertions
- more page to folio conversion (defrag, truncate), cached size and
shift
- preparation for more fine grained locking of sectors in subpage
mode
- cleanups and refactoring:
- include cleanups, forward declarations
- pointer-to-structure helpers
- redundant argument removals
- removed unused code
- slab cache updates, last use of SLAB_MEM_SPREAD removed"
* tag 'for-6.9-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (114 commits)
btrfs: reuse cloned extent buffer during fiemap to avoid re-allocations
btrfs: fix race when detecting delalloc ranges during fiemap
btrfs: fix off-by-one chunk length calculation at contains_pending_extent()
btrfs: qgroup: allow quick inherit if snapshot is created and added to the same parent
btrfs: qgroup: validate btrfs_qgroup_inherit parameter
btrfs: include device major and minor numbers in the device scan notice
btrfs: mark btrfs_put_caching_control() static
btrfs: remove SLAB_MEM_SPREAD flag use
btrfs: qgroup: always free reserved space for extent records
btrfs: tree-checker: dump the page status if hit something wrong
btrfs: compression: remove dead comments in btrfs_compress_heuristic()
btrfs: subpage: make writer lock utilize bitmap
btrfs: subpage: make reader lock utilize bitmap
btrfs: unexport btrfs_subpage_start_writer() and btrfs_subpage_end_and_test_writer()
btrfs: pass a valid extent map cache pointer to __get_extent_map()
btrfs: merge btrfs_del_delalloc_inode() helpers
btrfs: pass btrfs_device to btrfs_scratch_superblocks()
btrfs: handle transaction commit errors in flush_reservations()
btrfs: use KMEM_CACHE() to create btrfs_free_space cache
btrfs: use KMEM_CACHE() to create delayed ref caches
...
Linus Torvalds [Tue, 12 Mar 2024 19:24:40 +0000 (12:24 -0700)]
Merge tag 'zonefs-6.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/zonefs
Pull zonefs update from Damien Le Moal:
- A single change for this cycle to convert zonefs to use the new
mount API
* tag 'zonefs-6.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/zonefs:
zonefs: convert zonefs to use the new mount api
Linus Torvalds [Tue, 12 Mar 2024 17:56:28 +0000 (10:56 -0700)]
Merge tag 'asm-generic-6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic
Pull asm-generic updates from Arnd Bergmann:
"Just two small updates this time:
- A series I did to unify the definition of PAGE_SIZE through
Kconfig, intended to help with a vdso rework that needs the
constant but cannot include the normal kernel headers when building
the compat VDSO on arm64 and potentially others
- a patch from Yan Zhao to remove the pfn_to_virt() definitions from
a couple of architectures after finding they were both incorrect
and entirely unused"
* tag 'asm-generic-6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic:
arch: define CONFIG_PAGE_SIZE_*KB on all architectures
arch: simplify architecture specific page size configuration
arch: consolidate existing CONFIG_PAGE_SIZE_*KB definitions
mm: Remove broken pfn_to_virt() on arch csky/hexagon/openrisc
Linus Torvalds [Tue, 12 Mar 2024 17:52:18 +0000 (10:52 -0700)]
Merge tag 'soc-defconfig-6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc
Pull ARM defconfig updates from Arnd Bergmann:
"This has the usual updates to enable platform specific driver modules
as new hardware gets supported, as well as an update to the
virt.config fragment so we disable all newly added platforms again"
* tag 'soc-defconfig-6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc: (24 commits)
arm64: defconfig: Enable support for cbmem entries in the coreboot table
ARM: defconfig: enable STMicroelectronics accelerometer and gyro for Exynos
arm64: defconfig: drop ext2 filesystem and redundant ext3
arm64: defconfig: Enable Rockchip HDMI/eDP Combo PHY
arm64: defconfig: Enable Wave5 Video Encoder/Decoder
arm64: config: disable new platforms in virt.config
arm64: defconfig: Enable QCOM PBS
arm64: deconfig: enable Goodix Berlin SPI touchscreen driver as module
arm64: defconfig: Enable X1E80100 multimedia clock controllers configs
arm64: defconfig: Enable GCC and interconnect for QDU1000/QRU1000
arm64: defconfig: enable i.MX8MP ldb bridge
arm64: defconfig: enable the vf610 gpio driver
ARM: imx_v6_v7_defconfig: enable the vf610 gpio driver
ARM: multi_v7_defconfig: Add more TI Keystone support
arm64: defconfig: enable WCD939x USBSS driver as module
arm64: defconfig: enable audio drivers for SM8650 QRD board
arm64: defconfig: Enable Qualcomm interconnect providers
ARM: multi_v7_defconfig: Enable BACKLIGHT_CLASS_DEVICE
arm64: defconfig: Enable i.MX8QXP device drivers
ARM: multi_v7_defconfig: Add more TI Keystone support
...
Linus Torvalds [Tue, 12 Mar 2024 17:47:33 +0000 (10:47 -0700)]
Merge tag 'soc-arm-6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc
Pull ARM SoC code updates from Arnd Bergmann:
"These are mostly minor updates, including a number of kerneldoc fixes
from Randy Dunlap across multiple platforms. OMAP gets a few bugfixes,
and the MAINTAINERS file gets updated for AMD Zynq and NXP S32G"
* tag 'soc-arm-6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc: (23 commits)
ARM: s32c: update MAINTAINERS entry
ARM: AM33xx: PRM: Implement REBOOT_COLD
ARM: AM33xx: PRM: Remove redundand defines
ARM: omap1: remove duplicated 'select ARCH_OMAP'
ARM: s3c64xx: make bus_type const
ARM: imx: Remove usage of the deprecated ida_simple_xx() API
ARM: OMAP2+: fix kernel-doc warnings
ARM: OMAP2+: fix kernel-doc warnings
ARM: OMAP2+: fix a kernel-doc warning
ARM: OMAP2+: PRM: fix kernel-doc warnings
ARM: OMAP2+: prm44xx: fix a kernel-doc warning
ARM: OMAP2+: pmic-cpcap: fix kernel-doc warnings
ARM: OMAP2+: hwmod: fix kernel-doc warnings
ARM: OMAP2+: hwmod: remove misuse of kernel-doc
ARM: OMAP2+: CMINST: use matching function name in kernel-doc
ARM: OMAP2+: cm33xx: use matching function name in kernel-doc
ARM: OMAP2+: clock: fix a function name in kernel-doc
ARM: OMAP2+: clockdomain: fix kernel-doc warnings
ARM: OMAP2+: am33xx-restart: fix function name in kernel-doc
soc: xilinx: update maintainer of event manager driver
...
Linus Torvalds [Tue, 12 Mar 2024 17:35:24 +0000 (10:35 -0700)]
Merge tag 'soc-drivers-6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc
Pull ARM SoC driver updates from Arnd Bergmann:
"This is the usual mix of updates for drivers that are used on (mostly
ARM) SoCs with no other top-level subsystem tree, including:
- The SCMI firmware subsystem gains support for version 3.2 of the
specification and updates to the notification code
- Feature updates for Tegra and Qualcomm platforms for added hardware
support
- A number of platforms get soc_device additions for identifying
newly added chips from Renesas, Qualcomm, Mediatek and Google
- Trivial improvements for firmware and memory drivers amongst
others, in particular 'const' annotations throughout multiple
subsystems"
* tag 'soc-drivers-6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc: (96 commits)
tee: make tee_bus_type const
soc: qcom: aoss: add missing kerneldoc for qmp members
soc: qcom: geni-se: drop unused kerneldoc struct geni_wrapper param
soc: qcom: spm: fix building with CONFIG_REGULATOR=n
bus: ti-sysc: constify the struct device_type usage
memory: stm32-fmc2-ebi: keep power domain on
memory: stm32-fmc2-ebi: add MP25 RIF support
memory: stm32-fmc2-ebi: add MP25 support
memory: stm32-fmc2-ebi: check regmap_read return value
dt-bindings: memory-controller: st,stm32: add MP25 support
dt-bindings: bus: imx-weim: convert to YAML
watchdog: s3c2410_wdt: use exynos_get_pmu_regmap_by_phandle() for PMU regs
soc: samsung: exynos-pmu: Add regmap support for SoCs that protect PMU regs
MAINTAINERS: Update SCMI entry with HWMON driver
MAINTAINERS: samsung: gs101: match patches touching Google Tensor SoC
memory: tegra: Fix indentation
memory: tegra: Add BPMP and ICC info for DLA clients
memory: tegra: Correct DLA client names
dt-bindings: memory: renesas,rpc-if: Document R-Car V4M support
firmware: arm_scmi: Update the supported clock protocol version
...
Linus Torvalds [Tue, 12 Mar 2024 17:29:57 +0000 (10:29 -0700)]
Merge tag 'soc-dt-6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc
Pull SoC device tree updates from Arnd Bergmann:
"There is very little going on with new SoC support this time, all the
new chips are variations of others that we already support, and they
are all based on ARMv8 cores:
- Mediatek MT7981B (Filogic 820) and MT7988A (Filogic 880) are
networking SoCs designed to be used in wireless routers, similar to
the already supported MT7986A (Filogic 830).
- NXP i.MX8DXP is a variant of i.MX8QXP, with two CPU cores less.
These are used in many embedded and industrial applications.
- Renesas R8A779G2 (R-Car V4H ES2.0) and R8A779H0 (R-Car V4M) are
automotive SoCs.
- TI J722S is another automotive variant of its K3 family, related to
the AM62 series.
There are a total of 7 new arm32 machines and 45 arm64 ones, including
- Two Android phones based on the old Tegra30 chip
- Two machines using Cortex-A53 SoCs from Allwinner, a mini PC and a
SoM development board
- A set-top box using Amlogic Meson G12A S905X2
- Eight embedded board using NXP i.MX6/8/9
- Three machines using Mediatek network router chips
- Ten Chromebooks, all based on Mediatek MT8186
- One development board based on Mediatek MT8395 (Genio 1200)
- Seven tablets and phones based on Qualcomm SoCs, most of them from
Samsung.
- A third development board for Qualcomm SM8550 (Snapdragon 8 Gen 2)
- Three variants of the "White Hawk" board for Renesas automotive
SoCs
- Ten Rockchips RK35xx based machines, including NAS, Tablet, Game
console and industrial form factors.
- Three evaluation boards for TI K3 based SoCs
The other changes are mainly the usual feature additions for existing
hardware, cleanups, and dtc compile time fixes. One notable change is
the inclusion of PowerVR SGX GPU nodes on TI SoCs"
* tag 'soc-dt-6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc: (824 commits)
riscv: dts: Move BUILTIN_DTB_SOURCE to common Kconfig
riscv: dts: starfive: jh7100: fix root clock names
ARM: dts: samsung: exynos4412: decrease memory to account for unusable region
arm64: dts: qcom: sm8250-xiaomi-elish: set rotation
arm64: dts: qcom: sm8650: Fix SPMI channels size
arm64: dts: qcom: sm8550: Fix SPMI channels size
arm64: dts: rockchip: Fix name for UART pin header on qnap-ts433
arm: dts: marvell: clearfog-gtr-l8: align port numbers with enclosure
arm: dts: marvell: clearfog-gtr-l8: add support for second sfp connector
dt-bindings: soc: renesas: renesas-soc: Add pattern for gray-hawk
dtc: Enable dtc interrupt_provider check
arm64: dts: st: add video encoder support to stm32mp255
arm64: dts: st: add video decoder support to stm32mp255
ARM: dts: stm32: enable crypto accelerator on stm32mp135f-dk
ARM: dts: stm32: enable CRC on stm32mp135f-dk
ARM: dts: stm32: add CRC on stm32mp131
ARM: dts: add stm32f769-disco-mb1166-reva09
ARM: dts: stm32: add display support on stm32f769-disco
ARM: dts: stm32: rename mmc_vcard to vcc-3v3 on stm32f769-disco
ARM: dts: stm32: add DSI support on stm32f769
...
Linus Torvalds [Tue, 12 Mar 2024 17:27:52 +0000 (10:27 -0700)]
Merge tag 'm68k-for-v6.9-tag1' of git://git.kernel.org/pub/scm/linux/kernel/git/geert/linux-m68k
Pull m68k updates from Geert Uytterhoeven:
- Make the Zorro bus type constant
- defconfig updates
* tag 'm68k-for-v6.9-tag1' of git://git.kernel.org/pub/scm/linux/kernel/git/geert/linux-m68k:
m68k: defconfig: Update defconfigs for v6.8-rc1
zorro: Make zorro_bus_type const
Linus Torvalds [Tue, 12 Mar 2024 17:14:22 +0000 (10:14 -0700)]
Merge tag 's390-6.9-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
Pull s390 updates from Heiko Carstens:
- Various virtual vs physical address usage fixes
- Fix error handling in Processor Activity Instrumentation device
driver, and export number of counters with a sysfs file
- Allow for multiple events when Processor Activity Instrumentation
counters are monitored in system wide sampling
- Change multiplier and shift values of the Time-of-Day clock source to
improve steering precision
- Remove a couple of unneeded GFP_DMA flags from allocations
- Disable mmap alignment if randomize_va_space is also disabled, to
avoid a too small heap
- Various changes to allow s390 to be compiled with LLVM=1, since
ld.lld and llvm-objcopy will have proper s390 support witch clang 19
- Add __uninitialized macro to Compiler Attributes. This is helpful
with s390's FPU code where some users have up to 520 byte stack
frames. Clearing such stack frames (if INIT_STACK_ALL_PATTERN or
INIT_STACK_ALL_ZERO is enabled) before they are used contradicts the
intention (performance improvement) of such code sections.
- Convert switch_to() to an out-of-line function, and use the generic
switch_to header file
- Replace the usage of s390's debug feature with pr_debug() calls
within the zcrypt device driver
- Improve hotplug support of the Adjunct Processor device driver
- Improve retry handling in the zcrypt device driver
- Various changes to the in-kernel FPU code:
- Make in-kernel FPU sections preemptible
- Convert various larger inline assemblies and assembler files to
C, mainly by using singe instruction inline assemblies. This
increases readability, but also allows makes it easier to add
proper instrumentation hooks
- Cleanup of the header files
- Provide fast variants of csum_partial() and
csum_partial_copy_nocheck() based on vector instructions
- Introduce and use a lock to synchronize accesses to zpci device data
structures to avoid inconsistent states caused by concurrent accesses
- Compile the kernel without -fPIE. This addresses the following
problems if the kernel is compiled with -fPIE:
- It uses dynamic symbols (.dynsym), for which the linker refuses
to allow more than 64k sections. This can break features which
use '-ffunction-sections' and '-fdata-sections', including
kpatch-build and function granular KASLR
- It unnecessarily uses GOT relocations, adding an extra layer of
indirection for many memory accesses
- Fix shared_cpu_list for CPU private L2 caches, which incorrectly were
reported as globally shared
* tag 's390-6.9-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (117 commits)
s390/tools: handle rela R_390_GOTPCDBL/R_390_GOTOFF64
s390/cache: prevent rebuild of shared_cpu_list
s390/crypto: remove retry loop with sleep from PAES pkey invocation
s390/pkey: improve pkey retry behavior
s390/zcrypt: improve zcrypt retry behavior
s390/zcrypt: introduce retries on in-kernel send CPRB functions
s390/ap: introduce mutex to lock the AP bus scan
s390/ap: rework ap_scan_bus() to return true on config change
s390/ap: clarify AP scan bus related functions and variables
s390/ap: rearm APQNs bindings complete completion
s390/configs: increase number of LOCKDEP_BITS
s390/vfio-ap: handle hardware checkstop state on queue reset operation
s390/pai: change sampling event assignment for PMU device driver
s390/boot: fix minor comment style damages
s390/boot: do not check for zero-termination relocation entry
s390/boot: make type of __vmlinux_relocs_64_start|end consistent
s390/boot: sanitize kaslr_adjust_relocs() function prototype
s390/boot: simplify GOT handling
s390: vmlinux.lds.S: fix .got.plt assertion
s390/boot: workaround current 'llvm-objdump -t -j ...' behavior
...
Linus Torvalds [Tue, 12 Mar 2024 16:58:57 +0000 (09:58 -0700)]
Merge tag 'x86-boot-2024-03-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 boot updates from Ingo Molnar:
- Continuing work by Ard Biesheuvel to improve the x86 early startup
code, with the long-term goal to make it position independent:
- Get rid of early accesses to global objects, either by moving
them to the stack, deferring the access until later, or dropping
the globals entirely
- Move all code that runs early via the 1:1 mapping into
.head.text, and move code that does not out of it, so that build
time checks can be added later to ensure that no inadvertent
absolute references were emitted into code that does not
tolerate them
- Remove fixup_pointer() and occurrences of __pa_symbol(), which
rely on the compiler emitting absolute references, which is not
guaranteed
- Improve the early console code
- Add early console message about ignored NMIs, so that users are at
least warned about their existence - even if we cannot do anything
about them
- Improve the kexec code's kernel load address handling
- Enable more X86S (simplified x86) bits
- Simplify early boot GDT handling
- Micro-optimize the boot code a bit
- Misc cleanups
* tag 'x86-boot-2024-03-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (22 commits)
x86/sev: Move early startup code into .head.text section
x86/sme: Move early SME kernel encryption handling into .head.text
x86/boot: Move mem_encrypt= parsing to the decompressor
efi/libstub: Add generic support for parsing mem_encrypt=
x86/startup_64: Simplify virtual switch on primary boot
x86/startup_64: Simplify calculation of initial page table address
x86/startup_64: Defer assignment of 5-level paging global variables
x86/startup_64: Simplify CR4 handling in startup code
x86/boot: Use 32-bit XOR to clear registers
efi/x86: Set the PE/COFF header's NX compat flag unconditionally
x86/boot/64: Load the final kernel GDT during early boot directly, remove startup_gdt[]
x86/boot/64: Use RIP_REL_REF() to access early_top_pgt[]
x86/boot/64: Use RIP_REL_REF() to access early page tables
x86/boot/64: Use RIP_REL_REF() to access '__supported_pte_mask'
x86/boot/64: Use RIP_REL_REF() to access early_dynamic_pgts[]
x86/boot/64: Use RIP_REL_REF() to assign 'phys_base'
x86/boot/64: Simplify global variable accesses in GDT/IDT programming
x86/trampoline: Bypass compat mode in trampoline_start64() if not needed
kexec: Allocate kernel above bzImage's pref_address
x86/boot: Add a message about ignored early NMIs
...
Linus Torvalds [Tue, 12 Mar 2024 16:45:34 +0000 (09:45 -0700)]
Merge tag 'x86-apic-2024-03-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 APIC fixup from Dave Hansen:
"Revert VERW fixed addressing patch.
The reverted commit is not x86/apic material and was cruft left over
from a merge.
I believe the sequence of events went something like this:
- The commit in question was added to x86/urgent
- x86/urgent was merged into x86/apic to resolve a conflict
- The commit was zapped from x86/urgent, but *not* from x86/apic
- x86/apic got pullled (yesterday)
I think we need to be a bit more vigilant when zapping things to make
sure none of the other branches are depending on the zapped material"
* tag 'x86-apic-2024-03-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
Revert "x86/bugs: Use fixed addressing for VERW operand"
Linus Torvalds [Tue, 12 Mar 2024 16:31:39 +0000 (09:31 -0700)]
Merge tag 'rfds-for-linus-2024-03-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 RFDS mitigation from Dave Hansen:
"RFDS is a CPU vulnerability that may allow a malicious userspace to
infer stale register values from kernel space. Kernel registers can
have all kinds of secrets in them so the mitigation is basically to
wait until the kernel is about to return to userspace and has user
values in the registers. At that point there is little chance of
kernel secrets ending up in the registers and the microarchitectural
state can be cleared.
This leverages some recent robustness fixes for the existing MDS
vulnerability. Both MDS and RFDS use the VERW instruction for
mitigation"
* tag 'rfds-for-linus-2024-03-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
KVM/x86: Export RFDS_NO and RFDS_CLEAR to guests
x86/rfds: Mitigate Register File Data Sampling (RFDS)
Documentation/hw-vuln: Add documentation for RFDS
x86/mmio: Disable KVM mitigation when X86_FEATURE_CLEAR_CPU_BUF is set
Dave Hansen [Tue, 12 Mar 2024 14:27:57 +0000 (07:27 -0700)]
Revert "x86/bugs: Use fixed addressing for VERW operand"
This was reverts commit
8009479ee919b9a91674f48050ccbff64eafedaa.
It was originally in x86/urgent, but was deemed wrong so got zapped.
But in the meantime, x86/urgent had been merged into x86/apic to
resolve a conflict. I didn't notice the merge so didn't zap it
from x86/apic and it managed to make it up with the x86/apic
material.
The reverted commit is known to cause some KASAN problems.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Ingo Molnar [Tue, 12 Mar 2024 08:49:52 +0000 (09:49 +0100)]
Merge branch 'linus' into x86/boot, to resolve conflict
There's a new conflict with Linus's upstream tree, because
in the following merge conflict resolution in <asm/coco.h>:
38b334fc767e Merge tag 'x86_sev_for_v6.9_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Linus has resolved the conflicting placement of 'cc_mask' better
than the original commit:
1c811d403afd x86/sev: Fix position dependent variable references in startup code
... which was also done by an internal merge resolution:
2e5fc4786b7a Merge branch 'x86/sev' into x86/boot, to resolve conflicts and to pick up dependent tree
But Linus is right in
38b334fc767e, the 'cc_mask' declaration is sufficient
within the #ifdef CONFIG_ARCH_HAS_CC_PLATFORM block.
So instead of forcing Linus to do the same resolution again, merge in Linus's
tree and follow his conflict resolution.
Conflicts:
arch/x86/include/asm/coco.h
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Linus Torvalds [Tue, 12 Mar 2024 03:20:36 +0000 (20:20 -0700)]
Merge tag 'x86_tdx_for_6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 tdx update from Dave Hansen:
- Fix sparse warning from TDX use of movdir64b()
* tag 'x86_tdx_for_6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/asm: Remove the __iomem annotation of movdir64b()'s dst argument
Linus Torvalds [Tue, 12 Mar 2024 03:07:52 +0000 (20:07 -0700)]
Merge tag 'x86_mm_for_6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 mm updates from Dave Hansen:
- Add a warning when memory encryption conversions fail. These
operations require VMM cooperation, even in CoCo environments where
the VMM is untrusted. While it's _possible_ that memory pressure
could trigger the new warning, the odds are that a guest would only
see this from an attacking VMM.
- Simplify page fault code by re-enabling interrupts unconditionally
- Avoid truncation issues when pfns are passed in to pfn_to_kaddr()
with small (<64-bit) types.
* tag 'x86_mm_for_6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/mm/cpa: Warn for set_memory_XXcrypted() VMM fails
x86/mm: Get rid of conditional IF flag handling in page fault path
x86/mm: Ensure input to pfn_to_kaddr() is treated as a 64-bit type
Linus Torvalds [Tue, 12 Mar 2024 02:53:15 +0000 (19:53 -0700)]
Merge tag 'x86-core-2024-03-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull core x86 updates from Ingo Molnar:
- The biggest change is the rework of the percpu code, to support the
'Named Address Spaces' GCC feature, by Uros Bizjak:
- This allows C code to access GS and FS segment relative memory
via variables declared with such attributes, which allows the
compiler to better optimize those accesses than the previous
inline assembly code.
- The series also includes a number of micro-optimizations for
various percpu access methods, plus a number of cleanups of %gs
accesses in assembly code.
- These changes have been exposed to linux-next testing for the
last ~5 months, with no known regressions in this area.
- Fix/clean up __switch_to()'s broken but accidentally working handling
of FPU switching - which also generates better code
- Propagate more RIP-relative addressing in assembly code, to generate
slightly better code
- Rework the CPU mitigations Kconfig space to be less idiosyncratic, to
make it easier for distros to follow & maintain these options
- Rework the x86 idle code to cure RCU violations and to clean up the
logic
- Clean up the vDSO Makefile logic
- Misc cleanups and fixes
* tag 'x86-core-2024-03-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (52 commits)
x86/idle: Select idle routine only once
x86/idle: Let prefer_mwait_c1_over_halt() return bool
x86/idle: Cleanup idle_setup()
x86/idle: Clean up idle selection
x86/idle: Sanitize X86_BUG_AMD_E400 handling
sched/idle: Conditionally handle tick broadcast in default_idle_call()
x86: Increase brk randomness entropy for 64-bit systems
x86/vdso: Move vDSO to mmap region
x86/vdso/kbuild: Group non-standard build attributes and primary object file rules together
x86/vdso: Fix rethunk patching for vdso-image-{32,64}.o
x86/retpoline: Ensure default return thunk isn't used at runtime
x86/vdso: Use CONFIG_COMPAT_32 to specify vdso32
x86/vdso: Use $(addprefix ) instead of $(foreach )
x86/vdso: Simplify obj-y addition
x86/vdso: Consolidate targets and clean-files
x86/bugs: Rename CONFIG_RETHUNK => CONFIG_MITIGATION_RETHUNK
x86/bugs: Rename CONFIG_CPU_SRSO => CONFIG_MITIGATION_SRSO
x86/bugs: Rename CONFIG_CPU_IBRS_ENTRY => CONFIG_MITIGATION_IBRS_ENTRY
x86/bugs: Rename CONFIG_CPU_UNRET_ENTRY => CONFIG_MITIGATION_UNRET_ENTRY
x86/bugs: Rename CONFIG_SLS => CONFIG_MITIGATION_SLS
...
Linus Torvalds [Tue, 12 Mar 2024 02:37:56 +0000 (19:37 -0700)]
Merge tag 'x86-cleanups-2024-03-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 cleanups from Ingo Molnar:
"Misc cleanups, including a large series from Thomas Gleixner to cure
sparse warnings"
* tag 'x86-cleanups-2024-03-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/nmi: Drop unused declaration of proc_nmi_enabled()
x86/callthunks: Use EXPORT_PER_CPU_SYMBOL_GPL() for per CPU variables
x86/cpu: Provide a declaration for itlb_multihit_kvm_mitigation
x86/cpu: Use EXPORT_PER_CPU_SYMBOL_GPL() for x86_spec_ctrl_current
x86/uaccess: Add missing __force to casts in __access_ok() and valid_user_address()
x86/percpu: Cure per CPU madness on UP
smp: Consolidate smp_prepare_boot_cpu()
x86/msr: Add missing __percpu annotations
x86/msr: Prepare for including <linux/percpu.h> into <asm/msr.h>
perf/x86/amd/uncore: Fix __percpu annotation
x86/nmi: Remove an unnecessary IS_ENABLED(CONFIG_SMP)
x86/apm_32: Remove dead function apm_get_battery_status()
x86/insn-eval: Fix function param name in get_eff_addr_sib()
Linus Torvalds [Tue, 12 Mar 2024 02:23:16 +0000 (19:23 -0700)]
Merge tag 'x86-build-2024-03-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 build updates from Ingo Molnar:
- Reduce <asm/bootparam.h> dependencies
- Simplify <asm/efi.h>
- Unify *_setup_data definitions into <asm/setup_data.h>
- Reduce the size of <asm/bootparam.h>
* tag 'x86-build-2024-03-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86: Do not include <asm/bootparam.h> in several files
x86/efi: Implement arch_ima_efi_boot_mode() in source file
x86/setup: Move internal setup_data structures into setup_data.h
x86/setup: Move UAPI setup structures into setup_data.h
Linus Torvalds [Tue, 12 Mar 2024 02:13:06 +0000 (19:13 -0700)]
Merge tag 'x86-asm-2024-03-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 asm updates from Ingo Molnar:
"Two changes to simplify the x86 decoder logic a bit"
* tag 'x86-asm-2024-03-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/insn: Directly assign x86_64 state in insn_init()
x86/insn: Remove superfluous checks from instruction decoding routines
Linus Torvalds [Tue, 12 Mar 2024 01:45:16 +0000 (18:45 -0700)]
Merge tag 'sched-core-2024-03-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar:
- Fix inconsistency in misfit task load-balancing
- Fix CPU isolation bugs in the task-wakeup logic
- Rework and unify the sched_use_asym_prio() and sched_asym_prefer()
logic
- Clean up and simplify ->avg_* accesses
- Misc cleanups and fixes
* tag 'sched-core-2024-03-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/topology: Rename SD_SHARE_PKG_RESOURCES to SD_SHARE_LLC
sched/fair: Check the SD_ASYM_PACKING flag in sched_use_asym_prio()
sched/fair: Rework sched_use_asym_prio() and sched_asym_prefer()
sched/fair: Remove unused parameter from sched_asym()
sched/topology: Remove duplicate descriptions from TOPOLOGY_SD_FLAGS
sched/fair: Simplify the update_sd_pick_busiest() logic
sched/fair: Do strict inequality check for busiest misfit task group
sched/fair: Remove unnecessary goto in update_sd_lb_stats()
sched/fair: Take the scheduling domain into account in select_idle_core()
sched/fair: Take the scheduling domain into account in select_idle_smt()
sched/fair: Add READ_ONCE() and use existing helper function to access ->avg_irq
sched/fair: Use existing helper functions to access ->avg_rt and ->avg_dl
sched/core: Simplify code by removing duplicate #ifdefs
Linus Torvalds [Tue, 12 Mar 2024 01:33:03 +0000 (18:33 -0700)]
Merge tag 'locking-core-2024-03-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking updates from Ingo Molnar:
- Micro-optimize local_xchg() and the rtmutex code on x86
- Fix percpu-rwsem contention tracepoints
- Simplify debugging Kconfig dependencies
- Update/clarify the documentation of atomic primitives
- Misc cleanups
* tag 'locking-core-2024-03-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
locking/rtmutex: Use try_cmpxchg_relaxed() in mark_rt_mutex_waiters()
locking/x86: Implement local_xchg() using CMPXCHG without the LOCK prefix
locking/percpu-rwsem: Trigger contention tracepoints only if contended
locking/rwsem: Make DEBUG_RWSEMS and PREEMPT_RT mutually exclusive
locking/rwsem: Clarify that RWSEM_READER_OWNED is just a hint
locking/mutex: Simplify <linux/mutex.h>
locking/qspinlock: Fix 'wait_early' set but not used warning
locking/atomic: scripts: Clarify ordering of conditional atomics
Linus Torvalds [Tue, 12 Mar 2024 01:14:06 +0000 (18:14 -0700)]
Merge tag 'edac_updates_for_v6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras
Pull EDAC updates from Borislav Petkov:
- Add a FRU (Field Replaceable Unit) memory poison manager which
collects and manages previously encountered hw errors in order to
save them to persistent storage across reboots. Previously recorded
errors are "replayed" upon reboot in order to poison memory which has
caused said errors in the past.
The main use case is stacked, on-chip memory which cannot simply be
replaced so poisoning faulty areas of it and thus making them
inaccessible is the only strategy to prolong its lifetime.
- Add an AMD address translation library glue which converts the
reported addresses of hw errors into system physical addresses in
order to be used by other subsystems like memory failure, for
example. Add support for MI300 accelerators to that library.
- igen6: Add support for Alder Lake-N SoC
- i10nm: Add Grand Ridge support
- The usual fixlets and cleanups
* tag 'edac_updates_for_v6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras:
EDAC/versal: Convert to platform remove callback returning void
RAS/AMD/FMPM: Fix off by one when unwinding on error
RAS/AMD/FMPM: Add debugfs interface to print record entries
RAS/AMD/FMPM: Save SPA values
RAS: Export helper to get ras_debugfs_dir
RAS/AMD/ATL: Fix bit overflow in denorm_addr_df4_np2()
RAS: Introduce a FRU memory poison manager
RAS/AMD/ATL: Add MI300 row retirement support
Documentation: Move RAS section to admin-guide
EDAC/versal: Make the bit position of injected errors configurable
EDAC/i10nm: Add Intel Grand Ridge micro-server support
EDAC/igen6: Add one more Intel Alder Lake-N SoC support
RAS/AMD/ATL: Add MI300 DRAM to normalized address translation support
RAS/AMD/ATL: Fix array overflow in get_logical_coh_st_fabric_id_mi300()
RAS/AMD/ATL: Add MI300 support
Documentation: RAS: Add index and address translation section
EDAC/amd64: Use new AMD Address Translation Library
RAS: Introduce AMD Address Translation Library
EDAC/synopsys: Convert to devm_platform_ioremap_resource()
Linus Torvalds [Tue, 12 Mar 2024 01:02:44 +0000 (18:02 -0700)]
Merge tag 'x86_misc_for_v6.9_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull misc x86 fixes from Borislav Petkov:
- Fix a wrong check in the function reporting whether a CPU executes
(or not) a NMI handler
- Ratelimit unknown NMIs messages in order to not potentially slow down
the machine
- Other fixlets
* tag 'x86_misc_for_v6.9_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/nmi: Fix the inverse "in NMI handler" check
Documentation/maintainer-tip: Add C++ tail comments exception
Documentation/maintainer-tip: Add Closes tag
x86/nmi: Rate limit unknown NMI messages
Documentation/kernel-parameters: Add spec_rstack_overflow to mitigations=off
Linus Torvalds [Tue, 12 Mar 2024 00:44:11 +0000 (17:44 -0700)]
Merge tag 'x86_sev_for_v6.9_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 SEV updates from Borislav Petkov:
- Add the x86 part of the SEV-SNP host support.
This will allow the kernel to be used as a KVM hypervisor capable of
running SNP (Secure Nested Paging) guests. Roughly speaking, SEV-SNP
is the ultimate goal of the AMD confidential computing side,
providing the most comprehensive confidential computing environment
up to date.
This is the x86 part and there is a KVM part which did not get ready
in time for the merge window so latter will be forthcoming in the
next cycle.
- Rework the early code's position-dependent SEV variable references in
order to allow building the kernel with clang and -fPIE/-fPIC and
-mcmodel=kernel
- The usual set of fixes, cleanups and improvements all over the place
* tag 'x86_sev_for_v6.9_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (36 commits)
x86/sev: Disable KMSAN for memory encryption TUs
x86/sev: Dump SEV_STATUS
crypto: ccp - Have it depend on AMD_IOMMU
iommu/amd: Fix failure return from snp_lookup_rmpentry()
x86/sev: Fix position dependent variable references in startup code
crypto: ccp: Make snp_range_list static
x86/Kconfig: Remove CONFIG_AMD_MEM_ENCRYPT_ACTIVE_BY_DEFAULT
Documentation: virt: Fix up pre-formatted text block for SEV ioctls
crypto: ccp: Add the SNP_SET_CONFIG command
crypto: ccp: Add the SNP_COMMIT command
crypto: ccp: Add the SNP_PLATFORM_STATUS command
x86/cpufeatures: Enable/unmask SEV-SNP CPU feature
KVM: SEV: Make AVIC backing, VMSA and VMCB memory allocation SNP safe
crypto: ccp: Add panic notifier for SEV/SNP firmware shutdown on kdump
iommu/amd: Clean up RMP entries for IOMMU pages during SNP shutdown
crypto: ccp: Handle legacy SEV commands when SNP is enabled
crypto: ccp: Handle non-volatile INIT_EX data when SNP is enabled
crypto: ccp: Handle the legacy TMR allocation when SNP is enabled
x86/sev: Introduce an SNP leaked pages list
crypto: ccp: Provide an API to issue SEV and SNP commands
...
Linus Torvalds [Tue, 12 Mar 2024 00:29:55 +0000 (17:29 -0700)]
Merge tag 'x86_cache_for_v6.9_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull resource control updates from Borislav Petkov:
- Rework different aspects of the resctrl code like adding
arch-specific accessors and splitting the locking, in order to
accomodate ARM's MPAM implementation of hw resource control and be
able to use the same filesystem control interface like on x86. Work
by James Morse
- Improve the memory bandwidth throttling heuristic to handle workloads
with not too regular load levels which end up penalized unnecessarily
- Use CPUID to detect the memory bandwidth enforcement limit on AMD
- The usual set of fixes
* tag 'x86_cache_for_v6.9_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (30 commits)
x86/resctrl: Remove lockdep annotation that triggers false positive
x86/resctrl: Separate arch and fs resctrl locks
x86/resctrl: Move domain helper migration into resctrl_offline_cpu()
x86/resctrl: Add CPU offline callback for resctrl work
x86/resctrl: Allow overflow/limbo handlers to be scheduled on any-but CPU
x86/resctrl: Add CPU online callback for resctrl work
x86/resctrl: Add helpers for system wide mon/alloc capable
x86/resctrl: Make rdt_enable_key the arch's decision to switch
x86/resctrl: Move alloc/mon static keys into helpers
x86/resctrl: Make resctrl_mounted checks explicit
x86/resctrl: Allow arch to allocate memory needed in resctrl_arch_rmid_read()
x86/resctrl: Allow resctrl_arch_rmid_read() to sleep
x86/resctrl: Queue mon_event_read() instead of sending an IPI
x86/resctrl: Add cpumask_any_housekeeping() for limbo/overflow
x86/resctrl: Move CLOSID/RMID matching and setting to use helpers
x86/resctrl: Allocate the cleanest CLOSID by searching closid_num_dirty_rmid
x86/resctrl: Use __set_bit()/__clear_bit() instead of open coding
x86/resctrl: Track the number of dirty RMID a CLOSID has
x86/resctrl: Allow RMID allocation to be scoped by CLOSID
x86/resctrl: Access per-rmid structures by index
...
Linus Torvalds [Tue, 12 Mar 2024 00:27:12 +0000 (17:27 -0700)]
Merge tag 'x86_mtrr_for_v6.9_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 MTRR update from Borislav Petkov:
- Relax the PAT MSR programming which was unnecessarily using the MTRR
programming protocol of disabling the cache around the changes. The
reason behind this is the current algorithm triggering a #VE
exception for TDX guests and unnecessarily complicating things
* tag 'x86_mtrr_for_v6.9_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/pat: Simplify the PAT programming protocol
Linus Torvalds [Tue, 12 Mar 2024 00:25:45 +0000 (17:25 -0700)]
Merge tag 'x86_cpu_for_v6.9_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 cpu update from Borislav Petkov:
- Have AMD Zen common init code run on all families from Zen1 onwards
in order to save some future enablement effort
* tag 'x86_cpu_for_v6.9_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/CPU/AMD: Do the common init on future Zens too
Linus Torvalds [Tue, 12 Mar 2024 00:22:57 +0000 (17:22 -0700)]
Merge tag 'ras_core_for_v6.9_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull RAS fixlet from Borislav Petkov:
- Constify yet another static struct bus_type instance now that the
driver core can handle that
* tag 'ras_core_for_v6.9_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/mce: Make mce_subsys const
Linus Torvalds [Tue, 12 Mar 2024 00:11:28 +0000 (17:11 -0700)]
Revert "dm: use queue_limits_set"
This reverts commit
8e0ef412869430d114158fc3b9b1fb111e247bd3.
It's broken, and causes the boot to fail on encrypted volumes.
Reported-and-bisected-by: Johannes Weiner <hannes@cmpxchg.org>
Link: https://lore.kernel.org/all/20240311235023.GA1205@cmpxchg.org/
Acked-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus Torvalds [Mon, 11 Mar 2024 23:15:43 +0000 (16:15 -0700)]
Merge tag 'x86-entry-2024-03-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 entry update from Thomas Gleixner:
"A single update for the x86 entry code:
The current CR3 handling for kernel page table isolation in the
paranoid return paths which are relevant for #NMI, #MCE, #VC, #DB and
#DF is unconditionally writing CR3 with the value retrieved on
exception entry.
In the vast majority of cases when returning to the kernel this is a
pointless exercise because CR3 was not modified on exception entry.
The only situation where this is necessary is when the exception
interrupts a entry from user before switching to kernel CR3 or
interrupts an exit to user after switching back to user CR3.
As CR3 writes can be expensive on some systems this becomes measurable
overhead with high frequency #NMIs such as perf.
Avoid this overhead by checking the CR3 value, which was saved on
entry, and write it back to CR3 only when it is a user CR3"
* tag 'x86-entry-2024-03-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/entry: Avoid redundant CR3 write on paranoid returns
Linus Torvalds [Mon, 11 Mar 2024 23:00:17 +0000 (16:00 -0700)]
Merge tag 'x86-fred-2024-03-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 FRED support from Thomas Gleixner:
"Support for x86 Fast Return and Event Delivery (FRED).
FRED is a replacement for IDT event delivery on x86 and addresses most
of the technical nightmares which IDT exposes:
1) Exception cause registers like CR2 need to be manually preserved
in nested exception scenarios.
2) Hardware interrupt stack switching is suboptimal for nested
exceptions as the interrupt stack mechanism rewinds the stack on
each entry which requires a massive effort in the low level entry
of #NMI code to handle this.
3) No hardware distinction between entry from kernel or from user
which makes establishing kernel context more complex than it needs
to be especially for unconditionally nestable exceptions like NMI.
4) NMI nesting caused by IRET unconditionally reenabling NMIs, which
is a problem when the perf NMI takes a fault when collecting a
stack trace.
5) Partial restore of ESP when returning to a 16-bit segment
6) Limitation of the vector space which can cause vector exhaustion
on large systems.
7) Inability to differentiate NMI sources
FRED addresses these shortcomings by:
1) An extended exception stack frame which the CPU uses to save
exception cause registers. This ensures that the meta information
for each exception is preserved on stack and avoids the extra
complexity of preserving it in software.
2) Hardware interrupt stack switching is non-rewinding if a nested
exception uses the currently interrupt stack.
3) The entry points for kernel and user context are separate and GS
BASE handling which is required to establish kernel context for
per CPU variable access is done in hardware.
4) NMIs are now nesting protected. They are only reenabled on the
return from NMI.
5) FRED guarantees full restore of ESP
6) FRED does not put a limitation on the vector space by design
because it uses a central entry points for kernel and user space
and the CPUstores the entry type (exception, trap, interrupt,
syscall) on the entry stack along with the vector number. The
entry code has to demultiplex this information, but this removes
the vector space restriction.
The first hardware implementations will still have the current
restricted vector space because lifting this limitation requires
further changes to the local APIC.
7) FRED stores the vector number and meta information on stack which
allows having more than one NMI vector in future hardware when the
required local APIC changes are in place.
The series implements the initial FRED support by:
- Reworking the existing entry and IDT handling infrastructure to
accomodate for the alternative entry mechanism.
- Expanding the stack frame to accomodate for the extra 16 bytes FRED
requires to store context and meta information
- Providing FRED specific C entry points for events which have
information pushed to the extended stack frame, e.g. #PF and #DB.
- Providing FRED specific C entry points for #NMI and #MCE
- Implementing the FRED specific ASM entry points and the C code to
demultiplex the events
- Providing detection and initialization mechanisms and the necessary
tweaks in context switching, GS BASE handling etc.
The FRED integration aims for maximum code reuse vs the existing IDT
implementation to the extent possible and the deviation in hot paths
like context switching are handled with alternatives to minimalize the
impact. The low level entry and exit paths are seperate due to the
extended stack frame and the hardware based GS BASE swichting and
therefore have no impact on IDT based systems.
It has been extensively tested on existing systems and on the FRED
simulation and as of now there are no outstanding problems"
* tag 'x86-fred-2024-03-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (38 commits)
x86/fred: Fix init_task thread stack pointer initialization
MAINTAINERS: Add a maintainer entry for FRED
x86/fred: Fix a build warning with allmodconfig due to 'inline' failing to inline properly
x86/fred: Invoke FRED initialization code to enable FRED
x86/fred: Add FRED initialization functions
x86/syscall: Split IDT syscall setup code into idt_syscall_init()
KVM: VMX: Call fred_entry_from_kvm() for IRQ/NMI handling
x86/entry: Add fred_entry_from_kvm() for VMX to handle IRQ/NMI
x86/entry/calling: Allow PUSH_AND_CLEAR_REGS being used beyond actual entry code
x86/fred: Fixup fault on ERETU by jumping to fred_entrypoint_user
x86/fred: Let ret_from_fork_asm() jmp to asm_fred_exit_user when FRED is enabled
x86/traps: Add sysvec_install() to install a system interrupt handler
x86/fred: FRED entry/exit and dispatch code
x86/fred: Add a machine check entry stub for FRED
x86/fred: Add a NMI entry stub for FRED
x86/fred: Add a debug fault entry stub for FRED
x86/idtentry: Incorporate definitions/declarations of the FRED entries
x86/fred: Make exc_page_fault() work for FRED
x86/fred: Allow single-step trap and NMI when starting a new task
x86/fred: No ESPFIX needed when FRED is enabled
...
Linus Torvalds [Mon, 11 Mar 2024 22:45:55 +0000 (15:45 -0700)]
Merge tag 'x86-apic-2024-03-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 APIC updates from Thomas Gleixner:
"Rework of APIC enumeration and topology evaluation.
The current implementation has a couple of shortcomings:
- It fails to handle hybrid systems correctly.
- The APIC registration code which handles CPU number assignents is
in the middle of the APIC code and detached from the topology
evaluation.
- The various mechanisms which enumerate APICs, ACPI, MPPARSE and
guest specific ones, tweak global variables as they see fit or in
case of XENPV just hack around the generic mechanisms completely.
- The CPUID topology evaluation code is sprinkled all over the vendor
code and reevaluates global variables on every hotplug operation.
- There is no way to analyze topology on the boot CPU before bringing
up the APs. This causes problems for infrastructure like PERF which
needs to size certain aspects upfront or could be simplified if
that would be possible.
- The APIC admission and CPU number association logic is
incomprehensible and overly complex and needs to be kept around
after boot instead of completing this right after the APIC
enumeration.
This update addresses these shortcomings with the following changes:
- Rework the CPUID evaluation code so it is common for all vendors
and provides information about the APIC ID segments in a uniform
way independent of the number of segments (Thread, Core, Module,
..., Die, Package) so that this information can be computed instead
of rewriting global variables of dubious value over and over.
- A few cleanups and simplifcations of the APIC, IO/APIC and related
interfaces to prepare for the topology evaluation changes.
- Seperation of the parser stages so the early evaluation which tries
to find the APIC address can be seperately overridden from the late
evaluation which enumerates and registers the local APIC as further
preparation for sanitizing the topology evaluation.
- A new registration and admission logic which
- encapsulates the inner workings so that parsers and guest logic
cannot longer fiddle in it
- uses the APIC ID segments to build topology bitmaps at
registration time
- provides a sane admission logic
- allows to detect the crash kernel case, where CPU0 does not run
on the real BSP, automatically. This is required to prevent
sending INIT/SIPI sequences to the real BSP which would reset
the whole machine. This was so far handled by a tedious command
line parameter, which does not even work in nested crash
scenarios.
- Associates CPU number after the enumeration completed and
prevents the late registration of APICs, which was somehow
tolerated before.
- Converting all parsers and guest enumeration mechanisms over to the
new interfaces.
This allows to get rid of all global variable tweaking from the
parsers and enumeration mechanisms and sanitizes the XEN[PV]
handling so it can use CPUID evaluation for the first time.
- Mopping up existing sins by taking the information from the APIC ID
segment bitmaps.
This evaluates hybrid systems correctly on the boot CPU and allows
for cleanups and fixes in the related drivers, e.g. PERF.
The series has been extensively tested and the minimal late fallout
due to a broken ACPI/MADT table has been addressed by tightening the
admission logic further"
* tag 'x86-apic-2024-03-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (76 commits)
x86/topology: Ignore non-present APIC IDs in a present package
x86/apic: Build the x86 topology enumeration functions on UP APIC builds too
smp: Provide 'setup_max_cpus' definition on UP too
smp: Avoid 'setup_max_cpus' namespace collision/shadowing
x86/bugs: Use fixed addressing for VERW operand
x86/cpu/topology: Get rid of cpuinfo::x86_max_cores
x86/cpu/topology: Provide __num_[cores|threads]_per_package
x86/cpu/topology: Rename topology_max_die_per_package()
x86/cpu/topology: Rename smp_num_siblings
x86/cpu/topology: Retrieve cores per package from topology bitmaps
x86/cpu/topology: Use topology logical mapping mechanism
x86/cpu/topology: Provide logical pkg/die mapping
x86/cpu/topology: Simplify cpu_mark_primary_thread()
x86/cpu/topology: Mop up primary thread mask handling
x86/cpu/topology: Use topology bitmaps for sizing
x86/cpu/topology: Let XEN/PV use topology from CPUID/MADT
x86/xen/smp_pv: Count number of vCPUs early
x86/cpu/topology: Assign hotpluggable CPUIDs during init
x86/cpu/topology: Reject unknown APIC IDs on ACPI hotplug
x86/topology: Add a mechanism to track topology via APIC IDs
...
Linus Torvalds [Mon, 11 Mar 2024 21:38:26 +0000 (14:38 -0700)]
Merge tag 'timers-core-2024-03-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer updates from Thomas Gleixner:
"A large set of updates and features for timers and timekeeping:
- The hierarchical timer pull model
When timer wheel timers are armed they are placed into the timer
wheel of a CPU which is likely to be busy at the time of expiry.
This is done to avoid wakeups on potentially idle CPUs.
This is wrong in several aspects:
1) The heuristics to select the target CPU are wrong by
definition as the chance to get the prediction right is
close to zero.
2) Due to #1 it is possible that timers are accumulated on
a single target CPU
3) The required computation in the enqueue path is just overhead
for dubious value especially under the consideration that the
vast majority of timer wheel timers are either canceled or
rearmed before they expire.
The timer pull model avoids the above by removing the target
computation on enqueue and queueing timers always on the CPU on
which they get armed.
This is achieved by having separate wheels for CPU pinned timers
and global timers which do not care about where they expire.
As long as a CPU is busy it handles both the pinned and the global
timers which are queued on the CPU local timer wheels.
When a CPU goes idle it evaluates its own timer wheels:
- If the first expiring timer is a pinned timer, then the global
timers can be ignored as the CPU will wake up before they
expire.
- If the first expiring timer is a global timer, then the expiry
time is propagated into the timer pull hierarchy and the CPU
makes sure to wake up for the first pinned timer.
The timer pull hierarchy organizes CPUs in groups of eight at the
lowest level and at the next levels groups of eight groups up to
the point where no further aggregation of groups is required, i.e.
the number of levels is log8(NR_CPUS). The magic number of eight
has been established by experimention, but can be adjusted if
needed.
In each group one busy CPU acts as the migrator. It's only one CPU
to avoid lock contention on remote timer wheels.
The migrator CPU checks in its own timer wheel handling whether
there are other CPUs in the group which have gone idle and have
global timers to expire. If there are global timers to expire, the
migrator locks the remote CPU timer wheel and handles the expiry.
Depending on the group level in the hierarchy this handling can
require to walk the hierarchy downwards to the CPU level.
Special care is taken when the last CPU goes idle. At this point
the CPU is the systemwide migrator at the top of the hierarchy and
it therefore cannot delegate to the hierarchy. It needs to arm its
own timer device to expire either at the first expiring timer in
the hierarchy or at the first CPU local timer, which ever expires
first.
This completely removes the overhead from the enqueue path, which
is e.g. for networking a true hotpath and trades it for a slightly
more complex idle path.
This has been in development for a couple of years and the final
series has been extensively tested by various teams from silicon
vendors and ran through extensive CI.
There have been slight performance improvements observed on network
centric workloads and an Intel team confirmed that this allows them
to power down a die completely on a mult-die socket for the first
time in a mostly idle scenario.
There is only one outstanding ~1.5% regression on a specific
overloaded netperf test which is currently investigated, but the
rest is either positive or neutral performance wise and positive on
the power management side.
- Fixes for the timekeeping interpolation code for cross-timestamps:
cross-timestamps are used for PTP to get snapshots from hardware
timers and interpolated them back to clock MONOTONIC. The changes
address a few corner cases in the interpolation code which got the
math and logic wrong.
- Simplifcation of the clocksource watchdog retry logic to
automatically adjust to handle larger systems correctly instead of
having more incomprehensible command line parameters.
- Treewide consolidation of the VDSO data structures.
- The usual small improvements and cleanups all over the place"
* tag 'timers-core-2024-03-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (62 commits)
timer/migration: Fix quick check reporting late expiry
tick/sched: Fix build failure for CONFIG_NO_HZ_COMMON=n
vdso/datapage: Quick fix - use asm/page-def.h for ARM64
timers: Assert no next dyntick timer look-up while CPU is offline
tick: Assume timekeeping is correctly handed over upon last offline idle call
tick: Shut down low-res tick from dying CPU
tick: Split nohz and highres features from nohz_mode
tick: Move individual bit features to debuggable mask accesses
tick: Move got_idle_tick away from common flags
tick: Assume the tick can't be stopped in NOHZ_MODE_INACTIVE mode
tick: Move broadcast cancellation up to CPUHP_AP_TICK_DYING
tick: Move tick cancellation up to CPUHP_AP_TICK_DYING
tick: Start centralizing tick related CPU hotplug operations
tick/sched: Don't clear ts::next_tick again in can_stop_idle_tick()
tick/sched: Rename tick_nohz_stop_sched_tick() to tick_nohz_full_stop_tick()
tick: Use IS_ENABLED() whenever possible
tick/sched: Remove useless oneshot ifdeffery
tick/nohz: Remove duplicate between lowres and highres handlers
tick/nohz: Remove duplicate between tick_nohz_switch_to_nohz() and tick_setup_sched_timer()
hrtimer: Select housekeeping CPU during migration
...
Linus Torvalds [Mon, 11 Mar 2024 21:25:18 +0000 (14:25 -0700)]
Merge tag 'timers-ptp-2024-03-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull clocksource updates from Thomas Gleixner:
"Updates for timekeeping and PTP core.
The cross-timestamp mechanism which allows to correlate hardware
clocks uses clocksource pointers for describing the correlation.
That's suboptimal as drivers need to obtain the pointer, which
requires needless exports and exposing internals. This can all be
completely avoided by assigning clocksource IDs and using them for
describing the correlated clock source.
So this adds clocksource IDs to all clocksources in the tree which can
be exposed to this mechanism and removes the pointer and now needless
exports.
A related improvement for the core and the correlation handling has
not made it this time, but is expected to get ready for the next
round"
* tag 'timers-ptp-2024-03-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
kvmclock: Unexport kvmclock clocksource
treewide: Remove system_counterval_t.cs, which is never read
timekeeping: Evaluate system_counterval_t.cs_id instead of .cs
ptp/kvm, arm_arch_timer: Set system_counterval_t.cs_id to constant
x86/kvm, ptp/kvm: Add clocksource ID, set system_counterval_t.cs_id
x86/tsc: Add clocksource ID, set system_counterval_t.cs_id
timekeeping: Add clocksource ID to struct system_counterval_t
x86/tsc: Correct kernel-doc notation
Linus Torvalds [Mon, 11 Mar 2024 21:14:09 +0000 (14:14 -0700)]
Merge tag 'smp-core-2024-03-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull cpu core updates from Thomas Gleixner:
"A small boring set of cleanups for the SMP and CPU hotplug code"
* tag 'smp-core-2024-03-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
cpu: Remove stray semicolon
smp: Make __smp_processor_id() 0-argument macro
cpu: Mark cpu_possible_mask as __ro_after_init
kernel/cpu: Convert snprintf() to sysfs_emit()
cpu/hotplug: Delete an extraneous kernel-doc description
Linus Torvalds [Mon, 11 Mar 2024 21:03:03 +0000 (14:03 -0700)]
Merge tag 'irq-msi-2024-03-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull MSI updates from Thomas Gleixner:
"Updates for the MSI interrupt subsystem and initial RISC-V MSI
support.
The core changes have been adopted from previous work which converted
ARM[64] to the new per device MSI domain model, which was merged to
support multiple MSI domain per device. The ARM[64] changes are being
worked on too, but have not been ready yet. The core and platform-MSI
changes have been split out to not hold up RISC-V and to avoid that
RISC-V builds on the scheduled for removal interfaces.
The core support provides new interfaces to handle wire to MSI bridges
in a straight forward way and introduces new platform-MSI interfaces
which are built on top of the per device MSI domain model.
Once ARM[64] is converted over the old platform-MSI interfaces and the
related ugliness in the MSI core code will be removed.
The actual MSI parts for RISC-V were finalized late and have been
post-poned for the next merge window.
Drivers:
- Add a new driver for the Andes hart-level interrupt controller
- Rework the SiFive PLIC driver to prepare for MSI suport
- Expand the RISC-V INTC driver to support the new RISC-V AIA
controller which provides the basis for MSI on RISC-V
- A few fixup for the fallout of the core changes"
* tag 'irq-msi-2024-03-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (29 commits)
irqchip/riscv-intc: Fix low-level interrupt handler setup for AIA
x86/apic/msi: Use DOMAIN_BUS_GENERIC_MSI for HPET/IO-APIC domain search
genirq/matrix: Dynamic bitmap allocation
irqchip/riscv-intc: Add support for RISC-V AIA
irqchip/sifive-plic: Improve locking safety by using irqsave/irqrestore
irqchip/sifive-plic: Parse number of interrupts and contexts early in plic_probe()
irqchip/sifive-plic: Cleanup PLIC contexts upon irqdomain creation failure
irqchip/sifive-plic: Use riscv_get_intc_hwnode() to get parent fwnode
irqchip/sifive-plic: Use devm_xyz() for managed allocation
irqchip/sifive-plic: Use dev_xyz() in-place of pr_xyz()
irqchip/sifive-plic: Convert PLIC driver into a platform driver
irqchip/riscv-intc: Introduce Andes hart-level interrupt controller
irqchip/riscv-intc: Allow large non-standard interrupt number
genirq/irqdomain: Don't call ops->select for DOMAIN_BUS_ANY tokens
irqchip/imx-intmux: Handle pure domain searches correctly
genirq/msi: Provide MSI_FLAG_PARENT_PM_DEV
genirq/irqdomain: Reroute device MSI create_mapping
genirq/msi: Provide allocation/free functions for "wired" MSI interrupts
genirq/msi: Optionally use dev->fwnode for device domain
genirq/msi: Provide DOMAIN_BUS_WIRED_TO_MSI
...
Linus Torvalds [Mon, 11 Mar 2024 20:50:30 +0000 (13:50 -0700)]
Merge tag 'irq-core-2024-03-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq updates from Thomas Gleixner:
"Core:
- Make affinity changes take effect immediately for interrupt
threads. This reduces the impact on isolated CPUs as it pulls over
the thread right away instead of doing it after the next hardware
interrupt arrived.
- Cleanup and improvements for the interrupt chip simulator
- Deduplication of the interrupt descriptor initialization code so
the sparse and non-sparse mode share more code.
Drivers:
- A set of conversions to platform_drivers::remove_new() which gets
rid of the pointless return value.
- A new driver for the Starfive JH8100 SoC
- Support for Amlogic-T7 SoCs
- Improvement for the interrupt handling and EOI management for the
loongson interrupt controller.
- The usual fixes and improvements all over the place"
* tag 'irq-core-2024-03-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (33 commits)
irqchip/ts4800: Convert to platform_driver::remove_new() callback
irqchip/stm32-exti: Convert to platform_driver::remove_new() callback
irqchip/renesas-rza1: Convert to platform_driver::remove_new() callback
irqchip/renesas-irqc: Convert to platform_driver::remove_new() callback
irqchip/renesas-intc-irqpin: Convert to platform_driver::remove_new() callback
irqchip/pruss-intc: Convert to platform_driver::remove_new() callback
irqchip/mvebu-pic: Convert to platform_driver::remove_new() callback
irqchip/madera: Convert to platform_driver::remove_new() callback
irqchip/ls-scfg-msi: Convert to platform_driver::remove_new() callback
irqchip/keystone: Convert to platform_driver::remove_new() callback
irqchip/imx-irqsteer: Convert to platform_driver::remove_new() callback
irqchip/imx-intmux: Convert to platform_driver::remove_new() callback
irqchip/imgpdc: Convert to platform_driver::remove_new() callback
irqchip: Add StarFive external interrupt controller
dt-bindings: interrupt-controller: Add starfive,jh8100-intc
arm64: dts: Add gpio_intc node for Amlogic-T7 SoCs
irqchip/meson-gpio: Add support for Amlogic-T7 SoCs
dt-bindings: interrupt-controller: Add support for Amlogic-T7 SoCs
irqchip/vic: Fix a kernel-doc warning
genirq: Wake interrupt threads immediately when changing affinity
...
Pawan Gupta [Mon, 11 Mar 2024 19:29:43 +0000 (12:29 -0700)]
KVM/x86: Export RFDS_NO and RFDS_CLEAR to guests
Mitigation for RFDS requires RFDS_CLEAR capability which is enumerated
by MSR_IA32_ARCH_CAPABILITIES bit 27. If the host has it set, export it
to guests so that they can deploy the mitigation.
RFDS_NO indicates that the system is not vulnerable to RFDS, export it
to guests so that they don't deploy the mitigation unnecessarily. When
the host is not affected by X86_BUG_RFDS, but has RFDS_NO=0, synthesize
RFDS_NO to the guest.
Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Josh Poimboeuf <jpoimboe@kernel.org>
Pawan Gupta [Mon, 11 Mar 2024 19:29:43 +0000 (12:29 -0700)]
x86/rfds: Mitigate Register File Data Sampling (RFDS)
RFDS is a CPU vulnerability that may allow userspace to infer kernel
stale data previously used in floating point registers, vector registers
and integer registers. RFDS only affects certain Intel Atom processors.
Intel released a microcode update that uses VERW instruction to clear
the affected CPU buffers. Unlike MDS, none of the affected cores support
SMT.
Add RFDS bug infrastructure and enable the VERW based mitigation by
default, that clears the affected buffers just before exiting to
userspace. Also add sysfs reporting and cmdline parameter
"reg_file_data_sampling" to control the mitigation.
For details see:
Documentation/admin-guide/hw-vuln/reg-file-data-sampling.rst
Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Josh Poimboeuf <jpoimboe@kernel.org>
Pawan Gupta [Mon, 11 Mar 2024 19:29:43 +0000 (12:29 -0700)]
Documentation/hw-vuln: Add documentation for RFDS
Add the documentation for transient execution vulnerability Register
File Data Sampling (RFDS) that affects Intel Atom CPUs.
Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Josh Poimboeuf <jpoimboe@kernel.org>
Pawan Gupta [Mon, 11 Mar 2024 19:29:43 +0000 (12:29 -0700)]
x86/mmio: Disable KVM mitigation when X86_FEATURE_CLEAR_CPU_BUF is set
Currently MMIO Stale Data mitigation for CPUs not affected by MDS/TAA is
to only deploy VERW at VMentry by enabling mmio_stale_data_clear static
branch. No mitigation is needed for kernel->user transitions. If such
CPUs are also affected by RFDS, its mitigation may set
X86_FEATURE_CLEAR_CPU_BUF to deploy VERW at kernel->user and VMentry.
This could result in duplicate VERW at VMentry.
Fix this by disabling mmio_stale_data_clear static branch when
X86_FEATURE_CLEAR_CPU_BUF is enabled.
Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Linus Torvalds [Mon, 11 Mar 2024 20:13:22 +0000 (13:13 -0700)]
Merge tag 'cgroup-for-6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup updates from Tejun Heo:
"A quiet cycle. One trivial doc update patch. Two patches to drop the
now defunct memory_spread_slab feature from cgroup1 cpuset"
* tag 'cgroup-for-6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroup/cpuset: Mark memory_spread_slab as obsolete
cgroup/cpuset: Remove cpuset_do_slab_mem_spread()
docs: cgroup-v1: add missing code-block tags
Linus Torvalds [Mon, 11 Mar 2024 20:05:19 +0000 (13:05 -0700)]
Merge tag 'wq-for-6.9-bh-conversions' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq
Pull workqueue BH conversions from Tejun Heo:
"This contains two patches that convert tasklet users to BH workqueues:
backtracetest and usb hcd.
DM conversions are being routed through the respective subsystem tree.
Hopefully, the next cycle will see a lot more conversions"
* tag 'wq-for-6.9-bh-conversions' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
usb: core: hcd: Convert from tasklet to BH workqueue
backtracetest: Convert from tasklet to BH workqueue
Linus Torvalds [Mon, 11 Mar 2024 19:50:42 +0000 (12:50 -0700)]
Merge tag 'wq-for-6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq
Pull workqueue updates from Tejun Heo:
"This cycle, a lot of workqueue changes including some that are
significant and invasive.
- During v6.6 cycle, unbound workqueues were updated so that they are
more topology aware and flexible, which among other things improved
workqueue behavior on modern multi-L3 CPUs. In the process, commit
636b927eba5b ("workqueue: Make unbound workqueues to use per-cpu
pool_workqueues") switched unbound workqueues to use per-CPU
frontend pool_workqueues as a part of increasing front-back mapping
flexibility.
An unwelcome side effect of this change was that this made max
concurrency enforcement per-CPU blowing up the maximum number of
allowed concurrent executions. I incorrectly assumed that this
wouldn't cause practical problems as most unbound workqueue users
are self-regulate max concurrency; however, there definitely are
which don't (e.g. on IO paths) and the drastic increase in the
allowed max concurrency led to noticeable perf regressions in some
use cases.
This is now addressed by separating out max concurrency enforcement
to a separate struct - wq_node_nr_active - which makes @max_active
consistently mean system-wide max concurrency regardless of the
number of CPUs or (finally) NUMA nodes. This is a rather invasive
and, in places, a bit clunky; however, the clunkiness rises from
the the inherent requirement to handle the disagreement between the
execution locality domain and max concurrency enforcement domain on
some modern machines.
See commit
5797b1c18919 ("workqueue: Implement system-wide
nr_active enforcement for unbound workqueues") for more details.
- BH workqueue support is added.
They are similar to per-CPU workqueues but execute work items in
the softirq context. This is expected to replace tasklet. However,
currently, it's missing the ability to disable and enable work
items which is needed to convert many tasklet users. To avoid
crowding this merge window too much, this will be included in the
next merge window. A separate pull request will be sent for the
couple conversion patches that are currently pending.
- Waiman plugged a long-standing hole in workqueue CPU isolation
where ordered workqueues didn't follow wq_unbound_cpumask updates.
Ordered workqueues now follow the same rules as other unbound
workqueues.
- More CPU isolation improvements: Juri fixed another deficit in
workqueue isolation where unbound rescuers don't respect
wq_unbound_cpumask. Leonardo fixed delayed_work timers firing on
isolated CPUs.
- Other misc changes"
* tag 'wq-for-6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: (54 commits)
workqueue: Drain BH work items on hot-unplugged CPUs
workqueue: Introduce from_work() helper for cleaner callback declarations
workqueue: Control intensive warning threshold through cmdline
workqueue: Make @flags handling consistent across set_work_data() and friends
workqueue: Remove clear_work_data()
workqueue: Factor out work_grab_pending() from __cancel_work_sync()
workqueue: Clean up enum work_bits and related constants
workqueue: Introduce work_cancel_flags
workqueue: Use variable name irq_flags for saving local irq flags
workqueue: Reorganize flush and cancel[_sync] functions
workqueue: Rename __cancel_work_timer() to __cancel_timer_sync()
workqueue: Use rcu_read_lock_any_held() instead of rcu_read_lock_held()
workqueue: Cosmetic changes
workqueue, irq_work: Build fix for !CONFIG_IRQ_WORK
workqueue: Fix queue_work_on() with BH workqueues
async: Use a dedicated unbound workqueue with raised min_active
workqueue: Implement workqueue_set_min_active()
workqueue: Fix kernel-doc comment of unplug_oldest_pwq()
workqueue: Bind unbound workqueue rescuer to wq_unbound_cpumask
kernel/workqueue: Let rescuers follow unbound wq cpumask changes
...
Linus Torvalds [Mon, 11 Mar 2024 19:31:28 +0000 (12:31 -0700)]
Merge tag 'rust-6.9' of https://github.com/Rust-for-Linux/linux
Pull Rust updates from Miguel Ojeda:
"Another routine one in terms of features. We got two version upgrades
this time, but in terms of lines, 'alloc' changes are not very large.
Toolchain and infrastructure:
- Upgrade to Rust 1.76.0
This time around, due to how the kernel and Rust schedules have
aligned, there are two upgrades in fact. These allow us to remove
two more unstable features ('const_maybe_uninit_zeroed' and
'ptr_metadata') from the list, among other improvements
- Mark 'rustc' (and others) invocations as recursive, which fixes a
new warning and prepares us for the future in case we eventually
take advantage of the Make jobserver
'kernel' crate:
- Add the 'container_of!' macro
- Stop using the unstable 'ptr_metadata' feature by employing the now
stable 'byte_sub' method to implement 'Arc::from_raw()'
- Add the 'time' module with a 'msecs_to_jiffies()' conversion
function to begin with, to be used by Rust Binder
- Add 'notify_sync()' and 'wait_interruptible_timeout()' methods to
'CondVar', to be used by Rust Binder
- Update integer types for 'CondVar'
- Rename 'wait_list' field to 'wait_queue_head' in 'CondVar'
- Implement 'Display' and 'Debug' for 'BStr'
- Add the 'try_from_foreign()' method to the 'ForeignOwnable' trait
- Add reexports for macros so that they can be used from the right
module (in addition to the root)
- A series of code documentation improvements, including adding
intra-doc links, consistency improvements, typo fixes...
'macros' crate:
- Place generated 'init_module()' function in '.init.text'
Documentation:
- Add documentation on Rust doctests and how they work"
* tag 'rust-6.9' of https://github.com/Rust-for-Linux/linux: (29 commits)
rust: upgrade to Rust 1.76.0
kbuild: mark `rustc` (and others) invocations as recursive
rust: add `container_of!` macro
rust: str: implement `Display` and `Debug` for `BStr`
rust: module: place generated init_module() function in .init.text
rust: types: add `try_from_foreign()` method
docs: rust: Add description of Rust documentation test as KUnit ones
docs: rust: Move testing to a separate page
rust: kernel: stop using ptr_metadata feature
rust: kernel: add reexports for macros
rust: locked_by: shorten doclink preview
rust: kernel: remove unneeded doclink targets
rust: kernel: add doclinks
rust: kernel: add blank lines in front of code blocks
rust: kernel: mark code fragments in docs with backticks
rust: kernel: unify spelling of refcount in docs
rust: str: move SAFETY comment in front of unsafe block
rust: str: use `NUL` instead of 0 in doc comments
rust: kernel: add srctree-relative doclinks
rust: ioctl: end top-level module docs with full stop
...
Linus Torvalds [Mon, 11 Mar 2024 19:18:20 +0000 (12:18 -0700)]
Merge tag 'compiler-attributes-6.9' of https://github.com/ojeda/linux
Pull compiler attributes update from Miguel Ojeda:
"Trivial fixes to the __counted_by comments"
* tag 'compiler-attributes-6.9' of https://github.com/ojeda/linux:
Compiler Attributes: counted_by: fixup clang URL
Compiler Attributes: counted_by: bump min gcc version
Linus Torvalds [Mon, 11 Mar 2024 19:02:50 +0000 (12:02 -0700)]
Merge tag 'rcu.next.v6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/boqun/linux
Pull RCU updates from Boqun Feng:
- Eliminate deadlocks involving do_exit() and RCU tasks, by Paul:
Instead of SRCU read side critical sections, now a percpu list is
used in do_exit() for scaning yet-to-exit tasks
- Fix a deadlock due to the dependency between workqueue and RCU
expedited grace period, reported by Anna-Maria Behnsen and Thomas
Gleixner and fixed by Frederic: Now RCU expedited always uses its own
kthread worker instead of a workqueue
- RCU NOCB updates, code cleanups, unnecessary barrier removals and
minor bug fixes
- Maintain real-time response in rcu_tasks_postscan() and a minor fix
for tasks trace quiescence check
- Misc updates, comments and readibility improvement, boot time
parameter for lazy RCU and rcutorture improvement
- Documentation updates
* tag 'rcu.next.v6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/boqun/linux: (34 commits)
rcu-tasks: Maintain real-time response in rcu_tasks_postscan()
rcu-tasks: Eliminate deadlocks involving do_exit() and RCU tasks
rcu-tasks: Maintain lists to eliminate RCU-tasks/do_exit() deadlocks
rcu-tasks: Initialize data to eliminate RCU-tasks/do_exit() deadlocks
rcu-tasks: Initialize callback lists at rcu_init() time
rcu-tasks: Add data to eliminate RCU-tasks/do_exit() deadlocks
rcu-tasks: Repair RCU Tasks Trace quiescence check
rcu/sync: remove un-used rcu_sync_enter_start function
rcutorture: Suppress rtort_pipe_count warnings until after stalls
srcu: Improve comments about acceleration leak
rcu: Provide a boot time parameter to control lazy RCU
rcu: Rename jiffies_till_flush to jiffies_lazy_flush
doc: Update checklist.rst discussion of callback execution
doc: Clarify use of slab constructors and SLAB_TYPESAFE_BY_RCU
context_tracking: Fix kerneldoc headers for __ct_user_{enter,exit}()
doc: Add EARLY flag to early-parsed kernel boot parameters
doc: Add CONFIG_RCU_STRICT_GRACE_PERIOD to checklist.rst
doc: Make checklist.rst note that spinlocks are implied RCU readers
doc: Make whatisRCU.rst note that spinlocks are RCU readers
doc: Spinlocks are implied RCU readers
...
Linus Torvalds [Mon, 11 Mar 2024 18:43:44 +0000 (11:43 -0700)]
Merge tag 'for-6.9/block-
20240310' of git://git.kernel.dk/linux
Pull block updates from Jens Axboe:
- MD pull requests via Song:
- Cleanup redundant checks (Yu Kuai)
- Remove deprecated headers (Marc Zyngier, Song Liu)
- Concurrency fixes (Li Lingfeng)
- Memory leak fix (Li Nan)
- Refactor raid1 read_balance (Yu Kuai, Paul Luse)
- Clean up and fix for md_ioctl (Li Nan)
- Other small fixes (Gui-Dong Han, Heming Zhao)
- MD atomic limits (Christoph)
- NVMe pull request via Keith:
- RDMA target enhancements (Max)
- Fabrics fixes (Max, Guixin, Hannes)
- Atomic queue_limits usage (Christoph)
- Const use for class_register (Ricardo)
- Identification error handling fixes (Shin'ichiro, Keith)
- Improvement and cleanup for cached request handling (Christoph)
- Moving towards atomic queue limits. Core changes and driver bits so
far (Christoph)
- Fix UAF issues in aoeblk (Chun-Yi)
- Zoned fix and cleanups (Damien)
- s390 dasd cleanups and fixes (Jan, Miroslav)
- Block issue timestamp caching (me)
- noio scope guarding for zoned IO (Johannes)
- block/nvme PI improvements (Kanchan)
- Ability to terminate long running discard loop (Keith)
- bdev revalidation fix (Li)
- Get rid of old nr_queues hack for kdump kernels (Ming)
- Support for async deletion of ublk (Ming)
- Improve IRQ bio recycling (Pavel)
- Factor in CPU capacity for remote vs local completion (Qais)
- Add shared_tags configfs entry for null_blk (Shin'ichiro
- Fix for a regression in page refcounts introduced by the folio
unification (Tony)
- Misc fixes and cleanups (Arnd, Colin, John, Kunwu, Li, Navid,
Ricardo, Roman, Tang, Uwe)
* tag 'for-6.9/block-
20240310' of git://git.kernel.dk/linux: (221 commits)
block: partitions: only define function mac_fix_string for CONFIG_PPC_PMAC
block/swim: Convert to platform remove callback returning void
cdrom: gdrom: Convert to platform remove callback returning void
block: remove disk_stack_limits
md: remove mddev->queue
md: don't initialize queue limits
md/raid10: use the atomic queue limit update APIs
md/raid5: use the atomic queue limit update APIs
md/raid1: use the atomic queue limit update APIs
md/raid0: use the atomic queue limit update APIs
md: add queue limit helpers
md: add a mddev_is_dm helper
md: add a mddev_add_trace_msg helper
md: add a mddev_trace_remap helper
bcache: move calculation of stripe_size and io_opt into bcache_device_init
virtio_blk: Do not use disk_set_max_open/active_zones()
aoe: fix the potential use-after-free problem in aoecmd_cfg_pkts
block: move capacity validation to blkpg_do_ioctl()
block: prevent division by zero in blk_rq_stat_sum()
drbd: atomically update queue limits in drbd_reconsider_queue_parameters
...
Linus Torvalds [Mon, 11 Mar 2024 18:35:31 +0000 (11:35 -0700)]
Merge tag 'for-6.9/io_uring-
20240310' of git://git.kernel.dk/linux
Pull io_uring updates from Jens Axboe:
- Make running of task_work internal loops more fair, and unify how the
different methods deal with them (me)
- Support for per-ring NAPI. The two minor networking patches are in a
shared branch with netdev (Stefan)
- Add support for truncate (Tony)
- Export SQPOLL utilization stats (Xiaobing)
- Multishot fixes (Pavel)
- Fix for a race in manipulating the request flags via poll (Pavel)
- Cleanup the multishot checking by making it generic, moving it out of
opcode handlers (Pavel)
- Various tweaks and cleanups (me, Kunwu, Alexander)
* tag 'for-6.9/io_uring-
20240310' of git://git.kernel.dk/linux: (53 commits)
io_uring: Fix sqpoll utilization check racing with dying sqpoll
io_uring/net: dedup io_recv_finish req completion
io_uring: refactor DEFER_TASKRUN multishot checks
io_uring: fix mshot io-wq checks
io_uring/net: add io_req_msg_cleanup() helper
io_uring/net: simplify msghd->msg_inq checking
io_uring/kbuf: rename REQ_F_PARTIAL_IO to REQ_F_BL_NO_RECYCLE
io_uring/net: remove dependency on REQ_F_PARTIAL_IO for sr->done_io
io_uring/net: correctly handle multishot recvmsg retry setup
io_uring/net: clear REQ_F_BL_EMPTY in the multishot retry handler
io_uring: fix io_queue_proc modifying req->flags
io_uring: fix mshot read defer taskrun cqe posting
io_uring/net: fix overflow check in io_recvmsg_mshot_prep()
io_uring/net: correct the type of variable
io_uring/sqpoll: statistics of the true utilization of sq threads
io_uring/net: move recv/recvmsg flags out of retry loop
io_uring/kbuf: flag request if buffer pool is empty after buffer pick
io_uring/net: improve the usercopy for sendmsg/recvmsg
io_uring/net: move receive multishot out of the generic msghdr path
io_uring/net: unify how recvmsg and sendmsg copy in the msghdr
...
Linus Torvalds [Mon, 11 Mar 2024 18:02:06 +0000 (11:02 -0700)]
Merge tag 'vfs-6.9.uuid' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs uuid updates from Christian Brauner:
"This adds two new ioctl()s for getting the filesystem uuid and
retrieving the sysfs path based on the path of a mounted filesystem.
Getting the filesystem uuid has been implemented in filesystem
specific code for a while it's now lifted as a generic ioctl"
* tag 'vfs-6.9.uuid' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
xfs: add support for FS_IOC_GETFSSYSFSPATH
fs: add FS_IOC_GETFSSYSFSPATH
fat: Hook up sb->s_uuid
fs: FS_IOC_GETUUID
ovl: convert to super_set_uuid()
fs: super_set_uuid()
Linus Torvalds [Mon, 11 Mar 2024 17:52:34 +0000 (10:52 -0700)]
Merge tag 'vfs-6.9.super' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull block handle updates from Christian Brauner:
"Last cycle we changed opening of block devices, and opening a block
device would return a bdev_handle. This allowed us to implement
support for restricting and forbidding writes to mounted block
devices. It was accompanied by converting and adding helpers to
operate on bdev_handles instead of plain block devices.
That was already a good step forward but ultimately it isn't necessary
to have special purpose helpers for opening block devices internally
that return a bdev_handle.
Fundamentally, opening a block device internally should just be
equivalent to opening files. So now all internal opens of block
devices return files just as a userspace open would. Instead of
introducing a separate indirection into bdev_open_by_*() via struct
bdev_handle bdev_file_open_by_*() is made to just return a struct
file. Opening and closing a block device just becomes equivalent to
opening and closing a file.
This all works well because internally we already have a pseudo fs for
block devices and so opening block devices is simple. There's a few
places where we needed to be careful such as during boot when the
kernel is supposed to mount the rootfs directly without init doing it.
Here we need to take care to ensure that we flush out any asynchronous
file close. That's what we already do for opening, unpacking, and
closing the initramfs. So nothing new here.
The equivalence of opening and closing block devices to regular files
is a win in and of itself. But it also has various other advantages.
We can remove struct bdev_handle completely. Various low-level helpers
are now private to the block layer. Other helpers were simply
removable completely.
A follow-up series that is already reviewed build on this and makes it
possible to remove bdev->bd_inode and allows various clean ups of the
buffer head code as well. All places where we stashed a bdev_handle
now just stash a file and use simple accessors to get to the actual
block device which was already the case for bdev_handle"
* tag 'vfs-6.9.super' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (35 commits)
block: remove bdev_handle completely
block: don't rely on BLK_OPEN_RESTRICT_WRITES when yielding write access
bdev: remove bdev pointer from struct bdev_handle
bdev: make struct bdev_handle private to the block layer
bdev: make bdev_{release, open_by_dev}() private to block layer
bdev: remove bdev_open_by_path()
reiserfs: port block device access to file
ocfs2: port block device access to file
nfs: port block device access to files
jfs: port block device access to file
f2fs: port block device access to files
ext4: port block device access to file
erofs: port device access to file
btrfs: port device access to file
bcachefs: port block device access to file
target: port block device access to file
s390: port block device access to file
nvme: port block device access to file
block2mtd: port device access to files
bcache: port block device access to files
...
Linus Torvalds [Mon, 11 Mar 2024 17:37:45 +0000 (10:37 -0700)]
Merge tag 'vfs-6.9.file' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull file locking updates from Christian Brauner:
"A few years ago struct file_lock_context was added to allow for
separate lists to track different types of file locks instead of using
a singly-linked list for all of them.
Now leases no longer need to be tracked using struct file_lock.
However, a lot of the infrastructure is identical for leases and locks
so separating them isn't trivial.
This splits a group of fields used by both file locks and leases into
a new struct file_lock_core. The new core struct is embedded in struct
file_lock. Coccinelle was used to convert a lot of the callers to deal
with the move, with the remaining 25% or so converted by hand.
Afterwards several internal functions in fs/locks.c are made to work
with struct file_lock_core. Ultimately this allows to split struct
file_lock into struct file_lock and struct file_lease. The file lease
APIs are then converted to take struct file_lease"
* tag 'vfs-6.9.file' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (51 commits)
filelock: fix deadlock detection in POSIX locking
filelock: always define for_each_file_lock()
smb: remove redundant check
filelock: don't do security checks on nfsd setlease calls
filelock: split leases out of struct file_lock
filelock: remove temporary compatibility macros
smb/server: adapt to breakup of struct file_lock
smb/client: adapt to breakup of struct file_lock
ocfs2: adapt to breakup of struct file_lock
nfsd: adapt to breakup of struct file_lock
nfs: adapt to breakup of struct file_lock
lockd: adapt to breakup of struct file_lock
fuse: adapt to breakup of struct file_lock
gfs2: adapt to breakup of struct file_lock
dlm: adapt to breakup of struct file_lock
ceph: adapt to breakup of struct file_lock
afs: adapt to breakup of struct file_lock
9p: adapt to breakup of struct file_lock
filelock: convert seqfile handling to use file_lock_core
filelock: convert locks_translate_pid to take file_lock_core
...
Linus Torvalds [Mon, 11 Mar 2024 17:21:06 +0000 (10:21 -0700)]
Merge tag 'vfs-6.9.pidfd' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull pdfd updates from Christian Brauner:
- Until now pidfds could only be created for thread-group leaders but
not for threads. There was no technical reason for this. We simply
had no users that needed support for this. Now we do have users that
need support for this.
This introduces a new PIDFD_THREAD flag for pidfd_open(). If that
flag is set pidfd_open() creates a pidfd that refers to a specific
thread.
In addition, we now allow clone() and clone3() to be called with
CLONE_PIDFD | CLONE_THREAD which wasn't possible before.
A pidfd that refers to an individual thread differs from a pidfd that
refers to a thread-group leader:
(1) Pidfds are pollable. A task may poll a pidfd and get notified
when the task has exited.
For thread-group leader pidfds the polling task is woken if the
thread-group is empty. In other words, if the thread-group
leader task exits when there are still threads alive in its
thread-group the polling task will not be woken when the
thread-group leader exits but rather when the last thread in the
thread-group exits.
For thread-specific pidfds the polling task is woken if the
thread exits.
(2) Passing a thread-group leader pidfd to pidfd_send_signal() will
generate thread-group directed signals like kill(2) does.
Passing a thread-specific pidfd to pidfd_send_signal() will
generate thread-specific signals like tgkill(2) does.
The default scope of the signal is thus determined by the type
of the pidfd.
Since use-cases exist where the default scope of the provided
pidfd needs to be overriden the following flags are added to
pidfd_send_signal():
- PIDFD_SIGNAL_THREAD
Send a thread-specific signal.
- PIDFD_SIGNAL_THREAD_GROUP
Send a thread-group directed signal.
- PIDFD_SIGNAL_PROCESS_GROUP
Send a process-group directed signal.
The scope change will only work if the struct pid is actually
used for this scope.
For example, in order to send a thread-group directed signal the
provided pidfd must be used as a thread-group leader and
similarly for PIDFD_SIGNAL_PROCESS_GROUP the struct pid must be
used as a process group leader.
- Move pidfds from the anonymous inode infrastructure to a tiny pseudo
filesystem. This will unblock further work that we weren't able to do
simply because of the very justified limitations of anonymous inodes.
Moving pidfds to a tiny pseudo filesystem allows for statx on pidfds
to become useful for the first time. They can now be compared by
inode number which are unique for the system lifetime.
Instead of stashing struct pid in file->private_data we can now stash
it in inode->i_private. This makes it possible to introduce concepts
that operate on a process once all file descriptors have been closed.
A concrete example is kill-on-last-close. Another side-effect is that
file->private_data is now freed up for per-file options for pidfds.
Now, each struct pid will refer to a different inode but the same
struct pid will refer to the same inode if it's opened multiple
times. In contrast to now where each struct pid refers to the same
inode.
The tiny pseudo filesystem is not visible anywhere in userspace
exactly like e.g., pipefs and sockfs. There's no lookup, there's no
complex inode operations, nothing. Dentries and inodes are always
deleted when the last pidfd is closed.
We allocate a new inode and dentry for each struct pid and we reuse
that inode and dentry for all pidfds that refer to the same struct
pid. The code is entirely optional and fairly small. If it's not
selected we fallback to anonymous inodes. Heavily inspired by nsfs.
The dentry and inode allocation mechanism is moved into generic
infrastructure that is now shared between nsfs and pidfs. The
path_from_stashed() helper must be provided with a stashing location,
an inode number, a mount, and the private data that is supposed to be
used and it will provide a path that can be passed to dentry_open().
The helper will try retrieve an existing dentry from the provided
stashing location. If a valid dentry is found it is reused. If not a
new one is allocated and we try to stash it in the provided location.
If this fails we retry until we either find an existing dentry or the
newly allocated dentry could be stashed. Subsequent openers of the
same namespace or task are then able to reuse it.
- Currently it is only possible to get notified when a task has exited,
i.e., become a zombie and userspace gets notified with EPOLLIN. We
now also support waiting until the task has been reaped, notifying
userspace with EPOLLHUP.
- Ensure that ESRCH is reported for getfd if a task is exiting instead
of the confusing EBADF.
- Various smaller cleanups to pidfd functions.
* tag 'vfs-6.9.pidfd' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (23 commits)
libfs: improve path_from_stashed()
libfs: add stashed_dentry_prune()
libfs: improve path_from_stashed() helper
pidfs: convert to path_from_stashed() helper
nsfs: convert to path_from_stashed() helper
libfs: add path_from_stashed()
pidfd: add pidfs
pidfd: move struct pidfd_fops
pidfd: allow to override signal scope in pidfd_send_signal()
pidfd: change pidfd_send_signal() to respect PIDFD_THREAD
signal: fill in si_code in prepare_kill_siginfo()
selftests: add ESRCH tests for pidfd_getfd()
pidfd: getfd should always report ESRCH if a task is exiting
pidfd: clone: allow CLONE_THREAD | CLONE_PIDFD together
pidfd: exit: kill the no longer used thread_group_exited()
pidfd: change do_notify_pidfd() to use __wake_up(poll_to_key(EPOLLIN))
pid: kill the obsolete PIDTYPE_PID code in transfer_pid()
pidfd: kill the no longer needed do_notify_pidfd() in de_thread()
pidfd_poll: report POLLHUP when pid_task() == NULL
pidfd: implement PIDFD_THREAD flag for pidfd_open()
...
Linus Torvalds [Mon, 11 Mar 2024 17:07:03 +0000 (10:07 -0700)]
Merge tag 'vfs-6.9.iomap' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull iomap updates from Christian Brauner:
- Restore read-write hints in struct bio through the bi_write_hint
member for the sake of UFS devices in mobile applications. This can
result in up to 40% lower write amplification in UFS devices. The
patch series that builds on this will be coming in via the SCSI
maintainers (Bart)
- Overhaul the iomap writeback code. Afterwards ->map_blocks() is able
to map multiple blocks at once as long as they're in the same folio.
This reduces CPU usage for buffered write workloads on e.g., xfs on
systems with lots of cores (Christoph)
- Record processed bytes in iomap_iter() trace event (Kassey)
- Extend iomap_writepage_map() trace event after Christoph's
->map_block() changes to map mutliple blocks at once (Zhang)
* tag 'vfs-6.9.iomap' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (22 commits)
iomap: Add processed for iomap_iter
iomap: add pos and dirty_len into trace_iomap_writepage_map
block, fs: Restore the per-bio/request data lifetime fields
fs: Propagate write hints to the struct block_device inode
fs: Move enum rw_hint into a new header file
fs: Split fcntl_rw_hint()
fs: Verify write lifetime constants at compile time
fs: Fix rw_hint validation
iomap: pass the length of the dirty region to ->map_blocks
iomap: map multiple blocks at a time
iomap: submit ioends immediately
iomap: factor out a iomap_writepage_map_block helper
iomap: only call mapping_set_error once for each failed bio
iomap: don't chain bios
iomap: move the iomap_sector sector calculation out of iomap_add_to_ioend
iomap: clean up the iomap_alloc_ioend calling convention
iomap: move all remaining per-folio logic into iomap_writepage_map
iomap: factor out a iomap_writepage_handle_eof helper
iomap: move the PF_MEMALLOC check to iomap_writepages
iomap: move the io_folios field out of struct iomap_ioend
...
Linus Torvalds [Mon, 11 Mar 2024 16:55:17 +0000 (09:55 -0700)]
Merge tag 'vfs-6.9.ntfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull ntfs update from Christian Brauner:
"This removes the old ntfs driver. The new ntfs3 driver is a full
replacement that was merged over two years ago. We've went through
various userspace and either they use ntfs3 or they use the fuse
version of ntfs and thus build neither ntfs nor ntfs3. I think that's
a clear sign that we should risk removing the legacy ntfs driver.
Quoting from Arch Linux and Debian:
- Debian does neither build the legacy ntfs nor the new ntfs3:
"Not currently built with Debian's kernel packages, 'ntfs' has been
symlinked to 'ntfs-3g' as it relates to fstab and mount commands.
Debian kernels are built without support of the ntfs3 driver
developed by Paragon Software." (cf. [2])
- Archlinux provides ntfs3 as their default since 5.15:
"All officially supported kernels with versions 5.15 or newer are
built with CONFIG_NTFS3_FS=m and thus support it. Before 5.15,
NTFS read and write support is provided by the NTFS-3G FUSE file
system." (cf. [1]).
It's unmaintained apart from various odd fixes as well. Worst case we
have to reintroduce it if someone really has a valid dependency on it.
But it's worth trying to see whether we can remove it"
Link: https://wiki.archlinux.org/title/NTFS
Link: https://wiki.debian.org/NTFS
* tag 'vfs-6.9.ntfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
fs: remove NTFS classic from docum. index
fs: Remove NTFS classic
Linus Torvalds [Mon, 11 Mar 2024 16:38:17 +0000 (09:38 -0700)]
Merge tag 'vfs-6.9.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull misc vfs updates from Christian Brauner:
"Misc features, cleanups, and fixes for vfs and individual filesystems.
Features:
- Support idmapped mounts for hugetlbfs.
- Add RWF_NOAPPEND flag for pwritev2(). This allows us to fix a bug
where the passed offset is ignored if the file is O_APPEND. The new
flag allows a caller to enforce that the offset is honored to
conform to posix even if the file was opened in append mode.
- Move i_mmap_rwsem in struct address_space to avoid false sharing
between i_mmap and i_mmap_rwsem.
- Convert efs, qnx4, and coda to use the new mount api.
- Add a generic is_dot_dotdot() helper that's used by various
filesystems and the VFS code instead of open-coding it multiple
times.
- Recently we've added stable offsets which allows stable ordering
when iterating directories exported through NFS on e.g., tmpfs
filesystems. Originally an xarray was used for the offset map but
that caused slab fragmentation issues over time. This switches the
offset map to the maple tree which has a dense mode that handles
this scenario a lot better. Includes tests.
- Finally merge the case-insensitive improvement series Gabriel has
been working on for a long time. This cleanly propagates case
insensitive operations through ->s_d_op which in turn allows us to
remove the quite ugly generic_set_encrypted_ci_d_ops() operations.
It also improves performance by trying a case-sensitive comparison
first and then fallback to case-insensitive lookup if that fails.
This also fixes a bug where overlayfs would be able to be mounted
over a case insensitive directory which would lead to all sort of
odd behaviors.
Cleanups:
- Make file_dentry() a simple accessor now that ->d_real() is
simplified because of the backing file work we did the last two
cycles.
- Use the dedicated file_mnt_idmap helper in ntfs3.
- Use smp_load_acquire/store_release() in the i_size_read/write
helpers and thus remove the hack to handle i_size reads in the
filemap code.
- The SLAB_MEM_SPREAD is a nop now. Remove it from various places in
fs/
- It's no longer necessary to perform a second built-in initramfs
unpack call because we retain the contents of the previous
extraction. Remove it.
- Now that we have removed various allocators kfree_rcu() always
works with kmem caches and kmalloc(). So simplify various places
that only use an rcu callback in order to handle the kmem cache
case.
- Convert the pipe code to use a lockdep comparison function instead
of open-coding the nesting making lockdep validation easier.
- Move code into fs-writeback.c that was located in a header but can
be made static as it's only used in that one file.
- Rewrite the alignment checking iterators for iovec and bvec to be
easier to read, and also significantly more compact in terms of
generated code. This saves 270 bytes of text on x86-64 (with
clang-18) and 224 bytes on arm64 (with gcc-13). In profiles it also
saves a bit of time for the same workload.
- Switch various places to use KMEM_CACHE instead of
kmem_cache_create().
- Use inode_set_ctime_to_ts() in inode_set_ctime_current()
- Use kzalloc() in name_to_handle_at() to avoid kernel infoleak.
- Various smaller cleanups for eventfds.
Fixes:
- Fix various comments and typos, and unneeded initializations.
- Fix stack allocation hack for clang in the select code.
- Improve dump_mapping() debug code on a best-effort basis.
- Fix build errors in various selftests.
- Avoid wrap-around instrumentation in various places.
- Don't allow user namespaces without an idmapping to be used for
idmapped mounts.
- Fix sysv sb_read() call.
- Fix fallback implementation of the get_name() export operation"
* tag 'vfs-6.9.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (70 commits)
hugetlbfs: support idmapped mounts
qnx4: convert qnx4 to use the new mount api
fs: use inode_set_ctime_to_ts to set inode ctime to current time
libfs: Drop generic_set_encrypted_ci_d_ops
ubifs: Configure dentry operations at dentry-creation time
f2fs: Configure dentry operations at dentry-creation time
ext4: Configure dentry operations at dentry-creation time
libfs: Add helper to choose dentry operations at mount-time
libfs: Merge encrypted_ci_dentry_ops and ci_dentry_ops
fscrypt: Drop d_revalidate once the key is added
fscrypt: Drop d_revalidate for valid dentries during lookup
fscrypt: Factor out a helper to configure the lookup dentry
ovl: Always reject mounting over case-insensitive directories
libfs: Attempt exact-match comparison first during casefolded lookup
efs: remove SLAB_MEM_SPREAD flag usage
jfs: remove SLAB_MEM_SPREAD flag usage
minix: remove SLAB_MEM_SPREAD flag usage
openpromfs: remove SLAB_MEM_SPREAD flag usage
proc: remove SLAB_MEM_SPREAD flag usage
qnx6: remove SLAB_MEM_SPREAD flag usage
...
Linus Torvalds [Mon, 11 Mar 2024 16:32:28 +0000 (09:32 -0700)]
Merge tag 'linux_kselftest-kunit-6.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest
Pull KUnit updates from Shuah Khan:
- fix to make kunit_bus_type const
- kunit tool change to Print UML command
- DRM device creation helpers are now using the new kunit device
creation helpers. This change resulted in DRM helpers switching from
using a platform_device, to a dedicated bus and device type used by
kunit. kunit devices don't set DMA mask and this caused regression on
some drm tests as they can't allocate DMA buffers. Fix this problem
by setting DMA masks on the kunit device during initialization.
- KUnit has several macros which accept a log message, which can
contain printf format specifiers. Some of these (the explicit log
macros) already use the __printf() gcc attribute to ensure the format
specifiers are valid, but those which could fail the test, and hence
used __kunit_do_failed_assertion() behind the scenes, did not.
These include: KUNIT_EXPECT_*_MSG(), KUNIT_ASSERT_*_MSG(), and
KUNIT_FAIL()
A nine-patch series adds the __printf() attribute, and fixes all of
the issues uncovered.
* tag 'linux_kselftest-kunit-6.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest:
kunit: Annotate _MSG assertion variants with gnu printf specifiers
drm: tests: Fix invalid printf format specifiers in KUnit tests
drm/xe/tests: Fix printf format specifiers in xe_migrate test
net: test: Fix printf format specifier in skb_segment kunit test
rtc: test: Fix invalid format specifier.
time: test: Fix incorrect format specifier
lib: memcpy_kunit: Fix an invalid format specifier in an assertion msg
lib/cmdline: Fix an invalid format specifier in an assertion msg
kunit: test: Log the correct filter string in executor_test
kunit: Setup DMA masks on the kunit device
kunit: make kunit_bus_type const
kunit: Mark filter* params as rw
kunit: tool: Print UML command
Linus Torvalds [Mon, 11 Mar 2024 16:25:33 +0000 (09:25 -0700)]
Merge tag 'linux_kselftest-next-6.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest
Pull kselftest update from Shuah Khan:
- livepatch restructuring to move the module out of lib to be built as
a out-of-tree modules during kselftest build. This makes it easier
change, debug and rebuild the tests by running make on the
selftests/livepatch directory, which is not currently possible since
the modules on lib/livepatch are build and installed using the main
makefile modules target.
- livepatch restructuring fixes for problems found by kernel test
robot. The change skips the test if kernel-devel isn't installed
(default value of KDIR), or if KDIR variable passed doesn't exists.
- resctrl test restructuring and new non-contiguous CBMs CAT test
- new ktap_helpers to print diagnostic messages, pass/fail tests based
on exit code, abort test, and finish the test.
- a new test verify power supply properties.
- a new ftrace to exercise function tracer across cpu hotplug.
- timeout increase for mqueue test to allow the test to run on i3.metal
AWS instances.
- minor spelling corrections in several tests.
- missing gitignore files and changes to existing gitignore files.
* tag 'linux_kselftest-next-6.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest: (57 commits)
kselftest: Add basic test for probing the rust sample modules
selftests: lib.mk: Do not process TEST_GEN_MODS_DIR
selftests: livepatch: Avoid running the tests if kernel-devel is missing
selftests: livepatch: Add initial .gitignore
selftests/resctrl: Add non-contiguous CBMs CAT test
selftests/resctrl: Add resource_info_file_exists()
selftests/resctrl: Split validate_resctrl_feature_request()
selftests/resctrl: Add a helper for the non-contiguous test
selftests/resctrl: Add test groups and name L3 CAT test L3_CAT
selftests: sched: Fix spelling mistake "hiearchy" -> "hierarchy"
selftests/mqueue: Set timeout to 180 seconds
selftests/ftrace: Add test to exercize function tracer across cpu hotplug
selftest: ftrace: fix minor typo in log
selftests: thermal: intel: workload_hint: add missing gitignore
selftests: thermal: intel: power_floor: add missing gitignore
selftests: uevent: add missing gitignore
selftests: Add test to verify power supply properties
selftests: ktap_helpers: Add a helper to finish the test
selftests: ktap_helpers: Add a helper to abort the test
selftests: ktap_helpers: Add helper to pass/fail test based on exit code
...
Borislav Petkov (AMD) [Mon, 11 Mar 2024 15:24:20 +0000 (16:24 +0100)]
Merge remote-tracking branches 'ras/edac-drivers', 'ras/edac-misc' and 'ras/edac-amd-atl' into edac-updates-for-v6.9
* ras/edac-drivers:
EDAC/i10nm: Add Intel Grand Ridge micro-server support
EDAC/igen6: Add one more Intel Alder Lake-N SoC support
* ras/edac-misc:
EDAC/versal: Convert to platform remove callback returning void
EDAC/versal: Make the bit position of injected errors configurable
EDAC/synopsys: Convert to devm_platform_ioremap_resource()
* ras/edac-amd-atl:
RAS/AMD/FMPM: Fix off by one when unwinding on error
RAS/AMD/FMPM: Add debugfs interface to print record entries
RAS/AMD/FMPM: Save SPA values
RAS: Export helper to get ras_debugfs_dir
RAS/AMD/ATL: Fix bit overflow in denorm_addr_df4_np2()
RAS: Introduce a FRU memory poison manager
RAS/AMD/ATL: Add MI300 row retirement support
Documentation: Move RAS section to admin-guide
RAS/AMD/ATL: Add MI300 DRAM to normalized address translation support
RAS/AMD/ATL: Fix array overflow in get_logical_coh_st_fabric_id_mi300()
RAS/AMD/ATL: Add MI300 support
Documentation: RAS: Add index and address translation section
EDAC/amd64: Use new AMD Address Translation Library
RAS: Introduce AMD Address Translation Library
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Arnd Bergmann [Mon, 11 Mar 2024 06:59:30 +0000 (07:59 +0100)]
Merge tag 'riscv-dt-fixes-for-v6.8-final' of https://git.kernel.org/pub/scm/linux/kernel/git/conor/linux into soc/dt
RISC-V Devicetree fixes for v6.8-final
Starfive:
The previous cleanup broke boot on the jh7100 as the driver depended on
the fallback clock name created based on the node-name when
clock-output-names is not present. Add clock-output-names to restore
working order.
Generic:
BUILTIN_DTB has been broken for ages on any platform other than the
nommu Canaan k210 SoC as the first dtb built (in alphanumerical order),
would get built into the image. This didn't get fixed for ages because
nobody actually cared about running it other than the k210 enough to
fix it. The folks doing Sophgo SG2042 development have come along and
fixed it, as they want to use builtin dtbs. linux-boot on that platform
reuses the dtb it was provided by OpenSBI when booting linux proper,
which is unfortunately not possible to boot a mainline kernel with.
Signed-off-by: Conor Dooley <conor.dooley@microchip.com>
* tag 'riscv-dt-fixes-for-v6.8-final' of https://git.kernel.org/pub/scm/linux/kernel/git/conor/linux:
riscv: dts: Move BUILTIN_DTB_SOURCE to common Kconfig
riscv: dts: starfive: jh7100: fix root clock names
Link: https://lore.kernel.org/r/20240306-waltz-facial-9e4e1b792053@spud
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Linus Torvalds [Sun, 10 Mar 2024 20:38:09 +0000 (13:38 -0700)]
Linux 6.8
Linus Torvalds [Sun, 10 Mar 2024 18:53:21 +0000 (11:53 -0700)]
Merge tag 'trace-ring-buffer-v6.8-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull tracing fixes from Steven Rostedt:
- Do not allow large strings (> 4096) as single write to trace_marker
The size of a string written into trace_marker was determined by the
size of the sub-buffer in the ring buffer. That size is dependent on
the PAGE_SIZE of the architecture as it can be mapped into user
space. But on PowerPC, where PAGE_SIZE is 64K, that made the limit of
the string of writing into trace_marker 64K.
One of the selftests looks at the size of the ring buffer sub-buffers
and writes that plus more into the trace_marker. The write will take
what it can and report back what it consumed so that the user space
application (like echo) will write the rest of the string. The string
is stored in the ring buffer and can be read via the "trace" or
"trace_pipe" files.
The reading of the ring buffer uses vsnprintf(), which uses a
precision "%.*s" to make sure it only reads what is stored in the
buffer, as a bug could cause the string to be non terminated.
With the combination of the precision change and the PAGE_SIZE of 64K
allowing huge strings to be added into the ring buffer, plus the test
that would actually stress that limit, a bug was reported that the
precision used was too big for "%.*s" as the string was close to 64K
in size and the max precision of vsnprintf is 32K.
Linus suggested not to have that precision as it could hide a bug if
the string was again stored without a nul byte.
Another issue that was brought up is that the trace_seq buffer is
also based on PAGE_SIZE even though it is not tied to the
architecture limit like the ring buffer sub-buffer is. Having it be
64K * 2 is simply just too big and wasting memory on systems with 64K
page sizes. It is now hardcoded to 8K which is what all other
architectures with 4K PAGE_SIZE has.
Finally, the write to trace_marker is now limited to 4K as there is
no reason to write larger strings into trace_marker.
- ring_buffer_wait() should not loop.
The ring_buffer_wait() does not have the full context (yet) on if it
should loop or not. Just exit the loop as soon as its woken up and
let the callers decide to loop or not (they already do, so it's a bit
redundant).
- Fix shortest_full field to be the smallest amount in the ring buffer
that a waiter is waiting for. The "shortest_full" field is updated
when a new waiter comes in and wants to wait for a smaller amount of
data in the ring buffer than other waiters. But after all waiters are
woken up, it's not reset, so if another waiter comes in wanting to
wait for more data, it will be woken up when the ring buffer has a
smaller amount from what the previous waiters were waiting for.
- The wake up all waiters on close is incorrectly called frome
.release() and not from .flush() so it will never wake up any waiters
as the .release() will not get called until all .read() calls are
finished. And the wakeup is for the waiters in those .read() calls.
* tag 'trace-ring-buffer-v6.8-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
tracing: Use .flush() call to wake up readers
ring-buffer: Fix resetting of shortest_full
ring-buffer: Fix waking up ring buffer readers
tracing: Limit trace_marker writes to just 4K
tracing: Limit trace_seq size to just 8K and not depend on architecture PAGE_SIZE
tracing: Remove precision vsnprintf() check from print event
Linus Torvalds [Sun, 10 Mar 2024 18:39:48 +0000 (11:39 -0700)]
Merge tag 'phy-fixes3-6.8' of git://git.kernel.org/pub/scm/linux/kernel/git/phy/linux-phy
Pull phy fixes from Vinod Koul:
- fixes for Qualcomm qmp-combo driver for ordering of drm and type-c
switch registartion due to drivers might not probe defer after having
registered child devices to avoid triggering a probe deferral loop.
This fixes internal display on Lenovo ThinkPad X13s
* tag 'phy-fixes3-6.8' of git://git.kernel.org/pub/scm/linux/kernel/git/phy/linux-phy:
phy: qcom-qmp-combo: fix type-c switch registration
phy: qcom-qmp-combo: fix drm bridge registration
Steven Rostedt (Google) [Fri, 8 Mar 2024 20:24:05 +0000 (15:24 -0500)]
tracing: Use .flush() call to wake up readers
The .release() function does not get called until all readers of a file
descriptor are finished.
If a thread is blocked on reading a file descriptor in ring_buffer_wait(),
and another thread closes the file descriptor, it will not wake up the
other thread as ring_buffer_wake_waiters() is called by .release(), and
that will not get called until the .read() is finished.
The issue originally showed up in trace-cmd, but the readers are actually
other processes with their own file descriptors. So calling close() would wake
up the other tasks because they are blocked on another descriptor then the
one that was closed(). But there's other wake ups that solve that issue.
When a thread is blocked on a read, it can still hang even when another
thread closed its descriptor.
This is what the .flush() callback is for. Have the .flush() wake up the
readers.
Link: https://lore.kernel.org/linux-trace-kernel/20240308202432.107909457@goodmis.org
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: linke li <lilinke99@qq.com>
Cc: Rabin Vincent <rabin@rab.in>
Fixes: f3ddb74ad0790 ("tracing: Wake up ring buffer waiters on closing of the file")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Steven Rostedt (Google) [Fri, 8 Mar 2024 20:24:04 +0000 (15:24 -0500)]
ring-buffer: Fix resetting of shortest_full
The "shortest_full" variable is used to keep track of the waiter that is
waiting for the smallest amount on the ring buffer before being woken up.
When a tasks waits on the ring buffer, it passes in a "full" value that is
a percentage. 0 means wake up on any data. 1-100 means wake up from 1% to
100% full buffer.
As all waiters are on the same wait queue, the wake up happens for the
waiter with the smallest percentage.
The problem is that the smallest_full on the cpu_buffer that stores the
smallest amount doesn't get reset when all the waiters are woken up. It
does get reset when the ring buffer is reset (echo > /sys/kernel/tracing/trace).
This means that tasks may be woken up more often then when they want to
be. Instead, have the shortest_full field get reset just before waking up
all the tasks. If the tasks wait again, they will update the shortest_full
before sleeping.
Also add locking around setting of shortest_full in the poll logic, and
change "work" to "rbwork" to match the variable name for rb_irq_work
structures that are used in other places.
Link: https://lore.kernel.org/linux-trace-kernel/20240308202431.948914369@goodmis.org
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: linke li <lilinke99@qq.com>
Cc: Rabin Vincent <rabin@rab.in>
Fixes: 2c2b0a78b3739 ("ring-buffer: Add percentage of ring buffer full to wake up reader")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Linus Torvalds [Sun, 10 Mar 2024 16:27:39 +0000 (09:27 -0700)]
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm fixes from Paolo Bonzini:
"KVM GUEST_MEMFD fixes for 6.8:
- Make KVM_MEM_GUEST_MEMFD mutually exclusive with KVM_MEM_READONLY
to avoid creating an inconsistent ABI (KVM_MEM_GUEST_MEMFD is not
writable from userspace, so there would be no way to write to a
read-only guest_memfd).
- Update documentation for KVM_SW_PROTECTED_VM to make it abundantly
clear that such VMs are purely for development and testing.
- Limit KVM_SW_PROTECTED_VM guests to the TDP MMU, as the long term
plan is to support confidential VMs with deterministic private
memory (SNP and TDX) only in the TDP MMU.
- Fix a bug in a GUEST_MEMFD dirty logging test that caused false
passes.
x86 fixes:
- Fix missing marking of a guest page as dirty when emulating an
atomic access.
- Check for mmu_notifier invalidation events before faulting in the
pfn, and before acquiring mmu_lock, to avoid unnecessary work and
lock contention with preemptible kernels (including
CONFIG_PREEMPT_DYNAMIC in non-preemptible mode).
- Disable AMD DebugSwap by default, it breaks VMSA signing and will
be re-enabled with a better VM creation API in 6.10.
- Do the cache flush of converted pages in svm_register_enc_region()
before dropping kvm->lock, to avoid a race with unregistering of
the same region and the consequent use-after-free issue"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
SEV: disable SEV-ES DebugSwap by default
KVM: x86/mmu: Retry fault before acquiring mmu_lock if mapping is changing
KVM: SVM: Flush pages under kvm->lock to fix UAF in svm_register_enc_region()
KVM: selftests: Add a testcase to verify GUEST_MEMFD and READONLY are exclusive
KVM: selftests: Create GUEST_MEMFD for relevant invalid flags testcases
KVM: x86/mmu: Restrict KVM_SW_PROTECTED_VM to the TDP MMU
KVM: x86: Update KVM_SW_PROTECTED_VM docs to make it clear they're a WIP
KVM: Make KVM_MEM_GUEST_MEMFD mutually exclusive with KVM_MEM_READONLY
KVM: x86: Mark target gfn of emulated atomic instruction as dirty
Steven Rostedt (Google) [Fri, 8 Mar 2024 20:24:03 +0000 (15:24 -0500)]
ring-buffer: Fix waking up ring buffer readers
A task can wait on a ring buffer for when it fills up to a specific
watermark. The writer will check the minimum watermark that waiters are
waiting for and if the ring buffer is past that, it will wake up all the
waiters.
The waiters are in a wait loop, and will first check if a signal is
pending and then check if the ring buffer is at the desired level where it
should break out of the loop.
If a file that uses a ring buffer closes, and there's threads waiting on
the ring buffer, it needs to wake up those threads. To do this, a
"wait_index" was used.
Before entering the wait loop, the waiter will read the wait_index. On
wakeup, it will check if the wait_index is different than when it entered
the loop, and will exit the loop if it is. The waker will only need to
update the wait_index before waking up the waiters.
This had a couple of bugs. One trivial one and one broken by design.
The trivial bug was that the waiter checked the wait_index after the
schedule() call. It had to be checked between the prepare_to_wait() and
the schedule() which it was not.
The main bug is that the first check to set the default wait_index will
always be outside the prepare_to_wait() and the schedule(). That's because
the ring_buffer_wait() doesn't have enough context to know if it should
break out of the loop.
The loop itself is not needed, because all the callers to the
ring_buffer_wait() also has their own loop, as the callers have a better
sense of what the context is to decide whether to break out of the loop
or not.
Just have the ring_buffer_wait() block once, and if it gets woken up, exit
the function and let the callers decide what to do next.
Link: https://lore.kernel.org/all/CAHk-=whs5MdtNjzFkTyaUy=vHi=qwWgPi0JgTe6OYUYMNSRZfg@mail.gmail.com/
Link: https://lore.kernel.org/linux-trace-kernel/20240308202431.792933613@goodmis.org
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: linke li <lilinke99@qq.com>
Cc: Rabin Vincent <rabin@rab.in>
Fixes: e30f53aad2202 ("tracing: Do not busy wait in buffer splice")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Linus Torvalds [Sat, 9 Mar 2024 18:32:03 +0000 (10:32 -0800)]
Merge tag 'i2c-for-6.8-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux
Pull i2c fixes from Wolfram Sang:
"Two patches from Heiner for the i801 are targeting muxes discovered
while working on some other features. Essentially, there is a
reordering when adding optional slaves and proper cleanup upon
registering a mux device.
Christophe fixes the exit path in the wmt driver that was leaving the
clocks hanging, and the last fix from Tommy avoids false error reports
in IRQ"
* tag 'i2c-for-6.8-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux:
i2c: aspeed: Fix the dummy irq expected print
i2c: wmt: Fix an error handling path in wmt_i2c_probe()
i2c: i801: Avoid potential double call to gpiod_remove_lookup_table
i2c: i801: Fix using mux_pdev before it's set
Linus Torvalds [Sat, 9 Mar 2024 18:25:14 +0000 (10:25 -0800)]
Merge tag 'firewire-fixes-6.8-final' of git://git.kernel.org/pub/scm/linux/kernel/git/ieee1394/linux1394
Pull firewire fix from Takashi Sakamoto:
"A fix to suppress a warning about unreleased IRQ for 1394 OHCI
hardware when disabling MSI.
In Linux kernel v6.5, a PCI driver for 1394 OHCI hardware was
optimized into the managed device resources. Edmund Raile points out
that the change brings the warning about unreleased IRQ at the call of
pci_disable_msi(), since the API expects that the relevant IRQ has
already been released in advance.
As long as the API is called in .remove callback of PCI device
operation, it is prohibited to maintain the IRQ as the part of managed
device resource. As a workaround, the IRQ is explicitly released at
.remove callback, before the call of pci_disable_msi().
pci_disable_msi() is legacy API nowadays in PCI MSI implementation. I
have a plan to replace it with the modern API in the development for
the future version of Linux kernel. So at present I keep them as is"
* tag 'firewire-fixes-6.8-final' of git://git.kernel.org/pub/scm/linux/kernel/git/ieee1394/linux1394:
firewire: ohci: prevent leak of left-over IRQ on unbind
Paolo Bonzini [Sat, 9 Mar 2024 16:24:58 +0000 (11:24 -0500)]
SEV: disable SEV-ES DebugSwap by default
The DebugSwap feature of SEV-ES provides a way for confidential guests to use
data breakpoints. However, because the status of the DebugSwap feature is
recorded in the VMSA, enabling it by default invalidates the attestation
signatures. In 6.10 we will introduce a new API to create SEV VMs that
will allow enabling DebugSwap based on what the user tells KVM to do.
Contextually, we will change the legacy KVM_SEV_ES_INIT API to never
enable DebugSwap.
For compatibility with kernels that pre-date the introduction of DebugSwap,
as well as with those where KVM_SEV_ES_INIT will never enable it, do not enable
the feature by default. If anybody wants to use it, for now they can enable
the sev_es_debug_swap_enabled module parameter, but this will result in a
warning.
Fixes: d1f85fbe836e ("KVM: SEV: Enable data breakpoints in SEV-ES")
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Paolo Bonzini [Sat, 9 Mar 2024 16:20:44 +0000 (11:20 -0500)]
Merge tag 'kvm-x86-guest_memfd_fixes-6.8' of https://github.com/kvm-x86/linux into HEAD
KVM GUEST_MEMFD fixes for 6.8:
- Make KVM_MEM_GUEST_MEMFD mutually exclusive with KVM_MEM_READONLY to
avoid creating ABI that KVM can't sanely support.
- Update documentation for KVM_SW_PROTECTED_VM to make it abundantly
clear that such VMs are purely a development and testing vehicle, and
come with zero guarantees.
- Limit KVM_SW_PROTECTED_VM guests to the TDP MMU, as the long term plan
is to support confidential VMs with deterministic private memory (SNP
and TDX) only in the TDP MMU.
- Fix a bug in a GUEST_MEMFD negative test that resulted in false passes
when verifying that KVM_MEM_GUEST_MEMFD memslots can't be dirty logged.
Paolo Bonzini [Sat, 9 Mar 2024 16:18:46 +0000 (11:18 -0500)]
Merge tag 'kvm-x86-fixes-6.8-2' of https://github.com/kvm-x86/linux into HEAD
KVM x86 fixes for 6.8, round 2:
- When emulating an atomic access, mark the gfn as dirty in the memslot
to fix a bug where KVM could fail to mark the slot as dirty during live
migration, ultimately resulting in guest data corruption due to a dirty
page not being re-copied from the source to the target.
- Check for mmu_notifier invalidation events before faulting in the pfn,
and before acquiring mmu_lock, to avoid unnecessary work and lock
contention. Contending mmu_lock is especially problematic on preemptible
kernels, as KVM may yield mmu_lock in response to the contention, which
severely degrades overall performance due to vCPUs making it difficult
for the task that triggered invalidation to make forward progress.
Note, due to another kernel bug, this fix isn't limited to preemtible
kernels, as any kernel built with CONFIG_PREEMPT_DYNAMIC=y will yield
contended rwlocks and spinlocks.
https://lore.kernel.org/all/
20240110214723.695930-1-seanjc@google.com
Colin Ian King [Fri, 8 Mar 2024 13:39:21 +0000 (13:39 +0000)]
block: partitions: only define function mac_fix_string for CONFIG_PPC_PMAC
The helper function mac_fix_string is only required with CONFIG_PPC_PMAC,
add #if CONFIG_PPC_PMAC and #endif around the function.
Cleans up clang scan build warning:
block/partitions/mac.c:23:20: warning: unused function 'mac_fix_string' [-Wunused-function]
Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Link: https://lore.kernel.org/r/20240308133921.2058227-1-colin.i.king@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Gabriel Krisman Bertazi [Sat, 9 Mar 2024 00:32:56 +0000 (19:32 -0500)]
io_uring: Fix sqpoll utilization check racing with dying sqpoll
Commit
3fcb9d17206e ("io_uring/sqpoll: statistics of the true
utilization of sq threads"), currently in Jens for-next branch, peeks at
io_sq_data->thread to report utilization statistics. But, If
io_uring_show_fdinfo races with sqpoll terminating, even though we hold
the ctx lock, sqd->thread might be NULL and we hit the Oops below.
Note that we could technically just protect the getrusage() call and the
sq total/work time calculations. But showing some sq
information (pid/cpu) and not other information (utilization) is more
confusing than not reporting anything, IMO. So let's hide it all if we
happen to race with a dying sqpoll.
This can be triggered consistently in my vm setup running
sqpoll-cancel-hang.t in a loop.
BUG: kernel NULL pointer dereference, address:
00000000000007b0
PGD 0 P4D 0
Oops: 0000 [#1] PREEMPT SMP NOPTI
CPU: 0 PID: 16587 Comm: systemd-coredum Not tainted
6.8.0-rc3-g3fcb9d17206e-dirty #69
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS unknown 2/2/2022
RIP: 0010:getrusage+0x21/0x3e0
Code: 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 55 48 89 d1 48 89 e5 41 57 41 56 41 55 41 54 49 89 fe 41 52 53 48 89 d3 48 83 ec 30 <4c> 8b a7 b0 07 00 00 48 8d 7a 08 65 48 8b 04 25 28 00 00 00 48 89
RSP: 0018:
ffffa166c671bb80 EFLAGS:
00010282
RAX:
00000000000040ca RBX:
ffffa166c671bc60 RCX:
ffffa166c671bc60
RDX:
ffffa166c671bc60 RSI:
0000000000000000 RDI:
0000000000000000
RBP:
ffffa166c671bbe0 R08:
ffff9448cc3930c0 R09:
0000000000000000
R10:
ffffa166c671bd50 R11:
ffffffff9ee89260 R12:
0000000000000000
R13:
ffff9448ce099480 R14:
0000000000000000 R15:
ffff9448cff5b000
FS:
00007f786e225900(0000) GS:
ffff94493bc00000(0000) knlGS:
0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0:
0000000080050033
CR2:
00000000000007b0 CR3:
000000010d39c000 CR4:
0000000000750ef0
PKRU:
55555554
Call Trace:
<TASK>
? __die_body+0x1a/0x60
? page_fault_oops+0x154/0x440
? srso_alias_return_thunk+0x5/0xfbef5
? do_user_addr_fault+0x174/0x7c0
? srso_alias_return_thunk+0x5/0xfbef5
? exc_page_fault+0x63/0x140
? asm_exc_page_fault+0x22/0x30
? getrusage+0x21/0x3e0
? seq_printf+0x4e/0x70
io_uring_show_fdinfo+0x9db/0xa10
? srso_alias_return_thunk+0x5/0xfbef5
? vsnprintf+0x101/0x4d0
? srso_alias_return_thunk+0x5/0xfbef5
? seq_vprintf+0x34/0x50
? srso_alias_return_thunk+0x5/0xfbef5
? seq_printf+0x4e/0x70
? seq_show+0x16b/0x1d0
? __pfx_io_uring_show_fdinfo+0x10/0x10
seq_show+0x16b/0x1d0
seq_read_iter+0xd7/0x440
seq_read+0x102/0x140
vfs_read+0xae/0x320
? srso_alias_return_thunk+0x5/0xfbef5
? __do_sys_newfstat+0x35/0x60
ksys_read+0xa5/0xe0
do_syscall_64+0x50/0x110
entry_SYSCALL_64_after_hwframe+0x6e/0x76
RIP: 0033:0x7f786ec1db4d
Code: e8 46 e3 01 00 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 80 3d d9 ce 0e 00 00 74 17 31 c0 0f 05 <48> 3d 00 f0 ff ff 77 5b c3 66 2e 0f 1f 84 00 00 00 00 00 48 83 ec
RSP: 002b:
00007ffcb361a4b8 EFLAGS:
00000246 ORIG_RAX:
0000000000000000
RAX:
ffffffffffffffda RBX:
000055a4c8fe42f0 RCX:
00007f786ec1db4d
RDX:
0000000000000400 RSI:
000055a4c8fe48a0 RDI:
0000000000000006
RBP:
00007f786ecfb0b0 R08:
00007f786ecfb2a8 R09:
0000000000000001
R10:
0000000000000000 R11:
0000000000000246 R12:
00007f786ecfaf60
R13:
000055a4c8fe42f0 R14:
0000000000000000 R15:
00007ffcb361a628
</TASK>
Modules linked in:
CR2:
00000000000007b0
---[ end trace
0000000000000000 ]---
RIP: 0010:getrusage+0x21/0x3e0
Code: 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 55 48 89 d1 48 89 e5 41 57 41 56 41 55 41 54 49 89 fe 41 52 53 48 89 d3 48 83 ec 30 <4c> 8b a7 b0 07 00 00 48 8d 7a 08 65 48 8b 04 25 28 00 00 00 48 89
RSP: 0018:
ffffa166c671bb80 EFLAGS:
00010282
RAX:
00000000000040ca RBX:
ffffa166c671bc60 RCX:
ffffa166c671bc60
RDX:
ffffa166c671bc60 RSI:
0000000000000000 RDI:
0000000000000000
RBP:
ffffa166c671bbe0 R08:
ffff9448cc3930c0 R09:
0000000000000000
R10:
ffffa166c671bd50 R11:
ffffffff9ee89260 R12:
0000000000000000
R13:
ffff9448ce099480 R14:
0000000000000000 R15:
ffff9448cff5b000
FS:
00007f786e225900(0000) GS:
ffff94493bc00000(0000) knlGS:
0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0:
0000000080050033
CR2:
00000000000007b0 CR3:
000000010d39c000 CR4:
0000000000750ef0
PKRU:
55555554
Kernel panic - not syncing: Fatal exception
Kernel Offset: 0x1ce00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
Fixes: 3fcb9d17206e ("io_uring/sqpoll: statistics of the true utilization of sq threads")
Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de>
Link: https://lore.kernel.org/r/20240309003256.358-1-krisman@suse.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Linus Torvalds [Sat, 9 Mar 2024 02:05:21 +0000 (18:05 -0800)]
Merge tag 'ceph-for-6.8-rc8' of https://github.com/ceph/ceph-client
Pull ceph fix from Ilya Dryomov:
"A follow-up for sparse read fixes that went into -rc4 -- msgr2 case
was missed and is corrected here"
* tag 'ceph-for-6.8-rc8' of https://github.com/ceph/ceph-client:
libceph: init the cursor when preparing sparse read in msgr2
Linus Torvalds [Fri, 8 Mar 2024 21:39:28 +0000 (13:39 -0800)]
Merge tag 'char-misc-6.8-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc
Pull char/misc driver fixes from Greg KH:
"Here are a few small char/misc and other driver subsystem fixes for
reported issues that have been in my tree.
Included in here are fixes for:
- iio driver fixes for reported problems
- much reported bugfix for a lis3lv02d_i2c regression
- comedi driver bugfix
- mei new device ids
- mei driver fixes
- counter core fix
All of these have been in linux-next with no reported issues, some for
many weeks"
* tag 'char-misc-6.8-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc:
mei: gsc_proxy: match component when GSC is on different bus
misc: fastrpc: Pass proper arguments to scm call
comedi: comedi_test: Prevent timers rescheduling during deletion
comedi: comedi_8255: Correct error in subdevice initialization
misc: lis3lv02d_i2c: Fix regulators getting en-/dis-abled twice on suspend/resume
iio: accel: adxl367: fix I2C FIFO data register
iio: accel: adxl367: fix DEVID read after reset
iio: pressure: dlhl60d: Initialize empty DLH bytes
iio: imu: inv_mpu6050: fix frequency setting when chip is off
iio: pressure: Fixes BMP38x and BMP390 SPI support
iio: imu: inv_mpu6050: fix FIFO parsing when empty
mei: Add Meteor Lake support for IVSC device
mei: me: add arrow lake point H DID
mei: me: add arrow lake point S DID
counter: fix privdata alignment
Linus Torvalds [Fri, 8 Mar 2024 21:33:04 +0000 (13:33 -0800)]
Merge tag 'tty-6.8-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty
Pull tty / serial fixes from Greg KH:
"Here are some small remaining tty/serial driver fixes. Included in
here is fixes for:
- vt unicode buffer corruption fix
- imx serial driver fixes, again
- port suspend fix
- 8250_dw driver fix
- fsl_lpuart driver fix
- revert for the qcom_geni_serial driver to fix a reported regression
All of these have been in linux-next with no reported issues"
* tag 'tty-6.8-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty:
Revert "tty: serial: simplify qcom_geni_serial_send_chunk_fifo()"
tty: serial: fsl_lpuart: avoid idle preamble pending if CTS is enabled
vt: fix unicode buffer corruption when deleting characters
serial: port: Don't suspend if the port is still busy
serial: 8250_dw: Do not reclock if already at correct rate
tty: serial: imx: Fix broken RS485
Linus Torvalds [Fri, 8 Mar 2024 21:19:01 +0000 (13:19 -0800)]
Merge tag 'usb-6.8-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb
Pull USB / Thunderbolt fixes from Greg KH:
"Here are some small remaining fixes for USB and Thunderbolt drivers.
Included in here are fixes for:
- thunderbold NULL dereference fix
- typec driver fixes
- xhci driver regression fix
- usb-storage divide-by-0 fix
- ncm gadget driver fix
All of these have been in linux-next with no reported issues"
* tag 'usb-6.8-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb:
xhci: Fix failure to detect ring expansion need.
usb: port: Don't try to peer unused USB ports based on location
usb: gadget: ncm: Fix handling of zero block length packets
usb: typec: altmodes/displayport: create sysfs nodes as driver's default device attribute group
usb: typec: tpcm: Fix PORT_RESET behavior for self powered devices
usb: typec: ucsi: fix UCSI on SM8550 & SM8650 Qualcomm devices
USB: usb-storage: Prevent divide-by-0 error in isd200_ata_command
thunderbolt: Fix NULL pointer dereference in tb_port_update_credits()
Linus Torvalds [Fri, 8 Mar 2024 21:13:20 +0000 (13:13 -0800)]
Merge tag 'pinctrl-v6.8-3' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl
Pull pin control fixes from Linus Walleij:
- Fix the PM suspend callback in the STM32 ST32MP257 driver to properly
support suspend
- Drop an extraneous reference put in the debugfs code, this was
confusing the reference counts and causing unsolicited calls to
__free()
* tag 'pinctrl-v6.8-3' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl:
pinctrl: don't put the reference to GPIO device in pinctrl_pins_show()
pinctrl: stm32: fix PM support for stm32mp257
Linus Torvalds [Fri, 8 Mar 2024 21:06:35 +0000 (13:06 -0800)]
Merge tag 'input-for-v6.8-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input
Pull input updates from Dmitry Torokhov:
- a revert of endpoint checks in bcm5974 - the driver is being naughty
and pokes at unclaimed USB interface, so the check fails. We need to
fix the driver to claim both interfaces, and then re-implement the
endpoints check
- a fix to Synaptics RMI driver to avoid UAF on driver unload or device
unbinding
- a few new VID/PIDs added to xpad game controller driver
- a change to gpio_keys_polled driver to quiet it when GPIO causes
probe deferral.
* tag 'input-for-v6.8-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input:
Input: synaptics-rmi4 - fix UAF of IRQ domain on driver removal
Input: gpio_keys_polled - suppress deferred probe error for gpio
Revert "Input: bcm5974 - check endpoint type before starting traffic"
Input: xpad - add additional HyperX Controller Identifiers
Linus Torvalds [Fri, 8 Mar 2024 21:01:16 +0000 (13:01 -0800)]
Merge tag 'sound-6.8' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound
Pull sound fixes from Takashi Iwai:
"A collection of small fixes. Half of them are HD-audio quirks while
the rest are various device-specific ASoC fixes"
* tag 'sound-6.8' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound:
ASoC: wm8962: Fix up incorrect error message in wm8962_set_fll
ASoC: wm8962: Enable both SPKOUTR_ENA and SPKOUTL_ENA in mono mode
ASoC: wm8962: Enable oscillator if selecting WM8962_FLL_OSC
ASoC: dt-bindings: nvidia: Fix 'lge' vendor prefix
ALSA: hda/realtek: fix mute/micmute LEDs for HP EliteBook
ASoC: amd: yc: Add HP Pavilion Aero Laptop 13-be2xxx(8BD6) into DMI quirk table
ASoC: rcar: adg: correct TIMSEL setting for SSI9
ALSA: hda: cs35l41: Overwrite CS35L41 configuration for ASUS UM5302LA
ALSA: hda/realtek: Add quirks for Lenovo Thinkbook 16P laptops
ALSA: hda: cs35l41: Support Lenovo Thinkbook 16P
ALSA: hda/realtek - Add Headset Mic supported Acer NB platform
ALSA: hda: optimize the probe codec process
ALSA: hda/realtek - Fix headset Mic no show at resume back for Lenovo ALC897 platform
ASoC: Intel: bytcr_rt5640: Add an extra entry for the Chuwi Vi8 tablet
ASoC: madera: Fix typo in madera_set_fll_clks shift value
Linus Torvalds [Fri, 8 Mar 2024 20:44:56 +0000 (12:44 -0800)]
Merge tag 'drm-fixes-2024-03-08' of https://gitlab.freedesktop.org/drm/kernel
Pull drm fixes from Dave Airlie:
"Regular fixes (two weeks for i915), scattered across drivers, amdgpu
and i915 being the main ones, with nouveau having a couple of fixes.
One patch got applied for udl, but reverted soon after as the
maintainer has missed some crucial prior discussion.
Seems quiet and normal enough for this stage.
MAINTAINERS
- update email address
core:
- fix polling in certain configurations
buddy:
- fix kunit test warning
panel:
- boe-tv101wum-nl6: timing tuning fixes
i915:
- Fix to extract HDCP information from primary connector
- Check for NULL mmu_interval_notifier before removing
- Fix for #10184: Kernel crash on UHD Graphics 730 (Cc stable)
- Fix for #10284: Boot delay regresion with PSR
- Fix DP connector DSC HW state readout
- Selftest fix to convert msecs to jiffies
xe:
- error path fix
amdgpu:
- SMU14 fix
- Fix possible NULL pointer
- VRR fix
- pwm fix
nouveau:
- fix deadlock in new ioctls fail path
- fix missing locking around object rbtree
udl:
- apply and revert format change"
* tag 'drm-fixes-2024-03-08' of https://gitlab.freedesktop.org/drm/kernel: (21 commits)
nouveau: lock the client object tree.
drm/tests/buddy: fix print format
drm/xe: Return immediately on tile_init failure
drm/amdgpu/pm: Fix the error of pwm1_enable setting
drm/amd/display: handle range offsets in VRR ranges
drm/amd/display: check dc_link before dereferencing
drm/amd/swsmu: modify the gfx activity scaling
Revert "drm/udl: Add ARGB8888 as a format"
drm/i915/panelreplay: Move out psr_init_dpcd() from init_connector()
drm/i915/dp: Fix connector DSC HW state readout
drm/i915/selftests: Fix dependency of some timeouts on HZ
drm/udl: Add ARGB8888 as a format
drm/nouveau: fix stale locked mutex in nouveau_gem_ioctl_pushbuf
drm/i915: Don't explode when the dig port we don't have an AUX CH
MAINTAINERS: Update email address for Tvrtko Ursulin
drm/panel: boe-tv101wum-nl6: Fine tune Himax83102-j02 panel HFP and HBP (again)
drm: Fix output poll work for drm_kms_helper_poll=n
drm/i915: Check before removing mm notifier
drm/i915/hdcp: Extract hdcp structure from correct connector
drm/i915/hdcp: Remove additional timing for reading mst hdcp message
...
Uwe Kleine-König [Fri, 8 Mar 2024 08:51:04 +0000 (09:51 +0100)]
block/swim: Convert to platform remove callback returning void
The .remove() callback for a platform driver returns an int which makes
many driver authors wrongly assume it's possible to do error handling by
returning an error code. However the value returned is ignored (apart
from emitting a warning) and this typically results in resource leaks.
To improve here there is a quest to make the remove callback return
void. In the first step of this quest all drivers are converted to
.remove_new(), which already returns void. Eventually after all drivers
are converted, .remove_new() will be renamed to .remove().
Trivially convert this driver from always returning zero in the remove
callback to the void returning variant.
Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
Link: https://lore.kernel.org/r/a00aea8201ea85ae726411bb0fb015ea026ff40a.1709886922.git.u.kleine-koenig@pengutronix.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Fri, 8 Mar 2024 13:55:58 +0000 (13:55 +0000)]
io_uring/net: dedup io_recv_finish req completion
There are two block in io_recv_finish() completing the request, which we
can combine and remove jumping.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/0e338dcb33c88de83809fda021cba9e7c9681620.1709905727.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Fri, 8 Mar 2024 13:55:57 +0000 (13:55 +0000)]
io_uring: refactor DEFER_TASKRUN multishot checks
We disallow DEFER_TASKRUN multishots from running by io-wq, which is
checked by individual opcodes in the issue path. We can consolidate all
it in io_wq_submit_work() at the same time moving the checks out of the
hot path.
Suggested-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/e492f0f11588bb5aa11d7d24e6f53b7c7628afdb.1709905727.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Fri, 8 Mar 2024 13:55:56 +0000 (13:55 +0000)]
io_uring: fix mshot io-wq checks
When checking for concurrent CQE posting, we're not only interested in
requests running from the poll handler but also strayed requests ended
up in normal io-wq execution. We're disallowing multishots in general
from io-wq, not only when they came in a certain way.
Cc: stable@vger.kernel.org
Fixes: 17add5cea2bba ("io_uring: force multishot CQEs into task context")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/d8c5b36a39258036f93301cd60d3cd295e40653d.1709905727.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Wed, 6 Mar 2024 14:57:57 +0000 (07:57 -0700)]
io_uring/net: add io_req_msg_cleanup() helper
For the fast inline path, we manually recycle the io_async_msghdr and
free the iovec, and then clear the REQ_F_NEED_CLEANUP flag to avoid
that needing doing in the slower path. We already do that in 2 spots, and
in preparation for adding more, add a helper and use it.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Wed, 6 Mar 2024 17:57:33 +0000 (10:57 -0700)]
io_uring/net: simplify msghd->msg_inq checking
Just check for larger than zero rather than check for non-zero and
not -1. This is easier to read, and also protects against any errants
< 0 values that aren't -1.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Thu, 7 Mar 2024 19:53:24 +0000 (12:53 -0700)]
io_uring/kbuf: rename REQ_F_PARTIAL_IO to REQ_F_BL_NO_RECYCLE
We only use the flag for this purpose, so rename it accordingly. This
further prevents various other use cases of it, keeping it clean and
consistent. Then we can also check it in one spot, when it's being
attempted recycled, and remove some dead code in io_kbuf_recycle_ring().
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Thu, 7 Mar 2024 19:43:22 +0000 (12:43 -0700)]
io_uring/net: remove dependency on REQ_F_PARTIAL_IO for sr->done_io
Ensure that prep handlers always initialize sr->done_io before any
potential failure conditions, and with that, we now it's always been
set even for the failure case.
With that, we don't need to use the REQ_F_PARTIAL_IO flag to gate on that.
Additionally, we should not overwrite req->cqe.res unless sr->done_io is
actually positive.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Uwe Kleine-König [Fri, 8 Mar 2024 08:51:06 +0000 (09:51 +0100)]
EDAC/versal: Convert to platform remove callback returning void
The .remove() callback for a platform driver returns an int which makes
many driver authors wrongly assume it's possible to do error handling by
returning an error code. However the value returned is ignored (apart
from emitting a warning) and this typically results in resource leaks.
To improve this, there is a quest to make the remove callback return
void. In the first step of this quest all drivers are converted to
.remove_new(), which already returns void. Eventually after all drivers
are converted, .remove_new() will be renamed to .remove().
Trivially convert this driver from always returning zero in the remove
callback to the void returning variant.
Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Shubhrajyoti Datta <shubhrajyoti.datta@amd.com>
Link: https://lore.kernel.org/r/83deca1ce260f7e17ff3cb106c9a6946d4ca4505.1709886922.git.u.kleine-koenig@pengutronix.de
Tommy Huang [Tue, 5 Mar 2024 01:19:06 +0000 (09:19 +0800)]
i2c: aspeed: Fix the dummy irq expected print
When the i2c error condition occurred and master state was not
idle, the master irq function will goto complete state without any
other interrupt handling. It would cause dummy irq expected print.
Under this condition, assign the irq_status into irq_handle.
For example, when the abnormal start / stop occurred (bit 5) with
normal stop status (bit 4) at same time. Then the normal stop status
would not be handled and it would cause irq expected print in
the aspeed_i2c_bus_irq.
...
aspeed-i2c-bus x. i2c-bus: irq handled != irq.
Expected 0x00000030, but was 0x00000020
...
Fixes: 3e9efc3299dd ("i2c: aspeed: Handle master/slave combined irq events properly")
Cc: Jae Hyun Yoo <jae.hyun.yoo@linux.intel.com>
Signed-off-by: Tommy Huang <tommy_huang@aspeedtech.com>
Signed-off-by: Andi Shyti <andi.shyti@kernel.org>
Christophe JAILLET [Fri, 5 Jan 2024 14:39:35 +0000 (15:39 +0100)]
i2c: wmt: Fix an error handling path in wmt_i2c_probe()
wmt_i2c_reset_hardware() calls clk_prepare_enable(). So, should an error
occur after it, it should be undone by a corresponding
clk_disable_unprepare() call, as already done in the remove function.
Fixes: 560746eb79d3 ("i2c: vt8500: Add support for I2C bus on Wondermedia SoCs")
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Signed-off-by: Andi Shyti <andi.shyti@kernel.org>
Heiner Kallweit [Mon, 4 Mar 2024 20:31:06 +0000 (21:31 +0100)]
i2c: i801: Avoid potential double call to gpiod_remove_lookup_table
If registering the platform device fails, the lookup table is
removed in the error path. On module removal we would try to
remove the lookup table again. Fix this by setting priv->lookup
only if registering the platform device was successful.
In addition free the memory allocated for the lookup table in
the error path.
Fixes: d308dfbf62ef ("i2c: mux/i801: Switch to use descriptor passing")
Cc: stable@vger.kernel.org
Reviewed-by: Andi Shyti <andi.shyti@kernel.org>
Reviewed-by: Linus Walleij <linus.walleij@linaro.org>
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: Andi Shyti <andi.shyti@kernel.org>
Heiner Kallweit [Sun, 3 Mar 2024 10:45:22 +0000 (11:45 +0100)]
i2c: i801: Fix using mux_pdev before it's set
i801_probe_optional_slaves() is called before i801_add_mux().
This results in mux_pdev being checked before it's set by
i801_add_mux(). Fix this by changing the order of the calls.
I consider this safe as I see no dependencies.
Fixes: 80e56b86b59e ("i2c: i801: Simplify class-based client device instantiation")
Cc: stable@vger.kernel.org
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Reviewed-by: Andi Shyti <andi.shyti@kernel.org>
Signed-off-by: Andi Shyti <andi.shyti@kernel.org>
Changbin Du [Fri, 8 Mar 2024 04:44:01 +0000 (12:44 +0800)]
x86/sev: Disable KMSAN for memory encryption TUs
Instrumenting sev.c and mem_encrypt_identity.c with KMSAN will result in
a triple-faulting kernel. Some of the code is invoked too early during
boot, before KMSAN is ready.
Disable KMSAN instrumentation for the two translation units.
[ bp: Massage commit message. ]
Signed-off-by: Changbin Du <changbin.du@huawei.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20240308044401.1120395-1-changbin.du@huawei.com
Takashi Iwai [Fri, 8 Mar 2024 07:53:36 +0000 (08:53 +0100)]
Merge tag 'asoc-fix-v6.8-rc7' of https://git.kernel.org/pub/scm/linux/kernel/git/broonie/sound into for-linus
ASoC: Fixes for v6.8
Some more driver specific fixes for v6.8, plus one new x86 platform
quirk. All good fixes to have if you have systems that use the relevant
hardware.
Dave Airlie [Wed, 28 Feb 2024 06:19:47 +0000 (16:19 +1000)]
nouveau: lock the client object tree.
It appears the client object tree has no locking unless I've missed
something else. Fix races around adding/removing client objects,
mostly vram bar mappings.
4562.099306] general protection fault, probably for non-canonical address 0x6677ed422bceb80c: 0000 [#1] PREEMPT SMP PTI
[ 4562.099314] CPU: 2 PID: 23171 Comm: deqp-vk Not tainted 6.8.0-rc6+ #27
[ 4562.099324] Hardware name: Gigabyte Technology Co., Ltd. Z390 I AORUS PRO WIFI/Z390 I AORUS PRO WIFI-CF, BIOS F8 11/05/2021
[ 4562.099330] RIP: 0010:nvkm_object_search+0x1d/0x70 [nouveau]
[ 4562.099503] Code: 90 90 90 90 90 90 90 90 90 90 90 90 90 66 0f 1f 00 0f 1f 44 00 00 48 89 f8 48 85 f6 74 39 48 8b 87 a0 00 00 00 48 85 c0 74 12 <48> 8b 48 f8 48 39 ce 73 15 48 8b 40 10 48 85 c0 75 ee 48 c7 c0 fe
[ 4562.099506] RSP: 0000:
ffffa94cc420bbf8 EFLAGS:
00010206
[ 4562.099512] RAX:
6677ed422bceb814 RBX:
ffff98108791f400 RCX:
ffff9810f26b8f58
[ 4562.099517] RDX:
0000000000000000 RSI:
ffff9810f26b9158 RDI:
ffff98108791f400
[ 4562.099519] RBP:
ffff9810f26b9158 R08:
0000000000000000 R09:
0000000000000000
[ 4562.099521] R10:
ffffa94cc420bc48 R11:
0000000000000001 R12:
ffff9810f02a7cc0
[ 4562.099526] R13:
0000000000000000 R14:
00000000000000ff R15:
0000000000000007
[ 4562.099528] FS:
00007f629c5017c0(0000) GS:
ffff98142c700000(0000) knlGS:
0000000000000000
[ 4562.099534] CS: 0010 DS: 0000 ES: 0000 CR0:
0000000080050033
[ 4562.099536] CR2:
00007f629a882000 CR3:
000000017019e004 CR4:
00000000003706f0
[ 4562.099541] DR0:
0000000000000000 DR1:
0000000000000000 DR2:
0000000000000000
[ 4562.099542] DR3:
0000000000000000 DR6:
00000000fffe0ff0 DR7:
0000000000000400
[ 4562.099544] Call Trace:
[ 4562.099555] <TASK>
[ 4562.099573] ? die_addr+0x36/0x90
[ 4562.099583] ? exc_general_protection+0x246/0x4a0
[ 4562.099593] ? asm_exc_general_protection+0x26/0x30
[ 4562.099600] ? nvkm_object_search+0x1d/0x70 [nouveau]
[ 4562.099730] nvkm_ioctl+0xa1/0x250 [nouveau]
[ 4562.099861] nvif_object_map_handle+0xc8/0x180 [nouveau]
[ 4562.099986] nouveau_ttm_io_mem_reserve+0x122/0x270 [nouveau]
[ 4562.100156] ? dma_resv_test_signaled+0x26/0xb0
[ 4562.100163] ttm_bo_vm_fault_reserved+0x97/0x3c0 [ttm]
[ 4562.100182] ? __mutex_unlock_slowpath+0x2a/0x270
[ 4562.100189] nouveau_ttm_fault+0x69/0xb0 [nouveau]
[ 4562.100356] __do_fault+0x32/0x150
[ 4562.100362] do_fault+0x7c/0x560
[ 4562.100369] __handle_mm_fault+0x800/0xc10
[ 4562.100382] handle_mm_fault+0x17c/0x3e0
[ 4562.100388] do_user_addr_fault+0x208/0x860
[ 4562.100395] exc_page_fault+0x7f/0x200
[ 4562.100402] asm_exc_page_fault+0x26/0x30
[ 4562.100412] RIP: 0033:0x9b9870
[ 4562.100419] Code: 85 a8 f7 ff ff 8b 8d 80 f7 ff ff 89 08 e9 18 f2 ff ff 0f 1f 84 00 00 00 00 00 44 89 32 e9 90 fa ff ff 0f 1f 84 00 00 00 00 00 <44> 89 32 e9 f8 f1 ff ff 0f 1f 84 00 00 00 00 00 66 44 89 32 e9 e7
[ 4562.100422] RSP: 002b:
00007fff9ba2dc70 EFLAGS:
00010246
[ 4562.100426] RAX:
0000000000000004 RBX:
000000000dd65e10 RCX:
000000fff0000000
[ 4562.100428] RDX:
00007f629a882000 RSI:
00007f629a882000 RDI:
0000000000000066
[ 4562.100432] RBP:
00007fff9ba2e570 R08:
0000000000000000 R09:
0000000123ddf000
[ 4562.100434] R10:
0000000000000001 R11:
0000000000000246 R12:
000000007fffffff
[ 4562.100436] R13:
0000000000000000 R14:
0000000000000000 R15:
0000000000000000
[ 4562.100446] </TASK>
[ 4562.100448] Modules linked in: nf_conntrack_netbios_ns nf_conntrack_broadcast nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables libcrc32c nfnetlink cmac bnep sunrpc iwlmvm intel_rapl_msr intel_rapl_common snd_sof_pci_intel_cnl x86_pkg_temp_thermal intel_powerclamp snd_sof_intel_hda_common mac80211 coretemp snd_soc_acpi_intel_match kvm_intel snd_soc_acpi snd_soc_hdac_hda snd_sof_pci snd_sof_xtensa_dsp snd_sof_intel_hda_mlink snd_sof_intel_hda snd_sof kvm snd_sof_utils snd_soc_core snd_hda_codec_realtek libarc4 snd_hda_codec_generic snd_compress snd_hda_ext_core vfat fat snd_hda_intel snd_intel_dspcfg irqbypass iwlwifi snd_hda_codec snd_hwdep snd_hda_core btusb btrtl mei_hdcp iTCO_wdt rapl mei_pxp btintel snd_seq iTCO_vendor_support btbcm snd_seq_device intel_cstate bluetooth snd_pcm cfg80211 intel_wmi_thunderbolt wmi_bmof intel_uncore snd_timer mei_me snd ecdh_generic i2c_i801
[ 4562.100541] ecc mei i2c_smbus soundcore rfkill intel_pch_thermal acpi_pad zram nouveau drm_ttm_helper ttm gpu_sched i2c_algo_bit drm_gpuvm drm_exec mxm_wmi drm_display_helper drm_kms_helper drm crct10dif_pclmul crc32_pclmul nvme e1000e crc32c_intel nvme_core ghash_clmulni_intel video wmi pinctrl_cannonlake ip6_tables ip_tables fuse
[ 4562.100616] ---[ end trace
0000000000000000 ]---
Signed-off-by: Dave Airlie <airlied@redhat.com>
Cc: stable@vger.kernel.org
This page took 0.214856 seconds and 4 git commands to generate.