workqueue: Unbind kworkers before sending them to exit()
It has been reported that isolated CPUs can suffer from interference due to
per-CPU kworkers waking up just to die.
A surge of workqueue activity during initial setup of a latency-sensitive
application (refresh_vm_stats() being one of the culprits) can cause extra
per-CPU kworkers to be spawned. Then, said latency-sensitive task can be
running merrily on an isolated CPU only to be interrupted sometime later by
a kworker marked for death (cf. IDLE_WORKER_TIMEOUT, 5 minutes after last
kworker activity).
Prevent this by affining kworkers to the wq_unbound_cpumask (which doesn't
contain isolated CPUs, cf. HK_TYPE_WQ) before waking them up after marking
them with WORKER_DIE.
Changing the affinity does require a sleepable context, leverage the newly
introduced pool->idle_cull_work to get that.
Remove dying workers from pool->workers and keep track of them in a
separate list. This intentionally prevents for_each_loop_worker() from
iterating over workers that are marked for death.
Rename destroy_worker() to set_working_dying() to better reflect its
effects and relationship with wake_dying_workers().
workqueue: Don't hold any lock while rcuwait'ing for !POOL_MANAGER_ACTIVE
put_unbound_pool() currently passes wq_manager_inactive() as exit condition
to rcuwait_wait_event(), which grabs pool->lock to check for
pool->flags & POOL_MANAGER_ACTIVE
A later patch will require destroy_worker() to be invoked with
wq_pool_attach_mutex held, which needs to be acquired before
pool->lock. A mutex cannot be acquired within rcuwait_wait_event(), as
it could clobber the task state set by rcuwait_wait_event()
Instead, restructure the waiting logic to acquire any necessary lock
outside of rcuwait_wait_event().
Since further work cannot be inserted into unbound pwqs that have reached
->refcnt==0, this is bound to make forward progress as eventually the
worklist will be drained and need_more_worker(pool) will remain false,
preventing any worker from stealing the manager position from us.
workqueue: Convert the idle_timer to a timer + work_struct
A later patch will require a sleepable context in the idle worker timeout
function. Converting worker_pool.idle_timer to a delayed_work gives us just
that, however this would imply turning all idle_timer expiries into
scheduler events (waking up a worker to handle the dwork).
Instead, implement a "custom dwork" where the timer callback does some
extra checks before queuing the associated work.
Lai Jiangshan [Thu, 12 Jan 2023 16:14:27 +0000 (16:14 +0000)]
workqueue: Protects wq_unbound_cpumask with wq_pool_attach_mutex
When unbind_workers() reads wq_unbound_cpumask to set the affinity of
freshly-unbound kworkers, it only holds wq_pool_attach_mutex. This isn't
sufficient as wq_unbound_cpumask is only protected by wq_pool_mutex.
Make wq_unbound_cpumask protected with wq_pool_attach_mutex and also
remove the need of temporary saved_cpumask.
Fixes: 10a5a651e3af ("workqueue: Restrict kworker in the offline CPU pool running on housekeeping CPUs") Reported-by: Valentin Schneider <[email protected]> Signed-off-by: Lai Jiangshan <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
When large systems are facing certain types of hang conditions, it is not
unusual for this "pending" list to contain runs of hundreds of identical
function names. This "wall of text" is difficult to read, and worse yet,
it can be interleaved with other output such as stack traces.
Therefore, make show_pwq() use run-length encoding so that the above
printout instead looks like this:
When no comma would be printed, including the WORK_STRUCT_LINKED case,
a new run is started unconditionally.
This output is more readable, places less stress on the hardware,
firmware, and software on the console-log path, and reduces interference
with other output.
Richard Clark [Tue, 13 Dec 2022 04:39:36 +0000 (12:39 +0800)]
workqueue: Add a new flag to spot the potential UAF error
Currently if the user queues a new work item unintentionally
into a wq after the destroy_workqueue(wq), the work still can
be queued and scheduled without any noticeable kernel message
before the end of a RCU grace period.
As a debug-aid facility, this commit adds a new flag
__WQ_DESTROYING to spot that issue by triggering a kernel WARN
message.
Linus Torvalds [Wed, 4 Jan 2023 20:11:29 +0000 (12:11 -0800)]
Merge tag 'x86-urgent-2023-01-04' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull misc x86 fixes from Ingo Molnar:
"Fix a double-free bug, a binutils warning, a header namespace clash
and a bug in ib_prctl_set()"
* tag 'x86-urgent-2023-01-04' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/bugs: Flush IBP in ib_prctl_set()
x86/insn: Avoid namespace clash by separating instruction decoder MMIO type from MMIO trace type
x86/asm: Fix an assembler warning with current binutils
x86/kexec: Fix double-free of elf header buffer
Linus Torvalds [Wed, 4 Jan 2023 20:02:26 +0000 (12:02 -0800)]
Merge tag 'f2fs-fix-6.2-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs
Pull f2fs fixes from Jaegeuk Kim:
- fix a null pointer dereference in f2fs_issue_flush, which occurs by
the combination of mount/remount options.
- fix a bug in per-block age-based extent_cache newly introduced in
6.2-rc1, which reported a wrong age information in extent_cache.
- fix a kernel panic if extent_tree was not created, which was caught
by a wrong BUG_ON
* tag 'f2fs-fix-6.2-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs:
f2fs: let's avoid panic if extent_tree is not created
f2fs: should use a temp extent_info for lookup
f2fs: don't mix to use union values in extent_info
f2fs: initialize extent_cache parameter
f2fs: fix to avoid NULL pointer dereference in f2fs_issue_flush()
Linus Torvalds [Wed, 4 Jan 2023 19:26:36 +0000 (11:26 -0800)]
Merge tag 'nfsd-6.2-2' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux
Pull nfsd fixes from Chuck Lever:
- Fix a filecache UAF during NFSD shutdown
- Avoid exposing automounted mounts on NFS re-exports
* tag 'nfsd-6.2-2' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux:
nfsd: fix handling of readdir in v4root vs. mount upcall timeout
nfsd: shut down the NFSv4 state objects before the filecache
b) we will never allocate memory for SM_I(sbi)->fcc_info w/ below
testcase,
- mount -o ro /dev/vda /mnt/f2fs
- mount -o rw,remount /dev/vda /mnt/f2fs
- dd if=/dev/zero of=/mnt/f2fs/file bs=4k count=1 conv=fsync
In order to fix this issue, let change as below:
- fix error path handling in f2fs_create_flush_cmd_control().
- allocate SM_I(sbi)->fcc_info even if readonly is on.
Linus Torvalds [Mon, 2 Jan 2023 19:06:18 +0000 (11:06 -0800)]
Merge tag 'for-6.2-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"First batch of regression and regular fixes:
- regressions:
- fix error handling after conversion to qstr for paths
- fix raid56/scrub recovery caused by uninitialized variable
after conversion to error bitmaps
- restore qgroup backref lookup behaviour after recent
refactoring
- fix leak of device lists at module exit time
- fix resolving backrefs for inline extent followed by prealloc
- reset defrag ioctl buffer on memory allocation error"
* tag 'for-6.2-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: fix fscrypt name leak after failure to join log transaction
btrfs: scrub: fix uninitialized return value in recover_scrub_rbio
btrfs: fix resolving backrefs for inline extent followed by prealloc
btrfs: fix trace event name typo for FLUSH_DELAYED_REFS
btrfs: restore BTRFS_SEQ_LAST when looking up qgroup backref lookup
btrfs: fix leak of fs devices after removing btrfs module
btrfs: fix an error handling path in btrfs_defrag_leaves()
btrfs: fix an error handling path in btrfs_rename()
Tetsuo Handa [Mon, 2 Jan 2023 14:05:33 +0000 (23:05 +0900)]
fs/ntfs3: don't hold ni_lock when calling truncate_setsize()
syzbot is reporting hung task at do_user_addr_fault() [1], for there is
a silent deadlock between PG_locked bit and ni_lock lock.
Since filemap_update_page() calls filemap_read_folio() after calling
folio_trylock() which will set PG_locked bit, ntfs_truncate() must not
call truncate_setsize() which will wait for PG_locked bit to be cleared
when holding ni_lock lock.
Takashi Iwai [Tue, 22 Nov 2022 11:51:22 +0000 (12:51 +0100)]
x86/kexec: Fix double-free of elf header buffer
After
b3e34a47f989 ("x86/kexec: fix memory leak of elf header buffer"),
freeing image->elf_headers in the error path of crash_load_segments()
is not needed because kimage_file_post_load_cleanup() will take
care of that later. And not clearing it could result in a double-free.
Drop the superfluous vfree() call at the error path of
crash_load_segments().
Jeff Layton [Tue, 13 Dec 2022 18:08:26 +0000 (13:08 -0500)]
nfsd: fix handling of readdir in v4root vs. mount upcall timeout
If v4 READDIR operation hits a mountpoint and gets back an error,
then it will include that entry in the reply and set RDATTR_ERROR for it
to the error.
That's fine for "normal" exported filesystems, but on the v4root, we
need to be more careful to only expose the existence of dentries that
lead to exports.
If the mountd upcall times out while checking to see whether a
mountpoint on the v4root is exported, then we have no recourse other
than to fail the whole operation.
Linus Torvalds [Sun, 1 Jan 2023 19:27:00 +0000 (11:27 -0800)]
Merge tag 'perf_urgent_for_v6.2_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Borislav Petkov:
- Pass only an initialized perf event attribute to the LSM hook
- Fix a use-after-free on the perf syscall's error path
- A potential integer overflow fix in amd_core_pmu_init()
- Fix the cgroup events tracking after the context handling rewrite
- Return the proper value from the inherit_event() function on error
* tag 'perf_urgent_for_v6.2_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/core: Call LSM hook after copying perf_event_attr
perf: Fix use-after-free in error path
perf/x86/amd: fix potential integer overflow on shift of a int
perf/core: Fix cgroup events tracking
perf core: Return error pointer if inherit_event() fails to find pmu_ctx
Linus Torvalds [Sun, 1 Jan 2023 19:11:13 +0000 (11:11 -0800)]
Merge tag 'drm-fixes-2023-01-01' of git://anongit.freedesktop.org/drm/drm
Pull drm fixes from Daniel Vetter:
"I'm just back from the mountains, and Dave is out at the beach and
should be back in a week again. Just i915 fixes and since Rodrigo
bothered to make the pull last week I figured I should warm up gpg and
forward this in a nice signed tag as a new years present!
- i915 fixes for newer platforms
- i915 locking rework to not give up in vm eviction fallback path too
early"
* tag 'drm-fixes-2023-01-01' of git://anongit.freedesktop.org/drm/drm:
drm/i915/dsi: fix MIPI_BKLT_EN_1 native GPIO index
drm/i915/dsi: add support for ICL+ native MIPI GPIO sequence
drm/i915/uc: Fix two issues with over-size firmware files
drm/i915: improve the catch-all evict to handle lock contention
drm/i915: Remove __maybe_unused from mtl_info
drm/i915: fix TLB invalidation for Gen12.50 video and compute engines
Linus Torvalds [Sat, 31 Dec 2022 18:21:47 +0000 (10:21 -0800)]
Merge tag 'kbuild-fixes-v6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild
Pull Kbuild fixes from Masahiro Yamada:
- Fix broken BuildID
- Add srcrpm-pkg to the help message
- Fix the option order for modpost built with musl libc
- Fix the build dependency of rpm-pkg for openSUSE
* tag 'kbuild-fixes-v6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild:
fixdep: remove unneeded <stdarg.h> inclusion
kbuild: sort single-targets alphabetically again
kbuild: rpm-pkg: add libelf-devel as alternative for BuildRequires
kbuild: Fix running modpost with musl libc
kbuild: add a missing line for help message
.gitignore: ignore *.rpm
arch: fix broken BuildID for arm64 and riscv
kconfig: Add static text for search information in help menu
Linus Torvalds [Fri, 30 Dec 2022 18:47:25 +0000 (10:47 -0800)]
Merge tag 'acpi-6.2-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull ACPI fixes from Rafael Wysocki:
"These are new ACPI IRQ override quirks, low-power S0 idle (S0ix)
support adjustments and ACPI backlight handling fixes, mostly for
platforms using AMD chips.
Specifics:
- Add ACPI IRQ override quirks for Asus ExpertBook B2502, Lenovo
14ALC7, and XMG Core 15 (Hans de Goede, Adrian Freund, Erik
Schumacher).
- Adjust ACPI video detection fallback path to prevent
non-operational ACPI backlight devices from being created on
systems where the native driver does not detect a suitable panel
(Mario Limonciello).
- Fix Apple GMUX backlight detection (Hans de Goede).
- Add a low-power S0 idle (S0ix) handling quirk for HP Elitebook 865
and stop using AMD-specific low-power S0 idle code path for systems
with Rembrandt chips and newer (Mario Limonciello)"
* tag 'acpi-6.2-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
ACPI: x86: s2idle: Stop using AMD specific codepath for Rembrandt+
ACPI: x86: s2idle: Force AMD GUID/_REV 2 on HP Elitebook 865
ACPI: video: Fix Apple GMUX backlight detection
ACPI: resource: Add Asus ExpertBook B2502 to Asus quirks
ACPI: resource: do IRQ override on Lenovo 14ALC7
ACPI: resource: do IRQ override on XMG Core 15
ACPI: video: Don't enable fallback path for creating ACPI backlight by default
drm/amd/display: Report to ACPI video if no panels were found
ACPI: video: Allow GPU drivers to report no panels
Linus Torvalds [Fri, 30 Dec 2022 18:30:54 +0000 (10:30 -0800)]
Merge tag 'sound-6.2-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound
Pull sound fixes from Takashi Iwai:
"Just a few small fixes:
- A regression fix for HDMI audio on HD-audio AMD codecs
- Fixes for LINE6 MIDI handling
- HD-audio quirk for Dell laptops"
* tag 'sound-6.2-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound:
ALSA: hda/hdmi: Static PCM mapping again with AMD HDMI codecs
ALSA: hda/realtek: Apply dual codec fixup for Dell Latitude laptops
ALSA: line6: fix stack overflow in line6_midi_transmit
ALSA: line6: correct midi status byte when receiving data from podxt
Merge ACPI resource handling quirks and ACPI backlight handling fixes
for 6.2-rc2:
- Add ACPI IRQ override quirks for Asus ExpertBook B2502, Lenovo
14ALC7, and XMG Core 15 (Hans de Goede, Adrian Freund, Erik
Schumacher).
- Adjust ACPI video detection fallback path to prevent non-operational
ACPI backlight devices from being created on systems where the native
driver does not detect a suitable panel (Mario Limonciello).
- Fix Apple GMUX backlight detection (Hans de Goede).
* acpi-resource:
ACPI: resource: Add Asus ExpertBook B2502 to Asus quirks
ACPI: resource: do IRQ override on Lenovo 14ALC7
ACPI: resource: do IRQ override on XMG Core 15
* acpi-video:
ACPI: video: Fix Apple GMUX backlight detection
ACPI: video: Don't enable fallback path for creating ACPI backlight by default
drm/amd/display: Report to ACPI video if no panels were found
ACPI: video: Allow GPU drivers to report no panels
Jani Nikula [Mon, 19 Dec 2022 10:59:55 +0000 (12:59 +0200)]
drm/i915/dsi: add support for ICL+ native MIPI GPIO sequence
Starting from ICL, the default for MIPI GPIO sequences seems to be using
native GPIOs i.e. GPIOs available in the GPU. These native GPIOs reuse
many pins that quite frankly seem scary to poke based on the VBT
sequences. We pretty much have to trust that the board is configured
such that the relevant HPD, PP_CONTROL and GPIO bits aren't used for
anything else.
MIPI sequence v4 also adds a flag to fall back to non-native sequences.
v5:
- Wrap SHOTPLUG_CTL_DDI modification in spin_lock() in icp_irq_handler()
too (Ville)
- References instead of Closes issue 6131 because this does not fix everything
v4:
- Wrap SHOTPLUG_CTL_DDI modification in spin_lock_irq() (Ville)
v3:
- Fix -Wbitwise-conditional-parentheses (kernel test robot <[email protected]>)
v2:
- Fix HPD pin output set (impacts GPIOs 0 and 5)
- Fix GPIO data output direction set (impacts GPIOs 4 and 9)
- Reduce register accesses to single intel_de_rwm()
Masahiro Yamada [Wed, 28 Dec 2022 19:10:14 +0000 (04:10 +0900)]
kbuild: rpm-pkg: add libelf-devel as alternative for BuildRequires
Guoqing Jiang reports that openSUSE cannot compile the kernel rpm due
to "BuildRequires: elfutils-libelf-devel" added by commit 8818039f959b
("kbuild: add ability to make source rpm buildable using koji").
The relevant package name in openSUSE is libelf-devel.
Add it as an alternative package.
BTW, if it is impossible to solve the build requirement, the final
resort would be:
$ make RPMOPTS=--nodeps rpm-pkg
This passes --nodeps to the rpmbuild command so it will not verify
build dependencies. This is useful to test rpm builds on non-rpm
system. On Debian/Ubuntu, for example, you can install rpmbuild by
'apt-get install rpm'.
NOTE1:
Likewise, it is possible to bypass the build dependency check for
debian package builds:
$ make DPKG_FLAGS=-d deb-pkg
NOTE2:
The 'or' operator is supported since RPM 4.13. So, old distros such
as CentOS 7 will break. I suggest installing newer rpmbuild in such
cases.
Samuel Holland [Tue, 27 Dec 2022 21:48:21 +0000 (15:48 -0600)]
kbuild: Fix running modpost with musl libc
commit 3d57e1b7b1d4 ("kbuild: refactor the prerequisites of the modpost
rule") moved 'vmlinux.o' inside modpost-args, possibly before some of
the other options. However, getopt() in musl libc follows POSIX and
stops looking for options upon reaching the first non-option argument.
As a result, the '-T' option is misinterpreted as a positional argument,
and the build fails:
make -f ./scripts/Makefile.modpost
scripts/mod/modpost -E -o Module.symvers vmlinux.o -T modules.order
-T: No such file or directory
make[1]: *** [scripts/Makefile.modpost:137: Module.symvers] Error 1
make: *** [Makefile:1960: modpost] Error 2
The fix is to move all options before 'vmlinux.o' in modpost-args.
Fixes: 3d57e1b7b1d4 ("kbuild: refactor the prerequisites of the modpost rule") Signed-off-by: Samuel Holland <[email protected]> Reviewed-by: Nathan Chancellor <[email protected]> Signed-off-by: Masahiro Yamada <[email protected]>
Masahiro Yamada [Mon, 26 Dec 2022 18:54:44 +0000 (03:54 +0900)]
.gitignore: ignore *.rpm
Previously, *.rpm files were created under $HOME/rpmbuild/, but since
commit 8818039f959b ("kbuild: add ability to make source rpm buildable
using koji"), srcrpm-pkg creates the source rpm in the kernel tree
because it sets '_srcrpmdir'.
Masahiro Yamada [Mon, 26 Dec 2022 18:45:37 +0000 (03:45 +0900)]
arch: fix broken BuildID for arm64 and riscv
Dennis Gilmore reports that the BuildID is missing in the arm64 vmlinux
since commit 994b7ac1697b ("arm64: remove special treatment for the
link order of head.o").
The issue is that the type of .notes section, which contains the BuildID,
changed from NOTES to PROGBITS.
Ard Biesheuvel figured out that whichever object gets linked first gets
to decide the type of a section. The PROGBITS type is the result of the
compiler emitting .note.GNU-stack as PROGBITS rather than NOTE.
While Ard provided a fix for arm64, I want to fix this globally because
the same issue is happening on riscv since commit 2348e6bf4421 ("riscv:
remove special treatment for the link order of head.o"). This problem
will happen in general for other architectures if they start to drop
unneeded entries from scripts/head-object-list.txt.
Discard .note.GNU-stack in include/asm-generic/vmlinux.lds.h.
John Harrison [Wed, 21 Dec 2022 19:30:31 +0000 (11:30 -0800)]
drm/i915/uc: Fix two issues with over-size firmware files
In the case where a firmware file is too large (e.g. someone
downloaded a web page ASCII dump from github...), the firmware object
is released but the pointer is not zerod. If no other firmware file
was found then release would be called again leading to a double kfree.
Also, the size check was only being applied to the initial firmware
load not any of the subsequent attempts. So move the check into a
wrapper that is used for all loads.
Matthew Auld [Fri, 16 Dec 2022 11:34:56 +0000 (11:34 +0000)]
drm/i915: improve the catch-all evict to handle lock contention
The catch-all evict can fail due to object lock contention, since it
only goes as far as trylocking the object, due to us already holding the
vm->mutex. Doing a full object lock here can deadlock, since the
vm->mutex is always our inner lock. Add another execbuf pass which drops
the vm->mutex and then tries to grab the object will the full lock,
before then retrying the eviction. This should be good enough for now to
fix the immediate regression with userspace seeing -ENOSPC from execbuf
due to contended object locks during GTT eviction.
v2 (Mani)
- Also revamp the docs for the different passes.
Lucas De Marchi [Wed, 14 Dec 2022 19:49:44 +0000 (11:49 -0800)]
drm/i915: Remove __maybe_unused from mtl_info
The attribute __maybe_unused should remain only until the respective
info is not in the pciidlist. The info can't be added together
with its definition because that would cause the driver to automatically
probe for the device, while it's still not ready for that. However once
pciidlist contains it, the attribute can be removed.
Andrzej Hajda [Wed, 14 Dec 2022 07:54:39 +0000 (08:54 +0100)]
drm/i915: fix TLB invalidation for Gen12.50 video and compute engines
In case of Gen12.50 video and compute engines, TLB_INV registers are
masked - to modify one bit, corresponding bit in upper half of the register
must be enabled, otherwise nothing happens.
Linus Torvalds [Fri, 30 Dec 2022 00:57:29 +0000 (16:57 -0800)]
Merge tag 'block-6.2-2022-12-29' of git://git.kernel.dk/linux
Pull block fixes from Jens Axboe:
"Mostly just NVMe, but also a single fixup for BFQ for a regression
that happened during the merge window. In detail:
- NVMe pull requests via Christoph:
- Fix doorbell buffer value endianness (Klaus Jensen)
- Fix Linux vs NVMe page size mismatch (Keith Busch)
- Fix a potential use memory access beyong the allocation limit
(Keith Busch)
- Fix a multipath vs blktrace NULL pointer dereference (Yanjun
Zhang)
- Fix various problems in handling the Command Supported and
Effects log (Christoph Hellwig)
- Don't allow unprivileged passthrough of commands that don't
transfer data but modify logical block content (Christoph
Hellwig)
- Add a features and quirks policy document (Christoph Hellwig)
- Fix some really nasty code that was correct but made smatch
complain (Sagi Grimberg)
- Use-after-free regression in BFQ from this merge window (Yu)"
* tag 'block-6.2-2022-12-29' of git://git.kernel.dk/linux:
nvme-auth: fix smatch warning complaints
nvme: consult the CSE log page for unprivileged passthrough
nvme: also return I/O command effects from nvme_command_effects
nvmet: don't defer passthrough commands with trivial effects to the workqueue
nvmet: set the LBCC bit for commands that modify data
nvmet: use NVME_CMD_EFFECTS_CSUPP instead of open coding it
nvme: fix the NVME_CMD_EFFECTS_CSE_MASK definition
docs, nvme: add a feature and quirk policy document
nvme-pci: update sqsize when adjusting the queue depth
nvme: fix setting the queue depth in nvme_alloc_io_tag_set
block, bfq: fix uaf for bfqq in bfq_exit_icq_bfqq
nvme: fix multipath crash caused by flush request when blktrace is enabled
nvme-pci: fix page size checks
nvme-pci: fix mempool alloc size
nvme-pci: fix doorbell buffer value endianness
Linus Torvalds [Fri, 30 Dec 2022 00:48:21 +0000 (16:48 -0800)]
Merge tag 'io_uring-6.2-2022-12-29' of git://git.kernel.dk/linux
Pull io_uring fixes from Jens Axboe:
- Two fixes for mutex grabbing when the task state is != TASK_RUNNING
(me)
- Check for invalid opcode in io_uring_register() a bit earlier, to
avoid going through the quiesce machinery just to return -EINVAL
later in the process (me)
- Fix for the uapi io_uring header, skipping including time_types.h
when necessary (Stefan)
* tag 'io_uring-6.2-2022-12-29' of git://git.kernel.dk/linux:
uapi:io_uring.h: allow linux/time_types.h to be skipped
io_uring: check for valid register opcode earlier
io_uring/cancel: re-grab ctx mutex after finishing wait
io_uring: finish waiting before flushing overflow entries
Linus Torvalds [Thu, 29 Dec 2022 18:56:13 +0000 (10:56 -0800)]
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm fixes from Paolo Bonzini:
"Changes that were posted too late for 6.1, or after the release.
x86:
- several fixes to nested VMX execution controls
- fixes and clarification to the documentation for Xen emulation
- do not unnecessarily release a pmu event with zero period
- MMU fixes
- fix Coverity warning in kvm_hv_flush_tlb()
selftests:
- fixes for the ucall mechanism in selftests
- other fixes mostly related to compilation with clang"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (41 commits)
KVM: selftests: restore special vmmcall code layout needed by the harness
Documentation: kvm: clarify SRCU locking order
KVM: x86: fix deadlock for KVM_XEN_EVTCHN_RESET
KVM: x86/xen: Documentation updates and clarifications
KVM: x86/xen: Add KVM_XEN_INVALID_GPA and KVM_XEN_INVALID_GFN to uapi
KVM: x86/xen: Simplify eventfd IOCTLs
KVM: x86/xen: Fix SRCU/RCU usage in readers of evtchn_ports
KVM: x86/xen: Use kvm_read_guest_virt() instead of open-coding it badly
KVM: x86/xen: Fix memory leak in kvm_xen_write_hypercall_page()
KVM: Delete extra block of "};" in the KVM API documentation
kvm: x86/mmu: Remove duplicated "be split" in spte.h
kvm: Remove the unused macro KVM_MMU_READ_{,UN}LOCK()
MAINTAINERS: adjust entry after renaming the vmx hyperv files
KVM: selftests: Mark correct page as mapped in virt_map()
KVM: arm64: selftests: Don't identity map the ucall MMIO hole
KVM: selftests: document the default implementation of vm_vaddr_populate_bitmap
KVM: selftests: Use magic value to signal ucall_alloc() failure
KVM: selftests: Disable "gnu-variable-sized-type-not-at-end" warning
KVM: selftests: Include lib.mk before consuming $(CC)
KVM: selftests: Explicitly disable builtins for mem*() overrides
...
Jens Axboe [Thu, 29 Dec 2022 18:31:45 +0000 (11:31 -0700)]
Merge tag 'nvme-6.2-2022-12-29' of git://git.infradead.org/nvme into block-6.2
Pull NVMe fixes from Christoph:
"nvme fixes for Linux 6.2
- fix various problems in handling the Command Supported and Effects log
(Christoph Hellwig)
- don't allow unprivileged passthrough of commands that don't transfer
data but modify logical block content (Christoph Hellwig)
- add a features and quirks policy document (Christoph Hellwig)
- fix some really nasty code that was correct but made smatch complain
(Sagi Grimberg)"
* tag 'nvme-6.2-2022-12-29' of git://git.infradead.org/nvme:
nvme-auth: fix smatch warning complaints
nvme: consult the CSE log page for unprivileged passthrough
nvme: also return I/O command effects from nvme_command_effects
nvmet: don't defer passthrough commands with trivial effects to the workqueue
nvmet: set the LBCC bit for commands that modify data
nvmet: use NVME_CMD_EFFECTS_CSUPP instead of open coding it
nvme: fix the NVME_CMD_EFFECTS_CSE_MASK definition
docs, nvme: add a feature and quirk policy document
nvme: consult the CSE log page for unprivileged passthrough
Commands like Write Zeros can change the contents of a namespaces without
actually transferring data. To protect against this, check the Commands
Supported and Effects log is supported by the controller for any
unprivileg command passthrough and refuse unprivileged passthrough if the
command has any effects that can change data or metadata.
Note: While the Commands Support and Effects log page has only been
mandatory since NVMe 2.0, it is widely supported because Windows requires
it for any command passthrough from userspace.
nvme: also return I/O command effects from nvme_command_effects
To be able to use the Commands Supported and Effects Log for allowing
unprivileged passtrough, it needs to be corretly reported for I/O
commands as well. Return the I/O command effects from
nvme_command_effects, and also add a default list of effects for the
NVM command set. For other command sets, the Commands Supported and
Effects log is required to be present already.
nvmet: don't defer passthrough commands with trivial effects to the workqueue
Mask out the "Command Supported" and "Logical Block Content Change" bits
and only defer execution of commands that have non-trivial effects to
the workqueue for synchronous execution. This allows to execute admin
commands asynchronously on controllers that provide a Command Supported
and Effects log page, and will keep allowing to execute Write commands
asynchronously once command effects on I/O commands are taken into
account.
nvmet: set the LBCC bit for commands that modify data
Write, Write Zeroes, Zone append and a Zone Reset through
Zone Management Send modify the logical block content of a namespace,
so make sure the LBCC bit is reported for them.
docs, nvme: add a feature and quirk policy document
This adds a document about what specification features are supported by
the Linux NVMe driver, and what qualifies for a quirk if an implementation
has problems following the specification.
Takashi Iwai [Wed, 28 Dec 2022 12:57:14 +0000 (13:57 +0100)]
ALSA: hda/hdmi: Static PCM mapping again with AMD HDMI codecs
The recent code refactoring for HD-audio HDMI codec driver caused a
regression on AMD/ATI HDMI codecs; namely, PulseAudioand pipewire
don't recognize HDMI outputs any longer while the direct output via
ALSA raw access still works.
The problem turned out that, after the code refactoring, the driver
assumes only the dynamic PCM assignment, and when a PCM stream that
still isn't assigned to any pin gets opened, the driver tries to
assign any free converter to the PCM stream. This behavior is OK for
Intel and other codecs, as they have arbitrary connections between
pins and converters. OTOH, on AMD chips that have a 1:1 mapping
between pins and converters, this may end up with blocking the open of
the next PCM stream for the pin that is tied with the formerly taken
converter.
Also, with the code refactoring, more PCM streams are exposed than
necessary as we assume all converters can be used, while this isn't
true for AMD case. This may change the PCM stream assignment and
confuse users as well.
This patch fixes those problems by:
- Introducing a flag spec->static_pcm_mapping, and if it's set, the
driver applies the static mapping between pins and converters at the
probe time
- Limiting the number of PCM streams per pins, too; this avoids the
superfluous PCM streams
Paolo Bonzini [Wed, 30 Nov 2022 18:11:47 +0000 (13:11 -0500)]
KVM: selftests: restore special vmmcall code layout needed by the harness
Commit 8fda37cf3d41 ("KVM: selftests: Stuff RAX/RCX with 'safe' values
in vmmcall()/vmcall()", 2022-11-21) broke the svm_nested_soft_inject_test
because it placed a "pop rbp" instruction after vmmcall. While this is
correct and mimics what is done in the VMX case, this particular test
expects a ud2 instruction right after the vmmcall, so that it can skip
over it in the L1 part of the test.
Inline a suitably-modified version of vmmcall() to restore the
functionality of the test.
Paolo Bonzini [Wed, 28 Dec 2022 11:00:22 +0000 (06:00 -0500)]
Documentation: kvm: clarify SRCU locking order
Currently only the locking order of SRCU vs kvm->slots_arch_lock
and kvm->slots_lock is documented. Extend this to kvm->lock
since Xen emulation got it terribly wrong.
Paolo Bonzini [Wed, 28 Dec 2022 10:33:41 +0000 (05:33 -0500)]
KVM: x86: fix deadlock for KVM_XEN_EVTCHN_RESET
While KVM_XEN_EVTCHN_RESET is usually called with no vCPUs running,
if that happened it could cause a deadlock. This is due to
kvm_xen_eventfd_reset() doing a synchronize_srcu() inside
a kvm->lock critical section.
To avoid this, first collect all the evtchnfd objects in an
array and free all of them once the kvm->lock critical section
is over and th SRCU grace period has expired.
futex: Fix futex_waitv() hrtimer debug object leak on kcalloc error
In a scenario where kcalloc() fails to allocate memory, the futex_waitv
system call immediately returns -ENOMEM without invoking
destroy_hrtimer_on_stack(). When CONFIG_DEBUG_OBJECTS_TIMERS=y, this
results in leaking a timer debug object.
x86/kprobes: Fix optprobe optimization check with CONFIG_RETHUNK
Since the CONFIG_RETHUNK and CONFIG_SLS will use INT3 for stopping
speculative execution after function return, kprobe jump optimization
always fails on the functions with such INT3 inside the function body.
(It already checks the INT3 padding between functions, but not inside
the function)
To avoid this issue, as same as kprobes, check whether the INT3 comes
from kgdb or not, and if so, stop decoding and make it fail. The other
INT3 will come from CONFIG_RETHUNK/CONFIG_SLS and those can be
treated as a one-byte instruction.
x86/kprobes: Fix kprobes instruction boudary check with CONFIG_RETHUNK
Since the CONFIG_RETHUNK and CONFIG_SLS will use INT3 for stopping
speculative execution after RET instruction, kprobes always failes to
check the probed instruction boundary by decoding the function body if
the probed address is after such sequence. (Note that some conditional
code blocks will be placed after function return, if compiler decides
it is not on the hot path.)
This is because kprobes expects kgdb puts the INT3 as a software
breakpoint and it will replace the original instruction.
But these INT3 are not such purpose, it doesn't need to recover the
original instruction.
To avoid this issue, kprobes checks whether the INT3 is owned by
kgdb or not, and if so, stop decoding and make it fail. The other
INT3 will come from CONFIG_RETHUNK/CONFIG_SLS and those can be
treated as a one-byte instruction.
The addition of callthunks_translate_call_dest means that
skip_addr() and patch_dest() can no longer be discarded
as part of the __init section freeing:
Colin Ian King [Fri, 2 Dec 2022 13:51:49 +0000 (13:51 +0000)]
perf/x86/amd: fix potential integer overflow on shift of a int
The left shift of int 32 bit integer constant 1 is evaluated using 32 bit
arithmetic and then passed as a 64 bit function argument. In the case where
i is 32 or more this can lead to an overflow. Avoid this by shifting
using the BIT_ULL macro instead.
We can see from above that we wrongly use percpu atomic perf_cgroup_events
to check if we need to perf_cgroup_switch(), which should only be used
when we know this CPU has cgroup events enabled.
The commit bd2756811766 ("perf: Rewrite core context handling") change
to have only one context per-CPU, so we can just use cpuctx->cgrp to
check if this CPU has cgroup events enabled.
So percpu atomic perf_cgroup_events is not needed.
Ravi Bangoria [Fri, 18 Nov 2022 05:15:39 +0000 (10:45 +0530)]
perf core: Return error pointer if inherit_event() fails to find pmu_ctx
inherit_event() returns NULL only when it finds orphaned events
otherwise it returns either valid child_event pointer or an error
pointer. Follow the same when it fails to find pmu_ctx.
David Woodhouse [Mon, 26 Dec 2022 12:03:19 +0000 (12:03 +0000)]
KVM: x86/xen: Add KVM_XEN_INVALID_GPA and KVM_XEN_INVALID_GFN to uapi
These are (uint64_t)-1 magic values are a userspace ABI, allowing the
shared info pages and other enlightenments to be disabled. This isn't
a Xen ABI because Xen doesn't let the guest turn these off except with
the full SHUTDOWN_soft_reset mechanism. Under KVM, the userspace VMM is
expected to handle soft reset, and tear down the kernel parts of the
enlightenments accordingly.
Paolo Bonzini [Mon, 26 Dec 2022 12:03:17 +0000 (12:03 +0000)]
KVM: x86/xen: Fix SRCU/RCU usage in readers of evtchn_ports
The evtchnfd structure itself must be protected by either kvm->lock or
SRCU. Use the former in kvm_xen_eventfd_update(), since the lock is
being taken anyway; kvm_xen_hcall_evtchn_send() instead is a reader and
does not need kvm->lock, and is called in SRCU critical section from the
kvm_x86_handle_exit function.
It is also important to use rcu_read_{lock,unlock}() in
kvm_xen_hcall_evtchn_send(), because idr_remove() will *not*
use synchronize_srcu() to wait for readers to complete.
Remove a superfluous if (kvm) check before calling synchronize_srcu()
in kvm_xen_eventfd_deassign() where kvm has been dereferenced already.
David Woodhouse [Mon, 26 Dec 2022 12:03:16 +0000 (12:03 +0000)]
KVM: x86/xen: Use kvm_read_guest_virt() instead of open-coding it badly
In particular, we shouldn't assume that being contiguous in guest virtual
address space means being contiguous in guest *physical* address space.
In dropping the manual calls to kvm_mmu_gva_to_gpa_system(), also drop
the srcu_read_lock() that was around them. All call sites are reached
from kvm_xen_hypercall() which is called from the handle_exit function
with the read lock already held.
However, since 58cd407ca4c6 ("KVM: Fix multiple races in gfn=>pfn cache
refresh", 2022-05-25) the code is only relying on the MMU notifier's
invalidation count and sequence number.
Lukas Bulwahn [Mon, 5 Dec 2022 08:20:44 +0000 (09:20 +0100)]
MAINTAINERS: adjust entry after renaming the vmx hyperv files
Commit a789aeba4196 ("KVM: VMX: Rename "vmx/evmcs.{ch}" to
"vmx/hyperv.{ch}"") renames the VMX specific Hyper-V files, but does not
adjust the entry in MAINTAINERS.
Hence, ./scripts/get_maintainer.pl --self-test=patterns complains about a
broken reference.
Repair this file reference in KVM X86 HYPER-V (KVM/hyper-v).
Oliver Upton [Fri, 9 Dec 2022 01:53:02 +0000 (01:53 +0000)]
KVM: selftests: Mark correct page as mapped in virt_map()
The loop marks vaddr as mapped after incrementing it by page size,
thereby marking the *next* page as mapped. Set the bit in vpages_mapped
first instead.
Oliver Upton [Fri, 9 Dec 2022 01:53:04 +0000 (01:53 +0000)]
KVM: arm64: selftests: Don't identity map the ucall MMIO hole
Currently the ucall MMIO hole is placed immediately after slot0, which
is a relatively safe address in the PA space. However, it is possible
that the same address has already been used for something else (like the
guest program image) in the VA space. At least in my own testing,
building the vgic_irq test with clang leads to the MMIO hole appearing
underneath gicv3_ops.
Stop identity mapping the MMIO hole and instead find an unused VA to map
to it. Yet another subtle detail of the KVM selftests library is that
virt_pg_map() does not update vm->vpages_mapped. Switch over to
virt_map() instead to guarantee that the chosen VA isn't to something
else.
Paolo Bonzini [Mon, 12 Dec 2022 10:36:53 +0000 (05:36 -0500)]
KVM: selftests: document the default implementation of vm_vaddr_populate_bitmap
Explain the meaning of the bit manipulations of vm_vaddr_populate_bitmap.
These correspond to the "canonical addresses" of x86 and other
architectures, but that is not obvious.
KVM: selftests: Use magic value to signal ucall_alloc() failure
Use a magic value to signal a ucall_alloc() failure instead of simply
doing GUEST_ASSERT(). GUEST_ASSERT() relies on ucall_alloc() and so a
failure puts the guest into an infinite loop.
Use -1 as the magic value, as a real ucall struct should never wrap.
Disable gnu-variable-sized-type-not-at-end so that tests and libraries
can create overlays of variable sized arrays at the end of structs when
using a fixed number of entries, e.g. to get/set a single MSR.
It's possible to fudge around the warning, e.g. by defining a custom
struct that hardcodes the number of entries, but that is a burden for
both developers and readers of the code.
lib/x86_64/processor.c:664:19: warning: field 'header' with variable sized type 'struct kvm_msrs'
not at the end of a struct or class is a GNU extension [-Wgnu-variable-sized-type-not-at-end]
struct kvm_msrs header;
^
lib/x86_64/processor.c:772:19: warning: field 'header' with variable sized type 'struct kvm_msrs'
not at the end of a struct or class is a GNU extension [-Wgnu-variable-sized-type-not-at-end]
struct kvm_msrs header;
^
lib/x86_64/processor.c:787:19: warning: field 'header' with variable sized type 'struct kvm_msrs'
not at the end of a struct or class is a GNU extension [-Wgnu-variable-sized-type-not-at-end]
struct kvm_msrs header;
^
3 warnings generated.
x86_64/hyperv_tlb_flush.c:54:18: warning: field 'hv_vp_set' with variable sized type 'struct hv_vpset'
not at the end of a struct or class is a GNU extension [-Wgnu-variable-sized-type-not-at-end]
struct hv_vpset hv_vp_set;
^
1 warning generated.
x86_64/xen_shinfo_test.c:137:25: warning: field 'info' with variable sized type 'struct kvm_irq_routing'
not at the end of a struct or class is a GNU extension [-Wgnu-variable-sized-type-not-at-end]
struct kvm_irq_routing info;
^
1 warning generated.
KVM: selftests: Include lib.mk before consuming $(CC)
Include lib.mk before consuming $(CC) and document that lib.mk overwrites
$(CC) unless make was invoked with -e or $(CC) was specified after make
(which makes the environment override the Makefile). Including lib.mk
after using it for probing, e.g. for -no-pie, can lead to weirdness.
KVM: selftests: Explicitly disable builtins for mem*() overrides
Explicitly disable the compiler's builtin memcmp(), memcpy(), and
memset(). Because only lib/string_override.c is built with -ffreestanding,
the compiler reserves the right to do what it wants and can try to link the
non-freestanding code to its own crud.
/usr/bin/x86_64-linux-gnu-ld: /lib/x86_64-linux-gnu/libc.a(memcmp.o): in function `memcmp_ifunc':
(.text+0x0): multiple definition of `memcmp'; tools/testing/selftests/kvm/lib/string_override.o:
tools/testing/selftests/kvm/lib/string_override.c:15: first defined here
clang: error: linker command failed with exit code 1 (use -v to see invocation)
KVM: selftests: Use proper function prototypes in probing code
Make the main() functions in the probing code proper prototypes so that
compiling the probing code with more strict flags won't generate false
negatives.
<stdin>:1:5: error: function declaration isn’t a prototype [-Werror=strict-prototypes]
KVM: selftests: Rename UNAME_M to ARCH_DIR, fill explicitly for x86
Rename UNAME_M to ARCH_DIR and explicitly set it directly for x86. At
this point, the name of the arch directory really doesn't have anything
to do with `uname -m`, and UNAME_M is unnecessarily confusing given that
its purpose is purely to identify the arch specific directory.
KVM: selftests: Fix a typo in x86-64's kvm_get_cpu_address_width()
Fix a == vs. = typo in kvm_get_cpu_address_width() that results in
@pa_bits being left unset if the CPU doesn't support enumerating its
MAX_PHY_ADDR. Flagged by clang's unusued-value warning.
KVM: selftests: Use pattern matching in .gitignore
Use pattern matching to exclude everything except .c, .h, .S, and .sh
files from Git. Manually adding every test target has an absurd
maintenance cost, is comically error prone, and leads to bikeshedding
over whether or not the targets should be listed in alphabetical order.
Deliberately do not include the one-off assets, e.g. config, settings,
.gitignore itself, etc as Git doesn't ignore files that are already in
the repository. Adding the one-off assets won't prevent mistakes where
developers forget to --force add files that don't match the "allowed".
KVM: selftests: Fix divide-by-zero bug in memslot_perf_test
Check that the number of pages per slot is non-zero in get_max_slots()
prior to computing the remaining number of pages. clang generates code
that uses an actual DIV for calculating the remaining, which causes a #DE
if the total number of pages is less than the number of slots.
traps: memslot_perf_te[97611] trap divide error ip:4030c4 sp:7ffd18ae58f0
error:0 in memslot_perf_test[401000+cb000]
KVM: selftests: Define literal to asm constraint in aarch64 as unsigned long
Define a literal '0' asm input constraint to aarch64/page_fault_test's
guest_cas() as an unsigned long to make clang happy.
tools/testing/selftests/kvm/aarch64/page_fault_test.c:120:16: error:
value size does not match register size specified by the constraint
and modifier [-Werror,-Wasm-operand-widths]
:: "r" (0), "r" (TEST_DATA), "r" (guest_test_memory));
^
tools/testing/selftests/kvm/aarch64/page_fault_test.c:119:15: note:
use constraint modifier "w"
"casal %0, %1, [%2]\n"
^~
%w0