block.c: assertions to the block layer permissions API
Now that we "covered" the three main cases where the
permission API was being used under BQL (fuse,
amend and invalidate_cache), we can safely assert for
the permission functions implemented in block.c
block/block-backend.c: assertions for block-backend
All the global state (GS) API functions will check that
qemu_in_main_thread() returns true. If not, it means
that the safety of BQL cannot be guaranteed, and
they need to be moved to I/O.
block/export/fuse.c: allow writable exports to take RESIZE permission
Allow writable exports to get BLK_PERM_RESIZE permission
from creation, in fuse_export_create().
In this way, there is no need to give the permission in
fuse_do_truncate(), which might be run in an iothread.
Permissions should be set only in the main thread, so
in any case if an iothread tries to set RESIZE, it will
be blocked.
Also assert in fuse_do_truncate that if we give the
RESIZE permission we can then restore the original ones.
All the global state (GS) API functions will check that
qemu_in_main_thread() returns true. If not, it means
that the safety of BQL cannot be guaranteed, and
they need to be moved to I/O.
include/block/block: split header into I/O and global state API
block.h currently contains a mix of functions:
some of them run under the BQL and modify the block layer graph,
others are instead thread-safe and perform I/O in iothreads.
Some others can only be called by either the main loop or the
iothread running the AioContext (and not other iothreads),
and using them in another thread would cause deadlocks, and therefore
it is not ideal to define them as I/O.
It is not easy to understand which function is part of which
group (I/O vs GS vs "I/O or GS"), and this patch aims to clarify it.
The "GS" functions need the BQL, and often use
aio_context_acquire/release and/or drain to be sure they
can modify the graph safely.
The I/O function are instead thread safe, and can run in
any AioContext.
"I/O or GS" functions run instead in the main loop or in
a single iothread, and use BDRV_POLL_WHILE().
By splitting the header in two files, block-io.h
and block-global-state.h we have a clearer view on what
needs what kind of protection. block-common.h
contains common structures shared by both headers.
block.h is left there for legacy and to avoid changing
all includes in all c files that use the block APIs.
Righ now, IO_CODE and IO_OR_GS_CODE are nop, as there isn't
really a way to check that a function is only called in I/O.
On the other side, we can use qemu_in_main_thread() to check if
we are in the main loop.
The usage of macros makes easy to extend them in the future without
making changes in all callers. They will also visually help understanding
in which category each function is, without looking at the header.
When invoked from the main loop, this function is the same
as qemu_mutex_iothread_locked, and returns true if the BQL is held.
When invoked from iothreads or tests, it returns true only
if the current AioContext is the Main Loop.
This essentially just extends qemu_mutex_iothread_locked to work
also in unit tests or other users like storage-daemon, that run
in the Main Loop but end up using the implementation in
stubs/iothread-lock.c.
Using qemu_mutex_iothread_locked in unit tests defaults to false
because they use the implementation in stubs/iothread-lock,
making all assertions added in next patches fail despite the
AioContext is still the main loop.
See the comment in the function header for more information.
Hanna Reitz [Thu, 3 Mar 2022 16:48:14 +0000 (17:48 +0100)]
iotests/185: Add post-READY quit tests
185 tests quitting qemu while a block job is active. It does not
specifically test quitting qemu while a mirror or active commit job is
in its READY phase.
Add two test cases for this, where we respectively mirror or commit to
an external QSD instance, which provides a throttled block device. qemu
is supposed to cancel the job so that it can quit as soon as possible
instead of waiting for the job to complete (which it did before 6.2).
Hanna Reitz [Thu, 3 Mar 2022 16:48:13 +0000 (17:48 +0100)]
qsd: Add --daemonize
To implement this, we reuse the existing daemonizing functions from the
system emulator, which mainly do the following:
- Fork off a child process, and set up a pipe between parent and child
- The parent process waits until the child sends a status byte over the
pipe (0 means that the child was set up successfully; anything else
(including errors or EOF) means that the child was not set up
successfully), and then exits with an appropriate exit status
- The child process enters a new session (forking off again), changes
the umask, and will ignore terminal signals from then on
- Once set-up is complete, the child will chdir to /, redirect all
standard I/O streams to /dev/null, and tell the parent that set-up has
been completed successfully
In contrast to qemu-nbd's --fork implementation, during the set up
phase, error messages are not piped through the parent process.
qemu-nbd mainly does this to detect errors, though (while os_daemonize()
has the child explicitly signal success after set up); because we do not
redirect stderr after forking, error messages continue to appear on
whatever the parent's stderr was (until set up is complete).
Hanna Reitz [Thu, 3 Mar 2022 16:48:12 +0000 (17:48 +0100)]
qsd: Add pre-init argument parsing pass
In contrast to qemu-nbd (where it is called --fork) and the system
emulator, QSD does not have a --daemonize switch yet. Just like them,
QSD allows setting up block devices and exports on the command line.
When doing so, it is often necessary for whoever invoked the QSD to wait
until these exports are fully set up. A --daemonize switch allows
precisely this, by virtue of the parent process exiting once everything
is set up.
Note that there are alternative ways of waiting for all exports to be
set up, for example:
- Passing the --pidfile option and waiting until the respective file
exists (but I do not know if there is a way of implementing this
without a busy wait loop)
- Set up some network server (e.g. on a Unix socket) and have the QSD
connect to it after all arguments have been processed by appending
corresponding --chardev and --monitor options to the command line,
and then wait until the QSD connects
Having a --daemonize option would make this simpler, though, without
having to rely on additional tools (to set up a network server) or busy
waiting.
Implementing a --daemonize switch means having to fork the QSD process.
Ideally, we should do this as early as possible: All the parent process
has to do is to wait for the child process to signal completion of its
set-up phase, and therefore there is basically no initialization that
needs to be done before the fork. On the other hand, forking after
initialization steps means having to consider how those steps (like
setting up the block layer or QMP) interact with a later fork, which is
often not trivial.
In order to fork this early, we must scan the command line for
--daemonize long before our current process_options() call. Instead of
adding custom new code to do so, just reuse process_options() and give
it a @pre_init_pass argument to distinguish the two passes. I believe
there are some other switches but --daemonize that deserve parsing in
the first pass:
- --help and --version are supposed to only print some text and then
immediately exit (so any initialization we do would be for naught).
This changes behavior, because now "--blockdev inv-drv --help" will
print a help text instead of complaining about the --blockdev
argument.
Note that this is similar in behavior to other tools, though: "--help"
is generally immediately acted upon when finding it in the argument
list, potentially before other arguments (even ones before it) are
acted on. For example, "ls /does-not-exist --help" prints a help text
and does not complain about ENOENT.
- --pidfile does not need initialization, and is already exempted from
the sequential order that process_options() claims to strictly follow
(the PID file is only created after all arguments are processed, not
at the time the --pidfile argument appears), so it makes sense to
include it in the same category as --daemonize.
- Invalid arguments should always be reported as soon as possible. (The
same caveat with --help applies: That means that "--blockdev inv-drv
--inv-arg" will now complain about --inv-arg, not inv-drv.)
This patch does make some references to --daemonize without having
implemented it yet, but that will happen in the next patch.
Hanna Reitz [Thu, 3 Mar 2022 16:48:11 +0000 (17:48 +0100)]
os-posix: Add os_set_daemonize()
The daemonizing functions in os-posix (os_daemonize() and
os_setup_post()) only daemonize the process if the static `daemonize`
variable is set. Right now, it can only be set by os_parse_cmd_args().
In order to use os_daemonize() and os_setup_post() from the storage
daemon to have it be daemonized, we need some other way to set this
`daemonize` variable, because I would rather not tap into the system
emulator's arg-parsing code. Therefore, this patch adds an
os_set_daemonize() function, which will return an error on os-win32
(because daemonizing is not supported there).
Stefan Hajnoczi [Tue, 22 Feb 2022 14:01:50 +0000 (14:01 +0000)]
cpus: use coroutine TLS macros for iothread_locked
qemu_mutex_iothread_locked() may be used from coroutines. Standard
__thread variables cannot be used by coroutines. Use the coroutine TLS
macros instead.
Stefan Hajnoczi [Tue, 22 Feb 2022 14:01:47 +0000 (14:01 +0000)]
tls: add macros for coroutine-safe TLS variables
Compiler optimizations can cache TLS values across coroutine yield
points, resulting in stale values from the previous thread when a
coroutine is re-entered by a new thread.
Serge Guelton developed an __attribute__((noinline)) wrapper and tested
it with clang and gcc. I formatted his idea according to QEMU's coding
style and wrote documentation.
The compiler can still optimize based on analyzing noinline code, so an
asm volatile barrier with an output constraint is required to prevent
unwanted optimizations.
block: move BQL logic of bdrv_co_invalidate_cache in bdrv_activate
Split bdrv_co_invalidate cache in two: the Global State (under BQL)
code that takes care of permissions and running GS callbacks,
and leave only the I/O code (->bdrv_co_invalidate_cache) running in
the I/O coroutine.
The only side effect is that bdrv_co_invalidate_cache is not
recursive anymore, and so is every direct call to
bdrv_invalidate_cache().
This function is currently just a wrapper for bdrv_invalidate_cache(),
but in future will contain the code of bdrv_co_invalidate_cache() that
has to always be protected by BQL, and leave the rest in the I/O
coroutine.
Replace all bdrv_invalidate_cache() invokations with bdrv_activate().
crypto: distinguish between main loop and I/O in block_crypto_amend_options_generic_luks
block_crypto_amend_options_generic_luks uses the block layer
permission API, therefore it should be called with the BQL held.
However, the same function is being called by two BlockDriver
callbacks: bdrv_amend_options (under BQL) and bdrv_co_amend (I/O).
The latter is I/O because it is invoked by block/amend.c's
blockdev_amend_run(), a .run callback of the amend JobDriver.
Therefore we want to change this function to still perform
the permission check, but making sure it is done under BQL regardless
of the caller context.
Remove the permission check in block_crypto_amend_options_generic_luks()
and:
- in block_crypto_amend_options_luks() (BQL case, called by
.bdrv_amend_options()), reuse helper functions
block_crypto_amend_{prepare/cleanup} that take care of checking
permissions.
- for block_crypto_co_amend_luks() (I/O case, called by
.bdrv_co_amend()), don't check for permissions but delegate
.bdrv_amend_pre_run() and .bdrv_amend_clean() to do it,
performing these checks before and after the job runs in its aiocontext.
Move the permission API calls into driver-specific callbacks
that always run under BQL. In this case, bdrv_crypto_luks
needs to perform permission checks before and after
qcrypto_block_amend_options(). The problem is that the caller,
block_crypto_amend_options_generic_luks(), can also run in I/O
from .bdrv_co_amend(). This does not comply with Global State-I/O API split,
as permissions API must always run under BQL.
Firstly, introduce .bdrv_amend_pre_run() and .bdrv_amend_clean()
callbacks. These two callbacks are guaranteed to be invoked under
BQL, respectively before and after .bdrv_co_amend().
They take care of performing the permission checks
in the same way as they are currently done before and after
qcrypto_block_amend_options().
These callbacks are in preparation for next patch, where we
delete the original permission check. Right now they just add redundant
control.
Then, call .bdrv_amend_pre_run() before job_start in
qmp_x_blockdev_amend(), so that it will be run before the job coroutine
is created and stay in the main loop.
As a cleanup, use JobDriver's .clean() callback to call
.bdrv_amend_clean(), and run amend-specific cleanup callbacks under BQL.
After this patch, permission failures occur early in the blockdev-amend
job to update a LUKS volume's keys. iotest 296 must now expect them in
x-blockdev-amend's QMP reply instead of waiting for the actual job to
fail later.
Peter Maydell [Fri, 4 Mar 2022 15:31:23 +0000 (15:31 +0000)]
Merge remote-tracking branch 'remotes/nvme/tags/nvme-next-pull-request' into staging
hw/nvme updates
- add enhanced protection information (64-bit guard)
# gpg: Signature made Fri 04 Mar 2022 06:23:36 GMT
# gpg: using RSA key 522833AA75E2DCE6A24766C04DE1AF316D4F0DE9
# gpg: Good signature from "Klaus Jensen <[email protected]>" [unknown]
# gpg: aka "Klaus Jensen <[email protected]>" [unknown]
# gpg: WARNING: This key is not certified with a trusted signature!
# gpg: There is no indication that the signature belongs to the owner.
# Primary key fingerprint: DDCA 4D9C 9EF9 31CC 3468 4272 63D5 6FC5 E55D A838
# Subkey fingerprint: 5228 33AA 75E2 DCE6 A247 66C0 4DE1 AF31 6D4F 0DE9
* remotes/nvme/tags/nvme-next-pull-request:
hw/nvme: 64-bit pi support
hw/nvme: add pi tuple size helper
hw/nvme: add support for the lbafee hbs feature
hw/nvme: move format parameter parsing
hw/nvme: add host behavior support feature
hw/nvme: move dif/pi prototypes into dif.h
Peter Maydell [Fri, 4 Mar 2022 10:32:12 +0000 (10:32 +0000)]
Merge remote-tracking branch 'remotes/rth-gitlab/tags/pull-nios-20220303' into staging
Rewrite nios2 interrupt handling
# gpg: Signature made Thu 03 Mar 2022 19:52:33 GMT
# gpg: using RSA key 7A481E78868B4DB6A85A05C064DF38E8AF7E215F
# gpg: issuer "[email protected]"
# gpg: Good signature from "Richard Henderson <[email protected]>" [full]
# Primary key fingerprint: 7A48 1E78 868B 4DB6 A85A 05C0 64DF 38E8 AF7E 215F
* remotes/rth-gitlab/tags/pull-nios-20220303:
target/nios2: Rewrite interrupt handling
target/nios2: Special case ipending in rdctl and wrctl
target/nios2: Split mmu_write
target/nios2: Hoist R_ZERO check in rdctl
target/nios2: Only build mmu.c for system mode
target/nios2: Replace MMU_LOG with tracepoints
target/nios2: Remove mmu_read_debug
Peter Maydell [Thu, 3 Mar 2022 19:59:38 +0000 (19:59 +0000)]
Merge remote-tracking branch 'remotes/alistair/tags/pull-riscv-to-apply-20220303' into staging
Fifth RISC-V PR for QEMU 7.0
* Fixup checks for ext_zb[abcs]
* Add AIA support for virt machine
* Increase maximum number of CPUs in virt machine
* Fixup OpenTitan SPI address
* Add support for zfinx, zdinx and zhinx{min} extensions
# gpg: Signature made Thu 03 Mar 2022 05:26:55 GMT
# gpg: using RSA key F6C4AC46D4934868D3B8CE8F21E10D29DF977054
# gpg: Good signature from "Alistair Francis <[email protected]>" [full]
# Primary key fingerprint: F6C4 AC46 D493 4868 D3B8 CE8F 21E1 0D29 DF97 7054
* remotes/alistair/tags/pull-riscv-to-apply-20220303:
target/riscv: expose zfinx, zdinx, zhinx{min} properties
target/riscv: add support for zhinx/zhinxmin
target/riscv: add support for zdinx
target/riscv: add support for zfinx
target/riscv: hardwire mstatus.FS to zero when enable zfinx
target/riscv: add cfg properties for zfinx, zdinx and zhinx{min}
hw: riscv: opentitan: fixup SPI addresses
hw/riscv: virt: Increase maximum number of allowed CPUs
docs/system: riscv: Document AIA options for virt machine
hw/riscv: virt: Add optional AIA IMSIC support to virt machine
hw/intc: Add RISC-V AIA IMSIC device emulation
hw/riscv: virt: Add optional AIA APLIC support to virt machine
target/riscv: fix inverted checks for ext_zb[abcs]
Previously, we would avoid setting CPU_INTERRUPT_HARD when interrupts
are disabled at a particular point in time, instead queuing the value
into cpu->irq_pending. This is more complicated than required.
Instead, set CPU_INTERRUPT_HARD any time there is a pending interrupt,
and exclusively check for interrupts disabled in nios2_cpu_exec_interrupt.
target/nios2: Special case ipending in rdctl and wrctl
It was never correct to be able to write to ipending.
Until the rest of the irq code is tidied, the read of
ipending will generate an "unnecessary" mask.
Create three separate functions for the three separate registers.
Avoid extra dispatch through op_helper.c.
Dispatch to the correct function in translation.
Clean up the ifdefs in wrctl.
* remotes/pmaydell/tags/pull-target-arm-20220302: (26 commits)
ui/cocoa.m: Remove unnecessary NSAutoreleasePools
ui/cocoa.m: Fix updateUIInfo threading issues
target/arm: Report KVM's actual PSCI version to guest in dtb
target/arm: Implement FEAT_LPA2
target/arm: Advertise all page sizes for -cpu max
target/arm: Validate tlbi TG matches translation granule in use
target/arm: Fix TLBIRange.base for 16k and 64k pages
target/arm: Introduce tlbi_aa64_get_range
target/arm: Extend arm_fi_to_lfsc to level -1
target/arm: Implement FEAT_LPA
target/arm: Implement FEAT_LVA
target/arm: Prepare DBGBVR and DBGWVR for FEAT_LVA
target/arm: Honor TCR_ELx.{I}PS
target/arm: Use MAKE_64BIT_MASK to compute indexmask
target/arm: Pass outputsize down to check_s2_mmu_setup
target/arm: Move arm_pamax out of line
target/arm: Fault on invalid TCR_ELx.TxSZ
target/arm: Set TCR_EL1.TSZ for user-only
hw/registerfields: Add FIELD_SEX<N> and FIELD_SDP<N>
tests/qtest: add qtests for npcm7xx sdhci
...
Naveen Nagar [Tue, 16 Nov 2021 13:26:52 +0000 (18:56 +0530)]
hw/nvme: 64-bit pi support
This adds support for one possible new protection information format
introduced in TP4068 (and integrated in NVMe 2.0): the 64-bit CRC guard
and 48-bit reference tag. This version does not support storage tags.
Like the CRC16 support already present, this uses a software
implementation of CRC64 (so it is naturally pretty slow). But its good
enough for verification purposes.
This may go nicely hand-in-hand with the support that Keith submitted
for the Linux kernel[1].
Anup Patel [Sun, 20 Feb 2022 08:55:24 +0000 (14:25 +0530)]
hw/riscv: virt: Add optional AIA IMSIC support to virt machine
We extend virt machine to emulate both AIA IMSIC and AIA APLIC
devices only when "aia=aplic-imsic" parameter is passed along
with machine name in the QEMU command-line. The AIA IMSIC is
only a per-HART MSI controller so we use AIA APLIC in MSI-mode
to forward all wired interrupts as MSIs to the AIA IMSIC.
We also provide "aia-guests=<xyz>" parameter which can be used
to specify number of VS-level AIA IMSIC Guests MMIO pages for
each HART.
Anup Patel [Sun, 20 Feb 2022 08:55:23 +0000 (14:25 +0530)]
hw/intc: Add RISC-V AIA IMSIC device emulation
The RISC-V AIA (Advanced Interrupt Architecture) defines a new
interrupt controller for MSIs (message signal interrupts) called
IMSIC (Incoming Message Signal Interrupt Controller). The IMSIC
is per-HART device and also suppport virtualizaiton of MSIs using
dedicated VS-level guest interrupt files.
This patch adds device emulation for RISC-V AIA IMSIC which
supports M-level, S-level, and VS-level MSIs.
Anup Patel [Sun, 20 Feb 2022 08:55:22 +0000 (14:25 +0530)]
hw/riscv: virt: Add optional AIA APLIC support to virt machine
We extend virt machine to emulate AIA APLIC devices only when
"aia=aplic" parameter is passed along with machine name in QEMU
command-line. When "aia=none" or not specified then we fallback
to original PLIC device emulation.
Philipp Tomsich [Thu, 3 Feb 2022 15:39:45 +0000 (16:39 +0100)]
target/riscv: fix inverted checks for ext_zb[abcs]
While changing to the use of cfg_ptr, the conditions for REQUIRE_ZB[ABCS]
inadvertently became inverted and slipped through the initial testing (which
used RV64GC_XVentanaCondOps as a target).
This fixes the regression.
Tested against SPEC2017 w/ GCC 12 (prerelease) for RV64GC_zba_zbb_zbc_zbs.
Peter Maydell [Wed, 2 Mar 2022 20:55:48 +0000 (20:55 +0000)]
Merge remote-tracking branch 'remotes/dgilbert-gitlab/tags/pull-migration-20220302b' into staging
Migration/HMP/Virtio pull 2022-03-02
A bit of a mix this time:
* Minor fixes from myself, Hanna, and Jack
* VNC password rework by Stefan and Fabian
* Postcopy changes from Peter X that are
the start of a larger series to come
* Removing the prehistoic load_state_old
code from Peter M
Signed-off-by: Dr. David Alan Gilbert <[email protected]>
# gpg: Signature made Wed 02 Mar 2022 18:25:12 GMT
# gpg: using RSA key 45F5C71B4A0CB7FB977A9FA90516331EBC5BFDE7
# gpg: Good signature from "Dr. David Alan Gilbert (RH2) <[email protected]>" [full]
# Primary key fingerprint: 45F5 C71B 4A0C B7FB 977A 9FA9 0516 331E BC5B FDE7
* remotes/dgilbert-gitlab/tags/pull-migration-20220302b:
migration: Remove load_state_old and minimum_version_id_old
tests: Pass in MigrateStart** into test_migrate_start()
migration: Add migration_incoming_transport_cleanup()
migration: postcopy_pause_fault_thread() never fails
migration: Enlarge postcopy recovery to capture !-EIO too
migration: Move static var in ram_block_from_stream() into global
migration: Add postcopy_thread_create()
migration: Dump ramblock and offset too when non-same-page detected
migration: Introduce postcopy channels on dest node
migration: Tracepoint change in postcopy-run bottom half
migration: Finer grained tracepoints for POSTCOPY_LISTEN
migration: Dump sub-cmd name in loadvm_process_command tp
migration/rdma: set the REUSEADDR option for destination
qapi/monitor: allow VNC display id in set/expire_password
qapi/monitor: refactor set/expire_password with enums
monitor/hmp: add support for flag argument with value
virtiofsd: Let meson check for statx.stx_mnt_id
clock-vmstate: Add missing END_OF_LIST
Peter Maydell [Thu, 24 Feb 2022 10:13:30 +0000 (10:13 +0000)]
ui/cocoa.m: Remove unnecessary NSAutoreleasePools
In commit 6e657e64cdc478 in 2013 we added some autorelease pools to
deal with complaints from macOS when we made calls into Cocoa from
threads that didn't have automatically created autorelease pools.
Later on, macOS got stricter about forbidding cross-thread Cocoa
calls, and in commit 5588840ff77800e839d8 we restructured the code to
avoid them. This left the autorelease pool creation in several
functions without any purpose; delete it.
We still need the pool in cocoa_refresh() for the clipboard related
code which is called directly there.
Peter Maydell [Thu, 24 Feb 2022 10:13:29 +0000 (10:13 +0000)]
ui/cocoa.m: Fix updateUIInfo threading issues
The updateUIInfo method makes Cocoa API calls. It also calls back
into QEMU functions like dpy_set_ui_info(). To do this safely, we
need to follow two rules:
* Cocoa API calls are made on the Cocoa UI thread
* When calling back into QEMU we must hold the iothread lock
Fix the places where we got this wrong, by taking the iothread lock
while executing updateUIInfo, and moving the call in cocoa_switch()
inside the dispatch_async block.
Some of the Cocoa UI methods which call updateUIInfo are invoked as
part of the initial application startup, while we're still doing the
little cross-thread dance described in the comment just above
call_qemu_main(). This meant they were calling back into the QEMU UI
layer before we'd actually finished initializing our display and
registered the DisplayChangeListener, which isn't really valid. Once
updateUIInfo takes the iothread lock, we no longer get away with
this, because during this startup phase the iothread lock is held by
the QEMU main-loop thread which is waiting for us to finish our
display initialization. So we must suppress updateUIInfo until
applicationDidFinishLaunching allows the QEMU main-loop thread to
continue.
Peter Maydell [Thu, 24 Feb 2022 13:46:54 +0000 (13:46 +0000)]
target/arm: Report KVM's actual PSCI version to guest in dtb
When we're using KVM, the PSCI implementation is provided by the
kernel, but QEMU has to tell the guest about it via the device tree.
Currently we look at the KVM_CAP_ARM_PSCI_0_2 capability to determine
if the kernel is providing at least PSCI 0.2, but if the kernel
provides a newer version than that we will still only tell the guest
it has PSCI 0.2. (This is fairly harmless; it just means the guest
won't use newer parts of the PSCI API.)
The kernel exposes the specific PSCI version it is implementing via
the ONE_REG API; use this to report in the dtb that the PSCI
implementation is 1.0-compatible if appropriate. (The device tree
binding currently only distinguishes "pre-0.2", "0.2-compatible" and
"1.0-compatible".)
This feature widens physical addresses (and intermediate physical
addresses for 2-stage translation) from 48 to 52 bits, when using
4k or 16k pages.
This introduces the DS bit to TCR_ELx, which is RES0 unless the
page size is enabled and supports LPA2, resulting in the effective
value of DS for a given table walk. The DS bit changes the format
of the page table descriptor slightly, moving the PS field out to
TCR so that all pages have the same sharability and repurposing
those bits of the page table descriptor for the highest bits of
the output address.
Do not yet enable FEAT_LPA2; we need extra plumbing to avoid
tickling an old kernel bug.
We support 16k pages, but do not advertize that in ID_AA64MMFR0.
The value 0 in the TGRAN*_2 fields indicates that stage2 lookups defer
to the same support as stage1 lookups. This setting is deprecated, so
indicate support for all stage2 page sizes directly.
target/arm: Validate tlbi TG matches translation granule in use
For FEAT_LPA2, we will need other ARMVAParameters, which themselves
depend on the translation granule in use. We might as well validate
that the given TG matches; the architecture "does not require that
the instruction invalidates any entries" if this is not true.
Merge tlbi_aa64_range_get_length and tlbi_aa64_range_get_base,
returning a structure containing both results. Pass in the
ARMMMUIdx, rather than the digested two_ranges boolean.
This is in preparation for FEAT_LPA2, where the interpretation
of 'value' depends on the effective value of DS for the regime.
With FEAT_LPA2, rather than introducing translation level 4,
we introduce level -1, below the current level 0. Extend
arm_fi_to_lfsc to handle these faults.
Assert that this new translation level does not leak into
fault types for which it is not defined, which allows some
masking of fi->level to be removed.
This feature widens physical addresses (and intermediate physical
addresses for 2-stage translation) from 48 to 52 bits, when using
64k pages. The only thing left at this point is to handle the
extra bits in the TTBR and in the table descriptors.
Note that PAR_EL1 and HPFAR_EL2 are nominally extended, but we don't
mask out the high bits when writing to those registers, so no changes
are required there.
This feature is relatively small, as it applies only to
64k pages and thus requires no additional changes to the
table descriptor walking algorithm, only a change to the
minimum TSZ (which is the inverse of the maximum virtual
address space size).
Note that this feature widens VBAR_ELx, but we already
treat the register as being 64 bits wide.
target/arm: Prepare DBGBVR and DBGWVR for FEAT_LVA
The original A.a revision of the AArch64 ARM required that we
force-extend the addresses in these registers from 49 bits.
This language has been loosened via a combination of IMPLEMENTATION
DEFINED and CONSTRAINTED UNPREDICTABLE to allow consideration of
the entire aligned address.
This means that we do not have to consider whether or not FEAT_LVA
is enabled, and decide from which bit an address might need to be
extended.
This field controls the output (intermediate) physical address size
of the translation process. V8 requires to raise an AddressSize
fault if the page tables are programmed incorrectly, such that any
intermediate descriptor address, or the final translated address,
is out of range.
Add a PS field to ARMVAParameters, and properly compute outputsize
in get_phys_addr_lpae. Test the descaddr as extracted from TTBR
and from page table entries.
Restrict descaddrmask so that we won't raise the fault for v7.
target/arm: Pass outputsize down to check_s2_mmu_setup
Pass down the width of the output address from translation.
For now this is still just PAMax, but a subsequent patch will
compute the correct value from TCR_ELx.{I}PS.
Without FEAT_LVA, the behaviour of programming an invalid value
is IMPLEMENTATION DEFINED. With FEAT_LVA, programming an invalid
minimum value requires a Translation fault.
It is most self-consistent to choose to generate the fault always.
Wentao_Liang [Fri, 25 Feb 2022 04:01:42 +0000 (12:01 +0800)]
target/arm: Fix early free of TCG temp in handle_simd_shift_fpint_conv()
handle_simd_shift_fpint_conv() was accidentally freeing the TCG
temporary tcg_fpstatus too early, before the last use of it. Move
the free down to where it belongs.
Akihiko Odaki [Sun, 13 Feb 2022 03:57:53 +0000 (12:57 +0900)]
target/arm: Support PSCI 1.1 and SMCCC 1.0
Support the latest PSCI on TCG and HVF. A 64-bit function called from
AArch32 now returns NOT_SUPPORTED, which is necessary to adhere to SMC
Calling Convention 1.0. It is still not compliant with SMCCC 1.3 since
they do not implement mandatory functions.
Peter Maydell [Mon, 21 Feb 2022 14:07:50 +0000 (14:07 +0000)]
hw/input/tsc210x: Don't abort on bad SPI word widths
The tsc210x doesn't support anything other than 16-bit reads on the
SPI bus, but the guest can program the SPI controller to attempt
them anyway. If this happens, don't abort QEMU, just log this as
a guest error.
This fixes our machine_arm_n8x0.py:N8x0Machine.test_n800
acceptance test, which hits this assertion.
The reason we hit the assertion is because the guest kernel thinks
there is a TSC2005 on this SPI bus address, not a TSC210x. (The n810
*does* have a TSC2005 at this address.) The TSC2005 supports the
24-bit accesses which the guest driver makes, and the TSC210x does
not (that is, our TSC210x emulation is not missing support for a word
width the hardware can handle). It's not clear whether the problem
here is that the guest kernel incorrectly thinks the n800 has the
same device at this SPI bus address as the n810, or that QEMU's n810
board model doesn't get the SPI devices right. At this late date
there no longer appears to be any reliable information on the web
about the hardware behaviour, but I am inclined to think this is a
guest kernel bug. In any case, we prefer not to abort QEMU for
guest-triggerable conditions, so logging the error is the right thing
to do.
Peter Maydell [Mon, 21 Feb 2022 09:41:44 +0000 (09:41 +0000)]
hw/arm/mps2-tz.c: Update AN547 documentation URL
The AN547 application note URL has changed: update our comment
accordingly. (Rev B is still downloadable from the old URL,
but there is a new Rev C of the document now.)
Jimmy Brisson [Thu, 10 Feb 2022 21:02:27 +0000 (15:02 -0600)]
mps3-an547: Add missing user ahb interfaces
With these interfaces missing, TFM would delegate peripherals 0, 1,
2, 3 and 8, and qemu would ignore the delegation of interface 8, as
it thought interface 4 was eth & USB.
This patch corrects this behavior and allows TFM to delegate the
eth & USB peripheral to NS mode.
(The old QEMU behaviour was based on revision B of the AN547
appnote; revision C corrects this error in the documentation,
and this commit brings QEMU in to line with how the FPGA
image really behaves.)
Peter Maydell [Tue, 15 Feb 2022 17:57:05 +0000 (17:57 +0000)]
migration: Remove load_state_old and minimum_version_id_old
There are no longer any VMStateDescription structs in the tree which
use the load_state_old support for custom handling of incoming
migration from very old QEMU. Remove the mechanism entirely.
This includes removing one stray useless setting of
minimum_version_id_old in a VMStateDescription with no load_state_old
function, which crept in after the global weeding-out of them in
commit 17e313406126.
Peter Xu [Tue, 1 Mar 2022 08:39:25 +0000 (16:39 +0800)]
tests: Pass in MigrateStart** into test_migrate_start()
test_migrate_start() will release the MigrateStart structure that passed
in, however that's not super clear to the caller because after the call
returned the pointer can still be referenced by the callers. It can easily
be a source of use-after-free.
Let's pass in a double pointer of that, then we can safely clear the
pointer for the caller after the struct is released.
When do it, we should also null-ify the cleanup hook and the data, then it's
even safe to call it multiple times.
Move the socket_address_list cleanup altogether, because that's a mirror of the
listener channels and only for the purpose of query-migrate. Hence when
someone wants to cleanup the listener transport, it should also want to cleanup
the socket list too, always.
Peter Xu [Tue, 1 Mar 2022 08:39:10 +0000 (16:39 +0800)]
migration: Enlarge postcopy recovery to capture !-EIO too
We used to have quite a few places making sure -EIO happened and that's the
only way to trigger postcopy recovery. That's based on the assumption that
we'll only return -EIO for channel issues.
It'll work in 99.99% cases but logically that won't cover some corner cases.
One example is e.g. ram_block_from_stream() could fail with an interrupted
network, then -EINVAL will be returned instead of -EIO.
I remembered Dave Gilbert pointed that out before, but somehow this is
overlooked. Neither did I encounter anything outside the -EIO error.
However we'd better touch that up before it triggers a rare VM data loss during
live migrating.
To cover as much those cases as possible, remove the -EIO restriction on
triggering the postcopy recovery, because even if it's not a channel failure,
we can't do anything better than halting QEMU anyway - the corpse of the
process may even be used by a good hand to dig out useful memory regions, or
the admin could simply kill the process later on.
Peter Xu [Tue, 1 Mar 2022 08:39:06 +0000 (16:39 +0800)]
migration: Add postcopy_thread_create()
Postcopy create threads. A common manner is we init a sem and use it to sync
with the thread. Namely, we have fault_thread_sem and listen_thread_sem and
they're only used for this.
Make it a shared infrastructure so it's easier to create yet another thread.
Peter Xu [Tue, 1 Mar 2022 08:39:05 +0000 (16:39 +0800)]
migration: Dump ramblock and offset too when non-same-page detected
In ram_load_postcopy() we'll try to detect non-same-page case and dump error.
This error is very helpful for debugging. Adding ramblock & offset into the
error log too.
Peter Xu [Tue, 1 Mar 2022 08:39:04 +0000 (16:39 +0800)]
migration: Introduce postcopy channels on dest node
Postcopy handles huge pages in a special way that currently we can only have
one "channel" to transfer the page.
It's because when we install pages using UFFDIO_COPY, we need to have the whole
huge page ready, it also means we need to have a temp huge page when trying to
receive the whole content of the page.
Currently all maintainance around this tmp page is global: firstly we'll
allocate a temp huge page, then we maintain its status mostly within
ram_load_postcopy().
To enable multiple channels for postcopy, the first thing we need to do is to
prepare N temp huge pages as caching, one for each channel.
Meanwhile we need to maintain the tmp huge page status per-channel too.
To give some example, some local variables maintained in ram_load_postcopy()
are listed; they are responsible for maintaining temp huge page status:
- all_zero: this keeps whether this huge page contains all zeros
- target_pages: this counts how many target pages have been copied
- host_page: this keeps the host ptr for the page to install
Move all these fields to be together with the temp huge pages to form a new
structure called PostcopyTmpPage. Then for each (future) postcopy channel, we
need one structure to keep the state around.
For vanilla postcopy, obviously there's only one channel. It contains both
precopy and postcopy pages.
This patch teaches the dest migration node to start realize the possible number
of postcopy channels by introducing the "postcopy_channels" variable. Its
value is calculated when setup postcopy on dest node (during POSTCOPY_LISTEN
phase).
Vanilla postcopy will have channels=1, but when postcopy-preempt capability is
enabled (in the future), we will boost it to 2 because even during partial
sending of a precopy huge page we still want to preempt it and start sending
the postcopy requested page right away (so we start to keep two temp huge
pages; more if we want to enable multifd). In this patch there's a TODO marked
for that; so far the channels is always set to 1.
We need to send one "host huge page" on one channel only and we cannot split
them, because otherwise the data upon the same huge page can locate on more
than one channel so we need more complicated logic to manage. One temp host
huge page for each channel will be enough for us for now.
Postcopy will still always use the index=0 huge page even after this patch.
However it prepares for the latter patches where it can start to use multiple
channels (which needs src intervention, because only src knows which channel we
should use).
Jack Wang [Tue, 8 Feb 2022 08:56:40 +0000 (09:56 +0100)]
migration/rdma: set the REUSEADDR option for destination
We hit following error during testing RDMA transport:
in case of migration error, mgmt daemon pick one migration port,
incoming rdma:[::]:8089: RDMA ERROR: Error: could not rdma_bind_addr
Then try another -incoming rdma:[::]:8103, sometime it worked,
sometimes need another try with other ports number.
Set the REUSEADDR option for destination, This allow address could
be reused to avoid rdma_bind_addr error out.
Stefan Reiter [Fri, 25 Feb 2022 08:49:49 +0000 (09:49 +0100)]
qapi/monitor: allow VNC display id in set/expire_password
It is possible to specify more than one VNC server on the command line,
either with an explicit ID or the auto-generated ones à la "default",
"vnc2", "vnc3", ...
It is not possible to change the password on one of these extra VNC
displays though. Fix this by adding a "display" parameter to the
"set_password" and "expire_password" QMP and HMP commands.
For HMP, the display is specified using the "-d" value flag.
For QMP, the schema is updated to explicitly express the supported
variants of the commands with protocol-discriminated unions.
Stefan Reiter [Fri, 25 Feb 2022 08:49:47 +0000 (09:49 +0100)]
monitor/hmp: add support for flag argument with value
Adds support for the "-xs" parameter type, where "-x" denotes a flag
name and the "s" suffix indicates that this flag is supposed to take
an arbitrary string parameter.
These parameters are always optional, the entry in the qdict will be
omitted if the flag is not given.
Hanna Reitz [Wed, 23 Feb 2022 09:23:40 +0000 (10:23 +0100)]
virtiofsd: Let meson check for statx.stx_mnt_id
In virtiofsd, we assume that the presence of the STATX_MNT_ID macro
implies existence of the statx.stx_mnt_id field. Unfortunately, that is
not necessarily the case: glibc has introduced the macro in its commit 88a2cf6c4bab6e94a65e9c0db8813709372e9180, but the statx.stx_mnt_id field
is still missing from its own headers.
Let meson.build actually chek for both STATX_MNT_ID and
statx.stx_mnt_id, and set CONFIG_STATX_MNT_ID if both are present.
Then, use this config macro in virtiofsd.