Peter Maydell [Mon, 16 Sep 2019 14:15:36 +0000 (15:15 +0100)]
target/arm/arm-semi: Factor out implementation of SYS_CLOSE
Currently for the semihosting calls which take a file descriptor
(SYS_CLOSE, SYS_WRITE, SYS_READ, SYS_ISTTY, SYS_SEEK, SYS_FLEN)
we have effectively two implementations, one for real host files
and one for when we indirect via the gdbstub. We want to add a
third one to deal with the magic :semihosting-features file.
Instead of having a three-way if statement in each of these
cases, factor out the implementation of the calls to separate
functions which we dispatch to via function pointers selected
via the GuestFDType for the guest fd.
In this commit, we set up the framework for the dispatch,
and convert the SYS_CLOSE call to use it.
Peter Maydell [Mon, 16 Sep 2019 14:15:35 +0000 (15:15 +0100)]
target/arm/arm-semi: Use set_swi_errno() in gdbstub callback functions
When we are routing semihosting operations through the gdbstub, the
work of sorting out the return value and setting errno if necessary
is done by callback functions which are invoked by the gdbstub code.
Clean up some ifdeffery in those functions by having them call
set_swi_errno() to set the semihosting errno.
Peter Maydell [Mon, 16 Sep 2019 14:15:34 +0000 (15:15 +0100)]
target/arm/arm-semi: Restrict use of TaskState*
The semihosting code needs accuss to the linux-user only
TaskState pointer so it can set the semihosting errno per-thread
for linux-user mode. At the moment we do this by having some
ifdefs so that we define a 'ts' local in do_arm_semihosting()
which is either a real TaskState * or just a CPUARMState *,
depending on which mode we're compiling for.
This is awkward if we want to refactor do_arm_semihosting()
into other functions which might need to be passed the TaskState.
Restrict usage of the TaskState local by:
* making set_swi_errno() always take the CPUARMState pointer
and (for the linux-user version) get TaskState from that
* creating a new get_swi_errno() which reads the errno
* having the two semihosting calls which need the TaskState
for other purposes (SYS_GET_CMDLINE and SYS_HEAPINFO)
define a variable with scope restricted to just that code
Peter Maydell [Mon, 16 Sep 2019 14:15:33 +0000 (15:15 +0100)]
target/arm/arm-semi: Make semihosting code hand out its own file descriptors
Currently the Arm semihosting code returns the guest file descriptors
(handles) which are simply the fd values from the host OS or the
remote gdbstub. Part of the semihosting 2.0 specification requires
that we implement special handling of opening a ":semihosting-features"
filename. Guest fds which result from opening the special file
won't correspond to host fds, so to ensure that we don't end up
with duplicate fds we need to have QEMU code control the allocation
of the fd values we give the guest.
Add in an abstraction layer which lets us allocate new guest FD
values, and translate from a guest FD value back to the host one.
This also fixes an odd hole where a semihosting guest could
use the semihosting API to read, write or close file descriptors
that it had never allocated but which were being used by QEMU itself.
(This isn't a security hole, because enabling semihosting permits
the guest to do arbitrary file access to the whole host filesystem,
and so should only be done if the guest is completely trusted.)
Currently the only kind of guest fd is one which maps to a
host fd, but in a following commit we will add one which maps
to the :semihosting-features magic data.
If the guest is migrated with an open semihosting file descriptor
then subsequent attempts to use the fd will all fail; this is
not a change from the previous situation (where the host fd
being used on the source end would not be re-opened on the
destination end).
Peter Maydell [Mon, 16 Sep 2019 14:15:32 +0000 (15:15 +0100)]
target/arm/arm-semi: Correct comment about gdb syscall races
In arm_gdb_syscall() we have a comment suggesting a race
because the syscall completion callback might not happen
before the gdb_do_syscallv() call returns. The comment is
correct that the callback may not happen but incorrect about
the effects. Correct it and note the important caveat that
callers must never do any work of any kind after return from
arm_gdb_syscall() that depends on its return value.
Peter Maydell [Mon, 16 Sep 2019 14:15:31 +0000 (15:15 +0100)]
target/arm/arm-semi: Always set some kind of errno for failed calls
If we fail a semihosting call we should always set the
semihosting errno to something; we were failing to do
this for some of the "check inputs for sanity" cases.
Peter Maydell [Mon, 16 Sep 2019 14:15:30 +0000 (15:15 +0100)]
target/arm/arm-semi: Capture errno in softmmu version of set_swi_errno()
The set_swi_errno() function is called to capture the errno
from a host system call, so that we can return -1 from the
semihosting function and later allow the guest to get a more
specific error code with the SYS_ERRNO function. It comes in
two versions, one for user-only and one for softmmu. We forgot
to capture the errno in the softmmu version; fix the error.
(Semihosting calls directed to gdb are unaffected because
they go through a different code path that captures the
error return from the gdbstub call in arm_semi_cb() or
arm_semi_flen_cb().)
Peter Maydell [Tue, 8 Oct 2019 17:17:40 +0000 (18:17 +0100)]
hw/net/lan9118.c: Switch to transaction-based ptimer API
Switch the cmsdk-apb-watchdog code away from bottom-half based
ptimers to the new transaction-based ptimer API. This just requires
adding begin/commit calls around the various places that modify the
ptimer state, and using the new ptimer_init() function to create the
timer.
Peter Maydell [Tue, 8 Oct 2019 17:17:39 +0000 (18:17 +0100)]
hw/watchdog/cmsdk-apb-watchdog.c: Switch to transaction-based ptimer API
Switch the cmsdk-apb-watchdog code away from bottom-half based
ptimers to the new transaction-based ptimer API. This just requires
adding begin/commit calls around the various places that modify the
ptimer state, and using the new ptimer_init() function to create the
timer.
Peter Maydell [Tue, 8 Oct 2019 17:17:38 +0000 (18:17 +0100)]
hw/timer/mss-timerc: Switch to transaction-based ptimer API
Switch the mss-timer code away from bottom-half based ptimers to
the new transaction-based ptimer API. This just requires adding
begin/commit calls around the various places that modify the ptimer
state, and using the new ptimer_init() function to create the timer.
Peter Maydell [Tue, 8 Oct 2019 17:17:37 +0000 (18:17 +0100)]
hw/timer/imx_gpt.c: Switch to transaction-based ptimer API
Switch the imx_epit.c code away from bottom-half based ptimers to
the new transaction-based ptimer API. This just requires adding
begin/commit calls around the various places that modify the ptimer
state, and using the new ptimer_init() function to create the timer.
Peter Maydell [Tue, 8 Oct 2019 17:17:36 +0000 (18:17 +0100)]
hw/timer/imx_epit.c: Switch to transaction-based ptimer API
Switch the imx_epit.c code away from bottom-half based ptimers to
the new transaction-based ptimer API. This just requires adding
begin/commit calls around the various places that modify the ptimer
state, and using the new ptimer_init() function to create the timer.
Peter Maydell [Tue, 8 Oct 2019 17:17:33 +0000 (18:17 +0100)]
hw/timer/exynos4210_pwm.c: Switch to transaction-based ptimer API
Switch the exynos4210_pwm code away from bottom-half based ptimers to
the new transaction-based ptimer API. This just requires adding
begin/commit calls around the various places that modify the ptimer
state, and using the new ptimer_init() function to create the timer.
Peter Maydell [Tue, 8 Oct 2019 17:17:30 +0000 (18:17 +0100)]
hw/timer/exynos4210_mct.c: Switch GFRC to transaction-based ptimer API
We want to switch the exynos MCT code away from bottom-half based ptimers to
the new transaction-based ptimer API. The MCT is complicated
and uses multiple different ptimers, so it's clearer to switch
it a piece at a time. Here we change over only the GFRC.
Peter Maydell [Tue, 8 Oct 2019 17:17:29 +0000 (18:17 +0100)]
hw/timer/digic-timer.c: Switch to transaction-based ptimer API
Switch the digic-timer.c code away from bottom-half based ptimers to
the new transaction-based ptimer API. This just requires adding
begin/commit calls around the various places that modify the ptimer
state, and using the new ptimer_init() function to create the timer.
Peter Maydell [Tue, 8 Oct 2019 17:17:28 +0000 (18:17 +0100)]
hw/timer/cmsdk-apb-timer.c: Switch to transaction-based ptimer API
Switch the cmsdk-apb-timer code away from bottom-half based ptimers
to the new transaction-based ptimer API. This just requires adding
begin/commit calls around the various places that modify the ptimer
state, and using the new ptimer_init() function to create the timer.
Peter Maydell [Tue, 8 Oct 2019 17:17:27 +0000 (18:17 +0100)]
hw/timer/cmsdk-apb-dualtimer.c: Switch to transaction-based ptimer API
Switch the cmsdk-apb-dualtimer code away from bottom-half based
ptimers to the new transaction-based ptimer API. This just requires
adding begin/commit calls around the various places that modify the
ptimer state, and using the new ptimer_init() function to create the
timer.
Peter Maydell [Tue, 8 Oct 2019 17:17:26 +0000 (18:17 +0100)]
hw/timer/arm_mptimer.c: Switch to transaction-based ptimer API
Switch the arm_mptimer.c code away from bottom-half based ptimers to
the new transaction-based ptimer API. This just requires adding
begin/commit calls around the various places that modify the ptimer
state, and using the new ptimer_init() function to create the timer.
Peter Maydell [Tue, 8 Oct 2019 17:17:25 +0000 (18:17 +0100)]
hw/timer/allwinner-a10-pit.c: Switch to transaction-based ptimer API
Switch the allwinner-a10-pit code away from bottom-half based ptimers to
the new transaction-based ptimer API. This just requires adding
begin/commit calls around the various places that modify the ptimer
state, and using the new ptimer_init() function to create the timer.
Peter Maydell [Tue, 8 Oct 2019 17:17:24 +0000 (18:17 +0100)]
hw/arm/musicpal.c: Switch to transaction-based ptimer API
Switch the musicpal code away from bottom-half based ptimers to
the new transaction-based ptimer API. This just requires adding
begin/commit calls around the various places that modify the ptimer
state, and using the new ptimer_init() function to create the timer.
Peter Maydell [Tue, 8 Oct 2019 17:17:23 +0000 (18:17 +0100)]
hw/timer/arm_timer.c: Switch to transaction-based ptimer API
Switch the arm_timer.c code away from bottom-half based ptimers
to the new transaction-based ptimer API. This just requires
adding begin/commit calls around the various arms of
arm_timer_write() that modify the ptimer state, and using the
new ptimer_init() function to create the timer.
Peter Maydell [Tue, 8 Oct 2019 17:17:22 +0000 (18:17 +0100)]
tests/ptimer-test: Switch to transaction-based ptimer API
Convert the ptimer test cases to the transaction-based ptimer API,
by changing to ptimer_init(), dropping the now-unused QEMUBH
variables, and surrounding each set of changes to the ptimer
state in ptimer_transaction_begin/commit calls.
Peter Maydell [Tue, 8 Oct 2019 17:17:21 +0000 (18:17 +0100)]
ptimer: Provide new transaction-based API
Provide the new transaction-based API. If a ptimer is created
using ptimer_init() rather than ptimer_init_with_bh(), then
instead of providing a QEMUBH, it provides a pointer to the
callback function directly, and has opted into the transaction
API. All calls to functions which modify ptimer state:
- ptimer_set_period()
- ptimer_set_freq()
- ptimer_set_limit()
- ptimer_set_count()
- ptimer_run()
- ptimer_stop()
must be between matched calls to ptimer_transaction_begin()
and ptimer_transaction_commit(). When ptimer_transaction_commit()
is called it will evaluate the state of the timer after all the
changes in the transaction, and call the callback if necessary.
In the old API the individual update functions generally would
call ptimer_trigger() immediately, which would schedule the QEMUBH.
In the new API the update functions will instead defer the
"set s->next_event and call ptimer_reload()" work to
ptimer_transaction_commit().
Because ptimer_trigger() can now immediately call into the
device code which may then call other ptimer functions that
update ptimer_state fields, we must be more careful in
ptimer_reload() not to cache fields from ptimer_state across
the ptimer_trigger() call. (This was harmless with the QEMUBH
mechanism as the BH would not be invoked until much later.)
We use assertions to check that:
* the functions modifying ptimer state are not called outside
a transaction block
* ptimer_transaction_begin() and _commit() calls are paired
* the transaction API is not used with a QEMUBH ptimer
There is some slight repetition of code:
* most of the set functions have similar looking "if s->bh
call ptimer_reload, otherwise set s->need_reload" code
* ptimer_init() and ptimer_init_with_bh() have similar code
We deliberately don't try to avoid this repetition, because
it will all be deleted when the QEMUBH version of the API
is removed.
Peter Maydell [Tue, 8 Oct 2019 17:17:20 +0000 (18:17 +0100)]
ptimer: Rename ptimer_init() to ptimer_init_with_bh()
Currently the ptimer design uses a QEMU bottom-half as its
mechanism for calling back into the device model using the
ptimer when the timer has expired. Unfortunately this design
is fatally flawed, because it means that there is a lag
between the ptimer updating its own state and the device
callback function updating device state, and guest accesses
to device registers between the two can return inconsistent
device state.
We want to replace the bottom-half design with one where
the guest device's callback is called either immediately
(when the ptimer triggers by timeout) or when the device
model code closes a transaction-begin/end section (when the
ptimer triggers because the device model changed the
ptimer's count value or other state). As the first step,
rename ptimer_init() to ptimer_init_with_bh(), to free up
the ptimer_init() name for the new API. We can then convert
all the ptimer users away from ptimer_init_with_bh() before
removing it entirely.
(Commit created with
git grep -l ptimer_init | xargs sed -i -e 's/ptimer_init/ptimer_init_with_bh/'
and three overlong lines folded by hand.)
Eric Auger [Thu, 3 Oct 2019 15:46:40 +0000 (17:46 +0200)]
ARM: KVM: Check KVM_CAP_ARM_IRQ_LINE_LAYOUT_2 for smp_cpus > 256
Host kernel within [4.18, 5.3] report an erroneous KVM_MAX_VCPUS=512
for ARM. The actual capability to instantiate more than 256 vcpus
was fixed in 5.4 with the upgrade of the KVM_IRQ_LINE ABI to support
vcpu id encoded on 12 bits instead of 8 and a redistributor consuming
a single KVM IO device instead of 2.
So let's check this capability when attempting to use more than 256
vcpus within any ARM kvm accelerated machine.
Eric Auger [Thu, 3 Oct 2019 15:46:39 +0000 (17:46 +0200)]
intc/arm_gic: Support IRQ injection for more than 256 vpus
Host kernels that expose the KVM_CAP_ARM_IRQ_LINE_LAYOUT_2 capability
allow injection of interrupts along with vcpu ids larger than 255.
Let's encode the vpcu id on 12 bits according to the upgraded KVM_IRQ_LINE
ABI when needed.
Given that we have two callsites that need to assemble
the value for kvm_set_irq(), a new helper routine, kvm_arm_set_irq
is introduced.
Without that patch qemu exits with "kvm_set_irq: Invalid argument"
message.
Peter Maydell [Tue, 15 Oct 2019 12:25:05 +0000 (13:25 +0100)]
Merge remote-tracking branch 'remotes/kevin/tags/for-upstream' into staging
Block layer patches:
- block: Fix crash with qcow2 partial cluster COW with small cluster
sizes (misaligned write requests with BDRV_REQ_NO_FALLBACK)
- qcow2: Fix integer overflow potentially causing corruption with huge
requests
- vhdx: Detect truncated image files
- tools: Support help options for --object
- Various block-related replay improvements
- iotests/028: Fix for long $TEST_DIRs
# gpg: Signature made Mon 14 Oct 2019 17:02:54 BST
# gpg: using RSA key 7F09B272C88F2FD6
# gpg: Good signature from "Kevin Wolf <[email protected]>" [full]
# Primary key fingerprint: DC3D EB15 9A9A F95D 3D74 56FE 7F09 B272 C88F 2FD6
* remotes/kevin/tags/for-upstream:
iotests: Test large write request to qcow2 file
qcow2: Limit total allocation range to INT_MAX
qemu-nbd: Support help options for --object
qemu-img: Support help options for --object
qemu-io: Support help options for --object
vl: Split off user_creatable_print_help()
iotests/028: Fix for long $TEST_DIRs
block: Reject misaligned write requests with BDRV_REQ_NO_FALLBACK
replay: add BH oneshot event for block layer
replay: finish record/replay before closing the disks
replay: don't drain/flush bdrv queue while RR is working
replay: update docs for record/replay with block devices
replay: disable default snapshot for record/replay
block: implement bdrv_snapshot_goto for blkreplay
block/vhdx: add check for truncated image files
Stefan Hajnoczi [Wed, 9 Oct 2019 13:51:54 +0000 (14:51 +0100)]
trace: add --group=all to tracing.txt
tracetool needs to know the group name ("all", "root", or a specific
subdirectory). Also remove the stdin redirection because tracetool.py
needs the path to the trace-events file. Update the documentation.
Max Reitz [Thu, 10 Oct 2019 10:08:58 +0000 (12:08 +0200)]
iotests: Test large write request to qcow2 file
Without HEAD^, the following happens when you attempt a large write
request to a qcow2 file such that the number of bytes covered by all
clusters involved in a single allocation will exceed INT_MAX:
(A) handle_alloc_space() decides to fill the whole area with zeroes and
fails because bdrv_co_pwrite_zeroes() fails (the request is too
large).
(B) If handle_alloc_space() does not do anything, but merge_cow()
decides that the requests can be merged, it will create a too long
IOV that later cannot be written.
(C) Otherwise, all parts will be written separately, so those requests
will work.
In either B or C, though, qcow2_alloc_cluster_link_l2() will have an
overflow: We use an int (i) to iterate over nb_clusters, and then
calculate the L2 entry based on "i << s->cluster_bits" -- which will
overflow if the range covers more than INT_MAX bytes. This then leads
to image corruption because the L2 entry will be wrong (it will be
recognized as a compressed cluster).
Even if that were not the case, the .cow_end area would be empty
(because handle_alloc() will cap avail_bytes and nb_bytes at INT_MAX, so
their difference (which is the .cow_end size) will be 0).
So this test checks that on such large requests, the image will not be
corrupted. Unfortunately, we cannot check whether COW will be handled
correctly, because that data is discarded when it is written to null-co
(but we have to use null-co, because writing 2 GB of data in a test is
not quite reasonable).
Max Reitz [Thu, 10 Oct 2019 10:08:57 +0000 (12:08 +0200)]
qcow2: Limit total allocation range to INT_MAX
When the COW areas are included, the size of an allocation can exceed
INT_MAX. This is kind of limited by handle_alloc() in that it already
caps avail_bytes at INT_MAX, but the number of clusters still reflects
the original length.
This can have all sorts of effects, ranging from the storage layer write
call failing to image corruption. (If there were no image corruption,
then I suppose there would be data loss because the .cow_end area is
forced to be empty, even though there might be something we need to
COW.)
Fix all of it by limiting nb_clusters so the equivalent number of bytes
will not exceed INT_MAX.
Kevin Wolf [Fri, 11 Oct 2019 19:49:17 +0000 (21:49 +0200)]
qemu-nbd: Support help options for --object
Instead of parsing help options as normal object properties and
returning an error, provide the same help functionality as the system
emulator in qemu-nbd, too.
Kevin Wolf [Fri, 11 Oct 2019 19:49:17 +0000 (21:49 +0200)]
qemu-img: Support help options for --object
Instead of parsing help options as normal object properties and
returning an error, provide the same help functionality as the system
emulator in qemu-img, too.
Kevin Wolf [Fri, 11 Oct 2019 19:49:17 +0000 (21:49 +0200)]
qemu-io: Support help options for --object
Instead of parsing help options as normal object properties and
returning an error, provide the same help functionality as the system
emulator in qemu-io, too.
Kevin Wolf [Fri, 11 Oct 2019 17:20:12 +0000 (19:20 +0200)]
vl: Split off user_creatable_print_help()
Printing help for --object is something that we not only want in the
system emulator, but also in tools that support --object. Move it into a
separate function in qom/object_interfaces.c to make the code accessible
for tools.
Max Reitz [Fri, 11 Oct 2019 12:18:08 +0000 (14:18 +0200)]
iotests/028: Fix for long $TEST_DIRs
For long test image paths, the order of the "Formatting" line and the
"(qemu)" prompt after a drive_backup HMP command may be reversed. In
fact, the interaction between the prompt and the line may lead to the
"Formatting" to being greppable at all after "read"-ing it (if the
prompt injects an IFS character into the "Formatting" string).
So just wait until we get a prompt. At that point, the block job must
have been started, so "info block-jobs" will only return "No active
jobs" once it is done.
The reason is that when writing to an unallocated cluster we try to
skip the copy-on-write part and zeroize it using BDRV_REQ_NO_FALLBACK
instead, resulting in a write request that is too small (2KB cluster
size vs 4KB required alignment).
Pavel Dovgalyuk [Tue, 17 Sep 2019 11:58:19 +0000 (14:58 +0300)]
replay: add BH oneshot event for block layer
Replay is capable of recording normal BH events, but sometimes
there are single use callbacks scheduled with aio_bh_schedule_oneshot
function. This patch enables recording and replaying such callbacks.
Block layer uses these events for calling the completion function.
Replaying these calls makes the execution deterministic.
Pavel Dovgalyuk [Tue, 17 Sep 2019 11:58:13 +0000 (14:58 +0300)]
replay: finish record/replay before closing the disks
After recent updates block devices cannot be closed on qemu exit.
This happens due to the block request polling when replay is not finished.
Therefore now we stop execution recording before closing the block devices.
Pavel Dovgalyuk [Tue, 17 Sep 2019 11:58:08 +0000 (14:58 +0300)]
replay: don't drain/flush bdrv queue while RR is working
In record/replay mode bdrv queue is controlled by replay mechanism.
It does not allow saving or loading the snapshots
when bdrv queue is not empty. Stopping the VM is not blocked by nonempty
queue, but flushing the queue is still impossible there,
because it may cause deadlocks in replay mode.
This patch disables bdrv_drain_all and bdrv_flush_all in
record/replay mode.
Stopping the machine when the IO requests are not finished is needed
for the debugging. E.g., breakpoint may be set at the specified step,
and forcing the IO requests to finish may break the determinism
of the execution.
Pavel Dovgalyuk [Tue, 17 Sep 2019 11:57:51 +0000 (14:57 +0300)]
block: implement bdrv_snapshot_goto for blkreplay
This patch enables making snapshots with blkreplay used in
block devices.
This function is required to make bdrv_snapshot_goto without
calling .bdrv_open which is not implemented.
Peter Lieven [Tue, 10 Sep 2019 15:26:22 +0000 (17:26 +0200)]
block/vhdx: add check for truncated image files
qemu is currently not able to detect truncated vhdx image files.
Add a basic check if all allocated blocks are reachable at open and
report all errors during bdrv_co_check.
Peter Maydell [Mon, 14 Oct 2019 15:09:52 +0000 (16:09 +0100)]
Merge remote-tracking branch 'remotes/dgilbert/tags/pull-migration-20191011a' into staging
Migration pull 2019-10-11
Mostly cleanups and minor fixes
[Note I'm seeing a hang on the aarch64 hosted x86-64 tcg migration
test in xbzrle; but I'm seeing that on current head as well]
# gpg: Signature made Fri 11 Oct 2019 20:14:31 BST
# gpg: using RSA key 45F5C71B4A0CB7FB977A9FA90516331EBC5BFDE7
# gpg: Good signature from "Dr. David Alan Gilbert (RH2) <[email protected]>" [full]
# Primary key fingerprint: 45F5 C71B 4A0C B7FB 977A 9FA9 0516 331E BC5B FDE7
* remotes/dgilbert/tags/pull-migration-20191011a: (21 commits)
migration: Support gtree migration
migration/multifd: pages->used would be cleared when attach to multifd_send_state
migration/multifd: initialize packet->magic/version once at setup stage
migration/multifd: use pages->allocated instead of the static max
migration/multifd: fix a typo in comment of multifd_recv_unfill_packet()
migration/postcopy: check PostcopyState before setting to POSTCOPY_INCOMING_RUNNING
migration/postcopy: rename postcopy_ram_enable_notify to postcopy_ram_incoming_setup
migration/postcopy: postpone setting PostcopyState to END
migration/postcopy: mis->have_listen_thread check will never be touched
migration: report SaveStateEntry id and name on failure
migration: pass in_postcopy instead of check state again
migration/postcopy: fix typo in mark_postcopy_blocktime_begin's comment
migration/postcopy: map large zero page in postcopy_ram_incoming_setup()
migration/postcopy: allocate tmp_page in setup stage
migration: Don't try and recover return path in non-postcopy
rcu: Use automatic rc_read unlock in core memory/exec code
migration: Use automatic rcu_read unlock in rdma.c
migration: Use automatic rcu_read unlock in ram.c
migration: Fix missing rcu_read_unlock
rcu: Add automatically released rcu_read_lock variants
...
Alex Bennée [Tue, 1 Oct 2019 16:04:26 +0000 (17:04 +0100)]
cpus: kick all vCPUs when running thread=single
qemu_cpu_kick is used for a number of reasons including to indicate
there is work to be done. However when thread=single the old
qemu_cpu_kick_rr_cpu only advanced the vCPU to the next executing one
which can lead to a hang in the case that:
a) the kick is from outside the vCPUs (e.g. iothread)
b) the timers are paused (i.e. iothread calling run_on_cpu)
To avoid this lets split qemu_cpu_kick_rr into two functions. One for
the timer which continues to advance to the next timeslice and another
for all other kicks.
tcg/ppc: Update vector support for v3.00 load/store
These new instructions are a mix of those like LXSD that are
only conditional only on MSR.VEC and those like LXV that are
conditional on MSR.VEC for TX=1. Thus, in the end, we can
consider all of these as Altivec instructions.
These new instructions are conditional only on MSR.VEC and
are thus part of the Altivec instruction set, and not VSX.
This includes negation and compare not equal.
These new instructions are conditional on MSR.FP when TX=0 and
MSR.VEC when TX=1. Since we only care about the Altivec registers,
and force TX=1, we can consider these to be Altivec instructions.
Since Altivec is true for any use of vector types, we only need
test have_isa_2_07.
This includes moves to and from the integer registers.
These new instructions are conditional only on MSR.VSX and
are thus part of the VSX instruction set, and not Altivec.
This includes double-word loads and stores.
These new instructions are conditional only on MSR.VEC and
are thus part of the Altivec instruction set, and not VSX.
This includes lots of double-word arithmetic and a few extra
logical operations.
The VSX instruction set instructions include double-word loads and
stores, double-word load and splat, double-word permute, and bit
select. All of which require multiple operations in the Altivec
instruction set.
Because the VSX registers map %vsr32 to %vr0, and we have no current
intention or need to use vector registers outside %vr0-%vr19, force
on the {ax,bx,cx,tx} bits within the added VSX insns so that we don't
have to otherwise modify the VR[TABC] macros.
tcg/ppc: Add support for load/store/logic/comparison
Add various bits and peaces related mostly to load and store
operations. In that context, logic, compare, and splat Altivec
instructions are used, and, therefore, the support for emitting
them is included in this patch too.
Introduce all of the flags required to enable tcg backend vector support,
and a runtime flag to indicate the host supports Altivec instructions.
For now, do not actually set have_isa_altivec to true, because we have not
yet added all of the code to actually generate all of the required insns.
However, we must define these flags in order to disable ifndefs that create
stub versions of the functions added here.
The change to tcg_out_movi works around a buglet in tcg.c wherein if we
do not define tcg_out_dupi_vec we get a declared but not defined Werror,
but if we only declare it we get a defined but not used Werror. We need
to this change to tcg_out_movi eventually anyway, so it's no biggie.
Previously we've been hard-coding knowledge that Power7 has ISEL, but
it was an optional instruction before that. Use the AT_HWCAP2 bit,
when present, to properly determine support.
Peter Maydell [Mon, 14 Oct 2019 12:34:39 +0000 (13:34 +0100)]
Merge remote-tracking branch 'remotes/gkurz/tags/9p-next-2019-10-10' into staging
The most notable change is that we now detect cross-device setups in the
host since it may cause inode number collision and mayhem in the guest.
A new fsdev property is added for the user to choose the appropriate
policy to handle that: either remap all inode numbers or fail I/Os to
another host device or just print out a warning (default behaviour).
This is also my last PR as _active_ maintainer of 9pfs.
# gpg: Signature made Thu 10 Oct 2019 12:14:07 BST
# gpg: using RSA key B4828BAF943140CEF2A3491071D4D5E5822F73D6
# gpg: Good signature from "Greg Kurz <[email protected]>" [full]
# gpg: aka "Gregory Kurz <[email protected]>" [full]
# gpg: aka "[jpeg image of size 3330]" [full]
# Primary key fingerprint: B482 8BAF 9431 40CE F2A3 4910 71D4 D5E5 822F 73D6
* remotes/gkurz/tags/9p-next-2019-10-10:
MAINTAINERS: Downgrade status of virtio-9p to "Odd Fixes"
9p: Use variable length suffixes for inode remapping
9p: stat_to_qid: implement slow path
9p: Added virtfs option 'multidevs=remap|forbid|warn'
9p: Treat multiple devices on one export as an error
fsdev: Add return value to fsdev_throttle_parse_opts()
9p: Simplify error path of v9fs_device_realize_common()
9p: unsigned type for type, version, path
Peter Maydell [Mon, 14 Oct 2019 11:26:37 +0000 (12:26 +0100)]
Merge remote-tracking branch 'remotes/maxreitz/tags/pull-block-2019-10-10' into staging
Block patches:
- Parallelized request handling for qcow2
- Backup job refactoring to use a filter node instead of before-write
notifiers
- Add discard accounting information to file-posix nodes
- Allow trivial reopening of nbd nodes
- Some iotest fixes
Peter Maydell [Mon, 14 Oct 2019 09:42:35 +0000 (10:42 +0100)]
Merge remote-tracking branch 'remotes/davidhildenbrand/tags/s390x-tcg-2019-10-10' into staging
- MMU DAT translation rewrite and cleanup
- Implement more TCG CPU features related to the MMU (e.g., IEP)
- Add the current instruction length to unwind data and clean up
- Resolve one TODO for the MVCL instruction
* remotes/davidhildenbrand/tags/s390x-tcg-2019-10-10: (31 commits)
s390x/tcg: MVCL: Exit to main loop if requested
target/s390x: Remove ILEN_UNWIND
target/s390x: Remove ilen argument from trigger_pgm_exception
target/s390x: Remove ilen argument from trigger_access_exception
target/s390x: Remove ILEN_AUTO
target/s390x: Rely on unwinding in s390_cpu_virt_mem_rw
target/s390x: Rely on unwinding in s390_cpu_tlb_fill
target/s390x: Simplify helper_lra
target/s390x: Remove fail variable from s390_cpu_tlb_fill
target/s390x: Return exception from translate_pages
target/s390x: Return exception from mmu_translate
target/s390x: Remove exc argument to mmu_translate_asce
target/s390x: Return exception from mmu_translate_real
target/s390x: Handle tec in s390_cpu_tlb_fill
target/s390x: Push trigger_pgm_exception lower in s390_cpu_tlb_fill
target/s390x: Use tcg_s390_program_interrupt in TCG helpers
target/s390x: Remove ilen parameter from s390_program_interrupt
target/s390x: Remove ilen parameter from tcg_s390_program_interrupt
target/s390x: Add ilen to unwind data
s390x/cpumodel: Add new TCG features to QEMU cpu model
...
If iothread_run() checks iothread->stopping before the iothread_join()
thread sets stopping to true, then aio_notify() may be optimized away
and iothread_run() hangs forever in aio_poll().
The correct way to change iothread->stopping is from a BH that executes
within iothread_run(). This ensures that iothread->stopping is checked
after we set it to true.
This was already fixed for ./iothread.c (note this is a different source
file!) by commit 2362a28ea11c145e1a13ae79342d76dc118a72a6 ("iothread:
fix iothread_stop() race condition"), but not for tests/iothread.c.
Eric Auger [Fri, 11 Oct 2019 12:17:24 +0000 (14:17 +0200)]
migration: Support gtree migration
Introduce support for GTree migration. A custom save/restore
is implemented. Each item is made of a key and a data.
If the key is a pointer to an object, 2 VMSDs are passed into
the GTree VMStateField.
When putting the items, the tree is traversed in sorted order by
g_tree_foreach.
On the get() path, gtrees must be allocated using the proper
key compare, key destroy and value destroy. This must be handled
beforehand, for example in a pre_load method.
Tests are added to test save/dump of structs containing gtrees
including the virtio-iommu domain/mappings scenario.
Wei Yang [Fri, 11 Oct 2019 08:50:48 +0000 (16:50 +0800)]
migration/multifd: use pages->allocated instead of the static max
multifd_send_fill_packet() prepares meta data for following pages to
transfer. It would be more proper to fill pages->allocated instead of
static max value, especially we want to support flexible packet size.
Wei Yang [Thu, 10 Oct 2019 01:13:16 +0000 (09:13 +0800)]
migration/postcopy: check PostcopyState before setting to POSTCOPY_INCOMING_RUNNING
Currently, we set PostcopyState blindly to RUNNING, even we found the
previous state is not LISTENING. This will lead to a corner case.
First let's look at the code flow:
qemu_loadvm_state_main()
ret = loadvm_process_command()
loadvm_postcopy_handle_run()
return -1;
if (ret < 0) {
if (postcopy_state_get() == POSTCOPY_INCOMING_RUNNING)
...
}
>From above snippet, the corner case is loadvm_postcopy_handle_run()
always sets state to RUNNING. And then it checks the previous state. If
the previous state is not LISTENING, it will return -1. But at this
moment, PostcopyState is already been set to RUNNING.
Then ret is checked in qemu_loadvm_state_main(), when it is -1
PostcopyState is checked. Current logic would pause postcopy and retry
if PostcopyState is RUNNING. This is not what we expect, because
postcopy is not active yet.
This patch makes sure state is set to RUNNING only previous state is
LISTENING by checking the state first.
Wei Yang [Sun, 6 Oct 2019 00:02:48 +0000 (08:02 +0800)]
migration/postcopy: postpone setting PostcopyState to END
There are two places to call function postcopy_ram_incoming_cleanup()
postcopy_ram_listen_thread on migration success
loadvm_postcopy_handle_listen one setup failure
On success, the vm will never accept another migration. On failure,
PostcopyState is transited from LISTENING to END and would be checked in
qemu_loadvm_state_main(). If PostcopyState is RUNNING, migration would
be paused and retried.
Currently PostcopyState is set to END in function
postcopy_ram_incoming_cleanup(). With above analysis, we can take this
step out and postpone this till the end of listen thread to indicate the
listen thread is done.
Wei Yang [Sun, 6 Oct 2019 00:02:47 +0000 (08:02 +0800)]
migration/postcopy: mis->have_listen_thread check will never be touched
If mis->have_listen_thread is true, this means current PostcopyState
must be LISTENING or RUNNING. While the check at the beginning of the
function makes sure the state transaction happens when its previous
PostcopyState is ADVISE or DISCARD.
Wei Yang [Sat, 5 Oct 2019 13:50:21 +0000 (21:50 +0800)]
migration/postcopy: map large zero page in postcopy_ram_incoming_setup()
postcopy_ram_incoming_setup() and postcopy_ram_incoming_cleanup() are
counterpart. It is reasonable to map/unmap large zero page in these two
functions respectively.
Wei Yang [Sat, 5 Oct 2019 13:50:20 +0000 (21:50 +0800)]
migration/postcopy: allocate tmp_page in setup stage
During migration, a tmp page is allocated so that we could place a whole
host page during postcopy.
Currently the page is allocated during load stage, this is a little bit
late. And more important, if we failed to allocate it, the error is not
checked properly. Even it is NULL, we would still use it.
This patch moves the allocation to setup stage and if failed error
message would be printed and caller would notice it.
migration: Don't try and recover return path in non-postcopy
In normal precopy we can't do reconnection recovery - but we also
don't need to, since you can just rerun migration.
At the moment if the 'return-path' capability is on, we use
the return path in precopy to give a positive 'OK' to the end
of migration; however if migration fails then we fall into
the postcopy recovery path and hang. This fixes it by only
running the return path in the postcopy case.