David Gibson [Tue, 3 Apr 2018 04:55:11 +0000 (14:55 +1000)]
Make qemu_mempath_getpagesize() accept NULL
qemu_mempath_getpagesize() gets the effective (host side) page size for
a block of memory backed by an mmap()ed file on the host. It requires
the mem_path parameter to be non-NULL.
This ends up meaning all the callers need a different case for handling
anonymous memory (for memory-backend-ram or default memory with -mem-path
is not specified).
We can make all those callers a little simpler by having
qemu_mempath_getpagesize() accept NULL, and treat that as the anonymous
memory case.
BALATON Zoltan [Sun, 25 Mar 2018 23:54:28 +0000 (01:54 +0200)]
target/ppc: Fix reserved bit mask of dstst instruction
According to the Vector/SIMD extension documentation bit 6 that is
currently masked is valid (listed as transient bit) but bits 7 and 8
should be reserved instead. Fix the mask to match this.
Michael Matz [Fri, 23 Feb 2018 17:29:56 +0000 (17:29 +0000)]
ppc: Fix size of ppc64 xer register
The normal gdb definition of the XER registers is only 32 bit,
and that's what the current version of power64-core.xml also
says (seems copied from gdb's). But qemu's idea of the XER register
is target_ulong (in CPUPPCState, ppc_gdb_register_len and
ppc_cpu_gdb_read_register)
That mismatch leads to the following message when attaching
with gdb:
Truncated register 32 in remote 'g' packet
(and following on that qemu stops responding). The simple fix is
to say the truth in the .xml file. But the better fix is to
actually make it 32bit on the wire, as old gdbs don't support
XML files for describing registers. Also the XER state in qemu
doesn't seem to use the high 32 bits, so sending it off to gdb
doesn't seem worthwhile.
uninorth: move PCI IO (ISA) memory region into the uninorth device
Do this for both the uninorth main and uninorth u3 AGP buses, using the main
PCI bus for each machine (this ensures the IO addresses still match those
used by OpenBIOS).
uninorth: remove obsolete pci_pmac_u3_init() function
Instead wire up the PCI/AGP host bridges in mac_newworld.c. Now this is complete
it is possible to move the initialisation of the PCI hole alias into
pci_u3_agp_init().
uninorth: remove obsolete pci_pmac_init() function
Instead wire up the PCI/AGP host bridges in mac_newworld.c. Now this is complete
it is possible to move the initialisation of the PCI hole alias into
pci_unin_main_init().
Somewhere in the history of time, the initialisation of the PCI buses for the
AGP and PCI host bridges got mixed up in that the PCI host bridge was
creating an instance of the AGP PCI bus, and the AGP PCI bus was missing.
Swap the PCI host bridge over to use the correct PCI bus (including setting
the kMacRISCPCIAddressSelect register used by MacOS X) and add the missing
reference to the AGP PCI bus.
uninorth: move PCI host bridge bus initialisation into device realize
Since the IO address space is fixed to use the standard system IO address
space then we can also use the opportunity to remove the address_space_io
parameter from pci_pmac_init() and pci_pmac_u3_init().
Note we also move the default mac99 PCI bus to the end of the initialisation
list so that it becomes the default destination for any devices specified
via -device without an explicit PCI bus provided.
This is in preparation for moving the PCI bus wiring inside the uninorth
host bridge devices. In the future it will be possible to remove this once the
PICs have been switched to use qdev GPIOs.
mac_oldworld: move wiring of macio IRQs to macio_oldworld_realize()
Since the macio device has a link to the PIC device, we can now wire up the
IRQs directly via qdev GPIOs rather than having to use an intermediate array.
This is the first step towards removing the old-style pci_grackle_init()
function. Following on from the previous commit we can now pass the heathrow
device as an object link and wire up the heathrow IRQs via qdev GPIOs.
uninorth: move uninorth definitions into uninorth.h
Signed-off-by: Mark Cave-Ayland <[email protected]>
[dwg: Added hw/hw.h #include as suggested by Philippe Mathieu-Daudé] Reviewed-by: Philippe Mathieu-Daudé <[email protected]> Signed-off-by: David Gibson <[email protected]>
uninorth: remove second set of uninorth token registers
Commit 593c181160: "PPC: Newworld: Add second uninorth control register set"
added a second set of uninorth registers at 0xf3000000.
Testing MacOS 9.2 to MacOS X 10.4 reveals no accesses to this address and I
can't find any reference to it in Apple's Core99.cpp source so I'm assuming
that this was the result of another bug that has now been fixed.
Peter Maydell [Thu, 26 Apr 2018 18:22:09 +0000 (19:22 +0100)]
Merge remote-tracking branch 'remotes/iwj/tags/for-upstream.depriv-2' into staging
xen: xen-domid-restrict improvements
# gpg: Signature made Thu 26 Apr 2018 19:11:22 BST
# gpg: using RSA key E3E3392348B50D39
# gpg: Good signature from "Ian Jackson (new general purpose key) <[email protected]>"
# gpg: WARNING: This key is not certified with a trusted signature!
# gpg: There is no indication that the signature belongs to the owner.
# Primary key fingerprint: 559A E46C 2D6B 6D32 65E7 CBA1 E3E3 3923 48B5 0D39
* remotes/iwj/tags/for-upstream.depriv-2:
configure: do_compiler: Dump some extra info under bash
os-posix: cleanup: Replace perror with error_report
os-posix: cleanup: Replace fprintf with error_report in remaining call sites
xen: Expect xenstore write to fail when restricted
xen: Remove now-obsolete xen_xc_domain_add_to_physmap
xen: Use newly added dmops for mapping VGA memory
os-posix: Provide new -runas <uid>:<gid> facility
os-posix: cleanup: Replace fprintfs with error_report in change_process_uid
xen: destroy_hvm_domain: Try xendevicemodel_shutdown
xen: move xc_interface compatibility fallback further up the file
xen: destroy_hvm_domain: Move reason into a variable
xen: defer call to xen_restrict until just before os_setup_post
xen: restrict: use xentoolcore_restrict_all
xen: link against xentoolcore
AccelClass: Introduce accel_setup_post
checkpatch: Add xendevicemodel_handle to the list of types
Ian Jackson [Mon, 25 Sep 2017 15:41:03 +0000 (16:41 +0100)]
configure: do_compiler: Dump some extra info under bash
This makes it much easier to find a particular thing in config.log.
We have to use the ${BASH_LINENO[*]} syntax which is a syntax error in
other shells, so test what shell we are running and use eval.
The extra output is only printed if configure is run with bash. On
systems where /bin/sh is not bash, it is necessary to say bash
./configure to get the extra debug info in the log.
Ross Lagerwall [Mon, 5 Mar 2018 10:07:46 +0000 (10:07 +0000)]
xen: Expect xenstore write to fail when restricted
Saving the current state to xenstore may fail when running restricted
(in particular, after a migration). Therefore, don't report the error or
exit when running restricted. Toolstacks that want to allow running
QEMU restricted should instead make use of QMP events to listen for
state changes.
Ross Lagerwall [Mon, 23 Oct 2017 09:27:27 +0000 (10:27 +0100)]
xen: Use newly added dmops for mapping VGA memory
Xen unstable (to be in 4.11) has two new dmops, relocate_memory and
pin_memory_cacheattr. Use these to set up the VGA memory, replacing the
previous calls to libxc. This allows the VGA console to work properly
when QEMU is running restricted (-xen-domid-restrict).
Wrapper functions are provided to allow QEMU to work with older versions
of Xen.
Tweak the error handling while making this change:
* Report pin_memory_cacheattr errors.
* Report errors even when DEBUG_HVM is not set. This is useful for
trying to understand why VGA is not working, since otherwise it just
fails silently.
* Fix the return values when an error occurs. The functions now
consistently return -1 and set errno.
Ian Jackson [Fri, 15 Sep 2017 17:10:44 +0000 (18:10 +0100)]
os-posix: Provide new -runas <uid>:<gid> facility
This allows the caller to specify a uid and gid to use, even if there
is no corresponding password entry. This will be useful in certain
Xen configurations.
We don't support just -runas <uid> because: (i) deprivileging without
calling setgroups would be ineffective (ii) given only a uid we don't
know what gid we ought to use (since uids may eppear in multiple
passwd file entries with different gids).
Ian Jackson [Tue, 3 Oct 2017 17:51:05 +0000 (18:51 +0100)]
xen: move xc_interface compatibility fallback further up the file
We are going to want to use the dummy xendevicemodel_handle type in
new stub functions in the CONFIG_XEN_CTRL_INTERFACE_VERSION < 41000
section. So we need to provide that definition, or (as applicable)
include the appropriate header, earlier in the file.
(Ideally the newer compatibility layers would be at the bottom of the
file, so that they can naturally benefit from the compatibility layers
for earlier version. But that's rather too much for this series.)
Ian Jackson [Fri, 15 Sep 2017 15:02:24 +0000 (16:02 +0100)]
xen: defer call to xen_restrict until just before os_setup_post
We need to restrict *all* the control fds that qemu opens. Looking in
/proc/PID/fd shows there are many; their allocation seems scattered
throughout Xen support code in qemu.
We must postpone the restrict call until roughly the same time as qemu
changes its uid, chroots (if applicable), and so on.
There doesn't seem to be an appropriate hook already. The RunState
change hook fires at different times depending on exactly what mode
qemu is operating in.
And it appears that no-one but the Xen code wants a hook at this phase
of execution. So, introduce a bare call to a new function
xen_setup_post, just before os_setup_post. Also provide the
appropriate stub for when Xen compilation is disabled.
We do the restriction before rather than after os_setup_post, because
xen_restrict may need to open /dev/null, and os_setup_post might have
called chroot.
Currently this does not work with migration, because when running as
the Xen device model qemu needs to signal to the toolstack that it is
ready. It currently does this using xenstore, and for incoming
migration (but not for ordinary startup) that happens after
os_setup_post.
It is correct that this happens late: we want the incoming migration
stream to be processed by a restricted qemu. The fix for this will be
to do the startup notification a different way, without using
xenstore. (QMP is probably a reasonable choice.)
So for now this restriction feature cannot be used in conjunction with
migration. (Note that this is not a regression in this patch, because
previously the -xen-restrict-domid call was, in fact, simply
ineffective!) We will revisit this in the Xen 4.11 release cycle.
Ian Jackson [Fri, 15 Sep 2017 15:03:14 +0000 (16:03 +0100)]
xen: restrict: use xentoolcore_restrict_all
And insist that it works.
Drop individual use of xendevicemodel_restrict and
xenforeignmemory_restrict. These are not actually effective in this
version of qemu, because qemu has a large number of fds open onto
various Xen control devices.
The restriction arrangements are still not right, because the
restriction needs to be done very late - after qemu has opened all of
its control fds.
xentoolcore_restrict_all and xentoolcore.h are available in Xen 4.10
and later, only. Provide a compatibility stub. And drop the
compatibility stubs for the old functions.
Ian Jackson [Thu, 8 Mar 2018 18:07:26 +0000 (18:07 +0000)]
checkpatch: Add xendevicemodel_handle to the list of types
This avoids checkpatch misparsing (as statements) long function
definitions or declarations, which sometimes start with constructs
like this:
static inline int xendevicemodel_relocate_memory(
xendevicemodel_handle *dmod, domid_t domid, ...
The type xendevicemodel_handle does not conform to Qemu CODING_STYLE,
which would suggest CamelCase. However, it is a type defined by the
Xen Project in xen.git. It would be possible to introduce a typedef
to allow the qemu code to refer to it by a differently-spelled name,
but that would obfuscate more than it would clarify.
Peter Maydell [Fri, 20 Apr 2018 14:52:48 +0000 (15:52 +0100)]
vl.c: Remove compile time limit on number of serial ports
Instead of having a fixed sized global serial_hds[] array,
use a local dynamically reallocated one, so we don't have
a compile time limit on how many serial ports a system has.
Peter Maydell [Fri, 20 Apr 2018 14:52:47 +0000 (15:52 +0100)]
superio: Don't use MAX_SERIAL_PORTS for serial port limit
The superio device has a limit on the number of serial
ports it supports which is really only there because
it has a fixed-size array serial[]. This limit isn't
related particularly to the global MAX_SERIAL_PORTS limit,
so use a different #define for it.
(In practice the users of superio only ever want 2 serial ports.)
Peter Maydell [Fri, 20 Apr 2018 14:52:46 +0000 (15:52 +0100)]
serial-isa: Use MAX_ISA_SERIAL_PORTS instead of MAX_SERIAL_PORTS
The ISA serial port handling in serial-isa.c imposes a limit
of 4 serial ports. This is because we only know of 4 IO port
and IRQ settings for them, and is unrelated to the generic
MAX_SERIAL_PORTS limit, though they happen to both be set at
4 currently.
Use a new MAX_ISA_SERIAL_PORTS wherever that is the correct
limit to be checking against.
Peter Maydell [Fri, 20 Apr 2018 14:52:45 +0000 (15:52 +0100)]
hw/char/exynos4210_uart.c: Remove unneeded handling of NULL chardev
The handling of NULL chardevs in exynos4210_uart_create() is now
all unnecessary: we don't need to create 'null' chardevs, and we
don't need to enforce a bounds check on serial_hd().
Peter Maydell [Fri, 20 Apr 2018 14:52:44 +0000 (15:52 +0100)]
Remove checks on MAX_SERIAL_PORTS that are just bounds checks
Remove checks on MAX_SERIAL_PORTS that were just checking whether
they were within bounds for the serial_hds[] array and falling
back to NULL if not. This isn't needed with the serial_hd()
function, which returns NULL for all indexes beyond what the
user set up.
Peter Maydell [Fri, 20 Apr 2018 14:52:43 +0000 (15:52 +0100)]
Change references to serial_hds[] to serial_hd()
Change all the uses of serial_hds[] to go via the new
serial_hd() function. Code change produced with:
find hw -name '*.[ch]' | xargs sed -i -e 's/serial_hds\[\([^]]*\)\]/serial_hd(\1)/g'
Peter Maydell [Fri, 20 Apr 2018 14:52:42 +0000 (15:52 +0100)]
vl.c: Provide accessor function serial_hd() for serial_hds[] array
Provide an accessor function serial_hd() to return the Chardev
(if any) associated with the numbered serial port. This will
be used to replace direct accesses to the serial_hds[] array,
so that calling code doesn't need to care about the size of
that array.
Peter Maydell [Fri, 20 Apr 2018 14:52:41 +0000 (15:52 +0100)]
hw/xtensa/xtfpga.c: Don't create "null" chardevs for serial devices
Following commit 12051d82f004024, UART devices should handle
being passed a NULL pointer chardev, so we don't need to
create "null" backends in board code. Remove the code that
does this and updates serial_hds[].
Peter Maydell [Fri, 20 Apr 2018 14:52:40 +0000 (15:52 +0100)]
hw/mips/mips_malta: Don't create "null" chardevs for serial devices
Following commit 12051d82f004024, UART devices should handle
being passed a NULL pointer chardev, so we don't need to
create "null" backends in board code. Remove the code that
does this and updates serial_hds[].
Peter Maydell [Fri, 20 Apr 2018 14:52:39 +0000 (15:52 +0100)]
hw/mips/boston.c: Don't create "null" chardevs for serial devices
Following commit 12051d82f004024, UART devices should handle
being passed a NULL pointer chardev, so we don't need to
create "null" backends in board code. Remove the code that
does this and updates serial_hds[].
Peter Maydell [Fri, 20 Apr 2018 14:52:38 +0000 (15:52 +0100)]
hw/arm/fsl-imx*: Don't create "null" chardevs for serial devices
Following commit 12051d82f004024, UART devices should handle
being passed a NULL pointer chardev, so we don't need to
create "null" backends in board code. Remove the code that
does this and updates serial_hds[].
Peter Maydell [Fri, 20 Apr 2018 14:52:37 +0000 (15:52 +0100)]
hw/char/serial: Allow disconnected chardevs
Currently the serial.c realize code has an explicit check that it is not
connected to a disconnected backend (ie one with a NULL chardev).
This isn't what we want -- you should be able to create a serial device
even if it isn't attached to anything. Remove the check.
Peter Maydell [Thu, 26 Apr 2018 10:56:57 +0000 (11:56 +0100)]
Merge remote-tracking branch 'remotes/pmaydell/tags/pull-target-arm-20180426' into staging
target-arm queue:
* xilinx_spips: Correct SNOOP_NONE state when flushing the txfifo
* timer/aspeed: fix vmstate version id
* hw/arm/aspeed_soc: don't use vmstate_register_ram_global for SRAM
* hw/arm/aspeed: don't make 'boot_rom' region 'nomigrate'
* hw/arm/highbank: don't make sysram 'nomigrate'
* hw/arm/raspi: Don't bother setting default_cpu_type
* PMU emulation: some minor bugfixes and preparation for
support of other events than just the cycle counter
* target/arm: Use v7m_stack_read() for reading the frame signature
* target/arm: Remove stale TODO comment
* arm: always start from first_cpu when registering loader cpu reset callback
* device_tree: Increase FDT_MAX_SIZE to 1 MiB
* remotes/pmaydell/tags/pull-target-arm-20180426:
xilinx_spips: Correct SNOOP_NONE state when flushing the txfifo
timer/aspeed: fix vmstate version id
hw/arm/aspeed_soc: don't use vmstate_register_ram_global for SRAM
hw/arm/aspeed: don't make 'boot_rom' region 'nomigrate'
hw/arm/highbank: don't make sysram 'nomigrate'
hw/arm/raspi: Don't bother setting default_cpu_type
target/arm: Make PMOVSCLR and PMUSERENR 64 bits wide
target/arm: Fix bitmask for PMCCFILTR writes
target/arm: Allow EL change hooks to do IO
target/arm: Add pre-EL change hooks
target/arm: Support multiple EL change hooks
target/arm: Fetch GICv3 state directly from CPUARMState
target/arm: Mask PMU register writes based on PMCR_EL0.N
target/arm: Treat PMCCNTR as alias of PMCCNTR_EL0
target/arm: Check PMCNTEN for whether PMCCNTR is enabled
target/arm: Use v7m_stack_read() for reading the frame signature
target/arm: Remove stale TODO comment
arm: always start from first_cpu when registering loader cpu reset callback
device_tree: Increase FDT_MAX_SIZE to 1 MiB
Peter Maydell [Thu, 26 Apr 2018 10:48:20 +0000 (11:48 +0100)]
Open 2.13 development tree
Unfortunately I forgot to do this before applying the merge
in commit 8e383d19b44863556, so that commit will incorrectly
claim to be 2.12 even though it isn't in the official 2.12
release. Oops.
commit 1d3e65aa7ac5 ("hw/timer: Add value matching support to
aspeed_timer") increased the vmstate version of aspeed.timer because
the state had changed, but it also bumped the version of the
VMSTATE_STRUCT_ARRAY under the aspeed.timerctrl which did not need to.
Peter Maydell [Thu, 26 Apr 2018 10:04:39 +0000 (11:04 +0100)]
hw/arm/aspeed_soc: don't use vmstate_register_ram_global for SRAM
Currently we use vmstate_register_ram_global() for the SRAM;
this is not a good idea for devices, because it means that
you can only ever create one instance of the device, as
the second instance would get a RAM block name clash.
Instead, use memory_region_init_ram(), which automatically
registers the RAM block with a local-to-the-device name.
Note that this would be a cross-version migration compatibility break
for the "palmetto-bmc", "ast2500-evb" and "romulus-bmc" machines,
but migration is currently broken for them.
Peter Maydell [Thu, 26 Apr 2018 10:04:39 +0000 (11:04 +0100)]
hw/arm/aspeed: don't make 'boot_rom' region 'nomigrate'
Currently we use memory_region_init_ram_nomigrate() to create
the "aspeed.boot_rom" memory region, and we don't manually
register it with vmstate_register_ram(). This currently
means that its contents are migrated but as a ram block
whose name is the empty string; in future it may mean they
are not migrated at all. Use memory_region_init_ram() instead.
Note that would be a cross-version migration compatibility break
for the "palmetto-bmc", "ast2500-evb" and "romulus-bmc" machines,
but migration is currently broken for them.
Peter Maydell [Thu, 26 Apr 2018 10:04:39 +0000 (11:04 +0100)]
hw/arm/highbank: don't make sysram 'nomigrate'
Currently we use memory_region_init_ram_nomigrate() to create
the "highbank.sysram" memory region, and we don't manually
register it with vmstate_register_ram(). This currently
means that its contents are migrated but as a ram block
whose name is the empty string; in future it may mean they
are not migrated at all. Use memory_region_init_ram() instead.
Note that this is a cross-version migration compatibility
break for the "highbank" and "midway" machines.
In commit 210f47840dd62, we changed the bcm2836 SoC object to
always create a CPU of the correct type for that SoC model. This
makes the default_cpu_type settings in the MachineClass structs
for the raspi2 and raspi3 boards redundant. We didn't change
those at the time because it would have meant a temporary
regression in a corner case of error handling if the user
requested a non-existing CPU type. The -cpu parse handling
changes in 2278b93941d42c3 mean that it no longer implicitly
depends on default_cpu_type for this to work, so we can now
delete the redundant default_cpu_type fields.
During code generation, surround CPSR writes and exception returns which
call the EL change hooks with gen_io_start/end. The immediate need is
for the PMU to access the clock and icount during EL change to support
mode filtering.
Because the design of the PMU requires that the counter values be
converted between their delta and guest-visible forms for mode
filtering, an additional hook which occurs before the EL is changed is
necessary.
target/arm: Fetch GICv3 state directly from CPUARMState
This eliminates the need for fetching it from el_change_hook_opaque, and
allows for supporting multiple el_change_hooks without having to hack
something together to find the registered opaque belonging to GICv3.
Peter Maydell [Thu, 26 Apr 2018 10:04:38 +0000 (11:04 +0100)]
target/arm: Use v7m_stack_read() for reading the frame signature
In commit 95695effe8caa552b8f2 we changed the v7M/v8M stack
pop code to use a new v7m_stack_read() function that checks
whether the read should fail due to an MPU or bus abort.
We missed one call though, the one which reads the signature
word for the callee-saved register part of the frame.
Peter Maydell [Thu, 26 Apr 2018 10:04:38 +0000 (11:04 +0100)]
target/arm: Remove stale TODO comment
Remove a stale TODO comment -- we have now made the arm_ldl_ptw()
and arm_ldq_ptw() functions propagate physical memory read errors
out to their callers.
Igor Mammedov [Thu, 26 Apr 2018 10:04:38 +0000 (11:04 +0100)]
arm: always start from first_cpu when registering loader cpu reset callback
if arm_load_kernel() were passed non first_cpu, QEMU would end up
with partially set do_cpu_reset() callback leaving some CPUs without it.
Make sure that do_cpu_reset() is registered for all CPUs by enumerating
CPUs from first_cpu.
(In practice every board that we have was passing us the first CPU
as the boot CPU, either directly or indirectly, so this wasn't
causing incorrect behaviour.)
# gpg: Signature made Wed 25 Apr 2018 20:21:13 BST
# gpg: using RSA key 0516331EBC5BFDE7
# gpg: Good signature from "Dr. David Alan Gilbert (RH2) <[email protected]>"
# Primary key fingerprint: 45F5 C71B 4A0C B7FB 977A 9FA9 0516 331E BC5B FDE7
* remotes/dgilbert/tags/pull-migration-20180425a:
migration: remove ram_save_compressed_page()
migration: introduce save_normal_page()
migration: move calling save_zero_page to the common place
migration: move calling control_save_page to the common place
migration: move some code to ram_save_host_page
migration: introduce control_save_page()
migration: detect compression and decompression errors
migration: stop decompression to allocate and free memory frequently
migration: stop compression to allocate and free memory frequently
migration: stop compressing page in migration thread
migration: add postcopy total blocktime into query-migrate
migration: add blocktime calculation into migration-test
migration: postcopy_blocktime documentation
migration: calculate vCPU blocktime on dst side
migration: add postcopy blocktime ctx into MigrationIncomingState
migration: introduce postcopy-blocktime capability
Xiao Guangrong [Fri, 30 Mar 2018 07:51:28 +0000 (15:51 +0800)]
migration: remove ram_save_compressed_page()
Now, we can reuse the path in ram_save_page() to post the page out
as normal, then the only thing remained in ram_save_compressed_page()
is compression that we can move it out to the caller
Xiao Guangrong [Fri, 30 Mar 2018 07:51:24 +0000 (15:51 +0800)]
migration: move some code to ram_save_host_page
Move some code from ram_save_target_page() to ram_save_host_page()
to make it be more readable for latter patches that dramatically
clean ram_save_target_page() up
Xiao Guangrong [Fri, 30 Mar 2018 07:51:22 +0000 (15:51 +0800)]
migration: detect compression and decompression errors
Currently the page being compressed is allowed to be updated by
the VM on the source QEMU, correspondingly the destination QEMU
just ignores the decompression error. However, we completely miss
the chance to catch real errors, then the VM is corrupted silently
To make the migration more robuster, we copy the page to a buffer
first to avoid it being written by VM, then detect and handle the
errors of both compression and decompression errors properly
Xiao Guangrong [Fri, 30 Mar 2018 07:51:21 +0000 (15:51 +0800)]
migration: stop decompression to allocate and free memory frequently
Current code uses uncompress() to decompress memory which manages
memory internally, that causes huge memory is allocated and freed
very frequently, more worse, frequently returning memory to kernel
will flush TLBs
So, we maintain the memory by ourselves and reuse it for each
decompression
Xiao Guangrong [Fri, 30 Mar 2018 07:51:20 +0000 (15:51 +0800)]
migration: stop compression to allocate and free memory frequently
Current code uses compress2() to compress memory which manages memory
internally, that causes huge memory is allocated and freed very
frequently
More worse, frequently returning memory to kernel will flush TLBs
and trigger invalidation callbacks on mmu-notification which
interacts with KVM MMU, that dramatically reduce the performance
of VM
So, we maintain the memory by ourselves and reuse it for each
compression
Alexey Perevalov [Thu, 22 Mar 2018 18:17:27 +0000 (21:17 +0300)]
migration: add postcopy total blocktime into query-migrate
Postcopy total blocktime is available on destination side only.
But query-migrate was possible only for source. This patch
adds ability to call query-migrate on destination.
To be able to see postcopy blocktime, need to request postcopy-blocktime
capability.
The query-migrate command will show following sample result:
{"return":
"postcopy-vcpu-blocktime": [115, 100],
"status": "completed",
"postcopy-blocktime": 100
}}
postcopy_vcpu_blocktime contains list, where the first item is the first
vCPU in QEMU.
This patch has a drawback, it combines states of incoming and
outgoing migration. Ongoing migration state will overwrite incoming
state. Looks like better to separate query-migrate for incoming and
outgoing migration or add parameter to indicate type of migration.
Alexey Perevalov [Thu, 22 Mar 2018 18:17:24 +0000 (21:17 +0300)]
migration: calculate vCPU blocktime on dst side
This patch provides blocktime calculation per vCPU,
as a summary and as a overlapped value for all vCPUs.
This approach was suggested by Peter Xu, as an improvements of
previous approch where QEMU kept tree with faulted page address and cpus bitmask
in it. Now QEMU is keeping array with faulted page address as value and vCPU
as index. It helps to find proper vCPU at UFFD_COPY time. Also it keeps
list for blocktime per vCPU (could be traced with page_fault_addr)
Blocktime will not calculated if postcopy_blocktime field of
MigrationIncomingState wasn't initialized.
Alexey Perevalov [Thu, 22 Mar 2018 18:17:23 +0000 (21:17 +0300)]
migration: add postcopy blocktime ctx into MigrationIncomingState
This patch adds request to kernel space for UFFD_FEATURE_THREAD_ID, in
case this feature is provided by kernel.
PostcopyBlocktimeContext is encapsulated inside postcopy-ram.c,
due to it being a postcopy-only feature.
Also it defines PostcopyBlocktimeContext's instance live time.
Information from PostcopyBlocktimeContext instance will be provided
much after postcopy migration end, instance of PostcopyBlocktimeContext
will live till QEMU exit, but part of it (vcpu_addr,
page_fault_vcpu_time) used only during calculation, will be released
when postcopy ended or failed.
To enable postcopy blocktime calculation on destination, need to
request proper compatibility (Patch for documentation will be at the
tail of the patch set).
As an example following command enable that capability, assume QEMU was
started with
-chardev socket,id=charmonitor,path=/var/lib/migrate-vm-monitor.sock
option to control it
Right now it could be used on destination side to
enable vCPU blocktime calculation for postcopy live migration.
vCPU blocktime - it's time since vCPU thread was put into
interruptible sleep, till memory page was copied and thread awake.
Unfortunately this fix regresses console handling on MIPS Malta;
since the mux ctrl-a b bug is not a regression since 2.11, we
take the conservative approach and just drop it from 2.12.
Without bounding the increment, we can overflow exp either here
in scalbn_decomposed or when adding the bias in round_canonical.
This can result in e.g. underflowing to 0 instead of overflowing
to infinity.
Peter Maydell [Mon, 16 Apr 2018 15:19:23 +0000 (16:19 +0100)]
linux-user: check that all of AArch64 SVE extended sigframe is writable
In commit 8c5931de0ac7738809 we added support for SVE extended
sigframe records. These mean that the signal frame might now be
larger than the size of the target_rt_sigframe record, so make sure
we call lock_user on the entire frame size when we're creating it.
(The code for restoring the signal frame already correctly handles
the extended records by locking the 'extra' section separately to the
main section.)
In particular, this fixes a bug even for non-SVE signal frames,
because it extends the locked section to cover the
target_rt_frame_record. Previously this was part of 'struct
target_rt_sigframe', but in commit e1eecd1d9d4c1ade3 we pulled
it out into its own struct, and so locking the target_rt_sigframe
alone doesn't cover it. This bug would mean that we would fail
to correctly handle the case where a signal was taken with
SP pointing 16 bytes into an unwritable page, with the page
immediately below it in memory being writable.