hw/arm/msf2-som: Exit when the cpu is not the expected one
This machine correctly defines its default_cpu_type to cortex-m3
and report an error if the user requested another cpu_type,
however it does not exit, and this can confuse users trying
to use another core:
$ qemu-system-arm -M emcraft-sf2 -cpu cortex-m4 -kernel test-m4.elf
qemu-system-arm: This board can only be used with CPU cortex-m3-arm-cpu
[output related to M3 core ...]
Andrew Jones [Mon, 1 Jul 2019 16:26:14 +0000 (17:26 +0100)]
hw/arm/boot: fix direct kernel boot with initrd
Fix the condition used to check whether the initrd fits
into RAM; in some cases if an initrd was also passed on
the command line we would get an error stating that it
was too big to fit into RAM after the kernel. Despite the
error the loader continued anyway, though, so also add an
exit(1) when the initrd is actually too big.
* remotes/vivier2/tags/linux-user-for-4.1-pull-request:
linux-user: set default PPC64 CPU
linux-user: update PPC64 HWCAP2 feature list
linux-user: Add support for setsockopt() options IPV6_<ADD|DROP>_MEMBERSHIP
linux-user: Add support for setsockopt() option SOL_ALG
linux-user: emulate msgsnd(), msgrcv() and semtimedop()
util/path: Do not cache all filenames at startup
Peter Maydell [Mon, 1 Jul 2019 13:39:45 +0000 (14:39 +0100)]
Merge remote-tracking branch 'remotes/amarkovic/tags/mips-queue-jun-26-2019' into staging
MIPS queue for June 2016th, 2019
# gpg: Signature made Wed 26 Jun 2019 12:38:58 BST
# gpg: using RSA key D4972A8967F75A65
# gpg: Good signature from "Aleksandar Markovic <[email protected]>" [unknown]
# gpg: WARNING: This key is not certified with a trusted signature!
# gpg: There is no indication that the signature belongs to the owner.
# Primary key fingerprint: 8526 FBF1 5DA3 811F 4A01 DD75 D497 2A89 67F7 5A65
* remotes/amarkovic/tags/mips-queue-jun-26-2019:
target/mips: Fix big endian host behavior for interleave MSA instructions
tests/tcg: target/mips: Fix some test cases for pack MSA instructions
tests/tcg: target/mips: Add support for MSA MIPS32R6 testings
tests/tcg: target/mips: Add support for MSA big-endian target testings
tests/tcg: target/mips: Amend tests for MSA int multiply instructions
tests/tcg: target/mips: Amend tests for MSA int dot product instructions
tests/tcg: target/mips: Add tests for MSA move instructions
tests/tcg: target/mips: Add tests for MSA bit move instructions
dma/rc4030: Minor code style cleanup
dma/rc4030: Fix off-by-one error in specified memory region size
hw/mips/gt64xxx_pci: Align the pci0-mem size
hw/mips/gt64xxx_pci: Convert debug printf()s to trace events
hw/mips/gt64xxx_pci: Use qemu_log_mask() instead of debug printf()
hw/mips/gt64xxx_pci: Fix 'spaces' coding style issues
hw/mips/gt64xxx_pci: Fix 'braces' coding style issues
hw/mips/gt64xxx_pci: Fix 'tabs' coding style issues
hw/mips/gt64xxx_pci: Fix multiline comment syntax
Peter Maydell [Mon, 1 Jul 2019 12:03:51 +0000 (13:03 +0100)]
Merge remote-tracking branch 'remotes/aperard/tags/pull-xen-20190624' into staging
Xen queue
* Fix build
* xen-block: support feature-large-sector-size
* xen-block: Support IOThread polling for PV shared rings
* Avoid usage of a VLA
* Cleanup Xen headers usage
# gpg: Signature made Mon 24 Jun 2019 16:30:32 BST
# gpg: using RSA key F80C006308E22CFD8A92E7980CF5572FD7FB55AF
# gpg: issuer "[email protected]"
# gpg: Good signature from "Anthony PERARD <[email protected]>" [marginal]
# gpg: aka "Anthony PERARD <[email protected]>" [marginal]
# gpg: WARNING: This key is not certified with sufficiently trusted signatures!
# gpg: It is not certain that the signature belongs to the owner.
# Primary key fingerprint: 5379 2F71 024C 600F 778A 7161 D8D5 7199 DF83 42C8
# Subkey fingerprint: F80C 0063 08E2 2CFD 8A92 E798 0CF5 572F D7FB 55AF
* remotes/aperard/tags/pull-xen-20190624:
xen: Import other xen/io/*.h
Revert xen/io/ring.h of "Clean up a few header guard symbols"
xen: Drop includes of xen/hvm/params.h
xen: Avoid VLA
xen-bus / xen-block: add support for event channel polling
xen-bus: allow AioContext to be specified for each event channel
xen-bus: use a separate fd for each event channel
xen-block: support feature-large-sector-size
Peter Maydell [Mon, 1 Jul 2019 10:28:28 +0000 (11:28 +0100)]
Merge remote-tracking branch 'remotes/maxreitz/tags/pull-block-2019-06-24' into staging
Block patches:
- The SSH block driver now uses libssh instead of libssh2
- The VMDK block driver gets read-only support for the seSparse
subformat
- Various fixes
* remotes/maxreitz/tags/pull-block-2019-06-24:
iotests: Fix 205 for concurrent runs
ssh: switch from libssh2 to libssh
vmdk: Add read-only support for seSparse snapshots
vmdk: Reduce the max bound for L1 table size
vmdk: Fix comment regarding max l1_size coverage
iotest 134: test cluster-misaligned encrypted write
blockdev: enable non-root nodes for transaction drive-backup source
nvme: do not advertise support for unsupported arbitration mechanism
target/mips: Fix big endian host behavior for interleave MSA instructions
Fix big endian host behavior for interleave MSA instructions. Previous
fix used TARGET_WORDS_BIGENDIAN instead of HOST_WORDS_BIGENDIAN, which
was a mistake.
Neng Chen [Wed, 19 Jun 2019 14:17:10 +0000 (16:17 +0200)]
linux-user: Add support for setsockopt() options IPV6_<ADD|DROP>_MEMBERSHIP
Add support for the option IPV6_<ADD|DROP>_MEMBERSHIP of the syscall
setsockopt(). This option controls membership in multicast groups.
Argument is a pointer to a struct ipv6_mreq.
The glibc <netinet/in.h> header defines the ipv6_mreq structure,
which includes the following members:
struct in6_addr ipv6mr_multiaddr;
unsigned int ipv6mr_interface;
Whereas the kernel in its <linux/in6.h> header defines following
members of the same structure:
struct in6_addr ipv6mr_multiaddr;
int ipv6mr_ifindex;
POSIX defines ipv6mr_interface [1].
__UAPI_DEF_IVP6_MREQ appears in kernel headers with v3.12:
Without __UAPI_DEF_IVP6_MREQ, kernel defines ipv6mr_ifindex, and
this is explained in cfd280c91253:
"If you include the kernel headers first you get those,
and if you include the glibc headers first you get those,
and the following patch arranges a coordination and
synchronization between the two."
So before 3.12, a program can't include both <netinet/in.h> and
<linux/in6.h>.
In linux-user/syscall.c, we only include <netinet/in.h> (glibc) and
not <linux/in6.h> (kernel headers), so ipv6mr_interface is the one
to use.
Yunqiang Su [Wed, 19 Jun 2019 14:17:11 +0000 (16:17 +0200)]
linux-user: Add support for setsockopt() option SOL_ALG
Add support for options SOL_ALG of the syscall setsockopt(). This
option is used in relation to Linux kernel Crypto API, and allows
a user to set additional information for the cipher operation via
syscall setsockopt(). The field "optname" must be one of the
following:
- ALG_SET_KEY – seting the key
- ALG_SET_AEAD_AUTHSIZE – set the authentication tag size
SOL_ALG is relatively newer setsockopt() option. Therefore, the
code that handles SOL_ALG is enclosed in "ifdef" so that the build
does not fail for older kernels that do not contain support for
SOL_ALG. "ifdef" also contains check if ALG_SET_KEY and
ALG_SET_AEAD_AUTHSIZE are defined.
Laurent Vivier [Wed, 29 May 2019 08:48:04 +0000 (10:48 +0200)]
linux-user: emulate msgsnd(), msgrcv() and semtimedop()
When we have updated kernel headers to 5.2-rc1 we have introduced
new syscall numbers that can be not supported by older kernels
and fail with ENOSYS while the guest emulation succeeded before
because the syscalls were emulated with ipc().
This patch fixes the problem by using ipc() if the new syscall
returns ENOSYS.
If one uses -L $PATH to point to a full chroot, the startup time
is significant. In addition, the existing probing algorithm fails
to handle symlink loops.
Instead, probe individual paths on demand. Cache both positive
and negative results within $PATH, so that any one filename is
probed only once.
Max Reitz [Tue, 18 Jun 2019 21:02:38 +0000 (23:02 +0200)]
iotests: Fix 205 for concurrent runs
Tests should place their files into the test directory. This includes
Unix sockets. 205 currently fails to do so, which prevents it from
being run concurrently.
Pino Toscano [Thu, 20 Jun 2019 20:08:40 +0000 (22:08 +0200)]
ssh: switch from libssh2 to libssh
Rewrite the implementation of the ssh block driver to use libssh instead
of libssh2. The libssh library has various advantages over libssh2:
- easier API for authentication (for example for using ssh-agent)
- easier API for known_hosts handling
- supports newer types of keys in known_hosts
Use APIs/features available in libssh 0.8 conditionally, to support
older versions (which are not recommended though).
Adjust the iotest 207 according to the different error message, and to
find the default key type for localhost (to properly compare the
fingerprint with). Contributed-by: Max Reitz <[email protected]>
Adjust the various Docker/Travis scripts to use libssh when available
instead of libssh2. The mingw/mxe testing is dropped for now, as there
are no packages for it.
Sam Eiderman [Thu, 20 Jun 2019 09:10:57 +0000 (12:10 +0300)]
vmdk: Add read-only support for seSparse snapshots
Until ESXi 6.5 VMware used the vmfsSparse format for snapshots (VMDK3 in
QEMU).
This format was lacking in the following:
* Grain directory (L1) and grain table (L2) entries were 32-bit,
allowing access to only 2TB (slightly less) of data.
* The grain size (default) was 512 bytes - leading to data
fragmentation and many grain tables.
* For space reclamation purposes, it was necessary to find all the
grains which are not pointed to by any grain table - so a reverse
mapping of "offset of grain in vmdk" to "grain table" must be
constructed - which takes large amounts of CPU/RAM.
The format specification can be found in VMware's documentation:
https://www.vmware.com/support/developer/vddk/vmdk_50_technote.pdf
In ESXi 6.5, to support snapshot files larger than 2TB, a new format was
introduced: SESparse (Space Efficient).
This format fixes the above issues:
* All entries are now 64-bit.
* The grain size (default) is 4KB.
* Grain directory and grain tables are now located at the beginning
of the file.
+ seSparse format reserves space for all grain tables.
+ Grain tables can be addressed using an index.
+ Grains are located in the end of the file and can also be
addressed with an index.
- seSparse vmdks of large disks (64TB) have huge preallocated
headers - mainly due to L2 tables, even for empty snapshots.
* The header contains a reverse mapping ("backmap") of "offset of
grain in vmdk" to "grain table" and a bitmap ("free bitmap") which
specifies for each grain - whether it is allocated or not.
Using these data structures we can implement space reclamation
efficiently.
* Due to the fact that the header now maintains two mappings:
* The regular one (grain directory & grain tables)
* A reverse one (backmap and free bitmap)
These data structures can lose consistency upon crash and result
in a corrupted VMDK.
Therefore, a journal is also added to the VMDK and is replayed
when the VMware reopens the file after a crash.
Since ESXi 6.7 - SESparse is the only snapshot format available.
Unfortunately, VMware does not provide documentation regarding the new
seSparse format.
This commit is based on black-box research of the seSparse format.
Various in-guest block operations and their effect on the snapshot file
were tested.
The only VMware provided source of information (regarding the underlying
implementation) was a log file on the ESXi:
/var/log/hostd.log
Whenever an seSparse snapshot is created - the log is being populated
with seSparse records.
The sizes that are seen in the log file are in sectors.
Extents are of the following format: <offset : size>
This commit is a strict implementation which enforces:
* magics
* version number 2.1
* grain size of 8 sectors (4KB)
* grain table size of 64 sectors
* zero flags
* extent locations
Additionally, this commit proivdes only a subset of the functionality
offered by seSparse's format:
* Read-only
* No journal replay
* No space reclamation
* No unmap support
Hence, journal header, journal, free bitmap and backmap extents are
unused, only the "classic" (L1 -> L2 -> data) grain access is
implemented.
However there are several differences in the grain access itself.
Grain directory (L1):
* Grain directory entries are indexes (not offsets) to grain
tables.
* Valid grain directory entries have their highest nibble set to
0x1.
* Since grain tables are always located in the beginning of the
file - the index can fit into 32 bits - so we can use its low
part if it's valid.
Grain table (L2):
* Grain table entries are indexes (not offsets) to grains.
* If the highest nibble of the entry is:
0x0:
The grain in not allocated.
The rest of the bytes are 0.
0x1:
The grain is unmapped - guest sees a zero grain.
The rest of the bits point to the previously mapped grain,
see 0x3 case.
0x2:
The grain is zero.
0x3:
The grain is allocated - to get the index calculate:
((entry & 0x0fff000000000000) >> 48) |
((entry & 0x0000ffffffffffff) << 12)
* The difference between 0x1 and 0x2 is that 0x1 is an unallocated
grain which results from the guest using sg_unmap to unmap the
grain - but the grain itself still exists in the grain extent - a
space reclamation procedure should delete it.
Unmapping a zero grain has no effect (0x2 will not change to 0x1)
but unmapping an unallocated grain will (0x0 to 0x1) - naturally.
In order to implement seSparse some fields had to be changed to support
both 32-bit and 64-bit entry sizes.
Sam Eiderman [Thu, 20 Jun 2019 09:10:55 +0000 (12:10 +0300)]
vmdk: Fix comment regarding max l1_size coverage
Commit b0651b8c246d ("vmdk: Move l1_size check into vmdk_add_extent")
extended the l1_size check from VMDK4 to VMDK3 but did not update the
default coverage in the moved comment.
In any case, VMware does not offer virtual disks more than 2TB for
vmdk4/vmdk3 or 64TB for the new undocumented seSparse format which is
not implemented yet in qemu.
blockdev: enable non-root nodes for transaction drive-backup source
We forget to enable it for transaction .prepare, while it is already
enabled in do_drive_backup since commit a2d665c1bc362
"blockdev: loosen restrictions on drive-backup source node"
Anthony PERARD [Fri, 21 Jun 2019 10:54:41 +0000 (11:54 +0100)]
xen: Import other xen/io/*.h
A Xen public header have been imported into QEMU (by f65eadb639 "xen: import ring.h from xen"), but there are other header
that depends on ring.h which come from the system when building QEMU.
This patch resolves the issue of having headers from the system
importing a different copie of ring.h.
This patch is prompt by the build issue described in the previous
patch: 'Revert xen/io/ring.h of "Clean up a few header guard symbols"'
ring.h and the new imported headers are moved to
"include/hw/xen/interface" as those describe interfaces with a guest.
The imported headers are cleaned up a bit while importing them: some
part of the file that QEMU doesn't use are removed (description
of how to make hypercall in grant_table.h have been removed).
Other cleanup:
- xen-mapcache.c and xen-legacy-backend.c don't need grant_table.h.
- xenfb.c doesn't need event_channel.h.
Following 37677d7db3 "Clean up a few header guard symbols", QEMU start
to fail to build:
In file included from ~/xen/tools/../tools/include/xen/io/blkif.h:31:0,
from ~/xen/tools/qemu-xen-dir/hw/block/xen_blkif.h:5,
from ~/xen/tools/qemu-xen-dir/hw/block/xen-block.c:22:
~/xen/tools/../tools/include/xen/io/ring.h:68:0: error: "__CONST_RING_SIZE" redefined [-Werror]
#define __CONST_RING_SIZE(_s, _sz) \
In file included from ~/xen/tools/qemu-xen-dir/hw/block/xen_blkif.h:4:0,
from ~/xen/tools/qemu-xen-dir/hw/block/xen-block.c:22:
~/xen/tools/qemu-xen-dir/include/hw/xen/io/ring.h:66:0: note: this is the location of the previous definition
#define __CONST_RING_SIZE(_s, _sz) \
The issue is that some public xen headers have been imported (by f65eadb639 "xen: import ring.h from xen") but not all. With the change
in the guards symbole, the ring.h header start to be imported twice.
Anthony PERARD [Tue, 18 Jun 2019 11:23:40 +0000 (12:23 +0100)]
xen: Drop includes of xen/hvm/params.h
xen-mapcache.c doesn't needs params.h.
xen-hvm.c uses defines available in params.h but so is xen_common.h
which is included before. HVM_PARAM_* flags are only needed to make
xc_hvm_param_{get,set} calls so including only xenctrl.h, which is
where the definition the function is, should be enough.
(xenctrl.h does include params.h)
Paul Durrant [Mon, 8 Apr 2019 15:16:17 +0000 (16:16 +0100)]
xen-bus / xen-block: add support for event channel polling
This patch introduces a poll callback for event channel fd-s and uses
this to invoke the channel callback function.
To properly support polling, it is necessary for the event channel callback
function to return a boolean saying whether it has done any useful work or
not. Thus xen_block_dataplane_event() is modified to directly invoke
xen_block_handle_requests() and the latter only returns true if it actually
processes any requests. This also means that the call to qemu_bh_schedule()
is moved into xen_block_complete_aio(), which is more intuitive since the
only reason for doing a deferred poll of the shared ring should be because
there were previously insufficient resources to fully complete a previous
poll.
Paul Durrant [Mon, 8 Apr 2019 15:16:16 +0000 (16:16 +0100)]
xen-bus: allow AioContext to be specified for each event channel
This patch adds an AioContext parameter to xen_device_bind_event_channel()
and then uses aio_set_fd_handler() to set the callback rather than
qemu_set_fd_handler().
Paul Durrant [Mon, 8 Apr 2019 15:16:15 +0000 (16:16 +0100)]
xen-bus: use a separate fd for each event channel
To better support use of IOThread-s it will be necessary to be able to set
the AioContext for each XenEventChannel and hence it is necessary to open a
separate handle to libxenevtchan for each channel.
This patch stops using NotifierList for event channel callbacks, replacing
that construct by a list of complete XenEventChannel structures. Each of
these now has a xenevtchn_handle pointer in place of the single pointer
previously held in the XenDevice structure. The individual handles are
opened/closed in xen_device_bind/unbind_event_channel(), replacing the
single open/close in xen_device_realize/unrealize().
NOTE: This patch does not add an AioContext parameter to
xen_device_bind_event_channel(). That will be done in a subsequent
patch.
Paul Durrant [Tue, 9 Apr 2019 16:40:38 +0000 (17:40 +0100)]
xen-block: support feature-large-sector-size
A recent Xen commit [1] clarified the semantics of sector based quantities
used in the blkif protocol such that it is now safe to create a xen-block
device with a logical_block_size != 512, as long as the device only
connects to a frontend advertizing 'feature-large-block-size'.
This patch modifies xen-block accordingly. It also uses a stack variable
for the BlockBackend in xen_block_realize() to avoid repeated dereferencing
of the BlockConf pointer, and changes the parameters of
xen_block_dataplane_create() so that the BlockBackend pointer and sector
size are passed expicitly rather than implicitly via the BlockConf.
These modifications have been tested against a recent Windows PV XENVBD
driver [2] using a xen-disk device with a 4kB logical block size.
Peter Maydell [Fri, 21 Jun 2019 14:40:50 +0000 (15:40 +0100)]
Merge remote-tracking branch 'remotes/amarkovic/tags/mips-queue-jun-21-2019' into staging
MIPS queue for June 21st, 2019
# gpg: Signature made Fri 21 Jun 2019 10:46:57 BST
# gpg: using RSA key D4972A8967F75A65
# gpg: Good signature from "Aleksandar Markovic <[email protected]>" [unknown]
# gpg: WARNING: This key is not certified with a trusted signature!
# gpg: There is no indication that the signature belongs to the owner.
# Primary key fingerprint: 8526 FBF1 5DA3 811F 4A01 DD75 D497 2A89 67F7 5A65
* remotes/amarkovic/tags/mips-queue-jun-21-2019:
target/mips: Fix emulation of ILVR.<B|H|W> on big endian host
target/mips: Fix emulation of ILVL.<B|H|W> on big endian host
target/mips: Fix emulation of ILVOD.<B|H|W> on big endian host
target/mips: Fix emulation of ILVEV.<B|H|W> on big endian host
tests/tcg: target/mips: Amend tests for MSA pack instructions
tests/tcg: target/mips: Include isa/ase and group name in test output
target/mips: Fix if-else-switch-case arms checkpatch errors in translate.c
target/mips: Fix some space checkpatch errors in translate.c
MAINTAINERS: Consolidate MIPS disassembler-related items
MAINTAINERS: Update file items for MIPS Malta board
* remotes/bonzini/tags/for-upstream: (25 commits)
hw: Nuke hw_compat_4_0_1 and pc_compat_4_0_1
util/main-loop: Fix incorrect assertion
sd: Fix out-of-bounds assertions
target/i386: kvm: Add nested migration blocker only when kernel lacks required capabilities
target/i386: kvm: Add support for KVM_CAP_EXCEPTION_PAYLOAD
target/i386: kvm: Add support for save and restore nested state
vmstate: Add support for kernel integer types
linux-headers: sync with latest KVM headers from Linux 5.2
target/i386: kvm: Block migration for vCPUs exposed with nested virtualization
target/i386: kvm: Re-inject #DB to guest with updated DR6
target/i386: kvm: Use symbolic constant for #DB/#BP exception constants
KVM: Introduce kvm_arch_destroy_vcpu()
target/i386: kvm: Delete VMX migration blocker on vCPU init failure
target/i386: define a new MSR based feature word - FEAT_CORE_CAPABILITY
i386/kvm: add support for Direct Mode for Hyper-V synthetic timers
i386/kvm: hv-evmcs requires hv-vapic
i386/kvm: hv-tlbflush/ipi require hv-vpindex
i386/kvm: hv-stimer requires hv-time and hv-synic
i386/kvm: implement 'hv-passthrough' mode
i386/kvm: document existing Hyper-V enlightenments
...
Greg Kurz [Fri, 14 Jun 2019 13:09:02 +0000 (15:09 +0200)]
hw: Nuke hw_compat_4_0_1 and pc_compat_4_0_1
Commit c87759ce876a fixed a regression affecting pc-q35 machines by
introducing a new pc-q35-4.0.1 machine version to be used instead
of pc-q35-4.0. The only purpose was to revert the default behaviour
of not using split irqchip, but the change also introduced the usual
hw_compat and pc_compat bits, and wired them for pc-q35 only.
This raises questions when it comes to add new compat properties for
4.0* machine versions of any architecture. Where to add them ? In
4.0, 4.0.1 or both ? Error prone. Another possibility would be to teach
all other architectures about 4.0.1. This solution isn't satisfying,
especially since this is a pc-q35 specific issue.
It turns out that the split irqchip default is handled in the machine
option function and doesn't involve compat lists at all.
Drop all the 4.0.1 compat lists and use the 4.0 ones instead in the 4.0.1
machine option function.
Move the compat props that were added to the 4.0.1 since c87759ce876a to
4.0.
Even if only hw_compat_4_0_1 had an impact on other architectures,
drop pc_compat_4_0_1 as well for consistency.
Lidong Chen [Wed, 19 Jun 2019 19:14:47 +0000 (15:14 -0400)]
util/main-loop: Fix incorrect assertion
The check for poll_fds in g_assert() was incorrect. The correct assertion
should check "n_poll_fds + w->num <= ARRAY_SIZE(poll_fds)" because the
subsequent for-loop is doing access to poll_fds[n_poll_fds + i] where i
is in [0, w->num). This could happen with a very high number of file
descriptors and/or wait objects.
Lidong Chen [Wed, 19 Jun 2019 19:14:46 +0000 (15:14 -0400)]
sd: Fix out-of-bounds assertions
Due to an off-by-one error, the assert statements allow an
out-of-bound array access. This doesn't happen in practice,
but the static analyzer notices.
Liran Alon [Wed, 19 Jun 2019 16:21:40 +0000 (19:21 +0300)]
target/i386: kvm: Add nested migration blocker only when kernel lacks required capabilities
Previous commits have added support for migration of nested virtualization
workloads. This was done by utilising two new KVM capabilities:
KVM_CAP_NESTED_STATE and KVM_CAP_EXCEPTION_PAYLOAD. Both which are
required in order to correctly migrate such workloads.
Therefore, change code to add a migration blocker for vCPUs exposed with
Intel VMX or AMD SVM in case one of these kernel capabilities is
missing.
Liran Alon [Wed, 19 Jun 2019 16:21:39 +0000 (19:21 +0300)]
target/i386: kvm: Add support for KVM_CAP_EXCEPTION_PAYLOAD
Kernel commit c4f55198c7c2 ("kvm: x86: Introduce KVM_CAP_EXCEPTION_PAYLOAD")
introduced a new KVM capability which allows userspace to correctly
distinguish between pending and injected exceptions.
This distinguish is important in case of nested virtualization scenarios
because a L2 pending exception can still be intercepted by the L1 hypervisor
while a L2 injected exception cannot.
Furthermore, when an exception is attempted to be injected by QEMU,
QEMU should specify the exception payload (CR2 in case of #PF or
DR6 in case of #DB) instead of having the payload already delivered in
the respective vCPU register. Because in case exception is injected to
L2 guest and is intercepted by L1 hypervisor, then payload needs to be
reported to L1 intercept (VMExit handler) while still preserving
respective vCPU register unchanged.
This commit adds support for QEMU to properly utilise this new KVM
capability (KVM_CAP_EXCEPTION_PAYLOAD).
Liran Alon [Wed, 19 Jun 2019 16:21:38 +0000 (19:21 +0300)]
target/i386: kvm: Add support for save and restore nested state
Kernel commit 8fcc4b5923af ("kvm: nVMX: Introduce KVM_CAP_NESTED_STATE")
introduced new IOCTLs to extract and restore vCPU state related to
Intel VMX & AMD SVM.
Utilize these IOCTLs to add support for migration of VMs which are
running nested hypervisors.
Liran Alon [Wed, 19 Jun 2019 16:21:36 +0000 (19:21 +0300)]
linux-headers: sync with latest KVM headers from Linux 5.2
Improve the KVM_{GET,SET}_NESTED_STATE structs by detailing the format
of VMX nested state data in a struct.
In order to avoid changing the ioctl values of
KVM_{GET,SET}_NESTED_STATE, there is a need to preserve
sizeof(struct kvm_nested_state). This is done by defining the data
struct as "data.vmx[0]". It was the most elegant way I found to
preserve struct size while still keeping struct readable and easy to
maintain. It does have a misfortunate side-effect that now it has to be
accessed as "data.vmx[0]" rather than just "data.vmx".
Because we are already modifying these structs, I also modified the
following:
* Define the "format" field values as macros.
* Rename vmcs_pa to vmcs12_pa for better readability.
Liran Alon [Wed, 19 Jun 2019 16:21:35 +0000 (19:21 +0300)]
target/i386: kvm: Block migration for vCPUs exposed with nested virtualization
Commit d98f26073beb ("target/i386: kvm: add VMX migration blocker")
added a migration blocker for vCPU exposed with Intel VMX.
However, migration should also be blocked for vCPU exposed with
AMD SVM.
Both cases should be blocked because QEMU should extract additional
vCPU state from KVM that should be migrated as part of vCPU VMState.
E.g. Whether vCPU is running in guest-mode or host-mode.
tests/tcg: target/mips: Include isa/ase and group name in test output
For better appearance and usefullnes, include ISA/ASE name and
instruction group name in the output of tests. For example, all
this data will be displayed for FMAX_A.W test:
| MSA | Float Max Min | FMAX_A.W |
| PASS: 80 | FAIL: 0 | elapsed time: 0.16 ms |
(the data will be displayed in one row; they are presented here in two
rows not to exceed the width of the commit message)
Liran Alon [Wed, 19 Jun 2019 16:21:34 +0000 (19:21 +0300)]
target/i386: kvm: Re-inject #DB to guest with updated DR6
If userspace (QEMU) debug guest, when #DB is raised in guest and
intercepted by KVM, KVM forwards information on #DB to userspace
instead of injecting #DB to guest.
While doing so, KVM don't update vCPU DR6 but instead report the #DB DR6
value to userspace for further handling.
See KVM's handle_exception() DB_VECTOR handler.
QEMU handler for this case is kvm_handle_debug(). This handler basically
checks if #DB is related to one of user set hardware breakpoints and if
not, it re-inject #DB into guest.
The re-injection is done by setting env->exception_injected to #DB which
will later be passed as events.exception.nr to KVM_SET_VCPU_EVENTS ioctl
by kvm_put_vcpu_events().
However, in case userspace re-injects #DB, KVM expects userspace to set
vCPU DR6 as reported to userspace when #DB was intercepted! Otherwise,
KVM_REQ_EVENT handler will inject #DB with wrong DR6 to guest.
Fix this issue by updating vCPU DR6 appropriately when re-inject #DB to
guest.
Liran Alon [Wed, 19 Jun 2019 16:21:32 +0000 (19:21 +0300)]
KVM: Introduce kvm_arch_destroy_vcpu()
Simiar to how kvm_init_vcpu() calls kvm_arch_init_vcpu() to perform
arch-dependent initialisation, introduce kvm_arch_destroy_vcpu()
to be called from kvm_destroy_vcpu() to perform arch-dependent
destruction.
This was added because some architectures (Such as i386)
currently do not free memory that it have allocated in
kvm_arch_init_vcpu().
Liran Alon [Wed, 19 Jun 2019 16:21:31 +0000 (19:21 +0300)]
target/i386: kvm: Delete VMX migration blocker on vCPU init failure
Commit d98f26073beb ("target/i386: kvm: add VMX migration blocker")
added migration blocker for vCPU exposed with Intel VMX because QEMU
doesn't yet contain code to support migration of nested virtualization
workloads.
However, that commit missed adding deletion of the migration blocker in
case init of vCPU failed. Similar to invtsc_mig_blocker. This commit fix
that issue.
Vitaly Kuznetsov [Fri, 17 May 2019 14:19:24 +0000 (16:19 +0200)]
i386/kvm: add support for Direct Mode for Hyper-V synthetic timers
Hyper-V on KVM can only use Synthetic timers with Direct Mode (opting for
an interrupt instead of VMBus message). This new capability is only
announced in KVM_GET_SUPPORTED_HV_CPUID.
Vitaly Kuznetsov [Fri, 17 May 2019 14:19:20 +0000 (16:19 +0200)]
i386/kvm: implement 'hv-passthrough' mode
In many case we just want to give Windows guests all currently supported
Hyper-V enlightenments and that's where this new mode may come handy. We
pass through what was returned by KVM_GET_SUPPORTED_HV_CPUID.
hv_cpuid_check_and_set() is modified to also set cpu->hyperv_* flags as
we may want to check them later (and we actually do for hv_runtime,
hv_synic,...).
'hv-passthrough' is a development only feature, a migration blocker is
added to prevent issues while migrating between hosts with different
feature sets.
Currently, there is no doc describing hv-* CPU flags, people are
encouraged to get the information from Microsoft Hyper-V Top Level
Functional specification (TLFS). There is, however, a bit of QEMU
specifics.
Vitaly Kuznetsov [Fri, 17 May 2019 14:19:18 +0000 (16:19 +0200)]
i386/kvm: move Hyper-V CPUID filling to hyperv_handle_properties()
Let's consolidate Hyper-V features handling in hyperv_handle_properties().
The change is necessary to support 'hv-passthrough' mode as we'll be just
copying CPUIDs from KVM instead of filling them in.
Vitaly Kuznetsov [Fri, 17 May 2019 14:19:17 +0000 (16:19 +0200)]
i386/kvm: add support for KVM_GET_SUPPORTED_HV_CPUID
KVM now supports reporting supported Hyper-V features through CPUID
(KVM_GET_SUPPORTED_HV_CPUID ioctl). Going forward, this is going to be
the only way to announce new functionality and this has already happened
with Direct Mode stimers.
While we could just support KVM_GET_SUPPORTED_HV_CPUID for new features,
it seems to be beneficial to use it for all Hyper-V enlightenments when
possible. This way we can implement 'hv-all' pass-through mode giving the
guest all supported Hyper-V features even when QEMU knows nothing about
them.
Implementation-wise we create a new kvm_hyperv_properties structure
defining Hyper-V features, get_supported_hv_cpuid()/
get_supported_hv_cpuid_legacy() returning the supported CPUID set and
a bit over-engineered hv_cpuid_check_and_set() which we will also be
used to set cpu->hyperv_* properties for 'hv-all' mode.
Colin Xu [Mon, 10 Jun 2019 02:19:39 +0000 (10:19 +0800)]
hax: Honor CPUState::halted
QEMU tracks whether a vcpu is halted using CPUState::halted. E.g.,
after initialization or reset, halted is 0 for the BSP (vcpu 0)
and 1 for the APs (vcpu 1, 2, ...). A halted vcpu should not be
handed to the hypervisor to run (e.g. hax_vcpu_run()).
Under HAXM, Android Emulator sometimes boots into a "vcpu shutdown
request" error while executing in SeaBIOS, with the HAXM driver
logging a guest triple fault in vcpu 1, 2, ... at RIP 0x3. That is
ultimately because the HAX accelerator asks HAXM to run those APs
when they are still in the halted state.
Normally, the vcpu thread for an AP will start by looping in
qemu_wait_io_event(), until the BSP kicks it via a pair of IPIs
(INIT followed by SIPI). But because the HAX accelerator does not
honor cpu->halted, it allows the AP vcpu thread to proceed to
hax_vcpu_run() as soon as it receives any kick, even if the kick
does not come from the BSP. It turns out that emulator has a
worker thread which periodically kicks every vcpu thread (possibly
to collect CPU usage data), and if one of these kicks comes before
those by the BSP, the AP will start execution from the wrong RIP,
resulting in the aforementioned SMP boot failure.
The solution is inspired by the KVM accelerator (credit to
Chuanxiao Dong <[email protected]> for the pointer):
1. Get rid of questionable logic that unconditionally resets
cpu->halted before hax_vcpu_run(). Instead, only reset it at the
right moments (there are only a few "unhalt" events).
2. Add a check for cpu->halted before hax_vcpu_run().
Note that although the non-Unrestricted Guest (!ug_platform) code
path also forcibly resets cpu->halted, it is left untouched,
because only the UG code path supports SMP guests.
The patch is first merged to android emulator with Change-Id:
I9c5752cc737fd305d7eace1768ea12a07309d716
# gpg: Signature made Tue 18 Jun 2019 15:44:12 BST
# gpg: using RSA key 7F09B272C88F2FD6
# gpg: Good signature from "Kevin Wolf <[email protected]>" [full]
# Primary key fingerprint: DC3D EB15 9A9A F95D 3D74 56FE 7F09 B272 C88F 2FD6
* remotes/kevin/tags/for-upstream:
block/null: Expose read-zeroes option in QAPI schema
iotests: Test failure to loosen restrictions
block: Ignore loosening perm restrictions failures
block: Add *tighten_restrictions to *check*_perm()
block: Fix order in bdrv_replace_child()
block/commit: Drop bdrv_child_try_set_perm()
block/mirror: Fix child permissions
block: Add bdrv_child_refresh_perms()
file-posix: Update open_flags in raw_set_perm()
block: drop bs->job
blockdev: blockdev_mark_auto_del: drop usage of bs->job
block/block-backend: blk_iostatus_reset: drop usage of bs->job
block/replication: drop usage of bs->job
iotests: Hide timestamps for skipped tests
Peter Maydell [Tue, 18 Jun 2019 14:47:16 +0000 (15:47 +0100)]
Merge remote-tracking branch 'remotes/ehabkost/tags/python-next-pull-request' into staging
Python queue, 2019-06-18
Use a different method to dump avocado job log, to work around
timing-dependent issues in the arm test cases.
# gpg: Signature made Tue 18 Jun 2019 15:39:31 BST
# gpg: using RSA key 2807936F984DC5A6
# gpg: Good signature from "Eduardo Habkost <[email protected]>" [full]
# Primary key fingerprint: 5A32 2FD5 ABC4 D3DB ACCF D1AA 2807 936F 984D C5A6
* remotes/ehabkost/tags/python-next-pull-request:
Travis: print acceptance tests logs in case of job failure
Revert "travis: Make check-acceptance job more verbose"
Kevin Wolf [Mon, 17 Jun 2019 11:54:48 +0000 (13:54 +0200)]
block/null: Expose read-zeroes option in QAPI schema
Commit cd219eb1e55 added the read-zeroes option for the null-co and
null-aio block driver, but forgot to add them to the QAPI schema.
Therefore, this option wasn't available in -blockdev and blockdev-add
until now.
Add the missing option in the schema to make it available there, too.
We generally assume that loosening permission restrictions can never
fail. We have seen in the past that this assumption is wrong. This has
led to crashes because we generally pass &error_abort when loosening
permissions.
However, a failure in such a case should actually be handled in quite
the opposite way: It is very much not fatal, so qemu may report it, but
still consider the operation successful. The only realistic problem is
that qemu may then retain permissions and thus locks on images it
actually does not require. But again, that is not fatal.
To implement this behavior, we make all functions that change
permissions and that pass &error_abort to the initiating function
(bdrv_check_perm() or bdrv_child_check_perm()) evaluate the
@loosen_restrictions value introduced in the previous patch. If it is
true and an error did occur, we abort the permission update, discard the
error, and instead report success to the caller.
bdrv_child_try_set_perm() itself does not pass &error_abort, but it is
the only public function to change permissions. As such, callers may
pass &error_abort to it, expecting dropping permission restrictions to
never fail.
Max Reitz [Wed, 22 May 2019 17:03:50 +0000 (19:03 +0200)]
block: Add *tighten_restrictions to *check*_perm()
This patch makes three functions report whether the necessary permission
change tightens restrictions or not. These functions are:
- bdrv_check_perm()
- bdrv_check_update_perm()
- bdrv_child_check_perm()
Callers can use this result to decide whether a failure is fatal or not
(see the next patch).
Max Reitz [Wed, 22 May 2019 17:03:47 +0000 (19:03 +0200)]
block/mirror: Fix child permissions
We cannot use bdrv_child_try_set_perm() to give up all restrictions on
the child edge, and still have bdrv_mirror_top_child_perm() request
BLK_PERM_WRITE. Fix this by making bdrv_mirror_top_child_perm() return
0/BLK_PERM_ALL when we want to give up all permissions, and replacing
bdrv_child_try_set_perm() by bdrv_child_refresh_perms().
The bdrv_child_try_set_perm() before removing the node with
bdrv_replace_node() is then unnecessary. No permissions have changed
since the previous invocation of bdrv_child_try_set_perm().