Igor Mammedov [Fri, 11 May 2018 17:24:43 +0000 (19:24 +0200)]
cli: add --preconfig option
This option allows pausing QEMU in the new RUN_STATE_PRECONFIG state,
allowing the configuration of QEMU from QMP before the machine jumps
into board initialization code of machine_run_board_init()
The intent is to allow management to query machine state and additionally
configure it using previous query results within one QEMU instance
(i.e. eliminate the need to start QEMU twice, 1st to query board specific
parameters and 2nd for actual VM start using query results for
additional parameters).
The new option complements -S option and could be used with or without
it. The difference is that -S pauses QEMU when the machine is completely
initialized with all devices wired up and ready to execute guest code
(QEMU needs only to unpause VCPUs to let guest execute its code),
while the "preconfig" option pauses QEMU early before board specific init
callback (machine_run_board_init) is executed and allows the configuration
of machine parameters which will be used by board init code.
When early introspection/configuration is done, command 'exit-preconfig'
should be used to exit RUN_STATE_PRECONFIG and transition to the next
requested state (i.e. if -S is used then QEMU will pause the second
time when board/device initialization is completed or start guest
execution if -S isn't provided on CLI)
PS:
Initially 'preconfig' is planned to be used for configuring numa
topology depending on board specified possible cpus layout.
Igor Mammedov [Fri, 11 May 2018 16:51:43 +0000 (18:51 +0200)]
qapi: introduce new cmd option "allow-preconfig"
New option will be used to allow commands, which are prepared/need
to run, during preconfig state. Other commands that should be able
to run in preconfig state, should be amended to not expect machine
in initialized state or deal with it.
For compatibility reasons, commands that don't use new flag
'allow-preconfig' explicitly are not permitted to run in
preconfig state but allowed in all other states like they used
to be.
Within this patch allow following commands in preconfig state:
qmp_capabilities
query-qmp-schema
query-commands
query-command-line-options
query-status
exit-preconfig
to allow qmp connection, basic introspection and moving to the next
state.
PS:
set-numa-node and query-hotpluggable-cpus will be enabled later in
a separate patches.
Igor Mammedov [Fri, 4 May 2018 08:37:41 +0000 (10:37 +0200)]
qapi: introduce preconfig runstate
New preconfig runstate will be used in follow up patches
related to introducing --preconfig CLI option and is
intended to replace prelaunch runstate from QEMU start
up to machine_init callback.
Igor Mammedov [Fri, 4 May 2018 08:37:39 +0000 (10:37 +0200)]
numa: postpone options post-processing till machine_run_board_init()
in preparation for numa options to being handled via QMP before
machine_run_board_init(), move final numa configuration checks
and processing to machine_run_board_init() so it could take into
account both CLI (via parse_numa_opts()) and QMP input
Igor Mammedov [Wed, 16 May 2018 15:06:14 +0000 (17:06 +0200)]
numa: clarify error message when node index is out of range in -numa dist, ...
When using following CLI:
-numa dist,src=128,dst=1,val=20
user gets a rather confusing error message:
"Invalid node 128, max possible could be 128"
Where 128 is number of nodes that QEMU supports (MAX_NODES),
while src/dst is an index up to that limit, so it should be
MAX_NODES - 1 in error message.
Make error message to explicitly state valid range for node
index to be more clear.
Peter Maydell [Tue, 29 May 2018 08:57:09 +0000 (09:57 +0100)]
Merge remote-tracking branch 'remotes/stefanberger/tags/pull-tpm-2018-05-23-4' into staging
Merge tpm 2018/05/23 v4
# gpg: Signature made Sat 26 May 2018 03:52:12 BST
# gpg: using RSA key 75AD65802A0B4211
# gpg: Good signature from "Stefan Berger <[email protected]>"
# gpg: WARNING: This key is not certified with a trusted signature!
# gpg: There is no indication that the signature belongs to the owner.
# Primary key fingerprint: B818 B9CA DF90 89C2 D5CE C66B 75AD 6580 2A0B 4211
* remotes/stefanberger/tags/pull-tpm-2018-05-23-4:
test: Add test cases that use the external swtpm with CRB interface
docs: tpm: add VM save/restore example and troubleshooting guide
tpm: extend TPM TIS with state migration support
tpm: extend TPM emulator with state migration support
target-microblaze: mmu: Remove unused register state
Add explicit handling for MMU_R_TLBX and log accesses to
invalid MMU registers. We can now remove the state for
all regs but PID, ZPR and TLBX (0 - 2).
Break out trap_userspace() to avoid open coding it everywhere.
For privileged insns, we now always stop translation of the
current insn for cores without exceptions.
target-microblaze: Conditionalize setting of PVR11_USE_MMU
Conditionalize setting of PVR11_USE_MMU on the use_mmu
CPU property, otherwise we may incorrectly advertise an
MMU via PVR when the core in fact has none.
Stefan Berger [Wed, 7 Mar 2018 19:45:06 +0000 (14:45 -0500)]
test: Add test cases that use the external swtpm with CRB interface
Add a test program for testing the CRB with the external swtpm.
The 1st test case extends a PCR and reads back the value and compares
it against an expected return packet.
The 2nd test case repeats the 1st test case and then migrates the
external swtpm's state along with the VM state to a destination
QEMU and swtpm and checks that the PCR has the expected value now.
The test cases require 'swtpm' to be installed on the system and
in the PATH and 'swtpm' must support the --tpm2 option. If this is
not the case, the test will be skipped.
Peter Xu [Fri, 25 May 2018 01:50:42 +0000 (09:50 +0800)]
migration: use g_free for ram load bitmap
Buffers allocated with bitmap_new() should be freed with g_free().
Both reported by Coverity:
*** CID 1391300: API usage errors (ALLOC_FREE_MISMATCH)
/migration/ram.c: 3517 in ram_dirty_bitmap_reload()
3511 * the last one to sync, we need to notify the main send thread.
3512 */
3513 ram_dirty_bitmap_reload_notify(s);
3514
3515 ret = 0;
3516 out:
>>> CID 1391300: API usage errors (ALLOC_FREE_MISMATCH)
>>> Calling "free" frees "le_bitmap" using "free" but it should have been freed using "g_free".
3517 free(le_bitmap);
3518 return ret;
3519 }
3520
3521 static int ram_resume_prepare(MigrationState *s, void *opaque)
3522 {
*** CID 1391292: API usage errors (ALLOC_FREE_MISMATCH)
/migration/ram.c: 249 in ramblock_recv_bitmap_send()
243 * Mark as an end, in case the middle part is screwed up due to
244 * some "misterious" reason.
245 */
246 qemu_put_be64(file, RAMBLOCK_RECV_BITMAP_ENDING);
247 qemu_fflush(file);
248
>>> CID 1391292: API usage errors (ALLOC_FREE_MISMATCH)
>>> Calling "free" frees "le_bitmap" using "free" but it should have been freed using "g_free".
249 free(le_bitmap);
250
251 if (qemu_file_get_error(file)) {
252 return qemu_file_get_error(file);
253 }
254
* remotes/vivier2/tags/linux-user-for-2.13-pull-request:
gdbstub: Clarify what gdb_handlesig() is doing
linux-user: define TARGET_SO_REUSEPORT
linux-user: copy sparc/sockbits.h definitions from linux
linux-user: update ARCH_HAS_SOCKET_TYPES use
linux-user: move ppc socket.h definitions to ppc/sockbits.h
linux-user: move socket.h generic definitions to generic/sockbits.h
linux-user: move sparc/sparc64 socket.h definitions to sparc/sockbits.h
linux-user: move alpha socket.h definitions to alpha/sockbits.h
linux-user: move mips socket.h definitions to mips/sockbits.h
linux-user: Fix payload size logic in host_to_target_cmsg()
linux-user: update comments to point to tcg_exec_init()
linux-user: update netlink emulation
linux-user: Assert on bad type in thunk_type_align() and thunk_type_size()
Peter Maydell [Tue, 15 May 2018 18:19:58 +0000 (19:19 +0100)]
gdbstub: Clarify what gdb_handlesig() is doing
gdb_handlesig()'s behaviour is not entirely obvious at first
glance. Add a doc comment for it, and also add a comment
explaining why it's ok for gdb_do_syscallv() to ignore
gdb_handlesig()'s return value. (Coverity complains about
this: CID 1390850.)
Peter Maydell [Fri, 18 May 2018 18:47:15 +0000 (19:47 +0100)]
linux-user: Fix payload size logic in host_to_target_cmsg()
Coverity points out that there's a missing break in the switch in
host_to_target_cmsg() where we update tgt_len for
cmsg_level/cmsg_type combinations which require a different length
for host and target (CID 1385425). To avoid duplicating the default
case (target length same as host) in both switches, set that before
the switch so that only the cases which want to override it need any
code.
This fixes a bug where we would have used the wrong length
for SOL_SOCKET/SO_TIMESTAMP messages where the target and
host have differently sized 'struct timeval' (ie one is 32
bit and the other is 64 bit).
Igor Mammedov [Thu, 17 May 2018 11:51:17 +0000 (13:51 +0200)]
linux-user: update comments to point to tcg_exec_init()
cpu_init() was replaced by cpu_create() since 2.12 but comments
weren't updated. So update stale comments to point that page
sizes arei actually initialized by tcg_exec_init(). Also move
another qemu_host_page_size related comment before tcg_exec_init()
where it belongs.
Peter Maydell [Mon, 14 May 2018 17:46:16 +0000 (18:46 +0100)]
linux-user: Assert on bad type in thunk_type_align() and thunk_type_size()
In thunk_type_align() and thunk_type_size() we currently return
-1 if the value at the type_ptr isn't one of the TYPE_* values
we understand. However, this should never happen, and if it does
then the calling code will go confusingly wrong because none
of the callsites try to handle an error return. Switch to an
assertion instead, so that if this does somehow happen we'll have
a nice clear backtrace of what happened rather than a weird crash
or misbehaviour.
This also silences various Coverity complaints about not handling
the negative return value (CID 1005735, 1005736, 1005738, 1390582).
Stefan Berger [Wed, 11 Oct 2017 14:36:53 +0000 (10:36 -0400)]
tpm: extend TPM TIS with state migration support
Extend the TPM TIS interface with state migration support.
We need to synchronize with the backend thread to make sure that a command
being processed by the external TPM emulator has completed and its
response been received.
Stefan Berger [Wed, 11 Oct 2017 14:36:53 +0000 (10:36 -0400)]
tpm: extend TPM emulator with state migration support
Extend the TPM emulator backend device with state migration support.
The external TPM emulator 'swtpm' provides a protocol over
its control channel to retrieve its state blobs. We implement
functions for getting and setting the different state blobs.
In case the setting of the state blobs fails, we return a
negative errno code to fail the start of the VM.
Since we have an external TPM emulator, we need to make sure
that we do not migrate the state for as long as it is busy
processing a request. We need to wait for notification that
the request has completed processing.
Peter Maydell [Thu, 24 May 2018 13:22:23 +0000 (14:22 +0100)]
Merge remote-tracking branch 'remotes/mst/tags/for_upstream' into staging
pc, pci, virtio, vhost: fixes, features
Beginning of merging vDPA, new PCI ID, a new virtio balloon stat, intel
iommu rework fixing a couple of security problems (no CVEs yet), fixes
all over the place.
Signed-off-by: Michael S. Tsirkin <[email protected]>
# gpg: Signature made Wed 23 May 2018 15:41:32 BST
# gpg: using RSA key 281F0DB8D28D5469
# gpg: Good signature from "Michael S. Tsirkin <[email protected]>"
# gpg: aka "Michael S. Tsirkin <[email protected]>"
# Primary key fingerprint: 0270 606B 6F3C DF3D 0B17 0970 C350 3912 AFBE 8E67
# Subkey fingerprint: 5D09 FD08 71C8 F85B 94CA 8A0D 281F 0DB8 D28D 5469
* remotes/mst/tags/for_upstream: (28 commits)
intel-iommu: rework the page walk logic
util: implement simple iova tree
intel-iommu: trace domain id during page walk
intel-iommu: pass in address space when page walk
intel-iommu: introduce vtd_page_walk_info
intel-iommu: only do page walk for MAP notifiers
intel-iommu: add iommu lock
intel-iommu: remove IntelIOMMUNotifierNode
intel-iommu: send PSI always even if across PDEs
nvdimm: fix typo in label-size definition
contrib/vhost-user-blk: enable protocol feature for vhost-user-blk
hw/virtio: Fix brace Werror with clang 6.0.0
libvhost-user: Send messages with no data
vhost-user+postcopy: Use qemu_set_nonblock
virtio: support setting memory region based host notifier
vhost-user: support receiving file descriptors in slave_read
vhost-user: add Net prefix to internal state structure
linux-headers: add kvm header for mips
linux-headers: add unistd.h on all arches
update-linux-headers.sh: unistd.h, kvm consistency
...
Peter Maydell [Thu, 24 May 2018 10:30:59 +0000 (11:30 +0100)]
Merge remote-tracking branch 'remotes/sstabellini-http/tags/xen-20180522-tag' into staging
Xen 2018/05/22
# gpg: Signature made Tue 22 May 2018 19:44:06 BST
# gpg: using RSA key 894F8F4870E1AE90
# gpg: Good signature from "Stefano Stabellini <[email protected]>"
# gpg: aka "Stefano Stabellini <[email protected]>"
# Primary key fingerprint: D04E 33AB A51F 67BA 07D3 0AEA 894F 8F48 70E1 AE90
* remotes/sstabellini-http/tags/xen-20180522-tag:
xen_disk: be consistent with use of xendev and blkdev->xendev
xen_disk: use a single entry iovec
xen_backend: make the xen_feature_grant_copy flag private
xen_disk: remove use of grant map/unmap
xen_backend: add an emulation of grant copy
xen: remove other open-coded use of libxengnttab
xen_disk: remove open-coded use of libxengnttab
xen_backend: add grant table helpers
xen: add a meaningful declaration of grant_copy_segment into xen_common.h
checkpatch: generalize xen handle matching in the list of types
xen-hvm: create separate function for ioreq server initialization
xen_pt: Present the size of 64 bit BARs correctly
configure: Add explanation for --enable-xen-pci-passthrough
xen/pt: use address_space_memory object for memory region hooks
xen-pvdevice: Introduce a simplistic xen-pvdevice save state
Peter Maydell [Thu, 24 May 2018 09:25:43 +0000 (10:25 +0100)]
Merge remote-tracking branch 'remotes/mwalle/tags/lm32-queue/20180521' into staging
target/lm32: BQL patch
# gpg: Signature made Tue 22 May 2018 19:25:30 BST
# gpg: using RSA key B458ABB0D8D378E3
# gpg: Good signature from "Michael Walle <[email protected]>"
# gpg: WARNING: This key is not certified with a trusted signature!
# gpg: There is no indication that the signature belongs to the owner.
# Primary key fingerprint: 2190 3E48 4537 A7C2 90CE 3EB2 B458 ABB0 D8D3 78E3
* remotes/mwalle/tags/lm32-queue/20180521:
lm32: take BQL before writing IP/IM register
Gerd Hoffmann [Tue, 22 May 2018 16:50:55 +0000 (18:50 +0200)]
hw/display: add new bochs-display device
After writing up the virtual mdev device emulating a display supporting
the bochs vbe dispi interface (mbochs.ko) and seeing how simple it
actually is I've figured that would be useful for qemu too.
So, here it is, -device bochs-display. It is basically -device VGA
without legacy vga emulation. PCI bar 0 is the framebuffer, PCI bar 2
is mmio with the registers. The vga registers are simply not there
though, neither in the legacy ioport location nor in the mmio bar.
Consequently it is PCI class DISPLAY_OTHER not DISPLAY_VGA.
So there is no text mode emulation, no weird video modes (planar,
256color palette), no memory window at 0xa0000. Just a linear
framebuffer in the pci memory bar. And the amount of code to emulate
this (and therefore the attack surface) is an order of magnitude smaller
when compared to vga emulation.
Compatibility wise it works with OVMF (latest git master).
The bochs-drm.ko linux kernel module can handle it just fine too.
So UEFI guests should not see any functional difference to VGA.
Gerd Hoffmann [Mon, 14 May 2018 10:31:17 +0000 (12:31 +0200)]
vga: catch depth 0
depth == 0 is used to indicate 256 color modes. Our region calculation
goes wrong in that case. So detect that and just take the safe code
path we already have for the wraparound case.
While being at it also catch depth == 15 (where our region size
calculation goes wrong too). And make the comment more verbose,
explaining what is going on here.
Without this windows guest install might trigger an assert due to trying
to check dirty bitmap outside the snapshot region.
Peter Xu [Fri, 18 May 2018 07:25:17 +0000 (15:25 +0800)]
intel-iommu: rework the page walk logic
This patch fixes a potential small window that the DMA page table might
be incomplete or invalid when the guest sends domain/context
invalidations to a device. This can cause random DMA errors for
assigned devices.
This is a major change to the VT-d shadow page walking logic. It
includes but is not limited to:
- For each VTDAddressSpace, now we maintain what IOVA ranges we have
mapped and what we have not. With that information, now we only send
MAP or UNMAP when necessary. Say, we don't send MAP notifies if we
know we have already mapped the range, meanwhile we don't send UNMAP
notifies if we know we never mapped the range at all.
- Introduce vtd_sync_shadow_page_table[_range] APIs so that we can call
in any places to resync the shadow page table for a device.
- When we receive domain/context invalidation, we should not really run
the replay logic, instead we use the new sync shadow page table API to
resync the whole shadow page table without unmapping the whole
region. After this change, we'll only do the page walk once for each
domain invalidations (before this, it can be multiple, depending on
number of notifiers per address space).
While at it, the page walking logic is also refactored to be simpler.
Peter Xu [Fri, 18 May 2018 07:25:15 +0000 (15:25 +0800)]
intel-iommu: trace domain id during page walk
This patch only modifies the trace points.
Previously we were tracing page walk levels. They are redundant since
we have page mask (size) already. Now we trace something much more
useful which is the domain ID of the page walking. That can be very
useful when we trace more than one devices on the same system, so that
we can know which map is for which domain.
Peter Xu [Fri, 18 May 2018 07:25:13 +0000 (15:25 +0800)]
intel-iommu: introduce vtd_page_walk_info
During the recursive page walking of IOVA page tables, some stack
variables are constant variables and never changed during the whole page
walking procedure. Isolate them into a struct so that we don't need to
pass those contants down the stack every time and multiple times.
Peter Xu [Fri, 18 May 2018 07:25:12 +0000 (15:25 +0800)]
intel-iommu: only do page walk for MAP notifiers
For UNMAP-only IOMMU notifiers, we don't need to walk the page tables.
Fasten that procedure by skipping the page table walk. That should
boost performance for UNMAP-only notifiers like vhost.
Peter Xu [Fri, 18 May 2018 07:25:11 +0000 (15:25 +0800)]
intel-iommu: add iommu lock
SECURITY IMPLICATION: this patch fixes a potential race when multiple
threads access the IOMMU IOTLB cache.
Add a per-iommu big lock to protect IOMMU status. Currently the only
thing to be protected is the IOTLB/context cache, since that can be
accessed even without BQL, e.g., in IO dataplane.
Note that we don't need to protect device page tables since that's fully
controlled by the guest kernel. However there is still possibility that
malicious drivers will program the device to not obey the rule. In that
case QEMU can't really do anything useful, instead the guest itself will
be responsible for all uncertainties.
Peter Xu [Fri, 18 May 2018 07:25:10 +0000 (15:25 +0800)]
intel-iommu: remove IntelIOMMUNotifierNode
That is not really necessary. Removing that node struct and put the
list entry directly into VTDAddressSpace. It simplfies the code a lot.
Since at it, rename the old notifiers_list into vtd_as_with_notifiers.
Peter Xu [Fri, 18 May 2018 07:25:09 +0000 (15:25 +0800)]
intel-iommu: send PSI always even if across PDEs
SECURITY IMPLICATION: without this patch, any guest with both assigned
device and a vIOMMU might encounter stale IO page mappings even if guest
has already unmapped the page, which may lead to guest memory
corruption. The stale mappings will only be limited to the guest's own
memory range, so it should not affect the host memory or other guests on
the host.
During IOVA page table walking, there is a special case when the PSI
covers one whole PDE (Page Directory Entry, which contains 512 Page
Table Entries) or more. In the past, we skip that entry and we don't
notify the IOMMU notifiers. This is not correct. We should send UNMAP
notification to registered UNMAP notifiers in this case.
For UNMAP only notifiers, this might cause IOTLBs cached in the devices
even if they were already invalid. For MAP/UNMAP notifiers like
vfio-pci, this will cause stale page mappings.
This special case doesn't trigger often, but it is very easy to be
triggered by nested device assignments, since in that case we'll
possibly map the whole L2 guest RAM region into the device's IOVA
address space (several GBs at least), which is far bigger than normal
kernel driver usages of the device (tens of MBs normally).
Without this patch applied to L1 QEMU, nested device assignment to L2
guests will dump some errors like:
hw/virtio/vhost-user.c:1319:26: error: suggest braces
around initialization of subobject [-Werror,-Wmissing-braces]
VhostUserMsg msg = { 0 };
^
{}
While the original code is correct, and technically exactly correct
as per ISO C89, both GCC and Clang support plain empty set of braces
as an extension.
The response to a VHOST_USER_POSTCOPY_ADVISE contains a fd but doesn't
actually contain any data. FIx vu_message_write so that it doesn't
do a 0-byte write() call, since this was ending up with rc=0
that was confusing the error handling code.