Pavel Butsykin [Mon, 18 Sep 2017 12:42:29 +0000 (15:42 +0300)]
qcow2: add shrink image support
This patch add shrinking of the image file for qcow2. As a result, this allows
us to reduce the virtual image size and free up space on the disk without
copying the image. Image can be fragmented and shrink is done by punching holes
in the image file.
Pavel Butsykin [Mon, 18 Sep 2017 12:42:28 +0000 (15:42 +0300)]
qcow2: add qcow2_cache_discard
Whenever l2/refcount table clusters are discarded from the file we can
automatically drop unnecessary content of the cache tables. This reduces
the chance of eviction useful cache data and eliminates inconsistent data
in the cache with the data in the file.
Pavel Butsykin [Mon, 18 Sep 2017 12:42:27 +0000 (15:42 +0300)]
qemu-img: add --shrink flag for resize
The flag is additional precaution against data loss. Perhaps in the future the
operation shrink without this flag will be blocked for all formats, but for now
we need to maintain compatibility with raw.
Kevin Wolf [Fri, 15 Sep 2017 16:56:41 +0000 (18:56 +0200)]
qemu-iotests: Test change-backing-file command
This involves a temporary read-write reopen if the backing file link in
the middle of a backing file chain should be changed and is therefore a
good test for the latest bdrv_reopen() vs. op blockers fixes.
Kevin Wolf [Mon, 3 Jul 2017 15:07:35 +0000 (17:07 +0200)]
block: Fix permissions after bdrv_reopen()
If we switch between read-only and read-write, the permissions that
image format drivers need on bs->file change, too. Make sure to update
the permissions during bdrv_reopen().
Kevin Wolf [Thu, 14 Sep 2017 12:53:46 +0000 (14:53 +0200)]
block: reopen: Queue children after their parents
We will calculate the required new permissions in the prepare stage of a
reopen. Required permissions of children can be influenced by the
changes made to their parents, but parents are independent from their
children. This means that permissions need to be calculated top-down. In
order to achieve this, queue parents before their children rather than
queuing the children first.
Kevin Wolf [Thu, 14 Sep 2017 12:32:04 +0000 (14:32 +0200)]
block: Base permissions on rw state after reopen
When new permissions are calculated during bdrv_reopen(), they need to
be based on the state of the graph as it will be after the reopen has
completed, not on the current state of the involved nodes.
This patch makes bdrv_is_writable() optionally accept a BlockReopenQueue
from which the new flags are taken. This is then used for determining
the new bs->file permissions of format drivers as soon as we add the
code to actually pass a non-NULL reopen queue to the .bdrv_child_perm
callbacks.
While moving bdrv_is_writable(), make it static. It isn't used outside
block.c.
Kevin Wolf [Thu, 14 Sep 2017 12:42:12 +0000 (14:42 +0200)]
block: Add reopen queue to bdrv_check_perm()
In the context of bdrv_reopen(), we'll have to look at the state of the
graph as it will be after the reopen. This interface addition is in
preparation for the change.
Kevin Wolf [Thu, 14 Sep 2017 10:47:11 +0000 (12:47 +0200)]
block: Add reopen_queue to bdrv_child_perm()
In the context of bdrv_reopen(), we'll have to look at the state of the
graph as it will be after the reopen. This interface addition is in
preparation for the change.
Kevin Wolf [Fri, 22 Sep 2017 12:50:12 +0000 (14:50 +0200)]
qemu-io: Drop write permissions before read-only reopen
qemu-io provides a 'reopen' command that allows switching from writable
to read-only access. We need to make sure that we don't try to keep
write permissions to a BlockBackend that becomes read-only, otherwise
things are going to fail.
This requires a bdrv_drain() call because otherwise in-flight AIO
write requests could issue new internal requests while the permission
has already gone away, which would cause assertion failures. Draining
the queue doesn't break AIO requests in any new way, bdrv_reopen() would
drain it anyway only a few lines later.
Thomas Huth [Wed, 13 Sep 2017 10:21:28 +0000 (12:21 +0200)]
block: Clean up some bad code in the vvfat driver
Remove the unnecessary home-grown redefinition of the assert() macro here,
and remove the unusable debug code at the end of the checkpoint() function.
The code there uses assert() with side-effects (assignment to the "mapping"
variable), which should be avoided. Looking more closely, it seems as it is
apparently also only usable for one certain directory layout (with a file
named USB.H in it) and thus is of no use for the rest of the world.
block/throttle-groups.c: allocate RestartData on the heap
RestartData is the opaque data of the throttle_group_restart_queue_entry
coroutine. By being stack allocated, it isn't available anymore if
aio_co_enter schedules the coroutine with a bottom half and runs after
throttle_group_restart_queue returns.
Alberto Garcia [Wed, 13 Sep 2017 08:28:17 +0000 (11:28 +0300)]
throttle: Assert that bkt->max is valid in throttle_compute_wait()
If bkt->max == 0 and bkt->burst_length > 1 then we could have a
division by 0 in throttle_do_compute_wait(). That configuration is
however not permitted and is already detected by throttle_is_valid(),
but let's assert it in throttle_compute_wait() to make it explicit.
The default cpu model on s390x does not provide zPCI, which is
not yet wired up on tcg. Moreover, virtio-ccw is the standard
on s390x, so use the -ccw instead of the -pci versions of virtio
devices on s390x.
The default cpu model on s390x does not provide zPCI, which is
not yet wired up on tcg. Moreover, virtio-ccw is the standard
on s390x, so use the -ccw instead of the -pci versions of virtio
devices on s390x.
Stefan Hajnoczi [Fri, 8 Sep 2017 08:39:41 +0000 (09:39 +0100)]
docs: add qemu-block-drivers(7) man page
Block driver documentation is available in qemu-doc.html. It would be
convenient to have documentation for formats, protocols, and filter
drivers in a man page.
Extract the relevant part of qemu-doc.html into a new file called
docs/qemu-block-drivers.texi. This file can also be built as a
stand-alone document (man, html, etc).
People get surprised when, after "qemu-img create -f raw /dev/sdX", they
still see qcow2 with "qemu-img info", if previously the bdev had a qcow2
header. While this is natural because raw doesn't need to write any
magic bytes during creation, hdev_create is free to clear out the first
sector to make sure the stale qcow2 header doesn't cause such confusion.
Fam Zheng [Fri, 4 Aug 2017 14:36:58 +0000 (22:36 +0800)]
qemu-img: Clarify about relative backing file options
It's not too surprising when a user specifies the backing file relative
to the current working directory instead of the top layer image. This
causes error when they differ. Though the error message has enough
information to infer the fact about the misunderstanding, it is better
if we document this explicitly, so that users don't have to learn from
mistakes.
However, two test cases (172 and 186) overwrite QEMU_OPTIONS and neglect
to manually set '-machine accel=qtest'. Add the missing option for 172.
186 probably only copied the code from 172, it doesn't actually need to
overwrite QEMU_OPTIONS, so remove that in 186.
Peter Maydell [Mon, 25 Sep 2017 19:31:24 +0000 (20:31 +0100)]
Merge remote-tracking branch 'remotes/thibault/tags/samuel-thibault' into staging
slirp updates
# gpg: Signature made Sun 24 Sep 2017 19:07:51 BST
# gpg: using RSA key 0x9E511E01C737F075
# gpg: Good signature from "Samuel Thibault <[email protected]>"
# gpg: aka "Samuel Thibault <[email protected]>"
# gpg: aka "Samuel Thibault <[email protected]>"
# gpg: aka "Samuel Thibault <[email protected]>"
# gpg: aka "Samuel Thibault <[email protected]>"
# gpg: aka "Samuel Thibault <[email protected]>"
# gpg: aka "Samuel Thibault <[email protected]>"
# gpg: WARNING: This key is not certified with sufficiently trusted signatures!
# gpg: It is not certain that the signature belongs to the owner.
# Primary key fingerprint: 900C B024 B679 31D4 0F82 304B D017 8C76 7D06 9EE6
# Subkey fingerprint: 9A37 3D36 64A8 DC62 DA0A 34FD 9E51 1E01 C737 F075
* remotes/thibault/tags/samuel-thibault:
slirp: Add a special case for the NULL socket
slirp: Fix intermittent send queue hangs on a socket
slirp: Add explanation for hostfwd parsing failure
Kevin Cernekee [Wed, 20 Sep 2017 20:42:05 +0000 (13:42 -0700)]
slirp: Add a special case for the NULL socket
NULL sockets are used for NDP, BOOTP, and other critical operations.
If the topmost mbuf in a NULL session is blocked pending resolution,
it may cause problems if it blocks other packets with a NULL socket.
So do not add mbufs with a NULL socket field to the same session.
Kevin Cernekee [Wed, 20 Sep 2017 20:42:04 +0000 (13:42 -0700)]
slirp: Fix intermittent send queue hangs on a socket
if_output() originally sent one mbuf per call and used the slirp->next_m
variable to keep track of where it left off. But nowadays it tries to
send all of the mbufs from the fastq, and one mbuf from each session on
the batchq. The next_m variable is both redundant and harmful: there is
a case[0] involving delayed packets in which next_m ends up pointing
to &slirp->if_batchq when an active session still exists, and this
blocks all traffic for that session until qemu is restarted.
The test case was created to reproduce a problem that was seen on
long-running Chromium OS VM tests[1] which rapidly create and
destroy ssh connections through hostfwd.
* remotes/bonzini/tags/for-upstream: (32 commits)
chardev: remove context in chr_update_read_handler
chardev: use per-dev context for io_add_watch_poll
chardev: add Chardev.gcontext field
chardev: new qemu_chr_be_update_read_handlers()
scsi: add persistent reservation manager using qemu-pr-helper
scsi: add multipath support to qemu-pr-helper
scsi: build qemu-pr-helper
scsi, file-posix: add support for persistent reservation management
memory: Share special empty FlatView
memory: seek FlatView sharing candidates among children subregions
memory: trace FlatView creation and destruction
memory: Create FlatView directly
memory: Get rid of address_space_init_shareable
memory: Rework "info mtree" to print flat views and dispatch trees
memory: Do not allocate FlatView in address_space_init
memory: Share FlatView's and dispatch trees between address spaces
memory: Move address_space_update_ioeventfds
memory: Alloc dispatch tree where topology is generared
memory: Store physical root MR in FlatView
memory: Rename mem_begin/mem_commit/mem_add helpers
...
Peter Xu [Thu, 21 Sep 2017 06:35:54 +0000 (14:35 +0800)]
chardev: remove context in chr_update_read_handler
We had a per-chardev cache for context, then we don't need this
parameter to be passed in every time when chr_update_read_handler()
called. As long as we are calling chr_update_read_handler() using
qemu_chr_be_update_read_handlers() we'll be fine.
Peter Xu [Thu, 21 Sep 2017 06:35:53 +0000 (14:35 +0800)]
chardev: use per-dev context for io_add_watch_poll
It was only passed in by chr_update_read_handlers(). However when
reconnect, we'll lose that context information. So if a chardev was
running on another context (rather than the default context, the NULL
pointer), it'll switch back to the default context if reconnection
happens. But, it should really stick to the old context.
Convert all the callers of io_add_watch_poll() to use the internally
cached gcontext. Then the context should be able to survive even after
reconnections.
Peter Xu [Thu, 21 Sep 2017 06:35:52 +0000 (14:35 +0800)]
chardev: add Chardev.gcontext field
It caches the gcontext that is used to poll the chardev IO. Before this
patch, we only passed it in via chr_update_read_handlers(). However
that may not be enough if the char backend is disconnected and
reconnected afterward. There are chardev codes that still assumed the
context be NULL (which is the main context). Will fix that up in
following up patches.
Paolo Bonzini [Tue, 22 Aug 2017 04:50:55 +0000 (06:50 +0200)]
scsi: add multipath support to qemu-pr-helper
Proper support of persistent reservation for multipath devices requires
communication with the multipath daemon, so that the reservation is
registered and applied when a path comes up. The device mapper
utilities provide a library to do so; this patch makes qemu-pr-helper.c
detect multipath devices and, when one is found, delegate the operation
to libmpathpersist.
Paolo Bonzini [Tue, 22 Aug 2017 04:50:18 +0000 (06:50 +0200)]
scsi: build qemu-pr-helper
Introduce a privileged helper to run persistent reservation commands.
This lets virtual machines send persistent reservations without using
CAP_SYS_RAWIO or out-of-tree patches. The helper uses Unix permissions
and SCM_RIGHTS to restrict access to processes that can access its socket
and prove that they have an open file descriptor for a raw SCSI device.
The next patch will also correct the usage of persistent reservations
with multipath devices.
It would also be possible to support for Linux's IOC_PR_* ioctls in
the future, to support NVMe devices. For now, however, only SCSI is
supported.
Not all scripts using qemu.py configure the Python logging
module, and end up generating a "No handlers could be found for
logger" message instead of actual log messages.
To avoid requiring every script using qemu.py to configure
logging manually, call basicConfig() when creating a QEMUMachine
object. This won't affect scripts that already set up logging,
but will ensure that scripts that don't configure logging keep
working.
migration: split ufd_version_check onto receive/request features part
This modification is necessary for userfault fd features which are
required to be requested from userspace.
UFFD_FEATURE_THREAD_ID is a one of such "on demand" feature, which will
be introduced in the next patch.
QEMU have to use separate userfault file descriptor, due to
userfault context has internal state, and after first call of
ioctl UFFD_API it changes its state to UFFD_STATE_RUNNING (in case of
success), but kernel while handling ioctl UFFD_API expects UFFD_STATE_WAIT_API.
So only one ioctl with UFFD_API is possible per ufd.
migration: pass MigrationIncomingState* into migration check functions
That tiny refactoring is necessary to be able to set
UFFD_FEATURE_THREAD_ID while requesting features, and then
to create downtime context in case when kernel supports it.
Fill postcopy-able pending only if ram postcopy is enabled.
It is necessary because of there will be other postcopy-able states and
when ram postcopy is disabled, it should not spoil common postcopy
related pending.
Now postcopy-able states are recognized by not NULL
save_live_complete_postcopy handler. But when we have several different
postcopy-able states, it is not convenient. Ram postcopy may be
disabled, while some other postcopy enabled, in this case Ram state
should behave as it is not postcopy-able.
This patch add separate has_postcopy handler to specify behaviour of
savevm state.
Peter Xu [Wed, 30 Aug 2017 08:32:00 +0000 (16:32 +0800)]
bitmap: provide to_le/from_le helpers
Provide helpers to convert bitmaps to little endian format. It can be
used when we want to send one bitmap via network to some other hosts.
One thing to mention is that, these helpers only solve the problem of
endianess, but it does not solve the problem of different word size on
machines (the bitmaps managing same count of bits may contains different
size when malloced). So we need to take care of the size alignment issue
on the callers for now.
Catch inconsistent defaults (eric).
Improve comment stating that number of threads is the same than number
of sockets
Use new DEFIN_PROP_*
Rename x-multifd-threads to x-multifd-threads
We pass the ioc instead of the fd. This will allow us to have more
than one channel open. We also make sure that we set the
from_src_file sooner, so we don't need to pass it as a parameter.
Peter Maydell [Fri, 22 Sep 2017 11:14:27 +0000 (12:14 +0100)]
Merge remote-tracking branch 'remotes/famz/tags/build-and-test-automation-pull-request' into staging
# gpg: Signature made Fri 22 Sep 2017 08:28:38 BST
# gpg: using RSA key 0xCA35624C6A9171C6
# gpg: Good signature from "Fam Zheng <[email protected]>"
# gpg: WARNING: This key is not certified with sufficiently trusted signatures!
# gpg: It is not certain that the signature belongs to the owner.
# Primary key fingerprint: 5003 7CB7 9706 0F76 F021 AD56 CA35 624C 6A91 71C6
* remotes/famz/tags/build-and-test-automation-pull-request: (36 commits)
docker: Drop 'set -e' from run script
docker: Use archive-source.py
tests: Add README for vm tests
MAINTAINERS: Add tests/vm entry
Makefile: Add rules to run vm tests
tests: Add OpenBSD image
tests: Add NetBSD image
tests: Add FreeBSD image
tests: Add ubuntu.i386 image
tests: Add vm test lib
tests: Add a test key pair
scripts: Add archive-source.sh
qemu.py: Add "wait()" method
gitignore: Ignore vm test images
MAINTAINERS: Fix subsystem name for "Build and test automation"
buildsys: Move rdma libs to per object
buildsys: Move brlapi libs to per object
buildsys: Move usb redir cflags/libs to per object
buildsys: Move libusb cflags/libs to per object
buildsys: Move libcacard cflags/libs to per object
...
seccomp: Don't include libseccomp from QEMU header
The only prototype doesn't need anything from the lib header, and not
including it here allows files that include this header, for example
vl.c, to compile without the libseccomp cflags.
The breakage is since c3883e1f93 for environments where `pkg-config
--cflags libseccomp" is non-empty.
The migration interface for ais was introduced with kernel 4.13
but the capability itself had been active since 4.12. As migration
support is considered necessary lets disable ais in the 2.10
stable version. A proper fix and re-enablement will be done
for qemu 2.11.
Alex Bennée [Tue, 25 Jul 2017 13:34:23 +0000 (14:34 +0100)]
docker: docker.py make --no-cache skip checksum test
If you invoke with NOCACHE=1 we pass --no-cache in the argv to
docker.py but may still not force a rebuild if the dockerfile checksum
hasn't changed. By testing for its presence we can force builds
without having to manually remove the docker image.
Paolo Bonzini [Mon, 21 Aug 2017 16:58:56 +0000 (18:58 +0200)]
scsi, file-posix: add support for persistent reservation management
It is a common requirement for virtual machine to send persistent
reservations, but this currently requires either running QEMU with
CAP_SYS_RAWIO, or using out-of-tree patches that let an unprivileged
QEMU bypass Linux's filter on SG_IO commands.
As an alternative mechanism, the next patches will introduce a
privileged helper to run persistent reservation commands without
expanding QEMU's attack surface unnecessarily.
The helper is invoked through a "pr-manager" QOM object, to which
file-posix.c passes SG_IO requests for PERSISTENT RESERVE OUT and
PERSISTENT RESERVE IN commands. For example:
Multiple pr-manager implementations are conceivable and possible, though
only one is implemented right now. For example, a pr-manager could:
- talk directly to the multipath daemon from a privileged QEMU
(i.e. QEMU links to libmpathpersist); this makes reservation work
properly with multipath, but still requires CAP_SYS_RAWIO
- use the Linux IOC_PR_* ioctls (they require CAP_SYS_ADMIN though)
- more interestingly, implement reservations directly in QEMU
through file system locks or a shared database (e.g. sqlite)
This shares an cached empty FlatView among address spaces. The empty
FV is used every time when a root MR renders into a FV without memory
sections which happens when MR or its children are not enabled or
zero-sized. The empty_view is not NULL to keep the rest of memory
API intact; it also has a dispatch tree for the same reason.
On POWER8 with 255 CPUs, 255 virtio-net, 40 PCI bridges guest this halves
the amount of FlatView's in use (557 -> 260) and dispatch tables
(~800000 -> ~370000). In an unrelated experiment with 112 non-virtio
devices on x86 ("-M pc"), only 4 FlatViews are alive, and about ~2000
are created at startup.
Paolo Bonzini [Thu, 21 Sep 2017 10:28:16 +0000 (12:28 +0200)]
memory: seek FlatView sharing candidates among children subregions
A container can be used instead of an alias to allow switching between
multiple subregions. In this case we cannot directly share the
subregions (since they only belong to a single parent), but if the
subregions are aliases we can in turn walk those.
This is not enough to remove all source of quadratic FlatView creation,
but it enables sharing of the PCI bus master FlatViews (and their
AddressSpaceDispatch structures) across all PCI devices. For 112
virtio-net-pci devices, boot time is reduced from 25 to 10 seconds and
memory consumption from 1.4 to 1 G.