Kevin Wolf [Wed, 4 Dec 2013 10:06:36 +0000 (11:06 +0100)]
qcow2: Zero-initialise first cluster for new images
Strictly speaking, this is only required for has_zero_init() == false,
but it's easy enough to just do a cluster-aligned write that is padded
with zeros after the header.
This fixes that after 'qemu-img create' header extensions are attempted
to be parsed that are really just random leftover data.
Max Reitz [Tue, 3 Dec 2013 13:57:52 +0000 (14:57 +0100)]
block: Close backing file early in bdrv_img_create
Leaving the backing file open although it is not needed anymore can
cause problems if it is opened through a block driver which allows
exclusive access only and if the create function of the block driver
used for the top image (the one being created) tries to close and reopen
the image file (which will include opening the backing file a second
time).
In particular, this will happen with a backing file opened through
qemu-nbd and using qcow2 as the top image file format (which reopens the
image to flush it to disk).
In addition, the BlockDriverState in bdrv_img_create() is used for the
backing file only; it should therefore be made local to the respective
block.
Paolo Bonzini [Fri, 22 Nov 2013 12:40:01 +0000 (13:40 +0100)]
scsi-disk: correctly implement WRITE SAME
Fetch the data to be written from the input buffer. If it is all zeroes,
we can use the write_zeroes call (possibly with the new MAY_UNMAP flag).
Otherwise, do as many write cycles as needed, writing 512k at a time.
Strictly speaking, this is still incorrect because a zero cluster should
only be written if the MAY_UNMAP flag is set. But this is a bug in qcow2
and the other formats, not in the SCSI code.
Paolo Bonzini [Fri, 22 Nov 2013 12:40:00 +0000 (13:40 +0100)]
scsi-disk: reject ANCHOR=1 for UNMAP and WRITE SAME commands
Since we report ANC_SUP==0 in VPD page B2h, we need to return
an error (ILLEGAL REQUEST/INVALID FIELD IN CDB) for all WRITE SAME
requests with ANCHOR==1.
Inspired by a similar patch to the LIO in-kernel target.
Paolo Bonzini [Fri, 22 Nov 2013 12:39:57 +0000 (13:39 +0100)]
raw-posix: add support for write_zeroes on XFS and block devices
The code is similar to the implementation of discard and write_zeroes
with UNMAP. However, failure must be propagated up to block.c.
The stale page cache problem can be reproduced as follows:
# modprobe scsi-debug lbpws=1 lbprz=1
# ./qemu-io /dev/sdXX
qemu-io> write -P 0xcc 0 2M
qemu-io> write -z 0 1M
qemu-io> read -P 0x00 0 512
Pattern verification failed at offset 0, 512 bytes
qemu-io> read -v 0 512 00000000: cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc ................
...
Paolo Bonzini [Fri, 22 Nov 2013 12:39:48 +0000 (13:39 +0100)]
block: make bdrv_co_do_write_zeroes stricter in producing aligned requests
Right now, bdrv_co_do_write_zeroes will only try to align the
beginning of the request. However, it is simpler for many
formats to expect the block layer to separate both the head *and*
the tail. This makes sure that the format's bdrv_co_write_zeroes
function will be called with aligned sector_num and nb_sectors for
the bulk of the request.
Paolo Bonzini [Fri, 22 Nov 2013 12:39:47 +0000 (13:39 +0100)]
block: handle ENOTSUP from discard in generic code
Similar to write_zeroes, let the generic code receive a ENOTSUP for
discard operations. Since bdrv_discard has advisory semantics,
we can just swallow the error.
Paolo Bonzini [Fri, 22 Nov 2013 12:39:43 +0000 (13:39 +0100)]
block: generalize BlockLimits handling to cover bdrv_aio_discard too
bdrv_co_discard is only covering drivers which have a .bdrv_co_discard()
implementation, but not those with .bdrv_aio_discard(). Not very nice,
and easy to avoid.
Fam Zheng [Tue, 3 Dec 2013 02:41:05 +0000 (10:41 +0800)]
vmdk: Fix creating big description file
The buffer for description file was 4096 which only covers a few
hundred of extents. This changes the buffer to dynamic allocated with
g_strdup_printf in order to support bigger cases.
Kevin Wolf [Thu, 28 Nov 2013 10:58:02 +0000 (11:58 +0100)]
block: Use BDRV_O_NO_BACKING where appropriate
If you open an image temporarily just because you want to check its size
or get it flushed, there's no real reason to open the whole backing file
chain.
Kevin Wolf [Thu, 14 Nov 2013 14:37:12 +0000 (15:37 +0100)]
block: Enable BDRV_O_SNAPSHOT with driver-specific options
In the case of snapshot=on, don't rely on the backing file path in the
temporary image any more, but override the backing file with the given
set of options. This way, block drivers that don't use a file name can
be accessed with snapshot=on, for example:
Fam Zheng [Wed, 20 Nov 2013 02:01:54 +0000 (10:01 +0800)]
blkdebug: add "remove_break" command
This adds "remove_break" command which is the reverse of blkdebug
command "break": it removes all breakpoints with given tag and resumes
all the requests.
Kevin Wolf [Wed, 20 Nov 2013 12:09:21 +0000 (13:09 +0100)]
qdict: Optimise qdict_do_flatten()
Nested QDicts used to be both entered recursively in order to move their
entries to the target QDict and also be moved themselves to the target
QDict like all other objects. This is harmless because for the top
level, qdict_do_flatten() will encounter the (now empty) QDict for a
second time and then delete it, but at the same time it's obviously
unnecessary overhead. Just delete nested QDicts directly after moving
all of their entries.
Charlie Shepherd [Fri, 15 Nov 2013 18:47:02 +0000 (19:47 +0100)]
COW: Extend checking allocated bits to beyond one sector
cow_co_is_allocated() only checks one sector's worth of allocated bits
before returning. This is allowed but (slightly) inefficient, so extend
it to check all of the file's metadata sectors.
Charlie Shepherd [Fri, 15 Nov 2013 18:47:01 +0000 (19:47 +0100)]
COW: Speed up writes
Process a whole sector's worth of COW bits by reading a sector, setting
the bits after skipping any already set bits, then writing it out again.
Make sure we only flush once before writing metadata, and only if we
need to write metadata.
Fam Zheng [Wed, 13 Nov 2013 10:29:43 +0000 (18:29 +0800)]
block: per caller dirty bitmap
Previously a BlockDriverState has only one dirty bitmap, so only one
caller (e.g. a block job) can keep track of writing. This changes the
dirty bitmap to a list and creates a BdrvDirtyBitmap for each caller, the
lifecycle is managed with these new functions:
Max Reitz [Wed, 13 Nov 2013 19:37:58 +0000 (20:37 +0100)]
block/stream: Don't stream unbacked devices
If a block device is unbacked, a streaming blockjob should immediately
finish instead of beginning to try to stream, then noticing the backing
file does not contain even the first sector (since it does not exist)
and then finishing normally.
Max Reitz [Thu, 7 Nov 2013 19:10:29 +0000 (20:10 +0100)]
util/error: Save errno from clobbering
There may be calls to error_setg() and especially error_setg_errno()
which blindly (and until now wrongly) assume these functions not to
clobber errno (e.g., they pass errno to error_setg_errno() and return
-errno afterwards). Instead of trying to find and fix all of these
constructs, just make sure error_setg() and error_setg_errno() indeed do
not clobber errno.
Peter Lieven [Thu, 24 Oct 2013 10:07:06 +0000 (12:07 +0200)]
qemu-img: conditionally zero out target on convert
If the target has_zero_init = 0, but supports efficiently
writing zeroes by unmapping we call bdrv_make_zero to
avoid fully allocating the target. This currently works
only for iscsi. It can be extended to raw with
BLKDISCARDZEROES for example.
Peter Lieven [Thu, 24 Oct 2013 10:07:04 +0000 (12:07 +0200)]
block/get_block_status: fix BDRV_BLOCK_ZERO for unallocated blocks
this patch does 2 things:
a) only do additional call outs if BDRV_BLOCK_ZERO is not already set.
b) use the newly introduced bdrv_unallocated_blocks_are_zero()
to return the zero state of an unallocated block. the used callout
to bdrv_has_zero_init() is only valid right after bdrv_create.
Peter Lieven [Thu, 24 Oct 2013 10:07:03 +0000 (12:07 +0200)]
block: introduce bdrv_make_zero
this patch adds a call to completely zero out a block device.
the operation is sped up by checking the block status and
only writing zeroes to the device if they currently do not
return zeroes. optionally the zero writing can be sped up
by setting the flag BDRV_REQ_MAY_UNMAP to emulate the zero
write by unmapping if the driver supports it.
Peter Lieven [Thu, 24 Oct 2013 10:06:54 +0000 (12:06 +0200)]
block: add wrappers for logical block provisioning information
This adds 2 wrappers to read the unallocated_blocks_are_zero and
can_write_zeroes_with_unmap info from the BDI. The wrappers are
required to check for the existence of a backing_hd and
if the devices are opened with the correct flags.
Max Reitz [Mon, 25 Nov 2013 19:28:56 +0000 (20:28 +0100)]
qemu-iotests: Fix test 041
Performing multiple drive-mirror blockjobs on the same qemu instance
results in the image file used for the block device being replaced by
the newly mirrored file, which is not what we want.
Fix this by performing one dedicated test per sync mode.
Max Reitz [Mon, 25 Nov 2013 19:28:55 +0000 (20:28 +0100)]
block/drive-mirror: Reuse backing HD for sync=none
For "none" sync mode in "absolute-paths" mode, the current image should
be used as the backing file for the newly created image.
The current behavior is:
a) If the image to be mirrored has a backing file, use that (which is
wrong, since the operations recorded by "none" are applied to the
image itself, not to its backing file).
b) If the image to be mirrored lacks a backing file, the target doesn't
have one either (which is not really wrong, but not really right,
either; "none" records a set of operations executed on the image
file, therefore having no backing file to apply these operations on
seems rather pointless).
For a, this is clearly a bugfix. For b, it is still a bugfix, although
it might break existing API - but since that case crashed qemu just
three weeks ago (before 1452686495922b81d6cf43edf025c1aef15965c0), we
can safely assume there is no such API relying on that case yet.
Alexander Graf [Mon, 25 Nov 2013 21:46:55 +0000 (22:46 +0100)]
PPC: BookE: Make FIT/WDT timers at best millisecond grained
The default granularity for the FIT timer on 440 is on every 0x1000th
transition of TB from 0 to 1. Translated that means 48828 times a second.
Since interrupts are quite expensive for 440 and we don't really care
about the accuracy of the FIT to that significance, let's force FIT and
WDT to at best millisecond granularity.
This basically restores behavior as it was in QEMU 1.6, where timers
could only deal with millisecond granularities at all.
This patch greatly improves performance with the 440 target and restores
roughly the same performance level that QEMU 1.6 had for me.
Alexander Graf [Mon, 25 Nov 2013 21:46:54 +0000 (22:46 +0100)]
PPC: Make BookE FIT/WDT timers more lazy
Today we fire FIT and WDT timer events every time the respective bit
position in TB flips from 0 -> 1.
However, there is no need to do this if the end result would be that
we're changing a TSR bit that is set to 1 to 1 again. No guest visible
change would have occured.
So whenever we see that the TSR bit to our timer is already set, don't
even bother to update the timer that would potentially fire it off.
However, we do need to make sure that we update our timer that notifies
us of the TB flip when the respective TSR bit gets unset. In that case
we do care about the flip and need to notify the guest again. So add
a callback into our timer handlers when TSR bits get unset.
This improves performance for me when the guest is busy processing things.
Anthony Liguori [Mon, 25 Nov 2013 17:49:42 +0000 (09:49 -0800)]
Merge remote-tracking branch 'mst/tags/for_anthony' into staging
pc very last minute fixes for 1.7
This has a fix for a crasher bug with pci bridges,
boot failure fix for s390 on 32 bit hosts,
and fixes build for hosts with old glib.
There's also a fix for --iasl configure flag - it can be used
to work around broken iasl on some systems either
by using a non-standard iasl or by disabling it.
I've also reverted a e1000/rtl mac programming change
that seems slightly wrong and too risky for 1.8.
Signed-off-by: Michael S. Tsirkin <[email protected]>
# gpg: Signature made Mon 25 Nov 2013 03:40:07 AM PST using RSA key ID D28D5469
# gpg: Can't check signature: public key not found
# By Michael S. Tsirkin (5) and Bandan Das (1)
# Via Michael S. Tsirkin
* mst/tags/for_anthony:
configure: make --iasl option actually work
Revert "e1000/rtl8139: update HMP NIC when every bit is written"
acpi-build: fix build on glib < 2.14
acpi-build: fix build on glib < 2.22
pci: unregister vmstate_pcibus on unplug
s390x: fix flat file load on 32 bit systems
Anthony Liguori [Mon, 25 Nov 2013 17:41:24 +0000 (09:41 -0800)]
Merge remote-tracking branch 'bonzini/tags/for-anthony' into staging
Here are a bunch of 1.7-tagged patches that I was afraid
were getting forgotten or that did not have a clear maintainer responsible
for making a pull request.
# gpg: Signature made Thu 21 Nov 2013 08:40:59 AM PST using RSA key ID 9B4D86F2
# gpg: Can't check signature: public key not found
# By Peter Maydell (3) and others
# Via Paolo Bonzini
* bonzini/tags/for-anthony:
qga: Fix compiler warnings (missing format attribute, wrong format strings)
mips jazz: do not raise data bus exception when accessing invalid addresses
target-i386: yield to another VCPU on PAUSE
rng-egd: offset the point when repeatedly read from the buffer
rng-egd: remove redundant free
target-i386: Fix build by providing stub kvm_arch_get_supported_cpuid()
vfio-pci: Fix multifunction=on
atomic.h: Fix build with clang
pc: get rid of builtin pvpanic for "-M pc-1.5"
configure: Explicitly set ARFLAGS so we can build with GNU Make 4.0
sun4m: Add FCode ROM for TCX framebuffer
Tomoki Sekiyama [Fri, 1 Nov 2013 21:47:25 +0000 (17:47 -0400)]
qemu-ga: vss-win32: Install VSS provider COM+ application service
Currently, qemu-ga for Windows fails to execute guset-fsfreeze-freeze when
no user is logging in to Windows, with an error message:
{"error":{"class":"GenericError",
"desc":"failed to add C:\\ to snapshotset: (error: 8004230f)"}}
To enable guest-fsfreeze-freeze/thaw without logging in users, this installs
a service to execute qemu-ga VSS provider COM+ application that has full
access privileges to the local system. The service will automatically be
removed when the COM+ application is deregistered.
This patch replaces ICOMAdminCatalog interface with ICOMAdminCatalog2
interface that contains CreateServiceForApplication() method in addition.
Vlad Yasevich [Fri, 8 Nov 2013 02:13:09 +0000 (21:13 -0500)]
qdev-properties-system.c: Allow vlan or netdev for -device, not both
It is currently possible to specify things like:
-device e1000,netdev=foo,vlan=1
With this usage, whichever argument was specified last (vlan or netdev)
overwrites what was previousely set and results in a non-working
configuration. Even worse, when used with multiqueue devices,
it causes a segmentation fault on exit in qemu_free_net_client.
That patch treates the above command line options as invalid and
generates an error at start-up.
Stefan Weil [Sun, 17 Nov 2013 18:19:52 +0000 (19:19 +0100)]
qga: Fix compiler warnings (missing format attribute, wrong format strings)
gcc 4.8.2 reports this warning when extra warnings are enabled (-Wextra):
CC qga/commands.o
qga/commands.c: In function ‘slog’:
qga/commands.c:28:5: error:
function might be possible candidate for ‘gnu_printf’ format attribute [-Werror=suggest-attribute=format]
g_logv("syslog", G_LOG_LEVEL_INFO, fmt, ap);
^
gcc 4.8.2 reports this warning when slog is declared with the
gnu_printf format attribute:
qga/commands-posix.c: In function ‘qmp_guest_file_open’:
qga/commands-posix.c:404:5: warning:
format ‘%d’ expects argument of type ‘int’, but argument 2 has type ‘int64_t’ [-Wformat=]
slog("guest-file-open, handle: %d", handle);
^
On 32 bit hosts there are three more warnings which are also fixed here.
mips jazz: do not raise data bus exception when accessing invalid addresses
MIPS Jazz chipset doesn't seem to raise data bus exceptions on invalid accesses.
However, there is no easy way to prevent them. Creating a big memory region
for the whole address space doesn't prevent memory core to directly call
unassigned_mem_read/write which in turn call cpu->do_unassigned_access,
which (for MIPS CPU) raise an data bus exception.
Paolo Bonzini [Wed, 20 Nov 2013 11:54:02 +0000 (12:54 +0100)]
target-i386: yield to another VCPU on PAUSE
After commit b1bbfe7 (aio / timers: On timer modification, qemu_notify
or aio_notify, 2013-08-21) FreeBSD guests report a huge slowdown.
The problem shows up as soon as FreeBSD turns out its periodic (~1 ms)
tick, but the timers are only the trigger for a pre-existing problem.
Before the offending patch, setting a timer did a timer_settime system call.
After, setting the timer exits the event loop (which uses poll) and
reenters it with a new deadline. This does not cause any slowdown; the
difference is between one system call (timer_settime and a signal
delivery (SIGALRM) before the patch, and two system calls afterwards
(write to a pipe or eventfd + calling poll again when re-entering the
event loop).
Unfortunately, the exit/enter causes the main loop to grab the iothread
lock, which in turns kicks the VCPU thread out of execution. This
causes TCG to execute the next VCPU in its round-robin scheduling of
VCPUS. When the second VCPU is mostly unused, FreeBSD runs a "pause"
instruction in its idle loop which only burns cycles without any
progress. As soon as the timer tick expires, the first VCPU runs
the interrupt handler but very soon it sets it again---and QEMU
then goes back doing nothing in the second VCPU.
The fix is to make the pause instruction do "cpu_loop_exit".
Peter Maydell [Wed, 13 Nov 2013 23:09:07 +0000 (23:09 +0000)]
target-i386: Fix build by providing stub kvm_arch_get_supported_cpuid()
Fix build failures with clang when KVM is not enabled by
providing a stub version of kvm_arch_get_supported_cpuid().
We retain the compile time check that this function isn't
called when CONFIG_KVM is not set by guarding the stub with
ifndef __OPTIMIZE__ (we assume that an optimizing build will
do sufficient constant folding and dead code elimination to
remove the calls before linking).
Alex Williamson [Tue, 12 Nov 2013 18:53:24 +0000 (11:53 -0700)]
vfio-pci: Fix multifunction=on
When an assigned device is initialized it copies the device config
space into the emulated config space. Unfortunately multifunction is
setup prior to the device initfn and gets clobbered. We need to
restore it just like pci-assign does.
Peter Maydell [Tue, 22 Oct 2013 09:58:41 +0000 (10:58 +0100)]
atomic.h: Fix build with clang
clang defines __ATOMIC_SEQ_CST but its implementation of the
__atomic_exchange() builtin differs from that of gcc. Move the
__clang__ branch of the ifdef ladder to the top and fix its
implementation (there is no such builtin as __sync_exchange),
so we can compile with clang again.
Paolo Bonzini [Mon, 4 Nov 2013 13:30:48 +0000 (14:30 +0100)]
pc: get rid of builtin pvpanic for "-M pc-1.5"
This causes two slight backwards-incompatibilities between "-M pc-1.5"
and 1.5's "-M pc":
(1) a fw_cfg file is removed with this patch. This is only a problem
if migration stops the virtual machine exactly during fw_cfg enumeration.
(2) after migration, a VM created without an explicit "-device pvpanic"
will stop reporting panics to management.
The first problem only occurs if migration is done at a very, very
early point (and I'm not sure it can happen in practice for reasonable-size
VMs, since it will likely take more time to send the RAM to destination,
than it will take for BIOS to scan fw_cfg).
The second problem only occurs if the guest panics _and_ has a guest
driver _and_ management knows to look at the crash event, so it is
mostly theoretical at this point in time.
Thus keep the code simple, and pretend it was never broken.
Peter Maydell [Mon, 21 Oct 2013 20:03:06 +0000 (21:03 +0100)]
configure: Explicitly set ARFLAGS so we can build with GNU Make 4.0
Our rules.mak adds '-rR' to MAKEFLAGS to indicate that we will be
explicitly specifying everything and not relying on any default
variables or rules. However we were accidentally relying on the
default ARFLAGS ("rv"). This went unnoticed because of a bug in
GNU Make 3.82 and earlier which meant that adding -rR to MAKEFLAGS
only affected submakes, not the currently running instance.
Explicitly set ARFLAGS in config-host.mak, in the same way we
handle CFLAGS and LDFLAGS; this will allow us to work with
Make 4.0.
Thanks to Paul Smith for analyzing this bug for us.
Upstream OpenBIOS now implements SBus probing in order to determine the
contents of a physical bus slot, which is required to allow OpenBIOS to
identify the framebuffer without help from the fw_cfg interface.
SBus probing works by detecting the presence of an FCode program
(effectively tokenised Forth) at the base address of each slot, and if
present executes it so that it creates its own device node in the
OpenBIOS device tree.
The FCode ROM is generated as part of the OpenBIOS build and should
generally be updated at the same time.
Alex Williamson [Tue, 12 Nov 2013 18:53:24 +0000 (11:53 -0700)]
vfio-pci: Fix multifunction=on
When an assigned device is initialized it copies the device config
space into the emulated config space. Unfortunately multifunction is
setup prior to the device initfn and gets clobbered. We need to
restore it just like pci-assign does.
Peter Maydell [Tue, 22 Oct 2013 09:58:41 +0000 (10:58 +0100)]
atomic.h: Fix build with clang
clang defines __ATOMIC_SEQ_CST but its implementation of the
__atomic_exchange() builtin differs from that of gcc. Move the
__clang__ branch of the ifdef ladder to the top and fix its
implementation (there is no such builtin as __sync_exchange),
so we can compile with clang again.
Paolo Bonzini [Tue, 19 Nov 2013 16:49:46 +0000 (17:49 +0100)]
target-i386: do not override nr_cores for -cpu host
Commit 787aaf5 (target-i386: forward CPUID cache leaves when -cpu host is
used, 2013-09-02) brings bits 31..26 of CPUID leaf 04h out of sync with
the APIC IDs that QEMU reserves for each package. This number must come
from "-smp" options rather than from the host CPUID.
It also turns out that this unsyncing makes Windows Server 2012R2 fail
to boot.
mips jazz: do not raise data bus exception when accessing invalid addresses
MIPS Jazz chipset doesn't seem to raise data bus exceptions on invalid accesses.
However, there is no easy way to prevent them. Creating a big memory region
for the whole address space doesn't prevent memory core to directly call
unassigned_mem_read/write which in turn call cpu->do_unassigned_access,
which (for MIPS CPU) raise an data bus exception.
Paolo Bonzini [Wed, 20 Nov 2013 11:54:02 +0000 (12:54 +0100)]
target-i386: yield to another VCPU on PAUSE
After commit b1bbfe7 (aio / timers: On timer modification, qemu_notify
or aio_notify, 2013-08-21) FreeBSD guests report a huge slowdown.
The problem shows up as soon as FreeBSD turns out its periodic (~1 ms)
tick, but the timers are only the trigger for a pre-existing problem.
Before the offending patch, setting a timer did a timer_settime system call.
After, setting the timer exits the event loop (which uses poll) and
reenters it with a new deadline. This does not cause any slowdown; the
difference is between one system call (timer_settime and a signal
delivery (SIGALRM) before the patch, and two system calls afterwards
(write to a pipe or eventfd + calling poll again when re-entering the
event loop).
Unfortunately, the exit/enter causes the main loop to grab the iothread
lock, which in turns kicks the VCPU thread out of execution. This
causes TCG to execute the next VCPU in its round-robin scheduling of
VCPUS. When the second VCPU is mostly unused, FreeBSD runs a "pause"
instruction in its idle loop which only burns cycles without any
progress. As soon as the timer tick expires, the first VCPU runs
the interrupt handler but very soon it sets it again---and QEMU
then goes back doing nothing in the second VCPU.
The fix is to make the pause instruction do "cpu_loop_exit".